# exploratory data analysis

Here is a clear demonstration with detailed explanations of exploratory data analysis and correlation. Correlation is employed to test the strength between two variables. Exploratory data analysis is simply applied to know more about the variables in the data set.## Dataset: Data used QoG

The QoG standard data is one of the largest datasets. It contains over 100 data sources withmore than 2000 variables. The QoG is constantly updated, and I have used my study on the lastversion of 2019. It contains both cross-sectional data and time series, meaning that QoGdataset contains information of countries since 1946, which is very good if a researcher isinterested in countries as a unit of the analysis. However, one should be more careful if thisdataset is considered to be useful because there may some countries in the current datasetwhich no longer exist now as well as other countries that had not existed before but nowbecause of either reunification or secession of some countries. Despite that, the mostpart of the dataset is constantly updated and, therefore, can be considered as authentic anduseable for research purposes.## Exploratory Data Analysis

### The Dependent variable

### Economic development

Taking the GDP per capita as a variable is deemed more suitable than other variables that measure the same in the data used. Economic strength or economic development show the same correlation towards Democracy and other independent variables. To operationalize economic development, wealso used the QoG dataset 2013, whereby the GDP per capita is chosen. The reason is that the GDP per capita is of the most instrument used to measure economic development or performance. For the sake of accuracy, this paper uses the Geary-Khamis dollar or as it is called the international dollar. The Geary-Khamis is a hypothetical unit (GK$) that has the purchasing power as the US dollar. (QoG: 2013). The GK$ is a recognized currency used in studies and other economic contexts.## The Independent variables

### Democracy

Democracy is the independent variable to measure the concept of Democracy. I have used the index of Democracy in QoG standard dataset 2013. It is a value scale 0 to 10 founded on several other scale values intended to measure Democracy, which include civil liberty, democratic political culture, electoral process and pluralism, political participation, and functioning government. When it comes to Democracy, it can be used both nominal and interval variables. The index of Democracy uses a scale of measurement whereby the status of Democracy ranges from 0 to 10 where 0 is no democracy, and 10 is indicating higher Democracy.### Social cohesion

Social cohesion is used to define the sense of solidarity and strength of relationship showcased by members of a given nation, region, or community. Social cohesion is normally determined by several indicators, one of which is the amount of social capital collectively held by a given community. Social capital relates to the shared group resources present in a community, such as the knowledge of a job vacancy held by a close family friend. Therefore, social capital is accessed through social networks formed with other members of society. Social cohesion is a concern with the development of shared values, facing commonchallenges, creating a sense of togetherness, promoting group engagement across different community activities, and lastly, mitigating social disparities in terms of wealth distribution and income levels.### Ethnic fractionalization

Ethnic fractionalization is associated with the determination of sizes, socio-economic distribution, geographical location, and a number of distinct cultural groups within a given country like Japan or Egypt. These cultural features may relate to customs, traditions, ethnicity, religion, language, and language. In most cases, these features are used in the monopolization of power and social exclusion.### Trade openness

Trade openness is used to define the inward or outward orientation of an economy. Outward orientation is used to refer to economies that seize trading opportunities with other countries, while inward orientation defines countries to avoid or fail to take advantage of trading opportunities with other countries/economies. Good examples of trade policies that are directly linked to the establishment of either outward or inward orientation are scale economies, market competitiveness, technologies, infrastructure, import-export regulations, and trade barriers.### Education

We are not just dealing with the conventional interpretation of education, which is the acquisition of knowledge and skills through a series of teaching, self-learning, or apprenticeship. We are also dealing with the assessment of quality education, which focuses on the proper cognitive development of learners as they progress through the education system.## Exploratory Analysis

Before regression is run on the data, several things were checked to see if the assumptions of models hold. Q-Q plot for GDP per capita was plotted along with natural log transformation to see the normality of the dependent variable.

It is clear that the natural log transformation is much closer to normality assumption than the variable directly. Further analysis has been done on the log of per capita GDP. Since it is a bijective transformation that is strictly increasing, the comparison will be preserved. A scatter plot for all the independent variables with the dependent variable is plotted to check for the linearity of relationships.

## Onwards

### Correlation Analysis

### Per Capita GDP vs. Democracy

The correlation between variable per capita GDP and democracy indicator is 0.089, which is small, as indicated by the test of significance of the correlation coefficient, which yields a p-value of 0.315. This indicates that there may not be a strong linear relationship between the two variables.### Per Capita GDP vs. Democracy

The correlation between variable per capita GDP and democracy indicator is 0.128, which is small, as indicated by the test of significance of the correlation coefficient, which yields a p-value of 0.146. This indicates that there may not be a strong linear relationship between the two variables.### Per Capita GDP vs.Social Capital

The correlation between variable per capita GDP and social capital is 0.128, which is significant, as shown by the said test of a significant correlation coefficient, which yields a p-value of 0.146. This indicates that there may not be strong linear relationships between the two variables.### Per Capita GDP vs.Ethnic fractionalization

The correlation between variable per capita GDP and ethnic fractionalization is -0.340, which is small, as indicated by the test of significance of the correlation coefficient, which yields a p-value of 0.001. This shows that there is a strong negative linear relationship between the two variables. This variable may play an important role in explaining variability in per capita GDP.### Per Capita GDP vs.trade openness

The correlation between variable per capita GDP and trade openness is 0.385, which is small, as indicated by the test of significance of the correlation coefficient, which yields a p-value of 0.001. This indicates that there is a strong positive linear relationship between the two variables.### Per Capita GDP vs.Education

The correlation between variable per capita GDP and education is 0.483, which is a high positive value,which is significant, as shown by the test of significance of the correlation coefficient, which yields a p-value of 0.001. This indicates that there may be strong linear relationships between the two variables.### Log per capita GDP with variables

A similar correlation study is carried out with log per capita GDP and the explanatory variables.Correlations |
|||||||

log_GDP_PC | Education | Democracy | Social_capital | Ethnic_fractionalization | Trade_Openness | ||

log_GDP_PC | Pearson Correlation | 1 | .530^{**} |
.262^{**} |
.275^{**} |
-.442^{**} |
.504^{**} |

Sig. (2-tailed) | .000 | .003 | .002 | .000 | .000 | ||

N | 163 | 50 | 130 | 130 | 148 | 115 | |

Education | Pearson Correlation | .530^{**} |
1 | .575^{**} |
.482^{**} |
-.270 | .205 |

Sig. (2-tailed) | .000 | .000 | .000 | .069 | .277 | ||

N | 50 | 53 | 49 | 49 | 46 | 30 | |

Democracy | Pearson Correlation | .262^{**} |
.575^{**} |
1 | .856^{**} |
-.086 | .437^{**} |

Sig. (2-tailed) | .003 | .000 | .000 | .331 | .000 | ||

N | 130 | 49 | 136 | 136 | 129 | 92 | |

Social_capital | Pearson Correlation | .275^{**} |
.482^{**} |
.856^{**} |
1 | -.063 | .421^{**} |

Sig. (2-tailed) | .002 | .000 | .000 | .477 | .000 | ||

N | 130 | 49 | 136 | 136 | 129 | 92 | |

Ethnic_fractionalization | Pearson Correlation | -.442^{**} |
-.270 | -.086 | -.063 | 1 | -.511^{**} |

Sig. (2-tailed) | .000 | .069 | .331 | .477 | .000 | ||

N | 148 | 46 | 129 | 129 | 154 | 111 | |

Trade_Openness | Pearson Correlation | .504^{**} |
.205 | .437^{**} |
.421^{**} |
-.511^{**} |
1 |

Sig. (2-tailed) | .000 | .277 | .000 | .000 | .000 | ||

N | 115 | 30 | 92 | 92 | 111 | 116 | |

**. Correlation is significant at the 0.01 level (2-tailed). |

## Regression Models

The purpose of regression analysis in the study is to identify and measure the strength of relationships between variables and attribute the economic development to various chosen features. Statistical tests such as t-test and F-test have been used to test the significance of the model. The first model was fitted between GDP per capita and Democracy. All the variables chosen are scale variables, and hence there was no need forthe introduction of dummy variables. This model attributes the role of Democracy in the economic development of the country. Since there are a lot of missing values in the dataset, it has been imputed by the mean of the variable.The result from SPSS is presented in the below table:Model Summary^{b} |
||||||||||

Model | R | R Square | Adjusted R Square | Std. Error of the Estimate | ||||||

1 | .062^{a} |
.004 | -.002 | 19105.42550 | ||||||

a. Predictors: (Constant), Democracy | ||||||||||

b. Dependent Variable: GDP_per_capita | ||||||||||

ANOVA^{a} |
||||||||||

Model | Sum of Squares | df | Mean Square | F | Sig. | |||||

1 | Regression | 237204598.344 | 1 | 237204598.344 | .650 | .421^{b} |
||||

Residual | 60957886336.257 | 167 | 365017283.451 | |||||||

Total | 61195090934.601 | 168 | ||||||||

a. Dependent Variable: GDP_per_capita | ||||||||||

b. Predictors: (Constant), Democracy | ||||||||||

Coefficients^{a} |
||||||||||

Model | Unstandardized Coefficients | Standardized Coefficients | T | Sig. | 95.0% Confidence Interval for B | |||||

B | Std. Error | Beta | Lower Bound | Upper Bound | ||||||

1 | (Constant) | 14543.353 | 4641.833 | 3.133 | .002 | 5379.117 | 23707.590 | |||

Democracy | 639.450 | 793.235 | .062 | .806 | .421 | -926.611 | 2205.512 | |||

a. Dependent Variable: GDP_per_capita | ||||||||||

Model Summary^{b} |
||||||||

Model | R | R Square | Adjusted R Square | Std. Error of the Estimate | ||||

1 | .208^{a} |
.043 | .037 | 1.15660 | ||||

a. Predictors: (Constant), Democracy | ||||||||

b. Dependent Variable: log_GDP_PC | ||||||||

ANOVA^{a} |
||||||||

Model | Sum of Squares | df | Mean Square | F | Sig. | |||

1 | Regression | 10.053 | 1 | 10.053 | 7.515 | .007^{b} |
||

Residual | 223.400 | 167 | 1.338 | |||||

Total | 233.453 | 168 | ||||||

a. Dependent Variable: log_GDP_PC | ||||||||

b. Predictors: (Constant), Democracy | ||||||||

Coefficients^{a} |
||||||||

Model | Unstandardized Coefficients | Standardized Coefficients | T | Sig. | 95.0% Confidence Interval for B | |||

B | Std. Error | Beta | Lower Bound | Upper Bound | ||||

1 | (Constant) | 8.481 | .281 | 30.182 | .000 | 7.927 | 9.036 | |

Democracy | .132 | .048 | .208 | 2.741 | .007 | .037 | .226 | |

a. Dependent Variable: log_GDP_PC |

^{2}of 0.043. After this interesting finding, a new variable is introduced in regression after controlling by Democracy, which is education. The reason this variable was chosen was because of the strong belief that education is the highest asset for Human resources, and education, society, or country develops more. However, this point is not supported by the data as the significance of variable education is not high. The change in R

^{2}is negligible, and the model has similar explaining power as before. Below is the SPSS output after entering the variable education.

ANOVA^{a} |
||||||||

Model | Sum of Squares | df | Mean Square | F | Sig. | |||

1 | Regression | 10.053 | 1 | 10.053 | 7.515 | .007^{b} |
||

Residual | 223.400 | 167 | 1.338 | |||||

Total | 233.453 | 168 | ||||||

2 | Regression | 13.023 | 2 | 6.511 | 4.904 | .009^{c} |
||

Residual | 220.430 | 166 | 1.328 | |||||

Total | 233.453 | 168 | ||||||

a. Dependent Variable: log_GDP_PC | ||||||||

b. Predictors: (Constant), Democracy | ||||||||

c. Predictors: (Constant), Democracy, Education | ||||||||

Coefficients^{a} |
||||||||

Model | Unstandardized Coefficients | Standardized Coefficients | t | Sig. | 95.0% Confidence Interval for B | |||

B | Std. Error | Beta | Lower Bound | Upper Bound | ||||

1 | (Constant) | 8.481 | .281 | 30.182 | .000 | 7.927 | 9.036 | |

Democracy | .132 | .048 | .208 | 2.741 | .007 | .037 | .226 | |

2 | (Constant) | 7.957 | .449 | 17.726 | .000 | 7.071 | 8.843 | |

Democracy | .110 | .050 | .174 | 2.208 | .029 | .012 | .209 | |

Education | .014 | .010 | .118 | 1.495 | .137 | -.005 | .033 | |

a. Dependent Variable: log_GDP_PC |

Model Summary^{c} |
||||

Model | R | R Square | Adjusted R Square | Std. Error of the Estimate |

1 | .208^{a} |
.043 | .037 | 1.15660 |

2 | .236^{b} |
.056 | .044 | 1.15234 |

a. Predictors: (Constant), Democracy | ||||

b. Predictors: (Constant), Democracy, Education | ||||

c. Dependent Variable: log_GDP_PC |

Model Summary^{c} |
||||

Model | R | R Square | Adjusted R Square | Std. Error of the Estimate |

1 | .208^{a} |
.043 | .037 | 1.15660 |

2 | .421^{b} |
.177 | .167 | 1.07577 |

a. Predictors: (Constant), Democracy | ||||

b. Predictors: (Constant), Democracy, Trade_Openness | ||||

c. Dependent Variable: log_GDP_PC |

ANOVA^{a} |
||||||

Model | Sum of Squares | df | Mean Square | F | Sig. | |

1 | Regression | 10.053 | 1 | 10.053 | 7.515 | .007^{b} |

Residual | 223.400 | 167 | 1.338 | |||

Total | 233.453 | 168 | ||||

2 | Regression | 41.344 | 2 | 20.672 | 17.862 | .000^{c} |

Residual | 192.110 | 166 | 1.157 | |||

Total | 233.453 | 168 | ||||

a. Dependent Variable: log_GDP_PC | ||||||

b. Predictors: (Constant), Democracy | ||||||

c. Predictors: (Constant), Democracy, Trade_Openness |

Coefficients^{a} |
||||||||

Model | Unstandardized Coefficients | Standardized Coefficients | T | Sig. | 95.0% Confidence Interval for B | |||

B | Std. Error | Beta | Lower Bound | Upper Bound | ||||

1 | (Constant) | 8.481 | .281 | 30.182 | .000 | 7.927 | 9.036 | |

Democracy | .132 | .048 | .208 | 2.741 | .007 | .037 | .226 | |

2 | (Constant) | 6.712 | .429 | 15.646 | .000 | 5.865 | 7.559 | |

Democracy | .072 | .046 | .114 | 1.570 | .118 | -.019 | .163 | |

Trade_Openness | .270 | .052 | .378 | 5.200 | .000 | .167 | .372 | |

a. Dependent Variable: log_GDP_PC |

ANOVA^{a} |
||||||

Model | Sum of Squares | df | Mean Square | F | Sig. | |

1 | Regression | 38.489 | 1 | 38.489 | 33.364 | .000^{b} |

Residual | 194.964 | 169 | 1.154 | |||

Total | 233.453 | 170 | ||||

2 | Regression | 59.263 | 3 | 19.754 | 18.939 | .000^{c} |

Residual | 174.191 | 167 | 1.043 | |||

Total | 233.453 | 170 | ||||

a. Dependent Variable: log_GDP_PC | ||||||

b. Predictors: (Constant), Trade_Openness | ||||||

c. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization |

Coefficients^{a} |
||||||||||

Model | Unstandardized Coefficients | Standardized Coefficients | t | Sig. | 95.0% Confidence Interval for B | Collinearity Statistics | ||||

B | Std. Error | Beta | Lower Bound | Upper Bound | Tolerance | VIF | ||||

1 | (Constant) | 6.958 | .399 | 17.443 | .000 | 6.170 | 7.745 | |||

Trade_Openness | .290 | .050 | .406 | 5.776 | .000 | .191 | .389 | 1.000 | 1.000 | |

2 | (Constant) | 7.953 | .534 | 14.890 | .000 | 6.898 | 9.007 | |||

Trade_Openness | .176 | .054 | .247 | 3.261 | .001 | .070 | .283 | .779 | 1.284 | |

Social_capital | .110 | .052 | .146 | 2.126 | .035 | .008 | .212 | .948 | 1.055 | |

Ethnic_fractionalization | -1.411 | .351 | -.297 | -4.022 | .000 | -2.103 | -.718 | .817 | 1.223 |

Model Summary^{c} |
|||||||||

Model | R | R Square | Adjusted R Square | Std. Error of the Estimate | Change Statistics | ||||

R Square Change | F Change | df1 | df2 | Sig. F Change | |||||

1 | .406^{a} |
.165 | .160 | 1.07407 | .165 | 33.364 | 1 | 169 | .000 |

2 | .504^{b} |
.254 | .240 | 1.02130 | .089 | 9.958 | 2 | 167 | .000 |

a. Predictors: (Constant), Trade_Openness b. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization c. Dependent Variable: log_GDP_PC |

^{2}also increases to 0.254, which indicates that the model can explain the 25.4% of the variability in the dependent variable. At last, regression is fit with all variables included seeing the effect of Democracy and education. But both of the new included variables turn out to be insignificant, and the R

^{2}change was minimal. Hence the final model was decided to have Trade openness, social capital, and ethnic fractionalization.

Model Summary^{c} |
|||||||||

Model | R | R Square | Adjusted R Square | Std. Error of the Estimate | Change Statistics | ||||

R Square Change | F Change | df1 | df2 | Sig. F Change | |||||

1 | .504^{a} |
.254 | .240 | 1.02130 | .254 | 18.939 | 3 | 167 | .000 |

2 | .513^{b} |
.263 | .241 | 1.02124 | .009 | 1.010 | 2 | 165 | .366 |

a. Predictors: (Constant), Social_capital, Ethnic_fractionalization, Trade_Openness b. Predictors: (Constant), Social_capital, Ethnic_fractionalization, Trade_Openness, Education, Democracy c. Dependent Variable: log_GDP_PC |

### Final Model

A final regression model is fitted with the variables significant at the last step in the diagnostics and checked for multicollinearity and other assumptions.

On seeing the residual and predicted variable plot, it can be thought of as independent without heteroscedasticity as there’s no apparent pattern in the plot.

Model Summary^{b} |
|||||||||

Models | R | R-Squar | Adjusted R-Squar | Std. Errors of the Estimate | Change Statistics | ||||

R Squar Change | F Change | df1 | df2 | Sig. F Change | |||||

1 | .504^{a} |
.254 | .240 | 1.02130 | .254 | 18.939 | 3 | 167 | .000 |

ANOVA^{a} |
||||||

Model | Sum of Squares | df | Mean Square | F | Sig. | |

1 | Regression | 59.263 | 3 | 19.754 | 18.939 | .000^{b} |

Residual | 174.191 | 167 | 1.043 | |||

Total | 233.453 | 170 | ||||

a. Dependent Variable: log_GDP_PC | ||||||

b. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization |

Coefficients^{a} |
||||||||||

Model | Unstandardized Coefficients | Standardized Coefficients | t | Sig. | 95.0% Confidence Interval for B | Collinearity Statistics | ||||

B | Std. Error | Beta | Lower Bound | Upper Bound | Tolerance | VIF | ||||

1 | (Constant) | 7.953 | .534 | 14.890 | .000 | 6.898 | 9.007 | |||

Trade_Openness | .176 | .054 | .247 | 3.261 | .001 | .070 | .283 | .779 | 1.284 | |

Ethnic_fractionalization | -1.411 | .351 | -.297 | -4.022 | .000 | -2.103 | -.718 | .817 | 1.223 | |

Social_capital | .110 | .052 | .146 | 2.126 | .035 | .008 | .212 | .948 | 1.055 | |

a. Dependent Variable: log_GDP_PC |

^{2}.

Model Summary^{b} |
|||||||||

Model | R | R Square | Adjusted R Square | Std. Error of the Estimate | Change Statistics | ||||

R Square Change | F Change | df1 | df2 | Sig. F Change | |||||

1 | .524^{a} |
.275 | .262 | 1.00583 | .275 | 20.866 | 3 | 165 | .000 |

a. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization | |||||||||

b. Dependent Variable: log_GDP_PC |

ANOVA^{a} |
||||||

Model | Sum of Squares | df | Mean Square | F | Sig. | |

1 | Regression | 63.331 | 3 | 21.110 | 20.866 | .000^{b} |

Residual | 166.930 | 165 | 1.012 | |||

Total | 230.260 | 168 | ||||

a. Dependent Variable: log_GDP_PC | ||||||

b. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization |

Coefficients^{a} |
||||||||

Model | Unstandardized Coefficients | Standardized Coefficients | t | Sig. | Collinearity Statistics | |||

B | Std. Error | Beta | Tolerance | VIF | ||||

1 | (Constant) | 7.808 | .568 | 13.745 | .000 | |||

Social_capital | .095 | .052 | .124 | 1.830 | .069 | .958 | 1.044 | |

Ethnic_fractionalization | -1.428 | .353 | -.300 | -4.040 | .000 | .798 | 1.253 | |

Trade_Openness | .205 | .057 | .274 | 3.635 | .000 | .770 | 1.298 | |

a. Dependent Variable: log_GDP_PC |

After testing, the final model is decided with the dependent variable as log GDP and outliers, North Korea, and Venezuela removed. The model is able to elaborate on the 27.5% variation found in log per capita GDP.