# exploratory data analysis

Here is a clear demonstration with detailed explanations of exploratory data analysis and correlation.  Correlation is employed to test the strength between two variables. Exploratory data analysis is simply applied to know more about the variables in the data set.

## Dataset: Data used QoG

The QoG standard data is one of the largest datasets. It contains over 100 data sources withmore than 2000 variables. The QoG is constantly updated, and I have used my study on the lastversion of 2019. It contains both cross-sectional data and time series, meaning that QoGdataset contains information of countries since 1946, which is very good if a researcher isinterested in countries as a unit of the analysis. However, one should be more careful if thisdataset is considered to be useful because there may some countries in the current datasetwhich no longer exist now as well as other countries that had not existed before but nowbecause of either reunification or secession of some countries. Despite that, the mostpart of the dataset is constantly updated and, therefore, can be considered as authentic anduseable for research purposes.

## Exploratory Data Analysis

### Economic development

Taking the GDP per capita as a variable is deemed more suitable than other variables that measure the same in the data used. Economic strength or economic development show the same correlation towards Democracy and other independent variables. To operationalize economic development, wealso used the QoG dataset 2013, whereby the GDP per capita is chosen. The reason is that the GDP per capita is of the most instrument used to measure economic development or performance. For the sake of accuracy, this paper uses the Geary-Khamis dollar or as it is called the international dollar. The Geary-Khamis is a hypothetical unit (GK$) that has the purchasing power as the US dollar. (QoG: 2013). The GK$ is a recognized currency used in studies and other economic contexts.

## The Independent variables

### Democracy

Democracy is the independent variable to measure the concept of Democracy. I have used the index of Democracy in QoG standard dataset 2013. It is a value scale 0 to 10 founded on several other scale values intended to measure Democracy, which include civil liberty, democratic political culture, electoral process and pluralism, political participation, and functioning government. When it comes to Democracy, it can be used both nominal and interval variables. The index of Democracy uses a scale of measurement whereby the status of Democracy ranges from 0 to 10 where 0 is no democracy, and 10 is indicating higher Democracy.

### Social cohesion

Social cohesion is used to define the sense of solidarity and strength of relationship showcased by members of a given nation, region, or community. Social cohesion is normally determined by several indicators, one of which is the amount of social capital collectively held by a given community. Social capital relates to the shared group resources present in a community, such as the knowledge of a job vacancy held by a close family friend. Therefore, social capital is accessed through social networks formed with other members of society. Social cohesion is a concern with the development of shared values, facing commonchallenges, creating a sense of togetherness, promoting group engagement across different community activities, and lastly, mitigating social disparities in terms of wealth distribution and income levels.

### Ethnic fractionalization

Ethnic fractionalization is associated with the determination of sizes, socio-economic distribution, geographical location, and a number of distinct cultural groups within a given country like Japan or Egypt. These cultural features may relate to customs, traditions, ethnicity, religion, language, and language. In most cases, these features are used in the monopolization of power and social exclusion.

Trade openness is used to define the inward or outward orientation of an economy. Outward orientation is used to refer to economies that seize trading opportunities with other countries, while inward orientation defines countries to avoid or fail to take advantage of trading opportunities with other countries/economies. Good examples of trade policies that are directly linked to the establishment of either outward or inward orientation are scale economies, market competitiveness, technologies, infrastructure, import-export regulations, and trade barriers.

### Education

We are not just dealing with the conventional interpretation of education, which is the acquisition of knowledge and skills through a series of teaching, self-learning, or apprenticeship. We are also dealing with the assessment of quality education, which focuses on the proper cognitive development of learners as they progress through the education system.

## Exploratory Analysis

Before regression is run on the data, several things were checked to see if the assumptions of models hold. Q-Q plot for GDP per capita was plotted along with natural log transformation to see the normality of the dependent variable.

It is clear that the natural log transformation is much closer to normality assumption than the variable directly. Further analysis has been done on the log of per capita GDP. Since it is a bijective transformation that is strictly increasing, the comparison will be preserved. A scatter plot for all the independent variables with the dependent variable is plotted to check for the linearity of relationships.

## Onwards

### Per Capita GDP vs. Democracy

The correlation between variable per capita GDP and democracy indicator is 0.089, which is small, as indicated by the test of significance of the correlation coefficient, which yields a p-value of 0.315. This indicates that there may not be a strong linear relationship between the two variables.

### Per Capita GDP vs. Democracy

The correlation between variable per capita GDP and democracy indicator is 0.128, which is small, as indicated by the test of significance of the correlation coefficient, which yields a p-value of 0.146. This indicates that there may not be a strong linear relationship between the two variables.

### Per Capita GDP vs.Social Capital

The correlation between variable per capita GDP and social capital is 0.128, which is significant, as shown by the said test of a significant correlation coefficient, which yields a p-value of 0.146. This indicates that there may not be strong linear relationships between the two variables.

### Per Capita GDP vs.Ethnic fractionalization

The correlation between variable per capita GDP and ethnic fractionalization is -0.340, which is small, as indicated by the test of significance of the correlation coefficient, which yields a p-value of 0.001. This shows that there is a strong negative linear relationship between the two variables. This variable may play an important role in explaining variability in per capita GDP.

### Per Capita GDP vs.trade openness

The correlation between variable per capita GDP and trade openness is 0.385, which is small, as indicated by the test of significance of the correlation coefficient, which yields a p-value of 0.001. This indicates that there is a strong positive linear relationship between the two variables.

### Per Capita GDP vs.Education

The correlation between variable per capita GDP and education is 0.483, which is a high positive value,which is significant, as shown by the test of significance of the correlation coefficient, which yields a p-value of 0.001. This indicates that there may be strong linear relationships between the two variables.

### Log per capita GDP with variables

A similar correlation study is carried out with log per capita GDP and the explanatory variables.
 Correlations log_GDP_PC Education Democracy Social_capital Ethnic_fractionalization Trade_Openness log_GDP_PC Pearson Correlation 1 .530** .262** .275** -.442** .504** Sig. (2-tailed) .000 .003 .002 .000 .000 N 163 50 130 130 148 115 Education Pearson Correlation .530** 1 .575** .482** -.270 .205 Sig. (2-tailed) .000 .000 .000 .069 .277 N 50 53 49 49 46 30 Democracy Pearson Correlation .262** .575** 1 .856** -.086 .437** Sig. (2-tailed) .003 .000 .000 .331 .000 N 130 49 136 136 129 92 Social_capital Pearson Correlation .275** .482** .856** 1 -.063 .421** Sig. (2-tailed) .002 .000 .000 .477 .000 N 130 49 136 136 129 92 Ethnic_fractionalization Pearson Correlation -.442** -.270 -.086 -.063 1 -.511** Sig. (2-tailed) .000 .069 .331 .477 .000 N 148 46 129 129 154 111 Trade_Openness Pearson Correlation .504** .205 .437** .421** -.511** 1 Sig. (2-tailed) .000 .277 .000 .000 .000 N 115 30 92 92 111 116 **. Correlation is significant at the 0.01 level (2-tailed).
The correlations are similar to that with the original variables, but now the social capital and Democracy is also a significant variable.

## Regression Models

The purpose of regression analysis in the study is to identify and measure the strength of relationships between variables and attribute the economic development to various chosen features. Statistical tests such as t-test and F-test have been used to test the significance of the model. The first model was fitted between GDP per capita and Democracy. All the variables chosen are scale variables, and hence there was no need forthe introduction of dummy variables. This model attributes the role of Democracy in the economic development of the country. Since there are a lot of missing values in the dataset, it has been imputed by the mean of the variable.The result from SPSS is presented in the below table:
 Model Summaryb Model R R Square Adjusted R Square Std. Error of the Estimate 1 .062a .004 -.002 19105.42550 a. Predictors: (Constant), Democracy b. Dependent Variable: GDP_per_capita ANOVAa Model Sum of Squares df Mean Square F Sig. 1 Regression 237204598.344 1 237204598.344 .650 .421b Residual 60957886336.257 167 365017283.451 Total 61195090934.601 168 a. Dependent Variable: GDP_per_capita b. Predictors: (Constant), Democracy Coefficientsa Model Unstandardized Coefficients Standardized Coefficients T Sig. 95.0% Confidence Interval for B B Std. Error Beta Lower Bound Upper Bound 1 (Constant) 14543.353 4641.833 3.133 .002 5379.117 23707.590 Democracy 639.450 793.235 .062 .806 .421 -926.611 2205.512 a. Dependent Variable: GDP_per_capita
It is clear from the above model that Democracy alone can not explain a significant part of the variation in the per capita GDP. However, it may be interesting to see the effect of Democracy after controlling by some other variables. One point to note here is that the scale of per capita GDP is very large, anda natural log is taken to get the per capita GDP to a smaller scale. Next, regressions are carried out on the log of per capita GDP. The model is again fit with log GDP per capita as the dependent variable and Democracy as the independent variable.
 Model Summaryb Model R R Square Adjusted R Square Std. Error of the Estimate 1 .208a .043 .037 1.15660 a. Predictors: (Constant), Democracy b. Dependent Variable: log_GDP_PC ANOVAa Model Sum of Squares df Mean Square F Sig. 1 Regression 10.053 1 10.053 7.515 .007b Residual 223.400 167 1.338 Total 233.453 168 a. Dependent Variable: log_GDP_PC b. Predictors: (Constant), Democracy Coefficientsa Model Unstandardized Coefficients Standardized Coefficients T Sig. 95.0% Confidence Interval for B B Std. Error Beta Lower Bound Upper Bound 1 (Constant) 8.481 .281 30.182 .000 7.927 9.036 Democracy .132 .048 .208 2.741 .007 .037 .226 a. Dependent Variable: log_GDP_PC
This time, however, the p-value for the F-test for the significance of the model is significant, and the test for significance of Democracy is also significant. However, the model is only able to explain 4.3% of the variation in the log of GDP per capita, which is indicated by R2 of 0.043. After this interesting finding, a new variable is introduced in regression after controlling by Democracy, which is education. The reason this variable was chosen was because of the strong belief that education is the highest asset for Human resources, and education, society, or country develops more. However, this point is not supported by the data as the significance of variable education is not high. The change in R2 is negligible, and the model has similar explaining power as before. Below is the SPSS output after entering the variable education.
 ANOVAa Model Sum of Squares df Mean Square F Sig. 1 Regression 10.053 1 10.053 7.515 .007b Residual 223.400 167 1.338 Total 233.453 168 2 Regression 13.023 2 6.511 4.904 .009c Residual 220.430 166 1.328 Total 233.453 168 a. Dependent Variable: log_GDP_PC b. Predictors: (Constant), Democracy c. Predictors: (Constant), Democracy, Education Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B B Std. Error Beta Lower Bound Upper Bound 1 (Constant) 8.481 .281 30.182 .000 7.927 9.036 Democracy .132 .048 .208 2.741 .007 .037 .226 2 (Constant) 7.957 .449 17.726 .000 7.071 8.843 Democracy .110 .050 .174 2.208 .029 .012 .209 Education .014 .010 .118 1.495 .137 -.005 .033 a. Dependent Variable: log_GDP_PC
 Model Summaryc Model R R Square Adjusted R Square Std. Error of the Estimate 1 .208a .043 .037 1.15660 2 .236b .056 .044 1.15234 a. Predictors: (Constant), Democracy b. Predictors: (Constant), Democracy, Education c. Dependent Variable: log_GDP_PC
Clearly, the variable education was not significant and was removed at this stage. Although, it will be added again after controlling with other variables to see its effect. For now, it’s time to add a third variable to the list. Trade openness is introduced along with Democracy to see what the effect of it on economic development is. Intuitively, this should be strongly related, and intuition is supported by the data.
 Model Summaryc Model R R Square Adjusted R Square Std. Error of the Estimate 1 .208a .043 .037 1.15660 2 .421b .177 .167 1.07577 a. Predictors: (Constant), Democracy b. Predictors: (Constant), Democracy, Trade_Openness c. Dependent Variable: log_GDP_PC
 ANOVAa Model Sum of Squares df Mean Square F Sig. 1 Regression 10.053 1 10.053 7.515 .007b Residual 223.400 167 1.338 Total 233.453 168 2 Regression 41.344 2 20.672 17.862 .000c Residual 192.110 166 1.157 Total 233.453 168 a. Dependent Variable: log_GDP_PC b. Predictors: (Constant), Democracy c. Predictors: (Constant), Democracy, Trade_Openness
 Coefficientsa Model Unstandardized Coefficients Standardized Coefficients T Sig. 95.0% Confidence Interval for B B Std. Error Beta Lower Bound Upper Bound 1 (Constant) 8.481 .281 30.182 .000 7.927 9.036 Democracy .132 .048 .208 2.741 .007 .037 .226 2 (Constant) 6.712 .429 15.646 .000 5.865 7.559 Democracy .072 .046 .114 1.570 .118 -.019 .163 Trade_Openness .270 .052 .378 5.200 .000 .167 .372 a. Dependent Variable: log_GDP_PC
The model with Democracy and Trade Openness is significant, and the variable trade openness is significant as well. An interesting point to note that after the inclusion of trade openness, the variable Democracy is no longer significant. So, the next model has been implemented with Trade openness and adding the social capital and ethnic fractionalization. These two variables are added together as they are elements of society and culture. Ideally, it is not expected to have a significant effect on economic development. The results from SPSS is given in the tables below:
 ANOVAa Model Sum of Squares df Mean Square F Sig. 1 Regression 38.489 1 38.489 33.364 .000b Residual 194.964 169 1.154 Total 233.453 170 2 Regression 59.263 3 19.754 18.939 .000c Residual 174.191 167 1.043 Total 233.453 170 a. Dependent Variable: log_GDP_PC b. Predictors: (Constant), Trade_Openness c. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization
 Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B Collinearity Statistics B Std. Error Beta Lower Bound Upper Bound Tolerance VIF 1 (Constant) 6.958 .399 17.443 .000 6.170 7.745 Trade_Openness .290 .050 .406 5.776 .000 .191 .389 1.000 1.000 2 (Constant) 7.953 .534 14.890 .000 6.898 9.007 Trade_Openness .176 .054 .247 3.261 .001 .070 .283 .779 1.284 Social_capital .110 .052 .146 2.126 .035 .008 .212 .948 1.055 Ethnic_fractionalization -1.411 .351 -.297 -4.022 .000 -2.103 -.718 .817 1.223
 Model Summaryc Model R R Square Adjusted R Square Std. Error of the Estimate Change Statistics R Square Change F Change df1 df2 Sig. F Change 1 .406a .165 .160 1.07407 .165 33.364 1 169 .000 2 .504b .254 .240 1.02130 .089 9.958 2 167 .000 a. Predictors: (Constant), Trade_Openness b. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization c. Dependent Variable: log_GDP_PC
The result suggests that the social capital and ethnic fractionalization both are statistically significant variables at explaining variability at log GDP. The R2 also increases to 0.254, which indicates that the model can explain the 25.4% of the variability in the dependent variable. At last, regression is fit with all variables included seeing the effect of Democracy and education. But both of the new included variables turn out to be insignificant, and the R2 change was minimal. Hence the final model was decided to have Trade openness, social capital, and ethnic fractionalization.
 Model Summaryc Model R R Square Adjusted R Square Std. Error of the Estimate Change Statistics R Square Change F Change df1 df2 Sig. F Change 1 .504a .254 .240 1.02130 .254 18.939 3 167 .000 2 .513b .263 .241 1.02124 .009 1.010 2 165 .366 a. Predictors: (Constant), Social_capital, Ethnic_fractionalization, Trade_Openness b. Predictors: (Constant), Social_capital, Ethnic_fractionalization, Trade_Openness, Education, Democracy c. Dependent Variable: log_GDP_PC

### Final Model

A final regression model is fitted with the variables significant at the last step in the diagnostics and checked for multicollinearity and other assumptions.

On seeing the residual and predicted variable plot, it can be thought of as independent without heteroscedasticity as there’s no apparent pattern in the plot.

 Model Summaryb Models R R-Squar Adjusted R-Squar Std. Errors of the Estimate Change Statistics R Squar Change F Change df1 df2 Sig. F Change 1 .504a .254 .240 1.02130 .254 18.939 3 167 .000
 ANOVAa Model Sum of Squares df Mean Square F Sig. 1 Regression 59.263 3 19.754 18.939 .000b Residual 174.191 167 1.043 Total 233.453 170 a. Dependent Variable: log_GDP_PC b. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization
 Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B Collinearity Statistics B Std. Error Beta Lower Bound Upper Bound Tolerance VIF 1 (Constant) 7.953 .534 14.890 .000 6.898 9.007 Trade_Openness .176 .054 .247 3.261 .001 .070 .283 .779 1.284 Ethnic_fractionalization -1.411 .351 -.297 -4.022 .000 -2.103 -.718 .817 1.223 Social_capital .110 .052 .146 2.126 .035 .008 .212 .948 1.055 a. Dependent Variable: log_GDP_PC
In the VIF metric for testing multicollinearity, none of the variables have VIF>5, which indicates that there’s no multicollinearity present. For outlier analysis, Cook’s distance metric was saved for the final model, and cases were rearranged in decreasing order of magnitude. The threshold of 0.05 is used for outlier detection. Only two outliers have Cook’s distance higher than the threshold, which is North Korea and Venezuela. The regression is fit again after removing these variables, and the result turns out to be pretty much the same with a little better R2.
 Model Summaryb Model R R Square Adjusted R Square Std. Error of the Estimate Change Statistics R Square Change F Change df1 df2 Sig. F Change 1 .524a .275 .262 1.00583 .275 20.866 3 165 .000 a. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization b. Dependent Variable: log_GDP_PC
 ANOVAa Model Sum of Squares df Mean Square F Sig. 1 Regression 63.331 3 21.110 20.866 .000b Residual 166.930 165 1.012 Total 230.260 168 a. Dependent Variable: log_GDP_PC b. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization
 Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. Collinearity Statistics B Std. Error Beta Tolerance VIF 1 (Constant) 7.808 .568 13.745 .000 Social_capital .095 .052 .124 1.830 .069 .958 1.044 Ethnic_fractionalization -1.428 .353 -.300 -4.040 .000 .798 1.253 Trade_Openness .205 .057 .274 3.635 .000 .770 1.298 a. Dependent Variable: log_GDP_PC
ACF and PACF for the unstandardized residual are also plotted. There does not seem to be any serious autocorrelation in the residuals, indicating that the independence assumption of regression holds.

After testing, the final model is decided with the dependent variable as log GDP and outliers, North Korea, and Venezuela removed. The model is able to elaborate on the 27.5%  variation found in log per capita GDP.