exploratory data analysis
Here is a clear demonstration with detailed explanations of exploratory data analysis and correlation. Correlation is employed to test the strength between two variables. Exploratory data analysis is simply applied to know more about the variables in the data set.
Dataset: Data used QoG
The QoG standard data is one of the largest datasets. It contains over 100 data sources withmore than 2000 variables. The QoG is constantly updated, and I have used my study on the lastversion of 2019. It contains both cross-sectional data and time series, meaning that QoGdataset contains information of countries since 1946, which is very good if a researcher isinterested in countries as a unit of the analysis. However, one should be more careful if thisdataset is considered to be useful because there may some countries in the current datasetwhich no longer exist now as well as other countries that had not existed before but nowbecause of either reunification or secession of some countries. Despite that, the mostpart of the dataset is constantly updated and, therefore, can be considered as authentic anduseable for research purposes.
Exploratory Data Analysis
The Dependent variable
Economic development
Taking the GDP per capita as a variable is deemed more suitable than other variables that measure the same in the data used. Economic strength or economic development show the same correlation towards Democracy and other independent variables. To operationalize economic development, wealso used the QoG dataset 2013, whereby the GDP per capita is chosen. The reason is that the GDP per capita is of the most instrument used to measure economic development or performance. For the sake of accuracy, this paper uses the Geary-Khamis dollar or as it is called the international dollar. The Geary-Khamis is a hypothetical unit (GK$) that has the purchasing power as the US dollar. (QoG: 2013). The GK$ is a recognized currency used in studies and other economic contexts.
The Independent variables
Democracy
Democracy is the independent variable to measure the concept of Democracy. I have used the index of Democracy in QoG standard dataset 2013. It is a value scale 0 to 10 founded on several other scale values intended to measure Democracy, which include civil liberty, democratic political culture, electoral process and pluralism, political participation, and functioning government. When it comes to Democracy, it can be used both nominal and interval variables. The index of Democracy uses a scale of measurement whereby the status of Democracy ranges from 0 to 10 where 0 is no democracy, and 10 is indicating higher Democracy.
Social cohesion
Social cohesion is used to define the sense of solidarity and strength of relationship showcased by members of a given nation, region, or community. Social cohesion is normally determined by several indicators, one of which is the amount of social capital collectively held by a given community. Social capital relates to the shared group resources present in a community, such as the knowledge of a job vacancy held by a close family friend. Therefore, social capital is accessed through social networks formed with other members of society. Social cohesion is a concern with the development of shared values, facing commonchallenges, creating a sense of togetherness, promoting group engagement across different community activities, and lastly, mitigating social disparities in terms of wealth distribution and income levels.
Ethnic fractionalization
Ethnic fractionalization is associated with the determination of sizes, socio-economic distribution, geographical location, and a number of distinct cultural groups within a given country like Japan or Egypt. These cultural features may relate to customs, traditions, ethnicity, religion, language, and language. In most cases, these features are used in the monopolization of power and social exclusion.
Trade openness
Trade openness is used to define the inward or outward orientation of an economy. Outward orientation is used to refer to economies that seize trading opportunities with other countries, while inward orientation defines countries to avoid or fail to take advantage of trading opportunities with other countries/economies. Good examples of trade policies that are directly linked to the establishment of either outward or inward orientation are scale economies, market competitiveness, technologies, infrastructure, import-export regulations, and trade barriers.
Education
We are not just dealing with the conventional interpretation of education, which is the acquisition of knowledge and skills through a series of teaching, self-learning, or apprenticeship. We are also dealing with the assessment of quality education, which focuses on the proper cognitive development of learners as they progress through the education system.
Exploratory Analysis
Before regression is run on the data, several things were checked to see if the assumptions of models hold. Q-Q plot for GDP per capita was plotted along with natural log transformation to see the normality of the dependent variable.

It is clear that the natural log transformation is much closer to normality assumption than the variable directly. Further analysis has been done on the log of per capita GDP. Since it is a bijective transformation that is strictly increasing, the comparison will be preserved. A scatter plot for all the independent variables with the dependent variable is plotted to check for the linearity of relationships.



Onwards
Correlation Analysis
Per Capita GDP vs. Democracy
The correlation between variable per capita GDP and democracy indicator is 0.089, which is small, as indicated by the test of significance of the correlation coefficient, which yields a p-value of 0.315. This indicates that there may not be a strong linear relationship between the two variables.
Per Capita GDP vs. Democracy
The correlation between variable per capita GDP and democracy indicator is 0.128, which is small, as indicated by the test of significance of the correlation coefficient, which yields a p-value of 0.146. This indicates that there may not be a strong linear relationship between the two variables.
Per Capita GDP vs.Social Capital
The correlation between variable per capita GDP and social capital is 0.128, which is significant, as shown by the said test of a significant correlation coefficient, which yields a p-value of 0.146. This indicates that there may not be strong linear relationships between the two variables.
Per Capita GDP vs.Ethnic fractionalization
The correlation between variable per capita GDP and ethnic fractionalization is -0.340, which is small, as indicated by the test of significance of the correlation coefficient, which yields a p-value of 0.001. This shows that there is a strong negative linear relationship between the two variables. This variable may play an important role in explaining variability in per capita GDP.
Per Capita GDP vs.trade openness
The correlation between variable per capita GDP and trade openness is 0.385, which is small, as indicated by the test of significance of the correlation coefficient, which yields a p-value of 0.001. This indicates that there is a strong positive linear relationship between the two variables.
Per Capita GDP vs.Education
The correlation between variable per capita GDP and education is 0.483, which is a high positive value,which is significant, as shown by the test of significance of the correlation coefficient, which yields a p-value of 0.001. This indicates that there may be strong linear relationships between the two variables.
Log per capita GDP with variables
A similar correlation study is carried out with log per capita GDP and the explanatory variables.
Correlations |
|
log_GDP_PC |
Education |
Democracy |
Social_capital |
Ethnic_fractionalization |
Trade_Openness |
log_GDP_PC |
Pearson Correlation |
1 |
.530** |
.262** |
.275** |
-.442** |
.504** |
Sig. (2-tailed) |
|
.000 |
.003 |
.002 |
.000 |
.000 |
N |
163 |
50 |
130 |
130 |
148 |
115 |
Education |
Pearson Correlation |
.530** |
1 |
.575** |
.482** |
-.270 |
.205 |
Sig. (2-tailed) |
.000 |
|
.000 |
.000 |
.069 |
.277 |
N |
50 |
53 |
49 |
49 |
46 |
30 |
Democracy |
Pearson Correlation |
.262** |
.575** |
1 |
.856** |
-.086 |
.437** |
Sig. (2-tailed) |
.003 |
.000 |
|
.000 |
.331 |
.000 |
N |
130 |
49 |
136 |
136 |
129 |
92 |
Social_capital |
Pearson Correlation |
.275** |
.482** |
.856** |
1 |
-.063 |
.421** |
Sig. (2-tailed) |
.002 |
.000 |
.000 |
|
.477 |
.000 |
N |
130 |
49 |
136 |
136 |
129 |
92 |
Ethnic_fractionalization |
Pearson Correlation |
-.442** |
-.270 |
-.086 |
-.063 |
1 |
-.511** |
Sig. (2-tailed) |
.000 |
.069 |
.331 |
.477 |
|
.000 |
N |
148 |
46 |
129 |
129 |
154 |
111 |
Trade_Openness |
Pearson Correlation |
.504** |
.205 |
.437** |
.421** |
-.511** |
1 |
Sig. (2-tailed) |
.000 |
.277 |
.000 |
.000 |
.000 |
|
N |
115 |
30 |
92 |
92 |
111 |
116 |
**. Correlation is significant at the 0.01 level (2-tailed). |
The correlations are similar to that with the original variables, but now the social capital and Democracy is also a significant variable.
Regression Models
The purpose of regression analysis in the study is to identify and measure the strength of relationships between variables and attribute the economic development to various chosen features. Statistical tests such as t-test and F-test have been used to test the significance of the model.
The first model was fitted between GDP per capita and Democracy. All the variables chosen are scale variables, and hence there was no need forthe introduction of dummy variables. This model attributes the role of Democracy in the economic development of the country. Since there are a lot of missing values in the dataset, it has been imputed by the mean of the variable.The result from SPSS is presented in the below table:
Model Summaryb |
Model |
R |
R Square |
Adjusted R Square |
Std. Error of the Estimate |
1 |
.062a |
.004 |
-.002 |
19105.42550 |
a. Predictors: (Constant), Democracy |
b. Dependent Variable: GDP_per_capita |
ANOVAa |
Model |
Sum of Squares |
df |
Mean Square |
F |
Sig. |
1 |
Regression |
237204598.344 |
1 |
237204598.344 |
.650 |
.421b |
Residual |
60957886336.257 |
167 |
365017283.451 |
|
|
Total |
61195090934.601 |
168 |
|
|
|
a. Dependent Variable: GDP_per_capita |
b. Predictors: (Constant), Democracy |
Coefficientsa |
Model |
Unstandardized Coefficients |
Standardized Coefficients |
T |
Sig. |
95.0% Confidence Interval for B |
B |
Std. Error |
Beta |
Lower Bound |
Upper Bound |
1 |
(Constant) |
14543.353 |
4641.833 |
|
3.133 |
.002 |
5379.117 |
23707.590 |
Democracy |
639.450 |
793.235 |
.062 |
.806 |
.421 |
-926.611 |
2205.512 |
a. Dependent Variable: GDP_per_capita |
|
|
|
|
|
|
|
|
|
|
|
It is clear from the above model that Democracy alone can not explain a significant part of the variation in the per capita GDP. However, it may be interesting to see the effect of Democracy after controlling by some other variables. One point to note here is that the scale of per capita GDP is very large, anda natural log is taken to get the per capita GDP to a smaller scale. Next, regressions are carried out on the log of per capita GDP.
The model is again fit with log GDP per capita as the dependent variable and Democracy as the independent variable.
Model Summaryb |
|
Model |
R |
R Square |
Adjusted R Square |
Std. Error of the Estimate |
|
1 |
.208a |
.043 |
.037 |
1.15660 |
|
a. Predictors: (Constant), Democracy |
|
b. Dependent Variable: log_GDP_PC |
|
ANOVAa |
Model |
Sum of Squares |
df |
Mean Square |
F |
Sig. |
1 |
Regression |
10.053 |
1 |
10.053 |
7.515 |
.007b |
Residual |
223.400 |
167 |
1.338 |
|
|
Total |
233.453 |
168 |
|
|
|
a. Dependent Variable: log_GDP_PC |
b. Predictors: (Constant), Democracy |
Coefficientsa |
Model |
Unstandardized Coefficients |
Standardized Coefficients |
T |
Sig. |
95.0% Confidence Interval for B |
B |
Std. Error |
Beta |
Lower Bound |
Upper Bound |
1 |
(Constant) |
8.481 |
.281 |
|
30.182 |
.000 |
7.927 |
9.036 |
Democracy |
.132 |
.048 |
.208 |
2.741 |
.007 |
.037 |
.226 |
a. Dependent Variable: log_GDP_PC |
This time, however, the p-value for the F-test for the significance of the model is significant, and the test for significance of Democracy is also significant. However, the model is only able to explain 4.3% of the variation in the log of GDP per capita, which is indicated by R2 of 0.043.
After this interesting finding, a new variable is introduced in regression after controlling by Democracy, which is education. The reason this variable was chosen was because of the strong belief that education is the highest asset for Human resources, and education, society, or country develops more. However, this point is not supported by the data as the significance of variable education is not high. The change in R2 is negligible, and the model has similar explaining power as before. Below is the SPSS output after entering the variable education.
ANOVAa |
Model |
Sum of Squares |
df |
Mean Square |
F |
Sig. |
1 |
Regression |
10.053 |
1 |
10.053 |
7.515 |
.007b |
Residual |
223.400 |
167 |
1.338 |
|
|
Total |
233.453 |
168 |
|
|
|
2 |
Regression |
13.023 |
2 |
6.511 |
4.904 |
.009c |
Residual |
220.430 |
166 |
1.328 |
|
|
Total |
233.453 |
168 |
|
|
|
a. Dependent Variable: log_GDP_PC |
b. Predictors: (Constant), Democracy |
c. Predictors: (Constant), Democracy, Education |
Coefficientsa |
Model |
Unstandardized Coefficients |
Standardized Coefficients |
t |
Sig. |
95.0% Confidence Interval for B |
B |
Std. Error |
Beta |
Lower Bound |
Upper Bound |
1 |
(Constant) |
8.481 |
.281 |
|
30.182 |
.000 |
7.927 |
9.036 |
Democracy |
.132 |
.048 |
.208 |
2.741 |
.007 |
.037 |
.226 |
2 |
(Constant) |
7.957 |
.449 |
|
17.726 |
.000 |
7.071 |
8.843 |
Democracy |
.110 |
.050 |
.174 |
2.208 |
.029 |
.012 |
.209 |
Education |
.014 |
.010 |
.118 |
1.495 |
.137 |
-.005 |
.033 |
a. Dependent Variable: log_GDP_PC |
Model Summaryc |
Model |
R |
R Square |
Adjusted R Square |
Std. Error of the Estimate |
1 |
.208a |
.043 |
.037 |
1.15660 |
2 |
.236b |
.056 |
.044 |
1.15234 |
a. Predictors: (Constant), Democracy |
b. Predictors: (Constant), Democracy, Education |
c. Dependent Variable: log_GDP_PC |
Clearly, the variable education was not significant and was removed at this stage. Although, it will be added again after controlling with other variables to see its effect. For now, it’s time to add a third variable to the list. Trade openness is introduced along with Democracy to see what the effect of it on economic development is. Intuitively, this should be strongly related, and intuition is supported by the data.
Model Summaryc |
Model |
R |
R Square |
Adjusted R Square |
Std. Error of the Estimate |
1 |
.208a |
.043 |
.037 |
1.15660 |
2 |
.421b |
.177 |
.167 |
1.07577 |
a. Predictors: (Constant), Democracy |
b. Predictors: (Constant), Democracy, Trade_Openness |
c. Dependent Variable: log_GDP_PC |
ANOVAa |
Model |
Sum of Squares |
df |
Mean Square |
F |
Sig. |
1 |
Regression |
10.053 |
1 |
10.053 |
7.515 |
.007b |
Residual |
223.400 |
167 |
1.338 |
|
|
Total |
233.453 |
168 |
|
|
|
2 |
Regression |
41.344 |
2 |
20.672 |
17.862 |
.000c |
Residual |
192.110 |
166 |
1.157 |
|
|
Total |
233.453 |
168 |
|
|
|
a. Dependent Variable: log_GDP_PC |
b. Predictors: (Constant), Democracy |
c. Predictors: (Constant), Democracy, Trade_Openness |
Coefficientsa |
Model |
Unstandardized Coefficients |
Standardized Coefficients |
T |
Sig. |
95.0% Confidence Interval for B |
B |
Std. Error |
Beta |
Lower Bound |
Upper Bound |
1 |
(Constant) |
8.481 |
.281 |
|
30.182 |
.000 |
7.927 |
9.036 |
Democracy |
.132 |
.048 |
.208 |
2.741 |
.007 |
.037 |
.226 |
2 |
(Constant) |
6.712 |
.429 |
|
15.646 |
.000 |
5.865 |
7.559 |
Democracy |
.072 |
.046 |
.114 |
1.570 |
.118 |
-.019 |
.163 |
Trade_Openness |
.270 |
.052 |
.378 |
5.200 |
.000 |
.167 |
.372 |
a. Dependent Variable: log_GDP_PC |
The model with Democracy and Trade Openness is significant, and the variable trade openness is significant as well. An interesting point to note that after the inclusion of trade openness, the variable Democracy is no longer significant. So, the next model has been implemented with Trade openness and adding the social capital and ethnic fractionalization. These two variables are added together as they are elements of society and culture. Ideally, it is not expected to have a significant effect on economic development. The results from SPSS is given in the tables below:
ANOVAa |
Model |
Sum of Squares |
df |
Mean Square |
F |
Sig. |
1 |
Regression |
38.489 |
1 |
38.489 |
33.364 |
.000b |
Residual |
194.964 |
169 |
1.154 |
|
|
Total |
233.453 |
170 |
|
|
|
2 |
Regression |
59.263 |
3 |
19.754 |
18.939 |
.000c |
Residual |
174.191 |
167 |
1.043 |
|
|
Total |
233.453 |
170 |
|
|
|
a. Dependent Variable: log_GDP_PC |
b. Predictors: (Constant), Trade_Openness |
c. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization |
Coefficientsa |
Model |
Unstandardized Coefficients |
Standardized Coefficients |
t |
Sig. |
95.0% Confidence Interval for B |
Collinearity Statistics |
B |
Std. Error |
Beta |
Lower Bound |
Upper Bound |
Tolerance |
VIF |
1 |
(Constant) |
6.958 |
.399 |
|
17.443 |
.000 |
6.170 |
7.745 |
|
|
Trade_Openness |
.290 |
.050 |
.406 |
5.776 |
.000 |
.191 |
.389 |
1.000 |
1.000 |
2 |
(Constant) |
7.953 |
.534 |
|
14.890 |
.000 |
6.898 |
9.007 |
|
|
Trade_Openness |
.176 |
.054 |
.247 |
3.261 |
.001 |
.070 |
.283 |
.779 |
1.284 |
Social_capital |
.110 |
.052 |
.146 |
2.126 |
.035 |
.008 |
.212 |
.948 |
1.055 |
Ethnic_fractionalization |
-1.411 |
.351 |
-.297 |
-4.022 |
.000 |
-2.103 |
-.718 |
.817 |
1.223 |
Model Summaryc |
Model |
R |
R Square |
Adjusted R Square |
Std. Error of the Estimate |
Change Statistics |
R Square Change |
F Change |
df1 |
df2 |
Sig. F Change |
1 |
.406a |
.165 |
.160 |
1.07407 |
.165 |
33.364 |
1 |
169 |
.000 |
2 |
.504b |
.254 |
.240 |
1.02130 |
.089 |
9.958 |
2 |
167 |
.000 |
a. Predictors: (Constant), Trade_Openness
b. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization
c. Dependent Variable: log_GDP_PC |
The result suggests that the social capital and ethnic fractionalization both are statistically significant variables at explaining variability at log GDP. The R2 also increases to 0.254, which indicates that the model can explain the 25.4% of the variability in the dependent variable. At last, regression is fit with all variables included seeing the effect of Democracy and education. But both of the new included variables turn out to be insignificant, and the R2 change was minimal. Hence the final model was decided to have Trade openness, social capital, and ethnic fractionalization.
Model Summaryc |
Model |
R |
R Square |
Adjusted R Square |
Std. Error of the Estimate |
Change Statistics |
R Square Change |
F Change |
df1 |
df2 |
Sig. F Change |
1 |
.504a |
.254 |
.240 |
1.02130 |
.254 |
18.939 |
3 |
167 |
.000 |
2 |
.513b |
.263 |
.241 |
1.02124 |
.009 |
1.010 |
2 |
165 |
.366 |
a. Predictors: (Constant), Social_capital, Ethnic_fractionalization, Trade_Openness
b. Predictors: (Constant), Social_capital, Ethnic_fractionalization, Trade_Openness, Education, Democracy
c. Dependent Variable: log_GDP_PC |
Final Model
A final regression model is fitted with the variables significant at the last step in the diagnostics and checked for multicollinearity and other assumptions.
On seeing the residual and predicted variable plot, it can be thought of as independent without heteroscedasticity as there’s no apparent pattern in the plot.
Model Summaryb |
Models |
R |
R-Squar |
Adjusted R-Squar |
Std. Errors of the Estimate |
Change Statistics |
R Squar Change |
F Change |
df1 |
df2 |
Sig. F Change |
1 |
.504a |
.254 |
.240 |
1.02130 |
.254 |
18.939 |
3 |
167 |
.000 |
ANOVAa |
Model |
Sum of Squares |
df |
Mean Square |
F |
Sig. |
1 |
Regression |
59.263 |
3 |
19.754 |
18.939 |
.000b |
Residual |
174.191 |
167 |
1.043 |
|
|
Total |
233.453 |
170 |
|
|
|
a. Dependent Variable: log_GDP_PC |
b. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization |
Coefficientsa |
Model |
Unstandardized Coefficients |
Standardized Coefficients |
t |
Sig. |
95.0% Confidence Interval for B |
Collinearity Statistics |
B |
Std. Error |
Beta |
Lower Bound |
Upper Bound |
Tolerance |
VIF |
1 |
(Constant) |
7.953 |
.534 |
|
14.890 |
.000 |
6.898 |
9.007 |
|
|
Trade_Openness |
.176 |
.054 |
.247 |
3.261 |
.001 |
.070 |
.283 |
.779 |
1.284 |
Ethnic_fractionalization |
-1.411 |
.351 |
-.297 |
-4.022 |
.000 |
-2.103 |
-.718 |
.817 |
1.223 |
Social_capital |
.110 |
.052 |
.146 |
2.126 |
.035 |
.008 |
.212 |
.948 |
1.055 |
a. Dependent Variable: log_GDP_PC |
In the VIF metric for testing multicollinearity, none of the variables have VIF>5, which indicates that there’s no multicollinearity present.
For outlier analysis, Cook’s distance metric was saved for the final model, and cases were rearranged in decreasing order of magnitude. The threshold of 0.05 is used for outlier detection. Only two outliers have Cook’s distance higher than the threshold, which is North Korea and Venezuela. The regression is fit again after removing these variables, and the result turns out to be pretty much the same with a little better R2.
Model Summaryb |
Model |
R |
R Square |
Adjusted R Square |
Std. Error of the Estimate |
Change Statistics |
R Square Change |
F Change |
df1 |
df2 |
Sig. F Change |
1 |
.524a |
.275 |
.262 |
1.00583 |
.275 |
20.866 |
3 |
165 |
.000 |
a. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization |
b. Dependent Variable: log_GDP_PC |
ANOVAa |
Model |
Sum of Squares |
df |
Mean Square |
F |
Sig. |
1 |
Regression |
63.331 |
3 |
21.110 |
20.866 |
.000b |
Residual |
166.930 |
165 |
1.012 |
|
|
Total |
230.260 |
168 |
|
|
|
a. Dependent Variable: log_GDP_PC |
b. Predictors: (Constant), Trade_Openness, Social_capital, Ethnic_fractionalization |
Coefficientsa |
Model |
Unstandardized Coefficients |
Standardized Coefficients |
t |
Sig. |
Collinearity Statistics |
B |
Std. Error |
Beta |
Tolerance |
VIF |
1 |
(Constant) |
7.808 |
.568 |
|
13.745 |
.000 |
|
|
Social_capital |
.095 |
.052 |
.124 |
1.830 |
.069 |
.958 |
1.044 |
Ethnic_fractionalization |
-1.428 |
.353 |
-.300 |
-4.040 |
.000 |
.798 |
1.253 |
Trade_Openness |
.205 |
.057 |
.274 |
3.635 |
.000 |
.770 |
1.298 |
a. Dependent Variable: log_GDP_PC |
ACF and PACF for the unstandardized residual are also plotted. There does not seem to be any serious autocorrelation in the residuals, indicating that the independence assumption of regression holds.

After testing, the final model is decided with the dependent variable as log GDP and outliers, North Korea, and Venezuela removed. The model is able to elaborate on the 27.5% variation found in log per capita GDP.