# Using STATA for Data Analysis

STATA is a statistical program that enables users to manage, analyze, and graphically visualize data. It is commonly used by researchers and analysts in the fields of political science, biomedicine, and economics to identify patterns and trends in data and to make informed future predictions. STATA comes with both a graphical user and a command-line interface, making it easy to use by researchers and data analysts of all skill levels.

### Regression Analysis

The variables can be created with the generate command (gen). As the gender variable is already available, the “female” variable can be generated by simply making the entries as 0 when gender=1 (male) and 1 when gender = 2 (female). Similarly, the “Never Married” variable is generated by making values = 1 when marital status = 5 and 0 otherwise.

### Table 1: Summary Statistics

 Variable Label (coding) Mean Standard Deviation female 1 if female; 0 if male 0.44 0.4968841 Never_married 1 if never married; 0 otherwise 0.486 0.5003045

a) numchildren and gender

H0: the average number of children a respondent have is the same irrespective of their gender

Ha: the average number of children a respondent have is different for different gender

Here, the average number of children a respondent have is expected to be the same for both male and female unless there is no selection bias because every child would have parents on one male and female each.

b) numchildren and marstat

H0: the average number of children a respondent have is the same irrespective of the marital status

Ha: the average number of children a respondent have is different for respondents with different marital status

Here, the null hypothesis is expected to fail as people with different marital statuses may have a different number of children (especially the never-married category who might have less number of children on average)

c) numchildren and birthyear

H0: the average number of children a respondent have is the same irrespective of their birthyear

Ha: the average number of children a respondent have is different for respondents with different birthyear

Here also, the null hypothesis is expected to fail as people with later birthyear (meaning younger) might have a lesser number of children compared with respondents with earlier birthyear (older)

d) numchildren and faminc_new

H0: the average number of children a respondent have is the same irrespective of their family income level

Ha: the average number of children a respondent have is different for respondents with different family income level

Here also, the null hypothesis is expected to fail as people with higher family income might be more willing to have more children as they might be able to afford it.

Table 2: OLS Regression Results

For more information on this topic, take our regression analysis assignment help.

 Variable Regression coefficient P > |t| Statistically significant? Interpretation gender 0.291274 0.061 No Male respondents on average had 0.29 children more than female respondents. But it is not statistically significant and hence ignored never_married -1.509981 0.000 Yes Respondents who were never married had on average 1.51 children less than other respondents birthyr -0.017774 0.000 Yes The number of children that the respondents had decreased by 0.02 for every unit increase in their birth year. This is expected as older people tend to have more children faminc_new -0.0488778 0.041 Yes People tend to have lesser children on average as their income level rises. But the difference is minimal and also just marginally significant statistically

#### Data Analytics

From the above two plots (Residual plot, and residual histogram), it can be seen that the residuals are normally distributed but the dispersion is not constant for all the values of independent variables. Therefore, the homoskedasticity assumption looks to be violated in this model. This may be because all the variables are discrete and not continuous. The heteroskedasticity is evident from the heteroskedasticity test as well (p<0.05). The variation might be due to some variables omitted in the regression, which might have an impact on the dependent variable.

The variable “children” can be generated simply by making the entries as 0 when numchildren=0 and 1 when numchildren!= 0.

Table 3: Logistic Regression Results

 Variable Regression coefficient P > |t| Statistically significant? Interpretation female 0.678909 0.102 No Female respondents on average had only 0.679 times male’sodds of having a child. But it’s not statistically significant and hence ignored never_married 0.1781378 0.000 Yes Respondents who were never married had on average only 0.178 times the odds of amarried respondent’s chance of having a child. birthyr 0.9810059 0.013 Yes The odds that a respondent born a year later has a child is 0.98 times a respondent born a year earlier faminc_new 0.9214551 0.027 Yes The odds that a respondent in one level higher has a child is 0.98 times a respondent in one level lower income.
From the LR test of alpha in the negative binomial regression, there is no evidence of dispersion.Therefore, the Poisson regression model is an appropriate model to use.
Get assistance with the concept of data heteroscedasticity and normality by availing our data analytics assignment help.