• Regression Analysis

Diagnostics and Logistic Regression Using SPSS

Logistic regression is a statistical technique used to estimate relationships between one dependent variable and a set of independent variables. The independent variables can be nominal, interval, ordinal, or ratio-level. The technique produces a regression equation in which the coefficients represent the correlation between the dependent variable and each of the independent variables being observed.

Regression Analysis

 Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig. Collinearity Statistics B Std. Error Beta Tolerance VIF 1 (Constant) 19.352 6.249 3.097 .003 % low income .038 .062 .050 .621 .537 .699 1.430 % taking ACT .089 .095 .106 .930 .356 .348 2.870 % meet Math standard .139 .201 .203 .693 .491 .053 18.809 % meet Reading standard .443 .238 .571 1.861 .068 .048 20.673 % meet Science standard .023 .187 .032 .121 .904 .066 15.047 a. Dependent Variable: % graduating

Looking at the variance inflation factors, we see that three variables (% meet math standard, % meet the reading standard and % meet science standard) have a VIF value greater than 10. This means there is evidence of multicollinearity. The tolerance values of these variables are also less than 0.1 which means there is evidence of multicollinearity. The solution is to remove one of these three variables because they are highly correlated with r>0.9. if the problem persists after removing one, we need to remove another 1 which means we will remove either 1 or 2 of % meet math standard, % meet the reading standard and % meet science standard.

B

1.       We use a box plot of the dependent and independent variables to determine if there are outliers. The decision is based on values lesser than  or greater than .

The box plot of grad suggests that there is no outlier and the boxplot of loinc suggests the same while the box plot of reading suggests that observations 36 and 62 are outliers. The boxplots are shown below.   2. i. We use the cooks distance to determine if there are influential points. The threshold for the cooks distance is calculated as The cooks distance plot below shows that observations 3, 18, 35, 47, and 58 are influential points. ii. Using the standardized dffits, the threshold value is determined by

2×√(p/n)

2×√(3/62)=0.4399

The standardized dffit plot below shows that observations; 3, 4, 18, 35, 47, and 58 are influential points. This result is similar to the cooks distance result except that observations 4 is included here. iii. next, we plot the dfbetas for all independent variables. The cut-off of the dfbetas is given as

2/√n=2/√62=0.254

The plots are presented below, the plots for low income shows that observations 3, 4, 18, 35, 40, and 47 are the influential point. The plot for reading shows that only observations 13, 47, and 58 are an influential point.  3

Normality

The histogram below shows that the residual of the model is slightly skewed to the left and the assumption of normality is not satisfactorily met but this should not be a problem. Linearity and Homoscedasticity

The residual versus fitted plot below shows that the assumption of linearity is met because the scatter plot does not show any distinct pattern. Moreover, the points are distributed evenly above and below the origin which means the assumption of Homoscedasticity is met. Independence

The partial regression plots below show that for both variables, there is no evidence of nonlinearity. This means that the assumption of independence is met.  Regression Diagnostics

l=log⁡〖p/(1-p)〗=1.256+0.022age+0.02sex-1.606college

the value of -2 log-likelihood for the null model is 1595.048

for age; the null and alternative hypotheses are

H_0 1:β_age=0

H_1 1:β_age≠0

The result shows that wald statistics is 32.326, p<0.001 which means rejection of the null hypothesis. Thus, age is statistically significant.

for male; the null and alternative hypotheses are

H_0 2:β_sex=0

H_1 2:β_sex≠0

The result shows that wald statistics is 0.024, p=0.876>0.05 which connotes non-rejection of the null hypothesis. Thus, sex is not statistically significant.

For college; the null and alternative hypotheses are

H_0 3:β_college=0

H_1 3:β_college≠0

The result shows that wald statistics is 64.745, p<0.001 which connotes rejection of the null hypothesis. Thus, college is statistically significant.

The null and alternative hypotheses are given as

H0: the observed and expected proportions are the same across voted π ̂k_j=π ̂k_j0

H1: the observed and expected proportions are not the same across voted π ̂k_j≠π ̂k_j0

The result shows that χ^2 (8)=10.043,p=0.232>0.05 which means we cannot reject the null hypothesis. This means that the fit of the model is okay.

 Hosmer and Lemeshow Test Step Chi-square df Sig. 1 10.043 8 .262
 Variables in the Equation B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B) Lower Upper Step 1a age .022 .004 32.326 1 .000 1.022 1.014 1.030 sex .020 .128 .024 1 .876 1.020 .794 1.310 college(1) -1.606 .200 64.745 1 .000 .201 .136 .297 Constant 1.256 .253 24.679 1 .000 3.510 a. Variable(s) entered on step 1: age, sex, college.B  The null and alternative hypothesis is that H0: the model with income is not significantly different from the model without income H0: the model with income is significantly different from the model without income -2 log-likelihood of the model without income is 1480.412 -2 log-likelihood of the model with income is 1417.153 Next, we estimated the ΔG^2 statistics as ΔG^2=-2log⁡〖likelihood of restricted model〗-(-2log⁡〖likelihood of unrestricted model)〗 ΔG^2=1480.412-1417.153 ΔG^2=63.259 χ^2 (1)=3.84 Since ΔG^2>χ^2 (1); we reject the null hypothesis and conclude that income is statistically significant.  The null and alternative hypotheses are given as H0: the observed and expected proportions are the same across voted π ̂k_j=π ̂k_j0 H1: the observed and expected proportions are not the same across voted π ̂k_j≠π ̂k_j0 The result shows that χ^2 (8)=7.217,p=0.513>0.05 which means we cannot reject the null hypothesis. This means that the fit of the model is okay.

C

The null and alternative hypothesis is that

H0: the model with a male is not significantly different from the model without male

H0: the model with a male is significantly different from the model without male

-2 log-likelihood of the model with the male is 1417.153

Removing male, -2 log-likelihood of the model increased to 1417.884

Next, we estimated the ΔG^2 statistics as

ΔG^2=-2log⁡〖likelihood of restricted model〗-(-2log⁡〖likelihood of unrestricted model)〗

ΔG^2=1417.884-1417.153

ΔG^2=0.731

χ^2 (1)=3.84

Since ΔG^2<χ^2 (1); we reject the null hypothesis and conclude that removing male does not increase the deviance significantly.

The null and alternative hypotheses are given as

H0: the observed and expected proportions are the same across voted π ̂k_j=π ̂k_j0

H1: the observed and expected proportions are not the same across voted π ̂k_j≠π ̂k_j0

The result shows that χ^2 (8)=3.810,p=0.874>0.05 which means we cannot reject the null hypothesis. This means that the fit of the model is okay.

 Hosmer and Lemeshow Test Step Chi-square df Sig. 1 7.217 8 .513

I would use the last model in C2 because it contains only significant independent variables and has the highest p-value for Hosmer and Lemeshowtests which means that it has the highest fit.

The predicted probability of voting for a 22-year-old college graduate with a family income level in the 1st quartile is

p(voted=1│age,college,income)=1/(1-e^(β_0+β_1 age+β_2 college+β_3 income) )

p(voted=1│age,college,income)=1/(1-e^(-(β_0+β_1 age+β_2 college+β_3 income)) )

p(voted=1│age=22,college=1,income=1st quartile)=1/(1+e^(-(0.274+0.027*22+1.268-1.498)) )

p(voted=1│age=22,college=1,income=1st quartile)=0.653

The predicted probability of voting for a 22-year-old college graduate with a family income level in the 1st quartile is 0.653

exp(college)=3.553. this means that college graduates have 3.553 times the odds of non-college graduates of voting.

Yes, Age a statistically significant unique predictor for the probability of voting (p<0.001). The odds ratio of age is 1.027. This means that a year increase in age increases the odds of voting by 2.7%

The estimated value of exp[bincome(2)] is 0.462. This means that people at the second quartile of income have 53.8% fewer odds of voting than those at the fourth quartile of income.

No, higher total family income increases the odds of voting as we see that as income moves from 1st to 2nd to 3rd to 4th quartile, the odds of voting increases.

As shown below, the sensitivity is 93.9%, the specificity is 19.6%. The r-squared count is 0.125 and the adjusted R-squared count is 0.179

 Hosmer and Lemeshow Test Step Chi-square df Sig. 1 3.810 8 .874
 Classification Tablea Observed Predicted Voted Percentage Correct DID NOT VOTE VOTED Step 1 Voted DID NOT VOTE 75 307 19.6 VOTED 58 890 93.9 Overall Percentage 72.6 a. The cut value is .500

To get professional assistance with this topic, get in touch with our providers of regression diagnostics assignment help.

 Model Summary Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square 1 1417.884a .125 .179 a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001.