Diagnostics and Logistic Regression Using SPSS
Logistic regression is a statistical technique used to estimate relationships between one dependent variable and a set of independent variables. The independent variables can be nominal, interval, ordinal, or ratio-level. The technique produces a regression equation in which the coefficients represent the correlation between the dependent variable and each of the independent variables being observed.
Regression Analysis
Coefficientsa |
||||||||
Model |
Unstandardized Coefficients |
Standardized Coefficients |
t |
Sig. |
Collinearity Statistics |
|||
B |
Std. Error |
Beta |
Tolerance |
VIF |
||||
1 |
(Constant) |
19.352 |
6.249 |
|
3.097 |
.003 |
|
|
% low income |
.038 |
.062 |
.050 |
.621 |
.537 |
.699 |
1.430 |
|
% taking ACT |
.089 |
.095 |
.106 |
.930 |
.356 |
.348 |
2.870 |
|
% meet Math standard |
.139 |
.201 |
.203 |
.693 |
.491 |
.053 |
18.809 |
|
% meet Reading standard |
.443 |
.238 |
.571 |
1.861 |
.068 |
.048 |
20.673 |
|
% meet Science standard |
.023 |
.187 |
.032 |
.121 |
.904 |
.066 |
15.047 |
|
a. Dependent Variable: % graduating |
Looking at the variance inflation factors, we see that three variables (% meet math standard, % meet the reading standard and % meet science standard) have a VIF value greater than 10. This means there is evidence of multicollinearity. The tolerance values of these variables are also less than 0.1 which means there is evidence of multicollinearity. The solution is to remove one of these three variables because they are highly correlated with r>0.9. if the problem persists after removing one, we need to remove another 1 which means we will remove either 1 or 2 of % meet math standard, % meet the reading standard and % meet science standard.
B
1. We use a box plot of the dependent and independent variables to determine if there are outliers. The decision is based on values lesser than or greater than .
The box plot of grad suggests that there is no outlier and the boxplot of loinc suggests the same while the box plot of reading suggests that observations 36 and 62 are outliers. The boxplots are shown below.
2. i. We use the cooks distance to determine if there are influential points. The threshold for the cooks distance is calculated as
The cooks distance plot below shows that observations 3, 18, 35, 47, and 58 are influential points.
ii. Using the standardized dffits, the threshold value is determined by
2×√(p/n)
2×√(3/62)=0.4399
The standardized dffit plot below shows that observations; 3, 4, 18, 35, 47, and 58 are influential points. This result is similar to the cooks distance result except that observations 4 is included here.
iii. next, we plot the dfbetas for all independent variables. The cut-off of the dfbetas is given as
2/√n=2/√62=0.254
The plots are presented below, the plots for low income shows that observations 3, 4, 18, 35, 40, and 47 are the influential point. The plot for reading shows that only observations 13, 47, and 58 are an influential point.
3
Normality
The histogram below shows that the residual of the model is slightly skewed to the left and the assumption of normality is not satisfactorily met but this should not be a problem.
To learn more about this topic, take professional regression analysis assignment help from us.
Linearity and Homoscedasticity
The residual versus fitted plot below shows that the assumption of linearity is met because the scatter plot does not show any distinct pattern. Moreover, the points are distributed evenly above and below the origin which means the assumption of Homoscedasticity is met.
Independence
The partial regression plots below show that for both variables, there is no evidence of nonlinearity. This means that the assumption of independence is met.
Regression Diagnostics
l=log〖p/(1-p)〗=1.256+0.022age+0.02sex-1.606college
the value of -2 log-likelihood for the null model is 1595.048
for age; the null and alternative hypotheses are
H_0 1:β_age=0
H_1 1:β_age≠0
The result shows that wald statistics is 32.326, p<0.001 which means rejection of the null hypothesis. Thus, age is statistically significant.
for male; the null and alternative hypotheses are
H_0 2:β_sex=0
H_1 2:β_sex≠0
The result shows that wald statistics is 0.024, p=0.876>0.05 which connotes non-rejection of the null hypothesis. Thus, sex is not statistically significant.
For college; the null and alternative hypotheses are
H_0 3:β_college=0
H_1 3:β_college≠0
The result shows that wald statistics is 64.745, p<0.001 which connotes rejection of the null hypothesis. Thus, college is statistically significant.
The null and alternative hypotheses are given as
H0: the observed and expected proportions are the same across voted π ̂k_j=π ̂k_j0
H1: the observed and expected proportions are not the same across voted π ̂k_j≠π ̂k_j0
The result shows that χ^2 (8)=10.043,p=0.232>0.05 which means we cannot reject the null hypothesis. This means that the fit of the model is okay.
Hosmer and Lemeshow Test | |||
Step | Chi-square | df | Sig. |
1 | 10.043 | 8 | .262 |
Variables in the Equation | |||||||||
B | S.E. | Wald | df | Sig. | Exp(B) | 95% C.I.for EXP(B) | |||
Lower | Upper | ||||||||
Step 1a | age | .022 | .004 | 32.326 | 1 | .000 | 1.022 | 1.014 | 1.030 |
sex | .020 | .128 | .024 | 1 | .876 | 1.020 | .794 | 1.310 | |
college(1) | -1.606 | .200 | 64.745 | 1 | .000 | .201 | .136 | .297 | |
Constant | 1.256 | .253 | 24.679 | 1 | .000 | 3.510 | |||
a. Variable(s) entered on step 1: age, sex, college. B The null and alternative hypothesis is that H0: the model with income is not significantly different from the model without income H0: the model with income is significantly different from the model without income
-2 log-likelihood of the model without income is 1480.412 -2 log-likelihood of the model with income is 1417.153
Next, we estimated the ΔG^2 statistics as ΔG^2=-2log〖likelihood of restricted model〗-(-2log〖likelihood of unrestricted model)〗 ΔG^2=1480.412-1417.153 ΔG^2=63.259 χ^2 (1)=3.84 Since ΔG^2>χ^2 (1); we reject the null hypothesis and conclude that income is statistically significant. The null and alternative hypotheses are given as H0: the observed and expected proportions are the same across voted π ̂k_j=π ̂k_j0 H1: the observed and expected proportions are not the same across voted π ̂k_j≠π ̂k_j0 The result shows that χ^2 (8)=7.217,p=0.513>0.05 which means we cannot reject the null hypothesis. This means that the fit of the model is okay. |
C
The null and alternative hypothesis is that
H0: the model with a male is not significantly different from the model without male
H0: the model with a male is significantly different from the model without male
-2 log-likelihood of the model with the male is 1417.153
Removing male, -2 log-likelihood of the model increased to 1417.884
Next, we estimated the ΔG^2 statistics as
ΔG^2=-2log〖likelihood of restricted model〗-(-2log〖likelihood of unrestricted model)〗
ΔG^2=1417.884-1417.153
ΔG^2=0.731
χ^2 (1)=3.84
Since ΔG^2<χ^2 (1); we reject the null hypothesis and conclude that removing male does not increase the deviance significantly.
The null and alternative hypotheses are given as
H0: the observed and expected proportions are the same across voted π ̂k_j=π ̂k_j0
H1: the observed and expected proportions are not the same across voted π ̂k_j≠π ̂k_j0
The result shows that χ^2 (8)=3.810,p=0.874>0.05 which means we cannot reject the null hypothesis. This means that the fit of the model is okay.
Hosmer and Lemeshow Test | |||
Step | Chi-square | df | Sig. |
1 | 7.217 | 8 | .513 |
I would use the last model in C2 because it contains only significant independent variables and has the highest p-value for Hosmer and Lemeshowtests which means that it has the highest fit.
The predicted probability of voting for a 22-year-old college graduate with a family income level in the 1st quartile is
p(voted=1│age,college,income)=1/(1-e^(β_0+β_1 age+β_2 college+β_3 income) )
p(voted=1│age,college,income)=1/(1-e^(-(β_0+β_1 age+β_2 college+β_3 income)) )
p(voted=1│age=22,college=1,income=1st quartile)=1/(1+e^(-(0.274+0.027*22+1.268-1.498)) )
p(voted=1│age=22,college=1,income=1st quartile)=0.653
The predicted probability of voting for a 22-year-old college graduate with a family income level in the 1st quartile is 0.653
exp(college)=3.553. this means that college graduates have 3.553 times the odds of non-college graduates of voting.
Yes, Age a statistically significant unique predictor for the probability of voting (p<0.001). The odds ratio of age is 1.027. This means that a year increase in age increases the odds of voting by 2.7%
The estimated value of exp[bincome(2)] is 0.462. This means that people at the second quartile of income have 53.8% fewer odds of voting than those at the fourth quartile of income.
No, higher total family income increases the odds of voting as we see that as income moves from 1st to 2nd to 3rd to 4th quartile, the odds of voting increases.
As shown below, the sensitivity is 93.9%, the specificity is 19.6%. The r-squared count is 0.125 and the adjusted R-squared count is 0.179
Hosmer and Lemeshow Test | |||
Step | Chi-square | df | Sig. |
1 | 3.810 | 8 | .874 |
Classification Tablea | |||||
Observed | Predicted | ||||
Voted | Percentage Correct | ||||
DID NOT VOTE | VOTED | ||||
Step 1 | Voted | DID NOT VOTE | 75 | 307 | 19.6 |
VOTED | 58 | 890 | 93.9 | ||
Overall Percentage | 72.6 | ||||
a. The cut value is .500 |
Model Summary | |||
Step | -2 Log likelihood | Cox & Snell R Square | Nagelkerke R Square |
1 | 1417.884a | .125 | .179 |
a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001. |