Diagnostics and Logistic Regression Using SPSS

Logistic regression is a statistical technique used to estimate relationships between one dependent variable and a set of independent variables. The independent variables can be nominal, interval, ordinal, or ratio-level. The technique produces a regression equation in which the coefficients represent the correlation between the dependent variable and each of the independent variables being observed. 

Regression Analysis

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

Collinearity Statistics

B

Std. Error

Beta

Tolerance

VIF

1

(Constant)

19.352

6.249

 

3.097

.003

 

 

% low income

.038

.062

.050

.621

.537

.699

1.430

% taking ACT

.089

.095

.106

.930

.356

.348

2.870

% meet Math standard

.139

.201

.203

.693

.491

.053

18.809

% meet Reading standard

.443

.238

.571

1.861

.068

.048

20.673

% meet Science standard

.023

.187

.032

.121

.904

.066

15.047

a. Dependent Variable: % graduating

 

Looking at the variance inflation factors, we see that three variables (% meet math standard, % meet the reading standard and % meet science standard) have a VIF value greater than 10. This means there is evidence of multicollinearity. The tolerance values of these variables are also less than 0.1 which means there is evidence of multicollinearity. The solution is to remove one of these three variables because they are highly correlated with r>0.9. if the problem persists after removing one, we need to remove another 1 which means we will remove either 1 or 2 of % meet math standard, % meet the reading standard and % meet science standard.

B

1.       We use a box plot of the dependent and independent variables to determine if there are outliers. The decision is based on values lesser than  or greater than .

The box plot of grad suggests that there is no outlier and the boxplot of loinc suggests the same while the box plot of reading suggests that observations 36 and 62 are outliers. The boxplots are shown below.




2. i. We use the cooks distance to determine if there are influential points. The threshold for the cooks distance is calculated as 


The cooks distance plot below shows that observations 3, 18, 35, 47, and 58 are influential points.


ii. Using the standardized dffits, the threshold value is determined by

2×√(p/n)

2×√(3/62)=0.4399

The standardized dffit plot below shows that observations; 3, 4, 18, 35, 47, and 58 are influential points. This result is similar to the cooks distance result except that observations 4 is included here.


iii. next, we plot the dfbetas for all independent variables. The cut-off of the dfbetas is given as

2/√n=2/√62=0.254

The plots are presented below, the plots for low income shows that observations 3, 4, 18, 35, 40, and 47 are the influential point. The plot for reading shows that only observations 13, 47, and 58 are an influential point.



3

Normality

The histogram below shows that the residual of the model is slightly skewed to the left and the assumption of normality is not satisfactorily met but this should not be a problem.


To learn more about this topic, take professional regression analysis assignment help from us.

Linearity and Homoscedasticity

The residual versus fitted plot below shows that the assumption of linearity is met because the scatter plot does not show any distinct pattern. Moreover, the points are distributed evenly above and below the origin which means the assumption of Homoscedasticity is met.


Independence

The partial regression plots below show that for both variables, there is no evidence of nonlinearity. This means that the assumption of independence is met.



Regression Diagnostics

 l=log⁡〖p/(1-p)〗=1.256+0.022age+0.02sex-1.606college

 the value of -2 log-likelihood for the null model is 1595.048

 for age; the null and alternative hypotheses are

H_0 1:β_age=0

H_1 1:β_age≠0


The result shows that wald statistics is 32.326, p<0.001 which means rejection of the null hypothesis. Thus, age is statistically significant.

for male; the null and alternative hypotheses are

H_0 2:β_sex=0

H_1 2:β_sex≠0

The result shows that wald statistics is 0.024, p=0.876>0.05 which connotes non-rejection of the null hypothesis. Thus, sex is not statistically significant.

For college; the null and alternative hypotheses are

H_0 3:β_college=0

H_1 3:β_college≠0

The result shows that wald statistics is 64.745, p<0.001 which connotes rejection of the null hypothesis. Thus, college is statistically significant.

 The null and alternative hypotheses are given as

H0: the observed and expected proportions are the same across voted π ̂k_j=π ̂k_j0

H1: the observed and expected proportions are not the same across voted π ̂k_j≠π ̂k_j0

The result shows that χ^2 (8)=10.043,p=0.232>0.05 which means we cannot reject the null hypothesis. This means that the fit of the model is okay.


Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 10.043 8 .262
Variables in the Equation
  B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
Step 1a age .022 .004 32.326 1 .000 1.022 1.014 1.030
sex .020 .128 .024 1 .876 1.020 .794 1.310
college(1) -1.606 .200 64.745 1 .000 .201 .136 .297
Constant 1.256 .253 24.679 1 .000 3.510    

a. Variable(s) entered on step 1: age, sex, college.

B

 The null and alternative hypothesis is that

H0: the model with income is not significantly different from the model without income

H0: the model with income is significantly different from the model without income

-2 log-likelihood of the model without income is 1480.412

-2 log-likelihood of the model with income is 1417.153

Next, we estimated the ΔG^2 statistics as

ΔG^2=-2log⁡〖likelihood of restricted model〗-(-2log⁡〖likelihood of unrestricted model)〗

ΔG^2=1480.412-1417.153

ΔG^2=63.259

χ^2 (1)=3.84

Since ΔG^2>χ^2 (1); we reject the null hypothesis and conclude that income is statistically significant.

 The null and alternative hypotheses are given as

H0: the observed and expected proportions are the same across voted π ̂k_j=π ̂k_j0

H1: the observed and expected proportions are not the same across voted π ̂k_j≠π ̂k_j0

The result shows that χ^2 (8)=7.217,p=0.513>0.05 which means we cannot reject the null hypothesis. This means that the fit of the model is okay.

C

 The null and alternative hypothesis is that

H0: the model with a male is not significantly different from the model without male

H0: the model with a male is significantly different from the model without male

-2 log-likelihood of the model with the male is 1417.153 

Removing male, -2 log-likelihood of the model increased to 1417.884

Next, we estimated the ΔG^2 statistics as

ΔG^2=-2log⁡〖likelihood of restricted model〗-(-2log⁡〖likelihood of unrestricted model)〗

ΔG^2=1417.884-1417.153

ΔG^2=0.731

χ^2 (1)=3.84

Since ΔG^2<χ^2 (1); we reject the null hypothesis and conclude that removing male does not increase the deviance significantly.

 The null and alternative hypotheses are given as

H0: the observed and expected proportions are the same across voted π ̂k_j=π ̂k_j0

H1: the observed and expected proportions are not the same across voted π ̂k_j≠π ̂k_j0

The result shows that χ^2 (8)=3.810,p=0.874>0.05 which means we cannot reject the null hypothesis. This means that the fit of the model is okay.

Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 7.217 8 .513

 I would use the last model in C2 because it contains only significant independent variables and has the highest p-value for Hosmer and Lemeshowtests which means that it has the highest fit.

 The predicted probability of voting for a 22-year-old college graduate with a family income level in the 1st quartile is

p(voted=1│age,college,income)=1/(1-e^(β_0+β_1 age+β_2 college+β_3 income) )

p(voted=1│age,college,income)=1/(1-e^(-(β_0+β_1 age+β_2 college+β_3 income)) )

p(voted=1│age=22,college=1,income=1st quartile)=1/(1+e^(-(0.274+0.027*22+1.268-1.498)) )

p(voted=1│age=22,college=1,income=1st quartile)=0.653

The predicted probability of voting for a 22-year-old college graduate with a family income level in the 1st quartile is 0.653

 exp(college)=3.553. this means that college graduates have 3.553 times the odds of non-college graduates of voting.

 Yes, Age a statistically significant unique predictor for the probability of voting (p<0.001). The odds ratio of age is 1.027. This means that a year increase in age increases the odds of voting by 2.7%

 The estimated value of exp[bincome(2)] is 0.462. This means that people at the second quartile of income have 53.8% fewer odds of voting than those at the fourth quartile of income.

 No, higher total family income increases the odds of voting as we see that as income moves from 1st to 2nd to 3rd to 4th quartile, the odds of voting increases.

 As shown below, the sensitivity is 93.9%, the specificity is 19.6%. The r-squared count is 0.125 and the adjusted R-squared count is 0.179


Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 3.810 8 .874
Classification Tablea
  Observed Predicted
  Voted Percentage Correct
  DID NOT VOTE VOTED
Step 1 Voted DID NOT VOTE 75 307 19.6
VOTED 58 890 93.9
Overall Percentage     72.6
a. The cut value is .500

To get professional assistance with this topic, get in touch with our providers of regression diagnostics assignment help.

Model Summary
Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square
1 1417.884a .125 .179
a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001.