Logistic Regression in R
Some situations may arise when we need to predict something like success or failure based on some given information. Take, for example, you need to predict if a student is going to pass or fail in an upcoming examination based on his score on the previous two exams. So, in this case, we just need to say whether the student will pass or fail, which is dichotomous. These are the main steps that you may follow while doing the analysis,-
Convert to numeric values
-
Building the Model
Here x is the independent variable or the given information, which can be either a scalar or a vector, and β are the coefficients that are unknown and tell us the effect of x on predicted probabilities π. We can interpret β as follows,
This means that β gives us the change in the log-odds ratio of the probability of success with a unit increase in the independent variable x.
-
Estimating β
This, of course, cannot be solved easily and uses Newton-Raphs or Fisher-Scoring method to solve for β.
-
Checking for Model Accuracy and Interpreting Significance
-
Predicted values, Accuracy, and Precision
Observed Values | Success | Failure |
Predicted Values | ||
Success | (a) True Positive | (b) False Positive |
Failure | (c) False Negative | (d) True Negative |
- Accuracy = (a+d)/(a+b+c+d)
- Precision = (a)/(a+c)
- Specificity = d/(b+d)
- Positive Predicted Value = a/(a+b)
- Negative Predictive Value = d/(c+d)
An Illustrated Example with R Dataset
We illustrate the example with the help of R dataset mtcars. head(mtcars)Step 1: We see that there are two dichotomous variables, vs. and am. We choose one from the. Let us choose ‘am.’ The variables have already been encoded into 0 and 1, and hence we do not need to encode it further. As explanatory variables, we can choose ‘cyl,’ ‘hp,’ and ‘wt,’ all of which are continuous. Step 2: We can now build a model in R using the dataset model=glm(am~cyl+hp+wt,family='binomial',data=mtcars) Note that we have used the function ‘glm’ here, which is the short form of the Generalised Linear Model. There are multiple models that can be built under glm. Let us explain each part of the function one by one
- Formula (am~cyl+hp+wt): Since we are predicting am based on cyl. Hp and wt, we thus define the formula in a prescribed way, as shown.
- Family (binomial): As already mentioned, glm can be used for multiple types of models, but here we want only to build a logistic regression where our presumed distribution was Bernoulli, the superclass of which is Binomial. Thus we have defined the family as Binomial.
- Data: This is just the name of the dataset that we have used.
The above picture gives us the output of the call highlighted in yellow. We can see the estimate values under the column named Estimate. Of the three variables used, the relationship is positive with cyl and hp, while for wt the relationship happens to be negative. Now that we know the value of coefficients, we can easily obtain the probability of success with the known values of explanatory variables. Step 4: We again refer to the table shown above to check for the overall performance of the model and the individual contribution of each variable. We see that Residual Deviance is 9.8415, and AIC is 17.841, which are low in magnitude, but since already mentioned, these are more useful for comparing models rather than commenting on the overall model performance. About the individual significance, the p-values are listed in column ‘Pr(>|z|),’ and we see that the value is less than 0.05 for only wt (we ignore the intercept because it is not any variable). Thus we may say that ‘wt’ has a significant effect in predicting ‘am’ while other variables are not of much importance. Step 5: We decide on the cut-off to be 0.5. library(caret) library(e1071) predicted=predict(model,type='response')confusion matrix(factor(round(predicted)),factor(mtcars$am))
The above image shows the output from the code highlighted in yellow. We see that
- Accuracy in 0.9062, which means the model has predicted 90% of cases correctly.
- Sensitivity or Precision is 0.9474, which means that the model has predicted 95% of the successes correctly.
- Specificity is 0.8462, which means that the model has correctly identified 85% of the negative values.