Logistic Regression in R

Some situations may arise when we need to predict something like success or failure based on some given information. Take, for example, you need to predict if a student is going to pass or fail in an upcoming examination based on his score of the previous two exams. So, in this case, we just need to say whether the student will pass or fail, which is dichotomous. These are the main steps that you may follow while doing the analysis,
  1. Convert to numeric values

For any mathematical computation, we always need to have numeric values. Thus it is compulsory to convert pass and failure to some numeric values. In this case, let us assign the value 1 to ‘Pass’ and ‘0’ to ‘Failure.’ This is the most conventional way to do logistic regression analysis when we have two possible outcomes. The significance will be discussed later.
  1. Building the Model

Remember that the outcome here must be either 0 or 1 (or anything between these values), and hence the model should be such that the predicted values must be in the range [0, 1]. Instead of predicting the outcome variables here, we choose to predict the probability of success (being 1). Note that the closer the predicted probability is to 1, the greater the chance of getting success and vice versa for failure. Thus here our model is given by

Here x is the independent variable or the given information, which can be either a scalar or a vector, and β are the coefficients that are unknown and tell us the effect of x on predicted probabilities π. We can interpret β as follows,

This means that β gives us the change in the log-odds ratio of the probability of success with a unit increase in the independent variable x.

  1. Estimating β

To estimate the unknown coefficients, we need to solve the likelihood equation, which is constructed on the assumption that the predicted variables, y (0 or 1), have come from a Bernoulli distribution. The likelihood is thus given as below,

This, of course, cannot be solved easily and uses Newton-Raphs on or Fisher-Scoring method to solve for β.

  1. Checking for Model Accuracy and Interpreting Significance

For checking the overall accuracy of the model, three things might be checked, Null Deviance, Residual Deviance, and AIC. Of these, the two commonly used values are Residual Deviance and AIC. The lesser these values are, the better is the model. The checking is particularly useful when we are required to compare multiple models, say, for example, one model with two explanatory variables and the other with three. The Residual Deviance and AIC score, in this case, would tell us which model is better for predicting success and failure. The next thing that might come to notice is the significance of each of the explanatory variables to conclude which one of these can effectively predict the variable of interest. For each of the variables, we test the hypothesis, β=0. For conducting the test, we perform a t-test, and the p-value from the test tells us about the significance of the coefficients. The lesser the p-value, the more is the significance.
  1. Predicted values, Accuracy, and Precision

Generally, the Predicted values will be returned as probabilities, lying between 0 and 1. It is upto the users to determine the cut-off. For example, someone may decide to say that success has occurred (value is 1) if the probability of success is more than 0.5 and otherwise failure. Some users may choose a less or greater value for cut-off. Once the outcomes have been decided, we might like to check whether they match the original values, that is, whether success corresponds to success and failure to failure. For these, we can make a 2x2 table like shown below
Observed Values Success Failure
Predicted Values
Success (a) True Positive (b) False Positive
Failure (c) False Negative (d) True Negative
We derive the following measures here,
  1. Accuracy = (a+d)/(a+b+c+d)
  2. Precision = (a)/(a+c)
  3. Specificity = d/(b+d)
  4. Positive Predicted Value = a/(a+b)
  5. Negative Predictive Value = d/(c+d)
Of these, Accuracy and Precision are sometimes of greater interest, and more is the value of all these 5 measures, more is the accuracy of the model, and better is the choice of cut-off point.

An Illustrated Example with R Dataset 

We illustrate the example with the help of R dataset mtcars. head(mtcars)

Step 1: We see that there are two dichotomous variables, vs. and am. We choose one from the. Let us choose ‘am.’ The variables have already been encoded into 0 and 1, and hence we do not need to encode it further. As explanatory variables, we can choose ‘cyl,’ ‘hp,’ and ‘wt,’ all of which are continuous. Step 2: We can now build a model in R using the dataset model=glm(am~cyl+hp+wt,family='binomial',data=mtcars) Note that we have used the function ‘glm’ here, which is the short form of the Generalised Linear Model. There are multiple models that can be built under glm. Let us explain each part of the function one by one

  1. Formula (am~cyl+hp+wt): Since we are predicting am based on cyl. Hp and wt, we thus define the formula in a prescribed way, as shown.
  2. Family (binomial): As already mentioned, glm can be used to multiple types of models, but here we want only to build a logistic regression where our presumed distribution was Bernoulli, the superclass of which is Binomial. Thus we have defined the family as Binomial.
  3. Data: This is just the name of the dataset that we have used.
Step 3: Let us now take a look at the estimated values of the coefficients summary(model)

The above picture gives us the output of the call highlighted I yellow. We can see the estimates values under the column named Estimate. Of the three variables used, the relationship is positive with cyl and hp, while for wt the relationship happens to be negative. Now that we know the value of coefficients, we can easily obtain the probability of success with the known values of explanatory variables. Step 4: We again refer to the table shown above to check for the overall performance of the model and the individual contribution of each variable. We see that Residual Deviance is 9.8415, and AIC is 17.841, which are low in magnitude, but since already mentioned, these are more useful for comparing models rather than commenting on the overall model performance. About the individual significance, the p-values are listed in column ‘Pr(>|z|),’ and we see that the value is less than 0.05 for only wt (we ignore the intercept because it is not any variable). Thus we may say that ‘wt’ has a significant effect in predicting ‘am’ while other variables are not of much importance. Step 5: We decide on the cut-off to be 0.5. library(caret) library(e1071) predicted=predict(model,type='response') confusionMatrix(factor(round(predicted)),factor(mtcars$am))

The above image shows the output from the code highlighted in yellow. We see that

  1. Accuracy in 0.9062, which means the model has predicted 90% cases correctly.
  2. Sensitivity or Precision is 0.9474, which means that the model has predicted 95% of the successes correctly.
  3. Specificity is 0.8462, which means that the model has correctly identified 85% of the negative values.
Overall the model is good, and you have successfully built your first Logistic Regression model!