Loan Default Prediction: Linear and Logistic Regression in R

Problem Description

This R Programming assignment delves into the application of linear and logistic regression models to estimate the probability of loan defaults. Our dataset, sourced from Lending Club, comprises records of three-year loans granted between 2007 and 2010. The primary objective is to predict whether a loan will be fully repaid. Three distinct models have been provided for our evaluation:

Model 1: This model exclusively employs the FICO score as the predictor for loan default.

Model 2: Model 2 incorporates two predictors - the FICO score and credit policy.

Model 3: A more comprehensive ten-variable model that considers a broader spectrum of factors.

Skills: Linear and logistic regression in R

Question 1: Decision Contexts and Regression Choice In this section, we evaluate the choice between linear and logistic regression for specific decision contexts:

Identifying contractors at higher risk of going out of business: Logistic regression is chosen.
Identifying practices associated with students' SAT scores: Linear regression is suitable as we aim to predict a continuous variable (SAT scores).
Predicting the cost of a construction project: Linear regression is employed as the goal is to estimate a continuous outcome (project cost).
Developing a policy for interpreting PSA tests for prostate cancer: Logistic regression is the appropriate choice, considering we are classifying patients into those needing a biopsy or not.

Question 2: Model 1 Analysis We initiate the analysis by delving into Model 1. This model relies solely on the FICO score to make predictions about loan defaults. For two specific observations (row 7782 and 9394), we calculate the discrepancy between the fitted probability and the actual label.

The discrepancy for row 7782 is computed as 0.366.

print(paste0("The distance between fitted probability and actual label for 7782 observations is " , data$p_value1[7782] - data$credit.policy[7782]))
## [1] "The distance between fitted probability and actual label for 7782 observations is 0.365832610890921"

For row 9394, a threshold of 8% is applied, leading to a discrepancy of 0

print(paste0("The distance between assigned label and actual label for 9394 observations is " , y_hat_9394 - data$credit.policy[9394]))
## [1] "The distance between assigned label and actual label for 9394 observations is 0"

Question 3: Default Risk with FICO Scores This section examines the difference in default risk between borrowers with FICO scores of 650 and 700. We proceed to analyze the divergence in default risk between borrowers with FICO scores of 750 and 800.

For FICO 650, the default risk is estimated at 0.269, while for FICO 700, it stands at 0.169. This results in a difference of 0.100.

comparison_vals <- predict(model1, newdata = data.frame("fico" = c(650, 700)), type = 'response')
print(paste0("The probability of default risk for a borrower with FICO score 650 is ", comparison_vals[1],
" and the probability of default risk for a borrowser with FICO score 700 is ", comparison_vals[2],
". The difference is ", comparison_vals[1] - comparison_vals[2]))
## [1] "The probability of default risk for a borrower with FICO score 650 is 0.268650317872113 and the probability of default risk for a borrowser with FICO score 700 is 0.168631839874141. The difference is 0.100018477997972"

We then delve into the disparity in default risk for borrowers with FICO scores of 750 and 800.

For FICO 750, the default risk is computed as 0.101, and for FICO 800, it equates to 0.058. Consequently, the difference is discerned as 0.042, which is notably smaller than the prior case.

comparison_vals_2 <- predict(model1, newdata = data.frame("fico" = c(750, 800)), type = 'response')
print(paste0("The probability of default risk for a borrower with FICO score 750 is ", comparison_vals_2[1],
" and the probability of default risk for a borrowser with FICO score 800 is ", comparison_vals_2[2],
". The difference is ", comparison_vals_2[1] - comparison_vals_2[2]))
## [1] "The probability of default risk for a borrower with FICO score 750 is 0.100721943188717 and the probability of default risk for a borrowser with FICO score 800 is 0.0582441529417448. The difference is 0.042477790246972"

Question 4: Model 2 Plot Identification Model 2, which is represented by a non-linear plot, is identified as the most fitting choice.

model2 <- glm(not.fully.paid~fico + credit.policy, data = data, family = 'binomial')
data$p_value2 <- model2$fitted.values

Question 5: Default Risk with Model 2 The examination continues with an exploration of the difference in default risk for borrowers with FICO scores of 650 and 700, conditioned upon whether they meet Lending Club's credit policy.

For borrowers who adhere to the policy, the difference is 0.063.

comparison_vals_3 <- predict(model2, newdata = data.frame("fico" = c(650, 700), "credit.policy" = c(1, 1)),
type = 'response')
print(paste0(
"The probability of default risk for a borrower with FICO score 650 is who mets lending clubs credity policy ",
comparison_vals_3[1],
" and the probability of default risk for a borrowser with FICO score 700 is who mets lending clubs credity
policy ",
comparison_vals_3[2],
". The difference is ", comparison_vals_3[1] - comparison_vals_3[2]))
## [1] "The probability of default risk for a borrower with FICO score 650 is who mets lending clubs credity policy 0.208538445468652 and the probability of default risk for a borrowser with FICO score 700 is who mets lending clubs credity \n policy 0.145262763892304. The difference is 0.0632756815763477"

For those who do not meet the policy, the discrepancy in default risk stands at 0.090.

comparison_vals_4 <- predict(model2, newdata = data.frame("fico" = c(650, 700), "credit.policy" = c(0, 0)),
type = 'response')
print(paste0(
"The probability of default risk for a borrower with FICO score 650 is who didn't meet lending clubs credity policy ",
comparison_vals_4[1],
" and the probability of default risk for a borrowser with FICO score 700 is who didn't meet lending clubs credity
policy ",
comparison_vals_4[2],
". The difference is ", comparison_vals_4[1] - comparison_vals_4[2]))
## [1] "The probability of default risk for a borrower with FICO score 650 is who didn't meet lending clubs credity policy 0.337645636288135 and the probability of default risk for a borrowser with FICO score 700 is who didn't meet lending clubs credity \n policy 0.247443144795334. The difference is 0.0902024914928014"

Question 6: Making Loans with Model 2 We perform calculations to ascertain how many loans would be granted with a threshold of 8%. It is revealed that 643 loans would be dispensed using this criterion. Of these, 37 loans are anticipated to default, resulting in an accuracy rate of 21.9%.

How many loans would you have made?

data$pred_result_2 <- as.integer(ifelse(data$p_value2 < 0.08, 0, 1))
print(paste0(nrow(data) - sum(data$pred_result_2), " loans would be made with threshold 0.08."))
## [1] "643 loans would be made with threshold 0.08."

How many of the loans you made would have defaulted?

table_2_pred <- table(data$not.fully.paid, data$pred_result_2)
print(table_2_pred)
##
## 0 1
## 0 606 7439
## 1 37 1496
print(paste0(table_2_pred[2], " loans will fail."))
## [1] "37 loans will fail."

What is the accuracy of Model 2 using this threshold?

print(paste0("Overall accuracy is ", (table_2_pred[1] + table_2_pred[4])/sum(table_2_pred)))
## [1] "Overall accuracy is 0.219461265399875"

Question 7: Making Loans with Model 1 Utilizing Model 1's predictions and applying an 8% threshold, we would extend 819 loans. Among these, 52 loans are projected to default, yielding an accuracy of 23.5%.

How many loans would you have made?

data$pred_result_1 <- as.integer(ifelse(data$p_value1 < 0.08, 0, 1))
print(paste0(nrow(data) - sum(data$pred_result_1), " loans would be made with threshold 0.08."))
## [1] "819 loans would be made with threshold 0.08."

How many of the loans you made would have defaulted?

table_1_pred <- table(data$not.fully.paid, data$pred_result_1)
print(table_1_pred)
##
## 0 1
## 0 767 7278
## 1 52 1481
print(paste0(table_1_pred[2], " loans will fail."))
## [1] "52 loans will fail."

What is the accuracy of Model 1 using this threshold?

print(paste0("Overall accuracy is ", (table_1_pred[1] + table_1_pred[4])/sum(table_1_pred)))
## [1] "Overall accuracy is 0.234704531217373"

Question 8: Model 1 vs. Model 2 Accuracy The final section draws a comparison between the overall accuracy of Model 1 and Model 2. It is noted that the overall accuracy of the first model is marginally higher than that of the second model, with a difference of approximately 1.52%. However, it is vital to acknowledge that the number of accepted default loans for the second model is significantly lower than that of the first model.

Question 9: Confusion Matrix Terms The two terms from the confusion matrix are elucidated:

The loans that were made and defaulted are referred to as False Positives.
Loans that were not extended but ultimately defaulted are identified as True Negatives.

Question 10: Predicting Default Risk with Model 3 In the final segment, we investigate how Model 3 predicts default risk for two loans that are identical in all variables except for their purposes. We compare the default risk for a home improvement loan and an education loan, as well as for loans categorized under 'all other' purposes.

For a home improvement loan, the predicted default risk is lower.
For an education loan, the predicted default risk is higher, with a coefficient of 0.08286.

In summary, this assignment underscores the multifaceted application of linear and logistic regression in R for predicting loan defaults, assessing model accuracy, and discerning the influential factors governing default risk.

Analyzing Loan Defaults: Exploring Linear and Logistic Regression with R

Problem Description