# Statistical Analysis of Loan Approval Data: Hypotheses, Methods, and Results

In this comprehensive analysis, we delve into the world of financial data through the lens of statistical analysis of loan approval data, utilizing the powerful tools of SAS. We explore five key hypotheses that scrutinize factors influencing loan status, from Fico scores to interest rates, loan grades, and employment years. Our findings, presented in a clear and structured manner, reveal insightful patterns and relationships within the dataset. This study offers valuable insights for decision-makers in the finance industry, underlining the significance of statistical analysis and SAS in extracting meaningful information from complex financial datasets.

## Problem Description

In this SAS assignment, we aim to analyze a loan dataset to determine the factors that affect the status of a loan, specifically whether it will be fully paid or charged off. The dataset contains 26 variables and 39,786 observations. For the purpose of this study, we focus on six key variables:

1. Loan Status: Whether the loan is fully paid or charged off (categorical, 2 levels).
2. Fico Score: Fair Isaac Corporation score (continuous).
3. Interest Rate: Interest rate charged on the loan (continuous, %).
5. Loan Amount: The approved loan amount (continuous).
6. Employment Years: Years of employment (categorical, 11 levels).

Milestone One: Hypotheses We start by formulating five hypotheses for our analysis:

1. Hypothesis 1: Fico Score

• Null Hypothesis (H01): There is no significant difference in the average Fico score between those who fully paid and those who charged off.
• Alternative Hypothesis (H11): There is a significant difference in the average Fico score between those who fully paid and those who charged off.
Loan Status Method Mean 95% CL Mean Std Dev 95% CL Std Dev
Charged Off 707.6 706.8 31.87 31.30
Fully Paid 720.9 720.5 36.11 35.85
Difference Pooled -13.25 -14.25 -12.25 35.54
Difference Satterthwaite -13.25 -14.17 -12.34

Table 1: Hypothesis 1 - Fico Score

2. Hypothesis 2: Interest Rate

• Null Hypothesis (H02): There is no significant difference in the average interest rate between those who fully paid and those who charged off.
• Alternative Hypothesis (H12): There is a significant difference in the average interest rate between those who fully paid and those who charged off.
Loan Status Method Mean 95% CL Mean Std Dev 95% CL Std Dev
Charged Off 0.1384 0.1374 0.0366 0.0359
Fully Paid 0.1173 0.1169 0.0365 0.0363
Difference Pooled 0.0211 0.0201 0.0221 0.0365
Difference Satterthwaite 0.0211 0.0201 0.0221

Table 2: Hypothesis 2 - Interest Rate

3. Hypothesis 3: Loan Status and Loan Grade

• Null Hypothesis (H03): There is no significant association between loan status and loan grade.
• Alternative Hypothesis (H13): There is a significant association between loan status and loan grade.
Statistic DF Value Prob
Chi-Square 6 1472.8151 <.0001
Likelihood Ratio Chi-Square 6 1475.8336 <.0001
Mantel-Haenszel Chi-Square 1 1461.2862 <.0001
Phi Coefficient 0.1924
Contingency Coefficient 0.1889
Cramer's V 0.1924

Table 3: Hypothesis 3 - Loan Status and Loan Grade

4. Hypothesis 4: Fico Score and Loan Amount

• Null Hypothesis (H04): There is no significant relationship between Fico score and loan amount.
• Alternative Hypothesis (H14): There is a significant relationship between Fico score and loan amount.
Fico Range High Fico Range Low Int Rate
int_rate -0.70279 <.001 -0.70279

Table 4: Hypothesis 4 - Fico Score and Loan Amount

5. Hypothesis 5: Loan Amount Across Employment Years

• Null Hypothesis (H05): There is no significant difference in the average loan amount across employment years.
• Alternative Hypothesis (H15): There is a significant difference in the average loan amount across employment years.
Source DF Sum of Squares Mean Square F Value Pr> F
Model 11 56229125910 5111738719.1 101.51 <.0001
Error 39774 2.0028333E12 50355340.338
Corrected Total 39785 2.0590624E12

Table 5: Hypothesis 5 - Loan Amount Across Employment Years

Milestone Two: Statistical Approaches To test these hypotheses, we employ various statistical approaches:

• Hypotheses 1 and 2: We use an independent sample t-test as it is appropriate when comparing the average values of a continuous variable (Fico score and interest rate) across two independent groups (fully paid and charged off).
• Hypothesis 3: To test the association between two categorical variables (loan status and loan grade), we utilize a chi-square test.
• Hypothesis 4: To determine the relationship between two continuous variables (Fico score and interest rate), a correlation test is employed.
• Hypothesis 5: We conduct a one-way ANOVA test to evaluate the significant differences in the average loan amount across multiple employment years.

Milestone Three: Results Our analysis yields the following results:

Hypothesis 1: Fico Score

• Independent t-test shows a significant difference in Fico scores (t(39784) = -26.00, p < .001). Those who fully paid had a higher Fico score.
Method Variances DF t Value Pr > |t|
Pooled Equal 39784 -26.00 <.0001
Satterthwaite Unequal 8283.8 -28.42 <.0001

Table 6: Hypothesis 1 - Fico Score Results

Hypothesis 2: Interest Rate

• Independent t-test shows a significant difference in interest rates (t(39784) = -40.27, p < .001). Those who fully paid had a lower interest rate.
Method Variances DF t Value Pr > |t|
Pooled Equal 39784 40.27 <.0001
Satterthwaite Unequal 7672.1 40.26 <.0001

Table 7: Hypothesis 2 - Interest Rate Results of the independent t-test

Hypothesis 3: Loan Status and Loan Grade

• Chi-square test results indicate a significant association between loan status and loan grade (χ^2 (6) = 1472.82, p < .001). Cramer's V shows a weak association.
Statistic DF Value Prob
Chi-Square 6 1472.8151 <.0001
Likelihood Ratio Chi-Square 6 1475.8336 <.0001
Mantel-Haenszel Chi-Square 1 1461.2862 <.0001
Phi Coefficient 0.1924
Contingency Coefficient 0.1889
Cramer's V 0.1924

Table 8: Hypothesis 3 - Loan Status and Loan Grade Results

Hypothesis 4: Fico Score and Interest Rate

• Correlation analysis reveals a strong and significant negative correlation between Fico score and interest rate (r = -0.703, p < .001).
Pearson Correlation Coefficients, N = 39786
fico_range_high fico_range_low
int_rate
int_rate
-0.70279
<.001
-0.70279
<.001

Table 9: Hypothesis 4 - Fico Score and Interest Rate Results

Hypothesis 5: Loan Amount Across Employment Years

• One-way ANOVA demonstrates a significant difference in average loan amount across employment years (F(11, 39774) = 101.5, p < .001). Loan amount varies with employment years.
Source DF Sum of Squares Mean Square F Value Pr > F
Model 11 56229125910 5111738719.1 101.51 <.0001
Error 39774 2.0028333E12 50355340.338
Corrected Total 39785 2.0590624E12

Table 10: Hypothesis 5 - Loan Amount Across Employment Years Results

Summary: In summary, our analysis suggests that Fico score is positively related to loan status, while interest rate is negatively related. Additionally, a negative correlation exists between Fico score and interest rate, indicating that higher Fico scores are associated with lower interest rates. Loan status is significantly associated with loan grade, and the average loan amount differs across employment years, with more years of employment leading to higher loan amounts.

``` FILENAME REFFILE '/home/u41099423/sasuser.v94/Loan_Data.xlsx'; PROC IMPORT DATAFILE=REFFILE DBMS=XLSX OUT=WORK.IMPORT; GETNAMES=YES; RUN; PROC CONTENTS DATA=WORK.IMPORT; RUN; /*** H1 **/ /* Test for normality */ proc univariate data=WORK.IMPORT1 normal mu0=0; ods select TestsForNormality; class loan_status; var fico_range_high; run; /* t test */ proc ttest data=WORK.IMPORT1 sides=2 h0=0 plots(showh0); class loan_status; var fico_range_high; run; /* Test for normality */ proc univariate data=WORK.IMPORT4 normal mu0=0; ods select TestsForNormality; class loan_status; var int_rate; run; /* t test */ proc ttest data=WORK.IMPORT4 sides=2 h0=0 plots(showh0); class loan_status; var int_rate; run; proc freq data=WORK.IMPORT4; tables (loan_status) *(loan_grade) / chisq measures nopercentnorownocum plots(only)=(freqplotmosaicplot); run; proc corr data=WORK.IMPORT4 pearsonnosimplenoprob plots=none; var fico_range_highfico_range_low; with int_rate; run; proc glm data=WORK.IMPORT; class emp_length; model loaned_amt=emp_length; means emp_length / hovtest=levene welch plots=none; lsmeansemp_length / adjust=tukeypdiff alpha=.05; run; quit; ```