Understanding logistic regression
Logistic regression helps in conducting regression analysis when a dependent variable is binary. Like any other regression tool, logistic regression is also a predictive tool. It is used to describe data and explain relationships between dependent binary variables and ratio-level independent variables.
A. Python has been chosen for this task primarily because of its versatility and its ability to seamlessly transition from Data Extraction to Data analysis. Packages like Beautiful Soup and Scrappy make data extraction from the web extremely easy while Pandas and Scikit-learn allow quick data analysis and modeling. It is for these reasons that Python has been chosen over R/SAS.
B. The objective of the data analysis is to derive insights for the Telecommunication Company, specifically regarding the churn characteristics of its customers. Key observations regarding the churn behavior of customers can help the company employ tactics/offers to retain customers and hence increase profitability.
C. For the task, we shall be using Logistic Regression, Random Forest algorithms to model the customer data. In addition to this, basic descriptive summaries, correlation matrices, scatter plots will be used to determine the relationship of the independent variables with the dependent Churn variable
Data Exploration and Preparation
D. The target variable in the dataset is the Churn variable. It depicts whether a customer has been retained or has switched to a different company. It takes the value of ‘Yes’/’’No’ and is a categorical variable
E. An independent variable in the dataset is the Gender variable. This is a categorical variable that describes the gender of the customer.
Churn with respect to Gender
From the above table, we can observe that gender does not have much predictive power in determining the Churn variable
F. The goal of Data manipulation and preprocessing is to ensure the treatment of missing values and outliers in the data as well as the engineering of new features that can add to the predictive power of models that will be created to predict the Churn variable
G. The phenomenon that is to be predicted is the Churn rate of customers based on customer-level data.
H. The data were checked for the presence of any missing values. Thereafter, the distribution of the numeric variables was checked for the presence of any outliers. We observed that there were no missing values in the dataset. Moreover, for the 3 numeric variables, there weren’t any significant outliers that required the removal of any observations. Therefore, the provided data required minimal data preparation
I. Univariate statistics
For the purpose of understanding the various variables in the dataset, we employed univariate statistics to explore these variables and plot them. The following are the plots for some of the variables in the dataset
J. Bivariate Statistics
For the purpose of understanding the relationship between the target variable, Churn, and the independent variables, bivariate statistics were computed. The following are the results for some of the variable combinations checked
Learn more through our logistic regression assignment help.
K. For the purpose of modeling the data, 3 different algorithms were chosen
○ Logistic Regression
○ Decision Tree
○ Random Forest
Before the models could be trained, the categorical variables in the dataset had to be converted into numerical variables as Scikit-learn models only accept numerical variables for training model
For this, we used the LabelEncodermethod to convert categorical variables into numerical variables.
For the purpose of testing the model, the dataset was split into training and testing datasets on an 80:20 split. The ratio of the Churn variable was maintained in both train and test datasets using the train_test_splitmethod provided in Scikit-Learn.
L. The above-mentioned methods were selected because this was a classification task. Logistic Regression and Decision Trees are basic models that provide moderate results and can serve as a benchmark for us. To improve upon these results, we use Random Forest, which is one of the better classification models available.
Also, Random Forest provides a variable importance method that provides the relative importance of each variable. This allows determining the more important predictive variables in the dataset
M. For univariate statistics, histograms and bar plots have been used to visualize the variables. Histograms were chosen for the numerical variables as it displayed the proper distribution of the variable. Bar plots were chosen for categorical variables as most of them had 2-3 levels only and therefore, a bar plot was enough to properly convey the size of each of the levels of these categorical variables
For bivariate statistics, we have used grouped histograms to plot numerical variables and their relationship with the target variable, Churn. For categorical variables, we have used the mosaic plot as it clearly shows the percentage of the Churn variable within each level of the categorical variable
N. Of the models that were trained, the following accuracy values were achieved on the test dataset :
○ Logistic Regression : 0.789
○ Decision Tree: 0.7785
○ Random Forest: 0.7913
With test accuracy scores > 0.75%, we can see that the independent variables hold significant discriminatory power.
Confusion matrix for Random Forest prediction
|Truth = 0||Truth = 1|
|Pred = 0||956||219|
|Pred = 1||75||159|
So we can say that the ability to detect the Churn behavior of customers was present in the provided dataset.
O. Interactions between variables and the dependent variable were primarily detected through bivariate statistics and visualizations and later confirmed using the variable importance feature of Random Forest.
Certain interactions were detected through visualizations that indicated a relationship with the Churn variable. For eg: Contract variable
The mosaic plot showed that people with One-year and Two-year were much less likely churn than customers with the Month-to-Month plan. This suggested that Contracts can be a strong predictor. This was later confirmed through the variable importance plot of the Random Forest model.
Variable importance by Random Forest model
As can be seen, the Contract variable is the 4th most important predictor variable as identified by the Random Forest model.
If you are struggling with your regression assignment get quality logistic regression homework help from our experienced experts.