# A practical example of the application of predictive modeling

The objective of our predictive modeling in this article is to understand what mode of transport employees prefers to commute to their office. Also, we are interested in predicting whether or not an employee will use a car as a mode of transport and comparing the performance of different classification algorithms with respect to this problem.

## Exploratory analysis of the data

Exploratory data analysis is an integral part of the data analysis process, which is why our R assignment solver added this section. This section provides you with the codes and the necessary description of what was happening in the codes.

### Variable Identification

##colnames(Cars_dataset) [1] Engineer", "Age" ,"Gender" ,,"Work”,"Distance", Exp", "Salary", "license" ,"MBA" "Transport" Variables present in our dataset are Engineer", "Age" ,"Gender" ,,"Work”,"Distance", Exp", "Salary", "license" , "MBA" "Transport". Type of Variables in our study ## str(Cars_dataset)

### Description of variables

 Data Variable Data Details Age It is a continuous variable that provides details on the age of the person using the transport. Gender It is a categorical variable that provides details on the gender of the person using the transport. Engineer It is a categorical variable that provides details on whether a person is an engineer or not. MBA It is a categorical variable that provides details on whether a person has an MBA degree or not. Work Exp It is a continuous variable that provides details on the work experience of the person using the transport. Salary It is a continuous variable that provides details on the salary of the person using the transport. Distance It is a continuous variable that provides details on the Distance that they will like to travel of the person using the transport. License It is a categorical variable that provides details on whether a person has a license or not. Transport It is a categorical variable that provides detail on the type of transport used by the person.
Therefore we have eight independent variables to predict one dependent variable(Transport)

### Basic data Summary, analysis, graphs.

A dataset that we have considered has four quantitative variables, namely age, work exp, salary, and Distance, and five categorical variables, namely Gender, Engineer, MBA, License, and Transport. supply(Cars_dataset, function(y) sum(length(which(is.na(y))))) Data treatment
• We can see that the MBA column has one missing value. We can impute this value by the mode of that column, and we observe that 308 out of 417 observations have a value of 0. we will impute this missing value with “0.”
##### Quantitative Variables.
 Variables Minimum 1st Quantile Median average 3rd Quantile Maximum Age 18 25 27 27.33 29 43 Work Exp 0.00 3.0 5 5.87 8 24 Salary 6.5 9.625 13 15.42 14.9 57 Distance 3.2 8.6 10.90 11.29 13.57 23.40
Interpretation
• It can be noticed that the Median age is 27, which means 50% of the people surveyed are above the age of 27, and 50% are below the age of 27.
• It can be noticed that the MedianDistance traveled is 10.90, which means 50% of the people surveyed travel more than 10.9, and 50% of the people surveyed travel less than 10.9.
• Also, the median work experience is five years, and the median salary is 5.
##### (a)Box plots

From the above figure(Box-plots), it can be seen that box plots of all four qualitative are approximately the same for both genders, which means that these variables are independent of gender.

• Age feature is normally distributed with the majority of customers falling between 20 years and 60 years, mean is almost equal to 27 years.
• Outliers are observed in Age, Distance, Salary, and Work Exp.
##### (b) Histograms

From the above figure(Histograms) following conclusions can be drawn-

• Age feature is normally distributed with the majority of customers falling between 20 years and 60 years, mean is almost equal to 27 years.
• The distance (in miles) feature is normally distributed, with the majority of customers having travel distances between 5milesto to 20 miles.
• Salary and work experience features are highly skewed towards the right. Therefore, in algorithms(for example-Naïve Bayes), which assumes features to be normally distributed, we need to make the logarithmic transformation.
##### (c)Density plot

From the above figure (Density plots) following conclusions can be drawn-

• Age feature is normally distributed with the majority of customers falling between 20 years and 60 years, mean is almost equal to 27 years.
• The distance (in miles) feature is normally distributed, with the majority of customers having travel distances between 5 miles to 20 miles.
• Salary and work experience features are highly skewed towards the right. Therefore, in algorithms (for example-Naïve Bayes), which assumes features to be normally distributed, we need to make a logarithmic transformation.
##### (d)Correlation plot

From the above figure (Correlation plot) following conclusions can be drawn-

• Age and Experience have a high Correlation.
• Salary and Experience have a high Correlation
While the rest of the other features have medium to no- medium correlation.

#### Qualitative Variables.

 Transport Count Percentage 2Wheeler 83 20% Car 35 8% Public Transport 300 72% Grand Total 418 100%

From the above table and graph, it can be observed that public transport has the maximum percentage of the usage of movement between people, and the second most used transport is two-wheelers.

We can see that 297 of the 418 observations are males, which indicates that there are more males present in the data than females

We can see that 313 out of the 418 observations are Engineers, which indicates that there are many more Engineers present in the data than non-Engineers.

We can see that 309 out of the 418 observations are Non-MBA, which indicates that there are many more Non-MBA people present in the data than people who have an MBA degree.

 Transport Average of Distance 2Wheeler 12.04 Car 17.88 Public Transport 10.32 Grand Total 11.29
From the above table, we can see that the average distance for the people who use the car is much higher than the people using two other modes of transportation.

From the above graph, we can see that the average salary for the people who use the car is much higher than the people using two other modes of transportation.

 Transport Count Percentage 2Wheeler 83 20% Car 35 8% Public Transport 300 72% Grand Total 418 100%
From the above table, we can see 72% of the people use public transport, 20% of the people use a two-wheeler, whereas only 8% of the people use a car.

## Key insights based on EDA

We have explored the Car Dataset in order to generate insights into the data using graphs and charts. The following can be concluded from the data set.
• People with more average income tend to use the car more in comparison to others.
• The car is used by people who have to travel more.
• The most popular mode of transportation is Public Transport.
• Age and Distance are normally distributed.
• Salary and work Experience are highly right-skewed.
• Age and Experience have a high Correlation.
• Salary and Experience have a high Correlation.

## Major Challenges in the analysis

• One of the major challenges is that the variables that we want to predict are not balanced. I.e., the class of people who use cars is the minority. Therefore, oversampling might be needed for the minority class or under-sampling of the majority class.
• Not all the features are quantitative, but Naïve Bayes requires that all the features are quantitative in nature.
• Salary and work Experience are not Normally distributed, but Naïve Bayes requires that the features are normally distributed.

## The data preparation process

Data preparation is vital for any data analysis task. Data has to be cleaned to make it in the form that is suitable for analysis. This is a key knowledge you should be aware of.

### Coding of the response variable

we need to convert the response variable such that one corresponds to those observations which have transport as car and 0 for another mode of transport. Cars_dataset\$Transport[Cars_dataset\$Transport == "Car"] =1 Cars_dataset\$Transport[Cars_dataset\$Transport != "1"] =0

### Scaling of Continuous variable

Before applying models, we will scale the variables because all continuous variables are on different scales. Cars_dataset\$Age<-scale(Cars_dataset\$Age) Cars_dataset\$Work.Exp<-scale(Cars_dataset\$Work.Exp) Cars_dataset\$Salary<-scale(Cars_dataset\$Salary) Cars_dataset\$Distance<-scale(Cars_dataset\$Distance)

### Transformation of Salary and Work experience variables for Naïve Bayes

As Naïve Bayes assumes features to be normally distributed, but from the EDA(Histograms of Salary and Work Experience), we saw they are right-skewed. Hence, we need to do a log1p transformation of these variables. Our R homework helper experts applied Log1p, not log transformation, because it can result in Na values as outputin case of 0 input to log function. Only quantitative variables are selected for Naïve Bayes Cars_dataset_naive<-Cars_dataset[,c("Age","Salary","Work.Exp","Distance","Transport")] Cars_dataset_naive\$Salary<-log1p(Cars_dataset_naive\$Salary) Cars_dataset_naive\$Work.Exp<-log1p(Cars_dataset_naive\$Work.Exp)

### Scaling of Continuous variable for Naïve Bayes

Cars_dataset_naive\$Age<-scale(Cars_dataset_naive\$Age) Cars_dataset_naive\$Work.Exp<-scale(Cars_dataset_naive\$Work.Exp) Cars_dataset_naive\$Salary<-scale(Cars_dataset_naive\$Salary) Cars_dataset_naive\$Distance<-scale(Cars_dataset_naive\$Distance)

### Splitting Data into Test set and training set

set.seed(123) train_ind <- sample(1:418, 418*.8) train <- Cars_dataset[train_ind, ] test <- Cars_dataset[-train_ind, ]

## Modeling that was done by our online predictive modeling tutors

This is the part where predictive modeling is applied for the analysis. We fit different types of models using R. Read this section to explore how the models are fitted on the data set.

### KNN

Before applying KNN, we need to choose k, and one rule of thumb is that we choose the square root of the number of observations=sqrt(334)=18(approx.) Therefore, k=18

#### Model training and predicting

y_train <-(train\$Transport) trin <-train[, -length(train)] y_test <-(test\$Transport) tes<-test[, -length(test)] y_pred_knn <-knn(train = trin, test = tes, cl = y_trin, k =18)

#### Model Evaluation using Confusion matrix, Accuracy, and True Positive Rate.

accuracy=(78+3)/84*100 accuracy 96.42857 True_positive_rate=(3)/(2+3)*100 True_positive_rate 60

#### Interpretation of model results

The Accuracy-Accuracy of the KNN model on the test set is 96.43%, which means our KNN model is able to predict 96.43% of the observations correctly. True positive rate- False positive rate of our model is 60% which means that out of those people who drives car 60% times our model is able to predict that they drive a car.

### Naïve Bayes

We cannot apply Naïve Bayes directly to our data as Naïve Bayes assumes all of its features to be quantitative variables, and they are normally distributed. To apply Naïve Bayes in our model, we need to select only continuous variables, and we need to transform salary and work experience variables to make them distributed normally.

#### Model training and predicting

nb_class <- naiveBayes(Transport~Age+Work.Exp+Distance+Salary, data=train_naive) y_pred_naive_prob <- predict(nb_class,test_naive[-5],type="raw") y_pred_naive<-ifelse(y_pred_naive_prob[,2]>=0.5,1,0)

### Model Evaluation using Confusion matrix, Accuracy and True Positive Rate.

eval_naive <- data.frame(y_test_naive, y_pred_naive) CrossTable(x = y_test_naive, y = y_pred_naive, prop.chisq=F, prop.r = F, prop.t = F)

accuracy=(75+4)/84*100 accuracy 94.04762 True_positive_rate=(4)/(1+4)*100 True_positive_rate 80

#### Interpretation of model results

The Accuracy-Accuracy of our Naïve Byes model on the test set is 94.05%, which means our Naïve Bayes model is able to predict 94.05% of the observations correctly. True positive rate- False positive rate of our model is 80%, which means that out of those people who drive the car 80% time, our model is able to predict that they drive a car.

### Logistic Regression

#### Model training and predicting

logisti_model <- glm(y_train~., data = train, family = "binomial") y_pred_logistic_prob <- predict.glm(l ogisti_model, new data = tes, type = "response") y_pred_logistic <- rep(1, 84) y_pred_logistic[y_pred_logistic_prob<0.5] = 0

#### Model Evaluation using Confusion matrix, Accuracy, and True Positive Rate.

CrossTable(y_test, y_pred_logistic, prop.chisq=F, prop.c = F, prop.r = F, prop.t = F)

accuracy=(79+4)/84*100 accuracy 98.80952 True_positive_rate=(1)/(1+4)*100 True_positive_rate 80

#### Interpretation of model results

The accuracy-Accuracy of our Logistic Regression model on the test set is 98.81% which means our Logistic Regression model is able to predict 98.81% of the observations correctly. True positive rate- True positive rate of our model is 80%, which means that out of those people who drive a car 80% times, our model is able to predict that they drive a car.

### Bagging

#### Model training and predicting (Random Forest)

train\$y_train <- as.factor(train\$y_train) colnames(train)[5] <- "work_exp" colnames(test)[5] <- "work_exp" baggging <- randomForest(y_train~., data = train, importance = T, mtry = 8) y_pred_bagging <- predict(baggging, newdata = test)

#### Model Evaluation using Confusion matrix, Accuracy, and True Positive Rate.

accuracy=(79+4)/84*100 accuracy 98.80952 True_positive_rate=(4)/(1+4)*100 True_positive_rate 80

#### Feature Importance

varImpPlot(bagging)

From the above graph, we can see that important features to predict the mode of transportation are Salary, Distance, Age, and Work Experience. So, from a business point of view, one should look at these variables to estimate which mode of transport an individual is going to use.

#### Interpretation of model results

The Accuracy-Accuracy of our Bagging(Random Forest) model on the test set is 98.81%, which means our Bagging model is able to predict 98.81% of the observations correctly. True positive rate- True positive rate of our model is 80%, which means that out of those people who drive the car 80% times, our model is able to predict that they drive a car.

### Boosting

#### Model training and predicting (Gradient boosting)

train\$y_train <- as.character(train\$y_train) boosting <- gbm(y_train~., data = train, distribution = "bernoulli", n.trees = 1000, interaction.depth = 5) y_pred_boosting_prob <- predict(boosting, newdata = test, n.trees = 100, type = "response") y_pred_boosting <- rep(1, 84) y_pred_boosting[y_pred_boosting_prob<0.5] = 0

#### Model Evaluation using Confusion matrix, Accuracy, and True Positive Rate.

accuracy 100 True_positive_rate=(5)/(0+5)*100 True_positive_rate 100

#### Feature Importance

From the above graph, we can see that important features to predict the mode of transportation are Salary, Distance, Age, and Work Experience. So, from a business point of view, one should look at these variables to estimate which mode of transport an individual is going to use.

#### Interpretation of model results

The Accuracy-Accuracy of our Boosting(Gbm) model on the test set is 100%, which means our Boosting model can predict 100% of the observations correctly. True positive rate- True positive rate of our model is 100%, which means that out of those people who drive a car 100% times, our model is able to predict that they drive a car.