Exploratory analysis of the data

Exploratory data analysis is an integral part of the data analysis process, which is why ourR assignment solver added this section.  This section provides you with the codes and the necessary description for what was happening in the codes.

Variable Identification

##colnames(Cars_dataset) [1] Engineer", "Age" ,"Gender" ,,"Work”,"Distance",  Exp", "Salary", "license"  ,"MBA" "Transport" Variables present in our dataset are Engineer", "Age" ,"Gender" ,,"Work”,"Distance",  Exp", "Salary", "license"  , "MBA" "Transport". Type of Variables in our study ## str(Cars_dataset)

Description of variables

Data Variable Data Details
Age It is a continuous variable thatprovides details on the age of the person using the transport.
Gender It is a categorical variable that provides details on the gender of the person using thetransport.
Engineer It is a categorical variable that provides details on whether a person is an engineer or not.
MBA It is a categorical variablethatprovides details on whether a person has an MBA degree or not.
Work Exp It is a continuous variable that provides details on the work experience of the personusingthe transport.
Salary It is a continuous variable that provides details on the salary of the person usingthetransport.
Distance It is a continuous variable thatprovides details on the Distance that they will like to travel ofthe person using the transport.
License It is a categorical variablethatprovides details on whether a person hasa license or not.
Transport It is a categorical variablethatprovides details on the type of transport used by the person.
Therefore we have eight independent variables to predict one dependent variable(Transport)

Basic data Summary, analysis, graphs.

A dataset that we have considered hasfour quantitative variables, namely age, work exp, salary and Distance, and five categorical variables, namely Gender, Engineer, MBA, License, and Transport. sapply(Cars_dataset, function(y) sum(length(which(is.na(y))))) Data treatment
  • We can see that the MBA column has one missing value. We can impute this value by the mode of that column, and we observe that 308 out of 417 observations have value 0. we will impute this missing value with “0.”
Quantitative Variables.
  Variables   Minimum   1st Quantile   Median   average   3rd Quantile   Maximum
Age 18 25 27 27.33 29 43
Work Exp 0.00 3.0 5 5.87 8 24
Salary 6.5 9.625 13 15.42 14.9 57
Distance 3.2 8.6 10.90 11.29 13.57 23.40
Interpretation
  • It can be noticed that the Median age is 27,which means 50% of the people surveyed are above the age of 27, and 50% are below the age of 27.
  • It can be noticed that the MedianDistance traveled is 10.90,which means 50% of the people surveyed travel more than 10.9, and 50% of the people surveyed travel less than 10.9.
  • Also, the median work experience is five years, and the median salary is 5.
Graphical Summary
(a)Box plots

From the above figure(Box-plots), it can be seen that box plots of all four qualitative are approximately the same for both the genders, which means that these variables are independent of the gender.

  • Age feature is normally distributed with the majority of customers falling between 20 years and 60 years, mean is almost equal to 27 years.
  • Outliers are observed in Age, Distance, Salary, Work Exp.
(b) Histograms


From the above figure(Histograms) following conclusions can be drawn-

  • Age feature is normally distributed with the majority of customers falling between 20 years and 60 years, mean is almost equal to 27 years.
  • The distance (in miles) feature is normally distributed, with the majority of customers having travel distance between 5milesto 20 miles.
  • Salary and work experience features are highly skewed towards the right. Therefore, in algorithms(for example-Naïve Bayes), which assumes features to be normally distributed, we need to make the logarithmic transformation.
(c)Density plot


From the above figure (Density plots) following conclusions can be drawn-

  • Age feature is normally distributed with the majority of customers falling between 20 years and 60 years, mean is almost equal to 27 years.
  • The distance (in miles) feature is normally distributed, with the majority of customers having travel distance between 5 miles to 20 miles.
  • Salary and work experience features are highly skewed towards the right. Therefore, in algorithms (for example-Naïve Bayes), which assumes features to be normally distributed, we need to make a logarithmic transformation.
(d)Correlation plot

From the above figure (Correlation plot) following conclusions can be drawn-

  • Age and Experience havea high Correlation.
  • Salary and Experience have high Correlation
While the rest other features have medium to no- medium correlation.

Qualitative Variables.


Transport Count Percentage
2Wheeler 83 20%
Car 35 8%
Public Transport 300 72%
Grand Total 418 100%

Fromthe above tableandgraph,itcanbeobservedthatpublictransporthasthemaximumpercentage of the usage of movement between the people and the second most used transport are two-wheelers.

We can see that 297 of the 418 observations are males, which indicates that there are more males present in the data thanfemales

We can see that 313 out of the 418 observations are Engineers, which indicates that there are manymore Engineers present in the data thanNon-Engineers.

We can see that 309 out of the 418 observations are Non-MBA, which indicates that there are many more Non-MBA people present in the data than people who have an MBA degree.

Transport Average of Distance
2Wheeler 12.04
Car 17.88
Public Transport 10.32
Grand Total 11.29
From the above table, we can see that the average distance for the people who use the car is much higher than the people using two other modes of transportation.

From the above graph, we can see that the average salary for the people who use the car is much higher than the people using two other modes of transportation.


Transport Count Percentage
2Wheeler 83 20%
Car 35 8%
Public Transport 300 72%
Grand Total 418 100%
From the above table, we can see 72% of the people use public transport, 20% of the people use a two-wheeler, whereas only 8% of the people use the car.

Key insights based on EDA

We have explored the Car Dataset in order to generate insights into the data using graphs and charts. The following can be concluded from the data set.
  • People with more average income tend to use the car more in comparison to others.
  • The car is used by the people who have to travel more.
  • The most popular mode of transportation is Public Transport.
  • Age and Distance are normally distributed.
  • Salary and work Experience are highly right-skewed.
  • Age and Experience havea high Correlation.
  • Salary and Experience have a high Correlation.

Major Challenges in the analysis

  • One of the major Challenges is that variables that we want to predict are not balanced. I.e.,the class of people who use cars are the minority. Therefore, oversampling might be needed for minority class or under-sampling of the majority class.
  • Not all the features are quantitative, but Naïve Bayes requires that all the features are quantitative in nature.
  • Salary and work Experience are not Normally distributed, but Naïve Bayes requires that the features are normally distributed.

The data preparation process

Data preparation is vital for any data analysis task. Data has to be cleaned to make it in the form that is suitable for analysis. This is a key knowledge you should be aware of.

Coding of the response variable

we need to convert the response variable such that one corresponds to those observations which have transport as car and 0 for another mode of transport. Cars_dataset$Transport[Cars_dataset$Transport == "Car"] =1 Cars_dataset$Transport[Cars_dataset$Transport != "1"] =0

Scaling of Continuous variable

Before applying models, we will scale the variables because all continuous variables are on different scales. Cars_dataset$Age<-scale(Cars_dataset$Age) Cars_dataset$Work.Exp<-scale(Cars_dataset$Work.Exp) Cars_dataset$Salary<-scale(Cars_dataset$Salary) Cars_dataset$Distance<-scale(Cars_dataset$Distance)

Transformation of Salary and Work experience variables for Naïve Bayes

As Naïve Bayes assumes features to be normally distributed, but from the EDA(Histograms of Salary and Work Experience), we saw they are right-skewed. Hence, we need to do a log1p transformation of these variables. Our R homework helper experts applied Log1p, not log transformation, because it can result in Na values as outputin case of 0 input to log function. Only quantitative variables are selected for Naïve Bayes Cars_dataset_naive<-Cars_dataset[,c("Age","Salary","Work.Exp","Distance","Transport")] Cars_dataset_naive$Salary<-log1p(Cars_dataset_naive$Salary) Cars_dataset_naive$Work.Exp<-log1p(Cars_dataset_naive$Work.Exp)

Scaling of Continuous variable for Naïve Bayes

Cars_dataset_naive$Age<-scale(Cars_dataset_naive$Age) Cars_dataset_naive$Work.Exp<-scale(Cars_dataset_naive$Work.Exp) Cars_dataset_naive$Salary<-scale(Cars_dataset_naive$Salary) Cars_dataset_naive$Distance<-scale(Cars_dataset_naive$Distance)

Splitting Data into Test set and training set

set.seed(123) train_ind <- sample(1:418, 418*.8) train <- Cars_dataset[train_ind, ] test <- Cars_dataset[-train_ind, ]

Modeling that was done by our online predictive modeling tutors

This is the part where predictive modeling is applied for the analysis. We fit different types of models using R.  Read this section to explore how the models are fitted on the data set.

KNN

Before applying KNN, we need to choose k, and one rule of thumb is that we choose the square root of the number of observations=sqrt(334)=18(approx.) Therefore, k=18

Model training and predicting

y_train <-(train$Transport) trin <-train[, -length(train)] y_test <-(test$Transport) tes<-test[, -length(test)] y_pred_knn <-knn(train = trin, test = tes, cl = y_trin, k =18)

Model Evaluation using Confusion matrix, Accuracy, and True Positive Rate.


accuracy=(78+3)/84*100 accuracy 96.42857 True_positive_rate=(3)/(2+3)*100 True_positive_rate 60

Interpretation of model results

The Accuracy-Accuracy of the KNN model on the test set is 96.43%, which means our KNN model is able to predict 96.43% of the observations correctly. True positive rate- False positive rate of our model is 60% which means that out of those people who drives car 60% times our model is able to predict that they drive a car.

Naïve Bayes

We cannot apply Naïve Bayes directly to our data as Naïve Bayes assumes all of its features to be quantitative variables, and they are normally distributed. To apply Naïve Bayes in our model, we need to select only continuous variables, and we need to transform salary and work experience variables to make them distributed normally.

Model training and predicting

nb_class <- naiveBayes(Transport~Age+Work.Exp+Distance+Salary, data=train_naive) y_pred_naive_prob <- predict(nb_class,test_naive[-5],type="raw") y_pred_naive<-ifelse(y_pred_naive_prob[,2]>=0.5,1,0)

Model Evaluation using Confusion matrix, Accuracy and True Positive Rate.

eval_naive <- data.frame(y_test_naive, y_pred_naive) CrossTable(x = y_test_naive, y = y_pred_naive, prop.chisq=F, prop.r = F, prop.t = F)

accuracy=(75+4)/84*100 accuracy 94.04762 True_positive_rate=(4)/(1+4)*100 True_positive_rate 80

Interpretation of model results

The Accuracy-Accuracy of our Naïve Byes model on the test set is 94.05%, which means our Naïve Bayes model is able to predict 94.05% of the observations correctly. True positive rate- False positive rate of our model is 80%, which means that out of those people who drives car 80% times, our model is able to predict that they drive a car.

Logistic Regression

Model training and predicting

logisti_model <- glm(y_train~., data = train, family = "binomial") y_pred_logistic_prob <- predict.glm(l ogisti_model, newdata = tes, type = "response") y_pred_logistic <- rep(1, 84) y_pred_logistic[y_pred_logistic_prob<0.5] = 0

Model Evaluation using Confusion matrix, Accuracy, and True Positive Rate.

CrossTable(y_test, y_pred_logistic, prop.chisq=F, prop.c = F, prop.r = F, prop.t = F)

accuracy=(79+4)/84*100 accuracy 98.80952 True_positive_rate=(1)/(1+4)*100 True_positive_rate 80

Interpretation of model results

Accuracy-Accuracy of our Logistic Regression model on test set is 98.81% which means our Logistic Regression model is able to predict 98.81% of the observations correctly. True positive rate- True positive rate of our model is 80%, which means that out of those people who drive a car 80% times, our model is able to predict that they drive a car.

Bagging

Model training and predicting (Random Forest)

train$y_train <- as.factor(train$y_train) colnames(train)[5] <- "work_exp" colnames(test)[5] <- "work_exp" baggging <- randomForest(y_train~., data =  train, importance = T, mtry = 8) y_pred_bagging <- predict(baggging, newdata = test)

Model Evaluation using Confusion matrix, Accuracy, and True Positive Rate.


accuracy=(79+4)/84*100 accuracy 98.80952 True_positive_rate=(4)/(1+4)*100 True_positive_rate 80

Feature Importance

varImpPlot(baggging)

From the above graph, we can see that important features to predict the mode of transportation are Salary, Distance, Age, and Work Experience. So, from a business point of view, one should look at these variables to estimate which mode of transport an individual is going to use.

 Interpretation of model results

The Accuracy-Accuracy of our Bagging(Random Forest) model on the test set is 98.81%, which means our Bagging model is able to predict 98.81% of the observations correctly. True positive rate- True positive rate of our model is 80%, which means that out of those people who drives the car 80% times, our model is able to predict that they drive a car.

Boosting

Model training and predicting (Gradient boosting)

train$y_train <- as.character(train$y_train) boosting <- gbm(y_train~., data = train, distribution = "bernoulli", n.trees = 1000, interaction.depth = 5) y_pred_boosting_prob <- predict(boosting, newdata = test, n.trees = 100, type = "response") y_pred_boosting <- rep(1, 84) y_pred_boosting[y_pred_boosting_prob<0.5] = 0

Model Evaluation using Confusion matrix, Accuracy, and True Positive Rate.


accuracy 100 True_positive_rate=(5)/(0+5)*100 True_positive_rate 100

Feature Importance

From the above graph, we can see that important features to predict the mode of transportation are Salary, Distance, Age, and Work Experience. So, from a business point of view, one should look at these variables to estimate which mode of transport an individual is going to use.

Interpretation of model results

The Accuracy-Accuracy of our Boosting(Gbm) model on the test set is 100%, which means our Boosting model can predict 100% of the observations correctly. True positive rate- True positive rate of our model is 100%, which means that out of those people who drives car 100% times, our model is able to predict that they drive a car.