Classification

Overview:

This analysis was part of my submission for the Great Learning Spring 2023 Hack Linguist Hackathon Competition. The dataset used in the analysis were provided by Great Learning. The general framework for the hackathon was that competitors were provided a brief outline, goal, and evalutation criteria. Competitors chose the Machine Learning approach to take and tools to use.

This problem was based on the Shinkansen Bullet Train dataset. The dataset contains a random sample of individuals who traveled on the bullet train train. Passengers were asked to provide their feedback on various parameters related to the travel, on-time performance of the trains, passenger information, along with their overall experience is also recorded. In the survey, each passenger was explicitly asked whether they were satisfied with their overall travel experience or not, and that is captured in the data of the survey report under the variable labeled ‘Overall_Experience’. The objective of the problem is to understand which parameters play an important role in swaying passenger feedback towards a positive scale and accurately predict whether a passenger had a good experience.

This is a classification problem involving many features, both categorical and numeric. I chose to take a classic machine learning classification approach and use three different models:

random forest
XGBoost
XGBoost (Regression)

Goal:

The goal of the problem was to predict whether a passenger was satisfied or not considering his/her overall experience of traveling on the Shinkansen Bullet Train.

Dataset:

The problem consists of 2 separate datasets: Travel data & Survey data. Travel data has information related to passengers and attributes related to the Shinkansen train, in which they traveled. The survey data is aggregated data of surveys indicating the post-service experience. You are expected to treat both these datasets as raw data and perform any necessary data cleaning/validation steps as required.

The data has been split into train and test data previously.

Train_Data Test_Data

Target Variable:

Overall_Experience (1 represents ‘satisfied’, and 0 represents ‘not satisfied’)

Evaluation metric:

The primary evaluation metric that I used and that was required by the competition was Accuracy Score. The accuracy socre is simply the percentage of predictions made by the model that turned out to be correct (the accuracy of the model). It was calculated as the total number of correct predictions (True Positives + True Negatives) divided by the total number of observations in the dataset (In other words, the best possible accuracy is 100% (or 1) and the worst possible accuracy 0%).

Importing Libraries and the Dataset

Loading the datasets

First I merge independent survey and travel datasets into single train and test datasets. Because the data has been provided pre split into train and test, I dont need to separate it myself, if I were we would need to take care of that before conducting EDA.

lets get a better understand of the data by taking a look at a summary of the train dataframe…

Exploratory Data Analysis

<class 'pandas.core.frame.DataFrame'>
Int64Index: 94379 entries, 0 to 94378
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 ID                       94379 non-null  int64  
 Gender                   94302 non-null  object 
 Customer_Type            85428 non-null  object 
 Age                      94346 non-null  float64
 Type_Travel              85153 non-null  object 
 Travel_Class             94379 non-null  object 
 Travel_Distance          94379 non-null  int64  
 Departure_Delay_in_Mins  94322 non-null  float64
 Arrival_Delay_in_Mins    94022 non-null  float64
 Overall_Experience       94379 non-null  int64  
Seat_Comfort             94318 non-null  object 
Seat_Class               94379 non-null  object 
Arrival_Time_Convenient  85449 non-null  object 
Catering                 85638 non-null  object 
Platform_Location        94349 non-null  object 
Onboard_Wifi_Service     94349 non-null  object 
Onboard_Entertainment    94361 non-null  object 
Online_Support           94288 non-null  object 
Ease_of_Online_Booking   94306 non-null  object 
Onboard_Service          86778 non-null  object 
Legroom                  94289 non-null  object 
Baggage_Handling         94237 non-null  object 
CheckIn_Service          94302 non-null  object 
Cleanliness              94373 non-null  object 
Online_Boarding          94373 non-null  object 
dtypes: float64(3), int64(3), object(19)
memory usage: 18.7+ MB

Some observations from this overview are that there are 94379 unique rows in the dataframe, and there are no duplicate rows un the ‘ID’ field. There is a mixture of numeric and catagorical data in the dataset. There are 25 total features, but we can drop ‘ID’ as it does not provide any additional information for this analysis (it is an index/key). The field ‘Overall_Experience’ is the y variable, this is what I will be making predictions on. Also we can see that there are missing values in many of the fields (both catagorical and numeric). How you deal with missing values can play an important role in how accurate the ML models will be able to predict Overall_Experience. I still want to get a better understanding of each of the features in the data set and how they are distributed. Next I am going to print out a basic summary of the numeric features:

	count	mean	std	min	25%	50%	75%	max
ID	94379.0	9.884719e+07	27245.014865	98800001.0	98823595.5	98847190.0	98870784.5	98894379.0
Age	94346.0	3.941965e+01	15.116632	7.0	27.0	40.0	51.0	85.0
Travel_Distance	94379.0	1.978888e+03	1027.961019	50.0	1359.0	1923.0	2538.0	6951.0
Departure_Delay_in_Mins	94322.0	1.464709e+01	38.138781	0.0	0.0	0.0	12.0	1592.0
Arrival_Delay_in_Mins	94022.0	1.500522e+01	38.439409	0.0	0.0	0.0	13.0	1584.0
Overall_Experience	94379.0	5.466576e-01	0.497821	0.0	0.0	1.0	1.0	1.0

ID is not important, I will drop that. The median age of survey participants riding the bullet train is 40, and age looks almsot normally distributed. There is large range in the travel distance records. Departure and Arrival delays also look like they may be skewed. Lets plot these to get a better idea:

ID
Skew : 0.0

png

Age
Skew : -0.0

png

Travel_Distance
Skew : 0.47

png

Departure_Delay_in_Mins
Skew : 7.16

png

Arrival_Delay_in_Mins
Skew : 6.98

png

Overall_Experience
Skew : -0.19

png

It looks like Age is actually more or less normally distributed. The more normal the features are, the better the classification models will perform (and the less I will need to transform). Travel Distance is somewhat right skewed, and as I suspected departure and arrival delays are highly right skewed. These will need to be dealt will. There are several different approaches one could take to normalize data like this in classification problems. The approach I take later in this analysis is to use feature engineering to bin the values into categories. Lets also take a look at the balance between negative and postiive overall experience:

1    0.546658
0    0.453342
Name: Overall_Experience, dtype: float64

There are marginally more positives than negatives, we can take balance into account when tuning the classifer models. Next, lets take a closer look at the categorical features:

Female    0.507041
Male      0.492959
Name: Gender, dtype: float64
****************************************
Loyal Customer       0.817332
Disloyal Customer    0.182668
Name: Customer_Type, dtype: float64
****************************************
Business Travel    0.688373
Personal Travel    0.311627
Name: Type_Travel, dtype: float64
****************************************
Eco         0.522807
Business    0.477193
Name: Travel_Class, dtype: float64
****************************************
Acceptable           0.224326
Needs Improvement    0.222079
Good                 0.218357
Poor                 0.160998
Excellent            0.137524
Extremely Poor       0.036716
Name: Seat_Comfort, dtype: float64
****************************************
Green Car    0.502601
Ordinary     0.497399
Name: Seat_Class, dtype: float64
****************************************
Good                 0.229072
Excellent            0.206954
Acceptable           0.177615
Needs Improvement    0.175426
Poor                 0.160236
Extremely Poor       0.050697
Name: Arrival_Time_Convenient, dtype: float64
****************************************
Acceptable           0.215652
Needs Improvement    0.209930
Good                 0.209825
Poor                 0.161821
Excellent            0.157115
Extremely Poor       0.045657
Name: Catering, dtype: float64
****************************************
Manageable           0.256208
Convenient           0.232244
Needs Improvement    0.189000
Inconvenient         0.174342
Very Convenient      0.148184
Very Inconvenient    0.000021
Name: Platform_Location, dtype: float64
****************************************
Good                 0.242027
Excellent            0.222239
Acceptable           0.213230
Needs Improvement    0.207697
Poor                 0.113843
Extremely Poor       0.000965
Name: Onboard_Wifi_Service, dtype: float64
****************************************
Good                 0.322654
Excellent            0.229374
Acceptable           0.186094
Needs Improvement    0.147582
Poor                 0.091574
Extremely Poor       0.022721
Name: Onboard_Entertainment, dtype: float64
****************************************
Good                 0.318344
Excellent            0.274627
Acceptable           0.166532
Needs Improvement    0.132657
Poor                 0.107829
Extremely Poor       0.000011
Name: Online_Support, dtype: float64
****************************************
Good                 0.306545
Excellent            0.262380
Acceptable           0.173796
Needs Improvement    0.153532
Poor                 0.103578
Extremely Poor       0.000170
Name: Ease_of_Online_Booking, dtype: float64
****************************************
Good                 0.314193
Excellent            0.245131
Acceptable           0.208244
Needs Improvement    0.131254
Poor                 0.101132
Extremely Poor       0.000046
Name: Onboard_Service, dtype: float64
****************************************
Good                 0.306186
Excellent            0.263361
Acceptable           0.173764
Needs Improvement    0.167071
Poor                 0.086012
Extremely Poor       0.003606
Name: Legroom, dtype: float64
****************************************
Good                 0.370810
Excellent            0.275932
Acceptable           0.188535
Needs Improvement    0.103558
Poor                 0.061165
Name: Baggage_Handling, dtype: float64
****************************************
Good                 0.281033
Acceptable           0.273621
Excellent            0.208278
Needs Improvement    0.118958
Poor                 0.118099
Extremely Poor       0.000011
Name: CheckIn_Service, dtype: float64
****************************************
Good                 0.375393
Excellent            0.276064
Acceptable           0.184894
Needs Improvement    0.103907
Poor                 0.059689
Extremely Poor       0.000053
Name: Cleanliness, dtype: float64
****************************************
Good                 0.270554
Acceptable           0.238151
Excellent            0.230384
Needs Improvement    0.142530
Poor                 0.118254
Extremely Poor       0.000127
Name: Online_Boarding, dtype: float64
****************************************

most of the categorical features are ordinal. This is important to note as we will need to encode them in preprocessing before running the models. There are no missing categories but there is imbalance in the percentage of data in each catagory (some categories are not very well represented). Finally, lets see if any of the features in the data are correlated:

png

Arrival and Departure delay are highly correlated (suprise suprise :), but there is little correlation between any of the other numeric variables.

Preprocessing the training data

This is where Im going to drop the ID column out, and then split the data into the x (features) and y (target variable)

Im going to impute the numeric features using an iterative imputer to replace missing values based on all other features for a record, and categorical features will be replaced with the most frequent value for each column. After imputing, in the same code chunk, I encode the categorical features. Since most of the categorical features are oridinal, and there are many levels, I use the label_encoder method rather than one-hot encoding. With decision trees and random forest, one hot encoding can increase the number of features so much that it introduces ‘noise’ and leads to multicoliniarity.

Next, I bin the arrival and departure delay features to decrease the effect of skew and outliers

and make sure all of the features are numeric

Random Forest

RandomForestClassifier(random_state=7)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     42786
           1       1.00      1.00      1.00     51593

    accuracy                           1.00     94379
   macro avg       1.00      1.00      1.00     94379
weighted avg       1.00      1.00      1.00     94379

png

RandomForestClassifier(class_weight='balanced', max_depth=20, max_features=0.5,
                       n_estimators=150, random_state=7)

              precision    recall  f1-score   support

           0       0.99      1.00      1.00     42786
           1       1.00      1.00      1.00     51593

    accuracy                           1.00     94379
   macro avg       1.00      1.00      1.00     94379
weighted avg       1.00      1.00      1.00     94379

png

Now predicting y_test on x_test

XGBoost

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric='logloss', feature_types=None, gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_threshold=64, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=1, ...)

Now predicting y_test on x_test

	ID	Overall_Experience
0	99900001	1
1	99900002	1
2	99900003	1
3	99900004	0
4	99900005	1

this method had 0.9496 accuracy

XGBoost Regression

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.8,
             early_stopping_rounds=None, enable_categorical=False, eta=0.1,
             eval_metric=None, feature_types=None, gamma=0, gpu_id=-1,
             grow_policy='depthwise', importance_type=None,
             interaction_constraints='', learning_rate=0.100000001, max_bin=256,
             max_cat_threshold=64, max_cat_to_onehot=4, max_delta_step=0,
             max_depth=10, max_leaves=0, min_child_weight=0.5, missing=nan,
             monotone_constraints='()', n_estimators=1000, n_jobs=0,
             num_parallel_tree=1, predictor='auto', ...)

	ID	Overall_Experience
0	99900001	1.0
1	99900002	1.0
2	99900003	1.0
3	99900004	0.0
4	99900005	1.0

<AxesSubplot: >

png

Tuning hyperparameters of XGBoost Regression

Some important hyperparameters that can be tuned:

booster [default = gbtree ] Which booster to use. Can be gbtree, gblinear, or dart; gbtree and dart use tree-based models while gblinear uses linear functions.
min_child_weight [default = 1]

The minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning.The larger min_child_weight is, the more conservative the algorithm will be.

For a better understanding of each parameter in the XGBoost Classifier, please refer to this source.

Fitting 5 folds for each of 25 candidates, totalling 125 fits

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=0.5, colsample_bynode=1,
             colsample_bytree=0.8999999999999999, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=0, gpu_id=-1, grow_policy='depthwise', importance_type=None,
             interaction_constraints='', learning_rate=0.01, max_bin=256,
             max_cat_threshold=64, max_cat_to_onehot=4, max_delta_step=0,
             max_depth=20, max_leaves=0, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=500, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=20, ...)

<AxesSubplot: >

png

<AxesSubplot: >

png

This method had prediction accuracy of 0.9508455, placing 23 in the Great Learning Hack Linguist Hackathon Competition

png

Top of Page Raw Code