Statistical methodologies and machine learning with Python: episode 27 "machine learning with SVM and SVR"

N.B: if you want to stop receiving private notifications for all the episodes, please click the button you have to leave the conversation. Nobody is forced to stay.

Welcome back and thanks for reading the previous three episodes of machine learning. In this episode, I show you how with just Python code you can conduct the workflow of the two machine learning algorithms: Support Vector Machine for Classification (SVM) and Support Vector Machine for Regression (SVR).

For SVM, the data I've used is available for download on internet and it's the same that I've used in previous episodes (this dataset was last updated by the source on 03/10/2020). It concerns the chemical elements (oxides for instance) of the rocks "Andesite, Rhyolite, Dacite, Basalt, Trachyte, Picrite, Phonolite". The rock samples were collected from several places, countries and continents such as Greece, Italy, Iran, Georgia, Bolivia, Chile, Argentina, Peru, Mexico, Puerto Rico, Guadeloupe, China, Mongolia, Egypt, Morocco, Tanzania, Cameroon, Solomon islands, New Zealand, Canada, Yellowstone, etc. I've added the metal of Nickel to the oxide elements since it ended up having a good contribution to the predictions.

For SVR, the data I've used is available for download on internet and concerns the prediction of charges (the dependent variable) of a life insurance based on the features "age of the primary beneficiary, sex, bmi (body mass index), children covered by the insurance, smoking state (smoker or non-smoker) and region (beneficiary's residential area)".

SVM is a classifier that finds an optimal hyperplane that maximizes the margin between two classes. When we have data that can't be separated by a straight line or a hyperplane we then use what's called the "kernel": it transforms data that is not linearly separable in n-dimensional space into a higher dimension where it is linearly separable. I let the user do some research on the theory because internet is full of great videos, articles and figures that explain well the methodology in a manner that can be understood by the public.

SVM works well for fat data (lots of features and very few rows) and is great for complex relationships than Logistic regression. Indeed, SVM achieved better scores on the test sets for Precision, Recall and Accuracy (0.86, 0.859, 0.859 respectively) compared to Multiple Logistic Regression that I've shown in episode 24 (0.838, 0.835, 0.835 respectively, see episode 24), despite the slight overfitting which is not worrying at this stage because the model's performance is just a little bit lower than on the training set.

On the other hand, SVM can handle lots of outliers (because it only looks at the points that are the closest to the separating lines/planes) though the removal of the outliers in the rock chemical elements I am using in this episode caused the model's performance to improve and it's likely due to their locations nearby the separating lines/planes in 3D (an assumption that is hard to check in a 2D plot since the data is 9-dimensional).

It's important to note that PCA didn't achieve a better performance for SVM nor for SVR. I've then omitted it from the current episode. However, I've run SVM on PCA calculated with two components only in order to do a meshgrid plot (a 2D plot where one predicts the SVM classes based on two PCA components only in order to visualize the borders between the classes). The meshgrid plot I present right below shows the borders between the rock names in a 2D space of the two PCA components. Unfortunately, the borders are not perfect lines because the PCA was forced to run for two components only that explain much less than 95% of the total variability (no more than 2 components are accepted for the meshgrid plot unfortunately) and the horizontal and vertical discretization steps didn't help much in getting smoother borders.

No alt text provided for this image

For SVR, despite the few underestimated values between 10000 and 30000, the coefficient of determination is pretty good (R-squared of 0.83) indicating a good fit and the model achieved a low Mean Squared Error score (0.198) on the test set. One should keep in mind that some machine learning algorithms for regression do not usually perform well compared to other statistical regression methods.

Correlation and causation:

It's very important to remind the fact that, in regression, correlation is not causation. In other words, if we take the most important features smoker, age and bmi then the fact that "smoker" is correlated with the dependent variable "life insurance charges" does NOT necessarily indicate that smoking causes death (likewise, one cannot say that age or the body mass cause death). Well, I am not saying that smoking or body mass will make you healthy neither !

There is a multitude of parameters to try and tune that the analyst can investigate in order to seek a better performance (a better fit for the regression case), unfortunately I cannot try them all because it's too time consuming and this work is for demonstration purposes. Indeed, the analyst can try to:

  • Transform the categorical features such as Sex and Region and Smoking state into binary indicator features instead of just label-encoding/factorizing them (having binary features provides the algorithm with extra information to work with, though it might still be problematic for some classification algorithms since it makes the data fatter)
  • Tune One-SVM Class outlier detection parameters to enlarge the selection area beyond the current frontier (see figure below and the figures of the previous episodes).
  • Drop the less important features based on the feature selection result plots then re-run the algorithm (again parameter tuning and outlier method selection will have to be done again).
  • Try the Elliptic envelope outlier detection method and tune its parameters
  • Try DBSCAN outlier detection method and tune its parameters
  • Decrease or increase the training size depending on the algorithm used (e.g. less observations for SVM, small dataset for KNN since it takes longer to run, more than 50 observations for Logistic Regression, etc.).

Alright, that's all for today. If this is interesting to you or to one of your connections or if you just want to support, then please hit "like" and share with your connections in LinkedIn, Twitter and other platforms if you can and pass them my details. If someone sees a startup of a business opportunity in applied statistics and machine learning (and Geostatistics) then please get him/her to get in touch with me in private. Enjoy your day.

PART 1 "Support Vector Machine for Classification":

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Data shape before removing the duplicates      : (2365, 10)

Data shape after removing the duplicates          : (2363, 10)

Dependent variable name: Rock name

Column names: ['NI(PPM)', 'SIO2(WT%)', 'AL2O3(WT%)', 'FEOT(WT%)', 'CAO(WT%)', 'MGO(WT%)', 'MNO(WT%)', 'K2O(WT%)', 'NA2O(WT%)']

Count of zero values in all features (before label-encoding and multi-column encoding if any):

                              Count of zeros

NI(PPM)                            91

SIO2(WT%)                       0

AL2O3(WT%)                   0

FEOT(WT%)                      0

CAO(WT%)                       2

MGO(WT%)                     4

MNO(WT%)                     14

K2O(WT%)                        1

NA2O(WT%)                    11

Count of NaN values in all features:

                              NaN count

NI(PPM)                   0

SIO2(WT%)               0

AL2O3(WT%)           0

FEOT(WT%)              0

CAO(WT%)               0

MGO(WT%)             0

MNO(WT%)             0

K2O(WT%)               0

NA2O(WT%)            0

Number of total observations before dropping the missing values         : 2363

Number of total observations after dropping the missing values             : 2363

Class names: ['ANDESITE', 'BASALT', 'DACITE', 'PHONOLITE', 'PICRITE', 'RHYOLITE', 'TRACHYTE']

Total number of observations to be analysed for outliers           : 2363

Total number of rows left out after outlier clean-up                    : 2128

Rock name frequencies:

ANDESITE  BASALT  DACITE  PHONOLITE  PICRITE RHYOLITE  TRACHYTE

575              435         391         306                   238         138                45

Summary stats of the numerical features:

       NI(PPM) SIO2(WT%) AL2O3(WT%) FEOT(WT%) CAO(WT%) MGO(WT%) 

count 2128.000  2128.000   2128.000  2128.000 2128.000 2128.000

mean    73.607    60.345     15.006     5.903    5.004    3.749

std    181.467     8.966      2.437     3.361    3.441    4.382

min      0.000    41.600      0.820     0.060    0.020    0.000

25%      3.000    53.058     13.550     3.170    1.958    0.759

50%     12.505    60.390     15.142     5.320    4.500    2.400

75%     50.825    67.567     16.510     8.640    7.700    4.933

max   1356.000    78.040     23.740    17.020   15.410   25.590

      MNO(WT%) K2O(WT%) NA2O(WT%)

count 2128.000 2128.000  2128.000

mean     0.121    2.810     3.433

std      0.096    2.236     1.441

min      0.000    0.000     0.000

25%      0.070    1.130     2.599

50%      0.120    2.190     3.460

75%      0.160    4.021     4.281

max      2.390   11.020     8.950

Class frequencies:

                              Frequencies

ANDESITE                 575

BASALT                      306

DACITE                      391

PHONOLITE              45

PICRITE                      138

RHYOLITE                  435

TRACHYTE                238

Correlations of the features with the target variable (in absolute values):

                              Correlations with the target

CAO(WT%)                           0.623

MGO(WT%)                          0.567

FEOT(WT%)                          0.561

K2O(WT%)                            0.518

SIO2(WT%)                           0.516

MNO(WT%)                          0.392

NI(PPM)                                 0.354

AL2O3(WT%)                        0.331

NA2O(WT%)                         0.097

Spearman correlations between the features:

           NI(PPM) SIO2(WT%) AL2O3(WT%) FEOT(WT%) CAO(WT%) MGO(WT%) 

NI(PPM)        NaN    -0.703     -0.055     0.641    0.675    0.806

SIO2(WT%)      NaN       NaN     -0.186    -0.896   -0.874   -0.875

AL2O3(WT%)     NaN       NaN        NaN     0.033    0.202    0.039

FEOT(WT%)      NaN       NaN        NaN       NaN    0.811    0.834

CAO(WT%)       NaN       NaN        NaN       NaN      NaN    0.848

MGO(WT%)       NaN       NaN        NaN       NaN      NaN      NaN

MNO(WT%)       NaN       NaN        NaN       NaN      NaN      NaN

K2O(WT%)       NaN       NaN        NaN       NaN      NaN      NaN

NA2O(WT%)      NaN       NaN        NaN       NaN      NaN      NaN

           MNO(WT%) K2O(WT%) NA2O(WT%)

NI(PPM)       0.362   -0.522    -0.351

SIO2(WT%)    -0.733    0.604     0.272

AL2O3(WT%)    0.124   -0.034     0.365

FEOT(WT%)     0.749   -0.661    -0.314

CAO(WT%)      0.639   -0.742    -0.320

MGO(WT%)      0.563   -0.672    -0.421

MNO(WT%)        NaN   -0.484    -0.097

K2O(WT%)        NaN      NaN     0.111

NA2O(WT%)       NaN      NaN       NaN

*******************************************************************

Data split: training set size 60.0%

Validation set size out of the data left out from the first split: 20.0%

Test set size out of the data left out from the first split: 80.0%

Data shapes: Training set - (1276, 9) / Validation set - (171, 9) / Test set - (681, 9)

*******K-Fold Cross-validation score results on the basic model (5 folds)*******

Scores (accuracy)            : [0.818, 0.829, 0.795, 0.831, 0.814]

Mean score (accuracy) : 0.817

******************Hyperparameters tuning ['kernel', 'C'] *****************

ALL PARAMS:

0.81583 (+/-0.01 std) for {'C': 0.1, 'kernel': 'linear'}

0.79389 (+/-0.01 std) for {'C': 0.1, 'kernel': 'rbf'}

0.82367 (+/-0.017 std) for {'C': 1, 'kernel': 'linear'}

0.8174 (+/-0.013 std) for {'C': 1, 'kernel': 'rbf'}

0.82602 (+/-0.018 std) for {'C': 10, 'kernel': 'linear'}

0.83777 (+/-0.016 std) for {'C': 10, 'kernel': 'rbf'}

CHOSEN PARAMS BASED ON BEST SCORES:

Params model 1: C = 10 , kernel = rbf

Params model 2: C = 10 , kernel = linear

Params model 3: C = 1 , kernel = linear

Evaluation of the performance of the candidate models on the validation set:

Kernel: rbf / C: 10 -- Accuracy: 0.871 / Precision: 0.87 / Recall: 0.871 / Latency: 17.8ms

Kernel: linear / C: 10 -- Accuracy: 0.801 / Precision: 0.796 / Recall: 0.801 / Latency: 5.0ms

Kernel: linear / C: 1 -- Accuracy: 0.801 / Precision: 0.796 / Recall: 0.801 / Latency: 5.0ms

Model chosen: 1

Evaluation of the performance of the best model on the test set:

Kernel: rbf / C: 10 -- Accuracy: 0.859 / Precision: 0.86 / Recall: 0.859 / Latency: 29.0ms

Model's performance results on the training set:

Precision = 0.898, Recall = 0.897, F1 = 0.897, Accuracy = 0.897

Cohen's kappa: 0.874

Classification report:

             Precision   Recall F1-score  Support

          0      0.91     0.94     0.92      336

          1      0.95     0.88     0.92      197

          2      0.81     0.81     0.81      227

          3      0.88     0.79     0.84       29

                 4      0.99     0.96     0.97       80

          5      0.88     0.93     0.91      247

          6      0.90     0.88     0.89      160

Accuracy                                       0.90     1276

Macro avg              0.91     0.89     0.90     1276

Weighted avg        0.90     0.90     0.90     1276

Model's performance results on the test set:

Precision = 0.860, Recall = 0.859, F1 = 0.859, Accuracy = 0.859

Cohen's kappa: 0.826

Classification report:

             Precision   Recall  F1-score  Support

          0      0.91     0.88     0.89      195

          1      0.85     0.84     0.84       87

          2      0.81     0.78     0.80      132

          3      0.78     0.54     0.64       13

          4      0.96     0.96     0.96       49

          5      0.88     0.90     0.89      141

          6      0.75     0.88     0.81       64

Accuracy                                     0.86      681

Macro avg            0.85     0.82     0.83      681

Weighted avg       0.86     0.86     0.86      681

PART 2 "Support Vector Machine for Regression":

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Regression line of the backtransformed data:

No alt text provided for this image

List of duplicates:

age  sex     bmi        children   smoker    region            charges

581       19    male  30.59      0              no          northwest     1639.563

Data shape before removing the duplicates      : (1338, 7)     

Data shape after removing the duplicates          : (1337, 7)

Dependent variable name: charges

Column names: ['age', 'sex', 'bmi', 'children', 'smoker', 'region']

Count of zero values in all features (before label-encoding and multi-column encoding if any):

                  Count of zeros

age                   0

sex                   0

bmi                  0

children          573

smoker            0

region              0

Count of NaN values in all features:

              NaN count

age              0

sex              0

bmi             5

children     0

smoker      0

region        0

Number of total observations before dropping the missing values         :  1337

Number of total observations after dropping the missing values             : 1332

Overview of the processed and cleaned data (first five rows):

   age sex    bmi children smoker region   charges

0 19.0 0.0 27.900      0.0    0.0    0.0 16884.924

1 18.0 1.0 33.770      1.0    1.0    1.0  1725.552

2 28.0 1.0 33.000      3.0    1.0    1.0  4449.462

3 33.0 1.0 22.705      0.0    1.0    2.0 21984.471

4 32.0 1.0 28.880      0.0    1.0    2.0  3866.855

Summary stats of the numerical features:

N.B: some stats won't make sense for some features such as Sex and Smoker but I left them there for algorithm consistency only (one should remove them from a final report).

           age      sex      bmi children   smoker   region

count 1332.000 1332.000 1332.000 1332.000 1332.000 1332.000

mean    39.207    0.504   30.659    1.096    0.798    1.484

std     14.046    0.500    6.095    1.207    0.402    1.106

min     18.000    0.000   15.960    0.000    0.000    0.000

25%     26.750    0.000   26.309    0.000    1.000    1.000

50%     39.000    1.000   30.380    1.000    1.000    1.000

75%     51.000    1.000   34.681    2.000    1.000    2.000

max     64.000    1.000   53.130    5.000    1.000    3.000

Correlations of the features with the target variable (in absolute values):

                              Correlations with the target

smoker                         0.660

age                                0.535

children                       0.133

bmi                               0.118

region                          0.046

sex                                0.007

Spearman correlations between the features:

         age   sex   bmi children    smoker region

age      NaN -0.022 0.107    0.056 2.823e-02 -0.002

sex      NaN   NaN 0.042    0.017 -7.290e-02 -0.003

bmi      NaN   NaN   NaN    0.018 -7.125e-04 -0.150

children NaN   NaN   NaN      NaN -1.632e-02 -0.011

smoker   NaN   NaN   NaN      NaN       NaN -0.003

region   NaN   NaN   NaN      NaN       NaN    NaN

************************************************************************

Data split: training set size 70.0%

Validation set size out of the data left out from the first split: 20.0%

Test set size out of the data left out from the first split: 80.0%

Data shapes: Training set - (932, 6) / Validation set - (80, 6) / Test set - (320, 6)

********K-Fold Cross-validation score results on the basic model (5 folds)*******

Scores (neg_mean_absolute_error)                   : [0.237, 0.205, 0.23, 0.239, 0.195]

Mean score (neg_mean_absolute_error)           : 0.221

******************Hyperparameters tuning ['kernel', 'C'] *******************

ALL PARAMS:

-0.31089 (+/-0.029 std) for {'C': 0.1, 'kernel': 'linear'}

-0.26476 (+/-0.035 std) for {'C': 0.1, 'kernel': 'rbf'}

-0.30845 (+/-0.031 std) for {'C': 1, 'kernel': 'linear'}

-0.22108 (+/-0.018 std) for {'C': 1, 'kernel': 'rbf'}

-0.30853 (+/-0.031 std) for {'C': 10, 'kernel': 'linear'}

-0.22288 (+/-0.018 std) for {'C': 10, 'kernel': 'rbf'}

CHOSEN PARAMS BASED ON BEST SCORES:

Params model 1: C = 1 , kernel = rbf

Params model 2: C = 10 , kernel = rbf

Params model 3: C = 0.1 , kernel = rbf

Evaluation of the performance of the candidate models on the validation set:

Kernel: rbf / C: 1 -- Accuracy: 0.8026056925533376 / MSE: 0.17204022688220788 / Latency: 2.0ms

Kernel: rbf / C: 10 -- Accuracy: 0.8029337378485801 / MSE: 0.17175431698059435 / Latency: 3.0ms

Kernel: rbf / C: 0.1 -- Accuracy: 0.7335429841312396 / MSE: 0.23223225663082667 / Latency: 3.0ms

Model chosen: 2

Evaluation of the performance of the best model on the test set:

Kernel: rbf / C: 10 -- Accuracy: 0.8267899148382883 / MSE: 0.19888166138867128 / Latency: 12.0ms

Abdelaziz (Aziz) Moumou

Looking for a new opportunity

5y

I've just added this afternoon 05/09/2020 one paragraph about "correlation and causation" to episode 27. Thanks.

Like
Reply
Abdelaziz (Aziz) Moumou

Looking for a new opportunity

5y

I've just added today 05/04/2020 a few comments about SVR results to episode 27. Thanks.

Like
Reply
Abdelaziz (Aziz) Moumou

Looking for a new opportunity

5y

I've just added a small text tonight 05/02/2020 to episode 27 and corrected one typo. Thanks.

Like
Reply
Abdelaziz (Aziz) Moumou

Looking for a new opportunity

5y

I've just corrected this afternoon 05/02/2020 some typos and grammar in episode 27. Thanks.

Like
Reply

To view or add a comment, sign in

More articles by Abdelaziz (Aziz) Moumou

Insights from the community

Others also viewed

Explore topics