Implementation of Support Vector Machines in R

We discussed in our previous article ‘Beginners guide to SVM‘  the concept and the basics that one needs to know to learn Support Vector Machines.

Now lets talk about how to implement it in R. Explaining it by a simple hands-on example as below-

The data set we have used here is called iris. It is already present in the R studio and it is mainly used for practising binary classification problems.

Saving the values of the predictor variables in x

x <- subset (iris, select=-Species)

 Saving the values of the dependent variable Species in Y

y <- iris$Species

The package e1071 should be installed by the given code: install.packages (“e1071”).

We call the package before using the functions available in them.

library (e1071)

 Saving the svm model in svm_model

svm_model <- svm (Species ~., data=iris)

 The summary of the model shows the default values of the parameters of svm.

summary (svm_model)

Now we predict the values of y with function predict () and save it in pred1.

pred1 <- predict (svm_model, x)

 confusionMatrix () function is available in caret package and it is used to see the accuracy of the model by comparing the predicted values with the original values .We find from  the summary below that , we have achieved an accuracy of 97.33 % and there is misclassification in both versicolor and virginica.

library (caret)

confusionMatrix (pred1, y)

PARAMETERS TUNING

TYPE

SVM models can be classified into four distinct groups:

  • c- classification
  • nu- classification
  • epsilon- regression
  • nu- regression

C and nu regularization parameters help in implementing a penalty on the errors that are performed while separating the classes. This helps in improving the accuracy of the output. C ranges from (0 to infinity) that can be a bit hard to estimate and use. A better modification of this is nu which operates between (0-1) and the advantage of using nu is that it has a control over the number of support vectors.

Kernel

Kernel parameter selects the type of hyperplane used to separate the data. The ‘linear’ kernel uses a linear hyperplane. While the ‘radial’ and ‘polynomial’ is used for non linear hyper-plane.

The best way to determine the kernel is to check which kernel works well for the data. The linear kernel will work fine if the dataset is linearly separable; however, if the dataset is not linearly separable, a linear kernel will lead to high amount of misclassifications.

The radial kernel decision boundary is also linear. But the difference between both the kernels is that, the radial kernel actually creates non-linear combinations of features to uplift the samples onto a higher-dimensional feature space where one can use a linear decision boundary to separate the classes.

Now another question is, what if both a linear and a nonlinear kernel works equally well on a dataset?

So in that case we choose the simpler linear SVM because firstly linear kernel is a parametric model, secondly the complexity of the radial kernel grows with the size of the training set. So it is more expensive to train a radial kernel.

Gamma

Gamma is a parameter for non linear hyperplanes. The higher the gamma values the more it tries to exactly fit the training data set. When gamma is very small, the model cannot capture the complexity the data. The region of influence of any selected support vector would include the whole training data set i.e. points far away from the separation line. Whereas when gamma is too large, the radius of the area of influence of the support vectors only includes the support vector close to the separation line. But sometimes very high gamma values lead to overfitting the training data. So choosing gamma values at an intermediate level is   always desirable.

Cost

Cost controls the tradeoff between the length of a decision boundary and accuracy of the model. It tells the SVM how much misclassification to avoid. Now for large values of cost, there will be a smaller margin hyperplane which does a better job of getting all the training points classified correctly. On the other hand, a smaller value of cost leads to larger-margin hyperplanes which misclassifies more points.

Degree

Degree is a parameter that is used when kernel is set to ‘polynomial’. It is the degree of the polynomial used to find the hyperplane to split the data.

 Now we use the function tune () to get the best values of the parameters for the given dataset.

svm_tune <- tune (svm, train.x=x, train.y=y,

                 kernel=”radial”, ranges=list(cost=c(0.1,1,10,100,1000), gamma=c(0.1,1,10,100)))

 So after printing svm_tune we get that the best value for cost is 10 and gamma is 0.1

print (svm_tune)

We save the tuned model in svm_model_after_tune

svm_model_after_tune <- svm (Species ~., data=iris, kernel=”radial”, cost=1, gamma=0.1)

 Predicting the values of y with function predict () and saving it in pred2.

pred2 <- predict (svm_model_after_tune, x)

 From the updated confusion matrix and accuracy of the model (given below) it is clear that ,the model has improved after a bit of parameter tuning . The accuracy is now 98% and the misclassification is confined to virginica .

confusionMatrix (pred2, y)

Parameters tuning does not end here. There are a lot of features of the svm () function. For more information click the links given below:

  1. https://www.rdocumentation.org/packages/e1071/versions/1.7-0/topics/svm
  2. https://www.rdocumentation.org/packages/e1071/versions/1.7-0/topics/tune

You can also use the svm algorithm from the caret package .Here is the link to the functions available in caret package.

http://topepo.github.io/caret/train-models-by-tag.html#support-vector-machines

SVM is a very useful technique in case of nonlinear and high dimensional data. But there are some drawbacks. Firstly, it is very complex and people prefer to use simpler and easily interpretable models. Secondly, the correct choice of kernel parameters is crucial for obtaining good results, which means that an extensive research must be conducted on the parameters before using the model. Furthermore, a tuned model may give excellent result in classification accuracy for a problem A, but may result in poor classification accuracy for a problem B.

This article has been contributed by our student Saptarshi Mukherjee.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.