SVM

SVM has been a quite hot topic in recent E-discovery scene. The reason is Relativity has introduced Active Learning which is based on SVM. Now everybody knows it works especially well on classification. However what on earth is SVM? This blog is meant to take you through it briefly so you can know something more than just how to set up an active learning project in Relativity (:

Introduction

First of all, SVM is a supervised learning algorithm in general, meaning it is often used to classify data which is fundamentally different from LSI. You will need labelled data to train SVM model to predict on unseen data. In a nutshell, SVM is trying to find the hyperplane that can maximize the separation margins. So in another word, there might be many hyperplanes that can separate your data, but SVM is trying to find the best one for you! Nearly all the SVM tutorials will explain what is hyperplane and what is the margins, so i wont discuss it much about them here. The basic SVM is linear SVM, which is widely used in text classification problem. However, sometimes your data can not be linearly separated in its current space. If this happens, you can use kernel function to map your data to higher dimension space so that your data can be linearly separated there. The linear hyperplane in higher space is non linear hyperplane in current space. This is called kernel trick. I know this might not make sense intuitively. Hopefully the graphs are quoted from sklearn website can bring you some intuition.

The graphs below show how iris data set is plotted in 2D and 3D space. There are three types iris flower. As you can see, two of them (orange and grey) are not linearly separable in 2D space. However, after mapping data into 3D space, they can be linearly separated by a plane. However kernel trick is not often used in text classification because with text classification problem, you often start with thousands of features already, and mapping into a slightly higher space usually does not bring any significant gain but it will dramatically increase computing time, thus, do not scale.

SVM vs LSI

This might be the thing most people are interested to know. We were all happy about using LSI (Relativity conceptual analytics) before, not there is new tool. Which one works better? When to use which? Relativity has put up an article about this. However, i will try to approach it from a more data scientist point of view. In only one sentence, LSI is trying to find similarity, and SVM is trying to find dissimilarity. Personally, i don't quite think LSI itself is a machine learning estimator. It is more like a statistical approach. It utilizes SVD to linear transform feature space to a much lower space by forming concepts there, and then you can calculate document similarity using those concepts. SVM tries to find a hyperplane in current space to separate data, and can trick the current space into higher space without explicitly visiting that space to find a solution there. To put things into perspective, let's say you have a lot of documents, and you want to tag them with responsive and non-responsive. With LSI, you would sample some documents first, and label them manually into two groups, then you can use LSI to find similar documents to each group based on your manual labeling work earlier. With SVM, you do the same manual labeling process, then you can train SVM model with your labeled data After it is done, you can generalize to classify unseen data.

Regarding which one is better, i would say there is no clear winner, and it highly depends your data source and what you are trying to do. If the corpus of documents come from, say one custodian's mailbox during a small time frame, it is likely the concepts of documents are not sparse enough. Thus, using SVM might lower your precision and recall drastically (see coding example below). You best bet will be using LSI. However, if the collection is from several custodian's mailboxes during a large time frame, and you are looking for something very specific, it is likely the concepts of documents are very sparse. In this case, SVM will give you great result, please it is much faster to train than LSI. Well, if you have no hard deadline to meet, try both!!!

Example

In this example, I will use SVM to classify famous 20 newsgroups data set. This collection contains around 20k documents across 20 different newsgroups, and our challenge is using SVM to correctly learn and classify them. The 20 different categories are shown below:

You can see some of the newsgroups are very close, like comp.sys.ibm.pc.hardware vs comp.sys.mac.hardware, and rec.autos vs rec.motorcycles. Some are very different, like comp groups vs talk groups.

A little background first: the most widely used SVM implementation is from Dr. Lin's LIBSVM and LIBLINEAR. Liblinear is especially suitable for document classification, as I explained earlier, non-linear kernel trick is rarely used on document classification. In Python Sklearn library, there is a module based on Liblinear too which is Sklearn.svm.LinearSVC. There is another alternative model which is also based on linear SVM but with stochastic gradient descent training. The advantage of this module is it requires less memory (able to handle larger scale of documents), and can be set up as online learning mode. In Python Sklearn, it is sklearn.linear_model.SGDClassifier. I will write about this module in the future, so i am using LinearSVC for demonstration in this blog.

Fortunately, because 20news data set has been used and studied so much, there is already an vectorized object in sklearn library, so we don't need to process the documents for feature engineering. The out of box one has been removed noise words and then applied with TFIDF. However, if you are interested in designing your own feature engineering step like maybe including n-gram, it is also very easy to do. For this demo, I will use the out of box one. After a simple call, we can have it ready to be fed into our SVM estimator.

There are 11,314 documents, and 130,107 features. Notice the sample number is way less than the feature number which happens a lot in text classification. The linearSVC comes with many hyper-parameters you can define. This is a very crucial step for any data scientist folks, because you save your time to not try on things you know it is not going to work.

In nutshell of SVM, we are solving the unconstrained optimization problem of below.

I will walk through how to set the parameters. 1) we need to pick solvers. There are two available:dual and primary. Dual is default, and we should only try primary when training time is very slow or samples number >> features number. 2) If picking solver is easy, picking Regularization function (L1 vs L2), Regularization Paramter (C), and Loss Function (L1 vs L2) are not easy at all. You need to understand it to make a correct pick. Regularization is to prevent over-fitting. Generally speaking, in terms of loss function, L2 penalizes cost function more with existence of outliers; L1 is robust to outliers and able to fit most data just well. A graph for 2D case is shown below.

regarding regularization, L1 has built-in feature selection because it will shrink some coefficients to 0, and thus generates simpler model. L2 is more computational efficient due to having analytical solutions. Below is a graph for 2D case.

For most case in LIBLINEAR library, L1 regularization aka lasso regularization does not give higher accuracy but slower in training (here). For this reason, L1 regularization + L1 or L2 loss + dual solver is not implemented in Liblinear package. In practice, it will be very rarely you would ever want to use L1 regularization.

There is also a penalty parameter C which is quite important. Generally, when C is bigger than certain value, it makes little improvement over precision and recall but it will increases training time. In another word, we should not set it too high during our grid search.

Regarding the details of how to choose and set parameters in LinearSVC estimator, please check out this paper.

Code

Let me explain the above code example a little bit. Basically, I ran a grid search on different hyper-parameter C values (0.1, 1, 10) and different loss functions ('hinge', 'squared hinge') to find the best model. The best model is C = 1 and loss = 'squared_hinge'. The accuracy is 70.476% on average for classification on the 20 classes.

The accuracy might not be very impressive, but if you consider we are doing it for 20 classes and some of the classes are really close like pc.hardware versus mac.hardware, it is actually not bad. The more detailed insight can be found from the confusion matrix, as below:

We are used to see confusion matrix with 2 classes in which you can tell true positive, true negative and etc. This 20 classes confusion matrix is a little bit scary to look, but fundamentally it is the same as 2-class case. The diagonal numbers indicate the number of true positives for each class. In normalized graph, it is the percentage of true positive. Our goal is to want this to be as high as possible. The other numbers on each row indicate the number of false negatives broken down by the rest of 19 classes. The other numbers on each columns indicate the number of false positives broken down by each classes. For example, for rec.sport.hockey group, we correctly classify 343 instances out of 399 in our test set. The accuracy is 86% which is shown in the normalized matrix as 0.86. It is not bad. If we look closely, we can see the majority of false negatives are rec.sport.basketball in which there are 26 instances. We also can see there are no instance of hockey that are miss-classified as mac.hardware or windows.x. For windows.x, there is no instance that is miss-classified as hockey either. So if we would just build a model to classify hockey and windows, we would be doing a terrific nearly perfect job. I used nearly here because this is only test set which is from a subset of all data, so we don't know if it is the case for all data, but we can still draw a pretty safe conclusion like that. On the contrary, if we just need to classify hockey and basketball, we would have pretty high false positive and false negative counts.

How to improve the model? Well theoretically, non-linear SVM model can handle cases when linear model fails, but it will dramatically increase the training time, so in practice, it is rarely done to classify text documents because of the high number of features and samples. There are also ways to generate better document vector, for example, including n-gram, to increase performance.

Conclusion

SVM is a very powerful classification ML tool. By using LinearSVM, we are able to classify multi-class situation, and the performance is very good. However, SVM works better if you are trying to classify two topics that are quite different, for example, hockey versus hardware; it will work less accurately when you are trying to classify topics that are already similar, for example, hockey versus basketball. With SVM being introduced into Relativity, E-discovery folks now have another tool to perform classification, but keep in mind the limitation of SVM, which will save you a lot of time.

Search This Blog

Command Line Interface aka CLI

SVM

Introduction

SVM vs LSI

Example

Code

Conclusion

Comments

Post a Comment

Popular posts from this blog

Perform efficient Latent Semantic Index using Python

Open Source to E-discovery