SVMs - Part I (Theory)

April 21, 2020

Support Vector Machines have always been one of the algorithms that I have struggled with. I recently took the course Introduction to Analytics Modeling from Georgia Tech through EdX and these are my notes for Support Vector Machines.

I am going to start with creating some data.

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import pandas as pd

# create dataset
data, labels = make_blobs(n_features=2, centers=2, cluster_std=1.25,  
                          random_state=11, n_samples = 10)
df = pd.DataFrame({'Credit Score': data[:, 0], 'Income': data[:, 1], 
                          'Target': labels})

with plt.style.context('fivethirtyeight'):
    fig, ax = plt.subplots(figsize = (8, 8))
    df[df['Target'] == 0].plot(x = 'Credit Score', y = 'Income', color = 'red', 
                              kind = 'scatter', ax = ax,
                              s = 100, label = 'default')
    df[df['Target'] == 1].plot(x = 'Credit Score', y = 'Income', color = 'blue',
                              kind = 'scatter', ax = ax,
                              s = 100, label = 'repaid')

I now have some data. I am going to be trying to predict whether someone will repay their loan or default on their loan based on their credit score and income. The red dots are people that defaulted on their loand and the blue dots are people that repaid their loans. I see there is a clear seperation between the 2 classes, how can I figure out what line will best seperate the data. This is where SVMs will come in hand, but before that I need to declare some variables:

n = number of data points
m = number of attributes
\(x_{ij}\) = ith attribute of jth data point
- \(x_{1j}\) = credit score of person j
- \(x_{2j}\) = income of person j
\(y_j\) = response of data point j
- \(y_j\) = 1, if data point is blue (repaid)
- \(y_j\) = -1, if data point is red (defaulted)

A line is defined by a set of coefficients \(a_1\) through \(a_m\) for each attribute and an intercept \(a_0\) - this can be wrtten as:

\(a_1x_1 + a_2x_2 + ... + a_mx_m +a_0 = 0\) or

\begin{align} \sum_{i=0}^m a_ix_i + a_0 = 0 \end{align}

Parallel lines are 2 lines that have the same coefficients \(a_1\) through \(a_m\) and different intercepts \(a_0\). Below I am going to draw 2 parallel lines that seperate the red and blue points. The line in the middle will be our classifier.

We are wanting to find values of \(a_0\), \(a_1\) up to \(a_m\) that do 2 things:

Classify the points on the correct side of the line
Maximize the gap between the 2 parallel lines

To classify the points correctly, we need all the blue points to have:
\(a_1x_{1j} + a_2x_{2j} + ... + a_mx_{mj} + a_0 \geq 1\)

All red points need to have:
\(a_1x_{1j} + a_2x_{2j} + ... + a_mx_{mj} + a_0 \leq -1\)

Since we previously defined \(y_j\) = 1 for the blue points and \(y_j\) = -1 for the red points we can combine those 2 formulas and get the following for all points:

\((a_1x_{1j} + a_2x_{2j} + ... + a_mx_{mj} + a_0)y_j \geq 1\)

This works because the values that were previously predicted to default (red points) where having values less than -1, now these values are being multiplied by -1 and thus making a positive number.

The formula for the distance between 2 parallel lines is
\(=\frac{2}{\sqrt{\sum_i(a_i)^2}}\)

So, if we can minimize \(\sum_i(a_i)^2\), then we can maximize the margin. This is because on the above formula, if we minimize the denominator the number increases and thus the sum increases.

Our overall formula is:

Minimize \(\sum_i^m(a_i)^2\)
Subject to \((a_1x_{1j} + a_2x_{2j} + ... + a_mx_{mj} + a_0)y_j \geq 1\)

We need to find values of \(a_1\) through \(a_m\) that the margin is the greatest. But we can only choose from among values of the coefficients that correctly seperate all the points.

Soft Classifier

Imagine a scenario where a line will not meet all of the above standars. Then I would want to do what is known as a soft classifier. Lets start with making some data.

# create dataset
data, labels = make_blobs(n_features=2, centers=2, cluster_std=11,  
                            random_state=11, n_samples = 10)
df = pd.DataFrame({'Credit Score': data[:, 0], 'Income': data[:, 1], 
                            'Target': labels})

with plt.style.context('fivethirtyeight'):
    fig, ax = plt.subplots(figsize = (8, 8))
    df[df['Target'] == 0].plot(x = 'Credit Score', y = 'Income', color = 'red',
                              kind = 'scatter', ax = ax,
                              s = 100, label = 'default')
    df[df['Target'] == 1].plot(x = 'Credit Score', y = 'Income', color = 'blue',  
                              kind = 'scatter', ax = ax,
                              s = 100, label = 'repaid')

In the above instance, we can not draw a straight line that will seperate the red and blue points without breaking this \((a_1x_{1j} + a_2x_{2j} + ... + a_mx_{mj} + a_0)y_j \geq 1\)

If a point is on the correct side of the line it would be the following:
\((a_1x_{1j} + a_2x_{2j} + ... + a_mx_{mj} + a_0)y_j \geq 1\)

However if a point is on the wrong side of the line, it would be the following:
\((a_1x_{1j} + a_2x_{2j} + ... + a_mx_{mj} + a_0)y_j - 1 < 0\)

The amount that a point is less than zero, is the amount of error. Farther the wrongly classified point is from the line the bigger mistake we have made. To calculate the error for data point j:

max{0, 1 - \((\sum_i^m a_ix_{ij} + a_0)y_j\)}

A data point either is on the correct side of the line (no error) and will have an error that is greater than 1 or if the data point is on the incorrect size and will be below 0, so \(1 -\) will lead to a positive number.

The total error can be written as:

\(\sum_{j=1}^n max(0, 1 - (\sum_i^m a_ix_{ij} + a_0)y_j)\)

From earlier to maximize the margin we use this formula:

\(\sum_i(a_i)^2\)

To trade off between maximizing the margin and getting values correct we can use the following formula:

\(\sum_{j=1}^n max(0, 1 - (\sum_i^m a_ix_{ij} + a_0)y_j) + \lambda\sum_i^m(a_i)^2\)

As \(\lambda\) gets large, this part of the formula - \(\lambda\sum_i^m(a_i)^2\) - gets large and thus the importance of a large margin outweighs avoiding mistakes.

As \(\lambda\) gets closer to 0, this part of the formula - \(\lambda\sum_i^m(a_i)^2\) - gets close to zero and minimizing mistakes outweighs having a large margin.