Machine Learning Summary
from the Stanford AI Class video sets #5&6
Michael Schippling -- Oct 27, 2011


    Machine Learning extracts information from raw data

Introduction and Terminology
Bayes Network Classifiers
Linear Regression
Clustering
Perceptrons
K Nearest Neighbor
K-Means
Expectation Maximization and Gaussians


Introduction and Terminology

Machine Learning uses a Training Set of example inputs and attempts to build a model which will predict the output when given a new Input.

Why?

What it looks for

How it interacts

What it does

Types


Models

The object of Machine Learning is to build a useful model. If the model is too specific it will Overfit -- it will be too specific to the Training Data and may not work correctly with new data. If the model is not specific enough it will Underfit and fail to work well on the Training Data. There is a trade off between the complexity of the model and its error rate which needs to be minimized. Here the red curve shows the trade off :
overfitting

Data Sets

The Training Set is (usually) a large list of data vectors or sets of values:

{X1, X2, ... Xn}

each of which represent a condition of the system under consideration. In Supervised Learning these vectors will be accompanied by a result or Target Label value:

{X1, X2, ... Xn} -> Y

where Y may be a classification, e.g. Black or White, or a continuous value, e.g. 98.6 degrees.

In order to build and test a model the full data set can be divided into three sub-sets using these approximate percentages:
Training -- 80% -- Used for the actual model building;
Cross-validation -- 10% -- Used to test the complexity of the trained model, i.e., how overfitted is it?
Test -- 10% -- Used for the final model validation.
The Training and Cross-validation sets can be mixed and re-divided in order to do multiple training runs using different model-building parameters, but the Test set should be independent in order to minimize final errors.


Bayes Network Classifiers

A Classifier tries to put data into classes that have common attributes. A simple example would be to decide if an image pixel was Black or White. The Video lesson presented the example of deciding if an email message was Spam or Ham... To do this we take a set of messages which have already been correctly classified and then look at the frequency of word use in each set -- under the assumption that Spam uses words differently than real messages. This is called Bag of Words.

Maximum Likelihood

The initial approach is to determine the probability of each word (W) occurring in the two sets, Spam and notSpam (S, ~S) and then calculate the probability that the words in a new message (M) are in either set. To do this:
  1. Calculate P(S) and P(~S), the probability of S and ~S messages: messages-in-the-set / total-number-of-messages;
  2. Find the unique set of words used in the full input corpus: this makes a dictionary or a set of word classes to count;
  3. Count the use of each word in each message set: how many times is the word used in S and ~S?
  4. Calculate P(Wn|S), the probability that each word is in each set: words-in-the-set / total-words-in-the-set;
  5. Do the Bayesian thing to calculate whether a message is Spam or not: P(S|M) = P(W1|S) * P(W2|S) * P(S) / P(M)

Laplace Smoothing with K=N

With the naive Maximum Likelihood model we quickly find that there is overfitting due to under-sampling of some words in the corpus: if a word doesn't appear in the Spam category it will automatically prevent any message containing that word from being classified as Spam. A solution is to add a constant value, K, to the count of each word in each set, and then re-calculate their probabilities. This is called Laplace Smoothing with K=N (note that this adds K to every message and word count). The example adds K=1 to every class and set in the model, and has the effect of less-horribly miss-classifying low probability messages. It Smooths Out the distributions.

Example

A spreadsheet for the Spam example (videos 5.7-5.17) can be viewed in your browser here, or get the original here.


Linear Regression

Regression attempts to find a function that will model a data set that has continuous values. A non-simple example would take the input vector X: {todays-temperature, todays-humidity, TulsaOK-temperature, big-toe-pain} and try to predict the Target Y: {tomorrows-rainfall}. Regression functions are generally polynomials and the job of the Machine Learner is to estimate the W parameters so as to minimize the errors in the model:

Yi = W0 + W1*Xi + W2*Xi2 + ...

Note that W and X are vectors of values, where the number of values is the dimension of the system, and the * operations are vector cross-products. The degree of the polynomial is the maximum power to which X is taken. Fortunately, for this class, we only deal with single dimension variables -- good'ole numbers -- and degree 1 or Linear polynomials -- good'ole straight lines -- so the functions we want have this form:

Yi = W0 + W1*Xi

What we are trying to do is draw a line through the data set that minimizes the total distance between the line and each data point, as shown here:
regression
where W0 is the Y Intercept (where our line crosses the vertical axis at X=0) and W1 is the Slope. It turns out to be "easy" to do (modulo a lot of algebra and derivative taking) using a Quadratic Loss equation that minimizes the squared error over all the input points. Given a data set of x,y pairs, the optimal Linear Regression W parameters can be calculated with these equations -- where Xj and Yj are each x,y pair in the data set and N is the number of pairs:
linear W equations

Example

A spreadsheet for the video 5.24 example can be viewed in your browser here, or get the original here.

Problems with Linear Regression

Some of these problems can be addressed with Smoothing algorithms, such as a Logistic Function or Regularization which tend to pull the parameter values away from extremes. But once these functions become more complicated they don't have closed form solutions and have to be solved iteratively -- by trying different values to see how well they work. One method for this is Gradient Descent -- a search down a slope for a minimum. This can have problems with finding local solutions rather than the global minimum, but sometimes that's better than nothing.


Clustering

Clustering is a way of classifying continuous data points into sub-sets that are "near" each other, or have similar attributes.

The simplest version is Linear Separation -- drawing a line between two groups. If the data points are already labeled with their class, i.e., Supervised Learning with the outputs specified, this can be as "easy" as finding a line that lies between the two groups, assuming that the groups are Linearly Separable -- they are not mixed together in the input space -- to start with. If the data is not labeled then it is Un-Supervised Learning, and the job is to find a separator based on how close each group's points are to each other -- assigning data points to particular groups in the process.

linear separation

The divisions between clusters are called Voroni Graphs or Fronts.

 There are many clustering algorithms. A few are outlined below.

Perceptrons

A Perceptron is a simple model of a biological neuron. It updates its internal parameter weights in an iterative cycle to reduce the errors in its output using Gradient Descent to find a minimum. For the case of a Linear Separator, the parameters are the W0 and W1 coefficients of the line equation -- just as for Linear Regression. The Perceptron update rule also uses the same technique as regression to find minima, it subtracts a bit of the error from the parameter and tries again:

Wi = Wi + a * (y - f(x))

Wi is the parameter of interest and (y - f(x)) is the difference between a data point's actual Y value and the value "predicted" by the function f(x) using the current Wi value, i.e., the error in the prediction. a -- alpha -- is a small Learning Rate coefficient, often in the range of .001, which controls how much of the error feeds-back into Wi on each update cycle.

A Perceptron is guaranteed to converge to a Linear Separator, if one exists. However, it can find one of many possible separators and may not find the optimal one. To improve on this we want to maximize the margin -- distance -- between the separator and the data points, as shown on the left in the Clustering image. One way to do this is with a Support Vector Machine (SVM) which uses a Quadratic Programming optimization to generalize the error calculations. Using the Kernel Trick an SVM can also modify its internal model to use higher order polynomials and thus find non-linear cluster separators, e.g., circles.

K Nearest Neighbor

The KNN algorithm does pretty much what it says: For a new data point, find the number=K neighbors which are closest in the input space and use them to classify the new point. For K=1 this is just finding the nearest classified point and using it's value. For K>1 you find that many neighbors and average their values.

KNN

When building the models one can use cross-validation to find a good K because there's a tradeoff between the complexity of the data and the goodness of the fit. Using a large K gives a smoother result but may mis-classify outliers. There are also problems searching in large data sets, but these can be fixed with tree or hash data structures. There are worse problems with high dimensional data because it is exponentially more likely that a point will be very distant from others.

K-Means

The K-Means algorithm is an un-supervised clustering tool. The K parameter is the number of clusters to find, for instance K=2 will find two clusters that may not have a Linear Separator. It will converge to a locally optimal clustering, perhaps after a few iterations. In general clustering is NPhard so a locally optimal solution is pretty good.

The algorithm is:
  1. Start by guessing number=K cluster centers at random;
  2. For each center find their set of most likely data points by Euclidean distance.
        Each center separates the space and the line between is the "equidistant line";
  3. Find the two points that are at the geometric center of these preliminary clusters;
  4. Move the centers to those points;
  5. Iterate from 2 -- re-assign points to the centers and then re-center until the centers don't move anymore;
  6. If a cluster center becomes empty -- has no data points associated -- then restart with new random selections.
The problems with this are:

Expectation Maximization and Gaussians

We can use Gaussian distributions in place of straight-ahead Euclidean distances when calculating KNN or K-Means separations. This effectively puts non-linearly decreasing probability "fields" around each point, and then we can use the posterior probabilities of all the points combined to decide which centers to associate with. In this case data points are never completely "cut-off" from any center, but some centers have a higher probability of association than others. This smooths the transitions of the centers during the iterated search and also reduces the effect of outlier data points on the results.

Using Gaussians with the K-Means algorithm is called Expectation Maximization and the two algorithms may converge to different data models.