class 6 Unsupervised Learning Oct 25, 2011 1. Unsupervised Learning just get data with no target labels M data items with N dimensions can have clusters of data the task of unsupervised learning is to recover the number and center of the clusters 2. Question data clustered along a line could be represented along line so we can do "dimensionality reduction" see image of Video 2 3. Terminology data is IID -- Indentically distributed and independently drawn from the same distribution Density Estimation -- recover the underlying probability distribution that generated the data clustering using mixture models dimensionality reduction can be applied to find structure in data Blind Signal Separation -- separate two speakers on one signal using Factor Analysis each speaker is a Factor in the joint signal 4. Google Street View and Clustering can you take billions of images and discover things like trees cars and stop signs? humans learn from not labeled target data, can computers do that? clustering is the basic for of Unsup learning K means expectation maximization -- probabilistic extension of K means 5-6 K-Means Clustering Example & Algorithm K-Means for a fixed value of K where K is the number of clusters to find 1. start by guessing cluster centers at random, using K=2 for instance 2. for each center find the most likely data points by euclidean distance each center separates the space and the line between is the "equidistant line" making two clusters 3. now find the points that correspond to the center of the clusters, the optimal point is in the center of the "cluster's" data points 4. iterate from 2 -- reassign the cluster centers until they don't move... if a cluster center becomes "empty" (has no data points associated) then we need to restart with new random selections see Video 5 for animation that makes almost no sense Converges to locally optimal clustering In general clustering is NPhard so a locally optimal solution is good Problems: Need to know K -- how many cluster centers are we looking for Finds Local Minima General problem with high dimensionality, like KNN Lack of mathematical basis 7. Question get image from video 6 explanation get image from video 7 explanation distant point pulls center point back from the closest ones 8. Question get image from video 8 explanation 9. Expectation Maximization algorithm that uses probability distributions use "normal" gaussian distribution with "mu" mean in center and "sigma" width y = 1 / (sqrt(2*pi) * sigma) * exp( -1/2 * (x-mu)^2 / sigma^2) peaks at X = mu, it goes to zero exponentially the exp() argument is a quadratic and the 1/yada is a normalizer to make sure the area underneath sums to 1 ... just like a probability should for each x one can calculate a density value, effectively the probability of x, but should use a small interval around x and calculate the area multi-variate gaussian -- multiple dimensions --circle rings measure equal probabilities ...look up the equation someplace in any textbook on multi-variate distributions... We'll use one-dimensional gaussians but we want to be complete, no? 10. Gaussian Learning fitting gaussian to data gaussian parameters: mu, sigma^2 or variance f( x | mu, sigma^2) = y mean is the average of data points -- mu = 1/M sum(j)(Xj) where M is the number of points sigma^2 is the average quadratic deviation from the mean sigma^2 = 1/M sum(j)((Xj - mu)^2) 11. Maximum Likelihood the mean is the maximum likelihood estimate for a gaussian and the maximum likelihood of sigma^2 is the average deviation from the mean ...watch the video if you want to derive this from first principles... gives a nice basis to fit gaussians to data points Quiz: data point: 3,4,5,6,7 mean: add all the values and divide by 5 elements = 25/5 stdev subtract 5 from each value: -2 -1 0 1 2 square them: 4 1 0 1 4 average: 10/5 (average) mu = 5 (stddev) sigma^2 = 2 12. Question Quiz: 3,9,9,3 mean: add all the values and divide by 4 elements = 24/4 stdev subtract 5 from each value: -3 3 3 -3 square them: 9 9 9 9 average: 36/4 (average) mu = 6 (stddev) sigma^2 = 9 13. Question for multi-dimensional data x1,2 = 3 8 4 7 5 5 6 3 7 2 mean has two values, calc independently for each X vector, just happens to be 5 5 then subtract the vector means from the vectors to get this matrix: -2 3 -1 2 0 0 1 -2 2 -3 "which you can do for the main diagonal elements as before but for the off diagonal elements you just have to plug it in" giving: sigma^2 = 2 -3.2 -3.2 5.2 WHATEVER 14. Gaussian Summary Functional form for gaussian Fit for data and multi-variate gaussians 15-18 Expectation Maximization as Generalization of k-Means change K-Means to use gaussian probabilities each data point is attracted to a cluster center by a strength in proportion to its "posterier likelihood" the center positions are adjusted corresponding to all data points based on the strengths as a result they tend to not move as far and in a smoother way "posterier likelihood" Algorithm, use a "mixture" basically draws gaussian "circles" around each point P(x) = sum(i=K) (P(cluster[i]) * P(x | cluster[i])) draw a cluster center and then a gaussian attached to this center... the unknowns are the P(clusterK) prior called pi for each K -- fixed probability and the mu and sigma for gaussian K -- known gaussian formula steps 1. E-step -- expectation step, assume we know the gaussians and pi 2. M-step -- figure out what the parameters should have been pi is (sum e[ij])/M -- prob of attraction between centerK=i and pointX=j mu[i] is weighted mean of e[ij]*Xj / e[ij] sigmas sum of weighted expressions same calculations as with gaussian fitting before but weighted by a soft-correspondence of data point to center For EM: Each data point always corresponds to all cluster centers but with different strengths A linear set of data points will have an elongated gaussian "field", not a circle Centers will move to less extreme positions due to the "soft" strengths so K-Means and Expectation Maximization will converge to different data models. 19. Choosing K how many clusters do we have? practical implementation guess the number of clusters along with the parameters periodically evaluate which data points are not well covered by existing centers and generate new centers near those points then run algoithm to see how well it works combines minimization of data coverage with penalty for number of clusters 20. Clustering Summary 21. Dimensionality Reduction 22. Question 23. Linear Dimensionality Reduction 24. Face Example 25. Scan Example 26. Piece-Wise Linear Projection 27. Spectral Clustering 28. Spectral Clustering Algorithm 29. Question 30. Supervised vs Unsupervised Learning