class 6  Unsupervised Learning  Oct 25, 2011

1. Unsupervised Learning
    just get data with no target labels
	M data items with N dimensions

    can have clusters of data
     the task of unsupervised learning is to
      recover the number and center of the clusters


2. Question
   data clustered along a line could be represented along line
   so we can do "dimensionality reduction"

see image of Video 2


3. Terminology
    data is IID -- Indentically distributed and
                   independently drawn from the same distribution

   Density Estimation -- recover the underlying probability distribution
   				that generated the data
     clustering using mixture models
     dimensionality reduction

    can be applied to find structure in data
      Blind Signal Separation -- separate two speakers on one signal
       using Factor Analysis each speaker is a Factor in the joint signal


4. Google Street View and Clustering
can you take billions of images and discover things like trees cars and stop signs?
humans learn from not labeled target data, can computers do that?

clustering is the basic for of Unsup learning
 K means
 expectation maximization -- probabilistic extension of K means

5-6 K-Means Clustering Example & Algorithm

 K-Means for a fixed value of K where
  K is the number of clusters to find

   1. start by guessing cluster centers at random, using K=2 for instance
   2. for each center find the most likely data points by euclidean distance
       each center separates the space and the line between is the "equidistant line"
       making two clusters
   3. now find the points that correspond to the center of the clusters,
       the optimal point is in the center of the "cluster's" data points
   4. iterate from 2 -- reassign the cluster centers until they don't move...

   if a cluster center becomes "empty" (has no data points associated)
    then we need to restart with new random selections

see Video 5 for animation that makes almost no sense

 Converges to locally optimal clustering
 In general clustering is NPhard so a locally optimal solution is good


 Problems:
   Need to know K -- how many cluster centers are we looking for
   Finds Local Minima 
   General problem with high dimensionality, like KNN
   Lack of mathematical basis


7. Question
get image from video 6 explanation

get image from video 7 explanation
  distant point pulls center point back from the closest ones

8. Question
get image from video 8 explanation


9. Expectation Maximization
 algorithm that uses probability distributions
  use "normal" gaussian distribution with "mu" mean in center and "sigma" width
    y = 1 / (sqrt(2*pi) * sigma)  *  exp( -1/2 * (x-mu)^2 / sigma^2)

  peaks at X = mu, it goes to zero exponentially
   the exp() argument is a quadratic and the 1/yada is a normalizer
    to make sure the area underneath sums to 1 ... just like a probability should
  
 for each x one can calculate a density value, effectively the probability of x,
 but should use a small interval around x and calculate the area

 multi-variate gaussian -- multiple dimensions --circle rings measure equal probabilities
   ...look up the equation someplace in any textbook on multi-variate distributions...

We'll use one-dimensional gaussians but we want to be complete, no?


10. Gaussian Learning
fitting gaussian to data

  gaussian parameters:  mu, sigma^2 or variance

    f( x | mu, sigma^2) = y

  mean is the average of data points --
     mu = 1/M sum(j)(Xj)  where M is the number of points


  sigma^2 is the average quadratic deviation from the mean
    sigma^2 = 1/M sum(j)((Xj - mu)^2)



11. Maximum Likelihood

  the mean is the maximum likelihood estimate for a gaussian
  and the maximum likelihood of sigma^2 is the average deviation from the mean
    ...watch the video if you want to derive this from first principles...

  gives a nice basis to fit gaussians to data points

 Quiz:
    data point:   3,4,5,6,7

	mean: add all the values and divide by 5 elements = 25/5
	stdev  subtract 5 from each value:  -2 -1 0 1 2
                              square them:   4  1 0 1 4
                                  average:        10/5
         
        (average)      mu  = 5
        (stddev)  sigma^2  = 2


12. Question

   Quiz:   3,9,9,3

       mean: add all the values and divide by 4 elements = 24/4
       stdev  subtract 5 from each value:  -3 3 3 -3 
                              square them:  9 9 9  9 
                                  average:        36/4

        
        (average)      mu  = 6
        (stddev)  sigma^2  = 9


13. Question
  for multi-dimensional data

       x1,2 =	3  8
       		4  7
		5  5
		6  3
		7  2

     mean has two values, calc independently for each X vector,
         just happens to be 5  5

     then subtract the vector means from the vectors to get this matrix:
		-2  3
		-1  2
		 0  0
		 1 -2
		 2 -3
     "which you can do for the main diagonal elements as before
      but for the off diagonal elements you just have to plug it in"
         giving:  sigma^2   =	 2   -3.2
				-3.2  5.2

  WHATEVER


14. Gaussian Summary
 Functional form for gaussian
 Fit for data and multi-variate gaussians


15-18 Expectation Maximization as Generalization of k-Means
  change K-Means to use gaussian probabilities

  each data point is attracted to a cluster center by a strength
     in proportion to its "posterier likelihood"
  the center positions are adjusted corresponding to
     all data points based on the strengths
   as a result they tend to not move as far and in a smoother way

 "posterier likelihood" Algorithm, use a "mixture"
   basically draws gaussian "circles" around each point
   
    P(x) = sum(i=K) (P(cluster[i]) * P(x | cluster[i]))
 
  draw a cluster center and then a gaussian attached to this center...

  the unknowns are
    the P(clusterK) prior called pi for each K -- fixed probability
    and the mu and sigma for gaussian K        -- known gaussian formula

  steps
   1. E-step -- expectation step, assume we know the gaussians and pi
   2. M-step -- figure out what the parameters should have been
		 pi is (sum e[ij])/M -- prob of attraction between centerK=i and pointX=j
  		 mu[i] is weighted mean of e[ij]*Xj / e[ij]
    		 sigmas sum of weighted expressions

            same calculations as with gaussian fitting before but
		 weighted by a soft-correspondence of data point to center

 For EM:
  Each data point always corresponds to all cluster centers but with different strengths      
  A linear set of data points will have an elongated gaussian "field", not a circle
  Centers will move to less extreme positions due to the "soft" strengths

  so K-Means and Expectation Maximization will converge to different data models.


19. Choosing K
  how many clusters do we have?
  practical implementation guess the number of clusters along with the parameters

  periodically evaluate which data points are not well covered by existing centers
   and generate new centers near those points
  then run algoithm to see how well it works
  combines minimization of data coverage with penalty for number of clusters



20. Clustering Summary
21. Dimensionality Reduction
22. Question
23. Linear Dimensionality Reduction
24. Face Example
25. Scan Example
26. Piece-Wise Linear Projection
27. Spectral Clustering
28. Spectral Clustering Algorithm
29. Question
30. Supervised vs Unsupervised Learning