Unit 5  Machine Learning  Oct 25, 2011

Video 5.1. Introduction
	World is data rich on a scale unthinkable 5 years ago.
	Machine Learning extracts information from the data.

	three types
	  Supervised
	  Un-supervised
	  Reinforcement


Video 5.2. What is Machine Learning
	Reasoning within Bayes Networks -- reasoning with known models
	Machine Learning -- learns Models from Data
	  Supervised

  Quiz who does it?
  			Google -- web mining
  			Netflix -- DVD recommendation
			Amazon -- product placement


Video 5.3. Stanley DARPA Grand Challenge
	Stanley extensively uses ML for robotics
	Lasers see only 25meters out -- ML extrapolates road to horizon


Video 5.4. Taxonomy
	What:
	  Parameters -- nodes of Bayes Netowrk
	  Structure -- arcs of BN
	  Hidden concepts -- certain training examples form groups
	
	What from:
	 every system is driven from some kind of data that you care about
	  Supervised -- give specific target labels
	  Unsupervised -- target labels are missing
	  Reinforcement -- uses feedback from environment

	What for:
	  Prediciton
	  Diagnostics
	  Summarization

	How:
	  Passive -- just an observer
	  Active -- has effect on environment
	  Online -- while data is generated
	  Offline -- after data is generated

	Outputs:
	  Classification -- finding binary or discrete number of classes
	  Regression -- continuous prediction, e.g. 66.5degrees...

	Details:
	  Generative -- model data as generally as possible
	  Discriminative -- distinguish data
	  

Video 5.5. Supervised Learning
	for each training example you get a feature vector and a target label:
		X1 X2 X3 ... Xn  -> Y
	 example: credit rating X1=is person employed, Y=credit default or not?
	 example: image recog X=image pixels, Y=chair or not?

	Supervised Learning gets big list of actual data with X's and Y
	 and tries to build a model that predicts future Y's given only X's
	 called "the data",
	  each input is vector Xm,
	   want to define a function that transforms Xm into Ym:  f(Xm) = Ym
	   as close as possible, but tollerates az certain amount of error

  Quiz: straight line data, which function is good, line or squiggle?
  			...line...because squiggle is more complex and doesn't
			  match data outside the points
	

Video 5.6. Occam's Razor
	Everything else being equal, choose the less complex hypothesis.
	But there is a tradeoff between good fit and low complexity.
	Complexity might be the polynomial degree of the hypothesis.
	Matching of data increases with complexity, but errors increase too
	 overfitting -- being too specific about each data point
	   increases with complexity, major source of error in ML algorithms
	 generalization -- 
	   sum of overfitting and training complexity errors
	Want to minimize generalization error
	 by picking model complexity at minima of (red) curve
	see image:  05class_Overfitting.jpg
	

Video 5.7. SPAM Detection
	Get email, catorgize as "spam" or "ham" (good email)
	Email programs allow flagging messages as "spam"

    How to represent emails?
	  bag of words -- counts frequency of words in "dictionary",
	                   oblivious to order of words

    Bunch of messages classified as spam and ham
	 training set
  	  messages:
			SPAM						Ham
		offer is secret				play sports today
		click secret link			went play sports
		secret sports link			secret sports event
									sports is today
									sports costs money

	  note: distribution of word counts (for 24 total words) is:
		        total	S	H
	     offer  -- 1	1	0
		 is     -- 2	1	1
		 secret -- 4	3	1
		 click  -- 1	1	0
		 sports -- 6	1	5
		 link   -- 2	2	0
		 play   -- 2	0	2
		 today  -- 2	0	2
		 went   -- 1	0	1
		 event  -- 1	0	1
		 costs  -- 1	0	1
		 money  -- 1	0	1
		  		  24    9  15 
		  prob of word in class is: count/class-total, e.g. 
		    P("secret"|S) = 3/9 = 1/3
		  total prob of word in dictionary is: count/full-total, e.g. 
		    P("secret"|S) = 4/24 = 1/6
		  the dictionary contains 12 words (classes)


Video 5.8. Question
	Probability that message is spam = 3/8 becasue 3 of 8 messages are spam.


Video 5.9. Maximum Likelihood_1
  Working out the prior probability more formally...

	Prior prob of spam -- P(spam) = pi --
	 that maximizes the likelihood of data assuming identical distribution

	 for example:  SSSHHHHH   (3 spam, 5 ham)
	
	P(Yi) =   pi if Yi=S
			 1-pi if Yi=H
	 the probability of the Yth data item
	rewrite:
	  P(Yi) = pi^Yi * (1-pi)^(1-Yi)
	assuming data is independent the probability of all data items is
	  P(data) = prod(1-i) P(Yi) = 
	    the joint probability of all data items is the product of each
	rewrite:
	  = pi^(count of Yi=S) * (1-pi)^(count of Yi=H)

	for example:  = pi^3 * (1-pi)^5
	 want to maximize the pi.
	 so can take log of both sides:
	   log( P(data) = 3 * log(pi) + 5 * log(1-pi)
	 maximize by setting derivitive = 0:
	 	3/pi - 5/(1-pi) = 0
	 solving for pi gives:
	 	5*pi + 3*pi = 3   so  pi = 3/8

  so, in fact:
     the 3/8 we got by counting is the Maximum Likelihood we want!

  Quiz:

  	Question:
	   P("secret" | S ) = 1/3
		   probability that we have "secret" in the message
		      when we know it's Spam
  		P("secret" | H ) = 1/15
		   probability that we have "secret" in the message
		      when we know it's Ham
      secret is in Spam 3 times and Ham 1 time
	  Spam messages have 9 words total and Ham has 15 words
	   so 3/9 Spam words are secret
	      1/15 Ham words are secret

	note, doing it the other way, I think is this:
  		P(S | "secret") = .75
		   probability that we have Spam with "secret" in the message
		    = P("secret"|S) * P(S) / tP("secret") = 1/3 * 5/8 / 4/24
  		P(H | "secret") = .5
		   probability that we have Ham with "secret" in the message
         3/4 of "secret"s are in the Spam messages


Video 5.10. Relationship to Bayes Networks

	Bayes Network for spam filter:

			Spam
		   /  |  \
		  w1  w2  wn

	A hidden node, Spam,
	 has as many children as there are word classes in a message
	 where each word has an identical conditional distribution of it's
	 occurance given that the message is Spam or Ham
 
  Quiz: given a dictionary of 12 words, how many parameters needed in network?
      23 -- one for the prior P(S) and 2 for each word = 2*12= 24
	  		 but each set of words only needs 11 because the last can be calc
		P(S)      -- 1       ==  1
		P(Wi | S) --  12 - 1 == 11
		P(Wi | H) --  12 - 1 == 11
							    23


Video 5.11. Question
  Quiz: given message M = "sports", what is the probability of being Spam
  		P(S | M ) =  3/18

    by bayes:
	  P(S|M) = P(M|S) * P(S) / totalprob(M)
     so,
	  P("sports"|S) = "sports" occurs in spam  = 1/9 
	  P(S) = probability of message being spam = 3/8
	  P("sports") = total prob of message (6 out of 24 words) = 6/24


Video 5.12. Question
  Quiz: M = "secret is secret"
		what is P(S|M)

	P("secret","is","secret" | S) * P(S) / TP-of-Message

	  P("secret"|S) = "secret" occurs in spam  = 1/3 
	  P("secret"|~S) = "secret" occurs in ham  = 1/15 
	  P("is"|S) = "is" occurs in spam  = 1/9 
	  P("is"|~S) = "is" occurs in ham  = 1/15 
	  P(S) = probability of message being spam = 3/8

	  P+ = 1/3  * 1/9  * 1/3   *  3/8  = .00462 / TP ~= .9615
	  P- = 1/15 * 1/15 * 1/15  *  5/8  = .000185
	                               TP  = .00480

Video 5.13. Question
  Quiz: M = "today is secret"
		what is P(S|M)

	P("today","is","secret" | S) * P(S) / TP-of-Message

	  P("today"|S) = "today" occurs in spam  = 0/9 
	  P("today"|~S) = "today" occurs in ham  = 2/15
	  P("is"|S) = "is" occurs in spam  = 1/9 
	  P("is"|S) = "is" occurs in spam  = 1/15
	  P(S) = probability of message being spam = 3/8

	  P+ = 0/9  * 1/9  * 1/3   *  3/8  = .0 / TP ~= 0.0
	  P- = 2/15 * 1/15 * 1/15  *  5/8  = .1666
	                               TP  = .1666
	 This says it's impossible for the message to be Spam
	  becasue today has not yet appeared in Spam
	 This is !!!overfitting!!! becasue one word affects the outcome


Video 5.14. Laplace Smoothing
  Laplace Smoothing addresses overfitting problem

  with simple Maximum Likelihood the prob of a word was just
    ML --  P(x) = count(x)/totalN
  with LS we add a constant K to the count and
   normalize by adding K to each class count as well
    LS -- P(x) = count(x)+K / totalN + K|x|
    note that K|x| means K*number of classes so for our 2 example K|x| = 1*2
   equivalent to adding K training examples to each class
    if K=0 we get the ML version,
    but if totalN is finite we get (slightly) different answers

  Quiz:
   for K=1 and these message counts (total messages N, Spam count)
   		  N	Spam	P(Spam)  i.e.:  H+K	S+K	  N+K|x|
		  1	  1			.5		      1	  2	  3 =	2/3    0.666
		 10	  6	                      5   7  12 =   7/12   0.583
		100	 60                      41  61 102 =  61/102  0.598
   
 --> The effect is to approach the ML probability smoothly as
      the number of samples increases.


Video 5.15. Question
  Use LS K=1 on the original spam model
   to get the following probabilities for message "today"...

   original counts:
     messages: 3 spam, 5 ham, 8 total
	 dictionary: 12 words -- 12 CLASSES!!!
	 "today":  0 in spam, 2 in ham,  2 total

   adding K=1:
  	A. P(S) =  3+1/8+2  = 4/10				-- add K for each class!!!
	B. P(H) =  5+1/8+2 =  6/10
	C. P("today"|S) = 0+1/9+12 = 1/21		-- add K for each class!!!
	D. P("today"|H) = 2+1/15+12 = 3/27


Video 5.16. Question
  Use LS K=1 on the original spam model
   to get the following probabilities for message M = "today is secret"...
    P(S|M)

   original counts:
     messages: 3 spam, 5 ham, 8 total
	 dictionary: 24 words -- 12 CLASSES!!!  9 in spam, 15 in ham 
	 "today":  0 in spam, 2 in ham,  2 total
	 "is":     1 in spam, 1 in ham,  2 total
	 "secret": 3 in spam, 1 in ham,  4 total

   from before with K=1:
  	P(S) =  3+1/8+2  = 4/10				-- add K for each class!!!
	P(H) =  5+1/8+2 =  6/10
	P("today") = 2+1/24+12 = 3/36		-- add K for each class!!!
	P("today"|S) = 0+1/9+12 = 1/21
	P("today"|H) = 2+1/15+12 = 3/27

	P(S | "today","is","secret") = P(t,i,s | S) * P(S) / tP(t,i,s)

	P("today"|S) = 0+1/9+12  = 1/21	 -- P(t|~S) = 1+2/15+12 = 3/27
	P("is"|S) = 1+1/9+12     = 2/21  -- P(i|~S) = 1+1/15+12 = 2/27
	P("secret"|S) = 1+3/9+12 = 4/21  -- P(s|~S) = 1+1/15+12 = 2/27

	+P =  1/21 *  2/21 *  4/21 * 4/10 = .0003455 / tP = .4857
	~P =  3/27 *  2/27 *  2/27 * 6/10 = .0003657
	                              tP  = .0007112


Video 5.17. Summary Naive Bayes

	generative model
	bag of word model -- email represented by word counts
	start with training examples that have a result label
	use Maximum Likelihood and Laplacian Smoother
	with Bayes to get probability of Spam given word counts


Video 5.18. Advanced SPAM Filtering
	other things to consider:
	  known spammer
	  have you emailed before
	  have many people received the same message
	  is the header consistent
	  is eamil all CAPS
	  do inline URLs pointing where they say
	  did they use your right name

Video 5.19. Digit Recognition
  Naive Bayes can be applied to handwritten digit rcognition, but
   not a good choice because pixels are not Conditional Independent

  input vector = pixel values: 16x16 image
  patterns are not "shift invariant" -- shiftted patter is different
  input smoothing can blend pixel values


Video 5.20. Overfitting Prevention
  tradeoff between how well we can fit data and how smooth the model is
  using Laplacian Smoothing smooths the model
   how do you choose K?
    cross-validation -- need lost of training examples
	  divide data into 3 buckets
	    training -- 80%
		cross-validation -- 10%
		test -- 10%
	  train using different values of K using training set
	  use cross-val set to find optimal K that maximizes performance
	  use test set to verify only once
	   if you use test set for cross-val you can still overfit.

	  can split train and cros-val sets multiple times to do it.


Video 5.21. Classification vs Regression
   Classification -- target class is discrete

   Regression -- predict continuous values:
   					Yi is element [0,1] for all Reals
			relationship between two variables, one known and one you want


Video 5.22. Linear Regression
   overfitting makes a very un-smooth model

	data is input vector of X's
	 Xn -> continuous Y

	for one-dimensional linear regression data ... only one X value
	   Y =	f(x) = W1*X + W0
    for higher dimensions, W's and X's will be N sized vectors and
	 the multiply is the inner-product of the vectors

  Quiz:
	x	 y
	3	 0
	6	-3
	4	-1
	5	-2

    it is a linear fit what are
	  W0 =
	  W1 =

	 W1 is the slope dX/dY    --  6  - 4 / -3 - -1 = 2 / -2 = -1
	 W0 is the line intercept --  for x=3  W1*X = -1 * 3 = -3
	 							  so  Y = 0 = -3 + W0, and W0 = 3
	
	gets much more challenging if the data set is not exactly linear


Video 5.23. More Linear Regression
	Make an error measure:

	"Quadratic Error" between target table and our estimate --
	
	  LOSS = sum(j) (Yj - (W1*Xj + W0))^2
	   sum of the expected Y-value minus our estimated value
	     over the whole set of j data rows
	
	 Can minimize the above to get a good regression:

	     W* = argmin( LOSS ) over all W's
		   solution to the regression problem Wstar
		    is the minimum LOSS for all the W's 


Video 5.24. Quadratic Loss
	for one dimensional data, find the Regression by minimizing the LOSS
	 over both W's

	  min(w) L = sum(i) (Yi - W1*Xi - W0)^2
	  find derivitive of L and set to 0
	   (partial)dL/dW0 = -2*sum(i)(Yi - W1*Xi - W0) = 0
		where M is the number of training examples:
	     M * W0 = sum(i)(Yi) - W1*sum(i)(Xi)
		 so:
		  W0 = 1/M * sum(i)(Yi) - W1/M * sum(i)(Xi)
	  doing the same derivitive for W1
	   (partial)dL/dW1 = -2*sum(i)(Yi - W1*Xi - W0) * Xi = 0
	     so: 
		  sum(i)(Xi*Yi) - W0*sum(i)(Xi) = W1*sum(i)(Xi^2)

	cut to the chase, to get the W parameters from data:
		
 slope------->  W1 = (M * sum(i)(Xi*Yi)) - (sum(i)Xi * sum(i)Yi) /
			 		(M * sum(i)(Xi^2)) - (sum(i)Xi)^2	

 intercept--->  W0 = 1/M * sum(i)(Yi) - W1/M * sum(i)(Xi)

 answer to the quiz...see linearRegression.xls
   W0 = .5
   W1 = .9




25. Problems with Linear Regression
   lines don't fit curves
   outliers skew the regression
   as you go to infinity on X, so does Y
     which may not make sense for your model
     e.g. weather may not be getting hotter and hotter

   can use Logistic Regression
     for f(x) as a linear function
      the output of Logistic Reg is:

         Z = 1/ 1 + e^-f(x)

    range of Z is 0 to 1
      as f(x) -> +infinity, e^-f(x) -> 0,
         so we get 1/1
      as f(x) -> -infinity, e^-f(x) -> infinity,
         so we get 1/infinity which is close to 0
      
    plot of Z function is approx linear around x=0
     but levels off (asymtotically) at extremes  



26. Linear Regression and Complexity Control
    Regularization or "Complexity Control"
      use normal Loss function + "Loss over parameters":
          Loss(data) + Loss(paremeters)
       where Loss(param) Peanalizes parameters that become large

     Loss(data) = sum(j) ( Yj - W1*Xj + W0 )^2
     Loss(param) = sum(i) ( |Wi|^p    where p is usually 1 or 2

     so it pulls the parameters in towards 0
      in a circular shape for quadratic function with p=2 (called "L2")
      and a diamond shape for function with p=1 (called "L1")

     for L1 parameters are sparser and on tends to be pulled to 0
     for L2 they are not as sparse



27-30 Minimizing Complicated Loss Functions
  are there closed form solutions for Regularization ?
   mostly no, have to use iteritave solutions

  use gradient descent --
     gradient moving down to right is negative and up to the right is positive
     methods are subject to finding local minimums rather than global one
     optimization books will explain how to address local minimum problems

   start with a guess for parameters and iterate
    by subtracting (alpha * the slope of the Loss function) at that point
    where alpha is a small learning rate maybe .001

   to met real minimum the alpha rate should get smaller as you get close



31. Gradient Descent Implementation

     for Loss(data) functions the gradients are the dL/dW equations shown before
      so we subtract alpha * a bit of the dL/dW equation on each iteration


32. Perceptron
  linear functions used for classification
  iterative algorithm where the update rule is a gradient descent
 ---> Always converges to a linear separation of data if possible...

  a "Linear Separator" is a linear function that separates
    positive from negative samples
   if the function is >= to the actual sample
     then the sample is (in the example) +
     otherwise it's -

  start with a random guess for W1 and W0
  calculate new W values by adding  alpha * (actual value - f(x-using-old W's))
   this is an online learning rule, each X value is processed many times on its own

  gives a method to adapt W weights in proportion to error


  can find many different seperators, but may not find the best one

    want to use the one that is furthest from all the examples
      (opposite of minimizing error) --
    want to maximize the "margin" -- the distance to nearest training samples

  do that using
   * boosting 
   * support vector machines (SVM) -- picks LS that maximizes margin
      uses a quadratic program that iterates
      uses linear techniques to solve non-linear problems
      Kernel trick : augment feature set with new features
         starting feature set X1 and X2 is the X,Y axis,
          at a feature that is a circle
            X3 = sqrt(X1^2 + X2^2)
          this separates samples close to the origin from further away
       can be done in any linear learning problem
   

get images from video 32....

  

Summary so far:
   Regression vs Classificatoin
   Exact solutions vs Iterative solutions
   Smoothing
   Extension to Non-linear problems


33. k Nearest Neighbors
   Supervised Learning can be parametric or non-parametric

   parametric methods have parameters like weights or probabilities
     number of parameters is independent of training set size
     for bayes, only dependent on dictionary size
     for regression only the W weights
 
   non-parameteric methods -- the number of parameters can grow over time

  1-Nearest Neighbor


   while learning -- memorize all data
   get a new example
     * find K nearest neighbors
     * return the label of the majority of K neighbors
        as the class label for the new example


get image from video 34 at end, or from question explanation

   

34. KNN Definition
   
  Voronoi graph or diagram is the separation between classes

  K is a Regularizer that controls the complexity of
    the Nearest Neighbor Algorithm
  Using a large K gives a smoother result but misclassifies outliers
  Can use cross-validation to find a good K because there's
    a tradeoff between complexity of what we want to fit and
	the goodness of the fit


35-36 K as Smoothing Parameter
 Problems with KNN

  large data sets -- long searches for nearest neighbors,
					  but methods like "KDD trees" reduce time

  large feature spaces -- more of a problem because
					 increasing the feature set increases the computation
					 and tree methods become brittle
		if you graph the number of input dimensions vs
		 the edge length of the neighborhood, you find that
		 for randomly chosen points very quickly all points are far away
		  with a high dimension its more likely that one dimension will
		  be far away
		  the number of points you need to get close grows exponentially
		    with dimensions

  So don't trust KNN to do a good job if the input dimension is high


37. Congratulations