Unit 5 Machine Learning Oct 25, 2011 Video 5.1. Introduction World is data rich on a scale unthinkable 5 years ago. Machine Learning extracts information from the data. three types Supervised Un-supervised Reinforcement Video 5.2. What is Machine Learning Reasoning within Bayes Networks -- reasoning with known models Machine Learning -- learns Models from Data Supervised Quiz who does it? Google -- web mining Netflix -- DVD recommendation Amazon -- product placement Video 5.3. Stanley DARPA Grand Challenge Stanley extensively uses ML for robotics Lasers see only 25meters out -- ML extrapolates road to horizon Video 5.4. Taxonomy What: Parameters -- nodes of Bayes Netowrk Structure -- arcs of BN Hidden concepts -- certain training examples form groups What from: every system is driven from some kind of data that you care about Supervised -- give specific target labels Unsupervised -- target labels are missing Reinforcement -- uses feedback from environment What for: Prediciton Diagnostics Summarization How: Passive -- just an observer Active -- has effect on environment Online -- while data is generated Offline -- after data is generated Outputs: Classification -- finding binary or discrete number of classes Regression -- continuous prediction, e.g. 66.5degrees... Details: Generative -- model data as generally as possible Discriminative -- distinguish data Video 5.5. Supervised Learning for each training example you get a feature vector and a target label: X1 X2 X3 ... Xn -> Y example: credit rating X1=is person employed, Y=credit default or not? example: image recog X=image pixels, Y=chair or not? Supervised Learning gets big list of actual data with X's and Y and tries to build a model that predicts future Y's given only X's called "the data", each input is vector Xm, want to define a function that transforms Xm into Ym: f(Xm) = Ym as close as possible, but tollerates az certain amount of error Quiz: straight line data, which function is good, line or squiggle? ...line...because squiggle is more complex and doesn't match data outside the points Video 5.6. Occam's Razor Everything else being equal, choose the less complex hypothesis. But there is a tradeoff between good fit and low complexity. Complexity might be the polynomial degree of the hypothesis. Matching of data increases with complexity, but errors increase too overfitting -- being too specific about each data point increases with complexity, major source of error in ML algorithms generalization -- sum of overfitting and training complexity errors Want to minimize generalization error by picking model complexity at minima of (red) curve see image: 05class_Overfitting.jpg Video 5.7. SPAM Detection Get email, catorgize as "spam" or "ham" (good email) Email programs allow flagging messages as "spam" How to represent emails? bag of words -- counts frequency of words in "dictionary", oblivious to order of words Bunch of messages classified as spam and ham training set messages: SPAM Ham offer is secret play sports today click secret link went play sports secret sports link secret sports event sports is today sports costs money note: distribution of word counts (for 24 total words) is: total S H offer -- 1 1 0 is -- 2 1 1 secret -- 4 3 1 click -- 1 1 0 sports -- 6 1 5 link -- 2 2 0 play -- 2 0 2 today -- 2 0 2 went -- 1 0 1 event -- 1 0 1 costs -- 1 0 1 money -- 1 0 1 24 9 15 prob of word in class is: count/class-total, e.g. P("secret"|S) = 3/9 = 1/3 total prob of word in dictionary is: count/full-total, e.g. P("secret"|S) = 4/24 = 1/6 the dictionary contains 12 words (classes) Video 5.8. Question Probability that message is spam = 3/8 becasue 3 of 8 messages are spam. Video 5.9. Maximum Likelihood_1 Working out the prior probability more formally... Prior prob of spam -- P(spam) = pi -- that maximizes the likelihood of data assuming identical distribution for example: SSSHHHHH (3 spam, 5 ham) P(Yi) = pi if Yi=S 1-pi if Yi=H the probability of the Yth data item rewrite: P(Yi) = pi^Yi * (1-pi)^(1-Yi) assuming data is independent the probability of all data items is P(data) = prod(1-i) P(Yi) = the joint probability of all data items is the product of each rewrite: = pi^(count of Yi=S) * (1-pi)^(count of Yi=H) for example: = pi^3 * (1-pi)^5 want to maximize the pi. so can take log of both sides: log( P(data) = 3 * log(pi) + 5 * log(1-pi) maximize by setting derivitive = 0: 3/pi - 5/(1-pi) = 0 solving for pi gives: 5*pi + 3*pi = 3 so pi = 3/8 so, in fact: the 3/8 we got by counting is the Maximum Likelihood we want! Quiz: Question: P("secret" | S ) = 1/3 probability that we have "secret" in the message when we know it's Spam P("secret" | H ) = 1/15 probability that we have "secret" in the message when we know it's Ham secret is in Spam 3 times and Ham 1 time Spam messages have 9 words total and Ham has 15 words so 3/9 Spam words are secret 1/15 Ham words are secret note, doing it the other way, I think is this: P(S | "secret") = .75 probability that we have Spam with "secret" in the message = P("secret"|S) * P(S) / tP("secret") = 1/3 * 5/8 / 4/24 P(H | "secret") = .5 probability that we have Ham with "secret" in the message 3/4 of "secret"s are in the Spam messages Video 5.10. Relationship to Bayes Networks Bayes Network for spam filter: Spam / | \ w1 w2 wn A hidden node, Spam, has as many children as there are word classes in a message where each word has an identical conditional distribution of it's occurance given that the message is Spam or Ham Quiz: given a dictionary of 12 words, how many parameters needed in network? 23 -- one for the prior P(S) and 2 for each word = 2*12= 24 but each set of words only needs 11 because the last can be calc P(S) -- 1 == 1 P(Wi | S) -- 12 - 1 == 11 P(Wi | H) -- 12 - 1 == 11 23 Video 5.11. Question Quiz: given message M = "sports", what is the probability of being Spam P(S | M ) = 3/18 by bayes: P(S|M) = P(M|S) * P(S) / totalprob(M) so, P("sports"|S) = "sports" occurs in spam = 1/9 P(S) = probability of message being spam = 3/8 P("sports") = total prob of message (6 out of 24 words) = 6/24 Video 5.12. Question Quiz: M = "secret is secret" what is P(S|M) P("secret","is","secret" | S) * P(S) / TP-of-Message P("secret"|S) = "secret" occurs in spam = 1/3 P("secret"|~S) = "secret" occurs in ham = 1/15 P("is"|S) = "is" occurs in spam = 1/9 P("is"|~S) = "is" occurs in ham = 1/15 P(S) = probability of message being spam = 3/8 P+ = 1/3 * 1/9 * 1/3 * 3/8 = .00462 / TP ~= .9615 P- = 1/15 * 1/15 * 1/15 * 5/8 = .000185 TP = .00480 Video 5.13. Question Quiz: M = "today is secret" what is P(S|M) P("today","is","secret" | S) * P(S) / TP-of-Message P("today"|S) = "today" occurs in spam = 0/9 P("today"|~S) = "today" occurs in ham = 2/15 P("is"|S) = "is" occurs in spam = 1/9 P("is"|S) = "is" occurs in spam = 1/15 P(S) = probability of message being spam = 3/8 P+ = 0/9 * 1/9 * 1/3 * 3/8 = .0 / TP ~= 0.0 P- = 2/15 * 1/15 * 1/15 * 5/8 = .1666 TP = .1666 This says it's impossible for the message to be Spam becasue today has not yet appeared in Spam This is !!!overfitting!!! becasue one word affects the outcome Video 5.14. Laplace Smoothing Laplace Smoothing addresses overfitting problem with simple Maximum Likelihood the prob of a word was just ML -- P(x) = count(x)/totalN with LS we add a constant K to the count and normalize by adding K to each class count as well LS -- P(x) = count(x)+K / totalN + K|x| note that K|x| means K*number of classes so for our 2 example K|x| = 1*2 equivalent to adding K training examples to each class if K=0 we get the ML version, but if totalN is finite we get (slightly) different answers Quiz: for K=1 and these message counts (total messages N, Spam count) N Spam P(Spam) i.e.: H+K S+K N+K|x| 1 1 .5 1 2 3 = 2/3 0.666 10 6 5 7 12 = 7/12 0.583 100 60 41 61 102 = 61/102 0.598 --> The effect is to approach the ML probability smoothly as the number of samples increases. Video 5.15. Question Use LS K=1 on the original spam model to get the following probabilities for message "today"... original counts: messages: 3 spam, 5 ham, 8 total dictionary: 12 words -- 12 CLASSES!!! "today": 0 in spam, 2 in ham, 2 total adding K=1: A. P(S) = 3+1/8+2 = 4/10 -- add K for each class!!! B. P(H) = 5+1/8+2 = 6/10 C. P("today"|S) = 0+1/9+12 = 1/21 -- add K for each class!!! D. P("today"|H) = 2+1/15+12 = 3/27 Video 5.16. Question Use LS K=1 on the original spam model to get the following probabilities for message M = "today is secret"... P(S|M) original counts: messages: 3 spam, 5 ham, 8 total dictionary: 24 words -- 12 CLASSES!!! 9 in spam, 15 in ham "today": 0 in spam, 2 in ham, 2 total "is": 1 in spam, 1 in ham, 2 total "secret": 3 in spam, 1 in ham, 4 total from before with K=1: P(S) = 3+1/8+2 = 4/10 -- add K for each class!!! P(H) = 5+1/8+2 = 6/10 P("today") = 2+1/24+12 = 3/36 -- add K for each class!!! P("today"|S) = 0+1/9+12 = 1/21 P("today"|H) = 2+1/15+12 = 3/27 P(S | "today","is","secret") = P(t,i,s | S) * P(S) / tP(t,i,s) P("today"|S) = 0+1/9+12 = 1/21 -- P(t|~S) = 1+2/15+12 = 3/27 P("is"|S) = 1+1/9+12 = 2/21 -- P(i|~S) = 1+1/15+12 = 2/27 P("secret"|S) = 1+3/9+12 = 4/21 -- P(s|~S) = 1+1/15+12 = 2/27 +P = 1/21 * 2/21 * 4/21 * 4/10 = .0003455 / tP = .4857 ~P = 3/27 * 2/27 * 2/27 * 6/10 = .0003657 tP = .0007112 Video 5.17. Summary Naive Bayes generative model bag of word model -- email represented by word counts start with training examples that have a result label use Maximum Likelihood and Laplacian Smoother with Bayes to get probability of Spam given word counts Video 5.18. Advanced SPAM Filtering other things to consider: known spammer have you emailed before have many people received the same message is the header consistent is eamil all CAPS do inline URLs pointing where they say did they use your right name Video 5.19. Digit Recognition Naive Bayes can be applied to handwritten digit rcognition, but not a good choice because pixels are not Conditional Independent input vector = pixel values: 16x16 image patterns are not "shift invariant" -- shiftted patter is different input smoothing can blend pixel values Video 5.20. Overfitting Prevention tradeoff between how well we can fit data and how smooth the model is using Laplacian Smoothing smooths the model how do you choose K? cross-validation -- need lost of training examples divide data into 3 buckets training -- 80% cross-validation -- 10% test -- 10% train using different values of K using training set use cross-val set to find optimal K that maximizes performance use test set to verify only once if you use test set for cross-val you can still overfit. can split train and cros-val sets multiple times to do it. Video 5.21. Classification vs Regression Classification -- target class is discrete Regression -- predict continuous values: Yi is element [0,1] for all Reals relationship between two variables, one known and one you want Video 5.22. Linear Regression overfitting makes a very un-smooth model data is input vector of X's Xn -> continuous Y for one-dimensional linear regression data ... only one X value Y = f(x) = W1*X + W0 for higher dimensions, W's and X's will be N sized vectors and the multiply is the inner-product of the vectors Quiz: x y 3 0 6 -3 4 -1 5 -2 it is a linear fit what are W0 = W1 = W1 is the slope dX/dY -- 6 - 4 / -3 - -1 = 2 / -2 = -1 W0 is the line intercept -- for x=3 W1*X = -1 * 3 = -3 so Y = 0 = -3 + W0, and W0 = 3 gets much more challenging if the data set is not exactly linear Video 5.23. More Linear Regression Make an error measure: "Quadratic Error" between target table and our estimate -- LOSS = sum(j) (Yj - (W1*Xj + W0))^2 sum of the expected Y-value minus our estimated value over the whole set of j data rows Can minimize the above to get a good regression: W* = argmin( LOSS ) over all W's solution to the regression problem Wstar is the minimum LOSS for all the W's Video 5.24. Quadratic Loss for one dimensional data, find the Regression by minimizing the LOSS over both W's min(w) L = sum(i) (Yi - W1*Xi - W0)^2 find derivitive of L and set to 0 (partial)dL/dW0 = -2*sum(i)(Yi - W1*Xi - W0) = 0 where M is the number of training examples: M * W0 = sum(i)(Yi) - W1*sum(i)(Xi) so: W0 = 1/M * sum(i)(Yi) - W1/M * sum(i)(Xi) doing the same derivitive for W1 (partial)dL/dW1 = -2*sum(i)(Yi - W1*Xi - W0) * Xi = 0 so: sum(i)(Xi*Yi) - W0*sum(i)(Xi) = W1*sum(i)(Xi^2) cut to the chase, to get the W parameters from data: slope-------> W1 = (M * sum(i)(Xi*Yi)) - (sum(i)Xi * sum(i)Yi) / (M * sum(i)(Xi^2)) - (sum(i)Xi)^2 intercept---> W0 = 1/M * sum(i)(Yi) - W1/M * sum(i)(Xi) answer to the quiz...see linearRegression.xls W0 = .5 W1 = .9 25. Problems with Linear Regression lines don't fit curves outliers skew the regression as you go to infinity on X, so does Y which may not make sense for your model e.g. weather may not be getting hotter and hotter can use Logistic Regression for f(x) as a linear function the output of Logistic Reg is: Z = 1/ 1 + e^-f(x) range of Z is 0 to 1 as f(x) -> +infinity, e^-f(x) -> 0, so we get 1/1 as f(x) -> -infinity, e^-f(x) -> infinity, so we get 1/infinity which is close to 0 plot of Z function is approx linear around x=0 but levels off (asymtotically) at extremes 26. Linear Regression and Complexity Control Regularization or "Complexity Control" use normal Loss function + "Loss over parameters": Loss(data) + Loss(paremeters) where Loss(param) Peanalizes parameters that become large Loss(data) = sum(j) ( Yj - W1*Xj + W0 )^2 Loss(param) = sum(i) ( |Wi|^p where p is usually 1 or 2 so it pulls the parameters in towards 0 in a circular shape for quadratic function with p=2 (called "L2") and a diamond shape for function with p=1 (called "L1") for L1 parameters are sparser and on tends to be pulled to 0 for L2 they are not as sparse 27-30 Minimizing Complicated Loss Functions are there closed form solutions for Regularization ? mostly no, have to use iteritave solutions use gradient descent -- gradient moving down to right is negative and up to the right is positive methods are subject to finding local minimums rather than global one optimization books will explain how to address local minimum problems start with a guess for parameters and iterate by subtracting (alpha * the slope of the Loss function) at that point where alpha is a small learning rate maybe .001 to met real minimum the alpha rate should get smaller as you get close 31. Gradient Descent Implementation for Loss(data) functions the gradients are the dL/dW equations shown before so we subtract alpha * a bit of the dL/dW equation on each iteration 32. Perceptron linear functions used for classification iterative algorithm where the update rule is a gradient descent ---> Always converges to a linear separation of data if possible... a "Linear Separator" is a linear function that separates positive from negative samples if the function is >= to the actual sample then the sample is (in the example) + otherwise it's - start with a random guess for W1 and W0 calculate new W values by adding alpha * (actual value - f(x-using-old W's)) this is an online learning rule, each X value is processed many times on its own gives a method to adapt W weights in proportion to error can find many different seperators, but may not find the best one want to use the one that is furthest from all the examples (opposite of minimizing error) -- want to maximize the "margin" -- the distance to nearest training samples do that using * boosting * support vector machines (SVM) -- picks LS that maximizes margin uses a quadratic program that iterates uses linear techniques to solve non-linear problems Kernel trick : augment feature set with new features starting feature set X1 and X2 is the X,Y axis, at a feature that is a circle X3 = sqrt(X1^2 + X2^2) this separates samples close to the origin from further away can be done in any linear learning problem get images from video 32.... Summary so far: Regression vs Classificatoin Exact solutions vs Iterative solutions Smoothing Extension to Non-linear problems 33. k Nearest Neighbors Supervised Learning can be parametric or non-parametric parametric methods have parameters like weights or probabilities number of parameters is independent of training set size for bayes, only dependent on dictionary size for regression only the W weights non-parameteric methods -- the number of parameters can grow over time 1-Nearest Neighbor while learning -- memorize all data get a new example * find K nearest neighbors * return the label of the majority of K neighbors as the class label for the new example get image from video 34 at end, or from question explanation 34. KNN Definition Voronoi graph or diagram is the separation between classes K is a Regularizer that controls the complexity of the Nearest Neighbor Algorithm Using a large K gives a smoother result but misclassifies outliers Can use cross-validation to find a good K because there's a tradeoff between complexity of what we want to fit and the goodness of the fit 35-36 K as Smoothing Parameter Problems with KNN large data sets -- long searches for nearest neighbors, but methods like "KDD trees" reduce time large feature spaces -- more of a problem because increasing the feature set increases the computation and tree methods become brittle if you graph the number of input dimensions vs the edge length of the neighborhood, you find that for randomly chosen points very quickly all points are far away with a high dimension its more likely that one dimension will be far away the number of points you need to get close grows exponentially with dimensions So don't trust KNN to do a good job if the input dimension is high 37. Congratulations