Machine learning and data mining 

Machinelearning venues 
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example inputoutput pairs.^{[1]} It infers a function from labeled training data consisting of a set of training examples.^{[2]} In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way (see inductive bias).
The parallel task in human and animal psychology is often referred to as concept learning.
YouTube Encyclopedic

1/5Views:36 62919 69814 42724 9358 934

✪ Supervised Machine Learning: Crash Course Statistics #36

✪ Machine Learning Tutorial: Supervised Learning

✪ Supervised vs Unsupervised vs Reinforcement Learning  Data Science Certification Training  Edureka

✪ Supervised and Unsupervised Learning In Machine Learning  Machine Learning Tutorial  Simplilearn

✪ Supervised Learning explained
Transcription
Hi, I’m Adriene Hill, and welcome back to Crash Course Statistics. We’ve covered a lot of statistical models, from the matched pairs ttest to linear regression. And for the most part, we’ve used them to model data that we already have so we can make inferences about it. But sometimes we want to predict future data. A model that predicts whether someone will default on their loan could be very helpful to a bank employee. They’re probably not writing scientific papers about why people default on loans, but they do care about accurately predicting who will. Many types of Machine Learning (ML) do just that: build models to predict future outcomes. And this field has exploded over the past few decades. Supervised Machine Learning takes data that already has a correct answer, like images that have been labeled as “cat” or “not a cat”, or the current salary of a company’s CEO, and tries to learn how to predict it. It’s supervised because we can tell the model what it got wrong. It’s called Machine Learning because instead of following strict rules and instructions from humans, the computers (or machines) learn how to do things from data. Today, we’ll briefly cover a few types of supervised Machine Learning models, logistic regression, Linear Discriminant Analysis, and K Nearest Neighbors. Intro Say you own a microloan company. Your goal is to give short term, low interest loans to people around the world, so they can invest in their small businesses. You have everyone fill out an application that asks them to specify things like their age, sex, annual income, and the number of years they’ve been in business. The microloan is not a donation, the recipient is supposed to pay it back. So you need to figure out who is most likely to do that. During the early days of your company, you reviewed each application by hand and made that decision based on personal experience of who was likely to pay back the loan. But now you have more money and applicants than you could possibly handle. You need a modelor algorithmto help you make these decisions efficiently. Logistic regression is a simple twist on linear regression. It gets its name from the fact that it is a regression that predicts what’s called the log odds of an event occuring. While log odds can be difficult, once we have them, we can use some quick calculations to turn them into probabilities, which are a lot easier to work with. We can use these probabilities to predict whether an individual will default on their loan. Usually the cutoff is 50%. If someone is less than 50% likely to default on their loan, we’ll predict that they’ll pay it off. Otherwise, we’ll predict that they won’t pay off their loan. We need to be able to test whether our model will be good at predicting data it’s never seen before. Data it doesn't have the correct answer for. So we need to pretend that some of our data is “future” data for which we don’t know the outcome. One simple way to do that is to split your data into two parts. The first portion of our data, called the training set, will be the data that we use to createor trainour model. The other portion, called the testing set, is the data we’re pretending is from the future. We don’t use it to train our model. Instead, to test how well our model works, we withhold the outcomes of the test set so that the model doesn’t know whether someone paid off their loan or not, and ask it to make a prediction. Then, we can compare these with the real outcomes that we ignored before. We can do this using a what’s called a Confusion Matrix. A Confusion Matrix is a chart that tells us what actually happenedwhether a person paid back a loanand what the model predicted would happen. The diagonals of this matrix are times when the model got it right. Cases where the model correctly predicted that the person will default on the loan is called a True Positive. “True” because it got it right. “Positive” because the person defaulted on their loan. Cases where the model correctly predicted that a person will pay back the loan are called True Negatives. Again “true” because it made the correct prediction, and “negative” because the person did not default. Cases where the model was wrong are called False Negativesif the model thought that they would not defaultand False Positivesif the model thought they would default. Using current data and pretending it was future data allows us to see how this model performed with data it had never seen before. One simple way to measure how well the model did is to calculate its accuracy. Accuracy is the total number of correct classificationsOur True Positives and True Negativesdivided by the total number of cases. It’s the percent of cases our model got correct. Accuracy is important. But it’s also pretty simplistic. It doesn’t take into account the fact that in different situations, we might care more about some mistakes than others. We won’t touch on other methods of measuring a model’s accuracy here, but it’s important to recognize that in many situations, we want information above and beyond just an accuracy percentage. Logistic regression isn’t the only way predict the future. Another common model is Linear Discriminant Analysis or LDA for short. LDA uses Bayes’ Theorem in order to help us make predictions about data. Let’s say we wanna predict whether someone would get into our local state college based on their high school GPA. The red dots represent people who did not get in, green are people who did. If we make a couple of assumptions, we can estimate the GPA distributions of people who did, and did not get their acceptance letter. If we find a new student who wants to know if they will get in to your local state school, we use Bayes Rule and these distributions to calculate the probability of getting in or not. LDA just asks, “Which category is more likely?” If we draw a vertical line at their GPA, whichever distribution has a higher value at that line is the group we’d guess. Since this student, Analisa has a 3.2 GPA, we’d predict that she DOES get in. Since it’s more likely under the “got in” distribution. But we all know that GPA isn’t everything. What if we looked at SAT Scores as well. Looking at the distributions of both GPA and SAT scores together can get a little more complicated. And this is where LDA becomes really helpful. We want to create a score, we’ll call it Score X, that’s a linear combination of GPA and SAT scores. Something like this: We, or rather the computer, want to make it so that the Score X value of the admitted students is as different as possible from the Score X value of the people who weren’t admitted. This special way of combining variables to make a score that maximally separates the two groups is what makes LDA really special. So, Score X is a pretty good indicator of whether or not a student got in. AND that’s just one number that we have to keep track of, instead of two: GPA and SAT score. For this sample, my computer told me that this is the correct formula: Which means we can take the scatter plot of both GPA and SAT score and change it into a onedimensional graph of just Score X. Then we can plot the distributions and use Bayes Rule to predict whether a new student, Brad, is going to get into this school. Brad’s Score X is 8, so we predict that he won’t get in, since with a score X of 8, it’s more likely that you won’t get in than that you will. Creating a score like Score X can simplify things a lot. Here, we looked at two variables, which we could have easily graphed. But, that’s not the case if we have 100 variables for each student. Trust me, you don’t want your college admissions counselor making admissions decisions based on a graph like that. Using fewer numbers also means that on average, the computer can do faster calculations. So if 5 million potential students ask you to predict whether they get in, using LDA to simplify will speed things up. Reducing the number of variables we have to deal with is called Dimensionality Reduction, and it’s really important in the world of “Big Data”. It makes working with millions of data points, each with thousands of variables, possible. That’s often the kind of data that companies like Google and Amazon have. The last machine learning model we’ll talk about is KNearest Neighbors. KNearest Neighbors...or KNN for short...relies on the idea that data points will be similar to other data points that are near it. For example, let’s plot the height and weight of a group of Golden Retrievers, and a group of Huskies: If someone tells us a height and weight for a dognamed Chasewhose breed we don’t know...we could plot it on our graph. The four points closest to Chase are Golden Retrievers, so we would guess he’s a Golden Retriever. That’s the basic idea behind KNearest Neighbors! Whichever categoryin this case dog breedhas the more data points near our new data point is the category we pick. In practice it is a tiny bit more complicated than that. One thing we need to do is decide how many “neighboring” data points to look at. The K in KNN is a variable representing the number of neighbors we’ll look at for each pointor dogwe want to classify. When we wanted to know whether Chase was a Husky or a Golden Retriever, we looked at the 4 closest data points. So K equals 4. But we can set K to be any number. We could look at the 1 nearest neighbor. Or 15 nearest neighbors. As K changes, our classifications can change. These graphs show how points in each area of the graph would be classified. There are many ways to choose which k to use. One way is to split your data into two groups, a training set and a test set. I’m going to take 20% of the data, and ignore it for now. Then I’m going to take the other 80% of the data and use it to train a KNN classifier. A classifier basically just predicts which group something will be in. It classifies it. We’ll build it using k equals 5. And we get this result: Where blue means Golden Retriever. And red means Husky. As you can see, the boundaries between classes don’t have to be one straight line. That’s one benefit of KNN. It can fit all kinds of data. Now that we have trained our classifier using 80% of the data, we can test it using the other 20%. We’ll ask it to predict the classes of each of the data points in this 20% test set. And again, we can calculate an accuracy score. This model has 66.25% accuracy. But we can also try out other K’s and pick the one that has the best accuracy. It looks like using a k of 50 hits the sweet spot for us. Since the model with k equals 50 has the highest accuracy of predicting Husky vs. Golden Retriever. So, if we want to build a KNN classifier to predict the breed of unknown dogs, we’d start with a K of 50. Choosing model parametersvariables like k that can be different numberscan be done in much more complex ways than we showed here, or could be done using information about the specific data set you’re working with . We not going to get into alternative methods now, but if you’re ever going to build models for real, you should look it up. Machine Learning focuses a lot on prediction. Instead of just accurately describing our current data, we want it to pretty accurately predict future data. And these days, data is BIG. By one estimate, we produce 2.5 QUINTILLION bytes of data per day. And supervised machine learning can help us harness the strength of that data. We can teach models or rather have the models teach themselves how to best distinguish between groups like will pay off a loan and those that won’t. Or people who will love watching the new season of The Good Place `and those that won’t. We’re affected by these models all the time. From online shopping, to streaming a new show on Hulu, to a new song recommendation on Spotify. Machine learning affects our lives everyday. And it doesn’t always make it better we’ll get to that. Thanks for watching. I'll see you next time.
Contents
Steps
In order to solve a given problem of supervised learning, one has to perform the following steps:
 Determine the type of training examples. Before doing anything else, the user should decide what kind of data is to be used as a training set. In the case of handwriting analysis, for example, this might be a single handwritten character, an entire handwritten word, or an entire line of handwriting.
 Gather a training set. The training set needs to be representative of the realworld use of the function. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements.
 Determine the input feature representation of the learned function. The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output.
 Determine the structure of the learned function and corresponding learning algorithm. For example, the engineer may choose to use support vector machines or decision trees.
 Complete the design. Run the learning algorithm on the gathered training set. Some supervised learning algorithms require the user to determine certain control parameters. These parameters may be adjusted by optimizing performance on a subset (called a validation set) of the training set, or via crossvalidation.
 Evaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set.
Algorithm choice
A wide range of supervised learning algorithms are available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems (see the No free lunch theorem).
There are four major issues to consider in supervised learning:
Biasvariance tradeoff
A first issue is the tradeoff between bias and variance.^{[3]} Imagine that we have available several different, but equally good, training data sets. A learning algorithm is biased for a particular input if, when trained on each of these data sets, it is systematically incorrect when predicting the correct output for . A learning algorithm has high variance for a particular input if it predicts different output values when trained on different training sets. The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm.^{[4]} Generally, there is a tradeoff between bias and variance. A learning algorithm with low bias must be "flexible" so that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training data set differently, and hence have high variance. A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can adjust).
Function complexity and amount of training data
The second issue is the amount of training data available relative to the complexity of the "true" function (classifier or regression function). If the true function is simple, then an "inflexible" learning algorithm with high bias and low variance will be able to learn it from a small amount of data. But if the true function is highly complex (e.g., because it involves complex interactions among many different input features and behaves differently in different parts of the input space), then the function will only be learn able from a very large amount of training data and using a "flexible" learning algorithm with low bias and high variance.
Dimensionality of the input space
A third issue is the dimensionality of the input space. If the input feature vectors have very high dimension, the learning problem can be difficult even if the true function only depends on a small number of those features. This is because the many "extra" dimensions can confuse the learning algorithm and cause it to have high variance. Hence, high input dimensional typically requires tuning the classifier to have low variance and high bias. In practice, if the engineer can manually remove irrelevant features from the input data, this is likely to improve the accuracy of the learned function. In addition, there are many algorithms for feature selection that seek to identify the relevant features and discard the irrelevant ones. This is an instance of the more general strategy of dimensionality reduction, which seeks to map the input data into a lowerdimensional space prior to running the supervised learning algorithm.
Noise in the output values
A fourth issue is the degree of noise in the desired output values (the supervisory target variables). If the desired output values are often incorrect (because of human error or sensor errors), then the learning algorithm should not attempt to find a function that exactly matches the training examples. Attempting to fit the data too carefully leads to overfitting. You can overfit even when there are no measurement errors (stochastic noise) if the function you are trying to learn is too complex for your learning model. In such a situation, the part of the target function that cannot be modeled "corrupts" your training data  this phenomenon has been called deterministic noise. When either type of noise is present, it is better to go with a higher bias, lower variance estimator.
In practice, there are several approaches to alleviate noise in the output values such as early stopping to prevent overfitting as well as detecting and removing the noisy training examples prior to training the supervised learning algorithm. There are several algorithms that identify noisy training examples and removing the suspected noisy training examples prior to training has decreased generalization error with statistical significance.^{[5]}^{[6]}
Other factors to consider (important)
Other factors to consider when choosing and applying a learning algorithm include the following:
 Heterogeneity of the data. If the feature vectors include features of many different kinds (discrete, discrete ordered, counts, continuous values), some algorithms are easier to apply than others. Many algorithms, including Support Vector Machines, linear regression, logistic regression, neural networks, and nearest neighbor methods, require that the input features be numerical and scaled to similar ranges (e.g., to the [1,1] interval). Methods that employ a distance function, such as nearest neighbor methods and support vector machines with Gaussian kernels, are particularly sensitive to this. An advantage of decision trees is that they easily handle heterogeneous data.
 Redundancy in the data. If the input features contain redundant information (e.g., highly correlated features), some learning algorithms (e.g., linear regression, logistic regression, and distance based methods) will perform poorly because of numerical instabilities. These problems can often be solved by imposing some form of regularization.
 Presence of interactions and nonlinearities. If each of the features makes an independent contribution to the output, then algorithms based on linear functions (e.g., linear regression, logistic regression, Support Vector Machines, naive Bayes) and distance functions (e.g., nearest neighbor methods, support vector machines with Gaussian kernels) generally perform well. However, if there are complex interactions among features, then algorithms such as decision trees and neural networks work better, because they are specifically designed to discover these interactions. Linear methods can also be applied, but the engineer must manually specify the interactions when using them.
When considering a new application, the engineer can compare multiple learning algorithms and experimentally determine which one works best on the problem at hand (see cross validation). Tuning the performance of a learning algorithm can be very timeconsuming. Given fixed resources, it is often better to spend more time collecting additional training data and more informative features than it is to spend extra time tuning the learning algorithms.
Algorithms
The most widely used learning algorithms are:
 Support Vector Machines
 linear regression
 logistic regression
 naive Bayes
 linear discriminant analysis
 decision trees
 knearest neighbor algorithm
 Neural Networks (Multilayer perceptron)
 Similarity learning
How supervised learning algorithms work
Given a set of training examples of the form such that is the feature vector of the ith example and is its label (i.e., class), a learning algorithm seeks a function , where is the input space and is the output space. The function is an element of some space of possible functions , usually called the hypothesis space. It is sometimes convenient to represent using a scoring function such that is defined as returning the value that gives the highest score: . Let denote the space of scoring functions.
Although and can be any space of functions, many learning algorithms are probabilistic models where takes the form of a conditional probability model , or takes the form of a joint probability model . For example, naive Bayes and linear discriminant analysis are joint probability models, whereas logistic regression is a conditional probability model.
There are two basic approaches to choosing or : empirical risk minimization and structural risk minimization.^{[7]} Empirical risk minimization seeks the function that best fits the training data. Structural risk minimization includes a penalty function that controls the bias/variance tradeoff.
In both cases, it is assumed that the training set consists of a sample of independent and identically distributed pairs, . In order to measure how well a function fits the training data, a loss function is defined. For training example , the loss of predicting the value is .
The risk of function is defined as the expected loss of . This can be estimated from the training data as
 .
Empirical risk minimization
In empirical risk minimization, the supervised learning algorithm seeks the function that minimizes . Hence, a supervised learning algorithm can be constructed by applying an optimization algorithm to find .
When is a conditional probability distribution and the loss function is the negative log likelihood: , then empirical risk minimization is equivalent to maximum likelihood estimation.
When contains many candidate functions or the training set is not sufficiently large, empirical risk minimization leads to high variance and poor generalization. The learning algorithm is able to memorize the training examples without generalizing well. This is called overfitting.
Structural risk minimization
Structural risk minimization seeks to prevent overfitting by incorporating a regularization penalty into the optimization. The regularization penalty can be viewed as implementing a form of Occam's razor that prefers simpler functions over more complex ones.
A wide variety of penalties have been employed that correspond to different definitions of complexity. For example, consider the case where the function is a linear function of the form
 .
A popular regularization penalty is , which is the squared Euclidean norm of the weights, also known as the norm. Other norms include the norm, , and the norm, which is the number of nonzero s. The penalty will be denoted by .
The supervised learning optimization problem is to find the function that minimizes
The parameter controls the biasvariance tradeoff. When , this gives empirical risk minimization with low bias and high variance. When is large, the learning algorithm will have high bias and low variance. The value of can be chosen empirically via cross validation.
The complexity penalty has a Bayesian interpretation as the negative log prior probability of , , in which case is the posterior probabability of .
Generative training
The training methods described above are discriminative training methods, because they seek to find a function that discriminates well between the different output values (see discriminative model). For the special case where is a joint probability distribution and the loss function is the negative log likelihood a risk minimization algorithm is said to perform generative training, because can be regarded as a generative model that explains how the data were generated. Generative training algorithms are often simpler and more computationally efficient than discriminative training algorithms. In some cases, the solution can be computed in closed form as in naive Bayes and linear discriminant analysis.
Generalizations
There are several ways in which the standard supervised learning problem can be generalized:
 Semisupervised learning: In this setting, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled.
 Weak supervision: In this setting, noisy, limited, or imprecise sources are used to provide supervision signal for labeling training data.
 Active learning: Instead of assuming that all of the training examples are given at the start, active learning algorithms interactively collect new examples, typically by making queries to a human user. Often, the queries are based on unlabeled data, which is a scenario that combines semisupervised learning with active learning.
 Structured prediction: When the desired output value is a complex object, such as a parse tree or a labeled graph, then standard methods must be extended.
 Learning to rank: When the input is a set of objects and the desired output is a ranking of those objects, then again the standard methods must be extended.
Approaches and algorithms
 Analytical learning
 Artificial neural network
 Backpropagation
 Boosting (metaalgorithm)
 Bayesian statistics
 Casebased reasoning
 Decision tree learning
 Inductive logic programming
 Gaussian process regression
 Genetic Programming
 Group method of data handling
 Kernel estimators
 Learning Automata
 Learning Classifier Systems
 Minimum message length (decision trees, decision graphs, etc.)
 Multilinear subspace learning
 Naive Bayes classifier
 Maximum entropy classifier
 Conditional random field
 Nearest Neighbor Algorithm
 Probably approximately correct learning (PAC) learning
 Ripple down rules, a knowledge acquisition methodology
 Symbolic machine learning algorithms
 Subsymbolic machine learning algorithms
 Support vector machines
 Minimum Complexity Machines (MCM)
 Random Forests
 Ensembles of Classifiers
 Ordinal classification
 Data Preprocessing
 Handling imbalanced datasets
 Statistical relational learning
 Proaftn, a multicriteria classification algorithm
Applications
 Bioinformatics
 Cheminformatics
 Database marketing
 Handwriting recognition
 Information retrieval
 Information extraction
 Object recognition in computer vision
 Optical character recognition
 Spam detection
 Pattern recognition
 Speech recognition
 Supervised learning is a special case of Downward causation in biological systems
General issues
 Computational learning theory
 Inductive bias
 Overfitting (machine learning)
 (Uncalibrated) Class membership probabilities
 Unsupervised learning
 Version spaces
See also
References
 ^ Stuart J. Russell, Peter Norvig (2010) Artificial Intelligence: A Modern Approach, Third Edition, Prentice Hall ISBN 9780136042594.
 ^ Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar (2012) Foundations of Machine Learning, The MIT Press ISBN 9780262018258.
 ^ S. Geman, E. Bienenstock, and R. Doursat (1992). Neural networks and the bias/variance dilemma. Neural Computation 4, 1–58.
 ^ G. James (2003) Variance and Bias for General Loss Functions, Machine Learning 51, 115135. (http://wwwbcf.usc.edu/~gareth/research/bv.pdf)
 ^ C.E. Brodely and M.A. Friedl (1999). Identifying and Eliminating Mislabeled Training Instances, Journal of Artificial Intelligence Research 11, 131167. (http://jair.org/media/606/live6061803jair.pdf)
 ^ M.R. Smith and T. Martinez (2011). "Improving Classification Accuracy by Identifying and Removing Instances that Should Be Misclassified". Proceedings of International Joint Conference on Neural Networks (IJCNN 2011). pp. 2690–2697. CiteSeerX 10.1.1.221.1371. doi:10.1109/IJCNN.2011.6033571.
 ^ Vapnik, V. N. The Nature of Statistical Learning Theory (2nd Ed.), Springer Verlag, 2000.