- Artificial Intelligence for Big Data
- Anand Deshpande Manish Kumar
- 1056字
- 2021-06-25 21:57:12
Supervised and unsupervised machine learning
Machine learning at a broad level is categorized into two types: supervised and unsupervised learning. As the name indicates, this categorization is based on the availability of the historical data or the lack thereof. In simple terms, a supervised machine learning algorithm depends on the trending data, or version of truth. This version of truth is used for generalizing the model to make predictions on the new data points.
Let's understand this concept with the following example:
Figure 3.1 Simple training data: input (independent) and target (dependent) variables
Consider that the value of the y variable is dependent on the value of x. Based on a change in the value of x, there is a proportionate change in the value of y (think about any examples where the increase or decrease in the value of one factor proportionally changes the other).
Based on the data presented in the preceding table, it is clear that the value of y increases with an increase in the value of x. That means there is a direct relationship between x and y. In this case, x is called an independent, or input, variable and y is called a dependent, or target, variable. In this example, what will be the value of y when x is 220? At this point, let's understand a fundamental difference between traditional computer programming and machine learning when it comes to predicting the value of the y variable for a specific value of x=220. The following diagram shows the traditional programming process:
Figure 3.2 Traditional computer programming process
The traditional computer program has a predefined function that is applied on the input data to produce the output. In this example, a traditional computer program calculates the value of the (y) output variable as 562.
Have a look at the following diagram:
Figure 3.3 Machine learning process
In the case of supervised machine learning, the input and output data (training data) are used to create the program or the function. This is also termed the predictor function. A predictor function is used to predict the outcome of the dependent variable. In its simplest form, the process of defining the predictor function is called model training. Once a generalized predictor function is defined, we can predict the value of the target variable (y) corresponding to an input value (x). The goal of supervised machine learning is to develop a finely-tuned predictor function, h(x), called hypothesis. Hypothesis is a certain function that we believe (or hope) is similar to the true function, the target function that we want to model. Let's add some more data points and plot those on a two-dimensional chart, like the following diagram:
Figure 3.4 Supervised learning (linear regression)
We have plotted the input variable on the x axis and the target variable on the y axis. This is a general convention used and hence the input variable is termed x and the output variable is termed y. Once we plot the data points from the training data, we can visualize the correlation between the data points. In this case, there seems to a direct proportion between x and y. In order for us to predict the value of y when x = 220, we can draw a straight line that tries to characterize, or model, the truth (training data). The straight line represents the predictor function, which is also termed as a hypothesis.
Based on the hypothesis, in this case our model predicts that the value of y when x = 220 will be ~430. While this hypothesis predicts the value of y for a certain value of x, the line that defines the predictor function does not cover all the values of the input variable. For example, based on the training data, the value of y = 380 at x = 150. However, as per the hypothesis, the value comes out to be ~325. This differential is called prediction error (~55 units in this case). Any input variable (x) value that does not fall on the predictor function has some prediction error based on the derived hypothesis. The sum of errors for across all the training data is a good measure of the model's accuracy. The primary goal of any supervised learning algorithm is to minimize the error while defining a hypothesis based on the training data.
A straight-line hypothesis function is as good as an illustration. However, in reality, we will always have multiple input variables that control the output variable, and a good predictor function with minimal error will never be a straight line. When we predict the value of an output variable at a certain value of the input variable it is called regression. In certain cases, the historical data, or version of truth, is also used to separate data points into discrete sets (class, type, category). This is termed classification. For example, an email can be flagged as spam or not based on the training data. In the case of classification, the classes are known and predefined. The following image shows the classification with the Decision Boundary:
Figure 3.5 Classification with Decision Boundary
Here is a two-dimensional training dataset, where the output variables are separated by a Decision Boundary. Classification is a supervised learning technique that defines the Decision Boundary so that there is a clear separation of the output variables.
Regression and classification, as discussed in this section, require historical data to make predictions about the new data points. These represent supervised learning techniques. The generic process of supervised machine learning can be represented as follows:
Figure 3.6 Generic supervised learning process
The labeled data, or the version of truth, is split into training and validation sets with random sampling. Typically, an 80-20 rule is followed with the split percentage of the training and validation sets. The training set is used for training the model (curve fitting) to reduce overall error of the prediction. The model is checked for accuracy with the validation set. The model is further tuned for the accuracy threshold and then utilized for the prediction of the dependent variables for the new data.
With this background in machine learning, let's take a deep pe into various techniques of supervised and unsupervised machine learning.