Locally weighted Regression in Machine Learning
first of all, we need to overview these two topics: Parametric and non-parametric learning algorithm
1-Parametric Learning Algorithm:
An algorithm that has a fixed number of parameters that fit the data. Linear regression is an example of parametric learning.
When we have a fixed number of parameters, everything is going clear and we need just to minimize the prediction error by Gradient Decent Algorithm.
In the above picture, you can see we have normally distributed data with a specific and limited number of independent variables (parameters)
The basic assumption for linear regression is that the data must be linearly distributed. But what if the data is not linearly distributed. Can we still apply the idea of regression? The answer is ‘yes’ but how?
Firstly, let’s define a non-parametric learning algorithm then we go through the model which has solved the problem above.
2-Non-Parametric Learning Algorithm
Formal definition: The number of parameters grows with the size of the training set.
Informal definition: The amount of stuff that our learning algorithm needs to keep around an entire training set, even after training.
In the above picture, it’s clear that there is not a linear distribution and if the prediction would be calculated as before (we have done in linear regression), then we will have a weak prediction by large prediction error as’s obvious by the red line which shows linear regression.
In this case, we can also apply the regression model and it is called locally weighted regression. We can apply LOESS or LOWESS (locally weighted scatterplot smoothing) when the relationship between independent and dependent variables is non-linear such as in the above picture.
Locally Weighted Regression:
This is an algorithm that allows us to worry a bit less about having to choose features very carefully.
Locally Weighted Learning methods are non-parametric and the current prediction is done by local functions. The basic idea behind LWR is that instead of building a global model for the whole function space, for each point of interest a local model is created based on neighboring data of the query point. For this purpose, each data point becomes a weighting factor which expresses the influence of the data point for the prediction. In general, data points that are in the close neighborhood to the current query point are receiving a higher weight than data points that are far away. LWR is also called lazy learning because the processing of the training data is shifted until a query point needs to be answered. This approach makes LWR a very accurate function approximation method where it is easy to add new training points.
Example :
Suppose that you want to evaluate H at a certain point x as we did in linear regression
LR: we should fit θ to minimize
then return θTx.
Now, what should we do in LWR:
First, we need to find the neighbors point(a little bit larger and a little bit smaller)
Then we apply the weight to these neighbors and do the linear regression for this limited area.
Let’s do it in action :
we should Fit θ to minimize
then return θTx
Here, the w(i)’s are non-negative valued weights. Intuitively, if w(i) is large for a particular value of i, then in picking θ, we’ll try hard to make (y(i) − θTx(i))2 small. If w(i) is small, then the (y(i) − θTx(i))2 error term will be pretty much ignored in the fit. A fairly standard choice for the weights is :
Note that the weights depend on the particular point x at which we’re trying to evaluate x. Moreover, if |x(i) − x| is small, then w(i) is close to 1; and if |x(i) − x| is large, then w(i) is small. Hence, θ is chosen giving a much higher “weight” to the (errors on) training examples close to the query point x. (Note also that while the formula for the weights takes a form that is cosmetically similar to the density of a Gaussian distribution, the w(i)’s do not directly have anything to do with Gaussians, and in particular, the w(i) are not random variables, normally distributed or otherwise.) The parameter τ controls how quickly the weight of a training example falls off with a distance of its x(i) from the query point x; τ is called the bandwidth parameter.