Machine Learning 05 – Probabilistic Interpretation of Least Squares Method

In this post, we will discuss and show, why the least squares method is actually the way it is. To do this, we have to use Probability theory and some probabilistic assumption on our various variables.

Let’s say, that for every data-set in our training data the following rule applies: $y^{(i)} = h_\theta(x^{(i)}) + \varepsilon^{(i)}$ .

While $\varepsilon^{(i)}$ is an error value, which is the difference between our prediction $h_\theta(x^{(i)})$ and the actual value $y^{(i)}$ . The smaller the error value is, the better our prediction.

We may assume that the error value is distributed normally because the learning model is not taking every possible feature into consideration. If our training data is about house prices for example, we might miss out on features like the buyer and sellers mood, whether the house is next to a nice place and so on. All these features are independent from one another. Thanks to the central limit theory the accumulation of this „noise“ or error value is distributed by the Gaussian distribution.

\varepsilon^{(i)} \sim N(0, \sigma^2)

The above mathematical expression says that the error value is distributed by the normal distribution with the median $0$ and the variance sigma squared.

We can write out the function for the normal distribution like below.

P(\varepsilon^{(i)}) = \frac{1}{\sqrt{2\pi}*\sigma}exp\Big(-\frac{(\varepsilon^{(i)})^2}{2\sigma^2}\Big)

The error value can be expressed by $y^{(i)}$ and $h_\theta(x^{(i)})$ simply as $\varepsilon^{(i)}=y^{(i)}-h_\theta(x^{(i)})$ . Now lets change the input of our probability function accordingly.

P(y^{(i)}|x^{(i)};\theta) = \frac{1}{\sqrt{2\pi}*\sigma}exp\Big(-\frac{(y^{(i)}-h_\theta(x^{(i)}))^2}{2\sigma^2}\Big)

The above expression says that the probability of $y^{(i)}$ conditioned by $x^{(i)}$ and parameterized by $\theta$ is equal to the function on the left. It’s important to use a semicolon before the $\theta$ because a coma would mean, that the probability of $y^{(i)}$ is conditioned by BOTH $x^{(i)}$ and by $\theta$ . But the vector $\theta$ is just a parameter not a random variable on which the probability is conditioned.

Lets just write out the „hypothesis“ (= prediction function) by the actual definition $h_\theta(x^{(i)})=\theta^T x^{(i)}$ .

P(y^{(i)}|x^{(i)};\theta) = \frac{1}{\sqrt{2\pi}*\sigma}exp\Big(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\Big)

We now can define a function $L(\theta)$ to being the Likelihood of the parameter $\theta$ based on the product of all probabilities.

L(\theta)=P(\vec{y}|X;\theta)

Notice, that we did use a vector for our output variable and accordingly the Matrix of all the training data X in our probability function.

L(\theta)=\prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta)=\prod_{i=1}^m\frac{1}{\sqrt{2\pi}*\sigma}exp\Big(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\Big)

Based on our training data we want to fit $\theta$ to achieve the maximum Likelihood. In other words we want to maximize $L(\theta)$ .

Now we can maximize the function or any function which increases as $L(\theta)$ increases such as the logarithm of $L(\theta)$ , $l(\theta)=log L(\theta)$ . This is especially useful since working with logarithms provides some convenience for example when taking the derivative or making use of logarithm rules (like below).

l(\theta)=log L(\theta)=log\prod_{i=1}^m\frac{1}{\sqrt{2\pi}*\sigma}exp\Big(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\Big)

=\sum_{i=1}^m log \frac{1}{\sqrt{2\pi}*\sigma}exp\Big(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\Big)

(Above we used the logarithm rule $log(a*b*..) = log(a)+log(b)+...$ )

=\sum_{i=1}^m log \frac{1}{\sqrt{2\pi}*\sigma} + log\Bigg(exp\Big(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\Big) \Bigg)

As $log \frac{1}{\sqrt{2\pi}*\sigma}$ is just a constant which gets added $m$ times we can bring it in front of the sum. Also logarithm and exp chancel each other out. So we are left with the following:

=m * log\frac{1}{\sqrt{2\pi}*\sigma}+\sum_{i=1}^m-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}

The „-“ sign as well as the $\frac{1}{2\sigma^2}$ in our summation are just constant factors, which we can put in front of the summation.

=m * log\frac{1}{\sqrt{2\pi}*\sigma}-\frac{1}{\sigma^2}\frac{1}{2}\sum_{i=1}^m(y^{(i)}-\theta^T x^{(i)})^2

Now if we look at the function, it is clear to say that to maximize the function we have to minimize the sum. If you remember $J(\theta)$ from my second blog post, you will remember that it looks exactly the same as the sum in our log likelihood function. We also tried to minimize it (using the gradient descent algorithm).

J(\theta)=\frac{1}{2}\sum_{i=1}^m(y^{(i)}-\theta^T x^{(i)})^2

To sum up, we showed how the minimization of least squares is just a maximization of our likelihood function assuming the probabilistic distribution of our error values as shown above.