Machine Learning 02 – Gradient Descent

Linear Regression is the most basic task of Machine Learning.

In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X.

We will use the following training data (they are just made up points):

features = [ 3, 2, 1, 5, 4, 7, 9 ]
targets = [ 4, 1, 4, 8, 21, 20, 26 ]

A visualization of the training data done by Geogebra. The red line shows the linear regression line ( y=3.01796 x – 1.36527 )

The solution of our problem is the function above (also called the hypothesis h(x)). To get to this solution we will use a mathematical model called least squares combined with the optimization algorithm gradient descent.

h_\theta(x) = \theta_0 + \theta_1 x

Example for training data pairs of house-size (as input) and buying-price (as output).

We are searching for the ideal $\theta_0$ and $\theta_1$ pair.

A couple of variable definitions:

m … number of training examples (in our case m = 7)
x … input Variables = features
y … output Variables = targets
(x, y) … training example
$(x^{(i)}, y^{(i)})$ … $i^{th}$ training example

We can define $\theta$ to be the vector of both $\theta_0$ and $\theta_1$ , so $\theta = \binom{\theta_0}{\theta_1}$ .

Then using the least squares method we define the function $J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$ .

Using the gradient of J with respect of $\theta_i$ and updating $\theta_i$ accordingly will move us towards a local minimum.

Update rule:

\theta_i:=\theta_i-\alpha\frac{\partial}{\partial\theta_i}J(\theta)

$\alpha$ … learning rate of our learning algorithm (we will use $\alpha=0.01$ )

Choosing a good learning rate is crucial to the learning Algorithm, because a too big one can cause the $\theta_i$ to jump around, never reaching a minimum. If the learning rate is too small the algorithm will take a lot of time to reach the minimum of J ( = slow convergence).

$\frac{\partial}{\partial\theta_i}J(\theta)$ is called the partial derivative of J with respect to $\theta_i$ . We need to figure it out so we can implement it in a computer program.

\frac{\partial}{\partial\theta_i}J(\theta)=\frac{\partial}{\partial\theta_i}\frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2

Writing out the sum…

=\frac{\partial}{\partial\theta_i}\frac{1}{2}\Bigg[(h_\theta(x^{(1)})-y^{(1)})^2+(h_\theta(x^{(2)})-y^{(2)})^2+...+(h_\theta(x^{(m)})-y^{(m)})^2\Bigg]

We will put the factor $\frac{1}{2}$ before the derivative.

=\frac{1}{2}\frac{\partial}{\partial\theta_i}\Bigg[(h_\theta(x^{(1)})-y^{(1)})^2+(h_\theta(x^{(2)})-y^{(2)})^2+...+(h_\theta(x^{(m)})-y^{(m)})^2\Bigg]

To take the derivative of this sum, we will take the derivative of every element and sum them up.

=\frac{1}{2}\Bigg[\frac{\partial}{\partial\theta_i}(h_\theta(x^{(1)})-y^{(1)})^2+\frac{\partial}{\partial\theta_i}(h_\theta(x^{(2)})-y^{(2)})^2+...+\frac{\partial}{\partial\theta_i}(h_\theta(x^{(m)})-y^{(m)})^2\Bigg]

Here we have to use the chain rule.

=\frac{1}{2}\Bigg[2(h_\theta(x^{(1)})-y^{(1)})\frac{\partial}{\partial\theta_i}h_\theta(x^{(1)})+...+2(h_\theta(x^{(m)})-y^{(m)}))\frac{\partial}{\partial\theta_i}h_\theta(x^{(m)})\Bigg]

=(h_\theta(x^{(1)})-y^{(1)})\frac{\partial}{\partial\theta_i}h_\theta(x^{(1)})+...+(h_\theta(x^{(m)})-y^{(m)}))\frac{\partial}{\partial\theta_i}h_\theta(x^{(m)})

Now let us see what $\frac{\partial}{\partial\theta_i}h_\theta(x^{(i)})$ is…

$=\frac{\partial}{\partial\theta_i}\Big(\theta_0x^{(0)}+...+\theta_ix^{(i)}+...+\theta_nx^{(n)}\Big)$

Everything expect the $\theta_ix^{(i)}$ is just a constant, because we are deriving the function with respect to $\theta_i$ . This means that everything expect $x^{(i)}$ will be cancelled out.

Back to our derivative of J, now we can replace $\frac{\partial}{\partial\theta_i}h_\theta(x^{(i)})$ by the actual value $x^{(i)}$ .

\frac{\partial}{\partial\theta_i}J(\theta)=(h_\theta(x^{(1)})-y^{(1)})x^{(i)}+...+(h_\theta(x^{(m)})-y^{(m)}))x^{(i)}

\frac{\partial}{\partial\theta_i}J(\theta)=\sum_{j=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_i^{(j)}

So our new update rule will be:

$\theta_i:=\theta_i-\alpha\sum_{j=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_i^{(j)}$ .

We have $\theta_0$ , and $\theta_1$ , which means we will repeat updating both of them until we converge (= reach the optimal $\theta$ , that means $J(\theta)$ reaching a minimum).

I wrote a python script, to compute optimum $\theta$ vector. Note that $\theta_0$ , and $\theta_1$ became w0 and w1. The reason for choosing the letter w is that $\theta$ is sometimes also called weights.

# h(x) = O0 + O1*x1 + O2*x2 + ... + 0n*xn
# in this case: h(x) = w0 + w1*x1
features = [ 3, 2, 1, 5, 4, 7, 9 ]
targets = [ 4, 1, 4, 8, 21, 20, 26 ]
w = [ 0, 0 ] # weights w0 and w1, called Theta in Math
x = [ 1, 1 ] # x0 = 1

# alpha = 0.01
learningRate = 0.01

# actually should stop when derivative of J(w) gets near zero but we will use an iteration counter instead 
iterations = 5000 

def h(x):
    return w[0] + x*w[1]

# derivative of J = Sum j=1 to m [ h(xj) - yj ] * x1j
def derivativeJ(is_x0_soUseOneAsFactor):
    m = len(features)
    sum = 0
    for j in range(0, m):
        sum += ( h(features[j]) - targets[j] ) * ( 1 if is_x0_soUseOneAsFactor else features[j] )
    return sum

def gradientDescent():
    for j in range(0, iterations):
        w[0] = w[0] - learningRate * derivativeJ(True)
        w[1] = w[1] - learningRate * derivativeJ(False)

gradientDescent()
print( w )
# soultion for this training set h(x) = 3.01796x - 1.36527 
# correlationfactor r = 0.846