Machine Learning 03 – Normal Equation

Last post I explained the math behind and the Gradient Descent algorithm and implemented it in python. This part we will see how to evaluate \theta using a normal equation.

We defined J(\theta) to be half of the sum of the distance between the target values and the prediction squared.


We can define some matrices to write down J(\theta) using only the defined matrices.

Lets define the matrix X being the matrix of our training data („input“ values or features), y a vector of our „output“ or target values and \theta a vector of our parameters of h_\theta(x).

X_{m,n+1} =\begin{pmatrix} \longleftarrow x^{(1)^T} \longrightarrow \\ \longleftarrow x^{(2)^T} \longrightarrow \\ \vdots \\ \longleftarrow x^{(m)^T} \longrightarrow \end{pmatrix}, y_{m,1} = \begin{pmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{pmatrix} and \theta_{n+1,1} =\begin{pmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{pmatrix}

X\theta = \begin{pmatrix} x^{(1)^T}\theta \\x^{(2)^T}\theta \\ \vdots \\ x^{(m)^T}\theta \end{pmatrix} = \begin{pmatrix} h_\theta(x^{(1)}) \\h_\theta(x^{(2)}) \\ \vdots \\h_\theta(x^{(m)}) \end{pmatrix} X\theta - y = \begin{pmatrix} h_\theta(x^{(1)}) - y^{(1)} \\ h_\theta(x^{(2)}) - y^{(2)} \\ \vdots \\ h_\theta(x^{(m)}) - y^{(m)} \end{pmatrix}

Does this remind you of something ? Yes, it’s part of the function J(\theta) but without the squares and the sum. We can use the following rule to achieve the sum of the squares: When you multiply a transposed vector times the vector itself, you get the sum of each element squared. As you can see above (X\theta-y) is just a vector. (X\theta-y)^T(X\theta-y) is then a scalar (a number), which is the sum of each element in the vector (X\theta-y) squared. Now lets add a \frac{1}{2} in front of it and we get J(\theta).


To get the optimal \theta we can set the derivative of J(\theta) with respect to \theta to zero and calculate the corresponding \theta. So lets determine \triangledown_{\theta} J(\theta).

\triangledown_{\theta} J(\theta) = \triangledown_{\theta} \frac{1}{2}(X\theta-y)^T(X\theta-y) = \frac{1}{2} \triangledown_{\theta}(X\theta-y)^T(X\theta-y) = \frac{1}{2} \triangledown_{\theta}(\theta^TX^T-y^T)(X\theta-y) = \frac{1}{2} \triangledown_{\theta}(\theta^TX^TX\theta-\theta^TX^Ty -y^TX\theta +y^Ty)

(\theta^TX^TX\theta-\theta^TX^Ty -y^TX\theta +y^Ty) is just a scalar, we can use the trace function, because tr a = a.

= \frac{1}{2} \triangledown_{\theta}tr(\theta^TX^TX\theta-\theta^TX^Ty -y^TX\theta +y^Ty) = \frac{1}{2} (\triangledown_{\theta}tr \theta^TX^TX\theta-\triangledown_{\theta}tr \theta^TX^Ty -\triangledown_{\theta}tr y^TX\theta +\triangledown_{\theta}tr y^Ty)

The part \triangledown_{\theta}tr y^Ty will be zero, because y^Ty has no \theta in it, so its just a constant which gets cancelled out when we derive the function. \theta^TX^Ty is a scalar, we can transpose it – transposing a scalar gives us the scalar (a^T=a) – to turn the \theta^T into a \theta. On the term tr \theta^TX^TX\theta we will use the trace rule tr ABC = tr CAB = tr BCA (always moving the last matrix forward) to get tr\theta\theta^TX^TX.

= \frac{1}{2} (\triangledown_{\theta}tr \theta\theta^TX^TX-\triangledown_{\theta}tr y^TX\theta -\triangledown_{\theta}tr y^TX\theta)

The rule \triangledown_A tr ABA^TC = CAB+ C^TAB^T can be used to take the derivative \triangledown_{\theta}tr \theta\theta^TX^TX. To make use of it, we will put an Identity matrix I between \theta and \theta^T. So \theta becomes our A, the identity matrix I becomes our B and X^TX becomes our C. Now we can apply the rule and get the following result:

= \frac{1}{2} (\triangledown_{\theta}tr \theta I\theta^TX^TX-\triangledown_{\theta}tr y^TX\theta -\triangledown_{\theta}tr y^TX\theta)

To take the derivative of \triangledown_{\theta}tr y^TX\theta we will make use of the rule \triangledown_{A}tr BA = B^T. We will treat y^TX as B and \theta as A.

= \frac{1}{2} (X^TX\theta I+X^TX\theta I-X^Ty-X^Ty)

We can ignore the identity matrix I, because it doesn’t change the result of a multiplication.

= \frac{1}{2} (X^TX\theta+X^TX\theta-X^Ty-X^Ty) = \frac{1}{2} (2X^TX\theta-2X^Ty) =X^TX\theta-X^Ty \triangledown_{\theta}J(\theta)=X^TX\theta-X^Ty

Now we want to set \triangledown_{\theta}J(\theta) to zero and determine \theta .

\triangledown_{\theta}J(\theta)=0 X^TX\theta-X^Ty=0 X^TX\theta=X^Ty \theta=(X^TX)^{-1}X^Ty