Machine Learning 03 – Normal Equation

Last post I explained the math behind and the Gradient Descent algorithm and implemented it in python. This part we will see how to evaluate $\theta$ using a normal equation.

We defined $J(\theta)$ to be half of the sum of the distance between the target values and the prediction squared.

J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2

We can define some matrices to write down $J(\theta)$ using only the defined matrices.

Lets define the matrix X being the matrix of our training data („input“ values or features), y a vector of our „output“ or target values and $\theta$ a vector of our parameters of $h_\theta(x)$ .

$X_{m,n+1} =\begin{pmatrix} \longleftarrow x^{(1)^T} \longrightarrow \\ \longleftarrow x^{(2)^T} \longrightarrow \\ \vdots \\ \longleftarrow x^{(m)^T} \longrightarrow \end{pmatrix}$ , $y_{m,1} = \begin{pmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{pmatrix}$ and $\theta_{n+1,1} =\begin{pmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{pmatrix}$

X\theta = \begin{pmatrix} x^{(1)^T}\theta \\x^{(2)^T}\theta \\ \vdots \\ x^{(m)^T}\theta \end{pmatrix} = \begin{pmatrix} h_\theta(x^{(1)}) \\h_\theta(x^{(2)}) \\ \vdots \\h_\theta(x^{(m)}) \end{pmatrix}

X\theta - y = \begin{pmatrix} h_\theta(x^{(1)}) - y^{(1)} \\ h_\theta(x^{(2)}) - y^{(2)} \\ \vdots \\ h_\theta(x^{(m)}) - y^{(m)} \end{pmatrix}

Does this remind you of something ? Yes, it’s part of the function $J(\theta)$ but without the squares and the sum. We can use the following rule to achieve the sum of the squares: When you multiply a transposed vector times the vector itself, you get the sum of each element squared. As you can see above $(X\theta-y)$ is just a vector. $(X\theta-y)^T(X\theta-y)$ is then a scalar (a number), which is the sum of each element in the vector $(X\theta-y)$ squared. Now lets add a $\frac{1}{2}$ in front of it and we get $J(\theta)$ .

J(\theta)=\frac{1}{2}(X\theta-y)^T(X\theta-y)

To get the optimal $\theta$ we can set the derivative of $J(\theta)$ with respect to $\theta$ to zero and calculate the corresponding $\theta$ . So lets determine $\triangledown_{\theta} J(\theta)$ .

\triangledown_{\theta} J(\theta) = \triangledown_{\theta} \frac{1}{2}(X\theta-y)^T(X\theta-y)

= \frac{1}{2} \triangledown_{\theta}(X\theta-y)^T(X\theta-y)

= \frac{1}{2} \triangledown_{\theta}(\theta^TX^T-y^T)(X\theta-y)

= \frac{1}{2} \triangledown_{\theta}(\theta^TX^TX\theta-\theta^TX^Ty -y^TX\theta +y^Ty)

$(\theta^TX^TX\theta-\theta^TX^Ty -y^TX\theta +y^Ty)$ is just a scalar, we can use the trace function, because $tr a = a$ .

= \frac{1}{2} \triangledown_{\theta}tr(\theta^TX^TX\theta-\theta^TX^Ty -y^TX\theta +y^Ty)

= \frac{1}{2} (\triangledown_{\theta}tr \theta^TX^TX\theta-\triangledown_{\theta}tr \theta^TX^Ty -\triangledown_{\theta}tr y^TX\theta +\triangledown_{\theta}tr y^Ty)

The part $\triangledown_{\theta}tr y^Ty$ will be zero, because $y^Ty$ has no $\theta$ in it, so its just a constant which gets cancelled out when we derive the function. $\theta^TX^Ty$ is a scalar, we can transpose it – transposing a scalar gives us the scalar ( $a^T=a$ ) – to turn the $\theta^T$ into a $\theta$ . On the term $tr \theta^TX^TX\theta$ we will use the trace rule $tr ABC = tr CAB = tr BCA$ (always moving the last matrix forward) to get $tr\theta\theta^TX^TX$ .

= \frac{1}{2} (\triangledown_{\theta}tr \theta\theta^TX^TX-\triangledown_{\theta}tr y^TX\theta -\triangledown_{\theta}tr y^TX\theta)

The rule $\triangledown_A tr ABA^TC = CAB+ C^TAB^T$ can be used to take the derivative $\triangledown_{\theta}tr \theta\theta^TX^TX$ . To make use of it, we will put an Identity matrix $I$ between $\theta$ and $\theta^T$ . So $\theta$ becomes our A, the identity matrix $I$ becomes our B and $X^TX$ becomes our C. Now we can apply the rule and get the following result:

= \frac{1}{2} (\triangledown_{\theta}tr \theta I\theta^TX^TX-\triangledown_{\theta}tr y^TX\theta -\triangledown_{\theta}tr y^TX\theta)

To take the derivative of $\triangledown_{\theta}tr y^TX\theta$ we will make use of the rule $\triangledown_{A}tr BA = B^T$ . We will treat $y^TX$ as B and $\theta$ as A.

= \frac{1}{2} (X^TX\theta I+X^TX\theta I-X^Ty-X^Ty)

We can ignore the identity matrix I, because it doesn’t change the result of a multiplication.

= \frac{1}{2} (X^TX\theta+X^TX\theta-X^Ty-X^Ty)

= \frac{1}{2} (2X^TX\theta-2X^Ty)

=X^TX\theta-X^Ty

\triangledown_{\theta}J(\theta)=X^TX\theta-X^Ty

Now we want to set $\triangledown_{\theta}J(\theta)$ to zero and determine $\theta$ .

\triangledown_{\theta}J(\theta)=0

X^TX\theta-X^Ty=0

X^TX\theta=X^Ty

\theta=(X^TX)^{-1}X^Ty