Last post I explained the math behind and the Gradient Descent algorithm and implemented it in python. This part we will see how to evaluate \theta using a normal equation.
We defined J(\theta) to be half of the sum of the distance between the target values and the prediction squared.
J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2We can define some matrices to write down J(\theta) using only the defined matrices.
Lets define the matrix X being the matrix of our training data („input“ values or features), y a vector of our „output“ or target values and \theta a vector of our parameters of h_\theta(x).
X_{m,n+1} =\begin{pmatrix} \longleftarrow x^{(1)^T} \longrightarrow \\ \longleftarrow x^{(2)^T} \longrightarrow \\ \vdots \\ \longleftarrow x^{(m)^T} \longrightarrow \end{pmatrix}, y_{m,1} = \begin{pmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{pmatrix} and \theta_{n+1,1} =\begin{pmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{pmatrix}
X\theta = \begin{pmatrix} x^{(1)^T}\theta \\x^{(2)^T}\theta \\ \vdots \\ x^{(m)^T}\theta \end{pmatrix} = \begin{pmatrix} h_\theta(x^{(1)}) \\h_\theta(x^{(2)}) \\ \vdots \\h_\theta(x^{(m)}) \end{pmatrix} X\theta - y = \begin{pmatrix} h_\theta(x^{(1)}) - y^{(1)} \\ h_\theta(x^{(2)}) - y^{(2)} \\ \vdots \\ h_\theta(x^{(m)}) - y^{(m)} \end{pmatrix}Does this remind you of something ? Yes, it’s part of the function J(\theta) but without the squares and the sum. We can use the following rule to achieve the sum of the squares: When you multiply a transposed vector times the vector itself, you get the sum of each element squared. As you can see above (X\theta-y) is just a vector. (X\theta-y)^T(X\theta-y) is then a scalar (a number), which is the sum of each element in the vector (X\theta-y) squared. Now lets add a \frac{1}{2} in front of it and we get J(\theta).
J(\theta)=\frac{1}{2}(X\theta-y)^T(X\theta-y)To get the optimal \theta we can set the derivative of J(\theta) with respect to \theta to zero and calculate the corresponding \theta. So lets determine \triangledown_{\theta} J(\theta).
\triangledown_{\theta} J(\theta) = \triangledown_{\theta} \frac{1}{2}(X\theta-y)^T(X\theta-y) = \frac{1}{2} \triangledown_{\theta}(X\theta-y)^T(X\theta-y) = \frac{1}{2} \triangledown_{\theta}(\theta^TX^T-y^T)(X\theta-y) = \frac{1}{2} \triangledown_{\theta}(\theta^TX^TX\theta-\theta^TX^Ty -y^TX\theta +y^Ty)(\theta^TX^TX\theta-\theta^TX^Ty -y^TX\theta +y^Ty) is just a scalar, we can use the trace function, because tr a = a.
= \frac{1}{2} \triangledown_{\theta}tr(\theta^TX^TX\theta-\theta^TX^Ty -y^TX\theta +y^Ty) = \frac{1}{2} (\triangledown_{\theta}tr \theta^TX^TX\theta-\triangledown_{\theta}tr \theta^TX^Ty -\triangledown_{\theta}tr y^TX\theta +\triangledown_{\theta}tr y^Ty)The part \triangledown_{\theta}tr y^Ty will be zero, because y^Ty has no \theta in it, so its just a constant which gets cancelled out when we derive the function. \theta^TX^Ty is a scalar, we can transpose it – transposing a scalar gives us the scalar (a^T=a) – to turn the \theta^T into a \theta. On the term tr \theta^TX^TX\theta we will use the trace rule tr ABC = tr CAB = tr BCA (always moving the last matrix forward) to get tr\theta\theta^TX^TX.
= \frac{1}{2} (\triangledown_{\theta}tr \theta\theta^TX^TX-\triangledown_{\theta}tr y^TX\theta -\triangledown_{\theta}tr y^TX\theta)The rule \triangledown_A tr ABA^TC = CAB+ C^TAB^T can be used to take the derivative \triangledown_{\theta}tr \theta\theta^TX^TX. To make use of it, we will put an Identity matrix I between \theta and \theta^T. So \theta becomes our A, the identity matrix I becomes our B and X^TX becomes our C. Now we can apply the rule and get the following result:
= \frac{1}{2} (\triangledown_{\theta}tr \theta I\theta^TX^TX-\triangledown_{\theta}tr y^TX\theta -\triangledown_{\theta}tr y^TX\theta)To take the derivative of \triangledown_{\theta}tr y^TX\theta we will make use of the rule \triangledown_{A}tr BA = B^T. We will treat y^TX as B and \theta as A.
= \frac{1}{2} (X^TX\theta I+X^TX\theta I-X^Ty-X^Ty)We can ignore the identity matrix I, because it doesn’t change the result of a multiplication.
= \frac{1}{2} (X^TX\theta+X^TX\theta-X^Ty-X^Ty) = \frac{1}{2} (2X^TX\theta-2X^Ty) =X^TX\theta-X^Ty \triangledown_{\theta}J(\theta)=X^TX\theta-X^TyNow we want to set \triangledown_{\theta}J(\theta) to zero and determine \theta .
\triangledown_{\theta}J(\theta)=0 X^TX\theta-X^Ty=0 X^TX\theta=X^Ty \theta=(X^TX)^{-1}X^Ty