## Machine Learning 08 – Generalized Linear Models

We used probability theory – and an assumption that our variables were distributed on some specific distribution like the Bernoulli or the Gaussian distribution – to construct our „hypothesis“ function and our learning algorithms.

In this Blog entry we will see that both the Bernoulli and the Gaussian distribution, even others like Poisson and Multinomial distribution all belong to a family, called the exponential family

## The exponential family

For the exponential family you can define three functions $b(y)$$T(y)$ and $a(\eta)$ to get any distribution which belongs to this family. As already mentioned the Gaussian and the Bernoulli are in this family of distributions. The below formula (of the exponential family) defines a set of distributions, which are parameterized by $\eta$.

$$p(y; \eta) = b(y)*exp\big(\eta^T T(y)-a(\eta)\big)$$

The parameter $\eta$ is called the natural parameter. $T(y)$  is called the sufficient statistic.

### Choosing a, b and T for the Bernoulli

$$p(y; \phi)=\phi^y(1-\phi)^{1-y}$$

… show that the Bernoulli distribution is just a special case of the exponential family …

$b(y)=1$$T(y)=y$ and $a(\eta)=-log(1-\phi)$ and

$$\eta=log\frac{\phi}{1-\phi}$$

thus

$$=>\phi=\frac{1}{1+e^{-\eta}}$$

### Choosing a, b and T for the Gaussian

$p(y; \mu)=\frac{1}{\sqrt{2\pi}\sigma^2}exp\big(-\frac{1}{2}(y-\mu)^2\big)$

… show that the Bernoulli distribution is just a special case of the exponential family …

$b(y)=\frac{1}{\sqrt{2\pi}\sigma^2}exp\big(-\frac{1}{2}y^2\big)$$T(y)=y$ and $a(\eta)=-\frac{1}{2}\mu^2=-\frac{1}{2}\eta^2$ and

$$\eta=\mu$$

## Construct a Generalized Linear Model (GLM)

Now that we have chosen an exponential family distribution, how do we derive a GLM? For this we will make the following three assumptions.

1. $y|x;\theta \sim ExpFamily(\eta)$ – Our first assumption states, that $y$ is conditioned on $x$ and parameterized by $\theta$ are distributed by some Member of the exponential family.
2. Our second assumption says that given x we want to predict an output $h(x)=E[T(y)|x]$ which is the expected value of y. It basically says that we want to make a prediction based on the input data. The value of the prediction is the expected value of the probability distribution.
3. The third assumption, we will make will set our natural parameter to be linear to our input and parameters ($\eta=\theta^Tx$). This assumption will be very helpful, because it allows us later to construct the Generalized Linear Model later.

construction of the GLM coming soon…

## Machine Learning 07 – Newton’s Method

In the last couple of blog posts we used Gradient Descent / Gradient Ascent as our optimization algorithm. Gradient Descent works by computing the derivative and then moving in this direction by a small learning rate.

There are many optimization algorithms out there. In this post we will learn another one called Newtons Method.

The idea behind Newtons Method is finding the root of a function $f(x)$. It works by defining a random $x$-value like $x_1$. By putting a tangent in the upper function at the position of $x_1$ you can calculate the root of the tangent which is your $x_2$. This is called one iteration. After very few iterations the root of $f(x)$ will be determined very accurately.

Lets put the above paragraph in a mathematical notation. Let’s define the distance of $x_1$ and $x_2$ to be $\Delta$.

$$x_2 = x_1 - \Delta$$

The slope of the function of at a specific point can be expressed as $f'(x_1) = \frac{f(x_1)}{\Delta}$. So $\Delta$ can be expressed as $\Delta = \frac{f(x_1)}{f'(x_1)}$.

$$x_2 = x_1 - \frac{f(x_1)}{f'(x_1)}$$

Let’s say the function f is not dependent on $x$ but on $\theta$ (notice, that $\theta$ here is a scalar not a vector). We want to find the $\theta$ so that $f(\theta)=0$. Then we can change our iterations to use $\theta$.

$$\theta^{(2)} =\theta^{(1)} - \frac{f(\theta^{(1)})}{f'(\theta^{(1)})}$$

In more general terms:

$$\theta^{(t+1)} =\theta^{(t)} - \frac{ f(\theta^{(t)}) }{ f'(\theta^{(t)}) }$$.

Now we can use this algorithm to maximize our log-Likelihood function $l(\theta)$ from the previous blog posts. We now from calculus that the maximum of a function has $l'(\theta)=0$. So we need to fit $\theta$ to find $l'(\theta)=0$.

$$\theta^{(t+1)} =\theta^{(t)} -\frac{l'(\theta^{(t)})}{l''(\theta^{(t)})}$$

This is now our new optimization algorithm. It turns out that it works very well for logistic regression. As already mentioned the algorithm is known for the speed of convergence (finding the minimum/maximum). In more technical terms Newtons Method has quadratic convergence. That means that for every iteration the error value (the noise) minimizes quadratic.

Example:

• (1st iteration) -> 0.01 error
• (2nd iteration) -> 0.001 error
• (3rd iteration) -> 0.000 000 1 error
• and so on …

The generalization for Newtons method – when $\theta$ is a vector of our parameters instead of a scalar – looks like the following:

$$\theta^{(t+1)} =\theta^{(t)} -H^{-1} \bigtriangledown_\theta l(\theta)$$

Where H is the Hessian matrix (matrix of 2nd derivatives of $l(\theta)$.

$$H_{ij}=\frac{\partial^2 l }{\partial \theta_i\partial \theta_j}$$

For every iteration the Hessian matrix needs to be inverted. This could be computationally expensive for a large number of features.

## Machine Learning 06 – Binary Classification

In this post I will explain how we can use Machine Learning to implement a binary classification algorithm.

In binary classification which is also called logistic regression our $y$ value can only be 0 or 1. To really get a feeling for what logistic regression means I found it very helpful to look at the definition of the word logistic (of or relating to the philosophical attempt to reduce mathematics to logic). In normal regression we could have any possible y value, this time we „reduce it to logic“ so our y value can only be 0 or 1.

Because $y \in {0,1}$ it makes a lot of sense to reduce the output of our prediction function $h_\theta (x)$ to be in the range $[0;1]$. We will let $h_\theta (x)$   be a sigmoid function.

$$h_\theta (x)=g(\theta^Tx)$$ $$g(z)=\frac{1}{1+e^{-z}}$$ $$h_\theta (x)=\frac{1}{1+e^{-\theta^Tx}}$$

We will assume that $P(y=1|x; \theta)=h_\theta (x)$. This says that the probability of $y$ (conditioned by $x$ and parameterized by $\theta$) being $1$ is determined by our „hypothesis“ (=prediction function).

Thus $P(y=0|x; \theta)=1 - h_\theta (x)$ (because $y \in {0,1}$).

To generalize our assumption:

$$P(y|x; \theta)=h_\theta (x)^y*(1 - h_\theta (x))^{1-y}$$

Now the question is how do we fit the parameters $\theta$ so that the hypothesis correctly predicts the values 0 or 1 for a given x value.

To find the optimal parameters we will define a Likelihood function $L$ which is dependent on the parameters. So we have to define $L(\theta)$ .

$$L(\theta)=P(\vec{y}|X;\theta)$$ $$L(\theta)=\prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta)$$ $$=\prod_{i=1}^m h_\theta (x)^y*(1 - h_\theta (x))^{1-y}$$

We will have to maximize the Likelihood of our parameters, this means we will have to maximize $L(\theta)$. To achieve this a derivation will be needed. So we will define a logarithmic Likelihood function. We do this because it’s easier to work with logarithms, using the logarithm rules. Finding the $\theta$ which maximized the log-Likelihood results in having found the $\theta$ which maximizes the Likelihood function $L(\theta)$. This is because the logarithm increases as the input of the logarithm increases.

$$l(\theta)=log L(\theta)=log \prod_{i=1}^m h_\theta (x)^y*(1 - h_\theta (x))^{1-y}$$ $$=\sum_{i=1}^m log \Big( h_\theta (x)^y*(1 - h_\theta (x))^{1-y} \Big)$$ $$=\sum_{i=1}^m log(h_\theta (x)^y) + log (1 - h_\theta (x))^{1-y}$$ $$=\sum_{i=1}^m y *log(h_\theta (x)) + (1-y)*log (1 - h_\theta (x))$$

The update rule for our learning algorithm will look like the following:

$$\theta := \theta + \alpha \bigtriangledown_\theta l(\theta)$$

This is a gradient ascent algorithm thus we add the product of the learning rate and our derivation.

$$\frac{\partial}{\partial\theta_j} l(\theta)=...$$ $$... = \sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)})*x_j^{(i)}$$ $$\theta_j := \theta_j + \alpha\sum_{i=1}^m(y^{(i)}-h_\theta(x^{(i)})*x_j^{(i)}$$

### Example with a fictional Data Set

python code to classify the above points:

import math

# h(x) = g(z)
# g(z) =  1/(1+e^(-z))

points = [
[-2.02, 0],
[-1.54, 0],
[-1.53, 0],
[-1, 1],
[-1.28, 0],
[-0.6, 1],
[-0.7, 1],
[-0.38, 1],
[2.37, 0],
[2.66, 0],
[3, 0],
[3.43, 0],
[-0.28, 1],
[-0.02, 1]
]

# features
X = [
#   [x0=1, x1, x1^2]
[1, points[0][0], points[0][0]**2],
[1, points[1][0], points[1][0]**2],
[1, points[2][0], points[2][0]**2],
[1, points[3][0], points[3][0]**2],
[1, points[4][0], points[4][0]**2],
[1, points[5][0], points[5][0]**2],
[1, points[6][0], points[6][0]**2],
[1, points[7][0], points[7][0]**2],
[1, points[8][0], points[8][0]**2],
[1, points[9][0], points[9][0]**2],
[1, points[10][0], points[10][0]**2],
[1, points[11][0], points[11][0]**2],
[1, points[12][0], points[12][0]**2],
[1, points[13][0], points[13][0]**2]
]
# targets
y = [
points[0][1],
points[1][1],
points[2][1],
points[3][1],
points[4][1],
points[5][1],
points[6][1],
points[7][1],
points[8][1],
points[9][1],
points[10][1],
points[11][1],
points[12][1],
points[13][1]
]

w = [ 0, 0, 0 ] # weights w0 and w1, called Theta in Math

# alpha = 0.01
learningRate = 0.01
iterations = 5000

# soultion for this training set h(x) = 3.01796x - 1.36527
# korrelation r = 0.846
def g(z): # sigmoid function
return 1/(1+math.e**(-z))

def h(x):
# x = X[i] = eg.  [1, 3]
# using sigmoid function to avoid j from getting bigger than 0 and 1
return g( w[0]*x[0] + w[1]*x[1] + w[2]*x[2])

# derivative of l = Sum j=1 to m [ yi - h(xi) ] * xi
def derivativeLogLikelyHood(j): # j is number of feature, j=0 refers to x0 wich is 1, j=1 refers to x1 which is
m = len(X) # number of trainingdata in trainingset
sum = 0
for i in range(0, m):
sum += ( y[i] - h(X[i]) ) * X[i][j]
return sum

def gradientAsscent(): # Not Descent because we want to maximize the likelikhood ..

for iteration in range(0, iterations):
for j in range(0, len(w)):
w[j] = w[j] + learningRate * derivativeLogLikelyHood(j)

print( w )



## Machine Learning 05 – Probabilistic Interpretation of Least Squares Method

In this post, we will discuss and show, why the least squares method is actually the way it is. To do this, we have to use Probability theory and some probabilistic assumption on our various variables.

Let’s say, that for every data-set in our training data the following rule applies: $y^{(i)} = h_\theta(x^{(i)}) + \varepsilon^{(i)}$.

While $\varepsilon^{(i)}$ is an error value, which is the difference between our prediction $h_\theta(x^{(i)})$ and the actual value $y^{(i)}$. The smaller the error value is, the better our prediction.

We may assume that the error value is distributed normally because the learning model is not taking every possible feature into consideration. If our training data is about house prices for example, we might miss out on features like the buyer and sellers mood, whether the house is next to a nice place and so on. All these features are independent from one another. Thanks to the central limit theory the accumulation of this „noise“ or error value is distributed by the Gaussian distribution.

$$\varepsilon^{(i)} \sim N(0, \sigma^2)$$

The above mathematical expression says that the error value is distributed by the normal distribution with the median $0$ and the variance sigma squared.

We can write out the function for the normal distribution like below.

$$P(\varepsilon^{(i)}) = \frac{1}{\sqrt{2\pi}*\sigma}exp\Big(-\frac{(\varepsilon^{(i)})^2}{2\sigma^2}\Big)$$

The error value can be expressed by $y^{(i)}$ and $h_\theta(x^{(i)})$ simply as $\varepsilon^{(i)}=y^{(i)}-h_\theta(x^{(i)})$. Now lets change the input of our probability function accordingly.

$$P(y^{(i)}|x^{(i)};\theta) = \frac{1}{\sqrt{2\pi}*\sigma}exp\Big(-\frac{(y^{(i)}-h_\theta(x^{(i)}))^2}{2\sigma^2}\Big)$$

The above expression says that the probability of $y^{(i)}$ conditioned by $x^{(i)}$ and parameterized by $\theta$ is equal to the function on the left. It’s important to use a semicolon before the $\theta$ because a coma would mean, that the probability of $y^{(i)}$ is conditioned by BOTH $x^{(i)}$ and by $\theta$. But the vector $\theta$ is just a parameter not a random variable on which the probability is conditioned.

Lets just write out the „hypothesis“ (= prediction function) by the actual definition $h_\theta(x^{(i)})=\theta^T x^{(i)}$.

$$P(y^{(i)}|x^{(i)};\theta) = \frac{1}{\sqrt{2\pi}*\sigma}exp\Big(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\Big)$$

We now can define a function $L(\theta)$ to being the Likelihood of the parameter$\theta$ based on the product of all probabilities.

$$L(\theta)=P(\vec{y}|X;\theta)$$

Notice, that we did use a vector for our output variable and accordingly the Matrix of all the training data X in our probability function.

$$L(\theta)=\prod_{i=1}^m P(y^{(i)}|x^{(i)};\theta)=\prod_{i=1}^m\frac{1}{\sqrt{2\pi}*\sigma}exp\Big(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\Big)$$

Based on our training data we want to fit $\theta$ to achieve the maximum Likelihood. In other words we want to maximize $L(\theta)$.

Now we can maximize the function or any function which increases as $L(\theta)$ increases such as the logarithm of $L(\theta)$,$l(\theta)=log L(\theta)$. This is especially useful since working with logarithms provides some convenience for example when taking the derivative or making use of logarithm rules (like below).

$$l(\theta)=log L(\theta)=log\prod_{i=1}^m\frac{1}{\sqrt{2\pi}*\sigma}exp\Big(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\Big)$$ $$=\sum_{i=1}^m log \frac{1}{\sqrt{2\pi}*\sigma}exp\Big(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\Big)$$

(Above we used the logarithm rule $log(a*b*..) = log(a)+log(b)+...$)

$$=\sum_{i=1}^m log \frac{1}{\sqrt{2\pi}*\sigma} + log\Bigg(exp\Big(-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}\Big) \Bigg)$$

As $log \frac{1}{\sqrt{2\pi}*\sigma}$ is just a constant which gets added $m$ times we can bring it in front of the sum. Also logarithm and exp chancel each other out. So we are left with the following:

$$=m * log\frac{1}{\sqrt{2\pi}*\sigma}+\sum_{i=1}^m-\frac{(y^{(i)}-\theta^T x^{(i)})^2}{2\sigma^2}$$

The „-“ sign as well as the $\frac{1}{2\sigma^2}$ in our summation are just constant factors, which we can put in front of the summation.

$$=m * log\frac{1}{\sqrt{2\pi}*\sigma}-\frac{1}{\sigma^2}\frac{1}{2}\sum_{i=1}^m(y^{(i)}-\theta^T x^{(i)})^2$$

Now if we look at the function, it is clear to say that to maximize the function we have to minimize the sum. If you remember $J(\theta)$ from my second blog post, you will remember that it looks exactly the same as the sum in our log likelihood function. We also tried to minimize it (using the gradient descent algorithm).

$$J(\theta)=\frac{1}{2}\sum_{i=1}^m(y^{(i)}-\theta^T x^{(i)})^2$$

To sum up, we showed how the minimization of least squares is just a maximization of our likelihood function assuming the probabilistic distribution of our error values as shown above.

## Machine Learning 04 – Locally Weighted Regression

In this blog post I will explain what the Locally Weighted Regression is and when it makes sense to use.

Picture the following data set as our given training data. We want to be able to make predictions based on the features, that’s what regression is all about.

If we use the feature $x_1=x$ our prediction function will look like this $h_\theta(x)=\theta_0 + \theta_1*x_1$. This function is only a line. When you try to to fit a line through a data set like the one above, you will most probably get a horizontal line above the x-Axis. But its not really fitting the trends of our Data. We then could start thinking about another feature like $x_2=x_1^2$ or in this case also even $x_3=x_1x^3$ to fit our data. So the prediction function changes to: $h_\theta(x)=\theta_0+ \theta_1*x_1 + \theta_2*x^2+ \theta_3*x^3$.

Now imagine our training data had more of a bell shape curve. We could add even more features like $x_4=sin(x)$ and so on. Another approach would be the Locally Weighted Regression (LWR) also called Loess or Lowess.

### Underfitting

We talk about underfitting, when there are some patterns in the data, wich our model doesn’t recognize. For example trying to fit a straight line to more quadratic correlation.

### Overfitting

Overfitting, on the other hand describes the fitting of idiosyncrasies (= a mode of behavior or a way of thought peculiar to an individual) of a specific data set. This means it’s not pattern finding more exact fitting instead of finding the underlying trends.

### How to Avoid underfitting/overfitting?

To avoid underfitting/overfitting one must keep an eye on the selected features (There are algorithms to choose the best features).

„Parametric“ learning algorithm have a fixed number/set of parameters ($\theta$s).

„Non-Parametric“ learning algorithm: The number of parameters grow with $m$ (remember $m$ was the size of our training set).

LWR is a non-parametric learning algorithm. This algorithm will help us worry less about the features we choose.

Suppose we want to make a prediction on a value $x$. Then LWR says: Fit $\theta$ to minimize $\sum_{i=1}^m w^{(i)}(y^{(i)}-\theta^T x^{(i)})^2$ where $w^{(i)} = exp(-\frac{(x^{(i)}-x)}{2\tau^2})$.

The differences to the „normal“ linear regression are:

• The function we want to minimize looks almost the same as $J(\theta)$, except a weight value. This weight value is calculated by a bell curved function where the peak of the bell curve is at $x$. This means the weight gives $x^{(i)}$s which are far away from $x$ a smaller value.
• The LWR algorithm computes a new optimal $\theta$ each time we want to make a prediction. Thus, if our training set is to large this algorithm can be very costly. Although there are ways to still make it faster, these methods are not covered in the course (Stanford Machine Learning Lectures, by Andrew Ng).
• It fits a line through a subset of the data, instead of trying to make a fit on the whole training data.

## Machine Learning 03 – Normal Equation

Last post I explained the math behind and the Gradient Descent algorithm and implemented it in python. This part we will see how to evaluate $\theta$ using a normal equation.

We defined $J(\theta)$ to be half of the sum of the distance between the target values and the prediction squared.

$$J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$$

We can define some matrices to write down $J(\theta)$ using only the defined matrices.

Lets define the matrix X being the matrix of our training data („input“ values or features), y a vector of our „output“ or target values and $\theta$ a vector of our parameters of $h_\theta(x)$.

$X_{m,n+1} =\begin{pmatrix} \longleftarrow x^{(1)^T} \longrightarrow \\ \longleftarrow x^{(2)^T} \longrightarrow \\ \vdots \\ \longleftarrow x^{(m)^T} \longrightarrow \end{pmatrix}$, $y_{m,1} = \begin{pmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{pmatrix}$ and $\theta_{n+1,1} =\begin{pmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{pmatrix}$

$$X\theta = \begin{pmatrix} x^{(1)^T}\theta \\x^{(2)^T}\theta \\ \vdots \\ x^{(m)^T}\theta \end{pmatrix} = \begin{pmatrix} h_\theta(x^{(1)}) \\h_\theta(x^{(2)}) \\ \vdots \\h_\theta(x^{(m)}) \end{pmatrix}$$ $$X\theta - y = \begin{pmatrix} h_\theta(x^{(1)}) - y^{(1)} \\ h_\theta(x^{(2)}) - y^{(2)} \\ \vdots \\ h_\theta(x^{(m)}) - y^{(m)} \end{pmatrix}$$

Does this remind you of something ? Yes, it’s part of the function $J(\theta)$ but without the squares and the sum. We can use the following rule to achieve the sum of the squares: When you multiply a transposed vector times the vector itself, you get the sum of each element squared. As you can see above $(X\theta-y)$ is just a vector. $(X\theta-y)^T(X\theta-y)$ is then a scalar (a number), which is the sum of each element in the vector $(X\theta-y)$ squared. Now lets add a $\frac{1}{2}$ in front of it and we get $J(\theta)$.

$$J(\theta)=\frac{1}{2}(X\theta-y)^T(X\theta-y)$$

To get the optimal $\theta$ we can set the derivative of $J(\theta)$ with respect to $\theta$ to zero and calculate the corresponding $\theta$. So lets determine $\triangledown_{\theta} J(\theta)$.

$$\triangledown_{\theta} J(\theta) = \triangledown_{\theta} \frac{1}{2}(X\theta-y)^T(X\theta-y)$$ $$= \frac{1}{2} \triangledown_{\theta}(X\theta-y)^T(X\theta-y)$$ $$= \frac{1}{2} \triangledown_{\theta}(\theta^TX^T-y^T)(X\theta-y)$$ $$= \frac{1}{2} \triangledown_{\theta}(\theta^TX^TX\theta-\theta^TX^Ty -y^TX\theta +y^Ty)$$

$(\theta^TX^TX\theta-\theta^TX^Ty -y^TX\theta +y^Ty)$ is just a scalar, we can use the trace function, because $tr a = a$.

$$= \frac{1}{2} \triangledown_{\theta}tr(\theta^TX^TX\theta-\theta^TX^Ty -y^TX\theta +y^Ty)$$ $$= \frac{1}{2} (\triangledown_{\theta}tr \theta^TX^TX\theta-\triangledown_{\theta}tr \theta^TX^Ty -\triangledown_{\theta}tr y^TX\theta +\triangledown_{\theta}tr y^Ty)$$

The part $\triangledown_{\theta}tr y^Ty$ will be zero, because $y^Ty$ has no $\theta$ in it, so its just a constant which gets cancelled out when we derive the function. $\theta^TX^Ty$ is a scalar, we can transpose it – transposing a scalar gives us the scalar ($a^T=a$) – to turn the $\theta^T$ into a $\theta$. On the term $tr \theta^TX^TX\theta$ we will use the trace rule $tr ABC = tr CAB = tr BCA$ (always moving the last matrix forward) to get $tr\theta\theta^TX^TX$.

$$= \frac{1}{2} (\triangledown_{\theta}tr \theta\theta^TX^TX-\triangledown_{\theta}tr y^TX\theta -\triangledown_{\theta}tr y^TX\theta)$$

The rule $\triangledown_A tr ABA^TC = CAB+ C^TAB^T$ can be used to take the derivative $\triangledown_{\theta}tr \theta\theta^TX^TX$. To make use of it, we will put an Identity matrix $I$ between $\theta$ and $\theta^T$. So $\theta$ becomes our A, the identity matrix$I$ becomes our B and $X^TX$ becomes our C. Now we can apply the rule and get the following result:

$$= \frac{1}{2} (\triangledown_{\theta}tr \theta I\theta^TX^TX-\triangledown_{\theta}tr y^TX\theta -\triangledown_{\theta}tr y^TX\theta)$$

To take the derivative of $\triangledown_{\theta}tr y^TX\theta$ we will make use of the rule $\triangledown_{A}tr BA = B^T$. We will treat $y^TX$ as B and $\theta$ as A.

$$= \frac{1}{2} (X^TX\theta I+X^TX\theta I-X^Ty-X^Ty)$$

We can ignore the identity matrix I, because it doesn’t change the result of a multiplication.

$$= \frac{1}{2} (X^TX\theta+X^TX\theta-X^Ty-X^Ty)$$ $$= \frac{1}{2} (2X^TX\theta-2X^Ty)$$ $$=X^TX\theta-X^Ty$$ $$\triangledown_{\theta}J(\theta)=X^TX\theta-X^Ty$$

Now we want to set $\triangledown_{\theta}J(\theta)$ to zero and determine $\theta$ .

$$\triangledown_{\theta}J(\theta)=0$$ $$X^TX\theta-X^Ty=0$$ $$X^TX\theta=X^Ty$$ $$\theta=(X^TX)^{-1}X^Ty$$

## Machine Learning 02 – Gradient Descent

Linear Regression is the most basic task of Machine Learning.

In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X.

We will use the following training data (they are just made up points):

• features = [ 3, 2, 1, 5, 4, 7, 9 ]
• targets = [ 4, 1, 4, 8, 21, 20, 26 ]

The solution of our problem is the function above (also called the hypothesis h(x)). To get to this solution we will use a mathematical model called least squares combined with the optimization algorithm gradient descent.

$h_\theta(x) = \theta_0 + \theta_1 x$

We are searching for the ideal $\theta_0$ and $\theta_1$ pair.

A couple of variable definitions:

m … number of training examples (in our case m = 7)
x … input Variables = features
y … output Variables = targets
(x, y) … training example
$(x^{(i)}, y^{(i)})$$i^{th}$ training example

We can define $\theta$ to be the vector of both $\theta_0$ and $\theta_1$, so $\theta = \binom{\theta_0}{\theta_1}$.

Then using the least squares method we define the function $J(\theta)=\frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$.

Using the gradient of J with respect of $\theta_i$ and updating $\theta_i$ accordingly will move us towards a local minimum.

Update rule:

$\theta_i:=\theta_i-\alpha\frac{\partial}{\partial\theta_i}J(\theta)$

$\alpha$ … learning rate of our learning algorithm (we will use $\alpha=0.01$)

Choosing a good learning rate is crucial to the learning Algorithm, because a too big one can cause the $\theta_i$ to jump around, never reaching a minimum. If the learning rate is too small the algorithm will take a lot of time to reach the minimum of J ( = slow convergence).

$\frac{\partial}{\partial\theta_i}J(\theta)$ is called the partial derivative of J with respect to $\theta_i$. We need to figure it out so we can implement it in a computer program.

$$\frac{\partial}{\partial\theta_i}J(\theta)=\frac{\partial}{\partial\theta_i}\frac{1}{2}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$$

Writing out the sum…

$$=\frac{\partial}{\partial\theta_i}\frac{1}{2}\Bigg[(h_\theta(x^{(1)})-y^{(1)})^2+(h_\theta(x^{(2)})-y^{(2)})^2+...+(h_\theta(x^{(m)})-y^{(m)})^2\Bigg]$$

We will put the factor $\frac{1}{2}$ before the derivative.

$$=\frac{1}{2}\frac{\partial}{\partial\theta_i}\Bigg[(h_\theta(x^{(1)})-y^{(1)})^2+(h_\theta(x^{(2)})-y^{(2)})^2+...+(h_\theta(x^{(m)})-y^{(m)})^2\Bigg]$$

To take the derivative of this sum, we will take the derivative of every element and sum them up.

$$=\frac{1}{2}\Bigg[\frac{\partial}{\partial\theta_i}(h_\theta(x^{(1)})-y^{(1)})^2+\frac{\partial}{\partial\theta_i}(h_\theta(x^{(2)})-y^{(2)})^2+...+\frac{\partial}{\partial\theta_i}(h_\theta(x^{(m)})-y^{(m)})^2\Bigg]$$

Here we have to use the chain rule.

$$=\frac{1}{2}\Bigg[2(h_\theta(x^{(1)})-y^{(1)})\frac{\partial}{\partial\theta_i}h_\theta(x^{(1)})+...+2(h_\theta(x^{(m)})-y^{(m)}))\frac{\partial}{\partial\theta_i}h_\theta(x^{(m)})\Bigg]$$ $$=(h_\theta(x^{(1)})-y^{(1)})\frac{\partial}{\partial\theta_i}h_\theta(x^{(1)})+...+(h_\theta(x^{(m)})-y^{(m)}))\frac{\partial}{\partial\theta_i}h_\theta(x^{(m)})$$

Now let us see what $\frac{\partial}{\partial\theta_i}h_\theta(x^{(i)})$ is…

$$=\frac{\partial}{\partial\theta_i}\Big(\theta_0x^{(0)}+...+\theta_ix^{(i)}+...+\theta_nx^{(n)}\Big)$$

Everything expect the $\theta_ix^{(i)}$ is just a constant, because we are deriving the function with respect to $\theta_i$. This means that everything expect $x^{(i)}$ will be cancelled out.

Back to our derivative of J, now we can replace $\frac{\partial}{\partial\theta_i}h_\theta(x^{(i)})$ by the actual value $x^{(i)}$.

$$\frac{\partial}{\partial\theta_i}J(\theta)=(h_\theta(x^{(1)})-y^{(1)})x^{(i)}+...+(h_\theta(x^{(m)})-y^{(m)}))x^{(i)}$$ $$\frac{\partial}{\partial\theta_i}J(\theta)=\sum_{j=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_i^{(j)}$$

So our new update rule will be:

$\theta_i:=\theta_i-\alpha\sum_{j=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_i^{(j)}$.

We have $\theta_0$, and $\theta_1$, which means we will repeat updating both of them until we converge (= reach the optimal $\theta$, that means $J(\theta)$ reaching a minimum).

I wrote a python script, to compute optimum $\theta$ vector. Note that$\theta_0$, and $\theta_1$ became w0 and w1. The reason for choosing the letter w is that $\theta$ is sometimes also called weights.

# h(x) = O0 + O1*x1 + O2*x2 + ... + 0n*xn
# in this case: h(x) = w0 + w1*x1
features = [ 3, 2, 1, 5, 4, 7, 9 ]
targets = [ 4, 1, 4, 8, 21, 20, 26 ]
w = [ 0, 0 ] # weights w0 and w1, called Theta in Math
x = [ 1, 1 ] # x0 = 1

# alpha = 0.01
learningRate = 0.01

# actually should stop when derivative of J(w) gets near zero but we will use an iteration counter instead
iterations = 5000

def h(x):
return w[0] + x*w[1]

# derivative of J = Sum j=1 to m [ h(xj) - yj ] * x1j
def derivativeJ(is_x0_soUseOneAsFactor):
m = len(features)
sum = 0
for j in range(0, m):
sum += ( h(features[j]) - targets[j] ) * ( 1 if is_x0_soUseOneAsFactor else features[j] )
return sum

for j in range(0, iterations):
w[0] = w[0] - learningRate * derivativeJ(True)
w[1] = w[1] - learningRate * derivativeJ(False)

print( w )
# soultion for this training set h(x) = 3.01796x - 1.36527
# correlationfactor r = 0.846

## Machine Learning 01 – Introduction

Machine Learning is a subset of Artificial Intelligence. Computer programs should predict an outcome without beeing explicitly programmed.

### Supervised Learning

We talk about Supervised Learning whenever the machine gets training data to learn from. The machine is „supervised“ by the data. Types of Supervised Learning are Regression and Classification.

Regression: A function is determined, which best fits the training data.

Classification: Categorize given data set, based on training data.

### Unsupervised Learning

When using Unsupervised Learning, the data is unlabeled and the machine tries to find structure in the data.

Unsupervised Learning is used in computer vision and audio processing.

### Reinforcement Learning

Sometimes a system has to constantly come up with decisions over time. That is when Reinforcement Learning comes in handy. The system doesn’t have to come up with a solution all at one, instead it makes a sequence of decisions which get better over time.

One method – which is inspired by training animals – is the Reward System, in which the computer system receives positive or negative Feedback based on the decision taken. The learning algorithm then differentiates between „good“ and „bad“ behavior.

This and the coming blog posts are based on Andrew Ng’s Stanford lectures which are available on YouTube for free. The blog posts are primarily used as a way to summarize what I have learned in layman’s terms. Details will be updated over time.