Machine Learning 07 – Newton’s Method

In the last couple of blog posts we used Gradient Descent / Gradient Ascent as our optimization algorithm. Gradient Descent works by computing the derivative and then moving in this direction by a small learning rate.

There are many optimization algorithms out there. In this post we will learn another one called Newtons Method.

Related image

The idea behind Newtons Method is finding the root of a function f(x). It works by defining a random x-value like x_1. By putting a tangent in the upper function at the position of x_1 you can calculate the root of the tangent which is your x_2. This is called one iteration. After very few iterations the root of f(x) will be determined very accurately.

Lets put the above paragraph in a mathematical notation. Let’s define the distance of x_1 and x_2 to be \Delta.

x_2 = x_1 - \Delta

The slope of the function of at a specific point can be expressed as f'(x_1) = \frac{f(x_1)}{\Delta}. So \Delta can be expressed as \Delta = \frac{f(x_1)}{f'(x_1)}.

x_2 = x_1 - \frac{f(x_1)}{f'(x_1)}

Let’s say the function f is not dependent on x but on \theta (notice, that \theta here is a scalar not a vector). We want to find the \theta so that f(\theta)=0. Then we can change our iterations to use \theta.

\theta^{(2)} =\theta^{(1)} - \frac{f(\theta^{(1)})}{f'(\theta^{(1)})}

In more general terms:

\theta^{(t+1)} =\theta^{(t)} - \frac{ f(\theta^{(t)}) }{ f'(\theta^{(t)}) }.

Now we can use this algorithm to maximize our log-Likelihood function l(\theta) from the previous blog posts. We now from calculus that the maximum of a function has l'(\theta)=0. So we need to fit \theta to find l'(\theta)=0.

\theta^{(t+1)} =\theta^{(t)} -\frac{l'(\theta^{(t)})}{l''(\theta^{(t)})}

This is now our new optimization algorithm. It turns out that it works very well for logistic regression. As already mentioned the algorithm is known for the speed of convergence (finding the minimum/maximum). In more technical terms Newtons Method has quadratic convergence. That means that for every iteration the error value (the noise) minimizes quadratic.

Example:

  • (1st iteration) -> 0.01 error
  • (2nd iteration) -> 0.001 error
  • (3rd iteration) -> 0.000 000 1 error
  • and so on …

The generalization for Newtons method – when \theta is a vector of our parameters instead of a scalar – looks like the following:

\theta^{(t+1)} =\theta^{(t)} -H^{-1} \bigtriangledown_\theta l(\theta)

Where H is the Hessian matrix (matrix of 2nd derivatives of l(\theta).

H_{ij}=\frac{\partial^2 l }{\partial \theta_i\partial \theta_j}

For every iteration the Hessian matrix needs to be inverted. This could be computationally expensive for a large number of features.