Yuchao's blogspot: 10/27/16

When you first learn to play golf, you spend most the time developing a basic swing. There are so many available techniques, the best strategy is in-depth study of a few of the most important.

While unpleasant, we also learn quickly when we’re decisively wrong. By contrast, we learn more slowly when our errors are less well-defined.

To address the learning slowdown problem, define cross entropy cost function:

$C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right]$

Thus:

$\frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j(\sigma(z)-y)$

$\frac{\partial C}{\partial b} = \frac{1}{n} \sum_x (\sigma(z)-y)$

So the rate is proportional to the “error”.

import network2
net = network2.Network([784,30,10],cost=network2.CrossEntropyCost)
net.large_weight_initializer()
net.SGD(training_data,30,10,0.5,evaluation_data=test_data,monitor_evaluation_accuracy=True)

Cross entropy is a measure of surprise.

Softmax

The activation function is changed from sigmoid function to softmax function:

$a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}}$

You can think softmax as a way to rescale the weighted input, and form a probability distribution.

The cost function is correspondingly changed to log-likelihood.

overfitting

Fermi was suspicious of models with four free parameters.

We should stop training when accuracy on the validation data is no longer improving.

In general, one of the best ways of reducing overfitting is to increase the size of the training data. However, training data can be expensive or difficult to acquire.

One possible approach is to reduce the size of our network, but it will be less powerful.

We will use a regularization technique called L2/weight decay, which adds a regularization term in the cross-entropy cost:

$C =C_0+\frac{\lambda}{2n} \sum_w w^2= \frac{1}{2n} \sum_x \|y-a^L\|^2 + \frac{\lambda}{2n} \sum_w w^2$

why does regularization help reduce overfitting?

smaller weights -> lower complexity -> simpler explanation for data.

simpler explanation is more resistant to noise.

Such idea as Occam’s Razor is preferred by some people.

However, there’s no a priori logical reason to prefer simple explanation over more complex explanation.

linear fit vs polynomial fit

Newton’s theory vs Einstein’s theory

The true test of a model is not simplicity, but rather how well it does in predicting new phenomena.

So the answer now is that it’s an empirical fact.

Human brain has a huge number of free parameters. Shown just a few images of an elephant, a child will quickly learn to generalize— recognize other elephants. How do we do it? At this point we don’t know.

L1 regularization

$C = C_0 + \frac{\lambda}{n} \sum_w |w|$

The weights shrink by a constant amount.

Dropout

Randomly and temporarily delete half the hidden neurons, restore them after training on a mini-batch of examples. This kind of average or voting scheme is found to be a powerful way of reducing overfitting.

When we dropout different sets of neurons, it’s rather like we’re training different neural networks.This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is therefore forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.

Dropout is a way of making sure that the model is robust to the loss of any individual piece of evidence.

Artificially expanding the training data

make many small real-world variations, like rotations, translating and skewing the images, elastic distortions,

Neural network vs SVM

With same training data size, Neural network performs better than SVM. However, more traning data can sometimes compensate for differences in the machine learning algorithm used.

Is algorithm A better than algorithm B?

what training data set are you using?

Imagine an alternate world, people created the benchmark data set and have a larger research grant. They might have used the extra money to collect more training data.

The message to take away, is what we want is both better algorithms and better training data.

Weight initialization

Initialize weight of each net as Gaussian random variable with mean 0 and standard deviation 1. This will cause the weighted sum have a standard deviation sqrt(n), leading to activation of the hidden neuron close to either 1 or 0, i.e. saturation.

A better initialization is have individual standard deviation 1/sqrt(n).

import mnist_loader
training_data,validation_data,test_data=mnist_loader.load_data_wrapper()
import network2
net = network2.Network([784,30,10],cost=network2.CrossEntropyCost)
net.SGD(training_data,30,10,0.1,lmbda=5.0,evaluation_data=validation_data, monitor_evaluation_accuracy=True)

How to choose hyper-parameters

You don’t a priori know which hyper-parameters to adjust. If you spend many hours or days or weeks trying this or that, only to get no result, it will damage your confidence.

broad strategy

get any non-trivial learning to achieve results better than chance.
during early stages, make sure you can get quick feedback from experiments.
As with many things in life, getting started can be the hardest thing to do.

variations on SGD

Hessian technique
Momentum-based SGD

other models of artificial neuron

tanh neuron
rectified linear neuron

we don’t hvae a solid theory of how activation functions should be chosen.

Yuchao's blogspot

Thursday, October 27, 2016

Neural network and deep learning, 3