## Thursday, October 27, 2016

### Neural network and deep learning, 3

When you first learn to play golf, you spend most the time developing a basic swing. There are so many available techniques, the best strategy is in-depth study of a few of the most important.
While unpleasant, we also learn quickly when we’re decisively wrong. By contrast, we learn more slowly when our errors are less well-defined.
To address the learning slowdown problem, define cross entropy cost function:
Thus:
So the rate is proportional to the “error”.
import network2
net = network2.Network([784,30,10],cost=network2.CrossEntropyCost)
net.large_weight_initializer()
net.SGD(training_data,30,10,0.5,evaluation_data=test_data,monitor_evaluation_accuracy=True)

Cross entropy is a measure of surprise.

## Softmax

The activation function is changed from sigmoid function to softmax function:
You can think softmax as a way to rescale the weighted input, and form a probability distribution.
The cost function is correspondingly changed to log-likelihood.

## overfitting

Fermi was suspicious of models with four free parameters.
We should stop training when accuracy on the validation data is no longer improving.
In general, one of the best ways of reducing overfitting is to increase the size of the training data. However, training data can be expensive or difficult to acquire.
One possible approach is to reduce the size of our network, but it will be less powerful.
We will use a regularization technique called L2/weight decay, which adds a regularization term in the cross-entropy cost:

## why does regularization help reduce overfitting?

smaller weights -> lower complexity -> simpler explanation for data.
simpler explanation is more resistant to noise.
Such idea as Occam’s Razor is preferred by some people.
However, there’s no a priori logical reason to prefer simple explanation over more complex explanation.
linear fit vs polynomial fit
Newton’s theory vs Einstein’s theory
The true test of a model is not simplicity, but rather how well it does in predicting new phenomena.
So the answer now is that it’s an empirical fact.
Human brain has a huge number of free parameters. Shown just a few images of an elephant, a child will quickly learn to generalize— recognize other elephants. How do we do it? At this point we don’t know.

## L1 regularization

The weights shrink by a constant amount.

## Dropout

Randomly and temporarily delete half the hidden neurons, restore them after training on a mini-batch of examples. This kind of average or voting scheme is found to be a powerful way of reducing overfitting.
When we dropout different sets of neurons, it’s rather like we’re training different neural networks.This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is therefore forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
Dropout is a way of making sure that the model is robust to the loss of any individual piece of evidence.

## Artificially expanding the training data

make many small real-world variations, like rotations, translating and skewing the images, elastic distortions,

## Neural network vs SVM

With same training data size, Neural network performs better than SVM. However, more traning data can sometimes compensate for differences in the machine learning algorithm used.
Is algorithm A better than algorithm B?
what training data set are you using?
Imagine an alternate world, people created the benchmark data set and have a larger research grant. They might have used the extra money to collect more training data.
The message to take away, is what we want is both better algorithms and better training data.

## Weight initialization

Initialize weight of each net as Gaussian random variable with mean 0 and standard deviation 1. This will cause the weighted sum have a standard deviation sqrt(n), leading to activation of the hidden neuron close to either 1 or 0, i.e. saturation.
A better initialization is have individual standard deviation 1/sqrt(n).
import mnist_loader
import network2
net = network2.Network([784,30,10],cost=network2.CrossEntropyCost)
net.SGD(training_data,30,10,0.1,lmbda=5.0,evaluation_data=validation_data, monitor_evaluation_accuracy=True)


## How to choose hyper-parameters

You don’t a priori know which hyper-parameters to adjust. If you spend many hours or days or weeks trying this or that, only to get no result, it will damage your confidence.

• get any non-trivial learning to achieve results better than chance.
• during early stages, make sure you can get quick feedback from experiments.
• As with many things in life, getting started can be the hardest thing to do.

## variations on SGD

• Hessian technique
• Momentum-based SGD

## other models of artificial neuron

• tanh neuron
• rectified linear neuron
we don’t hvae a solid theory of how activation functions should be chosen.
We approach machine learning techniques almost entirely empirically?
As long as your method minimizes some sort of objective function and has a finite capacity (or is properly regularized), you are on solid theoretical grounds.
The questions become: how well does my method work on this particular problem, and how large is the set of problems on which it works well.
If you look through the research literature you’ll see that stories in a similar style (heuristic)appear in many research papers on neural nets, often with thin supporting evidence.
We need such heuristics to inspire and guide our thinking.
When you understand something poorly - as the explorers understood
geography, and as we understand neural nets today - it’s more important to explore boldly than it is to be rigorously correct in every step of your thinking
Put another way, we need good stories to help motivate and inspire us, and rigorous in-depth investigation in order to uncover the real facts of the matter.