Tuesday, October 25, 2016

Neural network and deep learning, 1


By Michael Nielsen, Jan 2016
The book only has online version due to its interactive graphs, although I would like to have a local copy, either pdf or hard copy.
Alternatively, Another more advanced book by MIT is http://www.deeplearningbook.org/ has just published in 2016.12.
[TOC]

What this book is about

Conventional approach to programming: we tell the computer what to do, break big problems up into many small, precisely defined tasks that the computer can easily perform.
Neural network: It learn from observation data, figuring out its own solution to the problem at hand.
Artificial neural networks, which was loosely based on the model of brain neurons, were proposed more 30 years ago. But it was not until the breakthrough in 2006 that the rebranded “deep learning“ become popular. Beside algorithms, many factors collectively contribute to this breakthrough: computing power, large dataset and GPU.
The highlight of this book is that it really focus on a solid understand of the core concepts of neural networks. As he said,
Technologies come and go, but insight is forever.

on the exercises and problems

You should do most of the exercises because they’re basic checks that you’ve understood the material. If you can’t solve an exercise relatively easily, you’ve probably missed something fundamental.
The problems are another matter….With that said, I don’t recommend working through all the problems. What’s even better is to find your own project. Maybe you want to use neural nets to classify your music collection. Or to predict stock prices. Or whatever. But find a project you care about. Then you can ignore the problems in the book, or use them simply as inspiration for work on your own project. Struggling with a project you care about will teach you far more than working through any number of set problems. Emotional commitment is a key to achieving mastery.

1. using neural nets to recognize handwritten digits

2 important types of artificial neuron

perceptron

A perceptron takes several binary inputs, compare the weighted to a threshold value, produces a single binary output. So it’s a device that makes decisions by weighing up evidence:
\sum_j w_j*x_j>threshold
For convenience, vector and bias are introduced:
\overrightarrow{w}\cdot\overrightarrow{x}+b>0
In biological terms, if the condition is true, it is called fire.

sigmoid neuron

The output is modified by a sigmoid function (or logistic function), defined by
\sigma(z)=\frac{1}{1+e^{-z}}
so a large positive z lead to 1, and a large negative z lead to 0
If perceptron is regarded as a step function, then sigmoid neuron is a smoother version.
feedforward neural networks: out from one layer, used as input for next layer

stochastic gradient descent

cost function
C(w,b)=\frac{1}{2n}\sum_x||y(x)-a||^2
C is quadratic cost function or the mean squared error.
This cost function is more smooth than counting the correctly recognized numbers, because small changes to the weights and biases won’t change much.
This algorithm which make the cost as small as possible is called gradient descent.
The model is to imagine a ball rolling down to th ebottom of the vally (minimum C), with randomly choosing a starting point. The gradient vector is:
\Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 +  \frac{\partial C}{\partial v_2} \Delta v_2\approx \nabla C \cdot \Delta v
choose a proper learning rate \eta>0 and \Delta v, to make a negative gradient:
\Delta v = -\eta \nabla C
Then the move of the ball is:
v \rightarrow v' = v -\eta \nabla C
The cost function is an average over individual cost, which can take a long time. To speed up learning, stochastic gradient descent is used, which chooses a small sample of randomly chosen training inputs (mini-batch).If only one training input is used, just as human being do, it’s called incremental learning.

code practice

The official MNIST data has 60 k tranning images and 10 k test images. Each training data is 784 dimensions of X and 10 dimensions of y (but the y value of test_data is not one hot encoded). So the number of neurons in the first layer is 784. mini-batch, e.g., each 10 training inputs as a group.
The following codes are run in Python command line. Bascially, it calls Network class from network.py, which defines a sigmoid function for feedforward and the derivative for back propogation.
import mnist_loader
training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
import network
net = network.Network([784, 30, 10])
net.SGD(training_data, 30, 10, 3.0, test_data=test_data)
# (training_data, epochs, mini_batch_size, eta)
#  score ~ 95%
  • starting point is much more important than algorithms and fine tune in parameters
  • If making a change improves things, try doing more!
  • Do you have enough training data?
  • Do you have enough epochs?
  • Do you have a good architecture?
  • Is the learning rate too low or too high?
  • sophisticated algorithm $$ \leq$$ simple learning algorithm + good training data
  • they’ve been learned automatically, so we understand neither the brain nor how AI works!
  • Deep neural networks (5~10 hidden layers) perform far better on many problems than shallow neural network(a single hidden layer).Because deep nets can build up a complex hierarchy of concepts.

Appendix: Is there a simple algorithm for intelligence?

The neural networks can be used to solve pattern recognition problems. The interesting question is will these approaches eventually be used to build thinking machines that match or surpass human intelligence?
Is there a simple set of principles to explain intelligence?
Intelligence may be explained by a large number of fundamentally distant mechanisms, which evolved in response to many different selection pressures in our species’ evolutionary history.
From connectomics, a brain contains 100 billion neurons with 100 trillion connections.If we need to understand the details of all those connections to understand how the brain works, we are not going to have a simple algorithm.
From molecular biology, human and chimp DNA differ at roughly 125 million DNA base pairs, out of a total of 3 billion DNA base pairs. So we are 96% chimp. If we converted these different genetic codes to letters, that’s about 30 times Bible.
Yet our genome alone is not enough to completely describe the neural connections, caveats are growing children need a healthy, stimulating environment and good nutrition to achieve their intelligent potential.
Chimp and human genetic lines diverged just 5 million years ago.
In 1980, Marvin Minsky developed “Society of Mind” theory, proposed that human intelligence is the result of a large society of individually simple computational processes:
What magical trick makes us intelligent? The trick is that there is no trick. The power of intelligence stems from our vast diversity, not from any single, perfect principle.
My own prejudice is in favor of there being a simple algorithm for intelligence. When it comes to research, an unjustified optimism is often more productive than a seemingly better-justified pessimism, for an optimist has the courage to set out and try new things. That’s the path to discovery, even if what is discovered is perhaps not what was originally hoped. A pessimist may be more “correct” in some narrow sense, but will discover less than the optimist.
That’s the path to insight, and by pursuing that path we may one day understand enough to write a longer program or build a more sophisticated network which does exhibit intelligence.