[TOC]

4 A visual proof that neural nets can compute any function

As the title says, it’s a visual proof.

5 Why are deep neural networks hard to train?

vanishing gradient problem

neurons in the earlier layers learn much more slowly than neurons in later layers.

6 Deep learning

convolutional networks

It’s strange is use networks with fully-connected layers to classify images, because it doesn’t take into account the spatial structure of the images. Instead, the learning speed could be improved by convolutional nets, which is based on 3 basic ideas:

local receptive fields
shared weights
pooling

Local receptive fields

In conventional approach, every input pixel is connected to every hidden neuron. Instead, taking advantage of the spatial structure, we only use a small window filter to make connections of the input images. This small window, i.e. a 55 region, is called *local receptive field. This filter is called feature map, which only requires 25+1= 26 parameters, which significantly reduce the calculation. In some sense, this way look like a pyramid hierarchy.

All the local receptive fields can share a set of weights and biases, and the resultant hidden layer is called a feature map. For image recognition, a complete convolutional layer consists of several different feature maps, corresponding to different kernels (shared weights/biases)

A big advantage of sharing weights and biases is it greatly reduces the number of parameters.

A pooling layer, or a condensed feature map, is the next layer after feature map. For max-pooling, a pooling unit is the maximum activation in the 2*2 input region. The intuition is that once a feature has been found, its exact location isn’t as important as its rough relative location. For L2 pooling, we take the square root of the sum of the squares of the activations.

code implementation

The difference between network 3 and network1/2 are well explained in its doc string. The dirty job is done by theano.function within Network. SGD().

Briefly, the results for different hyper-parameters are:

main code	hyper-parameter	accuracy	time
network3.py	FullyConnected, 60 epoch	97.8	223 s
network3.py	ConvPoolLayer, 60 epoch	98.78	1800 s
network3.py	ConvPoolLayer*2, 60 epoch	99.06
network3.py	ConvPoolLayer*2, 60 epoch,ReLU	99.23
network3.py	above + expand dataset by distortion	99.37
network3.py	above + extra fullyConnected	99.43
network3.py	above + dropout, 40 epoch	99.60
network3.py	above + ensemble	99.67
network.py	Network(784,30,10), sigmoid,30 epoch	95.42	330 s
network.py	Network(784,100,10), sigmoid,30 epoch	96.59
SVM		94.35	572 s
network2.py	Network(784,30,10),crossEntropy, 30	95.49
network2.py	Network(784,30,10),crossEntropy, 30	96.49
network2.py	Network(784,100,10),crossEntropy, 30	97.92

2013 record : 99.79% by http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html

import network3 
from network3 import Network
from network3 import ConvPoolLayer, FullyConnectedLayer,SoftmaxLayer
training_data,validation_data,test_data= network3.load_data_shared()
mini_batch_size =10

# fully connected layer as baseline
net= Network([FullyConnectedLayer(n_in=784,n_out=100),
            SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data,60, mini_batch_size,0.1,validation_data,test_data)

# convolutional networks
net= Network(
    [ConvPoolLayer(image_shape=(mini_batch_size,1,28,28),filter_shape=(20,1,5,5),poolsize=(2,2)),FullyConnectedLayer(n_in=20*12*12,n_out=100),SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data,60, mini_batch_size,0.1,validation_data,test_data)

# insert a 2nd convolutional-pooling layer
net= Network(
    [ConvPoolLayer(image_shape=(mini_batch_size,1,28,28),filter_shape=(20,1,5,5),poolsize=(2,2)),
     ConvPoolLayer(image_shape=(mini_batch_size,20,12,12),filter_shape=(40,20,5,5),poolsize=(2,2)),
     FullyConnectedLayer(n_in=40*4*4,n_out=100),
     SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data,60, mini_batch_size,0.1,validation_data,test_data)

# rectifield linear units
from network3 import ReLU
net = Network([
    ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28),
                  filter_shape=(20, 1, 5, 5),
                  poolsize=(2, 2), activation_fn=ReLU),
    ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5),
                  poolsize=(2, 2),
                  activation_fn=ReLU),
    FullyConnectedLayer(n_in=40*4*4, n_out=100,
                        activation_fn=ReLU),
    SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data, 60, mini_batch_size, 0.03,
        validation_data, test_data, lmbda=0.1)

Why rectified linear activation function $f(z)=max(0,z)$ is better than the sigmoid or tank functions? The adoption is empirical. A heuristic justification is that ReLU doesn’t saturate in the limit of large z, which helps it continue learning.

Recent progress in image recognition

In 1998, MNIST was introduced. It took weeks to train by a state-of-the-art workstation. Now it becomes a problem good for teaching and learning purposes.

2011-2015 is an era of huge breakthrough for computer vision. It’s a blit like watching the discovery of the atom, or the invention of antibiotics.

2014 ILSVRC competition

ImageNet Large-Scale Visual Recognition Challenge.

A training set of 1.2 million images in 1000 categories, from original 16 million images.

GoogLeNet achieve 6.8% error rate.

Recurrent neural networks

It has feedback loop which can save dynamic change over time.

useful in speech recognition.

Neural networks have done well at pattern recognition problems, not implementing web server or database program.

long short-term memory units

solve the issue of unstable gradient.

A LSTM block contains “forget gate”. Based on the sigmoid activation value, it decides whether the value is significant enough to remember, or block the value from entering into the next layer.

deep belief nets, generative models, Boltzmann machines

It can learn to write— generate images.

It can do unsupervised and semi-supervised learning.

Though interesting and attractive, DBN lessened in popularity. The marketplace of ideas often functions in a winner-take-all fashion, with nearly all attention going to the current fashion-of-the-moment in any given area. It can become extremely difficult for people to work on momentarily unfashionable ideas, even when those ideas are obviously of real long-term interest.

intention-driven user interface

An impatient professor: Don’t listen to what I say; listen to what I mean

Historically, computers are like confused student. Now google search is able to suggest the corrected query.

Products in the future would tolerate imprecision, while discerning and acting on the user’s true intent.

data science

The biggest breakthrough will be that machine learning research becomes profitable, through applications to data science and other areas.

Machine learning is an engine driving the creation of several major new markets and areas of growth in technology.

What next?

We understand neural networks so poorly.

The ability to learn hierarchies of concepts, building up multiple layers of abstraction, seems to be fundamental to making sense of the world.

Will deep learning soon learn to AI?

Conway’s law:

Any organization that design a system, will inevitably produce a design whose structure is a copy of the organization’s communication structure.

This means, the design and engineering of systems reflect the understanding of the likely constituent parts, and how to build them. Deep learning can’t be applied directly to the development of AI, because we don’t know what the constituent parts are. Indeed, we’re not even sure what basic questions to be asking. At this point, AI is more a problem of science than of engineering.

Wernher von Braun

basic research is what I’m doing when I don’t know what I’m doing.

As our knowledge grew, people were forced to specialize. Many deep new ideas, such as germ theory of disease, how antibodies work, what forms a complete cardiovascular system. Such deep insights formed the basis for subfields such as epidemiology, immunology, and the cluster of inter-linked fields around the cardiovascular system. And the structure of our knowledge has shaped the social structure of medicine, due to the realizing the immune system exists.

The field start out monolithic, with just a few deep ideas. Early experts can master all those ideas. But as time passes, we discover many deep new ideas, too many for any one person to really master. So the structure of our knowledge shapes the social organization of science, which in turn constrains and helps determine what we can discover.

Deep learning is the latest super-special weapon I’ve heard used in such arguments.

Deep learning is an exciting and fast-paced but also relatively monolithic field. What we don’t yet see is lots of well-developed subfields, each exploring their own sets of deep ideas, pushing deep learning in many directions. still a rather shallow field. It’s still possible for one person to master most of the deepest ideas in the field.

How complex and powerful a set of ideas will be needed to obtain AI?

No one knows for sure.

We are at least several decades from using deep learning to develop general AI.

This indefinite conclusion will no doubt frustrate people who crave certainty.

If you ask a scientist how far away some discovery is?

They say “10 years”.

What they mean is “I’ve got no idea”

Yuchao's blogspot

Monday, October 31, 2016

Neural network and deep learning 4, convolutional network