Monday, October 31, 2016

Deep learning, MIT intro

www.deeplearningbook.org

Early days of AI solved problems difficult for human but relative straightforward for computers, which can be described by a list of formal, mathematical rules. The true challenge to AI proved to be solving tasks that are easy for people to perform but hard for people to describe formally, problems we solve intuitively, that feel automatic, like recognizing spoken words or faces in images.

Deep learning is a solution to these more intuitive problems, which allow computers to learn from experience and understand the world in terms of a hierarchy of concepts.

Ironically, abstract and formal tasks, while mentally difficult for a human, are among the easiest for a computer.

A person’s everyday life requires an immense amount of knowledge about the world. Much of this knowledge is subjective and intuitive, and therefore difficult to articulate in a formal way. So one of the key challenge in AI is how to formalize this informal knowledge.

Simple machine learning algorithms depends heavily on the representation of the data, which is known as features.

For many tasks, it’s difficult to know what feature should be extracted. The approach is known as representation learning.

The main reason for the diminished role of neuroscience in deep learning research today is that we simply do not have enough information about the brain to use it as a guide.

my comment

I will make a stop of this book here due to the time constraint. My largest gain in this introduction is the awareness of informal knowledge. This reminds me that there are so many things that a school education failed to teach (at least at this moment) but are vital to a human’s life. These knowledges include emotional intelligent, time management, marriage fitness, culture shock, spiritual growth, etc. Unfortunately, we usually regard them as common sense without a systematical understanding.

The deep learning approach shines some light on these understanding. We can always insert arbitrary hidden layers between what we have and what we want. These hidden layers serve as thought-provoking buffer which allow us for creative ideas without directly jumping into the conclusion. I will practice this method to draw some mind maps.

Why AlghaGo?

quanta magazine

why alpha go is really such a big deal

Knight=bishop= 3 pawns, rook-5 pawns, queen= 9 pawns, king =$ $\infty$ $ pawns

The notionof value is crucial in computer chess. The goal is for the program to find a sequence of moves that maximizes the final vlaue of the program’s board position, no what what the opponent do.

Ideas like this(a pawn blocking the rook devalue the rook) depend on detailed knowledge of chess and were curcial to deep blue’s success.

What happends if you apply this strategy to Go? .. Top Go players use a lot of intuition in juding how good a particualr board position is. And it’s not immediately clear how to express this intuition in simple, well-defined systems like the valuation of chess pieces. In 2006, Monte Carlo tree search algorithms was introduced, based on a clever wayof randomly simulating games.But it still fell far short of human player.

The mechanics behind AlphaGo is published in Nature in Jan. 2016.

AlphaGo learned in 2 stages:

AlphaGo was trained by 150 k games played by good human players(6~9 dan), and used an artificial neural network to find patterns in those games. It learned to predict with high probability what move a human player would take in any given position.
Improve the neural network by repeatedly playing it against earlier version of itself, adjusting the network so it gradually improved its chance of winning.

The neural network is a very complicated mathematical model, with millions of parameters to tune. When the network learned, it kept making tiny adjustments to the parameters in the model, trying to find a way to make corresponding tiny improvements in its play. This sounds like a crazy strategy—repeatedly tiny tweaks to enormously complicated function. But if you do this for long enough, with enough computing power, the network gets pretty good. And here’s the strange thing: it gets good for reasons no one really understands, since the improvements are a consequence of billions of tiny adjustments made automatically.

However, the core idea is how to get a valuation of the position. While the valuation system of Deep Blue based on lots of detailed knowledge, Alphago did it by analyzing thousands of prior games and engaing in a lot of self-play. Alphago created a policy network through billions of tiny adjustments, and build a valuation system similar to a good player’s intuition abuout the value of different board positions.

However, neural network have drawbacks. It can be fooled. It needs too many training data than human players.

Misleading thinking: Criticizing each other only creates hate

As human nature, it’s easier to be dissatisfied and criticize, it’s harder to build and appreciate, which requires more efforts.

I am very thankful to Bill and Pam for inviting me to the watch party and preparing nice light refreshments(celery, carrot,grape, thin crackers, green tea).

From my perspective, the documentary film “Hillary’s America: The secret History of the democratic party” is highly biased. It tells the audience that Democratic is trying everything to weaken US, such as Indian removal, slavery, segregation. And GOP is always fighting against these wrongs. He twisted the truth so much that I have to come out my version.

1st party system: 1792-1824

aristocrat vs democrat

Federalist Party, created by Alexander Hamilton, appeals to business community

Democratic-Republican Party, by Thomas Jefferson and James Madison, appeal to the southern planters and farmers.

Starting from 1806, DR party had dominated both House and Senate with >80% seats.

In election of 1824, there’s only 1 party: Republican-democratic, but 4 candidates, none of them received over 50% electoral votes. Then decision comes to the hands of house of representatives. Top 3 entered the finalist with Jackson ranking No. 1. However, the final votes by representative choose Adam who was originally No.2. Adam won 13 out of 24 votes. (there were only 24 states at that time, TX belonged to Mexico).

2nd party system

Jackson began his revenge, formed a new party and won the election in 1828. He used spoils system to benefit his men.

spoils system, or patronage system: when a political party wins an election, the government jobs are given to its supporters, friends and relatives as a reward and an incentive to keeping working for the party.

The modern Republican Party) was formed in 1854 to oppose the expansion of slavery.

In political science, there are 3rd, 4th, 5th, 6th party systems, which are used to indicated that these two parties evolved or changed their respective ideology or campaign strategy over time.

donation limitation

Candidate Committee: 2700 for primary electron, another 2700 for general election, as you can see in Hillary campaign.

PAC: $5000, PAC usually represents business, labor or ideological interests

Super PAC: unlimited. But Super PAC makes no contribution to candidates or parties. They run independently.

D’Souza lied about what is his felony

He used a “straw donor” trick, which donated money under other’s name and that person get reimbursed.

“Mr. D’Souza agreed to accept responsibility for having urged two close associates to make contributions of $10,000 each to the unsuccessful 2012 senate campaign of Wendy Long and then reimbursing them for their contributions. Given the technical nature of the charge, there was no viable defense,” D’Souza Attorneys Benjamin Brafman and Alex Spiro said in a statement.

D’Souza argued for the charges to be dismissed on grounds of selective prosecution. Last week, a judge denied that motion, citing “no evidence” to support it.

“Following the court’s ruling denying Dinesh D’Souza’s baseless claim of selective prosecution, D’Souza now has admitted, through his guilty plea, what we have asserted all along – that he knowingly and intentionally violated federal election laws,” Bharara said in a statement.

my point of view

D’Souza’s supporters claimed that he was the victim of double standard/ selected prosecution. I think D’Sourza was so shameless:

Others doing the same crime without being prosecuted, doesn’t make you innocent. You are just trying to shift the point, the fact that you committed a felony.
If you know others doing the same crimes, why don’t you make a movie expose that? Why don’t you spend effort to improve the social justice, make the whole election system more transparent? The answer is very simple : no one is going to buy the ticket. You know it’s not a good business. You’re good at selling hate.

Neural network and deep learning 4, convolutional network

[TOC]

4 A visual proof that neural nets can compute any function

As the title says, it’s a visual proof.

5 Why are deep neural networks hard to train?

vanishing gradient problem

neurons in the earlier layers learn much more slowly than neurons in later layers.

6 Deep learning

convolutional networks

It’s strange is use networks with fully-connected layers to classify images, because it doesn’t take into account the spatial structure of the images. Instead, the learning speed could be improved by convolutional nets, which is based on 3 basic ideas:

local receptive fields
shared weights
pooling

Local receptive fields

In conventional approach, every input pixel is connected to every hidden neuron. Instead, taking advantage of the spatial structure, we only use a small window filter to make connections of the input images. This small window, i.e. a 55 region, is called *local receptive field. This filter is called feature map, which only requires 25+1= 26 parameters, which significantly reduce the calculation. In some sense, this way look like a pyramid hierarchy.

All the local receptive fields can share a set of weights and biases, and the resultant hidden layer is called a feature map. For image recognition, a complete convolutional layer consists of several different feature maps, corresponding to different kernels (shared weights/biases)

A big advantage of sharing weights and biases is it greatly reduces the number of parameters.

A pooling layer, or a condensed feature map, is the next layer after feature map. For max-pooling, a pooling unit is the maximum activation in the 2*2 input region. The intuition is that once a feature has been found, its exact location isn’t as important as its rough relative location. For L2 pooling, we take the square root of the sum of the squares of the activations.

code implementation

The difference between network 3 and network1/2 are well explained in its doc string. The dirty job is done by theano.function within Network. SGD().

Briefly, the results for different hyper-parameters are:

main code	hyper-parameter	accuracy	time
network3.py	FullyConnected, 60 epoch	97.8	223 s
network3.py	ConvPoolLayer, 60 epoch	98.78	1800 s
network3.py	ConvPoolLayer*2, 60 epoch	99.06
network3.py	ConvPoolLayer*2, 60 epoch,ReLU	99.23
network3.py	above + expand dataset by distortion	99.37
network3.py	above + extra fullyConnected	99.43
network3.py	above + dropout, 40 epoch	99.60
network3.py	above + ensemble	99.67
network.py	Network(784,30,10), sigmoid,30 epoch	95.42	330 s
network.py	Network(784,100,10), sigmoid,30 epoch	96.59
SVM		94.35	572 s
network2.py	Network(784,30,10),crossEntropy, 30	95.49
network2.py	Network(784,30,10),crossEntropy, 30	96.49
network2.py	Network(784,100,10),crossEntropy, 30	97.92

2013 record : 99.79% by http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html

import network3 
from network3 import Network
from network3 import ConvPoolLayer, FullyConnectedLayer,SoftmaxLayer
training_data,validation_data,test_data= network3.load_data_shared()
mini_batch_size =10

# fully connected layer as baseline
net= Network([FullyConnectedLayer(n_in=784,n_out=100),
            SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data,60, mini_batch_size,0.1,validation_data,test_data)

# convolutional networks
net= Network(
    [ConvPoolLayer(image_shape=(mini_batch_size,1,28,28),filter_shape=(20,1,5,5),poolsize=(2,2)),FullyConnectedLayer(n_in=20*12*12,n_out=100),SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data,60, mini_batch_size,0.1,validation_data,test_data)

# insert a 2nd convolutional-pooling layer
net= Network(
    [ConvPoolLayer(image_shape=(mini_batch_size,1,28,28),filter_shape=(20,1,5,5),poolsize=(2,2)),
     ConvPoolLayer(image_shape=(mini_batch_size,20,12,12),filter_shape=(40,20,5,5),poolsize=(2,2)),
     FullyConnectedLayer(n_in=40*4*4,n_out=100),
     SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data,60, mini_batch_size,0.1,validation_data,test_data)

# rectifield linear units
from network3 import ReLU
net = Network([
    ConvPoolLayer(image_shape=(mini_batch_size, 1, 28, 28),
                  filter_shape=(20, 1, 5, 5),
                  poolsize=(2, 2), activation_fn=ReLU),
    ConvPoolLayer(image_shape=(mini_batch_size, 20, 12, 12), filter_shape=(40, 20, 5, 5),
                  poolsize=(2, 2),
                  activation_fn=ReLU),
    FullyConnectedLayer(n_in=40*4*4, n_out=100,
                        activation_fn=ReLU),
    SoftmaxLayer(n_in=100, n_out=10)], mini_batch_size)
net.SGD(training_data, 60, mini_batch_size, 0.03,
        validation_data, test_data, lmbda=0.1)

Why rectified linear activation function $f(z)=max(0,z)$ is better than the sigmoid or tank functions? The adoption is empirical. A heuristic justification is that ReLU doesn’t saturate in the limit of large z, which helps it continue learning.

Recent progress in image recognition

In 1998, MNIST was introduced. It took weeks to train by a state-of-the-art workstation. Now it becomes a problem good for teaching and learning purposes.

2011-2015 is an era of huge breakthrough for computer vision. It’s a blit like watching the discovery of the atom, or the invention of antibiotics.

2014 ILSVRC competition

ImageNet Large-Scale Visual Recognition Challenge.

A training set of 1.2 million images in 1000 categories, from original 16 million images.

GoogLeNet achieve 6.8% error rate.

Recurrent neural networks

It has feedback loop which can save dynamic change over time.

useful in speech recognition.

Neural networks have done well at pattern recognition problems, not implementing web server or database program.

long short-term memory units

solve the issue of unstable gradient.

A LSTM block contains “forget gate”. Based on the sigmoid activation value, it decides whether the value is significant enough to remember, or block the value from entering into the next layer.

deep belief nets, generative models, Boltzmann machines

It can learn to write— generate images.

It can do unsupervised and semi-supervised learning.

Though interesting and attractive, DBN lessened in popularity. The marketplace of ideas often functions in a winner-take-all fashion, with nearly all attention going to the current fashion-of-the-moment in any given area. It can become extremely difficult for people to work on momentarily unfashionable ideas, even when those ideas are obviously of real long-term interest.

intention-driven user interface

An impatient professor: Don’t listen to what I say; listen to what I mean

Historically, computers are like confused student. Now google search is able to suggest the corrected query.

Products in the future would tolerate imprecision, while discerning and acting on the user’s true intent.

data science

The biggest breakthrough will be that machine learning research becomes profitable, through applications to data science and other areas.

Machine learning is an engine driving the creation of several major new markets and areas of growth in technology.

What next?

We understand neural networks so poorly.

The ability to learn hierarchies of concepts, building up multiple layers of abstraction, seems to be fundamental to making sense of the world.

Will deep learning soon learn to AI?

Conway’s law:

Any organization that design a system, will inevitably produce a design whose structure is a copy of the organization’s communication structure.

This means, the design and engineering of systems reflect the understanding of the likely constituent parts, and how to build them. Deep learning can’t be applied directly to the development of AI, because we don’t know what the constituent parts are. Indeed, we’re not even sure what basic questions to be asking. At this point, AI is more a problem of science than of engineering.

Wernher von Braun

basic research is what I’m doing when I don’t know what I’m doing.

As our knowledge grew, people were forced to specialize. Many deep new ideas, such as germ theory of disease, how antibodies work, what forms a complete cardiovascular system. Such deep insights formed the basis for subfields such as epidemiology, immunology, and the cluster of inter-linked fields around the cardiovascular system. And the structure of our knowledge has shaped the social structure of medicine, due to the realizing the immune system exists.

The field start out monolithic, with just a few deep ideas. Early experts can master all those ideas. But as time passes, we discover many deep new ideas, too many for any one person to really master. So the structure of our knowledge shapes the social organization of science, which in turn constrains and helps determine what we can discover.

Deep learning is the latest super-special weapon I’ve heard used in such arguments.

Deep learning is an exciting and fast-paced but also relatively monolithic field. What we don’t yet see is lots of well-developed subfields, each exploring their own sets of deep ideas, pushing deep learning in many directions. still a rather shallow field. It’s still possible for one person to master most of the deepest ideas in the field.

How complex and powerful a set of ideas will be needed to obtain AI?

No one knows for sure.

We are at least several decades from using deep learning to develop general AI.

This indefinite conclusion will no doubt frustrate people who crave certainty.

If you ask a scientist how far away some discovery is?

They say “10 years”.

What they mean is “I’ve got no idea”

Thursday, October 27, 2016

Neural network and deep learning, 3

When you first learn to play golf, you spend most the time developing a basic swing. There are so many available techniques, the best strategy is in-depth study of a few of the most important.

While unpleasant, we also learn quickly when we’re decisively wrong. By contrast, we learn more slowly when our errors are less well-defined.

To address the learning slowdown problem, define cross entropy cost function:

$C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right]$

Thus:

$\frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j(\sigma(z)-y)$

$\frac{\partial C}{\partial b} = \frac{1}{n} \sum_x (\sigma(z)-y)$

So the rate is proportional to the “error”.

import network2
net = network2.Network([784,30,10],cost=network2.CrossEntropyCost)
net.large_weight_initializer()
net.SGD(training_data,30,10,0.5,evaluation_data=test_data,monitor_evaluation_accuracy=True)

Cross entropy is a measure of surprise.

Softmax

The activation function is changed from sigmoid function to softmax function:

$a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}}$

You can think softmax as a way to rescale the weighted input, and form a probability distribution.

The cost function is correspondingly changed to log-likelihood.

overfitting

Fermi was suspicious of models with four free parameters.

We should stop training when accuracy on the validation data is no longer improving.

In general, one of the best ways of reducing overfitting is to increase the size of the training data. However, training data can be expensive or difficult to acquire.

One possible approach is to reduce the size of our network, but it will be less powerful.

We will use a regularization technique called L2/weight decay, which adds a regularization term in the cross-entropy cost:

$C =C_0+\frac{\lambda}{2n} \sum_w w^2= \frac{1}{2n} \sum_x \|y-a^L\|^2 + \frac{\lambda}{2n} \sum_w w^2$

why does regularization help reduce overfitting?

smaller weights -> lower complexity -> simpler explanation for data.

simpler explanation is more resistant to noise.

Such idea as Occam’s Razor is preferred by some people.

However, there’s no a priori logical reason to prefer simple explanation over more complex explanation.

linear fit vs polynomial fit

Newton’s theory vs Einstein’s theory

The true test of a model is not simplicity, but rather how well it does in predicting new phenomena.

So the answer now is that it’s an empirical fact.

Human brain has a huge number of free parameters. Shown just a few images of an elephant, a child will quickly learn to generalize— recognize other elephants. How do we do it? At this point we don’t know.

L1 regularization

$C = C_0 + \frac{\lambda}{n} \sum_w |w|$

The weights shrink by a constant amount.

Dropout

Randomly and temporarily delete half the hidden neurons, restore them after training on a mini-batch of examples. This kind of average or voting scheme is found to be a powerful way of reducing overfitting.

When we dropout different sets of neurons, it’s rather like we’re training different neural networks.This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is therefore forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.

Dropout is a way of making sure that the model is robust to the loss of any individual piece of evidence.

Artificially expanding the training data

make many small real-world variations, like rotations, translating and skewing the images, elastic distortions,

Neural network vs SVM

With same training data size, Neural network performs better than SVM. However, more traning data can sometimes compensate for differences in the machine learning algorithm used.

Is algorithm A better than algorithm B?

what training data set are you using?

Imagine an alternate world, people created the benchmark data set and have a larger research grant. They might have used the extra money to collect more training data.

The message to take away, is what we want is both better algorithms and better training data.

Weight initialization

Initialize weight of each net as Gaussian random variable with mean 0 and standard deviation 1. This will cause the weighted sum have a standard deviation sqrt(n), leading to activation of the hidden neuron close to either 1 or 0, i.e. saturation.

A better initialization is have individual standard deviation 1/sqrt(n).

import mnist_loader
training_data,validation_data,test_data=mnist_loader.load_data_wrapper()
import network2
net = network2.Network([784,30,10],cost=network2.CrossEntropyCost)
net.SGD(training_data,30,10,0.1,lmbda=5.0,evaluation_data=validation_data, monitor_evaluation_accuracy=True)

How to choose hyper-parameters

You don’t a priori know which hyper-parameters to adjust. If you spend many hours or days or weeks trying this or that, only to get no result, it will damage your confidence.

broad strategy

get any non-trivial learning to achieve results better than chance.
during early stages, make sure you can get quick feedback from experiments.
As with many things in life, getting started can be the hardest thing to do.

variations on SGD

Hessian technique
Momentum-based SGD

other models of artificial neuron

tanh neuron
rectified linear neuron

we don’t hvae a solid theory of how activation functions should be chosen.