Thursday, February 2, 2017

Machine Learning ND 4, Deep learning, TensorFlow


Update 2017-4-1

When studying Deep Learning nanodegree, I noticed that Udacity has done a major overhaul for this course. It is much better than the old one.

Update 2017-3-1

This section is exactly the same with the free course: deep learning, by Vincent Vanhoucke at google research. However, he talked fast on the workflow. It is more like a review session. So it is extremely difficult to follow. To fill this gap, Udacity is currently developing Deep Learning Foundation ND.
To recap, I would recommend the following order to study:
  1. Read through Michael Nielsen’s online book, Neural network and deep learning. It does an amazaing job in explaining the key concepts and frankly explains what’s known and what’s unknown, how the technique evolved. Run his code and try to understand what’s happening. In Chapter 6, in order to implement convolutional network, he uses theano module to do the dirty job.
  2. Study tensorflow and finish its official documentation from basics up to the application of convolutional networks. Understand the terms like graph, session, interactive session, tensor, operation.
  3. The free course by Udacity provides 6 assignments, which is a good practice to be familiar the use of tensorflow after the official tutorial. I would recommend at least finish 1, 2, 3, 4, which are handwriting recognition tasks. I think 5,6 are optional, because they are about another topic, text processing. Personally, I believe computer is only able to “mechanically” understand the text by its context, losing the big picture of culture, human interaction, emotions. It can think, but it can’t feel, which is hard-coded to human’s genes by millions year of biological evolution.
  4. The final project: multi-digit recognition. Actually, the difficulty of this project is not about how to apply deep learning, but how to get a high-quality dataset from messy reality. How does the computer know there is a digit, how to catch it, resize it and convert into trainable data. Make this automation is possible, but not so easy.
The comeback of neural networks due to data & GPU power:
  • 2009 speech recognition
  • 2012 computer vision
  • 2014 machine translation
The following 6 notebooks are originally hosted at . It is subject to future update.


It basically do the following things:
  1. the notMNIST data seems be hosted in an internal url of google: The data is ==500 k== training+ 19 k testing examples, with 10-factor labels from ‘A’ to “J”, “notMNIST_Large.tar.gz” is ~250 MB, the “notMNIST_Large.tar.gz” is ~8 MB, which serves as test set.
  2. use, extractall() to open the .tar.gz file, extract to 10 file folders, each folder is a collection of the letter writing, e.g. ‘A’, with 52 k png files. The folder names are stored in variable called “train_folders“ or “test_folders”, a list of strings.
  3. use scipy.ndimage.imread() to convert a png file into a 2D imbeded array (28*28), normalize their pixel value by maximum(i.e., 255), put such ~52.9 k png files of each folder in a 52.9 k*28*28 numpy array, and pickle.dump() them into a .pickle file. 10 pickle files are stored in a list named “train_datasets”
  4. use np.random.shuffle() and extract 20k training, 1 k valid, 1k testing dataset within each class. The total training set is ==200k==.
  5. use np.random.permutation to shuffle the whole dataset
main codes
train_filename =maybe_download('notMNIST_larget.tar.gz',247336696) # string
train_folders =maybe_extract(train_filename) # list of string, each represent a folder
train_datasets =maybe_pickle(train_folders,45000) # list of string, each represent a .pickle file of a np.ndarray
valid_dataset, valid_labels, train_dataset, train_labels = merge_datasets(train_datasets,200e3,10e3) # np.ndarray
train_dataset, train_labels = randomize(train_dataset, train_labels) # np.ndarray
At last, the 6 datasets (feature and labels of train, valid, test) are packed into a dictionary and pickle.dump into “notMNIST.pickle”, with a size of 690 MB. The large size is only because its compress rate worse than a .tar.gz file.

Logistic Regression

from sklearn.linear_model import LogisticRegression

nsamples, nx, ny = train_dataset.shape
X_train = train_dataset.reshape((nsamples,nx*ny))
y_train = train_labels

nsamples, nx, ny = test_dataset.shape
X_test = test_dataset.reshape((nsamples,nx*ny))
y_test = test_labels

train_size = 5000
test_size = 1000[0:train_size], y_train[0:train_size])
print(clf.score(X_test[0:test_size], y_test[0:test_size])) #0.864


  1. Data reshape.
  2. TensorFlow with simple gradient descent. Use 10 k samples and 801 epoches, get ~ 80% accuracy. Remember that 10 k subset is only 2% of the total 500 k dataset. The reason of doing so is to save time because gradient descent is time expensive.
  3. TensorFlow with stochastic gradient descent. 10 k samples, 128 batch size, 3001 epoches, get ~ 85% accuracy. This is actually the “beginner” tutorial, which has 92%.
One of the difficulty is the accuracy function, because the size of training set and test set is different. The trick is when creating placeholder, set shape =(None,neuron_num).
get 89%. Strange thing is valididation accuracy is only 82%. see complete code


why did we don’t figure out earlier that neural network were effective?
Many reasons.
  1. deep learning model only really shines when you have enough data to train them.
  2. better regularization techniques: L2 regulation, dropout
what changes train size, bacth,epoch accuracy
L2 regularization 200k, 128,4k 0.905
drop out 200k, 128,4k 0.906
learning rate decay 200k,128,4k 0.901


Follow “expert” tutorial can do the job.


The father of information retrieval is Gerard Salton, who proposed Vector Space Model in 1975. The basic idea of this model is Distributional Hypothesis, that the words that appear in the same contexts share semantic meaning.
A group led by Tomas Mikolov at Google created a word embedding toolkit word2vec, which is based on 2-layer neural networks.Two popular flavors are Continuous Bag-of-Words model and Skip_gram Model. TensorFlow website provide a tutorial for the latter. The notebook is adapted from A more serious implementation is here.
The principle in word2vec is use a noise classifier. In the same context, the realtarget word is assigned high probability, other k imaginary noise word low probabilities. It is computationally efficient because only k words instead of the whole dictionary is consindered. Such a binary logistic function is called noise-contrastive estimation(NCE) loss.
A vanilla definitionof the context is to use the right& left words, i.e., (context, target)=([left, right], middle) as input/output pairs, the noisy (contrastive) examples are drawn from some noise distribution, like nuigram distribution.
visualize the learned vectors by projecting to 2D, using t-SNE.
One way of evaluation is to calculate the distance between target and other words. Another way is to use analogical reasoning.


LSTM, recurrent NN

Practical Methodology for Deploying machine learning

2015.10, AI with the best.
3 step process
  1. use needs to define metric-based goals
  2. build an end-to-edn system
  3. data-driven refinement
identify the most difficult part as soon as possible.
Deep or not?
  • lots of noise, little structure -> not deep
  • little noise, complex structure -> deep
Just get familiar with one ML and know how to tune is enough.
what kind of deep?
  • No structure -> fully connected
  • spatial structucture -> convolutional
  • sequential structure -> recurrent
baseline: 2-3 hidden layer, ReLU, dropout, SGD+momentum