Thursday, March 30, 2017

Deep Learning ND 3, RNN, NLP

Update on 2017-5-5

Udacity reorganized courses into 4 sections:
  1. Neural Networks (8 courses + 1 project: bike share prediction)
  2. Convolutional Neural Networks (13 courses + 1 project: CIFAR-10 image classification)
  3. Recurrent Neural Networks (20 courses + 2 projects: generating TV scripts, translation)
  4. Generative Adversrial Networks (6 courses + 1 projects: Generate Faces)
So this post becomes the first half of secction 3.



3 Word2vec

Tutorial by Tensorflow

Word2vecis a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.
Word2vec is actually an efficient algorithms to produce word embeddings. It was published in 2013 by google researchers led by Tomas Mikolov.
It is unsupervised deep learning because you only have raw text without any labels. Depending on whether you choose the target word or context word as the label, there are two model architectures:
  • CBOW: use context to predict target word
  • skip-gram: use word to predict context.
Latest notebook is here. The dataset “text8” is originally hosted by Matt Mahoney.
There are several difficult concepts to understand the implementation.
Elaborating these concepts is beyond my ability at this moment. The core codes are:
# Look up embeddings for inputs, a random big matrix.
embeddings = tf.Variable(tf.random_uniform([50000,128], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable(
    tf.truncated_normal([50000, 128],stddev=1.0 / math.sqrt(128)))
nce_biases = tf.Variable(tf.zeros([50000]))
# Compute the average NCE loss for the batch.
loss = tf.reduce_mean(
# Construct the SGD optimizer using a learning rate of 1.0.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

by Radim Řehůřek

by Udacity.Mat

jupyter trust deep-learning/embeddings/Skip-Grams-Solution.ipynb
One difficulty is how to generate the (inputs, targets) pair.
The another is the embedding implementation. Think it this way:
  1. embedding features are like the basis function or the periodic table that are the “atoms” of your fictional world. We choose a few hundred simply because it is a manageable number.
  2. A few hundred is still a large number for matrix multiplication that is computational expensive. The lookup table takes advantages of the onehot encoding, so we quickly extract the value without computing.
  3. embedding matrix is a monster matrix. The initial value is random by tf.random_uniform and tuned by optimizer during training.
  4. use tf.nn.sampled_softmax_loss to calculate the loss. Be sure to read the documentation to figure out how it works.
n_vocab = len(int_to_vocab) # 63 k, vocab basis
n_embedding = 200 # Number of embedding features 
n_sampled = 100
train_graph = tf.Graph()
with train_graph.as_default():
    inputs = tf.placeholder(tf.int32, [None], name='inputs')
    labels = tf.placeholder(tf.int32, [None, None], name='labels')
    embedding = tf.Variable(tf.random_uniform((n_vocab, n_embedding), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs)
    softmax_w = tf.Variable(tf.truncated_normal((n_vocab, n_embedding), stddev=0.1))
    softmax_b = tf.Variable(tf.zeros(n_vocab))
    # Calculate the loss using negative sampling
    loss = tf.nn.sampled_softmax_loss(softmax_w, softmax_b, 
                                      labels, embed,
                                      n_sampled, n_vocab)
    cost = tf.reduce_mean(loss)
    optimizer = tf.train.AdamOptimizer().minimize(cost)
At last, use TSNE().fit_transform() to transform 200-dimensional data to 2D data and draw scatterplot to show the relation. This is a huge dimensionality reduction. You are going to lose some information anyway.
My final question is: how useful are the words with similar semantic meanings? Is it the starting point for machine understanding human language?
The inductive bias is that the words appearing in similar neighboring words have the similar meaning. So it is actually a synonym calculator. It is far from a real understanding, e.g., how a learner’s dictionary works.
Maybe we should start from elementary textbook?

6 TensorBoard

Interview tips

No one talks about his failture. Practice a lot:
  • Intro Data Structures (My code School)
  • Intro to Algorithms (MIT Open CourseWare)
  • Mock Interviews
Get a Job at a Startup:
search using keywords “technical recruiter”

10 Sentiment Prediction RNN

The official solution has some problems, check my note.
The dataset is IMDB movie review. It’s supervised learning.
The core codes are as below. I would like to mention 2 points:
  1. to simplify the codes, Teras-like tf.conbrib module is used.
  2. RNN implementation is by tf.contrib.rnn submodule
import tensorflow as tf
lstm_size = 256
lstm_layers = 1
batch_size = 500
learning_rate = 0.001
n_words = len(vocab)  #74072
embed_size = 300 
graph = tf.Graph()
with graph.as_default():
    # layer 0
    inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
    labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    # layer 1
    embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs_)
    # layer 2
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)
    initial_state = cell.zero_state(batch_size, tf.float32)
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
    # output layer
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    cost = tf.losses.mean_squared_error(labels_, predictions)
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
    correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

Project 3, TV script generation

The TV script is from Simpsons with 4257 lines and 11.5k unique words. Note in the helper function, the text is preprocessed into a large array with each word converted into an integer.
Complete implementation is here. I would like to mention 2 obstacles.
First, extract the text into feature-target pairs.
def get_batches(int_text, batch_size, seq_length):
    Return batches of input and target
    :param int_text: Text with the words replaced by their ids
    :param batch_size: The size of batch
    :param seq_length: The length of sequence
    :return: Batches as a Numpy array
    # TODO: Implement Function
    num = (len(int_text)-1)//(batch_size*seq_length)
    batches = np.ndarray(shape = (num,2,batch_size,seq_length), dtype = int)
    gap = num * seq_length
    for i in range(num):
        offset = i*seq_length
        for j in range(batch_size):
            offset += gap
    return batches
Second, configure RNN. Note: this only works for Tensorflow 1.0 not 1.1 or later.
import tensorflow as tf
from tensorflow.contrib import seq2seq
vocab_size = 6779
rnn_size = 256

input_text = tf.placeholder(tf.int32, [None, None], name='input')
targets = tf.placeholder(tf.int32, [None, None], name='targets')
learingRate = tf.placeholder(tf.float32, name='learingRate')
input_data_shape = tf.shape(input_text)
with tf.name_scope("initial_rnn_cell"):
    lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
    cell = tf.contrib.rnn.MultiRNNCell([lstm])
    initial_state = tf.identity (cell.zero_state (input_data_shape[0], tf.float32), name="initial_state")
with tf.name_scope("embed"): 
    embedding = tf.Variable(tf.random_uniform((vocab_size, rnn_size), -1, 1), name = "embedding")
    embed = tf.nn.embedding_lookup(embedding, input_text, name = "lookup")
#with tf.name_scope("build_rnn"):
outputs, final_state = tf.nn.dynamic_rnn(cell, embed ,dtype = tf.float32)
final_state = tf.identity(final_state, name = "final_state")
logits = tf.contrib.layers.fully_connected(outputs, vocab_size, activation_fn= None, name ="logits")

with tf.name_scope("softmax"):
    probs = tf.nn.softmax(logits, name='probs')
    cost = seq2seq.sequence_loss(logits,targets,tf.ones([input_data_shape[0], input_data_shape[1]]))
#with tf.name_scope("train"):
optimizer = tf.train.AdamOptimizer(learingRate)
gradients = optimizer.compute_gradients(cost)
capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients]
train_op = optimizer.apply_gradients(capped_gradients)

folder = 'board_rnn'
with tf.Session() as sess:
    writer = tf.summary.FileWriter(folder)  # create writer
    print("graph is written into folder:", folder)    
# tensorboard --logdir "board_rnn"
Smiley face

Monday, March 27, 2017

Deep Learning ND 2, sentiment analysis, image classification

Course schedule: week 3-6, lesson 10-23, project 2

section 2

2 Sentiment Analysis with Andrew Trask

Andrew Trask is a PhD student at university of Oxford. He is currently writing a book: Grokking Deep Learning (40% Off: traskud17). It is an in-progress book and you prepay to read each chapter as he finishes.
course material is a few notebooks: Sentiment Network
Project end goal: analyze IMDB comments to infer “positive” or “negative”. The basical flow is:
  1. you have 25 k reviews with binary target features. The reviews can be decomposed to a vocabulory of 74 k words.
  2. write a home-made class called SentimentNetwork that preprocess data, construct a 10-node hiddenlayer network with sigmoid output and back propagation. The input layer has a size of the vocabulary—74k.
  3. last 1 k review is used for testing.
Dataset documentation: here.
miniproject 1
  1. use Counter() to build 3 vocabulary dictionaries to count positive, negative and total reviews
  2. Because the most common words are connecting/preposition words and appear in both positive and negative reviews, we use another counter to store the ratios of positive count to negative count. And use np.log to scale the very large ratio and very small ratio.
miniproject 2
  1. useset(total_counts.keys()) to build a vocabulary, i.e., a list of words.
  2. use a word2index dictionary to give index to each word
  3. vectorize each review based on this vocabulary.
miniproject 3
  1. construct a class named SentimentNetwork, initialize with a 10-node hidden layer.
  2. use 24 k instances of review for the training set, 1 k instances for testing set. Get 60% accuracy
miniproject 4
By setting self.layer_0[0][self.word2index[word]] = 1, the most common words such as space and preposition is restricted to value 1. The neural network is more effectively trained. A testing accuracy of 85% is obtained.
miniproject 5
Taking advantage of the sparsity of layer_0, only a few nodes that have value is used to calculate the weighted sum. This increases the training speed by 10 times.
miniproject 6
  1. use bokeh module to plot D3 style histogram.
  2. use min_count=10, polarity_cutoff = 0.1 to add the informative words to vocabulary. This further increases training speed by 4 times, although the accuracy is slightly reduced to 82%


use the weights to see the similarity under the positive/negative context
def get_most_similar_words(focus = "horrible"):
    most_similar = Counter()
    for word in mlp_full.word2index.keys():
        weights_a = mlp_full.weights_0_1[mlp_full.word2index[word]]
        weights_b = mlp_full.weights_0_1[mlp_full.word2index[focus]]
        most_similar[word] =,weights_b)
    return most_similar.most_common()
use sklearn.manifold.TSNE to cluster the words and visualize the results.

3 Intro to TFLearn

This lesson begins with a comparison for different activation functions:
  • sigmoid has a maximum value of dy/dx (0.25 per layer), it is difficult to train deep layers.
  • ReLu is better, but should be fine tune the learning rate to avoid local minimum at 0.
  • softmax is good for multi-class learning. Consequently, cost function is changed from sum of squared errors to cross entropy.
TFLearn does a lot of things for you such as initializing weights, running the forward pass, and performing backpropagation to update the weights. You end up just defining the architecture of the network (number and type of layers, number of units, etc.) and how it is trained.
import pandas as pd
import numpy as np
import tensorflow as tf
import tflearn
from tflearn.data_utils import to_categorical
reviews = pd.read_csv('reviews.txt', header=None) 
labels = pd.read_csv('labels.txt', header=None) # 25 k
from collections import Counter
total_counts = Counter()
for _, row in reviews.iterrows():
    total_counts.update(row[0].split(' ')) #have 74 k keys
vocab = sorted(total_counts, key=total_counts.get, reverse=True)[:10000] # key the 10 k most common words
word2idx = {word: i for i,word in enumerate(vocab)} # used to vectorize the word
def text_to_vector(text):
    word_vector = np.zeros(len(vocab),
    for word in text.split(' '):
        idx = word2idx.get(word,None) # get index or None
        if idx is None:    
            word_vector[idx] += 1
    return np.array(word_vector)
word_vectors = np.zeros((len(reviews), len(vocab)), dtype=np.int_)
for i, (_, text) in enumerate(reviews.iterrows()):
    word_vectors[i] = text_to_vector(text[0]) # vectorize all reviews
Y = (labels=='positive').astype(np.int_)
records = len(labels)
y = to_categorical(Y,2)  # change 1 label to 2 labels
from sklearn.model_selection import train_test_split
X_train, X_test,y_train,y_test = train_test_split(word_vectors,y,test_size = 0.1)
Build and train model
def build_model():
    net = tflearn.input_data([None,10000]) # unknown instances, 10000 nodes
    net = tflearn.fully_connected(net,200, activation = "ReLU")
    net = tflearn.fully_connected(net,25 , activation = "ReLU")
    net = tflearn.fully_connected(net, 2, activation = "softmax")
    net = tflearn.regression(net, optimizer= 'sgd', learning_rate = 0.1, loss= "categorical_crossentropy")
    model = tflearn.DNN(net)
    return model
model = build_model(), y_train, validation_set=0.1, show_metric=True, batch_size=128, n_epoch=50)
predictions = (np.array(model.predict(testX))[:,0] >= 0.5).astype(np.int_)
test_accuracy = np.mean(predictions == testY[:,0], axis=0)
print("Test accuracy: ", test_accuracy)
The tricky thing here is TFLearn does not fully support TensorFlow.


  • Christopher Olah’s blog post on RNNs and LSTMs.This is the shortest and most accessible read.
  • Deep Learning Book chapter on RNNs.This will be a very technical read and is recommended for students very comfortable with advanced mathematical notation and scientific papers.
  • Andrej Karpathy’s lecture on Recurrent Neural Networks.This is a fairly long lecture (around an hour) but covers the content quite well as always with Karpathy.

7 MiniFlow

This miniflow aims to get you practice the architecture before everything is encapsulated in Tensorflow. My implementation is in this gist. The dataset used in the quiz is sklearn.datasets.load_boston.

9,11,12 TensorFlow

These 3 lessons repackaged Vincent’s previous deep learning course by adding more illustrative animations and more quizzes. Although I watched Vincent’s previous course several times, I didn’t fully understand what he means until this time. I realize why a picture worth a thousand words.


Previous course seems to be removed to somewhere)

Project 2: classify image from CIFAR10

cifar dataset is originally hosted at
  • 163 MB
  • 60 k instances (50 k training +10 test), each 10 k instances is pickled into a batch
  • input featues are 32*32, target feature is 10 classes, corresponding to ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
  • use tensorflow to build a neural net including 1 cnn (32,5x5)+ maxpool + flatten + fully connected layer(1024-node) + dropout(10-node) + softmax_cross_entropy_with_logits

Thursday, March 23, 2017

SAS, University version

Why SAS?

SAS is short for “Statistical Analysis System”.
  • 1966, prototype was developed by Barr and Goodnight, and funded by NIH
  • 1976, they moved from North Carolina State University and founded SAS Institute.
  • 1985, SAS was rewritten in C to allow it run on Unix, MS-DOS, and windows.
  • 2002, Text Miner component was introduced.
  • 2010, a free version for student was introduced.
  • 2010-12, sued world programming, but European Court of Justice ruled that “the functionality of a computer program and the programming language cannot be protected by copyright”
So SAS has a long history and its target customers are enterprise analytics.
  • It is web browser based. Although starting a local server by virtual machine seems a little complicated, it has the advantage of cross-platform
  • It can be seen as “advanced statistical version“ of Excel, which has rich GUI for people to learn quickly and provides brilliant technical support.
  • Big corporations like SAS because there’s a complete ecosystem that satisfies customers’ every need.
  • its direct competitors are Stata and SPSS (acquired by IBM).
  • You click on the front-end, the corresponding codes are automatically generated in the back-end. This means you can have the code to generate the exact same graph or make changes on that.
  • Integrate with SQL seamlessly.
And the usage differs by industry sectors:

University Edition

This version is free. check here. SAS University Edition includes SAS® Studio, Base SAS®, SAS/STAT®, SAS/IML®, SAS/ACCESS® and several time series forecasting procedures from SAS/ETS®.
There are 2 approaches to get SAS running:
  1. download a .ova file (2.2GB). use virtual box to start a local host and run SAS locally.
  2. use AWS AMI: SAS University Edition. You have to pay EC2 fee ranging from 0.012-0.047 /hr. It’s actually pretty cheap.
Open a new browser window with http://localhost:10080/ And you are good to go.


SAS programs have a DATA step, which retrieves and manipulates data, usually creating an SAS data set, and a PROC step, which analyzes the data.
data highchol;
    set sashelp.heart;
    where Chol_Status = "High";
proc print data = highchol;
proc print data =;    /*two-level name: library.table */
    by Make;
    var Make Model Type;

create library/ import csv

libname libsas 'S:/datafiles'; /* physical location of the dataset, which can be found in file's property */
data titanic;
    infile '/folders/myfolders/train.csv' dlm=',' firstobs=2; 
    input PassengerId Survived Pclass Name Sex;
use proc import is much more convenient, you don’t need to manually assign the column name. video guide which uses the snippets
/** FOR CSV Files uploaded from Unix/MacOS **/
FILENAME CSV "/folders/myfolders/train.csv" TERMSTR=LF;
/** Import the CSV file.  **/
/** Print the results. **/
/** Unassign the file reference.  **/
Alternatively, you can use tasks and utilities -> utilities -> import data. Then drag and drop the file from the “server files and folders”.



ods graphics / reset imagemap;
proc sgplot data=SASHELP.CARS;
    title "Vehicle Statistics";
    scatter x=Horsepower y=MPG_City / group=Origin 
        markerattrs=(symbol=CircleFilled size=12) transparency=0.7 name='Scatter';
    xaxis grid;
    yaxis grid;
    keylegend / location=Inside across=1;
ods graphics / reset;
Other plots like barplot, histogram are similar.

Certification training

The ad is for version 9.3, 2011, while the latest version is 9.4, 2013.
There are several certification packages:
  • Base programming: 3.1 k
  • Advanced programming: 3.8 k/2.45k
  • Predictive Modeling: 2.65 k
  • statistical analysis: 3.05 k