Yuchao's blogspot: Deep Learning ND 3, RNN, NLP

Update on 2017-5-5

Udacity reorganized courses into 4 sections:

Neural Networks (8 courses + 1 project: bike share prediction)
Convolutional Neural Networks (13 courses + 1 project: CIFAR-10 image classification)
Recurrent Neural Networks (20 courses + 2 projects: generating TV scripts, translation)
Generative Adversrial Networks (6 courses + 1 projects: Generate Faces)

So this post becomes the first half of secction 3.

1 RNN

Resources

Andrej Karpathy’s lecture on RNNs and LSTMs from CS231n
A great blog post by Christopher Olah on how LSTMs work.
Building an RNN from the ground up, this is a little more advanced, but has an implementation in TensorFlow.

the notebook with the character-wise RNN from the public GitHub repo.

resources

3 Word2vec

Tutorial by Tensorflow

Word2vecis a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.

Word2vec is actually an efficient algorithms to produce word embeddings. It was published in 2013 by google researchers led by Tomas Mikolov.

It is unsupervised deep learning because you only have raw text without any labels. Depending on whether you choose the target word or context word as the label, there are two model architectures:

CBOW: use context to predict target word
skip-gram: use word to predict context.

https://www.tensorflow.org/tutorials/word2vec

Latest notebook is here. The dataset “text8” is originally hosted by Matt Mahoney.

There are several difficult concepts to understand the implementation.

Elaborating these concepts is beyond my ability at this moment. The core codes are:

# Look up embeddings for inputs, a random big matrix.
embeddings = tf.Variable(tf.random_uniform([50000,128], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable(
    tf.truncated_normal([50000, 128],stddev=1.0 / math.sqrt(128)))
nce_biases = tf.Variable(tf.zeros([50000]))
# Compute the average NCE loss for the batch.
loss = tf.reduce_mean(
  tf.nn.nce_loss(weights=nce_weights,
                 biases=nce_biases,
                 labels=train_labels,
                 inputs=embed,
                 num_sampled=num_sampled,
                 num_classes=50000))
# Construct the SGD optimizer using a learning rate of 1.0.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

by Radim Řehůřek

https://rare-technologies.com/word2vec-tutorial

https://radimrehurek.com/gensim/models/word2vec.html

https://www.youtube.com/watch?v=wTp3P2UnTfQ

by Udacity.Mat

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

https://github.com/udacity/deep-learning/tree/master/embeddings

jupyter trust deep-learning/embeddings/Skip-Grams-Solution.ipynb

One difficulty is how to generate the (inputs, targets) pair.

The another is the embedding implementation. Think it this way:

embedding features are like the basis function or the periodic table that are the “atoms” of your fictional world. We choose a few hundred simply because it is a manageable number.
A few hundred is still a large number for matrix multiplication that is computational expensive. The lookup table takes advantages of the onehot encoding, so we quickly extract the value without computing.
embedding matrix is a monster matrix. The initial value is random by tf.random_uniform and tuned by optimizer during training.
use tf.nn.sampled_softmax_loss to calculate the loss. Be sure to read the documentation to figure out how it works.

n_vocab = len(int_to_vocab) # 63 k, vocab basis
n_embedding = 200 # Number of embedding features 
n_sampled = 100
train_graph = tf.Graph()
with train_graph.as_default():
    inputs = tf.placeholder(tf.int32, [None], name='inputs')
    labels = tf.placeholder(tf.int32, [None, None], name='labels')
    embedding = tf.Variable(tf.random_uniform((n_vocab, n_embedding), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs)
    softmax_w = tf.Variable(tf.truncated_normal((n_vocab, n_embedding), stddev=0.1))
    softmax_b = tf.Variable(tf.zeros(n_vocab))
    # Calculate the loss using negative sampling
    loss = tf.nn.sampled_softmax_loss(softmax_w, softmax_b, 
                                      labels, embed,
                                      n_sampled, n_vocab)
    cost = tf.reduce_mean(loss)
    optimizer = tf.train.AdamOptimizer().minimize(cost)

At last, use TSNE().fit_transform() to transform 200-dimensional data to 2D data and draw scatterplot to show the relation. This is a huge dimensionality reduction. You are going to lose some information anyway.

My final question is: how useful are the words with similar semantic meanings? Is it the starting point for machine understanding human language?

The inductive bias is that the words appearing in similar neighboring words have the similar meaning. So it is actually a synonym calculator. It is far from a real understanding, e.g., how a learner’s dictionary works.

Maybe we should start from elementary textbook?

6 TensorBoard

see my updated post: http://www.yuchao.us/2017/02/tensorboard.html

Interview tips

No one talks about his failture. Practice a lot:

Intro Data Structures (My code School)
Intro to Algorithms (MIT Open CourseWare)
HackerEarth.com HackerRank.com
Mock Interviews

Get a Job at a Startup: https://angel.co

hacker news: https://news.ycombinator.com/jobs

search using keywords “technical recruiter”

10 Sentiment Prediction RNN

https://github.com/udacity/deep-learning/tree/master/sentiment-rnn

The official solution has some problems, check my note.

The dataset is IMDB movie review. It’s supervised learning.

The core codes are as below. I would like to mention 2 points:

to simplify the codes, Teras-like tf.conbrib module is used.
RNN implementation is by tf.contrib.rnn submodule

import tensorflow as tf
lstm_size = 256
lstm_layers = 1
batch_size = 500
learning_rate = 0.001
n_words = len(vocab)  #74072
embed_size = 300 
graph = tf.Graph()
with graph.as_default():
    # layer 0
    inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
    labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    # layer 1
    embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs_)
    # layer 2
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)
    initial_state = cell.zero_state(batch_size, tf.float32)
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
    # output layer
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    cost = tf.losses.mean_squared_error(labels_, predictions)
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
    correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

Project 3, TV script generation

The TV script is from Simpsons with 4257 lines and 11.5k unique words. Note in the helper function, the text is preprocessed into a large array with each word converted into an integer.

Complete implementation is here. I would like to mention 2 obstacles.

First, extract the text into feature-target pairs.

def get_batches(int_text, batch_size, seq_length):
    """
    Return batches of input and target
    :param int_text: Text with the words replaced by their ids
    :param batch_size: The size of batch
    :param seq_length: The length of sequence
    :return: Batches as a Numpy array
    """
    # TODO: Implement Function
    num = (len(int_text)-1)//(batch_size*seq_length)
    batches = np.ndarray(shape = (num,2,batch_size,seq_length), dtype = int)
    gap = num * seq_length
    for i in range(num):
        offset = i*seq_length
        for j in range(batch_size):
            batches[i,0,j,:]=int_text[offset:offset+(0+1)*seq_length]
            batches[i,1,j,:]=int_text[offset+1:offset+(0+1)*seq_length+1]
            offset += gap
    return batches

Second, configure RNN. Note: this only works for Tensorflow 1.0 not 1.1 or later.

import tensorflow as tf
from tensorflow.contrib import seq2seq
vocab_size = 6779
rnn_size = 256

input_text = tf.placeholder(tf.int32, [None, None], name='input')
targets = tf.placeholder(tf.int32, [None, None], name='targets')
learingRate = tf.placeholder(tf.float32, name='learingRate')
input_data_shape = tf.shape(input_text)
with tf.name_scope("initial_rnn_cell"):
    lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
    cell = tf.contrib.rnn.MultiRNNCell([lstm])
    initial_state = tf.identity (cell.zero_state (input_data_shape[0], tf.float32), name="initial_state")
with tf.name_scope("embed"): 
    embedding = tf.Variable(tf.random_uniform((vocab_size, rnn_size), -1, 1), name = "embedding")
    embed = tf.nn.embedding_lookup(embedding, input_text, name = "lookup")
#with tf.name_scope("build_rnn"):
outputs, final_state = tf.nn.dynamic_rnn(cell, embed ,dtype = tf.float32)
final_state = tf.identity(final_state, name = "final_state")
logits = tf.contrib.layers.fully_connected(outputs, vocab_size, activation_fn= None, name ="logits")

with tf.name_scope("softmax"):
    probs = tf.nn.softmax(logits, name='probs')
    cost = seq2seq.sequence_loss(logits,targets,tf.ones([input_data_shape[0], input_data_shape[1]]))
#with tf.name_scope("train"):
optimizer = tf.train.AdamOptimizer(learingRate)
gradients = optimizer.compute_gradients(cost)
capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients]
train_op = optimizer.apply_gradients(capped_gradients)

folder = 'board_rnn'
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    writer = tf.summary.FileWriter(folder)  # create writer
    writer.add_graph(sess.graph)
    print("graph is written into folder:", folder)    
# tensorboard --logdir "board_rnn"

Yuchao's blogspot

Thursday, March 30, 2017

Deep Learning ND 3, RNN, NLP