Thursday, March 30, 2017

Deep Learning ND 3, RNN, NLP

Update on 2017-5-5

Udacity reorganized courses into 4 sections:
  1. Neural Networks (8 courses + 1 project: bike share prediction)
  2. Convolutional Neural Networks (13 courses + 1 project: CIFAR-10 image classification)
  3. Recurrent Neural Networks (20 courses + 2 projects: generating TV scripts, translation)
  4. Generative Adversrial Networks (6 courses + 1 projects: Generate Faces)
So this post becomes the first half of secction 3.



3 Word2vec

Tutorial by Tensorflow

Word2vecis a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.
Word2vec is actually an efficient algorithms to produce word embeddings. It was published in 2013 by google researchers led by Tomas Mikolov.
It is unsupervised deep learning because you only have raw text without any labels. Depending on whether you choose the target word or context word as the label, there are two model architectures:
  • CBOW: use context to predict target word
  • skip-gram: use word to predict context.
Latest notebook is here. The dataset “text8” is originally hosted by Matt Mahoney.
There are several difficult concepts to understand the implementation.
Elaborating these concepts is beyond my ability at this moment. The core codes are:
# Look up embeddings for inputs, a random big matrix.
embeddings = tf.Variable(tf.random_uniform([50000,128], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable(
    tf.truncated_normal([50000, 128],stddev=1.0 / math.sqrt(128)))
nce_biases = tf.Variable(tf.zeros([50000]))
# Compute the average NCE loss for the batch.
loss = tf.reduce_mean(
# Construct the SGD optimizer using a learning rate of 1.0.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

by Radim Řehůřek

by Udacity.Mat

jupyter trust deep-learning/embeddings/Skip-Grams-Solution.ipynb
One difficulty is how to generate the (inputs, targets) pair.
The another is the embedding implementation. Think it this way:
  1. embedding features are like the basis function or the periodic table that are the “atoms” of your fictional world. We choose a few hundred simply because it is a manageable number.
  2. A few hundred is still a large number for matrix multiplication that is computational expensive. The lookup table takes advantages of the onehot encoding, so we quickly extract the value without computing.
  3. embedding matrix is a monster matrix. The initial value is random by tf.random_uniform and tuned by optimizer during training.
  4. use tf.nn.sampled_softmax_loss to calculate the loss. Be sure to read the documentation to figure out how it works.
n_vocab = len(int_to_vocab) # 63 k, vocab basis
n_embedding = 200 # Number of embedding features 
n_sampled = 100
train_graph = tf.Graph()
with train_graph.as_default():
    inputs = tf.placeholder(tf.int32, [None], name='inputs')
    labels = tf.placeholder(tf.int32, [None, None], name='labels')
    embedding = tf.Variable(tf.random_uniform((n_vocab, n_embedding), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs)
    softmax_w = tf.Variable(tf.truncated_normal((n_vocab, n_embedding), stddev=0.1))
    softmax_b = tf.Variable(tf.zeros(n_vocab))
    # Calculate the loss using negative sampling
    loss = tf.nn.sampled_softmax_loss(softmax_w, softmax_b, 
                                      labels, embed,
                                      n_sampled, n_vocab)
    cost = tf.reduce_mean(loss)
    optimizer = tf.train.AdamOptimizer().minimize(cost)
At last, use TSNE().fit_transform() to transform 200-dimensional data to 2D data and draw scatterplot to show the relation. This is a huge dimensionality reduction. You are going to lose some information anyway.
My final question is: how useful are the words with similar semantic meanings? Is it the starting point for machine understanding human language?
The inductive bias is that the words appearing in similar neighboring words have the similar meaning. So it is actually a synonym calculator. It is far from a real understanding, e.g., how a learner’s dictionary works.
Maybe we should start from elementary textbook?

6 TensorBoard

Interview tips

No one talks about his failture. Practice a lot:
  • Intro Data Structures (My code School)
  • Intro to Algorithms (MIT Open CourseWare)
  • Mock Interviews
Get a Job at a Startup:
search using keywords “technical recruiter”

10 Sentiment Prediction RNN

The official solution has some problems, check my note.
The dataset is IMDB movie review. It’s supervised learning.
The core codes are as below. I would like to mention 2 points:
  1. to simplify the codes, Teras-like tf.conbrib module is used.
  2. RNN implementation is by tf.contrib.rnn submodule
import tensorflow as tf
lstm_size = 256
lstm_layers = 1
batch_size = 500
learning_rate = 0.001
n_words = len(vocab)  #74072
embed_size = 300 
graph = tf.Graph()
with graph.as_default():
    # layer 0
    inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
    labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    # layer 1
    embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs_)
    # layer 2
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)
    initial_state = cell.zero_state(batch_size, tf.float32)
    outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
    # output layer
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    cost = tf.losses.mean_squared_error(labels_, predictions)
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
    correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

Project 3, TV script generation

The TV script is from Simpsons with 4257 lines and 11.5k unique words. Note in the helper function, the text is preprocessed into a large array with each word converted into an integer.
Complete implementation is here. I would like to mention 2 obstacles.
First, extract the text into feature-target pairs.
def get_batches(int_text, batch_size, seq_length):
    Return batches of input and target
    :param int_text: Text with the words replaced by their ids
    :param batch_size: The size of batch
    :param seq_length: The length of sequence
    :return: Batches as a Numpy array
    # TODO: Implement Function
    num = (len(int_text)-1)//(batch_size*seq_length)
    batches = np.ndarray(shape = (num,2,batch_size,seq_length), dtype = int)
    gap = num * seq_length
    for i in range(num):
        offset = i*seq_length
        for j in range(batch_size):
            offset += gap
    return batches
Second, configure RNN. Note: this only works for Tensorflow 1.0 not 1.1 or later.
import tensorflow as tf
from tensorflow.contrib import seq2seq
vocab_size = 6779
rnn_size = 256

input_text = tf.placeholder(tf.int32, [None, None], name='input')
targets = tf.placeholder(tf.int32, [None, None], name='targets')
learingRate = tf.placeholder(tf.float32, name='learingRate')
input_data_shape = tf.shape(input_text)
with tf.name_scope("initial_rnn_cell"):
    lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
    cell = tf.contrib.rnn.MultiRNNCell([lstm])
    initial_state = tf.identity (cell.zero_state (input_data_shape[0], tf.float32), name="initial_state")
with tf.name_scope("embed"): 
    embedding = tf.Variable(tf.random_uniform((vocab_size, rnn_size), -1, 1), name = "embedding")
    embed = tf.nn.embedding_lookup(embedding, input_text, name = "lookup")
#with tf.name_scope("build_rnn"):
outputs, final_state = tf.nn.dynamic_rnn(cell, embed ,dtype = tf.float32)
final_state = tf.identity(final_state, name = "final_state")
logits = tf.contrib.layers.fully_connected(outputs, vocab_size, activation_fn= None, name ="logits")

with tf.name_scope("softmax"):
    probs = tf.nn.softmax(logits, name='probs')
    cost = seq2seq.sequence_loss(logits,targets,tf.ones([input_data_shape[0], input_data_shape[1]]))
#with tf.name_scope("train"):
optimizer = tf.train.AdamOptimizer(learingRate)
gradients = optimizer.compute_gradients(cost)
capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients]
train_op = optimizer.apply_gradients(capped_gradients)

folder = 'board_rnn'
with tf.Session() as sess:
    writer = tf.summary.FileWriter(folder)  # create writer
    print("graph is written into folder:", folder)    
# tensorboard --logdir "board_rnn"
Smiley face