Update on 2017-5-5
Udacity reorganized courses into 4 sections:
- Neural Networks (8 courses + 1 project: bike share prediction)
- Convolutional Neural Networks (13 courses + 1 project: CIFAR-10 image classification)
- Recurrent Neural Networks (20 courses + 2 projects: generating TV scripts, translation)
- Generative Adversrial Networks (6 courses + 1 projects: Generate Faces)
So this post becomes the first half of secction 3.
1 RNN
Resources
- Andrej Karpathy’s lecture on RNNs and LSTMs from CS231n
- A great blog post by Christopher Olah on how LSTMs work.
- Building an RNN from the ground up, this is a little more advanced, but has an implementation in TensorFlow.the notebook with the character-wise RNN from the public GitHub repo.
resources
- Understanding LSTM Networks
- LSTM Networks for Sentiment Analysis
- A Beginner’s Guide to Recurrent Networks and LSTMs
- TensorFlow’s Recurrent Neural Network Tutorial
- Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras
- Demystifying LSTM neural networks
3 Word2vec
Tutorial by Tensorflow
Word2vecis a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.
Word2vec is actually an efficient algorithms to produce word embeddings. It was published in 2013 by google researchers led by Tomas Mikolov.
It is unsupervised deep learning because you only have raw text without any labels. Depending on whether you choose the target word or context word as the label, there are two model architectures:
- CBOW: use context to predict target word
- skip-gram: use word to predict context.
Latest notebook is here. The dataset “text8” is originally hosted by Matt Mahoney.
There are several difficult concepts to understand the implementation.
- Monte Carlo average
- Negative Sampling
- noise-contrastive estimation (NCE)
- t-SNE dimensionality reduction technique
Elaborating these concepts is beyond my ability at this moment. The core codes are:
# Look up embeddings for inputs, a random big matrix.
embeddings = tf.Variable(tf.random_uniform([50000,128], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable(
tf.truncated_normal([50000, 128],stddev=1.0 / math.sqrt(128)))
nce_biases = tf.Variable(tf.zeros([50000]))
# Compute the average NCE loss for the batch.
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=50000))
# Construct the SGD optimizer using a learning rate of 1.0.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
by Radim Řehůřek
by Udacity.Mat
jupyter trust deep-learning/embeddings/Skip-Grams-Solution.ipynb
One difficulty is how to generate the (inputs, targets) pair.
The another is the embedding implementation. Think it this way:
- embedding features are like the basis function or the periodic table that are the “atoms” of your fictional world. We choose a few hundred simply because it is a manageable number.
- A few hundred is still a large number for matrix multiplication that is computational expensive. The lookup table takes advantages of the onehot encoding, so we quickly extract the value without computing.
- embedding matrix is a monster matrix. The initial value is random by
tf.random_uniform
and tuned by optimizer during training. - use
tf.nn.sampled_softmax_loss
to calculate the loss. Be sure to read the documentation to figure out how it works.
n_vocab = len(int_to_vocab) # 63 k, vocab basis
n_embedding = 200 # Number of embedding features
n_sampled = 100
train_graph = tf.Graph()
with train_graph.as_default():
inputs = tf.placeholder(tf.int32, [None], name='inputs')
labels = tf.placeholder(tf.int32, [None, None], name='labels')
embedding = tf.Variable(tf.random_uniform((n_vocab, n_embedding), -1, 1))
embed = tf.nn.embedding_lookup(embedding, inputs)
softmax_w = tf.Variable(tf.truncated_normal((n_vocab, n_embedding), stddev=0.1))
softmax_b = tf.Variable(tf.zeros(n_vocab))
# Calculate the loss using negative sampling
loss = tf.nn.sampled_softmax_loss(softmax_w, softmax_b,
labels, embed,
n_sampled, n_vocab)
cost = tf.reduce_mean(loss)
optimizer = tf.train.AdamOptimizer().minimize(cost)
At last, use
TSNE().fit_transform()
to transform 200-dimensional data to 2D data and draw scatterplot to show the relation. This is a huge dimensionality reduction. You are going to lose some information anyway.
My final question is: how useful are the words with similar semantic meanings? Is it the starting point for machine understanding human language?
The inductive bias is that the words appearing in similar neighboring words have the similar meaning. So it is actually a synonym calculator. It is far from a real understanding, e.g., how a learner’s dictionary works.
Maybe we should start from elementary textbook?
6 TensorBoard
see my updated post: http://www.yuchao.us/2017/02/tensorboard.html
Interview tips
No one talks about his failture. Practice a lot:
- Intro Data Structures (My code School)
- Intro to Algorithms (MIT Open CourseWare)
- HackerEarth.com HackerRank.com
- Mock Interviews
Get a Job at a Startup: https://angel.co
hacker news: https://news.ycombinator.com/jobs
search using keywords “technical recruiter”
10 Sentiment Prediction RNN
The official solution has some problems, check my note.
The dataset is IMDB movie review. It’s supervised learning.
The core codes are as below. I would like to mention 2 points:
- to simplify the codes, Teras-like
tf.conbrib
module is used. - RNN implementation is by
tf.contrib.rnn
submodule
import tensorflow as tf
lstm_size = 256
lstm_layers = 1
batch_size = 500
learning_rate = 0.001
n_words = len(vocab) #74072
embed_size = 300
graph = tf.Graph()
with graph.as_default():
# layer 0
inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
keep_prob = tf.placeholder(tf.float32, name='keep_prob')
# layer 1
embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
embed = tf.nn.embedding_lookup(embedding, inputs_)
# layer 2
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)
initial_state = cell.zero_state(batch_size, tf.float32)
outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state)
# output layer
predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
cost = tf.losses.mean_squared_error(labels_, predictions)
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
Project 3, TV script generation
The TV script is from Simpsons with 4257 lines and 11.5k unique words. Note in the helper function, the text is preprocessed into a large array with each word converted into an integer.
Complete implementation is here. I would like to mention 2 obstacles.
First, extract the text into feature-target pairs.
def get_batches(int_text, batch_size, seq_length):
"""
Return batches of input and target
:param int_text: Text with the words replaced by their ids
:param batch_size: The size of batch
:param seq_length: The length of sequence
:return: Batches as a Numpy array
"""
# TODO: Implement Function
num = (len(int_text)-1)//(batch_size*seq_length)
batches = np.ndarray(shape = (num,2,batch_size,seq_length), dtype = int)
gap = num * seq_length
for i in range(num):
offset = i*seq_length
for j in range(batch_size):
batches[i,0,j,:]=int_text[offset:offset+(0+1)*seq_length]
batches[i,1,j,:]=int_text[offset+1:offset+(0+1)*seq_length+1]
offset += gap
return batches
Second, configure RNN. Note: this only works for Tensorflow 1.0 not 1.1 or later.
import tensorflow as tf
from tensorflow.contrib import seq2seq
vocab_size = 6779
rnn_size = 256
input_text = tf.placeholder(tf.int32, [None, None], name='input')
targets = tf.placeholder(tf.int32, [None, None], name='targets')
learingRate = tf.placeholder(tf.float32, name='learingRate')
input_data_shape = tf.shape(input_text)
with tf.name_scope("initial_rnn_cell"):
lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
cell = tf.contrib.rnn.MultiRNNCell([lstm])
initial_state = tf.identity (cell.zero_state (input_data_shape[0], tf.float32), name="initial_state")
with tf.name_scope("embed"):
embedding = tf.Variable(tf.random_uniform((vocab_size, rnn_size), -1, 1), name = "embedding")
embed = tf.nn.embedding_lookup(embedding, input_text, name = "lookup")
#with tf.name_scope("build_rnn"):
outputs, final_state = tf.nn.dynamic_rnn(cell, embed ,dtype = tf.float32)
final_state = tf.identity(final_state, name = "final_state")
logits = tf.contrib.layers.fully_connected(outputs, vocab_size, activation_fn= None, name ="logits")
with tf.name_scope("softmax"):
probs = tf.nn.softmax(logits, name='probs')
cost = seq2seq.sequence_loss(logits,targets,tf.ones([input_data_shape[0], input_data_shape[1]]))
#with tf.name_scope("train"):
optimizer = tf.train.AdamOptimizer(learingRate)
gradients = optimizer.compute_gradients(cost)
capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients]
train_op = optimizer.apply_gradients(capped_gradients)
folder = 'board_rnn'
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
writer = tf.summary.FileWriter(folder) # create writer
writer.add_graph(sess.graph)
print("graph is written into folder:", folder)
# tensorboard --logdir "board_rnn"