Update on 2017-5-5
Udacity reorganized courses into 4 sections:
- Neural Networks (8 courses + 1 project: bike share prediction)
- Convolutional Neural Networks (13 courses + 1 project: CIFAR-10 image classification)
- Recurrent Neural Networks (20 courses + 2 projects: generating TV scripts, translation)
- Generative Adversrial Networks (6 courses + 1 projects: Generate Faces)
So this post becomes the first half of secction 3.
- Andrej Karpathy’s lecture on RNNs and LSTMs from CS231n
- A great blog post by Christopher Olah on how LSTMs work.
- Building an RNN from the ground up, this is a little more advanced, but has an implementation in TensorFlow.the notebook with the character-wise RNN from the public GitHub repo.
- Understanding LSTM Networks
- LSTM Networks for Sentiment Analysis
- A Beginner’s Guide to Recurrent Networks and LSTMs
- TensorFlow’s Recurrent Neural Network Tutorial
- Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras
- Demystifying LSTM neural networks
Tutorial by Tensorflow
Word2vecis a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.
Word2vec is actually an efficient algorithms to produce word embeddings. It was published in 2013 by google researchers led by Tomas Mikolov.
It is unsupervised deep learning because you only have raw text without any labels. Depending on whether you choose the target word or context word as the label, there are two model architectures:
- CBOW: use context to predict target word
- skip-gram: use word to predict context.
There are several difficult concepts to understand the implementation.
- Monte Carlo average
- Negative Sampling
- noise-contrastive estimation (NCE)
- t-SNE dimensionality reduction technique
Elaborating these concepts is beyond my ability at this moment. The core codes are:
# Look up embeddings for inputs, a random big matrix. embeddings = tf.Variable(tf.random_uniform([50000,128], -1.0, 1.0)) embed = tf.nn.embedding_lookup(embeddings, train_inputs) # Construct the variables for the NCE loss nce_weights = tf.Variable( tf.truncated_normal([50000, 128],stddev=1.0 / math.sqrt(128))) nce_biases = tf.Variable(tf.zeros()) # Compute the average NCE loss for the batch. loss = tf.reduce_mean( tf.nn.nce_loss(weights=nce_weights, biases=nce_biases, labels=train_labels, inputs=embed, num_sampled=num_sampled, num_classes=50000)) # Construct the SGD optimizer using a learning rate of 1.0. optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
by Radim Řehůřek
jupyter trust deep-learning/embeddings/Skip-Grams-Solution.ipynb
One difficulty is how to generate the (inputs, targets) pair.
The another is the embedding implementation. Think it this way:
- embedding features are like the basis function or the periodic table that are the “atoms” of your fictional world. We choose a few hundred simply because it is a manageable number.
- A few hundred is still a large number for matrix multiplication that is computational expensive. The lookup table takes advantages of the onehot encoding, so we quickly extract the value without computing.
- embedding matrix is a monster matrix. The initial value is random by
tf.random_uniformand tuned by optimizer during training.
tf.nn.sampled_softmax_lossto calculate the loss. Be sure to read the documentation to figure out how it works.
n_vocab = len(int_to_vocab) # 63 k, vocab basis n_embedding = 200 # Number of embedding features n_sampled = 100 train_graph = tf.Graph() with train_graph.as_default(): inputs = tf.placeholder(tf.int32, [None], name='inputs') labels = tf.placeholder(tf.int32, [None, None], name='labels') embedding = tf.Variable(tf.random_uniform((n_vocab, n_embedding), -1, 1)) embed = tf.nn.embedding_lookup(embedding, inputs) softmax_w = tf.Variable(tf.truncated_normal((n_vocab, n_embedding), stddev=0.1)) softmax_b = tf.Variable(tf.zeros(n_vocab)) # Calculate the loss using negative sampling loss = tf.nn.sampled_softmax_loss(softmax_w, softmax_b, labels, embed, n_sampled, n_vocab) cost = tf.reduce_mean(loss) optimizer = tf.train.AdamOptimizer().minimize(cost)
At last, use
TSNE().fit_transform()to transform 200-dimensional data to 2D data and draw scatterplot to show the relation. This is a huge dimensionality reduction. You are going to lose some information anyway.
My final question is: how useful are the words with similar semantic meanings? Is it the starting point for machine understanding human language?
The inductive bias is that the words appearing in similar neighboring words have the similar meaning. So it is actually a synonym calculator. It is far from a real understanding, e.g., how a learner’s dictionary works.
Maybe we should start from elementary textbook?
see my updated post: http://www.yuchao.us/2017/02/tensorboard.html
No one talks about his failture. Practice a lot:
- Intro Data Structures (My code School)
- Intro to Algorithms (MIT Open CourseWare)
- HackerEarth.com HackerRank.com
- Mock Interviews
Get a Job at a Startup: https://angel.co
hacker news: https://news.ycombinator.com/jobs
search using keywords “technical recruiter”
10 Sentiment Prediction RNN
The official solution has some problems, check my note.
The dataset is IMDB movie review. It’s supervised learning.
The core codes are as below. I would like to mention 2 points:
- to simplify the codes, Teras-like
tf.conbribmodule is used.
- RNN implementation is by
import tensorflow as tf lstm_size = 256 lstm_layers = 1 batch_size = 500 learning_rate = 0.001 n_words = len(vocab) #74072 embed_size = 300 graph = tf.Graph() with graph.as_default(): # layer 0 inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs') labels_ = tf.placeholder(tf.int32, [None, None], name='labels') keep_prob = tf.placeholder(tf.float32, name='keep_prob') # layer 1 embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1)) embed = tf.nn.embedding_lookup(embedding, inputs_) # layer 2 lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size) drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob) cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers) initial_state = cell.zero_state(batch_size, tf.float32) outputs, final_state = tf.nn.dynamic_rnn(cell, embed, initial_state=initial_state) # output layer predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid) cost = tf.losses.mean_squared_error(labels_, predictions) optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost) correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_) accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
Project 3, TV script generation
The TV script is from Simpsons with 4257 lines and 11.5k unique words. Note in the helper function, the text is preprocessed into a large array with each word converted into an integer.
Complete implementation is here. I would like to mention 2 obstacles.
First, extract the text into feature-target pairs.
def get_batches(int_text, batch_size, seq_length): """ Return batches of input and target :param int_text: Text with the words replaced by their ids :param batch_size: The size of batch :param seq_length: The length of sequence :return: Batches as a Numpy array """ # TODO: Implement Function num = (len(int_text)-1)//(batch_size*seq_length) batches = np.ndarray(shape = (num,2,batch_size,seq_length), dtype = int) gap = num * seq_length for i in range(num): offset = i*seq_length for j in range(batch_size): batches[i,0,j,:]=int_text[offset:offset+(0+1)*seq_length] batches[i,1,j,:]=int_text[offset+1:offset+(0+1)*seq_length+1] offset += gap return batches
Second, configure RNN. Note: this only works for Tensorflow 1.0 not 1.1 or later.
import tensorflow as tf from tensorflow.contrib import seq2seq vocab_size = 6779 rnn_size = 256 input_text = tf.placeholder(tf.int32, [None, None], name='input') targets = tf.placeholder(tf.int32, [None, None], name='targets') learingRate = tf.placeholder(tf.float32, name='learingRate') input_data_shape = tf.shape(input_text) with tf.name_scope("initial_rnn_cell"): lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size) cell = tf.contrib.rnn.MultiRNNCell([lstm]) initial_state = tf.identity (cell.zero_state (input_data_shape, tf.float32), name="initial_state") with tf.name_scope("embed"): embedding = tf.Variable(tf.random_uniform((vocab_size, rnn_size), -1, 1), name = "embedding") embed = tf.nn.embedding_lookup(embedding, input_text, name = "lookup") #with tf.name_scope("build_rnn"): outputs, final_state = tf.nn.dynamic_rnn(cell, embed ,dtype = tf.float32) final_state = tf.identity(final_state, name = "final_state") logits = tf.contrib.layers.fully_connected(outputs, vocab_size, activation_fn= None, name ="logits") with tf.name_scope("softmax"): probs = tf.nn.softmax(logits, name='probs') cost = seq2seq.sequence_loss(logits,targets,tf.ones([input_data_shape, input_data_shape])) #with tf.name_scope("train"): optimizer = tf.train.AdamOptimizer(learingRate) gradients = optimizer.compute_gradients(cost) capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients] train_op = optimizer.apply_gradients(capped_gradients) folder = 'board_rnn' with tf.Session() as sess: sess.run(tf.global_variables_initializer()) writer = tf.summary.FileWriter(folder) # create writer writer.add_graph(sess.graph) print("graph is written into folder:", folder) # tensorboard --logdir "board_rnn"