[翻譯]斯坦福CS 20SI:基於Tensorflow的深度學習研究課程筆記,Lecture note 4: How to structure your model in TensorFlow


“CS 20SI: TensorFlow for Deep Learning Research”


Prepared by Chip Huyen

Reviewed by Danijar Hafner


Lecture note 4: How to structure your model in TensorFlow

個人翻譯,部分內容較簡略,建議參考原note閱讀


本節課建立word2vec模型,不熟悉可以閱讀CS224N的課件Mikolov的原始論文


Skip - gram模型 vs CBOW模型( Continuous Bag - of - Words):


算法上是相似的,不同的是CBOW根據上下文預測中心詞,Skip-gram正好相反,統計上來講,C通過把全部上下文當做一個觀測值而使大量分佈信息更為平滑,對於比較小的數據集這種方法很有用,S把(上下文-目標)對當做一個觀測值,這使數據集更大的時候更有效


Word2Vec Tutorial


建立skip-gram模型,我們更關心隱藏層的權重,權重是我們所嘗試學習的,也叫詞向量矩陣


如何建造 tensorflow 模型



  • 階段1: 建造圖:

    1. 定義輸入輸出的占位符

    2. 定義權重

    3. 定義模型inference

    4. 定義損失函數

    5. 定義optimizer



  • 階段2:執行計算:

    1. 給第一次執行初始變量

    2. feed訓練數據,可能需要隨機化數據樣本

    3. 在訓練數據下執行模型inference,計算當前輸入和當前模型參數的輸出

    4. 計算損失

    5. 通過最小/大化模型損失調整參數




讓我們根據這些步驟穿件w2v,Skip-gram模型:


階段1: 構造圖:



  1. 定義輸入輸出的占位符

    輸入中心詞,輸出目標詞,使用詞列表索引代替one-hot向量,[batch_size]個標量輸入和輸出


center_words  =  tf.placeholder( tf.int32 ,  shape =[ BATCH_SIZE ]) target_words  =  tf.placeholder( tf.int32 ,  shape =[ BATCH_SIZE ])


  1. 定義權重(詞向量矩陣)

    每行表示一個單詞的詞向量,每個詞向量長度為EMBED_SIZE,那麼詞向量矩陣就是[VOCAB_SIZE,EMBED_SIZE],我們使用均勻分佈初始化詞向量矩陣


  embed_matrix  =  tf.Variable(tf.random_uniform([ VOCAB_SIZE ,  EMBED_SIZE ],   - 1.0 ,   1.0 ))


  1. inference(圖前向傳播通路)


tf.nn.embedding_lookup( params ,  ids ,  partition_strategy = 'mod' ,  name = None , validate_indices = True ,  max_norm = None)

使用以上方法可以通過詞向量矩陣轉換單詞索引為詞向量


embed  =  tf.nn.embedding_lookup( embed_matrix ,  center_words)

4 定義損失函數

NCE用py實現很復雜,tf自帶的實現:


tf.nn.nce_loss( weights ,  biases ,  labels ,  inputs ,  num_sampled ,  num_classes ,  num_true = 1 , sampled_values = None ,  remove_accidental_hits = False ,  partition_strategy = 'mod' , name = 'nce_loss')

我們需要隱藏層的權重和偏差去計算NCE損失


nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE],
stddev=1.0 / (EMBED_SIZE ** 0.5)),
name='nce_weight')
nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias')

然後定義損失:


loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight,
biases=nce_bias,
labels=target_words,
inputs=embed,
num_sampled=NUM_SAMPLED,
num_classes=VOCAB_SIZE), name='loss')


  1. 定義optimizer


optimizer  =  tf.train.GradientDescentOptimizer(LEARNING_RATE ).minimize(loss)

階段2:執行計算


建立session,給占位符feed輸入和輸出,運行優化器最小化loss,取回loss值


with tf.Session() as sess:
sess.run(tf.global_variables_initializer())

total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps
writer = tf.summary.FileWriter('./my_graph/no_frills/', sess.graph)
for index in xrange(NUM_TRAIN_STEPS):
centers, targets = batch_gen.next()
loss_batch, _ = sess.run([loss, optimizer],
feed_dict={center_words: centers, target_words: targets})
total_loss += loss_batch
if (index + 1) % SKIP_STEP == 0:
print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
total_loss = 0.0
writer.close()

nameScope


給張量命名並且在tensorboard中觀察:



看起來亂七八糟的


使用tf.name_scope(name)將節點分類,節點會顯示為一個scope,點擊scope可以顯示內部細節.你會發現tensorboard有節點兩種線,一種是實線,另一種是虛線,實線代表數據流,虛線是節點所依賴的操作.

完整的節點圖標


之前我們建立瞭一個簡單的順序模型,我們可以使用py的面向對象方法編寫一個易於重用的模型,把Skip_gram建立為一個類:


class SkipGramModel:
""" Build the graph for word2vec model """
def __init__(self, vocab_size, embed_size, batch_size, num_sampled, learning_rate):
self.vocab_size = vocab_size
self.embed_size = embed_size
self.batch_size = batch_size
self.num_sampled = num_sampled
self.lr = learning_rate
self.global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')

def _create_placeholders(self):
""" Step 1: define the placeholders for input and output """
with tf.name_scope("data"):
self.center_words = tf.placeholder(tf.int32, shape=[self.batch_size], name='center_words')
self.target_words = tf.placeholder(tf.int32, shape=[self.batch_size, 1], name='target_words')

def _create_embedding(self):
""" Step 2: define weights. In word2vec, it's actually the weights that we care about """
# Assemble this part of the graph on the CPU. You can change it to GPU if you have GPU
with tf.device('/cpu:0'):
with tf.name_scope("embed"):
self.embed_matrix = tf.Variable(tf.random_uniform([self.vocab_size,
self.embed_size], -1.0, 1.0),
name='embed_matrix')

def _create_loss(self):
""" Step 3 + 4: define the model + the loss function """
with tf.device('/cpu:0'):
with tf.name_scope("loss"):
# Step 3: define the inference
embed = tf.nn.embedding_lookup(self.embed_matrix, self.center_words, name='embed')

# Step 4: define loss function
# construct variables for NCE loss
nce_weight = tf.Variable(tf.truncated_normal([self.vocab_size, self.embed_size],
stddev=1.0 / (self.embed_size ** 0.5)),
name='nce_weight')
nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias')

# define loss function to be NCE loss function
self.loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight,
biases=nce_bias,
labels=self.target_words,
inputs=embed,
num_sampled=self.num_sampled,
num_classes=self.vocab_size), name='loss')
def _create_optimizer(self):
""" Step 5: define optimizer """
with tf.device('/cpu:0'):
self.optimizer = tf.train.GradientDescentOptimizer(self.lr).minimize(self.loss,
global_step=self.global_step)

def _create_summaries(self):
with tf.name_scope("summaries"):
tf.summary.scalar("loss", self.loss)
tf.summary.histogram("histogram loss", self.loss)
# because you have several summaries, we should merge them all
# into one op to make it easier to manage
self.summary_op = tf.summary.merge_all()

def build_graph(self):
""" Build the graph for our model """
self._create_placeholders()
self._create_embedding()
self._create_loss()
self._create_optimizer()
self._create_summaries()

def train_model(model, batch_gen, num_train_steps, weights_fld):
saver = tf.train.Saver() # defaults to saving all variables - in this case embed_matrix, nce_weight, nce_bias

initial_step = 0
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/checkpoint'))
# if that checkpoint exists, restore from checkpoint
if ckpt and ckpt.model_checkpoint_path:
saver.restore(sess, ckpt.model_checkpoint_path)

total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps
writer = tf.summary.FileWriter('improved_graph/lr' + str(LEARNING_RATE), sess.graph)
initial_step = model.global_step.eval()
for index in xrange(initial_step, initial_step + num_train_steps):
centers, targets = batch_gen.next()
feed_dict={model.center_words: centers, model.target_words: targets}
loss_batch, _, summary = sess.run([model.loss, model.optimizer, model.summary_op],
feed_dict=feed_dict)
writer.add_summary(summary, global_step=index)
total_loss += loss_batch
if (index + 1) % SKIP_STEP == 0:
print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
total_loss = 0.0
saver.save(sess, 'checkpoints/skip-gram', index)

通過t-SNE我們可以把我們的詞向量矩陣可視化,我們可以看到所有的數字被分類到右下角一行,挨著字母和名字,所有月份被分類到一組,所有”do,does,did”被分為一組等等.


如果你打印’American’接近的詞:


t - SNE(來自維基百科)

t - 分佈隨機相鄰嵌入(t-SNE)是一種由Geoffrey Hinton和Laurens van der Maaten開發的機器學習降維算法。它是一種非線性降維技術,其特別適合將高維數據轉換到二維或三維空間,然後以散點圖可視化。具體來說,它通過一個二或三維空間來模擬每個高維對維點,使得類似對象由附近的點和不相似的對象的遠近建模。t-SNE算法包括兩個主要階段。首先,t-SNE構建類似對象具有被選擇的高概率,而不相似的點具有非常小被挑選的概率。第二,t-SNE定義瞭類似的概率分佈低維地圖中的點,並且在相對於地圖中的點的位置的兩個分佈之間最小化Kullback-Leibler散度


t-SNE使MNIST數據集可視化


我們也可以使用PCA使數據可視化


我們可以用不到十行的代碼使數據可視化,tensorboard提供瞭很好的工具:


from tensorflow.contrib.tensorboard.plugins import projector
# obtain the embedding_matrix after you’ve trained it
final_embed_matrix = sess . run ( model . embed_matrix)
# create a variable to hold your embeddings. It has to be a variable. Constants
# don’t work. You also can’t just use the embed_matrix we defined earlier for our model. Why
# is that so? I don’t know. I get the 500 most popular words.
embedding_var = tf.Variable(final_embed_matrix[:500], name='embedding')
sess.run(embedding_var.initializer)
config = projector.ProjectorConfig()
summary_writer = tf.summary.FileWriter(LOGDIR)
# add embeddings to config
embedding = config.embeddings.add()
embedding.tensor_name = embedding_var.name
# link the embeddings to their metadata file. In this case, the file that contains
# the 500 most popular words in our vocabulary
embedding.metadata_path = LOGDIR + '/vocab_500.tsv'
# save a configuration file that TensorBoard will read during startup
projector.visualize_embeddings(summary_writer, config)
# save our embedding
saver_embed = tf.train.Saver([embedding_var])
saver_embed.save(sess, LOGDIR + '/skip-gram.ckpt', 1)

為什麼我們依然要瞭解梯度


雖然目前為止我們建立的模型都沒有獲取單個節點的梯度,因為tf會自動考慮反向傳播,但是我們依然要學會如何獲取梯度,因為tf並不能分辨梯度消失或者梯度爆炸的情況,我們需要瞭解模型的梯度去獲知模型是否正常工作

0 個評論

要回覆文章請先登錄註冊