Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
分类:NLP
Perplexity in language model
1.PPL
PPL是用在自然语言处理领域(NLP)中,衡量语言模型好坏的指标。它主要是根据每个词来估计一句话出现的概率,并用句子长度作normalize,公式为
由公式可知,perplexity越小,模型越好。从公式最后一部分,感觉更像是描述GPT这种生成模型。
2.Language Model
- autoregressive (AR) language model
GPT:
- autoencoding (AE)language model
BERT(denoising auto-encoding):
where mt = 1 indicates xt is masked.
3.Reference
- 求通俗解释NLP里的perplexity是什么?
- 语言模型评价指标Perplexity
- [2011EMNLP]Domain Adaptation via Pseudo In-Domain Data Selection
[论文笔记][2020-WWW]Enhanced-RCNN: An Efficient Method for Learning Sentence Similarity
前面的Related可以用来做综述
一、Architecture
二、Detail
1.Input Encoding
(a)RNN Encoder
(b)CNN Encoder
2.Interactive Sentence Representation
(a)Soft-attention Alignment(类似于ESIM)
(b)Interaction Modeling
3.Similarity Modeling
(a)Fusion Layer
(b)Label Prediction
MLP+softmax
三、Reference
文本生成中的解码策略
一.Language model decoding
given a sequence of m tokens as context, the task is to generate the next n continuation tokens to obtain the completed sequence
. We assume that models compute
using the common left-to-right decomposition of the text probability,
which is used to generate the generation token-by-token using a particular decoding strategy
二.Decoding Strategies
1.Maximization-based decoding
解码器的输出层后面通常会跟一个softmax函数来将输出概率归一化。如果词表比较大的话,softmax的计算复杂度会比较高(分母需要计算整个词表大小的维度)
- Greedy decoding
每次选最高概率的一项作为输出,直到遇到结束符号,将这些输出的概率相乘,得到该sequence的概率。
缺点:每次选择都是最大的,但是整体的概率不一定是最大的,有可能错过最优的结果
考虑到`Beam Size=2`,第一步选择概率最大的两个A和B,第二步选择AB和BB(橙色大箭头)。然后以选择的AB和BB继续向上传播,又出现了四种情况ABA/ABB/BBA/BBB,依然是选择综合概率最大的两个ABB/BBB。以此类推,直至句子结束。
缺点:
-
- 由于beam search的概率是一個连续相乘的結果,越早遇到结束符号,所得到的概率会越大,所以更倾向于生成短文本。解决办法是对输出概率取log,然后相加,最终结果除长度求最大概率((a * b)^ 1/2 ==》 1/2*(loga + logb)相当于概率相乘然后开方)。
- beam size过大会导致计算复杂度过高,时间复杂度为O(N*b*V),N为序列长度,b为beam size,v为词表大小
- 语言模型通常为格式良好的文本分配高分,但较长文本的最高分往往是通用的、重复的和尴尬的。beam search的输出,与人讲话的概率分布,很不一样。
2.Stochastic Decoding
- Sampling with temperature
- Top-k Sampling
从词表选择topk个词,使得概率之和最大,即
缺点:k的取值难以界定。k太小,会生成比较bland和generic的结果,即缺乏diversity;k太大,会退化成随机sampling(pure sampling),会陷入语义表达错误,即sample from tail distribution
- Top-p Sampling(nucleus sampling)
丛词表选择概率之和大于给定阈值p的词,即
采样集的大小将根据每个时间步长的概率分布动态调整。对于高的阈值来说,这一个小部分的词汇,就占据了绝大多数的概率—-核。
三.Implementation
def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float('Inf')): """ Filter a distribution of logits using top-k and/or nucleus (top-p) filtering Args: logits: logits distribution shape (vocabulary size) top_k >0: keep only top k tokens with highest probability (top-k filtering). top_p >0.0: keep the top tokens with cumulative probability >= top_p (nucleus filtering). Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751) """ assert logits.dim() == 1 # batch size 1 for now - could be updated for more but the code would be less clear top_k = min(top_k, logits.size(-1)) # Safety check if top_k > 0: # Remove all tokens with a probability less than the last token of the top-k indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None] logits[indices_to_remove] = filter_value if top_p > 0.0: sorted_logits, sorted_indices = torch.sort(logits, descending=True) cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1) # Remove tokens with cumulative probability above the threshold sorted_indices_to_remove = cumulative_probs > top_p # Shift the indices to the right to keep also the first token above the threshold sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone() sorted_indices_to_remove[..., 0] = 0 indices_to_remove = sorted_indices[sorted_indices_to_remove] logits[indices_to_remove] = filter_value return logits # Here is how to use this function for top-p sampling temperature = 1.0 top_k = 0 top_p = 0.9 # Get logits with a forward pass in our model (input is pre-defined) logits = model(input) # Keep only the last token predictions of the first batch item (batch size 1), apply a temperature coefficient and filter logits = logits[0, -1, :] / temperature filtered_logits = top_k_top_p_filtering(logits, top_k=top_k, top_p=top_p) # Sample from the filtered distribution probabilities = F.softmax(filtered_logits, dim=-1) next_token = torch.multinomial(probabilities, 1)
四.Reference
Beam search
1.简单实现
import torch # Beam search samples = [] topk = 10 log_prob, v_idx = decoder_outputs.detach().topk(topk) for k in range(topk): samples.append([[v_idx[0][k].item()], log_prob[0][k], decoder_state]) for i in range(max_len): new_samples = [] for sample in samples: v_list, score, decoder_state = sample if v_list[-1] == de_vocab.item2index['_EOS_']: new_samples.append([v_list, score, decoder_state]) continue decoder_inputs = torch.LongTensor([v_list[-1]]) decoder_outputs, new_states = decoder(decoder_inputs, encoder_output, decoder_state) log_prob, v_idx = decoder_outputs.data.topk(topk) for k in range(topk): new_v_list = [] new_v_list += v_list + [v_idx[0][k].item()] new_samples.append([new_v_list, score + log_prob[0][k], new_states]) new_samples = sorted(new_samples, key=lambda sample: sample[1], reverse=True) samples = new_samples[:topk] v_list, score, states = samples[0] for v_idx in v_list: pred_sent.append(de_vocab.index2item[v_idx])