BERT_base参数量计算

traing_log

一. Embedding层
embedding层参数:(30522+512+2)*768=23835648=A
30522=vocab_size,512=max_seq_length,2=segment_id(0或者1)

embedding层后面有个LayerNorm(beta和gamma):768+768=1536=B

二. Encoder中
1. Multi-head attention
heads的参数:768*(768/12)*3*12+(768/12)*3*12=1769472+2304=1771776=C
768/12=头放缩后的纬度大小,3=Q+K+V,12=12个heads,第二项为bias

concat:768*768+768=589824+768=590592=D
12个heads concat后维度变为768*768后又接入全连接层进行线性变换,第二项为bias

LayerNorm的beta和gamma:768+768=1536=E

2. Feed Forward
FFN(x) = max(0, xW1 + b1)W2 + b2
W1:768*3072+3072=2362368=F
W2: 3072*768+768=2360064=G

LayerNorm的beta和gamma:768+768=1536=H

三. Encoder后
1.pooler层(全连接):768*768+768=590592=I

2.cls transform:
全连接: 768*768+768=590592=J
LayerNorm的beta和gamma:768+768=1536=K

3.MLM:
全连接:768*30522(由于参数和Embedding共享,所以这里不算)

只需要计算bias:30522=L

4.cls for NSP:
全连接:768*2+2=1538=M

四. All (不考虑Encoder后):
A+B+12*(C+D+E+F+G+H)=23835648+1536+12*(1771776+590592+1536+2362368+2360064+1536)=23837184+12*7087872=108891648

KMP

C++ size()

vector, string,等的size()返回的都是无符号的整数,跟负数比较会出问题,比如

 

GPT 2: Language Models are Unsupervised Multitask Learners

Core

  • Language modeling is usually framed as unsupervised distribution estimation from a set of examples (x1, x2, …, xn) each composed of variable length sequences of symbols(s1, s2, …, sn).

  • Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distribution p(output|input). Condition not only on the input but also on the task to be performed should model p(output|input, task).
  • Language provides a flexible way to specify tasks, inputs, and outputs all as a sequence of symbols. Language modeling is also able to, in principle, learn the tasks without the need for explicit supervision of which symbols are the outputs to be predicted.

Training Dataset

Slightly over 8 million documents for a total of 40 GB of text.

Input Representation

Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences.

Model

  • Follows the details of the OpenAI GPT
  • Layer normalization was moved to the input of each sub-block and an additional layer normalization was added after the final self-attention block.
  • Scale the weights of residual layers at initialization by a factor of 1/N where is the number of residual layers.
  • The vocabulary is expanded to 50,257. Also increase the context size from 512 to 1024 tokens and a larger batch size of 512 is used.

paper