## BERT_base参数量计算

traing_log

embedding层参数：(30522+512+2)*768=23835648=A
30522=vocab_size,512=max_seq_length,2=segment_id(0或者1)

embedding层后面有个LayerNorm(beta和gamma):768+768=1536=B

concat:768*768+768=589824+768=590592=D

LayerNorm的beta和gamma：768+768=1536=E

2. Feed Forward
FFN(x) = max(0, xW1 + b1)W2 + b2
W1：768*3072+3072=2362368=F
W2: 3072*768+768=2360064=G

LayerNorm的beta和gamma：768+768=1536=H

1.pooler层(全连接)：768*768+768=590592=I

2.cls transform：

LayerNorm的beta和gamma：768+768=1536=K

3.MLM:

4.cls for NSP:

A+B+12*(C+D+E+F+G+H)=23835648+1536+12*(1771776+590592+1536+2362368+2360064+1536)=23837184+12*7087872=108891648

## C++ size()

vector, string,等的size()返回的都是无符号的整数，跟负数比较会出问题，比如

## GPT 2: Language Models are Unsupervised Multitask Learners

#### Core

• Language modeling is usually framed as unsupervised distribution estimation from a set of examples (x1, x2, …, xn) each composed of variable length sequences of symbols(s1, s2, …, sn). • Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distribution p(output|input). Condition not only on the input but also on the task to be performed should model p(output|input, task).
• Language provides a flexible way to specify tasks, inputs, and outputs all as a sequence of symbols. Language modeling is also able to, in principle, learn the tasks without the need for explicit supervision of which symbols are the outputs to be predicted.

#### Training Dataset

Slightly over 8 million documents for a total of 40 GB of text.

#### Input Representation

Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences.

#### Model

• Follows the details of the OpenAI GPT
• Layer normalization was moved to the input of each sub-block and an additional layer normalization was added after the final self-attention block.
• Scale the weights of residual layers at initialization by a factor of 1/N where is the number of residual layers.
• The vocabulary is expanded to 50,257. Also increase the context size from 512 to 1024 tokens and a larger batch size of 512 is used.

paper

## GPT: Improving Language Understanding by Generative Pre-Training

#### Framework  