## BERT_base参数量计算

training_log

embedding层参数：(30522+512+2)*768=23835648=A
30522=vocab_size,512=max_seq_length,2=segment_id(0或者1)

embedding层后面有个LayerNorm(beta和gamma):768+768=1536=B

concat:768*768+768=589824+768=590592=D

LayerNorm的beta和gamma：768+768=1536=E

2. Feed Forward
FFN(x) = max(0, xW1 + b1)W2 + b2
W1：768*3072+3072=2362368=F
W2: 3072*768+768=2360064=G

LayerNorm的beta和gamma：768+768=1536=H

1.pooler层(全连接)：768*768+768=590592=I

2.cls transform：

LayerNorm的beta和gamma：768+768=1536=K

3.MLM:

4.cls for NSP:

A+B+12*(C+D+E+F+G+H)=23835648+1536+12*(1771776+590592+1536+2362368+2360064+1536)=23837184+12*7087872=108891648

0

## KMP和朴素匹配BF

1. #### KMP和朴素匹配算法BF

class Solution {
public:
int kmp(string ts, string ps) {
int i = 0;
int j = 0;
vector<int> next(ps.size());
get_next(ps, next);
while (i < (int) ts.size() && j < (int) ps.size()) {//string::size()返回的是一个无符号的整数,当与负数比较会出错
if (j == -1 || ts[i] == ps[j]) {
++i;
++j;
} else {
j = next[j];
}
}
if (j == ps.size())
return i - j;
else
return -1;
}

void get_next(string ps, vector<int> &next) {
next[0] = -1;
int k = -1;
int j = 0;
while (j < (int) ps.size() - 1) {
if (k == -1 || ps[j] == ps[k]) {
next[++j] = ++k;
} else
k = next[k];
}
}

int bf(string ts, string ps) {//朴素搜索
int i = 0;
int j = 0;
while (i < (int) ts.size() && j < (int) ps.size()) {
if (ts[i] == ps[j]) {
i++;
j++;
} else {
i = i - j + 1;//回溯
j = 0;
}
}
if (j == (int) ps.size())
return i - j;
else
return -1;
}
};

#### 2.KMP原理

• 如果j = -1，或者当前字符匹配成功（即S[i] == P[j]），都令i++，j++，继续匹配下一个字符；
• 如果j != -1，且当前字符匹配失败（即S[i] != P[j]），则令 i 不变，j = next[j]。此举意味着失配时，模式串P相对于文本串S向右移动了j – next [j] 位。
• 换言之，当匹配失败时，模式串向右移动的位数为：失配字符所在位置 – 失配字符对应的next 值，即移动的实际位数为：j – next[j]，且此值大于等于1。

0

## C++ size()

vector, string,等的size()返回的都是无符号的整数，跟负数比较会出问题，比如

vector<int> a{1, 4, 3};
int i = -1;
while (i < a.size()) {
cout << "great" << endl;//并不会输出
break;
}

#必须改成
while (i < (int)a.size()) {

0

## [论文笔记][AAAI-2020]ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding

#### Detail

• Transformer Encoder
• Task Embedding: an id ranging from 0 to N
• Capitalization Prediction Task: The cased model has some advantages in tasks like named entity recognition while the uncased model is more suitable for some other tasks.
• Token-Document Relation Prediction Task:This task predicts whether the token in a segment appears in other segments of the original document.(to capture the key words)
• Sentence Reordering Task: a given paragraph is randomly split into 1 to m segments and then all of the combinations are shuffled by a random permuted order. We let the pre-trained model to reorganize these permuted segments, modeled as a k-class classification problem where k =\sum_{n=1}^{m} n!
• Sentence Distance Task: This task is modeled as a 3-class classification problem. ”0” represents that the two sentences are adjacent in the same document, ”1” represent that the two sentences are in the same document, but not adjacent, and ”2” represents that the two sentences are from two different documents
• Discourse Relation Task: To predict the semantic or rhetorical relation between two sentences.

• IR Relevance Task: We take the query as the first sentence and the title as the second sentence. The query and title pairs that are labelled as ” 0” stand for strong relevance, which means that the title is clicked by the users after they input the query. Those labelled as ”1” represent weak relevance, which implies that when the query is input by the users, these titles appear in the search results but failed to be clicked by users. The label ”2” means that the query and title are completely irrelevant and random in terms of semantic information

paper

0