BERT_base参数量计算

training_log

一. Embedding层
embedding层参数:(30522+512+2)*768=23835648=A
30522=vocab_size,512=max_seq_length,2=segment_id(0或者1)

embedding层后面有个LayerNorm(beta和gamma):768+768=1536=B

二. Encoder中
1. Multi-head attention
heads的参数:768*(768/12)*3*12+(768/12)*3*12=1769472+2304=1771776=C
768/12=头放缩后的纬度大小,3=Q+K+V,12=12个heads,第二项为bias

concat:768*768+768=589824+768=590592=D
12个heads concat后维度变为768*768后又接入全连接层进行线性变换,第二项为bias

LayerNorm的beta和gamma:768+768=1536=E

2. Feed Forward
FFN(x) = max(0, xW1 + b1)W2 + b2
W1:768*3072+3072=2362368=F
W2: 3072*768+768=2360064=G

LayerNorm的beta和gamma:768+768=1536=H

三. Encoder后
1.pooler层(全连接):768*768+768=590592=I

2.cls transform:
全连接: 768*768+768=590592=J
LayerNorm的beta和gamma:768+768=1536=K

3.MLM:
全连接:768*30522(由于参数和Embedding共享,所以这里不算)

只需要计算bias:30522=L

4.cls for NSP:
全连接:768*2+2=1538=M

四. All (不考虑Encoder后):
A+B+12*(C+D+E+F+G+H)=23835648+1536+12*(1771776+590592+1536+2362368+2360064+1536)=23837184+12*7087872=108891648

KMP和朴素匹配BF

  1. KMP和朴素匹配算法BF

class Solution {
public:
    int kmp(string ts, string ps) {
        int i = 0;
        int j = 0;
        vector<int> next(ps.size());
        get_next(ps, next);
        while (i < (int) ts.size() && j < (int) ps.size()) {//string::size()返回的是一个无符号的整数,当与负数比较会出错
            if (j == -1 || ts[i] == ps[j]) {
                ++i;
                ++j;
            } else {
                j = next[j];
            }
        }
        if (j == ps.size())
            return i - j;
        else
            return -1;
    }

    void get_next(string ps, vector<int> &next) {
        next[0] = -1;
        int k = -1;
        int j = 0;
        while (j < (int) ps.size() - 1) {
            if (k == -1 || ps[j] == ps[k]) {
                next[++j] = ++k;
            } else
                k = next[k];
        }
    }

    int bf(string ts, string ps) {//朴素搜索
        int i = 0;
        int j = 0;
        while (i < (int) ts.size() && j < (int) ps.size()) {
            if (ts[i] == ps[j]) {
                i++;
                j++;
            } else {
                i = i - j + 1;//回溯
                j = 0;
            }
        }
        if (j == (int) ps.size())
            return i - j;
        else
            return -1;
    }
};

2.KMP原理

寻找最长前缀后缀

假设现在文本串S匹配到 i 位置,模式串P匹配到 j 位置

  • 如果j = -1,或者当前字符匹配成功(即S[i] == P[j]),都令i++,j++,继续匹配下一个字符;
  • 如果j != -1,且当前字符匹配失败(即S[i] != P[j]),则令 i 不变,j = next[j]。此举意味着失配时,模式串P相对于文本串S向右移动了j – next [j] 位。
    • 换言之,当匹配失败时,模式串向右移动的位数为:失配字符所在位置 – 失配字符对应的next 值,即移动的实际位数为:j – next[j],且此值大于等于1。

字符串查找之KMP

C++ size()

vector, string,等的size()返回的都是无符号的整数,跟负数比较会出问题,比如

vector<int> a{1, 4, 3};
int i = -1;
while (i < a.size()) {
    cout << "great" << endl;//并不会输出
    break;
}

#必须改成
while (i < (int)a.size()) {

 

[论文笔记][AAAI-2020]ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding

Framework

Structure

Detail

  • Transformer Encoder
  • Task Embedding: an id ranging from 0 to N
  • Pre-training Tasks
    • Word-aware Pre-training Tasks
        • Knowledge Masking Task: phrase masking and named entity masking(ERNIE 1.0)
        • Capitalization Prediction Task: The cased model has some advantages in tasks like named entity recognition while the uncased model is more suitable for some other tasks.
        • Token-Document Relation Prediction Task:This task predicts whether the token in a segment appears in other segments of the original document.(to capture the key words)
    • Structure-aware Pre-training Tasks
        • Sentence Reordering Task: a given paragraph is randomly split into 1 to m segments and then all of the combinations are shuffled by a random permuted order. We let the pre-trained model to reorganize these permuted segments, modeled as a k-class classification problem where k =\sum_{n=1}^{m} n!
        • Sentence Distance Task: This task is modeled as a 3-class classification problem. ”0” represents that the two sentences are adjacent in the same document, ”1” represent that the two sentences are in the same document, but not adjacent, and ”2” represents that the two sentences are from two different documents
    • Semantic-aware Pre-training Tasks
        • Discourse Relation Task: To predict the semantic or rhetorical relation between two sentences.

        • IR Relevance Task: We take the query as the first sentence and the title as the second sentence. The query and title pairs that are labelled as ” 0” stand for strong relevance, which means that the title is clicked by the users after they input the query. Those labelled as ”1” represent weak relevance, which implies that when the query is input by the users, these titles appear in the search results but failed to be clicked by users. The label ”2” means that the query and title are completely irrelevant and random in terms of semantic information

paper