[论文笔记][2019]RoBERTa: A Robustly Optimized BERT Pretraining Approach

Dynamic Masking

duplicated 10 times so that each sequence is masked in 10 different ways (Like pre-training parameter dupe_factor in BERT)

Training with large batches

BERT-base for 1M steps with a batch size of 256 sequences is equivalent in computational cost, via gradient accumulation, to training for 125K steps with a batch size of 2K sequences, or for 31K steps with a batch size of 8K.

Remove NSP

More data and more training steps

BOOKCORPUS plus English WIKIPEDIA(16GB), CC-NEWS(76GB), OPENWEBTEXT(38GB), STORIES(31GB)

paper

0