duplicated 10 times so that each sequence is masked in 10 different ways (Like pre-training parameter dupe_factor in BERT)
Training with large batches
BERT-base for 1M steps with a batch size of 256 sequences is equivalent in computational cost, via gradient accumulation, to training for 125K steps with a batch size of 2K sequences, or for 31K steps with a batch size of 8K.
More data and more training steps
BOOKCORPUS plus English WIKIPEDIA(16GB), CC-NEWS(76GB), OPENWEBTEXT(38GB), STORIES(31GB)