Pretraining record

  1. Horovod分布式训练
  2. BERT base:32*16G GPU(Tesla V100-SXM2),BERT large:64*16G GPU(Tesla V100-SXM2)
  3. Stage-one:
    --do_train=True \
    --do_eval=True \
    --bert_config_file=bert_base/bert_config.json \
    --train_batch_size=1024 \
    --max_seq_length=128 \
    --max_predictions_per_seq=20 \
    --num_train_steps=900000 \
    --num_warmup_steps=10000 \
    --learning_rate=1e-4 \
    --save_checkpoints_steps=50000

    Stage-two:

    --do_train=True \
    --do_eval=True \
    --bert_config_file=bert_base/bert_config.json \
    --init_checkpoint=stage-one/model.ckpt-450000 \
    --train_batch_size=256 \
    --max_seq_length=512 \
    --max_predictions_per_seq=76 \
    --num_train_steps=100000 \
    --num_warmup_steps=5000 \
    --learning_rate=1e-4 \
    --save_checkpoints_steps=5000

    4. global step/sec=1.3-1.9

Albert

  1. 梯度累加(gradient accumulation)相当于扩大batch size相同倍数,为了更好的效果,training steps也应该相应扩大。
  2. 代码
作者:Pascal
链接:https://www.zhihu.com/question/303070254/answer/573037166
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

for i,(images,target) in enumerate(train_loader):
    # 1. input output
    images = images.cuda(non_blocking=True)
    target = torch.from_numpy(np.array(target)).float().cuda(non_blocking=True)
    outputs = model(images)
    loss = criterion(outputs,target)

    # 2.1 loss regularization
    loss = loss/accumulation_steps   
    # 2.2 back propagation
    loss.backward()
    # 3. update parameters of net
    if((i+1)%accumulation_steps)==0:
        # optimizer the net
        optimizer.step()        # update parameters of net
        optimizer.zero_grad()   # reset gradient
0