A Robustly Optimized BERT Pretraining Approach
Generalized Autoregressive Pretraining for Language Understanding
Language Models are Unsupervised Multitask Learners
Attentive Language Models Beyond a Fixed-Length Context
Pre-training of Deep Bidirectional Transformers for Language Understanding