Scaling Laws for Neural Language Models
Memory Optimizations Toward Training Trillion Parameter Models
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Training Multi-Billion Parameter Language Models Using Model Parallelism
A Lite BERT for Self-supervised Learning of Language Representations