Pager 15 - K2H'log

Scaling Law

Scaling Laws for Neural Language Models

ZeRO

Memory Optimizations Toward Training Trillion Parameter Models

T5

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Megatron-LM

Training Multi-Billion Parameter Language Models Using Model Parallelism

ALBERT

A Lite BERT for Self-supervised Learning of Language Representations