<aside>
๐ for big models on stability cluster;
Largest model fit on a single GPU: ****EleutherAI/pythia-1.4b (minimal tricks)
****Largest model fit on a single node: ****EleutherAI/pythia-12b (+ many tricks)
for how we can train these < 13B models;
Inter-node distributed data parallelism with torch.distributed
Intra-node sharding through DeepSpeed ZeRO functionality
for what to do about > 13B models; Use GPT-NeoX to achieve tensor / model parallelism inter-node? Use multi-node DeepSpeed?
</aside>
Our goals are to achieve inter-node data parallelism and most likely further intra-node data + model parallelism to enable training models that can fit across 8 GPUs (320GB) efficiently.
The first two rows below are cumulative changes but the remaining are different DeepSpeed configuration options outlined here. We can see a drop in performance when using DeepSpeed as naturally we induce more communication than we actually need for this size model.
These figures assume ******the use of context length of 2048, bf16 mixed precision, relatively optimal dataloader workers, 8 x A100-40GB GPUs, 96 CPU cores, 960GB RAM and that you can achieve near linear scaling along data parallel dimension. They also ignore ******gradient accumulation or model parallelism as another means of achieving higher effective batch size & throughput for training.
Crucially these figures restrict the DeepSpeed pipeline to sharding across a single node of 8 A100โs even though you can train larger models more efficiently through multi-node sharding.
See efficiency_tests group name on the LLCheM WandB workspace for some of the runs.
$$ Token/second/A100=\frac{Total.Tokens}{Total.Time*N_{A100's}} $$
EleutherAI/pythia-1bEleutherAI/pythia-2.8bEleutherAI/pythia-6.9bEleutherAI/pythia-12bEleutherAI/gpt-neox-20b