Multi-node + Parallelism

<aside> ๐Ÿ“ for big models on stability cluster; Largest model fit on a single GPU: ****EleutherAI/pythia-1.4b (minimal tricks) ****Largest model fit on a single node: ****EleutherAI/pythia-12b (+ many tricks)

for how we can train these < 13B models; Inter-node distributed data parallelism with torch.distributed Intra-node sharding through DeepSpeed ZeRO functionality

for what to do about > 13B models; Use GPT-NeoX to achieve tensor / model parallelism inter-node? Use multi-node DeepSpeed?

</aside>

Our goals are to achieve inter-node data parallelism and most likely further intra-node data + model parallelism to enable training models that can fit across 8 GPUs (320GB) efficiently.

PyTorch - Multinode DDP Tool Tests

Intra-Node Sharding Tool Tests

Token Efficiency - DeepSpeed ZeRO

The first two rows below are cumulative changes but the remaining are different DeepSpeed configuration options outlined here. We can see a drop in performance when using DeepSpeed as naturally we induce more communication than we actually need for this size model.

These figures assume ******the use of context length of 2048, bf16 mixed precision, relatively optimal dataloader workers, 8 x A100-40GB GPUs, 96 CPU cores, 960GB RAM and that you can achieve near linear scaling along data parallel dimension. They also ignore ******gradient accumulation or model parallelism as another means of achieving higher effective batch size & throughput for training.

Crucially these figures restrict the DeepSpeed pipeline to sharding across a single node of 8 A100โ€™s even though you can train larger models more efficiently through multi-node sharding.

See efficiency_tests group name on the LLCheM WandB workspace for some of the runs.

$$ Token/second/A100=\frac{Total.Tokens}{Total.Time*N_{A100's}} $$

EleutherAI/pythia-1b

EleutherAI/pythia-2.8b

EleutherAI/pythia-6.9b

1-7B multi-node optimisations

๐Ÿšงย EleutherAI/pythia-12b

๐Ÿšงย EleutherAI/gpt-neox-20b