@damhack

damhack@alien.top · 2 years ago

That’s how pretraining is already done. You would have the same issue, orders of magnitude greater latency. Given the number of calculations per training epoch, you don’t want to be bound by the slowest worker in the cluster. OpenAI etc. use 40Gbps (or 100Gbps nowadays) backplanes between A100/H100 GPU servers. Sending data over the Internet to an Nvidia 1080 is simply just slow.

damhack@alien.top · 2 years ago

You’d better tell the GPU manufacturers that LLM workloads can’t be parallelized.

The point of Transformers is that the matrix operations can be parallelized, unlike in standard RNNs.

The issue with distributing those parallel operations is that for every partition of the workload, you introduce latency.

If you offload a layer at a time, then you are introducing both the latency of the slowest worker and the network latency, plus the latency of combining results back into one set.

If you’re partitioning at finer grain, eg parts of a layer, then you add even more latency.

Latency can go from 1ms per layer in a monolithic LLM to >1s. That means response times measured in multiple minutes.