@sdmat

sdmat@alien.top · 2 years ago

This technique is actually really useful for batch processing.

I.e. if you run 100 generations and reuse the layer while it is loaded that will go much faster than the total serial time.

sdmat@alien.top · 2 years ago

What did you go with?

sdmat@alien.top · 2 years ago

No, the primary concern is that network latency kills the serial performance of LLMs.

You can have a distributed llm getting decent throughput in total across many slow generations. You can’t have a distributed LLM with throughput for a single generation competitive to running in a single cluster.

sdmat@alien.top · 2 years ago

You can get preemptible A100s for $1/hr, so not exactly breaking the bank if willing to take the risk.