I saw deepinfra. Their price is $0.7-0.95/million tokens for llama2 70b.
How is that possible? Even the quantized 70b models is 35 gbs.
How do you minimize costs of the GPUs, bandwidths, and on the software side?
On their article:
“I think other than the WhatsApp team, they are maybe first or second in the world to having the capability to build efficient infrastructure to serve hundreds of millions of people.”
But technology is not magic, can someone shine some light on running cost-effective AI clusters? I was looking at vast.ai etc but renting GPUs directly in that way would be much more expensive.
In production, most API uses something like TGI or vLLM that support batching, batch multiple requests and inference them at the same time. This doesn’t increase inference speed but it increase thoughput. For example, if running 70B llama normally take 20token/s for a single user, with batching the speed is 15-18 token/s but you can serve 20-50 users at the same time. The whole throughout will be 300-1000token/s, which makes the low price possible.