In production, most API uses something like TGI or vLLM that support batching, batch multiple requests and inference them at the same time.
This doesn’t increase inference speed but it increase thoughput.
For example, if running 70B llama normally take 20token/s for a single user, with batching the speed is 15-18 token/s but you can serve 20-50 users at the same time. The whole throughout will be 300-1000token/s, which makes the low price possible.
In production, most API uses something like TGI or vLLM that support batching, batch multiple requests and inference them at the same time. This doesn’t increase inference speed but it increase thoughput. For example, if running 70B llama normally take 20token/s for a single user, with batching the speed is 15-18 token/s but you can serve 20-50 users at the same time. The whole throughout will be 300-1000token/s, which makes the low price possible.