round trip latency of an http request (or grpc or whatever pick your poison) is utterly insignificant compared to the time it takes to run the inference process, even for the smallest models with the fastest inference
round trip latency of an http request (or grpc or whatever pick your poison) is utterly insignificant compared to the time it takes to run the inference process, even for the smallest models with the fastest inference
what is different/better about whatever you are attempting to suggest compared to the existing prominent solutions such as vLLM, TensorRT-LLM, etc?
it’s not clear to me exactly what the value proposition is of what you’re offering.
preferably around 7 billion parameters
aim to produce flawless generations
LMAO goooooooooood fuckin luck buddy
Its useful for people who want to know the inference response time.
No, it’s useful for people who want to know the inference response time with batch size 1, which is not something that prospective H200 buyers care about. Are you aware that deployments in business environments for interactive use cases such as real time chat generally use batching? Perhaps you’re assuming request batching is just for offline / non-interactive use, but that isn’t the case.
Anyone has any solutions for these?
Use a high quality model.
That means not 7B or 13B.
I know a lot of other people have already said this in the thread, but this keeps coming up in this sub so I’m just gonna say it too.
Bleeding edge 7B and 13B models look good in benchmarks. Try actually using them and the first thing you should realize is how poorly benchmark results indicate real world performance. These models are dumb.
You can get started on runpod by depositing as little as $10, that’s less than some fast food meals, just take the plunge and find out for yourself. If you use an RTX A6000 48GB they’ll only charge you $0.79 per hour so you get quite a few hours of experimenting to feel the difference for yourself. With 48GB VRAM you can run Q4_K_M quants of 70B with full GPU offloading, or try Q5_K_M or even Q6 or Q8 if you tweak the number of layers you’re offloading to fit within 48GB (and still get fast enough generations for interactive chat.)
The difference is just absolutely night and day. Not only do 70Bs rarely make the basic mistakes you are describing, sometimes they even surprise me in a way that feels “clever.”
Its would be better if they provide single batch information for normal inference on fp8.
better for who? people that are just curious or people that are actually going to consider buying H200s?
who is buying a GPU that costs more than a new car and using it for single batch?
this is sort of like what you’re talking about, and pretty interesting IMO:
If you take a really sober look at the numbers, how does running your own system make sense over renting hardware at runpod or a similar service?
To me it doesn’t. I use runpod, I’m just on this sub because it’s the best place I know to keep up on the latest news in open source / self-hosted LLM stuff. I’m not literally running it “locally.”
As far as I can tell there are lots of others like me here on this sub. Of course also many people here run on their own hardware, but it seems to me like the user base here is pretty split. I wonder what a poll would find.
I use runpod for everything I can’t do locally and I’ve been very happy with it. I initially chose it just because it was one of the cheapest, indeed way cheaper than the big 3, but I’ve had a good experience.
The main downside I know of runpod is that you can only run a container image, you can’t have a full VM. but for most use cases I think this is really no big deal. if you want a generic sandbox for interactive experimentation, rather than to run an actual containerized app, you can just use the runpod pytorch image to get a starting point with cuda and pytorch and some other common stuff installed and then just ssh into it and do whatever. i.e. you don’t necessarily have to bother with a more “normal” containerized deployment where you’re writing something that runs unattended or exposes an API or whatever, writing a dockerfile etc
full disclosure my recent experiments are all testing different setups for inference with continuous batching, i’m personally not doing training or finetuning. but as far as I can tell runpod would be equally applicable for training and finetuning tasks