What are some good options these days for llm work - primarily for fine tuning and other related experiments? This for personal work and proof of concept type stuff and will be out of pocket so I’d definitely prefer cheaper options. I’d mostly be using 7-13b models but later would want to test with larger models as well.
Most of the providers have on demand and spot options, with spot options being obviously cheaper. I understand the spot instances can go down at any time but assuming checkpoints are saved regularly and can resume later that shouldn’t be a big problem. Are there any gotchas here?
The other criteria is managed/secure environment vs some kind of open/community environment. Again the later options are cheaper and assuming security is not a major requirement that seems like the better choice. Any thoughts or advice on this one?
I’m mostly looking at runpod, vast, and replicate based on info from other threads. Are there any other providers folks had good experience with?
How do AWS, GCP, or Azure compare to these options? From what I can tell these seem more expensive but I haven’t looked at these too closely.
Any recommendations with some details on your own experience, use cases, and costs would be greatly appreciated.
I use runpod for everything I can’t do locally and I’ve been very happy with it. I initially chose it just because it was one of the cheapest, indeed way cheaper than the big 3, but I’ve had a good experience.
The main downside I know of runpod is that you can only run a container image, you can’t have a full VM. but for most use cases I think this is really no big deal. if you want a generic sandbox for interactive experimentation, rather than to run an actual containerized app, you can just use the runpod pytorch image to get a starting point with cuda and pytorch and some other common stuff installed and then just ssh into it and do whatever. i.e. you don’t necessarily have to bother with a more “normal” containerized deployment where you’re writing something that runs unattended or exposes an API or whatever, writing a dockerfile etc
full disclosure my recent experiments are all testing different setups for inference with continuous batching, i’m personally not doing training or finetuning. but as far as I can tell runpod would be equally applicable for training and finetuning tasks