I have a 4090 at work, and quantized 34B models barely fit in the 24GB of VRAM. I get around 20 tokens per second of output. My personal computer has a laptop 3080ti with 16GB of VRAM, that one can’t do more than 13B models but still get about 20 tokens per second from it. Although these are for quantizations optimized for speed so depending on what model you’re trying to use it might be slower.
I have a 4090 at work, and quantized 34B models barely fit in the 24GB of VRAM. I get around 20 tokens per second of output. My personal computer has a laptop 3080ti with 16GB of VRAM, that one can’t do more than 13B models but still get about 20 tokens per second from it. Although these are for quantizations optimized for speed so depending on what model you’re trying to use it might be slower.