Are there any tricks to speed up 13B models on a 3090?
Currently using the regular huggingface model quantized to 8bit by a GPTQ capable fork of KoboldAI.
Especially when the context limit changes, it’s pretty slow and far from even remotely real time.
Just run on TGI or vLLM for flash attention & continuous batching for parallel requests