A 4.x bit 70b model trained with 16k context with exllamav2 fits with room to spare. If you can add a 3090 or 4090 as well you can include a 6bit 32k 70b. That’s my standard inference setup and it covers a lot of ground.
A 4.x bit 70b model trained with 16k context with exllamav2 fits with room to spare. If you can add a 3090 or 4090 as well you can include a 6bit 32k 70b. That’s my standard inference setup and it covers a lot of ground.
Do you find any repetition problems at longer context lengths (closer to 4K)?