JustOneAvailableName@alien.topBtoLocalLLaMA•NVidia H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLMEnglish
1·
1 year agoFor people like us who don’t share the gpu, it doesn’t make much sense outside of rare cases.
Multiple agents talking to each other. Quickly parsing a knowledge base. Sampling methods like: tree of thought, the simple old beam search, or using multiple prompts.
I don’t want to spend that amount of money, but I definitely want play on one for a few months.
I don’t even think lanes really matter when you’re not training.