What is the frequency of your ram? DDR4 2666 mhz, DDR4 3200 mhz or DDR5 4800mhz? and how are they installed on the motherboard? 4x128gb(quad channel) or 8x64gb(octa channel)?
Rams frequencies are the most important for llm token generation, as these are often the bottleneck. With a 32 or more-core Epyc 7003 cpu in octa channel (DDR4 3200), you can expect 3 to 4 tokens(70b) equivalent to a 200GB/S vram of speed.
For OP, a 48-core cpu (genoa) or more, in 12-channel DDR5 4800 can expect to go up to 6 to 8 tokens(70b) equivalent to 400GB/S vram of speed.
an rtx 4090 is around 1000GB/s but with only 24gb vram, gpus are generally much faster for prompt processing than PC cpu (over 100 times faster), but I don’t know about modern server cpu(Genoa), normally they are faster in prompt processing than PC cpu, as they natively support fp16/BF16 operation.
But take it with a pinch of salt, as I don’t have these configurations at hand, so you’ll have to ask someone who does.
What is the frequency of your ram? DDR4 2666 mhz, DDR4 3200 mhz or DDR5 4800mhz? and how are they installed on the motherboard? 4x128gb(quad channel) or 8x64gb(octa channel)?
Rams frequencies are the most important for llm token generation, as these are often the bottleneck. With a 32 or more-core Epyc 7003 cpu in octa channel (DDR4 3200), you can expect 3 to 4 tokens(70b) equivalent to a 200GB/S vram of speed.
For OP, a 48-core cpu (genoa) or more, in 12-channel DDR5 4800 can expect to go up to 6 to 8 tokens(70b) equivalent to 400GB/S vram of speed.
an rtx 4090 is around 1000GB/s but with only 24gb vram, gpus are generally much faster for prompt processing than PC cpu (over 100 times faster), but I don’t know about modern server cpu(Genoa), normally they are faster in prompt processing than PC cpu, as they natively support fp16/BF16 operation.
But take it with a pinch of salt, as I don’t have these configurations at hand, so you’ll have to ask someone who does.