Macs with 32GB of memory can run 70B models with the GPU.

fallingdowndizzyvr@alien.top · 3 years ago

Macs with 32GB of memory can run 70B models with the GPU.

Aaaaaaaaaeeeee@alien.top · 3 years ago

The bandwidth utilization is not the best yet on gpu, its only 1/3rd of the potential 400GB/s.

The cpu RAM bandwidth utilization in llama.cpp on the otherhand, is nearly 100%, For my 32gb of DDR4, I get 1.5t/s with the 70B Q3_K_S model.

There will hopefully be more optimizations to speed this up.

fallingdowndizzyvr@alien.top · 3 years ago

I can’t wait for ultrafastbert. If that delivers on the promise then it’s a game changer that will propel CPU inference to the front of the pack. For 7B models up to a 78x speedup. The speedup decreases as the number of layers increase, but I’m hoping at 70B it’ll still be pretty significant.

DavidSJ@alien.top · 3 years ago

There will hopefully be more optimizations to speed this up.

Speculative, Jacobi, or lookahead decoding could speed things up quite a bit.