@vikarti_anatra

vikarti_anatra@alien.top · 1 year ago

That’s why you have ‘numa’ option in llama.cpp.

From my experience, number of memory channels do matter a lot so this mean that all memory sockets better be filled.

vikarti_anatra@alien.top · 1 year ago

some of my results:

System:

2xXeon E5-2680v4, 28 cores total, 56 HT, 128 Gb RAM

RTX 2060 6 Gb via PCIE x16 3.0

RTX 4060 Ti 16 Gb via PCIE x8 4.0

Windows 11 Pro

OpenHermes-2.5-AshhLimaRP-Mistral-7B (llama.cpp in text generation UI):

Q4_K_M,RTX 2060 6 Gb RAM, all 35 layers offloaded, 8k context, - approx 3 t/s

Q5_K_M,RTX 4060 Ti 16 Gb RAM, all 35 layers offloaded, 32k context - approx 25 t/s

Q5_K_M,CPU-only , 8 threads,32k context - approx 2.5-3.5 t/s

Q5_K_M,CPU-only , 16 threads,32k context - approx 3-3.5 t/s

Q5_K_M,CPU-only , 32 threads,32k context - approx 3-3.6 t/s

euryale-1.3-l2-70b (llama.cpp in text generation UI)

Q4_K_M,RTX 2060+RTX 4060 Ti,35 layers offloaded, 4K context - 0.6-0.8 t/s

goliath-120 (llama.cpp in text generation UI)

Q2_K, CPU-only,32 threads - 0.4-0.5 t/s

Q2_K, CPU-only,8 threads - 0.25-0.3 t/s

Noromaid-20b-v0.1.1 (llama.cpp in text generation UI)

Q5_K_M , RTX 2060+RTX 4060 Ti, 65 layers offloaded,4K context - approx 5 t/s

Noromaid-20b-v0.1.1 (exllamav2 in text generation UI)

3bpw-h8-exl2, RTX 2060+RTX 4060 Ti, cache 8 bit, 4k context, approx 15 t/s (looks like it fits in 4060)

6bpw-h8-exl2, RTX 2060+RTX 4060 Ti, cache 8 bit, 4k context, no flash attention, gpu split 12, 6 - approx 10 t/s

Observations:

- number of cores in cpu-only modes matters very little

- “numa” does matter (I have 2 CPU sockets)

I would say - try to get additional another card?

vikarti_anatra@alien.top · 1 year ago

I would be interested to use such thing (especially if it’s possible to pass custom options to llama.cpp and ask for custom models to be loaded).

Would it be possible to do something like this:

I put list of models: OpenHermes-2.5-Mistral-7B, Toppy-7B, OpenHermes-2.5-AshhLimaRP-Mistral-7B, Noromaid-v0.1.1-20B, Noromaid-v1.1-13B

Tool download every model from HF with every quantization, runs tests, and provide table with tests results (including failed ones)

vikarti_anatra@alien.top · 1 year ago

Unlikely. As far as I understoodd, first limit is not even matrix multiplication cores, it’s memory bandwith and solution for this is faster RAM and multi-channel connections.

vikarti_anatra@alien.top · 1 year ago

Interesting idea.

vikarti_anatra@alien.top · 1 year ago

!remindme 7 days

vikarti_anatra@alien.top · 1 year ago

!remindme 7 days

vikarti_anatra@alien.top · 1 year ago

Just my thoughts on this:

Would be great.

Would be rather limited but possible (thanks to https://llm.mlc.ai/ and increasing memory).

A lot of CHEAP Chinese devices will say they can actually do it. They will. At 2 bit quatization and <1 t/s and it would be 7B Models or even less. They will be unusuable.

Google say it’s not necessary because you can use their Firebase Services for AI and you can use NNAPI anyway. You must also censor your LLM-using apps in Play Store to adhere to their rules.

Apple says it’s not necessary, later they will advertise it as very good thing and provide optmized libraries and some pretrained models but you need to buy latest iphone(last-year won’t work because Apple). You must also censor your apps AND mark it as 18+

Areas of usage?

- Language translation (including voice-to-voice). Basically much more improved google translate.

- AI Assistant (basically MUCH more imroved Siri, used not only as command interface).

vikarti_anatra@alien.top · 1 year ago

SkyPilot? https://github.com/skypilot-org/skypilot/tree/master -

RunPod’s serverless functions ? (runpod-specific, bring container up, process requests, suspend it so it could resume fast, you need control scripts to communicate with their api)

vikarti_anatra@alien.top · 1 year ago

How you actually use it at home ? 3 or 4 old Test P40 from ebuy/local alternatives? Just CPU?