• 0 Posts
  • 10 Comments
Joined 1 year ago
cake
Cake day: October 25th, 2023

help-circle

  • some of my results:

    System:

    2xXeon E5-2680v4, 28 cores total, 56 HT, 128 Gb RAM

    RTX 2060 6 Gb via PCIE x16 3.0

    RTX 4060 Ti 16 Gb via PCIE x8 4.0

    Windows 11 Pro

    OpenHermes-2.5-AshhLimaRP-Mistral-7B (llama.cpp in text generation UI):

    Q4_K_M,RTX 2060 6 Gb RAM, all 35 layers offloaded, 8k context, - approx 3 t/s

    Q5_K_M,RTX 4060 Ti 16 Gb RAM, all 35 layers offloaded, 32k context - approx 25 t/s

    Q5_K_M,CPU-only , 8 threads,32k context - approx 2.5-3.5 t/s

    Q5_K_M,CPU-only , 16 threads,32k context - approx 3-3.5 t/s

    Q5_K_M,CPU-only , 32 threads,32k context - approx 3-3.6 t/s

    euryale-1.3-l2-70b (llama.cpp in text generation UI)

    Q4_K_M,RTX 2060+RTX 4060 Ti,35 layers offloaded, 4K context - 0.6-0.8 t/s

    goliath-120 (llama.cpp in text generation UI)

    Q2_K, CPU-only,32 threads - 0.4-0.5 t/s

    Q2_K, CPU-only,8 threads - 0.25-0.3 t/s

    Noromaid-20b-v0.1.1 (llama.cpp in text generation UI)

    Q5_K_M , RTX 2060+RTX 4060 Ti, 65 layers offloaded,4K context - approx 5 t/s

    Noromaid-20b-v0.1.1 (exllamav2 in text generation UI)

    3bpw-h8-exl2, RTX 2060+RTX 4060 Ti, cache 8 bit, 4k context, approx 15 t/s (looks like it fits in 4060)

    6bpw-h8-exl2, RTX 2060+RTX 4060 Ti, cache 8 bit, 4k context, no flash attention, gpu split 12, 6 - approx 10 t/s

    Observations:

    - number of cores in cpu-only modes matters very little

    - “numa” does matter (I have 2 CPU sockets)

    I would say - try to get additional another card?


  • I would be interested to use such thing (especially if it’s possible to pass custom options to llama.cpp and ask for custom models to be loaded).

    Would it be possible to do something like this:

    I put list of models: OpenHermes-2.5-Mistral-7B, Toppy-7B, OpenHermes-2.5-AshhLimaRP-Mistral-7B, Noromaid-v0.1.1-20B, Noromaid-v1.1-13B

    Tool download every model from HF with every quantization, runs tests, and provide table with tests results (including failed ones)






  • Just my thoughts on this:

    Would be great.

    Would be rather limited but possible (thanks to https://llm.mlc.ai/ and increasing memory).

    A lot of CHEAP Chinese devices will say they can actually do it. They will. At 2 bit quatization and <1 t/s and it would be 7B Models or even less. They will be unusuable.

    Google say it’s not necessary because you can use their Firebase Services for AI and you can use NNAPI anyway. You must also censor your LLM-using apps in Play Store to adhere to their rules.

    Apple says it’s not necessary, later they will advertise it as very good thing and provide optmized libraries and some pretrained models but you need to buy latest iphone(last-year won’t work because Apple). You must also censor your apps AND mark it as 18+

    Areas of usage?

    - Language translation (including voice-to-voice). Basically much more improved google translate.

    - AI Assistant (basically MUCH more imroved Siri, used not only as command interface).