Hi. I’m currently running a 3060 12Gb | R7 2700X | 32gb 3200 | Windows 10 w/ latests nvidia drivers (vram>ram overflow disabled). By loading a 20B-Q4_K_M model (50/65 layers offloaded seems to be the fastest from my tests) i currently get arround 0.65 t/s with a low context size of 500 or less, and about 0.45t/s nearing the max 4096 context.

Are these values what is expected of my setup? Or is there something i can do to improve speeds without changing the model?

Its pretty much unusable at this state, and since it’s hard to find information about this topic i figured i would try to ask here.

EDIT: running the model on the latest version of the text-generation-webui

  • vikarti_anatra@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    some of my results:

    System:

    2xXeon E5-2680v4, 28 cores total, 56 HT, 128 Gb RAM

    RTX 2060 6 Gb via PCIE x16 3.0

    RTX 4060 Ti 16 Gb via PCIE x8 4.0

    Windows 11 Pro

    OpenHermes-2.5-AshhLimaRP-Mistral-7B (llama.cpp in text generation UI):

    Q4_K_M,RTX 2060 6 Gb RAM, all 35 layers offloaded, 8k context, - approx 3 t/s

    Q5_K_M,RTX 4060 Ti 16 Gb RAM, all 35 layers offloaded, 32k context - approx 25 t/s

    Q5_K_M,CPU-only , 8 threads,32k context - approx 2.5-3.5 t/s

    Q5_K_M,CPU-only , 16 threads,32k context - approx 3-3.5 t/s

    Q5_K_M,CPU-only , 32 threads,32k context - approx 3-3.6 t/s

    euryale-1.3-l2-70b (llama.cpp in text generation UI)

    Q4_K_M,RTX 2060+RTX 4060 Ti,35 layers offloaded, 4K context - 0.6-0.8 t/s

    goliath-120 (llama.cpp in text generation UI)

    Q2_K, CPU-only,32 threads - 0.4-0.5 t/s

    Q2_K, CPU-only,8 threads - 0.25-0.3 t/s

    Noromaid-20b-v0.1.1 (llama.cpp in text generation UI)

    Q5_K_M , RTX 2060+RTX 4060 Ti, 65 layers offloaded,4K context - approx 5 t/s

    Noromaid-20b-v0.1.1 (exllamav2 in text generation UI)

    3bpw-h8-exl2, RTX 2060+RTX 4060 Ti, cache 8 bit, 4k context, approx 15 t/s (looks like it fits in 4060)

    6bpw-h8-exl2, RTX 2060+RTX 4060 Ti, cache 8 bit, 4k context, no flash attention, gpu split 12, 6 - approx 10 t/s

    Observations:

    - number of cores in cpu-only modes matters very little

    - “numa” does matter (I have 2 CPU sockets)

    I would say - try to get additional another card?

  • multiverse_fan@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I have an older 6GB 1660 and get like 0.3 t/s on a q2 quant of Goliath 120B. I guess I’m just thinking that comparatively your setup with a 20B model should be faster than that but I’m sure I’m missing something. I guess with offloading, the CPU plays a role as well. How many cores ya got?

  • marblemunkey@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    It’s been a couple months since I used less-than-complete GPU offloading; When I was using my Alienware laptop (i7-8th gen, 2060 6GB) to run 13B models with 13/25 layers offloaded I was getting 1-2 t/s, so yours sounds low.

  • longtimegoneMTGO@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I have a 3080 12Gb, and can run a 20B-Q4_K_M with about 50 layers offloaded and 8k context.

    It starts off at just under 4 t/s, and once the context is filled it gets as slow as just over 2 t/s

    It might be worth setting up a linux partition to boot into for this, I was getting much slower speeds under windows.

  • -Ellary-@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    R5 5500 (on stock 3600Mhz) | 3060 12gb | 32gb 3600, Win10 v2004.
    I’m using LM Studio for heavy models (34b (q4_k_m), 70b (q3_k_m) GGUF.
    On 70b I’m getting around 1-1.4 tokens depending on context size (4k max),
    I’m offloading 25 layers on GPU (trying to not exceed 11gb mark of VRAM),
    On 34b I’m getting around 2-2.5 tokens depending on context size (4k max).
    I’m offloading 30 layers on GPU (trying to not exceed 11gb mark of VRAM),
    On 20b I was getting around 4-5 tokens, not a huge user of 20b right now.

    So I can recommend LM Studio for models heavier then 13b+, woks better for me.
    Here is a 34b YI Chat generation speed:

    https://preview.redd.it/h4d0lbm5u63c1.png?width=903&format=png&auto=webp&s=fdc161b136879d1c1de6ef065cb80f35f188e46f

  • Desm0nt@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    By loading a 20B-Q4_K_M model (50/65 layers offloaded seems to be the fastest from my tests) i currently get arround 0.65 t/s with a low context size of 500 or less, and about 0.45t/s nearing the max 4096 context.

    Sound suspicious. A use Yi-Chat-34b-Q4_K_M on old 1080ti (11 gb VRAM) with 20 layers offloaded and got around 2.5 t/s.But it is on Threadripper 2920 with 4 channel RAM (also 3200). However I don’t think it would make that much difference. Ofcourse in 4 channel I have ram bandwidth x2 of your’s but I run 34b and I load only 20 layers on gpu…