Maybe anecdotal but I have very high hopes for Yi 34b finetunes.

Herr_Drosselmeyer@alien.top · 3 years ago

Maybe anecdotal but I have very high hopes for Yi 34b finetunes.

Herr_Drosselmeyer@alien.top · 3 years ago

The base Yi can handle 200k. The version I used can do 48k (though I only tested 16k so far). Larger context size requires more VRAM.

The size that TheBloke like gives for GGUF is the minimum size at 0 context. As context increases, VRAM use increases.

bullerwins@alien.top · 3 years ago

thanks a lot! I was not sure about how context affected VRAM usage. So each model has a maximum context size and using more will take more vram, thanks!

mcmoose1900@alien.top · 3 years ago

Another thing to note is that the exllamav2 backend is “special” because its context takes up less vram than the context in other backends. So lets say the weights take 18GB, and your context takes up 6GB for a gguf model. In exllama thats only 3GB taken up by the context with the 8 bit cache.

There are other complications like the prompt processing batch size, but thats the jist of it.

This makes a dramatic difference when the context gets huge. I’d prefer to use koboldcpp myself, but I just can’t really squeeze it on my 3090 without excessive offloading.

frozen_tuna@alien.top · 3 years ago

Very good to know! I haven’t fiddled with the new yi models too much yet since I was running into these exact issues. I’ll definitely use this solution soon, thanks.

bullerwins@alien.top · 3 years ago

Interesting! I had more succeed for some reason with gguf models, as those work everywhere using koboldcpp and ooba’s. I didn’t know that exllamasv2 was better for context. I will try it. That backend is for EXL2 formats right? I had the impression it was better for speed, I didn’t know about the context takes up less vram

ParanoidMarvin42@alien.top · 3 years ago

Do you know how to estimate how much memory the context will need?

Herr_Drosselmeyer@alien.top · 3 years ago

With this particular model, I can crank it up to 32k if I enable " Use 8-bit cache to save VRAM" and that’s as high as it can go in Oobabooga WebUI.

waxbolt@alien.top · 3 years ago

32k seems to be hard coded in oobabooga. At least it is for truncate length max. There’s a patch to be made to fix it.

Herr_Drosselmeyer@alien.top · 3 years ago

I know but it’s slowing down quite a bit at 32k already so I don’t think it’s worth pushing it further. But hey, even at just 16k it’s four times what we usually get, so I’m not complaining.

DedyLLlka_GROM@alien.top · 3 years ago

You can change it by yourself, although it’s required to be edited with every update as for now. Just put something like 200000 in these 2 places:

https://github.com/oobabooga/text-generation-webui/blob/454fcf39a95691f5e375c48fbc6fe6aa96f0c738/modules/shared.py#L46

https://github.com/oobabooga/text-generation-webui/blob/454fcf39a95691f5e375c48fbc6fe6aa96f0c738/modules/ui_model_menu.py#L100

andrewlapp@alien.top · 3 years ago

34B Model Memory Requirements (infer)

Sequence Length vs Bit Precision
SL / BP |     4      |     6      |     8      |     16    
-----------------------------------------------------------
    512 |     15.9GB |     23.8GB |     31.8GB |     63.6GB
   1024 |     16.0GB |     23.9GB |     31.9GB |     63.8GB
   2048 |     16.1GB |     24.1GB |     32.2GB |     64.3GB
   4096 |     16.3GB |     24.5GB |     32.7GB |     65.3GB
   8192 |     16.8GB |     25.2GB |     33.7GB |     67.3GB
  16384 |     17.8GB |     26.7GB |     35.7GB |     71.3GB
  32768 |     19.8GB |     29.7GB |     39.7GB |     79.3GB
  65536 |     23.8GB |     35.7GB |     47.7GB |     95.3GB
  131072 |     31.8GB |     47.7GB |     63.7GB |    127.3GB
  262144 |     47.8GB |     71.7GB |     95.7GB |    191.3GB

FullOf_Bad_Ideas@alien.top · 3 years ago

Here’s the formula

batch_size * seqlen * (d_model/n_heads) * n_layers * 2 (K and V) * 2 (bytes per Float16) * n_kv_heads

https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices