Maybe anecdotal but I have very high hopes for Yi 34b finetunes.

Herr_Drosselmeyer@alien.top · 3 years ago

Maybe anecdotal but I have very high hopes for Yi 34b finetunes.

ParanoidMarvin42@alien.top · 3 years ago

Do you know how to estimate how much memory the context will need?

andrewlapp@alien.top · 3 years ago

34B Model Memory Requirements (infer)

Sequence Length vs Bit Precision
SL / BP |     4      |     6      |     8      |     16    
-----------------------------------------------------------
    512 |     15.9GB |     23.8GB |     31.8GB |     63.6GB
   1024 |     16.0GB |     23.9GB |     31.9GB |     63.8GB
   2048 |     16.1GB |     24.1GB |     32.2GB |     64.3GB
   4096 |     16.3GB |     24.5GB |     32.7GB |     65.3GB
   8192 |     16.8GB |     25.2GB |     33.7GB |     67.3GB
  16384 |     17.8GB |     26.7GB |     35.7GB |     71.3GB
  32768 |     19.8GB |     29.7GB |     39.7GB |     79.3GB
  65536 |     23.8GB |     35.7GB |     47.7GB |     95.3GB
  131072 |     31.8GB |     47.7GB |     63.7GB |    127.3GB
  262144 |     47.8GB |     71.7GB |     95.7GB |    191.3GB

Herr_Drosselmeyer@alien.top · 3 years ago

With this particular model, I can crank it up to 32k if I enable " Use 8-bit cache to save VRAM" and that’s as high as it can go in Oobabooga WebUI.

DedyLLlka_GROM@alien.top · 3 years ago

You can change it by yourself, although it’s required to be edited with every update as for now. Just put something like 200000 in these 2 places:

https://github.com/oobabooga/text-generation-webui/blob/454fcf39a95691f5e375c48fbc6fe6aa96f0c738/modules/shared.py#L46

https://github.com/oobabooga/text-generation-webui/blob/454fcf39a95691f5e375c48fbc6fe6aa96f0c738/modules/ui_model_menu.py#L100

waxbolt@alien.top · 3 years ago

32k seems to be hard coded in oobabooga. At least it is for truncate length max. There’s a patch to be made to fix it.

Herr_Drosselmeyer@alien.top · 3 years ago

I know but it’s slowing down quite a bit at 32k already so I don’t think it’s worth pushing it further. But hey, even at just 16k it’s four times what we usually get, so I’m not complaining.

FullOf_Bad_Ideas@alien.top · 3 years ago

Here’s the formula

batch_size * seqlen * (d_model/n_heads) * n_layers * 2 (K and V) * 2 (bytes per Float16) * n_kv_heads

https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices