Maybe anecdotal but I have very high hopes for Yi 34b finetunes.

Herr_Drosselmeyer@alien.top · 2 years ago

Maybe anecdotal but I have very high hopes for Yi 34b finetunes.

Tacx79@alien.top · 2 years ago

I tried base yi-34-chat yesterday and it felt like the golden times of character.ai again, I imported my c.ai character card with 3-4k tokens, extended context to 8k and it’s just the right model for the job. It even followed the short hints about how the character should behave unlike the original c.ai model. Sure, finetunning on rp chats could make it even better but I don’t think I will move away from it in the near future

No_Scarcity5387@alien.top · 2 years ago

Looks promising! Tried the gguf model from the bloke at 16k context but got some repetition and some /\/\//\ answering with the original template. Which templates are you guys using?

SomeOddCodeGuy@alien.top · 2 years ago

TheBloke just quantized his newest version of this model. I’m downloading it right now =D

But I’m with you- Capybara-Tess-Yi is amazing; I don’t RP so I can’t speak to that, but for a conversational model that does basica ChatGPT tasks? It’s amazing.

rkzed@alien.top · 2 years ago

I had great fun with this specific model. Tried up to 32K context length with very minimum repetition problem…

AutomataManifold@alien.top · 2 years ago

I’ve been having trouble getting it to run with exllama2_HF in text-gen-webui. Did you run in to any issues?

Herr_Drosselmeyer@alien.top · 2 years ago

Try just exllama2, no HF.

Menix333@alien.top · 2 years ago

If you want the best 34b RP, try spicyboros-limarpv3. When I use other 34b models like Tess or Nous-capy, they are not bad, but they tend to get confused with the scene from time to time. However, this wasn’t happening with spicyboros at all. It is indistinguishable from 70b, and I’ve tried a lot of 70b models.

Herr_Drosselmeyer@alien.top · 2 years ago

https://huggingface.co/zgce/Yi-34B-Chat-Spicyboros-limarpv3-4bpw-hb6-exl2

This one?

Vadersays@alien.top · 2 years ago

The first one should be 3, right? Since one of the original 3 is dead?

The_One_Who_Slays@alien.top · 2 years ago

I’m still trying to figure out what are the correct settings for under 200k context. Ooba loads compress_emb(or whatever it’s called) to 5mils and I dunno if you should leave it alone or change it if you change the context size to, say, 64k.

mcmoose1900@alien.top · 2 years ago

No setting changes, as if the model is 200K native.

waxbolt@alien.top · 2 years ago

Don’t touch the truncate length setting in the UI or it’ll be stuck at 32k until resetting the server.

bullerwins@alien.top · 2 years ago

Does the 200K mean that it has up to 200k context size? Is the context limited by the model or can you just set it to whatever a long as you have enough VRAM. Also, if a GGUF model for example takes 20GB vram for example. That’s with the “default” context size? Can it be less if you decrease the context or more if you increase it ?

Herr_Drosselmeyer@alien.top · 2 years ago

The base Yi can handle 200k. The version I used can do 48k (though I only tested 16k so far). Larger context size requires more VRAM.

The size that TheBloke like gives for GGUF is the minimum size at 0 context. As context increases, VRAM use increases.

ParanoidMarvin42@alien.top · 2 years ago

Do you know how to estimate how much memory the context will need?

Herr_Drosselmeyer@alien.top · 2 years ago

With this particular model, I can crank it up to 32k if I enable " Use 8-bit cache to save VRAM" and that’s as high as it can go in Oobabooga WebUI.

waxbolt@alien.top · 2 years ago

32k seems to be hard coded in oobabooga. At least it is for truncate length max. There’s a patch to be made to fix it.

Herr_Drosselmeyer@alien.top · 2 years ago

I know but it’s slowing down quite a bit at 32k already so I don’t think it’s worth pushing it further. But hey, even at just 16k it’s four times what we usually get, so I’m not complaining.

DedyLLlka_GROM@alien.top · 2 years ago

You can change it by yourself, although it’s required to be edited with every update as for now. Just put something like 200000 in these 2 places:

https://github.com/oobabooga/text-generation-webui/blob/454fcf39a95691f5e375c48fbc6fe6aa96f0c738/modules/shared.py#L46

https://github.com/oobabooga/text-generation-webui/blob/454fcf39a95691f5e375c48fbc6fe6aa96f0c738/modules/ui_model_menu.py#L100

andrewlapp@alien.top · 2 years ago

34B Model Memory Requirements (infer)

Sequence Length vs Bit Precision
SL / BP |     4      |     6      |     8      |     16    
-----------------------------------------------------------
    512 |     15.9GB |     23.8GB |     31.8GB |     63.6GB
   1024 |     16.0GB |     23.9GB |     31.9GB |     63.8GB
   2048 |     16.1GB |     24.1GB |     32.2GB |     64.3GB
   4096 |     16.3GB |     24.5GB |     32.7GB |     65.3GB
   8192 |     16.8GB |     25.2GB |     33.7GB |     67.3GB
  16384 |     17.8GB |     26.7GB |     35.7GB |     71.3GB
  32768 |     19.8GB |     29.7GB |     39.7GB |     79.3GB
  65536 |     23.8GB |     35.7GB |     47.7GB |     95.3GB
  131072 |     31.8GB |     47.7GB |     63.7GB |    127.3GB
  262144 |     47.8GB |     71.7GB |     95.7GB |    191.3GB

FullOf_Bad_Ideas@alien.top · 2 years ago

Here’s the formula

batch_size * seqlen * (d_model/n_heads) * n_layers * 2 (K and V) * 2 (bytes per Float16) * n_kv_heads

https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices

bullerwins@alien.top · 2 years ago

thanks a lot! I was not sure about how context affected VRAM usage. So each model has a maximum context size and using more will take more vram, thanks!

mcmoose1900@alien.top · 2 years ago

Another thing to note is that the exllamav2 backend is “special” because its context takes up less vram than the context in other backends. So lets say the weights take 18GB, and your context takes up 6GB for a gguf model. In exllama thats only 3GB taken up by the context with the 8 bit cache.

There are other complications like the prompt processing batch size, but thats the jist of it.

This makes a dramatic difference when the context gets huge. I’d prefer to use koboldcpp myself, but I just can’t really squeeze it on my 3090 without excessive offloading.

frozen_tuna@alien.top · 2 years ago

Very good to know! I haven’t fiddled with the new yi models too much yet since I was running into these exact issues. I’ll definitely use this solution soon, thanks.

bullerwins@alien.top · 2 years ago

Interesting! I had more succeed for some reason with gguf models, as those work everywhere using koboldcpp and ooba’s. I didn’t know that exllamasv2 was better for context. I will try it. That backend is for EXL2 formats right? I had the impression it was better for speed, I didn’t know about the context takes up less vram