Warning: very long post. TLDR: this post answers some questions I had about generating text with full, unquantized Falcon-180B under budget constraints.
What is the goal
The goal is to benchmark full, unquantized Falcon-180B. I chose Falcon-180B because it is the biggest open-source model available currently. I also do not use any optimization such as speculative decoding or any kind of quantization, or even torch.compile
. I benchmark both for small and large context sizes. I aim for maximum utilization of the available GPUs. I use 3090 cards for all experiments, as they are easy to find in used condition (cost around 700$) and have 24GB of memory.
About the model
The Falcon-180B has 80 transformer layers, the weights are around ~340GB. Its maximum context size is 2048, so whenever I say small
context size, I mean around 100 tokens, and whenever I say large
context size, I mean 2048 tokens.
Experiment setup
Every LLM can be roughly split into three parts:
begin
- which converts the tokens into continuous representation (this is usually the embeddings)mid
- which is a series of transformer layers. In the case of Falcon-180B we have 80 transformer layersend
- which converts the intermediary result into a prediction for the next token (this is usually the LM head)
I converted the Falcon-180B into separate pth
file for each of those parts, so for Falcon-180B I have 82 .pth
files (one for begin
, one for end
, and 80 for the transformer layers).
This allows me to save disk space, because for example if a given node is going to run layers 5 to 15, it only needs the weights for those particular layers, there is no need to download several big safetensors
files and only read parts of them, instead we aim to store only exactly what is needed for a given node.
I also refactored Falcon-180B so that I can run parts of the model as a normal PyTorch module, e.g. you can run layers 0 to 5 as a normal PyTorch module. This allows me to run it distributed on heterogeneous hardware, e.g. add machines with other cards (which have very little memory) to the computation.
The experiments are being run in distributed mode, with multiple nodes (PCs) having different number of cards, so there is some network overhead, but all nodes are connected to the same switch. In my experiments, I found that the network overhead is about ~25% of the prediction time. This could be improved by using a 10Gbit switch and network cards or Infiniband, but 1Gbit network is the best I could do with the available budget.
Questions
How many layers can you fit on a single 3090 card?
I can load around 5 layers of the Falcon-180B, which take up around 21GB of memory, and the rest 3GB is left for intermediary results. To load all the weights of Falcon-180B on 3090 cards, you would need 16 cards, or 11k USD, assuming used 3090s cost around 700$, although you can also find them for 500$ at some places.
How long does it take to load the state dict of a single node on the GPU?
~3.5s
For 5 layers, it takes ~3.5 seconds to move the state dict from the CPU to the GPU.
How long does it to take to forward a small prompt through a single transformer layer?
~10ms
Since we have 80 layers, the prediction would take at least ~800ms. When you add the begin
, end
and the data transfer overhead, we go around a little bit more than 1s per token.
How long does it to take to forward a large prompt through a single transformer layer?
~100ms
Since we have 80 layers, the prediction would take at least ~8000ms, or 8 seconds. When you add the begin
, end
and the data transfer overhead, we go around a little bit more than 10s per token.
How many 3090s do I need to run Falcon-180B with a large prompt?
8
At first glance, it may seem like you need 16 3090s to achieve this, but shockingly, you can do with only 8 3090s and have the same speed of generation!
Why? Because you can reuse the same GPU multiple times! Let me explain what I mean.
Let’s say on node0 you load layers 0-5 on the GPU, on node1 you load layers 5-10 on the GPU, etc. and on node7 you load layers 35-40. After node0 does its part of the prediction (which will take ~500ms), it sends to the next node, and while the other nodes are computing, instead of sitting idle, it starts to immediately load layers 40-45 to the GPU, which are pre-loaded in the CPU memory. This load will take around ~3.5 seconds, while the prediction of the other nodes will take ~4s, and since these two processes happen in parallel, there’ll be no added time to the total inference time, as each node uses the time in which the other nodes are computing to load future layers to the GPU.
That’s insane because in under 6k USD you can 8 3090s and have Falcon-180B running at maximum context size with 10s/token. Add in another 4k USD for the rest of the components, and under 10k USD you can have Falcon-180B running at decent speed.
Implementation details
I separated the project into 4 small libraries with minimal third-party dependencies:
- One for converting the weights into a separated weights format
- One for running a node with reloading of future layers
- One for sampling the results
- One with Falcon stuff needed to run only parts of it as PyTorch modules. I did regression tests to ensure I have not broken anything and my implementation conforms to the original one
If there is sufficient interest, I may package and open-source the libraries and notebooks.
Future work
I plan to convert other models into the same format and refactor them so that different parts of the model can be used as normal PyTorch modules. Here’s which models are currently on my TODO list:
- Goliath-120b
- Llama2
- Mistral
- Yi
etc.
If the community is interested, I can open-source the whole project and accept requests for new models to be converted into this format.
Thank you for your attention and sorry once again for the long post.
That is absolutely impressive, but:
- is light quantization that bad? Couldn’t you run 99% of the same model for half the cost? Is running unquantized just a flex/exercise/bragging right?
- Would quantized run faster? Slower? The same?
- Isn’t Falcon-180B kinda… meh? I mean it’s pretty smart from size alone, but the lack of fine tuning by the community means it’s kind of like running LLaMA-70b by itself.
- Would one of those new crazy good Threadrippers beat the GPUs? lol
-
It’s not bad at all! I just wanted to see full model. The approach can be applied to quantized models too, I just wanted the most extreme example in terms of model and context size. It only gets better from there! Light quantization + speculative decoding gets you close to real-time.
-
Quantized would run significantly faster, although I haven’t measured it extensively yet. That is because you avoid most of the data transfer and also the layers take a lot less memory and run much faster themselves.
-
The model is definitely not the best, but what was important for me was to see something that’s close to GPT-3.5 in terms of size. So I have a blueprint for running newer open source models of similar sizes.
I bet you are really wishing OAI had gone ahead with their briefly considered idea of releasing GPT-3 open source on dev day.
You got me there 😊
As for point #3, have you tried Goliath-120B? If yes, how would you rate it against Falcon-180B?
I haven’t ran the full Goliath yet. Soon 😊
I see. Please update us when you do, thanks in advance!
-
Please open source the code. I am keen
Thanks for your comment! I will and will also share the Jupyter notebooks as well. Will probably be next week
!remindme 1 week
Probably no downside to open sourcing this type of work. It’s a bit like fishing with a net, it might be a while before you catch another skilled developer’s attention but when you do rate_of_progress(1+1)=4, so unless you have some commercialization the work is part of, then no downsides if you objective function is purely understanding.
Yes, agreed. But there is effort related with releasing it, I have to document it, test it, think good about naming so that it’s intuitive, etc.
If I release it and no one cares, that would be a waste of time, but since there is interest, this motivates me to release it.
Thanks guys 😊
Amazing work! Thanks!
When I tried running f16 180b purely from disc I get ~90s/t with pcie 4.0
With Q4_K_S, that becomes ~22s/t
Also try this out for running on multiple machines:
Not sure if your layer method is fast enough and I think its going to be a bottleneck if you get any faster.
BTW, cpu performance can match the bandwidth of good GPUs.
- There is a dude with 512gb of cpu RAM on his server, gets 4.5 t/s on f16 70B, and will probably get 1.8 t/s on f16 180B
Here’s a good post on a potential 1tb ram setup:
That’s awesome, and I could see it being pretty useful for synthetic data generation with more compute intensity.
90s/t is serial decoding, right? I guess your CPU utilization is approaching zero. What happens when you push the batch size until you’re > 50% CPU utilization? (At some point it might make sense to dedicate a core to tokenization).The potential gains from speculative decoding here seem likely to be big, too, since you’d only be running the big model once every several tokens. I imagine sticking Mistral in VRAM, after fine-tuning with the same instruction tuning corpus as your Falcon (though there are fancier ways to do sketch model / big model alignment, too).
Total aside: I don’t know if you saw the sub-1 bit compression of mixture models, but it might be up your alley. Fun if we ever get weights for a big mixture model (https://github.com/IST-DASLab/qmoe).
I get 1.33 t/s with 180B Q4_K_S with a batch of 64. here’s my test: https://www.reddit.com/r/LocalLLaMA/comments/17jhwpa/tested_batched_decoding_on_cpu/
Yes, speculative decoding does work with the llama models + tinyllama. but we don’t have an optimal model trained alongside the original models, so we get no higher than 1.3-1.5x for chat usage.
Lookahead decoding is another thing, I assume it will be better!
thanks for sharing!
Very cool. It’s fun to see praxis match the theory, as small models hit the compute wall at a batch size proportional to their size.
Have you tried cranking the batch size further on Falcon 180B? 16 tokens was 16 times as fast as one token, so it seems like you’re still pretty far from the limit.
And the optimal batch size for the FP16 model should be around 4x higher, right?
The threads are best at 4-5, unless that’s changed. So I think the default in “batched” binary is setup that way.
I reach the maximum cpu utilization (30-36%)after 64, but still see further fain at 256
That is amazing. Where do you think the primary bottleneck is?
Thanks for sharing, that’s very useful! What GPUs and how many are you using, just to make sure I understand correctly?
EDIT: What CPU are you using? Because 90s/t is pretty impressive to be honest.
The layer method basically uses the time when the node is idle, so it works on large context sizes or if you have many GPUs (so you can load a small number layers on the GPU and can reload them super fast).
I use ggml mmap inference, 0gb ram or vram needed. I use this model it is 360gb in size. https://huggingface.co/imi2/airoboros-180b-2.2.1-gguf/blob/main/airoboros-180b-2.2.1-f16.gguf.a
10s/tok and couple kilowatts of power… ok, if it was as smart as Einstein and as unerring as an Oracle it might make sense, but you can use it for free at Petals at 3 tok/sec and it is most certainly not…
I wonder if you could get it running on two Mac Studio Ultras with 192GB of RAM each. With fewer nodes you’d reduce the communication overhead quite a bit.
That sounds like a great idea. I don’t have a Mac Studo, but in theory it should totally work, since every part in this experiment is a normal PyTorch module. So if you can run PyTorch on Mac (which you definitely can), you can run it on two Mac Studio Ultras.
How does data transfer happen here? Via thunderbolt? Or just networked?
If there are multiple GPUs in the same machine, via PCI. If on different machines, via networking with 1Gbit switch.
I wonder if there is a way to mimic over thunderbolt or other high speed transfers. Networking across machines seems more feasible than PCI for the majority of users.
Running full Falcon-180B under budget constraint
Oh nonono, you doing it wrong ;) just kidding. Next numbers for reference of what one can have on a budget system without multiple hi end GPU-s.
i5-12400f + 128Gb DDR4 + some layers offloaded to 3060Ti = 0.35 token/second on Falcon-180B 4_K_M
what did you use to run it?
I used oobabooga_windows\text-generation-webui
ok thanks
Thanks for the info! What is the context size? Is it small or big? Because that definitely matters.
I think I tested it up to 500 tokens or so.
Interesting idea.
Being able to run local models cheaply is of chief intrest in medicine given our privacy concerns. Mark me down as pro-open sourcing this project👍
!remindme 1 week
this is a good example of how the big cloud computing vendors approach is not attractive at all at this scale. for instance, if i am looking at the math right, AWS recommends you run this same model in a “p4de.24xlarge” machine , which costs about 40USD/h (on demand), the equivalent 10k USD budget would be good to run this model for about …10 days. https://aws.amazon.com/blogs/machine-learning/falcon-180b-foundation-model-from-tii-is-now-available-via-amazon-sagemaker-jumpstart/ .
Latitude.sh got that same instance for $23.2/hr or you can get 8 x H100 for $35.2/hr
Yes, please open source this. It is an amazing idea. Thanks for doing this.
One big thing that you (or someone else) could do to make this accessible (and thus more popular) would be to create a “one-click installer”. This would allow those with little to no coding experience to benefit from this (and that’s a lot of people). Or refine the code in such a way that it could work in the background (or could easily be made to work) with any of the existing GUIs currently out there (e.g. LMStudio, OogaBooga, etc.). No idea how easy or hard this would be (as I am only now learning Python) but thought I’d throw it out there. Thanks again for working on this.
Good work.
I’m not sure if it would be possible but for loading the layers and processing could the following be achieved.
On gpu 1 load layers 1,3,5,7 and on gpu 2 load 2,4,6,8 and run the layers in parallel.
Once a layer is complete start unloading it and loading the next layer instead of waiting to finish all loaded layers. That might only be useful for those with slower cards but the loading might slow the processing time and make it worse.
Isn’t that how things like petals.dev work ?
Kinda, but not exactly.
Petals dev also separates the work so that volunteers can pick it up. But it doesn’t use the same gpu multiple times during the same inference. So to achieve a similar performance in Petals.dev, you would need 16 volunteers with 3090 cards, while here you have 8 3090s locally.
Yes, you are right. Although I guess it can work in petals as well if each person has the full model downloaded, then the GPU can be instructed to load the next weights locally when it is done with the current one ?
Yes, you could do it, but there’s no need to load the full model on each node. Only the layers that are assigned for this node.
In the case of petals where any client can drop off at anytime, each client would need multiple layers for redundancy, maybe not the full weight but at least 20-30% so if someone drops off, another one can take over instantly
If you want to benchmark the largest open source model, Google recently released a 1.6T model: https://huggingface.co/google/switch-c-2048
This is fantastic work