Right now it seems we are once again on the cusp of another round of LLM size upgrades. It appears to me that having 24gb VRAM gets you access to a lot of really great models, but 48gb VRAM really opens the door towards the impressive 70B models and allows you to nicely run the 30B models. However, im seeing more and more 100B+ models being created that push the 48 gb VRAM specs down into lower quants if they are able to run the model at all.

this is in my opinion is big, because 48gb is currently the magic number for in my opinion consumer level cards, 2x 3090’s or 2x 4090s. adding an extra 24gb to a build via consumer GPUs turns into a monumental task due to either space in the tower or capabilities of the hardware AND it would put you at 72gb VRAM putting you at the very edge of the recommended VRAM for the 120GB 4KM models.

I genuinely don’t know what i am talking about and i am just rambling, because i am trying to wrap my head around HOW to upgrade my vram to load the larger models without buying a massively overpriced workstation card. should i stuff 4 3090’s into a large tower? settle up 3 4090’s in a rig?

how can the average hobbyist make the jump from 48gb to 72gb+?

is taking the wait and see approach towards nvidia dropping new scalper priced high VRAM cards feasible? Hope and pray for some kind of technical magic that drops the required VRAM while simultaneously keeping quality?

the reason i am stressing about this and asking for advice is because the quality difference between smaller models and 70B models is astronomical. and the difference between the 70B models and the 100+B models is a HUGE jump too. from my testing it seems that the 100B+ models really turn the “humanization” of the LLM up to the next level, leaving the 70B models to sound like…well… AI.

I am very curious to see where this gets to by the end of 2024, but for sure… i won’t be seeing it on a 48gb VRAM set up.

  • unculturedperl@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Speed costs money, how fast can you afford to go?

    Why 72gb? 80 or 96 seems like a more reasonable number. H100’s have 80gb models if you can afford it ($29k?). Two A6000 adas would be $15k (plus a system to put them in).

    The higher end compute cards seem more limited by funds and production than anything, X090 cards are where you find more scalpers and their ilk.

  • corecursion0@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    The next gen of models are in the 110B mark and beyond. I would say, estimate what it takes to do 250B at FP8 and FP16, then structure your purchases accordingly. Favour high bandwidth memory.

  • tylerbeefish@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Your wait and see approach is probably wise. The newly released GH200 chip leapfrogs the H100 by a considerable margin, which was already smoking the A100.

    On the consumer side, there does not seem to be a high demand to run local LLM. However, I used a 7b model with GPT4All on my ultrabook from 2014 which has a low-tier intel 6th gen with 16gb ram and was getting about 2.5 tokens/second. It was super slow but just shows what would be possible with some optimizations on consumer hardware.

    If you’re willing to spend $10k to run an esoteric 110b model, it might be worthwhile to go for the capability to train them in the first place (even if perhaps very slowly). Or, consider a mac with large amounts of memory that’s built into the soc (unified memory) which would likely run models at an acceptable rate with some optimizations. Of course, if blistering performance isn’t necessary.

    Otherwise, patience will likely have some good results in the context of a solid model which works on consumer-grade components. The space seems keen on allowing general users and enabling alternatives to transmitting data to some random server elsewhere. Opinion.

  • fallingdowndizzyvr@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    The easiest thing to do is to get a Mac Studio. It also happens to be the best value. 3x4090s at $1600 each is $4800. That’s just for the cards. Adding a machine to put those cards into will cost another few hundred dollars. Just the cost of 3x4090s put you into Mac Ultra 128GB range. Adding the machine to put those cards into puts you in Mac Ultra 192GB range. With those 3x4090s you only have 72GB of RAM. Both those Mac options give you much more RAM.

  • Bod9001@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    If you want to run a general purpose model that can do everything fair enough throw resources at it, But I feel like there’s a lot of optimisation that can be done, e.g coding model doesn’t need to know how to fill out tax returns or who won the European Cup in 1995–96, and it’s even possible to do maybe even optimisations in size without any loss

  • MindOrbits@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Yes.

    Workstations are the way to go. There are a few motherboards out there that give you four 2x wide slots.

    Pro tip: think in pcix3 terms, 16x lanes (pcix3) is a sought after baseline. 8x lanes often preform about 80% of 16x, often due other system limitations are the bottlenecks, not the pci bus.

    Depending on cpu, motherboard chipset, and internal lane connections, you will struggle to find four 16x slots.

    PCI 4.0 adds to the mess, but always in your benefit, just not as much as you might think Depending on the above.

    Older cards 3.0 Most cards you consider modern good and better 4.0 New cards 5.0

    4.0 lanes can be split by chipsets for things like nvme drives and usb. And is 2x the bandwidth of pcix3 with supported 4.0 devices. (8x pcix4 ~ 16x pcix3) A nice motherboard feature is when 16x pice4 lanes are split into two 16x pciex3 slots. Chipsets and nvme drives benefit greatly from pcix4 and often free up more pcix3 lanes for the slots.

    So… if you find four pci double wide slots with at least 8x lanes per slot your leaving some performance ‘on the table’ but your really not that handicapped by the loss for what you buying, especially when shopping used.

    Really new cards would suffer more from lane saturation, and may not have a favorable cost to benefit due to newer cards price.

  • nero10578@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    You don’t NEED 3090/4090s. A 3x Tesla P40 setup still streams at reading speed running 120b models.

  • fediverser@alien.top
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    This post is an automated archive from a submission made on /r/LocalLLaMA, powered by Fediverser software running on alien.top. Responses to this submission will not be seen by the original author until they claim ownership of their alien.top account. Please consider reaching out to them let them know about this post and help them migrate to Lemmy.

    Lemmy users: you are still very much encouraged to participate in the discussion. There are still many other subscribers on !localllama@poweruser.forum that can benefit from your contribution and join in the conversation.

    Reddit users: you can also join the fediverse right away by getting by visiting https://portal.alien.top. If you are looking for a Reddit alternative made for and by an independent community, check out Fediverser.

  • AutomataManifold@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I think its work remembering that while the really big models take a lot of VRAM, they also quantize down to smaller sizes, so the numbers are slightly misleading.

  • Flying_Madlad@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I think the future is modular. Many small machines contributing to hosting a bigger model.

    That way if you need to upgrade the capacity of your system you can just add another compute node

  • synn89@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Building a system that supports two 24GB cards doesn’t have to cost a lot. Boards that can do dual 8x PCI and cases/power that can handle 2 GPUs isn’t very hard. The problem I see past that is you’re running into much more exotic/expensive hardware. AMD Threadripper comes to mind, which is a big price jump.

    Given that the market of people that can afford that is much lower than dual card setups, I don’t feel like we’ll see the lion’s share of open source happening at that level. People tend to tinker on things that are likely to get used by a lot of people.

    I don’t really see this changing much until AMD/Intel come out with graphics cards that bust the consumer card 24GB barrier to compete with Nvidia head on in the AI market. Right now Nvidia won’t do that, as to not compete with their premium priced server cards.

  • bick_nyers@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    ITT there are people discussing making a jump to Threadripper etc. to afford PCIE lanes.

    Alternatively, pick up a Zen 2 EPYC on eBay for cheap. 16 core CPU + Motherboard could run you around $500 and you can get 6 PCIE 4.0 x16. Check motherboard specs and learn more about using server hardware (loud fans!) via ServeTheHome and Art of Server.

    Saw something a while back that GDDR7 will have something like 33% more memory per chip, so if the bus width stays the same we are looking at a 32GB 5090. Keep in mind this will be PCIE 5.0.

  • kingp1ng@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I keep hearing *unsubstantiated* rumors about model optimization breakthroughs. Everyone knows that the cost of compute is too damn high.

    So I’m just waiting until the next performance improvements arrive. Three years ago, a 1B param model was state of the art. Hopefully by the next year, they’ll be a model and framework which cuts the compute cost by half.