• 0 Posts
  • 8 Comments
Joined 1 year ago
cake
Cake day: October 30th, 2023

help-circle
  • . We are handling sensitive data which requires us to do it local

    Likely this is incompetence on your end because - cough - OpenAI models are available under Azure and ok for even medical data. Takes some reading - but most “confidential, cannot use cloud” is “too stupid to read contracts”. Ther are some edge cases, but heck, you can even get government use approved from Azure.



  • > From what I’ve read mac somehow uses system ram and windows uses the gpu?

    Learn reading.

    SOME MAC’s - some very specific models - do not have a GPU in the classical sense but an on chip GPU and super fast RAM. You could essentially say they are a graphics card with CPU functionality and only VRAM - that would come close to the technical implementation side.

    They are not using “sometimes this, sometimes that”; just SOME models (M1, M2, M3 chips) have basically a GPU that also is the CPU. The negative? Not expandable.

    Normal RAM is a LOT - seriously a lot - slower than the VRAM or HBM you find on high end cards. They are not only way faster (DDR6 now, while current computers are now DDR5) but also not 654 bit wide but 384 or WAY higher (2048 i think) so their transfer speed in GB/S makes normal computers puke.

    That, though, comes at a price. Which, a I pointed out, starts with being non flexible - no way to expand RAM, it is all soldered on. Some of the fast RAM is reserved - but essentially on a M2 Pro you get a LOT of RAM usable for LLM.

    You now say you have 64gb ram - unless you bought a crappy card to a modern computer, that means you also have RAM that is way slower than what is normal today. So, you likely are tuck to the 12gb VRAM to run fast. Models come in layers, and you can offload some to the normal RAM, but it is a LOT slower than the VRAM - so not good to use a lot of it.


  • I acutally do not think so. Let’s have a look at it from various perspectives.

    • The H100 is below 100gb ram on a OCI3 form factor - the only relevant for inference - and the near 200gb versoin uses actually 2 cards. THat puts 5 of them into a 10x pcie server.
    • The AMD MI300, coming around the same timeframe, in their SGC form factor has 8 cards of near 200gb.

    So, AMD wins here, at the price of not using CODA - which may not be an issue.

    Now, performance. The 4.8TB speed are absolutely amazing. 5.2tb on AMD and end of the year totally new architectures make a joke out of that with memory integrated computing, like the DMatrix Corsair C8.

    I am not sure where NVidia - outside their ecosystem - will justify the price. Anyone who buys it - pressure to deliver may be a point - will get bitten soon.


  • THAT is the cost side and that is a NASTY one. It is not only the financial - but it is, as you point out - the response time. And it is NOT just inference, you also have all the lookup that must happen.

    But yes. This is where the price is paid that shows that we are still a factor if 10 or 20 away from fast interactive complex data AI.

    But - do not worry, we get there ;)

    No github I am aware of - people are very happy with their naive little innovation and never see the real problems in their simplistic tests. It is an 80/20 or higher order problem - MOST things work simple, SOME - ah - well ;) YOu also get into the “smalltalk” - you do not want to run a full research cycle when the user input is “Thank you, that was helpful” ;)

    That said, really, if AI gets 10x faster (and it looks like hard+software is on the way for more than that) it is easily doable from the time side.


  • RAG is for KNOWLEDGE - not style. THough you may use RAG on a guidance input to select instructions and send them into the system prompt. The whole thing is a LOT more complicated than most people make it - a LOT, sadly.

    For example, you must GENERATE the questions - you can not use user input for 2 reasons. First, let’s be clear - questions should be precise, so you can not feed in chat history because it may pollute the vector with irrelevant other stuff. Here are the 2 problems:
    * Questions may be reflexive. “Where is Microsoft located” is a good question, but then following with “And how many people does the city have” - UNLESS you take information from the first question, can not be answered.
    * Questions may not be singular questions. People ahve a tendency to sometimes ask multiple things in one input.

    So, on that side you must use an AI to EXTRACT the ACTIVE questions from the chat history, so an AI can extract complete questions, and separate them, then use them separately for RAG.

    On the other side, preprocessing into Q&A pairs, (with information density increase AND not vectorizing the answer) and dream like sequences to combine similar vectors are other approaches to make the answer more relevant.

    But you must STRICTLY separate what you look for - Rewriting an email, you CAN use RAG to instruct the AI, but it is a little more complex.

    This is one of the reasons I look forward to hopefully soon be able to use 32k and 64k context - I fill 16k up pretty much always.