The title, pretty much.

I’m wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.

  • semicausal@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    In my experience, the lower you go…the model:

    - hallucinates more (one time I asked Llama2 what made the sky blue and it freaked out and generated thousands of similar questions line by line)

    - is more likely to give you an inaccurate response when it doesn’t hallucinate

    - is significantly more unreliable and non-deterministic (seriously, providing the same prompt can cause different answers!)

    At the bottom of this post, I compare the 2-bit and 8-bit extreme ends of Code Llama Instruct model with the same prompt and you can see how it played out: https://about.xethub.com/blog/comparing-code-llama-models-locally-macbook

    • NachosforDachos@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      That was useful and interesting.

      Speaking of hypothetical situations how much money do you think an individual would need to buy the computing power needed to provide themselves with a gpt 4 turbo like experience locally?