Noxusequal@alien.topB to

LocalLLaMAEnglish · 2 years ago

Need help estimating if my speed is expected. Llama_index

4

1

Need help estimating if my speed is expected. Llama_index

Noxusequal@alien.topB to

LocalLLaMAEnglish · 2 years ago

4

Using a 5800h and rtx3060 laptop i constructed a rag pipline to do basically pdf Chat qith a local llama 7b 4bit quantized Modell in llama_index using llama.cpp as backend. I use an emmbeding and a vector store through postgresql. Under wsl.

With a context of 4k and 256 token output length generating an answer takes about 2-6min which seems relatively long. I wanted to know if that is expected or if i need to go on the hunt for what makes my code inefficient.

Also what kinds of speed ups would other gpus bring ?

Would be very happy to get some thoughts on the matter :)

Chat

harrro@alien.topB
link
fedilink
English
arrow-up
1·
2 years ago
I’m using langchain with qdrant as the vector store.

VRAM is full

How is a 7B model maxing out your VRAM? A 7B model at 4bit and 4k context should not use the 12GB VRAM on a 3060.
- Noxusequal@alien.topOPB
  link
  fedilink
  English
  arrow-up
  1·
  2 years ago
  Its a 3060 laptop so only 6gb and model plus embedding etc. Is at like 5.8gb