I’m trying to build an application using RAGs. I know how RAGs help ground the responses and all, but how do I handle generic queries from users which have nothing to do with what’s stored in my vector database? For example, queries such as: “How many gold medals did China win during Tokyo Olympics?” vs "Parapharse this email for me: … ". I would assume LLMs without RAGs would do a much better job answering the second question.
How do people usually handle these scenarios? Are there any tools that I can look at? Any help would be greatly appreciated. Thank you.
The way to do this is to generate a bunch of hypothetical questions from the FAQ, index these in the vDB
Then for the user prompt do a two stage inference with very small CTX size which only determines if the user is asking a question related to items specifically mentioned on the FAQ. Then you can retrieve the relevant FAQ section or source document accordingly only if the score is within a threshold
RAG is for KNOWLEDGE - not style. THough you may use RAG on a guidance input to select instructions and send them into the system prompt. The whole thing is a LOT more complicated than most people make it - a LOT, sadly.
For example, you must GENERATE the questions - you can not use user input for 2 reasons. First, let’s be clear - questions should be precise, so you can not feed in chat history because it may pollute the vector with irrelevant other stuff. Here are the 2 problems:
* Questions may be reflexive. “Where is Microsoft located” is a good question, but then following with “And how many people does the city have” - UNLESS you take information from the first question, can not be answered.
* Questions may not be singular questions. People ahve a tendency to sometimes ask multiple things in one input.So, on that side you must use an AI to EXTRACT the ACTIVE questions from the chat history, so an AI can extract complete questions, and separate them, then use them separately for RAG.
On the other side, preprocessing into Q&A pairs, (with information density increase AND not vectorizing the answer) and dream like sequences to combine similar vectors are other approaches to make the answer more relevant.
But you must STRICTLY separate what you look for - Rewriting an email, you CAN use RAG to instruct the AI, but it is a little more complex.
This is one of the reasons I look forward to hopefully soon be able to use 32k and 64k context - I fill 16k up pretty much always.
But won’t it increase the inference time quite a bit? Or are there any GitHub projects to get started with this?
THAT is the cost side and that is a NASTY one. It is not only the financial - but it is, as you point out - the response time. And it is NOT just inference, you also have all the lookup that must happen.
But yes. This is where the price is paid that shows that we are still a factor if 10 or 20 away from fast interactive complex data AI.
But - do not worry, we get there ;)
No github I am aware of - people are very happy with their naive little innovation and never see the real problems in their simplistic tests. It is an 80/20 or higher order problem - MOST things work simple, SOME - ah - well ;) YOu also get into the “smalltalk” - you do not want to run a full research cycle when the user input is “Thank you, that was helpful” ;)
That said, really, if AI gets 10x faster (and it looks like hard+software is on the way for more than that) it is easily doable from the time side.