Hello, I’m still new to this, but I want to focus on using a RAG and Vector DB to store all my personal and work-related data.
I’m seeking a better understanding of how things work.
I’m interested in covering multiple domains, such as “Sales,” “Marketing,” and “Security.”
I plan to use an embedding model to create embeddings and then store them in a Vector Database. When I interact with my LLM, it should retrieve relevant data based on my prompt and feed this into my LLM query.
For instance:
“What’s the command for xyz?” or “Create me a good offer for xyz.”
As I understand it, there will be a backend semantic search for “Create me a good offer.” Based on similarity, and possibly nearest neighbors, it will provide the LLM with context based on my prompt. The system’s prompt for the LLM will then be based on this information to deliver the best possible answer.
Now, the big question is… when creating my dataset to store in the Vector DB, should I label the dataset with tags like [M] or [S] for sales? This way, when I type my prompt and add the label [S], the semantic search can more accurately determine where to look.
Does this approach make sense, or could it lead to more problems than it solves?
I mean i asked GPT4 but thas not the same as someone who maybe have some extended knowledge about this.
Thanks!
Couple of key considerations:
Do you always know the “domain” a given query is related to?
Are there cases where documents outside of the domain of the query could be useful?
If you always know and always only care about documents in the domain then I would use a hard filter. If either is fuzzy I would test it out with and without filters and see how that goes. A good embedding model should be able to match only relevant topics without hard filters but depending on the data adding hard filters could be worth it. Make a representative list of queries you might encounter and check the documents being returned.