What do you want to happen when the total chat reaches 8k? Because there the server has to make a choice it can keep adding more context so it slows down, it can simply cut off the first messages but then it will for example forget its own name, or it could for example (this is a method I use but it costs interference time as you ask a 2nd question behind the scenes) ask the model to summarize the first 4K of the context so it will retain some context and still retain speed.
I would be very wary of such an application. There is no current model which does not hallucinates at times. And certainly if you are asking for a factual analysis.
I am using an llm to extract data from some texts, but for every answer it gives I do a simple search through the input text to first see if the text exists in the input text. Because if it does not then it cannot be true. If it exists it doesn’t mean it is correct or anything like that. And that simple check goes wrong on a finetuned model about 1 in a 100 answers.
Or look at the hg leaderboards, if it says 98% on a test then it basically says even after special training on known data it still has 2% percent wrong and now you want to throw unknown data with an unknown question at it.
Sometimes it will return rubbish which you can filter out but sometimes it will just output 23 instead of 22 which was in your input text ( or there was 23 for a different fact in your input text) and these are very hard to filter out and they don’t matter with most applications. But if you want to produce analyzes or facts than these are simply wrong