Today we run a query, the LLM answers. The evolution of this was going multi-modal, and being able to input or output other forms of data, like video and sound
But still, we query a model, the LLM answers. Same with agents, only then we have multiple queries working in a system, often assisted by function calling and automation in order to put thought into action
We do it this way because it’s the most obvious solution. The easiest one to implement, or rather the only one we can implement at this time
At this time
In the future we will get more efficient models. Models that can output in the orders of 100+ t/s. Models that can run so fast that they’re essentially able to simulate continuous runtime
And that’s not all. With a model this fast you could also implement multi-query systems which runs backend to handle any agent related tasks or even for assisting in reasoning by having lightning fast discussions with the frontend before giving any outputs
tl;dr LLMs are just waiting for us to give them more juice so that they’re fast enough to simulate continuous runtime
Maybe nobody is talking about this because it is really obvious? You literally say it in your post title.