have been thinking about this for a while-- does anyone know how feasible this is? Basically just applying some sort of “LoRa” on top of models to give them vision capabilities-- making then multimodal.
You must log in or register to comment.
There’s more than one image ingestion model already. Several for llama/mistral.
If you are talking about generating images, I dunno about that. Some people hook up LLMs to prompt stable diffusion, but thats not really the same thing.