have been thinking about this for a while-- does anyone know how feasible this is? Basically just applying some sort of “LoRa” on top of models to give them vision capabilities-- making then multimodal.

  • mcmoose1900@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    There’s more than one image ingestion model already. Several for llama/mistral.

    If you are talking about generating images, I dunno about that. Some people hook up LLMs to prompt stable diffusion, but thats not really the same thing.