• 9wR8xO@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Okay, what front-end can I use to run these type of multi modal models?

  • GeraltOfRiga@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    This is kinda nuts (first time I try a LLM + vision)

    Tried with a first person shooter screenshot, enemy on screen. Asked to give me the 2D coordinates of the enemy and it did, precisely.

  • yahma@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Would love to use this for handling remote security camera footage.

    Tried with LLAVA with little success. Has anyone successfully applied any of the Open Vision models to the problem of security?

    • fallingdowndizzyvr@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I just think you have to set proper expectations. I use llava with my security cameras and it does what I want. Which is to know when something interesting is happening like when it sees someone. Llava gave me this from one of my security cameras earlier this morning.

      The image features a person walking on a street, captured through a fisheye lens, which distorts the perspective of the scene. The person appears to be carrying a bag, possibly a backpack, while walking down the sidewalk.

      Which IMO is very useful.

  • metalman123@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    This style of captioning could be amazing for text to image datasets and i wouldn’t be surprised to see them take a jump in quality as well.

  • pseudonerv@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Ha, they used data generated by GPT-4V. It’s not a surprise that it got better than LLaVA 7B, and is comparable or slightly better than LLaVA 13B.

    No innovation needed otherwise!

    The ShareGPT4V-7B model follows the design of LLaVA- 1.5 [30], including three integral components: (1) A vision encoder utilizing the CLIP-Large model [45], with a reso- lution of 336×336 and a patch size of 14, converting input images into 576 tokens. (2) A projector, which is a two- layer multi-layer perception (MLP), is introduced to con- nect the vision and language modalities. (3) A LLM, based on the open-source Vicuna-v1.5 [8], derived from LLaMA2 [53].

    • justletmefuckinggo@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      im new here. but is this true multimodality, or is it the llm communicating with a vision model?

      and what are those 4 models being benchmark tested here for exactly?

    • lakolda@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      This isn’t comparing with the 13B version of LLAVA. I’d be curious to see that.