hi folks,

simple question really - what model (finetuned or otherwise) have you found that can extract data from a bunch of text.

I’m happy to finetune, so if there are any successes there, would really appreciate some pointers in the right direction.

Really looking for a starting point here. I’m aware of the DETR class of models and how Microsoft trained table-transformers on DETR. Wondering if that can be done on llama2,etc models ?

P.S. cannot use GPT because of sensitive PII data.

  • Iamisseibelial@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    If sensitive why not Claude to get the baseline of what you want // examples? Since they are SOC2 // HIPAA unless you’re dealing with national security stuff you should be good to go there. And get enough examples done to train a specialized model.

    • sandys1@alien.topOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Has nothing to do with national security. It has to do with audit and compliance. Soc2 and HIPAA are not the only compliance artifacts out there. There are multiple (including cross national ones like Singapore PDP, etc).

      This is why OpenAI was FORCED to offer custom model as a service.

      Again, i don’t want this thread to devolve into a regulatory debate…but I have fought large extended battles in court on these topics : these things are not possible.

      • Iamisseibelial@alien.topB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Ohh that’s absolutely fair, especially when dealing with Singapore, SK or Japan. APPI AND PIPA are a pain in the ass to deal with. That said making fake versions of the data for examples is likely the best route to actually be able to train your own model then.