Prompt like:

Extract the company names from the texts below and return as an array

– [“Google”, “Meta”, “Microsoft”]

  • AsliReddington@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Yeah man just use langchain+pydantic class/guidance lib by MS with Mistral Instruct or Zephyr & you’re golden

  • LoSboccacc@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    not to be an ass but what’s wrong with extracting the keywords and then going .split() ?

  • DreamGenX@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    On top of what other said, make sure to include a few shot examples in your prompt, and consider using constrained decoding (ensuring you get valid json of whatever schema you provide, see pointers on how to do it with llama.cpp).

    For few shotting chat models, append fake previous turns, like:

    System: 
    User: 
    Assistant: 
    ...
    User: 
    Assistant: 
    User: 
    
  • xelldev13@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    You can do this with NER model like bert, is more fast, but is only for entitie recognition

    • name_is_unimportant@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Yeah Named Entity Recognition with BERT works very well, provided that you have a good dataset. Another limitation is that it can only handle 512 tokens

  • laveshnk@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    NLTK / beautiful soup should have some tools to do such things. ig its NER.

    For the record, I wouldnt advice to use an LLM for this task. Unless you can afford to waste v memory

  • _omid_@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I use mistral-7b-openorca.q8_0 And this is my prompt

    
    system:You are a helpful machine. Always answer with the THREE most important keywords from the information provided to you between BEGININPUT and ENDINPUT. Here is an example:\\nUser: BGEININPUT A tree is planted for each contract. Your contribution will be invested 100% sustainably! ENDINPUT\\nassistant: [contract, tree, sustainable]\\nuser:
    
  • BrainSlugs83@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Why do you need an LLM for this? Just use any NER model. It will be blazing fast and run locally.

    • LPN64@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Because let’s say you train your bert model to do this, you’ll have a specific limited class trained on a specific type of document.

      It will work on wikipedia articles but not on transcripts from your local police station.

      Using a llm will allow it to inherit from the wide knowledge of the llm.