Are there any data cleaning focused LLMs? [also, rant]

AnomalyNexus@alien.top · 3 years ago

Are there any data cleaning focused LLMs? [also, rant]

georgejrjrjr@alien.top · 3 years ago

Sort-of.

Refuel.ai finetuned a 13B llama 2 for data labeling; not hard to imagine applications for that here if the data volume were reasonable. Simplest thing that might work: take a paragraph at a time and have a data labeling model answer “Is this boilerplate or content?”

Another possibility is using the TART classifier head from Hazy Research, find as many as 256 pairs of boilerplate vs. content, and use only as large a model as you need to get good classification results. If your data volume is large, you would do this for a while, get a larger corpus of content vs. boilerplate, and train a more efficient classifier with fasttext or something similar (probably bigram based).