Hey everyone,
I have a dataset that has around 8million pairs of prompts and responses collected and curated from a bunch of open-source datasets on hf. I wanted to know what’s the best method to dedup this dataset. I am planning on doing this locally (4090 with 64gb ram) and I’ve looked into a few methods but I wasn’t able to use those in my case cuz of my compute constraints.
Please let me know if y’all know a efficient method I can use!
TIA.
You could do a hash of each Q/A pair into a dictionary as you iterate through them and only keep each one if its hash doesn’t exist yet. If you’re looking to do a fuzzier search, you could do cosine similarity and throw out anything that’s got too close of a nearest neighbor.