Dry_Long3157@alien.topB to

LocalLLaMAEnglish · 2 years ago

Dataset de-duplication methods

1

1

Dataset de-duplication methods

Dry_Long3157@alien.topB to

LocalLLaMAEnglish · 2 years ago

1

Hey everyone,

I have a dataset that has around 8million pairs of prompts and responses collected and curated from a bunch of open-source datasets on hf. I wanted to know what’s the best method to dedup this dataset. I am planning on doing this locally (4090 with 64gb ram) and I’ve looked into a few methods but I wasn’t able to use those in my case cuz of my compute constraints.

Please let me know if y’all know a efficient method I can use!

TIA.

Chat

Careless-Age-4290@alien.topB
link
fedilink
English
arrow-up
1·
2 years ago
You could do a hash of each Q/A pair into a dictionary as you iterate through them and only keep each one if its hash doesn’t exist yet. If you’re looking to do a fuzzier search, you could do cosine similarity and throw out anything that’s got too close of a nearest neighbor.