APaperADay@alien.topB to

LocalLLaMAEnglish · 2 years ago

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models

6

1

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models

APaperADay@alien.topB to

LocalLLaMAEnglish · 2 years ago

6

Blog: https://together.ai/blog/redpajama-data-v2

Hugging Face: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2

GitHub: https://github.com/togethercomputer/RedPajama-Data

Description:

RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30B documents in the corpus that additionally come with quality signals, and 20B documents that are deduplicated.

Chat

Maykey@alien.topB
link
fedilink
English
arrow-up
1·
2 years ago

20B documents that are deduplicated.

I wonder if we’ll see even slimmer version