RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models

APaperADay@alien.top · 2 years ago

UserMinusOne@alien.top · 2 years ago

How much free space is required to do a “git clone …”?

Is there a better method to download the data without requiring additional space for the history (.git). If yes, how big is the whole dataset?

Given the current developments: Maybe some should start collecting raw data and serving them as torrents. … Just in case.