The Pile (dataset)

The Pile is an 886 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year.

Source: Wikipedia — The Pile (dataset) (CC BY-SA 4.0)

The Pile (dataset)

The Pile is an 886 GB diverse, open-source dataset of English text created as a training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and publicly released on December 31 of that year.

Source: Wikipedia "The Pile (dataset)" · CC BY-SA 4.0

Share this article: X · Bluesky
Privacy Policy