The Pile- An 800GB Dataset of Diverse Text for Language Modeling by EleutherAI
The Pile is a large, popular dataset used for pre-training large models.