Common Crawl
The Common Crawl Foundation (Common Crawl) is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Access to the data is free on Amazon Web Services, but users may incur storage and compute costs.