Just passing this along as it is difficult to find really large data sets that really kick the tires on testing and development. Especially with trying out cloud based machine learning which is able to use datasets that would normally blow away the resources we have on our laptops. Criteo has released a real world sample data set, here, of over 1TB in size and provides over “4 billion examples with binary labels (click vs. no-click) including over 156 billion total (dense) feature-values and over 800 million unique attribute values”.
More information is available here, in the TechNet article where I discovered it, with more data set references and explanations. I tried to download the data sets which are divided up into 24 files of about 15gb compressed each, however the bandwidth was really hammered most likely to the publicity around the announcement.
Criteo’s website has more information. Always great to see companies providing the community with great assistance.
Let me know what you think in the comments.
** Note, I had another article with free datasets in this article,Great Free Datasets for Your BI Testing