Note: This sample is no longer available, but the original dataset is here; Download Terabyte Click Logs – Criteo Labs
Just passing this along as it is difficult to find really large data sets that really kick the tires on testing and development. Especially with trying out cloud-based machine learning which is able to use datasets that would normally blow away the resources we have on our laptops. Criteo has released a real-world sample data set, here, of over 1TB in size and provides over “4 billion examples with binary labels (click vs. no-click) including over 156 billion total (dense) feature-values and over 800 million unique attribute values”.
More information is available here, in the TechNet article where I discovered it, with more data set references and explanations. I tried to download the data sets which are divided up into 24 files of about 15gb compressed each, however, the bandwidth was really hammered most likely to the publicity around the announcement.
Always great to see companies providing the community with great assistance.
Let me know what you think in the comments.