Researchers Find Thousands of Personal Documents in Major AI Training Dataset
“Anything you put online can and probably has been scraped,” concludes AI ethics researcher William Agnew after finding thousands of personal documents in a tiny sample of DataComp CommonPool. The massive dataset, used to train image generation models, likely contains hundreds of millions of private photos, IDs, and résumés scraped from the web. As journalist Eileen Guo notes, the findings expose “the original sin of AI systems built off public data—it’s extractive, misleading, and dangerous.”
Metadata:
/
/ Image: Examples of identity-related documents found in CommonPool’s small scale dataset, from “A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset” (2025)
