2025/07/18

Researchers Find Thousands of Personal Documents in Major AI Training Dataset

● 18/07/2025

“Anything you put online can and probably has been scraped,” concludes AI ethics researcher William Agnew after finding thousands of personal documents in a tiny sample of DataComp CommonPool. The massive dataset, used to train image generation models, likely contains hundreds of millions of private photos, IDs, and résumés scraped from the web. As journalist Eileen Guo notes, the findings expose “the original sin of AI systems built off public data—it’s extractive, misleading, and dangerous.”

MIT Technology Review

Metadata: People: Eileen Guo, William Agnew / Contributors: Greg J. Smith / Image: Examples of identity-related documents found in CommonPool’s small scale dataset, from “A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset” (2025)

HOLO / Stream

Stream

Dossiers

Encounters

Serials

Shop

About

Contact

Newsletter
Twitter
Instagram
Facebook

HOLO
PO Box 59038
Toronto, M6R 3B5
Canada

Hello Visitor

Stats

Reader Log-In

2025/07/18

Researchers Find Thousands of Personal Documents in Major AI Training Dataset

Before

After

2025/07/18

Researchers Find Thousands of Personal Documents in Major AI Training Dataset

Before

2025/07/17

Android Phones Form Global Earthquake Detection Network

2025/07/14

Ian Bogost Laments How Dumb Apple Intelligence Is

2025/07/14

AI Pivot Won’t Save Media Companies from Traffic Apocalypse, Tech Journalist Argues

After

2025/07/21

AI “Panics,” Destroys Months of Vibe Coding Work

2025/07/21

MUTEK’s AI Ecologies Lab Fosters Hybrid Co-Creation

2025/07/21

AI the Stuff of Authoritarian Fever Dreams, Hito Steyerl Warns