Amnesty International Finds Major AI Companies Scraping Private Data Unlawfully

A new report from Amnesty International has named OpenAI, Google Gemini, and Midjourney as participants in what the human rights organization describes as unlawful data scraping - the large-scale, largely unconsented harvesting of personal information from across the public internet to train generative AI systems. The findings land at a moment when public anxiety about digital privacy is running unusually high, and they offer the most detailed human-rights framing yet of a practice the AI industry has treated as a technical necessity rather than an ethical liability. The implications stretch well beyond copyright disputes and into territory that most ordinary internet users have not yet fully reckoned with.

What Data Scraping Actually Means for Ordinary People

The phrase "data scraping" can sound abstract until you consider what it means in practice. When someone posts a photograph on social media - a birthday dinner, a holiday on the coast, a family gathering - that image may be publicly accessible. That does not mean the person posting it consented to it being ingested into a commercial AI training dataset, reproduced in fragments, or used to generate content that resembles real people and real places without attribution or awareness.

Generative image models learn by processing enormous volumes of existing images. A model trained on scraped personal photographs can, in response to a sufficiently specific prompt, produce output that closely resembles real individuals or real situations - not because the system "remembers" a specific image, but because patterns derived from that image are baked into the model's weights. The person in the original photograph has no knowledge this has occurred, no means of opting out retroactively, and no legal remedy that is clearly established in most jurisdictions.

Text-based AI systems present a different but equally serious problem. Chatbots trained on scraped data may hold detailed inferences about a person's health concerns, political views, financial situation, or personal relationships - information assembled from forum posts, reviews, comment sections, and public profiles the person wrote years ago without any expectation that it would become training material for a commercial product. The concern is not just what the AI knows in a technical sense, but what can be inferred, reconstructed, or weaponized from that knowledge.

The Advertising Problem Nobody Is Talking About Loudly Enough

Several major AI chatbot platforms have begun integrating advertising into their interfaces. At present, sponsored content is marked on-screen and presented as transparent. But the conditions that make AI advertising uniquely dangerous are already in place: a system with extensive behavioral and personal data, a conversational format that lowers a user's critical defenses, and a commercial incentive to keep users engaged and spending. The combination is not theoretical - it is the same combination that made targeted social media advertising so effective, only with a more intimate interface and far richer underlying data.

There is also a downstream risk that extends beyond corporate platforms. Scraped personal data, once it circulates in training corpora or in data brokers' pipelines, becomes available to anyone with the technical means to exploit it. A fraudster armed with detailed personal profiles - cross-referenced across old social posts, images, and behavioral data - has a significant advantage over conventional social engineering tactics. When AI-assisted fraud eventually occurs at scale, the question of where the enabling data came from will matter enormously, both for accountability and for prevention.

Why VPN Searches Have Surged - and What a VPN Can and Cannot Do

Public concern about these issues is measurable. Searches for VPN services reached their highest recorded levels in February of this year - a 75% increase over the same month in the prior year, and more than triple the volume seen in an average month from 2010. The trend reflects a broader and legitimate anxiety: age-verification mandates for online content in a growing number of countries, AI data practices, and revelations about how major platforms share user data with third-party advertisers have all combined to push privacy concerns into mainstream awareness.

A VPN - a virtual private network - routes a user's internet traffic through an encrypted tunnel to a server operated by the VPN provider, masking the user's IP address and making it harder for websites, internet service providers, and network-level observers to build a profile of their browsing activity. For many threat models, this is genuinely useful. It limits passive surveillance, protects data on public or untrusted networks, and can reduce the granularity of behavioral tracking tied to a specific device or location.

What a VPN cannot do is undo data that has already been scraped. If personal photographs or text have already been ingested into an AI training set, no amount of encrypted browsing changes that. A VPN is a protective measure for data generated from this point forward - not a remedy for what has already left a person's control. That distinction matters, because a significant portion of the information Amnesty International is concerned about was collected years before most people understood what AI training pipelines even were.

The more durable protections will require legal and regulatory action: enforceable consent requirements for training data, meaningful rights of erasure or opt-out, and liability frameworks that treat large-scale data harvesting as a rights issue rather than a licensing one. Several data protection regimes - including the EU's General Data Protection Regulation - already contain provisions that, in principle, apply to AI training data. Whether regulators enforce those provisions against companies of this scale remains an open and consequential question.

The Bigger Picture: Privacy as a Structural Problem

What the Amnesty International report makes clear is that the erosion of digital privacy is not a series of isolated incidents but a structural feature of how the most commercially successful AI systems have been built. Training on internet-scale data without meaningful consent was a deliberate design choice, made because it was technically convenient and legally untested rather than because it was ethically sound. The industry bet that regulation would lag behind deployment - and, so far, that bet has largely paid off.

For individuals, the practical advice remains what it has been: limit what you share publicly, review the privacy settings on the platforms you use, and consider using a reputable VPN as a baseline measure. Services like NordVPN, Proton VPN, Surfshark, ExpressVPN, and CyberGhost differ in their jurisdiction, logging policies, and technical implementation - differences worth examining before committing to one. But personal precautions, however sensible, address symptoms rather than causes. The Amnesty report is valuable precisely because it frames AI data harvesting as a human rights issue, not a consumer inconvenience - a framing that will matter when courts and legislatures eventually decide how much of this was permissible all along.