"Cannibalization" of Data by LLMs
Data Feedback Loops and Bathing in another's Bathwater
How LLMs Use Your Data for Others and Use Others' Data on You
Large Language Models (LLMs) represent the “first wave” of the great AI revolution.
Wide and vast, general and diverse, large — and recyling?
Beneath their impressive capabilities lies a troubling practice: the cannibalization of user data.
When you interact with an LLM—whether by typing a prompt, uploading a document, or sharing personal details—you may unknowingly contribute to a system that reuses your data to serve others while simultaneously feeding you outputs derived from someone else’s input.
It’s akin to taking a bath in someone else’s bathwater, a murky and unsettling exchange where privacy, accuracy, and ownership are eroded.
The Data Feedback Loop
When you submit a prompt to an LLM, you’re not just requesting a response—you’re handing over raw material for the model’s improvement.
Many AI providers, especially those offering free or low-cost services, include clauses in their terms of service (TOS, or End User License Agreement, EULA) allowing them to use user inputs and uploaded data to refine their models. This means your carefully crafted prompts, personal queries, or even sensitive documents become a part of the training dataset, stripped of context and repurposed to generate responses for others.
This feedback loop creates a cycle where your data trains the model, which then serves responses to other users, who in turn contribute their data to the same system.
It’s a form of digital cannibalism: your input is consumed, processed, and regurgitated as part of a collective pool, often without your explicit consent or knowledge. The result? The answers you receive may be flavored by someone else’s input—potentially outdated, biased, or irrelevant—while your own contributions are used to shape responses for strangers.
In Law especially, where probably more than 9 out of 10 users are non-lawyers, it represents a grave risk to the integrity of law-oriented outputs. (NOT TO MENTION SOME LAWYERS ARE POTENTIALLY UPLOADING CLIENT DOCS OR CONFIDENCES!)
The Bathwater Analogy
Imagine soaking in a bathtub filled with water previously used by countless others.
You wouldn’t know whose dirt, soap, or germs you’re sharing the space with, and yet, this is precisely how many LLMs operate.
The data you provide—whether a casual question or a proprietary document—gets dumped into a communal pool. The model doesn’t just use your input to answer your query; it blends it with the data of others, creating a homogenized output that lacks transparency.
When you ask an LLM for advice or analysis, you’re getting a response steeped in the residue of other users’ contributions, with no way to trace its origins or verify its purity.
[I sometimes call this “democratization of data” but I don’t want to besmirch democracy.]
This lack of data hygiene is particularly concerning when sensitive information is involved. For instance, if you upload a confidential business plan or a personal essay, the model might retain elements of it, subtly influencing future outputs. Even if your data is anonymized, the patterns and ideas you contribute can resurface in ways that compromise your privacy or intellectual property. Again in Law wholly other heightened risks.
The Ethics of Data Reuse
The ethical implications of this practice are profound.
Users are rarely informed in clear, accessible terms about how their data is used. Fine print in privacy policies often buries the fact that inputs may be stored, analyzed, and fed back into the model. This raises questions about consent: did you agree to let your data train an AI that benefits others? Were you aware that the response you received might be a patchwork of other people’s inputs?
Moreover, the reuse of data can perpetuate biases and errors. If an LLM is trained on a pool of user prompts that include misinformation, prejudices, or flawed reasoning, those flaws can propagate across the system, affecting everyone who interacts with it. The bathwater becomes dirtier with each use, yet users have no control over the quality of the data pool or how their contributions are filtered.
The Illusion of Personalization
LLMs often market themselves as personalized tools, tailoring responses to your specific needs.
But this personalization is an illusion when your input is just one drop in a vast, shared reservoir. The model doesn’t prioritize your data’s integrity; it optimizes for general performance, blending your input with others’ to maximize efficiency. What you get back is less a bespoke response and more a recycled amalgamation, potentially diluting the accuracy or relevance of the output.
This practice also undermines trust. If you knew your private query about a medical condition or financial decision [or, um hello, LEGAL MATTTER] was being used to train the model, would you still share it?
If you suspected the response you received was influenced by someone else’s unrelated or inaccurate input, would you rely on it? The lack of transparency in how data is handled erodes confidence in these systems.
What Can Be Done?
To address this data cannibalization, there are various strategies available :
Transparent Data Policies: AI providers must clearly disclose how user inputs and uploaded data are used, stored, and shared. Consent should be explicit, not buried in fine print.
Data Isolation Options: Users should have the choice to opt out of data reuse, ensuring their inputs are used solely for their own queries and not for training purposes.
GUYS, IN AI WE CAN (AND WE DO) SEGMENT DATA AND USER ACCOUNTS FOR SECURITY PURPOSES, USE “CLEAN SETS” (UNADULTURATED DATA) WITH 0 OTHER USER DATA.
Improved Data Hygiene: Models should implement stricter controls to prevent sensitive or proprietary data from being absorbed into training datasets. Techniques like differential privacy could help protect user contributions.
User Control Over Outputs: Users should have insight into how responses are generated, including whether their data or others’ data influenced the output.
Conclusion
The cannibalization of user data by LLMs is a hidden cost of the seeming convenience of “big box, black box” AI.
Every prompt you submit, every file you upload, risks becoming part of a shared, murky pool that serves others while compromising your privacy. Like bathing in someone else’s bathwater, the process is uncomfortable and opaque, leaving you questioning what you’re really soaking in.
As LLMs continue to evolve, users must demand greater transparency and control to ensure their data isn’t just fodder for the machine but a respected and protected contribution.
…AND, BUILD EVEN BETTER SEPARATE SOLUTIONS!


