Understanding Wack LLM Datasets (Researching Categoricals)
Getting Granular Naming Names on LLM Databases
The AI Counsel post which blew up LinkedIn and Substack asked, “YIKES! What is in your LLM?” And then we told you. Many learned for the first time ever.
We have reported just how very GINORMOUS these are to the moon and back 127 times (!) “Just How Big are LLM Databases?”
So I decided to dig in deeper both qualitatively and contextually.
What are the categories of sources these data coming from
Wikipedia Reddit YouTube Google and others?
What do each of the five major LLMS do with respect
to each of these categoricals?
Which LLM leans in to what types of data?
The main categories of information commonly used as sources for training data in large language models (LLMs).
These are derived from public and open datasets, reflecting the diverse textual content that helps models learn language patterns, facts, and reasoning:
Web scraped content: Vast collections of text from websites, including blog posts, articles, reviews, search results, and e-commerce data, often sourced from large crawls like Common Crawl or refined versions like C4 and RefinedWeb.
Books and literary works: Digitized texts from public domain books, novels, and other literature, providing narrative styles and diverse topics, such as from BookCorpus or Project Gutenberg.
Academic and scientific publications: Peer-reviewed papers, research articles, and scholarly content from sources like arXiv, PubMed, or Google Scholar, focusing on technical and factual information.
Source code and programming data: Code snippets, scripts, and documentation from repositories like GitHub or datasets such as Starcoder, covering multiple programming languages.
News and journalistic articles: Content from news outlets and aggregators like Google News, covering current events, politics, and real-world updates.
Social media, forums, and community discussions: Conversational text from platforms like Stack Exchange, Reddit, or other networks, including Q&A and user-generated content.
Encyclopedic knowledge: Structured factual entries from sources like Wikipedia, providing broad coverage of topics in multiple languages.
Transcripts from audio/video content: Text derived from videos, podcasts, or speeches on platforms like YouTube, offering spoken language patterns and dialogues.
Multilingual and diverse corpora: Texts in various languages and formats, often combining multiple sources for global coverage, such as in ROOTS or The Pile.
Specialized documents: Niche content like legal texts, financial reports, or domain-specific data (e.g., protein structures for scientific models), though less common in general-purpose LLMs.
What do each of the major LLMs do with respect to these datasets?
[Note: I have bolded for each particular usage to make it easier to see which LLM is associated with which primary area of content.]
Data Analysis
Observations and interpretations:
ChatGPT and Llama heavily rely on web scraped content.
Claude heavily relies on academic and scientific publications.
Grok heavily relies on news and also forums/discussions from social media.
Gemini heavily relies code and also transcripts audio/video and also multilingual from the Google corpus of data.
Interesting LOW is Grok on books/literary.
Interesting LOW (“Negligible”) is Claude on audio/video in that it leans into text only content.
Across the board usage of “specialized” (i.e. expert) materials by ALL the LLMs is Low or Medium (not one is High application of expert data).
Across the board web scraping and source code/programming data power most of the operations of most of the LLMS.
Charting
Epilogue / Conclusion
As we deal with and discuss primarily use cases within Law, you should be able to see immediately the conflict with these data sets and the categories used — somehow morphing these generalized non-expert social-media tilted databases into cogent and precise legal — on the level required by Law and legal professionals — provides clear evidence of an embedded design error for usage of LLMs in Law for Research function.






