YIKES! What is in Your LLM?
Reddit Data + AI Data-Fact Totems

Yesterday we talked about the HUGE SIZE of LLM databases.
Today, we talk INGREDIENTS.
The above infographic is a visualization created by Visual Capitalist, illustrating the most frequently cited web domains by large language models (LLMs) such as ChatGPT and Perplexity.
It is based on a comprehensive study conducted by Semrush in June 2025, which analyzed over 150,000 unique citations from AI responses across multiple platforms, including Google AI Mode, Google AI Overviews, ChatGPT, and Perplexity.
The goal was to understand where AI tools source their "facts" and how this might impact accuracy, SEO, and content strategies.
Semrush selected 5,000 keywords randomly from their database, categorized by search intent (informational, transactional, commercial, navigational) and volume.
These were queried into the AI platforms in desktop mode to generate responses, resulting in the citation analysis.
The study compared overlaps with traditional Google search rankings, response lengths, and source preferences. It highlighted a strong correlation between high Google organic rankings and AI citations at the domain level, though AI often pulls from different pages within those domains.
User-generated content (UGC) platforms dominated, partly due to deals like Google's integration of Reddit data into its AI systems.
Top Domains Cited by LLMs
1
reddit.com
40.1%
2
wikipedia.org
26.3%
3
youtube.com
23.5%
4
google.com
23.3%
5
yelp.com
21.0%
6
facebook.com
20.0%
7
amazon.com
18.7%
8
tripadvisor.com
12.5%
9
mapbox.com
11.3%
10
openstreetmap.org
11.3%
11
instagram.com
10.9%
12
mapquest.com
9.8%
13
walmart.com
9.3%
14
ebay.com
7.7%
15
linkedin.com
5.9%
16
quora.com
4.6%
17
homedepot.com
4.6%
18
yahoo.com
4.4%
19
target.com
4.3%
20
pinterest.com
Heavy reliance on UGC raises issues like misinformation (e.g., unverified rumors spreading), echo chambers (popular but inaccurate views amplified), and lack of authority in sensitive areas like health or finance.
AI COUNSEL’S ONLY (OK TWO) QUESTION(S) TO YOU TODAY IS :
ARE THESE THE ”INGREDIENTS” THAT YOU WANT IN YOUR AI?
IS THIS A SUITABLE BASE OF INFORMATION FOR THE USE CASE OF LAW (WHICH IS PRECISION IN WORD, FORM, MEANING, AND CONTENT) — THESE INGREDIENTS IN YOUR LLM AND IN YOUR “LLM WRAPPERS” ON TOP OF SAME?


