Just How Big Are LLM Databases?
35 Libraries of Congress
How very big is the database of a leading LLM?
With an estimated (number not disclosed, and this estimate may be low) of
70 trillion tokens of training data,
(Remember your shorthand for a token is the word “the” (basically three letters)
ChatGPT’s recently released GPT 5 LLM model draws from the equivalent database of
175 BILLION
8 1/2 x 11 pages of content
Or, roughly the text/page volume of 35 (!) Library of Congress catalogues.
If we placed the pages on the ground one page touching the other lengthwise, the pages would go for 30.38 MILLION MILES!
That is the same distance as traveling to THE MOON AND BACK 127 times!
If it were 20lb bond paper and we weighed all of it, the data pages would weigh 875,000 metric tons or:
2.5 Empire State Building(s)
2.2 million (!) fully-loaded Boeing 747s
But only 14% of the weight of the Great Pyramid at Giza
People, including in AI, simply have no sense, conception, and no precision (since it is secret) of the staggeringly immense amount of data
Referenced when you type in your prompt to make an LLM call.
“Fishing for a ring in the ocean.”
Opaque, vast, diverse, cannot see thru it, dangers like sharks and completely made up stuff (like case citations), unlabeled,
outputs offered with absolute conviction and also flattery.
Data is “Hoover vacuumed” from anywhere everywhere and anywhere. There may also be what’s called “synthetic data,” which is exactly what it sounds like, to build out the db. IBM on synthetic data
In Law it’s simply the wrong tool for the use case, which is accuracy and precision. It can be correct. So can broken clocks.
If you are looking for precise needles, you do not build a bigger (or especially the biggest) haystack.
Library of Congress, earlier data.



