- Pascal's Chatbot Q&As
- Posts
- This paper makes a startling claim about modern large language models: they never actually “forget” the input you give them. The internal states inside a Transformer model are lossless.
This paper makes a startling claim about modern large language models: they never actually “forget” the input you give them. The internal states inside a Transformer model are lossless.
LLMs transform your text into internal numerical representations (hidden states) that still contain all the information of the original text, down to the exact characters.
LLMs Don’t Forget: How Transformers Secretly Store Every Word You Type
by ChatGPT-5.1
This paper makes a startling claim about modern large language models (LLMs):
they never actually “forget” the input you give them.
Instead, they transform your text into internal numerical representations (hidden states) that still contain all the information of the original text, down to the exact characters.
The authors prove mathematically — and verify with billions of tests — that:
Every distinct input prompt produces a unique internal representation. It is therefore possible to reconstruct the exact input text simply by looking at the model’s hidden activations.
In other words, the internal states inside a Transformer model are lossless: they encode your entire prompt in a reversible way.
And the authors demonstrate this with a reconstruction algorithm (“SIPIT”) that recovers any input text perfectly and efficiently, token by token, from the model’s activations.
This is true for all major transformer LLMs tested: GPT-2, Gemma-3, Llama-3, Mistral, Phi-4, TinyStories, and others.
1. Explaining the Core Ideas (Non-Technical)
1.1 What people previously assumed
Many people — including experts — believed that LLMs compress or “summarise” the input text inside the model.
This assumption is based on:
Non-linear functions (activations)
Normalization layers
Attention mechanisms
Residual mixing
All of these seem like they would blur information together.
So people assumed:
“Different inputs may collapse into the same internal state.”
If collapse were possible, a model’s hidden state would not fully reveal the prompt.
This paper shows the opposite.
1.2 What the paper proves
The authors show that:
For almost all parameter settings (i.e., every real-world model),
Distinct prompts always produce distinct hidden states.
This property holds even after training, no matter how long you train.
Thus the network is injective:
If two inputs are different, their internal representations must also be different.
Dramatically, this means the model retains everything — even things it might not later output.
**1.3 “Invertible”: We can reconstruct the input perfectly
The authors then build SIPIT, an algorithm that uses the model’s own forward pass to reconstruct the original text:
It looks at the hidden state at each position
Explores possible next tokens
Identifies the unique token whose predicted hidden state matches the observed one
Thus it recovers:
the exact token
in exact order
with 100% accuracy, for every model tested
in linear time (very efficient)
This proves that the internal representations are not “semantic abstractions.”
They are, fundamentally, a transform of the text — like a reversible encoding.
2. Most Surprising Statements and Findings
(1) Transformers do not lose information at any layer
Contrary to the belief that nonlinearities “destroy information,” the model’s transformations are reversible for all practical purposes.
(2) Injectivity is structurally guaranteed
This is not an accident of initialization.
Even after training, even after millions of gradient steps, injectivity remains true.
(3) Billions of collision tests found zero collisions
Every distinct input produced a distinct hidden state, in every model tested.
(4) Internal activations are the input, encoded
The authors explicitly state:
Hidden states are not abstractions; they are the prompt in disguise.
This has enormous implications for privacy and intellectual property.
(5) Regulators’ assumptions about “privacy through abstraction” are false
The paper calls out a real example:
a German data protection authority argued that prompt information cannot be recovered from a model’s internals.
This paper proves the opposite.
(6) A full input can be reconstructed even with only one layer’s activations
You do not need the whole network or outputs.
One layer is enough to reverse-engineer the prompt.
(7) The larger the model, the greater the separation between representations
Deeper layers increase the distinctness of inputs.
(8) Even the earliest layers (1–2) are already fully injective
Meaning the model begins keeping an invertible encoding almost immediately.
3. Most Controversial Claims
Because exact user input can be recovered, hidden states fall under GDPR and similar laws.
(2) Model weights may encode training data more directly than acknowledged
While the paper focuses on input-side data, it undermines arguments that weight matrices cannot encode personal or copyrighted information.
(3) Providers cannot claim they “do not store user data” if hidden states are logged
Many AI systems store:
key-value caches,
embeddings,
internal activations (for optimization),
traces for debugging or safety.
All of these become de facto copies of the user’s data.
(4) Some AI researchers’ intuition about “information compression” is wrong
The paper essentially says:
“Transformers do not forget things; they just reshape them.”
This contradicts years of assumptions in interpretability and privacy literature.
(5) Safety: attackers could extract sensitive text
If a malicious actor gains access to internal states (e.g., through a plugin, side channel, or API leakage):
passwords
medical histories
private messages
copyrighted content
all become trivially recoverable.
4. Valuable Insights
(1) Mechanistic interpretability gets a solid foundation
If hidden states preserve all input information:
any failure to find something interpretable is not because the model “forgot” it
it means interpretability methods need to improve
(2) Prompt-stealing attacks become dramatically easier
This paper shows how:
to reconstruct a user’s prompt exactly
without training an auxiliary model
even from partial internal information
even if the attacker sees only a single layer
(3) Privacy-preserving inference is harder than assumed
To protect user data, systems must avoid exposing hidden states or caches — not only outputs.
(4) AI regulation must consider hidden states as user data
Data retention, deletion rights, and access controls all need revision.
(5) The myth of “semantic embeddings” is challenged
Embeddings are not general-purpose, safe, compressed representations.
They encode the exact text.
5. Recommendations
5.1 For AI Developers
(1) Treat all internal activations as sensitive user data
Apply encryption-in-use
Zero-out residuals after inference
Avoid logging activations
Avoid exposing KV caches to plugins, agents, or external tools
(2) Build architectures that allow deliberate information destruction
If privacy is a design goal, models may need non-injective transformations.
(3) Provide privacy guarantees explicitly addressing internal activations
Claims like “we don’t store prompts” become misleading if hidden states are stored.
(4) Rethink memory and caching mechanisms
KV cache sharing between sessions must be protected.
5.2 For Regulators
Because prompt reconstruction is exact, internal states constitute:
a copy of personal data
a reversible encoding
a fully reconstructable derivative
(2) Require strict data minimization and retention rules
Force deletion of cached activations
Regulate server logs that contain hidden states
Mandate user rights to erase not only logs but derived representations
(3) Revisit claims from industry about “privacy by abstraction”
Regulators should not accept privacy arguments based on non-invertibility — this paper disproves them.
(4) Require API boundaries that prevent access to internal states
This includes agents and fine-tuning endpoints.
5.3 For Rights Holders (publishers, authors, media companies)
(1) Treat internal activations as copies of your copyrighted works
If your text was input to the model, its hidden states are reconstructable copies under copyright law.
This strengthens legal arguments in cases like:
Getty Images
GEMA v. OpenAI
Thele v. Google
NYT v. OpenAI
others where “copies” and “derivative works” matter
(2) Demand contractual protections for embeddings, caches, and internal representations
Licensing contracts should explicitly cover:
embeddings
activations
KV caches
intermediate function outputs
any data derived from copyrighted content
(3) Enforce restrictions on retention and redisclosure
A model that keeps your content in hidden states after inference is still storing it.
This triggers:
copyright rights
privacy rights
database rights
trade secret obligations
(4) Support legal and technical standards for “non-reconstructible embeddings”
Industry needs norms for embeddings that cannot be inverted, especially in enterprise environments.
Conclusion
This paper fundamentally changes how we understand language models.
It shows that:
Transformers are not “fuzzy pattern recognizers.”
They mathematically preserve all information from the input prompt.
Their internal activations act as reversible encodings of the exact text.
And a simple algorithm can reconstruct the full prompt with 100% accuracy.
These results challenge assumptions in AI safety, privacy, intellectual property law, and data protection. They imply that AI providers must treat internal states as copies of the input data, not abstract summaries. Regulators must adjust their frameworks accordingly. And rights holders gain a stronger technical basis to claim that models process and store exact copies of their protected works.
The implications for the future of AI governance are profound.


·
10 OCT

GateBleed and the Growing Risks of Hardware-Based AI Privacy Attacks
·
12 NOV

When AI Feeds on Poisoned Knowledge — Lessons from Library Genesis to Llama 3