LLM Reasoning: From Next-Token Prediction to Constrained Decoding
Understanding how LLMs reason, why they hallucinate, and how we can steer their output with structure
A first-principles look at autoregressive transformers, the hallucination problem, and how constrained decoding can ground LLM outputs in structured knowledge.
The Deepest Question
What does it mean for a machine to “reason” about the world?
At its core, a Large Language Model is a single, enormous function:
At step , given input and tokens generated so far , what token comes next?
That’s all it does. There is no internal monologue, no working memory, no database lookup, no verifier. It is next-token prediction, all the way down.
And yet, from this apparently shallow operation, the following emerge:
- Translation between languages
- Code generation and debugging
- Multi-step mathematical reasoning
- Theory-of-mind inferences
- Creative writing
The central mystery of LLMs: How can a mechanism so simple produce behaviour so complex?
Andrej Karpathy frames this as compression: training is a lossy compression of the entire internet into a weights file. The 15+ trillion tokens in a typical training corpus get compressed into ~140 GB of parameters for a 70B model. The result is not a database, but a compressed world model: a simulator of the processes that generated the text.
“The model is not exactly a database. It’s more like a dream simulator of the internet. You can’t lookup facts. But you can prompt it in the right way and it will simulate the fact for you.” — Andrej Karpathy
How an Autoregressive Transformer Works
Before we talk about constraints, we need to understand what we’re constraining. Here’s a simplified view of the transformer forward pass:
Each transformer block has two phases:
-
Communication (self-attention): tokens exchange information. Each token computes queries, keys, and values, then attends to all previous tokens (masked to prevent cheating).
-
Computation (feed-forward): each token independently processes the information it gathered. This is where “knowledge” is primarily stored (the FFN layers act as key-value memories).
Karpathy’s analogy: attention is a meeting where everyone shares information; the FFN is everyone going back to their desk to process what they learned.
The output is a probability distribution over the entire vocabulary (~128K tokens for Llama-3.1). The next token is sampled from this distribution.
The Two Crises
This compression explains two fundamental problems:
Hallucination
Because the model is a simulator, not a database, it has no fidelity guarantee. For common facts (“Paris is the capital of France”), the compression is nearly lossless. For rare facts (“the 7th largest exporter of cobalt in 2023”), the compression loses information. The model’s output is always its best statistical guess, not a verified retrieval.
Knowledge Cutoff
The model’s knowledge is frozen at the time of training. Any fact that changes after training (e.g., “Who is the current US president?”) cannot be correctly answered without external grounding.
Three Families of Solutions
There are three broad approaches to grounding LLMs in facts:
| Approach | Changes | Examples |
|---|---|---|
| Input augmentation | What the model sees | RAG, prompt engineering |
| Model updating | What the model is | Fine-tuning, LoRA |
| Output constraint | What the model can say | Grammar constraints, structured decoding |
The third family is the most interesting to me: instead of changing the model or its input, you change the allowed output space during decoding.
Constrained Decoding
Constrained decoding modifies the token selection process to enforce structural rules:
The implementation is straightforward: logit masking
for token_id in range(vocab_size):
if not is_allowed(token_id, prefix):
logits[token_id] = -float('inf')
probabilities = softmax(logits) # now zero for disallowed tokens
This is a hard constraint: disallowed tokens have exactly zero probability. The model cannot assign any probability to them, no matter how attractive they might seem.
HuggingFace’s model.generate() supports this via a prefix_allowed_tokens_fn callback, which receives the current prefix and returns the list of valid next tokens.
Types of Constraints
| Constraint Type | What It Enforces |
|---|---|
| Trie / prefix tree | Next token must continue a valid prefix |
| Regular expression | Output must match a regex pattern |
| Grammar | Output must follow a context-free grammar (SQL, JSON) |
| Length | Total length within bounds |
| Vocabulary | Certain words are forbidden |
Where Constrained Decoding Gets Interesting
The most powerful applications combine structure with semantics. Consider a knowledge graph: a structured collection of facts stored as triples:
(Elvis Presley, people.person.place_of_birth, Tupelo)
(Elvis Presley, film.actor.film, Blue_Hawaii)
(Blue_Hawaii, film.film.directed_by, Norman_Taurog)
If you constrain the LLM to generate only valid paths through a knowledge graph, you get a guarantee that every generated reasoning path corresponds to a real sequence of facts in the graph. The model can think out loud in graph space, and the constraint oracle ensures it never strays from what the graph actually says.
The constraint is a trie (prefix tree) built from all valid paths in the graph:
At each decoding step, the trie tells us which tokens are valid. The query is and essentially instantaneous even for millions of paths, especially with compressed trie implementations like marisa_trie.
The Faithfulness Guarantee and Its Limit
Here’s where things get subtle.
Constrained decoding can guarantee that a generated path is faithful to the knowledge graph: every relation mentioned actually exists. But faithfulness is not the same as relevance.
A path can be faithful to the graph and irrelevant to the question simultaneously. For a question about Marie Curie’s birthplace, a trie built from all paths in the graph admits:
- Marie Curie people.person.place_of_birth Warsaw ✅ relevant
- Marie Curie people.person.children Irène Joliot-Curie ❌ faithful but irrelevant
- Marie Curie award_received Nobel_Prize ❌ faithful but irrelevant
The constraint guarantees structural validity. It does not guarantee semantic relevance. The model still has to navigate a combinatorially large space of valid paths, many of which lead nowhere useful.
This gap between what the constraint allows and what the question needs is where a lot of interesting research is happening right now. How do you make the constraint itself aware of what the question is asking?
The Path Forward
Some directions being explored:
- Semantic filtering: scoring paths by relevance to the question before building the trie, so irrelevant branches are pruned upfront
- Dynamic expansion: starting with a small trie and expanding it step-by-step during generation, guided by what the model has generated so far
- Backtracking: when the model goes down an unpromising path, pruning that branch and recovering
All of these share a common insight: the constraint oracle should know what the question is asking, not just what paths structurally exist in the graph.
This post scratches the surface of concepts I’m exploring in depth. If you’re working on LLM reasoning, structured generation, or knowledge-grounded AI, I’d love to hear your thoughts.
Share this post