Transformers are inherently succinct

dfabulich · 2026-06-05T20:32:01 1780691521

The last line of the abstract has the most important takeaway.

> As a consequence of this succinctness, we show that basic verification problems for transformers, such as emptiness and equivalence, are provably intractable: specifically, EXPSPACE-complete.

If you were hoping to formally prove the correctness of a large transformer, it turns out that you're going to need an exponentially larger amount of space to do your verification, more than you could possibly afford.

doug_durham · 2026-06-05T21:30:02 1780695002

This is a truly important paper. It formalizes the intuition that many in the field have. We can stop wasting time doing formal analysis of LLMs. If you have a problem that requires formal verification, don't use an LLM. You can use an LLM to help you build such a system, but the LLM can't be the system.

AlotOfReading · 2026-06-06T00:15:19 1780704919

I'm not sure that's a great takeaway. Lots of problems are undecidable and have reasonably effective solutions in practice (e.g. finding bugs -> testing, static analysis, etc). Mind you, I don't expect we'll find anything like that for transformers, but there's a surprisingly large gap between what's possible in general and what's possible in the cases we care about.

suddenlybananas · 2026-06-05T22:22:07 1780698127

I don't really see the relationship to your comment and the paper's content. Could you elaborate a little?

dfabulich · 2026-06-05T22:34:43 1780698883

It's the last line of the abstract.

> As a consequence of this succinctness, we show that basic verification problems for transformers, such as emptiness and equivalence, are provably intractable: specifically, EXPSPACE-complete.

platinumrad · 2026-06-05T23:00:45 1780700445

That's saying you can't formally verify an LLM, not that LLMs can't be used in formal verification.

nextos · 2026-06-06T01:27:58 1780709278

But, if I have understood correctly on a quick read, they also claim transformers have pretty low expressive power. In particular, they claim they are limited to star-free subregular languages, whereas RNNs can recognize any regular language/simulate finite automata.

This doesn't imply you can't get aid from a LLM to e.g. implement a function that has a formal specification (an application I think is very promising), but surely it has some profound implications on how much of a large system can be understood by a LLM at once, without supervision.

dang · 2026-06-05T19:09:51 1780686591

Discussed (a bit) here:

Transformers Are Inherently Succinct (2025) - https://news.ycombinator.com/item?id=48014197 - May 2026 (9 comments)

thesz · 2026-06-05T20:35:59 1780691759

My comment in the previous discussion of that paper: https://news.ycombinator.com/item?id=48014197

Authors used LTL (linear temporal logic) to express, basically, non-reduced non-ordered binary decision diagrams. Or just binary decision diagrams, BDDs.

BDDs are almost guaranteed to have exponential size because they do not employ reduction (sharing of common expressions). Reduced BDDs are more succinct and reduced ordered BDDs are even more succinct.

Also, transformers in the paper are constructed, not trained. Training any model to express some truth table is very hard. They also did not perform comparison with, say, Kolmogorov-Arnold representation, which is also universal approximator.

So this paper is not as deep as one may think it is.

nico · 2026-06-05T22:31:06 1780698666

Thank you for the explanation, makes it a lot more digestible

Just nit picking a bit:

> Training any model to express some truth table is very hard

What kind of models are you including here? Truth tables can be modeled in regular code very easily and reliably. And I’m sure there are many deterministic models that could do the same. Are you talking about LLMs in particular or a certain category/type of models?

dragontamer · 2026-06-05T22:33:31 1780698811

What?

Reduced Ordered BDDs are likely not as succinct as an arbitrary BDD.

The famous Hidden Weight Bit problem can be more succinctly expressed in arbitrary BDDs (changing the order and revisiting nodes), but are provably EXPSPACE in ROBDDs.

-------

We study ROBDDs because they uniquely reduce to a canonical form. All functions have exactly one (!!!!) ROBDD.

BDDs in general however are really arbitrary, as arbitrary as any codebase can basically get. That makes BDDs in general to difficult to study or do math upon.

Results based on the studies of arbitrary BDDs do NOT apply to the simpler, easier to understand world of ROBDDs. And vice versa.

dragontamer · 2026-06-06T03:08:34 1780715314

I can now elaborate on my comment above a bit more...

Ingo Wegener's "Branching Programs and Binary Decision Diagrams" copyright year 2000, is considered the introductory work on all BDD related matters. On the front cover are two BDDs: BOTH of these implement the hidden weighted bit function (HWB-4) introduced in the first chapter, and is basically the main example used throughout the whole book.

Please locate a copy of the front cover ASAP, before continuing with my post.

-----------

Now that you've found the book cover, look at it carefully and understand the HWB-4 function. (Just traverse it by hand, dotted lines represent "x1 == 0", while solid lines represent "x1 == 1").

Both BDDs fail to be ROBDDs: the left diagram explores the variables in this order: x1, x2, x3, x4, and then (depending on path), x1/x2/x3. That is, the variables x1/x2/x3 are explored _TWICE_ (breaking the rule for ROBDDs).

The example on the right also fails to be a ROBDD. The variable orderings are seemingly random: one branch covers x4, x1, x2, x3. The other branch is x4, x3, x1, x2. This fails the "ordered" property as the different branches cover different orderings.

BOTH are BDDs however. Both clearly implement the HWB-4 function efficiently (less than EXP-space. Indeed, its a very small graph for the front cover of a book). If these were actually ROBDDs, they'd be ridiculously large and unable to fit.

-------

ROBDDs are again, studied due to the unique property (and proof), that for any given function, there can only be one ROBDD ever generated (!!!!). For the verification problem of circuits, this is extremely useful. Different circuits might implement the same function, despite one circuit being smaller or using less power (or other useful properties).

To "prove" that the circuits output the same function, you simply create an ROBDD. Because functions ALWAYS result with the same ROBDD, you can ALWAYS use this process to prove circuit equivalence.

ROBDDs aren't succinct. No: their usefulness is the opposite. Exactly one ROBDD to ever find for any binary function that has ever existed or will ever exist.

parasti · 2026-06-05T20:18:54 1780690734

Paper went over my head but is this in any way related to my experience of Claude Opus 4.8 using increasingly terse language with very short, overloaded words? Lately I've been having trouble parsing the things it writes about my own code, it's using the kind of compressed language that you see typically in git commit message subject lines but relentless, always on.

AlotOfReading · 2026-06-05T20:22:55 1780690975

No, this is in the same ballpark as ideas like big-O notation. The paper is saying that transformers can recognize a language with exponentially fewer symbols than other kinds of systems, i.e. they're more succinct.

It's exactly as related to real models as computer science is to real computers.

dontwannahearit · 2026-06-05T20:33:08 1780691588

Not just me then.

nightshift1 · 2026-06-05T21:04:40 1780693480

i noticed that with 4.7. i tried to add instructions in claude.md to unpack meaning when communicating to me but it did not work.

7e · 2026-06-05T22:40:12 1780699212

That's a budget thing. Claude is suffering from huge demand and they're pulling out all the stops to try to keep the lights on: terse tokens, lobotomizing Claude six ways from Sunday, aggressive batching, the works.

lee_ars · 2026-06-05T21:14:38 1780694078

"Why use lot word, when few word do trick?" —Optimus Prime

Terr_ · 2026-06-05T23:10:47 1780701047

Foolish Autobots! Your so-called Matrix of Leadership is merely a bunch of weights!

lkm0 · 2026-06-05T19:59:43 1780689583

I had no idea that LLMs (or the transformer architecture) were within reach of complexity theory. But if transformers "can be" exponentially more succinct than RNNs, doesn't that mean we're approaching optimality?

muvlon · 2026-06-05T21:54:48 1780696488

No. We have an infinitely more succinct formalism, it's Turing machines. Succinctness is not necessarily a desirable property, it just says where on the capability-tractability tradeoff something is. Turing machines can express literally anything computable, but in exchange we can't use computers to reason about them in general (Rice's Theorem). Regexes are much more limited, they famously can't even recognize HTML, but we get to automatically prove things about them.

thesz · 2026-06-05T20:51:41 1780692701

  > doesn't that mean we're approaching optimality?

No.

Transformers are Markov chains [1]. Somewhere around this fascinating site [2] I read that stateful models have an advantage. Author provided an example, a state machine with two states A and B, where at state A transitions are to state A (output 0) and to state B (output 1) with equal probability and at state B the transition is always to state A and output is always 1.

For this state machine just one bit of memory can make an optimal prediction that ones always go in pairs, whereas Markov chain will approximate this prediction and never reach optimality.

  [1] https://arxiv.org/abs/2410.02724
  [2] https://bactra.org/

hnsr · 2026-06-06T01:37:20 1780709840

I think that might have been the example given here: https://bactra.org/notebooks/nn-attention-and-transformers.h...

pafoster · 2026-06-05T21:28:20 1780694900

Markov chains are themselves a kind of state machine, namely a probabilistic deterministic finite automaton (PDFA), albeit where state is solely governed by the N most recent symbols. (Deterministic means that given a sequence, we can always infer the associated state transitions unambiguously). I believe the example in the reference you provide represents the more general case of PDFA, which is not representable as a finite order Markov processs.

drdeca · 2026-06-05T23:01:41 1780700501

Huh? Any process on a computer by itself is also a Markov chain.

If you include all the information the LLM uses to produce the next token as part of the state, then of course the LLM is a Markov chain.

So would be any other process for sampling continuations of a text, with finite memory.

measurablefunc · 2026-06-05T20:18:13 1780690693

What about the other direction? What languages are expressible w/ RNNs & LTLs that require exponential blowup for transformers?

aesthesia · 2026-06-05T21:38:02 1780695482

Not quite an answer to your question, but you might find this interesting. The Olmo Hybrid paper has some results on relative complexity of problems that can be solved by transformers and RNNs. They don't look at size, just solvability, and find that the sets of problems solvable by the two architectures are incomparable. They actually use these results to inform their architecture design, which includes both attention and state space layers. (Specifically, they choose gated delta-nets with negative eigenvalues, which they show have greater expressivity than those without.)

https://arxiv.org/abs/2604.03444

yorwba · 2026-06-05T20:37:32 1780691852

Proposition 16. UHATs have polynomially bounded expansion over LTL. In particular, given an LTL formula φ, one can construct in polynomial time a UHAT T such that L(T) = L(φ).

i.e. the blowup is only exponential in one direction.

measurablefunc · 2026-06-05T21:20:35 1780694435

That says every LTL formula can be compiled into UHAT w/ polynomial overhead. It doesn't say that all languages recognizable w/ UHATs necessarily do not have succinct recgonizers in LTLs or RNNs.

Edit: Actually nevermind. If UHAT could be compiled into LTL w/ polynomial overhead then that would also work for the languages that have exponential overhead in LTL but since they don't there is a strict separation.

brandonb · 2026-06-05T18:56:59 1780685819

This paper is being published at ICLR 2026 (top AI conference), and was selected as one of three outstanding papers.

dang · 2026-06-05T19:11:29 1780686689

(We'll add that bit to the toptext as well. Thanks!)