The final proposition of the Tractatus is probably the most quoted and least obeyed sentence in the history of philosophy. Wittgenstein wrote it as a limit: there are things language cannot reach, and in the face of them the honest thing to do is stop. Not out of cowardice. Out of precision.
Language models do exactly the opposite. When they don't know, they talk more. When a question exceeds their data, they don't stop: they complete. They produce an answer that sounds right, that has the texture of truth, but isn't. They don't lie in the human sense of the word: they have no intention to deceive. Socrates was declared the wisest man in Athens for one reason alone: he was the only one who knew he didn't know. That capacity — evaluating the limits of one's own knowledge — is precisely what a language model lacks.1 It doesn't know that it doesn't know. For a model, a gap in knowledge and a gap in a sentence are resolved the same way: with the most probable word.
Anyone who has worked with generative AI for more than a week recognizes this phenomenon. What is discussed less is why building the opposite is so difficult.
Designing a system that answers well is an engineering problem. Designing a system that knows when not to answer is a problem of a different nature. Because abstention is not a system failure: it is a system decision. And decisions require judgment, not just data.
In cinema — the territory where I have spent most of my intellectual life — this has been understood for a century. Editing does not consist of placing shots one after another. It consists of deciding what not to show. Hitchcock explained it with brutal clarity: terror lies not in what you see, but in what you imagine because the director chose to look away, not to show. Bresson went further: he eliminated acting itself, so the viewer would fill in what was missing. Tarkovsky sculpted time by emptying it.
Pascal Bonitzer put it precisely: in cinema, the visual field doubles into a blind field. The screen is a cache, a partial view.3 Restraint — the precise cut, the shot that doesn't appear, the dialogue that stays silent — is what separates a film that works from one that merely plays.
Something similar happens with academic writing. Any researcher knows that the hardest moment in a text is not writing what you know, but deciding what to leave out. Which data point not to include. Which nuance not to develop. Which hypothesis to mention but not defend, because the evidence doesn't yet reach. That gesture — restraining oneself with intention — is what turns a draft into an argument.
Language models don't make that gesture. They can't, at least not by design. Their architecture is optimized, at its core, to produce the most probable sequence given a context. There are tuning techniques that mitigate this — instructions, alignment, filters — but the underlying tendency persists: more text, more completeness, more flow. When a model produces an invented citation with author, publisher, and year, it isn't fabulating: it's completing a pattern. The format of an academic citation is so familiar that it reproduces it with the same ease it reproduces a cooking recipe or a code snippet.
The failure is not in capability. It is the absence of a mechanism that says: I don't have sufficient basis here — I stop. The problem is that for an architecture designed to predict the next token without interruption, the void doesn't exist — or shouldn't exist; it is just another probabilistic space that must be filled. Citing from memory, for a model, is citing from probability; and the difference between the two is exactly what separates a useful tool from a reliable one.
In the Tractatus, Wittgenstein doesn't propose silence as renunciation. He proposes it as the highest form of intellectual honesty. Staying silent where one cannot speak with precision is not a defect of thought: it is its highest point.
The question that interests me is whether that gesture can be instructed.
Not trained: instructed — or designed, if we want more precise terminology. The difference matters. Scaling training data doesn't teach a model to stay silent: it gives it more material to fill in with. What a rigorous query system needs is not more coverage, but an abstention protocol: explicit conditions under which the correct answer is I do not have sufficient evidence to answer this.
It's a problem that looks technical but is epistemological. How does one formalize judgment? How does one distinguish, within an inference flow, between an absence that is ignorance and an absence that is prudence? How does one teach a machine that sometimes the most precise thing it can say is nothing?
The question of abstention doesn't arise in a vacuum; it is the direct consequence of a historical triumph. For years, the great epic of the digital humanities was material: digitize, index, and make searchable what once required physical presence and months in the archive.4 That first generation solved access, casting us into a nearly absolute availability of text. But that very abundance is what makes the problem of language models critical today. With access to a practically inexhaustible archive, the machine always finds enough statistical material to maintain its fluency, to sound convincing. The second generation of digital humanities faces a much darker problem than access: judgment. When algorithmic fluency allows the generation of answers about any corpus, the question is no longer can it answer? but should it?
Although it's worth being precise about what this implies. Asking a model in its prompt to respond "I don't have sufficient information" is not giving it awareness of its own ignorance. For the model, generating that rejection phrase remains a probabilistic exercise; it simply calculates that, under certain parameters, that is the most appropriate sequence of words. It continues to operate under the logic of the oracle, or, in psychoanalytic terms, under the figure of the Subject Supposed to Know: the one to whom a demand is addressed with the illusion that they hold the answer to everything.
Breaking that uninterrupted eloquence requires more than good instructions. It requires an architectural design where the language model ceases to be the absolute engine of truth and becomes subordinated to a strict verification system. Abstention does not emerge from the LLM on its own: it emerges, if it emerges, from the system that constrains it, audits it, and prevents it from completing when evidence falls short. An abstention protocol where the default response, in the absence of documentary evidence, is the cut. An explicit limit that teaches the machine that sometimes, the most precise and rigorous thing it can say is nothing.
Deliberate silence — the kind born of judgment and not of ignorance — is probably the hardest capacity to build in an artificial intelligence system. It doesn't generate impressive metrics. It isn't easily demonstrated in a benchmark. It doesn't produce the kind of astonishment that sells licenses.
But it is what separates a tool that answers from a tool that can be trusted.
Wittgenstein closed the Tractatus with seven words. He had spent the entire book trying to trace the limits of language, and in the end he found them — not in a demonstration, but in an act: to stop speaking. Seventy propositions of formal logic to arrive at the conclusion that the most important thing he could do was to be silent.
I wonder whether we aren't, with artificial intelligence, at a similar moment. We have demonstrated that models can speak about almost anything. The question that remains — the hardest one, the one that truly matters — is whether they can learn not to.
1. The reason is architectural. A large language model (LLM) is trained through a single task: given a sequence of words, predict the next one. The loss function — cross-entropy loss — measures the distance between the model's prediction and the word that actually appears in the training text. This signal is identical regardless of whether the source text is a verified scientific article, a work of fiction, an erroneous Wikipedia entry, or speculation on a forum. There is no separate signal during training that distinguishes "this is a fact" from "this sounds like a fact." The result is that the model learns to produce statistically plausible sequences but lacks an internal mechanism to evaluate whether what it generates corresponds to something true or merely something probable. Subsequent techniques such as RLHF (Reinforcement Learning from Human Feedback) or instruction-based alignment partially mitigate this problem but do not resolve it at its root: the architecture generates; it does not evaluate. For an accessible technical discussion, see Bender, E. M. and Koller, A.: "Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data," Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 5185–5198.
2. Author's translation from the original German: Wittgenstein, L.: Logisch-philosophische Abhandlung, in Annalen der Naturphilosophie, 14, 1921, pp. 185–262. First edition as a book: London, Kegan Paul, 1922. The standard English translation by D. F. Pears and B. F. McGuinness (Tractatus Logico-Philosophicus, London, Routledge & Kegan Paul, 1961) renders the proposition as: "What we cannot speak about we must pass over in silence."
3. Bonitzer, P.: Le champ aveugle. Essais sur le réalisme au cinéma, Paris, Cahiers du cinéma / Gallimard, 1982. English translation by the author from the Spanish edition: El campo ciego. Ensayos sobre el realismo en el cine, Buenos Aires, Santiago Arcos Editor, 2007, p. 68.
4. The standard periodization of the digital humanities typically identifies a "first wave" or generation focused primarily on infrastructure, text encoding, and the creation of large-scale repositories. This immense material effort spans from the pioneering computational work of Jesuit scholar Roberto Busa (whose Index Thomisticus began taking shape in the late 1940s) to the establishment of the Text Encoding Initiative (TEI) in the 1980s and the massive institutional and corporate digitization projects of the early twenty-first century. This phase laid the foundations of machine readability without which contemporary analysis would be impossible. For a foundational cartography of this development, see Schreibman, S., Siemens, R. and Unsworth, J. (eds.): A Companion to Digital Humanities, Oxford, Blackwell, 2004.