Day 13: Agency

Prerequisites

Content on AIXI (previous day) should already be known
A wide variety of background knowledge is helpful for understanding different agenda: Probability theory is helpful for understanding money-pumps and complete class theorems; Formal logic is helpful for understanding tiling agent and Vingean reflection; Computability theory is helpful for understanding reflective oracles; Information theory, causality and statistical mechanics is helpful for understanding the section on descriptive agent foundations
In general, understanding the motivation of agent foundations research (such as rocket alignment problem) is out of scope

Content

Fast-track

Minimal

Why agent foundations Coherent decisions imply consistent utilities (Why not circular preferences, probabilities & expected utility) Complete class: Consequentialist foundations Embedded agency An intuitive introduction to logical induction Vingean reflection Introduction to Lob’s theorem Tiling agent walkthrough An intuitive introduction to functional decision theory Towards a new decision theory (updateless decision theory) Reflective consistency Limit computable grain of truth Reflective oracle Introduction to the infra-bayesianism sequence

(Corrigibility)

Descriptive agent foundations: The ground of optimization Optimization at a distance Algorithmic thermodynamics and 3 types of optimization General purpose search Generalize heat engine How we picture bayesian agents Selection theorems

Standard

When trying to align a superintelligence to human values, we face a deep difficulty: we are trying to specify the behavior of an agent that does not yet exist, under conditions that may look nothing like anything we have observed. This means we need robust formal concepts of agency: concepts that will not break down when subjected to intense optimization pressure or when applied far out of distribution.

Coherence arguments and complete class theorems give us some initial foothold: they suggest that any sufficiently powerful agent that is able to avoid dominated strategies will behave like a Bayesian expected utility maximizer. However, these frameworks often assume a clean separation between agent and environment; they operate in a “dualistic” setup where the agent contains a model of the world, observes inputs through a well-defined sensor channel, and selects actions through a well-defined output channel, as in the AIXI formalism.

But real agents are embedded in their environment: they are made of the same physical stuff as everything else, cannot hold a full model of the world in memory because they are smaller than the world, and have no well-defined I/O boundary separating “agent” from “environment”. This embeddedness gives rise to a cluster of downstream subproblems that our standard dualistic frameworks fail to handle:

Logical uncertainty. Standard Bayesian reasoning handles empirical uncertainty but assumes logical omniscience where the agent already knows all the consequences of its beliefs. Real agents cannot compute all logical consequences. Logical induction generalizes the Dutch book argument underlying Bayesianism to handle this: it describes how an agent should assign and update probabilities over logical facts, including mathematical conjectures it hasn’t yet been able to prove.
Vingean uncertainty. Vingean reflection concerns how an agent should reason about a cognitive system more powerful than itself (such as a smarter successor it is considering building) when it cannot predict the exact outputs of that system. Just as Deep Blue’s designers could reason that the program was “trying to win” without knowing its exact moves, an agent constructing a successor must be able to approve that successor’s design while reasoning only at an abstract level about its behavior.
Self-trust and the Löbian obstacle. A natural desideratum for a self-modifying agent is that it trusts its own successor: if the successor proves that a given action leads to a good outcome, the parent (running the same proof system) should also be able to act on that conclusion. But Löb’s theorem shows that a system cannot trust its own proofs without witnessing them. The tiling agents framework formalizes this obstacle and explores partial solutions.
Decision theory. For a dualistic agent with a functional environment, optimization is straightforward: The agent can simply argmax over its actions. The problem for embedded agents is that they don’t have a functional environment — their action is just another fact about the world, so there is no well-defined notion of “what would happen if I took a different action.” Different decision theories correspond to different ways of constructing these counterfactuals: conditioning on the action (EDT), intervening causally (CDT), or reasoning about the logical consequences of one’s decision procedure (FDT). Standard decision theories fail in characteristic ways (including becoming reflectively inconsistent) when faced with agents who can predict them, copies of themselves, or logical correlations. Updateless decision theory and functional decision theory has been invented to address these failure modes.
Self-reference and environments containing other agents. Dualistic frameworks like AIXI assume the agent is “larger” than its environment (e.g. AIXI considers all lower semicomputable environments but is not itself computable) which breaks down when reasoning about environments containing agents of comparable intelligence, or about the agent itself. Reflective oracles address this by introducing a probabilistic oracle that can answer questions about the outputs of machines with access to that same oracle, allowing agents to model opponents as ordinary parts of the environment. A related problem is non-realizability: when the true environment contains the agent, it cannot lie within the agent’s hypothesis space. Infra-Bayesianism addresses this by replacing point priors with sets of hypotheses representing Knightian uncertainty, and optimizing against the worst case within that set.

The problems outlined above can be roughly characterized as normative (or ideal) agent foundations: the goal is to characterize how a perfectly rational agent should reason and act in principle. A complementary research direction is descriptive agent foundations, which instead asks what agents we actually encounter in the wild actually are — with the ambition of being able to look at any physical system (a neural network, a bacterium) and identify its goals, world-model, and decision-making structure.

A large part of this is formalizing foundational concepts underlying agency, such as optimization (roughly: when a process reliably steers a system toward a narrow set of target configurations despite a wide range of initial conditions), world-models, and general-purpose search. Optimization has deep conceptual ties to thermodynamics: it can be understood as a form of local entropy reduction (concentrating probability mass from a broad initial distribution onto a narrow final one), analogous to doing thermodynamic work. Since agents are embedded in our physical world, they must respect the second law and the reversibility of physics, and these constraints shape what kinds of optimization are possible. Algorithmic thermodynamics develops this connection in a way that’s more consistent with embedded agency, while examples like Maxwell’s demon and the generalized heat engine illustrate how thermodynamic reasoning yields insight about optimization.

A further distinguishing feature of descriptive agent foundations is its methodology: whereas normative foundations start from desiderata of ideal agents and characterize what follows from them, descriptive foundations start from properties of the world and derive what kinds of agents those properties give rise to (with the aim that the theory applies to agents of any complexity, from bacteria to superintelligences). This includes modelling the internal mechanisms of agents in a way that accounts for boundedness & efficiency. For instance, how we picture Bayesian agents asks how to represent the internal structure of an agent (its beliefs, goals, and decision-making) in a way that is more mechanistically grounded, rather than treating the agent as an abstract utility-maximizing black box. Descriptive foundations also place greater emphasis on the selection pressures that give rise to agents and their properties, with the aim of understanding what type signature of agents will be selected for.

Learn more

Foundations of algorithmic thermodynamics Causal arrow of time Admissibility and complete class Dutch book arguments Logical induction Functional decision theory Cartoon’s guide to Lob’s theorem Tiling agent for self-modifying AI

Teaching guide

1 hour of introductory lecture on the bird’s eye view picture & motivation of agent foundations
2.5 hours of reading + discussion
- Format: We separate into fundamental reading vs readings on specific topics, each person reads a fundamental reading + readings on one of the topics, then discuss with others who’ve read different topics
- Fundamental reading:
  - Embedded agency
  - Why agent foundations
  - General purpose search
- Topics:
  - Consequentialist foundations (coherence & complete class theorems)
  - Lob’s theorem and tiling agent
  - Logical induction
  - Decision theory
  - Optimization and thermodynamics
  - Descriptive agent foundations
1 hour of lecture by Cole Wyeth on reflective oracles and nonrealizability
1 hour of exercises on Lob’s theorem, complete class and information theory
1 hour of lecture by Aram Ebtekar on decision theory, the information engine, and how embedded agency motivates algorithmic thermodynamics