Exercises: Agent Foundations

The following are exercises on agent foundations. Each problem is broken into a sequence of lemmas leading to a main theorem. For each subquestion, try to prove the stated lemma before reading on. If you get stuck, you may treat the lemma as given and proceed to the next part.

Fix a consistent formal system $\mathsf{PA}$ (Peano Arithmetic) and write $\mathsf{PA} \vdash \varphi$ to mean that $\varphi$ is provable in $\mathsf{PA}$. We write $\Box \varphi$ for the arithmetical sentence asserting that $\varphi$ is provable, i.e. the formula $\mathsf{Provable}_{\mathsf{PA}}(\ulcorner \varphi \urcorner)$ obtained by encoding provability inside arithmetic via Gödel numbering.

Setup. For any sentence $P$, define the program

$$ \texttt{ProofSeeker}(P) := \text{“enumerate all strings; halt iff one is a valid } \mathsf{PA}\text{-proof of } P\text{.”} $$

This gives a correspondence between logical and computational notions:

Logical statement	Computational analogue
$\mathsf{PA} \vdash P$	$\texttt{ProofSeeker}(P)$ finds a proof and actually halts
$\mathsf{PA} \vdash \Box P$	There exists a $\mathsf{PA}$-proof that $\texttt{ProofSeeker}(P)$ halts

These two notions are genuinely distinct. When $\mathsf{PA} \vdash P$, the proof is the witness: $\mathsf{PA}$ has directly certified $P$. When $\mathsf{PA} \vdash \Box P$, the proof is not about $P$ itself — it is about the program $\texttt{ProofSeeker}(P)$, asserting that this program terminates. From the outside, we can see that termination of $\texttt{ProofSeeker}(P)$ entails a proof of $P$ exists; but this reasoning is illegible to $\mathsf{PA}$ itself. Löb’s theorem shows that any formal system at least as strong as $\mathsf{PA}$ cannot derive $P$ from $\Box P$ in general without becoming inconsistent.

Exercise 1: Gödel’s second incompleteness theorem

Let $L$ be a first-order system at least as strong as $\mathsf{PA}$. Since $L$ extends $\mathsf{PA}$, it can express statements about Turing machines. Write $\mathsf{Halts}(M)$ for the $L$-statement that Turing machine $M$ halts, and $\bot$ for any fixed $L$-contradiction. We say $L$ is consistent if $L \nvdash \bot$.

Key observation. If $M$ actually halts, then $L \vdash \mathsf{Halts}(M)$: $L$ can verify this by expanding out the execution trace up to the halting step. However, if $M$ does not halt, $L$ need not be able to prove $\neg\mathsf{Halts}(M)$; indeed $L$ cannot decide $\mathsf{Halts}(M)$ for all $M$ while remaining consistent, or one could solve the halting problem by searching for $L$-proofs of $\mathsf{Halts}(M)$ and $\neg\mathsf{Halts}(M)$ in parallel.

Now define the machine $Z$ as follows:

$$ Z(A) := \text{“search over all } L\text{-proofs; halt iff one is a valid proof of } \neg\mathsf{Halts}(\ulcorner A(A) \urcorner)\text{.”} $$

Here $\ulcorner A(A) \urcorner$ denotes an encoding of the machine that runs $A$ on its own source code. Let $G$ be the statement

$$ G := \neg\mathsf{Halts}(\ulcorner Z(Z) \urcorner). $$

Part 1(a). Show that $G$ is true, assuming $L$ is consistent.

Part 1(b). Show that if $L \vdash \neg\Box_L(\bot)$ (i.e. $L$ proves its own consistency), then $L$ is inconsistent.

Hint: If $L$ can prove its own consistency, it can carry out inside $L$ the same reasoning you used in Part 1(a)

Part 1(c). Suppose $L \vdash \Box P \to P$ for all $P$. First show that $L$ proves that it never proves a false statement, i.e. $L \vdash \neg\Box P$ for all $P$ such that $L \vdash \neg P$. Then conclude that $L \vdash \neg\Box(\bot)$, and hence $L$ is inconsistent by Part 1(b).

Hint: Recall that $\neg\bot$ is a tautology.

Remark (Self-trust and the Löbian obstacle). A natural desideratum for a self-modifying AI is self-trust: the property $\forall P,\ L \vdash \Box P \to P$, i.e., the system trusts that anything it can prove is in fact true. This property becomes especially important when an AI constructs a successor that is smarter than itself. Since the successor is more capable, the parent AI cannot anticipate every individual action the successor will take. Instead, the parent would like to reason at an abstract level: “if my successor proves that an action achieves the goal, then that action in fact achieves that goal.” Formally, the parent $A_1$ reasoning in system $L_1$ would want to conclude that $A_1 \vdash \Box_{L_0} P \to P$ for the successor’s proof system $L_0$, trusting the successor’s proofs without inspecting each one.

Part 1(c) shows this is impossible without inconsistency: any system $L$ strong enough to prove $\Box P \to P$ for all $P$ is already inconsistent. The underlying reason is that $\Box P \to P$ is a soundness schema (it says the system can vouch for its own proofs in the abstract), and Gödel’s Second Incompleteness Theorem rules this out for any consistent system at least as strong as $\mathsf{PA}$. Consequently, a parent AI can only trust the proofs of a successor whose formal system is strictly weaker than its own. This in turn would imply that an agent architecture can only tile a finite chain of successors (or make a finite number of self-modifications) before running out of “trust.”

Exercise 2: Löb’s theorem

We wish to show: if $L \vdash \Box C \to C$, then $L \vdash C$. We will use the following properties of the provability predicate throughout:


(N)	Necessitation.	If $L \vdash \varphi$ then $L \vdash \Box\varphi$.
(K)	Distribution.	$L \vdash \Box(\varphi \to \psi) \to (\Box\varphi \to \Box\psi)$.
(4)	Löb condition.	$L \vdash \Box\varphi \to \Box\Box\varphi$.

Löb’s sentence. There exists a sentence $\lambda$ such that

$$ L \vdash \lambda \leftrightarrow (\Box\lambda \to C). $$

Part 2(a). Show that $L \vdash \Box\lambda \to \Box C$.

Hint: Apply $\Box$ to the Löb sentence.

Part 2(b). Now assume $L \vdash \Box C \to C$. Using Part 2(a) and Löb’s sentence, derive $L \vdash C$.

Remark. The Löbian sentence $\lambda \leftrightarrow (\Box\lambda \to C)$ is a formal analogue of the Santa Claus sentence (Curry’s paradox): “If this sentence is true, then Santa Claus exists.” To see why this is paradoxical, let $S$ denote the sentence itself and $C$ denote “Santa Claus exists”, so $S$ asserts $S \to C$. To prove the conditional $S\to C$ it suffices to assume its antecedent and derive the consequent, so assume $S$ is true. Then what $S$ says is true, namely that $S \to C$. But we have assumed $S$, so $C$ follows. We have therefore shown $S \to C$ — which is exactly $S$ — so $S$ is true, and hence $C$ holds. Since $C$ was arbitrary, this “proves” any statement whatsoever.

The reason $\mathsf{PA}$ does not immediately fall prey to this paradox is Tarski’s undefinability theorem: there is no formula $\mathsf{True}(\ulcorner\varphi\urcorner)$ in the language of $\mathsf{PA}$ that correctly captures truth in the standard model for all sentences $\varphi$. Consequently, $\mathsf{PA}$ cannot form a sentence that refers to its own truth. What it can do, via Gödel numbering, is refer to its own provability: the predicate $\Box\varphi$ is a perfectly legitimate arithmetical formula. The Löbian sentence $\lambda$ therefore substitutes $\Box\lambda$ for truth, and the proof of Löb’s theorem shows that this substitution is enough to force $L \vdash C$ — but only when $\Box C \to C$ is already assumed, not unconditionally.

This mirrors the relationship between the liar’s paradox and Gödel’s first incompleteness theorem. The liar’s sentence “This sentence is false” cannot be formed in $\mathsf{PA}$ because truth is undefinable; but its provability-theoretic analogue “This sentence is not provable” can be, and it yields incompleteness rather than paradox. In both cases, replacing the truth predicate with the provability predicate transforms a semantic paradox into a precise metamathematical theorem.

Part 2(c). In a one-shot Prisoner’s Dilemma, two players each choose to either cooperate ($C$) or defect ($D$). Instead of choosing directly, each player submits a program which receives the opponent’s source code and outputs $C$ or $D$. Consider the following agent, FairBot:

algorithm FairBot(opponent): if $\mathsf{PA} \vdash \texttt{opponent}(\texttt{FairBot}) = C$ then return $C$ else return $D$

That is, FairBot cooperates with an opponent if and only if it can find a $\mathsf{PA}$-proof that the opponent cooperates with FairBot. Note that FairBot is unexploitable: assuming $\mathsf{PA}$ is sound, FairBot never cooperates with an opponent that defects against it.

The interesting question is what happens when FairBot plays against itself. A priori, both mutual cooperation and mutual defection appear to be stable outcomes.

Consider two distinct implementations $\texttt{FairBot}_1$ and $\texttt{FairBot}_2$. Let $A$ be the formula $\texttt{FairBot}_1(\texttt{FairBot}_2) = C$ and $B$ be the formula $\texttt{FairBot}_2(\texttt{FairBot}_1) = C$.

Given these definitions, prove that $\mathsf{PA} \vdash A \wedge B$, i.e. that $\texttt{FairBot}_1$ and $\texttt{FairBot}_2$ mutually cooperate, using Löb’s Theorem.

Hint: What does $\Box A$ and $\Box B$ each imply given the source code of Fairbot? Show that $\mathsf{PA} \vdash \Box(A \wedge B) \to (A \wedge B)$

Exercise 3: The Complete Class Theorem

Fix a finite set of actions $\mathcal{A} = \{a_1, \ldots, a_k\}$ and a finite set of environments $\Omega = \{\omega_1, \ldots, \omega_n\}$. Each action $a$ in environment $\omega$ yields reward $u(a, \omega) \in \mathbb{R}$. A decision rule $\delta = (\lambda_1, \ldots, \lambda_k)$ is a probability distribution over $\mathcal{A}$, i.e. $\lambda_j \geq 0$ for all $j$ and $\sum_{j=1}^k \lambda_j = 1$. The utility of $\delta$ in environment $\omega$ is

$$ u(\delta, \omega) := \sum_{j=1}^k \lambda_j , u(a_j, \omega). $$

The risk vector of $\delta$ is $r(\delta) = (u(\delta, \omega_1), \ldots, u(\delta, \omega_n)) \in \mathbb{R}^n$, and the risk set is

$$ \mathcal{R} = \Bigl\{\sum_{j=1}^k \lambda_j , r(a_j) : \lambda_j \geq 0,; \sum_{j=1}^k \lambda_j = 1\Bigr\}, $$

i.e. the convex hull of the pure-action risk vectors $\{r(a_1), \ldots, r(a_k)\}$.

A decision rule $\delta$ is admissible if no other rule $\delta’$ satisfies $u(\delta’, \omega_i) \geq u(\delta, \omega_i)$ for all $i$ with strict inequality for at least one $i$. The Pareto frontier $\mathcal{F} \subseteq \mathcal{R}$ is the set of all admissible risk vectors, and a pure action $a_j$ is Pareto-optimal if $r(a_j) \in \mathcal{F}$.

Given a prior $\pi = (\pi_1, \ldots, \pi_n)$ with $\pi_i \geq 0$ and $\sum_i \pi_i = 1$, the expected utility of $\delta$ under $\pi$ is

$$ \mathrm{EU}(\delta, \pi) := \sum_{i=1}^n \pi_i , u(\delta, \omega_i) = \pi \cdot r(\delta). $$

A decision rule $\delta$ is Bayes-optimal under $\pi$ if $\mathrm{EU}(\delta, \pi) \geq \mathrm{EU}(\delta’, \pi)$ for all $\delta’ \in \Delta(\mathcal{A})$.

Theorem (Complete Class). Every admissible decision rule is Bayes-optimal under some prior $\pi$ with $\pi_i > 0$ for all $i$. Conversely, every Bayes-optimal rule under a prior with full support is admissible.

Part 3(a). Show the following two facts:

(i) The Pareto frontier $\mathcal{F}$ is contained in the convex hull of the Pareto-optimal pure actions, i.e. any admissible decision rule $\delta = \sum_j \lambda_j a_j$ has $\lambda_j = 0$ whenever $a_j$ is not Pareto-optimal. Conclude that dominated pure actions need not be considered.
(ii) Every Bayes-optimal rule under a prior $\pi$ with $\pi_i > 0$ for all $i$ is admissible. Hint: if $\delta’$ dominated $\delta$, what would happen to $\mathrm{EU}(\delta’, \pi)$ compared to $\mathrm{EU}(\delta, \pi)$?

Part 3(b). A face $F$ of the Pareto frontier is a maximal convex subset of $\mathcal{F}$ of the form

$$ F = \Bigl\{\sum_{l=1}^p \mu_l , r(a_{i_l}) : \mu_l \geq 0,; \sum_{l=1}^p \mu_l = 1\Bigr\} $$

for some subset of Pareto-optimal pure actions $\{a_{i_1}, \ldots, a_{i_p}\}$. Define the difference vectors $v_l := r(a_{i_l}) - r(a_{i_1})$ for $l = 2, \ldots, p$, and let $H = \mathrm{span}\{v_2, \ldots, v_p\}$. Show that $H$ consists precisely of the directions one can move within $F$ while remaining in $F$: that is, if $\delta \in F$, then $r(\delta) + h \in F$ for some $h$ only if $h \in H$.

Part 3(c). Let $\pi$ be the unit normal to the hyperplane $H$ constructed in Part 3(b), normalized so that $\pi_i > 0$ for all $i$ and $\sum_i \pi_i = 1$. You may assume that the entire risk set $\mathcal{R}$ lies on one side of this hyperplane. Show that $\mathrm{EU}(\delta, \pi)$ is constant for all $\delta$ with $r(\delta) \in F$, and that every rule on the face is simultaneously Bayes-optimal under $\pi$. Conclude the Complete Class Theorem.

Exercise 4: The Do-Divergence Theorem

Consider an agent (the “demon”) with observations $O$, actions $A$, and a policy $\pi$ specifying how actions are chosen as a function of observations. The actions and observations together cause an outcome $X$. The joint distribution factorises as

$$ P[X, A, O] = P[X \mid A, O] , P[A \mid O] , P[O]. $$

We compare this to a blind baseline in which the demon takes actions independently of its observations, i.e. actions are sampled from $P[A]$ without looking at $O$. This is written as the do-operation $do(A)$, under which

$$ P[X, A, O \mid do(A)] = P[X \mid A, O] , P[A] , P[O]. $$

Prove the Do-Divergence Theorem:

$$ D_{\mathrm{KL}}(P[X] ,|, P[X \mid do(A)]) \leq \mathrm{MI}(A;, O). $$

Useful facts:

Monotonicity of KL-divergence. For any two joint distributions $p(x,y)$ and $q(x,y)$, marginalizing out $y$ can only decrease KL-divergence: $D_{\mathrm{KL}}(p(x) | q(x)) \leq D_{\mathrm{KL}}(p(x,y) | q(x,y))$.

Remark. In Maxwell’s Demon, the outcome $X$ is the final configuration of molecules. When the demon acts blindly, molecules are equally likely to end up on either side, so $X$ is roughly uniform over both halves. If the demon perfectly sorts all $n$ molecules to the right, the KL divergence between these two distributions is $n \cdot \log 2$ (i.e. $n$ bits). The theorem therefore says that perfectly sorting $n$ molecules requires at least $n$ bits of mutual information between the demon’s actions and its observations — so any reduction in entropy of the gas must be paid for by information the demon has gathered about the molecules.

Exercise 5: Channel Additivity

Consider two independent channels $X_1 \to Y_1$ and $X_2 \to Y_2$ with a fixed joint channel

$$ P[Y \mid X] = P[Y_1 \mid X_1] , P[Y_2 \mid X_2]. $$

We are free to choose the input distribution $P[X]$ (which determines the joint distribution $P[Y,X]=P[Y|X]P[X]$), and the goal is to maximize the mutual information $\mathrm{MI}(X; Y)$, called the information throughput.

Part 5(a). Show that

$$ \mathrm{MI}(X;, Y) = \mathrm{MI}(X_1;, Y_1) + \mathrm{MI}(X_2;, Y_2) - \mathrm{MI}(Y_1;, Y_2). $$

Hint: Write mutual information as a KL divergence between the joint and the product of marginals, i.e. $\mathrm{MI}(A; B) = D_{\mathrm{KL}}(P[A,B] ,|, P[A]P[B])$, and use the factorization of the channel to simplify.

Part 5(b). Given any input distribution $P[X_1, X_2]$, construct

$$ Q[X_1, X_2] := P[X_1] , P[X_2], $$

i.e. the product of the marginals of $P$. Show that $\mathrm{MI}_Q(X; Y) \geq \mathrm{MI}_P(X; Y)$.

Part 5(c). Use Part 5(b) to conclude that there always exists a throughput-maximizing input distribution under which $X_1 \perp!!!\perp X_2$.

Remark. This result has a natural interpretation in terms of optimization and agency. Think of $X$ as the actions of an agent and $Y$ as the outcomes it cares about. The channel $P[Y \mid X]$ — how actions influence outcomes — is fixed by the environment and outside the agent’s control; the agent only gets to choose its policy $P[X]$. The factorization condition on $P[Y \mid X]$ says that the environment is modular: the two groups of outcomes $Y_1$ and $Y_2$ are each influenced only by their respective actions $X_1$ and $X_2$. Our theorem says that if the environment is modular in this sense, then the agent can always find an optimal policy that is modular in a corresponding sense — specifically, the two groups of actions need not be coordinated at all and can be chosen independently.