C.3 Agent Foundations: Exercise Solutions

Exercise 1: Godel’s Second Incompleteness Theorem

Part 1(a): $G$ is true assuming $L$ is consistent

Claim. The sentence $G := \neg\mathsf{Halts}(\ulcorner Z(Z) \urcorner)$ is true, provided $L$ is consistent.

Intuition. $G$ says “the machine $Z$ running on its own source code does not halt.” The machine $Z(Z)$ halts only if it finds an $L$-proof of exactly this non-halting statement. If it did find such a proof and halt, then halting would be a true fact that $L$ could verify — but $L$ would also have proved non-halting. That would make $L$ inconsistent. So, as long as $L$ is consistent, $Z(Z)$ cannot halt, and $G$ is true.

Proof. Suppose for contradiction that $G$ is false. Then $\mathsf{Halts}(\ulcorner Z(Z) \urcorner)$ is true, meaning $Z(Z)$ halts. By the definition of $Z$, the machine $Z(Z)$ halts only when it finds a valid $L$-proof of $\neg\mathsf{Halts}(\ulcorner Z(Z) \urcorner)$, which is the sentence $G$ itself. So:

$$ L \vdash G. $$

On the other hand, since $Z(Z)$ actually halts, the statement $\mathsf{Halts}(\ulcorner Z(Z) \urcorner)$ is true. Because $L$ can verify a halting computation by inspecting the finite execution trace, we also get:

$$ L \vdash \mathsf{Halts}(\ulcorner Z(Z) \urcorner), \quad \text{i.e.} \quad L \vdash \neg G. $$

Now $L$ proves both $G$ and $\neg G$. This is a contradiction, so $L$ is inconsistent — violating our assumption. Therefore $G$ must be true. $\blacksquare$

Part 1(b): If $L$ proves its own consistency, then $L$ is inconsistent

Claim. If $L \vdash \neg\Box_L(\bot)$, then $L$ is inconsistent.

Intuition. Part 1(a) gave us a proof that “if $L$ is consistent, then $G$ is true.” That argument used only elementary reasoning about Turing machines and proofs — reasoning that $L$ itself is powerful enough to carry out internally. So if $L$ can also state “I am consistent” as a theorem, it can chain the two together and derive $G$. But deriving $G$ triggers the very paradox that makes $L$ inconsistent.

Proof. The argument in Part 1(a) was: assuming $L$ is consistent (i.e., $\neg\Box_L(\bot)$), we derived that $G$ is true. This reasoning can be formalized inside $L$, because $L$ is at least as strong as $\mathsf{PA}$ and the argument only involves:

the definition of $Z$,
the fact that halting computations can be verified,
basic logical deduction.

So $L$ can prove the implication:

$$ L \vdash \neg\Box_L(\bot) \to G. $$

Now suppose $L \vdash \neg\Box_L(\bot)$. By modus ponens:

$$ L \vdash G. $$

This means $L$ has produced a proof of $G = \neg\mathsf{Halts}(\ulcorner Z(Z) \urcorner)$. But $\texttt{ProofSeeker}(G)$ searches for exactly such a proof. Since the proof exists, $\texttt{ProofSeeker}(G)$ finds it and halts. The machine $Z(Z)$ runs $\texttt{ProofSeeker}(G)$ internally, so $Z(Z)$ halts as well.

Since $Z(Z)$ actually halts, $L$ can verify this by inspecting the execution trace:

$$ L \vdash \mathsf{Halts}(\ulcorner Z(Z) \urcorner), \quad \text{i.e.} \quad L \vdash \neg G. $$

So $L$ proves both $G$ and $\neg G$, meaning $L$ is inconsistent. $\blacksquare$

Part 1(c): If $L \vdash \Box P \to P$ for all $P$, then $L$ is inconsistent

Claim. If $L$ can prove the schema $\Box P \to P$ for every sentence $P$, then $L$ is inconsistent.

Intuition. The schema “$\Box P \to P$” says “anything I can prove is true.” This is a self-trust property. We will show in three short steps that self-trust lets $L$ prove its own consistency, which Part 1(b) already told us leads to inconsistency.

Step 1: $L$ proves it never proves a false statement. Take any sentence $P$ with $L \vdash \neg P$. We want to show $L \vdash \neg\Box P$.

By assumption, $L \vdash \Box P \to P$. Taking the contrapositive (which is a basic logical equivalence):

$$ L \vdash \neg P \to \neg\Box P. $$

Since $L \vdash \neg P$, applying modus ponens gives:

$$ L \vdash \neg\Box P. $$

In words: for any $P$ that $L$ can refute, $L$ can also prove that $P$ is not provable.

Step 2: $L$ proves its own consistency. The sentence $\bot$ (a contradiction) is always false in any consistent system. In fact, $\neg\bot$ is a logical tautology — it is provable in any system that includes basic propositional logic:

$$ L \vdash \neg\bot. $$

Now apply Step 1 with $P = \bot$. Since $L \vdash \neg\bot$, we get:

$$ L \vdash \neg\Box\bot. $$

This is precisely the statement that $L$ is consistent, expressed inside $L$ itself: “$L$ does not prove a contradiction.”

Step 3: Conclude inconsistency. By Part 1(b), any system that proves its own consistency is inconsistent. Since $L \vdash \neg\Box_L(\bot)$, we conclude that $L$ is inconsistent. $\blacksquare$

Exercise 2: Lob’s Theorem

We wish to prove: if $L \vdash \Box C \to C$, then $L \vdash C$. We use the three standard properties of the provability predicate:


(N)	Necessitation.	If $L \vdash \varphi$ then $L \vdash \Box\varphi$.
(K)	Distribution.	$L \vdash \Box(\varphi \to \psi) \to (\Box\varphi \to \Box\psi)$.
(4)	Lob condition.	$L \vdash \Box\varphi \to \Box\Box\varphi$.

The Lob sentence is a sentence $\lambda$ satisfying $L \vdash \lambda \leftrightarrow (\Box\lambda \to C)$.

Part 2(a): Show $L \vdash \Box\lambda \to \Box C$

Intuition. We need to work entirely “inside the box” — showing that if $L$ can prove $\lambda$, then $L$ can prove $C$. The idea is to take the Lob sentence (which says $\lambda$ is equivalent to “$\Box\lambda$ implies $C$”), apply $\Box$ to lift it into the provability world, and then chain things together.

Proof. Start from the Lob sentence. One direction of the biconditional gives us:

$$ L \vdash \lambda \to (\Box\lambda \to C). $$

Apply Necessitation (N). Since this is a theorem of $L$, we can box it:

$$ L \vdash \Box(\lambda \to (\Box\lambda \to C)). $$

Apply Distribution (K) once. The distribution axiom says $\Box(\varphi \to \psi) \to (\Box\varphi \to \Box\psi)$. Setting $\varphi = \lambda$ and $\psi = (\Box\lambda \to C)$:

$$ L \vdash \Box\lambda \to \Box(\Box\lambda \to C). $$

Apply Distribution (K) again. Now apply (K) to $\Box(\Box\lambda \to C)$, with $\varphi = \Box\lambda$ and $\psi = C$:

$$ L \vdash \Box(\Box\lambda \to C) \to (\Box\Box\lambda \to \Box C). $$

Apply the Lob condition (4). This gives us:

$$ L \vdash \Box\lambda \to \Box\Box\lambda. $$

Chain everything together. Starting from $\Box\lambda$:

$\Box\lambda \to \Box(\Box\lambda \to C)$ (from the first application of K above),
$\Box(\Box\lambda \to C) \to (\Box\Box\lambda \to \Box C)$ (from the second application of K),
$\Box\lambda \to \Box\Box\lambda$ (from condition 4).

So from $\Box\lambda$, we get $\Box(\Box\lambda \to C)$ by (1), and $\Box\Box\lambda$ by (3). Feeding both into (2) gives $\Box C$. Therefore:

$$ L \vdash \Box\lambda \to \Box C. \quad \blacksquare $$

Part 2(b): Assuming $L \vdash \Box C \to C$, derive $L \vdash C$

Intuition. We have just shown that $\Box\lambda$ implies $\Box C$. If we additionally assume $\Box C \to C$, then $\Box\lambda$ implies $C$. But the Lob sentence says $\lambda$ is equivalent to “$\Box\lambda$ implies $C$” — and we have just proved that implication! So $\lambda$ is true, hence provable, and we can push through to get $C$.

Proof.

From Part 2(a): $L \vdash \Box\lambda \to \Box C$.

From the assumption: $L \vdash \Box C \to C$.

Chaining these two: $L \vdash \Box\lambda \to C$.

Now recall the Lob sentence: $L \vdash \lambda \leftrightarrow (\Box\lambda \to C)$. We have just shown $L \vdash \Box\lambda \to C$, which is the right-hand side of this biconditional. So by the reverse direction of the biconditional:

$$ L \vdash \lambda. $$

Apply Necessitation (N) to this theorem:

$$ L \vdash \Box\lambda. $$

Finally, since $L \vdash \Box\lambda \to C$ and $L \vdash \Box\lambda$, modus ponens gives:

$$ L \vdash C. \quad \blacksquare $$

Part 2(c): FairBot mutual cooperation

Claim. When $\texttt{FairBot}_1$ plays $\texttt{FairBot}_2$, both cooperate: $\mathsf{PA} \vdash A \wedge B$.

Let $A$ denote “$\texttt{FairBot}_1(\texttt{FairBot}_2) = C$” and $B$ denote “$\texttt{FairBot}_2(\texttt{FairBot}_1) = C$.”

Intuition. Each FairBot cooperates if it can prove the other cooperates. So $\Box B$ (a proof that FairBot$_2$ cooperates) triggers FairBot$_1$ to cooperate, and vice versa. This circular dependency looks like it could go either way, but Lob’s theorem breaks the symmetry: it says that if provability of a statement implies the statement, then the statement is actually provable. The mutual dependence satisfies exactly this condition.

Proof.

Step 1: Extract what each FairBot’s source code guarantees.

By the definition of $\texttt{FairBot}_1$: if $\mathsf{PA}$ can prove that $\texttt{FairBot}_2$ cooperates (i.e., $\mathsf{PA} \vdash B$), then $\texttt{FairBot}_1$ returns $C$. In other words:

$$ \mathsf{PA} \vdash \Box B \to A. $$

Similarly, by the definition of $\texttt{FairBot}_2$:

$$ \mathsf{PA} \vdash \Box A \to B. $$

Step 2: Show $\mathsf{PA} \vdash \Box(A \wedge B) \to (A \wedge B)$.

Suppose $\mathsf{PA} \vdash \Box(A \wedge B)$. Since $\Box$ distributes over conjunction (which follows from the Distribution axiom (K)):

$$ \mathsf{PA} \vdash \Box(A \wedge B) \to \Box A \quad \text{and} \quad \mathsf{PA} \vdash \Box(A \wedge B) \to \Box B. $$

From $\Box B$, the first FairBot equation gives $A$. From $\Box A$, the second gives $B$. So:

$$ \mathsf{PA} \vdash \Box(A \wedge B) \to (A \wedge B). $$

Step 3: Apply Lob’s theorem. Setting $C = A \wedge B$ in Lob’s theorem: since $\mathsf{PA} \vdash \Box(A \wedge B) \to (A \wedge B)$, we conclude:

$$ \mathsf{PA} \vdash A \wedge B. $$

Both FairBots cooperate. $\blacksquare$

Exercise 3: The Complete Class Theorem

Recall the setup: we have a finite set of actions $\mathcal{A} = \{a_1, \ldots, a_k\}$, a finite set of environments $\Omega = \{\omega_1, \ldots, \omega_n\}$, and a reward function $u(a, \omega)$. Decision rules $\delta$ are probability distributions over actions. The risk vector $r(\delta) \in \mathbb{R}^n$ records the expected utility of $\delta$ in each environment. The risk set $\mathcal{R}$ is the convex hull of the pure-action risk vectors.

Part 3(a)(i): The Pareto frontier lies in the convex hull of Pareto-optimal pure actions

Claim. If $\delta = \sum_j \lambda_j a_j$ is admissible, then $\lambda_j = 0$ whenever $a_j$ is not Pareto-optimal.

Intuition. If an admissible mixture puts positive weight on a dominated pure action, we can swap that action for one that dominates it, improving the mixture — contradicting admissibility.

Proof. Suppose $\delta = \sum_j \lambda_j a_j$ is admissible but places positive weight $\lambda_j > 0$ on some non-Pareto-optimal pure action $a_j$. Since $a_j$ is not Pareto-optimal, there exists another action $a’$ such that:

$u(a’, \omega_i) \geq u(a_j, \omega_i)$ for all environments $\omega_i$, and
$u(a’, \omega_{i_0}) > u(a_j, \omega_{i_0})$ for at least one environment $\omega_{i_0}$.

Construct a new decision rule $\delta’$ that is identical to $\delta$ except with $a_j$ replaced by $a’$ (keeping the same weight $\lambda_j$). Then for every environment $\omega_i$:

$$ u(\delta’, \omega_i) = u(\delta, \omega_i) + \lambda_j \bigl[u(a’, \omega_i) - u(a_j, \omega_i)\bigr]. $$

The bracketed term is $\geq 0$ for all $i$, and strictly $> 0$ for $i = i_0$. Since $\lambda_j > 0$, we have $u(\delta’, \omega_i) \geq u(\delta, \omega_i)$ for all $i$, with strict inequality for $i = i_0$. This means $\delta’$ dominates $\delta$, contradicting the admissibility of $\delta$.

Therefore $\lambda_j = 0$ for every non-Pareto-optimal $a_j$, and the Pareto frontier is contained in the convex hull of the Pareto-optimal pure actions. $\blacksquare$

Part 3(a)(ii): Bayes-optimal under full-support prior implies admissible

Claim. If $\delta$ is Bayes-optimal under a prior $\pi$ with $\pi_i > 0$ for all $i$, then $\delta$ is admissible.

Intuition. If some $\delta’$ dominated $\delta$ (doing at least as well everywhere, strictly better somewhere), then $\delta’$ would have strictly higher expected utility under any full-support prior — contradicting Bayes-optimality.

Proof. Suppose for contradiction that $\delta$ is Bayes-optimal under $\pi$ (with all $\pi_i > 0$) but is not admissible. Then there exists $\delta’$ with:

$u(\delta’, \omega_i) \geq u(\delta, \omega_i)$ for all $i$, and
$u(\delta’, \omega_{i_0}) > u(\delta, \omega_{i_0})$ for some $i_0$.

Compute the difference in expected utility:

$$ \mathrm{EU}(\delta’, \pi) - \mathrm{EU}(\delta, \pi) = \sum_{i=1}^n \pi_i \bigl[u(\delta’, \omega_i) - u(\delta, \omega_i)\bigr]. $$

Every term in this sum is $\geq 0$ (since each $\pi_i > 0$ and each difference in brackets is $\geq 0$). The term for $i = i_0$ is strictly positive (since $\pi_{i_0} > 0$ and $u(\delta’, \omega_{i_0}) - u(\delta, \omega_{i_0}) > 0$). Therefore:

$$ \mathrm{EU}(\delta’, \pi) > \mathrm{EU}(\delta, \pi), $$

contradicting the Bayes-optimality of $\delta$. $\blacksquare$

Part 3(b): $H$ is the set of feasible directions within the face $F$

Setup. We have a face $F$ of the Pareto frontier, spanned by Pareto-optimal pure actions $\{a_{i_1}, \ldots, a_{i_p}\}$. The difference vectors are $v_l := r(a_{i_l}) - r(a_{i_1})$ for $l = 2, \ldots, p$, and $H = \mathrm{span}\{v_2, \ldots, v_p\}$.

Claim. $H$ consists exactly of the directions one can move within $F$: if $r(\delta) \in F$, then $r(\delta) + h \in F$ for some small displacement only if $h \in H$.

Intuition. The face $F$ is a convex polytope — the convex hull of the points $r(a_{i_1}), \ldots, r(a_{i_p})$. Any point in $F$ can be written as $r(a_{i_1}) + \sum_{l=2}^p \mu_l v_l$ for appropriate non-negative coefficients. The directions you can travel while staying in this polytope are exactly the directions that lie within the affine hull of $F$, which is the subspace $H$ translated to pass through $r(a_{i_1})$.

Proof. Any point in $F$ has the form:

$$ r(\delta) = \sum_{l=1}^p \mu_l , r(a_{i_l}), \quad \mu_l \geq 0, \quad \sum_{l=1}^p \mu_l = 1. $$

Rewrite this using $r(a_{i_l}) = r(a_{i_1}) + v_l$ (with $v_1 = 0$):

$$ r(\delta) = r(a_{i_1}) + \sum_{l=2}^p \mu_l , v_l. $$

So every point in $F$ equals $r(a_{i_1})$ plus a vector in the non-negative span of $\{v_2, \ldots, v_p\}$ (with the constraint that $\sum_{l=2}^p \mu_l \leq 1$).

Now suppose $r(\delta) + h \in F$ for some point $r(\delta) \in F$. Then $r(\delta) + h = r(a_{i_1}) + \sum_{l=2}^p \mu’_l v_l$ for some valid coefficients. Since $r(\delta) = r(a_{i_1}) + \sum_{l=2}^p \mu_l v_l$, we get:

$$ h = \sum_{l=2}^p (\mu’_l - \mu_l) , v_l \in \mathrm{span}\{v_2, \ldots, v_p\} = H. $$

Conversely, for any $h \in H$ and any interior point $r(\delta)$ of $F$, a sufficiently small step $r(\delta) + th$ remains in $F$ (because $F$ is a convex polytope and interior points have room to move in any direction within the affine hull). So $H$ is precisely the set of feasible directions. $\blacksquare$

Part 3(c): The normal to $H$ gives Bayes-optimality

Setup. Let $\pi$ be the unit normal to the hyperplane spanned by $H$, normalized so that $\pi_i > 0$ for all $i$ and $\sum_i \pi_i = 1$. Assume the entire risk set $\mathcal{R}$ lies on one side of this hyperplane.

Claim. Every decision rule on $F$ is Bayes-optimal under $\pi$, completing the proof of the Complete Class Theorem.

Intuition. Since $\pi$ is perpendicular to every direction within $F$, the dot product $\pi \cdot r(\delta)$ — which is the expected utility $\mathrm{EU}(\delta, \pi)$ — is the same for every $\delta$ on $F$. And since the rest of the risk set lies on the other side of the hyperplane, no other decision rule can achieve a higher expected utility.

Proof.

Expected utility is constant on $F$. Since $\pi$ is normal to $H$, we have $\pi \cdot v_l = 0$ for all $l = 2, \ldots, p$. This means:

$$ \pi \cdot r(a_{i_l}) = \pi \cdot \bigl[r(a_{i_1}) + v_l\bigr] = \pi \cdot r(a_{i_1}) + \pi \cdot v_l = \pi \cdot r(a_{i_1}). $$

So all the vertices of $F$ have the same expected utility under $\pi$. Since every point in $F$ is a convex combination of these vertices, $\mathrm{EU}(\delta, \pi) = \pi \cdot r(\delta)$ is the same constant for all $\delta$ with $r(\delta) \in F$.

No other rule achieves higher expected utility. The risk set $\mathcal{R}$ lies on one side of the hyperplane through $F$ with normal $\pi$. This means for any $\delta’$ with $r(\delta’) \in \mathcal{R}$:

$$ \pi \cdot r(\delta’) \leq \pi \cdot r(\delta) \quad \text{for any } \delta \text{ on } F. $$

Therefore every rule on $F$ maximizes $\mathrm{EU}(\cdot, \pi)$ over all of $\mathcal{R}$, i.e., every rule on $F$ is Bayes-optimal under $\pi$.

Completing the Complete Class Theorem. We have shown both directions:

(Admissible $\Rightarrow$ Bayes-optimal): Every admissible rule lies on some face $F$ of the Pareto frontier. By the construction above, every rule on $F$ is Bayes-optimal under a full-support prior $\pi$.
(Bayes-optimal $\Rightarrow$ Admissible): By Part 3(a)(ii), every Bayes-optimal rule under a full-support prior is admissible.

This establishes the Complete Class Theorem: the class of admissible rules coincides exactly with the class of Bayes-optimal rules under full-support priors. $\blacksquare$

Exercise 4: The Do-Divergence Theorem

Theorem. $D_{\mathrm{KL}}(P[X] ,|, P[X \mid do(A)]) \leq \mathrm{MI}(A;, O)$.

Intuition. The left-hand side measures how much the demon’s policy changes the outcome distribution compared to a blind baseline. The right-hand side is the mutual information between the demon’s actions and observations — how much the demon “looks” at the world before acting. The theorem says the demon cannot influence outcomes more than the information it gathers allows.

The proof has two steps: (1) compute the KL divergence between the full joint distributions (not just the marginals over $X$), and show it equals $\mathrm{MI}(A; O)$; (2) apply monotonicity of KL divergence to conclude that marginalizing down to $X$ can only shrink the divergence.

Step 1: KL divergence of the full joints equals $\mathrm{MI}(A; O)$

The two joint distributions are:

$$ P[X, A, O] = P[X \mid A, O] \cdot P[A \mid O] \cdot P[O], $$

$$ P[X, A, O \mid do(A)] = P[X \mid A, O] \cdot P[A] \cdot P[O]. $$

The only difference is that $P[A \mid O]$ (the demon’s actual policy) is replaced by $P[A]$ (the marginal over actions, ignoring observations). Compute the KL divergence:

$$ D_{\mathrm{KL}}\bigl(P[X,A,O] ,\big|, P[X,A,O \mid do(A)]\bigr) = \sum_{x,a,o} P[x,a,o] \log \frac{P[x,a,o]}{P[x,a,o \mid do(A)]}. $$

Substitute the factorizations:

$$ = \sum_{x,a,o} P[x \mid a,o], P[a \mid o], P[o] ;\log \frac{P[x \mid a,o], P[a \mid o], P[o]}{P[x \mid a,o], P[a], P[o]}. $$

The $P[x \mid a, o]$ and $P[o]$ terms cancel inside the logarithm:

$$ = \sum_{x,a,o} P[x \mid a,o], P[a \mid o], P[o] ;\log \frac{P[a \mid o]}{P[a]}. $$

The log term no longer depends on $x$, so we can sum over $x$ first. Since $\sum_x P[x \mid a, o] = 1$:

$$ = \sum_{a,o} P[a \mid o], P[o] ;\log \frac{P[a \mid o]}{P[a]}. $$

Recognizing $P[a \mid o], P[o] = P[a, o]$:

$$ = \sum_{a,o} P[a,o] ;\log \frac{P[a \mid o]}{P[a]} = \mathrm{MI}(A;, O). $$

Step 2: Apply monotonicity of KL divergence

The monotonicity property (also called the data processing inequality for KL divergence) states that marginalizing out variables can only decrease KL divergence:

$$ D_{\mathrm{KL}}(p(x) ,|, q(x)) \leq D_{\mathrm{KL}}(p(x,y) ,|, q(x,y)). $$

Apply this with the full joint $(X, A, O)$ playing the role of $(x, y)$ and $X$ alone playing the role of $x$. Marginalizing out $A$ and $O$:

$$ D_{\mathrm{KL}}\bigl(P[X] ,\big|, P[X \mid do(A)]\bigr) \leq D_{\mathrm{KL}}\bigl(P[X,A,O] ,\big|, P[X,A,O \mid do(A)]\bigr) = \mathrm{MI}(A;, O). $$

This completes the proof. $\blacksquare$

Exercise 5: Channel Additivity

We have two independent channels $X_1 \to Y_1$ and $X_2 \to Y_2$ with joint channel $P[Y \mid X] = P[Y_1 \mid X_1], P[Y_2 \mid X_2]$. We can choose the input distribution $P[X]$ and want to maximize the mutual information $\mathrm{MI}(X; Y)$.

Part 5(a): Decomposition of mutual information

Claim. $\mathrm{MI}(X;, Y) = \mathrm{MI}(X_1;, Y_1) + \mathrm{MI}(X_2;, Y_2) - \mathrm{MI}(Y_1;, Y_2)$.

Intuition. The total information $X$ conveys about $Y$ is the sum of what each sub-channel conveys, minus a correction for any redundancy between $Y_1$ and $Y_2$. If the inputs $X_1$ and $X_2$ are correlated, their outputs $Y_1$ and $Y_2$ will share some mutual information, which gets counted twice and must be subtracted.

Proof. Since the channel factorizes as $P[Y \mid X] = P[Y_1 \mid X_1], P[Y_2 \mid X_2]$, the outputs are conditionally independent given the inputs. Therefore the conditional entropy decomposes:

$$ H(Y \mid X) = H(Y_1, Y_2 \mid X_1, X_2) = H(Y_1 \mid X_1) + H(Y_2 \mid X_2). $$

(This uses the fact that $Y_1$ depends only on $X_1$ and $Y_2$ depends only on $X_2$, so conditioning on all of $X$ is the same as conditioning on the relevant component.)

Now write the mutual information:

$$ \mathrm{MI}(X; Y) = H(Y) - H(Y \mid X) = H(Y_1, Y_2) - H(Y_1 \mid X_1) - H(Y_2 \mid X_2). $$

Expand the joint entropy using the chain rule: $H(Y_1, Y_2) = H(Y_1) + H(Y_2) - \mathrm{MI}(Y_1; Y_2)$. Substituting:

$$ \mathrm{MI}(X; Y) = H(Y_1) + H(Y_2) - \mathrm{MI}(Y_1; Y_2) - H(Y_1 \mid X_1) - H(Y_2 \mid X_2). $$

Regroup:

$$ = \bigl[H(Y_1) - H(Y_1 \mid X_1)\bigr] + \bigl[H(Y_2) - H(Y_2 \mid X_2)\bigr] - \mathrm{MI}(Y_1; Y_2). $$

$$ = \mathrm{MI}(X_1; Y_1) + \mathrm{MI}(X_2; Y_2) - \mathrm{MI}(Y_1; Y_2). \quad \blacksquare $$

Part 5(b): Product-of-marginals input achieves at least as much throughput

Claim. Define $Q[X_1, X_2] := P[X_1], P[X_2]$ (the product of the marginals of $P$). Then $\mathrm{MI}_Q(X; Y) \geq \mathrm{MI}_P(X; Y)$.

Intuition. Making the inputs independent eliminates any correlation between them. This does not change how much each individual channel conveys (since the marginal distributions of $X_1$ and $X_2$ are unchanged). But it does eliminate correlation between the outputs $Y_1$ and $Y_2$, removing the $\mathrm{MI}(Y_1; Y_2)$ penalty from Part 5(a).

Proof. Under $Q$, the inputs $X_1$ and $X_2$ are independent by construction. Since the channels are also independent ($P[Y_1 \mid X_1]$ does not depend on $X_2$ and vice versa), the outputs $Y_1$ and $Y_2$ are independent under $Q$ as well. Therefore:

$$ \mathrm{MI}_Q(Y_1; Y_2) = 0. $$

The marginal distributions of $X_1$ and $X_2$ are the same under $Q$ as under $P$ (by construction: $Q[X_i] = P[X_i]$). Since $\mathrm{MI}(X_i; Y_i)$ depends only on $P[X_i]$ and the channel $P[Y_i \mid X_i]$ (which is fixed), we have:

$$ \mathrm{MI}_Q(X_1; Y_1) = \mathrm{MI}_P(X_1; Y_1) \quad \text{and} \quad \mathrm{MI}_Q(X_2; Y_2) = \mathrm{MI}_P(X_2; Y_2). $$

Applying the formula from Part 5(a) under both distributions:

$$ \mathrm{MI}_Q(X; Y) = \mathrm{MI}_P(X_1; Y_1) + \mathrm{MI}_P(X_2; Y_2) - 0, $$

$$ \mathrm{MI}_P(X; Y) = \mathrm{MI}_P(X_1; Y_1) + \mathrm{MI}_P(X_2; Y_2) - \mathrm{MI}_P(Y_1; Y_2). $$

Since mutual information is always non-negative ($\mathrm{MI}_P(Y_1; Y_2) \geq 0$):

$$ \mathrm{MI}_Q(X; Y) \geq \mathrm{MI}_P(X; Y). \quad \blacksquare $$

Part 5(c): The optimal input has $X_1 \perp\!\!\!\perp X_2$

Claim. There exists a throughput-maximizing input distribution under which $X_1$ and $X_2$ are independent.

Intuition. Part 5(b) shows that for any candidate input distribution, we can “decorrelate” the inputs (replace the joint with the product of marginals) without losing any throughput. So the search for the optimal input can be restricted to product distributions.

Proof. Let $P^*$ be any input distribution that maximizes $\mathrm{MI}(X; Y)$ over all joint distributions on $(X_1, X_2)$. (Such a maximizer exists because we are optimizing a continuous function over a compact set.)

Define $Q^*[X_1, X_2] := P^*[X_1], P^*[X_2]$. By Part 5(b):

$$ \mathrm{MI}_{Q^*}(X; Y) \geq \mathrm{MI}_{P^*}(X; Y). $$

But $P^*$ was already a maximizer, so $\mathrm{MI}_{P^*}(X; Y) \geq \mathrm{MI}_{Q^*}(X; Y)$. Together these give:

$$ \mathrm{MI}_{Q^*}(X; Y) = \mathrm{MI}_{P^*}(X; Y). $$

So $Q^*$ achieves the same maximum throughput, and under $Q^*$ the inputs are independent: $X_1 \perp\!\!\!\perp X_2$. $\blacksquare$

Upshot. The capacity of a product channel $P[Y \mid X] = P[Y_1 \mid X_1], P[Y_2 \mid X_2]$ equals the sum of the individual channel capacities. Equivalently, if the environment is modular (outcomes split into independent groups, each controlled by its own actions), then the optimal policy is also modular — the agent can optimize each group of actions independently.