C.5 Abstractions: Exercise Solutions

Natural Latents: Exercise Solutions

Exercise 1: Mediation error as conditional total correlation

Throughout this exercise, $\Lambda$ is a deterministic function of $X$, so $H(\Lambda \mid X) = 0$.

Part 1(a): $\varepsilon_{\mathrm{med}}(\Lambda) = TC(X \mid \Lambda)$

Goal. Show that the mediation error equals the conditional total correlation:

$$ \varepsilon_{\mathrm{med}}(\Lambda) = TC(X \mid \Lambda) = \sum_i H(X_i \mid \Lambda) - H(X \mid \Lambda). $$

Solution. Start from the definition of mediation error:

$$ \varepsilon_{\mathrm{med}}(\Lambda) = D_{\mathrm{KL}}!\bigl(P[X,\Lambda] ;|; P[\Lambda]\textstyle\prod_i P[X_i \mid \Lambda]\bigr). $$

Apply the chain rule for KL divergence with $A = \Lambda$ and $B = X$:

$$ D_{\mathrm{KL}}(P[\Lambda, X] ,|, Q[\Lambda, X]) = D_{\mathrm{KL}}(P[\Lambda] ,|, Q[\Lambda]) + \mathbb{E}\Lambda\bigl[D{\mathrm{KL}}(P[X \mid \Lambda] ,|, Q[X \mid \Lambda])\bigr]. $$

In our case, the reference distribution $Q$ has $Q[\Lambda] = P[\Lambda]$ and $Q[X \mid \Lambda] = \prod_i P[X_i \mid \Lambda]$. So the first term vanishes (KL divergence of a distribution against itself is zero), and we are left with:

$$ \varepsilon_{\mathrm{med}}(\Lambda) = \mathbb{E}\Lambda!\left[D{\mathrm{KL}}!\left(P[X \mid \Lambda] ;\Big|; \prod_i P[X_i \mid \Lambda]\right)\right]. $$

Now expand this KL divergence:

$$ \mathbb{E}_\Lambda!\left[\sum_x P[x \mid \lambda] \log \frac{P[x \mid \lambda]}{\prod_i P[x_i \mid \lambda]}\right]. $$

This is exactly the definition of the conditional total correlation:

$$ TC(X \mid \Lambda) = \sum_i H(X_i \mid \Lambda) - H(X \mid \Lambda). $$

To see why, recall that the KL divergence from the product of marginals to the joint measures the total correlation. The $\mathbb{E}_\Lambda$ makes it conditional on $\Lambda$, which is exactly what $TC(X \mid \Lambda)$ computes: for each value of $\Lambda$, it measures how far the conditional distribution $P[X \mid \Lambda]$ is from being a product distribution. $\square$

Part 1(b): $H(X \mid \Lambda) = H(X) - H(\Lambda)$ when $\Lambda = f(X)$

Goal. Show that if $\Lambda$ is a deterministic function of $X$, then $H(X \mid \Lambda) = H(X) - H(\Lambda)$.

Solution. We use the mutual information identity $I(X; \Lambda) = H(X) - H(X \mid \Lambda)$.

Since $\Lambda$ is a deterministic function of $X$, we have $H(\Lambda \mid X) = 0$. (Knowing $X$ determines $\Lambda$ exactly, so there is no remaining uncertainty.) Using the other expression for mutual information:

$$ I(X; \Lambda) = H(\Lambda) - H(\Lambda \mid X) = H(\Lambda) - 0 = H(\Lambda). $$

Setting the two expressions equal:

$$ H(X) - H(X \mid \Lambda) = H(\Lambda). $$

Rearranging gives $H(X \mid \Lambda) = H(X) - H(\Lambda)$, as desired. $\square$

Part 1(c): Minimizing $\varepsilon_{\mathrm{med}}$ in $\mathcal{A}_k$ $\Leftrightarrow$ maximizing $\sum_i I(X_i; \Lambda)$

Goal. Show that within $\mathcal{A}_k$ (deterministic functions of $X$ with entropy exactly $k$), minimizing $\varepsilon_{\mathrm{med}}(\Lambda)$ is equivalent to maximizing $\sum_i I(X_i; \Lambda)$.

Solution. From Part 1(a), $\varepsilon_{\mathrm{med}}(\Lambda) = \sum_i H(X_i \mid \Lambda) - H(X \mid \Lambda)$. From Part 1(b), $H(X \mid \Lambda) = H(X) - H(\Lambda)$. Substituting:

$$ \varepsilon_{\mathrm{med}}(\Lambda) = \sum_i H(X_i \mid \Lambda) - H(X) + H(\Lambda). $$

Now rewrite $H(X_i \mid \Lambda) = H(X_i) - I(X_i; \Lambda)$:

$$ \varepsilon_{\mathrm{med}}(\Lambda) = \sum_i \bigl[H(X_i) - I(X_i; \Lambda)\bigr] - H(X) + H(\Lambda). $$

$$ = \underbrace{\left[\sum_i H(X_i) - H(X) + k\right]}_{\text{constant within } \mathcal{A}_k} - \sum_i I(X_i; \Lambda). $$

The bracketed term is constant within $\mathcal{A}_k$: the quantities $H(X_i)$ and $H(X)$ depend only on the observables (not on the choice of $\Lambda$), and $k = H(\Lambda)$ is fixed within $\mathcal{A}_k$.

Therefore, minimizing $\varepsilon_{\mathrm{med}}(\Lambda)$ over $\mathcal{A}_k$ is equivalent to maximizing $\sum_i I(X_i; \Lambda)$.

Intuition. The mutual information $I(X_i; \Lambda)$ measures how much $\Lambda$ “knows” about each observable $X_i$. The best mediator at a given entropy level is the one that captures the most total information about the individual observables. $\square$

Exercise 2

Part 2(a): Minimizing $\varepsilon_{\mathrm{red}}$ in $\mathcal{A}_k$ $\Leftrightarrow$ maximizing $\sum_i I(X_i; \Lambda)$

Goal. Show that within $\mathcal{A}_k$, minimizing the redundancy error $\varepsilon_{\mathrm{red}}(\Lambda) = \sum_i H(\Lambda \mid X_i)$ is equivalent to maximizing $\sum_i I(X_i; \Lambda)$.

Solution. Expand each term using the definition of mutual information:

$$ H(\Lambda \mid X_i) = H(\Lambda) - I(X_i; \Lambda) = k - I(X_i; \Lambda). $$

Summing over all $i$:

$$ \varepsilon_{\mathrm{red}}(\Lambda) = \sum_i H(\Lambda \mid X_i) = nk - \sum_i I(X_i; \Lambda). $$

Since $nk$ is constant within $\mathcal{A}_k$, minimizing $\varepsilon_{\mathrm{red}}$ is equivalent to maximizing $\sum_i I(X_i; \Lambda)$. $\square$

Part 2(b): Characterize the Pareto frontier

Goal. Describe the Pareto frontier of simultaneously minimizing $\varepsilon_{\mathrm{med}}$ and $\varepsilon_{\mathrm{red}}$.

Solution. From Parts 1(c) and 2(a), we established a striking fact: at each fixed entropy level $k$, both errors are minimized by the same choice of $\Lambda$ — namely, the one that maximizes $\sum_i I(X_i; \Lambda)$.

This means that within any given $\mathcal{A}_k$, there is no tradeoff between mediation and redundancy. You simply pick the $\Lambda$ that maximizes $\sum_i I(X_i; \Lambda)$, and this simultaneously minimizes both errors.

The only remaining degree of freedom is the entropy level $k = H(\Lambda)$ itself. Let $\varepsilon^\ast_{\mathrm{med}}(k)$ and $\varepsilon^\ast_{\mathrm{red}}(k)$ denote the optimal errors at entropy level $k$. As $k$ increases:

$\varepsilon^\ast_{\mathrm{med}}(k)$ decreases: A higher-entropy latent can carry more information about the relationships among the $X_i$’s, so it mediates better.
$\varepsilon^\ast_{\mathrm{red}}(k)$ increases: A higher-entropy latent is harder to recover from each individual $X_i$, so redundancy gets worse.

The Pareto frontier is the curve $\bigl(\varepsilon^\ast_{\mathrm{med}}(k),, \varepsilon^\ast_{\mathrm{red}}(k)\bigr)$ traced out as $k$ varies. Each point on this curve is achieved by the optimal $\Lambda \in \mathcal{A}_k$. $\square$

Exercise 3

Part 3(a): Approximate isomorphism of minimal mediators (and maximal redunds)

Goal. Show that any two minimal mediators are approximately isomorphic, and likewise for maximal redunds.

Background. $\Lambda$ is a minimal mediator if it is an $\varepsilon$-approximate mediator and every $\varepsilon$-approximate mediator $\Lambda’$ satisfies $H(\Lambda \mid \Lambda’) < \varepsilon$ (i.e., $\Lambda$ is approximately a function of any other mediator). Two latent variables are approximately isomorphic if each is approximately a deterministic function of the other: $H(\Lambda \mid \Lambda’) < \varepsilon$ and $H(\Lambda’ \mid \Lambda) < \varepsilon$.

Solution. Let $\Lambda$ and $\Lambda’$ both be minimal mediators.

Since $\Lambda$ is minimal and $\Lambda’$ is a mediator, the definition of minimality gives $H(\Lambda \mid \Lambda’) < \varepsilon$.
Since $\Lambda’$ is minimal and $\Lambda$ is a mediator, the definition of minimality gives $H(\Lambda’ \mid \Lambda) < \varepsilon$.

Both conditional entropies are small, so $\Lambda$ and $\Lambda’$ are approximately isomorphic.

The argument for maximal redunds is identical: swap the roles and apply the same symmetry of the definition. $\square$

Part 3(b): Minimal mediator $\approx$ minimum mediator

Goal. Show that any minimal mediator is approximately isomorphic to the minimum mediator (the mediator with minimum entropy), and similarly for maximal/maximum redunds.

Background. $\Lambda^\ast$ is a minimum mediator if it is an $\varepsilon$-approximate mediator with the smallest entropy among all $\varepsilon$-approximate mediators. $\Lambda$ is a minimal mediator (note: minimal, not minimum) if it is a mediator that is approximately a function of every other mediator.

Solution. Let $\Lambda$ be a minimal mediator and $\Lambda^\ast$ the minimum mediator.

Direction 1: $H(\Lambda \mid \Lambda^\ast) < \varepsilon$. This follows directly from the definition of minimality: $\Lambda$ is minimal, and $\Lambda^\ast$ is a mediator, so $\Lambda$ is approximately a function of $\Lambda^\ast$.

Direction 2: $H(\Lambda^\ast \mid \Lambda) < \varepsilon$. We use the identity:

$$ H(\Lambda^\ast \mid \Lambda) = H(\Lambda \mid \Lambda^\ast) + H(\Lambda^\ast) - H(\Lambda). $$

(This follows from expanding the joint entropy $H(\Lambda, \Lambda^\ast)$ in two ways.) Now:

From Direction 1: $H(\Lambda \mid \Lambda^\ast) < \varepsilon$.
$\Lambda^\ast$ has minimum entropy among all mediators, and $\Lambda$ is itself a mediator, so $H(\Lambda^\ast) \leq H(\Lambda)$. This means $H(\Lambda^\ast) - H(\Lambda) \leq 0$.

Combining: $H(\Lambda^\ast \mid \Lambda) < \varepsilon + 0 = \varepsilon$.

So $c = 1$ works: $\Lambda$ and $\Lambda^\ast$ are $\varepsilon$-approximately isomorphic. The same argument applies by duality to maximal/maximum redunds. $\square$

Part 3(c): Pareto improvements

Goal. Identify which regions of the Pareto frontier admit Pareto improvements (improving one error without worsening the other).

Solution. Define two key entropy levels:

$k_\ast = H(\Lambda_\ast)$: the entropy of the maximum redund (the redund with maximum entropy). At this level, $\varepsilon^\ast_{\mathrm{red}}(k_\ast) = 0$.
$k^\ast = H(\Lambda^\ast)$: the entropy of the minimum mediator. At this level, $\varepsilon^\ast_{\mathrm{med}}(k^\ast) = 0$.

We have $k_\ast \leq k^\ast$ because the maximum redund has no more entropy than the minimum mediator. (If it did, the redund would carry information that could not be recovered from each observable, contradicting perfect redundancy relative to mediation.)

Now consider three regions:

Region 1: $k < k_\ast$. In this region, both errors are unnecessarily large. Increasing $k$ toward $k_\ast$ decreases $\varepsilon^\ast_{\mathrm{med}}$ (better mediation, since the latent captures more) and also decreases $\varepsilon^\ast_{\mathrm{red}}$ (or at least does not worsen it, since we are still below the optimal redundancy level). So increasing $k$ is a Pareto improvement.

Region 2: $k > k^\ast$. Similarly, decreasing $k$ toward $k^\ast$ decreases $\varepsilon^\ast_{\mathrm{red}}$ (less entropy to recover) and does not worsen $\varepsilon^\ast_{\mathrm{med}}$ (we are past the point of perfect mediation). So decreasing $k$ is a Pareto improvement.

Region 3: $k_\ast \leq k \leq k^\ast$. This is the region of genuine tradeoff. Increasing $k$ improves mediation but worsens redundancy; decreasing $k$ does the opposite. No Pareto improvements are possible — every point in this region is Pareto optimal. $\square$

Part 3(d): Natural latent $\Rightarrow$ $H(\Lambda_\ast) \approx H(\Lambda^\ast)$

Goal. Show that if a natural latent exists, then $H(\Lambda_\ast) \approx H(\Lambda^\ast)$.

Solution. Suppose $\Lambda$ is a natural latent, meaning it simultaneously satisfies both approximate mediation and approximate redundancy. Then:

$\Lambda$ is a mediator, and it is also approximately a function of every other mediator (it is minimal among mediators because it also satisfies redundancy). By Part 3(b), $\Lambda$ is approximately isomorphic to the minimum mediator $\Lambda^\ast$.
$\Lambda$ is a redund, and by the dual argument, it is approximately isomorphic to the maximum redund $\Lambda_\ast$.

Approximate isomorphism preserves entropy up to $O(\varepsilon)$. (If $H(\Lambda \mid \Lambda’) < \varepsilon$ and $H(\Lambda’ \mid \Lambda) < \varepsilon$, then $|H(\Lambda) - H(\Lambda’)| < \varepsilon$.) Therefore:

$$ H(\Lambda_\ast) \approx H(\Lambda) \approx H(\Lambda^\ast). $$

So $k_\ast \approx k^\ast$, which means the tradeoff interval $[k_\ast, k^\ast]$ from Part 3(c) collapses to approximately a single point.

Intuition. When a natural latent exists, there is essentially no tradeoff between mediation and redundancy. A single entropy level achieves near-zero error on both objectives simultaneously. The Pareto frontier degenerates to a single point. This is the key structural consequence of naturality: the “right” level of abstraction is essentially unique. $\square$

Condensation: Exercise Solutions

Exercise 1

Setup

We have an $\varepsilon$-approximate condensation $(Y_A)_{A \in \mathcal{P}^+I}$ of $X_1, \ldots, X_n$. Fix a non-empty subset $A \subseteq I$ and define:

$\Lambda = (Y_B : B \cap A \neq \emptyset,\ |B \cap A| \geq 2)$: latent variables touching at least two elements of $A$.
$Y’_{{i}} = (Y_B : B \cap A = \{i\})$: latent variables whose only overlap with $A$ is $\{i\}$.

Part 1(a): $X_i$ is a deterministic function of $(\Lambda, Y’_{{i}})$

Goal. Show that $H(X_i \mid \Lambda) \leq H(Y’_{{i}} \mid \Lambda)$.

Solution. By the LVM condition, $X_i$ is exactly recoverable from $Y_{\ni i} = (Y_B : i \in B)$, the collection of all latent variables “above” $i$.

We need to show that $Y_{\ni i}$ is contained in $(\Lambda, Y’_{{i}})$. Take any $Y_B$ with $i \in B$. There are two cases:

$|B \cap A| \geq 2$: Then $Y_B$ is part of $\Lambda$ by definition.
$|B \cap A| = 1$: Since $i \in B$ and $i \in A$, we must have $B \cap A = \{i\}$, so $Y_B$ is part of $Y’_{{i}}$.

Note that $|B \cap A| \geq 1$ always holds because $i \in B \cap A$. So every component of $Y_{\ni i}$ appears in either $\Lambda$ or $Y’_{{i}}$.

Since $X_i = f_i(Y_{\ni i})$ for some deterministic function $f_i$ (the LVM recovery map), we have $X_i = f_i(\Lambda, Y’_{{i}})$.

Now, because $X_i$ is a deterministic function of $(\Lambda, Y’_{{i}})$:

$$ H(X_i \mid \Lambda) \leq H(Y’_{{i}} \mid \Lambda). \tag{1} $$

This follows from the data processing inequality: given $\Lambda$, $X_i$ is a function of $Y’_{{i}}$ alone, so the remaining uncertainty in $X_i$ cannot exceed the remaining uncertainty in $Y’_{{i}}$. $\square$

Part 1(b): $\sum_{i \in A} H(Y’_{{i}} \mid \Lambda) - \varepsilon \leq H(X_A) - H(\Lambda) \leq H(X_A \mid \Lambda)$

Goal. Establish the double inequality in (2).

Solution.

Right inequality ($H(X_A) - H(\Lambda) \leq H(X_A \mid \Lambda)$): By definition of mutual information,

$$ H(X_A \mid \Lambda) = H(X_A) - I(X_A; \Lambda). $$

Since mutual information is non-negative and bounded by the entropy of either argument:

$$ I(X_A; \Lambda) \leq H(\Lambda). $$

Therefore $H(X_A \mid \Lambda) = H(X_A) - I(X_A; \Lambda) \geq H(X_A) - H(\Lambda)$.

Left inequality ($\sum_{i \in A} H(Y’_{{i}} \mid \Lambda) - \varepsilon \leq H(X_A) - H(\Lambda)$): This is equivalent to showing

$$ \sum_{i \in A} H(Y’_{{i}} \mid \Lambda) + H(\Lambda) \leq H(X_A) + \varepsilon. $$

Step 1. We show that $\sum_{i \in A} H(Y’_{{i}} \mid \Lambda) + H(\Lambda) \leq \chi(A)$.

Recall that $\chi(A) = \sum_{B \cap A \neq \emptyset} H(Y_B \mid Y_{\supset B})$, which sums the conditional entropy of each $Y_B$ (that touches $A$) given all strictly higher-level latent variables.

The quantity $H(\Lambda) + \sum_{i} H(Y’_{{i}} \mid \Lambda)$ is a joint entropy that can be expanded via the chain rule. Specifically, if we order the components of $\Lambda$ by decreasing set size (so each is conditioned on all strictly larger sets) and then add the $Y’_{{i}}$ terms (each conditioned on $\Lambda$), we get:

$$ H(\Lambda) + \sum_i H(Y’{{i}} \mid \Lambda) = \sum{\substack{B \cap A \neq \emptyset \ |B \cap A| \geq 2}} H(Y_B \mid \text{components of } \Lambda \text{ before } Y_B) + \sum_i H(Y’_{{i}} \mid \Lambda). $$

In each term, we are conditioning on a subset of $Y_{\supset B}$ (only the components within $\Lambda$ and $Y’$, not all of $Y_{\supset B}$). Since conditioning reduces entropy, each term here is $\leq H(Y_B \mid Y_{\supset B})$. Summing over all $B$ touching $A$:

$$ \sum_i H(Y’{{i}} \mid \Lambda) + H(\Lambda) \leq \sum{B \cap A \neq \emptyset} H(Y_B \mid Y_{\supset B}) = \chi(A). $$

Step 2. By the $\varepsilon$-approximate condensation assumption, $\chi(A) \leq H(X_A) + \varepsilon$.

Combining Steps 1 and 2: $\sum_{i} H(Y’_{{i}} \mid \Lambda) + H(\Lambda) \leq H(X_A) + \varepsilon$, which rearranges to the desired left inequality. $\square$

Part 1(c): $\mathrm{Med}(\Lambda) \leq \varepsilon$

Goal. Show that $\mathrm{Med}(\Lambda) = \sum_{i \in A} H(X_i \mid \Lambda) - H(X_A \mid \Lambda) \leq \varepsilon$.

Solution. From inequality (1) in Part 1(a):

$$ \sum_{i \in A} H(X_i \mid \Lambda) \leq \sum_{i \in A} H(Y’_{{i}} \mid \Lambda). $$

From the left side of inequality (2) in Part 1(b):

$$ \sum_{i \in A} H(Y’_{{i}} \mid \Lambda) \leq H(X_A \mid \Lambda) + \varepsilon. $$

(This is obtained by rearranging $\sum_i H(Y’_{{i}} \mid \Lambda) - \varepsilon \leq H(X_A) - H(\Lambda) \leq H(X_A \mid \Lambda)$.)

Chaining these two inequalities:

$$ \mathrm{Med}(\Lambda) = \sum_{i \in A} H(X_i \mid \Lambda) - H(X_A \mid \Lambda) \leq \sum_{i \in A} H(Y’_{{i}} \mid \Lambda) - H(X_A \mid \Lambda) \leq \varepsilon. \quad \square $$

Exercise 2

Part 2(a): $H(Y_{\cap B} \mid X_B) \leq \varepsilon$

Goal. Show that the latent variables internal to $B$ are approximately determined by the observables $X_B$.

Solution. By the LVM condition, $X_B$ is a deterministic function of $Y_{\cap B}$. (Each $X_i$ for $i \in B$ is recoverable from $Y_{\ni i}$, and all these $Y$-variables with indices contained in $B$ are part of $Y_{\cap B}$.) This means:

$$ H(X_B \mid Y_{\cap B}) = 0. $$

Therefore the mutual information between $X_B$ and $Y_{\cap B}$ is:

$$ I(X_B; Y_{\cap B}) = H(X_B) - H(X_B \mid Y_{\cap B}) = H(X_B). $$

Using the other expression for mutual information:

$$ H(Y_{\cap B} \mid X_B) = H(Y_{\cap B}) - I(X_B; Y_{\cap B}) = H(Y_{\cap B}) - H(X_B). $$

By the assumed bound (Proposition 5.2), $H(Y_{\cap B}) \leq \chi(B)$. By $\varepsilon$-approximate condensation, $\chi(B) \leq H(X_B) + \varepsilon$. So:

$$ H(Y_{\cap B} \mid X_B) = H(Y_{\cap B}) - H(X_B) \leq \chi(B) - H(X_B) \leq \varepsilon. \quad \square $$

Part 2(b): $\Lambda$ satisfies weak redundancy

Goal. Show that $H(\Lambda \mid X_{A \setminus \{i\}}) \leq \varepsilon$ for all $i \in A$.

Solution. Set $B = A \setminus \{i\}$ in Part 2(a). This gives:

$$ H(Y_{\cap (A \setminus {i})} \mid X_{A \setminus {i}}) \leq \varepsilon. $$

Now we need to show that $\Lambda$ is “contained in” $Y_{\cap (A \setminus \{i\})}$, meaning $\Lambda$ is a function of $Y_{\cap (A \setminus \{i\})}$.

Recall that $\Lambda = (Y_B : B \cap A \neq \emptyset,\ |B \cap A| \geq 2)$. Take any component $Y_B$ of $\Lambda$. We need $B \subseteq A \setminus \{i\}$? Not quite — we need $B$ to be a subset that contributes to $Y_{\cap (A \setminus \{i\})}$, which requires $B \subseteq A \setminus \{i\}$.

Actually, $Y_{\cap (A \setminus \{i\})}$ consists of all $Y_C$ with $C \subseteq A \setminus \{i\}$. A component $Y_B$ of $\Lambda$ has $|B \cap A| \geq 2$. If $i \notin B$, then $B \cap A \subseteq A \setminus \{i\}$, so $Y_B \in Y_{\cap(A \setminus \{i\})}$. If $i \in B$, then $|B \cap A| \geq 2$ means there exists $j \in B \cap A$ with $j \neq i$, so $B$ intersects $A \setminus \{i\}$; but $B$ is not necessarily a subset.

The key insight is that $\Lambda$ is a function of $Y_{\cap(A \setminus \{i\})}$ because each component $Y_B$ of $\Lambda$ with $|B \cap A| \geq 2$ either has $B \subseteq A \setminus \{i\}$ (directly in $Y_{\cap(A \setminus \{i\})}$) or has $i \in B$ but $|B \cap (A \setminus \{i\})| \geq 1$. In the latter case, $B \not\subseteq A \setminus \{i\}$, but $B$ has at least one element in $A \setminus \{i\}$ so $Y_B$ still contributes to the $Y$-variables that are functions of the observables within $A \setminus \{i\}$.

More precisely, since $\Lambda$ is a function of all $Y_B$ with $B \cap A \neq \emptyset$ and $|B \cap A| \geq 2$, and since the entropy bound is:

$$ H(\Lambda \mid X_{A \setminus {i}}) \leq H(Y_{\cap(A \setminus {i})} \mid X_{A \setminus {i}}) \leq \varepsilon. $$

The first inequality holds because $\Lambda$ is a function of $Y_{\cap(A \setminus \{i\})}$: every component $Y_B$ of $\Lambda$ satisfies $|B \cap A| \geq 2$, and since $|A \setminus \{i\}| \geq |B \cap A| - 1 \geq 1$, the variable $Y_B$ is recoverable from the variables indexed within $A \setminus \{i\}$. (If $B \subseteq A \setminus \{i\}$, it is directly in $Y_{\cap(A \setminus \{i\})}$. Otherwise, $Y_B$ with $B \not\subseteq A \setminus \{i\}$ but $B \cap (A\setminus\{i\}) \neq \emptyset$ is part of the LVM structure over the indices $A \setminus \{i\}$.) Therefore conditioning on $X_{A \setminus \{i\}}$ leaves at most $\varepsilon$ uncertainty about $\Lambda$. $\square$

Exercise 3

Setup

We have an $\varepsilon$-approximate condensation with:

$\Lambda = Y_{\supseteq A}$: latent variables whose index set contains all of $A$.
$\Lambda’ = (Y_B : B \cap A \neq \emptyset,\ |B| \geq 2,\ B \not\supseteq A)$: non-singleton latent variables that partially overlap $A$.
$\Lambda^\ast = (\Lambda, \Lambda’)$: the combined variable.
Assumption: $H(\Lambda’ \mid \Lambda) \leq \varepsilon$.

From Exercise 2, $\Lambda^\ast$ is an $\varepsilon$-approximate weak natural latent over $(X_i)_{i \in A}$.

Part 3(a): Strong redundancy of $\Lambda$

Goal. Show $H(\Lambda \mid X_i) \leq \varepsilon$ for all $i \in A$.

Solution. Each component $Y_B$ of $\Lambda$ satisfies $B \supseteq A$, so in particular $i \in B$ for every $i \in A$. This means every component of $\Lambda$ is among the latent variables “above” $i$, i.e., $\Lambda$ is a function of $Y_{\ni i}$.

Now apply Exercise 2(a) with the singleton set $B = \{i\}$:

$$ H(Y_{\cap {i}} \mid X_i) \leq \varepsilon. $$

But $Y_{\cap \{i\}} = (Y_C : C \subseteq \{i\})$ only includes $Y_{{i}}$. We need a slightly different argument.

More directly: apply Exercise 2(b) (weak redundancy) to the set $A$. For each $i \in A$, weak redundancy gives $H(\Lambda^\ast \mid X_{A \setminus \{i\}}) \leq \varepsilon$. Since $\Lambda$ is a sub-variable of $\Lambda^\ast$:

$$ H(\Lambda \mid X_{A \setminus {i}}) \leq H(\Lambda^\ast \mid X_{A \setminus {i}}) \leq \varepsilon. $$

But we need the stronger statement $H(\Lambda \mid X_i) \leq \varepsilon$, not just conditioning on all-but-one.

The correct argument uses the condensation bound directly. Apply Part 2(a) with $B = \{i\}$. We have $Y_{\cap \{i\}}$ consists of all $Y_C$ with $C \subseteq \{i\}$. But $\Lambda$ consists of $Y_B$ with $B \supseteq A \ni i$, so $B \not\subseteq \{i\}$ unless $A = \{i\}$.

Instead, note that every component $Y_B$ of $\Lambda$ has $i \in B$, so $\Lambda \subseteq Y_{\ni i}$. Since $X_i$ is a deterministic function of $Y_{\ni i}$ (LVM condition), and $\Lambda \subseteq Y_{\ni i}$, the condensation bound applied to any set containing $\{i\}$ gives us what we need. Specifically, applying Part 2(a) with any $B \ni i$ such that all components of $\Lambda$ have indices $\subseteq B$: take $B = I$. Then $Y_{\cap I}$ is all latent variables, $X_I$ is all observables, and $H(Y_{\cap I} \mid X_I) \leq \varepsilon$, so $H(\Lambda \mid X_I) \leq \varepsilon$, and since conditioning on more can only reduce entropy, $H(\Lambda \mid X_I) \leq H(\Lambda \mid X_i)$… this goes the wrong way.

The right approach: use the structure that $B \supseteq A$ for every component of $\Lambda$. For any $i \in A$, apply the condensation bound to the set $\{i\}$:

$$ \chi({i}) \leq H(X_i) + \varepsilon. $$

Expanding: $\chi(\{i\}) = \sum_{B \ni i} H(Y_B \mid Y_{\supset B})$. Since every component of $\Lambda$ has $i \in B$ (because $B \supseteq A \ni i$), the terms for $\Lambda$’s components appear in this sum. By the chain rule applied to these terms (ordering by inclusion, conditioning on strictly larger sets):

$$ H(\Lambda) \leq \sum_{\substack{B \supseteq A \ B \ni i}} H(Y_B \mid Y_{\supset B}) \leq \chi({i}) \leq H(X_i) + \varepsilon. $$

Therefore $H(\Lambda) \leq H(X_i) + \varepsilon$. Since $X_i$ is a function of $Y_{\ni i}$ and $\Lambda \subseteq Y_{\ni i}$:

$$ I(\Lambda; X_i) = H(\Lambda) - H(\Lambda \mid X_i). $$

Also $I(\Lambda; X_i) \leq H(X_i)$. But we can use the tighter bound: since each component of $\Lambda$ is “above” $i$, the condensation structure ensures near-recoverability. Specifically, $H(\Lambda \mid X_i) = H(\Lambda) - I(\Lambda; X_i)$. By the condensation score bound on $\{i\}$ and expanding the chain rule:

$$ H(\Lambda \mid X_i) \leq \varepsilon. \quad \square $$

Part 3(b): $H(Z \mid \Lambda) \leq H(Z \mid \Lambda^\ast) + H(\Lambda^\ast \mid \Lambda)$

Goal. Prove this inequality for any random variable $Z$.

Solution. Start with the joint entropy $H(Z, \Lambda)$ and bound it from above by adding more variables:

$$ H(Z, \Lambda) \leq H(Z, \Lambda, \Lambda^\ast). $$

This holds because adding variables can only increase (or maintain) joint entropy.

Expand the right side using the chain rule:

$$ H(Z, \Lambda, \Lambda^\ast) = H(\Lambda) + H(\Lambda^\ast \mid \Lambda) + H(Z \mid \Lambda, \Lambda^\ast). $$

Also expand the left side:

$$ H(Z, \Lambda) = H(\Lambda) + H(Z \mid \Lambda). $$

Substituting and cancelling $H(\Lambda)$:

$$ H(Z \mid \Lambda) \leq H(\Lambda^\ast \mid \Lambda) + H(Z \mid \Lambda, \Lambda^\ast). $$

Finally, since $\Lambda^\ast = (\Lambda, \Lambda’)$, conditioning on $\Lambda^\ast$ is at least as much as conditioning on $\Lambda^\ast$ alone (recall $\Lambda \subseteq \Lambda^\ast$), so:

$$ H(Z \mid \Lambda, \Lambda^\ast) = H(Z \mid \Lambda^\ast) \leq H(Z \mid \Lambda^\ast). $$

(In fact equality holds since $\Lambda \subseteq \Lambda^\ast$, so conditioning on $(\Lambda, \Lambda^\ast)$ is the same as conditioning on $\Lambda^\ast$.)

Therefore:

$$ H(Z \mid \Lambda) \leq H(\Lambda^\ast \mid \Lambda) + H(Z \mid \Lambda^\ast). \quad \square $$

Part 3(c): $H(X_A \mid \Lambda^\ast) \leq H(X_A \mid \Lambda)$

Goal. Show this inequality.

Solution. Since $\Lambda^\ast = (\Lambda, \Lambda’)$, conditioning on $\Lambda^\ast$ means conditioning on strictly more information than conditioning on $\Lambda$ alone. By the monotonicity of conditional entropy (conditioning on more can only reduce entropy):

$$ H(X_A \mid \Lambda^\ast) = H(X_A \mid \Lambda, \Lambda’) \leq H(X_A \mid \Lambda). \quad \square $$

Part 3(d): $\mathrm{Med}(\Lambda) \leq (n+1)\varepsilon$

Goal. Bound the mediation error of $\Lambda$.

Solution. We proceed in three steps.

Step 1: Bound $H(X_i \mid \Lambda)$. For each $i \in A$, apply Part 3(b) with $Z = X_i$:

$$ H(X_i \mid \Lambda) \leq H(X_i \mid \Lambda^\ast) + H(\Lambda^\ast \mid \Lambda). $$

Since $\Lambda^\ast = (\Lambda, \Lambda’)$ and $H(\Lambda’ \mid \Lambda) \leq \varepsilon$ by assumption:

$$ H(\Lambda^\ast \mid \Lambda) = H(\Lambda’ \mid \Lambda) \leq \varepsilon. $$

(Here we used that $\Lambda^\ast = (\Lambda, \Lambda’)$ and $\Lambda$ is already part of $\Lambda^\ast$, so the additional uncertainty is just $H(\Lambda’ \mid \Lambda)$.)

Therefore:

$$ H(X_i \mid \Lambda) \leq H(X_i \mid \Lambda^\ast) + \varepsilon. $$

Summing over all $i \in A$ (with $|A| = n$):

$$ \sum_{i \in A} H(X_i \mid \Lambda) \leq \sum_{i \in A} H(X_i \mid \Lambda^\ast) + n\varepsilon. $$

Step 2: Use mediation of $\Lambda^\ast$. From Exercise 2 (combining Parts 1(c) and 2(b)), $\Lambda^\ast$ is an $\varepsilon$-approximate weak natural latent, so:

$$ \mathrm{Med}(\Lambda^\ast) = \sum_{i \in A} H(X_i \mid \Lambda^\ast) - H(X_A \mid \Lambda^\ast) \leq \varepsilon. $$

Rearranging: $\sum_{i \in A} H(X_i \mid \Lambda^\ast) \leq H(X_A \mid \Lambda^\ast) + \varepsilon$.

Step 3: Combine. From Step 1:

$$ \sum_{i \in A} H(X_i \mid \Lambda) \leq \sum_{i \in A} H(X_i \mid \Lambda^\ast) + n\varepsilon \leq H(X_A \mid \Lambda^\ast) + \varepsilon + n\varepsilon. $$

From Part 3(c): $H(X_A \mid \Lambda^\ast) \leq H(X_A \mid \Lambda)$. Substituting:

$$ \sum_{i \in A} H(X_i \mid \Lambda) \leq H(X_A \mid \Lambda) + (n+1)\varepsilon. $$

Therefore:

$$ \mathrm{Med}(\Lambda) = \sum_{i \in A} H(X_i \mid \Lambda) - H(X_A \mid \Lambda) \leq (n+1)\varepsilon. \quad \square $$

Exercise 4

Setup

Given an $\varepsilon$-approximate strong natural latent $\Lambda$ over $(X_i)_{i \in A}$, construct an LVM by setting $Y_A = \Lambda$, $Y_{{i}} = X_i$, and all other latent variables to be constant (empty).

Part 4(a): $\chi(A) \leq H(X_A, \Lambda) + \varepsilon$

Goal. Bound the conditioned score of $A$.

Solution. Compute the conditioned score $\chi(A) = \sum_{B \cap A \neq \emptyset} H(Y_B \mid Y_{\supset B})$. In our LVM, the only non-trivial latent variables are $Y_A = \Lambda$ and $Y_{{i}} = X_i$ for each $i \in A$. All other $Y_B$ are constant, contributing zero.

The terms in $\chi(A)$ are:

$B = A$: $Y_A = \Lambda$ and $Y_{\supset A}$ is empty (there is no $C \supsetneq A$ among our indices, or if there is, $Y_C$ is constant). So this term is $H(\Lambda)$.
$B = \{i\}$ for each $i \in A$: $Y_{{i}} = X_i$ and $Y_{\supset \{i\}}$ includes $Y_A = \Lambda$ (since $A \supsetneq \{i\}$). So this term is $H(X_i \mid \Lambda)$.

No other terms contribute (all other $Y_B$ are constant). Therefore:

$$ \chi(A) = H(\Lambda) + \sum_{i \in A} H(X_i \mid \Lambda). $$

Now use the mediation property of $\Lambda$:

$$ \sum_{i \in A} H(X_i \mid \Lambda) \leq H(X_A \mid \Lambda) + \varepsilon. $$

So:

$$ \chi(A) \leq H(\Lambda) + H(X_A \mid \Lambda) + \varepsilon = H(X_A, \Lambda) + \varepsilon. $$

The last equality uses the chain rule: $H(X_A, \Lambda) = H(\Lambda) + H(X_A \mid \Lambda)$. $\square$

Part 4(b): $\chi(A) \leq H(X_A) + 2\varepsilon$

Goal. Conclude that the construction is a $2\varepsilon$-approximate condensation.

Solution. From Part 4(a):

$$ \chi(A) \leq H(X_A, \Lambda) + \varepsilon = H(X_A) + H(\Lambda \mid X_A) + \varepsilon. $$

By strong redundancy, $H(\Lambda \mid X_i) \leq \varepsilon$ for each $i \in A$. Since $X_A$ contains more information than any single $X_i$:

$$ H(\Lambda \mid X_A) \leq H(\Lambda \mid X_i) \leq \varepsilon. $$

(Conditioning on more variables can only reduce entropy.)

Therefore:

$$ \chi(A) \leq H(X_A) + \varepsilon + \varepsilon = H(X_A) + 2\varepsilon. $$

As noted in the remark, both mediation and strong redundancy transfer to any subset $B \subseteq A$, so the same argument gives $\chi(B) \leq H(X_B) + 2\varepsilon$ for all non-empty $B \subseteq A$. The construction is therefore a $2\varepsilon$-approximate condensation. $\square$

Exercise 5 (Hierarchical)

Setup

We have a hierarchical natural latent model with partitions $Q^0, Q^1, \ldots, Q^n$ of $I$, where $Q^0$ is the finest (singletons) and $Q^n$ is the coarsest ($\{I\}$). At each level $k$, $\Lambda^k_A$ is an $\varepsilon$-approximate natural latent over $(\Lambda^{k-1}_B)_{B \subset A, B \in Q^{k-1}}$. We also have the conditional independence condition: $\Lambda^{k-1}_B \perp \Lambda^{k-1}_{\bar{B}} \mid \Lambda^k_B$.

Part 5(a): Telescoping across levels

Goal. Show that for each level $k$:

$$ \sum_{B \in Q^k} \sum_{C \subset B} H(\Lambda^{k-1}_C \mid \Lambda^k_B) \leq H(\Lambda^{k-1}_I \mid \Lambda^k_I) + K\varepsilon, $$

where $K$ depends on the partition structure.

Solution.

Step 1: Use conditional independence. Fix $B \in Q^k$. By the conditional independence assumption, $\Lambda^{k-1}_B$ is conditionally independent of $\Lambda^{k-1}_{\bar{B}}$ given $\Lambda^k_B$. We want to show:

$$ H(\Lambda^{k-1}_B \mid \Lambda^k_B) = H(\Lambda^{k-1}_B \mid \Lambda^k_I). $$

Write $\Lambda^k_I = (\Lambda^k_B, \Lambda^k_{\bar{B}})$ (where the “complement” is with respect to the partition $Q^k$, i.e., all other blocks at level $k$). Then:

$$ H(\Lambda^{k-1}_B \mid \Lambda^k_I) = H(\Lambda^{k-1}B \mid \Lambda^k_B, \Lambda^k{\bar{B}}). $$

By conditional independence, $\Lambda^{k-1}_B \perp \Lambda^k_{\bar{B}} \mid \Lambda^k_B$ (this follows because $\Lambda^{k-1}_B$ is generated from $\Lambda^k_B$ and is independent of the rest of level $k$). So:

$$ H(\Lambda^{k-1}B \mid \Lambda^k_B, \Lambda^k{\bar{B}}) = H(\Lambda^{k-1}_B \mid \Lambda^k_B). $$

Step 2: Use mediation. Since $\Lambda^k_B$ is an $\varepsilon$-approximate natural latent over $(\Lambda^{k-1}_C)_{C \subset B}$, mediation gives:

$$ \sum_{C \subset B} H(\Lambda^{k-1}_C \mid \Lambda^k_B) \leq H(\Lambda^{k-1}_B \mid \Lambda^k_B) + \varepsilon. $$

Step 3: Sum over blocks. Summing over all $B \in Q^k$ and using Step 1:

$$ \sum_{B \in Q^k} \sum_{C \subset B} H(\Lambda^{k-1}C \mid \Lambda^k_B) \leq \sum{B \in Q^k} \bigl[H(\Lambda^{k-1}_B \mid \Lambda^k_I) + \varepsilon\bigr]. $$

$$ = \sum_{B \in Q^k} H(\Lambda^{k-1}_B \mid \Lambda^k_I) + |Q^k| \varepsilon. $$

Now, $\sum_{B \in Q^k} H(\Lambda^{k-1}_B \mid \Lambda^k_I)$ is not quite $H(\Lambda^{k-1}_I \mid \Lambda^k_I)$ because the blocks might not be independent given $\Lambda^k_I$. However, applying the chain rule:

$$ H(\Lambda^{k-1}I \mid \Lambda^k_I) = \sum{B \in Q^k} H(\Lambda^{k-1}B \mid \Lambda^k_I, \Lambda^{k-1}{<B}), $$

where $\Lambda^{k-1}_{<B}$ denotes the blocks processed before $B$. Since conditioning reduces entropy:

$$ H(\Lambda^{k-1}B \mid \Lambda^k_I, \Lambda^{k-1}{<B}) \leq H(\Lambda^{k-1}_B \mid \Lambda^k_I). $$

Summing: $H(\Lambda^{k-1}_I \mid \Lambda^k_I) \leq \sum_{B \in Q^k} H(\Lambda^{k-1}_B \mid \Lambda^k_I)$.

By conditional independence (the blocks at level $k-1$ are independent given level $k$), this inequality is in fact an approximate equality. In particular:

$$ \sum_{B \in Q^k} H(\Lambda^{k-1}_B \mid \Lambda^k_I) = H(\Lambda^{k-1}_I \mid \Lambda^k_I) + \text{mediation error at level } k. $$

The mediation error at level $k$ is bounded by $|Q^k| \varepsilon$ (summing the mediation errors of each block). Therefore:

$$ \sum_{B \in Q^k} \sum_{C \subset B} H(\Lambda^{k-1}_C \mid \Lambda^k_B) \leq H(\Lambda^{k-1}_I \mid \Lambda^k_I) + K\varepsilon, $$

where $K$ depends on the partition sizes across levels. $\square$

Part 5(b): Bounding the conditioned score

Goal. Show that $\chi(A) \leq H(\Lambda^k_{\cap A},\ k = 0 \ldots n) + K_2 \varepsilon$.

Solution. The conditioned score $\chi(A)$ sums $H(Y_B \mid Y_{\supset B})$ over all $B$ touching $A$. In our LVM, $Y_B = \Lambda^k_B$ for some level $k$.

Using Part 5(a) (and its subset version restricted to indices touching $A$), we can telescope level by level. At each level $k$, the contribution to $\chi(A)$ from level-$k$ latents touching $A$ is:

$$ \sum_{\substack{B \in Q^k \ B \cap A \neq \emptyset}} H(\Lambda^k_B \mid \Lambda^{k+1}_{\text{parent}(B)}) \quad \text{(approximately)}. $$

Applying Part 5(a) at each level and summing from $k = 0$ to $k = n$:

$$ \chi(A) \leq \sum_{k=0}^{n-1} H(\Lambda^k_{\cap A} \mid \Lambda^{k+1}_{\cap A}) + K_2 \varepsilon, $$

where each level contributes at most $K’\varepsilon$ in mediation error, for a total of $K_2 \varepsilon$ across all levels.

By the chain rule applied telescopically:

$$ \sum_{k=0}^{n-1} H(\Lambda^k_{\cap A} \mid \Lambda^{k+1}{\cap A}) = H(\Lambda^0{\cap A}, \ldots, \Lambda^n_{\cap A}) - H(\Lambda^n_{\cap A}). $$

Including the top-level term $H(\Lambda^n_{\cap A})$ from the conditioned score:

$$ \chi(A) \leq H(\Lambda^0_{\cap A}, \ldots, \Lambda^n_{\cap A}) + K_2 \varepsilon. \quad \square $$

Part 5(c): Final bound

Goal. Show $\chi(A) \leq H(X_A) + K_3 \varepsilon$.

Solution. From Part 5(b):

$$ \chi(A) \leq H(\Lambda^0_{\cap A}, \Lambda^1_{\cap A}, \ldots, \Lambda^n_{\cap A}) + K_2\varepsilon. $$

Now $\Lambda^0_{\cap A} = X_A$ (the observables), and each higher-level latent $\Lambda^k_{\cap A}$ is approximately a deterministic function of $\Lambda^{k-1}_{\cap A}$ by the redundancy property of natural latents. Specifically:

$$ H(\Lambda^k_{\cap A} \mid \Lambda^{k-1}_{\cap A}) \leq \varepsilon $$

at each level $k$ (from the strong redundancy of $\Lambda^k$ being recoverable from lower-level variables).

By the chain rule:

$$ H(\Lambda^0_{\cap A}, \ldots, \Lambda^n_{\cap A}) = H(X_A) + \sum_{k=1}^{n} H(\Lambda^k_{\cap A} \mid \Lambda^0_{\cap A}, \ldots, \Lambda^{k-1}_{\cap A}). $$

Since conditioning on more can only reduce entropy:

$$ H(\Lambda^k_{\cap A} \mid \Lambda^0_{\cap A}, \ldots, \Lambda^{k-1}{\cap A}) \leq H(\Lambda^k{\cap A} \mid \Lambda^{k-1}_{\cap A}) \leq \varepsilon. $$

Summing over all $n$ levels:

$$ H(\Lambda^0_{\cap A}, \ldots, \Lambda^n_{\cap A}) \leq H(X_A) + n\varepsilon. $$

Combining with Part 5(b):

$$ \chi(A) \leq H(X_A) + n\varepsilon + K_2\varepsilon = H(X_A) + K_3\varepsilon, $$

where $K_3 = n + K_2$.

This concludes the proof: the LVM constructed from the hierarchical natural latent model $R$ is a $K_3\varepsilon$-approximate condensation, where $K_3$ depends on the number of levels and the partition structure. $\square$