jonathan’s blog

mathematics, physics, programming, and machine learning

Stochastic functions and the Giry monad

Stochastic functions

A function \(f\) between sets \(X\) and \(Y\) is a mapping \[\begin{aligned} f &:& X &\to& Y \cr f &:& x &\mapsto& f(y) \end{aligned}\] and one of the most important properties of functions is that they compose, i.e. if we have \(f: X \to Y\) and \(g: Y \to Z\) then we can construct a mapping \(g \circ f: X \to Z\) defined by \((g\circ f)(x) = g(f(x))\). When the sets involved have additional structure (topology, smooth structure, etc.), composition often preserves nice properties of the involved functions, e.g. continuity, differentiability, etc. This allows one to construct interesting categories of “nice” functions between “nice” spaces.

What about the real world? In practice, very few “functions” are true functions in the mathematical sense. Among others, there are some obvious examples of non-functions

  • \(f(x)\) is a measurement, and necessarily comes with error bars
  • \(f(x)\) is unknown but approximated by some known function \(\hat{f}(x)\)

In the first case, repeated measurements could produce different values, and we would like to capture the distribution of possible values upon repeated measurements. In the second case, we would like to have some way to capture the uncertainty in the estimated values. So the question is, what kind of mathematical objects are these not-quite-functions?

Let’s start with a simple example. Let’s suppose that we are measuring some function \(f: X \to \mathbb R\). Suppose that for repeated measurements at a point \(x \in X\), the distribution of observed values is approximately normal with mean \(\mu(x)\) and standard deviation \(\sigma(x)\). So for each \(x\), we have a random variable \[f(x) \sim \mathcal{N}(\mu(x), \sigma(x)).\] Fine. But what is \(f\) as a map? Rewriting the above, we have \[f: x \mapsto \mathcal{N}(\mu(x), \sigma(x)) \in P(\mathbb R) \] where \(P(\mathbb R)\) denotes the set of probability measures on \(\mathbb R\). In other words, we can model the distribution of possible measurements as a map \[ f: X \to P(\mathbb R). \] Taking inspiration from this, we can define a stochastic function from \(X\) to \(Y\) to be a map \(X \to P(Y)\). Now here is the problem: how do stochastic functions compose?

Compostion of stochastic functions

Suppose then that we have stochastic functions \(f: X \to P(Y)\) and \(g: Y \to P(Z)\). We would like to contruct \(g \circ f: X \to P(Z)\). Is there any reasonable way to do this?

First, let’s think about it from the point of view of sampling. Given \(x \in X\), \(f\) gives us a probability measure \(\mu_{f(x)}\) on \(Y\), from which we can sample some \(y \in Y\). Given a sampled \(y\), the stochastic map \(g\) produces a probability measure \(\mu_{g(y)}\) on \(Z\), but of course this depends on the sampled point \(y\). How do we produce something that depends only on \(\mu_{f(x)}\) and not on the particular sampled point \(y\)? We can take the expectation! We can define the probability measure \(\mu_{(g \circ f)(x)}\) by defining what it does on a measurable set \(A \subset Z\): \[ \mu_{(g \circ f)(x)}(A) := \int_Y \mu_{g(y)}(A) d\mu_{f(x)} \] To make this a little bit more explicit, we can take a special case and unwind the definitions. Suppose that \(X = \mathbb{R}^m\), \(Y = \mathbb{R}^n\), and \(Z=\mathbb{R}^k\). Suppose that \(f\) can be represented as a conditional probability density \(p_f(y|x)\) and that \(g\) can likewise be represented by a conditionall probability density \(p_g(z|y)\) (all taken with respect to the usual Lebesgue measure). Then what is \(p_{g\circ f}(z|x)\)? Let’s expand the right hand side of the above definition: \[\begin{aligned} & \int_Y \mu_{g(y)}(A) d\mu_{f(x)} \cr =& \int_Y \left[\int_A p_g(z|y) dz \right] p_f(y|x) dy \cr =& \int_A \left[\int_Y p_g(z|y) p_f(y|x) dy \right] dz \end{aligned}\] from which we deduce that \[ p_{g \circ f}(z|x) = \int_Y p_g(z|y) p_f(y|x) dy. \] In other words, compostion of stochastic functions is just a fancy way to talk about conditional probability!

The Giry monad

The above construction turns out to be monadic. Recall that a monad on a category \(\mathcal{C}\) is a tuple \((T, \eta, \mu)\) satisfying

  • \(T\) is an endofunctor \(\mathcal{C} \to \mathcal{C}\)
  • \(\eta\) is a natural transformation \(1_{\mathcal{C}} \to T\)
  • \(\mu\) is a natural transformation \(T^2 \to T\)
  • \(\mu \circ T\mu = \mu \circ \mu T\) as natural transformations \(T^3 \to T\)
  • \(\mu \circ T \eta = \mu \circ \eta T = 1_T \) as natural transformations \(T \to T\)

(Note carefully that now \(\mu\) refers to the multiplication operation, no longer a measure. In what follows below, we’ll use \(\nu\) to denote measures, to avoid overloading.) If you’ve never seen this definition before, it may seem a bit abstract. But if we think about it in terms of function composition, it is exactly the kind of structure that we have described above.

First, what is the functor \(T\) in our case? Given a reasonable space \(X\), we can take \(T(X)\) to be the set of probability measures on \(X\). To make this into a functor, we need to define how \(T\) acts on maps \(f \in \mathrm{Hom}_{\mathcal{C}}(X, Y)\). There is a natural choice here, the pushforward measure. Suppose we have \(f \in \mathrm{Hom}_{\mathcal{C}}(X, Y)\) and \(\nu \in T(X)\). Then the pushforward measure \(f_\ast\nu \in T(Y)\) is defined by \((f_\ast\nu)(A) := \nu(f^{-1}(A))\). This yields the desired lift \(Tf \in \mathrm{Hom}_{\mathcal{C}}(TX, TY)\).

Second, we need to define the embedding \(\eta: 1_\mathcal{C} \to T \). This is actually very simple: there is a natural map \(X \to TX\) defined by mapping \(x \in X\) to the Dirac measure \(\delta_x\) on \(X\).

And finally, we have to define the multiplication map \(\mu\). For a space \(X\), we need to define \(\mu_X: T(T(X)) \to T(X)\), where \(T\) denotes the operation of taking probability measures, i.e. we need a map from the set of probability measures on the set of probability measures on \(X\), to the set of probability measures on \(X\). This is effectively just an expectation. Let \(M \in T(T(X))\) be a measure on the space of probability measures. We need to define a measure \(\mu M\) on \(X\). We can define this for \(A \subset X\) as \[ (\mu M)(A) := \mathbb{E}_{\nu \sim M}\left[\nu(A) \right], \] i.e. the expected measure of \(A\) with respect to the distribution \(M\) on the space of probability measures.

Under suitable assumptions on spaces and maps under consideration, one can prove that the above operations are well-defined and indeed satisfy the monad laws.

Theorem. Let \(\mathrm{Meas}\) be the category of measurable spaces and measurable maps. Then the operations \((T, \eta, \mu)\) described above are well-defined and satisfy the monad laws. We call this the Giry monad.

The Kleisli category of a monad

Let’s recall monadic compostion. Assume that \((T, \eta, \mu)\) is a monad on some category \(\mathcal{C}\). For \(X, Y \in \mathrm{Ob}(\mathcal{C})\), it makes sense to talk about the set of maps \(\mathrm{Hom}_\mathcal{C}(X, T(Y))\), since \(T\) is an endofunctor. What if we have \(f \in \mathrm{Hom}_\mathcal{C}(X, T(Y))\) and \(g \in \mathrm{Hom}_\mathcal{C}(Y, T(Z))\)? The ordinary rules of category do not allow composing \(f\) and \(g\). A monad is exactly the kind of structure that we need to make the composition well-defined!

Recall that a natural transformation between functors \(F\) and \(G\) is a family of maps \(F(X) \to G(X)\) parametrized by \(X \in \mathrm{Ob}(\mathcal{C})\) that makes the obvious diagram commute. In the monad case, this means that we have maps \(\eta_X: X \to T(X)\) and \(\mu_X: T(T(X)) \to T(X)\). In particular, \(\mu\) allows us to define a monadic lift operation \[(-)^\ast: \mathrm{Hom}_\mathcal{C}(X, T(Y)) \to \mathrm{Hom}_\mathcal{C}(T(X), T(Y))\] by \(f^\ast := \mu_Y \circ Tf\). Then for functions \(f \in \mathrm{Hom}_\mathcal{C}(X, T(Y))\) and \(g \in \mathrm{Hom}_\mathcal{C}(Y, T(Z))\), we can define monadic composition \[ g \circ_T f := g^\ast \circ f. \] Now comes the question: is monadic composition associative? For functions \(f\), \(g\), and \(h\), we have \[ h \circ_T (g \circ_T f) = h^\ast \circ g^\ast \circ f, \] and \[ (h \circ_T g) \circ_T f = (h^\ast \circ g)^\ast \circ f. \] So associativity amounts to the condition \((h^\ast \circ g)^\ast \stackrel{?}{=} h^\ast \circ g^\ast\). Explicitly, we suppose that \(g: Y \to T(Z)\) and \(h: Z \to T(W)\). Expanding definitions and using the monad laws, we have \[\begin{aligned} (h^\ast \circ g)^\ast &= \mu_W \circ T(\mu_W \circ Th \circ g) \cr &= \mu_W \circ T\mu_W \circ T^2 h \circ Tg \cr &= \mu_W \circ \mu_{T(W)} \circ T^2 h \circ Tg \cr &= \mu_W \circ Th \circ \mu_Z \circ Tg \cr &= (\mu_W \circ Th) \circ (\mu_Z \circ Tg) \cr &= h^\ast \circ g^\ast. \end{aligned}\]

Theorem. Let \(\mathcal{C}\) be a category and let \((T, \eta, \mu)\) be a monad on \(\mathcal{C}\). Then we define a new category \(\mathcal{C}_T\) as the category with $$ \mathrm{Ob}(\mathcal{C}_T) := \mathrm{Ob}(\mathcal{C}) $$ $$ \mathrm{Hom}_{\mathcal{C}_T}(X, Y) := \mathrm{Hom}_{\mathcal{C}}(X, T(Y)) $$ with identity function \(1_{X_T} := \eta_X\) and compostion defined by $$g \circ_T f := g^\ast \circ f.$$ Then these objects and morphisms indeed form a category. We call this the Kleisli category of the monad.

Composition of stochastic functions is monadic

Now we can apply the Kleisli category construction to the Giry monad \(G\). Spelling things out explicitly, we have a category \(\mathcal{C}_G\) whose objects are measurable spaces, and whose morphisms are stochastic maps. Then composition of stochastic functions as we defined above is simply monadic composition as we defined it in the previous section.

Let’s spell it out explicitly. Suppose that \(f: X \to P(Y)\) is a stochastic map, which we can think of as being represented by a probability density \(p_f(y|x)\). What is the monadic lift of \(f\)? This is a map \(P(X) \to P(Y)\). How does it act on a measure \(\nu_X \in P(X)\) represented by a probability density \(p_{\nu_X}(x)\)? I.e. how do we combine \(p_f(y|x)\) and \(p_{\nu_X}(x)\) to produce a probability density \(p(y)\) on \(Y\)? The answer should be obvious, so let’s see how it follows from the Kleisli construction.

Recall that the Giry monad acts on functions by taking the pushfoward measure, and that the monad multiplication \(\mu\) is the operation of taking expectation. If we have a function \(h: Y \to \mathbb R\), then its expectation with respect to the monadic lift \(f^\ast := \mu_Y \circ f_\ast \nu_X\) is \[\begin{aligned} &\int_Y h(y) d(\mu_Y \circ f_\ast \nu_X) \cr &= \int_{P(Y)} \left[\int_Y h(y) d\nu \right] d(f_\ast \nu_X) \end{aligned}\] The expression in the square brackets can the thought of as a function \(\tilde{h}: P(Y) \to \mathbb R\) mapping a measure \(\nu\) to the expectation of \(h\) with respect to \(\nu\). Rewriting and using the change of variables formula, we have \[\begin{aligned} & \int_{P(Y)} \left[\int_Y h(y) d\nu \right] d(f_\ast \nu_X) \cr &= \int_X \tilde{h} \circ f d\nu_X \cr &= \int_X \tilde{h}(p_f(-|x)) p_{\nu_X}(x) dx \end{aligned}\] What does the integrand look like in coordinates? By definition, \(\tilde{h}(p_f(-|x))\) is just the expectation of \(h\) with respect to the measure on \(Y\) defined by the density \(p_f(y|x)\), i.e. \[ \tilde{h}(p_f(-|x)) = \int_Y h(y) p_f(y|x) dy\] Combining this with with the above calculation, we see that \(\mu_Y \circ f_\ast \nu_X\) is the map \[ f^\ast: p_{\nu_X}(x) \mapsto \int_X p_f(y|x) p_{\nu_X}(x)dx \] In other words, the monadic lift operation is simply the operation that, given a conditional probability \(p(y|x)\) and an unconditional probability \(p(x)\) on \(X\), and produces an unconditional probability density \(p(y)\) on \(Y\) by integrating over \(X\).

Now suppose that we have stochastic functions \(f: X \to P(Y)\) and \(g: Y \to P(Z)\) represented by conditional probability densities \(p(y|x)\) and \(p(z|y)\). Then we can compute that the monadic composition \( g^\ast \circ f: X \to P(Z)\) is represented by the probability density \[ p(z|x) = \int_Y p(z|y) p(y|x) dx, \] so we find that the previously defined composition of stochastic functions coincides with monadic composition with respect to the Giry monad.

References