Limiting Entropy as a Return

March 23, 2017

generalized returns, reinforcement learning, temporal difference learning, math

\def\Pr#1{\mathbb{P} \left( #1 \right)}

(In progress, just some quick notes until I’m done my thesis)

Expressing Entropy as a Return #

It’s a fun exercise
Learning the limiting entropy of a state online may be useful
There is probably some interesting analysis that could be done
So, let’s look at discrete Markov Processes and show that there’s a recursive Bellman-like equation for entropy.
Note that this is different from the entropy rate

Quick rundown of relevant MDP basics #

For a Markov chain, we have that the probability of seeing a particular sequence of states, say(s_0, s_1, s_2, \ldots, s_n), is:

\Pr{s_0, s_1, s_2 \ldots s_n} = \Pr{s_{0}} \Pr{s_{1} | s_{0}} \cdots \Pr{s_{n} | s_{n-1}, s_{n-2}, \ldots s_{1}, s_{0}}

Under the Markov assumption:\Pr{s_{n} | s_{n-1}, s_{n-2}, \ldots} = \Pr{s_{n} | s_{n-1}}.

\begin{aligned} \Pr{s_0, s_1, s_2 \ldots s_n} &= \Pr{s_0} \Pr{s_1 | s_0} \Pr{s_2 | s_1} \cdots \Pr{s_{n} | s_{n-1}} \\\\ &=\Pr{s_0} \prod_{t=1}^{n} \Pr{s_{t} | s_{t-1}} \end{aligned}

The expected value of a state for some reward functionr(s,s') is

\begin{aligned} v(s) &= \sum_{s_{1}} \Pr{s_{1} | s_{0} = s} \Big( r(s_0, s_1) + \gamma \Big( \sum_{s_{2}} \Pr{s_{2} | s_{1}} \Big(r(s_1, s_2) + \gamma \sum_{s_{3}} \Big( \ldots \Big) \Big) \Big) \Big) \\\\ &= \sum_{s_{1}} \Pr{s_{1} | s_{0} = s} \Big( r(s_0, s_1) + \gamma v(s_{1}) \Big) \end{aligned}

Entropy #

The classic definition of entropy for a discrete random variableX which takes values from some set\mathcal{X} is:

H(X) = \sum_{x \in \mathcal{X}} \Pr{X = x} \log\big( \Pr{X = x} \big)

We may also wish to express the entropy of two or more random variables taken together (the joint entropy). The joint entropy of two r.v.s,X andY, is given by:

\begin{aligned} H(X,Y) &= \sum_{x, y} \Pr{x,y} \log( \Pr{x, y} ) \\\\ &= H(Y|X) + H(X) = H(Y|X) + H(Y) \end{aligned}

WhereH(Y|X) is the conditional entropy forY givenX. The formula for conditional entropy can be expressed as:

\begin{aligned} H(Y|X) &= \sum_{x \in \mathcal{X}} \Pr{x} H(Y|X=x) \\\\ &= -\sum_{x \in \mathcal{X}} \Pr{x} \sum_{y \in \mathcal{Y}} \Pr{y|x} \log (\Pr{y|x}) \\\\ &= -\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr{x,y} \log (\Pr{y|x}) \end{aligned}

Note that ifY is independent ofX,H(Y|X) = H(Y).

We can extend the formula for the conditional entropy toH(Y| X_{1}, X_{2}, \ldots, X_{n}) fairly easily:

H(Y|X_{1}, X_{2}, \ldots, X_{n}) = -\sum_{x_{1}, \ldots x_{n}} \sum_{y \in \mathcal{Y}} \Pr{x_{1}, \ldots, x_{n}, ,y} \log (\Pr{y|x_{1}, \ldots, x_{n}})

Furthermore, the joint entropy for a collection on random variables is

For a sequence of r.v.s,(X_{1}, \ldots, X_{n}), the joint entropy can be expressed as:

\begin{aligned} H(X_{1}, \ldots, X_{n}) &= - \sum_{x_{1}} \sum_{x_{2}} \cdots \sum_{x_{n}} \Pr{x_1, x_2, \ldots x_n} \log \big( \Pr{x_1, x_2, \ldots x_n} \big) \\\\ &= \sum_{k=1}^{n} H(X_{k} | X_{1}, \ldots, X_{k-1}) \end{aligned}

Entropy in Markov Chains #

When it comes to Markov chains, we can use the Markov property to get a somewhat suggestive expression of the entropy.

First note how it simplifies the expression for conditional entropy:

\begin{aligned} H(X_{k} | X_{0}, \ldots, X_{k-1}) &= \sum_{x_{0}} \cdots \sum_{x_{k-1}} \Pr{x_{0}, \ldots, x_{k-1}} H(X_{k} | X_{0} = x_{0}, \ldots X_{k-1} = x_{k-1}) \\\\ &= \sum_{x_{0}} \Pr{x_0} \sum_{x_{1}} \Pr{x_{1} | x_{0}} \sum_{x_{2}} \Pr{x_{2} | x_{1}, x_{0}} \sum_{x_{3}} \cdots \\\\ &\qquad \quad \sum_{x_{k-1}} \Pr{x_{k-1} | x_{0}, \ldots, x_{k-2}} H(X_{k} | X_{0} = x_{0}, \ldots X_{k-2} = x_{k-1}) \\\\ &= -\sum_{x_{0}} \Pr{x_0} \sum_{x_{1}} \Pr{x_{1} | x_{0}} \sum_{x_{2}} \Pr{x_{2} | x_{1}} \cdots \sum_{x_{k-1}} \Pr{x_{k-1} | x_{k-2}} \\\\ &\qquad \quad \sum_{x_{k}} \Pr{X_{k} | X_{k-1}} \log \Big( \Pr{X_{k} | X_{k-1}} \Big) \end{aligned}

Note that it has a nice recursion friendly form. For example, if we substitute this expression in order to calculate the joint entropy, expanding up toX_{3}, we get:

\begin{aligned} H(X_{0}, X_{1}, \ldots, X_{3}) = -&\sum_{s_0} \Pr{s_0} \Big[ \log( \Pr{s_0}) \\\\ &+ \sum_{s_1} \Pr{s_1 | s_0} \Big[ \log(\Pr{s_1 | s_0}) \\\\ & \quad + \sum_{s_2} \Pr{s_2 | s_1} \Big[ \log(\Pr{s_2 | s_1}) \\\\ & \quad\quad + \sum_{s_3} \Pr{s_3 | s_2} \log(\Pr{s_3 | s_2}) \Big]\Big]\Big]\Big] \end{aligned}

Bellman Equation #

It’s starting to look very similar to a return.

In fact, we can get a Bellman equation by defining the reward as the log-probability of the transition.

\begin{aligned} R_{t+1} &= \log \big( \Pr{S_{t+1} | S_{t}} \big) \\\\ G_{t} &= R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \cdots \\\\ &= R_{t+1} + \gamma G_{t+1} \\\\ v_{H}(s) &= \mathbb{E}[G_{t} | S_{t} = s] \end{aligned}

Note the discount factor; if it’s less than 1 then we’re not “really” calculating the limiting entropy.
However, if the MDP has no terminal states, then the entropy is infinite, which is not an especially helpful observation when trying to employ these concepts in practice.
- The return can easily be infinite in the undiscounted case for a non-terminating MDP. For example, the expected return is infinite if the reward function is positive-definite.
- So the discount factor can be chosen arbitrarily, or you can perhaps avoid infinities by renormalizing (subtracting the joint entropy averaged over all states)
If you wanted a “justified” discount factor, you can look at1 - \gamma as a termination probability
- Or perhaps you’re interested in the entropy over some time horizon, say\tau, and set\gamma = (\tau - 1)/\tau.
- Using a state-dependent\gamma to express whether some outcomes somehow do not count is also plausible, although not super compelling.
There’s another way of doing the analysis where you’re working with infinite products instead of infinite sums
- The trouble arises mainly because we end up with $$ G_{t} = \log(\Pr{S_{t+1} | S_{t}})
- \log(\Pr{S_{t+2} | S_{t+1}}^{\gamma}) + \log(\Pr{S_{t+3} | S_{t+2}}^{\gamma^2}) + \cdots $$ Which is a little bit hard to interpret.

Where might you use it in practice?
- An estimate of a particular state’s entropy gives you an idea of the number of possible branching outcomes.
  - These outcomes might have the same end result
  - If you wanted to measure how different possible results might be, you’d use something like the variance (or another distance-like measure)
  - In combination with other things, this could be really informative.
- Given two policies with the same expected return, an agent might want to select one policy over one with higher entropy because it’s more certain
- Conversely, during learning it might be better to explore high entropy states, or select them more frequently, because potentially many outcomes can spring from that single state.
- It may also reveal some information about where to allocate features, but I haven’t analyzed this particularly deeply
- We can use entropy as guidance for how hard an MDP is, or how hard a state is to approximate
  - States with high entropy may benefit from a lower learning rate, conversely states with low entropy may admit a higher learning rate
Of course, issues spring up because in order to get a good estimate for a state’s entropy you need to visit it many times.