Limiting Entropy as a Return
March 23, 2017
\def\Pr#1{\mathbb{P} \left( #1 \right)}
(In progress, just some quick notes until I’m done my thesis)
Expressing Entropy as a Return #
- It’s a fun exercise
- Learning the limiting entropy of a state online may be useful
- There is probably some interesting analysis that could be done
- So, let’s look at discrete Markov Processes and show that there’s a recursive Bellman-like equation for entropy.
- Note that this is different from the entropy rate
Quick rundown of relevant MDP basics #
For a Markov chain, we have that the probability of seeing a particular sequence of states, say(s_0, s_1, s_2, \ldots, s_n), is:
\Pr{s_0, s_1, s_2 \ldots s_n} = \Pr{s_{0}} \Pr{s_{1} | s_{0}} \cdots \Pr{s_{n} | s_{n-1}, s_{n-2}, \ldots s_{1}, s_{0}}
Under the Markov assumption:\Pr{s_{n} | s_{n-1}, s_{n-2}, \ldots} = \Pr{s_{n} | s_{n-1}}.
\begin{aligned} \Pr{s_0, s_1, s_2 \ldots s_n} &= \Pr{s_0} \Pr{s_1 | s_0} \Pr{s_2 | s_1} \cdots \Pr{s_{n} | s_{n-1}} \\\\ &=\Pr{s_0} \prod_{t=1}^{n} \Pr{s_{t} | s_{t-1}} \end{aligned}
The expected value of a state for some reward functionr(s,s') is
\begin{aligned} v(s) &= \sum_{s_{1}} \Pr{s_{1} | s_{0} = s} \Big( r(s_0, s_1) + \gamma \Big( \sum_{s_{2}} \Pr{s_{2} | s_{1}} \Big(r(s_1, s_2) + \gamma \sum_{s_{3}} \Big( \ldots \Big) \Big) \Big) \Big) \\\\ &= \sum_{s_{1}} \Pr{s_{1} | s_{0} = s} \Big( r(s_0, s_1) + \gamma v(s_{1}) \Big) \end{aligned}
Entropy #
The classic definition of entropy for a discrete random variableX which takes values from some set\mathcal{X} is:
H(X) = \sum_{x \in \mathcal{X}} \Pr{X = x} \log\big( \Pr{X = x} \big)
We may also wish to express the entropy of two or more random variables taken together (the joint entropy). The joint entropy of two r.v.s,X andY, is given by:
\begin{aligned} H(X,Y) &= \sum_{x, y} \Pr{x,y} \log( \Pr{x, y} ) \\\\ &= H(Y|X) + H(X) = H(Y|X) + H(Y) \end{aligned}
WhereH(Y|X) is the conditional entropy forY givenX. The formula for conditional entropy can be expressed as:
\begin{aligned} H(Y|X) &= \sum_{x \in \mathcal{X}} \Pr{x} H(Y|X=x) \\\\ &= -\sum_{x \in \mathcal{X}} \Pr{x} \sum_{y \in \mathcal{Y}} \Pr{y|x} \log (\Pr{y|x}) \\\\ &= -\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr{x,y} \log (\Pr{y|x}) \end{aligned}
Note that ifY is independent ofX,H(Y|X) = H(Y).
We can extend the formula for the conditional entropy toH(Y| X_{1}, X_{2}, \ldots, X_{n}) fairly easily:
H(Y|X_{1}, X_{2}, \ldots, X_{n}) = -\sum_{x_{1}, \ldots x_{n}} \sum_{y \in \mathcal{Y}} \Pr{x_{1}, \ldots, x_{n}, ,y} \log (\Pr{y|x_{1}, \ldots, x_{n}})
Furthermore, the joint entropy for a collection on random variables is
For a sequence of r.v.s,(X_{1}, \ldots, X_{n}), the joint entropy can be expressed as:
\begin{aligned} H(X_{1}, \ldots, X_{n}) &= - \sum_{x_{1}} \sum_{x_{2}} \cdots \sum_{x_{n}} \Pr{x_1, x_2, \ldots x_n} \log \big( \Pr{x_1, x_2, \ldots x_n} \big) \\\\ &= \sum_{k=1}^{n} H(X_{k} | X_{1}, \ldots, X_{k-1}) \end{aligned}
Entropy in Markov Chains #
When it comes to Markov chains, we can use the Markov property to get a somewhat suggestive expression of the entropy.
First note how it simplifies the expression for conditional entropy:
\begin{aligned} H(X_{k} | X_{0}, \ldots, X_{k-1}) &= \sum_{x_{0}} \cdots \sum_{x_{k-1}} \Pr{x_{0}, \ldots, x_{k-1}} H(X_{k} | X_{0} = x_{0}, \ldots X_{k-1} = x_{k-1}) \\\\ &= \sum_{x_{0}} \Pr{x_0} \sum_{x_{1}} \Pr{x_{1} | x_{0}} \sum_{x_{2}} \Pr{x_{2} | x_{1}, x_{0}} \sum_{x_{3}} \cdots \\\\ &\qquad \quad \sum_{x_{k-1}} \Pr{x_{k-1} | x_{0}, \ldots, x_{k-2}} H(X_{k} | X_{0} = x_{0}, \ldots X_{k-2} = x_{k-1}) \\\\ &= -\sum_{x_{0}} \Pr{x_0} \sum_{x_{1}} \Pr{x_{1} | x_{0}} \sum_{x_{2}} \Pr{x_{2} | x_{1}} \cdots \sum_{x_{k-1}} \Pr{x_{k-1} | x_{k-2}} \\\\ &\qquad \quad \sum_{x_{k}} \Pr{X_{k} | X_{k-1}} \log \Big( \Pr{X_{k} | X_{k-1}} \Big) \end{aligned}
Note that it has a nice recursion friendly form. For example, if we substitute this expression in order to calculate the joint entropy, expanding up toX_{3}, we get:
\begin{aligned} H(X_{0}, X_{1}, \ldots, X_{3}) = -&\sum_{s_0} \Pr{s_0} \Big[ \log( \Pr{s_0}) \\\\ &+ \sum_{s_1} \Pr{s_1 | s_0} \Big[ \log(\Pr{s_1 | s_0}) \\\\ & \quad + \sum_{s_2} \Pr{s_2 | s_1} \Big[ \log(\Pr{s_2 | s_1}) \\\\ & \quad\quad + \sum_{s_3} \Pr{s_3 | s_2} \log(\Pr{s_3 | s_2}) \Big]\Big]\Big]\Big] \end{aligned}
Bellman Equation #
It’s starting to look very similar to a return.
In fact, we can get a Bellman equation by defining the reward as the log-probability of the transition.
\begin{aligned} R_{t+1} &= \log \big( \Pr{S_{t+1} | S_{t}} \big) \\\\ G_{t} &= R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \cdots \\\\ &= R_{t+1} + \gamma G_{t+1} \\\\ v_{H}(s) &= \mathbb{E}[G_{t} | S_{t} = s] \end{aligned}
- Note the discount factor; if it’s less than 1 then we’re not “really” calculating the limiting entropy.
- However, if the MDP has no terminal states, then the entropy is infinite, which is not an especially helpful observation when trying to employ these concepts in practice.
- The return can easily be infinite in the undiscounted case for a non-terminating MDP. For example, the expected return is infinite if the reward function is positive-definite.
- So the discount factor can be chosen arbitrarily, or you can perhaps avoid infinities by renormalizing (subtracting the joint entropy averaged over all states)
- If you wanted a “justified” discount factor, you can look at1 - \gamma as a termination probability
- Or perhaps you’re interested in the entropy over some time horizon, say\tau, and set\gamma = (\tau - 1)/\tau.
- Using a state-dependent\gamma to express whether some outcomes somehow do not count is also plausible, although not super compelling.
- There’s another way of doing the analysis where you’re working with infinite products instead of infinite sums
- The trouble arises mainly because we end up with $$ G_{t} = \log(\Pr{S_{t+1} | S_{t}})
- \log(\Pr{S_{t+2} | S_{t+1}}^{\gamma}) + \log(\Pr{S_{t+3} | S_{t+2}}^{\gamma^2}) + \cdots $$ Which is a little bit hard to interpret.
- Where might you use it in practice?
- An estimate of a particular state’s entropy gives you an idea of the number of possible branching outcomes.
- These outcomes might have the same end result
- If you wanted to measure how different possible results might be, you’d use something like the variance (or another distance-like measure)
- In combination with other things, this could be really informative.
- Given two policies with the same expected return, an agent might want to select one policy over one with higher entropy because it’s more certain
- Conversely, during learning it might be better to explore high entropy states, or select them more frequently, because potentially many outcomes can spring from that single state.
- It may also reveal some information about where to allocate features, but I haven’t analyzed this particularly deeply
- We can use entropy as guidance for how hard an MDP is, or how hard a state is to approximate
- States with high entropy may benefit from a lower learning rate, conversely states with low entropy may admit a higher learning rate
- An estimate of a particular state’s entropy gives you an idea of the number of possible branching outcomes.
- Of course, issues spring up because in order to get a good estimate for a state’s entropy you need to visit it many times.