Limiting Entropy as a Return

Limiting Entropy as a Return

March 23, 2017
generalized returns, reinforcement learning, temporal difference learning, math

\def\Pr#1{\mathbb{P} \left( #1 \right)}

(In progress, just some quick notes until I’m done my thesis)

Expressing Entropy as a Return #

  • It’s a fun exercise
  • Learning the limiting entropy of a state online may be useful
  • There is probably some interesting analysis that could be done
  • So, let’s look at discrete Markov Processes and show that there’s a recursive Bellman-like equation for entropy.
  • Note that this is different from the entropy rate

Quick rundown of relevant MDP basics #

For a Markov chain, we have that the probability of seeing a particular sequence of states, say(s_0, s_1, s_2, \ldots, s_n), is:

\Pr{s_0, s_1, s_2 \ldots s_n} = \Pr{s_{0}} \Pr{s_{1} | s_{0}} \cdots \Pr{s_{n} | s_{n-1}, s_{n-2}, \ldots s_{1}, s_{0}}

Under the Markov assumption:\Pr{s_{n} | s_{n-1}, s_{n-2}, \ldots} = \Pr{s_{n} | s_{n-1}}.

\begin{aligned} \Pr{s_0, s_1, s_2 \ldots s_n} &= \Pr{s_0} \Pr{s_1 | s_0} \Pr{s_2 | s_1} \cdots \Pr{s_{n} | s_{n-1}} \\\\ &=\Pr{s_0} \prod_{t=1}^{n} \Pr{s_{t} | s_{t-1}} \end{aligned}

The expected value of a state for some reward functionr(s,s') is

\begin{aligned} v(s) &= \sum_{s_{1}} \Pr{s_{1} | s_{0} = s} \Big( r(s_0, s_1) + \gamma \Big( \sum_{s_{2}} \Pr{s_{2} | s_{1}} \Big(r(s_1, s_2) + \gamma \sum_{s_{3}} \Big( \ldots \Big) \Big) \Big) \Big) \\\\ &= \sum_{s_{1}} \Pr{s_{1} | s_{0} = s} \Big( r(s_0, s_1) + \gamma v(s_{1}) \Big) \end{aligned}

Entropy #

The classic definition of entropy for a discrete random variableX which takes values from some set\mathcal{X} is:

H(X) = \sum_{x \in \mathcal{X}} \Pr{X = x} \log\big( \Pr{X = x} \big)

We may also wish to express the entropy of two or more random variables taken together (the joint entropy). The joint entropy of two r.v.s,X andY, is given by:

\begin{aligned} H(X,Y) &= \sum_{x, y} \Pr{x,y} \log( \Pr{x, y} ) \\\\ &= H(Y|X) + H(X) = H(Y|X) + H(Y) \end{aligned}

WhereH(Y|X) is the conditional entropy forY givenX. The formula for conditional entropy can be expressed as:

\begin{aligned} H(Y|X) &= \sum_{x \in \mathcal{X}} \Pr{x} H(Y|X=x) \\\\ &= -\sum_{x \in \mathcal{X}} \Pr{x} \sum_{y \in \mathcal{Y}} \Pr{y|x} \log (\Pr{y|x}) \\\\ &= -\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr{x,y} \log (\Pr{y|x}) \end{aligned}

Note that ifY is independent ofX,H(Y|X) = H(Y).

We can extend the formula for the conditional entropy toH(Y| X_{1}, X_{2}, \ldots, X_{n}) fairly easily:

H(Y|X_{1}, X_{2}, \ldots, X_{n}) = -\sum_{x_{1}, \ldots x_{n}} \sum_{y \in \mathcal{Y}} \Pr{x_{1}, \ldots, x_{n}, ,y} \log (\Pr{y|x_{1}, \ldots, x_{n}})

Furthermore, the joint entropy for a collection on random variables is

For a sequence of r.v.s,(X_{1}, \ldots, X_{n}), the joint entropy can be expressed as:

\begin{aligned} H(X_{1}, \ldots, X_{n}) &= - \sum_{x_{1}} \sum_{x_{2}} \cdots \sum_{x_{n}} \Pr{x_1, x_2, \ldots x_n} \log \big( \Pr{x_1, x_2, \ldots x_n} \big) \\\\ &= \sum_{k=1}^{n} H(X_{k} | X_{1}, \ldots, X_{k-1}) \end{aligned}

Entropy in Markov Chains #

When it comes to Markov chains, we can use the Markov property to get a somewhat suggestive expression of the entropy.

First note how it simplifies the expression for conditional entropy:

\begin{aligned} H(X_{k} | X_{0}, \ldots, X_{k-1}) &= \sum_{x_{0}} \cdots \sum_{x_{k-1}} \Pr{x_{0}, \ldots, x_{k-1}} H(X_{k} | X_{0} = x_{0}, \ldots X_{k-1} = x_{k-1}) \\\\ &= \sum_{x_{0}} \Pr{x_0} \sum_{x_{1}} \Pr{x_{1} | x_{0}} \sum_{x_{2}} \Pr{x_{2} | x_{1}, x_{0}} \sum_{x_{3}} \cdots \\\\ &\qquad \quad \sum_{x_{k-1}} \Pr{x_{k-1} | x_{0}, \ldots, x_{k-2}} H(X_{k} | X_{0} = x_{0}, \ldots X_{k-2} = x_{k-1}) \\\\ &= -\sum_{x_{0}} \Pr{x_0} \sum_{x_{1}} \Pr{x_{1} | x_{0}} \sum_{x_{2}} \Pr{x_{2} | x_{1}} \cdots \sum_{x_{k-1}} \Pr{x_{k-1} | x_{k-2}} \\\\ &\qquad \quad \sum_{x_{k}} \Pr{X_{k} | X_{k-1}} \log \Big( \Pr{X_{k} | X_{k-1}} \Big) \end{aligned}

Note that it has a nice recursion friendly form. For example, if we substitute this expression in order to calculate the joint entropy, expanding up toX_{3}, we get:

\begin{aligned} H(X_{0}, X_{1}, \ldots, X_{3}) = -&\sum_{s_0} \Pr{s_0} \Big[ \log( \Pr{s_0}) \\\\ &+ \sum_{s_1} \Pr{s_1 | s_0} \Big[ \log(\Pr{s_1 | s_0}) \\\\ & \quad + \sum_{s_2} \Pr{s_2 | s_1} \Big[ \log(\Pr{s_2 | s_1}) \\\\ & \quad\quad + \sum_{s_3} \Pr{s_3 | s_2} \log(\Pr{s_3 | s_2}) \Big]\Big]\Big]\Big] \end{aligned}

Bellman Equation #

It’s starting to look very similar to a return.

In fact, we can get a Bellman equation by defining the reward as the log-probability of the transition.

\begin{aligned} R_{t+1} &= \log \big( \Pr{S_{t+1} | S_{t}} \big) \\\\ G_{t} &= R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \cdots \\\\ &= R_{t+1} + \gamma G_{t+1} \\\\ v_{H}(s) &= \mathbb{E}[G_{t} | S_{t} = s] \end{aligned}

  • Note the discount factor; if it’s less than 1 then we’re not “really” calculating the limiting entropy.
  • However, if the MDP has no terminal states, then the entropy is infinite, which is not an especially helpful observation when trying to employ these concepts in practice.
    • The return can easily be infinite in the undiscounted case for a non-terminating MDP. For example, the expected return is infinite if the reward function is positive-definite.
    • So the discount factor can be chosen arbitrarily, or you can perhaps avoid infinities by renormalizing (subtracting the joint entropy averaged over all states)
  • If you wanted a “justified” discount factor, you can look at1 - \gamma as a termination probability
    • Or perhaps you’re interested in the entropy over some time horizon, say\tau, and set\gamma = (\tau - 1)/\tau.
    • Using a state-dependent\gamma to express whether some outcomes somehow do not count is also plausible, although not super compelling.
  • There’s another way of doing the analysis where you’re working with infinite products instead of infinite sums
    • The trouble arises mainly because we end up with $$ G_{t} = \log(\Pr{S_{t+1} | S_{t}})
    • \log(\Pr{S_{t+2} | S_{t+1}}^{\gamma}) + \log(\Pr{S_{t+3} | S_{t+2}}^{\gamma^2}) + \cdots $$ Which is a little bit hard to interpret.

  • Where might you use it in practice?
    • An estimate of a particular state’s entropy gives you an idea of the number of possible branching outcomes.
      • These outcomes might have the same end result
      • If you wanted to measure how different possible results might be, you’d use something like the variance (or another distance-like measure)
      • In combination with other things, this could be really informative.
    • Given two policies with the same expected return, an agent might want to select one policy over one with higher entropy because it’s more certain
    • Conversely, during learning it might be better to explore high entropy states, or select them more frequently, because potentially many outcomes can spring from that single state.
    • It may also reveal some information about where to allocate features, but I haven’t analyzed this particularly deeply
    • We can use entropy as guidance for how hard an MDP is, or how hard a state is to approximate
      • States with high entropy may benefit from a lower learning rate, conversely states with low entropy may admit a higher learning rate
  • Of course, issues spring up because in order to get a good estimate for a state’s entropy you need to visit it many times.

Generated Wed, 05 May 2021 23:10:04 MDT