Estimating Fractional Returns in Markov Decision Processes

(Just some quick thoughts before I return to typesetting my thesis... I will return to this to flesh things out once I've gotten a bit more sleep and organized my thoughts)

I am writing my thesis on estimating the variance of the return in Markov Decision Processes using online incremental algorithms, which turns out to a surprisingly complex problem.

Having an estimate of the variance is generally agreed to be a good thing, according to a random sampling (N=2) of statisticians I interviewed when writing this blog post. If you have unlimited computational power and memory, you may not care about having everything be online and incremental, because you can just keep track of literally everything and compute whatever function of the return you want. That approach is rather wasteful, and generally only promoted by individuals who own shares in NVIDIA.

If it's possible to update your estimator(s) as the data comes in, you should probably do that, because it amortizes the computational cost and allows you to modify the learning agent's behavior in real time1. TD(λ) and related algorithms can accomplish this, and there's a lot of work on methods of learning to estimate the expected return online and incrementally. But what if you want to estimate the second moment of the return? Or the square root? This is a tougher nut to crack.

Background

(You can skip this section if you're familiar with reinforcement learning)

Markov decision processes (MDPs) model decision making in situations where the outcomes are a function of the actions taken by an agent, possibly with some randomness thrown in for zest. At time "t", the agent gets some information about the state of the world ($S_{t}$), selects an action $A_{t}$ according to some policy it's following (so $A_{t} \sim \pi(S_{t})$), transitions to a new state $S_{t+1}$ and gets a reward for its efforts ($R_{t+1}$). It then has to repeat this process again, forever, or at least until it reaches a terminal state. The value of a state depends not just on the reward it receives on transitioning out of that state, but on all the future rewards that ensure. To make this mathematically tractable, we introduce a discount parameter $\gamma \in (0, 1]$, so that the reward $k$ transitions from now is diminished by $\gamma^{k}$, meaning that the value of each state is finite2. We call this discounted sum of reward "the return", and say the value of a state is the expected return you would get if you had just transitioned into that state.

We denote the return by $G_{t}$, and the value of some state $s$ by $v(s)$.

$$ \begin{aligned} G_{t} &= R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \cdots \\ v(s) &= \Exp{G_{t} | S_{t} = s} \end{aligned} $$

RL agents learn the value function by predicting the value of the next state, and update their estimate of the value function by something like stochastic gradient descent. That way, we can learn the value function as information comes in, without having to pause in order to update the model's parameters3.

Why learning the value function is "easy"

If you've not studied these ideas in detail, I can perhaps sketch the justification using a technique I like to call "math".

For the TD(0) algorithm, given a transition of the form $(S_{t}, A_{t}, R_{t+1}, S_{t+1})$, and we compute the temporal difference error:

\begin{equation} \delta_{t} = R_{t+1} + \gamma v(S_{t+1}) - v(S_{t}) \end{equation}

and we modify the value function, which is parameterized by $w$, with the update rule

\begin{equation} w \leftarrow + \alpha \delta \nabla_{w} v \end{equation}

Part of the reason why this works is that the return, $G_{t}$, is a discounted sum. So we can define it recursively, and use linearity of expectations to get:

$$ \begin{aligned} G_{t} &= R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \cdots \\ &= R_{t+1} + \gamma G_{t+1} \\ \Exp{G_{t}} &= \Exp{R_{t+1} + \gamma G_{t+1}} \\ &= \Exp{R_{t+1}} + \gamma \Exp{G_{t+1}} \end{aligned} $$

If we assume that $v(s) \approx \Exp{G_{t} | S_{t} = s}$, we can get an expression for the (squared) error in the value function:

$$ \begin{aligned} \text{err}(v) &= (R_{t+1} + \gamma G_{t+1} - v(S_{t}))^{2} \\ &\approx (R_{t+1} + \gamma v(S_{t+1}) - v(S_{t}))^{2} \end{aligned} $$

Further assume that $v(S_{t+1})$ is constant4, and we can differentiate w/r/t the parameters of the value function ($w$), to get:

$$ \begin{aligned} \nabla \text{err}(v) &\approx \nabla_{w} (R_{t+1} + \gamma v(S_{t+1}) - v(S_{t}))^{2} \\ &= - 2 (R_{t+1} + \gamma v(S_{t+1}) - v(S_{t})) \nabla_{w} v(S_{t}) \\ &= - 2 \delta_{t} \nabla_{w} v(S_{t}) \end{aligned}
$$

We make a few assumptions and get an algorithm that not works in practice, but have been proven to converge with a bounded error5, and can be used in the online setting.

Second and Higher Moment of the Return

So, we can learn the value function, and now perhaps we'd like the second moment of the return as well so that we can do fun things like computing the variance.

$$ \Var{G_{t}} = \Exp{G_{t}^{2}} - \Exp{G_{t}}^{2} $$

Having the variance is nice because we can make confidence intervals, modify the learning rate (reducing it when variance is high, and ramping it up when it's low), or even include it as a penalty to discourage risky behavior in our agent6 But naturally this is easier said than done.

Why is this so hard?

Mainly because we can use the fact that the expression for the return is recursive and linear to get cool operator equations, but the same cannot be said for e.g., $(G_{t}^{2})$.

For example, if $\vec{v}$ is the value function in vector form (such that $v(s_{i}) = [\vec{v}]_{i}$, $P$ is the transition matrix (so $\Pr (s_{j} | s_{i}) = P_{ij}$, and $\vec{r}$ is the expected immediate reward with $[\vec{r}]_{i} = \Exp{R_{t+1}|S_{t} = s_{i}}$, then

$$ \begin{aligned} \vec{v} &= \vec{r} + \gamma P \vec{v} \\ \Rightarrow \vec{v} &= (I - \gamma P)^{-1} \vec{r} \end{aligned} $$

But if we consider $G_{t}^{2}$, we get

$$ \begin{aligned} G_{t}^{2} &= (R_{t+1} + \gamma G_{t+1})^{2} \\ &= R_{t+1}^{2} + 2 \gamma R_{t+1} G_{t+1} + \gamma^{2} G_{t+1}^{2} \end{aligned} $$

which almost seems like it will work, except we can't just predict the squared return of the next state because of the $2 \gamma R_{t+1} G_{t+1}$ cross-term. If we wanted to be exactly correct, we'd need to track the future rewards from $G_{t+1}$, discount them, and multiply by $R_{t+1}$, and do this for every transition. Thus, it seems like a victory for the forces of batch updating and computational inefficiency.

Saved By Mathematical Chicanery!

However, we can try the same trick that we used before, where we use one estimator to learn the value function, plug that in for $G_{t+1}$ as needed, and then further assume that this estimator is constant. So we end up learning from what I'm calling generalized- or meta-rewards (denoted by a "bar" superscript, e.g., $\bar{R}_{t+1}$), and it turns out that if you accept the validity of that substitution, we get a Bellman equation:

$$ \begin{aligned} \bar{R}_{t+1} &= R_{t+1}^{2} + 2 \gamma R_{t+1} v(S_{t+1}) \\ \bar{\gamma} &= \gamma^{2} \\ \bar{v}(S_{t}) &= \Exp{\bar{R}_{t+1} + \bar{\gamma} \bar{v}(S_{t+1})} \\ &\approx \Exp{G_{t}^{2}} \end{aligned} $$

And we still have the ability to formulate cool operator equations:

$$ \begin{aligned} \bar{\vec{v}} &= (I - \bar{\gamma}^{2} P)^{-1} \bar{\vec{r}} \\ &= (I - \gamma^{2} P)(\vec{r}^{(2)} + 2 \gamma \vec{R} P \vec{v}) \end{aligned} $$

The justification for this is covered in other works, see Sobel. I will note that everything's just groovy from a convergence perspective, because if $\gamma \in (0, 1]$, then so is $\gamma^{2}$, and if the rewards are bounded, the $R_{t}^{2}$ is also bounded, and therefore so is the return. So if we limit the value function to be $|v(s)| \leq \frac{R_{max}}{1 - \gamma}$, then $\bar{R}_{t}$ is bounded, and so if TD would converge on the original problem, then it converges here too.

It works well in the tabular case, but under function approximation things begin to break down in odd ways.

Generalizing

As suggested by the section heading, we can generalize, and come up with a reward for the higher moments of the return rather easily:

$$ \begin{aligned} G_{t}^{n} &= \sum_{k=0}^{n} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} G_{t+1}^{k} \\ &\approx \sum_{k=0}^{n-1} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} v(S_{t+1})^{k} + \gamma^{n} G_{t+1}^{n} \end{aligned} $$

Which allows us (equipped only with a value function!) define rewards and discounts that yield Bellman equations we can use:

$$ \begin{aligned} \bar{R}_{t+1}^{(n)} &= \sum_{k=0}^{n-1} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} v(S_{t+1})^{k} \\ \bar{\gamma} &= \gamma^{n} \\ \bar{v}^{(n)}(S_{t}) &= \Exp{\bar{R}_{t+1}^{(n)} + \bar{\gamma} \bar{v}^{(n)}(S_{t+1})} \end{aligned} $$

When $n = 1$, we have $\bar{R}_{t+1} = R_{t+1}$, and $\bar{\gamma} = \gamma$, which is a nice sanity check.

Fractional Returns

So while the original problem (estimating the moments of the return) seems completely sane, and its solution a praiseworthy goal, we can drop the mask of being responsible mathematicians and start wondering about weird quantities, like $G_{t}^{p/q}$. We could even consider complex powers, if that sort of thing seems worthwhile.

Let $\beta = p/q \in \mathbb{C}$, and let's use a binomial series in place of the binomial expansion that came before:

$$ \begin{aligned} G_{t}^{\beta} &= (R_{t+1} + \gamma G_{t+1})^{\beta} \\ &= R_{t+1}^{\beta} (1 + \frac{\gamma G_{t+1}}{R_{t+1}})^{\beta} \\ &= R_{t+1}^{\beta} \sum_{k=0}^{\infty} \binom{\beta}{k} \Bigg( \frac{\gamma G_{t+1}}{R_{t+1}} \Bigg)^{k} \end{aligned} $$

Previously, for the n-th moment, we truncated the series to get a $\gamma^{n} G_{t}$ term, giving us a Bellman equation. Here, we have an infinite series, and so the way around this is not exactly clear. If

Instead, we can make an unjustified assumption, and subtract $(\gamma v(S_{t+1}))^{\beta}$ in exchange for the $(\gamma G_{t+1})^{\beta}$ term:

$$ G_{t}^{\beta} \approx \gamma^{\beta} G_{t+1}^{\beta} - (\gamma v(S_{t+1}))^{\beta} + R_{t+1}^{\beta} \sum_{k=0}^{\infty} \binom{\beta}{k} \Bigg( \frac{\gamma v(S_{t+1})}{R_{t+1}} \Bigg)^{k} $$

$$ \bar{R}_{t+1}^{\beta} = - (\gamma v(S_{t+1}))^{\beta} + R_{t+1}^{\beta} \sum_{k=0}^{\infty} \binom{\beta}{k} \Bigg( \frac{\gamma v(S_{t+1})}{R_{t+1}} \Bigg)^{k} $$

We can replace the binomial function with its analytic extension, the gamma function, which is easy to compute.

However, we're still at a bit of an impasse, because this series may not be convergent.

  • If $|\gamma v(S_{t+1})/R_{t+1}| < 1$, then it converges absolutely for all $\beta \in \mathbb{C}$.
  • If $\gamma v(S_{t+1})/R_{t+1} = -1$, then we need $\Re (\beta) > 0$
  • If $\gamma v(S_{t+1})/R_{t+1} \neq -1$ and $|\gamma v(S_{t+1})/R_{t+1}| = 1$, then we just need $\Re (\beta) > -1$.
  • Divergence occurs if we're using a negative integer for $\beta$ with $|\gamma v(S_{t+1})/R_{t+1}|$ greater than 1.

Clearly, we have to be concerned with the maximum value that $v(s)$ can take, so we might consider rescaling the rewards but this doesn't work cleanly because it's a function of the reward. So an affine transformation of the rewards might work, along with appropriately chosen $\gamma$. Affine transformations (rescaling $R \rightarrow aR + b$ for some scalars $a$ and $b$) don't really affect the interpretation of the prediction when we're talking about the expected return, $\Exp{G_{t}}$, but they might change things for $G_{t}^{\beta}$.

Otherwise, we might be limited to cases where the episode returns a value of $\pm 1$ prior to termination. This is not the worst thing in the world, because it still means we can make predictions about whether events will or will not happen. This is still fairly useful, and if the math checks out (which it may not), we have a way of extending TD-methods to these artificially created non-exponentially decaying cases.

Alternative Expansion

We could also look at an alternative expansion:

$$ \begin{aligned} G_{t}^{\beta} &= (R_{t+1} + \gamma G_{t+1})^{\beta} \\ &= (\gamma G_{t+1})^{\beta} \bigg(1 + \frac{R_{t+1}}{\gamma G_{t+1}} \bigg)^{\beta} \\ &= (\gamma G_{t+1})^{\beta} \sum_{k=0}^{\infty} \binom{\beta}{k} \Bigg( \frac{R_{t+1}}{\gamma G_{t+1}} \Bigg)^{k} \\ &= (\gamma G_{t+1})^{\beta} + (\gamma G_{t+1})^{\beta}\sum_{k=1}^{\infty} \binom{\beta}{k} \Bigg( \frac{R_{t+1}}{\gamma G_{t+1}} \Bigg)^{k} \\ &\approx (\gamma G_{t+1})^{\beta} + (\gamma v(S_{t+1})^{\beta} \sum_{k=1}^{\infty} \binom{\beta}{k} \bigg( \frac{R_{t+1}}{\gamma v(S_{t+1})} \bigg)^{k} \end{aligned} $$

So we can get the sorts of recursive equations that we like to see:

$$ \begin{aligned} \Exp{G_{t}^{\beta}} &\approx \bar{v}^{\beta}(S_{t}) \\ \bar{v}^{\beta}(s) &= \Exp{\bar{R}^{\beta} + \gamma^{\beta} v(S_{t+1}) | S_{t} = s} \\ \bar{R}^{\beta} &= (\gamma v(S_{t+1)})^{\beta} \sum_{k=1}^{\infty} \binom{\beta}{k} \Bigg( \frac{R_{t+1}}{\gamma v(S_{t+1})} \Bigg)^{k} \end{aligned} $$

I think I like this expansion better, since we avoid the decision to arbitrarily pull out a $(\gamma G_{t+1})^{\beta}$ in exchange for a $(\gamma v(S_{t+1}))^{\beta}$. Convergence of the series is also better, so long as the rewards are positive and the discount factor is set appropriately.

This time, we need $|R_{t+1}/(\gamma v(S_{t+1}))| \leq 1$ for convergence to be possible if $\beta$ is not a positive integer. Things get awkward if $\gamma$ or $v(S_{t+1})$ is zero for some values of $\beta$, but perhaps not fatally so. I'll have to look at it a bit more before I can speak from a "what does it mean?" standpoint.

Why?

Well... these fractional returns might still be useful, in the sense that one of the complaints about standard MDPs is that the discounting is too aggressive, and we might wish to learn to predict things with longer tails. If we decide that we care about fractional returns, it also raises questions about whether this is the best approach to go about computing them, and furthermore, why both with a discount parameter if (by the exponential nature of discounting), we care about processes with much longer memory? It's also just mathematically interesting, so there's that as well.

So perhaps the theory has just not been fully developed yet, which might make for a good research problem.

References

  • Sobel's paper, "The Variance of Discounted Markov Decision Processes", goes into this in greater detail.
    • https://www.jstor.org/stable/3213832
  • Also see White & White and Tamar et. al., and my own (soon-to-be-finished!) thesis

  1. This is the difference between your robot car figuring out that it should be steering into the skid vs. the company that sold it to you having an awkward press conference and possible recall later in the week. 

  2. It also immanentizes the common expression "a bird in the hand is worth $\frac{1 - \gamma}{\gamma}$ in the bush". 

  3. Again, you could just track the return from every state and then learn the value function via something like batch processing, but to me that sounds like Washington bureaucracy and seems unlikely to Make Markov Decision Processes Great Again (MMDPGA). 

  4. This "derivation" is rather hand-wavy, so if you'd like something a little more mathematically stringent, see Hamid Maei's thesis or the work on true-online TD(λ) by van Seijen et al. 

  5. Convergence is guaranteed in the linear function approximation case. Follow policy when attempting to estimate return, as off-policy setting may lead to divergence. Do not attempt to use in in tasks with unbounded variance, or in non-stationary adversarial domains. For treatment with nonlinear function approximators, side effects may include: local minima, proliferation of neural net frameworks, superintelligence and superintelligence doomsaying from Elon Musk. Consult with your doctor (PhD, not MD) to see if TD(λ) is right for you. 

  6. Because even if our robot car "expects" to be able to execute a perfect bootleg turn on average, its occupants might prefer a less erratic mode of conveyance). 

Limiting Entropy as a Return

$$ \def\Pr#1{\mathbb{P} \left( #1 \right)} $$

(In progress, just some quick notes until I'm done my thesis)

Expressing Entropy as a Return

  • It's a fun exercise
  • Learning the limiting entropy of a state online may be useful
  • There is probably some interesting analysis that could be done
  • So, let's look at discrete Markov Processes and show that there's a recursive Bellman-like equation for entropy.
  • Note that this is different from the entropy rate

Quick rundown of relevant MDP basics

For a Markov chain, we have that the probability of seeing a particular sequence of states, say $(s_0, s_1, s_2, \ldots, s_n)$, is:

\begin{equation} \Pr{s_0, s_1, s_2 \ldots s_n} = \Pr{s_{0}} \Pr{s_{1} | s_{0}} \cdots \Pr{s_{n} | s_{n-1}, s_{n-2}, \ldots s_{1}, s_{0}} \end{equation}

Under the Markov assumption: $\Pr{s_{n} | s_{n-1}, s_{n-2}, \ldots} = \Pr{s_{n} | s_{n-1}}$.

\begin{equation} \begin{aligned} \Pr{s_0, s_1, s_2 \ldots s_n} &= \Pr{s_0} \Pr{s_1 | s_0} \Pr{s_2 | s_1} \cdots \Pr{s_{n} | s_{n-1}} \\ &=\Pr{s_0} \prod_{t=1}^{n} \Pr{s_{t} | s_{t-1}} \end{aligned} \end{equation}

The expected value of a state for some reward function $r(s,s')$ is

\begin{equation} \begin{aligned} v(s) &= \sum_{s_{1}} \Pr{s_{1} | s_{0} = s} \Big( r(s_0, s_1) + \gamma \Big( \sum_{s_{2}} \Pr{s_{2} | s_{1}} \Big(r(s_1, s_2) + \gamma \sum_{s_{3}} \Big( \ldots \Big) \Big) \Big) \Big) \\ &= \sum_{s_{1}} \Pr{s_{1} | s_{0} = s} \Big( r(s_0, s_1) + \gamma v(s_{1}) \Big) \end{aligned} \end{equation}


Entropy

The classic definition of entropy for a discrete random variable $X$ which takes values from some set $\mathcal{X}$ is:

\begin{equation} H(X) = \sum_{x \in \mathcal{X}} \Pr{X = x} \log\big( \Pr{X = x} \big) \end{equation}

We may also wish to express the entropy of two or more random variables taken together (the joint entropy). The joint entropy of two r.v.s, $X$ and $Y$, is given by:

\begin{equation} \begin{aligned} H(X,Y) &= \sum_{x, y} \Pr{x,y} \log( \Pr{x, y} ) \\ &= H(Y|X) + H(X) = H(Y|X) + H(Y) \end{aligned} \end{equation}

Where $H(Y|X)$ is the conditional entropy for $Y$ given $X$. The formula for conditional entropy can be expressed as:

\begin{equation} \begin{aligned} H(Y|X) &= \sum_{x \in \mathcal{X}} \Pr{x} H(Y|X=x) \\ &= -\sum_{x \in \mathcal{X}} \Pr{x} \sum_{y \in \mathcal{Y}} \Pr{y|x} \log (\Pr{y|x}) \\ &= -\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr{x,y} \log (\Pr{y|x}) \end{aligned} \end{equation}

Note that if $Y$ is independent of $X$, $H(Y|X) = H(Y)$.

We can extend the formula for the conditional entropy to $H(Y| X_{1}, X_{2}, \ldots, X_{n})$ fairly easily:

\begin{equation} H(Y|X_{1}, X_{2}, \ldots, X_{n})
= -\sum_{x_{1}, \ldots x_{n}} \sum_{y \in \mathcal{Y}} \Pr{x_{1}, \ldots, x_{n}, ,y} \log (\Pr{y|x_{1}, \ldots, x_{n}}) \end{equation}


Furthermore, the joint entropy for a collection on random variables is

For a sequence of r.v.s, $(X_{1}, \ldots, X_{n})$, the joint entropy can be expressed as:

\begin{equation} \begin{aligned} H(X_{1}, \ldots, X_{n}) &= - \sum_{x_{1}} \sum_{x_{2}} \cdots \sum_{x_{n}} \Pr{x_1, x_2, \ldots x_n} \log \big( \Pr{x_1, x_2, \ldots x_n} \big) \\ &= \sum_{k=1}^{n} H(X_{k} | X_{1}, \ldots, X_{k-1}) \end{aligned} \end{equation}

Entropy in Markov Chains

When it comes to Markov chains, we can use the Markov property to get a somewhat suggestive expression of the entropy.

First note how it simplifies the expression for conditional entropy:

\begin{equation} \begin{aligned} H(X_{k} | X_{0}, \ldots, X_{k-1}) &= \sum_{x_{0}} \cdots \sum_{x_{k-1}} \Pr{x_{0}, \ldots, x_{k-1}} H(X_{k} | X_{0} = x_{0}, \ldots X_{k-1} = x_{k-1}) \\ &= \sum_{x_{0}} \Pr{x_0} \sum_{x_{1}} \Pr{x_{1} | x_{0}} \sum_{x_{2}} \Pr{x_{2} | x_{1}, x_{0}} \sum_{x_{3}} \cdots \\ &\qquad \quad \sum_{x_{k-1}} \Pr{x_{k-1} | x_{0}, \ldots, x_{k-2}} H(X_{k} | X_{0} = x_{0}, \ldots X_{k-2} = x_{k-1}) \\ &= -\sum_{x_{0}} \Pr{x_0} \sum_{x_{1}} \Pr{x_{1} | x_{0}} \sum_{x_{2}} \Pr{x_{2} | x_{1}} \cdots \sum_{x_{k-1}} \Pr{x_{k-1} | x_{k-2}} \\ &\qquad \quad \sum_{x_{k}} \Pr{X_{k} | X_{k-1}} \log \Big( \Pr{X_{k} | X_{k-1}} \Big) \end{aligned} \end{equation}

Note that it has a nice recursion friendly form. For example, Iif we substitute this expression in order to calculate the joint entropy, expanding up to $X_{3}$, we get:

\begin{equation} \begin{aligned} H(X_{0}, X_{1}, \ldots, X_{3}) = -&\sum_{s_0} \Pr{s_0} \Big[ \log( \Pr{s_0}) \\ &+ \sum_{s_1} \Pr{s_1 | s_0} \Big[ \log(\Pr{s_1 | s_0}) \\ & \quad + \sum_{s_2} \Pr{s_2 | s_1} \Big[ \log(\Pr{s_2 | s_1}) \\ & \quad\quad + \sum_{s_3} \Pr{s_3 | s_2} \log(\Pr{s_3 | s_2}) \Big]\Big]\Big]\Big] \end{aligned} \end{equation}

Bellman Equation

It's starting to look very similar to a return.

In fact, we can get a Bellman equation by defining the reward as the log-probability of the transition.

\begin{equation} \begin{aligned} R_{t+1} &= \log \big( \Pr{S_{t+1} | S_{t}} \big) \\ G_{t} &= R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \cdots \\ &= R_{t+1} + \gamma G_{t+1} \\ v_{H}(s) &= \mathbb{E}[G_{t} | S_{t} = s] \end{aligned} \end{equation}


  • Note the discount factor; if it's less than 1 then we're not "really" calculating the limiting entropy.
  • However, if the MDP has no terminal states, then the entropy is infinite, which is not an especially helpful observation when trying to employ these concepts in practice.
    • The return can easily be infinite in the undiscounted case for a non-terminating MDP. For example, the expected return is infinite if the reward function is positive-definite.
    • So the discount factor can be chosen arbitrarily, or you can perhaps avoid infinities by renormalizing (subtracting the joint entropy averaged over all states)
  • If you wanted a "justified" discount factor, you can look at $1 - \gamma$ as a termination probability
    • Or perhaps you're interested in the entropy over some time horizon, say $\tau$, and set $\gamma = (\tau - 1)/\tau$.
    • Using a state-dependent $\gamma$ to express whether some outcomes somehow do not count is also plausible, although not super compelling.
  • There's another way of doing the analysis where you're working with infinite products instead of infinite sums
    • The trouble arises mainly because we end up with $$ G_{t} = \log(\Pr{S_{t+1} | S_{t}})
    • \log(\Pr{S_{t+2} | S_{t+1}}^{\gamma}) + \log(\Pr{S_{t+3} | S_{t+2}}^{\gamma^2}) + \cdots $$ Which is a little bit hard to interpret.

  • Where might you use it in practice?
    • An estimate of a particular state's entropy gives you an idea of the number of possible branching outcomes.
      • These outcomes might have the same end result
      • If you wanted to measure how different possible results might be, you'd use something like the variance (or another distance-like measure)
      • In combination with other things, this could be really informative.
    • Given two policies with the same expected return, an agent might want to select one policy over one with higher entropy because it's more certain
    • Conversely, during learning it might be better to explore high entropy states, or select them more frequently, because potentially many outcomes can spring from that single state.
    • It may also reveal some information about where to allocate features, but I haven't analyzed this particularly deeply
    • We can use entropy as guidance for how hard an MDP is, or how hard a state is to approximate
      • States with high entropy may benefit from a lower learning rate, conversely states with low entropy may admit a higher learning rate
  • Of course, issues spring up because in order to get a good estimate for a state's entropy you need to visit it many times.

Fun with the TD-Error Part II

This bit of light entertainment emerged out of a paper I worked on estimating variance via the TD-error.

We have previously looked at how the discounted sum of TD-errors is just another way of expressing the bias of our value function. We'll quickly recapitulate that proof, but extend it for the λ-return and general value functions because I'm a sadist1

$$ \newcommand{\eqdef}{\stackrel{\text{def}}{=}} \newcommand{\eqmaybe}{\stackrel{\text{?}}{=}} $$

The δ-return

We first note the definitions of the TD-error (for time-step $t$, we denote it as $\delta_{t}$) and the λ-return, $G_{t}^{\lambda}$.

$$ \begin{aligned} \delta_{t} &\eqdef R_{t+1} + \gamma_{t+1} v(S_{t+1}) - v(S_{t}) \\ G_{t}^{\lambda} &\eqdef R_{t+1} + \gamma_{t+1} \big( (1 - \lambda_{t+1}) v(S_{t+1}) + G_{t+1}^{\lambda} \big) \end{aligned} $$

Now we bust out one of our favorite techniques of "expanding a series and grouping terms" (I didn't say that it was a particularly hard technique).

$$ \begin{aligned} G_{t}^{\lambda} - v(S_{t}) &= R_{t+1} + \gamma_{t+1} \big( (1 - \lambda_{t+1}) v(S_{t+1}) + G_{t+1}^{\lambda} \big) - v(S_{t}) \\ &= R_{t+1} + \gamma_{t+1} v(S_{t+1}) - v(S_{t}) + \gamma_{t+1} \lambda_{t+1} (G_{t+1}^{\lambda} - v(S_{t+1})) \\ &= \delta_{t} + \gamma_{t+1} \lambda_{t+1} \delta_{t+1} + \gamma_{t+1} \lambda_{t+1} \gamma_{t+2} \lambda_{t+2} \delta_{t+2} + \cdots \\ &= \sum_{n=0}^{\infty} \delta_{t+n} \Big( \prod_{k=1}^{n-1} \gamma_{t+k} \lambda_{t+k} \Big) \end{aligned} $$

Neat! I call this the δ-return, but it's a special case of defining meta-returns from an MDP + a learning algorithm.

This implies that you could figure out the bias of your reinforcement learning algorithm (that estimates the value function) by using a second estimator that tracks the δ-return. The definition of the bias is (when estimating the λ-return) is $\mathbb{E}_{\pi} [{G_{t}^{\lambda} - v(S_{t})}]$2, so if we have that as our "return" (which is the case with $\tilde{R}_{t+1} = \delta_{t}$), then in principle it is no different than estimating the value function ($v(s) \approx \mathbb{E}_{\pi} [ G_{t}^{\lambda} | S_t = s] $)

However, in practice this may have some unexpected difficulties, mainly because once the original algorithm has converged, $\Exp{\delta_{t} \vec{x}_{t}}$ is either exactly or close to zero 3.

Bellman Equation for the Second Moment of a Return

In order for things to get truly fun, we need to take a quick look at Bellman equations for the second moment of a return. Building on some previous work by the likes of Sobel, Tamar et. al, and White & White, we know that we can actually define a Bellman equation for the second moment of the δ-return.

The second moment of a random variable $X$ is just $\Exp{X^2}$ (the mean is $\Exp{X}$ and is the first moment). Most often it comes up in the context of variance, which is the second central moment of a random variable:

$$ \Var{X} \eqdef \Exp{X^2} - \Exp{X}^2 = \Exp{(X - \Exp{X})^2} $$

But ignore the man behind the curtain for a moment.

With some arbitrary return (let's call it $\tilde{G}$) we have

$$ \begin{aligned} (\tilde{G}_{t})^2 &= (\tilde{R}_{t+1} + \tilde{\gamma}_{t+1} G_{t+1})^2 \\ &= \tilde{R}_{t+1}^2 + 2 \tilde{\gamma}_{t+1} \tilde{R}_{t+1} \tilde{G}_{t+1} + \tilde{\gamma}_{t+1}^{2} (\tilde{G}_{t+1})^2 \\ &= \bar{R}_{t+1} + \bar{\gamma}_{t+1} \bar{G_{t+1}} \end{aligned} $$

With the substitutions

$$ \begin{aligned} \bar{R}_{t+1} &\eqdef \tilde{R}_{t+1}^2 + 2 \tilde{\gamma}_{t+1} \tilde{R}_{t+1} \tilde{G}_{t+1} \\ \bar{\gamma}_{t} &\eqdef \tilde{\gamma}_{t}^{2} \\ \bar{G}_{t} &\eqdef (\tilde{G}_{t})^2 \end{aligned} $$

Which looks pretty much like the standard recursive equation for the return, $G_{t} \eqdef R_{t+1} + \gamma_{t+1} G_{t+1}$. It's even practically useful4.

The Second Moment of the δ-return

Now, the promised fun. Consider the second moment of the δ-return:

$$ \begin{aligned} \Exp{(G_{t}^{\lambda} (\delta))^2} &= \Exp{ (G_{t}^{\lambda} - v(S_{t}))^2 } \\ &= \Exp{ \Big( \sum_{n=0}^\infty \delta_{t+n} \prod_{k=1}^{n-1} \gamma_{t+1} \lambda_{t+1} \Big)^2 } \end{aligned} $$

Wait, I think that $\Exp{ (G_{t}^{\lambda} - v(S_{t}))^2 }$ shows up somewhere in the literature. Oh, it's the mean-squared error!

Well that's pretty neat. When I first realized this I thought that we might be able to use this to achieve a truly error-free algorithm, because if you can get a handle on the errors you're making it's possible to reduce them.
It turns out that this is hard in practice, because we can't reliably estimate the δ-return. To see why this is a problem, we can plug in for the second moment of the δ-return.

Bellman equation for the δ-return's second moment

Setting $\tilde{G}_{t} = (G_{t}^{\lambda} - v(S_{t})$ (and therefore $\tilde{R}_{t+1} = \delta_{t}$) gives us the following:

$$ $$

Where we make the following substitutions:

$$ $$

To make this work, we need to have a method of estimating $\Exp{G_{t}^{\lambda} - v(S_{t})}$, but if our value function has converged, then that quantity is zero.

So we end up with just $\bar{R}_{t+1} = \delta_{t}^2$, throwing away all of the "cross-terms" like $2 \gamma_{t+1} \lambda_{t+1} \delta_{t+1} (G_{t} - v(S_{t}))$. These terms might make a substantial contribution to the second moment.

Delta-Squared Return for Variance

So under LFA we are effectively approximating not the second moment of the δ-return, first moment of the $\delta^2$ return (the discounted sum of the squared TD-errors).

Depending on how much the cross-terms affect the second moment, this might still be useful.

In the case that $\Exp{\delta_{t}} = 0$, then in expectation the "cross term" goes away, and the second moment of the δ-return and the expected value of the $\delta^2$ return are the same thing. Note that it may still be nonzero.

So if $\operatorname{MSE} = \operatorname{Bias}^2 + \operatorname{Variance}$, and our estimates are unbiased, then what remains is the variance. This is still pretty neat-- in the tabular case, or when our value function is exact (or close enough to exact), then estimating the $\delta^2$-return is the same as estimating the variance.

Final Remarks

It may be possible to estimate the second moment of the $\delta$-return via using a different function approximation scheme for each learner, although the utility of this is questionable-- ultimately, we tend to assume that getting the value funciton correct is the highest priority, and so if you have extra representational power you should probably allocate it there. Using a neural net or other adaptive representation might also be a way forward, and the behavior of such models in estimating the δ-return might be interesting in its own rihgt.

But it's possible that knowing the variance is its own reward (bah-dum-tssh), because you could use it to help drive exploration, set learning rates, or use it to modify the policy to prefer more reliable actions.

In that case, depending on how biased your value function is, the second moment of the $\delta$-return can be well-approximated by tracking only $\delta^2$, or you could go with the second moment methods mentioned previously5.

Either way, I expect that investigating meta-returns will yield similarly fun posts (or perhaps even papers!) in the future.


  1. You can just pretend that γ and λ are constant if you don't want to consider GVFs, I'm mainly just showing that this works even under that setting. 

  2. Where the subscript $\pi$ means we're taking the expectation w/r/t the stationary distribution induced by following the policy $\pi$ that we wish to evaluate, conditioned on following that policy in subsequent states. 

  3. For TD(λ) it's more that $\Exp{\delta_{t} \vec{e}_{t}}$ is zero at convergence, but in general you're going to see that (under linear function approximation with the same features and same γ, λ) that the weights for this second approximator should be zero in expectation. 

  4. White & White use this equation to estimate the second moment (and therefore the variance) online with a nice algorithm they call Variance TD, and use it attempt to optimize λ w/r/t the bias-variance trade-off (although you can avoid that and just estimate the variance); Tamar et. al. provide a couple of algorithms (TD(0) and LSTD variance) for estimating the variance. 

  5. Those methods also rely on the value function in order to estimate variance and to compute their updates, so if your value function sucks you're not going to have a good idea of what your variance is. I have a short "proof" that the $δ^2$-return comes up with the same estimate of the variance as the second moment algorithms, but it's pending a check because I think one of the steps is a bit suspect. 

Fun With The TD-Error

One of the fundamental ideas in reinforcement learning is the temporal difference (TD) error, which is the difference between the value of the current state and the reward received plus the discounted value of the next state. That may sound abstract, but effectively it's what you expected minus what you actually got and what you're expecting next. Okay, that still sounds incomprehensible to anyone who's not already familiar with RL, so instead here's an equation.

$$ \delta_{t} = R_{t} + \gamma v(S_{t+1}) - v(S_{t}) $$

  • $R_{t}$ denotes the reward we got at time $t$.
  • $v(s)$ denotes how much future reward we're expecting in state $s$.
  • $\gamma$ is the so-called discount factor, which is the way mathematicians say 'a bird in the hand is worth two in the bush'.

Anyways, the usual goal of RL is to make our predictions about the future as accurate as possible. It's surprisingly doable, because even when we don't know what's coming in the future, we can still take steps to make our predictions about what we do know as accurate as possible. By adjusting our models so that the TD-error gets smaller over time, we learn to predict how much reward we'll receive in the future without ever having to keep track of more than the current state, the next state, and the reward we received from transitioning from one to the other.

In fact, subject to some technical conditions, we can prove that the process of minizing the TD-error actually will lead to us learning the overall value function accurately1. I will not go into that in detail, but remark that one of the big assumptions we tend to work with is that the environment is a Markov Decision Process.

Solving The Bellman Equation Analytically

All this background is getting boring, so I'm just going to skip ahead and trust that the only people who find this site tend to have a background in reinforcement learning.

Assume you've got a transition matrix $\vec{P}$, a vector $\vec{r}$ that describes the expected rewards in each state. Let the value function can be represented as a vector, $\vec{v}$

$$ \begin{align} \vec{v}_{(i)} &= \Exp{G_t | S_t = s_i} \\ \vec{v} &= \vec{r} + \gamma \vec{P} \vec{v} \\ &= \vec{r} + \gamma \vec{P} \vec{r} + (\gamma \vec{P})^2 \vec{r} + \cdots \\ \vec{v}(\vec{I} - \gamma \vec{P}) &= \vec{r} \\ \vec{v} &= (\vec{I} - \gamma \vec{P})^{-1} \vec{r} \end{align} $$

Lines 3-4 used the matrix version of the geometric sum. Assuming that matrix inversion step was valid (which is usually the case), we have a way of solving for the value function in general.

Sum of Discounted Deltas

Suppose now that we have $\hat{v}$ an approximate value function. It may have been learned via some sort of learning algorithm, or it may be completely random numbers, but regardless, we have it.

Now, suppose for some reason we want to know about the discounted sum of future deltas2.

The value function is fixed and we have the expected reward vector from before, so if we let $\vec{d}_{(i)} = \mathbb{E}\Bigg[ \delta_{t} \Bigg\vert S_t = s_i\Bigg]$, we can write it as

$$ \vec{d} = \vec{r} + \gamma \vec{P} \hat{\vec{v}} - \hat{\vec{v}} $$

Note that we're using $\hat{\vec{v}}$, which is different from the "true" $\vec{v}$ that we got from solving the Bellman equation.

Now we go for the infinite sum:

$$ \begin{align} \vec{d} + \gamma \vec{P} \vec{d} + (\gamma \vec{P})^2 \vec{d} + \cdots &= (\vec{I} - \gamma \vec{P})^{-1} \vec{d} \\ &= (\vec{I} - \gamma \vec{P})^{-1} (\vec{r} + \gamma \vec{P} \hat{\vec{v}} - \hat{\vec{v}}) \\ &= (\vec{I} - \gamma \vec{P})^{-1} (\vec{r} - (\vec{I} - \gamma \vec{P}) \hat{\vec{v}}) \\ &= (\vec{I} - \gamma \vec{P})^{-1} \vec{r} - (\vec{I} - \gamma \vec{P})^{-1} (\vec{I} - \gamma \vec{P}) \hat{\vec{v}} \\ &= (\vec{I} - \gamma \vec{P})^{-1} \vec{r} - \hat{\vec{v}} \\ &= \vec{v} - \hat{\vec{v}} \end{align} $$

Amazing. Well, maybe you've seen the result before (I've learned that it shows in a few places) but I think it's kinda neat that the difference between the true value, $\vec{v}$, and our approximate value function, $\hat{\vec{v}}$, is also encoded in the series of TD-errors the same way that the value was encoded in the rewards.

Of course, we can see this in another way:

$$ \begin{align} G_{t} - \hat{v}(S_t) &= \Bigg( \sum_{k=1}^{\infty} R_{t+k} \gamma^{k-1} \Bigg) - \hat{v}(S_t) \\ &= R_{t+1} + \gamma \hat{v}(S_{t+1}) - \hat{v}(S_t) - \gamma \hat{v}(S_{t+1}) + \gamma \Bigg( \sum_{k=1}^{\infty} R_{t+k+1} \gamma^{k-1} \Bigg) \\ &= \delta_{t} + \gamma \Big( G_{t+1} - \hat{v}(S_{t+1}) \Big) \\ &= \sum_{k=0}^{\infty} \delta_{t+k} \gamma^{k} \end{align} $$

If you wanted to find a use for this, you might try using a cascade of algorithms (the first learns the value, the second learns the error in the first's approximation, and so on) to reduce error3. Or you can point to this as a semi-rigorous explanation of why monitoring the TD-error alone gives you an idea of how well your algorithm is doing in general.

But in any case, the math was fun, which is really all I promised in the title of this post.


  1. How accurately we learn the value function, of course, is its own topic of discussion. Largely, it depends on how accurately we can perceive the state space. If we assume complete omniscience, then asymptotically our accuracy will be perfect. If we fall short of that, then our accuracy suffers, but we we'll still do about as well the best-case scenario that our diminished perception can allow. 

  2. For example, suppose you are writing a thesis where this expression comes up and you're curious about it. OK, maybe that's a reason particular to my case. 

  3. You would likely have to use different representations for each (the reasons for that would require another post).