# Escaping Batch Country

This post sat in my "drafts" folder for many moons, mostly done, so I finished it and then threw it online. There are probably lacunae where I meant to explain something, didn't leave a note in the draft, and then failed to notice. However, you can now email me to tell me about it-- rldotaiblog@gmail.com --for the handful of readers, dozens of webcrawlers, and hundreds of automated vulnerability scanners. Lots of people looking for my wp-login.php page1.

This post is based on a talk that I gave a while ago. In essence, there's a way of approximating functions of returns using online algorithms (such as TD(λ)).

Rather than just posting the slides, I'll recap2 and mention some more technical details from that talk (although the slides are available here).

## Batch Country

I've been using the term "batch country" to refer to problems/approximation targets where online algorithms cannot be used. If you're estimating the mean of a time series (say, $X_{1}, X_{2}, \ldots$), then you can update your estimate when new data becomes available:

$$\mu_{t+1} = \frac{t \mu_{t} + X_{t}}{t+1}$$

Here, we're implictly relying on the linearity of expectation; if linearity doesn't apply, you either have to learn from batches of data or perhaps employ an approximation.

For example, predicting the log of a time series, $\log( X_{1} + X_{2} + \cdots + X_{T} )$. It doesn't dissociate into individual terms that can be aggregated every time-step, so if you want to estimate it you need to collect $X_{1}$ through $X_{T}$ and only then can you update.

### Why Online Algorithms?

Online algorithms in general and reinforcement learning in particular are preferable to batch methods for a variety of reasons. The fact that your estimates are always current is helpful if you're actively using the predictions, e.g., robot control or game playing. Additionally, updating over time means you amortize the cost of learning-- batch methods can waste a lot of available CPU time while waiting for enough new data to become available.

In reinforcement learning, the temporal difference methods often crushes value/policy iteration for these reasons (learning more in a shorter time with less computation) and they're more widely applicable. So it's nice to be able to use an online algorithm.

TD(λ) with accumulating traces is such an algorithm, which uses the following update rules:

\begin{aligned} e_{t} &= x_{t} + \gamma_{t} \lambda_{t} e_{t-1} \\ \delta_{t} &= R_{t+1} + \gamma_{t+1} w_{t}^{\top} x_{t+1} - w_{t}^{\top} x_{t} \\ w_{t+1} &= w_{t} + \alpha_{t} \delta_{t} e_{t} \end{aligned}

### When TD-methods break down

The big thing that Markov Decision Problems have going for them is, unsurprisingly, the Markov property.

A technical way of phrasing that crucial attribute is that they are "memoryless", in the sense that knowing the history of the system provides you no more information into its future evolution than if you just knew the current state. So if you know the current state of the system, then have just as much predictive power as if you knew the previous state, and the previous-previous state, etc.

Furthermore, it ensures that the return (our approximation target) has a nice recursive structure.

\begin{aligned} G_{t} &= R_{t+1} + \gamma G_{t+1} \\ v(S_{t}) &\approx \Exp{G_{t}} &= \Exp{R_{t+1} + \gamma G_{t+1} } \\ \Rightarrow v(S_{t}) &\approx \Exp{R_{t+1}} + \gamma v(S_{t+1}) \\ \Rightarrow \nabla v &= R_{t+1} + \gamma v(S_{t+1}) - v(S_{t}) \end{aligned}

For an online algorithm to work, we need to have enough information to compute the gradient for our estimator at each point in time.

## Possible Exits from Batch Country

We have previously looked at, where we note that we can estimate moments of the return (e.g., $\mathbb{E}[G_{t}^{n}]$, for $n=1, 2, 3, \ldots$. For $n=1$, you have the expected return, for $n=2$, you get the second moment (which can be used to compute the variance), and so on3.

Here's an expression for the second moment of the return as a pseudo-Bellman equation:

\begin{aligned} G_{t}^{2} &= (R_{t+1} + \gamma G_{t+1})^{2} \\ &= R_{t+1}^{2} + 2 \gamma R_{t+1} G_{t+1} + \gamma^{2} G_{t+1}^{2} \\ &= R^{SM}_{t+1} + \gamma^{2} G_{t+1}^{2} \end{aligned}

The second moment seems like it can't be approximated online, because we have to wait for $G_{t+1}$ to come in before we can compute $R^{SM}_{t+1}$.

However, if we approximate $G_{t+1}$ by $v(S_{t+1})$, suddenly we do have a Bellman equation:

\begin{aligned} R^{SM}_{t+1} \approx \bar{R}_{t+1} &= R_{t+1}^{2} + 2 \gamma_{t+1} R_{t+1} v(S_{t+1}) \\ G_{t}^{2} \approx \hat{G}^{2}_{t} &= \bar{R}_{t+1} + \gamma_{t+1}^{2} \hat{G}^{2}_{t+1} \\ &= \bar{R}_{t+1} + \bar{\gamma}_{t+1} \hat{G}^{2}_{t+1} \end{aligned}

Where $\bar{\gamma}$ and $\bar{R}$ are the discount factor and rewards for this return we've built.

You can then use TD(λ) or whatever other algorithm you prefer to estimate the value of our newly-defined return for the second moment

For example, we could extend the TD(λ) algorithm above to get an algorithm that estimates $\Exp{G_{t}}$ and $\Exp{G_{t}^{2}}$

\begin{aligned} \bar{R}_{t+1} &= R_{t+1}^{2} + 2 \gamma R_{t+1} w_{t}^{\top} x_{t+1} \\ \bar{\gamma}_{t+1} &= \gamma_{t}^{2} \\ \bar{e}_{t} &= x_{t} + \bar{\gamma}_{t} \lambda_{t} e_{t-1} \\ \bar{\delta}_{t} &= \bar{R}_{t+1} + \bar{\gamma}_{t+1} \bar{w}_{t}^{\top} x_{t+1} - \bar{w}_{t}^{\top} x_{t} \\ \bar{w}_{t+1} &= \bar{w}_{t} + \alpha_{t} \bar{\delta}_{t} \bar{e}_{t} \end{aligned}

We would then have $\Exp{G_{t}} \approx v(s) = w^{\top} x(s)$ and $\Exp{G_{t}^{2}} \approx v^{(2)}(s) = \bar{w}^{\top} x(s)$.

### Higher Moments

We can estimate the first and second moments, so the natural course of action is to proceed by induction to estimate higher moments.

For example, we have $G^{3}_{t}$.

\begin{aligned} G_{t}^{3} &= (R_{t+1} + \gamma G_{t+1})^{3} \\ &= R_{t+1}^{3} + 3 \gamma R_{t+1}^{2} G_{t+1} + 3 \gamma^{2} R_{t+1} G_{t+1}^{2} + \gamma^{3} G_{t+1}^{3} \\ &\approx R_{t+1}^{3} + 3 \gamma R_{t+1}^{2} v(S_{t+1}) + 3 \gamma^{2} R_{t+1} v^{(2)}(S_{t+1}) + \gamma^{3} G_{t+1}^{3} \end{aligned}

And in general:

\begin{aligned} G_{t}^{n} &= \sum_{k=0}^{n} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} G_{t+1}^{k} \\ &\approx \sum_{k=0}^{n-1} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} v^{(k)}(S_{t+1}) + \gamma^{n} G_{t+1}^{n} \end{aligned}

### Remarks

In practice, estimating higher moments does not really work out so well. Under linear function approximation, we can actually prove convergence for arbitrary moments, but in practice the fixed-point can be wildly different from the true value. The errors in each level of approximation tend to get magnified (both in terms of the approximation error and the variance of the stochastic approximation).

Furthermore, the various moments of the return are a set of measure zero in the space of "all functions that could be applied to the return"4.

However, to paraphrase a wise man, "there's nothing more dangerous than a man in the depths of an ether binge, except perhaps a theoretician with access to basic calculus".

## Escaping Batch Country

We have, for a suitable well-behaved function5.

$$f(x) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x - a)^{k}$$

We then consider functions of a random variable. Can we apply a Taylor series approximation? The answer is yes, so long as the random variable is also reasonably well-behaved6.

So we get:

\begin{aligned} \mathbb{E} [ f(X) ] &= \mathbb{E} [ f( \mu + (X - \mu) )] \\ &= \mathbb{E} [f(\mu) + \sum_{k=1}^{\infty} \frac{1}{k!} f^{(k)}(\mu) (X - \mu)^{k} ] \end{aligned}

Where $\mu = \Exp{X}$ is the mean of $X$.

We might reasonably get away with a second-order approximation:

\begin{aligned} \mathbb{E} [ f(X) ] &= \mathbb{E} [f(\mu) + \sum_{k=1}^{\infty} \frac{1}{k!} f^{(k)}(\mu) (X - \mu)^{k} ] \\ &\approx \mathbb{E}[ f(\mu) + f'(\mu) ( X - \mu ) + \frac{1}{2} f''(\mu) (X - \mu)^{2} ] \\ &= \mathbb{E}[ f(\mu) ] + \mathbb{E}[ f'(\mu) ( X - \mu ) ] + \mathbb{E}[\frac{1}{2} f''(\mu) (X - \mu)^{2} ] \\ &= f(\mu) + \mathbb{E}[ f'(\mu) ] \mathbb{E} [ ( X - \mu ) ] + f''(\mu) \mathbb{E}[\frac{1}{2} (X - \mu)^{2} ] \\ &= f(\mu) + \frac{1}{2} f''(\mu) \sigma^{2} \end{aligned}

Where we use the fact that $\Exp{f(\mu)}$ is constant, $\Exp{X - \mu} = 0$, and that $\Exp{(X - \mu)^{2}} = \sigma^{2}$, the variance of $X$.

### In theory, things are looking good

So we have a means of expressing a function of the return in terms of the moments of the return, and we have a method of estimating those moments. In theory, we can then express any function of the return.

\begin{aligned} \mathbb{E} [ f(G_{t}) | S_{t} = s] &= \mathbb{E} \Bigg[ f(v(s)) + \sum_{k=2}^{\infty} \frac{1}{k!} f^{(k)}(v(s)) (G_{t} - v(s))^{k} \Bigg | S_{t} = s \Bigg] \end{aligned}

It's even pretty efficient to compute.

For each $n$, the central moment $u^{(n)}(s) = \Exp{(G_{t} - v(S_{t})^{n} | S_{t} = s}$ can be expressed as:

\begin{aligned} u^{(n)}(s) &= \Exp{(G_{t} - v(s))^{n} | S_{t} = s} \\ &= \mathbb{E} \Bigg[ \sum_{k=0}^{n} \binom{n}{k} (-v(s))^{n-k} G_{t}^{k} \Bigg| S_{t} = s \Bigg] \\ &\approx \sum_{k=0}^{n} \binom{n}{k} (-v(s))^{n-k} v^{(k)}(s) \\ &= (-1)^{n-1} n v^{n}(s) + \sum_{k=2}^{n} \binom{n}{k}(-1)^{n-k} v^{n-k}(s) v^{(k)}(s) \end{aligned}

To compute $n$ central moments of $G_{t}$, we have to evaluate $n$ value functions, with the rest of the multiplication and summing of terms in the binomial being relatively cheap. We then have to compute the derivatives of $f$ at $v(s)$ up to $f^{(n)}$, which is probably not too expensive for typical choices of $f$.

We could have a whole bunch of functions, $f_{1}, \ldots, f_{z}$, and then things become even more efficient. We can memoize the $(G_{t} - v(S_{t}))^{n}$ coefficients and reuse them when calculating the Taylor expansion for each $f_{i}$.

### Is this at all useful in practice?

It depends.

First, let's examine whether the assumptions we're making regarding the Taylor approximation can be expected to hold in practice-- first, that $f(\cdot)$ is well-behaved, and second, that the moments of $G_{t}$ are finite.

1. It's not clear to me that all interesting functions are well-behaved, but perhaps all interesting functions have an analytic approximation that is suitable for our purposes. So, maybe this is true in practice. You probably want to choose functions where the each subsequent derivative is not growing too rapidly, though.

2. It's also unclear whether a typical return will have finite moments, but we can assert that it's true if the reward is bounded.

We have that $R_{t}^{(n)}$ (the "reward" we use when constructing the approximation target for the $n$-th moment) is:

$$R_{t}^{(n)} = \sum_{k=0}^{n-1} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} v^{(k)}(S_{t+1})$$

Suppose the base reward is bounded, then we have a means of bounding the higher rewards as well.

Let $\max | R | = R_{\max}$, so $|G_{t}| \leq \frac{R_{\max}}{1 - \gamma}$, which implies that we can clip the value function so that $|v(s)| \leq \max \frac{R_{\max}}{1 - \gamma}$. Then for the second moment, we have

\begin{aligned} R_{t}^{(2)} &= R_{t+1}^{2} + 2 \gamma R_{t+1} v(S_{t+1}) \\ &\leq R_{\max}^{2} + 2 R_{\max} \frac{R_{\max}}{1 - \gamma} \\ &= R_{\max}^{2} \frac{3 - \gamma}{1 - \gamma} \\ &= R_{\max}^{(2)} \\ G_{t}^{2} &\leq \frac{R_{\max}^{(2)}}{1-\gamma^{2}} \end{aligned}

We observe that the higher moments are all bound7, although depending on the value of $\gamma$ this may be cold comfort.

So by engineer's induction, if the base reward is bound, the higher moments are bound as well.

If your rewards are not bounded, then asserting that the higher moments are bounded becomes harder.

I've run a few experiments to test this, and the results are somewhat mixed. If the discount factor is high ($\gamma \approx 1$), then the higher moments tend to dominate for a while before the factorials in the denominator eventually remove those terms from consideration.

From an implementation point-of-view, we also have to worry about how good our estimate for the various moments will be under function approximation. For example, we might have a representation that is well-suited to $v(\cdot)$, but performs poorly when we try to use it for approximating $v^{(2)}(\cdot)$.

### Final Remarks

Approximating arbitrary functions of the return with an online algorithm sounds like a lot of fun, but the conditions under which it's feasible are pretty rough.

## Further Work

I'd like to take a closer look at what the dominant terms are, and maybe do some experiments to see how many moments we would need to evaluate to get a reasonable estimate.

I tried characterizing when this is possible in detail, and deriving error bounds based on Taylor's theorem, but it really gets ugly quickly. At some point, however, it might be worth pushing through the pain to see if anything neat falls out, or at least getting the chance to behold some hilariously worthless "guarantees".

## References

1. I have cleverly moved it to secret-wp-login.php to prevent hackers from being able to install WordPress on my static site.

2. I omit the introduction, which contains a couple of lousy puns and a bit of elementary stuff on RL/MDP theory, since this is the Internet where both are available in huge quantities.

3. We also derived some rather impractical expressions for fractional returns, which would be things like $G_{t}^{1/2}$, for example.

4. Although they might occupy non-trivial volume of the space of "functions that you'd actually want to apply to the return", so that's nice.

5. In general you want $f$ to be smooth, i.e., infinitely differentiable, but you might be able to loosen the requirements if you have some idea what values $x$ can take.

6. Here we need the moments of $X$ to be finite, although you could potentially loosen the requirements if you knew more about $f$.

7. I suppose that $G_{t}^{n}$ is bound by something like $O(R^{n}/\prod_{i=k}^{n}(1-\gamma)^{k} )$ but I haven't looked deeply into it.

# Max Entropy RL Explanation

The paper on max entropy RL (Reinforcement Learning with Deep Energy-Based Policies) is pretty cool, but some of the proofs are lacking explanation. This is papered over by the standard "it is easy to see..." evasions that mathematicians take such delight in. Usually, space restrictions are the culprit, but I wish authors would either provide more detail or at least a specific reference (including the page number!) for where I can learn more.

On HackerNews, a request for greater explanation of one of the proofs was requested, and since I didn't understand, and understanding this sort of thing is my job, I thought I would look into it.

$$\newcommand{\eqdef}{\stackrel{\text{def}}{=}} \newcommand{\eqmaybe}{\stackrel{\text{?}}{=}}$$

Define the "soft Q-value" for an arbitrary policy $\pi$ via:

$$Q_{soft}^{\pi}(s,a) \eqdef r_{0} + \mathbb{E}_{\pi} \bigg[ \sum_{t=1}^\infty \gamma^{t} \Big( r_{t} + \mathcal{H}(\pi(\cdot | s_{t})) \Big) \bigg| s_{0} = s, a_{0} = a \bigg]$$

(Hereafter, we'll just use $Q^{\pi}$ instead of $Q_{soft}^{\pi}$ since the standard Q-values aren't needed for our purposes).

Equation 18 (in Appendix 2) claims that:

$$\mathcal{H}(\pi(\cdot | s)) + \mathbb{E}_{a \sim \pi} [ Q_{soft}^{\pi} (s, a) ] \leq \mathcal{H}(\tilde{\pi}(\cdot | s)) + \mathbb{E}_{a \sim \tilde{\pi}} [ Q_{soft}^{\pi} (s, a) ]$$

For an arbitrary policy $\pi$ with $\tilde{\pi}(a|s) \propto \text{exp}[ Q_{soft}^{\pi}(s,a)]$

They claim that this (and the associated Theorem 4) are easily proved once you "notice" that (from eq. 19)

$$\mathcal{H}(\pi(\cdot | s)) + \mathbb{E}_{a \sim \pi} [ Q_{soft}^{\pi} (s, a) ] = - \text{D}_{KL}( \pi (\cdot | s) || \tilde{\pi} (\cdot | s) ) + \log \int \text{exp}[ Q^{\pi}(s,a) ] \, \text{d}a$$

Personally, I don't expect my readers to "notice" things that I cannot prove in a couple of lines or aren't tricks used in every other paper, but maybe I'm just not familiar with this subfield.

Let's tackle the setting where we've got a discrete action space, expanding the Kullback-Leibler divergence term so that we can get rid of the entropy term on the LHS:

\begin{aligned} - \text{D}_{KL}( \pi (\cdot | s) || \tilde{\pi} (\cdot | s) ) &= -\sum_{a} \pi (a | s) \log \bigg[ \frac{ \pi(a|s) }{ \tilde{\pi}(a|s) } \bigg] \\ &= -\sum_{a} \pi (a | s) \big( \log [\pi(a|s)] - \log [ \tilde{\pi}(a|s)] \big) \\ &= \mathcal{H} ( \pi( \cdot | s) ) + \sum_{a} \pi (a | s) \log [ \tilde{\pi}(a|s)] \end{aligned}

So now we need to "notice" that

$$\mathbb{E}_{a \sim \pi} [ Q_{soft}^{\pi} (s, a) ] = \sum_{a} \pi (a | s) \log [ \tilde{\pi}(a|s)] + \log \sum_{a} \text{exp}[ Q^{\pi}(s,a) ]$$

This is actually pretty cool, and it makes use of the fact that $\tilde{\pi}$ is proportional to the exponentiated Q-values. Say that $\tilde{\pi}(a | s) = \beta \exp [Q^{\pi}(s,a)]$ for some $\beta$.

\begin{aligned} \log \sum_{a} \exp [Q^{\pi}(s,a)] &= \log \bigg[ \sum_{a} (1/\beta) \tilde{\pi}(a | s) \bigg] \\ &= \log \bigg[ (1/\beta) \sum_{a} \tilde{\pi}(a | s) \bigg] \\ &= - \log \beta + \log \bigg[ \sum_{a} \tilde{\pi}(a | s) \bigg] \\ &= - \log \beta + \log [ 1 ] \\ &= -\log \beta \end{aligned}

Because $\tilde{\pi}(a|s)$ is a probability distribution and we're summing over all possible actions, it goes to 1, and the logarithm of 1 is zero, as you may have noticed.

Now we further notice that this trick applies in reverse:

\begin{aligned} \sum_{a} \pi (a | s) \log [ \tilde{\pi}(a|s)] &= \sum_{a} \pi (a | s) \log [ \beta \exp [Q^{\pi}(s,a)] ] \\ &= \sum_{a} \pi (a | s) \bigg( \log [ \beta ] + \log [ \exp [Q^{\pi}(s,a)] ] \bigg) \\ &= \sum_{a} \pi (a | s) \bigg( \log [ \beta ] + Q^{\pi}(s,a) \bigg) \\ &= \log \beta + \sum_{a} \pi (a | s) Q^{\pi}(s,a) \end{aligned}

Where we once again noticed that $\pi(a | s)$ is a probability function, and that $\beta$ was independent of it, so we could factor it out and sum. Combining the terms we analyzed, we get:

\begin{aligned} \sum_{a} \pi (a | s) \log [ \tilde{\pi}(a|s)] + \log \sum_{a} \text{exp}[ Q^{\pi}(s,a) ] &= \log \beta + \sum_{a} \pi (a | s) Q^{\pi}(s,a) -\log \beta \\ &= \sum_{a} \pi (a | s) Q^{\pi}(s,a) \end{aligned}

Which allows us to (finally) notice that:

$$\sum_{a} \pi (a | s) Q^{\pi}(s,a) = \mathbb{E}_{a \sim \pi} [ Q_{soft}^{\pi} (s, a) ]$$

### Noticing Our Way to Eq. 18

We still have the claim that:

$$\mathcal{H}(\pi(\cdot | s)) + \mathbb{E}_{a \sim \pi} [ Q_{soft}^{\pi} (s, a) ] \leq \mathcal{H}(\tilde{\pi}(\cdot | s)) + \mathbb{E}_{a \sim \tilde{\pi}} [ Q_{soft}^{\pi} (s, a) ]$$

...and it is not immediately clear how (19) helps us arrive at the inequality.

But it actually does, if we notice two things:

1. The KL-divergence is nonnegative,
2. That if we have $\tilde{\pi}(a | s) \propto \exp{Q^{\pi}(s, a)}$, then we need to be a little more specific and introduce a factor (for each state), denoted $\beta(s)$, that ensures our probabilities sum to one
• Let $\tilde{\pi}(a | s) = \beta(s) \exp{Q^{\pi}(s, a)}$
• Then $1/\beta(s) = \sum_{a} \exp [ Q^{\pi}(s, a) ]$.

With this in mind, we have:

\begin{aligned} \mathbb{E}_{a \sim \tilde{\pi}} [ Q_{soft}^{\pi} (s, a) ] + \mathcal{H}(\tilde{\pi}(\cdot | s)) &= \sum_{a} \tilde{\pi}(a | s) \bigg[ Q^{\pi}(s, a) - \log ( \tilde{\pi}(a | s) ) ) \bigg] \\ &= \sum_{a} \tilde{\pi}(a | s) \bigg[ Q^{\pi}(s, a) - \log [ \beta(s) \exp{Q^{\pi}(s, a)} ] \bigg] \\ &= \sum_{a} \tilde{\pi}(a | s) \bigg[ Q^{\pi}(s, a) - \log [ \beta(s) ] - \log [ \exp{Q^{\pi}(s, a)} ] \bigg] \\ &= \sum_{a} \tilde{\pi}(a | s) \bigg[ Q^{\pi}(s, a) - \log [ \beta(s) ] - Q^{\pi}(s, a) \bigg] \\ &= - \sum_{a} \tilde{\pi}(a | s) \log [ \beta(s) ] \\ &= - \log[ \beta(s) ] = \log [ 1/\beta(s) ] \\ &= \log \bigg[ \sum_{a} \exp [ Q^{\pi}(s,a) ] \bigg] \end{aligned}

Which, you may notice, is part of (19). Now we add in the negative KL divergence and we get the inequality in (18). Bam!

\begin{aligned} \mathbb{E}_{a \sim \tilde{\pi}} [ Q_{soft}^{\pi} (s, a) ] + \mathcal{H}(\tilde{\pi}(\cdot | s)) &= \log \bigg[ \sum_{a} \exp [ Q^{\pi}(s,a) ] \bigg] \\ &\geq - \text{D}_{KL}( \pi (\cdot | s) || \tilde{\pi} (\cdot | s) ) + \log \bigg[ \sum_{a} \exp [ Q^{\pi}(s,a) ] \bigg] \\ &= \mathcal{H}(\pi(\cdot | s)) + \mathbb{E}_{a \sim \pi} [ Q_{soft}^{\pi} (s, a) ] \end{aligned}

Which is what Eq. 18 claims.

So while the paper was interesting, I do find it kinda frustrating that in order to understand the proof I had to do a series of non-obvious manipulations and make use of some facts that are "obvious" only if you've done this sort of stuff before. And I'm not sure there isn't a better solution or something more intuitive that I could've tried instead-- perhaps if I'd solved it via measure theory instead of just working the discrete case? However, that's about all the noticing I'm able to do at the moment since I've got my own theorems to prove.

# Estimating Fractional Returns in Markov Decision Processes

(Just some quick thoughts before I return to typesetting my thesis... I will return to this to flesh things out once I've gotten a bit more sleep and organized my thoughts)

I am writing my thesis on estimating the variance of the return in Markov Decision Processes using online incremental algorithms, which turns out to a surprisingly complex problem.

Having an estimate of the variance is generally agreed to be a good thing, according to a random sampling (N=2) of statisticians I interviewed when writing this blog post. If you have unlimited computational power and memory, you may not care about having everything be online and incremental, because you can just keep track of literally everything and compute whatever function of the return you want. That approach is rather wasteful, and generally only promoted by individuals who own shares in NVIDIA.

If it's possible to update your estimator(s) as the data comes in, you should probably do that, because it amortizes the computational cost and allows you to modify the learning agent's behavior in real time1. TD(λ) and related algorithms can accomplish this, and there's a lot of work on methods of learning to estimate the expected return online and incrementally. But what if you want to estimate the second moment of the return? Or the square root? This is a tougher nut to crack.

## Background

(You can skip this section if you're familiar with reinforcement learning)

Markov decision processes (MDPs) model decision making in situations where the outcomes are a function of the actions taken by an agent, possibly with some randomness thrown in for zest. At time "t", the agent gets some information about the state of the world ($S_{t}$), selects an action $A_{t}$ according to some policy it's following (so $A_{t} \sim \pi(S_{t})$), transitions to a new state $S_{t+1}$ and gets a reward for its efforts ($R_{t+1}$). It then has to repeat this process again, forever, or at least until it reaches a terminal state. The value of a state depends not just on the reward it receives on transitioning out of that state, but on all the future rewards that ensure. To make this mathematically tractable, we introduce a discount parameter $\gamma \in (0, 1]$, so that the reward $k$ transitions from now is diminished by $\gamma^{k}$, meaning that the value of each state is finite2. We call this discounted sum of reward "the return", and say the value of a state is the expected return you would get if you had just transitioned into that state.

We denote the return by $G_{t}$, and the value of some state $s$ by $v(s)$.

\begin{aligned} G_{t} &= R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \cdots \\ v(s) &= \Exp{G_{t} | S_{t} = s} \end{aligned}

RL agents learn the value function by predicting the value of the next state, and update their estimate of the value function by something like stochastic gradient descent. That way, we can learn the value function as information comes in, without having to pause in order to update the model's parameters3.

### Why learning the value function is "easy"

If you've not studied these ideas in detail, I can perhaps sketch the justification using a technique I like to call "math".

For the TD(0) algorithm, given a transition of the form $(S_{t}, A_{t}, R_{t+1}, S_{t+1})$, and we compute the temporal difference error:

$$\delta_{t} = R_{t+1} + \gamma v(S_{t+1}) - v(S_{t})$$

and we modify the value function, which is parameterized by $w$, with the update rule

$$w \leftarrow + \alpha \delta \nabla_{w} v$$

Part of the reason why this works is that the return, $G_{t}$, is a discounted sum. So we can define it recursively, and use linearity of expectations to get:

\begin{aligned} G_{t} &= R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \cdots \\ &= R_{t+1} + \gamma G_{t+1} \\ \Exp{G_{t}} &= \Exp{R_{t+1} + \gamma G_{t+1}} \\ &= \Exp{R_{t+1}} + \gamma \Exp{G_{t+1}} \end{aligned}

If we assume that $v(s) \approx \Exp{G_{t} | S_{t} = s}$, we can get an expression for the (squared) error in the value function:

\begin{aligned} \text{err}(v) &= (R_{t+1} + \gamma G_{t+1} - v(S_{t}))^{2} \\ &\approx (R_{t+1} + \gamma v(S_{t+1}) - v(S_{t}))^{2} \end{aligned}

Further assume that $v(S_{t+1})$ is constant4, and we can differentiate w/r/t the parameters of the value function ($w$), to get:

\begin{aligned} \nabla \text{err}(v) &\approx \nabla_{w} (R_{t+1} + \gamma v(S_{t+1}) - v(S_{t}))^{2} \\ &= - 2 (R_{t+1} + \gamma v(S_{t+1}) - v(S_{t})) \nabla_{w} v(S_{t}) \\ &= - 2 \delta_{t} \nabla_{w} v(S_{t}) \end{aligned}

We make a few assumptions and get an algorithm that not works in practice, but have been proven to converge with a bounded error5, and can be used in the online setting.

## Second and Higher Moment of the Return

So, we can learn the value function, and now perhaps we'd like the second moment of the return as well so that we can do fun things like computing the variance.

$$\Var{G_{t}} = \Exp{G_{t}^{2}} - \Exp{G_{t}}^{2}$$

Having the variance is nice because we can make confidence intervals, modify the learning rate (reducing it when variance is high, and ramping it up when it's low), or even include it as a penalty to discourage risky behavior in our agent6 But naturally this is easier said than done.

### Why is this so hard?

Mainly because we can use the fact that the expression for the return is recursive and linear to get cool operator equations, but the same cannot be said for e.g., $(G_{t}^{2})$.

For example, if $\vec{v}$ is the value function in vector form (such that $v(s_{i}) = [\vec{v}]_{i}$, $P$ is the transition matrix (so $\Pr (s_{j} | s_{i}) = P_{ij}$, and $\vec{r}$ is the expected immediate reward with $[\vec{r}]_{i} = \Exp{R_{t+1}|S_{t} = s_{i}}$, then

\begin{aligned} \vec{v} &= \vec{r} + \gamma P \vec{v} \\ \Rightarrow \vec{v} &= (I - \gamma P)^{-1} \vec{r} \end{aligned}

But if we consider $G_{t}^{2}$, we get

\begin{aligned} G_{t}^{2} &= (R_{t+1} + \gamma G_{t+1})^{2} \\ &= R_{t+1}^{2} + 2 \gamma R_{t+1} G_{t+1} + \gamma^{2} G_{t+1}^{2} \end{aligned}

which almost seems like it will work, except we can't just predict the squared return of the next state because of the $2 \gamma R_{t+1} G_{t+1}$ cross-term. If we wanted to be exactly correct, we'd need to track the future rewards from $G_{t+1}$, discount them, and multiply by $R_{t+1}$, and do this for every transition. Thus, it seems like a victory for the forces of batch updating and computational inefficiency.

#### Saved By Mathematical Chicanery!

However, we can try the same trick that we used before, where we use one estimator to learn the value function, plug that in for $G_{t+1}$ as needed, and then further assume that this estimator is constant. So we end up learning from what I'm calling generalized- or meta-rewards (denoted by a "bar" superscript, e.g., $\bar{R}_{t+1}$), and it turns out that if you accept the validity of that substitution, we get a Bellman equation:

\begin{aligned} \bar{R}_{t+1} &= R_{t+1}^{2} + 2 \gamma R_{t+1} v(S_{t+1}) \\ \bar{\gamma} &= \gamma^{2} \\ \bar{v}(S_{t}) &= \Exp{\bar{R}_{t+1} + \bar{\gamma} \bar{v}(S_{t+1})} \\ &\approx \Exp{G_{t}^{2}} \end{aligned}

And we still have the ability to formulate cool operator equations:

\begin{aligned} \bar{\vec{v}} &= (I - \bar{\gamma}^{2} P)^{-1} \bar{\vec{r}} \\ &= (I - \gamma^{2} P)(\vec{r}^{(2)} + 2 \gamma \vec{R} P \vec{v}) \end{aligned}

The justification for this is covered in other works, see Sobel. I will note that everything's just groovy from a convergence perspective, because if $\gamma \in (0, 1]$, then so is $\gamma^{2}$, and if the rewards are bounded, the $R_{t}^{2}$ is also bounded, and therefore so is the return. So if we limit the value function to be $|v(s)| \leq \frac{R_{max}}{1 - \gamma}$, then $\bar{R}_{t}$ is bounded, and so if TD would converge on the original problem, then it converges here too.

It works well in the tabular case, but under function approximation things begin to break down in odd ways.

## Generalizing

As suggested by the section heading, we can generalize, and come up with a reward for the higher moments of the return rather easily:

\begin{aligned} G_{t}^{n} &= \sum_{k=0}^{n} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} G_{t+1}^{k} \\ &\approx \sum_{k=0}^{n-1} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} v(S_{t+1})^{k} + \gamma^{n} G_{t+1}^{n} \end{aligned}

Which allows us (equipped only with a value function!) define rewards and discounts that yield Bellman equations we can use:

\begin{aligned} \bar{R}_{t+1}^{(n)} &= \sum_{k=0}^{n-1} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} v(S_{t+1})^{k} \\ \bar{\gamma} &= \gamma^{n} \\ \bar{v}^{(n)}(S_{t}) &= \Exp{\bar{R}_{t+1}^{(n)} + \bar{\gamma} \bar{v}^{(n)}(S_{t+1})} \end{aligned}

When $n = 1$, we have $\bar{R}_{t+1} = R_{t+1}$, and $\bar{\gamma} = \gamma$, which is a nice sanity check.

### Fractional Returns

So while the original problem (estimating the moments of the return) seems completely sane, and its solution a praiseworthy goal, we can drop the mask of being responsible mathematicians and start wondering about weird quantities, like $G_{t}^{p/q}$. We could even consider complex powers, if that sort of thing seems worthwhile.

Let $\beta = p/q \in \mathbb{C}$, and let's use a binomial series in place of the binomial expansion that came before:

\begin{aligned} G_{t}^{\beta} &= (R_{t+1} + \gamma G_{t+1})^{\beta} \\ &= R_{t+1}^{\beta} (1 + \frac{\gamma G_{t+1}}{R_{t+1}})^{\beta} \\ &= R_{t+1}^{\beta} \sum_{k=0}^{\infty} \binom{\beta}{k} \Bigg( \frac{\gamma G_{t+1}}{R_{t+1}} \Bigg)^{k} \end{aligned}

Previously, for the n-th moment, we truncated the series to get a $\gamma^{n} G_{t}$ term, giving us a Bellman equation. Here, we have an infinite series, and so the way around this is not exactly clear. If

Instead, we can make an unjustified assumption, and subtract $(\gamma v(S_{t+1}))^{\beta}$ in exchange for the $(\gamma G_{t+1})^{\beta}$ term:

$$G_{t}^{\beta} \approx \gamma^{\beta} G_{t+1}^{\beta} - (\gamma v(S_{t+1}))^{\beta} + R_{t+1}^{\beta} \sum_{k=0}^{\infty} \binom{\beta}{k} \Bigg( \frac{\gamma v(S_{t+1})}{R_{t+1}} \Bigg)^{k}$$

$$\bar{R}_{t+1}^{\beta} = - (\gamma v(S_{t+1}))^{\beta} + R_{t+1}^{\beta} \sum_{k=0}^{\infty} \binom{\beta}{k} \Bigg( \frac{\gamma v(S_{t+1})}{R_{t+1}} \Bigg)^{k}$$

We can replace the binomial function with its analytic extension, the gamma function, which is easy to compute.

However, we're still at a bit of an impasse, because this series may not be convergent.

• If $|\gamma v(S_{t+1})/R_{t+1}| < 1$, then it converges absolutely for all $\beta \in \mathbb{C}$.
• If $\gamma v(S_{t+1})/R_{t+1} = -1$, then we need $\Re (\beta) > 0$
• If $\gamma v(S_{t+1})/R_{t+1} \neq -1$ and $|\gamma v(S_{t+1})/R_{t+1}| = 1$, then we just need $\Re (\beta) > -1$.
• Divergence occurs if we're using a negative integer for $\beta$ with $|\gamma v(S_{t+1})/R_{t+1}|$ greater than 1.

Clearly, we have to be concerned with the maximum value that $v(s)$ can take, so we might consider rescaling the rewards but this doesn't work cleanly because it's a function of the reward. So an affine transformation of the rewards might work, along with appropriately chosen $\gamma$. Affine transformations (rescaling $R \rightarrow aR + b$ for some scalars $a$ and $b$) don't really affect the interpretation of the prediction when we're talking about the expected return, $\Exp{G_{t}}$, but they might change things for $G_{t}^{\beta}$.

Otherwise, we might be limited to cases where the episode returns a value of $\pm 1$ prior to termination. This is not the worst thing in the world, because it still means we can make predictions about whether events will or will not happen. This is still fairly useful, and if the math checks out (which it may not), we have a way of extending TD-methods to these artificially created non-exponentially decaying cases.

#### Alternative Expansion

We could also look at an alternative expansion:

\begin{aligned} G_{t}^{\beta} &= (R_{t+1} + \gamma G_{t+1})^{\beta} \\ &= (\gamma G_{t+1})^{\beta} \bigg(1 + \frac{R_{t+1}}{\gamma G_{t+1}} \bigg)^{\beta} \\ &= (\gamma G_{t+1})^{\beta} \sum_{k=0}^{\infty} \binom{\beta}{k} \Bigg( \frac{R_{t+1}}{\gamma G_{t+1}} \Bigg)^{k} \\ &= (\gamma G_{t+1})^{\beta} + (\gamma G_{t+1})^{\beta}\sum_{k=1}^{\infty} \binom{\beta}{k} \Bigg( \frac{R_{t+1}}{\gamma G_{t+1}} \Bigg)^{k} \\ &\approx (\gamma G_{t+1})^{\beta} + (\gamma v(S_{t+1})^{\beta} \sum_{k=1}^{\infty} \binom{\beta}{k} \bigg( \frac{R_{t+1}}{\gamma v(S_{t+1})} \bigg)^{k} \end{aligned}

So we can get the sorts of recursive equations that we like to see:

\begin{aligned} \Exp{G_{t}^{\beta}} &\approx \bar{v}^{\beta}(S_{t}) \\ \bar{v}^{\beta}(s) &= \Exp{\bar{R}^{\beta} + \gamma^{\beta} v(S_{t+1}) | S_{t} = s} \\ \bar{R}^{\beta} &= (\gamma v(S_{t+1)})^{\beta} \sum_{k=1}^{\infty} \binom{\beta}{k} \Bigg( \frac{R_{t+1}}{\gamma v(S_{t+1})} \Bigg)^{k} \end{aligned}

I think I like this expansion better, since we avoid the decision to arbitrarily pull out a $(\gamma G_{t+1})^{\beta}$ in exchange for a $(\gamma v(S_{t+1}))^{\beta}$. Convergence of the series is also better, so long as the rewards are positive and the discount factor is set appropriately.

This time, we need $|R_{t+1}/(\gamma v(S_{t+1}))| \leq 1$ for convergence to be possible if $\beta$ is not a positive integer. Things get awkward if $\gamma$ or $v(S_{t+1})$ is zero for some values of $\beta$, but perhaps not fatally so. I'll have to look at it a bit more before I can speak from a "what does it mean?" standpoint.

#### Why?

Well... these fractional returns might still be useful, in the sense that one of the complaints about standard MDPs is that the discounting is too aggressive, and we might wish to learn to predict things with longer tails. If we decide that we care about fractional returns, it also raises questions about whether this is the best approach to go about computing them, and furthermore, why both with a discount parameter if (by the exponential nature of discounting), we care about processes with much longer memory? It's also just mathematically interesting, so there's that as well.

So perhaps the theory has just not been fully developed yet, which might make for a good research problem.

## References

• Sobel's paper, "The Variance of Discounted Markov Decision Processes", goes into this in greater detail.
• https://www.jstor.org/stable/3213832
• Also see White & White and Tamar et. al., and my own (soon-to-be-finished!) thesis

1. This is the difference between your robot car figuring out that it should be steering into the skid vs. the company that sold it to you having an awkward press conference and possible recall later in the week.

2. It also immanentizes the common expression "a bird in the hand is worth $\frac{1 - \gamma}{\gamma}$ in the bush".

3. Again, you could just track the return from every state and then learn the value function via something like batch processing, but to me that sounds like Washington bureaucracy and seems unlikely to Make Markov Decision Processes Great Again (MMDPGA).

4. This "derivation" is rather hand-wavy, so if you'd like something a little more mathematically stringent, see Hamid Maei's thesis or the work on true-online TD(λ) by van Seijen et al.

5. Convergence is guaranteed in the linear function approximation case. Follow policy when attempting to estimate return, as off-policy setting may lead to divergence. Do not attempt to use in in tasks with unbounded variance, or in non-stationary adversarial domains. For treatment with nonlinear function approximators, side effects may include: local minima, proliferation of neural net frameworks, superintelligence and superintelligence doomsaying from Elon Musk. Consult with your doctor (PhD, not MD) to see if TD(λ) is right for you.

6. Because even if our robot car "expects" to be able to execute a perfect bootleg turn on average, its occupants might prefer a less erratic mode of conveyance).

# Limiting Entropy as a Return

$$\def\Pr#1{\mathbb{P} \left( #1 \right)}$$

(In progress, just some quick notes until I'm done my thesis)

## Expressing Entropy as a Return

• It's a fun exercise
• Learning the limiting entropy of a state online may be useful
• There is probably some interesting analysis that could be done
• So, let's look at discrete Markov Processes and show that there's a recursive Bellman-like equation for entropy.
• Note that this is different from the entropy rate

### Quick rundown of relevant MDP basics

For a Markov chain, we have that the probability of seeing a particular sequence of states, say $(s_0, s_1, s_2, \ldots, s_n)$, is:

$$\Pr{s_0, s_1, s_2 \ldots s_n} = \Pr{s_{0}} \Pr{s_{1} | s_{0}} \cdots \Pr{s_{n} | s_{n-1}, s_{n-2}, \ldots s_{1}, s_{0}}$$

Under the Markov assumption: $\Pr{s_{n} | s_{n-1}, s_{n-2}, \ldots} = \Pr{s_{n} | s_{n-1}}$.

\begin{aligned} \Pr{s_0, s_1, s_2 \ldots s_n} &= \Pr{s_0} \Pr{s_1 | s_0} \Pr{s_2 | s_1} \cdots \Pr{s_{n} | s_{n-1}} \\ &=\Pr{s_0} \prod_{t=1}^{n} \Pr{s_{t} | s_{t-1}} \end{aligned}

The expected value of a state for some reward function $r(s,s')$ is

\begin{aligned} v(s) &= \sum_{s_{1}} \Pr{s_{1} | s_{0} = s} \Big( r(s_0, s_1) + \gamma \Big( \sum_{s_{2}} \Pr{s_{2} | s_{1}} \Big(r(s_1, s_2) + \gamma \sum_{s_{3}} \Big( \ldots \Big) \Big) \Big) \Big) \\ &= \sum_{s_{1}} \Pr{s_{1} | s_{0} = s} \Big( r(s_0, s_1) + \gamma v(s_{1}) \Big) \end{aligned}

### Entropy

The classic definition of entropy for a discrete random variable $X$ which takes values from some set $\mathcal{X}$ is:

$$H(X) = \sum_{x \in \mathcal{X}} \Pr{X = x} \log\big( \Pr{X = x} \big)$$

We may also wish to express the entropy of two or more random variables taken together (the joint entropy). The joint entropy of two r.v.s, $X$ and $Y$, is given by:

\begin{aligned} H(X,Y) &= \sum_{x, y} \Pr{x,y} \log( \Pr{x, y} ) \\ &= H(Y|X) + H(X) = H(Y|X) + H(Y) \end{aligned}

Where $H(Y|X)$ is the conditional entropy for $Y$ given $X$. The formula for conditional entropy can be expressed as:

\begin{aligned} H(Y|X) &= \sum_{x \in \mathcal{X}} \Pr{x} H(Y|X=x) \\ &= -\sum_{x \in \mathcal{X}} \Pr{x} \sum_{y \in \mathcal{Y}} \Pr{y|x} \log (\Pr{y|x}) \\ &= -\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} \Pr{x,y} \log (\Pr{y|x}) \end{aligned}

Note that if $Y$ is independent of $X$, $H(Y|X) = H(Y)$.

We can extend the formula for the conditional entropy to $H(Y| X_{1}, X_{2}, \ldots, X_{n})$ fairly easily:

H(Y|X_{1}, X_{2}, \ldots, X_{n})
= -\sum_{x_{1}, \ldots x_{n}} \sum_{y \in \mathcal{Y}} \Pr{x_{1}, \ldots, x_{n}, ,y} \log (\Pr{y|x_{1}, \ldots, x_{n}})

Furthermore, the joint entropy for a collection on random variables is

For a sequence of r.v.s, $(X_{1}, \ldots, X_{n})$, the joint entropy can be expressed as:

\begin{aligned} H(X_{1}, \ldots, X_{n}) &= - \sum_{x_{1}} \sum_{x_{2}} \cdots \sum_{x_{n}} \Pr{x_1, x_2, \ldots x_n} \log \big( \Pr{x_1, x_2, \ldots x_n} \big) \\ &= \sum_{k=1}^{n} H(X_{k} | X_{1}, \ldots, X_{k-1}) \end{aligned}

#### Entropy in Markov Chains

When it comes to Markov chains, we can use the Markov property to get a somewhat suggestive expression of the entropy.

First note how it simplifies the expression for conditional entropy:

\begin{aligned} H(X_{k} | X_{0}, \ldots, X_{k-1}) &= \sum_{x_{0}} \cdots \sum_{x_{k-1}} \Pr{x_{0}, \ldots, x_{k-1}} H(X_{k} | X_{0} = x_{0}, \ldots X_{k-1} = x_{k-1}) \\ &= \sum_{x_{0}} \Pr{x_0} \sum_{x_{1}} \Pr{x_{1} | x_{0}} \sum_{x_{2}} \Pr{x_{2} | x_{1}, x_{0}} \sum_{x_{3}} \cdots \\ &\qquad \quad \sum_{x_{k-1}} \Pr{x_{k-1} | x_{0}, \ldots, x_{k-2}} H(X_{k} | X_{0} = x_{0}, \ldots X_{k-2} = x_{k-1}) \\ &= -\sum_{x_{0}} \Pr{x_0} \sum_{x_{1}} \Pr{x_{1} | x_{0}} \sum_{x_{2}} \Pr{x_{2} | x_{1}} \cdots \sum_{x_{k-1}} \Pr{x_{k-1} | x_{k-2}} \\ &\qquad \quad \sum_{x_{k}} \Pr{X_{k} | X_{k-1}} \log \Big( \Pr{X_{k} | X_{k-1}} \Big) \end{aligned}

Note that it has a nice recursion friendly form. For example, Iif we substitute this expression in order to calculate the joint entropy, expanding up to $X_{3}$, we get:

\begin{aligned} H(X_{0}, X_{1}, \ldots, X_{3}) = -&\sum_{s_0} \Pr{s_0} \Big[ \log( \Pr{s_0}) \\ &+ \sum_{s_1} \Pr{s_1 | s_0} \Big[ \log(\Pr{s_1 | s_0}) \\ & \quad + \sum_{s_2} \Pr{s_2 | s_1} \Big[ \log(\Pr{s_2 | s_1}) \\ & \quad\quad + \sum_{s_3} \Pr{s_3 | s_2} \log(\Pr{s_3 | s_2}) \Big]\Big]\Big]\Big] \end{aligned}

#### Bellman Equation

It's starting to look very similar to a return.

In fact, we can get a Bellman equation by defining the reward as the log-probability of the transition.

\begin{aligned} R_{t+1} &= \log \big( \Pr{S_{t+1} | S_{t}} \big) \\ G_{t} &= R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \cdots \\ &= R_{t+1} + \gamma G_{t+1} \\ v_{H}(s) &= \mathbb{E}[G_{t} | S_{t} = s] \end{aligned}

• Note the discount factor; if it's less than 1 then we're not "really" calculating the limiting entropy.
• However, if the MDP has no terminal states, then the entropy is infinite, which is not an especially helpful observation when trying to employ these concepts in practice.
• The return can easily be infinite in the undiscounted case for a non-terminating MDP. For example, the expected return is infinite if the reward function is positive-definite.
• So the discount factor can be chosen arbitrarily, or you can perhaps avoid infinities by renormalizing (subtracting the joint entropy averaged over all states)
• If you wanted a "justified" discount factor, you can look at $1 - \gamma$ as a termination probability
• Or perhaps you're interested in the entropy over some time horizon, say $\tau$, and set $\gamma = (\tau - 1)/\tau$.
• Using a state-dependent $\gamma$ to express whether some outcomes somehow do not count is also plausible, although not super compelling.
• There's another way of doing the analysis where you're working with infinite products instead of infinite sums
• The trouble arises mainly because we end up with $$G_{t} = \log(\Pr{S_{t+1} | S_{t}}) • \log(\Pr{S_{t+2} | S_{t+1}}^{\gamma}) + \log(\Pr{S_{t+3} | S_{t+2}}^{\gamma^2}) + \cdots$$ Which is a little bit hard to interpret.

• Where might you use it in practice?
• An estimate of a particular state's entropy gives you an idea of the number of possible branching outcomes.
• These outcomes might have the same end result
• If you wanted to measure how different possible results might be, you'd use something like the variance (or another distance-like measure)
• In combination with other things, this could be really informative.
• Given two policies with the same expected return, an agent might want to select one policy over one with higher entropy because it's more certain
• Conversely, during learning it might be better to explore high entropy states, or select them more frequently, because potentially many outcomes can spring from that single state.
• It may also reveal some information about where to allocate features, but I haven't analyzed this particularly deeply
• We can use entropy as guidance for how hard an MDP is, or how hard a state is to approximate
• States with high entropy may benefit from a lower learning rate, conversely states with low entropy may admit a higher learning rate
• Of course, issues spring up because in order to get a good estimate for a state's entropy you need to visit it many times.

# Fun with the TD-Error Part II

This bit of light entertainment emerged out of a paper I worked on estimating variance via the TD-error.

We have previously looked at how the discounted sum of TD-errors is just another way of expressing the bias of our value function. We'll quickly recapitulate that proof, but extend it for the λ-return and general value functions because I'm a sadist1

$$\newcommand{\eqdef}{\stackrel{\text{def}}{=}} \newcommand{\eqmaybe}{\stackrel{\text{?}}{=}}$$

## The δ-return

We first note the definitions of the TD-error (for time-step $t$, we denote it as $\delta_{t}$) and the λ-return, $G_{t}^{\lambda}$.

\begin{aligned} \delta_{t} &\eqdef R_{t+1} + \gamma_{t+1} v(S_{t+1}) - v(S_{t}) \\ G_{t}^{\lambda} &\eqdef R_{t+1} + \gamma_{t+1} \big( (1 - \lambda_{t+1}) v(S_{t+1}) + G_{t+1}^{\lambda} \big) \end{aligned}

Now we bust out one of our favorite techniques of "expanding a series and grouping terms" (I didn't say that it was a particularly hard technique).

\begin{aligned} G_{t}^{\lambda} - v(S_{t}) &= R_{t+1} + \gamma_{t+1} \big( (1 - \lambda_{t+1}) v(S_{t+1}) + G_{t+1}^{\lambda} \big) - v(S_{t}) \\ &= R_{t+1} + \gamma_{t+1} v(S_{t+1}) - v(S_{t}) + \gamma_{t+1} \lambda_{t+1} (G_{t+1}^{\lambda} - v(S_{t+1})) \\ &= \delta_{t} + \gamma_{t+1} \lambda_{t+1} \delta_{t+1} + \gamma_{t+1} \lambda_{t+1} \gamma_{t+2} \lambda_{t+2} \delta_{t+2} + \cdots \\ &= \sum_{n=0}^{\infty} \delta_{t+n} \Big( \prod_{k=1}^{n-1} \gamma_{t+k} \lambda_{t+k} \Big) \end{aligned}

Neat! I call this the δ-return, but it's a special case of defining meta-returns from an MDP + a learning algorithm.

This implies that you could figure out the bias of your reinforcement learning algorithm (that estimates the value function) by using a second estimator that tracks the δ-return. The definition of the bias is (when estimating the λ-return) is $\mathbb{E}_{\pi} [{G_{t}^{\lambda} - v(S_{t})}]$2, so if we have that as our "return" (which is the case with $\tilde{R}_{t+1} = \delta_{t}$), then in principle it is no different than estimating the value function ($v(s) \approx \mathbb{E}_{\pi} [ G_{t}^{\lambda} | S_t = s]$)

However, in practice this may have some unexpected difficulties, mainly because once the original algorithm has converged, $\Exp{\delta_{t} \vec{x}_{t}}$ is either exactly or close to zero 3.

## Bellman Equation for the Second Moment of a Return

In order for things to get truly fun, we need to take a quick look at Bellman equations for the second moment of a return. Building on some previous work by the likes of Sobel, Tamar et. al, and White & White, we know that we can actually define a Bellman equation for the second moment of the δ-return.

The second moment of a random variable $X$ is just $\Exp{X^2}$ (the mean is $\Exp{X}$ and is the first moment). Most often it comes up in the context of variance, which is the second central moment of a random variable:

$$\Var{X} \eqdef \Exp{X^2} - \Exp{X}^2 = \Exp{(X - \Exp{X})^2}$$

But ignore the man behind the curtain for a moment.

With some arbitrary return (let's call it $\tilde{G}$) we have

\begin{aligned} (\tilde{G}_{t})^2 &= (\tilde{R}_{t+1} + \tilde{\gamma}_{t+1} G_{t+1})^2 \\ &= \tilde{R}_{t+1}^2 + 2 \tilde{\gamma}_{t+1} \tilde{R}_{t+1} \tilde{G}_{t+1} + \tilde{\gamma}_{t+1}^{2} (\tilde{G}_{t+1})^2 \\ &= \bar{R}_{t+1} + \bar{\gamma}_{t+1} \bar{G_{t+1}} \end{aligned}

With the substitutions

\begin{aligned} \bar{R}_{t+1} &\eqdef \tilde{R}_{t+1}^2 + 2 \tilde{\gamma}_{t+1} \tilde{R}_{t+1} \tilde{G}_{t+1} \\ \bar{\gamma}_{t} &\eqdef \tilde{\gamma}_{t}^{2} \\ \bar{G}_{t} &\eqdef (\tilde{G}_{t})^2 \end{aligned}

Which looks pretty much like the standard recursive equation for the return, $G_{t} \eqdef R_{t+1} + \gamma_{t+1} G_{t+1}$. It's even practically useful4.

## The Second Moment of the δ-return

Now, the promised fun. Consider the second moment of the δ-return:

\begin{aligned} \Exp{(G_{t}^{\lambda} (\delta))^2} &= \Exp{ (G_{t}^{\lambda} - v(S_{t}))^2 } \\ &= \Exp{ \Big( \sum_{n=0}^\infty \delta_{t+n} \prod_{k=1}^{n-1} \gamma_{t+1} \lambda_{t+1} \Big)^2 } \end{aligned}

Wait, I think that $\Exp{ (G_{t}^{\lambda} - v(S_{t}))^2 }$ shows up somewhere in the literature. Oh, it's the mean-squared error!

Well that's pretty neat. When I first realized this I thought that we might be able to use this to achieve a truly error-free algorithm, because if you can get a handle on the errors you're making it's possible to reduce them.
It turns out that this is hard in practice, because we can't reliably estimate the δ-return. To see why this is a problem, we can plug in for the second moment of the δ-return.

### Bellman equation for the δ-return's second moment

Setting $\tilde{G}_{t} = (G_{t}^{\lambda} - v(S_{t})$ (and therefore $\tilde{R}_{t+1} = \delta_{t}$) gives us the following:



Where we make the following substitutions:



To make this work, we need to have a method of estimating $\Exp{G_{t}^{\lambda} - v(S_{t})}$, but if our value function has converged, then that quantity is zero.

So we end up with just $\bar{R}_{t+1} = \delta_{t}^2$, throwing away all of the "cross-terms" like $2 \gamma_{t+1} \lambda_{t+1} \delta_{t+1} (G_{t} - v(S_{t}))$. These terms might make a substantial contribution to the second moment.

### Delta-Squared Return for Variance

So under LFA we are effectively approximating not the second moment of the δ-return, first moment of the $\delta^2$ return (the discounted sum of the squared TD-errors).

Depending on how much the cross-terms affect the second moment, this might still be useful.

In the case that $\Exp{\delta_{t}} = 0$, then in expectation the "cross term" goes away, and the second moment of the δ-return and the expected value of the $\delta^2$ return are the same thing. Note that it may still be nonzero.

So if $\operatorname{MSE} = \operatorname{Bias}^2 + \operatorname{Variance}$, and our estimates are unbiased, then what remains is the variance. This is still pretty neat-- in the tabular case, or when our value function is exact (or close enough to exact), then estimating the $\delta^2$-return is the same as estimating the variance.

## Final Remarks

It may be possible to estimate the second moment of the $\delta$-return via using a different function approximation scheme for each learner, although the utility of this is questionable-- ultimately, we tend to assume that getting the value funciton correct is the highest priority, and so if you have extra representational power you should probably allocate it there. Using a neural net or other adaptive representation might also be a way forward, and the behavior of such models in estimating the δ-return might be interesting in its own rihgt.

But it's possible that knowing the variance is its own reward (bah-dum-tssh), because you could use it to help drive exploration, set learning rates, or use it to modify the policy to prefer more reliable actions.

In that case, depending on how biased your value function is, the second moment of the $\delta$-return can be well-approximated by tracking only $\delta^2$, or you could go with the second moment methods mentioned previously5.

Either way, I expect that investigating meta-returns will yield similarly fun posts (or perhaps even papers!) in the future.

1. You can just pretend that γ and λ are constant if you don't want to consider GVFs, I'm mainly just showing that this works even under that setting.

2. Where the subscript $\pi$ means we're taking the expectation w/r/t the stationary distribution induced by following the policy $\pi$ that we wish to evaluate, conditioned on following that policy in subsequent states.

3. For TD(λ) it's more that $\Exp{\delta_{t} \vec{e}_{t}}$ is zero at convergence, but in general you're going to see that (under linear function approximation with the same features and same γ, λ) that the weights for this second approximator should be zero in expectation.

4. White & White use this equation to estimate the second moment (and therefore the variance) online with a nice algorithm they call Variance TD, and use it attempt to optimize λ w/r/t the bias-variance trade-off (although you can avoid that and just estimate the variance); Tamar et. al. provide a couple of algorithms (TD(0) and LSTD variance) for estimating the variance.

5. Those methods also rely on the value function in order to estimate variance and to compute their updates, so if your value function sucks you're not going to have a good idea of what your variance is. I have a short "proof" that the $δ^2$-return comes up with the same estimate of the variance as the second moment algorithms, but it's pending a check because I think one of the steps is a bit suspect.

# Fun With The TD-Error

One of the fundamental ideas in reinforcement learning is the temporal difference (TD) error, which is the difference between the value of the current state and the reward received plus the discounted value of the next state. That may sound abstract, but effectively it's what you expected minus what you actually got and what you're expecting next. Okay, that still sounds incomprehensible to anyone who's not already familiar with RL, so instead here's an equation.

$$\delta_{t} = R_{t} + \gamma v(S_{t+1}) - v(S_{t})$$

• $R_{t}$ denotes the reward we got at time $t$.
• $v(s)$ denotes how much future reward we're expecting in state $s$.
• $\gamma$ is the so-called discount factor, which is the way mathematicians say 'a bird in the hand is worth two in the bush'.

Anyways, the usual goal of RL is to make our predictions about the future as accurate as possible. It's surprisingly doable, because even when we don't know what's coming in the future, we can still take steps to make our predictions about what we do know as accurate as possible. By adjusting our models so that the TD-error gets smaller over time, we learn to predict how much reward we'll receive in the future without ever having to keep track of more than the current state, the next state, and the reward we received from transitioning from one to the other.

In fact, subject to some technical conditions, we can prove that the process of minizing the TD-error actually will lead to us learning the overall value function accurately1. I will not go into that in detail, but remark that one of the big assumptions we tend to work with is that the environment is a Markov Decision Process.

## Solving The Bellman Equation Analytically

All this background is getting boring, so I'm just going to skip ahead and trust that the only people who find this site tend to have a background in reinforcement learning.

Assume you've got a transition matrix $\vec{P}$, a vector $\vec{r}$ that describes the expected rewards in each state. Let the value function can be represented as a vector, $\vec{v}$

\begin{align} \vec{v}_{(i)} &= \Exp{G_t | S_t = s_i} \\ \vec{v} &= \vec{r} + \gamma \vec{P} \vec{v} \\ &= \vec{r} + \gamma \vec{P} \vec{r} + (\gamma \vec{P})^2 \vec{r} + \cdots \\ \vec{v}(\vec{I} - \gamma \vec{P}) &= \vec{r} \\ \vec{v} &= (\vec{I} - \gamma \vec{P})^{-1} \vec{r} \end{align}

Lines 3-4 used the matrix version of the geometric sum. Assuming that matrix inversion step was valid (which is usually the case), we have a way of solving for the value function in general.

## Sum of Discounted Deltas

Suppose now that we have $\hat{v}$ an approximate value function. It may have been learned via some sort of learning algorithm, or it may be completely random numbers, but regardless, we have it.

Now, suppose for some reason we want to know about the discounted sum of future deltas2.

The value function is fixed and we have the expected reward vector from before, so if we let $\vec{d}_{(i)} = \mathbb{E}\Bigg[ \delta_{t} \Bigg\vert S_t = s_i\Bigg]$, we can write it as

$$\vec{d} = \vec{r} + \gamma \vec{P} \hat{\vec{v}} - \hat{\vec{v}}$$

Note that we're using $\hat{\vec{v}}$, which is different from the "true" $\vec{v}$ that we got from solving the Bellman equation.

Now we go for the infinite sum:

\begin{align} \vec{d} + \gamma \vec{P} \vec{d} + (\gamma \vec{P})^2 \vec{d} + \cdots &= (\vec{I} - \gamma \vec{P})^{-1} \vec{d} \\ &= (\vec{I} - \gamma \vec{P})^{-1} (\vec{r} + \gamma \vec{P} \hat{\vec{v}} - \hat{\vec{v}}) \\ &= (\vec{I} - \gamma \vec{P})^{-1} (\vec{r} - (\vec{I} - \gamma \vec{P}) \hat{\vec{v}}) \\ &= (\vec{I} - \gamma \vec{P})^{-1} \vec{r} - (\vec{I} - \gamma \vec{P})^{-1} (\vec{I} - \gamma \vec{P}) \hat{\vec{v}} \\ &= (\vec{I} - \gamma \vec{P})^{-1} \vec{r} - \hat{\vec{v}} \\ &= \vec{v} - \hat{\vec{v}} \end{align}

Amazing. Well, maybe you've seen the result before (I've learned that it shows in a few places) but I think it's kinda neat that the difference between the true value, $\vec{v}$, and our approximate value function, $\hat{\vec{v}}$, is also encoded in the series of TD-errors the same way that the value was encoded in the rewards.

Of course, we can see this in another way:

\begin{align} G_{t} - \hat{v}(S_t) &= \Bigg( \sum_{k=1}^{\infty} R_{t+k} \gamma^{k-1} \Bigg) - \hat{v}(S_t) \\ &= R_{t+1} + \gamma \hat{v}(S_{t+1}) - \hat{v}(S_t) - \gamma \hat{v}(S_{t+1}) + \gamma \Bigg( \sum_{k=1}^{\infty} R_{t+k+1} \gamma^{k-1} \Bigg) \\ &= \delta_{t} + \gamma \Big( G_{t+1} - \hat{v}(S_{t+1}) \Big) \\ &= \sum_{k=0}^{\infty} \delta_{t+k} \gamma^{k} \end{align}

If you wanted to find a use for this, you might try using a cascade of algorithms (the first learns the value, the second learns the error in the first's approximation, and so on) to reduce error3. Or you can point to this as a semi-rigorous explanation of why monitoring the TD-error alone gives you an idea of how well your algorithm is doing in general.

But in any case, the math was fun, which is really all I promised in the title of this post.

1. How accurately we learn the value function, of course, is its own topic of discussion. Largely, it depends on how accurately we can perceive the state space. If we assume complete omniscience, then asymptotically our accuracy will be perfect. If we fall short of that, then our accuracy suffers, but we we'll still do about as well the best-case scenario that our diminished perception can allow.

2. For example, suppose you are writing a thesis where this expression comes up and you're curious about it. OK, maybe that's a reason particular to my case.

3. You would likely have to use different representations for each (the reasons for that would require another post).