Escaping Batch Country

May 4, 2018

reinforcement learning, temporal difference learning, math

This post sat in my “drafts” folder for many moons, mostly done, so I finished it and then threw it online. There are probably lacunae where I meant to explain something, didn’t leave a note in the draft, and then failed to notice. However, you can now email me to tell me about it– (nevermind, I can’t access that account anymore) –for the handful of readers, dozens of webcrawlers, and hundreds of automated vulnerability scanners. Lots of people looking for my wp-login.php page¹.

This post is based on a talk that I gave a while ago. In essence, there’s a way of approximating functions of returns using online algorithms (such as TD(λ)).

Rather than just posting the slides, I’ll recap² and mention some more technical details from that talk (although the slides are available here).

Batch Country #

I’ve been using the term “batch country” to refer to problems/approximation targets where online algorithms cannot be used. If you’re estimating the mean of a time series (say,X_{1}, X_{2}, \ldots ), then you can update your estimate when new data becomes available:

\mu_{t+1} = \frac{t \mu_{t} + X_{t}}{t+1}

Here, we’re implictly relying on the linearity of expectation; if linearity doesn’t apply, you either have to learn from batches of data or perhaps employ an approximation.

For example, predicting the log of a time series,\log( X_{1} + X_{2} + \cdots + X_{T} ). It doesn’t dissociate into individual terms that can be aggregated every time-step, so if you want to estimate it you need to collectX_{1} throughX_{T} and only then can you update.

Why Online Algorithms? #

Online algorithms in general and reinforcement learning in particular are preferable to batch methods for a variety of reasons. The fact that your estimates are always current is helpful if you’re actively using the predictions, e.g., robot control or game playing. Additionally, updating over time means you amortize the cost of learning– batch methods can waste a lot of available CPU time while waiting for enough new data to become available.

In reinforcement learning, the temporal difference methods often crushes value/policy iteration for these reasons (learning more in a shorter time with less computation) and they’re more widely applicable. So it’s nice to be able to use an online algorithm.

TD(λ) with accumulating traces is such an algorithm, which uses the following update rules:

\begin{aligned} e_{t} &= x_{t} + \gamma_{t} \lambda_{t} e_{t-1} \\\\ \delta_{t} &= R_{t+1} + \gamma_{t+1} w_{t}^{\top} x_{t+1} - w_{t}^{\top} x_{t} \\\\ w_{t+1} &= w_{t} + \alpha_{t} \delta_{t} e_{t} \end{aligned}

When TD-methods break down #

The big thing that Markov Decision Problems have going for them is, unsurprisingly, the Markov property.

A technical way of phrasing that crucial attribute is that they are “memoryless”, in the sense that knowing the history of the system provides you no more information into its future evolution than if you just knew the current state. So if you know the current state of the system, then have just as much predictive power as if you knew the previous state, and the previous-previous state, etc.

Furthermore, it ensures that the return (our approximation target) has a nice recursive structure.

\begin{aligned} G_{t} &= R_{t+1} + \gamma G_{t+1} \\\\ v(S_{t}) &\approx \Exp{G_{t}} &= \Exp{R_{t+1} + \gamma G_{t+1} } \\\\ \Rightarrow v(S_{t}) &\approx \Exp{R_{t+1}} + \gamma v(S_{t+1}) \\\\ \Rightarrow \nabla v &= R_{t+1} + \gamma v(S_{t+1}) - v(S_{t}) \end{aligned}

For an online algorithm to work, we need to have enough information to compute the gradient for our estimator at each point in time.

Possible Exits from Batch Country #

We have previously looked at, where we note that we can estimate moments of the return (e.g.,\mathbb{E}[G_{t}^{n}], forn=1, 2, 3, \ldots. Forn=1, you have the expected return, forn=2, you get the second moment (which can be used to compute the variance), and so on³.

Here’s an expression for the second moment of the return as a pseudo-Bellman equation:

\begin{aligned} G_{t}^{2} &= (R_{t+1} + \gamma G_{t+1})^{2} \\\\ &= R_{t+1}^{2} + 2 \gamma R_{t+1} G_{t+1} + \gamma^{2} G_{t+1}^{2} \\\\ &= R^{SM}_{t+1} + \gamma^{2} G_{t+1}^{2} \end{aligned}

The second moment seems like it can’t be approximated online, because we have to wait forG_{t+1} to come in before we can computeR^{SM}_{t+1}.

However, if we approximateG_{t+1} byv(S_{t+1}), suddenly we do have a Bellman equation:

\begin{aligned} R^{SM}_{t+1} \approx \bar{R}_{t+1} &= R_{t+1}^{2} + 2 \gamma_{t+1} R_{t+1} v(S_{t+1}) \\\\ G_{t}^{2} \approx \hat{G}^{2}_{t} &= \bar{R}_{t+1} + \gamma_{t+1}^{2} \hat{G}^{2}_{t+1} \\\\ &= \bar{R}_{t+1} + \bar{\gamma}_{t+1} \hat{G}^{2}_{t+1} \end{aligned}

Where\bar{\gamma} and\bar{R} are the discount factor and rewards for this return we’ve built.

You can then use TD(λ) or whatever other algorithm you prefer to estimate the value of our newly-defined return for the second moment

For example, we could extend the TD(λ) algorithm above to get an algorithm that estimates\Exp{G_{t}} and\Exp{G_{t}^{2}}

\begin{aligned} \bar{R}_{t+1} &= R_{t+1}^{2} + 2 \gamma R_{t+1} w_{t}^{\top} x_{t+1} \\\\ \bar{\gamma}_{t+1} &= \gamma_{t}^{2} \\\\ \bar{e}_{t} &= x_{t} + \bar{\gamma}_{t} \lambda_{t} e_{t-1} \\\\ \bar{\delta}_{t} &= \bar{R}_{t+1} + \bar{\gamma}_{t+1} \bar{w}_{t}^{\top} x_{t+1} - \bar{w}_{t}^{\top} x_{t} \\\\ \bar{w}_{t+1} &= \bar{w}_{t} + \alpha_{t} \bar{\delta}_{t} \bar{e}_{t} \end{aligned}

We would then have\Exp{G_{t}} \approx v(s) = w^{\top} x(s) and\Exp{G_{t}^{2}} \approx v^{(2)}(s) = \bar{w}^{\top} x(s).

Higher Moments #

We can estimate the first and second moments, so the natural course of action is to proceed by induction to estimate higher moments.

For example, we haveG^{3}_{t}.

\begin{aligned} G_{t}^{3} &= (R_{t+1} + \gamma G_{t+1})^{3} \\\\ &= R_{t+1}^{3} + 3 \gamma R_{t+1}^{2} G_{t+1} + 3 \gamma^{2} R_{t+1} G_{t+1}^{2} + \gamma^{3} G_{t+1}^{3} \\\\ &\approx R_{t+1}^{3} + 3 \gamma R_{t+1}^{2} v(S_{t+1}) + 3 \gamma^{2} R_{t+1} v^{(2)}(S_{t+1}) + \gamma^{3} G_{t+1}^{3} \end{aligned}

And in general:

\begin{aligned} G_{t}^{n} &= \sum_{k=0}^{n} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} G_{t+1}^{k} \\\\ &\approx \sum_{k=0}^{n-1} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} v^{(k)}(S_{t+1}) + \gamma^{n} G_{t+1}^{n} \end{aligned}

Remarks #

In practice, estimating higher moments does not really work out so well. Under linear function approximation, we can actually prove convergence for arbitrary moments, but in practice the fixed-point can be wildly different from the true value. The errors in each level of approximation tend to get magnified (both in terms of the approximation error and the variance of the stochastic approximation).

Furthermore, the various moments of the return are a set of measure zero in the space of “all functions that could be applied to the return”⁴.

However, to paraphrase a wise man, “there’s nothing more dangerous than a man in the depths of an ether binge, except perhaps a theoretician with access to basic calculus”.

Escaping Batch Country #

We have, for a suitable well-behaved function⁵.

f(x) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x - a)^{k}

We then consider functions of a random variable. Can we apply a Taylor series approximation? The answer is yes, so long as the random variable is also reasonably well-behaved⁶.

So we get:

\begin{aligned} \mathbb{E} [ f(X) ] &= \mathbb{E} [ f( \mu + (X - \mu) )] \\\\ &= \mathbb{E} [f(\mu) + \sum_{k=1}^{\infty} \frac{1}{k!} f^{(k)}(\mu) (X - \mu)^{k} ] \end{aligned}

Where\mu = \Exp{X} is the mean ofX.

We might reasonably get away with a second-order approximation:

\begin{aligned} \mathbb{E} [ f(X) ] &= \mathbb{E} [f(\mu) + \sum_{k=1}^{\infty} \frac{1}{k!} f^{(k)}(\mu) (X - \mu)^{k} ] \\\\ &\approx \mathbb{E}[ f(\mu) + f'(\mu) ( X - \mu ) + \frac{1}{2} f''(\mu) (X - \mu)^{2} ] \\\\ &= \mathbb{E}[ f(\mu) ] + \mathbb{E}[ f'(\mu) ( X - \mu ) ] + \mathbb{E}[\frac{1}{2} f''(\mu) (X - \mu)^{2} ] \\\\ &= f(\mu) + \mathbb{E}[ f'(\mu) ] \mathbb{E} [ ( X - \mu ) ] + f''(\mu) \mathbb{E}[\frac{1}{2} (X - \mu)^{2} ] \\\\ &= f(\mu) + \frac{1}{2} f''(\mu) \sigma^{2} \end{aligned}

Where we use the fact that\Exp{f(\mu)} is constant,\Exp{X - \mu} = 0, and that\Exp{(X - \mu)^{2}} = \sigma^{2}, the variance ofX.

In theory, things are looking good #

So we have a means of expressing a function of the return in terms of the moments of the return, and we have a method of estimating those moments. In theory, we can then express any function of the return.

\begin{aligned} \mathbb{E} [ f(G_{t}) | S_{t} = s] &= \mathbb{E} \Bigg[ f(v(s)) + \sum_{k=2}^{\infty} \frac{1}{k!} f^{(k)}(v(s)) (G_{t} - v(s))^{k} \Bigg | S_{t} = s \Bigg] \end{aligned}

It’s even pretty efficient to compute.

For eachn, the central momentu^{(n)}(s) = \Exp{(G_{t} - v(S_{t})^{n} | S_{t} = s} can be expressed as:

\begin{aligned} u^{(n)}(s) &= \Exp{(G_{t} - v(s))^{n} | S_{t} = s} \\\\ &= \mathbb{E} \Bigg[ \sum_{k=0}^{n} \binom{n}{k} (-v(s))^{n-k} G_{t}^{k} \Bigg| S_{t} = s \Bigg] \\\\ &\approx \sum_{k=0}^{n} \binom{n}{k} (-v(s))^{n-k} v^{(k)}(s) \\\\ &= (-1)^{n-1} n v^{n}(s) + \sum_{k=2}^{n} \binom{n}{k}(-1)^{n-k} v^{n-k}(s) v^{(k)}(s) \end{aligned}

To computen central moments ofG_{t}, we have to evaluaten value functions, with the rest of the multiplication and summing of terms in the binomial being relatively cheap. We then have to compute the derivatives off atv(s) up tof^{(n)}, which is probably not too expensive for typical choices off.

We could have a whole bunch of functions,f_{1}, \ldots, f_{z}, and then things become even more efficient. We can memoize the(G_{t} - v(S_{t}))^{n} coefficients and reuse them when calculating the Taylor expansion for eachf_{i}.

Is this at all useful in practice? #

It depends.

First, let’s examine whether the assumptions we’re making regarding the Taylor approximation can be expected to hold in practice– first, thatf(\cdot) is well-behaved, and second, that the moments ofG_{t} are finite.

It’s not clear to me that all interesting functions are well-behaved, but perhaps all interesting functions have an analytic approximation that is suitable for our purposes. So, maybe this is true in practice. You probably want to choose functions where the each subsequent derivative is not growing too rapidly, though.
It’s also unclear whether a typical return will have finite moments, but we can assert that it’s true if the reward is bounded.

We have thatR_{t}^{(n)} (the “reward” we use when constructing the approximation target for then-th moment) is:

R_{t}^{(n)} = \sum_{k=0}^{n-1} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} v^{(k)}(S_{t+1})

Suppose the base reward is bounded, then we have a means of bounding the higher rewards as well.

Let\max | R | = R_{\max}, so|G_{t}| \leq \frac{R_{\max}}{1 - \gamma}, which implies that we can clip the value function so that|v(s)| \leq \max \frac{R_{\max}}{1 - \gamma}. Then for the second moment, we have

\begin{aligned} R_{t}^{(2)} &= R_{t+1}^{2} + 2 \gamma R_{t+1} v(S_{t+1}) \\\\ &\leq R_{\max}^{2} + 2 R_{\max} \frac{R_{\max}}{1 - \gamma} \\\\ &= R_{\max}^{2} \frac{3 - \gamma}{1 - \gamma} \\\\ &= R_{\max}^{(2)} \\\\ G_{t}^{2} &\leq \frac{R_{\max}^{(2)}}{1-\gamma^{2}} \end{aligned}

We observe that the higher moments are all bound⁷, although depending on the value of\gamma this may be cold comfort.

So by engineer’s induction, if the base reward is bound, the higher moments are bound as well.

If your rewards are not bounded, then asserting that the higher moments are bounded becomes harder.

I’ve run a few experiments to test this, and the results are somewhat mixed. If the discount factor is high \gamma \approx 1), then the higher moments tend to dominate for a while before the factorials in the denominator eventually remove those terms from consideration.

From an implementation point-of-view, we also have to worry about how good our estimate for the various moments will be under function approximation. For example, we might have a representation that is well-suited tov(\cdot), but performs poorly when we try to use it for approximatingv^{(2)}(\cdot).

Final Remarks #

Approximating arbitrary functions of the return with an online algorithm sounds like a lot of fun, but the conditions under which it’s feasible are pretty rough.

Further Work #

I’d like to take a closer look at what the dominant terms are, and maybe do some experiments to see how many moments we would need to evaluate to get a reasonable estimate.

I tried characterizing when this is possible in detail, and deriving error bounds based on Taylor’s theorem, but it really gets ugly quickly. At some point, however, it might be worth pushing through the pain to see if anything neat falls out, or at least getting the chance to behold some hilariously worthless “guarantees”.

References #

I have cleverly moved it to secret-wp-login.php to prevent hackers from being able to install WordPress on my static site. ↩︎
I omit the introduction, which contains a couple of lousy puns and a bit of elementary stuff on RL/MDP theory, since this is the Internet where both are available in huge quantities. ↩︎
We also derived some rather impractical expressions for fractional returns, which would be things likeG_{t}^{1/2}, for example. ↩︎
Although they might occupy non-trivial volume of the space of “functions that you’d actually want to apply to the return”, so that’s nice. ↩︎
In general you wantf to be smooth, i.e., infinitely differentiable, but you might be able to loosen the requirements if you have some idea what valuesx can take. ↩︎
Here we need the moments ofX to be finite, although you could potentially loosen the requirements if you knew more aboutf. ↩︎
I suppose thatG_{t}^{n} is bound by something likeO(R^{n}/\prod_{i=k}^{n}(1-\gamma)^{k} ) but I haven’t looked deeply into it. ↩︎