Escaping Batch Country
This post sat in my "drafts" folder for many moons, mostly done, so I finished it and then threw it online.
There are probably lacunae where I meant to explain something, didn't leave a note in the draft, and then failed to notice.
However, you can now email me to tell me about it rldotaiblog@gmail.com for the handful of readers, dozens of webcrawlers, and hundreds of automated vulnerability scanners.
Lots of people looking for my wplogin.php
page^{1}.
This post is based on a talk that I gave a while ago. In essence, there's a way of approximating functions of returns using online algorithms (such as TD(λ)).
Rather than just posting the slides, I'll recap^{2} and mention some more technical details from that talk (although the slides are available here).
Batch Country
I've been using the term "batch country" to refer to problems/approximation targets where online algorithms cannot be used. If you're estimating the mean of a time series (say, $X_{1}, X_{2}, \ldots $), then you can update your estimate when new data becomes available:
$$ \mu_{t+1} = \frac{t \mu_{t} + X_{t}}{t+1} $$
Here, we're implictly relying on the linearity of expectation; if linearity doesn't apply, you either have to learn from batches of data or perhaps employ an approximation.
For example, predicting the log of a time series, $\log( X_{1} + X_{2} + \cdots + X_{T} )$. It doesn't dissociate into individual terms that can be aggregated every timestep, so if you want to estimate it you need to collect $X_{1}$ through $X_{T}$ and only then can you update.
Why Online Algorithms?
Online algorithms in general and reinforcement learning in particular are preferable to batch methods for a variety of reasons. The fact that your estimates are always current is helpful if you're actively using the predictions, e.g., robot control or game playing. Additionally, updating over time means you amortize the cost of learning batch methods can waste a lot of available CPU time while waiting for enough new data to become available.
In reinforcement learning, the temporal difference methods often crushes value/policy iteration for these reasons (learning more in a shorter time with less computation) and they're more widely applicable. So it's nice to be able to use an online algorithm.
TD(λ) with accumulating traces is such an algorithm, which uses the following update rules:
$$ \begin{aligned} e_{t} &= x_{t} + \gamma_{t} \lambda_{t} e_{t1} \\ \delta_{t} &= R_{t+1} + \gamma_{t+1} w_{t}^{\top} x_{t+1}  w_{t}^{\top} x_{t} \\ w_{t+1} &= w_{t} + \alpha_{t} \delta_{t} e_{t} \end{aligned} $$
When TDmethods break down
The big thing that Markov Decision Problems have going for them is, unsurprisingly, the Markov property.
A technical way of phrasing that crucial attribute is that they are "memoryless", in the sense that knowing the history of the system provides you no more information into its future evolution than if you just knew the current state. So if you know the current state of the system, then have just as much predictive power as if you knew the previous state, and the previousprevious state, etc.
Furthermore, it ensures that the return (our approximation target) has a nice recursive structure.
$$ \begin{aligned} G_{t} &= R_{t+1} + \gamma G_{t+1} \\ v(S_{t}) &\approx \Exp{G_{t}} &= \Exp{R_{t+1} + \gamma G_{t+1} } \\ \Rightarrow v(S_{t}) &\approx \Exp{R_{t+1}} + \gamma v(S_{t+1}) \\ \Rightarrow \nabla v &= R_{t+1} + \gamma v(S_{t+1})  v(S_{t}) \end{aligned} $$
For an online algorithm to work, we need to have enough information to compute the gradient for our estimator at each point in time.
Possible Exits from Batch Country
We have previously looked at, where we note that we can estimate moments of the return (e.g., $\mathbb{E}[G_{t}^{n}]$, for $n=1, 2, 3, \ldots$. For $n=1$, you have the expected return, for $n=2$, you get the second moment (which can be used to compute the variance), and so on^{3}.
Here's an expression for the second moment of the return as a pseudoBellman equation:
$$ \begin{aligned} G_{t}^{2} &= (R_{t+1} + \gamma G_{t+1})^{2} \\ &= R_{t+1}^{2} + 2 \gamma R_{t+1} G_{t+1} + \gamma^{2} G_{t+1}^{2} \\ &= R^{SM}_{t+1} + \gamma^{2} G_{t+1}^{2} \end{aligned} $$
The second moment seems like it can't be approximated online, because we have to wait for $G_{t+1}$ to come in before we can compute $R^{SM}_{t+1}$.
However, if we approximate $G_{t+1}$ by $v(S_{t+1})$, suddenly we do have a Bellman equation:
$$ \begin{aligned} R^{SM}_{t+1} \approx \bar{R}_{t+1} &= R_{t+1}^{2} + 2 \gamma_{t+1} R_{t+1} v(S_{t+1}) \\ G_{t}^{2} \approx \hat{G}^{2}_{t} &= \bar{R}_{t+1} + \gamma_{t+1}^{2} \hat{G}^{2}_{t+1} \\ &= \bar{R}_{t+1} + \bar{\gamma}_{t+1} \hat{G}^{2}_{t+1} \end{aligned} $$
Where $\bar{\gamma}$ and $\bar{R}$ are the discount factor and rewards for this return we've built.
You can then use TD(λ) or whatever other algorithm you prefer to estimate the value of our newlydefined return for the second moment
For example, we could extend the TD(λ) algorithm above to get an algorithm that estimates $\Exp{G_{t}}$ and $\Exp{G_{t}^{2}}$
$$ \begin{aligned} \bar{R}_{t+1} &= R_{t+1}^{2} + 2 \gamma R_{t+1} w_{t}^{\top} x_{t+1} \\ \bar{\gamma}_{t+1} &= \gamma_{t}^{2} \\ \bar{e}_{t} &= x_{t} + \bar{\gamma}_{t} \lambda_{t} e_{t1} \\ \bar{\delta}_{t} &= \bar{R}_{t+1} + \bar{\gamma}_{t+1} \bar{w}_{t}^{\top} x_{t+1}  \bar{w}_{t}^{\top} x_{t} \\ \bar{w}_{t+1} &= \bar{w}_{t} + \alpha_{t} \bar{\delta}_{t} \bar{e}_{t} \end{aligned} $$
We would then have $\Exp{G_{t}} \approx v(s) = w^{\top} x(s)$ and $\Exp{G_{t}^{2}} \approx v^{(2)}(s) = \bar{w}^{\top} x(s)$.
Higher Moments
We can estimate the first and second moments, so the natural course of action is to proceed by induction to estimate higher moments.
For example, we have $G^{3}_{t}$.
$$ \begin{aligned} G_{t}^{3} &= (R_{t+1} + \gamma G_{t+1})^{3} \\ &= R_{t+1}^{3} + 3 \gamma R_{t+1}^{2} G_{t+1} + 3 \gamma^{2} R_{t+1} G_{t+1}^{2} + \gamma^{3} G_{t+1}^{3} \\ &\approx R_{t+1}^{3} + 3 \gamma R_{t+1}^{2} v(S_{t+1}) + 3 \gamma^{2} R_{t+1} v^{(2)}(S_{t+1}) + \gamma^{3} G_{t+1}^{3} \end{aligned} $$
And in general:
$$ \begin{aligned} G_{t}^{n} &= \sum_{k=0}^{n} \binom{n}{k} R_{t+1}^{nk} \gamma^{k} G_{t+1}^{k} \\ &\approx \sum_{k=0}^{n1} \binom{n}{k} R_{t+1}^{nk} \gamma^{k} v^{(k)}(S_{t+1}) + \gamma^{n} G_{t+1}^{n} \end{aligned} $$
Remarks
In practice, estimating higher moments does not really work out so well. Under linear function approximation, we can actually prove convergence for arbitrary moments, but in practice the fixedpoint can be wildly different from the true value. The errors in each level of approximation tend to get magnified (both in terms of the approximation error and the variance of the stochastic approximation).
Furthermore, the various moments of the return are a set of measure zero in the space of "all functions that could be applied to the return"^{4}.
However, to paraphrase a wise man, "there's nothing more dangerous than a man in the depths of an ether binge, except perhaps a theoretician with access to basic calculus".
Escaping Batch Country
We have, for a suitable wellbehaved function^{5}.
$$ f(x) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x  a)^{k} $$
We then consider functions of a random variable. Can we apply a Taylor series approximation? The answer is yes, so long as the random variable is also reasonably wellbehaved^{6}.
So we get:
$$ \begin{aligned} \mathbb{E} [ f(X) ] &= \mathbb{E} [ f( \mu + (X  \mu) )] \\ &= \mathbb{E} [f(\mu) + \sum_{k=1}^{\infty} \frac{1}{k!} f^{(k)}(\mu) (X  \mu)^{k} ] \end{aligned} $$
Where $\mu = \Exp{X}$ is the mean of $X$.
We might reasonably get away with a secondorder approximation:
$$ \begin{aligned} \mathbb{E} [ f(X) ] &= \mathbb{E} [f(\mu) + \sum_{k=1}^{\infty} \frac{1}{k!} f^{(k)}(\mu) (X  \mu)^{k} ] \\ &\approx \mathbb{E}[ f(\mu) + f'(\mu) ( X  \mu ) + \frac{1}{2} f''(\mu) (X  \mu)^{2} ] \\ &= \mathbb{E}[ f(\mu) ] + \mathbb{E}[ f'(\mu) ( X  \mu ) ] + \mathbb{E}[\frac{1}{2} f''(\mu) (X  \mu)^{2} ] \\ &= f(\mu) + \mathbb{E}[ f'(\mu) ] \mathbb{E} [ ( X  \mu ) ] + f''(\mu) \mathbb{E}[\frac{1}{2} (X  \mu)^{2} ] \\ &= f(\mu) + \frac{1}{2} f''(\mu) \sigma^{2} \end{aligned} $$
Where we use the fact that $\Exp{f(\mu)}$ is constant, $\Exp{X  \mu} = 0$, and that $\Exp{(X  \mu)^{2}} = \sigma^{2}$, the variance of $X$.
In theory, things are looking good
So we have a means of expressing a function of the return in terms of the moments of the return, and we have a method of estimating those moments. In theory, we can then express any function of the return.
$$ \begin{aligned} \mathbb{E} [ f(G_{t})  S_{t} = s] &= \mathbb{E} \Bigg[ f(v(s)) + \sum_{k=2}^{\infty} \frac{1}{k!} f^{(k)}(v(s)) (G_{t}  v(s))^{k} \Bigg  S_{t} = s \Bigg] \end{aligned} $$
It's even pretty efficient to compute.
For each $n$, the central moment $u^{(n)}(s) = \Exp{(G_{t}  v(S_{t})^{n}  S_{t} = s}$ can be expressed as:
$$ \begin{aligned} u^{(n)}(s) &= \Exp{(G_{t}  v(s))^{n}  S_{t} = s} \\ &= \mathbb{E} \Bigg[ \sum_{k=0}^{n} \binom{n}{k} (v(s))^{nk} G_{t}^{k} \Bigg S_{t} = s \Bigg] \\ &\approx \sum_{k=0}^{n} \binom{n}{k} (v(s))^{nk} v^{(k)}(s) \\ &= (1)^{n1} n v^{n}(s) + \sum_{k=2}^{n} \binom{n}{k}(1)^{nk} v^{nk}(s) v^{(k)}(s) \end{aligned} $$
To compute $n$ central moments of $G_{t}$, we have to evaluate $n$ value functions, with the rest of the multiplication and summing of terms in the binomial being relatively cheap. We then have to compute the derivatives of $f$ at $v(s)$ up to $f^{(n)}$, which is probably not too expensive for typical choices of $f$.
We could have a whole bunch of functions, $f_{1}, \ldots, f_{z}$, and then things become even more efficient. We can memoize the $(G_{t}  v(S_{t}))^{n}$ coefficients and reuse them when calculating the Taylor expansion for each $f_{i}$.
Is this at all useful in practice?
It depends.
First, let's examine whether the assumptions we're making regarding the Taylor approximation can be expected to hold in practice first, that $f(\cdot)$ is wellbehaved, and second, that the moments of $G_{t}$ are finite.
1. It's not clear to me that all interesting functions are wellbehaved, but perhaps all interesting functions have an analytic approximation that is suitable for our purposes. So, maybe this is true in practice. You probably want to choose functions where the each subsequent derivative is not growing too rapidly, though.
2. It's also unclear whether a typical return will have finite moments, but we can assert that it's true if the reward is bounded.
We have that $R_{t}^{(n)}$ (the "reward" we use when constructing the approximation target for the $n$th moment) is:
$$ R_{t}^{(n)} = \sum_{k=0}^{n1} \binom{n}{k} R_{t+1}^{nk} \gamma^{k} v^{(k)}(S_{t+1}) $$
Suppose the base reward is bounded, then we have a means of bounding the higher rewards as well.
Let $\max  R  = R_{\max}$, so $G_{t} \leq \frac{R_{\max}}{1  \gamma}$, which implies that we can clip the value function so that $v(s) \leq \max \frac{R_{\max}}{1  \gamma}$. Then for the second moment, we have
$$ \begin{aligned} R_{t}^{(2)} &= R_{t+1}^{2} + 2 \gamma R_{t+1} v(S_{t+1}) \\ &\leq R_{\max}^{2} + 2 R_{\max} \frac{R_{\max}}{1  \gamma} \\ &= R_{\max}^{2} \frac{3  \gamma}{1  \gamma} \\ &= R_{\max}^{(2)} \\ G_{t}^{2} &\leq \frac{R_{\max}^{(2)}}{1\gamma^{2}} \end{aligned} $$
We observe that the higher moments are all bound^{7}, although depending on the value of $\gamma$ this may be cold comfort.
So by engineer's induction, if the base reward is bound, the higher moments are bound as well.
If your rewards are not bounded, then asserting that the higher moments are bounded becomes harder.
I've run a few experiments to test this, and the results are somewhat mixed. If the discount factor is high ($\gamma \approx 1$), then the higher moments tend to dominate for a while before the factorials in the denominator eventually remove those terms from consideration.
From an implementation pointofview, we also have to worry about how good our estimate for the various moments will be under function approximation. For example, we might have a representation that is wellsuited to $v(\cdot)$, but performs poorly when we try to use it for approximating $v^{(2)}(\cdot)$.
Final Remarks
Approximating arbitrary functions of the return with an online algorithm sounds like a lot of fun, but the conditions under which it's feasible are pretty rough.
Further Work
I'd like to take a closer look at what the dominant terms are, and maybe do some experiments to see how many moments we would need to evaluate to get a reasonable estimate.
I tried characterizing when this is possible in detail, and deriving error bounds based on Taylor's theorem, but it really gets ugly quickly. At some point, however, it might be worth pushing through the pain to see if anything neat falls out, or at least getting the chance to behold some hilariously worthless "guarantees".
References

I have cleverly moved it to
secretwplogin.php
to prevent hackers from being able to install WordPress on my static site. ↩ 
I omit the introduction, which contains a couple of lousy puns and a bit of elementary stuff on RL/MDP theory, since this is the Internet where both are available in huge quantities. ↩

We also derived some rather impractical expressions for fractional returns, which would be things like $G_{t}^{1/2}$, for example. ↩

Although they might occupy nontrivial volume of the space of "functions that you'd actually want to apply to the return", so that's nice. ↩

In general you want $f$ to be smooth, i.e., infinitely differentiable, but you might be able to loosen the requirements if you have some idea what values $x$ can take. ↩

Here we need the moments of $X$ to be finite, although you could potentially loosen the requirements if you knew more about $f$. ↩

I suppose that $G_{t}^{n}$ is bound by something like $O(R^{n}/\prod_{i=k}^{n}(1\gamma)^{k} )$ but I haven't looked deeply into it. ↩