Fun with the TD-Error Part II

February 17, 2017

reinforcement learning, temporal difference learning, math

This bit of light entertainment emerged out of a paper I worked on estimating variance via the TD-error.

We have previously looked at how the discounted sum of TD-errors is just another way of expressing the bias of our value function. We’ll quickly recapitulate that proof, but extend it for the λ-return and general value functions because I’m a sadist¹

\newcommand{\eqdef}{\stackrel{\text{def}}{=}} \newcommand{\eqmaybe}{\stackrel{\text{?}}{=}}

The δ-return #

We first note the definitions of the TD-error (for time-stept, we denote it as\delta_{t}) and the λ-return,G_{t}^{\lambda}.

\begin{aligned} \delta_{t} &\eqdef R_{t+1} + \gamma_{t+1} v(S_{t+1}) - v(S_{t}) \\\\ G_{t}^{\lambda} &\eqdef R_{t+1} + \gamma_{t+1} \big( (1 - \lambda_{t+1}) v(S_{t+1}) + G_{t+1}^{\lambda} \big) \end{aligned}

Now we bust out one of our favorite techniques of “expanding a series and grouping terms” (I didn’t say that it was a particularly hard technique).

\begin{aligned} G_{t}^{\lambda} - v(S_{t}) &= R_{t+1} + \gamma_{t+1} \big( (1 - \lambda_{t+1}) v(S_{t+1}) + G_{t+1}^{\lambda} \big) - v(S_{t}) \\\\ &= R_{t+1} + \gamma_{t+1} v(S_{t+1}) - v(S_{t}) + \gamma_{t+1} \lambda_{t+1} (G_{t+1}^{\lambda} - v(S_{t+1})) \\\\ &= \delta_{t} + \gamma_{t+1} \lambda_{t+1} \delta_{t+1} + \gamma_{t+1} \lambda_{t+1} \gamma_{t+2} \lambda_{t+2} \delta_{t+2} + \cdots \\\\ &= \sum_{n=0}^{\infty} \delta_{t+n} \Big( \prod_{k=1}^{n-1} \gamma_{t+k} \lambda_{t+k} \Big) \end{aligned}

Neat! I call this the δ-return, but it’s a special case of defining meta-returns from an MDP + a learning algorithm.

This implies that you could figure out the bias of your reinforcement learning algorithm (that estimates the value function) by using a second estimator that tracks the δ-return. The definition of the bias is (when estimating the λ-return) is\mathbb{E}\_{\pi} [{G_{t}^{\lambda} - v(S_{t})}]², so if we have that as our “return” (which is the case with\tilde{R}\_{t+1} = \delta\_{t}), then in principle it is no different than estimating the value function v(s) \approx \mathbb{E}\_{\pi} [ G_{t}^{\lambda} | S\_t = s] )

However, in practice this may have some unexpected difficulties, mainly because once the original algorithm has converged,\Exp{\delta_{t} \vec{x}_{t}} is either exactly or close to zero ³.

Bellman Equation for the Second Moment of a Return #

In order for things to get truly fun, we need to take a quick look at Bellman equations for the second moment of a return. Building on some previous work by the likes of Sobel, Tamar et. al, and White & White, we know that we can actually define a Bellman equation for the second moment of the δ-return.

The second moment of a random variableX is just\Exp{X^2} (the mean is\Exp{X} and is the first moment). Most often it comes up in the context of variance, which is the second central moment of a random variable:

\Var{X} \eqdef \Exp{X^2} - \Exp{X}^2 = \Exp{(X - \Exp{X})^2}

But ignore the man behind the curtain for a moment.

With some arbitrary return (let’s call it\tilde{G}) we have

\begin{aligned} (\tilde{G}\_{t})^2 &= (\tilde{R}\_{t+1} + \tilde{\gamma}\_{t+1} G\_{t+1})^2 \\\\ &= \tilde{R}\_{t+1}^2 + 2 \tilde{\gamma}\_{t+1} \tilde{R}\_{t+1} \tilde{G}\_{t+1} + \tilde{\gamma}_{t+1}^{2} (\tilde{G}\_{t+1})^2 \\\\ &= \bar{R}\_{t+1} + \bar{\gamma}\_{t+1} \bar{G\_{t+1}} \end{aligned}

With the substitutions

\begin{aligned} \bar{R}\_{t+1} &\eqdef \tilde{R}\_{t+1}^2 + 2 \tilde{\gamma}\_{t+1} \tilde{R}\_{t+1} \tilde{G}\_{t+1} \\\\ \bar{\gamma}\_{t} &\eqdef \tilde{\gamma}_{t}^{2} \\\\ \bar{G}\_{t} &\eqdef (\tilde{G}\_{t})^2 \end{aligned}

Which looks pretty much like the standard recursive equation for the return, G_{t} \eqdef R_{t+1} + \gamma_{t+1} G_{t+1}. It’s even practically useful⁴.

The Second Moment of the δ-return #

Now, the promised fun. Consider the second moment of the δ-return:

\begin{aligned} \Exp{(G_{t}^{\lambda} (\delta))^2} &= \Exp{ (G_{t}^{\lambda} - v(S_{t}))^2 } \\\\ &= \Exp{ \Big( \sum_{n=0}^\infty \delta_{t+n} \prod_{k=1}^{n-1} \gamma_{t+1} \lambda_{t+1} \Big)^2 } \end{aligned}

Wait, I think that\Exp{ (G_{t}^{\lambda} - v(S_{t}))^2 } shows up somewhere in the literature. Oh, it’s the mean-squared error!

Well that’s pretty neat. When I first realized this I thought that we might be able to use this to achieve a truly error-free algorithm, because if you can get a handle on the errors you’re making it’s possible to reduce them. It turns out that this is hard in practice, because we can’t reliably estimate the δ-return. To see why this is a problem, we can plug in for the second moment of the δ-return.

Bellman equation for the δ-return’s second moment #

Setting\tilde{G}\_{t} = (G\_{t}^{\lambda} - v(S\_{t}) (and therefore\tilde{R}\_{t+1} = \delta_{t}) gives us the following:

<!– \begin{aligned} (G_{t}^{\lambda} - v(S_{t}))^2 &= \Bigg( \sum_{n=0}^{\infty} \delta_{t+n} \Big( \prod_{k=1}^{n-1} \gamma_{t+k} \lambda_{t+k} \Big) \Bigg)^2 \\\\ &= \delta_{t}^2 + 2 \gamma_{t+1} \lambda_{t+1} \delta_{t} (G_{t+1}^{\lambda} - v(S_{t+1})) + \gamma_{t+1}^{2} \lambda_{t+1}^{2} (G_{t+1}^{\lambda} - v(S_{t+1}))^2 \\\\ &= \bar{R}_{t+1} + \bar{\gamma}_{t+1} \bar{G_{t+1}} \end{aligned} –>

Where we make the following substitutions:

<!– \begin{aligned} \bar{R}_{t+1} &= \delta_{t}^2 + 2 \gamma_{t+1} \lambda_{t+1} \delta_{t+1} (G_{t} - v(S_{t})) \\\\ \bar{\gamma}_{t} &= \gamma_{t}^{2} \lambda_{t}^{2} \\\\ \bar{G}_{t} &= (G_{t}^{\lambda} - v(S_{t}))^2 \end{aligned} –>

To make this work, we need to have a method of estimating\Exp{G_{t}^{\lambda} - v(S_{t})}, but if our value function has converged, then that quantity is zero.

So we end up with just\bar{R}\_{t+1} = \delta\_{t}^2, throwing away all of the “cross-terms” like2 \gamma_{t+1} \lambda_{t+1} \delta_{t+1} (G_{t} - v(S_{t})). These terms might make a substantial contribution to the second moment.

Delta-Squared Return for Variance #

So under LFA we are effectively approximating not the second moment of the δ-return, first moment of the\delta^2 return (the discounted sum of the squared TD-errors).

Depending on how much the cross-terms affect the second moment, this might still be useful.

In the case that\Exp{\delta_{t}} = 0, then in expectation the “cross term” goes away, and the second moment of the δ-return and the expected value of the\delta^2 return are the same thing. Note that it may still be nonzero.

So if\operatorname{MSE} = \operatorname{Bias}^2 + \operatorname{Variance}, and our estimates are unbiased, then what remains is the variance. This is still pretty neat– in the tabular case, or when our value function is exact (or close enough to exact), then estimating the\delta^2-return is the same as estimating the variance.

Final Remarks #

It may be possible to estimate the second moment of the\delta-return via using a different function approximation scheme for each learner, although the utility of this is questionable– ultimately, we tend to assume that getting the value funciton correct is the highest priority, and so if you have extra representational power you should probably allocate it there. Using a neural net or other adaptive representation might also be a way forward, and the behavior of such models in estimating the δ-return might be interesting in its own rihgt.

But it’s possible that knowing the variance is its own reward (bah-dum-tssh), because you could use it to help drive exploration, set learning rates, or use it to modify the policy to prefer more reliable actions.

In that case, depending on how biased your value function is, the second moment of the\delta-return can be well-approximated by tracking only\delta^2, or you could go with the second moment methods mentioned previously⁵.

Either way, I expect that investigating meta-returns will yield similarly fun posts (or perhaps even papers!) in the future.

You can just pretend that γ and λ are constant if you don’t want to consider GVFs, I’m mainly just showing that this works even under that setting. ↩︎
Where the subscript\pi means we’re taking the expectation w/r/t the stationary distribution induced by following the policy\pi that we wish to evaluate, conditioned on following that policy in subsequent states. ↩︎
For TD(λ) it’s more that\Exp{\delta_{t} \vec{e}_{t}} is zero at convergence, but in general you’re going to see that (under linear function approximation with the same features and same γ, λ) that the weights for this second approximator should be zero in expectation. ↩︎
White & White use this equation to estimate the second moment (and therefore the variance) online with a nice algorithm they call Variance TD, and use it attempt to optimize λ w/r/t the bias-variance trade-off (although you can avoid that and just estimate the variance); Tamar et. al. provide a couple of algorithms (TD(0) and LSTD variance) for estimating the variance. ↩︎
Those methods also rely on the value function in order to estimate variance and to compute their updates, so if your value function sucks you’re not going to have a good idea of what your variance is. I have a short “proof” that theδ^2-return comes up with the same estimate of the variance as the second moment algorithms, but it’s pending a check because I think one of the steps is a bit suspect. ↩︎