Fun with the TDError Part II
February 17, 2017
This bit of light entertainment emerged out of a paper I worked on estimating variance via the TDerror.
We have previously looked at how the discounted sum of TDerrors is just another way of expressing the bias of our value function. We’ll quickly recapitulate that proof, but extend it for the λreturn and general value functions because I’m a sadist^{1}
\newcommand{\eqdef}{\stackrel{\text{def}}{=}} \newcommand{\eqmaybe}{\stackrel{\text{?}}{=}}
The δreturn #
We first note the definitions of the TDerror (for timestept, we denote it as\delta_{t}) and the λreturn,G_{t}^{\lambda}.
\begin{aligned} \delta_{t} &\eqdef R_{t+1} + \gamma_{t+1} v(S_{t+1})  v(S_{t}) \\\\ G_{t}^{\lambda} &\eqdef R_{t+1} + \gamma_{t+1} \big( (1  \lambda_{t+1}) v(S_{t+1}) + G_{t+1}^{\lambda} \big) \end{aligned}
Now we bust out one of our favorite techniques of “expanding a series and grouping terms” (I didn’t say that it was a particularly hard technique).
\begin{aligned} G_{t}^{\lambda}  v(S_{t}) &= R_{t+1} + \gamma_{t+1} \big( (1  \lambda_{t+1}) v(S_{t+1}) + G_{t+1}^{\lambda} \big)  v(S_{t}) \\\\ &= R_{t+1} + \gamma_{t+1} v(S_{t+1})  v(S_{t}) + \gamma_{t+1} \lambda_{t+1} (G_{t+1}^{\lambda}  v(S_{t+1})) \\\\ &= \delta_{t} + \gamma_{t+1} \lambda_{t+1} \delta_{t+1} + \gamma_{t+1} \lambda_{t+1} \gamma_{t+2} \lambda_{t+2} \delta_{t+2} + \cdots \\\\ &= \sum_{n=0}^{\infty} \delta_{t+n} \Big( \prod_{k=1}^{n1} \gamma_{t+k} \lambda_{t+k} \Big) \end{aligned}
Neat! I call this the δreturn, but it’s a special case of defining metareturns from an MDP + a learning algorithm.
This implies that you could figure out the bias of your reinforcement learning algorithm (that estimates the value function) by using a second estimator that tracks the δreturn. The definition of the bias is (when estimating the λreturn) is\mathbb{E}\_{\pi} [{G_{t}^{\lambda}  v(S_{t})}]^{2}, so if we have that as our “return” (which is the case with\tilde{R}\_{t+1} = \delta\_{t}), then in principle it is no different than estimating the value function v(s) \approx \mathbb{E}\_{\pi} [ G_{t}^{\lambda}  S\_t = s] )
However, in practice this may have some unexpected difficulties, mainly because once the original algorithm has converged,\Exp{\delta_{t} \vec{x}_{t}} is either exactly or close to zero ^{3}.
Bellman Equation for the Second Moment of a Return #
In order for things to get truly fun, we need to take a quick look at Bellman equations for the second moment of a return. Building on some previous work by the likes of Sobel, Tamar et. al, and White & White, we know that we can actually define a Bellman equation for the second moment of the δreturn.
The second moment of a random variableX is just\Exp{X^2} (the mean is\Exp{X} and is the first moment). Most often it comes up in the context of variance, which is the second central moment of a random variable:
\Var{X} \eqdef \Exp{X^2}  \Exp{X}^2 = \Exp{(X  \Exp{X})^2}
But ignore the man behind the curtain for a moment.
With some arbitrary return (let’s call it\tilde{G}) we have
\begin{aligned} (\tilde{G}\_{t})^2 &= (\tilde{R}\_{t+1} + \tilde{\gamma}\_{t+1} G\_{t+1})^2 \\\\ &= \tilde{R}\_{t+1}^2 + 2 \tilde{\gamma}\_{t+1} \tilde{R}\_{t+1} \tilde{G}\_{t+1} + \tilde{\gamma}_{t+1}^{2} (\tilde{G}\_{t+1})^2 \\\\ &= \bar{R}\_{t+1} + \bar{\gamma}\_{t+1} \bar{G\_{t+1}} \end{aligned}
With the substitutions
\begin{aligned} \bar{R}\_{t+1} &\eqdef \tilde{R}\_{t+1}^2 + 2 \tilde{\gamma}\_{t+1} \tilde{R}\_{t+1} \tilde{G}\_{t+1} \\\\ \bar{\gamma}\_{t} &\eqdef \tilde{\gamma}_{t}^{2} \\\\ \bar{G}\_{t} &\eqdef (\tilde{G}\_{t})^2 \end{aligned}
Which looks pretty much like the standard recursive equation for the return, G_{t} \eqdef R_{t+1} + \gamma_{t+1} G_{t+1}. It’s even practically useful^{4}.
The Second Moment of the δreturn #
Now, the promised fun. Consider the second moment of the δreturn:
\begin{aligned} \Exp{(G_{t}^{\lambda} (\delta))^2} &= \Exp{ (G_{t}^{\lambda}  v(S_{t}))^2 } \\\\ &= \Exp{ \Big( \sum_{n=0}^\infty \delta_{t+n} \prod_{k=1}^{n1} \gamma_{t+1} \lambda_{t+1} \Big)^2 } \end{aligned}
Wait, I think that\Exp{ (G_{t}^{\lambda}  v(S_{t}))^2 } shows up somewhere in the literature. Oh, it’s the meansquared error!
Well that’s pretty neat. When I first realized this I thought that we might be able to use this to achieve a truly errorfree algorithm, because if you can get a handle on the errors you’re making it’s possible to reduce them. It turns out that this is hard in practice, because we can’t reliably estimate the δreturn. To see why this is a problem, we can plug in for the second moment of the δreturn.
Bellman equation for the δreturn’s second moment #
Setting\tilde{G}\_{t} = (G\_{t}^{\lambda}  v(S\_{t}) (and therefore\tilde{R}\_{t+1} = \delta_{t}) gives us the following:
<!– \begin{aligned} (G_{t}^{\lambda}  v(S_{t}))^2 &= \Bigg( \sum_{n=0}^{\infty} \delta_{t+n} \Big( \prod_{k=1}^{n1} \gamma_{t+k} \lambda_{t+k} \Big) \Bigg)^2 \\\\ &= \delta_{t}^2 + 2 \gamma_{t+1} \lambda_{t+1} \delta_{t} (G_{t+1}^{\lambda}  v(S_{t+1})) + \gamma_{t+1}^{2} \lambda_{t+1}^{2} (G_{t+1}^{\lambda}  v(S_{t+1}))^2 \\\\ &= \bar{R}_{t+1} + \bar{\gamma}_{t+1} \bar{G_{t+1}} \end{aligned} –>
Where we make the following substitutions:
<!– \begin{aligned} \bar{R}_{t+1} &= \delta_{t}^2 + 2 \gamma_{t+1} \lambda_{t+1} \delta_{t+1} (G_{t}  v(S_{t})) \\\\ \bar{\gamma}_{t} &= \gamma_{t}^{2} \lambda_{t}^{2} \\\\ \bar{G}_{t} &= (G_{t}^{\lambda}  v(S_{t}))^2 \end{aligned} –>
To make this work, we need to have a method of estimating\Exp{G_{t}^{\lambda}  v(S_{t})}, but if our value function has converged, then that quantity is zero.
So we end up with just\bar{R}\_{t+1} = \delta\_{t}^2, throwing away all of the “crossterms” like2 \gamma_{t+1} \lambda_{t+1} \delta_{t+1} (G_{t}  v(S_{t})). These terms might make a substantial contribution to the second moment.
DeltaSquared Return for Variance #
So under LFA we are effectively approximating not the second moment of the δreturn, first moment of the\delta^2 return (the discounted sum of the squared TDerrors).
Depending on how much the crossterms affect the second moment, this might still be useful.
In the case that\Exp{\delta_{t}} = 0, then in expectation the “cross term” goes away, and the second moment of the δreturn and the expected value of the\delta^2 return are the same thing. Note that it may still be nonzero.
So if\operatorname{MSE} = \operatorname{Bias}^2 + \operatorname{Variance}, and our estimates are unbiased, then what remains is the variance. This is still pretty neat– in the tabular case, or when our value function is exact (or close enough to exact), then estimating the\delta^2return is the same as estimating the variance.
Final Remarks #
It may be possible to estimate the second moment of the\deltareturn via using a different function approximation scheme for each learner, although the utility of this is questionable– ultimately, we tend to assume that getting the value funciton correct is the highest priority, and so if you have extra representational power you should probably allocate it there. Using a neural net or other adaptive representation might also be a way forward, and the behavior of such models in estimating the δreturn might be interesting in its own rihgt.
But it’s possible that knowing the variance is its own reward (bahdumtssh), because you could use it to help drive exploration, set learning rates, or use it to modify the policy to prefer more reliable actions.
In that case, depending on how biased your value function is, the second moment of the\deltareturn can be wellapproximated by tracking only\delta^2, or you could go with the second moment methods mentioned previously^{5}.
Either way, I expect that investigating metareturns will yield similarly fun posts (or perhaps even papers!) in the future.

You can just pretend that γ and λ are constant if you don’t want to consider GVFs, I’m mainly just showing that this works even under that setting. ↩︎

Where the subscript\pi means we’re taking the expectation w/r/t the stationary distribution induced by following the policy\pi that we wish to evaluate, conditioned on following that policy in subsequent states. ↩︎

For TD(λ) it’s more that\Exp{\delta_{t} \vec{e}_{t}} is zero at convergence, but in general you’re going to see that (under linear function approximation with the same features and same γ, λ) that the weights for this second approximator should be zero in expectation. ↩︎

White & White use this equation to estimate the second moment (and therefore the variance) online with a nice algorithm they call Variance TD, and use it attempt to optimize λ w/r/t the biasvariance tradeoff (although you can avoid that and just estimate the variance); Tamar et. al. provide a couple of algorithms (TD(0) and LSTD variance) for estimating the variance. ↩︎

Those methods also rely on the value function in order to estimate variance and to compute their updates, so if your value function sucks you’re not going to have a good idea of what your variance is. I have a short “proof” that theδ^2return comes up with the same estimate of the variance as the second moment algorithms, but it’s pending a check because I think one of the steps is a bit suspect. ↩︎