Fun With The TD-Error

November 3, 2016

reinforcement learning, temporal difference learning, math

One of the fundamental ideas in reinforcement learning is the temporal difference (TD) error, which is the difference between the value of the current state and the reward received plus the discounted value of the next state. That may sound abstract, but effectively it’s what you expected minus what you actually got and what you’re expecting next. Okay, that still sounds incomprehensible to anyone who’s not already familiar with RL, so instead here’s an equation.

\delta_{t} = R_{t} + \gamma v(S_{t+1}) - v(S_{t})

-R_{t} denotes the reward we got at timet. -v(s) denotes how much future reward we’re expecting in states. -\gamma is the so-called discount factor, which is the way mathematicians say ‘a bird in the hand is worth two in the bush’.

Anyways, the usual goal of RL is to make our predictions about the future as accurate as possible. It’s surprisingly doable, because even when we don’t know what’s coming in the future, we can still take steps to make our predictions about what we do know as accurate as possible. By adjusting our models so that the TD-error gets smaller over time, we learn to predict how much reward we’ll receive in the future without ever having to keep track of more than the current state, the next state, and the reward we received from transitioning from one to the other.

In fact, subject to some technical conditions, we can prove that the process of minizing the TD-error actually will lead to us learning the overall value function accurately¹. I will not go into that in detail, but remark that one of the big assumptions we tend to work with is that the environment is a Markov Decision Process.

Solving The Bellman Equation Analytically #

All this background is getting boring, so I’m just going to skip ahead and trust that the only people who find this site tend to have a background in reinforcement learning.

Assume you’ve got a transition matrix\vec{P}, a vector\vec{r} that describes the expected rewards in each state. Let the value function can be represented as a vector,\vec{v}

\begin{align} \vec{v}_{(i)} &= \Exp{G_t | S_t = s_i} \\\\ \vec{v} &= \vec{r} + \gamma \vec{P} \vec{v} \\\\ &= \vec{r} + \gamma \vec{P} \vec{r} + (\gamma \vec{P})^2 \vec{r} + \cdots \\\\ \vec{v}(\vec{I} - \gamma \vec{P}) &= \vec{r} \\\\ \vec{v} &= (\vec{I} - \gamma \vec{P})^{-1} \vec{r} \end{align}

Lines 3-4 used the matrix version of the geometric sum. Assuming that matrix inversion step was valid (which is usually the case), we have a way of solving for the value function in general.

Sum of Discounted Deltas #

Suppose now that we have\hat{v} an approximate value function. It may have been learned via some sort of learning algorithm, or it may be completely random numbers, but regardless, we have it.

Now, suppose for some reason we want to know about the discounted sum of future deltas².

The value function is fixed and we have the expected reward vector from before, so if we let\vec{d}\_{(i)} = \mathbb{E}\Bigg[ \delta_{t} \Bigg\vert S_t = s_i\Bigg], we can write it as

\vec{d} = \vec{r} + \gamma \vec{P} \hat{\vec{v}} - \hat{\vec{v}}

Note that we’re using\hat{\vec{v}}, which is different from the “true”\vec{v} that we got from solving the Bellman equation.

Now we go for the infinite sum:

\begin{align} \vec{d} + \gamma \vec{P} \vec{d} + (\gamma \vec{P})^2 \vec{d} + \cdots &= (\vec{I} - \gamma \vec{P})^{-1} \vec{d} \\\\ &= (\vec{I} - \gamma \vec{P})^{-1} (\vec{r} + \gamma \vec{P} \hat{\vec{v}} - \hat{\vec{v}}) \\\\ &= (\vec{I} - \gamma \vec{P})^{-1} (\vec{r} - (\vec{I} - \gamma \vec{P}) \hat{\vec{v}}) \\\\ &= (\vec{I} - \gamma \vec{P})^{-1} \vec{r} - (\vec{I} - \gamma \vec{P})^{-1} (\vec{I} - \gamma \vec{P}) \hat{\vec{v}} \\\\ &= (\vec{I} - \gamma \vec{P})^{-1} \vec{r} - \hat{\vec{v}} \\\\ &= \vec{v} - \hat{\vec{v}} \end{align}

Amazing. Well, maybe you’ve seen the result before (I’ve learned that it shows in a few places) but I think it’s kinda neat that the difference between the true value,\vec{v}, and our approximate value function,\hat{\vec{v}}, is also encoded in the series of TD-errors the same way that the value was encoded in the rewards.

Of course, we can see this in another way:

\begin{align} G_{t} - \hat{v}(S_t) &= \Bigg( \sum_{k=1}^{\infty} R_{t+k} \gamma^{k-1} \Bigg) - \hat{v}(S_t) \\\\ &= R_{t+1} + \gamma \hat{v}(S_{t+1}) - \hat{v}(S_t) \- \gamma \hat{v}(S_{t+1}) \+ \gamma \Bigg( \sum_{k=1}^{\infty} R_{t+k+1} \gamma^{k-1} \Bigg) \\\\ &= \delta_{t} + \gamma \Big( G_{t+1} - \hat{v}(S_{t+1}) \Big) \\\\ &= \sum_{k=0}^{\infty} \delta_{t+k} \gamma^{k} \end{align}

If you wanted to find a use for this, you might try using a cascade of algorithms (the first learns the value, the second learns the error in the first’s approximation, and so on) to reduce error³. Or you can point to this as a semi-rigorous explanation of why monitoring the TD-error alone gives you an idea of how well your algorithm is doing in general.

But in any case, the math was fun, which is really all I promised in the title of this post.

How accurately we learn the value function, of course, is its own topic of discussion. Largely, it depends on how accurately we can perceive the state space. If we assume complete omniscience, then asymptotically our accuracy will be perfect. If we fall short of that, then our accuracy suffers, but we we’ll still do about as well the best-case scenario that our diminished perception can allow. ↩︎
For example, suppose you are writing a thesis where this expression comes up and you’re curious about it. OK, maybe that’s a reason particular to my case. ↩︎
You would likely have to use different representations for each (the reasons for that would require another post). ↩︎