Estimating Fractional Returns in Markov Decision Processes

April 28, 2017

reinforcement learning, generalized returns, temporal difference learning

(Just some quick thoughts before I return to typesetting my thesis… I will return to this to flesh things out once I’ve gotten a bit more sleep and organized my thoughts)

I am writing my thesis on estimating the variance of the return in Markov Decision Processes using online incremental algorithms, which turns out to a surprisingly complex problem.

Having an estimate of the variance is generally agreed to be a good thing, according to a random sampling (N=2) of statisticians I interviewed when writing this blog post. If you have unlimited computational power and memory, you may not care about having everything be online and incremental, because you can just keep track of literally everything and compute whatever function of the return you want. That approach is rather wasteful, and generally only promoted by individuals who own shares in NVIDIA.

If it’s possible to update your estimator(s) as the data comes in, you should probably do that, because it amortizes the computational cost and allows you to modify the learning agent’s behavior in real time¹. TD(λ) and related algorithms can accomplish this, and there’s a lot of work on methods of learning to estimate the expected return online and incrementally. But what if you want to estimate the second moment of the return? Or the square root? This is a tougher nut to crack.

Background #

(You can skip this section if you’re familiar with reinforcement learning)

Markov decision processes (MDPs) model decision making in situations where the outcomes are a function of the actions taken by an agent, possibly with some randomness thrown in for zest. At time “t”, the agent gets some information about the state of the world S_{t}), selects an actionA_{t} according to some policy it’s following (soA_{t} \sim \pi(S_{t})), transitions to a new stateS_{t+1} and gets a reward for its efforts R_{t+1}). It then has to repeat this process again, forever, or at least until it reaches a terminal state. The value of a state depends not just on the reward it receives on transitioning out of that state, but on all the future rewards that ensure. To make this mathematically tractable, we introduce a discount parameter\gamma \in (0, 1], so that the rewardk transitions from now is diminished by\gamma^{k}, meaning that the value of each state is finite². We call this discounted sum of reward “the return”, and say the value of a state is the expected return you would get if you had just transitioned into that state.

We denote the return byG_{t}, and the value of some states byv(s).

\begin{aligned} G_{t} &= R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \cdots \\\\ v(s) &= \Exp{G_{t} | S_{t} = s} \end{aligned}

RL agents learn the value function by predicting the value of the next state, and update their estimate of the value function by something like stochastic gradient descent. That way, we can learn the value function as information comes in, without having to pause in order to update the model’s parameters³.

Why learning the value function is “easy” #

If you’ve not studied these ideas in detail, I can perhaps sketch the justification using a technique I like to call “math”.

For the TD(0) algorithm, given a transition of the form(S_{t}, A_{t}, R_{t+1}, S_{t+1}), and we compute the temporal difference error:

\delta_{t} = R_{t+1} + \gamma v(S_{t+1}) - v(S_{t})

and we modify the value function, which is parameterized byw, with the update rule

w \leftarrow + \alpha \delta \nabla_{w} v

Part of the reason why this works is that the return,G_{t}, is a discounted sum. So we can define it recursively, and use linearity of expectations to get:

\begin{aligned} G_{t} &= R_{t+1} + \gamma R_{t+2} + \gamma^{2} R_{t+3} + \cdots \\\\ &= R_{t+1} + \gamma G_{t+1} \\\\ \Exp{G_{t}} &= \Exp{R_{t+1} + \gamma G_{t+1}} \\\\ &= \Exp{R_{t+1}} + \gamma \Exp{G_{t+1}} \end{aligned}

If we assume thatv(s) \approx \Exp{G_{t} | S_{t} = s}, we can get an expression for the (squared) error in the value function:

\begin{aligned} \text{err}(v) &= (R_{t+1} + \gamma G_{t+1} - v(S_{t}))^{2} \\\\ &\approx (R_{t+1} + \gamma v(S_{t+1}) - v(S_{t}))^{2} \end{aligned}

Further assume thatv(S_{t+1}) is constant⁴, and we can differentiate w/r/t the parameters of the value function w), to get:

\begin{aligned} \nabla \text{err}(v) &\approx \nabla_{w} (R_{t+1} + \gamma v(S_{t+1}) - v(S_{t}))^{2} \\\\ &= - 2 (R_{t+1} + \gamma v(S_{t+1}) - v(S_{t})) \nabla_{w} v(S_{t}) \\\\ &= - 2 \delta_{t} \nabla_{w} v(S_{t}) \end{aligned}

We make a few assumptions and get an algorithm that not works in practice, but have been proven to converge with a bounded error⁵, and can be used in the online setting.

Second and Higher Moment of the Return #

So, we can learn the value function, and now perhaps we’d like the second moment of the return as well so that we can do fun things like computing the variance.

\Var{G_{t}} = \Exp{G_{t}^{2}} - \Exp{G_{t}}^{2}

Having the variance is nice because we can make confidence intervals, modify the learning rate (reducing it when variance is high, and ramping it up when it’s low), or even include it as a penalty to discourage risky behavior in our agent⁶ But naturally this is easier said than done.

Why is this so hard? #

Mainly because we can use the fact that the expression for the return is recursive and linear to get cool operator equations, but the same cannot be said for e.g.,(G_{t}^{2}).

For example, if\vec{v} is the value function in vector form (such thatv(s_{i}) = [\vec{v}]_{i},P is the transition matrix (so\Pr (s_{j} | s_{i}) = P_{ij}, and\vec{r} is the expected immediate reward with[\vec{r}]_{i} = \Exp{R_{t+1}|S_{t} = s_{i}}, then

\begin{aligned} \vec{v} &= \vec{r} + \gamma P \vec{v} \\\\ \Rightarrow \vec{v} &= (I - \gamma P)^{-1} \vec{r} \end{aligned}

But if we considerG_{t}^{2}, we get

\begin{aligned} G_{t}^{2} &= (R_{t+1} + \gamma G_{t+1})^{2} \\\\ &= R_{t+1}^{2} + 2 \gamma R_{t+1} G_{t+1} + \gamma^{2} G_{t+1}^{2} \end{aligned}

which almost seems like it will work, except we can’t just predict the squared return of the next state because of the2 \gamma R_{t+1} G_{t+1} cross-term. If we wanted to be exactly correct, we’d need to track the future rewards fromG_{t+1}, discount them, and multiply byR_{t+1}, and do this for every transition. Thus, it seems like a victory for the forces of batch updating and computational inefficiency.

Saved By Mathematical Chicanery! #

However, we can try the same trick that we used before, where we use one estimator to learn the value function, plug that in forG_{t+1} as needed, and then further assume that this estimator is constant. So we end up learning from what I’m calling generalized- or meta-rewards (denoted by a “bar” superscript, e.g.,\bar{R}_{t+1}), and it turns out that if you accept the validity of that substitution, we get a Bellman equation:

\begin{aligned} \bar{R}_{t+1} &= R_{t+1}^{2} + 2 \gamma R_{t+1} v(S_{t+1}) \\\\ \bar{\gamma} &= \gamma^{2} \\\\ \bar{v}(S_{t}) &= \Exp{\bar{R}_{t+1} + \bar{\gamma} \bar{v}(S_{t+1})} \\\\ &\approx \Exp{G_{t}^{2}} \end{aligned}

And we still have the ability to formulate cool operator equations:

\begin{aligned} \bar{\vec{v}} &= (I - \bar{\gamma}^{2} P)^{-1} \bar{\vec{r}} \\\\ &= (I - \gamma^{2} P)(\vec{r}^{(2)} + 2 \gamma \vec{R} P \vec{v}) \end{aligned}

The justification for this is covered in other works, see Sobel. I will note that everything’s just groovy from a convergence perspective, because if\gamma \in (0, 1], then so is\gamma^{2}, and if the rewards are bounded, theR_{t}^{2} is also bounded, and therefore so is the return. So if we limit the value function to be|v(s)| \leq \frac{R_{max}}{1 - \gamma}, then\bar{R}_{t} is bounded, and so if TD would converge on the original problem, then it converges here too.

It works well in the tabular case, but under function approximation things begin to break down in odd ways.

Generalizing #

As suggested by the section heading, we can generalize, and come up with a reward for the higher moments of the return rather easily:

\begin{aligned} G_{t}^{n} &= \sum_{k=0}^{n} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} G_{t+1}^{k} \\\\ &\approx \sum_{k=0}^{n-1} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} v(S_{t+1})^{k} + \gamma^{n} G_{t+1}^{n} \end{aligned}

Which allows us (equipped only with a value function!) define rewards and discounts that yield Bellman equations we can use:

\begin{aligned} \bar{R}_{t+1}^{(n)} &= \sum_{k=0}^{n-1} \binom{n}{k} R_{t+1}^{n-k} \gamma^{k} v(S_{t+1})^{k} \\\\ \bar{\gamma} &= \gamma^{n} \\\\ \bar{v}^{(n)}(S_{t}) &= \Exp{\bar{R}_{t+1}^{(n)} + \bar{\gamma} \bar{v}^{(n)}(S_{t+1})} \end{aligned}

Whenn = 1, we have\bar{R}_{t+1} = R_{t+1}, and\bar{\gamma} = \gamma, which is a nice sanity check.

Fractional Returns #

So while the original problem (estimating the moments of the return) seems completely sane, and its solution a praiseworthy goal, we can drop the mask of being responsible mathematicians and start wondering about weird quantities, likeG_{t}^{p/q}. We could even consider complex powers, if that sort of thing seems worthwhile.

Let\beta = p/q \in \mathbb{C}, and let’s use a binomial series in place of the binomial expansion that came before:

\begin{aligned} G_{t}^{\beta} &= (R_{t+1} + \gamma G_{t+1})^{\beta} \\\\ &= R_{t+1}^{\beta} (1 + \frac{\gamma G_{t+1}}{R_{t+1}})^{\beta} \\\\ &= R_{t+1}^{\beta} \sum_{k=0}^{\infty} \binom{\beta}{k} \Bigg( \frac{\gamma G_{t+1}}{R_{t+1}} \Bigg)^{k} \end{aligned}

Previously, for the n-th moment, we truncated the series to get a\gamma^{n} G_{t} term, giving us a Bellman equation. Here, we have an infinite series, and so the way around this is not exactly clear. If

Instead, we can make an unjustified assumption, and subtract(\gamma v(S_{t+1}))^{\beta} in exchange for the(\gamma G_{t+1})^{\beta} term:

G_{t}^{\beta} \approx \gamma^{\beta} G_{t+1}^{\beta} - (\gamma v(S_{t+1}))^{\beta} + R_{t+1}^{\beta} \sum_{k=0}^{\infty} \binom{\beta}{k} \Bigg( \frac{\gamma v(S_{t+1})}{R_{t+1}} \Bigg)^{k}

\bar{R}_{t+1}^{\beta} = - (\gamma v(S_{t+1}))^{\beta} + R_{t+1}^{\beta} \sum_{k=0}^{\infty} \binom{\beta}{k} \Bigg( \frac{\gamma v(S_{t+1})}{R_{t+1}} \Bigg)^{k}

We can replace the binomial function with its analytic extension, the gamma function, which is easy to compute.

However, we’re still at a bit of an impasse, because this series may not be convergent.

If|\gamma v(S_{t+1})/R_{t+1}| < 1, then it converges absolutely for all\beta \in \mathbb{C}.
If\gamma v(S_{t+1})/R_{t+1} = -1, then we need\Re (\beta) > 0
If\gamma v(S_{t+1})/R_{t+1} \neq -1 and|\gamma v(S_{t+1})/R_{t+1}| = 1, then we just need\Re (\beta) > -1.
Divergence occurs if we’re using a negative integer for\beta with|\gamma v(S_{t+1})/R_{t+1}| greater than 1.

Clearly, we have to be concerned with the maximum value thatv(s) can take, so we might consider rescaling the rewards but this doesn’t work cleanly because it’s a function of the reward. So an affine transformation of the rewards might work, along with appropriately chosen\gamma. Affine transformations (rescalingR \rightarrow aR + b for some scalarsa andb) don’t really affect the interpretation of the prediction when we’re talking about the expected return,\Exp{G_{t}}, but they might change things forG_{t}^{\beta}.

Otherwise, we might be limited to cases where the episode returns a value of\pm 1 prior to termination. This is not the worst thing in the world, because it still means we can make predictions about whether events will or will not happen. This is still fairly useful, and if the math checks out (which it may not), we have a way of extending TD-methods to these artificially created non-exponentially decaying cases.

Alternative Expansion #

We could also look at an alternative expansion:

\begin{aligned} G_{t}^{\beta} &= (R_{t+1} + \gamma G_{t+1})^{\beta} \\\\ &= (\gamma G_{t+1})^{\beta} \bigg(1 + \frac{R_{t+1}}{\gamma G_{t+1}} \bigg)^{\beta} \\\\ &= (\gamma G_{t+1})^{\beta} \sum_{k=0}^{\infty} \binom{\beta}{k} \Bigg( \frac{R_{t+1}}{\gamma G_{t+1}} \Bigg)^{k} \\\\ &= (\gamma G_{t+1})^{\beta} + (\gamma G_{t+1})^{\beta}\sum_{k=1}^{\infty} \binom{\beta}{k} \Bigg( \frac{R_{t+1}}{\gamma G_{t+1}} \Bigg)^{k} \\\\ &\approx (\gamma G_{t+1})^{\beta} + (\gamma v(S_{t+1})^{\beta} \sum_{k=1}^{\infty} \binom{\beta}{k} \bigg( \frac{R_{t+1}}{\gamma v(S_{t+1})} \bigg)^{k} \end{aligned}

So we can get the sorts of recursive equations that we like to see:

\begin{aligned} \Exp{G_{t}^{\beta}} &\approx \bar{v}^{\beta}(S_{t}) \\\\ \bar{v}^{\beta}(s) &= \Exp{\bar{R}^{\beta} + \gamma^{\beta} v(S_{t+1}) | S_{t} = s} \\\\ \bar{R}^{\beta} &= (\gamma v(S_{t+1)})^{\beta} \sum_{k=1}^{\infty} \binom{\beta}{k} \Bigg( \frac{R_{t+1}}{\gamma v(S_{t+1})} \Bigg)^{k} \end{aligned}

I think I like this expansion better, since we avoid the decision to arbitrarily pull out a(\gamma G_{t+1})^{\beta} in exchange for a(\gamma v(S_{t+1}))^{\beta}. Convergence of the series is also better, so long as the rewards are positive and the discount factor is set appropriately.

This time, we need|R_{t+1}/(\gamma v(S_{t+1}))| \leq 1 for convergence to be possible if\beta is not a positive integer. Things get awkward if\gamma orv(S_{t+1}) is zero for some values of\beta, but perhaps not fatally so. I’ll have to look at it a bit more before I can speak from a “what does it mean?” standpoint.

Why? #

Well… these fractional returns might still be useful, in the sense that one of the complaints about standard MDPs is that the discounting is too aggressive, and we might wish to learn to predict things with longer tails. If we decide that we care about fractional returns, it also raises questions about whether this is the best approach to go about computing them, and furthermore, why both with a discount parameter if (by the exponential nature of discounting), we care about processes with much longer memory? It’s also just mathematically interesting, so there’s that as well.

So perhaps the theory has just not been fully developed yet, which might make for a good research problem.

References #

Sobel’s paper, “The Variance of Discounted Markov Decision Processes”, goes into this in greater detail.
- https://www.jstor.org/stable/3213832
Also see White & White and Tamar et. al., and my own (soon-to-be-finished!) thesis

This is the difference between your robot car figuring out that it should be steering into the skid vs. the company that sold it to you having an awkward press conference and possible recall later in the week. ↩︎
It also immanentizes the common expression “a bird in the hand is worth\frac{1 - \gamma}{\gamma} in the bush”. ↩︎
Again, you could just track the return from every state and then learn the value function via something like batch processing, but to me that sounds like Washington bureaucracy and seems unlikely to Make Markov Decision Processes Great Again (MMDPGA). ↩︎
This “derivation” is rather hand-wavy, so if you’d like something a little more mathematically stringent, see Hamid Maei’s thesis or the work on true-online TD(λ) by van Seijen et al. ↩︎
Convergence is guaranteed in the linear function approximation case. Follow policy when attempting to estimate return, as off-policy setting may lead to divergence. Do not attempt to use in in tasks with unbounded variance, or in non-stationary adversarial domains. For treatment with nonlinear function approximators, side effects may include: local minima, proliferation of neural net frameworks, superintelligence and superintelligence doomsaying from Elon Musk. Consult with your doctor (PhD, not MD) to see if TD(λ) is right for you. ↩︎
Because even if our robot car “expects” to be able to execute a perfect bootleg turn on average, its occupants might prefer a less erratic mode of conveyance). ↩︎