math

Escaping Batch Country

May 4, 2018
reinforcement learning, temporal difference learning, math

This post sat in my “drafts” folder for many moons, mostly done, so I finished it and then threw it online. There are probably lacunae where I meant to explain something, didn’t leave a note in the draft, and then failed to notice. However, you can now email me to tell me about it– (nevermind, I can’t access that account anymore) –for the handful of readers, dozens of webcrawlers, and hundreds of automated vulnerability scanners. ...

Max Entropy RL Explanation

October 23, 2017
reinforcement learning, math

The paper on max entropy RL (Reinforcement Learning with Deep Energy-Based Policies) is pretty cool, but some of the proofs are lacking explanation. This is papered over by the standard “it is easy to see…” evasions that mathematicians take such delight in. Usually, space restrictions are the culprit, but I wish authors would either provide more detail or at least a specific reference (including the page number!) for where I can learn more. ...

Limiting Entropy as a Return

March 23, 2017
generalized returns, reinforcement learning, temporal difference learning, math

\def\Pr#1{\mathbb{P} \left( #1 \right)} (In progress, just some quick notes until I’m done my thesis) Expressing Entropy as a Return # It’s a fun exercise Learning the limiting entropy of a state online may be useful There is probably some interesting analysis that could be done So, let’s look at discrete Markov Processes and show that there’s a recursive Bellman-like equation for entropy. Note that this is different from the entropy rate Quick rundown of relevant MDP basics # For a Markov chain, we have that the probability of seeing a particular sequence of states, say(s_0, s_1, s_2, \ldots, s_n), is: ...

Fun with the TD-Error Part II

February 17, 2017
reinforcement learning, temporal difference learning, math

This bit of light entertainment emerged out of a paper I worked on estimating variance via the TD-error. We have previously looked at how the discounted sum of TD-errors is just another way of expressing the bias of our value function. We’ll quickly recapitulate that proof, but extend it for the λ-return and general value functions because I’m a sadist1 \newcommand{\eqdef}{\stackrel{\text{def}}{=}} \newcommand{\eqmaybe}{\stackrel{\text{?}}{=}} The δ-return # We first note the definitions of the TD-error (for time-stept, we denote it as\delta_{t}) and the λ-return,G_{t}^{\lambda}. ...

Fun With The TD-Error

November 3, 2016
reinforcement learning, temporal difference learning, math

One of the fundamental ideas in reinforcement learning is the temporal difference (TD) error, which is the difference between the value of the current state and the reward received plus the discounted value of the next state. That may sound abstract, but effectively it’s what you expected minus what you actually got and what you’re expecting next. Okay, that still sounds incomprehensible to anyone who’s not already familiar with RL, so instead here’s an equation. ...

Generated Wed, 05 May 2021 23:10:04 MDT