##### May 4, 2018

This post sat in my “drafts” folder for many moons, mostly done, so I finished it and then threw it online. There are probably lacunae where I meant to explain something, didn’t leave a note in the draft, and then failed to notice. However, you can now email me to tell me about it– (nevermind, I can’t access that account anymore) –for the handful of readers, dozens of webcrawlers, and hundreds of automated vulnerability scanners.
...

##### October 23, 2017

The paper on max entropy RL (Reinforcement Learning with Deep Energy-Based Policies) is pretty cool, but some of the proofs are lacking explanation. This is papered over by the standard “it is easy to see…” evasions that mathematicians take such delight in. Usually, space restrictions are the culprit, but I wish authors would either provide more detail or at least a specific reference (including the page number!) for where I can learn more.
...

##### April 28, 2017

(Just some quick thoughts before I return to typesetting my thesis… I will return to this to flesh things out once I’ve gotten a bit more sleep and organized my thoughts)
I am writing my thesis on estimating the variance of the return in Markov Decision Processes using online incremental algorithms, which turns out to a surprisingly complex problem.
Having an estimate of the variance is generally agreed to be a good thing, according to a random sampling (N=2) of statisticians I interviewed when writing this blog post.
...

##### March 23, 2017

\def\Pr#1{\mathbb{P} \left( #1 \right)}
(In progress, just some quick notes until I’m done my thesis)
Expressing Entropy as a Return # It’s a fun exercise Learning the limiting entropy of a state online may be useful There is probably some interesting analysis that could be done So, let’s look at discrete Markov Processes and show that there’s a recursive Bellman-like equation for entropy. Note that this is different from the entropy rate Quick rundown of relevant MDP basics # For a Markov chain, we have that the probability of seeing a particular sequence of states, say(s_0, s_1, s_2, \ldots, s_n), is:
...

##### February 17, 2017

This bit of light entertainment emerged out of a paper I worked on estimating variance via the TD-error.
We have previously looked at how the discounted sum of TD-errors is just another way of expressing the bias of our value function. We’ll quickly recapitulate that proof, but extend it for the λ-return and general value functions because I’m a sadist1
\newcommand{\eqdef}{\stackrel{\text{def}}{=}} \newcommand{\eqmaybe}{\stackrel{\text{?}}{=}}
The δ-return # We first note the definitions of the TD-error (for time-stept, we denote it as\delta_{t}) and the λ-return,G_{t}^{\lambda}.
...

##### November 3, 2016

One of the fundamental ideas in reinforcement learning is the temporal difference (TD) error, which is the difference between the value of the current state and the reward received plus the discounted value of the next state. That may sound abstract, but effectively it’s what you expected minus what you actually got and what you’re expecting next. Okay, that still sounds incomprehensible to anyone who’s not already familiar with RL, so instead here’s an equation.
...