## Hugo Customization

##### April 21, 2021

Some notes from the process of customizing my Hugo site.

...Some notes from the process of customizing my Hugo site.

...This post sat in my “drafts” folder for many moons, mostly done, so I finished it and then threw it online. There are probably lacunae where I meant to explain something, didn’t leave a note in the draft, and then failed to notice. However, you can now email me to tell me about it– (nevermind, I can’t access that account anymore) –for the handful of readers, dozens of webcrawlers, and hundreds of automated vulnerability scanners. ...

The paper on max entropy RL (Reinforcement Learning with Deep Energy-Based Policies) is pretty cool, but some of the proofs are lacking explanation. This is papered over by the standard “it is easy to see…” evasions that mathematicians take such delight in. Usually, space restrictions are the culprit, but I wish authors would either provide more detail or at least a specific reference (including the page number!) for where I can learn more. ...

\def\Pr#1{\mathbb{P} \left( #1 \right)} (In progress, just some quick notes until I’m done my thesis) Expressing Entropy as a Return # It’s a fun exercise Learning the limiting entropy of a state online may be useful There is probably some interesting analysis that could be done So, let’s look at discrete Markov Processes and show that there’s a recursive Bellman-like equation for entropy. Note that this is different from the entropy rate Quick rundown of relevant MDP basics # For a Markov chain, we have that the probability of seeing a particular sequence of states, say(s_0, s_1, s_2, \ldots, s_n), is: ...

This bit of light entertainment emerged out of a paper I worked on estimating variance via the TD-error. We have previously looked at how the discounted sum of TD-errors is just another way of expressing the bias of our value function. We’ll quickly recapitulate that proof, but extend it for the λ-return and general value functions because I’m a sadist1 \newcommand{\eqdef}{\stackrel{\text{def}}{=}} \newcommand{\eqmaybe}{\stackrel{\text{?}}{=}} The δ-return # We first note the definitions of the TD-error (for time-stept, we denote it as\delta_{t}) and the λ-return,G_{t}^{\lambda}. ...

One of the fundamental ideas in reinforcement learning is the temporal difference (TD) error, which is the difference between the value of the current state and the reward received plus the discounted value of the next state. That may sound abstract, but effectively it’s what you expected minus what you actually got and what you’re expecting next. Okay, that still sounds incomprehensible to anyone who’s not already familiar with RL, so instead here’s an equation. ...