# Max Entropy RL Explanation

The paper on max entropy RL (Reinforcement Learning with Deep Energy-Based Policies) is pretty cool, but some of the proofs are lacking explanation. This is papered over by the standard "it is easy to see..." evasions that mathematicians take such delight in. Usually, space restrictions are the culprit, but I wish authors would either provide more detail or at least a specific reference (including the page number!) for where I can learn more.

On HackerNews, a request for greater explanation of one of the proofs was requested, and since I didn't understand, and understanding this sort of thing is my job, I thought I would look into it.

$$ \newcommand{\eqdef}{\stackrel{\text{def}}{=}} \newcommand{\eqmaybe}{\stackrel{\text{?}}{=}} $$

Define the "soft Q-value" for an arbitrary policy $\pi$ via:

$$ Q_{soft}^{\pi}(s,a) \eqdef r_{0} + \mathbb{E}_{\pi} \bigg[ \sum_{t=1}^\infty \gamma^{t} \Big( r_{t} + \mathcal{H}(\pi(\cdot | s_{t})) \Big) \bigg| s_{0} = s, a_{0} = a \bigg] $$

(Hereafter, we'll just use $ Q^{\pi} $ instead of $ Q_{soft}^{\pi} $ since the standard Q-values aren't needed for our purposes).

Equation 18 (in Appendix 2) claims that:

$$ \mathcal{H}(\pi(\cdot | s)) + \mathbb{E}_{a \sim \pi} [ Q_{soft}^{\pi} (s, a) ] \leq \mathcal{H}(\tilde{\pi}(\cdot | s)) + \mathbb{E}_{a \sim \tilde{\pi}} [ Q_{soft}^{\pi} (s, a) ] $$

For an arbitrary policy $ \pi $ with $ \tilde{\pi}(a|s) \propto \text{exp}[ Q_{soft}^{\pi}(s,a)] $

They claim that this (and the associated Theorem 4) are easily proved once you "notice" that (from eq. 19)

$$ \mathcal{H}(\pi(\cdot | s)) + \mathbb{E}_{a \sim \pi} [ Q_{soft}^{\pi} (s, a) ] = - \text{D}_{KL}( \pi (\cdot | s) || \tilde{\pi} (\cdot | s) ) + \log \int \text{exp}[ Q^{\pi}(s,a) ] \, \text{d}a $$

Personally, I don't expect my readers to "notice" things that I cannot prove in a couple of lines or aren't tricks used in every other paper, but maybe I'm just not familiar with this subfield.

Let's tackle the setting where we've got a discrete action space, expanding the Kullback-Leibler divergence term so that we can get rid of the entropy term on the LHS:

$$
\begin{aligned}
- \text{D}_{KL}( \pi (\cdot | s) || \tilde{\pi} (\cdot | s) )

&= -\sum_{a} \pi (a | s) \log \bigg[ \frac{ \pi(a|s) }{ \tilde{\pi}(a|s) } \bigg]
\\
&= -\sum_{a} \pi (a | s) \big( \log [\pi(a|s)] - \log [ \tilde{\pi}(a|s)] \big)
\\
&= \mathcal{H} ( \pi( \cdot | s) ) + \sum_{a} \pi (a | s) \log [ \tilde{\pi}(a|s)]
\end{aligned}
$$

So now we need to "notice" that

$$ \mathbb{E}_{a \sim \pi} [ Q_{soft}^{\pi} (s, a) ] = \sum_{a} \pi (a | s) \log [ \tilde{\pi}(a|s)] + \log \sum_{a} \text{exp}[ Q^{\pi}(s,a) ] $$

This is actually pretty cool, and it makes use of the fact that $ \tilde{\pi} $ is proportional to the exponentiated Q-values. Say that $ \tilde{\pi}(a | s) = \beta \exp [Q^{\pi}(s,a)] $ for some $ \beta $.

$$\begin{aligned} \log \sum_{a} \exp [Q^{\pi}(s,a)] &= \log \bigg[ \sum_{a} (1/\beta) \tilde{\pi}(a | s) \bigg] \\ &= \log \bigg[ (1/\beta) \sum_{a} \tilde{\pi}(a | s) \bigg] \\ &= - \log \beta + \log \bigg[ \sum_{a} \tilde{\pi}(a | s) \bigg] \\ &= - \log \beta + \log [ 1 ] \\ &= -\log \beta \end{aligned}$$

Because $ \tilde{\pi}(a|s) $ is a probability distribution and we're summing over all possible actions, it goes to 1, and the logarithm of 1 is zero, as you may have noticed.

Now we further notice that this trick applies in reverse:

$$\begin{aligned} \sum_{a} \pi (a | s) \log [ \tilde{\pi}(a|s)] &= \sum_{a} \pi (a | s) \log [ \beta \exp [Q^{\pi}(s,a)] ] \\ &= \sum_{a} \pi (a | s) \bigg( \log [ \beta ] + \log [ \exp [Q^{\pi}(s,a)] ] \bigg) \\ &= \sum_{a} \pi (a | s) \bigg( \log [ \beta ] + Q^{\pi}(s,a) \bigg) \\ &= \log \beta + \sum_{a} \pi (a | s) Q^{\pi}(s,a) \end{aligned}$$

Where we once again noticed that $ \pi(a | s) $ is a probability function, and that $ \beta $ was independent of it, so we could factor it out and sum. Combining the terms we analyzed, we get:

$$\begin{aligned} \sum_{a} \pi (a | s) \log [ \tilde{\pi}(a|s)] + \log \sum_{a} \text{exp}[ Q^{\pi}(s,a) ] &= \log \beta + \sum_{a} \pi (a | s) Q^{\pi}(s,a) -\log \beta \\ &= \sum_{a} \pi (a | s) Q^{\pi}(s,a) \end{aligned}$$

Which allows us to (finally) notice that:

$$ \sum_{a} \pi (a | s) Q^{\pi}(s,a) = \mathbb{E}_{a \sim \pi} [ Q_{soft}^{\pi} (s, a) ] $$

### Noticing Our Way to Eq. 18

We still have the claim that:

$$ \mathcal{H}(\pi(\cdot | s)) + \mathbb{E}_{a \sim \pi} [ Q_{soft}^{\pi} (s, a) ] \leq \mathcal{H}(\tilde{\pi}(\cdot | s)) + \mathbb{E}_{a \sim \tilde{\pi}} [ Q_{soft}^{\pi} (s, a) ] $$

...and it is not immediately clear how (19) helps us arrive at the inequality.

But it actually does, if we notice two things:

- The KL-divergence is nonnegative,
- That if we have $\tilde{\pi}(a | s) \propto \exp{Q^{\pi}(s, a)}$, then we need to be a little more specific and introduce a factor (for each state), denoted $\beta(s)$, that ensures our probabilities sum to one
- Let $\tilde{\pi}(a | s) = \beta(s) \exp{Q^{\pi}(s, a)}$
- Then $1/\beta(s) = \sum_{a} \exp [ Q^{\pi}(s, a) ]$.

With this in mind, we have:

$$ \begin{aligned} \mathbb{E}_{a \sim \tilde{\pi}} [ Q_{soft}^{\pi} (s, a) ] + \mathcal{H}(\tilde{\pi}(\cdot | s)) &= \sum_{a} \tilde{\pi}(a | s) \bigg[ Q^{\pi}(s, a) - \log ( \tilde{\pi}(a | s) ) ) \bigg] \\ &= \sum_{a} \tilde{\pi}(a | s) \bigg[ Q^{\pi}(s, a) - \log [ \beta(s) \exp{Q^{\pi}(s, a)} ] \bigg] \\ &= \sum_{a} \tilde{\pi}(a | s) \bigg[ Q^{\pi}(s, a) - \log [ \beta(s) ] - \log [ \exp{Q^{\pi}(s, a)} ] \bigg] \\ &= \sum_{a} \tilde{\pi}(a | s) \bigg[ Q^{\pi}(s, a) - \log [ \beta(s) ] - Q^{\pi}(s, a) \bigg] \\ &= - \sum_{a} \tilde{\pi}(a | s) \log [ \beta(s) ] \\ &= - \log[ \beta(s) ] = \log [ 1/\beta(s) ] \\ &= \log \bigg[ \sum_{a} \exp [ Q^{\pi}(s,a) ] \bigg] \end{aligned} $$

Which, you may notice, is part of (19). Now we add in the negative KL divergence and we get the inequality in (18). Bam!

$$
\begin{aligned}
\mathbb{E}_{a \sim \tilde{\pi}} [ Q_{soft}^{\pi} (s, a) ]
+ \mathcal{H}(\tilde{\pi}(\cdot | s))
&= \log \bigg[ \sum_{a} \exp [ Q^{\pi}(s,a) ] \bigg]
\\
&\geq
- \text{D}_{KL}( \pi (\cdot | s) || \tilde{\pi} (\cdot | s) )

+ \log \bigg[ \sum_{a} \exp [ Q^{\pi}(s,a) ] \bigg]
\\
&= \mathcal{H}(\pi(\cdot | s))
+ \mathbb{E}_{a \sim \pi} [ Q_{soft}^{\pi} (s, a) ]
\end{aligned}
$$

Which is what Eq. 18 claims.

So while the paper was interesting, I do find it kinda frustrating that in order to understand the proof I had to do a series of non-obvious manipulations and make use of some facts that are "obvious" only if you've done this sort of stuff before. And I'm not sure there isn't a better solution or something more intuitive that I could've tried instead-- perhaps if I'd solved it via measure theory instead of just working the discrete case? However, that's about all the noticing I'm able to do at the moment since I've got my own theorems to prove.