Review of: Distributional Reinforcement Learning with Quantile Regression


For this series, I will attempt to provide insight into some of the complicated theory that underlies a breakthrough paper by Will Dabney, Mark Rowland, Marc G. Bellemare, Rémi Munos in the field of Reinforcement Learning.

Wassersten Metric

p-Wasserstein Metric is a probability metric that considers the dissimilarity of various outcome events for two distributions. Recall that for a random variable Y, the Inverse Cumulative Function is defined as: $F_Y^{-1}(\omega) := \inf\{\ y \in \ \mathbb{R} : F_Y(y) \geq \omega \}\ $. In plain english, a quantile $\omega$ is defined to be the greatest lower bound (inf) of the set of values of $y \in Y$ such that the cumulative density function $F(y) \geq \omega$.

\[W_{p}(U, V) = \big( \int_{0}^1 \mid F^{-1}_Y(w) - F^{-1}_U(w) \mid^p dw \big)^{\frac{1}{p}}\]

Convergence of Distributional Bellman Operator

Defining Distributional Bellman Operator

If we let $\mathcal{X}$ and $\mathcal{A}$ be the set of states and actions respectively, then we are able to define a action-value distribution with finite moments.

\[\mathcal{Z} := \{\ Z : \mathcal{X} \times \mathcal{A} \implies P(\mathbb{R}) \mid \mathbb{E}[ \mid Z(x,a) \mid^p ] \leq \infty, \forall(x,a), p \geq 1 \}\\]

Since Temporal Difference Algorithms incrementally improve an estimate of Value Expectation through the Bellman Operator, the authors similarily define a Distributional Bellman Operator. Recall an operator typically maps values of one kind of mathematical object to similar objects, in our case.

\[\mathcal{T}^{\pi} : \mathcal{Z} \to \mathcal{Z}\]

The authors choose to define:

\[\mathcal{T}^{\pi}Z(x,a) := R(x,a) + \gamma Z(x\prime,a \prime ), \ x \prime \sim P(\cdot \mid x , a), \ a \prime \sim \pi(\cdot \mid x \prime)\]

Essentially stating that the current distribution of value at a particular state is equivalent to it’s immediate reward and a discounted future value distribution that follows when you sample an $x \prime$ and $a \prime$ from the state transition function and policy distributions.

Distributional Bellman Operator is a Contraction

Now when we select any two value distributions $Z_1, Z_2 \in \mathcal{Z}$ and compare them with the maximal form of the Wasserstein metric we get the following. And remember, maximal form is a fancy way of saying we only care about the largest $\omega-$th difference of two distributions.

\[\bar{d_{p}}(Z_1, Z_2) := \sup_{s,a} W_{p}(Z_1(x,a), Z_2(x,a))\]

The authors prove as a lemma that when we apply the Bellman Operator $\mathcal{T}^{\pi}$ on any two value distribtions $Z_1, Z_2 \in \mathcal{Z}$, it is a contraction.

\[\bar{d_{p}}(\mathcal{T}^{\pi}Z_1, \mathcal{T}^{\pi}Z_2) \leq \gamma \bar{d_{p}}(Z_1, Z_2)\]

The significance of the contraction operation is that the distance metric is decreased after the operation. If this is applied many times over, eventually reaching a unique fixed point (optimal value distribution).

Cannot Minimize Wasserstein Metric Directly Using Stochastic Gradient Descent

This theoretical result is highlighted by the fact that one cannot directly minimize this metric using optimization techniques because the minimum of the expected sample loss is not equivalent to the minimum of the true Wasserstein loss. That is where $\hat{Y_m}$ is a sample distribution of the real distribution $B$

\[\arg\min_{\mu} \mathbb{E}[W_{p}(\hat{Y_m}, B_\mu)] \neq \arg\min_{\mu}W_{p}(B, B_\mu))\]

Approximately Minimizing Wasserstien Metric

Quantile Projection

To circumvent the problem above, the authors define a parameterized value distribution:

\[Z_\theta(x,a) := \frac{1}{N} \sum_{i=1}^{N} \delta_{\theta_{i}(x,a)}\]

Essentially, this is a discrete uniform distribution which parameterizes and maps state action pairs (support defined as \(\{\ \theta_1, \dots , \theta_N \}\\) ) to probability outcomes denoted by Dirac Delta Impulses in N locations. Shifting any of these Impulses changes the characterization of the distribution, thus it’s in our interest to fit these as the original distribution so that they communicate the same information.

The authors then define a Quantile Projection operator as the minimum difference of a value distribution $Z \in \mathcal{Z}$ estimated by $\mathcal{Z}_Q$. The operator defines the best parametaterized estimate according to our metric.

\[\Pi_{W_1} := \arg \min_{\mathcal{Z}_Q \in \mathcal{Z}}(Z, Z_{\theta})\]

Figure 2
Figure 2.

Evidently from figure 2 we can see that to minimize the Wasserstein Metric (shaded area) one must fit the support \(\{\ \theta_1, \dots , \theta_N \}\\) with $\theta_i = F^{-1}( \hat{\tau_{i}} )$ where the quantile midpoints is defined as $\hat{\tau_{i}} = \frac{\tau_{i-1} + \tau_{i}}{2}$

Quantile Regression

In Quantile Regression, instead of improving of predictor model for mean valued response variables, we instead model the response at quantiles using this regression technique. Therefore, in the context of our problem we attempt to approximate the quantiles of sample distributions $\hat{Z} \sim Z$ with $\theta$ and perform gradient descent on the following objective function:

\[\sum_{i=1}^{N}\mathbb{E}_{\hat{Z} \sim Z}[\rho_{\hat{\tau_{i}}}(\hat{Z} - \theta_i)]\]

Combining Projection and Bellman Update

Therefore the authors come to the conclusion of the paper stating that:

\[\bar{d_{\infty}}(\Pi_{W_1}\mathcal{T}^{\pi}Z_1, \Pi_{W_1}\mathcal{T}^{\pi}Z_2) \leq \gamma \bar{d_{\infty}}(Z_1, Z_2)\]