policy gradient
https://rail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-5.pdf
from
https://rail.eecs.berkeley.edu/deeprlcourse-fa18/
lilian weng policy gradient
https://lilianweng.github.io/posts/2018-04-08-policy-gradient/
The difference between policy based and value based
https://www.reddit.com/r/reinforcementlearning/comments/mkz9gl/policybased_vs_valuebased_are_they_truly_different/
$$ J(\theta) = E_{\tau ∼p_\theta(\tau)} \left[ \sum_{t}r(s_t, a_t) \right]$$
$$ \theta^* = argm\underset{\theta}ax J(\theta) $$
Object function \( J(\theta)\) is the expected return of a policy parameterized by \( \theta \).
\( \tau∼p_\theta(\tau) \) means that the trajectory
\( \tau \) is sampled from the distribution
\( p_\theta(\tau) \).
How to update the policy parameters \( \theta \) ?
First, we take the derivative of \( J(\theta)\)
$$
\nabla_\theta \ J(\theta)
= \nabla_\theta \ E_{\tau ∼p_\theta(\tau)} \left[ \sum_{t}r(s_t, a_t) \right]
$$
$$
= \nabla_\theta \ E_{\tau ∼p_\theta(\tau)} \left[ r(\tau) \right]
$$
$$
= \nabla_\theta \displaystyle \int p_\theta(\tau) \ r(\tau)d\tau
$$
$$
= \displaystyle \int \nabla_\theta \ p_\theta(\tau) \ r(\tau)d\tau
$$
$$
= \displaystyle \int p_\theta(\tau) \frac{\nabla_\theta \ p_\theta(\tau)}{p_\theta(\tau)} \ r(\tau)d\tau
$$
$$
= \displaystyle \int p_\theta(\tau)
\ \nabla_\theta \log p_\theta(\tau) \ r(\tau)d\tau
$$
$$
= E_{\tau ∼p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) \ r(\tau) \right]
$$