policy gradient
https://rail.eecs.berkeley.edu/deeprlcourse-fa18/static/slides/lec-5.pdf
from
https://rail.eecs.berkeley.edu/deeprlcourse-fa18/

lilian weng policy gradient
https://lilianweng.github.io/posts/2018-04-08-policy-gradient/

The difference between policy based and value based
https://www.reddit.com/r/reinforcementlearning/comments/mkz9gl/policybased_vs_valuebased_are_they_truly_different/

$$ J(\theta) = E_{\tau ∼p_\theta(\tau)} \left[ \sum_{t}r(s_t, a_t) \right]$$ $$ \theta^* = argm\underset{\theta}ax J(\theta) $$

Object function \( J(\theta)\) is the expected return of a policy parameterized by \( \theta \).
\( \tau∼p_\theta(\tau) \) means that the trajectory \( \tau \) is sampled from the distribution \( p_\theta(\tau) \). How to update the policy parameters \( \theta \) ? First, we take the derivative of \( J(\theta)\)

$$ \nabla_\theta \ J(\theta) = \nabla_\theta \ E_{\tau ∼p_\theta(\tau)} \left[ \sum_{t}r(s_t, a_t) \right] $$ $$ = \nabla_\theta \ E_{\tau ∼p_\theta(\tau)} \left[ r(\tau) \right] $$ $$ = \nabla_\theta \displaystyle \int p_\theta(\tau) \ r(\tau)d\tau $$ $$ = \displaystyle \int \nabla_\theta \ p_\theta(\tau) \ r(\tau)d\tau $$ $$ = \displaystyle \int p_\theta(\tau) \frac{\nabla_\theta \ p_\theta(\tau)}{p_\theta(\tau)} \ r(\tau)d\tau $$ $$ = \displaystyle \int p_\theta(\tau) \ \nabla_\theta \log p_\theta(\tau) \ r(\tau)d\tau $$ $$ = E_{\tau ∼p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) \ r(\tau) \right] $$