Proximal Policy Optimization: Derivation, Approximation, and Implementation.

7 min readSep 15, 2024

Proximal Policy Optimization is the current state of the art reinforcement learning policy-based algorithm. It is one of the most simple state of the art technique, and also results in stable, generalizable, and accurate learning. Also known as PPO it outshines its predecessor, TRPO, and it is far simpler to implement. Starting off, what are policy based methods, and how do they learn which action is best at every state?

POLICY BASED METHODS

The Policy Gradient method also known as REINFORCE tries to maximize the expected return of an agent following it’s policy. It learns to do this iteratively through gradient descent. There is a single neural network called the policy network which returns the the distribution over all the actions of which action is best. The objective function for the policy gradient method is as follows:

Intuitively, this will try to increase the probability the model gives to an action that results in a high return, while trying to decrease the probability the model gives to an action that results in a low return. The model will compute the gradients of the objective function in respect to it’s parameters, and will then scale them by a learning rate and add them to the current parameters. You should be familiar to this algorithm.

Take a look at the objective function once more. Do you see an issue? There are two problems with this function.

The R(t) term only measures if an action is good or bad, not if it is better or worse in comparison to the other actions. For example, if most of the actions in a certain state are good, but some are better, the model will be very confused which one to pick.
The R(t) term can be a huge range of values, and overly big or small values never work well in the objective function in a neural network. This can cause far too huge parameter updates or even far too little parameter updates.

REGULARIZING POLICY BASED METHODS

Clearly, we need to regularize this objective function such that it centers close to 0, and has a low variance. We could just manually center it at 0 and have a variance of 0, but this will not be an accurate representation of the real return values we get. Recall the textbook reinforcement learning function, A(s). This is called the advantage function. It returns the value distribution over all actions in a given state, which are some-what regularized by all possible rewards. In short, it gives a good estimate of the value of an action in comparison to all other action.

We can replace the R(s) with A(s) in our objective function. But wait, how can we compute A(s)? For R(s) we just used the discounted sum over a trajectory starting at state s and following policy pi. We have to resort to a textbook equation in reinforcement learning, which states that A(s, a) = Q(s, a) - V(s). Q(s, a) is the Q value of an action at a state, and V(s) is the value of a state, disregarding the actions taken after the state. We can replace the Q(s, a) with the discounted sum of rewards after that action was taken. So we can rewrite this as A(s, a) = R(t) - V(s), where t is the trajectory after the action was taken. All we need to do now is compute the V(s) term, which can be done with a simple neural network. The objective function of that network with a parameter of phi is:

The objective function of the policy network is now:

The R(t) term in both of the above equations is replaced with the bellman backup, used in TD learning in the Actor Critic methods. This should be familiar to you.

THE TRUST REGION

These methods work well, and they solve a lot of the problems with sparse rewards and other phenomenon specific to reinforcement learning. There is a problem that they don’t solve, however. And that is stability in learning. All equations in reinforcement learning are approximations and they require very precise neural network training to gain expected return efficiently. No matter which method you choose, the graph of average returns over epochs fluctuate so much that you usually can’t even see the model improving until after thousands of epochs. Trust Region methods fix these problems by limiting or penalizing huge updates. They want the model to update gradients minimally, while still maximizing the expected return.

In all forms of learning apart from reinforcement learning, all of the data is handed to you, and it is not nearly the amount of data required in any reinforcement learning problem, where environments can even be continuous. Reinforcement learning operates on a lot of unknown data and a lot of generalization. This is why we can’t just decrease the learning rate or do other simple forms of regularizations. If the model doesn’t improve enough over episodes, it will be difficult for it to collect it’s own data, as it will keep encountering the same situations.

TRPO —AN OVERVIEW

The equation and update rule for the TRPO objective function requires a lot of math, including concepts from the Taylor Series, and second order partial derivation. I’ll give a brief overview of the equation it attempts to minimize. This equation will also be used in PPO.

Lets begin with the equation for the expected return following a policy. This is the objective function for the REINFORCE method.

Now lets approximate this equation with an old version of out policy. This is because we will later use this to minimize the difference between the new and old policies

The A(s, a) parameterized by an old network is how good an action is in comparison to an old action. This is because the average old, worse action will have a value close to 0. Make sure to note that we follow a trajectory with the new policy, such that we actually get value out of the A(s, a).

While this equation looks good at first glance, in practice, we won’t have any data for the new policy, if that is what we are trying to learn, so we can’t follow a trajectory based on the new policy. To combat this problem, we can simply follow the old trajectory and use importance sampling.

For example, if a bad action was performed following the old policy, and there was a lower chance that the new policy would do that action, then we don’t need to update the new policy that much.

If the new and old policies sound a bit confusing, especially because we use our new policy in the objective function, don’t worry, they will make sense in a second.

We obviously want to maximize our objective function in respect to the parameters of the new network, which means we can drop the first term of our objective function, since that has nothing to do with our new network. This leaves us with this equation.

The Major issue derived from TRPO is the computational time required to run gradients on the KL divergence term. Specifically, it requires computing the Fisher Information Matrix as well as its inverse on the latent space, which is highly costly for bots with many possible moves. This lowers training speed which is considerably important for reinforcement learning agents which often take months and millions to train.

PPO- THE SOLUTION

To address the computational expense of TRPO, PPO attempts to simplify the KL-divergence term into a computationally inexpensive loss function.

Instead of harming the function for deviating from previous distributions, it Clips the total reward. Effectively, if a policy change is much more or much less than 0.8 and 1.2 respectively, it does not offer any additional reward to the policy term.