Sun Apr 18 2021

How to speed up Deep Reinforcement Learning by telling it what to do?

By Jan Scholten, a Xomnia Machine Learning Engineer.

By performing tedious tasks, autonomous learning can save us a lot of time and effort. It would account for its own control engineering, which is labor-intensive too. It is even able to attain optimality, and retain it even for changing environments - such as anything in the real world. Isn’t that nice?

As useful as they are, however, autonomous learning methods, such as Deep Reinforcement Learning (DRL), require huge amounts of experience before they become useful, which challenges their efficiency in being implemented to solve everyday problems. In the words of Andrew Ng: “RL is a type of machine learning whose hunger for data is even greater than supervised learning. [...] There’s more work to be done to translate this to businesses and practice.”

So, for actual feasibility, we need to address how fast the DLR algorithm learns, known otherwise as its sample efficiency. Otherwise, DRL is limited to the most expensive applications in the world, such as backflipping robots.

This blog provides a rough-around-the-edges explainer of Predictive Probabilistic Merging of Policies (PPMP), the first algorithm to leverage directive human feedback (e.g. left/right) for DRL. You may think of it as a toolbox that enables your favourite deep actor-critic algo to process human feedback and converge much faster (it only needs to be off-policy, which is typically the case for actor-critic architectures). PPMP is demonstrated in combination with DDPG (Deep Deterministic Policy Gradient, Lillicrap et al., 2015), a seminal work specific for continuous actions.

What are Reinforcement Learning, Deep Learning, and Deep Reinforcement Learning?

Reinforcement learning is the process where an agent interacts with an environment, and then obtains a reward signal that reflects upon how the agent is doing with respect to its task. The agent will try to maximize the rewards, and in doing so, it learns how to solve problems without prior knowledge.

Deep Learning is a method to approximate arbitrary functions from input-output data pairs. It involves neural networks, and is therefore suitable for high-dimensional and possibly unstructured data.

Deep Reinforcement Learning (DRL) combines reinforcement learning and deep learning, as it uses neural networks to approximate one or more mappings in the RL framework.

What are examples of Deep Reinforcement Learning?

DRL is quite a generic approach, and therefore suits a broad range of possible applications. To name a few, DRL has been applied in autonomous driving, drug dosing (e.g. cancer treatments), optimization of heat management for Google’s data centers, natural language text summarization, battling against cyber security threats, and robotic control challenges such as opening doors.

Challenge: DRL’s need for excessive training

The major hurdle in DRL applications is the need for lots and lots of training, which makes it an expensive solution, or even one that is not feasible because the required interactions take up too much time or test setups.

By advancing algorithms, we can partly overcome the need for excessive training, but not completely. A fundamental issue remains: as it starts learning, a tabula-rasa agent does not have the slightest idea of how its goal is formulated, neither does it have simple notions such as gravity, nor the insight that things may break under impact. It’s just blanco, and therefore its first attempts will always be extremely ignorant, no matter how smart the learning is.

On the other hand, humans are full of ideas, and many of these help to succeed at tasks even if it is our first try. We can thus realize crucial learning accelerations if we convey our insights to learning control agents.

The Predictive Probabilistic Merging of Policies algorithm as a solution

While humans have superior initial performance, their final performance is typically less than that of RL, because of human’s poorer precision and greater reaction time. It is, therefore, reasonable to assume that the corrections, of which the intended magnitude is not known, should become more subtle as training progresses.

In PPMP, the imperfect actions of the agent are combined with the noisy feedback in a probabilistic way, with respect to the abilities of the agent and the trainer for the current state. This idea is derived from Losey & O’Malley (2018), who phrased it best: ‘When learning from corrections ...[the agent] should also know what it does not know, and integrate this uncertainty as it makes decisions’.

So in PPMP, the respective uncertainties (that reflect the abilities of the agent and the trainer) determine the magnitude of the corrections. It means that initial exploration is vigorous, whereas upon convergence of the learner, the corrections become more subtle such that the trainer can refine the policy. Assuming Gaussian distributions, this principle may (for the early learning stage) be depicted as:

where we observe a broad distribution for rather pristine policy, and a narrower one for the human feedback. The posterior estimate of the optimal action a is determined as

where h denotes human feedback (a vector with any of -1, 0, and 1) and the estimated error

which has predefined bounds c on the correction, and then the actual tradeoff as a function of the covariances (something like a Kalman gain):

For the bounding vectors c, the lower bound expresses a region of indifference (or precision) of the human feedback and it is also assumed that the RL algorithm will effectively find the local optimum here. This is depicted by the truncated green distribution, and ensures significant effect. We may, for example, set this to a fraction of the action space. The upper bound may be used to control how aggressive the corrections can be. With the precision of the human feedback assumed as a constant (multivariate) Gaussian, the only thing we need is a way to obtain up-to-date estimates of the agent's abilities. For that, we use the multihead approach from Osband et al. (2016).

In actor-critic DRL, the critic and the actor are estimated using neural networks, which are a bunch of interconnected neurons. Architecture may be varied, but if we consider a simple example where the state vector x has three elements, and the action vector a two, a two-layer neural net could look like this:

This commonplace architecture can be adapted to our need for covariance. By making a couple copies of the actors output layer (with different initial weights and training samples), we can obtain samples of the action estimate in a straightforward and efficient way. This also allows us to make inferences about the abilities of the agent by looking at the covariance between the action estimates. Observe the following scheme:

Again, under the assumption of things being Gaussian, the distribution over the optimal action estimate may now be computed as the covariance over all the heads. It leaves us to the question of what action to consider as the policy. With multimodality and temporal consistency in mind, it is best to use a designated head per episode.

Architecture of the PPMP algorithm

PPMP consists of five modules, where the fifth one walks and talks:

The ‘merging of policies’ as explained above happens in the ‘Selector’ module that also obtains the given feedback. To render the algorithm more feedback efficient, a ‘Predictor’ module (with the same architecture as the critic) is trained with corrected actions, such that it can provide us with rough directions --- just what we need during the earliest stages of learning.

Whilst these predicted corrected actions are very useful as a rough information source, its actions will never be very accurate. Later, when the actor’s knowledge becomes of better quality, we want to crossfade the influence from the predictor to the actor. Yet, exactly how and when the tradeoff is scheduled depends on the problem, the progress, the feedback, and most likely as well on the state. The beauty of using an actor-critic learning scheme is that we may resolve the question of how the actions should best be interleaved using the critic. It can here act as an arbiter, as it has value estimates for all state-action pairs.

Now that we have discussed the internals, the question remains:

How much does PPMP increase the sample efficiency of DRL in comparison to DDPG?

PPMP is benchmarked against DDPG (pure DRL) and DCOACH (Deep Corrective Advice Communicated by Humans, Celemin et al.) a deep learning approach that learns from corrective feedback only). As is customary in the domain of RL, we consider standard problems found in the OpenAI gym that require continuous control. The first environment, Mountaincar, requires driving a little cart up a hill and reaching the flag. Because of a little engine, the thing needs to be rocked back and forth a bit to gain momentum. The second environment is the Pendulum problem, again underactuated, where a red pole is to be swung to its upright equilibrium and balanced there. Both problems penalize applied control action.

Besides testing with actual human feedback, feedback is synthesised using a converged DDPG policy, such that algorithms can be consistently and fairly compared without having to worry about the variance in human feedback. Below are the learning curves (top charts), where we desire to obtain maximum return as fast as possible for Mountaincar (left) and Pendulum (right). Just beneath is the amount of feedback as a fraction of the samples --- less feedback is less work for the human which is better.

We see that DDPG (red) fails in Mountaincar. It rarely reaches the flag and therefore there is little reward. DCOACH (green) hardly solves the Pendulum problem. PPMP (we are blue) uses significantly less feedback but converges at least 5x faster and has superior final performance (on par with the oracle in purple). As an ablation study, the orange lines demonstrate PPMP without predictor (PMP). For both environments, the orange curves get more feedback, but perform worse. This demonstrates that the predictor module makes the teacher’s job even easier.

Although performance is one hurdle towards making DRL work, eventual application depends as well on the robustness of algorithms. Real world problems and actual human feedback both feature real world noise, and the above innovations are only meaningful if they can cope with this noise and do not let applications crash. With the oracle implementation, we can precisely emulate erroneous feedback to assess the robustness of our algorithms. For ranges up to 30%, we stick to the previous colouring yet different line styles will now indicate the applied error rate:

PPMP is more robust to erroneous feedback than other algorithms, and it retains optimality.

It is clear that DCOACH (green) cannot handle erroneous feedback: performance quickly drops to zero as the feedback is less perfect. Because PPMP also learns from environmental reward, it’s able to downplay the misguidance eventually and then fully solve the problem.

Everything so far has all been related to simulated feedback. Now, what happens when we use actual human feedback that suffers from errors and delays? Below, we observe the same tendencies: less feedback, faster learning, and greater final performance for both environments:

PPMP outperforms DCOACH for both environments and requires less feedback.

Last but not least, let us consider a typical use case where the teacher has  limited performance and is not able to fully solve the problem itself. We now consider the Lunarlander environment, a cute little game (until you actually try it), where a space pod needs to land between some flags (or it crashes). We use an oracle that more or less knows how to hover without a clue, but does not know how to safely come to a rest. The environment assigns great negative reward to crashes, and 100 points for gentle landings. PPMP compares to DDPG as follows:

Note that using DDPG, the problem is not solved. The performance difference between PPMP and its teacher may seem small when we consider the reward points, but PPMP actually found the complete solution (between the flags), thereby exceeding the performance of the teacher (purple).

Incorporating human feedback to speed up DRL

Humans have better insight, whilst computers have greater precision and responsiveness. Therefore, the combined learning potential is greater than the sum of its parts, and we can overcome much of the sample efficiency struggles of DRL by incorporating human feedback.

Directional feedback is, in contrast to other feedback types,particularly effective for this purpose. PPMP takes a probabilistic approach where directions given by the teacher directly affect the action selection of the agent. In addition, the corrections are predicted, such that the need for feedback remains manageable.

As a result, PPMP can be used in cases where DDPG or DCOACH fails. In other cases, learning is accelerated with 5-10x, or performance exceeds that of the trainer. In real-world applications, telling your DRL what to do can thus make the difference between a working robot, and a broken one.

Further reading

  • A complete treatment of PPMP, including background on (deep) reinforcement learning, is found in the thesis titled Deep Reinforcement Learning with Feedback-based Exploration.
  • PPMP was presented at the 59th Conference on Decision and Control in Nice. There’s  a 6-page paper for the proceedings. (Also on Arxiv).
  • The codebase of this study is hosted on

This blog was written by Jan Scholten, a Xomnia Machine Learning Engineer. Jan develops and productionizes in-house data science projects for clients. When he is not doing that, he enjoys surfing.