DeepMind Found New Approach To Create Faster RL Models

Recently, researchers from DeepMind and McGill University proposed new approaches to speed up the solution of complex reinforcement learning problems. They mainly introduced a divide and conquer approach to reinforcement learning (RL), which is combined with deep learning to scale up the potentials of the agents.

For a few years now, reinforcement learning has been providing a conceptual framework in order to address several fundamental problems. This algorithm has been utilised in several applications, such as to model robots, simulate artificial limbs, developing self-driving cars, play games like poker, Go, and more.

Also, the recent combination of reinforcement learning with deep learning added several impressive achievements and is found to be a promising approach to tackle important sequential decision-making problems that are currently intractable. One such issue is the amount of data needed or an RL agent to learn to perform a task.

Behind the Approach

In this project, the researchers discussed that the range of problems the RL agents can tackle could be significantly extended if they are endowed with the appropriate mechanisms to leverage prior knowledge. The framework is basically based on the premise that an RL problem can usually be decomposed into a multitude of “tasks.”

The researchers generalised two fundamental operations in RL, policy improvement and policy evaluation, from single to multiple operands, i.e. tasks and policies, respectively. According to them, the generalisation of these two fundamental operations underlying much of RL, which is policy evaluation and policy improvement allows the solution of one task to speed up the solution of other tasks.

The Generalised policy evaluation (GPE) is the computation of the value function of a policy on a set of tasks. The generalised version of these two procedures are jointly referred to as “generalised policy updates,”

The generalised policy updates make it possible to reuse the solution of tasks in two distinct ways. They are-

When a task’s reward function can be approximated as a linear combination of reward functions of other tasks, the reinforcement learning problem can be reduced to a simpler linear regression which is solvable with only a fraction of the data.
When the linearity constraint is not satisfied, the agent can also leverage the solution of tasks. In this case, by using them to interact with and learn about the environment. This can also considerably reduce the amount of data needed to solve the problem.

The researchers combined these two strategies in order to produce a divide-and-conquer approach to RL that can assist in scaling the agents to problems that are currently intractable due to issues like lack of data.

They stated, “If the reward function of a task can be well approximated as a linear combination of the reward functions of tasks previously solved, we can reduce a reinforcement-learning problem to a simpler linear regression.”

Researchers further added, “When this is not the case, the agent can still exploit the task solutions by using them to interact with and learn about the environment. Both strategies considerably reduce the amount of data needed to solve a reinforcement-learning problem.”

The Outcome

In this paper, the researchers showed the possible ways to efficiently implement GPE and GPI and discussed how their combination leads to a generalised policy whose behaviour is modulated by a vector of preferences.

Also, the vector of preferences is considered to be the solution of a linear regression problem. This reduces a reinforcement learning task to a much simpler problem that can be solved using only a fraction of the data.

Wrapping Up

The researchers proposed a divide and conquer approach where they generalised two fundamental operations in RL, policy improvement and policy evaluation that can be used to speed up the solution of a reinforcement learning problem. The strategy is also claimed to improve the sample efficiency if the mapping from states to preferences is simpler to learn than the corresponding policy.

The source code that is used to generate all of the data in this research is available in GitHub. Get the source code here.

Provide your comments below

comments

If you loved this story, do join our Telegram Community.

Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box. Contact: [email protected]