Jarek Liesen: Discovering minimal reinforcement learning environments

BCCN Berlin / Technische Universität Berlin

Abstract

Humans often acquire new skills under conditions that are significantly different from the context in which they are evaluated. For example, students prepare for an exam not by taking it, but by studying books or supplementary material. Essentially, the training and evaluation environments of human agents can be different. A natural question that arises is whether artificial agents benefit from training outside of their evaluation environment as well.

A potential avenue to answering this question is to train neural-network based synthetic environments via meta-learning. Surprisingly, we find that synthetic Markov decision processes terminate most episodes after a single time step, effectively making them contextual bandits. We subsequently explore synthetic contextual bandits, and find that they enable training reinforcement learning agents that transfer well to their evaluation environment, even if it is a full Markov decision process. We show that the synthetic contextual bandits enable training in a fraction of time steps and wall clock time, and are general towards hyperparameter configurations and learning algorithms. Using our meta-learning algorithm in combination with a curriculum on the performance evaluation horizon, we are able to achieve competitive results on a number of challenging continuous control problems. Our approach opens a multitude of new research directions: Contextual bandits are easy to interpret, yielding insight into the tasks that are encoded by the evaluation environment. Additionally, we demonstrate that synthetic contextual bandits can be used in downstream meta-learning setups.

 

Additional Information

Master Thesis Defense

 

Organized by

Prof. Dr. Henning Sprekeler   & Prof. Dr. Klaus Obermayer  

Location: MAR 5.013, MAR Building, Marchstraße 23, 10587 Berlin

Go back