Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Peter Pietzuch

Affiliation: ICL
Title: Distributed Reinforcement Learning with Dataflow Fragments

 

Abstract

Reinforcement learning (RL) is a key technology for solving hard decision-making problems, surpassing human game play, and enabling conversational AI bots such as ChatGPT. Supporting RL workloads poses new challenges: RL jobs exhibit complex execution and communication patterns, and rely on large amounts of generated training data. Current RL systems fail to support RL algorithms efficiently on GPU clusters: they either hard-code algorithm-specific strategies for parallelisation and distribution; or they accelerate only parts of the computation on GPUs (e.g. DNN policy updates).
In this talk, I will argue that current RL systems lack an abstraction that decouples the definition of an RL algorithm from its strategy for distributed execution. I will describe our work on MSRL, a distributed RL training system that uses the new abstraction of a fragmented dataflow graph (FDG) to execute RL algorithms in a flexible way.  An FDG maps functions from the RL training loop to independent parallel dataflow fragments. Fragment can execute on different devices through a low-level dataflow implementation, e.g., an operator graph of a DNN engine, a CUDA GPU kernel, or a multi-threaded CPU process. Our experiments show that MSRL exposes trade-offs between different execution strategies, while surpassing the performance of existing RL systems with fixed strategies.

Short CV

Peter Pietzuch is a Professor of Distributed Systems at Imperial College London, where he leads the Large-scale Data & Systems (LSDS) group (https://lsds.doc.ic.ac.uk). His research work focuses on the design and engineering of scalable, reliable and secure data-intensive software systems, with a particular interest in performance, data and security issues. In addition, he is a Co-Director for Imperial’s I-X initiative on AI, data and digital (https://ix.imperial.ac.uk). Recently, he has served as the Chair of the ACM SIGOPS European Chapter (EuroSys) and the Programme Committee Chair for ACM SoCC 2023 and IEEE ICDCS 2018. Before joining Imperial College London, he was a post-doctoral Fellow at Harvard University. He holds PhD and MA degrees from the University of Cambridge.