Deep reinforcement learning from human preferences

https://openai.com/index/learning-from-human-preferences/ paper link https://arxiv.org/abs/1706.03741

For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent’s interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback. 为了让复杂的强化学习(RL)系统能够有效地与现实世界环境交互,我们需要向这些系统传达复杂的目标。在本研究中,我们探讨了基于(非专家)人类对成对轨迹片段偏好所定义的目标。我们表明,该方法能够在无需访问奖励函数的情况下,有效解决包括 Atari 游戏和模拟机器人运动在内的复杂 RL 任务,同时仅需对人类代理与环境交互中不到百分之一的部分提供反馈。这将人类监督的成本降低到了足以实际应用于最先进 RL 系统的水平。为展示我们方法的灵活性,我们证明仅需约一小时的人类时间,即可成功训练出复杂的新型行为。这些行为和环境比以往任何从人类反馈中学习到的都更为复杂。