Enhancing OOD Generalization in Offline Reinforcement Learning with Energy-Based Policy Optimization

Cao, Hongye; Yang, Shangdong; Huo, Jing; Chen, Xingguo; Gao, Yang

doi:10.3233/FAIA230288

Abstract

Offline Reinforcement Learning (RL) is an important research domain for real-world applications because it can avert expensive and dangerous online exploration. Offline RL is prone to extrapolation errors caused by the distribution shift between offline datasets and states visited by behavior policy. Existing offline RL methods constrain the policy to offline behavior to prevent extrapolation errors. But these methods limit the generalization potential of agents in Out-Of-Distribution (OOD) regions and cannot effectively evaluate OOD generalization behavior. To improve the generalization of the policy in OOD regions while avoiding extrapolation errors, we propose an Energy-Based Policy Optimization (EBPO) method for OOD generalization. An energy function based on the distribution of offline data is proposed for the evaluation of OOD generalization behavior, instead of relying on model discrepancies to constrain the policy. The way of quantifying exploration behavior in terms of energy values can balance the return and risk. To improve the stability of generalization and solve the problem of sparse reward in complex environment, episodic memory is applied to store successful experiences that can improve sample efficiency. Extensive experiments on the D4RL datasets demonstrate that EBPO outperforms the state-of-the-art methods and achieves robust performance on challenging tasks that require OOD generalization.

This website uses cookies

This website uses cookies