Training reinforcement learning (RL) agents using scalar reward signals is often infeasible when an environment has sparse and non-Markovian rewards. Moreover, handcrafting these reward functions before training is prone to misspecification. We learn non-Markovian finite task specifications as finite-state ‘task automata’ from episodes of agent experience within environments with unknown dynamics. First, we learn a product MDP, a model composed of the specification’s automaton and the environment’s MDP (both initially unknown), by treating it as a partially observable MDP and employing a hidden Markov model learning algorithm. Second, we efficiently distil the task automaton (assumed to be a deterministic finite automaton) from the learnt product MDP. Our automaton enables a task to be decomposed into sub-tasks, so an RL agent can later synthesise an optimal policy more efficiently. It is also an interpretable encoding of high-level task features, so a human can verify that the agent’s learnt tasks have no misspecifications. Finally, we also take steps towards ensuring that the automaton is environment-agnostic, making it well-suited for use in transfer learning.
IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
Tel.: +1 703 830 6300
Fax: +1 703 830 2300 email@example.com
(Corporate matters and books only) IOS Press c/o Accucoms US, Inc.
For North America Sales and Customer Service
West Point Commons
Lansdale PA 19446
Tel.: +1 866 855 8967
Fax: +1 215 660 5042 firstname.lastname@example.org