Master's Thesis: Optimism and Death in Reinforcement Learning

pdf (Master's Thesis)

Abstract

This thesis investigates two very different problems in Reinforcement Learning (RL), and is correspondingly in two self-contained parts.
The first proposes a new optimistic exploration method for RL that is feasible in very large state spaces. The success of RL algorithms in these domains depends crucially on effective generalisation. Function approximation techniques have recently been scaled to produce robust value estimates in this setting, but fewer attempts have been made to generalise the uncertainty in these estimates. This has largely prevented the combination of many scalable RL algorithms with directed exploration strategies which drive the agent to reduce its uncertainty. In this paper we draw on recent attempts to quantify uncertainty by generalising visit-counts across large state spaces. We describe a new approach which does so by exploiting the generalisation induced by the agent’s approximate representation of the value function. The method is less computationally expensive than previous proposals, and achieves world-leading results on contemporary RL benchmarks.

The second part of this work treats RL as a theoretical paradigm for studying intelligent behaviour. We provide an original formalisation of death for RL agents. We use this formalism to prove theoretical results regarding the behaviour of agents in relation to death. The agent AIXI is a universal solution to the RL problem; it can learn any computable environment. A technical subtlety of AIXI is that it is defined using a mixture over semimeasures that need not sum to 1, rather than over proper probability measures. In this work we argue that the shortfall of a semimeasure can naturally be interpreted as the agent’s estimate of the probability of its death. We formally define death for generally intelligent agents like AIXI, and prove a number of related theorems about their behaviour. Notable discoveries include that agent behaviour can change radically under positive linear transformations of the reward signal (from suicidal to dogmatically self-preserving), and that the agent’s posterior belief that it will survive increases over time.