Deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning.
The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards.
The agent interacts with an environment in a sequence of actions, observations and reqards. At each time-step, the agent selects an action
from the set of legal game actions,
. The action is passed to the emulator and modifies its internal state and game score. The emulator's internal state is not observed by the agent; instead it observes an image
from the emulator, which is a vector of raw pixel values representing the current screen.In addition it receives a reward ![reward](
representing the change in game score.
Since the agent only observes images of the current screen, the task is partially observed and many emulator states are perceptually aliased. Therefore a sequence of actions and observations and learn game strategies that depend upon these sequences.
The goal of the agent is to interact with the emulator by selecting actions in a way that maximises future rewards. The standard assumptions are that the future rewards are discounted by a factor of per time-step, and the discounted return at time t is
.
The optimal action-alue function is the maximum expected return achievable by following any strategy, and then taking some action a,
, where
is a policy mapping sequences to actions.
The optimal action-value function obeys an important identity known as the Bellman equation. This is based on the following intution: if the optimal value of the sequence
at the next time-step for all possile actions
, then the optimal strategy is to select the action maximising the expected value of
,
.
The basic idea behind many reinforcement learning algorithms is to estimate the action-value function, by using the Bellman equation as an iterative update . Such value iteration algorithms converge to the optimal action- value function,
.
A function approximator is used to estimate the action-value function . A neural network function approximator with weights
as a Q-neetwork can be trained by minimising a sequence of Loss function
that changes at each iteration i,
, where
is the target for iteration i and
is a probability distribution over sequences s and actions a that is refered to as behavious distribution.
The gardient is obtained by differentiating the loss function with respect to the weights,
.
A technique known as experience replay where the agent’s experiences in
at each time-step are stored and pooled over many episodes into a replay memory.
During the inner loop of the algorithm, Q-learning updates are applied, or minibatch updates, to samples of experience, , drawn at random from the pool of stored samples. After performing experience replay, the agent selects and executes an action according to an
-greedy policy.
- First, each step of experience is potentially used in many weight updates, which allows for greater data efficiency.
- Second, learning directly from consecutive samples is inefficient, due to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates.
- Third, when learning on-policy the current parameters determine the next data sample that the parameters are trained on.
The input to the neural network consists is an 84×84×4. The first hidden layer convolves 16 8×8 filters with stride 4 with the input image and applies a rectifier nonlinearity. The second hidden layer convolves 32 4×4 filters with stride 2, again followed by a rectifier nonlinearity. The final hidden layer is fully-connected and consists of 256 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action.