Skip to content

Latest commit

 

History

History

Deep Q-Learning

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Summary

Deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning.

The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards.

Background

The agent interacts with an environment E in a sequence of actions, observations and reqards. At each time-step, the agent selects an action action from the set of legal game actions, actions. The action is passed to the emulator and modifies its internal state and game score. The emulator's internal state is not observed by the agent; instead it observes an image image from the emulator, which is a vector of raw pixel values representing the current screen.In addition it receives a reward ![reward](actionrepresenting the change in game score.

Since the agent only observes images of the current screen, the task is partially observed and many emulator states are perceptually aliased. Therefore a sequence of actions and observations history and learn game strategies that depend upon these sequences.

The goal of the agent is to interact with the emulator by selecting actions in a way that maximises future rewards. The standard assumptions are that the future rewards are discounted by a factor of gamma per time-step, and the discounted return at time t is reward.

The optimal action-alue function Q is the maximum expected return achievable by following any strategy, and then taking some action a, Q, where pi is a policy mapping sequences to actions.

The optimal action-value function obeys an important identity known as the Bellman equation. This is based on the following intution: if the optimal value Q of the sequence s at the next time-step for all possile actions a, then the optimal strategy is to select the action maximising the expected value of expected, bellman.

The basic idea behind many reinforcement learning algorithms is to estimate the action-value function, by using the Bellman equation as an iterative update Q. Such value iteration algorithms converge to the optimal action- value function, converge.

A function approximator is used to estimate the action-value function function. A neural network function approximator with weights theta as a Q-neetwork can be trained by minimising a sequence of Loss function Loss that changes at each iteration i, Loss, where y is the target for iteration i and rho is a probability distribution over sequences s and actions a that is refered to as behavious distribution.

The gardient is obtained by differentiating the loss function with respect to the weights, gradient.

Approach

A technique known as experience replay where the agent’s experiences e in D at each time-step are stored and pooled over many episodes into a replay memory.

During the inner loop of the algorithm, Q-learning updates are applied, or minibatch updates, to samples of experience, eD, drawn at random from the pool of stored samples. After performing experience replay, the agent selects and executes an action according to an epsilon-greedy policy.

Advantages

  • First, each step of experience is potentially used in many weight updates, which allows for greater data efficiency.
  • Second, learning directly from consecutive samples is inefficient, due to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates.
  • Third, when learning on-policy the current parameters determine the next data sample that the parameters are trained on.

Architecture

The input to the neural network consists is an 84×84×4. The first hidden layer convolves 16 8×8 filters with stride 4 with the input image and applies a rectifier nonlinearity. The second hidden layer convolves 32 4×4 filters with stride 2, again followed by a rectifier nonlinearity. The final hidden layer is fully-connected and consists of 256 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action.