Name		Name	Last commit message	Last commit date
parent directory ..
Paper.pdf		Paper.pdf
README.md		README.md

README.md

Playing Atari With Deep Reinforcement Learning

Summary

Deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning.

The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards.

Background

The agent interacts with an environment in a sequence of actions, observations and reqards. At each time-step, the agent selects an action from the set of legal game actions, . The action is passed to the emulator and modifies its internal state and game score. The emulator's internal state is not observed by the agent; instead it observes an image from the emulator, which is a vector of raw pixel values representing the current screen.In addition it receives a reward ![reward](representing the change in game score.

Since the agent only observes images of the current screen, the task is partially observed and many emulator states are perceptually aliased. Therefore a sequence of actions and observations and learn game strategies that depend upon these sequences.

The goal of the agent is to interact with the emulator by selecting actions in a way that maximises future rewards. The standard assumptions are that the future rewards are discounted by a factor of per time-step, and the discounted return at time t is .

The optimal action-alue function is the maximum expected return achievable by following any strategy, and then taking some action a, , where is a policy mapping sequences to actions.

The optimal action-value function obeys an important identity known as the Bellman equation. This is based on the following intution: if the optimal value of the sequence at the next time-step for all possile actions , then the optimal strategy is to select the action maximising the expected value of , .

The basic idea behind many reinforcement learning algorithms is to estimate the action-value function, by using the Bellman equation as an iterative update . Such value iteration algorithms converge to the optimal action- value function, .

A function approximator is used to estimate the action-value function . A neural network function approximator with weights as a Q-neetwork can be trained by minimising a sequence of Loss function that changes at each iteration i, , where is the target for iteration i and is a probability distribution over sequences s and actions a that is refered to as behavious distribution.

The gardient is obtained by differentiating the loss function with respect to the weights, .

Approach

A technique known as experience replay where the agent’s experiences in at each time-step are stored and pooled over many episodes into a replay memory.

During the inner loop of the algorithm, Q-learning updates are applied, or minibatch updates, to samples of experience, , drawn at random from the pool of stored samples. After performing experience replay, the agent selects and executes an action according to an -greedy policy.

Advantages

First, each step of experience is potentially used in many weight updates, which allows for greater data efficiency.
Second, learning directly from consecutive samples is inefficient, due to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates.
Third, when learning on-policy the current parameters determine the next data sample that the parameters are trained on.

Architecture

The input to the neural network consists is an 84×84×4. The first hidden layer convolves 16 8×8 filters with stride 4 with the input image and applies a rectifier nonlinearity. The second hidden layer convolves 32 4×4 filters with stride 2, again followed by a rectifier nonlinearity. The final hidden layer is fully-connected and consists of 256 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deep Q-Learning

Deep Q-Learning

README.md

Playing Atari With Deep Reinforcement Learning

Summary

Background

Approach

Advantages

Architecture

Files

Deep Q-Learning

Directory actions

More options

Directory actions

More options

Latest commit

History

Deep Q-Learning

Folders and files

parent directory

README.md

Playing Atari With Deep Reinforcement Learning

Summary

Background

Approach

Advantages

Architecture