modified	title
2025-01-29 11:08:35 UTC

Learning Reinforcement Learning

Key concepts in Reinforcement Learning

Agent: The entity making decisions and interacting with the environment
Environment: The external system where the agent interacts
State: A representation of the current situation or configuration
Action: The decision or move made by the agent
Reward: The feedback received by the agent based on its actions
Policy: The strategy or set of rules guiding the agent's decision making

Roles of Actor and Critic

Actor:
- Makes decisions based on current policy
- Responsibility lies in exploring the action space
  - This maximises expected cumulative rewards
- Refining the policy
  - Actor will adapt to the dynamic nature of the environment
Critic:
- Evaluates actions taken by the actor
- Estimates the quality of these actions by providing feedback of their performance
- Guides the actor towards actions that lead to higher expected returns
  - This contributes to overall improvement

Key terms in Actor Critic Algorithm

Policy(Actor):
- Represents the probability of taking action a in state s
  - Denoted as π(a∣s)
- Actor seeks to maximise expected return by optimising this policy
- Policy is modeled by the actor network
  - Parameters are denoted as θ
Value(Critic):
- Estimates the expected cumulative reward starting from state s
  - It is denoted as V(s)
- The value function is modeled by the critic network
  - Parameters are denoted as w

How the Actor-Critic Algorithm Works

Objective Function

Combination of policy gradient for the actor and value function for the critic
Typically expressed as the sum of two components

Policy Gradient (Actor)

∇_θJ(θ)≈¹/_N∑^N_i=0∇_θlog_πθ(a_i ∣ s_i)⋅A(s_i,a_i)
Explanation:
- J(θ)
  - Represents the expected return under the policy parameterized by θ
- π_θ(a∣s)
  - Represents the policy function
- N
  - Represents the number of sampled experiences.
- A(s,a)
  - Represents the advantage function representing the advantage of taking action a in state s.
- i
  - Represents the number of samples

Value Function Update (Critic)

∇_wJ(w)≈¹/_N∑^N_i=1∇_w(V_w(s_i)−Q_w(s_i,a_i))²
Explanation
- ∇_wJ(w)
  - Represents the gradient of loss function with respect to the critic's paramter w
- N
  - Represents the number of samples
- V_w(s_i)
  - Represents the critics estimate of value with statesusing parameterw
- Q_w(s_i,a_i))²
  - Represents the critic's estimate of the action value of taking actiona
- i
  - Represents the index of the sample

Update Rules

Involves adjusting the respective parametes for the actor and critic
- Actor
  - Gradient ascent
- Critic
  - Gradient descent

Actor Update

θ_t+1 = θ_t + α∇_θJ(θ_t)
Explanation
- α
  - learning the rate for the actor
- t
  - time step within an episode

Critic Update

w_t = w_t − β∇_wJ(w_t)
Explanation
- w
  - Represents the parameters of the critic network
- β
- Represents the learning rate for the critic

Advantage Function

The Advantage function (A(s,a)) measures the advantage of taking action a in state s over the expected value of the state under the current policy
A(s,a) = Q(s,a)−V(s)
The Advantage function then provides a measure of how much better or worse an action is compared to the average action
Explanation of the A(s,a) = Q(s,a)−V(s) expression
- The actor is updated based on the policy gradient
  - Encouraging actions with higher advantages
- The critic is updated to minimise the difference between the estimated value and action value

Advantage Actor-Critic(A2C)

Introduces the concept of the advantage function
- Measures how much better an action is compared to the average action in the given state
Incorporating advantage information
- A2C focuses the learning process on actions that have a significanlty higher value than the typical action taken in that state
Learning from average:
- Base Actor-Critic uses difference between actual reward and estimated value (critics evaluation) to update the actor
Learning from Advantage:
- Uses the difference between the action's value and the average calue of actions in that state
  - This additional information refines the learning process that little bit further

Actor Critic Algorithm Steps

Initialisation
- Initialise the parameters
  - Policy:
    - Actor
      - θ
  - Value Function
    - Critic
      - ϕ
Interaction with the elements
- The agent interacts with the environment
  - Takes actions according to the current policy
    - Then rewarded in return
Advantage Computation
- Compare the advantage function A(s,a) based on the current policy and value estimates
Policy and Value Updates
- Simultaneously update the actors parameters using the policy gradient
  - The policy gradient is derived from the advantage function
    - Guides the actor to increate the probabilities of actions that lead to higher advantages
- Simultaneously update the critics parameters using a value-based method
  - The often involves minimising the temporal difference (TD) error
    - The difference between the observed rewards and the predicted values

Tip

The actor learns a policy
The critic then evaluates the actions taken by the actor
The actor is updated using the policy gradient
The critic is updated using the value-based method

[!NOTE] Allows for more stable and efficient learning in complex environments

Training Agent: Actor-Critic Algorithm

Important

Makes use of TensorFlow and OpenAI Gym

Import libraries

import numpy as np
import tensorflow as ts
import gym

Creating CartPole Environment

gym.make provides a standardized and convenient way to interact with various reinforcement learning tasks

env = gym.make('CartPole-v1')

Define Actor and Critic Networks

Actor and Critic are implemented as neural networks using TensorFlow's Keras API
Actor network
- Maps the state or a probability distrobution over actions
Critic network
- Estimates the states value

actor = tf.keraas.Sequential([
  tf.keras.layers.Dense(32, activation='relu'),
  tf.keras.layers.Dense(env.action_space.n, activation='softmax')
])
critic = tf.keras.Sequential([
  tf.keras.layers.Dense(32, activation='relu'),
  tf.keras.layers.Dense(1)
])

Defining Optimizers and Loss Functions

Adam optimizer is used for both actor and critic networks

actor_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
critic_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001

Training Loop

The main training loop runs for specific number of episodes (1000)
Agent interacts with the environment
- For each episode:
  - Resets environment
  - Initializes the episode reward to 0
The tf.GradientTape block
- USed to compute gradients for both the actor and critic networks
Agent chooses an action based on the actor's output probabilities
- It then takes that action in the environment
Observes:
- Next state
- Reward
- Whether the episode is done
Advantage function is computed
- Difference between the expected return and the estimated value at the current state
Actor and Critic losses are calculated based on the advantage function
Gradients are computed using tape.gradient
- Then applied to update the actor and critic networks using the respective optimisers
Episodes total reward is updated and the loop is the continued until the episode ends
Every 10 episodes the current episode number and reward are printed

num_episodes = 1000
gamma = 0.99

for episode in range(num_episodes):
    state = env.reset()
    episode_reward = 0

    with tf.GradientTape(persistent=True) as tape:
        for t in range(1, 10000):  # Limit the number of time steps
            # Choose an action using the actor
            action_probs = actor(np.array([state]))
            action = np.random.choice(env.action_space.n, p=action_probs.numpy()[0])

            # Take the chosen action and observe the next state and reward
            next_state, reward, done, _ = env.step(action)

            # Compute the advantage
            state_value = critic(np.array([state]))[0, 0]
            next_state_value = critic(np.array([next_state]))[0, 0]
            advantage = reward + gamma * next_state_value - state_value

            # Compute actor and critic losses
            actor_loss = -tf.math.log(action_probs[0, action]) * advantage
            critic_loss = tf.square(advantage)

            episode_reward += reward

            # Update actor and critic
            actor_gradients = tape.gradient(actor_loss, actor.trainable_variables)
            critic_gradients = tape.gradient(critic_loss, critic.trainable_variables)
            actor_optimizer.apply_gradients(zip(actor_gradients, actor.trainable_variables))
            critic_optimizer.apply_gradients(zip(critic_gradients, critic.trainable_variables))

            if done:
                break

    if episode % 10 == 0:
        print(f&quot;Episode {episode}, Reward: {episode_reward}&quot;)

env.close()

Advantages of the Actor Critic Algorithm

Improved Sample Efficiience
- The hybrid nature of Actor-Clinic algorithms requires less interactions with the environment, often leading to achieve optimal performance
Faster Convergance
- The method is able to update both the policy and value function concurrently
  - This leads to faster convergance during training
    - Leading to quicker adaptation of the learning task
Versatility Across Action Spaces
- Actor-Clinic architectures are able to seamlessly handle both discrete and continuous action spaces
  - This offers flexibility in addressing a wide range of RL problems
Off-Policy Learning
- Learns from past experiences
  - Even when not directly following the current policy

Advantage Actor Critic (A2C) vs. Asynchronous Actor Critic (A3C)

Note

A3C builds upon A2C by introducing parallelism

A2C uses a single actor-critic pair
A3C uses multiple actor-critic pairs each operating simultaneously
- Each pair interacts with a seperate copy of the environment collecting data independantly
  - These experiences are then used to update a global actor-critic network

Applications of Actor Critic Algorithm

Robotics
- Empower robots to learn optimal control policies
  - Allows them to adapt and navigate complex environments
Game Playing
- Provides training agents to make strategic decisions
  - Improves gameplay overtime
Autonomous Vehicles
- Dynamic decisions in real-time
  - Contributes to the evolution of self-driving technology
Finance and Trading
- Optimize trading strategies and make intelligent financial decisions in dynamic markets
Healthcare
- Personalised treatment planning
  - Agents learn to make decisions that maximise patient outcome based on individual health profiles

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

actor-critic.md

actor-critic.md

Learning Reinforcement Learning

Key concepts in Reinforcement Learning

Roles of Actor and Critic

Key terms in Actor Critic Algorithm

How the Actor-Critic Algorithm Works

Objective Function

Policy Gradient (Actor)

Value Function Update (Critic)

Update Rules

Actor Update

Critic Update

Advantage Function

Advantage Actor-Critic(A2C)

Actor Critic Algorithm Steps

Initialisation

Interaction with the elements

Advantage Computation

Policy and Value Updates

Training Agent: Actor-Critic Algorithm

Advantages of the Actor Critic Algorithm

Advantage Actor Critic (A2C) vs. Asynchronous Actor Critic (A3C)

Applications of Actor Critic Algorithm

Files

actor-critic.md

Latest commit

History

actor-critic.md

File metadata and controls

Learning Reinforcement Learning

Key concepts in Reinforcement Learning

Roles of Actor and Critic

Key terms in Actor Critic Algorithm

How the Actor-Critic Algorithm Works

Objective Function

Policy Gradient (Actor)

Value Function Update (Critic)

Update Rules

Actor Update

Critic Update

Advantage Function

Advantage Actor-Critic(A2C)

Actor Critic Algorithm Steps

Initialisation

Interaction with the elements

Advantage Computation

Policy and Value Updates

Training Agent: Actor-Critic Algorithm

Advantages of the Actor Critic Algorithm

Advantage Actor Critic (A2C) vs. Asynchronous Actor Critic (A3C)

Applications of Actor Critic Algorithm