modified | title |
---|---|
2025-01-29 11:08:35 UTC |
- Agent: The entity making decisions and interacting with the environment
- Environment: The external system where the agent interacts
- State: A representation of the current situation or configuration
- Action: The decision or move made by the agent
- Reward: The feedback received by the agent based on its actions
- Policy: The strategy or set of rules guiding the agent's decision making
- Actor:
- Makes decisions based on current policy
- Responsibility lies in exploring the action space
- This maximises expected cumulative rewards
- Refining the policy
- Actor will adapt to the dynamic nature of the environment
- Critic:
- Evaluates actions taken by the actor
- Estimates the quality of these actions by providing feedback of their performance
- Guides the actor towards actions that lead to higher expected returns
- This contributes to overall improvement
- Policy(Actor):
- Represents the probability of taking action
a
in states
- Denoted as
π(a∣s)
- Denoted as
- Actor seeks to maximise expected return by optimising this policy
- Policy is modeled by the actor network
- Parameters are denoted as
θ
- Parameters are denoted as
- Represents the probability of taking action
- Value(Critic):
- Estimates the expected cumulative reward starting from state
s
- It is denoted as
V(s)
- It is denoted as
- The value function is modeled by the critic network
- Parameters are denoted as
w
- Parameters are denoted as
- Estimates the expected cumulative reward starting from state
- Combination of policy gradient for the actor and value function for the critic
- Typically expressed as the sum of two components
- ∇θJ(θ)≈1/N∑Ni=0∇θlogπθ(ai ∣ si)⋅A(si,ai)
- Explanation:
- J(θ)
- Represents the expected return under the policy parameterized by θ
- πθ(a∣s)
- Represents the policy function
- N
- Represents the number of sampled experiences.
- A(s,a)
- Represents the advantage function representing the advantage of taking action a in state
s
.
- Represents the advantage function representing the advantage of taking action a in state
- i
- Represents the number of samples
- J(θ)
- ∇wJ(w)≈1/N∑Ni=1∇w(Vw(si)−Qw(si,ai))2
- Explanation
- ∇wJ(w)
- Represents the gradient of loss function with respect to the critic's paramter
w
- Represents the gradient of loss function with respect to the critic's paramter
- N
- Represents the number of samples
- Vw(si)
- Represents the critics estimate of value with state
s
using parameterw
- Represents the critics estimate of value with state
- Qw(si,ai))2
- Represents the critic's estimate of the action value of taking action
a
- Represents the critic's estimate of the action value of taking action
- i
- Represents the index of the sample
- ∇wJ(w)
- Involves adjusting the respective parametes for the actor and critic
- Actor
- Gradient ascent
- Critic
- Gradient descent
- Actor
- θt+1 = θt + α∇θJ(θt)
- Explanation
- α
- learning the rate for the actor
- t
- time step within an episode
- α
- wt = wt − β∇wJ(wt)
- Explanation
- w
- Represents the parameters of the critic network
- β
- Represents the learning rate for the critic
- w
- The Advantage function (A(s,a)) measures the advantage of taking action
a
in states
over the expected value of the state under the current policy - A(s,a) = Q(s,a)−V(s)
- The Advantage function then provides a measure of how much better or worse an action is compared to the average action
- Explanation of the A(s,a) = Q(s,a)−V(s) expression
- The actor is updated based on the policy gradient
- Encouraging actions with higher advantages
- The critic is updated to minimise the difference between the estimated value and action value
- The actor is updated based on the policy gradient
- Introduces the concept of the advantage function
- Measures how much better an action is compared to the average action in the given state
- Incorporating advantage information
- A2C focuses the learning process on actions that have a significanlty higher value than the typical action taken in that state
- Learning from average:
- Base Actor-Critic uses difference between actual reward and estimated value (critics evaluation) to update the actor
- Learning from Advantage:
- Uses the difference between the action's value and the average calue of actions in that state
- This additional information refines the learning process that little bit further
- Uses the difference between the action's value and the average calue of actions in that state
-
- Initialise the parameters
- Policy:
- Actor
- θ
- Actor
- Value Function
- Critic
- ϕ
- Critic
- Policy:
- Initialise the parameters
-
- The agent interacts with the environment
- Takes actions according to the current policy
- Then rewarded in return
- Takes actions according to the current policy
- The agent interacts with the environment
-
- Compare the advantage function A(s,a) based on the current policy and value estimates
-
-
Simultaneously update the actors parameters using the policy gradient
- The policy gradient is derived from the advantage function
- Guides the actor to increate the probabilities of actions that lead to higher advantages
- The policy gradient is derived from the advantage function
-
Simultaneously update the critics parameters using a value-based method
- The often involves minimising the temporal difference (TD) error
- The difference between the observed rewards and the predicted values
- The often involves minimising the temporal difference (TD) error
-
Tip
- The actor learns a policy
- The critic then evaluates the actions taken by the actor
- The actor is updated using the policy gradient
- The critic is updated using the value-based method
[!NOTE] Allows for more stable and efficient learning in complex environments
Important
Makes use of TensorFlow and OpenAI Gym
- Import libraries
import numpy as np
import tensorflow as ts
import gym
- Creating CartPole Environment
gym.make
provides a standardized and convenient way to interact with various reinforcement learning tasks
env = gym.make('CartPole-v1')
- Define Actor and Critic Networks
- Actor and Critic are implemented as neural networks using TensorFlow's Keras API
- Actor network
- Maps the state or a probability distrobution over actions
- Critic network
- Estimates the states value
actor = tf.keraas.Sequential([
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(env.action_space.n, activation='softmax')
])
critic = tf.keras.Sequential([
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1)
])
- Defining Optimizers and Loss Functions
- Adam optimizer is used for both actor and critic networks
actor_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
critic_optimizer = tf.keras.optimizers.Adam(learning_rate=0.001
- Training Loop
- The main training loop runs for specific number of episodes (1000)
- Agent interacts with the environment
- For each episode:
- Resets environment
- Initializes the episode reward to 0
- For each episode:
- The tf.GradientTape block
- USed to compute gradients for both the actor and critic networks
- Agent chooses an action based on the actor's output probabilities
- It then takes that action in the environment
- Observes:
- Next state
- Reward
- Whether the episode is done
- Advantage function is computed
- Difference between the expected return and the estimated value at the current state
- Actor and Critic losses are calculated based on the advantage function
- Gradients are computed using tape.gradient
- Then applied to update the actor and critic networks using the respective optimisers
- Episodes total reward is updated and the loop is the continued until the episode ends
- Every 10 episodes the current episode number and reward are printed
num_episodes = 1000
gamma = 0.99
for episode in range(num_episodes):
state = env.reset()
episode_reward = 0
with tf.GradientTape(persistent=True) as tape:
for t in range(1, 10000): # Limit the number of time steps
# Choose an action using the actor
action_probs = actor(np.array([state]))
action = np.random.choice(env.action_space.n, p=action_probs.numpy()[0])
# Take the chosen action and observe the next state and reward
next_state, reward, done, _ = env.step(action)
# Compute the advantage
state_value = critic(np.array([state]))[0, 0]
next_state_value = critic(np.array([next_state]))[0, 0]
advantage = reward + gamma * next_state_value - state_value
# Compute actor and critic losses
actor_loss = -tf.math.log(action_probs[0, action]) * advantage
critic_loss = tf.square(advantage)
episode_reward += reward
# Update actor and critic
actor_gradients = tape.gradient(actor_loss, actor.trainable_variables)
critic_gradients = tape.gradient(critic_loss, critic.trainable_variables)
actor_optimizer.apply_gradients(zip(actor_gradients, actor.trainable_variables))
critic_optimizer.apply_gradients(zip(critic_gradients, critic.trainable_variables))
if done:
break
if episode % 10 == 0:
print(f"Episode {episode}, Reward: {episode_reward}")
env.close()
- Improved Sample Efficiience
- The hybrid nature of Actor-Clinic algorithms requires less interactions with the environment, often leading to achieve optimal performance
- Faster Convergance
- The method is able to update both the policy and value function concurrently
- This leads to faster convergance during training
- Leading to quicker adaptation of the learning task
- This leads to faster convergance during training
- The method is able to update both the policy and value function concurrently
- Versatility Across Action Spaces
- Actor-Clinic architectures are able to seamlessly handle both discrete and continuous action spaces
- This offers flexibility in addressing a wide range of RL problems
- Actor-Clinic architectures are able to seamlessly handle both discrete and continuous action spaces
- Off-Policy Learning
- Learns from past experiences
- Even when not directly following the current policy
- Learns from past experiences
Note
A3C builds upon A2C by introducing parallelism
- A2C uses a single actor-critic pair
- A3C uses multiple actor-critic pairs each operating simultaneously
- Each pair interacts with a seperate copy of the environment collecting data independantly
- These experiences are then used to update a global actor-critic network
- Each pair interacts with a seperate copy of the environment collecting data independantly
- Robotics
- Empower robots to learn optimal control policies
- Allows them to adapt and navigate complex environments
- Empower robots to learn optimal control policies
- Game Playing
- Provides training agents to make strategic decisions
- Improves gameplay overtime
- Provides training agents to make strategic decisions
- Autonomous Vehicles
- Dynamic decisions in real-time
- Contributes to the evolution of self-driving technology
- Dynamic decisions in real-time
- Finance and Trading
- Optimize trading strategies and make intelligent financial decisions in dynamic markets
- Healthcare
- Personalised treatment planning
- Agents learn to make decisions that maximise patient outcome based on individual health profiles
- Personalised treatment planning