Consider the problem of determining a value function with function approximation. In the tabular case, we could use back-ups to compute our value function exactly. For function approximation, we must choose a gradient of the state or state-action pair which moves us in a direction towards optimality.
However, how can we compute the loss of our current value function, if we don’t know the optimal value function?
The solution is actually quite simple. Rather than performing a backup—where we assign the value of a state to a new estimate—we instead perform a gradient step in the direction of a backup. We perform a step in the direction of the gradient—rather than backing up—because we are using an approximation, and have to balance prediction errors from several states at once.
If we knew the optimal value function, we could simply move the weights in a direction towards the local optimum of the error estimation.
Note that this is the gradient of the mean squared value error:
However, since
Loss function |
|
Parameter update step expression | Biased? |
---|---|---|---|
Monte Carlo return | $$\theta_{t+1} \gets \theta_t + \alpha \left[ G_t - \hat{v}(S_t | \theta_t \right] \nabla \hat{v} (S_t | |
TD(0) return | $$U_t \gets R_t + \gamma \hat{v} (S' | \theta_t)$$ | $$\theta_{t+1} \gets \theta_t + \alpha \left[ R_t + \gamma \hat{v} (S' |
Note that the TD(0) return is biased, since the target is defined in terms of the parameters
The Monte Carlo update step is an unbiased estimate of