Bad performance on long run on MsPacman and SpaceInvaders #233

marintoro · 2024-06-17T14:08:26Z

Hello,

This issue is closely related to #229, I am trying to reproduce results of the original Muzero paper on Atari (or at least on a small subset of games, for the moment I tried MsPacman and SpaceInvaders).

I started with the lastest commit of main 6090ab1 and I made just small changes on atari_muzero_config.py listed below:
update_per_collect = 1000 ⇒ model_update_ratio = 0.25
max_env_step = int(1e6) ⇒ max_env_step = int(1e8)
optim_type='SGD'/learning_rate = 0.2 ⇒ optim_type='Adam' / learning_rate=0.003 / lr_piecewise_constant_decay=False

The results on both MsPacman and SpaceInvaders seems to be pretty low and not learning anymore after 600k learning steps (i.e 600k*(1/0.25)=2.4M env steps because update_ratio is 0.25 and 10M env frames with frame_skip of 4). I still have both experiments running and now one is around 1.4M learning steps while the other is at 1.1M but results doesn't seem to improve anymore...

Do you know what could be the reason of such failure?
Did you use any lr_piecewise_constant_decay in your experiments? Or did you use eps_greedy_exploration_in_collect?
Another big question, is why collecting so much data before doing any backprop at all? In the current implementation you wait the end of 8 collected episodes to start making backprop which can be a tons of transitions as episodes in Atari can be really long. I think this is not good for a stable learning, I think we should start making backprop after a fixed number of collected transitions (a quick and dirty improvement could be to collect ONLY ONE episode and make some backprop after).

Below are the curves I obtain after around 1.4M learning steps, equivalent to 5.6M env steps (i.e. around 22M frames because there is a frame_skip of 4).

for MsPacmanNoFrameskip-v4:

for SpaceInvadersNoFrameskip-v4:

The text was updated successfully, but these errors were encountered:

marintoro · 2024-06-20T08:59:14Z

Hello,

Just to give more information, both my trainings are now around 1.8M learning steps and 7.3M env steps (i.e. close to 30M Atari frames) and the performance have still not make any improvment. I will stop them because this is highly unlikely that they will increase after more than 1M learning steps without any progress.

To sum up, performance on both MsPacman and SpaceInvaders seems to be really low and stop improve after only 600k learning steps. Note that I only made one seed but this is probably enough to say that there is an issue and the actual code probably can't reproduce results of original Muzero.

puyuan1996 · 2024-06-21T14:18:55Z

Hello, thank you very much for your detailed experiment.
Regarding the reasons why the current performance is not as good as the original paper, we believe there are a few main points to consider:
- Firstly, the impact of the reanalyze_ratio. In our previous experiments, under a low data budget (100k), the impact of reanalyze_ratio might not have been significant. To reduce computational overhead, we set reanalyze_ratio=0, but this obviously affects performance under high data budgets because the update target for the policy net, the MCTS visit count, has not been updated and will remain in the replay buffer for a long time. Therefore, to reproduce the performance of the original paper, the reanalyze_ratio needs to be set to 1.
- Secondly, as you mentioned, our current setup updates the model after collecting 8 episodes, which significantly slows down the model improvement frequency. We previously used this setup mainly to take advantage of the speed improvement brought by collecting multiple episodes in parallel. To verify this hypothesis, it is indeed possible, as you suggested, to simply set collect_env_num to 1 for validation.
- Thirdly, the current implementation of priority still needs further validation and improvement.
- The results in our LightZero paper were obtained using SGD+lr_piecewise_constant_decay=True and eps_greedy_exploration_in_collect=False. We speculate that these two settings might not be the main reasons for the suboptimal performance of LightZero under high data budgets. We will conduct in-depth analysis and testing of other possible causes in the near future.
Thank you again for your question. You can follow our suggestions for testing, and we will also validate and update the relevant code and experimental information in the coming days. Best regards.

puyuan1996 · 2024-07-04T05:18:58Z

Hello, all our current experiments (on MsPacman) are being conducted with use_priority=False, which means we are using uniform sampling. The experimental results are as follows:

Data Collection (collect)
Evaluation (Eval)
Training (train)

Experimental Observations:

Reanalyze Ratio: The performance of rr0.25 is similar to rr1, and both are better than the previously default rr0. With the increase in training iterations, the performance shows a trend of continuous improvement. The experiment is still ongoing, and we will discuss further once we have long-term experimental results.
Reward Clipping: The performance of not using reward clipping (noclipreward) is not significantly better, indicating that under the current settings, clip_reward is not a key factor affecting performance. This is consistent with our expectations and the DQN paper, as in most Atari environments, the MDP with clipped rewards is similar to the original MDP, and the optimal strategies are also similar.
collect_env_num=1: The experimental results with filenames containing collect1 show that it has little impact on long-term performance.

Future Work:

Analyze experimental results after running for a longer period.
Fix the priority mechanism of LightZero.
Further analysis (such as internal representation-related metrics) and optimization considerations.

marintoro · 2024-07-04T09:15:15Z

Hello,
Thanks for this detailled experiments.

I think it is still way too early to know if your experiments will continue to improve or if they will stagnate like the one I launched. I believe a 'good' score in Pacman would be around 15k/20k, which is still far from the 250k score mentioned in the original MuZero paper (although they used 20 billion frames, which is insane...). However, consistently completing the first level of the game seems like a good target to determine if the current implementation is 'working' as expected.

Finally, I have a last question, why do you set use_priority = False? Are you sure the current implementation is not functionnal? I checked quickly the implementation and could not find any bugs (but for sure I didn't go in deep details).

puyuan1996 · 2024-07-04T17:13:45Z

Consistent with your analysis, we will indeed pay attention to the subsequent long-term experimental results to confirm the robustness of the performance and will update relevant info here.

As for the reason for using use_priority = False, it is as follows:

Currently, during data collection, the priorities we store are calculated based on the L1 error between the search value and the predicted value (here). However, when updating priorities during training (here), we are using the L1 error between the n-step bootstrapped value and the predicted value. The appendix of the original MuZero paper points out that priorities should be calculated based on the L1 error between the search value and the n-step bootstrapped value.

To maintain consistency, we might need to use the L1 error between the latest reanalyzed search value and the n-step bootstrapped value when updating priorities during training. However, these implementation details still require further verification through subsequent experiments and analysis.

marintoro · 2024-07-05T06:51:32Z

It's true that in the original paper they say that priorities is based on the L1 error between the search value and the n-step boostrapped value but the current implementation seems at least ok-ish for me as it's using the usual priorities, ie the TD-error between the n-step boostrapped value and the predicted value, as in the original PER paper.

By the way, I wonder if we should use the L1 error with the true values or the L1 error with the scaled values. It seems more intuitive to me to use the TD-error with the scaled value as it's the one we actually use to compute the loss for backpropagation. Moreover, if we don't use reward clipping, the TD-error with the true values can be really high (in some games, the expected value can be greater than 1 million) and I think this could impact the stability of the training.

puyuan1996 · 2024-07-14T11:35:22Z

Here are the results of running the previous experiment for a longer duration. The observations are generally consistent with earlier expectations. As illustrated by the grey line, the collect reward shows a nearly monotonous increase, but the curvature of performance improvement has decreased. This seems to be due to a decline in exploration during the collect phase, as the reward_max has shown little change after 5 million environment steps (5M env steps). Note that the original MuZero paper conducted experiments with 200 million environment frames (200M env frames), equivalent to 50 million environment steps (50M env steps). The current slowdown in performance improvement may be related to the network's hyperparameter settings. One major factor is the num_res_blocks, which was set to 16 in the original paper but only 1 in our setup. A smaller model capacity might limit performance in the later stages (as more game state representations need to be learned). The training logs also indicate that the policy/value/reward loss did not significantly decrease in the later stages.

Additionally, the current time cost is substantial. Besides using a single GPU, our setting of replay_ratio=0.25 has led to 1 million training iterations (1M train_iter) for 5 million environment steps. In contrast, the original MuZero paper had 1 million training iterations (1M train_iter) for 50 million environment steps (50M env steps).

Therefore, we hypothesize that the discrepancy in experimental results compared to the original paper is mainly due to differences in network structure and optimization parameters.

To verify our hypothesis, we are currently conducting experiments to align the network size with the original MuZero paper by increasing num_res_blocks from 1 to 16, reducing the replay_ratio to 0.02, and adjusting the batch_size from 256 to 1024. We will update with relevant information in due course.

Collect
Eval
Train

marintoro · 2024-07-15T12:57:02Z

Hello, thank you for this detailled answer.

I just realized than the input used in Muzero is totally different from the one that you are currently using (which is I think the one used in EfficientZero).
Indeed, in the current implementation, the input is 4 frames concatenated and gray-scale where as in Muzero it's 32 RGB frames concatenated along with the intermediate actions which is much MUCH bigger.

C.f. Appendix E in the Muzero paper:

For Atari, the input of the representation function includes the last 32 RGB frames at resolution 96x96 along
with the last 32 actions that led to each of those frames. We encode the historical actions because unlike board
games, an action in Atari does not necessarily have a visible effect on the observation

I think this could be a drastic change, with probably more chance to replicate the results of original Muzero.

marintoro · 2024-07-30T14:45:25Z

Hello,

Do you have any news about the performance on the long run? Is there new experiments running with a bigger network and/or more RGB frames as input?

puyuan1996 · 2024-08-02T07:36:20Z

Hello, thank you for your patience. Regarding the results of the previous experiment with extended runtime, we found that it was interrupted due to space limitations. Therefore, we currently only have results for 5M environment steps, which show a trend of continued performance improvement. Following the previous suggestion (#233 (comment)), we conducted experiments with a larger network, but the results at 2M environment steps did not show significant performance improvement. As for the experiments using more RGB frames as input, we have not yet conducted them. We plan to restart the previously interrupted experiments within the coming week. Thank you again for your patience.

PaParaZz1 added algorithm New algorithm discussion Discussion of a typical issue or concept labels Jun 20, 2024

puyuan1996 mentioned this issue Jul 4, 2024

Clipping reward in Atari while using invertible transform for reward and value target #239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad performance on long run on MsPacman and SpaceInvaders #233

Bad performance on long run on MsPacman and SpaceInvaders #233

marintoro commented Jun 17, 2024

marintoro commented Jun 20, 2024

puyuan1996 commented Jun 21, 2024 •

edited

Loading

puyuan1996 commented Jul 4, 2024 •

edited

Loading

marintoro commented Jul 4, 2024

puyuan1996 commented Jul 4, 2024

marintoro commented Jul 5, 2024

puyuan1996 commented Jul 14, 2024

marintoro commented Jul 15, 2024

marintoro commented Jul 30, 2024

puyuan1996 commented Aug 2, 2024

Bad performance on long run on MsPacman and SpaceInvaders #233

Bad performance on long run on MsPacman and SpaceInvaders #233

Comments

marintoro commented Jun 17, 2024

marintoro commented Jun 20, 2024

puyuan1996 commented Jun 21, 2024 • edited Loading

puyuan1996 commented Jul 4, 2024 • edited Loading

marintoro commented Jul 4, 2024

puyuan1996 commented Jul 4, 2024

marintoro commented Jul 5, 2024

puyuan1996 commented Jul 14, 2024

marintoro commented Jul 15, 2024

marintoro commented Jul 30, 2024

puyuan1996 commented Aug 2, 2024

puyuan1996 commented Jun 21, 2024 •

edited

Loading

puyuan1996 commented Jul 4, 2024 •

edited

Loading