-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad performance on long run on MsPacman and SpaceInvaders #233
Comments
Hello, Just to give more information, both my trainings are now around 1.8M learning steps and 7.3M env steps (i.e. close to 30M Atari frames) and the performance have still not make any improvment. I will stop them because this is highly unlikely that they will increase after more than 1M learning steps without any progress. To sum up, performance on both MsPacman and SpaceInvaders seems to be really low and stop improve after only 600k learning steps. Note that I only made one seed but this is probably enough to say that there is an issue and the actual code probably can't reproduce results of original Muzero. |
|
Hello, all our current experiments (on MsPacman) are being conducted with Experimental Observations:
Future Work:
|
Hello, I think it is still way too early to know if your experiments will continue to improve or if they will stagnate like the one I launched. I believe a 'good' score in Pacman would be around 15k/20k, which is still far from the 250k score mentioned in the original MuZero paper (although they used 20 billion frames, which is insane...). However, consistently completing the first level of the game seems like a good target to determine if the current implementation is 'working' as expected. Finally, I have a last question, why do you set |
Consistent with your analysis, we will indeed pay attention to the subsequent long-term experimental results to confirm the robustness of the performance and will update relevant info here. As for the reason for using Currently, during data collection, the priorities we store are calculated based on the L1 error between the To maintain consistency, we might need to use the L1 error between the |
It's true that in the original paper they say that priorities is based on the L1 error between the By the way, I wonder if we should use the L1 error with the true values or the L1 error with the scaled values. It seems more intuitive to me to use the TD-error with the scaled value as it's the one we actually use to compute the loss for backpropagation. Moreover, if we don't use reward clipping, the TD-error with the true values can be really high (in some games, the expected value can be greater than 1 million) and I think this could impact the stability of the training. |
Here are the results of running the previous experiment for a longer duration. The observations are generally consistent with earlier expectations. As illustrated by the grey line, the collect reward shows Additionally, the current time cost is substantial. Besides using a single GPU, our setting of replay_ratio=0.25 has led to 1 million training iterations (1M train_iter) for 5 million environment steps. In contrast, the original MuZero paper had 1 million training iterations (1M train_iter) for 50 million environment steps (50M env steps). Therefore, we hypothesize that the discrepancy in experimental results compared to the original paper is To verify our hypothesis, we are currently conducting experiments to align the network size with the original MuZero paper by increasing |
Hello, thank you for this detailled answer. I just realized than the input used in Muzero is totally different from the one that you are currently using (which is I think the one used in EfficientZero). C.f. Appendix E in the Muzero paper:
I think this could be a drastic change, with probably more chance to replicate the results of original Muzero. |
Hello, Do you have any news about the performance on the long run? Is there new experiments running with a bigger network and/or more RGB frames as input? |
Hello, thank you for your patience. Regarding the results of the previous experiment with extended runtime, we found that it was interrupted due to space limitations. Therefore, we currently only have results for 5M environment steps, which show a trend of continued performance improvement. Following the previous suggestion (#233 (comment)), we conducted experiments with a larger network, but the results at 2M environment steps did not show significant performance improvement. As for the experiments using more RGB frames as input, we have not yet conducted them. We plan to restart the previously interrupted experiments within the coming week. Thank you again for your patience. |
Hello,
This issue is closely related to #229, I am trying to reproduce results of the original Muzero paper on Atari (or at least on a small subset of games, for the moment I tried MsPacman and SpaceInvaders).
I started with the lastest commit of main 6090ab1 and I made just small changes on
atari_muzero_config.py
listed below:update_per_collect = 1000
⇒model_update_ratio = 0.25
max_env_step = int(1e6)
⇒max_env_step = int(1e8)
optim_type='SGD'/learning_rate = 0.2
⇒optim_type='Adam' / learning_rate=0.003 / lr_piecewise_constant_decay=False
The results on both MsPacman and SpaceInvaders seems to be pretty low and not learning anymore after 600k learning steps (i.e
600k*(1/0.25)=2.4M
env steps because update_ratio is 0.25 and 10M env frames with frame_skip of 4). I still have both experiments running and now one is around 1.4M learning steps while the other is at 1.1M but results doesn't seem to improve anymore...Do you know what could be the reason of such failure?
Did you use any
lr_piecewise_constant_decay
in your experiments? Or did you useeps_greedy_exploration_in_collect
?Another big question, is why collecting so much data before doing any backprop at all? In the current implementation you wait the end of 8 collected episodes to start making backprop which can be a tons of transitions as episodes in Atari can be really long. I think this is not good for a stable learning, I think we should start making backprop after a fixed number of collected transitions (a quick and dirty improvement could be to collect ONLY ONE episode and make some backprop after).
Below are the curves I obtain after around 1.4M learning steps, equivalent to 5.6M env steps (i.e. around 22M frames because there is a frame_skip of 4).
for
MsPacmanNoFrameskip-v4
:for
SpaceInvadersNoFrameskip-v4
:The text was updated successfully, but these errors were encountered: