-
Notifications
You must be signed in to change notification settings - Fork 683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible inconsistencies with the PPO implementation #477
Comments
Wow impressive, but I'm a bit confused what you would like to do. |
@pseudo-rnd-thoughts Thank you for the prompt response! I would like to work together and help to determine the cause of these discrepancies, possibly making the implementation by CleanRL more consistent as a result :) I don't think the seeds caused these differences, because the other 50 environments had random seeds as well and they were seen to be statistically consistent as seen in the table below. If seeds were an issue, it probably would have impacted more than just six environments in the table above. I've not been able to find the cause for these inconsistencies yet. Was wondering if you had any suggestions? |
The ppo implementations of cleanrl and sb3 are indeed inconsistent, and at least one difference I understand is the handling of truncation. sb3 fixes mishandling of environment truncation in openai baselines, while cleanrl keeps this issue. But for atari envs, I'm not sure how big of an impact that is. See 👇 |
@sdpkjc Thanks for the suggestions! It's surprising that such inconsistencies exist. Will look into it and determine if that is really the cause of the discrepancies. |
Problem Description
I tested different implementations of the PPO algorithm and found some discrepancies among the implementations. I tested each implementation on 56 Atari environments, with five trials per (implementation, environment) permutation.
Checklist
poetry install
(see CleanRL's installation guideline.Current Behavior
The table below depicts an environment-wise one-way ANOVA to determine the effect of implementation source on mean reward. Out of the 56 environments tested, the implementations significantly differed in nine environments, as seen in the table with respect to Stable Baselines3, CleanRL, and Baselines (not the 108 variant).
Expected Behavior
The implementations should not significantly differ in terms of mean reward.
Possible Solution
I believe that there are inconsistencies among the implementations which causes the observed environment-dependent discrepancies. For example, I found some inconsistencies (i.e., a bug) with Baselines' implementation where the frames per episode did not conform to 108K as per the v4 ALE specification, causing mean rewards to differ significantly in some environments. After correcting this, three out of nine environments previously flagged as statistically different were now not different, as seen in the table above with Baselines108. The inconsistency is likely to be related to the environments, so I would suggest starting with parts of the implementation which might affect a subset of environments (similar to the frames per episode).
Steps to Reproduce
ppo_atari.py
and followed the same PPO hyperparameters (without LSTM) as discussed in the ICLR Blog by @vwxyzjnThe text was updated successfully, but these errors were encountered: