Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible inconsistencies with the PPO implementation #477

Open
3 tasks done
rajdeepsh opened this issue Aug 2, 2024 · 4 comments
Open
3 tasks done

Possible inconsistencies with the PPO implementation #477

rajdeepsh opened this issue Aug 2, 2024 · 4 comments

Comments

@rajdeepsh
Copy link

Problem Description

I tested different implementations of the PPO algorithm and found some discrepancies among the implementations. I tested each implementation on 56 Atari environments, with five trials per (implementation, environment) permutation.

Checklist

Current Behavior

The table below depicts an environment-wise one-way ANOVA to determine the effect of implementation source on mean reward. Out of the 56 environments tested, the implementations significantly differed in nine environments, as seen in the table with respect to Stable Baselines3, CleanRL, and Baselines (not the 108 variant).
Screenshot 2024-08-02 at 3 39 59 PM

Expected Behavior

The implementations should not significantly differ in terms of mean reward.

Possible Solution

I believe that there are inconsistencies among the implementations which causes the observed environment-dependent discrepancies. For example, I found some inconsistencies (i.e., a bug) with Baselines' implementation where the frames per episode did not conform to 108K as per the v4 ALE specification, causing mean rewards to differ significantly in some environments. After correcting this, three out of nine environments previously flagged as statistically different were now not different, as seen in the table above with Baselines108. The inconsistency is likely to be related to the environments, so I would suggest starting with parts of the implementation which might affect a subset of environments (similar to the frames per episode).

Steps to Reproduce

  1. I used ppo_atari.py and followed the same PPO hyperparameters (without LSTM) as discussed in the ICLR Blog by @vwxyzjn
# Environment
Max Frames Per Episode = 108000
Frameskip = 4
Max Of Last 2 Frames = True
Max Steps Per Episode = 27000
Framestack = 4

Observation Type = Grayscale
Frame Size = 84 x 84

Max No Operation Actions = 30
Repeat Action Probability = 0.0

Terminal On Life Loss = True
Fire Action on Reset = True
Reward Clip = {-1, 0, 1}
Full Action Space = False

# Algorithm
Neural Network Feature Extractor = Nature CNN
Neural Network Policy Head = Linear Layer with n_actions output features
Neural Network Value Head = Linear Layer with 1 output feature
Shared Feature Extractor = True
Orthogonal Initialization = True
Scale Images to [0, 1] = True
Optimizer = Adam with 1e-5 Epsilon

Learning Rate = 2.5e-4
Decay Learning Rate = True

Number of Environments = 8
Number of Steps = 128
Batch Size = 256
Number of Minibatches = 4
Number of Epochs = 4
Gamma = 0.99
GAE Lambda = 0.95
Clip Range = 0.1
VF Clip Range = 0.1
Normalize Advantage = True
Entropy Coefficient = 0.01
VF Coefficient = 0.5
Max Gradient Normalization = 0.5
Use Target KL = False
Total Timesteps = 10000000
Log Interval = 1
Evaluation Episodes = 100
Deterministic Evaluation = False

Seed = Random
Number of Trials = 5
@pseudo-rnd-thoughts
Copy link
Collaborator

Wow impressive, but I'm a bit confused what you would like to do.
You note that this is with different seeds, could this explain the difference?
Have you been able to find any implementation details that could cause these issues?

@rajdeepsh
Copy link
Author

@pseudo-rnd-thoughts Thank you for the prompt response! I would like to work together and help to determine the cause of these discrepancies, possibly making the implementation by CleanRL more consistent as a result :) I don't think the seeds caused these differences, because the other 50 environments had random seeds as well and they were seen to be statistically consistent as seen in the table below. If seeds were an issue, it probably would have impacted more than just six environments in the table above. I've not been able to find the cause for these inconsistencies yet. Was wondering if you had any suggestions?

Screenshot 2024-08-02 at 5 48 18 PM

@sdpkjc
Copy link
Collaborator

sdpkjc commented Aug 2, 2024

The ppo implementations of cleanrl and sb3 are indeed inconsistent, and at least one difference I understand is the handling of truncation. sb3 fixes mishandling of environment truncation in openai baselines, while cleanrl keeps this issue. But for atari envs, I'm not sure how big of an impact that is.

See 👇

@rajdeepsh
Copy link
Author

rajdeepsh commented Aug 2, 2024

@sdpkjc Thanks for the suggestions! It's surprising that such inconsistencies exist. Will look into it and determine if that is really the cause of the discrepancies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants