How to access policy state with good train results? #711

Kiessar · 2022-07-29T10:17:34Z

Kiessar
Jul 29, 2022

Hi everyone,
I hope this is the right place to ask if not, please point me in a direction :) .

I implemented a custom environment (trading). When the episode is done, it prints out the sum of the rewards an some info. During training I sometimes see some pretty good rewards in this results. But even than it is not stored (save_best is not called). And unfortunately even when stored and loaded back afterwards the results never reach those good episodes. I'm pretty sure it is something that I'm missing and not understand here and not something technical or a bug and hope you can give me some advise.
I'm trying with different algorithms, currently DQN

    def save_best_fn(self, policy):
        print(f"stored best)

Epoch #15: 8192it [01:06, 123.49it/s, env_step=122880, len=6325, loss=0.000, n/ep=0, n/st=2048, rew=0.19]
Epoch #15: test_reward: 0.043976 ± 0.000000, best_reward: 0.043976 ± 0.000000 in #2

Is best_reward always the best reward from a test?
Found answer here :

tianshou/tianshou/trainer/base.py

Line 348 in f270e88

rew, rew_std = test_result["rew"], test_result["rew_std"]

My test and train envs are identical and I'm using DQN (offpolicy) I was asuming that train and test should result in same result(which is obviously wrong).
I also set steps_to_collect to the same size of the steps number of the environment hoping error was in overlapping or something like that.

Thank you

Answered by Trinkle23897

Jul 31, 2022

DQN's performance is largely affected by eps greedy. eps_test and eps_train are set to different values, so that's the reason for different performance between train and test.

best_reward always comes from a test. But if you are curious about "some pretty good result" in training, you can set test_in_train=True in the offpolicy trainer. This will freeze the policy, call test_episode to evaluate the policy once if it has an episodic reward that is above the given threshold.

View full answer

Trinkle23897 · 2022-07-31T16:25:49Z

Trinkle23897
Jul 31, 2022
Maintainer

DQN's performance is largely affected by eps greedy. eps_test and eps_train are set to different values, so that's the reason for different performance between train and test.

best_reward always comes from a test. But if you are curious about "some pretty good result" in training, you can set test_in_train=True in the offpolicy trainer. This will freeze the policy, call test_episode to evaluate the policy once if it has an episodic reward that is above the given threshold.

6 replies

Trinkle23897 Aug 2, 2022
Maintainer

it seems i never set the eps value and it defaults 0.0

0 eps for DQN is quite bad. It will get stuck in local minimal with a high probability.

Found it:

Remember to set exploration_noise=True when initializing train/test collector, see the examples or test script of DQN

By threshold you mean using stop_fn?

Yes, please refer to the example script.

Kiessar Aug 3, 2022
Author

Edit: Saw this: #708
I'm also using ObservationWrapper.
Followed your suggestion:

    policy.train()
    print("with policy.training == True", test_collector.collect(n_episode=10))
    policy.eval()
    print("with policy.training == False", test_collector.collect(n_episode=10))
	
	with policy.training == True {'n/ep': 10, 'n/st': 63250, 'rews': array([0.00427606, 0.00011887, 0.00120671, 0.00427389, 0.00168804,
       0.        , 0.00026179, 0.00064219, 0.00795588, 0.00248111]), 'lens': array([6325, 6325, 6325, 6325, 6325, 6325, 6325, 6325, 6325, 6325]), 'idxs': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'rew': 0.0022904537533713553, 'len': 6325.0, 'rew_std': 0.0024154009437862447, 'len_std': 0.0}
	with policy.training == False {'n/ep': 10, 'n/st': 63250, 'rews': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]), 'lens': array([6325, 6325, 6325, 6325, 6325, 6325, 6325, 6325, 6325, 6325]), 'idxs': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'rew': 0.0, 'len': 6325.0, 'rew_std': 0.0, 'len_std': 0.0}

/Edit

Thanks again. I switched to SAC and used MCC from the examples as basis.

Things that I adapted:

Replaced the environment with my custom env
test_num, training_num = 1

def stop_fn(mean_rewards):
        return mean_rewards >= 0.05

added print for test result in trainer/base.py when result higher than threshold during training:

        if result["n/ep"] > 0:
            # print(f"stop {self.stop_fn(result['rew'])}")
            # print(self.stop_fn)
            if self.test_in_train and self.stop_fn and self.stop_fn(result["rew"]):
                assert self.test_collector is not None
                print("test_epsiode")
                print(self.policy ==self.train_collector.policy)
                # print(self.train_collector.policy)
                test_result = test_episode(
                    self.policy, self.test_collector, self.test_fn, self.epoch,
                    self.episode_per_test, self.logger, self.env_step
                )
                print("test result: ", test_result["rew"])

I'm again getting results good during training but failing to perform in the test:

Epoch #3: 12001it [02:07, 94.00it/s, alpha=0.117, env_step=36000, len=6325, loss/actor=-22.574, loss/alpha=-10.789, loss/critic1=0.098, loss/critic2=0.096, n/ep=0, n/st=5, rew=0.35]  
Epoch #3: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0

And one triggered with test_in_train=True :

Epoch #6:  80%|###################1    | 9570/12000 [01:35<00:29, 81.82it/s, alpha=0.016, env_step=69570, len=6325, loss/actor=-17.844, loss/alpha=-20.942, loss/critic1=0.056, loss/critic2=0.056, n/ep=0, n/st=5, rew=-0.05]
test result:  0.0
Epoch #6: 12001it [02:09, 92.73it/s, alpha=0.013, env_step=72000, len=6325, loss/actor=-17.442, loss/alpha=-21.685, loss/critic1=0.111, loss/critic2=0.111, n/ep=0, n/st=5, rew=0.08] 
Epoch #6: test_reward: 0.000000 ± 0.000000, best_reward: 0.000000 ± 0.000000 in #0

But it generally works because later in the epoch:

Epoch #16:  28%|######8                 | 3420/12000 [00:43<01:09, 122.78it/s, alpha=0.000, env_step=183420, len=6325, loss/actor=-5.213, loss/alpha=4.817, loss/critic1=0.007, loss/critic2=0.003, n/ep=0, n/st=5, rew=-0.09]
test resut:  0.00148107229744221
Epoch #16: test_reward: 0.004916 ± 0.000000, best_reward: 0.004916 ± 0.000000 in #16

The only thing I'm coming up with to explain it, if it is no configuration issue or bug, is:
During the run the net adapts to the current position in the episode (over fitting current position in episode), but failing because of missing generalization for the whole episode.

If my assumption is correct I'd switch to config that everything runs in episodes and not steps. Assuming that when the policy is only updated after a whole episode, it should perform same in test and next collect.
Any other advises?

Trinkle23897 Aug 5, 2022
Maintainer

I still think it's the different behavior of policy in different training phase:

tianshou/tianshou/policy/modelfree/discrete_sac.py

Lines 83 to 86 in 0f59e38

    
           if self._deterministic_eval and not self.training: 
        
               act = logits.argmax(axis=-1) 
        
           else: 
        
               act = dist.sample()

Try to set deterministic_eval=False for DiscreteSAC, and run the above sanity check again (collect 10 episodes and see the reward with different policy.training flag).

Kiessar Aug 22, 2022
Author

Hi and sorry for the late reply (had some vacation),

the results to the configuration you suggested:

with policy.training == True {'n/ep': 10, 'n/st': 63250, 'rews': array([ 0.03209329,  0.02849145, -0.05491907,  0.01881817, -0.01861839,
        0.01906733,  0.03530011,  0.03447808, -0.01208855,  0.02748009]), 'lens': array([6325, 6325, 6325, 6325, 6325, 6325, 6325, 6325, 6325, 6325]), 'idxs': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'rew': 0.011010252501279556, 'len': 6325.0, 'rew_std': 0.028365550140227406, 'len_std': 0.0}
with policy.training == False {'n/ep': 10, 'n/st': 63250, 'rews': array([-0.05315209,  0.01126754,  0.04344485,  0.00413794,  0.01399161,
        0.01407767, -0.00861743, -0.02278917, -0.07286567,  0.00687546]), 'lens': array([6325, 6325, 6325, 6325, 6325, 6325, 6325, 6325, 6325, 6325]), 'idxs': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'rew': -0.006362930964261604, 'len': 6325.0, 'rew_std': 0.03285732879074395, 'len_std': 0.0}

Not sure where to go from here with the results and what I'm doing wrong.
Do I understand the training results wrong? And they aren't good but there just some parts where it is lucky but didn't actually learn anythin?

I also post most of the code just in case I messed something up.
Please note that I changed the default values, for the CLI args for better reproducability.
Some parts where I set up the custom environment are omitted and marked by [...]


def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--task', type=str, default='MountainCarContinuous-v0')
    parser.add_argument('--seed', type=int, default=1626)
    parser.add_argument('--buffer-size', type=int, default=50000)
    parser.add_argument('--actor-lr', type=float, default=3e-4)
    parser.add_argument('--critic-lr', type=float, default=3e-4)
    parser.add_argument('--alpha-lr', type=float, default=3e-4)
    parser.add_argument('--noise_std', type=float, default=1.2)
    parser.add_argument('--gamma', type=float, default=0.99)
    parser.add_argument('--tau', type=float, default=0.005)
    parser.add_argument('--auto_alpha', type=int, default=1)
    parser.add_argument('--alpha', type=float, default=0.2)
    parser.add_argument('--epoch', type=int, default=2)
    parser.add_argument('--step-per-epoch', type=int, default=6324)
    parser.add_argument('--step-per-collect', type=int, default=6324)
    parser.add_argument('--update-per-step', type=float, default=0.2)
    parser.add_argument('--batch-size', type=int, default=128)
    parser.add_argument('--hidden-sizes', type=int, nargs='*', default=[128, 128])
    parser.add_argument('--training-num', type=int, default=1)
    parser.add_argument('--test-num', type=int, default=1)
    parser.add_argument('--logdir', type=str, default='log')
    parser.add_argument('--render', type=float, default=0.)
    parser.add_argument('--rew-norm', type=bool, default=False)
    parser.add_argument(
        '--device', type=str, default='cuda' if torch.cuda.is_available() else 'cpu'
    )
    return parser.parse_args()


def test_sac(args=get_args()):
 
    [...]
    env = ObservationWrapper(ModularTradingEnv([...]), models)
 

    args.state_shape = env.observation_space.shape or env.observation_space.n
    args.action_shape = env.action_space.shape or env.action_space.n
    args.max_action = env.action_space.high[0]
   
  [...]
    train_envs = DummyVectorEnv([lambda: ObservationWrapper([...], models) for _ in range(args.training_num)])
    test_envs = DummyVectorEnv([lambda: ObservationWrapper[...] models) for _ in range(args.test_num)])
    # seed
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    train_envs.seed(args.seed)
    test_envs.seed(args.seed)
    # model
    net = Net(args.state_shape, hidden_sizes=args.hidden_sizes, device=args.device)
    actor = ActorProb(
        net,
        args.action_shape,
        max_action=args.max_action,
        device=args.device,
        unbounded=True
    ).to(args.device)
    actor_optim = torch.optim.Adam(actor.parameters(), lr=args.actor_lr)
    net_c1 = Net(
        args.state_shape,
        args.action_shape,
        hidden_sizes=args.hidden_sizes,
        concat=True,
        device=args.device
    )
    critic1 = Critic(net_c1, device=args.device).to(args.device)
    critic1_optim = torch.optim.Adam(critic1.parameters(), lr=args.critic_lr)
    net_c2 = Net(
        args.state_shape,
        args.action_shape,
        hidden_sizes=args.hidden_sizes,
        concat=True,
        device=args.device
    )
    critic2 = Critic(net_c2, device=args.device).to(args.device)
    critic2_optim = torch.optim.Adam(critic2.parameters(), lr=args.critic_lr)

    if args.auto_alpha:
        target_entropy = -np.prod(env.action_space.shape)
        log_alpha = torch.zeros(1, requires_grad=True, device=args.device)
        alpha_optim = torch.optim.Adam([log_alpha], lr=args.alpha_lr)
        args.alpha = (target_entropy, log_alpha, alpha_optim)

    policy = SACPolicy(
        actor,
        actor_optim,
        critic1,
        critic1_optim,
        critic2,
        critic2_optim,
        tau=args.tau,
        gamma=args.gamma,
        alpha=args.alpha,
        reward_normalization=args.rew_norm,
        exploration_noise=OUNoise(0.0, args.noise_std),
        action_space=env.action_space,
        deterministic_eval=False
    )
    # collector
    train_collector = Collector(
        policy,
        train_envs,
        VectorReplayBuffer(args.buffer_size, len(train_envs)),
        exploration_noise=True
    )
    test_collector = Collector(policy, test_envs)
    # train_collector.collect(n_step=args.buffer_size)
    # log
    log_path = os.path.join(args.logdir, args.task, 'sac')
    writer = SummaryWriter(log_path)
    logger = TensorboardLogger(writer)

    def save_best_fn(policy):
        torch.save(policy.state_dict(), os.path.join(log_path, 'policy.pth'))

    def stop_fn(mean_rewards):
        return mean_rewards >= 0.05


    
    # trainer
    result = offpolicy_trainer(
        policy,
        train_collector,
        test_collector,
        args.epoch,
        args.step_per_epoch,
        args.step_per_collect,
        args.test_num,
        args.batch_size,
        update_per_step=args.update_per_step,
        stop_fn=stop_fn,
        save_best_fn=save_best_fn,
        logger=logger,
        
    )

    # assert stop_fn(result['best_reward'])
    if __name__ == '__main__':
        pprint.pprint(result)
        # Let's watch its performance!
        policy.eval()
        test_envs.seed(args.seed)
        test_collector.reset()
        # result = test_collector.collect(n_episode=args.test_num, render=args.render)
        # rews, lens = result["rews"], result["lens"]
        # print(f"Final reward: {rews.mean()}, length: {lens.mean()}")

        policy.train()
        print("with policy.training == True", test_collector.collect(n_episode=10))
        policy.eval()
        print("with policy.training == False", test_collector.collect(n_episode=10))


if __name__ == '__main__':
    test_sac()

Trinkle23897 Aug 24, 2022
Maintainer

the results to the configuration you suggested:

I think these two results are quite close because the rew_std is large enough to cover another run's reward mean. Don't see any issue here but maybe more runs would reduce the randomness.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to access policy state with good train results? #711

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to access policy state with good train results? #711

Kiessar Jul 29, 2022

Replies: 1 comment · 6 replies

Trinkle23897 Jul 31, 2022 Maintainer

Trinkle23897 Aug 2, 2022 Maintainer

Kiessar Aug 3, 2022 Author

Trinkle23897 Aug 5, 2022 Maintainer

Kiessar Aug 22, 2022 Author

Trinkle23897 Aug 24, 2022 Maintainer

Kiessar
Jul 29, 2022

Replies: 1 comment 6 replies

Trinkle23897
Jul 31, 2022
Maintainer

Trinkle23897 Aug 2, 2022
Maintainer

Kiessar Aug 3, 2022
Author

Trinkle23897 Aug 5, 2022
Maintainer

Kiessar Aug 22, 2022
Author

Trinkle23897 Aug 24, 2022
Maintainer