A learning rate of 5e-4 is generally a good start for most of the algorithms.
Using multiple seeds to evaluate the performance of an algorithm is very common in RL since the model may give very different training results while using a different seed (Unlike many CV tasks).
The networks used are generally shallow (2 - 3 layers). (Deeper networks may be used for complex env and for algorithms like IMPALA).
Unlike Computer Vision, both Tanh and ReLU are used in the networks.
If observations have an unknown range, standardize.
Compute running estimate of mean and standard deviation, x’ = clip((x − µ)/σ, −10, 10)
Rescale the rewards, but don’t shift mean, as that affects the agent’s will to live.
If you have implemented the algorithm from scratch, it is advisable to test them on simpler envs like CartPole for Discrete and Pendulum for Continuous action space before moving on to Complex envs like BipedalWalker or Mujoco.
This video by John Schulman (Open AI) and the resp. slides are very helpful: video{:target="_blank"}
, slides{:target="_blank"}
Referring to pre-optimized hyperparameters is also very helpful to get an idea of optimal values.
These two links have a good collection of tuned hyperparams: link 1,
link 2
lr, entropy_coeff (1e-2 - 1e-5, larger values promotes exploration)
epsilon (1e-1 - 1e-7) (if using rmsprop optimizer as used in paper)
Can take advantage of multiple-CPUs.
Generally used for discrete action space, continuous action space if supported may throw errors.
The performance increases with an increase in the number of workers, due to more exploration.
For best performance use the number of workers more than 32.
Deep Deterministic Policy Gradients (DDPG): Off-Policy#
tau: Soft update of target network - (0.01 - 0.001, start with 0.001)
learning_starts: number of steps to sample sample with random policy before learning. This features exploration (1,000 - 20,000 time steps, use higher values for complex envs)
Only for Continuous action space.
Parameter tuning can be difficult. Try using TD3 first.
Critic is a Q-Network and Actor is Deterministic Policy Network.
Actor computes action directly instead of a probability, hence the name deterministic.
tau: Soft update of target network - (0.01 - 0.001, start with 0.001) learning_starts: Similar to DDPG - (1,000 - 20,000 time steps, use higher values for complex envs)
Similar to DDPG with few upgrades.
Only for Continuous action space.
Critic is a twin Q-Network and Actor is Deterministic Policy Network.