Running Experiments
One of the best ways to get a feel for deep RL is to run the algorithms and see how they perform on different tasks. The SLC library makes small-scale (local) experiments easy to do, and in this section, we’ll discuss two ways to run them: either from the command line or through function calls in scripts.
Launching from the Command Line
Important
Important Note: To run the examples in this section, you need to install the Gymnasium Mujoco environments package, including all its necessary dependencies. To do so, execute the following command:
pip install stable-learning-control[mujoco]
For more detailed information about the Gymnasium Mujoco environments package, please consult the documentation available here.
SLC ships with a convenient command line interface (CLI) that lets you quickly launch any algorithm (with any choices of hyperparameters) from the command line. It also serves as a thin wrapper over the utilities for watching/evaluating the trained policies and plotting. However, that functionality is not discussed on this page (for those details, see the pages on experiment outputs, robustness evaluation and Plotting Results).
The standard way to run an SLC algorithm from the command line is
python -m stable_learning_control.run [algo name] [experiment flags]
eg:
python -m stable_learning_control.run sac --env Walker2d-v4 --exp_name walker
You Should Know
If you are using ZShell: ZShell interprets square brackets as special characters. SLC uses square brackets in a few ways for command-line arguments; make sure to escape them or try the solution recommended here if you want to escape them by default.
Detailed Quickstart Guide
python -m stable_learning_control.run sac --exp_name sac_ant --env Ant-v4 --clip_ratio 0.1 0.2
--hid[h] [32,32] [64,32] --act torch.nn.Tanh --seed 0 10 20 --dt
--data_dir path/to/data
runs the SAC algorithm in the Ant-v4
gymnasium environment, with various settings controlled by the flags.
By default, the PyTorch version will run. You can, however, substitute sac
with
sac_tf2
for the TensorFlow version.
clip_ratio
, hid
, and act
are flags to set some algorithm hyperparameters. You
can provide multiple values for hyperparameters to run multiple experiments. Check the docs
to see what hyperparameters you can set (click here for the SAC documentation).
hid
and act
are special shortcut flags for setting the
hidden sizes and activation function for the neural networks trained by the algorithm.
The seed
flag sets the seed for the random number generator. RL algorithms have
high variance, so try multiple seeds to get a feel for how performance varies.
The dt
flag ensures that the save directory names will have timestamps in them
(otherwise, they don’t, unless you set FORCE_DATESTAMP=True
in stable_learning_control.user_config
).
The data_dir
flag allows you to set the save folder for results. The default
value is set by DEFAULT_DATA_DIR
in stable_learning_control.user_config
,
which will be a subfolder data
in the stable_learning_control
folder (unless you change it).
The Save directory names are based on exp_name
and any flags which have multiple
values. Instead of the full flag, a shorthand will appear in the directory name.
Shorthands can be provided by the user in square brackets after the flag, like
--hid[h]
; otherwise, shorthands are substrings of the flag (clip_ratio
becomes cli
). To illustrate, the save directory for the run with
clip_ratio=0.1
, hid=[32,32]
, and seed=10
will be:
path/to/data/YY-MM-DD_sac_ant_cli0-1_h32-32/YY-MM-DD_HH-MM-SS-sac_ant_cli0-1_h32-32_seed10
Choosing PyTorch or TensorFlow from the Command Line
To use a PyTorch version of an algorithm, run with
python -m stable_learning_control.run [algo]_pytorch
To use a TensorFlow version of an algorithm, run with
python -m stable_learning_control.run [algo]_tf2
If you run python -m stable_learning_control.run [algo]
without _pytorch
or _tf2
,
the runner will look in stable_learning_control/user_config.py
for which version it should
default to that algorithm.
Attention
The TensorFlow version is still experimental. It is not guaranteed to work, and it is not guaranteed to be up-to-date with the PyTorch version.
Setting Hyperparameters from the Command Line
Every hyperparameter in every algorithm can be controlled directly from the command line. If kwarg
is a valid keyword arg for the function call of an algorithm, you can set values for it with the flag
--kwarg
.
To find out what keyword args are available, see either the docs page for an algorithm, the API reference or try
python -m stable_learning_control.run [algo name] --help
to see a readout of the docstring.
You Should Know
Values pass through safer_eval()
before
being used so that you can describe some functions and objects directly from the command line.
For example:
python -m stable_learning_control.run sac --env Walker2d-v4 --exp_name walker --act torch.nn.ReLU
sets torch.nn.ReLU
as the activation function. (TensorFlow equivalent: run sac_tf
with --act tf.nn.relu
.)
You Should Know
There’s some excellent handling for kwargs that take dict
values. Instead of having to provide
--key dict(v1=value_1, v2=value_2)
you can give
--key:v1 value_1 --key:v2 value_2
to get the same result.
Launching Multiple Experiments at Once
You can launch multiple experiments, to be executed in series, by simply providing more than one value for a given argument. (An experiment for each possible combination of values will be launched.)
For example, to launch otherwise-equivalent runs with different random seeds (0, 10, and 20), do:
python -m stable_learning_control.run sac --env Walker2d-v4 --exp_name walker --seed 0 10 20
Experiments don’t launch in parallel because they soak up enough resources that executing several simultaneously wouldn’t get a speedup.
Special Flags
A few flags receive special treatment.
Environment Flags
- --env, --env_name
str
. The name of an environment in gymnasium. All SLC algorithms are implemented as functions that acceptenv_fn
as an argument, whereenv_fn
must be a callable function that builds a copy of the RL environment. Since the most common use case is gymnasium environments, though, all of which are built throughgym.make(env_name)
, we allow you to specifyenv_name
(orenv
for short) at the command line, which gets converted to a lambda-function that builds the correct gymnasium environment. You can prefix the environment name with a module name, separated by a colon, to specify a custom gymnasium environment (i.e.--env stable_gym:Oscillator-v1
).
- --env_k, --env_kwargs
object
. Additional keyword arguments you want to pass to the gym environment. If you, for example, want to change the forward reward weight and healthy reward of the Walker2d-v4 environment, you can do so by passing--env_kwargs "{'forward_reward_weight': 0.5, 'healthy_reward': 0.5}"
to the run command.
Algorithm Flags
General Flags
- --save_cps, --save_checkpoints, default: False
bool
. Only the most recent state of the agent and environment is saved by default. When the--save_checkpoints
flag is supplied, a snapshot (checkpoint) of the agent and environment will be saved at each epoch. These snapshots are saved in acheckpoints
folder inside the Logger output directory (for more information, see Saving and Loading Experiment Outputs).
Shortcut Flags
Some algorithm arguments are relatively long, and we enabled shortcuts for them:
- --hid, --ac_kwargs:hidden_sizes
list of ints
. Sets the sizes of the hidden layers in the neural networks of both the actor and critic.
- --hid_a, --ac_kwargs:hidden_sizes:actor
list of ints
. Sets the sizes of the hidden layers in the neural networks of the actor.
- --hid_c, --ac_kwargs:hidden_sizes:critic
list of ints
. Sets the sizes of the hidden layers in the neural networks of the critic.
- --act, --ac_kwargs:activation
torch.nn
ortf.nn
. The activation function for the neural networks in the actor and critic.
- --act_out, --ac_kwargs:output_activation
torch.nn
ortf.nn
. The activation function for the neural networks in the actor and critic.
- --act_a, --ac_kwargs:activation:actor
torch.nn
ortf.nn
. The activation function for the neural networks in the actor.
- --act_c, --ac_kwargs:activation:critic
torch.nn
ortf.nn
. The activation function for the neural networks in the critic.
- --act_out_a, --ac_kwargs:output_activation:actor
torch.nn
ortf.nn
. The activation function for the output activation function of the actor.
- --act_out_c, --ac_kwargs:output_activation:critic
torch.nn
ortf.nn
. The activation function for the output activation function of the critic.
These flags are valid for all current SLC algorithms.
Config Flags
These flags are not hyperparameters of any algorithm but change the experimental configuration in some way.
- --cpu, --num_cpu
int
. If this flag is set, the experiment is launched with this many processes, one per CPU, connected by MPI. Some algorithms are amenable to this sort of parallelization, but not all. If you try settingnum_cpu
> 1 for an incompatible algorithm, an error will be raised. You can also set--num_cpu auto
, which will automatically use as many CPUs as are available on the machine.
- --exp_name
str
. The experiment name. This is used in naming the save directory for each experiment. The default is “cmd” + [algo name].
- --data_dir
path str
. Set the base save directory for this experiment or set of experiments. If none is given, theDEFAULT_DATA_DIR
instable_learning_control/user_config.py
will be used.
Logger Flags
The CLI also contains several (shortcut) flags that can be used to change the behaviour of the
stable_learning_control.utils.log_utils.logx.EpochLogger
.
- --tb_log_freq, --logger_kwargs:tb_log_freq, default='low'
str
. The TensorBoard log frequency. Options arelow
(Recommended: logs at every epoch) andhigh
(logs at every SGD update batch). Defaults tolow
since this is less resource intensive.
- --wandb_job_type, --logger_kwargs:wandb_job_type, default='train'
str
. The Weights & Biases job type.
- --wandb_project, --logger_kwargs:wandb_project, default='stable_learning_control'
str
. The Weights & Biases project name.
- --verbose_fmt, --logger_kwargs:verbose_fmt, default='line'
bool
. The format in which the diagnostics are displayed to the terminal whenquiet
isFalse
. Options aretable
, which supplies them as a table andline
, which prints them in one line.
- --verbose_vars, --logger_kwargs:verbose_vars, default=None
list
. A list of variables you want to log to the stdout whenquiet
isFalse
. The defaultNone
means that all variables are logged.
Important
The verbose_vars list should be supplied as a list that can be evaluated in Python (e.g.
--verbose_vars ["Lr_a", "Lr_c"]
).
Using experimental configuration files (yaml)
The SLC CLI comes with a handy configuration file loader that can be used to load YAML configuration files.
These configuration files provide a convenient way to store your experiments’ hyperparameter such that results
can be reproduced. You can supply the CLI with an experiment configuration file using the --exp_cfg
flag.
- --exp_cfg
path str
. Sets the path to theyml
config file used for loading experiment hyperparameter.
For example, we can use the following command to train a SAC algorithm using the original hyperparameters used by Haarnoja et al., 2019.
python -m stable_learning_control.run --exp_cfg ./experiments/haarnoja_et_al_2019.yml
Important
Please note that if you want to run multiple hyperparameter variants, for example, multiple seeds or learning rates, you have to use comma/space-separated strings in your configuration file:
alg_name: lac
exp_name: my_experiment
seed: 0 12345 342699
ac_kwargs:
hidden_sizes:
actor: [64, 64]
critic: [256, 256, 16]
lr_a: "1e-4, 1e-3, 1e-2"
Additionally, if you want to specify a on/off flag, you can supply an empty key.
Where Results are Saved
Results for a particular experiment (a single run of a configuration of hyperparameters) are stored in
data_dir/[outer_prefix]exp_name[suffix]/[inner_prefix]exp_name[suffix]_s[seed]
where
data_dir
is the value of the--data_dir
flag (defaults toDEFAULT_DATA_DIR
fromstable_learning_control/user_config.py
if--data_dir
is not given),the
outer_prefix
is aYY-MM-DD_
timestamp if the--datestamp
flag is raised, otherwise nothing,the
inner_prefix
is aYY-MM-DD_HH-MM-SS-
timestamp if the--datestamp
flag is raised, otherwise nothing,and
suffix
is a special string based on the experiment hyperparameters.
How is Suffix Determined?
Suffixes are only included if you run multiple experiments at once, and they only have references to hyperparameters that differ across experiments, except for the random seed. The goal is to ensure that results for similar experiments (ones that share all parameters except the seed) are grouped in the same folder.
Suffixes are constructed by combining shorthands for hyperparameters with their values, where a shorthand is either
1) constructed automatically from the hyperparameter name or 2) supplied by the user. The user can write a shorthand
2) in square brackets after the kwarg
flag.
For example, consider:
python -m stable_learning_control.run sac_tf --env Hopper-v4 --hid[h] [300] [128,128] --act tf.nn.tanh tf.nn.relu
Here, the --hid
flag is given a user-supplied shorthand, h
. The user does not provide the --act
flag with a shorthand, so one will be constructed for it automatically.
The suffixes produced in this case are:
_h128-128_ac-actrelu
_h128-128_ac-acttanh
_h300_ac-actrelu
_h300_ac-acttanh
Note that the h
was given by the user. the ac-act
shorthand was constructed from ac_kwargs:activation
(the true name for the act
flag).
Extra
You don’t actually Need to Know This One
Each individual algorithm is located in a file stable_learning_control/algos/BACKEND/ALGO_NAME/ALGO_NAME.py
,
and these files can be run directly from the command line with a limited set of arguments (some of which differ
from what’s available to stable_learning_control/run.py
). However, the command line support in the individual
algorithm files is vestigial, which is not a recommended way to perform experiments.
This documentation page will not describe those command line calls and only describe calls through
stable_learning_control/run.py
.
Use transfer learning
The start_policy
command-line flag allows you to use an already trained algorithm as the starting point for
your new algorithm:
Using custom environments
The SLC package can be used with any Gymnasium-based environment. To use a custom environment, you need
to ensure it inherits from the gym.Env
class and implements the following methods:
reset(self)
: Reset the environment’s state. Returnsobservation, info
.step(self, action)
: Step the environment by one timestep. Returnsobservation, reward, terminated, truncated, info
.
Additionally, you must ensure that your environment is registered in the Gymnasium registry. This can be done by adding the following lines to your environment file:
import gymnasium as gym
from gymnasium.envs.registration import register
register(
id='CustomEnv-v1',
entry_point='path.to.your.env:CustomEnv',
)
After these requirements are met, you can use it with the SLC package by passing the environment
name to the --env
command-line flag. For example, if your environment is called CustomEnv
and is located in
the file custom_env_module.py
, you can run the SLC package with your environment by running:
python -m stable_learning_control.run sac --env custom_env_module:CustomEnv-v1
Launching from Scripts
Each algorithm is implemented as a Python function, which can be imported directly from the stable_learning_control
package, eg.
>>> from stable_learning_control.control import sac_pytorch as sac
See the documentation page for each algorithm for a complete account of possible arguments. These methods can be used to set up specialized custom experiments, for example:
from stable_learning_control.control import sac_tf2 as sac
import tensorflow as tf
import gymnasium as gym
env_fn = lambda : gym.make('LunarLander-v2')
ac_kwargs = dict(hidden_sizes=[64,64], activation=tf.nn.relu)
logger_kwargs = dict(output_dir='path/to/output_dir', exp_name='experiment_name')
sac(env_fn=env_fn, ac_kwargs=ac_kwargs, steps_per_epoch=5000, epochs=250, logger_kwargs=logger_kwargs)
Using ExperimentGrid
An easy way to find good hyperparameters is to run the same algorithm with many possible hyperparameters. LC ships with a simple tool for facilitating this, called ExperimentGrid.
Consider the example in stable_learning_control/examples/pytorch/sac_exp_grid_search.py
:
1import argparse
2
3import torch
4
5# Import the RL agent you want to perform the grid search for.
6from stable_learning_control.algos.pytorch.sac import sac
7from stable_learning_control.utils.run_utils import ExperimentGrid
8
9# Scriptparameters.
10ENV_NAME = (
11 "stable_gym:Oscillator-v1" # The environment on which you want to train the agent.
12)
13
14if __name__ == "__main__":
15 parser = argparse.ArgumentParser()
16 parser.add_argument("--cpu", type=int, default=5)
17 parser.add_argument("--num_runs", type=int, default=1)
18 args = parser.parse_args()
19
20 # Setup Grid search parameters.
21 # NOTE: Here you can add the algorithm parameters you want using their name.
22 eg = ExperimentGrid(name="sac-grid-search")
23 eg.add("env_name", "stable_gym:Oscillator-v1", "", True)
24 eg.add("seed", [10 * i for i in range(args.num_runs)])
25 eg.add("epochs", 100)
26 eg.add("steps_per_epoch", 4000)
27 eg.add("ac_kwargs:hidden_sizes", [(32,), (64, 64)], "hid")
28 eg.add("ac_kwargs:activation", [torch.nn.ReLU, torch.nn.ReLU], "")
29
30 # Run the grid search.
31 eg.run(sac, num_cpu=args.cpu)
After making the ExperimentGrid object, parameters are added to it with
eg.add(param_name, values, shorthand, in_name)
where in_name
forces a parameter to appear in the experiment name, even if it has the same value across all experiments.
After all parameters have been added,
eg.run(thunk, **run_kwargs)
runs all experiments in the grid (one experiment per valid configuration), by providing the configurations as kwargs to the
function thunk
. ExperimentGrid.run
uses a function named call_experiment to launch thunk
, and **run_kwargs
specify behaviors for call_experiment
. See the documentation page for details.
Except for the absence of shortcut kwargs (you can’t use hid
for ac_kwargs:hidden_sizes
in ExperimentGrid
), the
basic behaviour of ExperimentGrid
is the same as running things from the command line.
(In fact, stable_learning_control.run
uses an ExperimentGrid
under the hood.)
Note
An equivalent TensorFlow example is available in stable_learning_control/examples/tf2/sac_exp_grid_search.py
.