stable_learning_control
Module that Initialises the stable_learning_control package.
Subpackages
Submodules
Attributes
Functions
|
Trains the LAC algorithm in a given environment. |
|
Trains the LATC algorithm in a given environment. |
|
Trains the SAC algorithm in a given environment. |
Checks if TensorFlow is installed. |
Package Contents
- stable_learning_control.lac_pytorch(env_fn, actor_critic=None, ac_kwargs=dict(hidden_sizes={'actor': [256] * 2, 'critic': [256] * 2}, activation=nn.ReLU, output_activation=nn.ReLU), opt_type='minimize', max_ep_len=None, epochs=100, steps_per_epoch=2048, start_steps=0, update_every=100, update_after=1000, steps_per_update=100, num_test_episodes=10, alpha=0.99, alpha3=0.2, labda=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, lr_labda=0.0003, lr_a_final=1e-10, lr_c_final=1e-10, lr_alpha_final=1e-10, lr_labda_final=1e-10, lr_decay_type=DEFAULT_DECAY_TYPE, lr_a_decay_type=None, lr_c_decay_type=None, lr_alpha_decay_type=None, lr_labda_decay_type=None, lr_decay_ref=DEFAULT_DECAY_REFERENCE, batch_size=256, replay_size=int(1000000.0), horizon_length=0, seed=None, device='cpu', logger_kwargs=dict(), save_freq=1, start_policy=None, export=False)
Trains the LAC algorithm in a given environment.
- Parameters:
env_fn – A function which creates a copy of the environment. The environment must satisfy the gymnasium API.
actor_critic (torch.nn.Module, optional) –
The constructor method for a Torch Module with an
actmethod, apimodule and severalQorLmodules. Theactmethod andpimodule should accept batches of observations as inputs, and theQ*andLmodules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:Call
Output Shape
Description
act(batch, act_dim)
Numpy array of actions for eachobservation.Q*/L(batch,)
Tensor containing one current estimateofQ*/Lfor the providedobservations and actions. (Critical:make sure to flatten this!)Calling
pishould return:Symbol
Shape
Description
a(batch, act_dim)
Tensor containing actions from policygiven observations.logp_pi(batch,)
Tensor containing log probabilities ofactions ina. Importantly:gradients should be able to flow backintoa.Defaults to
LyapunovActorCriticac_kwargs (dict, optional) –
Any kwargs appropriate for the ActorCritic object you provided to LAC. Defaults to:
Kwarg
Value
hidden_sizes_actor256 x 2hidden_sizes_critic256 x 2activationoutput_activationopt_type (str, optional) – The optimization type you want to use. Options
maximizeandminimize. Defaults tomaximize.max_ep_len (int, optional) – Maximum length of trajectory / episode / rollout. Defaults to the environment maximum.
epochs (int, optional) – Number of epochs to run and train agent. Defaults to
100.steps_per_epoch (int, optional) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch. Defaults to
2048.start_steps (int, optional) – Number of steps for uniform-random action selection, before running real policy. Helps exploration. Defaults to
0.update_every (int, optional) – Number of env interactions that should elapse between gradient descent updates. Defaults to
100.update_after (int, optional) – Number of env interactions to collect before starting to do gradient descent updates. Ensures replay buffer is full enough for useful updates. Defaults to
1000.steps_per_update (int, optional) – Number of gradient descent steps that are performed for each gradient descent update. This determines the ratio of env steps to gradient steps (i.e.
update_every/steps_per_update). Defaults to100.num_test_episodes (int, optional) – Number of episodes used to test the deterministic policy at the end of each epoch. This is used for logging the performance. Defaults to
10.alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to
0.99.alpha3 (float, optional) – The Lyapunov constraint error boundary. Defaults to
0.2.labda (float, optional) – The Lyapunov Lagrance multiplier. Defaults to
0.99.gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to
0.99.polyak (float, optional) –
Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:
where
is polyak (Always between 0 and 1, usually close to 1.). In some papers
is defined as (1 -
) where
is the soft replacement factor. Defaults to
0.995.target_entropy (float, optional) –
Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :
adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.
lr_a (float, optional) – Learning rate used for the actor. Defaults to
1e-4.lr_c (float, optional) – Learning rate used for the (lyapunov) critic. Defaults to
1e-4.lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to
1e-4.lr_labda (float, optional) – Learning rate used for the Lyapunov Lagrance multiplier. Defaults to
3e-4.lr_a_final (float, optional) – The final actor learning rate that is achieved at the end of the training. Defaults to
1e-10.lr_c_final (float, optional) – The final critic learning rate that is achieved at the end of the training. Defaults to
1e-10.lr_alpha_final (float, optional) – The final alpha learning rate that is achieved at the end of the training. Defaults to
1e-10.lr_labda_final (float, optional) – The final labda learning rate that is achieved at the end of the training. Defaults to
1e-10.lr_decay_type (str, optional) – The learning rate decay type that is used (options are:
linearandexponentialandconstant). Defaults tolinear.Can be overridden by the specific learning rate decay types.lr_a_decay_type (str, optional) – The learning rate decay type that is used for the actor learning rate (options are:
linearandexponentialandconstant). If not specified, the general learning rate decay type is used.lr_c_decay_type (str, optional) – The learning rate decay type that is used for the critic learning rate (options are:
linearandexponentialandconstant). If not specified, the general learning rate decay type is used.lr_alpha_decay_type (str, optional) – The learning rate decay type that is used for the alpha learning rate (options are:
linearandexponentialandconstant). If not specified, the general learning rate decay type is used.lr_labda_decay_type (str, optional) – The learning rate decay type that is used for the labda learning rate (options are:
linearandexponentialandconstant). If not specified, the general learning rate decay type is used.lr_decay_ref (str, optional) – The reference variable that is used for decaying the learning rate (options:
epochandstep). Defaults toepoch.batch_size (int, optional) – Minibatch size for SGD. Defaults to
256.replay_size (int, optional) – Maximum length of replay buffer. Defaults to
1e6.horizon_length (int, optional) – The length of the finite-horizon used for the Lyapunov Critic target. Defaults to
0meaning the infinite-horizon bellman backup is used.seed (int) – Seed for random number generators. Defaults to
None.device (str, optional) – The device the networks are placed on (options:
cpu,gpu,gpu:0,gpu:1, etc.). Defaults tocpu.logger_kwargs (dict, optional) – Keyword args for EpochLogger.
save_freq (int, optional) – How often (in terms of gap between epochs) to save the current policy and value function.
start_policy (str) – Path of a already trained policy to use as the starting point for the training. By default a new policy is created.
export (bool) – Whether you want to export the model as a
TorchScriptsuch that it can be deployed on hardware. By defaultFalse.
- Returns:
tuple containing:
policy (
LAC): The trained actor-critic policy.replay_buffer (union[
ReplayBuffer,FiniteHorizonReplayBuffer]): The replay buffer used during training.
- Return type:
(tuple)
- stable_learning_control.latc_pytorch(env_fn, actor_critic=None, *args, **kwargs)
Trains the LATC algorithm in a given environment.
- Parameters:
env_fn – A function which creates a copy of the environment. The environment must satisfy the gymnasium API.
actor_critic (torch.nn.Module, optional) –
The constructor method for a Torch Module with an
actmethod, apimodule and severalQorLmodules. Theactmethod andpimodule should accept batches of observations as inputs, and theQ*andLmodules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:Call
Output Shape
Description
act(batch, act_dim)
Numpy array of actions for eachobservation.Q*/L(batch,)
Tensor containing one current estimateofQ*/Lfor the providedobservations and actions. (Critical:make sure to flatten this!)Calling
pishould return:Symbol
Shape
Description
a(batch, act_dim)
Tensor containing actions from policygiven observations.logp_pi(batch,)
Tensor containing log probabilities ofactions ina. Importantly:gradients should be able to flow backintoa.Defaults to
LyapunovActorTwinCritic*args – The positional arguments to pass to the
lac()method.**kwargs – The keyword arguments to pass to the
lac()method.
Note
Wraps the
lac()function so that theLyapunovActorTwinCriticarchitecture is used as the actor critic.
- stable_learning_control.sac_pytorch(env_fn, actor_critic=None, ac_kwargs=dict(hidden_sizes={'actor': [256] * 2, 'critic': [256] * 2}, activation={'actor': nn.ReLU, 'critic': nn.ReLU}, output_activation={'actor': nn.ReLU, 'critic': nn.Identity}), opt_type='maximize', max_ep_len=None, epochs=100, steps_per_epoch=2048, start_steps=0, update_every=100, update_after=1000, steps_per_update=100, num_test_episodes=10, alpha=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, lr_a_final=1e-10, lr_c_final=1e-10, lr_alpha_final=1e-10, lr_decay_type=DEFAULT_DECAY_TYPE, lr_a_decay_type=None, lr_c_decay_type=None, lr_alpha_decay_type=None, lr_decay_ref=DEFAULT_DECAY_REFERENCE, batch_size=256, replay_size=int(1000000.0), seed=None, device='cpu', logger_kwargs=dict(), save_freq=1, start_policy=None, export=False)
Trains the SAC algorithm in a given environment.
- Parameters:
env_fn – A function which creates a copy of the environment. The environment must satisfy the gymnasium API.
actor_critic (torch.nn.Module, optional) –
The constructor method for a Torch Module with an
actmethod, apimodule and severalQorLmodules. Theactmethod andpimodule should accept batches of observations as inputs, and theQ*andLmodules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:Call
Output Shape
Description
act(batch, act_dim)
Numpy array of actions for eachobservation.Q*/L(batch,)
Tensor containing one current estimateofQ*/Lfor the providedobservations and actions. (Critical:make sure to flatten this!)Calling
pishould return:Symbol
Shape
Description
a(batch, act_dim)
Tensor containing actions from policygiven observations.logp_pi(batch,)
Tensor containing log probabilities ofactions ina. Importantly:gradients should be able to flow backintoa.Defaults to
SoftActorCriticac_kwargs (dict, optional) –
Any kwargs appropriate for the ActorCritic object you provided to SAC. Defaults to:
Kwarg
Value
hidden_sizes_actor64 x 2hidden_sizes_critic128 x 2activationoutput_activationopt_type (str, optional) – The optimization type you want to use. Options
maximizeandminimize. Defaults tomaximize.max_ep_len (int, optional) – Maximum length of trajectory / episode / rollout. Defaults to the environment maximum.
epochs (int, optional) – Number of epochs to run and train agent. Defaults to
100.steps_per_epoch (int, optional) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch. Defaults to
2048.start_steps (int, optional) – Number of steps for uniform-random action selection, before running real policy. Helps exploration. Defaults to
0.update_every (int, optional) – Number of env interactions that should elapse between gradient descent updates. Defaults to
100.update_after (int, optional) – Number of env interactions to collect before starting to do gradient descent updates. Ensures replay buffer is full enough for useful updates. Defaults to
1000.steps_per_update (int, optional) – Number of gradient descent steps that are performed for each gradient descent update. This determines the ratio of env steps to gradient steps (i.e.
update_every/steps_per_update). Defaults to100.num_test_episodes (int, optional) – Number of episodes used to test the deterministic policy at the end of each epoch. This is used for logging the performance. Defaults to
10.alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to
0.99.gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to
0.99.polyak (float, optional) –
Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:
where
is polyak (Always between 0 and 1, usually close to 1.). In some papers
is defined as (1 -
) where
is the soft replacement factor. Defaults to
0.995.target_entropy (float, optional) –
Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :
adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.
lr_a (float, optional) – Learning rate used for the actor. Defaults to
1e-4.lr_c (float, optional) – Learning rate used for the (soft) critic. Defaults to
1e-4.lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to
1e-4.lr_a_final (float, optional) – The final actor learning rate that is achieved at the end of the training. Defaults to
1e-10.lr_c_final (float, optional) – The final critic learning rate that is achieved at the end of the training. Defaults to
1e-10.lr_alpha_final (float, optional) – The final alpha learning rate that is achieved at the end of the training. Defaults to
1e-10.lr_decay_type (str, optional) – The learning rate decay type that is used (options are:
linearandexponentialandconstant). Defaults tolinear. Can be overridden by the specific learning rate decay types.lr_a_decay_type (str, optional) – The learning rate decay type that is used for the actor learning rate (options are:
linearandexponentialandconstant). If not specified, the general learning rate decay type is used.lr_c_decay_type (str, optional) – The learning rate decay type that is used for the critic learning rate (options are:
linearandexponentialandconstant). If not specified, the general learning rate decay type is used.lr_alpha_decay_type (str, optional) – The learning rate decay type that is used for the alpha learning rate (options are:
linearandexponentialandconstant). If not specified, the general learning rate decay type is used.lr_decay_ref (str, optional) – The reference variable that is used for decaying the learning rate (options:
epochandstep). Defaults toepoch.batch_size (int, optional) – Minibatch size for SGD. Defaults to
256.replay_size (int, optional) – Maximum length of replay buffer. Defaults to
1e6.seed (int) – Seed for random number generators. Defaults to
None.device (str, optional) – The device the networks are placed on (options:
cpu,gpu,gpu:0,gpu:1, etc.). Defaults tocpu.logger_kwargs (dict, optional) – Keyword args for EpochLogger.
save_freq (int, optional) – How often (in terms of gap between epochs) to save the current policy and value function.
start_policy (str) – Path of a already trained policy to use as the starting point for the training. By default a new policy is created.
export (bool) – Whether you want to export the model as a
TorchScriptsuch that it can be deployed on hardware. By defaultFalse.
- Returns:
tuple containing:
policy (
SAC): The trained actor-critic policy.replay_buffer (union[
ReplayBuffer,FiniteHorizonReplayBuffer]): The replay buffer used during training.
- Return type:
(tuple)