stable_learning_control.algos
Contains the Pytorch and TensorFlow RL algorithms.
Warning
Due to being more friendly, the Pytorch implementation was eventually used during research. As a result, the TensorFlow implementation has yet to be thoroughly tested, and no guarantees can be given about the correctness of these algorithms.
Subpackages
Classes
The Lyapunov (soft) Actor-Critic (LAC) algorithm. |
|
The Soft Actor Critic algorithm. |
Functions
Checks if TensorFlow is installed. |
Package Contents
- class stable_learning_control.algos.LAC_pytorch(env, actor_critic=None, ac_kwargs=dict(hidden_sizes={'actor': [256] * 2, 'critic': [256] * 2}, activation=nn.ReLU, output_activation={'actor': nn.ReLU}), opt_type='minimize', alpha=0.99, alpha3=0.2, labda=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, lr_labda=0.0003, device='cpu')
Bases:
torch.nn.Module
The Lyapunov (soft) Actor-Critic (LAC) algorithm.
- ac
The (lyapunov) actor critic module.
- Type:
- ac_
The (lyapunov) target actor critic module.
- Type:
- log_alpha
The temperature Lagrance multiplier.
- Type:
- log_labda
The Lyapunov Lagrance multiplier.
- Type:
Initialise the LAC algorithm.
- Parameters:
env (
gym.env
) – The gymnasium environment the LAC is training in. This is used to retrieve the activation and observation space dimensions. This is used while creating the network sizes. The environment must satisfy the gymnasium API.actor_critic (torch.nn.Module, optional) –
The constructor method for a Torch Module with an
act
method, api
module and severalQ
orL
modules. Theact
method andpi
module should accept batches of observations as inputs, and theQ*
andL
modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:Call
Output Shape
Description
act
(batch, act_dim)
Numpy array of actions for eachobservation.Q*/L
(batch,)
Tensor containing one current estimateofQ*/L
for the providedobservations and actions. (Critical:make sure to flatten this!)Calling
pi
should return:Symbol
Shape
Description
a
(batch, act_dim)
Tensor containing actions from policygiven observations.logp_pi
(batch,)
Tensor containing log probabilities ofactions ina
. Importantly:gradients should be able to flow backintoa
.Defaults to
LyapunovActorCritic
ac_kwargs (dict, optional) –
Any kwargs appropriate for the ActorCritic object you provided to LAC. Defaults to:
Kwarg
Value
hidden_sizes_actor
256 x 2
hidden_sizes_critic
256 x 2
activation
output_activation
opt_type (str, optional) – The optimization type you want to use. Options
maximize
andminimize
. Defaults tomaximize
.alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to
0.99
.alpha3 (float, optional) – The Lyapunov constraint error boundary. Defaults to
0.2
.labda (float, optional) – The Lyapunov Lagrance multiplier. Defaults to
0.99
.gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to
0.99
per Haarnoja et al. 2018, not0.995
as in Han et al. 2020.polyak (float, optional) –
Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:
where is polyak (Always between 0 and 1, usually close to 1.). In some papers is defined as (1 - ) where is the soft replacement factor. Defaults to
0.995
.target_entropy (float, optional) –
Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :
adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.
lr_a (float, optional) – Learning rate used for the actor. Defaults to
1e-4
.lr_c (float, optional) – Learning rate used for the (lyapunov) critic. Defaults to
1e-4
.lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to
1e-4
.lr_labda (float, optional) – Learning rate used for the Lyapunov Lagrance multiplier. Defaults to
3e-4
.device (str, optional) – The device the networks are placed on (options:
cpu
,gpu
,gpu:0
,gpu:1
, etc.). Defaults tocpu
.
Attention
This class will behave differently when the
actor_critic
argument is set to theLyapunovActorTwinCritic
. For more information see the LATC documentation.- _setup_kwargs
- _act_dim
- _obs_dim
- _device
- _adaptive_temperature
- _opt_type
- _polyak
- _gamma
- _alpha3
- _lr_a
- _lr_lag
- _lr_c
- _use_twin_critic = False
- log_alpha
- log_labda
- actor_critic
- ac
- ac_targ
- _pi_optimizer
- _pi_params
- _log_labda_optimizer
- _c_optimizer
- _c_params
- forward(s, deterministic=False)
Wrapper around the
get_action()
method that enables users to also receive actions directly by invokingLAC(observations)
.- Parameters:
s (numpy.ndarray) – The current state.
deterministic (bool, optional) – Whether to return a deterministic action. Defaults to
False
.
- Returns:
The current action.
- Return type:
- get_action(s, deterministic=False)
Returns the current action of the policy.
- Parameters:
s (numpy.ndarray) – The current state.
deterministic (bool, optional) – Whether to return a deterministic action. Defaults to
False
.
- Returns:
The current action.
- Return type:
- update(data)
Update the actor critic network using stochastic gradient descent.
- Parameters:
data (dict) – Dictionary containing a batch of experiences.
- save(path)
Can be used to save the current model state.
- restore(path, restore_lagrance_multipliers=False)
Restores a already trained policy. Used for transfer learning.
- Parameters:
path (str) – The path where the model
state_dict
of the policy is found.restore_lagrance_multipliers (bool, optional) – Whether you want to restore the Lagrance multipliers. By fault
False
.
- Raises:
Exception – Raises an exception if something goes wrong during loading.
- abstract export(path)
Can be used to export the model as a
TorchScript
such that it can be deployed to hardware.- Parameters:
path (str) – The path where you want to export the policy too.
- Raises:
NotImplementedError – Raised until the feature is fixed on the upstream.
- load_state_dict(state_dict, restore_lagrance_multipliers=True)
Copies parameters and buffers from
state_dict
into this module and its descendants.
- state_dict()
Simple wrapper around the
torch.nn.Module.state_dict()
method that saves the current class name. This is used to enable easy loading of the model.
- bound_lr(lr_a_final=None, lr_c_final=None, lr_alpha_final=None, lr_labda_final=None)
Function that can be used to make sure the learning rate doesn’t go beyond a lower bound.
- Parameters:
lr_a_final (float, optional) – The lower bound for the actor learning rate. Defaults to
None
.lr_c_final (float, optional) – The lower bound for the critic learning rate. Defaults to
None
.lr_alpha_final (float, optional) – The lower bound for the alpha Lagrance multiplier learning rate. Defaults to
None
.lr_labda_final (float, optional) – The lower bound for the labda Lagrance multiplier learning rate. Defaults to
None
.
- _update_targets()
Updates the target networks based on a Exponential moving average (Polyak averaging).
- _set_learning_rates(lr_a=None, lr_c=None, lr_alpha=None, lr_labda=None)
Can be used to manually adjusts the learning rates of the optimizers.
- Parameters:
lr_a (float, optional) – The learning rate of the actor optimizer. Defaults to
None
.lr_c (float, optional) – The learning rate of the (Lyapunov) Critic. Defaults to
None
.lr_alpha (float, optional) – The learning rate of the temperature optimizer. Defaults to
None
.lr_labda (float, optional) – The learning rate of the Lyapunov Lagrance multiplier optimizer. Defaults to
None
.
- property alpha
- Property used to clip :attr:`alpha` to be equal or bigger than ``0.0`` to
- prevent it from becoming nan when :attr:`log_alpha` becomes ``-inf``. For
- :attr:`alpha` no upper bound is used.
- property labda
- Property used to clip :attr:`lambda` to be equal or bigger than ``0.0`` in
- order to prevent it from becoming ``nan`` when log_labda becomes -inf. Further
- we clip it to be lower or equal than ``1.0`` in order to prevent lambda from
- exploding when the the hyperparameters are chosen badly.
- property target_entropy
- The target entropy used while learning the entropy temperature
- :attr:`alpha`.
- property device
- ``cpu``, ``gpu``, ``gpu:0``,
- ``gpu:1``, etc.).
- Type:
The device the networks are placed on (options
- class stable_learning_control.algos.SAC_pytorch(env, actor_critic=None, ac_kwargs=dict(hidden_sizes={'actor': [256] * 2, 'critic': [256] * 2}, activation={'actor': nn.ReLU, 'critic': nn.ReLU}, output_activation={'actor': nn.ReLU, 'critic': nn.Identity}), opt_type='maximize', alpha=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, device='cpu')
Bases:
torch.nn.Module
The Soft Actor Critic algorithm.
- ac
The soft actor critic module.
- Type:
- ac_
The target soft actor critic module.
- Type:
- log_alpha
The temperature Lagrance multiplier.
- Type:
Initialise the SAC algorithm.
- Parameters:
env (
gym.env
) – The gymnasium environment the SAC is training in. This is used to retrieve the activation and observation space dimensions. This is used while creating the network sizes. The environment must satisfy the gymnasium API.actor_critic (torch.nn.Module, optional) –
The constructor method for a Torch Module with an
act
method, api
module and severalQ
orL
modules. Theact
method andpi
module should accept batches of observations as inputs, and theQ*
andL
modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:Call
Output Shape
Description
act
(batch, act_dim)
Numpy array of actions for eachobservation.Q*/L
(batch,)
Tensor containing one current estimateofQ*/L
for the providedobservations and actions. (Critical:make sure to flatten this!)Calling
pi
should return:Symbol
Shape
Description
a
(batch, act_dim)
Tensor containing actions from policygiven observations.logp_pi
(batch,)
Tensor containing log probabilities ofactions ina
. Importantly:gradients should be able to flow backintoa
.Defaults to
SoftActorCritic
ac_kwargs (dict, optional) –
Any kwargs appropriate for the ActorCritic object you provided to SAC. Defaults to:
Kwarg
Value
hidden_sizes_actor
64 x 2
hidden_sizes_critic
128 x 2
activation
output_activation
opt_type (str, optional) – The optimization type you want to use. Options
maximize
andminimize
. Defaults tomaximize
.alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to
0.99
.gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to
0.99
.polyak (float, optional) –
Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:
where is polyak (Always between 0 and 1, usually close to 1.). In some papers is defined as (1 - ) where is the soft replacement factor. Defaults to
0.995
.target_entropy (float, optional) –
Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :
adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.
lr_a (float, optional) – Learning rate used for the actor. Defaults to
1e-4
.lr_c (float, optional) – Learning rate used for the (Soft) critic. Defaults to
1e-4
.lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to
1e-4
.device (str, optional) – The device the networks are placed on (options:
cpu
,gpu
,gpu:0
,gpu:1
, etc.). Defaults tocpu
.
- _setup_kwargs
- _act_dim
- _obs_dim
- _device
- _adaptive_temperature
- _opt_type
- _polyak
- _gamma
- _lr_a
- _lr_c
- log_alpha
- actor_critic
- ac
- ac_targ
- _pi_optimizer
- _pi_params
- _c_params
- _c_optimizer
- forward(s, deterministic=False)
Wrapper around the
get_action()
method that enables users to also receive actions directly by invokingSAC(observations)
.- Parameters:
s (numpy.ndarray) – The current state.
deterministic (bool, optional) – Whether to return a deterministic action. Defaults to
False
.
- Returns:
The current action.
- Return type:
- get_action(s, deterministic=False)
Returns the current action of the policy.
- Parameters:
s (numpy.ndarray) – The current state.
deterministic (bool, optional) – Whether to return a deterministic action. Defaults to
False
.
- Returns:
The current action.
- Return type:
- update(data)
Update the actor critic network using stochastic gradient descent.
- Parameters:
data (dict) – Dictionary containing a batch of experiences.
- save(path)
Can be used to save the current model state.
- restore(path, restore_lagrance_multipliers=False)
Restores a already trained policy. Used for transfer learning.
- Parameters:
path (str) – The path where the model
state_dict
of the policy is found.restore_lagrance_multipliers (bool, optional) – Whether you want to restore the Lagrance multipliers. By fault
False
.
- Raises:
Exception – Raises an exception if something goes wrong during loading.
- abstract export(path)
Can be used to export the model as a
TorchScript
such that it can be deployed to hardware.- Parameters:
path (str) – The path where you want to export the policy too.
- Raises:
NotImplementedError – Raised until the feature is fixed on the upstream.
- load_state_dict(state_dict, restore_lagrance_multipliers=True)
Copies parameters and buffers from
state_dict
into this module and its descendants.
- state_dict()
Simple wrapper around the
torch.nn.Module.state_dict()
method that saves the current class name. This is used to enable easy loading of the model.
- bound_lr(lr_a_final=None, lr_c_final=None, lr_alpha_final=None)
Function that can be used to make sure the learning rate doesn’t go beyond a lower bound.
- Parameters:
lr_a_final (float, optional) – The lower bound for the actor learning rate. Defaults to
None
.lr_c_final (float, optional) – The lower bound for the critic learning rate. Defaults to
None
.lr_alpha_final (float, optional) – The lower bound for the alpha Lagrance multiplier learning rate. Defaults to
None
.
- _update_targets()
Updates the target networks based on a Exponential moving average (Polyak averaging).
- _set_learning_rates(lr_a=None, lr_c=None, lr_alpha=None)
Can be used to manually adjusts the learning rates of the optimizers.
- property alpha
- Property used to clip :attr:`alpha` to be equal or bigger than ``0.0`` to
- prevent it from becoming nan when :attr:`log_alpha` becomes ``-inf``. For
- :attr:`alpha` no upper bound is used.
- property target_entropy
- The target entropy used while learning the entropy temperature
- :attr:`alpha`.
- property device
- ``cpu``, ``gpu``, ``gpu:0``,
- ``gpu:1``, etc.).
- Type:
The device the networks are placed on (options