stable_learning_control.algos

Contains the Pytorch and TensorFlow RL algorithms.

Warning

Due to being more friendly, the Pytorch implementation was eventually used during research. As a result, the TensorFlow implementation has yet to be thoroughly tested, and no guarantees can be given about the correctness of these algorithms.

Subpackages

Package Contents

Classes

LAC_pytorch

The Lyapunov (soft) Actor-Critic (LAC) algorithm.

SAC_pytorch

The Soft Actor Critic algorithm.

Functions

tf_installed()

Checks if TensorFlow is installed.

class stable_learning_control.algos.LAC_pytorch(env, actor_critic=None, ac_kwargs=dict(hidden_sizes={'actor': [256] * 2, 'critic': [256] * 2}, activation=nn.ReLU, output_activation={'actor': nn.ReLU}), opt_type='minimize', alpha=0.99, alpha3=0.2, labda=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, lr_labda=0.0003, device='cpu')

Bases: torch.nn.Module

The Lyapunov (soft) Actor-Critic (LAC) algorithm.

ac

The (lyapunov) actor critic module.

Type:

torch.nn.Module

ac_

The (lyapunov) target actor critic module.

Type:

torch.nn.Module

log_alpha

The temperature Lagrance multiplier.

Type:

torch.Tensor

log_labda

The Lyapunov Lagrance multiplier.

Type:

torch.Tensor

Initialise the LAC algorithm.

Parameters:
  • env (gym.env) – The gymnasium environment the LAC is training in. This is used to retrieve the activation and observation space dimensions. This is used while creating the network sizes. The environment must satisfy the gymnasium API.

  • actor_critic (torch.nn.Module, optional) –

    The constructor method for a Torch Module with an act method, a pi module and several Q or L modules. The act method and pi module should accept batches of observations as inputs, and the Q* and L modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:

    Call

    Output Shape

    Description

    act

    (batch, act_dim)

    Numpy array of actions for each
    observation.

    Q*/L

    (batch,)

    Tensor containing one current estimate
    of Q*/L for the provided
    observations and actions. (Critical:
    make sure to flatten this!)

    Calling pi should return:

    Symbol

    Shape

    Description

    a

    (batch, act_dim)

    Tensor containing actions from policy
    given observations.

    logp_pi

    (batch,)

    Tensor containing log probabilities of
    actions in a. Importantly:
    gradients should be able to flow back
    into a.

    Defaults to LyapunovActorCritic

  • ac_kwargs (dict, optional) –

    Any kwargs appropriate for the ActorCritic object you provided to LAC. Defaults to:

    Kwarg

    Value

    hidden_sizes_actor

    256 x 2

    hidden_sizes_critic

    256 x 2

    activation

    torch.nn.ReLU

    output_activation

    torch.nn.ReLU

  • opt_type (str, optional) – The optimization type you want to use. Options maximize and minimize. Defaults to maximize.

  • alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to 0.99.

  • alpha3 (float, optional) – The Lyapunov constraint error boundary. Defaults to 0.2.

  • labda (float, optional) – The Lyapunov Lagrance multiplier. Defaults to 0.99.

  • gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to 0.99 per Haarnoja et al. 2018, not 0.995 as in Han et al. 2020.

  • polyak (float, optional) –

    Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:

    \theta_{\text{targ}} \leftarrow
\rho \theta_{\text{targ}} + (1-\rho) \theta

    where \rho is polyak (Always between 0 and 1, usually close to 1.). In some papers \rho is defined as (1 - \tau) where \tau is the soft replacement factor. Defaults to 0.995.

  • target_entropy (float, optional) –

    Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :

    -{\prod }_{i=0}^{n}action\_di{m}_{i}\phantom{\rule{0ex}{0ex}}

  • adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.

  • lr_a (float, optional) – Learning rate used for the actor. Defaults to 1e-4.

  • lr_c (float, optional) – Learning rate used for the (lyapunov) critic. Defaults to 1e-4.

  • lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to 1e-4.

  • lr_labda (float, optional) – Learning rate used for the Lyapunov Lagrance multiplier. Defaults to 3e-4.

  • device (str, optional) – The device the networks are placed on (options: cpu, gpu, gpu:0, gpu:1, etc.). Defaults to cpu.

Attention

This class will behave differently when the actor_critic argument is set to the LyapunovActorTwinCritic. For more information see the LATC documentation.

property alpha

Property used to clip alpha to be equal or bigger than 0.0 to prevent it from becoming nan when log_alpha becomes -inf. For alpha no upper bound is used.

property labda

Property used to clip lambda to be equal or bigger than 0.0 in order to prevent it from becoming nan when log_labda becomes -inf. Further we clip it to be lower or equal than 1.0 in order to prevent lambda from exploding when the the hyperparameters are chosen badly.

property target_entropy

The target entropy used while learning the entropy temperature alpha.

property device

cpu, gpu, gpu:0, gpu:1, etc.).

Type:

The device the networks are placed on (options

forward(s, deterministic=False)

Wrapper around the get_action() method that enables users to also receive actions directly by invoking LAC(observations).

Parameters:
  • s (numpy.ndarray) – The current state.

  • deterministic (bool, optional) – Whether to return a deterministic action. Defaults to False.

Returns:

The current action.

Return type:

numpy.ndarray

get_action(s, deterministic=False)

Returns the current action of the policy.

Parameters:
  • s (numpy.ndarray) – The current state.

  • deterministic (bool, optional) – Whether to return a deterministic action. Defaults to False.

Returns:

The current action.

Return type:

numpy.ndarray

update(data)

Update the actor critic network using stochastic gradient descent.

Parameters:

data (dict) – Dictionary containing a batch of experiences.

save(path)

Can be used to save the current model state.

Parameters:

path (str) – The path where you want to save the policy.

Raises:

Exception – Raises an exception if something goes wrong during saving.

restore(path, restore_lagrance_multipliers=False)

Restores a already trained policy. Used for transfer learning.

Parameters:
  • path (str) – The path where the model state_dict of the policy is found.

  • restore_lagrance_multipliers (bool, optional) – Whether you want to restore the Lagrance multipliers. By fault False.

Raises:

Exception – Raises an exception if something goes wrong during loading.

abstract export(path)

Can be used to export the model as a TorchScript such that it can be deployed to hardware.

Parameters:

path (str) – The path where you want to export the policy too.

Raises:

NotImplementedError – Raised until the feature is fixed on the upstream.

load_state_dict(state_dict, restore_lagrance_multipliers=True)

Copies parameters and buffers from state_dict into this module and its descendants.

Parameters:
  • state_dict (dict) – a dict containing parameters and persistent buffers.

  • restore_lagrance_multipliers (bool, optional) – Whether you want to restore the Lagrance multipliers. By fault True.

state_dict()

Simple wrapper around the torch.nn.Module.state_dict() method that saves the current class name. This is used to enable easy loading of the model.

bound_lr(lr_a_final=None, lr_c_final=None, lr_alpha_final=None, lr_labda_final=None)

Function that can be used to make sure the learning rate doesn’t go beyond a lower bound.

Parameters:
  • lr_a_final (float, optional) – The lower bound for the actor learning rate. Defaults to None.

  • lr_c_final (float, optional) – The lower bound for the critic learning rate. Defaults to None.

  • lr_alpha_final (float, optional) – The lower bound for the alpha Lagrance multiplier learning rate. Defaults to None.

  • lr_labda_final (float, optional) – The lower bound for the labda Lagrance multiplier learning rate. Defaults to None.

_update_targets()

Updates the target networks based on a Exponential moving average (Polyak averaging).

_set_learning_rates(lr_a=None, lr_c=None, lr_alpha=None, lr_labda=None)

Can be used to manually adjusts the learning rates of the optimizers.

Parameters:
  • lr_a (float, optional) – The learning rate of the actor optimizer. Defaults to None.

  • lr_c (float, optional) – The learning rate of the (Lyapunov) Critic. Defaults to None.

  • lr_alpha (float, optional) – The learning rate of the temperature optimizer. Defaults to None.

  • lr_labda (float, optional) – The learning rate of the Lyapunov Lagrance multiplier optimizer. Defaults to None.

class stable_learning_control.algos.SAC_pytorch(env, actor_critic=None, ac_kwargs=dict(hidden_sizes={'actor': [256] * 2, 'critic': [256] * 2}, activation={'actor': nn.ReLU, 'critic': nn.ReLU}, output_activation={'actor': nn.ReLU, 'critic': nn.Identity}), opt_type='maximize', alpha=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, device='cpu')

Bases: torch.nn.Module

The Soft Actor Critic algorithm.

ac

The soft actor critic module.

Type:

torch.nn.Module

ac_

The target soft actor critic module.

Type:

torch.nn.Module

log_alpha

The temperature Lagrance multiplier.

Type:

torch.Tensor

Initialise the SAC algorithm.

Parameters:
  • env (gym.env) – The gymnasium environment the SAC is training in. This is used to retrieve the activation and observation space dimensions. This is used while creating the network sizes. The environment must satisfy the gymnasium API.

  • actor_critic (torch.nn.Module, optional) –

    The constructor method for a Torch Module with an act method, a pi module and several Q or L modules. The act method and pi module should accept batches of observations as inputs, and the Q* and L modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:

    Call

    Output Shape

    Description

    act

    (batch, act_dim)

    Numpy array of actions for each
    observation.

    Q*/L

    (batch,)

    Tensor containing one current estimate
    of Q*/L for the provided
    observations and actions. (Critical:
    make sure to flatten this!)

    Calling pi should return:

    Symbol

    Shape

    Description

    a

    (batch, act_dim)

    Tensor containing actions from policy
    given observations.

    logp_pi

    (batch,)

    Tensor containing log probabilities of
    actions in a. Importantly:
    gradients should be able to flow back
    into a.

    Defaults to SoftActorCritic

  • ac_kwargs (dict, optional) –

    Any kwargs appropriate for the ActorCritic object you provided to SAC. Defaults to:

    Kwarg

    Value

    hidden_sizes_actor

    64 x 2

    hidden_sizes_critic

    128 x 2

    activation

    torch.nn.ReLU

    output_activation

    torch.nn.ReLU

  • opt_type (str, optional) – The optimization type you want to use. Options maximize and minimize. Defaults to maximize.

  • alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to 0.99.

  • gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to 0.99.

  • polyak (float, optional) –

    Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:

    \theta_{\text{targ}} \leftarrow
\rho \theta_{\text{targ}} + (1-\rho) \theta

    where \rho is polyak (Always between 0 and 1, usually close to 1.). In some papers \rho is defined as (1 - \tau) where \tau is the soft replacement factor. Defaults to 0.995.

  • target_entropy (float, optional) –

    Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :

    -{\prod }_{i=0}^{n}action\_di{m}_{i}\phantom{\rule{0ex}{0ex}}

  • adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.

  • lr_a (float, optional) – Learning rate used for the actor. Defaults to 1e-4.

  • lr_c (float, optional) – Learning rate used for the (Soft) critic. Defaults to 1e-4.

  • lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to 1e-4.

  • device (str, optional) – The device the networks are placed on (options: cpu, gpu, gpu:0, gpu:1, etc.). Defaults to cpu.

property alpha

Property used to clip alpha to be equal or bigger than 0.0 to prevent it from becoming nan when log_alpha becomes -inf. For alpha no upper bound is used.

property target_entropy

The target entropy used while learning the entropy temperature alpha.

property device

cpu, gpu, gpu:0, gpu:1, etc.).

Type:

The device the networks are placed on (options

forward(s, deterministic=False)

Wrapper around the get_action() method that enables users to also receive actions directly by invoking SAC(observations).

Parameters:
  • s (numpy.ndarray) – The current state.

  • deterministic (bool, optional) – Whether to return a deterministic action. Defaults to False.

Returns:

The current action.

Return type:

numpy.ndarray

get_action(s, deterministic=False)

Returns the current action of the policy.

Parameters:
  • s (numpy.ndarray) – The current state.

  • deterministic (bool, optional) – Whether to return a deterministic action. Defaults to False.

Returns:

The current action.

Return type:

numpy.ndarray

update(data)

Update the actor critic network using stochastic gradient descent.

Parameters:

data (dict) – Dictionary containing a batch of experiences.

save(path)

Can be used to save the current model state.

Parameters:

path (str) – The path where you want to save the policy.

Raises:

Exception – Raises an exception if something goes wrong during saving.

restore(path, restore_lagrance_multipliers=False)

Restores a already trained policy. Used for transfer learning.

Parameters:
  • path (str) – The path where the model state_dict of the policy is found.

  • restore_lagrance_multipliers (bool, optional) – Whether you want to restore the Lagrance multipliers. By fault False.

Raises:

Exception – Raises an exception if something goes wrong during loading.

abstract export(path)

Can be used to export the model as a TorchScript such that it can be deployed to hardware.

Parameters:

path (str) – The path where you want to export the policy too.

Raises:

NotImplementedError – Raised until the feature is fixed on the upstream.

load_state_dict(state_dict, restore_lagrance_multipliers=True)

Copies parameters and buffers from state_dict into this module and its descendants.

Parameters:
  • state_dict (dict) – a dict containing parameters and persistent buffers.

  • restore_lagrance_multipliers (bool, optional) – Whether you want to restore the Lagrance multipliers. By fault True.

state_dict()

Simple wrapper around the torch.nn.Module.state_dict() method that saves the current class name. This is used to enable easy loading of the model.

bound_lr(lr_a_final=None, lr_c_final=None, lr_alpha_final=None)

Function that can be used to make sure the learning rate doesn’t go beyond a lower bound.

Parameters:
  • lr_a_final (float, optional) – The lower bound for the actor learning rate. Defaults to None.

  • lr_c_final (float, optional) – The lower bound for the critic learning rate. Defaults to None.

  • lr_alpha_final (float, optional) – The lower bound for the alpha Lagrance multiplier learning rate. Defaults to None.

_update_targets()

Updates the target networks based on a Exponential moving average (Polyak averaging).

_set_learning_rates(lr_a=None, lr_c=None, lr_alpha=None)

Can be used to manually adjusts the learning rates of the optimizers.

Parameters:
  • lr_a (float, optional) – The learning rate of the actor optimizer. Defaults to None.

  • lr_c (float, optional) – The learning rate of the (soft) Critic. Defaults to None.

  • lr_alpha (float, optional) – The learning rate of the temperature optimizer. Defaults to None.

stable_learning_control.algos.tf_installed()[source]

Checks if TensorFlow is installed.

Returns:

Returns True if TensorFlow is installed.

Return type:

bool