stable_learning_control

Module that Initialises the stable_learning_control package.

Subpackages

Submodules

Package Contents

Functions

lac_pytorch(env_fn[, actor_critic, ac_kwargs, ...])

Trains the LAC algorithm in a given environment.

latc_pytorch(env_fn[, actor_critic])

Trains the LATC algorithm in a given environment.

sac_pytorch(env_fn[, actor_critic, ac_kwargs, ...])

Trains the SAC algorithm in a given environment.

tf_installed()

Checks if TensorFlow is installed.

Attributes

__version__

__version_tuple__

stable_learning_control.lac_pytorch(env_fn, actor_critic=None, ac_kwargs=dict(hidden_sizes={'actor': [256] * 2, 'critic': [256] * 2}, activation=nn.ReLU, output_activation=nn.ReLU), opt_type='minimize', max_ep_len=None, epochs=100, steps_per_epoch=2048, start_steps=0, update_every=100, update_after=1000, steps_per_update=100, num_test_episodes=10, alpha=0.99, alpha3=0.2, labda=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, lr_labda=0.0003, lr_a_final=1e-10, lr_c_final=1e-10, lr_alpha_final=1e-10, lr_labda_final=1e-10, lr_decay_type=DEFAULT_DECAY_TYPE, lr_a_decay_type=None, lr_c_decay_type=None, lr_alpha_decay_type=None, lr_labda_decay_type=None, lr_decay_ref=DEFAULT_DECAY_REFERENCE, batch_size=256, replay_size=int(1000000.0), horizon_length=0, seed=None, device='cpu', logger_kwargs=dict(), save_freq=1, start_policy=None, export=False)

Trains the LAC algorithm in a given environment.

Parameters:
  • env_fn – A function which creates a copy of the environment. The environment must satisfy the gymnasium API.

  • actor_critic (torch.nn.Module, optional) –

    The constructor method for a Torch Module with an act method, a pi module and several Q or L modules. The act method and pi module should accept batches of observations as inputs, and the Q* and L modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:

    Call

    Output Shape

    Description

    act

    (batch, act_dim)

    Numpy array of actions for each
    observation.

    Q*/L

    (batch,)

    Tensor containing one current estimate
    of Q*/L for the provided
    observations and actions. (Critical:
    make sure to flatten this!)

    Calling pi should return:

    Symbol

    Shape

    Description

    a

    (batch, act_dim)

    Tensor containing actions from policy
    given observations.

    logp_pi

    (batch,)

    Tensor containing log probabilities of
    actions in a. Importantly:
    gradients should be able to flow back
    into a.

    Defaults to LyapunovActorCritic

  • ac_kwargs (dict, optional) –

    Any kwargs appropriate for the ActorCritic object you provided to LAC. Defaults to:

    Kwarg

    Value

    hidden_sizes_actor

    256 x 2

    hidden_sizes_critic

    256 x 2

    activation

    torch.nn.ReLU

    output_activation

    torch.nn.ReLU

  • opt_type (str, optional) – The optimization type you want to use. Options maximize and minimize. Defaults to maximize.

  • max_ep_len (int, optional) – Maximum length of trajectory / episode / rollout. Defaults to the environment maximum.

  • epochs (int, optional) – Number of epochs to run and train agent. Defaults to 100.

  • steps_per_epoch (int, optional) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch. Defaults to 2048.

  • start_steps (int, optional) – Number of steps for uniform-random action selection, before running real policy. Helps exploration. Defaults to 0.

  • update_every (int, optional) – Number of env interactions that should elapse between gradient descent updates. Defaults to 100.

  • update_after (int, optional) – Number of env interactions to collect before starting to do gradient descent updates. Ensures replay buffer is full enough for useful updates. Defaults to 1000.

  • steps_per_update (int, optional) – Number of gradient descent steps that are performed for each gradient descent update. This determines the ratio of env steps to gradient steps (i.e. update_every/ steps_per_update). Defaults to 100.

  • num_test_episodes (int, optional) – Number of episodes used to test the deterministic policy at the end of each epoch. This is used for logging the performance. Defaults to 10.

  • alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to 0.99.

  • alpha3 (float, optional) – The Lyapunov constraint error boundary. Defaults to 0.2.

  • labda (float, optional) – The Lyapunov Lagrance multiplier. Defaults to 0.99.

  • gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to 0.99.

  • polyak (float, optional) –

    Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:

    \theta_{\text{targ}} \leftarrow
\rho \theta_{\text{targ}} + (1-\rho) \theta

    where \rho is polyak (Always between 0 and 1, usually close to 1.). In some papers \rho is defined as (1 - \tau) where \tau is the soft replacement factor. Defaults to 0.995.

  • target_entropy (float, optional) –

    Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :

    -{\prod }_{i=0}^{n}action\_di{m}_{i}\phantom{\rule{0ex}{0ex}}

  • adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.

  • lr_a (float, optional) – Learning rate used for the actor. Defaults to 1e-4.

  • lr_c (float, optional) – Learning rate used for the (lyapunov) critic. Defaults to 1e-4.

  • lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to 1e-4.

  • lr_labda (float, optional) – Learning rate used for the Lyapunov Lagrance multiplier. Defaults to 3e-4.

  • lr_a_final (float, optional) – The final actor learning rate that is achieved at the end of the training. Defaults to 1e-10.

  • lr_c_final (float, optional) – The final critic learning rate that is achieved at the end of the training. Defaults to 1e-10.

  • lr_alpha_final (float, optional) – The final alpha learning rate that is achieved at the end of the training. Defaults to 1e-10.

  • lr_labda_final (float, optional) – The final labda learning rate that is achieved at the end of the training. Defaults to 1e-10.

  • lr_decay_type (str, optional) – The learning rate decay type that is used (options are: linear and exponential and constant). Defaults to linear.Can be overridden by the specific learning rate decay types.

  • lr_a_decay_type (str, optional) – The learning rate decay type that is used for the actor learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.

  • lr_c_decay_type (str, optional) – The learning rate decay type that is used for the critic learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.

  • lr_alpha_decay_type (str, optional) – The learning rate decay type that is used for the alpha learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.

  • lr_labda_decay_type (str, optional) – The learning rate decay type that is used for the labda learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.

  • lr_decay_ref (str, optional) – The reference variable that is used for decaying the learning rate (options: epoch and step). Defaults to epoch.

  • batch_size (int, optional) – Minibatch size for SGD. Defaults to 256.

  • replay_size (int, optional) – Maximum length of replay buffer. Defaults to 1e6.

  • horizon_length (int, optional) – The length of the finite-horizon used for the Lyapunov Critic target. Defaults to 0 meaning the infinite-horizon bellman backup is used.

  • seed (int) – Seed for random number generators. Defaults to None.

  • device (str, optional) – The device the networks are placed on (options: cpu, gpu, gpu:0, gpu:1, etc.). Defaults to cpu.

  • logger_kwargs (dict, optional) – Keyword args for EpochLogger.

  • save_freq (int, optional) – How often (in terms of gap between epochs) to save the current policy and value function.

  • start_policy (str) – Path of a already trained policy to use as the starting point for the training. By default a new policy is created.

  • export (bool) – Whether you want to export the model as a TorchScript such that it can be deployed on hardware. By default False.

Returns:

tuple containing:

Return type:

(tuple)

stable_learning_control.latc_pytorch(env_fn, actor_critic=None, *args, **kwargs)

Trains the LATC algorithm in a given environment.

Parameters:
  • env_fn – A function which creates a copy of the environment. The environment must satisfy the gymnasium API.

  • actor_critic (torch.nn.Module, optional) –

    The constructor method for a Torch Module with an act method, a pi module and several Q or L modules. The act method and pi module should accept batches of observations as inputs, and the Q* and L modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:

    Call

    Output Shape

    Description

    act

    (batch, act_dim)

    Numpy array of actions for each
    observation.

    Q*/L

    (batch,)

    Tensor containing one current estimate
    of Q*/L for the provided
    observations and actions. (Critical:
    make sure to flatten this!)

    Calling pi should return:

    Symbol

    Shape

    Description

    a

    (batch, act_dim)

    Tensor containing actions from policy
    given observations.

    logp_pi

    (batch,)

    Tensor containing log probabilities of
    actions in a. Importantly:
    gradients should be able to flow back
    into a.

    Defaults to LyapunovActorTwinCritic

  • *args – The positional arguments to pass to the lac() method.

  • **kwargs – The keyword arguments to pass to the lac() method.

Note

Wraps the lac() function so that the LyapunovActorTwinCritic architecture is used as the actor critic.

stable_learning_control.sac_pytorch(env_fn, actor_critic=None, ac_kwargs=dict(hidden_sizes={'actor': [256] * 2, 'critic': [256] * 2}, activation={'actor': nn.ReLU, 'critic': nn.ReLU}, output_activation={'actor': nn.ReLU, 'critic': nn.Identity}), opt_type='maximize', max_ep_len=None, epochs=100, steps_per_epoch=2048, start_steps=0, update_every=100, update_after=1000, steps_per_update=100, num_test_episodes=10, alpha=0.99, gamma=0.99, polyak=0.995, target_entropy=None, adaptive_temperature=True, lr_a=0.0001, lr_c=0.0003, lr_alpha=0.0001, lr_a_final=1e-10, lr_c_final=1e-10, lr_alpha_final=1e-10, lr_decay_type=DEFAULT_DECAY_TYPE, lr_a_decay_type=None, lr_c_decay_type=None, lr_alpha_decay_type=None, lr_decay_ref=DEFAULT_DECAY_REFERENCE, batch_size=256, replay_size=int(1000000.0), seed=None, device='cpu', logger_kwargs=dict(), save_freq=1, start_policy=None, export=False)

Trains the SAC algorithm in a given environment.

Parameters:
  • env_fn – A function which creates a copy of the environment. The environment must satisfy the gymnasium API.

  • actor_critic (torch.nn.Module, optional) –

    The constructor method for a Torch Module with an act method, a pi module and several Q or L modules. The act method and pi module should accept batches of observations as inputs, and the Q* and L modules should accept a batch of observations and a batch of actions as inputs. When called, these modules should return:

    Call

    Output Shape

    Description

    act

    (batch, act_dim)

    Numpy array of actions for each
    observation.

    Q*/L

    (batch,)

    Tensor containing one current estimate
    of Q*/L for the provided
    observations and actions. (Critical:
    make sure to flatten this!)

    Calling pi should return:

    Symbol

    Shape

    Description

    a

    (batch, act_dim)

    Tensor containing actions from policy
    given observations.

    logp_pi

    (batch,)

    Tensor containing log probabilities of
    actions in a. Importantly:
    gradients should be able to flow back
    into a.

    Defaults to SoftActorCritic

  • ac_kwargs (dict, optional) –

    Any kwargs appropriate for the ActorCritic object you provided to SAC. Defaults to:

    Kwarg

    Value

    hidden_sizes_actor

    64 x 2

    hidden_sizes_critic

    128 x 2

    activation

    torch.nn.ReLU

    output_activation

    torch.nn.ReLU

  • opt_type (str, optional) – The optimization type you want to use. Options maximize and minimize. Defaults to maximize.

  • max_ep_len (int, optional) – Maximum length of trajectory / episode / rollout. Defaults to the environment maximum.

  • epochs (int, optional) – Number of epochs to run and train agent. Defaults to 100.

  • steps_per_epoch (int, optional) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch. Defaults to 2048.

  • start_steps (int, optional) – Number of steps for uniform-random action selection, before running real policy. Helps exploration. Defaults to 0.

  • update_every (int, optional) – Number of env interactions that should elapse between gradient descent updates. Defaults to 100.

  • update_after (int, optional) – Number of env interactions to collect before starting to do gradient descent updates. Ensures replay buffer is full enough for useful updates. Defaults to 1000.

  • steps_per_update (int, optional) – Number of gradient descent steps that are performed for each gradient descent update. This determines the ratio of env steps to gradient steps (i.e. update_every/ steps_per_update). Defaults to 100.

  • num_test_episodes (int, optional) – Number of episodes used to test the deterministic policy at the end of each epoch. This is used for logging the performance. Defaults to 10.

  • alpha (float, optional) – Entropy regularization coefficient (Equivalent to inverse of reward scale in the original SAC paper). Defaults to 0.99.

  • gamma (float, optional) – Discount factor. (Always between 0 and 1.). Defaults to 0.99.

  • polyak (float, optional) –

    Interpolation factor in polyak averaging for target networks. Target networks are updated towards main networks according to:

    \theta_{\text{targ}} \leftarrow
\rho \theta_{\text{targ}} + (1-\rho) \theta

    where \rho is polyak (Always between 0 and 1, usually close to 1.). In some papers \rho is defined as (1 - \tau) where \tau is the soft replacement factor. Defaults to 0.995.

  • target_entropy (float, optional) –

    Initial target entropy used while learning the entropy temperature (alpha). Defaults to the maximum information (bits) contained in action space. This can be calculated according to :

    -{\prod }_{i=0}^{n}action\_di{m}_{i}\phantom{\rule{0ex}{0ex}}

  • adaptive_temperature (bool, optional) – Enabled Automating Entropy Adjustment for maximum Entropy RL_learning.

  • lr_a (float, optional) – Learning rate used for the actor. Defaults to 1e-4.

  • lr_c (float, optional) – Learning rate used for the (soft) critic. Defaults to 1e-4.

  • lr_alpha (float, optional) – Learning rate used for the entropy temperature. Defaults to 1e-4.

  • lr_a_final (float, optional) – The final actor learning rate that is achieved at the end of the training. Defaults to 1e-10.

  • lr_c_final (float, optional) – The final critic learning rate that is achieved at the end of the training. Defaults to 1e-10.

  • lr_alpha_final (float, optional) – The final alpha learning rate that is achieved at the end of the training. Defaults to 1e-10.

  • lr_decay_type (str, optional) – The learning rate decay type that is used (options are: linear and exponential and constant). Defaults to linear. Can be overridden by the specific learning rate decay types.

  • lr_a_decay_type (str, optional) – The learning rate decay type that is used for the actor learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.

  • lr_c_decay_type (str, optional) – The learning rate decay type that is used for the critic learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.

  • lr_alpha_decay_type (str, optional) – The learning rate decay type that is used for the alpha learning rate (options are: linear and exponential and constant). If not specified, the general learning rate decay type is used.

  • lr_decay_ref (str, optional) – The reference variable that is used for decaying the learning rate (options: epoch and step). Defaults to epoch.

  • batch_size (int, optional) – Minibatch size for SGD. Defaults to 256.

  • replay_size (int, optional) – Maximum length of replay buffer. Defaults to 1e6.

  • seed (int) – Seed for random number generators. Defaults to None.

  • device (str, optional) – The device the networks are placed on (options: cpu, gpu, gpu:0, gpu:1, etc.). Defaults to cpu.

  • logger_kwargs (dict, optional) – Keyword args for EpochLogger.

  • save_freq (int, optional) – How often (in terms of gap between epochs) to save the current policy and value function.

  • start_policy (str) – Path of a already trained policy to use as the starting point for the training. By default a new policy is created.

  • export (bool) – Whether you want to export the model as a TorchScript such that it can be deployed on hardware. By default False.

Returns:

tuple containing:

Return type:

(tuple)

stable_learning_control.tf_installed()[source]

Checks if TensorFlow is installed.

Returns:

Returns True if TensorFlow is installed.

Return type:

bool

stable_learning_control.__version__ = '6.0.0'[source]
stable_learning_control.__version_tuple__[source]