stable_gym.envs.mujoco

Stable Gym gymnasium environments that are based on Mujoco or Mujoco gymnasium environments.

Subpackages

Package Contents

Classes

AntCost

Custom Ant gymnasium environment.

HalfCheetahCost

Custom HalfCheetah gymnasium environment.

HopperCost

Custom Hopper gymnasium environment.

HumanoidCost

Custom Humanoid gymnasium environment.

SwimmerCost

Custom Swimmer gymnasium environment.

Walker2dCost

Custom Walker2d gymnasium environment.

class stable_gym.envs.mujoco.AntCost(reference_forward_velocity=1.0, randomise_reference_forward_velocity=False, randomise_reference_forward_velocity_range=(0.5, 1.5), forward_velocity_weight=1.0, include_ctrl_cost=False, include_health_penalty=True, health_penalty_size=10, ctrl_cost_weight=0.0001, use_contact_forces=False, contact_cost_weight=0.0005, terminate_when_unhealthy=True, healthy_z_range=(0.2, 1.0), contact_force_range=(-1.0, 1.0), reset_noise_scale=0.1, exclude_current_positions_from_observation=True, exclude_reference_from_observation=False, exclude_reference_error_from_observation=True, exclude_x_velocity_from_observation=False, action_space_dtype=np.float32, observation_space_dtype=np.float64, **kwargs)[source]

Bases: gymnasium.envs.mujoco.ant_v4.AntEnv, gymnasium.utils.EzPickle

Custom Ant gymnasium environment.

Note

Can also be used in a vectorized manner. See the gym.vector documentation.

Source:

This is a modified version of the Ant Mujoco environment found in the gymnasium library. This modification was first described by Han et al. 2020. Compared to the original Ant environment in this modified version:

  • The objective was changed to a velocity-tracking task. To do this, the reward is replaced with a cost. This cost is the squared difference between the Ant’s forward velocity and a reference value (error). Additionally, also a control cost and health penalty can be included in the cost.

  • Three optional variables were added to the observation space; The reference velocity, the reference error (i.e. the difference between the ant’s forward velocity and the reference) and the ant’s forward velocity. These variables can be enabled using the exclude_reference_from_observation, exclude_reference_error_from_observation and exclude_velocity_from_observation environment arguments.

The rest of the environment is the same as the original Ant environment. Below, the modified cost is described. For more information about the environment (e.g. observation space, action space, episode termination, etc.), please refer to the gymnasium library.

Modified cost:

A cost, computed using the AntCost.cost() method, is given for each simulation step, including the terminal step. This cost is defined as the error between the Ant’s forward velocity and a reference value. A control cost and health penalty can also be included in the cost. The cost is computed as:

\[cost = w_{forward\_velocity} \times (x_{velocity} - x_{reference\_x\_velocity})^2 + w_{ctrl} \times c_{ctrl} + p_{health}\]
Solved Requirements:

Considered solved when the average cost is less than or equal to 50 over 100 consecutive trials.

How to use:
import stable_gym
import gymnasium as gym
env = gym.make("stable_gym:AntCost-v1")
state

The current system state.

Type:

numpy.ndarray

dt

The environment step size. Also available as tau.

Type:

float

reference_forward_velocity

The forward velocity that the agent should try to track.

Type:

float

Initialise a new AntCost environment instance.

Parameters:
  • reference_forward_velocity (float, optional) – The forward velocity that the agent should try to track. Defaults to 1.0.

  • randomise_reference_forward_velocity (bool, optional) – Whether to randomize the reference forward velocity. Defaults to False.

  • randomise_reference_forward_velocity_range (tuple, optional) – The range of the random reference forward velocity. Defaults to (0.5, 1.5).

  • forward_velocity_weight (float, optional) – The weight used to scale the forward velocity error. Defaults to 1.0.

  • include_ctrl_cost (bool, optional) – Whether you also want to penalize the ant if it takes actions that are too large. Defaults to False.

  • include_health_penalty (bool, optional) – Whether to penalize the ant if it becomes unhealthy (i.e. if it falls over). Defaults to True.

  • health_penalty_size (int, optional) – The size of the unhealthy penalty. Defaults to 10.

  • ctrl_cost_weight (float, optional) – The weight used to scale the control cost. Defaults to 1e-4.

  • use_contact_forces (bool, optional) – Whether to include contact forces in the observation and cost. Defaults to False.

  • contact_cost_weight (float, optional) – The weight used to scale the contact cost. Defaults to 5e-4.

  • terminate_when_unhealthy (bool, optional) – Whether to terminate the episode when the ant becomes unhealthy. Defaults to True.

  • healthy_z_range (tuple, optional) – The range of healthy z values. Defaults to (0.2, 1.0).

  • contact_force_range (tuple, optional) – The range of healthy contact forces. Defaults to (-1.0, 1.0).

  • reset_noise_scale (float, optional) – Scale of random perturbations of the initial position and velocity. Defaults to 0.1.

  • exclude_current_positions_from_observation (bool, optional) – Whether to omit the x- and y-coordinates of the front tip from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behaviour in policies. Defaults to True.

  • exclude_reference_from_observation (bool, optional) – Whether the reference should be excluded from the observation. Defaults to False.

  • exclude_reference_error_from_observation (bool, optional) – Whether the error should be excluded from the observation. Defaults to True.

  • exclude_x_velocity_from_observation (bool, optional) – Whether to omit the x- component of the velocity from observations. Defaults to False.

  • action_space_dtype (union[numpy.dtype, str], optional) – The data type of the action space. Defaults to np.float32.

  • observation_space_dtype (union[numpy.dtype, str], optional) – The data type of the observation space. Defaults to np.float64.

  • **kwargs – Extra keyword arguments to pass to the AntEnv class.

property tau

Alias for the environment step size. Done for compatibility with the other gymnasium environments.

property t

Environment time.

property physics_time

Returns the physics time.

cost(x_velocity, ctrl_cost)[source]

Compute the cost of a given x velocity and control cost.

Parameters:
  • x_velocity (float) – The Ant’s x velocity.

  • ctrl_cost (float) – The control cost.

Returns:

tuple containing:

  • cost (float): The cost of the action.

  • info (dict): Additional information about the cost.

Return type:

(tuple)

step(action)[source]

Take step into the environment.

Note

This method overrides the step() method such that the new cost function is used.

Parameters:

action (np.ndarray) – Action to take in the environment.

Returns:

tuple containing:

  • obs (np.ndarray): Environment observation.

  • cost (float): Cost of the action.

  • terminated (bool): Whether the episode is terminated.

  • truncated (bool): Whether the episode was truncated. This value is set by wrappers when for example a time limit is reached or the agent goes out of bounds.

  • info (dict): Additional information about the environment.

Return type:

(tuple)

reset(seed=None, options=None)[source]

Reset gymnasium environment.

Parameters:
  • seed (int, optional) – A random seed for the environment. By default None.

  • options (dict, optional) – A dictionary containing additional options for resetting the environment. By default None. Not used in this environment.

Returns:

tuple containing:

  • obs (numpy.ndarray): Initial environment observation.

  • info (dict): Dictionary containing additional information.

Return type:

(tuple)

class stable_gym.envs.mujoco.HalfCheetahCost(reference_forward_velocity=1.0, randomise_reference_forward_velocity=False, randomise_reference_forward_velocity_range=(0.5, 1.5), forward_velocity_weight=1.0, include_ctrl_cost=False, ctrl_cost_weight=0.0001, reset_noise_scale=0.1, exclude_current_positions_from_observation=True, exclude_reference_from_observation=False, exclude_reference_error_from_observation=True, exclude_x_velocity_from_observation=False, action_space_dtype=np.float32, observation_space_dtype=np.float64, **kwargs)[source]

Bases: gymnasium.envs.mujoco.half_cheetah_v4.HalfCheetahEnv, gymnasium.utils.EzPickle

Custom HalfCheetah gymnasium environment.

Note

Can also be used in a vectorized manner. See the gym.vector documentation.

Source:

This is a modified version of the HalfCheetah Mujoco environment found in the gymnasium library. This modification was first described by Han et al. 2020. Compared to the original HalfCheetah environment in this modified version:

  • The objective was changed to a velocity-tracking task. To do this, the reward is replaced with a cost. This cost is the squared difference between the HalfCheetah’s forward velocity and a reference value (error). Additionally, also a control cost can be included in the cost.

  • Three optional variables were added to the observation space; The reference velocity, the reference error (i.e. the difference between the cheetah’s forward velocity and the reference) and the cheetah’s forward velocity. These variables can be enabled using the exclude_reference_from_observation, exclude_reference_error_from_observation and exclude_velocity_from_observation environment arguments.

The rest of the environment is the same as the original HalfCheetah environment. Below, the modified cost is described. For more information about the environment (e.g. observation space, action space, episode termination, etc.), please refer to the gymnasium library.

Important

The original code from Han et al. 2020 terminates the episode if the cheetah’s back thigh angle exceeds \(0.5 \pi\) or falls below \(-0.5 \pi\). This condition, not mentioned in the paper, is not implemented here as it’s not part of the original environment.

Modified cost:

A cost, computed using the HalfCheetahCost.cost() method, is given for each simulation step, including the terminal step. This cost is defined as the error between the Cheetah’s forward velocity and a reference value. A control cost can also be included in the cost. The cost is computed as:

\[cost = w_{forward\_velocity} \times (x_{velocity} - x_{reference\_x\_velocity})^2 + w_{ctrl} \times c_{ctrl}\]
Solved Requirements:

Considered solved when the average cost is less than or equal to 50 over 100 consecutive trials.

How to use:
import stable_gym
import gymnasium as gym
env = gym.make("stable_gym:HalfCheetahCost-v1")
state

The current system state.

Type:

numpy.ndarray

dt

The environment step size. Also available as tau.

Type:

float

reference_forward_velocity

The forward velocity that the agent should try to track.

Type:

float

Initialise a new HalfCheetahCost environment instance.

Parameters:
  • reference_forward_velocity (float, optional) – The forward velocity that the agent should try to track. Defaults to 1.0.

  • randomise_reference_forward_velocity (bool, optional) – Whether to randomize the reference forward velocity. Defaults to False.

  • randomise_reference_forward_velocity_range (tuple, optional) – The range of the random reference forward velocity. Defaults to (0.5, 1.5).

  • forward_velocity_weight (float, optional) – The weight used to scale the forward velocity error. Defaults to 1.0.

  • include_ctrl_cost (bool, optional) – Whether you also want to penalize the half cheetah if it takes actions that are too large. Defaults to False.

  • ctrl_cost_weight (float, optional) – The weight used to scale the control cost. Defaults to 1e-4.

  • reset_noise_scale (float, optional) – Scale of random perturbations of the initial position and velocity. Defaults to 0.1.

  • exclude_current_positions_from_observation (bool, optional) – Whether to omit the x- and y-coordinates of the front tip from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behaviour in policies. Defaults to True.

  • exclude_reference_from_observation (bool, optional) – Whether the reference should be excluded from the observation. Defaults to False.

  • exclude_reference_error_from_observation (bool, optional) – Whether the error should be excluded from the observation. Defaults to True.

  • exclude_x_velocity_from_observation (bool, optional) – Whether to omit the x- component of the velocity from observations. Defaults to False.

  • action_space_dtype (union[numpy.dtype, str], optional) – The data type of the action space. Defaults to np.float32.

  • observation_space_dtype (union[numpy.dtype, str], optional) – The data type of the observation space. Defaults to np.float64.

  • **kwargs – Extra keyword arguments to pass to the HalfCheetahEnv class.

property tau

Alias for the environment step size. Done for compatibility with the other gymnasium environments.

property t

Environment time.

property physics_time

Returns the physics time.

cost(x_velocity, ctrl_cost)[source]

Compute the cost of a given x velocity and control cost.

Parameters:
  • x_velocity (float) – The HalfCheetah’s x velocity.

  • ctrl_cost (float) – The control cost.

Returns:

tuple containing:

  • cost (float): The cost of the action.

  • info (dict): Additional information about the cost.

Return type:

(tuple)

step(action)[source]

Take step into the environment.

Note

This method overrides the step() method such that the new cost function is used.

Parameters:

action (np.ndarray) – Action to take in the environment.

Returns:

tuple containing:

  • obs (np.ndarray): Environment observation.

  • cost (float): Cost of the action.

  • terminated (bool): Whether the episode is terminated.

  • truncated (bool): Whether the episode was truncated. This value is set by wrappers when for example a time limit is reached or the agent goes out of bounds.

  • info (dict): Additional information about the environment.

Return type:

(tuple)

reset(seed=None, options=None)[source]

Reset gymnasium environment.

Parameters:
  • seed (int, optional) – A random seed for the environment. By default None.

  • options (dict, optional) – A dictionary containing additional options for resetting the environment. By default None. Not used in this environment.

Returns:

tuple containing:

  • obs (numpy.ndarray): Initial environment observation.

  • info (dict): Dictionary containing additional information.

Return type:

(tuple)

class stable_gym.envs.mujoco.HopperCost(reference_forward_velocity=1.0, randomise_reference_forward_velocity=False, randomise_reference_forward_velocity_range=(0.5, 1.5), forward_velocity_weight=1.0, include_ctrl_cost=False, include_health_penalty=True, health_penalty_size=10, ctrl_cost_weight=0.001, terminate_when_unhealthy=True, healthy_state_range=(-100.0, 100.0), healthy_z_range=(0.7, float('inf')), healthy_angle_range=(-0.2, 0.2), reset_noise_scale=0.005, exclude_current_positions_from_observation=True, exclude_reference_from_observation=False, exclude_reference_error_from_observation=True, exclude_x_velocity_from_observation=False, action_space_dtype=np.float32, observation_space_dtype=np.float64, **kwargs)[source]

Bases: gymnasium.envs.mujoco.hopper_v4.HopperEnv, gymnasium.utils.EzPickle

Custom Hopper gymnasium environment.

Note

Can also be used in a vectorized manner. See the gym.vector documentation.

Source:

This is a modified version of the Hopper Mujoco environment found in the gymnasium library. This modification was first described by Han et al. 2020. Compared to the original Hopper environment in this modified version:

  • The objective was changed to a velocity-tracking task. To do this, the reward is replaced with a cost. This cost is the squared difference between the Hopper’s forward velocity and a reference value (error). Additionally, also a control cost and health penalty can be included in the cost.

  • Three optional variables were added to the observation space; The reference velocity, the reference error (i.e. the difference between the hopper’s forward velocity and the reference) and the hopper’s forward velocity. These variables can be enabled using the exclude_reference_from_observation, exclude_reference_error_from_observation and exclude_velocity_from_observation environment arguments.

The rest of the environment is the same as the original Hopper environment. Below, the modified cost is described. For more information about the environment (e.g. observation space, action space, episode termination, etc.), please refer to the gymnasium library.

Modified cost:

A cost, computed using the HopperCost.cost() method, is given for each simulation step, including the terminal step. This cost is defined as the error between the Hopper’s forward velocity and a reference value. A control cost and health penalty can also be included in the cost. The cost is computed as:

\[cost = w_{forward\_velocity} \times (x_{velocity} - x_{reference\_x\_velocity})^2 + w_{ctrl} \times c_{ctrl} + p_{health}\]
Solved Requirements:

Considered solved when the average cost is less than or equal to 50 over 100 consecutive trials.

How to use:
import stable_gym
import gymnasium as gym
env = gym.make("stable_gym:HopperCost-v1")
state

The current system state.

Type:

numpy.ndarray

dt

The environment step size. Also available as tau.

Type:

float

reference_forward_velocity

The forward velocity that the agent should try to track.

Type:

float

Initialise a new HopperCost environment instance.

Parameters:
  • reference_forward_velocity (float, optional) – The forward velocity that the agent should try to track. Defaults to 1.0.

  • randomise_reference_forward_velocity (bool, optional) – Whether to randomize the reference forward velocity. Defaults to False.

  • randomise_reference_forward_velocity_range (tuple, optional) – The range of the random reference forward velocity. Defaults to (0.5, 1.5).

  • forward_velocity_weight (float, optional) – The weight used to scale the forward velocity error. Defaults to 1.0.

  • include_ctrl_cost (bool, optional) – Whether you also want to penalize the hopper if it takes actions that are too large. Defaults to False.

  • include_health_penalty (bool, optional) – Whether to penalize the hopper if it becomes unhealthy (i.e. if it falls over). Defaults to True.

  • health_penalty_size (int, optional) – The size of the unhealthy penalty. Defaults to 10.

  • ctrl_cost_weight (float, optional) – The weight used to scale the control cost. Defaults to 1e-3.

  • terminate_when_unhealthy (bool, optional) – Whether to terminate the episode when the hopper becomes unhealthy. Defaults to True.

  • healthy_state_range (tuple, optional) – The range of healthy states. Defaults to (-100.0, 100.0).

  • healthy_z_range (tuple, optional) – The range of healthy z values. Defaults to (0.7, float("inf")).

  • healthy_angle_range (tuple, optional) – The range of healthy angles. Defaults to (-0.2, 0.2).

  • reset_noise_scale (float, optional) – Scale of random perturbations of the initial position and velocity. Defaults to 5e-3.

  • exclude_current_positions_from_observation (bool, optional) – Whether to omit the x- and y-coordinates of the front tip from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behaviour in policies. Defaults to True.

  • exclude_reference_from_observation (bool, optional) – Whether the reference should be excluded from the observation. Defaults to False.

  • exclude_reference_error_from_observation (bool, optional) – Whether the error should be excluded from the observation. Defaults to True.

  • exclude_x_velocity_from_observation (bool, optional) – Whether to omit the x- component of the velocity from observations. Defaults to False.

  • action_space_dtype (union[numpy.dtype, str], optional) – The data type of the action space. Defaults to np.float32.

  • observation_space_dtype (union[numpy.dtype, str], optional) – The data type of the observation space. Defaults to np.float64.

  • **kwargs – Extra keyword arguments to pass to the HopperEnv class.

property tau

Alias for the environment step size. Done for compatibility with the other gymnasium environments.

property t

Environment time.

property physics_time

Returns the physics time.

cost(x_velocity, ctrl_cost)[source]

Compute the cost of a given x velocity and control cost.

Parameters:
  • x_velocity (float) – The Hopper’s x velocity.

  • ctrl_cost (float) – The control cost.

Returns:

tuple containing:

  • cost (float): The cost of the action.

  • info (dict): Additional information about the cost.

Return type:

(tuple)

step(action)[source]

Take step into the environment.

Note

This method overrides the step() method such that the new cost function is used.

Parameters:

action (np.ndarray) – Action to take in the environment.

Returns:

tuple containing:

  • obs (np.ndarray): Environment observation.

  • cost (float): Cost of the action.

  • terminated (bool): Whether the episode is terminated.

  • truncated (bool): Whether the episode was truncated. This value is set by wrappers when for example a time limit is reached or the agent goes out of bounds.

  • info (dict): Additional information about the environment.

Return type:

(tuple)

reset(seed=None, options=None)[source]

Reset gymnasium environment.

Parameters:
  • seed (int, optional) – A random seed for the environment. By default None.

  • options (dict, optional) – A dictionary containing additional options for resetting the environment. By default None. Not used in this environment.

Returns:

tuple containing:

  • obs (numpy.ndarray): Initial environment observation.

  • info (dict): Dictionary containing additional information.

Return type:

(tuple)

class stable_gym.envs.mujoco.HumanoidCost(reference_forward_velocity=1.0, randomise_reference_forward_velocity=False, randomise_reference_forward_velocity_range=(0.5, 1.5), forward_velocity_weight=1.0, include_ctrl_cost=False, include_health_penalty=True, health_penalty_size=10, ctrl_cost_weight=0.0001, terminate_when_unhealthy=True, healthy_z_range=(1.0, 2.0), reset_noise_scale=0.01, exclude_current_positions_from_observation=True, exclude_reference_from_observation=False, exclude_reference_error_from_observation=True, exclude_x_velocity_from_observation=False, action_space_dtype=np.float32, observation_space_dtype=np.float64, **kwargs)[source]

Bases: gymnasium.envs.mujoco.humanoid_v4.HumanoidEnv, gymnasium.utils.EzPickle

Custom Humanoid gymnasium environment.

Note

Can also be used in a vectorized manner. See the gym.vector documentation.

Source:

This is a modified version of the Humanoid Mujoco environment found in the gymnasium library. This modification was first described by Han et al. 2020. Compared to the original Humanoid environment in this modified version:

  • The objective was changed to a velocity-tracking task. To do this, the reward is replaced with a cost. This cost is the squared difference between the Humanoid’s forward velocity and a reference value (error). Additionally, also a control cost and health penalty can be included in the cost.

  • Three optional variables were added to the observation space; The reference velocity, the reference error (i.e. the difference between the humanoid’s forward velocity and the reference) and the humanoid’s forward velocity. These variables can be enabled using the exclude_reference_from_observation, exclude_reference_error_from_observation and exclude_velocity_from_observation environment arguments.

The rest of the environment is the same as the original Humanoid environment. Below, the modified cost is described. For more information about the environment (e.g. observation space, action space, episode termination, etc.), please refer to the gymnasium library.

Modified cost:

A cost, computed using the HumanoidCost.cost() method, is given for each simulation step, including the terminal step. This cost is defined as the error between the Humanoid’s forward velocity and a reference value. A control cost and health penalty can also be included in the cost. The cost is computed as:

\[cost = w_{forward\_velocity} \times (x_{velocity} - x_{reference\_x\_velocity})^2 + w_{ctrl} \times c_{ctrl} + p_{health}\]
Solved Requirements:

Considered solved when the average cost is less than or equal to 50 over 100 consecutive trials.

How to use:
import stable_gyms
import gymnasium as gym
env = gym.make("stable_gym:HumanoidCost-v1")
state

The current system state.

Type:

numpy.ndarray

dt

The environment step size. Also available as tau.

Type:

float

reference_forward_velocity

The forward velocity that the agent should try to track.

Type:

float

Initialise a new HumanoidCost environment instance.

Parameters:
  • reference_forward_velocity (float, optional) – The forward velocity that the agent should try to track. Defaults to 1.0.

  • randomise_reference_forward_velocity (bool, optional) – Whether to randomize the reference forward velocity. Defaults to False.

  • randomise_reference_forward_velocity_range (tuple, optional) – The range of the random reference forward velocity. Defaults to (0.5, 1.5).

  • forward_velocity_weight (float, optional) – The weight used to scale the forward velocity error. Defaults to 1.0.

  • include_ctrl_cost (bool, optional) – Whether you also want to penalize the humanoid if it takes actions that are too large. Defaults to False.

  • include_health_penalty (bool, optional) – Whether to penalize the humanoid if it becomes unhealthy (i.e. if it falls over). Defaults to True.

  • health_penalty_size (int, optional) – The size of the unhealthy penalty. Defaults to 10.

  • ctrl_cost_weight (float, optional) – The weight used to scale the control cost. Defaults to 1e-4.

  • terminate_when_unhealthy (bool, optional) – Whether to terminate the episode when the humanoid becomes unhealthy. Defaults to True.

  • healthy_z_range (tuple, optional) – The range of healthy z values. Defaults to (1.0, 2.0).

  • reset_noise_scale (float, optional) – Scale of random perturbations of the initial position and velocity. Defaults to 1e-2.

  • exclude_current_positions_from_observation (bool, optional) – Whether to omit the x- and y-coordinates of the front tip from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behaviour in policies. Defaults to True.

  • exclude_reference_from_observation (bool, optional) – Whether the reference should be excluded from the observation. Defaults to False.

  • exclude_reference_error_from_observation (bool, optional) – Whether the error should be excluded from the observation. Defaults to True.

  • exclude_x_velocity_from_observation (bool, optional) – Whether to omit the x- component of the velocity from observations. Defaults to False.

  • action_space_dtype (union[numpy.dtype, str], optional) – The data type of the action space. Defaults to np.float32.

  • observation_space_dtype (union[numpy.dtype, str], optional) – The data type of the observation space. Defaults to np.float64.

  • **kwargs – Extra keyword arguments to pass to the HumanoidEnv class.

property tau

Alias for the environment step size. Done for compatibility with the other gymnasium environments.

property t

Environment time.

property physics_time

Returns the physics time.

cost(x_velocity, ctrl_cost)[source]

Compute the cost of a given x velocity and control cost.

Parameters:
  • x_velocity (float) – The Humanoid’s x velocity.

  • ctrl_cost (float) – The control cost.

Returns:

tuple containing:

  • cost (float): The cost of the action.

  • info (dict): Additional information about the cost.

Return type:

(tuple)

step(action)[source]

Take step into the environment.

Note

This method overrides the step() method such that the new cost function is used.

Parameters:

action (np.ndarray) – Action to take in the environment.

Returns:

tuple containing:

  • obs (np.ndarray): Environment observation.

  • cost (float): Cost of the action.

  • terminated (bool): Whether the episode is terminated.

  • truncated (bool): Whether the episode was truncated. This value is set by wrappers when for example a time limit is reached or the agent goes out of bounds.

  • info (dict): Additional information about the environment.

Return type:

(tuple)

reset(seed=None, options=None)[source]

Reset gymnasium environment.

Parameters:
  • seed (int, optional) – A random seed for the environment. By default None.

  • options (dict, optional) – A dictionary containing additional options for resetting the environment. By default None. Not used in this environment.

Returns:

tuple containing:

  • obs (numpy.ndarray): Initial environment observation.

  • info (dict): Dictionary containing additional information.

Return type:

(tuple)

class stable_gym.envs.mujoco.SwimmerCost(reference_forward_velocity=1.0, randomise_reference_forward_velocity=False, randomise_reference_forward_velocity_range=(0.5, 1.5), forward_velocity_weight=1.0, include_ctrl_cost=False, ctrl_cost_weight=0.0001, reset_noise_scale=0.1, exclude_current_positions_from_observation=True, exclude_reference_from_observation=False, exclude_reference_error_from_observation=True, exclude_x_velocity_from_observation=False, action_space_dtype=np.float32, observation_space_dtype=np.float64, **kwargs)[source]

Bases: gymnasium.envs.mujoco.swimmer_v4.SwimmerEnv, gymnasium.utils.EzPickle

Custom Swimmer gymnasium environment.

Note

Can also be used in a vectorized manner. See the gym.vector documentation.

Source:

This is a modified version of the swimmer Mujoco environment found in the gymnasium library. This modification was first described by Han et al. 2020. Compared to the original Swimmer environment in this modified version:

  • The objective was changed to a velocity-tracking task. To do this, the reward is replaced with a cost. This cost is the squared difference between the swimmer’s forward velocity and a reference value (error). Additionally, also a control cost can be included in the cost.

  • Three optional variables were added to the observation space; The reference velocity, the reference error (i.e. the difference between the swimmer’s forward velocity and the reference) and the swimmer’s forward velocity. These variables can be enabled using the exclude_reference_from_observation, exclude_reference_error_from_observation and exclude_velocity_from_observation environment arguments.

The rest of the environment is the same as the original Swimmer environment. Below, the modified cost is described. For more information about the environment (e.g. observation space, action space, episode termination, etc.), please refer to the gymnasium library.

Modified cost:

A cost, computed using the SwimmerCost.cost() method, is given for each simulation step, including the terminal step. This cost is defined as the error between the Swimmer’s forward velocity and a reference value. A control cost can also be included in the cost. The cost is computed as:

\[cost = w_{forward\_velocity} \times (x_{velocity} - x_{reference\_x\_velocity})^2 + w_{ctrl} \times c_{ctrl}\]
Solved Requirements:

Considered solved when the average cost is less than or equal to 50 over 100 consecutive trials.

How to use:
import stable_gym
import gymnasium as gym
env = gym.make("stable_gym:SwimmerCost-v1")
state

The current system state.

Type:

numpy.ndarray

dt

The environment step size. Also available as tau.

Type:

float

reference_forward_velocity

The forward velocity that the agent should try to track.

Type:

float

Initialise a new SwimmerCost environment instance.

Parameters:
  • reference_forward_velocity (float, optional) – The forward velocity that the agent should try to track. Defaults to 1.0.

  • randomise_reference_forward_velocity (bool, optional) – Whether to randomize the reference forward velocity. Defaults to False.

  • randomise_reference_forward_velocity_range (tuple, optional) – The range of the random reference forward velocity. Defaults to (0.5, 1.5).

  • forward_velocity_weight (float, optional) – The weight used to scale the forward velocity error. Defaults to 1.0.

  • include_ctrl_cost (bool, optional) – Whether you also want to penalize the swimmer if it takes actions that are too large. Defaults to False.

  • ctrl_cost_weight (float, optional) – The weight used to scale the control cost. Defaults to 1e-4.

  • reset_noise_scale (float, optional) – Scale of random perturbations of the initial position and velocity. Defaults to 0.1.

  • exclude_current_positions_from_observation (bool, optional) – Whether to omit the x- and y-coordinates of the front tip from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behaviour in policies. Defaults to True.

  • exclude_reference_from_observation (bool, optional) – Whether the reference should be excluded from the observation. Defaults to False.

  • exclude_reference_error_from_observation (bool, optional) – Whether the error should be excluded from the observation. Defaults to True.

  • exclude_x_velocity_from_observation (bool, optional) – Whether to omit the x- component of the velocity from observations. Defaults to False.

  • action_space_dtype (union[numpy.dtype, str], optional) – The data type of the action space. Defaults to np.float32.

  • observation_space_dtype (union[numpy.dtype, str], optional) – The data type of the observation space. Defaults to np.float64.

  • **kwargs – Extra keyword arguments to pass to the SwimmerEnv class.

property tau

Alias for the environment step size. Done for compatibility with the other gymnasium environments.

property t

Environment time.

property physics_time

Returns the physics time.

cost(x_velocity, ctrl_cost)[source]

Compute the cost of a given x velocity and control cost.

Parameters:
  • x_velocity (float) – The swimmer’s x velocity.

  • ctrl_cost (float) – The control cost.

Returns:

tuple containing:

  • cost (float): The cost of the action.

  • info (dict): Additional information about the cost.

Return type:

(tuple)

step(action)[source]

Take step into the environment.

Note

This method overrides the step() method such that the new cost function is used.

Parameters:

action (np.ndarray) – Action to take in the environment.

Returns:

tuple containing:

  • obs (np.ndarray): Environment observation.

  • cost (float): Cost of the action.

  • terminated (bool): Whether the episode is terminated.

  • truncated (bool): Whether the episode was truncated. This value is set by wrappers when for example a time limit is reached or the agent goes out of bounds.

  • info (dict): Additional information about the environment.

Return type:

(tuple)

reset(seed=None, options=None)[source]

Reset gymnasium environment.

Parameters:
  • seed (int, optional) – A random seed for the environment. By default None.

  • options (dict, optional) – A dictionary containing additional options for resetting the environment. By default None. Not used in this environment.

Returns:

tuple containing:

  • obs (numpy.ndarray): Initial environment observation.

  • info (dict): Dictionary containing additional information.

Return type:

(tuple)

class stable_gym.envs.mujoco.Walker2dCost(reference_forward_velocity=1.0, randomise_reference_forward_velocity=False, randomise_reference_forward_velocity_range=(0.5, 1.5), forward_velocity_weight=1.0, include_ctrl_cost=False, include_health_penalty=True, health_penalty_size=10, ctrl_cost_weight=0.001, terminate_when_unhealthy=True, healthy_z_range=(0.8, 2.0), healthy_angle_range=(-1.0, 1.0), reset_noise_scale=0.005, exclude_current_positions_from_observation=True, exclude_reference_from_observation=False, exclude_reference_error_from_observation=True, exclude_x_velocity_from_observation=False, action_space_dtype=np.float32, observation_space_dtype=np.float64, **kwargs)[source]

Bases: gymnasium.envs.mujoco.walker2d_v4.Walker2dEnv, gymnasium.utils.EzPickle

Custom Walker2d gymnasium environment.

Note

Can also be used in a vectorized manner. See the gym.vector documentation.

Source:

This is a modified version of the Walker2d Mujoco environment found in the gymnasium library. This modification was first described by Han et al. 2020. Compared to the original Walker2d environment in this modified version:

  • The objective was changed to a velocity-tracking task. To do this, the reward is replaced with a cost. This cost is the squared difference between the Walker2d’s forward velocity and a reference value (error). Additionally, also a control cost and health penalty can be included in the cost.

  • Three optional variables were added to the observation space; The reference velocity, the reference error (i.e. the difference between the walker2d’s forward velocity and the reference) and the walker2d’s forward velocity. These variables can be enabled using the exclude_reference_from_observation, exclude_reference_error_from_observation and exclude_velocity_from_observation environment arguments.

The rest of the environment is the same as the original Walker2d environment. Below, the modified cost is described. For more information about the environment (e.g. observation space, action space, episode termination, etc.), please refer to the gymnasium library.

Modified cost:

A cost, computed using the Walker2dCost.cost() method, is given for each simulation step, including the terminal step. This cost is defined as the error between the Walkers’s forward velocity and a reference value. A control cost and health penalty can also be included in the cost. The cost is computed as:

\[cost = w_{forward\_velocity} \times (x_{velocity} - x_{reference\_x\_velocity})^2 + w_{ctrl} \times c_{ctrl} + p_{health}\]
Solved Requirements:

Considered solved when the average cost is less than or equal to 50 over 100 consecutive trials.

How to use:
import stable_gyms
import gymnasium as gym
env = gym.make("stable_gym:Walker2dCost-v1")
state

The current system state.

Type:

numpy.ndarray

dt

The environment step size. Also available as tau.

Type:

float

reference_forward_velocity

The forward velocity that the agent should try to track.

Type:

float

Initialise a new Walker2dCost environment instance.

Parameters:
  • reference_forward_velocity (float, optional) – The forward velocity that the agent should try to track. Defaults to 1.0.

  • randomise_reference_forward_velocity (bool, optional) – Whether to randomize the reference forward velocity. Defaults to False.

  • randomise_reference_forward_velocity_range (tuple, optional) – The range of the random reference forward velocity. Defaults to (0.5, 1.5).

  • forward_velocity_weight (float, optional) – The weight used to scale the forward velocity error. Defaults to 1.0.

  • include_ctrl_cost (bool, optional) – Whether you also want to penalize the 2D walker if it takes actions that are too large. Defaults to False.

  • include_health_penalty (bool, optional) – Whether to penalize the 2D walker if it becomes unhealthy (i.e. if it falls over). Defaults to True.

  • health_penalty_size (int, optional) – The size of the unhealthy penalty. Defaults to 10.

  • ctrl_cost_weight (float, optional) – The weight used to scale the control cost. Defaults to 1e-3.

  • terminate_when_unhealthy (bool, optional) – Whether to terminate the episode when the 2D walker becomes unhealthy. Defaults to True.

  • healthy_z_range (tuple, optional) – The range of healthy z values. Defaults to (0.8, 2.0).

  • healthy_angle_range (tuple, optional) – The range of healthy angles. Defaults to (-1.0, 1.0),

  • reset_noise_scale (float, optional) – Scale of random perturbations of the initial position and velocity. Defaults to 5e-3.

  • exclude_current_positions_from_observation (bool, optional) – Whether to omit the x- and y-coordinates of the front tip from observations. Excluding the position can serve as an inductive bias to induce position-agnostic behaviour in policies. Defaults to True.

  • exclude_reference_from_observation (bool, optional) – Whether the reference should be excluded from the observation. Defaults to False.

  • exclude_reference_error_from_observation (bool, optional) – Whether the error should be excluded from the observation. Defaults to True.

  • exclude_x_velocity_from_observation (bool, optional) – Whether to omit the x- component of the velocity from observations. Defaults to False.

  • action_space_dtype (union[numpy.dtype, str], optional) – The data type of the action space. Defaults to np.float32.

  • observation_space_dtype (union[numpy.dtype, str], optional) – The data type of the observation space. Defaults to np.float64.

  • **kwargs – Extra keyword arguments to pass to the Walker2dEnv class.

property tau

Alias for the environment step size. Done for compatibility with the other gymnasium environments.

property t

Environment time.

property physics_time

Returns the physics time.

cost(x_velocity, ctrl_cost)[source]

Compute the cost of a given x velocity and control cost.

Parameters:
  • x_velocity (float) – The Walker2d’s x velocity.

  • ctrl_cost (float) – The control cost.

Returns:

tuple containing:

  • cost (float): The cost of the action.

  • info (dict): Additional information about the cost.

Return type:

(tuple)

step(action)[source]

Take step into the environment.

Note

This method overrides the step() method such that the new cost function is used.

Parameters:

action (np.ndarray) – Action to take in the environment.

Returns:

tuple containing:

  • obs (np.ndarray): Environment observation.

  • cost (float): Cost of the action.

  • terminated (bool): Whether the episode is terminated.

  • truncated (bool): Whether the episode was truncated. This value is set by wrappers when for example a time limit is reached or the agent goes out of bounds.

  • info (dict): Additional information about the environment.

Return type:

(tuple)

reset(seed=None, options=None)[source]

Reset gymnasium environment.

Parameters:
  • seed (int, optional) – A random seed for the environment. By default None.

  • options (dict, optional) – A dictionary containing additional options for resetting the environment. By default None. Not used in this environment.

Returns:

tuple containing:

  • obs (numpy.ndarray): Initial environment observation.

  • info (dict): Dictionary containing additional information.

Return type:

(tuple)