stable_gym.envs.classic_control

Stable Gym gymnasium environments based on classical control problems or classical control environments found in the gymnasium library.

Subpackages

Package Contents

Classes

CartPoleCost

Custom CartPole Gymnasium environment.

CartPoleTrackingCost

Custom CartPole Gymnasium environment.

Ex3EKF

Noisy master slave system

class stable_gym.envs.classic_control.CartPoleCost(render_mode=None, max_cost=100.0, clip_action=True, action_space_dtype=np.float64, observation_space_dtype=np.float64)[source]

Bases: gymnasium.Env

Custom CartPole Gymnasium environment.

Note

This environment can be used in a vectorized manner. Refer to the gym.vector documentation for details.

Attention

If you’re using this environment to reproduce the results of Han et al. (2020), please note that slight differences may occur due to the modifications mentioned below. For an accurate reproduction, refer to the separate han2020 branch, which mirrors the environment used in their study. It can be accessed here.

Source:

This environment is a modified version of the CartPole environment from the Farma Foundation’s Gymnasium package, first used by Han et al. in 2020. Modifications made by Han et al. include:

  • The action space is continuous, contrasting with the original discrete setting.

  • Offers an optional feature to confine actions within the defined action space, preventing the agent from exceeding set boundaries when activated.

  • The reward function is replaced with a (positive definite) cost function (negated reward), in line with Lyapunov stability theory.

  • Maximum cart force is increased from 10 to 20.

  • Episode length is reduced from 500 to 250.

  • A termination cost of \(c=100\) is introduced for early episode termination, to promote cost minimization.

  • The terminal angle limit is expanded from the original 12 degrees to 20 degrees, enhancing recovery potential.

  • The terminal position limit is extended from 2.4 meters to 10 meters, broadening the recovery range.

  • Velocity limits are adjusted from \(\pm \infty\) to \(\pm 50\), accelerating training.

  • Angular velocity termination threshold is lowered from \(\pm \infty\) to \(\pm 50\), likely for improved training efficiency.

  • Random initial state range is modified from [-0.05, 0.05] to [-5, 5] for the cart position and [-0.2, 0.2] for all other states, allowing for expanded exploration.

  • The info dictionary is expanded to include the reference state, state of interest, and reference error.

Additional modifications in our implementation:

  • Unlike the original environment’s fixed cost threshold of 100, this version allows users to adjust the maximum cost threshold via the max_cost input, improving training adaptability.

  • The gravity constant is adjusted back from 10 to the real-world value of 9.8, aligning it closer with the original CartPole environment.

  • The data types for action and observation spaces are set to np.float64, diverging from the np.float32 used by Han et al. 2020. This aligns the Gymnasium implementation with the original CartPole environment.

Observation:

Type: Box(4) or Box(6)

Num

Observation

Min

Max

0

Cart Position

-20

20

1

Cart Velocity

-50

50

2

Pole Angle

~ -.698 rad (-40 deg)

~ .698 rad (40 deg)

3

Pole Angular Velocity

-50rad

50rad

Note

While the ranges above denote the possible values for observation space of each element, it is not reflective of the allowed values of the state space in an un-terminated episode. Particularly:

  • The cart x-position (index 0) can be take values between (-20, 20), but the episode terminates if the cart leaves the (-10, 10) range.

  • The pole angle can be observed between (-0.698, .698) radians (or ±40°), but the episode terminates if the pole angle is not in the range (-.349, .349) (or ±20°)

Actions:

Type: Box(1)

Num

Action

Min

Max

0

The controller Force

-20

20

Note

The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it.

Cost:

A cost, computed using the CartPoleCost.cost() method, is given for each simulation step, including the terminal step. This cost is the error between the cart position and angle and the zero position and angle. The cost is set to the maximum cost when the episode is terminated.The cost is defined as:

\[cost = (x / x_{threshold})^2 + 20 * (\theta / \theta_{threshold})^2\]
Starting State:

The position is assigned a random value in [-5,5] and the other states are assigned a uniform random value in [-0.2..0.2].

Episode Termination:
  • Pole Angle is more than 20 degrees.

  • Cart Position is more than 10 m (center of the cart reaches the edge of the display).

  • Episode length is greater than 200.

  • The cost is greater than a threshold (100 by default). This threshold can be changed using the max_cost environment argument.

Solved Requirements:

Considered solved when the average cost is less than or equal to 50 over 100 consecutive trials.

How to use:
import stable_gym
import gymnasium as gym
env = gym.make("stable_gym:CartPoleCost-v1")

On reset, the options parameter allows the user to change the bounds used to determine the new random state when random=True.

state

The current state.

Type:

numpy.ndarray

t

Current time step.

Type:

float

tau

The time step size. Also available as self.dt.

Type:

float

target_pos

The target position.

Type:

float

constraint_pos

The constraint position.

Type:

float

kinematics_integrator

The kinematics integrator used to update the state. Options are euler and semi-implicit euler.

Type:

str

theta_threshold_radians

The angle at which the pole is considered to be at a terminal state.

Type:

float

x_threshold

The position at which the cart is considered to be at a terminal state.

Type:

float

max_v

The maximum velocity of the cart.

Type:

float

max_w

The maximum angular velocity of the pole.

Type:

float

max_cost

The maximum cost.

Type:

float

Initialise a new CartPoleCost environment instance.

Parameters:
  • render_mode (str, optional) – Gym rendering mode. By default None.

  • max_cost (float, optional) – The maximum cost allowed before the episode is terminated. Defaults to 100.0.

  • clip_action (str, optional) – Whether the actions should be clipped if they are greater than the set action limit. Defaults to True.

  • action_space_dtype (union[numpy.dtype, str], optional) – The data type of the action space. Defaults to np.float64.

  • observation_space_dtype (union[numpy.dtype, str], optional) – The data type of the observation space. Defaults to np.float64.

property total_mass

Property that returns the full mass of the system.

property _com_length

Property that returns the position of the center of mass.

property polemass_length

Property that returns the pole mass times the COM length.

property pole_mass_length

Alias for polemass_length.

property mass_pole

Alias for masspole.

property mass_cart

Alias for masscart.

property dt

Property that also makes the timestep available under the dt attribute.

property physics_time

Returns the physics time. Alias for t.

metadata
set_params(length, mass_of_cart, mass_of_pole, gravity)[source]

Sets the most important system parameters.

Parameters:
  • length (float) – The pole length.

  • mass_of_cart (float) – Cart mass.

  • mass_of_pole (float) – Pole mass.

  • gravity (float) – The gravity constant.

get_params()[source]

Retrieves the most important system parameters.

Returns:

tuple containing:

  • length(float): The pole length.

  • pole_mass (float): The pole mass.

  • pole_mass (float): The cart mass.

  • gravity (float): The gravity constant.

Return type:

(tuple)

reset_params()[source]

Resets the most important system parameters.

cost(x, theta)[source]

Returns the cost for a given cart position (x) and a pole angle (theta).

Args:

x (float): The current cart position. theta (float): The current pole angle (rads).

Returns:

tuple containing:

  • cost (float): The current cost.

Return type:

(tuple)

step(action)[source]

Take step into the environment.

Parameters:

action (numpy.ndarray) – The action we want to perform in the environment.

Returns:

tuple containing:

  • obs (np.ndarray): Environment observation.

  • cost (float): Cost of the action.

  • terminated (bool): Whether the episode is terminated.

  • truncated (bool): Whether the episode was truncated. This value is set by wrappers when for example a time limit is reached or the agent goes out of bounds.

  • info (dict): Additional information about the environment.

Return type:

(tuple)

reset(seed=None, options=None, random=True)[source]

Reset gymnasium environment.

Parameters:
  • seed (int, optional) – A random seed for the environment. By default None.

  • options (dict, optional) – A dictionary containing additional options for resetting the environment. By default None. Not used in this environment.

  • random (bool, optional) – Whether we want to randomly initialise the environment. By default True.

Returns:

tuple containing:

  • obs (numpy.ndarray): Initial environment observation.

  • info (dict): Dictionary containing additional information.

Return type:

(tuple)

render()[source]

Render one frame of the environment.

close()[source]

Close down the viewer

class stable_gym.envs.classic_control.CartPoleTrackingCost(render_mode=None, reference_target_position=0.0, reference_amplitude=7.0, reference_frequency=0.005, max_cost=100.0, clip_action=True, exclude_reference_from_observation=False, exclude_reference_error_from_observation=True, action_space_dtype=np.float64, observation_space_dtype=np.float64)[source]

Bases: gymnasium.Env

Custom CartPole Gymnasium environment.

Note

This environment can be used in a vectorized manner. Refer to the gym.vector documentation for details.

Source:

This environment is a modified version of the CartPole environment from the Farma Foundation’s Gymnasium package, first used by Han et al. in 2020. Modifications made by Han et al. include:

  • The action space is continuous, contrasting with the original discrete setting.

  • Offers an optional feature to confine actions within the defined action space, preventing the agent from exceeding set boundaries when activated.

  • The reward function is replaced with a (positive definite) cost function (negated reward), in line with Lyapunov stability theory. This cost is the difference between a state variable and a reference value (error).

  • Maximum cart force is increased from 10 to 20.

  • Episode length is reduced from 500 to 250.

  • A termination cost of \(c=100\) is introduced for early episode termination, to promote cost minimization.

  • The terminal angle limit is expanded from the original 12 degrees to 20 degrees, enhancing recovery potential.

  • The terminal position limit is extended from 2.4 meters to 10 meters, broadening the recovery range.

  • Velocity limits are adjusted from \(\pm \infty\) to \(\pm 50\), accelerating training.

  • Angular velocity termination threshold is lowered from \(\pm \infty\) to \(\pm 50\), likely for improved training efficiency.

  • Random initial state range is modified from [-0.05, 0.05] to [-5, 5] for the cart position and [-0.2, 0.2] for all other states, allowing for expanded exploration.

Additional modifications in our implementation:

  • Unlike the original environment’s fixed cost threshold of 100, this version allows users to adjust the maximum cost threshold via the max_cost input, improving training adaptability.

  • The gravity constant is adjusted back from 10 to the real-world value of 9.8, aligning it closer with the original CartPole environment.

  • The stabilization objective is replaced with a reference tracking task for enhanced control.

  • Two additional observations are introduced, facilitating reference tracking.

  • The info dictionary now provides extra information about the reference to be tracked.

  • The data types for action and observation spaces are set to np.float64, diverging from the np.float32 used by Han et al. 2020. This aligns the Gymnasium implementation with the original CartPole environment.

Observation:

Type: Box(4) or Box(6)

Num

Observation

Min

Max

0

Cart Position

-20

20

1

Cart Velocity

-50

50

2

Pole Angle

~ -.698 rad (-40 deg)

~ .698 rad (40 deg)

3

Pole Angular Velocity

-50rad

50rad

4

The cart position reference

-20

20

Optional - The reference
tracking error

-20

20

Note

While the ranges above denote the possible values for observation space of each element, it is not reflective of the allowed values of the state space in an un-terminated episode. Particularly:

  • The cart x-position (index 0) can be take values between (-20, 20), but the episode terminates if the cart leaves the (-10, 10) range.

  • The pole angle can be observed between (-0.698, .698) radians (or ±40°), but the episode terminates if the pole angle is not in the range (-.349, .349) (or ±20°)

Actions:

Type: Box(1)

Num

Action

Min

Max

0

The controller Force

-20

20

Note

The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it.

Cost:

A cost, computed using the CartPoleTrackingCost.cost() method, is given for each simulation step, including the terminal step. This cost is defined as a error between a state variable and a reference value. The cost is set to the maximum cost when the episode is terminated. The cost is defined as:

\[cost = (x - x_{ref})^2 + (\theta / \theta_{threshold})^2\]
Starting State:

The position is assigned a random value in [-5,5] and the other states are assigned a uniform random value in [-0.2..0.2].

Episode Termination:
  • Pole Angle is more than 60 degrees.

  • Cart Position is more than 10 m (center of the cart reaches the edge of the display).

  • Episode length is greater than 200.

  • The cost is greater than a threshold (100 by default). This threshold can be changed using the max_cost environment argument.

Solved Requirements:

Considered solved when the average cost is less than or equal to 50 over 100 consecutive trials.

How to use:
import stable_gym
import gymnasium as gym
env = gym.make("stable_gym:CartPoleTrackingCost-v1")

On reset, the options parameter allows the user to change the bounds used to determine the new random state when random=True.

state

The current state.

Type:

numpy.ndarray

t

Current time step.

Type:

float

tau

The time step size. Also available as self.dt.

Type:

float

target_pos

The target position.

Type:

float

constraint_pos

The constraint position.

Type:

float

kinematics_integrator

The kinematics integrator used to update the state. Options are euler and semi-implicit euler.

Type:

str

theta_threshold_radians

The angle at which the pole is considered to be at a terminal state.

Type:

float

x_threshold

The position at which the cart is considered to be at a terminal state.

Type:

float

max_v

The maximum velocity of the cart.

Type:

float

max_w

The maximum angular velocity of the pole.

Type:

float

max_cost

The maximum cost.

Type:

float

Initialise a new CartPoleTrackingCost environment instance.

Parameters:
  • render_mode (str, optional) – Gym rendering mode. By default None.

  • reference_target_position – The reference target position, by default 0.0 (i.e. the mean of the reference signal).

  • reference_amplitude – The reference amplitude, by default 7.0.

  • reference_frequency – The reference frequency, by default 0.005.

  • max_cost (float, optional) – The maximum cost allowed before the episode is terminated. Defaults to 100.0.

  • clip_action (str, optional) – Whether the actions should be clipped if they are greater than the set action limit. Defaults to True.

  • exclude_reference_from_observation (bool, optional) – Whether the reference should be excluded from the observation. Defaults to False.

  • exclude_reference_error_from_observation (bool, optional) – Whether the error should be excluded from the observation. Defaults to True.

  • action_space_dtype (union[numpy.dtype, str], optional) – The data type of the action space. Defaults to np.float64.

  • observation_space_dtype (union[numpy.dtype, str], optional) – The data type of the observation space. Defaults to np.float64.

property total_mass

Property that returns the full mass of the system.

property _com_length

Property that returns the position of the center of mass.

property polemass_length

Property that returns the pole mass times the COM length.

property pole_mass_length

Alias for polemass_length.

property mass_pole

Alias for masspole.

property mass_cart

Alias for masscart.

property dt

Property that also makes the timestep available under the dt attribute.

property physics_time

Returns the physics time. Alias for t.

metadata
set_params(length, mass_of_cart, mass_of_pole, gravity)[source]

Sets the most important system parameters.

Parameters:
  • length (float) – The pole length.

  • mass_of_cart (float) – Cart mass.

  • mass_of_pole (float) – Pole mass.

  • gravity (float) – The gravity constant.

get_params()[source]

Retrieves the most important system parameters.

Returns:

tuple containing:

  • length(float): The pole length.

  • pole_mass (float): The pole mass.

  • pole_mass (float): The cart mass.

  • gravity (float): The gravity constant.

Return type:

(tuple)

reset_params()[source]

Resets the most important system parameters.

reference(t)[source]

Returns the current value of the periodic cart reference signal that is tracked by the cart-pole system.

Parameters:

t (float) – The current time step.

Returns:

The current reference value.

Return type:

float

cost(x, theta)[source]

Returns the cost for a given cart position (x) and a pole angle (theta).

Args:

x (float): The current cart position. theta (float): The current pole angle (rads).

Returns:

tuple containing:

  • cost (float): The current cost.

  • r_1 (float): The current position reference.

Return type:

(tuple)

step(action)[source]

Take step into the environment.

Parameters:

action (numpy.ndarray) – The action we want to perform in the environment.

Returns:

tuple containing:

  • obs (np.ndarray): Environment observation.

  • cost (float): Cost of the action.

  • terminated (bool): Whether the episode is terminated.

  • truncated (bool): Whether the episode was truncated. This value is set by wrappers when for example a time limit is reached or the agent goes out of bounds.

  • info (dict): Additional information about the environment.

Return type:

(tuple)

reset(seed=None, options=None, random=True)[source]

Reset gymnasium environment.

Parameters:
  • seed (int, optional) – A random seed for the environment. By default None.

  • options (dict, optional) – A dictionary containing additional options for resetting the environment. By default None. Not used in this environment.

  • random (bool, optional) – Whether we want to randomly initialise the environment. By default True.

Returns:

tuple containing:

  • obs (numpy.ndarray): Initial environment observation.

  • info (dict): Dictionary containing additional information.

Return type:

(tuple)

render()[source]

Render one frame of the environment.

close()[source]

Close down the viewer

class stable_gym.envs.classic_control.Ex3EKF(render_mode=None, clipped_action=True)[source]

Bases: gymnasium.Env

Noisy master slave system

Description:

The goal of the agent in the Ex3EKF environment is to act in such a way that estimator perfectly estimated the original noisy system. By doing this it serves as a RL based stationary Kalman filter. First presented by Wu et al. 2023.

Observation:

Type: Box(4)

Num

Observation

Min

Max

0

The estimated angle

-10000 rad

10000 rad

1

The estimated frequency

-10000 hz

10000 hz

2

Actual angle

-10000 rad

10000 rad

3

Actual frequency

-10000 rad

10000 rad

Actions:

Type: Box(2)

Num

Action

0

First action coming from the RL Kalman filter

1

Second action coming from the RL Kalman filter

Cost:

A cost, computed as the sum of the squared differences between the estimated and the actual states:

\[C = {(\hat{x}_1 - x_1)}^2 + {(\hat{x}_2 - x_2)}^2\]
Starting State:

All observations are assigned a uniform random value in [-0.05..0.05]

Episode Termination:
  • When the step cost is higher than 100.

Solved Requirements:

Considered solved when the average cost is lower than 300.

state

The current system state.

Type:

numpy.ndarray

t

The current time step.

Type:

float

dt

The environment step size. Also available as tau.

Type:

float

sigma

The variance of the system noise.

Type:

float

Initialise new Ex3EKF environment instance.

Parameters:
  • render_mode (str, optional) – The render mode you want to use. Defaults to None. Not used in this environment.

  • clipped_action (str, optional) – Whether the actions should be clipped if they are greater than the set action limit. Defaults to True.

property tau

Alias for the environment step size. Done for compatibility with the other gymnasium environments.

property physics_time

Returns the physics time. Alias for t.

step(action)[source]

Take step into the environment.

Parameters:

action (numpy.ndarray) – The action we want to perform in the environment.

Returns:

tuple containing:

  • obs (np.ndarray): Environment observation.

  • cost (float): Cost of the action.

  • terminated (bool): Whether the episode is terminated.

  • truncated (bool): Whether the episode was truncated. This value is set by wrappers when for example a time limit is reached or the agent goes out of bounds.

  • info (dict): Additional information about the environment.

Return type:

(tuple)

reset(seed=None, options=None)[source]

Reset gymnasium environment.

Parameters:
  • seed (int, optional) – A random seed for the environment. By default None`.

  • options (dict, optional) – A dictionary containing additional options for resetting the environment. By default None. Not used in this environment.

Returns:

tuple containing:

  • obs (numpy.ndarray): Initial environment observation.

  • info (dict): Dictionary containing additional information.

Return type:

(tuple)

reference(x)[source]

Returns the current value of the periodic reference signal that is tracked by the Synthetic oscillatory network.

Parameters:

x (float) – The reference value.

Returns:

The current reference value.

Return type:

float

abstract render(mode='human')[source]

Render one frame of the environment.

Parameters:

mode (str, optional) – Gym rendering mode. The default mode will do something human friendly, such as pop up a window.

Raises:

NotImplementedError – Will throw a NotImplimented error since the render method has not yet been implemented.

Note

This currently is not yet implemented.