stable_gym.envs.classic_control.cartpole_tracking_cost
Modified version of the cart-pole environment found in the gymnasium library. This modification was first described by Han et al. 2020. In this modified version:
The action space is continuous, wherein the original version it is discrete.
The reward is replaced with a cost. This cost is defined as the difference between a state variable and a reference value (error).
The stabilization task was replaced with a reference tracking task.
Two additional observations are returned to enable reference tracking.
Some of the environment parameters were changed slightly.
The info dictionary returns extra information about the reference tracking task.
Submodules
Classes
Custom CartPole Gymnasium environment. |
Package Contents
- class stable_gym.envs.classic_control.cartpole_tracking_cost.CartPoleTrackingCost(render_mode=None, reference_target_position=0.0, reference_amplitude=7.0, reference_frequency=0.005, max_cost=100.0, clip_action=True, exclude_reference_from_observation=False, exclude_reference_error_from_observation=True, action_space_dtype=np.float64, observation_space_dtype=np.float64)[source]
Bases:
gymnasium.Env
Custom CartPole Gymnasium environment.
Note
This environment can be used in a vectorized manner. Refer to the gym.vector documentation for details.
- Source:
This environment is a modified version of the CartPole environment from the Farma Foundation’s Gymnasium package, first used by Han et al. in 2020. Modifications made by Han et al. include:
The action space is continuous, contrasting with the original discrete setting.
Offers an optional feature to confine actions within the defined action space, preventing the agent from exceeding set boundaries when activated.
The reward function is replaced with a (positive definite) cost function (negated reward), in line with Lyapunov stability theory. This cost is the difference between a state variable and a reference value (error).
Maximum cart force is increased from
10
to20
.Episode length is reduced from
500
to250
.A termination cost of \(c=100\) is introduced for early episode termination, to promote cost minimization.
The terminal angle limit is expanded from the original
12
degrees to20
degrees, enhancing recovery potential.The terminal position limit is extended from
2.4
meters to10
meters, broadening the recovery range.Velocity limits are adjusted from \(\pm \infty\) to \(\pm 50\), accelerating training.
Angular velocity termination threshold is lowered from \(\pm \infty\) to \(\pm 50\), likely for improved training efficiency.
Random initial state range is modified from
[-0.05, 0.05]
to[-5, 5]
for the cart position and[-0.2, 0.2]
for all other states, allowing for expanded exploration.
Additional modifications in our implementation:
Unlike the original environment’s fixed cost threshold of
100
, this version allows users to adjust the maximum cost threshold via themax_cost
input, improving training adaptability.The gravity constant is adjusted back from
10
to the real-world value of9.8
, aligning it closer with the original CartPole environment.The stabilization objective is replaced with a reference tracking task for enhanced control.
Two additional observations are introduced, facilitating reference tracking.
The info dictionary now provides extra information about the reference to be tracked.
The data types for action and observation spaces are set to
np.float64
, diverging from thenp.float32
used by Han et al. 2020. This aligns the Gymnasium implementation with the original CartPole environment.
- Observation:
Type: Box(4) or Box(6)
Num
Observation
Min
Max
0
Cart Position
-20
20
1
Cart Velocity
-50
50
2
Pole Angle
~ -.698 rad (-40 deg)
~ .698 rad (40 deg)
3
Pole Angular Velocity
-50rad
50rad
4
The cart position reference
-20
20
Optional - The referencetracking error-20
20
Note
While the ranges above denote the possible values for observation space of each element, it is not reflective of the allowed values of the state space in an un-terminated episode. Particularly:
The cart x-position (index 0) can be take values between
(-20, 20)
, but the episode terminates if the cart leaves the(-10, 10)
range.The pole angle can be observed between
(-0.698, .698)
radians (or ±40°), but the episode terminates if the pole angle is not in the range(-.349, .349)
(or ±20°)
- Actions:
Type: Box(1)
Num
Action
Min
Max
0
The controller Force
-20
20
Note
The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it.
- Cost:
A cost, computed using the
CartPoleTrackingCost.cost()
method, is given for each simulation step, including the terminal step. This cost is defined as a error between a state variable and a reference value. The cost is set to the maximum cost when the episode is terminated. The cost is defined as:\[cost = (x - x_{ref})^2 + (\theta / \theta_{threshold})^2\]- Starting State:
The position is assigned a random value in
[-5,5]
and the other states are assigned a uniform random value in[-0.2..0.2]
.- Episode Termination:
Pole Angle is more than 60 degrees.
Cart Position is more than 10 m (center of the cart reaches the edge of the display).
Episode length is greater than 200.
The cost is greater than a threshold (100 by default). This threshold can be changed using the
max_cost
environment argument.
- Solved Requirements:
Considered solved when the average cost is less than or equal to 50 over 100 consecutive trials.
- How to use:
import stable_gym import gymnasium as gym env = gym.make("stable_gym:CartPoleTrackingCost-v1")
On reset, the
options
parameter allows the user to change the bounds used to determine the new random state whenrandom=True
.
- state
The current state.
- Type:
- kinematics_integrator
The kinematics integrator used to update the state. Options are
euler
andsemi-implicit euler
.- Type:
- theta_threshold_radians
The angle at which the pole is considered to be at a terminal state.
- Type:
Initialise a new CartPoleTrackingCost environment instance.
- Parameters:
render_mode (str, optional) – Gym rendering mode. By default
None
.reference_target_position – The reference target position, by default
0.0
(i.e. the mean of the reference signal).reference_amplitude – The reference amplitude, by default
7.0
.reference_frequency – The reference frequency, by default
0.005
.max_cost (float, optional) – The maximum cost allowed before the episode is terminated. Defaults to
100.0
.clip_action (str, optional) – Whether the actions should be clipped if they are greater than the set action limit. Defaults to
True
.exclude_reference_from_observation (bool, optional) – Whether the reference should be excluded from the observation. Defaults to
False
.exclude_reference_error_from_observation (bool, optional) – Whether the error should be excluded from the observation. Defaults to
True
.action_space_dtype (union[numpy.dtype, str], optional) – The data type of the action space. Defaults to
np.float64
.observation_space_dtype (union[numpy.dtype, str], optional) – The data type of the observation space. Defaults to
np.float64
.
- metadata
- render_mode
- max_cost
- _clip_action
- _action_space_dtype
- _observation_space_dtype
- _action_dtype_conversion_warning = False
- force_mag = 20
- tau = 0.02
- kinematics_integrator = 'euler'
- theta_threshold_radians
- x_threshold = 10
- max_v = 50
- max_w = 50
- _exclude_reference_from_observation
- _exclude_reference_error_from_observation
- high
- action_space
- observation_space
- reward_range
- screen_width = 600
- screen_height = 400
- screen = None
- clock = None
- isopen = True
- state = None
- steps_beyond_terminated = None
- t = 0
- _action_clip_warning = False
- _init_state
- _init_state_range
- reference_target_pos
- reference_amplitude
- reference_frequency
- set_params(length, mass_of_cart, mass_of_pole, gravity)[source]
Sets the most important system parameters.
- get_params()[source]
Retrieves the most important system parameters.
- Returns:
tuple containing:
- Return type:
(tuple)
- reference(t)[source]
Returns the current value of the periodic cart reference signal that is tracked by the cart-pole system.
- cost(x, theta)[source]
Returns the cost for a given cart position (x) and a pole angle (theta).
- Args:
x (float): The current cart position. theta (float): The current pole angle (rads).
- Returns:
tuple containing:
cost (float): The current cost.
r_1 (float): The current position reference.
- Return type:
(tuple)
- step(action)[source]
Take step into the environment.
- Parameters:
action (numpy.ndarray) – The action we want to perform in the environment.
- Returns:
tuple containing:
obs (
np.ndarray
): Environment observation.cost (
float
): Cost of the action.terminated (
bool
): Whether the episode is terminated.truncated (
bool
): Whether the episode was truncated. This value is set by wrappers when for example a time limit is reached or the agent goes out of bounds.info (
dict
): Additional information about the environment.
- Return type:
(tuple)
- reset(seed=None, options=None, random=True)[source]
Reset gymnasium environment.
- Parameters:
seed (int, optional) – A random seed for the environment. By default
None
.options (dict, optional) – A dictionary containing additional options for resetting the environment. By default
None
. Not used in this environment.random (bool, optional) – Whether we want to randomly initialise the environment. By default True.
- Returns:
tuple containing:
obs (
numpy.ndarray
): Initial environment observation.info (
dict
): Dictionary containing additional information.
- Return type:
(tuple)
- property total_mass
- Property that returns the full mass of the system.
- property _com_length
- Property that returns the position of the center of mass.
- property polemass_length
- Property that returns the pole mass times the COM length.
- property pole_mass_length
- Alias for :attr:`polemass_length`.
- property mass_pole
- Alias for :attr:`masspole`.
- property mass_cart
- Alias for :attr:`masscart`.
- property dt
- Property that also makes the timestep available under the :attr:`dt`
- attribute.
- property physics_time
- Returns the physics time. Alias for :attr:`.t`.