CartPoleCost gymnasium environment
Attention
If you’re using this environment to reproduce the results of Han et al. (2020), please note that slight differences may occur due to the modifications mentioned below. For an accurate reproduction, refer to the separate han2020
branch, which mirrors the environment used in their study. It can be accessed here.
An unactuated joint attaches a pole to a cart, which moves along a frictionless track. This environment is a modified version of the CartPole-v1 found in the Gymnasium package, with several key alterations:
The action space is continuous, contrasting with the original discrete setting.
Offers an optional feature to confine actions within the defined action space, preventing the agent from exceeding set boundaries when activated.
The reward function is replaced with a (positive definite) cost function (negated reward), in line with Lyapunov stability theory.
Maximum cart force is increased from
10
to20
.Episode length is reduced from
500
to250
.A termination cost of
c=100
is introduced for early episode termination, to promote cost minimization.The terminal angle limit is expanded from the original
12
degrees to20
degrees, enhancing recovery potential.The terminal position limit is extended from
2.4
meters to10
meters, broadening the recovery range.Velocity limits are adjusted from ±∞ to ±50, accelerating training.
Angular velocity termination threshold is lowered from ±∞ to ±50, likely for improved training efficiency.
Random initial state range is modified from
[-0.05, 0.05]
to[-5, 5]
for the cart position and[-0.2, 0.2]
for all other states, allowing for expanded exploration.The info dictionary is expanded to include the reference state, state of interest, and reference error.
Additional modifications in our implementation:
Unlike the original environment’s fixed cost threshold of
100
, this version allows users to adjust the maximum cost threshold improving training adaptability.The gravity constant is adjusted back from
10
to the real-world value of9.8
, aligning it closer with the original CartPole environment.The data types for action and observation spaces are set to
np.float64
, diverging from thenp.float32
used by Han et al. 2020. This aligns the Gymnasium implementation with the original CartPole environment.
These modifications were first described in Han et al. 2020 and further adapted in our version for enhanced training and exploration.
Observation space
The environment returns the following observation:
\(x\) - Cart Position.
\(x_{dot}\) - Cart Velocity.
\(w\) - Pole angle.
\(w_{dot}\) - Pole angle velocity.
Action space
u1: The x-force applied on the cart.
Episode Termination
An episode is terminated when:
Pole Angle is more than 20 degrees.
Cart Position is more than 10 m (center of the cart reaches the edge of the display).
Episode length is greater than 200.
The cost is greater than a set threshold (100 by default). This threshold can be changed using the
max_cost
environment argument.
Environment goals
The goal is similar to the original CartPole-v1
environment. The pendulum starts upright, and the goal is to prevent it from falling over by increasing and reducing the cart’s control force. This must be done while the cart does not violate set position constraints. These constraints are defined in the cost function.
Cost function
The cost function of this environment is designed in such a way that it tries to minimize the error between the current cart position and angle and the zero position and angle:
The cost is between 0
and a set threshold value in both tasks, and the maximum cost is used when the episode is terminated.
Environment step return
In addition to the observations, the cost and a termination and truncation boolean, the environment also returns an info dictionary:
[observation, cost, termination, truncation, info_dict]
The info dictionary contains the following keys:
reference: The set cart position and angle reference (i.e. the zero position and angle).
state_of_interest: The state that should track the reference (SOI).
reference_error: The error between SOI and the reference.
How to use
This environment is part of the Stable Gym package. It is therefore registered as the stable_gym:CartPoleCost-v1
gymnasium environment when you import the Stable Gym package. If you want to use the environment in stand-alone mode, you can register it yourself.