================== Environment Module ================== .. module:: src.state_action_reward :synopsis: Gymnasium RL environment for UNO The environment module provides the RL interface to the UNO game. UnoEnv Class ============ .. class:: UnoEnv() Gymnasium environment for UNO. This environment wraps the UNO game for reinforcement learning. .. attribute:: observation_space :type: gym.spaces.Box 17-dimensional continuous observation space. Shape: (17,), Range: [0, 1] .. attribute:: action_space :type: gym.spaces.Discrete 9 discrete actions. .. method:: reset(seed=None, options=None) Reset the environment for a new game. :param seed: Random seed for reproducibility :type seed: Optional[int] :param options: Additional options (unused) :type options: Optional[dict] :returns: Initial observation and info dict :rtype: Tuple[np.ndarray, dict] .. method:: step(action) Execute an action in the environment. :param action: Action index (0-8) :type action: int :returns: (observation, reward, done, truncated, info) :rtype: Tuple[np.ndarray, float, bool, bool, dict] .. method:: get_valid_actions() Get list of valid action indices. :returns: Valid action indices :rtype: List[int] .. method:: render(mode='human') Render the current game state. :param mode: Render mode ('human' or 'ansi') :type mode: str Observation Space ================= The 17-dimensional observation encodes: .. list-table:: :header-rows: 1 :widths: 15 40 20 25 * - Index - Description - Range - Encoding * - 0-3 - Open card color - [0, 1] - One-hot * - 4-7 - Cards per color in hand - [0, 1] - Normalized count * - 8-10 - Special cards (Skip/Rev/+2) - [0, 1] - Normalized count * - 11-12 - Wild cards - [0, 1] - Normalized count * - 13-16 - Playable colors - [0, 1] - Binary Action Space ============ .. list-table:: :header-rows: 1 :widths: 15 30 55 * - Index - Action - Description * - 0 - RED - Play any red card from hand * - 1 - GREEN - Play any green card from hand * - 2 - BLUE - Play any blue card from hand * - 3 - YELLOW - Play any yellow card from hand * - 4 - SKIP - Play skip card (any color) * - 5 - REVERSE - Play reverse card (any color) * - 6 - DRAW2 - Play draw two card (any color) * - 7 - DRAW4 - Play wild draw four * - 8 - WILD - Play wild card (choose best color) Reward Structure ================ .. code-block:: python def _get_reward(self, done, winner): if not done: return 0.0 if winner == 0: # Agent wins return 1.0 return -1.0 # Agent loses MultiplayerUnoEnv Class ======================= .. class:: MultiplayerUnoEnv(num_players=4) Extended environment for 2-4 players. :param num_players: Number of players (2-4) :type num_players: int Extends observation to 25 dimensions to include opponent hand sizes. .. attribute:: observation_space :type: gym.spaces.Box 25-dimensional observation space for multiplayer. .. attribute:: direction :type: int Current turn direction (1=clockwise, -1=counter-clockwise) .. attribute:: current_player :type: int Index of current player (0 to num_players-1) Example Usage ============= Basic Usage ----------- .. code-block:: python from src.state_action_reward import UnoEnv import numpy as np # Create environment env = UnoEnv() # Reset for new game obs, info = env.reset(seed=42) print(f"Initial observation shape: {obs.shape}") # Game loop done = False total_reward = 0 while not done: # Get valid actions valid_actions = env.get_valid_actions() # Random policy action = np.random.choice(valid_actions) # Step obs, reward, done, truncated, info = env.step(action) total_reward += reward print(f"Game over! Total reward: {total_reward}") With Stable-Baselines3 ---------------------- .. code-block:: python from src.state_action_reward import UnoEnv from sb3_contrib import RecurrentPPO # Create environment env = UnoEnv() # Train model model = RecurrentPPO("MlpLstmPolicy", env, verbose=1) model.learn(total_timesteps=100000) # Evaluate obs, _ = env.reset() lstm_state = None episode_reward = 0 while True: action, lstm_state = model.predict(obs, state=lstm_state, deterministic=True) obs, reward, done, truncated, _ = env.step(action) episode_reward += reward if done: break print(f"Episode reward: {episode_reward}") Multiplayer Usage ----------------- .. code-block:: python from src.multiplayer_env import MultiplayerUnoEnv # 4-player game env = MultiplayerUnoEnv(num_players=4) obs, info = env.reset() print(f"Observation shape: {obs.shape}") # (25,) print(f"Current player: {env.current_player}") print(f"Turn direction: {env.direction}")