Environment Module

The environment module provides the RL interface to the UNO game.

UnoEnv Class

class src.state_action_reward.UnoEnv

Gymnasium environment for UNO.

This environment wraps the UNO game for reinforcement learning.

observation_space: gym.spaces.Box: 17-dimensional continuous observation space. Shape: (17,), Range: [0, 1]

action_space: gym.spaces.Discrete: 9 discrete actions.

reset(seed=None, options=None)

Reset the environment for a new game.

Parameters:

seed (Optional[int]) – Random seed for reproducibility
options (Optional[dict]) – Additional options (unused)

Returns:

Initial observation and info dict

Return type:

Tuple[np.ndarray, dict]

step(action)

Execute an action in the environment.

Parameters:: action (int) – Action index (0-8)
Returns:: (observation, reward, done, truncated, info)
Return type:: Tuple[np.ndarray, float, bool, bool, dict]

get_valid_actions()

Get list of valid action indices.

Returns:: Valid action indices
Return type:: List[int]

render(mode='human')

Render the current game state.

Parameters:: mode (str) – Render mode (‘human’ or ‘ansi’)

Observation Space

The 17-dimensional observation encodes:

Index	Description	Range	Encoding
0-3	Open card color	[0, 1]	One-hot
4-7	Cards per color in hand	[0, 1]	Normalized count
8-10	Special cards (Skip/Rev/+2)	[0, 1]	Normalized count
11-12	Wild cards	[0, 1]	Normalized count
13-16	Playable colors	[0, 1]	Binary

Action Space

Index	Action	Description
0	RED	Play any red card from hand
1	GREEN	Play any green card from hand
2	BLUE	Play any blue card from hand
3	YELLOW	Play any yellow card from hand
4	SKIP	Play skip card (any color)
5	REVERSE	Play reverse card (any color)
6	DRAW2	Play draw two card (any color)
7	DRAW4	Play wild draw four
8	WILD	Play wild card (choose best color)

Reward Structure

def _get_reward(self, done, winner):
    if not done:
        return 0.0
    if winner == 0:  # Agent wins
        return 1.0
    return -1.0  # Agent loses

MultiplayerUnoEnv Class

class src.state_action_reward.MultiplayerUnoEnv(num_players=4)

Extended environment for 2-4 players.

Parameters:: num_players (int) – Number of players (2-4)

Extends observation to 25 dimensions to include opponent hand sizes.

observation_space: gym.spaces.Box: 25-dimensional observation space for multiplayer.

direction: int: Current turn direction (1=clockwise, -1=counter-clockwise)

current_player: int: Index of current player (0 to num_players-1)

Example Usage

Basic Usage

from src.state_action_reward import UnoEnv
import numpy as np

# Create environment
env = UnoEnv()

# Reset for new game
obs, info = env.reset(seed=42)
print(f"Initial observation shape: {obs.shape}")

# Game loop
done = False
total_reward = 0

while not done:
    # Get valid actions
    valid_actions = env.get_valid_actions()

    # Random policy
    action = np.random.choice(valid_actions)

    # Step
    obs, reward, done, truncated, info = env.step(action)
    total_reward += reward

print(f"Game over! Total reward: {total_reward}")

With Stable-Baselines3

from src.state_action_reward import UnoEnv
from sb3_contrib import RecurrentPPO

# Create environment
env = UnoEnv()

# Train model
model = RecurrentPPO("MlpLstmPolicy", env, verbose=1)
model.learn(total_timesteps=100000)

# Evaluate
obs, _ = env.reset()
lstm_state = None
episode_reward = 0

while True:
    action, lstm_state = model.predict(obs, state=lstm_state, deterministic=True)
    obs, reward, done, truncated, _ = env.step(action)
    episode_reward += reward
    if done:
        break

print(f"Episode reward: {episode_reward}")

Multiplayer Usage

from src.multiplayer_env import MultiplayerUnoEnv

# 4-player game
env = MultiplayerUnoEnv(num_players=4)
obs, info = env.reset()

print(f"Observation shape: {obs.shape}")  # (25,)
print(f"Current player: {env.current_player}")
print(f"Turn direction: {env.direction}")