Environment Module

The environment module provides the RL interface to the UNO game.

UnoEnv Class

class src.state_action_reward.UnoEnv

Gymnasium environment for UNO.

This environment wraps the UNO game for reinforcement learning.

observation_space: gym.spaces.Box

17-dimensional continuous observation space. Shape: (17,), Range: [0, 1]

action_space: gym.spaces.Discrete

9 discrete actions.

reset(seed=None, options=None)

Reset the environment for a new game.

Parameters:
  • seed (Optional[int]) – Random seed for reproducibility

  • options (Optional[dict]) – Additional options (unused)

Returns:

Initial observation and info dict

Return type:

Tuple[np.ndarray, dict]

step(action)

Execute an action in the environment.

Parameters:

action (int) – Action index (0-8)

Returns:

(observation, reward, done, truncated, info)

Return type:

Tuple[np.ndarray, float, bool, bool, dict]

get_valid_actions()

Get list of valid action indices.

Returns:

Valid action indices

Return type:

List[int]

render(mode='human')

Render the current game state.

Parameters:

mode (str) – Render mode (‘human’ or ‘ansi’)

Observation Space

The 17-dimensional observation encodes:

Index

Description

Range

Encoding

0-3

Open card color

[0, 1]

One-hot

4-7

Cards per color in hand

[0, 1]

Normalized count

8-10

Special cards (Skip/Rev/+2)

[0, 1]

Normalized count

11-12

Wild cards

[0, 1]

Normalized count

13-16

Playable colors

[0, 1]

Binary

Action Space

Index

Action

Description

0

RED

Play any red card from hand

1

GREEN

Play any green card from hand

2

BLUE

Play any blue card from hand

3

YELLOW

Play any yellow card from hand

4

SKIP

Play skip card (any color)

5

REVERSE

Play reverse card (any color)

6

DRAW2

Play draw two card (any color)

7

DRAW4

Play wild draw four

8

WILD

Play wild card (choose best color)

Reward Structure

def _get_reward(self, done, winner):
    if not done:
        return 0.0
    if winner == 0:  # Agent wins
        return 1.0
    return -1.0  # Agent loses

MultiplayerUnoEnv Class

class src.state_action_reward.MultiplayerUnoEnv(num_players=4)

Extended environment for 2-4 players.

Parameters:

num_players (int) – Number of players (2-4)

Extends observation to 25 dimensions to include opponent hand sizes.

observation_space: gym.spaces.Box

25-dimensional observation space for multiplayer.

direction: int

Current turn direction (1=clockwise, -1=counter-clockwise)

current_player: int

Index of current player (0 to num_players-1)

Example Usage

Basic Usage

from src.state_action_reward import UnoEnv
import numpy as np

# Create environment
env = UnoEnv()

# Reset for new game
obs, info = env.reset(seed=42)
print(f"Initial observation shape: {obs.shape}")

# Game loop
done = False
total_reward = 0

while not done:
    # Get valid actions
    valid_actions = env.get_valid_actions()

    # Random policy
    action = np.random.choice(valid_actions)

    # Step
    obs, reward, done, truncated, info = env.step(action)
    total_reward += reward

print(f"Game over! Total reward: {total_reward}")

With Stable-Baselines3

from src.state_action_reward import UnoEnv
from sb3_contrib import RecurrentPPO

# Create environment
env = UnoEnv()

# Train model
model = RecurrentPPO("MlpLstmPolicy", env, verbose=1)
model.learn(total_timesteps=100000)

# Evaluate
obs, _ = env.reset()
lstm_state = None
episode_reward = 0

while True:
    action, lstm_state = model.predict(obs, state=lstm_state, deterministic=True)
    obs, reward, done, truncated, _ = env.step(action)
    episode_reward += reward
    if done:
        break

print(f"Episode reward: {episode_reward}")

Multiplayer Usage

from src.multiplayer_env import MultiplayerUnoEnv

# 4-player game
env = MultiplayerUnoEnv(num_players=4)
obs, info = env.reset()

print(f"Observation shape: {obs.shape}")  # (25,)
print(f"Current player: {env.current_player}")
print(f"Turn direction: {env.direction}")