Environment Module
The environment module provides the RL interface to the UNO game.
UnoEnv Class
- class src.state_action_reward.UnoEnv
Gymnasium environment for UNO.
This environment wraps the UNO game for reinforcement learning.
- observation_space: gym.spaces.Box
17-dimensional continuous observation space. Shape: (17,), Range: [0, 1]
- action_space: gym.spaces.Discrete
9 discrete actions.
- reset(seed=None, options=None)
Reset the environment for a new game.
- step(action)
Execute an action in the environment.
Observation Space
The 17-dimensional observation encodes:
Index |
Description |
Range |
Encoding |
|---|---|---|---|
0-3 |
Open card color |
[0, 1] |
One-hot |
4-7 |
Cards per color in hand |
[0, 1] |
Normalized count |
8-10 |
Special cards (Skip/Rev/+2) |
[0, 1] |
Normalized count |
11-12 |
Wild cards |
[0, 1] |
Normalized count |
13-16 |
Playable colors |
[0, 1] |
Binary |
Action Space
Index |
Action |
Description |
|---|---|---|
0 |
RED |
Play any red card from hand |
1 |
GREEN |
Play any green card from hand |
2 |
BLUE |
Play any blue card from hand |
3 |
YELLOW |
Play any yellow card from hand |
4 |
SKIP |
Play skip card (any color) |
5 |
REVERSE |
Play reverse card (any color) |
6 |
DRAW2 |
Play draw two card (any color) |
7 |
DRAW4 |
Play wild draw four |
8 |
WILD |
Play wild card (choose best color) |
Reward Structure
def _get_reward(self, done, winner):
if not done:
return 0.0
if winner == 0: # Agent wins
return 1.0
return -1.0 # Agent loses
MultiplayerUnoEnv Class
- class src.state_action_reward.MultiplayerUnoEnv(num_players=4)
Extended environment for 2-4 players.
- Parameters:
num_players (int) – Number of players (2-4)
Extends observation to 25 dimensions to include opponent hand sizes.
- observation_space: gym.spaces.Box
25-dimensional observation space for multiplayer.
Example Usage
Basic Usage
from src.state_action_reward import UnoEnv
import numpy as np
# Create environment
env = UnoEnv()
# Reset for new game
obs, info = env.reset(seed=42)
print(f"Initial observation shape: {obs.shape}")
# Game loop
done = False
total_reward = 0
while not done:
# Get valid actions
valid_actions = env.get_valid_actions()
# Random policy
action = np.random.choice(valid_actions)
# Step
obs, reward, done, truncated, info = env.step(action)
total_reward += reward
print(f"Game over! Total reward: {total_reward}")
With Stable-Baselines3
from src.state_action_reward import UnoEnv
from sb3_contrib import RecurrentPPO
# Create environment
env = UnoEnv()
# Train model
model = RecurrentPPO("MlpLstmPolicy", env, verbose=1)
model.learn(total_timesteps=100000)
# Evaluate
obs, _ = env.reset()
lstm_state = None
episode_reward = 0
while True:
action, lstm_state = model.predict(obs, state=lstm_state, deterministic=True)
obs, reward, done, truncated, _ = env.step(action)
episode_reward += reward
if done:
break
print(f"Episode reward: {episode_reward}")
Multiplayer Usage
from src.multiplayer_env import MultiplayerUnoEnv
# 4-player game
env = MultiplayerUnoEnv(num_players=4)
obs, info = env.reset()
print(f"Observation shape: {obs.shape}") # (25,)
print(f"Current player: {env.current_player}")
print(f"Turn direction: {env.direction}")