Algorithms

This page explains the Reinforcement Learning algorithms used in this project.

Overview

We explored several RL algorithms for learning to play UNO:

Algorithm	Type	Win Rate	Best For
Recurrent PPO	Policy	60%+	Partial observability (recommended)
PPO	Policy	53%	Stable training
DQN	Value	48%	Discrete actions
A2C	Policy	45%	Fast training
Q-Learning	Tabular	35%	Small state spaces

Q-Learning

The simplest algorithm we tested. Uses a table to store Q-values for each state-action pair.

Update Rule:

\[Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]\]

Where: - \(\alpha\) is the learning rate - \(\gamma\) is the discount factor - \(r\) is the immediate reward

Limitations: - State space too large for tables - No generalization across similar states - Only achieves ~35% win rate

Deep Q-Network (DQN)

Uses neural networks to approximate Q-values, enabling generalization.

Architecture:

Input (17) → Dense(256) → ReLU → Dense(256) → ReLU → Output (9)

Key Techniques:

Experience Replay: Store transitions in buffer, sample randomly
Target Network: Separate network for stable targets
ε-greedy Exploration: Random actions with probability ε

Configuration:

DQN(
    policy="MlpPolicy",
    env=env,
    learning_rate=1e-4,
    buffer_size=100000,
    batch_size=64,
    gamma=0.99,
    exploration_fraction=0.1,
    exploration_final_eps=0.02,
    target_update_interval=1000
)

Proximal Policy Optimization (PPO)

A policy gradient method with clipped objective for stable training.

Objective:

\[L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t) \right]\]

Where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\) is the probability ratio.

Benefits:

Stable training
Sample efficient
Works well with neural networks

Configuration:

PPO(
    policy="MlpPolicy",
    env=env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01
)

Recurrent PPO (LSTM)

Our best performing algorithm! Uses LSTM networks to maintain memory of past observations.

Why LSTM for UNO?

UNO is a partially observable game: - You can’t see opponent’s cards - Cards in the deck are hidden - Past plays contain useful information

LSTM maintains a hidden state that accumulates information over time.

Architecture:

Input (17) → LSTM(256) → Dense(256) → Dense(128) → Policy(9) + Value(1)

Key Insight:

The LSTM learns to: 1. Track which cards have been played 2. Estimate remaining cards in the deck 3. Infer opponent’s hand composition 4. Remember game patterns

Configuration:

RecurrentPPO(
    policy="MlpLstmPolicy",
    env=env,
    learning_rate=3e-4,
    n_steps=128,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    policy_kwargs=dict(
        lstm_hidden_size=256,
        n_lstm_layers=1,
        shared_lstm=True,
        enable_critic_lstm=True
    )
)

Algorithm Comparison

Training Efficiency

Algorithm	Timesteps to 50% WR	GPU Memory	Training Time (1M steps)
Q-Learning	Never reached	N/A	10 min
DQN	800K	~2GB	30 min
PPO	500K	~1GB	20 min
Recurrent PPO	300K	~3GB	45 min

Recommendations

For beginners: Start with PPO - it’s stable and easy to tune.

For best results: Use Recurrent PPO - the memory is essential for UNO.

For research: Try different architectures with the RecurrentPPO base.

Hyperparameter Tuning

Key parameters to tune for UNO:

Parameter	Range	Effect
learning_rate	1e-5 to 1e-3	Higher = faster but unstable
n_steps	64 to 2048	Higher = more stable gradients
batch_size	32 to 256	Must divide n_steps evenly
gamma	0.9 to 0.999	Higher = more farsighted
ent_coef	0.0 to 0.1	Higher = more exploration

Our best configuration is in config.py.