Algorithms

This page explains the Reinforcement Learning algorithms used in this project.

Overview

We explored several RL algorithms for learning to play UNO:

Algorithm

Type

Win Rate

Best For

Recurrent PPO

Policy

60%+

Partial observability (recommended)

PPO

Policy

53%

Stable training

DQN

Value

48%

Discrete actions

A2C

Policy

45%

Fast training

Q-Learning

Tabular

35%

Small state spaces

Q-Learning

The simplest algorithm we tested. Uses a table to store Q-values for each state-action pair.

Update Rule:

\[Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]\]

Where: - \(\alpha\) is the learning rate - \(\gamma\) is the discount factor - \(r\) is the immediate reward

Limitations: - State space too large for tables - No generalization across similar states - Only achieves ~35% win rate

Deep Q-Network (DQN)

Uses neural networks to approximate Q-values, enabling generalization.

Architecture:

Input (17) → Dense(256) → ReLU → Dense(256) → ReLU → Output (9)

Key Techniques:

  1. Experience Replay: Store transitions in buffer, sample randomly

  2. Target Network: Separate network for stable targets

  3. ε-greedy Exploration: Random actions with probability ε

Configuration:

DQN(
    policy="MlpPolicy",
    env=env,
    learning_rate=1e-4,
    buffer_size=100000,
    batch_size=64,
    gamma=0.99,
    exploration_fraction=0.1,
    exploration_final_eps=0.02,
    target_update_interval=1000
)

Proximal Policy Optimization (PPO)

A policy gradient method with clipped objective for stable training.

Objective:

\[L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t) \right]\]

Where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\) is the probability ratio.

Benefits:

  • Stable training

  • Sample efficient

  • Works well with neural networks

Configuration:

PPO(
    policy="MlpPolicy",
    env=env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01
)

Recurrent PPO (LSTM)

Our best performing algorithm! Uses LSTM networks to maintain memory of past observations.

Why LSTM for UNO?

UNO is a partially observable game: - You can’t see opponent’s cards - Cards in the deck are hidden - Past plays contain useful information

LSTM maintains a hidden state that accumulates information over time.

Architecture:

Input (17) → LSTM(256) → Dense(256) → Dense(128) → Policy(9) + Value(1)

Key Insight:

The LSTM learns to: 1. Track which cards have been played 2. Estimate remaining cards in the deck 3. Infer opponent’s hand composition 4. Remember game patterns

Configuration:

RecurrentPPO(
    policy="MlpLstmPolicy",
    env=env,
    learning_rate=3e-4,
    n_steps=128,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    policy_kwargs=dict(
        lstm_hidden_size=256,
        n_lstm_layers=1,
        shared_lstm=True,
        enable_critic_lstm=True
    )
)

Algorithm Comparison

Training Efficiency

Algorithm

Timesteps to 50% WR

GPU Memory

Training Time (1M steps)

Q-Learning

Never reached

N/A

10 min

DQN

800K

~2GB

30 min

PPO

500K

~1GB

20 min

Recurrent PPO

300K

~3GB

45 min

Recommendations

For beginners: Start with PPO - it’s stable and easy to tune.

For best results: Use Recurrent PPO - the memory is essential for UNO.

For research: Try different architectures with the RecurrentPPO base.

Hyperparameter Tuning

Key parameters to tune for UNO:

Parameter

Range

Effect

learning_rate

1e-5 to 1e-3

Higher = faster but unstable

n_steps

64 to 2048

Higher = more stable gradients

batch_size

32 to 256

Must divide n_steps evenly

gamma

0.9 to 0.999

Higher = more farsighted

ent_coef

0.0 to 0.1

Higher = more exploration

Our best configuration is in config.py.