Algorithms
This page explains the Reinforcement Learning algorithms used in this project.
Overview
We explored several RL algorithms for learning to play UNO:
Algorithm |
Type |
Win Rate |
Best For |
|---|---|---|---|
Recurrent PPO |
Policy |
60%+ |
Partial observability (recommended) |
PPO |
Policy |
53% |
Stable training |
DQN |
Value |
48% |
Discrete actions |
A2C |
Policy |
45% |
Fast training |
Q-Learning |
Tabular |
35% |
Small state spaces |
Q-Learning
The simplest algorithm we tested. Uses a table to store Q-values for each state-action pair.
Update Rule:
Where: - \(\alpha\) is the learning rate - \(\gamma\) is the discount factor - \(r\) is the immediate reward
Limitations: - State space too large for tables - No generalization across similar states - Only achieves ~35% win rate
Deep Q-Network (DQN)
Uses neural networks to approximate Q-values, enabling generalization.
Architecture:
Input (17) → Dense(256) → ReLU → Dense(256) → ReLU → Output (9)
Key Techniques:
Experience Replay: Store transitions in buffer, sample randomly
Target Network: Separate network for stable targets
ε-greedy Exploration: Random actions with probability ε
Configuration:
DQN(
policy="MlpPolicy",
env=env,
learning_rate=1e-4,
buffer_size=100000,
batch_size=64,
gamma=0.99,
exploration_fraction=0.1,
exploration_final_eps=0.02,
target_update_interval=1000
)
Proximal Policy Optimization (PPO)
A policy gradient method with clipped objective for stable training.
Objective:
Where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\) is the probability ratio.
Benefits:
Stable training
Sample efficient
Works well with neural networks
Configuration:
PPO(
policy="MlpPolicy",
env=env,
learning_rate=3e-4,
n_steps=2048,
batch_size=64,
n_epochs=10,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
ent_coef=0.01
)
Recurrent PPO (LSTM)
Our best performing algorithm! Uses LSTM networks to maintain memory of past observations.
Why LSTM for UNO?
UNO is a partially observable game: - You can’t see opponent’s cards - Cards in the deck are hidden - Past plays contain useful information
LSTM maintains a hidden state that accumulates information over time.
Architecture:
Input (17) → LSTM(256) → Dense(256) → Dense(128) → Policy(9) + Value(1)
Key Insight:
The LSTM learns to: 1. Track which cards have been played 2. Estimate remaining cards in the deck 3. Infer opponent’s hand composition 4. Remember game patterns
Configuration:
RecurrentPPO(
policy="MlpLstmPolicy",
env=env,
learning_rate=3e-4,
n_steps=128,
batch_size=64,
n_epochs=10,
gamma=0.99,
gae_lambda=0.95,
clip_range=0.2,
policy_kwargs=dict(
lstm_hidden_size=256,
n_lstm_layers=1,
shared_lstm=True,
enable_critic_lstm=True
)
)
Algorithm Comparison
Training Efficiency
Algorithm |
Timesteps to 50% WR |
GPU Memory |
Training Time (1M steps) |
|---|---|---|---|
Q-Learning |
Never reached |
N/A |
10 min |
DQN |
800K |
~2GB |
30 min |
PPO |
500K |
~1GB |
20 min |
Recurrent PPO |
300K |
~3GB |
45 min |
Recommendations
For beginners: Start with PPO - it’s stable and easy to tune.
For best results: Use Recurrent PPO - the memory is essential for UNO.
For research: Try different architectures with the RecurrentPPO base.
Hyperparameter Tuning
Key parameters to tune for UNO:
Parameter |
Range |
Effect |
|---|---|---|
learning_rate |
1e-5 to 1e-3 |
Higher = faster but unstable |
n_steps |
64 to 2048 |
Higher = more stable gradients |
batch_size |
32 to 256 |
Must divide n_steps evenly |
gamma |
0.9 to 0.999 |
Higher = more farsighted |
ent_coef |
0.0 to 0.1 |
Higher = more exploration |
Our best configuration is in config.py.