Self-Play Training
Self-play is our recommended training method for achieving the highest win rates (70%+).
What is Self-Play?
In self-play training, the agent plays against copies of itself (or previous versions). This creates an ever-improving opponent, preventing overfitting to fixed strategies.
Benefits:
No need for hand-crafted opponents
Discovers novel strategies
Robust to different play styles
Continuously improves
Prevents exploitation of fixed patterns
How It Works
Our self-play implementation uses several techniques:
1. Population-Based Training
Maintain a pool of agents at different skill levels:
Population Pool
├── Current Agent (training)
├── Checkpoint @ 100K steps
├── Checkpoint @ 200K steps
├── Checkpoint @ 500K steps
└── Best Agent (highest eval score)
2. Opponent Sampling
Sample opponents from the pool with probabilities:
opponent_weights = {
"current": 0.3, # Train against itself
"recent": 0.4, # Recent checkpoints
"best": 0.2, # Best historical
"random": 0.1 # Maintain exploration
}
3. Curriculum Learning
Gradually increase difficulty:
Phase 1 (0-100K): vs Random
Phase 2 (100K-500K): vs Mix(Random, Self)
Phase 3 (500K+): vs Self + Population
Running Self-Play Training
Basic Usage
python training/train_selfplay.py --mode selfplay --timesteps 1000000
Advanced Options
python training/train_selfplay.py \
--mode selfplay \
--timesteps 2000000 \
--checkpoint-freq 100000 \
--population-size 5 \
--learning-rate 1e-4 \
--eval-episodes 200
Command Line Arguments
Argument |
Default |
Description |
|---|---|---|
|
selfplay |
Training mode: selfplay, curriculum, mixed |
|
1000000 |
Total training timesteps |
|
100000 |
Save checkpoint frequency |
|
5 |
Max opponents in pool |
|
3e-4 |
Learning rate |
|
100 |
Evaluation episodes |
Training Modes
Selfplay Mode
Pure self-play against population:
python training/train_selfplay.py --mode selfplay
Curriculum Mode
Gradual difficulty increase:
python training/train_selfplay.py --mode curriculum
Mixed Mode
Combination of techniques:
python training/train_selfplay.py --mode mixed
Implementation Details
The self-play environment wrapper:
class SelfPlayEnv(gym.Wrapper):
def __init__(self, env, opponent_pool):
super().__init__(env)
self.opponent_pool = opponent_pool
self.current_opponent = None
def reset(self):
# Sample new opponent each episode
self.current_opponent = self._sample_opponent()
return super().reset()
def _sample_opponent(self):
weights = [0.3, 0.4, 0.2, 0.1] # current, recent, best, random
return random.choices(self.opponent_pool, weights=weights)[0]
Monitoring Progress
Track these metrics during self-play:
Metric |
Good Sign |
Bad Sign |
|---|---|---|
Win rate vs Random |
> 60% |
< 50% |
Win rate vs Self |
45-55% |
< 30% or > 70% |
Episode length |
Decreasing |
Increasing |
Reward |
Increasing |
Flat or decreasing |
Expected Results
Training Timeline
Timesteps Win Rate (vs Random)
--------- -------------------
100K 40-45%
250K 50-55%
500K 55-60%
1M 60-65%
2M 65-70%
5M 70%+
Final Performance
After 2M timesteps of self-play:
vs Random: 70%+ win rate
vs PPO: 60%+ win rate
vs DQN: 65%+ win rate
Tips for Best Results
Long Training: Self-play benefits from extended training (2M+ steps)
Large Population: Use 5-10 agents in the population
Regular Checkpoints: Save every 100K steps for diversity
Evaluation: Test against fixed baselines periodically
Patience: Early training may show unstable metrics
Resources: Self-play uses more memory (multiple models loaded)
Troubleshooting
Win Rate Not Improving
Increase population diversity
Add more random agents to pool
Lower learning rate
Training Unstable
Reduce opponent sampling frequency
Increase batch size
Use smaller population
Out of Memory
Reduce population size
Use smaller LSTM hidden size
Clear old checkpoints
Using the Trained Model
After training, the champion model is saved to:
models/selfplay_champion.zip
Use it in the GUI or evaluation:
python uno_gui.py # Select "Self-Play Champion" from dropdown
Or load programmatically:
from sb3_contrib import RecurrentPPO
model = RecurrentPPO.load("models/selfplay_champion.zip")