Self-Play Training

Self-play is our recommended training method for achieving the highest win rates (70%+).

What is Self-Play?

In self-play training, the agent plays against copies of itself (or previous versions). This creates an ever-improving opponent, preventing overfitting to fixed strategies.

Benefits:

No need for hand-crafted opponents
Discovers novel strategies
Robust to different play styles
Continuously improves
Prevents exploitation of fixed patterns

How It Works

Our self-play implementation uses several techniques:

1. Population-Based Training

Maintain a pool of agents at different skill levels:

Population Pool
├── Current Agent (training)
├── Checkpoint @ 100K steps
├── Checkpoint @ 200K steps
├── Checkpoint @ 500K steps
└── Best Agent (highest eval score)

2. Opponent Sampling

Sample opponents from the pool with probabilities:

opponent_weights = {
    "current": 0.3,      # Train against itself
    "recent": 0.4,       # Recent checkpoints
    "best": 0.2,         # Best historical
    "random": 0.1        # Maintain exploration
}

3. Curriculum Learning

Gradually increase difficulty:

Phase 1 (0-100K):   vs Random
Phase 2 (100K-500K): vs Mix(Random, Self)
Phase 3 (500K+):     vs Self + Population

Running Self-Play Training

Basic Usage

python training/train_selfplay.py --mode selfplay --timesteps 1000000

Advanced Options

python training/train_selfplay.py \
    --mode selfplay \
    --timesteps 2000000 \
    --checkpoint-freq 100000 \
    --population-size 5 \
    --learning-rate 1e-4 \
    --eval-episodes 200

Command Line Arguments

Argument	Default	Description
`--mode`	selfplay	Training mode: selfplay, curriculum, mixed
`--timesteps`	1000000	Total training timesteps
`--checkpoint-freq`	100000	Save checkpoint frequency
`--population-size`	5	Max opponents in pool
`--learning-rate`	3e-4	Learning rate
`--eval-episodes`	100	Evaluation episodes

Training Modes

Selfplay Mode

Pure self-play against population:

python training/train_selfplay.py --mode selfplay

Curriculum Mode

Gradual difficulty increase:

python training/train_selfplay.py --mode curriculum

Mixed Mode

Combination of techniques:

python training/train_selfplay.py --mode mixed

Implementation Details

The self-play environment wrapper:

class SelfPlayEnv(gym.Wrapper):
    def __init__(self, env, opponent_pool):
        super().__init__(env)
        self.opponent_pool = opponent_pool
        self.current_opponent = None

    def reset(self):
        # Sample new opponent each episode
        self.current_opponent = self._sample_opponent()
        return super().reset()

    def _sample_opponent(self):
        weights = [0.3, 0.4, 0.2, 0.1]  # current, recent, best, random
        return random.choices(self.opponent_pool, weights=weights)[0]

Monitoring Progress

Track these metrics during self-play:

Metric	Good Sign	Bad Sign
Win rate vs Random	> 60%	< 50%
Win rate vs Self	45-55%	< 30% or > 70%
Episode length	Decreasing	Increasing
Reward	Increasing	Flat or decreasing

Expected Results

Training Timeline

Timesteps     Win Rate (vs Random)
---------     -------------------
100K          40-45%
250K          50-55%
500K          55-60%
1M            60-65%
2M            65-70%
5M            70%+

Final Performance

After 2M timesteps of self-play:

vs Random: 70%+ win rate
vs PPO: 60%+ win rate
vs DQN: 65%+ win rate

Tips for Best Results

Long Training: Self-play benefits from extended training (2M+ steps)
Large Population: Use 5-10 agents in the population
Regular Checkpoints: Save every 100K steps for diversity
Evaluation: Test against fixed baselines periodically
Patience: Early training may show unstable metrics
Resources: Self-play uses more memory (multiple models loaded)

Troubleshooting

Win Rate Not Improving

Increase population diversity
Add more random agents to pool
Lower learning rate

Training Unstable

Reduce opponent sampling frequency
Increase batch size
Use smaller population

Out of Memory

Reduce population size
Use smaller LSTM hidden size
Clear old checkpoints

Using the Trained Model

After training, the champion model is saved to:

models/selfplay_champion.zip

Use it in the GUI or evaluation:

python uno_gui.py  # Select "Self-Play Champion" from dropdown

Or load programmatically:

from sb3_contrib import RecurrentPPO

model = RecurrentPPO.load("models/selfplay_champion.zip")