Self-Play Training

Self-play is our recommended training method for achieving the highest win rates (70%+).

What is Self-Play?

In self-play training, the agent plays against copies of itself (or previous versions). This creates an ever-improving opponent, preventing overfitting to fixed strategies.

Benefits:

  • No need for hand-crafted opponents

  • Discovers novel strategies

  • Robust to different play styles

  • Continuously improves

  • Prevents exploitation of fixed patterns

How It Works

Our self-play implementation uses several techniques:

1. Population-Based Training

Maintain a pool of agents at different skill levels:

Population Pool
├── Current Agent (training)
├── Checkpoint @ 100K steps
├── Checkpoint @ 200K steps
├── Checkpoint @ 500K steps
└── Best Agent (highest eval score)

2. Opponent Sampling

Sample opponents from the pool with probabilities:

opponent_weights = {
    "current": 0.3,      # Train against itself
    "recent": 0.4,       # Recent checkpoints
    "best": 0.2,         # Best historical
    "random": 0.1        # Maintain exploration
}

3. Curriculum Learning

Gradually increase difficulty:

Phase 1 (0-100K):   vs Random
Phase 2 (100K-500K): vs Mix(Random, Self)
Phase 3 (500K+):     vs Self + Population

Running Self-Play Training

Basic Usage

python training/train_selfplay.py --mode selfplay --timesteps 1000000

Advanced Options

python training/train_selfplay.py \
    --mode selfplay \
    --timesteps 2000000 \
    --checkpoint-freq 100000 \
    --population-size 5 \
    --learning-rate 1e-4 \
    --eval-episodes 200

Command Line Arguments

Argument

Default

Description

--mode

selfplay

Training mode: selfplay, curriculum, mixed

--timesteps

1000000

Total training timesteps

--checkpoint-freq

100000

Save checkpoint frequency

--population-size

5

Max opponents in pool

--learning-rate

3e-4

Learning rate

--eval-episodes

100

Evaluation episodes

Training Modes

Selfplay Mode

Pure self-play against population:

python training/train_selfplay.py --mode selfplay

Curriculum Mode

Gradual difficulty increase:

python training/train_selfplay.py --mode curriculum

Mixed Mode

Combination of techniques:

python training/train_selfplay.py --mode mixed

Implementation Details

The self-play environment wrapper:

class SelfPlayEnv(gym.Wrapper):
    def __init__(self, env, opponent_pool):
        super().__init__(env)
        self.opponent_pool = opponent_pool
        self.current_opponent = None

    def reset(self):
        # Sample new opponent each episode
        self.current_opponent = self._sample_opponent()
        return super().reset()

    def _sample_opponent(self):
        weights = [0.3, 0.4, 0.2, 0.1]  # current, recent, best, random
        return random.choices(self.opponent_pool, weights=weights)[0]

Monitoring Progress

Track these metrics during self-play:

Metric

Good Sign

Bad Sign

Win rate vs Random

> 60%

< 50%

Win rate vs Self

45-55%

< 30% or > 70%

Episode length

Decreasing

Increasing

Reward

Increasing

Flat or decreasing

Expected Results

Training Timeline

Timesteps     Win Rate (vs Random)
---------     -------------------
100K          40-45%
250K          50-55%
500K          55-60%
1M            60-65%
2M            65-70%
5M            70%+

Final Performance

After 2M timesteps of self-play:

  • vs Random: 70%+ win rate

  • vs PPO: 60%+ win rate

  • vs DQN: 65%+ win rate

Tips for Best Results

  1. Long Training: Self-play benefits from extended training (2M+ steps)

  2. Large Population: Use 5-10 agents in the population

  3. Regular Checkpoints: Save every 100K steps for diversity

  4. Evaluation: Test against fixed baselines periodically

  5. Patience: Early training may show unstable metrics

  6. Resources: Self-play uses more memory (multiple models loaded)

Troubleshooting

Win Rate Not Improving

  • Increase population diversity

  • Add more random agents to pool

  • Lower learning rate

Training Unstable

  • Reduce opponent sampling frequency

  • Increase batch size

  • Use smaller population

Out of Memory

  • Reduce population size

  • Use smaller LSTM hidden size

  • Clear old checkpoints

Using the Trained Model

After training, the champion model is saved to:

models/selfplay_champion.zip

Use it in the GUI or evaluation:

python uno_gui.py  # Select "Self-Play Champion" from dropdown

Or load programmatically:

from sb3_contrib import RecurrentPPO

model = RecurrentPPO.load("models/selfplay_champion.zip")