==================
Self-Play Training
==================

Self-play is our recommended training method for achieving the highest win rates (70%+).

What is Self-Play?
==================

In self-play training, the agent plays against copies of itself (or previous versions). This creates an ever-improving opponent, preventing overfitting to fixed strategies.

**Benefits:**

- No need for hand-crafted opponents
- Discovers novel strategies
- Robust to different play styles
- Continuously improves
- Prevents exploitation of fixed patterns

How It Works
============

Our self-play implementation uses several techniques:

1. Population-Based Training
----------------------------

Maintain a pool of agents at different skill levels:

.. code-block:: text

    Population Pool
    ├── Current Agent (training)
    ├── Checkpoint @ 100K steps
    ├── Checkpoint @ 200K steps
    ├── Checkpoint @ 500K steps
    └── Best Agent (highest eval score)

2. Opponent Sampling
--------------------

Sample opponents from the pool with probabilities:

.. code-block:: python

    opponent_weights = {
        "current": 0.3,      # Train against itself
        "recent": 0.4,       # Recent checkpoints
        "best": 0.2,         # Best historical
        "random": 0.1        # Maintain exploration
    }

3. Curriculum Learning
----------------------

Gradually increase difficulty:

.. code-block:: text

    Phase 1 (0-100K):   vs Random
    Phase 2 (100K-500K): vs Mix(Random, Self)
    Phase 3 (500K+):     vs Self + Population

Running Self-Play Training
==========================

Basic Usage
-----------

.. code-block:: bash

    python training/train_selfplay.py --mode selfplay --timesteps 1000000

Advanced Options
----------------

.. code-block:: bash

    python training/train_selfplay.py \
        --mode selfplay \
        --timesteps 2000000 \
        --checkpoint-freq 100000 \
        --population-size 5 \
        --learning-rate 1e-4 \
        --eval-episodes 200

Command Line Arguments
----------------------

.. list-table::
   :header-rows: 1
   :widths: 30 15 55

   * - Argument
     - Default
     - Description
   * - ``--mode``
     - selfplay
     - Training mode: selfplay, curriculum, mixed
   * - ``--timesteps``
     - 1000000
     - Total training timesteps
   * - ``--checkpoint-freq``
     - 100000
     - Save checkpoint frequency
   * - ``--population-size``
     - 5
     - Max opponents in pool
   * - ``--learning-rate``
     - 3e-4
     - Learning rate
   * - ``--eval-episodes``
     - 100
     - Evaluation episodes

Training Modes
==============

Selfplay Mode
-------------

Pure self-play against population:

.. code-block:: bash

    python training/train_selfplay.py --mode selfplay

Curriculum Mode
---------------

Gradual difficulty increase:

.. code-block:: bash

    python training/train_selfplay.py --mode curriculum

Mixed Mode
----------

Combination of techniques:

.. code-block:: bash

    python training/train_selfplay.py --mode mixed

Implementation Details
======================

The self-play environment wrapper:

.. code-block:: python

    class SelfPlayEnv(gym.Wrapper):
        def __init__(self, env, opponent_pool):
            super().__init__(env)
            self.opponent_pool = opponent_pool
            self.current_opponent = None
            
        def reset(self):
            # Sample new opponent each episode
            self.current_opponent = self._sample_opponent()
            return super().reset()
            
        def _sample_opponent(self):
            weights = [0.3, 0.4, 0.2, 0.1]  # current, recent, best, random
            return random.choices(self.opponent_pool, weights=weights)[0]

Monitoring Progress
===================

Track these metrics during self-play:

.. list-table::
   :header-rows: 1

   * - Metric
     - Good Sign
     - Bad Sign
   * - Win rate vs Random
     - > 60%
     - < 50%
   * - Win rate vs Self
     - 45-55%
     - < 30% or > 70%
   * - Episode length
     - Decreasing
     - Increasing
   * - Reward
     - Increasing
     - Flat or decreasing

Expected Results
================

Training Timeline
-----------------

.. code-block:: text

    Timesteps     Win Rate (vs Random)
    ---------     -------------------
    100K          40-45%
    250K          50-55%
    500K          55-60%
    1M            60-65%
    2M            65-70%
    5M            70%+

Final Performance
-----------------

After 2M timesteps of self-play:

- **vs Random**: 70%+ win rate
- **vs PPO**: 60%+ win rate
- **vs DQN**: 65%+ win rate

Tips for Best Results
=====================

1. **Long Training**: Self-play benefits from extended training (2M+ steps)

2. **Large Population**: Use 5-10 agents in the population

3. **Regular Checkpoints**: Save every 100K steps for diversity

4. **Evaluation**: Test against fixed baselines periodically

5. **Patience**: Early training may show unstable metrics

6. **Resources**: Self-play uses more memory (multiple models loaded)

Troubleshooting
===============

Win Rate Not Improving
----------------------

- Increase population diversity
- Add more random agents to pool
- Lower learning rate

Training Unstable
-----------------

- Reduce opponent sampling frequency
- Increase batch size
- Use smaller population

Out of Memory
-------------

- Reduce population size
- Use smaller LSTM hidden size
- Clear old checkpoints

Using the Trained Model
=======================

After training, the champion model is saved to:

.. code-block:: text

    models/selfplay_champion.zip

Use it in the GUI or evaluation:

.. code-block:: bash

    python uno_gui.py  # Select "Self-Play Champion" from dropdown

Or load programmatically:

.. code-block:: python

    from sb3_contrib import RecurrentPPO
    
    model = RecurrentPPO.load("models/selfplay_champion.zip")