========
Training
========

This guide covers how to train your own UNO RL agents.

Quick Training
==============

Standard PPO Training
---------------------

.. code-block:: bash

    python train_rl.py --algorithm ppo --timesteps 500000

This will train a PPO agent for 500K timesteps and save to ``models/``.

Recurrent PPO Training
----------------------

For better results with LSTM:

.. code-block:: bash

    python train_recurrent_ppo.py --timesteps 1000000

Training Parameters
===================

Command Line Arguments
----------------------

.. list-table::
   :header-rows: 1
   :widths: 25 15 60

   * - Argument
     - Default
     - Description
   * - ``--timesteps``
     - 100000
     - Total training timesteps
   * - ``--algorithm``
     - ppo
     - Algorithm: ppo, dqn, a2c
   * - ``--learning-rate``
     - 3e-4
     - Learning rate
   * - ``--batch-size``
     - 64
     - Batch size for updates
   * - ``--eval-freq``
     - 10000
     - Evaluation frequency
   * - ``--eval-episodes``
     - 100
     - Episodes per evaluation
   * - ``--save-path``
     - models/
     - Where to save model
   * - ``--log-dir``
     - logs/
     - TensorBoard log directory
   * - ``--seed``
     - 42
     - Random seed

Example with Custom Parameters
------------------------------

.. code-block:: bash

    python train_rl.py \
        --algorithm ppo \
        --timesteps 2000000 \
        --learning-rate 1e-4 \
        --batch-size 128 \
        --eval-freq 50000 \
        --seed 123

Training Scripts
================

Available Training Scripts
--------------------------

.. list-table::
   :header-rows: 1
   :widths: 35 65

   * - Script
     - Description
   * - ``train_rl.py``
     - General training script (PPO, DQN, A2C)
   * - ``train_sb3.py``
     - Stable-Baselines3 focused training
   * - ``train_recurrent_ppo.py``
     - Standard RecurrentPPO training
   * - ``train_best_recurrent_ppo.py``
     - Optimized RecurrentPPO
   * - ``train_optimal_recurrent_ppo.py``
     - Hyperparameter-tuned RecurrentPPO
   * - ``train_best_ppo.py``
     - Best non-recurrent PPO
   * - ``training/train_selfplay.py``
     - Self-play training (recommended)

Using Config File
-----------------

Modify ``config.py`` for persistent settings:

.. code-block:: python

    training_config = {
        "timesteps": 1000000,
        "learning_rate": 3e-4,
        "batch_size": 64,
        "n_steps": 128,
        "n_epochs": 10,
        "gamma": 0.99,
        "clip_range": 0.2,
    }

Monitoring Training
===================

TensorBoard
-----------

View training progress with TensorBoard:

.. code-block:: bash

    tensorboard --logdir logs/

Open http://localhost:6006 in your browser to see:

- Episode rewards
- Episode lengths
- Loss curves
- Learning rate
- Explained variance

Evaluation During Training
--------------------------

Enable periodic evaluation:

.. code-block:: bash

    python train_rl.py --eval-freq 10000 --eval-episodes 100

Results are saved to ``logs/evaluations.npz``.

Checkpointing
=============

Save Checkpoints
----------------

Checkpoints are automatically saved during training:

.. code-block:: python

    from stable_baselines3.common.callbacks import CheckpointCallback
    
    checkpoint_callback = CheckpointCallback(
        save_freq=50000,
        save_path="./models/checkpoints/",
        name_prefix="uno_model"
    )

Load from Checkpoint
--------------------

Resume training from a checkpoint:

.. code-block:: python

    from sb3_contrib import RecurrentPPO
    
    model = RecurrentPPO.load("models/checkpoints/uno_model_500000_steps")
    model.learn(total_timesteps=500000)  # Continue training

Best Practices
==============

1. **Start Small**: Begin with 100K steps to verify everything works.

2. **Use RecurrentPPO**: For UNO, LSTM-based models consistently outperform MLP.

3. **Monitor Early**: Check TensorBoard after 10K steps to catch issues.

4. **Save Often**: Use checkpoints every 50K steps.

5. **Evaluate Consistently**: Always evaluate against the same opponents.

6. **Use Self-Play**: For 70%+ win rates, self-play training is essential.

Common Issues
=============

Training Doesn't Converge
-------------------------

- Lower learning rate (try 1e-4 or 1e-5)
- Increase batch size
- Check reward function
- Ensure environment is correct

Slow Training
-------------

- Reduce ``n_steps`` for faster updates
- Use smaller network
- Enable GPU (install PyTorch with CUDA)

Model Overfits
--------------

- Increase entropy coefficient (``ent_coef=0.05``)
- Use self-play training
- Train against diverse opponents

GPU Training
============

Enable GPU training (requires CUDA):

.. code-block:: bash

    # Install PyTorch with CUDA
    pip install torch --index-url https://download.pytorch.org/whl/cu118
    
    # Training automatically uses GPU if available
    python train_rl.py --timesteps 1000000

Check GPU availability:

.. code-block:: python

    import torch
    print(f"CUDA available: {torch.cuda.is_available()}")
    print(f"Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}")