Training

This guide covers how to train your own UNO RL agents.

Quick Training

Standard PPO Training

python train_rl.py --algorithm ppo --timesteps 500000

This will train a PPO agent for 500K timesteps and save to models/.

Recurrent PPO Training

For better results with LSTM:

python train_recurrent_ppo.py --timesteps 1000000

Training Parameters

Command Line Arguments

Argument

Default

Description

--timesteps

100000

Total training timesteps

--algorithm

ppo

Algorithm: ppo, dqn, a2c

--learning-rate

3e-4

Learning rate

--batch-size

64

Batch size for updates

--eval-freq

10000

Evaluation frequency

--eval-episodes

100

Episodes per evaluation

--save-path

models/

Where to save model

--log-dir

logs/

TensorBoard log directory

--seed

42

Random seed

Example with Custom Parameters

python train_rl.py \
    --algorithm ppo \
    --timesteps 2000000 \
    --learning-rate 1e-4 \
    --batch-size 128 \
    --eval-freq 50000 \
    --seed 123

Training Scripts

Available Training Scripts

Script

Description

train_rl.py

General training script (PPO, DQN, A2C)

train_sb3.py

Stable-Baselines3 focused training

train_recurrent_ppo.py

Standard RecurrentPPO training

train_best_recurrent_ppo.py

Optimized RecurrentPPO

train_optimal_recurrent_ppo.py

Hyperparameter-tuned RecurrentPPO

train_best_ppo.py

Best non-recurrent PPO

training/train_selfplay.py

Self-play training (recommended)

Using Config File

Modify config.py for persistent settings:

training_config = {
    "timesteps": 1000000,
    "learning_rate": 3e-4,
    "batch_size": 64,
    "n_steps": 128,
    "n_epochs": 10,
    "gamma": 0.99,
    "clip_range": 0.2,
}

Monitoring Training

TensorBoard

View training progress with TensorBoard:

tensorboard --logdir logs/

Open http://localhost:6006 in your browser to see:

  • Episode rewards

  • Episode lengths

  • Loss curves

  • Learning rate

  • Explained variance

Evaluation During Training

Enable periodic evaluation:

python train_rl.py --eval-freq 10000 --eval-episodes 100

Results are saved to logs/evaluations.npz.

Checkpointing

Save Checkpoints

Checkpoints are automatically saved during training:

from stable_baselines3.common.callbacks import CheckpointCallback

checkpoint_callback = CheckpointCallback(
    save_freq=50000,
    save_path="./models/checkpoints/",
    name_prefix="uno_model"
)

Load from Checkpoint

Resume training from a checkpoint:

from sb3_contrib import RecurrentPPO

model = RecurrentPPO.load("models/checkpoints/uno_model_500000_steps")
model.learn(total_timesteps=500000)  # Continue training

Best Practices

  1. Start Small: Begin with 100K steps to verify everything works.

  2. Use RecurrentPPO: For UNO, LSTM-based models consistently outperform MLP.

  3. Monitor Early: Check TensorBoard after 10K steps to catch issues.

  4. Save Often: Use checkpoints every 50K steps.

  5. Evaluate Consistently: Always evaluate against the same opponents.

  6. Use Self-Play: For 70%+ win rates, self-play training is essential.

Common Issues

Training Doesn’t Converge

  • Lower learning rate (try 1e-4 or 1e-5)

  • Increase batch size

  • Check reward function

  • Ensure environment is correct

Slow Training

  • Reduce n_steps for faster updates

  • Use smaller network

  • Enable GPU (install PyTorch with CUDA)

Model Overfits

  • Increase entropy coefficient (ent_coef=0.05)

  • Use self-play training

  • Train against diverse opponents

GPU Training

Enable GPU training (requires CUDA):

# Install PyTorch with CUDA
pip install torch --index-url https://download.pytorch.org/whl/cu118

# Training automatically uses GPU if available
python train_rl.py --timesteps 1000000

Check GPU availability:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}")