Training

This guide covers how to train your own UNO RL agents.

Quick Training

Standard PPO Training

python train_rl.py --algorithm ppo --timesteps 500000

This will train a PPO agent for 500K timesteps and save to models/.

Recurrent PPO Training

For better results with LSTM:

python train_recurrent_ppo.py --timesteps 1000000

Training Parameters

Command Line Arguments

Argument	Default	Description
`--timesteps`	100000	Total training timesteps
`--algorithm`	ppo	Algorithm: ppo, dqn, a2c
`--learning-rate`	3e-4	Learning rate
`--batch-size`	64	Batch size for updates
`--eval-freq`	10000	Evaluation frequency
`--eval-episodes`	100	Episodes per evaluation
`--save-path`	models/	Where to save model
`--log-dir`	logs/	TensorBoard log directory
`--seed`	42	Random seed

Example with Custom Parameters

python train_rl.py \
    --algorithm ppo \
    --timesteps 2000000 \
    --learning-rate 1e-4 \
    --batch-size 128 \
    --eval-freq 50000 \
    --seed 123

Training Scripts

Available Training Scripts

Script	Description
`train_rl.py`	General training script (PPO, DQN, A2C)
`train_sb3.py`	Stable-Baselines3 focused training
`train_recurrent_ppo.py`	Standard RecurrentPPO training
`train_best_recurrent_ppo.py`	Optimized RecurrentPPO
`train_optimal_recurrent_ppo.py`	Hyperparameter-tuned RecurrentPPO
`train_best_ppo.py`	Best non-recurrent PPO
`training/train_selfplay.py`	Self-play training (recommended)

Using Config File

Modify config.py for persistent settings:

training_config = {
    "timesteps": 1000000,
    "learning_rate": 3e-4,
    "batch_size": 64,
    "n_steps": 128,
    "n_epochs": 10,
    "gamma": 0.99,
    "clip_range": 0.2,
}

Monitoring Training

TensorBoard

View training progress with TensorBoard:

tensorboard --logdir logs/

Open http://localhost:6006 in your browser to see:

Episode rewards
Episode lengths
Loss curves
Learning rate
Explained variance

Evaluation During Training

Enable periodic evaluation:

python train_rl.py --eval-freq 10000 --eval-episodes 100

Results are saved to logs/evaluations.npz.

Checkpointing

Save Checkpoints

Checkpoints are automatically saved during training:

from stable_baselines3.common.callbacks import CheckpointCallback

checkpoint_callback = CheckpointCallback(
    save_freq=50000,
    save_path="./models/checkpoints/",
    name_prefix="uno_model"
)

Load from Checkpoint

Resume training from a checkpoint:

from sb3_contrib import RecurrentPPO

model = RecurrentPPO.load("models/checkpoints/uno_model_500000_steps")
model.learn(total_timesteps=500000)  # Continue training

Best Practices

Start Small: Begin with 100K steps to verify everything works.
Use RecurrentPPO: For UNO, LSTM-based models consistently outperform MLP.
Monitor Early: Check TensorBoard after 10K steps to catch issues.
Save Often: Use checkpoints every 50K steps.
Evaluate Consistently: Always evaluate against the same opponents.
Use Self-Play: For 70%+ win rates, self-play training is essential.

Common Issues

Training Doesn’t Converge

Lower learning rate (try 1e-4 or 1e-5)
Increase batch size
Check reward function
Ensure environment is correct

Slow Training

Reduce n_steps for faster updates
Use smaller network
Enable GPU (install PyTorch with CUDA)

Model Overfits

Increase entropy coefficient (ent_coef=0.05)
Use self-play training
Train against diverse opponents

GPU Training

Enable GPU training (requires CUDA):

# Install PyTorch with CUDA
pip install torch --index-url https://download.pytorch.org/whl/cu118

# Training automatically uses GPU if available
python train_rl.py --timesteps 1000000

Check GPU availability:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}")