Training
This guide covers how to train your own UNO RL agents.
Quick Training
Standard PPO Training
python train_rl.py --algorithm ppo --timesteps 500000
This will train a PPO agent for 500K timesteps and save to models/.
Recurrent PPO Training
For better results with LSTM:
python train_recurrent_ppo.py --timesteps 1000000
Training Parameters
Command Line Arguments
Argument |
Default |
Description |
|---|---|---|
|
100000 |
Total training timesteps |
|
ppo |
Algorithm: ppo, dqn, a2c |
|
3e-4 |
Learning rate |
|
64 |
Batch size for updates |
|
10000 |
Evaluation frequency |
|
100 |
Episodes per evaluation |
|
models/ |
Where to save model |
|
logs/ |
TensorBoard log directory |
|
42 |
Random seed |
Example with Custom Parameters
python train_rl.py \
--algorithm ppo \
--timesteps 2000000 \
--learning-rate 1e-4 \
--batch-size 128 \
--eval-freq 50000 \
--seed 123
Training Scripts
Available Training Scripts
Script |
Description |
|---|---|
|
General training script (PPO, DQN, A2C) |
|
Stable-Baselines3 focused training |
|
Standard RecurrentPPO training |
|
Optimized RecurrentPPO |
|
Hyperparameter-tuned RecurrentPPO |
|
Best non-recurrent PPO |
|
Self-play training (recommended) |
Using Config File
Modify config.py for persistent settings:
training_config = {
"timesteps": 1000000,
"learning_rate": 3e-4,
"batch_size": 64,
"n_steps": 128,
"n_epochs": 10,
"gamma": 0.99,
"clip_range": 0.2,
}
Monitoring Training
TensorBoard
View training progress with TensorBoard:
tensorboard --logdir logs/
Open http://localhost:6006 in your browser to see:
Episode rewards
Episode lengths
Loss curves
Learning rate
Explained variance
Evaluation During Training
Enable periodic evaluation:
python train_rl.py --eval-freq 10000 --eval-episodes 100
Results are saved to logs/evaluations.npz.
Checkpointing
Save Checkpoints
Checkpoints are automatically saved during training:
from stable_baselines3.common.callbacks import CheckpointCallback
checkpoint_callback = CheckpointCallback(
save_freq=50000,
save_path="./models/checkpoints/",
name_prefix="uno_model"
)
Load from Checkpoint
Resume training from a checkpoint:
from sb3_contrib import RecurrentPPO
model = RecurrentPPO.load("models/checkpoints/uno_model_500000_steps")
model.learn(total_timesteps=500000) # Continue training
Best Practices
Start Small: Begin with 100K steps to verify everything works.
Use RecurrentPPO: For UNO, LSTM-based models consistently outperform MLP.
Monitor Early: Check TensorBoard after 10K steps to catch issues.
Save Often: Use checkpoints every 50K steps.
Evaluate Consistently: Always evaluate against the same opponents.
Use Self-Play: For 70%+ win rates, self-play training is essential.
Common Issues
Training Doesn’t Converge
Lower learning rate (try 1e-4 or 1e-5)
Increase batch size
Check reward function
Ensure environment is correct
Slow Training
Reduce
n_stepsfor faster updatesUse smaller network
Enable GPU (install PyTorch with CUDA)
Model Overfits
Increase entropy coefficient (
ent_coef=0.05)Use self-play training
Train against diverse opponents
GPU Training
Enable GPU training (requires CUDA):
# Install PyTorch with CUDA
pip install torch --index-url https://download.pytorch.org/whl/cu118
# Training automatically uses GPU if available
python train_rl.py --timesteps 1000000
Check GPU availability:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}")