==========
Algorithms
==========

This page explains the Reinforcement Learning algorithms used in this project.

Overview
========

We explored several RL algorithms for learning to play UNO:

.. list-table::
   :header-rows: 1
   :widths: 25 15 20 40

   * - Algorithm
     - Type
     - Win Rate
     - Best For
   * - Recurrent PPO
     - Policy
     - 60%+
     - Partial observability (recommended)
   * - PPO
     - Policy
     - 53%
     - Stable training
   * - DQN
     - Value
     - 48%
     - Discrete actions
   * - A2C
     - Policy
     - 45%
     - Fast training
   * - Q-Learning
     - Tabular
     - 35%
     - Small state spaces

Q-Learning
==========

The simplest algorithm we tested. Uses a table to store Q-values for each state-action pair.

**Update Rule:**

.. math::

    Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]

Where:
- :math:`\alpha` is the learning rate
- :math:`\gamma` is the discount factor
- :math:`r` is the immediate reward

**Limitations:**
- State space too large for tables
- No generalization across similar states
- Only achieves ~35% win rate

Deep Q-Network (DQN)
====================

Uses neural networks to approximate Q-values, enabling generalization.

**Architecture:**

.. code-block:: text

    Input (17) → Dense(256) → ReLU → Dense(256) → ReLU → Output (9)

**Key Techniques:**

1. **Experience Replay**: Store transitions in buffer, sample randomly
2. **Target Network**: Separate network for stable targets
3. **ε-greedy Exploration**: Random actions with probability ε

**Configuration:**

.. code-block:: python

    DQN(
        policy="MlpPolicy",
        env=env,
        learning_rate=1e-4,
        buffer_size=100000,
        batch_size=64,
        gamma=0.99,
        exploration_fraction=0.1,
        exploration_final_eps=0.02,
        target_update_interval=1000
    )

Proximal Policy Optimization (PPO)
==================================

A policy gradient method with clipped objective for stable training.

**Objective:**

.. math::

    L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t) \right]

Where :math:`r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}` is the probability ratio.

**Benefits:**

- Stable training
- Sample efficient
- Works well with neural networks

**Configuration:**

.. code-block:: python

    PPO(
        policy="MlpPolicy",
        env=env,
        learning_rate=3e-4,
        n_steps=2048,
        batch_size=64,
        n_epochs=10,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.2,
        ent_coef=0.01
    )

Recurrent PPO (LSTM)
====================

Our **best performing** algorithm! Uses LSTM networks to maintain memory of past observations.

**Why LSTM for UNO?**

UNO is a **partially observable** game:
- You can't see opponent's cards
- Cards in the deck are hidden
- Past plays contain useful information

LSTM maintains a **hidden state** that accumulates information over time.

**Architecture:**

.. code-block:: text

    Input (17) → LSTM(256) → Dense(256) → Dense(128) → Policy(9) + Value(1)

**Key Insight:**

The LSTM learns to:
1. Track which cards have been played
2. Estimate remaining cards in the deck
3. Infer opponent's hand composition
4. Remember game patterns

**Configuration:**

.. code-block:: python

    RecurrentPPO(
        policy="MlpLstmPolicy",
        env=env,
        learning_rate=3e-4,
        n_steps=128,
        batch_size=64,
        n_epochs=10,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.2,
        policy_kwargs=dict(
            lstm_hidden_size=256,
            n_lstm_layers=1,
            shared_lstm=True,
            enable_critic_lstm=True
        )
    )

Algorithm Comparison
====================

Training Efficiency
-------------------

.. list-table::
   :header-rows: 1

   * - Algorithm
     - Timesteps to 50% WR
     - GPU Memory
     - Training Time (1M steps)
   * - Q-Learning
     - Never reached
     - N/A
     - 10 min
   * - DQN
     - 800K
     - ~2GB
     - 30 min
   * - PPO
     - 500K
     - ~1GB
     - 20 min
   * - Recurrent PPO
     - 300K
     - ~3GB
     - 45 min

Recommendations
---------------

**For beginners:** Start with PPO - it's stable and easy to tune.

**For best results:** Use Recurrent PPO - the memory is essential for UNO.

**For research:** Try different architectures with the RecurrentPPO base.

Hyperparameter Tuning
=====================

Key parameters to tune for UNO:

.. list-table::
   :header-rows: 1

   * - Parameter
     - Range
     - Effect
   * - learning_rate
     - 1e-5 to 1e-3
     - Higher = faster but unstable
   * - n_steps
     - 64 to 2048
     - Higher = more stable gradients
   * - batch_size
     - 32 to 256
     - Must divide n_steps evenly
   * - gamma
     - 0.9 to 0.999
     - Higher = more farsighted
   * - ent_coef
     - 0.0 to 0.1
     - Higher = more exploration

Our best configuration is in ``config.py``.