Simplexity

This document is an overview of reinforcement learning (RL), a field concerned with methods for solving sequential decision-making problems. The paper covers a broad range of topics in RL, including:

1. Introduction

Sequential decision making: Defining the basic RL problem, where an agent interacts with an environment to maximize cumulative rewards.
Canonical examples: Introducing partially observed MDPs (POMDPs), MDPs, contextual MDPs, contextual bandits, belief state MDPs, and how optimization problems can be seen as a special case of RL.
Reinforcement Learning: Categorizing RL methods based on whether they are value-based, policy-based, or model-based, and discussing how to deal with partial observability.
Exploration-exploitation tradeoff: Highlighting the fundamental challenge of balancing exploration (learning about the environment) and exploitation (acting to maximize immediate rewards). Discusses simple heuristics like epsilon-greedy and Boltzmann exploration, as well as more advanced methods like upper confidence bounds (UCBs) and Thompson sampling.
RL as a posterior inference problem: Presents an alternative view of RL as an inference problem, where the goal is to infer a policy that maximizes the probability of achieving high rewards. This perspective leads to methods like maximum entropy RL and active inference.

2. Value-based RL

Basic concepts: Defining value functions, Q-functions, and advantage functions, and deriving Bellman’s equations.
Computing the value function and policy given a known world model: Discussing value iteration, real-time dynamic programming (RTDP), and policy iteration.
Computing the value function without knowing the world model: Introducing Monte Carlo estimation, temporal difference (TD) learning, eligibility traces, and SARSA.
Q-learning: Describing tabular Q-learning and Q-learning with function approximation, including neural fitted Q and DQN. Discusses challenges like experience replay, the deadly triad, target networks, and maximization bias, and presents solutions like double Q-learning.
DQN extensions: Reviewing various extensions to DQN, including Q-learning for continuous actions, dueling DQN, noisy nets, multi-step DQN, and Rainbow.

3. Policy-based RL

The policy gradient theorem: Deriving the fundamental theorem for policy gradient methods.
REINFORCE: Introducing the REINFORCE algorithm and discussing the use of baselines to reduce variance.
Actor-critic methods: Describing advantage actor-critic (A2C), generalized advantage estimation (GAE), and two-time scale actor-critic algorithms.
Policy improvement methods: Discussing trust region policy optimization (TRPO), proximal policy optimization (PPO), and VMPO.
Off-policy methods: Introducing importance sampling for policy evaluation and describing off-policy actor-critic methods, including V-trace and IMPALA.
Soft actor-critic (SAC): Presenting the SAC algorithm, which combines off-policy learning with maximum entropy RL.
Deterministic policy gradient methods: Describing DDPG and TD3.

4. Model-based RL

Decision-time planning: Discussing model predictive control (MPC), heuristic search, Monte Carlo tree search (MCTS), and trajectory optimization.
Background planning: Introducing Dyna and discussing how to deal with model errors and uncertainty.
World models: Describing different types of world models, including generative models (observation-space, factored, latent-space, Dreamer, Iris) and non-generative models (value prediction, self-prediction, policy prediction, observation prediction, BYOL-Explore).
Beyond one-step models: Discussing general value functions, successor representations, successor models, successor features, and generalized policy improvement.

5. Other topics in RL

Distributional RL: Discussing methods that learn a distribution over returns rather than just the expected return.
Reward functions: Highlighting the challenges of reward hacking, sparse rewards, and reward shaping.
Hierarchical RL: Introducing feudal (goal-conditioned) HRL and the options framework.
Imitation learning: Describing behavior cloning and inverse reinforcement learning (IRL).
Offline RL: Discussing the challenges of learning from a fixed dataset and presenting various approaches, including policy constraint methods, uncertainty penalties, conservative Q-learning, and offline RL using reward-conditioned sequence modeling.
LLMs and RL: Exploring the connections between large language models (LLMs) and RL, including using RL to fine-tune LLMs, using LLMs to pre-process inputs for RL, using LLMs for reward functions, using LLMs as world models, and using LLMs as policies.
General RL, AIXI and universal AGI: Briefly discussing the general RL setting and its connection to the theoretical concept of AIXI and artificial general intelligence (AGI).

The document provides a comprehensive overview of the field of RL, covering a wide range of methods, challenges, and applications. It serves as a valuable resource for anyone interested in learning about the state-of-the-art in RL research.

Possible questions that can be asked about the document:

What is the difference between value-based and policy-based RL methods? What are the advantages and disadvantages of each approach?
How does Q-learning work, and what are some of the challenges associated with using it?
What is the exploration-exploitation tradeoff, and how do different RL algorithms address it?
What is the difference between on-policy and off-policy learning?
What is the difference between model-based and model-free RL?
What is a world model, and how can it be used in RL?
What is the difference between decision-time planning and background planning?
What is hierarchical RL, and why is it important?
What is imitation learning, and how does it differ from reinforcement learning?
What is offline RL, and what are some of the challenges associated with it?
How can large language models (LLMs) be used in RL?
What is the connection between RL and artificial general intelligence (AGI)?
What is the difference between MCTS and MPC?
What is the difference between UCB and Thompson sampling?
What is meant by the “reward hacking” problem?
What is the difference between model-based and model-free RL, and what are the advantages and disadvantages of each approach?
What is meant by “hybrid offline/online” RL methods, and why might these be useful?