Subtopic Deep Dive

Thompson Sampling Bandit Algorithms
Research Guide

What is Thompson Sampling Bandit Algorithms?

Thompson Sampling is a Bayesian heuristic for multi-armed bandit problems that samples actions from the posterior distribution of the reward model to balance exploration and exploitation.

Thompson Sampling achieves near-optimal regret bounds in finite time, as shown by Kaufmann et al. (2012, 361 citations) and Agrawal and Goyal (2012, 302 citations). Extensions apply it to contextual bandits and reinforcement learning with linear function approximation (Jin et al., 2019, 219 citations). Over 1,000 papers cite these foundational works.

Curated Papers

Key Challenges

Why It Matters

Thompson Sampling powers A/B testing in web optimization, as in Hill et al. (2017, 111 citations) for multivariate page layouts. It enables response-adaptive randomization in clinical trials (Robertson et al., 2023, 87 citations). In recommendation systems, it drives factorization bandits (Wang et al., 2017, 95 citations), improving click-through rates.

Key Research Challenges

Finite-Time Regret Bounds

Deriving tight regret guarantees beyond asymptotic analysis remains challenging. Kaufmann et al. (2012) provide the first finite-time analysis, while Agrawal and Goyal (2012) tighten bounds for any gap sequence. Extensions to non-stationary environments lack matching lower bounds.

Contextual Bandit Extensions

Adapting posterior sampling to high-dimensional contexts increases computational demands. Jin et al. (2019) achieve efficient RL with linear approximation, but feature learning adds complexity (Wang et al., 2016, 84 citations). Scalability to continuous arms requires Gaussian processes (Schulz et al., 2016, 98 citations).

Logged Data Off-Policy Learning

Learning from implicit exploration data demands bias correction. Strehl et al. (2010, 105 citations) provide foundations for contextual settings. Combining with Thompson Sampling needs importance sampling ratios that preserve regret optimality.

Essential Papers

Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis

Emilie Kaufmann, Nathaniel Korda, Rémi Munos · 2012 · Lecture notes in computer science · 361 citations

Further Optimal Regret Bounds for Thompson Sampling

Shipra Agrawal, Navin Goyal · 2012 · arXiv (Cornell University) · 302 citations

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after severa...

Provably Efficient Reinforcement Learning with Linear Function Approximation

Chi Jin, Zhuoran Yang, Zhaoran Wang et al. · 2019 · arXiv (Cornell University) · 219 citations

Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value functio...

Interactive Anomaly Detection on Attributed Networks

Kaize Ding, Jundong Li, Huan Liu · 2019 · 150 citations

Performing anomaly detection on attributed networks concerns with finding nodes whose patterns or behaviors deviate significantly from the majority of reference nodes. Its success can be easily fou...

An Efficient Bandit Algorithm for Realtime Multivariate Optimization

Daniel Hill, Houssam Nassif, Yi Liu et al. · 2017 · 111 citations

Optimization is commonly employed to determine the content of web pages, such as to maximize conversions on landing pages or click-through rates on search engine result pages. Often the layout of t...

Learning from Logged Implicit Exploration Data

Alex Strehl, John Langford, Lihong Li et al. · 2010 · arXiv (Cornell University) · 105 citations

We provide a sound and consistent foundation for the use of \emph{nonrandom} exploration data in "contextual bandit" or "partially labeled" settings where only the value of a chosen action is learn...

A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions

Eric Schulz, Maarten Speekenbrink, Andreas Krause · 2016 · 98 citations

Abstract This tutorial introduces the reader to Gaussian process regression as an expressive tool to model, actively explore and exploit unknown functions. Gaussian process regression is a powerful...

Reading Guide

Foundational Papers

Start with Kaufmann et al. (2012, 361 citations) for finite-time analysis and Agrawal and Goyal (2012, 302 citations) for optimal regret; then Strehl et al. (2010, 105 citations) for logged data foundations.

Recent Advances

Study Jin et al. (2019, 219 citations) for linear RL and Robertson et al. (2023, 87 citations) for clinical trial applications.

Core Methods

Posterior sampling with Beta/Bernoulli priors; regret analysis via KL-divergence and concentration; Gaussian processes for continuous arms.

How PapersFlow Helps You Research Thompson Sampling Bandit Algorithms

Discover & Search

Research Agent uses citationGraph on Kaufmann et al. (2012) to map 361-citing works, then findSimilarPapers for contextual extensions like Jin et al. (2019). exaSearch queries 'Thompson Sampling regret bounds continuous arms' to uncover 50+ relevant papers from 250M+ OpenAlex database.

Analyze & Verify

Analysis Agent runs readPaperContent on Agrawal and Goyal (2012), then verifyResponse with CoVe to check regret bound derivations. runPythonAnalysis simulates Thompson Sampling vs. UCB in NumPy sandbox, with GRADE scoring empirical regret matching theoretical O(sqrt(T)) rates.

Synthesize & Write

Synthesis Agent detects gaps in non-stationary Thompson Sampling via contradiction flagging across Kaufmann et al. (2012) and recent works. Writing Agent applies latexEditText to draft proofs, latexSyncCitations for 10+ papers, and latexCompile for arXiv-ready bandit analysis document.

Use Cases

"Simulate Thompson Sampling regret on 5-arm bandit with beta priors."

Research Agent → searchPapers 'Thompson Sampling simulation' → Analysis Agent → runPythonAnalysis (NumPy bandit env, 10k trials) → matplotlib regret plot exported as PNG.

"Write LaTeX survey on Thompson Sampling in clinical trials."

Synthesis Agent → gap detection on Robertson et al. (2023) → Writing Agent → latexGenerateFigure (regret curves), latexSyncCitations (15 papers), latexCompile → PDF with theorems and proofs.

"Find GitHub repos implementing contextual Thompson Sampling."

Research Agent → searchPapers 'contextual Thompson Sampling code' → Code Discovery → paperExtractUrls → paperFindGithubRepo → githubRepoInspect → verified implementations from Wang et al. (2016).

Automated Workflows

Deep Research workflow scans 50+ Thompson Sampling papers via searchPapers → citationGraph → structured report with regret bound taxonomy. DeepScan applies 7-step analysis: readPaperContent on Agrawal and Goyal (2012) → runPythonAnalysis verification → GRADE scoring. Theorizer generates new hypotheses like 'Thompson Sampling optimality under partial feedback' from Kaufmann et al. (2012) and Strehl et al. (2010).

Try Doxa for Thompson Sampling Bandit Algorithms Research

Frequently Asked Questions

What defines Thompson Sampling?

Thompson Sampling samples arms from the posterior distribution over reward parameters, updated via Bayesian inference after each pull.

What are key methods in Thompson Sampling research?

Core methods prove finite-time regret via concentration inequalities (Kaufmann et al., 2012) and optimize for gap-dependent bounds (Agrawal and Goyal, 2012). Extensions use linear approximation (Jin et al., 2019).

What are the most cited papers?

Kaufmann et al. (2012, 361 citations) gives asymptotic finite-time analysis; Agrawal and Goyal (2012, 302 citations) provides optimal regret bounds.

What open problems exist?

Tight bounds for non-stationary environments and scalable posterior sampling in high-dimensional contexts remain unsolved, beyond Gaussian process approximations (Schulz et al., 2016).