Subtopic Deep Dive
Thompson Sampling Bandit Algorithms
Research Guide
What is Thompson Sampling Bandit Algorithms?
Thompson Sampling is a Bayesian heuristic for multi-armed bandit problems that samples actions from the posterior distribution of the reward model to balance exploration and exploitation.
Thompson Sampling achieves near-optimal regret bounds in finite time, as shown by Kaufmann et al. (2012, 361 citations) and Agrawal and Goyal (2012, 302 citations). Extensions apply it to contextual bandits and reinforcement learning with linear function approximation (Jin et al., 2019, 219 citations). Over 1,000 papers cite these foundational works.
Why It Matters
Thompson Sampling powers A/B testing in web optimization, as in Hill et al. (2017, 111 citations) for multivariate page layouts. It enables response-adaptive randomization in clinical trials (Robertson et al., 2023, 87 citations). In recommendation systems, it drives factorization bandits (Wang et al., 2017, 95 citations), improving click-through rates.
Key Research Challenges
Finite-Time Regret Bounds
Deriving tight regret guarantees beyond asymptotic analysis remains challenging. Kaufmann et al. (2012) provide the first finite-time analysis, while Agrawal and Goyal (2012) tighten bounds for any gap sequence. Extensions to non-stationary environments lack matching lower bounds.
Contextual Bandit Extensions
Adapting posterior sampling to high-dimensional contexts increases computational demands. Jin et al. (2019) achieve efficient RL with linear approximation, but feature learning adds complexity (Wang et al., 2016, 84 citations). Scalability to continuous arms requires Gaussian processes (Schulz et al., 2016, 98 citations).
Logged Data Off-Policy Learning
Learning from implicit exploration data demands bias correction. Strehl et al. (2010, 105 citations) provide foundations for contextual settings. Combining with Thompson Sampling needs importance sampling ratios that preserve regret optimality.
Essential Papers
Thompson Sampling: An Asymptotically Optimal Finite-Time Analysis
Emilie Kaufmann, Nathaniel Korda, Rémi Munos · 2012 · Lecture notes in computer science · 361 citations
Further Optimal Regret Bounds for Thompson Sampling
Shipra Agrawal, Navin Goyal · 2012 · arXiv (Cornell University) · 302 citations
Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after severa...
Provably Efficient Reinforcement Learning with Linear Function Approximation
Chi Jin, Zhuoran Yang, Zhaoran Wang et al. · 2019 · arXiv (Cornell University) · 219 citations
Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value functio...
Interactive Anomaly Detection on Attributed Networks
Kaize Ding, Jundong Li, Huan Liu · 2019 · 150 citations
Performing anomaly detection on attributed networks concerns with finding nodes whose patterns or behaviors deviate significantly from the majority of reference nodes. Its success can be easily fou...
An Efficient Bandit Algorithm for Realtime Multivariate Optimization
Daniel Hill, Houssam Nassif, Yi Liu et al. · 2017 · 111 citations
Optimization is commonly employed to determine the content of web pages, such as to maximize conversions on landing pages or click-through rates on search engine result pages. Often the layout of t...
Learning from Logged Implicit Exploration Data
Alex Strehl, John Langford, Lihong Li et al. · 2010 · arXiv (Cornell University) · 105 citations
We provide a sound and consistent foundation for the use of \emph{nonrandom} exploration data in "contextual bandit" or "partially labeled" settings where only the value of a chosen action is learn...
A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions
Eric Schulz, Maarten Speekenbrink, Andreas Krause · 2016 · 98 citations
Abstract This tutorial introduces the reader to Gaussian process regression as an expressive tool to model, actively explore and exploit unknown functions. Gaussian process regression is a powerful...
Reading Guide
Foundational Papers
Start with Kaufmann et al. (2012, 361 citations) for finite-time analysis and Agrawal and Goyal (2012, 302 citations) for optimal regret; then Strehl et al. (2010, 105 citations) for logged data foundations.
Recent Advances
Study Jin et al. (2019, 219 citations) for linear RL and Robertson et al. (2023, 87 citations) for clinical trial applications.
Core Methods
Posterior sampling with Beta/Bernoulli priors; regret analysis via KL-divergence and concentration; Gaussian processes for continuous arms.
How PapersFlow Helps You Research Thompson Sampling Bandit Algorithms
Discover & Search
Research Agent uses citationGraph on Kaufmann et al. (2012) to map 361-citing works, then findSimilarPapers for contextual extensions like Jin et al. (2019). exaSearch queries 'Thompson Sampling regret bounds continuous arms' to uncover 50+ relevant papers from 250M+ OpenAlex database.
Analyze & Verify
Analysis Agent runs readPaperContent on Agrawal and Goyal (2012), then verifyResponse with CoVe to check regret bound derivations. runPythonAnalysis simulates Thompson Sampling vs. UCB in NumPy sandbox, with GRADE scoring empirical regret matching theoretical O(sqrt(T)) rates.
Synthesize & Write
Synthesis Agent detects gaps in non-stationary Thompson Sampling via contradiction flagging across Kaufmann et al. (2012) and recent works. Writing Agent applies latexEditText to draft proofs, latexSyncCitations for 10+ papers, and latexCompile for arXiv-ready bandit analysis document.
Use Cases
"Simulate Thompson Sampling regret on 5-arm bandit with beta priors."
Research Agent → searchPapers 'Thompson Sampling simulation' → Analysis Agent → runPythonAnalysis (NumPy bandit env, 10k trials) → matplotlib regret plot exported as PNG.
"Write LaTeX survey on Thompson Sampling in clinical trials."
Synthesis Agent → gap detection on Robertson et al. (2023) → Writing Agent → latexGenerateFigure (regret curves), latexSyncCitations (15 papers), latexCompile → PDF with theorems and proofs.
"Find GitHub repos implementing contextual Thompson Sampling."
Research Agent → searchPapers 'contextual Thompson Sampling code' → Code Discovery → paperExtractUrls → paperFindGithubRepo → githubRepoInspect → verified implementations from Wang et al. (2016).
Automated Workflows
Deep Research workflow scans 50+ Thompson Sampling papers via searchPapers → citationGraph → structured report with regret bound taxonomy. DeepScan applies 7-step analysis: readPaperContent on Agrawal and Goyal (2012) → runPythonAnalysis verification → GRADE scoring. Theorizer generates new hypotheses like 'Thompson Sampling optimality under partial feedback' from Kaufmann et al. (2012) and Strehl et al. (2010).
Frequently Asked Questions
What defines Thompson Sampling?
Thompson Sampling samples arms from the posterior distribution over reward parameters, updated via Bayesian inference after each pull.
What are key methods in Thompson Sampling research?
Core methods prove finite-time regret via concentration inequalities (Kaufmann et al., 2012) and optimize for gap-dependent bounds (Agrawal and Goyal, 2012). Extensions use linear approximation (Jin et al., 2019).
What are the most cited papers?
Kaufmann et al. (2012, 361 citations) gives asymptotic finite-time analysis; Agrawal and Goyal (2012, 302 citations) provides optimal regret bounds.
What open problems exist?
Tight bounds for non-stationary environments and scalable posterior sampling in high-dimensional contexts remain unsolved, beyond Gaussian process approximations (Schulz et al., 2016).
Research Advanced Bandit Algorithms Research with AI
PapersFlow provides specialized AI tools for Decision Sciences researchers. Here are the most relevant for this topic:
Systematic Review
AI-powered evidence synthesis with documented search strategies
AI Literature Review
Automate paper discovery and synthesis across 474M+ papers
Deep Research Reports
Multi-source evidence synthesis with counter-evidence
See how researchers in Economics & Business use PapersFlow
Field-specific workflows, example queries, and use cases.
Start Researching Thompson Sampling Bandit Algorithms with AI
Search 474M+ papers, run AI-powered literature reviews, and write with integrated citations — all in one workspace.
See how PapersFlow works for Decision Sciences researchers