Subtopic Deep Dive

Contextual Multi-Armed Bandits
Research Guide

What is Contextual Multi-Armed Bandits?

Contextual Multi-Armed Bandits extend the multi-armed bandit framework by incorporating side information or contexts to inform action selection in sequential decision-making.

Algorithms handle contexts via linear payoff functions (Chu et al., 2011, 577 citations), Thompson Sampling (Agrawal and Goyal, 2012, 547 citations), and epoch-greedy methods (Langford and Zhang, 2007, 328 citations). Over 10 key papers from 2007-2022 address linear, similarity-based, and neural models. Applications span recommendation systems and reinforcement learning.

15
Curated Papers
3
Key Challenges

Why It Matters

Contextual bandits power personalized news recommendation in DRN (Zheng et al., 2018, 612 citations) and collaborative filtering (Li et al., 2016, 299 citations). They enable efficient exploration in high-dimensional ad targeting and medical treatment selection. In robotics, linear function approximation aids provably efficient RL (Jin et al., 2019, 219 citations), reducing regret in state-dependent actions.

Key Research Challenges

High-Dimensional Contexts

Scaling to high-dimensional feature spaces increases regret bounds, as shown in O(√(Td ln³ K)) for linear payoffs (Chu et al., 2011). Kernelized and neural extensions face curse-of-dimensionality. Slivkins (2009, 255 citations) addresses similarity information but computational costs rise.

Non-Stationary Environments

Dynamic user preferences challenge static models, evident in news recommendation (Zheng et al., 2018). Collaborative bandits handle evolving interactions (Li et al., 2016). Epoch-greedy adapts without horizon knowledge (Langford and Zhang, 2007).

Optimal Regret Guarantees

Achieving instance-independent regret remains open beyond linear cases. Thompson Sampling provides near-optimal bounds (Agrawal and Goyal, 2012; 2012 follow-up, 547 and 302 citations). Verification in RL settings requires linear approximation (Jin et al., 2019).

Essential Papers

1.

DRN

Guanjie Zheng, Fuzheng Zhang, Zihan Zheng et al. · 2018 · 612 citations

In this paper, we propose a novel Deep Reinforcement Learning framework for news recommendation. Online personalized news recommendation is a highly challenging problem due to the dynamic nature of...

2.

Contextual bandits with linear Payoff functions

Wei Chu, Lihong Li, Lev Reyzin et al. · 2011 · 577 citations

In this paper, we study the contextual bandit problem (also known as the multi-armed bandit problem with expert advice) for linear payoff functions. For T rounds, K actions, and d(√ dimensional fea...

3.

Thompson Sampling for Contextual Bandits with Linear Payoffs

Shipra Agrawal, Navin Goyal · 2012 · arXiv (Cornell University) · 547 citations

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after severa...

4.

Artificial intelligence in recommender systems

Qian Zhang, Jie Lü, Yaochu Jin · 2020 · Complex & Intelligent Systems · 397 citations

Abstract Recommender systems provide personalized service support to users by learning their previous behaviors and predicting their current preferences for particular products. Artificial intellig...

5.

The Epoch-Greedy algorithm for contextual multi-armed bandits

John Langford, Tong Zhang · 2007 · 328 citations

We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following prop-erties: 1. No knowledge of a time horizon...

6.

Further Optimal Regret Bounds for Thompson Sampling

Shipra Agrawal, Navin Goyal · 2012 · arXiv (Cornell University) · 302 citations

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after severa...

7.

Collaborative Filtering Bandits

Shuai Li, Alexandros Karatzoglou, Claudio Gentile · 2016 · 299 citations

Classical collaborative filtering, and content-based filtering methods try to learn a static recommendation model given training data. These approaches are far from ideal in highly dynamic recommen...

Reading Guide

Foundational Papers

Start with Epoch-Greedy (Langford and Zhang, 2007) for intuition, then linear payoffs (Chu et al., 2011, 577 cites) and Thompson Sampling (Agrawal and Goyal, 2012, 547 cites) for regret analysis.

Recent Advances

Study DRN neural methods (Zheng et al., 2018, 612 cites), collaborative bandits (Li et al., 2016, 299 cites), and RL approximation (Jin et al., 2019, 219 cites).

Core Methods

Core techniques: linear regression (LinUCB), Bayesian posterior sampling (TS), similarity graphs (Slivkins, 2009), epoch exploration (Langford and Zhang, 2007).

How PapersFlow Helps You Research Contextual Multi-Armed Bandits

Discover & Search

Research Agent uses searchPapers and citationGraph to map 10+ papers like Chu et al. (2011) centrality, then findSimilarPapers uncovers kernel extensions. exaSearch queries 'contextual bandits linear regret' yielding Zheng et al. (2018) DRN.

Analyze & Verify

Analysis Agent runs readPaperContent on Agrawal and Goyal (2012) to extract Thompson Sampling pseudocode, verifies regret claims via verifyResponse (CoVe), and uses runPythonAnalysis for NumPy simulation of O(√T) bounds with GRADE scoring for empirical validation.

Synthesize & Write

Synthesis Agent detects gaps in non-stationary handling beyond Li et al. (2016), flags contradictions in regret proofs; Writing Agent applies latexEditText for bandit algorithm sections, latexSyncCitations for 577-cite Chu paper, and latexCompile for full survey exportMermaid diagrams epoch-greedy phases.

Use Cases

"Simulate Thompson Sampling regret on synthetic linear contextual bandit data"

Research Agent → searchPapers 'Thompson Sampling contextual' → Analysis Agent → readPaperContent (Agrawal 2012) → runPythonAnalysis (NumPy bandit sim with 1000 arms, plot cumulative regret) → matplotlib output CSV.

"Draft LaTeX survey on epoch-greedy vs LinUCB for recommendations"

Research Agent → citationGraph (Langford 2007 hub) → Synthesis → gap detection → Writing Agent → latexEditText (intro section) → latexSyncCitations (Chu 2011, Zheng 2018) → latexCompile → PDF with Mermaid decision tree.

"Find GitHub code for collaborative filtering bandits"

Research Agent → searchPapers 'Collaborative Filtering Bandits' → Code Discovery → paperExtractUrls (Li 2016) → paperFindGithubRepo → githubRepoInspect → verified implementation notebook.

Automated Workflows

Deep Research workflow scans 50+ contextual bandit papers via searchPapers chains, structures report with regret tables from Chu et al. (2011). DeepScan applies 7-step CoVe to verify Thompson Sampling optimality (Agrawal 2012), checkpointing simulations. Theorizer generates hypotheses on neural extensions from Zheng et al. (2018) DRN.

Frequently Asked Questions

What defines Contextual Multi-Armed Bandits?

They incorporate contexts as side information for bandit action selection, enabling context-dependent exploration-exploitation (Chu et al., 2011).

What are key methods?

LinUCB for linear payoffs (Chu et al., 2011), Thompson Sampling (Agrawal and Goyal, 2012), Epoch-Greedy (Langford and Zhang, 2007).

What are top cited papers?

DRN (Zheng et al., 2018, 612 cites), Chu et al. (2011, 577 cites), Agrawal and Goyal (2012, 547 cites).

What open problems exist?

Optimal regret for non-linear payoffs, scaling to massive contexts, non-stationarity beyond collaborative settings (Li et al., 2016).

Research Advanced Bandit Algorithms Research with AI

PapersFlow provides specialized AI tools for Decision Sciences researchers. Here are the most relevant for this topic:

See how researchers in Economics & Business use PapersFlow

Field-specific workflows, example queries, and use cases.

Economics & Business Guide

Start Researching Contextual Multi-Armed Bandits with AI

Search 474M+ papers, run AI-powered literature reviews, and write with integrated citations — all in one workspace.

See how PapersFlow works for Decision Sciences researchers