Subtopic Deep Dive

Instance Selection and Hard Example Mining with Noisy Labels
Research Guide

What is Instance Selection and Hard Example Mining with Noisy Labels?

Instance Selection and Hard Example Mining with Noisy Labels applies co-teaching, divide-and-conquer, and confidence-based selection to filter clean samples from noisy datasets for robust classifier training.

Studies focus on selecting reliable instances from label-noisy data to improve classification performance. Techniques like boosting and ensemble methods reweight or select hard examples iteratively (Friedman et al., 2000; Geurts et al., 2006). Over 10 key papers address scalability and noise robustness in large datasets.

15
Curated Papers
3
Key Challenges

Why It Matters

Instance selection enables robust models in real-world scenarios like web-scale image classification or medical data annotation where labels are error-prone, reducing relabeling costs. Boosting reweights noisy examples sequentially for better margins (Friedman et al., 2000, 6854 citations). Extremely randomized trees handle noisy splits effectively in high-dimensional noisy data (Geurts et al., 2006, 8106 citations), critical for industrial ML pipelines with imperfect labels.

Key Research Challenges

Scalability to Large Noisy Datasets

Selecting clean instances from millions of noisy labels requires efficient divide-and-conquer without full data passes. Boosting variants struggle with quadratic scaling (Bauer and Kohavi, 1999). Surveys highlight computational bottlenecks in semi-supervised noisy settings (van Engelen and Hoos, 2019).

Distinguishing Hard vs Noisy Examples

Hard negatives must be mined without confusing them with label errors in confidence-based selection. Bayesian classifiers assume clean data optimality (Domingos and Pazzani, 1997). Feature selection struggles with noise-induced irrelevance (Li et al., 2017).

Robustness Across Noise Types

Symmetric, asymmetric, and instance-dependent noise demand adaptive co-teaching strategies. Transfer learning surveys note domain shifts exacerbate label noise (Weiss et al., 2016). Ensemble voting dilutes noise but risks overfitting hard examples (Bauer and Kohavi, 1999).

Essential Papers

1.

Extremely randomized trees

Pierre Geurts, Damien Ernst, Louis Wehenkel · 2006 · Machine Learning · 8.1K citations

2.

Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors)

Jerome H. Friedman, Trevor Hastie, Robert Tibshirani · 2000 · The Annals of Statistics · 6.9K citations

Boosting is one of the most important recent developments in\nclassification methodology. Boosting works by sequentially applying a\nclassification algorithm to reweighted versions of the training ...

3.

A survey of transfer learning

Karl R. Weiss, Taghi M. Khoshgoftaar, Dingding Wang · 2016 · Journal Of Big Data · 5.9K citations

Machine learning and data mining techniques have been used in numerous real-world applications. An assumption of traditional machine learning methodologies is the training data and testing data are...

4.

“Why Should I Trust You?”: Explaining the Predictions of Any Classifier

Marco Ribeiro, Sameer Singh, Carlos Guestrin · 2016 · 4.6K citations

Despite widespread adoption in NLP, machine learning models remain mostly black boxes.Understanding the reasons behind predictions is, however, quite important in assessing trust in a model.Trust i...

5.

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Pedro Domingos, Michael J. Pazzani · 1997 · Machine Learning · 3.0K citations

6.

An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants

Eric Bauer, Ron Kohavi · 1999 · Machine Learning · 2.6K citations

7.

A survey on semi-supervised learning

Jesper E. van Engelen, Holger H. Hoos · 2019 · Machine Learning · 2.4K citations

Abstract Semi-supervised learning is the branch of machine learning concerned with using labelled as well as unlabelled data to perform certain learning tasks. Conceptually situated between supervi...

Reading Guide

Foundational Papers

Start with Friedman et al. (2000) for boosting's reweighting of noisy examples; Geurts et al. (2006) for ensemble split selection robust to label errors; Bauer and Kohavi (1999) compares bagging/boosting on noisy benchmarks.

Recent Advances

van Engelen and Hoos (2019) surveys semi-supervised extensions to noisy selection; Li et al. (2017) links feature selection to noisy instance filtering; Hancock and Khoshgoftaar (2020) reviews CatBoost handling noisy big data.

Core Methods

Co-teaching selects high-confidence instances cross-model; boosting iteratively reweights hard examples; confidence thresholding discards low-agreement samples.

How PapersFlow Helps You Research Instance Selection and Hard Example Mining with Noisy Labels

Discover & Search

Research Agent uses searchPapers for 'co-teaching noisy labels instance selection' yielding 50+ papers, citationGraph on Geurts et al. (2006) revealing ensemble connections to boosting, and findSimilarPapers linking to Friedman et al. (2000) for hard example weighting.

Analyze & Verify

Analysis Agent runs readPaperContent on Friedman et al. (2000) to extract boosting pseudocode, verifies claims via verifyResponse (CoVe) against Geurts et al. (2006), and uses runPythonAnalysis to simulate noisy label filtering with NumPy/pandas on toy datasets, graded by GRADE for statistical significance.

Synthesize & Write

Synthesis Agent detects gaps in noisy label scalability across boosting papers, flags contradictions between Bayesian optimality (Domingos and Pazzani, 1997) and ensemble robustness; Writing Agent applies latexEditText to draft methods, latexSyncCitations for 20+ refs, and exportMermaid for co-teaching flowcharts.

Use Cases

"Simulate co-teaching on noisy CIFAR-10 dataset"

Research Agent → searchPapers 'co-teaching noisy labels' → Analysis Agent → runPythonAnalysis (NumPy/pandas reimplement select-clean from Friedman et al. 2000) → matplotlib accuracy plots under 40% noise.

"Write LaTeX review of hard example mining ensembles"

Synthesis Agent → gap detection across Bauer/Kohavi (1999) and Geurts (2006) → Writing Agent → latexEditText 'divide-and-conquer section' → latexSyncCitations → latexCompile → PDF with noise robustness tables.

"Find GitHub code for boosting noisy labels"

Research Agent → exaSearch 'noisy label boosting github' → Code Discovery → paperExtractUrls (Friedman 2000) → paperFindGithubRepo → githubRepoInspect → verified implementations of AdaBoost variants.

Automated Workflows

Deep Research workflow scans 50+ papers via citationGraph from Geurts et al. (2006), structures noisy selection methods report with GRADE-verified claims. DeepScan applies 7-step CoVe to validate hard mining scalability in van Engelen (2019) survey. Theorizer generates hypotheses linking boosting reweighting (Friedman et al., 2000) to feature selection under noise (Li et al., 2017).

Frequently Asked Questions

What defines instance selection with noisy labels?

It filters clean samples using co-teaching or confidence thresholds from noisy datasets for robust training (Friedman et al., 2000).

What are core methods?

Boosting reweights hard examples sequentially; randomized trees select splits robust to noise (Geurts et al., 2006; Bauer and Kohavi, 1999).

What are key papers?

Friedman et al. (2000, 6854 cites) on boosting; Geurts et al. (2006, 8106 cites) on extra trees; Domingos and Pazzani (1997) on Bayesian optimality.

What open problems exist?

Scaling co-teaching to billion-scale data and handling instance-dependent noise without clean validation sets (van Engelen and Hoos, 2019).

Research Machine Learning and Data Classification with AI

PapersFlow provides specialized AI tools for Computer Science researchers. Here are the most relevant for this topic:

See how researchers in Computer Science & AI use PapersFlow

Field-specific workflows, example queries, and use cases.

Computer Science & AI Guide

Start Researching Instance Selection and Hard Example Mining with Noisy Labels with AI

Search 474M+ papers, run AI-powered literature reviews, and write with integrated citations — all in one workspace.

See how PapersFlow works for Computer Science researchers