Subtopic Deep Dive

← Text and Document Classification Technologies

Feature Selection Techniques for Text Categorization
Research Guide

What is Feature Selection Techniques for Text Categorization?

Feature Selection Techniques for Text Categorization select subsets of text features to reduce dimensionality while maintaining classification performance.

Methods include document frequency, chi-squared, and mutual information, as evaluated in comparative studies (Yang and Pedersen, 1997; 4766 citations). These techniques address high-dimensionality in text data by eliminating non-discriminative terms (Forman, 2003; 2389 citations). Over 10 key papers from 1997-2020 analyze their impact on classifiers like naive Bayes and SVM.

Curated Papers

Key Challenges

Why It Matters

Feature selection improves text classifier efficiency by mitigating the curse of dimensionality, enabling scalable categorization for news filtering and spam detection (Yang and Pedersen, 1997). Yang and Liu (1999; 2651 citations) showed information gain and chi-squared outperform frequency-based methods by 10-20% in precision. Forman (2003) demonstrated balanced accuracy metrics like MCC yield robust selections across imbalanced datasets (Chicco and Jurman, 2020; 5276 citations). Applications span sentiment analysis (Pang et al., 2002; 6979 citations) and semi-supervised learning (Nigam et al., 2000; 2732 citations).

Key Research Challenges

Metric Selection Bias

Choosing between accuracy, F1, or MCC affects feature rankings differently across datasets (Chicco and Jurman, 2020). Yang and Pedersen (1997) found document frequency underperforms on sparse corpora. Forman (2003) highlights need for balanced metrics in imbalanced text classes.

Scalability to Large Vocabularies

Text corpora yield millions of features, straining univariate filters (Yang and Liu, 1999). Aggressive reduction risks losing discriminative n-grams. Empirical studies show computational trade-offs with classifier retraining (Forman, 2003).

Integration with Embeddings

Traditional term-based selection mismatches dense embeddings in modern classifiers (Yang et al., 2016). Semi-supervised contexts complicate feature relevance (Nigam et al., 2000). Hierarchical models demand multi-level selection strategies.

Essential Papers

Thumbs up?

Bo Pang, Lillian Lee, Shivakumar Vaithyanathan · 2002 · 7.0K citations

We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we find that standa...

The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

Davide Chicco, Giuseppe Jurman · 2020 · BMC Genomics · 5.3K citations

A Comparative Study on Feature Selection in Text Categorization

Yiming Yang, Jan Pedersen · 1997 · 4.8K citations

This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggres-sive dimensionality reduction. Five meth-ods were evaluated, in...

Hierarchical Attention Networks for Document Classification

Zichao Yang, Diyi Yang, Chris Dyer et al. · 2016 · 4.7K citations

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, Eduard Hovy. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human La...

Semi-Supervised Learning

Olivier Chapelle, Bernhard Schlkopf, Alexander Zien · 2006 · The MIT Press eBooks · 4.3K citations

A comprehensive review of an area of machine learning that deals with the use of unlabeled data in classification problems: state-of-the-art algorithms, a taxonomy of the field, applications, bench...

Text Classification from Labeled and Unlabeled Documents using EM

Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun et al. · 2000 · Machine Learning · 2.7K citations

A re-examination of text categorization methods

Yiming Yang, Xin Liu · 1999 · 2.7K citations

Article Free Access Share on A re-examination of text categorization methods Authors: Yiming Yang School of Computer Science, Carnegie Mellon University, Pittsburgh, PA School of Computer Science, ...

Reading Guide

Foundational Papers

Start with Yang and Pedersen (1997; 4766 citations) for core comparisons of DF/CHI2/Gain/odds ratio; then Forman (2003; 2389 citations) for metric extensibility; Yang and Liu (1999) for classifier integration.

Recent Advances

Chicco and Jurman (2020; 5276 citations) for MCC superiority; Yang et al. (2016; 4716 citations) for attention-based feature weighting; van Engelen and Hoos (2019) for semi-supervised extensions.

Core Methods

Univariate filters: document frequency, chi-squared, mutual information, odds ratio (Yang and Pedersen, 1997). Multivariate: latent semantic indexing hybrids. Evaluation: MCC, macro-F1 on Reuters/20NG datasets (Forman, 2003).

How PapersFlow Helps You Research Feature Selection Techniques for Text Categorization

Discover & Search

Research Agent uses searchPapers to query 'chi-squared feature selection text categorization' retrieving Yang and Pedersen (1997), then citationGraph reveals Forman (2003) and Yang and Liu (1999) as high-impact citations. exaSearch uncovers sparse embedding variants; findSimilarPapers expands to semi-supervised extensions like Nigam et al. (2000).

Analyze & Verify

Analysis Agent applies readPaperContent to extract chi-squared formulas from Yang and Pedersen (1997), then runPythonAnalysis recreates selection on Reuters dataset with NumPy/pandas for precision curves. verifyResponse (CoVe) cross-checks claims against Forman (2003); GRADE scores evidence strength for metric comparisons (Chicco and Jurman, 2020).

Synthesize & Write

Synthesis Agent detects gaps in univariate vs. multivariate selection via contradiction flagging across Yang papers. Writing Agent uses latexEditText for method comparisons, latexSyncCitations for 10+ refs, and latexCompile for tables. exportMermaid visualizes selection-classifier pipeline diagrams.

Use Cases

"Reproduce chi-squared vs DF selection from Yang 1997 on sample text data"

Research Agent → searchPapers('Yang Pedersen 1997') → Analysis Agent → readPaperContent → runPythonAnalysis (chi2 vs df on sklearn datasets) → matplotlib precision-recall plot.

"Write LaTeX section comparing feature selection metrics for text classification"

Synthesis Agent → gap detection (Yang 1997 + Forman 2003) → Writing Agent → latexEditText(table) → latexSyncCitations(10 papers) → latexCompile → PDF output.

"Find GitHub repos implementing mutual information selection from text papers"

Research Agent → searchPapers('mutual information text feature selection') → Code Discovery → paperExtractUrls → paperFindGithubRepo → githubRepoInspect → working sklearn implementations.

Automated Workflows

Deep Research workflow scans 50+ papers via searchPapers on 'feature selection text categorization', structures report with metric comparisons from Yang and Forman papers. DeepScan applies 7-step CoVe chain: readPaperContent → runPythonAnalysis(reproduce experiments) → GRADE(classifier metrics). Theorizer generates hypotheses on embedding-compatible selection from hierarchical attention insights (Yang et al., 2016).

Try Doxa for Feature Selection Techniques for Text Categorization Research

Frequently Asked Questions

What defines feature selection in text categorization?

It reduces high-dimensional term vectors using statistical measures like chi-squared or mutual information while preserving class separability (Yang and Pedersen, 1997).

Which methods perform best per key papers?

Chi-squared and information gain outperform document frequency by 10-15% on Reuters-21578; chi2 best for balanced accuracy (Yang and Pedersen, 1997; Forman, 2003).

What are foundational papers?

Yang and Pedersen (1997; 4766 citations) compares 5 methods; Forman (2003; 2389 citations) studies 35+ metrics; Yang and Liu (1999; 2651 citations) re-examines kNN/SVM baselines.

What open problems remain?

Adapting univariate filters to dense embeddings; scalable multivariate selection for billion-term corpora; robust metrics beyond F1/MCC for streaming text (Chicco and Jurman, 2020).

Research Text and Document Classification Technologies with AI

PapersFlow provides specialized AI tools for Computer Science researchers. Here are the most relevant for this topic:

AI Literature Review

Automate paper discovery and synthesis across 474M+ papers

Code & Data Discovery

Find datasets, code repositories, and computational tools

Deep Research Reports

Multi-source evidence synthesis with counter-evidence

AI Academic Writing

Write research papers with AI assistance and LaTeX support

See how researchers in Computer Science & AI use PapersFlow

Field-specific workflows, example queries, and use cases.

Computer Science & AI Guide

Start Researching Feature Selection Techniques for Text Categorization with AI

Search 474M+ papers, run AI-powered literature reviews, and write with integrated citations — all in one workspace.

Try PapersFlow Free See AI Literature Review

See how PapersFlow works for Computer Science researchers

Part of the Text and Document Classification Technologies Research Guide