Subtopic Deep Dive

Stylometry for Authorship Attribution
Research Guide

What is Stylometry for Authorship Attribution?

Stylometry for authorship attribution uses statistical and machine learning methods to identify authors from linguistic style markers like n-gram frequencies, function words, and syntactic patterns.

Researchers apply stylometric models to literary, forensic, and digital texts across languages. Key works include Stamatatos et al. (2000) with 396 citations on genre and author categorization in Modern Greek, and Eder et al. (2016) with 370 citations introducing the 'stylo' R package for computational stylistics. Over 1,500 papers explore these techniques since 2000.

15
Curated Papers
3
Key Challenges

Why It Matters

Stylometry enables forensic identification of authors in anonymous texts, supporting law enforcement in criminal investigations (Gamon, 2004). In digital humanities, it resolves literary disputes, such as attributing disputed works via function words (Kestemont, 2014). Plagiarism detection benefits from stylometric features, aiding academic integrity (Meuschke and Gipp, 2013). Applications extend to de-anonymizing programmers from binaries (Caliskan et al., 2018).

Key Research Challenges

Many Authors Limited Data

Attribution accuracy drops with numerous authors and scarce training texts. Luyckx and Daelemans (2008) show feature overestimation in small-author studies leads to poor generalization (151 citations). Models require robust handling of data sparsity.

Explaining Function Word Use

Function words succeed empirically but lack theoretical justification in stylometry. Kestemont (2014) critiques their 'black magic' status, urging mechanistic explanations (108 citations). This hampers model interpretability.

Style Obfuscation Detection

Adversarial changes to writing style evade stylometric detection in forensics. Caliskan et al. (2018) demonstrate surviving compilation in code but note text vulnerabilities (105 citations). Robust features against manipulation remain needed.

Essential Papers

1.

Automatic Text Categorization in Terms of Genre and Author

Efstathios Stamatatos, Nikos Fakotakis, George Kokkinakis · 2000 · Computational Linguistics · 396 citations

The two main factors that characterize a text are its content and its style, and both can be used as a means of categorization. In this paper we present an approach to text categorization in terms ...

2.

Stylometry with R: A Package for Computational Text Analysis

Maciej Eder, Jan Rybicki, Mike Kestemont · 2016 · The R Journal · 370 citations

This software paper describes 'Stylometry with R' (stylo), a flexible R package for the highlevel analysis of writing style in stylometry.Stylometry (computational stylistics) is concerned with the...

3.

Computational Sociolinguistics: A Survey

Dong Nguyen, A. Seza Doğruöz, Carolyn Penstein Rosé et al. · 2016 · Computational Linguistics · 219 citations

Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimens...

4.

Linguistic correlates of style

Michael Gamon · 2004 · 188 citations

The identification of authorship falls into the category of style classification, an interesting sub-field of text categorization that deals with properties of the form of linguistic expression as ...

5.

Authorship attribution and verification with many authors and limited data

Kim Luyckx, Walter Daelemans · 2008 · 151 citations

Most studies in statistical or machine learning based authorship attribution focus on two or a few authors. This leads to an overestimation of the importance of the features extracted from the trai...

6.

Function Words in Authorship Attribution. From Black Magic to Theory?

Mike Kestemont · 2014 · 108 citations

This position paper focuses on the use of function words in computational authorship attribution.Although recently there have been multiple successful applications of authorship attribution, the fi...

7.

When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries

Aylin Caliskan, Fabian Yamaguchi, Edwin Dauber et al. · 2018 · 105 citations

The ability to identify authors of computer programs based on their coding\nstyle is a direct threat to the privacy and anonymity of programmers. While\nrecent work found that source code can be at...

Reading Guide

Foundational Papers

Read Stamatatos et al. (2000) first for baseline genre/author stylometry (396 citations); Gamon (2004) next for linguistic feature sets (188 citations); Luyckx and Daelemans (2008) for multi-author scaling (151 citations).

Recent Advances

Study Eder et al. (2016) 'stylo' package (370 citations) for tools; Caliskan et al. (2018) for code stylometry (105 citations); Akimushkin et al. (2017) for network dynamics (95 citations).

Core Methods

Core techniques: function word ratios (Kestemont, 2014), n-gram frequencies (Stamatatos et al., 2000), syntactic patterns (Gamon, 2004), R stylometry (Eder et al., 2016), co-occurrence networks (Akimushkin et al., 2017).

How PapersFlow Helps You Research Stylometry for Authorship Attribution

Discover & Search

Research Agent uses searchPapers and citationGraph to map stylometry literature from Stamatatos et al. (2000), revealing 396 citations and descendants like Eder et al. (2016). findSimilarPapers expands to function word studies from Kestemont (2014); exaSearch queries 'stylometry n-grams authorship Greek' for cross-language works.

Analyze & Verify

Analysis Agent applies readPaperContent to extract features from Gamon (2004), then verifyResponse with CoVe checks claims against Nguyen et al. (2016). runPythonAnalysis replicates 'stylo' package deltas via NumPy/pandas on n-gram frequencies; GRADE scores evidence strength for function word efficacy (Kestemont, 2014).

Synthesize & Write

Synthesis Agent detects gaps in multi-author attribution (Luyckx and Daelemans, 2008) and flags contradictions in style markers. Writing Agent uses latexEditText for methods sections, latexSyncCitations for 10+ papers, latexCompile for reports; exportMermaid diagrams co-occurrence networks from Akimushkin et al. (2017).

Use Cases

"Reproduce stylometry analysis from Eder et al. 2016 stylo package on my corpus"

Research Agent → searchPapers('stylo R package') → Analysis Agent → readPaperContent + runPythonAnalysis (pandas n-gram freqs, matplotlib deltas) → researcher gets reproducible plots and attribution accuracy metrics.

"Write LaTeX review of function words in authorship attribution citing Kestemont 2014"

Synthesis Agent → gap detection → Writing Agent → latexEditText(draft) → latexSyncCitations(5 papers) → latexCompile → researcher gets compiled PDF with bibliography and inline citations.

"Find GitHub repos implementing stylometric models from recent papers"

Research Agent → citationGraph(Eder 2016) → Code Discovery → paperExtractUrls → paperFindGithubRepo → githubRepoInspect → researcher gets repo code, README, and stylometry scripts for local testing.

Automated Workflows

Deep Research workflow scans 50+ stylometry papers via searchPapers, structures reports on function words (Kestemont, 2014) → citationGraph → GRADE. DeepScan's 7-steps verify multi-author challenges (Luyckx and Daelemans, 2008) with CoVe checkpoints and runPythonAnalysis. Theorizer generates hypotheses on network-based stylometry from Akimushkin et al. (2017).

Frequently Asked Questions

What is stylometry for authorship attribution?

Stylometry identifies authors using quantitative linguistic features like n-grams and function words, avoiding content semantics (Gamon, 2004).

What are core methods in stylometry?

Methods include function word frequencies (Kestemont, 2014), R-based 'stylo' package (Eder et al., 2016), and word co-occurrence networks (Akimushkin et al., 2017).

What are key papers?

Stamatatos et al. (2000, 396 citations) on genre/author categorization; Gamon (2004, 188 citations) on style correlates; Luyckx and Daelemans (2008, 151 citations) on many authors.

What open problems exist?

Challenges include limited data for many authors (Luyckx and Daelemans, 2008), theoretical basis for function words (Kestemont, 2014), and obfuscation resistance (Caliskan et al., 2018).

Research Authorship Attribution and Profiling with AI

PapersFlow provides specialized AI tools for Computer Science researchers. Here are the most relevant for this topic:

See how researchers in Computer Science & AI use PapersFlow

Field-specific workflows, example queries, and use cases.

Computer Science & AI Guide

Start Researching Stylometry for Authorship Attribution with AI

Search 474M+ papers, run AI-powered literature reviews, and write with integrated citations — all in one workspace.

See how PapersFlow works for Computer Science researchers