PapersFlow Research Brief
Authorship Attribution and Profiling
Research Guide
What is Authorship Attribution and Profiling?
Authorship attribution and profiling is the application of stylometry, text classification, machine learning, and forensic linguistics to identify authors of anonymous texts, predict demographic attributes from online content, and analyze linguistic uniqueness across genres and languages, including gender differences and language use in social media.
The field encompasses 23,296 works focused on authorship attribution, stylometry, and user profiling in text. Techniques include text classification, machine learning, and forensic linguistics for analyzing gender differences and language use in social media. Research targets author identification of anonymous texts and prediction of demographic attributes from online content.
Topic Hierarchy
Research Sub-Topics
Stylometry for Authorship Attribution
This sub-topic examines statistical and machine learning methods to identify authors based on linguistic style markers such as n-gram frequencies, function word usage, and syntactic patterns. Researchers develop and evaluate stylometric models for attributing authorship in literary, forensic, and digital texts across languages.
Authorship Attribution in Social Media
This area focuses on attributing authorship and profiling users from short, noisy social media texts using features like emojis, hashtags, and posting patterns combined with deep learning classifiers. Studies address challenges like evolving language and multi-author accounts in platforms such as Twitter and Facebook.
Cross-Lingual Authorship Attribution
Researchers investigate methods to attribute authorship across different languages, leveraging transfer learning and language-independent stylometric features to handle multilingual corpora. Work includes evaluating performance on low-resource languages and genre-specific adaptations.
Demographic Profiling from Text
This sub-topic explores predicting user attributes like age, gender, and personality from linguistic cues in online texts using supervised learning and topic models. Research quantifies biases in profiling models and improves accuracy across genres like blogs and forums.
Adversarial Stylometry and Obfuscation
Studies develop attacks on stylometric systems and countermeasures, including authorship obfuscation techniques that alter text style while preserving semantics using GANs and paraphrasing. Researchers benchmark robustness of attribution models against such evasions.
Why It Matters
Authorship attribution and profiling enables identification of authors behind anonymous texts, with applications in forensic linguistics and security. Bertrand and Mullainathan (2004) demonstrated labor market discrimination by sending fictitious resumes with African-American- or White-sounding names, where White names received 50 percent more callbacks, highlighting how linguistic profiling reveals biases in text-based decisions. Caliskan et al. (2017) showed that semantics from language corpora contain human-like biases, as machines learn word associations mirroring societal prejudices, impacting AI fairness in hiring and content moderation. Chung and Pennebaker (2012) developed LIWC software to classify texts along psychological dimensions using word categories, aiding prediction of outcomes from social media language.
Reading Guide
Where to Start
"Linguistic Inquiry and Word Count (LIWC)" by Chung and Pennebaker (2012) because it provides a practical tool for text classification along psychological dimensions using word categories, serving as an accessible entry to profiling techniques.
Key Papers Explained
Bertrand and Mullainathan (2004) in "Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination" established linguistic profiling by showing name-based discrimination in resumes, with White names receiving 50% more callbacks. Brown et al. (1993) in "The mathematics of statistical machine translation: parameter estimation" built foundational statistical models for word alignment, extended by Brown et al. (1992) in "Class-based n-gram models of natural language" for predicting words via co-occurrence classes. Church and Hanks (1990) in "Word association norms, mutual information, and lexicography" added mutual information for language patterns, while Caliskan et al. (2017) in "Semantics derived automatically from language corpora contain human-like biases" connected these to bias detection in modern ML.
Paper Timeline
Most-cited paper highlighted in red. Papers ordered chronologically.
Advanced Directions
Research continues building on statistical models from Brown et al. (1993) and n-gram approaches in Brown et al. (1992), focusing on social media analysis without recent preprints. Emphasis remains on forensic applications from Chung and Pennebaker (2012) and bias mitigation per Caliskan et al. (2017).
Papers at a Glance
| # | Paper | Year | Venue | Citations | Open Access |
|---|---|---|---|---|---|
| 1 | Are Emily and Greg More Employable Than Lakisha and Jamal? A F... | 2004 | American Economic Review | 4.3K | ✕ |
| 2 | The mathematics of statistical machine translation: parameter ... | 1993 | — | 4.1K | ✕ |
| 3 | Word association norms, mutual information, and lexicography | 1990 | Computational Linguistics | 3.7K | ✕ |
| 4 | Language identification in the limit | 1967 | Information and Control | 3.6K | ✕ |
| 5 | Quantitative Analysis of Culture Using Millions of Digitized B... | 2010 | Science | 3.0K | ✓ |
| 6 | Class-based n -gram models of natural language | 1992 | Computational Linguistics | 2.9K | ✕ |
| 7 | Semantics derived automatically from language corpora contain ... | 2017 | Science | 2.6K | ✓ |
| 8 | A survey of named entity recognition and classification | 2007 | Lingvisticae Investiga... | 2.5K | ✕ |
| 9 | Linguistic Inquiry and Word Count (LIWC) | 2012 | IGI Global eBooks | 2.2K | ✕ |
| 10 | On the prediction of occurrence of particular verbal intrusion... | 1959 | Journal of Experimenta... | 2.1K | ✕ |
Frequently Asked Questions
What is stylometry in authorship attribution?
Stylometry analyzes linguistic features to identify authors of anonymous texts. It uses techniques like word frequency and n-gram models to distinguish individual writing styles. Brown et al. (1993) applied statistical models for word alignment in translation, foundational for stylometric parameter estimation.
How does machine learning contribute to user profiling?
Machine learning classifies texts to predict demographic attributes from online content. Caliskan et al. (2017) trained models on language corpora to detect human-like biases in word associations. Chung and Pennebaker (2012) used LIWC for efficient psychological classification of texts.
What role does forensic linguistics play in gender differences analysis?
Forensic linguistics examines language use in social media to attribute authorship and profile users. Nadeau and Sekine (2007) surveyed named entity recognition systems developed with hand-crafted grammars for text analysis. Church and Hanks (1990) used mutual information from word associations for statistical descriptions of language patterns.
What are key methods in text classification for author identification?
Methods include class-based n-gram models and linguistic inquiry tools. Brown et al. (1992) developed n-gram models assigning words to classes based on co-occurrence frequencies. Chung and Pennebaker (2012) introduced LIWC to count grammatical, psychological, and content words for text classification.
How has LIWC advanced authorship profiling?
LIWC references a dictionary of word categories to classify texts psychologically. Chung and Pennebaker (2012) describe its use in predicting outcomes from language use. It efficiently analyzes social media for user profiling and demographic prediction.
What is the current state of research in this field?
The field includes 23,296 works on stylometry and profiling across languages and genres. Top papers focus on statistical models and bias detection, with 4317 citations for Bertrand and Mullainathan (2004). No recent preprints or news coverage reported in the last 12 months.
Open Research Questions
- ? How can stylometric models distinguish authors across multiple languages and genres while accounting for topic variation?
- ? What techniques mitigate human-like biases in machine learning models trained for authorship profiling?
- ? How do n-gram models and word association norms improve prediction of demographic attributes from short social media texts?
- ? Which linguistic features best capture individual uniqueness for forensic attribution of anonymous online content?
- ? How can LIWC categories be extended to real-time profiling in diverse cultural contexts?
Recent Trends
The field maintains 23,296 works with no reported 5-year growth rate.
Highly cited papers like Bertrand and Mullainathan with 4317 citations and Brown et al. (1993) with 4118 citations dominate, focusing on linguistic discrimination and statistical models.
2004No recent preprints or news coverage in the last 12 months indicates steady reliance on established methods.
Research Authorship Attribution and Profiling with AI
PapersFlow provides specialized AI tools for Computer Science researchers. Here are the most relevant for this topic:
AI Literature Review
Automate paper discovery and synthesis across 474M+ papers
Code & Data Discovery
Find datasets, code repositories, and computational tools
Deep Research Reports
Multi-source evidence synthesis with counter-evidence
AI Academic Writing
Write research papers with AI assistance and LaTeX support
See how researchers in Computer Science & AI use PapersFlow
Field-specific workflows, example queries, and use cases.
Start Researching Authorship Attribution and Profiling with AI
Search 474M+ papers, run AI-powered literature reviews, and write with integrated citations — all in one workspace.
See how PapersFlow works for Computer Science researchers