Subtopic Deep Dive

Demographic Profiling from Text
Research Guide

What is Demographic Profiling from Text?

Demographic Profiling from Text predicts user attributes like age, gender, and personality from linguistic patterns in online texts using supervised learning and topic models.

Researchers analyze social media data such as Facebook messages and Twitter posts to identify linguistic markers correlated with demographics. Key methods include open-vocabulary approaches and predictive lexica (Schwartz et al., 2013; Sap et al., 2014). Over 10 papers from 2007-2021 explore accuracy, biases, and privacy in this area.

15
Curated Papers
3
Key Challenges

Why It Matters

Demographic profiling enables personalized advertising by matching linguistic cues to user age and gender, as shown in lexica development over social media (Sap et al., 2014). It supports mental health monitoring through personality recognition from text (Farnadi et al., 2016). Bias detection in AI systems benefits from adversarial removal techniques that strip demographic signals from representations (Elazar and Goldberg, 2018). Applications extend to robust text representations preserving privacy across genres like blogs and forums (Li et al., 2018).

Key Research Challenges

Bias in Profiling Models

Models encode unwanted demographic signals, leading to biased predictions across genres. Adversarial training removes attributes but struggles with intermediate representations (Elazar and Goldberg, 2018). Nguyen et al. (2021) highlight annotation challenges for fine-grained age prediction on Twitter.

Privacy Preservation

Text representations leak author demographics despite debiasing efforts. Li et al. (2018) show impacts on model performance from authorship attributes. Elazar and Goldberg (2018) demonstrate recovery of gender and age from learned embeddings.

Cross-Genre Generalization

Linguistic markers vary between platforms like Facebook and Twitter, reducing accuracy. Schwartz et al. (2013) succeed on messages but Nguyen et al. (2021) note life-stage annotation needs for tweets. Sap et al. (2014) develop lexica but genre shifts degrade performance.

Essential Papers

1.

Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach

H. Andrew Schwartz, Johannes C. Eichstaedt, Margaret L. Kern et al. · 2013 · PLoS ONE · 1.7K citations

We analyzed 700 million words, phrases, and topic instances collected from the Facebook messages of 75,000 volunteers, who also took standard personality tests, and found striking variations in lan...

2.

"How Old Do You Think I Am?" A Study of Language and Age in Twitter

Dong Nguyen, Rilana Gravel, Dolf Trieschnigg et al. · 2021 · Proceedings of the International AAAI Conference on Web and Social Media · 295 citations

In this paper we focus on the connection between age and language use, exploring age prediction of Twitter users based on their tweets. We discuss the construction of a fine-grained annotation effo...

3.

Adversarial Removal of Demographic Attributes from Text Data

Yanai Elazar, Yoav Goldberg · 2018 · 243 citations

Recent advances in Representation Learning and Adversarial Training seem to succeed in removing unwanted features from the learned representation. We show that demographic information of authors is...

4.

Developing Age and Gender Predictive Lexica over Social Media

Maarten Sap, Gregory Park, Johannes C. Eichstaedt et al. · 2014 · 236 citations

Maarten Sap, Gregory Park, Johannes Eichstaedt, Margaret Kern, David Stillwell, Michal Kosinski, Lyle Ungar, Hansen Andrew Schwartz. Proceedings of the 2014 Conference on Empirical Methods in Natur...

5.

Computational Sociolinguistics: A Survey

Dong Nguyen, A. Seza Doğruöz, Carolyn Penstein Rosé et al. · 2016 · Computational Linguistics · 219 citations

Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimens...

6.

Computational personality recognition in social media

Golnoosh Farnadi, Geetha Sitaraman, Shanu Sushmita et al. · 2016 · User Modeling and User-Adapted Interaction · 211 citations

7.

Workshop on Computational Personality Recognition: Shared Task

Fabio Celli, Fabio Pianesi, David Stillwell et al. · 2021 · Proceedings of the International AAAI Conference on Web and Social Media · 159 citations

In the Workshop on Computational Personality Recognition (Shared Task), we released two datasets, varying in size and genre, annotated with gold standard personality labels. This allowed participan...

Reading Guide

Foundational Papers

Read Schwartz et al. (2013) first for open-vocabulary baseline on 75,000 Facebook users; then Sap et al. (2014) for lexica methods building on it.

Recent Advances

Study Nguyen et al. (2021) for Twitter age annotation advances; Elazar and Goldberg (2018) for adversarial privacy techniques.

Core Methods

Core techniques: open-vocabulary topic models (Schwartz et al., 2013); predictive lexica (Sap et al., 2014); adversarial training (Elazar and Goldberg, 2018); fine-grained annotation (Nguyen et al., 2021).

How PapersFlow Helps You Research Demographic Profiling from Text

Discover & Search

Research Agent uses searchPapers and exaSearch to find core papers like Schwartz et al. (2013) with 1701 citations on personality and age from Facebook. citationGraph reveals connections from Sap et al. (2014) to Nguyen et al. (2021) age prediction. findSimilarPapers expands to privacy works like Elazar and Goldberg (2018).

Analyze & Verify

Analysis Agent applies readPaperContent to extract linguistic features from Schwartz et al. (2013), then verifyResponse with CoVe checks claims against Nguyen et al. (2021). runPythonAnalysis replicates lexica correlations using pandas on provided datasets, with GRADE scoring evidence strength for bias claims in Elazar and Goldberg (2018). Statistical verification confirms age prediction accuracies.

Synthesize & Write

Synthesis Agent detects gaps in cross-genre generalization between Sap et al. (2014) lexica and Twitter studies, flagging contradictions in bias removal (Elazar and Goldberg, 2018). Writing Agent uses latexEditText for methods sections, latexSyncCitations for 10+ references, and latexCompile for full reports. exportMermaid visualizes citation flows from foundational to recent works.

Use Cases

"Reproduce age-gender lexica correlation stats from Sap et al. 2014 on social media data"

Research Agent → searchPapers('Sap 2014 lexica') → Analysis Agent → readPaperContent → runPythonAnalysis(pandas correlation on extracted features) → matplotlib plot of accuracies.

"Write LaTeX review of demographic biases in profiling models citing Elazar Goldberg 2018"

Synthesis Agent → gap detection(Elazar Goldberg 2018, Li 2018) → Writing Agent → latexEditText(draft section) → latexSyncCitations(10 papers) → latexCompile(PDF with bias diagram via latexGenerateFigure).

"Find GitHub repos implementing open-vocabulary approach from Schwartz 2013"

Research Agent → searchPapers('Schwartz 2013 open-vocabulary') → paperExtractUrls → paperFindGithubRepo → Code Discovery → githubRepoInspect(code for personality prediction) → exportCsv(features list).

Automated Workflows

Deep Research workflow conducts systematic review: searchPapers(50+ on 'demographic profiling text') → citationGraph → structured report with GRADE scores on Schwartz et al. (2013). DeepScan applies 7-step analysis: readPaperContent(Nguyen 2021) → CoVe verification → runPythonAnalysis(bias stats). Theorizer generates hypotheses on privacy from Elazar and Goldberg (2018) patterns across Sap et al. (2014).

Frequently Asked Questions

What is Demographic Profiling from Text?

It predicts age, gender, personality from linguistic cues in texts using supervised models and topic analysis (Schwartz et al., 2013).

What are key methods?

Open-vocabulary analysis on 700M Facebook words (Schwartz et al., 2013); predictive lexica from social media (Sap et al., 2014); adversarial debiasing (Elazar and Goldberg, 2018).

What are key papers?

Schwartz et al. (2013, 1701 citations) on personality/gender/age; Sap et al. (2014, 236 citations) on lexica; Nguyen et al. (2021, 295 citations) on Twitter age.

What are open problems?

Cross-platform generalization, full demographic removal without accuracy loss, ethical variable ascription (Larson, 2017).

Research Authorship Attribution and Profiling with AI

PapersFlow provides specialized AI tools for Computer Science researchers. Here are the most relevant for this topic:

See how researchers in Computer Science & AI use PapersFlow

Field-specific workflows, example queries, and use cases.

Computer Science & AI Guide

Start Researching Demographic Profiling from Text with AI

Search 474M+ papers, run AI-powered literature reviews, and write with integrated citations — all in one workspace.

See how PapersFlow works for Computer Science researchers