Subtopic Deep Dive

Multilingual Hate Speech Detection
Research Guide

What is Multilingual Hate Speech Detection?

Multilingual Hate Speech Detection develops language-agnostic and multilingual models for identifying hate speech across diverse languages and scripts on social media.

Researchers employ transfer learning, cross-lingual embeddings, and multilingual benchmarks to detect hate speech beyond English. SemEval-2019 Task 5 by Basile et al. (2019, 839 citations) introduced multilingual detection of hate against immigrants and women in English and Spanish Twitter data. SemEval-2020 Task 12 by Zampieri et al. (2020, 378 citations) extended offensive language identification to multiple languages using the OLID schema.

14
Curated Papers
3
Key Challenges

Why It Matters

Multilingual detection enables equitable content moderation on global platforms where non-English hate speech proliferates. Basile et al. (2019) showed multilingual benchmarks improve model robustness across languages, aiding platforms like Twitter in diverse regions. Zampieri et al. (2020) demonstrated hierarchical classification reduces false positives in low-resource languages, supporting safer online spaces worldwide. Poletto et al. (2020, 362 citations) reviewed benchmarks revealing gaps in non-English corpora, driving inclusive AI safety.

Key Research Challenges

Cross-lingual Transfer Gaps

Models trained on English data underperform on low-resource languages due to embedding mismatches. MacAvaney et al. (2019, 541 citations) identified subtleties in multilingual language use as key hurdles. Transfer learning struggles with script diversity and cultural nuances.

Scarce Non-English Datasets

Annotated hate speech corpora are predominantly English, limiting multilingual training. Poletto et al. (2020) systematically reviewed resources, finding fewer benchmarks for languages like Arabic or Hindi. Vidgen and Derczynski (2020, 269 citations) warned of 'garbage in, garbage out' from imbalanced data.

Ambiguous Multilingual Offensiveness

Offensive language varies culturally, complicating universal detection. Zampieri et al. (2019, 673 citations) highlighted hierarchical taxonomy needs in OffensEval for multilingual subtlety. Schmidt and Wiegand (2017, 1340 citations) surveyed detection challenges amplified across languages.

Essential Papers

1.

A Survey on Hate Speech Detection using Natural Language Processing

Anna Grau Schmidt, Michael Wiegand · 2017 · 1.3K citations

This paper presents a survey on hate speech detection. Given the steadily growing body of social media content, the amount of online hate speech is also increasing. Due to the massive scale of the ...

2.

SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter

Valerio Basile, Cristina Bosco, Elisabetta Fersini et al. · 2019 · 839 citations

The paper describes the organization of the SemEval 2019 Task 5 about the detection of hate speech against immigrants and women in Spanish and English messages extracted from Twitter. The task is o...

3.

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)

Marcos Zampieri, Shervin Malmasi, Preslav Nakov et al. · 2019 · 673 citations

We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval). The task was based on a new dataset, the Offensi...

4.

Hate speech detection: Challenges and solutions

Sean MacAvaney, Hao-Ren Yao, Eugene Yang et al. · 2019 · PLoS ONE · 541 citations

As online content continues to grow, so does the spread of hate speech. We identify and examine challenges faced by online automatic approaches for hate speech detection in text. Among these diffic...

5.

Taxonomy of Risks posed by Language Models

Laura Weidinger, Jonathan Uesato, Maribeth Rauh et al. · 2022 · 2022 ACM Conference on Fairness, Accountability, and Transparency · 482 citations

Responsible innovation on large-scale Language Models (LMs) re- quires foresight into and in-depth understanding of the risks these models may pose. This paper develops a comprehensive taxon- omy o...

6.

SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

Marcos Zampieri, Preslav Nakov, Sara Rosenthal et al. · 2020 · 378 citations

We present the results and main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2020). The task involves three subtasks corresponding ...

7.

Resources and benchmark corpora for hate speech detection: a systematic review

Fabio Poletto, Valerio Basile, Manuela Sanguinetti et al. · 2020 · Language Resources and Evaluation · 362 citations

Abstract Hate Speech in social media is a complex phenomenon, whose detection has recently gained significant traction in the Natural Language Processing community, as attested by several recent re...

Reading Guide

Foundational Papers

Start with Schmidt and Wiegand (2017, 1340 citations) for hate speech survey foundations, then Basile et al. (2019, 839 citations) for first multilingual Twitter benchmarks to grasp task evolution.

Recent Advances

Study Zampieri et al. (2020, 378 citations) for OffensEval 2020 multilingual extension and Poletto et al. (2020, 362 citations) for resource review to track dataset advances.

Core Methods

Core techniques are cross-lingual transfer learning, OLID hierarchical taxonomy (Zampieri et al., 2019), and benchmark evaluation from SemEval tasks.

How PapersFlow Helps You Research Multilingual Hate Speech Detection

Discover & Search

Research Agent uses searchPapers and exaSearch to find multilingual benchmarks like SemEval-2019 Task 5 (Basile et al., 2019), then citationGraph reveals 378+ related works from Zampieri et al. (2020). findSimilarPapers clusters cross-lingual studies from 250M+ OpenAlex papers.

Analyze & Verify

Analysis Agent applies readPaperContent on Basile et al. (2019) to extract Spanish-English F1 scores, verifyResponse with CoVe checks cross-lingual claims, and runPythonAnalysis computes dataset imbalances via pandas on OLID schema stats. GRADE grading scores evidence strength for low-resource language claims.

Synthesize & Write

Synthesis Agent detects gaps in non-English corpora from Poletto et al. (2020), flags contradictions in transfer learning efficacy. Writing Agent uses latexEditText for benchmark tables, latexSyncCitations for 10+ papers, latexCompile for reports, and exportMermaid diagrams hierarchical taxonomies.

Use Cases

"Compare F1 scores of multilingual hate models on SemEval tasks"

Research Agent → searchPapers('SemEval multilingual hate') → Analysis Agent → runPythonAnalysis(pandas on extracted metrics) → CSV table of Basile (2019) vs Zampieri (2020) scores.

"Draft LaTeX section on cross-lingual hate detection challenges"

Synthesis Agent → gap detection (Poletto 2020) → Writing Agent → latexEditText('challenges') → latexSyncCitations(5 papers) → latexCompile → PDF with cited taxonomy diagram.

"Find GitHub repos for OffensEval multilingual datasets"

Research Agent → searchPapers('OffensEval 2020') → Code Discovery → paperExtractUrls → paperFindGithubRepo → githubRepoInspect → list of OLID dataset repos with inspection summaries.

Automated Workflows

Deep Research workflow scans 50+ papers via searchPapers on 'multilingual hate speech', structures report with SemEval benchmarks and citation graphs. DeepScan applies 7-step CoVe analysis to verify transfer learning claims in MacAvaney et al. (2019). Theorizer generates hypotheses on zero-shot detection from Zampieri et al. (2020) patterns.

Frequently Asked Questions

What defines Multilingual Hate Speech Detection?

It develops models for hate speech identification across languages using cross-lingual embeddings and benchmarks like SemEval-2019 Task 5 (Basile et al., 2019).

What are key methods in this subtopic?

Methods include multilingual BERT variants and hierarchical classification from OffensEval (Zampieri et al., 2019; 2020), trained on datasets like OLID.

What are influential papers?

Schmidt and Wiegand (2017, 1340 citations) survey foundations; Basile et al. (2019, 839 citations) and Zampieri et al. (2020, 378 citations) provide multilingual benchmarks.

What open problems exist?

Challenges include low-resource language scarcity (Poletto et al., 2020) and cultural ambiguity in offensiveness (MacAvaney et al., 2019; Vidgen and Derczynski, 2020).

Research Hate Speech and Cyberbullying Detection with AI

PapersFlow provides specialized AI tools for Computer Science researchers. Here are the most relevant for this topic:

See how researchers in Computer Science & AI use PapersFlow

Field-specific workflows, example queries, and use cases.

Computer Science & AI Guide

Start Researching Multilingual Hate Speech Detection with AI

Search 474M+ papers, run AI-powered literature reviews, and write with integrated citations — all in one workspace.

See how PapersFlow works for Computer Science researchers