Subtopic Deep Dive
Corpus Linguistics in Terminology
Research Guide
What is Corpus Linguistics in Terminology?
Corpus Linguistics in Terminology applies corpus-based statistical and distributional methods to extract, validate, and analyze terms from domain-specific corpora, focusing on neologisms, polysemy, and term variation.
Researchers use methods like C-value/NC-value for multi-word term extraction from special language corpora (Frantzi and Ananiadou, 1999, 227 citations). FrameNet leverages corpus evidence for semantic and syntactic frame analysis (Baker et al., 1998, 2556 citations). Studies address term normalization across medical and agricultural domains (Jacquemin, 1999, 111 citations).
Why It Matters
Corpus-driven term extraction builds empirical terminological resources for machine translation and knowledge bases, as shown by Frantzi and Ananiadou's (1999) domain-independent C-value/NC-value method applied to technical corpora. FrameNet's corpus-based lexicography supports NLP applications in semantic role labeling (Baker et al., 1998). Jacquemin's (1999) model for syntagmatic and paradigmatic term variation enables normalization in bilingual terminography, improving cross-domain consistency.
Key Research Challenges
Multi-word Term Extraction
Identifying nested and discontinuous multi-word terms requires combining linguistic filters with frequency-based scores. Frantzi and Ananiadou (1999) address this via C-value/NC-value, but domain shifts reduce precision. Evaluation lacks standardized gold standards across languages.
Term Variation Normalization
Morphological, syntactic, and semantic variations complicate term matching in corpora. Jacquemin (1999) models five variation types in medical and agricultural texts, yet scaling to large corpora demands efficient parsing. Polysemy resolution remains unresolved without context.
Domain Adaptation of Methods
Statistical methods tuned on one domain underperform in others due to vocabulary shifts. Baker et al. (1998) FrameNet relies on annotated English corpora, limiting transfer. Hockenmaier and Steedman (2007) CCGbank conversion highlights parsing challenges for dependency extraction.
Essential Papers
The Berkeley FrameNet Project
Collin F. Baker, Charles J. Fillmore, John B. Lowe · 1998 · 2.6K citations
FrameNet is a three-year NSF-supported project in corpus-based computational lexicography, now in its second year (NSF IRI-9618838, "Tools for Lexicon Building"). The project's key features are (a)...
CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank
Julia Hockenmaier, Mark Steedman · 2007 · Computational Linguistics · 386 citations
This article presents an algorithm for translating the Penn Treebank into a corpus of Combinatory Categorial Grammar (CCG) derivations augmented with local and long-range word-word dependencies. Th...
The C-value/NC-value domain-independent method for multi-word term extraction
Katerina T. Frantzi, Sophia Ananiadou · 1999 · Journal of Natural Language Processing · 227 citations
In this paper we present a domain-independent method for the automatic extraction of multi-word (technical) terms, from machine-readable special language corpora. The method, (C-value/NC-value), co...
Differential Object Marking in Spanish: state of the art
Antonio Fábregas · 2013 · Borealis – An International Journal of Hispanic Linguistics · 145 citations
<p class="MsoNormal" style="margin-top: 0cm; margin-right: 18.1pt; margin-bottom: .0001pt; margin-left: 1.0cm; text-align: justify; text-justify: inter-ideograph;"><span style="font-size: ...
Manners of human gait: a crosslinguistic event-naming study
Dan I. Slobin, Iraide Ibarretxe‐Antuñano, Anetta Kopecka et al. · 2014 · Cognitive Linguistics · 128 citations
Abstract Crosslinguistic studies of expressions of motion events have found that Talmy's binary typology of verb-framed and satellite-framed languages is reflected in language use. In particular, M...
Syntagmatic and paradigmatic representations of term variation
Christian Jacquemin · 1999 · 111 citations
A two-tier model for the description of morphological, syntactic and semantic variations of multi-word terms is presented. It is applied to term normalization of French and English corpora in the m...
"I don't believe in word senses"
Adam Kilgarriff · 1997 · arXiv (Cornell University) · 103 citations
Word sense disambiguation assumes word senses. Within the lexicography and linguistics literature, they are known to be very slippery entities. The paper looks at problems with existing accounts of...
Reading Guide
Foundational Papers
Start with Baker et al. (1998, 2556 citations) for corpus-based frame semantics as empirical base; Frantzi and Ananiadou (1999, 227 citations) for C-value/NC-value extraction standard; Jacquemin (1999, 111 citations) for variation models.
Recent Advances
Hoek et al. (2017, 98 citations) on coherence in parallel corpora; Slobin et al. (2014, 128 citations) on crosslinguistic event naming; Fábregas (2013, 145 citations) on Spanish object marking.
Core Methods
C-value/NC-value (Frantzi and Ananiadou, 1999) for term scoring; FrameNet annotation (Baker et al., 1998); CCG derivations (Hockenmaier and Steedman, 2007); variation normalization (Jacquemin, 1999).
How PapersFlow Helps You Research Corpus Linguistics in Terminology
Discover & Search
Research Agent uses searchPapers on 'C-value NC-value term extraction' to retrieve Frantzi and Ananiadou (1999), then citationGraph reveals 227 citing papers, and findSimilarPapers uncovers Jacquemin (1999) for variation models.
Analyze & Verify
Analysis Agent applies readPaperContent to extract C-value formulas from Frantzi and Ananiadou (1999), verifies term frequency stats via runPythonAnalysis on corpus samples with pandas, and uses GRADE grading for extraction precision claims alongside CoVe chain-of-verification.
Synthesize & Write
Synthesis Agent detects gaps in term variation handling beyond Jacquemin (1999), flags contradictions in polysemy approaches from Kilgarriff (1997), and Writing Agent employs latexEditText for term tables, latexSyncCitations for Baker et al. (1998), and latexCompile for full reports.
Use Cases
"Reimplement C-value/NC-value on medical corpus sample for neologism detection"
Research Agent → searchPapers 'Frantzi Ananiadou 1999' → Analysis Agent → readPaperContent + runPythonAnalysis (pandas frequency counts, NumPy log scoring) → CSV export of top terms.
"Draft LaTeX section comparing FrameNet and CCGbank for term frames"
Research Agent → citationGraph 'Baker 1998' → Synthesis Agent → gap detection → Writing Agent → latexEditText (frame tables) → latexSyncCitations → latexCompile (PDF with diagrams).
"Find GitHub repos implementing term variation normalization from Jacquemin"
Research Agent → searchPapers 'Jacquemin 1999 term variation' → Code Discovery → paperExtractUrls → paperFindGithubRepo → githubRepoInspect (code for syntagmatic parsing).
Automated Workflows
Deep Research workflow scans 50+ papers on 'corpus term extraction' via searchPapers → citationGraph → structured report on C-value evolutions (Frantzi and Ananiadou, 1999). DeepScan applies 7-step analysis with CoVe checkpoints to verify Jacquemin (1999) variation types on custom corpora. Theorizer generates hypotheses on polysemy integration from Kilgarriff (1997) and FrameNet data.
Frequently Asked Questions
What defines Corpus Linguistics in Terminology?
It uses corpus-based methods to extract and analyze terms, focusing on multi-word units, variations, and domain specificity via statistical measures like C-value (Frantzi and Ananiadou, 1999).
What are key methods?
C-value/NC-value extracts multi-word terms domain-independently (Frantzi and Ananiadou, 1999); FrameNet builds frames from corpus evidence (Baker et al., 1998); Jacquemin's model handles term variations (1999).
What are key papers?
Baker et al. (1998, 2556 citations) on FrameNet; Frantzi and Ananiadou (1999, 227 citations) on C-value/NC-value; Jacquemin (1999, 111 citations) on term variation; Hockenmaier and Steedman (2007, 386 citations) on CCGbank.
What open problems exist?
Scaling variation normalization to low-resource languages; integrating polysemy resolution (Kilgarriff, 1997); adapting extraction to noisy social media corpora without gold standards.
Research linguistics and terminology studies with AI
PapersFlow provides specialized AI tools for Arts and Humanities researchers. Here are the most relevant for this topic:
AI Literature Review
Automate paper discovery and synthesis across 474M+ papers
AI Academic Writing
Write research papers with AI assistance and LaTeX support
Citation Manager
Organize references with Zotero sync and smart tagging
See how researchers in Arts & Humanities use PapersFlow
Field-specific workflows, example queries, and use cases.
Start Researching Corpus Linguistics in Terminology with AI
Search 474M+ papers, run AI-powered literature reviews, and write with integrated citations — all in one workspace.
See how PapersFlow works for Arts and Humanities researchers