Subtopic Deep Dive

Corpus Linguistics in Terminology
Research Guide

Q: What defines Corpus Linguistics in Terminology?

It uses corpus-based methods to extract and analyze terms, focusing on multi-word units, variations, and domain specificity via statistical measures like C-value (Frantzi and Ananiadou, 1999).

Q: What are key methods?

C-value/NC-value extracts multi-word terms domain-independently (Frantzi and Ananiadou, 1999); FrameNet builds frames from corpus evidence (Baker et al., 1998); Jacquemin's model handles term variations (1999).

Q: What are key papers?

Baker et al. (1998, 2556 citations) on FrameNet; Frantzi and Ananiadou (1999, 227 citations) on C-value/NC-value; Jacquemin (1999, 111 citations) on term variation; Hockenmaier and Steedman (2007, 386 citations) on CCGbank.

Q: What open problems exist?

Scaling variation normalization to low-resource languages; integrating polysemy resolution (Kilgarriff, 1997); adapting extraction to noisy social media corpora without gold standards.

What is Corpus Linguistics in Terminology?

Corpus Linguistics in Terminology applies corpus-based statistical and distributional methods to extract, validate, and analyze terms from domain-specific corpora, focusing on neologisms, polysemy, and term variation.

Researchers use methods like C-value/NC-value for multi-word term extraction from special language corpora (Frantzi and Ananiadou, 1999, 227 citations). FrameNet leverages corpus evidence for semantic and syntactic frame analysis (Baker et al., 1998, 2556 citations). Studies address term normalization across medical and agricultural domains (Jacquemin, 1999, 111 citations).

Curated Papers

Key Challenges

Why It Matters

Corpus-driven term extraction builds empirical terminological resources for machine translation and knowledge bases, as shown by Frantzi and Ananiadou's (1999) domain-independent C-value/NC-value method applied to technical corpora. FrameNet's corpus-based lexicography supports NLP applications in semantic role labeling (Baker et al., 1998). Jacquemin's (1999) model for syntagmatic and paradigmatic term variation enables normalization in bilingual terminography, improving cross-domain consistency.

Key Research Challenges

Multi-word Term Extraction

Identifying nested and discontinuous multi-word terms requires combining linguistic filters with frequency-based scores. Frantzi and Ananiadou (1999) address this via C-value/NC-value, but domain shifts reduce precision. Evaluation lacks standardized gold standards across languages.

Term Variation Normalization

Morphological, syntactic, and semantic variations complicate term matching in corpora. Jacquemin (1999) models five variation types in medical and agricultural texts, yet scaling to large corpora demands efficient parsing. Polysemy resolution remains unresolved without context.

Domain Adaptation of Methods

Statistical methods tuned on one domain underperform in others due to vocabulary shifts. Baker et al. (1998) FrameNet relies on annotated English corpora, limiting transfer. Hockenmaier and Steedman (2007) CCGbank conversion highlights parsing challenges for dependency extraction.

Essential Papers

The Berkeley FrameNet Project

Collin F. Baker, Charles J. Fillmore, John B. Lowe · 1998 · 2.6K citations

FrameNet is a three-year NSF-supported project in corpus-based computational lexicography, now in its second year (NSF IRI-9618838, "Tools for Lexicon Building"). The project's key features are (a)...

CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank

Julia Hockenmaier, Mark Steedman · 2007 · Computational Linguistics · 386 citations

This article presents an algorithm for translating the Penn Treebank into a corpus of Combinatory Categorial Grammar (CCG) derivations augmented with local and long-range word-word dependencies. Th...

The C-value/NC-value domain-independent method for multi-word term extraction

Katerina T. Frantzi, Sophia Ananiadou · 1999 · Journal of Natural Language Processing · 227 citations

In this paper we present a domain-independent method for the automatic extraction of multi-word (technical) terms, from machine-readable special language corpora. The method, (C-value/NC-value), co...

Differential Object Marking in Spanish: state of the art

Antonio Fábregas · 2013 · Borealis – An International Journal of Hispanic Linguistics · 145 citations

<p class="MsoNormal" style="margin-top: 0cm; margin-right: 18.1pt; margin-bottom: .0001pt; margin-left: 1.0cm; text-align: justify; text-justify: inter-ideograph;"><span style="font-size: ...

Manners of human gait: a crosslinguistic event-naming study

Dan I. Slobin, Iraide Ibarretxe‐Antuñano, Anetta Kopecka et al. · 2014 · Cognitive Linguistics · 128 citations

Abstract Crosslinguistic studies of expressions of motion events have found that Talmy's binary typology of verb-framed and satellite-framed languages is reflected in language use. In particular, M...

Syntagmatic and paradigmatic representations of term variation

Christian Jacquemin · 1999 · 111 citations

A two-tier model for the description of morphological, syntactic and semantic variations of multi-word terms is presented. It is applied to term normalization of French and English corpora in the m...

"I don't believe in word senses"

Adam Kilgarriff · 1997 · arXiv (Cornell University) · 103 citations

Word sense disambiguation assumes word senses. Within the lexicography and linguistics literature, they are known to be very slippery entities. The paper looks at problems with existing accounts of...

Reading Guide

Foundational Papers

Start with Baker et al. (1998, 2556 citations) for corpus-based frame semantics as empirical base; Frantzi and Ananiadou (1999, 227 citations) for C-value/NC-value extraction standard; Jacquemin (1999, 111 citations) for variation models.

Recent Advances

Hoek et al. (2017, 98 citations) on coherence in parallel corpora; Slobin et al. (2014, 128 citations) on crosslinguistic event naming; Fábregas (2013, 145 citations) on Spanish object marking.

Core Methods

C-value/NC-value (Frantzi and Ananiadou, 1999) for term scoring; FrameNet annotation (Baker et al., 1998); CCG derivations (Hockenmaier and Steedman, 2007); variation normalization (Jacquemin, 1999).

How PapersFlow Helps You Research Corpus Linguistics in Terminology

Discover & Search

Research Agent uses searchPapers on 'C-value NC-value term extraction' to retrieve Frantzi and Ananiadou (1999), then citationGraph reveals 227 citing papers, and findSimilarPapers uncovers Jacquemin (1999) for variation models.

Analyze & Verify

Analysis Agent applies readPaperContent to extract C-value formulas from Frantzi and Ananiadou (1999), verifies term frequency stats via runPythonAnalysis on corpus samples with pandas, and uses GRADE grading for extraction precision claims alongside CoVe chain-of-verification.

Synthesize & Write

Synthesis Agent detects gaps in term variation handling beyond Jacquemin (1999), flags contradictions in polysemy approaches from Kilgarriff (1997), and Writing Agent employs latexEditText for term tables, latexSyncCitations for Baker et al. (1998), and latexCompile for full reports.

Use Cases

"Reimplement C-value/NC-value on medical corpus sample for neologism detection"

Research Agent → searchPapers 'Frantzi Ananiadou 1999' → Analysis Agent → readPaperContent + runPythonAnalysis (pandas frequency counts, NumPy log scoring) → CSV export of top terms.

"Draft LaTeX section comparing FrameNet and CCGbank for term frames"

Research Agent → citationGraph 'Baker 1998' → Synthesis Agent → gap detection → Writing Agent → latexEditText (frame tables) → latexSyncCitations → latexCompile (PDF with diagrams).

"Find GitHub repos implementing term variation normalization from Jacquemin"

Research Agent → searchPapers 'Jacquemin 1999 term variation' → Code Discovery → paperExtractUrls → paperFindGithubRepo → githubRepoInspect (code for syntagmatic parsing).

Automated Workflows

Deep Research workflow scans 50+ papers on 'corpus term extraction' via searchPapers → citationGraph → structured report on C-value evolutions (Frantzi and Ananiadou, 1999). DeepScan applies 7-step analysis with CoVe checkpoints to verify Jacquemin (1999) variation types on custom corpora. Theorizer generates hypotheses on polysemy integration from Kilgarriff (1997) and FrameNet data.

Try Doxa for Corpus Linguistics in Terminology Research

Frequently Asked Questions

What defines Corpus Linguistics in Terminology?

It uses corpus-based methods to extract and analyze terms, focusing on multi-word units, variations, and domain specificity via statistical measures like C-value (Frantzi and Ananiadou, 1999).

What are key methods?

C-value/NC-value extracts multi-word terms domain-independently (Frantzi and Ananiadou, 1999); FrameNet builds frames from corpus evidence (Baker et al., 1998); Jacquemin's model handles term variations (1999).

What are key papers?

Baker et al. (1998, 2556 citations) on FrameNet; Frantzi and Ananiadou (1999, 227 citations) on C-value/NC-value; Jacquemin (1999, 111 citations) on term variation; Hockenmaier and Steedman (2007, 386 citations) on CCGbank.

What open problems exist?

Scaling variation normalization to low-resource languages; integrating polysemy resolution (Kilgarriff, 1997); adapting extraction to noisy social media corpora without gold standards.