Subtopic Deep Dive
Corpus-Based Lexicography
Research Guide
What is Corpus-Based Lexicography?
Corpus-Based Lexicography uses large language corpora to empirically derive dictionary entries, definitions, and usage patterns.
Researchers analyze corpora for word frequency, collocations, and sense distributions to inform lexicographic decisions (Tognini-Bonelli, 2001). Key tools include Sketch Engine for corpus querying (Kilgarriff et al., 2014). Over 10 papers from the list demonstrate applications, with Miller's WordNet cited 13,914 times (Miller, 1995).
Why It Matters
Corpus-based methods enable dictionaries like Oxford Dictionary of English to reflect actual usage from massive corpora (Pearsall, 2010). Academic word lists from corpora improve language teaching, as in Coxhead's list from 3.5 million words (Coxhead, 2000). FrameNet applies corpus evidence to frame semantics for precise definitions (Baker et al., 1998). English Lexicon Project provides lexical decision data from corpora for psycholinguistic validation (Balota et al., 2007).
Key Research Challenges
Sense Disambiguation
Distinguishing word senses in corpora requires context analysis beyond frequency counts. FrameNet uses corpus evidence but struggles with polysemy (Baker et al., 1998). Tognini-Bonelli notes methodological shifts needed for corpus reading (Tognini-Bonelli, 2001).
Collocation Extraction
Identifying significant word combinations demands robust statistical measures. Sketch Engine provides tools but requires large corpora for reliability (Kilgarriff et al., 2014). Longman Grammar highlights spoken-written differences complicating extractions (Biber et al., 2000).
Frequency Representation
Corpus size biases frequency toward common usages, underrepresenting rare senses. Coxhead's academic list addressed range but not dispersion fully (Coxhead, 2000). Miller's WordNet supplements corpora with manual synsets (Miller, 1995).
Essential Papers
WordNet
George A. Miller · 1995 · Communications of the ACM · 13.9K citations
Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information...
Longman Grammar of Spoken and Written English
Kathleen M. Broussard, Douglas Biber, Stig Johansson et al. · 2000 · TESOL Quarterly · 8.2K citations
Introduction Since its publication in 1985, the outstanding 1,800-page Comprehensive Grammar of the English Language, by Randolph Quirk, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik, has been...
The English Lexicon Project
David A. Balota, Melvin J. Yap, Keith A. Hutchison et al. · 2007 · Behavior Research Methods · 2.7K citations
A New Academic Word List
Averil Coxhead · 2000 · TESOL Quarterly · 2.7K citations
This article describes the development and evaluation of a new academic word list (Coxhead, 1998), which was compiled from a corpus of 3.5 million running words of written academic text by examinin...
The Berkeley FrameNet Project
Collin F. Baker, Charles J. Fillmore, John B. Lowe · 1998 · 2.6K citations
FrameNet is a three-year NSF-supported project in corpus-based computational lexicography, now in its second year (NSF IRI-9618838, "Tools for Lexicon Building"). The project's key features are (a)...
The Encyclopedia of Applied Linguistics
Beverley Collins · 2012 · 2.0K citations
French connectives have been the object of detailed linguistic descriptions since the 1970s. Several of these earlyworks have paved the way for the development of classical concepts in discourse an...
The Sketch Engine
Adam Kilgarriff, Vít Baisa, Jan Bušta et al. · 2014 · Lexicography · 1.9K citations
The Sketch Engine is a leading corpus tool, widely used in lexicography. Now, at 10 years old, it is mature software. The Sketch Engine website offers many ready-to-use corpora, and tools for users...
Reading Guide
Foundational Papers
Start with Miller (1995) WordNet for lexical semantics basics (13,914 citations), then Baker et al. (1998) FrameNet for corpus-driven frames, and Tognini-Bonelli (2001) for methodology.
Recent Advances
Study Kilgarriff et al. (2014) Sketch Engine for practical tools and Pearsall (2010) Oxford Dictionary for applied outcomes.
Core Methods
Core techniques: corpus query (Sketch Engine), frequency/range analysis (Coxhead, 2000), synset building (WordNet), frame annotation (FrameNet).
How PapersFlow Helps You Research Corpus-Based Lexicography
Discover & Search
Research Agent uses searchPapers and exaSearch to find corpus tools like Sketch Engine (Kilgarriff et al., 2014), then citationGraph reveals connections to FrameNet (Baker et al., 1998) and WordNet (Miller, 1995), while findSimilarPapers uncovers related works like Coxhead (2000).
Analyze & Verify
Analysis Agent applies readPaperContent to extract corpus methods from Tognini-Bonelli (2001), verifies claims with CoVe against Biber et al. (2000), and runs PythonAnalysis for frequency stats from English Lexicon Project data (Balota et al., 2007) using pandas, with GRADE scoring evidence strength.
Synthesize & Write
Synthesis Agent detects gaps in sense coverage between WordNet and FrameNet, flags contradictions in usage frequencies; Writing Agent uses latexEditText for dictionary entry drafts, latexSyncCitations for Miller (1995), and latexCompile for publication-ready reports with exportMermaid for collocation graphs.
Use Cases
"Analyze frequency distributions in Coxhead's Academic Word List corpus data"
Research Agent → searchPapers('Coxhead 2000') → Analysis Agent → runPythonAnalysis(pandas frequency plot from extracted data) → matplotlib visualization of word ranges.
"Draft LaTeX entry for 'frame' using FrameNet corpus examples"
Research Agent → readPaperContent(Baker et al. 1998) → Synthesis Agent → gap detection → Writing Agent → latexEditText + latexSyncCitations + latexCompile → camera-ready dictionary section.
"Find code for Sketch Engine-like collocation tools from papers"
Research Agent → paperExtractUrls(Kilgarriff et al. 2014) → Code Discovery → paperFindGithubRepo → githubRepoInspect → working collocation extractor scripts.
Automated Workflows
Deep Research workflow scans 50+ papers like Miller (1995) and Kilgarriff (2014) for systematic review of corpus tools, producing structured corpus lexicography report. DeepScan applies 7-step analysis with CoVe checkpoints to verify Sketch Engine claims against Biber et al. (2000). Theorizer generates hypotheses on semantic change from Traugott & Dasher (2001) corpus patterns.
Frequently Asked Questions
What defines Corpus-Based Lexicography?
It applies empirical corpus analysis to dictionary making, replacing intuition with data on usage, frequency, and collocations (Tognini-Bonelli, 2001).
What are key methods?
Methods include frequency analysis (Coxhead, 2000), collocation extraction via Sketch Engine (Kilgarriff et al., 2014), and frame semantics from corpora (Baker et al., 1998).
What are key papers?
Foundational: WordNet (Miller, 1995, 13,914 citations), FrameNet (Baker et al., 1998, 2,556 citations); Recent: Sketch Engine (Kilgarriff et al., 2014, 1,936 citations).
What are open problems?
Challenges persist in rare sense detection and cross-corpus comparability, as corpora bias toward frequent usages (Balota et al., 2007; Tognini-Bonelli, 2001).
Research Lexicography and Language Studies with AI
PapersFlow provides specialized AI tools for Arts and Humanities researchers. Here are the most relevant for this topic:
AI Literature Review
Automate paper discovery and synthesis across 474M+ papers
AI Academic Writing
Write research papers with AI assistance and LaTeX support
Citation Manager
Organize references with Zotero sync and smart tagging
See how researchers in Arts & Humanities use PapersFlow
Field-specific workflows, example queries, and use cases.
Start Researching Corpus-Based Lexicography with AI
Search 474M+ papers, run AI-powered literature reviews, and write with integrated citations — all in one workspace.
See how PapersFlow works for Arts and Humanities researchers
Part of the Lexicography and Language Studies Research Guide