Subtopic Deep Dive

Corpus-Based Lexicography
Research Guide

What is Corpus-Based Lexicography?

Corpus-Based Lexicography uses large language corpora to empirically derive dictionary entries, definitions, and usage patterns.

Researchers analyze corpora for word frequency, collocations, and sense distributions to inform lexicographic decisions (Tognini-Bonelli, 2001). Key tools include Sketch Engine for corpus querying (Kilgarriff et al., 2014). Over 10 papers from the list demonstrate applications, with Miller's WordNet cited 13,914 times (Miller, 1995).

Curated Papers

Key Challenges

Why It Matters

Corpus-based methods enable dictionaries like Oxford Dictionary of English to reflect actual usage from massive corpora (Pearsall, 2010). Academic word lists from corpora improve language teaching, as in Coxhead's list from 3.5 million words (Coxhead, 2000). FrameNet applies corpus evidence to frame semantics for precise definitions (Baker et al., 1998). English Lexicon Project provides lexical decision data from corpora for psycholinguistic validation (Balota et al., 2007).

Key Research Challenges

Sense Disambiguation

Distinguishing word senses in corpora requires context analysis beyond frequency counts. FrameNet uses corpus evidence but struggles with polysemy (Baker et al., 1998). Tognini-Bonelli notes methodological shifts needed for corpus reading (Tognini-Bonelli, 2001).

Collocation Extraction

Identifying significant word combinations demands robust statistical measures. Sketch Engine provides tools but requires large corpora for reliability (Kilgarriff et al., 2014). Longman Grammar highlights spoken-written differences complicating extractions (Biber et al., 2000).

Frequency Representation

Corpus size biases frequency toward common usages, underrepresenting rare senses. Coxhead's academic list addressed range but not dispersion fully (Coxhead, 2000). Miller's WordNet supplements corpora with manual synsets (Miller, 1995).

Essential Papers

WordNet

George A. Miller · 1995 · Communications of the ACM · 13.9K citations

Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information...

Longman Grammar of Spoken and Written English

Kathleen M. Broussard, Douglas Biber, Stig Johansson et al. · 2000 · TESOL Quarterly · 8.2K citations

Introduction Since its publication in 1985, the outstanding 1,800-page Comprehensive Grammar of the English Language, by Randolph Quirk, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik, has been...

The English Lexicon Project

David A. Balota, Melvin J. Yap, Keith A. Hutchison et al. · 2007 · Behavior Research Methods · 2.7K citations

A New Academic Word List

Averil Coxhead · 2000 · TESOL Quarterly · 2.7K citations

This article describes the development and evaluation of a new academic word list (Coxhead, 1998), which was compiled from a corpus of 3.5 million running words of written academic text by examinin...

The Berkeley FrameNet Project

Collin F. Baker, Charles J. Fillmore, John B. Lowe · 1998 · 2.6K citations

FrameNet is a three-year NSF-supported project in corpus-based computational lexicography, now in its second year (NSF IRI-9618838, "Tools for Lexicon Building"). The project's key features are (a)...

The Encyclopedia of Applied Linguistics

Beverley Collins · 2012 · 2.0K citations

French connectives have been the object of detailed linguistic descriptions since the 1970s. Several of these earlyworks have paved the way for the development of classical concepts in discourse an...

The Sketch Engine

Adam Kilgarriff, Vít Baisa, Jan Bušta et al. · 2014 · Lexicography · 1.9K citations

The Sketch Engine is a leading corpus tool, widely used in lexicography. Now, at 10 years old, it is mature software. The Sketch Engine website offers many ready-to-use corpora, and tools for users...

Reading Guide

Foundational Papers

Start with Miller (1995) WordNet for lexical semantics basics (13,914 citations), then Baker et al. (1998) FrameNet for corpus-driven frames, and Tognini-Bonelli (2001) for methodology.

Recent Advances

Study Kilgarriff et al. (2014) Sketch Engine for practical tools and Pearsall (2010) Oxford Dictionary for applied outcomes.

Core Methods

Core techniques: corpus query (Sketch Engine), frequency/range analysis (Coxhead, 2000), synset building (WordNet), frame annotation (FrameNet).

How PapersFlow Helps You Research Corpus-Based Lexicography

Discover & Search

Research Agent uses searchPapers and exaSearch to find corpus tools like Sketch Engine (Kilgarriff et al., 2014), then citationGraph reveals connections to FrameNet (Baker et al., 1998) and WordNet (Miller, 1995), while findSimilarPapers uncovers related works like Coxhead (2000).

Analyze & Verify

Analysis Agent applies readPaperContent to extract corpus methods from Tognini-Bonelli (2001), verifies claims with CoVe against Biber et al. (2000), and runs PythonAnalysis for frequency stats from English Lexicon Project data (Balota et al., 2007) using pandas, with GRADE scoring evidence strength.

Synthesize & Write

Synthesis Agent detects gaps in sense coverage between WordNet and FrameNet, flags contradictions in usage frequencies; Writing Agent uses latexEditText for dictionary entry drafts, latexSyncCitations for Miller (1995), and latexCompile for publication-ready reports with exportMermaid for collocation graphs.

Use Cases

"Analyze frequency distributions in Coxhead's Academic Word List corpus data"

Research Agent → searchPapers('Coxhead 2000') → Analysis Agent → runPythonAnalysis(pandas frequency plot from extracted data) → matplotlib visualization of word ranges.

"Draft LaTeX entry for 'frame' using FrameNet corpus examples"

Research Agent → readPaperContent(Baker et al. 1998) → Synthesis Agent → gap detection → Writing Agent → latexEditText + latexSyncCitations + latexCompile → camera-ready dictionary section.

"Find code for Sketch Engine-like collocation tools from papers"

Research Agent → paperExtractUrls(Kilgarriff et al. 2014) → Code Discovery → paperFindGithubRepo → githubRepoInspect → working collocation extractor scripts.

Automated Workflows

Deep Research workflow scans 50+ papers like Miller (1995) and Kilgarriff (2014) for systematic review of corpus tools, producing structured corpus lexicography report. DeepScan applies 7-step analysis with CoVe checkpoints to verify Sketch Engine claims against Biber et al. (2000). Theorizer generates hypotheses on semantic change from Traugott & Dasher (2001) corpus patterns.

Try Doxa for Corpus-Based Lexicography Research

Frequently Asked Questions

What defines Corpus-Based Lexicography?

It applies empirical corpus analysis to dictionary making, replacing intuition with data on usage, frequency, and collocations (Tognini-Bonelli, 2001).

What are key methods?

Methods include frequency analysis (Coxhead, 2000), collocation extraction via Sketch Engine (Kilgarriff et al., 2014), and frame semantics from corpora (Baker et al., 1998).

What are key papers?

Foundational: WordNet (Miller, 1995, 13,914 citations), FrameNet (Baker et al., 1998, 2,556 citations); Recent: Sketch Engine (Kilgarriff et al., 2014, 1,936 citations).

What are open problems?

Challenges persist in rare sense detection and cross-corpus comparability, as corpora bias toward frequent usages (Balota et al., 2007; Tognini-Bonelli, 2001).