Subtopic Deep Dive

Record Linkage
Research Guide

What is Record Linkage?

Record linkage is the process of identifying records in separate databases that refer to the same real-world entity using probabilistic, deterministic, or machine learning methods.

Fellegi and Sunter (1969) established the foundational probabilistic model with 2345 citations. Jaro (1989, 1302 citations) advanced methodology for census matching, while Christen (2011, 658 citations) surveyed scalable indexing techniques. Over 10 key papers from 1969-2012 cover string metrics, health record linkage, and data quality integration.

Curated Papers

Key Challenges

Why It Matters

Record linkage enables accurate data integration for census analysis (Jaro, 1989), health services research (Holman et al., 1999), and public policy evaluation. It supports merging disparate sources in social sciences, improving decision-making in operations research. Winkler (1999) highlights its role in addressing large-scale matching problems without unique identifiers.

Key Research Challenges

Scalable Indexing for Large Datasets

Processing billions of record pairs requires efficient blocking and indexing to reduce comparisons. Christen (2011) surveys techniques but notes computational limits persist. Trade-offs between recall and precision challenge real-time applications.

String Matching Accuracy

Name variations and errors demand robust distance metrics like edit-distance. Cohen et al. (2003) compare metrics, showing no single best performer across datasets. Fellegi-Sunter enhancements by Winkler (1990) improve rules but require domain tuning.

Probabilistic Model Calibration

Estimating match probabilities needs labeled training data, often unavailable. Fellegi and Sunter (1969) provide theory, but Winkler (1999) identifies ongoing calibration issues. Clerical review for uncertain pairs increases costs.

Essential Papers

A Theory for Record Linkage

Ivan P. Fellegi, A. B. Sunter · 1969 · Journal of the American Statistical Association · 2.3K citations

Abstract A mathematical model is developed to provide a theoretical framework for a computer-oriented solution to the problem of recognizing those records in two files which represent identical per...

A comparison of string distance metrics for name-matching tasks

William W. Cohen, Pradeep Ravikumar, Stephen E. Fienberg · 2003 · 1.4K citations

Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics pro...

Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida

Matthew A. Jaro · 1989 · Journal of the American Statistical Association · 1.3K citations

Abstract A test census of Tampa, Florida and an independent postenumeration survey (PES) were conducted by the U.S. Census Bureau in 1985. The PES was a stratified block sample with heavy emphasis ...

Population-based linkage of health records in Western Australia: development of a health services research linked database

C. D’Arcy J. Holman, A. Bass, Ian Rouse et al. · 1999 · Australian and New Zealand Journal of Public Health · 1.0K citations

A product perspective on total data quality management

Richard Y. Wang · 1998 · Communications of the ACM · 899 citations

article Free Access Share on A product perspective on total data quality management Author: Richard Y. Wang Massachusetts Institute of Technology, Cambridge Massachusetts Institute of Technology, C...

The State of Record Linkage and Current Research Problems

William E. Winkler · 1999 · 817 citations

This paper provides an overview of methods and systems developed for record linkage. Modern record linkage begins with the pioneering work of Newcombe and is especially based on the formal mathemat...

String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.

William E. Winkler · 1990 · 785 citations

To locate matches across pairs of lists without unique identifiers it is sometimes necessary to compare strings of letters. String comparators are used in production computer matching software duri...

Reading Guide

Foundational Papers

Start with Fellegi and Sunter (1969) for probabilistic theory, then Jaro (1989) for practical census matching, and Cohen et al. (2003) for string metrics fundamentals.

Recent Advances

Christen (2011) for indexing scalability; Winkler (1999) for research problems overview; Christen (2012) book for comprehensive data matching.

Core Methods

Fellegi-Sunter likelihood ratios with m/u probabilities; blocking via sorting/indexing (Christen, 2011); Jaro-Winkler distance; expectation-maximization for parameter estimation.

How PapersFlow Helps You Research Record Linkage

Discover & Search

Research Agent uses citationGraph on Fellegi and Sunter (1969) to map 2345-cited probabilistic lineage, then findSimilarPapers reveals Jaro (1989) and Winkler (1999) extensions. exaSearch queries 'scalable record linkage indexing' to surface Christen (2011) amid 250M+ OpenAlex papers.

Analyze & Verify

Analysis Agent applies readPaperContent to extract Fellegi-Sunter equations from 1969 paper, then runPythonAnalysis simulates string comparators from Cohen et al. (2003) with pandas/NumPy. verifyResponse via CoVe cross-checks claims against Winkler (1990), with GRADE scoring probabilistic assumptions.

Synthesize & Write

Synthesis Agent detects gaps in scalable methods post-Christen (2011), flags contradictions in string metrics from Cohen et al. (2003). Writing Agent uses latexEditText for Fellegi-Sunter model equations, latexSyncCitations for 10-paper bibliography, and latexCompile for camera-ready review; exportMermaid diagrams decision rule flows.

Use Cases

"Reimplement Jaro-Winkler string similarity in Python for record linkage testing"

Research Agent → searchPapers 'Jaro string comparator' → Analysis Agent → readPaperContent (Winkler 1990) → runPythonAnalysis (NumPy/pandas sandbox computes Jaro metric on sample names, outputs similarity scores CSV).

"Write LaTeX appendix comparing Fellegi-Sunter to deterministic linkage"

Synthesis Agent → gap detection (Fellegi 1969 vs Jaro 1989) → Writing Agent → latexEditText (inserts probability formulas) → latexSyncCitations (adds 5 papers) → latexCompile (PDF with tables of m-probabilities).

"Find GitHub repos implementing scalable blocking from Christen survey"

Research Agent → searchPapers 'Christen 2011 indexing' → Code Discovery → paperExtractUrls → paperFindGithubRepo → githubRepoInspect (reviews blocking code, extracts Python snippets for canopy indexing).

Automated Workflows

Deep Research workflow scans 50+ record linkage papers via searchPapers → citationGraph, producing structured report with Fellegi-Sunter descendants and citation trends. DeepScan applies 7-step CoVe to verify string metric benchmarks from Cohen et al. (2003), with GRADE checkpoints. Theorizer generates extensions to Fellegi model from Winkler (1999) problems.

Try Doxa for Record Linkage Research

Frequently Asked Questions

What defines record linkage?

Record linkage identifies matching records across databases without unique IDs using comparison vectors and decision rules (Fellegi and Sunter, 1969).

What are core methods?

Probabilistic (Fellegi-Sunter), deterministic (exact matches), and blocking/indexing (Christen, 2011); string metrics include Jaro-Winkler (Winkler, 1990).

What are key papers?

Fellegi and Sunter (1969, 2345 citations) for theory; Cohen et al. (2003, 1375 citations) for string metrics; Jaro (1989, 1302 citations) for census applications.

What open problems exist?

Scalable entity resolution on massive datasets, unsupervised probability estimation, and robust matching under data drift (Winkler, 1999; Christen, 2011).