PapersFlow Research Brief
Data Quality and Management
Research Guide
What is Data Quality and Management?
Data Quality and Management is the cluster of techniques for assessing, improving, and maintaining data quality, including record linkage, data cleaning, entity resolution, information quality benchmarks, privacy-preserving record linkage, name disambiguation, data integration, and addressing big data challenges.
This field encompasses 61,971 works focused on data quality assessment and improvement methods such as duplicate detection and string similarity measures. Key contributions include frameworks like FAIR principles for data stewardship and models for privacy protection such as k-anonymity. Data consumers define quality beyond accuracy to include broader dimensions like accessibility and timeliness.
Topic Hierarchy
Research Sub-Topics
Record Linkage
Covers probabilistic, deterministic, and machine learning approaches to linking records across databases. Researchers evaluate accuracy in large-scale datasets.
Entity Resolution
Focuses on resolving duplicates and merging entities in structured and unstructured data. Studies blocking, matching, and clustering techniques for scalability.
Data Cleaning
Examines automated detection and correction of errors, outliers, and inconsistencies in datasets. Researchers develop tools for error localization and repair.
Name Disambiguation
Addresses author and entity name variants using similarity metrics and supervised learning. Applied in bibliometrics and citation analysis.
Privacy-Preserving Record Linkage
Develops cryptographic and secure multi-party computation methods for linking without revealing sensitive data. Balances utility with privacy guarantees.
Why It Matters
Data Quality and Management enables reliable data sharing in healthcare, as shown by the REDCap consortium, which built an international community of software platform partners cited 21,869 times and used in clinical research for secure data collection (Harris et al., 2019). In scientific research, the FAIR Guiding Principles ensure data is findable, accessible, interoperable, and reusable, supporting over 16,387 citations and improving reproducibility across disciplines (Wilkinson et al., 2016). Privacy models like k-anonymity allow hospitals and banks to release person-specific data to researchers with scientific guarantees against identification, cited 8,343 times (Sweeney, 2002). Poor data quality affects business decisions, with Wang and Strong (1996) identifying multiple dimensions beyond accuracy that impact economic outcomes, cited 4,344 times.
Reading Guide
Where to Start
'Beyond Accuracy: What Data Quality Means to Data Consumers' by Wang and Strong (1996), because it provides a foundational, consumer-focused framework of data quality dimensions with 4,344 citations, accessible before technical methods.
Key Papers Explained
Wang and Strong (1996) in 'Beyond Accuracy: What Data Quality Means to Data Consumers' establishes multiple quality dimensions, which Wilkinson et al. (2016) in 'The FAIR Guiding Principles for scientific data management and stewardship' operationalizes through findability and reusability standards; Sweeney (2002) in 'k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY' builds privacy protections compatible with these; Bizer, Heath, and Berners-Lee (2009) in 'Linked Data - The Story So Far' extends integration practices; Chen (2002) in 'The Entity Relationship Model — Toward a Unified View of Data' supplies the structural foundation.
Paper Timeline
Most-cited paper highlighted in red. Papers ordered chronologically.
Advanced Directions
Current work emphasizes scalability in big data cleaning, duplicate detection, and string similarity, as reflected in the 61,971 papers; no recent preprints or news available, so frontiers remain in privacy-preserving entity resolution and information quality benchmarks for social sciences applications.
Papers at a Glance
| # | Paper | Year | Venue | Citations | Open Access |
|---|---|---|---|---|---|
| 1 | The REDCap consortium: Building an international community of ... | 2019 | Journal of Biomedical ... | 21.9K | ✓ |
| 2 | The FAIR Guiding Principles for scientific data management and... | 2016 | Scientific Data | 16.4K | ✓ |
| 3 | Bayesian Data Analysis | 1995 | — | 13.7K | ✕ |
| 4 | k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY | 2002 | International Journal ... | 8.3K | ✕ |
| 5 | Business Intelligence and Analytics: From Big Data to Big Impact | 2012 | MIS Quarterly | 5.8K | ✕ |
| 6 | The Entity Relationship Model — Toward a Unified View of Data | 2002 | — | 5.8K | ✕ |
| 7 | Linked Data - The Story So Far | 2009 | International Journal ... | 4.5K | ✕ |
| 8 | Beyond Accuracy: What Data Quality Means to Data Consumers | 1996 | Journal of Management ... | 4.3K | ✕ |
| 9 | Software Framework for Topic Modelling with Large Corpora | 2010 | — | 3.8K | ✓ |
| 10 | Journal of Statistical Software | 2009 | Wiley Interdisciplinar... | 3.6K | ✓ |
Frequently Asked Questions
What are the FAIR Guiding Principles?
The FAIR Guiding Principles are guidelines for scientific data management and stewardship that make data findable, accessible, interoperable, and reusable. Wilkinson et al. (2016) introduced them in 'The FAIR Guiding Principles for scientific data management and stewardship,' which has 16,387 citations. These principles support global data integration in research.
How does k-anonymity protect privacy in data sharing?
k-anonymity is a model that protects privacy by ensuring each record in released data is indistinguishable from at least k-1 other records. Sweeney (2002) defined it in 'k-ANONYMITY: A MODEL FOR PROTECTING PRIVACY,' allowing data holders like hospitals to share field-structured data with researchers while providing guarantees against re-identification. The paper has 8,343 citations.
What data quality dimensions matter to consumers?
Data consumers consider quality beyond accuracy, including dimensions like completeness, timeliness, and accessibility. Wang and Strong (1996) showed in 'Beyond Accuracy: What Data Quality Means to Data Consumers' that narrow focus on accuracy misses broader impacts, with 4,344 citations. This broader view guides improvement efforts in organizations.
What is record linkage in data management?
Record linkage identifies and links records from different databases referring to the same entities, often using techniques like string similarity and duplicate detection. The field includes privacy-preserving methods and entity resolution, central to the 61,971 works in data quality management. Papers like those on name disambiguation address challenges in big data integration.
How do Linked Data principles support data quality?
Linked Data provides best practices for publishing and connecting structured data on the Web, creating a global space with billions of assertions. Bizer, Heath, and Berners-Lee (2009) described this in 'Linked Data - The Story So Far,' with 4,533 citations. It enhances data integration and quality through interoperability.
What role does the Entity Relationship Model play?
The Entity Relationship Model offers a unified view of data for design and integration. Chen (2002) presented it in 'The Entity Relationship Model — Toward a Unified View of Data,' cited 5,769 times. It supports data quality by standardizing structures in management systems.
Open Research Questions
- ? How can privacy-preserving record linkage scale to big data volumes while maintaining linkage accuracy?
- ? What metrics best capture data quality dimensions beyond accuracy for diverse consumer needs?
- ? How do entity resolution techniques handle name disambiguation in multilingual datasets?
- ? What integration methods resolve conflicts in linked data from heterogeneous sources?
- ? How can FAIR principles be automated in data management pipelines for real-time stewardship?
Recent Trends
The field holds steady at 61,971 works with no specified 5-year growth rate; highly cited papers like Harris et al. 'The REDCap consortium: Building an international community of software platform partners' (21,869 citations) indicate sustained focus on collaborative platforms, while Wilkinson et al. (2016) FAIR principles (16,387 citations) drive ongoing data stewardship adoption; no recent preprints or news in last 12 months.
2019Research Data Quality and Management with AI
PapersFlow provides specialized AI tools for Decision Sciences researchers. Here are the most relevant for this topic:
Systematic Review
AI-powered evidence synthesis with documented search strategies
AI Literature Review
Automate paper discovery and synthesis across 474M+ papers
Deep Research Reports
Multi-source evidence synthesis with counter-evidence
See how researchers in Economics & Business use PapersFlow
Field-specific workflows, example queries, and use cases.
Start Researching Data Quality and Management with AI
Search 474M+ papers, run AI-powered literature reviews, and write with integrated citations — all in one workspace.
See how PapersFlow works for Decision Sciences researchers