Subtopic Deep Dive

← Scientific Computing and Data Management

Data Provenance in Scientific Computing
Research Guide

What is Data Provenance in Scientific Computing?

Data provenance in scientific computing captures the lineage of data transformations, parameter choices, and workflow executions to ensure reproducibility and validation in computational experiments.

Standards and tools track data origins and processing steps in scientific workflows. Query languages and visualization enable exploration of provenance graphs. Over 2,700 citations across 10 key papers from 2002-2019 document core systems like Kepler and myExperiment (Goble et al., 2010; Altıntaş et al., 2006).

Curated Papers

Key Challenges

Why It Matters

Provenance tracking validates computational results in bioinformatics and oceanography, enabling data reuse across teams (Buneman et al., 2006; Tanhua et al., 2019). myExperiment repository supports sharing workflows with embedded provenance, cited 299 times for FAIR principles (Goble et al., 2010). Kepler's provenance collection ensures auditability in workflow systems (Altıntaş et al., 2006). FAIR Computational Workflows integrate provenance for reusable pipelines (Goble et al., 2019).

Key Research Challenges

Provenance Capture in Workflows

Capturing complete lineage during dynamic workflow execution remains difficult due to varying tool integrations. Kepler addresses this with built-in support but struggles with non-standard actors (Altıntaş et al., 2006). The First Provenance Challenge exposed inconsistencies across systems (Moreau et al., 2007).

Annotation Propagation Through Views

Propagating deletions and annotations across relational views loses information without precise mapping rules. Buneman et al. formalize conditions for correct propagation in databases (Buneman et al., 2002). Curated databases require manual provenance tracking (Buneman et al., 2006).

Scalable Provenance Querying

Querying large provenance graphs demands efficient languages and storage. myExperiment enables social sharing but scales poorly for complex inquiries (Goble et al., 2010). FAIR workflows need standardized provenance models for interoperability (Goble et al., 2019).

Essential Papers

myExperiment: a repository and social network for the sharing of bioinformatics workflows

Carole Goble, Jiten Bhagat, Sergejs Aleksejevs et al. · 2010 · Nucleic Acids Research · 299 citations

This FAIRsharing record describes: myExperiment is a collaborative environment where scientists can safely publish their workflows and in silico experiments, share them with groups and find those o...

Provenance management in curated databases

Peter Buneman, Adriane Chapman, James Cheney · 2006 · 296 citations

Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning ...

An annotation management system for relational databases

Deepavali Bhagwat, Laura Chiticariu, Wang-Chiew Tan et al. · 2005 · The VLDB Journal · 276 citations

Provenance Collection Support in the Kepler Scientific Workflow System

İlkay Altıntaş, Oscar Barney, Efrat Jaeger-Frank · 2006 · Lecture notes in computer science · 269 citations

Ocean FAIR Data Services

Toste Tanhua, Sylvie Pouliquen, Jessica Hausman et al. · 2019 · Frontiers in Marine Science · 235 citations

Well-founded data management systems are of vital importance for ocean observing systems as they ensure that essential data are not only collected but also retained and made accessible for analysis...

On propagation of deletions and annotations through views

Peter Buneman, Sanjeev Khanna, Wang-Chiew Tan · 2002 · 219 citations

We study two classes of view update problems in relational databases. We are given a source database S, a monotone query Q, and the view Q(S) generated by the query. The first problem that we consi...

Update exchange with mappings and provenance

Todd J. Green, Grigoris Karvounarakis, Zachary G. Ives et al. · 2007 · ScholarlyCommons (University of Pennsylvania) · 159 citations

We consider systems for data sharing among heterogeneous peers related by a network of schema mappings. Each peer has a locally controlled and edited database instance, but wants to ask queries ove...

Reading Guide

Foundational Papers

Start with Buneman et al. (2002) for view propagation theory, then Buneman et al. (2006) for curated database provenance, and Altıntaş et al. (2006) for practical Kepler implementation.

Recent Advances

Study Goble et al. (2019) FAIR Computational Workflows for modern standards and Tanhua et al. (2019) Ocean FAIR Data for domain application.

Core Methods

Core techniques include workflow logging (Kepler, Altıntaş 2006), annotation propagation (Buneman 2002), social repository sharing (myExperiment, Goble 2010), and FAIR provenance models (Goble 2019).

How PapersFlow Helps You Research Data Provenance in Scientific Computing

Discover & Search

Research Agent uses citationGraph on 'Provenance Collection Support in the Kepler Scientific Workflow System' (Altıntaş et al., 2006) to map connections to myExperiment (Goble et al., 2010) and FAIR Workflows (Goble et al., 2019), then exaSearch for 'provenance query languages in Kepler'. findSimilarPapers expands to annotation systems like Bhagwat et al. (2005).

Analyze & Verify

Analysis Agent runs readPaperContent on Altıntaş et al. (2006) to extract Kepler provenance schemas, then verifyResponse with CoVe against Buneman et al. (2006) for database consistency. runPythonAnalysis builds provenance graph visualizations using NetworkX on extracted data. GRADE grading scores evidence strength for workflow reproducibility claims.

Synthesize & Write

Synthesis Agent detects gaps in view propagation coverage between Buneman et al. (2002) and Green et al. (2007), flags contradictions in annotation models. Writing Agent applies latexEditText to generate provenance diagrams, latexSyncCitations for 10-paper bibliography, and latexCompile for workflow reports. exportMermaid creates interactive lineage graphs.

Use Cases

"Extract provenance code examples from Kepler papers and test reproducibility"

Research Agent → searchPapers 'Kepler provenance' → paperExtractUrls → Code Discovery → paperFindGithubRepo → githubRepoInspect → runPythonAnalysis (pandas workflow simulation) → reproducible execution logs.

"Write LaTeX paper section on FAIR workflows with provenance citations"

Synthesis Agent → gap detection (Goble 2019 vs Tanhua 2019) → Writing Agent → latexEditText (provenance section) → latexSyncCitations (10 papers) → latexCompile → PDF with embedded mermaid diagrams.

"Find GitHub repos implementing myExperiment-style provenance sharing"

Research Agent → citationGraph 'myExperiment Goble 2010' → Code Discovery → paperFindGithubRepo 'workflow provenance' → githubRepoInspect → exportCsv (repo metrics, stars, languages).

Automated Workflows

Deep Research workflow scans 50+ provenance papers via searchPapers → citationGraph clustering → structured report with GRADE scores on reproducibility claims. DeepScan applies 7-step analysis to Altıntaş et al. (2006): readPaperContent → runPythonAnalysis (schema validation) → CoVe verification chain. Theorizer generates hypotheses on scalable provenance from Buneman papers (2002, 2006).

Try Doxa for Data Provenance in Scientific Computing Research

Frequently Asked Questions

What is data provenance in scientific computing?

Data provenance records the origin, transformations, and lineage of data in computational workflows. It enables validation and reuse (Buneman et al., 2006).

What are key methods for provenance capture?

Kepler workflow system collects provenance natively (Altıntaş et al., 2006). myExperiment shares workflows with embedded lineage (Goble et al., 2010). Annotation systems track changes in relational databases (Bhagwat et al., 2005).

What are the most cited papers?

myExperiment (Goble et al., 2010, 299 citations), Provenance in curated databases (Buneman et al., 2006, 296 citations), Annotation management (Bhagwat et al., 2005, 276 citations).

What open problems exist in provenance?

Scalable querying of large graphs and consistent propagation through views. First Provenance Challenge revealed representation inconsistencies (Moreau et al., 2007). FAIR integration needs better standards (Goble et al., 2019).