Subtopic Deep Dive

Automatic Wrapper Generation
Research Guide

What is Automatic Wrapper Generation?

Automatic Wrapper Generation induces wrappers using machine learning and pattern-based methods to extract structured data from semi-structured web pages.

This subtopic covers template detection, attribute labeling, and wrapper adaptability to site changes. Key works include Wang and Lochovsky (2003) on data extraction and label assignment (335 citations) and Embley et al. (1999) on record-boundary discovery (234 citations). Over 10 influential papers from 1998-2011 address these techniques.

15
Curated Papers
3
Key Challenges

Why It Matters

Robust wrappers enable extraction of product data from e-commerce sites, news from portals, and records from databases, powering knowledge bases like those built by the U.S. Department of Labor (McCallum, 2005). Lerman et al. (2003) show machine learning maintains wrappers against site changes, supporting scalable web data integration. Dalvi et al. (2011) demonstrate unsupervised wrappers for large-scale extraction, impacting search engines and data aggregation services.

Key Research Challenges

Template Detection

Identifying repeated page layouts for record extraction remains difficult in varied HTML structures. Embley et al. (1999) address record-boundary discovery but struggle with noisy web documents. Florescu et al. (1998) survey early database techniques highlighting inconsistent templates across sites.

Attribute Labeling

Assigning semantic labels to extracted fields requires understanding context without user input. Wang and Lochovsky (2003) propose methods for web databases but face ambiguity in attribute roles. Muslea et al. (2006) use active learning with multiple views to reduce labeling needs.

Wrapper Maintenance

Wrappers break when sites change layouts, demanding automatic adaptation. Lerman et al. (2003) apply machine learning for maintenance but note scalability issues. Dalvi et al. (2011) introduce noise-tolerant induction for unsupervised updates.

Essential Papers

1.

Database techniques for the World-Wide Web

Daniela Florescu, Alon Y. Levy, Alberto O. Mendelzon · 1998 · ACM SIGMOD Record · 560 citations

article Free Access Share on Database techniques for the World-Wide Web: a survey Authors: Daniela Florescu Inria Roquencourt Inria RoquencourtView Profile , Alon Levy Univ. of Washington Univ. of ...

2.

Data extraction and label assignment for web databases

Jiying Wang, Frederick H. Lochovsky · 2003 · 335 citations

Many tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is t...

3.

Learning to Match the Schemas of Data Sources: A Multistrategy Approach

AnHai Doan, Pedro Domingos, Alon Halevy · 2003 · Machine Learning · 242 citations

4.

Active Learning with Multiple Views

Ion Muslea, Steven Minton, Craig A. Knoblock · 2006 · Journal of Artificial Intelligence Research · 237 citations

Active learners alleviate the burden of labeling large amounts of data by detecting and asking the user to label only the most informative examples in the domain. We focus here on active learning f...

5.

Record-boundary discovery in Web documents

David W. Embley, Yi Jiang, Y.-K. Ng · 1999 · ACM SIGMOD Record · 234 citations

Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By “record” we mean a group of information relevant to some en...

6.

Information Extraction

Andrew McCallum · 2005 · Queue · 202 citations

In 2001 the U.S. Department of Labor was tasked with building a Web site that would help people find continuing education opportunities at community colleges, universities, and organizations across...

7.

Learning to remove Internet advertisements

Nicholas Kushmerick · 1999 · 192 citations

Article Free Access Share on Learning to remove Internet advertisements Author: Nicholas Kushmerick Univ. College Dublin, Dublin, Ireland Univ. College Dublin, Dublin, IrelandView Profile Authors I...

Reading Guide

Foundational Papers

Start with Florescu et al. (1998) for WWW database survey, then Wang and Lochovsky (2003) for extraction basics, and Embley et al. (1999) for record boundaries to build core understanding.

Recent Advances

Study Dalvi et al. (2011) for unsupervised large-scale wrappers and Lerman et al. (2003) for maintenance to see adaptability advances.

Core Methods

Core techniques: pattern matching (Embley et al., 1999), active multi-view learning (Muslea et al., 2006), ML-based maintenance (Lerman et al., 2003), noise-tolerant induction (Dalvi et al., 2011).

How PapersFlow Helps You Research Automatic Wrapper Generation

Discover & Search

Research Agent uses searchPapers and citationGraph to map influences from Florescu et al. (1998, 560 citations) to Dalvi et al. (2011), then findSimilarPapers uncovers related works like Lerman et al. (2003) on maintenance.

Analyze & Verify

Analysis Agent applies readPaperContent on Wang and Lochovsky (2003), verifies claims with CoVe chain-of-verification, and runs PythonAnalysis to statistically compare wrapper accuracy metrics across Embley et al. (1999) and Muslea et al. (2006) using pandas for extraction pattern similarity.

Synthesize & Write

Synthesis Agent detects gaps in adaptability post-Lerman et al. (2003), flags contradictions between supervised and unsupervised methods; Writing Agent uses latexEditText, latexSyncCitations for Embley et al. (1999), and latexCompile to generate reports with exportMermaid diagrams of wrapper induction flows.

Use Cases

"Compare extraction accuracy of record-boundary methods in Embley 1999 vs Dalvi 2011"

Analysis Agent → readPaperContent (both papers) → runPythonAnalysis (pandas to parse F1-scores from tables) → GRADE grading outputs statistical verification table.

"Draft a survey section on wrapper maintenance techniques"

Synthesis Agent → gap detection (post-2003 papers) → Writing Agent → latexEditText (add Lerman 2003 content) → latexSyncCitations → latexCompile (produces PDF with sections on ML adaptation).

"Find code for unsupervised wrapper induction"

Research Agent → searchPapers (Dalvi 2011) → Code Discovery workflow (paperExtractUrls → paperFindGithubRepo → githubRepoInspect) → outputs runnable Python scripts for noisy training data wrappers.

Automated Workflows

Deep Research workflow scans 50+ papers from citationGraph of Florescu et al. (1998), producing structured reports on evolution from supervised to unsupervised wrappers. DeepScan applies 7-step analysis with CoVe checkpoints to verify claims in Lerman et al. (2003) maintenance methods. Theorizer generates hypotheses on combining active learning (Muslea et al., 2006) with noise-tolerant induction (Dalvi et al., 2011).

Frequently Asked Questions

What is Automatic Wrapper Generation?

It induces wrappers via ML and patterns to extract structured data from semi-structured web pages, covering template detection and labeling.

What are key methods?

Methods include record-boundary discovery (Embley et al., 1999), active learning (Muslea et al., 2006), and noise-tolerant induction (Dalvi et al., 2011).

What are foundational papers?

Florescu et al. (1998, 560 citations) surveys database techniques; Wang and Lochovsky (2003, 335 citations) handles extraction and labeling.

What are open problems?

Challenges persist in wrapper maintenance against site changes (Lerman et al., 2003) and scaling unsupervised methods to diverse templates.

Research Web Data Mining and Analysis with AI

PapersFlow provides specialized AI tools for Computer Science researchers. Here are the most relevant for this topic:

See how researchers in Computer Science & AI use PapersFlow

Field-specific workflows, example queries, and use cases.

Computer Science & AI Guide

Start Researching Automatic Wrapper Generation with AI

Search 474M+ papers, run AI-powered literature reviews, and write with integrated citations — all in one workspace.

See how PapersFlow works for Computer Science researchers