The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval.
Hence, the web users or applications need a smart way of extracting data from these web sources. One of the popular approaches is to write wrappers around the sources, either manually or with software assistance, to bring the web data within the reach of more sophisticated query tools and general mediator-based information integration systems.
In this paper, we describe the methodology and the software development of an XML-enabled wrapper construction system - XWRAP for semi-automatic generation of wrapper programs. By XML-enabled we mean that the metadata about information content that are implicit in the original web pages will be extracted and encoded explicitly as XML tags in the wrapped documents. In addition, the query-based content ltering process is performed against the XML documents.
The XWRAP wrapper generation framework has three distinct features. First, it explicitly separates tasks of building wrappers that are specic to a Web source from the tasks that are repetitive for any source, and uses a component library to provide basic building blocks for wrapper programs.
Second, it provides inductive learning algorithms that derive or discover wrapper patterns by reasoning about sample pages or sample specications. Third and most importantly, we introduce and develop a two- phase code generation framework.
The first phase utilizes an interactive interface facility to encode the source-specic metadata knowledge identied by individual wrapper developers as declarative information extraction rules.
The second phase combines the information extraction rules generated at the rst phase with the XWRAP component library to contruct an executable wrapper program for the given web source. The two-phase code generation approach exhibits a number of advantages over existing approaches.
First, it provides a user-friendly interface program to allow users to generate their information extraction rules with a few mouse clicks. Second, it provides a clean separation of the information extraction semantics from the generation of procedural wrapper programs (e.g., Java code).
Such separation allows new extraction rules to be incorporated into a wrapper program incrementally. Third, it facilitates the use of the micro-feedback approach to revisit and tune the wrapper programs at run time. We report the performance of XWRAP and our experiments by demonstrating the benet of building wrappers for a number of Web sources in dierent domains using the XWRAP generation system.
The architecture of XWRAP for data wrapping consists of four components - Syntactical Structure Normalization, Information Extraction, Code Generation, Program Testing and Packaging. Figure 1 illustrates how the wrapper generation process would work in the context of data wrapping scenario.
Syntactical Structure Normalization is the rst component and also called Syntactical Normalizer, which prepares and sets up the environment for information extraction process by performing the following three tasks. First, the syntactical normalizer accepts an URL selected and entered by the XWRAP user, issues an HTTP request to the remote server identied by the given URL, and fetches the corresponding web document (or so called page object). This page object is used as a sample for XWRAP to interact with the user to learn and derive the important information extraction rules. Second, it cleans up bad HTML tags and syntactical erros.
Third, it transforms the retrieved page object into a parse tree or so-called syntactic token tree. Information Extraction is the second component, which is responsible for deriving extraction rules that use declarative specication to describe how to extract information content of interest from its HTML formatting.
XWRAP performs the information extraction task in three steps -
(1) identifying interesting regions in the retrieved document,
(2) identifying the important semantic tokens and their logical paths and node positions in the parse tree, and
(3) identifying the useful hierarchical structures of the retrieved document. Each step results in a set of extraction rules specied in declarative languages.
Code Generation is the third component, which generates the wrapper program code through applying the three sets of information extraction ruls produced in the second step.