Information Integration from the Web

The Web provides access to large data sources which are not explicitly organized as databases. Instead, the information is presented as semistructured data. In contrast to integrating classical distributed databases, integration of such data raises several new problems such as schema discovery, wrapping and reorganizing the data sources and coping which changes in autonomous sources.

The project started in 1997 at Freiburg University: The FLORID system has been used for extraction and integration of semistructured data from the Web.

In 1997-1999, the FLORID system has been extended with Web access capabilities (Versions 2.x). A methodology for wrapping and integrating HTML pages by mapping the information into an integrated F-Logic data model representing both the structure of the data sources and containing an application/level model of the information has been developed. HTML pages are wrapped using generic rules for the usual structuring means (i.e., lists, tables, comma-lists, emphasized keywords). The MONDIAL case study documents the practicability of the approach.

In 2000, FLORID has been extended to FloXML with special functionality for handling XML data (XML/DTD parsing, metadata provided by DTDs and XMLSchema, XML export functionality).

The Experiences with F-Logic and FLORID have been incorporated into the LoPiX (Logic Programming in XML) project (2000 - 2003), dealing with integration of XML data.