Information Integration from the Web
The Web provides access to large data sources which are not explicitly
organized as databases. Instead, the information is presented as
semistructured data. In contrast to integrating classical
distributed databases, integration of such data raises several new
problems such as schema discovery, wrapping and reorganizing the data
sources and coping which changes in autonomous sources.
The project started in 1997 at Freiburg
University:
The FLORID
system has been used for extraction and integration of semistructured
data from the Web.
In 1997-1999, the FLORID system has been extended with Web access
capabilities (Versions 2.x). A methodology for wrapping and
integrating HTML pages by mapping the information into an integrated
F-Logic data model representing both the structure of the data sources
and containing an application/level model of the information has been
developed. HTML pages are wrapped using generic rules for the usual
structuring means (i.e., lists, tables, comma-lists, emphasized
keywords). The
MONDIAL
case study documents the practicability of the approach.
In 2000, FLORID has been extended to FloXML with special functionality for
handling XML data (XML/DTD parsing, metadata provided by DTDs and
XMLSchema, XML export functionality).
Documents (Florid, 1997-2000):
-
Underlying Considerations on Semistructured Data:
Managing Semistructured Data with FLORID:
A Deductive Object-Oriented Perspective,
B. Ludäscher, R. Himmeröder, G. Lausen, W. May, and
C. Schlepphorst.
Information Systems,
23 (8), Special Issue on Semistructured Data
, pp. 589-612, 1998.
-
Architecture:
An Integrated Architecture for Exploring, Wrapping,
Mediating and Restructuring Information from the Web, Wolfgang
May, Australian Database Conference (ADC2000)
Jan. 31 - Feb. 3, 2000, Canberra, Australia. IEEE CS Press.
-
The Underlying Web Model:
Modeling and Querying Structure and Contents of the Web, Wolfgang
May, International Workshop on Internet Data Management (IDM'99) at
DEXA'99 Workshop, Sept. 2, 1999, Firenze, Italy. IEEE Comp. Soc.
-
Wrapping:
A Unified Framework for Wrapping, Mediating and Restructuring
Information from the Web, Wolfgang May, Rainer Himmeröder,
Georg Lausen, Bertram Ludäscher. International Workshop on
International Workshop on the World-Wide Web and Conceptual Modeling
(WWWCM'99), Nov. 15 - 18, 1999, Paris, France. Springer LNCS 1727,
pp. 307-320.
-
A Long Version of the Report:
Information
Extraction from the Web,
Wolfgang May, Georg Lausen; Technical Report 136, Institut
für Informatik, Universität Freiburg, 2000.
- Slides:
Information Extraction from the Web with FLORID, Wolfgang May,
Slides of the talk given at TU Vienna, November 5, 1999.
- Case-Study: The
MONDIAL
case study documents the practicability of the approach.
-
A Retrospective Report and Conclusions:
A
Uniform Framework for Integration of Information from the Web,
Wolfgang May and Georg Lausen;
Information Systems, 29(1), pp. 59-91, 2004.
Documents (FloXML, 2000):
The Experiences with F-Logic and FLORID have been incorporated into
the LoPiX
(Logic Programming in XML) project
(2000 - 2003), dealing with integration of XML data.
|