Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Project Background

The wealth of freely available, structured information on the Web is constantly growing. This is especially true for public data from and about governments and administrations. Data‐providing projects, such as DBPedia and Freebase from the linked open data community, as well as structured data from domain‐specific sites, such as senate.gov, USASpending.gov, or epp.eurostat.ec.europa.eu, make it possible to integrate data from multiple sources and thus create new data sets with added value. The recent appointment of Tim Berners‐Lee to lead a review on how the UK government can open up access to official information reinforces this trend. However, the integration of such data sources is far from trivial: Apart from technical difficulties of accessing the data, structural and semantic differences in the data must be overcome. In particular, the various data sets must be standardized, transformed to a common structure, cleaned and finally consolidated into a single, consistent and complete data set.

 

Cooperation Partner

Midas is a joint project between IBM’s Almaden Research Lab and IBM’s Silicon Valley Lab. The project’s goal is to provide an end‐to‐end framework for the process of integrating heterogeneous Web data into a common, clean and consistent data set. Individual components extract data, scrub it in a sourcespecific manner, identify common entities across multiple sources, transform data to a common structure and finally fuse possibly conflicting data into a value‐added, rich and clean data set. Underlying technologies of this Java‐based tool include the relational and json data models, database operations based on SQL and Jaql, text extraction rules, similarity‐based duplicate detection, mapping‐based data transformation, and data fusion.

Project Description

The goal of the bachelor project is to establish the government domain for Midas. This goal includes the discovery of relevant data sources both from the US and the EU, to explore and extract data from those sources, to develop methods for scrubbing such data, develop techniques to discover duplicate entries and links among data, and finally to fuse duplicate data entries. Thus, the project team will add a new set of sources to Midas, it will adapt existing techniques for the individual steps, and it might develop new techniques that are more suitable for the domain and use case.

 

Midas-Team (left to right): Tobias Schmidt, Norman Höfler, Andrina Mascher, Stefan George, Martin Köppelmann, Claudia Lehmann, Markus Freitag