Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

12.02.2021

Two Papers accepted at ICDE 2021

We are happy to announce that two of our research papers have been accepted for presentation at the 37th IEEE International Conference on Data Engineering (ICDE 2021).  The conference will take place virtually between 19-22 April, 2021.

Relational Header Discovery using Similarity Search in a Table Corpus

Hazar Harmouch, Thorsten Papenbrock, Felix Naumann

Abstract

Column headers are among the most relevant types of meta-data for relational tables, because they provide meaning and context in which the data is to be interpreted. Headers play an important role in many data integration, exploration, and cleaning scenarios, such as schema matching, knowledge base augmentation, and similarity search. Unfortunately, in many cases column headers are missing, because they were never defined properly, are meaningless, or have been lost during data extraction, transmission, or storage. For example, around one third of the tables on the Web have missing headers.

Missing headers leave abundant tabular data shrouded and inaccessible to many data-driven applications. We introduce a fully automated, multi-phase system that discovers table column headers for cases where headers are missing, meaningless, or unrepresentative for the column values. It leverages existing table headers from web tables to suggest human-understandable, representative, and consistent headers for any target table. We evaluate our system on tables extracted from Wikipedia. Overall, 60% of the automatically discovered table headers are exact and complete. Considering more header candidates, top-5 for example, increases this percentage to 72%.

Structured Object Matching across Web Page Revisions

Tobias Bleifuß, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, Divesh Srivastava

Abstract

A considerable amount of useful information on the web is (semi-)structured, such as tables and lists. An extensive corpus of prior work addresses the problem of making these human-readable representations interpretable by algorithms. Most of these works focus only on the most recent snapshot of these web objects. However, their evolution over time represents valuable information that has barely been tapped, enabling various applications, including visual change exploration and trust assessment. To realize the full potential of this information, it is critical to match such objects across page revisions.

In this work, we present novel techniques that match tables, infoboxes and lists within a page across page revisions. We are, thus, able to extract the evolution of structured information in various forms from a long series of web page revisions. We evaluate our approach on a representative sample of pages and measure the number of correct matches. Our approach achieves a significant improvement in object matching over baselines and over related work.