Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

About the Speaker

Ziawasch Abedjan is Professor for “Databases and Information Systems” at Leibniz Universität Hannover. He is Junior Fellow of the German Computer Science Society, Fellow of the Berlin institute on Foundation of Learning and Data and member of the L3S Research Center. Ziawasch Abedjan received his PhD at the Hasso-Plattner-Institute in Potsdam and received the best dissertation award of the University of Potsdam in 2014. After his PhD, he was a postdoctoral associate at MIT and later Junior Professor at the TU Berlin. Further, he was Senior Researcher at the German Center for Artificial Intelligence (DFKI) and Visiting Academic at Amazon. Ziawasch has published more than 70 peer-reviewed papers in the area of data integration and data analytics and is recipient of the SIGMOD 2019 most reproducible paper award, SIGMOD 2015 best demonstration award, and the CIKM 2014 best student paper award. His research is funded by the German Research Foundation (DFG) and the German Ministry of Research and Education (BMBF).

About the Talk

Enabling automated data integration and cleaning has been a fundamental research goal for several decades because the requirements heavily depend on the application scenario. Recent learning-based techniques rely on heavy parameter tuning by experts or the provision of large amounts of labeled data, impeding their deployment in ad-hoc integration workflows. 

In my talk, I discuss how we overcome the aforementioned problems by building systems that follow example-based and declarative paradigms.

Data Cleaning

Summary written by Nadin Abdelhamid and Christian Falkenberg

Motivation for Data Cleaning

In organizations, Data Quality directly influences decision-making quality, impacting analytics accuracy and revenue. Similarly, in AI, poor Data Quality leads to inaccurate inferences, affecting decision-making adversely. Enhancing Data Quality is crucial, yet Data Cleaning is a tedious task hard to automate. With 80% of a data scientist's work dedicated to data preparation, including 60% for Data Cleaning and organizing, research into automating these processes is essential.

What is meant by Data Quality?

At its core, we're discussing errors, referred to as Data Errors, that degrade the quality of a dataset. These errors cover various types, including:
Integrity constraint violations, such as duplicate ID assignments. Representation mistakes, like incorrect ordering of first and last names. Column shifts, contradictions, missing, or even hidden missing values, which are values intentionally not disclosed by the user. Typos or even semantic errors.

Traditional Data Cleaning Approaches

There are four main categories of Traditional Data Cleaning approaches: Rule-Based Techniques, Pattern-Based Techniques, Statistical Techniques, and Masterdata/Knowledge base Approaches.
Pattern-Based Techniques are primarily used to ensure uniformity and fix syntactical representations such as dates, categories, and missing values.
Statistical Approaches can identify outliers, including infrequent string occurrences and numerical anomalies.
Master-Data/Crowdsourcing leverages high-quality data from knowledge bases or crowdsourced judgments to compare values.
Rule-Based Techniques employ rules, integrity constraints, and dependencies to ensure consistency in data. One Method involves parsing rules to create conflict graphs and intelligently clean most conflicting data until consistency is achieved. However, certain errors may lack defined consistency rules, making them uncleanable.

Where are we and what needs to be done?

Professor Abedjan and his colleagues identified two main findings during their literature review on Detecting Data Errors. Firstly, most research papers introduce their errors into clean data to test their methods, rather than using naturally occurring errors. Secondly, studies often compare tools within the same category.
They aimed to conduct a new case study to assess how error detection techniques perform in real-world scenarios and how combining these techniques affects their performance. The focus was on determining the most reliable approaches, rather than simply the best ones.
To achieve this, they utilized five real-world datasets to evaluate eight error detection methods for their effectiveness both individually and in combination. Metrics such as Precision, Recall, and F-Measure were utilized. Different approaches to combining tools were considered, such as whether all tools must agree on an error or only a subset. Depending on this, Precision or Recall may be preferred.
Some of the conclusions drawn include the absence of a single dominant tool, though effective ones exist. However, performance varies notably, and even combining all methods does not ensure consistent results. They discovered that making enhancements to individual methods leads to only marginal improvements, emphasizing the greater importance of prioritizing the combination of techniques.

In real-world scenarios, we prefer all-in-one tools. However, even with AI tools that require training and parameter tuning, users often need to define constraints they may not know or wish to create. Also, labeling 5% of a large dataset for training is unrealistic, especially when dealing with sensitive organizational data that can't be outsourced.

Example-based Cleaning

Example-based Cleaning aims to meet these requirements by allowing the user to specify what needs to be cleaned without dictating how it should be done. It is an approach that aims to autonomously detects and repairs data errors without requiring user-defined constraints, instead relying on minimal user labeling. But solutions for these problems are needed:
How to represent instances? Labeled typos may not cover all types, and the same value can be wrong or right depending on the context.
How to select instances? The dataset has a dirty-clean imbalance, making instance selection for labeling challenging. Also, all error types should ideally be labeled at least once.
How to model a classification task? Multiclass or generative? How to handle out-of-dataset corrections? Limiting solutions to existing values in the clean dataset may be problematic.

Professor Abedjan and his colleagues addressed these challenges by developing two tools for the Example-based Cleaning approach: Raha, dedicated to error detection, and Baran, responsible for managing the correction process.

RAHA

Raha is an Example-Based configuration-free system for Error Detection. It takes a Dirty Dataset and a few labeled Tuples as input and marks error data. Raha utilizes all categories of error detection tools in an ensemble. This raises the question again of how strategies should be combined:
Unsupervised strategies like Top k have a precision/recall tradeoff and exhibit poor performance. Semi-supervised for the best combination would be possible, but it requires training data for a classification task. In any case, individual tools and prediction models must always be tuned, which is not desired. Therefore, we have two goals:
We want to automatically generate algorithm configurations.
And we want to limit the number of labels - i.e., if an error type exists, it receives a label, not a set number.

Raha's Solution: Because we do not want to tune each tool/algorithm individually, Raha generates many configurations for each tool. Each configuration of an algorithm becomes its own error detection strategy. Each strategy marks data cells as error data or clean via a binary value. Raha then generates a feature vector for each data cell based on the respective outputs of the strategy. Using the feature vectors, clusters can be formed based on similarity. The cluster assumption is that data points in a cluster likely belong to the same label class. Once a data point is labeled in a cluster, label propagation can be applied to all data points in that cluster.

In this Raha example, a feature vector (each row in the table) was formed for each data cell (Berlin, Paris,...) from the generated error detection strategies (s1,s2,s3). The feature vectors of data cells "Bärlin" and "Pariss" are similar. Based on the cluster assumption, we assume that they belong to the same label type since similar values and similar errors fall into the same space. In this instance, as "Bärlin" is marked as dirty, this label can be extended to "Pariss" since it belongs to the same cluster. Smart instance sampling for labeling can now be conducted from the clusters to ensure that each suspected error type is atleast labeled once.

This method consistently surpasses state-of-the-art standalone error detection tools using as few as 20 labeled tuples, while making it a configuration-free error detection system

BARAN

The Correction Engine Baran utilizes the Raha Output - the Dirty Table with marked Data errors, and a Few labeled Tuples to output transformation rules to correct the Data Errors. Challenges include:
High precision requirement despite an infinite search space, as every string could be a correction.
Closed World Assumption: High recall requirement, but the correct correction could be outside the dataset (vs the closed world assumption, where every answer is inside the dataset).
Despite still being supervised, the goal is to use very few labels.

The solution involves two steps:
Generating a bunch of correction strategies.
Learning which candidates are correct among them.

To generate correctors and therefore candidates for error correction, the data error context is utilized (vicinity, value-based, domain). Baran using the user labels, also measures the fit for each data error and correction candidate, allowing a binary classification model that identifies the most accurate correction among all proposed correction candidates for each data error. This correction model can be improved with user labels. The corrector is also pre-trainable.

With this Method, Baran outperforms all state-of-the-art baselines with few labels, again with less than 20, although the runtime is quite high.

How important is Data Cleaning?

In optimized pipelines, cleaning does not significantly affect accuracy because outliers are ignored. However, in the long run, noise and spurious correlations can be harmful, and explainability is important.

Further research

Further research delves into cleaning multiple datasets simultaneously, investigating Task-driven data cleaning, and exploring the use of new technologies like LLMs.

Summary

Professor Ziawasch Abedjan provided a comprehensive overview of Data Cleaning, highlighting its challenges, and introduced Baran and Raha as advanced Example-Based configuration-free Error Detection and Correction Systems.