Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

About the Speaker

Felix Naumann is a full professor and head of the Information Systems group at Hasso-Plattner Institute in Potsdam, Germany. He is an active member of the research community and frequently works on the organization board of conferences, such as VLDB and SIGMOD. His group focuses, among others, on the fields metadata management, information integration, data quality, data cleaning, and data profiling.

About the Talk

Data profiling comprises a broad range of methods to efficiently extract various metadata from a given dataset, including data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and various other data dependencies. The talk highlights the key insights behind recent state of the art methods and presents various use cases in the areas of data cleaning and data integration: violations of dependencies point to errors in the data; key discovery identifies the core entities of a data source; inclusion dependencies are candidates to join up multiple sources; and in general, data profiling results can be used to organize data lakes.

Data Profiling

Summary written by Fabian Mahling and Jan Schneider

Data profiling refers to the activity of creating small but informative summaries of a database. ~ Ted Johnson, Encyclopedia of Database Systems

These data profiling activities can be classified into different kinds of tasks - single column, multiple columns, and dependencies. Single columns and multiple columns contain computationally rather trivial tasks like determining cardinalities, value distributions, correlations, or clusters & outliers. Dependencies, such as Unique Column Combinations (UCCs), Functional Dependencies (FDs), and Inclusion Dependencies (INDs) can be very costly to compute.

Use Cases

Data Profiling can be used in a variety of use cases. In query optimization, cardinalities, histograms, and counts play a crucial role in cost estimation and form a statistical basis on which a concrete query plan can be decided upon. In data cleansing patterns and rules found via data profiling can be used to identify violations. Dependencies that are found across multiple datasets can help in data integration processes. In database reverse engineering, Data Profiling is used to determine the underlying data model by helping to understand the relations between the entities. Data Profiling also serves as a prerequisite for data preparation.

Problems

Commercial and research tools are available for many data profiling tasks. Yet most are complex to configure or do not offer a convincing way to view and interpret the results. For example, IBM's Information Analyzer tool requires the user to input potential foreign candidates, which are then analyzed via an SQL query. Thus, there is no completely automatic discovery and the actual checking of the candidate is not very scalable, as a whole query is run even if there are violations found early on. Therefore, even if these tools support the required tasks, running them and interpreting their results can be very tedious.

In Data Profiling one does not simply analyze single columns, but instead looks at multiple column combinations, e.g. column A depends on columns B, C, and D. To check every key candidate combination of a table with 100 columns, 1.3 nonillion column combinations would have to be checked (a number with 31 digits). This is because the number of column combinations grows exponentially to the number of columns (2^n - 1). Data Profiling tasks can thus, at least theoretically, be computationally highly expensive.

Uniqueness, Keys, and Foreign Keys

Unique Column Combinations (UCCs) are combinations of columns for which every value combination of the specified columns is unique.

AB
1a
1b
2b

While in the given table, neither A nor B are unique themselves, their combination (A, B) is considered an UCC, as no value is present twice. An UCC is considered minimal if there is no unique column subset. Minimal UCCs are often used to determine key candidates (with the addition of containing no null values).

Lattice

The number of candidates grows exponentially to the number of columns (2^n-1). Thus, the task of finding all UCCs is often reduced to finding all minimal UCCs. Given a minimal UCC, we can "reconstruct" all UCCs by simply adding new columns to the minimal UCCs, as that does not affect the Uniqueness. A maximal non-unique combination, on the other hand, is a combination that becomes a UCC once a new column is added. All subsets of a (maximal) non-unique combination are also non-unique.

To efficiently work on or visualize UCC discovery, a lattice can be used. A lattice is a partially ordered set, where each set of elements has a unique supremum (union) and infimum (intersection). Visualization of lattices is done using Hasse diagrams, where each element is a node and its edges go upward, to the smallest larger element(s).

Lattice structure for a five-column table (A,B,C,D,E), with a minimal UCC (A,D) and a maximal non-unique (B,C,D,E).

UCC Discovery Algorithms

UCC discovery algorithms can be divided into three categories - column-based, row-based, and hybrid.
Row-based algorithms look at the rows of the table and derive UCCs by for example finding duplicates. This usually leads to algorithms scaling well with the number of columns but rather badly with the number of rows.
Column-based algorithms traverse the lattice and perform uniqueness checks on the nodes of the lattice. Based on the order of traversal, more or less checks have to be performed, as search space can be pruned when a UCC is encountered.
Hybrid algorithms combine column- and row-based algorithms and ideas. HyUCC for example, switches back and forth between column- and row-based algorithms to discover UCCs. In the ideal case, this leads to good scaling with both columns and rows.

The column-based Apriori algorithm follows a top-down idea, where first, each combination on the first level is checked for Uniqueness, then on the second level, the third, and so on. Once a UCC is encountered, we know, that it is a minimal UCC, and can prune away all the search space "above" it, as we know that they are also unique. This way, we can save a lot of checks, when we encounter a UCC and ultimately save computation time.

A lattice traversal performed by DUCC.

DUCC, an algorithm developed, among others, by Felix Naumann, takes another approach. For most datasets, minimal UCCs are distributed randomly among the lattice. Thus, instead of traversing the lattice bottom-up or top-down, DUCC walks the lattice randomly to find a UCC and then traverses the lattice along the boundary of UCCs and non-UCCs. This saves a lot of checks and the final runtime mainly depends on the size of the solution set itself. Therefore DUCC scales with the result size.

Functional Dependencies

A Functional Dependency (FD) describes a relation between a set of columns A and a column B. We have an FD A → B (spoken "A determines B"), if for every two rows in which the values of A are identical, the values of B are also identical.

A table containing main characters of Game of Thrones with some attributes and derivable Functional Dependencies. However, King Joffrey violates FD #4 for some reason (FD #4 is strictly speaking no FD).

In the image above, we can see some example FDs, where Person → Lineage means, that if we know the person, we know the person's religion. This does not really add any information here, as Person is unique and therefore, by definition, determines every other column.

Use Cases & Discovery

FDs can be used for the normalization of databases. However one should pay attention to whether or not the FDs are reasonable. If they are just coincidental or trivial (see Person → Lineage above), normalizing based on them probably makes little sense. FDs can also be used in data cleansing, query optimization, schema design, and key discovery.

Candidate Set Growth for FDs for different numbers of columns and levels inside the lattice.

The naive discovery of FDs (Brute-Force) is in O( n^2 * 2^m * (m/2)^2 ) with n being the number of rows and m the number of columns. The search space is larger than the one for UCC discovery, as we (on average) can combine the left-hand side of our dependency with half of the remaining columns.

We can again visualize the discovery problem as a lattice. Here, each edge represents an FD candidate, the node below the edge on the left-hand side and the upper node on the right-hand side.

Lattice structure for a five-column table (A,B,C,D,E). Each edge represents an FD candidate.

FD discovery algorithms can again be categorized as row-based, column-based, or hybrid. Felix Naumann and Thorsten Papenbrock have developed HyFD, a hybrid FD discovery algorithm. It again combines both a row-based and a column-based approach, by using the row-based approach for FD candidate discovery and the column-based approach for candidate validation. HyFD switches to the validation, when only a few new candidates can be found per row comparison, and again switches back to the discovery, when only a few new FDs can be validated per validation attempt.
This yields an approach, that is both column- and row-efficient.

Inclusion Dependencies

Inclusion Dependencies (INDs) can be divided into unary and n-ary INDs. Unary INDs describe a dependency A → B in which every value in column A is also included in column B. n-ary INDs describe a dependency between two sets C → D in which every value combination C_1, C_2, ..., C_i is also present in D_1, D_2, ...., D_i. This again leads to an even larger search space compared to the search space of FDs. Thus, the trivial discovery algorithm (Brute-Force) for n-ary INDs is in O(2^n * n!). INDs can be used to detect unknown foreign keys.

Outlook - Other Dependencies and More

In addition to UCCs, FDs, and INDs there exist various other dependencies, like Order Dependencies (OD), Multi-Valued Dependencies (MVD), or Denial Constraints (DC).
Dependencies can also be extended with conditions. For example, a dependency is only supposed to hold on a subset of the data. Partial dependencies describe dependencies that do not hold perfectly for the whole dataset but have a set "fault-tolerance" until which the dependency is allowed.

Metanome is a tool developed at the chair, which allows users to develop, run, and evaluate data profiling algorithms. It already comes with 37 data profiling algorithms, to compare against and is written in Java.

Summary

Felix Naumann introduces the field of Data Profiling and his work in the area. In detail, he explains different dependencies, such as Unique Column Combinations (UCCs) and Functional Dependencies (FDs) and how the theoretically very complex task of discovering those dependencies can be solved through "smart scaling", e.g. smarter algorithms. DUCC and HyFD are presented, two hybrid algorithms developed at the chair of Naumann for the discovery of UCCs and FDs. As an outlook, he also shortly presents other dependencies and Metanome, a Data Profiling tool developed at his chair.

Sources

All images are taken from Felix Naumann's presentation Data Profiling - a Primer (WiSe 23/24).