Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

About the Speaker

Prof. Torsten Grust is a Professor and the head of the Data Research Group at the University of Tübingen since 2008. Before this, Prof. Grust worked as a professor of Database Systems at the TU Muenchen and TU Clausthal. In 1999, he received his Ph.D. in Computer Science from the University of Konstanz, Germany.

He and his team's research is centered around databases and is mostly concerned about query languages, examples for this are query optimization, rewriting, transformation and compilation. While doing their research they often take concepts from procedural and functional languages and bring them closer to the relational database systems.

About the Talk

We derive new iterative CTE variants from the simple loop-based, operational semantics of SQL:1999’s WITH RECURSIVE. In the absence of fixpoint-based semantics and monotonicity restrictions, these CTE variants (1) can hold onto as well as forget the results of earlier loop iterations or (2) maintain iteration results in a keyed table, enabling a SQL authoring style that mimics imperative algorithms. We exercise the new variants using a series of examples to demonstrate that this fresh look at CTEs has a beneficial impact on the readability and performance of iterative SQL queries.

A Fix for the Fixation on Fixpoints

Summary written by Lucas Kerschke and Felix Koch

Background

A Common Table Expression (CTE) is a named temporary result set in a query that can be referenced within other statements. It defines a query that can be referred to later within a larger SQL Statement. One can initiate it using the keyword WITH and then provide a name and optionally specify some column names. This is followed by the keyword AS and then the temporary query:

WITH cte_name (column1, column2, ...) AS ( -- CTE query here SELECT column1, column2, ... FROM some_table WHERE some_condition ) -- Main query that references the CTE SELECT * FROM cte_name;

A recursive CTE is a query, that refers to itself. A recursive CTE is structured like a normal CTE and has the keyword WITH RECURSIVE to start it off. It basically consists of two different parts, an anchor member, that is the initial query which is the starting point for the recursion and a recursive member which refers to the CTE itself. The UNION ALL combines these two parts:

WITH RECURSIVE cte_name (column1, column2, ...) AS ( -- Anchor member (initial query) SELECT column1, column2, ... FROM some_table WHERE condition UNION ALL -- Recursive member (refers to the CTE itself) SELECT column1, column2, ... FROM cte_name WHERE condition ) -- Final query that references the CTE SELECT * FROM cte_name;

Problems

While the WITH RECURSIVE keyword is a part of SQL since 1999, it is still considered an obscure and difficult to understand feature by many.

Even some of the most experienced SQL programmers struggle with it. As shown in the lecture, the programmer named ‘snth’ tried to implement a calculation for the fibonacci series with SQLs WITH RECURSIVE and was puzzled by it’s behavior.

A big part of the difficulty comes from the requirement that the recursive member of the CTE (in the following called q) has to be monotonic. This excludes many potentially useful operations like aggregations, groupings and more.

If a query developer requires these operations, they use workarounds. Which aren’t always straightforward and often lead to unreadable syntax or unwanted behavior.

Additionaly, the semi-naive evaluation of the query means that the recursive query only ever gets those rows as input, which the previous evaluation returned. While this can be useful for some algorithms, if a query does need to inspect the history of previous results, there is no built in way to do that.

If those results are needed, developers try to build in their own version of history management, for example by adding additional columns to store previous results. This not only makes crafting such queries very difficult, it also leads to inefficiencies.

Solution

Grust proposed a revolutionary solution: the WITH ITERATIVE keyword.

Per default WITH ITERATIVE does not create history of all results anymore. This makes it clear that really only the previous result is available for evaluation. Now, instead of returning a table of all intermediate results, only the last non-zero result is returned. This means, that much less memory is used during the recursion. Additionally, new keywords are specified to allow q to access and maintain some sort of history.

WITH ITERATIVE tab(c1,...cn) AS ( initial_query() UNION ALL recursive_query(tab) ) SELECT * FROM tab;

Goal

The aim is to use the simple and loop-based behavior everybody knows from CTEs and leverage the already existing CTE infrastructur, but also lift some restrictions on q, mainly the monotonicity.

Main ideas, methodology, approach

Variant 1: WITH ITERATIVE … KEY

The first proposed variant involves using WITH ITERATIVE in conjunction with an updatable, keyed dictionary. This approach allows the query author to control the distribution of keys within the dictionary. When a new key is assigned for rows in each iteration (for example, based on the cumulative iteration count), it requires a significant amount of space as the dictionary effectively operates in append-only mode. However, if keys from previous iterations are reused, the upsert behavior comes into play, potentially keeping the dictionary compact. This overwriting process bears similarity to imperative programming languages, where the current value of a variable can also be updated through overwriting. In each iteration, q can read the entire dictionary, ensuring continued access to the data.

WITH ITERATIVE tab(c1,...cn) KEY(k1,...,kn) AS ( initial_query() UNION ALL recursive_query(tab, RECURRING(tab)) ) SELECT * FROM tab;

This approach was then used by Professor Grust to showcase a basic Graph-based algorithm. He wanted to map each node in a graph to a component. In a component one node can always reach another node via the edges, if two nodes are not in the same component, they can also not reach each other. This algorithm was then described and the runtime and space consumption of the WITH ITERATIVE approach was compared to the vanilla WITH RECURSIVE function. The runtime of the proposed approach is always one to two magnitues lower than the vanilla one. The same goes for the memory consumption, where the variant with the keyed dictionary always consums less memory.

Variant 2: WITH ITERATIVE … TTL

The second proposed variant is the usage of WITH ITERATIVE in combination with a time to live (TTL) value.

Now the DBMS collects two history-tables. One like it does with WITH RECURSIVE, where it appends every result and which is not accessible by q. And a second one, which is accessible inside the query by using RECURRING(<table_name>). After every iteration the ttl value of all rows in there is aged (decremented) and those which reach zero are removed from that table.

WITH ITERATIVE tab(ttl, c1, ...cn) TTL(ttl) AS ( initial_query() UNION ALL recursive_query(tab, RECURRING(tab)) ) SELECT * from tab;

This allows a query developer to make of some previous results in a controlled manner, as they do not have to manage this history inside custom columns of the result. Additionally, because they can control the value of ttl they have some control over the required memory.

Professor Grust then presented an algorithm that parses a string, with the condition that the grammar implemented by the parser must be in Chomsky normal form (CNF). The TTL algorithm demonstrated remarkable efficiency, being 2 to 3 times faster than naive expiry while simultaneously expiring a significantly larger number of rows in the process.

An straightforward example for a use for TTL would be computing the Fibonacci series:

WITH ITERATIVE fib(ttl, n , F) TTL(ttl) AS ( (SELECT 1 AS ttl, 0 AS n, 0 AS F UNION ALL SELECT 2 AS ttl, 1 AS n, 1 AS F) UNION ALL SELECT 2 AS ttl, 1 + MAX(n) AS n, SUM(F) FROM RECURRING(fib) AS F ) TABLE fib LIMIT 10;

Here the anchor member (also called q1) creates the two base cases for Fibonacci as two rows: (1, 0, 0), (2, 1, 1). Then q accesses all currently “alive” rows by selecting from RECURRING(fib) and calculates the next row for the sequence.

By setting ttl to 2 for each new row, this CTE ensures that each row is only available to the next two calls of q.

Summary

In summary, Torsten Grust’s lecture proposed a groundbreaking solution to challenges associated with recursive queries in SQL, introducing the WITH ITERATIVE keyword. This innovation addresses the complexity and limitations of WITH RECURSIVE, providing a more intuitive and efficient approach.

The lecture presented two variants of the solution, emphasizing reduced memory consumption and improved runtime. Demonstrations of graph-based and string-parsing algorithms showcased the remarkable advantages of WITH ITERATIVE over traditional methods. Overall, the lecture marks a significant advancement in the field of database query processing, offering a more accessible and powerful tool for developers grappling with recursive queries.