



Lecture Series on

### Database Research

Winter Semester 2023 / 2024 Introduction & Hardware Efficient Stream Processing

Prof. Dr. Tilmann Rabl, Prof. Dr. Felix Naumann
Data Engineering Systems
Hasso-Plattner-Institut

#### Attention!



This lecture is recorded and will be available in Tele-Task.

We do not record the videos in online sessions, feel free to enable video sharing.

You are not allowed to record the lecture or take screenshots or pictures.



### This Lecture



- Motivation
  - DB Research
  - This lecture
  - Speakers
- 2. Course Organization
  - Exercises
  - Grading
  - Registration
- 3. Stream Processing
  - Basics & Code Generation
  - Portable SIMD Code







- Tilmann Rabl
- Chair for Data Engineering Systems

- At Hasso Plattner Institute
- Department of U Potsdam
- In the vicinity of Berlin







## Why Database Research - my personal view

#### Why I chose to do DBS research:

- Database systems are examples of large complex software systems
  - Like OS, compilers, computer games
  - Building a DBS touches all topics of computer science
- Database systems are at the core of all major businesses
  - Safe and efficient processing is relevant everywhere
  - Significance of data processing is only increasing

#### Why I got stuck in DBS research

- Fun research and ongoing challenges
- Great community





## Why this course

- Germany has an active and productive DB research community
- Best DB conferences always with good German participation
  - E.g., VLDB, SIGMOD
- However, there is little awareness and exchange on student level
- We want to make you aware of excellent German database research
  - Foster exchange
  - Show opportunities

#### Contents



Presentations by *distinguished researchers* from the *German database research* community (and

friends and family).

Some universities also offer the course locally









**Pinar Tözün** ITU Copenhagen

Hardware Parallelism & Transaction Processing Systems



Matthias Böhm TU Berlin

System Infrastructure for Data-centric ML Pipelines







14.11.

## Torsten Grust Uni Tübingen

A Fix for the Fixation on Fixpoints (Rethinking Iteration in SQL)



21.11.

Hannes Mühleisen CWI & Uni Nijmegen

In-Process OLAP







28.11.

## **Stefanie Scherzinger**

## Uni Passau

Challenges with JSON Schema Data Modeling



05.12.

Viktor Leis TUM

Commoditizing Data Analytics in the Cloud

#### Presenters IV





12.12.

## **Wolfgang Lehner**

### TU Dresden

Flexible Vector Processing for Data Science Engines



# **Zsolt István**TU Darmstadt

Software-Defined Data Protection: Low Overhead Policy Compliance at the Storage Layer is Within Reach!







09.01.

Matthias Weidlich
HU Berlin

Pushing Computation to the Sources: On Distribution in Complex Event Processing



25.01.

Ziawasch Abedjan Leibniz Uni Hannover

Data Cleaning







Felix Naumann

Data Profiling



30.01.

Carsten Binnig
TU Darmstadt
Towards Learned
Database Systems

### **Presenters VII**





06.02.

Jana Giceva

TUM

**TBD** 



## **Poster Session**

**HPI** 

**Event Space L** 



# Course Logistics at HPI





#### Lecture

- Tuesdays, 17-18:30, L.E-03
- Bachelor & Master, 3 ECTS

Contact (at HPI)

Prof. Dr. Tilmann Rabl, Prof. Dr. Felix Naumann

e-mail: tilmann.rabl@hpi.de, felix.naumann@hpi.de





#### Lecture Summary

- Summarize one presentation
- In group or individually
- Contents
  - Introduction of speaker
  - Engaging overview of talk
    - Goal, problem, solution
    - Background
    - Main ideas, methodology, approach
    - Main results
    - Summary

- Will be added to the course website (should be 10-15 min reading time)
  - Cf. Morning Paper https://blog.acolyer.org/
- We will ask the presenter to review the summary
- Plain HTML formatting
- Deadline 2 weeks after presentation
- 50% of the final grade





- Poster project
  - A1 poster
  - Plenary presentation at Feb 9
- Bachelor Students
  - Based on one of the lectures
  - Related work overview, lecture summary,
  - As detailed as possible
- Master Students
  - Research project proposal
  - Extend the technology or methodology of one more lectures
  - Goal, problem, solution & connection to the lecture



## Registration

- Organization through HPI Moodle
  - https://moodle.hpi.de/course/view.php?id=630
  - Sign up now!
- Groups
  - Will be assigned randomly
  - TBD...





## Grading in a Nutshell

- Lecture Summary
  - Team project
  - 50% of total points
- Poster
  - Individual project
  - 50% of total points
- Graded by the teachers

#### HPI Hasso Plattner Institut

#### Code of Conduct

- Asking questions is greatly encouraged
  - Discuss questions with each other (except exams)
- The limits of collaboration
  - Plagiarism, copying, or other forms of dishonesty will result in failing the course
- Communication
  - Write professional and polite emails
  - Use netiquette in forum, email, chats, etc.
- Generally
  - Treat everyone with respect and consideration -> especially important for online settings
  - This course and the university in general should be safe spaces for everyone



## Stream Processing End-to-end Hardware-conscious Data Processing







Stream Processor

Result Stream



ML System

Stream Processor

#### Challenge

- Potentially unlimited data set
- Many different queries
- Low latency, continuous results

**Typical Systems** 

Queries





#### Stream Processor Research



#### Dataflow

- Operators
  - Records
  - Control events
- State

#### Research topic examples

- Application (e.g., incremental instead of batch)
- Operator-level (e.g., aggregation, join)
- Semantics (e.g., window types, watermarks)
- Execution strategy (e.g., multi-query processing, compilation)
- Hardware optimizations (e.g, PMem, RDMA)





#### not possible!

Four Billion Records per Second! What is Behind Alibaba Double 11 - Flink Stream-Batch Unification Practice during Double 11 for the Very First Time

Alibaba Clouder

December 2, 2020

1.889



scale-out system

16-core instance @ \$1/h

#### ~42k events/s per server

93750 VMs @ \$2.25 million/day

## scale-up system

#### **Grizzly: Efficient Stream Processing Through Adaptive Query Compilation**

Philipp M. Grulich<sup>1</sup> Sebastian Breß<sup>1</sup> Steffen Zeuch<sup>1,2</sup> Jonas Traub<sup>1</sup> Ianis von Bleichert<sup>1</sup> Zongxiong Chen<sup>2</sup> Tilmann Rabl<sup>3</sup> Volker Markl<sup>1,2</sup> <sup>1</sup>Technische Universitat Berlin <sup>2</sup>DFKI GmbH <sup>3</sup>HPI, University of Potsdam

#### ABSTRACT

Stream Processing Engines (SPEs) execute long-running queries on unbounded data streams. They follow an interpretation-based processing model and do not perform runtime optimizations. This limits the utilization of modern hardware and neglects changing data characteristics at runtime.

In this paper, we present Grizzly, a novel adaptive query compilation-based SPE, to enable highly efficient query execution. We extend query compilation and task-based par



Figure 1: Yahoo! Streaming Benchmark (8 Threads).

such as Flink [15] and Storm [66] scale-out executions to

#### > 100 million events/s per server

40 VMs @ \$5760/day 52-core instance @ \$6/h

https://www.alibabacloud.com/blog/four-billion-records-per-second-stream-batch-integrationimplementation-of-alibaba-cloud-realtime-compute-for-apache-flink-during-double-11-526962



## Stream Processing – Iterator Model



- Many virtual method calls :(
- Bad cache locality :(



## Streaming Processing – Query Compilation Model



Zeuch et al. "Analyzing Efficient Stream Processing on Modern Hardware" PVLDB'19 Grulich et al. "Grizzly: Efficient Stream Processing Through Adaptive Query Compilation" SIGMOD'20 Benson et al. "Darwin: Scale-In Stream Processing" CIDR'22



## What code to generate? Let's take a closer look...



Modern computer architecture with PCIe Bus

Memory, PCIe, core and socket connection through internal bus (ring or mesh)



## CPU Die – Intel i7 (from 2011)



Intel Core i7-3960X
The die is 21 by 21 mm and has 2.27 billion transistors





Performance Cores
L1 Instruction Cache 192 KB
L1 Data Cache 128KB
Shared L2 Cache 12 MB

Efficiency Cores
L1 Instruction Cache 128 KB
L1 Data Cache 64KB
Shared L2 Cache 4 MB

System Level Cache
Shared with GPU: 16MB







192KB L1I\$

8-wide decoder

8 instructions per cycle

630 size reorder buffer

7 integer ports (plus ld/st,fp)

8  $\mu$ -ops per cycle





## SIMD Compiler Intrinsics

Lawrence Benson, Richard Ebeling, Tilmann Rabl ADMS 2023



#### SIMD in a Nutshell

Single Instruction Multiple Data (= SIMD)

Most common instruction sets

- x86: SSE, AVX, AVX2, AVX512 (> 6k instructions)
- ARM: Neon (> 4k instructions), SVE
- PowerPC AltiVec, RISC-V V

Arithmetic, Logical, Shuffle, Shift, Load, Store, ...

Focus on x86 and Neon

- x86: 128 512 Bit registers
- Neon: 128 Bit registers





#### SIMD in Databases

#### Used to speed up, e.g.,

- Table scans
- Hash tables
- Sorting

#### SIMD code is ...

- ... hard to develop
- ... hard to test
- ... hard to benchmark

#### Non-x86 CPUs on the rise

How to translate x86 SIMD code?



What do these functions do?

\_mm\_add\_epi32()
\_mm512\_srl\_epi64() vaddq\_s32()

Don't have AVX512?

→ Can't compile

Don't have NEON?

→ Can't compile



## Abstractions on top of Abstractions

Add two 128-bit registers of 4x 32-bit integers



```
Application code

vec C = A + B;

SIMD intrinsics

mm_add_epi32

vaddq_s32

Compiler representation

__attribute__((
    vec operator+(vec, vec);
}

compiler representation

__attribute__((
    vector_size(N)));
```



### Cut the Middle Man: GCC Vector Extensions

Compilers map platform-specific types and operations to canonical internal representation anyway

- Decouples from external specifications
- Allows reusing analysis and optimization steps
- → We could directly target this canonical representation with our code
- Let the compiler do platform-specific instruction selection

<u>GCC-API</u> allows natural code using arithmetic (\*, +, -, /) and comparison (>, >=, ==, <, <=) operators



## GCC Vector Extensions (2)

```
#include <cstdint>
using VecT __attribute__((vector_size(16))) = int32_t;

void add(VecT* X, VecT* Y, VecT* Z, int MAX) {
    for (int i = 0; i < MAX/4; i++) {
        *Z++ = *X++ + *Y++;
    }
}</pre>
```

Produces identical code to platform-specific intrinsics (ARM, x86)

```
#include <immintrin.h>

void add(__m128i* X, __m128i* Y, __m128i* Z, int MAX) {
    for (int i = 0; i < MAX/4; i++) {
        *Z++ = _mm_add_epi32(*X++, *Y++);
    }
}</pre>
```

```
#include <arm_neon.h>

void add(int32x4_t* X, int32x4_t* Y, int32x4_t* Z, int MAX) {
    for (int i = 0; i < MAX/4; i++) {
        *Z++ = vaddq_s32(*X++, *Y++);
    }
}</pre>
```





M1: Apple Macbook Pro 14" M1 2021

x86 Icelake: Intel Xeon Platinum 8352Y

All experiments with:

■ Clang 15 (x86) and trunk Clang ~17 (ARM)

-O3, -march/-mtune=native

Single-threaded

Show relative speedup over scalar version

Also other x86/ARM CPUs  $\rightarrow$  see paper



## Bit-Packed Integer Decompression

Based on VLDB '09 SIMD-Scan paper

Unpack packed 9-Bit integers to 32 bits → shuffling + shifting

Compiler vec >= hand-written SIMD

Neon: vec-128 generates better code than translated x86 intrinsics

Implementation artifact on NEON







Velox: Meta's new unified query engine

Removed xSIMD dependency Use only compiler-intrinsics

End-to-end TPC-H SF1 x86  $\rightarrow$  0.1% diff NEON  $\rightarrow$  0.13% diff

#### Removed:

- 54 platform-specific functions
- Hundreds of lines of SIMD code





## Outlook





Efficiency is needed more than ever

1xChatGPT ~ 1000xGoogle¹

Efficiency does not mean newer bigger faster

- The cloud won't solve this
- There is little economic incentive in being most efficient

Need to make best use of what you have



https://www.tagesschau.de/multimedia/bilder/blickpunkte-8196.html



## Measuring Carbon Footprint



https://climateboard-bptr1.hpi.de

## Thank you for your attention!



- Course Motivation and Contents
- Course Logistics
- Stream Processing
- Next session: Pinar Tözün 24.10.



24.10.

**Pinar Tözün**ITU Copenhagen

Hardware Parallelism & Transaction Processing Systems