Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

About the Speaker

Pınar Tözün is an Associate Professor at IT University of Copenhagen. Before ITU, she was a research staff member at IBM Almaden Research Center. Prior to joining IBM, she received her PhD from EPFL. Her thesis received ACM SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention in 2016. Her research focuses on resource-aware machine learning, performance characterization of data-intensive systems, and scalability and efficiency of data-intensive systems on modern hardware.

Website: www.pinartozun.com

About the Talk

While modern hardware keeps offering increased parallelism and capabilities, harnessing them has been a perpetual challenge for decades for data-intensive systems. In particular, hardware trends oblige software to overcome three major challenges against systems scalability: (1) Taking advantage of the implicit parallelism within a core, (2) Exploiting the abundant explicit parallelism provided by multicores, (3) Achieving predictably efficient execution despite the non-uniform access latencies in multisocket multicores. This lecture will focus on these challenges in the context of transaction processing systems. 

Hardware Parallelism & Transaction Processing

Summary written by Philipp Bielefeld and Ivan Tunov

Today's computers have more and more CPU cores. Coordinating jobs to run in parallel on several cores is therefore more important than ever in order to utilize the full computational power.

Prof. Pinar Tözün – associate professor at the IT University of Copenhagen - gave a lecture about "Hardware Parallelism & Transaction Processing Systems", which examined exactly this problem with focus on online transaction processing. This blog post summarizes the key points based on this lecture.

Before we dive into the topic, it is worth mentioning, that online transaction processing (OLTP) is only a subset of data processing systems, where short-running, atomic (indivisible and irreducible) operations, called transactions, are processed. A typical example for OLTP are bank transactions or booking systems. OLTP systems have usually a lot of transactions of limited types, where each transaction is only accessing a very limited amount of data. Online analytical processing (OLAP), on the other hand, is characterized by long-running requests, each accessing a lot of data. Typical use cases are complex analysis, e.g. forecast planning, data mining or business intelligence applications. Both types of processing have their individual characteristics, therefore only OLTP was considered in the lecture and this blog post.

Background

CPU generations

General purpose CPUs – the hardware we run the transactions on – had drastically evolved in the last decades from single-core CPUs to multicore CPUs and then to multicore multisocket models. As shown in the figure below, single-core CPUs utilize faster & more complex cores over time, which reached its limits around 2005, when multicore CPUs were built to further increase the performance by utilizing more cores with the same speed and complexity.

Types of a hardware parallelism

There are two basic types of hardware parallelism: Implicit and explicit parallelism

Implicit parallelism

The first type of hardware parallelism to be highlighted is the implicit parallelism, also known as vertical parallelism, where a single-core runs multiple commands in parallel, for example by using following techniques:

  • Instructions & data parallelism focuses on executing multiple instructions and manipulating multiple data simultaneously within one command. This type (regime of work) is being provided automatically by hardware.
  • Multithreading is the concurrent execution of multiple threads on different cores. Threads itself are smaller execution sequences, sharing the same resources within a process. Therefore the cache can be better utilized, when switching threads on a single CPU or when several cores share the same caches.

The main idea behind single-core parallelism on CPU level is to keep the processor busy with other work during waiting times, e.g. when fetching data from the hard drive, which can take hundreds of CPU cycles. Below figure illustrates the hardware structure and storage access times.

Explicit parallelism

The second type of parallelism is the explicit or horizontal parallelism, where multiple threads run in parallel on different cores. Historically, improving single core CPUs to run faster reached it limits around 2005, which almost would have been the end of Moore's law. In order to further increase performance, more cores where added. In short: while implicit parallelism is done by the machine for the user and therefore is (almost) a free lunch, explicit parallelism is hard to exploit. The developer has to explicitly define concurrent patterns in his source code. Concurrency requires much more attention from the developer, because its concepts are harder to apply and errors are harder to debug.

Main ideas to increase CPU utilization

To reduce cache misses, several techniques were used in the past:

  • Splitting instruction and data cache: The next instruction can be much better predicted, than the data used in this transaction. On the one hand, because the source code can only branch out on specific positions, usually where a condition is checked. On the other hand, in an OLTP (Online Transaction Processing) database context, the set of often used transaction commands is usually very limited. For example, to mostly SELECT and UPDATE commands.
  • Sharing the instruction cache between cores: The instructions, used for executing transactions, are often needed by several cores. Sharing the cache not only eliminates duplicates in the caches, but also increases the chance of hits due to the bigger visible cache for each core.

While the above approaches, aiming to reduce the number of cache misses, the increasing number of CPU cores in a computer comes with another problem: The throughput of a multi socket multicore CPU can decrease with the number of cores, due to critical sections, which are shared between cores and effectively block the work of other cores. Moreover, if a critical section is accessed by several cores and jumps between them, the impact of switching the context in the register comes with serious costs in a NUMA (non-uniform memory access) architecture.

The synchronization between cores, regarding a critical section, can be categorized in one of three groups:

  • unbounded communication, where the critical section is shared between all cores. The more cores exist, the worse is the performance impact.
  • cooperative communication, where one thread can do the work for other threads as well, e.g. group commits.
  • fixed communication: here a fixed relationship, for example a consumer/producer relationship, is established between threads. Therefore, the data is only relevant for the threads in that relationship.

As shown in the above figure, the context switch can take up to 500 cycles in the worst case. The obvious approach to tackle this problem is to reduce the number of context switches between cores or even better, to completely avoid them, if possible. To reduce the problem of context switches, we can apply following techniques:

  • physiological partitioning (PLP): If each core is assigned to a disjunct index range of the database, critical sections are not shared between cores. In the tests, performed by Prof. Tözün, this improvement eliminated 70% of all shared critical sections. Nevertheless, splitting the index range in this way is not always possible.
  • Execute the same instances of an instruction on the same cores to avoid switching the context and maximize cache hits. Despite the fact, that this technique showed a substantial throughput increase in TPC-B/C/E benchmark tests, it comes with the problem of identifying identical instances of an instruction in order to run them on the same core. For the academic work, this was done by manually assigning the instructions to the cores. In practice, if this can’t be done automatically, it would mean more manual work for software developers.

Main results and summary

The research is almost a decade old, but the problem of maximal utilizing CPU cores is more pressing than ever. The growing number of cores in today's machines makes the distribution problems between cores worse. The work also shows that huge potential for a better utilization: More than 50% of CPU cycles are stalls. In order to make OLTP systems scale with the number of CPU cores, it is not enough to build lock-free running applications, nor is it possible to draw conclusions from a high throughput on a limited number of cores to a high throughput on a system with more cores. Instead, unbounded communication should be avoided whenever possible, to avoid jumps of critical sections between cores, which leads to substantial performance losses. As an alternative, the communication between cores can often be built in a fixed or cooperative manner, where a fixed producer/consumer relationship between cores is established or a single core can do the work for the other cores as well.

During the past decade, OLTP systems benefited from bigger caches and bigger main memory, as well as from more efficient code and better algorithms, leading to more cache hits and to no or minimal disk usage during transactions. While this makes modern OLTP systems much faster than traditional systems, writing to non-persistent memory only leads to other problems, like the need for lightweight logging mechanisms, in order to be able to recover from disasters, without slowing down the database system.