Parallel Programming for Multicore and Cluster Systems 3rd Edition Thomas Rauber download
Parallel Programming for Multicore and Cluster Systems 3rd Edition Thomas Rauber download
https://ebookmeta.com/product/parallel-programming-for-multicore-
and-cluster-systems-3rd-edition-thomas-rauber/
https://ebookmeta.com/product/data-parallel-c-programming-
accelerated-systems-using-c-and-sycl-2nd-edition-james-reinders/
https://ebookmeta.com/product/multicore-and-gpu-programming-an-
integrated-approach-2nd-gerassimos-barlas/
https://ebookmeta.com/product/multicore-and-gpu-programming-an-
integrated-approach-2nd-edition-gerassimos-barlas/
https://ebookmeta.com/product/the-micro-economy-today-16th-
edition-bradley-schiller/
Indigenous Peoples National Parks and Protected Areas A
New Paradigm Linking Conservation Culture and Rights
1st Edition Stan Stevens
https://ebookmeta.com/product/indigenous-peoples-national-parks-
and-protected-areas-a-new-paradigm-linking-conservation-culture-
and-rights-1st-edition-stan-stevens/
https://ebookmeta.com/product/collected-short-fiction-2022nd-
edition-keith-roberts/
https://ebookmeta.com/product/world-prehistory-a-brief-
introduction-10th-edition-brian-m-fagan/
https://ebookmeta.com/product/the-tainted-course-sugarbury-falls-
mystery-4-diane-weiner/
Messianism and Sociopolitical Revolution in Medieval
Islam 1st Edition Said Amir Arjomand
https://ebookmeta.com/product/messianism-and-sociopolitical-
revolution-in-medieval-islam-1st-edition-said-amir-arjomand/
Parallel Programming
Thomas Rauber • Gudula Rünger
Parallel Programming
for Multicore and Cluster Systems
Third Edition
Thomas Rauber Gudula Rünger
Lehrstuhl für Angewandte Informatik II Fakultät für Informatik
University of Bayreuth Chemnitz University of Technology
Bayreuth, Bayern, Germany Chemnitz, Sachsen, Germany
Second English Edition was a translation from the 3rd German language edition: Parallele
Programmierung (3. Aufl. 2012) by T. Rauber and G. Rünger, Springer-Verlag Berlin
Heidelberg 2000, 2007, 2012.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2010, 2013, 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
V
VI Preface
of this new English edition includes an extended update of the chapter on com-
puter architecture and performance analysis taking new developments such as the
aspect of energy consumption into consideration. The description of OpenMP has
been extended and now also captures the task concept of OpenMP. The chapter
on message-passing programming has been extended and updated to include new
features of MPI such as extended reduction operations and non-blocking collec-
tive communication operations. The chapter on GPU programming also has been
updated. All other chapters also have been revised carefully.
The content of the book consists of three main parts, covering all areas of parallel
computing: the architecture of parallel systems, parallel programming models and
environments, and the implementation of efficient application algorithms. The em-
phasis lies on parallel programming techniques needed for different architectures.
The first part contains an overview of the architecture of parallel systems, includ-
ing cache and memory organization, interconnection networks, routing and switch-
ing techniques as well as technologies that are relevant for modern and future mul-
ticore processors. Issues of power and energy consumption are also covered.
The second part presents parallel programming models, performance models,
and parallel programming environments for message passing and shared memory
models, including the message passing interface (MPI), Pthreads, Java threads, and
OpenMP. For each of these parallel programming environments, the book intro-
duces basic concepts as well as more advanced programming methods and enables
the reader to write and run semantically correct and computationally efficient par-
allel programs. Parallel design patterns, such as pipelining, client-server, or task
pools are presented for different environments to illustrate parallel programming
techniques and to facilitate the implementation of efficient parallel programs for a
wide variety of application areas. Performance models and techniques for runtime
analysis are described in detail, as they are a prerequisite for achieving efficiency
and high performance. A chapter gives a detailed description of the architecture of
GPUs and also contains an introduction into programming approaches for general
purpose GPUs concentrating on CUDA and OpenCL. Programming examples are
provided to demonstrate the use of the specific programming techniques introduced.
The third part applies the parallel programming techniques from the second part
to representative algorithms from scientific computing. The emphasis lies on basic
methods for solving linear equation systems, which play an important role for many
scientific simulations. The focus of the presentation is the analysis of the algorithmic
structure of the different algorithms, which is the basis for a parallelization, and not
so much on mathematical properties of the solution methods. For each algorithm,
the book discusses different parallelization variants, using different methods and
strategies.
Many colleagues and students have helped to improve the quality of this book.
We would like to thank all of them for their help and constructive criticisms. For
numerous corrections we would like to thank Robert Dietze, Jörg Dümmler, Mar-
vin Ferber, Michael Hofmann, Ralf Hoffmann, Sascha Hunold, Thomas Jakobs,
Oliver Klöckner, Matthias Korch, Ronny Kramer, Raphael Kunis, Jens Lang, Isabel
Mühlmann, John O’Donnell, Andreas Prell, Carsten Scholtes, Michael Schwind,
Preface VII
and Jesper Träff. Many thanks to Thomas Jakobs, Matthias Korch, Carsten Scholtes
and Michael Schwind for their help with the program examples and the exercises.
We thank Monika Glaser and Luise Steinbach for their help and support with the
LATEX typesetting of the book. We also thank all the people who have been involved
in the writing of the three German versions of this book. It has been a pleasure work-
ing with the Springer Verlag in the development of this book. We especially thank
Ralf Gerstner for his continuous support.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Classical Use of Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Parallelism in Today’s Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Basic Concepts of parallel programming . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Overview of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
IX
X Contents
8.2 Direct Methods for Linear Systems with Banded Structure . . . . . . . . 470
8.2.1 Discretization of the Poisson Equation . . . . . . . . . . . . . . . . . . 470
8.2.2 Tridiagonal Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
8.2.3 Generalization to Banded Matrices . . . . . . . . . . . . . . . . . . . . . 489
8.2.4 Solving the Discretized Poisson Equation . . . . . . . . . . . . . . . . 491
8.3 Iterative Methods for Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 493
8.3.1 Standard Iteration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
8.3.2 Parallel implementation of the Jacobi Iteration . . . . . . . . . . . . 498
8.3.3 Parallel Implementation of the Gauss-Seidel Iteration . . . . . . 499
8.3.4 Gauss-Seidel Iteration for Sparse Systems . . . . . . . . . . . . . . . 501
8.3.5 Red-black Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
8.4 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
8.4.1 Sequential CG method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
8.4.2 Parallel CG Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
8.5 Cholesky Factorization for Sparse Matrices . . . . . . . . . . . . . . . . . . . . . 518
8.5.1 Sequential Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
8.5.2 Storage Scheme for Sparse Matrices . . . . . . . . . . . . . . . . . . . . 525
8.5.3 Implementation for Shared Variables . . . . . . . . . . . . . . . . . . . . 526
8.6 Exercises for Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
Chapter 1
Introduction
A typical example for the first application area is weather forecasting where the fu-
ture development in the atmosphere has to be predicted, which can only be obtained
by simulations. In the second application area, computer simulations are used to
obtain results that are more precise than results from practical experiments or that
can be performed at lower cost. An example is the use of simulations to determine
the air resistance of vehicles: Compared to a classical wind tunnel experiment, a
computer simulation can get more precise results because the relative movement of
the vehicle in relation to the ground can be included in the simulation. This is not
possible in the wind tunnel, since the vehicle cannot be moved. Crash tests of ve-
hicles are an obvious example where computer simulations can be performed with
lower cost.
Computer simulations often require a large computational effort. Thus, a low
performance of the computer system used can restrict the simulations and the ac-
curacy of the results obtained significantly. Using a high-performance system al-
lows larger simulations which lead to better results and therefore, parallel comput-
ers have usually been used to perform computer simulations. Today, cluster systems
built up from server nodes are widely available and are now also often used for
parallel simulations. Additionally, multicore processors within the nodes provide
further parallelism, which can be exploited for a fast computation. To use paral-
lel computers or cluster systems, the computations to be performed must be parti-
tioned into several parts which are assigned to the parallel resources for execution.
These computation parts should be independent of each other, and the algorithm
performed must provide enough independent computations to be suitable for a par-
allel execution. This is normally the case for scientific simulation, which often use
one- or multi-dimensional arrays as data structures and organize their computations
in nested loops. To obtain a parallel program for parallel excution, the algorithm
must be formulated in a suitable programming language. Parallel execution is often
controlled by specific runtime libraries or compiler directives which are added to
a standard programming language, such as C, Fortran, or Java. The programming
techniques needed to obtain efficient parallel programs are described in this book.
Popular runtime systems and environments are also presented.
eral cores. Thus, using multicore processors makes each desktop computer a small
parallel system. The technological development toward multicore processors was
forced by physical reasons, since the clock speed of chips with more and more tran-
sistors cannot be increased at the previous rate without overheating.
Multicore architectures in the form of single multicore processors, shared mem-
ory systems of several multicore processors, or clusters of multicore processors with
a hierarchical interconnection network will have a large impact on software develop-
ment. In 2022, quad-core and oct-core processors are standard for normal desktop
computers, and chips with up to 64 cores are already available for a use in high-
end systems. It can be predicted from Moore’s law that the number of cores per
processor chip will double every 18 – 24 months and in several years, a typical
processor chip might consist of dozens up to hundreds of cores where some of the
cores will be dedicated to specific purposes such as network management, encryp-
tion and decryption, or graphics [138]; the majority of the cores will be available
for application programs, providing a huge performance potential. Another trend in
parallel computing is the use of GPUs for compute-intensive applications. GPU ar-
chitectures provide many hundreds of specialized processing cores that can perform
computations in parallel.
The users of a computer system are interested in benefitting from the perfor-
mance increase provided by multicore processors. If this can be achieved, they can
expect their application programs to keep getting faster and keep getting more and
more additional features that could not be integrated in previous versions of the soft-
ware because they needed too much computing power. To ensure this, there should
definitely be support from the operating system, e.g., by using dedicated cores for
their intended purpose or by running multiple user programs in parallel, if they are
available. But when a large number of cores is provided, which will be the case in
the near future, there is also the need to execute a single application program on
multiple cores. The best situation for the software developer would be that there is
an automatic transformer that takes a sequential program as input and generates a
parallel program that runs efficiently on the new architectures. If such a transformer
were available, software development could proceed as before. But unfortunately,
the experience of the research in parallelizing compilers during the last 20 years has
shown that for many sequential programs it is not possible to extract enough paral-
lelism automatically. Therefore, there must be some help from the programmer and
application programs need to be restructured accordingly.
For the software developer, the new hardware development toward multicore ar-
chitectures is a challenge, since existing software must be restructured toward paral-
lel execution to take advantage of the additional computing resources. In particular,
software developers can no longer expect that the increase of computing power can
automatically be used by their software products. Instead, additional effort is re-
quired at the software level to take advantage of the increased computing power. If
a software company is able to transform its software so that it runs efficiently on
novel multicore architectures, it will likely have an advantage over its competitors.
There is much research going on in the area of parallel programming languages
and environments with the goal of facilitating parallel programming by providing
4 1 Introduction
support at the right level of abstraction. But there are also many effective techniques
and environments already available. We give an overview in this book and present
important programming techniques, enabling the reader to develop efficient parallel
programs. There are several aspects that must be considered when developing a
parallel program, no matter which specific environment or system is used. We give
a short overview in the following section.
memory. For shared memory machines, a global shared memory stores the data of
an application and can be accessed by all processors or cores of the hardware sys-
tems. Information exchange between threads is done by shared variables written by
one thread and read by another thread. The correct behavior of the entire program
has to be achieved by synchronization between threads so that the access to shared
data is coordinated, i.e., a thread must not read a data element before the write op-
eration by another thread storing the data element has been finalized. Depending on
the programming language or environment, synchronization is done by the runtime
system or by the programmer. For distributed memory machines, there exists a pri-
vate memory for each processor, which can only be accessed by this processor and
no synchronization for memory access is needed. Information exchange is done by
sending data from one processor to another processor via an interconnection net-
work by explicit communication operations.
Specific barrier operations offer another form of coordination which is avail-
able for both shared memory and distributed memory machines. All processes or
threads have to wait at a barrier synchronization point until all other processes or
threads have also reached that point. Only after all processes or threads have exe-
cuted the code before the barrier, they can continue their work with the subsequent
code after the barrier.
An important aspect of parallel computing is the parallel execution time which
consists of the time for the computation on processors or cores and the time for data
exchange or synchronization. The parallel execution time should be smaller than the
sequential execution time on one processor so that designing a parallel program is
worth the effort. The parallel execution time is the time elapsed between the start of
the application on the first processor and the end of the execution of the application
on all processors. This time is influenced by the distribution of work to processors or
cores, the time for information exchange or synchronization, and idle times in which
a processor cannot do anything useful but waiting for an event to happen. In general,
a smaller parallel execution time results when the work load is assigned equally to
processors or cores, which is called load balancing, and when the overhead for
information exchange, synchronization and idle times is small. Finding a specific
scheduling and mapping strategy which leads to a good load balance and a small
overhead is often difficult because of many interactions. For example, reducing the
overhead for information exchange may lead to load imbalance whereas a good load
balance may require more overhead for information exchange or synchronization.
For a quantitative evaluation of the execution time of parallel programs, cost
measures like speedup and efficiency are used, which compare the resulting paral-
lel execution time with the sequential execution time on one processor. There are
different ways to measure the cost or runtime of a parallel program and a large va-
riety of parallel cost models based on parallel programming models have been pro-
posed and used. These models are meant to bridge the gap between specific parallel
hardware and more abstract parallel programming languages and environments.
6 1 Introduction
The rest of the book is structured as follows. Chapter 2 gives an overview of impor-
tant aspects of the hardware of parallel computer systems and addresses new devel-
opments such as the trends toward multicore architectures. In particular, the chap-
ter covers important aspects of memory organization with shared and distributed
address spaces as well as popular interconnection networks with their topological
properties. Since memory hierarchies with several levels of caches may have an
important influence on the performance of (parallel) computer systems, they are
covered in this chapter. The architecture of multicore processors is also described in
detail. The main purpose of the chapter is to give a solid overview of the important
aspects of parallel computer architectures that play a role for parallel programming
and the development of efficient parallel programs.
Chapter 3 considers popular parallel programming models and paradigms and
discusses how the inherent parallelism of algorithms can be presented to a par-
allel runtime environment to enable an efficient parallel execution. An important
part of this chapter is the description of mechanisms for the coordination of paral-
lel programs, including synchronization and communication operations. Moreover,
mechanisms for exchanging information and data between computing resources for
different memory models are described. Chapter 4 is devoted to the performance
analysis of parallel programs. It introduces popular performance or cost measures
that are also used for sequential programs, as well as performance measures that
have been developed for parallel programs. Especially, popular communication pat-
terns for distributed address space architectures are considered and their efficient
implementations for specific interconnection structures are given.
Chapter 5 considers the development of parallel programs for distributed address
spaces. In particular, a detailed description of MPI (Message Passing Interface) is
given, which is by far the most popular programming environment for distributed
address spaces. The chapter describes important features and library functions of
MPI and shows which programming techniques must be used to obtain efficient
MPI programs. Chapter 6 considers the development of parallel programs for shared
address spaces. Popular programming environments are Pthreads, Java threads, and
OpenMP. The chapter describes all three and considers programming techniques
to obtain efficient parallel programs. Many examples help to understand the rel-
evant concepts and to avoid common programming errors that may lead to low
performance or may cause problems such as deadlocks or race conditions. Pro-
gramming examples and parallel programming patterns are presented. Chapter 7
introduces programming approaches for the execution of non-graphics application
programs, e.g., from the area of scientific computing, on GPUs. The chapter de-
scribes the architecture of GPUs and concentrates on the programming environment
CUDA (Compute Unified Device Architecture) from NVIDIA. A short overview of
OpenCL is also given in this chapter. Chapter 8 considers algorithms from numer-
ical analysis as representative examples and shows how the sequential algorithms
can be transferred into parallel programs in a systematic way.
1.4 Overview of the Book 7
The main emphasis of the book is to provide the reader with the programming
techniques that are needed for developing efficient parallel programs for different
architectures and to give enough examples to enable the reader to use these tech-
niques for programs from other application areas. In particular, reading and using
the book is a good training for software development for modern parallel architec-
tures, including multicore architectures.
The content of the book can be used for courses in the area of parallel com-
puting with different emphasis. All chapters are written in a self-contained way
so that chapters of the book can be used in isolation; cross-references are given
when material from other chapters might be useful. Thus, different courses in the
area of parallel computing can be assembled from chapters of the book in a mod-
ular way. Exercises are provided for each chapter separately. For a course on the
programming of multicore systems, Chapters 2, 3 and 6 should be covered. In par-
ticular Chapter 6 provides an overview of the relevant programming environments
and techniques. For a general course on parallel programming, Chapters 2, 5, and 6
can be used. These chapters introduce programming techniques for both distributed
and shared address space. For a course on parallel numerical algorithms, mainly
Chapters 5 and 8 are suitable; Chapter 6 can be used additionally. These chapters
consider the parallel algorithms used as well as the programming techniques re-
quired. For a general course on parallel computing, Chapters 2, 3, 4, 5, and 6 can
be used with selected applications from Chapter 8. Depending on the emphasis,
Chapter 7 on GPU programming can be included in each of the courses mentioned
above. The following web page will be maintained for additional and new material:
ai2.inf.uni-bayreuth.de/ppbook3
Chapter 2
Parallel Computer Architecture
In more detail, Section 2.1 gives an overview of the use of parallelism within
a single processor or processor core. Using the available resources within a single
processor core at instruction level can lead to a significant performance increase.
Section 2.2 focuses on important aspects of the power and energy consumption of
processors. Section 2.3 addresses techniques that influence memory access times
and play an important role for the performance of (parallel) programs. Sections 2.4
introduces Flynn’s taxonomy and Section 2.5 addresses the memory organization of
parallel platforms. Section 2.6 presents the architecture of multicore processors and
describes the use of thread-based parallelism for simultaneous multithreading.
Section 2.7 describes interconnection networks which connect the resources of
parallel platforms and are used to exchange data and information between these
resources. Interconnection networks also play an important role for multicore pro-
cessors for the connection between the cores of a processor chip. The section covers
static and dynamic interconnection networks and discusses important characteris-
tics, such as diameter, bisection bandwidth and connectivity of different network
types as well as the embedding of networks into other networks. Section 2.8 ad-
dresses routing techniques for selecting paths through networks and switching tech-
niques for message forwarding over a given path. Section 2.9 considers memory
hierarchies of sequential and parallel platforms and discusses cache coherence and
memory consistency for shared memory platforms. Section 2.10 shows examples
for the use of parallelism in today’s computer architectures by describing the ar-
chitecture of the Intel Cascade Lake and Ice Lake processors on one hand and the
Top500 list on the other hand.
Processor chips are the key components of computers. Considering the trends that
can be observed for processor chips during recent years, estimations for future de-
velopments can be deduced.
An important performance factor is the clock frequency (also called clock rate
or clock speed) of the processor which is the number of clock cycles per second,
measured in Hertz = 1/second, abbreviated as Hz = 1/s. The clock frequency f
determines the clock cycle time t of the processor by t = 1/ f , which is usually
the time needed for the execution of one instruction. Thus, an increase of the clock
frequency leads to a faster program execution and therefore a better performance.
Between 1987 and 2003, an average annual increase of the clock frequency of
about 40% could be observed for desktop processors [103]. Since 2003, the clock
frequency of desktop processors remains nearly unchanged and no significant in-
creases can be expected in the near future [102, 134]. The reason for this develop-
ment lies in the fact that an increase in clock frequency leads to an increase in power
consumption, mainly due to leakage currents which are transformed into heat, which
then requires a larger amount of cooling. Using current state-of-the-art cooling tech-
nology, processors with a clock rate significantly above 4 GHz cannot be cooled
permanently without a large additional effort.
Another important influence on the processor development are technical im-
provements in processor manufacturing. Internally, processor chips consist of tran-
sistors. The number of transistors contained in a processor chip can be used as a
rough estimate of its complexity and performance. Moore’s law is an empirical
observation which states that the number of transistors of a typical processor chip
doubles every 18 to 24 months. This observation has first been made by Gordon
Moore in 1965 and has been valid for more than 40 years. However, the transistor
increase due to Moore’s law has slowed down during the last years [104]. Neverthe-
less, the number of transistors still increases and the increasing number of transistors
has been used for architectural improvements, such as additional functional units,
more and larger caches, and more registers, as described in the following sections.
In 2022, a typical desktop processor chip contains between 5 and 20 billions
transistors, depending on the specific configuration. For example, an AMD Ryzen 7
3700X with eight cores (introduced in 2019) comprises about 6 billion transistors,
an AMD Ryzen 7 5800H with eight cores (introduced in 2021) contains 10.7 billion
transistors, and a 10-core Apple M2 (introduced in 2022) consists of 20 billion tran-
sistors, using an ARM-based system-on-a-chip (SoC) design, for more information
see en.wikipedia.org/wiki/Transistor count. The manufacturer Intel does
not disclose the number of transistors of its processors.
The increase of the number of transistors and the increase in clock speed has
led to a significant increase in the performance of computer systems. Processor
performance can be measured by specific benchmark programs that have been se-
lected from different application areas to get a representative performance metric of
computer systems. Often, the SPEC benchmarks (System Performance and Evalu-
ation Cooperative) are used to measure the integer and floating-point performance
2.1 Processor Architecture and Technology Trends 11
of computer systems [113, 104, 206, 136], see www.spec.org. Measurements with
these benchmarks show that different time periods with specific performance in-
creases can be identified for desktop processors [104]: Between 1986 and 2003,
an average annual performance increase of about 50% could be reached. The time
period of this large performance increase corresponds to the time period in which
the clock frequency has been increased significantly each year. Between 2003 and
2011, the average annual performance increase of desktop processors is about 23%.
This is still a significant increase which has been reached although the clock fre-
quency remained nearly constant, indicating that the annual increase in transistor
count has been used for architectural improvements leading to a reduction of the
average time for executing an instruction. Between 2011 and 2015, the performance
increase slowed down to 12% per year and is currently only at about 3.5 % per
year, mainly due to the limited degree of parallelism available in the benchmark
programs.
In the following, a short overview of architectural improvements that contributed
to performance increase during the last decades is given. Four phases of micropro-
cessor design trends can be observed [45] which are mainly driven by an internal
use of parallelism:
1. Parallelism at bit level: Up to about 1986, the word size used by the processors
for operations increased stepwise from 4 bits to 32 bits. This trend has slowed
down and ended with the adoption of 64-bit operations at the beginning of the
1990’s. This development has been driven by demands for improved floating-
point accuracy and a larger address space. The trend has stopped at a word size
of 64 bits, since this gives sufficient accuracy for floating-point numbers and
covers a sufficiently large address space of 264 bytes.
2. Parallelism by pipelining: The idea of pipelining at instruction level is to over-
lap the execution of multiple instructions. For this purpose, the execution of each
instruction is partitioned into several steps which are performed by dedicated
hardware units (pipeline stages) one after another. A typical partitioning could
result in the following steps:
(a) fetch: the next instruction to be executed is fetched from memory;
(b) decode: the instruction fetched in step (a) is decoded;
(c) execute: the operands specified are loaded and the instruction is executed;
(d) write back: the result is written into the destination register.
An instruction pipeline is similar to an assembly line in automobile industry.
The advantage is that the different pipeline stages can operate in parallel, if there
are no control or data dependencies between the instructions to be executed, see
Fig. 2.1 for an illustration. To avoid waiting times, the execution of the different
pipeline stages should take about the same amount of time. In each clock cycle,
the execution of one instruction is finished and the execution of another instruc-
tion is started, if there are no dependencies between the instructions. The number
of instructions finished per time unit is called the throughput of the pipeline.
Thus, in the absence of dependencies, the throughput is one instruction per clock
cycle.
12 2 Parallel Computer Architecture
In the absence of dependencies, all pipeline stages work in parallel. Thus, the
number of pipeline stages determines the degree of parallelism attainable by a
pipelined computation. The number of pipeline stages used in practice depends
on the specific instruction and its potential to be partitioned into stages. Typical
numbers of pipeline stages lie between 2 and 26 stages. Processors which use
pipelining to execute instructions are called ILP processors (instruction level
parallelism processors). Processors with a relatively large number of pipeline
stages are sometimes called superpipelined. Although the available degree of
parallelism increases with the number of pipeline stages, this number cannot be
arbitrarily increased, since it is not possible to partition the execution of the in-
struction into a very large number of steps of equal size. Moreover, data depen-
dencies often inhibit a completely parallel use of the stages.
instruction 4 F4 D4 E4 W4
instruction 3 F3 D3 E3 W3
instruction 2 F2 D2 E2 W2
instruction 1 F1 D1 E1 W1
t1 t2 t3 t4 time
Fig. 2.1 Pipelined execution of four independent instructions 1–4 at times t1,t2,t3,t4. The
execution of each instruction is split into four stages: fetch (F), decode (D), execute (E), and
write back (W).
to ten independent instructions to functional units in one machine cycle per core,
see Section 2.10.
4. Parallelism at process or thread level: The three techniques described so far
assume a single sequential control flow which is provided by the compiler and
which determines the execution order if there are dependencies between instruc-
tions. For the programmer, this has the advantage that a sequential programming
language can be used nevertheless leading to a parallel execution of instructions
due to ILP. However, the degree of parallelism obtained by pipelining and mul-
tiple functional units is limited. Thus, the increasing number of transistors avail-
able per processor chip according to Moore’s law should be used for other tech-
niques. One approach is to integrate larger caches on the chip. But the cache sizes
cannot be arbitrarily increased either, as larger caches lead to a larger access time,
see Section 2.9.
An alternative approach to use the increasing number of transistors on a chip is
to put multiple, independent processor cores onto a single processor chip. This
approach has been used for typical desktop processors since 2005. The resulting
processor chips are called multicore processors. Each of the cores of a multi-
core processor must obtain a separate flow of control, i.e., parallel programming
techniques must be used. The cores of a processor chip access the same mem-
ory and may even share caches. Therefore, memory accesses of the cores must
be coordinated. The coordination and synchronization techniques required are
described in later chapters.
A more detailed description of parallelism at hardware level using the four tech-
niques described can be found in [45, 104, 171, 207]. Section 2.6 describes tech-
niques such as simultaneous multithreading and multicore processors requiring an
explicit specification of parallelism.
Until 2003, a significant average annual increase of the clock frequency of proces-
sors could be observed. This trend has stopped in 2003 at a clock frequency of about
3.3 GHz and since then, only slight increases of the clock frequency could be ob-
served. A further increase of the clock frequency is difficult because of the increased
heat production due to leakage currents. Such leakage currents also occur if the pro-
cessor is not performing computations. Therefore, the resulting power consumption
is called static power consumption. The power consumption caused by compu-
tations is called dynamic power consumption. The overall power consumption is
the sum of the static power consumption and the dynamic power consumption. In
2011, depending on the processor architecture, the static power consumption typi-
cally contributed between 25% and 50% to the total power consumption [104]. The
heat produced by leakage currents must be carried away from the processor chip by
using a sophisticated cooling technology.
14 2 Parallel Computer Architecture
40
power [Watt]
30
20
10
0
0 20 40 60 80 100 120 140 160 180 200
time [sec*10]
70
60
power [Watt]
50
40
30
20
10
0
0 20 40 60 80 100 120 140 160 180
time [sec*10]
Fig. 2.2 Development of the power consumption during program execution for 10 threads (top)
and 20 threads (bottom) on an Intel Broadwell processor when solving an ordinary differential
equation [184].
age and frequency scaling (DVFS). The techniques are often controlled and co-
ordinated by a special Power Management Unit (PMU). The idea of DVFS is to
reduce the clock frequency of the processor chip to save energy during time periods
with a small workload and to increase the clock frequency again if the workload
increases again. Increasing the frequency reduces the cycle time of the processor,
and in the same amount of time more instructions can be executed than when using
a smaller frequency. An example of a desktop microprocessor with DVFS capability
is the Intel Core i7 9700 processor (Coffee Lake architecture) for which 16 clock
frequencies between 0.8 GHz and 3.0 GHz are available (3.0 GHz, 2.9 GHz, 2.7
GHz, 2.6 GHz, 2.4 GHz, 2.3 GHz, 2.1 GHz, 2.0 GHz, 1.8 GHz, 1.7 GHz, 1.5 GHz,
1.4GHz, 1.2 GHz, 1.1 GHz, 900 MHz, 800 MHz). The operating system can dynam-
ically switch between these frequencies according to the current workload observed.
Tools such as cpufreq set can be used by the application programmer to adjust
the clock frequency manually. For some processors, it is even possible to increase
16 2 Parallel Computer Architecture
E = Pav · T. (2.2)
Thus, the energy unit is Watt · sec = W s = Joule. The average power Pav captures
the average static and dynamic power consumption during program execution. The
clock frequency can be set at a fixed value or can be changed dynamically during
program execution by the operating system. The clock frequency also has an effect
on the execution time: decreasing the clock frequency leads to a larger machine cy-
cle time and, thus, a larger execution time of computations. Hence, the overall effect
of a reduction of the clock frequency on the energy consumption of a program ex-
ecution is not a priori clear: reducing the energy decreases the power consumption,
but increases the resulting execution time. Experiments with DVFS processor have
shown that the smallest energy consumption does not necessarily correspond to the
use of the smallest operational frequency available [186, 184]. Instead, using a small
but not the smallest frequency often leads to the smallest energy consumption. The
best frequency to be used depends strongly on the processor used and the application
program executed, but also on the number of threads on parallel executions.
The access time to memory can have a large influence on the execution time of a
program, which is referred to as the program’s (runtime) performance. Reducing
the memory access time can improve the performance. The amount of improvement
depends on the memory access behavior of the program considered. Programs for
which the amount of memory accesses is large compared to the number of com-
putations performed may exhibit a significant benefit; these programs are called
memory-bound. Programs for which the amount of memory accesses is small com-
pared to the number of computations performed may exhibit a smaller benefit; these
programs are called compute-bound.
The technological development with a steady reduction in the VLSI (Very-large-
scale integration) feature size has led to significant improvements in processor per-
formance. Since 1980, integer and floating performance on the SPEC benchmark
suite has been increasing substantially per year, see Section 2.1. A significant con-
2.3 Memory access times 17
Access to DRAM chips is standardized by the JEDEC Solid State Technology As-
sociation where JEDEC stands for Joint Electron Device Engineering Council, see
jedec.org. The organization is responsible for the development of open indus-
try standards for semiconductor technologies, including DRAM chips. All leading
processor manufacturers are members of JEDEC.
For a performance evaluation of DRAM chips, the latency and the bandwidth
are used. The latency of a DRAM chip is defined as the total amount of time that
elapses between the point of time at which a memory access to a data block is issued
by the CPU and the point in time when the first byte of the block of data arrives at
the CPU. The latency is typically measured in micro-seconds (µs or nano-seconds
(ns). The bandwidth denotes the number of data elements that can be read from
a DRAM chip per time unit. The bandwidth is also denoted as throughput. The
bandwidth is typically measured in megabytes per second (MB/s) or gigabytes per
second (GB/s). For the latency of DRAM chips, an average decrease of about 5%
per year could be observed between 1980 and 2005; since 2005 the improvement
in access time has declined [104]. For the bandwidth of DRAM chips, an average
annual increase of about 10% can be observed.
In 2022, the latency of the newest DRAM technology (DDR5, Double Data Rate)
lies between 13.75 and 18 ns, depending on the specific JEDEC standard used. For
the DDR5 technology, a bandwidth between 25.6 GB/s and 51.2 GB/s per DRAM
chip is obtained. For example, the DDR5-3200 A specification leads to a peak band-
width of 25.6 GB/s with a latency of 13.75 ns, the DDR5-6400 C specification has a
peak bandwidth of 51.2 GB/s and a latency of 17.50 ns. Several DRAM chips (typ-
ically between 4 and 16) can be connected to DIMMs (dual inline memory module)
to provide even larger bandwidths.
Considering DRAM latency, it can be observed that the average memory access
time is significantly larger than the processor cycle time. The large gap between pro-
cessor cycle time and memory access time makes a suitable organization of memory
access more and more important to get good performance results at program level.
Two important approaches have been proposed to reduce the average latency for
memory access [14]: the simulation of virtual processors by each physical proces-
sor (multithreading) and the use of local caches to store data values that are accessed
often. The next two subsections give a short overview of these approaches.
18 2 Parallel Computer Architecture
A cache is a small, but fast memory that is logically located between the processor
and main memory. Physically, caches are located on the processor chip to ensure a
fast access time. A cache can be used to store data that is often accessed by the pro-
2.4 Flynn’s Taxonomy of Parallel Architectures 19
cessor, thus avoiding expensive main memory access. In most cases, the inclusion
property is used, i.e., the data stored in the cache is a subset of the data stored in main
memory. The management of the data elements in the cache is done by hardware,
e.g. by employing a set-associative strategy, see [104] and Section 2.9.1 for a de-
tailed treatment. For each memory access issued by the processor, it is first checked
by hardware whether the memory address specified currently resides in the cache.
If so, the data is loaded from the cache and no memory access is necessary. There-
fore, memory accesses that go into the cache are significantly faster than memory
accesses that require a load from the main memory. Since fast memory is expensive,
several levels of caches are typically used, starting from a small, fast and expen-
sive level 1 (L1) cache over several stages (L2, L3) to the large, but slower main
memory. For a typical processor architecture, access to the L1 cache only takes 2-4
cycles whereas access to main memory can take up to several hundred cycles. The
primary goal of cache organization is to reduce the average memory access time as
far as possible and to achieve an access time as close as possible to that of the L1
cache. Whether this can be achieved depends on the memory access behavior of the
program considered, see also Section 2.9.
Caches are used for nearly all processors, and they also play an important role
for SMPs with a shared address space and parallel computers with distributed mem-
ory organization. If shared data is used by multiple processors or cores, it may be
replicated in multiple caches to reduce access latency. Each processor or core should
have a coherent view to the memory system, i.e., any read access should return the
most recently written value no matter which processor or core has issued the cor-
responding write operation. A coherent view would be destroyed if a processor p
changes the value in a memory address in its local cache without writing this value
back to main memory. If another processor q would later read this memory address,
it would not get the most recently written value. But even if p writes the value back
to main memory, this may not be sufficient if q has a copy of the same memory
location in its local cache. In this case, it is also necessary to update the copy in the
local cache of q. The problem of providing a coherent view to the memory system is
often referred to as cache coherence problem. To ensure cache coherency, a cache
coherency protocol must be used, see Section 2.9.3 and [45, 104, 100] for a more
detailed description.
Parallel computers have been used for many years, and many different architectural
alternatives have been proposed and implemented. In general, a parallel computer
can be characterized as a collection of processing elements that can communicate
and cooperate to solve large problems quickly [14]. This definition is intentionally
quite vague to capture a large variety of parallel platforms. Many important details
are not addressed by the definition, including the number and complexity of the
processing elements, the structure of the interconnection network between the pro-
20 2 Parallel Computer Architecture
cessing elements, the coordination of the work between the processing elements as
well as important characteristics of the problem to be solved.
For a more detailed investigation, it is useful to introduce a classification accord-
ing to important characteristics of a parallel computer. A simple model for such a
classification is given by Flynn’s taxonomy [64]. This taxonomy characterizes par-
allel computers according to the global control and the resulting data and control
flows. Four categories are distinguished:
1. Single-Instruction, Single-Data (SISD): There is one processing element which
has access to a single program and a single data storage. In each step, the pro-
cessing element loads an instruction and the corresponding data and executes the
instruction. The result is stored back into the data storage. Thus, SISD describes
a conventional sequential computer according to the von Neumann model.
2. Multiple-Instruction, Single-Data (MISD): There are multiple processing ele-
ments each of which has a private program memory, but there is only one com-
mon access to a single global data memory. In each step, each processing element
obtains the same data element from the data memory and loads an instruction
from its private program memory. These possibly different instructions are then
executed in parallel by the processing elements using the previously obtained
(identical) data element as operand. This execution model is very restrictive and
no commercial parallel computer of this type has ever been built.
3. Single-Instruction, Multiple-Data (SIMD): There are multiple processing ele-
ments each of which has a private access to a (shared or distributed) data mem-
ory, see Section 2.5 for a discussion of shared and distributed address spaces.
But there is only one program memory from which a special control processor
fetches and dispatches instructions. In each step, each processing element obtains
the same instruction from the control processor and loads a separate data element
through its private data access on which the instruction is performed. Thus, the
same instruction is synchronously applied in parallel by all processing elements
to different data elements.
For applications with a significant degree of data parallelism, the SIMD approach
can be very efficient. Examples are multimedia applications or computer graph-
ics algorithms which generate realistic three-dimensional views of computer-
generated environments. Algorithms from scientific computing are often based
on large arrays and can therefore also benefit from SIMD computations.
4. Multiple-Instruction, Multiple-Data (MIMD): There are multiple processing
elements each of which has a separate instruction and a separate data access to
a (shared or distributed) program and data memory. In each step, each process-
ing element loads a separate instruction and a separate data element, applies the
instruction to the data element, and stores a possible result back into the data
storage. The processing elements work asynchronously to each other. MIMD
computers are the most general form of parallel computers in Flynn’s taxonomy.
Multicore processors or cluster systems are examples for the MIMD model.
Compared to MIMD computers, SIMD computers have the advantage that they are
easy to program, since there is only one program flow, and the synchronous execu-
2.5 Memory Organization of Parallel Computers 21
tion does not require synchronization at program level. But the synchronous execu-
tion is also a restriction, since conditional statements of the form
if (b==0) c=a; else c = a/b;
must be executed in two steps. In the first step, all processing elements whose lo-
cal value of b is zero execute the then part. In the second step, all other process-
ing elements execute the else part. Some processors support SIMD computations
as additional possibility for processing large uniform data sets. An example is the
x86 architecture which provides SIMD instructions in the form of SSE (Streaming
SIMD Extensions) or AVX (Advanced Vector Extensions) instructions. AVX exten-
sions have first been introduced in 2011 by Intel and AMD and are now supported
in nearly all modern desktop and server processors. The features of AVX have been
extended several times, and since 2017 AVX-512 is available. AVX-512 is based
on a separate set of 512-bit registers. Each of these registers can store 16 single-
precision 32-bit or eight double-precision 64-bit floating-point numbers, on which
arithmetic operations can be executed in SIMD style, see Sect. 3.4 for a more de-
tailed description. The computations of GPUs are also based on the SIMD concept,
see Sect. 7.1 for a more detailed description.
MIMD computers are more flexible than SIMD computers, since each processing
element can execute its own program flow. On the upper level, multicore processors
as well as all parallel computers are based on the MIMD concept. Although Flynn’s
taxonomy only provides a coarse classification, it is useful to give an overview of
the design space of parallel computers.
Nearly all general-purpose parallel computers are based on the MIMD model. A fur-
ther classification of MIMD computers can be done according to their memory orga-
nization. Two aspects can be distinguished: the physical memory organization and
the view of the programmer to the memory. For the physical organization, comput-
ers with a physically shared memory (also called multiprocessors) and computers
with a physically distributed memory (also called multicomputers) can be distin-
guished, see Fig. 2.3. But there exist also many hybrid organizations, for example
providing a virtually shared memory on top of a physically distributed memory.
From the programmer’s point of view, there are computers with a distributed ad-
dress space and computers with a shared address space. This view does not necessar-
ily correspond to the actual physical memory organization. For example, a parallel
computer with a physically distributed memory may appear to the programmer as a
computer with a shared address space when a corresponding programming environ-
ment is used. In the following, the physical organization of the memory is discussed
in more detail.
22 2 Parallel Computer Architecture
Computers with a physically distributed memory are also called distributed mem-
ory machines (DMM). They consist of a number of processing elements (called
nodes) and an interconnection network which connects the nodes and supports the
transfer of data between the nodes. A node is an independent unit, consisting of
processor, local memory and, sometimes, peripheral elements, see Fig. 2.4 a).
The data of a program is stored in the local memory of one or several nodes.
All local memory is private and only the local processor can access its own lo-
cal memory directly. When a processor needs data from the local memory of other
nodes to perform local computations, message-passing has to be performed via the
interconnection network. Therefore, distributed memory machines are strongly con-
nected with the message-passing programming model which is based on communi-
cation between cooperating sequential processes, see Chapters 3 and 5. To perform
message-passing, two processes PA and PB on different nodes A and B issue corre-
sponding send and receive operations. When PB needs data from the local memory
of node A, PA performs a send operation containing the data for the destination pro-
cess PB . PB performs a receive operation specifying a receive buffer to store the data
from the source process PA from which the data is expected.
The architecture of computers with a distributed memory has experienced many
changes over the years, especially concerning the interconnection network and the
coupling of network and nodes. The interconnection networks of earlier multicom-
puters were often based on point-to-point connections between nodes. A node is
connected to a fixed set of other nodes by physical connections. The structure of
the interconnection network can be represented as a graph structure. The nodes of
the graph represent the processors, the edges represent the physical interconnections
(also called links). Typically, the graph exhibits a regular structure. A typical net-
work structure is the hypercube which is used in Fig. 2.4 b) to illustrate the node
connections; a detailed description of interconnection structures is given in Section
2.7. In networks with point-to-point connections, the structure of the network deter-
mines the possible communications, since each node can only exchange data with
its direct neighbors. To decouple send and receive operations, buffers can be used
to store a message until the communication partner is ready. Point-to-point con-
2.5 Memory Organization of Parallel Computers 23
a)
interconnection network
P = processor
M = local memory
P P . . . P
M M M
M P
... M P
d)
P M
... ...
external external
Router
...
...
e) N N N
R R R R = Router
N N N N = node consisting of processor
and local memory
R R R
N N N
R R R
Fig. 2.4 Illustration of computers with distributed memory: a) abstract structure, b) computer
with distributed memory and hypercube as interconnection structure, c) DMA (direct memory
access), d) processor-memory node with router and e) interconnection network in form of a
mesh to connect the routers of the different processor-memory nodes.
24 2 Parallel Computer Architecture
nections restrict parallel programming, since the network topology determines the
possibilities for data exchange, and parallel algorithms have to be formulated such
that their communication pattern fits to the given network structure [8, 144].
The execution of communication operations can be decoupled from the proces-
sor’s operations by adding a DMA controller (DMA - direct memory access) to the
nodes to control the data transfer between the local memory and the I/O controller.
This enables data transfer from or to the local memory without participation of the
processor (see Fig. 2.4 c) for an illustration) and allows asynchronous communica-
tion. A processor can issue a send operation to the DMA controller and can then
continue local operations while the DMA controller executes the send operation.
Messages are received at the destination node by its DMA controller which copies
the enclosed data to a specific system location in local memory. When the processor
then performs a receive operation, the data are copied from the system location to
the specified receive buffer. Communication is still restricted to neighboring nodes
in the network. Communication between nodes that do not have a direct connec-
tion must be controlled by software to send a message along a path of direct inter-
connections. Therefore, communication times between nodes that are not directly
connected can be much larger than communication times between direct neighbors.
Thus, it is still more efficient to use algorithms with a communication pattern ac-
cording to the given network structure.
A further decoupling can be obtained by putting routers into the network, see Fig.
2.4 d). The routers form the actual network over which communication can be per-
formed. The nodes are connected to the routers, see Fig. 2.4 e). Hardware-supported
routing reduces communication times as messages for processors on remote nodes
can be forwarded by the routers along a pre-selected path without interaction of
the processors in the nodes along the path. With router support there is not a large
difference in communication time between neighboring nodes and remote nodes,
depending on the switching technique, see Sect. 2.8.3. Each physical I/O channel of
a router can be used by one message only at a specific point in time. To decouple
message forwarding, message buffers are used for each I/O channel to store mes-
sages and apply specific routing algorithms to avoid deadlocks, see also Sect. 2.8.1.
Technically, DMMs are quite easy to assemble since standard desktop computers
or servers can be used as nodes. The programming of DMMs requires a careful
data layout, since each processor can access only its local data directly. Non-local
data must be accessed via message-passing, and the execution of the corresponding
send and receive operations takes significantly longer than a local memory access.
Depending on the interconnection network and the communication library used, the
difference can be more than a factor of 100. Therefore, data layout may have a
significant influence on the resulting parallel runtime of a program. The data layout
should be selected such that the number of message transfers and the size of the data
blocks exchanged are minimized.
The structure of DMMs has many similarities with networks of workstations
(NOWs) in which standard workstations are connected by a fast local area network
(LAN). An important difference is that interconnection networks of DMMs are typi-
2.5 Memory Organization of Parallel Computers 25
cally more specialized and provide larger bandwidth and lower latency, thus leading
to a faster message exchange.
Collections of complete computers with a dedicated interconnection network are
often called clusters. Clusters are usually based on standard computers and even
standard network topologies. The entire cluster is addressed and programmed as a
single unit. The popularity of clusters as parallel machines comes from the availabil-
ity of standard high-speed interconnections, such as FCS (Fibre Channel Standard),
SCI (Scalable Coherent Interface), Switched Gigabit Ethernet, Myrinet, or Infini-
Band, see [175, 104, 171]. A natural programming model of DMMs is the message-
passing model that is supported by communication libraries, such as MPI or PVM,
see Chapter 5 for a detailed treatment of MPI. These libraries are often based on
standard protocols such as TCP/IP [139, 174].
The difference between cluster systems and distributed systems lies in the fact
that the nodes in cluster systems use the same operating system and can usually
not be addressed individually; instead a special job scheduler must be used. Several
cluster systems can be connected to grid systems by using middleware software,
such as the Globus Toolkit, see www.globus.org [72]. This allows a coordinated
collaboration of several clusters. In Grid systems, the execution of application pro-
grams is controlled by the middleware software.
Cluster systems are also used for the provision of services in the area of cloud
computing. Using cloud computing, each user can allocate virtual resources which
are provided via the cloud infrastructure as part of the cluster system. The user can
dynamically allocate and use resources according to his or her computational re-
quirements. Depending on the allocation, a virtual resource can be a single cluster
node or a collection of cluster nodes. Examples for cloud infrastructures are the
Amazon Elastic Compute Cloud (EC2), Microsoft Azure, and Google Cloud. Ama-
zon EC2 offers a variety of virtual machines with different performance and memory
capacity that can be rented by users on an hourly or monthly basis. Amazon EC2 is
part of Amazon Web Services (AWS), which provides many cloud computing ser-
vices in different areas, such as computing, storage, database, network and content
delivery, analytics, machine learning, and security, see aws.amazon.com. In the
third quarter of 2022, AWS had a share of 34 % of the global cloud infrastructure
service market, followed by Microsoft Azure with 21 % and Google Cloud with 11
%, see www.srgresearch.com.
Computers with a physically shared memory are also called shared memory ma-
chines (SMMs). The shared memory is also called global memory. SMMs consist
of a number of processors or cores, a shared physical memory (global memory) and
an interconnection network to connect the processors with the memory. The shared
memory can be implemented as a set of memory modules. Data can be exchanged
between processors via the global memory by reading or writing shared variables.
26 2 Parallel Computer Architecture
The cores of a multicore processor are an example for an SMM, see Sect. 2.6.2 for
a more detailed description. Physically, the global memory usually consists of sep-
arate memory modules providing a common address space which can be accessed
by all processors, see Fig. 2.5 for an illustration.
a) P . . . P b) P . . . P
shared memory M . . . M
memory modules
Fig. 2.5 Illustration of a computer with shared memory: a) abstract view and b) implementation
of the shared memory with several memory modules.
A natural programming model for SMMs is the use of shared variables which
can be accessed by all processors. Communication and cooperation between the
processors is organized by writing and reading shared variables that are stored in
the global memory. Accessing shared variables concurrently by several processors
should be avoided, since race conditions with unpredictable effects can occur, see
also Chapters 3 and 6.
The existence of a global memory is a significant advantage, since communica-
tion via shared variables is easy and since no data replication is necessary as it is
sometimes the case for DMMs. But technically, the realization of SMMs requires
a larger effort, in particular because the interconnection network must provide fast
access to the global memory for each processor. This can be ensured for a small
number of processors, but scaling beyond a few dozen processors is difficult.
A special variant of SMMs are symmetric multiprocessors (SMPs). SMPs have
a single shared memory which provides a uniform access time from any processor
for all memory locations, i.e., all memory locations are equidistant to all processors
[45, 171]. SMPs usually have a small number of processors that are connected via
a central bus or interconnection network, which also provides access to the shared
memory. There are usually no private memories of processors or specific I/O pro-
cessors, but each processor has a private cache hierarchy. As usual, access to a local
cache is faster than access to the global memory. In the spirit of the definition from
above, each multicore processor with several cores is an SMP system.
SMPs usually have only a small number of processors, since the central bus has
to provide a constant bandwidth which is shared by all processors. When too many
processors are connected, more and more access collisions may occur, thus increas-
ing the effective memory access time. This can be alleviated by the use of caches and
suitable cache coherence protocols, see Sect. 2.9.3. The maximum number of pro-
cessors used in bus-based SMPs typically lies between 32 and 64. Interconnection
schemes with a higher bandwidth are also used, such as parallel buses (IBM Power
2.6 Thread-Level Parallelism 27
8), ring interconnects (Intel Xeon E7) or crossbar connections (Fujitsu SPARC64
X+) [171].
Parallel programs for SMMs are often based on the execution of threads. A
thread is a separate control flow which shares data with other threads via a global
address space. It can be distinguished between kernel threads that are managed by
the operating system, and user threads that are explicitly generated and controlled
by the parallel program, see Section 3.8.2. The kernel threads are mapped by the op-
erating system to processors or cores for execution. User threads are managed by the
specific programming environment used and are mapped to kernel threads for exe-
cution. The mapping algorithms as well as the exact number of processors or cores
can be hidden from the user by the operating system. The processors or cores are
completely controlled by the operating system. The operating system can also start
multiple sequential programs from several users on different processors or cores,
when no parallel program is available. Small size SMP systems are often used as
servers, because of their cost-effectiveness, see [45, 175] for a detailed description.
SMP systems can be used as nodes of a larger parallel computer by employ-
ing an interconnection network for data exchange between processors of different
SMP nodes. For such systems, a shared address space can be defined by using a
suitable cache coherence protocol, see Sect. 2.9.3. A coherence protocol provides
the view of a shared address space, although the actual physical memory might be
distributed. Such a protocol must ensure that any memory access returns the most
recently written value for a specific memory address, no matter where this value is
stored physically. The resulting systems are also called distributed shared mem-
ory (DSM) architectures. In contrast to single SMP systems, the access time in DSM
systems depends on the location of a data value in the global memory, since an ac-
cess to a data value in the local SMP memory is faster than an access to a data value
in the memory of another SMP node via the coherence protocol. These systems are
therefore also called NUMAs (non-uniform memory access), see Fig. 2.6 (b). Since
single SMP systems have a uniform memory latency for all processors, they are also
called UMAs (uniform memory access).
CC-NUMA (Cache-Coherent NUMA) systems are computers with a virtually
shared address space for which cache coherence is ensured, see Fig. 2.6 (c). Thus,
each processor’s cache can store data not only from the processor’s local memory
but also from the shared address space. A suitable coherence protocol, see Sect.
2.9.3, ensures a consistent view by all processors. COMA (Cache-Only Memory
Architecture) systems are variants of CC-NUMA in which the local memories of
the different processors are used as caches, see Fig. 2.6 (d).
The architectural organization within a processor chip may require the use of ex-
plicitly parallel programs to efficiently use the resources provided. This is called
thread-level parallelism, since the multiple control flows needed are often called
28 2 Parallel Computer Architecture
a)
P1 P2 . . . Pn
memory
b)
P1 P2 ... Pn processing
elements
M1 M2 Mn
interconnection network
c)
P1 P2 Pn processing
... elements
C1 C2 Cn
M1 M2 Mn
interconnection network
d)
Processor P 1 P2 Pn processing
. . . elements
Cache C1 C2 Cn
interconnection network
Fig. 2.6 Illustration of the architecture of computers with shared memory: a) SMP – symmetric
multiprocessors, b) NUMA – non-uniform memory access, c) CC-NUMA – cache coherent
NUMA and d) COMA – cache only memory access.
2.6 Thread-Level Parallelism 29
For two logical processors, the required increase in chip area for an Intel Xeon pro-
cessor is less than 5% [149, 225]. The shared resources are assigned to the logical
processors for simultaneous use, thus leading to a simultaneous execution of logical
processors. When a logical processor must wait for an event, the resources can be
assigned to another logical processor. This leads to a continuous use of the resources
from the view of the physical processor. Waiting times for logical processors can oc-
cur for cache misses, wrong branch predictions, dependencies between instructions,
and pipeline hazards.
Investigations have shown that the simultaneous use of processor resources by
two logical processors can lead to performance improvements between 15% and
30%, depending on the application program [149]. Since the processor resources are
shared by the logical processors, it cannot be expected that the use of more than two
logical processors can lead to a significant additional performance improvement.
Therefore, SMT will likely be restricted to a small number of logical processors.
In 2022, examples of processors that support SMT are the Intel Core i3, i5, and
i7 processors (supporting two logical processors), the IBM Power9 and Power10
processors (four or eight logical processors, depending on the configuration), as
well as the AMD Zen 3 processors (up to 24 threads per core).
To use SMT to obtain performance improvements, it is necessary that the op-
erating system is able to control logical processors. From the point of view of the
application program, it is necessary that there is a separate thread available for ex-
ecution for each logical processor. Therefore, the application program must apply
parallel programming techniques to get performance improvements for SMT pro-
cessors.
• There are two main reasons why the speed of processor clocks cannot be in-
creased significantly [132]. First, the increase of the number of transistors on
a chip is mainly achieved by increasing the transistor density. But this also in-
creases the power density and heat production because of leakage current and
power consumption, thus requiring an increased effort and more energy for cool-
ing. Second, memory access times could not be reduced at the same rate as pro-
cessor clock speed has been increased. This leads to an increased number of ma-
chine cycles for a memory access. For example, in 1990 main memory access has
required between 6 and 8 machine cycles for a typical desktop computer system.
In 2012, memory access time has significantly increased to 180 machine cycles
for an Intel Core i7 processor. Since then, memory access time has increased fur-
ther and in 2022, the memory latencies for an AMD EPYC Rome and an Intel
Xeon Cascade Lake SP server processor are 220 cycles and 200 cycles, respec-
tively [221]. Therefore, memory access times could become a limiting factor for
further performance increase, and cache memories are used to prevent this, see
Section 2.9 for a further discussion. In future, it can be expected that the number
of cycles needed for a memory access will not change significantly.
There are more problems that processor designers have to face: Using the in-
creased number of transistors to increase the complexity of the processor archi-
tecture may also lead to an increase in processor-internal wire length to transfer
control and data between the functional units of the processor. Here, the speed
of signal transfers within the wires could become a limiting factor. For example,
a processor with a clock frequency of 3 GHz = 3 · 109 Hz has a cycle time of
1/(3 · 109 Hz) = 0.33 · 10−9 s = 0.33ns. Assuming a signal transfer at the speed of
light (which is 0.3 ·109 m/s), a signal can cross a distance of 0.33 ·10−9 s ·0.3·109 m/s
= 10cm in one processor cycle. This is not significantly larger than the typical size
of a processor chip and wire lengths become an important issue when the clock
frequency is increased further.
Another problem is the following: The physical size of a processor chip limits
the number of pins that can be used, thus limiting the bandwidth between CPU and
main memory. This may lead to a processor-to-memory performance gap which
is sometimes referred to as memory wall. This makes the use of high-bandwidth
memory architectures with an efficient cache hierarchy necessary [19].
All these reasons inhibit a processor performance increase at the previous rate
when using the traditional techniques. Instead, new processor architectures have to
be used, and the use of multiple cores on a single processor die is considered as the
most promising approach. Instead of further increasing the complexity of the inter-
nal organization of a processor chip, this approach integrates multiple independent
processing cores with a relatively simple architecture onto one processor chip. This
has the additional advantage that the energy consumption of a processor chip can
be reduced if necessary by switching off unused processor cores during idle times
[102].
Multicore processors integrate multiple execution cores on a single processor
chip. For the operating system, each execution core represents an independent log-
ical processor with separate execution resources, such as functional units or execu-
32 2 Parallel Computer Architecture
tion pipelines. Each core has to be controlled separately, and the operating system
can assign different application programs to the different cores to obtain a parallel
execution. Background applications like virus checking, image compression, and
encoding can run in parallel to application programs of the user. By using techniques
of parallel programming, it is also possible to execute a computational-intensive ap-
plication program (like computer games, computer vision, or scientific simulations)
in parallel on a set of cores, thus reducing execution time compared to an execution
on a single core or leading to more accurate results by performing more computa-
tions as in the sequential case. In the future, users of standard application programs
as computer games will likely expect an efficient use of the execution cores of a
processor chip. To achieve this, programmers have to use techniques from parallel
programming.
The use of multiple cores on a single processor chip also enables standard pro-
grams, such as text processing, office applications, or computer games, to provide
additional features that are computed in the background on a separate core so that
the user does not notice any delay in the main application. But again, techniques of
parallel programming have to be used for the implementation.
There are many different design variants for multicore processors, differing in the
number of cores, the structure and size of the caches, the access of cores to caches,
and the use of heterogeneous components. From a high level view, [133] distin-
guished three main types of architectures: (a) a hierarchical design in which mul-
tiple cores share multiple caches that are organized in a tree-like configuration, (b)
a pipelined design where multiple execution cores are arranged in a pipelined way,
and (c) a network-based design where the cores are connected via an on-chip inter-
connection network, see Fig. 2.7 for an illustration. In 2022, most multicore proces-
sors are based on a network-based design using a fast on-chip interconnect to couple
the different cores. Earlier multicore processors often relied on a hierarchical design.
memory
memory
Cache
Cache
cache/memory cache/memory
core core
interconnection network
control
cache cache core core
core
core
core
core
memory
memory
Cache
Cache
core core core core
ally between four and 16 cores. The reason for this usage lies in the fact that proces-
sors with a large number of cores provide a higher performance, but they also have
a larger power consumption and a higher price than processors with a smaller num-
ber of cores. The larger performance can especially be exploited for server systems
when executing jobs from different users. Desktop computers or mobile computers
are used by a single user and the performance requirement is therefore smaller than
for server systems. Hence, less expensive systems are sufficient with the positive
effect that they are also more energy-efficient.
Table 2.1 Examples for multicore processors with a homogeneous design available in 2022.
Most processors for server and desktop systems rely on a homogeneous design,
i.e., they contain identical cores where each core has the same computational per-
formance and the same power consumption, see Figure 2.8 (left) for an illustration.
34 2 Parallel Computer Architecture
homogeneous heterogeneous
C P E
interconnection network
interconnection network
C
E
C C P E
E
C C P
E
C C Gr Cr
Fig. 2.8 Illustration of multicore architectures with a homogeneous design (left) and with a het-
erogeneous design (right). The homogeneous design contains eight identical cores C. The hetero-
geneous design has three performance cores P, five energy-efficient cores E, a cryptography core
Cr, and a graphics unit Gr.
This has the advantage that the computational work can be distributed evenly among
the different cores, such that each core gets about the same amount of work. Thus,
the execution time of application programs can be reduced if suitable multithreading
programming techniques are used.
Table 2.1 gives examples for multicore processors with a homogeneous design
specifying the number of cores and threads, the base clock frequency in GHz as
well as information about the size of the L1, L2, and L3 caches and the release year.
The size for the L1 cache given in the table is the size of the L1 data cache. The
Intel Core i9-11900 and the AMD Ryzen 9 7950X processors shown in the table
are desktop processor, the Intel i9-11950H processor is a processor for notebooks.
The other four processors are designed for a usage in server systems. These server
processors provide a larger performance than the desktop or mobile processors, but
they are also much more expensive. All processors shown have private L1 and L2
caches for each core and use shared L3 caches. The clock frequencies given are
the base frequencies that can be increased to a higher turbo frequency for a short
time period if required. Considering the power consumption, it can be observed that
the server processors have a much larger power consumption that the desktop and
mobile processors. This can be seen when considering the Thermal Design Power
(TDP), which captures the maximum amount of heat that is generated by a proces-
sor. The TDP is measured in Watt and determines the cooling requirement for the
processor and can be considered as a rough measure when comparing the average
power consumption of processor chips. For example, the TDP is 270 W for the Intel
Xeon Platinum 8380 server processor, 65 W for the Intel Core i9-11900 desktop
processor, and 35 W for the Intel Mobil-Core i9-11950H mobile processor.
Another Random Scribd Document
with Unrelated Content
ea t tab e o , ;
accumulation of, 27;
True wealth, 99, 101;
land is the source of wealth, 54, 55;
average wealth per family, $5,125, table, 29, 47;
per capita, lower table, 27, 38, table, 51;
aggregates of wealth owned by different classes, 1st table, 29, 45;
wealth owned by individuals, table, 51;
chart, 50;
concentration of wealth, tables, 150, 169 (for 1897);
increase of wealth (for 1900), 181;
increase of in 7 years, 139, 140;
increased phenomenally, 140;
who profits by the increase of, 144-5;
concentration of in industries, 154;
largest fortunes of, increase most rapidly, Dr. Henderson, 172;
wealth reduced with the increased number of families, 171.
See: in the tax table, 178.
FOOTNOTES:
4. This work will show the real causes of it and the rapid
tendency toward it.
17. The diagrams and statistical tables supply the life contents
for these premises.
22. Here, p. 6.
35. Compare the total wealth of this table with that on p. 27.
66. He pays rent and divides the results of his labor, p. 58-61.
76. Ed. Atkinson, ib., pp. 77, 78. Also, Enc. of S. R., p. 1093.
81. Compare the last two groups with the first two of the
table, p. 28. And compare the same groups of table, p. 51.
86. Artificial property again means all things that were created
or invented by man in the past or the present.
93. These percentages are from the Official Bulletin, No. 98.
95. That is, the rate of making mortgages in 1880th year was
643,143, and the yearly rate in 1889th year was 1,226,323
in one year.
100.
Continuation, “On the debt in force against acres,
$162,652,944; on lots, $234,789,848,” is the yearly interest.
101.
Here, p. 112.
102.
Ib., p. 113.
103.
Enc. of Soc. Reform, p. 902. This interest charge is at the
end of the Extra Bulletin No. 71.
104.
Yet, it should be remembered that we do not here deal
with the debts of Railroad Companies, Street Railway,
Telegraph, Telephone and other companies and
corporations; nor do we deal with the U. S. debt of
$891,960,104; States, $228,997,389; Counties,
$145,048,045; Municipalities, $724,463,060; School districts,
$36,701,948, which in 1890 made the grand total of
$18,027,170,546 including the debt under our consideration.
But we deal with family-debtors, for whom debt is equal to
ruin. Whereas debt to the others is prosperity.
105.
That is, if we divide them by the line of families worth
$5,000 and over, and families worth $5,000 and under; and
the latter will include the economic dependants.
106.
Here, p. 119.
109.
Enc. of Soc. Reform, p. 904.
110.
Ib., p. 904.
111.
Mr. Dunn could not have known at the time that some
Eastern States were even worse than the Western ones, and
that “New York,” for instance, “is” more “conspicuously
prominent as having a real estate mortgage indebtedness of
$1,607,874,301, which is 26.71 per cent of the total
indebtedness on acres and lots in the United States.”
112.
Here, pp. 91, 92, or Mayo Smith, Statistics and Sociology,
p. 200.
113.
As the rates of their gains show, pp. 104, 105.
114.
Enc. of Soc. Reform, p. 1386.—Waldron, “Handbook on
Currency or Wealth.”
115.
References: Enc. of Soc. R., see “Unemployment.” Dr.
Spahr, “Present Distribution of Wealth in U. S.” (1896).—J. R.
Common’s “Distribution of Wealth,” Enc. p. 1392.
116.
Enc. of Soc. Reform, p. 1392.
119.
“Introduction and Mutual Insurance,” vol. I, pp. 148-9,
150-1164.
120.
Enc. of Soc. Reform, pp. 1370, 1373 and the Labor
Reports.
121.
Dr. Spahr, ib., pp. 116, 117.
122.
The above 246,938 families could not be here classified
among the tenants of farms consisting of the 1,624,765
families, because after losing their country properties, these
homeless hurry on to crowd up cities.
123.
Dr. Spahr, ibid, p. 122-3.
124.
As numerous inquiries convince me.
125.
According to the U. S. Census of 1890, there were
4,564,641 farms consisting of 623,218,619 acres of land, or
an average of 136 acres to a farm. World Almanac, 1899, p.
184.
126.
Enc. of Soc. Reform, pp. 22, 23; also based on the
census.
129.
Enc. of Soc. Reform, 1897, pp. 1346-7; from “Philadelphia
Times,” etc.
130.
Dr. Spahr, ibid., pp. 104-5.
131.
It was the gross income.
132.
See the upper table, p. 42.
133.
Table, p. 47.
134.
Lower table, p. 42, 1st two groups.
135.
Mr. Waldron, “Hand-book on Currency or Wealth,” pp. 106
and 107. See also: Enc. of Soc. Reform, p. 1389.
136.
See the statistical conclusions on the fall of wages, p. 134;
also Dr. Spahr’s “Present Distribution of Wealth,” etc., pp.
95-118.
138.
Dr. Spahr, ibid., pp. 104-5. Enc. of Soc. Reform, p. 1385.
139.
Dr. Spahr, ibid., pp. 98, 104-5. Also: Statistics of
Massachusetts Bureau of Labor, 1890, p. 319.
140.
Dr. Spahr, ibid., p. 104-5.
141.
Dr. Spahr, ibid., pp. 116, 117.
142.
Ibid., pp. 104, 105.
143.
Dr. Spahr, ibid., p. 104-5.
144.
World Almanac, 1899, pp. 200, 225.
145.
Quoted from Enc. of Soc. Reform, p. 1347.
146.
For 1897 is an approximate estimate of The World
Almanac, 1899, p. 200, foot note.
148.
“For a share in product” is an initial form of serfdom pure
and simple.
149.
Enc. of Soc. Reform, pp. 606-7.
150.
Also here, p. 125.
151.
Enc. of Soc. Reform, pp. 606-7.
152.
World Almanac, 1900, p. 174.
153.
Here, see foot-note, p. 150.
154.
Table of profits, here, p. 101.
155.
Includes the increase of $310,001,619 by the railroad,
telegraph and telephone monopolies, p. 162.
156.
Excludes net incomes of the artificial gas companies and
those of the non-national banks (beside mortgages) as not
given in the table on p. 101. See foot note, pp. 150, 167,
168.
158.
Includes the rent of land paid by the increased
populations, p. 165.
159.
This amount of double taxes is calculated to have been
fully paid for 7 years on the net incomes here stated, and on
all the property these trusts, etc., have had in the beginning
of 1891 and after, according to the tax rates to be here
indicated.
160.
Compare tables, here, on pp. 42 and 47.
161.
Mr. Mallock’s “Classes and Masses” (1896.)
162.
Enc. of Soc. Reform, p. 1392.
163.
His work on “Social Elements,” p. 162.
164.
“Statistics of Railways, 1890,” p. 58.
165.
I italicized the words.
166.
Dr. Spahr, ibid., pp. 41, 42. The total capitalization of
railroads in 1890 was represented by $9,437,300,000, while
the total investment amounted to only $3,714,400,000. And
Mr. Van Oss stated that “shares now return at least 18 per
cent per annum on the actual investment.” Ibidem.
167. Dr. Spahr, ibid., p. 143. The total incomes in the table of
taxes above represented are gross incomes.
168.
Statistics, World Almanac, 1899, p. 165.
169.
Dr. Spahr, “Present Distribution of Wealth in the United
States,” p. 143-4.
170.
Ib., p. 156-7.
171.
172.
See here, pp. 64, 65, 68, 72.
173.
Dr. Spahr, ibid., pp. 157, 158.
174.
The World Almanac, 1899, p. 164. Mr. Upton, here, p. 27.
175.
The World Almanac, 1900, p. 539.
176.
This average would mean that in every 100 families 90
have 5 and 10 have only 4 members. See the decrease of
family membership: foot note, p. 18.
179.
Encyclopedia of Social Reform, p. 1346.
180.
Encyclopedia of Social Reform, p. 1346.
181.
Ibid., p. 888.
TRANSCRIBER’S NOTE:
Other changes:
• Removed half-title page originally on first page
• Page 003: thousands of like opnions → thousands of like opinions
• Page 036: surplus milion families → surplus million families
• Page 039: distribution of weatlh → distribution of wealth
• Page 048: more wealth that → more wealth than
• Page 099: in possesion of others → in possession of others
• Page 108: Napoleon Boneparte → Napoleon Bonaparte
• Page 132: bound, by dividogensure → bound, by dividogenesure