100% found this document useful (4 votes)
37 views

Parallel Programming for Multicore and Cluster Systems 3rd Edition Thomas Rauber download

The document is about the third edition of 'Parallel Programming for Multicore and Cluster Systems' by Thomas Rauber and Gudula Rünger, which discusses parallel programming techniques necessary for utilizing multicore processors and parallel cluster systems. It covers architecture, programming models, and efficient algorithms, with updates reflecting recent advancements in hardware and programming environments. The book serves as both a textbook for students and a reference for professionals in the field of parallel computing.

Uploaded by

fisserojeldi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
37 views

Parallel Programming for Multicore and Cluster Systems 3rd Edition Thomas Rauber download

The document is about the third edition of 'Parallel Programming for Multicore and Cluster Systems' by Thomas Rauber and Gudula Rünger, which discusses parallel programming techniques necessary for utilizing multicore processors and parallel cluster systems. It covers architecture, programming models, and efficient algorithms, with updates reflecting recent advancements in hardware and programming environments. The book serves as both a textbook for students and a reference for professionals in the field of parallel computing.

Uploaded by

fisserojeldi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Parallel Programming for Multicore and Cluster

Systems 3rd Edition Thomas Rauber install


download

https://ebookmeta.com/product/parallel-programming-for-multicore-
and-cluster-systems-3rd-edition-thomas-rauber/

Download more ebook from https://ebookmeta.com


We believe these products will be a great fit for you. Click
the link to download now, or visit ebookmeta.com
to discover even more!

Data Parallel C++: Programming Accelerated Systems


Using C++ and SYCL 2nd Edition James Reinders

https://ebookmeta.com/product/data-parallel-c-programming-
accelerated-systems-using-c-and-sycl-2nd-edition-james-reinders/

Multicore and GPU programming An Integrated approach


2nd Gerassimos Barlas

https://ebookmeta.com/product/multicore-and-gpu-programming-an-
integrated-approach-2nd-gerassimos-barlas/

Multicore and GPU Programming An Integrated Approach


2nd Edition Gerassimos Barlas

https://ebookmeta.com/product/multicore-and-gpu-programming-an-
integrated-approach-2nd-edition-gerassimos-barlas/

The Micro Economy Today, 16th Edition Bradley Schiller

https://ebookmeta.com/product/the-micro-economy-today-16th-
edition-bradley-schiller/
Indigenous Peoples National Parks and Protected Areas A
New Paradigm Linking Conservation Culture and Rights
1st Edition Stan Stevens

https://ebookmeta.com/product/indigenous-peoples-national-parks-
and-protected-areas-a-new-paradigm-linking-conservation-culture-
and-rights-1st-edition-stan-stevens/

Collected Short Fiction 2022nd Edition Keith Roberts

https://ebookmeta.com/product/collected-short-fiction-2022nd-
edition-keith-roberts/

Exploring a Business Case for High Value Continuing


Professional Development Proceedings of a Workshop 1st
Edition And Medicine Engineering National Academies Of
Sciences
https://ebookmeta.com/product/exploring-a-business-case-for-high-
value-continuing-professional-development-proceedings-of-a-
workshop-1st-edition-and-medicine-engineering-national-academies-
of-sciences/

World Prehistory A Brief Introduction 10th Edition


Brian M. Fagan

https://ebookmeta.com/product/world-prehistory-a-brief-
introduction-10th-edition-brian-m-fagan/

The Tainted Course Sugarbury Falls Mystery 4 Diane


Weiner

https://ebookmeta.com/product/the-tainted-course-sugarbury-falls-
mystery-4-diane-weiner/
Messianism and Sociopolitical Revolution in Medieval
Islam 1st Edition Said Amir Arjomand

https://ebookmeta.com/product/messianism-and-sociopolitical-
revolution-in-medieval-islam-1st-edition-said-amir-arjomand/
Parallel Programming
Thomas Rauber • Gudula Rünger

Parallel Programming
for Multicore and Cluster Systems

Third Edition
Thomas Rauber Gudula Rünger
Lehrstuhl für Angewandte Informatik II Fakultät für Informatik
University of Bayreuth Chemnitz University of Technology
Bayreuth, Bayern, Germany Chemnitz, Sachsen, Germany

Second English Edition was a translation from the 3rd German language edition: Parallele
Programmierung (3. Aufl. 2012) by T. Rauber and G. Rünger, Springer-Verlag Berlin
Heidelberg 2000, 2007, 2012.

ISBN 978-3-031-28923-1 ISBN 978-3-031-28924-8 (eBook)


https://doi.org/10.1007/978-3-031-28924-8

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2010, 2013, 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

Innovations in hardware architecture, such as hyper-threading or multicore proces-


sors, make parallel computing resources available for computer systems in differ-
ent areas, including desktop and laptop computers, mobile devices, and embedded
systems. However, the efficient usage of the parallel computing resources requires
parallel programming techniques. Today, many standard software products are al-
ready based on concepts of parallel programming to use the hardware resources of
multicore processors efficiently. This trend will continue and the need for parallel
programming will extend to all areas of software development. The application area
will be much larger than the area of scientific computing, which used to be the main
area for parallel computing for many years. The expansion of the application area
for parallel computing will lead to an enormous need for software developers with
parallel programming skills. Some chip manufacturers already demand to include
parallel programming as a standard course in computer science curricula. A more
recent trend is the use of Graphics Processing Units (GPUs), which may comprise
several thousands of cores, for the execution of compute-intensive non-graphics ap-
plications.
This book covers the new development in processor architecture and parallel
hardware. Moreover, important parallel programming techniques that are necessary
for developing efficient programs for multicore processors as well as for parallel
cluster systems or supercomputers are provided. Both shared and distributed address
space architectures are covered. The main goal of the book is to present parallel
programming techniques that can be used in many situations for many application
areas and to enable the reader to develop correct and efficient parallel programs.
Many example programs and exercises are provided to support this goal and to show
how the techniques can be applied to further applications. The book can be used as a
textbook for students as well as a reference book for professionals. The material of
the book has been used for courses in parallel programming at different universities
for many years.
This third edition of the English book on parallel programming is an updated
and revised version based on the second edition of this book from 2013. The three
earlier German editions appeared in 2000, 2007, and 2012, respectively. The update

V
VI Preface

of this new English edition includes an extended update of the chapter on com-
puter architecture and performance analysis taking new developments such as the
aspect of energy consumption into consideration. The description of OpenMP has
been extended and now also captures the task concept of OpenMP. The chapter
on message-passing programming has been extended and updated to include new
features of MPI such as extended reduction operations and non-blocking collec-
tive communication operations. The chapter on GPU programming also has been
updated. All other chapters also have been revised carefully.
The content of the book consists of three main parts, covering all areas of parallel
computing: the architecture of parallel systems, parallel programming models and
environments, and the implementation of efficient application algorithms. The em-
phasis lies on parallel programming techniques needed for different architectures.
The first part contains an overview of the architecture of parallel systems, includ-
ing cache and memory organization, interconnection networks, routing and switch-
ing techniques as well as technologies that are relevant for modern and future mul-
ticore processors. Issues of power and energy consumption are also covered.
The second part presents parallel programming models, performance models,
and parallel programming environments for message passing and shared memory
models, including the message passing interface (MPI), Pthreads, Java threads, and
OpenMP. For each of these parallel programming environments, the book intro-
duces basic concepts as well as more advanced programming methods and enables
the reader to write and run semantically correct and computationally efficient par-
allel programs. Parallel design patterns, such as pipelining, client-server, or task
pools are presented for different environments to illustrate parallel programming
techniques and to facilitate the implementation of efficient parallel programs for a
wide variety of application areas. Performance models and techniques for runtime
analysis are described in detail, as they are a prerequisite for achieving efficiency
and high performance. A chapter gives a detailed description of the architecture of
GPUs and also contains an introduction into programming approaches for general
purpose GPUs concentrating on CUDA and OpenCL. Programming examples are
provided to demonstrate the use of the specific programming techniques introduced.
The third part applies the parallel programming techniques from the second part
to representative algorithms from scientific computing. The emphasis lies on basic
methods for solving linear equation systems, which play an important role for many
scientific simulations. The focus of the presentation is the analysis of the algorithmic
structure of the different algorithms, which is the basis for a parallelization, and not
so much on mathematical properties of the solution methods. For each algorithm,
the book discusses different parallelization variants, using different methods and
strategies.
Many colleagues and students have helped to improve the quality of this book.
We would like to thank all of them for their help and constructive criticisms. For
numerous corrections we would like to thank Robert Dietze, Jörg Dümmler, Mar-
vin Ferber, Michael Hofmann, Ralf Hoffmann, Sascha Hunold, Thomas Jakobs,
Oliver Klöckner, Matthias Korch, Ronny Kramer, Raphael Kunis, Jens Lang, Isabel
Mühlmann, John O’Donnell, Andreas Prell, Carsten Scholtes, Michael Schwind,
Preface VII

and Jesper Träff. Many thanks to Thomas Jakobs, Matthias Korch, Carsten Scholtes
and Michael Schwind for their help with the program examples and the exercises.
We thank Monika Glaser and Luise Steinbach for their help and support with the
LATEX typesetting of the book. We also thank all the people who have been involved
in the writing of the three German versions of this book. It has been a pleasure work-
ing with the Springer Verlag in the development of this book. We especially thank
Ralf Gerstner for his continuous support.

Bayreuth and Chemnitz, January 2023 Thomas Rauber


Gudula Rünger
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Classical Use of Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Parallelism in Today’s Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Basic Concepts of parallel programming . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Overview of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Parallel Computer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9


2.1 Processor Architecture and Technology Trends . . . . . . . . . . . . . . . . . . 10
2.2 Power and Energy Consumption of Processors . . . . . . . . . . . . . . . . . . 13
2.3 Memory access times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 DRAM access times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Multithreading for hiding memory access times . . . . . . . . . . . 18
2.3.3 Caches for reducing the average memory access time . . . . . . 18
2.4 Flynn’s Taxonomy of Parallel Architectures . . . . . . . . . . . . . . . . . . . . 19
2.5 Memory Organization of Parallel Computers . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Computers with Distributed Memory Organization . . . . . . . . 22
2.5.2 Computers with Shared Memory Organization . . . . . . . . . . . . 25
2.6 Thread-Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.1 Simultaneous Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6.2 Multicore Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.3 Architecture of Multicore Processors . . . . . . . . . . . . . . . . . . . . 32
2.7 Interconnection Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.7.1 Properties of Interconnection Networks . . . . . . . . . . . . . . . . . . 38
2.7.2 Direct Interconnection Networks . . . . . . . . . . . . . . . . . . . . . . . 41
2.7.3 Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.7.4 Dynamic Interconnection Networks . . . . . . . . . . . . . . . . . . . . . 49
2.8 Routing and Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.8.1 Routing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.8.2 Routing in the Omega Network . . . . . . . . . . . . . . . . . . . . . . . . 64
2.8.3 Switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.8.4 Flow control mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

IX
X Contents

2.9 Caches and Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74


2.9.1 Characteristics of Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.9.2 Write Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.9.3 Cache coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.9.4 Memory consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.10 Examples for hardware parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
2.10.1 Intel Cascade Lake and Ice Lake Architectures . . . . . . . . . . . 101
2.10.2 Top500 list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.11 Exercises for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

3 Parallel Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109


3.1 Models for parallel systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.2 Parallelization of programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.3 Levels of parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.3.1 Parallelism at instruction level . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.3.2 Data parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.3.3 Loop parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.3.4 Functional parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.3.5 Explicit and implicit representation of parallelism . . . . . . . . . 122
3.3.6 Parallel programming patterns . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.4 SIMD Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.4.1 Execution of vector operations . . . . . . . . . . . . . . . . . . . . . . . . . 130
3.4.2 SIMD instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.5 Data distributions for arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.5.1 Data distribution for one-dimensional arrays . . . . . . . . . . . . . 134
3.5.2 Data distribution for two-dimensional arrays . . . . . . . . . . . . . 135
3.5.3 Parameterized data distribution . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.6 Information exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.6.1 Shared variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.6.2 Communication operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.7 Parallel matrix-vector product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
3.7.1 Parallel computation of scalar products . . . . . . . . . . . . . . . . . . 147
3.7.2 Parallel computation of the linear combinations . . . . . . . . . . . 150
3.8 Processes and Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.8.1 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.8.2 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
3.8.3 Synchronization mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 158
3.8.4 Developing efficient and correct thread programs . . . . . . . . . 161
3.9 Further parallel programming approaches . . . . . . . . . . . . . . . . . . . . . . 163
3.10 Exercices for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

4 Performance Analysis of Parallel Programs . . . . . . . . . . . . . . . . . . . . . . . 169


4.1 Performance Evaluation of Computer Systems . . . . . . . . . . . . . . . . . . 170
4.1.1 Evaluation of CPU Performance . . . . . . . . . . . . . . . . . . . . . . . . 170
4.1.2 MIPS and MFLOPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Contents XI

4.1.3 Performance of Processors with a Memory Hierarchy . . . . . . 174


4.1.4 Benchmark Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.2 Performance Metrics for Parallel Programs . . . . . . . . . . . . . . . . . . . . . 181
4.2.1 Parallel Runtime and Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
4.2.2 Speedup and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
4.2.3 Weak and Strong Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
4.3 Energy Measurement and Energy Metrics . . . . . . . . . . . . . . . . . . . . . . 191
4.3.1 Performance and Energy Measurement Techniques . . . . . . . . 191
4.3.2 Modeling of Power and Energy Consumption for DVFS . . . . 196
4.3.3 Energy Metrics for Parallel Programs . . . . . . . . . . . . . . . . . . . 197
4.4 Asymptotic Times for Global Communication . . . . . . . . . . . . . . . . . . 200
4.4.1 Implementing Global Communication Operations . . . . . . . . . 201
4.4.2 Communications Operations on a Hypercube . . . . . . . . . . . . . 207
4.4.3 Communication Operations on a Complete Binary Tree . . . . 215
4.5 Analysis of Parallel Execution Times . . . . . . . . . . . . . . . . . . . . . . . . . . 218
4.5.1 Parallel Scalar Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
4.5.2 Parallel Matrix-vector Product . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.6 Parallel Computational Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
4.6.1 PRAM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
4.6.2 BSP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
4.6.3 LogP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
4.7 Loop Scheduling and Loop Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
4.7.1 Loop Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
4.7.2 Loop Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
4.8 Exercises for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

5 Message-Passing Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249


5.1 Introduction to MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
5.1.1 MPI point-to-point communication . . . . . . . . . . . . . . . . . . . . . 252
5.1.2 Deadlocks with Point-to-point Communications . . . . . . . . . . 257
5.1.3 Nonblocking Point-to-Point Operations . . . . . . . . . . . . . . . . . . 260
5.1.4 Communication modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
5.2 Collective Communication Operations . . . . . . . . . . . . . . . . . . . . . . . . . 266
5.2.1 Collective Communication in MPI . . . . . . . . . . . . . . . . . . . . . . 266
5.2.2 Deadlocks with Collective Communication . . . . . . . . . . . . . . 280
5.2.3 Nonblocking Collective Communication Operations . . . . . . . 283
5.3 Process Groups and Communicators . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
5.3.1 Process Groups in MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
5.3.2 Process Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
5.3.3 Timings and aborting processes . . . . . . . . . . . . . . . . . . . . . . . . 295
5.4 Advanced topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
5.4.1 Dynamic Process Generation and Management . . . . . . . . . . . 296
5.4.2 One-sided communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
5.5 Exercises for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
XII Contents

6 Thread Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313


6.1 Programming with Pthreads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
6.1.1 Creating and Merging Threads . . . . . . . . . . . . . . . . . . . . . . . . . 315
6.1.2 Thread Coordination with Pthreads . . . . . . . . . . . . . . . . . . . . . 318
6.1.3 Condition Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
6.1.4 Extended Lock Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
6.1.5 One-time initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
6.2 Parallel programming patterns with Pthreads . . . . . . . . . . . . . . . . . . . . 333
6.2.1 Implementation of a Task Pool . . . . . . . . . . . . . . . . . . . . . . . . . 333
6.2.2 Parallelism by Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
6.2.3 Implementation of a Client-Server Model . . . . . . . . . . . . . . . . 345
6.3 Advanced Pthread features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
6.3.1 Thread Attributes and Cancellation . . . . . . . . . . . . . . . . . . . . . 349
6.3.2 Thread Scheduling with Pthreads . . . . . . . . . . . . . . . . . . . . . . . 358
6.3.3 Priority Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
6.3.4 Thread-specific Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
6.4 Java Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
6.4.1 Thread Generation in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
6.4.2 Synchronization of Java Threads . . . . . . . . . . . . . . . . . . . . . . . 370
6.4.3 Wait and Notify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
6.4.4 Extended Synchronization Patterns . . . . . . . . . . . . . . . . . . . . . 385
6.4.5 Thread Scheduling in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
6.4.6 Package java.util.concurrent . . . . . . . . . . . . . . . . . . . . . 391
6.5 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
6.5.1 Compiler directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
6.5.2 Execution environment routines . . . . . . . . . . . . . . . . . . . . . . . . 406
6.5.3 Coordination and synchronization of threads . . . . . . . . . . . . . 407
6.5.4 OpenMP task model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
6.6 Exercises for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419

7 General Purpose GPU Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423


7.1 The Architecture of GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
7.2 Introduction to CUDA programming . . . . . . . . . . . . . . . . . . . . . . . . . . 429
7.3 Synchronization and Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 435
7.4 CUDA Thread Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
7.5 Efficient Memory Access and Tiling Technique . . . . . . . . . . . . . . . . . 441
7.6 Introduction to OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
7.7 Exercises for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449

8 Algorithms for Systems of Linear Equations . . . . . . . . . . . . . . . . . . . . . . 451


8.1 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
8.1.1 Gaussian Elimination and LU Decomposition . . . . . . . . . . . . 452
8.1.2 Parallel Row-Cyclic Implementation . . . . . . . . . . . . . . . . . . . . 456
8.1.3 Parallel Implementation with Checkerboard Distribution . . . 460
8.1.4 Analysis of the Parallel Execution Time . . . . . . . . . . . . . . . . . 464
Contents XIII

8.2 Direct Methods for Linear Systems with Banded Structure . . . . . . . . 470
8.2.1 Discretization of the Poisson Equation . . . . . . . . . . . . . . . . . . 470
8.2.2 Tridiagonal Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
8.2.3 Generalization to Banded Matrices . . . . . . . . . . . . . . . . . . . . . 489
8.2.4 Solving the Discretized Poisson Equation . . . . . . . . . . . . . . . . 491
8.3 Iterative Methods for Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 493
8.3.1 Standard Iteration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
8.3.2 Parallel implementation of the Jacobi Iteration . . . . . . . . . . . . 498
8.3.3 Parallel Implementation of the Gauss-Seidel Iteration . . . . . . 499
8.3.4 Gauss-Seidel Iteration for Sparse Systems . . . . . . . . . . . . . . . 501
8.3.5 Red-black Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
8.4 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
8.4.1 Sequential CG method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
8.4.2 Parallel CG Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
8.5 Cholesky Factorization for Sparse Matrices . . . . . . . . . . . . . . . . . . . . . 518
8.5.1 Sequential Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
8.5.2 Storage Scheme for Sparse Matrices . . . . . . . . . . . . . . . . . . . . 525
8.5.3 Implementation for Shared Variables . . . . . . . . . . . . . . . . . . . . 526
8.6 Exercises for Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
Chapter 1
Introduction

About this Chapter

Parallel programming is increasingly important for the software develop-


ment today and in the future. This introduction outlines the more classical
use of parallelism in scientific computing using supercomputers as well as
the parallelism available in today’s hardware, which broadens the use of par-
allelism to a larger class of applications. The basic concepts of parallel pro-
gramming are introduced on less than two pages by informally defining key
definitions, such as task decomposition or potential parallelism, and bring-
ing them into context. The content of this book on parallel programming is
described and suggestions for course structures are given.

1.1 Classical Use of Parallelism

Parallel programming and the design of efficient parallel programs is well-establi-


shed in high performance, scientific computing for many years. The simulation
of scientific problems is an important area in natural and engineering sciences of
growing importance. More precise simulations or the simulation of larger problems
lead to an increasing demand for computing power and memory space. In the last
decades, high performance research also included the development of new parallel
hardware and software technologies, and steady progress in parallel high perfor-
mance computing can be observed. Popular examples are simulations for weather
forecast based on complex mathematical models involving partial differential equa-
tions or crash simulations from car industry based on finite element methods. Other
examples include drug design and computer graphics applications for film and ad-
vertising industry.
Depending on the specific application, computer simulation is the main method
to obtain the desired result or it is used to replace or enhance physical experiments.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1


T. Rauber, G. Rünger, Parallel Programming, https://doi.org/10.1007/978-3-031-28924-8_1
2 1 Introduction

A typical example for the first application area is weather forecasting where the fu-
ture development in the atmosphere has to be predicted, which can only be obtained
by simulations. In the second application area, computer simulations are used to
obtain results that are more precise than results from practical experiments or that
can be performed at lower cost. An example is the use of simulations to determine
the air resistance of vehicles: Compared to a classical wind tunnel experiment, a
computer simulation can get more precise results because the relative movement of
the vehicle in relation to the ground can be included in the simulation. This is not
possible in the wind tunnel, since the vehicle cannot be moved. Crash tests of ve-
hicles are an obvious example where computer simulations can be performed with
lower cost.
Computer simulations often require a large computational effort. Thus, a low
performance of the computer system used can restrict the simulations and the ac-
curacy of the results obtained significantly. Using a high-performance system al-
lows larger simulations which lead to better results and therefore, parallel comput-
ers have usually been used to perform computer simulations. Today, cluster systems
built up from server nodes are widely available and are now also often used for
parallel simulations. Additionally, multicore processors within the nodes provide
further parallelism, which can be exploited for a fast computation. To use paral-
lel computers or cluster systems, the computations to be performed must be parti-
tioned into several parts which are assigned to the parallel resources for execution.
These computation parts should be independent of each other, and the algorithm
performed must provide enough independent computations to be suitable for a par-
allel execution. This is normally the case for scientific simulation, which often use
one- or multi-dimensional arrays as data structures and organize their computations
in nested loops. To obtain a parallel program for parallel excution, the algorithm
must be formulated in a suitable programming language. Parallel execution is often
controlled by specific runtime libraries or compiler directives which are added to
a standard programming language, such as C, Fortran, or Java. The programming
techniques needed to obtain efficient parallel programs are described in this book.
Popular runtime systems and environments are also presented.

1.2 Parallelism in Today’s Hardware

Parallel programming is an important aspect of high performance scientific com-


puting but it used to be a niche within the entire field of hardware and software
products. However, more recently parallel programming has left this niche and will
become the mainstream of software development techniques due to a radical change
in hardware technology.
Major chip manufacturers have started to produce processors with several power-
efficient computing units on one chip, which have an independent control and can
access the same memory concurrently. Normally, the term core is used for single
computing units and the term multicore is used for the entire processor having sev-
1.2 Parallelism in Today’s Hardware 3

eral cores. Thus, using multicore processors makes each desktop computer a small
parallel system. The technological development toward multicore processors was
forced by physical reasons, since the clock speed of chips with more and more tran-
sistors cannot be increased at the previous rate without overheating.
Multicore architectures in the form of single multicore processors, shared mem-
ory systems of several multicore processors, or clusters of multicore processors with
a hierarchical interconnection network will have a large impact on software develop-
ment. In 2022, quad-core and oct-core processors are standard for normal desktop
computers, and chips with up to 64 cores are already available for a use in high-
end systems. It can be predicted from Moore’s law that the number of cores per
processor chip will double every 18 – 24 months and in several years, a typical
processor chip might consist of dozens up to hundreds of cores where some of the
cores will be dedicated to specific purposes such as network management, encryp-
tion and decryption, or graphics [138]; the majority of the cores will be available
for application programs, providing a huge performance potential. Another trend in
parallel computing is the use of GPUs for compute-intensive applications. GPU ar-
chitectures provide many hundreds of specialized processing cores that can perform
computations in parallel.
The users of a computer system are interested in benefitting from the perfor-
mance increase provided by multicore processors. If this can be achieved, they can
expect their application programs to keep getting faster and keep getting more and
more additional features that could not be integrated in previous versions of the soft-
ware because they needed too much computing power. To ensure this, there should
definitely be support from the operating system, e.g., by using dedicated cores for
their intended purpose or by running multiple user programs in parallel, if they are
available. But when a large number of cores is provided, which will be the case in
the near future, there is also the need to execute a single application program on
multiple cores. The best situation for the software developer would be that there is
an automatic transformer that takes a sequential program as input and generates a
parallel program that runs efficiently on the new architectures. If such a transformer
were available, software development could proceed as before. But unfortunately,
the experience of the research in parallelizing compilers during the last 20 years has
shown that for many sequential programs it is not possible to extract enough paral-
lelism automatically. Therefore, there must be some help from the programmer and
application programs need to be restructured accordingly.
For the software developer, the new hardware development toward multicore ar-
chitectures is a challenge, since existing software must be restructured toward paral-
lel execution to take advantage of the additional computing resources. In particular,
software developers can no longer expect that the increase of computing power can
automatically be used by their software products. Instead, additional effort is re-
quired at the software level to take advantage of the increased computing power. If
a software company is able to transform its software so that it runs efficiently on
novel multicore architectures, it will likely have an advantage over its competitors.
There is much research going on in the area of parallel programming languages
and environments with the goal of facilitating parallel programming by providing
4 1 Introduction

support at the right level of abstraction. But there are also many effective techniques
and environments already available. We give an overview in this book and present
important programming techniques, enabling the reader to develop efficient parallel
programs. There are several aspects that must be considered when developing a
parallel program, no matter which specific environment or system is used. We give
a short overview in the following section.

1.3 Basic Concepts of parallel programming

A first step in parallel programming is the design of a parallel algorithm or pro-


gram for a given application problem. The design starts with the decomposition of
the computations of an application into several parts, called tasks, which can be
computed in parallel on the cores or processors of the parallel hardware. The de-
composition into tasks can be complicated and laborious, since there are usually
many different possibilities of decomposition for the same application algorithm.
The size of tasks (e.g. in terms of the number of instructions) is called granularity
and there is typically the possibility of choosing tasks of different sizes. Defining
the tasks of an application appropriately is one of the main intellectual challenges
in the development of a parallel program and is difficult to automate. The potential
parallelism is an inherent property of an application algorithm and influences how
an application can be split into tasks.
The tasks of an application are coded in a parallel programming language or envi-
ronment and are assigned to processes or threads which are then assigned to physi-
cal computation units for execution. The assignment of tasks to processes or threads
is called scheduling and fixes the order in which the tasks are executed. Schedul-
ing can be done by hand in the source code or by the programming environment,
at compile time or dynamically at runtime. The assignment of processes or threads
onto the physical units, processors or cores, is called mapping and is usually done
by the runtime system but can sometimes be influenced by the programmer. The
tasks of an application algorithm can be independent but can also depend on each
other resulting in data or control dependencies of tasks. Data and control depen-
dencies may require a specific execution order of the parallel tasks: If a task needs
data produced by another task, the execution of the first task can start only after
the other task has actually produced these data and provides the information. Thus,
dependencies between tasks are constraints for the scheduling. In addition, parallel
programs need synchronization and coordination of threads and processes in order
to execute correctly. The methods of synchronization and coordination in parallel
computing are strongly connected with the way in which information is exchanged
between processes or threads, and this depends on the memory organization of the
hardware.
A coarse classification of the memory organization distinguishes between shared
memory machines and distributed memory machines. Often the term thread is
connected with shared memory and the term process is connected with distributed
1.3 Basic Concepts of parallel programming 5

memory. For shared memory machines, a global shared memory stores the data of
an application and can be accessed by all processors or cores of the hardware sys-
tems. Information exchange between threads is done by shared variables written by
one thread and read by another thread. The correct behavior of the entire program
has to be achieved by synchronization between threads so that the access to shared
data is coordinated, i.e., a thread must not read a data element before the write op-
eration by another thread storing the data element has been finalized. Depending on
the programming language or environment, synchronization is done by the runtime
system or by the programmer. For distributed memory machines, there exists a pri-
vate memory for each processor, which can only be accessed by this processor and
no synchronization for memory access is needed. Information exchange is done by
sending data from one processor to another processor via an interconnection net-
work by explicit communication operations.
Specific barrier operations offer another form of coordination which is avail-
able for both shared memory and distributed memory machines. All processes or
threads have to wait at a barrier synchronization point until all other processes or
threads have also reached that point. Only after all processes or threads have exe-
cuted the code before the barrier, they can continue their work with the subsequent
code after the barrier.
An important aspect of parallel computing is the parallel execution time which
consists of the time for the computation on processors or cores and the time for data
exchange or synchronization. The parallel execution time should be smaller than the
sequential execution time on one processor so that designing a parallel program is
worth the effort. The parallel execution time is the time elapsed between the start of
the application on the first processor and the end of the execution of the application
on all processors. This time is influenced by the distribution of work to processors or
cores, the time for information exchange or synchronization, and idle times in which
a processor cannot do anything useful but waiting for an event to happen. In general,
a smaller parallel execution time results when the work load is assigned equally to
processors or cores, which is called load balancing, and when the overhead for
information exchange, synchronization and idle times is small. Finding a specific
scheduling and mapping strategy which leads to a good load balance and a small
overhead is often difficult because of many interactions. For example, reducing the
overhead for information exchange may lead to load imbalance whereas a good load
balance may require more overhead for information exchange or synchronization.
For a quantitative evaluation of the execution time of parallel programs, cost
measures like speedup and efficiency are used, which compare the resulting paral-
lel execution time with the sequential execution time on one processor. There are
different ways to measure the cost or runtime of a parallel program and a large va-
riety of parallel cost models based on parallel programming models have been pro-
posed and used. These models are meant to bridge the gap between specific parallel
hardware and more abstract parallel programming languages and environments.
6 1 Introduction

1.4 Overview of the Book

The rest of the book is structured as follows. Chapter 2 gives an overview of impor-
tant aspects of the hardware of parallel computer systems and addresses new devel-
opments such as the trends toward multicore architectures. In particular, the chap-
ter covers important aspects of memory organization with shared and distributed
address spaces as well as popular interconnection networks with their topological
properties. Since memory hierarchies with several levels of caches may have an
important influence on the performance of (parallel) computer systems, they are
covered in this chapter. The architecture of multicore processors is also described in
detail. The main purpose of the chapter is to give a solid overview of the important
aspects of parallel computer architectures that play a role for parallel programming
and the development of efficient parallel programs.
Chapter 3 considers popular parallel programming models and paradigms and
discusses how the inherent parallelism of algorithms can be presented to a par-
allel runtime environment to enable an efficient parallel execution. An important
part of this chapter is the description of mechanisms for the coordination of paral-
lel programs, including synchronization and communication operations. Moreover,
mechanisms for exchanging information and data between computing resources for
different memory models are described. Chapter 4 is devoted to the performance
analysis of parallel programs. It introduces popular performance or cost measures
that are also used for sequential programs, as well as performance measures that
have been developed for parallel programs. Especially, popular communication pat-
terns for distributed address space architectures are considered and their efficient
implementations for specific interconnection structures are given.
Chapter 5 considers the development of parallel programs for distributed address
spaces. In particular, a detailed description of MPI (Message Passing Interface) is
given, which is by far the most popular programming environment for distributed
address spaces. The chapter describes important features and library functions of
MPI and shows which programming techniques must be used to obtain efficient
MPI programs. Chapter 6 considers the development of parallel programs for shared
address spaces. Popular programming environments are Pthreads, Java threads, and
OpenMP. The chapter describes all three and considers programming techniques
to obtain efficient parallel programs. Many examples help to understand the rel-
evant concepts and to avoid common programming errors that may lead to low
performance or may cause problems such as deadlocks or race conditions. Pro-
gramming examples and parallel programming patterns are presented. Chapter 7
introduces programming approaches for the execution of non-graphics application
programs, e.g., from the area of scientific computing, on GPUs. The chapter de-
scribes the architecture of GPUs and concentrates on the programming environment
CUDA (Compute Unified Device Architecture) from NVIDIA. A short overview of
OpenCL is also given in this chapter. Chapter 8 considers algorithms from numer-
ical analysis as representative examples and shows how the sequential algorithms
can be transferred into parallel programs in a systematic way.
1.4 Overview of the Book 7

The main emphasis of the book is to provide the reader with the programming
techniques that are needed for developing efficient parallel programs for different
architectures and to give enough examples to enable the reader to use these tech-
niques for programs from other application areas. In particular, reading and using
the book is a good training for software development for modern parallel architec-
tures, including multicore architectures.
The content of the book can be used for courses in the area of parallel com-
puting with different emphasis. All chapters are written in a self-contained way
so that chapters of the book can be used in isolation; cross-references are given
when material from other chapters might be useful. Thus, different courses in the
area of parallel computing can be assembled from chapters of the book in a mod-
ular way. Exercises are provided for each chapter separately. For a course on the
programming of multicore systems, Chapters 2, 3 and 6 should be covered. In par-
ticular Chapter 6 provides an overview of the relevant programming environments
and techniques. For a general course on parallel programming, Chapters 2, 5, and 6
can be used. These chapters introduce programming techniques for both distributed
and shared address space. For a course on parallel numerical algorithms, mainly
Chapters 5 and 8 are suitable; Chapter 6 can be used additionally. These chapters
consider the parallel algorithms used as well as the programming techniques re-
quired. For a general course on parallel computing, Chapters 2, 3, 4, 5, and 6 can
be used with selected applications from Chapter 8. Depending on the emphasis,
Chapter 7 on GPU programming can be included in each of the courses mentioned
above. The following web page will be maintained for additional and new material:
ai2.inf.uni-bayreuth.de/ppbook3
Chapter 2
Parallel Computer Architecture

About this Chapter

The possibility for a parallel execution of computations strongly depends on


the architecture of the execution platform, which determines how computa-
tions of a program can be mapped to the available resources, such that a par-
allel execution is supported. This chapter gives an overview of the general
architecture of parallel computers, which includes the memory organization
of parallel computers, thread-level parallelism and multicore processors, in-
terconnection networks, routing and switching as well as caches and memory
hierarchies. The issue of energy efficiency is also considered.

In more detail, Section 2.1 gives an overview of the use of parallelism within
a single processor or processor core. Using the available resources within a single
processor core at instruction level can lead to a significant performance increase.
Section 2.2 focuses on important aspects of the power and energy consumption of
processors. Section 2.3 addresses techniques that influence memory access times
and play an important role for the performance of (parallel) programs. Sections 2.4
introduces Flynn’s taxonomy and Section 2.5 addresses the memory organization of
parallel platforms. Section 2.6 presents the architecture of multicore processors and
describes the use of thread-based parallelism for simultaneous multithreading.
Section 2.7 describes interconnection networks which connect the resources of
parallel platforms and are used to exchange data and information between these
resources. Interconnection networks also play an important role for multicore pro-
cessors for the connection between the cores of a processor chip. The section covers
static and dynamic interconnection networks and discusses important characteris-
tics, such as diameter, bisection bandwidth and connectivity of different network
types as well as the embedding of networks into other networks. Section 2.8 ad-
dresses routing techniques for selecting paths through networks and switching tech-
niques for message forwarding over a given path. Section 2.9 considers memory
hierarchies of sequential and parallel platforms and discusses cache coherence and
memory consistency for shared memory platforms. Section 2.10 shows examples
for the use of parallelism in today’s computer architectures by describing the ar-
chitecture of the Intel Cascade Lake and Ice Lake processors on one hand and the
Top500 list on the other hand.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 9


T. Rauber, G. Rünger, Parallel Programming, https://doi.org/10.1007/978-3-031-28924-8_2
10 2 Parallel Computer Architecture

2.1 Processor Architecture and Technology Trends

Processor chips are the key components of computers. Considering the trends that
can be observed for processor chips during recent years, estimations for future de-
velopments can be deduced.
An important performance factor is the clock frequency (also called clock rate
or clock speed) of the processor which is the number of clock cycles per second,
measured in Hertz = 1/second, abbreviated as Hz = 1/s. The clock frequency f
determines the clock cycle time t of the processor by t = 1/ f , which is usually
the time needed for the execution of one instruction. Thus, an increase of the clock
frequency leads to a faster program execution and therefore a better performance.
Between 1987 and 2003, an average annual increase of the clock frequency of
about 40% could be observed for desktop processors [103]. Since 2003, the clock
frequency of desktop processors remains nearly unchanged and no significant in-
creases can be expected in the near future [102, 134]. The reason for this develop-
ment lies in the fact that an increase in clock frequency leads to an increase in power
consumption, mainly due to leakage currents which are transformed into heat, which
then requires a larger amount of cooling. Using current state-of-the-art cooling tech-
nology, processors with a clock rate significantly above 4 GHz cannot be cooled
permanently without a large additional effort.
Another important influence on the processor development are technical im-
provements in processor manufacturing. Internally, processor chips consist of tran-
sistors. The number of transistors contained in a processor chip can be used as a
rough estimate of its complexity and performance. Moore’s law is an empirical
observation which states that the number of transistors of a typical processor chip
doubles every 18 to 24 months. This observation has first been made by Gordon
Moore in 1965 and has been valid for more than 40 years. However, the transistor
increase due to Moore’s law has slowed down during the last years [104]. Neverthe-
less, the number of transistors still increases and the increasing number of transistors
has been used for architectural improvements, such as additional functional units,
more and larger caches, and more registers, as described in the following sections.
In 2022, a typical desktop processor chip contains between 5 and 20 billions
transistors, depending on the specific configuration. For example, an AMD Ryzen 7
3700X with eight cores (introduced in 2019) comprises about 6 billion transistors,
an AMD Ryzen 7 5800H with eight cores (introduced in 2021) contains 10.7 billion
transistors, and a 10-core Apple M2 (introduced in 2022) consists of 20 billion tran-
sistors, using an ARM-based system-on-a-chip (SoC) design, for more information
see en.wikipedia.org/wiki/Transistor count. The manufacturer Intel does
not disclose the number of transistors of its processors.
The increase of the number of transistors and the increase in clock speed has
led to a significant increase in the performance of computer systems. Processor
performance can be measured by specific benchmark programs that have been se-
lected from different application areas to get a representative performance metric of
computer systems. Often, the SPEC benchmarks (System Performance and Evalu-
ation Cooperative) are used to measure the integer and floating-point performance
2.1 Processor Architecture and Technology Trends 11

of computer systems [113, 104, 206, 136], see www.spec.org. Measurements with
these benchmarks show that different time periods with specific performance in-
creases can be identified for desktop processors [104]: Between 1986 and 2003,
an average annual performance increase of about 50% could be reached. The time
period of this large performance increase corresponds to the time period in which
the clock frequency has been increased significantly each year. Between 2003 and
2011, the average annual performance increase of desktop processors is about 23%.
This is still a significant increase which has been reached although the clock fre-
quency remained nearly constant, indicating that the annual increase in transistor
count has been used for architectural improvements leading to a reduction of the
average time for executing an instruction. Between 2011 and 2015, the performance
increase slowed down to 12% per year and is currently only at about 3.5 % per
year, mainly due to the limited degree of parallelism available in the benchmark
programs.
In the following, a short overview of architectural improvements that contributed
to performance increase during the last decades is given. Four phases of micropro-
cessor design trends can be observed [45] which are mainly driven by an internal
use of parallelism:
1. Parallelism at bit level: Up to about 1986, the word size used by the processors
for operations increased stepwise from 4 bits to 32 bits. This trend has slowed
down and ended with the adoption of 64-bit operations at the beginning of the
1990’s. This development has been driven by demands for improved floating-
point accuracy and a larger address space. The trend has stopped at a word size
of 64 bits, since this gives sufficient accuracy for floating-point numbers and
covers a sufficiently large address space of 264 bytes.
2. Parallelism by pipelining: The idea of pipelining at instruction level is to over-
lap the execution of multiple instructions. For this purpose, the execution of each
instruction is partitioned into several steps which are performed by dedicated
hardware units (pipeline stages) one after another. A typical partitioning could
result in the following steps:
(a) fetch: the next instruction to be executed is fetched from memory;
(b) decode: the instruction fetched in step (a) is decoded;
(c) execute: the operands specified are loaded and the instruction is executed;
(d) write back: the result is written into the destination register.
An instruction pipeline is similar to an assembly line in automobile industry.
The advantage is that the different pipeline stages can operate in parallel, if there
are no control or data dependencies between the instructions to be executed, see
Fig. 2.1 for an illustration. To avoid waiting times, the execution of the different
pipeline stages should take about the same amount of time. In each clock cycle,
the execution of one instruction is finished and the execution of another instruc-
tion is started, if there are no dependencies between the instructions. The number
of instructions finished per time unit is called the throughput of the pipeline.
Thus, in the absence of dependencies, the throughput is one instruction per clock
cycle.
12 2 Parallel Computer Architecture

In the absence of dependencies, all pipeline stages work in parallel. Thus, the
number of pipeline stages determines the degree of parallelism attainable by a
pipelined computation. The number of pipeline stages used in practice depends
on the specific instruction and its potential to be partitioned into stages. Typical
numbers of pipeline stages lie between 2 and 26 stages. Processors which use
pipelining to execute instructions are called ILP processors (instruction level
parallelism processors). Processors with a relatively large number of pipeline
stages are sometimes called superpipelined. Although the available degree of
parallelism increases with the number of pipeline stages, this number cannot be
arbitrarily increased, since it is not possible to partition the execution of the in-
struction into a very large number of steps of equal size. Moreover, data depen-
dencies often inhibit a completely parallel use of the stages.

instruction 4 F4 D4 E4 W4
instruction 3 F3 D3 E3 W3
instruction 2 F2 D2 E2 W2
instruction 1 F1 D1 E1 W1
t1 t2 t3 t4 time

Fig. 2.1 Pipelined execution of four independent instructions 1–4 at times t1,t2,t3,t4. The
execution of each instruction is split into four stages: fetch (F), decode (D), execute (E), and
write back (W).

3. Parallelism by multiple functional units: Many processors are multiple-issue


processors, which means that they use multiple, independent functional units,
such as ALUs (arithmetic logical unit), FPUs (floating-point unit), load/store
units, or branch units. These units can work in parallel, i.e., different independent
instructions can be executed in parallel by different functional units, so that par-
allelism at instruction level is exploited. Therefore this technique is also referred
to as instruction-level parallelism (ILP) . Using ILP, the average execution rate
of instructions can be increased. Multiple-issue processors can be distinguished
into superscalar processors and VLIW (very long instruction word) processors,
see [104, 45] for a more detailed treatment.
The number of functional units that can efficiently be utilized is restricted be-
cause of data dependencies between neighboring instructions. For superscalar
processors, these dependencies are dynamically determined at runtime by hard-
ware, and decoded instructions are dispatched to the instruction units by hard-
ware using dynamic scheduling. This may increase the complexity of the circuit
significantly. But simulations have shown that superscalar processors with up to
four functional units yield a substantial benefit over the use of a single functional
unit. However, using a significantly larger number of functional units provides
little additional gain [45, 123] because of dependencies between instructions and
branching of control flow. In 2022, some server processors are able to submit up
2.2 Power and Energy Consumption of Processors 13

to ten independent instructions to functional units in one machine cycle per core,
see Section 2.10.
4. Parallelism at process or thread level: The three techniques described so far
assume a single sequential control flow which is provided by the compiler and
which determines the execution order if there are dependencies between instruc-
tions. For the programmer, this has the advantage that a sequential programming
language can be used nevertheless leading to a parallel execution of instructions
due to ILP. However, the degree of parallelism obtained by pipelining and mul-
tiple functional units is limited. Thus, the increasing number of transistors avail-
able per processor chip according to Moore’s law should be used for other tech-
niques. One approach is to integrate larger caches on the chip. But the cache sizes
cannot be arbitrarily increased either, as larger caches lead to a larger access time,
see Section 2.9.
An alternative approach to use the increasing number of transistors on a chip is
to put multiple, independent processor cores onto a single processor chip. This
approach has been used for typical desktop processors since 2005. The resulting
processor chips are called multicore processors. Each of the cores of a multi-
core processor must obtain a separate flow of control, i.e., parallel programming
techniques must be used. The cores of a processor chip access the same mem-
ory and may even share caches. Therefore, memory accesses of the cores must
be coordinated. The coordination and synchronization techniques required are
described in later chapters.
A more detailed description of parallelism at hardware level using the four tech-
niques described can be found in [45, 104, 171, 207]. Section 2.6 describes tech-
niques such as simultaneous multithreading and multicore processors requiring an
explicit specification of parallelism.

2.2 Power and Energy Consumption of Processors

Until 2003, a significant average annual increase of the clock frequency of proces-
sors could be observed. This trend has stopped in 2003 at a clock frequency of about
3.3 GHz and since then, only slight increases of the clock frequency could be ob-
served. A further increase of the clock frequency is difficult because of the increased
heat production due to leakage currents. Such leakage currents also occur if the pro-
cessor is not performing computations. Therefore, the resulting power consumption
is called static power consumption. The power consumption caused by compu-
tations is called dynamic power consumption. The overall power consumption is
the sum of the static power consumption and the dynamic power consumption. In
2011, depending on the processor architecture, the static power consumption typi-
cally contributed between 25% and 50% to the total power consumption [104]. The
heat produced by leakage currents must be carried away from the processor chip by
using a sophisticated cooling technology.
14 2 Parallel Computer Architecture

An increase in clock frequency of a processor usually corresponds to a larger


amount of leakage currents, leading to a larger power consumption and an increased
heat production. Models to capture this phenomenon describe the dynamic power
consumption Pdyn (measured in Watt, abbreviated as W) of a processor by

Pdyn ( f ) = α ·CL ·V 2 · f (2.1)

where α is a switching probability, CL is the load capacitance, V is the supply volt-


age (measured in Volt, abbreviated as V) and f is the clock frequency (measured
in Hertz, abbreviated as Hz) [125, 186]. Since V depends linearly on f , a cubic
dependence of the dynamic power consumption from the clock frequency results,
i.e., the dynamic power consumption increases significantly if the clock frequency
is increased. This can be confirmed by looking at the history of processor devel-
opment: The first 32-bit microprocessors (such as the Intel 80386 processor) had a
(fixed) clock frequency between 12 MHz and 40 MHz and a power consumption of
about 2 W. A more recent 4.0 GHz Intel Core i7 6700K processor has a power con-
sumption of about 95 W [104]. An increase in clock frequency which significantly
exceeds 4.0 GHz would lead to an intolerable increase in dynamic power consump-
tion. The cubic dependency of the power consumption on the clock frequency also
explains why no significant increase in the clock frequency of desktop processors
could be observed since 2003. Decreasing or increasing the clock frequency by a
scaling factor is known as frequency scaling. There are several other factors that
have an effect on the power consumption, including the computational intensity of
an application program and the number of threads employed for the execution of the
program. Using more threads usually leads to an increase of the power consumption.
Moreover, the execution of floating-point operations usually leads to a larger power
consumption than the use of integer operations [104].
The power consumption is normally not constant, but may vary during program
execution, depending on the power consumption caused by the computations per-
formed by the program and the clock frequency used during the execution. This
can be seen in Figure 2.2 from [184]. The figure shows the power consumption of
an application program for the numerical solution of ordinary differential equations
using 10 threads (top) and 20 threads (bottom). Two different implementation ver-
sions of the solution methods are shown in the figure (Version 1 in red and Version
4 in green). Both versions exhibit an initialization phase with a small power con-
sumption followed by a computation phase with the actual numerical solution using
nine time steps, which can be clearly identified in the figure. During the individ-
ual time steps, there are large variations of the power consumption due to different
types of computations. The figure also shows that during the computation phase, the
use of 20 threads (bottom) leads to a larger power consumption than the use of 10
threads (top). Moreover, it can be seen that implementation Versions 1 and 4 have
about the same execution time using 10 threads. However, when using 20 threads,
implementation Version 4 is significantly faster than implementation Version 1.
To reduce energy consumption, modern microprocessors use several techniques
such as shutting down inactive parts of the processor chip as well as dynamic volt-
2.2 Power and Energy Consumption of Processors 15

power consumption for N=4096 for 9 time steps


60
Power Version 1
Power Version 4
50

40
power [Watt]

30

20

10

0
0 20 40 60 80 100 120 140 160 180 200
time [sec*10]

power consumption for N=4096 using 20 threads for 9 time steps


90
Power Version 1
Power Version 4
80

70

60
power [Watt]

50

40

30

20

10

0
0 20 40 60 80 100 120 140 160 180
time [sec*10]

Fig. 2.2 Development of the power consumption during program execution for 10 threads (top)
and 20 threads (bottom) on an Intel Broadwell processor when solving an ordinary differential
equation [184].

age and frequency scaling (DVFS). The techniques are often controlled and co-
ordinated by a special Power Management Unit (PMU). The idea of DVFS is to
reduce the clock frequency of the processor chip to save energy during time periods
with a small workload and to increase the clock frequency again if the workload
increases again. Increasing the frequency reduces the cycle time of the processor,
and in the same amount of time more instructions can be executed than when using
a smaller frequency. An example of a desktop microprocessor with DVFS capability
is the Intel Core i7 9700 processor (Coffee Lake architecture) for which 16 clock
frequencies between 0.8 GHz and 3.0 GHz are available (3.0 GHz, 2.9 GHz, 2.7
GHz, 2.6 GHz, 2.4 GHz, 2.3 GHz, 2.1 GHz, 2.0 GHz, 1.8 GHz, 1.7 GHz, 1.5 GHz,
1.4GHz, 1.2 GHz, 1.1 GHz, 900 MHz, 800 MHz). The operating system can dynam-
ically switch between these frequencies according to the current workload observed.
Tools such as cpufreq set can be used by the application programmer to adjust
the clock frequency manually. For some processors, it is even possible to increase
16 2 Parallel Computer Architecture

the frequency to a turbo-mode, exceeding the maximum frequency available for a


short period of time. This is also called overclocking and allows an especially fast
execution of computations during time periods with a heavy workload. Overclock-
ing is typically restricted to about 10 % over the normal clock rate of the processor
to avoid overheating [104].
The overall energy consumption of an application program depends on the exe-
cution time of the program obtained on a specific computer system and the power
consumption of the computer system during program execution. The energy con-
sumption E can be expressed as the product of the execution time T of the program
and the average power consumption Pav during the execution:

E = Pav · T. (2.2)

Thus, the energy unit is Watt · sec = W s = Joule. The average power Pav captures
the average static and dynamic power consumption during program execution. The
clock frequency can be set at a fixed value or can be changed dynamically during
program execution by the operating system. The clock frequency also has an effect
on the execution time: decreasing the clock frequency leads to a larger machine cy-
cle time and, thus, a larger execution time of computations. Hence, the overall effect
of a reduction of the clock frequency on the energy consumption of a program ex-
ecution is not a priori clear: reducing the energy decreases the power consumption,
but increases the resulting execution time. Experiments with DVFS processor have
shown that the smallest energy consumption does not necessarily correspond to the
use of the smallest operational frequency available [186, 184]. Instead, using a small
but not the smallest frequency often leads to the smallest energy consumption. The
best frequency to be used depends strongly on the processor used and the application
program executed, but also on the number of threads on parallel executions.

2.3 Memory access times

The access time to memory can have a large influence on the execution time of a
program, which is referred to as the program’s (runtime) performance. Reducing
the memory access time can improve the performance. The amount of improvement
depends on the memory access behavior of the program considered. Programs for
which the amount of memory accesses is large compared to the number of com-
putations performed may exhibit a significant benefit; these programs are called
memory-bound. Programs for which the amount of memory accesses is small com-
pared to the number of computations performed may exhibit a smaller benefit; these
programs are called compute-bound.
The technological development with a steady reduction in the VLSI (Very-large-
scale integration) feature size has led to significant improvements in processor per-
formance. Since 1980, integer and floating performance on the SPEC benchmark
suite has been increasing substantially per year, see Section 2.1. A significant con-
2.3 Memory access times 17

tribution to these improvements comes from a reduction in processor cycle time.


At the same time, the capacity of DRAM (Dynamic random-access memory) chips,
which are used for building main memory of computer systems, increased signifi-
cantly: Between 1986 and 2003, the storage capacity of DRAM chips increased by
about 60% per year. Since 2003, the annual average increase lies between 25% and
40% [103]. In the following, a short overview of DRAM access times is given.

2.3.1 DRAM access times

Access to DRAM chips is standardized by the JEDEC Solid State Technology As-
sociation where JEDEC stands for Joint Electron Device Engineering Council, see
jedec.org. The organization is responsible for the development of open indus-
try standards for semiconductor technologies, including DRAM chips. All leading
processor manufacturers are members of JEDEC.
For a performance evaluation of DRAM chips, the latency and the bandwidth
are used. The latency of a DRAM chip is defined as the total amount of time that
elapses between the point of time at which a memory access to a data block is issued
by the CPU and the point in time when the first byte of the block of data arrives at
the CPU. The latency is typically measured in micro-seconds (µs or nano-seconds
(ns). The bandwidth denotes the number of data elements that can be read from
a DRAM chip per time unit. The bandwidth is also denoted as throughput. The
bandwidth is typically measured in megabytes per second (MB/s) or gigabytes per
second (GB/s). For the latency of DRAM chips, an average decrease of about 5%
per year could be observed between 1980 and 2005; since 2005 the improvement
in access time has declined [104]. For the bandwidth of DRAM chips, an average
annual increase of about 10% can be observed.
In 2022, the latency of the newest DRAM technology (DDR5, Double Data Rate)
lies between 13.75 and 18 ns, depending on the specific JEDEC standard used. For
the DDR5 technology, a bandwidth between 25.6 GB/s and 51.2 GB/s per DRAM
chip is obtained. For example, the DDR5-3200 A specification leads to a peak band-
width of 25.6 GB/s with a latency of 13.75 ns, the DDR5-6400 C specification has a
peak bandwidth of 51.2 GB/s and a latency of 17.50 ns. Several DRAM chips (typ-
ically between 4 and 16) can be connected to DIMMs (dual inline memory module)
to provide even larger bandwidths.
Considering DRAM latency, it can be observed that the average memory access
time is significantly larger than the processor cycle time. The large gap between pro-
cessor cycle time and memory access time makes a suitable organization of memory
access more and more important to get good performance results at program level.
Two important approaches have been proposed to reduce the average latency for
memory access [14]: the simulation of virtual processors by each physical proces-
sor (multithreading) and the use of local caches to store data values that are accessed
often. The next two subsections give a short overview of these approaches.
18 2 Parallel Computer Architecture

2.3.2 Multithreading for hiding memory access times

The idea of interleaved multithreading is to hide the latency of memory accesses


by simulating a fixed number of virtual processors for each physical processor. The
physical processor contains a separate program counter (PC) as well as a separate
set of registers for each virtual processor. After the execution of a machine instruc-
tion, an implicit switch to the next virtual processor is performed, i.e. the virtual
processors are simulated by the physical processor in a round-robin fashion. The
number of virtual processors per physical processor should be selected such that
the time between the execution of successive instructions of a specific virtual pro-
cessor is sufficiently large to load required data from the global memory. Thus,
the memory latency is hidden by executing instructions of other virtual processors.
This approach does not reduce the amount of data loaded from the global mem-
ory via the network. Instead, instruction execution is organized such that a virtual
processor does not access requested data until after its arrival. Therefore, from the
point of view of a virtual processor, the memory latency cannot be observed. This
approach is also called fine-grained multithreading, since a switch is performed
after each instruction. An alternative approach is coarse-grained multithreading
which switches between virtual processors only on costly stalls, such as level 2
cache misses [104]. For the programming of fine-grained multithreading architec-
tures, a PRAM-like programming model can be used, see Section 4.6.1. There are
two drawbacks of fine-grained multithreading:
• The programming must be based on a large number of virtual processors. There-
fore, the algorithm executed must provide a sufficiently large potential of paral-
lelism to employ all virtual processors.
• The physical processors must be especially designed for the simulation of virtual
processors. A software-based simulation using standard microprocessors would
be too slow.
There have been several examples for the use of fine-grained multithreading in the
past, including Dencelor HEP (heterogeneous element processor) [202], NYU Ul-
tracomputer [87], SB-PRAM [2], Tera MTA [45, 116], as well as the Oracle/Fujitsu
T1 – T5 and M7/M8 multiprocessors. For example, each T5 processor supports 16
cores with 128 threads per core, acting as virtual processors. Graphics processing
units (GPUs), such as NVIDIA GPUs, also use fine-grained multithreading to hide
memory access latencies, see Chapter 7 for more information. Section 2.6.1 will
describe another variation of multithreading which is simultaneous multithreading.

2.3.3 Caches for reducing the average memory access time

A cache is a small, but fast memory that is logically located between the processor
and main memory. Physically, caches are located on the processor chip to ensure a
fast access time. A cache can be used to store data that is often accessed by the pro-
2.4 Flynn’s Taxonomy of Parallel Architectures 19

cessor, thus avoiding expensive main memory access. In most cases, the inclusion
property is used, i.e., the data stored in the cache is a subset of the data stored in main
memory. The management of the data elements in the cache is done by hardware,
e.g. by employing a set-associative strategy, see [104] and Section 2.9.1 for a de-
tailed treatment. For each memory access issued by the processor, it is first checked
by hardware whether the memory address specified currently resides in the cache.
If so, the data is loaded from the cache and no memory access is necessary. There-
fore, memory accesses that go into the cache are significantly faster than memory
accesses that require a load from the main memory. Since fast memory is expensive,
several levels of caches are typically used, starting from a small, fast and expen-
sive level 1 (L1) cache over several stages (L2, L3) to the large, but slower main
memory. For a typical processor architecture, access to the L1 cache only takes 2-4
cycles whereas access to main memory can take up to several hundred cycles. The
primary goal of cache organization is to reduce the average memory access time as
far as possible and to achieve an access time as close as possible to that of the L1
cache. Whether this can be achieved depends on the memory access behavior of the
program considered, see also Section 2.9.
Caches are used for nearly all processors, and they also play an important role
for SMPs with a shared address space and parallel computers with distributed mem-
ory organization. If shared data is used by multiple processors or cores, it may be
replicated in multiple caches to reduce access latency. Each processor or core should
have a coherent view to the memory system, i.e., any read access should return the
most recently written value no matter which processor or core has issued the cor-
responding write operation. A coherent view would be destroyed if a processor p
changes the value in a memory address in its local cache without writing this value
back to main memory. If another processor q would later read this memory address,
it would not get the most recently written value. But even if p writes the value back
to main memory, this may not be sufficient if q has a copy of the same memory
location in its local cache. In this case, it is also necessary to update the copy in the
local cache of q. The problem of providing a coherent view to the memory system is
often referred to as cache coherence problem. To ensure cache coherency, a cache
coherency protocol must be used, see Section 2.9.3 and [45, 104, 100] for a more
detailed description.

2.4 Flynn’s Taxonomy of Parallel Architectures

Parallel computers have been used for many years, and many different architectural
alternatives have been proposed and implemented. In general, a parallel computer
can be characterized as a collection of processing elements that can communicate
and cooperate to solve large problems quickly [14]. This definition is intentionally
quite vague to capture a large variety of parallel platforms. Many important details
are not addressed by the definition, including the number and complexity of the
processing elements, the structure of the interconnection network between the pro-
20 2 Parallel Computer Architecture

cessing elements, the coordination of the work between the processing elements as
well as important characteristics of the problem to be solved.
For a more detailed investigation, it is useful to introduce a classification accord-
ing to important characteristics of a parallel computer. A simple model for such a
classification is given by Flynn’s taxonomy [64]. This taxonomy characterizes par-
allel computers according to the global control and the resulting data and control
flows. Four categories are distinguished:
1. Single-Instruction, Single-Data (SISD): There is one processing element which
has access to a single program and a single data storage. In each step, the pro-
cessing element loads an instruction and the corresponding data and executes the
instruction. The result is stored back into the data storage. Thus, SISD describes
a conventional sequential computer according to the von Neumann model.
2. Multiple-Instruction, Single-Data (MISD): There are multiple processing ele-
ments each of which has a private program memory, but there is only one com-
mon access to a single global data memory. In each step, each processing element
obtains the same data element from the data memory and loads an instruction
from its private program memory. These possibly different instructions are then
executed in parallel by the processing elements using the previously obtained
(identical) data element as operand. This execution model is very restrictive and
no commercial parallel computer of this type has ever been built.
3. Single-Instruction, Multiple-Data (SIMD): There are multiple processing ele-
ments each of which has a private access to a (shared or distributed) data mem-
ory, see Section 2.5 for a discussion of shared and distributed address spaces.
But there is only one program memory from which a special control processor
fetches and dispatches instructions. In each step, each processing element obtains
the same instruction from the control processor and loads a separate data element
through its private data access on which the instruction is performed. Thus, the
same instruction is synchronously applied in parallel by all processing elements
to different data elements.
For applications with a significant degree of data parallelism, the SIMD approach
can be very efficient. Examples are multimedia applications or computer graph-
ics algorithms which generate realistic three-dimensional views of computer-
generated environments. Algorithms from scientific computing are often based
on large arrays and can therefore also benefit from SIMD computations.
4. Multiple-Instruction, Multiple-Data (MIMD): There are multiple processing
elements each of which has a separate instruction and a separate data access to
a (shared or distributed) program and data memory. In each step, each process-
ing element loads a separate instruction and a separate data element, applies the
instruction to the data element, and stores a possible result back into the data
storage. The processing elements work asynchronously to each other. MIMD
computers are the most general form of parallel computers in Flynn’s taxonomy.
Multicore processors or cluster systems are examples for the MIMD model.
Compared to MIMD computers, SIMD computers have the advantage that they are
easy to program, since there is only one program flow, and the synchronous execu-
2.5 Memory Organization of Parallel Computers 21

tion does not require synchronization at program level. But the synchronous execu-
tion is also a restriction, since conditional statements of the form
if (b==0) c=a; else c = a/b;
must be executed in two steps. In the first step, all processing elements whose lo-
cal value of b is zero execute the then part. In the second step, all other process-
ing elements execute the else part. Some processors support SIMD computations
as additional possibility for processing large uniform data sets. An example is the
x86 architecture which provides SIMD instructions in the form of SSE (Streaming
SIMD Extensions) or AVX (Advanced Vector Extensions) instructions. AVX exten-
sions have first been introduced in 2011 by Intel and AMD and are now supported
in nearly all modern desktop and server processors. The features of AVX have been
extended several times, and since 2017 AVX-512 is available. AVX-512 is based
on a separate set of 512-bit registers. Each of these registers can store 16 single-
precision 32-bit or eight double-precision 64-bit floating-point numbers, on which
arithmetic operations can be executed in SIMD style, see Sect. 3.4 for a more de-
tailed description. The computations of GPUs are also based on the SIMD concept,
see Sect. 7.1 for a more detailed description.
MIMD computers are more flexible than SIMD computers, since each processing
element can execute its own program flow. On the upper level, multicore processors
as well as all parallel computers are based on the MIMD concept. Although Flynn’s
taxonomy only provides a coarse classification, it is useful to give an overview of
the design space of parallel computers.

2.5 Memory Organization of Parallel Computers

Nearly all general-purpose parallel computers are based on the MIMD model. A fur-
ther classification of MIMD computers can be done according to their memory orga-
nization. Two aspects can be distinguished: the physical memory organization and
the view of the programmer to the memory. For the physical organization, comput-
ers with a physically shared memory (also called multiprocessors) and computers
with a physically distributed memory (also called multicomputers) can be distin-
guished, see Fig. 2.3. But there exist also many hybrid organizations, for example
providing a virtually shared memory on top of a physically distributed memory.
From the programmer’s point of view, there are computers with a distributed ad-
dress space and computers with a shared address space. This view does not necessar-
ily correspond to the actual physical memory organization. For example, a parallel
computer with a physically distributed memory may appear to the programmer as a
computer with a shared address space when a corresponding programming environ-
ment is used. In the following, the physical organization of the memory is discussed
in more detail.
22 2 Parallel Computer Architecture

MIMD computer systems

Multicomputer systems computers with Multiprocessor systems


computers with virtually shared computers with
distributed memory memory shared memory

Fig. 2.3 Classification of the memory organization of MIMD computers.

2.5.1 Computers with Distributed Memory Organization

Computers with a physically distributed memory are also called distributed mem-
ory machines (DMM). They consist of a number of processing elements (called
nodes) and an interconnection network which connects the nodes and supports the
transfer of data between the nodes. A node is an independent unit, consisting of
processor, local memory and, sometimes, peripheral elements, see Fig. 2.4 a).
The data of a program is stored in the local memory of one or several nodes.
All local memory is private and only the local processor can access its own lo-
cal memory directly. When a processor needs data from the local memory of other
nodes to perform local computations, message-passing has to be performed via the
interconnection network. Therefore, distributed memory machines are strongly con-
nected with the message-passing programming model which is based on communi-
cation between cooperating sequential processes, see Chapters 3 and 5. To perform
message-passing, two processes PA and PB on different nodes A and B issue corre-
sponding send and receive operations. When PB needs data from the local memory
of node A, PA performs a send operation containing the data for the destination pro-
cess PB . PB performs a receive operation specifying a receive buffer to store the data
from the source process PA from which the data is expected.
The architecture of computers with a distributed memory has experienced many
changes over the years, especially concerning the interconnection network and the
coupling of network and nodes. The interconnection networks of earlier multicom-
puters were often based on point-to-point connections between nodes. A node is
connected to a fixed set of other nodes by physical connections. The structure of
the interconnection network can be represented as a graph structure. The nodes of
the graph represent the processors, the edges represent the physical interconnections
(also called links). Typically, the graph exhibits a regular structure. A typical net-
work structure is the hypercube which is used in Fig. 2.4 b) to illustrate the node
connections; a detailed description of interconnection structures is given in Section
2.7. In networks with point-to-point connections, the structure of the network deter-
mines the possible communications, since each node can only exchange data with
its direct neighbors. To decouple send and receive operations, buffers can be used
to store a message until the communication partner is ready. Point-to-point con-
2.5 Memory Organization of Parallel Computers 23

a)
interconnection network
P = processor
M = local memory
P P . . . P
M M M

b) node consisting of processor and local memory


computer with distributed memory
with a hypercube as
interconnection network

c) DMA (direct memory access)


interconnection network
with DMA connections
to the network
DMA DMA

M P
... M P

d)
P M
... ...
external external
Router
...

...

input channels output channels

e) N N N
R R R R = Router
N N N N = node consisting of processor
and local memory
R R R

N N N
R R R

Fig. 2.4 Illustration of computers with distributed memory: a) abstract structure, b) computer
with distributed memory and hypercube as interconnection structure, c) DMA (direct memory
access), d) processor-memory node with router and e) interconnection network in form of a
mesh to connect the routers of the different processor-memory nodes.
24 2 Parallel Computer Architecture

nections restrict parallel programming, since the network topology determines the
possibilities for data exchange, and parallel algorithms have to be formulated such
that their communication pattern fits to the given network structure [8, 144].
The execution of communication operations can be decoupled from the proces-
sor’s operations by adding a DMA controller (DMA - direct memory access) to the
nodes to control the data transfer between the local memory and the I/O controller.
This enables data transfer from or to the local memory without participation of the
processor (see Fig. 2.4 c) for an illustration) and allows asynchronous communica-
tion. A processor can issue a send operation to the DMA controller and can then
continue local operations while the DMA controller executes the send operation.
Messages are received at the destination node by its DMA controller which copies
the enclosed data to a specific system location in local memory. When the processor
then performs a receive operation, the data are copied from the system location to
the specified receive buffer. Communication is still restricted to neighboring nodes
in the network. Communication between nodes that do not have a direct connec-
tion must be controlled by software to send a message along a path of direct inter-
connections. Therefore, communication times between nodes that are not directly
connected can be much larger than communication times between direct neighbors.
Thus, it is still more efficient to use algorithms with a communication pattern ac-
cording to the given network structure.
A further decoupling can be obtained by putting routers into the network, see Fig.
2.4 d). The routers form the actual network over which communication can be per-
formed. The nodes are connected to the routers, see Fig. 2.4 e). Hardware-supported
routing reduces communication times as messages for processors on remote nodes
can be forwarded by the routers along a pre-selected path without interaction of
the processors in the nodes along the path. With router support there is not a large
difference in communication time between neighboring nodes and remote nodes,
depending on the switching technique, see Sect. 2.8.3. Each physical I/O channel of
a router can be used by one message only at a specific point in time. To decouple
message forwarding, message buffers are used for each I/O channel to store mes-
sages and apply specific routing algorithms to avoid deadlocks, see also Sect. 2.8.1.
Technically, DMMs are quite easy to assemble since standard desktop computers
or servers can be used as nodes. The programming of DMMs requires a careful
data layout, since each processor can access only its local data directly. Non-local
data must be accessed via message-passing, and the execution of the corresponding
send and receive operations takes significantly longer than a local memory access.
Depending on the interconnection network and the communication library used, the
difference can be more than a factor of 100. Therefore, data layout may have a
significant influence on the resulting parallel runtime of a program. The data layout
should be selected such that the number of message transfers and the size of the data
blocks exchanged are minimized.
The structure of DMMs has many similarities with networks of workstations
(NOWs) in which standard workstations are connected by a fast local area network
(LAN). An important difference is that interconnection networks of DMMs are typi-
2.5 Memory Organization of Parallel Computers 25

cally more specialized and provide larger bandwidth and lower latency, thus leading
to a faster message exchange.
Collections of complete computers with a dedicated interconnection network are
often called clusters. Clusters are usually based on standard computers and even
standard network topologies. The entire cluster is addressed and programmed as a
single unit. The popularity of clusters as parallel machines comes from the availabil-
ity of standard high-speed interconnections, such as FCS (Fibre Channel Standard),
SCI (Scalable Coherent Interface), Switched Gigabit Ethernet, Myrinet, or Infini-
Band, see [175, 104, 171]. A natural programming model of DMMs is the message-
passing model that is supported by communication libraries, such as MPI or PVM,
see Chapter 5 for a detailed treatment of MPI. These libraries are often based on
standard protocols such as TCP/IP [139, 174].
The difference between cluster systems and distributed systems lies in the fact
that the nodes in cluster systems use the same operating system and can usually
not be addressed individually; instead a special job scheduler must be used. Several
cluster systems can be connected to grid systems by using middleware software,
such as the Globus Toolkit, see www.globus.org [72]. This allows a coordinated
collaboration of several clusters. In Grid systems, the execution of application pro-
grams is controlled by the middleware software.
Cluster systems are also used for the provision of services in the area of cloud
computing. Using cloud computing, each user can allocate virtual resources which
are provided via the cloud infrastructure as part of the cluster system. The user can
dynamically allocate and use resources according to his or her computational re-
quirements. Depending on the allocation, a virtual resource can be a single cluster
node or a collection of cluster nodes. Examples for cloud infrastructures are the
Amazon Elastic Compute Cloud (EC2), Microsoft Azure, and Google Cloud. Ama-
zon EC2 offers a variety of virtual machines with different performance and memory
capacity that can be rented by users on an hourly or monthly basis. Amazon EC2 is
part of Amazon Web Services (AWS), which provides many cloud computing ser-
vices in different areas, such as computing, storage, database, network and content
delivery, analytics, machine learning, and security, see aws.amazon.com. In the
third quarter of 2022, AWS had a share of 34 % of the global cloud infrastructure
service market, followed by Microsoft Azure with 21 % and Google Cloud with 11
%, see www.srgresearch.com.

2.5.2 Computers with Shared Memory Organization

Computers with a physically shared memory are also called shared memory ma-
chines (SMMs). The shared memory is also called global memory. SMMs consist
of a number of processors or cores, a shared physical memory (global memory) and
an interconnection network to connect the processors with the memory. The shared
memory can be implemented as a set of memory modules. Data can be exchanged
between processors via the global memory by reading or writing shared variables.
26 2 Parallel Computer Architecture

The cores of a multicore processor are an example for an SMM, see Sect. 2.6.2 for
a more detailed description. Physically, the global memory usually consists of sep-
arate memory modules providing a common address space which can be accessed
by all processors, see Fig. 2.5 for an illustration.

a) P . . . P b) P . . . P

interconnection network interconnection network

shared memory M . . . M
memory modules

Fig. 2.5 Illustration of a computer with shared memory: a) abstract view and b) implementation
of the shared memory with several memory modules.

A natural programming model for SMMs is the use of shared variables which
can be accessed by all processors. Communication and cooperation between the
processors is organized by writing and reading shared variables that are stored in
the global memory. Accessing shared variables concurrently by several processors
should be avoided, since race conditions with unpredictable effects can occur, see
also Chapters 3 and 6.
The existence of a global memory is a significant advantage, since communica-
tion via shared variables is easy and since no data replication is necessary as it is
sometimes the case for DMMs. But technically, the realization of SMMs requires
a larger effort, in particular because the interconnection network must provide fast
access to the global memory for each processor. This can be ensured for a small
number of processors, but scaling beyond a few dozen processors is difficult.
A special variant of SMMs are symmetric multiprocessors (SMPs). SMPs have
a single shared memory which provides a uniform access time from any processor
for all memory locations, i.e., all memory locations are equidistant to all processors
[45, 171]. SMPs usually have a small number of processors that are connected via
a central bus or interconnection network, which also provides access to the shared
memory. There are usually no private memories of processors or specific I/O pro-
cessors, but each processor has a private cache hierarchy. As usual, access to a local
cache is faster than access to the global memory. In the spirit of the definition from
above, each multicore processor with several cores is an SMP system.
SMPs usually have only a small number of processors, since the central bus has
to provide a constant bandwidth which is shared by all processors. When too many
processors are connected, more and more access collisions may occur, thus increas-
ing the effective memory access time. This can be alleviated by the use of caches and
suitable cache coherence protocols, see Sect. 2.9.3. The maximum number of pro-
cessors used in bus-based SMPs typically lies between 32 and 64. Interconnection
schemes with a higher bandwidth are also used, such as parallel buses (IBM Power
2.6 Thread-Level Parallelism 27

8), ring interconnects (Intel Xeon E7) or crossbar connections (Fujitsu SPARC64
X+) [171].
Parallel programs for SMMs are often based on the execution of threads. A
thread is a separate control flow which shares data with other threads via a global
address space. It can be distinguished between kernel threads that are managed by
the operating system, and user threads that are explicitly generated and controlled
by the parallel program, see Section 3.8.2. The kernel threads are mapped by the op-
erating system to processors or cores for execution. User threads are managed by the
specific programming environment used and are mapped to kernel threads for exe-
cution. The mapping algorithms as well as the exact number of processors or cores
can be hidden from the user by the operating system. The processors or cores are
completely controlled by the operating system. The operating system can also start
multiple sequential programs from several users on different processors or cores,
when no parallel program is available. Small size SMP systems are often used as
servers, because of their cost-effectiveness, see [45, 175] for a detailed description.
SMP systems can be used as nodes of a larger parallel computer by employ-
ing an interconnection network for data exchange between processors of different
SMP nodes. For such systems, a shared address space can be defined by using a
suitable cache coherence protocol, see Sect. 2.9.3. A coherence protocol provides
the view of a shared address space, although the actual physical memory might be
distributed. Such a protocol must ensure that any memory access returns the most
recently written value for a specific memory address, no matter where this value is
stored physically. The resulting systems are also called distributed shared mem-
ory (DSM) architectures. In contrast to single SMP systems, the access time in DSM
systems depends on the location of a data value in the global memory, since an ac-
cess to a data value in the local SMP memory is faster than an access to a data value
in the memory of another SMP node via the coherence protocol. These systems are
therefore also called NUMAs (non-uniform memory access), see Fig. 2.6 (b). Since
single SMP systems have a uniform memory latency for all processors, they are also
called UMAs (uniform memory access).
CC-NUMA (Cache-Coherent NUMA) systems are computers with a virtually
shared address space for which cache coherence is ensured, see Fig. 2.6 (c). Thus,
each processor’s cache can store data not only from the processor’s local memory
but also from the shared address space. A suitable coherence protocol, see Sect.
2.9.3, ensures a consistent view by all processors. COMA (Cache-Only Memory
Architecture) systems are variants of CC-NUMA in which the local memories of
the different processors are used as caches, see Fig. 2.6 (d).

2.6 Thread-Level Parallelism

The architectural organization within a processor chip may require the use of ex-
plicitly parallel programs to efficiently use the resources provided. This is called
thread-level parallelism, since the multiple control flows needed are often called
28 2 Parallel Computer Architecture

a)
P1 P2 . . . Pn

cache cache cache

memory
b)
P1 P2 ... Pn processing
elements
M1 M2 Mn

interconnection network

c)
P1 P2 Pn processing
... elements
C1 C2 Cn

M1 M2 Mn

interconnection network

d)
Processor P 1 P2 Pn processing
. . . elements
Cache C1 C2 Cn

interconnection network

Fig. 2.6 Illustration of the architecture of computers with shared memory: a) SMP – symmetric
multiprocessors, b) NUMA – non-uniform memory access, c) CC-NUMA – cache coherent
NUMA and d) COMA – cache only memory access.
2.6 Thread-Level Parallelism 29

threads. The corresponding architectural organization is also called chip multipro-


cessing (CMP). An example for CMP is the placement of multiple independent exe-
cution cores with all execution resources onto a single processor chip. The resulting
processors are called multicore processors, see Section 2.6.2.
An alternative approach is the use of multithreading to execute multiple threads
simultaneously on a single processor by switching between the different threads
when needed by the hardware. As described in Section 2.3, this can be obtained
by fine-grained or coarse-grained multithreading. A variant of coarse-grained mul-
tithreading is timeslice multithreading in which the processor switches between
the threads after a predefined timeslice interval has elapsed. This can lead to situa-
tions where the timeslices are not effectively used if a thread must wait for an event.
If this happens in the middle of a timeslice, the processor may remain idle for the
rest of the timeslice because of the waiting. Such unnecessary waiting times can
be avoided by using switch-on-event multithreading [149] in which the processor
can switch to the next thread if the current thread must wait for an event to occur as
can happen for cache misses.
A variant of this technique is simultaneous multithreading (SMT) which will
be described in the following. This technique is called hyperthreading for some
Intel processors. The technique is based on the observation that a single thread of
control often does not provide enough instruction-level parallelism to use all func-
tional units of modern superscalar processors.

2.6.1 Simultaneous Multithreading

The idea of simultaneous multithreading (SMT) is to use several threads and to


schedule executable instructions from different threads in the same cycle if neces-
sary, thus using the functional units of a processor more effectively. This leads to
a simultaneous execution of several threads which gives the technique its name. In
each cycle, instructions from several threads compete for the functional units of a
processor. Hardware support for simultaneous multithreading is based on the repli-
cation of the chip area which is used to store the processor state. This includes the
program counter (PC), user and control registers as well as the interrupt controller
with the corresponding registers. Due to this replication, the processor appears to
the operating system and the user program as a set of logical processors to which
processes or threads can be assigned for execution. These processes or threads typ-
ically come from a single user program, assuming their provision by this program
using parallel programming techniques. The number of replications of the processor
state determines the number of logical processors.
Each logical processor stores its processor state in a separate processor resource.
This avoids overhead for saving and restoring processor states when switching to
another logical processor. All other resources of the processor chip, such as caches,
bus system, and function and control units, are shared by the logical processors.
Therefore, the implementation of SMT only leads to a small increase in chip size.
30 2 Parallel Computer Architecture

For two logical processors, the required increase in chip area for an Intel Xeon pro-
cessor is less than 5% [149, 225]. The shared resources are assigned to the logical
processors for simultaneous use, thus leading to a simultaneous execution of logical
processors. When a logical processor must wait for an event, the resources can be
assigned to another logical processor. This leads to a continuous use of the resources
from the view of the physical processor. Waiting times for logical processors can oc-
cur for cache misses, wrong branch predictions, dependencies between instructions,
and pipeline hazards.
Investigations have shown that the simultaneous use of processor resources by
two logical processors can lead to performance improvements between 15% and
30%, depending on the application program [149]. Since the processor resources are
shared by the logical processors, it cannot be expected that the use of more than two
logical processors can lead to a significant additional performance improvement.
Therefore, SMT will likely be restricted to a small number of logical processors.
In 2022, examples of processors that support SMT are the Intel Core i3, i5, and
i7 processors (supporting two logical processors), the IBM Power9 and Power10
processors (four or eight logical processors, depending on the configuration), as
well as the AMD Zen 3 processors (up to 24 threads per core).
To use SMT to obtain performance improvements, it is necessary that the op-
erating system is able to control logical processors. From the point of view of the
application program, it is necessary that there is a separate thread available for ex-
ecution for each logical processor. Therefore, the application program must apply
parallel programming techniques to get performance improvements for SMT pro-
cessors.

2.6.2 Multicore Processors

The enormous annual increase of the number of transistors of a processor chip


has enabled hardware manufacturers for many years to provide a significant per-
formance increase for application programs, see also Section 2.1. Thus, a typical
computer is considered old-fashioned and too slow after at most five years and cus-
tomers buy new computers quite often. Hardware manufacturers are therefore trying
to keep the obtained performance increase at least at the current level to avoid re-
duction in computer sales figures.
As discussed in Section 2.1, the most important factors for the performance in-
crease per year have been an increase in clock speed and the internal use of parallel
processing, such as pipelined execution of instructions and the use of multiple func-
tional units. But these traditional techniques have mainly reached their limits:
• Although it is possible to put additional functional units on the processor chip,
this would not increase performance for most application programs because de-
pendencies between instructions of a single control thread inhibit their parallel
execution. A single control flow does not provide enough instruction-level paral-
lelism to keep a large number of functional units busy.
2.6 Thread-Level Parallelism 31

• There are two main reasons why the speed of processor clocks cannot be in-
creased significantly [132]. First, the increase of the number of transistors on
a chip is mainly achieved by increasing the transistor density. But this also in-
creases the power density and heat production because of leakage current and
power consumption, thus requiring an increased effort and more energy for cool-
ing. Second, memory access times could not be reduced at the same rate as pro-
cessor clock speed has been increased. This leads to an increased number of ma-
chine cycles for a memory access. For example, in 1990 main memory access has
required between 6 and 8 machine cycles for a typical desktop computer system.
In 2012, memory access time has significantly increased to 180 machine cycles
for an Intel Core i7 processor. Since then, memory access time has increased fur-
ther and in 2022, the memory latencies for an AMD EPYC Rome and an Intel
Xeon Cascade Lake SP server processor are 220 cycles and 200 cycles, respec-
tively [221]. Therefore, memory access times could become a limiting factor for
further performance increase, and cache memories are used to prevent this, see
Section 2.9 for a further discussion. In future, it can be expected that the number
of cycles needed for a memory access will not change significantly.
There are more problems that processor designers have to face: Using the in-
creased number of transistors to increase the complexity of the processor archi-
tecture may also lead to an increase in processor-internal wire length to transfer
control and data between the functional units of the processor. Here, the speed
of signal transfers within the wires could become a limiting factor. For example,
a processor with a clock frequency of 3 GHz = 3 · 109 Hz has a cycle time of
1/(3 · 109 Hz) = 0.33 · 10−9 s = 0.33ns. Assuming a signal transfer at the speed of
light (which is 0.3 ·109 m/s), a signal can cross a distance of 0.33 ·10−9 s ·0.3·109 m/s
= 10cm in one processor cycle. This is not significantly larger than the typical size
of a processor chip and wire lengths become an important issue when the clock
frequency is increased further.
Another problem is the following: The physical size of a processor chip limits
the number of pins that can be used, thus limiting the bandwidth between CPU and
main memory. This may lead to a processor-to-memory performance gap which
is sometimes referred to as memory wall. This makes the use of high-bandwidth
memory architectures with an efficient cache hierarchy necessary [19].
All these reasons inhibit a processor performance increase at the previous rate
when using the traditional techniques. Instead, new processor architectures have to
be used, and the use of multiple cores on a single processor die is considered as the
most promising approach. Instead of further increasing the complexity of the inter-
nal organization of a processor chip, this approach integrates multiple independent
processing cores with a relatively simple architecture onto one processor chip. This
has the additional advantage that the energy consumption of a processor chip can
be reduced if necessary by switching off unused processor cores during idle times
[102].
Multicore processors integrate multiple execution cores on a single processor
chip. For the operating system, each execution core represents an independent log-
ical processor with separate execution resources, such as functional units or execu-
32 2 Parallel Computer Architecture

tion pipelines. Each core has to be controlled separately, and the operating system
can assign different application programs to the different cores to obtain a parallel
execution. Background applications like virus checking, image compression, and
encoding can run in parallel to application programs of the user. By using techniques
of parallel programming, it is also possible to execute a computational-intensive ap-
plication program (like computer games, computer vision, or scientific simulations)
in parallel on a set of cores, thus reducing execution time compared to an execution
on a single core or leading to more accurate results by performing more computa-
tions as in the sequential case. In the future, users of standard application programs
as computer games will likely expect an efficient use of the execution cores of a
processor chip. To achieve this, programmers have to use techniques from parallel
programming.
The use of multiple cores on a single processor chip also enables standard pro-
grams, such as text processing, office applications, or computer games, to provide
additional features that are computed in the background on a separate core so that
the user does not notice any delay in the main application. But again, techniques of
parallel programming have to be used for the implementation.

2.6.3 Architecture of Multicore Processors

There are many different design variants for multicore processors, differing in the
number of cores, the structure and size of the caches, the access of cores to caches,
and the use of heterogeneous components. From a high level view, [133] distin-
guished three main types of architectures: (a) a hierarchical design in which mul-
tiple cores share multiple caches that are organized in a tree-like configuration, (b)
a pipelined design where multiple execution cores are arranged in a pipelined way,
and (c) a network-based design where the cores are connected via an on-chip inter-
connection network, see Fig. 2.7 for an illustration. In 2022, most multicore proces-
sors are based on a network-based design using a fast on-chip interconnect to couple
the different cores. Earlier multicore processors often relied on a hierarchical design.

2.6.3.1 Homogeneous Multicore Processors

As described in Section 2.1, the exploitation of parallelism by pipelining and ILP is


limited due to dependencies between instructions. This limitation and the need for
further performance improvement has led to the introduction of multicore designs
starting in the year 2005 with the Intel Pentium D processor, with was a dual-core
processor with two identical cores. The first quad-core processors were released in
2006, the first 12-core processors in 2010. This trend has continued, and in 2022
multicore with up to 96 cores are available, see Table 2.1. Processors with a large
number of cores are typically used for server processors, whereas desktop computers
or mobile computers normally use processors with a smaller number of cores, usu-
2.6 Thread-Level Parallelism 33

memory

memory
Cache

Cache
cache/memory cache/memory
core core
interconnection network
control
cache cache core core

core

core

core
core

memory

memory
Cache

Cache
core core core core

hierarchical design pipelined design network−based design

Fig. 2.7 Design choices for multicore chips according to [133].

ally between four and 16 cores. The reason for this usage lies in the fact that proces-
sors with a large number of cores provide a higher performance, but they also have
a larger power consumption and a higher price than processors with a smaller num-
ber of cores. The larger performance can especially be exploited for server systems
when executing jobs from different users. Desktop computers or mobile computers
are used by a single user and the performance requirement is therefore smaller than
for server systems. Hence, less expensive systems are sufficient with the positive
effect that they are also more energy-efficient.

number number clock L1 L2 L3 year


processor cores threads GHz cache cache cache released
Intel Core i9-11900 8 16 2.5 8x 8x 16 MB 2021
”Rocket Lake” 64 KB 512 MB
Intel Xeon Platinum 8380 40 80 2.3 40 x 40 x 60 MB 2021
”Ice Lake” 48 KB 1.25 MB
Intel Mobil-Core 8 16 2.6 8x 8x 24 MB 2020
i9-11950H ”Tiger Lake-H” 48 KB 1.28 MB
AMD Ryzen 9 7950X 16 32 4.5 1 MB 16 MB 64 MB 2022
”Zen 4”
AMD EPYCTM 9654P 96 192 2.4 6 MB 96 MB 384 MB 2022
”Zen 4 Genoa”
IBM 15 120 3.5 15 x 15 x 120 MB 2021
Power10 32 KB 2 MB
Ampere Altra Ma 128 128 3.0 128 x 128 x 16 MB 2020
ARM v8 64 KB 1 MB

Table 2.1 Examples for multicore processors with a homogeneous design available in 2022.

Most processors for server and desktop systems rely on a homogeneous design,
i.e., they contain identical cores where each core has the same computational per-
formance and the same power consumption, see Figure 2.8 (left) for an illustration.
34 2 Parallel Computer Architecture

homogeneous heterogeneous

C P E

interconnection network

interconnection network
C
E
C C P E
E
C C P
E
C C Gr Cr

Fig. 2.8 Illustration of multicore architectures with a homogeneous design (left) and with a het-
erogeneous design (right). The homogeneous design contains eight identical cores C. The hetero-
geneous design has three performance cores P, five energy-efficient cores E, a cryptography core
Cr, and a graphics unit Gr.

This has the advantage that the computational work can be distributed evenly among
the different cores, such that each core gets about the same amount of work. Thus,
the execution time of application programs can be reduced if suitable multithreading
programming techniques are used.
Table 2.1 gives examples for multicore processors with a homogeneous design
specifying the number of cores and threads, the base clock frequency in GHz as
well as information about the size of the L1, L2, and L3 caches and the release year.
The size for the L1 cache given in the table is the size of the L1 data cache. The
Intel Core i9-11900 and the AMD Ryzen 9 7950X processors shown in the table
are desktop processor, the Intel i9-11950H processor is a processor for notebooks.
The other four processors are designed for a usage in server systems. These server
processors provide a larger performance than the desktop or mobile processors, but
they are also much more expensive. All processors shown have private L1 and L2
caches for each core and use shared L3 caches. The clock frequencies given are
the base frequencies that can be increased to a higher turbo frequency for a short
time period if required. Considering the power consumption, it can be observed that
the server processors have a much larger power consumption that the desktop and
mobile processors. This can be seen when considering the Thermal Design Power
(TDP), which captures the maximum amount of heat that is generated by a proces-
sor. The TDP is measured in Watt and determines the cooling requirement for the
processor and can be considered as a rough measure when comparing the average
power consumption of processor chips. For example, the TDP is 270 W for the Intel
Xeon Platinum 8380 server processor, 65 W for the Intel Core i9-11900 desktop
processor, and 35 W for the Intel Mobil-Core i9-11950H mobile processor.
Another Random Scribd Document
with Unrelated Content
ea t tab e o , ;
accumulation of, 27;
True wealth, 99, 101;
land is the source of wealth, 54, 55;
average wealth per family, $5,125, table, 29, 47;
per capita, lower table, 27, 38, table, 51;
aggregates of wealth owned by different classes, 1st table, 29, 45;
wealth owned by individuals, table, 51;
chart, 50;
concentration of wealth, tables, 150, 169 (for 1897);
increase of wealth (for 1900), 181;
increase of in 7 years, 139, 140;
increased phenomenally, 140;
who profits by the increase of, 144-5;
concentration of in industries, 154;
largest fortunes of, increase most rapidly, Dr. Henderson, 172;
wealth reduced with the increased number of families, 171.
See: in the tax table, 178.
FOOTNOTES:

1. Quoted from “The Public,” Number 69, July 29, 1899.

2. Louis Post, ibid.

3. His name cannot be here given.

4. This work will show the real causes of it and the rapid
tendency toward it.

5. Encyclopedia of Social Reform, p. 1435. Ed. by Rev. Wm.


Bliss and published in 1897 by Funk and Wagnalls Company,
New York and London.

6. This 5 per cent includes personal, unproductive property


of all sorts.

7. Mind that these statements are of one authority only, viz.:


Mr. G. K. Holmes.

8. House-scarb means: all domestic or household property


that may be carried on from one rentable house to another.

9. Dr. C. B. Spahr, Pres. Distribution of Wealth in the U. S.


(1896), p. 69; published by Thos. Y. Crowell & Company,
Boston.

10. Encyclopedia of Social Reform, p. 1388.

11. Ibidem, p. 1388.


12. This table gives you the exact equivalent of diagrams
found on p. 12.

13. So far, we give honor to Mr. Holmes in advance.

14. One of the best authorities in statistics.

15. Reported in Binghamton Independent of Aug. 12, 1899.

16. “The Public,” Chicago, No. 74, Sept., 1899.

17. The diagrams and statistical tables supply the life contents
for these premises.

18. The exact statistics of the Eleventh Census, 1890, have


given the average at about 4.93 members to a family, which
means that in each 100 families 93 have 5 and 7 have only
4 members. In 1880 this average was 5.04, and in 1870,
5.09 members to a family.

19. Ibid., p. 69.—I italicize these conclusions. See Enc. of Soc.


R., p. 1389.

20. Dr. C. B. Spahr, “The Present Distribution of Wealth in the


U. S.,” 1896.

21. Whereas the general average of per capita wealth was


$1,036.

22. Here, p. 6.

23. Here, p. 13.

24. Here, p. 21.

25. Here, see p. 18.

26. Dr. Spahr, “Present Distribution of Wealth in the United


States,” p. 69.—Enc. of Soc. R., p. 1389.
27. Enc. of Soc. R., p. 1384.

28. C. D. Wright, “Atlantic Monthly,” Sept., 1897.

29. “Encyclopedia of Social Reform.” (p. 1388), 1897, by Rev.


Wm. Bliss.

30. Dr. Spahr, “Present Distribution of Wealth in the U. S.,” p.


69, 1896, who held each family at five members.

31. It should be borne in mind that, “Goods, wares,


merchandise, utensils, furniture, cattle, provisions, and
every other species of personal property, was included
among the assets” representing wealth. Dr. Spahr, Ib., p. 55.

32. Encyclopedia of Social Reform (publ. in 1897), p. 1388.

33. These totals have been summed up by me.

34. Table, p. 32, here.

35. Compare the total wealth of this table with that on p. 27.

36. Here, p. 13.

37. Atlantic Monthly, Sept. 1897.

38. See here, p. 18.

39. This is the restored group of the 1st table, p. 29.

40. 3d group, p. 32 or 36.

41. See Diagrams, p. 12, and Table II, p. 13.

42. Compare these families in the 2d restored table, p. 36.

43. Compare the same families in the 1st restored table, p.


29.

44. Enc. of Soc. Reform, p. 1389.


45. Statistics and Sociology, p. 201-2.

46. Subtraction has been made on p. 36.

47. See table, p. 29.

48. The total number of immigrants entered into the United


States from 1891 to 1897 inclusively was 2,854,834.—The
World Almanac, 1899, p. 176.

49. Here, p. 18.—Dr. Spahr, “The Present Distribution of


Wealth in the United States,” p. 69.

50. Even the uncultivated land is a great source of income to


its owner. And if it were not so, the great landowners of
England and Scotland would not have owned fully
20,000,000 acres of the U. S. land. But now five of them
own it, and draw large incomes from it, while remaining at
home beyond the Atlantic. And the Holland syndicate and
the German syndicate could not have owned 7,000,000
acres of the U. S. land, if it were not a source of income,
even without special application of any labor energy to it.
But now the former syndicate owns 5,000,000 acres of
grazing land in Western States; and the latter owns
2,000,000 acres of it in various States, as the “Up to Date,
Coin’s Financial School,” has indicated, pp. 108-118.

51. Chas. R. Henderson, D. D., “Social Elements,” p. 144.

52. Some one may of course prefer to live in another’s house,


as they say, not willing to pay taxes for his own property.
But a just taxation can never cause this trouble. The
abnormity of taxation is shown here in Chapter VI.

53. Land, Capital, Rentables, Salables are income-bearing


properties.

54. “Encyclopedia of Social Reform,” p. 1389.


55. Dividogenesure means: (As a class and as an individual, I
am the owner of land, of wealth and capital): Divide with
me your sole results of active energy upon my source of
wealth, or else you may be sure you have only the right to
starve from drain by others without this supply. [Latin:
divido, divide, part, separate. Greek: genesis, origin, source,
creation, origination, production. Latin: ure, (perish) by rust,
by fire, by cold, wither, dry up, or starve to death.]

56. Here, p. 32 or 36.

57. Dr. Warner, American Charities, pp. 178-9, Dr. T. Ely’s


edition.

58. I italicized his words.

59. I italicized his words.

60. Dr. Warner, ibidem, p. 181.

61. Remember that these conclusions are moderate.

62. These owning families include the mortgagors.

63. Many of these home-owning families are in debt, and their


homes serve as securities for it.

64. Enc. of Soc. R., pp. 899-900.

65. He pays rent.

66. He pays rent and divides the results of his labor, p. 58-61.

67. See conclusion, p. 18.

68. Mr. Wright, “Atlantic Monthly” for September, 1897.

69. See his conclusions and my explanation of them, here, pp.


12, 13.
70. Compare for this the original tables, pp. 28, 32 and 51.

71. See 1st R. table, group 1st, p. 47, and as individuals, p.


51.

72. Mayo Smith, “Statistics and Sociology,” pp. 200, 201-2.

73. Mr. E. Atkinson, “The Distribution of Products,” p. 15.

74. Ib., p. 22.

75. Ed. Atkinson, ib. p. 27.

76. Ed. Atkinson, ib., pp. 77, 78. Also, Enc. of S. R., p. 1093.

77. “Socialism and Christianity,” p. 205. Also Enc. of Soc. R., p.


289.

78. Prof. John R. Commons, “Distribution of Wealth,” p. 258.


Also, see Enc. of Soc. Reform, p. 1102.

79. “Gross receipts less gross disbursements.”

80. Totals made up by me.

81. Compare the last two groups with the first two of the
table, p. 28. And compare the same groups of table, p. 51.

82. See this number and families, p. 92.

83. See tables, p. 36 or 45.

84. By the “other” monopolies, I mean some monopolies,


companies, trusts and combinations which have not been
mentioned here at all, and many of which deal with rentable
houses in cities, and so on.

85. Prof. George Herron’s dismissal from the Iowa College is a


striking example, foreboding the nation’s near future. This
professor was forbidden by financial necessity to teach what
is good for the people.—“The Public,” Nov. 11, 1899,
Chicago. “The Public” No. 115, 1900, has now on record
four other professors similarly dealt with in different colleges
on grounds similar to that of Prof. G. Herron. One of these
four is President Henry Wade Rogers, of the Northwestern
University, at Evanston, Ill.

86. Artificial property again means all things that were created
or invented by man in the past or the present.

87. See the same number on p. 79.

88. Enc. of Soc. Reform, p. 899.

89. This number contains 1,624,765 tenant farming families.

90. Remember that the tenant families are excluded here.

91. Lien means a legal claim on property which must be paid.

92. Remember that the 4,999,396 tenant families are


excluded here.

93. These percentages are from the Official Bulletin, No. 98.

94. Dividogenesure is the stronger, the larger the per cent an


employer obtains from the results of the labor of every
employee; and is the weaker, the smaller the per cent he
obtains from every one dependent on him for life.

95. That is, the rate of making mortgages in 1880th year was
643,143, and the yearly rate in 1889th year was 1,226,323
in one year.

96. Enc. of Soc. Reform, p. 901.

97. Dr. Spahr, ib. p. 67.

98. All expressions under the inverted commas are from


Bulletin.
99. Bulletin No. 71, Encyclopedia of Social Reform, p. 901.

100.
Continuation, “On the debt in force against acres,
$162,652,944; on lots, $234,789,848,” is the yearly interest.

101.
Here, p. 112.

102.
Ib., p. 113.

103.
Enc. of Soc. Reform, p. 902. This interest charge is at the
end of the Extra Bulletin No. 71.

104.
Yet, it should be remembered that we do not here deal
with the debts of Railroad Companies, Street Railway,
Telegraph, Telephone and other companies and
corporations; nor do we deal with the U. S. debt of
$891,960,104; States, $228,997,389; Counties,
$145,048,045; Municipalities, $724,463,060; School districts,
$36,701,948, which in 1890 made the grand total of
$18,027,170,546 including the debt under our consideration.
But we deal with family-debtors, for whom debt is equal to
ruin. Whereas debt to the others is prosperity.

105.
That is, if we divide them by the line of families worth
$5,000 and over, and families worth $5,000 and under; and
the latter will include the economic dependants.

106.
Here, p. 119.

107. Here, p. 121.


108.
Enc. of Soc. Reform, p. 904, Edition of 1897.

109.
Enc. of Soc. Reform, p. 904.

110.
Ib., p. 904.

111.
Mr. Dunn could not have known at the time that some
Eastern States were even worse than the Western ones, and
that “New York,” for instance, “is” more “conspicuously
prominent as having a real estate mortgage indebtedness of
$1,607,874,301, which is 26.71 per cent of the total
indebtedness on acres and lots in the United States.”

112.
Here, pp. 91, 92, or Mayo Smith, Statistics and Sociology,
p. 200.

113.
As the rates of their gains show, pp. 104, 105.

114.
Enc. of Soc. Reform, p. 1386.—Waldron, “Handbook on
Currency or Wealth.”

115.
References: Enc. of Soc. R., see “Unemployment.” Dr.
Spahr, “Present Distribution of Wealth in U. S.” (1896).—J. R.
Common’s “Distribution of Wealth,” Enc. p. 1392.

116.
Enc. of Soc. Reform, p. 1392.

117. Enc. of Soc. Reform, p. 1370.


118.
Mr. and Mrs. Webb, “History of Trade Unionism,” p. 1 or 2.

119.
“Introduction and Mutual Insurance,” vol. I, pp. 148-9,
150-1164.

120.
Enc. of Soc. Reform, pp. 1370, 1373 and the Labor
Reports.

121.
Dr. Spahr, ib., pp. 116, 117.

122.
The above 246,938 families could not be here classified
among the tenants of farms consisting of the 1,624,765
families, because after losing their country properties, these
homeless hurry on to crowd up cities.

123.
Dr. Spahr, ibid, p. 122-3.

124.
As numerous inquiries convince me.

125.
According to the U. S. Census of 1890, there were
4,564,641 farms consisting of 623,218,619 acres of land, or
an average of 136 acres to a farm. World Almanac, 1899, p.
184.

126.
Enc. of Soc. Reform, pp. 22, 23; also based on the
census.

127. “Present Distribution of Wealth in the U. S.,” pp. 104, 105;


(see here: Appendix I.). The same: Enc. of Soc. Reform, p.
1385—table of incomes, 1890.
128.
Some one may suppose that some net earnings of the
national banks might overlap some net earnings of the
mortgagee monopolies, since mortgage profits are often
obtained by banks. But such a supposition cannot have a
real ground here, because the national banks are prohibited
by the law of the United States to make investments in
mortgages; and because mortgages of real estate, being not
easily convertible securities for loans, would not be
admissible by them. The only exception made by the law for
these banks is that, for a necessary accommodation of their
business, a mortgage may sometimes be held as a security,
collateral to some other which is more easily convertible into
currency. (See Revised Statutes, §5137. Prof. Dunbar’s
“Theory and Hist. of Banking,” p. 26.)
It is the non-national or State banks that often directly
deal with mortgages. But estimating their gross earnings at
$200,000,000 for the year 1890 (see p. 101), Mr. Waite
evidently could not ascertain their enormous net incomes,
hence we leave them to be understood as surplus above all
our concluding totals of net incomes.
And whereas, the net incomes of the national banks
decreased $110,378,930 in the 7 years, those of the life
insurance companies increased $108,932,030 (World
Almanac, 1900, p. 180, 184) and with the help of the
omitted net incomes of the gas companies (p. 101) more
than offset the loss, leaving our totals correct.

129.
Enc. of Soc. Reform, 1897, pp. 1346-7; from “Philadelphia
Times,” etc.

130.
Dr. Spahr, ibid., pp. 104-5.

131.
It was the gross income.
132.
See the upper table, p. 42.

133.
Table, p. 47.

134.
Lower table, p. 42, 1st two groups.

135.
Mr. Waldron, “Hand-book on Currency or Wealth,” pp. 106
and 107. See also: Enc. of Soc. Reform, p. 1389.

136.
See the statistical conclusions on the fall of wages, p. 134;
also Dr. Spahr’s “Present Distribution of Wealth,” etc., pp.
95-118.

137. “Present Distr. of Wealth in U. S.” (1896), pp. 104, 105,


112. Here, pp. 140-143. “Average daily wages: 1873, $2.04;
1891, $1.69; urban laborers.”

138.
Dr. Spahr, ibid., pp. 104-5. Enc. of Soc. Reform, p. 1385.

139.
Dr. Spahr, ibid., pp. 98, 104-5. Also: Statistics of
Massachusetts Bureau of Labor, 1890, p. 319.

140.
Dr. Spahr, ibid., p. 104-5.

141.
Dr. Spahr, ibid., pp. 116, 117.

142.
Ibid., pp. 104, 105.

143.
Dr. Spahr, ibid., p. 104-5.
144.
World Almanac, 1899, pp. 200, 225.

145.
Quoted from Enc. of Soc. Reform, p. 1347.

146.
For 1897 is an approximate estimate of The World
Almanac, 1899, p. 200, foot note.

147. It might be that some of these families paid house rents


on farms beside the land rent, as Dr. Spahr has shown;
while some others might pay simply house rents, and thus
offset each other, making the above sum correct.

148.
“For a share in product” is an initial form of serfdom pure
and simple.

149.
Enc. of Soc. Reform, pp. 606-7.

150.
Also here, p. 125.

151.
Enc. of Soc. Reform, pp. 606-7.

152.
World Almanac, 1900, p. 174.

153.
Here, see foot-note, p. 150.

154.
Table of profits, here, p. 101.

155.
Includes the increase of $310,001,619 by the railroad,
telegraph and telephone monopolies, p. 162.
156.
Excludes net incomes of the artificial gas companies and
those of the non-national banks (beside mortgages) as not
given in the table on p. 101. See foot note, pp. 150, 167,
168.

157. Includes the house rent on farms and that of the


increased population, pp. 149, 164.

158.
Includes the rent of land paid by the increased
populations, p. 165.

159.
This amount of double taxes is calculated to have been
fully paid for 7 years on the net incomes here stated, and on
all the property these trusts, etc., have had in the beginning
of 1891 and after, according to the tax rates to be here
indicated.

160.
Compare tables, here, on pp. 42 and 47.

161.
Mr. Mallock’s “Classes and Masses” (1896.)

162.
Enc. of Soc. Reform, p. 1392.

163.
His work on “Social Elements,” p. 162.

164.
“Statistics of Railways, 1890,” p. 58.

165.
I italicized the words.
166.
Dr. Spahr, ibid., pp. 41, 42. The total capitalization of
railroads in 1890 was represented by $9,437,300,000, while
the total investment amounted to only $3,714,400,000. And
Mr. Van Oss stated that “shares now return at least 18 per
cent per annum on the actual investment.” Ibidem.

167. Dr. Spahr, ibid., p. 143. The total incomes in the table of
taxes above represented are gross incomes.

168.
Statistics, World Almanac, 1899, p. 165.

169.
Dr. Spahr, “Present Distribution of Wealth in the United
States,” p. 143-4.

170.
Ib., p. 156-7.

171.

“Extra Census Bulletin No. 70” represents $465,000,000


taxes on property including corporations for
1890
Licenses, poll taxes, etc. (about) 50,000,000
------------
Total (the same as that contained above) $515,000,000
The Bulletin adds that “three-fourths of this tax falls upon
the relatively poorer classes.” Dr. Spahr, ibid., p. 156.

172.
See here, pp. 64, 65, 68, 72.

173.
Dr. Spahr, ibid., pp. 157, 158.
174.
The World Almanac, 1899, p. 164. Mr. Upton, here, p. 27.

175.
The World Almanac, 1900, p. 539.

176.
This average would mean that in every 100 families 90
have 5 and 10 have only 4 members. See the decrease of
family membership: foot note, p. 18.

177. “It is interesting to remark that, while in 1893 the number


of the propertyless families reached over 7-millions, the
national and local Building and Loan Associations having net
assets of over $450,000,000, have,” in 25 years, “helped to
secure” only “probably over 400,000 homes,” says Mr.
Wright, U. S. Commissioner of Labor. The World Almanac,
1899, p. 168; ib., 1900, p. 172. But that inability is
aggravated by the taxation unjust to the poor. See here, pp.
174-178.

179.
Encyclopedia of Social Reform, p. 1346.

180.
Encyclopedia of Social Reform, p. 1346.

181.
Ibid., p. 888.
TRANSCRIBER’S NOTE:

• Silently corrected obvious punctuation and capitalization errors.


Several unpaired double quotation marks were retained as
they occurred in the original text.
• Unless noted below, spelling and hyphenation are retained as in
the original.
• Footnotes have been renumbered and moved to the end of the
book.

Other changes:
• Removed half-title page originally on first page
• Page 003: thousands of like opnions → thousands of like opinions
• Page 036: surplus milion families → surplus million families
• Page 039: distribution of weatlh → distribution of wealth
• Page 048: more wealth that → more wealth than
• Page 099: in possesion of others → in possession of others
• Page 108: Napoleon Boneparte → Napoleon Bonaparte
• Page 132: bound, by dividogensure → bound, by dividogenesure

• Page 163: In the table "Increase of Population" corrected the value


for percent of population in cities for the year 1880. Changed
from 2.57 to 22.57. The correct value was taken from page 18
of the 1890 census
http://www2.census.gov/prod2/decennial/documents/1890d9-
01.pdf pg 18

• Page 192: defintion and origin → definition and origin


• Page 193: not partizan → not partisan
*** END OF THE PROJECT GUTENBERG EBOOK THE IMPENDING
CRISIS ***

Updated editions will replace the previous one—the old editions


will be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States
copyright in these works, so the Foundation (and you!) can copy
and distribute it in the United States without permission and
without paying copyright royalties. Special rules, set forth in the
General Terms of Use part of this license, apply to copying and
distributing Project Gutenberg™ electronic works to protect the
PROJECT GUTENBERG™ concept and trademark. Project
Gutenberg is a registered trademark, and may not be used if
you charge for an eBook, except by following the terms of the
trademark license, including paying royalties for use of the
Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such
as creation of derivative works, reports, performances and
research. Project Gutenberg eBooks may be modified and
printed and given away—you may do practically ANYTHING in
the United States with eBooks not protected by U.S. copyright
law. Redistribution is subject to the trademark license, especially
commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the


free distribution of electronic works, by using or distributing this
work (or any other work associated in any way with the phrase
“Project Gutenberg”), you agree to comply with all the terms of
the Full Project Gutenberg™ License available with this file or
online at www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand,
agree to and accept all the terms of this license and intellectual
property (trademark/copyright) agreement. If you do not agree
to abide by all the terms of this agreement, you must cease
using and return or destroy all copies of Project Gutenberg™
electronic works in your possession. If you paid a fee for
obtaining a copy of or access to a Project Gutenberg™
electronic work and you do not agree to be bound by the terms
of this agreement, you may obtain a refund from the person or
entity to whom you paid the fee as set forth in paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only


be used on or associated in any way with an electronic work by
people who agree to be bound by the terms of this agreement.
There are a few things that you can do with most Project
Gutenberg™ electronic works even without complying with the
full terms of this agreement. See paragraph 1.C below. There
are a lot of things you can do with Project Gutenberg™
electronic works if you follow the terms of this agreement and
help preserve free future access to Project Gutenberg™
electronic works. See paragraph 1.E below.

You might also like