Full Download Elements of Parallel Computing 1st Edition Eric Aubanel PDF DOCX
Full Download Elements of Parallel Computing 1st Edition Eric Aubanel PDF DOCX
https://ebookultra.com/download/introduction-to-parallel-
computing-2nd-edition-ananth-grama/
https://ebookultra.com/download/introduction-to-parallel-
computing-2nd-edition-ananth-grama-2/
https://ebookultra.com/download/a-tale-of-seven-elements-1st-edition-
eric-scerri/
https://ebookultra.com/download/parallel-computing-in-quantum-
chemistry-1st-edition-curtis-l-janssen/
High Performance Parallel Database Processing and Grid
Databases Wiley Series on Parallel and Distributed
Computing 1st Edition David Taniar
https://ebookultra.com/download/high-performance-parallel-database-
processing-and-grid-databases-wiley-series-on-parallel-and-
distributed-computing-1st-edition-david-taniar/
https://ebookultra.com/download/an-introduction-to-parallel-and-
vector-scientific-computing-1st-edition-ronald-w-shonkwiler/
https://ebookultra.com/download/the-druby-book-distributed-and-
parallel-computing-with-ruby-1st-edition-masatoshi-seki/
https://ebookultra.com/download/mobile-intelligence-wiley-series-on-
parallel-and-distributed-computing-1st-edition-laurence-t-yang/
Horst Simon
Deputy Director
Lawrence Berkeley National Laboratory
Berkeley, California, U.S.A.
PUBLISHED TITLES
Eric Aubanel
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
To my grandfather, Dr. E.P. Scarlett: physician, educator and
scholar.
Contents
1.1 INTRODUCTION 1
1.2 TERMINOLOGY 2
1.3 EVOLUTION OF PARALLEL COMPUTERS 3
1.4 EXAMPLE: WORD COUNT 4
1.5 PARALLEL PROGRAMMING MODELS 5
1.5.1 Implicit Models 5
1.5.2 Semi-Implicit Models 5
1.5.3 Explicit Models 6
1.5.4 Thinking in Parallel 6
1.6 PARALLEL DESIGN PATTERNS 7
1.6.1 Structural Patterns 7
1.6.2 Computational Patterns 8
1.6.3 Patterns in the Lower Layers 9
1.7 WORD COUNT IN PARALLEL 10
1.8 OUTLINE OF THE BOOK 11
ix
x Contents
Bibliography 215
Index 221
Preface
Parallel computing is hard, it’s creative, and it’s an essential part of high performance
scientific computing. I got my start in this field parallelizing quantum mechanical wave
packet evolution for the IBM SP. Parallel computing has now joined the mainstream, thanks
to multicore and manycore processors, and to the cloud and its Big Data applications. This
ubiquity has resulted in a move to include parallel computing concepts in undergraduate
computer science curricula. Clearly, a CS graduate must be familiar with the basic concepts
and pitfalls of parallel computing, even if he/she only ever uses high level frameworks.
After all, we expect graduates to have some knowledge of computer architecture, even if
they never write code in an assembler.
Exposing undergraduates to parallel computing concepts doesn’t mean dismantling the
teaching of this subject in dedicated courses, as it remains an important discipline in com-
puter science. I’ve found it a very challenging subject to teach effectively, for several reasons.
One reason is that it requires students to have a strong background in sequential program-
ming and algorithm design. Students with a shaky mental model of programming quickly
get bogged down with parallel programming. Parallel computing courses attract many stu-
dents, but many of them struggle with the challenges of parallel programming, debugging,
and getting even modest speedup. Another challenge is that the discipline has been driven
throughout its history by advances in hardware, and these advances keep coming at an
impressive pace. I’ve regularly had to redesign my courses to keep up. Unfortunately, I’ve
had little help from textbooks, as they have gone out of print or out of date.
This book presents the fundamental concepts of parallel computing not from the point
of view of hardware, but from a more abstract view of the algorithmic and implementation
patterns. While the hardware keeps changing, the same basic conceptual building blocks
are reused. For instance, SIMD computation has survived through many incarnations from
processor arrays to pipelined vector processors to SIMT execution on GPUs. Books on the
theory of parallel computation approach the subject from a similar level of abstraction, but
practical parallel programming books tend to be tied to particular programming models
and hardware. I’ve been inspired by the work on parallel programming patterns, but I
haven’t adopted the formal design patterns approach, as I feel it is more suited to expert
programmers than to novices.
My aim is to facilitate the teaching of parallel programming by surveying some key
algorithmic structures and programming models, together with an abstract representation
of the underlying hardware. The presentation is meant to be friendly and informal. The
motivation, goals, and means of my approach are the subject of Chapter 1. The content
of the book is language neutral, using pseudocode that represents common programming
language models.
The first five chapters present the core concepts. After the introduction in Chapter 1,
Chapter 2 presents SIMD, shared memory, and distributed memory machine models, along
with a brief discussion of what their execution models look like. Chapter 2 concludes with a
presentation of the task graph execution model that will be used in the following chapters.
Chapter 3 discusses decomposition as a fundamental activity in parallel algorithmic design,
starting with a naive example, and continuing with a discussion of some key algorithmic
xiii
xiv Preface
structures. Chapter 4 covers some important programming models in depth, and shows
contrasting implementations of the task graphs presented in the previous chapter. Finally,
Chapter 5 presents the important concepts of performance analysis, including work-depth
analysis of task graphs, communication analysis of distributed memory algorithms, some
key performance metrics, and a discussion of barriers to obtaining good performance. A
brief discussion of how to measure performance and report performance is included because
I have observed that this is often done poorly by students and even in the literature.
This book is meant for an introductory parallel computing course at the advanced un-
dergraduate or beginning graduate level. A basic background in computer architecture and
algorithm design and implementation is assumed. Clearly, hands-on experience with parallel
programming is essential, and the instructor will have selected one or more languages and
computing platforms. There are many good online and print resources for learning particu-
lar parallel programming language models, which could supplement the concepts presented
in this book. While I think it’s valuable for students to be familiar with all the parallel pro-
gram structures in Chapter 4, in a given course a few could be studied in more detail along
with practical programming experience. The instructor who wants to get students program-
ming as soon as possible may want to supplement the algorithmic structures of Chapter 3
with simple programming examples. I have postponed performance analysis until Chapter 5,
but sections of this chapter could be covered earlier. For instance, the work-depth analysis
could be presented together with Chapter 3, and performance analysis and metrics could
be presented in appropriate places in Chapter 4. The advice of Section 5.4 on reporting
performance should be presented before students have to write up their experiments.
The second part of the book presents three case studies that reinforce the concepts of
the earlier chapters. One feature of these chapters is to contrast different solutions to the
same problem. I have tried for the most part to select problems that aren’t discussed all
that often in parallel computing textbooks. They include the Single Source Shortest Path
Problem in Chapter 6, the Eikonal equation in Chapter 7, which is a partial differential
equation with relevance to many fields, including graphics and AI, and finally in Chapter 8
a classical computational geometry problem, computation of the two-dimensional convex
hull. These chapters could be supplemented with material from other sources on other well-
known problems, such as dense matrix operations and the Fast Fourier Transform. I’ve also
found it valuable, particularly in a graduate course, to have students research and present
case studies from the literature.
Acknowledgements
I would first like to acknowledge an important mentor at the University of New Brunswick,
professor Virendra Bhavsar, who helped me transition from a high performance computing
practitioner to a researcher and teacher. I’ve used the generalized fractals he worked on,
instead of the more common Mandelbrot set, to illustrate the need for load balancing. This
book wouldn’t have been possible without my experience teaching CS4745/6025, and the
contributions of the students. In particular, my attention was drawn to the subset sum
problem by Steven Stewart’s course project and Master’s thesis.
The formal panel sessions and informal discussions I witnessed from 2002 to 2014 at the
International Parallel and Distributed Processing Symposium were very stimulating and
influential in developing my views. I finally made the decision to write this book after reading
the July 2014 JPDC special issue on Perspectives on Parallel and Distributed Processing.
In this issue, Robert Schreiber’s A few bad ideas on the way to the triumph of parallel
computing [57] echoed many of my views, and inspired me to set them down in print. Finally,
I would like to thank Siew Yin Chan, my former PhD student, for valuable discussions and
revision of early material.
xv
CHAPTER 1
Overview of Parallel
Computing
1.1 INTRODUCTION
In the first 60 years of the electronic computer, beginning in 1940, computing performance
per dollar increased on average by 55% per year [52]. This staggering 100 billion-fold increase
hit a wall in the middle of the first decade of this century. The so-called power wall arose
when processors couldn’t work any faster because they couldn’t dissipate the heat they pro-
duced. Performance has kept increasing since then, but only by placing multiple processors
on the same chip, and limiting clock rates to a few GHz. These multicore processors are
found in devices ranging from smartphones to servers.
Before the multicore revolution, programmers could rely on a free performance increase
with each processor generation. However, the disparity between theoretical and achievable
performance kept increasing, because processing speed grew much faster than memory band-
width. Attaining peak performance required careful attention to memory access patterns in
order to maximize re-use of data in cache memory. The multicore revolution made things
much worse for programmers. Now increasing the performance of an application required
parallel execution on multiple cores.
Enabling parallel execution on a few cores isn’t too challenging, with support available
from language extensions, compilers and runtime systems. The number of cores keeps in-
creasing, and manycore processors, such as Graphics Processing Units (GPUs), can have
thousands of cores. This makes achieving good performance more challenging, and parallel
programming is required to exploit the potential of these parallel processors.
1
2 Elements of Parallel Computing
better use of them. Specialized parallel code is often essential for applications requiring high
performance. The challenge posed by the rapidly growing number of cores has meant that
more programmers than ever need to understand something about parallel programming.
Fortunately parallel processing is natural for humans, as our brains have been described as
parallel processors, even though we have been taught to program in a sequential manner.
1.2 TERMINOLOGY
It’s important to be clear about terminology, since parallel computing, distributed computing,
and concurrency are all overlapping concepts that have been defined in different ways.
Parallel computers can also be placed in several categories.
Definition 1.1 (Parallel Computing). Parallel Computing means solving a computing prob-
lem in less time by breaking it down into parts and computing those parts simultaneously.
Parallel computers provide more computing resources and memory in order to tackle
problems that cannot be solved in a reasonable time by a single processor core. They differ
from sequential computers in that there are multiple processing elements that can execute
instructions in parallel, as directed by the parallel program. We can think of a sequential
Instruction streams
single multiple
Data streams
single
SISD MISD
multiple
SIMD MIMD
distributed
concurrency
computing
parallel
computing
the domain of enthusiasts and those whose needs required high performance computing
resources.
Not only is expert knowledge needed to write parallel programs for a cluster, the signif-
icant computing needs of large scale applications require the use of shared supercomputing
facilities. Users had to master the complexities of coordinating the execution of applications
and file management, sometimes across different administrative and geographic domains.
This led to the idea of Grid computing in the late 1990s, with the dream of computing as
a utility, which would be as simple to use as the power grid. While the dream hasn’t been
realized, Grid computing has provided significant benefit to collaborative scientific data
analysis and simulation. This type of distributed computing wasn’t adopted by the wider
community until the development of cloud computing in the early 2000s.
Cloud computing has grown rapidly, thanks to improved network access and virtualiza-
tion techniques. It allows users to rent computing resources on demand. While using the
cloud doesn’t require parallel programming, it does remove financial barriers to the use of
compute clusters, as these can be assembled and configured on demand. The introduction
of frameworks based on the MapReduce programming model eliminated the difficulty of
parallel programming for a large class of data processing applications, particularly those
associated with the mining of large volumes of data.
With the emergence of multicore and manycore processors all computers are parallel
computers. A desktop computer with an attached manycore co-processor features thousands
of cores and offers performance in the trillions of operations per second. Put a number of
these computers on a network and even more performance is available, with the main
limiting factor being power consumption and heat dissipation. Parallel computing is now
relevant to all application areas. Scientific computing isn’t the only player any more in
large scale high performance computing. The need to make sense of the vast quantity of
data that cheap computing and networks have produced, so-called Big Data, has created
another important use for parallel computing.
If the input consists of two documents, one containing “The quick brown fox jumps
over a lazy dog” and the other containing “The brown dog chases the tabby cat,” then the
output would be the list: ha, 1i, hbrown, 2i, . . . , htabby, 1i, hthe, 3i.
Think about how you would break this algorithm in parts that could be computed in
parallel, before we discuss below how this could be done.
Once loops that have independent iterations have been identified, the parallelism is very
simple to express. Ensuring correctness is another matter, as we’ll see in Chapters 2 and 4.
Computational
Structural Patterns Chapters 6-8: Case Studies
Patterns
ways of thinking [69]. It could be argued that making the transition from sequential to
parallel notional machines is another threshold concept.
English
messages plain text words
stream of language metadata
tokenization ...
messages identification removal
x0 core 0
x1
A00 A01 A02 A03 X x2
= b0
x3
x0 core 1
x1
A10 A11 A12 A13 X x2
= b1
x3
x0 core 2
x1
A20 A21 A22 A23 X x2
= b2
x3
x0 core 3
x1
A30 A31 A32 A33 X x2
= b3
x3
libraries. The High Performance Linpack (HPL) benchmark used to classify the top 500
computers involves solving a dense system of linear equations. Parallelism arises naturally
in these applications. In matrix-vector multiplication, for example, the inner products that
compute each element of the result vector can be computed independently, and hence in
parallel, as seen in Figure 1.5. In practice it is more difficult to develop solutions that scale
well with matrix size and the number of processors, but the plentiful literature provides
guidance.
Another important pattern is one where operations are performed on a grid of data. It
occurs in scientific simulations that numerically solve partial differential equations, and also
in image processing that executes operations on pixels. The solutions for each data point
can be computed independently but they require data from neighboring points. Other pat-
terns include those found in graph algorithms, optimization (backtrack/branch and bound,
dynamic programming), and sorting. It is reassuring that the lists that have been drawn up
of these patterns include less than twenty patterns. Even though the landscape of parallel
computing is vast, most applications can be composed of a small number of well studied
computational patterns.
map map
<a,1>,<brown,2>,<cat,1>,<chases,1>,<dog,2>,<fox,1>,<jumps,1>,
<lazy,1>,<over,1>,<quick,1>,<tabby,1>,<the,3>
algorithmic and implementation structures to create parallel software. These structures will
be explored in detail in Chapters 3 and 4.
Both distributed and shared implementations could be combined. Taking a look at the
sequential algorithm we can see that word counts could be done independently for each line
of text. The processing of the words of the sentences of each document could be accomplished
in parallel using a shared data structure, while the documents could be processed by multiple
computers.
This is a glimpse of how rich the possibilities can be in parallel programming. It can get
even richer as new computational platforms emerge, as happened in the mid 2000s with gen-
eral purpose programming on graphics processing units (GPGPU). The patterns of parallel
computing provided a guide in our example problem, as they allowed the recognition that
it fit the MapReduce structural pattern, and that implementation could be accomplished
using well established algorithmic and implementation patterns. While this book will not
follow the formalism of design patterns, it does adopt the similar view that there is a set
of parallel computing elements that can be composed in many ways to produce clear and
effective solutions to computationally demanding problems.
Unsurprisingly, parallel programs are executed on parallel machines. It may be less obvious
that they are also executed in the mind of the parallel programmer. All programmers have a
mental model of program execution that allows them to trace through their code. Program-
mers who care about performance also have a mental model of the machine they are using.
Knowledge about the operation of the cache hierarchy, for instance, allows a programmer
to structure loops to promote cache reuse. While many programmers can productively work
at a high level of abstraction without any concern for the hardware, parallel programmers
cannot afford to do so. The first part of this chapter discusses three machine models and
the second part presents a task graph execution model. The task graph model is used ex-
tensively in Chapter 3 when presenting algorithmic structures. The machine models are
examined more deeply together with implementation structures in Chapter 4.
13
14 Elements of Parallel Computing
a) b)
ALU ALU
r0: v0:
r1: v1:
M M
Figure 2.1: a) scalar architecture b) SIMD architecture, with vector registers than can
contain 4 data elements.
eral purpose parallel computers: SIMD, shared memory and distributed memory. Recall from
Chapter 1 that the latter two are subcategories of the MIMD architecture. These machines
mainly need to be understood at a high level of abstraction for the purpose of algorithm
design and implementation strategy. Understanding of deeper levels of machine architecture
is important during implementation, at which point technical documentation from the hard-
ware vendor should be consulted. One exception has to do with access to shared variables,
where race conditions can lead to incorrect programs. Another is the performance impact
of cache memory, which is sensitive to data access patterns.
2.1.1 SIMD
SIMD computers simultaneously execute a single instruction on multiple data elements. His-
torically, processors in SIMD computers consisted either of a network of Arithmetic Logic
Units (ALUs), as in the Connection Machine, or had deeply pipelined vector arithmetic
units, as in computers from Cray and NEC. Currently, SIMD execution is mainly found in
functional units in general purpose processors, and our discussion will reflect this architec-
ture, as sketched in Figure 2.1. SIMD ALUs are wider than conventional ALUs and can
perform multiple operations simultaneously in a single clock cycle. They use wide SIMD
registers that can load and store multiple data elements to memory in a single transaction.
For example, for a scalar ALU to add the first 4 elements of two arrays a and b and
store the result in array c, machine instructions similar to the following would need to be
executed in sequence:
1: r1 ← load a[0] 5: r1 ← load a[1] 9: r1 ← load a[2] 13: r1 ← load a[3]
2: r2 ← load b[0] 6: r2 ← load b[1] 10: r2 ← load b[2] 14: r2 ← load b[3]
3: r2 ← add r1, r2 7: r2 ← add r1, r2 11: r2 ← add r1, r2 15: r2 ← add r1, r2
4: c[0] ← store r2 8: c[1] ← store r2 12: c[2] ← store r2 16: c[3] ← store r2
A SIMD ALU of width equal to the size of four elements of the arrays could do the same
operations in just four instructions:
Parallel Machine and Execution Models 15
1: v1 ← vload a
2: v2 ← vload b
3: v2 ← vadd v1, v2
4: c ← vstore v2
The vload and vstore instructions load and store four elements of an array from memory
into a vector register in a single transaction. The vadd instruction simultaneously adds the
four values stored in each vector register.
SIMD execution is also found in manycore co-processors, such as the Intel Xeon Phi.
Nvidia Graphics Processing Units (GPUs) use SIMD in a different way, by scheduling
threads in groups (called warps), where the threads in each group perform the same in-
struction simultaneously on their own data. The difference with conventional SIMD led
the company to coin a new term, Single Instruction Multiple Threads (SIMT). It works
something like this:
t0: r00 ← load a[0] r10 ← load a[1] r20 ← load a[2] r30 ← load a[3]
t1: r01 ← load b[0] r11 ← load b[1] r21 ← load b[2] r31 ← load b[3]
t2: r01 ← add r00, r01 r11 ← add r10, r11 r21 ← add r20, r21 r31 ← add r30, r31
t3: c[0] ← store r01 c[1] ← store r11 c[2] ← store r21 c[3] ← store r31
Thread i loads a[i], and all loads are coalesced into a single transaction at time t0. The same
thing occurs at time t1 for loading the elements of b. All threads simultaneously perform an
addition at time t2. Finally all stores to c are coalesced at time t3. The difference between
SIMD and SIMT for the programmer is in the programming model: SIMD uses data parallel
loops or array operations whereas in the SIMT model the programmer specifies work for
each thread, and the threads are executed in SIMD fashion.
The popularity of the SIMD architecture comes from the common programming pattern
16 Elements of Parallel Computing
C C C C
M M
Figure 2.2: Basic multicore processor model, consisting of multiple cores (C) sharing memory
(M) units organized hierarchically.
where the same operation is executed independently on the elements of an array. SIMD
units can be programmed explicitly using a parallel programming language that supports
data parallelism. They can also be programmed directly in assembly language using vec-
tor instructions, as in the above example. More commonly the parallelism is not specified
explicitly by the programmer, but is exploited by the compiler or by the runtime system.
The SIMD model is not hard to understand from the programmer’s perspective, but it
can be challenging to work within the restrictions imposed by the model. Common execution
patterns often deviate from the model, such as conditional execution and non-contiguous
data, as we’ll see in Chapter 4.
C C C C
M M
C C C C
M M
Figure 2.3: Basic parallel computer with two processors. Can be implemented as shared
memory (multiprocessor) or distributed memory (multicomputer).
32 32 32 32 32 32 32 32 32
M M M
C C C C
M M
Figure 2.4: Manycore processor with 32-way SIMD units connected to a multicore processor
functional units, as in the Intel Xeon Phi, or SIMT execution of threads in a warp on an
Nvidia GPU. The multicore processor and manycore co-processor can share memory, or they
can have distinct memory spaces. In either case there are likely to be nonuniform access
times to memory, as in the multiprocessor of Figure 2.3. Multiple hybrid multicore/manycore
computers can be combined on a network to produce a distributed multicomputer with very
high performance.
// process 0 // process 1
1: send a[2 . . . 3] to process 1 1: receive a[0 . . . 1] from process 0
2: send b[2 . . . 3] to process 1 2: receive b[0 . . . 1] from process 0
3: c[0] ← a[0] + b[0] 3: c[0] ← a[0] + b[0]
4: c[1] ← a[1] + b[1] 4: c[1] ← a[1] + b[1]
5: receive c[2 . . . 3] from process 1 5: send c[0 . . . 1] to process 0
Arrays a and b are initially in the memory of process 0, and the result of a + b is to be stored
there as well. The communication is illustrated in Figure 2.5. Process 0 sends the second
half of arrays a and b to process 1. Notice that each process uses local indexing for the
Parallel Machine and Execution Models 19
process 0 process 1
a a
b b
c c
Figure 2.5: Communication involved in adding two arrays with two processes.
arrays, and hence both use identical statements to add their arrays. Execution of the sums
can then take place independently on each process until process 1 has finished its sums and
sends its array c to process 0. Lines 3 and 4 won’t necessarily execute exactly simultaneously.
Process 0 might start earlier since it could proceed to line 3 while both messages are in flight
and before process 1 has received them. The execution times of these operations could be
different for each process, due to different memory access times or competition for CPU
resources from operating system processes.
The message passing programming model gives a lot of control to the programmer.
Since communication is explicit it’s possible to trace through execution and reason about
correctness and performance. In practice this type of programming can be quite complex.
We’ll look at some common patterns in message passing programming in Chapter 4.
Here the threads don’t execute their instructions in lockstep, unlike the SIMT example
above. Their execution will overlap in time, but the exact timing is unpredictable. Observe
that each thread loads and stores to different memory locations, so the relative timing of
threads has no impact on the result.
Data Races
Let’s say after adding a and b we want to use array c, and we add two more instructions
for thread 0:
r00 ← load c[3]
x ← store r00
20 Elements of Parallel Computing
sum: 0 1
This means thread 0 is loading from a memory location that thread 3 stores to. The value of
x is indeterminate, since it depends on the relative timing of threads 0 and 3. The problem
is that thread 0 could load c[3] before thread 3 stored its value to c[3]. This is an example
of a data race, which occurs when one thread stores a value to a memory location while
other threads load or store to the same location. Data races are a serious threat to program
correctness.
Data races can be hidden from view in higher level programming languages. Consider
two threads incrementing the same variable:
// thread 0 // thread 1
sum ← sum + 1 sum ← sum + 1
If sum is initially 0, then one might think that the result of these two operations would be
sum = 2, whatever the temporal order of the two threads. However, another possible result
is sum = 1, because the statement sum ← sum + 1 is not executed atomically (from the
Greek atomos meaning indivisible) at the machine level. Let’s look at machine instructions
for this operation:
// thread 0 // thread 1
r00 ← load sum r10 ← load sum
r01 ← add r00,1 r11 ← add r10,1
sum ← store r01 sum ← store r11
Now the data race is more easily seen, as both threads could load sum before either of them
stored their value, as illustrated in Figure 2.6.
Returning to our previous example, we could try to fix this data race by synchronizing
threads so that thread 0 doesn’t read from c[3] until thread 3 has written its value there.
Thread 3 sets a flag once its value is written, and thread 0 makes sure that flag is set before
loading c[3] (flag is initialized to 0):
// thread 0 // thread 3
L: r00 ← load f lag c[3] ← store r31
If f lag 6 = 1 goto L f lag ← store 1
r00 ← load c[3]
x ← store r00
We might expect that, even if thread 3 were delayed, thread 0 would keep looping until
thread 3 had written to c[3] and set the flag. This expectation comes from our practice of
reasoning about sequential programs, where we know that instructions may be reordered,
but the result is the same as if they executed in program order. This intuition is no longer
valid for multithreaded execution. The ordering of instructions executed by multiple threads
is governed by the memory model of the system.
Parallel Machine and Execution Models 21
Memory Model
A useful way to look at multithreaded execution is to imagine they are all running on a
single core, so they have to be executed in sequence in some order. An example of such an
interleaved order for the sum of arrays a and b could be this:
1: r00 ← load a[0] 5: r01 ← load b[0] 9: r21 ← add r20, r21 13: c[0] ← store r01
2: r10 ← load a[1] 6: r11 ← load b[1] 10: r01 ← add r00, r01 14: c[1] ← store r11
3: r20 ← load a[2] 7: r21 ← load b[2] 11: r31 ← add r30, r31 15: c[2] ← store r21
4: r30 ← load a[3] 8: r31 ← load b[3] 12: r11 ← add r10, r11 16: c[3] ← store r31
Of course, we want to run each thread on its own core to speed up execution, but logically
the result (array c) after multicore execution is the same as if they did execute in interleaved
fashion on one core.
We can observe that each thread executes its instructions in program order in this
example. The order of the interleaving is arbitrary, although here the loads and stores are
in thread order and the adds are in a permuted order. This type of multithreaded execution
order, which preserves the program order of each thread, is called sequential consistency.
It’s the most natural way to think about multithreaded execution, but it’s not the way most
processors work. The problem comes as a result of instruction reordering by the processor
at runtime. How this is done is a complex topic which we won’t get into [68]. It’s only
important to know that it’s done to improve performance and that it can also be a threat
to the correctness of parallel program execution. Reordering is limited by the dependencies
between the instructions of each thread. In this example the loads of each thread can
obviously be done in either order, but they must complete before the add, after which the
store can take place.
Returning to our attempt to synchronize threads 0 and 3, we can see that the instructions
of thread 3 could be reordered, since they each access a different memory location. This
could lead to the following order:
thread 3: f lag ← store 1
thread 0: L: r00 ← load f lag
thread 0: If f lag 6 = 1 goto L
thread 0: r00 ← load c[3]
thread 0: x ← store r00
thread 3: c[3] ← store r31
What we have tried to do here is to intentionally use a data race with f lag to synchronize
two threads, which is called a synchronization race. However our solution has been undone
by the reordering of instructions. Note that this ordering violates sequential consistency,
because the instructions of thread 3 don’t take effect in program order.
Sequential consistency is one possible memory model. Other models involve some form
of relaxed consistency where the requirement for instructions to execute in program order
is relaxed, in order to obtain increased performance.
It’s possible to reason about program execution for relaxed consistency models, but it’s
much easier to use sequential consistency. Fortunately, we can keep reasoning with the se-
quential consistency model, even if the underlying memory model uses relaxed consistency,
as long as we eliminate data races. If we need to synchronize the execution of threads we
can only use specialized language constructs, such as locks or barriers, to perform synchro-
nization, rather than rolling our own as in this example. This sequential consistency for
data race free execution applies to higher level languages such as C++ and Java as well [68].
22 Elements of Parallel Computing
C0 C1
x 4 x 5
x 4
Figure 2.7: Cache coherence problem. Core C0 loads 4 from memory location x and C1
writes 5 to x.
Cache Coherence
The discussion so far has ignored the effect of cache memory. On-chip caches offer perfor-
mance benefits by enabling much faster access to a relatively small amount of memory. The
operation of most caches is invisible to programmers. A block of data is fetched from main
memory into cache memory when a load or store occurs to a data element not already in
the cache. Programmers concerned about performance know to maximize the number of
memory accesses to cache blocks, for example by favoring traversal of consecutive elements
of arrays. Otherwise, cache memory can safely be ignored for sequential programs.
In a single core processor the fact that there can be two copies of the data corresponding
to the same memory address is not a problem, as the program can never access an out of
date value in main memory. The situation is different in a multicore processor. As Figure 2.7
illustrates, one core has loaded the value corresponding to variable x in its own cache and
the other core has updated the value of x in its cache. The first core could then read an
outdated value of x. In most processors this problem is solved by using a hardware protocol
so that the caches are coherent. A good definition of cache coherence states that at any given
time either any number of cores can read or only one core can write to a cache block, and in
addition that once a write occurs it is immediately visible to a subsequent read [68]. In other
words cache coherence makes the cache almost disappear from the multicore programmer’s
view, since the behavior described in the definition is what is expected of concurrent accesses
to the same location in main memory. Since loads and stores are atomic it is not possible for
multiple writes (or reads and writes) to occur simultaneously to the same memory location.
Why does the cache only almost disappear? This is because the above definition of
cache coherence refers to a cache block, which contains multiple data elements. If cores are
accessing distinct data that are in the same cache block in their local cache they cannot
do concurrent writes (or writes and reads). The programmer thinks the threads are mak-
ing concurrent accesses to these data elements, but in fact they cannot if the block is in
coherent caches. This is called false sharing, which we’ll explore when we discuss barriers
to performance in Chapter 5.
2.1.5 Summary
We’ve seen the three machine models that are relevant to parallel programmers. Each one
can incorporate the previous one, as SIMD execution is enabled in most processors, and
distributed multicomputers consist of a number of shared memory multiprocessors. We’ll
see in Chapter 4 that there are different implementation strategies for each model.
Parallelism is expressed differently in each model. SIMD execution is usually specified
Parallel Machine and Execution Models 23
1 2 3
Example 2.3
1: a ← 3
2: b ← 5 + a
3: a ← 42
4: c ← 2 ∗ a + 1
Only one of statements 2 and 4 is executed. The task graph model does not model control
dependencies and is only concerned with real data dependencies.
Definition 2.1 (Task Graph Model). A parallel algorithm is represented by a directed
acyclic graph, where each vertex represents execution of a task, and an edge between two
vertices represents a real data dependence between the corresponding tasks. A directed edge
u → v indicates that task u must execute before task v, and implies communication between
the tasks. Tasks may be fine-grained, such as a single statement, or may be coarse-grained,
containing multiple statements and control structures. Any control dependencies are only
encapsulated in the tasks. The vertices and edges may be labeled with nonnegative weights,
indicating relative computation and communication costs, respectively.
We’ll see in the following chapters that the granularity of tasks is an important consider-
ation in parallel algorithm design. The rest of this chapter explores some simple examples of
task graphs generated from sequences of statements, to get a feeling for how this execution
model represents dependencies between tasks.
2.2.2 Examples
Work-intensive applications usually spend most of their time in loops. Consider the loop
for addition of two arrays in Example 2.5.
Example 2.5
for i ← 0 to n − 1 do
c[i] ← a[i] + b[i]
end
The statement in each iteration can be represented by a vertex in the task graph, but no
edges are required since the statements are independent (Figure 2.9a). We are not usually
this lucky, as often there are dependencies between iterations, as in Example 2.6.
26 Elements of Parallel Computing
(a)
(b)
(c)
Figure 2.9: Task graphs for four iterations of Examples (a) 2.5, (b) 2.6 and (c) 2.7.
Example 2.6
for i ← 1 to n − 1 do
a[i] ← a[i − 1] + x ∗ i
end
Each iteration depends on the value computed in the previous iteration, as shown in Fig-
ure 2.9b for the case with four iterations.
Task graphs for iterative programs can get very large. The number of iterations may not
even be known in advance. Consider Example 2.7:
Example 2.7
for i ← 1 to n − 1 do
1: a[i] ← b[i] + x ∗ i
2: c[i] ← a[i − 1] ∗ b[i]
end
Figure 2.9c shows the task graph for the first four iterations of Example 2.7. Each task has
an extra label indicating to which iteration it belongs. The dependence between tasks 1 and
2 is from each iteration to the next.
There are two alternative ways to model iterative computation. One is to only show
dependencies within each iteration, which would produce two unconnected tasks for Exam-
ple 2.7. Another alternative is to use a compact representation called a flow graph, where
tasks can be executed multiple times, which we won’t discuss here.
Reduction
One particular case is interesting, as it represents a common pattern, called a reduction.
We’ll bring back the word count example from Chapter 1, but to simplify we are only
interested in occurrences of a given word:
Parallel Machine and Execution Models 27
0 1 2 3 4
(a)
1 3 5
0 2 4
(b)
Figure 2.10: Task graph for (a) sequential and (b) parallel reduction.
count ← 0
foreach document in collection do
count ← count + countOccurrences(word, document)
end
Let’s see if we can build a task graph with independent tasks, which will allow parallel
execution. To keep things simple, we’ll take the case where there are four documents, and
unroll the loop:
0: count ←0
1: count ← count + countOccurrences(word, document1)
2: count ← count + countOccurrences(word, document2)
3: count ← count + countOccurrences(word, document3)
4: count ← count + countOccurrences(word, document4)
These statements aren’t independent, since they are all updating the count variable, as
shown by the task graph in Figure 2.10a. This contradicts our intuition, which tells us that
we can compute partial sums independently followed by a combination of the result, as in:
0: count1 ← 0
1: count2 ← 0
2: count1 ← count1 + countOccurrences(word, document1)
3: count2 ← count2 + countOccurrences(word, document2)
4: count1 ← count1 + countOccurrences(word, document3)
5: count2 ← count2 + countOccurrences(word, document4)
6: count ← count1 + count2
As the task graph in Figure 2.10b illustrates, statements 0, 2 and 4 can be computed at the
same time as statements 1, 3, and 5. The final statement needs to wait until tasks 4 and 5
are complete. We’ll look at reduction in more detail in Chapter 3.
Transitive Dependencies
We’ve left out some dependencies between the tasks for sequential reduction in Figure 2.10a.
Each task writes to count and all following tasks read from count. In Figure 2.11 we show
28 Elements of Parallel Computing
0 1 2 3 4
all the real dependencies. This graph illustrates transitive dependence. For example, task 1
depends on task 0, task 2 depends on task 1, and the transitive dependence where task 2
depends on task 0 is also included. The process of removing the transitive dependencies to
produce Figure 2.10a is called transitive reduction.
We don’t always want to remove transitive dependencies, as in the following example:
Example 2.8
1: y ← foo()
2: ymin ← min(y)
3: ymax ← max(y)
4: for i ← 0 to n − 1 do
ynorm[i] ← (y[i] − ymin)/(ymax − ymin)
end
Statement 1 generates an array y, statements 2 and 3 find the minimum and maximum of
y and the loop uses the results of the three previous statements to produce array ynorm
with values normalized between 0 and 1. The tasks graph is shown in Figure 2.12, with
the edge labels indicating the relative size of the data being communicated between tasks.
Here the transitive dependence between tasks 1 and 4 is useful as it not only indicates
the dependence of task 4 on task 1 (through y) but also that this dependence involves a
communication volume that is n times larger than its other dependencies.
2
n 1
n
1 4
n 1
3
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookultra.com