100% found this document useful (4 votes)
25 views

Full Download Elements of Parallel Computing 1st Edition Eric Aubanel PDF DOCX

The document promotes the book 'Elements of Parallel Computing' by Eric Aubanel and provides links to download it along with other related ebooks. It includes a detailed table of contents outlining various chapters covering topics such as parallel programming models, performance analysis, and specific algorithms. The book aims to educate readers on the principles and practices of parallel computing, which is increasingly relevant in modern computing environments.

Uploaded by

adjirifutami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
25 views

Full Download Elements of Parallel Computing 1st Edition Eric Aubanel PDF DOCX

The document promotes the book 'Elements of Parallel Computing' by Eric Aubanel and provides links to download it along with other related ebooks. It includes a detailed table of contents outlining various chapters covering topics such as parallel programming models, performance analysis, and specific algorithms. The book aims to educate readers on the principles and practices of parallel computing, which is increasingly relevant in modern computing environments.

Uploaded by

adjirifutami
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Visit https://ebookultra.

com to download the full version and


explore more ebooks

Elements of Parallel Computing 1st Edition Eric


Aubanel

_____ Click the link below to download _____


https://ebookultra.com/download/elements-of-parallel-
computing-1st-edition-eric-aubanel/

Explore and download more ebooks at ebookultra.com


Here are some suggested products you might be interested in.
Click the link to download

Introduction to Parallel Computing 2nd Edition Ananth


Grama

https://ebookultra.com/download/introduction-to-parallel-
computing-2nd-edition-ananth-grama/

Introduction to Parallel Computing 2nd Edition Ananth


Grama

https://ebookultra.com/download/introduction-to-parallel-
computing-2nd-edition-ananth-grama-2/

A Tale of Seven Elements 1st Edition Eric Scerri

https://ebookultra.com/download/a-tale-of-seven-elements-1st-edition-
eric-scerri/

Parallel Computing in Quantum Chemistry 1st Edition Curtis


L. Janssen

https://ebookultra.com/download/parallel-computing-in-quantum-
chemistry-1st-edition-curtis-l-janssen/
High Performance Parallel Database Processing and Grid
Databases Wiley Series on Parallel and Distributed
Computing 1st Edition David Taniar
https://ebookultra.com/download/high-performance-parallel-database-
processing-and-grid-databases-wiley-series-on-parallel-and-
distributed-computing-1st-edition-david-taniar/

An introduction to parallel and vector scientific


computing 1st Edition Ronald W. Shonkwiler

https://ebookultra.com/download/an-introduction-to-parallel-and-
vector-scientific-computing-1st-edition-ronald-w-shonkwiler/

The dRuby Book Distributed and Parallel Computing with


Ruby 1st Edition Masatoshi Seki

https://ebookultra.com/download/the-druby-book-distributed-and-
parallel-computing-with-ruby-1st-edition-masatoshi-seki/

Mobile Intelligence Wiley Series on Parallel and


Distributed Computing 1st Edition Laurence T. Yang

https://ebookultra.com/download/mobile-intelligence-wiley-series-on-
parallel-and-distributed-computing-1st-edition-laurence-t-yang/

Design and Analysis of Distributed Algorithms Wiley Series


on Parallel and Distributed Computing 1st Edition Nicola
Santoro
https://ebookultra.com/download/design-and-analysis-of-distributed-
algorithms-wiley-series-on-parallel-and-distributed-computing-1st-
edition-nicola-santoro/
Elements of Parallel Computing 1st Edition Eric Aubanel
Digital Instant Download
Author(s): Eric Aubanel
ISBN(s): 9781498727891, 1498727891
Edition: 1
File Details: PDF, 5.92 MB
Year: 2016
Language: english
Elements of
Parallel Computing
Chapman & Hall/CRC
Computational Science Series
SERIES EDITOR

Horst Simon
Deputy Director
Lawrence Berkeley National Laboratory
Berkeley, California, U.S.A.

PUBLISHED TITLES

COMBINATORIAL SCIENTIFIC COMPUTING


Edited by Uwe Naumann and Olaf Schenk
CONTEMPORARY HIGH PERFORMANCE COMPUTING: FROM PETASCALE
TOWARD EXASCALE
Edited by Jeffrey S. Vetter
CONTEMPORARY HIGH PERFORMANCE COMPUTING: FROM PETASCALE
TOWARD EXASCALE, VOLUME TWO
Edited by Jeffrey S. Vetter
DATA-INTENSIVE SCIENCE
Edited by Terence Critchlow and Kerstin Kleese van Dam
ELEMENTS OF PARALLEL COMPUTING
Eric Aubanel
THE END OF ERROR: UNUM COMPUTING
John L. Gustafson
FROM ACTION SYSTEMS TO DISTRIBUTED SYSTEMS: THE REFINEMENT APPROACH
Edited by Luigia Petre and Emil Sekerinski
FUNDAMENTALS OF MULTICORE SOFTWARE DEVELOPMENT
Edited by Victor Pankratius, Ali-Reza Adl-Tabatabai, and Walter Tichy
FUNDAMENTALS OF PARALLEL MULTICORE ARCHITECTURE
Yan Solihin
THE GREEN COMPUTING BOOK: TACKLING ENERGY EFFICIENCY AT LARGE SCALE
Edited by Wu-chun Feng
GRID COMPUTING: TECHNIQUES AND APPLICATIONS
Barry Wilkinson
HIGH PERFORMANCE COMPUTING: PROGRAMMING AND APPLICATIONS
John Levesque with Gene Wagenbreth
HIGH PERFORMANCE PARALLEL I/O
Prabhat and Quincey Koziol
PUBLISHED TITLES CONTINUED

HIGH PERFORMANCE VISUALIZATION:


ENABLING EXTREME-SCALE SCIENTIFIC INSIGHT
Edited by E. Wes Bethel, Hank Childs, and Charles Hansen
INDUSTRIAL APPLICATIONS OF HIGH-PERFORMANCE COMPUTING:
BEST GLOBAL PRACTICES
Edited by Anwar Osseyran and Merle Giles
INTRODUCTION TO COMPUTATIONAL MODELING USING C AND
OPEN-SOURCE TOOLS
José M Garrido
INTRODUCTION TO CONCURRENCY IN PROGRAMMING LANGUAGES
Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
INTRODUCTION TO ELEMENTARY COMPUTATIONAL MODELING: ESSENTIAL
CONCEPTS, PRINCIPLES, AND PROBLEM SOLVING
José M. Garrido
INTRODUCTION TO HIGH PERFORMANCE COMPUTING FOR SCIENTISTS
AND ENGINEERS
Georg Hager and Gerhard Wellein
INTRODUCTION TO REVERSIBLE COMPUTING
Kalyan S. Perumalla
INTRODUCTION TO SCHEDULING
Yves Robert and Frédéric Vivien
INTRODUCTION TO THE SIMULATION OF DYNAMICS USING SIMULINK®
Michael A. Gray
PEER-TO-PEER COMPUTING: APPLICATIONS, ARCHITECTURE, PROTOCOLS,
AND CHALLENGES
Yu-Kwong Ricky Kwok
PERFORMANCE TUNING OF SCIENTIFIC APPLICATIONS
Edited by David Bailey, Robert Lucas, and Samuel Williams
PETASCALE COMPUTING: ALGORITHMS AND APPLICATIONS
Edited by David A. Bader
PROCESS ALGEBRA FOR PARALLEL AND DISTRIBUTED PROCESSING
Edited by Michael Alexander and William Gardner
SCIENTIFIC DATA MANAGEMENT: CHALLENGES, TECHNOLOGY, AND DEPLOYMENT
Edited by Arie Shoshani and Doron Rotem
SOFTWARE ENGINEERING FOR SCIENCE
Edited by Jeffrey C. Carver, Neil P. Chue Hong, and George K. Thiruvathukal
Elements of
Parallel Computing

Eric Aubanel

Boca Raton London New York

CRC Press is an imprint of the


Taylor & Francis Group, an informa business
A CHAPMAN & HALL BOOK
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2017 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20161028

International Standard Book Number-13: 978-1-4987-2789-1 (Paperback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
To my grandfather, Dr. E.P. Scarlett: physician, educator and
scholar.
Contents

Chapter 1  Overview of Parallel Computing 1

1.1 INTRODUCTION 1
1.2 TERMINOLOGY 2
1.3 EVOLUTION OF PARALLEL COMPUTERS 3
1.4 EXAMPLE: WORD COUNT 4
1.5 PARALLEL PROGRAMMING MODELS 5
1.5.1 Implicit Models 5
1.5.2 Semi-Implicit Models 5
1.5.3 Explicit Models 6
1.5.4 Thinking in Parallel 6
1.6 PARALLEL DESIGN PATTERNS 7
1.6.1 Structural Patterns 7
1.6.2 Computational Patterns 8
1.6.3 Patterns in the Lower Layers 9
1.7 WORD COUNT IN PARALLEL 10
1.8 OUTLINE OF THE BOOK 11

Chapter 2  Parallel Machine and Execution Models 13

2.1 PARALLEL MACHINE MODELS 13


2.1.1 SIMD 14
2.1.2 Shared Memory and Distributed Memory Computers 16
2.1.3 Distributed Memory Execution 18
2.1.4 Shared Memory Execution 19
2.1.5 Summary 22
2.2 PARALLEL EXECUTION MODEL 23
2.2.1 Task Graph Model 23
2.2.2 Examples 25
2.2.3 Summary 29
2.3 FURTHER READING 29
2.4 EXERCISES 30

ix
x  Contents

Chapter 3  Parallel Algorithmic Structures 33

3.1 HISTOGRAM EXAMPLE 33


3.1.1 Guidelines for Parallel Algorithm Design 34
3.2 EMBARRASSINGLY PARALLEL 35
3.3 REDUCTION 36
3.4 SCAN 39
3.5 DIVIDE-AND-CONQUER 42
3.6 PIPELINE 45
3.7 DATA DECOMPOSITION 50
3.8 SUMMARY 56
3.9 FURTHER READING 56
3.10 EXERCISES 56

Chapter 4  Parallel Program Structures 59

4.1 LOAD BALANCE 59


4.2 SIMD: STRICTLY DATA PARALLEL 60
4.3 FORK-JOIN 65
4.4 PARALLEL LOOPS AND SYNCHRONIZATION 71
4.4.1 Shared and Private Variables 74
4.4.2 Synchronization 74
4.4.3 Thread Safety 78
4.5 TASKS WITH DEPENDENCIES 79
4.6 SINGLE PROGRAM MULTIPLE DATA 83
4.7 MASTER-WORKER 90
4.8 DISTRIBUTED MEMORY PROGRAMMING 92
4.8.1 Distributed Arrays 92
4.8.2 Message Passing 94
4.8.3 Map-Reduce 102
4.9 CONCLUSION 105
4.10 FURTHER READING 105
4.11 EXERCISES 105

Chapter 5  Performance Analysis and Optimization 109

5.1 WORK-DEPTH ANALYSIS 109


5.2 PERFORMANCE ANALYSIS 111
5.2.1 Performance Metrics 111
5.2.2 Communication Analysis 116
5.3 BARRIERS TO PERFORMANCE 120
Contents  xi

5.4 MEASURING AND REPORTING PERFORMANCE 127


5.5 FURTHER READING 128
5.6 EXERCISES 129

Chapter 6  Single Source Shortest Path 131

6.1 SEQUENTIAL ALGORITHMS 132


6.1.1 Data Structures 133
6.1.2 Bellman-Ford Algorithm 134
6.1.3 Dijkstra’s Algorithm 134
6.1.4 Delta-Stepping Algorithm 135
6.2 PARALLEL DESIGN EXPLORATION 138
6.3 PARALLEL ALGORITHMS 141
6.3.1 Shared Memory Delta-Stepping 141
6.3.2 SIMD Bellman-Ford for GPU 144
6.3.3 Message Passing Algorithm 146
6.4 CONCLUSION 149
6.5 FURTHER READING 150
6.6 EXERCISES 150

Chapter 7  The Eikonal Equation 153

7.1 NUMERICAL SOLUTION 155


7.1.1 Fast Sweeping Method 156
7.1.2 Fast Marching Method 160
7.2 PARALLEL DESIGN EXPLORATION 163
7.2.1 Parallel Fast Sweeping Methods 164
7.2.2 Parallel Fast Marching Methods 167
7.3 PARALLEL ALGORITHMS 177
7.3.1 Parallel Fast Sweeping Methods 178
7.3.2 Parallel Fast Marching Methods 178
7.4 FURTHER READING 182
7.5 EXERCISES 182

Chapter 8  Planar Convex Hull 185

8.1 SEQUENTIAL ALGORITHMS 185


8.2 PARALLEL DESIGN EXPLORATION 190
8.2.1 Parallel Hull Merge 191
8.3 PARALLEL ALGORITHMS 196
8.3.1 SIMD QuickHull 197
8.3.2 Coarse-Grained Shared Memory MergeHull 203
xii  Contents

8.3.3 Distributed Memory MergeHull 207


8.4 CONCLUSION 212
8.5 FURTHER READING 212
8.6 EXERCISES 213

Bibliography 215

Index 221
Preface

Parallel computing is hard, it’s creative, and it’s an essential part of high performance
scientific computing. I got my start in this field parallelizing quantum mechanical wave
packet evolution for the IBM SP. Parallel computing has now joined the mainstream, thanks
to multicore and manycore processors, and to the cloud and its Big Data applications. This
ubiquity has resulted in a move to include parallel computing concepts in undergraduate
computer science curricula. Clearly, a CS graduate must be familiar with the basic concepts
and pitfalls of parallel computing, even if he/she only ever uses high level frameworks.
After all, we expect graduates to have some knowledge of computer architecture, even if
they never write code in an assembler.
Exposing undergraduates to parallel computing concepts doesn’t mean dismantling the
teaching of this subject in dedicated courses, as it remains an important discipline in com-
puter science. I’ve found it a very challenging subject to teach effectively, for several reasons.
One reason is that it requires students to have a strong background in sequential program-
ming and algorithm design. Students with a shaky mental model of programming quickly
get bogged down with parallel programming. Parallel computing courses attract many stu-
dents, but many of them struggle with the challenges of parallel programming, debugging,
and getting even modest speedup. Another challenge is that the discipline has been driven
throughout its history by advances in hardware, and these advances keep coming at an
impressive pace. I’ve regularly had to redesign my courses to keep up. Unfortunately, I’ve
had little help from textbooks, as they have gone out of print or out of date.
This book presents the fundamental concepts of parallel computing not from the point
of view of hardware, but from a more abstract view of the algorithmic and implementation
patterns. While the hardware keeps changing, the same basic conceptual building blocks
are reused. For instance, SIMD computation has survived through many incarnations from
processor arrays to pipelined vector processors to SIMT execution on GPUs. Books on the
theory of parallel computation approach the subject from a similar level of abstraction, but
practical parallel programming books tend to be tied to particular programming models
and hardware. I’ve been inspired by the work on parallel programming patterns, but I
haven’t adopted the formal design patterns approach, as I feel it is more suited to expert
programmers than to novices.
My aim is to facilitate the teaching of parallel programming by surveying some key
algorithmic structures and programming models, together with an abstract representation
of the underlying hardware. The presentation is meant to be friendly and informal. The
motivation, goals, and means of my approach are the subject of Chapter 1. The content
of the book is language neutral, using pseudocode that represents common programming
language models.
The first five chapters present the core concepts. After the introduction in Chapter 1,
Chapter 2 presents SIMD, shared memory, and distributed memory machine models, along
with a brief discussion of what their execution models look like. Chapter 2 concludes with a
presentation of the task graph execution model that will be used in the following chapters.
Chapter 3 discusses decomposition as a fundamental activity in parallel algorithmic design,
starting with a naive example, and continuing with a discussion of some key algorithmic

xiii
xiv  Preface

structures. Chapter 4 covers some important programming models in depth, and shows
contrasting implementations of the task graphs presented in the previous chapter. Finally,
Chapter 5 presents the important concepts of performance analysis, including work-depth
analysis of task graphs, communication analysis of distributed memory algorithms, some
key performance metrics, and a discussion of barriers to obtaining good performance. A
brief discussion of how to measure performance and report performance is included because
I have observed that this is often done poorly by students and even in the literature.
This book is meant for an introductory parallel computing course at the advanced un-
dergraduate or beginning graduate level. A basic background in computer architecture and
algorithm design and implementation is assumed. Clearly, hands-on experience with parallel
programming is essential, and the instructor will have selected one or more languages and
computing platforms. There are many good online and print resources for learning particu-
lar parallel programming language models, which could supplement the concepts presented
in this book. While I think it’s valuable for students to be familiar with all the parallel pro-
gram structures in Chapter 4, in a given course a few could be studied in more detail along
with practical programming experience. The instructor who wants to get students program-
ming as soon as possible may want to supplement the algorithmic structures of Chapter 3
with simple programming examples. I have postponed performance analysis until Chapter 5,
but sections of this chapter could be covered earlier. For instance, the work-depth analysis
could be presented together with Chapter 3, and performance analysis and metrics could
be presented in appropriate places in Chapter 4. The advice of Section 5.4 on reporting
performance should be presented before students have to write up their experiments.
The second part of the book presents three case studies that reinforce the concepts of
the earlier chapters. One feature of these chapters is to contrast different solutions to the
same problem. I have tried for the most part to select problems that aren’t discussed all
that often in parallel computing textbooks. They include the Single Source Shortest Path
Problem in Chapter 6, the Eikonal equation in Chapter 7, which is a partial differential
equation with relevance to many fields, including graphics and AI, and finally in Chapter 8
a classical computational geometry problem, computation of the two-dimensional convex
hull. These chapters could be supplemented with material from other sources on other well-
known problems, such as dense matrix operations and the Fast Fourier Transform. I’ve also
found it valuable, particularly in a graduate course, to have students research and present
case studies from the literature.
Acknowledgements

I would first like to acknowledge an important mentor at the University of New Brunswick,
professor Virendra Bhavsar, who helped me transition from a high performance computing
practitioner to a researcher and teacher. I’ve used the generalized fractals he worked on,
instead of the more common Mandelbrot set, to illustrate the need for load balancing. This
book wouldn’t have been possible without my experience teaching CS4745/6025, and the
contributions of the students. In particular, my attention was drawn to the subset sum
problem by Steven Stewart’s course project and Master’s thesis.
The formal panel sessions and informal discussions I witnessed from 2002 to 2014 at the
International Parallel and Distributed Processing Symposium were very stimulating and
influential in developing my views. I finally made the decision to write this book after reading
the July 2014 JPDC special issue on Perspectives on Parallel and Distributed Processing.
In this issue, Robert Schreiber’s A few bad ideas on the way to the triumph of parallel
computing [57] echoed many of my views, and inspired me to set them down in print. Finally,
I would like to thank Siew Yin Chan, my former PhD student, for valuable discussions and
revision of early material.

xv
CHAPTER 1

Overview of Parallel
Computing

1.1 INTRODUCTION
In the first 60 years of the electronic computer, beginning in 1940, computing performance
per dollar increased on average by 55% per year [52]. This staggering 100 billion-fold increase
hit a wall in the middle of the first decade of this century. The so-called power wall arose
when processors couldn’t work any faster because they couldn’t dissipate the heat they pro-
duced. Performance has kept increasing since then, but only by placing multiple processors
on the same chip, and limiting clock rates to a few GHz. These multicore processors are
found in devices ranging from smartphones to servers.
Before the multicore revolution, programmers could rely on a free performance increase
with each processor generation. However, the disparity between theoretical and achievable
performance kept increasing, because processing speed grew much faster than memory band-
width. Attaining peak performance required careful attention to memory access patterns in
order to maximize re-use of data in cache memory. The multicore revolution made things
much worse for programmers. Now increasing the performance of an application required
parallel execution on multiple cores.
Enabling parallel execution on a few cores isn’t too challenging, with support available
from language extensions, compilers and runtime systems. The number of cores keeps in-
creasing, and manycore processors, such as Graphics Processing Units (GPUs), can have
thousands of cores. This makes achieving good performance more challenging, and parallel
programming is required to exploit the potential of these parallel processors.

Why Learn Parallel Computing?


Compilers already exploit instruction level parallelism to speed up sequential code, so
couldn’t they also automatically generate multithreaded code to take advantage of multiple
cores? Why learn distributed parallel programming, when frameworks such as MapReduce
can meet the application programmer’s needs? Unfortunately it’s not so easy. Compilers
can only try to optimize the code that’s given to them, but they can’t rewrite the underly-
ing algorithms to be parallel. Frameworks, on the other hand, do offer significant benefits.
However, they tend to be restricted to particular domains and they don’t always produce
the desired performance.
High level tools are very important, but we will always need to go deeper and apply par-
allel programming expertise to understand the performance of frameworks in order to make

1
2  Elements of Parallel Computing

better use of them. Specialized parallel code is often essential for applications requiring high
performance. The challenge posed by the rapidly growing number of cores has meant that
more programmers than ever need to understand something about parallel programming.
Fortunately parallel processing is natural for humans, as our brains have been described as
parallel processors, even though we have been taught to program in a sequential manner.

Why is a New Approach Needed?


Rapid advances in hardware make parallel computing exciting for the devotee. Unfortu-
nately, advances in the field tend to be driven by the hardware. This has resulted in so-
lutions tied to particular architectures that quickly go out of date, as do textbooks. On
the software front it’s easy to become lost amid the large number of parallel programming
languages and environments.
Good parallel algorithm design requires postponing consideration of the hardware, but
at the same time good algorithms and implementations must take the hardware into ac-
count. The way out of this parallel software/hardware thicket is to find a suitable level of
abstraction and to recognize that a limited number of solutions keep being re-used.
This book will guide you toward being able to think in parallel, using a task graph model
for parallel computation. It makes use of recent work on parallel programming patterns to
identify commonly used algorithmic and implementation strategies. It takes a language-
neutral approach using pseudocode that reflects commonly used language models. The
pseudocode can quite easily be adapted for implementation in relevant languages, and there
are many good resources online and in print for parallel languages.

1.2 TERMINOLOGY
It’s important to be clear about terminology, since parallel computing, distributed computing,
and concurrency are all overlapping concepts that have been defined in different ways.
Parallel computers can also be placed in several categories.
Definition 1.1 (Parallel Computing). Parallel Computing means solving a computing prob-
lem in less time by breaking it down into parts and computing those parts simultaneously.
Parallel computers provide more computing resources and memory in order to tackle
problems that cannot be solved in a reasonable time by a single processor core. They differ
from sequential computers in that there are multiple processing elements that can execute
instructions in parallel, as directed by the parallel program. We can think of a sequential

Instruction streams
single multiple
Data streams
single

SISD MISD
multiple

SIMD MIMD

Figure 1.1: Flynn’s Taxonomy.


Overview of Parallel Computing  3

distributed
concurrency
computing
parallel
computing

Figure 1.2: Three overlapping disciplines.

computer, as described by the von Neumann architecture, as executing a stream of instruc-


tions that accesses a stream of data. Parallel computers work with multiple streams of data
and/or instructions, which is the basis of Flynn’s taxonomy, given in Figure 1.1. A sequential
computer is in the Single Instruction Single Data (SISD) category and parallel computers
are in the other categories. Single Instruction Multiple Data (SIMD) computers have a
single stream of instructions that operate on multiple streams of data in parallel. Multiple
Instruction Multiple Data (MIMD) is the most general category, where each instruction
stream can operate on different data. There aren’t currently any Multiple Instruction Single
Data (MISD) computers in production. While most parallel computers are in the MIMD
category, most also incorporate SIMD processing elements.
MIMD computers are further classified into shared memory and distributed memory
computers. The processing elements of a shared memory computer all have access to a single
memory address space. Distributed memory computers have memory that is distributed
among compute nodes. If a processing element on one node wants to access data on another
node, it can’t directly refer to it by its address, but must obtain it from the other node by
exchanging messages.
Distributed memory parallel computing can be done on a cluster of computers connected
by a network. The larger field of distributed computing has some of the same concerns, such
as speed, but is not mainly concerned with the solution of a single problem. Distributed
systems are typically loosely coupled, such as peer-to-peer systems, and the main concerns
include reliability as well as performance.
Parallel computing can also be done on a shared memory multiprocessor. Simultaneous
access to the same data can lead to incorrect results. Techniques from the field of con-
currency, such as mutual exclusion, can be used to ensure correctness. The discipline of
concurrency is about much more than parallel computing. Shared access to data is also
important for databases and operating systems. For example, concurrency addresses the
problem of ensuring that two simultaneous operations on a bank account don’t conflict.
We can summarize the three overlapping disciplines with the diagram in Figure 1.2.

1.3 EVOLUTION OF PARALLEL COMPUTERS


Parallel computers used to be solely large expensive machines with specialized hardware.
They were housed at educational and governmental institutions and were mainly dedicated
to scientific computing. An important development came in the 1990s with the so-called
Beowulf revolution, where the best performance to cost ratio could be obtained with clusters
of commodity PCs connected by network switches rather than with expensive purpose-
built supercomputers. While this made parallel computing more accessible, it still remained
4  Elements of Parallel Computing

the domain of enthusiasts and those whose needs required high performance computing
resources.
Not only is expert knowledge needed to write parallel programs for a cluster, the signif-
icant computing needs of large scale applications require the use of shared supercomputing
facilities. Users had to master the complexities of coordinating the execution of applications
and file management, sometimes across different administrative and geographic domains.
This led to the idea of Grid computing in the late 1990s, with the dream of computing as
a utility, which would be as simple to use as the power grid. While the dream hasn’t been
realized, Grid computing has provided significant benefit to collaborative scientific data
analysis and simulation. This type of distributed computing wasn’t adopted by the wider
community until the development of cloud computing in the early 2000s.
Cloud computing has grown rapidly, thanks to improved network access and virtualiza-
tion techniques. It allows users to rent computing resources on demand. While using the
cloud doesn’t require parallel programming, it does remove financial barriers to the use of
compute clusters, as these can be assembled and configured on demand. The introduction
of frameworks based on the MapReduce programming model eliminated the difficulty of
parallel programming for a large class of data processing applications, particularly those
associated with the mining of large volumes of data.
With the emergence of multicore and manycore processors all computers are parallel
computers. A desktop computer with an attached manycore co-processor features thousands
of cores and offers performance in the trillions of operations per second. Put a number of
these computers on a network and even more performance is available, with the main
limiting factor being power consumption and heat dissipation. Parallel computing is now
relevant to all application areas. Scientific computing isn’t the only player any more in
large scale high performance computing. The need to make sense of the vast quantity of
data that cheap computing and networks have produced, so-called Big Data, has created
another important use for parallel computing.

1.4 EXAMPLE: WORD COUNT


Let’s consider a simple problem that can be solved using parallel computing. We wish to list
all the words in a collection of documents, together with their frequency. We can use a list
containing key-value pairs to store words and their frequency. The sequential Algorithm 1.1
is straightforward.

Algorithm 1.1: Sequential word count


Input: collection of text documents
Output: list of hword, counti pairs
foreach document in collection do
foreach word in document do
if first occurrence of word then
add hword, 1i to ordered list
else
increment count in hword, counti
end
end
end
Overview of Parallel Computing  5

If the input consists of two documents, one containing “The quick brown fox jumps
over a lazy dog” and the other containing “The brown dog chases the tabby cat,” then the
output would be the list: ha, 1i, hbrown, 2i, . . . , htabby, 1i, hthe, 3i.
Think about how you would break this algorithm in parts that could be computed in
parallel, before we discuss below how this could be done.

1.5 PARALLEL PROGRAMMING MODELS


Many programming models have been proposed over the more than 40 year history of par-
allel computing. Parallel computing involves identifying those tasks that can be performed
concurrently and coordinating their execution. Fortunately, abstraction is our friend, as in
other areas of computer science. It can mask some of the complexities involved in imple-
menting algorithms that make use of parallel execution. Programming models that enable
parallel execution operate at different levels of abstraction. While only a few will be men-
tioned here, parallel programming models can be divided into three categories, where the
parallelism is implicit, partly explicit, and completely explicit [66].

1.5.1 Implicit Models


There are some languages where parallelism is implicit. This is the case for functional pro-
gramming languages, such as Haskell, where the runtime works with the program as a graph
and can identify those functions that can be executed concurrently. There are also algorith-
mic skeletons, which may be separate languages or frameworks built on existing languages.
Applications are composed of parallel skeletons, such as in the case of MapReduce. The
programmer is concerned with the functionality of the components of the skeletons, not
with how they will be executed in parallel.
The word count problem solved in Algorithm 1.1 is a canonical MapReduce application.
The programmer writes a map function that emits hword, counti pairs and a reduce function
to sum the values of all pairs with the same word. We’ll examine MapReduce in more detail
below and in Chapter 4.

1.5.2 Semi-Implicit Models


There are other languages where programmers identify regions of code that can be executed
concurrently, but where they do not have to determine how many resources (threads/pro-
cesses) to use and how to assign tasks to them. The Cilk language, based on C/C++, allows
the programmer to identify recursive calls that can be done in parallel and to annotate loops
whose iterations can be computed independently. OpenMP is another approach, which aug-
ments existing imperative languages with an API to enable loop and task level specification
of parallelism. Java supports parallelism in several ways. The Fork/Join framework of Java
7 provides an Executor service that supports recursive parallelism much in the same way
as Cilk. The introduction of streams in Java 8 makes it possible to identify streams that
can be decomposed into parallel sub-streams. For these languages the compiler and runtime
systems take care of the assignment of tasks to threads. Semi-implicit models are becoming
more relevant with the increasing size and complexity of parallel computers.
For example, parallel loops can be identified by the programmer, such as:
parallel for i ← 0 to n − 1 do
c[i] ← a[i] + b[i]
end
6  Elements of Parallel Computing

Once loops that have independent iterations have been identified, the parallelism is very
simple to express. Ensuring correctness is another matter, as we’ll see in Chapters 2 and 4.

1.5.3 Explicit Models


Finally, there are lower level programming models where parallelism is completely explicit.
The most popular has been the Message Passing Interface (MPI), a library that enables
parallel programming for C/C++ and Fortran. Here the programmer is responsible for iden-
tifying tasks, mapping them to processors, and sending messages between processors. MPI
has been very successful due to its proven performance and portability, and the ease with
which parallel algorithms can be expressed. However, MPI programs can require significant
development time, and achieving the portability of performance across platforms can be very
challenging. OpenMP, mentioned above, can be used in a similar way, where all parallelism
and mapping to threads is expressed explicitly. It has the advantage over MPI of a shared
address space, but can suffer from poorer performance, and is restricted to platforms that
offer shared memory. So-called partitioned global address space (PGAS) languages combine
some of the advantages of MPI and OpenMP. These languages support data parallelism
by allowing the programmer to specify data structures that can be distributed across pro-
cesses, while providing the familiar view of a single address space. Examples include Unified
Parallel C and Chapel.
In the following message passing example for the sum of two arrays a and b on p proces-
sors, where the arrays are initially on processor 0, the programmer has to explicitly scatter
the operands and gather the results:
scatter(0, a, n/p, aLoc)
scatter(0, b, n/p, bLoc)
for i ← 0 to n/p − 1 do
cLoc[i] ← aLoc[i] + bLoc[i]
end
gather(0, c, n/p, cLoc)
Here arrays a and b of length n are scattered in contiguous chunks of n/p elements to
p processors, and stored in arrays aLoc and bLoc on each processor. The resulting cLoc
arrays are gathered into the c array on processor 0. We’ll examine message passing in detail
in Chapter 4.

1.5.4 Thinking in Parallel


Parallel programming models where the parallelism is implicit don’t require programmers
to “think in parallel.” As long as the underlying model (e.g., functional language or skeleton
framework) is familiar, then the benefit of harnessing multiple cores can be achieved without
the additional development time required by explicit parallel programming. However, not
only can this approach limit the attainable performance, more importantly it limits the
exploration space in the design of algorithms. Learning to think in parallel exposes a broader
landscape of algorithm design. The difficulty is that it requires a shift in the mental model of
the algorithm designer or programmer. Skilled programmers can look at code and run it in
their mind, executing it on the notional machine associated with the programming language.
A notional machine explains how a programming language’s statements are executed. Being
able to view a static program as something that is dynamically executed on a notional
machine is a threshold concept. Threshold concepts are transformative and lead to new
Overview of Parallel Computing  7

Computational
Structural Patterns Chapters 6-8: Case Studies
Patterns

Algorithm Strategy Patterns Chapter 3: Parallel Algorithmic Structures Chapter 2: Parallel


Machine and Execution
Models

Implementation Strategy Patterns

Chapter 4: Parallel Program Structures Chapter 5: Performance


Analysis and Optimization
Parallel Execution Patterns

Figure 1.3: OPL hierarchy (left) and corresponding chapters (right).

ways of thinking [69]. It could be argued that making the transition from sequential to
parallel notional machines is another threshold concept.

1.6 PARALLEL DESIGN PATTERNS


While the prospect of an enlarged algorithmic design space may be thrilling to some, it can
be daunting to many. Fortunately, we can build on the work of the parallel programming
community by reusing existing algorithmic techniques. This has been successful particularly
in object-oriented software design, where design patterns offer guidance to common software
design problems at different layers of abstraction. This idea has been extended to parallel
programming, notably with Berkeley’s Our Pattern Language (OPL) [44]. Programmers can
identify patterns in program structure at a high level of abstraction, such as pipe-and-filter,
or those that are found in particular application domains, such as in graph algorithms.
Although these high level patterns do not mention parallelism, they can naturally suggest
parallel implementation, as in the case of pipe-and-filter, which can benefit from pipelined
parallel execution. These patterns also document which lower level patterns are appropriate.
At lower levels there are patterns that are commonly used in algorithm and software de-
sign, such as divide-and-conquer, geometric decomposition, and the master-worker pattern.
There is a seemingly never ending number of algorithmic and design patterns, which is why
mastering the discipline can take a long time. The true benefit of the work of classifying
patterns is that it can provide a map of parallel computing techniques.
Figure 1.3 illustrates the OPL hierarchy and indicates how the chapters of this book
refer to different layers. The top two layers are discussed in this chapter. We will not be
considering the formal design patterns in the lower layers, but will examine in Chapters 3
and 4 the algorithmic and implementation structures that they cover.

1.6.1 Structural Patterns


Structural patterns consist of interacting components that describe the high level structure
of a software application. These include well known structures that have been studied by
the design pattern community. Examples include the model-view-controller pattern that is
used in graphical user interface frameworks, the pipe-and-filter pattern for applying a series
of filters to a stream of data, and the agent-and-repository and MapReduce patterns used
8  Elements of Parallel Computing

English
messages plain text words
stream of language metadata
tokenization ...
messages identification removal

Figure 1.4: Part of a text processing pipeline.

in data analysis. These patterns can be represented graphically as a group of interacting


tasks.
Consider the pipe-and-filter pattern, which is useful when a series of transformations
needs to be applied to a collection of data. The transformations, also called filters, can
be organized in a pipeline. Each filter is independent and does not produce side effects. It
reads its input, applies its transformation, then outputs the result. As in a factory assembly
line, once the stream of data fills the pipeline the filters can be executed in parallel on
different processors. If the filters take roughly the same amount of time, the execution time
can be reduced proportionately to the number of processors used. It can happen that some
filters will take much longer to execute than others, which can cause a bottleneck and slow
down execution. The slower filters can be sped up by using parallel processing techniques to
harness multiple processors. An example is shown in Figure 1.4, which shows the first few
stages of a text processing pipeline. A stream of social media messages is first filtered to
only keep messages in English, then any metadata (such as URLs) is removed in the next
filter. The third filter tokenizes the messages into words. The pipeline could continue with
other filters to perform operations such as tagging and classification.
Another pattern, MapReduce, became popular when Google introduced the framework
of the same name in 2004, but it has been used for much longer. It consists of a map phase,
where the same operation is performed on objects in a collection, followed by a reduce phase
where a summary of the results of the map phase is collected. Many applications that need
to process and summarize large quantities of data fit this pattern. Continuing with the text
processing example, we might want to produce a list of words and their frequency from a
stream of social media messages, as in the example of Section 1.4.
Both map and reduce phases of this pattern can be executed in parallel. Since the map
operations are independent, they can easily be done in parallel across multiple processors.
This type of parallel execution is sometimes called embarrassingly parallel, since there are
no dependencies between the tasks and they can be trivially executed in parallel. The reduce
phase can be executed in parallel using the reduction operation, which is a frequently used
lower level parallel execution pattern. The attraction of MapReduce, implemented in a
framework, is that the developer can build the mapper and reducer without any knowledge
of how they are executed.

1.6.2 Computational Patterns


Whereas in cases like pipe-and-filter and MapReduce the structure reveals the potential for
parallel execution, usually parallelism is found in the functions that make up the compo-
nents of the software architecture. In practice these functions are usually constructed from
a limited set of computational patterns. Dense and sparse linear algebra computations are
probably the two most common patterns. They are used by applications such as games, ma-
chine learning, image processing and high performance scientific computing. There are well
established parallel algorithms for these patterns, which have been implemented in many
Overview of Parallel Computing  9

x0 core 0
x1
A00 A01 A02 A03 X x2
= b0

x3

x0 core 1
x1
A10 A11 A12 A13 X x2
= b1

x3

x0 core 2
x1
A20 A21 A22 A23 X x2
= b2

x3

x0 core 3
x1
A30 A31 A32 A33 X x2
= b3

x3

Figure 1.5: Parallel matrix-vector multiplication b = Ax.

libraries. The High Performance Linpack (HPL) benchmark used to classify the top 500
computers involves solving a dense system of linear equations. Parallelism arises naturally
in these applications. In matrix-vector multiplication, for example, the inner products that
compute each element of the result vector can be computed independently, and hence in
parallel, as seen in Figure 1.5. In practice it is more difficult to develop solutions that scale
well with matrix size and the number of processors, but the plentiful literature provides
guidance.
Another important pattern is one where operations are performed on a grid of data. It
occurs in scientific simulations that numerically solve partial differential equations, and also
in image processing that executes operations on pixels. The solutions for each data point
can be computed independently but they require data from neighboring points. Other pat-
terns include those found in graph algorithms, optimization (backtrack/branch and bound,
dynamic programming), and sorting. It is reassuring that the lists that have been drawn up
of these patterns include less than twenty patterns. Even though the landscape of parallel
computing is vast, most applications can be composed of a small number of well studied
computational patterns.

1.6.3 Patterns in the Lower Layers


Structural and computational patterns allow developers to identify opportunities for par-
allelism and exploit ready-made solutions by using frameworks and libraries, without the
need to acquire expertise in parallel programming. Exploration of the patterns in the lower
layers is necessary for those who wish to develop new parallel frameworks and libraries, or
need a customized solution that will obtain higher performance. This requires the use of
10  Elements of Parallel Computing

The quick brown fox The brown dog chases


jumps over a lazy dog the tabby cat

map map

<a,1>,<brown,1>, <dog,1>, <brown,1>,<cat,1>, <chases,1>,


<fox,1>,<jumps,1>,<lazy,1>, <dog,1>,<tabby,1>,<the,2>
<over,1>,<quick,1>,<the,1>
reduce

<a,1>,<brown,2>,<cat,1>,<chases,1>,<dog,2>,<fox,1>,<jumps,1>,
<lazy,1>,<over,1>,<quick,1>,<tabby,1>,<the,3>

Figure 1.6: Word count as a MapReduce pattern.

algorithmic and implementation structures to create parallel software. These structures will
be explored in detail in Chapters 3 and 4.

1.7 WORD COUNT IN PARALLEL


Let’s return to the word count problem of Section 1.4. Figure 1.6 illustrates how this
algorithm can be viewed as an instance of the MapReduce pattern. In the map phase,
hword, counti pairs are produced for each document. The reduce phase aggregates the pairs
produced from all documents. What’s not shown in Figure 1.6 is that there can be multiple
reducers. If we’re using a MapReduce framework then we just need to implement a mapper
function to generate hword, counti pairs given a document and a reducer function to sum
the values of a given word, as we’ll see in Chapter 4.
To produce an explicitly parallel solution we need to start by finding the parallelism
in the problem. The task of creating hword, counti pairs can be done independently on
each document, and therefore can be done trivially in parallel. The reduction phase is not
as simple, since some coordination among tasks is needed, as described in the reduction
algorithm pattern. Next, we need to consider what type of computational platform is to
be used, and whether the documents are stored locally or are distributed over multiple
computers. A distributed approach is required if the documents are not stored in one place
or if the number of documents is large enough to justify using a cluster of computers, which
might be located in the cloud.
Alternatively, a local computer offers the choice of using a shared data structure. A
concurrent hash map would allow multiple threads to update hword, counti pairs, while
providing the necessary synchronization to avoid conflicts should multiple updates overlap.
A distributed implementation could use the master-worker pattern, where the master per-
forms the distribution and collection of work among the workers. Note that this approach
could also be used on multiple cores of a single computer.
Overview of Parallel Computing  11

Both distributed and shared implementations could be combined. Taking a look at the
sequential algorithm we can see that word counts could be done independently for each line
of text. The processing of the words of the sentences of each document could be accomplished
in parallel using a shared data structure, while the documents could be processed by multiple
computers.
This is a glimpse of how rich the possibilities can be in parallel programming. It can get
even richer as new computational platforms emerge, as happened in the mid 2000s with gen-
eral purpose programming on graphics processing units (GPGPU). The patterns of parallel
computing provided a guide in our example problem, as they allowed the recognition that
it fit the MapReduce structural pattern, and that implementation could be accomplished
using well established algorithmic and implementation patterns. While this book will not
follow the formalism of design patterns, it does adopt the similar view that there is a set
of parallel computing elements that can be composed in many ways to produce clear and
effective solutions to computationally demanding problems.

1.8 OUTLINE OF THE BOOK


Parallel programming requires some knowledge of parallel computer organization. Chapter 2
begins with a discussion of three abstract machine models: SIMD, shared memory, and dis-
tributed memory. It also discusses the hazards of access to shared variables, which threaten
program correctness. Chapter 2 then presents the task graph as an execution model for
parallel computing. This model is used for algorithm design, analysis, and implementation
in the following chapters. Chapter 3 presents an overview of algorithmic structures that are
often used as building blocks. It focuses on the fundamental task of finding parallelism via
task and data decomposition. Chapter 4 explores the three machine models more deeply
through examination of the implementation structures that are relevant to each model.
Performance is a central concern for parallel programmers. Chapter 5 first shows how
work-depth analysis allows evaluation of the task graph of a parallel algorithm, without
considering the target machine model. The barriers to achieving good performance that are
encountered when implementing parallel algorithms are discussed next. This chapter closes
with advice on how to effectively and honestly present performance results.
The final chapters contain three detailed case studies, drawing on what was learned
in the previous chapters. They consider different types of parallel solutions for different
machine models, and take a step-by-step voyage through the algorithm design and analysis
process.
CHAPTER 2

Parallel Machine and


Execution Models

Unsurprisingly, parallel programs are executed on parallel machines. It may be less obvious
that they are also executed in the mind of the parallel programmer. All programmers have a
mental model of program execution that allows them to trace through their code. Program-
mers who care about performance also have a mental model of the machine they are using.
Knowledge about the operation of the cache hierarchy, for instance, allows a programmer
to structure loops to promote cache reuse. While many programmers can productively work
at a high level of abstraction without any concern for the hardware, parallel programmers
cannot afford to do so. The first part of this chapter discusses three machine models and
the second part presents a task graph execution model. The task graph model is used ex-
tensively in Chapter 3 when presenting algorithmic structures. The machine models are
examined more deeply together with implementation structures in Chapter 4.

2.1 PARALLEL MACHINE MODELS


We will not delve into the details of actual parallel computers, because they continue to
evolve at a very rapid pace and an abstract representation is more appropriate for our
purposes. We’ll look instead at some general features that can be observed when surveying
parallel computers past and present. Stated simply, parallel computers have multiple pro-
cessing cores. They range from the most basic multi-core processor to clusters of hundreds
of thousands of processors. They also vary in the functionality of the cores and in how they
are interconnected.
Modern microprocessors have become incredibly complicated, as they have used the
ever increasing number of transistors to improve the performance of execution of a single
instruction stream, using techniques such as pipelining and superscalar execution. They have
also devoted an increasing fraction of the chip to high speed cache memory, in an attempt
to compensate for slow access to main memory. Because of this, multicore processors have
a limited number of cores. So-called manycore processors simplify the design of the cores
so that more can be placed on a chip, and therefore more execution streams can proceed in
parallel. These two types of architectures have been called latency oriented and throughput
oriented. They can be seen in general purpose microprocessors on one hand, and graphics
processing units (GPUs) on the other [28].
Three machine models can represent, individually and in combination, all current gen-

13
14  Elements of Parallel Computing

a) b)
ALU ALU

r0: v0:
r1: v1:

M M

Figure 2.1: a) scalar architecture b) SIMD architecture, with vector registers than can
contain 4 data elements.

eral purpose parallel computers: SIMD, shared memory and distributed memory. Recall from
Chapter 1 that the latter two are subcategories of the MIMD architecture. These machines
mainly need to be understood at a high level of abstraction for the purpose of algorithm
design and implementation strategy. Understanding of deeper levels of machine architecture
is important during implementation, at which point technical documentation from the hard-
ware vendor should be consulted. One exception has to do with access to shared variables,
where race conditions can lead to incorrect programs. Another is the performance impact
of cache memory, which is sensitive to data access patterns.

2.1.1 SIMD
SIMD computers simultaneously execute a single instruction on multiple data elements. His-
torically, processors in SIMD computers consisted either of a network of Arithmetic Logic
Units (ALUs), as in the Connection Machine, or had deeply pipelined vector arithmetic
units, as in computers from Cray and NEC. Currently, SIMD execution is mainly found in
functional units in general purpose processors, and our discussion will reflect this architec-
ture, as sketched in Figure 2.1. SIMD ALUs are wider than conventional ALUs and can
perform multiple operations simultaneously in a single clock cycle. They use wide SIMD
registers that can load and store multiple data elements to memory in a single transaction.
For example, for a scalar ALU to add the first 4 elements of two arrays a and b and
store the result in array c, machine instructions similar to the following would need to be
executed in sequence:
1: r1 ← load a[0] 5: r1 ← load a[1] 9: r1 ← load a[2] 13: r1 ← load a[3]
2: r2 ← load b[0] 6: r2 ← load b[1] 10: r2 ← load b[2] 14: r2 ← load b[3]
3: r2 ← add r1, r2 7: r2 ← add r1, r2 11: r2 ← add r1, r2 15: r2 ← add r1, r2
4: c[0] ← store r2 8: c[1] ← store r2 12: c[2] ← store r2 16: c[3] ← store r2

A SIMD ALU of width equal to the size of four elements of the arrays could do the same
operations in just four instructions:
Parallel Machine and Execution Models  15

1: v1 ← vload a
2: v2 ← vload b
3: v2 ← vadd v1, v2
4: c ← vstore v2
The vload and vstore instructions load and store four elements of an array from memory
into a vector register in a single transaction. The vadd instruction simultaneously adds the
four values stored in each vector register.
SIMD execution is also found in manycore co-processors, such as the Intel Xeon Phi.
Nvidia Graphics Processing Units (GPUs) use SIMD in a different way, by scheduling
threads in groups (called warps), where the threads in each group perform the same in-
struction simultaneously on their own data. The difference with conventional SIMD led
the company to coin a new term, Single Instruction Multiple Threads (SIMT). It works
something like this:
t0: r00 ← load a[0] r10 ← load a[1] r20 ← load a[2] r30 ← load a[3]
t1: r01 ← load b[0] r11 ← load b[1] r21 ← load b[2] r31 ← load b[3]
t2: r01 ← add r00, r01 r11 ← add r10, r11 r21 ← add r20, r21 r31 ← add r30, r31
t3: c[0] ← store r01 c[1] ← store r11 c[2] ← store r21 c[3] ← store r31

Thread i loads a[i], and all loads are coalesced into a single transaction at time t0. The same
thing occurs at time t1 for loading the elements of b. All threads simultaneously perform an
addition at time t2. Finally all stores to c are coalesced at time t3. The difference between
SIMD and SIMT for the programmer is in the programming model: SIMD uses data parallel
loops or array operations whereas in the SIMT model the programmer specifies work for
each thread, and the threads are executed in SIMD fashion.

Connection Machine: Art and Science


Computers are becoming invisible as they become ubiquitous, either disappearing into
devices or off in the cloud somewhere. Supercomputers tend to be physically impressive
because of the space they occupy and the visible cables, cooling, and power infrastructure.
But to the uninitiated they are not very impressive because their design doesn’t express
how they function. The Connection Machine was an exception. It looked like a work of
art, because in a way it was. Produced by Thinking Machines between 1983 and 1994, it
was the realization of Danny Hillis’s PhD research at MIT. The Connection Machine was a
massively parallel SIMD computer. The CM-1 had 4096 chips with 16 processors each, for
a total of 65,536 processors. The chips were connected in a 12-dimensional hypercube. The
task of designing the enclosure for the CM-1 was assigned to the artist Tamiko Thiel. With
the help of Nobel winning physicist Richard Feynman, she designed the striking cubic case
with transparent sides that revealed red LED lights indicating the activity of the processors.
These blinking lights not only helped programmers verify that as many processors were
occupied as possible, but they made the computer look like a living, thinking machine.
The Connection Machine was originally designed to solve problems in artificial intelligence,
and was programmed using a version of Lisp. During his brief stay at Thinking Machines,
Feynman also demonstrated that the Connection Machine was good for scientific problems
by writing a program to numerically solve problems in Quantum Chromodynamics (QCD).
Feynman’s QCD solver outperformed a computer that Caltech, his home institution, was
building to solve QCD problems.

The popularity of the SIMD architecture comes from the common programming pattern
16  Elements of Parallel Computing

C C C C

M M

Figure 2.2: Basic multicore processor model, consisting of multiple cores (C) sharing memory
(M) units organized hierarchically.

where the same operation is executed independently on the elements of an array. SIMD
units can be programmed explicitly using a parallel programming language that supports
data parallelism. They can also be programmed directly in assembly language using vec-
tor instructions, as in the above example. More commonly the parallelism is not specified
explicitly by the programmer, but is exploited by the compiler or by the runtime system.
The SIMD model is not hard to understand from the programmer’s perspective, but it
can be challenging to work within the restrictions imposed by the model. Common execution
patterns often deviate from the model, such as conditional execution and non-contiguous
data, as we’ll see in Chapter 4.

2.1.2 Shared Memory and Distributed Memory Computers


Multicore Processor
Most parallel computers fall into the Multiple Instruction Multiple Data (MIMD) category,
as each execution unit (process or thread) can perform different instructions on different
data. These computers differ in how the processing units are connected to each other and
to memory modules. A notable feature that unites them is that the memory modules are
organized hierarchically.
Figure 2.2 shows a model of a hypothetical multicore processor. The memory node at
the root represents the main off-chip memory associated with the processor. When a core
loads or stores a data element, a block of data containing that element is copied into its
local cache, represented in Figure 2.2 as a child of the main memory node. Access to off-
chip memory is typically several orders of magnitude slower than access to on-chip cache
memory. While all cores have access to the entire memory address space, those sharing a
cache are closer together in the sense that they can share data more efficiently among each
other than with cores in the other group. The memory hierarchy is normally invisible to
programmers, which means they do not have to be concerned with moving data between
the memory units, although some processors (e.g. Nvidia GPUs) have memory local to a
group of cores that is managed explicitly by the programmer.

Multiprocessors and Multicomputers


More processing power can be obtained by connecting several multicore processors together,
as in Figure 2.3. This machine can be realized in two ways. First, it can provide a single
memory address space to all processing units. This can be realized in a single multiproces-
sor computer, where an on-board network connects the memory modules of each processor.
Parallel Machine and Execution Models  17

C C C C

M M

C C C C

M M

Figure 2.3: Basic parallel computer with two processors. Can be implemented as shared
memory (multiprocessor) or distributed memory (multicomputer).

It can also be realized in a distributed shared memory computer as a network of multi-


processors, where a single address space is enabled either in hardware or software. These
multiprocessors implement what is called a Non Uniform Memory Architecture (NUMA),
due to the fact that memory access times are not uniform across the address space. A core
in one processor will obtain data more quickly from its processor’s memory than from an-
other processor. This difference in access times can be as much as a factor of two or more. In
contrast, the architecture in Figure 2.2 can be called UMA, since access times to an element
of data will be the same for all cores, assuming it is not already in any of the caches.
Second, each processor may have a separate address space and therefore messages need
to be sent between processors to obtain data not stored locally. This is called a distributed
multicomputer. In either case it is clear that increasing the ratio of local to nonlocal memory
accesses can provide a performance gain.
The multiprocessor in Figure 2.3 can be expanded by adding more processors at the same
level, or combining several multiprocessors to form a deeper hierarchy. Knowledge of the
hierarchical structure of the machine can guide program design, for instance by encouraging
a hierarchical decomposition of tasks. The actual network topology is not relevant to the
parallel programmer, however. It is possible to design the communication pattern of a
parallel program tailored to a given network topology. This usually isn’t practical as it
limits the type of machine the program will perform well on. However, the performance
characteristics of the network are relevant when implementing an algorithm. For instance,
if point-to-point latency is high but there is plenty of bandwidth then it may be better to
avoid sending many small messages and to send them in larger batches instead. We’ll look
at performance analysis of distributed memory programs in Chapter 5.
Figure 2.4 shows a machine model for a hybrid multicore/manycore processor. The
manycore co-processor follows the shared memory model, as all cores have access to the
same memory address space. It also incorporates SIMD execution, either through SIMD
18  Elements of Parallel Computing

32 32 32 32 32 32 32 32 32

M M M

C C C C

M M

Figure 2.4: Manycore processor with 32-way SIMD units connected to a multicore processor

functional units, as in the Intel Xeon Phi, or SIMT execution of threads in a warp on an
Nvidia GPU. The multicore processor and manycore co-processor can share memory, or they
can have distinct memory spaces. In either case there are likely to be nonuniform access
times to memory, as in the multiprocessor of Figure 2.3. Multiple hybrid multicore/manycore
computers can be combined on a network to produce a distributed multicomputer with very
high performance.

2.1.3 Distributed Memory Execution


Parallel execution in the distributed memory model involves execution of multiple processes,
with each process normally assigned to its own processor core. Each process has its own
private memory, and can only access data in another process if that process sends the data
in a message. Distributed memory execution is suited to multicomputers, but it can also be
applied to shared memory multiprocessors.
To continue with our simple array example, consider adding the four elements with two
processes, this time with pseudocode representing a higher level language than assembler
to keep the exposition brief:

// process 0 // process 1
1: send a[2 . . . 3] to process 1 1: receive a[0 . . . 1] from process 0
2: send b[2 . . . 3] to process 1 2: receive b[0 . . . 1] from process 0
3: c[0] ← a[0] + b[0] 3: c[0] ← a[0] + b[0]
4: c[1] ← a[1] + b[1] 4: c[1] ← a[1] + b[1]
5: receive c[2 . . . 3] from process 1 5: send c[0 . . . 1] to process 0

Arrays a and b are initially in the memory of process 0, and the result of a + b is to be stored
there as well. The communication is illustrated in Figure 2.5. Process 0 sends the second
half of arrays a and b to process 1. Notice that each process uses local indexing for the
Parallel Machine and Execution Models  19

process 0 process 1

a a

b b

c c

Figure 2.5: Communication involved in adding two arrays with two processes.

arrays, and hence both use identical statements to add their arrays. Execution of the sums
can then take place independently on each process until process 1 has finished its sums and
sends its array c to process 0. Lines 3 and 4 won’t necessarily execute exactly simultaneously.
Process 0 might start earlier since it could proceed to line 3 while both messages are in flight
and before process 1 has received them. The execution times of these operations could be
different for each process, due to different memory access times or competition for CPU
resources from operating system processes.
The message passing programming model gives a lot of control to the programmer.
Since communication is explicit it’s possible to trace through execution and reason about
correctness and performance. In practice this type of programming can be quite complex.
We’ll look at some common patterns in message passing programming in Chapter 4.

2.1.4 Shared Memory Execution


Threads are the execution units for shared memory programs. Unlike processes they share
memory with other threads, so they do not need to use explicit communication to share
data. They can follow their own execution path with private data stored on the runtime
stack. Execution of our array sum by four threads would look like this:
// thread 0 // thread 1 // thread 2 // thread 3
r00 ← load a[0] r10 ← load a[1] r20 ← load a[2] r30 ← load a[3]
r01 ← load b[0] r11 ← load b[1] r21 ← load b[2] r31 ← load b[3]
r01 ← add r00, r01 r11 ← add r10, r11 r21 ← add r20, r21 r31 ← add r30, r31
c[0] ← store r01 c[1] ← store r11 c[2] ← store r21 c[3] ← store r31

Here the threads don’t execute their instructions in lockstep, unlike the SIMT example
above. Their execution will overlap in time, but the exact timing is unpredictable. Observe
that each thread loads and stores to different memory locations, so the relative timing of
threads has no impact on the result.

Data Races
Let’s say after adding a and b we want to use array c, and we add two more instructions
for thread 0:
r00 ← load c[3]
x ← store r00
20  Elements of Parallel Computing

load sum add 1 store sum


thread 1:

load sum add 1 store sum


thread 2:

sum: 0 1

Figure 2.6: Example of a data race.

This means thread 0 is loading from a memory location that thread 3 stores to. The value of
x is indeterminate, since it depends on the relative timing of threads 0 and 3. The problem
is that thread 0 could load c[3] before thread 3 stored its value to c[3]. This is an example
of a data race, which occurs when one thread stores a value to a memory location while
other threads load or store to the same location. Data races are a serious threat to program
correctness.
Data races can be hidden from view in higher level programming languages. Consider
two threads incrementing the same variable:
// thread 0 // thread 1
sum ← sum + 1 sum ← sum + 1
If sum is initially 0, then one might think that the result of these two operations would be
sum = 2, whatever the temporal order of the two threads. However, another possible result
is sum = 1, because the statement sum ← sum + 1 is not executed atomically (from the
Greek atomos meaning indivisible) at the machine level. Let’s look at machine instructions
for this operation:
// thread 0 // thread 1
r00 ← load sum r10 ← load sum
r01 ← add r00,1 r11 ← add r10,1
sum ← store r01 sum ← store r11
Now the data race is more easily seen, as both threads could load sum before either of them
stored their value, as illustrated in Figure 2.6.
Returning to our previous example, we could try to fix this data race by synchronizing
threads so that thread 0 doesn’t read from c[3] until thread 3 has written its value there.
Thread 3 sets a flag once its value is written, and thread 0 makes sure that flag is set before
loading c[3] (flag is initialized to 0):
// thread 0 // thread 3
L: r00 ← load f lag c[3] ← store r31
If f lag 6 = 1 goto L f lag ← store 1
r00 ← load c[3]
x ← store r00

We might expect that, even if thread 3 were delayed, thread 0 would keep looping until
thread 3 had written to c[3] and set the flag. This expectation comes from our practice of
reasoning about sequential programs, where we know that instructions may be reordered,
but the result is the same as if they executed in program order. This intuition is no longer
valid for multithreaded execution. The ordering of instructions executed by multiple threads
is governed by the memory model of the system.
Parallel Machine and Execution Models  21

Memory Model
A useful way to look at multithreaded execution is to imagine they are all running on a
single core, so they have to be executed in sequence in some order. An example of such an
interleaved order for the sum of arrays a and b could be this:
1: r00 ← load a[0] 5: r01 ← load b[0] 9: r21 ← add r20, r21 13: c[0] ← store r01
2: r10 ← load a[1] 6: r11 ← load b[1] 10: r01 ← add r00, r01 14: c[1] ← store r11
3: r20 ← load a[2] 7: r21 ← load b[2] 11: r31 ← add r30, r31 15: c[2] ← store r21
4: r30 ← load a[3] 8: r31 ← load b[3] 12: r11 ← add r10, r11 16: c[3] ← store r31

Of course, we want to run each thread on its own core to speed up execution, but logically
the result (array c) after multicore execution is the same as if they did execute in interleaved
fashion on one core.
We can observe that each thread executes its instructions in program order in this
example. The order of the interleaving is arbitrary, although here the loads and stores are
in thread order and the adds are in a permuted order. This type of multithreaded execution
order, which preserves the program order of each thread, is called sequential consistency.
It’s the most natural way to think about multithreaded execution, but it’s not the way most
processors work. The problem comes as a result of instruction reordering by the processor
at runtime. How this is done is a complex topic which we won’t get into [68]. It’s only
important to know that it’s done to improve performance and that it can also be a threat
to the correctness of parallel program execution. Reordering is limited by the dependencies
between the instructions of each thread. In this example the loads of each thread can
obviously be done in either order, but they must complete before the add, after which the
store can take place.
Returning to our attempt to synchronize threads 0 and 3, we can see that the instructions
of thread 3 could be reordered, since they each access a different memory location. This
could lead to the following order:
thread 3: f lag ← store 1
thread 0: L: r00 ← load f lag
thread 0: If f lag 6 = 1 goto L
thread 0: r00 ← load c[3]
thread 0: x ← store r00
thread 3: c[3] ← store r31

What we have tried to do here is to intentionally use a data race with f lag to synchronize
two threads, which is called a synchronization race. However our solution has been undone
by the reordering of instructions. Note that this ordering violates sequential consistency,
because the instructions of thread 3 don’t take effect in program order.
Sequential consistency is one possible memory model. Other models involve some form
of relaxed consistency where the requirement for instructions to execute in program order
is relaxed, in order to obtain increased performance.
It’s possible to reason about program execution for relaxed consistency models, but it’s
much easier to use sequential consistency. Fortunately, we can keep reasoning with the se-
quential consistency model, even if the underlying memory model uses relaxed consistency,
as long as we eliminate data races. If we need to synchronize the execution of threads we
can only use specialized language constructs, such as locks or barriers, to perform synchro-
nization, rather than rolling our own as in this example. This sequential consistency for
data race free execution applies to higher level languages such as C++ and Java as well [68].
22  Elements of Parallel Computing

C0 C1

x 4 x 5

x 4

Figure 2.7: Cache coherence problem. Core C0 loads 4 from memory location x and C1
writes 5 to x.

Cache Coherence
The discussion so far has ignored the effect of cache memory. On-chip caches offer perfor-
mance benefits by enabling much faster access to a relatively small amount of memory. The
operation of most caches is invisible to programmers. A block of data is fetched from main
memory into cache memory when a load or store occurs to a data element not already in
the cache. Programmers concerned about performance know to maximize the number of
memory accesses to cache blocks, for example by favoring traversal of consecutive elements
of arrays. Otherwise, cache memory can safely be ignored for sequential programs.
In a single core processor the fact that there can be two copies of the data corresponding
to the same memory address is not a problem, as the program can never access an out of
date value in main memory. The situation is different in a multicore processor. As Figure 2.7
illustrates, one core has loaded the value corresponding to variable x in its own cache and
the other core has updated the value of x in its cache. The first core could then read an
outdated value of x. In most processors this problem is solved by using a hardware protocol
so that the caches are coherent. A good definition of cache coherence states that at any given
time either any number of cores can read or only one core can write to a cache block, and in
addition that once a write occurs it is immediately visible to a subsequent read [68]. In other
words cache coherence makes the cache almost disappear from the multicore programmer’s
view, since the behavior described in the definition is what is expected of concurrent accesses
to the same location in main memory. Since loads and stores are atomic it is not possible for
multiple writes (or reads and writes) to occur simultaneously to the same memory location.
Why does the cache only almost disappear? This is because the above definition of
cache coherence refers to a cache block, which contains multiple data elements. If cores are
accessing distinct data that are in the same cache block in their local cache they cannot
do concurrent writes (or writes and reads). The programmer thinks the threads are mak-
ing concurrent accesses to these data elements, but in fact they cannot if the block is in
coherent caches. This is called false sharing, which we’ll explore when we discuss barriers
to performance in Chapter 5.

2.1.5 Summary
We’ve seen the three machine models that are relevant to parallel programmers. Each one
can incorporate the previous one, as SIMD execution is enabled in most processors, and
distributed multicomputers consist of a number of shared memory multiprocessors. We’ll
see in Chapter 4 that there are different implementation strategies for each model.
Parallelism is expressed differently in each model. SIMD execution is usually specified
Parallel Machine and Execution Models  23

implicitly by the lack of dependencies in a loop. Distributed memory programming makes


use of message passing, where communication is explicit and synchronization is implicitly
specified by the messages. In shared memory programming, threads synchronize explicitly
but communication is implicitly done by the hardware.
Program correctness is a serious issue when memory is shared. Understanding how
to eliminate data races is key, since it enables us to use the intuitive sequential consis-
tency model to reason about program execution. We’ll explore synchronization techniques
in Chapter 4.
Data movement is a key concern for all models. Successful implementations will minimize
communication. A good way to do this is to favor access to local data over remote data.

2.2 PARALLEL EXECUTION MODEL


All programming languages have an execution model that specifies the building blocks
of a program and their order of execution. Sequential execution models impose an order
of execution. Compilers can reorder machine instructions only if the result is unchanged.
Parallel execution models specify tasks and, directly or indirectly, the dependencies between
them. The order of execution of tasks is only constrained by the dependencies. Programmers
use a language’s execution model as a mental model when designing, writing, and reading
programs.
Execution models are also necessary for theoretical analysis of algorithms. A number
of models have been proposed, which represent both a machine and an execution model.
The random access machine (RAM) is a sequential model with a single processing unit and
memory. Execution is measured in the number of operations and memory required. The
parallel RAM (PRAM) model has multiple processing units working synchronously and
sharing memory. Subcategories of the PRAM model specify how concurrent accesses to the
same memory location are arbitrated.
Other theoretical models take into account the computation and communication costs
of execution on parallel computers. While they can more accurately assess algorithm perfor-
mance than the simple PRAM model, they are not necessarily the best suited as execution
models for parallel algorithm design. It’s becoming increasingly difficult to model complex
parallel computers at a level relevant to programmers. A suitable model should capture the
essential features of parallel algorithms at a level of abstraction high enough to be applicable
for execution on any computer.

2.2.1 Task Graph Model


The task graph model is our candidate for an execution model. A parallel program designer
is concerned with specifying the units of work and the dependencies between them. These
are captured in the vertices and edges of the task graph. This model is independent of the
target computer, which is appropriate for the designer since algorithm exploration should
start with the properties of the problem at hand and not the computer. This model allows
theoretical analysis, as we will see in Chapter 5. Of course, task graphs don’t operate on
their own, they just specify the work. They need to be mapped to a particular computer.
The task graph model is well suited to analyzing this process as it is the preferred model
for analysis of task scheduling on parallel computers.
24  Elements of Parallel Computing

1 2 3

Figure 2.8: Dependence graph for Example 2.1.

Real Data Dependencies


Before we define the task graph model we need to be clear about what we mean by depen-
dencies between tasks. Consider the following four statements:
Example 2.1
1: a ← 5
2: b ← 3 + a
3: c ← 2 ∗ b − 1
4: d ← b + 1

Normally a programmer thinks of these statements executing as they appear in a pro-


gram, that is, one after another. Let’s look more closely. Statement 1 assigns a value to a,
which is then needed by statement 2. Statement 2 assigns a value to b, which is needed by
both statements 3 and 4. These dependencies can be represented by the graph in Figure 2.8.
Now we can see that there are several possible ways these statements could be executed.
Since there is no dependence between statements 3 and 4, they could be executed in ei-
ther order or in parallel. The first two statements must be executed sequentially in order,
however, because of the dependence between them.
The edges of the graph in Figure 2.8 also represent the flow or communication of data
between statements. The value of a is communicated between statements 1 and 2, and state-
ment 2 communicates the value of b to statements 3 and 4. In practice the communication
might occur simply by writing and reading to a register, but for parallel execution it could
involve sending data over a communication link.
The dependencies in Example 2.1 are called real dependencies. In contrast, consider the
next group of statements:
Example 2.2
1: a ← 8
2: b ← 7 ∗ a
3: c ← 3 ∗ b + 1
4: b ← 2 ∗ a
5: c ← (a + b)/2

Statements 2, 4 and 5 require the value of a to be communicated from statement 1, and


statement 3 needs the value of b from statement 2. There is also a dependence between
statements 3 and 4, through b, which requires them to be executed one after the other. This
dependence is in the reverse direction of the other ones, since it involves a read followed
by a write rather than a write followed by a read. This type of dependence is called an
anti-dependence. It’s also known as a false dependence, since it can easily be eliminated by
using another variable in statement 4, d ← 2 ∗ a, and changing statement 5 to c ← (a + d)/2.
There is another type of false dependence, illustrated in the following example:
Parallel Machine and Execution Models  25

Example 2.3
1: a ← 3
2: b ← 5 + a
3: a ← 42
4: c ← 2 ∗ a + 1

Statements 1 and 2 should be able to be executed independently from statements 3 and


4, but they can’t because the final value of a (42) requires statement 1 to be executed before
statement 3. This is called an output dependence, since two statements are writing to the
same variable. It is a false dependence because it can easily be eliminated by replacing a
with another variable in statements 3 and 4.
The examples so far have shown data dependencies. Control dependencies are also fa-
miliar, as in:
Example 2.4
1: if a > b then
2: c ← a/2
3: else
4: c ← b/2
5: end

Only one of statements 2 and 4 is executed. The task graph model does not model control
dependencies and is only concerned with real data dependencies.
Definition 2.1 (Task Graph Model). A parallel algorithm is represented by a directed
acyclic graph, where each vertex represents execution of a task, and an edge between two
vertices represents a real data dependence between the corresponding tasks. A directed edge
u → v indicates that task u must execute before task v, and implies communication between
the tasks. Tasks may be fine-grained, such as a single statement, or may be coarse-grained,
containing multiple statements and control structures. Any control dependencies are only
encapsulated in the tasks. The vertices and edges may be labeled with nonnegative weights,
indicating relative computation and communication costs, respectively.
We’ll see in the following chapters that the granularity of tasks is an important consider-
ation in parallel algorithm design. The rest of this chapter explores some simple examples of
task graphs generated from sequences of statements, to get a feeling for how this execution
model represents dependencies between tasks.

2.2.2 Examples
Work-intensive applications usually spend most of their time in loops. Consider the loop
for addition of two arrays in Example 2.5.
Example 2.5
for i ← 0 to n − 1 do
c[i] ← a[i] + b[i]
end

The statement in each iteration can be represented by a vertex in the task graph, but no
edges are required since the statements are independent (Figure 2.9a). We are not usually
this lucky, as often there are dependencies between iterations, as in Example 2.6.
26  Elements of Parallel Computing

i=0 i=1 i=2 i=3

(a)

i=1 i=2 i=3 i=4

(b)

2,1 2,2 2,3 2,4

1,1 1,2 1,3 1,4

(c)

Figure 2.9: Task graphs for four iterations of Examples (a) 2.5, (b) 2.6 and (c) 2.7.

Example 2.6
for i ← 1 to n − 1 do
a[i] ← a[i − 1] + x ∗ i
end

Each iteration depends on the value computed in the previous iteration, as shown in Fig-
ure 2.9b for the case with four iterations.
Task graphs for iterative programs can get very large. The number of iterations may not
even be known in advance. Consider Example 2.7:
Example 2.7
for i ← 1 to n − 1 do
1: a[i] ← b[i] + x ∗ i
2: c[i] ← a[i − 1] ∗ b[i]
end

Figure 2.9c shows the task graph for the first four iterations of Example 2.7. Each task has
an extra label indicating to which iteration it belongs. The dependence between tasks 1 and
2 is from each iteration to the next.
There are two alternative ways to model iterative computation. One is to only show
dependencies within each iteration, which would produce two unconnected tasks for Exam-
ple 2.7. Another alternative is to use a compact representation called a flow graph, where
tasks can be executed multiple times, which we won’t discuss here.

Reduction
One particular case is interesting, as it represents a common pattern, called a reduction.
We’ll bring back the word count example from Chapter 1, but to simplify we are only
interested in occurrences of a given word:
Parallel Machine and Execution Models  27

0 1 2 3 4

(a)

1 3 5

0 2 4

(b)

Figure 2.10: Task graph for (a) sequential and (b) parallel reduction.

count ← 0
foreach document in collection do
count ← count + countOccurrences(word, document)
end

Let’s see if we can build a task graph with independent tasks, which will allow parallel
execution. To keep things simple, we’ll take the case where there are four documents, and
unroll the loop:

0: count ←0
1: count ← count + countOccurrences(word, document1)
2: count ← count + countOccurrences(word, document2)
3: count ← count + countOccurrences(word, document3)
4: count ← count + countOccurrences(word, document4)

These statements aren’t independent, since they are all updating the count variable, as
shown by the task graph in Figure 2.10a. This contradicts our intuition, which tells us that
we can compute partial sums independently followed by a combination of the result, as in:

0: count1 ← 0
1: count2 ← 0
2: count1 ← count1 + countOccurrences(word, document1)
3: count2 ← count2 + countOccurrences(word, document2)
4: count1 ← count1 + countOccurrences(word, document3)
5: count2 ← count2 + countOccurrences(word, document4)
6: count ← count1 + count2

As the task graph in Figure 2.10b illustrates, statements 0, 2 and 4 can be computed at the
same time as statements 1, 3, and 5. The final statement needs to wait until tasks 4 and 5
are complete. We’ll look at reduction in more detail in Chapter 3.

Transitive Dependencies
We’ve left out some dependencies between the tasks for sequential reduction in Figure 2.10a.
Each task writes to count and all following tasks read from count. In Figure 2.11 we show
28  Elements of Parallel Computing

0 1 2 3 4

Figure 2.11: Reduction task graph showing transitive dependencies.

all the real dependencies. This graph illustrates transitive dependence. For example, task 1
depends on task 0, task 2 depends on task 1, and the transitive dependence where task 2
depends on task 0 is also included. The process of removing the transitive dependencies to
produce Figure 2.10a is called transitive reduction.
We don’t always want to remove transitive dependencies, as in the following example:
Example 2.8

1: y ← foo()
2: ymin ← min(y)
3: ymax ← max(y)
4: for i ← 0 to n − 1 do
ynorm[i] ← (y[i] − ymin)/(ymax − ymin)
end

Statement 1 generates an array y, statements 2 and 3 find the minimum and maximum of
y and the loop uses the results of the three previous statements to produce array ynorm
with values normalized between 0 and 1. The tasks graph is shown in Figure 2.12, with
the edge labels indicating the relative size of the data being communicated between tasks.
Here the transitive dependence between tasks 1 and 4 is useful as it not only indicates
the dependence of task 4 on task 1 (through y) but also that this dependence involves a
communication volume that is n times larger than its other dependencies.

Mapping Task Graphs


Task graphs are a very useful tool for parallel algorithm design. An important design goal
is to have many tasks with as few dependencies as possible. We’ll see in Chapter 5 that a
task graph can be analyzed to assess properties such as the available parallelism.
The task graph then has to be mapped to a given machine, whether this is done by the
programmer or by the runtime system. Edges between tasks on different processors indicate

2
n 1
n
1 4
n 1
3

Figure 2.12: Task graph for Example 2.8.


Another Random Scribd Document
with Unrelated Content
NOTES
~~~~~
[1] Jouaust.—La Mort d’Agrippine, 1875.
[2] Ce sonnet, se trouve dans les Œuvres poétiques du
sieur de P. (Prade), publiées en 1650 (Paris, Nicolas et Jean
de la Coste, in-4). Il prouve que le Voyage dans la Lune
était composé longtemps avant la mort de Cyrano, auquel il
causa de graves ennuis, comme lui-même nous l’apprend
dans l’Histoire des Etats et Empires du Soleil.
[3] Horace, Art Poétique.
[4] Var: plusieurs grands hommes. (Edition Le Bret.)
[5] Var: le maréchal de l’Hôpital. (Edition Le Bret.)
[6] Le Canada ou Nouvelle-France.
[7] Je réponds que je dispute plus; car, si vous voulez
m’obliger à vous rendre raison de ce que me fournit mon
imagination, c’est m’ôter la parole, et m’obliger de vous
confesser que mon raisonnement le cédera toujours en ces
sortes de choses à la Foi.
Il me dit qu’à la vérité sa demande était blâmable, mais
que je reprisse mon idée. (Edition Le Bret.)
[8] Variante: Dont je ne suis que la créature. (Edition Le
Bret.)
[9] Il y a dans l’édition Le Bret: son imagination.
[10] Après avoir longtemps rêvé. (Edition Le Bret.)
[11] Que je croyais moi-même être tout en feu. (Edition
Le Bret.)
[12] Et qu’à un vieux radoteux, parce que le soleil a
quatre-vingt dix fois expié sa moisson, vous lui deviez de
l’encens (Variante).
[13] Edition Le Bret: Les Etats et Empires de la Lune
avec une addition à l’histoire de l’étincelle. Je ne vois pas
comment il peut donner les Etats de la Lune, puisqu’il est
précisément... dans la Lune, la phrase du manuscrit est plus
logique.
[14] Ce paragraphe entier n’existe que dans l’édition de
1663; il a disparu de toutes les éditions postérieures.
Au lecteur
~~~~~
Cette version électronique reproduit
dans son intégralité la version originale.
La ponctuation n’a pas été modifiée
hormis quelques corrections mineures.
L’orthographe a été conservée. Seuls
quelques mots ont été modifiés. Ils sont
soulignés par des tirets. Passer la souris
sur le mot pour voir le texte original.
*** END OF THE PROJECT GUTENBERG EBOOK L'AUTRE MONDE;
OU, HISTOIRE COMIQUE DES ETATS ET EMPIRES DE LA LUNE ***

Updated editions will replace the previous one—the old editions


will be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States
copyright in these works, so the Foundation (and you!) can copy
and distribute it in the United States without permission and
without paying copyright royalties. Special rules, set forth in the
General Terms of Use part of this license, apply to copying and
distributing Project Gutenberg™ electronic works to protect the
PROJECT GUTENBERG™ concept and trademark. Project
Gutenberg is a registered trademark, and may not be used if
you charge for an eBook, except by following the terms of the
trademark license, including paying royalties for use of the
Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such
as creation of derivative works, reports, performances and
research. Project Gutenberg eBooks may be modified and
printed and given away—you may do practically ANYTHING in
the United States with eBooks not protected by U.S. copyright
law. Redistribution is subject to the trademark license, especially
commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the


free distribution of electronic works, by using or distributing this
work (or any other work associated in any way with the phrase
“Project Gutenberg”), you agree to comply with all the terms of
the Full Project Gutenberg™ License available with this file or
online at www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand,
agree to and accept all the terms of this license and intellectual
property (trademark/copyright) agreement. If you do not agree
to abide by all the terms of this agreement, you must cease
using and return or destroy all copies of Project Gutenberg™
electronic works in your possession. If you paid a fee for
obtaining a copy of or access to a Project Gutenberg™
electronic work and you do not agree to be bound by the terms
of this agreement, you may obtain a refund from the person or
entity to whom you paid the fee as set forth in paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only


be used on or associated in any way with an electronic work by
people who agree to be bound by the terms of this agreement.
There are a few things that you can do with most Project
Gutenberg™ electronic works even without complying with the
full terms of this agreement. See paragraph 1.C below. There
are a lot of things you can do with Project Gutenberg™
electronic works if you follow the terms of this agreement and
help preserve free future access to Project Gutenberg™
electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright
law in the United States and you are located in the United
States, we do not claim a right to prevent you from copying,
distributing, performing, displaying or creating derivative works
based on the work as long as all references to Project
Gutenberg are removed. Of course, we hope that you will
support the Project Gutenberg™ mission of promoting free
access to electronic works by freely sharing Project Gutenberg™
works in compliance with the terms of this agreement for
keeping the Project Gutenberg™ name associated with the
work. You can easily comply with the terms of this agreement
by keeping this work in the same format with its attached full
Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.

1.E. Unless you have removed all references to Project


Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project
Gutenberg™ work (any work on which the phrase “Project
Gutenberg” appears, or with which the phrase “Project
Gutenberg” is associated) is accessed, displayed, performed,
viewed, copied or distributed:

This eBook is for the use of anyone anywhere in the United


States and most other parts of the world at no cost and
with almost no restrictions whatsoever. You may copy it,
give it away or re-use it under the terms of the Project
Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United
States, you will have to check the laws of the country
where you are located before using this eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is


derived from texts not protected by U.S. copyright law (does not
contain a notice indicating that it is posted with permission of
the copyright holder), the work can be copied and distributed to
anyone in the United States without paying any fees or charges.
If you are redistributing or providing access to a work with the
phrase “Project Gutenberg” associated with or appearing on the
work, you must comply either with the requirements of
paragraphs 1.E.1 through 1.E.7 or obtain permission for the use
of the work and the Project Gutenberg™ trademark as set forth
in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is


posted with the permission of the copyright holder, your use and
distribution must comply with both paragraphs 1.E.1 through
1.E.7 and any additional terms imposed by the copyright holder.
Additional terms will be linked to the Project Gutenberg™
License for all works posted with the permission of the copyright
holder found at the beginning of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files
containing a part of this work or any other work associated with
Project Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute


this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or


providing access to or distributing Project Gutenberg™
electronic works provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project


Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite these
efforts, Project Gutenberg™ electronic works, and the medium
on which they may be stored, may contain “Defects,” such as,
but not limited to, incomplete, inaccurate or corrupt data,
transcription errors, a copyright or other intellectual property
infringement, a defective or damaged disk or other medium, a
computer virus, or computer codes that damage or cannot be
read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except


for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU AGREE
THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT
LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT
EXCEPT THOSE PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE
THAT THE FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person
or entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.

1.F.4. Except for the limited right of replacement or refund set


forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the


Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you
do or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status
by the Internal Revenue Service. The Foundation’s EIN or
federal tax identification number is 64-6221541. Contributions
to the Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.

The Foundation’s business office is located at 809 North 1500


West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact
Section 4. Information about Donations to
the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws


regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or determine
the status of compliance for any particular state visit
www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states


where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot


make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.

Please check the Project Gutenberg web pages for current


donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several


printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
back
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookultra.com

You might also like