An Introduction to Parallel Programming. Second Edition Peter S. Pachecopdf download
An Introduction to Parallel Programming. Second Edition Peter S. Pachecopdf download
or textbooks at https://ebookmass.com
_____ Follow the link below to get your download now _____
https://ebookmass.com/product/an-introduction-to-parallel-
programming-second-edition-peter-s-pacheco/
https://ebookmass.com/product/an-introduction-to-parallel-
programming-2-edition-pacheco/
https://ebookmass.com/product/fishes-an-introduction-to-ichthyology-
peter-b-moyle/
https://ebookmass.com/product/parallel-programming-concepts-and-
practice-gonzalez-dominguez/
https://ebookmass.com/product/an-introduction-to-programming-through-
c-abhiram-g-ranade/
An Introduction to Policing 9th Edition John S. Dempsey
https://ebookmass.com/product/an-introduction-to-policing-9th-edition-
john-s-dempsey/
https://ebookmass.com/product/the-tangled-bank-an-introduction-to-
evolution-second-edition-ebook-pdf-version/
https://ebookmass.com/product/the-stacked-deck-an-introduction-to-
social-inequality-second-edition-jennifer-ball/
https://ebookmass.com/product/introduction-to-global-studies-second-
edition-john-mccormick/
https://ebookmass.com/product/an-introduction-to-redox-polymers-for-
energy-storage-applications-ulrich-s-schubert/
An Introduction to Parallel
Programming
SECOND EDITION
Peter S. Pacheco
University of San Francisco
Matthew Malensek
University of San Francisco
Table of Contents
Cover image
Title page
Copyright
Dedication
Preface
1.10. Summary
1.11. Exercises
Bibliography
2.6. Performance
2.9. Assumptions
2.10. Summary
2.11. Exercises
Bibliography
3.8. Summary
3.9. Exercises
Bibliography
4.5. Busy-waiting
4.6. Mutexes
4.11. Thread-safety
4.12. Summary
4.13. Exercises
Bibliography
5.10. Tasking
5.11. Thread-safety
5.12. Summary
5.13. Exercises
Bibliography
6.13. CUDA trapezoidal rule III: blocks with more than one warp
6.14. Bitonic sort
6.15. Summary
6.16. Exercises
Bibliography
7.5. Summary
7.6. Exercises
Bibliography
Bibliography
Bibliography
Bibliography
Index
Copyright
Morgan Kaufmann is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
Notices
Knowledge and best practice in this field are constantly changing.
As new research and experience broaden our understanding,
changes in research methods, professional practices, or medical
treatment may become necessary.
Practitioners and researchers must always rely on their own
experience and knowledge in evaluating and using any
information, methods, compounds, or experiments described
herein. In using such information or methods they should be
mindful of their own safety and the safety of others, including
parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the
authors, contributors, or editors, assume any liability for any injury
and/or damage to persons or property as a matter of products
liability, negligence or otherwise, or from any use or operation of
any methods, products, instructions, or ideas contained in the
material herein.
ISBN: 978-0-12-804605-0
Typeset by VTeX
Printed in United States of America
Now suppose we also have p cores and . Then each core can
form a partial sum of approximately values:
Here the prefix indicates that each core is using its own, private
variables, and each core can execute this block of code
independently of the other cores.
After each core completes execution of this code, its variable
will store the sum of the values computed by its calls to
. For example, if there are eight cores, , and the 24 calls to
return the values
1, 4, 3, 9, 2, 8, 5, 1, 1, 6, 2, 7, 2, 5, 0, 4, 1, 8, 6, 5, 1, 2, 3, 9,
In our example, if the master core is core 0, it would add the values
.
But you can probably see a better way to do this—especially if the
number of cores is large. Instead of making the master core do all the
work of computing the final sum, we can pair the cores so that while
core 0 adds in the result of core 1, core 2 can add in the result of core
3, core 4 can add in the result of core 5, and so on. Then we can
repeat the process with only the even-ranked cores: 0 adds in the
result of 2, 4 adds in the result of 6, and so on. Now cores divisible
by 4 repeat the process, and so on. See Fig. 1.1. The circles contain
the current value of each core's sum, and the lines with arrows
indicate that one core is sending its sum to another core. The plus
signs indicate that a core is receiving a sum from another core and
adding the received sum into its own sum.
For both “global” sums, the master core (core 0) does more work
than any other core, and the length of time it takes the program to
complete the final sum should be the length of time it takes for the
master to complete. However, with eight cores, the master will carry
out seven receives and adds using the first method, while with the
second method, it will only carry out three. So the second method
results in an improvement of more than a factor of two. The
difference becomes much more dramatic with large numbers of
cores. With 1000 cores, the first method will require 999 receives and
adds, while the second will only require 10—an improvement of
almost a factor of 100!
The first global sum is a fairly obvious generalization of the serial
global sum: divide the work of adding among the cores, and after
each core has computed its part of the sum, the master core simply
repeats the basic serial addition—if there are p cores, then it needs to
add p values. The second global sum, on the other hand, bears little
relation to the original serial addition.
The point here is that it's unlikely that a translation program
would “discover” the second global sum. Rather, there would more
likely be a predefined efficient global sum that the translation
program would have access to. It could “recognize” the original
serial loop and replace it with a precoded, efficient, parallel global
sum.
We might expect that software could be written so that a large
number of common serial constructs could be recognized and
efficiently parallelized, that is, modified so that they can use
multiple cores. However, as we apply this principle to ever more
complex serial programs, it becomes more and more difficult to
recognize the construct, and it becomes less and less likely that we'll
have a precoded, efficient parallelization.
Thus we cannot simply continue to write serial programs; we
must write parallel programs, programs that exploit the power of
multiple processors.
Now suppose we have n SIMD cores, and each core is assigned one
element from each of the three arrays: core i is assigned elements
, and . Then our program can simply tell each core to add its
x- and y-values to get the z value:
• Definitions are given in the body of the text, and the term
being defined is printed in boldface type: A parallel
program can make use of multiple cores.
• When we need to refer to the environment in which a
program is being developed, we'll assume that we're using a
UNIX shell, such as , and we'll use a to indicate the shell
prompt:
1.10 Summary
For many years we've reaped the benefits of having ever-faster
processors. However, because of physical limitations, the rate of
performance improvement in conventional processors has decreased
dramatically. To increase the power of processors, chipmakers have
turned to multicore integrated circuits, that is, integrated circuits
with multiple conventional processors on a single chip.
Ordinary serial programs, which are programs written for a
conventional single-core processor, usually cannot exploit the
presence of multiple cores, and it's unlikely that translation
programs will be able to shoulder all the work of converting serial
programs into parallel programs—programs that can make use of
multiple cores. As software developers, we need to learn to write
parallel programs.
When we write parallel programs, we usually need to coordinate
the work of the cores. This can involve communication among the
cores, load balancing, and synchronization of the cores.
In this book we'll be learning to program parallel systems, so that
we can maximize their performance. We'll be using the C language
with four different application program interfaces or APIs: MPI,
Pthreads, OpenMP, and CUDA. These APIs are used to program
parallel systems that are classified according to how the cores access
memory and whether the individual cores can operate
independently of each other.
In the first classification, we distinguish between shared-memory
and distributed-memory systems. In a shared-memory system, the
cores share access to one large pool of memory, and they can
coordinate their actions by accessing shared memory locations. In a
distributed-memory system, each core has its own private memory,
and the cores can coordinate their actions by sending messages
across a network.
In the second classification, we distinguish between systems with
cores that can operate independently of each other and systems in
which the cores all execute the same instruction. In both types of
system, the cores can operate on their own data stream. So the first
type of system is called a multiple-instruction multiple-data or
MIMD system, and the second type of system is called a single-
instruction multiple-data or SIMD system.
MPI is used for programming distributed-memory MIMD
systems. Pthreads is used for programming shared-memory MIMD
systems. OpenMP can be used to program both shared-memory
MIMD and shared-memory SIMD systems, although we'll be
looking at using it to program MIMD systems. CUDA is used for
programming Nvidia graphics processing units or GPUs. GPUs
have aspects of all four types of system, but we'll be mainly
interested in the shared-memory SIMD and shared-memory MIMD
aspects.
Concurrent programs can have multiple tasks in progress at any
instant. Parallel and distributed programs usually have tasks that
execute simultaneously. There isn't a hard and fast distinction
y
between parallel and distributed, although in parallel programs, the
tasks are usually more tightly coupled.
Parallel programs are usually very complex. So it's even more
important to use good program development techniques with
parallel programs.
1.11 Exercises
From the table, we see that during the first stage each core is
paired with the core whose rank differs in the rightmost or
first bit. During the second stage, cores that continue are
paired with the core whose rank differs in the second bit; and
during the third stage, cores are paired with the core whose
rank differs in the third bit. Thus if we have a binary value
that is 0012 for the first stage, 0102 for the second, and
1002 for the third, we can get the rank of the core we're
paired with by “inverting” the bit in our rank that is nonzero
in . This can be done using the bitwise exclusive or ∧
operator.
Implement this algorithm in pseudocode using the bitwise
exclusive or and the left-shift operator.
1.5 What happens if your pseudocode in Exercise 1.3 or Exercise
1.4 is run when the number of cores is not a power of two
(e.g., 3, 5, 6, 7)? Can you modify the pseudocode so that it
will work correctly regardless of the number of cores?
1.6 Derive formulas for the number of receives and additions
that core 0 carries out using
a. the original pseudocode for a global sum, and
b. the tree-structured global sum.
Make a table showing the numbers of receives and additions
carried out by core 0 when the two sums are used with
cores.
1.7 The first part of the global sum example—when each core
adds its assigned computed values—is usually considered to
be an example of data-parallelism, while the second part of
the first global sum—when the cores send their partial sums
to the master core, which adds them—could be considered to
be an example of task-parallelism. What about the second
part of the second global sum—when the cores use a tree
structure to add their partial sums? Is this an example of
data- or task-parallelism? Why?
1.8 Suppose the faculty members are throwing a party for the
students in the department.
a. Identify tasks that can be assigned to the faculty
members that will allow them to use task-parallelism
when they prepare for the party. Work out a schedule
that shows when the various tasks can be performed.
b. We might hope that one of the tasks in the preceding
part is cleaning the house where the party will be held.
How can we use data-parallelism to partition the work
of cleaning the house among the faculty?
c. Use a combination of task- and data-parallelism to
prepare for the party. (If there's too much work for the
faculty, you can use TAs to pick up the slack.)
1.9 Write an essay describing a research problem in your major
that would benefit from the use of parallel computing.
Provide a rough outline of how parallelism would be used.
Would you use task- or data-parallelism?
Bibliography
[5] Clay Breshears, The Art of Concurrency: A Thread Monkey's
Guide to Writing Parallel Applications. Sebastopol, CA:
O'Reilly; 2009.
[28] John Hennessy, David Patterson, Computer Architecture: A
Quantitative Approach. 6th ed. Burlington, MA: Morgan
Kaufmann; 2019.
[31] IBM, IBM InfoSphere Streams v1.2.0 supports highly
complex heterogeneous data analysis, IBM United States
Software Announcement 210-037, Feb. 23, 2010
http://www.ibm.com/common/ssi/rep_ca/7/897/ENUS210-
037/ENUS210-037.PDF.
[36] John Loeffler, No more transistors: the end of Moore's Law,
Interesting Engineering, Nov 29, 2018. See
https://interestingengineering.com/no-more-transistors-the-
end-of-moores-law.
Chapter 2: Parallel hardware and
parallel software
It's perfectly feasible for specialists in disciplines other than
computer science and computer engineering to write parallel
programs. However, to write efficient parallel programs, we often
need some knowledge of the underlying hardware and system
software. It's also very useful to have some knowledge of different
types of parallel software, so in this chapter we'll take a brief look at
a few topics in hardware and software. We'll also take a brief look at
evaluating program performance and a method for developing
parallel programs. We'll close with a discussion of what kind of
environment we might expect to be working in, and a few rules and
assumptions we'll make in the rest of the book.
This is a long, broad chapter, so it may be a good idea to skim
through some of the sections on a first reading so that you have a
good idea of what's in the chapter. Then, when a concept or term in a
later chapter isn't quite clear, it may be helpful to refer back to this
chapter. In particular, you may want to skim over most of the
material in “Modifications to the von Neumann Model,” except “The
Basics of Caching.” Also, in the “Parallel Hardware” section, you can
safely skim the material on “Interconnection Networks.” You can
also skim the material on “SIMD Systems” unless you're planning to
read the chapter on CUDA programming.
BEITH, JOHN HAY (IAN HAY, pseud.). All in it; “K (1)” carries
on. *$1.50 (2½c) Houghton 940.91 17-29361
This is the continuation of “The first hundred thousand,” promised
us by Captain Beith. “‘The first hundred thousand’ closed with the
battle of Loos. The present narrative follows certain friends of ours
from the scene of that costly but valuable experience, through a
winter campaign in the neighbourhood of Ypres and Ploegsteert, to
profitable participation in the battle of the Somme.” (Author’s note)
Captain (now major) Wagstaffe and Private (now corporal)
Mucklewame reappear in this volume.
“Told with the same humorous turns and descriptions that made
the first book so readable.”
+ A L A Bkl 14:88 D ‘17
“Bit by bit Major Beith pieces together the tale of the fighter in the
present war. He does not minimize its horrors, but he does not
over-emphasize them. Through his entire story runs an
undercurrent of optimism.” E. F. E.
+ Boston Transcript p8 N 7 ‘17 1500w
BELL, ARCHIE. Trip to Lotus land. il *$2.50 (4c) Lane 915.2 17-
30747
The author outlines a six-weeks’ itinerary for the tourist to Japan,
and states that his purpose is to convey to the reader something
of the joys that such a tour holds for a traveler. He says that the
book is not a guide book. “Mr Terry’s ‘Japanese empire’ and the
excellent publications of the Imperial Japanese government
railways” supply that need, and his pleasant narrative account of
his own travels will serve to supplement them. Yokohama,
Kamakura, Miyanoshita, Tokyo, Nagoya, Kyoto, Kobe, Nagasaki and
Nikko are among the points visited. There are over fifty
illustrations.
Reviewed by A. M. Chase
+ Bookm 46:335 N ‘17 40w
“Both instructive and entertaining.”
+ Lit D 56:40 Ja 12 ‘18 170w
+ N Y Times 22:579 D 30 ‘17 100w
“[Fulfills its purpose] admirably both in text and illustrations.”
+ Outlook 117:615 D 12 ‘17 60w
BELL, JOHN JOY. Till the clock stops. *$1.35 (2c) Duffield 17-5450
The clock, with its diamond-studded pendulum, stood in a
secluded house in Scotland. It was guaranteed to go for a year
and a day after the pendulum was set in motion—that being done
on the death of its owner Christopher Craig. It was in some way to
watch over the green box full of diamonds and the other fortune
reserved for Christopher’s nephew, Alan Craig, supposedly lost in
the Arctic. Its enemy was Bullard, London member of a South
African mining syndicate, who knew of the existence of the
diamonds and its guardians were a dense green liquid with which
the case was partly filled, placed over the ominous word
“Dangerous,” Caw, the faithful servant of the dead man, and
Marjorie Handyside, the daughter of a doctor and neighbor. How
these and others played their respective parts, and the surprise in
store for all when the clock stopped make a thrilling tale. The
writer is the author of “Wee MacGreegor.”
A L A Bkl 13:401 Je ‘17
“The story is well planned, and full of excitement and suspense up
to the last chapter.”
+ Ath p101 F ‘17 30w
ebookmasss.com