An Introduction to Parallel Programming 2. Edition Pachecodownload
An Introduction to Parallel Programming 2. Edition Pachecodownload
or textbooks at https://ebookmass.com
https://ebookmass.com/product/an-introduction-to-parallel-
programming-2-edition-pacheco/
https://ebookmass.com/product/an-introduction-to-parallel-programming-
second-edition-peter-s-pacheco/
https://ebookmass.com/product/parallel-programming-concepts-and-
practice-gonzalez-dominguez/
https://ebookmass.com/product/an-introduction-to-programming-through-
c-abhiram-g-ranade/
https://ebookmass.com/product/data-parallel-c-programming-accelerated-
systems-using-c-and-sycl-james-reinders/
Data Parallel C++: Programming Accelerated Systems Using
C++ and SYCL 2nd Edition James Reinders
https://ebookmass.com/product/data-parallel-c-programming-accelerated-
systems-using-c-and-sycl-2nd-edition-james-reinders/
https://ebookmass.com/product/introduction-to-java-programming-
comprehensive-version-y-daniel-liang/
https://ebookmass.com/product/introduction-to-computing-and-
programming-in-python-global-edition-mark-j-guzdial/
https://ebookmass.com/product/introduction-to-computation-and-
programming-using-python-third-edition-john-v-guttag/
https://ebookmass.com/product/java-programming-a-comprehensive-
introduction-first-edition/
An Introduction to Parallel
Programming
SECOND EDITION
Peter S. Pacheco
University of San Francisco
Matthew Malensek
University of San Francisco
Table of Contents
Cover image
Title page
Copyright
Dedication
Preface
1.10. Summary
1.11. Exercises
Bibliography
2.6. Performance
2.9. Assumptions
2.10. Summary
2.11. Exercises
Bibliography
3.8. Summary
3.9. Exercises
Bibliography
4.5. Busy-waiting
4.6. Mutexes
4.11. Thread-safety
4.12. Summary
4.13. Exercises
Bibliography
5.10. Tasking
5.11. Thread-safety
5.12. Summary
5.13. Exercises
Bibliography
6.13. CUDA trapezoidal rule III: blocks with more than one warp
6.15. Summary
6.16. Exercises
Bibliography
7.5. Summary
7.6. Exercises
Bibliography
Bibliography
Bibliography
Bibliography
Index
Copyright
Morgan Kaufmann is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139,
United States
Notices
Knowledge and best practice in this field are constantly
changing. As new research and experience broaden our
understanding, changes in research methods,
professional practices, or medical treatment may become
necessary.
Practitioners and researchers must always rely on their
own experience and knowledge in evaluating and using
any information, methods, compounds, or experiments
described herein. In using such information or methods
they should be mindful of their own safety and the safety
of others, including parties for whom they have a
professional responsibility.
ISBN: 978-0-12-804605-0
Typeset by VTeX
Printed in United States of America
Here the prefix indicates that each core is using its own,
private variables, and each core can execute this block of
code independently of the other cores.
After each core completes execution of this code, its
variable will store the sum of the values computed by
its calls to . For example, if there are eight
cores, , and the 24 calls to return the
values
1, 4, 3, 9, 2, 8, 5, 1, 1, 6, 2, 7, 2, 5, 0, 4, 1, 8, 6, 5,
1, 2, 3, 9,
then the values stored in might be
The idea here is that each core will wait in the function
until all the cores have entered the function—in
particular, until the master core has entered this function.
Currently, the most powerful parallel programs are
written using explicit parallel constructs, that is, they are
written using extensions to languages such as C, C++, and
Java. These programs include explicit instructions for
parallelism: core 0 executes task 0, core 1 executes task 1,
…, all cores synchronize, …, and so on, so such programs
are often extremely complex. Furthermore, the complexity
of modern cores often makes it necessary to use
considerable care in writing the code that will be executed
by a single core.
There are other options for writing parallel programs—
for example, higher level languages—but they tend to
sacrifice performance to make program development
somewhat easier.
1.10 Summary
For many years we've reaped the benefits of having ever-
faster processors. However, because of physical limitations,
the rate of performance improvement in conventional
processors has decreased dramatically. To increase the
power of processors, chipmakers have turned to multicore
integrated circuits, that is, integrated circuits with multiple
conventional processors on a single chip.
Ordinary serial programs, which are programs written
for a conventional single-core processor, usually cannot
exploit the presence of multiple cores, and it's unlikely that
translation programs will be able to shoulder all the work
of converting serial programs into parallel programs—
programs that can make use of multiple cores. As software
developers, we need to learn to write parallel programs.
When we write parallel programs, we usually need to
coordinate the work of the cores. This can involve
communication among the cores, load balancing, and
synchronization of the cores.
In this book we'll be learning to program parallel
systems, so that we can maximize their performance. We'll
be using the C language with four different application
program interfaces or APIs: MPI, Pthreads, OpenMP, and
CUDA. These APIs are used to program parallel systems
that are classified according to how the cores access
memory and whether the individual cores can operate
independently of each other.
In the first classification, we distinguish between shared-
memory and distributed-memory systems. In a shared-
memory system, the cores share access to one large pool of
memory, and they can coordinate their actions by accessing
shared memory locations. In a distributed-memory system,
each core has its own private memory, and the cores can
coordinate their actions by sending messages across a
network.
In the second classification, we distinguish between
systems with cores that can operate independently of each
other and systems in which the cores all execute the same
instruction. In both types of system, the cores can operate
on their own data stream. So the first type of system is
called a multiple-instruction multiple-data or MIMD
system, and the second type of system is called a single-
instruction multiple-data or SIMD system.
MPI is used for programming distributed-memory MIMD
systems. Pthreads is used for programming shared-memory
MIMD systems. OpenMP can be used to program both
shared-memory MIMD and shared-memory SIMD systems,
although we'll be looking at using it to program MIMD
systems. CUDA is used for programming Nvidia graphics
processing units or GPUs. GPUs have aspects of all four
types of system, but we'll be mainly interested in the
shared-memory SIMD and shared-memory MIMD aspects.
Concurrent programs can have multiple tasks in
progress at any instant. Parallel and distributed
programs usually have tasks that execute simultaneously.
There isn't a hard and fast distinction between parallel and
distributed, although in parallel programs, the tasks are
usually more tightly coupled.
Parallel programs are usually very complex. So it's even
more important to use good program development
techniques with parallel programs.
1.11 Exercises
1.1 Devise formulas for the functions that calculate
and in the global sum example.
Remember that each core should be assigned
roughly the same number of elements of
computations in the loop. : First consider the
case when n is evenly divisible by p.
1.2 We've implicitly assumed that each call to
requires roughly the same amount of
work as the other calls. How would you change your
answer to the preceding question if call requires
times as much work as the call with ? How
would you change your answer if the first call ( )
requires 2 milliseconds, the second call ( )
requires 4, the third ( ) requires 6, and so on?
1.3 Try to write pseudocode for the tree-structured
global sum illustrated in Fig. 1.1. Assume the
number of cores is a power of two (1, 2, 4, 8, …).
: Use a variable to determine whether a
core should send its sum or receive and add. The
should start with the value 2 and be doubled
after each iteration. Also use a variable to
determine which core should be partnered with the
current core. It should start with the value 1 and
also be doubled after each iteration. For example, in
the first iteration and , so 0
receives and adds, while 1 sends. Also in the first
iteration and , so 0 and
1 are paired in the first iteration.
1.4 As an alternative to the approach outlined in the
preceding problem, we can use C's bitwise operators
to implement the tree-structured global sum. To see
how this works, it helps to write down the binary
(base 2) representation of each of the core ranks and
note the pairings during each stage: =8.5cm
From the table, we see that during the first stage each
core is paired with the core whose rank differs in the
rightmost or first bit. During the second stage, cores
that continue are paired with the core whose rank
differs in the second bit; and during the third stage,
cores are paired with the core whose rank differs in
the third bit. Thus if we have a binary value that
is 0012 for the first stage, 0102 for the second, and
1002 for the third, we can get the rank of the core
we're paired with by “inverting” the bit in our rank
that is nonzero in . This can be done using the
bitwise exclusive or ∧ operator.
Implement this algorithm in pseudocode using the
bitwise exclusive or and the left-shift operator.
1.5 What happens if your pseudocode in Exercise 1.3 or
Exercise 1.4 is run when the number of cores is not a
power of two (e.g., 3, 5, 6, 7)? Can you modify the
pseudocode so that it will work correctly regardless
of the number of cores?
1.6 Derive formulas for the number of receives and
additions that core 0 carries out using
a. the original pseudocode for a global sum, and
b. the tree-structured global sum.
Make a table showing the numbers of receives and
additions carried out by core 0 when the two sums
are used with cores.
1.7 The first part of the global sum example—when
each core adds its assigned computed values—is
usually considered to be an example of data-
parallelism, while the second part of the first global
sum—when the cores send their partial sums to the
master core, which adds them—could be considered
to be an example of task-parallelism. What about the
second part of the second global sum—when the
cores use a tree structure to add their partial sums?
Is this an example of data- or task-parallelism? Why?
1.8 Suppose the faculty members are throwing a party
for the students in the department.
a. Identify tasks that can be assigned to the
faculty members that will allow them to use task-
parallelism when they prepare for the party.
Work out a schedule that shows when the various
tasks can be performed.
b. We might hope that one of the tasks in the
preceding part is cleaning the house where the
party will be held. How can we use data-
parallelism to partition the work of cleaning the
house among the faculty?
c. Use a combination of task- and data-parallelism
to prepare for the party. (If there's too much
work for the faculty, you can use TAs to pick up
the slack.)
1.9 Write an essay describing a research problem in
your major that would benefit from the use of
parallel computing. Provide a rough outline of how
parallelism would be used. Would you use task- or
data-parallelism?
Bibliography
[5] Clay Breshears, The Art of Concurrency: A Thread
Monkey's Guide to Writing Parallel Applications.
Sebastopol, CA: O'Reilly; 2009.
[28] John Hennessy, David Patterson, Computer
Architecture: A Quantitative Approach. 6th ed.
Burlington, MA: Morgan Kaufmann; 2019.
[31] IBM, IBM InfoSphere Streams v1.2.0 supports
highly complex heterogeneous data analysis, IBM
United States Software Announcement 210-037,
Feb. 23, 2010
http://www.ibm.com/common/ssi/rep_ca/7/897/ENUS
210-037/ENUS210-037.PDF.
[36] John Loeffler, No more transistors: the end of
Moore's Law, Interesting Engineering, Nov 29, 2018.
See https://interestingengineering.com/no-more-
transistors-the-end-of-moores-law.
Chapter 2: Parallel
hardware and parallel
software
It's perfectly feasible for specialists in disciplines other
than computer science and computer engineering to write
parallel programs. However, to write efficient parallel
programs, we often need some knowledge of the underlying
hardware and system software. It's also very useful to have
some knowledge of different types of parallel software, so
in this chapter we'll take a brief look at a few topics in
hardware and software. We'll also take a brief look at
evaluating program performance and a method for
developing parallel programs. We'll close with a discussion
of what kind of environment we might expect to be working
in, and a few rules and assumptions we'll make in the rest
of the book.
This is a long, broad chapter, so it may be a good idea to
skim through some of the sections on a first reading so that
you have a good idea of what's in the chapter. Then, when a
concept or term in a later chapter isn't quite clear, it may
be helpful to refer back to this chapter. In particular, you
may want to skim over most of the material in
“Modifications to the von Neumann Model,” except “The
Basics of Caching.” Also, in the “Parallel Hardware”
section, you can safely skim the material on
“Interconnection Networks.” You can also skim the material
on “SIMD Systems” unless you're planning to read the
chapter on CUDA programming.
Other documents randomly have
different content
enlivened by imagery peculiarly vivid and rich; the seventh and
eighth lines especially, contain a picture of a great beauty:—
Son. 7.
The inevitable effects of time over every object in physical nature,
reminding the poet of the disastrous changes incident to human life,
he exclaims in a style highly figurative and picturesque:—
Son. 98.
To the melody, perspicuity, and spirit of the versification of the
next specimen, and to the exquisite turn upon the words, too much
praise cannot be given. It is one amongst the numerous evidences
of Lord Southampton being the subject of the great bulk of our
author's sonnets; for he assures us, that he not only esteemed his
lays, but gave argument and skill to his pen:—
Son. 106.
It is a striking proof of the poetical inferiority of the few sonnets
which Shakspeare has addressed to his mistress, that we find it
difficult to select more than one passage from them which does
honour to his memory. Of this, however, it will be allowed, that the
comparison is happy, the rhythm pleasing, and the expression clear:
—
Son. 54.
In spirit, however, in elegance, in the skill and texture of its
modulation, and beyond all, in the dignified and highly poetical close
of the third quatrain, no one of our author's sonnets excels the
twenty-ninth. The ascent of the lark was a favourite subject of
contemplation with the poet:—
It is written in stanzas of seven lines; the first and third, and the
second, fourth, and fifth, rhiming to each other, while the sixth and
seventh form a couplet; an arrangement exactly similar to the stanza
of the Rape of Lucrece. Like many of our author's smaller pieces, it
is too full of imagery and allusion, but has several passages of great
beauty and force. In the description which this forsaken fair one
gives of the person and qualities of her lover, the following lines will
be acknowledged to possess considerable excellence:—
The address which the injured mistress puts into the mouth of her
seducer, when "he 'gan besiege her," opens in a strain of such
beautiful simplicity, that we cannot avoid an expression of regret,
that the defective taste of the age prevented its continuance and
completion in a similar style of tenderness and ease:—
[27:A] These, and the following extracts, are taken from Mr.
Malone's edition of the Poems of Shakspeare.
[34:B] The last line of this extract is taken from the 12mo. edit.
of 1616.
[42:A] Reed's Shakspeare, vol. vii. p. 105. Act iv. sc. 3.—We
have found reason, as will be seen hereafter, to ascribe this play
to the year 1591.
[43:A] "I know not," says this gentleman, "when the second
edition was printed."—Reed's Shakspeare, 1803, vol. ii. p. 153.
[62:A] Reed's Shakspeare, vol. vii. p. 331, and vol. xii. p. 219.
[63:A] Malone's Supplement, vol. i. p. 698.
[74:C] Ibid.
[75:C] Ibid.
Son. 18.
Son. 54.
Son. 60.
Son. 63.
Son. 81.
CHAPTER VI.
ebookmasss.com