An Introduction to Parallel Programming 2. Edition Pacheco - eBook PDF download
An Introduction to Parallel Programming 2. Edition Pacheco - eBook PDF download
https://ebookluna.com/download/an-introduction-to-parallel-
programming-ebook-pdf/
https://ebookluna.com/product/ebook-pdf-introduction-to-programming-using-
python-an-1/
https://ebookluna.com/download/parallel-programming-concepts-and-practice-
ebook-pdf/
https://ebookluna.com/product/ebook-pdf-java-an-introduction-to-problem-
solving-and-programming-7th-edition/
https://ebookluna.com/product/ebook-pdf-java-an-introduction-to-problem-
solving-and-programming-8th-edition/
(eBook PDF) Microsoft Visual C#: An Introduction to Object-Oriented
Programming 7th Edition
https://ebookluna.com/product/ebook-pdf-microsoft-visual-c-an-introduction-
to-object-oriented-programming-7th-edition/
https://ebookluna.com/product/python-programming-an-introduction-to-
computer-science-3rd-edition-by-john-zelle-ebook-pdf/
https://ebookluna.com/product/ebook-pdf-introduction-to-programming-using-
visual-basic-10th-edition/
https://ebookluna.com/download/introduction-to-java-programming-
comprehensive-version-ebook-pdf/
https://ebookluna.com/product/ebook-pdf-introduction-to-java-programming-
brief-version-global-edition-11th-edition/
An Introduction to Parallel
Programming
SECOND EDITION
Peter S. Pacheco
University of San Francisco
Matthew Malensek
University of San Francisco
Table of Contents
Cover image
Title page
Copyright
Dedication
Preface
1.10. Summary
1.11. Exercises
Bibliography
2.6. Performance
2.9. Assumptions
2.10. Summary
2.11. Exercises
Bibliography
3.8. Summary
3.9. Exercises
Bibliography
4.5. Busy-waiting
4.6. Mutexes
4.11. Thread-safety
4.12. Summary
4.13. Exercises
Bibliography
5.10. Tasking
5.11. Thread-safety
5.12. Summary
5.13. Exercises
Bibliography
6.13. CUDA trapezoidal rule III: blocks with more than one warp
6.15. Summary
6.16. Exercises
Bibliography
7.5. Summary
7.6. Exercises
Bibliography
Bibliography
Bibliography
Bibliography
Index
Copyright
Morgan Kaufmann is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139,
United States
Notices
Knowledge and best practice in this field are constantly
changing. As new research and experience broaden our
understanding, changes in research methods,
professional practices, or medical treatment may become
necessary.
Practitioners and researchers must always rely on their
own experience and knowledge in evaluating and using
any information, methods, compounds, or experiments
described herein. In using such information or methods
they should be mindful of their own safety and the safety
of others, including parties for whom they have a
professional responsibility.
ISBN: 978-0-12-804605-0
Typeset by VTeX
Printed in United States of America
Here the prefix indicates that each core is using its own,
private variables, and each core can execute this block of
code independently of the other cores.
After each core completes execution of this code, its
variable will store the sum of the values computed by
its calls to . For example, if there are eight
cores, , and the 24 calls to return the
values
1, 4, 3, 9, 2, 8, 5, 1, 1, 6, 2, 7, 2, 5, 0, 4, 1, 8, 6, 5,
1, 2, 3, 9,
then the values stored in might be
The idea here is that each core will wait in the function
until all the cores have entered the function—in
particular, until the master core has entered this function.
Currently, the most powerful parallel programs are
written using explicit parallel constructs, that is, they are
written using extensions to languages such as C, C++, and
Java. These programs include explicit instructions for
parallelism: core 0 executes task 0, core 1 executes task 1,
…, all cores synchronize, …, and so on, so such programs
are often extremely complex. Furthermore, the complexity
of modern cores often makes it necessary to use
considerable care in writing the code that will be executed
by a single core.
There are other options for writing parallel programs—
for example, higher level languages—but they tend to
sacrifice performance to make program development
somewhat easier.
1.10 Summary
For many years we've reaped the benefits of having ever-
faster processors. However, because of physical limitations,
the rate of performance improvement in conventional
processors has decreased dramatically. To increase the
power of processors, chipmakers have turned to multicore
integrated circuits, that is, integrated circuits with multiple
conventional processors on a single chip.
Ordinary serial programs, which are programs written
for a conventional single-core processor, usually cannot
exploit the presence of multiple cores, and it's unlikely that
translation programs will be able to shoulder all the work
of converting serial programs into parallel programs—
programs that can make use of multiple cores. As software
developers, we need to learn to write parallel programs.
When we write parallel programs, we usually need to
coordinate the work of the cores. This can involve
communication among the cores, load balancing, and
synchronization of the cores.
In this book we'll be learning to program parallel
systems, so that we can maximize their performance. We'll
be using the C language with four different application
program interfaces or APIs: MPI, Pthreads, OpenMP, and
CUDA. These APIs are used to program parallel systems
that are classified according to how the cores access
memory and whether the individual cores can operate
independently of each other.
In the first classification, we distinguish between shared-
memory and distributed-memory systems. In a shared-
memory system, the cores share access to one large pool of
memory, and they can coordinate their actions by accessing
shared memory locations. In a distributed-memory system,
each core has its own private memory, and the cores can
coordinate their actions by sending messages across a
network.
In the second classification, we distinguish between
systems with cores that can operate independently of each
other and systems in which the cores all execute the same
instruction. In both types of system, the cores can operate
on their own data stream. So the first type of system is
called a multiple-instruction multiple-data or MIMD
system, and the second type of system is called a single-
instruction multiple-data or SIMD system.
MPI is used for programming distributed-memory MIMD
systems. Pthreads is used for programming shared-memory
MIMD systems. OpenMP can be used to program both
shared-memory MIMD and shared-memory SIMD systems,
although we'll be looking at using it to program MIMD
systems. CUDA is used for programming Nvidia graphics
processing units or GPUs. GPUs have aspects of all four
types of system, but we'll be mainly interested in the
shared-memory SIMD and shared-memory MIMD aspects.
Concurrent programs can have multiple tasks in
progress at any instant. Parallel and distributed
programs usually have tasks that execute simultaneously.
There isn't a hard and fast distinction between parallel and
distributed, although in parallel programs, the tasks are
usually more tightly coupled.
Parallel programs are usually very complex. So it's even
more important to use good program development
techniques with parallel programs.
1.11 Exercises
1.1 Devise formulas for the functions that calculate
and in the global sum example.
Remember that each core should be assigned
roughly the same number of elements of
computations in the loop. : First consider the
case when n is evenly divisible by p.
1.2 We've implicitly assumed that each call to
requires roughly the same amount of
work as the other calls. How would you change your
answer to the preceding question if call requires
times as much work as the call with ? How
would you change your answer if the first call ( )
requires 2 milliseconds, the second call ( )
requires 4, the third ( ) requires 6, and so on?
1.3 Try to write pseudocode for the tree-structured
global sum illustrated in Fig. 1.1. Assume the
number of cores is a power of two (1, 2, 4, 8, …).
: Use a variable to determine whether a
core should send its sum or receive and add. The
should start with the value 2 and be doubled
after each iteration. Also use a variable to
determine which core should be partnered with the
current core. It should start with the value 1 and
also be doubled after each iteration. For example, in
the first iteration and , so 0
receives and adds, while 1 sends. Also in the first
iteration and , so 0 and
1 are paired in the first iteration.
1.4 As an alternative to the approach outlined in the
preceding problem, we can use C's bitwise operators
to implement the tree-structured global sum. To see
how this works, it helps to write down the binary
(base 2) representation of each of the core ranks and
note the pairings during each stage: =8.5cm
From the table, we see that during the first stage each
core is paired with the core whose rank differs in the
rightmost or first bit. During the second stage, cores
that continue are paired with the core whose rank
differs in the second bit; and during the third stage,
cores are paired with the core whose rank differs in
the third bit. Thus if we have a binary value that
is 0012 for the first stage, 0102 for the second, and
1002 for the third, we can get the rank of the core
we're paired with by “inverting” the bit in our rank
that is nonzero in . This can be done using the
bitwise exclusive or ∧ operator.
Implement this algorithm in pseudocode using the
bitwise exclusive or and the left-shift operator.
1.5 What happens if your pseudocode in Exercise 1.3 or
Exercise 1.4 is run when the number of cores is not a
power of two (e.g., 3, 5, 6, 7)? Can you modify the
pseudocode so that it will work correctly regardless
of the number of cores?
1.6 Derive formulas for the number of receives and
additions that core 0 carries out using
a. the original pseudocode for a global sum, and
b. the tree-structured global sum.
Make a table showing the numbers of receives and
additions carried out by core 0 when the two sums
are used with cores.
1.7 The first part of the global sum example—when
each core adds its assigned computed values—is
usually considered to be an example of data-
parallelism, while the second part of the first global
sum—when the cores send their partial sums to the
master core, which adds them—could be considered
to be an example of task-parallelism. What about the
second part of the second global sum—when the
cores use a tree structure to add their partial sums?
Is this an example of data- or task-parallelism? Why?
1.8 Suppose the faculty members are throwing a party
for the students in the department.
a. Identify tasks that can be assigned to the
faculty members that will allow them to use task-
parallelism when they prepare for the party.
Work out a schedule that shows when the various
tasks can be performed.
b. We might hope that one of the tasks in the
preceding part is cleaning the house where the
party will be held. How can we use data-
parallelism to partition the work of cleaning the
house among the faculty?
c. Use a combination of task- and data-parallelism
to prepare for the party. (If there's too much
work for the faculty, you can use TAs to pick up
the slack.)
1.9 Write an essay describing a research problem in
your major that would benefit from the use of
parallel computing. Provide a rough outline of how
parallelism would be used. Would you use task- or
data-parallelism?
Bibliography
[5] Clay Breshears, The Art of Concurrency: A Thread
Monkey's Guide to Writing Parallel Applications.
Sebastopol, CA: O'Reilly; 2009.
[28] John Hennessy, David Patterson, Computer
Architecture: A Quantitative Approach. 6th ed.
Burlington, MA: Morgan Kaufmann; 2019.
[31] IBM, IBM InfoSphere Streams v1.2.0 supports
highly complex heterogeneous data analysis, IBM
United States Software Announcement 210-037,
Feb. 23, 2010
http://www.ibm.com/common/ssi/rep_ca/7/897/ENUS
210-037/ENUS210-037.PDF.
[36] John Loeffler, No more transistors: the end of
Moore's Law, Interesting Engineering, Nov 29, 2018.
See https://interestingengineering.com/no-more-
transistors-the-end-of-moores-law.
Chapter 2: Parallel
hardware and parallel
software
It's perfectly feasible for specialists in disciplines other
than computer science and computer engineering to write
parallel programs. However, to write efficient parallel
programs, we often need some knowledge of the underlying
hardware and system software. It's also very useful to have
some knowledge of different types of parallel software, so
in this chapter we'll take a brief look at a few topics in
hardware and software. We'll also take a brief look at
evaluating program performance and a method for
developing parallel programs. We'll close with a discussion
of what kind of environment we might expect to be working
in, and a few rules and assumptions we'll make in the rest
of the book.
This is a long, broad chapter, so it may be a good idea to
skim through some of the sections on a first reading so that
you have a good idea of what's in the chapter. Then, when a
concept or term in a later chapter isn't quite clear, it may
be helpful to refer back to this chapter. In particular, you
may want to skim over most of the material in
“Modifications to the von Neumann Model,” except “The
Basics of Caching.” Also, in the “Parallel Hardware”
section, you can safely skim the material on
“Interconnection Networks.” You can also skim the material
on “SIMD Systems” unless you're planning to read the
chapter on CUDA programming.
Discovering Diverse Content Through
Random Scribd Documents
Curing Pork the Year Around
Great care should always be exercised when hogs are hoisted before
sticking. When hogs are hoisted alive to be stuck, very often when a
very heavy hog is jerked from the floor, the hip is dislocated or
sprained, and blood will be thrown out around the injured joint, so
the Ham will be spoiled. Great care should also be exercised in
driving the live hogs, as hogs are the heaviest and weakest and
easiest injured of all animals.
Special pens should be provided for them, so they are not crowded,
and so they have plenty of room when they are driven to the killing
pen. They should be handled very carefully, and piling up and
crowding should be avoided as much as possible. Many hams are
injured by overcrowding the hogs in the killing pens, for when hogs
smell blood they become excited and nervous, and unless they have
plenty of room, they will pile upon each other and bruise themselves
so that there will be many skin-bruised hams, and the flesh will be
full of bruises. Men driving hogs should never use a whip. The best
thing to use in driving hogs is a stick about two feet long, to the end
of which is fastened a piece of canvas three inches wide and two
feet long. By striking the hogs with this canvas, it makes a noise
which will do more towards driving them, without injury, than the
whip which will injure and discolor the skin.
MACHINE USED IN LARGER
PACKING HOUSES FOR
HOISTING HOGS.
The skin of all hogs is covered with more or less greasy filth, which
contains millions of disease germs and these extend down into the
pores of the skin. If this germ-laden filth is not removed, and if it
gets into the brine when the meat is being cured, it injures both the
meat and the brine in flavor, and also spoils the flavor of the lard if it
gets into that. Hog-Scald removes most of this filth and cleanses the
skin, and for these reasons alone, should be used by every Packer
and Butcher. Hams and Bacon from hogs that have been scalded
with Hog-Scald are, therefore, cleaner and will be much brighter
after they are smoked than when the filth of the hog remains in the
pores of the skin.
Those selling dressed hogs will find Hog-Scald very valuable, as hogs
that have been scalded with it are cleaner and look whiter and much
more appetizing.
The use of Hog-Scald is legal everywhere. It does not come under
the regulations of the Food Laws, as it is simply a cleansing agent.
Hog-Scald costs very little at the price we sell it, and everyone can
afford to use it. Butchers who once try it will continue its use.
SCRAPING HOGS.
(Copyrighted; Reprint Forbidden.)
After the hog has been gambrelled and hung up, either on a
gambrel-stick or on rollers, it should be gutted. After it is gutted, it
should be washed out thoroughly, with plenty of cold, fresh water. As
every Packer understands how to gut a hog, it is not necessary to go
into details.
GUTTING HOGS IN A MODERN
PACKING HOUSE.
The first two figures in the above cut show two men Facing Hams.
The first man faces the Ham at his right hand side and the second
man faces the Ham on his left hand side, as the Hogs pass by.
The advantage of Facing Hams right after the hogs are dressed, is
this. The knife can be drawn through the skin and through the fat
close to the meat, and the fat will peel right off the fleshy part of the
Ham. Between the fat and lean meat of the Ham, between the legs,
there is a fibrous membrane which is very soft and pliable. When the
knife is run through the skin and fat, it will run along the side of this
membrane, making a clean face for the Ham. That part remaining on
the Ham will shrink to the Ham and will form a smooth coating over
the lean meat, which closes the pores and makes the Ham look
smooth and nice when it is smoked. It also makes a much smoother
cut along the skin. The skin when cut warm will dry nicely and look
smooth when cured, whereas if it is trimmed after the meat is
chilled, it looks rough and ragged. Facing Hams also allows the
escape of the animal heat more readily. If Hams are not faced until
after the Hogs have been chilled, this fat must be trimmed off and
the Hams will not look nearly so smooth as they will if this tissue and
fat is removed while the hog is warm.
The second two men in the opposite illustration are Pulling Leaf
Lard. The Leaf Lard should always be pulled out of the hogs in
summer, as it gives the hogs, as well as the Leaf Lard, a better
chance to chill. During the winter months it can be pulled loose, but
can be left hanging loosely in the hog, from the top. In this way it
will cool nicely, and it will also allow the animal heat to get out of the
hog. Most of the large packing houses pull out the Leaf Lard in the
winter as well as summer, and hang it on hooks in the chill room to
chill. Leaf Lard that is properly chilled, with the animal heat all taken
out of it, makes much finer lard than when pulled out of the hog and
put into the rendering tank with the animal heat in it.
SPLITTING HOGS IN A MODERN
PACKING HOUSE.
(Copyrighted; Reprint Forbidden.)
Many chill rooms are not properly built. There should be at least
from 24 to 36 inches of space between the ceiling of the chilling
room and the gambrel-stick, or more if possible, in order to enable
the shanks to become thoroughly chilled. The animal heat which
leaves the carcass naturally rises to the top of the cooler, and unless
there is space between the ceiling and the top of the hog the heat
will accumulate in the top of the cooler where the temperature will
become quite warm; this will prevent the marrow in the shank and
the joints from becoming properly chilled. It is this fact that accounts
for so much marrow and shank sour in hams.
TEMPERATURE OF CHILL ROOM.
(Copyrighted; Reprint Forbidden.)
All Packers who have a properly built cooler for chilling hogs and
who are property equipped with an ice machine will find the
following rules will give the best results. Those who are not properly
equipped should try to follow these rules as closely as they can with
their equipment.
All large hog coolers should be partitioned off between each section
of timbers, into long alleys, so that each alley can be kept at its own
temperature.
Packers who cure large quantities of hogs must see to it that their
chill rooms are properly constructed and have sufficient refrigeration,
so the temperature can be kept under perfect control at all times.
The cooler should be partitioned off lengthwise, between each line
of posts, making long alleys to run the hogs into, each one of which
can be regulated as to its temperature separately from the others.
The hogs can be run into one of these alleys as fast as they are
killed and should the temperature get up above 50 degrees F., the
hogs can be run out of this into another. The cooler in which hogs
are chilled should never go above 50 degrees Fahrenheit, and a
properly constructed cooler can be kept below this temperature.
The first 12 hours of the chilling of all kinds of meat and the removal
of the animal heat during this period is the most important part of
the chilling. After that period, the proper temperature is of much less
vital importance.
Hogs that are to be cut up for curing should never be cut up sooner
than 48 hours after being killed, and the temperature of the cooler
should be gradually brought down to 28 degrees Fahrenheit by the
time the hogs are taken out of the chill room to be cut up. After the
hogs have been in the cooler 12 hours the temperature should
gradually be brought down from 36 degrees at the end of the first
12 hours, to 28 degrees at the end of 48 hours; that is, if the hogs
are to be cut up 48 hours after they are killed. If they are to be cut
up 72 hours after being killed, the temperature should be brought
down gradually from 36 degrees at the end of the first 12 hours, to
30 degrees F. at the end of 72 hours. This would mean that the
temperature should be brought down from 36 degrees to 30 degrees
F., if the hogs are to be cut up at the end of 72 hours, or a lowering
of six degrees in practically 58 hours; or a lowering of eight degrees,
from 36 to 28 Fahrenheit, if the hogs are to be cut up in 48 hours
after being killed. This means a reduction in temperature of about
one degree for every eight hours. This does not mean that the six or
eight degrees should be reduced in two hours’ time, for if that were
done the meat would be frozen.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookluna.com