100% found this document useful (4 votes)
7 views

An Introduction to Parallel Programming 2. Edition Pacheco - eBook PDF download

Ebook access

Uploaded by

heeyaghorisv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
7 views

An Introduction to Parallel Programming 2. Edition Pacheco - eBook PDF download

Ebook access

Uploaded by

heeyaghorisv
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

An Introduction to Parallel Programming 2.

Edition Pacheco - eBook PDF download

https://ebookluna.com/download/an-introduction-to-parallel-
programming-ebook-pdf/

Download full version ebook from https://ebookluna.com


We believe these products will be a great fit for you. Click
the link to download now, or visit ebookluna.com
to discover even more!

(eBook PDF) Introduction to Programming Using Python An 1

https://ebookluna.com/product/ebook-pdf-introduction-to-programming-using-
python-an-1/

Parallel programming: concepts and practice 1st Edition - eBook PDF

https://ebookluna.com/download/parallel-programming-concepts-and-practice-
ebook-pdf/

(eBook PDF) Java: An Introduction to Problem Solving and Programming 7th


Edition

https://ebookluna.com/product/ebook-pdf-java-an-introduction-to-problem-
solving-and-programming-7th-edition/

(eBook PDF) Java: An Introduction to Problem Solving and Programming 8th


Edition

https://ebookluna.com/product/ebook-pdf-java-an-introduction-to-problem-
solving-and-programming-8th-edition/
(eBook PDF) Microsoft Visual C#: An Introduction to Object-Oriented
Programming 7th Edition

https://ebookluna.com/product/ebook-pdf-microsoft-visual-c-an-introduction-
to-object-oriented-programming-7th-edition/

Python Programming: An Introduction to Computer Science 3rd Edition by John


Zelle (eBook PDF)

https://ebookluna.com/product/python-programming-an-introduction-to-
computer-science-3rd-edition-by-john-zelle-ebook-pdf/

(eBook PDF) Introduction to Programming Using Visual Basic 10th Edition

https://ebookluna.com/product/ebook-pdf-introduction-to-programming-using-
visual-basic-10th-edition/

Introduction to Java Programming, Comprehensive Version 10th edition- eBook


PDF

https://ebookluna.com/download/introduction-to-java-programming-
comprehensive-version-ebook-pdf/

(eBook PDF) Introduction to Java Programming, Brief Version, Global Edition


11th Edition

https://ebookluna.com/product/ebook-pdf-introduction-to-java-programming-
brief-version-global-edition-11th-edition/
An Introduction to Parallel
Programming

SECOND EDITION

Peter S. Pacheco
University of San Francisco

Matthew Malensek
University of San Francisco
Table of Contents

Cover image

Title page

Copyright

Dedication

Preface

Chapter 1: Why parallel computing

1.1. Why we need ever-increasing performance

1.2. Why we're building parallel systems

1.3. Why we need to write parallel programs

1.4. How do we write parallel programs?

1.5. What we'll be doing

1.6. Concurrent, parallel, distributed

1.7. The rest of the book


1.8. A word of warning

1.9. Typographical conventions

1.10. Summary

1.11. Exercises

Bibliography

Chapter 2: Parallel hardware and parallel software

2.1. Some background

2.2. Modifications to the von Neumann model

2.3. Parallel hardware

2.4. Parallel software

2.5. Input and output

2.6. Performance

2.7. Parallel program design

2.8. Writing and running parallel programs

2.9. Assumptions

2.10. Summary

2.11. Exercises

Bibliography

Chapter 3: Distributed memory programming with MPI


3.1. Getting started

3.2. The trapezoidal rule in MPI

3.3. Dealing with I/O

3.4. Collective communication

3.5. MPI-derived datatypes

3.6. Performance evaluation of MPI programs

3.7. A parallel sorting algorithm

3.8. Summary

3.9. Exercises

3.10. Programming assignments

Bibliography

Chapter 4: Shared-memory programming with Pthreads

4.1. Processes, threads, and Pthreads

4.2. Hello, world

4.3. Matrix-vector multiplication

4.4. Critical sections

4.5. Busy-waiting

4.6. Mutexes

4.7. Producer–consumer synchronization and semaphores

4.8. Barriers and condition variables


4.9. Read-write locks

4.10. Caches, cache-coherence, and false sharing

4.11. Thread-safety

4.12. Summary

4.13. Exercises

4.14. Programming assignments

Bibliography

Chapter 5: Shared-memory programming with OpenMP

5.1. Getting started

5.2. The trapezoidal rule

5.3. Scope of variables

5.4. The reduction clause

5.5. The parallel for directive

5.6. More about loops in OpenMP: sorting

5.7. Scheduling loops

5.8. Producers and consumers

5.9. Caches, cache coherence, and false sharing

5.10. Tasking

5.11. Thread-safety

5.12. Summary
5.13. Exercises

5.14. Programming assignments

Bibliography

Chapter 6: GPU programming with CUDA

6.1. GPUs and GPGPU

6.2. GPU architectures

6.3. Heterogeneous computing

6.4. CUDA hello

6.5. A closer look

6.6. Threads, blocks, and grids

6.7. Nvidia compute capabilities and device architectures

6.8. Vector addition

6.9. Returning results from CUDA kernels

6.10. CUDA trapezoidal rule I

6.11. CUDA trapezoidal rule II: improving performance

6.12. Implementation of trapezoidal rule with warpSize thread


blocks

6.13. CUDA trapezoidal rule III: blocks with more than one warp

6.14. Bitonic sort

6.15. Summary
6.16. Exercises

6.17. Programming assignments

Bibliography

Chapter 7: Parallel program development

7.1. Two n-body solvers

7.2. Sample sort

7.3. A word of caution

7.4. Which API?

7.5. Summary

7.6. Exercises

7.7. Programming assignments

Bibliography

Chapter 8: Where to go from here

Bibliography

Bibliography

Bibliography

Index
Copyright
Morgan Kaufmann is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139,
United States

Copyright © 2022 Elsevier Inc. All rights reserved.

No part of this publication may be reproduced or


transmitted in any form or by any means, electronic or
mechanical, including photocopying, recording, or any
information storage and retrieval system, without
permission in writing from the publisher. Details on how to
seek permission, further information about the Publisher's
permissions policies and our arrangements with
organizations such as the Copyright Clearance Center and
the Copyright Licensing Agency, can be found at our
website: www.elsevier.com/permissions.

This book and the individual contributions contained in it


are protected under copyright by the Publisher (other than
as may be noted herein).
Cover art: “seven notations,” nickel/silver etched plates,
acrylic on wood structure, copyright © Holly Cohn

Notices
Knowledge and best practice in this field are constantly
changing. As new research and experience broaden our
understanding, changes in research methods,
professional practices, or medical treatment may become
necessary.
Practitioners and researchers must always rely on their
own experience and knowledge in evaluating and using
any information, methods, compounds, or experiments
described herein. In using such information or methods
they should be mindful of their own safety and the safety
of others, including parties for whom they have a
professional responsibility.

To the fullest extent of the law, neither the Publisher nor


the authors, contributors, or editors, assume any liability
for any injury and/or damage to persons or property as a
matter of products liability, negligence or otherwise, or
from any use or operation of any methods, products,
instructions, or ideas contained in the material herein.

Library of Congress Cataloging-in-Publication Data


A catalog record for this book is available from the Library
of Congress

British Library Cataloguing-in-Publication Data


A catalogue record for this book is available from the
British Library

ISBN: 978-0-12-804605-0

For information on all Morgan Kaufmann publications


visit our website at https://www.elsevier.com/books-and-
journals

Publisher: Katey Birtcher


Acquisitions Editor: Stephen Merken
Content Development Manager: Meghan Andress
Publishing Services Manager: Shereen Jameel
Production Project Manager: Rukmani Krishnan
Designer: Victoria Pearson

Typeset by VTeX
Printed in United States of America

Last digit is the print number: 9 8 7 6 5 4 3 2 1


Dedication

To the memory of Robert S. Miller


Preface
Parallel hardware has been ubiquitous for some time
now: it's difficult to find a laptop, desktop, or server that
doesn't use a multicore processor. Cluster computing is
nearly as common today as high-powered workstations
were in the 1990s, and cloud computing is making
distributed-memory systems as accessible as desktops. In
spite of this, most computer science majors graduate with
little or no experience in parallel programming. Many
colleges and universities offer upper-division elective
courses in parallel computing, but since most computer
science majors have to take a large number of required
courses, many graduate without ever writing a
multithreaded or multiprocess program.
It seems clear that this state of affairs needs to change.
Whereas many programs can obtain satisfactory
performance on a single core, computer scientists should
be made aware of the potentially vast performance
improvements that can be obtained with parallelism, and
they should be able to exploit this potential when the need
arises.
Introduction to Parallel Programming was written to
partially address this problem. It provides an introduction
to writing parallel programs using MPI, Pthreads, OpenMP,
and CUDA, four of the most widely used APIs for parallel
programming. The intended audience is students and
professionals who need to write parallel programs. The
prerequisites are minimal: a college-level course in
mathematics and the ability to write serial programs in C.
The prerequisites are minimal, because we believe that
students should be able to start programming parallel
systems as early as possible. At the University of San
Francisco, computer science students can fulfill a
requirement for the major by taking a course on which this
text is based immediately after taking the “Introduction to
Computer Science I” course that most majors take in the
first semester of their freshman year. It has been our
experience that there really is no reason for students to
defer writing parallel programs until their junior or senior
year. To the contrary, the course is popular, and students
have found that using concurrency in other courses is much
easier after having taken this course.
If second-semester freshmen can learn to write parallel
programs by taking a class, then motivated computing
professionals should be able to learn to write parallel
programs through self-study. We hope this book will prove
to be a useful resource for them.
The Second Edition
It has been nearly ten years since the first edition of
Introduction to Parallel Programming was published.
During that time much has changed in the world of parallel
programming, but, perhaps surprisingly, much also remains
the same. Our intent in writing this second edition has been
to preserve the material from the first edition that
continues to be generally useful, but also to add new
material where we felt it was needed.
The most obvious addition is the inclusion of a new
chapter on CUDA programming. When the first edition was
published, CUDA was still very new. It was already clear
that the use of GPUs in high-performance computing would
become very widespread, but at that time we felt that
GPGPU wasn't readily accessible to programmers with
relatively little experience. In the last ten years, that has
clearly changed. Of course, CUDA is not a standard, and
features are added, modified, and deleted with great
rapidity. As a consequence, authors who use CUDA must
present a subject that changes much faster than a
standard, such as MPI, Pthreads, or OpenMP. In spite of
this, we hope that our presentation of CUDA will continue
to be useful for some time.
Another big change is that Matthew Malensek has come
onboard as a coauthor. Matthew is a relatively new
colleague at the University of San Francisco, but he has
extensive experience with both the teaching and
application of parallel computing. His contributions have
greatly improved the second edition.
About This Book
As we noted earlier, the main purpose of the book is to
teach parallel programming in MPI, Pthreads, OpenMP, and
CUDA to an audience with a limited background in
computer science and no previous experience with
parallelism. We also wanted to make the book as flexible as
possible so that readers who have no interest in learning
one or two of the APIs can still read the remaining material
with little effort. Thus the chapters on the four APIs are
largely independent of each other: they can be read in any
order, and one or two of these chapters can be omitted.
This independence has some cost: it was necessary to
repeat some of the material in these chapters. Of course,
repeated material can be simply scanned or skipped.
On the other hand, readers with no prior experience with
parallel computing should read Chapter 1 first. This
chapter attempts to provide a relatively nontechnical
explanation of why parallel systems have come to dominate
the computer landscape. It also provides a short
introduction to parallel systems and parallel programming.
Chapter 2 provides technical background on computer
hardware and software. Chapters 3 to 6 provide
independent introductions to MPI, Pthreads, OpenMP, and
CUDA, respectively. Chapter 7 illustrates the development
of two different parallel programs using each of the four
APIs. Finally, Chapter 8 provides a few pointers to
additional information on parallel computing.
We use the C programming language for developing our
programs, because all four API's have C-language
interfaces, and, since C is such a small language, it is a
relatively easy language to learn—especially for C++ and
Java programmers, since they will already be familiar with
C's control structures.
Classroom Use
This text grew out of a lower-division undergraduate
course at the University of San Francisco. The course
fulfills a requirement for the computer science major, and it
also fulfills a prerequisite for the undergraduate operating
systems, architecture, and networking courses. The course
begins with a four-week introduction to C programming.
Since most of the students have already written Java
programs, the bulk of this introduction is devoted to the
use pointers in C.1 The remainder of the course provides
introductions first to programming in MPI, then Pthreads
and/or OpenMP, and it finishes with material covering
CUDA.
We cover most of the material in Chapters 1, 3, 4, 5, and
6, and parts of the material in Chapters 2 and 7. The
background in Chapter 2 is introduced as the need arises.
For example, before discussing cache coherence issues in
OpenMP (Chapter 5), we cover the material on caches in
Chapter 2.
The coursework consists of weekly homework
assignments, five programming assignments, a couple of
midterms and a final exam. The homework assignments
usually involve writing a very short program or making a
small modification to an existing program. Their purpose is
to insure that the students stay current with the
coursework, and to give the students hands-on experience
with ideas introduced in class. It seems likely that their
existence has been one of the principle reasons for the
course's success. Most of the exercises in the text are
suitable for these brief assignments.
The programming assignments are larger than the
programs written for homework, but we typically give the
students a good deal of guidance: we'll frequently include
pseudocode in the assignment and discuss some of the
more difficult aspects in class. This extra guidance is often
crucial: it's easy to give programming assignments that will
take far too long for the students to complete.
The results of the midterms and finals and the
enthusiastic reports of the professor who teaches operating
systems suggest that the course is actually very successful
in teaching students how to write parallel programs.
For more advanced courses in parallel computing, the
text and its online supporting materials can serve as a
supplement so that much of the material on the syntax and
semantics of the four APIs can be assigned as outside
reading.
The text can also be used as a supplement for project-
based courses and courses outside of computer science
that make use of parallel computation.
Support Materials
An online companion site for the book is located at
www.elsevier.com/books-and-journals/book-
companion/9780128046050.. This site will include errata
and complete source for the longer programs we discuss in
the text. Additional material for instructors, including
downloadable figures and solutions to the exercises in the
book, can be downloaded from
https://educate.elsevier.com/9780128046050.
We would greatly appreciate readers' letting us know of
any errors they find. Please send email to
mmalensek@usfca.edu if you do find a mistake.
Acknowledgments
In the course of working on this book we've received
considerable help from many individuals. Among them we'd
like to thank the reviewers of the second edition, Steven
Frankel (Technion) and Il-Hyung Cho (Saginaw Valley State
University), who read and commented on draft versions of
the new CUDA chapter. We'd also like to thank the
reviewers who read and commented on the initial proposal
for the book: Fikret Ercal (Missouri University of Science
and Technology), Dan Harvey (Southern Oregon
University), Joel Hollingsworth (Elon University), Jens
Mache (Lewis and Clark College), Don McLaughlin (West
Virginia University), Manish Parashar (Rutgers University),
Charlie Peck (Earlham College), Stephen C. Renk (North
Central College), Rolfe Josef Sassenfeld (The University of
Texas at El Paso), Joseph Sloan (Wofford College), Michela
Taufer (University of Delaware), Pearl Wang (George Mason
University), Bob Weems (University of Texas at Arlington),
and Cheng-Zhong Xu (Wayne State University). We are also
deeply grateful to the following individuals for their
reviews of various chapters of the book: Duncan Buell
(University of South Carolina), Matthias Gobbert
(University of Maryland, Baltimore County), Krishna Kavi
(University of North Texas), Hong Lin (University of
Houston–Downtown), Kathy Liszka (University of Akron),
Leigh Little (The State University of New York), Xinlian Liu
(Hood College), Henry Tufo (University of Colorado at
Boulder), Andrew Sloss (Consultant Engineer, ARM), and
Gengbin Zheng (University of Illinois). Their comments and
suggestions have made the book immeasurably better. Of
course, we are solely responsible for remaining errors and
omissions.
Slides and the solutions manual for the first edition were
prepared by Kathy Liszka and Jinyoung Choi, respectively.
Thanks to both of them.
The staff at Elsevier has been very helpful throughout
this project. Nate McFadden helped with the development
of the text. Todd Green and Steve Merken were the
acquisitions editors. Meghan Andress was the content
development manager. Rukmani Krishnan was the
production editor. Victoria Pearson was the designer. They
did a great job, and we are very grateful to all of them.
Our colleagues in the computer science and mathematics
departments at USF have been extremely helpful during
our work on the book. Peter would like to single out Prof.
Gregory Benson for particular thanks: his understanding of
parallel computing—especially Pthreads and semaphores—
has been an invaluable resource. We're both very grateful
to our system administrators, Alexey Fedosov and Elias
Husary. They've patiently and efficiently dealt with all of
the “emergencies” that cropped up while we were working
on programs for the book. They've also done an amazing
job of providing us with the hardware we used to do all
program development and testing.
Peter would never have been able to finish the book
without the encouragement and moral support of his
friends Holly Cohn, John Dean, and Maria Grant. He will
always be very grateful for their help and their friendship.
He is especially grateful to Holly for allowing us to use her
work, seven notations, for the cover.
Matthew would like to thank his colleagues in the USF
Department of Computer Science, as well as Maya
Malensek and Doyel Sadhu, for their love and support.
Most of all, he would like to thank Peter Pacheco for being
a mentor and infallible source of advice and wisdom during
the formative years of his career in academia.
Our biggest debt is to our students. As always, they
showed us what was too easy and what was far too difficult.
They taught us how to teach parallel computing. Our
deepest thanks to all of them.
1 “Interestingly, a number of students have said that they
found the use of C pointers more difficult than MPI
programming.”
Chapter 1: Why parallel
computing
From 1986 to 2003, the performance of microprocessors
increased, on average, more than 50% per year [28]. This
unprecedented increase meant that users and software
developers could often simply wait for the next generation
of microprocessors to obtain increased performance from
their applications. Since 2003, however, single-processor
performance improvement has slowed to the point that in
the period from 2015 to 2017, it increased at less than 4%
per year [28]. This difference is dramatic: at 50% per year,
performance will increase by almost a factor of 60 in 10
years, while at 4%, it will increase by about a factor of 1.5.
Furthermore, this difference in performance increase has
been associated with a dramatic change in processor
design. By 2005, most of the major manufacturers of
microprocessors had decided that the road to rapidly
increasing performance lay in the direction of parallelism.
Rather than trying to continue to develop ever-faster
monolithic processors, manufacturers started putting
multiple complete processors on a single integrated circuit.
This change has a very important consequence for
software developers: simply adding more processors will
not magically improve the performance of the vast majority
of serial programs, that is, programs that were written to
run on a single processor. Such programs are unaware of
the existence of multiple processors, and the performance
of such a program on a system with multiple processors
will be effectively the same as its performance on a single
processor of the multiprocessor system.
All of this raises a number of questions:

• Why do we care? Aren't single-processor systems


fast enough?
• Why can't microprocessor manufacturers continue
to develop much faster single-processor systems?
Why build parallel systems? Why build systems
with multiple processors?
• Why can't we write programs that will automatically
convert serial programs into parallel programs,
that is, programs that take advantage of the
presence of multiple processors?

Let's take a brief look at each of these questions. Keep in


mind, though, that some of the answers aren't carved in
stone. For example, the performance of many applications
may already be more than adequate.

1.1 Why we need ever-increasing performance


The vast increases in computational power that we've been
enjoying for decades now have been at the heart of many of
the most dramatic advances in fields as diverse as science,
the Internet, and entertainment. For example, decoding the
human genome, ever more accurate medical imaging,
astonishingly fast and accurate Web searches, and ever
more realistic and responsive computer games would all
have been impossible without these increases. Indeed,
more recent increases in computational power would have
been difficult, if not impossible, without earlier increases.
But we can never rest on our laurels. As our computational
power increases, the number of problems that we can
seriously consider solving also increases. Here are a few
examples:

• Climate modeling. To better understand climate


change, we need far more accurate computer
models, models that include interactions between
the atmosphere, the oceans, solid land, and the ice
caps at the poles. We also need to be able to make
detailed studies of how various interventions might
affect the global climate.
• Protein folding. It's believed that misfolded proteins
may be involved in diseases such as Huntington's,
Parkinson's, and Alzheimer's, but our ability to study
configurations of complex molecules such as
proteins is severely limited by our current
computational power.
• Drug discovery. There are many ways in which
increased computational power can be used in
research into new medical treatments. For example,
there are many drugs that are effective in treating a
relatively small fraction of those suffering from some
disease. It's possible that we can devise alternative
treatments by careful analysis of the genomes of the
individuals for whom the known treatment is
ineffective. This, however, will involve extensive
computational analysis of genomes.
• Energy research. Increased computational power
will make it possible to program much more detailed
models of technologies, such as wind turbines, solar
cells, and batteries. These programs may provide
the information needed to construct far more
efficient clean energy sources.
• Data analysis. We generate tremendous amounts of
data. By some estimates, the quantity of data stored
worldwide doubles every two years [31], but the vast
majority of it is largely useless unless it's analyzed.
As an example, knowing the sequence of nucleotides
in human DNA is, by itself, of little use.
Understanding how this sequence affects
development and how it can cause disease requires
extensive analysis. In addition to genomics, huge
quantities of data are generated by particle
colliders, such as the Large Hadron Collider at
CERN, medical imaging, astronomical research, and
Web search engines—to name a few.

These and a host of other problems won't be solved without


tremendous increases in computational power.

1.2 Why we're building parallel systems


Much of the tremendous increase in single-processor
performance was driven by the ever-increasing density of
transistors—the electronic switches—on integrated circuits.
As the size of transistors decreases, their speed can be
increased, and the overall speed of the integrated circuit
can be increased. However, as the speed of transistors
increases, their power consumption also increases. Most of
this power is dissipated as heat, and when an integrated
circuit gets too hot, it becomes unreliable. In the first
decade of the twenty-first century, air-cooled integrated
circuits reached the limits of their ability to dissipate heat
[28].
Therefore it is becoming impossible to continue to
increase the speed of integrated circuits. Indeed, in the last
few years, the increase in transistor density has slowed
dramatically [36].
But given the potential of computing to improve our
existence, there is a moral imperative to continue to
increase computational power.
How then, can we continue to build ever more powerful
computers? The answer is parallelism. Rather than building
ever-faster, more complex, monolithic processors, the
industry has decided to put multiple, relatively simple,
complete processors on a single chip. Such integrated
circuits are called multicore processors, and core has
become synonymous with central processing unit, or CPU.
In this setting a conventional processor with one CPU is
often called a single-core system.
1.3 Why we need to write parallel programs
Most programs that have been written for conventional,
single-core systems cannot exploit the presence of multiple
cores. We can run multiple instances of a program on a
multicore system, but this is often of little help. For
example, being able to run multiple instances of our
favorite game isn't really what we want—we want the
program to run faster with more realistic graphics. To do
this, we need to either rewrite our serial programs so that
they're parallel, so that they can make use of multiple
cores, or write translation programs, that is, programs that
will automatically convert serial programs into parallel
programs. The bad news is that researchers have had very
limited success writing programs that convert serial
programs in languages such as C, C++, and Java into
parallel programs.
This isn't terribly surprising. While we can write
programs that recognize common constructs in serial
programs, and automatically translate these constructs into
efficient parallel constructs, the sequence of parallel
constructs may be terribly inefficient. For example, we can
view the multiplication of two matrices as a sequence
of dot products, but parallelizing a matrix multiplication as
a sequence of parallel dot products is likely to be fairly slow
on many systems.
An efficient parallel implementation of a serial program
may not be obtained by finding efficient parallelizations of
each of its steps. Rather, the best parallelization may be
obtained by devising an entirely new algorithm.
As an example, suppose that we need to compute n
values and add them together. We know that this can be
done with the following serial code:
Now suppose we also have p cores and . Then each
core can form a partial sum of approximately values:

Here the prefix indicates that each core is using its own,
private variables, and each core can execute this block of
code independently of the other cores.
After each core completes execution of this code, its
variable will store the sum of the values computed by
its calls to . For example, if there are eight
cores, , and the 24 calls to return the
values

1, 4, 3, 9, 2, 8, 5, 1, 1, 6, 2, 7, 2, 5, 0, 4, 1, 8, 6, 5,
1, 2, 3, 9,
then the values stored in might be

Here we're assuming the cores are identified by


nonnegative integers in the range , where p is the
number of cores.
When the cores are done computing their values of ,
they can form a global sum by sending their results to a
designated “master” core, which can add their results:

In our example, if the master core is core 0, it would add


the values .
But you can probably see a better way to do this—
especially if the number of cores is large. Instead of making
the master core do all the work of computing the final sum,
we can pair the cores so that while core 0 adds in the result
of core 1, core 2 can add in the result of core 3, core 4 can
add in the result of core 5, and so on. Then we can repeat
the process with only the even-ranked cores: 0 adds in the
result of 2, 4 adds in the result of 6, and so on. Now cores
divisible by 4 repeat the process, and so on. See Fig. 1.1.
The circles contain the current value of each core's sum,
and the lines with arrows indicate that one core is sending
its sum to another core. The plus signs indicate that a core
is receiving a sum from another core and adding the
received sum into its own sum.
FIGURE 1.1 Multiple cores forming a global sum.

For both “global” sums, the master core (core 0) does


more work than any other core, and the length of time it
takes the program to complete the final sum should be the
length of time it takes for the master to complete. However,
with eight cores, the master will carry out seven receives
and adds using the first method, while with the second
method, it will only carry out three. So the second method
results in an improvement of more than a factor of two. The
difference becomes much more dramatic with large
numbers of cores. With 1000 cores, the first method will
require 999 receives and adds, while the second will only
require 10—an improvement of almost a factor of 100!
The first global sum is a fairly obvious generalization of
the serial global sum: divide the work of adding among the
cores, and after each core has computed its part of the
sum, the master core simply repeats the basic serial
addition—if there are p cores, then it needs to add p values.
The second global sum, on the other hand, bears little
relation to the original serial addition.
The point here is that it's unlikely that a translation
program would “discover” the second global sum. Rather,
there would more likely be a predefined efficient global
sum that the translation program would have access to. It
could “recognize” the original serial loop and replace it
with a precoded, efficient, parallel global sum.
We might expect that software could be written so that a
large number of common serial constructs could be
recognized and efficiently parallelized, that is, modified so
that they can use multiple cores. However, as we apply this
principle to ever more complex serial programs, it becomes
more and more difficult to recognize the construct, and it
becomes less and less likely that we'll have a precoded,
efficient parallelization.
Thus we cannot simply continue to write serial programs;
we must write parallel programs, programs that exploit the
power of multiple processors.

1.4 How do we write parallel programs?


There are a number of possible answers to this question,
but most of them depend on the basic idea of partitioning
the work to be done among the cores. There are two widely
used approaches: task-parallelism and data-parallelism.
In task-parallelism, we partition the various tasks carried
out in solving the problem among the cores. In data-
parallelism, we partition the data used in solving the
problem among the cores, and each core carries out more
or less similar operations on its part of the data.
As an example, suppose that Prof P has to teach a section
of “Survey of English Literature.” Also suppose that Prof P
has one hundred students in her section, so she's been
assigned four teaching assistants (TAs): Mr. A, Ms. B, Mr. C,
and Ms. D. At last the semester is over, and Prof P makes
up a final exam that consists of five questions. To grade the
exam, she and her TAs might consider the following two
options: each of them can grade all one hundred responses
to one of the questions; say, P grades question 1, A grades
question 2, and so on. Alternatively, they can divide the one
hundred exams into five piles of twenty exams each, and
each of them can grade all the papers in one of the piles; P
grades the papers in the first pile, A grades the papers in
the second pile, and so on.
In both approaches the “cores” are the professor and her
TAs. The first approach might be considered an example of
task-parallelism. There are five tasks to be carried out:
grading the first question, grading the second question,
and so on. Presumably, the graders will be looking for
different information in question 1, which is about
Shakespeare, from the information in question 2, which is
about Milton, and so on. So the professor and her TAs will
be “executing different instructions.”
On the other hand, the second approach might be
considered an example of data-parallelism. The “data” are
the students' papers, which are divided among the cores,
and each core applies more or less the same grading
instructions to each paper.
The first part of the global sum example in Section 1.3
would probably be considered an example of data-
parallelism. The data are the values computed by
, and each core carries out roughly the same
operations on its assigned elements: it computes the
required values by calling and adds them
together. The second part of the first global sum example
might be considered an example of task-parallelism. There
are two tasks: receiving and adding the cores' partial sums,
which is carried out by the master core; and giving the
partial sum to the master core, which is carried out by the
other cores.
When the cores can work independently, writing a
parallel program is much the same as writing a serial
program. Things get a great deal more complex when the
cores need to coordinate their work. In the second global
sum example, although the tree structure in the diagram is
very easy to understand, writing the actual code is
relatively complex. See Exercises 1.3 and 1.4.
Unfortunately, it's much more common for the cores to
need coordination.
In both global sum examples, the coordination involves
communication: one or more cores send their current
partial sums to another core. The global sum examples
should also involve coordination through load balancing.
In the first part of the global sum, it's clear that we want
the amount of time taken by each core to be roughly the
same as the time taken by the other cores. If the cores are
identical, and each call to requires the same
amount of work, then we want each core to be assigned
roughly the same number of values as the other cores. If,
for example, one core has to compute most of the values,
then the other cores will finish much sooner than the
heavily loaded core, and their computational power will be
wasted.
A third type of coordination is synchronization. As an
example, suppose that instead of computing the values to
be added, the values are read from . Say, is an array
that is read in by the master core:

In most systems the cores are not automatically


synchronized. Rather, each core works at its own pace. In
this case, the problem is that we don't want the other cores
to race ahead and start computing their partial sums before
the master is done initializing and making it available to
the other cores. That is, the cores need to wait before
starting execution of the code:
We need to add in a point of synchronization between the
initialization of and the computation of the partial sums:

The idea here is that each core will wait in the function
until all the cores have entered the function—in
particular, until the master core has entered this function.
Currently, the most powerful parallel programs are
written using explicit parallel constructs, that is, they are
written using extensions to languages such as C, C++, and
Java. These programs include explicit instructions for
parallelism: core 0 executes task 0, core 1 executes task 1,
…, all cores synchronize, …, and so on, so such programs
are often extremely complex. Furthermore, the complexity
of modern cores often makes it necessary to use
considerable care in writing the code that will be executed
by a single core.
There are other options for writing parallel programs—
for example, higher level languages—but they tend to
sacrifice performance to make program development
somewhat easier.

1.5 What we'll be doing


We'll be focusing on learning to write programs that are
explicitly parallel. Our purpose is to learn the basics of
programming parallel computers using the C language and
four different APIs or application program interfaces:
the Message-Passing Interface or MPI, POSIX threads
or Pthreads, OpenMP, and CUDA. MPI and Pthreads are
libraries of type definitions, functions, and macros that can
be used in C programs. OpenMP consists of a library and
some modifications to the C compiler. CUDA consists of a
library and modifications to the C++ compiler.
You may well wonder why we're learning about four
different APIs instead of just one. The answer has to do
with both the extensions and parallel systems. Currently,
there are two main ways of classifying parallel systems: one
is to consider the memory that the different cores have
access to, and the other is to consider whether the cores
can operate independently of each other.
In the memory classification, we'll be focusing on
shared-memory systems and distributed-memory
systems. In a shared-memory system, the cores can share
access to the computer's memory; in principle, each core
can read and write each memory location. In a shared-
memory system, we can coordinate the cores by having
them examine and update shared-memory locations. In a
distributed-memory system, on the other hand, each core
has its own, private memory, and the cores can
communicate explicitly by doing something like sending
messages across a network. Fig. 1.2 shows schematics of
the two types of systems.

FIGURE 1.2 (a) A shared memory system and (b) a


distributed memory system.
The second classification divides parallel systems
according to the number of independent instruction
streams and the number of independent data streams. In
one type of system, the cores can be thought of as
conventional processors, so they have their own control
units, and they are capable of operating independently of
each other. Each core can manage its own instruction
stream and its own data stream, so this type of system is
called a Multiple-Instruction Multiple-Data or MIMD
system.
An alternative is to have a parallel system with cores that
are not capable of managing their own instruction streams:
they can be thought of as cores with no control unit.
Rather, the cores share a single control unit. However, each
core can access either its own private memory or memory
that's shared among the cores. In this type of system, all
the cores carry out the same instruction on their own data,
so this type of system is called a Single-Instruction
Multiple-Data or SIMD system.
In a MIMD system, it's perfectly feasible for one core to
execute an addition while another core executes a multiply.
In a SIMD system, two cores either execute the same
instruction (on their own data) or, if they need to execute
different instructions, one executes its instruction while the
other is idle, and then the second executes its instruction
while the first is idle. In a SIMD system, we couldn't have
one core executing an addition while another core executes
a multiplication. The system would have to do something
like this: =5.7cm

Time First core Second core


1 Addition Idle
2 Idle Multiply
Since you're used to programming a processor with its
own control unit, MIMD systems may seem more natural to
you. However, as we'll see, there are many problems that
are very easy to solve using a SIMD system. As a very
simple example, suppose we have three arrays, each with n
elements, and we want to add corresponding entries of the
first two arrays to get the values in the third array. The
serial pseudocode might look like this:

Now suppose we have n SIMD cores, and each core is


assigned one element from each of the three arrays: core i
is assigned elements , and . Then our program can
simply tell each core to add its x- and y-values to get the z
value:

This type of system is fundamental to modern Graphics


Processing Units or GPUs, and since GPUs are extremely
powerful parallel processors, it's important that we learn
how to program them.
Our different APIs are used for programming different
types of systems:

• MPI is an API for programming distributed memory


MIMD systems.
• Pthreads is an API for programming shared memory
MIMD systems.
• OpenMP is an API for programming both shared
memory MIMD and shared memory SIMD systems,
although we'll be focusing on programming MIMD
systems.
• CUDA is an API for programming Nvidia GPUs,
which have aspects of all four of our classifications:
shared memory and distributed memory, SIMD, and
MIMD. We will, however, be focusing on the shared
memory SIMD and MIMD aspects of the API.

1.6 Concurrent, parallel, distributed


If you look at some other books on parallel computing or
you search the Web for information on parallel computing,
you're likely to also run across the terms concurrent
computing and distributed computing. Although there
isn't complete agreement on the distinction between the
terms parallel, distributed, and concurrent, many authors
make the following distinctions:

• In concurrent computing, a program is one in which


multiple tasks can be in progress at any instant [5].
• In parallel computing, a program is one in which
multiple tasks cooperate closely to solve a problem.
• In distributed computing, a program may need to
cooperate with other programs to solve a problem.

So parallel and distributed programs are concurrent, but


a program such as a multitasking operating system is also
concurrent, even when it is run on a machine with only one
core, since multiple tasks can be in progress at any instant.
There isn't a clear-cut distinction between parallel and
distributed programs, but a parallel program usually runs
multiple tasks simultaneously on cores that are physically
close to each other and that either share the same memory
or are connected by a very high-speed network. On the
other hand, distributed programs tend to be more “loosely
coupled.” The tasks may be executed by multiple
computers that are separated by relatively large distances,
and the tasks themselves are often executed by programs
that were created independently. As examples, our two
concurrent addition programs would be considered parallel
by most authors, while a Web search program would be
considered distributed.
But beware, there isn't general agreement on these
terms. For example, many authors consider shared-memory
programs to be “parallel” and distributed-memory
programs to be “distributed.” As our title suggests, we'll be
interested in parallel programs—programs in which closely
coupled tasks cooperate to solve a problem.

1.7 The rest of the book


How can we use this book to help us write parallel
programs?
First, when you're interested in high performance,
whether you're writing serial or parallel programs, you
need to know a little bit about the systems you're working
with—both hardware and software. So in Chapter 2, we'll
give an overview of parallel hardware and software. In
order to understand this discussion, it will be necessary to
review some information on serial hardware and software.
Much of the material in Chapter 2 won't be needed when
we're getting started, so you might want to skim some of
this material and refer back to it occasionally when you're
reading later chapters.
The heart of the book is contained in Chapters 3–7.
Chapters 3, 4, 5, and 6 provide a very elementary
introduction to programming parallel systems using C and
MPI, Pthreads, OpenMP, and CUDA, respectively. The only
prerequisite for reading these chapters is a knowledge of C
programming. We've tried to make these chapters
independent of each other, and you should be able to read
them in any order. However, to make them independent, we
did find it necessary to repeat some material. So if you've
read one of the three chapters, and you go on to read
another, be prepared to skim over some of the material in
the new chapter.
Chapter 7 puts together all we've learned in the
preceding chapters. It develops two fairly large programs
using each of the four APIs. However, it should be possible
to read much of this even if you've only read one of
Chapters 3, 4, 5, or 6. The last chapter, Chapter 8, provides
a few suggestions for further study on parallel
programming.

1.8 A word of warning


Before proceeding, a word of warning. It may be tempting
to write parallel programs “by the seat of your pants,”
without taking the trouble to carefully design and
incrementally develop your program. This will almost
certainly be a mistake. Every parallel program contains at
least one serial program. Since we almost always need to
coordinate the actions of multiple cores, writing parallel
programs is almost always more complex than writing a
serial program that solves the same problem. In fact, it is
often far more complex. All the rules about careful design
and development are usually far more important for the
writing of parallel programs than they are for serial
programs.

1.9 Typographical conventions


We'll make use of the following typefaces in the text:

• Program text, displayed or within running text, will


use the following typefaces:
• Definitions are given in the body of the text, and the
term being defined is printed in boldface type: A
parallel program can make use of multiple cores.
• When we need to refer to the environment in which
a program is being developed, we'll assume that
we're using a UNIX shell, such as , and we'll use a
to indicate the shell prompt:

• We'll specify the syntax of function calls with fixed


argument lists by including a sample argument list.
For example, the integer absolute value function, ,
in , might have its syntax specified with

For more complicated syntax, we'll enclose required


content in angle brackets and optional content in
square brackets . For example, the C statement
might have its syntax specified as follows:
This says that the statement must include an
expression enclosed in parentheses, and the right
parenthesis must be followed by a statement. This
statement can be followed by an optional clause.
If the clause is present, it must include a second
statement.

1.10 Summary
For many years we've reaped the benefits of having ever-
faster processors. However, because of physical limitations,
the rate of performance improvement in conventional
processors has decreased dramatically. To increase the
power of processors, chipmakers have turned to multicore
integrated circuits, that is, integrated circuits with multiple
conventional processors on a single chip.
Ordinary serial programs, which are programs written
for a conventional single-core processor, usually cannot
exploit the presence of multiple cores, and it's unlikely that
translation programs will be able to shoulder all the work
of converting serial programs into parallel programs—
programs that can make use of multiple cores. As software
developers, we need to learn to write parallel programs.
When we write parallel programs, we usually need to
coordinate the work of the cores. This can involve
communication among the cores, load balancing, and
synchronization of the cores.
In this book we'll be learning to program parallel
systems, so that we can maximize their performance. We'll
be using the C language with four different application
program interfaces or APIs: MPI, Pthreads, OpenMP, and
CUDA. These APIs are used to program parallel systems
that are classified according to how the cores access
memory and whether the individual cores can operate
independently of each other.
In the first classification, we distinguish between shared-
memory and distributed-memory systems. In a shared-
memory system, the cores share access to one large pool of
memory, and they can coordinate their actions by accessing
shared memory locations. In a distributed-memory system,
each core has its own private memory, and the cores can
coordinate their actions by sending messages across a
network.
In the second classification, we distinguish between
systems with cores that can operate independently of each
other and systems in which the cores all execute the same
instruction. In both types of system, the cores can operate
on their own data stream. So the first type of system is
called a multiple-instruction multiple-data or MIMD
system, and the second type of system is called a single-
instruction multiple-data or SIMD system.
MPI is used for programming distributed-memory MIMD
systems. Pthreads is used for programming shared-memory
MIMD systems. OpenMP can be used to program both
shared-memory MIMD and shared-memory SIMD systems,
although we'll be looking at using it to program MIMD
systems. CUDA is used for programming Nvidia graphics
processing units or GPUs. GPUs have aspects of all four
types of system, but we'll be mainly interested in the
shared-memory SIMD and shared-memory MIMD aspects.
Concurrent programs can have multiple tasks in
progress at any instant. Parallel and distributed
programs usually have tasks that execute simultaneously.
There isn't a hard and fast distinction between parallel and
distributed, although in parallel programs, the tasks are
usually more tightly coupled.
Parallel programs are usually very complex. So it's even
more important to use good program development
techniques with parallel programs.
1.11 Exercises
1.1 Devise formulas for the functions that calculate
and in the global sum example.
Remember that each core should be assigned
roughly the same number of elements of
computations in the loop. : First consider the
case when n is evenly divisible by p.
1.2 We've implicitly assumed that each call to
requires roughly the same amount of
work as the other calls. How would you change your
answer to the preceding question if call requires
times as much work as the call with ? How
would you change your answer if the first call ( )
requires 2 milliseconds, the second call ( )
requires 4, the third ( ) requires 6, and so on?
1.3 Try to write pseudocode for the tree-structured
global sum illustrated in Fig. 1.1. Assume the
number of cores is a power of two (1, 2, 4, 8, …).
: Use a variable to determine whether a
core should send its sum or receive and add. The
should start with the value 2 and be doubled
after each iteration. Also use a variable to
determine which core should be partnered with the
current core. It should start with the value 1 and
also be doubled after each iteration. For example, in
the first iteration and , so 0
receives and adds, while 1 sends. Also in the first
iteration and , so 0 and
1 are paired in the first iteration.
1.4 As an alternative to the approach outlined in the
preceding problem, we can use C's bitwise operators
to implement the tree-structured global sum. To see
how this works, it helps to write down the binary
(base 2) representation of each of the core ranks and
note the pairings during each stage: =8.5cm
From the table, we see that during the first stage each
core is paired with the core whose rank differs in the
rightmost or first bit. During the second stage, cores
that continue are paired with the core whose rank
differs in the second bit; and during the third stage,
cores are paired with the core whose rank differs in
the third bit. Thus if we have a binary value that
is 0012 for the first stage, 0102 for the second, and
1002 for the third, we can get the rank of the core
we're paired with by “inverting” the bit in our rank
that is nonzero in . This can be done using the
bitwise exclusive or ∧ operator.
Implement this algorithm in pseudocode using the
bitwise exclusive or and the left-shift operator.
1.5 What happens if your pseudocode in Exercise 1.3 or
Exercise 1.4 is run when the number of cores is not a
power of two (e.g., 3, 5, 6, 7)? Can you modify the
pseudocode so that it will work correctly regardless
of the number of cores?
1.6 Derive formulas for the number of receives and
additions that core 0 carries out using
a. the original pseudocode for a global sum, and
b. the tree-structured global sum.
Make a table showing the numbers of receives and
additions carried out by core 0 when the two sums
are used with cores.
1.7 The first part of the global sum example—when
each core adds its assigned computed values—is
usually considered to be an example of data-
parallelism, while the second part of the first global
sum—when the cores send their partial sums to the
master core, which adds them—could be considered
to be an example of task-parallelism. What about the
second part of the second global sum—when the
cores use a tree structure to add their partial sums?
Is this an example of data- or task-parallelism? Why?
1.8 Suppose the faculty members are throwing a party
for the students in the department.
a. Identify tasks that can be assigned to the
faculty members that will allow them to use task-
parallelism when they prepare for the party.
Work out a schedule that shows when the various
tasks can be performed.
b. We might hope that one of the tasks in the
preceding part is cleaning the house where the
party will be held. How can we use data-
parallelism to partition the work of cleaning the
house among the faculty?
c. Use a combination of task- and data-parallelism
to prepare for the party. (If there's too much
work for the faculty, you can use TAs to pick up
the slack.)
1.9 Write an essay describing a research problem in
your major that would benefit from the use of
parallel computing. Provide a rough outline of how
parallelism would be used. Would you use task- or
data-parallelism?

Bibliography
[5] Clay Breshears, The Art of Concurrency: A Thread
Monkey's Guide to Writing Parallel Applications.
Sebastopol, CA: O'Reilly; 2009.
[28] John Hennessy, David Patterson, Computer
Architecture: A Quantitative Approach. 6th ed.
Burlington, MA: Morgan Kaufmann; 2019.
[31] IBM, IBM InfoSphere Streams v1.2.0 supports
highly complex heterogeneous data analysis, IBM
United States Software Announcement 210-037,
Feb. 23, 2010
http://www.ibm.com/common/ssi/rep_ca/7/897/ENUS
210-037/ENUS210-037.PDF.
[36] John Loeffler, No more transistors: the end of
Moore's Law, Interesting Engineering, Nov 29, 2018.
See https://interestingengineering.com/no-more-
transistors-the-end-of-moores-law.
Chapter 2: Parallel
hardware and parallel
software
It's perfectly feasible for specialists in disciplines other
than computer science and computer engineering to write
parallel programs. However, to write efficient parallel
programs, we often need some knowledge of the underlying
hardware and system software. It's also very useful to have
some knowledge of different types of parallel software, so
in this chapter we'll take a brief look at a few topics in
hardware and software. We'll also take a brief look at
evaluating program performance and a method for
developing parallel programs. We'll close with a discussion
of what kind of environment we might expect to be working
in, and a few rules and assumptions we'll make in the rest
of the book.
This is a long, broad chapter, so it may be a good idea to
skim through some of the sections on a first reading so that
you have a good idea of what's in the chapter. Then, when a
concept or term in a later chapter isn't quite clear, it may
be helpful to refer back to this chapter. In particular, you
may want to skim over most of the material in
“Modifications to the von Neumann Model,” except “The
Basics of Caching.” Also, in the “Parallel Hardware”
section, you can safely skim the material on
“Interconnection Networks.” You can also skim the material
on “SIMD Systems” unless you're planning to read the
chapter on CUDA programming.
Discovering Diverse Content Through
Random Scribd Documents
Curing Pork the Year Around

(Copyrighted; Reprint Forbidden.)

Up to a comparatively few years ago, all Pork


Packing was done in the winter. Packing Houses
would fill their plants during the winter months,
and in the spring would smoke out the meats. In
this way, most of the meat had to be sold
oversalted, the shrinkage and loss to the Packer
was greater and meats, therefore, had to be
sold at a much higher price, besides, they were of very inferior
quality.

At the present time, due to improved methods, packing can be done


all the year around, and meat can be sold as fast as it is finished. In
this way, cured meat can be produced at a much lower price, the
money invested in it can be turned over four, five or six times a year,
and the meat will be much better, taste better and more of it can be
eaten because of the fact that it is more wholesome and more easily
digested.
HOISTING HOGS IN A LARGE
PACKING HOUSE, WITH A HOG-
HOISTING MACHINE.
(Copyrighted; Reprint Forbidden.)

Great care should always be exercised when hogs are hoisted before
sticking. When hogs are hoisted alive to be stuck, very often when a
very heavy hog is jerked from the floor, the hip is dislocated or
sprained, and blood will be thrown out around the injured joint, so
the Ham will be spoiled. Great care should also be exercised in
driving the live hogs, as hogs are the heaviest and weakest and
easiest injured of all animals.

Special pens should be provided for them, so they are not crowded,
and so they have plenty of room when they are driven to the killing
pen. They should be handled very carefully, and piling up and
crowding should be avoided as much as possible. Many hams are
injured by overcrowding the hogs in the killing pens, for when hogs
smell blood they become excited and nervous, and unless they have
plenty of room, they will pile upon each other and bruise themselves
so that there will be many skin-bruised hams, and the flesh will be
full of bruises. Men driving hogs should never use a whip. The best
thing to use in driving hogs is a stick about two feet long, to the end
of which is fastened a piece of canvas three inches wide and two
feet long. By striking the hogs with this canvas, it makes a noise
which will do more towards driving them, without injury, than the
whip which will injure and discolor the skin.
MACHINE USED IN LARGER
PACKING HOUSES FOR
HOISTING HOGS.

STICKING HOGS IN A MODERN


PACKING HOUSE.
(Copyrighted; Reprint Forbidden.)
Men sticking hogs should be sure to make a good, large opening in
the neck, three or four inches long, in order to give the blood a
good, free flow. It is very necessary to sever the veins and arteries
in the neck, so as to get all of the blood out of the hog. The man
who does the sticking must be careful not to stick the knife into the
shoulder, for if the shoulder is stuck, the blood settles there, and the
bloody part will have to be trimmed out after the hog is cut up. In
large Packing Houses, there is a report made out every day, of the
number of shoulder-stuck hogs, and the sticker must sign this report
before it is sent to the office. This shows the sticker the kind of work
he is doing and makes him more careful. In small houses, most
butchers stick the hogs on the floor and let them bleed there. Those
who can possibly do it should hoist the hog by the hind leg before it
is stuck or immediately after it is stuck, as the case may be, so as to
allow the hog to properly bleed. When the hog is properly hoisted by
one hind leg, alive, and then stuck while hanging, it will kick
considerably and the kicking and jerking of the hog will help in
pumping out all of the blood, making a much better bled carcass
than if the hog is first stunned with a hammer and stuck on the floor.
The better the hog is bled, the better the meat will be for curing.
HOW HOGS ARE STUCK IN A
LARGE MODERN PACKING
HOUSE.
SCALDING HOGS.
(Copyrighted; Reprint Forbidden.)

It is impossible to give the exact temperature one should use in


scalding hogs, as this will vary under different circumstances. In
winter the hair sticks much tighter than in summer and requires
more scalding and more heat than in summer. Hogs raised in the
South, in a warm climate, will scald much easier than those raised in
a northern climate. A butcher will soon learn which temperature is
best adapted to his own locality and the kind of hogs he is scalding.
SCALDING HOGS IN A LARGE
MODERN PACKING HOUSE.

In a Packing House where a long scalding tub is used, the


temperature depends entirely upon how fast the hogs are being
killed. If the hogs are killed slowly, so each hog can remain in the
water longer, it is not necessary to have the water as hot as when
they are handled fast and are taken out of the water in a shorter
time. It is, however, universally acknowledged that the quicker a hog
can be taken out of the scalding tub the better it is for the meat.
The hog is a great conductor of heat, and when kept in the scalding
water too long, it becomes considerably heated and bad results have
many times been traced to the fact that the hog was scalded in
water which was not hot enough, and was kept in this water too
long in order to loosen the hair. Overheating the hog in the scalding
water very often causes the meat of fat hogs to sour and Packers
wonder why it is that the meat has spoiled. We therefore wish to
caution Packers against this, and to advise the use of water as hot
as practicable for scalding hogs.
To make the hair easy to remove
and to remove dirt and impurities
from the skin, we recommend Hog-
Scald. This preparation makes
scalding easy, it removes most of the
dirt and filth, cleanses the hog and
whitens the skin.

In many localities, where the water


is hard, Hog-Scald will be found of
great value, as it softens the water
and makes it nice to work with; it
cleanses the skin of the hogs and
improves their appearance. It is a
B. HELLER & CO’S HOG-SCALD great labor saver and more than
TRADE MARK pays the cost by the labor it saves,
as it assists in removing the hair and leaves the skin more yielding to
the scraper.

The skin of all hogs is covered with more or less greasy filth, which
contains millions of disease germs and these extend down into the
pores of the skin. If this germ-laden filth is not removed, and if it
gets into the brine when the meat is being cured, it injures both the
meat and the brine in flavor, and also spoils the flavor of the lard if it
gets into that. Hog-Scald removes most of this filth and cleanses the
skin, and for these reasons alone, should be used by every Packer
and Butcher. Hams and Bacon from hogs that have been scalded
with Hog-Scald are, therefore, cleaner and will be much brighter
after they are smoked than when the filth of the hog remains in the
pores of the skin.

Those selling dressed hogs will find Hog-Scald very valuable, as hogs
that have been scalded with it are cleaner and look whiter and much
more appetizing.
The use of Hog-Scald is legal everywhere. It does not come under
the regulations of the Food Laws, as it is simply a cleansing agent.
Hog-Scald costs very little at the price we sell it, and everyone can
afford to use it. Butchers who once try it will continue its use.
SCRAPING HOGS.
(Copyrighted; Reprint Forbidden.)

As much of the hair as possible should be scraped from the hogs,


instead of being shaved off with a sharp knife, as is often done. If
the hog is not properly scalded and scraped and the hair remains in
the skin, such hair is usually shaved off with a knife before the hog
is gutted, and sometimes after the meat is chilled and cut up. After
the meat is cured, the rind shrinks and all the stubs of hair that have
been shaved off will stick out and the rind will be rough like a man’s
face when he has not been shaved for a day or so. Hams and Bacon
from hogs that have been shaved instead of properly scalded and
scraped, will look much rougher and much more unsightly than if the
hogs are properly scalded and scraped. Therefore, Packers should
give close attention that the scalding and scraping is properly done.
The scraping bench should be provided with a hose right above
where the hogs are being scraped and this should be supplied with
hot water, if possible, so the hogs can be rinsed off occasionally with
hot water, while being scraped. The hot water can, however, be
thrown over the hogs with a bucket.
SCRAPING HOGS IN A PACKING
HOUSE.

After the hog has been gambrelled and hung up, either on a
gambrel-stick or on rollers, it should be gutted. After it is gutted, it
should be washed out thoroughly, with plenty of cold, fresh water. As
every Packer understands how to gut a hog, it is not necessary to go
into details.
GUTTING HOGS IN A MODERN
PACKING HOUSE.

CUTTING THE HIND SHANK


BONE.
(Copyrighted; Reprint Forbidden.)

We advise the cutting of the hind shank


bone after the hog is dressed, so as to
expose the marrow, as shown in cuts A and
B. It is the best thing to do, as it helps to
chill the marrow. The chunk of meat that is
usually left on the hind foot, above and
next to the knee, if cut loose around the
knee, will be drawn to the ham, and when
a chilled, will remain on the ham instead of
being on the hind foot, as shown in cut A. b
After the meat is cut, the bone can be sawed, in the
same place where the hock would be cut from the ham later. See cut
B. The hog will hang on the sinews the same as if the bone had not
been sawed, except that the cut bone separates and exposes the
marrow so it can be properly cooled. On heavy hogs this is quite a
gain, as the chunk that would remain on the foot would be of little
or no value there, but when left on the ham, sells for the regular
ham prices.
FACING HAMS AND PULLING
LEAF LARD IN A MODERN
PACKING HOUSE.
(Copyrighted; Reprint Forbidden.)

The first two figures in the above cut show two men Facing Hams.
The first man faces the Ham at his right hand side and the second
man faces the Ham on his left hand side, as the Hogs pass by.

The advantage of Facing Hams right after the hogs are dressed, is
this. The knife can be drawn through the skin and through the fat
close to the meat, and the fat will peel right off the fleshy part of the
Ham. Between the fat and lean meat of the Ham, between the legs,
there is a fibrous membrane which is very soft and pliable. When the
knife is run through the skin and fat, it will run along the side of this
membrane, making a clean face for the Ham. That part remaining on
the Ham will shrink to the Ham and will form a smooth coating over
the lean meat, which closes the pores and makes the Ham look
smooth and nice when it is smoked. It also makes a much smoother
cut along the skin. The skin when cut warm will dry nicely and look
smooth when cured, whereas if it is trimmed after the meat is
chilled, it looks rough and ragged. Facing Hams also allows the
escape of the animal heat more readily. If Hams are not faced until
after the Hogs have been chilled, this fat must be trimmed off and
the Hams will not look nearly so smooth as they will if this tissue and
fat is removed while the hog is warm.

The second two men in the opposite illustration are Pulling Leaf
Lard. The Leaf Lard should always be pulled out of the hogs in
summer, as it gives the hogs, as well as the Leaf Lard, a better
chance to chill. During the winter months it can be pulled loose, but
can be left hanging loosely in the hog, from the top. In this way it
will cool nicely, and it will also allow the animal heat to get out of the
hog. Most of the large packing houses pull out the Leaf Lard in the
winter as well as summer, and hang it on hooks in the chill room to
chill. Leaf Lard that is properly chilled, with the animal heat all taken
out of it, makes much finer lard than when pulled out of the hog and
put into the rendering tank with the animal heat in it.
SPLITTING HOGS IN A MODERN
PACKING HOUSE.
(Copyrighted; Reprint Forbidden.)

Splitting can be done in several different ways. Where the back of


the hog is to be cut up for pork loins, the hog is simply split through
the center of the backbone, so that one half of the backbone
remains on each loin. Packers who wish to cut the sides into Short or
Long Clears or Clear Bacon Backs run the knife down on both sides
of the backbone, as close to the backbone as possible, cutting
through the skin, fat and lean meat; then the hog should be split
down on one side of the backbone. The backbone should remain on
the one side until the hog is cut up and it can then easily be sawed
off with a small saw. By cutting or scoring the back in this way for
making boneless side meat, the sides will be smooth and there will
not be much waste left on the bone as when the backbone is split
and half of it left on each side and then is peeled out after the meat
is chilled and is being cut up.
VENTILATION IN HOG CHILL
ROOM.

HOG CHILL ROOM IN A MODERN


PACKING HOUSE.
(Copyrighted; Reprint Forbidden.)

Many chill rooms are not properly built. There should be at least
from 24 to 36 inches of space between the ceiling of the chilling
room and the gambrel-stick, or more if possible, in order to enable
the shanks to become thoroughly chilled. The animal heat which
leaves the carcass naturally rises to the top of the cooler, and unless
there is space between the ceiling and the top of the hog the heat
will accumulate in the top of the cooler where the temperature will
become quite warm; this will prevent the marrow in the shank and
the joints from becoming properly chilled. It is this fact that accounts
for so much marrow and shank sour in hams.
TEMPERATURE OF CHILL ROOM.
(Copyrighted; Reprint Forbidden.)

All Packers who have a properly built cooler for chilling hogs and
who are property equipped with an ice machine will find the
following rules will give the best results. Those who are not properly
equipped should try to follow these rules as closely as they can with
their equipment.

A hog chill room should be down to from 28 to 32 degrees


Fahrenheit when the hogs are run into it. As the cooler is filled, the
temperature will be raised to as high as 45 or 46 degrees F., but
enough refrigeration must be kept on so the temperature is brought
down to 36 degrees by the end of 12 hours after the cooler is filled,
and then the temperature must be gradually reduced down as low
as 32 degrees by the time the carcasses have been in the cooler 48
hours. In other words, at the end of 48 hours the cooler must be
down to 32 degrees.

All large hog coolers should be partitioned off between each section
of timbers, into long alleys, so that each alley can be kept at its own
temperature.

In the improper chilling of the carcasses lies the greatest danger of


spoiling the meat. The greatest care must be given to the proper
chilling, for if the carcasses are not properly chilled, it will be very
difficult to cure the meat, and it will be liable to sour in the curing.
Meat from improperly chilled carcasses, even with the greatest care
afterwards, will not cure properly. Therefore, one of the first places
to look for trouble when Hams are turning out sour is to look to the
chilling of the meat, as it is nine chances out of ten that this is
where the trouble started from. We have found by experience that
by deviating only a few degrees from these set rules, the percentage
of sour meat is surprisingly increased.

It has always been considered an absolute necessity to have an


open air hanging room to allow the hogs to cool off in the open air
before they are run into the cooler. It has always been considered
that this saves considerable money in the refrigeration of the hogs.
However, by the experiments made in some of the large Packing
Houses, it has been demonstrated that this economy is very much
over-estimated. There are certain conditions which must be closely
adhered to for the safe handling and curing of pork products, and
the most important of these is the proper temperature. In the
outside atmosphere the proper temperature rarely prevails. Hogs
that are left in the open air on the hanging floor over night are
generally either insufficiently chilled or are over-chilled the next
morning, depending upon the outside temperature of the air. We feel
that it is of advantage, however, to run the hogs into an outside
hanging room and to allow them to dry for one or two hours before
putting them into the chilling room.

Packers who cure large quantities of hogs must see to it that their
chill rooms are properly constructed and have sufficient refrigeration,
so the temperature can be kept under perfect control at all times.
The cooler should be partitioned off lengthwise, between each line
of posts, making long alleys to run the hogs into, each one of which
can be regulated as to its temperature separately from the others.
The hogs can be run into one of these alleys as fast as they are
killed and should the temperature get up above 50 degrees F., the
hogs can be run out of this into another. The cooler in which hogs
are chilled should never go above 50 degrees Fahrenheit, and a
properly constructed cooler can be kept below this temperature.

While the cooler is being filled, the temperature should be held at


between 45 and 50 degrees Fahrenheit, and should be kept at this
temperature for about two hours after filling. At the end of two
hours, all of the vapor will have passed away, being taken up by and
frozen onto the refrigerator pipes, and the hogs will begin to dry.
When the hogs begin to show signs of drying, or in about two hours
after the refrigerator is filled, more refrigeration should be turned
on, and the temperature should be gradually brought down, so that
in twelve hours from the time the cooler is filled, the temperature
should be brought down to 36 or 37 degrees temperature
Fahrenheit. If the temperature is not brought down to 36 or 37
degrees F. in 12 hours it means a delay in removing the animal heat,
and a tendency for decomposition to set in. If the temperature is
brought down lower than 32 degrees Fahrenheit during the first 12
hours, the outside surface of the carcasses are too rapidly chilled,
which tends to retard the escape of the animal heat. It is known,
from practical experience, that where the meat is chilled through
rather slowly, the animal heat leaves the meat more uniformly. Too
rapid chilling on the outside seems to clog up the outside of the
meat so that the heat in the thick portions does not readily escape.

The first 12 hours of the chilling of all kinds of meat and the removal
of the animal heat during this period is the most important part of
the chilling. After that period, the proper temperature is of much less
vital importance.

Hogs that are to be cut up for curing should never be cut up sooner
than 48 hours after being killed, and the temperature of the cooler
should be gradually brought down to 28 degrees Fahrenheit by the
time the hogs are taken out of the chill room to be cut up. After the
hogs have been in the cooler 12 hours the temperature should
gradually be brought down from 36 degrees at the end of the first
12 hours, to 28 degrees at the end of 48 hours; that is, if the hogs
are to be cut up 48 hours after they are killed. If they are to be cut
up 72 hours after being killed, the temperature should be brought
down gradually from 36 degrees at the end of the first 12 hours, to
30 degrees F. at the end of 72 hours. This would mean that the
temperature should be brought down from 36 degrees to 30 degrees
F., if the hogs are to be cut up at the end of 72 hours, or a lowering
of six degrees in practically 58 hours; or a lowering of eight degrees,
from 36 to 28 Fahrenheit, if the hogs are to be cut up in 48 hours
after being killed. This means a reduction in temperature of about
one degree for every eight hours. This does not mean that the six or
eight degrees should be reduced in two hours’ time, for if that were
done the meat would be frozen.

In a large Packing House, where the cooler is properly equipped,


and one has a good attendant, these instructions can be carried out
in detail. When the foregoing instructions are carefully followed, the
safe curing of the product will be assured.

While the curing of course requires careful attention, yet, if the


chilling is not done properly, the curing will never be perfect.

The floors of coolers should always be kept sprinkled with clean


sawdust, as this will absorb drippings and assist in keeping the
cooler clean and sweet. If the drippings from hogs are allowed to fall
on the bare floor, the cooler will soon become sour and this will
affect the meat that hangs over it.
TEMPERATURE FOR CURING
MEAT.
(Copyrighted; Reprint Forbidden.)

An even temperature of 38 degrees Fahrenheit is the best


temperature for curing meats. Most butchers, however, have no ice
machine, and, therefore, are not able to reach such a low
temperature in their coolers; nevertheless, they should try to get
their coolers as low in temperature as possible, and should at all
times be careful to keep the doors closed, and not leave them open
longer than is necessary at any time. The temperature of 37 to 39
degrees Fahrenheit is what should govern all packers who use ice
machines; those who are fortunate enough to have ice machinery
should never allow the cooler to get below 37 degrees, nor above 40
degrees. Many packers let the temperature in their coolers get too
cold, and in winter during the very cold weather, the windows are
sometimes left open, which allows the temperature to get too low.
This should always be avoided, as meat will not cure in any brine, or
take salt when dry salted, if stored in a room that is below 36
degrees Fahrenheit. If meat is packed even in the strongest kind of
brine, and put into a cooler, which is kept at 32 to 33 degrees of
temperature, and thus left at this degree of cold for three months, it
will come out of the brine only partly cured. The reason for this is
the fact that meat will not cure and take on salt at such a low
temperature, and as the temperature herein given is above freezing
point, which is 32 degrees, the meat will only keep for a short time,
and then it starts to decompose when taken into a higher
temperature. Anyone, who is unaware of this fact, will see how
necessary it is to have accurate thermometers in a cooler, to
examine them frequently, and to closely watch the temperature of
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookluna.com

You might also like