0% found this document useful (0 votes)
5 views

Parallel programming: concepts and practice 1st Edition - eBook PDF download

The document is an introduction to the book 'Parallel Programming: Concepts and Practice', which covers essential techniques for programming parallel systems using C++11 and various APIs like OpenMP and MPI. It emphasizes the importance of parallel programming in modern computing due to the prevalence of multi-core processors and outlines the book's structure, targeting students and professionals in computer science and engineering. The content includes theoretical concepts, practical examples, and exercises to enhance understanding of parallel algorithms and their implementation.

Uploaded by

jlhdbgqf4779
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Parallel programming: concepts and practice 1st Edition - eBook PDF download

The document is an introduction to the book 'Parallel Programming: Concepts and Practice', which covers essential techniques for programming parallel systems using C++11 and various APIs like OpenMP and MPI. It emphasizes the importance of parallel programming in modern computing due to the prevalence of multi-core processors and outlines the book's structure, targeting students and professionals in computer science and engineering. The content includes theoretical concepts, practical examples, and exercises to enhance understanding of parallel algorithms and their implementation.

Uploaded by

jlhdbgqf4779
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Parallel programming: concepts and practice 1st

Edition - eBook PDF download

https://ebookluna.com/download/parallel-programming-concepts-and-
practice-ebook-pdf/

Download full version ebook from https://ebookluna.com


We believe these products will be a great fit for you. Click
the link to download now, or visit ebookluna.com
to discover even more!

An Introduction to Parallel Programming 2. Edition Pacheco - eBook PDF

https://ebookluna.com/download/an-introduction-to-parallel-programming-
ebook-pdf/

(eBook PDF) Parallel Computer Organization and Design

https://ebookluna.com/product/ebook-pdf-parallel-computer-organization-and-
design/

(eBook PDF) Parallel Computer Organization and Design

https://ebookluna.com/product/ebook-pdf-parallel-computer-organization-and-
design-2/

Data Science: Concepts and Practice 2nd Edition- eBook PDF

https://ebookluna.com/download/data-science-concepts-and-practice-ebook-
pdf/
Problem Solving and Python Programming 1st edition - eBook PDF

https://ebookluna.com/download/problem-solving-and-python-programming-
ebook-pdf/

(eBook PDF) Community and Human Services Concepts for Practice

https://ebookluna.com/product/ebook-pdf-community-and-human-services-
concepts-for-practice/

(eBook PDF) Introduction to Leadership: Concepts and Practice 4th Edition

https://ebookluna.com/product/ebook-pdf-introduction-to-leadership-
concepts-and-practice-4th-edition/

Introduction to Leadership: Concepts and Practice 4th Edition (eBook PDF)

https://ebookluna.com/product/introduction-to-leadership-concepts-and-
practice-4th-edition-ebook-pdf/

(eBook PDF) Professional Nursing Practice: Concepts and Perspectives 7th


Edition

https://ebookluna.com/product/ebook-pdf-professional-nursing-practice-
concepts-and-perspectives-7th-edition/
Parallel Programming
Parallel Programming
Concepts and Practice
Bertil Schmidt
Institut für Informatik
Staudingerweg 9
55128 Mainz
Germany

Jorge González-Domínguez
Computer Architecture Group
University of A Coruña
Edificio área científica (Office 3.08), Campus de Elviña
15071, A Coruña
Spain

Christian Hundt
Institut für Informatik
Staudingerweg 9
55128 Mainz
Germany

Moritz Schlarb
Data Center
Johannes Gutenberg-University Mainz
Germany
Anselm-Franz-von-Bentzel-Weg 12
55128 Mainz
Germany
Morgan Kaufmann is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
Copyright © 2018 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including
photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on
how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as
the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted
herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes
in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information,
methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety
and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or
damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods,
products, instructions, or ideas contained in the material herein.

Library of Congress Cataloging-in-Publication Data


A catalog record for this book is available from the Library of Congress

British Library Cataloguing-in-Publication Data


A catalogue record for this book is available from the British Library

ISBN: 978-0-12-849890-3

For information on all Morgan Kaufmann publications


visit our website at https://www.elsevier.com/books-and-journals

Publisher: Katey Birtcher


Acquisition Editor: Steve Merken
Developmental Editor: Nate McFadden
Production Project Manager: Sreejith Viswanathan
Designer: Christian J. Bilbow
Typeset by VTeX
Preface

Parallelism abounds. Nowadays, any modern CPU contains at least two cores, whereas some CPUs
feature more than 50 processing units. An even higher degree of parallelism is available on larger sys-
tems containing multiple CPUs such as server nodes, clusters, and supercomputers. Thus, the ability
to program these types of systems efficiently and effectively is an essential aspiration for scientists,
engineers, and programmers. The subject of this book is a comprehensive introduction to the area of
parallel programming that addresses this need. Our book teaches practical parallel programming for
shared memory and distributed memory architectures based on the C++11 threading API, Open Mul-
tiprocessing (OpenMP), Compute Unified Device Architecture (CUDA), Message Passing Interface
(MPI), and Unified Parallel C++ (UPC++), as well as necessary theoretical background. We have in-
cluded a large number of programming examples based on the recent C++11 and C++14 dialects of
the C++ programming language.
This book targets participants of “Parallel Programming” or “High Performance Computing”
courses which are taught at most universities at senior undergraduate level or graduate level in com-
puter science or computer engineering. Moreover, it serves as suitable literature for undergraduates in
other disciplines with a computer science minor or professionals from related fields such as research
scientists, data analysts, or R&D engineers. Prerequisites for being able to understand the contents
of our book include some experience with writing sequential code in C/C++ and basic mathematical
knowledge.
In good tradition with the historic symbiosis of High Performance Computing and natural science,
we introduce parallel concepts based on real-life applications ranging from basic linear algebra rou-
tines over machine learning algorithms and physical simulations but also traditional algorithms from
computer science. The writing of correct yet efficient code is a key skill for every programmer. Hence,
we focus on the actual implementation and performance evaluation of algorithms. Nevertheless, the
theoretical properties of algorithms are discussed in depth, too. Each chapter features a collection of
additional programming exercises that can be solved within a web framework that is distributed with
this book. The System for Automated Code Evaluation (SAUCE) provides a web-based testing en-
vironment for the submission of solutions and their subsequent evaluation in a classroom setting: the
only prerequisite is an HTML5 compatible web browser allowing for the embedding of interactive
programming exercise in lectures. SAUCE is distributed as docker image and can be downloaded at
https://parallelprogrammingbook.org
This website serves as hub for related content such as installation instructions, a list of errata, and
supplementary material (such as lecture slides and solutions to selected exercises for instructors).
If you are a student or professional that aims to learn a certain programming technique, we advise to
initially read the first three chapters on the fundamentals of parallel programming, theoretical models,
and hardware architectures. Subsequently, you can dive into one of the introductory chapters on C++11
Multithreading, OpenMP, CUDA, or MPI which are mostly self-contained. The chapters on Advanced
C++11 Multithreading, Advanced CUDA, and UPC++ build upon the techniques of their preceding
chapter and thus should not be read in isolation.
ix
x Preface

If you are a lecturer, we propose a curriculum consisting of 14 lectures mainly covering applications
from the introductory chapters. You could start with a lecture discussing the fundamentals from the
first chapter including parallel summation using a hypercube and its analysis, the definition of basic
measures such as speedup, parallelization efficiency and cost, and a discussion of ranking metrics. The
second lecture could cover an introduction to PRAM, network topologies, weak and strong scaling.
You can spend more time on PRAM if you aim to later discuss CUDA in more detail or emphasize
hardware architectures if you focus on CPUs. Two to three lectures could be spent on teaching the
basics of the C++11 threading API, CUDA, and MPI, respectively. OpenMP can be discussed within
a span of one to two lectures. The remaining lectures can be used to either discuss the content in the
advanced chapters on multithreading, CUDA, or the PGAS-based UPC++ language.
An alternative approach is splitting the content into two courses with a focus on pair-programming
within the lecture. You could start with a course on CPU-based parallel programming covering selected
topics from the first three chapters. Hence, C++11 threads, OpenMP, and MPI could be taught in full
detail. The second course would focus on advanced parallel approaches covering extensive CUDA
programming in combination with (CUDA-aware) MPI and/or the PGAS-based UPC++.
We wish you a great time with the book. Be creative and investigate the code! Finally, we would be
happy to hear any feedback from you so that we could improve any of our provided material.
Acknowledgments

This book would not have been possible without the contributions of many people.
Initially, we would like to thank the anonymous and few non-anonymous reviewers who com-
mented on our book proposal and the final draft: Eduardo Cesar Galobardes, Ahmad Al-Khasawneh,
and Mohammad Olaimat.
Moreover, we would like to thank our colleagues who thoroughly peer-reviewed the chapters and
provided essential feedback: André Müller for his valuable advise on C++ programming, Robin Kobus
for being a tough code reviewer, Felix Kallenborn for his steady proofreading sessions, Daniel Jünger
for constantly complaining about the CUDA chapter, as well as Stefan Endler and Elmar Schömer for
their suggestions.
Additionally, we would like to thank the staff of Morgan Kaufman and Elsevier who coordinated
the making of this book. In particular we would like to mention Nate McFadden.
Finally, we would like to thank our spouses and children for their ongoing support and patience
during the countless hours we could not spend with them.

xi
CHAPTER

INTRODUCTION

Abstract
1
In the recent past, teaching and learning of parallel programming has become increasingly important
due to the ubiquity of parallel processors in portable devices, workstations, and compute clusters. Stag-
nating single-threaded performance of modern CPUs requires future computer scientists and engineers
to write highly parallelized code in order to fully utilize the compute capabilities of current hardware
architectures. The design of parallel algorithms, however, can be challenging especially for inexpe-
rienced students due to common pitfalls such as race conditions when concurrently accessing shared
resources, defective communication patterns causing deadlocks, or the non-trivial task of efficiently
scaling an application over the whole number of available compute units. Hence, acquiring parallel
programming skills is nowadays an important part of many undergraduate and graduate curricula.
More importantly, education of concurrent concepts is not limited to the field of High Performance
Computing (HPC). The emergence of deep learning and big data lectures requires teachers and stu-
dents to adopt HPC as an integral part of their knowledge domain. An understanding of basic concepts
is indispensable for acquiring a deep understanding of fundamental parallelization techniques.
The goal of this chapter is to provide an overview of introductory concepts and terminologies in parallel
computing. We start with learning about speedup, efficiency, cost, scalability, and the computation-to-
communication ratio by analyzing a simple yet instructive example for summing up numbers using a
varying number of processors. We get to know about the two most important parallel architectures:
distributed memory systems and shared memory systems. Designing efficient parallel programs re-
quires a lot of experience and we will study a number of typical considerations for this process such
as problem partitioning strategies, communication patterns, synchronization, and load balancing. We
end this chapter with learning about current and past supercomputers and their historical and upcoming
architectural trends.

Keywords
Parallelism, Speedup, Parallelization, Efficiency, Scalability, Reduction, Computation-to-communica-
tion ratio, Distributed memory, Shared memory, Partitioning, Communication, Synchronization, Load
balancing, Task parallelism, Prefix sum, Deep learning, Top500

CONTENTS
1.1 Motivational Example and Its Analysis ............................................................................ 2
The General Case and the Computation-to-Communication Ratio..................................... 8
1.2 Parallelism Basics .................................................................................................... 10
Distributed Memory Systems................................................................................ 10
Shared Memory Systems..................................................................................... 11

Parallel Programming. DOI: 10.1016/B978-0-12-849890-3.00001-0


Copyright © 2018 Elsevier Inc. All rights reserved.
1
2 CHAPTER 1 INTRODUCTION

Considerations When Designing Parallel Programs ...................................................... 13


1.3 HPC Trends and Rankings ........................................................................................... 16
1.4 Additional Exercises.................................................................................................. 18

1.1 MOTIVATIONAL EXAMPLE AND ITS ANALYSIS


In this section we learn about some basic concepts and terminologies. They are important for analyzing
parallel algorithms or programs in order to understand their behavior. We use a simple example for
summing up numbers using an increasing number of processors in order to explain and apply the
following concepts:

• Speedup. You have designed a parallel algorithm or written a parallel code. Now you want to
know how much faster it is than your sequential approach; i.e., you want to know the speedup.
The speedup (S) is usually measured or calculated for almost every parallel code or algorithm and
is simply defined as the quotient of the time taken using a single processor (T (1)) over the time
measured using p processors (T (p)) (see Eq. (1.1)).
T (1)
S= (1.1)
T (p)
• Efficiency and cost. The best speedup you can usually expect is a linear speedup; i.e., the maximal
speedup you can achieve with p processors or cores is p (although there are exceptions to this,
which are referred to as super-linear speedups). Thus, you want to relate the speedup to the number
of utilized processors or cores. The Efficiency E measures exactly that by dividing S by P (see
Eq. (1.2)); i.e., linear speedup would then be expressed by a value close to 100%. The cost C is
similar but relates the runtime T (p) (instead of the speedup) to the number of utilized processors
(or cores) by multiplying T (p) and p (see Eq. (1.3)).
S T (1)
E= = (1.2)
p T (p) × p
C = T (p) × p (1.3)

• Scalability. Often we do not only want to measure the efficiency for one particular number of pro-
cessors or cores but for a varying number; e.g. P = 1, 2, 4, 8, 16, 32, 64, 128, etc. This is called
scalability analysis and indicates the behavior of a parallel program when the number of processors
increases. Besides varying the number of processors, the input data size is another parameter that
you might want to vary when executing your code. Thus, there are two types of scalability: strong
scalability and weak scalability. In the case of strong scalability we measure efficiencies for a vary-
ing number of processors and keep the input data size fixed. In contrast, weak scalability shows the
behavior of our parallel code for varying both the number of processors and the input data size; i.e.
when doubling the number of processors we also double the input data size.
• Computation-to-communication ratio. This is an important metric influencing the achievable
scalability of a parallel implementation. It can be defined as the time spent calculating divided by
the time spent communicating messages between processors. A higher ratio often leads to improved
speedups and efficiencies.
1.1 MOTIVATIONAL EXAMPLE AND ITS ANALYSIS 3

The example we now want to look at is a simple summation; i.e., given an array A of n numbers we
want to compute n−1
i=0 A[i]. We parallelize this problem using an array of processing elements (PEs).
We make the following (not necessarily realistic) assumptions:

• Computation. Each PE can add two numbers stored in its local memory in one time unit.
• Communication. A PE can send data from its local memory to the local memory of any other PE
in three time units (independent of the size of the data).
• Input and output. At the beginning of the program the whole input array A is stored in PE #0. At
the end the result should be gathered in PE #0.
• Synchronization. All PEs operate in lock-step manner; i.e. they can either compute, communicate,
or be idle. Thus, it is not possible to overlap computation and communication on this architecture.

Speedup is relative. Therefore, we need to establish the runtime of a sequential program first. The
sequential program simply uses a single processor (e.g. PE #0) and adds the n numbers using n − 1
additions in n − 1 time units; i.e. T (1, n) = n − 1. In the following we illustrate our parallel algorithm
for varying p, where p denotes the number of utilized PEs. We further assume that n is a power of 2;
i.e., n = 2k for a positive integer k.

• p = 2. PE #0 sends half of its array to PE #1 (takes three time units). Both PEs then compute the
sum of their respective n/2 numbers (takes time n/2 − 1). PE #1 sends its partial sum back to PE
#0 (takes time 3). PE #0 adds the two partial sums (takes time 1). The overall required runtime is
T (2, n) = 3 + n/2 − 1 + 3 + 1. Fig. 1.1 illustrates the computation for n = 1024 = 210 , which has
a runtime of T (2, 1024) = 3 + 511 + 3 + 1 = 518. This is significantly faster than the sequential
runtime. We can calculate the speedup for this case as T (1, 1024)/T (2, 1024) = 1023/518 = 1.975.
This is very close to the optimum of 2 and corresponds to an efficiency of 98.75% (calculated
dividing the speedup by the number of utilized PEs; i.e. 1.975/2).
• p = 4. PE #0 sends half of the input data to PE #1 (takes time 3). Afterwards PE #0 and PE #1
each send a quarter of the input data to PE #2 and PE #3 respectively (takes time 3). All four PEs
then compute the sum of their respective n/4 numbers in parallel (takes time n/4 − 1). PE #2 and
PE #3 send their partial sums to PE #0 and PE #1, respectively (takes time 3). PE #0 and PE #1
add their respective partial sums (takes time 1). PE #1 then sends its partial sum to PE #0 (takes
time 3). Finally, PE #0 adds the two partial sums (takes time 1). The overall required runtime is
T (4, n) = 3 + 3 + n/4 − 1 + 3 + 1 + 3 + 1. Fig. 1.2 illustrates the computation for n = 1024 = 210 ,
which has a runtime of T (4, 1024) = 3 + 3 + 255 + 3 + 1 + 3 + 1 = 269. We can again calculate the
speedup for this case as T (1, 1024)/T (4, 1024) = 1023/269 = 3.803 resulting in an efficiency of
95.07%. Even though this value is also close to 100%, it is slightly reduced in comparison to p = 2.
The reduction is caused by the additional communication overhead required for the larger number
of processors.
• p = 8. PE #0 sends half of its array to PE #1 (takes time 3). PE #0 and PE #1 then each send a
quarter of the input data to PE #2 and PE #3 (takes time 3). Afterwards, PE #0, PE #1, PE #2, and
PE #3 each send a 1/8 of the input data to PE #5, PE #6, PE #7, and PE #8 (takes again time 3).
Fig. 1.3 illustrates the three initial data distribution steps for n = 1024 = 210 . All eight PEs then
compute the sum of their respective n/8 numbers (takes time n/8 − 1). PE #5, PE #6, PE #7, and
PE #8 send their partial sums to PE #0, PE #1, PE #2, and PE #3, respectively (takes time 3).
4 CHAPTER 1 INTRODUCTION

FIGURE 1.1
Summation of n = 1024 numbers on p = 2 PEs: (A) initially PE #0 stores the whole input data locally; (B) PE #0
sends half of the input to PE #1 (takes time 3); (C) Each PE sums up its 512 numbers (takes time 511);
(D) PE #1 sends its partial sum back to PE #0 (takes time 3); (E) To finalize the computation, PE #0 adds the
two partial sums (takes time 1). Thus, the total runtime is T (2, 1024) = 3 + 511 + 3 + 1 = 518.

Subsequently, PE #0, PE #1, PE #2, and PE #3 add their respective partial sums (takes time 1). PE
#2 and PE #3 then send their partial sums to PE #0 and PE #1, respectively (takes time 3). PE #0
and PE #1 add their respective partial sums (takes time 1). PE #1 then sends its partial sum to PE #0
(takes time 3). Finally, PE #0 adds the two partial sums (takes time 1). The overall required runtime
is T (8, n) = 3 + 3 + 3 + n/8 − 1 + 3 + 1 + 3 + 1 + 3 + 1. The computation for n = 1024 = 210
thus has a runtime of T (8, 1024) = 3 + 3 + 3 + 127 + 3 + 1 + 3 + 1 + 3 + 1 = 148. The speedup
for this case is T (1, 1024)/T (8, 1024) = 1023/148 = 6.91 resulting in an efficiency of 86%. The
decreasing efficiency is again caused by the additional communication overhead required for the
larger number of processors.

We are now able to analyze the runtime of our parallel summation algorithm in a more general way
using p = 2q PEs and n = 2k input numbers:

• Data distribution time: 3 × q.


• Computing local sums: n/p − 1 = 2k−q − 1.
• Collecting partial results: 3 × q.
• Adding partial results: q.
1.1 MOTIVATIONAL EXAMPLE AND ITS ANALYSIS 5

FIGURE 1.2
Summation of n = 1024 numbers on p = 4 PEs: (A) initially PE #0 stores the whole input in its local memory;
(B) PE #0 sends half of its input to PE #1 (takes time 3); (C) PE #0 and PE #1 send half of their data to PE #2
and PE #3 (takes time 3); (D) Each PE adds its 256 numbers (takes time 255); (E) PE #2 and PE #3 send their
partial sums to PE #0 and PE #1, respectively (takes time 3). Subsequently, PE #0 and PE #1 add their
respective partial sums (takes time 1); (F) PE #1 sends its partial sum to PE #0 (takes time 3), which then
finalizes the computation by adding them (takes time 1). Thus, the total runtime is
T (4, 1024) = 3 + 3 + 511 + 3 + 1 + 3 + 1 = 269.
6 CHAPTER 1 INTRODUCTION

FIGURE 1.3
The three initial data distribution steps for n = 1024 and p = 8: (A) Initially PE #0 stores the whole input in its
local memory and sends half of its input to PE #1; (B) PE #0 and PE #1 send half of their (remaining) data to
PE #2 and PE #3; (C) PE #0, PE #1, PE #2, and PE #3 each send half of their (remaining) input data to PE #5,
PE #6, PE #7, and PE #8.

Thus, we get the following formula for the runtime:

T (p, n) = T (2q , 2k ) = 3q + 2k−q − 1 + 3q + q = 2k−q − 1 + 7q. (1.4)

Fig. 1.4 shows the runtime, speedup, cost, and efficiency of our parallel algorithm for n = 1024
and p ranging from 1 to 512. This type of runtime analysis (where the input size is kept constant
and the number of PEs is scaled) is called strong scalability analysis. We can see that the efficiency
1.1 MOTIVATIONAL EXAMPLE AND ITS ANALYSIS 7

FIGURE 1.4
Strong scalability analysis: runtime, speedup, cost, and efficiency of our parallel summation algorithm for
adding n = 1024 numbers on a varying number of PEs (ranging from 1 to 512).

is high for a small number of PEs (i.e. p  n), but is low for a large number of PEs (i.e. p ≈ n).
This behavior can also be deduced from Eq. (1.4): for the case p  n, holds 2k−q  7q (i.e., the
term for computation time dominates), while it holds 2k−q  7q for the case p ≈ n (i.e., the term
for communication time dominates). Thus, we can conclude that our algorithm is not strongly scal-
able.
Now, we want to change our analysis a bit by not only increasing the number of PEs but additionally
increasing the input data size at the same time. This is known as weak scalability analysis. Fig. 1.5
shows the speedup and efficiency of our algorithm for n ranging from 1024 to 524,288 and p ranging
from 1 to 512. We can see that the efficiency is kept high (close to 100%) even for a large number of
PEs. This behavior can again be deduced from Eq. (1.4): since both n and p are scaled at the same
rate, the term relating to the computation time is constant for varying number of PEs (i.e. 2k−q = 1024
in Fig. 1.5), while the term for the communication time (7q = 7 × log(p)) only grows at a logarithmic
rate. Thus, we can conclude that our algorithm is weakly scalable.
The terms weak and strong scalability are also related to two well-known laws in parallel comput-
ing: Amdahl’s law and Gustafsson’s law, which we will discuss in more detail in Chapter 2.
8 CHAPTER 1 INTRODUCTION

FIGURE 1.5
Weak scalability analysis: speedup and efficiency of our parallel summation algorithm for adding n = 1024 × p
numbers on p PEs (p ranging from 1 to 512).

THE GENERAL CASE AND THE COMPUTATION-TO-COMMUNICATION RATIO


In general, let α > 0 be the time needed to perform a single addition and β > 0 be the time to com-
municate a stack of numbers. Note that we have previously chosen α = 1 and β = 3. Then the general
formula for the runtime is given by
   
Tα,β (2q , 2k ) = βq + α 2k−q − 1 + βq + αq = 2βq + α 2k−q − 1 + q . (1.5)

The speedup is defined as quotient of the sequential and the parallel runtime:
 
Tα,β (20 , 2k ) α 2k − 1
Sα,β (2 , 2 ) =
q k
=   . (1.6)
Tα,β (2q , 2k ) 2βq + α 2k−q − 1 + q

For our example we define the computation-to-communication ratio as γ = βα . The speedup then tends
to zero if we compute the limit γ → 0 for q > 0:
 
γ 2k − 1
Sγ (2 , 2 ) =
q k   and lim Sγ (2q , 2k ) = 0 . (1.7)
2q + γ 2k−q − 1 + q γ →0

The first derivative of Sγ (2q , 2k ) with respect to γ for fixed q and k is always positive, i.e. the speedup
is monotonically decreasing if we increase the communication time (reduce the value of γ ). Let k >
q > 0, A(k) = 2k − 1 > 0, and B(q, k) = 2k−q − 1 + q > 0; then we can simply apply the quotient
rule:
d d γ A(k) 2qA(k)
Sγ (2q , 2k ) = = 2 > 0 . (1.8)
dγ dγ 2q + γ B(q, k) 2q + γ B(q, k)

As a result, decreasing the computation-to-communication ratio is decreasing the speedup independent


of the number of used compute units p = 2q > 1 – an observation that is true for the majority of
parallel algorithms. The speedup Sγ (2q , 2k ) interpreted as function of q exhibits a local maximum at
Another Random Scribd Document
with Unrelated Content
The Neanderthal was not a pretty spectacle. He had the low
forehead and heavy brow ridges of Java and Peking man, and the
same lack of chin. And yet he was the cleverest fellow by far that
had ever lived, and the most sensitive, which may seem rather odd,
since some of the earlier skulls found in England, Africa, Java, and
Australia come nearer the modern type of what we call thinking
man, Homo sapiens.

On the evidence we have, the Neanderthal seems to have been the


inventor of religion. In his caves we find burials for the first time,
and burials accompanied by tools for the dead man to use in the
other world. We also find shrines made of the skulls of cave bears.

As a chipper of flint, he was much more skillful than those that had
gone before. He soon gave up making tools which, like the hand ax,
would serve a number of purposes, but none very well. He invented
the stone spear point. There is still considerable argument about his
flint work, or at least about the technique which he may or may not
have developed.

89
THREE TYPES OF OLD WORLD MAN
Note the progressive lessening of brow ridge and receding
chin and the increase in the height of forehead and vault.
Pithecanthropus robustus was found in the same general
area as Pithecanthropus erectus, or Java man. (Robustus,
after Weidenreich, 1946; Neanderthal, after McGregor,
1926; Cro-Magnon, after Verneau, 1906.)

90

MAN’S FIRST SPEAR POINTS


Two views of a flake of flint that has been chipped on one
face and retouched along the edges by a Neanderthal man.
To remove the tiny lateral chips, he probably held it with
the smoother side against a chunk of wood, and struck
small and careful blows with a hammerstone (as shown at
the bottom of page 91). Acheulean man may have used
this technique in making the best of his hand axes. (After
Mortillet, 1881.)

When man began to make tools he pounded one rock with another.
He hoped he would knock off just the right chip in the right spot.
This is called the percussion method of flaking (see illustration, page
91). Some say the Neanderthal was not content with this. They think
that he must have discovered how to place a piece of bone or very
hard wood against a flint at the point where he wanted to knock off
a flake, and then strike it with a hammerstone (see illustration, page
92). This might account for the small chips, or “retouches,” taken off
the edge of some of his spear points as in the illustration below.
Even the Acheuleans are occasionally credited with this invention
because many of their hand axes are so symmetrical. There are
those who say that the Neanderthal had progressed so far in flint
work that he knew the art of pressure flaking—the third step in flint
knapping—which involved the pressing off of small chips with the bit
of wood or bone held in the hand (see illustration, page 93). It
seems more likely that Neanderthals and men of Acheulean times
used the anvil method of percussion flaking (lower drawing, page
91), not an inaccurate way of knocking off small chips.

91
PERCUSSION FLAKING
The first method by which early man shaped his tools.
(After Holmes, 1919.)

92
THE SECOND STEP IN FLINT KNAPPING
For more accurate work, early man applied a small stick of
hardwood or a piece of bone at the proper spot and hit the
interposed tool with a mallet of heavy wood or a rock. No
one knows who invented this technique—Acheulean,
Neanderthal, or later man. (After Holmes, 1919.)

93
THE THIRD STEP—PRESSURE FLAKING
The discovery that gave early man complete control over
the shaping of flints was that a slow and continued
pressure would dislodge just the flake he desired. Above,
we see how he worked on a small point, and below, to the
left, how he chipped thin slivers from a core. (After Holmes,
1919.)

The Neanderthal—with his Mousterian culture—seems to have 94


invaded Europe from Asia toward the end of the third and last
interglacial, anywhere from 80,000 to 125,000 years ago, and to
have left from 15,000 to 100,000 years ago, depending on what
authority you choose, and how that authority dates the last
interglacial. Unlike his predecessors, the Neanderthal lived in caves;
but that was probably because he was the first man in Europe to
survive a glacial winter—tens of thousands of them.

The Neanderthal seems to have disappeared quite suddenly from


Europe, taking his Australoid features with him. There are traces of
him in Africa, and also in Palestine where he is thought to have
produced a hybrid among the Mount Carmel people. Sir Arthur Keith
said, in 1915, that the Neanderthal never left Europe, but was
merely absorbed into the next peoples. We can see the Neanderthal
profile on an occasional passer-by.

Most anthropologists are rather cool to the Neanderthal. They cast


him quite outside the sacred ranks of our ancestors. They say he
was not Homo sapiens—merely Homo neanderthalensis. This means
that he was a sort of dead end, a blind alley, up which one sort of
ape-man ran, while another was taking a turn that ended in his
being master of the atom but not of the atomic bomb. Other
anthropologists do not agree. Like Keith, they take the Neanderthal
into the sacred circle—at least at stud.
Radiocarbon Dates for the Mousterian

However early the Mousterian culture may have begun, the later
stages fall within the range of one of the archaeologist’s most
interesting and precise techniques for dating. This is the
[19]
radiocarbon, or Carbon 14, method. Much simplified, it 95
depends upon the following phenomena: Most plants are
radioactive, and so are all animals that depend directly or indirectly
upon these plants for food. This radioactivity is found in a rare form
of carbon called radiocarbon, or Carbon 14. While a plant or an
animal is alive, it contains a constant proportion of this radioactive
material. Radiocarbon is always breaking down and disappearing,
but while a tree or a bird or another animal lives, this material is
being renewed. When this same tree or animal dies, it stops
acquiring new radiocarbon, and therefore its radioactivity decreases.
Heartwood from a 4,000-year-old sequoia tree is appreciably less
radioactive than the living outer layer. Antlers shed by a buck in the
last spring are more radioactive than any reindeer antlers left in a
cave in France by Old Stone Age hunters 17,000 years ago.

After death, radiocarbon disappears at a rate of speed that can be


measured. This rate may seem strange to most laymen. It is based
on what physicists call “half-lives.” The half-life of radiocarbon is
5,568 years, give or take a few. This means that after 5,568 years,
half the radiocarbon is gone. If the material weighed a pound at the
death of the plant or animal, only half a pound would be left. After
another half-life of the same length, only a quarter of a pound would
remain. And so on and so on. With highly refined techniques, what is
left of this radioactive substance can be detected in matter 60,000 to
[20]
70,000 years old, and it can be measured and its age
determined, with a few percentage points of error, up to at least
50,000 years ago. The scientist must be sure, however, that the
material—such as wood, charcoal, peat, antler, shell, bone, or hair—
has not been exposed to contamination that would add radiocarbon.
This method of dating, developed by Willard F. Libby in the late
forties, supplied archaeologists with fairly close estimates of 96
the age of sites containing wood, charcoal, shell, or bone.

Neanderthal is the first of our early men to have lived within the
range of radiocarbon dating. To be more precise, his Mousterian
culture has left traces that can be measured. We have several
radiocarbon measurements that tell us how recently Neanderthal
was around; his oldest cultural materials are beyond the present
range of radiocarbon measurement. At Godarville, Belgium,
Mousterian artifacts were found underlying an accumulation of peat
that dated from more than 36,000 years ago. Since the stone tools
were deposited before the peat, they must be at least as old.
Charcoal from an ancient hearth in a Libyan cave at Haua Fteah,
associated with Levalloiso-Mousterian materials and about three feet
or so above a Neanderthaloid jaw, was dated at 34,000 years, or
possibly older. The archaeology of this cave suggests that the
Neanderthal survived in North Africa until about 30,000 years ago. In
Israel, south of Haifa in the Mount Carmel range, at a site called
Mugharet-el-Kebara, a very small sample of charcoal, thought to
correlate with a nearby Levalloiso-Mousterian deposit, furnished a
date of more than 30,000 years.

In the Near East, archaeological excavation directed by Dr. Ralph S.


Solecki, of the Smithsonian Institution, turned up yet more material
that helps us date the Neanderthal. In the Zagros Mountains of
northern Iraq, at Shanidar Cave, in Shanidar Valley, there is a
remarkable deposit of both the tools and the bones of early man.
The cave contained a dozen feet or so of Upper Paleolithic remains
overlying a deep Mousterian deposit. The bottom level of the Upper
Paleolithic materials has been dated at over 34,000 years. How
much older the Mousterian layers may be we cannot yet be sure.
From depths of 14½ and 23 feet below the surface and within the
Mousterian deposit, adult skeletons have been recovered. 97
These seem to be Neanderthaloid, as does the skeleton of a
child, recovered at a depth of 26 feet. The estimate is that the
shallower of the skeletons may be about 45,000 years old, the lower
[21]
adult perhaps 60,000.

Homo sapiens—New or Old?

The relationship of all these forms of early man is much disputed.


For many, many years they were all supposed to be barren offshoots
of our ancestral tree. Nobody could find the particular breed of ape-
man from which we were descended. Now science is inclined to
lump most of them together in one way or another. There are many
theories and many genealogies. Swanscombe man plays grandfather
to Homo sapiens. Java man and Peking man become the forebears
of the Mongoloid. Other men from Java father the Australian, and
even the Neanderthal. The Neanderthals breed out their crudeness
in some sort of union with Homo sapiens. Or all of them are
admitted to the ranks of Homo, with Neanderthal a degenerate
offshoot without issue.

The earlier picture was simpler and more dramatic. From


Pithecanthropus erectus to Homo neanderthalensis—Java man to the
Neanderthal—these creatures bore no relation to our own happy
breed. Then, quite suddenly, came Homo sapiens in the person of
the Cro-Magnon. He was the kind of tall fellow with a well domed,
narrow Nordic head whom Hitler identified with the better class of
human beings. Except for the “Red Lady of Paviland,” the first
specimens were found at Aurignac, France, in 1852, though nobody
recognized the outstanding quality of the skulls until the find at Cro-
Magnon, France, in 1868 (see illustration, page 89).

98
SCULPTURE OF THE OLD STONE AGE
Above, one of the carved and perforated reindeer antlers of
the Magdalenians, which are sometimes described as
bâtons de commandement; the Eskimos used a somewhat
similar tool for straightening their arrows. Left, the Venus of
Willendorf, an Aurignacian carving in stone, found near
Spitz, Austria. The woman’s head from the Grotte du Pape,
Brassempouy, France, may be either Aurignacian or
Magdalenian. The horse’s head, made of reindeer antlers,
from Mas d’Azil, France, is Magdalenian. (After Osborn,
1915.)

For many years the French clung to what Hooton calls the 99
rather chauvinistic myth that here, in the waning years of the
Great Ice Age, we find a superior kind of man that was
predominantly a product of the French area. Certainly he was a
remarkable person in many ways. For one thing, he discovered art.
He painted on the walls of his caves and carved on pieces of bone
and elephant ivory pictures of mammoths, bison, and boars, and he
made sculptures of fat women in stone. Also, he began to fish in the
swift streams that ran off from the glaciers. He hunted reindeer and
made use of their antlers as tools. For quite a time he was supposed
to represent the peak of achievement by early man.

Before long, however, the Cro-Magnon became only a factor in a


broader culture, described as the Aurignacian, and soon the
Aurignacian suffered from scientific fission. Through this whole
period and, indeed, until the end of the Old Stone Age, new tools in
the form of blades, chisellike burins, and implements of reindeer
bone make their appearance; but they vary in shape and in the time
of their emergence. Some of these tools divide what was formerly
called the Aurignacian into three parts: the Châtelperron, the Middle
Aurignacian, and the Gravettian. The Châtelperron people developed
a narrow, curved blade out of a tool vaguely Mousterian. The Middle
Aurignacians appeared as invaders with thin blades and scrapers
notched or narrowed halfway along each side. Finally, a people who
had hunted mammoths in southern Russia—the Gravettians—turned
up in France as the inventors of a thin, narrow, and straight 100
blade made by carefully detaching sliver after sliver from a
well shaped core of flint. Sometimes one edge was blunted to make
it handier to use; occasionally the point of a blade or other tool was
chipped off diagonally to produce a chisellike engraving tool. Another
type of tool, the Font Robert point with a stem, also appeared (see
illustration, page 101).
How blades were split off a core. The technique was
perfected by the Gravettians, an Aurignacian people, and
was practiced by the Aztecs of Mexico. (After Evans, 1872.)

Henry Fairfield Osborn once dated the European advent of the


Aurignacians at about 27,000 years ago, Nelson at 20,000, Mather at
[22]
15,000. Zeuner, however, believes they flourished from about
[23]
100,000 until 75,000 years ago. Dating the last of the glaciers
was the key to this dispute. Radiocarbon dates now suggest that the
Aurignacian period survived in Europe and the Near East until 18,000
[24]
to 34,000 years ago. Its beginnings may well extend beyond the
range of this method.
101
Upper Paleolithic tools from long flakes taken off cores after
the manner shown on page 100. The burin, or graver, at
the upper left is probably Upper Aurignacian, though
commoner in the Magdalenian culture. The others are
usually called blades. (The burin, after Burkitt, 1933; the
blades, after MacCurdy, 1924.)

The Aurignacians are a variegated lot, which argues further 102


for subdividing them. One specimen, the tall, high-domed
Cro-Magnon, is variously credited with producing the modern
European man, the Eskimo, and even the Indian of America. Another
specimen, the Grimaldi from the Riviera, is distinctly Negroid.
Another—from hints in several places—seems to be Mongoloid.
Apparently, the Aurignacians were almost variegated enough to have
peopled the modern world. But almost as much could be said for the
inhabitants of an upper level in the Choukoutien Cave near Peking.
There, in one spot, they divide nicely into Negroid, Eskimoid, and
Melanesoid.

Solutrean Flint Workers Invade Europe

Toward the close of Aurignacian times comes a remarkable people


called the Solutreans. They appear quite suddenly as invading
hunters, and they disappear as suddenly. Their culture does not
evolve out of the Aurignacian, and it does not evolve into the next
culture, the Magdalenian. The Solutreans stayed a relatively short
time in Europe; Braidwood once gave them 10,000 years, but
[25]
Mather and Peake and Fleure only 500. Guesses as to when they
arrived vary as widely. Peake and Fleure think it was about 12,000
years ago, while Zeuner puts them back to 67,000 years before our
[26]
time. Radiocarbon dates indicate only 18,000 years ago.
Three Aurignacian types that suggest the Negroid, the
Caucasoid, and the Mongoloid. (After Peake and Fleure,
1927.)

103
THE MEANING OF SCRAPERS
“A primitive thing called a scraper is crude and not at all
eloquent until you realize that it points to much else. It
means not only a scraper, but a thing to be scraped, most
likely a hide; therefore it means a growing ability to kill, to
take off the hide and cure it. That is just the beginning, for
a scraper also shows a knowledge of how to scrape, and a
desire for scraping, and enough leisure (beyond the
struggle to get food) to allow time for scraping. All this
means self-restraint and thought for the future, and it
implies a certain confidence in the ways of life, because no
one would be liable to go to all the trouble of scraping if he
did not have reasonable hope of enjoying the results of the
work.”—George R. Stewart, in Man: An Autobiography. (Left
and center, after MacCurdy, 1924; right, after Leakey,
1935.)

Where the Solutreans came from is another of the unsolved 104


riddles of archaeology. Until recently, they were generally
supposed to have come out of western Asia, because the most
primitive of their remarkable tools were found plentifully in Hungary
and sparsely in western Europe. For sixty years Solutrean points
were found no farther south than northern and eastern Spain. Now,
however, points of Solutrean type have begun to appear in North
Africa, Egypt, and Kenya Colony. Here they are jumbled together in
the same strata with Mousterian points and the tanged points of the
Aterians, a purely African people. Hence certain archaeologists are
inclined to believe that the Solutreans may have originated in Africa
as an offshoot of the Mousterians (see illustration, page 105).

In spite of their fondness for the chase, the Solutreans of Europe


continued the interest that the Aurignacians had shown in art—or so
at least certain authorities who admire the relief carvings of Le Roc
in France tell us. But their chief distinction is that they knew the
craft of pressure flaking typical of Folsom and Eden cultures in the
New World. The Solutreans are represented mainly by thin, willow-
or laurel-shaped tools. By pressing—not pounding—a piece of bone
or wood against the surface of the flint they flaked off slivers across
the tool in a way that no one equaled in the Old World until the
Egyptians had entered the neolithic and agricultural age many
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookluna.com

You might also like