Parallel programming: concepts and practice 1st Edition - eBook PDF download
Parallel programming: concepts and practice 1st Edition - eBook PDF download
https://ebookluna.com/download/parallel-programming-concepts-and-
practice-ebook-pdf/
https://ebookluna.com/download/an-introduction-to-parallel-programming-
ebook-pdf/
https://ebookluna.com/product/ebook-pdf-parallel-computer-organization-and-
design/
https://ebookluna.com/product/ebook-pdf-parallel-computer-organization-and-
design-2/
https://ebookluna.com/download/data-science-concepts-and-practice-ebook-
pdf/
Problem Solving and Python Programming 1st edition - eBook PDF
https://ebookluna.com/download/problem-solving-and-python-programming-
ebook-pdf/
https://ebookluna.com/product/ebook-pdf-community-and-human-services-
concepts-for-practice/
https://ebookluna.com/product/ebook-pdf-introduction-to-leadership-
concepts-and-practice-4th-edition/
https://ebookluna.com/product/introduction-to-leadership-concepts-and-
practice-4th-edition-ebook-pdf/
https://ebookluna.com/product/ebook-pdf-professional-nursing-practice-
concepts-and-perspectives-7th-edition/
Parallel Programming
Parallel Programming
Concepts and Practice
Bertil Schmidt
Institut für Informatik
Staudingerweg 9
55128 Mainz
Germany
Jorge González-Domínguez
Computer Architecture Group
University of A Coruña
Edificio área científica (Office 3.08), Campus de Elviña
15071, A Coruña
Spain
Christian Hundt
Institut für Informatik
Staudingerweg 9
55128 Mainz
Germany
Moritz Schlarb
Data Center
Johannes Gutenberg-University Mainz
Germany
Anselm-Franz-von-Bentzel-Weg 12
55128 Mainz
Germany
Morgan Kaufmann is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
Copyright © 2018 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including
photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on
how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as
the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted
herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes
in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information,
methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety
and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or
damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods,
products, instructions, or ideas contained in the material herein.
ISBN: 978-0-12-849890-3
Parallelism abounds. Nowadays, any modern CPU contains at least two cores, whereas some CPUs
feature more than 50 processing units. An even higher degree of parallelism is available on larger sys-
tems containing multiple CPUs such as server nodes, clusters, and supercomputers. Thus, the ability
to program these types of systems efficiently and effectively is an essential aspiration for scientists,
engineers, and programmers. The subject of this book is a comprehensive introduction to the area of
parallel programming that addresses this need. Our book teaches practical parallel programming for
shared memory and distributed memory architectures based on the C++11 threading API, Open Mul-
tiprocessing (OpenMP), Compute Unified Device Architecture (CUDA), Message Passing Interface
(MPI), and Unified Parallel C++ (UPC++), as well as necessary theoretical background. We have in-
cluded a large number of programming examples based on the recent C++11 and C++14 dialects of
the C++ programming language.
This book targets participants of “Parallel Programming” or “High Performance Computing”
courses which are taught at most universities at senior undergraduate level or graduate level in com-
puter science or computer engineering. Moreover, it serves as suitable literature for undergraduates in
other disciplines with a computer science minor or professionals from related fields such as research
scientists, data analysts, or R&D engineers. Prerequisites for being able to understand the contents
of our book include some experience with writing sequential code in C/C++ and basic mathematical
knowledge.
In good tradition with the historic symbiosis of High Performance Computing and natural science,
we introduce parallel concepts based on real-life applications ranging from basic linear algebra rou-
tines over machine learning algorithms and physical simulations but also traditional algorithms from
computer science. The writing of correct yet efficient code is a key skill for every programmer. Hence,
we focus on the actual implementation and performance evaluation of algorithms. Nevertheless, the
theoretical properties of algorithms are discussed in depth, too. Each chapter features a collection of
additional programming exercises that can be solved within a web framework that is distributed with
this book. The System for Automated Code Evaluation (SAUCE) provides a web-based testing en-
vironment for the submission of solutions and their subsequent evaluation in a classroom setting: the
only prerequisite is an HTML5 compatible web browser allowing for the embedding of interactive
programming exercise in lectures. SAUCE is distributed as docker image and can be downloaded at
https://parallelprogrammingbook.org
This website serves as hub for related content such as installation instructions, a list of errata, and
supplementary material (such as lecture slides and solutions to selected exercises for instructors).
If you are a student or professional that aims to learn a certain programming technique, we advise to
initially read the first three chapters on the fundamentals of parallel programming, theoretical models,
and hardware architectures. Subsequently, you can dive into one of the introductory chapters on C++11
Multithreading, OpenMP, CUDA, or MPI which are mostly self-contained. The chapters on Advanced
C++11 Multithreading, Advanced CUDA, and UPC++ build upon the techniques of their preceding
chapter and thus should not be read in isolation.
ix
x Preface
If you are a lecturer, we propose a curriculum consisting of 14 lectures mainly covering applications
from the introductory chapters. You could start with a lecture discussing the fundamentals from the
first chapter including parallel summation using a hypercube and its analysis, the definition of basic
measures such as speedup, parallelization efficiency and cost, and a discussion of ranking metrics. The
second lecture could cover an introduction to PRAM, network topologies, weak and strong scaling.
You can spend more time on PRAM if you aim to later discuss CUDA in more detail or emphasize
hardware architectures if you focus on CPUs. Two to three lectures could be spent on teaching the
basics of the C++11 threading API, CUDA, and MPI, respectively. OpenMP can be discussed within
a span of one to two lectures. The remaining lectures can be used to either discuss the content in the
advanced chapters on multithreading, CUDA, or the PGAS-based UPC++ language.
An alternative approach is splitting the content into two courses with a focus on pair-programming
within the lecture. You could start with a course on CPU-based parallel programming covering selected
topics from the first three chapters. Hence, C++11 threads, OpenMP, and MPI could be taught in full
detail. The second course would focus on advanced parallel approaches covering extensive CUDA
programming in combination with (CUDA-aware) MPI and/or the PGAS-based UPC++.
We wish you a great time with the book. Be creative and investigate the code! Finally, we would be
happy to hear any feedback from you so that we could improve any of our provided material.
Acknowledgments
This book would not have been possible without the contributions of many people.
Initially, we would like to thank the anonymous and few non-anonymous reviewers who com-
mented on our book proposal and the final draft: Eduardo Cesar Galobardes, Ahmad Al-Khasawneh,
and Mohammad Olaimat.
Moreover, we would like to thank our colleagues who thoroughly peer-reviewed the chapters and
provided essential feedback: André Müller for his valuable advise on C++ programming, Robin Kobus
for being a tough code reviewer, Felix Kallenborn for his steady proofreading sessions, Daniel Jünger
for constantly complaining about the CUDA chapter, as well as Stefan Endler and Elmar Schömer for
their suggestions.
Additionally, we would like to thank the staff of Morgan Kaufman and Elsevier who coordinated
the making of this book. In particular we would like to mention Nate McFadden.
Finally, we would like to thank our spouses and children for their ongoing support and patience
during the countless hours we could not spend with them.
xi
CHAPTER
INTRODUCTION
Abstract
1
In the recent past, teaching and learning of parallel programming has become increasingly important
due to the ubiquity of parallel processors in portable devices, workstations, and compute clusters. Stag-
nating single-threaded performance of modern CPUs requires future computer scientists and engineers
to write highly parallelized code in order to fully utilize the compute capabilities of current hardware
architectures. The design of parallel algorithms, however, can be challenging especially for inexpe-
rienced students due to common pitfalls such as race conditions when concurrently accessing shared
resources, defective communication patterns causing deadlocks, or the non-trivial task of efficiently
scaling an application over the whole number of available compute units. Hence, acquiring parallel
programming skills is nowadays an important part of many undergraduate and graduate curricula.
More importantly, education of concurrent concepts is not limited to the field of High Performance
Computing (HPC). The emergence of deep learning and big data lectures requires teachers and stu-
dents to adopt HPC as an integral part of their knowledge domain. An understanding of basic concepts
is indispensable for acquiring a deep understanding of fundamental parallelization techniques.
The goal of this chapter is to provide an overview of introductory concepts and terminologies in parallel
computing. We start with learning about speedup, efficiency, cost, scalability, and the computation-to-
communication ratio by analyzing a simple yet instructive example for summing up numbers using a
varying number of processors. We get to know about the two most important parallel architectures:
distributed memory systems and shared memory systems. Designing efficient parallel programs re-
quires a lot of experience and we will study a number of typical considerations for this process such
as problem partitioning strategies, communication patterns, synchronization, and load balancing. We
end this chapter with learning about current and past supercomputers and their historical and upcoming
architectural trends.
Keywords
Parallelism, Speedup, Parallelization, Efficiency, Scalability, Reduction, Computation-to-communica-
tion ratio, Distributed memory, Shared memory, Partitioning, Communication, Synchronization, Load
balancing, Task parallelism, Prefix sum, Deep learning, Top500
CONTENTS
1.1 Motivational Example and Its Analysis ............................................................................ 2
The General Case and the Computation-to-Communication Ratio..................................... 8
1.2 Parallelism Basics .................................................................................................... 10
Distributed Memory Systems................................................................................ 10
Shared Memory Systems..................................................................................... 11
• Speedup. You have designed a parallel algorithm or written a parallel code. Now you want to
know how much faster it is than your sequential approach; i.e., you want to know the speedup.
The speedup (S) is usually measured or calculated for almost every parallel code or algorithm and
is simply defined as the quotient of the time taken using a single processor (T (1)) over the time
measured using p processors (T (p)) (see Eq. (1.1)).
T (1)
S= (1.1)
T (p)
• Efficiency and cost. The best speedup you can usually expect is a linear speedup; i.e., the maximal
speedup you can achieve with p processors or cores is p (although there are exceptions to this,
which are referred to as super-linear speedups). Thus, you want to relate the speedup to the number
of utilized processors or cores. The Efficiency E measures exactly that by dividing S by P (see
Eq. (1.2)); i.e., linear speedup would then be expressed by a value close to 100%. The cost C is
similar but relates the runtime T (p) (instead of the speedup) to the number of utilized processors
(or cores) by multiplying T (p) and p (see Eq. (1.3)).
S T (1)
E= = (1.2)
p T (p) × p
C = T (p) × p (1.3)
• Scalability. Often we do not only want to measure the efficiency for one particular number of pro-
cessors or cores but for a varying number; e.g. P = 1, 2, 4, 8, 16, 32, 64, 128, etc. This is called
scalability analysis and indicates the behavior of a parallel program when the number of processors
increases. Besides varying the number of processors, the input data size is another parameter that
you might want to vary when executing your code. Thus, there are two types of scalability: strong
scalability and weak scalability. In the case of strong scalability we measure efficiencies for a vary-
ing number of processors and keep the input data size fixed. In contrast, weak scalability shows the
behavior of our parallel code for varying both the number of processors and the input data size; i.e.
when doubling the number of processors we also double the input data size.
• Computation-to-communication ratio. This is an important metric influencing the achievable
scalability of a parallel implementation. It can be defined as the time spent calculating divided by
the time spent communicating messages between processors. A higher ratio often leads to improved
speedups and efficiencies.
1.1 MOTIVATIONAL EXAMPLE AND ITS ANALYSIS 3
The example we now want to look at is a simple summation; i.e., given an array A of n numbers we
want to compute n−1
i=0 A[i]. We parallelize this problem using an array of processing elements (PEs).
We make the following (not necessarily realistic) assumptions:
• Computation. Each PE can add two numbers stored in its local memory in one time unit.
• Communication. A PE can send data from its local memory to the local memory of any other PE
in three time units (independent of the size of the data).
• Input and output. At the beginning of the program the whole input array A is stored in PE #0. At
the end the result should be gathered in PE #0.
• Synchronization. All PEs operate in lock-step manner; i.e. they can either compute, communicate,
or be idle. Thus, it is not possible to overlap computation and communication on this architecture.
Speedup is relative. Therefore, we need to establish the runtime of a sequential program first. The
sequential program simply uses a single processor (e.g. PE #0) and adds the n numbers using n − 1
additions in n − 1 time units; i.e. T (1, n) = n − 1. In the following we illustrate our parallel algorithm
for varying p, where p denotes the number of utilized PEs. We further assume that n is a power of 2;
i.e., n = 2k for a positive integer k.
• p = 2. PE #0 sends half of its array to PE #1 (takes three time units). Both PEs then compute the
sum of their respective n/2 numbers (takes time n/2 − 1). PE #1 sends its partial sum back to PE
#0 (takes time 3). PE #0 adds the two partial sums (takes time 1). The overall required runtime is
T (2, n) = 3 + n/2 − 1 + 3 + 1. Fig. 1.1 illustrates the computation for n = 1024 = 210 , which has
a runtime of T (2, 1024) = 3 + 511 + 3 + 1 = 518. This is significantly faster than the sequential
runtime. We can calculate the speedup for this case as T (1, 1024)/T (2, 1024) = 1023/518 = 1.975.
This is very close to the optimum of 2 and corresponds to an efficiency of 98.75% (calculated
dividing the speedup by the number of utilized PEs; i.e. 1.975/2).
• p = 4. PE #0 sends half of the input data to PE #1 (takes time 3). Afterwards PE #0 and PE #1
each send a quarter of the input data to PE #2 and PE #3 respectively (takes time 3). All four PEs
then compute the sum of their respective n/4 numbers in parallel (takes time n/4 − 1). PE #2 and
PE #3 send their partial sums to PE #0 and PE #1, respectively (takes time 3). PE #0 and PE #1
add their respective partial sums (takes time 1). PE #1 then sends its partial sum to PE #0 (takes
time 3). Finally, PE #0 adds the two partial sums (takes time 1). The overall required runtime is
T (4, n) = 3 + 3 + n/4 − 1 + 3 + 1 + 3 + 1. Fig. 1.2 illustrates the computation for n = 1024 = 210 ,
which has a runtime of T (4, 1024) = 3 + 3 + 255 + 3 + 1 + 3 + 1 = 269. We can again calculate the
speedup for this case as T (1, 1024)/T (4, 1024) = 1023/269 = 3.803 resulting in an efficiency of
95.07%. Even though this value is also close to 100%, it is slightly reduced in comparison to p = 2.
The reduction is caused by the additional communication overhead required for the larger number
of processors.
• p = 8. PE #0 sends half of its array to PE #1 (takes time 3). PE #0 and PE #1 then each send a
quarter of the input data to PE #2 and PE #3 (takes time 3). Afterwards, PE #0, PE #1, PE #2, and
PE #3 each send a 1/8 of the input data to PE #5, PE #6, PE #7, and PE #8 (takes again time 3).
Fig. 1.3 illustrates the three initial data distribution steps for n = 1024 = 210 . All eight PEs then
compute the sum of their respective n/8 numbers (takes time n/8 − 1). PE #5, PE #6, PE #7, and
PE #8 send their partial sums to PE #0, PE #1, PE #2, and PE #3, respectively (takes time 3).
4 CHAPTER 1 INTRODUCTION
FIGURE 1.1
Summation of n = 1024 numbers on p = 2 PEs: (A) initially PE #0 stores the whole input data locally; (B) PE #0
sends half of the input to PE #1 (takes time 3); (C) Each PE sums up its 512 numbers (takes time 511);
(D) PE #1 sends its partial sum back to PE #0 (takes time 3); (E) To finalize the computation, PE #0 adds the
two partial sums (takes time 1). Thus, the total runtime is T (2, 1024) = 3 + 511 + 3 + 1 = 518.
Subsequently, PE #0, PE #1, PE #2, and PE #3 add their respective partial sums (takes time 1). PE
#2 and PE #3 then send their partial sums to PE #0 and PE #1, respectively (takes time 3). PE #0
and PE #1 add their respective partial sums (takes time 1). PE #1 then sends its partial sum to PE #0
(takes time 3). Finally, PE #0 adds the two partial sums (takes time 1). The overall required runtime
is T (8, n) = 3 + 3 + 3 + n/8 − 1 + 3 + 1 + 3 + 1 + 3 + 1. The computation for n = 1024 = 210
thus has a runtime of T (8, 1024) = 3 + 3 + 3 + 127 + 3 + 1 + 3 + 1 + 3 + 1 = 148. The speedup
for this case is T (1, 1024)/T (8, 1024) = 1023/148 = 6.91 resulting in an efficiency of 86%. The
decreasing efficiency is again caused by the additional communication overhead required for the
larger number of processors.
We are now able to analyze the runtime of our parallel summation algorithm in a more general way
using p = 2q PEs and n = 2k input numbers:
FIGURE 1.2
Summation of n = 1024 numbers on p = 4 PEs: (A) initially PE #0 stores the whole input in its local memory;
(B) PE #0 sends half of its input to PE #1 (takes time 3); (C) PE #0 and PE #1 send half of their data to PE #2
and PE #3 (takes time 3); (D) Each PE adds its 256 numbers (takes time 255); (E) PE #2 and PE #3 send their
partial sums to PE #0 and PE #1, respectively (takes time 3). Subsequently, PE #0 and PE #1 add their
respective partial sums (takes time 1); (F) PE #1 sends its partial sum to PE #0 (takes time 3), which then
finalizes the computation by adding them (takes time 1). Thus, the total runtime is
T (4, 1024) = 3 + 3 + 511 + 3 + 1 + 3 + 1 = 269.
6 CHAPTER 1 INTRODUCTION
FIGURE 1.3
The three initial data distribution steps for n = 1024 and p = 8: (A) Initially PE #0 stores the whole input in its
local memory and sends half of its input to PE #1; (B) PE #0 and PE #1 send half of their (remaining) data to
PE #2 and PE #3; (C) PE #0, PE #1, PE #2, and PE #3 each send half of their (remaining) input data to PE #5,
PE #6, PE #7, and PE #8.
Fig. 1.4 shows the runtime, speedup, cost, and efficiency of our parallel algorithm for n = 1024
and p ranging from 1 to 512. This type of runtime analysis (where the input size is kept constant
and the number of PEs is scaled) is called strong scalability analysis. We can see that the efficiency
1.1 MOTIVATIONAL EXAMPLE AND ITS ANALYSIS 7
FIGURE 1.4
Strong scalability analysis: runtime, speedup, cost, and efficiency of our parallel summation algorithm for
adding n = 1024 numbers on a varying number of PEs (ranging from 1 to 512).
is high for a small number of PEs (i.e. p n), but is low for a large number of PEs (i.e. p ≈ n).
This behavior can also be deduced from Eq. (1.4): for the case p n, holds 2k−q 7q (i.e., the
term for computation time dominates), while it holds 2k−q 7q for the case p ≈ n (i.e., the term
for communication time dominates). Thus, we can conclude that our algorithm is not strongly scal-
able.
Now, we want to change our analysis a bit by not only increasing the number of PEs but additionally
increasing the input data size at the same time. This is known as weak scalability analysis. Fig. 1.5
shows the speedup and efficiency of our algorithm for n ranging from 1024 to 524,288 and p ranging
from 1 to 512. We can see that the efficiency is kept high (close to 100%) even for a large number of
PEs. This behavior can again be deduced from Eq. (1.4): since both n and p are scaled at the same
rate, the term relating to the computation time is constant for varying number of PEs (i.e. 2k−q = 1024
in Fig. 1.5), while the term for the communication time (7q = 7 × log(p)) only grows at a logarithmic
rate. Thus, we can conclude that our algorithm is weakly scalable.
The terms weak and strong scalability are also related to two well-known laws in parallel comput-
ing: Amdahl’s law and Gustafsson’s law, which we will discuss in more detail in Chapter 2.
8 CHAPTER 1 INTRODUCTION
FIGURE 1.5
Weak scalability analysis: speedup and efficiency of our parallel summation algorithm for adding n = 1024 × p
numbers on p PEs (p ranging from 1 to 512).
The speedup is defined as quotient of the sequential and the parallel runtime:
Tα,β (20 , 2k ) α 2k − 1
Sα,β (2 , 2 ) =
q k
= . (1.6)
Tα,β (2q , 2k ) 2βq + α 2k−q − 1 + q
For our example we define the computation-to-communication ratio as γ = βα . The speedup then tends
to zero if we compute the limit γ → 0 for q > 0:
γ 2k − 1
Sγ (2 , 2 ) =
q k and lim Sγ (2q , 2k ) = 0 . (1.7)
2q + γ 2k−q − 1 + q γ →0
The first derivative of Sγ (2q , 2k ) with respect to γ for fixed q and k is always positive, i.e. the speedup
is monotonically decreasing if we increase the communication time (reduce the value of γ ). Let k >
q > 0, A(k) = 2k − 1 > 0, and B(q, k) = 2k−q − 1 + q > 0; then we can simply apply the quotient
rule:
d d γ A(k) 2qA(k)
Sγ (2q , 2k ) = = 2 > 0 . (1.8)
dγ dγ 2q + γ B(q, k) 2q + γ B(q, k)
As a chipper of flint, he was much more skillful than those that had
gone before. He soon gave up making tools which, like the hand ax,
would serve a number of purposes, but none very well. He invented
the stone spear point. There is still considerable argument about his
flint work, or at least about the technique which he may or may not
have developed.
89
THREE TYPES OF OLD WORLD MAN
Note the progressive lessening of brow ridge and receding
chin and the increase in the height of forehead and vault.
Pithecanthropus robustus was found in the same general
area as Pithecanthropus erectus, or Java man. (Robustus,
after Weidenreich, 1946; Neanderthal, after McGregor,
1926; Cro-Magnon, after Verneau, 1906.)
90
When man began to make tools he pounded one rock with another.
He hoped he would knock off just the right chip in the right spot.
This is called the percussion method of flaking (see illustration, page
91). Some say the Neanderthal was not content with this. They think
that he must have discovered how to place a piece of bone or very
hard wood against a flint at the point where he wanted to knock off
a flake, and then strike it with a hammerstone (see illustration, page
92). This might account for the small chips, or “retouches,” taken off
the edge of some of his spear points as in the illustration below.
Even the Acheuleans are occasionally credited with this invention
because many of their hand axes are so symmetrical. There are
those who say that the Neanderthal had progressed so far in flint
work that he knew the art of pressure flaking—the third step in flint
knapping—which involved the pressing off of small chips with the bit
of wood or bone held in the hand (see illustration, page 93). It
seems more likely that Neanderthals and men of Acheulean times
used the anvil method of percussion flaking (lower drawing, page
91), not an inaccurate way of knocking off small chips.
91
PERCUSSION FLAKING
The first method by which early man shaped his tools.
(After Holmes, 1919.)
92
THE SECOND STEP IN FLINT KNAPPING
For more accurate work, early man applied a small stick of
hardwood or a piece of bone at the proper spot and hit the
interposed tool with a mallet of heavy wood or a rock. No
one knows who invented this technique—Acheulean,
Neanderthal, or later man. (After Holmes, 1919.)
93
THE THIRD STEP—PRESSURE FLAKING
The discovery that gave early man complete control over
the shaping of flints was that a slow and continued
pressure would dislodge just the flake he desired. Above,
we see how he worked on a small point, and below, to the
left, how he chipped thin slivers from a core. (After Holmes,
1919.)
However early the Mousterian culture may have begun, the later
stages fall within the range of one of the archaeologist’s most
interesting and precise techniques for dating. This is the
[19]
radiocarbon, or Carbon 14, method. Much simplified, it 95
depends upon the following phenomena: Most plants are
radioactive, and so are all animals that depend directly or indirectly
upon these plants for food. This radioactivity is found in a rare form
of carbon called radiocarbon, or Carbon 14. While a plant or an
animal is alive, it contains a constant proportion of this radioactive
material. Radiocarbon is always breaking down and disappearing,
but while a tree or a bird or another animal lives, this material is
being renewed. When this same tree or animal dies, it stops
acquiring new radiocarbon, and therefore its radioactivity decreases.
Heartwood from a 4,000-year-old sequoia tree is appreciably less
radioactive than the living outer layer. Antlers shed by a buck in the
last spring are more radioactive than any reindeer antlers left in a
cave in France by Old Stone Age hunters 17,000 years ago.
Neanderthal is the first of our early men to have lived within the
range of radiocarbon dating. To be more precise, his Mousterian
culture has left traces that can be measured. We have several
radiocarbon measurements that tell us how recently Neanderthal
was around; his oldest cultural materials are beyond the present
range of radiocarbon measurement. At Godarville, Belgium,
Mousterian artifacts were found underlying an accumulation of peat
that dated from more than 36,000 years ago. Since the stone tools
were deposited before the peat, they must be at least as old.
Charcoal from an ancient hearth in a Libyan cave at Haua Fteah,
associated with Levalloiso-Mousterian materials and about three feet
or so above a Neanderthaloid jaw, was dated at 34,000 years, or
possibly older. The archaeology of this cave suggests that the
Neanderthal survived in North Africa until about 30,000 years ago. In
Israel, south of Haifa in the Mount Carmel range, at a site called
Mugharet-el-Kebara, a very small sample of charcoal, thought to
correlate with a nearby Levalloiso-Mousterian deposit, furnished a
date of more than 30,000 years.
98
SCULPTURE OF THE OLD STONE AGE
Above, one of the carved and perforated reindeer antlers of
the Magdalenians, which are sometimes described as
bâtons de commandement; the Eskimos used a somewhat
similar tool for straightening their arrows. Left, the Venus of
Willendorf, an Aurignacian carving in stone, found near
Spitz, Austria. The woman’s head from the Grotte du Pape,
Brassempouy, France, may be either Aurignacian or
Magdalenian. The horse’s head, made of reindeer antlers,
from Mas d’Azil, France, is Magdalenian. (After Osborn,
1915.)
For many years the French clung to what Hooton calls the 99
rather chauvinistic myth that here, in the waning years of the
Great Ice Age, we find a superior kind of man that was
predominantly a product of the French area. Certainly he was a
remarkable person in many ways. For one thing, he discovered art.
He painted on the walls of his caves and carved on pieces of bone
and elephant ivory pictures of mammoths, bison, and boars, and he
made sculptures of fat women in stone. Also, he began to fish in the
swift streams that ran off from the glaciers. He hunted reindeer and
made use of their antlers as tools. For quite a time he was supposed
to represent the peak of achievement by early man.
103
THE MEANING OF SCRAPERS
“A primitive thing called a scraper is crude and not at all
eloquent until you realize that it points to much else. It
means not only a scraper, but a thing to be scraped, most
likely a hide; therefore it means a growing ability to kill, to
take off the hide and cure it. That is just the beginning, for
a scraper also shows a knowledge of how to scrape, and a
desire for scraping, and enough leisure (beyond the
struggle to get food) to allow time for scraping. All this
means self-restraint and thought for the future, and it
implies a certain confidence in the ways of life, because no
one would be liable to go to all the trouble of scraping if he
did not have reasonable hope of enjoying the results of the
work.”—George R. Stewart, in Man: An Autobiography. (Left
and center, after MacCurdy, 1924; right, after Leakey,
1935.)
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookluna.com