100% found this document useful (3 votes)
17 views

Download full High Performance Computing Programming and Applications Chapman Hall CRC Computational Science 1st Edition John Levesque ebook all chapters

The document provides information about various ebooks available for download on ebookgate.com, focusing on topics related to high performance computing and its applications. It includes links to specific titles within the Chapman & Hall/CRC Computational Science series, which cover a range of subjects from multicore architectures to parallel programming paradigms. The document also outlines the aims and scope of the series, emphasizing its relevance to scientific computing and related disciplines.

Uploaded by

neykadlawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
17 views

Download full High Performance Computing Programming and Applications Chapman Hall CRC Computational Science 1st Edition John Levesque ebook all chapters

The document provides information about various ebooks available for download on ebookgate.com, focusing on topics related to high performance computing and its applications. It includes links to specific titles within the Chapman & Hall/CRC Computational Science series, which cover a range of subjects from multicore architectures to parallel programming paradigms. The document also outlines the aims and scope of the series, emphasizing its relevance to scientific computing and related disciplines.

Uploaded by

neykadlawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Get the full ebook with Bonus Features for a Better Reading Experience on ebookgate.

com

High Performance Computing Programming and


Applications Chapman Hall CRC Computational
Science 1st Edition John Levesque

https://ebookgate.com/product/high-performance-computing-
programming-and-applications-chapman-hall-crc-computational-
science-1st-edition-john-levesque/

OR CLICK HERE

DOWLOAD NOW

Download more ebook instantly today at https://ebookgate.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Peer to Peer Computing Applications Architecture Protocols


and Challenges Chapman Hall CRC Computational Science 1st
Edition Yu-Kwong Ricky Kwok
https://ebookgate.com/product/peer-to-peer-computing-applications-
architecture-protocols-and-challenges-chapman-hall-crc-computational-
science-1st-edition-yu-kwong-ricky-kwok/
ebookgate.com

Speculative Execution in High Performance Computer


Architectures Chapman Hall Crc Computer Information
Science Series 1st Edition David Kaeli
https://ebookgate.com/product/speculative-execution-in-high-
performance-computer-architectures-chapman-hall-crc-computer-
information-science-series-1st-edition-david-kaeli/
ebookgate.com

Scientific Data Management Challenges Technology and


Deployment Chapman Hall CRC Computational Science 1st
Edition Arie Shoshani
https://ebookgate.com/product/scientific-data-management-challenges-
technology-and-deployment-chapman-hall-crc-computational-science-1st-
edition-arie-shoshani/
ebookgate.com

Cancer Systems Biology Chapman Hall CRC Mathematical


Computational Biology 1st Edition Edwin Wang

https://ebookgate.com/product/cancer-systems-biology-chapman-hall-crc-
mathematical-computational-biology-1st-edition-edwin-wang/

ebookgate.com
Normal Mode Analysis Theory and Applications to Biological
and Chemical Systems Chapman Hall CRC Mathematical
Computational Biology 1st Edition Qiang Cui
https://ebookgate.com/product/normal-mode-analysis-theory-and-
applications-to-biological-and-chemical-systems-chapman-hall-crc-
mathematical-computational-biology-1st-edition-qiang-cui/
ebookgate.com

Context Aware Computing and Self Managing Systems Chapman


Hall CRC Studies in Informatics Series 1st Edition
Waltenegus Dargie
https://ebookgate.com/product/context-aware-computing-and-self-
managing-systems-chapman-hall-crc-studies-in-informatics-series-1st-
edition-waltenegus-dargie/
ebookgate.com

Bayesian Data Analysis Second Edition Chapman Hall CRC


Texts in Statistical Science Andrew Gelman

https://ebookgate.com/product/bayesian-data-analysis-second-edition-
chapman-hall-crc-texts-in-statistical-science-andrew-gelman/

ebookgate.com

Handbook of Approximation Algorithms and Metaheuristics


Chapman Hall CRC Computer Information Science Series 1st
Edition Teofilo F. Gonzalez
https://ebookgate.com/product/handbook-of-approximation-algorithms-
and-metaheuristics-chapman-hall-crc-computer-information-science-
series-1st-edition-teofilo-f-gonzalez/
ebookgate.com

Statistical Data Mining Using SAS Applications Second


Edition Chapman Hall CRC Data Mining and Knowledge
Discovery Series George Fernandez
https://ebookgate.com/product/statistical-data-mining-using-sas-
applications-second-edition-chapman-hall-crc-data-mining-and-
knowledge-discovery-series-george-fernandez/
ebookgate.com
High Performance
Computing
Programming and Applications
Chapman & Hall/CRC
Computational Science Series

SERIES EDITOR
Horst Simon
Deputy Director
Lawrence Berkeley National Laboratory
Berkeley, California, U.S.A.

AIMS AND SCOPE

This series aims to capture new developments and applications in the field of computational sci-
ence through the publication of a broad range of textbooks, reference works, and handbooks.
Books in this series will provide introductory as well as advanced material on mathematical, sta-
tistical, and computational methods and techniques, and will present researchers with the latest
theories and experimentation. The scope of the series includes, but is not limited to, titles in the
areas of scientific computing, parallel and distributed computing, high performance computing,
grid computing, cluster computing, heterogeneous computing, quantum computing, and their
applications in scientific disciplines such as astrophysics, aeronautics, biology, chemistry, climate
modeling, combustion, cosmology, earthquake prediction, imaging, materials, neuroscience, oil
exploration, and weather forecasting.

PUBLISHED TITLES
PETASCALE COMPUTING: ALGORITHMS AND APPLICATIONS
Edited by David A. Bader
PROCESS ALGEBRA FOR PARALLEL AND DISTRIBUTED PROCESSING
Edited by Michael Alexander and William Gardner
GRID COMPUTING: TECHNIQUES AND APPLICATIONS
Barry Wilkinson
INTRODUCTION TO CONCURRENCY IN PROGRAMMING LANGUAGES
Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
INTRODUCTION TO SCHEDULING
Yves Robert and Frédéric Vivien
SCIENTIFIC DATA MANAGEMENT: CHALLENGES, TECHNOLOGY, AND DEPLOYMENT
Edited by Arie Shoshani and Doron Rotem
INTRODUCTION TO THE SIMULATION OF DYNAMICS USING SIMULINK®
Michael A. Gray
INTRODUCTION TO HIGH PERFORMANCE COMPUTING FOR SCIENTISTS
AND ENGINEERS, Georg Hager and Gerhard Wellein
PERFORMANCE TUNING OF SCIENTIFIC APPLICATIONS, Edited by David Bailey,
Robert Lucas, and Samuel Williams
HIGH PERFORMANCE COMPUTING: PROGRAMMING AND APPLICATIONS
John Levesque with Gene Wagenbreth
High Performance
Computing
Programming and Applications

John Levesque
with Gene Wagenbreth
Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2011 by Taylor and Francis Group, LLC


Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed in the United States of America on acid-free paper


10 9 8 7 6 5 4 3 2 1

International Standard Book Number: 978-1-4200-7705-6 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.

Library of Congress Cataloging‑in‑Publication Data

Levesque, John M.
High performance computing : programming and applications / John Levesque, Gene
Wagenbreth.
p. cm. -- (Chapman & Hall/CRC computational science series)
Includes bibliographical references and index.
ISBN 978-1-4200-7705-6 (hardcover : alk. paper)
1. High performance computing. 2. Supercomputers--Programming. I. Wagenbreth,
Gene. II. Title.

QA76.88.L48 2011
004.1’1--dc22 2010044861

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com
Contents

Introduction, xi

Chapter 1 ◾ Multicore Architectures 1


1.1 MeMory Architecture 1
1.1.1 Why the Memory Wall? 2
1.1.2 Hardware Counters 3
1.1.3 Translation Look-Aside Buffer 4
1.1.4 Caches 6
1.1.4.1 Associativity 6
1.1.4.2 Memory Alignment 11
1.1.5 Memory Prefetching 11
1.2 SSe inStructionS 12
1.3 hArdwAre deScribed in thiS book 14
exerciSeS 16

Chapter 2 ◾ the MPP: A combination of hardware


and Software 19
2.1 toPology of the interconnect 20
2.1.1 Job Placement on the Topology 21
2.2 interconnect chArActeriSticS 22
2.2.1 The Time to Transfer a Message 22
2.2.2 Perturbations Caused by Software 23
2.3 network interfAce coMPuter 24
2.4 MeMory MAnAgeMent for MeSSAgeS 24

v
vi ◾ contents

2.5 how MulticoreS iMPAct the


PerforMAnce of the interconnect 25
exerciSeS 25

Chapter 3 ◾ how compilers optimize Programs 27


3.1 MeMory AllocAtion 27
3.2 MeMory AlignMent 28
3.3 VectorizAtion 29
3.3.1 Dependency Analysis 31
3.3.2 Vectorization of IF Statements 32
3.3.3 Vectorization of Indirect Addressing and Strides 34
3.3.4 Nested DO Loops 35
3.4 Prefetching oPerAndS 36
3.5 looP unrolling 37
3.6 interProcedurAl AnAlySiS 38
3.7 coMPiler SwitcheS 38
3.8 fortrAn 2003 And itS inefficiencieS 39
3.8.1 Array Syntax 40
3.8.2 Using Optimized Libraries 42
3.8.3 Passing Array Sections 42
3.8.4 Using Modules for Local Variables 43
3.8.5 Derived Types 43
3.9 ScAlAr oPtiMizAtionS PerforMed
by the coMPiler 44
3.9.1 Strength Reduction 44
3.9.2 Avoiding Floating Point Exponents 46
3.9.3 Common Subexpression Elimination 47
exerciSeS 48

Chapter 4 ◾ Parallel Programming Paradigms 51


4.1 how coreS coMMunicAte with eAch other 51
4.1.1 Using MPI across All the Cores 51
contents ◾ vii

4.1.2 Decomposition 52
4.1.3 Scaling an Application 53
4.2 MeSSAge PASSing interfAce 55
4.2.1 Message Passing Statistics 55
4.2.2 Collectives 56
4.2.3 Point-to-Point Communication 57
4.2.3.1 Combining Messages into
Larger Messages 58
4.2.3.2 Preposting Receives 58
4.2.4 Environment Variables 61
4.2.5 Using Runtime Statistics to Aid MPI-Task
Placement 61
4.3 uSing oPenMPtM 62
4.3.1 Overhead of Using OpenMPTM 63
4.3.2 Variable Scoping 64
4.3.3 Work Sharing 66
4.3.4 False Sharing in OpenMPTM 68
4.3.5 Some Advantages of Hybrid Programming:
MPI with OpenMPTM 70
4.3.5.1 Scaling of Collectives 70
4.3.5.2 Scaling Memory Bandwidth Limited MPI
Applications 70
4.4 PoSix threAdS
® 71
4.5 PArtitioned globAl AddreSS SPAce
lAnguAgeS (PgAS) 77
4.5.1 PGAS for Adaptive Mesh Refinement 78
4.5.2 PGAS for Overlapping Computation and
Communication 78
4.5.3 Using CAF to Perform Collective Operations 79
4.6 coMPilerS for PgAS lAnguAgeS 83
4.7 role of the interconnect 85
exerciSeS 85
viii ◾ contents

Chapter 5 ◾ A Strategy for Porting an Application


to a large MPP System 87
5.1 gAthering StAtiSticS for A lArge
PArAllel ProgrAM 89
exerciSeS 96

Chapter 6 ◾ Single core optimization 99


6.1 MeMory AcceSSing 99
6.1.1 Computational Intensity 102
6.1.2 Looking at Poor Memory Utilization 103
6.2 VectorizAtion 109
6.2.1 Memory Accesses in Vectorized Code 110
6.2.2 Data Dependence 112
6.2.3 IF Statements 119
6.2.4 Subroutine and Function Calls 129
6.2.4.1 Calling Libraries 135
6.2.5 Multinested DO Loops 136
6.2.5.1 More Efficient Memory Utilization 145
6.3 SuMMAry 156
6.3.1 Loop Reordering 156
6.3.2 Index Reordering 157
6.3.3 Loop Unrolling 157
6.3.4 Loop Splitting (Loop Fission) 157
6.3.5 Scalar Promotion 157
6.3.6 Removal of Loop-Independent IFs 158
6.3.7 Use of Intrinsics to Remove IFs 158
6.3.8 Strip Mining 158
6.3.9 Subroutine Inlining 158
6.3.10 Pulling Loops into Subroutines 158
6.3.11 Cache Blocking 159
6.3.12 Loop Jamming (Loop Fusion) 159
exerciSeS 159
contents ◾ ix

Chapter 7 ◾ Parallelism across the nodes 161


7.1 leSlie3d 162
7.2 PArAllel oceAn ProgrAM (PoP) 164
7.3 SwiM 169
7.4 S3d 171
7.5 loAd iMbAlAnce 174
7.5.1 SWEEP3D 177
7.6 coMMunicAtion bottleneckS 179
7.6.1 Collectives 179
7.6.1.1 Writing One’s Own Collectives to
Achieve Overlap with Computation 179
7.6.2 Point to Point 180
7.6.3 Using the “Hybrid Approach” Efficiently 180
7.6.3.1 SWIM Benchmark 181
7.7 oPtiMizAtion of inPut And outPut (i/o) 182
7.7.1 Parallel File Systems 183
7.7.2 An Inefficient Way of Doing I/O on a Large
Parallel System 184
7.7.3 An Efficient Way of Doing I/O on a Large
Parallel System 184
exerciSeS 185

Chapter 8 ◾ node Performance 187


8.1 wuPPertAl wilSon ferMion SolVer:
wuPwiSe 187
8.2 SwiM 189
8.3 Mgrid 192
8.4 APPlu 192
8.5 gAlgel 194
8.6 APSi 196
8.7 eQuAke 199
8.8 fMA-3d 201
8.9 Art 203
x ◾ contents

8.10 Another MoleculAr MechAnicS


ProgrAM (AMMP) 204
8.11 SuMMAry 206
exerciSeS 207

Chapter 9 ◾ Accelerators and conclusion 209


9.1 AccelerAtorS 210
9.1.1 Using Extensions to OpenMP TM Directives
for Accelerators 211
9.1.2 Efficiency Concerns When Using an Accelerator 212
9.1.3 Summary for Using Accelerators 213
9.1.4 Programming Approaches for Using Multicore
Nodes with Accelerators 214
9.2 concluSion 214
exerciSeS 215

APPendix A: coMMon coMPiler directiVeS, 217

APPendix b: SAMPle MPi enVironMent VAriAbleS, 219

referenceS, 221

index, 223
Introduction

The future of high-performance computing (HPC) lies with large distrib-


uted parallel systems with three levels of parallelism, thousands of nodes
containing MIMD* groups of SIMD* processors. For the past 50 years, the
clock cycle of single processors has decreased steadily and the cost of each
processor has decreased. Most applications would obtain speed increases
inversely proportional to the decrease in the clock cycle and cost reduction
by running on new hardware every few years with little or no additional
programming effort. However, that era is over. Users now must utilize
parallel algorithms in their application codes to reap the benefit of new
hardware advances.
In the near term, the HPC community will see an evolution in the
architecture of its basic building block processors. So, while clock cycles
are not decreasing in these newer chips, we see that Moore’s law still holds
with denser circuitry and multiple processors on the die. The net result is
more floating point operations/second (FLOPS) being produced by wider
functional units on each core and multiple cores on the chip. Additionally,
a new source of energy-efficient performance has entered the HPC com-
munity. The traditional graphics processing units are becoming more reli-
able and more powerful for high-precision operations, making them viable
to be used for HPC applications.
But to realize this increased potential FLOP rate from the wider
instruction words and the parallel vector units on the accelerators, the
compiler must generate vector or streaming SIMD extension (SSE)
instructions. Using vector and SSE instructions is by no means automatic.
The user must either hand code the instructions using the assembler and/
or the compiler must generate the SSE instructions and vector code when
compiling the program. This means that the compiler must perform the

* MIMD—multiple instruction, multiple data; SIMD—single instruction, multiple data.

xi
xii ◾ introduction

extra dependence analysis needed to transform scalar loops into vector-


izable loops. Once the SSE instructions on the X86 processor are employed,
a significant imbalance between greater processor performance and lim-
ited memory bandwidth is revealed. Memory architectures may eventu-
ally remove this imbalance, but in the near term, poor memory
performance remains as a major concern in the overall performance of
HPC applications.
Moving beyond the chip, clustered systems of fast multicore processors
and/or accelerators on the node immediately reveal a serious imbalance
between the speed of the network and the performance of each node.
While node processor performance gains are from multiple cores, multi-
ple threads, and multiple wide functional units, the speed gains of net-
work interconnects lag and are not as likely to increase at the same rate to
keep up with the nodes. Programmers will have to rethink how they use
those high-performing nodes in a parallel job, because this recent chip
development with multiple cores per node demands a more efficient mes-
sage passing code.
This book attempts to give the HPC applications programmer an aware-
ness of the techniques needed to address these new performance issues.
We discuss hardware architectures from the point of view of what an
application developer really needs to know to achieve high performance,
leaving out the deep details of how the hardware works. Similarly, when
discussing programming techniques to achieve performance, we avoid
detailed discourse on programming language syntax or semantics other
than just what is required to get the idea behind those techniques across. In
this book, we concentrate on C and Fortran, but the techniques described
may just as well be applicable to other languages such as C++ and JAVA®.
Throughout the book, the emphasis will be on chips from Advanced
Micro Devices (AMD) and systems, and interconnects and software from
Cray Inc., as the authors have much more experience on those systems.
While the concentration is on a subset of the HPC industry, the techniques
discussed have application across the entire breath of the HPC industry,
from the desktop to the Petaflop system. Issues that give rise to bottle-
necks to attaining good performance will be discussed in a generic sense,
having applicability to all vendors’ systems.
To set the foundation, users must start thinking about three levels of
parallelism. At the outermost level is message passing to communicate
between the nodes of the massively parallel computer, shared memory
parallelism in the middle to utilize the cores on the nodes or the MIMD
introduction ◾ xiii

units on the accelerator, and finally, vectorization on the inner level.


Techniques to address each of these levels are found within these pages.
What are the application programmers to do when asked to port their
application to a massively parallel multicore architecture? After discuss-
ing the architectural and software issues, the book outlines a strategy for
taking an existing application and identifying how to address the formi-
dable task of porting and optimizing their application for the target sys-
tem. The book is organized to facilitate the implementation of that strategy.
First, performance data must be collected for the application and the user
is directed to individual chapters for addressing the bottlenecks indicated
in the performance data. If the desire is to scale the application to 10,000
processors and it currently quits scaling at 500 processors, processor per-
formance is not important, but improving the message passing and/or
reducing synchronization time or load imbalance is what should be
addressed. If, on the other hand, the application scales extremely well,
only 1–2% of peak performance is being seen, processor performance
should be examined.
Chapter 1 addresses the node architecture in the latest chips from
AMD. AMD supplies the same basic chip instruction architecture as
Intel®, which can be employed as single-socket nodes, or as cache-coherent,
multisocket nodes. Of particular interest here is the memory architecture,
with its caches and prefetching mechanisms, and the high-performance
functional units capable of producing four 8-byte or eight 4-byte floating
point results/core. And, as mentioned earlier, some sort of “vectorization”
scheme must be employed to achieve these high result rates. While the
unit of computation will primarily be the core, the cores are organized
within a uniform memory shared memory socket, and then there are
multiple sockets per node. Within the socket, the cores share memory
with uniform memory access (UMA); that is, fetching any memory on the
socket from any core on the socket will take the same amount of time.
Across the sockets, the cores experience a nonuniform memory access
(NUMA); that is, the time to access memory connected to a particular
core’s socket is faster than accessing memory connected to the other
socket. The latency is higher and the bandwidth is slightly smaller.
Chapter 2 discusses the infrastructure of connecting the nodes together
including hardware and software, the hardware that supplies connectivity
between the nodes in a massively parallel processor (MPP) system and the
software that orchestrates the MPP to work on a single application. Certain
characteristics of the interconnect such as latency, injection bandwidth,
xiv ◾ introduction

global bandwidth, and messages/second that can be handled by the


network interface computer (NIC) play an important role in how well the
application scales across the nodes in the system.
Chapter 3 brings the role of compilers into discussion. In this chapter,
we concentrate on the compilers from Portland Group Inc. (PGI) and the
Cray compilation environment (CCE). Users tend to expect (or hope) that
the compiler will automatically perform all the right transformations and
optimizations and generate the best possible executable code. Unfortu-
nately, reality is far from this; however, it is not entirely the compiler
technology at fault, but the way the application is written and the ambi-
guities inherent in the source code. Typically, what the compiler needs to
know to properly transform and optimize the generated executable code is
hidden, and so it must take a conservative approach by making minimal
assumptions in order to preserve the validity of the program. This chapter
discusses various source code syndromes that inhibit compilers from
optimizing the code. We will see that, for some compilers, simple com-
ment line directives can help remove certain optimization inhibitors.
In Chapter 4, we discuss the message passing interface (MPI) and
OpenMPTM, which are the most prevalent parallel programming
approaches used today and continue with Pthreads for shared memory
parallelization and the partitioned global address space (PGAS) languages
co-array Fortran and unified parallel C (UPC). We focus primarily on
MPI message passing and OpenMP shared memory models, and the
hybrid model of distributed shared memory (DSM) that applies both MPI
and OpenMP together over thousands of multicore nodes. For the hybrid
DSM approach to be successful, both the MPI calls across the network,
and the placement of OpenMP directives on the shared memory node
must be done effectively. We show in Chapter 4 the high-level mechanics
of how this might be achieved. In later chapters, specific applications that
have been implemented in OpenMP and MPI are discussed to see the
good and not-so-good implementations.
Given the discussion in the first four chapters, we then propose a strat-
egy to be taken by an application developer when faced with the challenge
of porting and optimizing their application for the multicore, massively
parallel architecture. This strategy starts with the process of gathering
runtime statistics from running the application on a variety of processor
configurations. Chapter 5 discusses what statistics would be required and
how one should interpret the data and then proceed to investigate the
optimization of the application. Given the runtime statistics, the user
introduction ◾ xv

should address the most obvious bottleneck first and work on the “whack
a mole principal”; that is, address the bottlenecks from the highest to the
lowest. Potentially, the highest bottleneck could be single-processor per-
formance, single-node performance, scaling issues, or input/output (I/O).
If the most time-consuming bottleneck is single-processor performance,
Chapter 6 discusses how the user can address performance bottlenecks in
processor or single-core performance. In this chapter, we examine compu-
tational kernels from a wide spectrum of real HPC applications. Here read-
ers should find examples that relate to important code sections in their own
applications. The actual timing results for these examples were gathered
by running on hardware current at the time this book was published.
The companion Web site (www.hybridmulticoreoptimization.com) con-
tains all the examples from the book, along with updated timing results on
the latest released processors. The focus here is on single-core performance,
efficient cache utilization, and loop vectorization techniques.
Chapter 7 addresses scaling performance. If the most obvious bottleneck
is communication and/or synchronization time, the user must understand
how a given application can be restructured to scale to higher processor
counts. This chapter looks at numerous MPI implementations, illustrating
bottlenecks to scaling. By looking at existing message passing implementa-
tions, the reader may be able to identify those techniques that work very well
and some that have limitations in scaling to a large number of processors.
In Chapter 7, we address the optimization of I/O as an application is
moved to higher processor counts. Of course, this section depends heavily
on the network and I/O capabilities of the target system; however, there is
a set of rules that should be considered on how to do I/O from a large par-
allel job.
As the implementation of a hybrid MPI/OpenMP programming para-
digm is one method of improving scalibility of an application, Chapter 8
is all about OpenMP performance issues. In particular, we discuss the effi-
cient use of OpenMP by looking at the Standard Performance Evaluation
Corporation (SPEC®) OpenMP examples. In this chapter, we look at each
of the applications and discuss memory bandwidth issues that arise from
multiple cores accessing memory simultaneously. We see that efficient
OpenMP requires good cache utilization on each core of the node.
A new computational resource is starting to become a viable HPC alter-
native. General purpose graphics processor units (GPGPUs) have gained
significant market share in the gaming and graphics area. These systems
previously did not have an impact on the HPC community owing to the
xvi ◾ introduction

lack of hardware 64-bit operations. The performance of the high-precision


software was significantly lower than the hardware 32-bit performance.
Initially, the GPGPUs did not have error correction, because it was really
not required by the gaming industry. For these reasons, the original
releases of these systems were not viable for HPC where high precision and
reliability are required. Recently, this situation has changed. Once again,
with the ability to pack more and more into the chip, GPGPUs are being
designed with hardware 64-bit precision whose performance is close to
the 32-bit performance (2–3 instead of 10–12 difference) and error
detection–correction is being introduced to enable correction of single-bit
errors and detection of double-bit errors. These accelerators supply impres-
sive FLOPS/clock by supplying multiple SIMD units. The programming
paradigm is therefore to parallelize the outermost loops of a kernel as in
OpenMP or threading and vectorize the innermost loops. Unlike the SSE
instructions, vectorization for the accelerators can deliver factors of 10–20
in performance improvement. The final chapter will look at the future and
discuss what the application programmer should know about utilizing the
GPGPUs to carry out HPC.
Chapter 1

Multicore Architectures

T he multicore architectures that we see today are due to an


increased desire to supply more floating point operations (FLOPS)
each clock cycle from the computer chip. The typical decrease of the clock
cycle that we have seen over the past 20 years is no longer evident; in fact,
on some systems the clock cycles are becoming longer and the only way to
supply more FLOPS is by increasing the number of cores on the chip or by
increasing the number of floating point results from the functional units
or a combination of the two.
When the clock cycle shortens, everything on the chip runs faster
without programmer interaction. When the number of results per clock
cycle increases, it usually means that the application must be structured to
optimally use the increased operation count. When more cores are intro-
duced, the application must be restructured to incorporate more parallel-
ism. In this section, we will examine the most important elements of the
multicore system—the memory architecture and the vector instructions
which supply more FLOPS/clock cycle.

1.1 MeMory Architecture


The faster the memory circuitry, the more the memory system costs.
Memory hierarchies are built by having faster, more-expensive memory
close to the CPU and slower, less-expensive memory further away. All
multicore architectures have a memory hierarchy: from the register set, to
the various levels of cache, to the main memory. The closer the memory is
to the processing unit, the lower the latency to access the data from that
memory component and the higher the bandwidth for accessing the data.

1
2 ◾ high Performance computing

For example, registers can typically deliver multiple operands to the func-
tional units in one clock cycle and multiple registers can be accessed in
each clock cycle. Level 1 cache can deliver 1–2 operands to the registers in
each clock cycle. Lower levels of cache deliver fewer operands per clock
cycle and the latency to deliver the operands increases as the cache is fur-
ther from the processor. As the distance from the processor increases the
size of the memory component increases. There are 10s of registers, Level
1 cache is typically 64 KB (when used in this context, K represents 1024
bytes—64 KB is 65,536 bytes). Higher levels of cache hold more data and
main memory is the largest component of memory.
Utilizing this memory architecture is the most important lesson a pro-
grammer can learn to effectively program the system. Unfortunately,
compilers cannot solve the memory locality problem automatically. The
programmer must understand the various components of the memory
architecture and be able to build their data structures to most effectively
mitigate the lack of sufficient memory bandwidth.
The amount of data that is processed within a major computational rou-
tine must be known to understand how that data flows from memory
through the caches and back to memory. To some extent the computation
performed on the data is not important. A very important optimization
technique, “cache blocking,” is a restructuring process that restricts the
amount of data processed in a computational chunk so that the data fits
within Level 1 and/or Level 2 cache during the execution of the blocked DO
loop. Cache blocking will be discussed in more detail in Chapter 6.

1.1.1 why the Memory wall?


On today’s, as well as future generation, microprocessors the ratio between
the rate at which results can be generated by the functional units, result
rate, divided by the rate at which operands can be fetched from and stored
to memory is becoming very large. For example, with the advent of the
SSE3 (streaming single instruction, multiple data extensions) instruction
the result rate doubled and the memory access rate remained the same.
Many refer to the “memory wall” as a way of explaining that memory
technologies have not kept up with processor technology and are far
behind the processor in producing inexpensive fast circuitry.

We all know that the rate of improvement in the microprocessor speed


exceeds the rate of improvement in DRAM memory speed, each is
improving exponentially, but the exponent for micro-processors is
Multicore Architectures ◾ 3

substantially larger than that for DRAMs. The difference between


diverging exponentials also grows exponentially; so, although the
disparity between processor and memory speed is already an issue,
downstream someplace it will be a much bigger one.
Wulf and McKee, 1995

Given this mismatch between processor speed and memory speed, hard-
ware designers have built special logic to mitigate this imbalance. This sec-
tion explains how the memory system of a typical microprocessor works
and what the programmer must understand to allocate their data in a form
that can most effectively be accessed during the execution of their program.
Understanding how to most effectively use memory is perhaps the most
important lesson a programmer can learn to write efficient programs.

1.1.2 hardware counters


Since the early days of super scalar processing, chip designers incorpo-
rated hardware counters into their systems. These counters were originally
introduced to measure important performance issues in the hardware.
HPC proponents soon found out about these counters and developed soft-
ware to read the counters from an application program. Today, software
supplied by the Innovative Computing Laboratory at the University of
Tennessee called PAPI, gives “raw” hardware counter data [2]. Throughout
the remainder of the book, the importance of this information will be dis-
cussed. Within the book, the CrayPatTM performance analysis tool [3] will
be used to access the hardware counters for examples shown. Most ven-
dors supply profiling software that can access these hardware counters.
It is important to understand that the “raw” hardware counter data
are somewhat useless. For example, when one runs an application the
number of cache misses is not enough information. What is needed is the
number of memory accesses for each cache miss. This important piece of
information is known as a derived metric. It is obtained from the perfor-
mance tools measuring the number of cache misses and the number of
memory accesses and computing the derived metric (memory accesses/
cache miss).
Without the hardware counter data the application programmer would
have a very difficult time figuring out why their application did not run
well. With these data, the application programmer can quickly zero into
the portion of the application that takes most of the CPU time and then
understand why the performance is or is not good.
4 ◾ high Performance computing

In the following discussion, it is important to understand that derived


metrics are available to be used in a profiling mode to determine the
important memory access details about the executing application.

1.1.3 translation look-Aside buffer


The mechanics of the translation look-aside buffer (TLB) is the first impor-
tant lesson in how to effectively utilize memory. When the processor issues
an address for a particular operand, that address is a logical address.
Logical addresses are what the application references and in the applica-
tions view, consecutive logical addresses are contiguous in memory. In
practice, this is not the case. Logical memory is mapped onto physical
memory with the use of the TLB. The TLB contains entries that give the
translation of the location of logical memory pages within physical mem-
ory (Figure 1.1). The default page size on most Linux® systems is 4096
bytes. If one stores four-byte operands, there will be 1024 four-byte oper-
ands in a page. The page holds 512 eight-byte operands.
Physical memory can be fragmented and two pages adjacent in logical
memory may not be adjacent in physical memory (Figure 1.2). The map-
ping between the two, logical and physical, is performed by an entry in the
TLB. The first time an operand is fetched to the processor, the page table
Virtual address

Virtual page number Page offset

Main memory
Page table
Physical address

figure 1.1 Operation of the TLB to locate operand in memory. (Adapted from
Hennessy, J. L. and Patterson, D. A. with contributions by Dusseau, A. C. et al.
Computer Architecture—A Quantitative Approach. Burlington, MA: Morgan
Kaufmann publications.)
Multicore Architectures ◾ 5

Virtual memory Address translation Physical memory

figure 1.2 Fragmentation in physical memory.

entry must be fetched, the logical address translated to a physical address


and then the cache line that contains the operand is fetched to Level 1
cache. To access a single element of memory, two memory loads, the table
entry and the cache line, are issued and 64 bytes are transferred to Level 1
cache—all this for supplying a single operand. If the next operand is con-
tiguous to the previous one in logical memory, the likelihood of it being in
the same page and same cache line is very high. (The only exception would
be if the first operand was the last operand in the cache line which hap-
pens to be the last cache line in the page).
A TLB table entry allows the processor to access all the 4096 bytes
within the page. As the processor accesses additional operands, either the
operand resides within a page whose address resides in the TLB or another
table entry must be fetched to access the physical page containing the
operand. A very important hardware statistic that can be measured for a
section of code is the effective TLB miss ratio. A TLB miss is the term used
to describe the action when no table entry within the TLB contains the
physical address required—then a miss occurs and a page entry must be
loaded to the TLB and then the cache line that contains the operands can
be fetched. A TLB miss is very expensive, since it requires two memory
accesses to obtain the data. Unfortunately, the size of the TLB is relatively
small. On some AMD microprocessors the TLB only has 48 entries, which
means that the maximum amount of memory that can be “mapped” at
any given time is 4096 × 48 = 196,608 bytes. Given this limitation, the
potential of TLB “thrashing” is possible. Thrashing the TLB refers to the
condition where very few of the operands within the page are referenced
before the page table entry is flushed from the TLB. In subsequent chapters,
examples where the TLB performance is poor will be discussed in more
6 ◾ high Performance computing

detail. The programmer should always obtain hardware counters for the
important sections of their application to determine if their TLB utiliza-
tion is a problem.

1.1.4 caches
The processor cannot fetch a single bit, byte, word it must fetch an entire
cache line. Cache lines are 64 contiguous bytes, which would contain
either 16 4-byte operands or 8 8-byte operands. When a cache line is
fetched from memory several memory banks are utilized. When accessing
a memory bank, the bank must refresh. Once this happens, new data can-
not be accessed from that bank until the refresh is completed. By having
interleaved memory banks, data from several banks can be accessed in
parallel, thus delivering data to the processor at a faster rate than any one
memory bank can deliver. Since a cache line is contiguous in memory, it
spans several banks and by the time a second cache line is accessed the
banks have had a chance to recycle.

1.1.4.1 Associativity
A cache is a highly structured, expensive collection of memory banks.
There are many different types of caches. A very important characteristic
of the cache is cache associativity. In order to discuss the operation of a
cache we need to envision the following memory structure. Envision that
memory is a two dimensional grid where the width of each box in the grid
is a cache line. The size of a row of memory is the same as one associativity
class of the cache. Each associativity class also has boxes the size of a cache
line. When a system has a direct mapped cache, this means that there is
only one associativity class. On all x86 systems the associativity of Level 1
cache is two way. This means that there are two rows of associativity classes
in Level 1 cache. Now consider a column of memory. Any cache line in a
given column of memory must be fetched to the corresponding column of
the cache. In a two-way associative cache, there are only two locations in
Level 1 cache for any of the cache lines in a given column of memory. The
following diagram tries to depict the concept.
Figure 1.3 depicts a two-way associative Level 1 cache. There are only
two cache lines in Level 1 cache that can contain any of the cache lines in
the Nth column of memory. If a third cache line from the same column is
required from memory, one of the cache lines contained in associativity
Multicore Architectures ◾ 7

Level 1 cache
(Column) associativity

(N)1 (N+1)1 (N+2)1 (N+3)1

(N)2 (N+1)2 (N+2)2 (N+3)2

Memory
(Column) Row

(N) (N+1) (N+2) (N+3) 1

..

figure 1.3 Two-way associative level 1 cache.

class 1 or associativity class 2 must be flushed out of cache, typically to


Level 2 cache. Consider the following DO loop:
REAL*8 A(65536),B(65536),C(65536)
DO I=1,65536
C(I) = A(I)+SCALAR*B(I)
ENDDO

Let us assume that A(1)–A(8) are contained in the Nth column in mem-
ory. What is contained in the second row of the Nth column? The width of
the cache is the size of the cache divided by its associativity. The size of this
Level 1 cache is 65,536 bytes, and each associativity class has 32,768 bytes
of locations. Since the array A contains 8 bytes per word, the length of the
8 ◾ high Performance computing

Level 1 cache
(Column) associativity

(N)1 (N+1)1 (N+2)1 (N+3)1

A(1–8)

(N)2 (N+1)2 (N+2)2 (N+3)2

B(1–8)

figure 1.4 Contents of level 1 cache after fetch of A(1) and B(1).

A array is 65,536 × 8 = 131,072 bytes. The second row of the Nth column
will contain A(4097–8192), the third A(8193–12,288), and the sixteen row
will contain the last part of the array A. The 17th row of the Nth column
will contain B(1)–B(4096), since the compiler should store the array B right
after the array A and C(1)–C(4096) will be contained in 33rd row of the Nth
column. Given the size of the dimensions of the arrays A, B, and C, the first
cache line required from each array will be exactly in the same column. We
are not concerned with the operand SCALAR—this will be fetched to a
register for the duration of the execution of the DO loop.
When the compiler generates the fetch for A(1) the cache line contain-
ing A(1) will be fetched to either associativity 1 or associativity 2 in the
Nth column of Level 1 cache. Then the compiler fetches B(1) and the cache
line containing that element will go into the other slot in the Nth column
of Level 1 cache, either into associativity 1 or associativity 2. Figure 1.4
illustrates the contents of Level 1 cache after the fetch of A(1) and B(1).
The add is then generated. To store the result into C(1), the cache line
containing C(1) must be fetched to Level 1 cache. Since there are only two
slots available for this cache line, either the cache line containing B(1) or
the cache line containing A(1) will be flushed out of Level 1 cache into

Level 1 cache
(Column) associativity

(N)1 (N+1)1 (N+2)1 (N+3)1

C(1–8)

(N)2 (N+1)2 (N+2)2 (N+3)2

B(1–8)

figure 1.5 State of level 1 cache after fetch of C(1).


Multicore Architectures ◾ 9

Level 2 cache. Figure 1.5 depicts the state of Level 1 cache once C(1) is
fetched to cache.
The second pass through the DO loop the cache line containing A(2) or
B(2) will have to be fetched from Level 2; however, the access will over-
write one of the other two slots and we end up thrashing cache with the
execution of this DO loop.
Consider the following storage scheme:

REAL*8 A(65544),B(65544),C(65544)
DO I=1,65536
C(I) = A(I)+SCALAR*B(I)
ENDDO

The A array now occupies 16 full rows of memory plus one cache line.
This causes B(1)–B(8) to be stored in the N + 1 column of the 17th row and
C(1)–C(8) is stored in the N + 2 column of the 33rd row. We have offset the
arrays and so they do not conflict in the cache. This storage therefore
results in more efficient execution than that of the previous version. The
next example investigates the performance impact of this rewrite. Figure
1.6 depicts the status of Level 1 cache given this new storage scheme.
Figure 1.7 gives the performance of this simple kernel for various values
of vector length. In these tests the loop is executed 1000 times. This is done
to measure the performance when the arrays, A, B, and C are fully con-
tained within the caches. If the vector length is greater than 2730 [(65,536
bytes size of Level 1 cache)/(8 bytes for REAL(8) divided *3 operands to be
fetch-stored) = 2730], the three arrays cannot be contained within Level 1
cache and we have some overflow into Level 2 cache. When the vector
length is greater than 21,857 [(524,588 bytes in Level 2 cache)/(8 bytes
divided by 3 operands) = 21,857] then the three arrays will not fit in Level

Level 1 cache
(Column) associativity

(N)1 (N+1)1 (N+2)1 (N+3)1

A(1–8) B(1–8) C(1–8)

(N)2 (N+1)2 (N+2)2 (N+3)2

figure 1.6 Status of level 1 cache given offset-array storage scheme.


10 ◾ high Performance computing

3.00E+09

2.50E+09

2.00E+09
FLOPS

1.50E+09 65,536
65,544
65,552
1.00E+09
65,568
65,600
5.00E+08 65,664
65,792
66,048
0.00E+00
1 10 100 1000 10,000 100,000
Vector length

figure 1.7 Performance based on the storage alignment of the arrays.

2 cache and the arrays will spill over into Level 3 cache. In Figure 1.7, there
is a degradation of performance as N increases, and this is due to where
the operands reside prior to being fetched.
More variation exists than that attributed to increasing the vector
length. There is a significant variation of performance of this simple kernel
due to the size of each of the arrays. Each individual series indicates the
dimension of the three arrays in memory. Notice that the series with the
worst performance is dimensioned by 65,536 REAL*8 words. The perfor-
mance of this memory alignment is extremely bad because the three arrays
are overwriting each other in Level 1 cache as explained earlier. The next
series which represents the case where we add a cache line to the dimen-
sion of the arrays gives a slightly better performance; however, it is still
poor, the third series once again gives a slightly better performance, and
so on. The best-performing series is when the arrays are dimensioned by a
65,536 plus a page (512 words—8-byte words).
The reason for this significant difference in performance is due to the
memory banks in Level 1 cache. When the arrays are dimensioned a large
power of two they are aligned in cache, as the three arrays are accessed,
the accesses pass over the same memory banks and the fetching of oper-
ands stalls until the banks refresh. When the arrays are offset by a full
Multicore Architectures ◾ 11

page of 4096 bytes, the banks have time to recycle before another operand
is accessed.
The lesson from this example is to pay attention to the alignment of
arrays. As will be discussed in the compiler section, the compiler can only
do so much to “pad” arrays to avoid these alignment issues. The applica-
tion programmer needs to understand how best to organize their arrays to
achieve the best possible cache utilization. Memory alignment plays a very
important role in effective cache utilization which is an important lesson
a programmer must master when writing efficient applications.

1.1.4.2 Memory Alignment


A page always starts on a 4096-byte boundary. Within the page, the next
unit of alignment is the cache line and there is always an even number of
full cache lines within a page. A 4096-byte page would contain 64 512-bit
cache lines. Within a cache line, there are 4 super words of 128 bits each.
The functional units on the chips, discussed in the next section, can accept
two 128 bit super words each clock cycle. Adding two arrays as in the
following DO loop

DO I=1,20
A(I)=B(I)+C(I)
ENDDO

If B(1) and C(1) are both the first elements of a super word, the move
from cache to register and from register through the adder back into the
register can be performed at two results a clock cycle. If on the other hand
they are not aligned, the compiler must move the unaligned operands into
a 128-bit register prior to issuing the add. These align operations tend to
decrease the performance attained by the functional unit.
Unfortunately it is extremely difficult to assure that the arrays are
aligned on super word boundaries. One way to help would be to always
make the arrays a multiple of 128 bits. What about scalars? Well one can
always try to pack your scalars together in the data allocation and make
sure that the sum of the scalar storage is a multiple of 128 bits. If every-
thing allocated is a length of 128 bits, the likelihood of the first word of
every array being on a 128-bit boundary is assured.

1.1.5 Memory Prefetching


Memory prefetching is an optimization that both the hardware and the
compiler can use to assure that operands required in subsequent iterations
12 ◾ high Performance computing

of the DO loop are available when the functional units are available.
When the memory controller (hardware) senses a logical pattern to address-
ing the operands, the next logical cache line is fetched automatically before
the processor requests it. Hardware prefetching cannot be turned off. The
hardware does this to help mitigate the latency to access the memory.
When the compiler optimizes a DO loop, it may use software prefetches
prior to the DO loop and just before the end of the DO loop to prefetch
operands required in the next pass through the loop. Additionally, com-
pilers allow the user to influence the prefetching with comment line direc-
tives. These directives give the user the ability to override the compiler and
indicate which array should be prefetched.
Since two contiguous logical pages may not be contiguous in physical
memory, neither type of prefetching can prefetch across a page boundary.
When using small pages prefetching is limited to 63 cache lines on each
array.
When an application developer introduces their own cache blocking, it
is usually designed without considering hardware or software prefetching.
This turns out to be extremely important since the prefetching can utilize
cache space. When cache blocking is performed manually it might be best
to turn off software prefetching.
The intent of this section is to give the programmer the important
aspects of memory that impact the performance of their application.
Future sections will examine kernels of real applications that utilize mem-
ory poorly; unfortunately most applications do not effectively utilize
memory. By using the hardware counter information and following the
techniques discussed in later chapters, significant performance gains can
be achieved by effectively utilizing the TLB and cache.

1.2 SSe inStructionS


On AMD’s multicore chips, there are two separate functional units, one
for performing an ADD and one for performing a MULTIPLY. Each of
these is independent and the size of the floating point add or multiply
determines how many operations can be produced in parallel. On the
dual core systems, the width of the SSE instruction is 64 bits wide. If the
application uses 32 bit or single precision arithmetic, each functional unit
(ADD and MULTIPLY) produces two single precision results each clock
cycle. With the advent of the SSE3 instructions on the quad core systems,
the functional units are 128 bits wide and each functional unit produces
Multicore Architectures ◾ 13

four single precision results each clock cycle (cc) or two double precision
results.
When running in REAL*8 or double precision, the result rate of the
functional units is cut in half, since each result requires 64 bit wide oper-
ands. The following table gives the results rate per clock cycle (cc) from the
SSE2 and SSE3 instructions.

SSE2 (Dual Core Systems) SSE3(Quad Core Systems)


32-bit single precision 2 Adds and 2 Multiplies/cc 4 Adds and 4 Multiplies/cc
64-bit double precision 1 Add and 1 Multiply/cc 2 Adds and 2 Multiplies/cc

For the compiler to utilize SSE instructions, it must know that the two
operations being performed within the instruction are independent of
each other. All the current generation compilers are able to determine
the lack of dependency by analyzing the DO loop for vectorization.
Vectorization will be discussed in more detail in later chapters; however, it
is important to understand that the compiler must be able to vectorize a
DO loop or at least a part of the DO loop in order to utilize the SSE
instructions.
Not all adds and multiplies would use the SSE instructions. There is an
added requirement for using the SSE instructions. The operands to the
instruction must be aligned on instruction width boundary (super words
of 128 bits). For example:

DO I=1, 100
A(I)=B(I)+SCALAR*C(I)
ENDDO

The arguments for the MULTIPLY and ADD units must be on a 128-bit
boundary to use the SSE or packed instruction. This is a severe restriction;
however, the compiler can employ operand shifts to align the operands in
the 128 bit registers. Unfortunately, these shift operations will detract
from the overall performance of the DO loops. This loop was run with the
following two allocations:

PARAMETER (IIDIM=100)
COMMON A(IIDIM),AF,B(IIDIM),BF,C(IIDIM) ==> run I
COMMON A(IIDIM),B(IIDIM),C(IIDIM) ==> run II
REAL*8 A,B,C,AF,BF,SCALAR
14 ◾ high Performance computing

Since IIDIM is even, we know that the arrays will not be aligned in run
I since there is a single 8-byte operand between A and B. On run II the
arrays are aligned on 128-bit boundaries. The difference in performance
follows:

Run I Run II
MFLOPS 208 223

This run was made repetitively to make sure that the operands were in
cache after the first execution of the loop, and so the difference in these
two runs is simply the additional alignment operations. We get a 10% dif-
ference in performance when the arrays are aligned on 128-bit boundar-
ies. This is yet another example where memory alignment impacts the
performance of an application.
Given these alignment issues, applications which stride through mem-
ory and/or use indirect addressing will suffer significantly from alignment
problems and usually the compiler will only attempt to use SSE instruc-
tions for DO loops that access memory contiguously.
In addition to the floating point ADD and MULTIPLY, the memory
functions (LOAD, STORE, MOVE) are also as wide as the floating point
units. On the SSE2 64 bit wide instructions, there were 128 bit wide mem-
ory functions. On the multicores with the SSE2 instructions, vectorization
was still valuable, not because of the width of the floating point units
which were only 64 bits wide, but because of the memory operations which
could be done as 128-bit operations.

1.3 hArdwAre deScribed in thiS book


In the first release of the book, we will be concentrating on one of the
recent chips from AMD. All the examples in the book have been run on
the AMD Magny-Cours with two six-core sockets sharing memory. The
competitor to the Magny-Cours is the Intel® NehalemTM also available
with two multicore sockets sharing memory.
The following diagram (Figure 1.8) shows some earlier chips from AMD
and Intel, the AMD Barcelona and the Intel Nehalem and a third chip, the
Harpertown is presented to illustrate the dynamic change that Intel incor-
porated in the development of the Nehalem. The Nehalem looks much
more like an OpteronTM than any previous Xeon® processor. The diagram
Multicore Architectures ◾ 15

Nehalem Barcelona

2.3 GHz 2.3 GHz


Nehalem Nehalem
Barcelona Barcelona
core 0 core 3
core 0 core 3
. . . . . .
32 KB 32 KB 64 KB 64 KB
L1D cache L1D cache L1D cache L1D cache

256 KB 256 KB 512 KB 512 KB


L2 cache L2 cache L2 cache L2 cache

8 MB 2 MB
L3 cache L3 cache

DDR3 memory Quickpath DDR2 memory HyperTransport 2.0


controllers interconnect controllers

3x8B at 1.33 GT/s 4x20b at 6.4 GT/s 2x8B at 667 MT/s 6x2B at 2 GT/s

Harpertown

3.2 GHz 3 GHz 3.2 GHz 3.2 GHz


Wolfdale Wolfdale Wolfdale Wolfdale
core 0 core 1 core 2 core 3

32 KB 32 KB 32 KB 32 KB
L1D cache L1D cache L1D cache L1D cache

6 MB 6 MB
L2 cache L2 cache

Front side bus Front side bus


interface interface

8B at 1.6 GT/s

figure 1.8 System architecture comparison, B indicates bytes, b indicates bits.


(Adapted from Kanter, D. Inside Nehalem: Intel’s future processor and system.
http://realworldtech.com/includes/templates/articles.cfm; April 2, 2008.)

shows AMD’s Barcelona socket; the differences between the current


Magny-Cours and the Barcelona are the size of Level 3 Cache (2 MB on
Barcelona and 6 MB on Magny-Cours), the clock cycle, the number of
cores per socket and the memory controller. The diagram shows only one
of the quad core sockets. Many HPC systems are delivered with two or
four of the multicores sharing memory on the node.
Like Opteron’s HyperTransportTM, the Nehalem introduced a new high-
speed interconnect called “QuickPath” which provides a by-pass to the
16 ◾ high Performance computing

traditional PCI buses. As with HyperTransport on the Barcelona,


QuickPath allows higher bandwidth off the node as well as between cores
on the node. Both the AMD Magny-Cours and the Nehalem use DDR3
memory and its effective memory bandwidth is significantly higher than
the Barcelona. Notice in the diagram that the Nehalem can access three
8-byte quantities at 1.33 GHz, while the Barcelona access two 8-byte quan-
tities at .800 GHz. This is an overall bandwidth improvement of 2.5 in
favor of the DDR3 memory systems. When a kernel is memory bound, the
performance difference should be greater and when kernel is compute
bound, the differences are less.
Note in Figure 1.8 that each core on the socket has its own Level 1 and
Level 2 caches and all four cores on the socket share a Level 3 cache. When
an application is threaded, either with OpenMPTM and/or Pthreads, data
may be shared between the four cores without accessing memory. The
mechanism is quite simple if the cores are reading a cache line and it
becomes more complicated as soon as a core wants to write into a cache
line. While using an OpenMP there is a performance hit referred to as
“false sharing” when multiple cores are trying to store into the same cache
line. There can only be one owner of a cache line and you must own a
cache line when modifying it.
Additionally, the user should understand that the memory bandwidth is
a resource that is shared by all the cores on the socket. When a core per-
forms a very memory-intensive operation, say adding two large arrays
together, it would require a significant portion of the total memory band-
width. If two cores are performing the memory-intensive operation, they
have to share the available memory bandwidth. Typically, the memory
bandwidth is not sufficient to have every core performing a memory-
intensive operation at the same time. We will see in Chapter 8 that OpenMP
tends to put more pressure on the memory bandwidth, since the execution
of each core is more tightly coupled than in an MPI application.

ExERCISES
1. What is the size of TLB on current AMD processors? How
much memory can be mapped by the TLB at any one time?
What are possible causes and solutions for TLB thrashing?
2. What is cache line associativity? On x86 systems how many
rows of associativity are there for Level 1 cache? How can these
cause performance problems? What remedies are there for
these performances?
Multicore Architectures ◾ 17

3. How does SSE3 improve on the original SSE2 instructions.


What can prevent the compiler from using SSE instructions
directly?
4. In relation to cache, what is false sharing?
5. What is the size of typical Level 1 cache, Level 2 cache, and
Level 3 cache?
6. How is performance affected by memory-intensive operations
running simultaneously on multiple cores? Why?
7. What are some of the reasons why a user may choose to use
only 1/2 of the cores on a node? On a two socket node, is it
important to spread out the cores being used evenly between
the sockets?
9. What are some of the improvements in going from the
Harperstown chip to the Nehalem?
Chapter 2

The MPP
A Combination of Hardware
and Software

W hen an application runs across a massively parallel processor


(MPP), numerous inefficiencies can arise that may be caused by the
structure of the application, interference with other applications running
on the MPP system or inefficiencies of the hardware and software that
makes up the MPP. The interference due to the other applications can be
the source of nonreproducibility of runtimes. For example, all applica-
tions run best when they run on an MPP in a dedicated mode. As soon as
a second application is introduced, either message passing and/or I/O
from the second application could perturb the performance of the first
application by competing for the interconnect bandwidth. As an applica-
tion is scaled to larger and larger processor counts, it becomes more sensi-
tive to this potential interference. The application programmer should
understand these issues and know what steps can be taken to minimize
the impact of the inefficiencies. The issues we cover in this chapter are

1. Topology of the interconnect and how its knowledge can be used to


minimize interference from other applications
2. Interconnect characteristics and how they might impact runtimes
3. Operating system jitter

19
20 ◾ high Performance computing

2.1 toPology of the interconnect


The topology of the interconnect dictates how the MPP nodes are connected
together. There are numerous interconnect topologies in use today. As
MPP systems grow to larger and larger node counts, the cost of some
interconnect topologies grows more than others. For example, a complete
crossbar switch that connects every processor to every other processor
becomes prohibitedly expensive as the number of processors increase. On
the other hand, a two- or three-dimensional torus grows linearly with the
number of nodes. All very large MPP systems such as IBM®’s Blue Gene®
[5] and Cray’s XTTM [6] use a 3D torus. In this discussion, we will concen-
trate on the torus topology.
In the 3D torus, every node has a connection to its north, south, east,
west, up, and down neighboring nodes. This sounds ideal for any mesh-
based finite difference, finite-element application, or other application that
does nearest-neighbor communication. Of course within the node itself,
all the cores can share memory and so there is a complete crossbar inter-
connect within the node. When mapping an application onto an MPP
system, the scheduler and the user should try to keep a decomposed neigh-
borhood region within the node. In this way, the MPI tasks within the
node can take advantage of the ability to perform message passing using
shared memory moves for the MPI tasks within the node. Additionally, it
would be good to make sure that the MPI task layout on the interconnect
is performed in a way where neighboring MPI tasks are adjacent to each
other on the torus. Imagine that the 3D simulation is laid out on the torus
in the same way as the physical problem is organized. For example, con-
sider an application that uses a 1000 × 1000 × 10 3D mesh and that the
grid is decomposed in 10 × 10 × 5 cubical sections. There are therefore
100 × 100 × 2 cubical sections. If we want to use 20,000 processors we
would want to lay out the cubical sections on the nodes so that we have a
2 × 2 × 2 group of cubical sections on a two-socket quad-core node. This
leaves us with a 2D array of 50 × 50 nodes. This application could there-
fore be efficiently mapped on a 2D torus of 50 × 50 nodes and all nearest
neighbor communication would be optimized on the node as well as
across the interconnect. Such a mapping to the MPP will minimize the
latency and maximize the bandwidth for the MPI message passing.
What is good for a single application may not be best for the overall
throughput of the MPP system. The previous mapping discussion is the best
thing to do when running in dedicated mode where the application has
the MPP ◾ 21

access to the entire system; however, what would happen if there were several
large applications running in the system? Consider an analogy. Say we pack
eggs in large cubical boxes holding 1000 eggs each in a 10 × 10 × 10 configu-
ration. Now we want to ship these boxes of eggs in a truck that has a dimen-
sion of 95 × 45 × 45. With the eggs contained in the 10 × 10 × 10 boxes we
could only get 144,000 (90 × 40 × 40) eggs in the truck. If, on the other hand,
we could split up some of the 10 × 10 × 10 boxes into smaller boxes we could
put 192,375 (95 × 45 × 45) eggs in the truck. This egg-packing problem illus-
trates one of the trade-offs in scheduling several MPI jobs onto a large MPP
system. Given a 3D torus as discussed earlier, it has three dimensions which
correspond to the dimensions of the truck. There are two techniques that
can be used. The first does not split up jobs; it allocates the jobs in the pre-
ferred 3D shape and the scheduler will never split up jobs. Alternatively, the
scheduler allocates the jobs in the preferred 3D shape until the remaining
holes in the torus are not big enough for another contiguous 3D shape and
then the scheduler will split up the job into smaller chunks that can fit into
the holes remaining in the torus. In the first case, a large majority of the
communication is performed within the 3D shape and very few messages
will be passed outside of that 3D shape. The second approach will utilize
much more of the system’s processors; however, individual jobs may inter-
fere with other jobs since their messages would necessarily go through that
portion of the torus that is being used for another application.

2.1.1 Job Placement on the topology


The user cannot change the policy of the computing center that runs the
MPP system; however, there are several options the user can take to try to
get as much locality as possible. For example, the scheduler does not know
what kind of decomposition a particular application might have and there-
fore it is at a loss as to the shape of a desired 3D or 2D shape to minimize
nearest-neighbor communication. To remedy this situation, the mpirun
command needs to have an option for the user to specify how to organize
the MPI tasks on the topology of the interconnect. There are typically
mechanisms to group certain MPI tasks on nodes, to take advantage of the
shared memory communication as well as mechanisms to allocate MPI
tasks across the nodes in a certain shape. If only one application can have
access to a given node, the internode allocation would always be honored
by the scheduler; however, when the scheduler is trying to fill up the entire
system, the allocation of nodes within a certain shape is just a suggestion
22 ◾ high Performance computing

and may be overridden by the scheduler. Later we will see how the perfor-
mance of an application can be improved by effectively mapping the MPI
task onto the 3D torus.

2.2 interconnect chArActeriSticS


The interconnect can be designed to improve the performance of disjoint
applications that arise from using the second scheduler strategy. For
example, if the bandwidth of the torus links is higher than the injection
bandwidth of a node, then there will be capacity for the link bandwidth
to handle more than the typical nearest-neighbor communication.
Additionally, if the latency across the entire MPP system is not much
greater than the latency between nearest nodes, then the distance between
neighboring MPI tasks may not seem so long. The latency across an inter-
connect is a function of both hardware and software. In a 3D torus, the
time for a message to pass through an intersection of the torus can be very
short and the numerous hops to get from one part of the interconnect to
another may not greatly increase the latency of a neighbor message that
has to go across the entire system.
Other characteristics of the interconnect can impact the latency across
the entire machine. In a very large application, the likelihood of link errors
can be high. Some interconnects only detect an error at the receiving end
of the message and then asks the sender to resend. The more sophisticated
interconnects can correct errors on the link which will not significantly
impact the overall message time. Additionally, as the number of messages
in the system increases with large node counts and global calls such as
MPI_ALLTOALL and MPI_ALL_REDUCE, the interconnect may exceed
its capability for keeping track of the messages in flight. If the number of
active connections exceeds the interconnect’s cache of messages, then the
amount of time to handle additional messages increases and causes per-
turbation in the time to send the messages.

2.2.1 the time to transfer a Message


The time to transfer a message of N bytes from one processor to another is

Time to transfer N bytes = Latency of the interconnect


+ N bytes * bandwidth of the interconnect

The latency is a function of the location of the sender and the receiver
on the network and how long it takes the operating system at the receiving
the MPP ◾ 23

end to recognize that it has a message. The bandwidth is a function of not


only the hardware bandwidth of the interconnect, but also what other
traffic occupies the interconnect when the message is being transferred.
Obviously, the time to transfer the message would be better if the applica-
tion runs in the dedicated mode. In actuality, when an application is run-
ning in a production environment, it must compete with other applications
running on the MPP system. Not only does it compete with message pass-
ing from the other applications, but also any I/O that is performed by any
of the applications that run at the same time necessarily utilize the inter-
connect and interfere with MPI messages on the interconnect.

2.2.2 Perturbations caused by Software


Very few, if any, MPP systems have a systemwide clock. Maintaining a
consistent clock across all the nodes in the system would be a nightmare.
Why does one need a consistent clock? On all MPP systems, each core
runs an operating system that is designed to perform many different func-
tions in addition to running an application. For example, Linux®, a widely
used operating system, has the ability to handle numerous services such
as, disk I/O, sockets, and so on. Each of these requires a demon running to
handle a request and Linux must schedule the demons to service potential
requests. The more the services available in the operating system, the more
the demons must be run to service different requests. On a small system of
500 or less nodes, the time to perform these services may not impact the
performance of an application. As the number of nodes used by a given
applications increases, the likelihood that these demons interfere with the
application increases. This interference typically comes when an applica-
tion comes to a barrier. Say, a large parallel application must perform a
global sum across all the MPI tasks in the system. Prior to performing the
sum, each processor must arrive at the call and have its contribution added
to the sum. If some processors are off-servicing operating system demons,
the processor may be delayed in arriving at the sum. As the number of
processors grows, the time to synchronize prior to the summation can
grow significantly, depending on the number of demons the operating
system must service.
To address operating system jitter, both IBM’s Blue Gene and Cray’s XT
systems have “light-weight” operating system kernels that reduce the
number of demons. Consequently, many of these “light-weight” operating
systems cannot support all of the services that are contained in a normal
Linux system.
24 ◾ high Performance computing

2.3 network interfAce coMPuter


In addition to the handling of the messages coming off the node, in some
interconnects, the network interface computer (NIC) must handle the
messages and I/O that pass through it from other nodes, going to other
nodes. For nearest-neighbor communication, it is sufficient that the link
bandwidth of the interconnect be the same as the injection bandwidth
from each of the nodes; however, when global communication and I/O
is performed, the NICs must not only handle the local communication, but
also route messages from distant nodes to other distant nodes. To handle
significant global communication, as in a 3D transpose of an array, the link
bandwidth has to handle much more traffic than just the processors
running on the nodes to which it is connected. An important capability of
the NIC is referred to as adaptive routing. Adaptive routing is the notion of
being able to send a message on a different path to the destination depend-
ing on the traffic in the network or a disabled NIC. Adaptive routing can
significantly improve the effective bandwidth of the network and can be
utilized when the MPP system has the ability to identify when an NIC on
the network is down and reconfigure the routing to bypass the down NIC.
Another very important characteristic of the NIC is the rate at which it
can handle messages. As the number of cores on the multicore node grows,
the NIC’s ability to handle all the messages that come off the node becomes
a bottleneck. As the number of messages exceeds its rate of handling
messages, there is a slowdown that may introduce interference into the
MPP system.

2.4 MeMory MAnAgeMent for MeSSAgeS


Frequently, when a message is received by a node, memory must be
dynamically allocated to accommodate the message. The management of
this message space is a very complex problem for the operating system. On
the Linux operating system with small pages of size 4096 bytes, the alloca-
tion, deallocation, and garbage collection for the message space can take
an inordinate amount of time. Very often, this memory can become frag-
mented which impacts the memory transfer rates. The previous section
discussed the use of small pages and how they may not be contiguous in
memory and so the actual memory bandwidth may suffer as a result of the
disjointed pages.
The application programmer can help in this area. By preposting the
receives prior to sending data, the message may not be held in intermediate
message buffers, but may be delivered directly to the application buffer
the MPP ◾ 25

which has a much better chance of being contiguous in memory. More


will be discussed on this in the MPI section.

2.5 how MulticoreS iMPAct the PerforMAnce


of the interconnect
As the number of cores on the multicore node increases, more pressure
is placed on the interconnect. When an application is using all MPI, the
number of messages going out to the interconnect increases as the num-
ber of cores increases. It is a common occurrence to move an application
from an 8- to a 12-core node and have the overall performance of the
application decrease. How can the user deal with the problems that arise
as a system is upgraded to larger multicore sockets?
The first thing the user can do is to make sure that the minimum
amount of data is shipped off the node. By organizing the optimal neigh-
borhood of MPI tasks on the multicore node, the number of messages
going off the node can be minimized. Messages that are on the node can
be performed with efficient shared memory transfers.
The second approach would be to introduce OpenMPTM into the all-
MPI application, so that fewer MPI tasks are placed on the multicore node.
In the next 4–5 years the number of cores on a node will probably increase
to 24–32. With such a fat node, the use of MPI across all the cores on the
multicore node will cease to be an efficient option. As multicore nodes are
designed with larger number of cores, the days of an application using
MPI between all of the cores of a MPP system may be nearing the end.
The reluctance of application developers to introduce OpenMP into
their applications may disappear once the GPGPUs emerge as viable
HPC computational units. Since GPGPUs require some form of high-
level shared memory parallelism and low-level vectorization, the same
programming paradigm should be used on the multicore host. A popular
code design for using attached GPGPUs is to have an application that uses
MPI between nodes and OpenMP on the node, where high-level kernels of
the OpenMP are computed on the GPGPU. Ideally, an application can be
structured to run efficiently on a multicore MPP without GPGPUs or to
run efficiently on a multicore MPP with GPGPUs simply with comments
line directives. This approach will be discussed in Chapter 9.

ExERCISES
1. The number of MPI messages that an NIC can handle each
second is becoming more and more of a bottleneck on multicore
26 ◾ high Performance computing

systems. The user can reduct the number of messages sent by


assuring the best decomposition on the node. Given an all-
MPI application using 8 MPI tasks on the node, how would
the number of messages for one halo exchange change in the
following three scenarios?
a. The MPI tasks on the node are not neighbors of other MPI
tasks on the node.
b. The MPI tasks on the node are arranged in a 2 × 4 × 1 plane
of a grid.
c. The MPI tasks on the node are arranged in a 2 × 2 × 2 cube.
2. Having seen that the decomposition can impact halo exchange
performance, can it also impact the global sum of a single vari-
able? How about the performance of a 3D ALL to ALL?
3. When an application is memory bandwidth limited users may
be tempted to use fewer cores/node and more nodes. If the
computer center charges for Node/hour will this method ever
result in reduced cost? What if the computer center charges for
power used?
4. What are some typical interconnect topologies? Why would
the manufacturer choose one over another? What might be the
advantages and disadvantages of the following interconnects?
a. Full all-to-all switch
b. Fat Tree
c. Hypercube
d. 2D Torus
e. 3D Torus
5. Why would an application’s I/O impact other users in the
system?
6. What is operating system jitter? When and why does it inter-
fere with performance? How is it minimized?
7. What is injection bandwidth? How can communication from
one job interfere with the performance of a different job?
8. Given an interconnect with an injection bandwidth of 2 GB/s
and a latency of 5 μs, what is the minimum time to send the
following messages:
a. 100 messages of 80 bytes
b. 10 messages of 800 bytes
c. 1 message of 8000 bytes
9. As multicore nodes grow larger and larger, how might this
impact the design of the interconnect?
a. In injection bandwidth?
b. In message/s handled?
c. In the topology of the interconnect?
Chapter 3

How Compilers
Optimize Programs

T o many application developers, compilation of their code is a


“black art.” Often, a programmer questions why the compiler cannot
“automatically” generate the most optimal code from their application.
This chapter reveals the problems a compiler faces generating an efficient
machine code from the source code, and gives some hints about how to
write a code to be more amenable to automatic optimization.

3.1 MeMory AllocAtion


The compiler has several different ways of allocating the data referenced in
a program. First, there is the notion of static arrays; that is, arrays that are
allocated at compile time. Whenever all the characteristics of a variable
are known, it may be allocated in the executable and therefore the data are
actually allocated when the executable is loaded. Most modern applica-
tions dynamically allocate their arrays after reading the input data and
only then allocate the amount of data required. The most often used syntax
to do this is the typical malloc in C and/or using ALLOCATABLE arrays
in Fortran. A more dynamic way to allocate data is by using automatic
arrays. This is achieved in Fortran by passing a subroutine integers that
are then used to dynamically allocate the array on the subroutine stack.
The efficiency of an application heavily depends upon the way arrays
are allocated. When an array is allocated within a larger allocation, and
that allocation is performed only once, the likelihood that the array is
allocated on contiguous physical pages is very high. When arrays are
27
28 ◾ high Performance computing

frequently allocated and deallocated, the likelihood that the array is allo-
cated on contiguous physical pages is very low. There is a significant ben-
efit to having an array allocated on contiguous physical pages. Compilers
can only do so much when the application dynamically allocates and deal-
locates memory. When all the major work arrays are allocated together on
subsequent ALLOCATE statements, the compiler usually allocates a large
chunk of memory and suballocates the individual arrays; this is good.
When users write their own memory allocation routine and call it from
numerous locations within the application, the compiler cannot help to
allocate the data in contiguous pages. If the user must do this it is far
better to allocate a large chunk of memory first and then manage that data
by cutting up the memory into smaller chunks, reusing the memory when
the data is no longer needed. De-allocation and re-allocating memory is
very detrimental to an efficient program. Allocating and deallocating
arrays end up in increasing the compute time to perform garbage collec-
tion. “Garbage collection” is the term used to describe the process of
releasing unnecessary memory areas and combining them into available
memory for future allocations. Additionally, the allocation and dealloca-
tion of arrays leads to memory fragmentation.

3.2 MeMory AlignMent


As discussed in Chapter 2, the way the program memory is allocated
impacts the runtime performance. How are arrays, which are often used
together, aligned, and does that alignment facilitate the use of SSE instruc-
tions, the cache, and the TLB? The compiler tries to align arrays to utilize
cache effectively; however, the semantics of the Fortran language do not
always allow the compiler to move the location of one array relative to
another. The following Fortran constructs inhibit a compiler from padding
and/or aligning arrays:

1. Fortran COMMON BLOCK


2. Fortran MODULE
3. Fortran EQUIVALENCE
4. Passing arrays as arguments to a subroutine
5. Any usage of POINTER
6. ALLOCATABLE ARRAYS
how compilers optimize Programs ◾ 29

When the application contains any of these structures, there can be an


implicit understanding of the location of one array in relation to another.
Fortran has very strict storage and sequence association rules which must
be obeyed when compiling the code. While COMMON blocks are being
replaced by MODULES in later versions of Fortran, application developers
have come to expect the strict memory alignment imposed by COMMON
blocks. For example, when performing I/O on a set of arrays, application
developers frequently pack the arrays into a contiguous chunk of logical
memory and then write out the entire chunk with a single write. For
example,
COMMON A(100,100), B(100,100), C(100,100)
WRITE (10) (A(I), I=1, 30000)

Using this technique results in a single I/O operation, and in this case,
a write outputting all A, B, and C arrays in one large block, which is a good
strategy for efficient I/O. If an application employs this type of coding, the
compiler cannot perform padding on any of the arrays A, B, or C.
Consider the following call to a subroutine crunch.
CALL CRUNCH (A(1,10), B(1,1), C(5,10))
ooo
SUBROUTINE CRUNCH (D,E,F)
DIMENSION D(100), E(10000), F(1000)

This is legal Fortran and it certainly keeps the compiler from moving A,
B, and C around. Unfortunately, since Fortran does not prohibit this type of
coding, the alignment and padding of arrays by compilers is very limited.
A compiler can pad and modify the alignment of arrays when they are
allocated as automatic or local data. In this case the compiler allocates
memory and adds padding if necessary to properly align the array for effi-
cient access. Unfortunately, the inhibitors to alignment and padding far
outnumber these cases. The application developer should accept responsi-
bility for allocating their arrays in such a way that the alignment is condu-
cive to effective memory utilization. This will be covered in more in
Chapter 6.

3.3 VectorizAtion
Vectorization is an ancient art, developed over 40 years ago for real vector
machines. While the SSE instructions require the compiler to vectorize
the Fortran DO loop to generate the instructions, today SSE instructions
30 ◾ high Performance computing

are not as powerful as the vector instructions of the past. We see this
changing with the next generation of multicore architectures. In an
attempt to generate more floating point operations/clock cycle, the SSE
instructions are becoming wider and will benefit more from vectorization.
While some of today’s compilers existed and generated a vector code for
the past vector processors, they have had to dummy down their analysis
for the SSE instructions. While performance increases from vectorization
on the past vector systems ranged from 5 to 20, the SSE instructions at
most achieve a factor of 2 for 64-bit arithmetic and a factor of 4 for 32-bit
arithmetic. For this reason, many loops that were parallelizable with some
compiler restructuring (e.g., IF statements) are not vectorized for the SSE
instructions, because the overhead of performing the vectorization is not
justified by the meager performance gain with the SSE instructions.
With the advent of the GPGPUs for HPC and the wider SSE instruc-
tions, using all the complicated restructuring to achieve vectorization will
once again have a big payoff. For this reason, we will include many vector-
ization techniques in the later chapters that may not obtain a performance
gain with today’s SSE instructions; however, they absolutely achieve good
performance gain when moving to a GPGPU. The difference is that the
vector performance of the accelerator is 20–30 times faster than the scalar
performance of the processing unit driving the accelerator. The remainder
of this section concentrates on compiling for cores with SSE instructions.
The first requirement for vectorizing SSE instructions is to make sure
that the DO loops access the arrays contiguously. While there are cases
when a compiler vectorizes part of a loop that accesses the arrays with a
stride and/or indirect addressing, the compiler must perform some over-
head prior to issuing the SSE instructions. Each SSE instruction issued
must operate on a 128-bit register that contains two 64-bit operands or
four 32-bit operands. When arrays are accessed with a stride or with indi-
rect addressing, the compiler must fetch the operands to cache, and then
pack the 128-bit registers element by element prior to issuing the floating
point operation. This overhead in packing the operands and subsequently
unpacking and storing the results back into memory introduces an over-
head that is not required when the scalar, non-SSE instructions are issued.
That overhead degrades from the factor of 2 in 64-bit mode or 4 in 32-bit
mode. In Chapter 5, we examine cases where packing and unpacking of
operands for subsequent vectorization does not pay off. Since compilers
tend to be conservative, most do not vectorize any DO loop with noncon-
tiguous array accesses. Additionally, when a compiler is analyzing DO
how compilers optimize Programs ◾ 31

loops containing IF conditions, it does not vectorize the DO loop due to


the overhead required to achieve that vectorization. With the current gen-
eration of multicore chips, application developers should expect that only
contiguous DO loops without IFs will be analyzed for vectorization for the
SSE instructions.

3.3.1 dependency Analysis


The hardest part in vectorizing a DO loop is determining if the operations
in the loop are independent. Can the loop operations be distributed across
all the iterations of the loop? If the loop has a “loop-carried” dependency,
the loop cannot be vectorized. Consider this classic example:
DO I = N1,N2
A(I) = A(I+K) + scalar * B(I)
ENDDO

If K is −1, the loop cannot be vectorized, but if K is +1, it can be


vectorized.
So what happens if K is equal to –1:
• Iteration I = N1: A(N1) = A(N1 − 1) + scalar * B(N1)
• Iteration I = N + 1: A(N1 + 1) = A(N1) + scalar * B(N1 + 1)
• Iteration I = N1 + 2: A(N1 + 2) = A(N1 + 1) + scalar * B(N1 + 2)
(Bold, italic indicates an array element computed in the DO loop.)
Here the element of the A array needed as input to the second iteration
is calculated in the first iteration of the loop. Each update after the first
requires a value calculated the previous time through the loop. We say that
this loop has a loop-carried dependency.
When a loop is vectorized, all the data on the right-hand side of the
equal sign must be available before the loop executes or be calculated ear-
lier in the loop, and this is not the case with K = –1 above.
Now, what if K is equal to +1:
• Iteration I = N1: A(N1) = A(N1 + 1) + scalar * B(N1)
• Iteration I = N1 + 1: A(N1 + 1) = A(N1 + 2) + scalar * B(N1 + 1)
• Iteration I = N1 + 2: A(N1 + 2) = A(N1 + 3) + scalar * B(N1 + 2)
Here, all the values needed as input are available before the loop exe-
cutes. Although A(N1 + 1) is calculated in the second pass of the DO, its
32 ◾ high Performance computing

old value is used in the first pass of the DO loop and that value is available.
Another way of looking at this loop is to look at the array assignment:
A(N1 : N2) = A(N1 + K:N2 + K) + scalar * B(N1:N2)

Regardless of what value K takes on, the array assignment specifies that
all values on the right hand side of the replacement sign are values that
exist prior to executing the array assignment. For this example, the array
assignment when K = −1 is not equivalent to the DO loop when K = −1.
When K = +1, the array assignment and the DO loop are equivalent.
The compiler cannot vectorize the DO loop without knowing the value
of K. Some compilers may compile both a scalar and vector version of the
loop and perform a runtime check on K to choose which loop to execute.
But this adds overhead that might even cancel any potential speedup that
could be gained from vectorization, especially if the loop involves more
than one value that needs to be checked. Another solution is to have a
comment line directive such as
!DIR$ IVDEP

This directive was introduced by Cray® Research in 1976 to address situ-


ations where the compiler needed additional information about the DO
loop. A problem would arise if this directive was placed on the DO loop and
there were truely data dependencies. Wrong answers would be generated.
When a compiler is analyzing a DO loop in C, additional complications
can hinder its optimization. For example, when using C pointers
for(i = 0; i < 100; i+ +) p1[i] = p2[i];

p1 and p2 can point anywhere and the compiler is restricted from vector-
izing any computation that uses pointers. There are compiler switches that
can override such concerns for a compilation unit.

3.3.2 Vectorization of if Statements


The legacy vector computers had special hardware to handle the vectoriza-
tion of conditional blocks of the code within loops controlled by IF state-
ments. Lacking this hardware and any real performance boost from
vectorization, the effect of vectorizing DO loops containing a conditional
code seems marginal on today’s SSE instructions; however, future, wider
SSE instructions and GPUs can benefit from such vectorization.
There are two ways to vectorize a DO loop with an IF statement. One is
to generate a code that computes all values of the loop index and then only
how compilers optimize Programs ◾ 33

store results when the IF condition is true. This is called a controlled store.
For example,
DO I = 1,N
IF(C(I).GE.0.0)B(I) = SQRT(A(I))
ENDDO

The controlled store approach would compute all the values for I = 1,
2, 3, …, N and then only store the values where the condition C(I).GE.0.0
where true. If C(I) is never true, this will give extremely poor performance.
If, on the other hand, a majority of the conditions are true, the benefit can
be significant. Control stored treatment of IF statements has a problem in
that the condition could be hiding a singularity. For example,
DO I = 1,N
IF(A(I).GE.0.0)B(I) = SQRT(A(I))
ENDDO

Here, the SQRT(A(I)) is not defined when A(I) is less than zero. Most
smart compilers handle this by artificially replacing A(I) with 1.0 when-
ever A(I) is less than zero and take the SQRT of the resultant operand as
follows:
DO I = 1,N
IF(A(I).LT.0.0) TEMP(I) = 1.0
IF(A(I).GE.0.0) TEMP(I) = A(I)
IF(A(I).GE.0.0)B(I) = SQRT(TEMP(I))
ENDDO

A second way is to compile a code that gathers all the operands for the
cases when the IF condition is true, then perform the computation for the
“true” path, and finally scatter the results out into the result arrays.
Considering the previous DO loop,
DO I = 1,N
IF(A(I).GE.0.0)B(I) = SQRT(A(I))
ENDDO

The compiler effectively performs the following operations:

II = 1
DO I = 1,N
IF(A(I).GE.0.0)THEN
Random documents with unrelated
content Scribd suggests to you:
Fig. 211.—Maison à Provins (XIVᵉ siècle).

des fenêtres ouvrant sur une cour, servait de cuisine et de salle à manger. A gauche de l’arcade, sur la façade,
Fig. 212.—Maison à Laon (XIVᵉ siècle).

s’ouvrait une petite porte donnant accès à l’escalier desservant le premier étage, où se trouvait la grande
Fig. 213.—Maison à Cordes-Albigeois (XIVᵉ siècle).

chambre qui servait de salle de réception et, à côté, une autre chambre éclairée sur la cour; au-dessus se
trouvaient les logements du personnel de la maison.
L’architecture des maisons varie selon le climat, les matériaux du pays et les usages des habitants.
Quand il ne s’agit que d’ouvrir des jours, portes et fenêtres dans les façades pour éclairer l’habitation, les
maisons n’ont pas de caractère particulier; mais dès que ces jours prennent une certaine richesse et que des
moulures ou des sculptures décorent quelques parties de la façade, les ornements sont empruntés aux édifices
voisins: églises ou abbayes construites par les moines-architectes, soit par suite de l’influence des écoles
monastiques, esprit d’imitation ou la force de l’habitude.
Les maisons de Cluny, qui remontent au XIIᵉ siècle, nous fournissent plusieurs exemples; celles qui
existent encore sont bâties presque entièrement en pierre. Les arcatures des ouvertures rappellent certains
détails de l’église abbatiale ou des bâtiments claustraux que les constructeurs ont tout naturellement imités.
Il en est de même pour les autres maisons dont nous donnons les dessins exprimant les caractères des
constructions urbaines des XIIIᵉ et XVᵉ siècles. On peut suivre par l’étude des habitations privées les effets
consécutifs des transformations qui s’étaient faites dans l’architecture religieuse et monastique et qui
s’étaient manifestées dans les édifices élevés au même temps.
Ce n’est que vers la fin du XIVᵉ siècle et particulièrement pendant le siècle suivant que cette influence
s’efface et le changement, sinon le progrès, s’accuse par la forme des ouvertures qui ne ressemblent plus aux
arcatures des cloîtres ou des églises, mais qui deviennent surbaissées, en anse de panier ou carrées et qui,
dans les fenêtres, ne sont plus divisées par des
Fig. 214.—Maison au Mont-Saint-Michel (XVᵉ siècle).

réseaux de pierre, ornés d’arcs brisés et d’accolades, mais simplement par des meneaux et des traverses
Fig. 215.—Maison en bois à Rouen (XVᵉ siècle).
Fig. 215.—Maison en bois à Rouen (XVᵉ siècle).

formant des subdivisions carrées qu’il était possible de


Fig. 216.—Maison en bois aux Andelys (XVᵉ siècle).
clore par des châssis vitrés mobiles dont la manœuvre était des plus faciles.
Les façades sont généralement construites en pierre ou en brique, c’est-à-dire en matériaux résistants, le
bois n’étant plus en usage que pour les planchers et la charpente des combles.
Au XVᵉ siècle, dans les provinces du Nord où la pierre est rare, celle-ci n’était employée que dans la
partie basse, les étages établis en encorbellement étaient composés de charpente dont les vides étaient
maçonnés en briques; les membrures principales: les poutres encorbellées, les poteaux, les saillies, les cadres
des fenêtres étaient ornés de moulures et de sculptures; ces étages étaient, le plus souvent, couronnés d’un
pignon accusant la forme par un arc brisé en saillie, de la charpente du comble ou bien par des lucarnes en
bois richement décorées.
Dans les climats pluvieux, la charpente était recouverte d’ardoises ou de bardeaux, en bois fendu en
lames, afin de la préserver de l’humidité.
Suivant un usage adopté dans le Nord, chaque maison était séparée, à son sommet, quand elle ne l’était
pas par une ruelle étroite ou par un espace vide, non seulement pour satisfaire la vanité du bourgeois qui
voulait avoir pignon sur rue et le faire voir, mais surtout pour éviter la propagation des incendies si fréquents
dans les cités dont les maisons étaient presque toutes bâties en bois, et dont les conséquences étaient
désastreuses, alors qu’il n’existait que des moyens rudimentaires pour combattre le fléau.
Pendant le XVᵉ siècle et surtout pendant le siècle suivant, on éleva de grandes habitations, des maisons
nobles qui n’existaient guère avant ce temps, les seigneurs habitant leurs châteaux forteresses. Ces grandes
maisons seigneuriales diffèrent essentiellement des habitations du bourgeois; l’hôtel occupait un espace
assez étendu, comprenant des cours et souvent des jardins,

Fig. 217.—Hôtel Lallemand, à Bourges (fin du XVᵉ siècle).


la maison du bourgeois ou du marchand donnait directement sur la rue, tandis que les bâtiments de l’hôtel
étaient disposés dans une cour intérieure, souvent très richement décorée et que des communs, écuries,
remises et logement des gens bordaient la rue sur laquelle s’ouvrait la porte principale donnant accès à la
cour et aux bâtiments intérieurs.
A Paris, au XIVᵉ siècle et surtout au XVᵉ, il existait des hôtels dont les noms au moins ont été conservés:
des Tournelles, de Saint-Pol, de Sens, de Nevers, de la Trémoille, détruit en 1840. L’hôtel de Cluny, construit
vers 1485, est un des plus curieux exemples de cette disposition, et il est d’autant plus intéressant qu’il a été
conservé presque tout entier.
A Bourges, il existe encore plusieurs grandes maisons

Fig. 218.—Hôtel de Jacques Cœur, à Bourges.—Façade sur la place Berry


(XVᵉ siècle).

du même temps, entre autres, l’hôtel Lallemand, construit vers la fin du XVᵉ siècle, dont la cour intérieure
présente un grand intérêt, et principalement l’hôtel ou plutôt le château de Jacques Cœur.
Élevé dans la seconde moitié du XVᵉ siècle, en partie sur les remparts de la ville, ce superbe édifice est
trop connu pour que nous en donnions des images et une nouvelle description de l’entrée et de la cour
intérieure; mais la façade sur la place Berry, pour être moins somptueusement décorée, n’en est pas moins
intéressante. Elle montre les deux grosses tours de l’enceinte fortifiée, assises sur leurs soubassements gallo-
romains, les corps de logis de l’immense hôtel rappelant encore le château féodal, qui témoignent en même
temps de la richesse et de la puissance de l’argentier de Charles VII, aussi célèbre par sa haute fortune que
par ses malheurs immérités.

CHAPITRE II

M A I S O N S C O M M U N E S , B E F F R O I S , PA L A I S .

L’évolution sociale qui produisit l’affranchissement des communes commença dès le XIᵉ siècle, mais la
manifestation de ce grand événement politique ne se produisit que beaucoup plus tard.
Jusqu’au XIVᵉ siècle, les communes eurent à souffrir des vicissitudes sans nombre pour exercer les droits
que leur donnaient les chartes consenties par les suzerains, non sans difficultés et résistances, toutes
naturelles d’ailleurs, puisque ces droits qu’ils avaient octroyés étaient une atteinte portée à leur despotique
autorité seigneuriale. Aussi dès qu’ils pouvaient reprendre ce qu’ils avaient donné et abolir la commune, ils
exigeaient d’abord la démolition de la maison de ville et du beffroi. Ce qui explique qu’il ne soit resté que de
très rares vestiges des maisons communes antérieures au XIVᵉ siècle.
Maisons communes.—Quelques grandes cités du Midi avaient élevé des maisons communes: à
Bordeaux, dès le XIIᵉ siècle et suivant les traditions romaines; à Toulouse, vers la même époque, où la maison
de ville était une véritable forteresse.
Mais la plupart des communes naissantes étaient dans une grande misère; les charges et les redevances
qui leur étaient imposées étaient si lourdes qu’il leur était impossible de songer à bâtir la maison commune.
Au XIVᵉ siècle, la commune de Paris même n’avait qu’une maison de ville des plus modestes, car c’est
seulement en 1357 que le receveur des gabelles vendit à Étienne Marcel, prévôt des marchands, un petit logis
consistant en deux pignons et qui tenaient à plusieurs maisons bourgeoises. Ce qui prouverait que, jusqu’à
cette époque, la maison communale n’avait rien qui la distinguât des autres habitations.
A la fin du même siècle, Caen possédait une maison commune qui avait quatre étages de hauteur.
Pendant le XIIIᵉ siècle, la monarchie, la noblesse et le clergé, l’expression des pouvoirs en ce temps,
avaient créé des villes et des communes nouvelles.
Dans le Nord: Villeneuve-le-Roi, Villeneuve-le-Comte et Villeneuve-l’Archevêque durent leur existence
matérielle et communale à la manifestation de la puissance de ces divers pouvoirs.
Dans le Midi, la guerre des Albigeois avait ravagé, ruiné et même détruit plusieurs cités. Ces mêmes
pouvoirs publics reconnurent la nécessité de repeupler ces pays décimés par une guerre cruelle. Les
seigneurs
Fig. 219.—Maison commune de Pienza (Italie) (fin du XIVᵉ siècle).

féodaux, laïques et religieux attirèrent dans des centres les populations dispersées en leur concédant des
terres pour former des villes nouvelles et ils les fixèrent par l’apparence de la liberté qu’ils leur donnaient en
leur octroyant des franchises communales.
D’après de Caumont et Anthyme Saint-Paul, les villes neuves ou bastides sont reconnaissables à leurs
noms, à la régularité de leur plan ou à ces caractères réunis.
Quelques noms marquaient soit une dépendance ou une origine royale plus particulière, comme
Réalville ou Montréal, soit des privilèges octroyés à la ville, comme Bonneville, la Sauvetat, Sauveterre,
Villefranche, ou simplement la Bastide ou Villeneuve.
Enfin un certain nombre portent les noms de provinces et de villes françaises, ou même étrangères, cités
par Ant. Saint-Paul dans l’Annuaire de l’archéologie française: Barcelone ou Barcelonnette, Beauvais,
Boulogne, Bruges, Cadix, Cordes (pour Cordoue), Fleurance (pour Florence), Bretagne, Cologne, Valence,
Miélan (pour Milan), la Française et Francescas, Grenade, Libourne (pour Livourne), Modène, Pampelonne
(pour Pampelune), etc.
Une ville neuve ou bastide a généralement la forme d’un rectangle dont deux des côtés mesurent environ
deux cent vingt-cinq mètres et les deux autres cent soixante-quinze, comme Sauveterre d’Aveyron, par
exemple. Au milieu est ménagée une place à laquelle quatre rues aboutissent, partageant la ville en quatre
parties. Cette place est entourée de galeries, en plein cintre ou en arc brisé, qui sont couvertes par une
charpente, ou des voûtes, ou des arcades transversales, d’où est venu le nom de place des Couverts, encore
usité dans certaines villes du Midi.
Au centre de la place se trouvait la maison commune dont le rez-de-chaussée servait de halle publique.
La
Fig. 220.—Maison commune et beffroi d’Ypres (Belgique).

bastide de Montréjeau a conservé cette disposition et on peut citer pour leur régularité les villes de
Montpazier, avec ses rues couvertes par de grandes arcades en arc brisé; puis, Eymet, Domme et Beaumont,
Libourne, Sainte-Foy et Sauveterre de Guyenne, Damazan et Montflanquin, Rabastens, Mirande, Grenade,
l’Isle d’Albi et Réalmont, etc. Plusieurs bastides ont été fondées en Guyenne par les Anglais. Enfin la ville
basse de Carcassonne, fondée en 1247, et Aigues-Mortes, en
Fig. 221.—Halle et beffroi de Bruges (Belgique).
1248, sont également des villes neuves ou des bastides[86].

Fig. 222.—Hôtel de ville de Bruges (Belgique).


«L’ère des bastides méridionales, ouverte en 1222 par la fondation de Cordes-Albigeois, fut close en
1344 par une protestation des Capitouls de Toulouse, sur laquelle le roi interdit désormais toute création
nouvelle. Il existe encore en Guyenne, en Gascogne, en Languedoc et dans les pays circonvoisins, au moins
deux cents bastides dont plusieurs, n’ayant pas prospéré, sont restées de petits villages; sur certains points
elles étaient trop rapprochées les unes des autres pour ne pas se porter un préjudice mutuel[87].»
L’architecture civile était arrivée au XVᵉ siècle à une prospérité si grande que, par un effet de réaction
qu’il est intéressant de noter, tout au moins, elle apporta des modifications à l’architecture religieuse, d’où
elle était sortie, en lui transmettant certaines formes comme l’arc en accolade ou en anse de panier, adoptées
dès la fin du XVᵉ siècle et pendant le siècle suivant qui fut, du reste, l’apogée de l’architecture civile.
Les communes du Midi conservèrent leurs franchises jusqu’au XVIᵉ siècle, l’époque néfaste des guerres
de religion qui causèrent la destruction d’un grand nombre d’édifices de toute nature.
La maison commune de Saint-Antonin (Tarn-et-Garonne) est peut-être la seule qui fut épargnée et elle
nous est restée comme un exemple, à peu près intact, sauf le sommet du beffroi, des dispositions prises par
les architectes au XIIIᵉ siècle, date probable de cet édifice municipal (fig. 200).
La petite ville de Saint-Antonin, qui avait obtenu sa charte communale dès 1136, eut beaucoup à souffrir
de sa fidélité au comte de Toulouse, Raymond VI, et,
Fig. 223.—Hôtel de ville de Louvain (Belgique).

pendant la guerre contre les Albigeois, elle fut prise deux fois par Simon de Montfort, puis vendue par son
fils Gui de Montfort à saint Louis en 1226. C’est sans doute à cette époque que fut élevé l’édifice qui
subsiste et porte le caractère particulier de la maison commune: le beffroi, c’est-à-dire la manifestation
monumentale de l’autorité et de la juridiction communale.
L’édifice se compose d’un simple bâtiment de forme rectangulaire à trois étages, dominés par le beffroi
carré; le rez-de-chaussée est une halle communiquant avec un marché adjacent et la rue, étroite, qui passe
sous le beffroi; au premier étage se trouve la salle communale et une petite salle dans la tour; le deuxième
étage est semblable au premier.
On sait quelle fut la force d’expansion de l’art français dès la fin du XIIᵉ siècle et nous en avons étudié
les effets dans l’architecture religieuse; l’influence française paraît s’être exercée également par
l’architecture civile, car nous voyons des édifices municipaux, élevés vers la fin du XIVᵉ siècle en Italie,—à
Pienza et autres villes,—qui présentent une analogie, une ressemblance même avec celui de Saint-Antonin
construit vers le milieu du XIIIᵉ siècle.
Les maisons communes du Nord, en Allemagne et en Belgique, semblent avoir été bâties sur un plan à
peu près uniforme; un beffroi s’élevait au centre de la façade qui accuse de grandes salles, à droite et à
gauche au premier étage, et dont l’étage inférieur était une halle pour la vente de diverses marchandises.
La maison commune d’Ypres, en Belgique,—dite la halle aux draps depuis la construction au XVIIᵉ siècle
du nouvel hôtel de ville,—qui existe encore, est un des plus beaux exemples de cette disposition.
Fig. 224.—Beffroi de Tournai (Belgique).
Elle fut commencée en 1202 et terminée en 1304. La façade, qui mesure 140 mètres de longueur, est
percée de fenêtres en arc brisé. Chaque extrémité est marquée par une élégante tourelle et le centre est
magnifiquement accusé par un immense beffroi carré, qui est la partie la plus ancienne de l’édifice dont la
première pierre a été posée en 1200 par Baudouin IX, comte de Flandre.
A Bruges, le beffroi, ou tour des halles, commencé à la fin du XIIIᵉ siècle et terminé un siècle plus tard,
est également un exemple intéressant des maisons communes des villes de ce temps.
L’édifice contient les halles, les salles communales, et l’ensemble des bâtiments municipaux est dominé
par un beffroi qui atteint une hauteur de 105 mètres.
L’hôtel de ville de Bruges, remplaçant la première maison commune, fut élevé sur la place du Bourg, de
1376 à 1387 et dans un caractère architectural tout différent, car son aspect, très élégant par ses détails, le fait
ressembler plutôt à une chapelle somptueusement décorée qu’à un édifice municipal.
Enfin, comme spécimen des hôtels de ville élevés en Belgique aux XIVᵉ et XVᵉ siècles, il faut citer celui
de Louvain. Il rappelle Bruges par son architecture couverte d’ornements et surtout par sa disposition
générale qui donne l’impression d’un monument religieux.
Il fut construit de 1448 à 1463 par Mathieu de Layens, maître maçon de la ville et de sa banlieue.
L’édifice, avec ses trois étages, est de forme rectangulaire dont les pignons, percés de trois étages de fenêtres
en arc brisé, sont d’une extrême richesse de moulures, de statues et d’ornements sculptés. Il est couvert par
un comble très aigu, décoré de plusieurs étages de lucarnes; les
Fig. 225.—Beffroi de Gand (Belgique).

pignons sont couronnés par trois élégantes tourelles ajourées et surmontées de délicates pyramides. Les
façades latérales sont ornées de trois étages de statues et de sculptures allégoriques, couvrant toute la surface
avec une véritable profusion; aussi ces dentelles de pierre, trop délicates, ont subi les atteintes un peu rudes
du climat et elles ont dû être refaites en partie vers 1840.
Beffrois.—Dès les premiers temps de l’affranchissement des communes, le signal des réunions était
donné par les cloches, qui n’existaient alors que dans les tours des églises et qui ne pouvaient être sonnées
qu’avec l’autorisation du clergé. On conçoit que le nouvel état de choses occasionna des conflits sans cesse
renaissants, le clergé régulier n’étant pas disposé à favoriser ce mouvement—séparatiste—qui était une
atteinte portée à ses droits féodaux. Afin d’éviter ces luttes incessantes les bourgeois établirent des cloches
au-dessus des portes des villes; puis vers la fin du XIIᵉ siècle et dès le commencement du XIIIᵉ, ils élevèrent
des tours destinées à contenir les cloches de la ville.
C’est l’origine du beffroi, expression visible des franchises communales. Il faisait corps avec la maison
commune, mais il était aussi souvent un édifice isolé.
Le beffroi isolé était une grosse tour carrée, à plusieurs étages et couronnée par un comble en charpente,
recouvert d’ardoises ou de plomb; l’un des étages renfermait les cloches et au sommet se trouvaient les
clochettes du carillon.
A l’étage supérieur un logement, ouvert sur le pourtour d’une galerie, était ménagé pour le guetteur qui
avertissait les habitants de tous les dangers ou événements
Fig. 226.—Beffroi de Calais (France).
extérieurs et signalait les incendies. Les cloches du beffroi sonnaient le lever du soleil et le couvre-feu.
Le carillon indiquait les heures et leurs divisions, et il mêlait, aux jours de fête, les notes joyeuses de ses
clochettes à la voix profonde et solennelle de la grosse cloche.
L’usage de sonner la grosse cloche pour signaler les incendies est encore suivi dans un grand nombre de
villes du Nord, dont la plupart ont conservé leurs beffrois malgré les modifications qu’ils ont subies à
différentes époques.
La tour du beffroi contenait ordinairement une prison, une salle de réunion pour les échevins, des dépôts
d’archives, des magasins d’armes; elle fut longtemps l’unique maison commune.
En Belgique, les beffrois isolés—celui de Tournai, fondé en 1187, reconstruit en partie à la fin du XIVᵉ
siècle; celui de Gand, qui date de la fin du XIIᵉ siècle pour la tour carrée surmontée d’une flèche moderne—
nous donnent des exemples de ces premiers édifices municipaux.
En France, il existe encore quelques édifices de ce genre particulier.
Le beffroi de Calais, dont la tour carrée, construite pendant les XIVᵉ et XVᵉ siècles, est couronnée par une
flèche octogone commencée à la fin du XVᵉ siècle et ne fut terminée que pendant les premières années du
XVIIᵉ siècle.
Le beffroi de Béthune, qui remonte au XIVᵉ siècle, se compose d’une tour carrée cantonnée
d’échauguettes hexagones encorbellées sur trois de ses angles; le quatrième est de même forme, mais il
monte de fond et renferme l’escalier à vis qui dessert les divers étages de la tour et aboutit à une plate-forme
crénelée; au-dessus s’élève une élégante pyramide couronnée par la tourelle
Fig. 227.—Beffroi de Béthune (France).
du guetteur, dont les détails, aussi bien que la forme, ont dû inspirer l’architecte de Louvain pour la forme
Fig. 228.—Beffroi d’Evreux.

des tourelles qui couronnent les pignons de l’hôtel de ville. Dans l’étage supérieur se trouvent les grosses
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!

ebookgate.com

You might also like