Download full High Performance Computing Programming and Applications Chapman Hall CRC Computational Science 1st Edition John Levesque ebook all chapters
Download full High Performance Computing Programming and Applications Chapman Hall CRC Computational Science 1st Edition John Levesque ebook all chapters
com
https://ebookgate.com/product/high-performance-computing-
programming-and-applications-chapman-hall-crc-computational-
science-1st-edition-john-levesque/
OR CLICK HERE
DOWLOAD NOW
https://ebookgate.com/product/cancer-systems-biology-chapman-hall-crc-
mathematical-computational-biology-1st-edition-edwin-wang/
ebookgate.com
Normal Mode Analysis Theory and Applications to Biological
and Chemical Systems Chapman Hall CRC Mathematical
Computational Biology 1st Edition Qiang Cui
https://ebookgate.com/product/normal-mode-analysis-theory-and-
applications-to-biological-and-chemical-systems-chapman-hall-crc-
mathematical-computational-biology-1st-edition-qiang-cui/
ebookgate.com
https://ebookgate.com/product/bayesian-data-analysis-second-edition-
chapman-hall-crc-texts-in-statistical-science-andrew-gelman/
ebookgate.com
SERIES EDITOR
Horst Simon
Deputy Director
Lawrence Berkeley National Laboratory
Berkeley, California, U.S.A.
This series aims to capture new developments and applications in the field of computational sci-
ence through the publication of a broad range of textbooks, reference works, and handbooks.
Books in this series will provide introductory as well as advanced material on mathematical, sta-
tistical, and computational methods and techniques, and will present researchers with the latest
theories and experimentation. The scope of the series includes, but is not limited to, titles in the
areas of scientific computing, parallel and distributed computing, high performance computing,
grid computing, cluster computing, heterogeneous computing, quantum computing, and their
applications in scientific disciplines such as astrophysics, aeronautics, biology, chemistry, climate
modeling, combustion, cosmology, earthquake prediction, imaging, materials, neuroscience, oil
exploration, and weather forecasting.
PUBLISHED TITLES
PETASCALE COMPUTING: ALGORITHMS AND APPLICATIONS
Edited by David A. Bader
PROCESS ALGEBRA FOR PARALLEL AND DISTRIBUTED PROCESSING
Edited by Michael Alexander and William Gardner
GRID COMPUTING: TECHNIQUES AND APPLICATIONS
Barry Wilkinson
INTRODUCTION TO CONCURRENCY IN PROGRAMMING LANGUAGES
Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
INTRODUCTION TO SCHEDULING
Yves Robert and Frédéric Vivien
SCIENTIFIC DATA MANAGEMENT: CHALLENGES, TECHNOLOGY, AND DEPLOYMENT
Edited by Arie Shoshani and Doron Rotem
INTRODUCTION TO THE SIMULATION OF DYNAMICS USING SIMULINK®
Michael A. Gray
INTRODUCTION TO HIGH PERFORMANCE COMPUTING FOR SCIENTISTS
AND ENGINEERS, Georg Hager and Gerhard Wellein
PERFORMANCE TUNING OF SCIENTIFIC APPLICATIONS, Edited by David Bailey,
Robert Lucas, and Samuel Williams
HIGH PERFORMANCE COMPUTING: PROGRAMMING AND APPLICATIONS
John Levesque with Gene Wagenbreth
High Performance
Computing
Programming and Applications
John Levesque
with Gene Wagenbreth
Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Levesque, John M.
High performance computing : programming and applications / John Levesque, Gene
Wagenbreth.
p. cm. -- (Chapman & Hall/CRC computational science series)
Includes bibliographical references and index.
ISBN 978-1-4200-7705-6 (hardcover : alk. paper)
1. High performance computing. 2. Supercomputers--Programming. I. Wagenbreth,
Gene. II. Title.
QA76.88.L48 2011
004.1’1--dc22 2010044861
Introduction, xi
v
vi ◾ contents
4.1.2 Decomposition 52
4.1.3 Scaling an Application 53
4.2 MeSSAge PASSing interfAce 55
4.2.1 Message Passing Statistics 55
4.2.2 Collectives 56
4.2.3 Point-to-Point Communication 57
4.2.3.1 Combining Messages into
Larger Messages 58
4.2.3.2 Preposting Receives 58
4.2.4 Environment Variables 61
4.2.5 Using Runtime Statistics to Aid MPI-Task
Placement 61
4.3 uSing oPenMPtM 62
4.3.1 Overhead of Using OpenMPTM 63
4.3.2 Variable Scoping 64
4.3.3 Work Sharing 66
4.3.4 False Sharing in OpenMPTM 68
4.3.5 Some Advantages of Hybrid Programming:
MPI with OpenMPTM 70
4.3.5.1 Scaling of Collectives 70
4.3.5.2 Scaling Memory Bandwidth Limited MPI
Applications 70
4.4 PoSix threAdS
® 71
4.5 PArtitioned globAl AddreSS SPAce
lAnguAgeS (PgAS) 77
4.5.1 PGAS for Adaptive Mesh Refinement 78
4.5.2 PGAS for Overlapping Computation and
Communication 78
4.5.3 Using CAF to Perform Collective Operations 79
4.6 coMPilerS for PgAS lAnguAgeS 83
4.7 role of the interconnect 85
exerciSeS 85
viii ◾ contents
referenceS, 221
index, 223
Introduction
xi
xii ◾ introduction
should address the most obvious bottleneck first and work on the “whack
a mole principal”; that is, address the bottlenecks from the highest to the
lowest. Potentially, the highest bottleneck could be single-processor per-
formance, single-node performance, scaling issues, or input/output (I/O).
If the most time-consuming bottleneck is single-processor performance,
Chapter 6 discusses how the user can address performance bottlenecks in
processor or single-core performance. In this chapter, we examine compu-
tational kernels from a wide spectrum of real HPC applications. Here read-
ers should find examples that relate to important code sections in their own
applications. The actual timing results for these examples were gathered
by running on hardware current at the time this book was published.
The companion Web site (www.hybridmulticoreoptimization.com) con-
tains all the examples from the book, along with updated timing results on
the latest released processors. The focus here is on single-core performance,
efficient cache utilization, and loop vectorization techniques.
Chapter 7 addresses scaling performance. If the most obvious bottleneck
is communication and/or synchronization time, the user must understand
how a given application can be restructured to scale to higher processor
counts. This chapter looks at numerous MPI implementations, illustrating
bottlenecks to scaling. By looking at existing message passing implementa-
tions, the reader may be able to identify those techniques that work very well
and some that have limitations in scaling to a large number of processors.
In Chapter 7, we address the optimization of I/O as an application is
moved to higher processor counts. Of course, this section depends heavily
on the network and I/O capabilities of the target system; however, there is
a set of rules that should be considered on how to do I/O from a large par-
allel job.
As the implementation of a hybrid MPI/OpenMP programming para-
digm is one method of improving scalibility of an application, Chapter 8
is all about OpenMP performance issues. In particular, we discuss the effi-
cient use of OpenMP by looking at the Standard Performance Evaluation
Corporation (SPEC®) OpenMP examples. In this chapter, we look at each
of the applications and discuss memory bandwidth issues that arise from
multiple cores accessing memory simultaneously. We see that efficient
OpenMP requires good cache utilization on each core of the node.
A new computational resource is starting to become a viable HPC alter-
native. General purpose graphics processor units (GPGPUs) have gained
significant market share in the gaming and graphics area. These systems
previously did not have an impact on the HPC community owing to the
xvi ◾ introduction
Multicore Architectures
1
2 ◾ high Performance computing
For example, registers can typically deliver multiple operands to the func-
tional units in one clock cycle and multiple registers can be accessed in
each clock cycle. Level 1 cache can deliver 1–2 operands to the registers in
each clock cycle. Lower levels of cache deliver fewer operands per clock
cycle and the latency to deliver the operands increases as the cache is fur-
ther from the processor. As the distance from the processor increases the
size of the memory component increases. There are 10s of registers, Level
1 cache is typically 64 KB (when used in this context, K represents 1024
bytes—64 KB is 65,536 bytes). Higher levels of cache hold more data and
main memory is the largest component of memory.
Utilizing this memory architecture is the most important lesson a pro-
grammer can learn to effectively program the system. Unfortunately,
compilers cannot solve the memory locality problem automatically. The
programmer must understand the various components of the memory
architecture and be able to build their data structures to most effectively
mitigate the lack of sufficient memory bandwidth.
The amount of data that is processed within a major computational rou-
tine must be known to understand how that data flows from memory
through the caches and back to memory. To some extent the computation
performed on the data is not important. A very important optimization
technique, “cache blocking,” is a restructuring process that restricts the
amount of data processed in a computational chunk so that the data fits
within Level 1 and/or Level 2 cache during the execution of the blocked DO
loop. Cache blocking will be discussed in more detail in Chapter 6.
Given this mismatch between processor speed and memory speed, hard-
ware designers have built special logic to mitigate this imbalance. This sec-
tion explains how the memory system of a typical microprocessor works
and what the programmer must understand to allocate their data in a form
that can most effectively be accessed during the execution of their program.
Understanding how to most effectively use memory is perhaps the most
important lesson a programmer can learn to write efficient programs.
Main memory
Page table
Physical address
figure 1.1 Operation of the TLB to locate operand in memory. (Adapted from
Hennessy, J. L. and Patterson, D. A. with contributions by Dusseau, A. C. et al.
Computer Architecture—A Quantitative Approach. Burlington, MA: Morgan
Kaufmann publications.)
Multicore Architectures ◾ 5
detail. The programmer should always obtain hardware counters for the
important sections of their application to determine if their TLB utiliza-
tion is a problem.
1.1.4 caches
The processor cannot fetch a single bit, byte, word it must fetch an entire
cache line. Cache lines are 64 contiguous bytes, which would contain
either 16 4-byte operands or 8 8-byte operands. When a cache line is
fetched from memory several memory banks are utilized. When accessing
a memory bank, the bank must refresh. Once this happens, new data can-
not be accessed from that bank until the refresh is completed. By having
interleaved memory banks, data from several banks can be accessed in
parallel, thus delivering data to the processor at a faster rate than any one
memory bank can deliver. Since a cache line is contiguous in memory, it
spans several banks and by the time a second cache line is accessed the
banks have had a chance to recycle.
1.1.4.1 Associativity
A cache is a highly structured, expensive collection of memory banks.
There are many different types of caches. A very important characteristic
of the cache is cache associativity. In order to discuss the operation of a
cache we need to envision the following memory structure. Envision that
memory is a two dimensional grid where the width of each box in the grid
is a cache line. The size of a row of memory is the same as one associativity
class of the cache. Each associativity class also has boxes the size of a cache
line. When a system has a direct mapped cache, this means that there is
only one associativity class. On all x86 systems the associativity of Level 1
cache is two way. This means that there are two rows of associativity classes
in Level 1 cache. Now consider a column of memory. Any cache line in a
given column of memory must be fetched to the corresponding column of
the cache. In a two-way associative cache, there are only two locations in
Level 1 cache for any of the cache lines in a given column of memory. The
following diagram tries to depict the concept.
Figure 1.3 depicts a two-way associative Level 1 cache. There are only
two cache lines in Level 1 cache that can contain any of the cache lines in
the Nth column of memory. If a third cache line from the same column is
required from memory, one of the cache lines contained in associativity
Multicore Architectures ◾ 7
Level 1 cache
(Column) associativity
Memory
(Column) Row
..
Let us assume that A(1)–A(8) are contained in the Nth column in mem-
ory. What is contained in the second row of the Nth column? The width of
the cache is the size of the cache divided by its associativity. The size of this
Level 1 cache is 65,536 bytes, and each associativity class has 32,768 bytes
of locations. Since the array A contains 8 bytes per word, the length of the
8 ◾ high Performance computing
Level 1 cache
(Column) associativity
A(1–8)
B(1–8)
figure 1.4 Contents of level 1 cache after fetch of A(1) and B(1).
A array is 65,536 × 8 = 131,072 bytes. The second row of the Nth column
will contain A(4097–8192), the third A(8193–12,288), and the sixteen row
will contain the last part of the array A. The 17th row of the Nth column
will contain B(1)–B(4096), since the compiler should store the array B right
after the array A and C(1)–C(4096) will be contained in 33rd row of the Nth
column. Given the size of the dimensions of the arrays A, B, and C, the first
cache line required from each array will be exactly in the same column. We
are not concerned with the operand SCALAR—this will be fetched to a
register for the duration of the execution of the DO loop.
When the compiler generates the fetch for A(1) the cache line contain-
ing A(1) will be fetched to either associativity 1 or associativity 2 in the
Nth column of Level 1 cache. Then the compiler fetches B(1) and the cache
line containing that element will go into the other slot in the Nth column
of Level 1 cache, either into associativity 1 or associativity 2. Figure 1.4
illustrates the contents of Level 1 cache after the fetch of A(1) and B(1).
The add is then generated. To store the result into C(1), the cache line
containing C(1) must be fetched to Level 1 cache. Since there are only two
slots available for this cache line, either the cache line containing B(1) or
the cache line containing A(1) will be flushed out of Level 1 cache into
Level 1 cache
(Column) associativity
C(1–8)
B(1–8)
Level 2 cache. Figure 1.5 depicts the state of Level 1 cache once C(1) is
fetched to cache.
The second pass through the DO loop the cache line containing A(2) or
B(2) will have to be fetched from Level 2; however, the access will over-
write one of the other two slots and we end up thrashing cache with the
execution of this DO loop.
Consider the following storage scheme:
REAL*8 A(65544),B(65544),C(65544)
DO I=1,65536
C(I) = A(I)+SCALAR*B(I)
ENDDO
The A array now occupies 16 full rows of memory plus one cache line.
This causes B(1)–B(8) to be stored in the N + 1 column of the 17th row and
C(1)–C(8) is stored in the N + 2 column of the 33rd row. We have offset the
arrays and so they do not conflict in the cache. This storage therefore
results in more efficient execution than that of the previous version. The
next example investigates the performance impact of this rewrite. Figure
1.6 depicts the status of Level 1 cache given this new storage scheme.
Figure 1.7 gives the performance of this simple kernel for various values
of vector length. In these tests the loop is executed 1000 times. This is done
to measure the performance when the arrays, A, B, and C are fully con-
tained within the caches. If the vector length is greater than 2730 [(65,536
bytes size of Level 1 cache)/(8 bytes for REAL(8) divided *3 operands to be
fetch-stored) = 2730], the three arrays cannot be contained within Level 1
cache and we have some overflow into Level 2 cache. When the vector
length is greater than 21,857 [(524,588 bytes in Level 2 cache)/(8 bytes
divided by 3 operands) = 21,857] then the three arrays will not fit in Level
Level 1 cache
(Column) associativity
3.00E+09
2.50E+09
2.00E+09
FLOPS
1.50E+09 65,536
65,544
65,552
1.00E+09
65,568
65,600
5.00E+08 65,664
65,792
66,048
0.00E+00
1 10 100 1000 10,000 100,000
Vector length
2 cache and the arrays will spill over into Level 3 cache. In Figure 1.7, there
is a degradation of performance as N increases, and this is due to where
the operands reside prior to being fetched.
More variation exists than that attributed to increasing the vector
length. There is a significant variation of performance of this simple kernel
due to the size of each of the arrays. Each individual series indicates the
dimension of the three arrays in memory. Notice that the series with the
worst performance is dimensioned by 65,536 REAL*8 words. The perfor-
mance of this memory alignment is extremely bad because the three arrays
are overwriting each other in Level 1 cache as explained earlier. The next
series which represents the case where we add a cache line to the dimen-
sion of the arrays gives a slightly better performance; however, it is still
poor, the third series once again gives a slightly better performance, and
so on. The best-performing series is when the arrays are dimensioned by a
65,536 plus a page (512 words—8-byte words).
The reason for this significant difference in performance is due to the
memory banks in Level 1 cache. When the arrays are dimensioned a large
power of two they are aligned in cache, as the three arrays are accessed,
the accesses pass over the same memory banks and the fetching of oper-
ands stalls until the banks refresh. When the arrays are offset by a full
Multicore Architectures ◾ 11
page of 4096 bytes, the banks have time to recycle before another operand
is accessed.
The lesson from this example is to pay attention to the alignment of
arrays. As will be discussed in the compiler section, the compiler can only
do so much to “pad” arrays to avoid these alignment issues. The applica-
tion programmer needs to understand how best to organize their arrays to
achieve the best possible cache utilization. Memory alignment plays a very
important role in effective cache utilization which is an important lesson
a programmer must master when writing efficient applications.
DO I=1,20
A(I)=B(I)+C(I)
ENDDO
If B(1) and C(1) are both the first elements of a super word, the move
from cache to register and from register through the adder back into the
register can be performed at two results a clock cycle. If on the other hand
they are not aligned, the compiler must move the unaligned operands into
a 128-bit register prior to issuing the add. These align operations tend to
decrease the performance attained by the functional unit.
Unfortunately it is extremely difficult to assure that the arrays are
aligned on super word boundaries. One way to help would be to always
make the arrays a multiple of 128 bits. What about scalars? Well one can
always try to pack your scalars together in the data allocation and make
sure that the sum of the scalar storage is a multiple of 128 bits. If every-
thing allocated is a length of 128 bits, the likelihood of the first word of
every array being on a 128-bit boundary is assured.
of the DO loop are available when the functional units are available.
When the memory controller (hardware) senses a logical pattern to address-
ing the operands, the next logical cache line is fetched automatically before
the processor requests it. Hardware prefetching cannot be turned off. The
hardware does this to help mitigate the latency to access the memory.
When the compiler optimizes a DO loop, it may use software prefetches
prior to the DO loop and just before the end of the DO loop to prefetch
operands required in the next pass through the loop. Additionally, com-
pilers allow the user to influence the prefetching with comment line direc-
tives. These directives give the user the ability to override the compiler and
indicate which array should be prefetched.
Since two contiguous logical pages may not be contiguous in physical
memory, neither type of prefetching can prefetch across a page boundary.
When using small pages prefetching is limited to 63 cache lines on each
array.
When an application developer introduces their own cache blocking, it
is usually designed without considering hardware or software prefetching.
This turns out to be extremely important since the prefetching can utilize
cache space. When cache blocking is performed manually it might be best
to turn off software prefetching.
The intent of this section is to give the programmer the important
aspects of memory that impact the performance of their application.
Future sections will examine kernels of real applications that utilize mem-
ory poorly; unfortunately most applications do not effectively utilize
memory. By using the hardware counter information and following the
techniques discussed in later chapters, significant performance gains can
be achieved by effectively utilizing the TLB and cache.
four single precision results each clock cycle (cc) or two double precision
results.
When running in REAL*8 or double precision, the result rate of the
functional units is cut in half, since each result requires 64 bit wide oper-
ands. The following table gives the results rate per clock cycle (cc) from the
SSE2 and SSE3 instructions.
For the compiler to utilize SSE instructions, it must know that the two
operations being performed within the instruction are independent of
each other. All the current generation compilers are able to determine
the lack of dependency by analyzing the DO loop for vectorization.
Vectorization will be discussed in more detail in later chapters; however, it
is important to understand that the compiler must be able to vectorize a
DO loop or at least a part of the DO loop in order to utilize the SSE
instructions.
Not all adds and multiplies would use the SSE instructions. There is an
added requirement for using the SSE instructions. The operands to the
instruction must be aligned on instruction width boundary (super words
of 128 bits). For example:
DO I=1, 100
A(I)=B(I)+SCALAR*C(I)
ENDDO
The arguments for the MULTIPLY and ADD units must be on a 128-bit
boundary to use the SSE or packed instruction. This is a severe restriction;
however, the compiler can employ operand shifts to align the operands in
the 128 bit registers. Unfortunately, these shift operations will detract
from the overall performance of the DO loops. This loop was run with the
following two allocations:
PARAMETER (IIDIM=100)
COMMON A(IIDIM),AF,B(IIDIM),BF,C(IIDIM) ==> run I
COMMON A(IIDIM),B(IIDIM),C(IIDIM) ==> run II
REAL*8 A,B,C,AF,BF,SCALAR
14 ◾ high Performance computing
Since IIDIM is even, we know that the arrays will not be aligned in run
I since there is a single 8-byte operand between A and B. On run II the
arrays are aligned on 128-bit boundaries. The difference in performance
follows:
Run I Run II
MFLOPS 208 223
This run was made repetitively to make sure that the operands were in
cache after the first execution of the loop, and so the difference in these
two runs is simply the additional alignment operations. We get a 10% dif-
ference in performance when the arrays are aligned on 128-bit boundar-
ies. This is yet another example where memory alignment impacts the
performance of an application.
Given these alignment issues, applications which stride through mem-
ory and/or use indirect addressing will suffer significantly from alignment
problems and usually the compiler will only attempt to use SSE instruc-
tions for DO loops that access memory contiguously.
In addition to the floating point ADD and MULTIPLY, the memory
functions (LOAD, STORE, MOVE) are also as wide as the floating point
units. On the SSE2 64 bit wide instructions, there were 128 bit wide mem-
ory functions. On the multicores with the SSE2 instructions, vectorization
was still valuable, not because of the width of the floating point units
which were only 64 bits wide, but because of the memory operations which
could be done as 128-bit operations.
Nehalem Barcelona
8 MB 2 MB
L3 cache L3 cache
3x8B at 1.33 GT/s 4x20b at 6.4 GT/s 2x8B at 667 MT/s 6x2B at 2 GT/s
Harpertown
32 KB 32 KB 32 KB 32 KB
L1D cache L1D cache L1D cache L1D cache
6 MB 6 MB
L2 cache L2 cache
8B at 1.6 GT/s
ExERCISES
1. What is the size of TLB on current AMD processors? How
much memory can be mapped by the TLB at any one time?
What are possible causes and solutions for TLB thrashing?
2. What is cache line associativity? On x86 systems how many
rows of associativity are there for Level 1 cache? How can these
cause performance problems? What remedies are there for
these performances?
Multicore Architectures ◾ 17
The MPP
A Combination of Hardware
and Software
19
20 ◾ high Performance computing
access to the entire system; however, what would happen if there were several
large applications running in the system? Consider an analogy. Say we pack
eggs in large cubical boxes holding 1000 eggs each in a 10 × 10 × 10 configu-
ration. Now we want to ship these boxes of eggs in a truck that has a dimen-
sion of 95 × 45 × 45. With the eggs contained in the 10 × 10 × 10 boxes we
could only get 144,000 (90 × 40 × 40) eggs in the truck. If, on the other hand,
we could split up some of the 10 × 10 × 10 boxes into smaller boxes we could
put 192,375 (95 × 45 × 45) eggs in the truck. This egg-packing problem illus-
trates one of the trade-offs in scheduling several MPI jobs onto a large MPP
system. Given a 3D torus as discussed earlier, it has three dimensions which
correspond to the dimensions of the truck. There are two techniques that
can be used. The first does not split up jobs; it allocates the jobs in the pre-
ferred 3D shape and the scheduler will never split up jobs. Alternatively, the
scheduler allocates the jobs in the preferred 3D shape until the remaining
holes in the torus are not big enough for another contiguous 3D shape and
then the scheduler will split up the job into smaller chunks that can fit into
the holes remaining in the torus. In the first case, a large majority of the
communication is performed within the 3D shape and very few messages
will be passed outside of that 3D shape. The second approach will utilize
much more of the system’s processors; however, individual jobs may inter-
fere with other jobs since their messages would necessarily go through that
portion of the torus that is being used for another application.
and may be overridden by the scheduler. Later we will see how the perfor-
mance of an application can be improved by effectively mapping the MPI
task onto the 3D torus.
The latency is a function of the location of the sender and the receiver
on the network and how long it takes the operating system at the receiving
the MPP ◾ 23
ExERCISES
1. The number of MPI messages that an NIC can handle each
second is becoming more and more of a bottleneck on multicore
26 ◾ high Performance computing
How Compilers
Optimize Programs
frequently allocated and deallocated, the likelihood that the array is allo-
cated on contiguous physical pages is very low. There is a significant ben-
efit to having an array allocated on contiguous physical pages. Compilers
can only do so much when the application dynamically allocates and deal-
locates memory. When all the major work arrays are allocated together on
subsequent ALLOCATE statements, the compiler usually allocates a large
chunk of memory and suballocates the individual arrays; this is good.
When users write their own memory allocation routine and call it from
numerous locations within the application, the compiler cannot help to
allocate the data in contiguous pages. If the user must do this it is far
better to allocate a large chunk of memory first and then manage that data
by cutting up the memory into smaller chunks, reusing the memory when
the data is no longer needed. De-allocation and re-allocating memory is
very detrimental to an efficient program. Allocating and deallocating
arrays end up in increasing the compute time to perform garbage collec-
tion. “Garbage collection” is the term used to describe the process of
releasing unnecessary memory areas and combining them into available
memory for future allocations. Additionally, the allocation and dealloca-
tion of arrays leads to memory fragmentation.
Using this technique results in a single I/O operation, and in this case,
a write outputting all A, B, and C arrays in one large block, which is a good
strategy for efficient I/O. If an application employs this type of coding, the
compiler cannot perform padding on any of the arrays A, B, or C.
Consider the following call to a subroutine crunch.
CALL CRUNCH (A(1,10), B(1,1), C(5,10))
ooo
SUBROUTINE CRUNCH (D,E,F)
DIMENSION D(100), E(10000), F(1000)
This is legal Fortran and it certainly keeps the compiler from moving A,
B, and C around. Unfortunately, since Fortran does not prohibit this type of
coding, the alignment and padding of arrays by compilers is very limited.
A compiler can pad and modify the alignment of arrays when they are
allocated as automatic or local data. In this case the compiler allocates
memory and adds padding if necessary to properly align the array for effi-
cient access. Unfortunately, the inhibitors to alignment and padding far
outnumber these cases. The application developer should accept responsi-
bility for allocating their arrays in such a way that the alignment is condu-
cive to effective memory utilization. This will be covered in more in
Chapter 6.
3.3 VectorizAtion
Vectorization is an ancient art, developed over 40 years ago for real vector
machines. While the SSE instructions require the compiler to vectorize
the Fortran DO loop to generate the instructions, today SSE instructions
30 ◾ high Performance computing
are not as powerful as the vector instructions of the past. We see this
changing with the next generation of multicore architectures. In an
attempt to generate more floating point operations/clock cycle, the SSE
instructions are becoming wider and will benefit more from vectorization.
While some of today’s compilers existed and generated a vector code for
the past vector processors, they have had to dummy down their analysis
for the SSE instructions. While performance increases from vectorization
on the past vector systems ranged from 5 to 20, the SSE instructions at
most achieve a factor of 2 for 64-bit arithmetic and a factor of 4 for 32-bit
arithmetic. For this reason, many loops that were parallelizable with some
compiler restructuring (e.g., IF statements) are not vectorized for the SSE
instructions, because the overhead of performing the vectorization is not
justified by the meager performance gain with the SSE instructions.
With the advent of the GPGPUs for HPC and the wider SSE instruc-
tions, using all the complicated restructuring to achieve vectorization will
once again have a big payoff. For this reason, we will include many vector-
ization techniques in the later chapters that may not obtain a performance
gain with today’s SSE instructions; however, they absolutely achieve good
performance gain when moving to a GPGPU. The difference is that the
vector performance of the accelerator is 20–30 times faster than the scalar
performance of the processing unit driving the accelerator. The remainder
of this section concentrates on compiling for cores with SSE instructions.
The first requirement for vectorizing SSE instructions is to make sure
that the DO loops access the arrays contiguously. While there are cases
when a compiler vectorizes part of a loop that accesses the arrays with a
stride and/or indirect addressing, the compiler must perform some over-
head prior to issuing the SSE instructions. Each SSE instruction issued
must operate on a 128-bit register that contains two 64-bit operands or
four 32-bit operands. When arrays are accessed with a stride or with indi-
rect addressing, the compiler must fetch the operands to cache, and then
pack the 128-bit registers element by element prior to issuing the floating
point operation. This overhead in packing the operands and subsequently
unpacking and storing the results back into memory introduces an over-
head that is not required when the scalar, non-SSE instructions are issued.
That overhead degrades from the factor of 2 in 64-bit mode or 4 in 32-bit
mode. In Chapter 5, we examine cases where packing and unpacking of
operands for subsequent vectorization does not pay off. Since compilers
tend to be conservative, most do not vectorize any DO loop with noncon-
tiguous array accesses. Additionally, when a compiler is analyzing DO
how compilers optimize Programs ◾ 31
old value is used in the first pass of the DO loop and that value is available.
Another way of looking at this loop is to look at the array assignment:
A(N1 : N2) = A(N1 + K:N2 + K) + scalar * B(N1:N2)
Regardless of what value K takes on, the array assignment specifies that
all values on the right hand side of the replacement sign are values that
exist prior to executing the array assignment. For this example, the array
assignment when K = −1 is not equivalent to the DO loop when K = −1.
When K = +1, the array assignment and the DO loop are equivalent.
The compiler cannot vectorize the DO loop without knowing the value
of K. Some compilers may compile both a scalar and vector version of the
loop and perform a runtime check on K to choose which loop to execute.
But this adds overhead that might even cancel any potential speedup that
could be gained from vectorization, especially if the loop involves more
than one value that needs to be checked. Another solution is to have a
comment line directive such as
!DIR$ IVDEP
p1 and p2 can point anywhere and the compiler is restricted from vector-
izing any computation that uses pointers. There are compiler switches that
can override such concerns for a compilation unit.
store results when the IF condition is true. This is called a controlled store.
For example,
DO I = 1,N
IF(C(I).GE.0.0)B(I) = SQRT(A(I))
ENDDO
The controlled store approach would compute all the values for I = 1,
2, 3, …, N and then only store the values where the condition C(I).GE.0.0
where true. If C(I) is never true, this will give extremely poor performance.
If, on the other hand, a majority of the conditions are true, the benefit can
be significant. Control stored treatment of IF statements has a problem in
that the condition could be hiding a singularity. For example,
DO I = 1,N
IF(A(I).GE.0.0)B(I) = SQRT(A(I))
ENDDO
Here, the SQRT(A(I)) is not defined when A(I) is less than zero. Most
smart compilers handle this by artificially replacing A(I) with 1.0 when-
ever A(I) is less than zero and take the SQRT of the resultant operand as
follows:
DO I = 1,N
IF(A(I).LT.0.0) TEMP(I) = 1.0
IF(A(I).GE.0.0) TEMP(I) = A(I)
IF(A(I).GE.0.0)B(I) = SQRT(TEMP(I))
ENDDO
A second way is to compile a code that gathers all the operands for the
cases when the IF condition is true, then perform the computation for the
“true” path, and finally scatter the results out into the result arrays.
Considering the previous DO loop,
DO I = 1,N
IF(A(I).GE.0.0)B(I) = SQRT(A(I))
ENDDO
II = 1
DO I = 1,N
IF(A(I).GE.0.0)THEN
Random documents with unrelated
content Scribd suggests to you:
Fig. 211.—Maison à Provins (XIVᵉ siècle).
des fenêtres ouvrant sur une cour, servait de cuisine et de salle à manger. A gauche de l’arcade, sur la façade,
Fig. 212.—Maison à Laon (XIVᵉ siècle).
s’ouvrait une petite porte donnant accès à l’escalier desservant le premier étage, où se trouvait la grande
Fig. 213.—Maison à Cordes-Albigeois (XIVᵉ siècle).
chambre qui servait de salle de réception et, à côté, une autre chambre éclairée sur la cour; au-dessus se
trouvaient les logements du personnel de la maison.
L’architecture des maisons varie selon le climat, les matériaux du pays et les usages des habitants.
Quand il ne s’agit que d’ouvrir des jours, portes et fenêtres dans les façades pour éclairer l’habitation, les
maisons n’ont pas de caractère particulier; mais dès que ces jours prennent une certaine richesse et que des
moulures ou des sculptures décorent quelques parties de la façade, les ornements sont empruntés aux édifices
voisins: églises ou abbayes construites par les moines-architectes, soit par suite de l’influence des écoles
monastiques, esprit d’imitation ou la force de l’habitude.
Les maisons de Cluny, qui remontent au XIIᵉ siècle, nous fournissent plusieurs exemples; celles qui
existent encore sont bâties presque entièrement en pierre. Les arcatures des ouvertures rappellent certains
détails de l’église abbatiale ou des bâtiments claustraux que les constructeurs ont tout naturellement imités.
Il en est de même pour les autres maisons dont nous donnons les dessins exprimant les caractères des
constructions urbaines des XIIIᵉ et XVᵉ siècles. On peut suivre par l’étude des habitations privées les effets
consécutifs des transformations qui s’étaient faites dans l’architecture religieuse et monastique et qui
s’étaient manifestées dans les édifices élevés au même temps.
Ce n’est que vers la fin du XIVᵉ siècle et particulièrement pendant le siècle suivant que cette influence
s’efface et le changement, sinon le progrès, s’accuse par la forme des ouvertures qui ne ressemblent plus aux
arcatures des cloîtres ou des églises, mais qui deviennent surbaissées, en anse de panier ou carrées et qui,
dans les fenêtres, ne sont plus divisées par des
Fig. 214.—Maison au Mont-Saint-Michel (XVᵉ siècle).
réseaux de pierre, ornés d’arcs brisés et d’accolades, mais simplement par des meneaux et des traverses
Fig. 215.—Maison en bois à Rouen (XVᵉ siècle).
Fig. 215.—Maison en bois à Rouen (XVᵉ siècle).
du même temps, entre autres, l’hôtel Lallemand, construit vers la fin du XVᵉ siècle, dont la cour intérieure
présente un grand intérêt, et principalement l’hôtel ou plutôt le château de Jacques Cœur.
Élevé dans la seconde moitié du XVᵉ siècle, en partie sur les remparts de la ville, ce superbe édifice est
trop connu pour que nous en donnions des images et une nouvelle description de l’entrée et de la cour
intérieure; mais la façade sur la place Berry, pour être moins somptueusement décorée, n’en est pas moins
intéressante. Elle montre les deux grosses tours de l’enceinte fortifiée, assises sur leurs soubassements gallo-
romains, les corps de logis de l’immense hôtel rappelant encore le château féodal, qui témoignent en même
temps de la richesse et de la puissance de l’argentier de Charles VII, aussi célèbre par sa haute fortune que
par ses malheurs immérités.
CHAPITRE II
M A I S O N S C O M M U N E S , B E F F R O I S , PA L A I S .
L’évolution sociale qui produisit l’affranchissement des communes commença dès le XIᵉ siècle, mais la
manifestation de ce grand événement politique ne se produisit que beaucoup plus tard.
Jusqu’au XIVᵉ siècle, les communes eurent à souffrir des vicissitudes sans nombre pour exercer les droits
que leur donnaient les chartes consenties par les suzerains, non sans difficultés et résistances, toutes
naturelles d’ailleurs, puisque ces droits qu’ils avaient octroyés étaient une atteinte portée à leur despotique
autorité seigneuriale. Aussi dès qu’ils pouvaient reprendre ce qu’ils avaient donné et abolir la commune, ils
exigeaient d’abord la démolition de la maison de ville et du beffroi. Ce qui explique qu’il ne soit resté que de
très rares vestiges des maisons communes antérieures au XIVᵉ siècle.
Maisons communes.—Quelques grandes cités du Midi avaient élevé des maisons communes: à
Bordeaux, dès le XIIᵉ siècle et suivant les traditions romaines; à Toulouse, vers la même époque, où la maison
de ville était une véritable forteresse.
Mais la plupart des communes naissantes étaient dans une grande misère; les charges et les redevances
qui leur étaient imposées étaient si lourdes qu’il leur était impossible de songer à bâtir la maison commune.
Au XIVᵉ siècle, la commune de Paris même n’avait qu’une maison de ville des plus modestes, car c’est
seulement en 1357 que le receveur des gabelles vendit à Étienne Marcel, prévôt des marchands, un petit logis
consistant en deux pignons et qui tenaient à plusieurs maisons bourgeoises. Ce qui prouverait que, jusqu’à
cette époque, la maison communale n’avait rien qui la distinguât des autres habitations.
A la fin du même siècle, Caen possédait une maison commune qui avait quatre étages de hauteur.
Pendant le XIIIᵉ siècle, la monarchie, la noblesse et le clergé, l’expression des pouvoirs en ce temps,
avaient créé des villes et des communes nouvelles.
Dans le Nord: Villeneuve-le-Roi, Villeneuve-le-Comte et Villeneuve-l’Archevêque durent leur existence
matérielle et communale à la manifestation de la puissance de ces divers pouvoirs.
Dans le Midi, la guerre des Albigeois avait ravagé, ruiné et même détruit plusieurs cités. Ces mêmes
pouvoirs publics reconnurent la nécessité de repeupler ces pays décimés par une guerre cruelle. Les
seigneurs
Fig. 219.—Maison commune de Pienza (Italie) (fin du XIVᵉ siècle).
féodaux, laïques et religieux attirèrent dans des centres les populations dispersées en leur concédant des
terres pour former des villes nouvelles et ils les fixèrent par l’apparence de la liberté qu’ils leur donnaient en
leur octroyant des franchises communales.
D’après de Caumont et Anthyme Saint-Paul, les villes neuves ou bastides sont reconnaissables à leurs
noms, à la régularité de leur plan ou à ces caractères réunis.
Quelques noms marquaient soit une dépendance ou une origine royale plus particulière, comme
Réalville ou Montréal, soit des privilèges octroyés à la ville, comme Bonneville, la Sauvetat, Sauveterre,
Villefranche, ou simplement la Bastide ou Villeneuve.
Enfin un certain nombre portent les noms de provinces et de villes françaises, ou même étrangères, cités
par Ant. Saint-Paul dans l’Annuaire de l’archéologie française: Barcelone ou Barcelonnette, Beauvais,
Boulogne, Bruges, Cadix, Cordes (pour Cordoue), Fleurance (pour Florence), Bretagne, Cologne, Valence,
Miélan (pour Milan), la Française et Francescas, Grenade, Libourne (pour Livourne), Modène, Pampelonne
(pour Pampelune), etc.
Une ville neuve ou bastide a généralement la forme d’un rectangle dont deux des côtés mesurent environ
deux cent vingt-cinq mètres et les deux autres cent soixante-quinze, comme Sauveterre d’Aveyron, par
exemple. Au milieu est ménagée une place à laquelle quatre rues aboutissent, partageant la ville en quatre
parties. Cette place est entourée de galeries, en plein cintre ou en arc brisé, qui sont couvertes par une
charpente, ou des voûtes, ou des arcades transversales, d’où est venu le nom de place des Couverts, encore
usité dans certaines villes du Midi.
Au centre de la place se trouvait la maison commune dont le rez-de-chaussée servait de halle publique.
La
Fig. 220.—Maison commune et beffroi d’Ypres (Belgique).
bastide de Montréjeau a conservé cette disposition et on peut citer pour leur régularité les villes de
Montpazier, avec ses rues couvertes par de grandes arcades en arc brisé; puis, Eymet, Domme et Beaumont,
Libourne, Sainte-Foy et Sauveterre de Guyenne, Damazan et Montflanquin, Rabastens, Mirande, Grenade,
l’Isle d’Albi et Réalmont, etc. Plusieurs bastides ont été fondées en Guyenne par les Anglais. Enfin la ville
basse de Carcassonne, fondée en 1247, et Aigues-Mortes, en
Fig. 221.—Halle et beffroi de Bruges (Belgique).
1248, sont également des villes neuves ou des bastides[86].
pendant la guerre contre les Albigeois, elle fut prise deux fois par Simon de Montfort, puis vendue par son
fils Gui de Montfort à saint Louis en 1226. C’est sans doute à cette époque que fut élevé l’édifice qui
subsiste et porte le caractère particulier de la maison commune: le beffroi, c’est-à-dire la manifestation
monumentale de l’autorité et de la juridiction communale.
L’édifice se compose d’un simple bâtiment de forme rectangulaire à trois étages, dominés par le beffroi
carré; le rez-de-chaussée est une halle communiquant avec un marché adjacent et la rue, étroite, qui passe
sous le beffroi; au premier étage se trouve la salle communale et une petite salle dans la tour; le deuxième
étage est semblable au premier.
On sait quelle fut la force d’expansion de l’art français dès la fin du XIIᵉ siècle et nous en avons étudié
les effets dans l’architecture religieuse; l’influence française paraît s’être exercée également par
l’architecture civile, car nous voyons des édifices municipaux, élevés vers la fin du XIVᵉ siècle en Italie,—à
Pienza et autres villes,—qui présentent une analogie, une ressemblance même avec celui de Saint-Antonin
construit vers le milieu du XIIIᵉ siècle.
Les maisons communes du Nord, en Allemagne et en Belgique, semblent avoir été bâties sur un plan à
peu près uniforme; un beffroi s’élevait au centre de la façade qui accuse de grandes salles, à droite et à
gauche au premier étage, et dont l’étage inférieur était une halle pour la vente de diverses marchandises.
La maison commune d’Ypres, en Belgique,—dite la halle aux draps depuis la construction au XVIIᵉ siècle
du nouvel hôtel de ville,—qui existe encore, est un des plus beaux exemples de cette disposition.
Fig. 224.—Beffroi de Tournai (Belgique).
Elle fut commencée en 1202 et terminée en 1304. La façade, qui mesure 140 mètres de longueur, est
percée de fenêtres en arc brisé. Chaque extrémité est marquée par une élégante tourelle et le centre est
magnifiquement accusé par un immense beffroi carré, qui est la partie la plus ancienne de l’édifice dont la
première pierre a été posée en 1200 par Baudouin IX, comte de Flandre.
A Bruges, le beffroi, ou tour des halles, commencé à la fin du XIIIᵉ siècle et terminé un siècle plus tard,
est également un exemple intéressant des maisons communes des villes de ce temps.
L’édifice contient les halles, les salles communales, et l’ensemble des bâtiments municipaux est dominé
par un beffroi qui atteint une hauteur de 105 mètres.
L’hôtel de ville de Bruges, remplaçant la première maison commune, fut élevé sur la place du Bourg, de
1376 à 1387 et dans un caractère architectural tout différent, car son aspect, très élégant par ses détails, le fait
ressembler plutôt à une chapelle somptueusement décorée qu’à un édifice municipal.
Enfin, comme spécimen des hôtels de ville élevés en Belgique aux XIVᵉ et XVᵉ siècles, il faut citer celui
de Louvain. Il rappelle Bruges par son architecture couverte d’ornements et surtout par sa disposition
générale qui donne l’impression d’un monument religieux.
Il fut construit de 1448 à 1463 par Mathieu de Layens, maître maçon de la ville et de sa banlieue.
L’édifice, avec ses trois étages, est de forme rectangulaire dont les pignons, percés de trois étages de fenêtres
en arc brisé, sont d’une extrême richesse de moulures, de statues et d’ornements sculptés. Il est couvert par
un comble très aigu, décoré de plusieurs étages de lucarnes; les
Fig. 225.—Beffroi de Gand (Belgique).
pignons sont couronnés par trois élégantes tourelles ajourées et surmontées de délicates pyramides. Les
façades latérales sont ornées de trois étages de statues et de sculptures allégoriques, couvrant toute la surface
avec une véritable profusion; aussi ces dentelles de pierre, trop délicates, ont subi les atteintes un peu rudes
du climat et elles ont dû être refaites en partie vers 1840.
Beffrois.—Dès les premiers temps de l’affranchissement des communes, le signal des réunions était
donné par les cloches, qui n’existaient alors que dans les tours des églises et qui ne pouvaient être sonnées
qu’avec l’autorisation du clergé. On conçoit que le nouvel état de choses occasionna des conflits sans cesse
renaissants, le clergé régulier n’étant pas disposé à favoriser ce mouvement—séparatiste—qui était une
atteinte portée à ses droits féodaux. Afin d’éviter ces luttes incessantes les bourgeois établirent des cloches
au-dessus des portes des villes; puis vers la fin du XIIᵉ siècle et dès le commencement du XIIIᵉ, ils élevèrent
des tours destinées à contenir les cloches de la ville.
C’est l’origine du beffroi, expression visible des franchises communales. Il faisait corps avec la maison
commune, mais il était aussi souvent un édifice isolé.
Le beffroi isolé était une grosse tour carrée, à plusieurs étages et couronnée par un comble en charpente,
recouvert d’ardoises ou de plomb; l’un des étages renfermait les cloches et au sommet se trouvaient les
clochettes du carillon.
A l’étage supérieur un logement, ouvert sur le pourtour d’une galerie, était ménagé pour le guetteur qui
avertissait les habitants de tous les dangers ou événements
Fig. 226.—Beffroi de Calais (France).
extérieurs et signalait les incendies. Les cloches du beffroi sonnaient le lever du soleil et le couvre-feu.
Le carillon indiquait les heures et leurs divisions, et il mêlait, aux jours de fête, les notes joyeuses de ses
clochettes à la voix profonde et solennelle de la grosse cloche.
L’usage de sonner la grosse cloche pour signaler les incendies est encore suivi dans un grand nombre de
villes du Nord, dont la plupart ont conservé leurs beffrois malgré les modifications qu’ils ont subies à
différentes époques.
La tour du beffroi contenait ordinairement une prison, une salle de réunion pour les échevins, des dépôts
d’archives, des magasins d’armes; elle fut longtemps l’unique maison commune.
En Belgique, les beffrois isolés—celui de Tournai, fondé en 1187, reconstruit en partie à la fin du XIVᵉ
siècle; celui de Gand, qui date de la fin du XIIᵉ siècle pour la tour carrée surmontée d’une flèche moderne—
nous donnent des exemples de ces premiers édifices municipaux.
En France, il existe encore quelques édifices de ce genre particulier.
Le beffroi de Calais, dont la tour carrée, construite pendant les XIVᵉ et XVᵉ siècles, est couronnée par une
flèche octogone commencée à la fin du XVᵉ siècle et ne fut terminée que pendant les premières années du
XVIIᵉ siècle.
Le beffroi de Béthune, qui remonte au XIVᵉ siècle, se compose d’une tour carrée cantonnée
d’échauguettes hexagones encorbellées sur trois de ses angles; le quatrième est de même forme, mais il
monte de fond et renferme l’escalier à vis qui dessert les divers étages de la tour et aboutit à une plate-forme
crénelée; au-dessus s’élève une élégante pyramide couronnée par la tourelle
Fig. 227.—Beffroi de Béthune (France).
du guetteur, dont les détails, aussi bien que la forme, ont dû inspirer l’architecte de Louvain pour la forme
Fig. 228.—Beffroi d’Evreux.
des tourelles qui couronnent les pignons de l’hôtel de ville. Dans l’étage supérieur se trouvent les grosses
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!
ebookgate.com