100% found this document useful (1 vote)
10 views

Optimizing HPC Applications with Intel Cluster Tools 1st Edition Alexander Supalov download

The document discusses the book 'Optimizing HPC Applications with Intel Cluster Tools' by Alexander Supalov, which focuses on performance optimization for high-performance computing (HPC) applications. It outlines the importance of optimization, the top-down methodology for addressing application and system bottlenecks, and the tools provided by Intel for effective optimization. The book is structured into chapters that guide readers through various aspects of optimization, making it suitable for both students and professionals in the field.

Uploaded by

hovelebadisg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
10 views

Optimizing HPC Applications with Intel Cluster Tools 1st Edition Alexander Supalov download

The document discusses the book 'Optimizing HPC Applications with Intel Cluster Tools' by Alexander Supalov, which focuses on performance optimization for high-performance computing (HPC) applications. It outlines the importance of optimization, the top-down methodology for addressing application and system bottlenecks, and the tools provided by Intel for effective optimization. The book is structured into chapters that guide readers through various aspects of optimization, making it suitable for both students and professionals in the field.

Uploaded by

hovelebadisg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Optimizing HPC Applications with Intel Cluster

Tools 1st Edition Alexander Supalov pdf download

https://textbookfull.com/product/optimizing-hpc-applications-
with-intel-cluster-tools-1st-edition-alexander-supalov/

Download more ebook from https://textbookfull.com


We believe these products will be a great fit for you. Click
the link to download now, or visit textbookfull.com
to discover even more!

Optimizing HPC Applications with Intel Cluster Tools


Hunting Petaflops Dahnken

https://textbookfull.com/product/optimizing-hpc-applications-
with-intel-cluster-tools-hunting-petaflops-dahnken/

Android on x86 An Introduction to Optimizing for Intel


Architecture 1st Edition Iggy Krajci

https://textbookfull.com/product/android-on-x86-an-introduction-
to-optimizing-for-intel-architecture-1st-edition-iggy-krajci/

Intel Xeon Phi Coprocessor Architecture and Tools The


Guide for Application Developers 1st Edition Rezaur
Rahman (Auth.)

https://textbookfull.com/product/intel-xeon-phi-coprocessor-
architecture-and-tools-the-guide-for-application-developers-1st-
edition-rezaur-rahman-auth/

MongoDB Performance Tuning: Optimizing MongoDB


Databases and their Applications 1st Edition Guy
Harrison

https://textbookfull.com/product/mongodb-performance-tuning-
optimizing-mongodb-databases-and-their-applications-1st-edition-
guy-harrison/
Artificial Neural Networks with Java - Tools for
Building Neural Network Applications 1st Edition Igor
Livshin

https://textbookfull.com/product/artificial-neural-networks-with-
java-tools-for-building-neural-network-applications-1st-edition-
igor-livshin/

Optimizing Project Management 1st Edition Te Wu

https://textbookfull.com/product/optimizing-project-
management-1st-edition-te-wu/

Functional Metagenomics Tools and Applications 1st


Edition Trevor C. Charles

https://textbookfull.com/product/functional-metagenomics-tools-
and-applications-1st-edition-trevor-c-charles/

Pro MySQL NDB Cluster 1st Edition Jesper Wisborg Krogh

https://textbookfull.com/product/pro-mysql-ndb-cluster-1st-
edition-jesper-wisborg-krogh/

Modern Algorithms of Cluster Analysis 1st Edition


Slawomir Wierzcho■

https://textbookfull.com/product/modern-algorithms-of-cluster-
analysis-1st-edition-slawomir-wierzchon/
For your convenience Apress has placed some of the front
matter material after the index. Please use the Bookmarks
and Contents at a Glance links to access them.
Contents at a Glance

About the Authors��������������������������������������������������������������������������� xiii


About the Technical Reviewers������������������������������������������������������� xv
Acknowledgments������������������������������������������������������������������������� xvii
Foreword���������������������������������������������������������������������������������������� xix
Introduction������������������������������������������������������������������������������������ xxi

■■Chapter 1: No Time to Read This Book?��������������������������������������������� 1


■■Chapter 2: Overview of Platform Architectures���������������������������� 11
■■Chapter 3: Top-Down Software Optimization������������������������������� 39
■■Chapter 4: Addressing System Bottlenecks��������������������������������� 55
■■Chapter 5: Addressing Application Bottlenecks:
Distributed Memory���������������������������������������������������������������������� 87
■■Chapter 6: Addressing Application Bottlenecks:
Shared Memory�������������������������������������������������������������������������� 173
■■Chapter 7: Addressing Application Bottlenecks:
Microarchitecture����������������������������������������������������������������������� 201
■■Chapter 8: Application Design Considerations��������������������������� 247

Index���������������������������������������������������������������������������������������������� 265

v
Introduction

Let’s optimize some programs. We have been doing this for years, and we still love doing it.
One day we thought, Why not share this fun with the world? And just a year later, here we are.
Oh, you just need your program to run faster NOW? We understand. Go to Chapter 1
and get quick tuning advice. You can return later to see how the magic works.
Are you a student? Perfect. This book may help you pass that “Software Optimization
101” exam. Talking seriously about programming is a cool party trick, too. Try it.
Are you a professional? Good. You have hit the one-stop-shopping point for Intel’s
proven top-down optimization methodology and Intel Cluster Studio that includes
Message Passing Interface* (MPI), OpenMP, math libraries, compilers, and more.
Or are you just curious? Read on. You will learn how high-performance computing
makes your life safer, your car faster, and your day brighter.
And, by the way: You will find all you need to carry on, including free trial
software, code snippets, checklists, expert advice, fellow readers, and more at
www.apress.com/source-code.

HPC: The Ever-Moving Frontier


High-performance computing, or simply HPC, is mostly concerned with
floating-point operations per second, or FLOPS. The more FLOPS you get, the better.
For convenience, FLOPS on large HPC systems are typically counted by the quadrillions
(tera, or 10 to the power of 12) and by the quintillions (peta, or 10 to the power of 15)—hence,
TeraFLOPS and PetaFLOPS. Performance of stand-alone computers is currently hovering
at around 1 to 2 TeraFLOPS, which is three orders of magnitude below PetaFLOPS. In
other words, you need around a thousand modern computers to get to the PetaFLOPS
level for the whole system. This will not stay this way forever, for HPC is an ever-moving
frontier: ExaFLOPS are three orders of magnitude above PetaFLOPS, and whole countries
are setting their sights on reaching this level of performance now.
We have come a long way since the days when computing started in earnest. Back
then [sigh!], just before WWII, computing speed was indicated by the two hours necessary
to crack the daily key settings of the Enigma encryption machine. It is indicative that
already then the computations were being done in parallel: each of the several “bombs”1
united six reconstructed Enigma machines and reportedly relieved a hundred human
operators from boring and repetitive tasks.

*
Here and elsewhere, certain product names may be the property of their respective third parties.

xxi
■ Introduction

Computing has progressed a lot since those heady days. There is hardly a better
illustration of this than the famous TOP500 list.2 Twice a year, the teams running the
most powerful non-classified computers on earth report their performance. This
data is then collated and published in time for two major annual trade shows: the
International Supercomputing Conference (ISC), typically held in Europe in June; and the
Supercomputing (SC), traditionally held in the United States in November.
Figure 1 shows how certain aspects of this list have changed over time.

Figure 1. Observed and projected performance of the Top 500 systems


(Source: top500.org; used with permission)

xxii
■ Introduction

There are several observations we can make looking at this graph:3


1. Performance available in every represented category
is growing exponentially (hence, linear graphs in this
logarithmic representation).
2. Only part of this growth comes from the incessant
improvement of processor technology, as represented, for
example, by Moore’s Law.4 The other part is coming from
putting many machines together to form still larger machines.
3. An extrapolation made on the data obtained so far predicts
that an ExaFLOPS machine is likely to appear by 2018. Very
soon (around 2016) there may be PetaFLOPS machines at
personal disposal.
So, it’s time to learn how to optimize programs for these systems.

Why Optimize?
Optimization is probably the most profitable time investment an engineer can make, as
far as programming is concerned. Indeed, a day spent optimizing a program that takes an
hour to complete may decrease the program turn-around time by half. This means that
after 48 runs, you will recover the time invested in optimization, and then move into
the black.
Optimization is also a measure of software maturity. Donald Knuth famously said,
“Premature optimization is the root of all evil,”5 and he was right in some sense. We will
deal with how far this goes when we get closer to the end of this book. In any case, no one
should start optimizing what has not been proven to work correctly in the first place. And
a correct program is still a very rare and very satisfying piece of art.
Yes, this is not a typo: art. Despite zillions of thick volumes that have been written
and the conferences held on a daily basis, programming is still more art than science.
Likewise, for the process of program optimization. It is somewhat akin to architecture: it
must include flight of fantasy, forensic attention to detail, deep knowledge of underlying
materials, and wide expertise in the prior art. Only this combination—and something
else, something intangible and exciting, something we call “talent”—makes a good
programmer in general and a good optimizer in particular.
Finally, optimization is fun. Some 25 years later, one of us still cherishes the
memories of a day when he made a certain graphical program run 300 times faster than
it used to. A screen update that had been taking half a minute in the morning became
almost instantaneous by midnight. It felt almost like love.

The Top-down Optimization Method


Of course, the optimization process we mention is of the most common type—namely,
performance optimization. We will be dealing with this kind of optimization almost
exclusively in this book. There are other optimization targets, going beyond performance
and sometimes hurting it a lot, like code size, data size, and energy.

xxiii
■ Introduction

The good news are, once you know what you want to achieve, the methodology is
roughly the same. We will look into those details in Chapter 3. Briefly, you proceed in
the top-down fashion from the higher levels of the problem under analysis (platform,
distributed memory, shared memory, microarchitecture), iterate in a closed-loop manner
until you exhaust optimization opportunities at each of these levels. Keep in mind that
a problem fixed at one level may expose a problem somewhere else, so you may need to
revisit those higher levels once more.
This approach crystallized quite a while ago. Its previous reincarnation was
formulated by Intel application engineers working in Intel’s application solution centers
in the 1990’s.6 Our book builds on that solid foundation, certainly taking some things a tad
further to account for the time passed.
Now, what happens when top-down optimization meets the closed-loop approach?
Well, this is a happy marriage. Every single level of the top-down method can be handled
by the closed-loop approach. Moreover, the top-down method itself can be enclosed
in another, bigger closed loop where every iteration addresses the biggest remaining
problem at any level where it has been detected. This way, you keep your priorities
straight and helps you stay focused.

Intel Parallel Studio XE Cluster Edition


Let there be no mistake: the bulk of HPC is still made up by C and Fortran, MPI, OpenMP,
Linux OS, and Intel Xeon processors. This is what we will focus on, with occasional
excursions into several adjacent areas.
There are many good parallel programming packages around, some of them
available for free, some sold commercially. However, to the best of our absolutely
unbiased professional knowledge, for completeness none of them comes in anywhere
close to Intel Parallel Studio XE Cluster Edition.7
Indeed, just look at what it has to offer—and for a very modest price that does not
depend on the size of the machines you are going to use, or indeed on their number.
• Intel Parallel Studio XE Cluster Edition8 compilers and libraries,
including:
• Intel Fortran Compiler9
• Intel C++ Compiler10
• Intel Cilk Plus11
• Intel Math Kernel Library (MKL)12
• Intel Integrated Performance Primitives (IPP)13
• Intel Threading Building Blocks (TBB)14
• Intel MPI Benchmarks (IMB)15
• Intel MPI Library16
• Intel Trace Analyzer and Collector17

xxiv
■ Introduction

• Intel VTune Amplifier XE18


• Intel Inspector XE19
• Intel Advisor XE20
All these riches and beauty work on the Linux and Microsoft Windows OS,
sometimes more; support all modern Intel platforms, including, of course, Intel Xeon
processors and Intel Xeon Phi coprocessors; and come at a cumulative discount akin
to the miracles of the Arabian 1001 Nights. Best of all, Intel runtime libraries come
traditionally free of charge.
Certainly, there are good tools beyond Intel Parallel Studio XE Cluster Edition, both
offered by Intel and available in the world at large. Whenever possible and sensible, we
employ those tools in this book, highlighting their relative advantages and drawbacks
compared to those described above. Some of these tools come as open source, some
come with the operating system involved; some can be evaluated for free, while others
may have to be purchased. While considering the alternative tools, we focus mostly on
the open-source, free alternatives that are easy to get and simple to use.

The Chapters of this Book


This is what awaits you, chapter by chapter:
1. No Time to Read This Book? helps you out on the burning
optimization assignment by providing several proven recipes
out of an Intel application engineer’s magic toolbox.
2. Overview of Platform Architectures introduces common
terminology, outlines performance features in modern
processors and platforms, and shows you how to estimate
peak performance for a particular target platform.
3. Top-down Software Optimization introduces the generic
top-down software optimization process flow and the
closed-loop approach that will help you keep the challenge of
multilevel optimization under secure control.
4. Addressing System Bottlenecks demonstrates how you can
utilize Intel Cluster Studio XE and other tools to discover
and remove system bottlenecks as limiting factors to the
maximum achievable application performance.
5. Addressing Application Bottlenecks: Distributed Memory
shows how you can identify and remove distributed memory
bottlenecks using Intel MPI Library, Intel Trace Analyzer and
Collector, and other tools.
6. Addressing Application Bottlenecks: Shared Memory explains
how you can identify and remove threading bottlenecks using
Intel VTune Amplifier XE and other tools.

xxv
■ Introduction

7. Addressing Application Bottlenecks: Microarchitecture


demonstrates how you can identify and remove microarchitecture
bottlenecks using Intel VTune Amplifier XE and Intel
Composer XE, as well as other tools.
8. Application Design Considerations deals with the key tradeoffs
guiding the design and optimization of applications. You will
learn how to make your next program be fast from the start.
Most chapters are sufficiently self-contained to permit individual reading in
any order. However, if you are interested in one particular optimization aspect, you
may decide to go through those chapters that naturally cover that topic. Here is a
recommended reading guide for several selected topics:
• System optimization: Chapters 2, 3, and 4.
• Distributed memory optimization: Chapters 2, 3, and 5.
• Shared memory optimization: Chapters 2, 3, and 6.
• Microarchitecture optimization: Chapters 2, 3, and 7.
Use your judgment and common sense to find your way around. Good luck!

References
1. “Bomba_(cryptography),” [Online]. Available:
http://en.wikipedia.org/wiki/Bomba_(cryptography).
2. Top500.Org, “TOP500 Supercomputer Sites,” [Online]. Available:
http://www.top500.org/.
3. Top500.Org, “Performance Development TOP500 Supercomputer
Sites,” [Online]. Available: http://www.top500.org/statistics/
perfdevel/.
4. G. E. Moore, “Cramming More Components onto Integrated
Circuits,” Electronics, p. 114–117, 19 April 1965.
5. “Knuth,” [Online]. Available: http://en.wikiquote.org/wiki/
Donald_Knuth.
6. Intel Corporation, “ASC Performance Methodology - Top-Down/
Closed Loop Approach,” 1999. [Online]. Available:
http://smartdata.usbid.com/datasheets/usbid/2001/2001-q1/
asc_methodology.pdf.
7. Intel Corporation, “Intel Cluster Studio XE,” [Online]. Available:
http://software.intel.com/en-us/intel-cluster-studio-xe.

xxvi
■ Introduction

8. Intel Corporation, “Intel Composer XE,” [Online]. Available:


http://software.intel.com/en-us/intel-composer-xe/.
9. Intel Corporation, “Intel Fortran Compiler,” [Online]. Available:
http://software.intel.com/en-us/fortran-compilers.
10. Intel Corporation, “Intel C++ Compiler,” [Online]. Available:
http://software.intel.com/en-us/c-compilers.
11. Intel Corporation, “Intel Cilk Plus,” [Online]. Available:
http://software.intel.com/en-us/intel-cilk-plus.
12. Intel Corporation, “Intel Math Kernel Library,” [Online]. Available:
http://software.intel.com/en-us/intel-mkl.
13. Intel Corporation, “Intel Performance Primitives,” [Online].
Available: http://software.intel.com/en-us/intel-ipp.
14. Intel Corporation, “Intel Threading Building Blocks,” [Online].
Available: http://software.intel.com/en-us/intel-tbb.
15. Intel Corporation, “Intel MPI Benchmarks,” [Online]. Available:
http://software.intel.com/en-us/articles/intel-mpi-
benchmarks/.
16. Intel Corporation, “Intel MPI Library,” [Online]. Available:
http://software.intel.com/en-us/intel-mpi-library/.
17. Intel Corporation, “Intel Trace Analyzer and Collector,” [Online].
Available: http://software.intel.com/en-us/intel-trace-
analyzer/.
18. Intel Corporation, “Intel VTune Amplifier XE,” [Online]. Available:
http://software.intel.com/en-us/intel-vtune-amplifier-xe.
19. Intel Corporation, “Intel Inspector XE,” [Online]. Available:
http://software.intel.com/en-us/intel-inspector-xe/.
20. Intel Corporation, “Intel Advisor XE,” [Online]. Available:
http://software.intel.com/en-us/intel-advisor-xe/.

xxvii
Chapter 1

No Time to Read This Book?

We know what it feels like to be under pressure. Try out a few quick and proven optimization
stunts described below. They may provide a good enough performance gain right away.
There are several parameters that can be adjusted with relative ease. Here are the
steps we follow when hard pressed:
• Use Intel MPI Library1 and Intel Composer XE2
• Got more time? Tune Intel MPI:
• Collect built-in statistics data
• Tune Intel MPI process placement and pinning
• Tune OpenMP thread pinning
• Got still more time? Tune Intel Composer XE:
• Analyze optimization and vectorization reports
• Use interprocedural optimization

Using Intel MPI Library


The Intel MPI Library delivers good out-of-the-box performance for bandwidth-bound
applications. If your application belongs to this popular class, you should feel the
difference immediately when switching over.
If your application has been built for Intel MPI compatible distributions like
MPICH,3 MVAPICH2,4 or IBM POE,5 and some others, there is no need to recompile the
application. You can switch by dynamically linking the Intel MPI 5.0 libraries at runtime:

$ source /opt/intel/impi_latest/bin64/mpivars.sh
$ mpirun -np 16 -ppn 2 xhpl

If you use another MPI and have access to the application source code, you can
rebuild your application using Intel MPI compiler scripts:
• Use mpicc (for C), mpicxx (for C++), and mpifc/mpif77/mpif90
(for Fortran) if you target GNU compilers.
• Use mpiicc, mpiicpc, and mpiifort if you target Intel Composer XE.

1
Chapter 1 ■ No Time to Read This Book?

Using Intel Composer XE


The invocation of the Intel Composer XE is largely compatible with the widely used GNU
Compiler Collection (GCC). This includes both the most commonly used command line
options and the language support for C/C++ and Fortran. For many applications you can
simply replace gcc with icc, g++ with icpc, and gfortran with ifort. However, be aware
that although the binary code generated by Intel C/C++ Composer XE is compatible with the
GCC-built executable code, the binary code generated by the Intel Fortran Composer is not.
For example:

$ source /opt/intel/composerxe/bin/compilervars.sh intel64


$ icc -O3 -xHost -qopenmp -c example.o example.c

Revisit the compiler flags you used before the switch; you may have to remove some
of them. Make sure that Intel Composer XE is invoked with the flags that give the best
performance for your application (see Table 1-1). More information can be found in the
Intel Composer XE documentation.6

Table 1-1. Selected Intel Composer XE Optimization Flags

GCC ICC Effect


-O0 -O0 Disable (almost all) optimization. Not
something you want to use for performance!
-O1 -O1 Optimize for speed (no code size increase
for ICC)
-O2 -O2 Optimize for speed and enable vectorization
-O3 -O3 Turn on high-level optimizations
-ftlo -ipo Enable interprocedural optimization
-ftree-vectorize -vec Enable auto-vectorization (auto-enabled
with -O2 and -O3)
-fprofile-generate -prof-gen Generate runtime profile for optimization
-fprofile-use -prof-use Use runtime profile for optimization
-parallel Enable auto-parallelization
-fopenmp -qopenmp Enable OpenMP
-g -g Emit debugging symbols
-qopt-report Generate the optimization report
-vec-report Generate the vectorization report
-ansi-alias Enable ANSI aliasing rules for C/C++

(continued)

2
Chapter 1 ■ No Time to Read This Book?

Table 1-1. (continued)


GCC ICC Effect
-msse4.1 -xSSE4.1 Generate code for Intel processors with SSE
4.1 instructions
-mavx -xAVX Generate code for Intel processors with
AVX instructions
-mavx2 -xCORE-AVX2 Generate code for Intel processors with
AVX2 instructions
-mcpu=native -xHost Generate code for the current machine used
for compilation

For most applications, the default optimization level of -O2 will suffice. It runs fast
and gives reasonable performance. If you feel adventurous, try -O3. It is more aggressive
but it also increases the compilation time.

Tuning Intel MPI Library


If you have more time, you can try to tune Intel MPI parameters without changing the
application source code.

Gather Built-in Statistics


Intel MPI comes with a built-in statistics-gathering mechanism. It creates a negligible
runtime overhead and reports key performance metrics (for example, MPI to
computation ratio, message sizes, counts, and collective operations used) in the popular
IPM format.7
To switch the IPM statistics gathering mode on and do the measurements, enter the
following commands:

$ export I_MPI_STATS=ipm
$ mpirun -np 16 xhpl

By default, this will generate a file called stats.ipm. Listing 1-1 shows an example
of the MPI statistics gathered for the well-known High Performance Linpack (HPL)
benchmark.8 (We will return to this benchmark throughout this book, by the way.)

3
Chapter 1 ■ No Time to Read This Book?

Listing 1-1. MPI Statistics for the HPL Benchmark with the Most Interesting Fields
Highlighted

Intel(R) MPI Library Version 5.0

Summary MPI Statistics


Stats format: region
Stats scope : full

############################################################################
#
# command : /home/book/hpl/./xhpl_hybrid_intel64_dynamic (completed)
# host : esg066/x86_64_Linux mpi_tasks : 16 on 8 nodes
# start : 02/14/14/12:43:33 wallclock : 2502.401419 sec
# stop : 02/14/14/13:25:16 %comm : 8.43
# gbytes : 0.00000e+00 total gflop/sec : NA
#
############################################################################
# region : * [ntasks] = 16
#
# [total] <avg> min max
# entries 16 1 1 1
# wallclock 40034.7 2502.17 2502.13 2502.4
# user 446800 27925 27768.4 28192.7
# system 1971.27 123.205 102.103 145.241
# mpi 3375.05 210.941 132.327 282.462
# %comm 8.43032 5.28855 11.2888
# gflop/sec NA NA NA NA
# gbytes 0 0 0 0
#
#
# [time] [calls] <%mpi> <%wall>
# MPI_Send 2737.24 1.93777e+06 81.10 6.84
# MPI_Recv 394.827 16919 11.70 0.99
# MPI_Wait 236.568 1.92085e+06 7.01 0.59
# MPI_Iprobe 3.2257 6.57506e+06 0.10 0.01
# MPI_Init_thread 1.55628 16 0.05 0.00
# MPI_Irecv 1.31957 1.92085e+06 0.04 0.00
# MPI_Type_commit 0.212124 14720 0.01 0.00
# MPI_Type_free 0.0963376 14720 0.00 0.00
# MPI_Comm_split 0.0065608 48 0.00 0.00
# MPI_Comm_free 0.000276804 48 0.00 0.00
# MPI_Wtime 9.67979e-05 48 0.00 0.00
# MPI_Comm_size 9.13143e-05 452 0.00 0.00
# MPI_Comm_rank 7.77245e-05 452 0.00 0.00
# MPI_Finalize 6.91414e-06 16 0.00 0.00
# MPI_TOTAL 3375.05 1.2402e+07 100.00 8.43
############################################################################

4
Chapter 1 ■ No Time to Read This Book?

From Listing 1-1 you can deduce that MPI communication occupies between 5.3
and 11.3 percent of the total runtime, and that the MPI_Send, MPI_Recv, and MPI_Wait
operations take about 81, 12, and 7 percent, respectively, of the total MPI time. With
this data at hand, you can see that there are potential load imbalances between the job
processes, and that you should focus on making the MPI_Send operation as fast as it can
go to achieve a noticeable performance hike.
Note that if you use the full IPM package instead of the built-in statistics, you will also
get data on the total communication volume and floating point performance that are not
measured by the Intel MPI Library.

Optimize Process Placement


The Intel MPI Library puts adjacent MPI ranks on one cluster node as long as there are cores
to occupy. Use the Intel MPI command line argument -ppn to control the process placement
across the cluster nodes. For example, this command will start two processes per node:

$ mpirun -np 16 -ppn 2 xhpl

Intel MPI supports process pinning to restrict the MPI ranks to parts of the system
so as to optimize process layout (for example, to avoid NUMA effects or to reduce latency
to the InfiniBand adapter). Many relevant settings are described in the Intel MPI Library
Reference Manual.9
Briefly, if you want to run a pure MPI program only on the physical processor cores,
enter the following commands:

$ export I_MPI_PIN_PROCESSOR_LIST=allcores
$ mpirun -np 2 your_MPI_app

If you want to run a hybrid MPI/OpenMP program, don’t change the default Intel
MPI settings, and see the next section for the OpenMP ones.
If you want to analyze Intel MPI process layout and pinning, set the following
environment variable:

$ export I_MPI_DEBUG=4

Optimize Thread Placement


If the application uses OpenMP for multithreading, you may want to control thread
placement in addition to the process placement. Two possible strategies are:

$ export KMP_AFFINITY=granularity=thread,compact
$ export KMP_AFFINITY=granularity=thread,scatter

The first setting keeps threads close together to improve inter-thread


communication, while the second setting distributes the threads across the system to
maximize memory bandwidth.

5
Chapter 1 ■ No Time to Read This Book?

Programs that use the OpenMP API version 4.0 can use the equivalent OpenMP
affinity settings instead of the KMP_AFFINITY environment variable:

$ export OMP_PROC_BIND=close
$ export OMP_PROC_BIND=spread

If you use I_MPI_PIN_DOMAIN, MPI will confine the OpenMP threads of an MPI
process on a single socket. Then you can use the following setting to avoid thread
movement between the logical cores of the socket:

$ export KMP_AFFINITY=granularity=thread

Tuning Intel Composer XE


If you have access to the source code of the application, you can perform optimizations
by selecting appropriate compiler switches and recompiling the source code.

Analyze Optimization and Vectorization Reports


Add compiler flags -qopt-report and/or -vec-report to see what the compiler did to
your source code. This will report all the transformations applied to your code. It will also
highlight those code patterns that prevented successful optimization. Address them if you
have time left.
Here is a small example. Because the optimization report may be very long, Listing 1-2
only shows an excerpt from it. The example code contains several loop nests of seven loops.
The compiler found an OpenMP directive to parallelize the loop nest. It also recognized
that the overall loop nest was not optimal, and it automatically permuted some loops
to improve the situation for vectorization. Then it vectorized all inner-most loops while
leaving the outer-most loops as they are.

Listing 1-2. Example Optimization Report with the Most Interesting Fields Highlighted

$ ifort -O3 -qopenmp -qopt-report -qopt-report-file=stdout -c example.F90

Report from: Interprocedural optimizations [ipo]

[...]

OpenMP Construct at example.F90(8,7)


remark #15059: OpenMP DEFINED LOOP WAS PARALLELIZED
OpenMP Construct at example.F90(25,7)
remark #15059: OpenMP DEFINED LOOP WAS PARALLELIZED

[...]

6
Chapter 1 ■ No Time to Read This Book?

LOOP BEGIN at example.F90(9,2)


remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(12,5)


remark #25448: Loopnest Interchanged : ( 1 2 3 4 ) --> ( 1 4 2 3 )
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(12,5)


remark #15018: loop was not vectorized: not inner loop

[...]

LOOP BEGIN at example.F90(15,8)


remark #25446: blocked by 125 (pre-vector)
remark #25444: unrolled and jammed by 4 (pre-vector)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(13,6)


remark #25446: blocked by 125 (pre-vector)
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(14,7)


remark #25446: blocked by 128 (pre-vector)
remark #15003: PERMUTED LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at example.F90(14,7)


Remainder
remark #25460: Loop was not optimized
LOOP END
LOOP END
LOOP END

[...]

LOOP END
LOOP END
LOOP END
LOOP END
LOOP END

LOOP BEGIN at example.F90(26,2)


remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(29,5)


remark #25448: Loopnest Interchanged : ( 1 2 3 4 ) --> ( 1 3 2 4 )
remark #15018: loop was not vectorized: not inner loop

7
Chapter 1 ■ No Time to Read This Book?

LOOP BEGIN at example.F90(29,5)


remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(29,5)


remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(29,5)


remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(29,5)


remark #25446: blocked by 125 (pre-vector)
remark #25444: unrolled and jammed by 4 (pre-vector)
remark #15018: loop was not vectorized: not inner loop

[...]
LOOP END
LOOP END
LOOP END
LOOP END
LOOP END
LOOP END

Listing 1-3 shows the vectorization report for the example in Listing 1-2. As you can
see, the vectorization report contains the same information about vectorization as the
optimization report.

Listing 1-3. Example Vectorization Report with the Most Interesting Fields Highlighted

$ ifort -O3 -qopenmp -vec-report=2 -qopt-report-file=stdout -c example.F90

[...]

LOOP BEGIN at example.F90(9,2)


remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(12,5)


remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(12,5)


remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(12,5)


remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(12,5)


remark #15018: loop was not vectorized: not inner loop

8
Chapter 1 ■ No Time to Read This Book?

LOOP BEGIN at example.F90(12,5)


remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(15,8)


remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(13,6)


remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(14,7)


remark #15003: PERMUTED LOOP WAS VECTORIZED
LOOP END

[...]

LOOP END
LOOP END

LOOP BEGIN at example.F90(15,8)


Remainder
remark #15018: loop was not vectorized: not inner loop

LOOP BEGIN at example.F90(13,6)


remark #15018: loop was not vectorized: not inner loop

[...]

LOOP BEGIN at example.F90(14,7)


remark #15003: PERMUTED LOOP WAS VECTORIZED
LOOP END

[...]

LOOP END
LOOP END
LOOP END

[...]

LOOP END
LOOP END
LOOP END
LOOP END
LOOP END

[...]

9
Chapter 1 ■ No Time to Read This Book?

Use Interprocedural Optimization


Add the compiler flag -ipo to switch on interprocedural optimization. This will give the
compiler a holistic view of the program and open more optimization opportunities for the
program as a whole. Note that this will also increase the overall compilation time.
Runtime profiling can also increase the chances for the compiler to generate better
code. Profile-guided optimization requires a three-stage process. First, compile the
application with the compiler flag -prof-gen to instrument the application with profiling
code. Second, run the instrumented application with a typical dataset to produce a
meaningful profile. Third, feed the compiler with the profile (-prof-use) and let it
optimize the code.

Summary
Switching to Intel MPI and Intel Composer XE can help improve performance because
the two strive to optimally support Intel platforms and deliver good out-of-the-box (OOB)
performance. Tuning measures can further improve the situation. The next chapters will
reiterate the quick and dirty examples of this chapter and show you how to push the limits.

References
1. Intel Corporation, “Intel(R) MPI Library,” http://software.intel.com/en-us/
intel-mpi-library.
2. Intel Corporation, “Intel(R) Composer XE Suites,”
http://software.intel.com/en-us/intel-composer-xe.
3. Argonne National Laboratory, “MPICH: High-Performance Portable MPI,” www.mpich.
org.
4. Ohio State University, “MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE,”
http://mvapich.cse.ohio-state.edu/overview/mvapich2/.
5. International Business Machines Corporation, “IBM Parallel
Environment,” www-03.ibm.com/systems/software/parallel/.
6. Intel Corporation, “Intel Fortran Composer XE 2013 - Documentation,”
http://software.intel.com/articles/intel-fortran-composer-xe-
documentation/.
7. The IPM Developers, “Integrated Performance Monitoring - IPM,” http://ipm-hpc.
sourceforge.net/.
8. A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary, “HPL : A Portable
Implementation of the High-Performance Linpack Benchmark for Distributed-
Memory Computers,” 10 September 2008, www.netlib.org/benchmark/hpl/.
9. Intel Corporation, “Intel MPI Library Reference Manual,” http://software.intel.
com/en-us/node/500285.

10
Chapter 2

Overview of Platform
Architectures

In order to optimize software you need to understand hardware. In this chapter we give
you a brief overview of the typical system architectures found in the high-performance
computing (HPC) today. We also introduce terminology that will be used throughout
the book.

Performance Metrics and Targets


The definition of optimization found in Merriam-Webster’s Collegiate Dictionary reads
as follows: “an act, process, or methodology of making something (as a design, system,
or decision) as fully perfect, functional, or effective as possible.”1 To become practically
applicable, this definition requires establishment of clear success criteria. These objective
criteria need to be based on quantifiable metrics and on well-defined standards of
measurement. We deal with the metrics in this chapter.

Latency, Throughput, Energy, and Power


Let us start with the most common class of metrics: those that are based on the total time
required to complete an action–for example, the time it takes for a car to drive from the
start to the finish on a race track, as shown in Figure 2-1. Execution (or wall-clock) time
is one of the most common ways to measure application performance: to measure its
runtime on a specific system and report it in seconds (or hours, or sometimes days).
In this context, the time required to complete an action is a typical latency metric.

11
Chapter 2 ■ Overview of Platform Architectures

Figure 2-1. Runtime: observed time interval between the start and the finish of a car on a
race track

The runtime, or the period of time from the start to the completion of an application,
is important because it tells you how long you need to wait for the results. In networking,
latency is the amount of time it takes a data packet to travel from the source to the
destination; it also can be referred to as the response time. For measurements inside the
processor, we often use the term instruction latency as the time it takes for a machine
instruction entering the execution unit until results of that instruction are available—that
is, written to the register file and ready to be used by subsequent instructions. In more
general terms, latency can be defined as the observed time interval between the start of a
process and its completion.
We can generalize this class of metrics to represent more of a general class of
consumable resources. Time is one kind of a consumable resource, such as the time
allocated for your job on a supercomputer. Another important example of a consumable
resource is the amount of electrical energy required to complete your job, called energy to
solution. The official unit in which energy is measured is the joule, while in everyday life
we more often use watt-hours. One watt-hour is equal to 3600 joules.
The amount of energy consumption defines your electricity bill and is a very visible
item among operating expenses of major, high-performance computing facilities. It drives
demand for optimization of the energy to solution, in addition to the traditional efforts
to reduce the runtime, improve parallel efficiency, and so on. Energy optimization work
has different scales; going from giga-joules (GJ, or 109 joules) consumed at the application
level, to pico-joules (pJ, or 10–12 joules) per instruction.
One of the specific properties of the latency metrics is that they are additive, so that
they can be viewed as a cumulative sum of several latencies of subtasks. This means that
if the application has three subtasks following one after another, and these subtasks take
times T1, T2 and T3, respectively, then the total application runtime is Tapp = T1 + T2 + T3.
Other types of metrics describe the amount of work that can be completed by the
system per unit of time, or per unit of another consumable resource. One example of car
performance would be its speed defined as the distance covered per unit of time; or of its
fuel efficiency, defined as the distance covered per unit of fuel—, such as miles per gallon.
We call these metrics throughput metrics. For example, the number of instructions per
second (IPS) executed by the processor, or the number of floating point operations per
second (FLOPS) are both throughput metrics. Other widely used metrics of this class are
memory bandwidth (reaching tens and hundreds of gigabytes per second these days),
and network interconnection throughput (in either bits per second or bytes per second).
The unit of power (watt) is also a throughput metric that is defined as energy flow per unit
of time, and is equal exactly to 1 joule per second.

12
Chapter 2 ■ Overview of Platform Architectures

You may encounter situations where throughput is described as the inverse of


latency. This is correct only when both metrics describe the same process applied to the
same amount of work. In particular, for an application or kernel that takes one second to
complete 109 arithmetic operations on floating point numbers, it is correct to state that its
throughput is 1 GFLOPS (gigaFLOPS, or 109 FLOPS).
However, very often, especially in computer networks, latency is understood
as the time from the beginning of the packet shipment until the first data arrives at
the destination. In this context, latency will not be equal to the inverse value of the
throughput. To grasp why this happens, compare sending a very large amount of data
(say, 1 terabyte (TB), which is 1012 bytes) using two different methods2:
1. Shipping with overnight express mail
2. Uploading via broadband Internet access
The overnight (24-hour) shipment of the 1TB hard drive has good throughput but
lousy latency. The throughput is (1 × 1012 × 8) bits / (24 × 60 × 60) seconds = about 92
million bits per second (bps), which is comparable to modern broadband networks. The
difference is that the overnight shipment bits are delayed for a day and then arrive all
at once, but the bits we send over the Internet start appearing almost immediately. We
would say that the network has much better latency, even though both methods have
approximately the same throughput when considered over the interval of one day.
Although high throughput systems may have low latency, there is no causal link.
Comparing a GDDR5 (Graphics Double Data Rate, version 5) vs. DDR3 (Double Data
Rate, type 3) memory bandwidth and latency, one notices that systems with GDDR5
(such as Intel Xeon Phi coprocessors) deliver three to five times more bandwidth, while
the latency to access data (measured in an idle environment) is five to six times lower
than in systems with DDR3 memory.
Finally, a graph of latency versus load looks very different from a graph of throughput
versus load. As we will see later in this chapter, memory access latency goes up
exponentially as the load increases. Throughput will go up almost linearly at first, then
levels out to become nearly flat when the physical capacity of the transport medium is
saturated. Simply by looking at a graph of test results and keeping those features in mind,
you can guess whether it is a latency graph or a throughput graph.
Another important concept and property of a system or process is its degree of
concurrency or parallelism. Concurrency (or degree of concurrency) is defined as the
number of work items that can potentially be performed simultaneously. In the example
illustrated by Figure 2-2, where three cars can race simultaneously, each on its own
track, we would say this system has concurrency of 3. In computation, an example of
concurrency would be the simultaneous execution of multiple, structurally different
application “threads” by a multicore processor. Presence of concurrency is an intrinsic
property of any modern high-performance system. Processes running on different
machines of a cluster form a common system that executes application code on multiple
machines at the same time. This, too, is an example of concurrency in action.

13
Chapter 2 ■ Overview of Platform Architectures

Figure 2-2. A system with the degree of concurrency equal to 3

Cantrill and Bonwick describe three fundamental ways of using concurrency to


improve application performance.3 At the same time, these three ways represent the
typical optimization targets for either latency or throughout metrics:
• Increase throughput: By executing multiple tasks concurrently,
the general system throughput can be increased.
• Reduce latency: A given amount of work is completed in shorter
time by dividing it into parts that can be completed concurrently.
• Hide latency: Multiple long-running tasks are executed in
parallel by the underlying system. This is particularly effective
when some tasks are blocked (for example, if they must wait
upon disk or network I/O operations), while others can proceed
independently.

Peak Performance as the Ultimate Limit


Every time we talk about performance of an application running on a machine, we try to
compare it to the maximum attainable performance on that specific machine, or peak
performance of that machine. The ratio between the achieved (or measured) performance
and the peak performance gives the efficiency metric. This metric is often used to drive
the performance optimization, for an increase in efficiency will also lead to an increase in
performance according to the underlying metric. For example, efficiency for the wall-clock
time is the fraction of time that is spent doing useful work, while efficiency for throughout is
a measure of useful capacity utilization.
Consider the example of how to quantify efficiency for a network protocol. Network
protocols normally require each packet to contain a header and a footer. The actual data
transmitted in the packet is then the size of the packet minus the protocol overhead.
Therefore, efficiency of using the network, from the application point of view, is reduced
from the total utilization according to the size of the header and the footer. For Ethernet,
the frame payload size equals 1536 bytes. The TCP/IP header and footer take 40 bytes
extra. Hence, efficiency here is equal to 1536 / 1576 × 100, or 97.5 percent.
Understanding the limitations of maximum achievable performance is an important
step in guiding the optimization process: the limits are always there! These limits are
driven by physical properties of the available materials, maturity of the technology, or
(trivially) the cost. Particularly, the propagation of signals along the wires is limited by
the speed of light in the respective material. Thus, the latency for completing any work
using electronic equipment will always be greater than zero. In the same way, it is not
possible to build an infinitely wide highway, for its throughput will always be limited by
the number of lanes and their individual throughputs.

14
Chapter 2 ■ Overview of Platform Architectures

Scalability and Maximum Parallel Speedup


The ability to increase performance by using more resources in parallel (for example,
more processors) is called scalability. The basic approach in high-performance
computing is to use many computational resources in parallel to solve one problem, and
to add still more resources if higher performance is required. Scalability analysis indicates
how efficient an application is using the increasing numbers of parallel computing
elements, such as cores, vector units, memory, or network connections.
Increase in performance before and after addition of the resources is called speedup.
When talking about throughput-related metrics, speedup is expressed as the ratio of the
throughput after addition of the resources versus the original throughput. For latency
metrics, speedup is the ratio between the original latency and the latency after addition of
the resources. This way speedup is always greater than 1.0 if performance improves. If the
ratio goes below 1.0, we call this negative speedup, or simply slowdown.
Amdahl’s Law, also known as Amdahl’s argument,4 is used to find the maximum
expected improvement for an entire application when only a part of the application is
improved. This law is often used in parallel computing to predict the theoretical maximum
speedup that can be achieved by adding multiple processors. In essence, Amdahl’s Law
says that speedup of a program using p processors in parallel is limited by the time needed
for the nonparallel fraction of the program (f ), according to the following formula:
p
Speedup £
1 + f ×(p - 1)

where f takes values between 0 and 1.


As an example, think about an application that needs 10 hours when running on
a single processor core, where a particular portion of the program takes two hours to
execute and cannot be made parallel (for instance, since it performs sequential I/O
operations). If the remaining 8 hours of the runtime can be efficiently parallelized, then
regardless of how many processors are devoted to the parallelized execution of this
program, the minimum execution time cannot be less than those critical 2 hours. Hence,
speedup is limited by at most five times (usually denoted as 5x). In reality, even this 5x
speedup goal is not attainable, since infinite parallelization of code is not possible for the
parallel part of the application. Figure 2-3 illustrates Amdahl’s law in action. If the parallel
component is made 50 times faster, then the maximum speedup with 20 percent of time
taken by the serial part will be equal to 4.63x.

15
Chapter 2 ■ Overview of Platform Architectures

Figure 2-3. Illustration of Amdahl’s Law

It may be depressing to realize that the maximum possible speedup will be limited
by something you can’t improve by adding more resources. Even so, consider the same
speedup problem from another angle: what happens if the amount of work in the
parallelizable part of the execution can be increased?
If the relative share of time taken by the serial portion of the application remains
unchanged with the increase of the workload size, there is no inherent speedup factor
available, and as illustrated in Figure 2-4 (left), Amdahl’s Law still works. However, John
Gustafson observed that there was significant speedup opportunity available if the serial
component shrank in size relative to the parallel part as the amount of data processed by
the application (and consequently the amount of computation) expanded.5

Figure 2-4. Illustration of Gustafson’s observation

16
Chapter 2 ■ Overview of Platform Architectures

This observation leads to two kinds of scalability metrics:


• Strong scaling: How performance varies with the number of
computing elements for a fixed total problem size. In strong
scaling, perfect scaling (i.e., when performance improves linearly)
is achieved when speedup is equal to the number of computing
elements involved.
• Weak scaling: How performance varies with the number of
computing elements for a fixed problem size per processor, and
additional computing elements are used to solve a larger total
problem. In the case of weak scaling, perfect scaling is achieved
if the runtime remains constant while the workload is increased
proportionally to the number of computing elements involved.

Bottlenecks and a Bit of Queuing Theory


Performance analysis is a process of identifying bottlenecks and removing them, with the
objective of increasing overall application performance. Certain parts of the application
that limit performance of the entire application are called performance bottlenecks. The
significance of the term bottleneck can be illustrated with the same car metaphor that we
have used before (see Figure 2-5). When there is a toll gate on the road that can process
only one car at a time, the rate at which cars will pass along the highway (that is, highway
throughput) is limited by the width of the toll gate, irrespective of how many more lanes are
on the road before and after it. In other words, the toll gate is a bottleneck. By increasing the
width of the toll gate, it is possible to increase the rate of cars on the highway.

Figure 2-5. Bottlenecks on the road are commonly known as traffic jams

As shown in Figure 2-5, bottlenecks can create traffic jams on the highway. Using
the terminology of queuing theory,6 we are talking about the toll gate as a single service
center. Customers (here, cars) arrive at this service center at a certain rate, called arrival
rate or workload intensity. There is also certain duration of time required to collect money
from each car, which is referred to as service demand. For specific parameter values of
the workload intensity and the service demand, it is possible to analytically evaluate this
model and produce performance metrics, such as utilization (proportion of time when
the server point is busy), residence time (average time spent at the service center by a
customer), length of the queue (average number of customers waiting at the service center),
and throughput (rate at which customers depart from the service center).

17
Chapter 2 ■ Overview of Platform Architectures

This approach is widely used by queuing network modeling, where a computer


system is represented as a network of queues—that is, a collection of service centers that
represent system resources and customers who represent users or transactions. This
model provides a framework for gathering, organizing, evaluating, and understanding
information about the computer system, as well as for identifying possible bottlenecks
and testing ideas for system improvement. Such models are widely used for quantitative
analysis during computer system design and the application development process.

Roofline Model
Amdahl’s law and the queuing network models both offer “bound and bottleneck
analysis,” and they work quite well in many cases. However, both complexity and the
level of concurrency of modern high-performance systems keep increasing. Indeed, even
smartphones today have complex multicore chips with pipelines, caches, superscalar
instruction issue, and out-of-order execution, while the applications increasingly use
tasks and threads with asynchronous communication between them. Quantitative
queuing network models that simulate behavior of very complex applications on modern
multicore and heterogeneous systems have become very complex. At the same time, the
speed of microprocessor development has outpaced the speed of the memory evolution;
and in most cases, specifically in high-performance computing, the bandwidth of the
memory subsystem is often the main bottleneck.
In search of a simplified model that would relate processor performance to the
off-chip memory traffic, Williams, Waterman, and Patterson observed that that “the
Roofline [model] sets an upper bound on performance of a kernel depending on the
kernel’s operational intensity.”7 The Roofline model subsumes two platform specific
ceilings in one single graph: floating-point performance and memory bandwidth. The
model, despite its apparent simplicity, provides an insightful visualization of the system
bottlenecks. Peak floating point and memory throughput performances can usually be
found from the architecture specifications. Alternatively, it is possible to find sustained
memory performance by running the STREAM benchmark.8
Figure 2-6 shows a roofline plot for a platform with peak performance P = 518.4
GFLOPS (such as a dual-socket server with Intel Xeon E5-2697 v2 processors) and
bandwidth B = 101 GB/s (gigabytes per second) attainable with the STREAM TRIAD
benchmark on this system.

18
Chapter 2 ■ Overview of Platform Architectures

Figure 2-6. Roofline model for dual Intel Xeon E5-2697 v2 server with DDR3-1866 memory

The horizontal line shows peak performance of the computer. This is a hardware
limit for this server. The X-axis represents amount of work (in number of floating point
operations, or Flops) done for every byte of data coming from memory: Flops/byte (here,
“Flops” stands for the plural of “Flop”–the number of floating point operations, rather
than FLOPS, which is Flops per second). And the Y-axis represents gigaFLOPS (109
FLOPS), which is a throughput metric showing the number of floating point operations
executed every second (Flops/second, or FLOPS). With that, taking into account that
Flops / second
bytes / second = , the memory throughput metric gigabytes/second is
Flops / byte
represented by a line of unit slope in Figure 2-6. Thus, the slanted line shows the
maximum floating point performance that the memory subsystem can support for the
given operational intensity. The following formula drives the two performance limits in
the graph shown in Figure 2-6:
Attainable performance [GLOPS ]
ì Peak floating point performance , ü
= min í ý
î Peak memory bandwidth ´ Operational int ensity þ

The horizontal and diagonal lines form a kind of roofline, and this gives the
model its name. The roofline sets an upper bound on performance of a computational
kernel depending on its operational intensity. Improving performance of a kernel with
operational intensity of 6 Flops/byte (shown as the dotted line marked by “O” in the
plot) will hit the flat part of the roof, meaning that the kernel performance is ultimately

19
Random documents with unrelated
content Scribd suggests to you:
LETTER XXVII
FROM JANETTA LAUNDY
TO
CLEMENT MONTGOMERY
It is very strange I should express myself so ill as to have my
emotions of sorrow and regret mistaken, by you, for coldness and
aversion. It is cruel, Montgomery, thus to accuse your Janetta. Could
I but describe the anguish I suffered both on your account and my
own, you would pity me. Yes, Montgomery; 'tis I should ask for pity.
I, who never till now knew how strong are the ties by which my rival
held you. Barbarous as she is, I fear you still love her. She thinks
only how she can most effectually work your ruin; while you charge
with neglect and unkindness the faithful Janetta, who is labouring to
redress your misfortunes.
Montgomery, there is but one way. To talk of dying is absurd. You
may feel a temporary languor, the effect of vapour and indigestion;
but the bloom and vigour of a constitution like your's is not so easily
undermined. Trust me, you will live to a good old age, even with the
despicable 200l. per annum your hard hearted father bestows on
you. But it is in your power, Montgomery, to live surrounded by
riches and splendor, to command the perpetual succession of
pleasures which riches and splendor can procure.
Remember the proposal I made you one day, half in earnest half in
jest. Think of it. Embrace it. And send Mr. Valmont back his paltry
annuity in disdain. You cannot be so blind, so mad as to reject this
only means of your happiness. Renounce it, and I shall believe you
reserve yourself for my rival, the faithless and barbarous Sibella.
Accept it, and all the delights which Janetta's love can bestow are
your's for ever.
Why should you hide yourself? That form and face were given for
better purposes. Bloom in success and victory! And leave to those
who possess not your advantages to mope in dull obscurity! You owe
to yourself this triumph over the malice of Mr. Valmont and the
cruelty of her who has so wantonly betrayed you to his wrath. Throw
off your foibles and your sorrow; and call up those alluring graces of
your mien which are so irresistible. Exchange your sighs for smiles;
and, aided by the advantages of dress you well know how to choose,
come here to dinner. I have contrived that we shall dine alone.
Weigh well what I advise and its motives; and then ask yourself, if I
deserve to be accused of unkindness—Ask yourself what that love
must be which can content itself with secret confessions, and can
yield its open triumph to another in order to secure your advantage.
Consider these things with attention, dearest Montgomery: and
convince me that you deserve all I am willing to do for you by your
instant compliance. I cannot, do not, doubt you. Be here by six.
Ever your's, if you wish me to be so,
JANETTA LAUNDY
LETTER XXVIII
FROM GEORGE VALMONT
TO
CLEMENT MONTGOMERY
Scoundrel,
By means I cannot divine, Sibella has escaped me. I have no doubt
you or some of your diabolical agents are concerned in the business.
—The deed, Sir, I have burned.—Your draught of it must help to
amuse you.
It delights me to think she is not yet nineteen, and that you are
pennyless. Beg at my gates if you dare!—The worst of indignities are
better than your deservings.—You seal your union under happy
auspices.—I give you joy.—Would I could give you destruction!
GEORGE VALMONT
LETTER XXIX
FROM CLEMENT MONTGOMERY
TO
GEORGE VALMONT
Since, Sir, you have extended my punishment to the utmost, I can
incur no heavier penalty by thus intruding myself before you.
I could offer many excuses, Sir, for my first fault; but it is now too
late. Only, I must say your harshness and severity drove us to that
measure, which, in justice to myself, I must also inform you Miss
Valmont proposed, and with which I but reluctantly complied.
But, Sir, your further charge is without foundation. I have neither
any concern in, nor any knowledge of Miss Valmont's flight; and,
further to prove that I would have obeyed you if I could, I shall
refuse to protect her.—Indeed, Sir, your last letter has driven me
immediately to ratify an engagement that precludes the possibility of
any further intercourse with Miss Valmont.
I remain, Sir,
Your unhappy and repentant son,
CLEMENT MONTGOMERY
LETTER XXX
FROM ARTHUR MURDEN
TO
CAROLINE ASHBURN
There she is, Madam!—She walks and sighs:—and one little room, a
small circumference, contains only Murden and Sibella. When the
waiter shut the door and withdrew, I would have given an eye to
have detained him.—She knows not I am writing to you; for she
would have taken the office on herself, and that would not satisfy
me.—It is a relief, madam, to write—tho' any thing upon earth would
be preferable to hearing—I mean, seeing her.
Miss Ashburn, till I saw her, I did not understand you.—Well might
you warn me!
It will be three hours before we reach you.—I send this letter by a
man and horse; because, in knowing that we are safe, you will have
at least half an hour of less anxiety.
The place where we are now is only a village, five miles out of the
road to Valmont.—Richardson advised me to make this sweep for
fear of a pursuit.—He brought us here through cross roads on his
own horses. I have sent him back; and the only chaise this little inn
maintains is engaged for a two hours airing for some invalid in the
village.—Have patience, madam.—Your friend is safe.
Richardson and myself possessed ourselves of the cell at half past
nine last night.—Then in our disguises we prowled around the castle
till about eleven, and heard the locking of doors, and saw in the
upper windows light after light die away as their possessors yielded
themselves to rest.
We would not venture too early. I believe it was past two before we
left the armoury.—All was hushed.—The stairs!—the gallery!—her
apartments!—I seized Richardson by the arm, as he attempted to
turn the lock.—It seemed profanation. I feared every thing!—I would
have gone back.—Richardson forbade me.
We entered the antichamber. We crossed two others. The door of a
third stood open.—In that there was a fire, a candle, and a bed.—
The curtains were undrawn; and I caught a glimpse of her face.
Instantly, I drew the door so close as only to admit my hand, holding
out your letter.—I gasped.—'Speak for me,' I said to Richardson;
'Say, Miss Ashburn.'
'Rise, dear Miss Valmont,' said he, 'Miss Ashburn sends you this.'
I heard her start from the bed.—'Who?—What?'
'Miss Ashburn,' repeated Richardson, 'Miss Ashburn, it is a letter
from Miss Ashburn.'
She took or rather snatched the letter; and, as I withdrew my hand,
she shut the door hastily.
I heard her utter an exclamation—I could hear her too burst into
sobs and bless you.—I heard her also name another.
At length she asked, without opening the door, if I was indeed Mr.
Murden, and if I could take her from the castle.
'O yes, yes,' said I, 'Come away.'
'Stay,' she replied.
She was dressed in an instant. She opened the door. She came out
to us.—'Ah! what, what is the matter?' cried she, extending her arms
as if to save me from falling.—Why were you not more explicit in
your letter, Miss Ashburn?—I recoiled from her, from the
remembrance of her Clement—and, as I leaned on Richardson's
shoulder, I closed my dim eyes, and wished they might never more
open upon recollection.
'Shame!' whispered Richardson, 'you are unmanned!'
And so I am, Miss Ashburn. I think too, I should love revenge. I feel
a rankling glow of satisfaction, as she walks past my chair, that I
have so placed it I cannot look up and behold her.
I recovered strength and courage while my horror remained
unabated.—She saw I could hear, and she began to pour forth the
effusions of her gratitude upon you and us.—She knew you had
been in the castle. Her cruel uncle had informed her of it.—'And
then,' said she, 'I fancied I must die without seeing any one that
ever loved me.'—As she spoke, I turned my eyes from her now
haggard and jaundiced face to my own, reflected in the mirror by
which I was standing. 'Moving corpses!' said I to myself—'Why
encumber ye the fair earth?'
'He showed me a letter too,' added she. 'He said Clement had
renounced me.—Ah, Mr. Valmont! deceiving Mr. Valmont!'—and she
waved her hand gracefully—'had you known Sibella's heart as she
knows Clement's, you——.'
'Come away!' said I.
'Have you no other preparation to make, madam?' asked
Richardson; 'the night is very cold.'
This reminded her of a cloak.—She enquired if she must swim across
the moat; and said she was sure she could swim;—for she knew why
she had failed before.—I bade Richardson lead her.
I expected to have seen her much more surprised at the strange
path through which she had to go.—From the armoury to the cell
she never spoke. Her mind was overcharged with swelling emotions.
—At times we were obliged to stand still. She even panted for freer
respiration. The——
I heard wheels.—I expected our chaise.—It is some travellers who
have stopped to bait.
After we had safely crossed the moat, she alternately grasped our
hands in a tumult of joy; named you, named me, but talked on the
never-failing theme of her Clement.
She rode behind Richardson.—I see she is much worse for the
journey; yet her burning eye and vehement spirits would persuade
me otherwise.
She kindly ceased her torturing questions concerning Clement,
imagining, by my abrupt answers, I was too ill to talk.—She says you
will heal me—for you have healed her.—Miss Ashburn, how ardently
she loves you!
I find you will receive this letter an hour before we come.—Won't
you thank, and praise me?—It is written with a shaking hand, and
throbbing temples. I know it would be difficult to keep Sibella from
mounting the same horse, if she were informed of the messenger.
When we enter the chaise, I will tell her what I have done.
A. MURDEN
LETTER XXXI
FROM THE SAME
TO
THE SAME
Why should I have the rage of distraction without the phrenzy? Dare
they tell me I am a lunatic?—She is gone, Miss Ashburn? I have lost
your treasure!—Some villain, lured by the vestiges of her
transcendent beauty, has taken her from me.—They have forced me
into a bed!—The barbarians confine me here.—Won't you order me
to be released?—Oh sweet Miss Ashburn, won't you tell them I must
be released?
Now I recollect I wanted to tell you all the particulars.——Ha! they
fade from me, and I dream again!——

Madam,
I keep the Blue Boar at Hipsley; and the poor unhappy gentleman
who wrote the above came to my house with his lady yesterday
morning. As long as ever I live, I shan't forget the poor gentleman's
ravings, when he discovered that his lady had ran away from him,
and he only came to his senses about an hour ago, when he ordered
us to send for you, and he wrote till his raving fit returned; and it
would melt your heart, madam, to hear how he is bemoaning
himself and calling by the kindest names the ungrateful wicked lady
who served him so badly.—I saw her jump into the chaise myself;
and she went willingly enough, though he won't believe it. My son
brings you this, madam; and I hope you will tell us what we must do
for the poor gentleman. From
Your ladyship's humble servant,
MARY HOLMES
LETTER XXXII
FROM CAROLINE ASHBURN
TO
ARTHUR MURDEN
Ah my friend, my beloved Murden, if an interval of memory, happily,
now is thine, read these lines which thy friend pens to thee in agony.
—She follows on the instant. You once demanded my consolations
and friendship, as a reparation for the mischievous error into which I
had led you.—Will you receive them now as such, against the
manifold mischiefs I have brought upon you?
Ah! what, what had I to do with secret escapes!—I, who exclaimed
from the beginning against Valmont's secresy, and prophesied its
fatal consequences!—Must I too conspire to make Sibella the victim
of secresy?—Unhappy sufferer! Yet more unhappy Caroline! She,
debarred the use of her judgment, erred only from mistake; I, alas!
have sinned against reason and conviction!
Clement, I suspect, has watched our footsteps. I fear he has secured
her.—Ah, miserable fate!
Console yourself, my dearest friend. It will please you to know that,
even before I come to you, I am going to B——, to send messengers
in search of Sibella. And if money and vigilance can bring us tidings
of our lost friend, I have the power of employing both.—Prepare to
receive me with calmness.—Already, I have the aggravated distress
of your and Sibella's feelings to endure.—I am pained beyond
description.
CAROLINE ASHBURN
LETTER XXXIII
FROM LORD FILMAR
TO
SIR WALTER BOYER
I can truly say, I neither sit, stand, walk, nor lie:—that is, the
complete I, body and soul together; for, let the former attempt its
mechanical motions as it will, the other in a quite opposite direction
is striking, curvetting, capering, twisting, tumbling, and playing more
tricks than any fantastical ape in nature.
Therefore, dear Walter, you must send me instantly a hundred
guineas. Yes really, if you want to quiet my conscience, you must
send me a hundred guineas.
Nothing will quiet my conscience but matrimony.
I cannot marry without a hundred guineas.—Ergo, if you don't send
me a hundred guineas, and I should die, and be ——, the sin will lay
at your door, and you will die and be —— likewise.
As I have much consideration for you, my dear Walter, and as I
know that people who have very weak heads, have sometimes also
very weak nerves, I would advise you to lay down my letter, unless
you are seated in some safe place, for, should your situation be
dangerous, and should the surprise I am preparing for you rob you
of apprehension, down you drop and leave me in utter despair—lest
your executors should refuse me the hundred guineas.
Well—are you settled to your satisfaction?—Here it comes like a
thunderbolt!
Miss Valmont is mine, and I am her's—your hundred guineas will buy
a parson and a prayer book; and then the L.7000 a year is mine
also.
You know, dear Walter, that resolved to obtain the heiress of
Valmont castle, I left London to return to Monkton Hall, with a heart
full of promise, and a head full of stratagem. Fortune, that dear blind
inconstant goddess, who formerly was almost within my grasp, now
dashes the projects of others to the ground, to give my wishes their
triumph. Some Merlin, with more potent spells than mine, broke the
enchanted castle, bore off the damsel, and, directed by fate and
fortune, brought her on the road, to meet me, to the very spot
where it was decreed his success should end, that mine might begin.
And begun it has.
Last night, I slept at B——; and intended breakfasting with Sir
Gilbert Monkton: this morning I ordered the driver to leave the high
road, and cross the country, by which means I should save six miles
of the journey. Griffiths had been unwell some days; and he now
appeared so cold, and so much indisposed, I thought it prudent to
give him a breakfast on the road. The postilion, by the luckiest of all
chances, drove up to a pretty little white-washed inn, that I shall
love dearly for six months to come.
The landlady, a curtsying civil woman, was mighty sorry she had not
a better room to receive my honour in; but her best parlour, she
said, was already taken up with a lady and gentleman, who had
arrived at seven o'clock in the morning. And she showed me into a
little place, which had two excellent properties, namely, perfect
cleanliness, and a good fire.
By this good fire I had sat five or perhaps ten minutes, when
Griffiths entered. 'My Lord! my Lord!' said he, and turned back to
shut the door; 'I have seen the strangest sight, my Lord! I have seen
a gentleman——' At that moment a tea-kettle was brought into the
room; and Griffiths grew downright pettish with the damsel who
bore the kettle, because she did not quit the room with sufficient
speed.
His information, Walter, amounted to thus much—that in the
passage he had seen the gentleman who occupied the land-lady's
best parlour; and that this gentleman, of whom Griffiths had had a
very distinct view, certainly was, or Griffiths was much deceived, the
very identical spright who reminded some of us of our devotions in
the narrow passage of the west tower at Valmont castle. ''Tis
impossible!' said I.
'My Lord, 'tis true,' said Griffiths. I should know him among a
thousand. I know his eyes and nose as well as I know your's, my
Lord.'
This you will allow, Walter, was but a very vague sort of a
supposition to ground any belief upon; for, as eyes and noses are
the common lot of all mankind, it may happen now and then that
two or more may be greatly alike. Yet, so diligent is hope and
imagination, I could not persuade myself these eyes and this nose
had any owner but the spright of the castle.—It was Miss Valmont
and her hermit, my fancy said. I blessed my stars. I cursed my stars.
I wondered how and why they should come hither. Then, I
remembered, that fancy, though sometimes a prophetess, is rarely
an oracle, and I thought it might not be Miss Valmont and her
hermit.—I consulted much with Griffiths; and, at length, had
recourse to the waiter, a dapper shabby-coated fellow with a wooden
leg.
They came, he said, on horseback before seven o'clock. A man, who
conducted them, did not alight. They were impatient to be gone.
They waited for a chaise. They had ordered a breakfast which
neither of them had tasted. The lady did not appear, he thought,
equipped for travelling. The gentleman was melancholy, and the lady
restless and agitated.
Miss Valmont: whispered I to myself.
'They are a fine couple,' said the waiter.
I asked if he thought they were a married pair. He answered, he was
sure she must be a married lady. I enquired if the gentleman
seemed to be very fond of her.
'Not at all,' replied the waiter. 'The gentleman sits writing, Sir, with
his back to her. She walks about the room, muttering to herself.
When I carried in the breakfast, he leaned his head against the wall,
and groaned with his eyes shut.'
It cannot be Miss Valmont and her hermit, thought I.
'Is the lady handsome?' I asked.
The waiter thought she was too pale to be very handsome; but he
added that in all his born days he never beheld such a head of hair.
'Of what colour is it?' I asked.
It was neither black, nor brown, nor as red as Jenny's;—he thought
it was not any colour, but it shone as if gold threads were among it.
Miss Valmont: whispered my forward heart. I rose and walked hastily
across the room.
'What did they talk about?' said I.
'They don't talk at all, Sir. The poor gentleman seems very bad; and
as I told your honor before, she walks about muttering. When they
first came, as I was lighting a fire for them, the lady pulled off her
hat just as if she was in a passion, and then she shook her fine
waving locks, as though she was wond'rous proud of them. And she
said her head ached with that—cumbrance; and she said something
more about customs and cumbrances, but I forget what, your honor.
While she talked, the gentleman looked so kind and pretty at her it
did my heart good to see him; but he is either very ill or very
whimmy, for, immediately, while she took off her cloak, he laid his
head on the table with his face downward and sighed as if his heart
was breaking.'
I asked if he had heard her call him by any name, and the waiter
replied he had heard her twice name somebody as she walked about
the room; but to my great disappointment at that moment, his
memory had not retained the name.
At this part of our conference, the parlour bell rang, and the waiter
disappeared; not, though, till I had sealed him mine by a bribe, and
given him orders to return instantly. However, Griffiths, who was
most zealous on this occasion, thought proper to follow him.
Fortunate was it that he did; for, my waiter, dull at a hint, had
received a letter from the guest in the parlour, which, without
consulting my will and pleasure, he was quietly bearing to a courier
ready mounted and waiting for it at the inn door. Griffiths with a
careless air took the letter from his hand. It was addressed, Walter,
to—Miss Ashburn.
I began to stalk, to exclaim, to ejaculate. Go on, Filmar, cried I, and
prosper! Henceforward be plot and stratagem sanctified! for Miss
Ashburn deigns to plot.
Griffiths prudently reminded me that it would be quite as well at
present to think of Miss Valmont, and leave Miss Ashburn alone till
another opportunity.
'Right,' said I.
''Tis folly! 'Tis madness, but to think for a moment of such a
project!', said I, ten minutes after, and turning myself half round in
my chair, throwing one arm across its back and one leg over the
other. No! no! I'll have nothing to do with it!' and I fell to shaking the
uppermost leg furiously. 'It might be very easily managed though,'
said Griffiths; 'and then your Lordship——'
'Would have nothing to do but to digest Montgomery's bullets and
Miss Ashburn's harder words.—Oh that Miss Ashburn can find words
to lash like scorpion's stings! Say no more of it, Griffiths, I have
given her up.'
'As you please, my Lord.'
'Ay! ay!' muttered I to myself. 'Let her go to her Montgomery! There
are men who perhaps are worthy of being loved as himself, and
might perhaps be more capable of constancy. There are other
women too in the world, thank heaven!—Strange,' continued I, 'that
Miss Ashburn with her understanding, and who must know the
imbecility of Montgomery's love, could dream of joining in any plot
whose object was to bear Miss Valmont to Montgomery!' For, Walter,
I had by this time concluded that the quondam hermit was some
righteous go-between of Miss Valmont and her lover; and I felt
inclined to be mortally offended with her, because Montgomery had
so well concealed from my penetration their mutual intelligence. I
shifted to the other side of my seat; and I did not sigh; but I blew
my breath from me with much more force than usual.
I mused during the greater part of an hour. 'Your chaise is waiting,
my Lord,' said Griffiths; 'and, as you have quite done with this affair,
if your Lordship thinks proper it is as well not to keep the horses in
the cold.—Well, I must say 'twas a fine opportunity!'
'Do you think so, Griffiths?' said I mildly.
The rogue exultingly smiled; and, to change my wavering into
downright resolution, he recapitulated all the probabilities that
awaited my attempt, and noticed the trifling hazard that would
accrue (provided I adopted his plan for the purpose), should the
attempt fail: nor did he forget an oblique glance or two at certain
prospects which he knew put no inconsiderable weight into the
balance.
'Away! away!' I cried, 'give the driver his directions; let him draw up
close to the door, before the other chaise; and let him be sure to
keep his chaise door open, but not the step down.'
Signifies it to thee, Walter, of what Griffiths' plan consisted? Surely
not. Nothing could be more easy than, at the instant of their
departure, to request a moment's conversation with the gentleman;
nothing more simple than to invent a tale of a pursuit, to be
delivered into the attentive ear of Miss Valmont's guide: nothing
could promise fairer, and surely never did fulment better follow
promise.
Our casement looked upon a garden, and there the melancholy
conductor of Miss Valmont came to walk for a few minutes. There
needed no screen to hide us from his glance. His arms were folded,
and his eyes intensely fixed on the earth. His hat shaded the upper
part of his face, so that I could see no part of his said resemblance
to the bearded youth of the armoury; but I observed with pleasure
and thrilling expectation that he and I were nearly of one stature,
both booted, and both wearing dark blue great coats. This only
difference existed, one of his capes he had drawn round his chin, all
mine lay on my shoulders.—Walter, I could button mine up on
occasion.
George had ridden my grey mare from town. I felt no way inclined to
make him a party in the transaction; and I also wanted the mare for
Griffiths. I therefore ordered him to return to B——, and take a
stage for London, waiting there my further orders. Griffiths saw him
mount a post horse, and led the grey mare round the house, and
fastened her to some rails in readiness.
It was exactly two hours and one quarter from the time of our
arrival, before their chaise came to the door. The horses were to
have a feed in their harness; the guests were impatient to be gone. I
shuddered: and, as I traversed our little room, the echo of my
footsteps seemed to be blabbing tell tales. I shall never, Walter,
know such another minute as that. All in future will be the dull
uniformity of peace and plenty.
It was done. The waiter delivered Griffiths' message in the best
parlour. I, from a distant peeping station, saw the gentleman walk to
our room. I heard the door shut—the waiter stump away. Thrilling,
throbbing with hope and fear, I walked up the passage to their
parlour. Wrapped in her cloak, the hood drawn over her head, her
hat in her hand, stood the fair expectant. 'O come,' said she, 'do let
us hasten!' The day was gloomy, the passage was dark, I had drawn
up my cape and drawn down my hat. My hand took her's. She
tripped along. No creature was in sight. I caught her up in my arms,
lifted her into the chaise, and we whirled off, just as the landlady
came bustling up to the door.
I had my cue of silence and reserve in the intelligence I had received
from the waiter. During the first three miles, I neither spoke nor
looked up. She, the while, clasping her hands and muttering, as the
waiter called it. I heard her pronounce the names of Miss Ashburn,
of Montgomery, and of some one else. For three miles, I say, we
interchanged not one word: then, Walter, the first word betrayed
me.
And now what a list of sobs, tears, screams, prayers, and
lamentations you expect! I have not one for you. She sighed,
indeed, and a few drops forced a reluctant way; but she neither
prayed, threatened, nor lamented. She demanded her liberty. She
reasoned for her liberty; reasoned with a firmness collected, vigilant,
manly, let me say. She remembers seeing me in the castle, and
takes me for her uncle's agent. In truth, Walter, I suffer her to think
it still; for I do not find, when carefully examined, that my own
character and motives in this business possess much to recommend
them.
In a little glen, between two hills of which the barrenness of one
frowns on the cultivation of the other, stands a farm, embosomed
hid in secresy and solitude. No traveller eyes it from the distant
heath. No horses, save its own, leave the print of their hoofs at its
entrance. But even more than usual gloom and dulness now reigns
around it. The lively whistle of the ploughman and hind no longer
chear the echoes of the hill. The farm yard is emptied of its gabbling
tenants. The master is dead, the stock sold, the tenants discharged,
and one solitary daughter, with one solitary female domestic alone,
remains to guard the house till quarter day shall yield it to a new
tenant.
'Tis neither fit employment for my time to relate, nor for your's to
read, the trifling adventures by which Griffiths became acquainted
with this fair daughter, her circumstances and abode; nor how he
wooed and won her love during our residence at Monkton Hall. At
Griffiths' instigation, hither I brought Miss Valmont; and here, till
your cash arrives, as in a place of trust and safety, do I mean to
keep my treasure, although I am little more than three leagues
distant from Monkton Hall, and scarcely four from Valmont castle.
A less ready imagination than even thine, Walter, might picture to
itself the manner in which Griffiths deluded Miss Valmont's knight-
errant with a tale of pursuit and discovery. The youth checked his
surprise, and renewed his vigour. He hastened to secure his lovely
ward; and Griffiths, mean while, stole round the inn, mounted the
grey mare, and was out of sight and sound of the consequences.
I hear her walking. A slight partition divides her chamber from mine.
No more of those deep-drawn sighs, my fair one! I thank heaven I
am not an agent of Valmont's neither. He must have used her
cruelly. She is excessively pale; and strangely altered.
I stand, Walter, the watchful sentinel of her chamber door, which I
presume not to enter. Till I had her in my possession, my thoughts,
in gadding after the enterprise, possessed all the saucy gaiety which
youth and untamed spirits could impart. Nay, when I began to write
this letter, they wore their natural character. I must shift my station
from this room. Those deep deep sighs will undo me! Hasten, dear
Walter, make the wings of speed thy messengers to bear to me a
hundred guineas, that we may fly to the land of blessings ere I
forgot that her cupids have golden headed arrows.—Hem!—Seven
thousand per annum—O 'tis an elixir to chear the fainting spirits!
And now, as sure as I have possession of the rich and beauteous
prize after which I have so long yearned, so sure will I recompense
her present uneasiness by a life of tenderness, attention, and, to the
best of my present belief, of unabated constancy.
But marry me she must and shall, by G—d!
FILMAR
LETTER XXXIV
FROM CAROLINE ASHBURN
TO
LADY BARLOWE
Dear Madam,
By a strange concurrence of accidents I am at present attending Mr.
Murden, who during many days has lain dangerously ill in a small
country inn nine miles from Valmont castle. I must leave it to your
prudence to acquaint Sir Thomas Barlowe (to whom I know it will be
most distressing tidings) that his nephew is in danger, but it is
necessary that Sir Thomas should know it immediately, for I have
made preparations for bringing Mr. Murden to London, that he may
have better accommodation and better advice. Though I speak of
advice, I dare not encourage any hope in Sir Thomas, for I have
watched the progress of his nephew's disorder, and I believe he is
only lingering—abide he cannot.
Sir Thomas Barlowe loved this young man as a son; and, to receive
him scarcely a shadow of his former self, will create distressing
emotions. Yet, I beseech you to urge Sir Thomas carefully to avoid
any strong expressions of sorrow when his nephew arrives, for I
have the grief to tell you that Mr. Murden's reason is shaken: and
dreadful paroxysms may follow the slightest agitation.
Nought but the power I have long laboured to obtain and have in
part obtained over my sensations could have preserved any degree
of fortitude in me under the most trying events of my life, events
which have lately befallen Miss Valmont and Mr. Murden. On them I
had bestowed the warmest tribute of my affections. In the
enjoyment of their virtues and happiness, I expected daily to
augment my own. But, alas! it is gone; and my wretched hopes still
wear their beautiful and alluring form while sinking in
disappointment.
I am aware, Madam, that Mr. Murden's misfortune cannot create
more concern in your breast than the circumstance of my being with
him will raise wonder and curiosity; nor have I any other than a full
intention of making you acquainted with the circumstances that
drew us both hither, whose sad termination has operated so fatally
on Mr. Murden. But I am obliged to defer the relation till our arrival
in town, both on account of its length, of the preparations I am
making for Mr. Murden's ease and safety on the journey, and the
continual anxiety of watchfulness which possesses me for the sake
of Miss Valmont, to whom I have been unhappily the cause of evils
possibly worse than that which has befallen Mr. Murden.
I cannot name the day when you and Sir Thomas may expect us, for
the time consumed in the journey must be regulated by the
abatement or increase of Mr. Murden's disorder. He shall travel in a
litter; and I hope it is unnecessary for me to assure Sir Thomas
nothing shall be wanting to his accommodation that I have means to
procure.
I remain your Ladyship's well wisher and servant,
CAROLINE ASHBURN
LETTER XXXV
FROM CAROLINE ASHBURN
TO
GEORGE VALMONT
Sir,
By the messenger of mine, who, on his search for my lost friend,
came to your gates a few days since, you were informed that it was
through my means Sibella escaped from your castle; and, however
stern may be the anger you entertain against me, be assured, Sir, it
cannot exceed the vehemence of that self-reproach and sorrow
which now assail me, for having been the contriver of so
unjustifiable an undertaking.
I send you, Sir, a pacquet containing all the letters I have received
from Sibella, and also the letters that have passed between Mr.
Murden and myself. I lay them before you, with the confidence that
you will afford them a patient and temperate perusal; for I think
they will serve to convince you, as they have already convinced me
by the unfortunate event to which they have led, that, however
plausible and even necessary in appearance, yet artifice and secresy
are dangerous vicious tools.
Your secrets were the preparatory step to the errors of Clement and
Sibella. Had Sibella never departed from strict truth and sincerity,
she had never formed her rash engagement with Clement. Had
Murden never (with his dangerous refinements of fancy) longed
secretly to view this rare child of seclusion, he had not battered his
life and happiness for a sigh. And lastly, had I not given way to the
fatal mistake that secresy could repair the inability of reason, I had,
instead of availing myself of the ruin on the rock, ere now perhaps
released Sibella by convincing you. And we had all been
comparatively happy.
Murden's unfinished letter from the village of Hipsley will show you
his deplorable situation, and all that we know concerning the loss of
our Sibella.
I have six agents employed to discover her. But they wander blindly,
for I have neither trace, nor supposition, to guide them. What can I
do, Sir? if you have any advice to offer, I hope you will not withhold
it, from animosity to me. Excessively do I love the friend I have
helped to sacrifice, yet I can readily and sincerely forgive you the
errors of your conduct towards her. Oh then, Sir, pardon mine, and
in pity to the anxiety of my heart aid me with your advice and
assistance.
I do not even hate Mr. Montgomery; though I do despise him
altogether. You suspected him of taking Sibella from the castle. I
suspected him of stealing her from Mr. Murden. He was otherwise
employed.
I arrived in town, with my poor patient under my protection,
yesterday evening, and resigned Mr. Murden to the care of his uncle,
Sir Thomas Barlowe. When I drove up to my mother's door, I found
it more than usually crowded with carriages and servants; hung
upon the pillars; and, when several of my mother's footmen stepped
from among the crowd, I perceived they were in new liveries
adorned in the highest stile of elegant expence. Though it was
impossible not to notice the uncommon glare of splendor that
saluted my eyes, yet our changes have always been so various and
profuse, I never thought of enquiring into the cause of the present.
Unfitted by my dress, but still more by weariness of limbs and
depression of mind, to encounter company, I retired to my chamber
and to bed.
This morning my maid attended me; and, with the natural hesitation
of good nature in relating disagreeable tidings, she informed me—
Mr. Montgomery was married to my mother.
Sir, it is the fact. Last Saturday, my mother became the bride of your
son; and the parade I witnessed last night was to do honour to the
first complimentors of this extraordinary hymeneal.
The tidings stunned me, for I was no way prepared from the
conduct of either to expect such an event. Uninvited and assuredly
unwelcome, I visited their apartment the hour of breakfast, and my
mother collected the utmost of her haughtiness and Mr. Montgomery
his gay indifference, to repel the reproaches they expected I should
be prompted to bestow on them. But, Sir, they mistook me. I went
only to deliver to them a plain history of the mischiefs I have heaped
on Mr. Murden and my Sibella, to remind them how early, and, alas!
how severe a punishment has followed my deviation from rectitude.
I saw Montgomery's countenance become pale and ghastly. It was,
Sir, when I spoke of Miss Valmont's independent fortune. Then, I
believe, all the force of his situation was present with him. May it
often recur, and be the preservative against future follies.
Allow me, Sir, to say a word or two of him who most loved your
niece and best deserved her. Mr. Murden intruded on your domain,
and destroyed some of your unripe projects; yet I persuade myself
you will feel a pity for his misfortunes. His life pays the forfeiture of
his curiosity and secresy. A romantic love of Miss Valmont sapped its
foundation, and his nights of watching amidst the chilling damps of
the Ruin hastened the progress of its destruction. Sibella's
unaccountable escape from him at a time when his high toned
feelings were wrought upon, in a way that I cannot express, by the
alteration in her person, drove him to madness. Then it was that I
saw him who once possessed every advantage of manly grace and
beauty changed to a living skeleton, whose eyes starting from their
sockets glanced around with wild horror and insanity. Oh, Sir, it was
indeed a scene that called forth all my fortitude!
As his delirium had no mischievous tendencies, it was judged better
to remove him to London; and whether change of air and place had
the salutary effect, or the delirium had exhausted its force I know
not, but he became perfectly restored to reason before we reached
London. That restoration was almost beyond my hopes; and there
hope rests, it dares not presume further. The most certain
indications of speedy dissolution now appear; and all my time must
be given to the endeavour of tracing my beloved Sibella, and
consoling the anxious Murden for her loss. On his own account,
consolation pains him. All his wishes centre in death; and the
irrevocable union will soon take place.
Will you be kind enough to inform me of the name of Sibella's other
guardian?—Adieu, Sir, may that peace which is only to be purchased
by rectitude become an inmate of your abode.
CAROLINE ASHBURN
LETTER XXXVI
LORD FILMAR
TO
SIR WALTER BOYER
Faith, Walter, I have secured a rich prize, indeed. Hear but its
estimate.
In the first place, a very lovely and adorable woman.
In the second, a fine estate.
In the third,——an heir (in embrio) to inherit it.
True, by the Gods!—Nevertheless, stop your rash conclusions, for I
have heard her whole story, therefore I tell you that Miss Ashburn is
an angel, Mr. Murden a fine fellow, Mr. Valmont an idiot, Sibella a
saint, and Montgomery—a scoundrel: though on my soul she talked
so movingly of his never fading faith I could not for my life persuade
myself to tell her my true opinion of him.
From the little she knows of Murden, (her hermit and deliverer) I
long to know more. I burn to tell you of her wonderful escape, of
the marvellous Ruin on the rock, but I have resolved to wave
explanations till I come.—I charge you, by your friendship, breathe
not a whisper of the adventure till you see me. I am going to restore
her to her friends; her eloquence did part, but truly her condition did
more.—I never bargained to pay off such a mortgage. I could love
her dearly; but then you know my name is Filmar, and as a Lord I
am bound in duty to love and cherish no son but a son of my own
begetting.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

textbookfull.com

You might also like