100% found this document useful (5 votes)

49 views

Instant Download GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata PDF All Chapters

Parallel

Uploaded by

forberrofyda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (5 votes)

49 views

Instant Download GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata PDF All Chapters

Parallel

Uploaded by

forberrofyda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Experience Seamless Full Ebook Downloads for Every Genre at textbookfull.

com

GPU Parallel Program Development Using CUDA 1st

Edition Tolga Soyata

https://textbookfull.com/product/gpu-parallel-program-
development-using-cuda-1st-edition-tolga-soyata/

OR CLICK BUTTON

DOWNLOAD NOW

Explore and download more ebook at https://textbookfull.com

Recommended digital products (PDF, EPUB, MOBI) that
you can download immediately if you are interested.

Comprehensive Healthcare Simulation Program Center

Development 1st Edition Michael A. Seropian

https://textbookfull.com/product/comprehensive-healthcare-simulation-
program-center-development-1st-edition-michael-a-seropian/

textboxfull.com

Cybersecurity Program Development for Business: The

Essential Planning Guide 1st Edition Chris Moschovitis

https://textbookfull.com/product/cybersecurity-program-development-
for-business-the-essential-planning-guide-1st-edition-chris-
moschovitis/
textboxfull.com

Freight transport and distribution : concepts and

optimisation models Tolga Bektas

https://textbookfull.com/product/freight-transport-and-distribution-
concepts-and-optimisation-models-tolga-bektas/

textboxfull.com

GPU Pro 360 Guide to Geometry Manipulation 1st Edition

Wolfgang Engel

https://textbookfull.com/product/gpu-pro-360-guide-to-geometry-
manipulation-1st-edition-wolfgang-engel/

textboxfull.com
GPU Pro 360 Guide to Mobile Devices 1st Edition Wolfgang
Engel

https://textbookfull.com/product/gpu-pro-360-guide-to-mobile-
devices-1st-edition-wolfgang-engel/

textboxfull.com

GPU PRO 360: Guide to GPGPU 1st Edition Wolfgang Engel

(Editor)

https://textbookfull.com/product/gpu-pro-360-guide-to-gpgpu-1st-
edition-wolfgang-engel-editor/

textboxfull.com

Data Parallel C++ Mastering DPC++ for Programming of

Heterogeneous Systems using C++ and SYCL 1st Edition James
Reinders
https://textbookfull.com/product/data-parallel-c-mastering-dpc-for-
programming-of-heterogeneous-systems-using-c-and-sycl-1st-edition-
james-reinders/
textboxfull.com

GPU Zen Advanced Rendering Techniques Wolfgang Engel

(Editor)

https://textbookfull.com/product/gpu-zen-advanced-rendering-
techniques-wolfgang-engel-editor/

textboxfull.com

GPU Pro 360 Guide to Image Space 1st Edition Wolfgang

Engel (Author)

https://textbookfull.com/product/gpu-pro-360-guide-to-image-space-1st-
edition-wolfgang-engel-author/

textboxfull.com
GPU Parallel Program
Development Using CUDA
Chapman & Hall/CRC
Computational Science Series
SERIES EDITOR

Horst Simon
Deputy Director
Lawrence Berkeley National Laboratory
Berkeley, California, U.S.A.

PUBLISHED TITLES

COMBINATORIAL SCIENTIFIC COMPUTING

Edited by Uwe Naumann and Olaf Schenk
CONTEMPORARY HIGH PERFORMANCE COMPUTING: FROM PETASCALE
TOWARD EXASCALE
Edited by Jeffrey S. Vetter
CONTEMPORARY HIGH PERFORMANCE COMPUTING: FROM PETASCALE
TOWARD EXASCALE, VOLUME TWO
Edited by Jeffrey S. Vetter
DATA-INTENSIVE SCIENCE
Edited by Terence Critchlow and Kerstin Kleese van Dam
ELEMENTS OF PARALLEL COMPUTING
Eric Aubanel
THE END OF ERROR: UNUM COMPUTING
John L. Gustafson
EXASCALE SCIENTIFIC APPLICATIONS: SCALABILITY AND
PERFORMANCE PORTABILITY
Edited by Tjerk P. Straatsma, Katerina B. Antypas, and Timothy J. Williams
FROM ACTION SYSTEMS TO DISTRIBUTED SYSTEMS: THE REFINEMENT APPROACH
Edited by Luigia Petre and Emil Sekerinski
FUNDAMENTALS OF MULTICORE SOFTWARE DEVELOPMENT
Edited by Victor Pankratius, Ali-Reza Adl-Tabatabai, and Walter Tichy
FUNDAMENTALS OF PARALLEL MULTICORE ARCHITECTURE
Yan Solihin
THE GREEN COMPUTING BOOK: TACKLING ENERGY EFFICIENCY AT LARGE SCALE
Edited by Wu-chun Feng
GRID COMPUTING: TECHNIQUES AND APPLICATIONS
Barry Wilkinson
GPU PARALLEL PROGRAM DEVELOPMENT USING CUDA
Tolga Soyata
PUBLISHED TITLES CONTINUED

HIGH PERFORMANCE COMPUTING: PROGRAMMING AND APPLICATIONS

John Levesque with Gene Wagenbreth
HIGH PERFORMANCE PARALLEL I/O
Prabhat and Quincey Koziol
HIGH PERFORMANCE VISUALIZATION:
ENABLING EXTREME-SCALE SCIENTIFIC INSIGHT
Edited by E. Wes Bethel, Hank Childs, and Charles Hansen
INDUSTRIAL APPLICATIONS OF HIGH-PERFORMANCE COMPUTING:
BEST GLOBAL PRACTICES
Edited by Anwar Osseyran and Merle Giles
INTRODUCTION TO COMPUTATIONAL MODELING USING C AND
OPEN-SOURCE TOOLS
José M Garrido
INTRODUCTION TO CONCURRENCY IN PROGRAMMING LANGUAGES
Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen
INTRODUCTION TO ELEMENTARY COMPUTATIONAL MODELING: ESSENTIAL
CONCEPTS, PRINCIPLES, AND PROBLEM SOLVING
José M. Garrido
INTRODUCTION TO HIGH PERFORMANCE COMPUTING FOR SCIENTISTS
AND ENGINEERS
Georg Hager and Gerhard Wellein
INTRODUCTION TO MODELING AND SIMULATION WITH MATLAB® AND PYTHON
Steven I. Gordon and Brian Guilfoos
INTRODUCTION TO REVERSIBLE COMPUTING
Kalyan S. Perumalla
INTRODUCTION TO SCHEDULING
Yves Robert and Frédéric Vivien
INTRODUCTION TO THE SIMULATION OF DYNAMICS USING SIMULINK®
Michael A. Gray
PEER-TO-PEER COMPUTING: APPLICATIONS, ARCHITECTURE, PROTOCOLS,
AND CHALLENGES
Yu-Kwong Ricky Kwok
PERFORMANCE TUNING OF SCIENTIFIC APPLICATIONS
Edited by David Bailey, Robert Lucas, and Samuel Williams
PETASCALE COMPUTING: ALGORITHMS AND APPLICATIONS
Edited by David A. Bader
PROCESS ALGEBRA FOR PARALLEL AND DISTRIBUTED PROCESSING
Edited by Michael Alexander and William Gardner
PUBLISHED TITLES CONTINUED

PROGRAMMING FOR HYBRID MULTI/MANY-CORE MPP SYSTEMS

John Levesque and Aaron Vose
SCIENTIFIC DATA MANAGEMENT: CHALLENGES, TECHNOLOGY, AND DEPLOYMENT
Edited by Arie Shoshani and Doron Rotem
SOFTWARE ENGINEERING FOR SCIENCE
Edited by Jeffrey C. Carver, Neil P. Chue Hong, and George K. Thiruvathukal
GPU Parallel Program
Development Using CUDA

Tolga Soyata
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2018 by Taylor & Francis Group, LLC

CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper

International Standard Book Number-13: 978-1-4987-5075-2 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize
to copyright holders if permission to publish in this form has not been obtained. If any copyright material
has not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, trans-
mitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information storage or retrieval
system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com
(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive,
Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and regis-
tration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
——————————————————————————————————————————————–
Library of Congress Cataloging-in-Publication Data
——————————————————————————————————————————————–
Names: Soyata, Tolga, 1967- author.
Title: GPU parallel program development using CUDA
/ by Tolga Soyata.
Description: Boca Raton, Florida : CRC Press, [2018] | Includes bibliographical
references and index.
Identifiers: LCCN 2017043292 | ISBN 9781498750752 (hardback) |
ISBN 9781315368290 (e-book)
Subjects: LCSH: Parallel programming (Computer science) | CUDA (Computer architecture) |
Graphics processing units–Programming.
Classification: LCC QA76.642.S67 2018 | DDC 005.2/75–dc23
LC record available at https://lccn.loc.gov/2017043292
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com

and the CRC Press Web site at

http://www.crcpress.com
To my wife Eileen
and my step-children Katherine, Andrew, and Eric.
Contents

List of Figures xxiii

List of Tables xxix
Preface xxxiii
About the Author xxxv

Part I Understanding CPU Parallelism

Chapter 1 Introduction to CPU Parallel Programming 3

1.1 EVOLUTION OF PARALLEL PROGRAMMING 3

1.2 MORE CORES, MORE PARALLELISM 4
1.3 CORES VERSUS THREADS 5
1.3.1 More Threads or More Cores to Parallelize? 5
1.3.2 Influence of Core Resource Sharing 7
1.3.3 Influence of Memory Resource Sharing 7
1.4 OUR FIRST SERIAL PROGRAM 8
1.4.1 Understanding Data Transfer Speeds 8
1.4.2 The main() Function in imflip.c 10
1.4.3 Flipping Rows Vertically: FlipImageV() 11
1.4.4 Flipping Columns Horizontally: FlipImageH() 12
1.5 WRITING, COMPILING, RUNNING OUR PROGRAMS 13
1.5.1 Choosing an Editor and a Compiler 13
1.5.2 Developing in Windows 7, 8, and Windows 10
Platforms 13
1.5.3 Developing in a Mac Platform 15
1.5.4 Developing in a Unix Platform 15
1.6 CRASH COURSE ON UNIX 15
1.6.1 Unix Directory-Related Commands 15
1.6.2 Unix File-Related Commands 16
1.7 DEBUGGING YOUR PROGRAMS 19
1.7.1 gdb 20
1.7.2 Old School Debugging 21
1.7.3 valgrind 22

ix
x Contents

1.8 PERFORMANCE OF OUR FIRST SERIAL PROGRAM 23

1.8.1 Can We Estimate the Execution Time? 24
1.8.2 What Does the OS Do When Our Code Is
Executing? 24
1.8.3 How Do We Parallelize It? 25
1.8.4 Thinking About the Resources 25

Chapter 2 Developing Our First Parallel CPU Program 27

2.1 OUR FIRST PARALLEL PROGRAM 27

2.1.1 The main() Function in imflipP.c 28
2.1.2 Timing the Execution 29
2.1.3 Split Code Listing for main() in imflipP.c 29
2.1.4 Thread Initialization 32
2.1.5 Thread Creation 32
2.1.6 Thread Launch/Execution 34
2.1.7 Thread Termination (Join) 35
2.1.8 Thread Task and Data Splitting 35
2.2 WORKING WITH BITMAP (BMP) FILES 37
2.2.1 BMP is a Non-Lossy/Uncompressed File
Format 37
2.2.2 BMP Image File Format 38
2.2.3 Header File ImageStuff.h 39
2.2.4 Image Manipulation Routines in ImageStuff.c 40
2.3 TASK EXECUTION BY THREADS 42
2.3.1 Launching a Thread 43
2.3.2 Multithreaded Vertical Flip: MTFlipV() 45
2.3.3 Comparing FlipImageV() and MTFlipV() 48
2.3.4 Multithreaded Horizontal Flip: MTFlipH() 50
2.4 TESTING/TIMING THE MULTITHREADED CODE 51

Chapter 3 Improving Our First Parallel CPU Program 53

3.1 EFFECT OF THE “PROGRAMMER” ON PERFORMANCE 53

3.2 EFFECT OF THE “CPU” ON PERFORMANCE 54
3.2.1 In-Order versus Out-Of-Order Cores 55
3.2.2 Thin versus Thick Threads 57
3.3 PERFORMANCE OF IMFLIPP 57
3.4 EFFECT OF THE “OS” ON PERFORMANCE 58
3.4.1 Thread Creation 59
3.4.2 Thread Launch and Execution 59
3.4.3 Thread Status 60
Contents xi

3.4.4 Mapping Software Threads to Hardware Threads 61

3.4.5 Program Performance versus Launched Pthreads 62
3.5 IMPROVING IMFLIPP 63
3.5.1 Analyzing Memory Access Patterns in MTFlipH() 64
3.5.2 Multithreaded Memory Access of MTFlipH() 64
3.5.3 DRAM Access Rules of Thumb 66
3.6 IMFLIPPM: OBEYING DRAM RULES OF THUMB 67
3.6.1 Chaotic Memory Access Patterns of imflipP 67
3.6.2 Improving Memory Access Patterns of imflipP 68
3.6.3 MTFlipHM(): The Memory Friendly MTFlipH() 69
3.6.4 MTFlipVM(): The Memory Friendly MTFlipV() 71
3.7 PERFORMANCE OF IMFLIPPM.C 72
3.7.1 Comparing Performances of imflipP.c and imflipPM.c 72
3.7.2 Speed Improvement: MTFlipV() versus MTFlipVM() 73
3.7.3 Speed Improvement: MTFlipH() versus MTFlipHM() 73
3.7.4 Understanding the Speedup: MTFlipH() versus MTFlipHM() 73
3.8 PROCESS MEMORY MAP 74
3.9 INTEL MIC ARCHITECTURE: XEON PHI 76
3.10 WHAT ABOUT THE GPU? 77
3.11 CHAPTER SUMMARY 78

Chapter 4 Understanding the Cores and Memory 79

4.1 ONCE UPON A TIME ... INTEL ... 79

4.2 CPU AND MEMORY MANUFACTURERS 80
4.3 DYNAMIC (DRAM) VERSUS STATIC (SRAM) MEMORY 81
4.3.1 Static Random Access Memory (SRAM) 81
4.3.2 Dynamic Random Access Memory (DRAM) 81
4.3.3 DRAM Interface Standards 81
4.3.4 Influence of DRAM on our Program Performance 82
4.3.5 Influence of SRAM (Cache) on our Program
Performance 83
4.4 IMAGE ROTATION PROGRAM: IMROTATE.C 83
4.4.1 Description of the imrotate.c 84
4.4.2 imrotate.c: Parametric Restrictions and
Simplifications 84
4.4.3 imrotate.c: Theory of Operation 85
4.5 PERFORMANCE OF IMROTATE 89
4.5.1 Qualitative Analysis of Threading Efficiency 89
4.5.2 Quantitative Analysis: Defining Threading
Efficiency 89
xii Contents

4.6 THE ARCHITECTURE OF THE COMPUTER 91

4.6.1 The Cores, L1$ and L2$ 91
4.6.2 Internal Core Resources 92
4.6.3 The Shared L3 Cache Memory (L3$) 94
4.6.4 The Memory Controller 94
4.6.5 The Main Memory 95
4.6.6 Queue, Uncore, and I/O 96
4.7 IMROTATEMC: MAKING IMROTATE MORE EFFICIENT 97
4.7.1 Rotate2(): How Bad is Square Root and FP Division? 99
4.7.2 Rotate3() and Rotate4(): How Bad Is sin() and cos()? 100
4.7.3 Rotate5(): How Bad Is Integer Division/Multiplication? 102
4.7.4 Rotate6(): Consolidating Computations 102
4.7.5 Rotate7(): Consolidating More Computations 104
4.7.6 Overall Performance of imrotateMC 104
4.8 CHAPTER SUMMARY 106

Chapter 5 Thread Management and Synchronization 107

5.1 EDGE DETECTION PROGRAM: IMEDGE.C 107

5.1.1 Description of the imedge.c 108
5.1.2 imedge.c: Parametric Restrictions and
Simplifications 108
5.1.3 imedge.c: Theory of Operation 109
5.2 IMEDGE.C : IMPLEMENTATION 111
5.2.1 Initialization and Time-Stamping 112
5.2.2 Initialization Functions for Different Image
Representations 113
5.2.3 Launching and Terminating Threads 114
5.2.4 Gaussian Filter 115
5.2.5 Sobel 116
5.2.6 Threshold 117
5.3 PERFORMANCE OF IMEDGE 118
5.4 IMEDGEMC: MAKING IMEDGE MORE EFFICIENT 118
5.4.1 Using Precomputation to Reduce Bandwidth 119
5.4.2 Storing the Precomputed Pixel Values 120
5.4.3 Precomputing Pixel Values 121
5.4.4 Reading the Image and Precomputing Pixel
Values 122
5.4.5 PrGaussianFilter 123
5.4.6 PrSobel 124
5.4.7 PrThreshold 125
5.5 PERFORMANCE OF IMEDGEMC 126
Contents xiii

5.6 IMEDGEMCT: SYNCHRONIZING THREADS EFFICIENTLY 127

5.6.1 Barrier Synchronization 128
5.6.2 MUTEX Structure for Data Sharing 129
5.7 IMEDGEMCT: IMPLEMENTATION 130
5.7.1 Using a MUTEX: Read Image, Precompute 132
5.7.2 Precomputing One Row at a Time 133
5.8 PERFORMANCE OF IMEDGEMCT 134

Part II GPU Programming Using CUDA

Chapter 6 Introduction to GPU Parallelism and CUDA 137

6.1 ONCE UPON A TIME ... NVIDIA ... 137

6.1.1 The Birth of the GPU 137
6.1.2 Early GPU Architectures 138
6.1.3 The Birth of the GPGPU 140
6.1.4 Nvidia, ATI Technologies, and Intel 141
6.2 COMPUTE-UNIFIED DEVICE ARCHITECTURE (CUDA) 143
6.2.1 CUDA, OpenCL, and Other GPU Languages 143
6.2.2 Device Side versus Host Side Code 143
6.3 UNDERSTANDING GPU PARALLELISM 144
6.3.1 How Does the GPU Achieve High Performance? 145
6.3.2 CPU versus GPU Architectural Differences 146
6.4 CUDA VERSION OF THE IMAGE FLIPPER: IMFLIPG.CU 147
6.4.1 imflipG.cu: Read the Image into a CPU-Side Array 149
6.4.2 Initialize and Query the GPUs 151
6.4.3 GPU-Side Time-Stamping 153
6.4.4 GPU-Side Memory Allocation 155
6.4.5 GPU Drivers and Nvidia Runtime Engine 155
6.4.6 CPU→GPU Data Transfer 156
6.4.7 Error Reporting Using Wrapper Functions 157
6.4.8 GPU Kernel Execution 157
6.4.9 Finish Executing the GPU Kernel 160
6.4.10 Transfer GPU Results Back to the CPU 161
6.4.11 Complete Time-Stamping 161
6.4.12 Report the Results and Cleanup 162
6.4.13 Reading and Writing the BMP File 163
6.4.14 Vflip(): The GPU Kernel for Vertical Flipping 164
6.4.15 What Is My Thread ID, Block ID, and Block Dimension? 166
6.4.16 Hflip(): The GPU Kernel for Horizontal Flipping 169
xiv Contents

6.4.17 Hardware Parameters: threadIDx.x, blockIdx.x,

blockDim.x 169
6.4.18 PixCopy(): The GPU Kernel for Copying an Image 169
6.4.19 CUDA Keywords 170
6.5 CUDA PROGRAM DEVELOPMENT IN WINDOWS 170
6.5.1 Installing MS Visual Studio 2015 and CUDA Toolkit 8.0 171
6.5.2 Creating Project imflipG.cu in Visual Studio 2015 172
6.5.3 Compiling Project imflipG.cu in Visual Studio 2015 174
6.5.4 Running Our First CUDA Application: imflipG.exe 177
6.5.5 Ensuring Your Program’s Correctness 178
6.6 CUDA PROGRAM DEVELOPMENT ON A MAC PLATFORM 179
6.6.1 Installing XCode on Your Mac 179
6.6.2 Installing the CUDA Driver and CUDA Toolkit 180
6.6.3 Compiling and Running CUDA Applications on a Mac 180
6.7 CUDA PROGRAM DEVELOPMENT IN A UNIX PLATFORM 181
6.7.1 Installing Eclipse and CUDA Toolkit 181
6.7.2 ssh into a Cluster 182
6.7.3 Compiling and Executing Your CUDA Code 182

Chapter 7 CUDA Host/Device Programming Model 185

7.1 DESIGNING YOUR PROGRAM’S PARALLELISM 185

7.1.1 Conceptually Parallelizing a Task 186
7.1.2 What Is a Good Block Size for Vflip()? 187
7.1.3 imflipG.cu: Interpreting the Program Output 187
7.1.4 imflipG.cu: Performance Impact of Block and
Image Size 188
7.2 KERNEL LAUNCH COMPONENTS 189
7.2.1 Grids 189
7.2.2 Blocks 190
7.2.3 Threads 191
7.2.4 Warps and Lanes 192
7.3 IMFLIPG.CU: UNDERSTANDING THE KERNEL DETAILS 193
7.3.1 Launching Kernels in main() and Passing Arguments
to Them 193
7.3.2 Thread Execution Steps 194
7.3.3 Vflip() Kernel Details 195
7.3.4 Comparing Vflip() and MTFlipV() 196
7.3.5 Hflip() Kernel Details 197
7.3.6 PixCopy() Kernel Details 197
7.4 DEPENDENCE OF PCI EXPRESS SPEED ON THE CPU 199
Contents xv

7.5 PERFORMANCE IMPACT OF PCI EXPRESS BUS 200

7.5.1 Data Transfer Time, Speed, Latency, Throughput, and
Bandwidth 200
7.5.2 PCIe Throughput Achieved with imflipG.cu 201
7.6 PERFORMANCE IMPACT OF GLOBAL MEMORY BUS 204
7.7 PERFORMANCE IMPACT OF COMPUTE CAPABILITY 206
7.7.1 Fermi, Kepler, Maxwell, Pascal, and Volta Families 207
7.7.2 Relative Bandwidth Achieved in Different Families 207
7.7.3 imflipG2.cu: Compute Capability 2.0 Version of imflipG.cu 208
7.7.4 imflipG2.cu: Changes in main() 210
7.7.5 The PxCC20() Kernel 211
7.7.6 The VfCC20() Kernel 212
7.8 PERFORMANCE OF IMFLIPG2.CU 214
7.9 OLD-SCHOOL CUDA DEBUGGING 214
7.9.1 Common CUDA Bugs 216
7.9.2 return Debugging 218
7.9.3 Comment-Based Debugging 220
7.9.4 printf() Debugging 220
7.10 BIOLOGICAL REASONS FOR SOFTWARE BUGS 221
7.10.1 How Is Our Brain Involved in Writing/Debugging Code? 222
7.10.2 Do We Write Buggy Code When We Are Tired? 222
7.10.2.1 Attention 223
7.10.2.2 Physical Tiredness 223
7.10.2.3 Tiredness Due to Heavy Physical Activity 223
7.10.2.4 Tiredness Due to Needing Sleep 223
7.10.2.5 Mental Tiredness 224

Chapter 8 Understanding GPU Hardware Architecture 225

8.1 GPU HARDWARE ARCHITECTURE 226

8.2 GPU HARDWARE COMPONENTS 226
8.2.1 SM: Streaming Multiprocessor 226
8.2.2 GPU Cores 227
8.2.3 Giga-Thread Scheduler 227
8.2.4 Memory Controllers 229
8.2.5 Shared Cache Memory (L2$) 229
8.2.6 Host Interface 229
8.3 NVIDIA GPU ARCHITECTURES 230
8.3.1 Fermi Architecture 231
8.3.2 GT, GTX, and Compute Accelerators 231
8.3.3 Kepler Architecture 232
xvi Contents

8.3.4 Maxwell Architecture 232

8.3.5 Pascal Architecture and NVLink 233
8.4 CUDA EDGE DETECTION: IMEDGEG.CU 233
8.4.1 Variables to Store the Image in CPU, GPU Memory 233
8.4.1.1 TheImage and CopyImage 233
8.4.1.2 GPUImg 234
8.4.1.3 GPUBWImg 234
8.4.1.4 GPUGaussImg 234
8.4.1.5 GPUGradient and GPUTheta 234
8.4.1.6 GPUResultImg 235
8.4.2 Allocating Memory for the GPU Variables 235
8.4.3 Calling the Kernels and Time-Stamping Their Execution 238
8.4.4 Computing the Kernel Performance 239
8.4.5 Computing the Amount of Kernel Data Movement 239
8.4.6 Reporting the Kernel Performance 242
8.5 IMEDGEG: KERNELS 242
8.5.1 BWKernel() 242
8.5.2 GaussKernel() 244
8.5.3 SobelKernel() 246
8.5.4 ThresholdKernel() 249
8.6 PERFORMANCE OF IMEDGEG.CU 249
8.6.1 imedgeG.cu: PCIe Bus Utilization 250
8.6.2 imedgeG.cu: Runtime Results 250
8.6.3 imedgeG.cu: Kernel Performance Comparison 252
8.7 GPU CODE: COMPILE TIME 253
8.7.1 Designing CUDA Code 253
8.7.2 Compiling CUDA Code 255
8.7.3 GPU Assembly: PTX, CUBIN 255
8.8 GPU CODE: LAUNCH 255
8.8.1 OS Involvement and CUDA DLL File 255
8.8.2 GPU Graphics Driver 256
8.8.3 CPU←→GPU Memory Transfers 256
8.9 GPU CODE: EXECUTION (RUN TIME) 257
8.9.1 Getting the Data 257
8.9.2 Getting the Code and Parameters 257
8.9.3 Launching Grids of Blocks 258
8.9.4 Giga Thread Scheduler (GTS) 258
8.9.5 Scheduling Blocks 259
8.9.6 Executing Blocks 260
8.9.7 Transparent Scalability 261
Contents xvii

Chapter 9 Understanding GPU Cores 263

9.1 GPU ARCHITECTURE FAMILIES 263

9.1.1 Fermi Architecture 263
9.1.2 Fermi SM Structure 264
9.1.3 Kepler Architecture 266
9.1.4 Kepler SMX Structure 267
9.1.5 Maxwell Architecture 268
9.1.6 Maxwell SMM Structure 268
9.1.7 Pascal GP100 Architecture 270
9.1.8 Pascal GP100 SM Structure 271
9.1.9 Family Comparison: Peak GFLOPS and Peak DGFLOPS 272
9.1.10 GPU Boost 273
9.1.11 GPU Power Consumption 274
9.1.12 Computer Power Supply 274
9.2 STREAMING MULTIPROCESSOR (SM) BUILDING BLOCKS 275
9.2.1 GPU Cores 275
9.2.2 Double Precision Units (DPU) 276
9.2.3 Special Function Units (SFU) 276
9.2.4 Register File (RF) 276
9.2.5 Load/Store Queues (LDST) 277
9.2.6 L1$ and Texture Cache 277
9.2.7 Shared Memory 278
9.2.8 Constant Cache 278
9.2.9 Instruction Cache 278
9.2.10 Instruction Buffer 278
9.2.11 Warp Schedulers 278
9.2.12 Dispatch Units 279
9.3 PARALLEL THREAD EXECUTION (PTX) DATA TYPES 279
9.3.1 INT8 : 8-bit Integer 280
9.3.2 INT16 : 16-bit Integer 280
9.3.3 24-bit Integer 280
9.3.4 INT32 : 32-bit Integer 281
9.3.5 Predicate Registers (32-bit) 281
9.3.6 INT64 : 64-bit Integer 282
9.3.7 128-bit Integer 282
9.3.8 FP32: Single Precision Floating Point (float) 282
9.3.9 FP64: Double Precision Floating Point (double) 283
9.3.10 FP16: Half Precision Floating Point (half) 284
9.3.11 What is a FLOP? 284
xviii Contents

9.3.12 Fused Multiply-Accumulate (FMA) versus Multiply-Add

(MAD) 285
9.3.13 Quad and Octo Precision Floating Point 285
9.3.14 Pascal GP104 Engine SM Structure 285
9.4 IMFLIPGC.CU: CORE-FRIENDLY IMFLIPG 286
9.4.1 Hflip2(): Precomputing Kernel Parameters 288
9.4.2 Vflip2(): Precomputing Kernel Parameters 290
9.4.3 Computing Image Coordinates by a Thread 290
9.4.4 Block ID versus Image Row Mapping 291
9.4.5 Hflip3(): Using a 2D Launch Grid 292
9.4.6 Vflip3(): Using a 2D Launch Grid 293
9.4.7 Hflip4(): Computing Two Consecutive Pixels 294
9.4.8 Vflip4(): Computing Two Consecutive Pixels 295
9.4.9 Hflip5(): Computing Four Consecutive Pixels 296
9.4.10 Vflip5(): Computing Four Consecutive Pixels 297
9.4.11 PixCopy2(), PixCopy3(): Copying 2,4 Consecutive Pixels at
a Time 298
9.5 IMEDGEGC.CU: CORE-FRIENDLY IMEDGEG 299
9.5.1 BWKernel2(): Using Precomputed Values and 2D Blocks 299
9.5.2 GaussKernel2(): Using Precomputed Values and 2D Blocks 300

Chapter 10 Understanding GPU Memory 303

10.1 GLOBAL MEMORY 303

10.2 L2 CACHE 304
10.3 TEXTURE/L1 CACHE 304
10.4 SHARED MEMORY 305
10.4.1 Split versus Dedicated Shared Memory 305
10.4.2 Memory Resources Available Per Core 306
10.4.3 Using Shared Memory as Software Cache 306
10.4.4 Allocating Shared Memory in an SM 307
10.5 INSTRUCTION CACHE 307
10.6 CONSTANT MEMORY 307
10.7 IMFLIPGCM.CU: CORE AND MEMORY FRIENDLY IMFLIPG 308
10.7.1 Hflip6(),Vflip6(): Using Shared Memory as Buffer 308
10.7.2 Hflip7(): Consecutive Swap Operations in Shared Memory 310
10.7.3 Hflip8(): Using Registers to Swap Four Pixels 312
10.7.4 Vflip7(): Copying 4 Bytes (int) at a Time 314
10.7.5 Aligned versus Unaligned Data Access in Memory 314
10.7.6 Vflip8(): Copying 8 Bytes at a Time 315
10.7.7 Vflip9(): Using Only Global Memory, 8 Bytes at a Time 316
List of Figures

1.1 Harvesting each coconut requires two consecutive 30-second tasks (threads).
Thread 1: get a coconut. Thread 2: crack (process) that coconut using the
hammer. 4
1.2 Simultaneously executing Thread 1 (“1”) and Thread 2 (“2”). Accessing
shared resources will cause a thread to wait (“-”). 6
1.3 Serial (single-threaded) program imflip.c flips a 640×480 dog picture (left)
horizontally (middle) or vertically (right). 8
1.4 Running gdb to catch a segmentation fault. 20
1.5 Running valgrind to catch a memory access error. 23

2.1 Windows Task Manager, showing 1499 threads, however, there is 0% CPU
utilization. 33

3.1 The life cycle of a thread. From the creation to its termination, a thread is
cycled through many different statuses, assigned by the OS. 60
3.2 Memory access patterns of MTFlipH() in Code 2.8. A total of 3200 pixels’
RGB values (9600 Bytes) are flipped for each row. 65
3.3 The memory map of a process when only a single thread is running within
the process (left) or multiple threads are running in it (right). 75

4.1 Inside a computer containing an i7-5930K CPU [10] (CPU5 in Table 3.1),
and 64 GB of DDR4 memory. This PC has a GTX Titan Z GPU that will
be used to test a lot of the programs in Part II. 80
4.2 The imrotate.c program rotates a picture by a specified angle. Original dog
(top left), rotated +10◦ (top right), +45◦ (bottom left), and −75◦ (bottom
right) clockwise. Scaling is done to avoid cropping of the original image area. 84
4.3 The architecture of one core of the i7-5930K CPU (the PC in Figure 4.1).
This core is capable of executing two threads (hyper-threading, as defined
by Intel). These two threads share most of the core resources, but have their
own register files. 92
4.4 Architecture of the i7-5930K CPU (6C/12T). This CPU connects to the
GPUs through an external PCI express bus and memory through the mem-
ory bus. 94

5.1 The imedge.c program is used to detect edges in the original image
astronaut.bmp (top left). Intermediate processing steps are: GaussianFilter()
(top right), Sobel() (bottom left), and finally Threshold() (bottom right). 108

xxiii
Contents xix

10.7.8 PixCopy4(), PixCopy5(): Copying One versus 4 Bytes Using

Shared Memory 317
10.7.9 PixCopy6(), PixCopy7(): Copying One/Two Integers Using
Global Memory 318
10.8 IMEDGEGCM.CU: CORE- & MEMORY-FRIENDLY IMEDGEG 319
10.8.1 BWKernel3(): Using Byte Manipulation to Extract RGB 319
10.8.2 GaussKernel3(): Using Constant Memory 321
10.8.3 Ways to Handle Constant Values 321
10.8.4 GaussKernel4(): Buffering Neighbors of 1 Pixel in Shared
Memory 323
10.8.5 GaussKernel5(): Buffering Neighbors of 4 Pixels in Shared
Memory 325
10.8.6 GaussKernel6(): Reading 5 Vertical Pixels into Shared
Memory 327
10.8.7 GaussKernel7(): Eliminating the Need to Account for Edge
Pixels 329
10.8.8 GaussKernel8(): Computing 8 Vertical Pixels 331
10.9 CUDA OCCUPANCY CALCULATOR 333
10.9.1 Choosing the Optimum Threads/Block 334
10.9.2 SM-Level Resource Limitations 335
10.9.3 What is “Occupancy”? 336
10.9.4 CUDA Occupancy Calculator: Resource
Computation 336
10.9.5 Case Study: GaussKernel7() 340
10.9.6 Case Study: GaussKernel8() 343

Chapter 11 CUDA Streams 345

11.1 WHAT IS PIPELINING? 347

11.1.1 Execution Overlapping 347
11.1.2 Exposed versus Coalesced Runtime 348
11.2 MEMORY ALLOCATION 349
11.2.1 Physical versus Virtual Memory 349
11.2.2 Physical to Virtual Address Translation 350
11.2.3 Pinned Memory 350
11.2.4 Allocating Pinned Memory with cudaMallocHost() 351
11.3 FAST CPU←→GPU DATA TRANSFERS 351
11.3.1 Synchronous Data Transfers 351
11.3.2 Asynchronous Data Transfers 351
11.4 CUDA STREAMS 352
11.4.1 CPU→GPU Transfer, Kernel Exec, GPU→CPUTransfer 352
11.4.2 Implementing Streaming in CUDA 353
xx Contents

11.4.3 Copy Engine 353

11.4.4 Kernel Execution Engine 353
11.4.5 Concurrent Upstream and Downstream PCIe
Transfers 354
11.4.6 Creating CUDA Streams 355
11.4.7 Destroying CUDA Streams 355
11.4.8 Synchronizing CUDA Streams 355
11.5 IMGSTR.CU: STREAMING IMAGE PROCESSING 356
11.5.1 Reading the Image into Pinned Memory 356
11.5.2 Synchronous versus Single Stream 358
11.5.3 Multiple Streams 359
11.5.4 Data Dependence Across Multiple Streams 361
11.5.4.1 Horizontal Flip: No Data Dependence 362
11.5.4.2 Edge Detection: Data Dependence 363
11.5.4.3 Preprocessing Overlapping Rows Synchronously 363
11.5.4.4 Asynchronous Processing the Non-Overlapping
Rows 364
11.6 STREAMING HORIZONTAL FLIP KERNEL 366
11.7 IMGSTR.CU: STREAMING EDGE DETECTION 367
11.8 PERFORMANCE COMPARISON: IMGSTR.CU 371
11.8.1 Synchronous versus Asynchronous Results 371
11.8.2 Randomness in the Results 372
11.8.3 Optimum Queuing 372
11.8.4 Best Case Streaming Results 373
11.8.5 Worst Case Streaming Results 374
11.9 NVIDIA VISUAL PROFILER: NVVP 375
11.9.1 Installing nvvp and nvprof 375
11.9.2 Using nvvp 376
11.9.3 Using nvprof 377
11.9.4 imGStr Synchronous and Single-Stream Results 377
11.9.5 imGStr 2- and 4-Stream Results 378

Part III More To Know

Chapter 12 CUDA Libraries 383
Mohamadhadi Habibzadeh, Omid Rajabi Shishvan, and Tolga Soyata
12.1 cuBLAS 383
12.1.1 BLAS Levels 383
12.1.2 cuBLAS Datatypes 384
12.1.3 Installing cuBLAS 385
12.1.4 Variable Declaration and Initialization 385
Contents xxi

12.1.5 Device Memory Allocation 386

12.1.6 Creating Context 386
12.1.7 Transferring Data to the Device 386
12.1.8 Calling cuBLAS Functions 387
12.1.9 Transfer Data Back to the Host 388
12.1.10 Deallocating Memory 388
12.1.11 Example cuBLAS Program: Matrix Scalar 388
12.2 CUFFT 390
12.2.1 cuFFT Library Characteristics 390
12.2.2 A Sample Complex-to-Complex Transform 390
12.2.3 A Sample Real-to-Complex Transform 391
12.3 NVIDIA PERFORMANCE PRIMITIVES (NPP) 392
12.4 THRUST LIBRARY 393

Chapter 13 Introduction to OpenCL 397

Chase Conklin and Tolga Soyata
13.1 WHAT IS OpenCL? 397
13.1.1 Multiplatform 397
13.1.2 Queue-Based 397
13.2 IMAGE FLIP KERNEL IN OPENCL 398
13.3 RUNNING OUR KERNEL 399
13.3.1 Selecting a Device 400
13.3.2 Running the Kernel 401
13.3.2.1 Creating a Compute Context 401
13.3.2.2 Creating a Command Queue 401
13.3.2.3 Loading Kernel File 402
13.3.2.4 Setting Up Kernel Invocation 403
13.3.3 Runtimes of Our OpenCL Program 405
13.4 EDGE DETECTION IN OpenCL 406

Chapter 14 Other GPU Programming Languages 413

Sam Miller, Andrew Boggio-Dandry, and Tolga Soyata
14.1 GPU PROGRAMMING WITH PYTHON 413
14.1.1 PyOpenCL Version of imflip 414
14.1.2 PyOpenCL Element-Wise Kernel 418
14.2 OPENGL 420
14.3 OPENGL ES: OPENGL FOR EMBEDDED SYSTEMS 420
14.4 VULKAN 421
14.5 MICROSOFT’S HIGH-LEVEL SHADING LANGUAGE (HLSL) 421
14.5.1 Shading 421
14.5.2 Microsoft HLSL 422
xxii Contents

14.6 APPLE’S METAL API 422

14.7 APPLE’S SWIFT PROGRAMMING LANGUAGE 423
14.8 OPENCV 423
14.8.1 Installing OpenCV and Face Recognition 423
14.8.2 Mobile-Cloudlet-Cloud Real-Time Face Recognition 423
14.8.3 Acceleration as a Service (AXaas) 423

Chapter 15 Deep Learning Using CUDA 425

Omid Rajabi Shishvan and Tolga Soyata
15.1 ARTIFICIAL NEURAL NETWORKS (ANNS) 425
15.1.1 Neurons 425
15.1.2 Activation Functions 425
15.2 FULLY CONNECTED NEURAL NETWORKS 425
15.3 DEEP NETWORKS/CONVOLUTIONAL NEURAL NETWORKS 427
15.4 TRAINING A NETWORK 428
15.5 CUDNN LIBRARY FOR DEEP LEARNING 428
15.5.1 Creating a Layer 429
15.5.2 Creating a Network 430
15.5.3 Forward Propagation 431
15.5.4 Backpropagation 431
15.5.5 Using cuBLAS in the Network 431
15.6 KERAS 432

Bibliography 435

Index 439
xxiv List of Figures

5.2 Example barrier synchronization for 4 threads. Serial runtime is 7281 ms

and the 4-threaded runtime is 2246 ms. The speedup of 3.24× is close to
the best-expected 4×, but not equal due to the imbalance of each thread’s
runtime. 128
5.3 Using a MUTEXdata structure to access shared variables. 129

6.1 Turning the dog picture into a 3D wire frame. Triangles are used to represent
the object, rather than pixels. This representation allows us to map a texture
to each triangle. When the object moves, so does each triangle, along with
their associated textures. To increase the resolution of this kind of an object
representation, we can divide triangles into smaller triangles in a process
called tesselation. 139
6.2 Steps to move triangulated 3D objects. Triangles contain two attributes:
their location and their texture. Objects are moved by performing mathe-
matical operations only on their coordinates. A final texture mapping places
the texture back on the moved object coordinates, while a 3D-to-2D transfor-
mation allows the resulting image to be displayed on a regular 2D computer
monitor. 140
6.3 Three farmer teams compete in Analogy 6.1: (1) Arnold competes alone
with his 2× bigger tractor and “the strongest farmer” reputation, (2) Fred
and Jim compete together in a much smaller tractor than Arnold. (3) Tolga,
along with 32 boy and girl scouts, compete together using a bus. Who wins? 145
6.4 Nvidia Runtime Engine is built into your GPU drivers, shown in your Win-
dows 10 Pro SysTray. When you click the Nvidia symbol, you can open the
Nvidia control panel to see the driver version as well as the parameters of
your GPU(s). 156
6.5 Creating a Visual Studio 2015 CUDA project named imflipG.cu. Assume
that the code will be in a directory named Z:\code\imflipG in this example. 172
6.6 Visual Studio 2015 source files are in the Z:\code\imflipG\imflipG direc-
tory. In this specific example, we will remove the default file, kernel.cu, that
VS 2015 creates. After this, we will add an existing file, imflipG.cu, to the
project. 173
6.7 The default CPU platform is x86. We will change it to x64. We will also
remove the GPU debugging option. 174
6.8 The default Compute Capability is 2.0. This is too old. We will change it to
Compute Capability 3.0, which is done by editing Code Generation under
Device and changing it to compute 30, sm 30. 175
6.9 Compiling imflipG.cu to get the executable file imflipG.exe in the
Z:\code\imflipG\x64\Debug directory. 176
6.10 Running imflipG.exe from a CMD command line window. 177
6.11 The /usr/local directory in Unix contains your CUDA directories. 181
6.12 Creating a new CUDA project using the Eclipse IDE in Unix. 183

7.1 The PCIe bus connects for the host (CPU) and the device(s) (GPUs).
The host and each device have their own I/O controllers to allow transfers
through the PCIe bus, while both the host and the device have their own
memory, with a dedicated bus to it; in the GPU this memory is called global
memory. 205
List of Figures xxv

8.1 Analogy 8.1 for executing a massively parallel program using a significant
number of GPU cores, which receive their instructions and data from differ-
ent sources. Melissa (Memory controller ) is solely responsible for bringing
the coconuts from the jungle and dumping them into the big barrel (L2$).
Larry (L2$ controller ) is responsible for distributing these coconuts into the
smaller barrels (L1$) of Laura, Linda, Lilly, and Libby; eventually, these four
folks distribute the coconuts (data) to the scouts (GPU cores). On the right
side, Gina (Giga-Thread Scheduler ) has the big list of tasks (list of blocks to
be executed ); she assigns each block to a school bus (SM or streaming mul-
tiprocessor ). Inside the bus, one person — Tolga, Tony, Tom, and Tim —
is responsible to assign them to the scouts (instruction schedulers). 228
8.2 The internal architecture of the GTX550Ti GPU. A total of 192 GPU cores
are organized into six streaming multiprocessor (SM) groups of 32 GPU
cores. A single L2$ is shared among all 192 cores, while each SM has its
own L1$. A dedicated memory controller is responsible for bringing data in
and out of the GDDR5 global memory and dumping it into the shared L2$,
while a dedicated host interface is responsible for shuttling data (and code)
between the CPU and GPU over the PCIe bus. 230
8.3 A sample output of the imedgeG.cu program executed on the astronaut.bmp
image using a GTX Titan Z GPU. Kernel execution times and the amount
of data movement for each kernel is clearly shown. 242

9.1 GF110 Fermi architecture with 16 SMs, where each SM houses 32 cores, 16
LD/ST units, and 4 Special Function Units (SFUs). The highest end Fermi
GPU contains 512 cores (e.g., GTX 580). 264
9.2 GF110 Fermi SM structure. Each SM has a 128 KB register file that contains
32,768 (32 K) registers, where each register is 32-bits. This register file feeds
operands to the 32 cores and 4 Special Function Units (SFU). 16 Load/Store
(LD/ST) units are used to queue memory load/store requests. A 64 KB total
cache memory is used for L1$ and shared memory. 265
9.3 GK110 Kepler architecture with 15 SMXs, where each SMX houses 192
cores, 48 double precision units (DPU), 32 LD/ST units, and 32 Special
Function Units (SFU). The highest end Kepler GPU contains 2880 cores
(e.g., GTX Titan Black); its “double” version GTX Titan Z contains 5760
cores. 266
9.4 GK110 Kepler SMX structure. A 256 KB (64 K-register) register file feeds
192 cores, 64 Double-Precision Units (DPU), 32 Load/Store units, and 32
SFUs. Four warp schedulers can schedule four warps, which are dispatched
as 8 half-warps. Read-only cache is used to hold constants. 267
9.5 GM200 Maxwell architecture with 24 SMMs, housed inside 6 larger GPC
units; each SMM houses 128 cores, 32 LD/ST units, and 32 Special Function
Units (SFU), does not contain double-precision units (DPUs). The highest
end Maxwell GPU contains 3072 cores (e.g., GTX Titan X). 268
9.6 GM200 Maxwell SMM structure consists of 4 identical sub-structures with
32 cores, 8 LD/ST units, 8 SFUs, and 16 K registers. Two of these sub-
structures share an L1$, while four of them share a 96 KB shared memory. 269
9.7 GP100 Pascal architecture with 60 SMs, housed inside 6 larger GPC units,
each containing 10 SMs. The highest end Pascal GPU contains 3840 cores
(e.g., P100 compute accelerator). NVLink and High Bandwidth Memory
xxvi List of Figures

(HBM2) allow significantly faster memory bandwidths as compared to pre-

vious generations. 270
9.8 GP100 Pascal SM structure consists of two identical sub-structures that
contain 32 cores, 16 DPUs, 8 LD/ST units, 8 SFUs, and 32 K registers.
They share an instruction cache, however, they have their own instruction
buffer. 271
9.9 IEEE 754-2008 floating point standard and the supported floating point data
types by CUDA. half data type is supported in Compute Capability 5.3 and
above, while float has seen support from the first day of the introduction
of CUDA. Support for double types started in Compute Capability 1.3. 284

10.1 CUDA Occupancy Calculator: Choosing the Compute Capability, max.

shared memory size, registers/kernel, and kernel shared memory usage. In
this specific case, the occupancy is 24 warps per SM (out of a total of 64),
translating to an occupancy of 24 ÷ 64 = 38 %. 337
10.2 Analyzing the occupancy of a case with (1) registers/thread=16, (2) shared
memory/kernel=8192 (8 KB), and (3) threads/block=128 (4 warps). CUDA
Occupancy Calculator plots the occupancy when each kernel contains more
registers (top) and as we launch more blocks (bottom), each requiring an
additional 8 KB. With 8 KB/block, the limitation is 24 warps/SM; however,
it would go up to 32 warps/block, if each block only required 6 KB of shared
memory (6144 Bytes), as shown in the shared memory plot (below). 338
10.3 Analyzing the occupancy of a case with (1) registers/thread=16, (2) shared
memory/kernel=8192 (8 KB), and (3) threads/block=128 (4 warps). CUDA
Occupancy Calculator plots the occupancy when we launch our blocks with
more threads/block (top) and provides a summary of which one of the three
resources will be exposed to the limitation before the others (bottom). In this
specific case, the limited amount of shared memory (48 KB) limits the total
number of blocks we can launch to 6. Alternatively, the number of registers
or the maximum number of blocks per SM does not become a limitation. 339
10.4 Analyzing the GaussKernel7(), which uses (1) registers/thread ≈ 16, (2)
shared memory/kernel=40,960 (40 KB), and (3) threads/block=256. It is
clear that the shared memory limitation does not allow us to launch more
than a single block with 256 threads (8 warps). If you could reduce the
shared memory down to 24 KB by redesigning your kernel, you could launch
at least 2 blocks (16 warps, as shown in the plot) and double the occupancy. 341
10.5 Analyzing the GaussKernel7() with (1) registers/thread=16, (2) shared
memory/kernel=40,960, and (3) threads/block=256. 342
10.6 Analyzing the GaussKernel8() with (1) registers/thread=16, (2) shared
memory/kernel=24,576, and (3) threads/block=256. 343
10.7 Analyzing the GaussKernel8() with (1) registers/thread=16, (2) shared
memory/kernel=24,576, and (3) threads/block=256. 344

11.1 Nvidia visual profiler. 376

11.2 Nvidia profiler, command line version. 377
11.3 Nvidia NVVP results with no streaming and using a single stream, on the
K80 GPU. 378
11.4 Nvidia NVVP results with 2 and 4 streams, on the K80 GPU. 379
List of Figures xxvii

14.1 imflip.py kernel runtimes on different devices. 417

15.1 Generalized architecture of a fully connected artificial neural network with

n inputs, k hidden layers, and m outputs. 426
15.2 Inner structure of a neuron used in ANNs. ωij are the weights by which
inputs to the neuron (x1 , x2 , ..., xn ) are multiplied before they are summed.
“Bias” is a value by which this sum is augmented, and f () is the activation
function, which is used to introduce a non-linear component to the output. 426
Exploring the Variety of Random
Documents with Different Content
Majd visszafelé mentem, keresztül a fenyőerdőn, itt-ott nyakig a
vörösfűben, és megtudtam, hogy a «Tarka kutya» kocsmárosa már
el volt hantolva. Végre elhaladva a kollégium mellett, hazaértem.
Egy férfi, aki az egyik kunyhó nyitott ajtajában állt, nevemen szólítva,
üdvözölt, mikor elhaladtam mellette.
Házamra pillantottam s a remény egy sugara villant fel bennem;
de csakhamar elhalványodott. A kaput feltörték; nem volt
elreteszelve s mikor oda értem, lassan kinyílt.
Újra becsukódott. Dolgozó-szobám függönyei félrelebbentek a
nyitott ablakról, ahol a tüzér meg én együtt virrasztottunk azon a
reggelen. Azóta senki sem csukta be az ablakot. A letaposott bokrok
éppen úgy voltak, mint négy hét előtt, mikor utoljára jártam itt.
Betámolyogtam az előcsarnokba. A házban nem volt nyoma
senkinek. A lépcső-szőnyeg gyűrött volt és megfakult azon a helyen,
ahol a katasztrófa éjjelén a viharban bőrig ázva kuporogtam. Sáros
lábunk nyomait még mindig ott láttam a lépcsőkön.
Követtem a nyomokat dolgozó-szobámba s íróasztalomon a
papírnyomó alatt rátaláltam kéziratomra, úgy ahogy a henger
megnyílásának délutánján ott hagytam. Egy darabig ott álltam,
olvasgatva abbahagyott munkámat, mely az erkölcsi felfogás
valószínű fejlődésével foglalkozott, a civilizáció haladásával
kapcsolatban.
Az utolsó mondat egy jövendölés kezdete volt: «Körülbelül
kétszáz év mulva várhatjuk…» itt a mondat félbemaradt.
Emlékszem, hogy azon a reggelen, alig egy hónapja, nem voltam
képes gondolataimat értekezésem tárgyára irányítani. Emlékszem,
hogy abbahagytam az írást és lementem, hogy átvegyem az
újságkihordótól a Daily Chronicle-met; elébe szaladtam a kertajtóig
és izgatottan hallgattam különös meséjét a «Marsbeli emberek»-ről.
Lementem az ebédlőbe. Ott volt még, erősen feloszlásnak
indulva, az ürühús és a kenyér s egy felfordult söröspalack, úgy
ahogy a tüzérrel otthagytuk. Otthonom teljesen vigasztalan volt.
Beláttam, mily esztelen reményt kergettem oly hosszú ideig. És
ekkor különös dolog történt.
– Nincs itt senki, – szólt egy hang. – A ház üres. Nem járt itt egy
lélek sem tíz nap óta. Ne időzzön itt tovább s ne gyötörje magát
hasztalanul. Nem menekült meg senki magán kívül.
Megdöbbentem. Fenhangon gondolkoztam talán? Megfordultam
s láttam, hogy mögöttem nyitva van a földig érő szárnyas ablak.
Odaléptem s kitekintettem.
És íme, a bámulattól szinte megdermedve, rémülten, éppen úgy,
mint én magam, unokatestvérem és feleségem állt előttem…
feleségem sápadtan és könnytelenül. Halkan felsikoltott.
– Eljöttem, – rebegte. – Tudtam… tudtam…
Kezével fejéhez kapott és megtántorodott. Hozzá rohantam s
felfogtam karjaimba.

X.
Befejezés.

Sajnálom most, mikor elbeszélésem vége felé közeledem, hogy

nem sokkal járulhatok hozzá annyi vitás és még mindig eldöntetlen
kérdés tisztázásához. Bizonyos, hogy egy dologban ki fogom hívni a
bírálatot. Tulajdonképpeni szaktudományom a spekulativ filozófia.
Összehasonlító fiziológiai ismereteimet összesen egy vagy két
könyvből merítettem ugyan, de azért azt hiszem, hogy Carver
magyarázata a Marsbeliek gyors halálának okáról csaknem
bebizonyított igazságnak tekinthető. Elbeszélésem folyamán ezt a
magyarázatot fogadtam el.
A Marsbeliek holttesteiben, amelyeket a háború után
megvizsgáltak, egy esetben sem találtak a már rég ismert földi
fajokon kívül más baktériumokat. Az a körülmény, hogy egyetlenegy
halottukat sem ásták el s hogy a legnagyobb gondatlansággal
öldökölték az embereket, szintén azt bizonyítja, hogy fogalmuk sem
volt a rothadás folyamatáról. De bármennyire valószínűnek látszik is
ez a magyarázat, nincs teljesen bebizonyítva.
Nem ismerjük a fekete füst összetételét sem, amelyet a
Marsbeliek halált okozó hatással használtak, s a hősugár eredete is
talány marad. Az a rettenetes szerencsétlenség, amely az ealingi és
a dél-kensingtoni laboratóriumokban történt, elvette a tudósoknak
minden kedvét attól, hogy a hősugár eredetét tovább elemezzék. A
színképbontás félreérthetetlenül kimutatta, hogy a fekete porban
valami ismeretlen elem van jelen, amely három vonalból álló,
ragyogó csoportra bomlik a zöld színben. Lehetséges, hogy ez az
elem, az argonnal vegyülve, egy szempillantás alatt halálos
eredménnyel hat a vér valamely alkatrészére. De efféle be nem
bizonyított találgatások aligha fogják érdekelni az olvasók nagy
tömegét, akiknek számára könyvemet írtam.
A barna tajtékcsomók közül, amelyek Shepperton pusztulása
után lefelé sodródtak a Themzén, egyet sem vizsgáltak meg
akkoriban s azóta újabbat nem találtak.
A Marsbeliek anatómiai tanulmányozásának eredményeit,
amennyiben a kóbor kutyák ily tanulmányozást teljesen lehetetlenné
nem tettek, már előbb összefoglaltam, de úgyis ismeri mindenki azt
a nagyszerű és csaknem teljes példányt, amelyet a természetrajzi
múzeumban őriznek spirituszban, vagy ismeri legalább is az erről
készített számtalan rajzot. Minden egyéb fiziológiai és morfológiai
részlet különben is tisztán tudományos érdekű.
Sokkal fontosabb az a kérdés és bizonyára általános
érdeklődésre is tarthat számot, lehetséges-e, hogy a Marsbeliek
újabb támadást intézzenek a Föld ellen? Azt hiszem, hogy eddig
nem méltatták távolról sem kellő figyelemre ezt a kérdést. Jelenleg a
Mars-bolygó konjunkcióban van: de valahányszor visszatér az
oppozicióba, ami engem illet, én mindig készen állok egy újabb
kirándulásukra. Annyi bizonyos, hogy nem szabad elmulasztanunk a
kellő előkészületeket. Azt hiszem, meg lehet majd állapítani annak
az ágyúnak a helyzetét, amelyből felénk tüzelnek. A bolygónak azt a
részét szüntelenül figyelemmel kell kísérnünk, hogy idejekorán
értesüljünk a következő támadás kezdetéről.
Újabb támadás esetén a hengert könnyen elpusztíthatjuk
dinamittal vagy tüzérséggel, mielőtt annyira lehűlne, hogy a
Marsbeliek előbújhatnának; vagy amint a henger teteje lecsavarodik,
őket is agyonágyúzhatjuk. Azt hiszem, nagy kárukra vált első
vállalkozásuk sikertelensége. Lehet, hogy ők maguk is így fogják fel.
Lessing kitünő érvekkel támogatta azt a föltevését, hogy a
Marsbeliek betörtek a Venus bolygóra is, még pedig sikeresen. Most
hét hónapja a Venus és a Mars egy vonalba esett a Nappal; azaz a
Venus bolygón figyelőre nézve a Mars oppozicióban volt. Később
különös ragyogó, hullámszerű jel látszott a belső bolygó meg nem
világított felén és ezzel csaknem egyidejűleg a Mars korongjának
egyik fényképén hasonló hullámszerű halványfekete jelet fedeztek
föl. Látni kell mind a két tünemény ábrázolását, hogy teljes
jelentőségében felfoghassuk jellegük rendkívüli hasonlatosságát.
De akár várunk újabb támadást, akár nem, az, ami történt,
mindenesetre nagyban módosította az emberiség jövőjéről alkotott
képzeteinket. Megtudtuk, hogy a Föld nem zárt terület s éppen nem
biztos tartózkodási helye az embernek. Sejtelmünk sem lehet róla
soha, micsoda láthatatlan jó vagy rossz ront reánk a térből
váratlanul. Lehet, hogy a mindenség szempontjából a Marsbelieknek
ez a támadása végeredményében javára válik az embereknek;
megfosztotta az embert attól a jövőbe vetett derült bizalomtól, amely
legtermékenyebb forrása a hanyatlásnak. Az emberi tudást rengeteg
új ismerettel gyarapította; s nagyban hozzájárult ahhoz, hogy az
emberiség közös érdekeinek felfogása minél hamarább tisztázódjék.
Lehet, hogy a Marsbeliek a tér végtelenségén keresztül figyelemmel
kísérték előfutáraiknak sorsát s tanultak a saját kárukon; az is lehet,
hogy a Venus bolygón biztosabb letelepedési helyre akadtak.
Bármint legyen is, az az egy bizonyos, hogy a Mars korongját
számos esztendeig a legnagyobb gonddal fogják megfigyelni s hogy
azok a tüzes égi nyilak, a hulló csillagok, valahányszor lezuhannak,
elkerülhetetlenül meg fogják remegtetni az emberek minden fiát.
Az emberi tudás mélyülése, amelyet a Marsbeliek támadásának
köszönhetünk, alig túlozható. Mielőtt a henger lezuhant, az volt az
általános meggyőződés, hogy a tér végtelenségében parányi
földgömbünk vékony kérgén kívül sehol sincs élet. Most tovább
látunk. Ha a Marsbeliek elérhetik a Venust, nincs okunk föltenni,
hogy az emberek ugyanazt meg nem tehetik. És ha a nap lassú
lehűlése lakhatatlanná teszi ezt a földet, ami végül meg is fog
történni, lehet, hogy az élet fonala, amely itt megkezdődött, tovább
terjed a Földről s bevonja hálójába testvérbolygónkat is. Vajjon
győzni fogunk-e?
Homályos, csodálatos az a kép, mely lelkemben arról támadt,
hogy fog terjedni az élet a Naprendszer kis melegágyából
lassacskán keresztül a csillagos tér lelketlen végtelenségén. De ez
csak távoli álom. Viszont az is lehet, hogy a Marsbeliek elpusztulása
csak szünetet jelent. Talán az övék lesz a jövő s nem a mienk.
Nem tagadhatom, hogy azóta a szorongattatással és veszéllyel
telt idő óta állandóan van lelkemben valami kétely és
bizonytalanság. Dolgozószobámban ülök, lámpafénynél írva és
egyszerre csak magam előtt látom újra a szép völgyet alul, amint
kígyózó lángok lepik el és úgy érzem, hogy a ház mögöttem és
körülöttem üres és elhagyott. Kimegyek a byfleeti országútra, ahol
járművek haladnak el mellettem, egyik kocsin egy mészároslegény,
egy kocsi telve vendégekkel, egy kerékpáros munkás, iskolába siető
gyerekek. És egyszerre csak mindnyájan elmosódnak előttem, s újra
ott sietek a tüzérrel együtt a néma forróságban. Egy éjjel láttam,
hogy a fekete por festi sötétre a hallgatag uccákat s borítja el az
eltorzított holttesteket, amelyek reám támadnak rongyosan,
szétmarcangoltan. Értelmetlen hangokat hadarva, egyre
dühösebbek, sápadtabbak, undokabbak lettek, végre teljesen
kivetkőztek minden emberi formájukból, mire fázva és nyomorultan
felébredtem az éjszaka sötétségében.
Londonba megyek s nézem a Fleet Streeten és a Strandon a
nyüzsgő sokaságot s egyszerre az jut eszembe, hogy ezek az
emberek csak a mult szellemei, akik kísértve járnak fel-alá a néma
és romokba dőlt uccákon, álomképek egy holt városban, az élet
gúnyképei galvanizált testben. S oly különös az is, ha a Primrose-
dombon állok, mint álltam egy nappal ez utolsó fejezet megírása
előtt. Oly különös elnézni a füstös ködön homályosan keresztül kéklő
házak rengetegét, mely végül beleolvad a ráhajló égboltozatba;
elnézni a domb virágágyai között ide-oda járkáló embereket; elnézni
a bámészkodókat, akik mindig seregestül állnak a még mindig ott
álló Marsbeli gépezet körül; elhallgatni a játszó gyermekek zajgását
s azután visszagondolni arra az időre, mikor az utolsó nagy nap
hajnalán ugyanitt állva, mindent komornak, hallgatagnak láttam a
kelő nap ragyogásában…
De a legkülönösebb mégis csak az, hogy újra kezemben tartom
feleségem kis kezét s arra gondolok, hogy én őt, ő pedig engem
egyszer már a halottak közé számított…
Lábjegyzetek.
1) Mérföld az egész könyvben angol mérföldet jelent: 1609·33
métert.
TARTALOM.

Első rész. A Mars-lakók megérkezése.

I. A harc küszöbén 5
II. A hulló csillag 12
III. A horselli réten 17
IV. A henger megnyílik 20
V. A hősugár 24
VI. A hősugár a chobhami országúton 28
VII. Hogyan értem haza? 31
VIII. Péntek éjjel 36
IX. Kezdődik a harc 39
X. A viharban 45
XI. Az ablakban 52
XII. Mit láttam Weybridge és Shepperton
pusztulásából? 59
XIII. Hogy találkoztam a tiszteletessel? 71
XIV. Londonban 77
XV. Mi történt Surreyben? 89
XVI. A londoni kivándorlás 98
XVII. A «Villám» 112
Második rész. A Föld a Mars-lakók uralma alatt.

I. A romok alatt 123

II. Mit láttunk a rombadőlt házból? 131
III. A fogság napjai 142
IV. A tiszteletes halála 148
V. A csend 154
VI. Két hét műve 157
VII. A Putney-dombi férfi 161
VIII. A halott London 180
IX. Romok 189
X. Befejezés 195
*** END OF THE PROJECT GUTENBERG EBOOK VILÁGOK
HARCA ***

Updated editions will replace the previous one—the old editions will
be renamed.

Creating the works from print editions not protected by U.S.

copyright law means that no one owns a United States copyright in
these works, so the Foundation (and you!) can copy and distribute it
in the United States without permission and without paying copyright
royalties. Special rules, set forth in the General Terms of Use part of
this license, apply to copying and distributing Project Gutenberg™
electronic works to protect the PROJECT GUTENBERG™ concept
and trademark. Project Gutenberg is a registered trademark, and
may not be used if you charge for an eBook, except by following the
terms of the trademark license, including paying royalties for use of
the Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is very
easy. You may use this eBook for nearly any purpose such as
creation of derivative works, reports, performances and research.
Project Gutenberg eBooks may be modified and printed and given
away—you may do practically ANYTHING in the United States with
eBooks not protected by U.S. copyright law. Redistribution is subject
to the trademark license, especially commercial redistribution.

START: FULL LICENSE

THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the free

distribution of electronic works, by using or distributing this work (or
any other work associated in any way with the phrase “Project
Gutenberg”), you agree to comply with all the terms of the Full
Project Gutenberg™ License available with this file or online at
www.gutenberg.org/license.

Section 1. General Terms of Use and

Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand, agree
to and accept all the terms of this license and intellectual property
(trademark/copyright) agreement. If you do not agree to abide by all
the terms of this agreement, you must cease using and return or
destroy all copies of Project Gutenberg™ electronic works in your
possession. If you paid a fee for obtaining a copy of or access to a
Project Gutenberg™ electronic work and you do not agree to be
bound by the terms of this agreement, you may obtain a refund from
the person or entity to whom you paid the fee as set forth in
paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only be

used on or associated in any way with an electronic work by people
who agree to be bound by the terms of this agreement. There are a
few things that you can do with most Project Gutenberg™ electronic
works even without complying with the full terms of this agreement.
See paragraph 1.C below. There are a lot of things you can do with
Project Gutenberg™ electronic works if you follow the terms of this
agreement and help preserve free future access to Project
Gutenberg™ electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law in
the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name
associated with the work. You can easily comply with the terms of
this agreement by keeping this work in the same format with its
attached full Project Gutenberg™ License when you share it without
charge with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the terms
of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.

1.E. Unless you have removed all references to Project Gutenberg:

1.E.1. The following sentence, with active links to, or other

immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears, or
with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is derived

from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is posted

with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning of
this work.

1.E.4. Do not unlink or detach or remove the full Project

Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute this

electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1 with
active links or immediate access to the full terms of the Project
Gutenberg™ License.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or expense
to the user, provide a copy, a means of exporting a copy, or a means
of obtaining a copy upon request, of the work in its original “Plain
Vanilla ASCII” or other form. Any alternate format must include the
full Project Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,

performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or providing

access to or distributing Project Gutenberg™ electronic works
provided that:

• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who

notifies you in writing (or by e-mail) within 30 days of receipt that
s/he does not agree to the terms of the full Project Gutenberg™
License. You must require such a user to return or destroy all
copies of the works possessed in a physical medium and
discontinue all use of and all access to other copies of Project
Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of

any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™

electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend

considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except

for the “Right of Replacement or Refund” described in paragraph
1.F.3, the Project Gutenberg Literary Archive Foundation, the owner
of the Project Gutenberg™ trademark, and any other party
distributing a Project Gutenberg™ electronic work under this
agreement, disclaim all liability to you for damages, costs and
expenses, including legal fees. YOU AGREE THAT YOU HAVE NO
REMEDIES FOR NEGLIGENCE, STRICT LIABILITY, BREACH OF
WARRANTY OR BREACH OF CONTRACT EXCEPT THOSE
PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE THAT THE
FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you

discover a defect in this electronic work within 90 days of receiving it,
you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or entity
that provided you with the defective work may elect to provide a
replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth in
paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied

warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.
1.F.6. INDEMNITY - You agree to indemnify and hold the
Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and distribution
of Project Gutenberg™ electronic works, harmless from all liability,
costs and expenses, including legal fees, that arise directly or
indirectly from any of the following which you do or cause to occur:
(a) distribution of this or any Project Gutenberg™ work, (b)
alteration, modification, or additions or deletions to any Project
Gutenberg™ work, and (c) any Defect you cause.

Section 2. Information about the Mission of

Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the

assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project

Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,

Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to

the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many small
donations ($1 to $5,000) are particularly important to maintaining tax
exempt status with the IRS.

The Foundation is committed to complying with the laws regulating

charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where

we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make

any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About Project

Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.

Project Gutenberg™ eBooks are often created from several printed

editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,

including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and

personal growth!

textbookfull.com

Parallel Programming for Modern High Performance Computing Systems (Czarnul, Pawel)
No ratings yet
Parallel Programming for Modern High Performance Computing Systems (Czarnul, Pawel)
330 pages
HCIA-Cloud Computing V5.0 Learning Guide
No ratings yet
HCIA-Cloud Computing V5.0 Learning Guide
274 pages
Fault Tolerant & Fault Testable Hardware Design
From Everand
Fault Tolerant & Fault Testable Hardware Design
Parag K. Lala
5/5 (2)
Britannia Non Functional Test Plan Ver 1.2
No ratings yet
Britannia Non Functional Test Plan Ver 1.2
26 pages
How To Setup A Simple Scenario With SAP Records Management 2323
No ratings yet
How To Setup A Simple Scenario With SAP Records Management 2323
22 pages
Get GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata free all chapters
100% (2)
Get GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata free all chapters
65 pages
GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata - The full ebook with complete content is ready for download
No ratings yet
GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata - The full ebook with complete content is ready for download
62 pages
Gpu Parallel Program Development Cuda
100% (2)
Gpu Parallel Program Development Cuda
477 pages
Full download (Ebook) GPU Parallel Program Development Using CUDA by Tolga Soyata ISBN 9781498750752, 1498750753 pdf docx
100% (9)
Full download (Ebook) GPU Parallel Program Development Using CUDA by Tolga Soyata ISBN 9781498750752, 1498750753 pdf docx
55 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
Immediate download (Ebook) Elements of Parallel Computing by Eric Aubanel ISBN 9781498727891, 1498727891 ebooks 2024
100% (3)
Immediate download (Ebook) Elements of Parallel Computing by Eric Aubanel ISBN 9781498727891, 1498727891 ebooks 2024
73 pages
Full Download Elements of Parallel Computing 1st Edition Eric Aubanel PDF DOCX
100% (4)
Full Download Elements of Parallel Computing 1st Edition Eric Aubanel PDF DOCX
84 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
34 pages
Fundamentals of Multicore Software Development PDF
No ratings yet
Fundamentals of Multicore Software Development PDF
322 pages
Preview-9781482211191 A37870511
No ratings yet
Preview-9781482211191 A37870511
50 pages
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
No ratings yet
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
31 pages
Download Complete Fundamentals of Multicore Software Development 1st Edition Victor Pankratius PDF for All Chapters
No ratings yet
Download Complete Fundamentals of Multicore Software Development 1st Edition Victor Pankratius PDF for All Chapters
59 pages
CUDA Zone - Library of Resources - NVIDIA Developer
No ratings yet
CUDA Zone - Library of Resources - NVIDIA Developer
7 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Parallel Comp Point Main
No ratings yet
Parallel Comp Point Main
18 pages
Topic 1 2024
No ratings yet
Topic 1 2024
41 pages
Complete Download Programming For Hybrid Multi/Manycore MPP Systems 1st Edition John Levesque PDF All Chapters
100% (3)
Complete Download Programming For Hybrid Multi/Manycore MPP Systems 1st Edition John Levesque PDF All Chapters
43 pages
Gpu Computing Gems Jade PDF
No ratings yet
Gpu Computing Gems Jade PDF
3 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
Parallel Computing Terminology
No ratings yet
Parallel Computing Terminology
11 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
Achieving High Performance Computing
No ratings yet
Achieving High Performance Computing
58 pages
Introduction To Computing
No ratings yet
Introduction To Computing
6 pages
Lecture Parallel Computing
No ratings yet
Lecture Parallel Computing
6 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
Bader Large Scale Earthquake Simulation On Supercomputing Platforms
No ratings yet
Bader Large Scale Earthquake Simulation On Supercomputing Platforms
60 pages
High Performance Computing Programming and Applications Chapman Hall CRC Computational Science 1st Edition John Levesque download
100% (1)
High Performance Computing Programming and Applications Chapman Hall CRC Computational Science 1st Edition John Levesque download
77 pages
An Introduction to Parallel Programming Pacheco Peter S Malensek Matthew pdf download
No ratings yet
An Introduction to Parallel Programming Pacheco Peter S Malensek Matthew pdf download
49 pages
High Performance Parallel I O 1st Edition I Foster All Chapters Instant Download
No ratings yet
High Performance Parallel I O 1st Edition I Foster All Chapters Instant Download
81 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 05-Aug-2021 Module1 (Part 1)
30 pages
Chapter # 1
No ratings yet
Chapter # 1
117 pages
UNIT 3
No ratings yet
UNIT 3
46 pages
Puting Experiences
No ratings yet
Puting Experiences
15 pages
Intro_HPC_IITK
No ratings yet
Intro_HPC_IITK
44 pages
Week1-Parallel-and-Distributed-Computing
No ratings yet
Week1-Parallel-and-Distributed-Computing
55 pages
Lec1 and 2
No ratings yet
Lec1 and 2
52 pages
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
No ratings yet
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
11 pages
Cuda
No ratings yet
Cuda
15 pages
Lecture-4 Parallel hardware-Jameel-NNL
No ratings yet
Lecture-4 Parallel hardware-Jameel-NNL
39 pages
High Performance Parallel I O 1st Edition I Foster - The latest ebook edition with all chapters is now available
100% (1)
High Performance Parallel I O 1st Edition I Foster - The latest ebook edition with all chapters is now available
80 pages
A Presentation On Parallel Computing: - Ameya Waghmare (Rno 41, BE CSE) Guided by-Dr.R.P.Adgaonkar (HOD), CSE Dept
No ratings yet
A Presentation On Parallel Computing: - Ameya Waghmare (Rno 41, BE CSE) Guided by-Dr.R.P.Adgaonkar (HOD), CSE Dept
32 pages
Cuda Lab Manual
100% (1)
Cuda Lab Manual
22 pages
1 Introduction
No ratings yet
1 Introduction
58 pages
A Presentation On Parallel Computing: - Ameya Waghmare (Rno 41, BE CSE) Guided by-Dr.R.P.Adgaonkar (HOD), CSE Dept
No ratings yet
A Presentation On Parallel Computing: - Ameya Waghmare (Rno 41, BE CSE) Guided by-Dr.R.P.Adgaonkar (HOD), CSE Dept
32 pages
PDF Fundamentals of Parallel Computing 1st Edition Sanjay Razdan download
100% (13)
PDF Fundamentals of Parallel Computing 1st Edition Sanjay Razdan download
82 pages
Lecture 4
No ratings yet
Lecture 4
27 pages
Introduction To Parallel Computing LLNL
No ratings yet
Introduction To Parallel Computing LLNL
44 pages
u 1 c
No ratings yet
u 1 c
20 pages
Lecture 1 - Introduction To Parallel Computing
0% (1)
Lecture 1 - Introduction To Parallel Computing
32 pages
Presentation cc 1
No ratings yet
Presentation cc 1
63 pages
Cuda
No ratings yet
Cuda
69 pages
Introduction To Parallel Computing
No ratings yet
Introduction To Parallel Computing
30 pages
The Art of Machine Learning: A Hands-On Guide to Machine Learning with R
From Everand
The Art of Machine Learning: A Hands-On Guide to Machine Learning with R
Norman Matloff
5/5 (1)
Programming for Hybrid Multi/Manycore MPP Systems 1st Edition John Levesque - Download the ebook today and experience the full content
No ratings yet
Programming for Hybrid Multi/Manycore MPP Systems 1st Edition John Levesque - Download the ebook today and experience the full content
57 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Parallel Computing Main
No ratings yet
Parallel Computing Main
47 pages
Paralle Processing in Brief
No ratings yet
Paralle Processing in Brief
31 pages
Hpc_unit-1 Insem Notes
No ratings yet
Hpc_unit-1 Insem Notes
76 pages
(Ebook) Flexible Software Design: Systems Development for Changing Requirements by Bruce Johnson, Walter W. Woolfolk, Robert Miller, Cindy Johnson ISBN 9780849326509, 9781420031331, 0849326508, 1420031333 2024 scribd download
100% (3)
(Ebook) Flexible Software Design: Systems Development for Changing Requirements by Bruce Johnson, Walter W. Woolfolk, Robert Miller, Cindy Johnson ISBN 9780849326509, 9781420031331, 0849326508, 1420031333 2024 scribd download
81 pages
Get (Ebook) Theoretical Meltdown (Architectural Design 01-02.2009, Vol. 79 N°. 1) by Luigi Prestinenza Puglisi ISBN 9780470997796, 0470997796 free all chapters
100% (3)
Get (Ebook) Theoretical Meltdown (Architectural Design 01-02.2009, Vol. 79 N°. 1) by Luigi Prestinenza Puglisi ISBN 9780470997796, 0470997796 free all chapters
81 pages
Instant Download (Ebook) An Introduction to Gait Analysis, 4th Ed. by Michael W. Whittle BSc MSc MB BS PhD ISBN 9780750688833, 0750688831 PDF All Chapters
100% (3)
Instant Download (Ebook) An Introduction to Gait Analysis, 4th Ed. by Michael W. Whittle BSc MSc MB BS PhD ISBN 9780750688833, 0750688831 PDF All Chapters
81 pages
Learn iOS 11 Programming with Swift 4 Craig Clayton all chapter instant download
100% (5)
Learn iOS 11 Programming with Swift 4 Craig Clayton all chapter instant download
55 pages
Where can buy The National Interest in Question Foreign Policy in Multicultural Societies 1st Edition Christopher Hill ebook with cheap price
100% (5)
Where can buy The National Interest in Question Foreign Policy in Multicultural Societies 1st Edition Christopher Hill ebook with cheap price
55 pages
Get Principles of Radiometric Dating 1st Edition Kunchithapadam Gopalan PDF ebook with Full Chapters Now
100% (5)
Get Principles of Radiometric Dating 1st Edition Kunchithapadam Gopalan PDF ebook with Full Chapters Now
55 pages
University of Antique-Hamtic Campus: Ccs@antiquespride - Edu.ph
No ratings yet
University of Antique-Hamtic Campus: Ccs@antiquespride - Edu.ph
2 pages
Mitsubishi FX5U: HMI Setting
No ratings yet
Mitsubishi FX5U: HMI Setting
5 pages
Internet_of_Things_(IoT)__Concepts_and_Applications
No ratings yet
Internet_of_Things_(IoT)__Concepts_and_Applications
11 pages
MDM XU ReleaseNotes
No ratings yet
MDM XU ReleaseNotes
22 pages
Lab Report - 5 - OS
No ratings yet
Lab Report - 5 - OS
10 pages
IT Chapter 1
No ratings yet
IT Chapter 1
3 pages
CS6 Install Instructions
No ratings yet
CS6 Install Instructions
5 pages
PowerVM VirtualSwitches 091010
No ratings yet
PowerVM VirtualSwitches 091010
24 pages
Day One Configuring EX Switches - Juniper Networkspdf
No ratings yet
Day One Configuring EX Switches - Juniper Networkspdf
82 pages
Log
No ratings yet
Log
13 pages
Instant Access to Interact with Information Technology 2 new edition Birbal ebook Full Chapters
100% (7)
Instant Access to Interact with Information Technology 2 new edition Birbal ebook Full Chapters
35 pages
Input Output Ports of 8051
No ratings yet
Input Output Ports of 8051
5 pages
Luminous Design
No ratings yet
Luminous Design
12 pages
Apollo Opt9608 Delivers Data Center Centric Connectivity
No ratings yet
Apollo Opt9608 Delivers Data Center Centric Connectivity
6 pages
HiP19 Cracking Mifare Classic On The Cheap Workshop
No ratings yet
HiP19 Cracking Mifare Classic On The Cheap Workshop
78 pages
Distributed File Systems
No ratings yet
Distributed File Systems
23 pages
Emp5500 Service Manual
No ratings yet
Emp5500 Service Manual
125 pages
Wandoan BESS Gant
No ratings yet
Wandoan BESS Gant
3 pages
TM112 - جميع التعاريف في الكتاب - By Mada
No ratings yet
TM112 - جميع التعاريف في الكتاب - By Mada
32 pages
Javad Triumph - Configure Rover NTRIP: Coordinate System Settings
No ratings yet
Javad Triumph - Configure Rover NTRIP: Coordinate System Settings
6 pages
Metasys® Intelligent Fire Controller: Description
No ratings yet
Metasys® Intelligent Fire Controller: Description
4 pages
HP Designjet l26500 Service Manual
No ratings yet
HP Designjet l26500 Service Manual
2 pages
Python-Part 1 PDF
No ratings yet
Python-Part 1 PDF
76 pages
Cisco UCS C240 M6 - Installation
No ratings yet
Cisco UCS C240 M6 - Installation
42 pages
Natus Nicolet Edx Software Load Instructions
No ratings yet
Natus Nicolet Edx Software Load Instructions
28 pages
Data - Sheet - c78-530613 Cisco SD208P 8 Port
No ratings yet
Data - Sheet - c78-530613 Cisco SD208P 8 Port
3 pages
Laboratory Exercise 6 Explore The Smart Grid
No ratings yet
Laboratory Exercise 6 Explore The Smart Grid
3 pages