Instant Download GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata PDF All Chapters
Instant Download GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata PDF All Chapters
com
https://textbookfull.com/product/gpu-parallel-program-
development-using-cuda-1st-edition-tolga-soyata/
OR CLICK BUTTON
DOWNLOAD NOW
https://textbookfull.com/product/comprehensive-healthcare-simulation-
program-center-development-1st-edition-michael-a-seropian/
textboxfull.com
https://textbookfull.com/product/cybersecurity-program-development-
for-business-the-essential-planning-guide-1st-edition-chris-
moschovitis/
textboxfull.com
https://textbookfull.com/product/freight-transport-and-distribution-
concepts-and-optimisation-models-tolga-bektas/
textboxfull.com
https://textbookfull.com/product/gpu-pro-360-guide-to-geometry-
manipulation-1st-edition-wolfgang-engel/
textboxfull.com
GPU Pro 360 Guide to Mobile Devices 1st Edition Wolfgang
Engel
https://textbookfull.com/product/gpu-pro-360-guide-to-mobile-
devices-1st-edition-wolfgang-engel/
textboxfull.com
https://textbookfull.com/product/gpu-pro-360-guide-to-gpgpu-1st-
edition-wolfgang-engel-editor/
textboxfull.com
https://textbookfull.com/product/gpu-zen-advanced-rendering-
techniques-wolfgang-engel-editor/
textboxfull.com
https://textbookfull.com/product/gpu-pro-360-guide-to-image-space-1st-
edition-wolfgang-engel-author/
textboxfull.com
GPU Parallel Program
Development Using CUDA
Chapman & Hall/CRC
Computational Science Series
SERIES EDITOR
Horst Simon
Deputy Director
Lawrence Berkeley National Laboratory
Berkeley, California, U.S.A.
PUBLISHED TITLES
Tolga Soyata
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize
to copyright holders if permission to publish in this form has not been obtained. If any copyright material
has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, trans-
mitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information storage or retrieval
system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com
(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive,
Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and regis-
tration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
——————————————————————————————————————————————–
Library of Congress Cataloging-in-Publication Data
——————————————————————————————————————————————–
Names: Soyata, Tolga, 1967- author.
Title: GPU parallel program development using CUDA
/ by Tolga Soyata.
Description: Boca Raton, Florida : CRC Press, [2018] | Includes bibliographical
references and index.
Identifiers: LCCN 2017043292 | ISBN 9781498750752 (hardback) |
ISBN 9781315368290 (e-book)
Subjects: LCSH: Parallel programming (Computer science) | CUDA (Computer architecture) |
Graphics processing units–Programming.
Classification: LCC QA76.642.S67 2018 | DDC 005.2/75–dc23
LC record available at https://lccn.loc.gov/2017043292
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
ix
x Contents
1.1 Harvesting each coconut requires two consecutive 30-second tasks (threads).
Thread 1: get a coconut. Thread 2: crack (process) that coconut using the
hammer. 4
1.2 Simultaneously executing Thread 1 (“1”) and Thread 2 (“2”). Accessing
shared resources will cause a thread to wait (“-”). 6
1.3 Serial (single-threaded) program imflip.c flips a 640×480 dog picture (left)
horizontally (middle) or vertically (right). 8
1.4 Running gdb to catch a segmentation fault. 20
1.5 Running valgrind to catch a memory access error. 23
2.1 Windows Task Manager, showing 1499 threads, however, there is 0% CPU
utilization. 33
3.1 The life cycle of a thread. From the creation to its termination, a thread is
cycled through many different statuses, assigned by the OS. 60
3.2 Memory access patterns of MTFlipH() in Code 2.8. A total of 3200 pixels’
RGB values (9600 Bytes) are flipped for each row. 65
3.3 The memory map of a process when only a single thread is running within
the process (left) or multiple threads are running in it (right). 75
4.1 Inside a computer containing an i7-5930K CPU [10] (CPU5 in Table 3.1),
and 64 GB of DDR4 memory. This PC has a GTX Titan Z GPU that will
be used to test a lot of the programs in Part II. 80
4.2 The imrotate.c program rotates a picture by a specified angle. Original dog
(top left), rotated +10◦ (top right), +45◦ (bottom left), and −75◦ (bottom
right) clockwise. Scaling is done to avoid cropping of the original image area. 84
4.3 The architecture of one core of the i7-5930K CPU (the PC in Figure 4.1).
This core is capable of executing two threads (hyper-threading, as defined
by Intel). These two threads share most of the core resources, but have their
own register files. 92
4.4 Architecture of the i7-5930K CPU (6C/12T). This CPU connects to the
GPUs through an external PCI express bus and memory through the mem-
ory bus. 94
5.1 The imedge.c program is used to detect edges in the original image
astronaut.bmp (top left). Intermediate processing steps are: GaussianFilter()
(top right), Sobel() (bottom left), and finally Threshold() (bottom right). 108
xxiii
Contents xix
Bibliography 435
Index 439
xxiv List of Figures
6.1 Turning the dog picture into a 3D wire frame. Triangles are used to represent
the object, rather than pixels. This representation allows us to map a texture
to each triangle. When the object moves, so does each triangle, along with
their associated textures. To increase the resolution of this kind of an object
representation, we can divide triangles into smaller triangles in a process
called tesselation. 139
6.2 Steps to move triangulated 3D objects. Triangles contain two attributes:
their location and their texture. Objects are moved by performing mathe-
matical operations only on their coordinates. A final texture mapping places
the texture back on the moved object coordinates, while a 3D-to-2D transfor-
mation allows the resulting image to be displayed on a regular 2D computer
monitor. 140
6.3 Three farmer teams compete in Analogy 6.1: (1) Arnold competes alone
with his 2× bigger tractor and “the strongest farmer” reputation, (2) Fred
and Jim compete together in a much smaller tractor than Arnold. (3) Tolga,
along with 32 boy and girl scouts, compete together using a bus. Who wins? 145
6.4 Nvidia Runtime Engine is built into your GPU drivers, shown in your Win-
dows 10 Pro SysTray. When you click the Nvidia symbol, you can open the
Nvidia control panel to see the driver version as well as the parameters of
your GPU(s). 156
6.5 Creating a Visual Studio 2015 CUDA project named imflipG.cu. Assume
that the code will be in a directory named Z:\code\imflipG in this example. 172
6.6 Visual Studio 2015 source files are in the Z:\code\imflipG\imflipG direc-
tory. In this specific example, we will remove the default file, kernel.cu, that
VS 2015 creates. After this, we will add an existing file, imflipG.cu, to the
project. 173
6.7 The default CPU platform is x86. We will change it to x64. We will also
remove the GPU debugging option. 174
6.8 The default Compute Capability is 2.0. This is too old. We will change it to
Compute Capability 3.0, which is done by editing Code Generation under
Device and changing it to compute 30, sm 30. 175
6.9 Compiling imflipG.cu to get the executable file imflipG.exe in the
Z:\code\imflipG\x64\Debug directory. 176
6.10 Running imflipG.exe from a CMD command line window. 177
6.11 The /usr/local directory in Unix contains your CUDA directories. 181
6.12 Creating a new CUDA project using the Eclipse IDE in Unix. 183
7.1 The PCIe bus connects for the host (CPU) and the device(s) (GPUs).
The host and each device have their own I/O controllers to allow transfers
through the PCIe bus, while both the host and the device have their own
memory, with a dedicated bus to it; in the GPU this memory is called global
memory. 205
List of Figures xxv
8.1 Analogy 8.1 for executing a massively parallel program using a significant
number of GPU cores, which receive their instructions and data from differ-
ent sources. Melissa (Memory controller ) is solely responsible for bringing
the coconuts from the jungle and dumping them into the big barrel (L2$).
Larry (L2$ controller ) is responsible for distributing these coconuts into the
smaller barrels (L1$) of Laura, Linda, Lilly, and Libby; eventually, these four
folks distribute the coconuts (data) to the scouts (GPU cores). On the right
side, Gina (Giga-Thread Scheduler ) has the big list of tasks (list of blocks to
be executed ); she assigns each block to a school bus (SM or streaming mul-
tiprocessor ). Inside the bus, one person — Tolga, Tony, Tom, and Tim —
is responsible to assign them to the scouts (instruction schedulers). 228
8.2 The internal architecture of the GTX550Ti GPU. A total of 192 GPU cores
are organized into six streaming multiprocessor (SM) groups of 32 GPU
cores. A single L2$ is shared among all 192 cores, while each SM has its
own L1$. A dedicated memory controller is responsible for bringing data in
and out of the GDDR5 global memory and dumping it into the shared L2$,
while a dedicated host interface is responsible for shuttling data (and code)
between the CPU and GPU over the PCIe bus. 230
8.3 A sample output of the imedgeG.cu program executed on the astronaut.bmp
image using a GTX Titan Z GPU. Kernel execution times and the amount
of data movement for each kernel is clearly shown. 242
9.1 GF110 Fermi architecture with 16 SMs, where each SM houses 32 cores, 16
LD/ST units, and 4 Special Function Units (SFUs). The highest end Fermi
GPU contains 512 cores (e.g., GTX 580). 264
9.2 GF110 Fermi SM structure. Each SM has a 128 KB register file that contains
32,768 (32 K) registers, where each register is 32-bits. This register file feeds
operands to the 32 cores and 4 Special Function Units (SFU). 16 Load/Store
(LD/ST) units are used to queue memory load/store requests. A 64 KB total
cache memory is used for L1$ and shared memory. 265
9.3 GK110 Kepler architecture with 15 SMXs, where each SMX houses 192
cores, 48 double precision units (DPU), 32 LD/ST units, and 32 Special
Function Units (SFU). The highest end Kepler GPU contains 2880 cores
(e.g., GTX Titan Black); its “double” version GTX Titan Z contains 5760
cores. 266
9.4 GK110 Kepler SMX structure. A 256 KB (64 K-register) register file feeds
192 cores, 64 Double-Precision Units (DPU), 32 Load/Store units, and 32
SFUs. Four warp schedulers can schedule four warps, which are dispatched
as 8 half-warps. Read-only cache is used to hold constants. 267
9.5 GM200 Maxwell architecture with 24 SMMs, housed inside 6 larger GPC
units; each SMM houses 128 cores, 32 LD/ST units, and 32 Special Function
Units (SFU), does not contain double-precision units (DPUs). The highest
end Maxwell GPU contains 3072 cores (e.g., GTX Titan X). 268
9.6 GM200 Maxwell SMM structure consists of 4 identical sub-structures with
32 cores, 8 LD/ST units, 8 SFUs, and 16 K registers. Two of these sub-
structures share an L1$, while four of them share a 96 KB shared memory. 269
9.7 GP100 Pascal architecture with 60 SMs, housed inside 6 larger GPC units,
each containing 10 SMs. The highest end Pascal GPU contains 3840 cores
(e.g., P100 compute accelerator). NVLink and High Bandwidth Memory
xxvi List of Figures
X.
Befejezés.
I. A harc küszöbén 5
II. A hulló csillag 12
III. A horselli réten 17
IV. A henger megnyílik 20
V. A hősugár 24
VI. A hősugár a chobhami országúton 28
VII. Hogyan értem haza? 31
VIII. Péntek éjjel 36
IX. Kezdődik a harc 39
X. A viharban 45
XI. Az ablakban 52
XII. Mit láttam Weybridge és Shepperton
pusztulásából? 59
XIII. Hogy találkoztam a tiszteletessel? 71
XIV. Londonban 77
XV. Mi történt Surreyben? 89
XVI. A londoni kivándorlás 98
XVII. A «Villám» 112
Második rész. A Föld a Mars-lakók uralma alatt.
Updated editions will replace the previous one—the old editions will
be renamed.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the terms
of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.
• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
1.F.4. Except for the limited right of replacement or refund set forth in
paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com