0% found this document useful (0 votes)
9 views

Data Parallel C Programming Accelerated Systems Using C And Sycl 2nd Edition James Reinders instant download

Ebook

Uploaded by

samptteneza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data Parallel C Programming Accelerated Systems Using C And Sycl 2nd Edition James Reinders instant download

Ebook

Uploaded by

samptteneza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Data Parallel C Programming Accelerated Systems

Using C And Sycl 2nd Edition James Reinders


download

https://ebookbell.com/product/data-parallel-c-programming-
accelerated-systems-using-c-and-sycl-2nd-edition-james-
reinders-52721604

Explore and download more ebooks at ebookbell.com


Here are some recommended products that we believe you will be
interested in. You can click the link to download.

Data Parallel C Programming Accelerated Systems Using C And Sycl James


Reinders Ben Ashbaugh James Brodman Michael Kinsner John Pennycook
Xinmin Tian

https://ebookbell.com/product/data-parallel-c-programming-accelerated-
systems-using-c-and-sycl-james-reinders-ben-ashbaugh-james-brodman-
michael-kinsner-john-pennycook-xinmin-tian-52722142

Data Parallel C Mastering Dpc For Programming Of Heterogeneous Systems


Using C And Sycl 1st Ed James Reinders

https://ebookbell.com/product/data-parallel-c-mastering-dpc-for-
programming-of-heterogeneous-systems-using-c-and-sycl-1st-ed-james-
reinders-30716410

Parallel Computing For Data Science With Examples In R C And Cuda 2nd
Edition Norman Matloff

https://ebookbell.com/product/parallel-computing-for-data-science-
with-examples-in-r-c-and-cuda-2nd-edition-norman-matloff-5127002

Techniques And Environments For Big Data Analysis Parallel Cloud And
Grid Computing 1st Edition Bhabani Shankar Prasad Mishra

https://ebookbell.com/product/techniques-and-environments-for-big-
data-analysis-parallel-cloud-and-grid-computing-1st-edition-bhabani-
shankar-prasad-mishra-5355428
Parallel Spatialdata Conversion Engine Enabling Fast Sharing Of
Massive Geospatial Data Shuai Zhang

https://ebookbell.com/product/parallel-spatialdata-conversion-engine-
enabling-fast-sharing-of-massive-geospatial-data-shuai-zhang-10882728

Ultimate Parallel And Distributed Computing With Julia For Data


Science Excel In Data Analysis Statistical Modeling And Machine
Learning By Leveraging Mlbasejl And Mljjl To Optimize Workflows
English Edition Nabanita Dash
https://ebookbell.com/product/ultimate-parallel-and-distributed-
computing-with-julia-for-data-science-excel-in-data-analysis-
statistical-modeling-and-machine-learning-by-leveraging-mlbasejl-and-
mljjl-to-optimize-workflows-english-edition-nabanita-dash-54858136

Sequential And Parallel Algorithms And Data Structures The Basic


Toolbox Sanders

https://ebookbell.com/product/sequential-and-parallel-algorithms-and-
data-structures-the-basic-toolbox-sanders-10600742

Parallel Computing Architectures And Apis Iot Big Data Stream


Processing Vivek Kale

https://ebookbell.com/product/parallel-computing-architectures-and-
apis-iot-big-data-stream-processing-vivek-kale-11088800

Data Mining For Association Rules And Sequential Patterns Sequential


And Parallel Algorithms 1st Edition Jeanmarc Adamo Auth

https://ebookbell.com/product/data-mining-for-association-rules-and-
sequential-patterns-sequential-and-parallel-algorithms-1st-edition-
jeanmarc-adamo-auth-4198728
Data Parallel C++
Programming Accelerated Systems Using
C++ and SYCL

Second Edition

James Reinders
Ben Ashbaugh
James Brodman
Michael Kinsner
John Pennycook
Xinmin Tian
Foreword by Erik Lindahl, GROMACS and
Stockholm University
Data Parallel C++
Programming Accelerated
Systems Using C++ and SYCL
Second Edition

James Reinders
Ben Ashbaugh
James Brodman
Michael Kinsner
John Pennycook
Xinmin Tian
Foreword by Erik Lindahl, GROMACS and
Stockholm University
Data Parallel C++: Programming Accelerated Systems Using C++ and SYCL, Second Edition
James Reinders Michael Kinsner
Beaverton, OR, USA Halifax, NS, Canada
Ben Ashbaugh John Pennycook
Folsom, CA, USA San Jose, CA, USA
James Brodman Xinmin Tian
Marlborough, MA, USA Fremont, CA, USA

ISBN-13 (pbk): 978-1-4842-9690-5 ISBN-13 (electronic): 978-1-4842-9691-2


https://doi.org/10.1007/978-1-4842-9691-2

Copyright © 2023 by Intel Corporation


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on
microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed.
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0
International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes
were made.
The images or other third party material in this book are included in the book’s Creative Commons license, unless indicated
otherwise in a credit line to the material. If material is not included in the book’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of
a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the
trademark owner, with no intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such,
is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
Intel, the Intel logo, Intel Optane, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. OpenCL
and the OpenCL logo are trademarks of Apple Inc. in the U.S. and/or other countries. OpenMP and the OpenMP logo are
trademarks of the OpenMP Architecture Review Board in the U.S. and/or other countries. SYCL, the SYCL logo, Khronos and
the Khronos Group logo are trademarks of the Khronos Group Inc. The open source DPC++ compiler is based on a published
Khronos SYCL specification. The current conformance status of SYCL implementations can be found at https://www.khronos.
org/conformance/adopters/conformant-products/sycl.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests are measured using specific computer systems, components, software, operations and functions. Any change
to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products. For more complete information visit https://www.intel.com/benchmarks. Performance results are based on testing
as of dates shown in configuration and may not reflect all publicly available security updates. See configuration disclosure for
details. No product or component can be absolutely secure. Intel technologies’ features and benefits depend on system
configuration and may require enabled hardware, software or service activation. Performance varies depending on system
configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at
www.intel.com.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the
authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The
publisher makes no warranty, express or implied, with respect to the material contained herein.
Managing Director, Apress Media LLC: Welmoed Spahr
Acquisitions Editor: Susan McDermot
Development Editor: James Markham
Coordinating Editor: Jessica Vakili
Distributed to the book trade worldwide by Springer Science+Business Media New York, 1 NY Plaza, New York, NY 10004.
Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit https://www.springeronline.com.
Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM
Finance Inc). SSBM Finance Inc is a Delaware corporation.
For information on translations, please e-mail booktranslations@springernature.com; for reprint, paperback, or audio rights,
please e-mail bookpermissions@springernature.com.
Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also
available for most titles. For more information, reference our Print and eBook Bulk Sales web page at https://www.apress.com/
bulk-sales.
Any source code or other supplementary material referenced by the author in this book is available to readers on the Github
repository: https://github.com/Apress/Data-Parallel-CPP. For more detailed information, please visit https://www.apress.
com/gp/services/source-code.
Paper in this product is recyclable
Table of Contents
About the Authors������������������������������������������������������������������������������xix

Preface����������������������������������������������������������������������������������������������xxi

Foreword������������������������������������������������������������������������������������������ xxv

Acknowledgments��������������������������������������������������������������������������� xxix

Chapter 1: Introduction������������������������������������������������������������������������1
Read the Book, Not the Spec��������������������������������������������������������������������������������2
SYCL 2020 and DPC++�����������������������������������������������������������������������������������������3
Why Not CUDA?�����������������������������������������������������������������������������������������������������4
Why Standard C++ with SYCL?����������������������������������������������������������������������������5
Getting a C++ Compiler with SYCL Support���������������������������������������������������������5
Hello, World! and a SYCL Program Dissection�������������������������������������������������������6
Queues and Actions����������������������������������������������������������������������������������������������7
It Is All About Parallelism��������������������������������������������������������������������������������������8
Throughput������������������������������������������������������������������������������������������������������8
Latency������������������������������������������������������������������������������������������������������������9
Think Parallel���������������������������������������������������������������������������������������������������9
Amdahl and Gustafson����������������������������������������������������������������������������������10
Scaling�����������������������������������������������������������������������������������������������������������11
Heterogeneous Systems��������������������������������������������������������������������������������11
Data-Parallel Programming���������������������������������������������������������������������������13

iii
Table of Contents

Key Attributes of C++ with SYCL������������������������������������������������������������������������14


Single-Source������������������������������������������������������������������������������������������������14
Host���������������������������������������������������������������������������������������������������������������15
Devices����������������������������������������������������������������������������������������������������������15
Kernel Code���������������������������������������������������������������������������������������������������16
Asynchronous Execution�������������������������������������������������������������������������������18
Race Conditions When We Make a Mistake���������������������������������������������������19
Deadlock��������������������������������������������������������������������������������������������������������22
C++ Lambda Expressions�����������������������������������������������������������������������������23
Functional Portability and Performance Portability���������������������������������������26
Concurrency vs. Parallelism��������������������������������������������������������������������������������28
Summary������������������������������������������������������������������������������������������������������������30

Chapter 2: Where Code Executes��������������������������������������������������������31


Single-Source�����������������������������������������������������������������������������������������������������31
Host Code������������������������������������������������������������������������������������������������������33
Device Code���������������������������������������������������������������������������������������������������34
Choosing Devices������������������������������������������������������������������������������������������������36
Method#1: Run on a Device of Any Type�������������������������������������������������������������37
Queues����������������������������������������������������������������������������������������������������������37
Binding a Queue to a Device When Any Device Will Do���������������������������������41
Method#2: Using a CPU Device for Development, Debugging,
and Deployment��������������������������������������������������������������������������������������������������42
Method#3: Using a GPU (or Other Accelerators)��������������������������������������������������45
Accelerator Devices���������������������������������������������������������������������������������������46
Device Selectors��������������������������������������������������������������������������������������������46
Method#4: Using Multiple Devices����������������������������������������������������������������������50

iv
Table of Contents

Method#5: Custom (Very Specific) Device Selection������������������������������������������51


Selection Based on Device Aspects���������������������������������������������������������������51
Selection Through a Custom Selector�����������������������������������������������������������53
Creating Work on a Device����������������������������������������������������������������������������������54
Introducing the Task Graph����������������������������������������������������������������������������54
Where Is the Device Code?����������������������������������������������������������������������������56
Actions�����������������������������������������������������������������������������������������������������������60
Host tasks������������������������������������������������������������������������������������������������������63
Summary������������������������������������������������������������������������������������������������������������65

Chapter 3: Data Management�������������������������������������������������������������67


Introduction���������������������������������������������������������������������������������������������������������68
The Data Management Problem�������������������������������������������������������������������������69
Device Local vs. Device Remote�������������������������������������������������������������������������69
Managing Multiple Memories�����������������������������������������������������������������������������70
Explicit Data Movement���������������������������������������������������������������������������������70
Implicit Data Movement��������������������������������������������������������������������������������71
Selecting the Right Strategy��������������������������������������������������������������������������71
USM, Buffers, and Images�����������������������������������������������������������������������������������72
Unified Shared Memory��������������������������������������������������������������������������������������72
Accessing Memory Through Pointers������������������������������������������������������������73
USM and Data Movement������������������������������������������������������������������������������74
Buffers����������������������������������������������������������������������������������������������������������������77
Creating Buffers��������������������������������������������������������������������������������������������78
Accessing Buffers������������������������������������������������������������������������������������������78
Access Modes�����������������������������������������������������������������������������������������������80

v
Table of Contents

Ordering the Uses of Data�����������������������������������������������������������������������������������80


In-order Queues���������������������������������������������������������������������������������������������83
Out-of-Order Queues�������������������������������������������������������������������������������������84
Choosing a Data Management Strategy��������������������������������������������������������������92
Handler Class: Key Members������������������������������������������������������������������������������93
Summary������������������������������������������������������������������������������������������������������������96

Chapter 4: Expressing Parallelism������������������������������������������������������97


Parallelism Within Kernels����������������������������������������������������������������������������������98
Loops vs. Kernels������������������������������������������������������������������������������������������������99
Multidimensional Kernels���������������������������������������������������������������������������������101
Overview of Language Features�����������������������������������������������������������������������102
Separating Kernels from Host Code������������������������������������������������������������102
Different Forms of Parallel Kernels�������������������������������������������������������������������103
Basic Data-Parallel Kernels������������������������������������������������������������������������������105
Understanding Basic Data-Parallel Kernels�������������������������������������������������105
Writing Basic Data-Parallel Kernels������������������������������������������������������������107
Details of Basic Data-Parallel Kernels���������������������������������������������������������109
Explicit ND-Range Kernels��������������������������������������������������������������������������������112
Understanding Explicit ND-Range Parallel Kernels�������������������������������������113
Writing Explicit ND-Range Data-Parallel Kernels����������������������������������������121
Details of Explicit ND-Range Data-­Parallel Kernels�������������������������������������122
Mapping Computation to Work-Items���������������������������������������������������������������127
One-to-One Mapping�����������������������������������������������������������������������������������128
Many-to-One Mapping���������������������������������������������������������������������������������128
Choosing a Kernel Form������������������������������������������������������������������������������������130
Summary����������������������������������������������������������������������������������������������������������132

vi
Table of Contents

Chapter 5: Error Handling�����������������������������������������������������������������135


Safety First��������������������������������������������������������������������������������������������������������135
Types of Errors��������������������������������������������������������������������������������������������������136
Let’s Create Some Errors!���������������������������������������������������������������������������������138
Synchronous Error���������������������������������������������������������������������������������������139
Asynchronous Error�������������������������������������������������������������������������������������139
Application Error Handling Strategy������������������������������������������������������������������140
Ignoring Error Handling�������������������������������������������������������������������������������141
Synchronous Error Handling������������������������������������������������������������������������143
Asynchronous Error Handling����������������������������������������������������������������������144
The Asynchronous Handler��������������������������������������������������������������������������145
Invocation of the Handler����������������������������������������������������������������������������148
Errors on a Device���������������������������������������������������������������������������������������������149
Summary����������������������������������������������������������������������������������������������������������150

Chapter 6: Unified Shared Memory���������������������������������������������������153


Why Should We Use USM?��������������������������������������������������������������������������������153
Allocation Types������������������������������������������������������������������������������������������������154
Device Allocations���������������������������������������������������������������������������������������154
Host Allocations�������������������������������������������������������������������������������������������155
Shared Allocations���������������������������������������������������������������������������������������155
Allocating Memory��������������������������������������������������������������������������������������������156
What Do We Need to Know?������������������������������������������������������������������������156
Multiple Styles���������������������������������������������������������������������������������������������157
Deallocating Memory����������������������������������������������������������������������������������164
Allocation Example��������������������������������������������������������������������������������������165

vii
Table of Contents

Data Management���������������������������������������������������������������������������������������������165
Initialization�������������������������������������������������������������������������������������������������165
Data Movement�������������������������������������������������������������������������������������������166
Queries��������������������������������������������������������������������������������������������������������������174
One More Thing�������������������������������������������������������������������������������������������������177
Summary����������������������������������������������������������������������������������������������������������178

Chapter 7: Buffers����������������������������������������������������������������������������179
Buffers��������������������������������������������������������������������������������������������������������������180
Buffer Creation��������������������������������������������������������������������������������������������181
What Can We Do with a Buffer?������������������������������������������������������������������188
Accessors����������������������������������������������������������������������������������������������������������189
Accessor Creation���������������������������������������������������������������������������������������192
What Can We Do with an Accessor?������������������������������������������������������������198
Summary����������������������������������������������������������������������������������������������������������199

Chapter 8: Scheduling Kernels and Data Movement������������������������201


What Is Graph Scheduling?�������������������������������������������������������������������������������202
How Graphs Work in SYCL��������������������������������������������������������������������������������202
Command Group Actions�����������������������������������������������������������������������������203
How Command Groups Declare Dependences��������������������������������������������203
Examples�����������������������������������������������������������������������������������������������������204
When Are the Parts of a Command Group Executed?����������������������������������213
Data Movement�������������������������������������������������������������������������������������������������213
Explicit Data Movement�������������������������������������������������������������������������������213
Implicit Data Movement������������������������������������������������������������������������������214
Synchronizing with the Host�����������������������������������������������������������������������������216
Summary����������������������������������������������������������������������������������������������������������218

viii
Table of Contents

Chapter 9: Communication and Synchronization�����������������������������221


Work-Groups and Work-Items���������������������������������������������������������������������������221
Building Blocks for Efficient Communication����������������������������������������������������223
Synchronization via Barriers�����������������������������������������������������������������������223
Work-Group Local Memory��������������������������������������������������������������������������225
Using Work-Group Barriers and Local Memory�������������������������������������������������227
Work-Group Barriers and Local Memory in ND-Range Kernels�������������������231
Sub-Groups�������������������������������������������������������������������������������������������������������235
Synchronization via Sub-Group Barriers�����������������������������������������������������236
Exchanging Data Within a Sub-Group����������������������������������������������������������237
A Full Sub-Group ND-Range Kernel Example����������������������������������������������239
Group Functions and Group Algorithms������������������������������������������������������������241
Broadcast����������������������������������������������������������������������������������������������������241
Votes������������������������������������������������������������������������������������������������������������242
Shuffles�������������������������������������������������������������������������������������������������������243
Summary����������������������������������������������������������������������������������������������������������246

Chapter 10: Defining Kernels������������������������������������������������������������249


Why Three Ways to Represent a Kernel?����������������������������������������������������������249
Kernels as Lambda Expressions�����������������������������������������������������������������������251
Elements of a Kernel Lambda Expression���������������������������������������������������251
Identifying Kernel Lambda Expressions������������������������������������������������������254
Kernels as Named Function Objects�����������������������������������������������������������������255
Elements of a Kernel Named Function Object���������������������������������������������256
Kernels in Kernel Bundles���������������������������������������������������������������������������������259
Interoperability with Other APIs������������������������������������������������������������������������264
Summary����������������������������������������������������������������������������������������������������������264

ix
Table of Contents

Chapter 11: Vectors and Math Arrays����������������������������������������������267


The Ambiguity of Vector Types��������������������������������������������������������������������������268
Our Mental Model for SYCL Vector Types����������������������������������������������������������269
Math Array (marray)������������������������������������������������������������������������������������������271
Vector (vec)�������������������������������������������������������������������������������������������������������273
Loads and Stores�����������������������������������������������������������������������������������������274
Interoperability with Backend-Native Vector Types�������������������������������������276
Swizzle Operations��������������������������������������������������������������������������������������276
How Vector Types Execute��������������������������������������������������������������������������������280
Vectors as Convenience Types��������������������������������������������������������������������280
Vectors as SIMD Types��������������������������������������������������������������������������������284
Summary����������������������������������������������������������������������������������������������������������286

Chapter 12: Device Information and Kernel Specialization��������������289


Is There a GPU Present?������������������������������������������������������������������������������������290
Refining Kernel Code to Be More Prescriptive��������������������������������������������������291
How to Enumerate Devices and Capabilities����������������������������������������������������293
Aspects��������������������������������������������������������������������������������������������������������296
Custom Device Selector������������������������������������������������������������������������������298
Being Curious: get_info<>��������������������������������������������������������������������������300
Being More Curious: Detailed Enumeration Code����������������������������������������301
Very Curious: get_info plus has()�����������������������������������������������������������������303
Device Information Descriptors������������������������������������������������������������������������303
Device-Specific Kernel Information Descriptors�����������������������������������������������303
The Specifics: Those of “Correctness”��������������������������������������������������������������304
Device Queries��������������������������������������������������������������������������������������������305
Kernel Queries���������������������������������������������������������������������������������������������306

x
Table of Contents

The Specifics: Those of “Tuning/Optimization”�������������������������������������������������307


Device Queries��������������������������������������������������������������������������������������������307
Kernel Queries���������������������������������������������������������������������������������������������308
Runtime vs. Compile-Time Properties��������������������������������������������������������������308
Kernel Specialization����������������������������������������������������������������������������������������309
Summary����������������������������������������������������������������������������������������������������������312

Chapter 13: Practical Tips����������������������������������������������������������������313


Getting the Code Samples and a Compiler�������������������������������������������������������313
Online Resources����������������������������������������������������������������������������������������������313
Platform Model�������������������������������������������������������������������������������������������������314
Multiarchitecture Binaries���������������������������������������������������������������������������315
Compilation Model���������������������������������������������������������������������������������������316
Contexts: Important Things to Know�����������������������������������������������������������������319
Adding SYCL to Existing C++ Programs�����������������������������������������������������������321
Considerations When Using Multiple Compilers�����������������������������������������������322
Debugging���������������������������������������������������������������������������������������������������������323
Debugging Deadlock and Other Synchronization Issues�����������������������������325
Debugging Kernel Code�������������������������������������������������������������������������������326
Debugging Runtime Failures�����������������������������������������������������������������������327
Queue Profiling and Resulting Timing Capabilities��������������������������������������330
Tracing and Profiling Tools Interfaces����������������������������������������������������������334
Initializing Data and Accessing Kernel Outputs������������������������������������������������335
Multiple Translation Units����������������������������������������������������������������������������������344
Performance Implication of Multiple Translation Units��������������������������������345
When Anonymous Lambdas Need Names��������������������������������������������������������345
Summary����������������������������������������������������������������������������������������������������������346

xi
Table of Contents

Chapter 14: Common Parallel Patterns���������������������������������������������349


Understanding the Patterns������������������������������������������������������������������������������350
Map�������������������������������������������������������������������������������������������������������������351
Stencil���������������������������������������������������������������������������������������������������������352
Reduction����������������������������������������������������������������������������������������������������354
Scan������������������������������������������������������������������������������������������������������������356
Pack and Unpack�����������������������������������������������������������������������������������������358
Using Built-In Functions and Libraries��������������������������������������������������������������360
The SYCL Reduction Library������������������������������������������������������������������������360
Group Algorithms�����������������������������������������������������������������������������������������366
Direct Programming������������������������������������������������������������������������������������������370
Map�������������������������������������������������������������������������������������������������������������370
Stencil���������������������������������������������������������������������������������������������������������371
Reduction����������������������������������������������������������������������������������������������������373
Scan������������������������������������������������������������������������������������������������������������374
Pack and Unpack�����������������������������������������������������������������������������������������377
Summary����������������������������������������������������������������������������������������������������������380
For More Information�����������������������������������������������������������������������������������381

Chapter 15: Programming for GPUs��������������������������������������������������383


Performance Caveats����������������������������������������������������������������������������������������383
How GPUs Work������������������������������������������������������������������������������������������������384
GPU Building Blocks������������������������������������������������������������������������������������384
Simpler Processors (but More of Them)������������������������������������������������������386
Simplified Control Logic (SIMD Instructions)�����������������������������������������������391
Switching Work to Hide Latency������������������������������������������������������������������398
Offloading Kernels to GPUs�������������������������������������������������������������������������������400
SYCL Runtime Library����������������������������������������������������������������������������������400
GPU Software Drivers����������������������������������������������������������������������������������401

xii
Table of Contents

GPU Hardware���������������������������������������������������������������������������������������������402
Beware the Cost of Offloading!��������������������������������������������������������������������403
GPU Kernel Best Practices��������������������������������������������������������������������������������405
Accessing Global Memory���������������������������������������������������������������������������405
Accessing Work-Group Local Memory���������������������������������������������������������409
Avoiding Local Memory Entirely with Sub-Groups��������������������������������������412
Optimizing Computation Using Small Data Types����������������������������������������412
Optimizing Math Functions��������������������������������������������������������������������������413
Specialized Functions and Extensions��������������������������������������������������������414
Summary����������������������������������������������������������������������������������������������������������414
For More Information�����������������������������������������������������������������������������������415

Chapter 16: Programming for CPUs��������������������������������������������������417


Performance Caveats����������������������������������������������������������������������������������������418
The Basics of Multicore CPUs���������������������������������������������������������������������������419
The Basics of SIMD Hardware���������������������������������������������������������������������������422
Exploiting Thread-Level Parallelism������������������������������������������������������������������428
Thread Affinity Insight���������������������������������������������������������������������������������431
Be Mindful of First Touch to Memory�����������������������������������������������������������435
SIMD Vectorization on CPU��������������������������������������������������������������������������������436
Ensure SIMD Execution Legality������������������������������������������������������������������437
SIMD Masking and Cost������������������������������������������������������������������������������440
Avoid Array of Struct for SIMD Efficiency����������������������������������������������������442
Data Type Impact on SIMD Efficiency����������������������������������������������������������444
SIMD Execution Using single_task��������������������������������������������������������������446
Summary����������������������������������������������������������������������������������������������������������448

xiii
Table of Contents

Chapter 17: Programming for FPGAs������������������������������������������������451


Performance Caveats����������������������������������������������������������������������������������������452
How to Think About FPGAs��������������������������������������������������������������������������������452
Pipeline Parallelism�������������������������������������������������������������������������������������456
Kernels Consume Chip “Area”���������������������������������������������������������������������459
When to Use an FPGA����������������������������������������������������������������������������������������460
Lots and Lots of Work����������������������������������������������������������������������������������460
Custom Operations or Operation Widths������������������������������������������������������461
Scalar Data Flow�����������������������������������������������������������������������������������������462
Low Latency and Rich Connectivity�������������������������������������������������������������463
Customized Memory Systems���������������������������������������������������������������������464
Running on an FPGA�����������������������������������������������������������������������������������������465
Compile Times���������������������������������������������������������������������������������������������467
The FPGA Emulator��������������������������������������������������������������������������������������469
FPGA Hardware Compilation Occurs “Ahead-­of-Time”��������������������������������470
Writing Kernels for FPGAs���������������������������������������������������������������������������������471
Exposing Parallelism�����������������������������������������������������������������������������������472
Keeping the Pipeline Busy Using ND-Ranges����������������������������������������������475
Pipelines Do Not Mind Data Dependences!�������������������������������������������������478
Spatial Pipeline Implementation of a Loop��������������������������������������������������481
Loop Initiation Interval���������������������������������������������������������������������������������483
Pipes������������������������������������������������������������������������������������������������������������489
Custom Memory Systems����������������������������������������������������������������������������495
Some Closing Topics�����������������������������������������������������������������������������������������498
FPGA Building Blocks����������������������������������������������������������������������������������498
Clock Frequency������������������������������������������������������������������������������������������500
Summary����������������������������������������������������������������������������������������������������������501

xiv
Table of Contents

Chapter 18: Libraries������������������������������������������������������������������������503


Built-In Functions����������������������������������������������������������������������������������������������504
Use the sycl:: Prefix with Built-In Functions������������������������������������������������506
The C++ Standard Library��������������������������������������������������������������������������������507
oneAPI DPC++ Library (oneDPL)�����������������������������������������������������������������������510
SYCL Execution Policy���������������������������������������������������������������������������������511
Using oneDPL with Buffers��������������������������������������������������������������������������513
Using oneDPL with USM������������������������������������������������������������������������������517
Error Handling with SYCL Execution Policies����������������������������������������������519
Summary����������������������������������������������������������������������������������������������������������520

Chapter 19: Memory Model and Atomics�����������������������������������������523


What’s in a Memory Model?�����������������������������������������������������������������������������525
Data Races and Synchronization�����������������������������������������������������������������526
Barriers and Fences������������������������������������������������������������������������������������529
Atomic Operations���������������������������������������������������������������������������������������531
Memory Ordering�����������������������������������������������������������������������������������������532
The Memory Model�������������������������������������������������������������������������������������������534
The memory_order Enumeration Class�������������������������������������������������������536
The memory_scope Enumeration Class������������������������������������������������������538
Querying Device Capabilities�����������������������������������������������������������������������540
Barriers and Fences������������������������������������������������������������������������������������542
Atomic Operations in SYCL��������������������������������������������������������������������������543
Using Atomics with Buffers�������������������������������������������������������������������������548
Using Atomics with Unified Shared Memory�����������������������������������������������550
Using Atomics in Real Life��������������������������������������������������������������������������������550
Computing a Histogram�������������������������������������������������������������������������������551
Implementing Device-Wide Synchronization�����������������������������������������������553

xv
Table of Contents

Summary����������������������������������������������������������������������������������������������������������556
For More Information�����������������������������������������������������������������������������������557

Chapter 20: Backend Interoperability�����������������������������������������������559


What Is Backend Interoperability?��������������������������������������������������������������������559
When Is Backend Interoperability Useful?��������������������������������������������������������561
Adding SYCL to an Existing Codebase���������������������������������������������������������562
Using Existing Libraries with SYCL��������������������������������������������������������������564
Using Backend Interoperability for Kernels�������������������������������������������������������569
Interoperability with API-Defined Kernel Objects����������������������������������������569
Interoperability with Non-SYCL Source Languages�������������������������������������571
Backend Interoperability Hints and Tips�����������������������������������������������������������574
Choosing a Device for a Specific Backend��������������������������������������������������574
Be Careful About Contexts!��������������������������������������������������������������������������576
Access Low-Level API-Specific Features����������������������������������������������������576
Support for Other Backends������������������������������������������������������������������������577
Summary����������������������������������������������������������������������������������������������������������577

Chapter 21: Migrating CUDA Code����������������������������������������������������579


Design Differences Between CUDA and SYCL���������������������������������������������������579
Multiple Targets vs. Single Device Targets��������������������������������������������������579
Aligning to C++ vs. Extending C++�������������������������������������������������������������581
Terminology Differences Between CUDA and SYCL������������������������������������������582
Similarities and Differences������������������������������������������������������������������������������583
Execution Model������������������������������������������������������������������������������������������584
Memory Model��������������������������������������������������������������������������������������������589
Other Differences����������������������������������������������������������������������������������������592

xvi
Table of Contents

Features in CUDA That Aren’t In SYCL… Yet!����������������������������������������������������595


Global Variables�������������������������������������������������������������������������������������������595
Cooperative Groups�������������������������������������������������������������������������������������596
Matrix Multiplication Hardware�������������������������������������������������������������������597
Porting Tools and Techniques����������������������������������������������������������������������������598
Migrating Code with dpct and SYCLomatic�������������������������������������������������598
Summary����������������������������������������������������������������������������������������������������������603
For More Information����������������������������������������������������������������������������������������604

Epilogue: Future Direction of SYCL���������������������������������������������������605

Index�������������������������������������������������������������������������������������������������615

xvii
About the Authors
James Reinders is an Engineer at Intel Corporation with more than four
decades of experience in parallel computing and is an author/coauthor/
editor of more than ten technical books related to parallel programming.
James has a passion for system optimization and teaching. He has had the
great fortune to help make contributions to several of the world’s fastest
computers (#1 on the TOP500 list) as well as many other supercomputers
and software developer tools.

Ben Ashbaugh is a Software Architect at Intel Corporation where he has


worked for over 20 years developing software drivers and compilers for
Intel graphics products. For the past ten years, Ben has focused on parallel
programming models for general-purpose computation on graphics
processors, including SYCL and the DPC++ compiler. Ben is active in the
Khronos SYCL, OpenCL, and SPIR working groups, helping define industry
standards for parallel programming, and he has authored numerous
extensions to expose unique Intel GPU features.

James Brodman is a Principal Engineer at Intel Corporation working on


runtimes and compilers for parallel programming, and he is one of the
architects of DPC++. James has a Ph.D. in Computer Science from the
University of Illinois at Urbana-Champaign.

xix
About the Authors

Michael Kinsner is a Principal Engineer at Intel Corporation developing


parallel programming languages and compilers for a variety of
architectures. Michael contributes extensively to spatial architectures and
programming models and is an Intel representative within The Khronos
Group where he works on the SYCL and OpenCL industry standards for
parallel programming. Mike has a Ph.D. in Computer Engineering from
McMaster University and is passionate about programming models that
cross architectures while still enabling performance.

John Pennycook is a Software Enabling and Optimization Architect


at Intel Corporation, focused on enabling developers to fully utilize
the parallelism available in modern processors. John is experienced
in optimizing and parallelizing applications from a range of scientific
domains, and previously served as Intel’s representative on the steering
committee for the Intel eXtreme Performance User’s Group (IXPUG).
John has a Ph.D. in Computer Science from the University of Warwick.
His research interests are varied, but a recurring theme is the ability to
achieve application “performance portability” across different hardware
architectures.

Xinmin Tian is an Intel Fellow and Compiler Architect at Intel Corporation


and serves as Intel’s representative on the OpenMP Architecture Review
Board (ARB). Xinmin has been driving OpenMP offloading, vectorization,
and parallelization compiler technologies for Intel architectures. His
current focus is on LLVM-based OpenMP offloading, SYCL/DPC++
compiler optimizations for CPUs/GPUs, and tuning HPC/AI application
performance. He has a Ph.D. in Computer Science from Tsinghua
University, holds 27 US patents, has published over 60 technical papers
with over 1300+ citations of his work, and has coauthored two books that
span his expertise.

xx
Preface
If you are new to parallel programming that is okay. If you have never
heard of SYCL or the DPC++ compilerthat is also okay
Compared with programming in CUDA, C++ with SYCL offers
portability beyond NVIDIA, and portability beyond GPUs, plus a tight
alignment to enhance modern C++ as it evolves too. C++ with SYCL offers
these advantages without sacrificing performance.
C++ with SYCL allows us to accelerate our applications by harnessing
the combined capabilities of CPUs, GPUs, FPGAs, and processing devices
of the future without being tied to any one vendor.
SYCL is an industry-driven Khronos Group standard adding
advanced support for data parallelism with C++ to exploit accelerated
(heterogeneous) systems. SYCL provides mechanisms for C++ compilers
that are highly synergistic with C++ and C++ build systems. DPC++ is an
open source compiler project based on LLVM that adds SYCL support.
All examples in this book should work with any C++ compiler supporting
SYCL 2020 including the DPC++ compiler.
If you are a C programmer who is not well versed in C++, you are in
good company. Several of the authors of this book happily share that
they picked up much of C++ by reading books that utilized C++ like this
one. With a little patience, this book should also be approachable by C
programmers with a desire to write modern C++ programs.

Second Edition
With the benefit of feedback from a growing community of SYCL users, we
have been able to add content to help learn SYCL better than ever.

xxi
Preface

This edition teaches C++ with SYCL 2020. The first edition preceded
the SYCL 2020 specification, which differed only slightly from what the
first edition taught (the most obvious changes for SYCL 2020 in this edition
are the header file location, the device selector syntax, and dropping an
explicit host device).

Important resources for updated SYCL information, including any


known book errata, include the book GitHub (https://github.
com/Apress/data-parallel-CPP), the Khronos Group SYCL
standards website (www.khronos.org/sycl), and a key SYCL
education website (https://sycl.tech).

Chapters 20 and 21 are additions encouraged by readers of the first


edition of this book.
We added Chapter 20 to discuss backend interoperability. One of
the key goals of the SYCL 2020 standard is to enable broad support for
hardware from many vendors with many architectures. This required
expanding beyond the OpenCL-only backend support of SYCL 1.2.1. While
generally “it just works,” Chapter 20 explains this in more detail for those
who find it valuable to understand and interface at this level.
For experienced CUDA programmers, we have added Chapter 21 to
explicitly connect C++ with SYCL concepts to CUDA concepts both in
terms of approach and vocabulary. While the core issues of expressing
heterogeneous parallelism are fundamentally similar, C++ with SYCL offers
many benefits because of its multivendor and multiarchitecture approach.
Chapter 21 is the only place we mention CUDA terminology; the rest of this
book teaches using C++ and SYCL terminology with its open multivendor,
multiarchitecture approaches. In Chapter 21, we strongly encourage
looking at the open source tool “SYCLomatic” (github.com/oneapi-src/
SYCLomatic), which helps automate migration of CUDA code. Because it

xxii
Preface

is helpful, we recommend it as the preferred first step in migrating code.


Developers using C++ with SYCL have been reporting strong results on
NVIDIA, AMD, and Intel GPUs on both codes that have been ported from
CUDA and original C++ with SYCL code. The resulting C++ with SYCL
offers portability that is not possible with NVIDIA CUDA.
The evolution of C++, SYCL, and compilers including DPC++
continues. Prospects for the future are discussed in the Epilogue, after
we have taken a journey together to learn how to create programs for
heterogeneous systems using C++ with SYCL.
It is our hope that this book supports and helps grow the SYCL
community and helps promote data-parallel programming in C++
with SYCL.

Structure of This Book


This book takes us on a journey through what we need to know to be an
effective programmer of accelerated/heterogeneous systems using C++
with SYCL.

Chapters 1–4: Lay Foundations


Chapters 1–4 are important to read in order when first approaching C++
with SYCL.
Chapter 1 lays the first foundation by covering core concepts that are
either new or worth refreshing in our minds.
Chapters 2–4 lay a foundation of understanding for data-parallel
programming in C++ with SYCL. When we finish reading Chapters 1–4,
we will have a solid foundation for data-parallel programming in C++.
Chapters 1–4 build on each other and are best read in order.

xxiii
Preface

Chapters 5–12: Build on Foundations


With the foundations established, Chapters 5–12 fill in vital details by
building on each other to some degree while being easy to jump between
as desired. All these chapters should be valuable to all users of C++
with SYCL.

Chapters 13–21: Tips/Advice for SYCL in Practice


These final chapters offer advice and details for specific needs. We
encourage at least skimming them all to find content that is important to
your needs.

Epilogue: Speculating on the Future


The book concludes with an Epilogue that discusses likely and potential
future directions for C++ with SYCL, and the Data Parallel C++ compiler
for SYCL.
We wish you the best as you learn to use C++ with SYCL.

xxiv
Foreword
SYCL 2020 is a milestone in parallel computing. For the first time we have
a modern, stable, feature-complete, and portable open standard that can
target all types of hardware, and the book you hold in your hand is the
premier resource to learn SYCL 2020.
Computer hardware development is driven by our needs to solve
larger and more complex problems, but those hardware advances are
largely useless unless programmers like you and me have languages that
allow us to implement our ideas and exploit the power available with
reasonable effort. There are numerous examples of amazing hardware,
and the first solutions to use them have often been proprietary since it
saves time not having to bother with committees agreeing on standards.
However, in the history of computing, they have eventually always ended
up as vendor lock-in—unable to compete with open standards that allow
developers to target any hardware and share code—because ultimately the
resources of the worldwide community and ecosystem are far greater than
any individual vendor, not to mention how open software standards drive
hardware competition.
Over the last few years, my team has had the tremendous privilege
of contributing to shaping the emerging SYCL ecosystem through our
development of GROMACS, one of the world’s most widely used scientific
HPC codes. We need our code to run on every supercomputer in the
world as well as our laptops. While we cannot afford to lose performance,
we also depend on being part of a larger community where other teams
invest effort in libraries we depend on, where there are open compilers
available, and where we can recruit talent. Since the first edition of this
book, SYCL has matured into such a community; in addition to several

xxv
Foreword

vendor-provided compilers, we now have a major community-driven


implementation1 that targets all hardware, and there are thousands of
developers worldwide sharing experiences, contributing to training
events, and participating in forums. The outstanding power of open
source—whether it is an application, a compiler, or an open standard—is
that we can peek under the hood to learn, borrow, and extend. Just as we
repeatedly learn from the code in the Intel-led LLVM implementation,2
the community-driven implementation from Heidelberg University, and
several other codes, you can use our public repository3 to compare CUDA
and SYCL implementations in a large production codebase or borrow
solutions for your needs—because when you do so, you are helping to
further extend our community.
Perhaps surprisingly, data-parallel programming as a paradigm is
arguably far easier than classical solutions such as message-passing
communication or explicit multithreading—but it poses special challenges
to those of us who have spent decades in the old paradigms that focus on
hardware and explicit data placement. On a small scale, it was trivial for
us to explicitly decide how data is moved between a handful of processes,
but as the problem scales to thousands of units, it becomes a nightmare to
manage the complexity without introducing bugs or having the hardware
sit idle waiting for data. Data-parallel programming with SYCL solves
this by striking the balance of primarily asking us to explicitly express the
inherent parallelism of our algorithm, but once we have done that, the
compiler and drivers will mostly handle the data locality and scheduling
over tens of thousands of functional units. To be successful in data-parallel
programming, it is important not to think of a computer as a single unit
executing one program, but as a collection of units working independently

1
Community-driven implementation from Heidelberg University: tinyurl.com/
HeidelbergSYCL
2
DPC++ compiler project: github.com/intel/llvm
3
GROMACS: gitlab.com/gromacs/gromacs/

xxvi
Foreword

to solve parts of a large problem. As long as we can express our problem as


an algorithm where each part does not have dependencies on other parts,
it is in theory straightforward to implement it, for example, as a parallel
for-loop that is executed on a GPU through a device queue. However, for
more practical examples, our problems are frequently not large enough
to use an entire device efficiently, or we depend on performing tens of
thousands of iterations per second where latency in device drivers starts
to be a major bottleneck. While this book is an outstanding introduction to
performance-portable GPU programming, it goes far beyond this to show
how both throughput and latency matter for real-world applications, how
SYCL can be used to exploit unique features both of CPUs, GPUs, SIMD
units, and FPGAs, but it also covers the caveats that for good performance
we need to understand and possibly adapt code to the characteristics of
each type of hardware. Doing so, it is not merely a great tutorial on data-
parallel programing, but an authoritative text that anybody interested in
programming modern computer hardware in general should read.
One of SYCL’s key strengths is the close alignment to modern C++.
This can seem daunting at first; C++ is not an easy language to fully master
(I certainly have not), but Reinders and coauthors take our hand and lead
us on a path where we only need to learn a handful of C++ concepts to get
started and be productive in actual data-parallel programming. However,
as we become more experienced, SYCL 2020 allows us to combine this
with the extreme generality of C++17 to write code that can be dynamically
targeted to different devices, or relying on heterogeneous parallelism that
uses CPU, GPU, and network units in parallel for different tasks. SYCL is
not a separate bolted-on solution to enable accelerators but instead holds
great promise to be the general way we express data parallelism in C++.
The SYCL 2020 standard now includes several features previously only
available as vendor extensions, for example, Unified Shared Memory,
sub-groups, atomic operations, reductions, simpler accessors, and many
other concepts that make code cleaner, and facilitates both development
as well as porting from standard C++17 or CUDA to have your code target

xxvii
Foreword

more diverse hardware. This book provides a wonderful and accessible


introduction to all of them, and you will also learn how SYCL is expected to
evolve together with the rapid development C++ is undergoing.
This all sounds great in theory, but how portable is SYCL in practice?
Our application is an example of a codebase that is quite challenging to
optimize since data access patterns are random, the amount of data to
process in each step is limited, we need to achieve thousands of iterations
per second, and we are limited both by memory bandwidth, floating-point,
and integer operations—it is an extreme opposite of a simple data-parallel
problem. We spent over two decades writing assembly SIMD instructions
and native implementations for several GPU architectures, and our
very first encounters with SYCL involved both pains with adapting to
differences and reporting performance regressions to driver and compiler
developers. However, as of spring 2023, our SYCL kernels can achieve
80–100% of native performance on all GPU architectures not only from a
single codebase but even a single precompiled binary.
SYCL is still young and has a rapidly evolving ecosystem. There are
a few things not yet part of the language, but SYCL is unique as the only
performance-portable standard available that successfully targets all
modern hardware. Whether you are a beginner wanting to learn parallel
programming, an experienced developer interested in data-parallel
programming, or a maintainer needing to port 100,000 lines of proprietary
API code to an open standard, this second edition is the only book you will
need to become part of this community.

Erik Lindahl
Professor of Biophysics
Dept. Biophysics & Biochemistry
Science for Life Laboratory
Stockholm University

xxviii
Acknowledgments
We have been blessed with an outpouring of community input for this
second edition of our book. Much inspiration came from interactions with
developers as they use SYCL in production, classes, tutorials, workshops,
conferences, and hackathons. SYCL deployments that include NVIDIA
hardware, in particular, have helped us enhance the inclusiveness and
practical tips in our teaching of SYCL in this second edition.
The SYCL community has grown a great deal—and consists of
engineers implementing compilers and tools, and a much larger group of
users that adopt SYCL to target hardware of many types and vendors. We
are grateful for their hard work, and shared insights.
We thank the Khronos SYCL Working Group that has worked diligently
to produce a highly functional specification. In particular, Ronan Keryell
has been the SYCL specification editor and a longtime vocal advocate
for SYCL.
We are in debt to the numerous people who gave us feedback from
the SYCL community in all these ways. We are also deeply grateful for
those who helped with the first edition a few years ago, many of whom we
named in the acknowledgement of that edition.
The first edition received feedback via GitHub,1 which we did review
but we were not always prompt in acknowledging (imagine six coauthors
all thinking “you did that, right?”). We did benefit a great deal from that
feedback, and we believe we have addressed all the feedback in the
samples and text for this edition. Jay Norwood was the most prolific at
commenting and helping us—a big thank you to Jay from all the authors!

1
github.com/apress/data-parallel-CPP

xxix
Acknowledgments

Other feedback contributors include Oscar Barenys, Marcel Breyer, Jeff


Donner, Laurence Field, Michael Firth, Piotr Fusik, Vincent Mierlak, and
Jason Mooneyham. Regardless of whether we recalled your name here or
not, we thank everyone who has provided feedback and helped refine our
teaching of C++ with SYCL.
For this edition, a handful of volunteers tirelessly read draft
manuscripts and provided insightful feedback for which we are incredibly
grateful. These reviewers include Aharon Abramson, Thomas Applencourt,
Rod Burns, Joe Curley, Jessica Davies, Henry Gabb, Zheming Jin, Rakshith
Krishnappa, Praveen Kundurthy, Tim Lewis, Eric Lindahl, Gregory Lueck,
Tony Mongkolsmai, Ruyman Reyes Castro, Andrew Richards, Sanjiv Shah,
Neil Trevett, and Georg Viehöver.
We all enjoy the support of our family and friends, and we cannot
thank them enough. As coauthors, we have enjoyed working as a team
challenging each other and learning together along the way. We appreciate
our collaboration with the entire Apress team in getting this book
published.
We are sure that there are more than a few people whom we have failed
to mention explicitly who have positively impacted this book project. We
thank all who helped us.
As you read this second edition, please do provide feedback if you find
any way to improve it. Feedback via GitHub can open up a conversation,
and we will update the online errata and book samples as needed.
Thank you all, and we hope you find this book invaluable in your
endeavors.

xxx
CHAPTER 1

Introduction
We have undeniably entered the age of accelerated computing. In order to
satisfy the world’s insatiable appetite for more computation, accelerated
computing drives complex simulations, AI, and much more by providing
greater performance and improved power efficiency when compared with
earlier solutions.
Heralded as a “New Golden Age for Computer Architecture,”1 we are
faced with enormous opportunity through a rich diversity in compute
devices. We need portable software development capabilities that are
not tied to any single vendor or architecture in order to realize the full
potential for accelerated computing.
SYCL (pronounced sickle) is an industry-driven Khronos Group
standard adding advanced support for data parallelism with C++ to
support accelerated (heterogeneous) systems. SYCL provides mechanisms
for C++ compilers to exploit accelerated (heterogeneous) systems in a way
that is highly synergistic with modern C++ and C++ build systems. SYCL is
not an acronym; SYCL is simply a name.

1
A New Golden Age for Computer Architecture by John L. Hennessy, David
A. Patterson; Communications of the ACM, February 2019, Vol. 62 No. 2,
Pages 48-60.

© Intel Corporation 2023 1


J. Reinders et al., Data Parallel C++, https://doi.org/10.1007/978-1-4842-9691-2_1
Chapter 1 Introduction

ACCELERATED VS. HETEROGENEOUS

These terms go together. Heterogeneous is a technical description


acknowledging the combination of compute devices that are programmed
differently. Accelerated is the motivation for adding this complexity to systems
and programming. There is no guarantee of acceleration ever; programming
heterogeneous systems will only accelerate our applications when we do it
right. This book helps teach us how to do it right!

Data parallelism in C++ with SYCL provides access to all the compute
devices in a modern accelerated (heterogeneous) system. A single C++
application can use any combination of devices—including GPUs, CPUs,
FPGAs, and application-specific integrated circuits (ASICs)—that are
suitable to the problems at hand. No proprietary, single-vendor, solution
can offer us the same level of flexibility.
This book teaches us how to harness accelerated computing using
data-parallel programming using C++ with SYCL and provides practical
advice for balancing application performance, portability across compute
devices, and our own productivity as programmers. This chapter lays
the foundation by covering core concepts, including terminology, which
are critical to have fresh in our minds as we learn how to accelerate C++
programs using data parallelism.

Read the Book, Not the Spec


No one wants to be told “Go read the spec!”—specifications are hard to
read, and the SYCL specification (www.khronos.org/sycl/) is no different.
Like every great language specification, it is full of precision but is light on
motivation, usage, and teaching. This book is a “study guide” to teach C++
with SYCL.

2
Chapter 1 Introduction

No book can explain everything at once. Therefore, this chapter does


what no other chapter will do: the code examples contain programming
constructs that go unexplained until future chapters. We should not get
hung up on understanding the coding examples completely in Chapter 1
and trust it will get better with each chapter.

SYCL 2020 and DPC++


This book teaches C++ with SYCL 2020. The first edition of this book
preceded the SYCL 2020 specification, so this edition includes updates
including adjustments in the header file location (sycl instead of CL),
device selector syntax, and removal of an explicit host device.
DPC++ is an open source compiler project based on LLVM. It is
our hope that SYCL eventually be supported by default in the LLVM
community and that the DPC++ project will help make that happen. The
DPC++ compiler offers broad heterogeneous support that includes GPU,
CPU, and FPGA. All examples in this book work with the DPC++ compiler
and should work with any C++ compiler supporting SYCL 2020.

Important resources for updated SYCL information, including any


known book errata, include the book GitHub (github.com/Apress/
data-parallel-CPP), the Khronos Group SYCL standards website
(www.khronos.org/sycl), and a key SYCL education website
(sycl.tech).

As of publication time, no C++ compiler claims full conformance or


compliance with the SYCL 2020 specification. Nevertheless, the code in
this book works with the DPC++ compiler and should work with other C++
compilers that have most of SYCL 2020 implemented. We use only standard
C++ with SYCL 2020 excepting for a few DPC++-specific extensions that

3
Chapter 1 Introduction

are clearly called out in Chapter 17 (Programming for FPGAs) to a small


degree, Chapter 20 (Backend Interoperability) when connecting to Level
Zero backends, and the Epilogue when speculating on the future.

Why Not CUDA?


Unlike CUDA, SYCL supports data parallelism in C++ for all vendors and
all types of architectures (not just GPUs). CUDA is focused on NVIDIA
GPU support only, and efforts (such as HIP/ROCm) to reuse it for GPUs
by other vendors have limited ability to succeed despite some solid
success and usefulness. With the explosion of accelerator architectures,
only SYCL offers the support we need for harnessing this diversity and
offering a multivendor/multiarchitecture approach to help with portability
that CUDA does not offer. To more deeply understand this motivation,
we highly recommend reading (or watching the video recording of their
excellent talk) “A New Golden Age for Computer Architecture” by industry
legends John L. Hennessy and David A. Patterson. We consider this a
must-read article.
Chapter 21, in addition to addressing topics useful for migrating code
from CUDA to C++ with SYCL, is valuable for those experienced with
CUDA to bridge some terminology and capability differences. The most
significant capabilities beyond CUDA come from the ability for SYCL to
support multiple vendors, multiple architectures (not just GPUs), and
multiple backends even for the same device. This flexibility answers the
question “Why not CUDA?”
SYCL does not involve any extra overhead compared with CUDA or
HIP. It is not a compatibility layer—it is a generalized approach that is open
to all devices regardless of vendor and architecture while simultaneously
being in sync with modern C++. Like other open multivendor and
multiarchitecture techniques, such as OpenMP and OpenCL, the ultimate
proof is in the implementations including options to access hardware-­
specific optimizations when absolutely needed.

4
Chapter 1 Introduction

Why Standard C++ with SYCL?


As we will point out repeatedly, every program using SYCL is first and
foremost a C++ program. SYCL does not rely on any language changes
to C++. SYCL does take C++ programming places it cannot go without
SYCL. We have no doubt that all programming for accelerated computing
will continue to influence language standards including C++, but we do
not believe the C++ standard should (or will) evolve to displace the need
for SYCL any time soon. SYCL has a rich set of capabilities that we spend
this book covering that extend C++ through classes and rich support for
new compiler capabilities necessary to meet needs (already existing today)
for multivendor and multiarchitecture support.

Getting a C++ Compiler with SYCL Support


All examples in this book compile and work with all the various
distributions of the DPC++ compiler and should compile with other C++
compilers supporting SYCL (see “SYCL Compilers in Development” at
www.khronos.org/sycl). We are careful to note the very few places where
extensions are used that are DPC++ specific at the time of publication.
The authors recommend the DPC++ compiler for a variety of reasons,
including our close association with the DPC++ compiler. DPC++ is an
open source compiler project to support SYCL. By using LLVM, the DPC++
compiler project has access to backends for numerous devices. This has
already resulted in support for Intel, NVIDIA, and AMD GPUs, numerous
CPUs, and Intel FPGAs. The ability to extend and enhance support openly
for multiple vendors and multiple architecture makes LLVM a great choice
for open source efforts to support SYCL.
There are distributions of the DPC++ compiler, augmented with
additional tools and libraries, available as part of a larger project to
offer broad support for heterogeneous systems, which include libraries,

5
Chapter 1 Introduction

debuggers, and other tools, known as the oneAPI project. The oneAPI
tools, including the DPC++ compiler, are freely available (www.oneapi.io/
implementations).

1. #include <iostream>
2. #include <sycl/sycl.hpp>
3. using namespace sycl;
4.
5. const std::string secret{
6. "Ifmmp-!xpsme\"\012J(n!tpssz-!Ebwf/!"
7. "J(n!bgsbje!J!dbo(u!ep!uibu/!.!IBM\01"};
8.
9. const auto sz = secret.size();
10.
11. int main() {
12. queue q;
13.
14. char* result = malloc_shared<char>(sz, q);
15. std::memcpy(result, secret.data(), sz);
16.
17. q.parallel_for(sz, [=](auto& i) {
18. result[i] -= 1;
19. }).wait();
20.
21. std::cout << result << "\n";
22. free(result, q);
23. return 0;
24. }

Figure 1-1. Hello data-parallel programming

 ello, World! and a SYCL


H
Program Dissection
Figure 1-1 shows a sample SYCL program. Compiling and running it
results in the following being printed:
Hello, world! (and some additional text left to experience by running it)
We will completely understand this example by the end of Chapter 4.
Until then, we can observe the single include of <sycl/sycl.hpp> (line 2)
that is needed to define all the SYCL constructs. All SYCL constructs live
inside a namespace called sycl.

6
Chapter 1 Introduction

• Line 3 lets us avoid writing sycl:: over and over.

• Line 12 instantiates a queue for work requests directed


to a particular device (Chapter 2).

• Line 14 creates an allocation for data shared with the


device (Chapter 3).

• Line 15 copies the secret string into device memory,


where it will be processed by the kernel.

• Line 17 enqueues work to the device (Chapter 4).

• Line 18 is the only line of code that will run on the


device. All other code runs on the host (CPU).

Line 18 is the kernel code that we want to run on devices. That kernel
code decrements a single character. With the power of parallel_for(),
that kernel is run on each character in our secret string in order to decode
it into the result string. There is no ordering of the work required, and it is
run asynchronously relative to the main program once the parallel_for
queues the work. It is critical that there is a wait (line 19) before looking at
the result to be sure that the kernel has completed, since in this example
we are using a convenient feature (Unified Shared Memory, Chapter 6).
Without the wait, the output may occur before all the characters have been
decrypted. There is more to discuss, but that is the job of later chapters.

Queues and Actions


Chapter 2 discusses queues and actions, but we can start with a simple
explanation for now. Queues are the only connection that allows an
application to direct work to be done on a device. There are two types
of actions that can be placed into a queue: (a) code to execute and (b)
memory operations. Code to execute is expressed via either single_task
or parallel_for (used in Figure 1-1). Memory operations perform copy

7
Chapter 1 Introduction

operations between host and device or fill operations to initialize memory.


We only need to use memory operations if we seek more control than
what is done automatically for us. These are all discussed later in the
book starting with Chapter 2. For now, we should be aware that queues
are the connection that allows us to command a device, and we have
a set of actions available to put in queues to execute code and to move
around data. It is also very important to understand that requested actions
are placed into a queue without waiting. The host, after submitting an
action into a queue, continues to execute the program, while the device
will eventually, and asynchronously, perform the action requested via
the queue.

QUEUES CONNECT US TO DEVICES

We submit actions into queues to request computational work and data


movement.

Actions happen asynchronously.

It Is All About Parallelism


Since programming in C++ for data parallelism is all about parallelism,
let’s start with this critical concept. The goal of parallel programming is
to compute something faster. It turns out there are two aspects to this:
increased throughput and reduced latency.

Throughput
Increasing throughput of a program comes when we get more work done
in a set amount of time. Techniques like pipelining may stretch out the
time necessary to get a single work-item done, to allow overlapping of

8
Chapter 1 Introduction

work that leads to more work-per-unit-of-time being done. Humans


encounter this often when working together. The very act of sharing work
involves overhead to coordinate that often slows the time to do a single
item. However, the power of multiple people leads to more throughput.
Computers are no different—spreading work to more processing cores
adds overhead to each unit of work that likely results in some delays, but
the goal is to get more total work done because we have more processing
cores working together.

Latency
What if we want to get one thing done faster—for instance, analyzing
a voice command and formulating a response? If we only cared about
throughput, the response time might grow to be unbearable. The concept
of latency reduction requires that we break up an item of work into
pieces that can be tackled in parallel. For throughput, image processing
might assign whole images to different processing units—in this case,
our goal may be optimizing for images per second. For latency, image
processing might assign each pixel within an image to different processing
cores—in this case, our goal may be maximizing pixels per second from a
single image.

Think Parallel
Successful parallel programmers use both techniques in their
programming. This is the beginning of our quest to Think Parallel.
We want to adjust our minds to think first about where parallelism
can be found in our algorithms and applications. We also think about how
different ways of expressing the parallelism affect the performance we
ultimately achieve. That is a lot to take in all at once. The quest to Think
Parallel becomes a lifelong journey for parallel programmers. We can learn
a few tips here.

9
Chapter 1 Introduction

Amdahl and Gustafson


Amdahl’s Law, stated by the supercomputer pioneer Gene Amdahl in
1967, is a formula to predict the theoretical maximum speed-up when
using multiple processors. Amdahl lamented that the maximum gain from
parallelism is limited to (1/(1-p)) where p is the fraction of the program
that runs in parallel. If we only run two-thirds of our program in parallel,
then the most that program can speed up is a factor of 3. We definitely
need that concept to sink in deeply! This happens because no matter how
fast we make that two-thirds of our program run, the other one-third still
takes the same time to complete. Even if we add 100 GPUs, we will only get
a factor of 3 increase in performance.
For many years, some viewed this as proof that parallel computing
would not prove fruitful. In 1988, John Gustafson wrote an article titled
“Reevaluating Amdahl’s Law.” He observed that parallelism was not used
to speed up fixed workloads, but it was used to allow work to be scaled
up. Humans experience the same thing. One delivery person cannot
deliver a single package faster with the help of many more people and
trucks. However, a hundred people and trucks can deliver one hundred
packages more quickly than a single driver with a truck. Multiple drivers
will definitely increase throughput and will also generally reduce latency
for package deliveries. Amdahl’s Law tells us that a single driver cannot
deliver one package faster by adding ninety-nine more drivers with their
own trucks. Gustafson noticed the opportunity to deliver one hundred
packages faster with these extra drivers and trucks.
This emphasizes that parallelism is most useful because the size of
problems we tackle keep growing in size year after year. Parallelism would
not nearly as important to study if year after year we only wanted to run the
same size problems faster. This quest to solve larger and larger problems
fuels our interest in exploiting data parallelism, using C++ with SYCL, for
the future of computer (heterogeneous/accelerated systems).

10
Chapter 1 Introduction

Scaling
The word “scaling” appeared in our prior discussion. Scaling is a measure
of how much a program speeds up (simply referred to as “speed-up”)
when additional computing is available. Perfect speed-up happens if
one hundred packages are delivered in the same time as one package,
by simply having one hundred trucks with drivers instead of a single
truck and driver. Of course, it does not reliably work that way. At some
point, there is a bottleneck that limits speed-up. There may not be one
hundred places for trucks to dock at the distribution center. In a computer
program, bottlenecks often involve moving data around to where it will
be processed. Distributing to one hundred trucks is similar to having to
distribute data to one hundred processing cores. The act of distributing
is not instantaneous. Chapter 3 starts our journey of exploring how to
distribute data to where it is needed in a heterogeneous system. It is critical
that we know that data distribution has a cost, and that cost affects how
much scaling we can expect from our applications.

Heterogeneous Systems
For our purposes, a heterogeneous system is any system which contains
multiple types of computational devices. For instance, a system with both
a central processing unit (CPU) and a graphics processing unit (GPU) is a
heterogeneous system. The CPU is often just called a processor, although
that can be confusing when we speak of all the processing units in a
heterogeneous system as compute processors. To avoid this confusion,
SYCL refers to processing units as devices. An application always runs on
a host that in turn sends work to devices. Chapter 2 begins the discussion
of how our main application (host code) will steer work (computations) to
particular devices in a heterogeneous system.

11
Chapter 1 Introduction

A program using C++ with SYCL runs on a host and issues kernels of
work to devices. Although it might seem confusing, it is important to know
that the host will often be able to serve as a device. This is valuable for two
key reasons: (1) the host is most often a CPU that will run a kernel if no
accelerator is present—a key promise of SYCL for application portability
is that a kernel can always be run on any system even those without
accelerators—and (2) CPUs often have vector, matrix, tensor, and/or
AI processing capabilities that are accelerators that kernels map well to
run upon.

Host code invokes code on devices. The capabilities of the host are
very often available as a device also, to provide both a back-­up
device and to offer any acceleration capabilities the host has for
processing kernels as well. Our host is most often a CPU, and as such
it may be available as a CPU device. There is no guarantee by SYCL of
a CPU device, only that there is at least one device available to be the
default device for our application.

While heterogeneous describes the system from a technical


standpoint, the reason to complicate our hardware and software is to
obtain higher performance. Therefore, the term accelerated computing is
popular for marketing heterogeneous systems or their components. We
like to emphasize that there is no guarantee of acceleration. Programming
of heterogeneous systems will only accelerate our applications when we do
it right. This book helps teach us how to do it right!
GPUs have evolved to become high-performance computing (HPC)
devices and therefore are sometimes referred to as general-purpose GPUs,
or GPGPUs. For heterogeneous programming purposes, we can simply
assume we are programming such powerful GPGPUs and refer to them
as GPUs.

12
Chapter 1 Introduction

Today, the collection of devices in a heterogeneous system can include


CPUs, GPUs, FPGAs (field-programmable gate arrays), DSPs (digital signal
processors), ASICs (application-specific integrated circuits), and AI chips
(graph, neuromorphic, etc.).
The design of such devices will involve duplication of compute
processors (multiprocessors) and increased connections (increased
bandwidth) to data sources such as memory. The first of these,
multiprocessing, is particularly useful for raising throughput. In our
analogy, this was done by adding additional drivers and trucks. The latter
of these, higher bandwidth for data, is particularly useful for reducing
latency. In our analogy, this was done with more loading docks to enable
trucks to be fully loaded in parallel.
Having multiple types of devices, each with different architectures and
therefore different characteristics, leads to different programming and
optimization needs for each device. That becomes the motivation for C++
with SYCL and the majority of what this book has to teach.

SYCL was created to address the challenges of C++ data-­parallel


programming for heterogeneous (accelerated) systems.

Data-Parallel Programming
The phrase “data-parallel programming” has been lingering unexplained
ever since the title of this book. Data-parallel programming focuses on
parallelism that can be envisioned as a bunch of data to operate on in
parallel. This shift in focus is like Gustafson vs. Amdahl. We need one
hundred packages to deliver (effectively lots of data) in order to divide
up the work among one hundred trucks with drivers. The key concept
comes down to what we should divide. Should we process whole images

13
Chapter 1 Introduction

or process them in smaller tiles or process them pixel by pixel? Should


we analyze a collection of objects as a single collection or a set of smaller
groupings of objects or object by object?
Choosing the right division of work and mapping that work onto
computational resources effectively is the responsibility of any parallel
programmer using C++ with SYCL. Chapter 4 starts this discussion, and it
continues through the rest of the book.

Key Attributes of C++ with SYCL


Every program using SYCL is first and foremost a C++ program. SYCL does
not rely on any language changes to C++.
C++ compilers with SYCL support will optimize code based on built-­
in knowledge of the SYCL specification as well as implement support so
heterogeneous compilations “just work” within traditional C++ build
systems.
Next, we will explain the key attributes of C++ with SYCL: single-source
style, host, devices, kernel code, and asynchronous task graphs.

Single-Source
Programs are single-source, meaning that the same translation unit2
contains both the code that defines the compute kernels to be executed
on devices and also the host code that orchestrates execution of those
compute kernels. Chapter 2 begins with a more detailed look at this
capability. We can still divide our program source into different files and
translation units for host and device code if we want to, but the key is that
we don’t have to!

2
We could just say “file,” but that is not entirely correct here. A translation unit
is the actual input to the compiler, made from the source file after it has been
processed by the C preprocessor to inline header files and expand macros.

14
Chapter 1 Introduction

Host
Every program starts by running on a host, and most of the lines of code
in a program are usually for the host. Thus far, hosts have always been
CPUs. The standard does not require this, so we carefully describe it as
a host. This seems unlikely to be anything other than a CPU because the
host needs to fully support C++17 in order to support all C++ with SYCL
programs. As we will see shortly, devices (accelerators) do not need to
support all of C++17.

Devices
Using multiple devices in a program is what makes it heterogeneous
programming. That is why the word device has been recurring in this
chapter since the explanation of heterogeneous systems a few pages ago.
We already learned that the collection of devices in a heterogeneous
system can include GPUs, FPGAs, DSPs, ASICs, CPUs, and AI chips, but is
not limited to any fixed list.
Devices are the targets to gain acceleration. The idea of offloading
computations is to transfer work to a device that can accelerate completion
of the work. We have to worry about making up for time lost moving
data—a topic that needs to constantly be on our minds.

Sharing Devices
On a system with a device, such as a GPU, we can envision two or more
programs running and wanting to use a single device. They do not need to
be programs using SYCL. Programs can experience delays in processing by
the device if another program is currently using it. This is really the same
philosophy used in C++ programs in general for CPUs. Any system can be
overloaded if we run too many active programs on our CPU (mail, browser,
virus scanning, video editing, photo editing, etc.) all at once.

15
Chapter 1 Introduction

On supercomputers, when nodes (CPUs + all attached devices) are


granted exclusively to a single application, sharing is not usually a concern.
On non-supercomputer systems, we can just note that the performance
of a program may be impacted if there are multiple applications using the
same devices at the same time.
Everything still works, and there is no programming we need to do
differently.

Kernel Code
Code for a device is specified as kernels. This is a concept that is not
unique to C++ with SYCL: it is a core concept in other offload acceleration
languages including OpenCL and CUDA. While it is distinct from loop-­
oriented approaches (such as commonly used with OpenMP target
offloads), it may resemble the body of code within the innermost loop
without requiring the programmer to write the loop nest explicitly.
Kernel code has certain restrictions to allow broader device support
and massive parallelism. The list of features not supported in kernel code
includes dynamic polymorphism, dynamic memory allocations (therefore
no object management using new or delete operators), static variables,
function pointers, runtime type information (RTTI), and exception
handling. No virtual member functions, and no variadic functions, are
allowed to be called from kernel code. Recursion is not allowed within
kernel code.

16
Chapter 1 Introduction

VIRTUAL FUNCTIONS?

While we will not discuss it further in this book, the DPC++ compiler project does
have an experimental extension (visible in the open source project, of course) to
implement some support for virtual functions within kernels. Thanks to the nature
of offloading to accelerator efficiently, virtual functions cannot be supported well
without some restrictions, but many users have expressed interest in seeing
SYCL offer such support even with some restrictions. The beauty of open source,
and the open SYCL specification, is the opportunity to participate in experiments
that can inform the future of C++ and SYCL specifications. Visit the DPC++
project (github.com/intel/llvm) for more information.

Chapter 3 describes how memory allocations are done before and


after kernels are invoked, thereby making sure that kernels stay focused
on massively parallel computations. Chapter 5 describes handling of
exceptions that arise in connection with devices.
The rest of C++ is fair game in a kernel, including functors, lambda
expressions, operator overloading, templates, classes, and static
polymorphism. We can also share data with the host (see Chapter 3) and
share the read-only values of (non-global) host variables (via lambda
expression captures).

Kernel: Vector Addition (DAXPY)


Kernels should feel familiar to any programmer who has worked on
computationally complex code. Consider implementing DAXPY, which
stands for “double-precision A times X Plus Y.” A classic for decades.
Figure 1-2 shows DAXPY implemented in modern Fortran, C/C++, and
SYCL. Amazingly, the computation lines (line 3) are virtually identical.
Chapters 4 and 10 explain kernels in detail. Figure 1-2 should help remove
any concerns that kernels are difficult to understand—they should feel
familiar even if the terminology is new to us.

17
Chapter 1 Introduction

1. ! Fortran loop
2. do i = 1, n
3. z(i) = alpha * x(i) + y(i)
4. end do

1. // C/C++ loop
2. for (int i=0;i<n;i++) {
3. z[i] = alpha * x[i] + y[i];
4. }

1. // SYCL kernel
2. q.parallel_for(range{n},[=](id<1> i) {
3. z[i] = alpha * x[i] + y[i];
4. }).wait();

Figure 1-2. DAXPY computations in Fortran, C/C++, and SYCL

Asynchronous Execution
The asynchronous nature of programming using C++ with SYCL must not
be missed. Asynchronous programming is critical to understand for two
reasons: (1) proper use gives us better performance (better scaling), and
(2) mistakes lead to parallel programming errors (usually race conditions)
that make our applications unreliable.
The asynchronous nature comes about because work is transferred to
devices via a “queue” of requested actions. The host program submits a
requested action into a queue, and the program continues without waiting
for any results. This no waiting is important so that we can try to keep
computational resources (devices and the host) busy all the time. If we had
to wait, that would tie up the host instead of allowing the host to do useful
work. It would also create serial bottlenecks when the device finished, until
we queued up new work. Amdahl’s Law, as discussed earlier, penalizes us
for time spent not doing work in parallel. We need to construct our programs
to be moving data to and from devices while the devices are busy and keep
all the computational power of the devices and host busy any time work is
available. Failure to do so will bring the full curse of Amdahl’s Law upon us.

18
Chapter 1 Introduction

Chapter 3 starts the discussion on thinking of our program as an


asynchronous task graph, and Chapter 8 greatly expands upon this
concept.

Race Conditions When We Make a Mistake


In our first code example (Figure 1-1), we specifically did a “wait” on
line 19 to prevent line 21 from writing out the value from result before it
was available. We must keep this asynchronous behavior in mind. There
is another subtle thing done in that same code example—line 15 uses
std::memcpy to load the input. Since std::memcpy runs on the host, line
17 and later do not execute until line 15 has completed. After reading
Chapter 3, we could be tempted to change this to use q.memcpy (using
SYCL). We have done exactly that in Figure 1-3 on line 7. Since that is a
queue submission, there is no guarantee that it will execute before line
9. This creates a race condition, which is a type of parallel programming
bug. A race condition exists when two parts of a program access the same
data without coordination. Since we expect to write data using line 7 and
then read it in line 9, we do not want a race that might have line 9 execute
before line 7 completes! Such a race condition would make our program
unpredictable—our program could get different results on different runs
and on different systems. A fix for this would be to explicitly wait for
q.memcpy to complete before proceeding by adding .wait() to the end of
line 7. That is not the best fix. We could have used event dependences to
solve this (Chapter 8). Creating the queue as an ordered queue would also
add an implicit dependence between the memcpy and the parallel_for.
As an alternative, in Chapter 7, we will see how a buffer and accessor
programming style can be used to have SYCL manage the dependences
and waits automatically for us.

19
Chapter 1 Introduction

1. // ...we are changing one line from Figure 1-1


2. char* result = malloc_shared<char>(sz, q);
3.
4. // Introduce potential data race! We don't define a
5. // dependence to ensure correct ordering with later
6. // operations.
7. q.memcpy(result, secret.data(), sz);
8.
9. q.parallel_for(sz, [=](auto& i) {
10. result[i] -= 1;
11. }).wait();
12.
13. // ...

Figure 1-3. Adding a race condition to illustrate a point about being


asynchronous

RACE CONDITIONS DO NOT ALWAYS CAUSE A PROGRAM TO FAIL

An astute reader noticed that the code in Figure 1-3 did not fail on every
system they tried. Using a GPU with partition_max_sub_devices==0 did
not fail because it was a small GPU not capable of running the parallel_for
until the memcpy had completed. Regardless, the code is flawed because the
race condition exists even if it does not universally cause a failure at runtime.
We call it a race—sometimes we win, and sometimes we lose. Such coding
flaws can lay dormant until the right combination of compile and runtime
environments lead to an observable failure.

Adding a wait() forces host synchronization between the memcpy and


the kernel, which goes against the previous advice to keep the device busy
all the time. Much of this book covers the different options and trade-offs
that balance program simplicity with efficient use of our systems.

20
Random documents with unrelated
content Scribd suggests to you:
The Project Gutenberg eBook of Garside's
Career: A Comedy in Four Acts
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.

Title: Garside's Career: A Comedy in Four Acts

Author: Harold Brighouse

Release date: August 7, 2017 [eBook #55290]


Most recently updated: October 23, 2024

Language: English

Credits: Produced by David Widger from page images generously


provided by the Internet Archive

*** START OF THE PROJECT GUTENBERG EBOOK GARSIDE'S


CAREER: A COMEDY IN FOUR ACTS ***
GARSIDE'S CAREER
A Comedy In Four Acts
By Harold Brighouse
London: Constable And Company Ltd.

1914
TO

A. N. MONKHOUSE
CONTENTS

GARSIDE'S CAREER

ACT I

ACT II

ACT III.

ACT IV

GARSIDE'S CAREER
ACT I
Interior of an artisan cottage. Door centre, leading direct to street,
door right to house. Fireplace with kitchen range left. Table centre,
with print cloth. Two plain chairs under it, one left, one centre,
facing audience. Rocking-chair by fireplace. Two chairs against wall
right, above door. Dresser right, below door. Small hanging bookcase
on wall, left centre. Window right centre. On walls plainly framed
photographs of Socialist leaders—Blatchford, Hyndman, Hardie. The
time is 7.0 p.m. on a June evening.

Mrs. Garside is a working-class woman of 50, greyhaired, slight,


with red toil-worn hands and a face expressive of resignation marred
by occasional petulance, dressed in a rough serge skirt, dark print
blouse, elastic-sided boots, and a white apron. She sits in the
rocking-chair, watching the cheap alarm-clock fretfully. Outside a boy
is heard calling "Last Edishun." She rises hastily, feels on the
mantelpiece for her purse, opens the door centre and buys a paper
from the boy who appears through the doorway. She closes door,
sits centre and spreads the paper on the table, rises again and gets
spectacle-case from mantelpiece. She sits with spectacles on and
rapidly goes through the paper seeking some particular item.

The door centre opens and Margaret Shawcross enters. She is


young, dark, with a face beautiful in expression rather than feature.
It is the face of an idealist, one who would go through fire and water
for the faith that is in her.

She is a school teacher, speaking with an educated voice in a


slightly apparent northern accent, dressed neatly and serviceably.
Mrs. Garside turns eagerly as she enters and is disappointed on
seeing Margaret.

Mrs. Gar. Oh, it's you. I thought it might be——

Mar. (closing door, sympathetically). Yes. But it's too early to


expect Peter back yet.

Mrs. G. (with some truculence). He'll not be long. He's always


said he'd let his mother be the first to hear the news.

Mar. (gently). You don't mind my being here to hear it with you?

Mrs. G. (rising and putting spectacles back on mantelpiece,


speaking ungraciously). No, you've got a right to hear it too,
Margaret. (Margaret picks up paper.) I can't find anything in that.

Mar. Peter said the results come out too late for the evening
papers.

Mrs. G. He never told me. (Margaret folds paper on table.) I'm


glad though. There's no one else 'ull know a-front of me. He'll bring
the good news home himself. He's coming now as fast as train and
car 'ull bring him. (Sitting in rocking-chair.)

Mar. Yes. He knows we're waiting here, we two who care for
Peter more than anything on earth.
Mrs. G. (giving her a jealous glance). I wish he'd come.

Mar. Try to be calm, Mrs. Garside.

Mrs. G. (irritably). Calm? How can I be calm? I'm on edge till I


know. (Rocking her chair quickly.)

Mar. (trying to soothe her). It isn't as if he can't try again if he's


not through this time.

Mrs. G. (confidently, keeping her chair still). He'll have no need to


try again. I've a son and his name this night is Peter Garside, b.a. I
know he's through.

Mar. (sitting in chair lift of table). Then if you're sure——

Mrs. G. Yes. I know I'm a fidget. I want to hear it from his own
lips. He's worked so hard he can't fail. (Accusingly.) You don't believe
me, Margaret. You're not sure of him.

Mar. (with elbows on table and head on hands). I'm fearful of the
odds against him—the chances that the others have and he hasn't.
Peter's to work for his living. They're free to study all day long.
(Rising, enthusiastically.) Oh, if he does it, what a triumph for our
class. Peter Garside, the Board School boy, the working engineer,
keeping himself and you, and studying at night for his degree.

Mrs. G. (dogmatically). The odds don't count. I know Peter.


Peter's good enough for any odds. You doubt him, Margaret.

Mar. No. I've seen him work. I've worked with him till he
distanced me and left me far behind. He knows enough to pass, to
pass above them all——

Mrs. G. Yes, yes!


Mar. But examinations are a fearful hazard.

Mrs. G. Not to Peter. He's fighting for his class, he's showing them
he's the better man. He can work with his hands and they can't, and
he can work with his brain as well as the best of them.

Mar. He'll do it. It may not be this time, but he'll do it in the end.

Mrs. G. (obstinately). This time, Margaret.

Mar. I do hope so.

Mrs. G. (looking at the clock). Do you think there's been a


breakdown on the cars?

Mar. No, no.

Mrs. G. (rising anxiously). He said seven, and it's after that.

Mar. (trying to soothe her). Not much.

Mrs. G. (pacing about). Why doesn't he come? (Stopping short.)


Where do people go to find out if there's been an accident? It's the
police station, isn't it?

Mar. Oh, there's no need——

[Peter Garside bursts in through centre door and closes it behind


him as he speaks. He is 23, cleanshaven, fair, sturdily built, with a
large, loose mouth, strong jaw, and square face, dressed in a cheap
tweed suit, wearing a red tie.

Peter (breathlessly). I've done it.

Mrs. G. (going to him; he puts his arm round her and pats her
back, while she hides her face against his chest). My boy, my boy!
Peter. I've done it, mother. (Looking proudly at Margaret.) I'm an
honours man of Midlandton University.

Mar. First class, Peter?

Peter. Yes. First Class. (Gently disengaging himself from Mrs.


Garside.)

Mrs. G. (standing by his left, looking up at him). I knew, I knew


it, Peter. I had the faith in you.

Peter (hanging his cap behind the door right, then coming back
to centre. Margaret is standing on the hearthrug). Ah, little mother,
what a help that faith has been to me. I couldn't disappoint a faith
like yours. I had to win. Mother, Margaret, I've done it. Done it. Oh, I
think I'm not quite sane to-night. This room seems small all of a
sudden. I want to leap, to dance, and I know I'd break my neck
against the ceiling if I did. Peter Garside, b.a. (Approaching
Margaret.) Margaret, tell me I deserve it. You know what it means to
me. The height of my ambition. The crown, the goal, my target
reached at last. Margaret, isn't it a great thing that I've done?

Mar. (taking both his hands). A great thing, Peter.

Peter. Oh, but I was lucky in my papers.

Mar. No, you just deserve it all.

Peter (dropping her hands). Up to the end I didn't know. I


thought I'd failed. And here I'm through first class. I've beaten men
I never hoped to equal. I've called myself a swollen-headed fool for
dreaming to compete with them, and now——

Mrs. G. Now you've justified my faith. I never doubted you—like


Margaret did.
Peter (looking from her to Margaret). Margaret did?

Mar. I didn't dare to hope for this, Peter—at a first attempt.

Mrs. G. (contemptuously). She didn't dare. But I did. I dared,


Peter, I knew.

Peter (putting his arm over her shoulder). Oh, mother, mother!
But Margaret was right, if I hadn't had such luck in the papers I——

Mrs. G. (slipping from him and going to where her cape and
bonnet hang on the door right). It wasn't luck. Even Margaret said
you deserved it all.

Peter. Even Margaret! (Seeing her putting cape on.) You're not
going out, mother?

Mrs. G. (with determination). Yes, I am. There's others besides


Margaret doubted you. I'm going to tell them all. I'm going to be the
first to spread the news. And won't it spread! Like murder.

[Margaret sits l.c.

Peter. Oh, yes. It'll spread fast enough. They may know already.

Mrs. G. (turning with her hand on the centre door latch). How
could they?

Peter. News travels fast.

Mrs. G. But you haven't told anyone else. Have you, Peter?
(Reproachfully.) You said you'd let me be the first to know.

Peter. I met O'Callagan on his way to the Club. He asked me. I


couldn't refuse to answer.
Mrs. G. (energetically). He'd no right to meet you. A dreamy
wastrel like O'Callagan to know before your mother!

Peter. He'll only tell the men at the Club, mother.

Mrs. G. (opening door). And I'll tell the women. They're going to
know the kind of son I've borne. I'm a proud woman this night, and
all Belinda Street is going to know I've cause to be. (Sniffing.)
O'Callagan indeed!

[Exit Mrs. Garside.

Peter. And now, Margaret? (He stands centre behind table.)

Mar. (looking up and holding out her hand across table; she takes
his, bending). Oh, my dear, my dear.

Peter. Are you pleased with me?

Mar. Pleased!

Peter (rising). Yes. We've done it.

Mar. You, not we. My hero.

Peter. We, Margaret, we. I'm no hero. I owe it all to you.

Mar. (rising). You owe it to yourself.

Peter. You inspired me. You helped me on. You kept me at it


when my courage failed. When I wanted to slack you came and
worked with me. It was your idea from the first.

Mar. My idea but your deed.


Peter (sitting centre, behind table). I've had dreams of this.
Dreams of success. I never thought it would come. It was there on
the horizon—a far-off nebulous dream.

Mar. (standing right). It's a reality to-day.

Peter. Yes. It's a reality to day. I've done the task you set me.
I've proved my class as good as theirs. That's what you wanted,
wasn't it?

Mar. I wanted you to win, Peter.

Peter. I've won because you wanted it, because after I won I
knew that you—— (Rising.) Has it been wearisome to wait,
Margaret? I had the work, lectures, study. You had the tedious clays
of teaching idiotic middle-class facts to idiotic middle-class children,
and evenings when you ought to have had me and didn't because I
couldn't lose a single precious moment's chance of study.

Mar. That's clean forgotten. To-night is worth it all.

Peter. To-night, and the future, Margaret.

Mar. (solemnly). Yes, the future, Peter.

Peter. This night was always in my dreams. The night when I


should come to you and say, Margaret Shawcross, this have I done
for you, because you wanted it. Was it well done, Margaret?

Mar. Nobly done.

Peter. And the labourer is worthy of his hire? I ask for my reward.

Mar. (shaking her head). I can give you no reward that's big
enough.
Peter. You can give the greatest prize on earth. We ought to have
been married long ago. I've kept you waiting.

Mar. That had to be. They won't have married women teachers at
the Midlandton High School. I couldn't burden you until this fight
was fought.

Peter. And now, Margaret?

Mar. Now I'm ready—if——

Peter. More if's?

Mar. A very little one. If you've money to keep us three. No going


short for mother.

Peter. You trust me, don't you?

Mar. (giving hand). Yes, Peter, I trust you.

Peter (bursting with thoughts). There's my journalism. This


degree 'ull give me a lift at that. I shall get lecture engagements too.

Mar. (alarmed). Peter, you didn't do it for that!

Peter. I did it for you. But I mean to enjoy the fruits of all this
work. Public speaking's always been a joy to me. You don't know the
glorious sensation of holding a crowd in the hollow of your hand,
mastering it, doing what you like with it.

Mar. (sadly). I hoped you'd given up speaking.

Peter. I haven't spoken lately because I'd other things to do. I


haven't given it up.

Mar. You did too much before.


Peter. You don't know the fascination of the thing.

Mar. (bracing herself for a tussle). I know the fascination's fatal. I


saw it growing on you—this desire to speak, to be the master of a
mob. I hoped I'd cured you of it.

Peter. Cured me?

Mar. I thought I'd given you a higher aim.

Peter. And that was why you urged this study on me?

Mar. Yes.

Peter. Margaret! Why? (Backing from, her, and sitting centre


during her speech.)

Mar. I've seen men ruined by this itch to speak. You know them.
Men we had great hopes of in the movement. Men we thought
would be real leaders of the people. And they spoke, and spoke, and
soon said all they had to say, became mere windbags trading on a
reputation till people tired and turned to some new orator. Don't be
one of these, Peter. You've solider grit than they. The itch to speak is
like the itch to drink, except that it's cheaper to talk yourself tipsy.

Peter. You ask a great thing of me, Margaret.

Mar. (sitting right) What shall I see of you if you're out speaking
every night? You pitied me just now because you had to close your
door against me while you studied. I could bear that for the time.
But this other thing, married and widowed at once, with you out at
your work all day and away night after night——

Peter. But I shan't always be working in the daytime.

Mar. (alarmed). Not work! Peter—they haven't dismissed you?


Peter. Oh, no. I'm safe if anyone is safe. No one is, of course, but
I'm as safe as man can be. I'm a first-elass workman.

Mar. I know that, dear.

Peter. So do they. They'll not sack me. I might sack them some
day.

Mar. But—how shall we live?

Peter (impatiently). Oh, not yet. I'm speaking of the future. Don't
you see? I'm not content to be a workman all my life. I ought to
make a living easily by writing and—and speaking if you'll let me.
Then I could be with you all day long.

Mar. (looking straight in front of her). Have I set fire to this train?

Peter. You don't suppose a B.A. means to stick to manual labour


all his life, do you?

Mar. Oh, dear! This wasn't my idea at all. I wanted you to win
your degree for the honour of the thing, to show them what a
working engineer could do. Cease to be a workman and you confess
another, worse motive. It's as though you only passed to make a
profit for yourself.

Peter. I can't help being ambitious. I wasn't till you set me on.

Mar. If you listened to me then, listen to me now.

Peter (pushing his chair hack and rising). I might have a career.
(Crossing to fireplace.)

Mar (still sitting). And I might have a husband. I don't want to


marry a career, Peter.
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge


connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.

Join us on a journey of knowledge exploration, passion nurturing, and


personal growth every day!

ebookbell.com

You might also like