100% found this document useful (8 votes)

55 views

[Ebooks PDF] download Using OpenCL Programming Massively Parallel Computers J. Kowalik full chapters

Using

Uploaded by

shifazsanapi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (8 votes)

55 views

[Ebooks PDF] download Using OpenCL Programming Massively Parallel Computers J. Kowalik full chapters

Using

Uploaded by

shifazsanapi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Get ebook downloads in full at ebookname.

com

Using OpenCL Programming Massively Parallel

Computers J. Kowalik

https://ebookname.com/product/using-opencl-programming-
massively-parallel-computers-j-kowalik/

OR CLICK BUTTON

DOWNLOAD EBOOK

Explore and download more ebook at https://ebookname.com

Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Computers and Programming 1st Edition Lisa Mccoy

https://ebookname.com/product/computers-and-programming-1st-
edition-lisa-mccoy/

Hidden Structure Music Analysis Using Computers David

Cope

https://ebookname.com/product/hidden-structure-music-analysis-
using-computers-david-cope/

Using Computers in Linguistics A Practical Guide 1st

Edition John M. Lawler

https://ebookname.com/product/using-computers-in-linguistics-a-
practical-guide-1st-edition-john-m-lawler/

The ASHRAE GreenGuide Second Edition The ASHRAE Green

Guide Series Ashrae Press

https://ebookname.com/product/the-ashrae-greenguide-second-
edition-the-ashrae-green-guide-series-ashrae-press/
Complete Java 2 Certification Study Guide 4th Edition
Phillip Heller

https://ebookname.com/product/complete-java-2-certification-
study-guide-4th-edition-phillip-heller/

Bosnia and Herzegovina in the Second World War 2004

Enver Redzic

https://ebookname.com/product/bosnia-and-herzegovina-in-the-
second-world-war-2004-enver-redzic/

Human Resource Skills for the Project Manager The Human

Aspects of Project Management Volume Two 1st Edition
Verma

https://ebookname.com/product/human-resource-skills-for-the-
project-manager-the-human-aspects-of-project-management-volume-
two-1st-edition-verma/

Textbook of Forensic Medicine and Toxicology 2nd

Edition Nageshkumar G Rao

https://ebookname.com/product/textbook-of-forensic-medicine-and-
toxicology-2nd-edition-nageshkumar-g-rao/

World Civilizations Volume 1 To 1700 5th Edition Philip

J. Adler

https://ebookname.com/product/world-civilizations-
volume-1-to-1700-5th-edition-philip-j-adler/
Noise and Military Service Implications for Hearing
Loss and Tinnitus 1st Edition Committee On Noise-
Induced Hearing Loss And Tinnitus Associated With
Military Service From World War Ii To The Present
https://ebookname.com/product/noise-and-military-service-
implications-for-hearing-loss-and-tinnitus-1st-edition-committee-
on-noise-induced-hearing-loss-and-tinnitus-associated-with-
military-service-from-world-war-ii-to-the-present/
USING OPENCL
Advances in Parallel Computing
This book series publishes research and development results on all aspects of parallel
computing. Topics may include one or more of the following: high-speed computing
architectures (Grids, clusters, Service Oriented Architectures, etc.), network technology,
performance measurement, system software, middleware, algorithm design,
development tools, software engineering, services and applications.

Series Editor:
Professor Dr. Gerhard R. Joubert

Volume 21
Recently published in this series
Vol. 20. I. Foster, W. Gentzsch, L. Grandinetti and G.R. Joubert (Eds.), High
Performance Computing: From Grids and Clouds to Exascale
Vol. 19. B. Chapman, F. Desprez, G.R. Joubert, A. Lichnewsky, F. Peters and T. Priol
(Eds.), Parallel Computing: From Multicores and GPU’s to Petascale
Vol. 18. W. Gentzsch, L. Grandinetti and G. Joubert (Eds.), High Speed and Large Scale
Scientific Computing
Vol. 17. F. Xhafa (Ed.), Parallel Programming, Models and Applications in Grid and
P2P Systems
Vol. 16. L. Grandinetti (Ed.), High Performance Computing and Grids in Action
Vol. 15. C. Bischof, M. Bücker, P. Gibbon, G.R. Joubert, T. Lippert, B. Mohr and F.
Peters (Eds.), Parallel Computing: Architectures, Algorithms and Applications

Volumes 1–14 published by Elsevier Science.

ISSN 0927-5452 (print)

ISSN 1879-808X (online)
Usin
ng OpeenCL
Program
mming Ma
assively Pa
arallel Com
mputers

Janu
usz Kow
walik
1647
77-107th PL NE, Bothell,, WA 98011
1, USA
and
Tadeusz PuĨnia
akowski
UG, MFI, Wit
W Stwosz Street
S 57, 80-952
8 GdaĔĔsk, Poland

Amstterdam x Berrlin x Tokyo x Washington, DC

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, without prior written permission from the publisher.

ISBN 978-1-61499-029-1 (print)

ISBN 978-1-61499-030-7 (online)
Library of Congress Control Number: 2012932792
doi:10.3233/978-1-61499-030-7-i

Publisher
IOS Press BV
Nieuwe Hemweg 6B
1013 BG Amsterdam
Netherlands
fax: +31 20 687 0019
e-mail: order@iospress.nl

Distributor in the USA and Canada

IOS Press, Inc.
4502 Rachael Manor Drive
Fairfax, VA 22032
USA
fax: +1 703 323 3668
e-mail: iosbooks@iospress.com

LEGAL NOTICE
The publisher is not responsible for the use which might be made of the following information.

PRINTED IN THE NETHERLANDS

This book is dedicated to Alex, Bogdan and Gabriela
with love and consideration.

v
vi
Preface
This book contains the most important and essential information required for de-
signing correct and efficient OpenCL programs. Some details have been omitted but
can be found in the provided references. The authors assume that readers are famil-
iar with basic concepts of parallel computation, have some programming experience
with C or C++ and have a fundamental understanding of computer architecture.
In the book, all terms, definitions and function signatures have been copied from
official API documents available on the page of the OpenCL standards creators.
The book was written in 2011, when OpenCL was in transition from its infancy
to maturity as a practical programming tool for solving real-life problems in science
and engineering. Earlier, the Khronos Group successfully defined OpenCL specifica-
tions, and several companies developed stable OpenCL implementations ready for
learning and testing. A significant contribution to programming heterogeneous com-
puters was made by NVIDIA which created one of the first working systems for pro-
gramming massively parallel computers – CUDA. OpenCL has borrowed from CUDA
several key concepts. At this time (fall 2011), one can install OpenCL on a hetero-
geneous computer and perform meaningful computing experiments. Since OpenCL
is relatively new, there are not many experienced users or sources of practical infor-
mation. One can find on the Web some helpful publications about OpenCL, but there
is still a shortage of complete descriptions of the system suitable for students and
potential users from the scientific and engineering application communities.
Chapter 1 provides short but realistic examples of codes using MPI and OpenMP
in order for readers to compare these two mature and very successful systems with
the fledgling OpenCL. MPI used for programming clusters and OpenMP for shared
memory computers, have achieved remarkable worldwide success for several rea-
sons. Both have been designed by groups of parallel computing specialists that per-
fectly understood scientific and engineering applications and software development
tools. Both MPI and OpenMP are very compact and easy to learn. Our experience
indicates that it is possible to teach scientists or students whose disciplines are other
than computer science how to use MPI and OpenMP in a several hours time. We
hope that OpenCL will benefit from this experience and achieve, in the near future,
a similar success.
Paraphrasing the wisdom of Albert Einstein, we need to simplify OpenCL as
much as possible but not more. The reader should keep in mind that OpenCL will
be evolving and that pioneer users always have to pay an additional price in terms
of initially longer program development time and suboptimal performance before
they gain experience. The goal of achieving simplicity for OpenCL programming re-
quires an additional comment. OpenCL supporting heterogeneous computing offers
us opportunities to select diverse parallel processing devices manufactured by differ-
ent vendors in order to achieve near-optimal or optimal performance. We can select
multi-core CPUs, GPUs, FPGAs and other parallel processing devices to fit the prob-
lem we want to solve. This flexibility is welcomed by many users of HPC technology,
but it has a price.
Programming heterogeneous computers is somewhat more complicated than
writing programs in conventional MPI and OpenMP. We hope this gap will disappear
as OpenCL matures and is universally used for solving large scientific and engineer-
ing problems.

vii
Acknowledgements
It is our pleasure to acknowledge assistance and contributions made by several per-
sons who helped us in writing and publishing the book.
First of all, we express our deep gratitude to Prof. Gerhard Joubert who has
accepted the book as a volume in the book series he is editing, Advances in Parallel
Computing. We are proud to have our book in his very prestigious book series.
Two members of the Khronos organization, Elizabeth Riegel and Neil Trevett,
helped us with evaluating the initial draft of Chapter 2 Fundamentals and provided
valuable feedback. We thank them for the feedback and for their offer of promoting
the book among the Khronos Group member companies.
Our thanks are due to NVIDIA for two hardware grants that enabled our com-
puting work related to the book.
Our thanks are due to Piotr Arłukowicz, who contributed two sections to the
book and helped us with editorial issues related to using LATEX and the Blender3D
modeling open-source program.
We thank two persons who helped us improve the book structure and the lan-
guage. They are Dominic Eschweiler from FIAS, Germany and Roberta Scholz from
Redmond, USA.
We also thank several friends and family members who helped us indirectly by
supporting in various ways our book writing effort.
Janusz Kowalik
Tadeusz Puźniakowski

How to read this book

The text and the source code presented in this book are written using different text
fonts. Here are some examples of diffrent typography styles collected.
variable – for example:
. . . the variable platform represents an object of class . . .
type or class name – for example:
. . . the value is always of type cl_ulong. . .
. . . is an object of class cl::Platform. . .
constant or macro – for example:
. . . the value CL_PLATFORM_EXTENSIONS means that. . .
function, method or constructor – for example:
. . . the host program has to execute clGetPlatformIDs. . .
. . . can be retrieved using cl::Platform::getInfo method. . .
. . . the context is created by cl::Context construcor. . .
file name – for example:
. . . the cl.h header file contains. . .
keyword – for example:
. . . identified by the __kernel qualifier. . .

viii
Contents

1 Introduction 1
1.1 Existing Standard Parallel Programming Systems . . . . . . . . . . . . . . 1
1.1.1 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Two Parallelization Strategies: Data Parallelism and Task Parallelism . 9
1.2.1 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Task Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 History and Goals of OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Origins of Using GPU in General Purpose Computing . . . . . . 12
1.3.2 Short History of OpenCL . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Heterogeneous Computer Memories and Data Transfer . . . . . . . . . . 14
1.4.1 Heterogeneous Computer Memories . . . . . . . . . . . . . . . . . 14
1.4.2 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.3 The Fourth Generation CUDA . . . . . . . . . . . . . . . . . . . . . 15
1.5 Host Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.1 Phase a. Initialization and Creating Context . . . . . . . . . . . . 17
1.5.2 Phase b. Kernel Creation, Compilation and Preparations . . . . . 17
1.5.3 Phase c. Creating Command Queues and Kernel Execution . . . 17
1.5.4 Finalization and Releasing Resource . . . . . . . . . . . . . . . . . 18
1.6 Applications of Heterogeneous Computing . . . . . . . . . . . . . . . . . . 18
1.6.1 Accelerating Scientiﬁc/Engineering Applications . . . . . . . . . 19
1.6.2 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . 19
1.6.3 Jacobi Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6.4 Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6.5 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.7 Benchmarking CGM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.2 Additional CGM Description . . . . . . . . . . . . . . . . . . . . . . 24
1.7.3 Heterogeneous Machine . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.4 Algorithm Implementation and Timing Results . . . . . . . . . . 24
1.7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

ix
2 OpenCL Fundamentals 27
2.1 OpenCL Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.1 What is OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.2 CPU + Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.3 Massive Parallelism Idea . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.4 Work Items and Workgroups . . . . . . . . . . . . . . . . . . . . . . 29
2.1.5 OpenCL Execution Model . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.6 OpenCL Memory Structure . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.7 OpenCL C Language for Programming Kernels . . . . . . . . . . . 30
2.1.8 Queues, Events and Context . . . . . . . . . . . . . . . . . . . . . . 30
2.1.9 Host Program and Kernel . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.10 Data Parallelism in OpenCL . . . . . . . . . . . . . . . . . . . . . . 31
2.1.11 Task Parallelism in OpenCL . . . . . . . . . . . . . . . . . . . . . . 32
2.2 How to Start Using OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.1 Header Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.3 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Platforms and Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1 OpenCL Platform Properties . . . . . . . . . . . . . . . . . . . . . . 36
2.3.2 Devices Provided by Platform . . . . . . . . . . . . . . . . . . . . . 37
2.4 OpenCL Platforms – C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 OpenCL Context to Manage Devices . . . . . . . . . . . . . . . . . . . . . . 41
2.5.1 Different Types of Devices . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.2 CPU Device Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.3 GPU Device Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.4 Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.5 Different Device Types – Summary . . . . . . . . . . . . . . . . . . 44
2.5.6 Context Initialization – by Device Type . . . . . . . . . . . . . . . 45
2.5.7 Context Initialization – Selecting Particular Device . . . . . . . . 46
2.5.8 Getting Information about Context . . . . . . . . . . . . . . . . . . 47
2.6 OpenCL Context to Manage Devices – C++ . . . . . . . . . . . . . . . . . 48
2.7 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.7.1 Checking Error Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.7.2 Using Exceptions – Available in C++ . . . . . . . . . . . . . . . . 53
2.7.3 Using Custom Error Messages . . . . . . . . . . . . . . . . . . . . . 54
2.8 Command Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.8.1 In-order Command Queue . . . . . . . . . . . . . . . . . . . . . . . 55
2.8.2 Out-of-order Command Queue . . . . . . . . . . . . . . . . . . . . 57
2.8.3 Command Queue Control . . . . . . . . . . . . . . . . . . . . . . . 60
2.8.4 Profiling Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.8.5 Profiling Using Events – C example . . . . . . . . . . . . . . . . . . 61
2.8.6 Profiling Using Events – C++ example . . . . . . . . . . . . . . . 63
2.9 Work-Items and Work-Groups . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.9.1 Information About Index Space from a Kernel . . . . . . . . . . 66
2.9.2 NDRange Kernel Execution . . . . . . . . . . . . . . . . . . . . . . 67
2.9.3 Task Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.9.4 Using Work Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

x
2.10 OpenCL Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.10.1 Different Memory Regions – the Kernel Perspective . . . . . . . . 71
2.10.2 Relaxed Memory Consistency . . . . . . . . . . . . . . . . . . . . . 73
2.10.3 Global and Constant Memory Allocation – Host Code . . . . . . 75
2.10.4 Memory Transfers – the Host Code . . . . . . . . . . . . . . . . . . 78
2.11 Programming and Calling Kernel . . . . . . . . . . . . . . . . . . . . . . . . 79
2.11.1 Loading and Compilation of an OpenCL Program . . . . . . . . . 81
2.11.2 Kernel Invocation and Arguments . . . . . . . . . . . . . . . . . . 88
2.11.3 Kernel Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.11.4 Supported Scalar Data Types . . . . . . . . . . . . . . . . . . . . . 90
2.11.5 Vector Data Types and Common Functions . . . . . . . . . . . . . 92
2.11.6 Synchronization Functions . . . . . . . . . . . . . . . . . . . . . . . 94
2.11.7 Counting Parallel Sum . . . . . . . . . . . . . . . . . . . . . . . . . 96
2.11.8 Parallel Sum – Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 97
2.11.9 Parallel Sum – Host Program . . . . . . . . . . . . . . . . . . . . . 100
2.12 Structure of the OpenCL Host Program . . . . . . . . . . . . . . . . . . . . 103
2.12.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.12.2 Preparation of OpenCL Programs . . . . . . . . . . . . . . . . . . . 106
2.12.3 Using Binary OpenCL Programs . . . . . . . . . . . . . . . . . . . . 107
2.12.4 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
2.12.5 Release of Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
2.13 Structure of OpenCL host Programs in C++ . . . . . . . . . . . . . . . . . 114
2.13.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.13.2 Preparation of OpenCL Programs . . . . . . . . . . . . . . . . . . . 115
2.13.3 Using Binary OpenCL Programs . . . . . . . . . . . . . . . . . . . . 116
2.13.4 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
2.13.5 Release of Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
2.14 The SAXPY Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.14.1 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
2.14.2 The Example SAXPY Application – C Language . . . . . . . . . . 123
2.14.3 The example SAXPY application – C++ language . . . . . . . . 128
2.15 Step by Step Conversion of an Ordinary C Program to OpenCL . . . . . 131
2.15.1 Sequential Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2.15.2 OpenCL Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2.15.3 Data Allocation on the Device . . . . . . . . . . . . . . . . . . . . . 134
2.15.4 Sequential Function to OpenCL Kernel . . . . . . . . . . . . . . . 135
2.15.5 Loading and Executing a Kernel . . . . . . . . . . . . . . . . . . . . 136
2.15.6 Gathering Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
2.16 Matrix by Vector Multiplication Example . . . . . . . . . . . . . . . . . . . 139
2.16.1 The Program Calculating mat r i x × vec t or . . . . . . . . . . . . 140
2.16.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.16.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.16.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

3 Advanced OpenCL 147

3.1 OpenCL Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
3.1.1 Different Classes of Extensions . . . . . . . . . . . . . . . . . . . . 147

xi
3.1.2 Detecting Available Extensions from API . . . . . . . . . . . . . . 148
3.1.3 Using Runtime Extension Functions . . . . . . . . . . . . . . . . . 149
3.1.4 Using Extensions from OpenCL Program . . . . . . . . . . . . . . 153
3.2 Debugging OpenCL codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
3.2.1 Printf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
3.2.2 Using GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
3.3 Performance and Double Precision . . . . . . . . . . . . . . . . . . . . . . 162
3.3.1 Floating Point Arithmetics . . . . . . . . . . . . . . . . . . . . . . . 162
3.3.2 Arithmetics Precision – Practical Approach . . . . . . . . . . . . . 165
3.3.3 Profiling OpenCL Application . . . . . . . . . . . . . . . . . . . . . 172
3.3.4 Using the Internal Profiler . . . . . . . . . . . . . . . . . . . . . . . 173
3.3.5 Using External Profiler . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.3.6 Effective Use of Memories – Memory Access Patterns . . . . . . . 183
3.3.7 Matrix Multiplication – Optimization Issues . . . . . . . . . . . . 189
3.4 OpenCL and OpenGL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
3.4.1 Extensions Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
3.4.2 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
3.4.3 Header Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
3.4.4 Common Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
3.4.5 OpenGL Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 198
3.4.6 OpenCL Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 201
3.4.7 Creating Buffer for OpenGL and OpenCL . . . . . . . . . . . . . . 203
3.4.8 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
3.4.9 Generating Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
3.4.10 Running Kernel that Operates on Shared Buffer . . . . . . . . . . 215
3.4.11 Results Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
3.4.12 Message Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
3.4.13 Cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
3.4.14 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . . . 221
3.5 Case Study – Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 221
3.5.1 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
3.5.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
3.5.3 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
3.5.4 Example Problem Definition . . . . . . . . . . . . . . . . . . . . . . 225
3.5.5 Genetic Algorithm Implementation Overview . . . . . . . . . . . 225
3.5.6 OpenCL Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
3.5.7 Most Important Elements of Host Code . . . . . . . . . . . . . . . 234
3.5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
3.5.9 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

A Comparing CUDA with OpenCL 245

A.1 Introduction to CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
A.1.1 Short CUDA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 245
A.1.2 CUDA 4.0 Release and Compatibility . . . . . . . . . . . . . . . . . 245
A.1.3 CUDA Versions and Device Capability . . . . . . . . . . . . . . . . 247
A.2 CUDA Runtime API Example . . . . . . . . . . . . . . . . . . . . . . . . . . 249
A.2.1 CUDA Program Explained . . . . . . . . . . . . . . . . . . . . . . . 251

xii
A.2.2 Blocks and Threads Indexing Formulas . . . . . . . . . . . . . . . 257
A.2.3 Runtime Error Handling . . . . . . . . . . . . . . . . . . . . . . . . 260
A.2.4 CUDA Driver API Example . . . . . . . . . . . . . . . . . . . . . . . 262

B Theoretical Foundations of Heterogeneous Computing 269

B.1 Parallel Computer Architectures . . . . . . . . . . . . . . . . . . . . . . . . 269
B.1.1 Clusters and SMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
B.1.2 DSM and ccNUMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
B.1.3 Parallel Chip Computer . . . . . . . . . . . . . . . . . . . . . . . . . 270
B.1.4 Performance of OpenCL Programs . . . . . . . . . . . . . . . . . . 270
B.2 Combining MPI with OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . 277

C Matrix Multiplication – Algorithm and Implementation 279

C.1 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
C.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
C.2.1 OpenCL Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
C.2.2 Initialization and Setup . . . . . . . . . . . . . . . . . . . . . . . . . 280
C.2.3 Kernel Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
C.2.4 Executing Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

D Using Examples Attached to the Book 285

D.1 Compilation and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
D.1.1 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
D.1.2 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

Bibliography and References 289

xiii
xiv
Chapter 1

Introduction

1.1. Existing Standard Parallel Programming Systems

The last decade of the 20th century and the ﬁrst decade of the 21st century can be
called the Era of Parallel Computing. In this period of time, not only were extremely
powerful supercomputers designed and built, but two de facto standard parallel pro-
gramming systems for scientiﬁc and engineering applications were successfully intro-
duced worldwide. They are MPI (Message Passing Interface) for clusters of comput-
ers and OpenMP for shared memory multi-processors. Both systems are predecessors
of the subject of this book on OpenCL. They deserve short technical description and
discussion. This will help to see differences between the older MPI/OpenMP and the
newest OpenCL parallel programming systems.

1.1.1. MPI
MPI is a programming system but not a programming language. It is a library of func-
tions for C and subroutines for FORTRAN that are used for message communication
between parallel processes created by MPI. The message-passing computing model
(Fig. 1.1) is a collection of interconnected processes that use their local memories
exclusively. Each process has an individual address identiﬁcation number called the
rank. Ranks are used for sending and receiving messages and for workload distribu-
tion. There are two kinds of messages: point-to-point messages and collective mes-
sages. Point-to-point message C functions contain several parameters: the address of
the sender, the address of the receiver, the message size and type and some additional
information. In general, message parameters describe the nature of the transmitted
data and the delivery envelope description.
The collection of processes that can exchange messages is called the communica-
tor. In the simplest case there is only one communicator and each process is assigned
to one processor. In more general settings, there are several communicators and sin-
gle processors serve several processes. A processor rank is usually an integer number
running from 0 to p-1 where p is the total number of processes. It is also possible to
number processes in a more general way than by consecutive integer numbers. For
example, their address ID can be a double number such as a point (x, y) in Carte-

1
sian space. This method of identifying processes may be very helpful in handling ma-
trix operations or partial differential equations (PDEs) in two-dimensional Cartesian
space.

Figure 1.1: The message passing model.

An example of a collective message is the broadcast message that sends data

from a single process to all other processes. Other collective messages such as Gather
and Scatter are very helpful and are often used in computational mathematics
algorithms.
For the purpose of illustrating MPI, consider a parallel computation of the SAXPY
operation. SAXPY is a linear algebra operation z = ax+y where a is a constant scalar
and x, y, z are vectors of the same size. The name SAXPY comes from Sum of ax plus
y. An example of a SAXPY operation is presented in section 2.14.
The following assumptions and notations are made:

1. The number of processes is p.

2. The vector size n is divisible by p.
3. Only a part of the code for parallel calculation of the vector z will be written.
4. All required MPI initialization instructions have been done in the front part of
the code.
5. The vectors x and y have been initialized.
6. The process ID is called my_rank from 0 to p−1. The number of vector element
pairs that must be computed by each process is: N = n/p. The communicator
is the entire system.
7. n = 10000 and p = 10 so N = 1000.

The C loop below will be executed by all processes from the process from
my_rank equal to 0 to my_rank equal to p − 1.

1 {
2 int i;
3 for (i = my_rank*N; i < (my_rank+1)*N; i++)
4 z[i]=a*x[i]+y[i];
5 }

2
For example, the process with my_rank=1 will add one thousand ele-
ments a*x[i] and y[i] from i=N =1000 to 2N –1=1999. The process with the
my_rank=p–1=9 will add elements from i=n–N =9000 to n–1=9999.
One can now appreciate the usefulness of assigning a rank ID to each process.
This simple concept makes it possible to send and receive messages and share the
workloads as illustrated above. Of course, before each process can execute its code
for computing a part of the vector z it must receive the needed data. This can be
done by one process initializing the vectors x and y and sending appropriate groups
of data to all remaining processes. The computation of SAXPY is an example of data
parallelism. Each process executes the same code on different data.
To implement function parallelism where several processes execute different pro-
grams process ranks can be used. In the SAXPY example the block of code containing
the loop will be executed by all p processes without exception. But if one process, for
example, the process rank 0, has to do something different than all others this could
be accomplished by specifying the task as follows:

1 if (my_rank == 0)
2 {execute specified here task}

This block of code will be executed only by the process with my_rank==0. In the
absence of "if (my_rank==something)" instructions, all processes will execute the
block of code {execute this task}. This technique can be used for a case where
processes have to perform several different computations required in the task-parallel
algorithm. MPI is universal. It can express every kind of algorithmic parallelism.
An important concern in parallel computing is the efficiency issue. MPI often
can be computationally efficient because all processes use only local memories. On
the other hand, processes require network communication to place data in proper
process memories in proper times. This need for moving data is called the algorithmic
synchronization, and it creates the communication overhead, impacting negatively
the parallel program performance. It might significantly damage performance if the
program sends many short messages. The communication overhead can be reduced
by grouping data for communication. By grouping data for communication created
are larger user defined data types for larger messages to be sent less frequently. The
benefit is avoiding the latency times.
On the negative side of the MPI evaluation score card is its inability for incre-
mental (part by part) conversion from a serial code to an MPI parallel code that
is algorithmically equivalent. This problem is attributed to the relatively high level
of MPI program design. To design an MPI program, one has to modifiy algorithms.
This work is done at an earlier stage of the parallel computing process than develop-
ing parallel codes. Algorithmic level modification usually can’t be done piecewise. It
has to be done all at once. In contrast, in OpenMP a programmer makes changes to
sequential codes written in C or FORTRAN.
Fig. 1.2 shows the difference. The code level modification makes possible incre-
mental conversions. In a commonly practiced conversion technique, a serial code is
converted incrementally from the most compute-intensive program parts to the least
compute intensive parts until the parallelized code runs sufficiently fast. For further
study of MPI, the book [1] is highly recommended.

3
1. Mathematical model.
2. Computational model.
3. Numerical algorithm and parallel conversion: MPI
4. Serial code and parallel modiﬁcation: OpenMP
5. Computer runs.

Figure 1.2: The computer solution process. OpenMP will be discussed in the next
section.

1.1.2. OpenMP
OpenMP is a shared address space computer application programming interface
(API). It contains a number of compiler directives that instruct C/C++ or FORTRAN
“OpenMP aware” compilers to execute certain instructions in parallel and distribute
the workload among multiple parallel threads. Shared address space computers fall
into two major groups: centralized memory multiprocessors, also called Symmetric
Multi-Processors (SMP) as shown in Fig. 1.3, and Distributed Shared Memory (DSM)
multiprocessors whose common representative is the cache coherent Non Uniform
Memory Access architecture (ccNUMA) shown in Fig. 1.4.

Figure 1.3: The bus based SMP multiprocessor.

SMP computers are also called Uniform Memory Access (UMA). They were the
ﬁrst commercially successful shared memory computers and they still remain popu-
lar. Their main advantage is uniform memory access. Unfortunately for large num-
bers of processors, the bus becomes overloaded and performance deteriorates. For
this reason, the bus-based SMP architectures are limited to about 32 or 64 proces-
sors. Beyond these sizes, single address memory has to be physically distributed.
Every processor has a chunk of the single address space.
Both architectures have single address space and can be programmed by using
OpenMP. OpenMP parallel programs are executed by multiple independent threads

4
Figure 1.4: ccNUMA with cache coherent interconnect.

that are streams of instructions having access to shared and private data, as shown
in Fig. 1.5.

Figure 1.5: The threads’ access to data.

The programmer can explicitly define data that are shared and data that are
private. Private data can be accessed only by the thread owning these data. The flow
of OpenMP computation is shown in Fig. 1.6. One thread, the master thread, runs
continuously from the beginning to the end of the program execution. The worker
threads are created for the duration of parallel regions and then are terminated.
When a programmer inserts a proper compiler directive, the system creates a
team of worker threads and distributes workload among them. This operation point
is called the fork. After forking, the program enters a parallel region where parallel
processing is done. After all threads finish their work, the program worker threads
cease to exist or are redeployed while the master thread continues its work. The
event of returning to the master thread is called the join. The fork and join tech-
nique may look like a simple and easy way for parallelizing sequential codes, but the

5
Figure 1.6: The fork and join OpenMP code ﬂow.

reader should not be deceived. There are several difﬁculties that will be described
and discussed shortly. First of all, how to ﬁnd whether several tasks can be run in
parallel?
In order for any group of computational tasks to be correctly executed in parallel,
they have to be independent. That means that the result of the computation does not
depend on the order of task execution. Formally, two programs or parts of programs
are independent if they satisfy the Bernstein conditions shown in 1.1.

I j ∩ Oi = 0
Ii ∩ Oj = 0 (1.1)
Oi ∩ O j = 0

Bernstein’s conditions for task independence.

Letters I and O signify input and output. The ∩ symbol means the intersection
of two sets that belong to task i or j. In practice, determining if some group of
tasks can be executed in parallel has to be done by the programmer and may not
be very easy. To discuss other shared memory computing issues, consider a very
simple programming task. Suppose the dot product of two large vectors x and y
whose number of components is n is calculated. The sequential computation would
be accomplished by a simple C loop

1 double dp = 0;
2 for(int i=0; i<n; i++)
3 dp += x[i]*y[i];

Inserting, in front of the above for-loop, the OpenMP directive for parallelizing
the loop and declaring shared and private variables leads to:

6
1 double dp = 0;
2 #pragma omp parallel for shared(x,y,dp) (private i)
3 for(int i=0; i<n; i++)
4 dp += x[i]*y[i];

Unfortunately, this solution would encounter a serious difﬁculty in computing

dp by this code. The difﬁculty arises because the computation of dp+=x[i]*y[i]; is
not atomic. This means that more threads than one may attempt to update the value
of dp simultaneously. If this happens, the value of dp will depend on the timing of
individual threads’ operations. This situation is called the data race. The difﬁculty
can be removed in two ways.
One way is by using the OpenMP critical construct #pragma omp critical in
front of the statement dp+=x[i]*y[i]; The directive critical forces threads to up-
date dp one at a time. In this way, the updating relation dp+=x[i]*y[i] becomes
atomic and is executed correctly. That means every thread executes this operation
alone and completely without interference of other threads. With this addition, the
code becomes:

1 double dp = 0;
2 #pragma omp parallel for shared(x,y,dp) private(i)
3 for(int i=0; i<n; i++){
4 #pragma omp critical
5 dp += x[i]*y[i];
6 }

In general, the block of code following the critical construct is computed by

one thread at a time. In our case, the critical block is just one line of code
dp+=x[i]*y[i];. The second way for ﬁxing the problem is by using the clause re-
duction that ensures adding correctly all partial results of the dot product. Below is
a correct fragment of code for computing dp with the reduction clause.

1 double dp = 0;
2 #pragma omp parallel for reduction(+:dp) shared(x,y) private(i)
3 for (int i=0;i<n ;i++)
4 dp += x[i]*y[i];

Use of the critical construct amounts to serializing the computation of the critical
region, the block of code following the critical construct. For this reason, large critical
regions degrade program performance and ought to be avoided. Using the reduction
clause is more efﬁcient and preferable. The reader may have noticed that the loop
counter variable i is declared private, so each thread updates its loop counter in-
dependently without interference. In addition to the parallelizing construct parallel
for that applies to C for-loops, there is a more general section construct that paral-
lelizes independent sections of code. The section construct makes it possible to apply
OpenMP to task parallelism where several threads can compute different sections of
a code. For computing two parallel sections, the code structure is:

7
1 #pragma omp parallel
2 {
3 #pragma omp sections
4 {
5 #pragma omp section
6 /* some program segment computation */
7 #pragma omp section
8 /* another program segment computation */
9 }
10 /* end of sections block */
11 }
12 /* end of parallel region */

OpenMP has tools that can be used by programmers for improving performance.
One of them is the clause nowait. Consider the program fragment:

1 #pragma omp parallel shared(a,b,c,d,e,f) private(i,j)

2 {
3 #pragma omp for nowait
4 for(int i=0; i<n; i++)
5 c[i] = a[i]+b[i];
6 #pragma omp for
7 for(int j=0; j<m; j++)
8 d[j] = e[j]*f[j];
9 #pragma omp barrier
10 g = func(d);
11 }
12 /* end of the parallel region; implied barrier */

In the first parallelized loop, there is the clause nowait. Since the second loop
variables do not depend on the results of the first loop, it is possible to use the clause
nowait – telling the compiler that as soon as any first loop thread finishes its work, it
can start doing the second loop work without waiting for other threads. This speeds
up computation by reducing the potential waiting time for the threads that finish
work at different times.
On the other hand, the construct #pragma omp barrier is inserted after the
second loop to make sure that the second loop is fully completed before the calcula-
tion of g being a function of d is performed after the second loop. At the barrier, all
threads computing the second loop must wait for the last thread to finish before they
proceed. In addition to the explicit barrier construct #pragma omp barrier used by
the programmer, there are also implicit barriers used by OpenMP automatically at
the end of every parallel region for the purpose of thread synchronization. To sum
up, barriers should be used sparingly and only if necessary. nowait clauses should
be used as frequently as possible, provided that their use is safe.
Finally, we have to point out that the major performance issue in numerical com-
putation is the use of cache memories. For example, in computing the matrix/vector
product c = Ab, two mathematically equivalent methods could be used. In the first
method, elements of the vector c are computed by multiplying each row of A by the
vector b, i.e., computing dot products. An inferior performance will be obtained if c
is computed as the sum of the columns of A multiplied by the elements of b. In this

8
case, the program would access the columns of A not in the way the matrix data are
stored in the main memory and transferred to caches.
For further in-depth study of OpenMP and its performance, reading [2] is highly
recommended. Of special value for parallel computing practitioners are Chapter 5
“How to get Good Performance by Using OpenMP” and Chapter 6 “Using OpenMP
in the Real World”. Chapter 6 offers advice for and against using combined MPI and
OpenMP. A chapter on combining MPI and OpenMP can also be found in [3]. Like
MPI and OpenMP, the OpenCL system is standardized. It has been designed to run
regardless of processor types, operating systems and memories. This makes OpenCL
programs highly portable, but the method of developing OpenCL codes is more com-
plicated. The OpenCL programmer has to deal with several low-level programming
issues, including memory management.

1.2. Two Parallelization Strategies: Data Parallelism

and Task Parallelism
There are two strategies for designing parallel algorithms and related codes: data
parallelism and task parallelism. This Section describes both concepts.

1.2.1. Data Parallelism

Data parallelism, also called Single Program Multiple Data (SPMD) is very common
in computational linear algebra. In a data parallel code, data structures such as ma-
trices are divided into blocks, sets of rows or columns, and a single program performs
identical operations on these partitions that contain different data. An example of
data parallelism is the matrix/vector multiplication a = Ab where every element of
a can be computed by performing the dot product of one row of A and the vector b.
Fig. 1.7 shows this data parallel concept.

Figure 1.7: Computing matrix/vector product.

1.2.2. Task Parallelism

The task parallel approach is more general. It is assumed that there are multiple
different independent tasks that can be computed in parallel. The tasks operate on

9
their own data sets. Task parallelism can be called Multiple Programs Multiple Data
(MPMD). A small-size example of a task parallel problem is shown in Fig. 1.8. The
directed graph indicates the task execution precedence. Two tasks can execute in
parallel if they are not dependent.

Figure 1.8: Task dependency graph.

In the case shown in Fig. 1.8, there are two options for executing the entire
set of tasks. Option 1. Execute tasks T1, T2 and T4 in parallel, followed by Task T3
and ﬁnally T5. Option 2. Execute tasks T1 and T2 in parallel, then T3 and T4 in
parallel and ﬁnally T5. In both cases, the total computing work is equal but the time
to solution may not be.

1.2.3. Example
Consider a problem that can be computed in both ways – via data parallelism and
task parallelism. The problem is to calculate C = A × B − (D + E) where A, B, D
and E are all square matrices of size n × n. An obvious task parallel version would
be to compute in parallel two tasks A × B and D + E and then subtract the sum
from the product. Of course, the task of computing A × B and the task of computing
the sum D + E can be calculated in a data parallel fashion. Here, there are two
levels of parallelism: the higher task level and the lower data parallel level. Similar
multilevel parallelism is common in real-world applications. Not surprisingly, there
is also for this problem a direct data parallel method based on the observation that
every element of C can be computed directly and independently from the coefﬁcients
of A, B, D and E. This computation is shown in equation 1.2.

n−1
ci j = aik bk j − di j − ei j (1.2)
k=0

The direct computation of C.

10
The equation (1.2) means that it is possible to calculate all n2 elements of C
in parallel using the same formula. This is good news for OpenCL devices that can
handle only one kernel and related data parallel computation. Those devices will
be discussed in the chapters that follow. The matrix C can be computed using three
standard programming systems: MPI, OpenMP and OpenCL.
Using MPI, the matrix C could be partitioned into sub-matrix components and
assigned the subcomponents to processes. Every process would compute a set of
elements of C, using the expression 1.2. The main concern here is not computation
itself but the ease of assembling results and minimizing the cost of communication
while distributing data and assembling the results. A reasonable partitioning would
be dividing C by blocks of rows that can be scattered and gathered as user-deﬁned
data types. Each process would get a set of the rows of A, D and E and the entire
matrix B. If matrix C size is n and the number of processes is p, then each process
would get n/p rows of C to compute. If a new data type is deﬁned as n/p rows,
the data can easily be distributed by strips of rows to processes and then results can
be gathered. The suggested algorithm is a data parallel method. Data partitioning is
shown in Fig. 1.9.

Figure 1.9: Strips of data needed by a process to compute the topmost strip of C.

The data needed for each process include one strip of A, D and E and the entire
matrix B. Each process of rank 0<=rank<p computes one strip of C rows. After fin-
ishing computation, the matrix C can be assembled by the collective MPI communi-
cation function Gather. An alternative approach would be partitioning Gather into
blocks and assigning to each process computing one block of C. However, assembling
the results would be harder than in the strip partitioning case.
OpenMP would take advantage of the task parallel approach. First, the subtasks
A × B and D + E would be parallelized and computed separately, and then C = A ×
B − (D + E) would be computed, as shown in Fig. 1.10. One weakness of task parallel
programs is the efficiency loss if all parallel tasks do not represent equal workloads.
For example, in the matrix C computation, the task of computing A×B and the task of
computing D+E are not load-equal. Unequal workloads could cause some threads to
idle unless tasks are selected dynamically for execution by the scheduler. In general,
data parallel implementations are well load balanced and tend to be more efficient.
In the case of OpenCL implementation of the data parallel option for computing C, a
single kernel function would compute elements of C by the equation 1.2 where the
sum represents dot products of A × B and the remaining terms represent subtracting
D and E. If the matrix size is n = 1024, a compute device could execute over one
million work items in parallel. Additional performance gain can be achieved by using
the tiling technique [4].

11
Figure 1.10: The task parallel approach for OpenMP.

In the tiling technique applied to the matrix/matrix product operation, matrices

are partitioned into tiles small enough so that the data for dot products can be placed
in local (in the NVIDIA terminology, shared) memories. This way, slow access to
global memory is avoided and smaller, but faster, local memories are used.
The tiling technique is one of the most effective ways for enhancing perfor-
mance of heterogeneous computing. In the matrix/matrix multiplication problem,
the NVIDIA GPU Tesla C1060 processor was used as a massively parallel acceleration
device.
In general: the most important heterogeneous computing strategy for achiev-
ing efﬁcient applications is optimizing memory use. This includes avoiding CPU-GPU
data transfers and using fast local memory. Finally, a serious limitation of many cur-
rent (2011) generation GPUs has to be mentioned. Some of them are not capable
of running in parallel multiple kernels. They run only one kernel at a time in data
parallel fashion. This eliminates the possibility of a task parallel program structure
requiring multiple kernels. Using such GPUs, the only possibility of running multiple
kernels would be to have multiple devices, each one executing a single kernel, but
the host would be a single processor. The context for managing two OpenCL devices
is shown in Fig. 2.3 in section 2.1.

1.3. History and Goals of OpenCL

1.3.1. Origins of Using GPU in General Purpose Computing
Initially, massively parallel processors called GPUs (Graphics Processor Units) were
built for applications in graphics and gaming.
Soon the scientific and engineering communities realized that data parallel com-
putation is so common in the majority of scientific/engineering numerical algorithms
that GPUs can also be used for compute intensive parts of large numerical algorithms
solving scientific and engineering problems. This idea contributed to creating a new
computing discipline under the name General Purpose GPU or GPGPU. Several nu-

12
merical algorithms containing data parallel computations are discussed in Section
1.6 of this Chapter.
As a parallel programming system, OpenCL is preceded by MPI, OpenMP and
CUDA. Parallel regions in OpenMP are comparable to parallel kernel executions
in OpenCL. A crucial difference is OpenMP’s limited scalability due to heavy over-
head in creating and managing threads. Threads used on heterogeneous systems are
lightweight, hence suitable for massive parallelism. The most similar to OpenCL is
NVIDIA’s CUDA (Compute Uniﬁed Driver Architecture) because OpenCL heavily bor-
rowed from CUDA its fundamental features. Readers who know CUDA will ﬁnd it
relatively easy to learn and use OpenCL after becoming familiar with the mapping
between CUDA and OpenCL technical terms and minor programming differences.
The technical terminology differences are shown in Tab.1.1.

Table 1.1: Comparision of terminology of CUDA and OpenCL

CUDA OpenCL
Thread Work item
Block Work group
Grid Index space

Similar one-to-one mapping exists for the CUDA and OpenCL API calls.

1.3.2. Short History of OpenCL

OpenCL (Open Computing Language) was initiated by Apple, Inc., which holds its
trademark rights. Currently, OpenCL is managed by the Khronos Group, which in-
cludes representatives from several major computer vendors. Technical specification
details were finalized in November 2008 and released for public use in December
of that year. In 2010, IBM and Intel released their implementations of OpenCL. In
the middle of November 2010, Wolfram Research released Mathematica 8 with an
OpenCL link. Implementations of OpenCL have been released also by other compa-
nies, including NVIDIA and AMD. The date for a stable release of OpenCL 1.2 stan-
dard is November 2011.
One of the prime goals of OpenCL designers and specification authors has been
portability of the OpenCL codes. Using OpenCL is not limited to any specific vendor
hardware, operating systems or type of memory. This is the most unique feature of
the emerging standard programming system called OpenCL. Current (2011) OpenCL
specification management is in the hands of the international group Khronos, which
includes IBM, SONY, Apple, NVIDIA, Texas Instruments, AMD and Intel.
Khronos manages specifications of OpenCL C language and OpenCL runtime
APIs. One of the leaders in developing and spreading OpenCL is the Fixstars Corpo-
ration, whose main technology focus has been programming multi-core systems and
optimizing their application performance. This company published the first commer-
cially available book on OpenCL [5].
The book was written by five Japanese software specialists and has been trans-
lated into English. It is currently (2011) available from Amazon.com in paperback or
the electronic book form. In the latter form, the book can be read using Kindle.

13
1.4. Heterogeneous Computer Memories and Data
Transfer
1.4.1. Heterogeneous Computer Memories
A device’s main memory is the global memory (Fig. 1.11). All work items have access
to this memory. It is the largest but also the slowest memory on a heterogeneous
device. The global memory can be dynamically allocated by the host program and
can be read and written by the host and the device.

Figure 1.11: Heterogeneous computer memories. Host memory is accesible only to

processes working on the host. Global memory is a GPU memory accessible both in
read and write for a kernel run on the GPU device. Constant memory is accessible
only for read operations.

The constant memory can also be dynamically allocated by the host. It can be
used for read and write operations by the host, but it is read only by the device. All
work items can read from it. This memory can be fast if the system has a supporting
cache.
Local memories are shared by work items within a work group. For example, two
work items can synchronize their operation using this memory if they belong to the
same work group. Local memories can’t be accessed by the host.

14
Private memories can be accessed only by each individual work item. They are
registers. A kernel can use only these four device memories.
In contrast, the host system can use the host memory and the global/constant
memories on the device. If a device is a multicore CPU, then all device memories are
portions of the RAM.

1.4.2. Data Transfer

It has to be pointed out that all data movements required by a particular algorithm
have to be programmed explicitly by the programmer, using special commands for
data transfer between different kinds of memories. To illustrate different circum-
stances of transferring data, consider two algorithms: algorithm 1 for matrix–matrix
multiplication and algorithm 2 for solving sets of linear equations Ax = b with a pos-
itive deﬁnite matrix A. Algorithm 2 is an iterative procedure called The Conjugate
Gradient Method (CGM).

Figure 1.12: Data transfer in matrix/matrix multiplication.

Executing a matrix/matrix multiplication algorithm requires only two data trans-

fer operations. They are shown in Fig. 1.12 by thick arrows. The ﬁrst transfer is
needed to load matrices A and B into the device global memory. The second transfer
moves the resulting matrix C to the host RAM to enable printing the results by the
host.
The second algorithm is the Conjugate Gradient Method (CGM) discussed in
section 1.6. It is assumed that every entire iteration is computed on the GPU device.
Fig. 1.13 shows the algorithm ﬂow.
CGM is executed on a heterogeneous computer with a CPU host and a GPU
device. The thick arrow indicates data transfer from the CPU memory to the device
global memory. Except for the initial transfer of the matrix A and the vector p and
one bit after every iteration, there is no need for other data transfers for the CGM
algorithm.

1.4.3. The Fourth Generation CUDA

The most recent (March 2011) version of NVIDIAs’s Compute Uniﬁed Device Archi-
tecture (CUDA) brings a new and important memory technology called Uniﬁed Vir-
tual Addressing (UVA). UVA provides a single address space for merged CPU memory

15
Figure 1.13: The CGM algorithm iterations. The only data transfer takes place at the
very beginning of the iterative process.

and global GPU memory. This change will simplify programming systems that com-
bine the CPU and GPU and will eliminate currently necessary data transfers between
the host and the device.

1.5. Host Code

Every OpenCL program contains two major components: a host code and at least one
kernel, as shown in Fig. 1.14.

Figure 1.14: OpenCL program components.

16
Kernels are written in the OpenCL C language and executed on highly parallel
devices. They provide performance improvements. An example of a kernel code is
shown in section 2.1.
In this section, the structure and the functionality of host programs is described.
The main purpose of the host code is to manage device(s). More speciﬁcally, host
codes arrange and submit kernels for execution on device(s). In general, a host code
has four phases:

a) Initialization and creating context,

b) kernel creation and preparation for kernel execution,
c) creating command queues and kernels execution,
d) ﬁnalization and release of resources.

Section 2.8 in Chapter 2 provides details of the four phases and an example of
the host code structure. This section is an introduction to section 2.11 in Chapter
2. This introduction may be useful for readers who start learning OpenCL since the
concept of the host code has not been used in other parallel programming systems.

1.5.1. Phase a. Initialization and Creating Context

The very ﬁrst step in initialization is getting available OpenCL platforms. This is fol-
lowed by device selection. The programmer can query available devices and choose
those that could help achieve the required level of performance for the application
at hand.
The second step is creating the context.
In OpenCL, devices are managed through contexts. Please see Fig. 2.3, illustrat-
ing the context for managing devices. To determine the types and the numbers of de-
vices in the system, a special API function is used. All other steps are also performed
by invoking runtime API functions. Hence the host code is written in the conventional
C or C++ language plus the API Library runtime function calls.

1.5.2. Phase b. Kernel Creation, Compilation and Preparations

for Kernel Execution
This phase includes kernel creation, compilation and loading. The kernel has to be
loaded to the global memory for execution on the devices. It can be loaded in binary
form or as source code.

1.5.3. Phase c. Creating Command Queues and Kernel Execution

In this phase, a Command Queue is created. Kernels must be placed in command
queues for execution. When a device is ready, the kernel at the head of the command
queue will be executed. After kernel execution, the results can be transferred from
the device global memory to the host memory for display or printing. Kernels can be
executed out of order or in order. In the out-of-order case, the kernels are indepen-
dent and can be executed in any sequence. In the second case, the kernel must be
executed in a certain ﬁxed order.

17
1.5.4. Finalization and Releasing Resource
After ﬁnishing computation, each code should perform a cleanup operation that in-
cludes releasing of resources in preparation for the next application. The entire host
code structure is shown in Fig. 1.15. In principle, the host code can also perform
some algorithmic computation that is not executed on the device – for example, the
initial data preparation.

Figure 1.15: Host code structure and functionality.

1.6. Applications of Heterogeneous Computing

Heterogeneous computation is already present in many fields. It is an extension of
general-purpose computing on graphics processing units in terms of combining tra-
ditional computation on multi-core CPUs and computation on any other computing
device. OpenCL can be used for real-time post-processing of a rendered image, or for
accelerating a ray-tracing engine. This standard is also used in real-time enhance-
ment for full-motion video, like image stabilization or improving the image quality of
one frame using consecutive surrounding frames. Nonlinear video-editing software
can accelerate standard operations by using OpenCL; for example, most video filters
work per-pixel, so they can be naturally parallelized. Libraries that accelerate basic
linear algebra operations are another example of OpenCL applications. There is a
wide range of simulation issues that can be targeted by OpenCL – like rigid-body
dynamics, fluid and gas simulation and virtually any imaginable physical simulation
involving many objects or complicated matrix computation. For example, a game
physics engine can use OpenCL for accelerating rigid- and soft-body dynamics in
real time to improve graphics and interactivity of the product. Another field where
massively parallel algorithms can be used is stock trade analysis.
This section describes several algorithms for engineering and scientific applica-
tions that could be accelerated by OpenCL. This is by no means an exhaustive list of
all the algorithms that can be accelerated by OpenCL.

18
1.6.1. Accelerating Scientific/Engineering Applications
Computational linear algebra has been regarded as the workhorse for applied math-
ematics in nearly all scientific and engineering applications. Software systems such
as Linpack have been used heavily by supercomputer users solving large-scale prob-
lems.
Algebraic linear equations solvers have been used for many years as the perfor-
mance benchmark for testing and ranking the fastest 500 computers in the world.
The latest (2010) TOP500 champion, is a Chinese computer, the Tianhe-1A.
The Tianhe supercomputer is a heterogeneous machine. It has 14336 Intel Xeon
CPUs and 7168 NVIDIA Tesla M2050 GPUs.
The champion’s Linpack benchmark performance record is about 2.5 petaflops
(2.5×10 to the power 15 floating point operations per second). Such impressive
speed has been achieved to a great extent by using massively parallel GPU accelera-
tors. An important question arises: do commonly used scientific and engineering so-
lution algorithms contain components that can be accelerated by massively parallel
devices? To partially answer this question, this section of the book considers several
numerical algorithms frequently used by scientists and engineers.

1.6.2. Conjugate Gradient Method

The ﬁrst algorithm, a popular method for solving algebraic linear equations, is called
the conjugate gradient method (CG or CGM) [6].

(x (0) ∈ ℜn given)
1. x := x (0)
2. r := b − Ax
3. p := r
4. α := r2
5. while α > t ol 2 :
6. λ := α/(p T Ap)
7. x := λp
8. r := r − λAp
9. p := r + (r2 /α)p
10. α := r2
11. end

The CGM algorithm solves the system of linear equations Ax = b where the ma-
trix A is positive deﬁnite. Every iteration of the CGM algorithm requires one matrix-
vector multiplication Ap, two vector dot products:

n−1
dp = xT y = x i yi (1.3)
i=0

and three SAXPY operations z = ax + y, where x, y and z are n-component vectors

and α is a scalar number. The total number of ﬂops (ﬂoating point operations) per

19
iteration is N = n2 +10n. The dominant ﬁrst term is contributed by the matrix-vector
multiplication w = Ap.
For large problems, the ﬁrst term will be several orders of magnitude greater
than the second linear term. For this reason, we may be tempted to compute in
parallel on the device only the operation w = Ap.
The matrix-vector multiplication can be done in parallel by computing the ele-
ments of w as dot products of the rows of A and the vector p. The parallel compute
time will now be proportional to 2n instead of 2n2 where n is the vector size. This
would mean faster computing and superior scalability. Unfortunately, sharing com-
putation of each iteration by the CPU and the GPU requires data transfers between
the CPU and the device, as depicted in Fig. 1.16.

Figure 1.16: The CPU+GPU shared execution of the CGM with data transfer between
the CPU and the GPU.

In the CGM algorithm, the values of p and w change in every iteration and
need to be transferred if q is computed by the GPU and d is computed by the CPU.
The matrix A remains constant and need not be moved. To avoid data transfers that
would seriously degrade performance, it has been decided to compute the entire
iteration on the device. It is common practice to regard the CGM as an iterative
method, although for a problem of size n the method converges in n iterations if
exact arithmetic is used. The speed of the convergence depends upon the condition
number for the matrix A. For a positive deﬁnite matrix, the condition number is
the ratio of the largest to the smallest eigenvalues of A. The speed of convergence
increases as the condition number approaches 1.0. There are methods for improving
the conditioning of A. The CGM method with improved conditioning of A is called the
preconditioned CGM and is often used in practical applications. Readers interested
in preconditioning techniques and other mathematical aspects of the CGM may ﬁnd
useful information in [6] and [7].

20
1.6.3. Jacobi Method
Another linear equations solver often used for equations derived from discretizing
linear partial differential equations is the stationary iterative Jacobi method. In this
method, the matrix A is split as follows: A = L+D+U where L and U are strictly lower
and upper triangular and D is diagonal. Starting from some initial approximation x 0 ,
the subsequent approximations are computed from

Dx k+1 = b − (L + U)x k (1.4)

This iteration is perfectly parallel. Each element of x can be computed in par-

allel. Putting it differently, the Jacobi iteration requires a matrix-vector multiplica-
tion followed by a SAXPY, both parallizable. Fig. 1.17 shows the Jacobi algorithm
computation on a heterogeneous computer.

Figure 1.17: The Jacobi algorithm computation on a heterogeneous machine.

Implementing the Jacobi iterative algorithm on a heterogeneous machine can

result in a signiﬁcant acceleration over the sequential execution time because most
of the work is done by the massively parallel GPU. The CPU processor must initialize
computation before iterations start, and it keeps checking if the solution converged.
Closely related to the Jacobi iterative method is the well-known method of Gauss-
Seidel, but it is not as easily parallelizable as the Jacobi algorithm. If both methods
converge, the Gauss-Seidel converges twice as fast as the Jacobi. Here, there is an in-
teresting dilemma. Which of the two algorithms is preferred – the slower-converging
Jacobi parallel algorithm or the faster-converging Gauss-Seidel method, that may

21
Discovering Diverse Content Through
Random Scribd Documents
The Project Gutenberg eBook of Helena
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.

Title: Helena

Author: Euripides

Translator: Axel Gabriel Sjöström

Release date: September 29, 2019 [eBook #60381]

Language: Swedish

Credits: Jari Koivisto

* START OF THE PROJECT GUTENBERG EBOOK HELENA *

Produced by Jari Koivisto

HELENA
af

Euripides

Öfversatt af

Axel Gabriel Sjöström

EURIPIDIS HELENA.

SVETHICE REDDITA
cuius
Particulam Primam
VENIA AMPLISSIMAE FACULTATIS PHILOSOPHICAE
AD IMPERIALEM ALEXANDREAM IN FENNIA UNIVEBSITATEM

PRAESIDE

AXELIO GABRIELE SJÖSTRÖM

Literaturae Graecae Professore Publico et Ordinario
Imper. Ord. de St. Wladimiro in IV Classe Equite
P.P.
RESPONDENS
ALEXANDER HULTIN

Stipendiarius Quintae Cohortis Pyrobolorum Savono-Carelus In

Auditorio Philos. die 1 Nov. 1843 h.p.m.4.

Helsingfors, hos J.C. Frenckell & Son, 1845.

HELENA.

Hvi är du kommen hit till Nilens åkerfält?

HELENA.

Är redan hemma med gemåln Menelaos?

Nereidens dotter der, hafsnymphens, Theonoe,
Bespörj, om din gemål ännu vid lifvet är,
Om han är hädangången. Och när allt du vet, 320
Derefter rätta sedan både fröjd och sorg,
Men förrn du intet säkert vet, hvad båtar dig,
Att ängslig vara? nej, hörsamma mina råd!
Gå hän från denna graf, och uppsök sierskan,
Af hvilken du all sanning sedan veta får; 325
I detta slott, hvad blickar du väl efter mer?
Jag äfven vill ditin med dig begifva mig,
Och jemte dig hos jungfrun be om gudasvar;
Ty qvinna bör med qvinna dela ärende.

HELENA.

Väninnor, jag ert råd tar an, 330

Kommen, kommen ditin,
Att derinne J höra fån
Samtliga mina qval!

KHOREN.

Den villiga du manar lätt.

HELENA.

Ve, osalige dag! 335 Hvad, eländiga, hvad gråt- värdt ord
nu förnimmer jag?

KHOREN.

Spå dig ej qvalen påförhand,

Goda, och sucka i förtid ej!

HELENA.

Hvad har händt min arma gemål? 340

Skådar han dagens ljus,
Och solgudens fyrspann,
Och stjernornas banor än;
Eller delar han under jorden
Med de döda, skuggors lott? 345

KHOREN.

Du bör till det bästa vända Hvad i en framtid kan hända.

HELENA.

Dig jag åkalla vill, dig jag besvärja vill,

Du vattenrika, och säfgröna
Eurotas, om sannt är ryktet, 350
Som går om min makes död.

KHOREN.

Hvarför så dårligt prat?

HELENA.

En mördande snara
Kring halsen jag binda skall;
Eller med mordsvärds anfall 355
Drabba min strupe tversgenom, förfärligt,
Rof för gudinnorna trenne,
Och Priamiden på Ida, som blå-
ste på syrinxpipan vid stallen en gång.

KHOREN.

All ofärd fjerran från dig 360

Vike, och säll må du bli!
HELENA.

Olyckliga Troi, ve!

För saklös sak du har fallit, och lidit hårdt
Min gåfva var det, att Kypris dig sände
Rikelig blod, och rikliga tårar, och smärta på smärta,
Och tårar på tårar ökte ditt qval. 365
Mödrarne söner förlorade;
Och jungfrur, de dödes förvandter,
Sig lockarna skördade af, vid Skamandros'
Phrygiska bölja.
Men Hellas ett jemmerskri 370
Hof upp, och ropade ve,
Och drabbade hufvut med händren,
Och bestänkte finhyiga kinden
Med nagelns blodiga slag,
Lycksaliga mö i Arkadien, o Kallisto, som Zeus' 375
Läger med fyra fötter besteg,
Hur mycket sällare var du, än moder min!
Du som i ludenkäftige djurets skick,
Med blodig blick, björninnas gestalt
Antog, till kränkningens råga. 380
Mer salig var hon, som Artemis slöt ur dansen,
Gullhornade hinden, Merops, Titanidens dotter,
För fägringens skull. Min skönhet har
Förstört, förstört Dardaniens Pergama,
Samt olycksälla Akhaier. 385

MENELAOS.
Du, som i Pisa med Oinomaos' fyrspannsvagn,
O Pelops, i fordna dagar hurtigt åkte kapp,
Ack, att du då, när du till spis åt gudarna
Framsattes styckad, mist bland gudarna ditt lif,
Förrän du nånsin Atreus födde, fader min, 390
Som med min mor Aeropa framaflade
Agamemnon, och mig, Menelaos, ett hjeltepar!
Ty största krigshär — och jag talar utan skryt, —
På flottan menar jag mig ha till Troia fört,
Härförare och konung, utan minsta tvång, 395
Frivilligt folk från Hellas land beherrskande.
Dem, som ej mera finnas till, man räkna kan
Och dem som gladligt sig ur hafvet räddade,
Till hemmet återbringande de dödes namn.
Jag, arme, uppå blåe hafvets vida svall 400
Kringirrat lika länge, som jag Ilios' torn
Förstörde; och till fosterlandet sträfvande,
Mig gudarne tillstädja ej, att detta nå.
Men Libyens öknar, och hvar ogästvänlig strand
Jag har besökt; och när jag hemmet nära var, 405
Dref stormen mig tillbaka; aldrig seglet mitt
Frisk kultje fyllde att jag hade hemmet nått.
Och nu, med vännernas förlust, skeppsbruten man,
Jag kastats hit, och emot klipporna mitt skepp
Uti tallösa spillror sönderkrossades. 410
Af alla bjelklag kölen blott mig återstod,
På hvilken, mot förmodan, knapt jag räddad är,
Samt Helena, som jag från Troia tog igen.
Hvad namn nu detta landet har, och folket här,
Jag vet ej; störta mig i hopen blygdes jag, 415

s10 2.2l Engine PDF
100% (2)
s10 2.2l Engine PDF
25 pages
An Introduction To Parallel Programming 2. Edition Pacheco - Ebook PDF All Chapters Instant Download
100% (4)
An Introduction To Parallel Programming 2. Edition Pacheco - Ebook PDF All Chapters Instant Download
51 pages
Requirements Engineering Fundamentals: A Study Guide for the Certified Professional for Requirements Engineering Exam - Foundation Level - IREB compliant
From Everand
Requirements Engineering Fundamentals: A Study Guide for the Certified Professional for Requirements Engineering Exam - Foundation Level - IREB compliant
Klaus Pohl
3/5 (4)
SP-FP Trutzchler
100% (2)
SP-FP Trutzchler
230 pages
Full download Using OpenCL Programming Massively Parallel Computers J. Kowalik pdf docx
100% (2)
Full download Using OpenCL Programming Massively Parallel Computers J. Kowalik pdf docx
41 pages
Using OpenCL Programming Massively Parallel Computers J. Kowalik download
No ratings yet
Using OpenCL Programming Massively Parallel Computers J. Kowalik download
62 pages
Immediate download Using OpenCL Programming Massively Parallel Computers J. Kowalik ebooks 2024
100% (6)
Immediate download Using OpenCL Programming Massively Parallel Computers J. Kowalik ebooks 2024
40 pages
Download ebooks file Using OpenCL Programming Massively Parallel Computers J. Kowalik all chapters
100% (6)
Download ebooks file Using OpenCL Programming Massively Parallel Computers J. Kowalik all chapters
61 pages
Using OpenCL Programming Massively Parallel Computers J. Kowalik - The ebook in PDF/DOCX format is available for instant download
100% (1)
Using OpenCL Programming Massively Parallel Computers J. Kowalik - The ebook in PDF/DOCX format is available for instant download
43 pages
Download Complete (Ebook) Using OpenCL: Programming Massively Parallel Computers by J. Kowalik, T. Puzniakowski ISBN 9781614990291, 1614990298 PDF for All Chapters
100% (5)
Download Complete (Ebook) Using OpenCL: Programming Massively Parallel Computers by J. Kowalik, T. Puzniakowski ISBN 9781614990291, 1614990298 PDF for All Chapters
76 pages
Building a BeagleBone Black Super Cluster
From Everand
Building a BeagleBone Black Super Cluster
Andreas Josef Reichel
No ratings yet
Using OpenCL Programming Massively Parallel Computers
No ratings yet
Using OpenCL Programming Massively Parallel Computers
309 pages
Programming Concepts in C++
From Everand
Programming Concepts in C++
Robert Burns
No ratings yet
(Ebook) Using OpenCL: Programming Massively Parallel Computers by J. Kowalik, T. Puzniakowski ISBN 9781614990291, 1614990298 - The latest ebook is available, download it today
100% (2)
(Ebook) Using OpenCL: Programming Massively Parallel Computers by J. Kowalik, T. Puzniakowski ISBN 9781614990291, 1614990298 - The latest ebook is available, download it today
58 pages
Linux Programming Tools Unveiled
From Everand
Linux Programming Tools Unveiled
N. B. Venkateswarlu
No ratings yet
Learning .NET High-performance Programming
From Everand
Learning .NET High-performance Programming
Antonio Esposito
No ratings yet
Virtual Report Processing: The Mapper Story
From Everand
Virtual Report Processing: The Mapper Story
Louis Schlueter
No ratings yet
Where can buy An Introduction to Parallel Programming 2nd Edition Peter Pacheco ebook with cheap price
100% (7)
Where can buy An Introduction to Parallel Programming 2nd Edition Peter Pacheco ebook with cheap price
40 pages
Build Supercomputers with Raspberry Pi 3
From Everand
Build Supercomputers with Raspberry Pi 3
Carlos R. Morrison
No ratings yet
Gpu Parallel Program Development Cuda
100% (2)
Gpu Parallel Program Development Cuda
477 pages
Full download An Introduction to Parallel Programming Pacheco Peter S Malensek Matthew pdf docx
100% (1)
Full download An Introduction to Parallel Programming Pacheco Peter S Malensek Matthew pdf docx
54 pages
Machine Learning, revised and updated edition
From Everand
Machine Learning, revised and updated edition
Ethem Alpaydin
No ratings yet
Buy ebook Parallel Scientific Computation: A Structured Approach Using BSP 2nd Edition Rob H. Bisseling cheap price
100% (3)
Buy ebook Parallel Scientific Computation: A Structured Approach Using BSP 2nd Edition Rob H. Bisseling cheap price
50 pages
2018 Book IntroductionToParallelComputin PDF
100% (1)
2018 Book IntroductionToParallelComputin PDF
263 pages
Full download (Ebook) GPU Parallel Program Development Using CUDA by Tolga Soyata ISBN 9781498750752, 1498750753 pdf docx
100% (9)
Full download (Ebook) GPU Parallel Program Development Using CUDA by Tolga Soyata ISBN 9781498750752, 1498750753 pdf docx
55 pages
An Introduction to Parallel Programming 2. Edition Pacheco - eBook PDF download pdf
100% (1)
An Introduction to Parallel Programming 2. Edition Pacheco - eBook PDF download pdf
50 pages
High_performance_cluster_computing_Book
No ratings yet
High_performance_cluster_computing_Book
2 pages
Parallel Scientific Computation: A Structured Approach Using BSP 2nd Edition Rob H. Bisselinginstant download
100% (2)
Parallel Scientific Computation: A Structured Approach Using BSP 2nd Edition Rob H. Bisselinginstant download
24 pages
34744
No ratings yet
34744
69 pages
The OpenMP Common Core: Making OpenMP Simple Again
From Everand
The OpenMP Common Core: Making OpenMP Simple Again
Timothy G. Mattson
No ratings yet
Parallel Scientific Computation: A Structured Approach Using BSP 2nd Edition Rob H. Bisseling - The ebook in PDF/DOCX format is available for instant download
100% (3)
Parallel Scientific Computation: A Structured Approach Using BSP 2nd Edition Rob H. Bisseling - The ebook in PDF/DOCX format is available for instant download
70 pages
Complete Download (Ebook) COMPUTERSHigh Performance Computing by Marvin V. Zelkowitz (Eds.) ISBN 9780123744111, 0123744113 PDF All Chapters
100% (1)
Complete Download (Ebook) COMPUTERSHigh Performance Computing by Marvin V. Zelkowitz (Eds.) ISBN 9780123744111, 0123744113 PDF All Chapters
82 pages
The Book of PF, 3rd Edition: A No-Nonsense Guide to the OpenBSD Firewall
From Everand
The Book of PF, 3rd Edition: A No-Nonsense Guide to the OpenBSD Firewall
Peter N.M. Hansteen
4/5 (10)
Fundamentals of Multicore Software Development PDF
No ratings yet
Fundamentals of Multicore Software Development PDF
322 pages
Programming Abc With Python Or Programming The Easy Way: From Basics to Practical Projects in Python: From Basics to Practical Projects in Python
From Everand
Programming Abc With Python Or Programming The Easy Way: From Basics to Practical Projects in Python: From Basics to Practical Projects in Python
Mindaugas Vilčinskas
No ratings yet
Programming Abc With Python Or Programming The Easy Way
From Everand
Programming Abc With Python Or Programming The Easy Way
Mindaugas Vilčinskas
No ratings yet
Raspberry Pi Super Cluster
From Everand
Raspberry Pi Super Cluster
Dennis Andrew K.
No ratings yet
Download Complete Parallel Scientific Computation: A Structured Approach Using BSP 2nd Edition Rob H. Bisseling PDF for All Chapters
100% (1)
Download Complete Parallel Scientific Computation: A Structured Approach Using BSP 2nd Edition Rob H. Bisseling PDF for All Chapters
40 pages
Jumping Computation 1st Edition Alexander Meduna download
100% (1)
Jumping Computation 1st Edition Alexander Meduna download
57 pages
Jumping Computation 1st Edition Alexander Meduna - The ebook is available for online reading or easy download
100% (1)
Jumping Computation 1st Edition Alexander Meduna - The ebook is available for online reading or easy download
50 pages
Get GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata free all chapters
100% (2)
Get GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata free all chapters
65 pages
Digital Electronics with Arduino: Learn How To Work With Digital Electronics And MicroControllers
From Everand
Digital Electronics with Arduino: Learn How To Work With Digital Electronics And MicroControllers
Bob Dukish
5/5 (1)
Jumping Computation 1st Edition Alexander Meduna All Chapters Instant Download
100% (13)
Jumping Computation 1st Edition Alexander Meduna All Chapters Instant Download
70 pages
Where can buy Applied Computational Physics 1st Edition Joseph F. Boudreau ebook with cheap price
100% (1)
Where can buy Applied Computational Physics 1st Edition Joseph F. Boudreau ebook with cheap price
41 pages
Data Acquisition Using LabVIEW
From Everand
Data Acquisition Using LabVIEW
Behzad Ehsani
No ratings yet
Instant Download An Introduction to Parallel Programming Pacheco Peter S Malensek Matthew PDF All Chapters
100% (3)
Instant Download An Introduction to Parallel Programming Pacheco Peter S Malensek Matthew PDF All Chapters
40 pages
GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata - The full ebook with complete content is ready for download
No ratings yet
GPU Parallel Program Development Using CUDA 1st Edition Tolga Soyata - The full ebook with complete content is ready for download
62 pages
Intel Galileo Essentials
From Everand
Intel Galileo Essentials
Richard Grimmett
No ratings yet
Buy ebook Computational and Data Grids Principles Applications and Design 1st Edition Nikolaos Preve cheap price
100% (1)
Buy ebook Computational and Data Grids Principles Applications and Design 1st Edition Nikolaos Preve cheap price
67 pages
(Ebook) Parallel Scientific Computation: A Structured Approach Using BSP by Rob H. Bisseling ISBN 9780191092572, 0191092576 - Download the ebook now for instant access to all chapters
100% (2)
(Ebook) Parallel Scientific Computation: A Structured Approach Using BSP by Rob H. Bisseling ISBN 9780191092572, 0191092576 - Download the ebook now for instant access to all chapters
83 pages
Getting Started with Julia
From Everand
Getting Started with Julia
Ivo Balbaert
No ratings yet
Preview-9781482211191 A37870511
No ratings yet
Preview-9781482211191 A37870511
50 pages
[Ebooks PDF] download Jumping Computation: Updating Automata and Grammars for Discontinuous Information Processing 1st Edition Meduna full chapters
100% (2)
[Ebooks PDF] download Jumping Computation: Updating Automata and Grammars for Discontinuous Information Processing 1st Edition Meduna full chapters
40 pages
[FREE PDF sample] Multicore and GPU Programming An Integrated Approach 2nd Edition Gerassimos Barlas ebooks
100% (4)
[FREE PDF sample] Multicore and GPU Programming An Integrated Approach 2nd Edition Gerassimos Barlas ebooks
40 pages
Get (Ebook) Parallel programming for modern high performance computing systems by Czarnul, Pawel ISBN 9781138305953, 9781315144405, 9781351385787, 9781351385794, 9781351385800, 1138305952, 1315144409, 135138578X, 1351385798 free all chapters
100% (9)
Get (Ebook) Parallel programming for modern high performance computing systems by Czarnul, Pawel ISBN 9781138305953, 9781315144405, 9781351385787, 9781351385794, 9781351385800, 1138305952, 1315144409, 135138578X, 1351385798 free all chapters
38 pages
Jumping Computation: Updating Automata and Grammars for Discontinuous Information Processing 1st Edition Meduna - The ebook in PDF/DOCX format is available for instant download
100% (1)
Jumping Computation: Updating Automata and Grammars for Discontinuous Information Processing 1st Edition Meduna - The ebook in PDF/DOCX format is available for instant download
75 pages
Multicore and GPU Programming An Integrated Approach 2nd Edition Gerassimos Barlas - Download the ebook now and own the full detailed content
100% (1)
Multicore and GPU Programming An Integrated Approach 2nd Edition Gerassimos Barlas - Download the ebook now and own the full detailed content
79 pages
Applied Computational Physics 1st Edition Joseph F. Boudreau pdf download
100% (2)
Applied Computational Physics 1st Edition Joseph F. Boudreau pdf download
54 pages
Instant download Programming Languages and Systems Amal Ahmed pdf all chapter
100% (2)
Instant download Programming Languages and Systems Amal Ahmed pdf all chapter
62 pages
Instant Access to (Ebook) Domain-Specific Computer Architectures for Emerging Applications; Machine Learning and Neural Networks by Chao Wang ISBN 9780429355080, 0429355084 ebook Full Chapters
100% (3)
Instant Access to (Ebook) Domain-Specific Computer Architectures for Emerging Applications; Machine Learning and Neural Networks by Chao Wang ISBN 9780429355080, 0429355084 ebook Full Chapters
66 pages
Programming Language Concepts Peter Sestoft 2024 Scribd Download
100% (4)
Programming Language Concepts Peter Sestoft 2024 Scribd Download
62 pages
Swift Quick Syntax Reference
From Everand
Swift Quick Syntax Reference
Matthew Campbell
No ratings yet
IKEA Case
No ratings yet
IKEA Case
5 pages
Fiber Lasers: Fundamentals with MATLAB Modelling 1st Edition Johan Meyer (Editor) - Download the ebook today and experience the full content
100% (1)
Fiber Lasers: Fundamentals with MATLAB Modelling 1st Edition Johan Meyer (Editor) - Download the ebook today and experience the full content
65 pages
UNESCO Moving Forward: The 2030 Agenda For Sustainable Development
No ratings yet
UNESCO Moving Forward: The 2030 Agenda For Sustainable Development
22 pages
Masai DSA Video Links ALL
No ratings yet
Masai DSA Video Links ALL
2 pages
Chapter Two
No ratings yet
Chapter Two
34 pages
BSD-148 - Simplified Prediction of Driving Rain On Buildings - ASHRAE 160P and WUFI 4.0 - Building Science Corporation
No ratings yet
BSD-148 - Simplified Prediction of Driving Rain On Buildings - ASHRAE 160P and WUFI 4.0 - Building Science Corporation
10 pages
Diploma in Building Services Engineering
No ratings yet
Diploma in Building Services Engineering
3 pages
Automatic Congestion Handler (RAN19.1 - 01)
No ratings yet
Automatic Congestion Handler (RAN19.1 - 01)
604 pages
Actors, Interfaces and Development Intervention: Meanings, Purposes and Powers
No ratings yet
Actors, Interfaces and Development Intervention: Meanings, Purposes and Powers
23 pages
English For Research Paper Writing BAM
No ratings yet
English For Research Paper Writing BAM
8 pages
Gek105060 File0060
No ratings yet
Gek105060 File0060
12 pages
NOV - Full Circle Casing Scrapers OMM 6255
No ratings yet
NOV - Full Circle Casing Scrapers OMM 6255
6 pages
4th Lesson Plan - Metric Conversion
No ratings yet
4th Lesson Plan - Metric Conversion
3 pages
Siam Ahmed CV Part 1
No ratings yet
Siam Ahmed CV Part 1
1 page
Mos 3330
No ratings yet
Mos 3330
15 pages
3 Idiots
100% (4)
3 Idiots
20 pages
LTIP E Bill For Web 2409
No ratings yet
LTIP E Bill For Web 2409
2 pages
Recruitment of Assistant Manager, Grade A (Regular)
No ratings yet
Recruitment of Assistant Manager, Grade A (Regular)
3 pages
New Malta Bus Routes
No ratings yet
New Malta Bus Routes
32 pages
BRC Issued PDF
No ratings yet
BRC Issued PDF
3 pages
Tikmuz
No ratings yet
Tikmuz
13 pages
Confi Guration Guide: Thrane & Thrane TT-3000SSA September 30th 2004
No ratings yet
Confi Guration Guide: Thrane & Thrane TT-3000SSA September 30th 2004
8 pages
Natural Persons, Juridical Persons and Legal Personhood Elvia Arcelia Q A
No ratings yet
Natural Persons, Juridical Persons and Legal Personhood Elvia Arcelia Q A
18 pages
01 - Python Pandas 1 & 2
No ratings yet
01 - Python Pandas 1 & 2
5 pages
Teacher Work Attachment at CSCI
No ratings yet
Teacher Work Attachment at CSCI
4 pages
Sociology (4)
No ratings yet
Sociology (4)
4 pages
FSMS - Advance Planning
No ratings yet
FSMS - Advance Planning
28 pages
Posicionador Tyco Manual PDF
100% (2)
Posicionador Tyco Manual PDF
14 pages

[Ebooks PDF] download Using OpenCL Programming Massively Parallel Computers J. Kowalik full chapters

Uploaded by

[Ebooks PDF] download Using OpenCL Programming Massively Parallel Computers J. Kowalik full chapters

Uploaded by

Get ebook downloads in full at ebookname.

Using OpenCL Programming Massively Parallel

Explore and download more ebook at https://ebookname.com

Computers and Programming 1st Edition Lisa Mccoy

Hidden Structure Music Analysis Using Computers David

Using Computers in Linguistics A Practical Guide 1st

The ASHRAE GreenGuide Second Edition The ASHRAE Green

Bosnia and Herzegovina in the Second World War 2004

Human Resource Skills for the Project Manager The Human

Textbook of Forensic Medicine and Toxicology 2nd

World Civilizations Volume 1 To 1700 5th Edition Philip

Volumes 1–14 published by Elsevier Science.

ISSN 0927-5452 (print)

Amstterdam x Berrlin x Tokyo x Washington, DC

ISBN 978-1-61499-029-1 (print)

Distributor in the USA and Canada

PRINTED IN THE NETHERLANDS

How to read this book

3 Advanced OpenCL 147

A Comparing CUDA with OpenCL 245

B Theoretical Foundations of Heterogeneous Computing 269

C Matrix Multiplication – Algorithm and Implementation 279

D Using Examples Attached to the Book 285

Bibliography and References 289

1.1. Existing Standard Parallel Programming Systems

Figure 1.1: The message passing model.

An example of a collective message is the broadcast message that sends data

1. The number of processes is p.

Figure 1.3: The bus based SMP multiprocessor.

Figure 1.5: The threads’ access to data.

Bernstein’s conditions for task independence.

Unfortunately, this solution would encounter a serious difﬁculty in computing

In general, the block of code following the critical construct is computed by

1 #pragma omp parallel shared(a,b,c,d,e,f) private(i,j)

1.2. Two Parallelization Strategies: Data Parallelism

1.2.1. Data Parallelism

Figure 1.7: Computing matrix/vector product.

1.2.2. Task Parallelism

Figure 1.8: Task dependency graph.

The direct computation of C.

In the tiling technique applied to the matrix/matrix product operation, matrices

1.3. History and Goals of OpenCL

Table 1.1: Comparision of terminology of CUDA and OpenCL

1.3.2. Short History of OpenCL

Figure 1.11: Heterogeneous computer memories. Host memory is accesible only to

1.4.2. Data Transfer

Figure 1.12: Data transfer in matrix/matrix multiplication.

Executing a matrix/matrix multiplication algorithm requires only two data trans-

1.4.3. The Fourth Generation CUDA

1.5. Host Code

Figure 1.14: OpenCL program components.

a) Initialization and creating context,

1.5.1. Phase a. Initialization and Creating Context

1.5.2. Phase b. Kernel Creation, Compilation and Preparations

1.5.3. Phase c. Creating Command Queues and Kernel Execution

Figure 1.15: Host code structure and functionality.

1.6. Applications of Heterogeneous Computing

1.6.2. Conjugate Gradient Method

and three SAXPY operations z = ax + y, where x, y and z are n-component vectors

Dx k+1 = b − (L + U)x k (1.4)

This iteration is perfectly parallel. Each element of x can be computed in par-

Figure 1.17: The Jacobi algorithm computation on a heterogeneous machine.

Implementing the Jacobi iterative algorithm on a heterogeneous machine can

Translator: Axel Gabriel Sjöström

Release date: September 29, 2019 [eBook #60381]

Credits: Jari Koivisto

*** START OF THE PROJECT GUTENBERG EBOOK HELENA ***

Axel Gabriel Sjöström

AXELIO GABRIELE SJÖSTRÖM

Stipendiarius Quintae Cohortis Pyrobolorum Savono-Carelus In

Helsingfors, hos J.C. Frenckell & Son, 1845.

Här rinner Nilens nymphbebodda vatten fram,

Ho är, som makten har i detta välbefästa hus?

Förlåt! För vreden vek jag mera, än jag bordt; 80

Ho är du? från hvad bygder kommer du till oss?

En bland Akhaierna jag är, de stackrarna.

* START OF THE PROJECT GUTENBERG EBOOK HELENA *