100% found this document useful (8 votes)
55 views

[Ebooks PDF] download Using OpenCL Programming Massively Parallel Computers J. Kowalik full chapters

Using

Uploaded by

shifazsanapi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (8 votes)
55 views

[Ebooks PDF] download Using OpenCL Programming Massively Parallel Computers J. Kowalik full chapters

Using

Uploaded by

shifazsanapi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Get ebook downloads in full at ebookname.

com

Using OpenCL Programming Massively Parallel


Computers J. Kowalik

https://ebookname.com/product/using-opencl-programming-
massively-parallel-computers-j-kowalik/

OR CLICK BUTTON

DOWNLOAD EBOOK

Explore and download more ebook at https://ebookname.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Computers and Programming 1st Edition Lisa Mccoy

https://ebookname.com/product/computers-and-programming-1st-
edition-lisa-mccoy/

Hidden Structure Music Analysis Using Computers David


Cope

https://ebookname.com/product/hidden-structure-music-analysis-
using-computers-david-cope/

Using Computers in Linguistics A Practical Guide 1st


Edition John M. Lawler

https://ebookname.com/product/using-computers-in-linguistics-a-
practical-guide-1st-edition-john-m-lawler/

The ASHRAE GreenGuide Second Edition The ASHRAE Green


Guide Series Ashrae Press

https://ebookname.com/product/the-ashrae-greenguide-second-
edition-the-ashrae-green-guide-series-ashrae-press/
Complete Java 2 Certification Study Guide 4th Edition
Phillip Heller

https://ebookname.com/product/complete-java-2-certification-
study-guide-4th-edition-phillip-heller/

Bosnia and Herzegovina in the Second World War 2004


Enver Redzic

https://ebookname.com/product/bosnia-and-herzegovina-in-the-
second-world-war-2004-enver-redzic/

Human Resource Skills for the Project Manager The Human


Aspects of Project Management Volume Two 1st Edition
Verma

https://ebookname.com/product/human-resource-skills-for-the-
project-manager-the-human-aspects-of-project-management-volume-
two-1st-edition-verma/

Textbook of Forensic Medicine and Toxicology 2nd


Edition Nageshkumar G Rao

https://ebookname.com/product/textbook-of-forensic-medicine-and-
toxicology-2nd-edition-nageshkumar-g-rao/

World Civilizations Volume 1 To 1700 5th Edition Philip


J. Adler

https://ebookname.com/product/world-civilizations-
volume-1-to-1700-5th-edition-philip-j-adler/
Noise and Military Service Implications for Hearing
Loss and Tinnitus 1st Edition Committee On Noise-
Induced Hearing Loss And Tinnitus Associated With
Military Service From World War Ii To The Present
https://ebookname.com/product/noise-and-military-service-
implications-for-hearing-loss-and-tinnitus-1st-edition-committee-
on-noise-induced-hearing-loss-and-tinnitus-associated-with-
military-service-from-world-war-ii-to-the-present/
USING OPENCL
Advances in Parallel Computing
This book series publishes research and development results on all aspects of parallel
computing. Topics may include one or more of the following: high-speed computing
architectures (Grids, clusters, Service Oriented Architectures, etc.), network technology,
performance measurement, system software, middleware, algorithm design,
development tools, software engineering, services and applications.

Series Editor:
Professor Dr. Gerhard R. Joubert

Volume 21
Recently published in this series
Vol. 20. I. Foster, W. Gentzsch, L. Grandinetti and G.R. Joubert (Eds.), High
Performance Computing: From Grids and Clouds to Exascale
Vol. 19. B. Chapman, F. Desprez, G.R. Joubert, A. Lichnewsky, F. Peters and T. Priol
(Eds.), Parallel Computing: From Multicores and GPU’s to Petascale
Vol. 18. W. Gentzsch, L. Grandinetti and G. Joubert (Eds.), High Speed and Large Scale
Scientific Computing
Vol. 17. F. Xhafa (Ed.), Parallel Programming, Models and Applications in Grid and
P2P Systems
Vol. 16. L. Grandinetti (Ed.), High Performance Computing and Grids in Action
Vol. 15. C. Bischof, M. Bücker, P. Gibbon, G.R. Joubert, T. Lippert, B. Mohr and F.
Peters (Eds.), Parallel Computing: Architectures, Algorithms and Applications

Volumes 1–14 published by Elsevier Science.

ISSN 0927-5452 (print)


ISSN 1879-808X (online)
Usin
ng OpeenCL
Program
mming Ma
assively Pa
arallel Com
mputers

Janu
usz Kow
walik
1647
77-107th PL NE, Bothell,, WA 98011
1, USA
and
Tadeusz PuĨnia
akowski
UG, MFI, Wit
W Stwosz Street
S 57, 80-952
8 GdaĔĔsk, Poland

Amstterdam x Berrlin x Tokyo x Washington, DC


© 2012 The authors and IOS Press.

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, without prior written permission from the publisher.

ISBN 978-1-61499-029-1 (print)


ISBN 978-1-61499-030-7 (online)
Library of Congress Control Number: 2012932792
doi:10.3233/978-1-61499-030-7-i

Publisher
IOS Press BV
Nieuwe Hemweg 6B
1013 BG Amsterdam
Netherlands
fax: +31 20 687 0019
e-mail: order@iospress.nl

Distributor in the USA and Canada


IOS Press, Inc.
4502 Rachael Manor Drive
Fairfax, VA 22032
USA
fax: +1 703 323 3668
e-mail: iosbooks@iospress.com

LEGAL NOTICE
The publisher is not responsible for the use which might be made of the following information.

PRINTED IN THE NETHERLANDS


This book is dedicated to Alex, Bogdan and Gabriela
with love and consideration.

v
vi
Preface
This book contains the most important and essential information required for de-
signing correct and efficient OpenCL programs. Some details have been omitted but
can be found in the provided references. The authors assume that readers are famil-
iar with basic concepts of parallel computation, have some programming experience
with C or C++ and have a fundamental understanding of computer architecture.
In the book, all terms, definitions and function signatures have been copied from
official API documents available on the page of the OpenCL standards creators.
The book was written in 2011, when OpenCL was in transition from its infancy
to maturity as a practical programming tool for solving real-life problems in science
and engineering. Earlier, the Khronos Group successfully defined OpenCL specifica-
tions, and several companies developed stable OpenCL implementations ready for
learning and testing. A significant contribution to programming heterogeneous com-
puters was made by NVIDIA which created one of the first working systems for pro-
gramming massively parallel computers – CUDA. OpenCL has borrowed from CUDA
several key concepts. At this time (fall 2011), one can install OpenCL on a hetero-
geneous computer and perform meaningful computing experiments. Since OpenCL
is relatively new, there are not many experienced users or sources of practical infor-
mation. One can find on the Web some helpful publications about OpenCL, but there
is still a shortage of complete descriptions of the system suitable for students and
potential users from the scientific and engineering application communities.
Chapter 1 provides short but realistic examples of codes using MPI and OpenMP
in order for readers to compare these two mature and very successful systems with
the fledgling OpenCL. MPI used for programming clusters and OpenMP for shared
memory computers, have achieved remarkable worldwide success for several rea-
sons. Both have been designed by groups of parallel computing specialists that per-
fectly understood scientific and engineering applications and software development
tools. Both MPI and OpenMP are very compact and easy to learn. Our experience
indicates that it is possible to teach scientists or students whose disciplines are other
than computer science how to use MPI and OpenMP in a several hours time. We
hope that OpenCL will benefit from this experience and achieve, in the near future,
a similar success.
Paraphrasing the wisdom of Albert Einstein, we need to simplify OpenCL as
much as possible but not more. The reader should keep in mind that OpenCL will
be evolving and that pioneer users always have to pay an additional price in terms
of initially longer program development time and suboptimal performance before
they gain experience. The goal of achieving simplicity for OpenCL programming re-
quires an additional comment. OpenCL supporting heterogeneous computing offers
us opportunities to select diverse parallel processing devices manufactured by differ-
ent vendors in order to achieve near-optimal or optimal performance. We can select
multi-core CPUs, GPUs, FPGAs and other parallel processing devices to fit the prob-
lem we want to solve. This flexibility is welcomed by many users of HPC technology,
but it has a price.
Programming heterogeneous computers is somewhat more complicated than
writing programs in conventional MPI and OpenMP. We hope this gap will disappear
as OpenCL matures and is universally used for solving large scientific and engineer-
ing problems.

vii
Acknowledgements
It is our pleasure to acknowledge assistance and contributions made by several per-
sons who helped us in writing and publishing the book.
First of all, we express our deep gratitude to Prof. Gerhard Joubert who has
accepted the book as a volume in the book series he is editing, Advances in Parallel
Computing. We are proud to have our book in his very prestigious book series.
Two members of the Khronos organization, Elizabeth Riegel and Neil Trevett,
helped us with evaluating the initial draft of Chapter 2 Fundamentals and provided
valuable feedback. We thank them for the feedback and for their offer of promoting
the book among the Khronos Group member companies.
Our thanks are due to NVIDIA for two hardware grants that enabled our com-
puting work related to the book.
Our thanks are due to Piotr Arłukowicz, who contributed two sections to the
book and helped us with editorial issues related to using LATEX and the Blender3D
modeling open-source program.
We thank two persons who helped us improve the book structure and the lan-
guage. They are Dominic Eschweiler from FIAS, Germany and Roberta Scholz from
Redmond, USA.
We also thank several friends and family members who helped us indirectly by
supporting in various ways our book writing effort.
Janusz Kowalik
Tadeusz Puźniakowski

How to read this book


The text and the source code presented in this book are written using different text
fonts. Here are some examples of diffrent typography styles collected.
variable – for example:
. . . the variable platform represents an object of class . . .
type or class name – for example:
. . . the value is always of type cl_ulong. . .
. . . is an object of class cl::Platform. . .
constant or macro – for example:
. . . the value CL_PLATFORM_EXTENSIONS means that. . .
function, method or constructor – for example:
. . . the host program has to execute clGetPlatformIDs. . .
. . . can be retrieved using cl::Platform::getInfo method. . .
. . . the context is created by cl::Context construcor. . .
file name – for example:
. . . the cl.h header file contains. . .
keyword – for example:
. . . identified by the __kernel qualifier. . .

viii
Contents

1 Introduction 1
1.1 Existing Standard Parallel Programming Systems . . . . . . . . . . . . . . 1
1.1.1 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Two Parallelization Strategies: Data Parallelism and Task Parallelism . 9
1.2.1 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Task Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 History and Goals of OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Origins of Using GPU in General Purpose Computing . . . . . . 12
1.3.2 Short History of OpenCL . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Heterogeneous Computer Memories and Data Transfer . . . . . . . . . . 14
1.4.1 Heterogeneous Computer Memories . . . . . . . . . . . . . . . . . 14
1.4.2 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.3 The Fourth Generation CUDA . . . . . . . . . . . . . . . . . . . . . 15
1.5 Host Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.1 Phase a. Initialization and Creating Context . . . . . . . . . . . . 17
1.5.2 Phase b. Kernel Creation, Compilation and Preparations . . . . . 17
1.5.3 Phase c. Creating Command Queues and Kernel Execution . . . 17
1.5.4 Finalization and Releasing Resource . . . . . . . . . . . . . . . . . 18
1.6 Applications of Heterogeneous Computing . . . . . . . . . . . . . . . . . . 18
1.6.1 Accelerating Scientific/Engineering Applications . . . . . . . . . 19
1.6.2 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . 19
1.6.3 Jacobi Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6.4 Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6.5 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.7 Benchmarking CGM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.2 Additional CGM Description . . . . . . . . . . . . . . . . . . . . . . 24
1.7.3 Heterogeneous Machine . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.4 Algorithm Implementation and Timing Results . . . . . . . . . . 24
1.7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

ix
2 OpenCL Fundamentals 27
2.1 OpenCL Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.1 What is OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.2 CPU + Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.3 Massive Parallelism Idea . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.4 Work Items and Workgroups . . . . . . . . . . . . . . . . . . . . . . 29
2.1.5 OpenCL Execution Model . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.6 OpenCL Memory Structure . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.7 OpenCL C Language for Programming Kernels . . . . . . . . . . . 30
2.1.8 Queues, Events and Context . . . . . . . . . . . . . . . . . . . . . . 30
2.1.9 Host Program and Kernel . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.10 Data Parallelism in OpenCL . . . . . . . . . . . . . . . . . . . . . . 31
2.1.11 Task Parallelism in OpenCL . . . . . . . . . . . . . . . . . . . . . . 32
2.2 How to Start Using OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.1 Header Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.3 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Platforms and Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1 OpenCL Platform Properties . . . . . . . . . . . . . . . . . . . . . . 36
2.3.2 Devices Provided by Platform . . . . . . . . . . . . . . . . . . . . . 37
2.4 OpenCL Platforms – C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 OpenCL Context to Manage Devices . . . . . . . . . . . . . . . . . . . . . . 41
2.5.1 Different Types of Devices . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.2 CPU Device Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.3 GPU Device Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.4 Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.5 Different Device Types – Summary . . . . . . . . . . . . . . . . . . 44
2.5.6 Context Initialization – by Device Type . . . . . . . . . . . . . . . 45
2.5.7 Context Initialization – Selecting Particular Device . . . . . . . . 46
2.5.8 Getting Information about Context . . . . . . . . . . . . . . . . . . 47
2.6 OpenCL Context to Manage Devices – C++ . . . . . . . . . . . . . . . . . 48
2.7 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.7.1 Checking Error Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.7.2 Using Exceptions – Available in C++ . . . . . . . . . . . . . . . . 53
2.7.3 Using Custom Error Messages . . . . . . . . . . . . . . . . . . . . . 54
2.8 Command Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.8.1 In-order Command Queue . . . . . . . . . . . . . . . . . . . . . . . 55
2.8.2 Out-of-order Command Queue . . . . . . . . . . . . . . . . . . . . 57
2.8.3 Command Queue Control . . . . . . . . . . . . . . . . . . . . . . . 60
2.8.4 Profiling Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.8.5 Profiling Using Events – C example . . . . . . . . . . . . . . . . . . 61
2.8.6 Profiling Using Events – C++ example . . . . . . . . . . . . . . . 63
2.9 Work-Items and Work-Groups . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.9.1 Information About Index Space from a Kernel . . . . . . . . . . 66
2.9.2 NDRange Kernel Execution . . . . . . . . . . . . . . . . . . . . . . 67
2.9.3 Task Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.9.4 Using Work Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

x
2.10 OpenCL Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.10.1 Different Memory Regions – the Kernel Perspective . . . . . . . . 71
2.10.2 Relaxed Memory Consistency . . . . . . . . . . . . . . . . . . . . . 73
2.10.3 Global and Constant Memory Allocation – Host Code . . . . . . 75
2.10.4 Memory Transfers – the Host Code . . . . . . . . . . . . . . . . . . 78
2.11 Programming and Calling Kernel . . . . . . . . . . . . . . . . . . . . . . . . 79
2.11.1 Loading and Compilation of an OpenCL Program . . . . . . . . . 81
2.11.2 Kernel Invocation and Arguments . . . . . . . . . . . . . . . . . . 88
2.11.3 Kernel Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.11.4 Supported Scalar Data Types . . . . . . . . . . . . . . . . . . . . . 90
2.11.5 Vector Data Types and Common Functions . . . . . . . . . . . . . 92
2.11.6 Synchronization Functions . . . . . . . . . . . . . . . . . . . . . . . 94
2.11.7 Counting Parallel Sum . . . . . . . . . . . . . . . . . . . . . . . . . 96
2.11.8 Parallel Sum – Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 97
2.11.9 Parallel Sum – Host Program . . . . . . . . . . . . . . . . . . . . . 100
2.12 Structure of the OpenCL Host Program . . . . . . . . . . . . . . . . . . . . 103
2.12.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.12.2 Preparation of OpenCL Programs . . . . . . . . . . . . . . . . . . . 106
2.12.3 Using Binary OpenCL Programs . . . . . . . . . . . . . . . . . . . . 107
2.12.4 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
2.12.5 Release of Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
2.13 Structure of OpenCL host Programs in C++ . . . . . . . . . . . . . . . . . 114
2.13.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.13.2 Preparation of OpenCL Programs . . . . . . . . . . . . . . . . . . . 115
2.13.3 Using Binary OpenCL Programs . . . . . . . . . . . . . . . . . . . . 116
2.13.4 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
2.13.5 Release of Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
2.14 The SAXPY Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.14.1 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
2.14.2 The Example SAXPY Application – C Language . . . . . . . . . . 123
2.14.3 The example SAXPY application – C++ language . . . . . . . . 128
2.15 Step by Step Conversion of an Ordinary C Program to OpenCL . . . . . 131
2.15.1 Sequential Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2.15.2 OpenCL Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2.15.3 Data Allocation on the Device . . . . . . . . . . . . . . . . . . . . . 134
2.15.4 Sequential Function to OpenCL Kernel . . . . . . . . . . . . . . . 135
2.15.5 Loading and Executing a Kernel . . . . . . . . . . . . . . . . . . . . 136
2.15.6 Gathering Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
2.16 Matrix by Vector Multiplication Example . . . . . . . . . . . . . . . . . . . 139
2.16.1 The Program Calculating mat r i x × vec t or . . . . . . . . . . . . 140
2.16.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.16.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.16.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

3 Advanced OpenCL 147


3.1 OpenCL Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
3.1.1 Different Classes of Extensions . . . . . . . . . . . . . . . . . . . . 147

xi
3.1.2 Detecting Available Extensions from API . . . . . . . . . . . . . . 148
3.1.3 Using Runtime Extension Functions . . . . . . . . . . . . . . . . . 149
3.1.4 Using Extensions from OpenCL Program . . . . . . . . . . . . . . 153
3.2 Debugging OpenCL codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
3.2.1 Printf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
3.2.2 Using GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
3.3 Performance and Double Precision . . . . . . . . . . . . . . . . . . . . . . 162
3.3.1 Floating Point Arithmetics . . . . . . . . . . . . . . . . . . . . . . . 162
3.3.2 Arithmetics Precision – Practical Approach . . . . . . . . . . . . . 165
3.3.3 Profiling OpenCL Application . . . . . . . . . . . . . . . . . . . . . 172
3.3.4 Using the Internal Profiler . . . . . . . . . . . . . . . . . . . . . . . 173
3.3.5 Using External Profiler . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.3.6 Effective Use of Memories – Memory Access Patterns . . . . . . . 183
3.3.7 Matrix Multiplication – Optimization Issues . . . . . . . . . . . . 189
3.4 OpenCL and OpenGL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
3.4.1 Extensions Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
3.4.2 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
3.4.3 Header Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
3.4.4 Common Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
3.4.5 OpenGL Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 198
3.4.6 OpenCL Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 201
3.4.7 Creating Buffer for OpenGL and OpenCL . . . . . . . . . . . . . . 203
3.4.8 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
3.4.9 Generating Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
3.4.10 Running Kernel that Operates on Shared Buffer . . . . . . . . . . 215
3.4.11 Results Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
3.4.12 Message Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
3.4.13 Cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
3.4.14 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . . . 221
3.5 Case Study – Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 221
3.5.1 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
3.5.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
3.5.3 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
3.5.4 Example Problem Definition . . . . . . . . . . . . . . . . . . . . . . 225
3.5.5 Genetic Algorithm Implementation Overview . . . . . . . . . . . 225
3.5.6 OpenCL Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
3.5.7 Most Important Elements of Host Code . . . . . . . . . . . . . . . 234
3.5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
3.5.9 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

A Comparing CUDA with OpenCL 245


A.1 Introduction to CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
A.1.1 Short CUDA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 245
A.1.2 CUDA 4.0 Release and Compatibility . . . . . . . . . . . . . . . . . 245
A.1.3 CUDA Versions and Device Capability . . . . . . . . . . . . . . . . 247
A.2 CUDA Runtime API Example . . . . . . . . . . . . . . . . . . . . . . . . . . 249
A.2.1 CUDA Program Explained . . . . . . . . . . . . . . . . . . . . . . . 251

xii
A.2.2 Blocks and Threads Indexing Formulas . . . . . . . . . . . . . . . 257
A.2.3 Runtime Error Handling . . . . . . . . . . . . . . . . . . . . . . . . 260
A.2.4 CUDA Driver API Example . . . . . . . . . . . . . . . . . . . . . . . 262

B Theoretical Foundations of Heterogeneous Computing 269


B.1 Parallel Computer Architectures . . . . . . . . . . . . . . . . . . . . . . . . 269
B.1.1 Clusters and SMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
B.1.2 DSM and ccNUMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
B.1.3 Parallel Chip Computer . . . . . . . . . . . . . . . . . . . . . . . . . 270
B.1.4 Performance of OpenCL Programs . . . . . . . . . . . . . . . . . . 270
B.2 Combining MPI with OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . 277

C Matrix Multiplication – Algorithm and Implementation 279


C.1 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
C.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
C.2.1 OpenCL Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
C.2.2 Initialization and Setup . . . . . . . . . . . . . . . . . . . . . . . . . 280
C.2.3 Kernel Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
C.2.4 Executing Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

D Using Examples Attached to the Book 285


D.1 Compilation and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
D.1.1 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
D.1.2 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

Bibliography and References 289

xiii
xiv
Chapter 1

Introduction

1.1. Existing Standard Parallel Programming Systems


The last decade of the 20th century and the first decade of the 21st century can be
called the Era of Parallel Computing. In this period of time, not only were extremely
powerful supercomputers designed and built, but two de facto standard parallel pro-
gramming systems for scientific and engineering applications were successfully intro-
duced worldwide. They are MPI (Message Passing Interface) for clusters of comput-
ers and OpenMP for shared memory multi-processors. Both systems are predecessors
of the subject of this book on OpenCL. They deserve short technical description and
discussion. This will help to see differences between the older MPI/OpenMP and the
newest OpenCL parallel programming systems.

1.1.1. MPI
MPI is a programming system but not a programming language. It is a library of func-
tions for C and subroutines for FORTRAN that are used for message communication
between parallel processes created by MPI. The message-passing computing model
(Fig. 1.1) is a collection of interconnected processes that use their local memories
exclusively. Each process has an individual address identification number called the
rank. Ranks are used for sending and receiving messages and for workload distribu-
tion. There are two kinds of messages: point-to-point messages and collective mes-
sages. Point-to-point message C functions contain several parameters: the address of
the sender, the address of the receiver, the message size and type and some additional
information. In general, message parameters describe the nature of the transmitted
data and the delivery envelope description.
The collection of processes that can exchange messages is called the communica-
tor. In the simplest case there is only one communicator and each process is assigned
to one processor. In more general settings, there are several communicators and sin-
gle processors serve several processes. A processor rank is usually an integer number
running from 0 to p-1 where p is the total number of processes. It is also possible to
number processes in a more general way than by consecutive integer numbers. For
example, their address ID can be a double number such as a point (x, y) in Carte-

1
sian space. This method of identifying processes may be very helpful in handling ma-
trix operations or partial differential equations (PDEs) in two-dimensional Cartesian
space.

Figure 1.1: The message passing model.

An example of a collective message is the broadcast message that sends data


from a single process to all other processes. Other collective messages such as Gather
and Scatter are very helpful and are often used in computational mathematics
algorithms.
For the purpose of illustrating MPI, consider a parallel computation of the SAXPY
operation. SAXPY is a linear algebra operation z = ax+y where a is a constant scalar
and x, y, z are vectors of the same size. The name SAXPY comes from Sum of ax plus
y. An example of a SAXPY operation is presented in section 2.14.
The following assumptions and notations are made:

1. The number of processes is p.


2. The vector size n is divisible by p.
3. Only a part of the code for parallel calculation of the vector z will be written.
4. All required MPI initialization instructions have been done in the front part of
the code.
5. The vectors x and y have been initialized.
6. The process ID is called my_rank from 0 to p−1. The number of vector element
pairs that must be computed by each process is: N = n/p. The communicator
is the entire system.
7. n = 10000 and p = 10 so N = 1000.

The C loop below will be executed by all processes from the process from
my_rank equal to 0 to my_rank equal to p − 1.

1 {
2 int i;
3 for (i = my_rank*N; i < (my_rank+1)*N; i++)
4 z[i]=a*x[i]+y[i];
5 }

2
For example, the process with my_rank=1 will add one thousand ele-
ments a*x[i] and y[i] from i=N =1000 to 2N –1=1999. The process with the
my_rank=p–1=9 will add elements from i=n–N =9000 to n–1=9999.
One can now appreciate the usefulness of assigning a rank ID to each process.
This simple concept makes it possible to send and receive messages and share the
workloads as illustrated above. Of course, before each process can execute its code
for computing a part of the vector z it must receive the needed data. This can be
done by one process initializing the vectors x and y and sending appropriate groups
of data to all remaining processes. The computation of SAXPY is an example of data
parallelism. Each process executes the same code on different data.
To implement function parallelism where several processes execute different pro-
grams process ranks can be used. In the SAXPY example the block of code containing
the loop will be executed by all p processes without exception. But if one process, for
example, the process rank 0, has to do something different than all others this could
be accomplished by specifying the task as follows:

1 if (my_rank == 0)
2 {execute specified here task}

This block of code will be executed only by the process with my_rank==0. In the
absence of "if (my_rank==something)" instructions, all processes will execute the
block of code {execute this task}. This technique can be used for a case where
processes have to perform several different computations required in the task-parallel
algorithm. MPI is universal. It can express every kind of algorithmic parallelism.
An important concern in parallel computing is the efficiency issue. MPI often
can be computationally efficient because all processes use only local memories. On
the other hand, processes require network communication to place data in proper
process memories in proper times. This need for moving data is called the algorithmic
synchronization, and it creates the communication overhead, impacting negatively
the parallel program performance. It might significantly damage performance if the
program sends many short messages. The communication overhead can be reduced
by grouping data for communication. By grouping data for communication created
are larger user defined data types for larger messages to be sent less frequently. The
benefit is avoiding the latency times.
On the negative side of the MPI evaluation score card is its inability for incre-
mental (part by part) conversion from a serial code to an MPI parallel code that
is algorithmically equivalent. This problem is attributed to the relatively high level
of MPI program design. To design an MPI program, one has to modifiy algorithms.
This work is done at an earlier stage of the parallel computing process than develop-
ing parallel codes. Algorithmic level modification usually can’t be done piecewise. It
has to be done all at once. In contrast, in OpenMP a programmer makes changes to
sequential codes written in C or FORTRAN.
Fig. 1.2 shows the difference. The code level modification makes possible incre-
mental conversions. In a commonly practiced conversion technique, a serial code is
converted incrementally from the most compute-intensive program parts to the least
compute intensive parts until the parallelized code runs sufficiently fast. For further
study of MPI, the book [1] is highly recommended.

3
1. Mathematical model.
2. Computational model.
3. Numerical algorithm and parallel conversion: MPI
4. Serial code and parallel modification: OpenMP
5. Computer runs.

Figure 1.2: The computer solution process. OpenMP will be discussed in the next
section.

1.1.2. OpenMP
OpenMP is a shared address space computer application programming interface
(API). It contains a number of compiler directives that instruct C/C++ or FORTRAN
“OpenMP aware” compilers to execute certain instructions in parallel and distribute
the workload among multiple parallel threads. Shared address space computers fall
into two major groups: centralized memory multiprocessors, also called Symmetric
Multi-Processors (SMP) as shown in Fig. 1.3, and Distributed Shared Memory (DSM)
multiprocessors whose common representative is the cache coherent Non Uniform
Memory Access architecture (ccNUMA) shown in Fig. 1.4.

Figure 1.3: The bus based SMP multiprocessor.

SMP computers are also called Uniform Memory Access (UMA). They were the
first commercially successful shared memory computers and they still remain popu-
lar. Their main advantage is uniform memory access. Unfortunately for large num-
bers of processors, the bus becomes overloaded and performance deteriorates. For
this reason, the bus-based SMP architectures are limited to about 32 or 64 proces-
sors. Beyond these sizes, single address memory has to be physically distributed.
Every processor has a chunk of the single address space.
Both architectures have single address space and can be programmed by using
OpenMP. OpenMP parallel programs are executed by multiple independent threads

4
Figure 1.4: ccNUMA with cache coherent interconnect.

that are streams of instructions having access to shared and private data, as shown
in Fig. 1.5.

Figure 1.5: The threads’ access to data.

The programmer can explicitly define data that are shared and data that are
private. Private data can be accessed only by the thread owning these data. The flow
of OpenMP computation is shown in Fig. 1.6. One thread, the master thread, runs
continuously from the beginning to the end of the program execution. The worker
threads are created for the duration of parallel regions and then are terminated.
When a programmer inserts a proper compiler directive, the system creates a
team of worker threads and distributes workload among them. This operation point
is called the fork. After forking, the program enters a parallel region where parallel
processing is done. After all threads finish their work, the program worker threads
cease to exist or are redeployed while the master thread continues its work. The
event of returning to the master thread is called the join. The fork and join tech-
nique may look like a simple and easy way for parallelizing sequential codes, but the

5
Figure 1.6: The fork and join OpenMP code flow.

reader should not be deceived. There are several difficulties that will be described
and discussed shortly. First of all, how to find whether several tasks can be run in
parallel?
In order for any group of computational tasks to be correctly executed in parallel,
they have to be independent. That means that the result of the computation does not
depend on the order of task execution. Formally, two programs or parts of programs
are independent if they satisfy the Bernstein conditions shown in 1.1.

I j ∩ Oi = 0
Ii ∩ Oj = 0 (1.1)
Oi ∩ O j = 0

Bernstein’s conditions for task independence.

Letters I and O signify input and output. The ∩ symbol means the intersection
of two sets that belong to task i or j. In practice, determining if some group of
tasks can be executed in parallel has to be done by the programmer and may not
be very easy. To discuss other shared memory computing issues, consider a very
simple programming task. Suppose the dot product of two large vectors x and y
whose number of components is n is calculated. The sequential computation would
be accomplished by a simple C loop

1 double dp = 0;
2 for(int i=0; i<n; i++)
3 dp += x[i]*y[i];

Inserting, in front of the above for-loop, the OpenMP directive for parallelizing
the loop and declaring shared and private variables leads to:

6
1 double dp = 0;
2 #pragma omp parallel for shared(x,y,dp) (private i)
3 for(int i=0; i<n; i++)
4 dp += x[i]*y[i];

Unfortunately, this solution would encounter a serious difficulty in computing


dp by this code. The difficulty arises because the computation of dp+=x[i]*y[i]; is
not atomic. This means that more threads than one may attempt to update the value
of dp simultaneously. If this happens, the value of dp will depend on the timing of
individual threads’ operations. This situation is called the data race. The difficulty
can be removed in two ways.
One way is by using the OpenMP critical construct #pragma omp critical in
front of the statement dp+=x[i]*y[i]; The directive critical forces threads to up-
date dp one at a time. In this way, the updating relation dp+=x[i]*y[i] becomes
atomic and is executed correctly. That means every thread executes this operation
alone and completely without interference of other threads. With this addition, the
code becomes:

1 double dp = 0;
2 #pragma omp parallel for shared(x,y,dp) private(i)
3 for(int i=0; i<n; i++){
4 #pragma omp critical
5 dp += x[i]*y[i];
6 }

In general, the block of code following the critical construct is computed by


one thread at a time. In our case, the critical block is just one line of code
dp+=x[i]*y[i];. The second way for fixing the problem is by using the clause re-
duction that ensures adding correctly all partial results of the dot product. Below is
a correct fragment of code for computing dp with the reduction clause.

1 double dp = 0;
2 #pragma omp parallel for reduction(+:dp) shared(x,y) private(i)
3 for (int i=0;i<n ;i++)
4 dp += x[i]*y[i];

Use of the critical construct amounts to serializing the computation of the critical
region, the block of code following the critical construct. For this reason, large critical
regions degrade program performance and ought to be avoided. Using the reduction
clause is more efficient and preferable. The reader may have noticed that the loop
counter variable i is declared private, so each thread updates its loop counter in-
dependently without interference. In addition to the parallelizing construct parallel
for that applies to C for-loops, there is a more general section construct that paral-
lelizes independent sections of code. The section construct makes it possible to apply
OpenMP to task parallelism where several threads can compute different sections of
a code. For computing two parallel sections, the code structure is:

7
1 #pragma omp parallel
2 {
3 #pragma omp sections
4 {
5 #pragma omp section
6 /* some program segment computation */
7 #pragma omp section
8 /* another program segment computation */
9 }
10 /* end of sections block */
11 }
12 /* end of parallel region */

OpenMP has tools that can be used by programmers for improving performance.
One of them is the clause nowait. Consider the program fragment:

1 #pragma omp parallel shared(a,b,c,d,e,f) private(i,j)


2 {
3 #pragma omp for nowait
4 for(int i=0; i<n; i++)
5 c[i] = a[i]+b[i];
6 #pragma omp for
7 for(int j=0; j<m; j++)
8 d[j] = e[j]*f[j];
9 #pragma omp barrier
10 g = func(d);
11 }
12 /* end of the parallel region; implied barrier */

In the first parallelized loop, there is the clause nowait. Since the second loop
variables do not depend on the results of the first loop, it is possible to use the clause
nowait – telling the compiler that as soon as any first loop thread finishes its work, it
can start doing the second loop work without waiting for other threads. This speeds
up computation by reducing the potential waiting time for the threads that finish
work at different times.
On the other hand, the construct #pragma omp barrier is inserted after the
second loop to make sure that the second loop is fully completed before the calcula-
tion of g being a function of d is performed after the second loop. At the barrier, all
threads computing the second loop must wait for the last thread to finish before they
proceed. In addition to the explicit barrier construct #pragma omp barrier used by
the programmer, there are also implicit barriers used by OpenMP automatically at
the end of every parallel region for the purpose of thread synchronization. To sum
up, barriers should be used sparingly and only if necessary. nowait clauses should
be used as frequently as possible, provided that their use is safe.
Finally, we have to point out that the major performance issue in numerical com-
putation is the use of cache memories. For example, in computing the matrix/vector
product c = Ab, two mathematically equivalent methods could be used. In the first
method, elements of the vector c are computed by multiplying each row of A by the
vector b, i.e., computing dot products. An inferior performance will be obtained if c
is computed as the sum of the columns of A multiplied by the elements of b. In this

8
case, the program would access the columns of A not in the way the matrix data are
stored in the main memory and transferred to caches.
For further in-depth study of OpenMP and its performance, reading [2] is highly
recommended. Of special value for parallel computing practitioners are Chapter 5
“How to get Good Performance by Using OpenMP” and Chapter 6 “Using OpenMP
in the Real World”. Chapter 6 offers advice for and against using combined MPI and
OpenMP. A chapter on combining MPI and OpenMP can also be found in [3]. Like
MPI and OpenMP, the OpenCL system is standardized. It has been designed to run
regardless of processor types, operating systems and memories. This makes OpenCL
programs highly portable, but the method of developing OpenCL codes is more com-
plicated. The OpenCL programmer has to deal with several low-level programming
issues, including memory management.

1.2. Two Parallelization Strategies: Data Parallelism


and Task Parallelism
There are two strategies for designing parallel algorithms and related codes: data
parallelism and task parallelism. This Section describes both concepts.

1.2.1. Data Parallelism


Data parallelism, also called Single Program Multiple Data (SPMD) is very common
in computational linear algebra. In a data parallel code, data structures such as ma-
trices are divided into blocks, sets of rows or columns, and a single program performs
identical operations on these partitions that contain different data. An example of
data parallelism is the matrix/vector multiplication a = Ab where every element of
a can be computed by performing the dot product of one row of A and the vector b.
Fig. 1.7 shows this data parallel concept.

Figure 1.7: Computing matrix/vector product.

1.2.2. Task Parallelism


The task parallel approach is more general. It is assumed that there are multiple
different independent tasks that can be computed in parallel. The tasks operate on

9
their own data sets. Task parallelism can be called Multiple Programs Multiple Data
(MPMD). A small-size example of a task parallel problem is shown in Fig. 1.8. The
directed graph indicates the task execution precedence. Two tasks can execute in
parallel if they are not dependent.

Figure 1.8: Task dependency graph.

In the case shown in Fig. 1.8, there are two options for executing the entire
set of tasks. Option 1. Execute tasks T1, T2 and T4 in parallel, followed by Task T3
and finally T5. Option 2. Execute tasks T1 and T2 in parallel, then T3 and T4 in
parallel and finally T5. In both cases, the total computing work is equal but the time
to solution may not be.

1.2.3. Example
Consider a problem that can be computed in both ways – via data parallelism and
task parallelism. The problem is to calculate C = A × B − (D + E) where A, B, D
and E are all square matrices of size n × n. An obvious task parallel version would
be to compute in parallel two tasks A × B and D + E and then subtract the sum
from the product. Of course, the task of computing A × B and the task of computing
the sum D + E can be calculated in a data parallel fashion. Here, there are two
levels of parallelism: the higher task level and the lower data parallel level. Similar
multilevel parallelism is common in real-world applications. Not surprisingly, there
is also for this problem a direct data parallel method based on the observation that
every element of C can be computed directly and independently from the coefficients
of A, B, D and E. This computation is shown in equation 1.2.


n−1
ci j = aik bk j − di j − ei j (1.2)
k=0

The direct computation of C.

10
The equation (1.2) means that it is possible to calculate all n2 elements of C
in parallel using the same formula. This is good news for OpenCL devices that can
handle only one kernel and related data parallel computation. Those devices will
be discussed in the chapters that follow. The matrix C can be computed using three
standard programming systems: MPI, OpenMP and OpenCL.
Using MPI, the matrix C could be partitioned into sub-matrix components and
assigned the subcomponents to processes. Every process would compute a set of
elements of C, using the expression 1.2. The main concern here is not computation
itself but the ease of assembling results and minimizing the cost of communication
while distributing data and assembling the results. A reasonable partitioning would
be dividing C by blocks of rows that can be scattered and gathered as user-defined
data types. Each process would get a set of the rows of A, D and E and the entire
matrix B. If matrix C size is n and the number of processes is p, then each process
would get n/p rows of C to compute. If a new data type is defined as n/p rows,
the data can easily be distributed by strips of rows to processes and then results can
be gathered. The suggested algorithm is a data parallel method. Data partitioning is
shown in Fig. 1.9.

Figure 1.9: Strips of data needed by a process to compute the topmost strip of C.

The data needed for each process include one strip of A, D and E and the entire
matrix B. Each process of rank 0<=rank<p computes one strip of C rows. After fin-
ishing computation, the matrix C can be assembled by the collective MPI communi-
cation function Gather. An alternative approach would be partitioning Gather into
blocks and assigning to each process computing one block of C. However, assembling
the results would be harder than in the strip partitioning case.
OpenMP would take advantage of the task parallel approach. First, the subtasks
A × B and D + E would be parallelized and computed separately, and then C = A ×
B − (D + E) would be computed, as shown in Fig. 1.10. One weakness of task parallel
programs is the efficiency loss if all parallel tasks do not represent equal workloads.
For example, in the matrix C computation, the task of computing A×B and the task of
computing D+E are not load-equal. Unequal workloads could cause some threads to
idle unless tasks are selected dynamically for execution by the scheduler. In general,
data parallel implementations are well load balanced and tend to be more efficient.
In the case of OpenCL implementation of the data parallel option for computing C, a
single kernel function would compute elements of C by the equation 1.2 where the
sum represents dot products of A × B and the remaining terms represent subtracting
D and E. If the matrix size is n = 1024, a compute device could execute over one
million work items in parallel. Additional performance gain can be achieved by using
the tiling technique [4].

11
Figure 1.10: The task parallel approach for OpenMP.

In the tiling technique applied to the matrix/matrix product operation, matrices


are partitioned into tiles small enough so that the data for dot products can be placed
in local (in the NVIDIA terminology, shared) memories. This way, slow access to
global memory is avoided and smaller, but faster, local memories are used.
The tiling technique is one of the most effective ways for enhancing perfor-
mance of heterogeneous computing. In the matrix/matrix multiplication problem,
the NVIDIA GPU Tesla C1060 processor was used as a massively parallel acceleration
device.
In general: the most important heterogeneous computing strategy for achiev-
ing efficient applications is optimizing memory use. This includes avoiding CPU-GPU
data transfers and using fast local memory. Finally, a serious limitation of many cur-
rent (2011) generation GPUs has to be mentioned. Some of them are not capable
of running in parallel multiple kernels. They run only one kernel at a time in data
parallel fashion. This eliminates the possibility of a task parallel program structure
requiring multiple kernels. Using such GPUs, the only possibility of running multiple
kernels would be to have multiple devices, each one executing a single kernel, but
the host would be a single processor. The context for managing two OpenCL devices
is shown in Fig. 2.3 in section 2.1.

1.3. History and Goals of OpenCL


1.3.1. Origins of Using GPU in General Purpose Computing
Initially, massively parallel processors called GPUs (Graphics Processor Units) were
built for applications in graphics and gaming.
Soon the scientific and engineering communities realized that data parallel com-
putation is so common in the majority of scientific/engineering numerical algorithms
that GPUs can also be used for compute intensive parts of large numerical algorithms
solving scientific and engineering problems. This idea contributed to creating a new
computing discipline under the name General Purpose GPU or GPGPU. Several nu-

12
merical algorithms containing data parallel computations are discussed in Section
1.6 of this Chapter.
As a parallel programming system, OpenCL is preceded by MPI, OpenMP and
CUDA. Parallel regions in OpenMP are comparable to parallel kernel executions
in OpenCL. A crucial difference is OpenMP’s limited scalability due to heavy over-
head in creating and managing threads. Threads used on heterogeneous systems are
lightweight, hence suitable for massive parallelism. The most similar to OpenCL is
NVIDIA’s CUDA (Compute Unified Driver Architecture) because OpenCL heavily bor-
rowed from CUDA its fundamental features. Readers who know CUDA will find it
relatively easy to learn and use OpenCL after becoming familiar with the mapping
between CUDA and OpenCL technical terms and minor programming differences.
The technical terminology differences are shown in Tab.1.1.

Table 1.1: Comparision of terminology of CUDA and OpenCL

CUDA OpenCL
Thread Work item
Block Work group
Grid Index space

Similar one-to-one mapping exists for the CUDA and OpenCL API calls.

1.3.2. Short History of OpenCL


OpenCL (Open Computing Language) was initiated by Apple, Inc., which holds its
trademark rights. Currently, OpenCL is managed by the Khronos Group, which in-
cludes representatives from several major computer vendors. Technical specification
details were finalized in November 2008 and released for public use in December
of that year. In 2010, IBM and Intel released their implementations of OpenCL. In
the middle of November 2010, Wolfram Research released Mathematica 8 with an
OpenCL link. Implementations of OpenCL have been released also by other compa-
nies, including NVIDIA and AMD. The date for a stable release of OpenCL 1.2 stan-
dard is November 2011.
One of the prime goals of OpenCL designers and specification authors has been
portability of the OpenCL codes. Using OpenCL is not limited to any specific vendor
hardware, operating systems or type of memory. This is the most unique feature of
the emerging standard programming system called OpenCL. Current (2011) OpenCL
specification management is in the hands of the international group Khronos, which
includes IBM, SONY, Apple, NVIDIA, Texas Instruments, AMD and Intel.
Khronos manages specifications of OpenCL C language and OpenCL runtime
APIs. One of the leaders in developing and spreading OpenCL is the Fixstars Corpo-
ration, whose main technology focus has been programming multi-core systems and
optimizing their application performance. This company published the first commer-
cially available book on OpenCL [5].
The book was written by five Japanese software specialists and has been trans-
lated into English. It is currently (2011) available from Amazon.com in paperback or
the electronic book form. In the latter form, the book can be read using Kindle.

13
1.4. Heterogeneous Computer Memories and Data
Transfer
1.4.1. Heterogeneous Computer Memories
A device’s main memory is the global memory (Fig. 1.11). All work items have access
to this memory. It is the largest but also the slowest memory on a heterogeneous
device. The global memory can be dynamically allocated by the host program and
can be read and written by the host and the device.

Figure 1.11: Heterogeneous computer memories. Host memory is accesible only to


processes working on the host. Global memory is a GPU memory accessible both in
read and write for a kernel run on the GPU device. Constant memory is accessible
only for read operations.

The constant memory can also be dynamically allocated by the host. It can be
used for read and write operations by the host, but it is read only by the device. All
work items can read from it. This memory can be fast if the system has a supporting
cache.
Local memories are shared by work items within a work group. For example, two
work items can synchronize their operation using this memory if they belong to the
same work group. Local memories can’t be accessed by the host.

14
Private memories can be accessed only by each individual work item. They are
registers. A kernel can use only these four device memories.
In contrast, the host system can use the host memory and the global/constant
memories on the device. If a device is a multicore CPU, then all device memories are
portions of the RAM.

1.4.2. Data Transfer


It has to be pointed out that all data movements required by a particular algorithm
have to be programmed explicitly by the programmer, using special commands for
data transfer between different kinds of memories. To illustrate different circum-
stances of transferring data, consider two algorithms: algorithm 1 for matrix–matrix
multiplication and algorithm 2 for solving sets of linear equations Ax = b with a pos-
itive definite matrix A. Algorithm 2 is an iterative procedure called The Conjugate
Gradient Method (CGM).

Figure 1.12: Data transfer in matrix/matrix multiplication.

Executing a matrix/matrix multiplication algorithm requires only two data trans-


fer operations. They are shown in Fig. 1.12 by thick arrows. The first transfer is
needed to load matrices A and B into the device global memory. The second transfer
moves the resulting matrix C to the host RAM to enable printing the results by the
host.
The second algorithm is the Conjugate Gradient Method (CGM) discussed in
section 1.6. It is assumed that every entire iteration is computed on the GPU device.
Fig. 1.13 shows the algorithm flow.
CGM is executed on a heterogeneous computer with a CPU host and a GPU
device. The thick arrow indicates data transfer from the CPU memory to the device
global memory. Except for the initial transfer of the matrix A and the vector p and
one bit after every iteration, there is no need for other data transfers for the CGM
algorithm.

1.4.3. The Fourth Generation CUDA


The most recent (March 2011) version of NVIDIAs’s Compute Unified Device Archi-
tecture (CUDA) brings a new and important memory technology called Unified Vir-
tual Addressing (UVA). UVA provides a single address space for merged CPU memory

15
Figure 1.13: The CGM algorithm iterations. The only data transfer takes place at the
very beginning of the iterative process.

and global GPU memory. This change will simplify programming systems that com-
bine the CPU and GPU and will eliminate currently necessary data transfers between
the host and the device.

1.5. Host Code


Every OpenCL program contains two major components: a host code and at least one
kernel, as shown in Fig. 1.14.

Figure 1.14: OpenCL program components.

16
Kernels are written in the OpenCL C language and executed on highly parallel
devices. They provide performance improvements. An example of a kernel code is
shown in section 2.1.
In this section, the structure and the functionality of host programs is described.
The main purpose of the host code is to manage device(s). More specifically, host
codes arrange and submit kernels for execution on device(s). In general, a host code
has four phases:

a) Initialization and creating context,


b) kernel creation and preparation for kernel execution,
c) creating command queues and kernels execution,
d) finalization and release of resources.

Section 2.8 in Chapter 2 provides details of the four phases and an example of
the host code structure. This section is an introduction to section 2.11 in Chapter
2. This introduction may be useful for readers who start learning OpenCL since the
concept of the host code has not been used in other parallel programming systems.

1.5.1. Phase a. Initialization and Creating Context


The very first step in initialization is getting available OpenCL platforms. This is fol-
lowed by device selection. The programmer can query available devices and choose
those that could help achieve the required level of performance for the application
at hand.
The second step is creating the context.
In OpenCL, devices are managed through contexts. Please see Fig. 2.3, illustrat-
ing the context for managing devices. To determine the types and the numbers of de-
vices in the system, a special API function is used. All other steps are also performed
by invoking runtime API functions. Hence the host code is written in the conventional
C or C++ language plus the API Library runtime function calls.

1.5.2. Phase b. Kernel Creation, Compilation and Preparations


for Kernel Execution
This phase includes kernel creation, compilation and loading. The kernel has to be
loaded to the global memory for execution on the devices. It can be loaded in binary
form or as source code.

1.5.3. Phase c. Creating Command Queues and Kernel Execution


In this phase, a Command Queue is created. Kernels must be placed in command
queues for execution. When a device is ready, the kernel at the head of the command
queue will be executed. After kernel execution, the results can be transferred from
the device global memory to the host memory for display or printing. Kernels can be
executed out of order or in order. In the out-of-order case, the kernels are indepen-
dent and can be executed in any sequence. In the second case, the kernel must be
executed in a certain fixed order.

17
1.5.4. Finalization and Releasing Resource
After finishing computation, each code should perform a cleanup operation that in-
cludes releasing of resources in preparation for the next application. The entire host
code structure is shown in Fig. 1.15. In principle, the host code can also perform
some algorithmic computation that is not executed on the device – for example, the
initial data preparation.

Figure 1.15: Host code structure and functionality.

1.6. Applications of Heterogeneous Computing


Heterogeneous computation is already present in many fields. It is an extension of
general-purpose computing on graphics processing units in terms of combining tra-
ditional computation on multi-core CPUs and computation on any other computing
device. OpenCL can be used for real-time post-processing of a rendered image, or for
accelerating a ray-tracing engine. This standard is also used in real-time enhance-
ment for full-motion video, like image stabilization or improving the image quality of
one frame using consecutive surrounding frames. Nonlinear video-editing software
can accelerate standard operations by using OpenCL; for example, most video filters
work per-pixel, so they can be naturally parallelized. Libraries that accelerate basic
linear algebra operations are another example of OpenCL applications. There is a
wide range of simulation issues that can be targeted by OpenCL – like rigid-body
dynamics, fluid and gas simulation and virtually any imaginable physical simulation
involving many objects or complicated matrix computation. For example, a game
physics engine can use OpenCL for accelerating rigid- and soft-body dynamics in
real time to improve graphics and interactivity of the product. Another field where
massively parallel algorithms can be used is stock trade analysis.
This section describes several algorithms for engineering and scientific applica-
tions that could be accelerated by OpenCL. This is by no means an exhaustive list of
all the algorithms that can be accelerated by OpenCL.

18
1.6.1. Accelerating Scientific/Engineering Applications
Computational linear algebra has been regarded as the workhorse for applied math-
ematics in nearly all scientific and engineering applications. Software systems such
as Linpack have been used heavily by supercomputer users solving large-scale prob-
lems.
Algebraic linear equations solvers have been used for many years as the perfor-
mance benchmark for testing and ranking the fastest 500 computers in the world.
The latest (2010) TOP500 champion, is a Chinese computer, the Tianhe-1A.
The Tianhe supercomputer is a heterogeneous machine. It has 14336 Intel Xeon
CPUs and 7168 NVIDIA Tesla M2050 GPUs.
The champion’s Linpack benchmark performance record is about 2.5 petaflops
(2.5×10 to the power 15 floating point operations per second). Such impressive
speed has been achieved to a great extent by using massively parallel GPU accelera-
tors. An important question arises: do commonly used scientific and engineering so-
lution algorithms contain components that can be accelerated by massively parallel
devices? To partially answer this question, this section of the book considers several
numerical algorithms frequently used by scientists and engineers.

1.6.2. Conjugate Gradient Method


The first algorithm, a popular method for solving algebraic linear equations, is called
the conjugate gradient method (CG or CGM) [6].

(x (0) ∈ ℜn given)
1. x := x (0)
2. r := b − Ax
3. p := r
4. α := r2
5. while α > t ol 2 :
6. λ := α/(p T Ap)
7. x := λp
8. r := r − λAp
9. p := r + (r2 /α)p
10. α := r2
11. end

The CGM algorithm solves the system of linear equations Ax = b where the ma-
trix A is positive definite. Every iteration of the CGM algorithm requires one matrix-
vector multiplication Ap, two vector dot products:


n−1
dp = xT y = x i yi (1.3)
i=0

and three SAXPY operations z = ax + y, where x, y and z are n-component vectors


and α is a scalar number. The total number of flops (floating point operations) per

19
iteration is N = n2 +10n. The dominant first term is contributed by the matrix-vector
multiplication w = Ap.
For large problems, the first term will be several orders of magnitude greater
than the second linear term. For this reason, we may be tempted to compute in
parallel on the device only the operation w = Ap.
The matrix-vector multiplication can be done in parallel by computing the ele-
ments of w as dot products of the rows of A and the vector p. The parallel compute
time will now be proportional to 2n instead of 2n2 where n is the vector size. This
would mean faster computing and superior scalability. Unfortunately, sharing com-
putation of each iteration by the CPU and the GPU requires data transfers between
the CPU and the device, as depicted in Fig. 1.16.

Figure 1.16: The CPU+GPU shared execution of the CGM with data transfer between
the CPU and the GPU.

In the CGM algorithm, the values of p and w change in every iteration and
need to be transferred if q is computed by the GPU and d is computed by the CPU.
The matrix A remains constant and need not be moved. To avoid data transfers that
would seriously degrade performance, it has been decided to compute the entire
iteration on the device. It is common practice to regard the CGM as an iterative
method, although for a problem of size n the method converges in n iterations if
exact arithmetic is used. The speed of the convergence depends upon the condition
number for the matrix A. For a positive definite matrix, the condition number is
the ratio of the largest to the smallest eigenvalues of A. The speed of convergence
increases as the condition number approaches 1.0. There are methods for improving
the conditioning of A. The CGM method with improved conditioning of A is called the
preconditioned CGM and is often used in practical applications. Readers interested
in preconditioning techniques and other mathematical aspects of the CGM may find
useful information in [6] and [7].

20
1.6.3. Jacobi Method
Another linear equations solver often used for equations derived from discretizing
linear partial differential equations is the stationary iterative Jacobi method. In this
method, the matrix A is split as follows: A = L+D+U where L and U are strictly lower
and upper triangular and D is diagonal. Starting from some initial approximation x 0 ,
the subsequent approximations are computed from

Dx k+1 = b − (L + U)x k (1.4)

This iteration is perfectly parallel. Each element of x can be computed in par-


allel. Putting it differently, the Jacobi iteration requires a matrix-vector multiplica-
tion followed by a SAXPY, both parallizable. Fig. 1.17 shows the Jacobi algorithm
computation on a heterogeneous computer.

Figure 1.17: The Jacobi algorithm computation on a heterogeneous machine.

Implementing the Jacobi iterative algorithm on a heterogeneous machine can


result in a significant acceleration over the sequential execution time because most
of the work is done by the massively parallel GPU. The CPU processor must initialize
computation before iterations start, and it keeps checking if the solution converged.
Closely related to the Jacobi iterative method is the well-known method of Gauss-
Seidel, but it is not as easily parallelizable as the Jacobi algorithm. If both methods
converge, the Gauss-Seidel converges twice as fast as the Jacobi. Here, there is an in-
teresting dilemma. Which of the two algorithms is preferred – the slower-converging
Jacobi parallel algorithm or the faster-converging Gauss-Seidel method, that may

21
Discovering Diverse Content Through
Random Scribd Documents
The Project Gutenberg eBook of Helena
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.

Title: Helena

Author: Euripides

Translator: Axel Gabriel Sjöström

Release date: September 29, 2019 [eBook #60381]

Language: Swedish

Credits: Jari Koivisto

*** START OF THE PROJECT GUTENBERG EBOOK HELENA ***


Produced by Jari Koivisto

HELENA
af

Euripides

Öfversatt af

Axel Gabriel Sjöström


EURIPIDIS HELENA.

SVETHICE REDDITA
cuius
Particulam Primam
VENIA AMPLISSIMAE FACULTATIS PHILOSOPHICAE
AD IMPERIALEM ALEXANDREAM IN FENNIA UNIVEBSITATEM

PRAESIDE

AXELIO GABRIELE SJÖSTRÖM


Literaturae Graecae Professore Publico et Ordinario
Imper. Ord. de St. Wladimiro in IV Classe Equite
P.P.
RESPONDENS
ALEXANDER HULTIN

Stipendiarius Quintae Cohortis Pyrobolorum Savono-Carelus In


Auditorio Philos. die 1 Nov. 1843 h.p.m.4.

Helsingfors, hos J.C. Frenckell & Son, 1845.


PERSONERNE:

Helena.
Teukros.
Khoren.
Menelaos.
Gumman.
Budet.
Thronoe.
Theoklymenos.
[Andra budet]
Dioskurerne.
HELENA.

Här rinner Nilens nymphbebodda vatten fram,


Hans, som, i stället för den himmelska regnskurn,
Egyptens åkrar vattnar, när hvit snö bortsmällt.
Sålänge Proteus lefde, var han landets drott,
Och bodde uppå Pharos' ö, Egyptens kong. 5
Bland vattnets tärnor en han valde till gemål,
Psamathe, såsnart hon öfvergiftit Aiakos.
Hon födde sedan tvenne barn i denna borg,
En son Theoklymenos, som from mot gudarna.
Sitt lif framlefde, och en dotter, underskön, 10
Sin moders ögonsten, sålänge barn hon var;
Men när hon vuxit upp, och var giftvuxen mö,
Hon kallades Theonoe, ty gudaråd
Hon kände alla, varande och vardande;
Utaf sin morfar Nereus hon den gåfvan fått. 15
Mitt fosterland är icke heller oberömdt,
Sparta; min far är Tyndareus; ett rykte går,
Att Zeus en gång inställde sig hos moder min,
Leda, se'n han sig tagit skepnad af en svan,
Samt sof hos henne stjälandes, liksom han flytt 20
Ur örnens klor, ifall den sägn trovärdig är.
Sjelf Helena jag kallas, och de qval jag haft,
Jag vill förtälja: tre gudinnor ställde sig
För Alexandros, i Idaiska grottan fram,
Hera, och Kypris, och zeusborna gudamön, 25
Och önskade få höra fälld sin skönhets dom.
Min fägring satte, om hvad osällt fägring är,
Kypris till pris åt Alexandros, med min hand:
Och vann. Idaiske Paris stallen lemnade,
Och kom till Sparta, för att dela bädd' med mig. 30
Hera, misstyckande, att priset hon ej vann,
Tillintetgjorde Alexandros' giftermål,
Och gaf ej mig åt honom, men en själfull bild
Af mig, som var af himlens ether sammansatt,
Åt sonen af drott Priamos, som trodde sig 35
Mig ega fåfäng tro! och egde ej; men Zeus
Med denna ofärd andra planer sammanknöt.
Ett krig han bragte öfver de Helleners land,
Och olycksälla Phryger, att från menskomängd
Och hvimmel så befria jorden, allas mor, 40
Och låta störste man i Hellas ärad bli.
Nu för de Phrygers mod utsattes jag, ej jag,
Men blott mitt namn till kamppris åt Hellenerna.
Då lade mig Hermes i etherslöjorna,
Betäckande med moln; ty mig ej Zeus förglömt, 45
Och bragte sedan hit till Proteus' kongaborg,
Som kyskast han ansåg bland alla dödliga,
Att Menelaos' bädd jag hölle så i helgd.
Här är jag nu, och mellertid min arme man,
Med samlad här, utkräfver mig, bortröfvade, 50
Mot Ilios' vallar dragande i härnad ut.
För min skull månge kämpar vid Skamandriske
Flodstranden dött; men jag, som alla qvalen bär,
Förbannas, och man tror att jag, förrådande
Min make bland Hellener väckt det stora krig. 55
Hvi lefver jag då? Af en gud jag rösten hört,
Utaf Hermes, att jag får bo i Spartas land,
Det stolta, med min make, när han hör, att jag
Till Ilios ej kom, ej redde annans bädd.
Sålänge detta solens ljus än skådade 60
Proteus, jag var för bröllop fredad; se'n han gömms
I jordens mörker, mig den hädangångens son
Att äkta jäktar. Men af aktning för gemåln
Här invid Proteus' minnesvård knäböjer jag,
Bönfallande, att han må skydda fordna band, 65
Så att, om jag vanryktadt namn i Hellas bär,
Jag ej härstädes nesa mig åsamka må.

TEUKROS.

Ho är, som makten har i detta välbefästa hus?


En rik man värdig denna boning synes mig,
Med konungslig omgifning, och sköntinnad borg. 70
O Ve!
O gudar, hvilken syn! Förhatlig skådar jag,
Och blodig bild af qvinnan, som min ofärd var,
Och samtlige Akhaiers. Dig, som liknar så
Helena, förskjute gudarne! Och stode jag 75
Ej här på okänd jord, för säkert slungad sten
Du strax umgällde, att du är Zeus dottersbild.

HELENA.
Hvarför du usle, ho du är, afskyr du mig,
Och hatar mig, för hennes olyckshändelser?

TEUKROS.

Förlåt! För vreden vek jag mera, än jag bordt; 80


Ty hela Hellas bär ju till Zeus dotter hat.
Och derför, qvinna, tag ej illa mina ord!

HELENA.

Ho är du? från hvad bygder kommer du till oss?

TEUKROS.

En bland Akhaierna jag är, de stackrarna.

HELENA.

Alltså ej under, att du hatar Helena. 85


Men hvem du är, hvarfrån, hvars son, du säga bör.

TEUKROS.

Mitt eget namn är Teukros, och min faders namn


Är Telamon, och fosterlandets Salamis.

HELENA.

Hvi är du kommen hit till Nilens åkerfält?

TEUKROS.
Landsflyktig drefs jag bort utur min fosterbygd. 90

HELENA.

Ack, stackars karl! ho körde dig ur landet hän?

TEUKROS.

Min fader Telamon, som vara bordt min vän.

HELENA.

Och hvarför det? den saken din olycka gör.

TEUKROS.

Min broder Aias skulden var, vid Troia fälld.

HELENA.

Hur så? hans lif föröddes då utaf ditt svärd? 95

TEUKROS.

Frivilligt språng mot eget svärd hans bane var.

HELENA.

Var han då galen? hvilken sansad gjorde så?

TEUKROS.

Du visst Akhilleus känna torde, Peleus' son?


HELENA.

Som vi hört sägas, Helenas friare han var.

TEUKROS.

Död, väckte han bland vännerna om vapnen tvist. 100

HELENA.

Och hvilken skada hade Aias derutaf?

TEUKROS.

Då vapnen annan fick, från lifvet skildes han.

HELENA.

Du lider då gemensamt med hans lidanden?

TEUKROS.

Emedan jag med honom ej tillsamman dog.

HELENA.

Du kom då äfven med till Ilios' stolta stad? 105

TEUKROS.

Olycklig blef jag, se'n jag hjelpt att öda den.

HELENA.
Är den då redan bränd, och genom eld förstörd?

TEUKROS.

Så, att ett tydligt spår ej finns af murarna.

HELENA.

Osälla Helena, för din skull Phrygerne förgås!

TEUKROS.

Akhaierne derhos; ty ondt är mycket gjordt. 110

HELENA.

Hur lång tid har nu gått, sen staden härjades?

TEUKROS.

Man nära sju års frukter sedan bärgat in.

HELENA.

Hur lång tid eljes haden J vid Troia dröjt?

TEUKROS.

Rätt många månar, lupna genom tio år.

HELENA.

Säg, togen J den Spartiatska qvinnan ock? 115


TEUKROS.

Sjelf Menelaos henne ifrån luggen drog.

HELENA.

Såg du den arma, eller hörde prat derom?

TEUKROS.

Som dig jag ser med ögonen, ej mindre alls.

HELENA.

Sen till, att det synvilla ej från gudar var!

TEUKROS.

Säg eljes hvad du vill men nämn synvilla ej! 120

HELENA.

Tron J synvillan tillförlitlig vara då?

TEUKROS.
Jag sjelf med egna ögon sett, och vettet ser.

HELENA.

Är redan hemma med gemåln Menelaos?

TEUKROS.
I Argos ej, ej heller vid Eurotas' flod.

HELENA.

Aj aj! du illa sagt för dem du illa sagt. 125

TEUKROS.

Han jemte sin gemål försvunnen vara sägs.

HELENA.

Togs ej af samtliga Argeier samma väg?

TEUKROS.

Togs visst; men stormen dref en hit, en annan dit.

HELENA.

På hvilka böljors ryggar gjorde de sin färd?

TEUKROS.

Tversöfver det Aigaiske hafvet foro de. 130

HELENA.

Har ingen se'n sett Menelaos återländ?

TEUKROS.

Allsingen; men för död i Hellas hålles han.


HELENA.

Med oss är slut; men Thestios' dotter, lefver hon?

TEUKROS.

Du Leda menar? redan hon afliden är.

HELENA.

Visst gjorde Helenas vanrykte hennes död? 135

TEUKROS.

Man säger, att hon snaran knöt kring ädel hals.

HELENA.

De Tyndarider, lefva de väl, eller ej?

TEUKROS.

De dött; de icke dött: tu tal man har derom.

HELENA.

Och hvilkendera tror man mest? o ve mig, ve!

TEUKROS.

I stjernors skepnad anses de för gudar två. 140

HELENA.
Detder du vackert sagt; hur lyder andre sägn?

TEUKROS.

Att för sin systers skull, de genom sjelfmord dött.


Men nog är taladt; jag ej dubbelt sucka vill.
Nu, för hvars skull jag kom till denna kongaborg,
För att orakelqvinnan Theonoe få se, 145
Var mig behjelplig, att hon ger mig gudasvar,
Hvarthän jag styra bör mitt skepp, för förlig vind
Att Kypros' kuster hinna, der Apollon sjelf
Mig förespått, att fästa bo, och Salamis
Mitt nya hem benämna, för det gamlas skull. 150

HELENA.

Det skall dig sjelfva färden lära. Lemnande


Vårt land, fly hädan, fremling, förrän Proteus' son
Dig sett, som här är konung, men nu bortastada,
Af sina trogna hundar följd, på blodig jagt.
Ty hvar Hellen, som han påträffar, dräper han, 155
Och hvarföre derom du icke spörja må,
Och jag ej säger; ty hvad båtade det dig?

TEUKROS.

Du talat väl, o qvinna! måtte gudarne


Dig riklig vedergällning ge för goda ord!
Du till gestalt visst liknar Helena, men ej 160
Ett likdant sinne, utan mycket olikt har.
Hon snöpligt må förgås, och till Eurotas' stad
Ej hinna; men du vare alltid lyckelig!

HELENA.

Ve mig, som i skräckliga smärtors skräckliga qval är kastad!


Hur skall jag bekämpa min sorg? hvad qväde försöka, 165
Med tårar, jemmerskri, eller suckar? ve!
Unga, bevingade mör, Stroph. 1.
Döttrar af moder Jord,
Seirener, måtte ni mina qval
Dela, med Libyska 170
Lotosflöjtens eller Syringens ljud,
Skänkande smärtorna ymniga tårar,
Qvalena qval, och tonerna toner!
Sånger, till sucken samljudiga,
Skicke mig Phersephassa, 175
Blodiga, blodiga, att hon, på tårar, må
Från mig, i de nattliga salar, undfå
Härjade skuggors låfsång.

KHOREN.

Vid dunkla vattnet jag satt Motstr. 1.


På klasige gräsets bädd, 180
De purpurskimrande mantlarna
I solens gyllene strålglöd
Värmande, samt i skogen af vassen.
Dän ljöd då en jemmerklagan,
Och jag hörde en ömklig, qvidande stämma. 185
Så suckar med verop mången gång
Najaden, när uppå berget hon
Sörjer den flyktande älskaren,
Eller när hon i klippans klyfta, i grottan hos Pan
Beklagar det gäckade bröllop. 190

HELENA.

O ve, o ve! Stroph. 2.


Du byte på barbarens skepp,
Helleniska flickehop,
Ho bland Akhaiers seglare kom väl, ja, kom väl,
Bringande tårar på tårar åt mig, 195
Bud om Ilios' undergång,
Genom ödande elden skedd,
För mig mångdråparinnan,
Och för min skull mångplågarinnan?
Leda, i snarans streck, 200
Emottog döden,
Af sorg öfver nesan min.
och, på sjön kringirrande, bortdog
Min olycksälle gemål.
Men Kastors och dess sambroders 205
Tvillingborna ära, försvunnen,
Försvunnen, hemlandet, hästtrampade ängen lemnade
Och skolorna vid den säfvige
Eurotas, sin ungdoms äflan.

KHOREN.

O ve, o ve! Motstr. 2. 210


Mångsuckige olyckan din,
O qvinna, och ödet ditt!
Olycklig den stund må man kalla, man kalla,
Då hos modren dig aflade
Zeus, herrlig i ethern, 215
Snöhyig, med vinge af svan,
Ty hvilken ofärd tryter dig?
Och hvad qval har ej du utstått?
Död är din mor,
Och ej heller af Zeus 220
De tvillingssöner haft trefnad.
Och fädernejorden ej skådar du,
Men i all verlden går
Det prat, som åt annors bädd
Dig, vördade, öfverlemnar. 225
Och din make i sjön och i böljorna lifvet förlorat;
Ej någonsin mera fädernesalen,
Ej koppartemplet du lyckliggör.

HELENA.

O ve, ove! ho var det bland Phrygers folk? Stroph. 3.


Eller ho från Helleniska landet 230
Har timrat för Ilios gråtvärd
Fura? hvaraf hopfogande
Osälla galejan se'n
Priamiden seglande kom på barbarisk båt,
Till min födernehärd, 235
För min menliga
Fägrings skull, att roffa mitt bröllop.
Men den falska mångdråparinnan Kypris
Danaiderna bragte och Priamiderna död;
Ve mig, olyckliga! 240
Och hon på den gyllene thron, Motstr. 3.
Zeus höga omfamning, Hera,
Snabbfotade sonen af Maia sände,
Som mig, de friska rosenblad, inunder kappan,
Samlande, bjöd till koppartemplig Athena 245
Komma; men genom ethern bortröfvande,
Till detta osaliga landet,
Till en split, en split bestämde
För Hellas och Priamos söner.
Och mitt namn vid Simuntiska böljan har 250
Ett vanskeligt rykte fått.

KHOREN.

Jag vet, att du olycklig är; dock båtar dig,


Att bära lifvets händelser med tålamod.

HELENA.

Väninnor, vid hvad öde är jag sammanfäst? 255


Har till järtecken modren mig åt menskor födt?
Ty ej Hellenska, ej Barbarska nånsin än
I hvitt foderal till verlden sina foster släppt,
Hvari mig Leda säges hafva fått af Zeus.
Ett under äro mine öden, och mitt lif, 260
Dels genom Hera, dels ock för min fägrings skull.
O, måtte jag, utsuddad som ett konterfej,
För denna vackra, få en fulare gestalt!
Och alla onda öden, som jag lider nu,
Hellenerne ha glömmt; men de ej onda de 265
Nog minnas, likasom de onda ödena.
Den som, enär han afser blott ett olycksfall,
Af gudar hemsöks, bär det tungt, men måste dock,
Men jag i många olycksöden ligger snärjd.
Ty fastän icke brottslig, jag vanryktad är. 270
Och den olyckan större, än en verklig blir,
När brott, som icke finnas, man påbördar
Mig äfven gudarna från fosterbyggden sändt
Ibland barbarer, och, beröfvad vännerna,
Jag är slafvinna vorden, som dock föddes fri. 275
Ty hos barbaren allt är slaf, förutom en.
Och nu det ankar, som mig ensamt uppehöll,
Att komma skulle min gemål, och frälsa mig,
Nu är han död, och finnes icke mera till.
Ock mor min dött, och jag dess mördarinna är, 280
Väl oförskyldt, men sjelfva skulden dock är min.
och hon, som husets prydnad var, och min också,
Min dotter, grånar bort som ogift mö.
Och sjelfve Dioskurerne, Zeus sönerne
Ej finnas till; men jag i allt olyckelig, 285
I mina öden dött, men icke verkligen.
Det sista blir, att, om jag kommer hem igen,
I fängslet kastas utaf dem, som hålla för,
Att jag från Ilios följde Menelaos åt.
Ty om min make lefde, kände vi hvaran 290
Af vissa tecken, endast klara för oss två.
Nu är ej så; och aldrig nånsin räddas han.
Hvi lefver jag väl längre? hvad afbider jag?
Att, bytande olycka emot giftermål,
Tillsammanbo med en barbar, och sätta mig 295
vid kräsligt bord? Men då en vrångsinnt äkta man
Med qvinna bor, hvars hela väsen vrångsinnt är,
Är bäst att dö; dock huru skall jag vackert dö?
Otäck är snaran, den i höjden sväfvande,
Samt oanständig ock bland slafvar anses hon; 300
Men svärdet något skönt och ädelt innebär,
Och kort är dödens rätta ögonblick också.
Ty jag i sådant djup af olyckor är bragt,
Att andra qvinnor sälla genom skönheten
Befinnas, men just hon mig gör olyckelig. 305

KHOREN.

Helena, den fremmande, som kom, eho han är,


Ej bör du tänka, att han idel sanning talt.

HELENA.

Bestämdt likväl han sade, att min make dött.

KHOREN.

Osanning finns emellertid med sanning mängd.

HELENA.

Och tvertemot är tillförlitlig sanningen. 310

KHOREN.

I lyckans ställe bringas du i ofärd nu.


HELENA.

Farhågan griper mig, och gör förskräckelse.

KHOREN.

Hvad välbevågenhet sker dig i detta slott?

HELENA.

Der äro alle vänner, utom friaren.

KHOREN.

Säg, vet du hvad? Nu minnesvården lemnande. 315

HELENA.

Hvad är din mening, eller hvartill råder du?

KHOREN.

Gå in, och henne, som om allting kunskap har,


Nereidens dotter der, hafsnymphens, Theonoe,
Bespörj, om din gemål ännu vid lifvet är,
Om han är hädangången. Och när allt du vet, 320
Derefter rätta sedan både fröjd och sorg,
Men förrn du intet säkert vet, hvad båtar dig,
Att ängslig vara? nej, hörsamma mina råd!
Gå hän från denna graf, och uppsök sierskan,
Af hvilken du all sanning sedan veta får; 325
I detta slott, hvad blickar du väl efter mer?
Jag äfven vill ditin med dig begifva mig,
Och jemte dig hos jungfrun be om gudasvar;
Ty qvinna bör med qvinna dela ärende.

HELENA.

Väninnor, jag ert råd tar an, 330


Kommen, kommen ditin,
Att derinne J höra fån
Samtliga mina qval!

KHOREN.

Den villiga du manar lätt.

HELENA.

Ve, osalige dag! 335 Hvad, eländiga, hvad gråt- värdt ord
nu förnimmer jag?

KHOREN.

Spå dig ej qvalen påförhand,


Goda, och sucka i förtid ej!

HELENA.

Hvad har händt min arma gemål? 340


Skådar han dagens ljus,
Och solgudens fyrspann,
Och stjernornas banor än;
Eller delar han under jorden
Med de döda, skuggors lott? 345

KHOREN.

Du bör till det bästa vända Hvad i en framtid kan hända.

HELENA.

Dig jag åkalla vill, dig jag besvärja vill,


Du vattenrika, och säfgröna
Eurotas, om sannt är ryktet, 350
Som går om min makes död.

KHOREN.

Hvarför så dårligt prat?

HELENA.

En mördande snara
Kring halsen jag binda skall;
Eller med mordsvärds anfall 355
Drabba min strupe tversgenom, förfärligt,
Rof för gudinnorna trenne,
Och Priamiden på Ida, som blå-
ste på syrinxpipan vid stallen en gång.

KHOREN.

All ofärd fjerran från dig 360


Vike, och säll må du bli!
HELENA.

Olyckliga Troi, ve!


För saklös sak du har fallit, och lidit hårdt
Min gåfva var det, att Kypris dig sände
Rikelig blod, och rikliga tårar, och smärta på smärta,
Och tårar på tårar ökte ditt qval. 365
Mödrarne söner förlorade;
Och jungfrur, de dödes förvandter,
Sig lockarna skördade af, vid Skamandros'
Phrygiska bölja.
Men Hellas ett jemmerskri 370
Hof upp, och ropade ve,
Och drabbade hufvut med händren,
Och bestänkte finhyiga kinden
Med nagelns blodiga slag,
Lycksaliga mö i Arkadien, o Kallisto, som Zeus' 375
Läger med fyra fötter besteg,
Hur mycket sällare var du, än moder min!
Du som i ludenkäftige djurets skick,
Med blodig blick, björninnas gestalt
Antog, till kränkningens råga. 380
Mer salig var hon, som Artemis slöt ur dansen,
Gullhornade hinden, Merops, Titanidens dotter,
För fägringens skull. Min skönhet har
Förstört, förstört Dardaniens Pergama,
Samt olycksälla Akhaier. 385

MENELAOS.
Du, som i Pisa med Oinomaos' fyrspannsvagn,
O Pelops, i fordna dagar hurtigt åkte kapp,
Ack, att du då, när du till spis åt gudarna
Framsattes styckad, mist bland gudarna ditt lif,
Förrän du nånsin Atreus födde, fader min, 390
Som med min mor Aeropa framaflade
Agamemnon, och mig, Menelaos, ett hjeltepar!
Ty största krigshär — och jag talar utan skryt, —
På flottan menar jag mig ha till Troia fört,
Härförare och konung, utan minsta tvång, 395
Frivilligt folk från Hellas land beherrskande.
Dem, som ej mera finnas till, man räkna kan
Och dem som gladligt sig ur hafvet räddade,
Till hemmet återbringande de dödes namn.
Jag, arme, uppå blåe hafvets vida svall 400
Kringirrat lika länge, som jag Ilios' torn
Förstörde; och till fosterlandet sträfvande,
Mig gudarne tillstädja ej, att detta nå.
Men Libyens öknar, och hvar ogästvänlig strand
Jag har besökt; och när jag hemmet nära var, 405
Dref stormen mig tillbaka; aldrig seglet mitt
Frisk kultje fyllde att jag hade hemmet nått.
Och nu, med vännernas förlust, skeppsbruten man,
Jag kastats hit, och emot klipporna mitt skepp
Uti tallösa spillror sönderkrossades. 410
Af alla bjelklag kölen blott mig återstod,
På hvilken, mot förmodan, knapt jag räddad är,
Samt Helena, som jag från Troia tog igen.
Hvad namn nu detta landet har, och folket här,
Jag vet ej; störta mig i hopen blygdes jag, 415

You might also like