[Ebooks PDF] download Using OpenCL Programming Massively Parallel Computers J. Kowalik full chapters
[Ebooks PDF] download Using OpenCL Programming Massively Parallel Computers J. Kowalik full chapters
com
https://ebookname.com/product/using-opencl-programming-
massively-parallel-computers-j-kowalik/
OR CLICK BUTTON
DOWNLOAD EBOOK
https://ebookname.com/product/computers-and-programming-1st-
edition-lisa-mccoy/
https://ebookname.com/product/hidden-structure-music-analysis-
using-computers-david-cope/
https://ebookname.com/product/using-computers-in-linguistics-a-
practical-guide-1st-edition-john-m-lawler/
https://ebookname.com/product/the-ashrae-greenguide-second-
edition-the-ashrae-green-guide-series-ashrae-press/
Complete Java 2 Certification Study Guide 4th Edition
Phillip Heller
https://ebookname.com/product/complete-java-2-certification-
study-guide-4th-edition-phillip-heller/
https://ebookname.com/product/bosnia-and-herzegovina-in-the-
second-world-war-2004-enver-redzic/
https://ebookname.com/product/human-resource-skills-for-the-
project-manager-the-human-aspects-of-project-management-volume-
two-1st-edition-verma/
https://ebookname.com/product/textbook-of-forensic-medicine-and-
toxicology-2nd-edition-nageshkumar-g-rao/
https://ebookname.com/product/world-civilizations-
volume-1-to-1700-5th-edition-philip-j-adler/
Noise and Military Service Implications for Hearing
Loss and Tinnitus 1st Edition Committee On Noise-
Induced Hearing Loss And Tinnitus Associated With
Military Service From World War Ii To The Present
https://ebookname.com/product/noise-and-military-service-
implications-for-hearing-loss-and-tinnitus-1st-edition-committee-
on-noise-induced-hearing-loss-and-tinnitus-associated-with-
military-service-from-world-war-ii-to-the-present/
USING OPENCL
Advances in Parallel Computing
This book series publishes research and development results on all aspects of parallel
computing. Topics may include one or more of the following: high-speed computing
architectures (Grids, clusters, Service Oriented Architectures, etc.), network technology,
performance measurement, system software, middleware, algorithm design,
development tools, software engineering, services and applications.
Series Editor:
Professor Dr. Gerhard R. Joubert
Volume 21
Recently published in this series
Vol. 20. I. Foster, W. Gentzsch, L. Grandinetti and G.R. Joubert (Eds.), High
Performance Computing: From Grids and Clouds to Exascale
Vol. 19. B. Chapman, F. Desprez, G.R. Joubert, A. Lichnewsky, F. Peters and T. Priol
(Eds.), Parallel Computing: From Multicores and GPU’s to Petascale
Vol. 18. W. Gentzsch, L. Grandinetti and G. Joubert (Eds.), High Speed and Large Scale
Scientific Computing
Vol. 17. F. Xhafa (Ed.), Parallel Programming, Models and Applications in Grid and
P2P Systems
Vol. 16. L. Grandinetti (Ed.), High Performance Computing and Grids in Action
Vol. 15. C. Bischof, M. Bücker, P. Gibbon, G.R. Joubert, T. Lippert, B. Mohr and F.
Peters (Eds.), Parallel Computing: Architectures, Algorithms and Applications
Janu
usz Kow
walik
1647
77-107th PL NE, Bothell,, WA 98011
1, USA
and
Tadeusz PuĨnia
akowski
UG, MFI, Wit
W Stwosz Street
S 57, 80-952
8 GdaĔĔsk, Poland
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, without prior written permission from the publisher.
Publisher
IOS Press BV
Nieuwe Hemweg 6B
1013 BG Amsterdam
Netherlands
fax: +31 20 687 0019
e-mail: order@iospress.nl
LEGAL NOTICE
The publisher is not responsible for the use which might be made of the following information.
v
vi
Preface
This book contains the most important and essential information required for de-
signing correct and efficient OpenCL programs. Some details have been omitted but
can be found in the provided references. The authors assume that readers are famil-
iar with basic concepts of parallel computation, have some programming experience
with C or C++ and have a fundamental understanding of computer architecture.
In the book, all terms, definitions and function signatures have been copied from
official API documents available on the page of the OpenCL standards creators.
The book was written in 2011, when OpenCL was in transition from its infancy
to maturity as a practical programming tool for solving real-life problems in science
and engineering. Earlier, the Khronos Group successfully defined OpenCL specifica-
tions, and several companies developed stable OpenCL implementations ready for
learning and testing. A significant contribution to programming heterogeneous com-
puters was made by NVIDIA which created one of the first working systems for pro-
gramming massively parallel computers – CUDA. OpenCL has borrowed from CUDA
several key concepts. At this time (fall 2011), one can install OpenCL on a hetero-
geneous computer and perform meaningful computing experiments. Since OpenCL
is relatively new, there are not many experienced users or sources of practical infor-
mation. One can find on the Web some helpful publications about OpenCL, but there
is still a shortage of complete descriptions of the system suitable for students and
potential users from the scientific and engineering application communities.
Chapter 1 provides short but realistic examples of codes using MPI and OpenMP
in order for readers to compare these two mature and very successful systems with
the fledgling OpenCL. MPI used for programming clusters and OpenMP for shared
memory computers, have achieved remarkable worldwide success for several rea-
sons. Both have been designed by groups of parallel computing specialists that per-
fectly understood scientific and engineering applications and software development
tools. Both MPI and OpenMP are very compact and easy to learn. Our experience
indicates that it is possible to teach scientists or students whose disciplines are other
than computer science how to use MPI and OpenMP in a several hours time. We
hope that OpenCL will benefit from this experience and achieve, in the near future,
a similar success.
Paraphrasing the wisdom of Albert Einstein, we need to simplify OpenCL as
much as possible but not more. The reader should keep in mind that OpenCL will
be evolving and that pioneer users always have to pay an additional price in terms
of initially longer program development time and suboptimal performance before
they gain experience. The goal of achieving simplicity for OpenCL programming re-
quires an additional comment. OpenCL supporting heterogeneous computing offers
us opportunities to select diverse parallel processing devices manufactured by differ-
ent vendors in order to achieve near-optimal or optimal performance. We can select
multi-core CPUs, GPUs, FPGAs and other parallel processing devices to fit the prob-
lem we want to solve. This flexibility is welcomed by many users of HPC technology,
but it has a price.
Programming heterogeneous computers is somewhat more complicated than
writing programs in conventional MPI and OpenMP. We hope this gap will disappear
as OpenCL matures and is universally used for solving large scientific and engineer-
ing problems.
vii
Acknowledgements
It is our pleasure to acknowledge assistance and contributions made by several per-
sons who helped us in writing and publishing the book.
First of all, we express our deep gratitude to Prof. Gerhard Joubert who has
accepted the book as a volume in the book series he is editing, Advances in Parallel
Computing. We are proud to have our book in his very prestigious book series.
Two members of the Khronos organization, Elizabeth Riegel and Neil Trevett,
helped us with evaluating the initial draft of Chapter 2 Fundamentals and provided
valuable feedback. We thank them for the feedback and for their offer of promoting
the book among the Khronos Group member companies.
Our thanks are due to NVIDIA for two hardware grants that enabled our com-
puting work related to the book.
Our thanks are due to Piotr Arłukowicz, who contributed two sections to the
book and helped us with editorial issues related to using LATEX and the Blender3D
modeling open-source program.
We thank two persons who helped us improve the book structure and the lan-
guage. They are Dominic Eschweiler from FIAS, Germany and Roberta Scholz from
Redmond, USA.
We also thank several friends and family members who helped us indirectly by
supporting in various ways our book writing effort.
Janusz Kowalik
Tadeusz Puźniakowski
viii
Contents
1 Introduction 1
1.1 Existing Standard Parallel Programming Systems . . . . . . . . . . . . . . 1
1.1.1 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Two Parallelization Strategies: Data Parallelism and Task Parallelism . 9
1.2.1 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Task Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 History and Goals of OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Origins of Using GPU in General Purpose Computing . . . . . . 12
1.3.2 Short History of OpenCL . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Heterogeneous Computer Memories and Data Transfer . . . . . . . . . . 14
1.4.1 Heterogeneous Computer Memories . . . . . . . . . . . . . . . . . 14
1.4.2 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.3 The Fourth Generation CUDA . . . . . . . . . . . . . . . . . . . . . 15
1.5 Host Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.1 Phase a. Initialization and Creating Context . . . . . . . . . . . . 17
1.5.2 Phase b. Kernel Creation, Compilation and Preparations . . . . . 17
1.5.3 Phase c. Creating Command Queues and Kernel Execution . . . 17
1.5.4 Finalization and Releasing Resource . . . . . . . . . . . . . . . . . 18
1.6 Applications of Heterogeneous Computing . . . . . . . . . . . . . . . . . . 18
1.6.1 Accelerating Scientific/Engineering Applications . . . . . . . . . 19
1.6.2 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . 19
1.6.3 Jacobi Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6.4 Power Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6.5 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.7 Benchmarking CGM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.2 Additional CGM Description . . . . . . . . . . . . . . . . . . . . . . 24
1.7.3 Heterogeneous Machine . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7.4 Algorithm Implementation and Timing Results . . . . . . . . . . 24
1.7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
ix
2 OpenCL Fundamentals 27
2.1 OpenCL Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.1 What is OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.2 CPU + Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.3 Massive Parallelism Idea . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.4 Work Items and Workgroups . . . . . . . . . . . . . . . . . . . . . . 29
2.1.5 OpenCL Execution Model . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.6 OpenCL Memory Structure . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.7 OpenCL C Language for Programming Kernels . . . . . . . . . . . 30
2.1.8 Queues, Events and Context . . . . . . . . . . . . . . . . . . . . . . 30
2.1.9 Host Program and Kernel . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.10 Data Parallelism in OpenCL . . . . . . . . . . . . . . . . . . . . . . 31
2.1.11 Task Parallelism in OpenCL . . . . . . . . . . . . . . . . . . . . . . 32
2.2 How to Start Using OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.1 Header Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.3 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Platforms and Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1 OpenCL Platform Properties . . . . . . . . . . . . . . . . . . . . . . 36
2.3.2 Devices Provided by Platform . . . . . . . . . . . . . . . . . . . . . 37
2.4 OpenCL Platforms – C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 OpenCL Context to Manage Devices . . . . . . . . . . . . . . . . . . . . . . 41
2.5.1 Different Types of Devices . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.2 CPU Device Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.3 GPU Device Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.4 Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.5 Different Device Types – Summary . . . . . . . . . . . . . . . . . . 44
2.5.6 Context Initialization – by Device Type . . . . . . . . . . . . . . . 45
2.5.7 Context Initialization – Selecting Particular Device . . . . . . . . 46
2.5.8 Getting Information about Context . . . . . . . . . . . . . . . . . . 47
2.6 OpenCL Context to Manage Devices – C++ . . . . . . . . . . . . . . . . . 48
2.7 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.7.1 Checking Error Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.7.2 Using Exceptions – Available in C++ . . . . . . . . . . . . . . . . 53
2.7.3 Using Custom Error Messages . . . . . . . . . . . . . . . . . . . . . 54
2.8 Command Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.8.1 In-order Command Queue . . . . . . . . . . . . . . . . . . . . . . . 55
2.8.2 Out-of-order Command Queue . . . . . . . . . . . . . . . . . . . . 57
2.8.3 Command Queue Control . . . . . . . . . . . . . . . . . . . . . . . 60
2.8.4 Profiling Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.8.5 Profiling Using Events – C example . . . . . . . . . . . . . . . . . . 61
2.8.6 Profiling Using Events – C++ example . . . . . . . . . . . . . . . 63
2.9 Work-Items and Work-Groups . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.9.1 Information About Index Space from a Kernel . . . . . . . . . . 66
2.9.2 NDRange Kernel Execution . . . . . . . . . . . . . . . . . . . . . . 67
2.9.3 Task Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.9.4 Using Work Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
x
2.10 OpenCL Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.10.1 Different Memory Regions – the Kernel Perspective . . . . . . . . 71
2.10.2 Relaxed Memory Consistency . . . . . . . . . . . . . . . . . . . . . 73
2.10.3 Global and Constant Memory Allocation – Host Code . . . . . . 75
2.10.4 Memory Transfers – the Host Code . . . . . . . . . . . . . . . . . . 78
2.11 Programming and Calling Kernel . . . . . . . . . . . . . . . . . . . . . . . . 79
2.11.1 Loading and Compilation of an OpenCL Program . . . . . . . . . 81
2.11.2 Kernel Invocation and Arguments . . . . . . . . . . . . . . . . . . 88
2.11.3 Kernel Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.11.4 Supported Scalar Data Types . . . . . . . . . . . . . . . . . . . . . 90
2.11.5 Vector Data Types and Common Functions . . . . . . . . . . . . . 92
2.11.6 Synchronization Functions . . . . . . . . . . . . . . . . . . . . . . . 94
2.11.7 Counting Parallel Sum . . . . . . . . . . . . . . . . . . . . . . . . . 96
2.11.8 Parallel Sum – Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 97
2.11.9 Parallel Sum – Host Program . . . . . . . . . . . . . . . . . . . . . 100
2.12 Structure of the OpenCL Host Program . . . . . . . . . . . . . . . . . . . . 103
2.12.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.12.2 Preparation of OpenCL Programs . . . . . . . . . . . . . . . . . . . 106
2.12.3 Using Binary OpenCL Programs . . . . . . . . . . . . . . . . . . . . 107
2.12.4 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
2.12.5 Release of Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
2.13 Structure of OpenCL host Programs in C++ . . . . . . . . . . . . . . . . . 114
2.13.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
2.13.2 Preparation of OpenCL Programs . . . . . . . . . . . . . . . . . . . 115
2.13.3 Using Binary OpenCL Programs . . . . . . . . . . . . . . . . . . . . 116
2.13.4 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
2.13.5 Release of Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
2.14 The SAXPY Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
2.14.1 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
2.14.2 The Example SAXPY Application – C Language . . . . . . . . . . 123
2.14.3 The example SAXPY application – C++ language . . . . . . . . 128
2.15 Step by Step Conversion of an Ordinary C Program to OpenCL . . . . . 131
2.15.1 Sequential Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2.15.2 OpenCL Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 132
2.15.3 Data Allocation on the Device . . . . . . . . . . . . . . . . . . . . . 134
2.15.4 Sequential Function to OpenCL Kernel . . . . . . . . . . . . . . . 135
2.15.5 Loading and Executing a Kernel . . . . . . . . . . . . . . . . . . . . 136
2.15.6 Gathering Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
2.16 Matrix by Vector Multiplication Example . . . . . . . . . . . . . . . . . . . 139
2.16.1 The Program Calculating mat r i x × vec t or . . . . . . . . . . . . 140
2.16.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.16.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
2.16.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
xi
3.1.2 Detecting Available Extensions from API . . . . . . . . . . . . . . 148
3.1.3 Using Runtime Extension Functions . . . . . . . . . . . . . . . . . 149
3.1.4 Using Extensions from OpenCL Program . . . . . . . . . . . . . . 153
3.2 Debugging OpenCL codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
3.2.1 Printf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
3.2.2 Using GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
3.3 Performance and Double Precision . . . . . . . . . . . . . . . . . . . . . . 162
3.3.1 Floating Point Arithmetics . . . . . . . . . . . . . . . . . . . . . . . 162
3.3.2 Arithmetics Precision – Practical Approach . . . . . . . . . . . . . 165
3.3.3 Profiling OpenCL Application . . . . . . . . . . . . . . . . . . . . . 172
3.3.4 Using the Internal Profiler . . . . . . . . . . . . . . . . . . . . . . . 173
3.3.5 Using External Profiler . . . . . . . . . . . . . . . . . . . . . . . . . 180
3.3.6 Effective Use of Memories – Memory Access Patterns . . . . . . . 183
3.3.7 Matrix Multiplication – Optimization Issues . . . . . . . . . . . . 189
3.4 OpenCL and OpenGL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
3.4.1 Extensions Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
3.4.2 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
3.4.3 Header Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
3.4.4 Common Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
3.4.5 OpenGL Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 198
3.4.6 OpenCL Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 201
3.4.7 Creating Buffer for OpenGL and OpenCL . . . . . . . . . . . . . . 203
3.4.8 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
3.4.9 Generating Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
3.4.10 Running Kernel that Operates on Shared Buffer . . . . . . . . . . 215
3.4.11 Results Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
3.4.12 Message Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
3.4.13 Cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
3.4.14 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . . . 221
3.5 Case Study – Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 221
3.5.1 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
3.5.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
3.5.3 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
3.5.4 Example Problem Definition . . . . . . . . . . . . . . . . . . . . . . 225
3.5.5 Genetic Algorithm Implementation Overview . . . . . . . . . . . 225
3.5.6 OpenCL Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
3.5.7 Most Important Elements of Host Code . . . . . . . . . . . . . . . 234
3.5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
3.5.9 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
xii
A.2.2 Blocks and Threads Indexing Formulas . . . . . . . . . . . . . . . 257
A.2.3 Runtime Error Handling . . . . . . . . . . . . . . . . . . . . . . . . 260
A.2.4 CUDA Driver API Example . . . . . . . . . . . . . . . . . . . . . . . 262
xiii
xiv
Chapter 1
Introduction
1.1.1. MPI
MPI is a programming system but not a programming language. It is a library of func-
tions for C and subroutines for FORTRAN that are used for message communication
between parallel processes created by MPI. The message-passing computing model
(Fig. 1.1) is a collection of interconnected processes that use their local memories
exclusively. Each process has an individual address identification number called the
rank. Ranks are used for sending and receiving messages and for workload distribu-
tion. There are two kinds of messages: point-to-point messages and collective mes-
sages. Point-to-point message C functions contain several parameters: the address of
the sender, the address of the receiver, the message size and type and some additional
information. In general, message parameters describe the nature of the transmitted
data and the delivery envelope description.
The collection of processes that can exchange messages is called the communica-
tor. In the simplest case there is only one communicator and each process is assigned
to one processor. In more general settings, there are several communicators and sin-
gle processors serve several processes. A processor rank is usually an integer number
running from 0 to p-1 where p is the total number of processes. It is also possible to
number processes in a more general way than by consecutive integer numbers. For
example, their address ID can be a double number such as a point (x, y) in Carte-
1
sian space. This method of identifying processes may be very helpful in handling ma-
trix operations or partial differential equations (PDEs) in two-dimensional Cartesian
space.
The C loop below will be executed by all processes from the process from
my_rank equal to 0 to my_rank equal to p − 1.
1 {
2 int i;
3 for (i = my_rank*N; i < (my_rank+1)*N; i++)
4 z[i]=a*x[i]+y[i];
5 }
2
For example, the process with my_rank=1 will add one thousand ele-
ments a*x[i] and y[i] from i=N =1000 to 2N –1=1999. The process with the
my_rank=p–1=9 will add elements from i=n–N =9000 to n–1=9999.
One can now appreciate the usefulness of assigning a rank ID to each process.
This simple concept makes it possible to send and receive messages and share the
workloads as illustrated above. Of course, before each process can execute its code
for computing a part of the vector z it must receive the needed data. This can be
done by one process initializing the vectors x and y and sending appropriate groups
of data to all remaining processes. The computation of SAXPY is an example of data
parallelism. Each process executes the same code on different data.
To implement function parallelism where several processes execute different pro-
grams process ranks can be used. In the SAXPY example the block of code containing
the loop will be executed by all p processes without exception. But if one process, for
example, the process rank 0, has to do something different than all others this could
be accomplished by specifying the task as follows:
1 if (my_rank == 0)
2 {execute specified here task}
This block of code will be executed only by the process with my_rank==0. In the
absence of "if (my_rank==something)" instructions, all processes will execute the
block of code {execute this task}. This technique can be used for a case where
processes have to perform several different computations required in the task-parallel
algorithm. MPI is universal. It can express every kind of algorithmic parallelism.
An important concern in parallel computing is the efficiency issue. MPI often
can be computationally efficient because all processes use only local memories. On
the other hand, processes require network communication to place data in proper
process memories in proper times. This need for moving data is called the algorithmic
synchronization, and it creates the communication overhead, impacting negatively
the parallel program performance. It might significantly damage performance if the
program sends many short messages. The communication overhead can be reduced
by grouping data for communication. By grouping data for communication created
are larger user defined data types for larger messages to be sent less frequently. The
benefit is avoiding the latency times.
On the negative side of the MPI evaluation score card is its inability for incre-
mental (part by part) conversion from a serial code to an MPI parallel code that
is algorithmically equivalent. This problem is attributed to the relatively high level
of MPI program design. To design an MPI program, one has to modifiy algorithms.
This work is done at an earlier stage of the parallel computing process than develop-
ing parallel codes. Algorithmic level modification usually can’t be done piecewise. It
has to be done all at once. In contrast, in OpenMP a programmer makes changes to
sequential codes written in C or FORTRAN.
Fig. 1.2 shows the difference. The code level modification makes possible incre-
mental conversions. In a commonly practiced conversion technique, a serial code is
converted incrementally from the most compute-intensive program parts to the least
compute intensive parts until the parallelized code runs sufficiently fast. For further
study of MPI, the book [1] is highly recommended.
3
1. Mathematical model.
2. Computational model.
3. Numerical algorithm and parallel conversion: MPI
4. Serial code and parallel modification: OpenMP
5. Computer runs.
Figure 1.2: The computer solution process. OpenMP will be discussed in the next
section.
1.1.2. OpenMP
OpenMP is a shared address space computer application programming interface
(API). It contains a number of compiler directives that instruct C/C++ or FORTRAN
“OpenMP aware” compilers to execute certain instructions in parallel and distribute
the workload among multiple parallel threads. Shared address space computers fall
into two major groups: centralized memory multiprocessors, also called Symmetric
Multi-Processors (SMP) as shown in Fig. 1.3, and Distributed Shared Memory (DSM)
multiprocessors whose common representative is the cache coherent Non Uniform
Memory Access architecture (ccNUMA) shown in Fig. 1.4.
SMP computers are also called Uniform Memory Access (UMA). They were the
first commercially successful shared memory computers and they still remain popu-
lar. Their main advantage is uniform memory access. Unfortunately for large num-
bers of processors, the bus becomes overloaded and performance deteriorates. For
this reason, the bus-based SMP architectures are limited to about 32 or 64 proces-
sors. Beyond these sizes, single address memory has to be physically distributed.
Every processor has a chunk of the single address space.
Both architectures have single address space and can be programmed by using
OpenMP. OpenMP parallel programs are executed by multiple independent threads
4
Figure 1.4: ccNUMA with cache coherent interconnect.
that are streams of instructions having access to shared and private data, as shown
in Fig. 1.5.
The programmer can explicitly define data that are shared and data that are
private. Private data can be accessed only by the thread owning these data. The flow
of OpenMP computation is shown in Fig. 1.6. One thread, the master thread, runs
continuously from the beginning to the end of the program execution. The worker
threads are created for the duration of parallel regions and then are terminated.
When a programmer inserts a proper compiler directive, the system creates a
team of worker threads and distributes workload among them. This operation point
is called the fork. After forking, the program enters a parallel region where parallel
processing is done. After all threads finish their work, the program worker threads
cease to exist or are redeployed while the master thread continues its work. The
event of returning to the master thread is called the join. The fork and join tech-
nique may look like a simple and easy way for parallelizing sequential codes, but the
5
Figure 1.6: The fork and join OpenMP code flow.
reader should not be deceived. There are several difficulties that will be described
and discussed shortly. First of all, how to find whether several tasks can be run in
parallel?
In order for any group of computational tasks to be correctly executed in parallel,
they have to be independent. That means that the result of the computation does not
depend on the order of task execution. Formally, two programs or parts of programs
are independent if they satisfy the Bernstein conditions shown in 1.1.
I j ∩ Oi = 0
Ii ∩ Oj = 0 (1.1)
Oi ∩ O j = 0
Letters I and O signify input and output. The ∩ symbol means the intersection
of two sets that belong to task i or j. In practice, determining if some group of
tasks can be executed in parallel has to be done by the programmer and may not
be very easy. To discuss other shared memory computing issues, consider a very
simple programming task. Suppose the dot product of two large vectors x and y
whose number of components is n is calculated. The sequential computation would
be accomplished by a simple C loop
1 double dp = 0;
2 for(int i=0; i<n; i++)
3 dp += x[i]*y[i];
Inserting, in front of the above for-loop, the OpenMP directive for parallelizing
the loop and declaring shared and private variables leads to:
6
1 double dp = 0;
2 #pragma omp parallel for shared(x,y,dp) (private i)
3 for(int i=0; i<n; i++)
4 dp += x[i]*y[i];
1 double dp = 0;
2 #pragma omp parallel for shared(x,y,dp) private(i)
3 for(int i=0; i<n; i++){
4 #pragma omp critical
5 dp += x[i]*y[i];
6 }
1 double dp = 0;
2 #pragma omp parallel for reduction(+:dp) shared(x,y) private(i)
3 for (int i=0;i<n ;i++)
4 dp += x[i]*y[i];
Use of the critical construct amounts to serializing the computation of the critical
region, the block of code following the critical construct. For this reason, large critical
regions degrade program performance and ought to be avoided. Using the reduction
clause is more efficient and preferable. The reader may have noticed that the loop
counter variable i is declared private, so each thread updates its loop counter in-
dependently without interference. In addition to the parallelizing construct parallel
for that applies to C for-loops, there is a more general section construct that paral-
lelizes independent sections of code. The section construct makes it possible to apply
OpenMP to task parallelism where several threads can compute different sections of
a code. For computing two parallel sections, the code structure is:
7
1 #pragma omp parallel
2 {
3 #pragma omp sections
4 {
5 #pragma omp section
6 /* some program segment computation */
7 #pragma omp section
8 /* another program segment computation */
9 }
10 /* end of sections block */
11 }
12 /* end of parallel region */
OpenMP has tools that can be used by programmers for improving performance.
One of them is the clause nowait. Consider the program fragment:
In the first parallelized loop, there is the clause nowait. Since the second loop
variables do not depend on the results of the first loop, it is possible to use the clause
nowait – telling the compiler that as soon as any first loop thread finishes its work, it
can start doing the second loop work without waiting for other threads. This speeds
up computation by reducing the potential waiting time for the threads that finish
work at different times.
On the other hand, the construct #pragma omp barrier is inserted after the
second loop to make sure that the second loop is fully completed before the calcula-
tion of g being a function of d is performed after the second loop. At the barrier, all
threads computing the second loop must wait for the last thread to finish before they
proceed. In addition to the explicit barrier construct #pragma omp barrier used by
the programmer, there are also implicit barriers used by OpenMP automatically at
the end of every parallel region for the purpose of thread synchronization. To sum
up, barriers should be used sparingly and only if necessary. nowait clauses should
be used as frequently as possible, provided that their use is safe.
Finally, we have to point out that the major performance issue in numerical com-
putation is the use of cache memories. For example, in computing the matrix/vector
product c = Ab, two mathematically equivalent methods could be used. In the first
method, elements of the vector c are computed by multiplying each row of A by the
vector b, i.e., computing dot products. An inferior performance will be obtained if c
is computed as the sum of the columns of A multiplied by the elements of b. In this
8
case, the program would access the columns of A not in the way the matrix data are
stored in the main memory and transferred to caches.
For further in-depth study of OpenMP and its performance, reading [2] is highly
recommended. Of special value for parallel computing practitioners are Chapter 5
“How to get Good Performance by Using OpenMP” and Chapter 6 “Using OpenMP
in the Real World”. Chapter 6 offers advice for and against using combined MPI and
OpenMP. A chapter on combining MPI and OpenMP can also be found in [3]. Like
MPI and OpenMP, the OpenCL system is standardized. It has been designed to run
regardless of processor types, operating systems and memories. This makes OpenCL
programs highly portable, but the method of developing OpenCL codes is more com-
plicated. The OpenCL programmer has to deal with several low-level programming
issues, including memory management.
9
their own data sets. Task parallelism can be called Multiple Programs Multiple Data
(MPMD). A small-size example of a task parallel problem is shown in Fig. 1.8. The
directed graph indicates the task execution precedence. Two tasks can execute in
parallel if they are not dependent.
In the case shown in Fig. 1.8, there are two options for executing the entire
set of tasks. Option 1. Execute tasks T1, T2 and T4 in parallel, followed by Task T3
and finally T5. Option 2. Execute tasks T1 and T2 in parallel, then T3 and T4 in
parallel and finally T5. In both cases, the total computing work is equal but the time
to solution may not be.
1.2.3. Example
Consider a problem that can be computed in both ways – via data parallelism and
task parallelism. The problem is to calculate C = A × B − (D + E) where A, B, D
and E are all square matrices of size n × n. An obvious task parallel version would
be to compute in parallel two tasks A × B and D + E and then subtract the sum
from the product. Of course, the task of computing A × B and the task of computing
the sum D + E can be calculated in a data parallel fashion. Here, there are two
levels of parallelism: the higher task level and the lower data parallel level. Similar
multilevel parallelism is common in real-world applications. Not surprisingly, there
is also for this problem a direct data parallel method based on the observation that
every element of C can be computed directly and independently from the coefficients
of A, B, D and E. This computation is shown in equation 1.2.
n−1
ci j = aik bk j − di j − ei j (1.2)
k=0
10
The equation (1.2) means that it is possible to calculate all n2 elements of C
in parallel using the same formula. This is good news for OpenCL devices that can
handle only one kernel and related data parallel computation. Those devices will
be discussed in the chapters that follow. The matrix C can be computed using three
standard programming systems: MPI, OpenMP and OpenCL.
Using MPI, the matrix C could be partitioned into sub-matrix components and
assigned the subcomponents to processes. Every process would compute a set of
elements of C, using the expression 1.2. The main concern here is not computation
itself but the ease of assembling results and minimizing the cost of communication
while distributing data and assembling the results. A reasonable partitioning would
be dividing C by blocks of rows that can be scattered and gathered as user-defined
data types. Each process would get a set of the rows of A, D and E and the entire
matrix B. If matrix C size is n and the number of processes is p, then each process
would get n/p rows of C to compute. If a new data type is defined as n/p rows,
the data can easily be distributed by strips of rows to processes and then results can
be gathered. The suggested algorithm is a data parallel method. Data partitioning is
shown in Fig. 1.9.
Figure 1.9: Strips of data needed by a process to compute the topmost strip of C.
The data needed for each process include one strip of A, D and E and the entire
matrix B. Each process of rank 0<=rank<p computes one strip of C rows. After fin-
ishing computation, the matrix C can be assembled by the collective MPI communi-
cation function Gather. An alternative approach would be partitioning Gather into
blocks and assigning to each process computing one block of C. However, assembling
the results would be harder than in the strip partitioning case.
OpenMP would take advantage of the task parallel approach. First, the subtasks
A × B and D + E would be parallelized and computed separately, and then C = A ×
B − (D + E) would be computed, as shown in Fig. 1.10. One weakness of task parallel
programs is the efficiency loss if all parallel tasks do not represent equal workloads.
For example, in the matrix C computation, the task of computing A×B and the task of
computing D+E are not load-equal. Unequal workloads could cause some threads to
idle unless tasks are selected dynamically for execution by the scheduler. In general,
data parallel implementations are well load balanced and tend to be more efficient.
In the case of OpenCL implementation of the data parallel option for computing C, a
single kernel function would compute elements of C by the equation 1.2 where the
sum represents dot products of A × B and the remaining terms represent subtracting
D and E. If the matrix size is n = 1024, a compute device could execute over one
million work items in parallel. Additional performance gain can be achieved by using
the tiling technique [4].
11
Figure 1.10: The task parallel approach for OpenMP.
12
merical algorithms containing data parallel computations are discussed in Section
1.6 of this Chapter.
As a parallel programming system, OpenCL is preceded by MPI, OpenMP and
CUDA. Parallel regions in OpenMP are comparable to parallel kernel executions
in OpenCL. A crucial difference is OpenMP’s limited scalability due to heavy over-
head in creating and managing threads. Threads used on heterogeneous systems are
lightweight, hence suitable for massive parallelism. The most similar to OpenCL is
NVIDIA’s CUDA (Compute Unified Driver Architecture) because OpenCL heavily bor-
rowed from CUDA its fundamental features. Readers who know CUDA will find it
relatively easy to learn and use OpenCL after becoming familiar with the mapping
between CUDA and OpenCL technical terms and minor programming differences.
The technical terminology differences are shown in Tab.1.1.
CUDA OpenCL
Thread Work item
Block Work group
Grid Index space
Similar one-to-one mapping exists for the CUDA and OpenCL API calls.
13
1.4. Heterogeneous Computer Memories and Data
Transfer
1.4.1. Heterogeneous Computer Memories
A device’s main memory is the global memory (Fig. 1.11). All work items have access
to this memory. It is the largest but also the slowest memory on a heterogeneous
device. The global memory can be dynamically allocated by the host program and
can be read and written by the host and the device.
The constant memory can also be dynamically allocated by the host. It can be
used for read and write operations by the host, but it is read only by the device. All
work items can read from it. This memory can be fast if the system has a supporting
cache.
Local memories are shared by work items within a work group. For example, two
work items can synchronize their operation using this memory if they belong to the
same work group. Local memories can’t be accessed by the host.
14
Private memories can be accessed only by each individual work item. They are
registers. A kernel can use only these four device memories.
In contrast, the host system can use the host memory and the global/constant
memories on the device. If a device is a multicore CPU, then all device memories are
portions of the RAM.
15
Figure 1.13: The CGM algorithm iterations. The only data transfer takes place at the
very beginning of the iterative process.
and global GPU memory. This change will simplify programming systems that com-
bine the CPU and GPU and will eliminate currently necessary data transfers between
the host and the device.
16
Kernels are written in the OpenCL C language and executed on highly parallel
devices. They provide performance improvements. An example of a kernel code is
shown in section 2.1.
In this section, the structure and the functionality of host programs is described.
The main purpose of the host code is to manage device(s). More specifically, host
codes arrange and submit kernels for execution on device(s). In general, a host code
has four phases:
Section 2.8 in Chapter 2 provides details of the four phases and an example of
the host code structure. This section is an introduction to section 2.11 in Chapter
2. This introduction may be useful for readers who start learning OpenCL since the
concept of the host code has not been used in other parallel programming systems.
17
1.5.4. Finalization and Releasing Resource
After finishing computation, each code should perform a cleanup operation that in-
cludes releasing of resources in preparation for the next application. The entire host
code structure is shown in Fig. 1.15. In principle, the host code can also perform
some algorithmic computation that is not executed on the device – for example, the
initial data preparation.
18
1.6.1. Accelerating Scientific/Engineering Applications
Computational linear algebra has been regarded as the workhorse for applied math-
ematics in nearly all scientific and engineering applications. Software systems such
as Linpack have been used heavily by supercomputer users solving large-scale prob-
lems.
Algebraic linear equations solvers have been used for many years as the perfor-
mance benchmark for testing and ranking the fastest 500 computers in the world.
The latest (2010) TOP500 champion, is a Chinese computer, the Tianhe-1A.
The Tianhe supercomputer is a heterogeneous machine. It has 14336 Intel Xeon
CPUs and 7168 NVIDIA Tesla M2050 GPUs.
The champion’s Linpack benchmark performance record is about 2.5 petaflops
(2.5×10 to the power 15 floating point operations per second). Such impressive
speed has been achieved to a great extent by using massively parallel GPU accelera-
tors. An important question arises: do commonly used scientific and engineering so-
lution algorithms contain components that can be accelerated by massively parallel
devices? To partially answer this question, this section of the book considers several
numerical algorithms frequently used by scientists and engineers.
(x (0) ∈ ℜn given)
1. x := x (0)
2. r := b − Ax
3. p := r
4. α := r2
5. while α > t ol 2 :
6. λ := α/(p T Ap)
7. x := λp
8. r := r − λAp
9. p := r + (r2 /α)p
10. α := r2
11. end
The CGM algorithm solves the system of linear equations Ax = b where the ma-
trix A is positive definite. Every iteration of the CGM algorithm requires one matrix-
vector multiplication Ap, two vector dot products:
n−1
dp = xT y = x i yi (1.3)
i=0
19
iteration is N = n2 +10n. The dominant first term is contributed by the matrix-vector
multiplication w = Ap.
For large problems, the first term will be several orders of magnitude greater
than the second linear term. For this reason, we may be tempted to compute in
parallel on the device only the operation w = Ap.
The matrix-vector multiplication can be done in parallel by computing the ele-
ments of w as dot products of the rows of A and the vector p. The parallel compute
time will now be proportional to 2n instead of 2n2 where n is the vector size. This
would mean faster computing and superior scalability. Unfortunately, sharing com-
putation of each iteration by the CPU and the GPU requires data transfers between
the CPU and the device, as depicted in Fig. 1.16.
Figure 1.16: The CPU+GPU shared execution of the CGM with data transfer between
the CPU and the GPU.
In the CGM algorithm, the values of p and w change in every iteration and
need to be transferred if q is computed by the GPU and d is computed by the CPU.
The matrix A remains constant and need not be moved. To avoid data transfers that
would seriously degrade performance, it has been decided to compute the entire
iteration on the device. It is common practice to regard the CGM as an iterative
method, although for a problem of size n the method converges in n iterations if
exact arithmetic is used. The speed of the convergence depends upon the condition
number for the matrix A. For a positive definite matrix, the condition number is
the ratio of the largest to the smallest eigenvalues of A. The speed of convergence
increases as the condition number approaches 1.0. There are methods for improving
the conditioning of A. The CGM method with improved conditioning of A is called the
preconditioned CGM and is often used in practical applications. Readers interested
in preconditioning techniques and other mathematical aspects of the CGM may find
useful information in [6] and [7].
20
1.6.3. Jacobi Method
Another linear equations solver often used for equations derived from discretizing
linear partial differential equations is the stationary iterative Jacobi method. In this
method, the matrix A is split as follows: A = L+D+U where L and U are strictly lower
and upper triangular and D is diagonal. Starting from some initial approximation x 0 ,
the subsequent approximations are computed from
21
Discovering Diverse Content Through
Random Scribd Documents
The Project Gutenberg eBook of Helena
This ebook is for the use of anyone anywhere in the United States
and most other parts of the world at no cost and with almost no
restrictions whatsoever. You may copy it, give it away or re-use it
under the terms of the Project Gutenberg License included with this
ebook or online at www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of the country where
you are located before using this eBook.
Title: Helena
Author: Euripides
Language: Swedish
HELENA
af
Euripides
Öfversatt af
SVETHICE REDDITA
cuius
Particulam Primam
VENIA AMPLISSIMAE FACULTATIS PHILOSOPHICAE
AD IMPERIALEM ALEXANDREAM IN FENNIA UNIVEBSITATEM
PRAESIDE
Helena.
Teukros.
Khoren.
Menelaos.
Gumman.
Budet.
Thronoe.
Theoklymenos.
[Andra budet]
Dioskurerne.
HELENA.
TEUKROS.
HELENA.
Hvarför du usle, ho du är, afskyr du mig,
Och hatar mig, för hennes olyckshändelser?
TEUKROS.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
Landsflyktig drefs jag bort utur min fosterbygd. 90
HELENA.
TEUKROS.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
TEUKROS.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
HELENA.
Är den då redan bränd, och genom eld förstörd?
TEUKROS.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
HELENA.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
Jag sjelf med egna ögon sett, och vettet ser.
HELENA.
TEUKROS.
I Argos ej, ej heller vid Eurotas' flod.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
TEUKROS.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
HELENA.
TEUKROS.
HELENA.
Detder du vackert sagt; hur lyder andre sägn?
TEUKROS.
HELENA.
TEUKROS.
HELENA.
KHOREN.
HELENA.
KHOREN.
HELENA.
KHOREN.
HELENA.
KHOREN.
HELENA.
KHOREN.
HELENA.
KHOREN.
KHOREN.
HELENA.
KHOREN.
HELENA.
KHOREN.
HELENA.
KHOREN.
HELENA.
Ve, osalige dag! 335 Hvad, eländiga, hvad gråt- värdt ord
nu förnimmer jag?
KHOREN.
HELENA.
KHOREN.
HELENA.
KHOREN.
HELENA.
En mördande snara
Kring halsen jag binda skall;
Eller med mordsvärds anfall 355
Drabba min strupe tversgenom, förfärligt,
Rof för gudinnorna trenne,
Och Priamiden på Ida, som blå-
ste på syrinxpipan vid stallen en gång.
KHOREN.
MENELAOS.
Du, som i Pisa med Oinomaos' fyrspannsvagn,
O Pelops, i fordna dagar hurtigt åkte kapp,
Ack, att du då, när du till spis åt gudarna
Framsattes styckad, mist bland gudarna ditt lif,
Förrän du nånsin Atreus födde, fader min, 390
Som med min mor Aeropa framaflade
Agamemnon, och mig, Menelaos, ett hjeltepar!
Ty största krigshär — och jag talar utan skryt, —
På flottan menar jag mig ha till Troia fört,
Härförare och konung, utan minsta tvång, 395
Frivilligt folk från Hellas land beherrskande.
Dem, som ej mera finnas till, man räkna kan
Och dem som gladligt sig ur hafvet räddade,
Till hemmet återbringande de dödes namn.
Jag, arme, uppå blåe hafvets vida svall 400
Kringirrat lika länge, som jag Ilios' torn
Förstörde; och till fosterlandet sträfvande,
Mig gudarne tillstädja ej, att detta nå.
Men Libyens öknar, och hvar ogästvänlig strand
Jag har besökt; och när jag hemmet nära var, 405
Dref stormen mig tillbaka; aldrig seglet mitt
Frisk kultje fyllde att jag hade hemmet nått.
Och nu, med vännernas förlust, skeppsbruten man,
Jag kastats hit, och emot klipporna mitt skepp
Uti tallösa spillror sönderkrossades. 410
Af alla bjelklag kölen blott mig återstod,
På hvilken, mot förmodan, knapt jag räddad är,
Samt Helena, som jag från Troia tog igen.
Hvad namn nu detta landet har, och folket här,
Jag vet ej; störta mig i hopen blygdes jag, 415