0% found this document useful (0 votes)
8 views

Programming Languages and Systems Ilya Sergey download

The document discusses the 31st European Symposium on Programming (ESOP 2022) held in Munich, Germany, focusing on programming languages and systems. It includes details about the conference's organization, acceptance rates of submitted papers, and the introduction of artifact evaluation for the first time. The proceedings contain 21 accepted papers selected from 64 submissions, highlighting contributions from various experts in the field.

Uploaded by

hlubeklunzah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Programming Languages and Systems Ilya Sergey download

The document discusses the 31st European Symposium on Programming (ESOP 2022) held in Munich, Germany, focusing on programming languages and systems. It includes details about the conference's organization, acceptance rates of submitted papers, and the introduction of artifact evaluation for the first time. The proceedings contain 21 accepted papers selected from 64 submissions, highlighting contributions from various experts in the field.

Uploaded by

hlubeklunzah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Programming Languages and Systems Ilya Sergey

install download

https://ebookmeta.com/product/programming-languages-and-systems-
ilya-sergey/

Download more ebook from https://ebookmeta.com


We believe these products will be a great fit for you. Click
the link to download now, or visit ebookmeta.com
to discover even more!

Programming Languages Application and Interpretation


printing Shriram Krishnamurthi

https://ebookmeta.com/product/programming-languages-application-
and-interpretation-printing-shriram-krishnamurthi/

Programming Languages Build Prove and Compare Norman


Ramsey

https://ebookmeta.com/product/programming-languages-build-prove-
and-compare-norman-ramsey/

Programming Languages: Principles and Paradigms (2nd


Edition) Maurizio Gabbrielli

https://ebookmeta.com/product/programming-languages-principles-
and-paradigms-2nd-edition-maurizio-gabbrielli/

CYBER CRIME INVESTIGATOR'S FIELD GUIDE 3rd Edition


Bruce Middleton

https://ebookmeta.com/product/cyber-crime-investigators-field-
guide-3rd-edition-bruce-middleton/
Engaging with Brecht. Making Theatre in the Twenty-
first Century 1st Edition Bill Gelber

https://ebookmeta.com/product/engaging-with-brecht-making-
theatre-in-the-twenty-first-century-1st-edition-bill-gelber/

Mothership Haunting of Ypsilon 4 1st Edition Sean Mccay

https://ebookmeta.com/product/mothership-haunting-of-
ypsilon-4-1st-edition-sean-mccay/

Ai-Powered Business Intelligence 1st Edition Tobias


Zwingmann

https://ebookmeta.com/product/ai-powered-business-
intelligence-1st-edition-tobias-zwingmann/

Handmade Soap Book Easy Soapmaking with Natural


Ingredients 2nd Edition Melinda Coss

https://ebookmeta.com/product/handmade-soap-book-easy-soapmaking-
with-natural-ingredients-2nd-edition-melinda-coss/

The Black Elfstone Book One of the Fall of Shannara 1st


Edition Terry Brooks

https://ebookmeta.com/product/the-black-elfstone-book-one-of-the-
fall-of-shannara-1st-edition-terry-brooks/
Making Faithful Decisions at the End of Life 3rd
Edition Nancy J Duff

https://ebookmeta.com/product/making-faithful-decisions-at-the-
end-of-life-3rd-edition-nancy-j-duff/
ARCoSS Ilya Sergey (Ed.)

Programming
LNCS 13240

Languages
and Systems
31st European Symposium on Programming, ESOP 2022
Held as Part of the European Joint Conferences
on Theory and Practice of Software, ETAPS 2022
Munich, Germany, April 2–7, 2022
Proceedings
Lecture Notes in Computer Science 13240
Founding Editors
Gerhard Goos, Germany
Juris Hartmanis, USA

Editorial Board Members


Elisa Bertino, USA Gerhard Woeginger , Germany
Wen Gao, China Moti Yung , USA
Bernhard Steffen , Germany

Advanced Research in Computing and Software Science


Subline of Lecture Notes in Computer Science

Subline Series Editors


Giorgio Ausiello, University of Rome ‘La Sapienza’, Italy
Vladimiro Sassone, University of Southampton, UK

Subline Advisory Board


Susanne Albers, TU Munich, Germany
Benjamin C. Pierce, University of Pennsylvania, USA
Bernhard Steffen , University of Dortmund, Germany
Deng Xiaotie, Peking University, Beijing, China
Jeannette M. Wing, Microsoft Research, Redmond, WA, USA
More information about this series at https://link.springer.com/bookseries/558
Ilya Sergey (Ed.)

Programming
Languages
and Systems
31st European Symposium on Programming, ESOP 2022
Held as Part of the European Joint Conferences
on Theory and Practice of Software, ETAPS 2022
Munich, Germany, April 2–7, 2022
Proceedings

123
Editor
Ilya Sergey
National University of Singapore
Singapore, Singapore

ISSN 0302-9743 ISSN 1611-3349 (electronic)


Lecture Notes in Computer Science
ISBN 978-3-030-99335-1 ISBN 978-3-030-99336-8 (eBook)
https://doi.org/10.1007/978-3-030-99336-8
© The Editor(s) (if applicable) and The Author(s) 2022. This book is an open access publication.
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution
and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this book are included in the book’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the book’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use,
you will need to obtain permission directly from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, expressed or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
ETAPS Foreword

Welcome to the 25th ETAPS! ETAPS 2022 took place in Munich, the beautiful capital
of Bavaria, in Germany.
ETAPS 2022 is the 25th instance of the European Joint Conferences on Theory and
Practice of Software. ETAPS is an annual federated conference established in 1998,
and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each
conference has its own Program Committee (PC) and its own Steering Committee
(SC). The conferences cover various aspects of software systems, ranging from theo-
retical computer science to foundations of programming languages, analysis tools, and
formal approaches to software engineering. Organizing these conferences in a coherent,
highly synchronized conference program enables researchers to participate in an
exciting event, having the possibility to meet many colleagues working in different
directions in the field, and to easily attend talks of different conferences. On the
weekend before the main conference, numerous satellite workshops took place that
attract many researchers from all over the globe.
ETAPS 2022 received 362 submissions in total, 111 of which were accepted,
yielding an overall acceptance rate of 30.7%. I thank all the authors for their interest in
ETAPS, all the reviewers for their reviewing efforts, the PC members for their con-
tributions, and in particular the PC (co-)chairs for their hard work in running this entire
intensive process. Last but not least, my congratulations to all authors of the accepted
papers!
ETAPS 2022 featured the unifying invited speakers Alexandra Silva (University
College London, UK, and Cornell University, USA) and Tomáš Vojnar (Brno
University of Technology, Czech Republic) and the conference-specific invited
speakers Nathalie Bertrand (Inria Rennes, France) for FoSSaCS and Lenore Zuck
(University of Illinois at Chicago, USA) for TACAS. Invited tutorials were provided by
Stacey Jeffery (CWI and QuSoft, The Netherlands) on quantum computing and
Nicholas Lane (University of Cambridge and Samsung AI Lab, UK) on federated
learning.
As this event was the 25th edition of ETAPS, part of the program was a special
celebration where we looked back on the achievements of ETAPS and its constituting
conferences in the past, but we also looked into the future, and discussed the challenges
ahead for research in software science. This edition also reinstated the ETAPS men-
toring workshop for PhD students.
ETAPS 2022 took place in Munich, Germany, and was organized jointly by the
Technical University of Munich (TUM) and the LMU Munich. The former was
founded in 1868, and the latter in 1472 as the 6th oldest German university still running
today. Together, they have 100,000 enrolled students, regularly rank among the top
100 universities worldwide (with TUM’s computer-science department ranked #1 in
the European Union), and their researchers and alumni include 60 Nobel laureates.
vi ETAPS Foreword

The local organization team consisted of Jan Křetínský (general chair), Dirk Beyer
(general, financial, and workshop chair), Julia Eisentraut (organization chair), and
Alexandros Evangelidis (local proceedings chair).
ETAPS 2022 was further supported by the following associations and societies:
ETAPS e.V., EATCS (European Association for Theoretical Computer Science),
EAPLS (European Association for Programming Languages and Systems), and EASST
(European Association of Software Science and Technology).
The ETAPS Steering Committee consists of an Executive Board, and representa-
tives of the individual ETAPS conferences, as well as representatives of EATCS,
EAPLS, and EASST. The Executive Board consists of Holger Hermanns
(Saarbrücken), Marieke Huisman (Twente, chair), Jan Kofroň (Prague), Barbara König
(Duisburg), Thomas Noll (Aachen), Caterina Urban (Paris), Tarmo Uustalu (Reykjavik
and Tallinn), and Lenore Zuck (Chicago).
Other members of the Steering Committee are Patricia Bouyer (Paris), Einar Broch
Johnsen (Oslo), Dana Fisman (Be’er Sheva), Reiko Heckel (Leicester), Joost-Pieter
Katoen (Aachen and Twente), Fabrice Kordon (Paris), Jan Křetínský (Munich), Orna
Kupferman (Jerusalem), Leen Lambers (Cottbus), Tiziana Margaria (Limerick),
Andrew M. Pitts (Cambridge), Elizabeth Polgreen (Edinburgh), Grigore Roşu (Illinois),
Peter Ryan (Luxembourg), Sriram Sankaranarayanan (Boulder), Don Sannella
(Edinburgh), Lutz Schröder (Erlangen), Ilya Sergey (Singapore), Natasha Sharygina
(Lugano), Pawel Sobocinski (Tallinn), Peter Thiemann (Freiburg), Sebastián Uchitel
(London and Buenos Aires), Jan Vitek (Prague), Andrzej Wasowski (Copenhagen),
Thomas Wies (New York), Anton Wijs (Eindhoven), and Manuel Wimmer (Linz).
I’d like to take this opportunity to thank all authors, attendees, organizers of the
satellite workshops, and Springer-Verlag GmbH for their support. I hope you all
enjoyed ETAPS 2022.
Finally, a big thanks to Jan, Julia, Dirk, and their local organization team for all their
enormous efforts to make ETAPS a fantastic event.

February 2022 Marieke Huisman


ETAPS SC Chair
ETAPS e.V. President
Preface

This volume contains the papers accepted at the 31st European Symposium on
Programming (ESOP 2022), held during April 5–7, 2022, in Munich, Germany
(COVID-19 permitting). ESOP is one of the European Joint Conferences on Theory
and Practice of Software (ETAPS); it is dedicated to fundamental issues in the spec-
ification, design, analysis, and implementation of programming languages and systems.
The 21 papers in this volume were selected by the Program Committee (PC) from
64 submissions. Each submission received between three and four reviews. After
receiving the initial reviews, the authors had a chance to respond to questions and
clarify misunderstandings of the reviewers. After the author response period, the papers
were discussed electronically using the HotCRP system by the 33 Program Committee
members and 33 external reviewers. Two papers, for which the PC chair had a conflict
of interest, were kindly managed by Zena Ariola. The reviewing for ESOP 2022 was
double-anonymous, and only authors of the eventually accepted papers have been
revealed.
Following the example set by other major conferences in programming languages,
for the first time in its history, ESOP featured optional artifact evaluation. Authors
of the accepted manuscripts were invited to submit artifacts, such as code, datasets, and
mechanized proofs, that supported the conclusions of their papers. Members of the
Artifact Evaluation Committee (AEC) read the papers and explored the artifacts,
assessing their quality and checking that they supported the authors’ claims. The
authors of eleven of the accepted papers submitted artifacts, which were evaluated by
20 AEC members, with each artifact receiving four reviews. Authors of papers with
accepted artifacts were assigned official EAPLS artifact evaluation badges, indicating
that they have taken the extra time and have undergone the extra scrutiny to prepare a
useful artifact. The ESOP 2022 AEC awarded Artifacts Functional and Artifacts
(Functional and) Reusable badges. All submitted artifacts were deemed Functional, and
all but one were found to be Reusable.
My sincere thanks go to all who contributed to the success of the conference and to
its exciting program. This includes the authors who submitted papers for consideration;
the external reviewers who provided timely expert reviews sometimes on very short
notice; the AEC members and chairs who took great care of this new aspect of ESOP;
and, of course, the members of the ESOP 2022 Program Committee. I was extremely
impressed by the excellent quality of the reviews, the amount of constructive feedback
given to the authors, and the criticism delivered in a professional and friendly tone.
I am very grateful to Andreea Costea and KC Sivaramakrishnan who kindly agreed to
serve as co-chairs for the ESOP 2022 Artifact Evaluation Committee. I would like to
thank the ESOP 2021 chair Nobuko Yoshida for her advice, patience, and the many
insightful discussions on the process of running the conference. I thank all who con-
tributed to the organization of ESOP: the ESOP steering committee and its chair Peter
Thiemann, as well as the ETAPS steering committee and its chair Marieke Huisman.
viii Preface

Finally, I would like to thank Barbara König and Alexandros Evangelidis for their help
with assembling the proceedings.

February 2022 Ilya Sergey


Organization

Program Chair
Ilya Sergey National University of Singapore, Singapore

Program Committee
Michael D. Adams Yale-NUS College, Singapore
Danel Ahman University of Ljubljana, Slovenia
Aws Albarghouthi University of Wisconsin-Madison, USA
Zena M. Ariola University of Oregon, USA
Ahmed Bouajjani Université de Paris, France
Giuseppe Castagna CNRS, Université de Paris, France
Cristina David University of Bristol, UK
Mariangiola Dezani Università di Torino, Italy
Rayna Dimitrova CISPA Helmholtz Center for Information Security,
Germany
Jana Dunfield Queen’s University, Canada
Aquinas Hobor University College London, UK
Guilhem Jaber Université de Nantes, France
Jeehoon Kang KAIST, South Korea
Ekaterina Komendantskaya Heriot-Watt University, UK
Ori Lahav Tel Aviv University, Israel
Ivan Lanese Università di Bologna, Italy, and Inria, France
Dan Licata Wesleyan University, USA
Sam Lindley University of Edinburgh, UK
Andreas Lochbihler Digital Asset, Switzerland
Cristina Lopes University of California, Irvine, USA
P. Madhusudan University of Illinois at Urbana-Champaign, USA
Stefan Marr University of Kent, UK
James Noble Victoria University of Wellington, New Zealand
Burcu Kulahcioglu Ozkan Delft University of Technology, The Netherlands
Andreas Pavlogiannis Aarhus University, Denmark
Vincent Rahli University of Birmingham, UK
Robert Rand University of Chicago, USA
Christine Rizkallah University of Melbourne, Australia
Alejandro Russo Chalmers University of Technology, Sweden
Gagandeep Singh University of Illinois at Urbana-Champaign, USA
Gordon Stewart BedRock Systems, USA
Joseph Tassarotti Boston College, USA
Bernardo Toninho Universidade NOVA de Lisboa, Portugal
x Organization

Additional Reviewers
Andreas Abel Gothenburg University, Sweden
Guillaume Allais University of St Andrews, UK
Kalev Alpernas Tel Aviv University, Israel
Davide Ancona Università di Genova, Italy
Stephanie Balzer Carnegie Mellon University, USA
Giovanni Bernardi Université de Paris, France
Soham Chakraborty Delft University of Technology, The Netherlands
Arthur Chargueraud Inria, France
Ranald Clouston Australian National University, Australia
Fredrik Dahlqvist University College London, UK
Olivier Danvy Yale-NUS College, Singapore
Benjamin Delaware Purdue University, USA
Dominique Devriese KU Leuven, Belgium
Paul Downen University of Massachusetts, Lowell, USA
Yannick Forster Saarland University, Germany
Milad K. Ghale University of New South Wales, Australia
Kiran Gopinathan National University of Singapore, Singapore
Tristan Knoth University of California, San Diego, USA
Paul Levy University of Birmingham, UK
Umang Mathur National University of Singapore, Singapore
McKenna McCall Carnegie Mellon University, USA
Garrett Morris University of Iowa, USA
Fredrik Nordvall Forsberg University of Strathclyde, UK
José N. Oliveira University of Minho, Portugal
Alex Potanin Australian National University, Australia
Susmit Sarkar University of St Andrews, UK
Filip Sieczkowski Heriot-Watt University, UK
Kartik Singhal University of Chicago, USA
Sandro Stucki Chalmers University of Technology and University
of Gothenburg, Sweden
Amin Timany Aarhus University, Denmark
Klaus v. Gleissenthall Vrije Universiteit Amsterdam, The Netherlands
Thomas Wies New York University, USA
Vladimir Zamdzhiev Inria, Loria, Université de Lorraine, France

Artifact Evaluation Committee Chairs


Andreea Costea National University of Singapore, Singapore
K. C. Sivaramakrishnan IIT Madras, India

Artifact Evaluation Committee


Utpal Bora IIT Hyderabad, India
Darion Cassel Carnegie Mellon University, USA
Organization xi

Pritam Choudhury University of Pennsylvania, USA


Jan de Muijnck-Hughes University of Glasgow, UK
Darius Foo National University of Singapore, Singapore
Léo Gourdin Université Grenoble-Alpes, France
Daniel Hillerström University of Edinburgh, UK
Jules Jacobs Radboud University, The Netherlands
Chaitanya Koparkar Indiana University, USA
Yinling Liu Toulouse Computer Science Research Center, France
Yiyun Liu University of Pennsylvania, USA
Kristóf Marussy Budapest University of Technology and Economics,
Hungary
Orestis Melkonian University of Edinburgh, UK
Shouvick Mondal Concordia University, Canada
Krishna Narasimhan TU Darmstadt, Germany
Mário Pereira Universidade NOVA de Lisboa, Portugal
Goran Piskachev Fraunhofer IEM, Germany
Somesh Singh Inria, France
Yahui Song National University of Singapore, Singapore
Vimala Soundarapandian IIT Madras, India
Contents

Categorical Foundations of Gradient-Based Learning . . . . . . . . . . . . . . . . . . 1


Geoffrey S. H. Cruttwell, Bruno Gavranović, Neil Ghani, Paul Wilson,
and Fabio Zanasi

Compiling Universal Probabilistic Programming Languages with Efficient


Parallel Sequential Monte Carlo Inference . . . . . . . . . . . . . . . . . . . . . . . . . 29
Daniel Lundén, Joey Öhman, Jan Kudlicka, Viktor Senderov,
Fredrik Ronquist, and David Broman

Foundations for Entailment Checking in Quantitative Separation Logic . . . . . 57


Kevin Batz, Ira Fesefeldt, Marvin Jansen, Joost-Pieter Katoen,
Florian Keßler, Christoph Matheja, and Thomas Noll

Extracting total Amb programs from proofs . . . . . . . . . . . . . . . . . . . . . . . . 85


Ulrich Berger and Hideki Tsuiki

Why3-do: The Way of Harmonious Distributed System Proofs . . . . . . . . . . . 114


Cláudio Belo Lourenço and Jorge Sousa Pinto

Relaxed virtual memory in Armv8-A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143


Ben Simner, Alasdair Armstrong, Jean Pichon-Pharabod,
Christopher Pulte, Richard Grisenthwaite, and Peter Sewell

Verified Security for the Morello Capability-enhanced Prototype


Arm Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Thomas Bauereiss, Brian Campbell, Thomas Sewell,
Alasdair Armstrong, Lawrence Esswood, Ian Stark, Graeme Barnes,
Robert N. M. Watson, and Peter Sewell

The Trusted Computing Base of the CompCert Verified Compiler . . . . . . . . 204


David Monniaux and Sylvain Boulmé

View-Based Owicki–Gries Reasoning for Persistent x86-TSO. . . . . . . . . . . . 234


Eleni Vafeiadi Bila, Brijesh Dongol, Ori Lahav, Azalea Raad,
and John Wickerson

Abstraction for Crash-Resilient Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 262


Artem Khyzha and Ori Lahav

Static Race Detection for Periodic Programs . . . . . . . . . . . . . . . . . . . . . . . . 290


Varsha P Suresh, Rekha Pai, Deepak D’Souza, Meenakshi D’Souza,
and Sujit Kumar Chakrabarti
xiv Contents

Probabilistic Total Store Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317


Parosh Aziz Abdulla, Mohamed Faouzi Atig, Raj Aryan Agarwal,
Adwait Godbole, and Krishna S.

Linearity and Uniqueness: An Entente Cordiale . . . . . . . . . . . . . . . . . . . . . 346


Daniel Marshall, Michael Vollmer, and Dominic Orchard

A Framework for Substructural Type Systems . . . . . . . . . . . . . . . . . . . . . . 376


James Wood and Robert Atkey

A Dependent Dependency Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403


Pritam Choudhury, Harley Eades III, and Stephanie Weirich

Polarized Subtyping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431


Zeeshan Lakhani, Ankush Das, Henry DeYoung, Andreia Mordido,
and Frank Pfenning

Structured Handling of Scoped Effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . 462


Zhixuan Yang, Marco Paviotti, Nicolas Wu, Birthe van den Berg,
and Tom Schrijvers

Region-based Resource Management and Lexical Exception Handlers


in Continuation-Passing Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
Philipp Schuster, Jonathan Immanuel Brachthäuser,
and Klaus Ostermann

A Predicate Transformer for Choreographies: Computing Preconditions


in Choreographic Programming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
Sung-Shik Jongmans and Petra van den Bos

Comparing the Expressiveness of the p-calculus and CCS . . . . . . . . . . . . . . 548


Rob van Glabbeek

Concurrent NetKAT: Modeling and analyzing stateful,


concurrent networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
Jana Wagemaker, Nate Foster, Tobias Kappé, Dexter Kozen,
Jurriaan Rot, and Alexandra Silva

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603


Categorical Foundations of Gradient-Based Learning

Geoffrey S. H. Cruttwell1 ( ) , Bruno Gavranović2 ( ) , Neil Ghani2 ( ) ,


Paul Wilson4 ( ) , and Fabio Zanasi4 ( )
1
Mount Allison University, Canada
2
University of Strathclyde, United Kingdom
3
University College London

Abstract. We propose a categorical semantics of gradient-based ma-


chine learning algorithms in terms of lenses, parametric maps, and re-
verse derivative categories. This foundation provides a powerful explana-
tory and unifying framework: it encompasses a variety of gradient descent
algorithms such as ADAM, AdaGrad, and Nesterov momentum, as well
as a variety of loss functions such as MSE and Softmax cross-entropy,
shedding new light on their similarities and differences. Our approach to
gradient-based learning has examples generalising beyond the familiar
continuous domains (modelled in categories of smooth maps) and can
be realized in the discrete setting of boolean circuits. Finally, we demon-
strate the practical significance of our framework with an implementation
in Python.

1 Introduction
The last decade has witnessed a surge of interest in machine learning, fuelled by
the numerous successes and applications that these methodologies have found in
many fields of science and technology. As machine learning techniques become
increasingly pervasive, algorithms and models become more sophisticated, posing
a significant challenge both to the software developers and the users that need to
interface, execute and maintain these systems. In spite of this rapidly evolving
picture, the formal analysis of many learning algorithms mostly takes place at a
heuristic level [41], or using definitions that fail to provide a general and scalable
framework for describing machine learning. Indeed, it is commonly acknowledged
through academia, industry, policy makers and funding agencies that there is a
pressing need for a unifying perspective, which can make this growing body of
work more systematic, rigorous, transparent and accessible both for users and
developers [2, 36].
Consider, for example, one of the most common machine learning scenar-
ios: supervised learning with a neural network. This technique trains the model
towards a certain task, e.g. the recognition of patterns in a data set (cf. Fig-
ure 1). There are several different ways of implementing this scenario. Typically,
at their core, there is a gradient update algorithm (often called the “optimiser”),
depending on a given loss function, which updates in steps the parameters of the
network, based on some learning rate controlling the “scaling” of the update. All
c The Author(s) 2022
I. Sergey (Ed.): ESOP 2022, LNCS 13240, pp. 1–28, 2022.
https://doi.org/10.1007/978-3-030-99336-8_1
2 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi

of these components can vary independently in a supervised learning algorithm


and a number of choices is available for loss maps (quadratic error, Softmax
cross entropy, dot product, etc.) and optimisers (Adagrad [20], Momentum [37],
and Adam [32], etc.).

Fig. 1: An informal illustration of gradient-based learning. This neural network


is trained to distinguish different kinds of animals in the input image. Given an
input X, the network predicts an output Y , which is compared by a ‘loss map’
with what would be the correct answer (‘label’). The loss map returns a real
value expressing the error of the prediction; this information, together with the
learning rate (a weight controlling how much the model should be changed in
response to error) is used by an optimiser, which computes by gradient-descent
the update of the parameters of the network, with the aim of improving its
accuracy. The neural network, the loss map, the optimiser and the learning rate
are all components of a supervised learning system, and can vary independently
of one another.

This scenario highlights several questions: is there a uniform mathemati-


cal language capturing the different components of the learning process? Can
we develop a unifying picture of the various optimisation techniques, allowing
for their comparative analysis? Moreover, it should be noted that supervised
learning is not limited to neural networks. For example, supervised learning is
surprisingly applicable to the discrete setting of boolean circuits [50] where con-
tinuous functions are replaced by boolean-valued functions. Can we identify an
abstract perspective encompassing both the real-valued and the boolean case?
In a nutshell, this paper seeks to answer the question:
what are the fundamental mathematical structures underpinning gradient-
based learning?
Our approach to this question stems from the identification of three funda-
mental aspects of the gradient-descent learning process:
(I) computation is parametric, e.g. in the simplest case we are given a function
f : P × X → Y and learning consists of finding a parameter p : P such
Categorical Foundations of Gradient-Based Learning 3

that f (p, −) is the best function according to some criteria. Specifically, the
weights on the internal nodes of a neural network are a parameter which the
learning is seeking to optimize. Parameters also arise elsewhere, e.g. in the
loss function (see later).
(II) information flows bidirectionally: in the forward direction, the computa-
tion turns inputs via a sequence of layers into predicted outputs, and then
into a loss value; in the reverse direction, backpropagation is used propa-
gate the changes backwards through the layers, and then turn them into
parameter updates.
(III) the basis of parameter update via gradient descent is differentiation e.g.
in the simple case we differentiate the function mapping a parameter to its
associated loss to reduce that loss.

We model bidirectionality via lenses [6, 12, 29] and based upon the above
three insights, we propose the notion of parametric lens as the fundamental
semantic structure of learning. In a nutshell, a parametric lens is a process with
three kinds of interfaces: inputs, outputs, and parameters. On each interface,
information flows both ways, i.e. computations are bidirectional. These data
are best explained with our graphical representation of parametric lenses, with
inputs A, A′ , outputs B, B ′ , parameters P , P ′ , and arrows indicating information
flow (below left). The graphical notation also makes evident that parametric
lenses are open systems, which may be composed along their interfaces (below
center and right).
Q Q′
′ ′ Q Q ′
P P P P

B
A B A C
(1)
A′ B′ A′ C′ P P′
B′
A B
A′ B′

This pictorial formalism is not just an intuitive sketch: as we will show, it can
be understood as a completely formal (graphical) syntax using the formalism of
string diagrams [39], in a way similar to how other computational phenomena
have been recently analysed e.g. in quantum theory [14], control theory [5, 8],
and digital circuit theory [26].
It is intuitively clear how parametric lenses express aspects (I) and (II) above,
whereas (III) will be achieved by studying them in a space of ‘differentiable
objects’ (in a sense that will be made precise). The main technical contribution
of our paper is showing how the various ingredients involved in learning (the
model, the optimiser, the error map and the learning rate) can be uniformly
understood as being built from parametric lenses.
We will use category theory as the formal language to develop our notion of
parametric lenses, and make Figure 2 mathematically precise. The categorical
perspective brings several advantages, which are well-known, established princi-
ples in programming language semantics [3,40,49]. Three of them are particularly
4 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi

A P P
B

Optimiser

P P′
B′
A B L
Learning
Model Loss
rate
A′ B′ L′

Fig. 2: The parametric lens that captures the learning process informally sketched
in Figure 1. Note each component is a lens itself, whose composition yields the
interactions described in Figure 1. Defining this picture formally will be the
subject of Sections 3-4.

important to our contribution, as they constitute distinctive advantages of our


semantic foundations:
Abstraction Our approach studies which categorical structures are sufficient
to perform gradient-based learning. This analysis abstracts away from the
standard case of neural networks in several different ways: as we will see, it
encompasses other models (namely Boolean circuits), different kinds of op-
timisers (including Adagrad, Adam, Nesterov momentum), and error maps
(including quadratic and softmax cross entropy loss). These can be all un-
derstood as parametric lenses, and different forms of learning result from
their interaction.
Uniformity As seen in Figure 1, learning involves ingredients that are seem-
ingly quite different: a model, an optimiser, a loss map, etc. We will show
how all these notions may be seen as instances of the categorical defini-
tion of a parametric lens, thus yielding a remarkably uniform description of
the learning process, and supporting our claim of parametric lenses being a
fundamental semantic structure of learning.
Compositionality The use of categorical structures to describe computation
naturally enables compositional reasoning whereby complex systems are anal-
ysed in terms of smaller, and hence easier to understand, components. Com-
positionality is a fundamental tenet of programming language semantics; in
the last few years, it has found application in the study of diverse kinds of
computational models, across different fields— see e.g. [8,14,25,45]. As made
evident by Figure 2, our approach models a neural network as a parametric
lens, resulting from the composition of simpler parametric lenses, capturing
the different ingredients involved in the learning process. Moreover, as all
the simpler parametric lenses are themselves composable, one may engineer
a different learning process by simply plugging a new lens on the left or right
of existing ones. This means that one can glue together smaller and relatively
simple networks to create larger and more sophisticated neural networks.
Categorical Foundations of Gradient-Based Learning 5

We now give a synopsis of our contributions:


– In Section 2, we introduce the tools necessary to define our notion of para-
metric lens. First, in Section 2.1, we introduce a notion of parametric cat-
egories, which amounts to a functor Para(−) turning a category C into one
Para(C) of ‘parametric C-maps’. Second, we recall lenses (Section 2.2). In a
nutshell, a lens is a categorical morphism equipped with operations to view
and update values in a certain data structure. Lenses play a prominent role
in functional programming [47], as well as in the foundations of database
theory [31] and more recently game theory [25]. Considering lenses in C sim-
ply amounts to the application of a functorial construction Lens(−), yield-
ing Lens(C). Finally, we recall the notion of a cartesian reverse differential
category (CRDC): a categorical structure axiomatising the notion of differ-
entiation [13] (Section 2.4). We wrap up in Section 2.3, by combining these
ingredients into the notion of parametric lens, formally defined as a morphism
in Para(Lens(C)) for a CRDC C. In terms of our desiderata (I)-(III) above,
note that Para(−) accounts for (I), Lens(−) accounts for (II), and the CRDC
structure accounts for (III).
– As seen in Figure 1, in the learning process there are many components at
work: the model, the optimiser, the loss map, the learning rate, etc.. In Sec-
tion 3, we show how the notion of parametric lens provides a uniform char-
acterisation for such components. Moreover, for each of them, we show how
different variations appearing in the literature become instances of our ab-
stract characterisation. The plan is as follows:
◦ In Section 3.1, we show how the combinatorial model subject of the training
can be seen as a parametric lens. The conditions we provide are met by the
‘standard’ case of neural networks, but also enables the study of learning for
other classes of models. In particular, another instance are Boolean circuits:
learning of these structures is relevant to binarisation [16] and it has been
explored recently using a categorical approach [50], which turns out to be
a particular case of our framework.
◦ In Section 3.2, we show how the loss maps associated with training are also
parametric lenses. Our approach covers the cases of quadratic error, Boolean
error, Softmax cross entropy, but also the ‘dot product loss’ associated with
the phenomenon of deep dreaming [19, 34, 35, 44].
◦ In Section 3.3, we model the learning rate as a parametric lens. This
analysis also allows us to contrast how learning rate is handled in the ‘real-
valued’ case of neural networks with respect to the ‘Boolean-valued’ case of
Boolean circuits.
◦ In Section 3.4, we show how optimisers can be modelled as ‘reparame-
terisations’ of models as parametric lenses. As case studies, in addition to
basic gradient update, we consider the stateful variants: Momentum [37],
Nesterov Momentum [48], Adagrad [20], and Adam (Adaptive Moment Es-
timation) [32]. Also, on Boolean circuits, we show how the reverse derivative
ascent of [50] can be also regarded in such way.
– In Section 4, we study how the composition of the lenses defined in Section 3
yields a description of different kinds of learning processes.
6 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi

◦ Section 4.1 is dedicated to modelling supervised learning of parameters,


in the way described in Figure 1. This amounts essentially to study of
the composite of lenses expressed in Figure 2, for different choices of the
various components. In particular we look at (i) quadratic loss with basic
gradient descent, (ii) softmax cross entropy loss with basic gradient descent,
(iii) quadratic loss with Nesterov momentum, and (iv) learning in Boolean
circuits with XOR loss and basic gradient ascent.
◦ In order to showcase the flexibility of our approach, in Section 4.2 we de-
part from our ‘core’ case study of parameter learning, and turn attention
to supervised learning of inputs, also called deep dreaming — the idea
behind this technique is that, instead of the network parameters, one up-
dates the inputs, in order to elicit a particular interpretation [19, 34, 35, 44].
Deep dreaming can be easily expressed within our approach, with a differ-
ent rearrangement of the parametric lenses involved in the learning process,
see (8) below. The abstract viewpoint of categorical semantics provides a
mathematically precise and visually captivating description of the differ-
ences between the usual parameter learning process and deep dreaming.
– In Section 5 we describe a proof-of-concept Python implementation, avail-
able at [17], based on the theory developed in this paper. This code is intended
to show more concretely the payoff of our approach. Model architectures, as
well as the various components participating in the learning process, are now
expressed in a uniform, principled mathematical language, in terms of lenses.
As a result, computing network gradients is greatly simplified, as it amounts
to lens composition. Moreover, the modularity of this approach allows one to
more easily tune the various parameters of training.
We show our library via a number of experiments, and prove correctness by
achieving accuracy on par with an equivalent model in Keras, a mainstream
deep learning framework [11]. In particular, we create a working non-trivial
neural network model for the MNIST image-classification problem [33].
– Finally, in Sections 6 and 7, we discuss related and future work.

2 Categorical Toolkit

In this section we describe the three categorical components of our framework,


each corresponding to an aspect of gradient-based learning: (I) the Para con-
struction (Section 2.1), which builds a category of parametric maps, (II) the
Lens construction, which builds a category of “bidirectional” maps (Section
2.2), and (III) the combination of these two constructions into the notion of
“parametric lenses” (Section 2.3). Finally (IV) we recall Cartesian reverse dif-
ferential categories — categories equipped with an abstract gradient operator.

Notation We shall use f ; g for sequential composition of morphisms f : A → B


and g : B → C in a category, 1A for the identity morphism on A, and I for the
unit object of a symmetric monoidal category.
Categorical Foundations of Gradient-Based Learning 7

2.1 Parametric Maps

In supervised learning one is typically interested in approximating a function


g : Rn → Rm for some n and m. To do this, one begins by building a neural
network, which is a smooth map f : Rp × Rn → Rm where Rp is the set of
possible weights of that neural network. Then one looks for a value of q ∈ Rp
such that the function f (q, −) : Rn → Rm closely approximates g. We formalise
these maps categorically via the Para construction [9, 23, 24, 30].

Definition 1 (Parametric category). Let (C, ⊗, I) be a strict4 symmetric


monoidal category. We define a category Para(C) with objects those of C, and
a map from A to B a pair (P, f ), with P an object of C and f : P ⊗ A →
B. The composite of maps (P, f ) : A → B and (P ′ , f ′ ) : B → C is the pair
(P ′ ⊗ P, (1P ′ ⊗ f ); f ′ ). The identity on A is the pair (I, 1A ).

Example 1. Take the category Smooth whose objects are natural numbers and
whose morphisms f : n → m are smooth maps from Rn to Rm . As described
above, the category Para(Smooth) can be thought of as a category of neural
networks: a map in this category from n to m consists of a choice of p and a
map f : Rp × Rn → Rm with Rp representing the set of possible weights of the
neural network.

As we will see in the next sections, the interplay of the various components
at work in the learning process becomes much clearer once represented the mor-
phisms of Para(C) using the pictorial formalism of string diagrams, which we
now recall. In fact, we will mildly massage the traditional notation for string
diagrams (below left), by representing a morphism f : A → B in Para(C) as
below right.
P
P
f B A f B
A

This is to emphasise the special role played by P , reflecting the fact that in
machine learning data and parameters have different semantics. String diagram-
matic notations also allows to neatly represent composition of maps (P, f ) : A →
B and (P ′ , f ′ ) : B → C (below left), and “reparameterisation” of (P, f ) : A → B
by a map α : Q → P (below right), yielding a new map (Q, (α⊗1A ); f ) : A → B.

P P′ α
(2)
P
B
A f f′ C A f B

4
One can also define Para(C) in the case when C is non-strict; however, the result
would be not a category but a bicategory.
8 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi

Intuitively, reparameterisation changes the parameter space of (P, f ) : A → B to


some other object Q, via some map α : Q → P . We shall see later that gradient
descent and its many variants can naturally be viewed as reparameterisations.
Note coherence rules in combining the two operations in (2) just work as ex-
pected, as these diagrams can be ultimately ‘compiled’ down to string diagrams
for monoidal categories.

2.2 Lenses

In machine learning (or even learning in general) it is fundamental that infor-


mation flows both forwards and backwards: the ‘forward’ flow corresponds to a
model’s predictions, and the ‘backwards’ flow to corrections to the model. The
category of lenses is the ideal setting to capture this type of structure, as it is a
category consisting of maps with both a “forward” and a “backward” part.

Definition 2. For any Cartesian category C, the category of (bimorphic) lenses


in C, Lens(C), is the category with the following data. Objects are pairs (A, A′ )
of objects in C. A map from (A, A′ ) to (B, B ′ ) consists of a pair (f, f ∗ ) where
f : A → B (called the get or forward part of the lens) and f ∗ : A × B ′ →
A′ (called the put or backwards part of the lens). The composite of (f, f ∗ ) :
(A, A′ ) → (B, B ′ ) and (g, g ∗ ) : (B, B ′ ) → (C, C ′ ) is given by get f ; g and put
⟨π0 , ⟨π0 ; f, π1 ⟩; g ∗ ⟩; f ∗ . The identity on (A, A′ ) is the pair (1A , π1 ).

The embedding of Lens(C) into the category of Tambara modules over C


(see [7, Thm. 23]) provides a rich string diagrammatic language, in which lenses
may be represented with forward/backward wires indicating the information
flow. In this language, a morphism (f, f ∗ ) : (A, A′ ) → (B, B ′ ) is written as
below left, which can be ‘expanded’ as below right.

f B
A
A ∗ B
(f, f )
A′ B′
A′ f∗
B′

It is clear in this language how to describe the composite of (f, f ∗ ) : (A, A′ ) →


(B, B ′ ) and (g, g ∗ ) : (B, B ′ ) → (C, C ′ ):

f B g C
A
(3)
′ ∗ ∗
A f g
B′ C′
Categorical Foundations of Gradient-Based Learning 9

2.3 Parametric Lenses

The fundamental category where supervised learning takes place is the composite
Para(Lens(C)) of the two constructions in the previous sections:

Definition 3. The category Para(Lens(C)) of parametric lenses on C has


as objects pairs (A, A′ ) of objects from C. A morphism from (A, A′ ) to (B, B ′ ),
called a parametric lens5 , is a choice of parameter pair (P, P ′ ) and a lens (f, f ∗ ) :
(P, P ′ ) × (A, A′ ) → (B, B ′ ) so that f : P × A → B and f ∗ : P × A × B ′ → P ′ × A′
String diagrams for parametric lenses are built by simply composing the graph-
ical languages of the previous two sections — see (1), where respectively a mor-
phism, a composition of morphisms, and a reparameterisation are depicted.
Given a generic morphism in Para(Lens(C)) as depicted in (1) on the left,
one can see how it is possible to “learn” new values from f : it takes as input an
input A, a parameter P , and a change B ′ , and outputs a change in A, a value
of B, and a change P ′ . This last element is the key component for supervised
learning: intuitively, it says how to change the parameter values to get the neural
network closer to the true value of the desired function.
The question, then, is how one is to define such a parametric lens given
nothing more than a neural network, ie., a parametric map (P, f ) : A → B.
This is precisely what the gradient operation provides, and its generalization to
categories is explored in the next subsection.

2.4 Cartesian Reverse Differential Categories

Fundamental to all types of gradient-based learning is, of course, the gradient


operation. In most cases this gradient operation is performed in the category of
smooth maps between Euclidean spaces. However, recent work [50] has shown
that gradient-based learning can also work well in other categories; for example,
in a category of boolean circuits. Thus, to encompass these examples in a single
framework, we will work in a category with an abstract gradient operation.

Definition 4. A Cartesian left additive category [13, Defn. 1] consists of


a category C with chosen finite products (including a terminal object), and an
addition operation and zero morphism in each homset, satisfying various axioms.
A Cartesian reverse differential category (CRDC) [13, Defn. 13] consists
of a Cartesian left additive category C, together with an operation which provides,
for each map f : A → B in C, a map R[f ] : A × B → A satisfying various
axioms.

For f : A → B, the pair (f, R[f ]) forms a lens from (A, A) to (B, B). We
will pursue the idea that R[f ] acts as backwards map, thus giving a means to
“learn”f .
5
In [23], these are called learners. However, in this paper we study them in a much
broader light; see Section 6.
10 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi

Note that assigning type A×B → A to R[f ] hides some relevant information:
B-values in the domain and A-values in the codomain of R[f ] do not play the
same role as values of the same types in f : A → B: in R[f ], they really take in a
tangent vector at B and output a tangent vector at A (cf. the definition of R[f ]
in Smooth, Example 2 below). To emphasise this, we will type R[f ] as a map
A × B ′ → A′ (even though in reality A = A′ and B = B ′ ), thus meaning that
(f, R[f ]) is actually a lens from (A, A′ ) to (B, B ′ ). This typing distinction will
be helpful later on, when we want to add additional components to our learning
algorithms.
The following two examples of CRDCs will serve as the basis for the learning
scenarios of the upcoming sections.

Example 2. The category Smooth (Example 1) is Cartesian with product given


by addition, and it is also a Cartesian reverse differential category: given a
smooth map f : Rn → Rm , the map R[f ] : Rn × Rm → Rn sends a pair (x, v)
to J[f ]T (x) · v: the transpose of the Jacobian of f at x in the direction v. For
example, if f : R2 → R3 is defined as f (x1 , x2 ) := (x31 + 2x1 x2 , x2 , sin(x
  1 )), then
 2  v1
3x1 + 2x2 0 cos(x1 )  
R[f ] : R2 × R3 → R2 is given by (x, v) 7→ · v2 . Using
2x1 1 0
v3
the reverse derivative (as opposed to the forward derivative) is well-known to be
much more computationally efficient for functions f : Rn → Rm when m ≪ n
(for example, see [28]), as is the case in most supervised learning situations
(where often m = 1).

Example 3. Another CRDC is the symmetric monoidal category POLY Z2 [13,


Example 14] with objects the natural numbers and morphisms f : A → B the B-
tuples of polynomials Z2 [x1 . . . xA ]. When presented by generators and relations
these morphisms can be viewed as a syntax for boolean circuits, with parametric
lenses for such circuits (and their reverse derivative) described in [50].

3 Components of learning as Parametric Lenses


As seen in the introduction, in the learning process there are many components
at work: a model, an optimiser, a loss map, a learning rate, etc. In this section
we show how each such component can be understood as a parametric lens.
Moreover, for each component, we show how our framework encompasses several
variations of the gradient-descent algorithms, thus offering a unifying perspective
on many different approaches that appear in the literature.

3.1 Models as Parametric Lenses


We begin by characterising the models used for training as parametric lenses.
In essence, our approach identifies a set of abstract requirements necessary to
perform training by gradient descent, which covers the case studies that we will
consider in the next sections.
Categorical Foundations of Gradient-Based Learning 11

The leading intuition is that a suitable model is a parametric map, equipped


with a reverse derivative operator. Using the formal developments of Section 2,
this amounts to assuming that a model is a morphism in Para(C), for a CRDC
C. In order to visualise such morphism as a parametric lens, it then suffices to
apply under Para(−) the canonical morphism R : C → Lens(C) (which exists
for any CRDC C, see [13, Prop. 31]), mapping f to (f, R[f ]). This yields a functor
Para(R) : Para(C) → Para(Lens(C)), pictorially defined as

P P′

P f B

A
A f B 7→ (4)
A′ R[f ]
B′

Example 4 (Neural networks). As noted previously, to learn a function of type


Rn → Rm , one constructs a neural network, which can be seen as a function of
type Rp × Rn → Rm where Rp is the space of parameters of the neural network.
As seen in Example 1, this is a map in the category Para(Smooth) of type
Rn → Rm with parameter space Rp . Then one can apply the functor in (4)
to present a neural network together with its reverse derivative operator as a
parametric lens, i.e. a morphism in Para(Lens(Smooth)).

Example 5 (Boolean circuits). For learning of Boolean circuits as described in


[50], the recipe is the same as in Example 4, except that the base category is
POLYZ2 (see Example 3). The important observation here is that POLY Z2 is a
CRDC, see [13, 50], and thus we can apply the functor in (4).

Note a model/parametric lens f can take as inputs an element of A, an


element of B ′ (a change in B) and a parameter P and outputs an element of
B, a change in A, and a change in P . This is not yet sufficient to do machine
learning! When we perform learning, we want to input a parameter P and a pair
A × B and receive a new parameter P . Instead, f expects a change in B (not an
element of B) and outputs a change in P (not an element of P ). Deep dreaming,
on the other hand, wants to return an element of A (not a change in A). Thus, to
do machine learning (or deep dreaming) we need to add additional components
to f ; we will consider these additional components in the next sections.

3.2 Loss Maps as Parametric Lenses


Another key component of any learning algorithm is the choice of loss map.
This gives a measurement of how far the current output of the model is from
the desired output. In standard learning in Smooth, this loss map is viewed as
a map of type B × B → R. However, in our setup, this is naturally viewed as a
12 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi

parametric map from B to R with parameter space B.6 We also generalize the
codomain to an arbitrary object L.

Definition 5. A loss map on B consists of a parametric map (B, loss) :


Para(C)(B, L) for some object L.

Note that we can precompose a loss map (B, loss) : B → L with a neural
network (P, f ) : A → B (below left), and apply the functor in (4) (with C =
Smooth) to obtain the parametric lens below right.

P P′ B B′
P B B
A f loss L (5)
B 7→
A f loss L A′ R[f ] R[loss] L′
B′

This is getting closer to the parametric lens we want: it can now receive
inputs of type B. However, this is at the cost of now needing an input to L′ ; we
consider how to handle this in the next section.

Example 6 (Quadratic error). In Smooth, the standard loss function on Rb is


quadratic error:P it uses L = R and has parametric map e : Rb × Rb → R given
1 b
by e(bt , bp ) = 2 i=1 ((bp )i −(bt )i )2 , where we think of bt as the “true” value and
bp the predicted value. This has reverse derivative R[e] : Rb × Rb × R → Rb × Rb
given by R[e](bt , bp , α) = α · (bp − bt , bt − bp ) — note α suggests the idea of
learning rate, which we will explore in Section 3.3.

Example 7 (Boolean error). In POLYZ2 , the loss function on Zb which is im-


plicitly used in [50] is a bit different: it uses L = Zb and has parametric map
e : Zb × Zb → Zb given by
e(bt , bp ) = bt + bp .
(Note that this is + in Z2 ; equivalently this is given by XOR.) Its reverse deriva-
tive is of type R[e] : Zb × Zb × Zb → Zb × Zb given by R[e](bt , bp , α) = (α, α).

Example 8 (Softmax cross entropy). The Softmax cross entropy loss is a Rb -


Pb
parametric map Rb → R defined by e(bt , bp ) = i=1 (bt )i ((bp )i −log(Softmax(bp )i ))
exp((bp )i )
where Softmax(bp ) = Pb exp((b ) )
is defined componentwise for each class i.
j=1 p j

We note that, although bt needs to be a probability distribution, at the


moment there is no need to ponder the question of interaction of probability
distributions with the reverse derivative framework: one can simply consider bt
as the image of some logits under the Softmax function.
6
Here the loss map has its parameter space equal to its input space. However, putting
loss maps on the same footing as models lends itself to further generalizations where
the parameter space is different, and where the loss map can itself be learned. See
Generative Adversarial Networks, [9, Figure 7.].
Categorical Foundations of Gradient-Based Learning 13

Example 9 (Dot product). In Deep Dreaming (Section 4.2) we often want to focus
only on a particular element of the network output Rb . This is done by supplying
a one-hot vector bt as the ground truth to the loss function e(bt , bp ) = bt ·bp which
computes the dot product of two vectors. If the ground truth vector y is a one-
hot vector (active at the i-th element), then the dot product performs masking of
all inputs except the i-th one. Note the reverse derivative R[e] : Rb × Rb × R →
Rb × Rb of the dot product is defined as R[e](bt , bp , α) = (α · bp , α · bt ).

3.3 Learning Rates as Parametric Lenses


After models and loss maps, another ingredient of the learning process are learn-
ing rates, which we formalise as follows.
Definition 6. A learning rate α on L consists of a lens from (L, L′ ) to (1, 1)
where 1 is a terminal object in C.
Note that the get component of the learning rate lens must be the unique map
to 1, while the put component is a map L × 1 → L′ ; that is, simply a map
α∗ : L → L′ . Thus we can view α as a parametric lens from (L, L′ ) → (1, 1)
(with trivial parameter space) and compose it in Para(Lens(C)) with a model
and a loss map (cf. (5)) to get
P P′ B B′

B L
A f loss
α (6)
A′ R[f ] R[loss]
B′ L′

Example 10. In standard supervised learning in Smooth, one fixes some ϵ > 0
as a learning rate, and this is used to define α: α is simply constantly −ϵ, ie.,
α(l) = −ϵ for any l ∈ L.
Example 11. In supervised learning in POLY Z2 , the standard learning rate is
quite different: for a given L it is defined as the identity function, α(l) = l.
Other learning rate morphisms are possible as well: for example, one could
fix some ϵ > 0 and define a learning rate in Smooth by α(l) = −ϵ · l. Such a
choice would take into account how far away the network is from its desired goal
and adjust the learning rate accordingly.

3.4 Optimisers as Reparameterisations


In this section we consider how to implement gradient descent (and its variants)
into our framework. To this aim, note that the parametric lens (f, R[f ]) rep-
resenting our model (see (4)) outputs a P ′ , which represents a change in the
parameter space. Now, we would like to receive not just the requested change
in the parameter, but the new parameter itself. This is precisely what gradient
descent accomplishes, when formalised as a lens.
14 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi

Definition 7. In any CRDC C we can define gradient update as a map G in


Lens(C) from (P, P ) to (P, P ′ ) consisting of (G, G∗ ) : (P, P ) → (P, P ′ ), where
G(p) = p and G∗ (p, p′ ) = p + p′7 .

Intuitively, such a lens allows one to receive the requested change in parameter
and implement that change by adding that value to the current parameter. By its
type, we can now “plug” the gradient descent lens G : (P, P ) → (P, P ′ ) above the
model (f, R[f ]) in (4) — formally, this is accomplished as a reparameterisation
of the parametric morphism (f, R[f ]), cf. Section 2.1. This gives us Figure 3
(left).

P P S×P S×P

+ Optimiser

P P′ P P′
A B A B

Model Model

A′ B′ A′ B′

Fig. 3: Model reparameterised by basic gradient descent (left) and a generic


stateful optimiser (right).

Example 12 (Gradient update in Smooth). In Smooth, the gradient descent repa-


rameterisation will take the output from P ′ and add it to the current value of
P to get a new value of P .

Example 13 (Gradient update in Boolean circuits). In the CRDC POLY Z2 , the


gradient descent reparameterisation will again take the output from P ′ and
add it to the current value of P to get a new value of P ; however, since + in
Z2 is the same as XOR, this can be also be seen as taking the XOR of the
current parameter and the requested change; this is exactly how this algorithm
is implemented in [50].

Other variants of gradient descent also fit naturally into this framework by
allowing for additional input/output data with P . In particular, many of them
keep track of the history of previous updates and use that to inform the next one.
This is easy to model in our setup: instead of asking for a lens (P, P ) → (P, P ′ ),
we ask instead for a lens (S ×P, S ×P ) → (P, P ′ ) where S is some “state” object.
7
Note that as in the discussion in Section 2.4, we are implicitly assuming that P = P ′ ;
we have merely notated them differently to emphasize the different “roles” they play
(the first P can be thought of as “points”, the second as “vectors”)
Categorical Foundations of Gradient-Based Learning 15

Definition 8. A stateful parameter update consists of a choice of object S


(the state object) and a lens U : (S × P, S × P ) → (P, P ′ ).
Again, we view this optimiser as a reparameterisation which may be “plugged
in” a model as in Figure 3 (right). Let us now consider how several well-known
optimisers can be implemented in this way.
Example 14 (Momentum). In the momentum variant of gradient descent, one
keeps track of the previous change and uses this to inform how the current
parameter should be changed. Thus, in this case, we set S = P , fix some γ >
0, and define the momentum lens (U, U ∗ ) : (P × P, P × P ) → (P, P ′ ) . by
U (s, p) = p and U ∗ (s, p, p′ ) = (s′ , p + s′ ), where s′ = −γs + p′ . Note momentum
recovers gradient descent when γ = 0.
In both standard gradient descent and momentum, our lens representation
has trivial get part. However, as soon as we move to more complicated variants,
this is not anymore the case, as for instance in Nesterov momentum below.
Example 15 (Nesterov momentum). In Nesterov momentum, one uses the mo-
mentum from previous updates to tweak the input parameter supplied to the
network. We can precisely capture this by using a small variation of the lens in
the previous example. Again, we set S = P , fix some γ > 0, and define the Nes-
terov momentum lens (U, U ∗ ) : (P × P, P × P ) → (P, P ′ ) by U (s, p) = p + γs
and U ∗ as in the previous example.
Example 16 (Adagrad). Given any fixed ϵ > 0 and δ ∼ 10−7 , Adagrad [20] is
given by S = P , with the lens whose get part is (g, p) 7→ p. The put is (g, p, p′ ) 7→
(g ′ , p + δ+√
ϵ
g′
⊙ p′ ) where g ′ = g + p′ ⊙ p′ and ⊙ is the elementwise (Hadamard)
product. Unlike with other optimization algorithms where the learning rate is
the same for all parameters, Adagrad divides the learning rate of each individual
parameter with the square root of the past accumulated gradients.
Example 17 (Adam). Adaptive Moment Estimation (Adam) [32] is another method
that computes adaptive learning rates for each parameter by storing exponen-
tially decaying average of past gradients (m) and past squared gradients (v). For
fixed β1 , β2 ∈ [0, 1), ϵ > 0, and δ ∼ 10−8 , Adam is given by S = P × P , with
the lens whose get part is (m, v, p) 7→ p and whose put part is put(m, v, p, p′ ) =
(mb ′ , vb′ , p + δ+√
ϵ
b′
v
b ′ ) where m′ = β1 m + (1 − β1 )p′ , v ′ = β2 v + (1 − β2 )p′2 ,
⊙m
m′ v′
b′ =
and m 1−β1t
, vb′ = 1−β2t
.

Note that, so far, optimsers/reparameterisations have been added to the


P/P ′ wires. In order to change the model’s parameters (Fig. 3). In Section 4.2
we will study them on the A/A′ wires instead, giving deep dreaming.

4 Learning with Parametric Lenses


In the previous section we have seen how all the components of learning can be
modeled as parametric lenses. We now study how all these components can be
16 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi

put together to form supervised learning systems. In addition to studying the


most common examples of supervised learning: systems that learn parameters,
we also study different kinds systems: those that learn their inputs. This is a
technique commonly known as deep dreaming, and we present it as a natural
counterpart of supervised learning of parameters.
Before we describe these systems, it will be convenient to represent all the
inputs and outputs of our parametric lenses as parameters. In (6), we see the
P/P ′ and B/B ′ inputs and outputs as parameters; however, the A/A′ wires are
not. To view the A/A′ inputs as parameters, we compose that system with the
parametric lens η we now define. The parametric lens η has the type (1, 1) →
(A, A′ ) with parameter space (A, A′ ) defined by (getη = 1A , putη = π1 ) and can
A
A
be depicted graphically as . Composing η with the rest of the learning
A′
system in (6) gives us the closed parametric lens

A A′ P P′ B B′
A B L
Model Loss α (7)
A′ B′ L′

This composite is now a map in Para(Lens(C)) from (1, 1) to (1, 1); all its inputs
and outputs are now vertical wires, ie., parameters. Unpacking it further, this is
a lens of type (A × P × B, A′ × P ′ × B ′ ) → (1, 1) whose get map is the terminal
map, and whose put map is of the type A × P × B → A′ × P ′ × B ′ . It can be
unpacked as the composite put(a, p, bt ) = (a′ , p′ , b′t ), where

bp = f (p, a) (b′t , b′p ) = R[loss](bt , bp , α(loss(bt , bp ))) (p′ , a′ ) = R[f ](p, a, b′p ).

In the next two sections we consider further additions to the image above which
correspond to different types of supervised learning.

4.1 Supervised Learning of Parameters

The most common type of learning performed on (7) is supervised learning of


parameters. This is done by reparameterising (cf. Section 2.1) the image in the
following manner. The parameter ports are reparameterised by one of the (pos-
sibly stateful) optimisers described in the previous section, while the backward
wires A′ of inputs and B ′ of outputs are discarded. This finally yields the com-
plete picture of a system which learns the parameters in a supervised manner:
Categorical Foundations of Gradient-Based Learning 17

A S×P S×P
B

Optimiser

P P′
B′
A B L

Model Loss α
′ ′ ′
A B L

Fixing a particular optimiser (U, U ) : (S × P, S × P ) → (P, P ′ ) we again


unpack the entire construction. This is a map in Para(Lens(C)) from (1, 1) to


(1, 1) whose parameter space is (A × S × P × B, S × P ). In other words, this
is a lens of type (A × S × P × B, S × P ) → (1, 1) whose get component is the
terminal map. Its put map has the type A × S × P × B → S × P and unpacks
to put(a, s, p, bt ) = U ∗ (s, p, p′ ), where

p = U (s, p) bp = f (p, a)
(b′t , b′p ) = R[loss](bt , bp , α(loss(bt , bp ))) (p′ , a′ ) = R[f ](p, a, b′p ).

While this formulation might seem daunting, we note that it just explicitly
specifies the computation performed by a supervised learning system. The vari-
able p represents the parameter supplied to the network by the stateful gradient
update rule (in many cases this is equal to p); bp represents the prediction of
the network (contrast this with bt which represents the ground truth from the
dataset). Variables with a tick ′ represent changes: b′p and b′t are the changes
on predictions and true values respectively, while p′ and a′ are changes on the
parameters and inputs. Furthermore, this arises automatically out of the rule for
lens composition (3); what we needed to specify is just the lenses themselves.
We justify and illustrate our approach on a series of case studies drawn from
the literature. This presentation has the advantage of treating all these instances
uniformly in terms of basic constructs, highlighting their similarities and differ-
ences. First, we fix some parametric map (Rp , f ) : Para(Smooth)(Ra , Rb ) in
Smooth and the constant negative learning rate α : R (Example 10). We then
vary the loss function and the gradient update, seeing how the put map above
reduces to many of the known cases in the literature.

Example 18 (Quadratic error, basic gradient descent). Fix the quadratic error
(Example 6) as the loss map and basic gradient update (Example 12). Then the
aforementioned put map simplifies. Since there is no state, its type reduces to
A × P × B → P , and we have put(a, p, bt ) = p + p′ , where (p′ , a′ ) = R[f ](p, a, α ·
(f (p, a) − bt )). Note that α here is simply a constant, and due to the linearity
of the reverse derivative (Def 4), we can slide the α from the costate into the
basic gradient update lens. Rewriting this update, and performing this sliding we
obtain a closed form update step put(a, p, bt ) = p+α·(R[f ](p, a, f (p, a)−bt ); π0 ),
18 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi

where the negative descent component of gradient descent is here contained in


the choice of the negative constant α.

This example gives us a variety of regression algorithms solved iteratively


by gradient descent: it embeds some parametric map (Rp , f ) : Ra → Rb into the
system which performs regression on input data - where a denotes the input to
the model and bt denotes the ground truth. If the corresponding f is linear and
b = 1, we recover simple linear regression with gradient descent. If the codomain
is multi-dimensional, i.e. we are predicting multiple scalars, then we recover
multivariate linear regression. Likewise, we can model a multi-layer perceptron or
even more complex neural network architectures performing supervised learning
of parameters simply by changing the underlying parametric map.

Example 19 (Softmax cross entropy, basic gradient descent). Fix Softmax cross
entropy (Example 8) as the loss map and basic gradient update (Example 12).
Again the put map simplifies. The type reduces to A × P × B → P and we have
put(a, p, bt ) = p + p′ where (p′ , a′ ) = R[f ](p, a, α · (Softmax(f (p, a)) − bt )). The
same rewriting performed on the previous example can be done here.

This example recovers logistic regression, e.g. classification.

Example 20 (Mean squared error, Nesterov Momentum). Fix the quadratic error
(Example 6) as the loss map and Nesterov momentum (Example 15) as the
gradient update. This time the put map A × S × P × B → S × P does not have a
simplified type. The implementation of put reduces to put(a, s, p, bt ) = (s′ , p+s′ ),
where p = p + γs, (p′ , a′ ) = R[f ](p, a, α · (f (p, a) − bt )), and s′ = −γs + p′ .

This example with Nesterov momentum differs in two key points from all
the other ones: i) the optimiser is stateful, and ii) its get map is not trivial.
While many other optimisers are stateful, the non-triviality of the get map here
showcases the importance of lenses. They allow us to make precise the notion of
computing a “lookahead” value for Nesterov momentum, something that is in
practice usually handled in ad-hoc ways. Here, the algebra of lens composition
handles this case naturally by using the get map, a seemingly trivial, unused
piece of data for previous optimisers.
Our last example, using a different base category POLY Z2 , shows that our
framework captures learning in not just continuous, but discrete settings too.
Again, we fix a parametric map (Zp , f ) : POLYZ2 (Za , Zb ) but this time we fix
the identity learning rate (Example 11), instead of a constant one.

Example 21 (Basic learning in Boolean circuits). Fix XOR as the loss map (Ex-
ample 7) and the basic gradient update (Example 13). The put map again
simplifies. The type reduces to A × P × B → P and the implementation to
put(a, p, bt ) = p + p′ where (p′ , a′ ) = R[f ](p, a, f (p, a) + bt ).

A sketch of learning iteration. Having described a number of examples in


supervised learning, we outline how to model learning iteration in our framework.
Recall the aforementioned put map whose type is A × P × B → P (for simplicity
Categorical Foundations of Gradient-Based Learning 19

here modelled without state S). This map takes an input-output pair (a0 , b0 ),
the current parameter pi and produces an updated parameter pi+1 . At the next
time step, it takes a potentially different input-output pair (a1 , b1 ), the updated
parameter pi+1 and produces pi+2 . This process is then repeated. We can model
this iteration as a composition of the put map with itself, as a composite (A ×
put × B); put whose type is A × A × P × B × B → P . This map takes two input-
output pairs A × B, a parameter and produces a new parameter by processing
these datapoints in sequence. One can see how this process can be iterated any
number of times, and even represented as a string diagram.
But we note that with a slight reformulation of the put map, it is possible
to obtain a conceptually much simpler definition. The key insight lies in seeing
that the map put : A × P × B → P is essentially an endo-map P → P with some
extra inputs A × B; it’s a parametric map!
In other words, we can recast the put map as a parametric map (A × B, put) :
Para(C)(P, P ). Being an endo-map, it can be composed with itself. The resulting
composite is an endo-map taking two “parameters”: input-output pair at the
time step 0 and time step 1. This process can then be repeated, with Para
composition automatically taking care of the algebra of iteration.

A×B A×B A×B

P
P put put . n. . put P

This reformulation captures the essence of parameter iteration: one can think
of it as a trajectory pi , pi+1 , pi+2 , ... through the parameter space; but it is a
trajectory parameterised by the dataset. With different datasets the algorithm
will take a different path through this space and learn different things.

4.2 Deep Dreaming: Supervised Learning of Inputs

We have seen that reparameterising the parameter port with gradient descent
allows us to capture supervised parameter learning. In this section we describe
how reparameterising the input port provides us with a way to enhance an input
image to elicit a particular interpretation. This is the idea behind the technique
called Deep Dreaming, appearing in the literature in many forms [19, 34, 35, 44].

S×A S×A P B

Optimiser

A A′
B′
A B L

Model Loss α (8)


A′ B′ L′
20 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi

Deep dreaming is a technique which uses the parameters p of some trained


classifier network to iteratively dream up, or amplify some features of a class b on
a chosen input a. For example, if we start with an image of a landscape a0 , a label
b of a “cat” and a parameter p of a sufficiently well-trained classifier, we can start
performing “learning” as usual: computing the predicted class for the landscape
a0 for the network with parameters p, and then computing the distance between
the prediction and our label of a cat b. When performing backpropagation, the
respective changes computed for each layer tell us how the activations of that
layer should have been changed to be more “cat” like. This includes the first
(input) layer of the landscape a0 . Usually, we discard this changes and apply
gradient update to the parameters. In deep dreaming we discard the parameters
and apply gradient update to the input (see (8)). Gradient update here takes these
changes and computes a new image a1 which is the same image of the landscape,
but changed slightly so to look more like whatever the network thinks a cat looks
like. This is the essence of deep dreaming, where iteration of this process allows
networks to dream up features and shapes on a particular chosen image [1].
Just like in the previous subsection, we can write this deep dreaming system
as a map in Para(Lens(C)) from (1, 1) to (1, 1) whose parameter space is (S×A×
P ×B, S ×A). In other words, this is a lens of type (S ×A×P ×B, S ×A) → (1, 1)
whose get map is trivial. Its put map has the type S × A × P × B → S × A
and unpacks to put(s, a, p, bt ) = U ∗ (s, a, a′ ), where a = U (s, a), bp = f (p, a),
(b′t , b′p ) = R[loss](bt , bp , α(loss(bt , bp ))), and (p′ , a′ ) = R[f ](p, a, b′p ).
We note that deep dreaming is usually presented without any loss function as
a maximisation of a particular activation in the last layer of the network output
[44, Section 2.]. This maximisation is done with gradient ascent, as opposed to
gradient descent. However, this is just a special case of our framework where
the loss function is the dot product (Example 9). The choice of the particular
activation is encoded as a one-hot vector, and the loss function in that case
essentially masks the network output, leaving active only the particular chosen
activation. The final component is the gradient ascent: this is simply recovered
by choosing a positive, instead of a negative learning rate [44]. We explicitly
unpack this in the following example.
Example 22 (Deep dreaming, dot product loss, basic gradient update). Fix Smooth
as base category, a parametric map (Rp , f ) : Para(Smooth)(Ra , Rb ), the dot
product loss (Example 9), basic gradient update (Example 12), and a positive
learning rate α : R. Then the above put map simplifies. Since there is no state, its
type reduces to A × P × B → A and its implementation to put(a, p, bt ) = a + a′ ,
where (p′ , a′ ) = R[f ](p, a, α · bt ). Like in Example 18, this update can be rewrit-
ten as put(a, p, bt ) = a + α · (R[f ](p, a, bt ); π1 ), making a few things apparent.
This update does not depend on the prediction f (p, a): no matter what the net-
work has predicted, the goal is always to maximize particular activations. Which
activations? The ones chosen by bt . When bt is a one-hot vector, this picks out
the activation of just one class to maximize, which is often done in practice.
While we present only the most basic image, there is plenty of room left
for exploration. The work of [44, Section 2.] adds an extra regularization term
Categorical Foundations of Gradient-Based Learning 21

to the image. In general, the neural network f is sometimes changed to copy


a number of internal activations which are then exposed on the output layer.
Maximizing all these activations often produces more visually appealing results.
In the literature we did not find an example which uses the Softmax-cross entropy
(Example 8) as a loss function in deep dreaming, which seems like the more
natural choice in this setting. Furthermore, while deep dreaming commonly uses
basic gradient descent, there is nothing preventing the use of any of the optimiser
lenses discussed in the previous section, or even doing deep dreaming in the
context of Boolean circuits. Lastly, learning iteration which was described in at
the end of previous subsection can be modelled here in an analogous way.

5 Implementation

We provide a proof-of-concept implementation as a Python library — full usage


examples, source code, and experiments can be found at [17]. We demonstrate
the correctness of our library empirically using a number of experiments im-
plemented both in our library and in Keras [11], a popular framework for deep
learning. For example, one experiment is a model for the MNIST image clas-
sification problem [33]: we implement the same model in both frameworks and
achieve comparable accuracy. Note that despite similarities between the user in-
terfaces of our library and of Keras, a model in our framework is constructed
as a composition of parametric lenses. This is fundamentally different to the
approach taken by Keras and other existing libraries, and highlights how our
proposed algebraic structures naturally guide programming practice
In summary, our implementation demonstrates the advantages of our ap-
proach. Firstly, computing the gradients of the network is greatly simplified
through the use of lens composition. Secondly, model architectures can be ex-
pressed in a principled, mathematical language; as morphisms of a monoidal
category. Finally, the modularity of our approach makes it easy to see how var-
ious aspects of training can be modified: for example, one can define a new
optimization algorithm simply by defining an appropriate lens. We now give a
brief sketch of our implementation.

5.1 Constructing a Model with Lens and Para

We model a lens (f, f ∗ ) in our library with the Lens class, which consists of a
pair of maps fwd and rev corresponding to f and f ∗ , respectively. For example,
we write the identity lens (1A , π2 ) as follows:
i d e n t i t y = Lens ( lambda x : x , lambda x dy : x dy [ 1 ] )

The composition (in diagrammatic order) of Lens values f and g is written


f >> g, and monoidal composition as f @ g. Similarly, the type of Para maps
is modeled by the Para class, with composition and monoidal product written
the same way. Our library provides several primitive Lens and Para values.
22 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi

Let us now see how to construct a single layer neural network from the com-
position of such primitives. Diagramatically, we wish to construct the following
model, representing a single ‘dense’ layer of a neural network:

Rb×a Rb×a Rb Rb

Rb Rb
Ra Rb
linear bias activation (9)
Ra Rb
Rb Rb
Here, the parameters of linear are the coefficients of a b × a matrix, and the
underlying lens has as its forward map the function (M, x) → M · x, where M is
the b × a matrix whose coefficients are the Rb×a parameters, and x ∈ Ra is the
input vector. The bias map is even simpler: the forward map of the underlying
lens is simply pointwise addition of inputs and parameters: (b, x) → b+x. Finally,
the activation map simply applies a nonlinear function (e.g., sigmoid) to the
input, and thus has the trivial (unit) parameter space. The representation of
this composition in code is straightforward: we can simply compose the three
primitive Para maps as in (9):
def d e n s e ( a , b , a c t i v a t i o n ) :
return l i n e a r ( a , b ) >> b i a s ( b ) >> a c t i v a t i o n

Note that by constructing model architectures in this way, the computation


of reverse derivatives is greatly simplified: we obtain the reverse derivative ‘for
free’ as the put map of the model. Furthermore, adding new primitives is also
simplified: the user need simply provide a function and its reverse derivative in
the form of a Para map. Finally, notice also that our approach is truly composi-
tional: we can define a hidden layer neural network with n hidden units simply
by composing two dense layers, as follows:
d e n s e ( a , n , a c t i v a t i o n ) >> d e n s e ( n , b , a c t i v a t i o n )

5.2 Learning
Now that we have constructed a model, we also need to use it to learn from
data. Concretely, we will construct a full parametric lens as in Figure 2 then
extract its put map to iterate over the dataset.
By way of example, let us see how to construct the following parametric lens,
representing basic gradient descent over a single layer neural network with a
fixed learning rate:
A P P B

P P′
B′
A B L
dense loss (10)
ϵ
A′ B′ L′
Categorical Foundations of Gradient-Based Learning 23

This morphism is constructed essentially as below, where apply update(α,


f ) represents the ‘vertical stacking’ of α atop f :
a p p l y u p d a t e ( b a s i c u p d a t e , d e n s e ) >> l o s s >> l e a r n i n g r a t e ( ϵ )

Now, given the parametric lens of (10), one can construct a morphism step :
B ×P ×A → P which is simply the put map of the lens. Training the model then
consists of iterating the step function over dataset examples (x, y) ∈ A×B to op-
timise some initial choice of parameters θ0 ∈ P , by letting θi+1 = step(yi , θi , xi ).
Note that our library also provides a utility function to construct step from
its various pieces:
s t e p = s u p e r v i s e d s t e p ( model , update , l o s s , l e a r n i n g r a t e )

For an end-to-end example of model training and iteration, we refer the


interested reader to the experiments accompanying the code [17].

6 Related Work
The work [23] is closely related to ours, in that it provides an abstract categorical
model of backpropagation. However, it differs in a number of key aspects. We
give a complete lens-theoretic explanation of what is back-propagated via (i)
the use of CRDCs to model gradients; and (ii) the Para construction to model
parametric functions and parameter update. We thus can go well beyond [23]
in terms of examples - their example of smooth functions and basic gradient
descent is covered in our subsection 4.1.
We also explain some of the constructions of [23] in a more structured way.
For example, rather than considering the category Learn of [23] as primitive,
here we construct it as a composite of two more basic constructions (the Para
and Lens constructions). The flexibility could be used, for example, to com-
positionally replace Para with a variant allowing parameters to come from a
different category, or lenses with the category of optics [38] enabling us to model
things such as control flow using prisms.
One more relevant aspect is functoriality. We use a functor to augment a
parametric map with its backward pass, just like [23]. However, they additionally
augmented this map with a loss map and gradient descent using a functor as
well. This added extra conditions on the partial derivatives of the loss function:
it needed to be invertible in the 2nd variable. This constraint was not justified
in [23], nor is it a constraint that appears in machine learning practice. This led
us to reexamine their constructions, coming up with our reformulation that does
not require it. While loss maps and optimisers are mentioned in [23] as parts of
the aforementioned functor, here they are extracted out and play a key role: loss
maps are parametric lenses and optimisers are reparameterisations. Thus, in this
paper we instead use Para-composition to add the loss map to the model, and
Para 2-cells to add optimisers. The mentioned inverse of the partial derivative
of the loss map in the 2nd variable was also hypothesised to be relevant to deep
dreaming. We have investigated this possibility thoroughly in our paper, showing
24 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi

it is gradient update which is used to dream up pictures. We also correct a small


issue in Theorem III.2 of [23]. There, the morphisms of Learn were defined up to
an equivalence (pg. 4 of [23]) but, unfortunately, the functor defined in Theorem
III.2 does not respect this equivalence relation. Our approach instead uses 2-cells
which comes from the universal property of Para — a 2-cell from (P, f ) : A → B
to (Q, g) : A → B is a lens, and hence has two components: a map α : Q → P
and α∗ : Q × P → Q. By comparison, we can see the equivalence relation of [23]
as being induced by map α : Q → P , and not a lens. Our approach highlights
the importance of the 2-categorical structure of learners. In addition, it does not
treat the functor Para(C) → Learn as a primitive. In our case, this functor
has the type Para(C) → Para(Lens(C)) and arises from applying Para to a
canonical functor C → Lens(C) existing for any reverse derivative category, not
just Smooth. Lastly, in our paper we took advantage of the graphical calculus
for Para, redrawing many diagrams appearing in [23] in a structured way.
Other than [23], there are a few more relevant papers. The work of [18] con-
tains a sketch of some of the ideas this paper evolved from. They are based
on the interplay of optics with parameterisation, albeit framed in the setting of
diffeological spaces, and requiring cartesian and local cartesian closed structure
on the base category. Lenses and Learners are studied in the eponymous work
of [22] which observes that learners are parametric lenses. They do not explore
any of the relevant Para or CRDC structure, but make the distinction between
symmetric and asymmetric lenses, studying how they are related to learners de-
fined in [23]. A lens-like implementation of automatic differentiation is the focus
of [21], but learning algorithms aren’t studied. A relationship between category-
theoretic perspective on probabilistic modeling and gradient-based optimisation
is studied in [42] which also studies a variant of the Para construction. Usage of
Cartesian differential categories to study learning is found in [46]. They extend
the differential operator to work on stateful maps, but do not study lenses, pa-
rameterisation nor update maps. The work of [24] studies deep learning in the
context of Cycle-consistent Generative Adversarial Networks [51] and formalises
it via free and quotient categories, making parallels to the categorical formula-
tions of database theory [45]. They do use the Para construction, but do not
relate it to lenses nor reverse derivative categories. A general survey of category
theoretic approaches to machine learning, covering many of the above papers,
can be found in [43]. Lastly, the concept of parametric lenses has started appear-
ing in recent formulations of categorical game theory and cybernetics [9,10]. The
work of [9] generalises the study of parametric lenses into parametric optics and
connects it to game thereotic concepts such as Nash equilibria.

7 Conclusions and Future Directions

We have given a categorical foundation of gradient-based learning algorithms


which achieves a number of important goals. The foundation is principled and
mathematically clean, based on the fundamental idea of a parametric lens. The
foundation covers a wide variety of examples: different optimisers and loss maps
Categorical Foundations of Gradient-Based Learning 25

in gradient-based learning, different settings where gradient-based learning hap-


pens (smooth functions vs. boolean circuits), and both learning of parameters
and learning of inputs (deep dreaming). Finally, the foundation is more than
a mere abstraction: we have also shown how it can be used to give a practical
implementation of learning, as discussed in Section 5.
There are a number of important directions which are possible to explore
because of this work. One of the most exciting ones is the extension to more
complex neural network architectures. Our formulation of the loss map as a
parametric lens should pave the way for Generative Adversarial Networks [27],
an exciting new architecture whose loss map can be said to be learned in tandem
with the base network. In all our settings we have fixed an optimiser beforehand.
The work of [4] describes a meta-learning approach which sees the optimiser as a
neural network whose parameters and gradient update rule can be learned. This
is an exciting prospect since one can model optimisers as parametric lenses;
and our framework covers learning with parametric lenses. Recurrent neural
networks are another example of a more complex architecture, which has already
been studied in the context of differential categories in [46]. When it comes to
architectures, future work includes modelling some classical systems as well, such
as the Support Vector Machines [15], which should be possible with the usage
of loss maps such as Hinge loss.
Future work also includes using the full power of CRDC axioms. In particular,
axioms RD.6 or RD.7, which deal with the behaviour of higher-order derivatives,
were not exploited in our work, but they should play a role in modelling some
supervised learning algorithms using higher-order derivatives (for example, the
Hessian) for additional optimisations. Taking this idea in a different direction,
one can see that much of our work can be applied to any functor of the form
F : C → Lens(C) - F does not necessarily have to be of the form f 7→ (f, R[f ])
for a CRDC R. Moreover, by working with more generalised forms of the lens
category (such as dependent lenses), we may be able to capture ideas related
to supervised learning on manifolds. And, of course, we can vary the parameter
space to endow it with different structure from the functions we wish to learn. In
this vein, we wish to use fibrations/dependent types to model the use of tangent
bundles: this would foster the extension of the correct by construction paradigm
to machine learning, and thereby addressing the widely acknowledged problem
of trusted machine learning. The possibilities are made much easier by the com-
positional nature of our framework. Another key topic for future work is to link
gradient-based learning with game theory. At a high level, the former takes lit-
tle incremental steps to achieve an equilibrium while the later aims to do so in
one fell swoop. Formalising this intuition is possible with our lens-based frame-
work and the lens-based framework for game theory [25]. Finally, because our
framework is quite general, in future work we plan to consider further modifica-
tions and additions to encompass non-supervised, probabilistic and non-gradient
based learning. This includes genetic algorithms and reinforcement learning.

Acknowledgements Fabio Zanasi acknowledges support from epsrc EP/V002376/1.


Geoff Cruttwell acknowledges support from NSERC.
26 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi

References

1. Inceptionism: Going deeper into neural networks (2015), https://ai.googleblog.


com/2015/06/inceptionism-going-deeper-into-neural.html
2. Explainable AI: the basics - policy briefing (2019), royalsociety.org/ai-
interpretability
3. Abramsky, S., Coecke, B.: A categorical semantics of quantum protocols. In: Pro-
ceedings of the 19th Annual IEEE Symposium on Logic in Computer Science, 2004.
pp. 415–425 (2004). https://doi.org/10.1109/LICS.2004.1319636
4. Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M.W., Pfau, D., Schaul, T.,
Shillingford, B., de Freitas, N.: Learning to learn by gradient descent by gradient
descent. In: 30th Conference on Neural Information Processings Systems (NIPS)
(2016)
5. Baez, J.C., Erbele, J.: Categories in Control. Theory and Applications of Categories
30(24), 836–881 (2015)
6. Bohannon, A., Foster, J.N., Pierce, B.C., Pilkiewicz, A., Schmitt, A.: Boomerang:
Resourceful lenses for string data. SIGPLAN Not. 43(1), 407–419 (Jan 2008).
https://doi.org/10.1145/1328897.1328487
7. Boisseau, G.: String Diagrams for Optics. arXiv:2002.11480 (2020)
8. Bonchi, F., Sobocinski, P., Zanasi, F.: The calculus of signal flow di-
agrams I: linear relations on streams. Inf. Comput. 252, 2–29 (2017).
https://doi.org/10.1016/j.ic.2016.03.002, https://doi.org/10.1016/j.ic.2016.
03.002
9. Capucci, M., Gavranovi’c, B., Hedges, J., Rischel, E.F.: Towards foundations of
categorical cybernetics. arXiv:2105.06332 (2021)
10. Capucci, M., Ghani, N., Ledent, J., Nordvall Forsberg, F.: Translating Extensive
Form Games to Open Games with Agency. arXiv:2105.06763 (2021)
11. Chollet, F., et al.: Keras (2015), https://github.com/fchollet/keras
12. Clarke, B., Elkins, D., Gibbons, J., Loregian, F., Milewski, B., Pillmore, E., Román,
M.: Profunctor optics, a categorical update. arXiv:2001.07488 (2020)
13. Cockett, J.R.B., Cruttwell, G.S.H., Gallagher, J., Lemay, J.S.P., MacAdam, B.,
Plotkin, G.D., Pronk, D.: Reverse derivative categories. In: Proceedings of the
28th Computer Science Logic (CSL) conference (2020)
14. Coecke, B., Kissinger, A.: Picturing Quantum Processes: A First Course in Quan-
tum Theory and Diagrammatic Reasoning. Cambridge University Press (2017).
https://doi.org/10.1017/9781316219317
15. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297
(1995)
16. Courbariaux, M., Bengio, Y., David, J.P.: BinaryConnect: Training Deep Neural
Networks with binary weights during propagations. arXiv:1511.00363
17. CRCoauthors, A.: Numeric Optics: A python library for constructing and training
neural networks based on lenses and reverse derivatives. https://github.com/
anonymous-c0de/esop-2022
18. Dalrymple, D.: Dioptics: a common generalization of open games and gradient-
based learners. SYCO7 (2019), https://research.protocol.ai/publications/
dioptics-a-common-generalization-of-open-games-and-gradient-based-
learners/dalrymple2019.pdf
19. Dosovitskiy, A., Brox, T.: Inverting convolutional networks with convolutional net-
works. arXiv:1506.02753 (2015)
Categorical Foundations of Gradient-Based Learning 27

20. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Learning Research 12(Jul), 2121–
2159 (2011)
21. Elliott, C.: The simple essence of automatic differentiation (differentiable functional
programming made easy). arXiv:1804.00746 (2018)
22. Fong, B., Johnson, M.: Lenses and learners. In: Proceedings of the 8th International
Workshop on Bidirectional transformations (Bx@PLW) (2019)
23. Fong, B., Spivak, D.I., Tuyéras, R.: Backprop as functor: A compositional per-
spective on supervised learning. In: Proceedings of the Thirty fourth Annual IEEE
Symposium on Logic in Computer Science (LICS 2019). pp. 1–13. IEEE Computer
Society Press (June 2019)
24. Gavranovic, B.: Compositional deep learning. arXiv:1907.08292 (2019)
25. Ghani, N., Hedges, J., Winschel, V., Zahn, P.: Compositional game theory. In:
Proceedings of the 33rd Annual ACM/IEEE Symposium on Logic in Computer
Science. p. 472–481. LICS ’18 (2018). https://doi.org/10.1145/3209108.3209165
26. Ghica, D.R., Jung, A., Lopez, A.: Diagrammatic Semantics for Digital Circuits.
arXiv:1703.10247 (2017)
27. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Ghahramani, Z.,
Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in
Neural Information Processing Systems 27, pp. 2672–2680 (2014), http://papers.
nips.cc/paper/5423-generative-adversarial-nets.pdf
28. Griewank, A., Walther, A.: Evaluating derivatives: principles and techniques of
algorithmic differentiation. Society for Industrial and Applied Mathematics (2008)
29. Hedges, J.: Limits of bimorphic lenses. arXiv:1808.05545 (2018)
30. Hermida, C., Tennent, R.D.: Monoidal indeterminates and cate-
gories of possible worlds. Theor. Comput. Sci. 430, 3–22 (Apr 2012).
https://doi.org/10.1016/j.tcs.2012.01.001
31. Johnson, M., Rosebrugh, R., Wood, R.: Lenses, fibrations and universal transla-
tions. Mathematical structures in computer science 22, 25–42 (2012)
32. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio,
Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings
(2015), http://arxiv.org/abs/1412.6980
33. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied
to document recognition. In: Proceedings of the IEEE. pp. 2278–2324 (1998).
https://doi.org/10.1109/5.726791
34. Mahendran, A., Vedaldi, A.: Understanding deep image representations by invert-
ing them. arXiv:1412.0035 (2014)
35. Nguyen, A.M., Yosinski, J., Clune, J.: Deep neural networks are easily fooled: High
confidence predictions for unrecognizable images. arXiv:1412.1897 (2014)
36. Olah, C.: Neural networks, types, and functional programming (2015), http://
colah.github.io/posts/2015-09-NN-Types-FP/
37. Polyak, B.: Some methods of speeding up the convergence of iteration meth-
ods. USSR Computational Mathematics and Mathematical Physics 4(5), 1 –
17 (1964). https://doi.org/https://doi.org/10.1016/0041-5553(64)90137-5, http:
//www.sciencedirect.com/science/article/pii/0041555364901375
38. Riley, M.: Categories of optics. arXiv:1809.00738 (2018)
39. Selinger, P.: A survey of graphical languages for monoidal categories. Lecture Notes
in Physics p. 289–355 (2010)
28 Cruttwell, Gavranović, Ghani, Wilson, and Zanasi

40. Selinger, P.: Control categories and duality: on the categorical semantics of the
lambda-mu calculus. Mathematical Structures in Computer Science 11(02), 207–
260 (4 2001). https://doi.org/null, http://journals.cambridge.org/article_
S096012950000311X
41. Seshia, S.A., Sadigh, D.: Towards verified artificial intelligence. CoRR
abs/1606.08514 (2016), http://arxiv.org/abs/1606.08514
42. Shiebler, D.: Categorical Stochastic Processes and Likelihood. Compositionality
3(1) (2021)
43. Shiebler, D., Gavranović, B., Wilson, P.: Category Theory in Machine Learning.
arXiv:2106.07032 (2021)
44. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks:
Visualising image classification models and saliency maps. arXiv:1312.6034 (2014)
45. Spivak, D.I.: Functorial data migration. arXiv:1009.1166 (2010)
46. Sprunger, D., Katsumata, S.y.: Differentiable causal computations via delayed
trace. In: Proceedings of the 34th Annual ACM/IEEE Symposium on Logic in
Computer Science. LICS ’19, IEEE Press (2019)
47. Steckermeier, A.: Lenses in functional programming. Preprint, available at
https://sinusoid.es/misc/lager/lenses.pdf (2015)
48. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initial-
ization and momentum in deep learning. In: Dasgupta, S., McAllester, D. (eds.)
Proceedings of the 30th International Conference on Machine Learning. vol. 28,
pp. 1139–1147 (2013), http://proceedings.mlr.press/v28/sutskever13.html
49. Turi, D., Plotkin, G.: Towards a mathematical operational semantics. In: Pro-
ceedings of Twelfth Annual IEEE Symposium on Logic in Computer Science. pp.
280–291 (1997). https://doi.org/10.1109/LICS.1997.614955
50. Wilson, P., Zanasi, F.: Reverse derivative ascent: A categorical approach to learn-
ing boolean circuits. In: Proceedings of Applied Category Theory (ACT) (2020),
https://cgi.cse.unsw.edu.au/~eptcs/paper.cgi?ACT2020:31
51. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired Image-to-Image Translation
using Cycle-Consistent Adversarial Networks. arXiv:1703.10593 (2017)

Open Access This chapter is licensed under the terms of the Creative Commons
Attribution 4.0 International License (http://creativecommons.org/licenses/by/
4.0/), which permits use, sharing, adaptation, distribution and reproduction in any
medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons license and indicate if changes
were made.
The images or other third party material in this chapter are included in the chapter’s
Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the chapter’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
Compiling Universal Probabilistic Programming
Languages with Efficient Parallel Sequential
Monte Carlo Inference?

Daniel Lundén1 ( ) , Joey Öhman2 , Jan Kudlicka3 , Viktor Senderov4 ,


Fredrik Ronquist4,5 , and David Broman1

1
EECS and Digital Futures, KTH Royal Institute of Technology, Stockholm,
Sweden, {dlunde,dbro}@kth.se
2
AI Sweden, Stockholm, Sweden, joey.ohman@ai.se
3
Department of Data Science and Analytics, BI Norwegian Business School, Oslo,
Norway, jan.kudlicka@bi.no
4
Department of Bioinformatics and Genetics, Swedish Museum of Natural History,
Stockholm, Sweden, {viktor.senderov,fredrik.ronquist}@nrm.se
5
Department of Zoology, Stockholm University

Abstract. Probabilistic programming languages (PPLs) allow users to


encode arbitrary inference problems, and PPL implementations provide
general-purpose automatic inference for these problems. However, con-
structing inference implementations that are efficient enough is challeng-
ing for many real-world problems. Often, this is due to PPLs not fully ex-
ploiting available parallelization and optimization opportunities. For ex-
ample, handling probabilistic checkpoints in PPLs through continuation-
passing style transformations or non-preemptive multitasking—as is done
in many popular PPLs—often disallows compilation to low-level lan-
guages required for high-performance platforms such as GPUs. To solve
the checkpoint problem, we introduce the concept of PPL control-flow
graphs (PCFGs)—a simple and efficient approach to checkpoints in low-
level languages. We use this approach to implement RootPPL: a low-level
PPL built on CUDA and C++ with OpenMP, providing highly effi-
cient and massively parallel SMC inference. We also introduce a general
method of compiling universal high-level PPLs to PCFGs and illustrate
its application when compiling Miking CorePPL—a high-level universal
PPL—to RootPPL. The approach is the first to compile a universal PPL
to GPUs with SMC inference. We evaluate RootPPL and the CorePPL
compiler through a set of real-world experiments in the domains of phylo-
genetics and epidemiology, demonstrating up to 6× speedups over state-
of-the-art PPLs implementing SMC inference.

Keywords: Probabilistic Programming Languages · Compilers · Se-


quential Monte Carlo · GPU Compilation
?
This project is financially supported by the Swedish Foundation for Strategic Re-
search (FFL15-0032 and RIT15-0012), the European Union’s Horizon 2020 re-
search and innovation program under the Marie Skłodowska-Curie grant agreement
PhyPPL (No 898120), and the Swedish Research Council (grant number 2018-04620).

c The Author(s) 2022


I. Sergey (Ed.): ESOP 2022, LNCS 13240, pp. 29–56, 2022.
https://doi.org/10.1007/978-3-030-99336-8_2
30 D. Lundén et al.

1 Introduction

Probabilistic programming languages (PPLs) allow for encoding a wide range of


statistical inference problems and provide inference algorithms as part of their
implementations. Specifically, PPLs allow language users to focus solely on en-
coding their statistical problems, which the language implementation then solves
automatically. Many such languages exist and are applied in, e.g., statistics, ma-
chine learning, and artificial intelligence. Some example PPLs are WebPPL [20],
Birch [32], Anglican [40], and Pyro [10].
However, implementing efficient PPL inference algorithms is challenging for
many real-world problems. Most often, universal 6 PPLs implement general-
purpose inference algorithms—most commonly sequential Monte Carlo (SMC)
methods [14], Markov chain Monte Carlo (MCMC) methods [18], Hamiltonian
Monte Carlo (HMC) methods [12], variational inference (VI) [39], or a combina-
tion of these. In some cases, poor efficiency may be due to an inference algorithm
not well suited to the particular PPL program. However, in other cases, the PPL
implementations do not fully exploit opportunities for parallelization and opti-
mization on the available hardware. Unfortunately, doing this is often tricky
without introducing complexity for end-users of PPLs.
A critical performance consideration is handling probabilistic checkpoints [37]
in PPLs. Checkpoints are locations in probabilistic programs where inference al-
gorithms must interject, for example, to resample in SMC inference or record
random draw locations where MCMC inference can explore alternative execution
paths. The most common approach to checkpoints—used in universal PPLs such
as WebPPL [20], Anglican [40], and Birch [32]—is to associate them with PPL-
specific language constructs. In general, PPL users can place these constructs
without restriction, and inference algorithms interject through continuation-
passing style (CPS) transformations [9,20,40] or non-preemptive multitasking
[32] (e.g., coroutines) that enable pausing and resuming executions. These so-
lutions are often not available in languages such as C and CUDA [1] used for
high-performance platforms such as graphics processing units (GPUs), making
compiling PPLs to these languages and platforms challenging. Some approaches
for running PPLs on GPUs do exist, however. LibBi [29] runs on GPUs with
SMC inference but is not universal. Stan [12] and AugurV2 [22] partially run
MCMC inference on GPUs but have limited expressive power. Pyro [10] runs on
GPUs, but currently not in combination with SMC. In this paper, we compile a
universal PPL and run it with SMC on GPUs for the first time.
A more straightforward approach to checkpoints, used for SMC in Birch [32]
and Pyro [10], is to encode models with a step function called iteratively. Check-
points then occur each time step returns. This paper presents a new approach to
checkpoint handling, generalizing the step function approach. We write prob-
abilistic programs as a set of code blocks connected in what we term a PPL
6
A term due to Goodman et al. [19]. No precise definition exists, but in principle, a
universal PPL program can perform probabilistic operations at any point. In partic-
ular, it is not always possible to statically determine the number of random variables.
Compiling Universal PPLs with Efficient Parallel SMC Inference 31

Section 2 Section 4 Section 3


Miking RootPPL C++ or CUDA
Compiler Executable
CorePPL Language Compiler

RootPPL SMC
Inference Engine

Fig. 1: The CorePPL and RootPPL toolchain. Solid rectangular components


(gray) represent programs and rounded components (blue) translations. The
dashed rectangles indicate paper sections.

control-flow graph (PCFG). PPL checkpoints are restricted to only occur at


tail position in these blocks, and communication between blocks is only allowed
through an explicit PCFG state. As a result, pausing and resuming executions
is straightforward: it is simply a matter of stopping after executing a block and
then resuming by running the next block. A variable in the PCFG state, set from
within the blocks, determines the next block. This variable allows for loops and
branching and gives the same expressive power as other universal PPLs. We im-
plement the above approach in RootPPL: a low-level universal PPL framework
built using C++ and CUDA with highly efficient and parallel SMC inference.
RootPPL consists of both an inference engine and a simple macro-based PPL.
A problem with RootPPL is that it is low-level and, therefore, challenging
to write programs in. In particular, sending data between blocks through the
PCFG state can quickly get difficult for more complex models. To solve this, we
develop a general technique for compiling high-level universal PPLs to PCFGs.
The key idea is to decompose functions in the high-level language to a set of
PCFG blocks, such that checkpoints in the original function always occur at
tail position in blocks. As a result of the decomposition, the PCFG state must
store a part of the call stack. The compiler adds code for handling this call
stack explicitly in the PCFG blocks. We illustrate the compilation technique by
introducing a high-level source language, Miking CorePPL, and compiling it to
RootPPL. Fig. 1 illustrates the overall toolchain.
In summary, we make the following contributions.

– We introduce PCFGs, a framework for checkpoint handling in PPLs, and use


it to implement RootPPL: a low-level universal PPL with highly efficient and
parallel SMC inference (Section 3).
– We develop an approach for compiling high-level universal PPLs to PCFGs
and use it to compile Miking CorePPL to RootPPL. In particular, we give an
algorithm for decomposing high-level functions to PCFG blocks (Section 4).

Furthermore, we introduce Miking CorePPL in Section 2 and evaluate the


performance of RootPPL and the CorePPL compiler in Section 5 on real-world
models from phylogenetics and epidemiology, achieving up to 6× speedups over
the state-of-the-art. An artifact accompanying this paper supports the evalua-
tion [26]. An extended version of this article is also available [27]. A † symbol in
the text indicates more information is available in the extended version.
Other documents randomly have
different content
Die Geschichte von der übermütigen
Mohrenprinzessin

von

Albert Roderich.
Pelusa, die Tochter des Königs der Mohren,
war schwarz im Gesicht bis hinter die Ohren;
sie war wie geschnitzelt aus Ebenholz
und übermütig und scheußlich stolz.
Sie spielte aber vortrefflich Schach
und übte darin sich jeden Tag.
Einst machte bekannt sie durch ihre Bonzen
und auch zugleich durch Zeitungsannoncen,
es könnt’ mit ihr spielen um hohen Gewinns
eine Partie Schach jeder Vollblutprinz;
gewinnt er, so wird sie sein Ehegesponst,
verliert er, so muß er ihr dienen umsonst,
muß scheuern und putzen des Schlosses Treppen,
muß Holz zerspalten und Wasser schleppen. —
Es waren gekommen, auf Klugheit trutzend,
von Mohrenprinzen diverse Dutzend,
so viele, daß ich sie einzeln nicht zähl’,
zu Wasser, zu Pferde und auch zu Kamel,
Pelusa besiegte sie alle im Schach,
und Hausknechte wurden die Prinzen sonach.
Da kam mal ein weißer, ein Prinz vom Norden
— der Name ist nicht bekannt geworden —
der zeigte seinen Geburtsschein und sprach:
„Bitte, melden Sie mich der Prinzessin zum Schach!“
Wie die beiden einander gegenübersaßen,
da gefiel er dem Fräulein über die Maßen;
anstatt, daß wie sonst vorsichtig sie spielt,
hat heimlich sie nach dem Prinzen geschielt.
Ihre Kunst, die bewährte, ward immer geringer,
jetzt nimmt ihr der Prinz schon den zweiten Springer.
Die Schranzen können sich wundern nicht satt;
jetzt ruft schon der eine: „Beim nächsten Zug matt!“
Da beugte der Prinz vor Pelusa das Knie
und sagte: „Mein Fräulein, ich geb’ es remis!“
grüßt hübsch in der Runde verschiedene Mal
und verläßt mit zierlichem Lächeln den Saal.
Da glich das Antlitz der stolzen Pelusa
dem Angesicht einer schwarzen Medusa,
und regungslos saß sie voll Wut und Stolz,
als wär’ sie geschnitzelt aus Ebenholz.
Sie wartet noch heut’ auf den Prinzen vom Norden —
laß sie warten, bis sie weiß geworden.
Der Junge

von

Ferdinand Avenarius.
Wer war weggegangen, wer,
sag’ mir, Frau, kam wieder her?
Mit roten Backen, heisassa,
unsere Jugend ist wieder da!
Sieht wie ein großer Junge aus,
lärmt und tollt, es ist ein Graus.
Sitz’ ich bei der Arbeit sacht,
hängt er mir plötzlich am Hals und lacht,
macht mir das, wie sich’s gehört, Verdruß,
mir nichts, dir nichts, gibt’s einen Kuß.
Wehr’ ich mich endlich: „Nun aber hinaus!“
schaut er auf einmal ganz anders aus,
sieht mich aus den Augen verschmitzt
an, daß mir’s zum Herzen blitzt,
klatscht dann plötzlich in die Hand —
Himmel: von Pult und Schrank und Wand
von Mucken, Motten und Hummeln brummts
und hinaus zum Fenster summts!
„Ich bin die Jugend,“ lacht er dazu:
„Das kann ich — nun duld mich, du!“
Gut, so mag’s fortan denn sein:
Wir Alten, die Jugend, wir bleiben zu drei’n!
Ein Bildchen

von

Carl Spitteler.
Den Rain hinauf, mit trotzigem Alarm
fuchtelt ein Kinderschwarm.
„Vorwärts! Hurra!“
Hut ab! Du schaust kein Spiel.
Den Himmel zu erstürmen gilt das ernste Ziel.
Er ist so nah!
Siehst, wie er aus dem Grase guckt dort oben?

Zwei Glockentöne, leicht vom Morgenwind gehoben,


kommen vergnügt und ungezwungen
dahergesungen.
„Wo geht denn hier der Weg?“
„Wir wollen durch den Kindersternenhaufen
über den Hügel weg
die lange Kirschenblütenstraße laufen.“
Gesagt. Ein Sang, ein Flug:
verschwunden in den Kirschen überm Hügelzug.

Der Kindersturm aber dort unten


hat einen Igel gefunden.
In Anbetracht dessen
ist der Himmel vergessen.
Das Brückengespenst

von

Carl Spitteler.
Am Kreuzweg seufzt’ ein Brückengeist,
umringt von sieben Kleinen,
mit Wanderpack und Bettelsack,
und alle Kleinen weinen.
„Was fehlt dir, Vater? fasse Mut,
erzähle mir die Märe,
was dir geschah, und ob ich dir
vielleicht behilflich wäre.“
Der Alte ächzt’ und wischte sich
die tränenfeuchten Lider,
hernach mit kummervollem Blick
gab er die Antwort wieder:
„Ich lebt’ als ehrliches Gespenst
im trauten Uferloche
friedlich am heimatlichen Fluß
unter dem Brückenjoche.
Ach! war das eine schöne Zeit!
Die Brücke war in Stücken,
zwei Balken fehlten, einer wich,
die andern hatten Lücken,
der Mittelpfosten schaukelte
und tanzte vor Vergnügen;
kurz, selbst der strengsten Forderung
konnte der Bau genügen.
Und da einmal Gespensterpflicht
erfordert, wen zu necken,
so wählten wir die Profession,
die Pferde zu erschrecken.
’s ist eine angestammte Kunst
vom Urgroßvater ferne,
und wenn wir drinnen Meister sind,
das macht: wir tun’s halt gerne.
Zwar so ein Gaul am Wägelein
und solche kleine Dinge —
bewahr’! dergleichen lockt uns nicht,
das war uns zu geringe;
dagegen eine Jagdpartie,
ein Picknick meinetwegen
auf heißen Rasserossen! Hah!
da lohnte sich’s hingegen!
Man ließ das Trüpplein ungestört
tripp trapp im muntern Schritte,
mit Scherz und Sang tralli tralla
bis auf die Brückenmitte.
Dann, auf mein Zeichen, ging es los:
verborgen im Gebälke,
eröffneten zugleich den Krieg
die sieben süßen Schälke.
Der Leopold, der Barnabas,
der Klaus, der Sakranitsche
klatschten den Pferden um die Knie
mit Latten und mit Pritsche.
Der Wenzel zerrte sie am Schweif,
der Philipp, nach den Regeln,
wippt’ ihnen Balken an den Bauch,
die kitzelten mit Nägeln.
‚Ich komme auch!‘ rief Fridolin,
‚wart’ doch! nicht solche Eile!‘
nahm hurtig einen Span und stieß
und stach die Hinterteile.
War das ein Wirrwarr und Geschrei!
Das hätt’st du sehen sollen!
Vor Angst und Aufruhr wußte keins,
ob vor- ob rückwärtswollen.
Und war nun alles unterobs,
dann fuhr ich wie der Teufel
haushoch hervor mit „Holdridu“.
Da schwand der letzte Zweifel.
Links, rechts hinunter in den Fluß,
plumps über das Geländer.
Und lustig schwammen Sonnenschirm’
und Strohhüt’ und Gewänder.

„Ach Gott! was schwatz’ ich unnütz da!


Das sind vergang’ne Zeiten!
Es geht jetzt alles mit Benzin,
vorüber ist das Reiten.
Ein Maultier von Gemeinderat —
man sollt’ ihn „Unrat“ heißen —
ließ all’ die schöne Herrlichkeit
vandalisch niederreißen.
Statt des elastischen Gebälks
glotzt eine starre Mauer.
Ach was! was weiß von Pietät
und Heimatschutz ein Bauer.
Der kennt nur seinen Marktverkehr
und seine Dorfint’ressen.
Ich aber irre seither nun
verstoßen und vergessen
mit meinen Kindern durch die Welt,
ob ich vielleicht am Ende
für sie — ich denk’ ja nicht an mich —
Arbeit und Stellung fände.
Ansprüche, große, mach’ ich nicht,
sei’s eine hohle Eiche,
ein Kirchhof, ein verwunsch’nes Schloß,
es ist mir ganz das gleiche,
ich selber würde unterdes
etwa bei Spiritisten
als Klopfgeist oder Gabriel
zunächst mein Leben fristen.
’s ist furchtbar schwierig heutzutag’
für körperlose Seelen!
Drum falls du jemals etwas weißt,
so möcht’ ich mich empfehlen.“
Bruder Liederlich

von

Detlev von Liliencron.


Die Feder am Sturmhut in Spiel und Gefahren,
Halli.
Nie lernt ich im Leben fasten, noch sparen,
Hallo.
Der Dirne laß ich die Wege nicht frei,
wo Männer sich raufen, da bin ich dabei,
und wo sie saufen, da sauf’ ich für drei.
Halli und Hallo.

Verdammt, es blieb mir ein Mädchen hängen,


Halli.
Ich kann sie mir nicht aus dem Herzen zwängen,
Hallo.
Ich glaube, sie war erst sechzehn Jahr’,
trug rote Bänder im schwarzen Haar
und plauderte wie der lustigste Staar.
Halli und Hallo.

Was hatte das Mädel zwei frische Backen,


Halli.
Krach, konnten die Zähne die Haselnuß knacken,
Hallo.
Sie hat mir das Zimmer mit Blumen geschmückt,
die wir auf heimlichen Wegen gepflückt,
wie hab ich dafür ans Herz sie gedrückt.
Halli und Hallo.

Ich schenkt ihr ein Kleidchen von gelber Seiden,


Halli.
Sie sagte, sie möcht’ mich unsäglich gern leiden,
Hallo.
Und als ich die Taschen ihr vollgesteckt
mit Pralinés, Feigen und feinem Konfekt,
da hat sie von morgens bis abends geschleckt.
Halli und Hallo.
Wir haben superb uns die Zeit vertrieben,
Halli.
Ich wollte, wir wären zusammen geblieben,
Hallo.
Doch wurde die Sache mir stark ennuyant,
ich sagt’ ihr, daß mich die Regierung ernannt,
Kamele zu kaufen in Samarkand.
Halli und Hallo.

Und als ich zum Abschied die Hand gab der Kleinen,
Halli.
Da fing sie bitterlich an zu weinen,
Hallo.
Was denk ich just heut’ ohn’ Unterlaß,
daß ich ihr so rauh gab den Reisepaß ...
Wein her, zum Henker, und da liegt Trumpf Aß.
Halli und Hallo.
Die Musik kommt

von

Detlev von Liliencron.


Klingling, bumbum und tschingdada,
zieht im Triumph der Perserschah?
Und um die Ecke brausend bricht’s
wie Tubaton des Weltgerichts,
voran der Schellenträgen.

Brumbum, das große Bombardon,


der Beckenschlag, das Helikon,
die Pikkolo, der Zinkenist,
die Türkentrommel, der Flötist,
und dann der Herre Hauptmann.

Der Hauptmann naht mit stolzem Sinn,


die Schuppenketten unterm Kinn,
die Schärpe schnürt den schlanken Leib,
beim Zeus! Das ist kein Zeitvertreib,
und dann die Herren Leutnants.

Zwei Leutnants, rosenrot und braun,


die Fahnen schützen sie als Zaun,
die Fahne kommt, den Hut nimm ab,
der bleiben treu wir bis ans Grab,
und dann die Grenadiere.

Der Grenadier im strammen Tritt,


in Schritt und Tritt und Tritt und Schritt,
das stampft und dröhnt und klappt und flirrt,
Laternenglas und Fenster klirrt,
und dann die kleinen Mädchen.

Die Mädchen alle, Kopf an Kopf,


das Auge blau und blond der Zopf,
aus Tür und Tor und Hof und Haus
schaut Mine, Trine, Stine aus,
vorbei ist die Musike.
Klingling, tschingtsching und Paukenkrach,
noch aus der Ferne tönt es schwach,
ganz leise bumbumbumbum tsching,
zog da ein bunter Schmetterling,
tschingtsching, bum, um die Ecke?
Ich und die Rose warten

von

Detlev von Liliencron.


Vor mir
auf der dunkelbraunen Tischdecke
liegt eine große hellgelbe Rose.
Sie wartet mit mir
auf die Liebste,
der ich ins schwarze Haar
sie flechten will.

Wir warten schon eine Stunde.


Die Haustür geht.
Sie kommt, sie kommt.
Doch herein tritt
mein Freund, der Assessor;
geschniegelt, gebügelt, wie stets.
Der Assessor, ein Streber,
will Bürgermeister werden.
Gräßlich sind seine Erzählungen
über Wahlen, Vereine, Gegenpartei.
Endlich bemerkt er die Blume,
und seine gierigen,
perlgrauglacébehandschuhten Hände
greifen nach ihr:
„Äh, süperb!
Müssen mir geben fürs Knopfloch.“
„Nein!“ ruf ich grob.
„Herr Jess’ noch mal,
sind heut’ nicht in Laune,
denn nicht.
Empfehl’ mich Ihnen.
Sie kommen doch morgen in die Versammlung?“

Ich und die Rose warten.

Die Haustür geht.


Sie kommt, sie kommt.
Doch herein tritt
Doch herein tritt
mein Freund, Herr von Schnelleben.
Unerträglich langweilig sind seine Erzählungen
über Bälle und Diners.
Endlich bemerkt er die Blume.
Und seine bismarckbraunglacébehandschuhten Hände
greifen nach ihr:
„Ah, das trifft sich,
brauch’ ich nicht erst zu Bünger.
Hinein ins Knopfloch.
Du erlaubst doch?“
„Nein!“ schrei ich wütend.
„Na, aber,
warum denn so ausfallend,
bist heut’ nicht in Laune.
Denn nicht.
Empfehl’ mich dir.“

Ich und die Rose warten.

Die Haustür geht.


Sie kommt, sie kommt.
Doch herein tritt
mein Freund, der Dichter.
Der bemerkt sofort die hellgelbe.
Und er leiert ohn’ Umstände drauf los:
„Die Rose wallet am Busen des Mädchens,
wenn sie spät abends im Parke des Städtchens
gehet allein im mondlichen Schein ...“
„Halt ein, halt ein!“
„Was ist dir denn, Mensch.
Aber du schenkst mir doch die Blume?
Ich will sie mir ins Knopfloch stecken.“
Und gierig greift er nach ihr.
„Nein!“ brüll’ ich wie rasend.
„Aber was ist denn?
Bist heut’ nicht in Laune.
Denn nicht.
Empfehl’ mich dir.“

Ich und die Rose warten.

Die Haustür geht.


Sie kommt, sie kommt.
Und — da ist sie.
„Hast du mich aber lange lauern lassen.“
„Ich konnte doch nicht eher ...
Oh, die Rose, die Rose.“
„Hut ab erst.
Stillgestanden!
Nicht gemuckst.
Kopf vorwärts beugt!“
Und ich nestl’ ihr
die gelbe Rose ins schwarze Haar.
Ein letzter Sonnenschein
fällt ins Zimmer
über ihr reizend Gesicht.
Auf der Kasse

von

Detlev von Liliencron.


Heute war ich zur Kasse bestellt,
dort läge für mich auf dem Zahltisch Geld.
Waren’s auch nur drei Mark und acht,
hinein in den Beutel die fröhliche Fracht.

Auf der Kasse die Zähler und Schreiber,


die Pfennigumdreher und Steuereintreiber,
wie sie kalt auf den Sitzböcken tronen,
sichten das Gold wie Kaffeebohnen.
Möchte doch lieber Zigeuner sein,
als Mammonbeschnüffler im güldenen Schrein.

Im Bureau ist jeder zu warten schuldig,


stand ich denn auch eine Stunde geduldig.
Dacht’ ich mir plötzlich, mit Verlaub,
wären doch alle hier blind und taub.
Der Geldschrank steht offen, rasch wie der Pfiff,
tät ich hinein einen herzhaften Griff,
packte mir berstvoll alle Taschen,
machte mich schleunigst auf die Gamaschen,
nähme Schritte wie zwanzig Meter.
Hinter mir der Gendarm mit Gezeter,
brächt’ mich nicht ein, so sehr er auch liefe,
säß auf der schnellsten Lokomotive.

Mit der Verwendung des Geldes, nun ...


bin ich doch kein blindes Huhn.
Stolziert’ umher wie der König von Polen,
suchte mir bald ein Bräutchen zu holen.
So ein Mädchen mit blanken Zöpfen
könnt’ ich wahrhaftig vor Liebe köpfen.
Vor dem Spiegel, auf hohen Zehen,
stehn wir, wer größer ist, zu sehen.
Ach, diese Nähe! Den Puls ihres Lebens
fühl’ ich im Spiele des neckischen Strebens.
Weiter, natürlich Wagen und Pferde,
Länder und Leute, Himmel und Erde.
Tausend, wie will ich mich amüsieren ...

„Bitte, wollen Sie hier quittieren.“


O, wie das nüchtern und eisig klang.
Nahm die drei Mark und acht in Empfang,
trank bescheiden ein Krüglein Bier,
trollte nach Hause, ich armes Tier,
schalt meine Frau mich bis spät in die Nacht,
daß ich so wenig Geld gebracht.
Trin

von

Detlev von Liliencron.


Mit Nadel un Tweern[217]
keem de lütt Deern[218].
As[219] se mi nu den utneiten Knoop anneiht[220],
un so flink de Finger ehr geiht,
un se so neech bi mi steiht[221],
denk ick, wat kann dat sien[222], man to,
un ick gev ehr’n Söten[223], hallo, hallo.
Auk, har ick een weg, un dat wem Släg[224],
datt ick glieks dat Jammern kreeg[225],
do kiekt se mi ganz luri[226] an;
„häv ick wehdahn? min leve[227] Mann?“
„Ja,“ segg ick, un ganz sachen[228]
fat ick se üm, greep frischen Moot[229],
un nu güngt ja allns up eenmal got.

As se gung, seg ick: „Lütt Deern,


kumms ock mal weller[230] mit Nadel un Tweern?“
„Ja geern!“
Der Handkuß

von

Detlev von Liliencron.


Viere lang,
zum Empfang,
vorne Jean,
elegant,
fährt meine süße Lady.

Schilderhaus,
Wache ’raus.
Schloßportal,
und im Saal
steht meine süße Lady.

Hofmarschall,
Pagenwall.
Sehr graziös,
merveillös
knixt meine süße Lady.

Königin,
hoher Sinn,
ihre Hand,
interessant,
küßt meine süße Lady.

„Nun, wie war’s


heut’ bei Zars?“
„Ach, ich bin
noch ganz hin,“
haucht meine süße Lady.

Nach und nach,


allgemach,
ihren Mann
wieder dann
kennt meine süße Lady.
Hans der Schwärmer

von

Detlev von Liliencron.


Hans Töffel liebte schön’ Doris sehr,
schön Doris Hans Töffel vielleicht noch mehr.
Doch seine Liebe, ich weiß nicht wie,
ist zu scheu, zu schüchtern, zu viel Elegie.
Im Kreise liest er Gedichte vor,
schön Doris steht unten am Gartentor:
„Ach, käm er doch frisch zu mir hergesprungen,
wie wollt ich ihn herzen, den lieben Jungen.“
Hans Töffel liest oben Gedichte.

Am andern Abend, der blöde Tor,


Hans Töffel trägt wieder Gedichte vor.
Schön Doris das wirklich sehr verdrießt,
daß er immer weiter und weiter liest.
Sie schleicht sich hinaus, er gewahrt es nicht,
just sagt er von Heine ein herrlich Gedicht.
Schön Doris steht unten in Rosendüften
und hätte so gern seinen Arm um die Hüften.
Hans Töffel liest oben Gedichte.

Am andern Abend ist großes Fest,


viel Menschen sind eng aneinander gepreßt.
Heut muß er’s doch endlich sehn, der Poet,
wenn schön Doris sacht aus der Türe geht.
Der Junker Hans Jürgen, der merkt es gleich,
die Linden duften, die Nacht ist so weich.
Und unten im stillen, dunklen Garten
braucht heute schön Doris nicht lange zu warten.
Hans Töffel liest oben Gedichte. —
Lebensjuchzer

von

Detlev von Liliencron.


————————————
Darum, nach vollbrachter Tagespflicht,
stülp’ ich mir meinen alten Filzhut auf,
mit der unscheinbaren Sperberfeder dran,
stecke mir einige blaue Scheine ein,
trumpfe auf den Tisch,
und alle nüchternen Gewohnheitsunkenseelen
tief bedauernd,
ruf’ ich voll kommender Freude:
„Nu wüllt wi uns ook mal fix ameseern!“
Betrunken

von

Detlev von Liliencron.


Ich sitze zwischen Mine und Stine,
den hellblonden hübschen Friesenmädchen,
und trinke Grog.
Die Mutter ging schlafen.
Geht Mine hinaus,
um heißes Wasser zu holen,
küß’ ich Stine.
Geht Stine hinaus,
um ein Brötchen mit aufgelegten kalten Eiern
und Anchovis zu bringen,
küß’ ich Mine.
Nun sitzen wieder beide neben mir.
Meinen rechten Arm halt’ ich um Stine,
meinen linken um Mine.
Wir sind lustig und lachen.
Stine häkelt,
Mine blättert
in einem verjährten Modejournal.
Und ich erzähl’ ihnen Geschichten.

Draußen tobt, höchst ungezogen,


unser guter Freund,
der Nordwest.
Die Wellen spritzen,
es ist Hochflut,
zuweilen über den nahen Deich
und sprengen Tropfen
an unsre Fenster.

Ich bin verbannt und ein Gefangener


auf dieser vermaledeiten
einsamen kleinen Insel.
Zwei Panzerfregatten
und sechs Kreuzer spinnen mich ein.
Auf den Wällen
wachen die Posten
wachen die Posten,
und einer ruft dem andern zu,
durch die hohle Hand,
von Viertel- zu Viertelstunde,
in singendem Tone:
„Kamerad, lebst du noch?“

Wie wohl mir wird.


Alles Leid sinkt, sinkt.
Mine und Stine lehnen sich
an meine Schultern.
Ich ziehe sie dichter und dichter
an mich heran.
Denn im Lande der Hyperboreer,
wo wir wohnen,
ist es kalt.

Ich trank das sechste Glas.


Ich stehe draußen
an der Mauer des Hauses,
barhaupt,
und schaue in die Sterne:
der winzige, matt blinkende,
grad über mir,
ist der Stern der Gemütlichkeit,
zugleich der Stern
der äußersten geistigen Genügsamkeit.
Der nah daneben blitzt,
der große, feuerfunkelnde,
ist der Stern des Zorns.
Welten-Rätsel.
Die Welt — das Rätsel der Rätsel.
Wie mir der Wind die heiße Stirn kühlt.
Angenehm, höchst angenehm.

Ich bin wieder im Zimmer.


Ich trinke mein achtes Glas Nordnordgrog.

You might also like