0% found this document useful (0 votes)
13 views

Download Combinatorial pattern matching algorithms in computational biology using Perl and R 1st Edition Gabriel Valiente ebook now - free full chapters

The document provides information about the book 'Combinatorial Pattern Matching Algorithms in Computational Biology Using Perl and R' by Gabriel Valiente, which is available for download. It highlights the book's focus on integrating mathematical, statistical, and computational methods into biology, and includes various suggested related ebooks. The book is part of the Mathematical and Computational Biology series and covers topics such as sequence and tree pattern matching.

Uploaded by

peltopeeryzy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Download Combinatorial pattern matching algorithms in computational biology using Perl and R 1st Edition Gabriel Valiente ebook now - free full chapters

The document provides information about the book 'Combinatorial Pattern Matching Algorithms in Computational Biology Using Perl and R' by Gabriel Valiente, which is available for download. It highlights the book's focus on integrating mathematical, statistical, and computational methods into biology, and includes various suggested related ebooks. The book is part of the Mathematical and Computational Biology series and covers topics such as sequence and tree pattern matching.

Uploaded by

peltopeeryzy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Visit https://ebookfinal.

com to download the full version and


explore more ebooks

Combinatorial pattern matching algorithms in


computational biology using Perl and R 1st
Edition Gabriel Valiente

_____ Click the link below to download _____


https://ebookfinal.com/download/combinatorial-pattern-
matching-algorithms-in-computational-biology-using-
perl-and-r-1st-edition-gabriel-valiente/

Explore and download more ebooks at ebookfinal.com


Here are some suggested products you might be interested in.
Click the link to download

Computational Intelligence and Pattern Analysis in Biology


Informatics Wiley Series in Bioinformatics 1st Edition
Ujjwal Maulik
https://ebookfinal.com/download/computational-intelligence-and-
pattern-analysis-in-biology-informatics-wiley-series-in-
bioinformatics-1st-edition-ujjwal-maulik/

Effective awk Programming Universal Text Processing and


Pattern Matching 4th Edition Arnold Robbins

https://ebookfinal.com/download/effective-awk-programming-universal-
text-processing-and-pattern-matching-4th-edition-arnold-robbins/

Matching Pursuit and Unification in EEG Analysis


Engineering in Medicine Biology 1st Edition Piotr Durka

https://ebookfinal.com/download/matching-pursuit-and-unification-in-
eeg-analysis-engineering-in-medicine-biology-1st-edition-piotr-durka/

Handbook of Graph Theory Combinatorial Optimization and


Algorithms 1st Edition Subramanian Arumugam

https://ebookfinal.com/download/handbook-of-graph-theory-
combinatorial-optimization-and-algorithms-1st-edition-subramanian-
arumugam/
The Burrows Wheeler Transform Data Compression Suffix
Arrays and Pattern Matching 1st Edition Donald Adjeroh

https://ebookfinal.com/download/the-burrows-wheeler-transform-data-
compression-suffix-arrays-and-pattern-matching-1st-edition-donald-
adjeroh/

Computational geometry algorithms and applications 3e


edition Edition Berg

https://ebookfinal.com/download/computational-geometry-algorithms-and-
applications-3e-edition-edition-berg/

Pattern Recognition Algorithms for Data Mining 1st Edition


Sankar K. Pal

https://ebookfinal.com/download/pattern-recognition-algorithms-for-
data-mining-1st-edition-sankar-k-pal/

Computational Systems Biology 1st Edition Andres Kriete

https://ebookfinal.com/download/computational-systems-biology-1st-
edition-andres-kriete/

The Biology of Cancer 2nd Edition Janice Ann Gabriel

https://ebookfinal.com/download/the-biology-of-cancer-2nd-edition-
janice-ann-gabriel/
Combinatorial pattern matching algorithms in
computational biology using Perl and R 1st Edition
Gabriel Valiente Digital Instant Download
Author(s): Gabriel Valiente
ISBN(s): 9781420069730, 142006973X
Edition: 1
File Details: PDF, 3.03 MB
Year: 2009
Language: english
Combinatorial Pattern
Matching Algorithms in
Computational Biology
Using Perl and R

© 2009 by Taylor & Francis Group, LLC


CHAPMAN & HALL/CRC
Mathematical and Computational Biology Series

Aims and scope:


This series aims to capture new developments and summarize what is known
over the whole spectrum of mathematical and computational biology and
medicine. It seeks to encourage the integration of mathematical, statistical and
computational methods into biology by publishing a broad range of textbooks,
reference works and handbooks. The titles included in the series are meant to
appeal to students, researchers and professionals in the mathematical, statistical
and computational sciences, fundamental biology and bioengineering, as well
as interdisciplinary researchers involved in the field. The inclusion of concrete
examples and applications, and programming techniques and examples, is
highly encouraged.

Series Editors
Alison M. Etheridge
Department of Statistics
University of Oxford

Louis J. Gross
Department of Ecology and Evolutionary Biology
University of Tennessee

Suzanne Lenhart
Department of Mathematics
University of Tennessee

Philip K. Maini
Mathematical Institute
University of Oxford

Shoba Ranganathan
Research Institute of Biotechnology
Macquarie University

Hershel M. Safer
Weizmann Institute of Science
Bioinformatics & Bio Computing

Eberhard O. Voit
The Wallace H. Couter Department of Biomedical Engineering
Georgia Tech and Emory University

Proposals for the series should be submitted to one of the series editors above or directly to:
CRC Press, Taylor & Francis Group
4th, Floor, Albert House
1-4 Singer Street
London EC2A 4BQ
UK

© 2009 by Taylor & Francis Group, LLC


Published Titles
Bioinformatics: A Practical Approach
Shui Qing Ye
Cancer Modelling and Simulation
Luigi Preziosi
Computational Biology: A Statistical Mechanics Perspective
Ralf Blossey
Computational Neuroscience: A Comprehensive Approach
Jianfeng Feng
Data Analysis Tools for DNA Microarrays
Sorin Draghici
Differential Equations and Mathematical Biology
D.S. Jones and B.D. Sleeman
Exactly Solvable Models of Biological Invasion
Sergei V. Petrovskii and Bai-Lian Li
Handbook of Hidden Markov Models in Bioinformatics
Martin Gollery
Introduction to Bioinformatics
Anna Tramontano
An Introduction to Systems Biology: Design Principles of Biological Circuits
Uri Alon
Kinetic Modelling in Systems Biology
Oleg Demin and Igor Goryanin
Knowledge Discovery in Proteomics
Igor Jurisica and Dennis Wigle
Modeling and Simulation of Capsules and Biological Cells
C. Pozrikidis
Niche Modeling: Predictions from Statistical Distributions
David Stockwell
Normal Mode Analysis: Theory and Applications to Biological and
Chemical Systems
Qiang Cui and Ivet Bahar
Pattern Discovery in Bioinformatics: Theory & Algorithms
Laxmi Parida
Spatiotemporal Patterns in Ecology and Epidemiology: Theory, Models, and Simulation
Horst Malchow, Sergei V. Petrovskii, and Ezio Venturino
Stochastic Modelling for Systems Biology
Darren J. Wilkinson
Structural Bioinformatics: An Algorithmic Approach
Forbes J. Burkowski
The Ten Most Wanted Solutions in Protein Bioinformatics
Anna Tramontano

© 2009 by Taylor & Francis Group, LLC


Combinatorial Pattern
Matching Algorithms in
Computational Biology
Using Perl and R

Gabriel Valiente

© 2009 by Taylor & Francis Group, LLC


Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487‑2742
© 2009 by Taylor & Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works


Printed in the United States of America on acid‑free paper
10 9 8 7 6 5 4 3 2 1

International Standard Book Number‑13: 978‑1‑4200‑6973‑0 (Hardcover)

This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher can‑
not assume responsibility for the validity of all materials or the consequences of their use. The
authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copy‑
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978‑750‑8400. CCC is a not‑for‑profit organization that pro‑
vides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.

Library of Congress Cataloging‑in‑Publication Data

Valiente, Gabriel, 1963‑


Combinatorial pattern matching algorithms in computational biology using
Perl and R / Gabriel Valiente.
p. cm. ‑‑ (Mathematical and computational biology series)
Includes bibliographical references and index.
ISBN 978‑1‑4200‑6973‑0 (hardcover : alk. paper)
1. Computational biology. 2. Pattern formation (Biology)‑‑Computer
simulation. 3. Graph algorithms. 4. Perl (Computer program language) 5. R
(Computer program language) I. Title. II. Series.

QH324.2.V35 2009
572.80285‑‑dc22 2009003714

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com

© 2009 by Taylor & Francis Group, LLC


To the loving memory of Helda Zulma Feruglio

© 2009 by Taylor & Francis Group, LLC


Contents

Foreword

Preface

1 Introduction 1
1.1 Combinatorial Pattern Matching . . . . . . . . . . . . . . . . 3
1.2 Computational Biology . . . . . . . . . . . . . . . . . . . . . 4
1.3 A Motivating Example: Gene Prediction . . . . . . . . . . . 4
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 17

I Sequence Pattern Matching

2 Sequences 21
2.1 Sequences in Mathematics . . . . . . . . . . . . . . . . . . . 21
2.1.1 Counting Labeled Sequences . . . . . . . . . . . . . . 22
2.2 Sequences in Computer Science . . . . . . . . . . . . . . . . 24
2.2.1 Traversing Labeled Sequences . . . . . . . . . . . . . . 26
2.3 Sequences in Computational Biology . . . . . . . . . . . . . . 29
2.3.1 Reverse Complementing DNA Sequences . . . . . . . . 31
2.3.2 Counting RNA Sequences . . . . . . . . . . . . . . . . 33
2.3.3 Generating DNA Sequences . . . . . . . . . . . . . . . 35
2.3.4 Representing Sequences in Perl . . . . . . . . . . . . . 38
2.3.5 Representing Sequences in R . . . . . . . . . . . . . . 40
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3 Simple Pattern Matching in Sequences 43


3.1 Finding Words in Sequences . . . . . . . . . . . . . . . . . . 43
3.1.1 Word Composition of Sequences . . . . . . . . . . . . 43
3.1.2 Alignment Free Comparison of Sequences . . . . . . . 49
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 General Pattern Matching in Sequences 53


4.1 Finding Subsequences . . . . . . . . . . . . . . . . . . . . . . 53
4.1.1 Suffix Arrays . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Finding Common Subsequences . . . . . . . . . . . . . . . . 67

© 2009 by Taylor & Francis Group, LLC


4.2.1 Generalized Suffix Arrays . . . . . . . . . . . . . . . . 74
4.3 Comparing Sequences . . . . . . . . . . . . . . . . . . . . . . 86
4.3.1 Edit Distance-Based Comparison of Sequences . . . . 86
4.3.2 Alignment-Based Comparison of Sequences . . . . . . 95
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 110

II Tree Pattern Matching

5 Trees 115
5.1 Trees in Mathematics . . . . . . . . . . . . . . . . . . . . . . 115
5.1.1 Counting Labeled Trees . . . . . . . . . . . . . . . . . 115
5.2 Trees in Computer Science . . . . . . . . . . . . . . . . . . . 117
5.2.1 Traversing Rooted Trees . . . . . . . . . . . . . . . . . 118
5.3 Trees in Computational Biology . . . . . . . . . . . . . . . . 118
5.3.1 The Newick Linear Representation . . . . . . . . . . . 123
5.3.2 Counting Phylogenetic Trees . . . . . . . . . . . . . . 125
5.3.3 Generating Phylogenetic Trees . . . . . . . . . . . . . 126
5.3.4 Representing Trees in Perl . . . . . . . . . . . . . . . . 128
5.3.5 Representing Trees in R . . . . . . . . . . . . . . . . . 131
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6 Simple Pattern Matching in Trees 137


6.1 Finding Paths in Unrooted Trees . . . . . . . . . . . . . . . . 137
6.1.1 Distances in Unrooted Trees . . . . . . . . . . . . . . . 138
6.1.2 The Partition Distance between Unrooted Trees . . . 140
6.1.3 The Nodal Distance between Unrooted Trees . . . . . 144
6.2 Finding Paths in Rooted Trees . . . . . . . . . . . . . . . . . 148
6.2.1 Distances in Rooted Trees . . . . . . . . . . . . . . . . 150
6.2.2 The Partition Distance between Rooted Trees . . . . . 151
6.2.3 The Nodal Distance between Rooted Trees . . . . . . 151
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 152

7 General Pattern Matching in Trees 155


7.1 Finding Subtrees . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.1.1 Finding Subtrees Induced by Triplets . . . . . . . . . 156
7.1.2 Finding Subtrees Induced by Quartets . . . . . . . . . 159
7.2 Finding Common Subtrees . . . . . . . . . . . . . . . . . . . 161
7.2.1 Maximum Agreement of Rooted Trees . . . . . . . . . 161
7.2.2 Maximum Agreement of Unrooted Trees . . . . . . . . 172
7.3 Comparing Trees . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.3.1 The Triplets Distance between Rooted Trees . . . . . 172
7.3.2 The Quartets Distance between Unrooted Trees . . . . 175
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 178

© 2009 by Taylor & Francis Group, LLC


III Graph Pattern Matching

8 Graphs 181
8.1 Graphs in Mathematics . . . . . . . . . . . . . . . . . . . . . 181
8.1.1 Counting Labeled Graphs . . . . . . . . . . . . . . . . 182
8.2 Graphs in Computer Science . . . . . . . . . . . . . . . . . . 183
8.2.1 Traversing Directed Graphs . . . . . . . . . . . . . . . 183
8.3 Graphs in Computational Biology . . . . . . . . . . . . . . . 184
8.3.1 The eNewick Linear Representation . . . . . . . . . . 193
8.3.2 Counting Phylogenetic Networks . . . . . . . . . . . . 195
8.3.3 Generating Phylogenetic Networks . . . . . . . . . . . 198
8.3.4 Representing Graphs in Perl . . . . . . . . . . . . . . . 202
8.3.5 Representing Graphs in R . . . . . . . . . . . . . . . . 205
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 208

9 Simple Pattern Matching in Graphs 211


9.1 Finding Paths in Graphs . . . . . . . . . . . . . . . . . . . . 211
9.1.1 Distances in Graphs . . . . . . . . . . . . . . . . . . . 214
9.1.2 The Path Multiplicity Distance between Graphs . . . 220
9.1.3 The Tripartition Distance between Graphs . . . . . . 228
9.1.4 The Nodal Distance between Graphs . . . . . . . . . . 234
9.2 Finding Trees in Graphs . . . . . . . . . . . . . . . . . . . . 238
9.2.1 The Statistical Error between Graphs . . . . . . . . . 243
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 246

10 General Pattern Matching in Graphs 247


10.1 Finding Subgraphs . . . . . . . . . . . . . . . . . . . . . . . . 247
10.1.1 Finding Subgraphs Induced by Triplets . . . . . . . . 248
10.2 Finding Common Subgraphs . . . . . . . . . . . . . . . . . . 259
10.2.1 Maximum Agreement of Rooted Networks . . . . . . . 259
10.3 Comparing Graphs . . . . . . . . . . . . . . . . . . . . . . . 269
10.3.1 The Triplets Distance between Graphs . . . . . . . . . 269
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 273

A Elements of Perl 275


A.1 Perl Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
A.2 Overview of Perl . . . . . . . . . . . . . . . . . . . . . . . . . 294
A.3 Perl Quick Reference Card . . . . . . . . . . . . . . . . . . . 297
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 304

© 2009 by Taylor & Francis Group, LLC


B Elements of R 305
B.1 R Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
B.2 Overview of R . . . . . . . . . . . . . . . . . . . . . . . . . . 323
B.3 R Quick Reference Card . . . . . . . . . . . . . . . . . . . . 329
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 336

References 339

© 2009 by Taylor & Francis Group, LLC


Foreword

When, more than 25 years ago, Zvi Galil and I decided to organize the
presentations at a NATO workshop into the volume entitled “Combinatorial
Algorithms on Words,” windows were building fixtures and webs were inhab-
ited by spiders and navigated, more or less inadvertedly, by doomed insects.
The “Atlas of Protein Sequences” by Margaret Dayhoff listed a handful of Cy-
tochrome C proteins and the first sequenced genome was more than a decade
away. Nonetheless, we were convinced that the few, scattered properties and
constructs of pattern matching available at the time had the potential to spark
and shape a specialty of algorithms design that would not fade in comparison
with already well-established areas such as graph and numerical algorithms.
It is safe to say that our volume presented a detailed account of the state of the
art, attracted attention to data structures such as the suffix trees that are still
the subject of deep study, and also listed most of the relevant open problems
that would be tackled, with varying success, in the following years. A reader
willing to invest some time on the contents of that volume could immigrate
rather quickly into the issues at the frontier and start making contributions
depending only on own taste and skills.
About ten years later, as we attempted to collect the state of the art in
“Pattern Matching Algorithms,” we faced an amazingly expanded scenario.
This time, each one of the contributed chapters had to be a rather synthetic
survey of the many intervening results, some of which had started new special-
ized sections, notable among which were two-dimensional and tree matching.
The neophyte interested in the subject could now take steps from reading one
or two chapters, but would then have to invest considerable additional time
in the study of numerous references. Many things had happened in between,
bringing new meanings to webs and windows and growing protein databases
and sequence repositories that included now sizable genomes such as yeast.
Since then, additional beautiful volumes have been produced and even con-
ferences such as CPM, RECOMB, SPIRE, WABI have started that keep
churning out an unrelenting crop of problems, applications, and results. All
of this shows how challenging it is to embark in a compendium of the state of
the art, and thus how admirable is the volume that results from the effort of
Gabriel Valiente.
The reader of this nicely structured volume will find a well-rounded expo-
sition of the traditional issues accompanied by an up-to-date account of more
recent developments, such as graph similarity and search. As is well known, a
great many pattern matching problems have been inspired over the years by

© 2009 by Taylor & Francis Group, LLC


the growing domain of computational molecular biology. This is due in part
to the fact that the crisp lexicon of biological sequence analysis provides the
natural habitat for pattern matching, and while probably not every problem
has received a fungible solution, it is admitted that nowhere else has pattern
matching found more stimuli and opportunities for growth as a discipline.
Continuing competently in this vein, the organization of this volume dove-
tails with the transition of computational biology from the molecular to the
cellular level.
For more than three decades, I have observed that one of the greatest diffi-
culties that biologists and computer scientists have to overcome when engaging
in interdisciplinary inquiries is the lack of a common language, hence of the
ability to set common goals. Such a language is the necessary prerequisite to
the ambitious and yet necessary task of developing a hybrid specialist, conver-
sant in both disciplines and able to appreciate a problem and its solution from
the standpoint of both computing and biology. Balancing a careful mixture of
formal methods, programming, and examples, Gabriel Valiente has managed
to harmoniously bridge languages and contents into a self-contained source of
lasting influence. It is not difficult to predict that this book will be studied
indifferently by the specialist of biology and computer science, helping each to
walk a few steps towards the other. It will entice new generations of scholars
to engage in its beautiful subject.

Alberto Apostolico
Atlanta

© 2009 by Taylor & Francis Group, LLC


Preface

Combinatorial pattern matching algorithms count among the main sources


of solutions to the computational biology problems that arise in the analysis
of genomic, transcriptomic, proteomic, metabolomic, and interactomic data.
Top in the ranking is the BLAST software tool for rapid searching of nucleotide
and protein databases, which is based on combinatorial pattern matching
algorithms for local sequence alignment. Also high in the ranking, suffix trees
and suffix arrays were developed to efficiently solve specific problems that
arise in computational biology, such as the search for patterns in nucleotide
or protein sequences.
This is a book on combinatorial pattern matching algorithms, with special
emphasis on computational biology and also on their implementation in Perl
and R, two widely used scripting languages in computational biology. The
book is aimed at anyone with an interest in combinatorial pattern matching
and in the broader subject of combinatorial algorithms, the only prerequisites
being an elementary knowledge of mathematics and computer programming,
the desire to learn, and unlimited time and patience.

Acknowledgments
This book is based on graduate lectures taught at the Technical Univer-
sity of Catalonia, Barcelona, and also invited lectures at the Phylogenet-
ics Programme of the Isaac Newton Institute for Mathematical Sciences,
held September 3 to December 21, 2007, in Cambridge, UK; at the Gul-
benkian PhD Program in Computational Biology of the Instituto Gulbenkian
de Ciência, held February 4–8, 2008, and January 26–30, 2009, in Oeiras,
Portugal; and at the Lipari International Summer School on Bioinformat-
ics and Computational Biology, held February 4–8, 2008, in Lipari Island,
Italy. I am very grateful to Vincent Moulton, Mike Steel, and Daniel Huson
(Isaac Newton Institute for Mathematical Sciences), to Jorge Carneiro and
Manuela Cordeiro (Gulbenkian PhD Program in Computational Biology), and
to Alfredo Ferro, Raffaele Giancarlo, Concettina Guerra, and Michael Levitt
(Lipari International Summer School on Bioinformatics and Computational
Biology) for their continuous encouragement and support.
The very idea of presenting combinatorial pattern matching problems in a

© 2009 by Taylor & Francis Group, LLC


uniform framework (pattern matching in and between sequences, trees, and
graphs) arose during the Second Haifa Annual International Stringology Re-
search Workshop, held April 3–8, 2005, at the Caesarea Rothschild Institute
for Interdisciplinary Applications of Computer Science, University of Haifa,
Israel. I am very grateful to Amihood Amir, Martin C. Golumbic, and Gad M.
Landau for providing such a stimulating environment.
The approach to algorithms in bioinformatics and computational biology
expressed in this book has been influenced by the interaction with numerous
colleagues at the Barcelona Biomedical Research Park, especially within the
Centre for Genomic Regulation and also in the Research Unit on Biomedical
Informatics. In particular, I would like to thank Roderic Guigó and Ferran
Sanz for granting me access to the Barcelona Biomedical Research Park.
I am grateful to the people who have read and commented on draft material.
In particular, I would like to thank José Clemente, Eduardo Eyras, Vincent
Lacroix, Michael Levitt, and Francesc Rosselló.
Last, but not least, it has been a pleasure to work out editorial matters
together with Amber Donley, Sarah Morris, and Sunil Nair of Taylor & Francis
Group.

Gabriel Valiente
Barcelona

© 2009 by Taylor & Francis Group, LLC


Chapter 1
Introduction

Computational biology, the application of computational and mathematical


techniques to problems inspired by biology, has witnessed an unprecedented
expansion over the last few years. The wide availability of genomic, tran-
scriptomic, proteomic, metabolomic, and interactomic data has fostered the
development of computational techniques for their analysis. Combinatorial
pattern matching is one of the main sources of algorithmic solutions to the
problems that arise in their analysis.
This is a text on combinatorial pattern matching algorithms, with special
emphasis on computational biology. Pattern matching is well known in com-
putational biology, not only because of biological sequence alignment. Data
structures such as suffix trees and suffix arrays were developed within the
combinatorial pattern matching research community to efficiently solve spe-
cific problems that arise in computational biology. This book provides an
organized and comprehensive view of the whole field of combinatorial pattern
matching from a computational biology perspective and addresses specific
pattern matching problems within and between sequences, trees, and graphs.
Much of the material presented on the book is only available in the specialized
research literature.
The book is structured around the specific algorithmic problems that arise
when dealing with those structures that are commonly found in computa-
tional biology, namely: biological sequences (such as DNA, RNA, and protein
sequences), trees (such as phylogenetic trees and RNA structures), and graphs
(such as phylogenetic networks, metabolic pathways, protein interaction net-
works, and signaling pathways). The emphasis throughout this book is on the
search for patterns within and between biological sequences, trees, and graphs,
with the understanding of exact (rather than approximate) occurrences as pat-
terns and pairwise (rather than multiple) comparison of structures. There is
also a strong emphasis on phylogenetic trees and networks as examples of
trees and graphs in computational biology.
For each of these structures (sequences, trees, and graphs), a clear distinc-
tion is made between the problems that arise in the analysis of one structure
(finding patterns within a structure) and in the comparative analysis of two
or more structures (finding patterns common to two structures and aligning
these structures).
The patterns contained in a sequence are words, and k-mer composition is
the basis of an important form of alignment-free sequence comparison. Suffix

1
© 2009 by Taylor & Francis Group, LLC
2 Combinatorial Pattern Matching Algorithms in Computational Biology

arrays allow for the efficient search of occurrences of a sequence in another


sequence, while generalized suffix arrays allow for the efficient search of oc-
currences common to two sequences. Besides finding common patterns, the
comparison of two sequences is also made on the basis of the Hamming dis-
tance, the Levenshtein distance, and the edit distance, as well as by means of
a global or a local alignment of the sequences.
The patterns contained in a tree are paths, and the distances (lengths of the
shortest paths) between the nodes of a tree are the basis of the nodal distance
between phylogenetic trees. The partition distance builds upon the distinction
in phylogenetic trees between descendant and non-descendant nodes. Small
subtrees common to two trees underlie both the triplets distance and the
quartets distance between phylogenetic trees. The comparison of two trees is
also made on the basis of their maximum agreement subtree.
The patterns contained in a graph are paths and trees, and the distances
between the nodes of a graph are the basis of the nodal distance between
graphs. The path multiplicity distance between phylogenetic networks builds
upon the number of different paths between the nodes of a graph. The tri-
partition distance is based on the distinction between strict descendant, non-
strict descendant, and non-descendant nodes in a phylogenetic network. Small
subgraphs common to two graphs underlie the triplets distance between phy-
logenetic networks, and large subgraphs common to two graphs underlie the
statistical error between phylogenetic networks. The comparison of two phy-
logenetic networks is also made on the basis of their maximum agreement
subgraph.
A thorough discussion is made of each of these specific problems, together
with detailed algorithmic solutions in pseudo-code, full Perl and R implemen-
tation, and pointers to off-the-shelf software and alternative implementations
such as those found on CPAN, the Comprehensive Perl Archive Network, or
integrated into the BioPerl project, as well as those found on CRAN, the Com-
prehensive R Archive Network, or integrated into the BioConductor project.
The Perl and R source code of all of the algorithms presented in the book is
also available at http://www.lsi.upc.edu/~valiente/comput-biol/.
The rest of this chapter contains an introduction to some of the biological,
mathematical, and computational notions used in this book, by means of a
motivating example, in an effort to make it as self-contained as possible for the
biologist, the mathematician, and the computer scientist reader as well. The
book itself is organized in a first part devoted to sequence pattern matching,
a second part on tree pattern matching, and a third part about graph pattern
matching, followed by a brief introduction to Perl and R in two appendices.
The first part contains an introductory chapter about sequences, a chapter
devoted to the problem of pattern matching within a sequence, and a chapter
on finding patterns common to two sequences. Following the same scheme, the
second part contains an introductory chapter about trees, a chapter devoted
to finding patterns within a tree, and a chapter on finding patterns common to
two trees, and the third part contains an introductory chapter about graphs,

© 2009 by Taylor & Francis Group, LLC


Introduction 3

a chapter devoted to finding patterns within a graph, and a chapter on finding


patterns common to two graphs. Each chapter includes detailed bibliographic
notes and pointers to the specialized research literature.
Throughout this book, sequence is often used as a synonym for string,
because the reader is assumed to be familiar with the notion of biological
(DNA, RNA, protein) sequences. The distinction between string and sequence
is important when referring to substructures, however, because subsequences
are not necessarily substrings. In general, subsequences will be referred to as
gapped subsequences, and the particular case of substrings (consecutive parts
of a string) will be referred to as just subsequences.

1.1 Combinatorial Pattern Matching


Pattern matching refers to the search for the occurrences of a pattern in
a text, where both the pattern and the text can be discrete structures such
as sequences, trees, and graphs. Examples of patterns in computational bi-
ology are a short nucleotide sequence, such as the TATAAAA motif found in
the promoter region of most eukaryotic genes; the amino acid sequence of a
transcription factor, such as the prokaryotic C2H2 zinc-finger motif x(2)-Cys-
x(2)-Cys-x(9)-His-x(2)-His-x(2); and an RNA secondary structure motif, such
as the CUUCGG hairpin found in the small subunit ribosomal RNA of most
bacteria. Corresponding examples of pattern matching problems are finding
motifs and transcription factor binding sites in DNA sequences, and searching
RNA sequences for recurrent structural motifs.
Sequence patterns are often described by means of regular expressions, with
a special syntax such as a vertical bar to separate alternatives, parentheses to
group patterns, and wild cards such as a dot for matching a single character,
a question mark for matching zero or one occurrence, an asterisk for matching
zero or more occurrences, and a plus sign for matching one or more occur-
rences. Regular expressions can be used as patterns for selecting and replacing
text with the utilities awk (named after the authors), ed (text editor), expr
(evaluate expression), grep (global regular expression pattern), sed (stream
editor), vim (visual editor improved), and in scripting programming languages
such as perl (practical extraction and report language), among others.
Combinatorial pattern matching addresses issues of searching and matching
strings and more complex patterns such as trees, regular expressions, graphs,
point sets, and arrays, with the goal of deriving non-trivial combinatorial
properties for such structures and then exploiting these properties in order
to either achieve improved performance for the corresponding computational
problems or pinpoint properties and conditions under which searches cannot
be performed efficiently.

© 2009 by Taylor & Francis Group, LLC


4 Combinatorial Pattern Matching Algorithms in Computational Biology

1.2 Computational Biology


In a broad sense, computational biology is the application of computational
and mathematical techniques to problems inspired by biology. A distinction is
often drawn between computational biology and bioinformatics, where com-
putational biology involves the development and application of theoretical
methods, mathematical modeling techniques, and computational simulation
techniques to biological data, while bioinformatics is centered on the develop-
ment and application of computational approaches and tools for the acquisi-
tion, organization, storage, analysis, and visualization of biological data.
Molecular biology itself is experiencing a shift from an understanding of
biological systems at the molecular level (nucleotide or amino acid sequences
and structures of individual genes or proteins) to an understanding of biolog-
ical systems at a system level (integrated function of hundreds or thousands
of genes and proteins in the cell, in tissues, and in whole organisms), and this
shift is also influencing computational biology.
There are two main branches in computational biology. On the one hand,
the area of biological data mining focuses on extracting hidden patterns from
large amounts of experimental data, forming hypotheses as a result. Most
of the research in computational genomics and computational proteomics be-
longs in this area. On the other hand, modeling and simulation focuses instead
on developing mathematical models and performing simulations to test hy-
potheses with in-silico experiments, providing predictions that can be tested
by in-vitro and in-vivo studies. Much of the research in mathematical bi-
ology, computational biochemistry, computational biophysics, and computa-
tional systems biology falls in this area.
Combinatorial pattern matching algorithms belong in the biological data
mining branch of computational biology.

1.3 A Motivating Example: Gene Prediction


The whole genome of an organism can be revealed from tissue samples by
using one of several DNA sequencing technologies, each of them producing a
large number of DNA fragments of various lengths that are then assembled
into the DNA sequence of the molecules in either the mitochondria or the
nucleus (for eukaryotes) or in the cytoplasm (for prokaryotes) of the cells. The
whole genomes of thousands of extant species have already been sequenced,
including 111 archaeal genomes ranging from 1,668 to 5,751,492 nucleotides;
2,167 bacterial genomes with 846 to 13,033,779 nucleotides; 2,593 eukaryote
genomes with 1,028 to 748,055,161 nucleotides; 2,651 viral genomes with 200

© 2009 by Taylor & Francis Group, LLC


Introduction 5

to 1,181,404 nucleotides; 39 viroid RNA genomes with 246 to 399 nucleotides;


and 1504 plasmid genomes with 846 to 2,094,509 nucleotides. Extant species
represent only a small fraction of the genetic diversity that has ever existed,
however, and whole genomes of extinct species can also be sequenced from
well-conserved tissue samples.
Once the genome of a species has been sequenced, one of the first steps
towards understanding it consists in the identification of genes coding for
proteins. In prokaryotic genomes, the sequence coding for a protein occurs as
one contiguous open reading frame, while in eukaryotic genomes, it is often
spliced into several coding exons separated by non-coding introns, and these
exons can be combined in different arrangements to code for different proteins
by the cellular process of alternative splicing.

Example 1.1
The DNA sequence of Bacteriophage φ-X174, which was the first genome to
be sequenced, has 11 protein coding genes within a circular single strand of
5,368 nucleotides. One of these genes is shown highlighted.
GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAAT
TATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCG
GAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGC
CATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTG
GCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGT
TGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAG
AAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCT
CATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAA
GTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGC
TGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTA
TTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAA
AACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTG
TACGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGC
GGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGC
CGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATT
TTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCC
GAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTA
CCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCG
TCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTC
CCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTC
CTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCC
TGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTT
AAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTC
GTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGA
GCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTAT
GCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTT
CATTTGGAGGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCT

© 2009 by Taylor & Francis Group, LLC


6 Combinatorial Pattern Matching Algorithms in Computational Biology

CTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGT
GTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTA
CTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGG
TGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAA
ATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTC
AGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATT
CATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGAC
CAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTT
ATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGAGTGTGAGGT
TATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGG
TAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTT
TTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACA
CCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTT
TTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGC
TGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCG
GTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTT
ATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTA
TGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAAT
CAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTAT
TGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAA
AAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGG
GTGATGCTGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCC
TAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCT
GGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTG
ATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCG
TGCTGGTGCTGATGCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTT
ACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGA
TTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGA
GATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAAT
CTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTG
GTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTT
AGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGAT
ATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAG
CTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTT
GTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTT
GGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTG
AGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCC
AATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTT
CGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTG
ACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAA
TGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGG
TTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTA
AGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATT
GCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATG
CGACAGGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGAT

© 2009 by Taylor & Francis Group, LLC


Introduction 7

TAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGT
TCTTGCTGCCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTG
CCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTC
CTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTT
GCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTT
TCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATA
TGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAA
AGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAG
CTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCA
CGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAA
GCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCG
GCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTAT
CCAACCTGCA

Protein coding regions of a DNA sequence are first transcribed into messen-
ger RNA and then translated into protein. A codon of three DNA nucleotides
is transcribed into a codon of three complementary RNA nucleotides, which
is translated in turn into a single amino acid within a protein. A fragment of
single-stranded DNA sequence has three possible reading frames, and transla-
tion takes place in an open reading frame, a sequence of codons from a certain
start codon to a certain stop codon and containing no further stop codon.

Example 1.2
Reading frame 2 of the DNA sequence of Bacteriophage φ-X174 from the pre-
vious example contains 15 open reading frames of more than 108 nucleotides,
which can potentially code for proteins of more than 36 amino acids. Only
two of them, shown highlighted in the next table, actually code for a protein.

sequence fragment start stop length


ATG TGA 17 136 120
ATG TAA 848 964 117
ATG TGA 1,001 2,284 1,284
ATG TGA 1,031 2,284 1,254
ATG TGA 1,130 2,284 1,155
ATG TGA 1,256 2,284 1,029
ATG TGA 1,421 2,284 864
ATG TGA 1,550 2,284 735
ATG TGA 1,580 2,284 705
ATG TGA 1,637 2,284 648
ATG TGA 1,715 2,284 570
ATG TGA 1,850 2,284 435
ATG TGA 1,991 2,284 294
ATG TGA 2,543 2,731 189
ATG TGA 2,552 2,731 180

© 2009 by Taylor & Francis Group, LLC


8 Combinatorial Pattern Matching Algorithms in Computational Biology

The reading frame determines the actual amino acids encoded by a gene.
For instance, the DNA sequence fragment GTCGCCATGATGGTGGTTATT
ATACCGTCAAGGACTGTGTGACTA can be read in the 50 to 30 direction
in the following three frames:
1 GTC GCC ATG ATG GTG GTT ATT ATA CCG TCA AGG ACT GTG TGA CTA
2 TCG CCA TGA TGG TGG TTA TTA TAC CGT CAA GGA CTG TGT GAC TA
3 CGC CAT GAT GGT GGT TAT TAT ACC GTC AAG GAC TGT GTG ACT A
A fragment of double-stranded DNA sequence, on the other hand, has six
possible reading frames, three in each direction. An open reading frame begins
with the start codon ATG (methionine) in most species and ends with a stop
codon TAA, TAG, or TGA.
The identification of genes coding for proteins in a DNA sequence is a very
difficult task. Even a simple organism such as Bacteriophage φ-X174, with
a single-stranded DNA sequence of only 5,368 nucleotides, has a total of 117
open reading frames, only 11 of which actually code for a protein. There are
several other biological signals that help the computational biologist in the
task of gene finding, but to start with, the known protein with the shortest
sequence has 8 amino acids and, thus, short open reading frames, with fewer
than 3 + 24 + 3 = 30 nucleotides, cannot code for a protein.
A first algorithmic problem consists in extracting all open reading frames
in the three reading frames of a DNA sequence fragment. The problem has
to be solved on the reverse complement of the sequence as well if the DNA is
double stranded.
Given a fragment of DNA sequence S of n nucleotides, let S[i] denote the
i-th nucleotide of sequence S, for 1 6 i 6 n. Thus, in the sequence S = GTC
GCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTA, which
has n = 45 nucleotides, S[1] = G, S[2] = T, S[3] = C, and S[n] = A. Let
also S[i, . . . , j], where i 6 j, denote the fragment of S containing nucleotides
S[i], S[i+1], . . . , S[j]. For instance, S[1, . . . , 4] = GTCG, and S[1, . . . , n] = S.
Therefore, S[i, . . . , i] = S[i] for any 1 6 i 6 n.
With this notation, an open reading frame is a fragment S[i, . . . , j], of length
j−i+1, such that S[i, . . . , i+2] is the start codon ATG and S[j−2, . . . , j] is one
of the stop codons TAA, TAG, or TGA. This is actually not quite the case. It
has to be at least 30 nucleotides long, that is, it must fulfill j −i+1 > 30. And
it cannot contain any other stop codon, that is, it must also fulfill the condition
S[k, . . . , k+2] ∈ / {TAA, TAG, TGA} for i+3 6 k 6 j −6. In the sequence frag-
ment S = GTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTG
TGACTA, for instance, S[7, . . . , 42] is an open reading frame, as it begins
with S[7, . . . , 9] = ATG, ends with S[40, . . . , 42] = TGA, and has no other
codon between S[10] and S[39] equal to TAA, TAG, or TGA.
GTC GCC ATG ATG GTG GTT ATT ATA CCG TCA AGG ACT GTG TGA CTA
The reading frame determines a partition of the DNA sequence fragment S
in codons of three consecutive nucleotides. In reading frame 1, the first codon

© 2009 by Taylor & Francis Group, LLC


Introduction 9

is S[1, . . . , 3], the second codon is S[4, . . . , 6], and so on. In reading frame
2, however, the first codon is S[2, . . . , 4], and the second codon is S[5, . . . , 7].
The first codon in reading frame 3 is S[3, . . . , 5].
In a given reading frame, the codons can be accessed by sliding a window
of length three over the sequence, starting at position 1, 2, or 3, depending on
the reading frame. The sliding window is thus a kind of looking glass under
which a codon of the sequence can be seen and accessed:
1-3 4-6 7-9 ... ... ... ... ... ... ... ... ... ... ... ..n
−→ GTC GCC ATG ATG GTG GTT ATT ATA CCG TCA AGG ACT GTG TGA CTA
GTC GCC ATG ATG GTG GTT ATT ATA CCG TCA AGG ACT GTG TGA CTA
GTC GCC ATG ATG GTG GTT ATT ATA CCG TCA AGG ACT GTG TGA CTA
...
GTC GCC ATG ATG GTG GTT ATT ATA CCG TCA AGG ACT GTG TGA CTA
GTC GCC ATG ATG GTG GTT ATT ATA CCG TCA AGG ACT GTG TGA CTA
Consider, as a first example, the problem of finding an open reading frame
in a reading frame of a sequence, and let S[k, . . . , k+2] be the codon under the
sliding window in the given reading frame. Starting with an initial position k
given by the reading frame, the sliding window has to be displaced by three
nucleotides each time until accessing a start codon and then continue sliding
by three nucleotides each time until accessing a stop codon. Again, this is not
actually quite the case. The reading frame of the sequence fragment could
contain no start codon at all, or it could contain a start codon but no stop
codon, and the search for the beginning or the end of an open reading frame
might go beyond the end of the sequence.
The first start codon in the k-th reading frame of a given DNA sequence
fragment S of n nucleotides can be found by sliding a window S[i, . . . , i + 2]
of three nucleotides along S[k, . . . , n], until either i + 2 > n or S[i, . . . , i + 2] =
AGT. In the following description, the initial position i of the candidate
start codon is incremented by three as long as the codon does not fall off the
sequence (that is, i + 2 6 n) and is not a start codon (that is, S[i, . . . , i + 2] 6=
AGT).

i←k
while i + 2 6 n and S[i, . . . , i + 2] 6= AGT do
i←i+3
if i + 2 6 n then
output S[i, . . . , i + 2]

After having found a start codon S[i, . . . , i + 2], the first stop codon can be
found by sliding a window S[j, . . . , j + 2] of three nucleotides, this time along
S[i + 3, . . . , n], until either j + 2 > n or S[j, . . . , j + 2] ∈ {TAA, TAG, TGA}.
In the following description, the initial position j of the candidate stop codon
is incremented by three as long as the codon does not fall off the sequence

© 2009 by Taylor & Francis Group, LLC


10 Combinatorial Pattern Matching Algorithms in Computational Biology

(that is, j + 2 6 n) and the candidate codon is not a stop codon (that is, with
S[j, . . . , j + 2] ∈
/ {TAA, TAG, TGA}).

j ←i+3
while j + 2 6 n and S[j, . . . , j + 2] ∈
/ {TAA, TAG, TGA} do
j ←j+3
if j + 2 6 n then
output S[j, . . . , j + 2]

Now, the problem of extracting the first open reading frame in the k-th
reading frame of a DNA sequence fragment S of length n can be solved by
putting together the search for a start codon and the search for a stop codon.
In the following description, the start codon is S[i, . . . , i + 2] and the stop
codon is S[j, . . . , j + 2] of the sequence and, thus, the open reading frame
S[i, . . . , j + 2] is output.

i←k
while i + 2 6 n and S[i, . . . , i + 2] 6= AGT do
i←i+3
if i + 2 6 n then
j ←i+3
while j + 2 6 n and S[j, . . . , j + 2] ∈
/ {TAA, TAG, TGA} do
j ←j+3
if j + 2 6 n then
output S[i, . . . , j + 2]

Notice that only the first start codon is found, and the first stop codon
after this start codon will then signal the end of the first open reading frame.
There may be other start codons in the sequence fragment between the first
start codon and the first stop codon, however, which would signal shorter
open reading frames contained in the first open reading frame found. Also,
the first open reading frame might be shorter than 30 nucleotides, much too
short to actually code for a protein.
The problem of extracting all open reading frames of at least 30 nucleotides
in the k-th reading frame of a DNA sequence fragment S of length n > 30
can be solved by repeating the previous procedure for each start codon found
in turn, checking that the open reading frames thus found have at least 30
nucleotides, as follows.

i←k
while i + 2 6 n do
if S[i, . . . , i + 2] = AGT then
j ←i+3

© 2009 by Taylor & Francis Group, LLC


Introduction 11

while j + 2 6 n and S[j, . . . , j + 2] ∈


/ {TAA, TAG, TGA} do
j ←j+3
if j + 2 6 n then
if j + 2 − i + 1 > 30 then
output S[i, . . . , j + 2]
i←i+3

Finally, the problem of extracting all open reading frames of at least 30 nu-
cleotides in the three reading frames of a DNA sequence fragment S of length
n > 30 can be solved by repeating the previous procedure for each reading
frame and for each start codon in turn, checking again that the open reading
frames thus found have at least 30 nucleotides. In the following description,
the whole algorithm is wrapped into a procedure that takes the DNA se-
quence fragment S as input and reports each of the open reading frames of S
as output.

procedure extract open reading frames(S)


n ← length(S)
for i ← 1, 2, 3 do
while i + 2 6 n do
if S[i, . . . , i + 2] = AGT then
j ←i+3
while j + 2 6 n and S[j, . . . , j + 2] ∈
/ {TAA, TAG, TGA} do
j ←j+3
if j + 2 6 n then
if j + 2 − i + 1 > 30 then
output S[i, . . . , j + 2]
i←i+3

The previous algorithm for extracting all open reading frames in the three
reading frames of a given DNA sequence fragment can be implemented in
Perl in a straightforward way. An open reading frame S[i, . . . , j + 2] is repre-
sented as the fragment of sequence $seq with starting position $i and length
$j+2-$i+1, that is, substr($seq,$i,$j+2-$i+1). Notice that Perl arrays do
not start with position 1 but, rather, with position 0 and, thus, the first codon
is substr($seq,0,3), the last nucleotide is substr($seq,$n-1,1), and the
three reading frames $r are numbered 0, 1, 2. This is all shown in the following
Perl script.
sub e x t r a c t _ o p e n _ r e a d i n g _ f r a m e s {
my $seq = shift ;
for my $r (0 ,1 ,2) {
for ( my $i = $r ; $i <= length ( $seq ) -3; $i += 3) {
if ( substr ( $seq , $i ,3) eq " ATG " ) {

© 2009 by Taylor & Francis Group, LLC


12 Combinatorial Pattern Matching Algorithms in Computational Biology

my $j = $i +3;
while ( $j <= length ( $seq ) -3 &&
substr ( $seq , $j ,3) ne " TAA " &&
substr ( $seq , $j ,3) ne " TAG " &&
substr ( $seq , $j ,3) ne " TGA " ) {
$j += 3;
}
if ( $j <= length ( $seq ) -3) {
my $len = $j +2 - $i +1;
if ( $len >= 30) {
print substr ( $seq , $i , $j +2 - $i +1) ," \ n " ;
}
}
}
}
}
}

The algorithm for extracting all open reading frames in the three reading
frames of a given DNA sequence fragment can also be implemented in R in a
straightforward way, as shown in the following R script.
extract . open . reading . frames <- function ( seq ) {
for ( i in 1:3) {
while ( i +2 <= nchar ( seq ) ) {
if ( substr ( seq ,i , i +2) == " ATG " ) {
j <- i + 3
while ( j +2 <= nchar ( seq ) & &
substr ( seq ,j , j +2) ! = " TAA " & &
substr ( seq ,j , j +2) ! = " TAG " & &
substr ( seq ,j , j +2) ! = " TGA " ) {
j <- j + 3
}
if ( j +2 <= nchar ( seq ) ) {
if ( j +2 - i +1 >= 30) {
print ( c (i , j +2 , substr ( seq ,i , j +2) ) )
}
}
}
i <- i + 3
}
}
}

There are indeed 104 open reading frames of at least 30 nucleotides and up
to 1,284 nucleotides in the DNA sequence of Bacteriophage φ-X174.

© 2009 by Taylor & Francis Group, LLC


Introduction 13

> seq <- " G A G T T T T A T C G C T T C C A T G A C G C A G A A G T T A A C ... CGGATA "


> extract . open . reading . frames ( seq )
[1] " 133 " " 393 " " A T G A G A A A A T T C G A C C T A T C C T T G ... TCATGA "
[1] " 250 " " 393 " " A T G C T T G G C A C G T T C G T C A A G G A C ... TCATGA "
[1] " 568 " " 843 " " A T G G T A C G C T G G A C T T T G T G G G A T ... GAGTGA "
[1] " 643 " " 843 " " A T G T T C A T C C C G T C A A C A T T C A A A ... GAGTGA "
[1] " 715 " " 843 " " A T G G C G T C G A G C G T C C G G T T A A A G ... GAGTGA "
[1] " 2395 " " 2922 " " A T G T T T C A G A C T T T T A T T T C T C G C ... AAGTGA "
[1] " 2578 " " 2922 " " A T G G A T A C A T C T G T C A A C G C C G C T ... AAGTGA "
[1] " 2827 " " 2922 " " A T G G T T T G G T C T A A C T T T A C C G C T ... AAGTGA "
[1] " 3037 " " 3066 " " A T G T G C T T G C T A C C G A T A A C A A T A ... CTGTAG "
[1] " 3076 " " 3684 " " A T G C T G G T A T T A A A T C T G C C A T T C ... AAATGA "
[1] " 3109 " " 3684 " " A T G T T C C T A A C C C T G A T G A G G C C G ... AAATGA "
[1] " 3124 " " 3684 " " A T G A G G C C G C C C C T A G T T T T G T T T ... AAATGA "
[1] " 3316 " " 3684 " " A T G C T T G G G A G C G T G C T G G T G C T G ... AAATGA "
[1] " 3340 " " 3684 " " A T G C T T C C T C T G C T G G T A T G G T T G ... AAATGA "
[1] " 3439 " " 3684 " " A T G A G A C T C A A A A A G A G A T T G C T G ... AAATGA "
[1] " 3508 " " 3684 " " A T G C A C A A A A T G A G A T G C T T G C T T ... AAATGA "
[1] " 3517 " " 3684 " " A T G A G A T G C T T G C T T A T C A A C A G A ... AAATGA "
[1] " 3742 " " 3930 " " A T G G C T C T T C T C A T A T T G G C G C T A ... GATTGA "
[1] " 3784 " " 3930 " " A T G T C G T C A C T G A T G C T G C T T C T G ... GATTGA "
[1] " 3796 " " 3930 " " A T G C T G C T T C T G G T G T G G T T G A T A ... GATTGA "
[1] " 3826 " " 3930 " " A T G G T A T T G A T A A A G C T G T T G C C G ... GATTGA "
[1] " 3886 " " 3930 " " A T G G T A T T G G C T C T A A T T T G T C T A ... GATTGA "
[1] " 3946 " " 4263 " " A T G T T T T C A T G C C T C C A A A T C T T G ... AGTTAA "
[1] " 4186 " " 4263 " " A T G G T G A T A T G T A T G T T G A C G G C C ... AGTTAA "
[1] " 4198 " " 4263 " " A T G T T G A C G G C C A T A A G G C T G C T T ... AGTTAA "
[1] " 4234 " " 4263 " " A T G A G T T T G T A T C T G T T A C T G A G A ... AGTTAA "
[1] " 4267 " " 4323 " " A T G A A T T G G C A C A A T G C T A C A A T G ... CTATAG "
[1] " 4288 " " 4323 " " A T G T G C T C C C C C A A C T T G A T A T T A ... CTATAG "
[1] " 4429 " " 4500 " " A T G A G T A T A A T T A C C C C A A A A A G A ... CTATGA "
[1] " 4465 " " 4500 " " A T G A G T G T T C A A G A T T G C T G G A G G ... CTATGA "
[1] " 4537 " " 4611 " " A T G C A A T G C G A C A G G C T C A T G C T G ... GATTAG "
[1] " 4555 " " 4611 " " A T G C T G A T G G T T G G T T T A T C G T T T ... GATTAG "
[1] " 4561 " " 4611 " " A T G G T T G G T T T A T C G T T T T T G A C A ... GATTAG "
[1] " 4621 " " 4857 " " A T G A T A A T C C C A A T G C T T T G C G T G ... AGTTAA "
[1] " 4633 " " 4857 " " A T G C T T T G C G T G A C T A T T T T C G T G ... AGTTAA "
[1] " 4699 " " 4857 " " A T G A T T C A C A C G C C G A C T G C T A T C ... AGTTAA "
[1] " 4744 " " 4857 " " A T G G T A C A G C T A A T G G C C G T C T T C ... AGTTAA "
[1] " 4756 " " 4857 " " A T G G C C G T C T T C A T T T C C A T G C G G ... AGTTAA "
[1] " 4774 " " 4857 " " A T G C G G T G C A C T T T A T G C G G A C A C ... AGTTAA "
[1] " 4882 " " 5064 " " A T G G T T A C A G T A T G C C C A T C G C A G ... GTCTAG "
[1] " 4957 " " 5064 " " A T G C T A A A G G T G A G C C G C T T A A A G ... GTCTAG "
[1] " 5008 " " 5064 " " A T G T G G C T A A A T A C G T T A A C A A A A ... GTCTAG "
[1] " 17 " " 136 " " A T G A C G C A G A A G T T A A C A C T T T C G ... AAATGA "

© 2009 by Taylor & Francis Group, LLC


14 Combinatorial Pattern Matching Algorithms in Computational Biology

[1] " 230 " " 331 " " A T G A G G A G A A G T G G C T T A A T A T G C ... TTTTAA "
[1] " 284 " " 331 " " A T G A G T C A C A T T T T G T T C A T G G T A ... TTTTAA "
[1] " 302 " " 331 " " A T G G T A G A G A T T C T C T T G T T G A C A ... TTTTAA "
[1] " 848 " " 964 " " A T G T C T A A A G G T A A A A A A C G T T C T ... TTTTAA "
[1] " 1001 " " 2284 " " A T G T C T A A T A T T C A A A C T G G C G C C ... TCGTGA "
[1] " 1031 " " 2284 " " A T G C C G C A T G A C C T T T C C C A T C T T ... TCGTGA "
[1] " 1130 " " 2284 " " A T G G A C G C C G T T G G C G C T C T C C G T ... TCGTGA "
[1] " 1256 " " 2284 " " A T G A A G G A T G G T G T T A A T G C C A C T ... TCGTGA "
[1] " 1421 " " 2284 " " A T G C C T G A C C G T A C C G A G G C T A A C ... TCGTGA "
[1] " 1550 " " 2284 " " A T G A C G A C T T C T A C C A C A T C T A T T ... TCGTGA "
[1] " 1580 " " 2284 " " A T G G G T C T G C A A G C T G C T T A T G C T ... TCGTGA "
[1] " 1637 " " 2284 " " A T G C A G C G T T A C C A T G A T G T T A T T ... TCGTGA "
[1] " 1715 " " 2284 " " A T G C G C T C T A A T C T C T G G G C A T C T ... TCGTGA "
[1] " 1850 " " 2284 " " A T G T T T A C T C T T G C G C T T G T T C G T ... TCGTGA "
[1] " 1991 " " 2284 " " A T G A A G G A T G T T T T C C G T T C T G G T ... TCGTGA "
[1] " 2543 " " 2731 " " A T G C T G G T A A T G G T G G T T T T C T T C ... CTTTGA "
[1] " 2552 " " 2731 " " A T G G T G G T T T T C T T C A T T G C A T T C ... CTTTGA "
[1] " 2639 " " 2731 " " A T G C C G A C C C T A A A T T T T T T G C C T ... CTTTGA "
[1] " 2732 " " 2776 " " A T G G T C G C C A T G A T G G T G G T T A T T ... GTGTGA "
[1] " 2741 " " 2776 " " A T G A T G G T G G T T A T T A T A C C G T C A ... GTGTGA "
[1] " 2744 " " 2776 " " A T G G T G G T T A T T A T A C C G T C A A G G ... GTGTGA "
[1] " 2816 " " 2878 " " A T G T T G G T T T C A T G G T T T G G T C T A ... CGCTGA "
[1] " 4349 " " 4405 " " A T G G T T T T T A G A G A A C G A G A A G A C ... TGCTGA "
[1] " 51 " " 221 " " A T G A G T C G A A A A A T T A T C T T G A T A ... AACTGA "
[1] " 390 " " 848 " " A T G A G T C A A G T T A C T G A A C A A T C C ... ATGTAA "
[1] " 681 " " 848 " " A T G G A A G G C G C T G A A T T T A C G G A A ... ATGTAA "
[1] " 1038 " " 1196 " " A T G A C C T T T C C C A T C T T G G C T T C C ... CTGTAG "
[1] " 1212 " " 1259 " " A T G T C C C T C A T C G T C A C G T T T A T G ... TCATGA "
[1] " 1263 " " 1388 " " A T G G T G T T A A T G C C A C T C C T C T C C ... ATTTGA "
[1] " 1272 " " 1388 " " A T G C C A C T C C T C T C C C G A C T G T T A ... ATTTGA "
[1] " 1317 " " 1388 " " A T G C C G C T T T T C T T G G C A C G A T T A ... ATTTGA "
[1] " 1449 " " 1553 " " A T G A G C T T A A T C A A G A T G A T G C T C ... AAATGA "
[1] " 1464 " " 1553 " " A T G A T G C T C G T T A T G G T T T C C G T T ... AAATGA "
[1] " 1467 " " 1553 " " A T G C T C G T T A T G G T T T C C G T T G C T ... AAATGA "
[1] " 1476 " " 1553 " " A T G G T T T C C G T T G C T G C C A T C T C A ... AAATGA "
[1] " 1599 " " 1775 " " A T G C T A A T T T G C A T A C T G A C C A A G ... CGTTAG "
[1] " 1650 " " 1775 " " A T G A T G T T A T T T C T T C A T T T G G A G ... CGTTAG "
[1] " 1653 " " 1775 " " A T G T T A T T T C T T C A T T T G G A G G T A ... CGTTAG "
[1] " 1686 " " 1775 " " A T G A C G C T G A C A A C C G T C C T T T A C ... CGTTAG "
[1] " 1743 " " 1775 " " A T G A T G T T G A T G G A A C T G A C C A A A ... CGTTAG "
[1] " 1746 " " 1775 " " A T G T T G A T G G A A C T G A C C A A A C G T ... CGTTAG "
[1] " 1842 " " 1928 " " A T G G C A C T A T G T T T A C T C T T G C G C ... CTTTGA "
[1] " 1962 " " 1994 " " A T G G C A A C T T G C C G C C G C G T G A A A ... CTATGA "
[1] " 1998 " " 2234 " " A T G T T T T C C G T T C T G G T G A T T C G T ... ATGTGA "
[1] " 2061 " " 2234 " " A T G C G C C T T C G T A T G T T T C T C C T G ... ATGTGA "

© 2009 by Taylor & Francis Group, LLC


Introduction 15

[1] " 2073 " " 2234 " " A T G T T T C T C C T G C T T A T C A C C T T C ... ATGTGA "
[1] " 2166 " " 2234 " " A T G A T T A T G A C C A G T G T T T C C A G T ... ATGTGA "
[1] " 2172 " " 2234 " " A T G A C C A G T G T T T C C A G T C C G T T C ... ATGTGA "
[1] " 2856 " " 2891 " " A T G C C G C G G A T T G G T T T C G C T G A A ... TATTAA "
[1] " 2931 " " 3917 " " A T G T T T G G T G C T A T T G C T G G C G G T ... AAATAA "
[1] " 2982 " " 3917 " " A T G T C T A A A T T G T T T G G A G G C G G T ... AAATAA "
[1] " 3069 " " 3917 " " A T G G G T G A T G C T G G T A T T A A A T C T ... AAATAA "
[1] " 3156 " " 3917 " " A T G G C T A A A G C T G G T A A A G G A C T T ... AAATAA "
[1] " 3357 " " 3917 " " A T G G T T G A C G C C G G A T T T G A G A A T ... AAATAA "
[1] " 3399 " " 3917 " " A T G C A A C T G G A C A A T C A G A A A G A G ... AAATAA "
[1] " 3432 " " 3917 " " A T G C A A A A T G A G A C T C A A A A A G A G ... AAATAA "
[1] " 3522 " " 3917 " " A T G C T T G C T T A T C A A C A G A A G G A G ... AAATAA "
[1] " 3570 " " 3917 " " A T G G A A A A C A C C A A T C T T T C C A A G ... AAATAA "
[1] " 3615 " " 3917 " " A T G C G C C A A A T G C T T A C T C A A G C T ... AAATAA "
[1] " 3624 " " 3917 " " A T G C T T A C T C A A G C T C A A A C G G C T ... AAATAA "
[1] " 3681 " " 3917 " " A T G A C T C G C A A G G T T A G T G C T G A G ... AAATAA "
A related algorithmic problem consists of finding the longest open reading
frame of a given DNA sequence fragment. The longest open reading frame
often determines the correct reading frame for eukaryotes, where translation
usually takes place in one reading frame only. Again, the problem has to be
solved on the reverse complement of the sequence as well if the DNA is double
stranded.
The previous algorithm for extracting all open reading frames can be ex-
tended to find the longest open reading frame, by keeping the position of the
start and stop codon of the longest open reading frame found so far. In the
following description, the start codon of the longest open reading frame found
so far is S[i0 , . . . , i0 + 2], and the corresponding stop codon is S[j 0 , . . . , j 0 + 2].

function longest open reading frame(S)


i0 ← j 0 ← 0
n ← length(S)
for i ← 1, 2, 3 do
while i + 2 6 n do
if S[i, . . . , i + 2] = AGT then
j ←i+3
while j + 2 6 n and S[j, . . . , j + 2] ∈/ {TAA, TAG, TGA} do
j ←j+3
if j + 2 6 n then
if j + 2 − i + 1 > j 0 + 2 − i0 + 1 then
i0 ← i
j0 ← j
i←i+3
return (i0 , j 0 + 2)

© 2009 by Taylor & Francis Group, LLC


16 Combinatorial Pattern Matching Algorithms in Computational Biology

The previous algorithm for finding the longest open reading frame of a given
DNA sequence fragment can be implemented in Perl in a straightforward way,
as shown in the following Perl script.
sub l o n g e s t _ o p e n _ r e a d i n g _ f r a m e {
my $seq = shift ;
my ( $ii , $jj ) = (0 ,0) ;
for my $r (0 ,1 ,2) {
for ( my $i = $r ; $i <= length ( $seq ) -3; $i += 3) {
if ( substr ( $seq , $i ,3) eq " ATG " ) {
my $j = $i +3;
while ( $j <= length ( $seq ) -3 &&
substr ( $seq , $j ,3) ne " TAA " &&
substr ( $seq , $j ,3) ne " TAG " &&
substr ( $seq , $j ,3) ne " TGA " ) {
$j += 3;
}
if ( $j <= length ( $seq ) -3) {
my $len = $j +2 - $i +1;
if ( $j +2 - $i +1 > $jj +2 - $ii +1) {
$ii = $i ;
$jj = $j ;
}
}
}
}
}
return [ $ii , $jj +2];
}

The algorithm for finding the longest open reading frame of a given DNA
sequence fragment can also be easily implemented in R, as shown in the fol-
lowing R script.
longest . open . reading . frame <- function ( seq ) {
ii <- jj <- 0
for ( i in 1:3) {
while ( i +2 <= nchar ( seq ) ) {
if ( substr ( seq ,i , i +2) == " ATG " ) {
j <- i + 3
while ( j +2 <= nchar ( seq ) & &
substr ( seq ,j , j +2) ! = " TAA " & &
substr ( seq ,j , j +2) ! = " TAG " & &
substr ( seq ,j , j +2) ! = " TGA " ) {
j <- j + 3
}

© 2009 by Taylor & Francis Group, LLC


Introduction 17

if ( j +2 <= nchar ( seq ) ) {


if ( j +2 - i +1 > jj +2 - ii +1) {
ii <- i
jj <- j
}
}
}
i <- i + 3
}
}
c ( ii , jj +2)
}
The longest of the 104 open reading frames with at least 30 nucleotides in
the DNA sequence of Bacteriophage φ-X174 has indeed 2,284 − 1,001 + 1 =
1,284 nucleotides.
> longest . open . reading . frame ( seq )
[1] 1001 2284
The actual reading frame it belongs to can be obtained by integer division.
Open reading frame S[i, . . . , j] comes from reading frame ((i − 1) mod 3) + 1.
> ((1001 -1) %% 3) + 1
[1] 2

Bibliographic Notes
Most of the research in combinatorial pattern matching is reflected in the
various editions of the Annual Symposium on Combinatorial Pattern Match-
ing (Apostolico et al. 1992; 1993; Crochemore and Gusfield 1994; Galil and
Ukkonen 1995; Hirschberg and Myers 1996; Apostolico and Hein 1997; Farach-
Colton 1998; Crochemore and Paterson 1999; Giancarlo and Sankoff 2000;
Amir and Landau 2001; Apostolico and Takeda 2002; Baeza-Yates et al. 2003;
Sahinalp et al. 2004; Apostolico et al. 2005; Lewenstein and Valiente 2006; Ma
and Zhang 2007; Ferragina and Landau 2008). There are also several books on
specific aspects of combinatorial pattern matching, focused on algorithms on
sequences (Stephen 1998; Crochemore and Rytter 1994; Navarro and Raffinot
2002; Crochemore and Rytter 2003; Smyth 2003; Crochemore et al. 2007).
There are several books on algorithms in computational biology which also
address combinatorial pattern matching, including (Waterman 1995; Gusfield
1997; Pevzner 2000; Valiente 2002; Dwyer 2003; Jones and Pevzner 2004;
Deonier et al. 2005; Kasahara and Morishita 2006). A brief introduction to

© 2009 by Taylor & Francis Group, LLC


18 Combinatorial Pattern Matching Algorithms in Computational Biology

bioinformatics was written by Cohen (2004). Systems biology is a quite recent


discipline, although there are already a couple of textbooks (Alon 2006; Pals-
son 2006), and a brief introduction to systems biology was written by Kitano
(2002a;b).
The algorithmic techniques used in this book are rather simple, in order to
make life easier for the biologist reader. A basic understanding of algorithms
and computing will be more than sufficient to follow this book, with the
most advanced algorithmic technique used in the book being perhaps dynamic
programming, and the presentation of the algorithms is iterative rather than
recursive. The use of dynamic programming in computational biology was
reviewed by Giegerich (2000) and Eddy (2004b).
Alternative Perl and R implementations for some of the algorithms pre-
sented in this book can be found within the BioPerl project (Stajich et al.
2002; Birney et al. 2009) and in the Bioconductor project (Gentleman et al.
2005; Hahne et al. 2008), respectively. See the appendices for further biblio-
graphic notes on Perl and R.
The DNA sequence of Bacteriophage φ-X174, the first complete genome to
be sequenced, was determined by Sanger et al. (1977). The first complete mi-
tochondrial genome sequence of an extinct species was reported by Haddrath
and Baker (2001). See also (Sanger and Dowding 1996; Green et al. 2008).

© 2009 by Taylor & Francis Group, LLC


Part I

Sequence Pattern
Matching

© 2009 by Taylor & Francis Group, LLC


Chapter 2
Sequences

Sequences are fundamental mathematical objects that count among the most
common combinatorial structures in computer science and computational bi-
ology. Basic notions underlying combinatorial algorithms on sequences, such
as counting, generation, and traversal algorithms, as well as appropriate data
structures for the representation of sequences, are the subject of this intro-
ductory chapter.

2.1 Sequences in Mathematics


The notion of sequence most often found in discrete mathematics is that of
a (finite or infinite) ordered list of elements. The same element can appear
multiple times at different positions in the sequence. A sequence thus defines
an ordered multiset, that is, an ordered set of elements, each belonging to the
multiset with a certain multiplicity.
Some applications of sequences in mathematics involve labeled sequences,
where the elements have additional attributes such as, for instance, their
multiplicity in the sequence.

Example 2.1
The following three sequences (shown with the elements separated by spaces,
for clarity) are identical as multisets of elements, but they are all different
sequences.
A B C C D D D E E E E E F F F F F F F F
A B C D E F C D E F D E F E F E F F F F
F F F F F F F F E E E E E D D D C C B A

Actually, they all define the same labeled sequence: an ordered multiset with
elements A and B, element C twice, three occurrences of element D, five
occurrences of element E, and eight occurrences of element F.
(A ,1) (B ,1) (C ,2) (D ,3) (E ,5) (F ,8)

21
© 2009 by Taylor & Francis Group, LLC
22 Combinatorial Pattern Matching Algorithms in Computational Biology

2.1.1 Counting Labeled Sequences


Determining the number of possible labeled sequences is a trivial exercise
in mathematics. Here, counting refers to determining the number of possible
sequences that have certain properties, while generation is the process of
obtaining the actual sequences with these properties such as, for instance, all
labeled sequences.
Assume the elements are drawn from the alphabet {A, B}. There are 21 = 2
ways to make a sequence of length 1 with elements from this alphabet:
A
B
Each of these two sequences can be extended in two different ways to make
a sequence of length 2 and, thus, there are 2 · 2 = 22 = 4 possible sequences
of length 2 with elements from that alphabet:
AA
AB
BA
BB
Each of these four sequences can now be extended in two different ways
to make a sequence of length 3 and, thus, there are 2 · 4 = 23 = 8 possible
sequences of length 3 with elements from that alphabet:
AAA
AAB
ABA
ABB
BAA
BAB
BBA
BBB
In general, there are 2n possible sequences of length n with elements from
that alphabet, and there are mn possible sequences of length n with elements
from an alphabet of size m, as shown in the following R script for sequence
length 1 6 n 6 12 and alphabet size 1 6 m 6 6.
> outer (1:12 ,1:6 , function (n , m ) m ^ n )
[ ,1] [ ,2] [ ,3] [ ,4] [ ,5] [ ,6]
[1 ,] 1 2 3 4 5 6
[2 ,] 1 4 9 16 25 36
[3 ,] 1 8 27 64 125 216
[4 ,] 1 16 81 256 625 1296
[5 ,] 1 32 243 1024 3125 7776
[6 ,] 1 64 729 4096 15625 46656
[7 ,] 1 128 2187 16384 78125 279936

© 2009 by Taylor & Francis Group, LLC


Sequences 23

[8 ,] 1 256 6561 65536 390625 1679616


[9 ,] 1 512 19683 262144 1953125 10077696
[10 ,] 1 1024 59049 1048576 9765625 60466176
[11 ,] 1 2048 177147 4194304 48828125 362797056
[12 ,] 1 4096 531441 16777216 244140625 2176782336
Assume now the elements are drawn from the alphabet {A, B, C}. There
are 3+1−1
3−1 = 3
2 = 3 ways to make a labeled sequence of length 1 with
elements from this alphabet:
(A ,1)
(B ,1)
(C ,1)

There are 3+2−1 = 42 = 6 ways to make a labeled sequence of length 2


 
3−1
with elements from that alphabet:
(A ,1) (B ,1)
(A ,1) (C ,1)
(A ,2)
(B ,1) (C ,1)
(B ,2)
(C ,2)

Also, there are 3+3−1 = 52 = 10 ways to make a labeled sequence of


 
3−1
length 3 with elements from that alphabet:
(A ,1) (B ,1) (C ,1)
(A ,1) (B ,2)
(A ,1) (C ,2)
(A ,2) (B ,1)
(A ,2) (C ,1)
(A ,3)
(B ,1) (C ,2)
(B ,2) (C ,1)
(B ,3)
(C ,3)

In general, there are 3+n−1



3−1 = (n+2)(n+1)/2 possible sequences of length
n with elements from that alphabet, and there are m+n−1

m−1 possible labeled
sequences of length n with elements from an alphabet of size m, as shown in
the following R script for labeled sequence length 1 6 n 6 20 and alphabet
size 1 6 m 6 8.
> outer (1:20 ,1:8 , function (n , m ) choose ( m +n -1 ,m -1) )
[ ,1] [ ,2] [ ,3] [ ,4] [ ,5] [ ,6] [ ,7] [ ,8]
[1 ,] 1 2 3 4 5 6 7 8
[2 ,] 1 3 6 10 15 21 28 36

© 2009 by Taylor & Francis Group, LLC


24 Combinatorial Pattern Matching Algorithms in Computational Biology

[3 ,] 1 4 10 20 35 56 84 120
[4 ,] 1 5 15 35 70 126 210 330
[5 ,] 1 6 21 56 126 252 462 792
[6 ,] 1 7 28 84 210 462 924 1716
[7 ,] 1 8 36 120 330 792 1716 3432
[8 ,] 1 9 45 165 495 1287 3003 6435
[9 ,] 1 10 55 220 715 2002 5005 11440
[10 ,] 1 11 66 286 1001 3003 8008 19448
[11 ,] 1 12 78 364 1365 4368 12376 31824
[12 ,] 1 13 91 455 1820 6188 18564 50388
[13 ,] 1 14 105 560 2380 8568 27132 77520
[14 ,] 1 15 120 680 3060 11628 38760 116280
[15 ,] 1 16 136 816 3876 15504 54264 170544
[16 ,] 1 17 153 969 4845 20349 74613 245157
[17 ,] 1 18 171 1140 5985 26334 100947 346104
[18 ,] 1 19 190 1330 7315 33649 134596 480700
[19 ,] 1 20 210 1540 8855 42504 177100 657800
[20 ,] 1 21 231 1771 10626 53130 230230 888030

2.2 Sequences in Computer Science


The notion of sequence most often found in computer science is also that
of an ordered list of elements, where it is often called a string of characters
or symbols drawn from an underlying set or alphabet. The alphabet itself is
usually ordered, thus allowing the definition of an ordering among sequences,
called the dictionary or lexicographical order.
A sequence (x1 , x2 , . . . , xk ) precedes a sequence (x01 , x02 , . . . , x0` ) in lexico-
graphical order if x1 < x01 , or x1 = x01 and (x2 , . . . , xk ) precedes (x02 , . . . , x0` )
in lexicographical order, where the empty sequence precedes any non-empty
sequence.

Example 2.2
The 2 + 4 + 8 + 16 = 30 sequences of length 1 through 4 over the alphabet
{A, B} are shown in lexicographical order.
A
AA
AAA
AAAA
AAAB
AAB

© 2009 by Taylor & Francis Group, LLC


Sequences 25

AABA
AABB
AB
ABA
ABAA
ABAB
ABB
ABBA
ABBB
B
BA
BAA
BAAA
BAAB
BAB
BABA
BABB
BB
BBA
BBAA
BBAB
BBB
BBBA
BBBB

In labeled sequences, both the alphabet and the attributes associated with
the elements of the sequences are usually ordered, allowing also the definition
of a lexicographical ordering among labeled sequences. A labeled sequence
((x1 , n1 ), (x2 , n2 ), . . . , (xk , nk )) with x1 < x2 < · · · < xn precedes another
labeled sequence ((x01 , n01 ), (x02 , n02 ), . . . , (x0` , n0` )) with x01 < x02 < · · · < x0` in
lexicographical order if x1 < x01 , or x1 = x01 and n1 < n01 , or x1 = x01 and
n1 = n01 and (x2 , . . . , xk ) precedes (x02 , . . . , x0` ) in lexicographical order, where
the empty labeled sequence precedes any non-empty labeled sequence.

Example 2.3
The 2 + 3 + 4 + 5 = 14 labeled sequences of length 1 through 4 over the
alphabet {A, B} are shown in lexicographical order.
(A ,1)
(A ,1) (B ,1)
(A ,1) (B ,2)
(A ,1) (B ,3)
(A ,2)
(A ,2) (B ,1)

© 2009 by Taylor & Francis Group, LLC


26 Combinatorial Pattern Matching Algorithms in Computational Biology

(A ,2) (B ,2)
(A ,3)
(A ,3) (B ,1)
(A ,4)
(B ,1)
(B ,2)
(B ,3)
(B ,4)

A first assessment of the similarities and differences between two sequences


can be made by means of the symmetric difference of the corresponding labeled
sequences, that is, the number of elements in which the two sequences differ.
While the length of the symmetric difference of the labeled sequences is equal
to the difference in length of the two sequences, different weights can be given
to each element of the alphabet in a particular application. Distance measures
over sequences will be discussed in the next two chapters.

Example 2.4
Consider the following two sequences over the alphabet {A, B}.
AABAAABBAAAABBB
A B BA A B BB A A A BB B B AA A A B B B B B
The corresponding labeled sequences, of length 15 and 24, are as follows.
(A ,9) (B ,6)
(A ,10) (B ,14)
Their symmetric difference is thus the following labeled sequence, of length 9.
(A ,1) (B ,8)

2.2.1 Traversing Labeled Sequences


Most algorithms on sequences require a systematic method of accessing the
elements of a sequence, and combinatorial pattern matching algorithms are no
exception. The most common method for accessing the elements of a sequence
is by traversing the ordered list of elements, from first to last.
The following Perl script illustrates the traversal of a sequence represented
as a character string.
my $seq = " AABAAABBAAAABBB " ;

for ( my $i = 0; $i < length ( $seq ) ; $i ++) {


print substr ( $seq , $i ,1) ," \ n " ;
}

© 2009 by Taylor & Francis Group, LLC


Exploring the Variety of Random
Documents with Different Content
TWO-LEAVED SOLOMON’S
SEAL: Smilacina trifolia.
Solomon Zigzag. Smilacina racemosa.
False Spikenard.

Found in June, in moist woods and brookside copses.


The leafy stalk (from 1 to 2 feet high) is oblique and zigzag in
gesture; it has a strong fibre, and a smooth surface, and is light
green.
The leaf is oval—long in proportion to its width,—tapering to a
slender tip, with an entire and much ruffled margin, and 3 noticeable
ribs; the surface is finely downy; the color a strong, vigorous green.
The leaves have almost no stem at all, and are placed alternately
along the stalk.
The flower is small, with 6 petal-like parts and 6 stamens, all
greenish white. Many flowers are gathered in a branching cluster
upon the end of the leafy curving stalk.
The berries are pale red, speckled with dark. Gray’s Manual in
speaking of this genus says its name is a diminution of Smilax, “to
which, however, these plants bear little resemblance.” For a similar
reason, perhaps, this plant is called “false” after the true Spikenard!
SOLOMON ZIGZAG: S.
racemosa.
Carrion Flower. Smilax herbacea.

Found in moist meadows, along river banks, and in wayside


thickets in June.
The round, smooth, tough, green stalk grows upward at first, but
soon swings over to one side with a strong curve, leaning on
surrounding plants for support, and further assisting its progress by
means of many small twining tendrils.
The large leaf is nearly round, and somewhat heart-shaped at the
base; it has an entire edge, and strongly marked parallel ribs;
though thin, it is tough, and has a smooth, shining surface, and a
bright green color. The leaves are alternate at short intervals, almost
crowded. The tendrils spring from the angles of the leaves.
The small flower has its parts in threes, with twice three stamens;
it is dull light green in color. Very many flowers, some 20 to 40, on
short flower-stems, are gathered together in a round head; this head
is on the end of a stem from 4 to 6 or more inches long. The flowers
exhale a disagreeable odor which gives rise to its folk-name.
The fruit is a round blue-black berry; as it has not the
objectionable odor of the flower, the vine becomes more attractive in
its fruiting than in its blossoming season.
CARRION FLOWER: Smilax
herbacea.
False Hellebore. Veratrum viride.
Indian Poke.
Poor Annie.

Found blossoming in June, in wet hollows, and along the borders


of upland streams.
Its simple, erect stalk, growing from 2 to 4 feet high, is very leafy;
it is round, and stout, an inch or more in diameter at the base;
smooth, and green.
The very large leaf, from 10 to 12 inches long, is broadly oval,
tapering at the tip, and deeply pleated on its many parallel ribs; the
surface is finely downy, especially beneath; in color a bright grass
green.
The flower is about three fourths of an inch across; the 6
spreading petal-like parts are leaf-like in texture, and of a yellowish-
green color. The flowers, growing on very short foot-stems, with a
narrow leaf (or bract) to each one, are thickly set in a pyramidal
cluster, on the top of the leafy stalk.
The juice, particularly that of the root, is said to be a strong acrid
poison. It is a plant of splendid vigor, and curves; in early spring,
before the twiggery shows any green, it pushes up from the dark
earth a large, lush, green bud, charged to the full with the impulse
of growth; later its leafage becomes more or less mingled with that
of its neighbors, and so does not receive the recognition it deserves
for its striking qualities.
FALSE HELLEBORE:
Veratum viride—⅓ life size.
Indian Cucumber-root. Medeola Virginiana.

Found in moist rich woods, during June.


The slender, smooth, green stalk varies from 1 to 3 feet in height;
a light fleecy wool is loosely caught around it.
The oblong leaf tapers at both ends, it is 3-ribbed, and the margin
is entire; its color is a fresh full green. The leaves clasp the stalk,
and are set in two whorls, of 3 to 9 in number.
The flower is inconspicuous on account of its light greenish color;
its 3 petals and 3 calyx-parts are long, narrow, and much recurved;
the 6 stamens are colored a dark crimson-red, with brown tips, and
the 3 divisions of the pistil are long and spreading. Three or four
blossoms on their slender stems hang beneath the upper whorl of
leaves.
The green, spidery flower of the Medeola is curious rather than
pretty,—the charm of the plant lies in the slender stalk, with its two
whorls of fine green leaves lightly poised about it. When the dark
blue-black berry is ripe its stem takes on an upward curve, and at
about the same time a crimson-red spot appears at the base of the
leaves.
INDIAN CUCUMBER-ROOT:
Medeola Virginiana.
Great Solomon’s Seal. Polygonatum giganteum.
Giant Solomon’s Seal.

Found on rich banks, in partial shade, during June.


The single leafy stalk grows from 3 to 8 feet high; at first it stands
erect, but later the tip curves over and downward; it is tough-fibred,
smooth and fine in surface, and green.
The large broadly oval leaf is sometimes 6 inches long and 3
broad; it is pointed at the tip, and partly clasps the stalk at its base;
the margin is entire, the ribs parallel and deeply marked, and the
texture is fine, while the surface is smooth. In color it is a cool dark
green. The leaves occur alternately along the stalk.
The tubular flower is from ½ an inch to about ¾ of an inch long,
and spreads into 6 divisions. The color is pale green, the tips of the
6 stamens which first show in the opening of the flower are pale
straw-color. The flowers swing on slender stems, from the angles of
the leaves, in clusters of two or three (or sometimes singly,) forming
a row upon the curving stalk.
Following the flower-bells come the globular blue-black berries,
about the size of a pea; they are fully as charming as the blossoms,
but seldom remain long on the stalk, since they are fully appreciated
by the birds who devour them quickly. This is a plant of fine gesture,
and splendid curves, too large to be figured full-size upon the
accompanying plate.
GREAT SOLOMON’S SEAL:
Polygonatum giganteum.
Nodding Lily. Lilium Canadense.
Field Lily.

Found in grass fields, and moist meadows, during July.


The single, leafy, and smooth stalk is 2 or 3, or more, feet high. In
color it is green, often inclined to take on a dull reddish-brown hue
near the flowers.
The leaf is long, pointed at the tip, and clasping the stalk at the
base; it is of strong fibre; in color a vigorous green. The leaves are
inclined to grow in whorls about the stalk, but are often placed
irregularly near the top.
The large and spreading, bell-shaped flower is formed of 6 petal-
like parts, whose tips are pointed, and curved a trifle,—3 of the parts
have prominent ribs down the middle; it is orange-yellow in color, on
the inside speckled with many small reddish-brown dots. The 6
stamens and the club-shaped pistil have dull tawny-orange or
reddish-brown tips. From 1 to 3, or more, flowers swing nodding on
their short stems from the top of the stalk.
When the orange bells of the Field Lily may be seen gaily nodding
here and there just above the feathers of the red-top grass in level
meadows, midsummer has come in.
NODDING LILY: Lilium
Canadense.
Flame Lily. Lilium Philadelphicum.
Wood Lily.

Found in upland meadows, woods, and along copse-borders in


July.
The stalk grows from 1 to 2 feet high; it is single, leafy, and
strong-fibred and smooth; in color a purplish-green.
The leaf is long, narrow, and pointed, of a firm strong texture, and
smooth surface. The color is a fall-toned green. The leaves are
placed upon the stalk, in whorls of 5 or more, with an occasional
one, escaping regularity, lodged between.
The 6 petal-like parts of the large flower-bell are narrowed at their
bases into little stems; 3 of the parts have pronounced midribs. The
color of this Lily is orange-red, or flame, irregularly marked on the
inside with large spots of reddish brown; the 6 stamens, and the
pistil, have reddish-brown tips. Usually a single flower, but
sometimes two, on slender stems, are erect upon the top of the
stalk.
While the general direction of the stalk is upright the flower sways
from side to side with a free grace of movement. Sometimes a single
plant will stray into some little open clearing of a lonely wood where
its flame warms the whole space; or again its bell swings out from
the rocky slope of a mountain pasture. Near the seaboard it grows in
communities, where its color, intensified by the sea air, gives it the
folk-name of “Flame Lily.” The plant is said to be especially
indifferent to drought.
FLAME LILY: L.
Philadelphicum.
PICKEREL-WEED FAMILY.
PONTEDERIACEÆ.

Pickerel-weed. Pontederia cordata.

Found in shallow water from July to September.


The height is variable, from 4 to 6 inches or more. The stalk is
stout, round, smooth, and green; it grows with sharp-angled turns
below the water.
The leaf is large and arrow-shaped, with a blunt tip; the margin is
entire, the fibre tough and leathery, the surface extremely smooth,
and the color a dark strong green. The stem is round, large, and
sheathes the stalk.
The irregularly 4-parted flower has a short tube; the upper
division is erect, broad, and 3-lobed; the 3 lower divisions are long,
narrow, and spreading; it has 6 stamens. In color it is a dull bluish
violet, the broad division marked with two round greenish-yellow
spots. The flowers grow in a thick blunt spike, and bloom spirally;
the stem is enfolded about midway by a small sheathing, green leaf.
The flowers are fleeting with the day. In general lines the Pickerel-
weed is full of vigor, and strong swinging curves, but there is a
primitive lack of finish in its growth.
PICKEREL-WEED:
Pontederia cordata.
CAT-TAIL FAMILY.
TYPHACEÆ.

Bur-reed. Sparganium simplex.

Found along the borders of ponds, blossoming in June, July and


August.
The erect, simple, stalk is round, smooth, strong, and fine in fibre;
of a bright yellow-green.
The leaf is a long, narrow, green ribbon, pointed at the tip, and
growing thick toward the base where it sheathes the stalk; it has a
smooth surface, and exceedingly fine texture; of a beautiful grass-
green color.
The small flowers are of 2 kinds in densely crowded, round heads
which are threaded on a long curving flower-stem; the lower ones
are the seed-bearing heads, and develop into green burs about 1
inch in diameter, the upper, stamen-bearing ones, are light and fluffy
with many fine thread-like stamens, of a dull white color, tipped with
gray.
One involuntarily says of this plant, How Japanese! Our Western
neighbors have shown in their drawings that they appreciate the
long sweeping curves and original gesture found in many water
plants. The name of the genus means a fillet, and is derived from
the ribbon-like leaf.
BUR-REED: Sparganium
simplex.
ARUM FAMILY.
ARACEÆ.

Skunk Cabbage. Symplocarpus fœtidus.

Found in March or early April, in damp meadows, and moist or


swampy woodlands.
The leaves and hooded flower-clusters rise from the ground.
The large and conspicuous leaves, which do not unfold until the
flowering season is past, vary from 1 to 2 feet in length; they are
oval in shape with a blunt tip and heart-shaped base, have entire
margins, firm texture, and smooth surfaces, and resemble the
garden Day-Lily because of their many parallel ribs. In color a light
clear green.
The unnoticeable 4-parted greenish-yellow flowers are gathered
closely on a fleshy round club (that is about an inch in diameter) and
enveloped by a protecting hood. This hood is large and sharp-
pointed, of a very thick and leathery texture, with a smooth and dull
glossy surface; it is a dull brown or mahogany color, mottled or
streaked with darker purple or red. From 1 to 3 or 4 of these hood-
protected flower-heads are crowded close together, along with the
rolled up leaf, in the hold of several dull greenish or slightly purple
leaf-like parts which serve as weather blankets wrapped about the
whole plant.
After the flowers mature the hood shrivels and falls away, the
blankets disappear, and the pointed leaf-bud then unfolds, the leaves
pushing forth with fine springing curves. The strong odor of the
plant prevents close observation, and denies to it the praise its
growth deserves. In habit it is highly gregarious, and favorable
meadows are thickly sprinkled with these rich-hued hoods of our
earliest spring flower.
SKUNK CABBAGE:
Symplocarpus fœtidus.
Golden Club. Orontium aquaticum.

Found in ponds, or standing water, in early May.


The flower-spike and leaves lie upon the water, or hold themselves
just above its surface, springing on long stems from the submerged
root.
The oblong, narrow leaf is thick and juicy, with an entire margin,
that is conspicuously folded together and joined at the base, and a
smooth surface. In color it is a rich-toned green with a bloom above,
the under surface being shining and pale and much colored by fine
speckles of bronze. The leaf-stem is round, porous, smooth, and
bronzy.
The small, flat, indefinite flowers are inset closely upon a round
club, in a smooth, curiously patterned mosaic-work of a bright
golden-yellow color. This club terminates a smooth porous stem,
which becomes large and is flattened on one side beneath the
flowers; it is streaked with bronze beneath the water, but clear white
above its surface.
“I don’t feel as if I should search for it!” said the farmer when he
saw this plant for the first time.
GOLDEN CLUB: Orontium
aquaticum.
Green Dragon. Arisæma Dracontium.
Dragon-root.

Found in moist shade, in May.


The root produces usually one leaf-stem, 1 to 2 feet high, which
bears the blossom-stem.
The leaflets of the compound leaf are 5 to 13 in number, in shape
oblong, tapering at both ends, and entire; the surface is smooth,
and a rich dark green color. The stem is smooth, round, and light
green.
The little flowers are borne on the base of a long wand, which
projects one or two inches beyond the ensheathing wrapper,—both
wand and envelope are green; the stem is round, smooth, and
green, and grows from the side of the leaf-stem.
This well-named plant has many marks of the dragon upon it; its
solitary leaf spreads, particularly when unfolding, a green claw-like
hand above the flowers, and all about the club-shaped root-bulb
grow numerous flesh-colored bulblets highly suggestive of his
dragon-ship’s toes!
GREEN DRAGON: Arisæma
Dracontium.
Jack-in-the-Pulpit. Arisæma triphyllum.
Indian Turnip.

Found in damp shady nooks, blossoming in May.


The root sends up two or three leaf-bearing stems which vary
from 8 to 20 inches in height.
The leaf is compound, and often grows to a considerable size; the
3 leaflets are a broad oval shape, tapering at the tip, the ribs much
marked, the fibre fine, and surface smooth; a fall, juicy green. The
stem is long, round and smooth, and sheathed at the foot.
The inconspicuous flowers are borne at the base of a green wand,
which is wrapped around by a leaf-like sheath, its tip curving over
the head of Jack and making the sounding board of his airy green
pulpit. This sheath is tougher in texture and more shining than the
leaves, and varies in color; on the stamen-bearing plant it is green,
striped with greenish-white, while that of the pistil-bearing plant is
green, striped with blackish violet. The Jack-in-the-Pulpit is borne on
a stout, round, shining stem, which springs from between the
sheaths of the leaf-stem.
In late summer the ripened seeds are found, a thick short club of
bright red berries; the leaves of the seed-bearing plants often grow
very large and are a rich, dark green color. It will be observed that
Jack-in-the-Pulpit is “brother to dragons”!
JACK-IN-THE-PULPIT: A.
triphyllum.
Sweet Flag. Acorus Calamus.

Found in swamps and on the borders of runlets, in flower in June.


The Sweet Flag has no common stalk; the leaves springing directly
from the rootstock; long and narrow (from 3 to 4 feet long), sword-
like in shape, thin and even on the edge, and rising to a sharply
defined ridge on one side of the center; the surface is like silk, and
the texture is firm and fine; of a beautiful grass-green color.
The minute greenish-yellow flowers are closely set in a long club-
like spike, growing out abruptly from the middle of a leaf-like stalk.
The large root is sought and eaten for its pungent and aromatic
flavor; the tender young flower-club also has the same qualities of
flavor and is eaten by children, who in certain localities call it “the
grater.”
SWEET FLAG: Acorus
Calamus.
WATER-PLANTAIN FAMILY.
ALISMACEÆ.

Arrowhead. Sagittaria variabilis.

Found in shallow water and low moist grounds, from July to


August.
The leaf- and flower-stems rise 1 or 2 feet from the root.
The leaf varies very much in size and proportion; it is arrow-
shaped, with strong ribs, a smooth and fine surface, and is borne on
a single stem that rises directly from the root. It is green, of a
vigorous quality.
The flowers leave 3 rounding petals, concave, and of an
exceptionally pure white color. They are of two kinds, sometimes
borne on separate stems, sometimes upon the same stem (in which
case the stamen-bearing flowers are placed upon its upper part); the
stamens are many, and of a pure orange yellow, making a heart of
gold in their white blossoms, while the pistil appears in its flower as
a beautiful light green ball. The calyx is green and 3-parted. The
flowers are arranged in whorls of threes and fours, upon inch long
foot-stems, which are placed along a large many-angled, green
stem, rising from the root.
The Arrowhead is attractive in all its parts, and in its gesture; it is
truly decorative, and suggests the subjects treated by the best
Gothic artists.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebookfinal.com

You might also like