Download Combinatorial pattern matching algorithms in computational biology using Perl and R 1st Edition Gabriel Valiente ebook now - free full chapters
Download Combinatorial pattern matching algorithms in computational biology using Perl and R 1st Edition Gabriel Valiente ebook now - free full chapters
https://ebookfinal.com/download/effective-awk-programming-universal-
text-processing-and-pattern-matching-4th-edition-arnold-robbins/
https://ebookfinal.com/download/matching-pursuit-and-unification-in-
eeg-analysis-engineering-in-medicine-biology-1st-edition-piotr-durka/
https://ebookfinal.com/download/handbook-of-graph-theory-
combinatorial-optimization-and-algorithms-1st-edition-subramanian-
arumugam/
The Burrows Wheeler Transform Data Compression Suffix
Arrays and Pattern Matching 1st Edition Donald Adjeroh
https://ebookfinal.com/download/the-burrows-wheeler-transform-data-
compression-suffix-arrays-and-pattern-matching-1st-edition-donald-
adjeroh/
https://ebookfinal.com/download/computational-geometry-algorithms-and-
applications-3e-edition-edition-berg/
https://ebookfinal.com/download/pattern-recognition-algorithms-for-
data-mining-1st-edition-sankar-k-pal/
https://ebookfinal.com/download/computational-systems-biology-1st-
edition-andres-kriete/
https://ebookfinal.com/download/the-biology-of-cancer-2nd-edition-
janice-ann-gabriel/
Combinatorial pattern matching algorithms in
computational biology using Perl and R 1st Edition
Gabriel Valiente Digital Instant Download
Author(s): Gabriel Valiente
ISBN(s): 9781420069730, 142006973X
Edition: 1
File Details: PDF, 3.03 MB
Year: 2009
Language: english
Combinatorial Pattern
Matching Algorithms in
Computational Biology
Using Perl and R
Series Editors
Alison M. Etheridge
Department of Statistics
University of Oxford
Louis J. Gross
Department of Ecology and Evolutionary Biology
University of Tennessee
Suzanne Lenhart
Department of Mathematics
University of Tennessee
Philip K. Maini
Mathematical Institute
University of Oxford
Shoba Ranganathan
Research Institute of Biotechnology
Macquarie University
Hershel M. Safer
Weizmann Institute of Science
Bioinformatics & Bio Computing
Eberhard O. Voit
The Wallace H. Couter Department of Biomedical Engineering
Georgia Tech and Emory University
Proposals for the series should be submitted to one of the series editors above or directly to:
CRC Press, Taylor & Francis Group
4th, Floor, Albert House
1-4 Singer Street
London EC2A 4BQ
UK
Gabriel Valiente
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher can‑
not assume responsibility for the validity of all materials or the consequences of their use. The
authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy‑
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978‑750‑8400. CCC is a not‑for‑profit organization that pro‑
vides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
QH324.2.V35 2009
572.80285‑‑dc22 2009003714
Foreword
Preface
1 Introduction 1
1.1 Combinatorial Pattern Matching . . . . . . . . . . . . . . . . 3
1.2 Computational Biology . . . . . . . . . . . . . . . . . . . . . 4
1.3 A Motivating Example: Gene Prediction . . . . . . . . . . . 4
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Sequences 21
2.1 Sequences in Mathematics . . . . . . . . . . . . . . . . . . . 21
2.1.1 Counting Labeled Sequences . . . . . . . . . . . . . . 22
2.2 Sequences in Computer Science . . . . . . . . . . . . . . . . 24
2.2.1 Traversing Labeled Sequences . . . . . . . . . . . . . . 26
2.3 Sequences in Computational Biology . . . . . . . . . . . . . . 29
2.3.1 Reverse Complementing DNA Sequences . . . . . . . . 31
2.3.2 Counting RNA Sequences . . . . . . . . . . . . . . . . 33
2.3.3 Generating DNA Sequences . . . . . . . . . . . . . . . 35
2.3.4 Representing Sequences in Perl . . . . . . . . . . . . . 38
2.3.5 Representing Sequences in R . . . . . . . . . . . . . . 40
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Trees 115
5.1 Trees in Mathematics . . . . . . . . . . . . . . . . . . . . . . 115
5.1.1 Counting Labeled Trees . . . . . . . . . . . . . . . . . 115
5.2 Trees in Computer Science . . . . . . . . . . . . . . . . . . . 117
5.2.1 Traversing Rooted Trees . . . . . . . . . . . . . . . . . 118
5.3 Trees in Computational Biology . . . . . . . . . . . . . . . . 118
5.3.1 The Newick Linear Representation . . . . . . . . . . . 123
5.3.2 Counting Phylogenetic Trees . . . . . . . . . . . . . . 125
5.3.3 Generating Phylogenetic Trees . . . . . . . . . . . . . 126
5.3.4 Representing Trees in Perl . . . . . . . . . . . . . . . . 128
5.3.5 Representing Trees in R . . . . . . . . . . . . . . . . . 131
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8 Graphs 181
8.1 Graphs in Mathematics . . . . . . . . . . . . . . . . . . . . . 181
8.1.1 Counting Labeled Graphs . . . . . . . . . . . . . . . . 182
8.2 Graphs in Computer Science . . . . . . . . . . . . . . . . . . 183
8.2.1 Traversing Directed Graphs . . . . . . . . . . . . . . . 183
8.3 Graphs in Computational Biology . . . . . . . . . . . . . . . 184
8.3.1 The eNewick Linear Representation . . . . . . . . . . 193
8.3.2 Counting Phylogenetic Networks . . . . . . . . . . . . 195
8.3.3 Generating Phylogenetic Networks . . . . . . . . . . . 198
8.3.4 Representing Graphs in Perl . . . . . . . . . . . . . . . 202
8.3.5 Representing Graphs in R . . . . . . . . . . . . . . . . 205
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 208
References 339
When, more than 25 years ago, Zvi Galil and I decided to organize the
presentations at a NATO workshop into the volume entitled “Combinatorial
Algorithms on Words,” windows were building fixtures and webs were inhab-
ited by spiders and navigated, more or less inadvertedly, by doomed insects.
The “Atlas of Protein Sequences” by Margaret Dayhoff listed a handful of Cy-
tochrome C proteins and the first sequenced genome was more than a decade
away. Nonetheless, we were convinced that the few, scattered properties and
constructs of pattern matching available at the time had the potential to spark
and shape a specialty of algorithms design that would not fade in comparison
with already well-established areas such as graph and numerical algorithms.
It is safe to say that our volume presented a detailed account of the state of the
art, attracted attention to data structures such as the suffix trees that are still
the subject of deep study, and also listed most of the relevant open problems
that would be tackled, with varying success, in the following years. A reader
willing to invest some time on the contents of that volume could immigrate
rather quickly into the issues at the frontier and start making contributions
depending only on own taste and skills.
About ten years later, as we attempted to collect the state of the art in
“Pattern Matching Algorithms,” we faced an amazingly expanded scenario.
This time, each one of the contributed chapters had to be a rather synthetic
survey of the many intervening results, some of which had started new special-
ized sections, notable among which were two-dimensional and tree matching.
The neophyte interested in the subject could now take steps from reading one
or two chapters, but would then have to invest considerable additional time
in the study of numerous references. Many things had happened in between,
bringing new meanings to webs and windows and growing protein databases
and sequence repositories that included now sizable genomes such as yeast.
Since then, additional beautiful volumes have been produced and even con-
ferences such as CPM, RECOMB, SPIRE, WABI have started that keep
churning out an unrelenting crop of problems, applications, and results. All
of this shows how challenging it is to embark in a compendium of the state of
the art, and thus how admirable is the volume that results from the effort of
Gabriel Valiente.
The reader of this nicely structured volume will find a well-rounded expo-
sition of the traditional issues accompanied by an up-to-date account of more
recent developments, such as graph similarity and search. As is well known, a
great many pattern matching problems have been inspired over the years by
Alberto Apostolico
Atlanta
Acknowledgments
This book is based on graduate lectures taught at the Technical Univer-
sity of Catalonia, Barcelona, and also invited lectures at the Phylogenet-
ics Programme of the Isaac Newton Institute for Mathematical Sciences,
held September 3 to December 21, 2007, in Cambridge, UK; at the Gul-
benkian PhD Program in Computational Biology of the Instituto Gulbenkian
de Ciência, held February 4–8, 2008, and January 26–30, 2009, in Oeiras,
Portugal; and at the Lipari International Summer School on Bioinformat-
ics and Computational Biology, held February 4–8, 2008, in Lipari Island,
Italy. I am very grateful to Vincent Moulton, Mike Steel, and Daniel Huson
(Isaac Newton Institute for Mathematical Sciences), to Jorge Carneiro and
Manuela Cordeiro (Gulbenkian PhD Program in Computational Biology), and
to Alfredo Ferro, Raffaele Giancarlo, Concettina Guerra, and Michael Levitt
(Lipari International Summer School on Bioinformatics and Computational
Biology) for their continuous encouragement and support.
The very idea of presenting combinatorial pattern matching problems in a
Gabriel Valiente
Barcelona
1
© 2009 by Taylor & Francis Group, LLC
2 Combinatorial Pattern Matching Algorithms in Computational Biology
Example 1.1
The DNA sequence of Bacteriophage φ-X174, which was the first genome to
be sequenced, has 11 protein coding genes within a circular single strand of
5,368 nucleotides. One of these genes is shown highlighted.
GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAAT
TATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCG
GAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGC
CATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTG
GCACGTTCGTCAAGGACTGGTTTAGATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGT
TGACATTTTAAAAGAGCGTGGATTACTATCTGAGTCCGATGCTGTTCAACCACTAATAGGTAAG
AAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGCCTCTATTAAGCT
CATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTTCGATTTTCTGACGAGTAACAAA
GTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGC
TGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCGTCATTGCTTA
TTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTACGGAA
AACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTG
TACGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGC
GGAAGGAGTGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGC
CGTTGCGAGGTACTAAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATT
TTAATTGCAGGGGCTTCGGCCCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCC
GAGCGTATGCCGCATGACCTTTCCCATCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTA
CCATTTCAACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCGCTCTCCG
TCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTACTGTAGACATTTTTACTTTTTATGTC
CCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGCCACTC
CTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTTGGCACGATTAACCC
TGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACAACTATTTT
AAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTC
GTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGA
GCTTTCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTAT
GCTAATTTGCATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCATGATGTTATTTCTT
CATTTGGAGGTAAAACCTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCT
CTGGGCATCTGGCTATGATGTTGATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGT
GTTCAACAGACCTATAAACATTCTGTGCCGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTA
CTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGG
TGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTGTATGGCAACTTGCCGCCGCGTGAA
ATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGTTTAAGATTGCTGAGGGTC
AGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGAAGGCTTCCCATT
CATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGATTATGAC
CAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTT
ATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGAGTGTGAGGT
TATAACGCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGG
TAGGTTTTCTGCTTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTT
TTTTTCTGATAAGCTGGTTCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACA
CCTAAAGCTACATCGTCAACGTTATATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTT
TTCTTCATTGCATTCAGATGGATACATCTGTCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGC
TGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCG
GTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTGAATGGTCGCCATGATGGTGGTT
ATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGCCGGGCAATAACGTTTA
TGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAAT
CAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTAT
TGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAA
AAAGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGG
GTGATGCTGGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCC
TAGTTTTGTTTCTGGTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCT
GGCACTTCTGCCGTTTCTGATAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTG
ATAAAGGAAAGGATACTCGTGATTATCTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCG
TGCTGGTGCTGATGCTTCCTCTGCTGGTATGGTTGACGCCGGATTTGAGAATCAAAAAGAGCTT
ACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGAGATGCAAAATGAGACTCAAAAAGAGA
TTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGACCAGGTATATGCACAAAATGA
GATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTATGGAAAACACCAAT
CTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCAAACGGCTG
GTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTT
AGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGAT
ATTTCTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAG
CTGTTGCCGATACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTT
GTCTAGGAAATAACCGTCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTT
GGAGGCTTTTTTATGGTTCGTTCTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTG
AGCGTATCGAGGCTCTTAAACCTGCTATTGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCC
AATGCTTGGCTTCCATAAGCAGATGGATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCTTTT
CGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATGTTGACGGCCATAAGGCTGCTTCTG
ACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGAATTGGCACAATGCTACAA
TGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGG
TTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCCCTCTTA
AGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATT
GCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATGAATGCAATG
CGACAGGCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGAT
TAGAGGCGTTTTATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGT
TCTTGCTGCCGAGGGTCGCAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTG
CCTGAGTATGGTACAGCTAATGGCCGTCTTCATTTCCATGCGGTGCACTTTATGCGGACACTTC
CTACAGGTAGCGTTGACCCTAATTTTGGTCGTCGGGTACGCAATCGCCGCCAGTTAAATAGCTT
GCAAAATACGTGGCCTTATGGTTACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTT
TCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAGCCGCTTAAAGCTACCAGTTATA
TGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAA
AGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCAAGAAG
CTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTGTCCA
CGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAA
GCAGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCG
GCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTAT
CCAACCTGCA
Protein coding regions of a DNA sequence are first transcribed into messen-
ger RNA and then translated into protein. A codon of three DNA nucleotides
is transcribed into a codon of three complementary RNA nucleotides, which
is translated in turn into a single amino acid within a protein. A fragment of
single-stranded DNA sequence has three possible reading frames, and transla-
tion takes place in an open reading frame, a sequence of codons from a certain
start codon to a certain stop codon and containing no further stop codon.
Example 1.2
Reading frame 2 of the DNA sequence of Bacteriophage φ-X174 from the pre-
vious example contains 15 open reading frames of more than 108 nucleotides,
which can potentially code for proteins of more than 36 amino acids. Only
two of them, shown highlighted in the next table, actually code for a protein.
The reading frame determines the actual amino acids encoded by a gene.
For instance, the DNA sequence fragment GTCGCCATGATGGTGGTTATT
ATACCGTCAAGGACTGTGTGACTA can be read in the 50 to 30 direction
in the following three frames:
1 GTC GCC ATG ATG GTG GTT ATT ATA CCG TCA AGG ACT GTG TGA CTA
2 TCG CCA TGA TGG TGG TTA TTA TAC CGT CAA GGA CTG TGT GAC TA
3 CGC CAT GAT GGT GGT TAT TAT ACC GTC AAG GAC TGT GTG ACT A
A fragment of double-stranded DNA sequence, on the other hand, has six
possible reading frames, three in each direction. An open reading frame begins
with the start codon ATG (methionine) in most species and ends with a stop
codon TAA, TAG, or TGA.
The identification of genes coding for proteins in a DNA sequence is a very
difficult task. Even a simple organism such as Bacteriophage φ-X174, with
a single-stranded DNA sequence of only 5,368 nucleotides, has a total of 117
open reading frames, only 11 of which actually code for a protein. There are
several other biological signals that help the computational biologist in the
task of gene finding, but to start with, the known protein with the shortest
sequence has 8 amino acids and, thus, short open reading frames, with fewer
than 3 + 24 + 3 = 30 nucleotides, cannot code for a protein.
A first algorithmic problem consists in extracting all open reading frames
in the three reading frames of a DNA sequence fragment. The problem has
to be solved on the reverse complement of the sequence as well if the DNA is
double stranded.
Given a fragment of DNA sequence S of n nucleotides, let S[i] denote the
i-th nucleotide of sequence S, for 1 6 i 6 n. Thus, in the sequence S = GTC
GCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTA, which
has n = 45 nucleotides, S[1] = G, S[2] = T, S[3] = C, and S[n] = A. Let
also S[i, . . . , j], where i 6 j, denote the fragment of S containing nucleotides
S[i], S[i+1], . . . , S[j]. For instance, S[1, . . . , 4] = GTCG, and S[1, . . . , n] = S.
Therefore, S[i, . . . , i] = S[i] for any 1 6 i 6 n.
With this notation, an open reading frame is a fragment S[i, . . . , j], of length
j−i+1, such that S[i, . . . , i+2] is the start codon ATG and S[j−2, . . . , j] is one
of the stop codons TAA, TAG, or TGA. This is actually not quite the case. It
has to be at least 30 nucleotides long, that is, it must fulfill j −i+1 > 30. And
it cannot contain any other stop codon, that is, it must also fulfill the condition
S[k, . . . , k+2] ∈ / {TAA, TAG, TGA} for i+3 6 k 6 j −6. In the sequence frag-
ment S = GTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTG
TGACTA, for instance, S[7, . . . , 42] is an open reading frame, as it begins
with S[7, . . . , 9] = ATG, ends with S[40, . . . , 42] = TGA, and has no other
codon between S[10] and S[39] equal to TAA, TAG, or TGA.
GTC GCC ATG ATG GTG GTT ATT ATA CCG TCA AGG ACT GTG TGA CTA
The reading frame determines a partition of the DNA sequence fragment S
in codons of three consecutive nucleotides. In reading frame 1, the first codon
is S[1, . . . , 3], the second codon is S[4, . . . , 6], and so on. In reading frame
2, however, the first codon is S[2, . . . , 4], and the second codon is S[5, . . . , 7].
The first codon in reading frame 3 is S[3, . . . , 5].
In a given reading frame, the codons can be accessed by sliding a window
of length three over the sequence, starting at position 1, 2, or 3, depending on
the reading frame. The sliding window is thus a kind of looking glass under
which a codon of the sequence can be seen and accessed:
1-3 4-6 7-9 ... ... ... ... ... ... ... ... ... ... ... ..n
−→ GTC GCC ATG ATG GTG GTT ATT ATA CCG TCA AGG ACT GTG TGA CTA
GTC GCC ATG ATG GTG GTT ATT ATA CCG TCA AGG ACT GTG TGA CTA
GTC GCC ATG ATG GTG GTT ATT ATA CCG TCA AGG ACT GTG TGA CTA
...
GTC GCC ATG ATG GTG GTT ATT ATA CCG TCA AGG ACT GTG TGA CTA
GTC GCC ATG ATG GTG GTT ATT ATA CCG TCA AGG ACT GTG TGA CTA
Consider, as a first example, the problem of finding an open reading frame
in a reading frame of a sequence, and let S[k, . . . , k+2] be the codon under the
sliding window in the given reading frame. Starting with an initial position k
given by the reading frame, the sliding window has to be displaced by three
nucleotides each time until accessing a start codon and then continue sliding
by three nucleotides each time until accessing a stop codon. Again, this is not
actually quite the case. The reading frame of the sequence fragment could
contain no start codon at all, or it could contain a start codon but no stop
codon, and the search for the beginning or the end of an open reading frame
might go beyond the end of the sequence.
The first start codon in the k-th reading frame of a given DNA sequence
fragment S of n nucleotides can be found by sliding a window S[i, . . . , i + 2]
of three nucleotides along S[k, . . . , n], until either i + 2 > n or S[i, . . . , i + 2] =
AGT. In the following description, the initial position i of the candidate
start codon is incremented by three as long as the codon does not fall off the
sequence (that is, i + 2 6 n) and is not a start codon (that is, S[i, . . . , i + 2] 6=
AGT).
i←k
while i + 2 6 n and S[i, . . . , i + 2] 6= AGT do
i←i+3
if i + 2 6 n then
output S[i, . . . , i + 2]
After having found a start codon S[i, . . . , i + 2], the first stop codon can be
found by sliding a window S[j, . . . , j + 2] of three nucleotides, this time along
S[i + 3, . . . , n], until either j + 2 > n or S[j, . . . , j + 2] ∈ {TAA, TAG, TGA}.
In the following description, the initial position j of the candidate stop codon
is incremented by three as long as the codon does not fall off the sequence
(that is, j + 2 6 n) and the candidate codon is not a stop codon (that is, with
S[j, . . . , j + 2] ∈
/ {TAA, TAG, TGA}).
j ←i+3
while j + 2 6 n and S[j, . . . , j + 2] ∈
/ {TAA, TAG, TGA} do
j ←j+3
if j + 2 6 n then
output S[j, . . . , j + 2]
Now, the problem of extracting the first open reading frame in the k-th
reading frame of a DNA sequence fragment S of length n can be solved by
putting together the search for a start codon and the search for a stop codon.
In the following description, the start codon is S[i, . . . , i + 2] and the stop
codon is S[j, . . . , j + 2] of the sequence and, thus, the open reading frame
S[i, . . . , j + 2] is output.
i←k
while i + 2 6 n and S[i, . . . , i + 2] 6= AGT do
i←i+3
if i + 2 6 n then
j ←i+3
while j + 2 6 n and S[j, . . . , j + 2] ∈
/ {TAA, TAG, TGA} do
j ←j+3
if j + 2 6 n then
output S[i, . . . , j + 2]
Notice that only the first start codon is found, and the first stop codon
after this start codon will then signal the end of the first open reading frame.
There may be other start codons in the sequence fragment between the first
start codon and the first stop codon, however, which would signal shorter
open reading frames contained in the first open reading frame found. Also,
the first open reading frame might be shorter than 30 nucleotides, much too
short to actually code for a protein.
The problem of extracting all open reading frames of at least 30 nucleotides
in the k-th reading frame of a DNA sequence fragment S of length n > 30
can be solved by repeating the previous procedure for each start codon found
in turn, checking that the open reading frames thus found have at least 30
nucleotides, as follows.
i←k
while i + 2 6 n do
if S[i, . . . , i + 2] = AGT then
j ←i+3
Finally, the problem of extracting all open reading frames of at least 30 nu-
cleotides in the three reading frames of a DNA sequence fragment S of length
n > 30 can be solved by repeating the previous procedure for each reading
frame and for each start codon in turn, checking again that the open reading
frames thus found have at least 30 nucleotides. In the following description,
the whole algorithm is wrapped into a procedure that takes the DNA se-
quence fragment S as input and reports each of the open reading frames of S
as output.
The previous algorithm for extracting all open reading frames in the three
reading frames of a given DNA sequence fragment can be implemented in
Perl in a straightforward way. An open reading frame S[i, . . . , j + 2] is repre-
sented as the fragment of sequence $seq with starting position $i and length
$j+2-$i+1, that is, substr($seq,$i,$j+2-$i+1). Notice that Perl arrays do
not start with position 1 but, rather, with position 0 and, thus, the first codon
is substr($seq,0,3), the last nucleotide is substr($seq,$n-1,1), and the
three reading frames $r are numbered 0, 1, 2. This is all shown in the following
Perl script.
sub e x t r a c t _ o p e n _ r e a d i n g _ f r a m e s {
my $seq = shift ;
for my $r (0 ,1 ,2) {
for ( my $i = $r ; $i <= length ( $seq ) -3; $i += 3) {
if ( substr ( $seq , $i ,3) eq " ATG " ) {
my $j = $i +3;
while ( $j <= length ( $seq ) -3 &&
substr ( $seq , $j ,3) ne " TAA " &&
substr ( $seq , $j ,3) ne " TAG " &&
substr ( $seq , $j ,3) ne " TGA " ) {
$j += 3;
}
if ( $j <= length ( $seq ) -3) {
my $len = $j +2 - $i +1;
if ( $len >= 30) {
print substr ( $seq , $i , $j +2 - $i +1) ," \ n " ;
}
}
}
}
}
}
The algorithm for extracting all open reading frames in the three reading
frames of a given DNA sequence fragment can also be implemented in R in a
straightforward way, as shown in the following R script.
extract . open . reading . frames <- function ( seq ) {
for ( i in 1:3) {
while ( i +2 <= nchar ( seq ) ) {
if ( substr ( seq ,i , i +2) == " ATG " ) {
j <- i + 3
while ( j +2 <= nchar ( seq ) & &
substr ( seq ,j , j +2) ! = " TAA " & &
substr ( seq ,j , j +2) ! = " TAG " & &
substr ( seq ,j , j +2) ! = " TGA " ) {
j <- j + 3
}
if ( j +2 <= nchar ( seq ) ) {
if ( j +2 - i +1 >= 30) {
print ( c (i , j +2 , substr ( seq ,i , j +2) ) )
}
}
}
i <- i + 3
}
}
}
There are indeed 104 open reading frames of at least 30 nucleotides and up
to 1,284 nucleotides in the DNA sequence of Bacteriophage φ-X174.
[1] " 230 " " 331 " " A T G A G G A G A A G T G G C T T A A T A T G C ... TTTTAA "
[1] " 284 " " 331 " " A T G A G T C A C A T T T T G T T C A T G G T A ... TTTTAA "
[1] " 302 " " 331 " " A T G G T A G A G A T T C T C T T G T T G A C A ... TTTTAA "
[1] " 848 " " 964 " " A T G T C T A A A G G T A A A A A A C G T T C T ... TTTTAA "
[1] " 1001 " " 2284 " " A T G T C T A A T A T T C A A A C T G G C G C C ... TCGTGA "
[1] " 1031 " " 2284 " " A T G C C G C A T G A C C T T T C C C A T C T T ... TCGTGA "
[1] " 1130 " " 2284 " " A T G G A C G C C G T T G G C G C T C T C C G T ... TCGTGA "
[1] " 1256 " " 2284 " " A T G A A G G A T G G T G T T A A T G C C A C T ... TCGTGA "
[1] " 1421 " " 2284 " " A T G C C T G A C C G T A C C G A G G C T A A C ... TCGTGA "
[1] " 1550 " " 2284 " " A T G A C G A C T T C T A C C A C A T C T A T T ... TCGTGA "
[1] " 1580 " " 2284 " " A T G G G T C T G C A A G C T G C T T A T G C T ... TCGTGA "
[1] " 1637 " " 2284 " " A T G C A G C G T T A C C A T G A T G T T A T T ... TCGTGA "
[1] " 1715 " " 2284 " " A T G C G C T C T A A T C T C T G G G C A T C T ... TCGTGA "
[1] " 1850 " " 2284 " " A T G T T T A C T C T T G C G C T T G T T C G T ... TCGTGA "
[1] " 1991 " " 2284 " " A T G A A G G A T G T T T T C C G T T C T G G T ... TCGTGA "
[1] " 2543 " " 2731 " " A T G C T G G T A A T G G T G G T T T T C T T C ... CTTTGA "
[1] " 2552 " " 2731 " " A T G G T G G T T T T C T T C A T T G C A T T C ... CTTTGA "
[1] " 2639 " " 2731 " " A T G C C G A C C C T A A A T T T T T T G C C T ... CTTTGA "
[1] " 2732 " " 2776 " " A T G G T C G C C A T G A T G G T G G T T A T T ... GTGTGA "
[1] " 2741 " " 2776 " " A T G A T G G T G G T T A T T A T A C C G T C A ... GTGTGA "
[1] " 2744 " " 2776 " " A T G G T G G T T A T T A T A C C G T C A A G G ... GTGTGA "
[1] " 2816 " " 2878 " " A T G T T G G T T T C A T G G T T T G G T C T A ... CGCTGA "
[1] " 4349 " " 4405 " " A T G G T T T T T A G A G A A C G A G A A G A C ... TGCTGA "
[1] " 51 " " 221 " " A T G A G T C G A A A A A T T A T C T T G A T A ... AACTGA "
[1] " 390 " " 848 " " A T G A G T C A A G T T A C T G A A C A A T C C ... ATGTAA "
[1] " 681 " " 848 " " A T G G A A G G C G C T G A A T T T A C G G A A ... ATGTAA "
[1] " 1038 " " 1196 " " A T G A C C T T T C C C A T C T T G G C T T C C ... CTGTAG "
[1] " 1212 " " 1259 " " A T G T C C C T C A T C G T C A C G T T T A T G ... TCATGA "
[1] " 1263 " " 1388 " " A T G G T G T T A A T G C C A C T C C T C T C C ... ATTTGA "
[1] " 1272 " " 1388 " " A T G C C A C T C C T C T C C C G A C T G T T A ... ATTTGA "
[1] " 1317 " " 1388 " " A T G C C G C T T T T C T T G G C A C G A T T A ... ATTTGA "
[1] " 1449 " " 1553 " " A T G A G C T T A A T C A A G A T G A T G C T C ... AAATGA "
[1] " 1464 " " 1553 " " A T G A T G C T C G T T A T G G T T T C C G T T ... AAATGA "
[1] " 1467 " " 1553 " " A T G C T C G T T A T G G T T T C C G T T G C T ... AAATGA "
[1] " 1476 " " 1553 " " A T G G T T T C C G T T G C T G C C A T C T C A ... AAATGA "
[1] " 1599 " " 1775 " " A T G C T A A T T T G C A T A C T G A C C A A G ... CGTTAG "
[1] " 1650 " " 1775 " " A T G A T G T T A T T T C T T C A T T T G G A G ... CGTTAG "
[1] " 1653 " " 1775 " " A T G T T A T T T C T T C A T T T G G A G G T A ... CGTTAG "
[1] " 1686 " " 1775 " " A T G A C G C T G A C A A C C G T C C T T T A C ... CGTTAG "
[1] " 1743 " " 1775 " " A T G A T G T T G A T G G A A C T G A C C A A A ... CGTTAG "
[1] " 1746 " " 1775 " " A T G T T G A T G G A A C T G A C C A A A C G T ... CGTTAG "
[1] " 1842 " " 1928 " " A T G G C A C T A T G T T T A C T C T T G C G C ... CTTTGA "
[1] " 1962 " " 1994 " " A T G G C A A C T T G C C G C C G C G T G A A A ... CTATGA "
[1] " 1998 " " 2234 " " A T G T T T T C C G T T C T G G T G A T T C G T ... ATGTGA "
[1] " 2061 " " 2234 " " A T G C G C C T T C G T A T G T T T C T C C T G ... ATGTGA "
[1] " 2073 " " 2234 " " A T G T T T C T C C T G C T T A T C A C C T T C ... ATGTGA "
[1] " 2166 " " 2234 " " A T G A T T A T G A C C A G T G T T T C C A G T ... ATGTGA "
[1] " 2172 " " 2234 " " A T G A C C A G T G T T T C C A G T C C G T T C ... ATGTGA "
[1] " 2856 " " 2891 " " A T G C C G C G G A T T G G T T T C G C T G A A ... TATTAA "
[1] " 2931 " " 3917 " " A T G T T T G G T G C T A T T G C T G G C G G T ... AAATAA "
[1] " 2982 " " 3917 " " A T G T C T A A A T T G T T T G G A G G C G G T ... AAATAA "
[1] " 3069 " " 3917 " " A T G G G T G A T G C T G G T A T T A A A T C T ... AAATAA "
[1] " 3156 " " 3917 " " A T G G C T A A A G C T G G T A A A G G A C T T ... AAATAA "
[1] " 3357 " " 3917 " " A T G G T T G A C G C C G G A T T T G A G A A T ... AAATAA "
[1] " 3399 " " 3917 " " A T G C A A C T G G A C A A T C A G A A A G A G ... AAATAA "
[1] " 3432 " " 3917 " " A T G C A A A A T G A G A C T C A A A A A G A G ... AAATAA "
[1] " 3522 " " 3917 " " A T G C T T G C T T A T C A A C A G A A G G A G ... AAATAA "
[1] " 3570 " " 3917 " " A T G G A A A A C A C C A A T C T T T C C A A G ... AAATAA "
[1] " 3615 " " 3917 " " A T G C G C C A A A T G C T T A C T C A A G C T ... AAATAA "
[1] " 3624 " " 3917 " " A T G C T T A C T C A A G C T C A A A C G G C T ... AAATAA "
[1] " 3681 " " 3917 " " A T G A C T C G C A A G G T T A G T G C T G A G ... AAATAA "
A related algorithmic problem consists of finding the longest open reading
frame of a given DNA sequence fragment. The longest open reading frame
often determines the correct reading frame for eukaryotes, where translation
usually takes place in one reading frame only. Again, the problem has to be
solved on the reverse complement of the sequence as well if the DNA is double
stranded.
The previous algorithm for extracting all open reading frames can be ex-
tended to find the longest open reading frame, by keeping the position of the
start and stop codon of the longest open reading frame found so far. In the
following description, the start codon of the longest open reading frame found
so far is S[i0 , . . . , i0 + 2], and the corresponding stop codon is S[j 0 , . . . , j 0 + 2].
The previous algorithm for finding the longest open reading frame of a given
DNA sequence fragment can be implemented in Perl in a straightforward way,
as shown in the following Perl script.
sub l o n g e s t _ o p e n _ r e a d i n g _ f r a m e {
my $seq = shift ;
my ( $ii , $jj ) = (0 ,0) ;
for my $r (0 ,1 ,2) {
for ( my $i = $r ; $i <= length ( $seq ) -3; $i += 3) {
if ( substr ( $seq , $i ,3) eq " ATG " ) {
my $j = $i +3;
while ( $j <= length ( $seq ) -3 &&
substr ( $seq , $j ,3) ne " TAA " &&
substr ( $seq , $j ,3) ne " TAG " &&
substr ( $seq , $j ,3) ne " TGA " ) {
$j += 3;
}
if ( $j <= length ( $seq ) -3) {
my $len = $j +2 - $i +1;
if ( $j +2 - $i +1 > $jj +2 - $ii +1) {
$ii = $i ;
$jj = $j ;
}
}
}
}
}
return [ $ii , $jj +2];
}
The algorithm for finding the longest open reading frame of a given DNA
sequence fragment can also be easily implemented in R, as shown in the fol-
lowing R script.
longest . open . reading . frame <- function ( seq ) {
ii <- jj <- 0
for ( i in 1:3) {
while ( i +2 <= nchar ( seq ) ) {
if ( substr ( seq ,i , i +2) == " ATG " ) {
j <- i + 3
while ( j +2 <= nchar ( seq ) & &
substr ( seq ,j , j +2) ! = " TAA " & &
substr ( seq ,j , j +2) ! = " TAG " & &
substr ( seq ,j , j +2) ! = " TGA " ) {
j <- j + 3
}
Bibliographic Notes
Most of the research in combinatorial pattern matching is reflected in the
various editions of the Annual Symposium on Combinatorial Pattern Match-
ing (Apostolico et al. 1992; 1993; Crochemore and Gusfield 1994; Galil and
Ukkonen 1995; Hirschberg and Myers 1996; Apostolico and Hein 1997; Farach-
Colton 1998; Crochemore and Paterson 1999; Giancarlo and Sankoff 2000;
Amir and Landau 2001; Apostolico and Takeda 2002; Baeza-Yates et al. 2003;
Sahinalp et al. 2004; Apostolico et al. 2005; Lewenstein and Valiente 2006; Ma
and Zhang 2007; Ferragina and Landau 2008). There are also several books on
specific aspects of combinatorial pattern matching, focused on algorithms on
sequences (Stephen 1998; Crochemore and Rytter 1994; Navarro and Raffinot
2002; Crochemore and Rytter 2003; Smyth 2003; Crochemore et al. 2007).
There are several books on algorithms in computational biology which also
address combinatorial pattern matching, including (Waterman 1995; Gusfield
1997; Pevzner 2000; Valiente 2002; Dwyer 2003; Jones and Pevzner 2004;
Deonier et al. 2005; Kasahara and Morishita 2006). A brief introduction to
Sequence Pattern
Matching
Sequences are fundamental mathematical objects that count among the most
common combinatorial structures in computer science and computational bi-
ology. Basic notions underlying combinatorial algorithms on sequences, such
as counting, generation, and traversal algorithms, as well as appropriate data
structures for the representation of sequences, are the subject of this intro-
ductory chapter.
Example 2.1
The following three sequences (shown with the elements separated by spaces,
for clarity) are identical as multisets of elements, but they are all different
sequences.
A B C C D D D E E E E E F F F F F F F F
A B C D E F C D E F D E F E F E F F F F
F F F F F F F F E E E E E D D D C C B A
Actually, they all define the same labeled sequence: an ordered multiset with
elements A and B, element C twice, three occurrences of element D, five
occurrences of element E, and eight occurrences of element F.
(A ,1) (B ,1) (C ,2) (D ,3) (E ,5) (F ,8)
21
© 2009 by Taylor & Francis Group, LLC
22 Combinatorial Pattern Matching Algorithms in Computational Biology
[3 ,] 1 4 10 20 35 56 84 120
[4 ,] 1 5 15 35 70 126 210 330
[5 ,] 1 6 21 56 126 252 462 792
[6 ,] 1 7 28 84 210 462 924 1716
[7 ,] 1 8 36 120 330 792 1716 3432
[8 ,] 1 9 45 165 495 1287 3003 6435
[9 ,] 1 10 55 220 715 2002 5005 11440
[10 ,] 1 11 66 286 1001 3003 8008 19448
[11 ,] 1 12 78 364 1365 4368 12376 31824
[12 ,] 1 13 91 455 1820 6188 18564 50388
[13 ,] 1 14 105 560 2380 8568 27132 77520
[14 ,] 1 15 120 680 3060 11628 38760 116280
[15 ,] 1 16 136 816 3876 15504 54264 170544
[16 ,] 1 17 153 969 4845 20349 74613 245157
[17 ,] 1 18 171 1140 5985 26334 100947 346104
[18 ,] 1 19 190 1330 7315 33649 134596 480700
[19 ,] 1 20 210 1540 8855 42504 177100 657800
[20 ,] 1 21 231 1771 10626 53130 230230 888030
Example 2.2
The 2 + 4 + 8 + 16 = 30 sequences of length 1 through 4 over the alphabet
{A, B} are shown in lexicographical order.
A
AA
AAA
AAAA
AAAB
AAB
AABA
AABB
AB
ABA
ABAA
ABAB
ABB
ABBA
ABBB
B
BA
BAA
BAAA
BAAB
BAB
BABA
BABB
BB
BBA
BBAA
BBAB
BBB
BBBA
BBBB
In labeled sequences, both the alphabet and the attributes associated with
the elements of the sequences are usually ordered, allowing also the definition
of a lexicographical ordering among labeled sequences. A labeled sequence
((x1 , n1 ), (x2 , n2 ), . . . , (xk , nk )) with x1 < x2 < · · · < xn precedes another
labeled sequence ((x01 , n01 ), (x02 , n02 ), . . . , (x0` , n0` )) with x01 < x02 < · · · < x0` in
lexicographical order if x1 < x01 , or x1 = x01 and n1 < n01 , or x1 = x01 and
n1 = n01 and (x2 , . . . , xk ) precedes (x02 , . . . , x0` ) in lexicographical order, where
the empty labeled sequence precedes any non-empty labeled sequence.
Example 2.3
The 2 + 3 + 4 + 5 = 14 labeled sequences of length 1 through 4 over the
alphabet {A, B} are shown in lexicographical order.
(A ,1)
(A ,1) (B ,1)
(A ,1) (B ,2)
(A ,1) (B ,3)
(A ,2)
(A ,2) (B ,1)
(A ,2) (B ,2)
(A ,3)
(A ,3) (B ,1)
(A ,4)
(B ,1)
(B ,2)
(B ,3)
(B ,4)
Example 2.4
Consider the following two sequences over the alphabet {A, B}.
AABAAABBAAAABBB
A B BA A B BB A A A BB B B AA A A B B B B B
The corresponding labeled sequences, of length 15 and 24, are as follows.
(A ,9) (B ,6)
(A ,10) (B ,14)
Their symmetric difference is thus the following labeled sequence, of length 9.
(A ,1) (B ,8)
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookfinal.com