Graph Data Management Techniques and Applications 1st Edition Sherif Sakr pdf download
Graph Data Management Techniques and Applications 1st Edition Sherif Sakr pdf download
https://ebookname.com/product/graph-data-management-techniques-
and-applications-1st-edition-sherif-sakr/
https://ebookname.com/product/prostate-biopsy-indications-
techniques-and-complications-1st-edition-wael-sakr-md-auth/
https://ebookname.com/product/database-and-applications-security-
integrating-information-security-and-data-management-1st-edition-
bhavani-thuraisingham/
https://ebookname.com/product/particle-swarm-optimization-theory-
techniques-and-applications-theory-techniques-and-
applications-1st-edition-andrea-e-olsson/
https://ebookname.com/product/mastering-genealogical-proof-first-
edition-thomas-w-jones/
Heidegger Hölderlin and the Subject of Poetic Language
Toward a New Poetics of Dasein 1st ed Edition Jennifer
Anna Gosetti-Ferencei
https://ebookname.com/product/heidegger-holderlin-and-the-
subject-of-poetic-language-toward-a-new-poetics-of-dasein-1st-ed-
edition-jennifer-anna-gosetti-ferencei/
https://ebookname.com/product/understanding-environmental-
pollution-marquita-k-hill/
https://ebookname.com/product/globalisation-and-the-quest-for-
social-and-environmental-justice-the-relevance-of-international-
law-in-an-evolving-world-order-1st-edition-shawkat-alam/
https://ebookname.com/product/biaxial-nematic-liquid-crystals-
theory-simulation-and-experiment-1st-edition-geoffrey-r-
luckhurst/
https://ebookname.com/product/hybrid-organic-inorganic-
perovskites-1-edition-wei-li/
iPod ■ ■ iTunes For Dummies ■ 2nd Edition Tony Bove
https://ebookname.com/product/ipod-%d0%b2-%d0%b2-itunes-for-
dummies-%d0%b2-2nd-edition-tony-bove/
Graph Data Management:
Techniques and Applications
Sherif Sakr
University of New South Wales, Australia
Eric Pardede
La Trobe University, Australia
Senior Editorial Director: Kristin Klinger
Director of Book Publications: Julia Mosemann
Editorial Director: Lindsay Johnston
Acquisitions Editor: Erika Carter
Development Editor: Mike Killian
Production Editor: Sean Woznicki
Typesetters: Adrienne Freeland, Jennifer Romanchak
Print Coordinator: Jamie Snavely
Cover Design: Nick Newcomer
Copyright © 2012 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in
any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or
companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the
authors, but not necessarily of the publisher.
Editorial Advisory Board
Sourav S. Bhowmick, Nanyang Technological University, Singapore
Michael Böhlen, University of Zurich, Switzerland
Marlon Dumas, University of Tartu, Estonia
Claudio Gutierrez, Universidad de Chile, Chile
Jun Huan, University of Kansas, USA
Irwin King, The Chinese University of Hong Kong, China
Raymond Wong, University of New South Wales, Australia
Mohammed Zaki, Rensselaer Polytechnic Institute, USA
Xiaofang Zhou, University of Queensland, Australia
Table of Contents
Foreword............................................................................................................................................... vii
Preface.................................................................................................................................................... ix
Acknowledgment.................................................................................................................................... x
Section 1
Basic Challenges of Data Management in Graph Databases
Chapter 1
Graph Representation............................................................................................................................... 1
D. Dominguez-Sal, Universitat Politècnica de Catalunya, Spain
V. Muntés-Mulero, Universitat Politècnica de Catalunya, Spain
N. Martínez-Bazán, Universitat Politècnica de Catalunya, Spain
J. Larriba-Pey, Universitat Politècnica de Catalunya, Spain
Chapter 2
The Graph Traversal Pattern.................................................................................................................. 29
Marko A. Rodriguez, AT&T Interactive, USA
Peter Neubauer, Neo Technology, Sweden
Chapter 3
Data, Storage and Index Models for Graph Databases.......................................................................... 47
Srinath Srinivasa, International Institute of Information Technology, India
Chapter 4
An Overview of Graph Indexing and Querying Techniques................................................................. 71
Sherif Sakr, University of New South Wales, Australia
Ghazi Al-Naymat, University of Tabuk, Saudi Arabia
Chapter 5
Efficient Techniques for Graph Searching and Biological Network Mining......................................... 89
Alfredo Ferro, Università di Catania, Italy
Rosalba Giugno, Università di Catania, Italy
Alfredo Pulvirenti, Università di Catania, Italy
Dennis Shasha, Courant Institute of Mathematical Sciences, USA
Chapter 6
A Survey of Relational Approaches for Graph Pattern Matching over Large Graphs......................... 112
Jiefeng Cheng, The University of Hong Kong, China
Jeffrey Xu Yu, The Chinese University of Hong Kong, China
Chapter 7
Labelling-Scheme-Based Subgraph Query Processing on Graph Data............................................... 142
Hongzhi Wang, Harbin Institute of Technology, China
Jianzhong Li, Harbin Institute of Technology, China
Hong Gao, Harbin Institute of Technology, China
Section 2
Advanced Querying and Mining Aspects of Graph Databases
Chapter 8
G-Hash: Towards Fast Kernel-Based Similarity Search in Large Graph Databases........................... 176
Xiaohong Wang, University of Kansas, USA
Jun Huan, University of Kansas, USA
Aaron Smalter, University of Kansas, USA
Gerald H. Lushington, University of Kansas, USA
Chapter 9
TEDI: Efficient Shortest Path Query Answering on Graphs............................................................... 214
Fang Wei, University of Freiburg, Germany
Chapter 10
Graph Mining Techniques: Focusing on Discriminating between Real and Synthetic Graphs........... 239
Ana Paula Appel, Federal University of Espirito Santo at São Mateus, Brazil
Christos Faloutsos, Carnegie Mellon University, USA
Caetano Traina Junior, University of São Paulo at São Carlos, Brazil
Chapter 11
Matrix Decomposition-Based Dimensionality Reduction on Graph Data.......................................... 260
Hiroto Saigo, Kyushu Institute of Technology, Japan
Koji Tsuda, National Institute of Advanced Industrial Science and Technology (AIST), Japan
Chapter 12
Clustering Vertices in Weighted Graphs.............................................................................................. 285
Derry Tanti Wijaya, Carnegie Mellon University, USA
Stephane Bressan, National University of Singapore, Singapore
Chapter 13
Large Scale Graph Mining with MapReduce: Counting Triangles in Large Real Networks.............. 299
Charalampos E. Tsourakakis, Carnegie Mellon University, USA
Chapter 14
Graph Representation and Anonymization in Large Survey Rating Data........................................... 315
Xiaoxun Sun, Australian Council for Educational Research, Australia
Min Li, University of Southern Queensland, Australia
Section 3
Graph Database Applications in Various Domains
Chapter 15
Querying RDF Data............................................................................................................................. 335
Faisal Alkhateeb, Yarmouk University, Jordan
Jérôme Euzenat, INRIA & LIG, France
Chapter 16
On the Efficiency of Querying and Storing RDF Documents............................................................. 354
Maria-Esther Vidal, Universidad Simón Bolívar, Venezuela
Amadís Martínez, Universidad Simón Bolívar &Universidad de Carabobo, Venezuela
Edna Ruckhaus, Universidad Simón Bolívar, Venezuela
Tomas Lampo, University of Maryland, USA
Javier Sierra, Universidad Simón Bolívar, Venezuela
Chapter 17
Graph Applications in Chemoinformatics and Structural Bioinformatics........................................... 386
Eleanor Joyce Gardiner, University of Sheffield, UK
Chapter 18
Business Process Graphs: Similarity Search and Matching................................................................ 421
Remco Dijkman, Eindhoven University of Technology, The Netherlands
Marlon Dumas, University of Tartu, Estonia
Luciano García-Bañuelos, University of Tartu, Estonia
Chapter 19
A Graph-Based Approach for Semantic Process Model Discovery..................................................... 438
Ahmed Gater, Universite de Versailles Saint-Quentin en Yvelines, France
Daniela Grigori, Universite de Versailles Saint-Quentin en Yvelines, France
Mokrane Bouzeghoub, Universite de Versailles Saint-Quentin en Yvelines, France
Chapter 20
Shortest Path in Transportation Network and Weighted Subdivisions................................................ 463
Radwa Elshawi, National ICT Australia (NICTA) - University of Sydney, Australia
Joachim Gudmundsson, National ICT Australia (NICTA) - University of Sydney, Australia
Index.................................................................................................................................................... 485
vii
Foreword
In recent years, many database researchers have become fascinated by graphs, of which I am one. Like
many others, I have spent years working with the relational model, including its data access methods,
optimizing techniques, and query languages. The rise of the graph model has been somewhat disruptive,
and our natural tendency is to ask, “Can we address new challenges using the existing relational model?”
Unfortunately, many efforts along this direction do not seem to work well.
The advent of the Web, and in particular large social networks on the Web, highlights the urgency of
developing a native graph database. The embracing of the graph model is anything but surprising. After
all, data management is about modeling the data and the relationships between the data. Can there be
a more natural model than the graph itself? The question, however, was not new. About three or four
decades ago, network and hierarchical models, which are also graph based, fought and lost the battle
against the relational model. Today, however, we are facing many new challenges that the relational
model is not designed for. For example, as graphs are becoming increasingly larger, storing a graph in
a relational table and using relational self-joins to traverse the graph are simply too costly.
Over the last decade, much research has been conducted on graphs. In particular, the study of large-
scale social networks has been made possible, and many interesting and even surprising results have been
published. However, the majority of research focuses on graph analytics using specific graph algorithms
(for example, graph reachability, sub-graph homomorphism and matching), and not enough effort has
been devoted to developing a new data model to better support graph analytics and applications on graphs.
This book is the first that approaches the challenges associated with graphs from a data management
point of view; it connects the dots. As I am currently involved in building a native graph database engine,
I encounter problems that arise from every possible aspect: data representation, indexing, transaction
support, parallel query processing, and many others. All of them sound familiar to a database researcher,
but the inherent change is fundamental as they originate from a new foundation. I found that this book
contains a lot of timely information, aiding my efforts. To be clear, it does not offer the blueprint for
building a graph database system, but it contains a bag of diamonds, enlightening the readers as they
start exploring a field that may fundamentally change data management in the future.
Haixun Wang
Microsoft Research Asia
viii
Haixun Wang, who earned a PhD in Computer Science from UCLA in 2000, joined Microsoft Re-
search Asia in 2009. Before then, he spent nine years as a research staff member at IBM Research Center,
where he was a technical assistant to Stuart Feldman, then-vice president of Computer Science, and to
Mark Wegman, head of Computer Science. Wang’s interest lies in data management and mining, and he
is working on building a large-scale knowledge base for advanced applications, including search. Wang
is on the editorial board of IEEE Transactions on Knowledge and Data Engineering, Knowledge and
Information Systems, and the Journal of Computer Science and Technology. He has held key leadership
roles in various top conferences in his field
ix
Preface
The graph is a powerful tool for representing and understanding objects and their relationships in vari-
ous application domains. Recently, graphs have been widely used to model many complex structured
and schemaless data such as semantic web, social networks, biological networks, protein networks,
chemical compounds and business process models. The growing popularity of graph databases has
generated interesting data management problems. Therefore, the domain of graph databases have at-
tracted a lot of attention from the research community and different challenges have been discussed such
as: subgraph search queries, supergraph search queries, approximate subgraph matching, short path
queries and graph mining techniques.
This book is designed for studying various fundamental challenges of storing and querying graph
databases. In addition, it discusses the applications of graph databases in various domains. In particular,
the book is divided into three main sections.
The first section discusses the basic definitions of graph data models, graph representations and graph
traversal patterns. It also provides an overview of different graph indexing techniques and evaluation
mechanisms for the main types of graph queries. The second section further discusses advanced query-
ing aspects of graph databases and different mining techniques of graph databases. It should be noted
that many graph querying algorithms are sensitive to the application scenario in which they are designed
and cannot be generalized for all domains. Therefore, the third section focuses on presenting the usage
of graph database techniques in different practical domains such as: semantic web, chemoinformatics,
bioinformatics, business process model and transportation networks.
In a nutshell, the book provides a comprehensive summary from both of the algorithmic and the
applied perspectives. It will provide the reader with a better understanding of how graph databases can
be effectively utilized in different scenarios.
Sherif Sakr
University of New South Wales, Australia
Eric Pardede
La Trobe University, Australia
x
Acknowledgment
We would like to thank the editorial advisory board of the book for their efforts towards the successful
completion of this project.
We also would like to thank all contributing authors of the chapters of the book. The efforts of all
reviewers in advancing the material of this book are highly appreciated!
Thank you all.
Sherif Sakr
University of New South Wales, Australia
Eric Pardede
La Trobe University, Australia
Section 1
Basic Challenges of Data
Management in Graph
Databases
1
Chapter 1
Graph Representation
D. Dominguez-Sal
Universitat Politècnica de Catalunya, Spain
V. Muntés-Mulero
Universitat Politècnica de Catalunya, Spain
N. Martínez-Bazán
Universitat Politècnica de Catalunya, Spain
J. Larriba-Pey
Universitat Politècnica de Catalunya, Spain
ABSTRACT
In this chapter, we review different graph implementation alternatives that have been proposed in the
literature. Our objective is to provide the readers with a broad set of alternatives to implement a graph,
according to their needs. We pay special attention to the techniques that enable the management of large
graphs. We also include a description of the most representative libraries available for representing graphs.
DOI: 10.4018/978-1-61350-053-8.ch001
Copyright © 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Graph Representation
2
Graph Representation
Figure 1. (a) Sample graph represented as (b) adjacency matrix, (c) adjacency lists, (d) incidence matrix,
(e) laplacian matrix
3
Graph Representation
ent, then the algorithm scans the adjacency list chapter, we show an example of spectral analysis
of neighbors, which can be expensive for nodes application for improving the cache locality access
with a large number of connections. An alterna- of graphs later. For the interested reader, F. Chung
tive implementation is to keep the adjacency lists (1994) provides a comprehensive description of
sorted by node identifier, which reduces the time the mathematical foundations of this topic.
to search to logarithmic. However, if the lists are
sorted the insertion time is logarithmic, too.
COMPRESSED ADJACENCY LISTS
Incidence Matrix
Although many real graphs are not homogeneous
An incidence matrix is a data structure that repre- as shown by Leskovec et al. (2008), they have
sents the graph as a bidimensional boolean matrix regularities that can be exploited for compressing
of n rows and e columns (n·e positions), where data. The characteristic functions of these large
n=|V| and e=|E|. Each column represents an edge, real graphs (e.g. the degree distributions of the
and each row represents a node. Therefore, each nodes or even the number of edges with respect
column indicates the nodes that are connected by to the nodes) fit very well power law distributions.
a certain edge. Conversely, each row indicates all Boldi and Vigna (2004) exploited such features
the edges that are connected to a node. for compressing the adjacency lists of large graphs,
Generally, an incidence matrix is not efficient in their case the set of links among web pages.
for implementing graphs because it requires n·e They sort each web address lexicographically and
bits, and for most graphs, the number of edges is map each address to a node in the graph. The edges
significantly larger than the number of nodes. The are the links between the nodes. Departing from
most significant application of incidence matrices an adjacency lists such as that described in the
is for representing hypergraphs, in which one edge previous section, we describe four compression
connects an arbitrary number of nodes. We note techniques, proposed by Boldi and Vigna (2004),
that although for regular graphs the columns of which can be cascaded in sequence. We show an
the incidence matrix always have two positions example of this in Figure 2.
set to true, for hypergraphs the number of posi- Gap encoding: It represents the adjacency lists
tions set in one column may be arbitrarily large. by differential encoding of the node identifiers.
Differential encoding substitutes the original node
Laplacian Matrix identifiers by the difference between two con-
secutive nodes, which turns into a reduction of
The laplacian matrix is a variant of the matrix repre- the overall size of the lists. The adjacency list for
sentation. The laplacian matrix is a bidimensional node x: A(x ) = (a1, a 2 , a 3 , an ) is encoded as
array of n·n integers, where n=|V|. The diagonal (v(a1 − x ), a2 − a1 − 1, a 3 − a2 − 1, , an −1 − 1)
of the laplacian matrix indicates the degree of the Since the adjacency list A(x) is originally sorted,
node. The rest of positions are set to -1 if the two all the elements but the first are positive numbers.
vertices are connected and 0 otherwise. In order to keep all node identifiers positive, the
The main advantage of the laplacian represen- mapping for the first node is the following:
tation of a graph is that it allows analyzing the
graph structure by means of spectral analysis. The 2x , if x ≥ 0
spectral analysis calculates the eigenvalues of the v(x ) =
2 | x | −1, if x < 0
laplacian matrix, which can be interpreted and
applied to extract properties of the graph. In this
4
Graph Representation
• Reference compression: Since nodes may previous nodes where the compression is
share a large set of neighbors, it is more the largest (it has been tested that select-
compact to represent a node with respect ing the node which compresses the best
to the differences in the adjacency lists of among the three previous node identifiers
previous nodes. The reference compres- yields good compression ratios). The copy
sion splits the adjacency lists in three val- list for node x, with reference y is a bitmap
ues: the reference, the copy list and the ex- of |A(y)| positions where the i-th bit is set
tra nodes. The reference is an offset to the if the i-th entry of A(y) is in A(x). The extra
node identifier adjacency list over which nodes are a regular adjacency list that con-
the compression is performed. The select- tains the node identifiers in A(x) – A(y).
ed node for reference is the one among the
5
Graph Representation
• Differential compression: The copy list, We summarize the previously described com-
obtained by reference compression, is di- pression mechanism in the Table 1.
vided in blocks of consecutive 0’s and 1’s. Example: Figure 2 depicts an example of the
This step compresses the copy lists into compression scheme described previously. In
two values: a block counter and a copy Figure 2(a), we enumerate the neighbors of nodes
block. The block counter accounts for the 21 and 22 with their respective adjacency lists.
number of blocks in the copy block. The In the first step, gap encoding reduces the mag-
copy block is a list of integers that indicate nitude of the node identifiers enabling less bits
the length of each block of 0’s and 1’s. By per identifier (Figure 2(b)). The reference column
agreement, it is assumed that the first block in Figure 2(c) indicates the offset from where the
is a block of 1’s, and thus if the first block copy list is compressed. For node 21, the zero
is a 0’s block the first element of the copy indicates that no compression is performed, and
block is a 0. Since only the first number the one for the second node indicates that the
may be a 0, each element of the copy block reference points to the previous node identifier,
can be decremented by one for all the which is node 21. The copy list for node 22 is a
blocks except for the first one. In order to bitmap that indicates the elements that are shared
generate a more compact copy block, the with node 21, such as identifiers 23 or 24. Dif-
last element can be deduced from the block ferential compression in Figure 2(d) compresses
counter and the outdegree of the node. it is the previously defined copy lists: the copy list of
assumed that the first block is a one block, node 22 contained four sections of consecutive
i.e. if the first block is a block with 0’s the bits of length 1, 1, 3 and 2 (given that a section
counter. of consecutive bits has at least length one, these
• Interval representation: This procedure numbers can be decreased by one unit to reduce
aims at compacting the extra nodes lists. the space). Finally, interval representation com-
If the list has more than two consecutive presses the two ranges of consecutive residuals
integers, then they are encoded as an inter- of node 21: from 23 to 26 and 30 to 31. For the
val, which is represented as two values: the former: the left extreme is encoded
left extreme and the length of the interval. v(E1 − x )1 = 2° ⋅ (23 − 21) = 4 , and its length
The left extreme is coded as the difference L 1− 2 = 4 − 2 = 2 S i m i l a r l y f o r t h e l a t t e r
between the left extreme of the current range, the left extreme is encoded as
interval and the right extreme minus two E 2 − E1 − L1 − 1 = 30 − 23 − 4 − 1 = 2 a n d
(because at least there are two elements in the length L2 − 2 = 2 − 2 = 0 The residuals
the interval). For the case of the first in- remaining 20, 28, 58 are encoded as
terval, the node identifier is subtracted to v(R 1 −x)=2∙|20−21|−1=1,R 2 −R 1 −1=7 and
the left extreme and is represented with the R3 − R2 − 1 = 58 − 28 − 1 = 29 .
v(x) function. The nodes that do not belong
In particular, the previously stated compression
to an interval of more than two nodes are
techniques are very effective because many graphs,
represented using the differential compres-
and particularly web graphs, have the following
sion. The length of the interval is com-
characteristics (Boldi and Vigna (2004)]:
puted as the number of elements minus the
minimum number of elements in the inter-
• Some nodes act as a hub with a number of
val. The node sequences that are too short
connections significantly over the average
to become an interval, are called residuals
degree of the graph.
and are encoded with gap encoding.
6
Graph Representation
• Many relations are found in a community. identifiers in order to improve the data locality.
In the case of web graphs it is likely that In other words, in a graph adjacency matrix rep-
a large portion of websites are linking to resentation, these techniques would exchange
other sections of the same website. rows and columns of the graph to improve the
• There are many nodes with shared neigh- cache hit ratio.
bors. In the case of web graphs, two pages We observe that operations on graphs exhibit
have either a very similar set of links, or an important component of spatial locality. Spatial
they are completely disjoint. locality is obtained when once a certain data item
has been accessed; the nearby data items are likely
to be accessed in the following computations. This
IMPROVING THE DATA is particularly evident in graph algorithms, such
LOCALITY OF A GRAPH as traversals, in which once a node is accessed
it is likely that its neighbors are explored in the
In addition to the design of the data structures near future. Bearing this in mind, a graph whose
to reach a good performance, it is important to data items are mapped taking into account the
take into account the impact of the computer ar- accesses will be faster than one with the nodes
chitecture in the data structures. Computers are randomly mapped.
prepared to take advantage of both the spatial and
temporal localities thanks to the design of the Breadth First Search Layout (BFSL)
memory hierarchy. Thus, the way data is laid out
physically in memory determines the locality to Al-Furaih and Ranka (1998) proposed a method to
be obtained. In this section, we review techniques augment the space locality of large graphs based
that are adequate to improve the L1 and L2 cache on breadth first search traversals. This algorithm
levels for graph management. We describe sev- takes as input the sequence of vertices of a graph,
eral techniques that relabel the sequence of node and generates a permutation of the vertices which
7
Graph Representation
obtains better cache performance for graph tra- with the first identifier, which is the node from
versals. The algorithm selects a node at random where the traversal starts, is the node with the
that is the origin of the traversal. Then, the graph smallest degree in the whole graph. Then, a BFS
is traversed following a breadth first search algo- traversal starts, applying a heuristic that selects
rithm, generating a list of vertex identifiers in the those nodes that have the smallest degree and
order that they are visited. Finally, BFSL takes have not been visited yet. The nodes are labeled
the generated list and assigns the node identifiers sequentially as they are visited by the traversal.
sequentially.
The rationale of the algorithm is that when a Recursive Spectral Bisection
node is explored using a BFS algorithm, the neigh-
boring nodes are queued to be explored together. Recursive Spectral Bisection (RSB) is a graph
Thus, BFSL is packing the neighboring vertices layout method designed by Barnard and Simon
of a node together. BFSL is the optimal policy for (1993) that takes advantage of the spectral analy-
the BFS traversal starting from the selected origin sis of a graph. Spectral methods are based on the
node, but for traversals starting from a different mathematical properties of the eigenvectors of a
node it might be far from the optimal. This differ- matrix. If we represent a graph G using its lapla-
ence is more severe for traversals that start from cian representation L(G), the eigenvectors of the
nodes distant from the origin node. graph are the vectors x that satisfy the following
relation:
Cuthill-Mckee
L(G ) ⋅ x = λ ⋅ x ,
Since graphs may be seen as matrices we can
map the locality problem to a similar one from where λ is an scalar value. RSB orders the set of
matrix reorganization that is called minimum eigenvalues of a graph by increasing value, and
bandwidth problem. The bandwidth of a row in
selects the eigenvector x 2 , which correspond to
a matrix is the maximum distance between non-
the second smallest eigenvalue λ2. The second
zero elements, with the condition that one is on
smallest eigenvalue is known as the Fiedler vec-
the left of the diagonal and the other on the right
tor, which is an indicator of how close are two
of the diagonal. The bandwidth of a matrix is the
vertices in a graph. Two nodes are close in a graph
maximum of the bandwidth of its rows. The band-
if the difference between its components in the
width minimization problem is known to be a NP
Fiedler vector is small. RSB applies this to keep
problem, and thus for large matrices (or graphs)
nodes that are close in the graph, also close in
the solutions are only approximated. Matrices
memory to improve the spatial locality of the
with low bandwidths are more cache friendly,
graph. Therefore, RSB sorts the nodes by its
because the non zero elements (i.e. the edges of
value in the Fiedler vector, and labels sequen-
the graph) are clustered across the diagonal of the
tially the vertices.
matrix. Therefore, if the graph matrix has a low
Although the Fiedler vector may be computed
bandwidth the neighbors of a node are close to
by regular linear equation resolution techniques,
the diagonal of the matrix and are all clustered.
such as the Lanczos (1950) algorithm, its direct
Cuthill and Mckee (1969) proposed one of the
application to large graphs is unfeasible. RSB
most popular bandwidth minimization techniques
provides an approximated procedure to find the
for sparse matrices. This method relabels the ver-
eigenvalues of a graph. RSB is divided in a three
tices of a matrix according to a sequence, with the
step recursive process that contracts the graph until
aid of a heuristically guided traversal. The node
8
Graph Representation
it can be computed efficiently with the Lanczos • Refinement: In order to increase the qual-
algorithm, and then interpolates the results for the ity of the approximation, RSB applies
uncontracted graph using the rayleigh quotient a refinement step that uses the rayleigh
iteration: quotient iteration (RQI) (Parlett (1974)),
which is an iterative algorithm that in each
• Contraction: The contraction phase finds iteration estimates an approximation of the
a maximal independent set of the original eigenvalues of the matrix that converges to
graph. An independent set of vertices is a the exact solution. Since the Fiedler vec-
selection of vertices that none of them are tor obtained in the interpolation phase is
connected by an edge. This set is maximal already a good approximation to the exact
if no further vertices can be added to the solution, the RQI process converges fast.
set. Note that the maximal independent
may not be unique.
• Then, the new contracted graph has as ver- GRAPH PARTITIONING IN
tices those that are part of the maximal in- DISTRIBUTED REPRESENTATIONS
dependent set. Two nodes of the contracted
graph are connected if the neighbors of As we have been discussing in previous sections,
the maximal independent sets share nodes. an important aspect of network/graph informa-
Therefore, the contracted graph has fewer tion is its size. Some graphs are too large to be
nodes than the original graph, but it still fully loaded into the main memory of a single
keeps the original graph structure, and the computer and this implies an intensive usage of
inference of properties on the contracted secondary storage that degrades the performance
graph allows for their extrapolation to the of graph applications. A scalable solution consists
uncontracted graph. Once the size of the on distributing the graph on multiple computers
graph is small enough to be computed with in order to add the computing resources.
the Lanczos algorithm, the Fiedler vector Most of the initial work in handling graph-like
is computed finishing the recursivity. data in parallel or distributed data management has
• Interpolation: Given a Fiedler vector of been done in the context of the relational model,
the contracted graph, RSB estimates the where flat tables make it easy for partitioning.
Fiedler vector of the uncontracted graph. In the late 1980s, there have been interesting
The value of the Fiedler vector is copied attempts to deal with complex objects (mostly
from the contracted graph to the uncon- trees) in the context of object-oriented databases
tracted one, for the nodes that were part of (Khoshafian, Valduriez and Copeland (1988);
the maximal set (i.e. the nodes that were Valduriez, Khoshafian and Copeland (1986)). This
in both graphs). For the rest of nodes, the work can be useful to deal with XML documents
value in the Fiedler vector is obtained av- (trees) in parallel/distributed systems but requires
eraging the components of their neighbor- major rethinking to deal with more general graphs.
ing nodes. Since the interpolated nodes More recent work has been oriented to the
must have at least a neighboring node in parallel aspects of the algorithms to solve some
the contracted graph due to the maximal important algorithms over graphs such as the short-
independent set construction, the value est path problem (Crobak et al. (2007); Madduri et
of the Fiedler vector is defined for all al. (2007)) or finding frequent patterns as the work
components. presented by Reinhardt and Karypis (2007). Other
graph algorithms like the shortest path or graph
9
Graph Representation
isomorphism have been used to assess their scal- Figure 3.b. Therefore, the list of adjacencies of a
ability on new parallel architectures (Bader, Cong given vertex is divided into R nodes, and the set
and Feo (2005) and Underwood et al. (2007)). of vertices in the graph is assigned to C groups of
Parallelism in graph algorithms has also been R nodes. Thanks to this, the number of nodes to
largely exploited for other typical graph opera- which a certain node sends data can be reduced
tions such as BFS, graph partitioning, spanning drastically, since each node only communicates
tree search, graph coloring, etc. Some examples with at most R + C = 2 instead of p nodes as in
may be found in (Bader and Cong (2005); Devine the case of 1D partitioning.
et al. (2006); Yoo et al. (2005)). However, these previous attempts to partition
a graph and parallelize BFS ignore the possible
One and Two Dimensional effect that different partitioning techniques of the
Graph Partitioning graph might have on the amount of inter-node
communication, which could degrade system
Yoo et al. (2005), proposed partitioning the graph performance. This problem becomes critical de-
at random into parts and distribute them into the pending on the query. For instance, the scalabil-
nodes of a shared-nothing parallel system in order ity of communications might be a problem for all
to solve BFS more efficiently. More precisely, those types of queries that visit the graph glob-
they present two approaches for distributing ally and need to traverse the graph using operations
BFS: BFS with 1D and 2D partitionings, which such as BFS repeatedly from multiple source
propose two different ways of partitioning the vertices.
adjacency matrix of a graph. Although adjacency
matrices can grow enormously if the graph has Other Partitioning Techniques
many nodes, as we saw previously, they handled
this problem by deploying their system in a huge Comprehensive studies on data partitioning strate-
supercomputer: the BlueGene/L system with over gies have also been conducted for other database
30000 computing nodes. Following, we compare models. For instance, in OODBMSs, the vertices
the two partitioning strategies in relation to the could be viewed as the objects and, the edges
distributed BFS algorithm (DBFS) that Yoo et al. could be understood as the relationships between
used for their work. objects. A comprehensive survey of partitioning
With 1D partitioning (1D), matrix rows are techniques for OODBMSs was written by Özsu
randomly assigned to the p nodes in the system and Valduriez (1999). However, typical graph
(Figure 3(a)). Therefore, each row contains the database queries are completely different from
complete list of neighbors of the corresponding those in OODBMSs. For example, structural que-
vertex. With 2D partitioning (2D), the idea is to ries such as finding the shortest distance between
benefit from the fact that nodes are organized in two vertices in a very large graph might be crucial
R x C = p processor meshes in the architecture in graph databases but irrelevant in OODBMSs,
of the system that they are using, i.e. in a two- making previous partitioning strategies unsuitable
dimensional grid such that each node is connected for these new requirements.
to its four immediate neighbors. The adjacency Graph partitioning. In the light of all these
matrix is divided into R x C blocks of rows and analyses, a possible solution might be partitioning
C blocks of columns such that each node is as- the graph guaranteeing that the amount of edges
signed those values in the adjacency matrix cor- is minimum or as small as possible. Much work
responding to the intersection between C blocks has been devoted to the general problem of bal-
of rows and a block of columns, as shown in anced graph partitioning in the last four decades.
10
Graph Representation
Although most of the work has been generalized to heuristic algorithms that give good sub-optimal
hypergraphs, we will focus on the particular case solutions.
of graphs, where edges only connect two vertices. First steps on Balanced Graph Bisection.
Given a graph, the objective is to divide it into k The proposals on graph partitioning presented in
partitions such that each partition approximately the literature can be classified into two different
contains the same number of vertices and the categories (Ashcraft and Liu (1997)): (i) direct
number of edges connecting vertices located in approaches which construct partitions, such as the
different partitions is minimal. In particular, when nested dissection algorithm presented by George
k = 2, the problem is called minimum graph bisec- and Liu (1978), based on alternating level struc-
tion. This problem is also known as finding the tures and spectral methods (Pothen, Simon and
minimum edge-separator, i.e. the minimal set of Liou (1990)), and (ii) iterative approaches which
edges whose removal from the graph leaves two improve these partitions, such as those presented
disconnected subgraphs of equal size (contain- by Kernighan and Lin (1970) (KL) or its faster
ing the same number of vertices), and has been version presented by Fiduccia and Mattheyses
deeply studied in many different fields, such as (1982) (FM), which uses an iterative approach
VLSI Computer-Aided Design (see the work by by swapping vertices between the existing parti-
Alpert and Kahng (1995), for example) or in com- tions or moving a vertex from one partition to
munication networks (Arora, Rao and Vazirani the other, respectively. Iterative improvement
(2004)). Finding an optimal graph bisection has algorithms have been preferred in general to
been proven to be NP-Hard by Garey and Johnson other well-known optimization techniques such
(1990) and, as a consequence, finding an optimal as simulated annealing and genetic algorithms
k-way partition is also at least NP-Hard. Because because they have the potential to combine good
of this, most of the work presented in the litera- sub-optimal solutions with fast run times. These
ture has focused on developing polynomial-time algorithms rely on a priority queue of vertex
moves to greedily select the best vertex move (in
11
Graph Representation
the case of the FM algorithm) or the best vertex of vertices in the original one. Next, the small-
swap (in the case of the KL algorithm) in terms est graph in this sequence is partitioned. Finally,
of the objective function. They proceed in steps, during the uncoarsening phase, this partition is
during each of which each vertex is moved at most projected back through the sequence of succes-
once. Vertex moves resulting in negative gain are sively larger graphs onto the original graph, with
also possible, provided that they represent the best optional heuristic refinements applied at each step.
feasible move at that point. A step terminates when Multiple-Way Graph Partitioning. Much effort
none of the remaining vertex moves are feasible. has been done on extending the graph bisection
The gain of a step in terms of the objective func- problem to find a near-optimum multiple-way
tion is then computed as the best partial sum of separator that splits the graph into k parts. A k-
the gains of the individual vertex moves that are way partition of a graph G is either constructed
made during that step. The algorithm terminates directly or by the recursive bisection of G. A
when the last completed step does not yield a gain comprehensive survey of different heuristic ap-
in the objective function. However, the quality of proaches to multiple-way graph partitioning is
the solution of KL or FM is not stable, which is a presented in (Alpert and Kahng (1995)). As an
common weakness of the iterative improvement example, the FM algorithm has been extended
approaches based on moves. Because of this, ran- in order to compute a k-way partition directly
dom multi-start approaches are used to alleviate by Sanchis (1989) (a k-way extension to the KL
this problem by applying the algorithm repeatedly algorithm was first proposed in the original paper
starting from random initial partitions and return by Kernighan and Lin). However, this first k-way
the best solution found (Alpert and Kahng (1995)). formulation of the FM algorithm is dominated
A complete overview on stochastic methods is by the FM algorithm implemented in a recursive
provided by Battiti and Bertossi (1999). bisection framework. Hence an enhanced k-way
The main disadvantage of the above algorithms algorithm based on a pair-wise application of
is that they make vertex moves based solely on the FM algorithm is proposed by Cong and Lim
local information (the immediate gain of the (1998). Another randomized greedy refinement
vertex move) and this has caused the appearance algorithm presented by Karypis and Kumar (1999)
of several enhancements to the basic algorithms has been shown to yield partitions of good quality
that attempt to capture global properties or that with fast run times.
incorporate look-ahead. In addition, flat partition- Although the first papers on multi-way graph
ing algorithms, i.e. algorithms that directly work partitioning were written two decades ago, re-
on the original graph, suffer significant degrada- search on this topic is still very active, as shown
tion in terms of quality and running time when by recent publications such as that by Trifunovic
the size of the graph increases. et al. (2008), where authors propose new parallel
Thus, algorithms based on the multilevel algorithms for the hypergraph partitioning problem
paradigm have been proposed as an alternative (only scalable when the maximum vertex and
to mitigate this problem. Some examples are hyperedge degrees are small, which may not be
the work presented by Hendrickson and Leland the case in real graphs). Although this algorithm
(1995), Karypis, et al. (1997), Karypis and Kumar allows for easy parallelization and improves the
(1998) or Trifunovic and Knottenbelt (2008). quality of results with respect to previous ap-
In this type of algorithms, the original graph is proaches, this is done at the cost of increasing the
successively approximated by smaller graphs execution time, which might make it unsuitable
during the coarsening phase, where each vertex for very large graphs.
in the coarse graph is usually a connected subset
12
Another Random Document on
Scribd Without Any Related Topics
zum ersten Bande zu betrachten«, vor diesem wieder
abgedruckt ist. Sie lautet:
Die Kriegskunst hat einen so wesentlichen Antheil an der
gegenwärtigen Entwickelung des Staatenschicksals von
Europa gehabt, daß es für den Geschichtsfreund
überhaupt, wie für den Kriegskundigen insbesondere, ein
wissenschaftliches Bedürfniß geworden ist, einzelne, für
größere Werke oft gar nicht geeignete und dennoch für
die Theorie sowol als für die Praxis, oder für die
allgemeine Geschichte wichtige Beobachtungen und
Erfahrungen, überhaupt Alles, was die Geschichte der
Kriegskunst in dem 19. Jahrhunderte betrifft und neu ist,
von Augenzeugen zu sammeln, und die Ansichten
sachkundiger Männer von denkwürdigen Kriegsereignissen
in einem diesem Zwecke ausschließend gewidmeten
Archive zu vereinigen.
Die schätzbarsten Beiträge zu von Bülow's, von
Scharnhorst's und Anderer Schriften liegen in den
Tagebüchern verdienter Offiziere verborgen, welche in
einer Zeitschrift, wie von Rouvroy's »Militärische Minerva«
oder von Rühl's »Pallas« oder die »Oesterreichische
militärische Zeitschrift« und ähnliche Archive der
Kriegsgeschichte waren, einen Ehrenplatz einnehmen
würden. Sollen diese handschriftlichen Bemerkungen und
Nachrichten für die Wissenschaft verloren gehen und
vergessen werden, oder soll man warten, bis sie spät,
nach dem Tode der Augenzeugen, in zerstreuten
Denkwürdigkeiten erscheinen, wo sie der öffentlichen
Prüfung und Vergleichung mit andern Thatsachen weniger
unterliegen?
Jetzt, da die Waffen ruhen und die mit Lorbern
umwundenen Feldtagebücher geordnet werden, jetzt ist
die Erinnerung an Alles, was geschehen, ebenso lebendig
und frisch, als das Bedürfniß des Forschens und Wissens
lebhaft. Sollten daher unsere tapfern Zeitgenossen nicht
unter sich austauschen und gegenseitig kriegskundig
prüfen wollen, was sie beobachtet, gethan und erfahren,
was sie Schätzbares für Kunst und Wissenschaft selbst
eingesammelt haben? Die Kriege seit 1792 bieten für die
Geschichte der Kriegskunst so reiche Ausbeute dar, daß es
einer kriegsgeschichtlichen Zeitschrift in einer zwanglosen
Folge von Bänden, wie die unsrige sein soll, nicht an
neuem Stoffe von wissenschaftlichem Werthe fehlen wird,
wenn die einsichtsvollen Kriegsmänner aus allen Heeren,
welche seit 1792 in den meisten Ländern Europas fast
nach denselben Grundsätzen kriegskünstlerischer Bildung
gefochten haben, sich für unsern Zweck mit uns
vereinigen wollen.
Wir laden sie, als die vollgültigsten Zeugen der ewig
denkwürdigen Geschichte unserer Zeit, hierzu mit dem
Vertrauen ein, das uns unsere Ueberzeugung von dem
geistigen Zusammenhange und dem Gemeingeiste, der
jetzt alle Gebildete zu wissenschaftlicher Thätigkeit
hinführt, nicht ohne Ursache einflößt. Denn schon
erfreuen wir uns der Zusage mehrerer würdigen Männer,
und wir können dem Publikum versprechen, daß es in
unsern kriegsgeschichtlichen Monographien nur
Erzählungen und Charakteristiken von bedeutenden oder
minder bekannten denkwürdigen Kriegsbegebenheiten,
vorzüglich aus der neuesten Zeit, von Augenzeugen und
Theilnehmern kriegskundig abgefaßt, oder aus weniger
zugänglichen Quellen mit Kritik ausgewählt, und durch
Karten und Plane, wo es die Wissenschaft erfordert,
erläutert, ohne Beimischung von Politik noch fremdartigen
Dingen finden wird.
Jeder Band von 24-30 Bogen soll sechs und mehr
Erzählungen oder Darstellungen dieser Art enthalten. Der
erste wird zur Ostermesse des nächsten Jahres
erscheinen, und die Fortsetzung unsers Unternehmens
kann, wie wir nach den getroffenen Maßregeln hoffen
dürfen, nur an Neuheit und Interesse gewinnen.
Alle Beiträge, zu denen dringend eingeladen wird und die
auf Verlangen angemessen honorirt werden, sind an
unterzeichneten Verleger zu senden.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookname.com