Immediate Download Graph Algorithms Practical Examples in Apache Spark and Neo4j 1st Edition Mark Needham Ebooks 2024
Immediate Download Graph Algorithms Practical Examples in Apache Spark and Neo4j 1st Edition Mark Needham Ebooks 2024
com
https://textbookfull.com/product/graph-algorithms-
practical-examples-in-apache-spark-and-neo4j-1st-
edition-mark-needham/
https://textbookfull.com/product/graph-algorithms-for-data-science-
with-examples-in-neo4j-1st-edition-tomaz-bratanic/
textbookfull.com
https://textbookfull.com/product/high-performance-spark-best-
practices-for-scaling-and-optimizing-apache-spark-1st-edition-holden-
karau/
textbookfull.com
https://textbookfull.com/product/longitudinal-multivariate-psychology-
emilio-ferrer/
textbookfull.com
Flourish by Design 1st Edition Nick Dunn
https://textbookfull.com/product/flourish-by-design-1st-edition-nick-
dunn/
textbookfull.com
https://textbookfull.com/product/cultural-spaces-production-and-
consumption-1st-edition-graeme-evans-2/
textbookfull.com
https://textbookfull.com/product/the-neuronal-cytoskeleton-motor-
proteins-and-organelle-trafficking-in-the-axon-first-edition-pfister/
textbookfull.com
https://textbookfull.com/product/evolution-of-nervous-systems-second-
edition-jon-h-kaas/
textbookfull.com
https://textbookfull.com/product/family-therapy-history-theory-and-
practice-samuel-gladding/
textbookfull.com
Emerging Challenges and Innovations in Microfinance and
Financial Inclusion Michael O'Connor
https://textbookfull.com/product/emerging-challenges-and-innovations-
in-microfinance-and-financial-inclusion-michael-oconnor/
textbookfull.com
Co
m
pl
im
en
ts
of
Graph
Algorithms
Practical Examples in Apache Spark & Neo4j
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Graph Algorithms, the cover image of a
European garden spider, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
This work is part of a collaboration between O’Reilly and Neo4j. See our statement of editorial independ‐
ence.
978-1-492-05781-9
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What Are Graphs? 2
What Are Graph Analytics and Algorithms? 3
Graph Processing, Databases, Queries, and Algorithms 6
OLTP and OLAP 7
Why Should We Care About Graph Algorithms? 8
Graph Analytics Use Cases 12
Conclusion 14
iii
Summary 28
iv | Table of Contents
5. Centrality Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Example Graph Data: The Social Graph 79
Importing the Data into Apache Spark 80
Importing the Data into Neo4j 81
Degree Centrality 81
Reach 81
When Should I Use Degree Centrality? 82
Degree Centrality with Apache Spark 83
Closeness Centrality 84
When Should I Use Closeness Centrality? 85
Closeness Centrality with Apache Spark 86
Closeness Centrality with Neo4j 88
Closeness Centrality Variation: Wasserman and Faust 89
Closeness Centrality Variation: Harmonic Centrality 91
Betweenness Centrality 92
When Should I Use Betweenness Centrality? 94
Betweenness Centrality with Neo4j 95
Betweenness Centrality Variation: Randomized-Approximate Brandes 98
PageRank 99
Influence 99
The PageRank Formula 100
Iteration, Random Surfers, and Rank Sinks 102
When Should I Use PageRank? 103
PageRank with Apache Spark 103
PageRank with Neo4j 105
PageRank Variation: Personalized PageRank 107
Summary 108
Table of Contents | v
Strongly Connected Components with Neo4j 122
Connected Components 124
When Should I Use Connected Components? 124
Connected Components with Apache Spark 125
Connected Components with Neo4j 126
Label Propagation 127
Semi-Supervised Learning and Seed Labels 129
When Should I Use Label Propagation? 129
Label Propagation with Apache Spark 130
Label Propagation with Neo4j 131
Louvain Modularity 133
When Should I Use Louvain? 137
Louvain with Neo4j 138
Validating Communities 143
Summary 143
vi | Table of Contents
The Coauthorship Graph 193
Creating Balanced Training and Testing Datasets 194
How We Predict Missing Links 199
Creating a Machine Learning Pipeline 200
Predicting Links: Basic Graph Features 201
Predicting Links: Triangles and the Clustering Coefficient 214
Predicting Links: Community Detection 218
Summary 224
Wrapping Things Up 224
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
ix
graph algorithms are used within workflows: one for general analysis and one for
machine learning.
At the beginning of each category of algorithms, there is a reference table to help you
quickly jump to the relevant algorithm. For each algorithm, you’ll find:
x | Preface
This element indicates a warning or caution.
Our unique network of experts and innovators share their knowledge and expertise
through books, articles, conferences, and our online learning platform. O’Reilly’s
online learning platform gives you on-demand access to live training courses, in-
depth learning paths, interactive coding environments, and a vast collection of text
and video from O’Reilly and 200+ other publishers. For more information, please
visit http://oreilly.com.
Preface | xi
How to Contact Us
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://bit.ly/graph-algorithms.
To comment or ask technical questions about this book, send email to bookques‐
tions@oreilly.com.
For more information about our books, courses, conferences, and news, see our web‐
site at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Acknowledgments
We’ve thoroughly enjoyed putting together the material for this book and thank all
those who assisted. We’d especially like to thank Michael Hunger for his guidance, Jim
Webber for his invaluable edits, and Tomaz Bratanic for his keen research. Finally, we
greatly appreciate Yelp permitting us to use its rich dataset for powerful examples.
xii | Preface
Foreword
What do the following things all have in common: marketing attribution analysis,
anti-money laundering (AML) analysis, customer journey modeling, safety incident
causal factor analysis, literature-based discovery, fraud network detection, internet
search node analysis, map application creation, disease cluster analysis, and analyzing
the performance of a William Shakespeare play. As you might have guessed, what
these all have in common is the use of graphs, proving that Shakespeare was right
when he declared, “All the world’s a graph!”
Okay, the Bard of Avon did not actually write graph in that sentence, he wrote stage.
However, notice that the examples listed above all involve entities and the relation‐
ships between them, including both direct and indirect (transitive) relationships.
Entities are the nodes in the graph—these can be people, events, objects, concepts, or
places. The relationships between the nodes are the edges in the graph. Therefore,
isn’t the very essence of a Shakespearean play the active portrayal of entities (the
nodes) and their relationships (the edges)? Consequently, maybe Shakespeare could
have written graph in his famous declaration.
What makes graph algorithms and graph databases so interesting and powerful isn’t
the simple relationship between two entities, with A being related to B. After all, the
standard relational model of databases instantiated these types of relationships in its
foundation decades ago, in the entity relationship diagram (ERD). What makes
graphs so remarkably important are directional relationships and transitive relation‐
ships. In directional relationships, A may cause B, but not the opposite. In transitive
relationships, A can be directly related to B and B can be directly related to C, while A
is not directly related to C, so that consequently A is transitively related to C.
With these transitivity relationships—particularly when they are numerous and
diverse, with many possible relationship/network patterns and degrees of separation
between the entities—the graph model uncovers relationships between entities that
otherwise may seem disconnected or unrelated, and are undetected by a relational
xiii
database. Hence, the graph model can be applied productively and effectively in many
network analysis use cases.
Consider this marketing attribution use case: person A sees the marketing campaign;
person A talks about it on social media; person B is connected to person A and sees
the comment; and, subsequently, person B buys the product. From the marketing
campaign manager’s perspective, the standard relational model fails to identify the
attribution, since B did not see the campaign and A did not respond to the campaign.
The campaign looks like a failure, but its actual success (and positive ROI) is discov‐
ered by the graph analytics algorithm through the transitive relationship between the
marketing campaign and the final customer purchase, through an intermediary
(entity in the middle).
Next, consider an anti-money laundering (AML) analysis case: persons A and C are
suspected of illicit trafficking. Any interaction between the two (e.g., a financial trans‐
action in a financial database) would be flagged by the authorities, and heavily scruti‐
nized. However, if A and C never transact business together, but instead conduct
financial dealings through safe, respected, and unflagged financial authority B, what
could pick up on the transaction? The graph analytics algorithm! The graph engine
would discover the transitive relationship between A and C through intermediary B.
In internet searches, major search engines use a hyperlinked network (graph-based)
algorithm to find the central authoritative node across the entire internet for any
given set of search words. The directionality of the edge is vital in this case, since the
authoritative node in the network is the one that many other nodes point at.
With literature-based discovery (LBD)—a knowledge network (graph-based) applica‐
tion enabling significant discoveries across the knowledge base of thousands (or even
millions) of research journal articles—“hidden knowledge” is discovered only
through the connection between published research results that may have many
degrees of separation (transitive relationships) between them. LBD is being applied to
cancer research studies, where the massive semantic medical knowledge base of
symptoms, diagnoses, treatments, drug interactions, genetic markers, short-term
results, and long-term consequences could be “hiding” previously unknown cures or
beneficial treatments for the most impenetrable cases. The knowledge could already
be in the network, but we need to connect the dots to find it.
Similar descriptions of the power of graphing can be given for the other use cases lis‐
ted earlier, all examples of network analysis through graph algorithms. Each case
deeply involves entities (people, objects, events, actions, concepts, and places) and
their relationships (touch points, both causal and simple associations).
When considering the power of graphing, we should keep in mind that perhaps the
most powerful node in a graph model for real-world use cases might be “context.”
Context may include time, location, related events, nearby entities, and more. Incor‐
xiv | Foreword
porating context into the graph (as nodes and as edges) can thus yield impressive pre‐
dictive analytics and prescriptive analytics capabilities.
Mark Needham and Amy Hodler’s Graph Algorithms aims to broaden our knowledge
and capabilities around these important types of graph analyses, including algo‐
rithms, concepts, and practical machine learning applications of the algorithms.
From basic concepts to fundamental algorithms to processing platforms and practical
use cases, the authors have compiled an instructive and illustrative guide to the won‐
derful world of graphs.
Foreword | xv
CHAPTER 1
Introduction
Graphs are one of the unifying themes of computer science—an abstract representation that
describes the organization of transportation systems, human interactions, and telecommuni‐
cation networks. That so many different structures can be modeled using a single formalism
is a source of great power to the educated programmer.
—The Algorithm Design Manual, by Steven S. Skiena (Springer), Distinguished Teach‐
ing Professor of Computer Science at Stony Brook University
Today’s most pressing data challenges center around relationships, not just tabulating
discrete data. Graph technologies and analytics provide powerful tools for connected
data that are used in research, social initiatives, and business solutions such as:
1
What Are Graphs?
Graphs have a history dating back to 1736, when Leonhard Euler solved the “Seven
Bridges of Königsberg” problem. The problem asked whether it was possible to visit
all four areas of a city connected by seven bridges, while only crossing each bridge
once. It wasn’t.
With the insight that only the connections themselves were relevant, Euler set the
groundwork for graph theory and its mathematics. Figure 1-1 depicts Euler’s progres‐
sion with one of his original sketches, from the paper “Solutio problematis ad geome‐
triam situs pertinentis”.
Figure 1-1. The origins of graph theory. The city of Königsberg included two large islands
connected to each other and the two mainland portions of the city by seven bridges. The
puzzle was to create a walk through the city, crossing each bridge once and only once.
While graphs originated in mathematics, they are also a pragmatic and high fidelity
way of modeling and analyzing data. The objects that make up a graph are called
nodes or vertices and the links between them are known as relationships, links, or
edges. We use the terms nodes and relationships in this book: you can think of nodes
as the nouns in sentences, and relationships as verbs giving context to the nodes. To
avoid any confusion, the graphs we talk about in this book have nothing to do with
graphing equations or charts as in Figure 1-2.
2 | Chapter 1: Introduction
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
Figure 1-2. A graph is a representation of a network, often illustrated with circles to rep‐
resent entities which we call nodes, and lines to represent relationships.
Looking at the person graph in Figure 1-2, we can easily construct several sentences
which describe it. For example, person A lives with person B who owns a car, and
person A drives a car that person B owns. This modeling approach is compelling
because it maps easily to the real world and is very “whiteboard friendly.” This helps
align data modeling and analysis.
But modeling graphs is only half the story. We might also want to process them to
reveal insight that isn’t immediately obvious. This is the domain of graph algorithms.
Network Science
Network science is an academic field strongly rooted in graph theory that is concerned
with mathematical models of the relationships between objects. Network scientists
rely on graph algorithms and database management systems because of the size, con‐
nectedness, and complexity of their data.
There are many fantastic resources for complexity and network science. Here are a
few references for you to explore.
Graph algorithms have widespread potential, from preventing fraud and optimizing
call routing to predicting the spread of the flu. For instance, we might want to score
particular nodes that could correspond to overload conditions in a power system. Or
we might like to discover groupings in the graph which correspond to congestion in a
transport system.
In fact, in 2010 US air travel systems experienced two serious events involving multi‐
ple congested airports that were later studied using graph analytics. Network scien‐
tists P. Fleurquin, J. J. Ramasco, and V. M. Eguíluz used graph algorithms to confirm
the events as part of systematic cascading delays and use this information for correc‐
tive advice, as described in their paper, “Systemic Delay Propagation in the US Air‐
port Network”.
To visualize the network underpinning air transportation Figure 1-3 was created by
Martin Grandjean for his article, “Connected World: Untangling the Air Traffic Net‐
work”. This illustration clearly shows the highly connected structure of air transpor‐
tation clusters. Many transportation systems exhibit a concentrated distribution of
links with clear hub-and-spoke patterns that influence delays.
4 | Chapter 1: Introduction
Figure 1-3. Air transportation networks illustrate hub-and-spoke structures that evolve
over multiple scales. These structures contribute to how travel flows.
Graphs also help uncover how very small interactions and dynamics lead to global
mutations. They tie together the micro and macro scales by representing exactly
which things are interacting within global structures. These associations are used to
forecast behavior and determine missing links. Figure 1-4 is a foodweb of grassland
species interactions that used graph analysis to evaluate the hierarchical organization
and species interactions and then predict missing relationships, as detailed in the
paper by A. Clauset, C. Moore, and M. E. J. Newman, “Hierarchical Structure and the
Prediction of Missing Links in Network”.
6 | Chapter 1: Introduction
drives smarter transactions, which creates new data and opportunities for further
analysis. More recently there’s been a trend to integrate these silos for more real-time
decision making.