100% found this document useful (3 votes)
117 views

PDF SQL and NoSQL Databases: Modeling, Languages, Security and Architectures for Big Data Management Michael Kaufmann download

Kaufmann

Uploaded by

betsyjintin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
117 views

PDF SQL and NoSQL Databases: Modeling, Languages, Security and Architectures for Big Data Management Michael Kaufmann download

Kaufmann

Uploaded by

betsyjintin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Download Full Version ebook - Visit ebookmeta.

com

SQL and NoSQL Databases: Modeling, Languages,


Security and Architectures for Big Data Management
Michael Kaufmann

https://ebookmeta.com/product/sql-and-nosql-databases-
modeling-languages-security-and-architectures-for-big-data-
management-michael-kaufmann/

OR CLICK HERE

DOWLOAD NOW

Discover More Ebook - Explore Now at ebookmeta.com


Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...

Start reading on any device today!

SQL and NoSQL Databases Modeling Languages Security and


Architectures for Big Data Management 2nd Edition Michael
Kaufmann
https://ebookmeta.com/product/sql-and-nosql-databases-modeling-
languages-security-and-architectures-for-big-data-management-2nd-
edition-michael-kaufmann/
ebookmeta.com

Python Data Persistence With SQL and NOSQL Databases 1st


Edition Lathkar

https://ebookmeta.com/product/python-data-persistence-with-sql-and-
nosql-databases-1st-edition-lathkar/

ebookmeta.com

NoSQL and SQL Data Modeling: Bringing Together Data,


Semantics, and Software First Edition Hills

https://ebookmeta.com/product/nosql-and-sql-data-modeling-bringing-
together-data-semantics-and-software-first-edition-hills/

ebookmeta.com

Transplantation Imaging Ghaneh Fananapazir Editor Ramit


Lamba Editor

https://ebookmeta.com/product/transplantation-imaging-ghaneh-
fananapazir-editor-ramit-lamba-editor/

ebookmeta.com
The Indonesian economy since 1965 a case study of
political economy Ingrid Palmer

https://ebookmeta.com/product/the-indonesian-economy-
since-1965-a-case-study-of-political-economy-ingrid-palmer/

ebookmeta.com

Insight Guides Israel 10th Edition Insight Guides

https://ebookmeta.com/product/insight-guides-israel-10th-edition-
insight-guides/

ebookmeta.com

Competing Interest Groups and Lobbying in the Construction


of the European Banking Union Giuseppe Montalbano

https://ebookmeta.com/product/competing-interest-groups-and-lobbying-
in-the-construction-of-the-european-banking-union-giuseppe-montalbano/

ebookmeta.com

Triple Play for the Single Mom 1st Edition Rebel Bloom

https://ebookmeta.com/product/triple-play-for-the-single-mom-1st-
edition-rebel-bloom/

ebookmeta.com

Application Delivery and Load Balancing in Microsoft Azure


Practical Solutions with NGINX and Microsoft Azure 1st
Edition Derek Dejonghe
https://ebookmeta.com/product/application-delivery-and-load-balancing-
in-microsoft-azure-practical-solutions-with-nginx-and-microsoft-
azure-1st-edition-derek-dejonghe/
ebookmeta.com
In Your Face - Law, Justice, and Niqab-Wearing Women in
Canada 1st Edition Natasha Bakht

https://ebookmeta.com/product/in-your-face-law-justice-and-niqab-
wearing-women-in-canada-1st-edition-natasha-bakht/

ebookmeta.com
Michael Kaufmann
Andreas Meier

SQL and NoSQL


Databases
Modeling, Languages, Security
and Architectures for Big Data
Management
Second Edition
SQL and NoSQL Databases
Michael Kaufmann • Andreas Meier

SQL and NoSQL


Databases
Modeling, Languages, Security
and Architectures for Big Data
Management

Second Edition
Michael Kaufmann Andreas Meier
Informatik Institute of Informatics
Hochschule Luzern Universität Fribourg
Rotkreuz, Switzerland Fribourg, Switzerland

ISBN 978-3-031-27907-2 ISBN 978-3-031-27908-9 (eBook)


https://doi.org/10.1007/978-3-031-27908-9

# The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
The first edition of this book was published by Springer Vieweg in 2019
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword

The term database has long since become part of people’s everyday vocabulary, for
managers and clerks as well as students of most subjects. They use it to describe a
logically organized collection of electronically stored data that can be directly
searched and viewed. However, they are generally more than happy to leave the
whys and hows of its inner workings to the experts.
Users of databases are rarely aware of the immaterial and concrete business
values contained in any individual database. This applies as much to a car importer’s
spare parts inventory as the IT solution containing all customer depots at a bank or
the patient information system of a hospital. Yet failure of these systems, or even
cumulative errors, can threaten the very existence of the respective company or
institution. For that reason, it is important for a much larger audience than just the
“database specialists” to be well-informed about what is going on. Anyone involved
with databases should understand what these tools are effectively able to do and
which conditions must be created and maintained for them to do so.
Probably the most important aspect concerning databases involves (a) the dis-
tinction between their administration and the data stored in them (user data) and
(b) the economic magnitude of these two areas. Database administration consists of
various technical and administrative factors, from computers, database systems, and
additional storage to the experts setting up and maintaining all these components—
the aforementioned database specialists. It is crucial to keep in mind that the
administration is by far the smaller part of standard database operation, constituting
only about a quarter of the entire efforts.
Most of the work and expenses concerning databases lie in gathering,
maintaining, and utilizing the user data. This includes the labor costs for all
employees who enter data into the database, revise it, retrieve information from
the database, or create files using this information. In the above examples, this means
warehouse employees, bank tellers, or hospital personnel in a wide variety of
fields—usually for several years.
In order to be able to properly evaluate the importance of the tasks connected with
data maintenance and utilization on the one hand and database administration on the
other hand, it is vital to understand and internalize this difference in the effort
required for each of them. Database administration starts with the design of the
database, which already touches on many specialized topics such as determining the

v
vi Foreword

consistency checks for data manipulation or regulating data redundancies, which are
as undesirable on the logical level as they are essential on the storage level. The
development of database solutions is always targeted on their later use, so
ill-considered decisions in the development process may have a permanent impact
on everyday operations. Finding ideal solutions, such as the golden mean between
too strict and too flexible when determining consistency conditions, may require
some experience. Unduly strict conditions will interfere with regular operations,
while excessively lax rules will entail a need for repeated expensive data repairs.
To avoid such issues, it is invaluable for anyone concerned with database
development and operation, whether in management or as a database specialist, to
gain systematic insight into this field of computer sciences. The table of contents
gives an overview of the wide variety of topics covered in this book. The title already
shows that, in addition to an in-depth explanation of the field of conventional
databases (relational model, SQL), the book also provides highly educational infor-
mation about current advancements and related fields, the keywords being NoSQL
and Big Data. I am confident that the newest edition of this book will once again be
well-received by both students and professionals—its authors are quite familiar with
both groups.

Professor Emeritus for Databases Carl August Zehnder


ETH Zürich
Zürich, Switzerland
Preface

It is remarkable how stable some concepts are in the field of databases. Information
technology is generally known to be subject to rapid development, bringing forth
new technologies at an unbelievable pace. However, this is only superficially the
case. Many aspects of computer science do not essentially change. This includes not
only the basics, such as the functional principles of universal computing machines,
processors, compilers, operating systems, databases and information systems, and
distributed systems, but also computer language technologies such as C, TCP/IP, or
HTML that are decades old but in many ways provide a stable fundament of the
global, earth-spanning information system known as the World Wide Web. Like-
wise, the SQL language (Structured Query Language) has been in use for almost five
decades and will remain so in the foreseeable future. The theory of relational
database systems was initiated in the 1970s by Codd (relation model) and
Chamberlin and Boyce (SEQUEL). However, these technologies have a major
impact on the practice of data management today. Especially, with the Big Data
revolution and the widespread use of data science methods for decision support,
relational databases and the use of SQL for data analysis are actually becoming more
important. Even though sophisticated statistics and machine learning are enhancing
the possibilities for knowledge extraction from data, many if not most data analyses
for decision support rely on descriptive statistics using SQL for grouped aggrega-
tion. SQL is also used in the field of Big Data with MapReduce technology. In this
sense, although SQL database technology is quite mature, it is more relevant today
than ever.
Nevertheless, the developments in the Big Data ecosystem brought new
technologies into the world of databases, to which we pay enough attention too.
Non-relational database technologies, which find more and more fields of applica-
tion under the generic term NoSQL, differ not only superficially from the classical
relational databases but also in the underlying principles. Relational databases were
developed in the twentieth century with the purpose of tightly organized, operational
forms of data management, which provided stability but limited flexibility. In
contrast, the NoSQL database movement emerged in the beginning of the new
century, focusing on horizontal partitioning, schema flexibility, and index-free
neighborhood with the goal of solving the Big Data problems of volume, variety,
and velocity, especially in Web-scale data systems. This has far-reaching

vii
viii Preface

consequences and leads to a new approach in data management, which deviate


significantly from the previous theories on the basic concept of databases: the way
data is modeled, how data is queried and manipulated, how data consistency is
handled, and how data is stored and made accessible. That is why in all chapters we
compare these two worlds, SQL and NoSQL databases.
In the first five chapters, we analyze in detail the management, modeling,
languages, security, and architecture of SQL databases, graph databases, and, in
the second English edition, new document databases. In Chaps. 6 and 7, we provide
an overview of other SQL- and NoSQL-based database approaches.
In addition to classic concepts such as the entity and relationship model and its
mapping in SQL or NoSQL database schemas, query languages, or transaction
management, we explain aspects for NoSQL databases such as the MapReduce
procedure, distribution options (fragments, replication), or the CAP theorem (con-
sistency, availability, partition tolerance).
In the second English edition, we offer a new in-depth introduction to document
databases with a method for modeling document structures, an overview of the
database language MQL, as well as security and architecture aspects. The new
edition also takes into account new developments in the Cypher language. The
topic of database security is newly introduced as a separate chapter and analyzed
in detail with regard to data protection, integrity, and transactions. Texts on data
management, database programming, and data warehousing and data lakes have
been updated. In addition, the second English edition explains the concepts of JSON,
JSON Schema, BSON, index-free neighborhood, cloud databases, search engines,
and time series databases.
We have launched a Website called sql-nosql.org, where we share teaching and
tutoring materials such as slides, tutorials for SQL and Cypher, case studies, and a
workbench for MySQL and Neo4j, so that language training can be done either with
SQL or with Cypher, the graph-oriented query language of the NoSQL database
Neo4j.
We thank Alexander Denzler and Marcel Wehrle for the development of the
workbench for relational and graph-oriented databases. For the redesign of the
graphics, we were able to work with Thomas Riediker. We thank him for his tireless
efforts. He has succeeded in giving the pictures a modern style and an individual
touch. In the ninth edition, we have tried to keep his style in our new graphics. For
the further development of the tutorials and case studies, which are available on the
website sql-nosql.org, we thank the computer science students Andreas Waldis,
Bettina Willi, Markus Ineichen, and Simon Studer for their contributions to the
tutorial in Cypher, respectively, to the case study Travelblitz with OpenOffice Base
and with Neo4J. For the feedback on the manuscript, we thank Alexander Denzler,
Daniel Fasel, Konrad Marfurt, Thomas Olnhoff, and Stefan Edlich for their willing-
ness to contribute to the quality of our work with reading our manuscript and with
providing valuable feedback. A heartfelt thank you goes out to Michael Kaufmann’s
wife Melody Reymond for proofreading our manuscript. Special thanks to Andy
Preface ix

Oppel of the University of California, Berkeley, for grammatical and technological


review of the English text. A big thank goes to Leonardo Milla of Springer, who has
supported us with patience and expertise.

Rotkreuz, Switzerland Michael Kaufmann


Fribourg, Switzerland Andreas Meier
October 2022
Contents

1 Database Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Information Systems and Databases . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 SQL Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Relational Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Structured Query Language SQL . . . . . . . . . . . . . . . . . . . 6
1.2.3 Relational Database Management System . . . . . . . . . . . . . 8
1.3 Big Data and NoSQL Databases . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 NoSQL Database Management System . . . . . . . . . . . . . . . 12
1.4 Graph Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.1 Graph-Based Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.2 Graph Query Language Cypher . . . . . . . . . . . . . . . . . . . . 15
1.5 Document Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.1 Document Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5.2 Document-Oriented Database Language MQL . . . . . . . . . . 19
1.6 Organization of Data Management . . . . . . . . . . . . . . . . . . . . . . . . 21
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Database Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 From Requirements Analysis to Database . . . . . . . . . . . . . . . . . . . 25
2.2 The Entity-Relationship Model . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Entities and Relationships . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2 Associations and Association Types . . . . . . . . . . . . . . . . . 29
2.2.3 Generalization and Aggregation . . . . . . . . . . . . . . . . . . . . 32
2.3 Implementation in the Relational Model . . . . . . . . . . . . . . . . . . . . 35
2.3.1 Dependencies and Normal Forms . . . . . . . . . . . . . . . . . . . 35
2.3.2 Mapping Rules for Relational Databases . . . . . . . . . . . . . . 42
2.4 Implementation in the Graph Model . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.1 Graph Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.2 Mapping Rules for Graph Databases . . . . . . . . . . . . . . . . . 51
2.5 Implementation in the Document Model . . . . . . . . . . . . . . . . . . . . 55
2.5.1 Document-Oriented Database Modeling . . . . . . . . . . . . . . 55

xi
xii Contents

2.5.2 Mapping Rules for Document Databases . . . . . . . . . . . . . . 59


2.6 Formula for Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3 Database Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.1 Interacting with Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2 Relational Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.1 Overview of Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.2 Set Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2.3 Relation Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2.4 Relationally Complete Languages . . . . . . . . . . . . . . . . . . . 80
3.3 Relational Language SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3.1 Creating and Populating the Database Schema . . . . . . . . . . 81
3.3.2 Relational Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.3.3 Built-In Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.3.4 Null values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.4 Graph-Based Language Cypher . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.4.1 Creating and Populating the Database Schema . . . . . . . . . . 92
3.4.2 Relation Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.4.3 Built-In Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.4.4 Graph Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.5 Document-Oriented Language MQL . . . . . . . . . . . . . . . . . . . . . . 98
3.5.1 Creating and Filling the Database Schema . . . . . . . . . . . . . 98
3.5.2 Relation Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.5.3 Built-In Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.5.4 Null Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.6 Database Programming with Cursors . . . . . . . . . . . . . . . . . . . . . . 106
3.6.1 Embedding of SQL in Procedural Languages . . . . . . . . . . . 106
3.6.2 Embedding Graph-Based Languages . . . . . . . . . . . . . . . . . 109
3.6.3 Embedding Document Database Languages . . . . . . . . . . . . 109
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4 Database Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.1 Security Goals and Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.2 Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.2.1 Authentication and Authorization in SQL . . . . . . . . . . . . . 113
4.2.2 Authentication in Cypher . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.2.3 Authentication and Authorization in MQL . . . . . . . . . . . . . 121
4.3 Integrity Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.3.1 Relational Integrity Constraints . . . . . . . . . . . . . . . . . . . . . 127
4.3.2 Integrity Constraints for Graphs in Cypher . . . . . . . . . . . . 129
4.3.3 Integrity Constraints in Document Databases with MQL . . . 132
4.4 Transaction Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.4.1 Multi-user Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.4.2 ACID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Contents xiii

4.4.3 Serializability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135


4.4.4 Pessimistic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.4.5 Optimistic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.4.6 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.5 Soft Consistency in Massive Distributed Data . . . . . . . . . . . . . . . . 144
4.5.1 BASE and the CAP Theorem . . . . . . . . . . . . . . . . . . . . . . 144
4.5.2 Nuanced Consistency Settings . . . . . . . . . . . . . . . . . . . . . 146
4.5.3 Vector Clocks for the Serialization of Distributed
Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.5.4 Comparing ACID and BASE . . . . . . . . . . . . . . . . . . . . . . 149
4.6 Transaction Control Language Elements . . . . . . . . . . . . . . . . . . . . 151
4.6.1 Transaction Control in SQL . . . . . . . . . . . . . . . . . . . . . . . 151
4.6.2 Transaction Management in the Graph Database Neo4J
and in the Cypher Language . . . . . . . . . . . . . . . . . . . . . . . 153
4.6.3 Transaction Management in MongoDB and MQL . . . . . . . 155
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.1 Processing of Homogeneous and Heterogeneous Data . . . . . . . . . . 159
5.2 Storage and Access Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.2.1 Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.2.2 Tree Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.2.3 Hashing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.2.4 Consistent Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.2.5 Multi-dimensional Data Structures . . . . . . . . . . . . . . . . . . 168
5.2.6 Binary JavaScript Object Notation BSON . . . . . . . . . . . . . 171
5.2.7 Index-Free Adjacency . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.3 Translation and Optimization of Relational Queries . . . . . . . . . . . . 175
5.3.1 Creation of Query Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.3.2 Optimization by Algebraic Transformation . . . . . . . . . . . . 178
5.3.3 Calculation of Join Operators . . . . . . . . . . . . . . . . . . . . . . 180
5.3.4 Cost-Based Optimization of Access Paths . . . . . . . . . . . . . 182
5.4 Parallel Processing with MapReduce . . . . . . . . . . . . . . . . . . . . . . 184
5.5 Layered Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
5.6 Use of Different Storage Structures . . . . . . . . . . . . . . . . . . . . . . . 187
5.7 Cloud Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
6 Post-relational Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
6.1 The Limits of SQL and What Lies Beyond . . . . . . . . . . . . . . . . . . 193
6.2 Federated Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.3 Temporal Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
6.4 Multi-dimensional Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.5 Data Warehouse and Data Lake Systems . . . . . . . . . . . . . . . . . . . 204
6.6 Object-Relational Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.7 Knowledge Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
xiv Contents

6.8 Fuzzy Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216


Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7 NoSQL Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.1 Development of Non-relational Technologies . . . . . . . . . . . . . . . . 223
7.2 Key-Value Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7.3 Column-Family Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.4 Document Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
7.5 XML Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.6 Graph Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
7.7 Search Engine Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.8 Time Series Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Database Management
1

1.1 Information Systems and Databases

The evolution from the industrial society via the service society to the information
and knowledge society is represented by the assessment of information as a factor in
production. The following characteristics distinguish information from material
goods:

• Representation: Information is specified by data (signs, signals, messages, or


language elements).
• Processing: Information can be transmitted, stored, categorized, found, or
converted into other representation formats using algorithms and data structures
(calculation rules).
• Combination: Information can be freely combined. The origin of individual parts
cannot be traced. Manipulation is possible at any point.
• Age: Information is not subject to physical aging processes.
• Original: Information can be copied without limit and does not distinguish
between original and copy.
• Vagueness: Information can be imprecise and of differing validity (quality).
• Medium: Information does not require a fixed medium and is therefore indepen-
dent of location.

These properties clearly show that digital goods (information, software, multime-
dia, etc.), i.e., data, are vastly different from material goods in both handling and
economic or legal evaluation. A good example is the loss in value that physical
products often experience when they are used—the shared use of information, on the
other hand, may increase its value. Another difference lies in the potentially high
production costs for material goods, while information can be multiplied easily and
at significantly lower costs (only computing power and storage medium). This
causes difficulties in determining property rights and ownership, even though digital
watermarks and other privacy and security measures are available.

# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1


M. Kaufmann, A. Meier, SQL and NoSQL Databases,
https://doi.org/10.1007/978-3-031-27908-9_1
2 1 Database Management

Information System

Communication
Database System Application Software network
or WWW

User
 User guidance
Database  Dialog design Request
Management
 Business logic
 Data querying Response
 Data manipulation
Database  Access permissions
Storage  Data protection

Fig. 1.1 Architecture and components of information systems

Considering data as the basis of information as a production factor in a company


has significant consequences:

• Basis for decision-making: Data allows well-informed decisions, making it vital


for all organizational functions.
• Quality level: Data can be available from different sources; information quality
depends on the availability, correctness, and completeness of the data.
• Need for investments: Data gathering, storage, and processing cause work and
expenses.
• Degree of integration: Fields and holders of duties within any organization are
connected by informational relations, meaning that the fulfillment of the said
duties largely depends on the degree of data integration.

Once data is viewed as a factor in production, it must be planned, governed,


monitored, and controlled. This makes it necessary to see data management as a task
for the executive level, inducing a major change within the company. In addition to
the technical function of operating the information and communication infrastructure
(production), planning and design of data flows (application portfolio) is crucial.
As shown in Fig. 1.1, an information system enables users to store and connect
information interactively, to ask questions, and to get answers. Depending on the
type of information system, the acceptable questions may be limited. There are,
however, open information systems and online platforms in the World Wide Web
that use search engines to process arbitrary queries.
The computer-based information system in Fig. 1.1 is connected to a communi-
cation network such as the World Wide Web in order to allow for online interaction
1.2 SQL Databases 3

and global information exchange in addition to company-specific analyses. Any


information system of a certain size uses database systems to avoid the necessity to
redevelop database management, querying, and analysis every time it is used.
Database systems are software for application-independently describing, storing,
and querying data. All database systems contain a storage and a management
component. The storage component called the database includes all data stored in
organized form plus their description. The management component called the
database management system (DBMS) contains a query and data manipulation
language for evaluating and editing the data and information. This component not
only does serve the user interface but also manages all access and editing
permissions for users and applications.
SQL databases (SQL = Structured Query Language, cf. Sect. 1.2) are the most
common in practical use. However, providing real-time Web-based services
referencing heterogeneous data sets is especially challenging (cf. Sect. 1.3 on Big
Data) and has called for new solutions such as NoSQL approaches (cf. Sect. 1.4).
When deciding whether to use relational or non-relational technologies, pros and
cons have to be considered carefully—in some use cases, it may even be ideal to
combine different technologies (cf. operating a Web shop in Sect. 5.6). Modern
hybrid DBMS approaches combine SQL with non-relational aspects, either by
providing NoSQL features in relational databases or by exposing an SQL querying
interface to non-relational databases. Depending on the database architecture of
choice, data management within the company must be established and developed
with the support of qualified experts (Sect. 1.5). Further reading is listed in Sect. 1.6.

1.2 SQL Databases

1.2.1 Relational Model

One of the simplest and most intuitive ways to collect and present data is in a table.
Most tabular data sets can be read and understood without additional explanations.
To collect information about employees, a table structure as shown in Fig. 1.2 can
be used. The all-capitalized table name EMPLOYEE refers to the entire table, while
the individual columns are given the desired attribute names as headers, for example,
the employee number “E#,” the employee’s name “Name,” and their city of resi-
dence “City.”
An attribute assigns a specific data value from a predefined value range called
domain as a property to each entry in the table. In the EMPLOYEE table, the
attribute E# allows to uniquely identify individual employees, making it the key of
the table. To mark key attributes more clearly, they will be written in italics in the
table headers throughout this book.1 The attribute City is used to label the respective

1
Some major works of database literature mark key attributes by underlining.
4 1 Database Management

Table name

EMPLOYEE Attribute
E# Name City

Key attribute

Fig. 1.2 Table structure for an EMPLOYEE table

Column (or attribute)


EMPLOYEE

E# Name City
E19 Stewart Stow

E4 Bell Kent

E1 Murphy Kent

E7 Howard Cleveland

Data value Record


(row or tuple)

Fig. 1.3 EMPLOYEE table with manifestations

places of residence and the attribute Name for the names of the respective employees
(Fig. 1.3).
The required information of the employees can now easily be entered row by row.
In the columns, values may appear more than once. In our example, Kent is listed as
the place of residence of two employees. This is an important fact, telling us that both
employee Murphy and employee Bell are living in Kent. In our EMPLOYEE table,
not only cities but also employee names may exist multiple times. For that reason,
the aforementioned key attribute E# is required to uniquely identify each employee
in the table.
1.2 SQL Databases 5

Identification Key
An identification key or just key of a table is one attribute or a minimal combination
of attributes whose values uniquely identify the records (called rows or tuples)
within the table. If there are multiple keys, one of them can be chosen as the primary
key. This short definition lets us infer two important properties of keys:

• Uniqueness: Each key value uniquely identifies one record within the table, i.e.,
different tuples must not have identical keys.
• Minimality: If the key is a combination of attributes, this combination must be
minimal, i.e., no attribute can be removed from the combination without
eliminating the unique identification.

The requirements of uniqueness and minimality fully characterize an identifica-


tion key. However, keys are also commonly used to reference tables among
themselves.
Instead of a natural attribute or a combination of natural attributes, an artificial
attribute can be introduced into the table as key. The employee number E# in our
example is an artificial attribute, as it is not a natural characteristic of the employees.
While we are hesitant to include artificial keys or numbers as identifying
attributes, especially when the information in question is personal, natural keys
often result in issues with uniqueness and/or privacy. For example, if a key is
constructed from parts of the name and the date of birth, it may not necessarily be
unique. Moreover, natural or intelligent keys divulge information about the respec-
tive person, potentially infringing on their privacy.
Due to these considerations, artificial keys should be defined application-inde-
pendent and without semantics (meaning, informational value). As soon as any
information can be deduced from the data values of a key, there is room for
interpretation. Additionally, it is quite possible that the originally well-defined
principle behind the key values changes or is lost over time.

Table Definition
To summarize, a table is a set of rows presented in tabular form. The data records
stored in the table rows, also called tuples, establish a relation between singular data
values. According to this definition, the relational model considers each table as a set
of unordered tuples. Tables in this sense meet the following requirements:

• Table name: A table has a unique table name.


• Attribute name: All attribute names are unique within a table and label one
specific column with the required property.
• No column order: The number of attributes is not set, and the order of the
columns within the table does not matter.
• No row order: The number of tuples is not set, and the order of the rows within
the table does not matter.
6 1 Database Management

EMPLOYEE
E# Name City
E19 Stewart Stow
E4 Bell Kent
E1 Murphy Kent
E7 Howard Cleveland

Example query:
“Select the names of the employees living in Kent.”

Formulation with SQL: Results table:


SELECT Name Name
FROM EMPLOYEE
WHERE City = 'Kent' Bell
Murphy

Fig. 1.4 Formulating a query in SQL

• Identification key: Strictly speaking, tables represent relations in the mathemati-


cal sense only if there are no duplicate rows. Therefore, one attribute or a
combination of attributes can uniquely identify the tuples within the table and is
declared the identification key.

1.2.2 Structured Query Language SQL

As explained, the relational model presents information in tabular form, where each
table is a set of tuples (or records) of the same type. Seeing all the data as sets makes
it possible to offer query and manipulation options based on sets.
The result of a selective operation, for example, is a set, i.e., each search result is
returned by the database management system as a table. If no tuples of the scanned
table show the respective properties, the user gets a blank result table. Manipulation
operations similarly target sets and affect an entire table or individual table sections.
The primary query and data manipulation language for tables is called Structured
Query Language, usually shortened to SQL (see Fig. 1.4). It was standardized by
1.2 SQL Databases 7

ANSI (American National Standards Institute) and ISO (International Organization


for Standardization).2
SQL is a descriptive language, as the statements describe the desired result
instead of the necessary computing steps. SQL queries follow a basic pattern as
illustrated by the query from Fig. 1.4:
“SELECT the attribute Name FROM the EMPLOYEE table WHERE the city is
Kent.”
A SELECT-FROM-WHERE query can apply to one or several tables and always
generates a table as a result. In our example, the query would yield a results table
with the names Bell and Murphy, as desired.
The set-based method offers users a major advantage, since a single SQL query
can trigger multiple actions within the database management system. Relational
query and data manipulation languages are descriptive. Users get the desired results
by merely setting the requested properties in the SELECT expression. They do not
have to provide the procedure for computing the required records. The database
management system takes on this task, processes the query or manipulation with its
own search and access methods, and generates the results table.
With procedural database languages on the other hand, the methods for retrieving
the requested information must be programmed by the user. In that case, each query
yields only one record, not a set of tuples.
With its descriptive query formula, SQL requires only the specification of the
desired selection conditions in the WHERE clause, while procedural languages
require the user to specify an algorithm for finding the individual records. As an
example, let us take a look at a query language for hierarchical databases (see
Fig. 1.5): For our initial operation, we use GET_FIRST to search for the first record
that meets our search criteria. Next, we access all other corresponding records
individually with the command GET_NEXT until we reach the end of the file or a
new hierarchy level within the database.
Overall, we can conclude that procedural database management languages use
record-based or navigating commands to manipulate collections of data, requiring
some experience and knowledge of the database’s inner structure from the users.
Occasional users basically cannot independently access and use the contents of a
database. Unlike procedural languages, relational query and manipulation languages
do not require the specification of access paths, processing procedures, or naviga-
tional routes, which significantly reduces the development effort for database
utilization.
If database queries and analyses are to be done by end users instead of IT
professionals, the descriptive approach is extremely useful. Research on descriptive
database interfaces has shown that even occasional users have a high probability of
successfully executing the desired analyses using descriptive language elements.
Figure 1.5 also illustrates the similarities between SQL and natural language. In

2
ANSI is the national standards organization of the USA. The national standardization
organizations are part of ISO.
8 1 Database Management

Natural language:

“Select the names of the employees living in Kent.”

Descriptive language:

SELECT Name
FROM EMPLOYEE
WHERE City = 'Kent'

Procedural language:

get first EMPLOYEE


while status = 0 do
begin
if City = 'Kent' then print(Name)
get next EMPLOYEE
end

Fig. 1.5 The difference between descriptive and procedural languages

fact, there are modern relational database management systems that can be accessed
with natural language.

1.2.3 Relational Database Management System

Databases are used in the development and operation of information systems in order
to store data centrally, permanently, and in a structured manner.
As shown in Fig. 1.6, relational database management systems are integrated
systems for the consistent management of tables. They offer service functionalities
and the descriptive language SQL for data description, selection, and manipulation.
Every relational database management system consists of a storage and a man-
agement component. The storage component stores both data and the relationships
between pieces of information in tables. In addition to tables with user data from
various applications, it contains predefined system tables necessary for database
operation. These contain descriptive information and can be queried, but not
manipulated, by users.
The management component’s most important part is the language SQL for
relational data definition, selection, and manipulation. This component also contains
service functions for data restoration after errors, for data protection, and for backup.
Relational database management systems (RDBMS) have the following properties:
1.2 SQL Databases 9

Relational Database System




+


 Data and relationships in tables
 Metadata in system tables

 Query and manipulation language SQL


 Special functions (recovery, reorganization,
security, data protection, etc.) with SQL

Fig. 1.6 Basic structure of a relational database management system

• Model: The database model follows the relational model, i.e., all data and data
relations are represented in tables. Dependencies between attribute values of
tuples or multiple instances of data can be discovered (cf. normal forms in Sect.
2.3.1).
• Schema: The definitions of tables and attributes are stored in the relational
database schema. The schema further contains the definition of the identification
keys and rules for integrity assurance.
• Language: The database system includes SQL for data definition, selection, and
manipulation. The language component is descriptive and facilitates analyses and
programming tasks for users.
• Architecture: The system ensures extensive data independence, i.e., data and
applications are mostly segregated. This independence is reached by separating
the actual storage component from the user side using the management compo-
nent. Ideally, physical changes to relational databases are possible without having
to adjust related applications.
• Multi-user operation: The system supports multi-user operation (cf. Sect. 4.1),
i.e., several users can query or manipulate the same database at the same time.
The RDBMS ensures that parallel transactions in one database do not interfere
with each other or worse, with the correctness of data (Sect. 4.2).
• Consistency assurance: The database management system provides tools for
ensuring data integrity, i.e., the correct and uncompromised storage of data.
• Data security and data protection: The database management system provides
mechanisms to protect data from destruction, loss, or unauthorized access.

NoSQL database management systems meet these criteria only partially (see
Chaps. 4 and 7). For that reason, most corporations, organizations, and especially
10 1 Database Management

SMEs (small and medium enterprises) rely heavily on relational database manage-
ment systems. However, for spread-out Web applications or applications handling
Big Data, relational database technology must be augmented with NoSQL technol-
ogy in order to ensure uninterrupted global access to these services.

1.3 Big Data and NoSQL Databases

1.3.1 Big Data

The term Big Data is used to label large volumes of data that push the limits of
conventional software. This data can be unstructured (see Sect. 5.1) and may
originate from a wide variety of sources: social media postings; e-mails; electronic
archives with multimedia content; search engine queries; document repositories of
content management systems; sensor data of various kinds; rate developments at
stock exchanges; traffic flow data and satellite images; smart meters in household
appliances; order, purchase, and payment processes in online stores; e-health
applications; monitoring systems; etc.
There is no binding definition for Big Data yet, but most data specialists will
agree on three Vs: volume (extensive amounts of data), variety (multiple formats:
structured, semi-structured, and unstructured data; see Fig. 1.7), and velocity (high-
speed and real-time processing). Gartner Group’s IT glossary offers the following
definition:

Big Data
“Big data is high-volume, high-velocity and high-variety information assets that
demand cost-effective, innovative forms of information processing for enhanced
insight and decision making.”

Multimedia

Text Graphics Image Audio Video

 Continuous text  City map  Photograph  Language  Film


 Structured text  Road map  Satellite image  Music  Animation
 Collection of  Technical  X-ray image  Sounds  Ads
texts drawing etc.  Animal sounds  Video
 Tags etc.  3D graphics  Synthetic conference etc.
etc. sounds etc.

Fig. 1.7 Variety of sources for Big Data


1.3 Big Data and NoSQL Databases 11

With this definition, Big Data are information assets for companies. It is indeed
vital for companies and organizations to generate decision-relevant knowledge in
order to survive. In addition to internal information systems, they increasingly utilize
the numerous resources available online to better anticipate economic, ecologic, and
social developments on the markets.
Big Data is a challenge faced by not only for-profit-oriented companies in digital
markets but also governments, public authorities, NGOs (non-governmental
organizations), and NPOs (nonprofit organizations).
A good example are programs to create smart or ubiquitous cities, i.e., by using
Big Data technologies in cities and urban agglomerations for sustainable develop-
ment of social and ecologic aspects of human living spaces. They include projects
facilitating mobility, the use of intelligent systems for water and energy supply, the
promotion of social networks, expansion of political participation, encouragement of
entrepreneurship, protection of the environment, and an increase of security and
quality of life.
All use of Big Data applications requires successful management of the three Vs
mentioned above:

• Volume: There are massive amounts of data involved, ranging from giga- to
zettabytes (megabyte, 106 bytes; gigabyte, 109 bytes; terabyte, 1012 bytes;
petabyte, 1015 bytes; exabyte, 1018 bytes; zettabyte, 1021 bytes).
• Variety: Big Data involves storing structured, semi-structured, and unstructured
multimedia data (text, graphics, images, audio, and video; cf. Fig. 1.7).
• Velocity: Applications must be able to process and analyze data streams in real
time as the data is gathered.

Big Data can be considered an information asset, which is why sometimes


another V is added:

• Value: Big Data applications are meant to increase the enterprise value, so
investments in personnel and technical infrastructure are made where they will
bring leverage or added value can be generated.

To complete our discussion of the concept of Big Data, we will look at another V:

• Veracity: Since much data is vague or inaccurate, specific algorithms evaluating


the validity and assessing result quality are needed. Large amounts of data do not
automatically mean better analyses.

Veracity is an important factor in Big Data, where the available data is of variable
quality, which must be taken into consideration in analyses. Aside from statistical
methods, there are fuzzy methods of soft computing which assign a truth value
between 0 (false) and 1 (true) to any result or statement.
12 1 Database Management

1.3.2 NoSQL Database Management System

Before Ted Codd’s introduction of the relational model, non-relational databases


such as hierarchical or network-like databases existed. After the development of
relational database management systems, non-relational models were still used in
technical or scientific applications. For instance, running CAD (computer-aided
design) systems for structural or machine components on relational technology is
rather difficult. Splitting technical objects across a multitude of tables proved
problematic, as geometric, topological, and graphical manipulations all had to be
executed in real time.
The omnipresence of the Internet and numerous Web-based and mobile
applications has provided quite a boost to the relevance of non-relational data
concepts vs. relational ones, as managing Big Data applications with relational
database technology is hard to impossible.
While “non-relational” would be a better description than NoSQL, the latter has
become established with database researchers and providers on the market over the
last few years.

NoSQL
The term NoSQL is used for any non-relational database management approach
meeting at least one of two criteria:

• The data is not stored in tables.


• The database language is not SQL.

NoSQL technologies are in demand, especially where the applications in the


framework for Big Data (speed, volume, variety) are in the foreground, because
non-relational structures are often better suited for this. Sometimes, the term NoSQL
is translated to “Not only SQL.” This is to express that non-relational storage and
language functions are used in addition to SQL in an application. For example, there
are SQL language interfaces for non-relational systems, either native or as
middleware; and relational databases today also offer NoSQL functions, e.g., docu-
ment data types or graph analyses.
The basic structure of a NoSQL database system is outlined in Fig. 1.8. Mostly, a
NoSQL database system is subject to a massively distributed data storage architec-
ture. Data is stored in alternative non-tabular structures depending on the type of
NoSQL database. As an example, Fig. 1.9 shows key/value stores, document
databases, and graph databases. To ensure high availability and to protect the
NoSQL database system against failures, different replication concepts are
supported. With a massively distributed and replicated computer architecture, paral-
lel evaluation procedures can be used. The analysis of extensive data volumes or the
search for specific facts can be accelerated with distributed computation procedures.
In the MapReduce procedure, subtasks are distributed to various computer nodes,
and simple key-value pairs are extracted (Map) before the partial results are com-
bined and output (Reduce).
1.3 Big Data and NoSQL Databases 13

NoSQL Database System

 Data in columns, documents, or graphs


 Distributed data replicas

 Parallel execution
 Weak to strong consistency

Fig. 1.8 Basic structure of a NoSQL database management system

Key-value store Document store Graph database

Document E
DI

Key= Session-ID 1 Document E


RE

Document D
Document
Document C D MOVIE
CT

Value = Order-Nr 1
ED

Document
DocumentB C
_B

Document A
Document B
Y

Key= Session-ID 2
AC

Key= Order-Nr A
Document
TE

Value = Order-Nr 2 Customer profile DIRECTOR


D_
IN

Shopping cart
Key= Session-ID 3 Item 1
Item 2 ACTOR
Value = Order-Nr 3
Item 3

Fig. 1.9 Three different NoSQL databases


14 1 Database Management

In massively distributed computer networks, differentiated consistency concepts


are also offered. Strong consistency means that the NoSQL database system
guarantees consistency at all times. Weak consistency means that changes to
replicated nodes are tolerated with a delay and can lead to short-term inconsistencies.

NoSQL Database System


Storage systems are considered NoSQL database systems if they meet some of the
following requirements:

• Model: The underlying database model is not relational.


• Data: The database system includes a large amount of data (volume), flexible
data structures (variety), and real-time processing (velocity).
• Schema: The database management system is not bound by a fixed database
schema.
• Architecture: The database architecture supports massively scalable
applications.
• Replication: The database management system supports data replication.
• Consistency assurance: According to the CAP theorem, consistency may be
ensured with a delay to prioritize high availability and partition tolerance.

Figure 1.9 shows three different NoSQL database management systems. Key-
value stores (see also Sect. 7.2) are the simplest version. Data is stored as an
identification key <key = "key"> and a list of values <value = "value 1", "value
2", . . .>. A good example is an online store with session management and shopping
basket. The session ID is the identification key; the order number is the value stored
in the cache. In document stores, records are managed as documents within the
NoSQL database. These documents are structured files which describe an entire
subject matter in a self-contained manner. For instance, together with an order
number, the individual items from the basket are stored as values in addition to the
customer profile. The third example shows a graph database on movies and actors
discussed in the next section.

1.4 Graph Databases

1.4.1 Graph-Based Model

NoSQL databases support various database models (see Fig. 1.9). Here, we discuss
graph databases as a first example to look at and discuss its characteristics.

Property Graph
Property graphs consist of nodes (concepts, objects) and directed edges
(relationships) connecting the nodes. Both nodes and edges are given a label and
can have properties. Properties are given as attribute-value pairs with the names of
attributes and the respective values.
1.4 Graph Databases 15

MOVIE
Title HAS
Year
GENRE
Type

ACTED_IN
Role DIRECTED_BY

ACTOR
Name DIRECTOR
Birthyear Name
Nationality

Fig. 1.10 Section of a property graph on movies

A graph abstractly presents the nodes and edges with their properties. Figure 1.10
shows part of a movie collection as an example. It contains the nodes MOVIE with
attributes Title and Year (of release), GENRE with the respective Type (e.g., crime,
mystery, comedy, drama, thriller, western, science fiction, documentary, etc.),
ACTOR with Name and Year of Birth, and DIRECTOR with Name and Nationality.
The example uses three directed edges: The edge ACTED_IN shows which artist
from the ACTOR node starred in which film from the MOVIE node. This edge also
has a property, the Role of the actor in the movie. The other two edges, HAS and
DIRECTED_BY, go from the MOVIE node to the GENRE and DIRECTOR node,
respectively.
In the manifestation level, i.e., the graph database, the property graph contains the
concrete values (Fig. 1.11). For each node and for each edge, a separate record is
stored. Thus, in contrast to relational databases, the connections between the data are
not stored and indexed as key references, but as separate records. This leads to
efficient processing of network analyses.

1.4.2 Graph Query Language Cypher

Cypher is a declarative query language for extracting patterns from graph databases.
ISO plans to extend Cypher to become the international standard for graph-based
database languages as Graph Query Language (GQL) by 2023.
Users define their query by specifying nodes and edges. The database manage-
ment system then calculates all patterns meeting the criteria by analyzing the
16 1 Database Management

ACTOR
Name: Keanu Reeves
Birthyear: 1964

AC ole
k
ar

R
TE : N
M

DIRECTED_BY

D eo
D IN

_I
ak
e : D_

N
on
ol E
R CT
A

MOVIE MOVIE
Title: Man of Tai Chi Title: The Matrix
Year: 2013 Year: 1999

Fig. 1.11 Section of a graph database on movies

possible paths (connections between nodes via edges). The user declares the struc-
ture of the desired pattern, and the database management system’s algorithms
traverse all necessary connections (paths) and assemble the results.
As described in Sect. 1.4.1, the data model of a graph database consists of nodes
(concepts, objects) and directed edges (relationships between nodes). In addition to
their name, both nodes and edges can have a set of properties (see Property Graph in
Sect. 1.4.1). These properties are represented by attribute-value pairs.
Figure 1.11 shows a segment of a graph database on movies and actors. To keep
things simple, only two types of nodes are shown: ACTOR and MOVIE. ACTOR
nodes contain two attribute-value pairs, specifically (Name: FirstName LastName)
and (YearOfBirth: Year).
The segment in Fig. 1.11 includes different types of edges: The ACTED_IN
relationship represents which actors starred in which movies. Edges can also have
properties if attribute-value pairs are added to them. For the ACTED_IN relation-
ship, the respective roles of the actors in the movies are listed. For example, Keanu
Reeves is the hacker Neo in “The Matrix.”
Nodes can be connected by multiple relationship edges. The movie “Man of Tai
Chi” and actor Keanu Reeves are linked not only by the actor’s role (ACTED_IN)
but also by the director position (DIRECTED_BY). The diagram therefore shows
that Keanu Reeves both directed the movie “Man of Tai Chi” and starred in it as
Donaka Mark.
If we want to analyze this graph database on movies, we can use Cypher. It uses
the following basic query elements:
1.4 Graph Databases 17

• MATCH: Specification of nodes and edges, as well as declaration of search


patterns
• WHERE: Conditions for filtering results
• RETURN: Specification of the desired search result, aggregated if necessary

For instance, the Cypher query for the year the movie “The Matrix” was released
would be:

MATCH (m: Movie {Title: "The Matrix"})


RETURN m.Year

The query sends out the variable m for the movie “The Matrix” to return the
movie’s year of release by m.Year. In Cypher, parentheses always indicate nodes,
i.e., (m: Movie) declares the control variable m for the MOVIE node. In addition to
control variables, individual attribute-value pairs can be included in curly brackets.
Since we are specifically interested in the movie “The Matrix,” we can add {Title:
“The Matrix”} to the node (m: Movie).
Queries regarding the relationships within the graph database are a bit more
complicated. Relationships between two arbitrary nodes (a) and (b) are expressed
in Cypher by the arrow symbol “->,” i.e., the path from (a) to (b) is declared as “(a) -
> (b).” If the specific relationship between (a) and (b) is of importance, the edge
[r] can be inserted in the middle of the arrow. The square brackets represent edges,
and r is our variable for relationships.
Now, if we want to find out who played Neo in “The Matrix,” we use the
following query to analyze the ACTED_IN path between ACTOR and MOVIE:

MATCH (a: Actor) –[: Acted_In {Role: "Neo"}] –>


(: Movie {Title: "The Matrix"}])
RETURN a.Name

Cypher will return the result Keanu Reeves. For a list of movie titles (m), actor
names (a), and respective roles (r), the query would have to be:

MATCH (a: Actor) –[r: Acted_In] –> (m: Movie)


RETURN m.Title, a.Name, r.Role

Since our example graph database only contains one actor and two movies, the
result would be the movie “Man of Tai Chi” with actor Keanu Reeves in the role of
Donaka Mark and the movie “The Matrix” with Keanu Reeves as Neo.
Random documents with unrelated
content Scribd suggests to you:
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About Project


Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.

Project Gutenberg™ eBooks are often created from several printed


editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.

You might also like