Data Science in Theory and Practice: Techniques for Big Data Analytics and Complex Data Sets Maria C. Mariani download pdf
Data Science in Theory and Practice: Techniques for Big Data Analytics and Complex Data Sets Maria C. Mariani download pdf
com
https://ebookmass.com/product/data-science-in-theory-and-
practice-techniques-for-big-data-analytics-and-complex-data-
sets-maria-c-mariani/
OR CLICK HERE
DOWLOAD NOW
https://ebookmass.com/product/data-mining-for-business-analytics-
concepts-techniques-and-applications-in-python-ebook/
ebookmass.com
https://ebookmass.com/product/distrust-big-data-data-torturing-and-
the-assault-on-science-gary-smith/
ebookmass.com
https://ebookmass.com/product/upconverting-nanoparticles-vineet-k-rai/
ebookmass.com
The Ghost Orchid Jonathan Kellerman
https://ebookmass.com/product/the-ghost-orchid-jonathan-kellerman/
ebookmass.com
https://ebookmass.com/product/medical-secrets-6th-edition-mary-p-
harward/
ebookmass.com
https://ebookmass.com/product/windows-10-inside-out-3rd-edition-ed-
bott/
ebookmass.com
https://ebookmass.com/product/recipe-for-love-a-small-town-romance-
sugar-springs-book-5-alexa-aston/
ebookmass.com
Agricultural Nanobiotechnology Biogenic Nanoparticles,
Nanofertilizers and Nanoscale Biocontrol Agents Sougata
Ghosh
https://ebookmass.com/product/agricultural-nanobiotechnology-biogenic-
nanoparticles-nanofertilizers-and-nanoscale-biocontrol-agents-sougata-
ghosh/
ebookmass.com
Data Science in Theory and Practice
Data Science in Theory and Practice
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system,
or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or
otherwise, except as permitted by law. Advice on how to obtain permission to reuse material
from this title is available at http://www.wiley.com/go/permissions
The right of Maria Cristina Mariani, Osei Kofi Tweneboah, and Maria Pia Beccar-Varela to be
identified as the authors of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley
products visit us at www.wiley.com
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some
content that appears in standard print versions of this book may not be available in other
formats.
10 9 8 7 6 5 4 3 2 1
v
Contents
3 Multivariate Analysis 21
3.1 Introduction 21
3.2 Multivariate Analysis: Overview 21
3.3 Mean Vectors 22
3.4 Variance–Covariance Matrices 24
3.5 Correlation Matrices 26
vi Contents
5 Introduction to R 61
5.1 Introduction 61
5.2 Basic Data Types 62
5.2.1 Numeric Data Type 62
5.2.2 Integer Data Type 62
5.2.3 Character 63
5.2.4 Complex Data Types 63
5.2.5 Logical Data Types 64
5.3 Simple Manipulations – Numbers and Vectors 64
5.3.1 Vectors and Assignment 64
Contents vii
6 Introduction to Python 81
6.1 Introduction 81
6.2 Basic Data Types 82
6.2.1 Number Data Type 82
6.2.1.1 Integer 82
6.2.1.2 Floating-Point Numbers 83
6.2.1.3 Complex Numbers 84
6.2.2 Strings 84
6.2.3 Lists 85
6.2.4 Tuples 86
6.2.5 Dictionaries 86
6.3 Number Type Conversion 87
6.4 Python Conditions 87
6.4.1 If Statements 88
6.4.2 The Else and Elif Clauses 89
6.4.3 The While Loop 90
6.4.3.1 The Break Statement 91
6.4.3.2 The Continue Statement 91
6.4.4 For Loops 91
viii Contents
7 Algorithms 97
7.1 Introduction 97
7.2 Algorithm – Definition 97
7.3 How to Write an Algorithm 98
7.3.1 Algorithm Analysis 99
7.3.2 Algorithm Complexity 99
7.3.3 Space Complexity 100
7.3.4 Time Complexity 100
7.4 Asymptotic Analysis of an Algorithm 101
7.4.1 Asymptotic Notations 102
7.4.1.1 Big O Notation 102
7.4.1.2 The Omega Notation, Ω 102
7.4.1.3 The Θ Notation 102
7.5 Examples of Algorithms 104
7.6 Flowchart 104
7.7 Problems 105
Bibliography 353
Index 359
xvii
List of Figures
Figure 16.5 Two class problem when data is not linearly separable. 224
Figure 16.6 ROC curve for linear SVM. 226
Figure 16.7 ROC curve for nonlinear SVM. 227
Figure 17.1 Single hidden layer feed-forward neural networks. 232
Figure 17.2 Simple recurrent neural network. 234
Figure 17.3 Long short-term memory unit. 235
Figure 17.4 Philippines (PSI). (a) Basic RNN. (b) LTSM. 239
Figure 17.5 Thailand (SETI). (a) Basic RNN. (b) LTSM. 240
Figure 17.6 United States (NASDAQ). (a) Basic RNN. (b) LTSM. 241
Figure 17.7 JPMorgan Chase & Co. (JPM). (a) Basic RNN. (b) LTSM. 242
Figure 17.8 Walmart (WMT). (a) Basic RNN. (b) LTSM. 243
Figure 18.1 3D power spectra of the daily returns from the four analyzed stock
companies. (a) Discover. (b) Microsoft. (c) Walmart. (d) JPM
Chase. 255
Figure 18.2 3D power spectra of the returns (generated per minute) from the
four analyzed stock companies. (a) Discover. (b) Microsoft.
(c) Walmart. (d) JPM Chase. 257
Figure 19.1 Time-frequency image of explosion 1 recorded by ANMO
(Table 19.2). 270
Figure 19.2 Time-frequency image of earthquake 1 recorded by ANMO
(Table 19.2). 270
Figure 19.3 Three-dimensional graphic information of explosion 1 recorded
by ANMO (Table 19.2). 272
Figure 19.4 Three-dimensional graphic information of earthquake 1 recorded
by ANMO (Table 19.2). 272
Figure 19.5 Time-frequency image of explosion 2 recorded by TUC
(Table 19.3). 273
Figure 19.6 Time-frequency image of earthquake 2 recorded by TUC
(Table 19.3). 273
Figure 19.7 Three-dimensional graphic information of explosion 2 recorded
by TUC (Tabl 19.3). 274
Figure 19.8 Three-dimensional graphic information of earthquake 2 recorded
by TUC (Table 19.3). 274
Figure 21.1 R∕S for volcanic eruptions 1 and 2. 322
Figure 21.2 DFA for volcanic eruptions 1 and 2. 323
Figure 21.3 DEA for volcanic eruptions 1 and 2. 323
xxi
List of Tables
Preface
We conclude this book with a discussion of ethics in data science: With great
power comes great responsibility.
The authors express their deepest gratitude to Wiley for making the publication
a reality.
1.1 Introduction
Data science is one of the most promising and high-demand career paths for skilled
professionals in the 21st century. Currently, successful data professionals under-
stand that they must advance past the traditional skills of analyzing large amounts
of data, statistical learning, and programming skills. In order to explore and dis-
cover useful information for their companies or organizations, data scientists must
have a good grip of the full spectrum of the data science life cycle and have a level
of flexibility and understanding to maximize returns at each phase of the process.
Data science is a “concept to unify statistics, mathematics, computer science,
data analysis, machine learning and their related methods” in order to find trends,
understand, and analyze actual phenomena with data. Due to the Coronavirus dis-
ease (COVID-19) many colleges, institutions, and large organizations asked their
nonessential employees to work virtually. The virtual meetings have provided col-
leges and companies with plenty of data. Some aspect of the data suggest that
virtual fatigue is on the rise. Virtual fatigue is defined as the burnout associated
with the over dependence on virtual platforms for communication. Data science
provides tools to explore and reveal the best and worst aspects of virtual work.
In the past decade, data scientists have become necessary assets and are present
in almost all institutions and organizations. These professionals are data-driven
individuals with high-level technical skills who are capable of building complex
quantitative algorithms to organize and synthesize large amounts of information
used to answer questions and drive strategy in their organization. This is coupled
with the experience in communication and leadership needed to deliver tangible
results to various stakeholders across an organization or business.
Data scientists need to be curious and result-oriented, with good knowledge
(domain specific) and communication skills that allow them to explain very tech-
nical results to their nontechnical counterparts. They possess a strong quantitative
background in statistics and mathematics as well as programming knowledge with
Data Science in Theory and Practice: Techniques for Big Data Analytics and Complex Data Sets,
First Edition. Maria Cristina Mariani, Osei Kofi Tweneboah, and Maria Pia Beccar-Varela.
© 2022 John Wiley & Sons, Inc. Published 2022 by John Wiley & Sons, Inc.
2 1 Background of Data Science
focuses in data warehousing, mining, and modeling to build and analyze algo-
rithms. In fact, data scientists are a group of analytical data expert who have the
technical skills to solve complex problems and the curiosity to explore how prob-
lems need to be solved.
accurate, and useful information than that provided by any individual data
source. Veracity: Veracity describes the quality of data and the data value. The
quality of data obtained can greatly affect the accuracy of the analyzed results. In
the next subsection we will discuss some big data architectures. A comprehensive
study of this topic can be found in the application architecture guide of the
Microsoft technical documentation.
● Analytical data store: Several big data solutions prepare data for analysis and
then serve the processed data in a structured format that can be queried using
analytical tools. The analytical data store used to serve these queries can be a
Kimball-style relational data warehouse, as observed in most classical business
intelligence (BI) solutions. Alternatively, the data could be presented through a
low-latency NoSQL technology, such as HBase, or an interactive Hive database
that provides a metadata abstraction over data files in the distributed data store.
● Analysis and reporting: The goal of most big data solutions is to provide insights
into the data through analysis and reporting. Users can analyze the data using
mathematical and statistical models as well using data visualization techniques.
Analysis and reporting can also take the form of interactive data exploration by
data scientists or data analysts.
● Orchestration: Several big data solutions consist of repeated data processing
operations, encapsulated in workflows, that transform source data, move data
between multiple sources and sinks, load the processed data into an analytical
data store, or move the results to a report or dashboard.
7
2.1 Introduction
The matrix algebra and random vectors presented in this chapter will enable us to
precisely state statistical models. We will begin by discussing some basic concepts
that will be essential throughout this chapter. For more details on matrix algebra
please consult (Axler 2015).
Data Science in Theory and Practice: Techniques for Big Data Analytics and Complex Data Sets,
First Edition. Maria Cristina Mariani, Osei Kofi Tweneboah, and Maria Pia Beccar-Varela.
© 2022 John Wiley & Sons, Inc. Published 2022 by John Wiley & Sons, Inc.
8 2 Matrix Algebra and Random Vectors
Definition 2.3 (Vector addition) The sum of two vectors of the same size is
the vector obtained by adding corresponding entries in the vectors:
⎡x1 ⎤ ⎡y1 ⎤ ⎡ x1 + y1 ⎤
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
x y x + y2 ⎥
x + y = ⎢ 2⎥ + ⎢ 2⎥ = ⎢ 2
⎢⋮⎥ ⎢⋮⎥ ⎢ ⋮ ⎥
⎢x ⎥ ⎢y ⎥ ⎢x + y ⎥
⎣ n⎦ ⎣ n⎦ ⎣ n n⎦
2.2.2 Matrices
The notation Ai,j denotes the entry in row i, column j of A. In other words,
the first index refers to the row number and the second index refers to the column
number.
Example 2.1
⎛1 4 8⎞
If A = ⎜0 4 9⎟ ,
⎜ ⎟
⎝7 −1 7⎠
then A3,1 = 7.
Example 2.2
( ) ⎛1 0⎞
1 4 8
If A2×3 = , then AT3×2 = ⎜4 4⎟ .
0 4 9 ⎜ ⎟
⎝8 9⎠
by the scalar:
⎡ cA1,1 · · · cA1,n ⎤
cA = ⎢ ⋮ ⋮ ⎥.
⎢ ⎥
⎣cAm,1 · · · cAm,n ⎦
Definition 2.7 (Matrix addition) The sum of two vectors of the same size is
the vector obtained by adding corresponding entries in the vectors:
⎡ A1,1 · · · A1,n ⎤ ⎡ B1,1 · · · B1,n ⎤
A+B=⎢ ⋮ ⋮ ⎥+⎢ ⋮ ⋮ ⎥
⎢ ⎥ ⎢ ⎥
⎣ Am,1 · · · Am,n ⎦ ⎣ Bm,1 · · · Bm,n ⎦
⎡ A1,1 + B1,1 · · · A1,n + B1,n ⎤
=⎢ ⋮ ⋮ ⎥.
⎢ ⎥
⎣ Am,1 + Bm,1 · · · Am,n + Bm,n ⎦
Example 2.3
⎡1 4 ⎤ [ ]
1 1
If A = ⎢0 4 ⎥ and B=
⎢ ⎥ 2 1
⎣7 −1⎦
then
⎡1 4 ⎤ [ ] ⎡ 1(1) + 4(2) 1(1) + 4(1) ⎤ ⎡9 5⎤
⎢ ⎥ 1 1
AB = 0 4 = ⎢ 0(1) + 4(2) 0(1) + 4(1) ⎥ = ⎢8 4⎥ .
⎢ ⎥ 2 1 ⎢ ⎥ ⎢ ⎥
⎣7 −1⎦ ⎣7(1) + −1(2) 7(1) + −1(1)⎦ ⎣5 6⎦
10 2 Matrix Algebra and Random Vectors
[ ] [ ]
1 4 1 6
Example 2.4 The matrix A = is symmetric; the matrix B = is not
4 4 4 −4
symmetric.
Definition 2.11 (Trace) For any square matrix A, the trace of A denoted
by tr(A) is defined as the sum of the diagonal elements, i.e.
∑
n
tr(A) = aii = a11 + a22 + · · · + ann .
i=1
1. ∅ ∈ .
2. If F ∈ then its complement F c ∈ .
3. If F1 , F2 , … is a countable collection of sets in then their union ∪∞
n=1 Fn ∈ .
Language: English
JULY-DECEMBER 1915.
By N. BOHR,
This expression is also equal to the mean value of the kinetic energy
of the system. Since is equal to the total energy of the
system we get from (4) and (5)
If we compare (6) with the relation (1), we see that the connexion
with ordinary mechanics in the region of slow vibration, mentioned in
the former section, is satisfied.
Putting in (3) we get the ordinary series spectrum of
hydrogen. Putting we get a spectrum which, on the theory,
should be expected to be emitted by an electron rotating round a
helium nucleus. The formula is found very closely to represent some
series of lines observed by Fowler[9] and Evans[10]. These series
correspond to and [11]. The theoretical value for the
ratio between the second factor in (3) for this spectrum and for the
hydrogen spectrum is 1.000409; the value calculated from Fowler’s
measurements is 1.000408[12]. Some of the lines under consideration
have been observed earlier in star spectra, and have been ascribed to
hydrogen not only on account of the close numerical relation with the
lines of the Balmer series, but also on account of the fact that the
lines observed, together with the lines of the Balmer series,
constitutes a spectrum which shows a marked analogy with the
spectra of the alkali metals. This analogy, however, has been
completely disturbed by Fowler’s and Evans’ observations, that the
two new series contain twice as many lines as is to be expected on
this analogy. In addition, Evans has succeeded in obtaining the lines
in such pure helium that no trace of the ordinary hydrogen lines
could be observed[13]. The great difference between the conditions
for the production of the Balmer series and the series under
consideration is also brought out very strikingly by some recent
experiments of Rau[14] on the minimum voltage necessary for the
production of spectral lines. While about 13 volts was sufficient to
excite the lines of the Balmer series, about 80 volts was found
necessary to excite the other series. These values agree closely with
the values calculated from the assumption E for the energies
necessary to remove the electron from the hydrogen atom and to
remove both electrons from the helium atom, viz. 13.6 and 81.3 volts
respectively. It has recently been argued[15] that the lines are not so
sharp as should be expected from the atomic weight of helium on
Lord Rayleigh’s theory of the width of spectral lines. This might,
however, be explained by the fact that the systems emitting the
spectrum, in contrast to those emitting the hydrogen spectrum, are
supposed to carry an excess positive charge, and therefore must be
expected to acquire great velocities in the electric field in the
discharge-tube.
In paper IV. an attempt was made on the basis of the present
theory to explain the characteristic effect of an electric field on the
hydrogen spectrum recently discovered by Stark. This author
observed that if luminous hydrogen is placed in an intense electric
field, each of the lines of the Balmer series is split up into a number
of homogeneous components. These components are situated
symmetrically with regard to the original lines, and their distance
apart is proportional to the intensity of the external electric field. By
spectroscopic observation in a direction perpendicular to the field, the
components are linearly polarized, some parallel and some
perpendicular to the field. Further experiments have shown that the
phenomenon is even more complex than was at first expected. By
applying greater dispersion, the number of components observed has
been greatly increased, and the numbers as well as the intensities of
the components are found to vary in a complex manner from line to
line[16]. Although the present development of the theory does not
allow us to account in detail for the observations, it seems that the
considerations in paper IV. offer a simple interpretation of several
characteristic features of the phenomenon.
The calculation can be made considerably simpler than in the
former paper by an application of Hamilton’s principle. Consider a
particle moving in a closed orbit in a stationary field. Let be the
frequency of revolution, the mean value of the kinetic energy
during the revolution, and the mean value of the sum of the
kinetic energy and the potential energy of the particle relative to the
stationary field. We have then for a small arbitrary variation of the
orbit
This equation was used in paper IV. to prove the equivalence of the
formulæ (2) and (6) for any system governed by ordinary mechanics.
The equation (7) further shows that if the relations (2) and (6) hold
for a system of orbits, they will hold also for any small variation of
these orbits for which the value of is unaltered. If a hydrogen
atom in one of its stationary states is placed in an external electric
field and the electron rotates in a closed orbit, we shall therefore
expect that is not altered by the introduction of the atom in the
field, and that the only variation of the total energy of the system will
be due to the variation of the mean value of the potential energy
relative to the external field.
In the former paper it was pointed out that the orbit of the
electron will be deformed by the external field. This deformation will
in course of time be considerable even if the external electric force is
very small compared with the force of attraction between the
particles. The orbit of the electron may at any moment be considered
as an ellipse with the nucleus in the focus, and the length of the
major axis will approximately remain constant, but the effect of the
field will consist in a gradual variation of the direction of the major
axis as well as the excentricity of the orbit. A detailed investigation of
the very complicated motion of the electron was not attempted, but it
was simply pointed out that the problem allows of two stationary
orbits of the electron, and that these may be taken as representing
two possible stationary states. In these orbits the excentricity is equal
to 1, and the major axis parallel to the external force; the orbits
simply consisting of a straight line through the nucleus parallel to the
axis of the field, one on each side of it. It can very simply be shown
that the mean value of the potential energy relative to the field for
these rectilinear orbits is equal to , where is the external
electric force and the major axis of the orbit, and the two signs
correspond to orbits in which the direction of the major axis from the
nucleus is the same or opposite to that of the electric force
respectively. Using the formulæ (4) and (5) and neglecting the mass
of the electron compared with that of the nucleus, we get, therefore,
for the energy of the system in the two states
where should be a constant for all the lines and equal to unity.
28500 volts. per cm. 74000 volts. per cm.
2 3 0.46 0.83
28500 volts. per cm. 74000 volts. per cm.
where is the Rydberg constant in the hydrogen spectrum. It will be seen that this
result is in approximate agreement with the calculation mentioned above if we
assume that the radiation is emitted as a quantum .
Moseley pointed out the analogy between the formula (14) and the formula (3) in
section 2, and remarked that the constant was equal to the last factor in this
formula, if we put and . He therefore proposed the explanation of
the formula (14), that the line was emitted during a transition of the innermost ring
between two states in which the angular momentum of each electron was equal to
and respectively. From the replacement of by he deduced
that the number of electrons in the ring was equal to 4. This view, however, can
hardly be maintained. The approximate agreement mentioned above with
Whiddington’s measurements for the energy necessary to produce the characteristic
radiation indicates very strongly that the spectrum is due to a displacement of a
single electron, and not to a whole ring. In the latter case the energy should be
several times larger. It is also pointed out by Nicholson[30] that Moseley’s explanation
would imply the emission of several quanta at the same time; but this assumption is
apparently not necessitated for the explanation of other phenomena. At present it
seems impossible to obtain a detailed interpretation of Moseley’s results, but much
light seems to be thrown on the whole problem by some recent interesting
considerations by W. Kossel[31].
Kossel takes the view of the nucleus atom and assumes that the electrons are
arranged in rings, the one outside the other. As in the present theory, it is assumed
that any radiation emitted from the atom is due to a transition of the system
between two steady states, and that the frequency of the radiation is determined by
the relation (1). He considers now the radiation which results from the removal of an
electron from one of the rings, assuming that the radiation is emitted when the atom
settles down in its original state. The latter process may take place in different ways.
The vacant place in the ring may be taken by an electron coming directly from
outside the whole system, but it may also be taken by an electron jumping from one
of the outer rings. In the latter case a vacant place will be left in that ring to be
replaced in turn by another electron, etc. For the sake of brevity, we shall refer to
the innermost ring as ring 1, the next one as ring 2, and so on. Kossel now assumes
that the radiation results from the removal of an electron from ring 1, and makes
the interesting suggestion that the line denoted by Moseley as corresponds to
the radiation emitted when an electron jumps from ring 2 to ring 1, and that the line
, corresponds to a jump from ring 3 to ring 1. On this view, we should expect
that the radiation consists of as many lines as there are rings in the atom, the
lines forming a series of rapidly increasing intensities. For the radiation, Kossel
makes assumptions analogous to those for the radiation, with the distinction that
the radiation is ascribed to the removal of an electron from ring 2 instead of ring 1.
A possible radiation is ascribed to ring 3, and so on. The interest of these
considerations is that they lead to the prediction of some simple relations between
the frequencies of the different lines. Thus it follows as an immediate consequence
of the assumption used that we must have
It will be seen that these relations correspond exactly to the ordinary principle of
combination of spectral lines. By using Moseley’s measurements for and and
extrapolating for the values of by the help of Moseley’s. empirical formula, Kossel
showed that the first relation was closely satisfied for the elements from calcium to
zinc. Recently T. Malmer[32] has measured the wave-length of and for a
number of elements of higher atomic weight, and it is therefore possible to test the
relation over a wider range and without extrapolation. The table gives Malmer’s
values for and Moseley’s values for , all values being multiplied by
.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookmass.com