Data Science: Theory, Algorithms, and Applications (Transactions On Computer Systems and Networks) Gyanendra K. Verma (Editor)
Data Science: Theory, Algorithms, and Applications (Transactions On Computer Systems and Networks) Gyanendra K. Verma (Editor)
com
https://ebookmeta.com/product/data-science-theory-
algorithms-and-applications-transactions-on-computer-
systems-and-networks-gyanendra-k-verma-editor/
OR CLICK BUTTON
DOWLOAD NOW
https://ebookmeta.com/product/smolder-anita-blake-vampire-
hunter-29-laurell-k-hamilton/
https://ebookmeta.com/product/distributed-systems-theory-and-
applications-1st-edition-ratan-k-ghosh/
https://ebookmeta.com/product/opportunistic-networks-
fundamentals-applications-and-emerging-trends-1st-edition-anshul-
verma/
Classical Hopf Algebras and Their Applications Algebra
and Applications 29 Pierre Cartier Frédéric Patras
https://ebookmeta.com/product/classical-hopf-algebras-and-their-
applications-algebra-and-applications-29-pierre-cartier-frederic-
patras/
https://ebookmeta.com/product/data-analytics-for-drilling-
engineering-theory-algorithms-experiments-software-information-
fusion-and-data-science-qilong-xue/
https://ebookmeta.com/product/29-single-and-nigerian-incomplete-
first-edition-naijasinglegirl/
https://ebookmeta.com/product/cambridge-igcse-and-o-level-
history-workbook-2c-depth-study-the-united-states-1919-41-2nd-
edition-benjamin-harrison/
https://ebookmeta.com/product/computer-vision-algorithms-and-
applications-2nd-edition-richard-szeliski/
Transactions on Computer Systems and Networks
Gyanendra K. Verma
Badal Soni
Salah Bourennane
Alexandre C. B. Ramos Editors
Data Science
Theory, Algorithms, and Applications
Transactions on Computer Systems
and Networks
Series Editor
Amlan Chakrabarti, Director and Professor, A. K. Choudhury School of
Information Technology, Kolkata, West Bengal, India
Transactions on Computer Systems and Networks is a unique series that aims
to capture advances in evolution of computer hardware and software systems
and progress in computer networks. Computing Systems in present world span
from miniature IoT nodes and embedded computing systems to large-scale
cloud infrastructures, which necessitates developing systems architecture, storage
infrastructure and process management to work at various scales. Present
day networking technologies provide pervasive global coverage on a scale
and enable multitude of transformative technologies. The new landscape of
computing comprises of self-aware autonomous systems, which are built upon a
software-hardware collaborative framework. These systems are designed to execute
critical and non-critical tasks involving a variety of processing resources like
multi-core CPUs, reconfigurable hardware, GPUs and TPUs which are managed
through virtualisation, real-time process management and fault-tolerance. While AI,
Machine Learning and Deep Learning tasks are predominantly increasing in the
application space the computing system research aim towards efficient means of
data processing, memory management, real-time task scheduling, scalable, secured
and energy aware computing. The paradigm of computer networks also extends it
support to this evolving application scenario through various advanced protocols,
architectures and services. This series aims to present leading works on advances
in theory, design, behaviour and applications in computing systems and networks.
The Series accepts research monographs, introductory and advanced textbooks,
professional books, reference works, and select conference proceedings.
Data Science
Theory, Algorithms, and Applications
Editors
Gyanendra K. Verma Badal Soni
Department of Computer Engineering Department of Computer Science
National Institute of Technology and Engineering
Kurukshetra National Institute of Technology Silchar
Kurukshetra, India Silchar, India
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
We dedicate to all those who directly or
indirectly contributed to the accomplishment
of this work.
Preface
Digital information influences our everyday lives in various ways. Data sciences
provides us tools and techniques to comprehend and analyze the data. Data sciences
is one of the fastest-growing multidisciplinary fields that deals with data acquisition,
analysis, integration, modeling, visualization, and interaction of a large amount of
data.
Currently, each sector of the economy produces a huge amount of data in an
unstructured format. A huge amount of data is being available from various sources
like web services, databases, online repositories, etc.; however, the major challenge
is to extract meaningful intelligence information. However, to preprocess and extract
useful information is a challenging task. The role of artificial intelligence is playing
a pivotal role in the analysis of the data.
It becomes possible to analyze and interpret information in real-time with the
evolution of artificial intelligence. The deep learning models are widely used in
the analysis of big data for various applications, particularly in the area of image
processing.
This book aims to develop an understanding of data sciences theory and concepts,
data modeling by using various machine learning algorithms for a wide range of real-
world applications. In addition to providing basic principles of data processing, the
book teaches standard models and algorithms to data analysis.
vii
Acknowledgements
We are thankful to all the contributors who have generously given time and material
to this book. We would also want to extend our appreciation to those who have well
played their role to inspire us continuously.
We are extremely thankful to the reviewers, who have carried out the most impor-
tant and critical part of any technical book, evaluation of each of the submitted
chapters assigned to them.
We also express our sincere gratitude toward our publication partner, Springer,
especially to Ms. Kamiya Khatter and the Springer book production team for
continuous support and guidance in completing this book project.
Thank you.
ix
Introduction
This book aims to provide authors with an understanding of data sciences, their
architectures, and their applications in various domains. The data sciences is helpful
in the extraction of meaningful information from unstructured data. The major aspect
of data sciences is data modeling, analysis, and visualization. This book covers major
models, algorithms, and prominent applications of data sciences to solve real-world
problems. By the end of the book, we hope that our readers will have an understanding
of concepts, different approaches, models, and familiarity with the implementation
of data sciences tools and libraries.
Artificial intelligence has a major impact on research and raised the performance
bar substantially in many of the standard evaluations. Moreover, the new challenges
can be tackled using artificial intelligence in the decision-making process. However,
it is very difficult to comprehend, let alone guide, the process of learning in deep
learning. There is an air of uncertainty about exactly what and how these models
learn, and this book is an effort to fill those gaps.
Target Audience
The book is divided into three parts comprising a total of 27 chapters. Parts, distinct
groups of chapters, as well as single chapters are meant to be fairly independent
and also self-contained, and the reader is encouraged to study only relevant parts or
chapters. This book is intended for a broad readership. The first part provides the
theory and concepts of learning. Thus, this part addresses readers wishing to gain an
overview of learning frameworks. Subsequent parts delve deeper into research topics
and are aimed at the more advanced reader, in particular graduate and PhD students
as well as junior researchers. The target audience of this book will be academi-
cians, professionals, researchers, and students at engineering and medical institutions
working in the areas of data sciences and artificial intelligence.
xi
xii Introduction
Book Organization
This book is organized into three parts. Part I includes eight chapters that deal with
theory concepts of data sciences, Part II deals with data design and analysis, and
finally, Part III is based on the major applications of data sciences. This book contains
invited as well as contributed chapters.
The first part of the book exclusively focuses on the fundamentals of data sciences.
The book chapters under this part cover active learning, ensemble learning concepts
along with language processing concepts.
Chapter 1 describes a general active learning framework that has been proposed for
network intrusion detection. The authors have experimented with different learning
and sampling strategies on the KDD Cup 1999 dataset. The results show that the
performance of complex learning models has been found to outperform the rela-
tively simple learning models. The uncertainty and entropy sampling also outperform
random sampling. Chapter 2 describes a bagging classifier which is an ensemble
learning approach for student outcome prediction by employing base and meta-
classifiers. Additionally, performance analysis of various classifiers has been carried
out by an oversampling approach using SMOTE and an undersampling approach
using spread sampling. Chapter 3 presents the patient’s medical data security via bi-
chaos bi-order Fourier transform. In this work, authors have used three techniques
for medical or clinical image encryption, i.e., FRFT, logistic map, and Arnold map.
The results suggest that the complex hybrid combination makes the system more
robust and secure from the different cryptographic attacks than these methods alone.
In Chap. 4, word-sense disambiguation (WSD) for the Nepali language is performed
using variants of the Lesk algorithm such as direct overlap, frequency-based scoring,
and frequency-based scoring after drooping of the target word. Performance anal-
ysis based on the elimination of stop words, the number of senses, and context
window size has been carried out. Chapter 5 presents a performance analysis of
different branch prediction schemes incorporated in ARM big.LITTLE architecture.
The performance comparison of these branch predictors has been carried out based
on performance, power dissipation, conditional branch mispredictions, IPC, execu-
tion time, power consumption, etc. The results show that TAGE-LSC and perceptron
achieve the highest accuracy among the simulated predictors. Chapter 6 presents a
global feature representation using a new architecture SEANet that has been built
over SENet. An aggregate block implemented after the SE block aids in global feature
representation and reducing the redundancies. SEANet has been found to outperform
ResNet and SENet on two benchmark datasets—CIFAR-10 and CIFAR-100.
Introduction xiii
The subsequent chapters in this part are devoted to analyzing images. Chapter 7
presents an improved super-resolution of a single image through an external dictio-
nary formation for training and a neighbor embedding technique for reconstruction.
The task of dictionary formation is carried out so as to contain maximum structural
variations and the minimal number of images. The reconstruction stage is carried
out by the selection of overlapping pixels of a particular location. In Chap. 8, single-
step image super-resolution and denoising of SAR images are proposed using the
generative adversarial networks (GANs) model. The model shows improvement in
VGG16 loss as it preserves relevant features and reduces noise from the image. The
quality of results produced by the proposed approach is compared with the two-step
upscaling and denoising model and the baseline method.
The second part of the book focuses on the models and algorithms for data sciences.
The deep learning models, discrete wavelet transforms, principal component anal-
ysis, SenLDA, color-based classification model, and gray-level co-occurrence matrix
(GLCM) are used to model real-world problems.
Chapter 9 explores a deep learning technique based on OCR-SSD for car detection
and tracking in images. It also presents a solution for real-time license plate recog-
nition on a quadcopter in autonomous flight. Chapter 10 describes an algorithm for
gender identification based on biometric palm print using binarized statistical image
features. The filter size is varied with a fixed length of 8 bits to capture information
from the ROI palm prints. The proposed method outperforms baseline approaches
with an accuracy of 98%. Chapter 11 describes a Sudoku puzzle recognition and solu-
tion study. Puzzle recognition is carried out using a deep belief network for feature
extraction. The puzzle solution is given by serialization of two approaches—parallel
rule-based methods and ant colony optimization. Chapter 12 describes a novel profile
generation approach for human action recognition. DWT & PC is proposed to detect
energy variation for feature extraction in video frames. The proposed method is
applied to various existing classifiers and tested on Weizmann’s dataset. The results
outperform baselines like the MACH filter.
The subsequent chapters in this part are devoted to more research-oriented models
and algorithms. Chapter 13 presents a novel filter and color-based classification
model to assess the ripeness of tobacco leaves for harvesting. The ripeness detection
is performed by a spot detection approach using a first-order edge extractor and a
second-order high-pass filtering. A simple thresholding classifier is then proposed
for the classification task. Chapter 14 proposes an automatic deep learning frame-
work for breast cancer detection and classification model from hematoxylin and
eosin (H&E)-stained breast histopathology images with 80.4% accuracy for supple-
menting analysis of medical professionals to prevent false negatives. Experimental
results yield that the proposed architecture provides better classification results as
compared to benchmark methods. Chapter 15 specifies a technique for indoor flying
xiv Introduction
of autonomous drones using image processing and neural networks. The route for
the drone is determined through the location of the detected object in the captured
image. The first detection technique relies on image-based filters, while the second
technique focuses on the use of CNN to replicate a real environment. Chapter 16
describes the use of a gray-level co-occurrence matrix (GLCM) for feature detection
in SAR images. The features detected in SAR images by GLCM find much applica-
tion as it identifies various orientations such as water, urban areas, and forests and
any changes in these areas.
The third part of the book covers the major applications of data sciences in various
fields like biometrics, robotics, medical imaging, affective computing, security, etc.
Chapter 17 deals with signature verification using Galois field operator. The
features are obtained by building a normalized cumulative histogram. Offline signa-
ture verification is also implemented using the K-NN classifier. Chapter 18 details a
face recognition approach in videos using 3D residual networks and comparing the
accuracy for different depths of residual networks. A CVBL video dataset has been
developed for the purpose of experimentation. The proposed approach achieves the
highest accuracy of 97% with DenseNets on the CVBL dataset. Microcontroller units
(MCU) with auto firmware communicate with the fog layer through a smart edge
node. The robot employs approaches such as simultaneous localization and mapping
(SLAM) and other path-finding algorithms and IR sensors for obstacle detection. ML
techniques and FastAi aid in the classification of the dataset. Chapter 20 describes
an automatic tumor identification approach to classify MRI of brain. An advanced
CNN model consisting of convolution and a dense layer is employed to correctly
classify the brain tumors. The results exhibit the proposed model’s effectiveness in
brain tumor image classification. Chapter 21 presents a vision-based sensor mech-
anism for phase lane detection in IVS. The land markings on a structured road are
detected using image processing techniques such as edge detection and Hough space
transformation on KITTI data. Qualitative and quantitative analysis shows satis-
factory results. In Chapter 22, the proposed implementation of deep convolutional
neural network (DCNN) for micro-expression recognition as DCNN has established
its presence in different image processing applications. CASME-II, a benchmark
database for micro-expression recognition, has been used for experimentations. The
results of the experiment had revealed that types based on CNN give correct results
of 90% and 88% for four and six classes, respectively, that is beyond the regular
methods.
In Chapter 23, the proposed semantic classification model intends to employ
modern embedding and aggregating methods which considerably enhance feature
discriminability and boost the performance of CNN. The performance of this frame-
work is exhaustively tested across a wide dataset. The intuitive and robust systems
that use these techniques play a vital role in various sectors like security, military,
Introduction xv
xvii
xviii Contents
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Editors and Contributors
Salah Bourennane received his Ph.D. degree from Institut National Polytechnique
de Grenoble, France. Currently, he is a Full Professor at the Ecole Centrale Marseille,
France. He is the head of the Multidimensional Signal Processing Group of Fresnel
Institute. His research interests are in statistical signal processing, remote sensing,
xxi
xxii Editors and Contributors
Contributors
xxvii
Part I
Theory and Concepts
Chapter 1
Active Learning for Network Intrusion
Detection
Amir Ziai
Abstract Network operators are generally aware of common attack vectors that they
defend against. For most networks, the vast majority of traffic is legitimate. How-
ever, new attack vectors are continually designed and attempted by bad actors which
bypass detection and go unnoticed due to low volume. One strategy for finding such
activity is to look for anomalous behavior. Investigating anomalous behavior requires
significant time and resources. Collecting a large number of labeled examples for
training supervised models is both prohibitively expensive and subject to obsole-
tion as new attacks surface. A purely unsupervised methodology is ideal; however,
research has shown that even a very small number of labeled examples can signifi-
cantly improve the quality of anomaly detection. A methodology that minimizes the
number of required labels while maximizing the quality of detection is desirable.
False positives in this context result in wasted effort or blockage of legitimate traf-
fic, and false negatives translate to undetected attacks. We propose a general active
learning framework and experiment with different choices of learners and sampling
strategies.
1.1 Introduction
Detecting anomalous activity is an active area of research in the security space. Tuor
et al. use an online anomaly detection method based on deep learning to detect anoma-
lies. This methodology is compared to traditional anomaly detection algorithms such
as isolation forest (IF) and a principal component analysis (PCA)-based approach
and found to be superior. However, no comparison is provided with semi-supervised
or active learning approaches which leverage a small amount of labeled data (Tuor
et al. 2017). The authors later propose another unsupervised methodology leverag-
ing recurrent neural network (RNN) to ingest the log-level event data as opposed to
aggregated data (Tuor et al. 2018). Pimentel et al. propose a generalized framework
for unsupervised anomaly detection. They argue that purely unsupervised anomaly
A. Ziai (B)
Stanford University, 450 Serra Mall, Stanford, CA 94305, USA
e-mail: amirziai@stanford.edu
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 3
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_1
4 A. Ziai
Table 1.1 Prevalence and number of attacks for each of the 10 attack types
Label Attacks Prevalence Prevalence Records
(overall)
smurf. 280,790 0.742697 0.568377 378,068
neptune. 107,201 0.524264 0.216997 204,479
back. 22.3 0.022145 0.04459 99,481
satan 1589 0.016072 0.003216 98,867
ipsweep 1247 0.012657 0.002524 98,525
portsweep. 1040 0.010578 0.002105 98,318
warezclient. 1020 0.010377 0.002065 98,298
teardrop. 979 0.009964 0.001982 98,257
pod. 264 0.002707 0.000534 97,542
nmap. 231 0.002369 0.000468 97,509
1.2 Dataset
We have used the KDD Cup 1999 dataset which consists of about 500K records
representing network connections in a military environment. Each record is either
“normal” or one of 22 different types of intrusion such as smurf, IP sweep, and
teardrop. Out of these 22 categories, only 10 have at least 100 occurrences, and the
rest were removed. Each record has 41 features including duration, protocol, and
bytes exchanged. Prevalence of attack types varies substantially with smurf being
the most pervasive at about 50% of total records and Nmap at less than 0.01% of
total records (Table 1.1).
1 Active Learning for Network Intrusion Detection 5
We generated 10 separate datasets consisting of normal traffic and each of the attack
vectors. This way we can study the proposed approach over 10 different attack vectors
with varying prevalence and ease of detection. Each dataset is then split into train,
development, and test partitions with 80%, 10%, and 10% proportions. All algorithms
are trained on the train set and evaluated on the development set. The winning strategy
is tested on the test set to generate an unbiased estimate of generalization. Categorical
features are one-hot encoded, and missing values are filled with zero.
1.3 Approach
Since labeled data is very hard to come by in this space, we have decided to treat this
problem as an active learning one. Therefore, the machine learning model receives a
subset of the labeled data. We will use the F1 score to capture the trade-off between
precision and recall:
F1 = (2P R)/(P + R) (1.1)
However, this usually comes at the cost of being overly conservative and not catching
anomalous activity that is indeed an intrusion.
Labeling effort is a major factor in this analysis and a dimension along which we
will define the upper and lower bounds of the quality of our detection systems. A
purely unsupervised approach would be ideal as there is no labeling involved. We
will use an isolation forest (Zhou et al. 2004) to establish our baseline. Isolation
forests (IFs) are widely, and very successfully, used for anomaly detection. An IF
consists of a number of isolation trees, each of which are constructed by selecting
random features to split and then selecting a random value to split on (random value
in the range of continuous variables or random value for categorical variables). Only a
small random subset of the data is used for growing the trees, and usually a maximum
allowable depth is enforced to curb computational cost. We have used 10 trees for
each IF. Intuitively, anomalous data points are easier to isolate with a smaller average
number of splits and therefore tend to be closer to the root. The average closeness
to the root is proportional to the anomaly score (i.e., the lower this score, the more
anomalous the data point).
A completely supervised approach would incur maximum cost as we will have
to label every data point. We have used a random forest classifier with 10 estimators
trained on the entire training dataset to establish the upper bound (i.e., Oracle). In
Table 1.3, the F1 scores are reported for evaluation on the development set:
The proposed approach starts with training a classifier on a small random subset of
the data (i.e., 1000 samples) and then continually queries a security analyst for the
next record to label. There is a maximum budget of 100 queries (Fig. 1.1).
This approach is highly flexible. The choice of classifier can range from logistic
regression all the way up to deep networks as well as any ensemble of those models.
Moreover, the hyper-parameters for the classifier can be tuned on every round of
training to improve the quality of predictions. The sampling strategy can range from
simply picking random records to using classifier uncertainty or other elaborate
schemes. Once a record is labeled, it is removed from the pool of labeled data and
placed into the labeled record database. We are assuming that labels are trustworthy
which may not necessarily be true. In other words, the analyst might make a mistake
in labeling or there may be low consensus among analysts around labeling. In the
presence of those issues, we would need to extend this approach to query multiple
analysts and to build the consensus of labels into the framework.
1.4 Experiments
We used a logistic regression (LR) classifier with L2 penalty as well as a random forest
(RF) classifier with 10 estimators, Gini impurity for splitting criteria, and unlimited
depth for our choice of learners. We also chose three sampling strategies. First is
a random strategy that randomly selects a data point from the unlabeled pool. The
second option is uncertainty sampling that scores the entire database of unlabeled
data and then selects the data point with the highest uncertainty. The first option
is entropy sampling, which calculates the entropy over the positive and negative
8 A. Ziai
Table 1.4 Effects of learner and sampling strategy on detection quality and latency
Learner Sampling F1 initial F1 after 10 F1 after 50 F1 after Train time Query time
strategy 100 (s) (s)
LR Random 0.76±0.32 0.76±0.32 0.79±0.31 0.86±0.17 0.05±0.01 0.09±0.08
LR Uncertainty 0.83±0.26 0.85±0.31 0.88±0.20 0.10±0.08
LR Entropy 0.83±0.26 0.85±0.31 0.88±0.20 0.08±0.08
RF Random 0.90±0.14 0.91±0.12 0.84±0.31 0.95±0.07 0.11±0.00 0.09±0.07
RF Uncertainty 0.98±0.03 0.99±0.03 0.99±0.03 0.16±0.06
RF Entropy 0.98±0.04 0.98±0.03 0.99±0.03 0.12±0.08
classes and selects the highest entropy data point. Ties are broken randomly for both
uncertainty and entropy sampling.
Table 1.4 shows the F1 score immediately after the initial training (F1 initial)
followed by the F1 score after 10, 50, and 100 queries to the analyst across different
learners and sampling strategies aggregated over the 10 attack types:
Random forests are strictly superior to logistic regression from a detection per-
spective regardless of the sampling strategy. It is also clear that uncertainty and
entropy sampling are superior to random sampling which suggests that judiciously
sampling the unlabeled dataset can have a significant impact on the detection quality,
especially in the earlier queries (F1 goes from 0.90 to 0.98 with just 10 queries). It is
important to notice that the query time might become a bottleneck. In our examples,
the unlabeled pool of data is not very large but as this set grows these sampling
strategies have to scale accordingly. The good news is that scoring is embarrassingly
parallelizable.
Figure 1.2 depicts the evolution of detection quality as the system makes queries
to the analyst for an attack with high prevalence (i.e., the majority of traffic is an
attack):
The random forest learner combined with an entropy sampler can get to perfect
detection within 5 queries which suggests high data efficiency (Mussmann and Liang
2018). We will compare this to the Nmap attack with significantly lower prevalence
(i.e., less than 0.01% of the dataset is an attack) (Fig. 1.3):
We know from our Oracle evaluations that a random forest model can achieve
perfect detection for this attack type; however, we see that an entropy sampler is not
guaranteed to query the optimal sequence of data points. The fact that the prevalence
of attacks is very low means that the initial training dataset probably does not have a
representative set of positive labels that can be exploited by the model to generalize.
The failure of uncertainty sampling has been documented (Zhu et al. 2008), and
more elaborate schemes can be designed to exploit other information about the unla-
beled dataset that the sampling strategy is ignoring. To gain some intuition into
these deficiencies, we will unpack a step of entropy sampling for the Nmap attack.
Figure 1.4 compares (a) the relative feature importance after the initial training to (b)
the Oracle (Fig. 1.5):
1 Active Learning for Network Intrusion Detection 9
The Oracle graph suggests the “src_bytes” is a feature that the model is highly
reliant upon for prediction. However, our initial training is not reflecting this; we will
compute the z-score for each of the positive labels in our development set:
|μ R fi − μW fi |
z fi = (1.2)
σ R fi
where μ R fi is the average value of the true positives for feature i (i.e., f i ), μW fi is
the average value of the false positives or false negatives, and σ R fi is the standard
deviation of the values in the case of true positives.
10 A. Ziai
The higher this value is for a feature, the more our learner needs to know about it
to correct the discrepancy. However, we see that the next query made by the strategy
does not involve a decision around this fact. The score for “src_bytes” is an order
of magnitude larger than other features. The model continues to make uncertainty
queries staying oblivious to information about specific features that it needs to correct
for.
Fig. 1.4 Random forest feature importance for a initial training and b Oracle
we
[PredictionEnsemble = I we Prediction E > (1.3)
e E e E
2
where Prediction E {0, 1} is the binary prediction associated with the classifier e E =
{R F, G B, L R, I F} and we is the weight of the classifier in the ensemble.
The weights are proportional to the level of confidence we have in each of the
learners. We have added a gradient boosting classifier with 10 estimators.
Unfortunately, the results of this experiment suggest that this particular ensemble
is not adding any additional value. Figure 1.6 shows that at best the results match
that of random forest (a) and in the worst case they can be significantly worse (b):
The majority of the error associated with this ensemble approach relative to only
using random forests can be attributed to a high false negative rate. The other four
algorithms are in most cases conspiring to generate a negative class prediction which
overrides the positive prediction of the random forest.
1 Active Learning for Network Intrusion Detection 13
Finally, we explore whether we can use an unsupervised method for finding the
most anomalous data points to query. If this methodology is successful, the sampling
strategy is decoupled from active learning and we can simply precompute and cache
the most anomalous data points for the analyst to label.
We compared a sampling strategy based on isolation forest with entropy sampling
(Table 1.5):
In both cases, we are using a random forest learner. The results suggest that
entropy sampling is superior since it is sampling the most uncertain data points in
the context of the current learner and not a global notion of anomaly which isolation
forest provides.
1.5 Conclusion
We have proposed a general active learning framework for network intrusion detec-
tion. We experimented with different learners and observed that more complex learn-
ers can achieve higher detection quality with significantly less labeling effort for most
attack types. We did not explore other complex models such as deep neural networks
and did not attempt to tune the hyper-parameters of our model. Since the bottleneck
associated with this task is the labeling effort, we can add model tuning while staying
within the acceptable latency requirements.
We then explored a few sampling strategies and discovered that uncertainty and
entropy sampling can have a significant benefit over unsupervised or random sam-
pling. However, we also realized that these strategies are not optimal, and we can
extend them to incorporate available information about the distribution of the fea-
tures for mispredicted data points. We attempted a semi-supervised approach called
label spreading that builds the affinity matrix over the normalized graph Laplacian
which can be used to create pseudo-labels for unlabeled data points (Zhou et al. 2004).
However, this methodology is very memory-intensive, and we could not successfully
train and evaluate it on all of the attack types.
14 A. Ziai
References
Mussmann S, Liang P (2018) On the relationship between data efficiency and error for un-certainty
sampling. arXiv preprint arXiv:1806.06123
Pimentel T, Monteiro M, Viana J, Veloso A, Ziviani N (2018) A generalized active learning approach
for unsupervised anomaly detection. arXiv preprint arXiv:1805.09411
Tuor A, Kaplan S, Hutchinson B, Nichols N, Robinson S (2017) Deep learning for unsupervised
insider threat detection in structured cybersecurity data streams. arXiv preprint arXiv:1710.00811
Tuor A, Baerwolf R, Knowles N, Hutchinson B, Nichols N, Jasper R (2018) Recurrent neural
network language models for open vocabulary event-level cyber anomaly detection. Workshops
at the thirty-second AAAI conference on artificial intelligence
Veeramachaneni K, Arnaldo I, Korrapati V, Bassias C, Li K (2016) AI: training a big data machine
to defend. Big Data Security on Cloud (BigDataSecurity), IEEE international conference on high
performance and smart computing (HPSC), and IEEE international conference on intelligent data
and security (IDS), IEEE 2nd international conference, pp 49–54
Zainal A, Maarof MA, Shamsuddin SM (2009) Ensemble classifiers for network intrusion detection
system. J Inf Assur Secur 4(3):217–225
Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B (2004) Learning with local and global
consistency. In: Advances in neural information processing systems, pp 321–328
Zhu J, Wang H, Yao T, Tsou BK (2008) Active learning with sampling by uncertainty and density
for word sense disambiguation and text classification. In: Proceedings of the 22nd international
conference on computational linguistics, vol 1, pp 1137–1144
Chapter 2
Educational Data Mining Using Base
(Individual) and Ensemble Learning
Approaches to Predict the Performance
of Students
2.1 Introduction
M. Ashraf (B)
School of CS and IT, Jain university, Bangalore190006, India
Y. K. Salal · S. M. Abdullaev
Department of System Programming, South Ural State University, Chelyabinsk, Russia
e-mail: yasskhudheirsalal@gmail.com
S. M. Abdullaev
e-mail: abdullaevsm@susu.ru
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 15
G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems
and Networks, https://doi.org/10.1007/978-981-16-1681-5_2
16 M. Ashraf et al.
In this study, primarily, we have applied four learning classifiers such as j48, ran-
dom tree, naïve bayes, and knn across academic dataset. Thereafter, the academic
dataset was subjected to progression of oversampling and undersampling methods
to corroborate whether there is any improvement in prediction achievements of stu-
dent’s outcome. Correspondingly, the analogous procedure is practiced over ensem-
ble methodologies including bagging and boosting to substantiate which learning
classifier among base or meta has demonstrated compelling results.
Table 2.1 portrays outcome of diverse classifiers accomplished subsequent to
running these machine learning classifiers across educational dataset. Moreover, it
is unequivocal that naïve bayes has achieved notable prediction precision of 95.50%
in classifying the actual occurrences, incorrectly classification error of 4.45%, and
minimum relative absolute error of 7.94% in contrast to remaining classifiers. The
supplementary calculations related with the learning algorithm such as Tp rate, Fp
rate, precision, recall, f -measure, and ROC area have been also found significant.
Conversely, random tree produced although substantial classification accuracy of
90.03%, incorrectly classified instances as 9.69%, (RAE) relative absolute error of
15.46%, and supplementary parameters connected with the algorithm were found
18 M. Ashraf et al.
Under this subsection, bagging has been utilized using various classifiers that are
highlighted in Table 2.4. Nevertheless, after employing bagging, the prediction accu-
racy has demonstrated paramount success over base learning mechanism. The cor-
rectly classified rate in Table 2.4 when contrasted with initial prediction rate of
different classifiers in Table 2.1 have shown substantial improvement in three learn-
ing algorithms such as j48 (92.20–94.87%), random tree (90.30–94.76%), and knn
(91.80–93.81%).
In addition, the incorrectly classified instances have come down to considerable
level in these classifiers, and as a consequence, supplementary parameters viz. Tp
rate, Fp rate, precision, recall, ROC area, and f -measure related with these classifiers
have also rendered admirable results. However, naïve bayes has not revealed any
significant achievement in prediction accuracy with bagging approach, and moreover,
relative absolute error associated with each meta classifier has augmented while
synthesizing different classifiers.
20 M. Ashraf et al.
2.4 Conclusion
In this research study, the central focus has been early prediction of student’s out-
come using various individual (base) and meta classifiers to provide timely guidance
for weak students. The individual learning algorithms employed across pedagogical
22 M. Ashraf et al.
data including j48, random tree, naïve bayes, and knn which have evidenced phe-
nomenal prediction accuracy of student’s final outcomes. Among each base learning
algorithms, naïve bayes attained paramount accuracy of 95.50%. As the dataset
in this investigation was imbalanced which could have otherwise culminated with
inaccurate and biased outcomes, therefore academic dataset was exploited to filter-
ing approaches, namely synthetic minority oversampling technique ( SMOTE) and
spread subsampling.
In this contemporary study, a comparative revision was conducted with base and
meta learning algorithms, followed by oversampling (SMOTE) and undersampling
(spread subsampling) techniques to get a comprehensive knowledge which classifiers
can be more precise and decisive in generating predictions. The above-mentioned
base learning algorithms were subjected to phenomenon of oversampling and under-
sampling methods. The naïve bayes yet again demonstrated noteworthy improve-
ment of 97.15% after practiced with oversampling technique. With undersampling
technique, knn showed exceptional improvement of 93.94% in prediction accuracy
over other base learning algorithms. However, in case of ensemble learning such as
bagging, among all classifiers bagging with naïve bayes accomplished convincing
correctness of 95.32% in predicting the exact instances.
The bagging algorithm, when put into effect with techniques such as oversam-
pling and undersampling, the ensembles generated from classifiers viz. j48 and naïve
bayes demonstrated with significant accuracy and least classification error (95.21%,
bagging with j48 and 96.07%, bagging with naïve bayes), respectively.
References
Ahmed ABED, Elaraby IS (2014) Data mining: a prediction for student’s performance using clas-
sification method. World J Comput Appl Technol 2(2):43–47
Ali KM, Pazzani MJ (1996) Error reduction through learning multiple descriptions. Mach Learn
24(3):173–202
2 Educational Data Mining Using Base (Individual) … 23
Ashraf M et al (2017) Knowledge discovery in academia: a survey on related literature. Int J Adv
Res Comput Sci 8(1)
Ashraf M, Zaman M (2017) Tools and techniques in knowledge discovery in academia: a theoretical
discourse. Int J Data Min Emerg Technol 7(1):1–9
Ashraf M, Zaman M, Ahmed Muheet (2018a) Using ensemble StackingC method and base classi-
fiers to ameliorate prediction accuracy of pedagogical data. Proc Comput Sci 132:1021–1040
Ashraf M, Zaman M, Ahmed M (2018b) Using predictive modeling system and ensemble method
to ameliorate classification accuracy in EDM. Asian J Comput Sci Technol 7(2):44–47
Ashraf M, Zaman M, Ahmed M (2020) An intelligent prediction system for educational data mining
based on ensemble and filtering approaches. Proc Comput Sci 167:1471–1483
Ashraf M, Zaman M, Ahmed M (2018c) Performance analysis and different subject combinations:
an empirical and analytical discourse of educational data mining. In: 8th international conference
on cloud computing. IEEE, data science & engineering (confluence), p 2018
Ashraf M, Zaman M, Ahmed M (2019) To ameliorate classification accuracy using ensemble vote
approach and base classifiers. Emerging technologies in data mining and information security.
Springer, Singapore, pp 321-334
Bartlett P, Shawe-Taylor J (1999) Generalization performance of support vector machines and other
pattern classifiers. Advances in Kernel methods—support vector learning, pp 43–54
Brazdil P, Gama J, Henery B (1994) Characterizing the applicability of classification algorithms
using meta-level learning. In: European conference on machine learning. Springer, Berlin, Hei-
delberg, p 83102
Breiman L (1996). Bagging predictors. Machine Learn 24(2): 123–140; Freund Y, Schapire RE
(1996) Experiments with a new boosting algorithm. ICML 96:148–156
Bruzzone L, Cossu R, Vernazza G (2004) Detection of land-cover transitions by combining multidate
classifiers. Pattern Recogn Lett 25(13):1491–1500
Bü hlmann P, Yu B (2003) Boosting with the L 2 loss: regression and classification. J Am Stat Assoc
98(462):324–339
Dimitriadou E, Weingessel A, Hornik K (2003) A cluster ensembles framework, design and appli-
cation of hybrid intelligent systems
Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural
Comput 4(1):1–58
Leigh W, Purvis R, Ragusa JM (2002) Forecasting the NYSE composite index with technical
analysis, pattern recognizer, neural network, and genetic algorithm: a case study in romantic
decision support. Decision Support Syst 32(4):361–377
Maimon O, Rokach L (2004) Ensemble of decision trees for mining manufacturing data sets. Mach
Eng 4(1–2):32–57
Mangiameli P, West D, Rampal R (2004) Model selection for medical diagnosis decision support
systems. Decision Support Syst 36(3):247–259
Mukesh K, Salal YK (2019) Systematic review of predicting student’s performance in academics.
Int J Eng Adv Techno 8(3): 54–61
Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. J Artif Intell Res 11:169–
198
Pfahringer B, Bensusan H, Giraud-Carrier CG (2000) Meta-Learning by land-marking various
learning algorithms. In: ICML, pp 743–750
Salal YK, Abdullaev SM (2020, December) Deep learning based ensemble approach to predict
student academic performance: case study. In: 2020 3rd International conference on intelligent
sustainable systems (ICISS) (pp 191–198). IEEE
Salal YK, Hussain M, Paraskevi T (2021) Student next assignment submission prediction using a
machine learning approach. Adv Autom II 729:383
Salzberg SL (1994) C4. 5: programs for machine learning by J. Rossquinlan. Mach Learn 16(3):235–
240
Sidiq SJ, Zaman M, Ashraf M, Ahmed M (2017) An empirical comparison of supervised classifiers
for diabetic diagnosis. Int J Adv Res Comput Sci 8(1)
Another Random Document on
Scribd Without Any Related Topics
CHAPTER XXVII
WHAT WAS THAT?
The two girls waited to make sure there was no one else in the cave
besides Joe, listened until the sounds made by his captor crashing
through the underbrush had died away.
Then Dorothy ran to him, sank to her knees beside him, laughed
and cried over him as she lifted his head and held it tight against her.
“Joe, Joe! why did you run away? We’ve been nearly crazy, dear!
No, no, don’t cry, Joe darling! It’s all right. Your Dorothy is here.
Nothing, nothing will ever hurt you again.”
Her arms tightened about him fiercely and the boy sobbed, great,
tearing sobs that he was ashamed of but could not control.
The storm lasted only a minute, and then he said gruffly, big-boy
fashion, to hide his weakness:
“I—you oughtn’t to come near me, Dot. I—I’ve done an awful thing
and got myself into a heap of trouble!”
“Never mind about that now, dear,” cried Dorothy, suddenly
recalled to the peril of their situation. “We’ve got to get you away
before that dreadful man comes back.”
“He went off to fetch the others,” said Joe, growing suddenly eager
and hopeful now that rescue seemed near. “They are going to do
something awful to me because I wouldn’t——”
“Yes, yes, Joe, I know. But now be quiet,” cried Dorothy, as she
propped him up against the wall and began to work feverishly at the
knots of the heavy cord that bound his feet and hands. “Some one
might hear you and—oh, we must get away from here before they
come back!”
“Here, I have something better than that,” cried Tavia, who had
been watching Dorothy’s clumsy efforts to unloose Joe’s bonds.
She fished frantically in the pockets of her jacket and brought forth
a rather grimy ball of cord and a penknife. This she held up
triumphantly.
“A good sight better than your fingers!”
“Oh, give it to me, quickly,” cried Dorothy, reaching for the knife in
an agony of apprehension. “Oh, it won’t open! Yes, I have it!”
With the sharp blade she sawed feverishly at the cords.
They gave way one after another and she flung them on to the floor
of the cave.
Joe tried to get to his feet, but stumbled and fell.
“Feel funny and numb, kind of,” he muttered. “Been tied up too
long, I guess.”
“But, Joe, you must stand up—you must!” cried Dorothy
frantically. “Come, try again. I’ll hold you. You must try, Joe. They
will be back in a minute! Never mind how much it hurts, stand up!”
With Dorothy’s aid Joe got to his feet again slowly and painfully
and stood there, swaying, an arm about his sister’s shoulders, the
other hand clenched tight against the damp, rocky wall of the cave.
The pain was so intense as the blood flowed back into his tortured
feet that his face went white and he clenched his teeth to keep from
crying out.
“Do you think you can walk at all, dear?” asked Dorothy, her own
face white with the reflection of his misery. “If you could manage to
walk a little way! We have horses in the woods and it would be
harder for them to find us there. Try, Joe dear! Try!”
“I guess I can make it now, Sis,” said Joe from between his
clenched teeth. “If Tavia will help a little too—on the other side.”
“I guess so!” cried Tavia with alacrity, as she put Joe’s other arm
about her shoulders and gave his hand a reassuring squeeze. “Now
something tells me that the sooner we leave this place behind the
healthier it will be for all of us.”
“Hush! What’s that?” cried Dorothy, and they stood motionless for
a moment, listening.
“I didn’t hear anything, Doro,” whispered Tavia. “It was just
nerves, I guess.”
They took a step toward the entrance of the cave, Joe still leaning
heavily upon the two girls.
A horse whinnied sharply and as they paused again, startled, a
sinister shadow fell across the narrow entrance to the cave. They
shrank back as substance followed shadow and a man wedged his
way into the cave.
He straightened up and winked his eyes at the unexpected sight
that met them.
Dorothy stifled a startled exclamation as she recognized him. It
was the small, black-eyed man, Gibbons, known to Desert City as
George Lightly, who stood blinking at them.
Suddenly he laughed, a short, sharp laugh, and turned back toward
the mouth of the cave.
“Come on in, fellows!” he called cautiously. “Just see what I
found!”
Joe’s face, through the grime and dirt that covered it, had grown
fiery red and he struggled to get free of Dorothy and Tavia.
“Just you let me get my hands on him!” he muttered. “I’ll show
him! I’ll——”
“You keep out of this, Joe,” Dorothy whispered fiercely. “Let me do
the talking.”
Three other men squeezed through the narrow opening and stood
blinking in the semi-darkness of the cave.
One of them Dorothy recognized as Joe’s former captor, a big,
burly man with shifty eyes and a loose-lipped mouth, another was
Philo Marsh, more smug and self-sufficient than she remembered
him, and the third was Cal Stiffbold, her handsome cavalier of the
train ride, who had called himself Stanley Blake.
It took the girls, crouched against the wall of the cave, only a
moment to see all this, and the men were no slower in reading the
meaning of the situation.
Stiffbold’s face was suffused with fury as he recognized Dorothy
and Tavia, and he took a threatening step forward. Philo Marsh
reached out a hand and drew him back, saying in mild tones:
“Easy there, Stiffbold. Don’t do anything you are likely to regret.”
“So, ladies to the rescue, eh?” sneered Lightly, thrusting his hands
into his pockets and regarding the girls with an insulting leer.
“Regular little heroines and all, ain’t you? Well, now, I’ll be blowed!”
“Young ladies, this isn’t the place for you, you know.” Philo Marsh
took a step forward, reaching out his hand toward Joe. “You’re
interfering, you know, and you’re likely to get yourselves in a heap o’
trouble. But if you’ll go away and stay away and keep your mouths
closed——”
“And leave my brother here with you scoundrels, I suppose?”
suggested Dorothy.
The hypocritical expression upon the face of Philo Marsh changed
suddenly to fury at her short, scornful laugh.
“Scoundrels, is it?” he sneered. “Well, my young lady, maybe you’ll
know better than to call honest people names before you leave this
place.”
“Honest people! You?” cried Dorothy, no longer able to contain her
furious indignation. “That sounds startling coming from you, Philo
Marsh, and your—honest friends!
“Do you call it honest,” she took a step forward and the men
retreated momentarily, abashed before her fury, “to take a poor boy
away from his people, to hide him here in a place like this, to torture
him physically and mentally, to attempt to make him false to all his
standards of right——”
“See here, this won’t do!” Lightly blustered, but Dorothy turned
upon him like a tigress.
“You will listen to me till I have said what I am going to say,” she
flung at him. “You do all this—you honest men,” she turned to the
others, searing them with her scorn. “And why? So that you can force
Garry Knapp, who has the best farmlands anywhere around here—
and who will make more than good some day, in spite of you, yes, in
spite of you, I say—to turn over his lands to you for a song, an
amount of money that would hardly pay him for the loss of one little
corner of it——”
“Say, are we goin’ to stand here and take this?”
“Yes, you are—Stanley Blake!” Dorothy flamed at him, and the
man retreated before her fury. “And then, when this boy defies you,
what do you do? Act like honest men? Of course you do! You
threaten to ‘put the screws on’ until he is too weak to defy you, a boy
against four—honest—men! If that is honesty, if that is bravery, then
I would rather be like that slimy toad out in the woods who knows
nothing of such things!”
“Hold on there, you!” George Lightly started forward, his hand
uplifted threateningly. “You call us any more of those pretty names
and I’ll——”
“What will you do?” Dorothy defied him gloriously, her eyes
blazing. “You dare to lay a hand upon me or my friend or my
brother,” instinctively her arm tightened about Joe, “and Garry
Knapp will hound you to the ends of the earth. Hark! What’s that?”
She paused, head uplifted, listening.
They all listened in a breathless silence while the distant clatter of
horses’ hoofs breaking a way through the woodland came closer—
ever closer!
“Garry!” Dorothy lifted her head and sent her cry ringing through
the woodland. “We are over this way, Garry, over this way! Come qui
——”
A HORSEMAN BROKE THROUGH THE
UNDERBRUSH. IT WAS GARRY.
THE END
THE DOROTHY DALE SERIES
By MARGARET PENROSE
12 mo. Illustrated
By MARGARET PENROSE
By AGNES MILLER
By MARGARET PENROSE
By ALICE B. EMERSON