Feature Engineering for Machine Learning and Data Analytics First Edition Dong - Quickly download the ebook in PDF format for unlimited reading
Feature Engineering for Machine Learning and Data Analytics First Edition Dong - Quickly download the ebook in PDF format for unlimited reading
com
https://textbookfull.com/product/feature-engineering-for-
machine-learning-and-data-analytics-first-edition-dong/
OR CLICK HERE
DOWLOAD EBOOK
https://textbookfull.com/product/feature-engineering-for-machine-
learning-principles-and-techniques-for-data-scientists-first-edition-
casari/
textbookfull.com
https://textbookfull.com/product/the-art-of-feature-engineering-
essentials-for-machine-learning-1st-edition-pablo-duboue/
textbookfull.com
https://textbookfull.com/product/ai-and-machine-learning-paradigms-
for-health-monitoring-system-intelligent-data-analytics-hasmat-malik/
textbookfull.com
https://textbookfull.com/product/advanced-data-analytics-using-python-
with-machine-learning-deep-learning-and-nlp-examples-mukhopadhyay/
textbookfull.com
Recent Developments in Machine Learning and Data Analytics
IC3 2018 Jugal Kalita
https://textbookfull.com/product/recent-developments-in-machine-
learning-and-data-analytics-ic3-2018-jugal-kalita/
textbookfull.com
https://textbookfull.com/product/machine-learning-and-big-data-
analytics-paradigms-analysis-applications-and-challenges-aboul-ella-
hassanien/
textbookfull.com
https://textbookfull.com/product/intelligent-feature-selection-for-
machine-learning-using-the-dynamic-wavelet-fingerprint-mark-k-hinders/
textbookfull.com
RapidMiner
Data Mining Use Cases and Business Analytics Applications
Markus Hofmann and Ralf Klinkenberg
Computational Business Analytics
Subrata Das
Data Classification
Algorithms and Applications
Charu C. Aggarwal
Healthcare Data Analytics
Chandan K. Reddy and Charu C. Aggarwal
Accelerating Discovery
Mining Unstructured Information for Hypothesis Generation
Scott Spangler
Event Mining
Algorithms and Applications
Tao Li
Text Mining and Visualization
Case Studies Using Open-Source Tools
Markus Hofmann and Andrew Chisholm
Graph-Based Social Media Analysis
Ioannis Pitas
Data Mining
A Tutorial-Based Primer, Second Edition
Richard J. Roiger
Data Mining with R
Learning with Case Studies, Second Edition
Luís Torgo
Social Networks with Rich Edge Semantics
Quan Zheng and David Skillicorn
Large-Scale Machine Learning in the Earth Sciences
Ashok N. Srivastava, Ramakrishna Nemani, and Karsten Steinhaeuser
Data Science and Analytics with Python
Jesus Rogel-Salazar
Feature Engineering for Machine Learning and Data Analytics
Guozhu Dong and Huan Liu
Edited by
Guozhu Dong and Huan Liu
MATLAB• is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks
does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion
of MATLAB• software or related products does not constitute endorsement or sponsorship by The
MathWorks of a particular pedagogical approach or particular use of the MATLAB• software.
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
To my family, especially baby Hazel [G. D.]
Preface xv
Contributors xvii
vii
viii Contents
Index 395
Preface
Feature engineering plays a vital role in big data analytics. Machine learning
and data mining algorithms cannot work without data. Little can be achieved
if there are few features to represent the underlying data objects, and the
quality of results of those algorithms largely depends on the quality of the
available features. Data can exist in various forms such as image, text, graph,
sequence, and time series. A common way to represent data for data analytics
is to use feature vectors. Feature engineering meets the needs in the generation
and selection of useful features, as well as several other issues.
This book is devoted to feature engineering. It covers various aspects
of feature engineering, including feature generation, feature extraction, fea-
ture transformation, feature selection, and feature analysis and evaluation. It
presents concepts, methods, examples, as well as applications.
Feature engineering is often data type specific and application dependent.
This calls for multiple chapters on different data types that require specialized
feature engineering techniques to meet various data analytic needs. Hence, this
book contains chapters on feature engineering for major data types such as
texts, images, sequences, time series, graphs, streaming data, software engi-
neering data, Twitter data, and social media data. It also contains generic
feature generation approaches, as well as methods for generating tried-and-
tested, hand-crafted, domain-specific features.
This book contains many useful feature engineering concepts and tech-
niques, which are an important part of machine learning and data analytics.
They can help readers to meet their needs in multiple scenarios: (a) gener-
ate features to represent the data when there are no features, (b) generate
effective features when (one may be concerned that) existing features are
not good/competitive enough, (c) select features when there are too many
features, (d) generate and select effective features for specific types of appli-
cations, and (e) understand the challenges associated with, and the needed
approaches to handle, various data types. This list is certainly not exhaustive.
The first chapter is an introduction, which defines the concepts of fea-
tures and feature engineering, offers an overview of the book, and provides
pointers to topics not covered in this book. The next six chapters are devoted
to feature engineering, including feature generation, for specific data types,
namely texts, images, sequences, time series, graphs, and streaming data. The
subsequent four chapters cover generic approaches for feature engineering,
namely feature selection, feature transformation-based feature engineering,
xv
xvi Preface
xvii
xviii Contributors
Guozhu Dong
Wright State University, Dayton, Ohio, USA
Huan Liu
Arizona State University, Phoenix, Arizona, USA
1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Machine Learning and Data Analytic Tasks . . . . . . . . . . . . 3
1.2 Overview of the Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Beyond this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Feature Engineering for Specific Data Types . . . . . . . . . . . 8
1.3.2 Feature Engineering on Non-Data-Specific Topics . . . . . . 9
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1 Preliminaries
1.1.1 Features
In machine learning, data mining, and data analytics, a feature is an
attribute or variable used to describe some aspect of individual data objects.
1
2 Feature Engineering for Machine Learning and Data Analytics
Example features include age and eye color for persons, and major and grade
point average for students.
Informative features are the basis for data analytics. They are useful for
describing the underlying objects, and for distinguishing and characterizing
different (explicit or latent) groups of objects. They are also vital for producing
accurate and easy-to-explain predictive models, and yielding good results in
various data analytic tasks. “Feature,” “variable,” and “attribute” are often
used as synonyms.
For a given application and a fixed point in time, often a fixed set of
features is implicitly chosen to describe all underlying data objects; each object
takes a particular value for each of those features. This results in a feature-
vector-based representation of the data objects.
Features are divided into several feature types, including categorical, ordi-
nal, and numerical. Different feature types require different kinds of analysis,
due to structural differences in their domains.
• The domain of a categorical feature is a set of discrete val-
ues. For example, color is a categorical feature whose domain is
{black, blue, brown, green, red, white, yellow}.
(2) Feature generation is about generating new features that are often not
the result of feature transformations. For example, assuming that one
does not view a pixel in an image as a feature, one generates new features
for images. Moreover, it makes sense to say that features defined from
patterns are generated features. Many domain-specific ways for defining
features also belong in the feature generation category. Sometimes the
term feature extraction is used for feature generation.
(3) Feature selection is about selecting a small set of features from a very
large pool of features. The reduced feature set size makes it computa-
tionally feasible to use certain algorithms. Feature selection may also
lead to improved quality on the result of those algorithms.
(4) Feature analysis and evaluation is about concepts, methods, and mea-
sures for evaluating the usefulness of features and feature sets. This is
often included as part of feature selection.
(5) General automatic feature engineering methodology is about generic ap-
proaches for automatically generating a large number of features and
selecting an effective subset of the generated features.
(6) Feature engineering applications involve feature engineering but the fo-
cus is to solve some other data analytic tasks in specific contexts. Ex-
amples include analyzing Twitter data to improve the quality of disaster
response and relief efforts.
discusses (a) the dominant bag of words–based text representation, (b) ap-
proaches that use multiple words as features, and (c) structural features that
require natural language processing techniques or statistical pattern analysis
methods. It further describes how to learn latent semantic representations us-
ing methods such as probabilistic topic models and neural networks, and how
text data can be analyzed together with non-textual context data to extract
contextualized text representations.
A majority of visual computing tasks involve prediction, regression or deci-
sion making using features extracted from the original, raw visual data (images
or videos). Chapter 3 presents a hierarchy of feature representations for im-
age data, starting with classic, hand-crafted features. The classic features are
designed by human experts and they are based on task-specific prior knowl-
edge. They are easily interpretable and characterize fundamental aspects of
images such as color, texture and shape. The features at the next level are
latent feature representations. Such features represent task-specific structures
in the feature space such as sparsity, decorrelation of reduced dimension, low
rank, etc.
Time series is an important type of data that are frequently encountered
in data analytics. Chapter 4 provides an overview of a vast literature of rep-
resentations and analysis methods for time series. It first presents discussion
on global distances between time-series values including Euclidean and elastic
distance measures like DTW. It then discusses three kinds of features, namely
subsequences that provide more localized shape-based information, global fea-
tures that capture higher order structure, and interval features that capture
discriminative properties in time-series subsequences. It also discusses factors
that influence the selection of the most useful method for a given task.
Chapter 5 provides an overview of feature engineering for streaming data,
with a focus on streaming feature construction and selection. It first summa-
rizes the typical streaming settings and their corresponding formal defini-
tions. Then it reviews automated feature construction algorithms including
linear and non-linear methods. Next it gives an overview of feature selection
algorithms with different streaming settings. Finally it discusses some open
questions and possible research directions about feature engineering for data
streams.
Sequence data occur in many applications including bioinformatics, mu-
sic, literature, health care, and security. Chapter 6 first discusses the basic
concepts for sequence data. It then discusses three major classes of sequence
features, namely traditional pattern-based sequence features, general pattern-
based features, and sequence features that do not involve the use of patterns. It
presents several approaches for using sequence patterns as sequence features,
and it provides an overview of sequence pattern types as well as methods to
mine such patterns. It also considers factors that are important for selecting
patterns as features.
Graph and network data are essential for various graph analysis tasks
such as social network analysis, protein–protein interaction analysis, and
6 Feature Engineering for Machine Learning and Data Analytics
how they are used for hierarchical and disentangle representation learning,
and how they can be applied for various domains.
Increasing evidence suggests that social platforms like Twitter accommo-
date an increasing number of autonomous entities known as social bots, which
are controlled by software that generates content and establishes interactions
with other accounts. Chapter 12 considers feature engineering for social bot
detection in the context of social media. It describes the setting of such de-
tection, and it presents various kinds of features, some of which are unique
for social media, including their definition, selection, and usefulness for social
bot detection. It also describes a system called Botometer that analyzes pub-
lic information about a Twitter account, extracting over a thousand features
describing the account and its neighbors, and discusses experiments where the
extracted features were used to build classifiers for bot detection.
Chapter 13 considers feature generation and engineering for software
analytics. It shows how domain-specific features can be designed and used
to automate three software engineering tasks: (1) detecting defective software
modules, (2) identifying a crashing mobile app release, and (3) predicting who
will leave a software team. For each task, different sets of features are extracted
from a diverse set of software artifacts, and used to build predictive models.
The chapter also discusses recent advances as well as their potential.
Chapter 14 presents studies concerning feature engineering for Twitter-
based applications. It first discusses how Twitter data can be downloaded
from the Twitter Application Programming Interface (API) and the kinds of
data available in the downloaded tweets. Then, it discusses various textual
features, image and video features, Twitter metadata-related features, and
network features that can be extracted. Next, it discusses the uses of different
feature types along with an analysis of why certain features perform well
in the context of informal short text messages typically found in tweets. It
then presents five real-world Twitter applications that utilize different feature
types. For each application, it also highlights the features that perform well
in the corresponding application setting. Finally, it concludes the chapter by
discussing Twitris, a real-time semantic social web analytics platform that has
already been commercialized, and its use of Twitter features.
Bibliography
[1] Alessandro Canossa. Meaning in gameplay: Filtering variables, defining
metrics, extracting features and creating models for gameplay analysis.
In Game Analytics, pages 255–283. Springer, 2013.
[2] Pedro Domingos. A few useful things to know about machine learning.
Communications of the ACM, 55(10):78–87, 2012.
[3] Guozhu Dong and Qian Han. Mining accurate shared decision trees from
microarray gene expression data for different cancers. In Proceedings of
the International Conference on Bioinformatics & Computational Biology
(BIOCOMP), 2013.
10 Feature Engineering for Machine Learning and Data Analytics
Sheldon tuumi mielessään, kuka tuo Von mahtoi olla — oliko ehkä
juuri hän kaksi vuotta takaperin saanut Joanin vakuutetuksi siitä,
että hänen oli parasta mennä naimisiin.
"Tuo on aivan totta, valitettavasti, sillä juuri sen takia minä olen
täällä."
"Hyvin, hyvin yksin. Minulla ei ollut veljiä eikä sisaria, ja isän kaikki
omaiset olivat saaneet surmansa eräässä hirmumyrskyssä
Kansasissa. Isä oli silloin ollut aivan pieni poika. Olisinhan minä
tietysti voinut palata Vonin luo. Sinne olen milloin tahansa tervetullut
kuin omaan kotiini. Mutta miksi olisin mennyt sinne? Sitäpaitsihan isä
oli uskonut minulle suunnitelmansa, ja minusta tuntui siltä, kuin se
olisi velvoittanut minut toteuttamaan ne. Se oli minusta suuri
tehtävä. Ja minun teki mieli toteuttaa ne. Ja — niin, täällä minä nyt
olen.
"Älkää koskaan menkö Tahitille, sen neuvon annan teille. Paikka
on hurmaava ja alkuasukkaat mukavia. Mutta valkoihoiset! Varkaita,
rosvoja, valehtelijoita jok'ikinen! Rehellisiä miehiä ei ole niin monta,
että tarvittaisiin viisi sormea niiden laskemiseen. Se seikka, että minä
olin nainen, teki kaiken heille vain yksinkertaisemmaksi. He
rosvosivat minulta kaiken omaisuuteni senkin seitsemänlaisten
tekosyitten nojalla ja valehtelivat ilmankin tekosyytä, joko se sitten
oli tarpeellista tai ei. Mr Ericsen-raukan he saivat lahjotuksi. Hän
antautui rosvojen palvelukseen ja todisti kaikki heidän laskunsa
oikeiksi, vaikka ne olisivat olleet tuhannen prosenttia liian suuret.
Kun he puijasivat minulta kymmenen frangia, niin niistä tuli kolme
hänen osalleen. Kun maksoin viidentoistasadan frangin laskun, niin
se tuotti hänelle viisisataa. Kaiken tämän sain tietysti kuulla
jäljestäpäin. Mutta 'Miele' oli vanha alus, korjaukset olivat tarpeen, ja
minä sain maksaa seitsenkertaisen hinnan.
"En tietysti koskaan saa tietää, kuinka paljon Ericsen ansaitsi. Hän
oli asettunut asumaan maihin, hienosti kalustettuun taloon.
Laivanrakentajat olivat luovuttaneet sen hänelle vuokratta. Hedelmiä,
vihanneksia, kalaa, lihaa ja jäitä tuotiin hänelle joka päivä, eikä
hänen tarvinnut maksaa mitään. Kauppiaat suorittivat hänelle osan
sovitusta palkasta siinä muodossa. Ja koko ajan hän kyynelsilmin
valitteli, että minä olin joutunut niin kurjan kohtelun uhriksi. Ei, minä
en ollut tullut ammattivarkaitten pesään, olin vain tullut Tahitille.
Myrsky.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com