Java for Data Science 1st Edition Reese pdf download
Java for Data Science 1st Edition Reese pdf download
https://textbookfull.com/product/java-for-data-science-1st-
edition-reese/
https://textbookfull.com/product/learning-java-functional-
programming-1st-edition-reese-richard-m/
https://textbookfull.com/product/learning-network-programming-
with-java-1st-edition-reese-richard-m/
https://textbookfull.com/product/natural-language-processing-
with-java-community-experience-distilled-1st-edition-reese-
richard-m/
https://textbookfull.com/product/r-for-data-science-1st-edition-
garrett-grolemund/
Algorithms for Data Science 1st Edition Brian Steele
https://textbookfull.com/product/algorithms-for-data-science-1st-
edition-brian-steele/
https://textbookfull.com/product/data-science-for-
mathematicians-1st-edition-nathan-carter/
https://textbookfull.com/product/r-programming-for-data-
science-1st-edition-roger-peng/
https://textbookfull.com/product/programming-skills-for-data-
science-1st-edition-michael-freeman/
https://textbookfull.com/product/bounce-1st-edition-samuels-
kailee-reese/
Java for Data Science
Richard M. Reese
Jennifer L. Reese
BIRMINGHAM - MUMBAI
Java for Data Science
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the authors, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
Shilpi Saxena
Richard has written several Java books and a C Pointer book. He uses a concise and easy-to-
follow approach to topics at hand. His Java books have addressed EJB 3.1, updates to Java 7
and 8, certification, jMonkeyEngine, natural language processing, functional programming,
and networks.
Richard would like to thank his wife, Karla, for her continued support, and to the staff of
Packt Publishing for their work in making this a better book.
Jennifer L. Reese studied computer science at Tarleton State University. She also earned her
M.Ed. from Tarleton in December 2016. She currently teaches computer science to high-
school students. Her research interests include the integration of computer science concepts
with other academic disciplines, increasing diversity in computer science courses, and the
application of data science to the field of education.
She previously worked as a software engineer developing software for county- and district-
level government offices throughout Texas. In her free time she enjoys reading, cooking,
and traveling—especially to any destination with a beach. She is a musician and appreciates
a variety of musical genres.
I would like to thank Dad for his inspiration and guidance, Mom for her patience and
perspective, and Jace for his support and always believing in me.
About the Reviewers
Walter Molina is a UI and UX developer from Villa Mercedes, San Luis, Argentina. His
skills include, but are not limited to, HTML5, CSS3, and JavaScript. He uses these
technologies at a Jedi/ninja level (along with a plethora of JavaScript libraries) in his daily
work as a frontend developer at Tachuso, a creative content agency. He holds a bachelor's
degree in computer science and is a member of the School of Engineering at local National
University, where he teaches programming skills to second- and third-year students. His
LinkedIn profile is https://ar.linkedin.com/in/waltermolina.
Shilpi Saxena is an IT professional and also a technology evangelist. She is an engineer who
has had exposure to various domains (IOT and cloud computing space, healthcare, telecom,
hiring, and manufacturing). She has experience in all the aspects of conception and
execution of enterprise solutions. She has been architecting, managing, and delivering
solutions in the big data space for the last 3 years; she also handles a high-performance and
geographically distributed team of elite engineers.
Shilpi has more than 14 years (3 years in the big data space) of experience in the
development and execution of various facets of enterprise solutions both in the products
and services dimensions of the software industry. An engineer by degree and profession,
she has worn various hats, such as developer, technical leader, product owner, tech
manager, and so on, and has seen all the flavors that the industry has to offer. She has
architected and worked through some of the pioneers' production implementations in big
data on Storm and Impala with autoscaling in AWS.
Shilpi has also authored Real-time Analytics with Storm and Cassandra (https://www.pack
tpub.com/big-data-and-business-intelligence/learning-real-time-analytics-sto
rm-and-cassandra) and Real time Big Data Analytics (https://www.packtpub.com/big-d
ata-and-business-intelligence/real-time-big-data-analytics) with Packt
Publishing.
www.PacktPub.com
eBooks, discount offers, and more
Did you know that Packt offers eBook versions of every book published, with PDF and
ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a
print book customer, you are entitled to a discount on the eBook copy. Get in touch with us
at customercare@packtpub.com for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters and receive exclusive discounts and offers on Packt books and
eBooks.
https://www.packtpub.com/mapt
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book
library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thank you for purchasing this Packt book. We take our commitment to improving our
content and products to meet your needs seriously—that's why your feedback is so
valuable. Whatever your feelings about your purchase, please consider leaving a review on
this book's Amazon page. Not only will this help us, more importantly it will also help
others in the community to make an informed decision about the resources that they invest
in to learn.
You can also review for us on a regular basis by joining our reviewers' club. If you're
interested in joining, or would like to learn more about the benefits we offer, please
contact us: customerreviews@packtpub.com.
Table of Contents
Preface 1
Chapter 1: Getting Started with Data Science 6
Problems solved using data science 7
Understanding the data science problem – solving approach 8
Using Java to support data science 9
Acquiring data for an application 10
The importance and process of cleaning data 11
Visualizing data to enhance understanding 13
The use of statistical methods in data science 14
Machine learning applied to data science 16
Using neural networks in data science 18
Deep learning approaches 21
Performing text analysis 22
Visual and audio analysis 24
Improving application performance using parallel techniques 26
Assembling the pieces 28
Summary 28
Chapter 2: Data Acquisition 29
Understanding the data formats used in data science applications 30
Overview of CSV data 31
Overview of spreadsheets 31
Overview of databases 32
Overview of PDF files 34
Overview of JSON 35
Overview of XML 35
Overview of streaming data 36
Overview of audio/video/images in Java 37
Data acquisition techniques 38
Using the HttpUrlConnection class 38
Web crawlers in Java 39
Creating your own web crawler 41
Using the crawler4j web crawler 44
Web scraping in Java 47
Using API calls to access common social media sites 51
Using OAuth to authenticate users 51
Handing Twitter 51
Handling Wikipedia 54
Handling Flickr 57
Handling YouTube 60
Searching by keyword 61
Summary 64
Chapter 3: Data Cleaning 65
Handling data formats 66
Handling CSV data 67
Handling spreadsheets 69
Handling Excel spreadsheets 70
Handling PDF files 71
Handling JSON 73
Using JSON streaming API 73
Using the JSON tree API 78
The nitty gritty of cleaning text 79
Using Java tokenizers to extract words 81
Java core tokenizers 82
Third-party tokenizers and libraries 82
Transforming data into a usable form 84
Simple text cleaning 84
Removing stop words 86
Finding words in text 88
Finding and replacing text 89
Data imputation 91
Subsetting data 94
Sorting text 95
Data validation 99
Validating data types 100
Validating dates 101
Validating e-mail addresses 103
Validating ZIP codes 105
Validating names 105
Cleaning images 106
Changing the contrast of an image 107
Smoothing an image 108
Brightening an image 110
Resizing an image 111
Converting images to different formats 112
Summary 113
Chapter 4: Data Visualization 114
[ ii ]
Understanding plots and graphs 115
Visual analysis goals 121
Creating index charts 122
Creating bar charts 125
Using country as the category 127
Using decade as the category 129
Creating stacked graphs 132
Creating pie charts 134
Creating scatter charts 137
Creating histograms 139
Creating donut charts 142
Creating bubble charts 144
Summary 147
Chapter 5: Statistical Data Analysis Techniques 148
Working with mean, mode, and median 149
Calculating the mean 149
Using simple Java techniques to find mean 149
Using Java 8 techniques to find mean 150
Using Google Guava to find mean 151
Using Apache Commons to find mean 151
Calculating the median 152
Using simple Java techniques to find median 152
Using Apache Commons to find the median 154
Calculating the mode 154
Using ArrayLists to find multiple modes 156
Using a HashMap to find multiple modes 157
Using a Apache Commons to find multiple modes 158
Standard deviation 158
Sample size determination 161
Hypothesis testing 161
Regression analysis 162
Using simple linear regression 164
Using multiple regression 167
Summary 173
Chapter 6: Machine Learning 175
Supervised learning techniques 176
Decision trees 177
Decision tree types 178
Decision tree libraries 178
Using a decision tree with a book dataset 179
Testing the book decision tree 183
[ iii ]
Support vector machines 184
Using an SVM for camping data 187
Testing individual instances 190
Bayesian networks 191
Using a Bayesian network 192
Unsupervised machine learning 195
Association rule learning 195
Using association rule learning to find buying relationships 197
Reinforcement learning 199
Summary 200
Chapter 7: Neural Networks 202
Training a neural network 204
Getting started with neural network architectures 205
Understanding static neural networks 206
A basic Java example 206
Understanding dynamic neural networks 214
Multilayer perceptron networks 215
Building the model 215
Evaluating the model 217
Predicting other values 218
Saving and retrieving the model 219
Learning vector quantization 219
Self-Organizing Maps 220
Using a SOM 220
Displaying the SOM results 221
Additional network architectures and algorithms 225
The k-Nearest Neighbors algorithm 225
Instantaneously trained networks 225
Spiking neural networks 226
Cascading neural networks 226
Holographic associative memory 226
Backpropagation and neural networks 227
Summary 227
Chapter 8: Deep Learning 228
Deeplearning4j architecture 229
Acquiring and manipulating data 230
Reading in a CSV file 230
Configuring and building a model 231
Using hyperparameters in ND4J 232
Instantiating the network model 234
Training a model 234
[ iv ]
Testing a model 235
Deep learning and regression analysis 236
Preparing the data 236
Setting up the class 237
Reading and preparing the data 237
Building the model 238
Evaluating the model 239
Restricted Boltzmann Machines 241
Reconstruction in an RBM 242
Configuring an RBM 243
Deep autoencoders 244
Building an autoencoder in DL4J 245
Configuring the network 245
Building and training the network 247
Saving and retrieving a network 247
Specialized autoencoders 247
Convolutional networks 248
Building the model 248
Evaluating the model 251
Recurrent Neural Networks 252
Summary 253
Chapter 9: Text Analysis 254
Implementing named entity recognition 255
Using OpenNLP to perform NER 256
Identifying location entities 257
Classifying text 259
Word2Vec and Doc2Vec 259
Classifying text by labels 259
Classifying text by similarity 262
Understanding tagging and POS 265
Using OpenNLP to identify POS 265
Understanding POS tags 267
Extracting relationships from sentences 268
Using OpenNLP to extract relationships 269
Sentiment analysis 271
Downloading and extracting the Word2Vec model 272
Building our model and classifying text 275
Summary 277
Chapter 10: Visual and Audio Analysis 279
[v]
Text-to-speech 280
Using FreeTTS 282
Getting information about voices 284
Gathering voice information 286
Understanding speech recognition 287
Using CMUPhinx to convert speech to text 288
Obtaining more detail about the words 289
Extracting text from an image 291
Using Tess4j to extract text 291
Identifying faces 292
Using OpenCV to detect faces 293
Classifying visual data 295
Creating a Neuroph Studio project for classifying visual images 296
Training the model 303
Summary 308
Chapter 11: Mathematical and Parallel Techniques for Data Analysis 309
Implementing basic matrix operations 310
Using GPUs with DeepLearning4j 312
Using map-reduce 314
Using Apache's Hadoop to perform map-reduce 314
Writing the map method 315
Writing the reduce method 316
Creating and executing a new Hadoop job 317
Various mathematical libraries 319
Using the jblas API 319
Using the Apache Commons math API 320
Using the ND4J API 321
Using OpenCL 323
Using Aparapi 323
Creating an Aparapi application 324
Using Aparapi for matrix multiplication 327
Using Java 8 streams 329
Understanding Java 8 lambda expressions and streams 330
Using Java 8 to perform matrix multiplication 331
Using Java 8 to perform map-reduce 332
Summary 334
Chapter 12: Bringing It All Together 336
Defining the purpose and scope of our application 337
[ vi ]
Understanding the application's architecture 337
Data acquisition using Twitter 341
Understanding the TweetHandler class 343
Extracting data for a sentiment analysis model 345
Building the sentiment model 346
Processing the JSON input 347
Cleaning data to improve our results 348
Removing stop words 349
Performing sentiment analysis 350
Analysing the results 350
Other optional enhancements 351
Summary 352
Index 353
[ vii ]
Preface
In this book, we examine Java-based approaches to the field of data science. Data science is
a broad topic and includes such subtopics as data mining, statistical analysis, audio and
video analysis, and text analysis. A number of Java APIs provide support for these topics.
The ability to apply these specific techniques allows for the creation of new, innovative
applications able to handle the vast amounts of data available for analysis.
This book takes an expansive yet cursory approach to various aspects of data science. A
brief introduction to the field is presented in the first chapter. Subsequent chapters cover
significant aspects of data science, such as data cleaning and the application of neural
networks. The last chapter combines topics discussed throughout the book to create a
comprehensive data science application.
Chapter 2, Data Acquisition, demonstrates how to acquire data from a number of sources,
including Twitter, Wikipedia, and YouTube. The first step of a data science application is to
acquire data.
Chapter 3, Data Cleaning, explains that once data has been acquired, it needs to be cleaned.
This can involve such activities as removing stop words, validating the data, and data
conversion.
Chapter 4, Data Visualization, shows that while numerical processing is a critical step in
many data science tasks, people often prefer visual depictions of the results of analysis. This
chapter demonstrates various Java approaches to this task.
Chapter 5, Statistical Data Analysis Techniques, reviews basic statistical techniques, including
regression analysis, and demonstrates how various Java APIs provide statistical support.
Statistical analysis is key to many data analysis tasks.
Chapter 7, Neural Networks, explains that neural networks can be applied to solve a variety
of data science problems. In this chapter, we explain how they work and demonstrate the
use of several different types of neural networks.
Chapter 8, Deep Learning, shows that deep learning algorithms are often described as
multilevel neural networks. Java provides significant support in this area, and we will
illustrate the use of this approach.
Chapter 9, Text Analysis, explains that significant portions of available datasets exist in
textual formats. The field of natural language processing has advanced considerably and is
frequently used in data science applications. We demonstrate various Java APIs used to
support this type of analysis.
Chapter 10, Visual and Audio Analysis, tells us that data science is not restricted to text
processing. Many social media sites use visual data extensively. This chapter illustrates the
Java supports available for this type of analysis.
Chapter 11, Mathematical and Parallel Techniques for Data Analysis, investigates the support
provided for low-level math operations and how they can be supported in a multiple
processor environment. Data analysis, at its heart, necessitates the ability to manipulate and
analyze large quantities of numeric data.
Chapter 12, Bringing It All Together, examines how the integration of the various
technologies introduced in this book can be used to create a data science application. This
chapter begins with data acquisition and incorporates many of the techniques used in
subsequent chapters to build a complete application.
[2]
Preface
Conventions
In this book, you will find a number of text styles that distinguish between different kinds
of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text are shown as follows: “The getResult method returns a
SpeechResult instance which holds the result of the processing." Database table names,
folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter
handles are shown as follows: "The KevinVoiceDirectory contains two voices: kevin
and kevin16."
New terms and important words are shown in bold. Words that you see on the screen, for
example, in menus or dialog boxes, appear in the text like this: "Select the Images category
and then filter for Labeled for reuse."
[3]
Preface
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book-what you liked or disliked. Reader feedback is important for us as it helps us develop
titles that you will really get the most out of. To send us general feedback, simply e-
mail feedback@packtpub.com, and mention the book's title in the subject of your
message. If there is a topic that you have expertise in and you are interested in either
writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.
1. Log in or register to our website using your e-mail address and password.
2. Hover the mouse pointer on the SUPPORT tab at the top.
3. Click on Code Downloads & Errata.
4. Enter the name of the book in the Search box.
5. Select the book for which you're looking to download the code files.
6. Choose from the drop-down menu where you purchased this book from.
7. Click on Code Download.
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPubl
ishing/Java-for-Data-Science. We also have other code bundles from our rich catalog of
books and videos available at https://github.com/PacktPublishing/. Check them out!
[4]
Preface
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-
we would be grateful if you could report this to us. By doing so, you can save other readers
from frustration and help us improve subsequent versions of this book. If you find any
errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting
your book, clicking on the Errata Submission Form link, and entering the details of your
errata. Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section of
that title.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy.
We appreciate your help in protecting our authors and our ability to bring you valuable
content.
Questions
If you have a problem with any aspect of this book, you can contact us
at questions@packtpub.com, and we will do our best to address the problem.
[5]
Getting Started with Data
1
Science
Data science is not a single science as much as it is a collection of various scientific
disciplines integrated for the purpose of analyzing data. These disciplines include various
statistical and mathematical techniques, including:
Computer science
Data engineering
Visualization
Domain-specific knowledge and approaches
With the advent of cheaper storage technology, more and more data has been collected and
stored permitting previously unfeasible processing and analysis of data. With this analysis
came the need for various techniques to make sense of the data. These large sets of data,
when used to analyze data and identify trends and patterns, become known as big data.
This in turn gave rise to cloud computing and concurrent techniques such as map-reduce,
which distributed the analysis process across a large number of processors, taking
advantage of the power of parallel processing.
The process of analyzing big data is not simple and evolves to the specialization of
developers who were known as data scientists. Drawing upon a myriad of technologies
and expertise, they are able to analyze data to solve problems that previously were
either not envisioned or were too difficult to solve.
Getting Started with Data Science
Early big data applications were typified by the emergence of search engines capable of
more powerful and accurate searches than their predecessors. For example, AltaVista was
an early popular search engine that was eventually superseded by Google. While big data
applications were not limited to these search engine functionalities, these applications laid
the groundwork for future work in big data.
The term, data science, has been used since 1974 and evolved over time to include statistical
analysis of data. The concepts of data mining and data analytics have been associated with
data science. Around 2008, the term data scientist appeared and was used to describe a
person who performs data analysis. A more in-depth discussion of the history of data
science can be found at http://www.forbes.com/sites/gilpress/2013/05/28/a-very-sh
ort-history-of-data-science/#3d9ea08369fd.
This book aims to take a broad look at data science using Java and will briefly touch on
many topics. It is likely that the reader may find topics of interest and pursue these at
greater depth independently. The purpose of this book, however, is simply to introduce the
reader to the significant data science topics and to illustrate how they can be addressed
using Java.
There are many algorithms used in data science. In this book, we do not attempt to explain
how they work except at an introductory level. Rather, we are more interested in explaining
how they can be used to solve problems. Specifically, we are interested in knowing how
they can be used with Java.
Data mining is a popular application area for data science. In this activity, large quantities
of data are processed and analyzed to glean information about the dataset, to provide
meaningful insights, and to develop meaningful conclusions and predictions. It has been
used to analyze customer behavior, detecting relationships between what may appear to be
unrelated events, and to make predictions about future behavior.
[7]
Getting Started with Data Science
Machine learning is an important aspect of data science. This technique allows the
computer to solve various problems without needing to be explicitly programmed. It has
been used in self-driving cars, speech recognition, and in web searches. In data mining, the
data is extracted and processed. With machine learning, computers use the data to take
some sort of action.
Acquiring the data: Before we can process the data, it must be acquired. The data
is frequently stored in a variety of formats and will come from a wide range of
data sources.
Cleaning the data: Once the data has been acquired, it often needs to be
converted to a different format before it can be used. In addition, the data needs
to be processed, or cleaned, so as to remove errors, resolve inconsistencies, and
otherwise put it in a form ready for analysis.
Analyzing the data: This can be performed using a number of techniques
including:
Statistical analysis: This uses a multitude of statistical approaches
to provide insight into data. It includes simple techniques and
more advanced techniques such as regression analysis.
AI analysis: These can be grouped as machine learning, neural
networks, and deep learning techniques:
Machine learning approaches are characterized by
programs that can learn without being specifically
programmed to complete a specific task
Neural networks are built around models patterned
after the neural connection of the brain
Deep learning attempts to identify higher levels of
abstraction within a set of data
[8]
Getting Started with Data Science
Complementing this set of tasks is the need to develop applications that are efficient. The
introduction of machines with multiple processors and GPUs contributes significantly to
the end result.
While the exact steps used will vary by application, understanding these basic steps
provides the basis for constructing solutions to many data science problems.
There is ample support provided for the basic data science tasks. These include multiple
ways of acquiring data, libraries for cleaning data, and a wide variety of analysis
approaches for tasks such as natural language processing and statistical analysis. There are
also myriad of libraries supporting neural network types of analysis.
Java can be a very good choice for data science problems. The language provides both
object-oriented and functional support for solving problems. There is a large developer
community to draw upon and there exist multiple APIs that support data science tasks.
These are but a few reasons as to why Java should be used.
[9]
Getting Started with Data Science
The remainder of this chapter will provide an overview of the data science tasks and Java
support demonstrated in the book. Each section is only able to present a brief introduction
to the topics and the available support. The subsequent chapter will go into considerably
more depth regarding these topics.
Data may be stored in a variety of formats. Popular formats for text data include HTML,
Comma Separated Values (CSV), JavaScript Object Notation (JSON), and XML. Image
and audio data are stored in a number of formats. However, it is frequently necessary to
convert one data format into another format, typically plain text.
[ 10 ]
Getting Started with Data Science
With many popular media sites, it is necessary to acquire a user ID and password to access
data. A commonly used technique is OAuth, which is an open standard used to
authenticate users to many different websites. The technique delegates access to a server
resource and works over HTTPS. Several companies use OAuth 2.0, including PayPal,
Facebook, Twitter, and Yelp.
When data is cleaned, there are several tasks that often need to be performed, including
checking its validity, accuracy, completeness, consistency, and uniformity. For example,
when the data is incomplete, it may be necessary to provide substitute values.
Consider CSV data. It can be handled in one of several ways. We can use simple Java
techniques such as the String class' split method. In the following sequence, a string
array, csvArray, is assumed to hold comma-delimited data. The split method populates
a second array, tokenArray.
for(int i=0; i<csvArray.length; i++) {
tokenArray[i] = csvArray[i].split(",");
}
More complex data types require APIs to retrieve the data. For example, in Chapter 3, Data
Cleaning, we will use the Jackson Project (https://github.com/FasterXML/jackson) to
retrieve fields from a JSON file. The example uses a file containing a JSON-formatted
presentation of a person, as shown next:
{
"firstname":"Smith",
"lastname":"Peter",
"phone":8475552222,
"address":["100 Main Street","Corpus","Oklahoma"]
}
[ 11 ]
Getting Started with Data Science
The code sequence that follows shows how to extract the values for fields of a person. A
parser is created, which uses getCurrentName to retrieve a field name. If the name is
firstname, then the getText method returns the value for that field. The other fields are
handled in a similar manner.
try {
JsonFactory jsonfactory = new JsonFactory();
JsonParser parser = jsonfactory.createParser(
new File("Person.json"));
while (parser.nextToken() != JsonToken.END_OBJECT) {
String token = parser.getCurrentName();
if ("firstname".equals(token)) {
parser.nextToken();
String fname = parser.getText();
out.println("firstname : " + fname);
}
...
}
parser.close();
} catch (IOException ex) {
// Handle exceptions
}
Simple data cleaning may involve converting the text to lowercase, replacing certain text
with blanks, and removing multiple whitespace characters with a single blank. One way of
doing this is shown next, where a combination of the String class' toLowerCase,
replaceAll, and trim methods are used. Here, a string containing dirty text is processed:
dirtyText = dirtyText
.toLowerCase()
.replaceAll("[\\d[^\\w\\s]]+", "
.trim();
while(dirtyText.contains(" ")){
dirtyText = dirtyText.replaceAll(" ", " ");
}
Stop words are words such as the, and, or but that do not always contribute to the analysis of
text. Removing these stop words can often improve the results and speed up the processing.
[ 12 ]
Getting Started with Data Science
The LingPipe API can be used to remove stop words. In the next code sequence,
a TokenizerFactory class instance is used to tokenize text. Tokenization is the process of
returning individual words. The EnglishStopTokenizerFactory class is a special
tokenizer that removes common English stop words.
text = text.toLowerCase().trim();
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE;
fact = new EnglishStopTokenizerFactory(fact);
Tokenizer tok = fact.tokenizer(
text.toCharArray(), 0, text.length());
for(String word : tok){
out.print(word + " ");
}
Consider the following text, which was pulled from the book, Moby Dick:
Call me Ishmael. Some years ago- never mind how long precisely - having
little or no money in my purse, and nothing particular to interest me
on shore, I thought I would sail about a little and see the watery part
of the world.
These are just a couple of the data cleaning tasks discussed in Chapter 3, Data Cleaning.
The human mind is often good at seeing patterns, trends, and outliers in visual
representation. The large amount of data present in many data science problems can be
analyzed using visualization techniques. Visualization is appropriate for a wide range of
audiences ranging from analysts to upper-level management to clientele. In this chapter, we
present various visualization techniques and demonstrate how they are supported in Java.
[ 13 ]
Random documents with unrelated
content Scribd suggests to you:
altas hierbas y de purpúreas digitales, se internaron en el monte. La mañana
estaba húmeda, templada; el campo mojado por el rocío; el cielo azul muy
pálido, con algunas nubecillas blancas que se deshilachaban en estrías
tenues. A las diez de la mañana llegaron a Arnazabal, un pueblo en un alto,
con su iglesia, su juego de pelota en la plaza, y dos o tres calles formadas
por caseríos.
Entraron en el caserío, propiedad de la mujer del boticario, y pasaron a la
cocina. Allí comenzaron los agasajos y los grandes recibimientos de la vieja
de la casa, que abandonó su labor de echar ramas al fuego y de mecer la
cuna de un niño; se levantó del fogón bajo, en donde estaba sentada, y
saludó a todos, besando a Maintoni, a su hermana y a los chicos. Era una
vieja flaca, acartonada, con un pañuelo negro en la cabeza; tenía la nariz
larga y ganchuda, la boca sin dientes, la cara llena de arrugas y el pelo
blanco.
—¿Y vuestra merced es el que estaba en las Indias?—preguntó la vieja a
Elizabide, encarándose con él.
—Sí; yo era el que estaba allá.
Como habían dado las diez, y a esta hora empezaba la Misa mayor, no
quedaba en casa más que la vieja. Todos se dirigieron a la iglesia.
Antes de comer, el boticario, ayudado de su cuñada y de los chicos,
disparó desde una ventana del caserío una barbaridad de cohetes, y después
bajaron todos al comedor. Había más de veinte personas en la mesa, entre
ellas el médico del pueblo, que se sentó cerca de Maintoni, y tuvo para ella
y para su hermano un sin fin de galanterías y de oficiosidades.
Elizabide el Vagabundo sintió una tristeza tan grande en aquel momento,
que pensó en dejar la aldea y volverse a América. Durante la comida,
Maintoni le miraba mucho a Elizabide.
—Es para burlarse de mí—pensaba éste.—Ha sospechado que la quiero,
y coquetea con el otro. El golfo de Méjico tendrá que ser otra vez conmigo.
Al terminar la comida eran más de las cuatro; había comenzado el baile.
El médico, sin separarse de Maintoni, seguía galanteándola, y ella seguía
mirando a Elizabide.
Al anochecer, cuando la fiesta estaba en su esplendor, comenzó el
aurrescu. Los muchachos, agarrados de las manos, iban dando vuelta a la
plaza, precedidos de los tamborileros; dos de los mozos se destacaron, se
hablaron, parecieron vacilar, y descubriéndose, con las boinas en la mano,
invitaron a Maintoni para ser la primera, la reina del baile. Ella trató de
disuadirles en vascuence: miró a su cuñado, que sonreía; a su hermana, que
también sonreía, y a Elizabide, que estaba fúnebre.
—Anda, no seas tonta—le dijo su hermana.
Y comenzó el baile con todas sus ceremonias y sus saludos, recuerdos de
una edad primitiva y heroica. Concluído el aurrescu, el boticario sacó a
bailar el fandango a su mujer, y el médico joven a Maintoni.
Obscureció: fueron encendiéndose hogueras en la plaza, y la gente fué
pensando en la vuelta. Después de tomar chocolate en el caserío, la familia
del boticario y Elizabide emprendieron el camino hacia casa.
A lo lejos, entre los montes, se oían los irrintzis de los que volvían de la
romería, gritos como relinchos salvajes. En las espesuras brillaban los
gusanos de luz como estrellas azuladas, y los sapos lanzaban su nota de
cristal en el silencio de la noche serena.
De vez en cuando, al bajar alguna cuesta, al boticario se le ocurría que se
agarraran todos de la mano, y bajaban la cuesta cantando:
Aita San Antoniyo Urquiyolacua. Ascoren biyotzeco santo devotua.
A pesar de que Elizabide quería alejarse de Maintoni, con la cual estaba
indignado, dió la coincidencia de que ella se encontraba junto a él. Al
formar la cadena, ella le daba la mano, una mano pequeña, suave y tibia. De
pronto, al boticario, que iba el primero, se le ocurría pararse y empujar para
atrás, y entonces se daban encontronazos los unos contra los otros, y a veces
Elizabide recibía en sus brazos a Maintoni. Ella reñía alegremente a su
cuñado, y miraba al vagabundo, siempre fúnebre.
—Y usted, ¿por qué está tan triste?—le preguntó Maintoni con voz
maliciosa, y sus ojos negros brillaron en la noche.
—¡Yo! No sé. Esta maldad de hombre que sin querer le entristecen las
alegrías de los demás.
—Pero usted no es malo—dijo Maintoni, y le miró tan profundamente
con sus ojos negros, que Elizabide el Vagabundo, se quedó tan turbado, que
pensó que hasta las mismas estrellas notarían su turbación.
—No, no soy malo—murmuró Elizabide—; pero soy un fatuo, un
hombre inútil, como dice todo el pueblo.
—¿Y eso le preocupa a usted, lo que dice la gente que no le conoce?
—Sí, temo que sea la verdad, y para un hombre que tendrá que
marcharse otra vez a América, ese es un temor grave.
—¡Marcharse! ¿Se va usted a marchar?—murmuró Maintoni con voz
triste.
—Sí.
—¿Pero por qué?
—¡Oh! A usted no se lo puedo decir.
—¿Y si yo lo adivinara?
—Entonces lo sentiría mucho, porque se burlaría usted de mí, que soy
viejo...
—¡Oh, no!
—Que soy pobre.
—No importa.
—¡Oh, Maintoni! ¿De veras? ¿No me rechazaría usted?
—No; al revés.
—Entonces... ¿me querrás como yo te quiero?—murmuró Elizabide el
Vagabundo en vascuence.
—Siempre, siempre...—Y Maintoni inclinó su cabeza sobre el pecho de
Elizabide y éste la besó en su cabellera castaña.
—¡Maintoni! ¡Aquí!—le dijo su hermana, y ella se alejó de él; pero se
volvió a mirarle una vez, y muchas.
Y siguieron todos andando hacia el pueblo por los caminos solitarios. En
derredor vibraba la noche llena de misterios; en el cielo palpitaban los
astros. Elizabide el Vagabundo, con el corazón anegado de sensaciones
inefables, sofocado de felicidad, miraba con los ojos muy abiertos una
estrella lejana, muy lejana, y le hablaba en voz baja...
La epopeya de una zíngara.
(DICENTA)
(RICARDO LEÓN)
(JOSÉ NOGALES)
(PEDRO DE RÉPIDE)
SONETO
DEL AUTOR
Al licenciado Alonso de las Torres.
SONETO
PRIMERA PARTE
CUÉNTASE EL PEREGRINO SUCESO DE LA ENAMORADA
INDISCRETA, QUE TAMBIÉN FUÉ LLAMADO DEL PELIGRO EN
LA VERDAD.
En una de las más famosas y nobles ciudades de la prócer Italia, asiento
de las artes y patria de los más ínclitos varones, aconteció esta rara historia
que aquí se relata, y donde se muestra la ejemplaridad de los designios del
Altísimo, que trae aparejada la más alta edificación así saludable para que
huyan la tentación del Enemigo los que aun no pecaron, y vuelvan a la
senda de la Gracia los apartados de ella.
Era, pues, en Ferrara, ciudad insigne, que había visto prender al delicado
Torcuato Tasso, vate preclarísimo, y había visto también morir a aquel
gallardo ingenio, príncipe soberano de los de su época, que fué el divino
Ariosto, de quien pudo decirse que hubo reinas que besaron su pie, ya que
egregias hermosuras y la mayor de estos últimos tiempos, como ha sido la
sin par Catalina Cornaro, a quien sus paisanos los dux de la república
veneta, Federico Barbarigo y Leonardo Loredano, más codiciosos que
caballeros, han quitado su reino de Chipre, tuvo en ese poeta el consuelo de
un amor que bien valía un trono. Y siguiendo en este relato verídico y
curioso, ha de decirse, que frontera a la casa donde había muerto el Ariosto,
alzábase otra suntuosísima, que bien a las claras pregonaba la elegancia y
distinción de la gente principal que en ella moraba.
Estaba la tal habitada por un magistrado de uno de los más altos linajes
de la ciudad, que era la magnífica señoría de Leonardo Aldobrandino,
hermano de Hércules, senescal de los duques. Viudo de una señora de Pisa,
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com