PDF Creating Good Data: A Guide To Dataset Structure and Data Representation Harry J. Foxwell Download
PDF Creating Good Data: A Guide To Dataset Structure and Data Representation Harry J. Foxwell Download
com
https://textbookfull.com/product/creating-good-
data-a-guide-to-dataset-structure-and-data-
representation-harry-j-foxwell/
https://textbookfull.com/product/guide-to-data-analytics-aicpa/
textbookfull.com
https://textbookfull.com/product/data-lake-analytics-on-microsoft-
azure-a-practitioners-guide-to-big-data-engineering-harsh-chawla/
textbookfull.com
https://textbookfull.com/product/data-and-the-built-environment-a-
practical-guide-to-building-a-better-world-using-data-1st-edition-ian-
gordon/
textbookfull.com
https://textbookfull.com/product/professionalism-and-teacher-
education-voices-from-policy-and-practice-amanda-gutierrez/
textbookfull.com
A Course in Behavioral Economics Erik Angner
https://textbookfull.com/product/a-course-in-behavioral-economics-
erik-angner/
textbookfull.com
https://textbookfull.com/product/painless-grammar-elliott/
textbookfull.com
https://textbookfull.com/product/creative-foundations-1st-edition-
vicki-boutin/
textbookfull.com
https://textbookfull.com/product/internal-security-for-civil-services-
main-examination-gs-paper-iii-2nd-edition-m-karthikeyan/
textbookfull.com
Beginning Rails 6: From Novice to Professional Brady
Somerville
https://textbookfull.com/product/beginning-rails-6-from-novice-to-
professional-brady-somerville/
textbookfull.com
Creating
Good Data
A Guide to Dataset Structure and
Data Representation
—
Harry J. Foxwell
Creating Good Data
A Guide to Dataset Structure
and Data Representation
Harry J. Foxwell
Creating Good Data: A Guide to Dataset Structure and Data Representation
Harry J. Foxwell
Fairfax, VA, USA
Introduction�������������������������������������������������������������������������������������������������������������xv
v
Table of Contents
vi
Table of Contents
vii
Table of Contents
Index��������������������������������������������������������������������������������������������������������������������� 103
viii
About the Author
Dr. Harry J. Foxwell teaches graduate data analytics courses
at George Mason University’s Department of Information
Sciences and Technology. He draws on his decades of
prior experience as a Principal System Engineer for Oracle
and for other major IT companies to help his students
understand the concepts, tools, and practices of big data
projects. He is a coauthor of several books on operating
systems administration and is a designer of the data analytics
curricula for his university courses. He is also a US Army
combat veteran, having served in Vietnam as a Platoon
Sergeant in the 1st Infantry Division. He lives in Fairfax,
Virginia, with his wife Eileen and two bothersome cats. Find out more about him at
https://cs.gmu.edu/~hfoxwell/.
ix
About the Technical Reviewer
Thomas Plunkett has extensive experience with big data and data analytics. He has
taught university courses on related technical topics.
xi
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
Acknowledgments
I have benefited greatly from valuable encouragement and support for this work from
numerous colleagues at George Mason University. Dr. James Baldo, Director of the Data
Analytics Engineering program, provided helpful early advice and focus suggestions.
And special thanks to Ms. Vidhyasri Ganapathi, Teaching Assistant for several of my data
analytics courses, for identifying students’ challenges in learning and practicing data
science and for confirming their need for this guidance in preparing good datasets.
xiii
Introduction
Extracting actionable knowledge from data is a major ongoing challenge of modern IT in
corporations, governments, and academia. Creating effectively usable datasets requires
an understanding of data quality issues and of data types and the related analytics which
can properly be applied. There are numerous data analytics resources – books, articles,
blogs, and even commercial software – describing how to clean up and transform
data after it has been collected, yet there is little practical guidance on how to avoid or
minimize the typical “data cleaning” tasks beforehand. Such guidance and best practices
are needed to eliminate or reduce lengthy dataset preparation.
Data analysts are often simply presented with datasets for exploration and study
which are poorly designed, leading to difficulties in interpretation and to delays in
producing usable results. In fact, some analysts report spending up to 80% of their
time just getting data ready to be explored so that it can be effectively interpreted.
And much data analytics training and published resources focus on how to clean and
transform datasets before serious analyses can even begin. Inappropriate or confusing
representations, unit of measurement choices, coding errors, missing values, outliers,
and others can be avoided by using good data item selection, good dataset design and
collection, and by understanding how data types determine the kinds of analyses that
can be performed.
Why not create good data from the start, keeping in mind how it will be used,
rather than fixing it after it is collected?
Creating Good Data discusses the principles and best practices of dataset creation
and covers basic data types and their related appropriate statistics and visualizations.
Following these guidelines results in more effective analyses and presentations of
your research data. A key focus of this book is why certain data types and structures
are chosen for representing concepts and measurements, in contrast to the usual
discussions of how to analyze a specific data type once it has been selected.
xv
CHAPTER 1
Learning about data analytics tools and methods typically begins with discussions of
how to prepare a given dataset for analysis. The reason for this is that many datasets
have problems – defects in design, missing or incorrect data items, and non-standard
file formats. This often leads to lengthy and complex tasks required to produce datasets
ready for efficient analysis. Unfortunately, the critical first step – understanding the
nature of data representation – is frequently missing or not sufficiently addressed in
resources about data analytics, especially for practitioners just starting their technical
careers. Thus, in this chapter, we start with the detailed understanding of data – what it
is, how it is expressed, and what we mean by “good” and “bad” data. Only by basing your
analyses on good data will you produce trustworthy interpretations of your research,
leading to good decisions and knowledge-based actions. Let’s get started.
1
© Harry J. Foxwell 2020
H. J. Foxwell, Creating Good Data, https://doi.org/10.1007/978-1-4842-6103-3_1
Chapter 1 The Need for Good Data
• Students who are learning methods and tools for exploring data
Assumptions
We assume you have a basic knowledge of statistical methods and tools for summarizing
and visualizing datasets, including using tools such as R, Python, and SQL, and perhaps
some familiarity with commercial software such as SAS, SPSS, and Tableau. Many of you
likely already have a library of data analytics texts and other resources that cover data
cleaning and presentation, but who would like “early intervention” in dataset design.
All professionals in the rapidly growing data analytics field can benefit from
instruction on creating data themselves or on guiding others who will create datasets
for their analyses. Data analysts who are called upon to explore and explain other
researchers’ data can thus guide and encourage the creation of better datasets.
Readers of Creating Good Data will use it regularly as a reference, for practitioners
as well as for students taking data analytics courses. The book can also serve as a
supplementary textbook for such courses.
By the end of Creating Good Data, you will understand
• Dataset formats and best practices for creating and sharing datasets
• Examples and use cases (good and bad)
2
Chapter 1 The Need for Good Data
Brief code examples from R, Python, and SQL will be included, but this book is not
intended to be a complete tutorial for data analysis coding in those languages – there are
plenty of those [2,3,4]. Our focus will be on dataset format and data representation using
those programming tools.
3
Chapter 1 The Need for Good Data
4
Chapter 1 The Need for Good Data
Figure 1-2. Data generated during a single Internet minute in 2018 [7]
www.visualcapitalist.com/internet-minute-2018/
• Accuracy
5
Chapter 1 The Need for Good Data
• Relevance
• Representative
• Well-defined
• Data items’ meanings must be unambiguously defined in a
schema, metadata, or data dictionary.
• Complete
• Granular
“Data are people” [8]. Getting data “right” can have important, at times life-critical,
consequences – like data from testing the effectiveness of the Ebola vaccine, calculating
consumer financial decisions based on credit scores, or determining sentences for
crimes. Awareness and ethical practices concerning human-relevant data should always
be implemented in data selection and dataset management.
Good data even has the potential for changing fundamental beliefs. The astronomer
Kepler was taught and strongly believed that planetary orbits must be perfect circles;
his Mars data proved otherwise and led to his famous formulation of the laws of motion
for the planets. And today, climate scientists produce and publish data with the hope of
convincing the world about the dangers of climate change. Bad climate study data and
analysis simply encourages dangerous climate change denial; good climate data has the
potential for changing minds.
6
Chapter 1 The Need for Good Data
in, garbage out” succinctly describes this situation. The sources of bad data include pre-
collection design decisions, collection errors, and post-collection interpretation errors.
Understanding these sources and planning to address them is essential to effective and
accurate dataset analysis.
• Collection errors
7
Chapter 1 The Need for Good Data
Preventive Action
Some data analysts report spending the majority of their time on a project cleaning,
transforming, and preparing their assigned datasets [9]. Obviously, this is costly in time,
money, and technical resources. And as Deming also points out, trying to solve this
problem after the data have been created is not an effective solution:
Inspection does not improve the quality, nor guarantee quality. Inspection
is too late. The quality, good or bad, is already in the product. —W. Edwards
Deming [1]
That is, the “product” – data – needs to be created from the start using practices and
components that at least minimize the “bugs” in your datasets. Of course, eliminating
all such problems is probably not possible, but if you can get off to a good start with your
analytical projects’ data, you will produce better and more trusted results.
8
Visit https://textbookfull.com
now to explore a rich
collection of eBooks, textbook
and enjoy exciting offers!
Chapter 1 The Need for Good Data
S
ummary
In this introductory chapter, we learned about the need for good data, what we mean by
“good” and “bad” data, and the origins of potential dataset problems. Minimizing such
problems requires awareness of how data collection can fail and by using procedures
that ensure quality project design and execution.
The next chapter examines the numerous data types and formats which can be used
to represent observations. Thoroughly understanding these data characteristics and
using them appropriately will help you significantly in designing and executing your
research.
C
hapter References
[1] W. Edwards Deming Quotes, https://quotes.deming.org/
[3] Hui, Eric. Learn R for Applied Statistics. New York NY: Apress,
2019.
[4] Nelli, Fabio. Python Data Analytics. New York NY: Apress, 2018.
[8] Ten simple rules for responsible big data research, https://
dash.harvard.edu/bitstream/handle/1/32630692/5373508.pdf
9
CHAPTER 2
Decisions about how to represent data measurements for your research projects
have important consequences – they directly determine what kinds of statistical and
visualization methods can ultimately be used for analysis and presentation of your
results. This means you need to select representation types thoughtfully with your
analytical goals in mind while at the same time trying to avoid any form of bias in what
you decide to measure and what you anticipate your data exploration and analysis tasks
will look like.
When we consider a “type” for a data item, we need to specify its context and
purpose for our discussion. For programming languages, we define computational
data types (storage formats), such as integer (short and long), floating point (single and
double precision), character (single and multicharacter strings), Boolean (0/1, T/F),
and derived types such as pointers, how many bits or bytes they use, and how they are
referenced by the syntax of the language.
For data analytics, however, we focus on how a data item is to be used and
interpreted and so refer to analytical data types. Additionally, we categorize such data
items as qualitative or quantitative, and then we discuss how varieties within these two
categories represent specific measurement requirements.
11
© Harry J. Foxwell 2020
H. J. Foxwell, Creating Good Data, https://doi.org/10.1007/978-1-4842-6103-3_2
Chapter 2 Basic Data Types and When to Use Them
We will now discuss the four generally accepted data types, Nominal, Ordinal,
Interval, and Ratio (often abbreviated NOIR), and include brief Python or R code
examples to illustrate basic statistical and visualization methods for the data types
presented.
12
Chapter 2 Basic Data Types and When to Use Them
Nominal/Categorical Data
Nominal (also called categorical) measures are used to capture qualitative attributes that
have no size or extent characteristics. They are simply labels or names for some observed
attribute such as country of birth, occupation, manufacturing brand, or language
spoken. This also includes binomial (or dichotomous) characteristics such as true/
false, yes/no, or agree/disagree. Such measures explicitly have no implication of order
among the labels. For example, for a person’s primary spoken language (English, French,
Spanish, Farsi, Chinese, etc.), there is no implied order among the languages; you
can’t claim that French is “bigger” than English in any linguistic or mathematical sense
(unless perhaps if you are from France!). Moreover, you can’t do any kind of arithmetical
operations among the category members – there is no concept of a “mean language.”
A nominal data representation must have two important characteristics: its
categories must be exhaustive and mutually exclusive. Exhaustive means that the
categories cover all possible values in some manner, although there might be a generic
“other” category that encompasses multiple cases of low frequency or importance.
Mutually exclusive means there cannot be any cases that belong to more than one
category. Nominal data items might have many categories represented (e.g., country of
birth, where there are nearly 200) or only a few such as gender (male or female).
Tip Many studies have used gender as a qualifying nominal variable, but such
classification is not always well-defined and recent usage might include “other” or
specific “non-binary” values. Be aware of any such relevant ambiguities in the data
items you select for your analysis.
Because there is no implied quantity for a nominal data item, this limits the kinds
of summary statistics, visualizations, and comparisons that are allowable for analysis.
Nominal data item collections don’t have means, maxima or minima, or measures of
variability like standard deviations. All that is possible to characterize such data are
frequencies of occurrence – how many there are in each category, which can be expressed
as absolute counts or as percentages or proportions of the total. Visual summaries of
such data include various forms of bar charts and pie* charts. And when counting the
frequency of items in each possible category, the category with the largest number
of items is the mode (although there might be several categories with relatively high
frequencies, referred to as multimodal).
13
Chapter 2 Basic Data Types and When to Use Them
To illustrate nominal data (and subsequent data types), let’s examine a sample
synthetic dataset (constructed for illustration) of hypothetical college graduates, GD-
Data.csv [3]. Only the first ten records of 500 are shown:
gender;age;degree;field;wrkfld;annsal;payfair;jobsat
Female;40;BS;Engr;Yes;78.0;4;4
Male;39;MS;Engr;Yes;64.0;4;4
Male;36;MS;Comp;No;70.0;3;4
Male;42;MS;Comp;Yes;85.0;5;3
NotSay;39;BS;Comp;Yes;71.0;5;3
Male;38;MS;Biol;Yes;113.0;3;4
Male;38;MS;Comp;Yes;84.0;3;3
Male;37;MS;Chem;Yes;61.0;3;2
Other;28;MS;Chem;Yes;72.5;3;2
Female;31;MS;Comp;Yes;73.5;4;3
...
As we see from Listing 2-1, field is a character string representing a category, so the
permissible analytics for this data item includes counts, relative frequencies, and bar
chart visualizations, as shown in Figure 2-1.
14
Random documents with unrelated
content Scribd suggests to you:
Evelyn took the handkerchief with a trembling hand, and
examined the corner, close to the lantern. "My husband's,"
she said.
"Then he has taken the wrong turn and gone this way."
"Found anything?"
There was a pause. No, they had not ventured quite so far.
They had only taken a look at the said corner.
"It was an awful bad bit to get over," Ricketts said solemnly.
"The General could never have taken that way, sure! The
stile was leaning to one side, the path almost broken away;
and the piece of marsh beyond was enclosed with dykes—
no second way out of it."
"Yes! You know what you are about. Only take care. After
that, we must get Mrs. Villiers home."
"Gone alone! Are you not afraid? I don't know which is the
most wonderful, you or Jean. Suppose she were to slip into
the water? O do come too."
Jean did not return quickly. The three went over the first
dyke, Walters leading; and then they followed Jean's small
footprints. A minute later, Jean's voice rang out from the
distance, clear and thrilling, with a now sound in it.
"Will you wait here with Walters? I must take the lantern.
Don't stir till I come back."
"O no—I must go too."
"I can't tell you! Father, don't let her! Don't let her!" cried
Jean, in smothered agony. "Keep her away! Don't let her
go!"
Evelyn did not at once grasp how things were, or, if in her
heart she knew, she would not at once accept the truth. A
mist over the moon had thickened into cloud, blotting out
most of its light, but now the cloud rolled on, leaving a clear
landscape. The quiet face could be seen plainly—hardly
paler than Jean's. Evelyn's glance went from the one to the
other.
"He is found!" she said; and putting Mr. Trevelyan aside, she
went forward alone.
"Then he lost his way, as Jean thought. And Jean has found
him!—Jean!" with an accent of wonder. "Has he fainted? We
must got him home quickly. He is so cold—only feel him!
Cannot we give him—something—do—something?"
uncertainly, as if she did not know what she said. "William,
dearest! Dear—I have come to you."
CHAPTER IX.
BROUGHT HOME.
WHETHER he had simply lost his way in the storm, and had
wandered to and fro among the marshes, finding himself
again and again turned back by intercepting dykes, till so
exhausted that when he slipped and fell, he had no strength
to rise; or whether some undetected heart-weakness,
rendering him unfit to cope with the icy gale, had resulted
in sudden failure of the heart's action, who of those present
could say? All was over, long before they found him.
Had a mere child stood by, the chill of that icy solitude
would not have entered, as it did, into the very depths of
Jean's organisation. Her actual grief was, indeed, for
Evelyn, not for herself; but nine-tenths of what Jean
suffered in life always had been and always would be for
others: and the suffering was no whit less keen on that
account. Rather, it was more keen, because more pure and
noble in kind.
Evelyn's fainting fit did not last long, and when she rallied,
the native force of her character at once asserted itself.
Instead of giving way to a display of grief, adding to others'
difficulties, she stood resolutely up, insisted on walking, and
decisively set Mr. Trevelyan free, as well as Walters and
Adams—the latter having returned—for the heavy task
before them. Ricketts had been sent to the cottage to
procure a shutter, and if possible, additional help. To convey
such a weight over such ground would be no light matter;
and a man lodging there, but seldom back till late, would
probably be in by this time. The lad's own lameness
rendered him of small avail.
"Jean will give me her arm. I want nothing more," Evelyn
said steadily. "Only Jean, please. I shall not faint again. You
must not think of me at all. We will go on, and—you will
bring him home—quickly, please!" with unutterable
entreaty.
"If—if anything can be done—" But she did not finish her
sentence, for she knew as well as he that it was too late,
that nothing whatever could be done.
"I cannot thank you!" she said. "I owe you—so much!
Come, Jean, dear."
That walk always stood out before Jean in after life, as one
of the worst experiences she had ever had to go through.
Her most pressing desire was to keep Evelyn well ahead,
that she might not see aught of what went on behind. There
was to be no delay. Mr. Trevelyan and the two men would
start at once with—it—alas! No longer him—hoping soon to
meet coming aid, which indeed would be needed.
Not till the marshes were left behind, not till the large final
meadow between marsh and high road were reached, did
Jean venture to say—
"I think you might make more use of my arm. We shall soon
be at home now. Are you very wet?"
"My dear Evelyn! Then you have come, and it is all right,"
cried Sybella, starting forward. "And he is found! I said so! I
was sure it was nothing! I knew he must have taken shelter
somewhere. Such an imprudent thing to go out in the snow!
A man of his age! If people will be so foolish—! I shouldn't
wonder if he had a bad cold afterwards—and rheumatism,
of course. How wet you are, both of you! Really, it is quite
madness! I can't think what Mr. Trevelyan was after to let
you go! Such folly! If you had just stayed at home quietly!
It is too imprudent! Look at the state of your skirts. Is she
faint?"—to Jean. "Where is General Villiers? Is he coming? I
drove over, in spite of the weather, when I heard—when
Pearce brought me word—and the horses are put up here."
The first rush of Sybella's effervescence had always to be
endured; it could no more be checked than the rush from a
freshly uncorked champagne-bottle; but neither Stowe or
Jean was idle. Wine and hot water stood ready on the hall-
table, for Stowe had rightly conjectured that they would be
needed: and while Jean pulled off Evelyn's wet gloves, and
rubbed her icy fingers, Stowe brought a tumbler of
steaming liquid.
"Really, Jean—"
"Send for Dr. Ingram! What for? Mrs. Villiers will be all right
in a few minutes. She is just overdone—as anybody of any
sense might expect her to be. Really, Jean, I think you a
little over-rate your position here," declared Miss Devereux,
in aggrieved accents. "Evelyn has been very kind to you, no
doubt, giving you the run of the Park and all that, but you
are hardly more than a child. I really don't quite see what
you have to do with giving orders. Evelyn ought to take
some more food before she moves. I never heard of
anything so mad, as taking her to the marshes on such a
night. If I had been here!—But some people have no
common-sense. When General Villiers comes in, he will say
—"
Evelyn stood up, her face rigid with anguish. "My own room
—" she said distinctly. "Send for Dr. Ingram at once, Stowe
—and wait here till—No, only Jean with me!" As Sybella
drew near. "Only Jean!"
"What is all the fuss about, Stowe? And why is the General
so long? I suppose he found shelter somewhere, but he
ought to be here by this time. And Mr. Trevelyan—how he
could allow Mrs. Villiers to take that walk, with only Miss
Trevelyan—no proper protection?"
Sybella's flow of remarks was cut short. Mr. Trevelyan's
voice was heard outside, speaking in subdued accents; and
into the lighted hall was brought a silent presence, before
which even Sybella's volubility failed.
Not till midnight, when Dr. Ingram had departed, and when
Evelyn was asleep under the influence of a semi-opiate, did
Jean venture to leave her, and to steal downstairs. She
believed that her father was there; but what might be the
next step for either of them, Jean could not so much as
conjecture. All she knew was that she herself could do and
bear no more.
She did not exactly say as much to Mr. Trevelyan, but she
looked it every inch; and there was no mistaking what she
meant, as she professed an eager desire not to be a burden
on Mr. Trevelyan's time—he was always so busy—so much
to do—and she, of course, a single lady, with so few ties—
what more natural than that she should remain at the Park,
and devote herself to her poor niece?
Yes, she would stay over the night, of course—oh, certainly
—and as many nights as her dear niece might require her.
Impossible to leave the young widow alone! Could Mr.
Trevelyan think it of her? Oh, quite impossible! Would Mr.
Trevelyan and Jean like to make use of her carriage to
convey them home? It was so late, and of course they were
fatigued. Grimshaw would think nothing—oh, nothing at all
—of that little extra round on his way to the Brow. So easily
managed! And really, the sooner the house was quiet for
her beloved niece—though none of them could ever forget
the trouble to which Mr. Trevelyan had put himself—still, at
such a time, complete quiet was so very essential—
"Mrs. Villiers will send when she wants you. We can't force
ourselves, even for her sake. Where's your ulster? To be
sure—it went to be dried."
"No—I think not. Miss Trevelyan has done enough. She will
look round in the morning."
Jean stood over the fire, feeling strangely. It had been such
a terrible day. Only ten hours since she had quitted the
Rectory, light-hearted and joyous—and all this to have come
since! She felt as if ten weeks might have passed over her
head. A vision rose before Jean of the General's tall figure
and kind face, as he had come into his wife's boudoir; and
then of the same, lying stark and cold in the white snow;
and then of Evelyn's desolate misery; and a suffocating
lump rose in her throat.
"Aunt Marie will see you presently, but I can't have talk to-
night. You must go to bed as soon as you have had
something to eat," said Mr. Trevelyan, entering.
He poked the fire carefully, arranged a bed of hot coals with
deft fingers, and placed the kettle thereon.
"Poor little girl!" came in a tone which she had never heard
from him before.
"You can't just yet. Never mind. You won't be the worse for
this."
"Oh, I can't!"