Statistics for Data Science and Analytics 1st Edition Peter C. Bruce instant download
Statistics for Data Science and Analytics 1st Edition Peter C. Bruce instant download
https://ebookgate.com/product/statistics-for-data-science-and-
analytics-1st-edition-peter-c-bruce/
https://ebookgate.com/product/agile-data-science-building-data-
analytics-applications-with-hadoop-1st-edition-russell-jurney/
ebookgate.com
https://ebookgate.com/product/introduction-to-data-analytics-for-
accounting-2nd-edition/
ebookgate.com
https://ebookgate.com/product/intelligent-techniques-for-predictive-
data-analytics-1st-edition-neha-singh/
ebookgate.com
https://ebookgate.com/product/big-data-analytics-1st-edition-venkat-
ankam/
ebookgate.com
Big data analytics 1st ed Edition Arvind Sathi
https://ebookgate.com/product/big-data-analytics-1st-ed-edition-
arvind-sathi/
ebookgate.com
https://ebookgate.com/product/healthcare-data-analytics-1st-edition-
chandan-k-reddy/
ebookgate.com
https://ebookgate.com/product/data-analytics-in-football-positional-
data-collection-modelling-and-analysis-1st-edition-daniel-memmert/
ebookgate.com
https://ebookgate.com/product/big-data-analytics-disruptive-
technologies-for-changing-the-game-1st-edition-arvind-sathi/
ebookgate.com
https://ebookgate.com/product/applied-predictive-analytics-principles-
and-techniques-for-the-professional-data-analyst-1st-edition-dean-
abbott/
ebookgate.com
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Statistics for Data Science and Analytics
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Statistics for Data Science and Analytics
Peter Gedeck
k
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Copyright © 2025 by John Wiley & Sons, Inc. All rights reserved, including rights for text and
data mining and training of artificial technologies or similar technologies.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise,
except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without
either the prior written permission of the Publisher, or authorization through payment of the
appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers,
MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to
the Publisher for permission should be addressed to the Permissions Department, John Wiley &
Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at
http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley &
Sons, Inc. and/or its affiliates in the United States and other countries and may not be used
without written permission. All other trademarks are the property of their respective owners.
John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best
k k
efforts in preparing this book, they make no representations or warranties with respect to the
accuracy or completeness of the contents of this book and specifically disclaim any implied
warranties of merchantability or fitness for a particular purpose. No warranty may be created or
extended by sales representatives or written sales materials. The advice and strategies contained
herein may not be suitable for your situation. You should consult with a professional where
appropriate. Further, readers should be aware that websites listed in this work may have
changed or disappeared between when this work was written and when it is read. Neither the
publisher nor authors shall be liable for any loss of profit or any other commercial damages,
including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please
contact our Customer Care Department within the United States at (800) 762-2974, outside the
United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in
print may not be available in electronic formats. For more information about Wiley products,
visit our web site at www.wiley.com.
k
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
To my friend Peter, whose guidance on my data science journey has been invaluable,
To my parents, Helga and Erhard Gedeck, and my sister, Heike, who have always
Contents
5 Probability 121
5.1 What Is Probability 121
5.2 Simple Probability 122
5.2.1 Venn Diagrams 124
5.3 Probability Distributions 126
5.3.1 Binomial Distribution 126
5.3.1.1 Example 128
5.4 From Binomial to Normal Distribution 129
5.4.1 Standardization (Normalization) 130
5.4.2 Standard Normal Distribution 131
5.4.2.1 z-Tables 132
5.4.3 The 95 Percent Rule 133
5.5 Appendix: Binomial Formula and Normal Approximation 133
5.5.1 Normal Approximation 134
5.6 Python: Probability 134
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Contents xi
9 Correlation 249
9.1 Example: Delta Wire 249
9.2 Example: Cotton Dust and Lung Disease 251
9.3 The Vector Product Sum Test 252
9.3.1 Example: Baseball Payroll 254
9.3.1.1 Resampling Procedure 254
9.4 Correlation Coefficient 256
9.4.1 Inference for the Correlation Coefficient—Resampling 257
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
xiv Contents
10 Regression 271
10.1 Finding the Regression Line by Eye 272
10.1.1 Making Predictions Based on the Regression Line 274
10.2 Finding the Regression Line by Minimizing Residuals 274
10.2.1 The “Loss Function” 275
10.3 Linear Relationships 276
10.3.1 Example: Workplace Exposure and PEFR 276
10.3.2 Residual Plots 277
10.3.2.1 How to Read the Payroll Residual Plot 278
10.4 Prediction vs. Explanation 280
10.4.1 Research Studies: Regression for Explanation 280
10.4.2 Assessing the Performance of Regression for Explanation 281
10.4.3 Big Data: Regression for Prediction 282
10.4.4 Assessing the Performance of Regression for Prediction 283
10.5 Python: Linear Regression 284
10.5.1 Linear Regression Using Statsmodels 284
10.5.2 Using the Non-formula Interface to statsmodels 287
10.5.3 Linear Regression Using scikit-learn 289
10.5.4 Splitting Datasets and Evaluating Model Performance 290
Exercises 293
Index 349
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
xvii
Peter C. Bruce is the Founder of the Institute for Statistics Education at Statis-
tics.com. Founded in 2002, Statistics.com was the first educational institution
devoted solely to online education in statistics and data science.
Janet Dobbins is a leading voice in the Washington, DC, data science commu-
nity. She is Chair of the Board of Directors for Data Community DC (DC2) and
co-organizes the popular Data Science DC meetups. She previously worked at the
Institute for Statistics Education at Statistics.com.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
xix
Acknowledgments
Julian Simon, an early resampling pioneer, first kindled Peter C. Bruce’s interest
in statistics with his permutation and bootstrap approach to statistics, his Resam-
pling Stats software (first released in the late 1970s), and his statistics text on the
same subject. Robert Hayden, who co-authored early versions of parts of the text
you are reading, was instrumental in getting this project launched.
Michelle Everson has taught many sessions of introductory statistics using ver-
sions of this book and has been vigilant in pointing out ambiguities and omissions.
Her active participation in the statistics education community has been an asset
as we have strived to improve and perfect this text. Meena Badade also teaches
using this text and has also been very helpful in bringing to our attention errors
and points requiring clarification. Diane Murphy reviewed the latest version of the
book with care and contributed many useful corrections and suggestions.
Many students at the Institute for Statistics Education at Statistics.com have
helped clarify confusing points and refine this book over the years.
We also thank our editor at Wiley, Brett Kurzman, who shepherded this book
through the acceptance and launch process quickly and smoothly. Nandhini
Karuppiah, the Managing Editor, helped guide us through the production process,
and Govind Nagaraj managed the copyediting.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
xxi
www.wiley.com/go/Wiley_Statistics_for_Data
We are happy that you have chosen our book for your course. For instructors that
adopt the book, we provide these supplemental materials:
● Short answers and exercises in the text.
● Datasets and Python examples
● Videos mentioned in the test
● Link to GitHub repository and Jupyter notebook
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
xxiii
Introduction
Python
With some exceptions, this book presents relevant Python code in the second part
of each chapter. The book is not an in-depth step-by-step introduction to computer
programming as a discipline, but rather it provides the tools you need to imple-
ment the statistical procedures that are discussed in this book. Because many of
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Introduction xxv
these procedures are based on iterative resampling, rather than simply calculating
formulas, you will get useful practice with the data handling and manipulation
that is a Python strength. No specific level of Python ability is required to get
started. If you are completely new to Python, you could consider launching your-
self with a quick self-study guide (easily found on the web), but, in general, you
should be able to follow along.
k
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1
Statistical methods first came into use before homes had electricity, and had
several phases of rapid growth:
● The first big boost came from manufacturers and farmers who were able to
decrease costs, produce better products, and improve crop yields via statistical
experiments.
● Similar experiments helped drug companies graduate from snake oil purveyors
to makers of scientifically proven remedies.
k ● In the late 20th century, computing power enabled a new class of computation- k
ally intensive methods, like the resampling methods that we will study.
● In the early decades of the current millennium, organizations discovered that
the rapidly growing repositories of data they were collecting (“big data”) could
be mined for useful insights.
As with any powerful tool, the more you know about it the better you can apply
it and the less likely you will go astray. The lurking dangers are illustrated when
you type the phrase “How to lie with...” into a web search engine. The likely auto-
completion is “statistics.”
Much of the book that follows deals with important issues that can determine
whether data yields meaningful information or not:
● How to assess the role that random chance can play in creating apparently inter-
esting results or patterns in data
● How to design experiments and surveys to get useful and reliable information
● How to formulate simple statistical models to describe relationships between
one variable and another
We will start our study in the next chapter with a look at how to design exper-
iments, but, before we dive in, let’s look at some statistical wins and losses from
different arenas.
Statistics for Data Science and Analytics, First Edition. Peter C. Bruce, Peter Gedeck, and Janet Dobbins.
© 2025 John Wiley & Sons, Inc. Published 2025 by John Wiley & Sons, Inc.
Companion website: www.wiley.com/go/Wiley_Statistics_for_Data
k
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 1 Statistics and Data Science
took vitamin E, if there were a vitamin E effect, it would have shown up. Several
meta-analyses, which are consolidated reviews of the results of multiple published
studies, have reached the same conclusion. One found that vitamin E at the above
dosage might even increase mortality.
What happened to make the researchers in 1993 think they had found a link
between vitamin E and disease inhibition? In reviewing a vast quantity of data,
researchers thought they saw an interesting association. In retrospect, with the
benefit of a well-designed experiment, it appears that this association was merely
a chance coincidence. Unfortunately, coincidences happen all the time in life. In
fact, they happen to a greater extent than we think possible.
person is a terrorist threat, but the theory is that outliers may have a higher threat
probability. Turning the work over to a statistical algorithm mitigates some of the
controversy around profiling, since security officers would lack the authority to
make discretionary decisions.
Defining “different” requires a statistical measure of distance, which we will
learn more about later.
All of these tasks involve examining data, but the number of records is likely to
be in the hundreds or thousands at most, and the challenge of obtaining the data
and preparing it for analysis was not overwhelming. So the task of obtaining the
data could safely be left to others.
Try It Yourself
1) Write down a series of 50 random coin flips without actually flipping the
coins. That is, write down a series of 50 made-up H’s and T’s selected in
such a way that they appear random.
2) Now actually flip a coin 50 times.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1.6 Big Data and Statisticians 7
If you look at the series of made-up tosses and compare them to the real tosses,
the longest streaks of either H or T generally occur in the ACTUAL tosses. When
a person is asked to make up random tosses, they will rarely “allow” more than
four H’s or T’s in a row. By the time they have written down four H’s in a row, they
think it is time to switch over to T, or else the series would not appear random. By
contrast, instructors who teach this exercise in class often see a streak of 8 T’s or
H’s in a row. Most people think that this is not random, and yet it clearly is.
In 1913, a roulette wheel at the Monte Carlo casino landed on black 26 times in a
row. As the streak developed, gamblers, convinced that the wheel would most cer-
tainly have to end the streak, increasingly bet heavily on red—they lost millions.
The message here is that random variation reliably produces patterns that
appear non-random.
Why is this significant? Just as with coin tosses, there is a significant component
of random variation (engineers call it noise) in the data that routinely flow through
life—whether business life, government affairs, the education world, or personal
life. So much data…so much random variation…how do we know what is real and
what is random?
We can’t know for certain, though we do know that random behavior can appear
to be real. One purpose of this book is to teach you about probability, and help you
evaluate the potential random component in data, and provide ways of modeling it.
This gives us a benchmark against which to measure patterns, and form educated
guesses about whether observed events or patterns of interest might be really due
to chance. When we understand randomness better, we can curb the tendency to
chase after random patterns, and produce more reliable analyses of data.
k
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
9
In this chapter, we study random behavior and how it can fool us, and we learn
how to design studies to gain useful and reliable information. After completing
this chapter, you should be able to
1) Use coin flips to replicate random processes, and interpret the results of
coin-flipping experiments
2) Use an informal working definition of probability
3) Define, intuitively, p-value
k 4) Describe the different data formats you will encounter, including relational k
database and flat file formats
5) Describe the difference between data encountered in traditional statistical
research, and “big data”
6) Explain the use of treatment and control groups in experiments
7) Define statistical bias
8) Explain the role of randomization and blinding in assigning subjects in a study
9) Explain the difference between observational studies and experiments
10) Design a statistical study following basic principles
Statistics for Data Science and Analytics, First Edition. Peter C. Bruce, Peter Gedeck, and Janet Dobbins.
© 2025 John Wiley & Sons, Inc. Published 2025 by John Wiley & Sons, Inc.
Companion website: www.wiley.com/go/Wiley_Statistics_for_Data
k
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 2 Designing and Carrying Out a Statistical Study
Nearly all large organizations now have huge stores of customer and other
data that they mine for insight, in hopes of boosting revenue or reducing costs.
In the academic world, over five million research articles are published per year
in scholarly and scientific journals. These activities afford ample opportunity to
dive into the data, and discover things that aren’t true, particularly when the
diving is done automatically and at a large scale. Statistical methods play a large
role in this extraction of meaning from data. However, the science of statistics
also provides tools to study data more carefully, and distinguish what’s true from
what ain’t so.
The terms big data, machine learning, data science, and artificial intelligence
(AI) often go together, and bring different things to mind for different people.
The term artificial intelligence, particularly with the advent of Chat-GPT and gen-
erative AI, suggests almost magical methods that approach human-like cognition
capabilities. Privacy-minded individuals may think of large corporations or spy
agencies combing through petabytes of personal data in hopes of locating tidbits
of information that are interesting or useful. Analysts may focus on statistical and
machine learning models that can predict an unknown value of interest (loan
default, acceptance of a sales offer, filing a fraudulent insurance claim or tax
return, for example).
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.4 Example: Hospital Errors 11
Statistical science, by contrast, has well over a century of history, and its meth-
ods were originally tailored to data that were small and well-structured. However,
it is an important contributor to the field of data science which, when it is well
practiced, is not just aimless trolling for patterns, but starts out with questions of
interest:
● What additional product should we recommend to a customer?
● Which price will generate more revenue?
● Does the MRI show a malignancy?
● Is a customer likely to terminate a subscription?
All these questions require some understanding of random behavior and all benefit
from an understanding of the principles of well-designed statistical studies, so this
is where we will start.
2.5 Experiment
To tie together our study of statistics we will look at an experiment designed to test
whether no-fault reporting of all hospital errors reduces major errors in hospitals
(errors resulting in further hospitalization, serious complications, or even death).
An experiment like this was conducted by a hospital in Quebec, Canada, but it was
too small to provide definitive conclusions. For illustrative purposes, we will look
at hypothetical data that a larger study might have produced.
Experiments are used in industry, medicine, social science, and data science.
The ubiquitous A/B test (more on that later) is an experiment. The key feature of
an experiment is that the investigator manipulates some variable that is believed to
affect an outcome of interest, in order to demonstrate the importance and effect (or
lack thereof) of the variable. This stands in contrast to a survey or other analysis
of existing data, where the analyst simply collects and analyzes data. For example,
in a web experiment, the marketing investigator might try out a new product price
to see how it affects sales.
Experiments can be uncontrolled or controlled. In an uncontrolled experiment,
the investigator collects data on the group or time period for which the variable of
interest has been changed. In a web experiment, for example, the price of a product
might be increased by 25%, and then sales compared to prior sales.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.6 Designing an Experiment 13
treatment you wish to study (here, the no-fault plan) is called the treatment group.
The group that gets no treatment or the standard treatment is called the control
group.
An experiment like this, testing a control group vs. a treatment group, is also
called an A/B test, particularly in the field of marketing, where one web treatment
might be tested against another. Sometimes, particularly in marketing, there might
not be an established control scenario and we are simply comparing one proposed
new treatment against another proposed new treatment (e.g. two different web
pages).
How do you decide which hospitals go into which group?
You would like the two groups to be similar to one another, except for the treat-
ment/control difference. That way, if the treatment group does turn out to have
fewer errors, you can be confident that it was due to the treatment. One way to do
this would be to study all the hospitals in detail, examine all their relevant charac-
teristics, and assign them to treatment/control in such a way that the two groups
end up being similar across all these attributes. There are two problems with this
approach.
1) It is usually not possible to think of all the relevant characteristics that might
affect the outcome. Research is replete with the discovery of factors that were
unknown prior to the study or thought to be unimportant.
2) The researcher, who has a stake in the outcome of the experiment, may con-
sciously or unconsciously assign hospitals in a way that enhances the chances
of the success of their pet theory.
Oddly enough, the best strategy is to assign hospitals randomly: for example,
by tossing a coin.
2.6.2 Randomizing
True random assignment eliminates both conscious and unconscious bias in
the assignment to groups. It does not guarantee that the groups will be equal in
all respects. However, it does guarantee that any departure from equality will
be due simply to the chance allocation, and that the larger the samples, the
fewer differences the groups will have. With extremely large samples, differences
due to chance virtually disappear, and you are left with differences that are
real—provided the assignment to groups is really random.
Random assignment lets us make the claim that any difference in group out-
comes that is more than might reasonably happen by chance is, in fact, due to
the different treatment of the groups. The study of probability in this book lets us
quantify the role that chance can play and take it into account.
We can imagine an experiment in which both groups got the same treatment.
We would expect to see some differences from one hospital to another. An everyday
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.6 Designing an Experiment 15
example of this might be tossing a coin. If you toss a coin 10 times you will get a
certain number of heads. Do it again and you will probably get a different number
of heads.
Though the results vary, there are laws of chance that allow you to calculate
things like how many heads you would expect on average or how much the results
would vary from one set of 10 (or 100 or 1000) tosses to the next. If we assign sub-
jects at random, we can use these same laws of chance—or a lot of coin tosses—to
analyze our results.
If we have Doctor Jones assign subjects using her own best judgment, we will
have no knowledge of the (often subconscious) factors that influence assignment.
These factors may bias assignment so that we can no longer say that the only thing
(besides chance assignment) distinguishing the treatment and control groups is
the treatment. Random assignment is not always possible—for example, randomly
assigning elementary school students to two different teaching methods, where
everything else in the education setting is the same.
Randomization can be difficult or impossible in some situations, but it is rela-
tively easy in the A/B testing that is popular in digital marketing. Web visitors can
be easily randomized to one web page or another; email recipients can easily be
assigned randomly to one version or another of an email.
2.6.3 Planning
You need some hospitals and you estimate you can find about 100 within a rea-
sonable distance. You will probably need to present a plan for your study to the
hospitals to get their approval. That seems like a nuisance, but they cannot let just
anyone do any study they please on the patients.1 In addition to writing a plan to
get approval, you know that one of the biggest problems in interpreting studies
is that many are poorly designed. You want to avoid that, so you think carefully
about your plan and ask others for advice. It would be good to talk to a statistician
with experience in medical work. Your plan is to ask the 100 or so available hos-
pitals if they are willing to join your study. They have a right to say no. You hope
quite a few will say yes. In particular, you hope to recruit 50 willing hospitals and
randomly assign them to treatment and control.
Try It Yourself
How exactly would you assign hospitals randomly? Think about options for
the scenario where it doesn’t matter if the groups are exactly equal-sized, and
for the scenario where you want two groups of equal size.
1 This effort is modest, in comparison to that required to gain approval for new drug therapies.
Clinical trials to establish drug efficacy and safety can take years and cost billions of dollars.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16 2 Designing and Carrying Out a Statistical Study
2.6.4 Bias
Randomization is used to try to make the two groups similar at the beginning.
It is important to keep them as similar as possible during the experiment. We want
to be sure the treatment is the only difference between them. Any difference in
outcome due to non-random extraneous factors is a form of bias. Statistical bias is a
technical concept but can include the lay definition that refers to people’s opinions
or states of mind.
Definitions: Bias Statistical bias is the tendency for an estimate, model, or pro-
cedure to yield results that are consistently off-target for a specific purpose (as in
Fig. 2.1).
For example, the mean (average) income for a region might not be a good esti-
mate for the income of a typical resident, if part of the region is home to a small
number of very high-income residents. Their incomes would likely raise the aver-
age above that of most typical residents (i.e. ones selected at random). Another
example is gun sights on a long-range rifle. Gravity will exert a downward pull on
a bullet, the more distant the target the greater the pull. The coordinates of the
target in the sights will be biased upward compared to where the bullet lands.
Bias can often creep into a study when humans are involved, either as subjects or
experimenters. For one thing, subject behavior can be changed by the fact that they
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.6 Designing an Experiment 17
are participating in a study. Experience has also shown that people respond posi-
tively to attention, and just being part of a study may cause subjects to change. A
positive response to the attention of being in a study is called the Hawthorne effect.
Awareness of an issue can significantly affect perceptions, which is why potential
jurors in a trial are asked if they have seen news coverage of a case at issue.
Out-of-Control Toyotas?
In the fall of 2009, the National Highway Transportation Safety Agency
(NHTSA) was receiving several dozen complaints per month about Toyota
cars speeding out of control. The rate of complaint was not that different
from the rates of complaint for other car companies. Then, in November of
2009, Toyota recalled 3.8 million vehicles to check for sticking gas pedals.
By February, the complaint rate had risen from several dozen per month to
over 1500 per month of alleged cases of unintended acceleration. Attention
turned to the electronic throttle.
Clearly what changed was not the actual condition of cars—the stock of Toy-
otas on the road in February of 2010 was not that different from November of
2009. What changed was car owners’ awareness and perception as a result of
the headlines surrounding the recall. Acceleration problems, whether real or
illusory, that escaped notice prior to November 2009 became causes for worry
and a trip to the dealer. Later, the NHTSA examined a number of engine data
recorders from accidents where the driver claimed to experience acceleration
despite applying the brakes. In all cases, the data recorder showed that the
brakes were not applied.
In February 2011, the US Department of Transportation announced that a
10-month investigation of the electronic throttle showed no problems. Public
awareness of the problem boosted the rate of complaint far out of proportion
to its true scope.
Lesson: Your perception of whether you personally experience a problem or
benefit is substantially affected by your prior awareness of others’ problems
and benefits.
Sources: Wall Street Journal, July 14, 2010; The Analysis Group (http://www.analysisgroup
.com—accessed July 14, 2010); USA Today online, April 2, 2011.
In some situations, we can avoid telling people they are participating in a study.
For example, a marketing study might try different ads or products in various
regions without publicizing that they are doing so for research purposes. In other
situations, we may not be able to avoid letting subjects know they are being
studied, but we may be able to conceal whether they are in the treatment or
control group.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
18 2 Designing and Carrying Out a Statistical Study
2.6.4.1 Placebo
A placebo is a dummy treatment imposed on the control group to render their
experience similar to that of the control group. In this way, we can distinguish the
real effect of the treatment from any response that is simply the result of perceiving
that you are being treated. Experience has shown that subjects will often experi-
ence and report good results even for dummy treatments. This positive response to
the perception that you are being treated is called the placebo effect. In medicine,
the placebo effect on the brain can be powerful for mitigating symptoms, and
explains the popularity of many remedies that have no rigorous scientific basis for
their effectiveness. In one study, a tablet labeled “placebo” was found to be 50% as
effective as actual migraine medicine in relieving migraine symptoms. Thus, the
net beneficial effect of many therapies is made up both of a real scientific compo-
nent, and a placebo component.
2.6.4.2 Blinding
The process of concealing treatment from control is termed blinding. We say
a study is single-blind when the subjects—the hospitals in our medical errors
example—do not know whether they are getting the treatment. It is double-blind
if the staff in contact with the subjects also do not know which group is getting
the real treatment. It is triple-blind if the people who evaluate the results do not
know, either.
Blinding, particularly full blinding, is not always feasible. The very nature of
the treatment (for example, no-fault error reporting for the hospital study) may
require participant awareness. The control side may need to be more than simply
“do nothing.” Drug trials typically include a sugar pill for the control arm. For
the hospital study, our control might be a standard set of best practices (excluding
no-fault reporting) shared among the control hospitals.
In marketing experiments, participant blinding usually happens automatically—
web viewers or message recipients are typically unaware there might be another
group seeing a different message. Bias from the analyst side, though, is common.
It often occurs when analysts have a favored outcome (perhaps subconsciously)
and can choose when to end an experiment, or elect to look at a subset of
results that seem more reasonable to them. Blinding of the analyst can mitigate
this bias.
In addition to the various forms of blinding, we try to keep all other aspects of
each subject’s environment the same. In the hospital experiment, we need to agree
on common treatment and control error-handling methods that will be applied
to all hospitals in each group. By keeping the two groups the same in every way
except the treatment, we can be confident that any differences in the results were
due to it.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.7 The Data 19
Errors Errors
Row Hospital# Treat? Before After
1 239 0 27 24
2 1126 0 17 16
3 1161 0 31 29
4 1293 1 38 32
5 1462 1 25 23
45 more
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
20 2 Designing and Carrying Out a Statistical Study
Reduction Reduction
Row Hospital# Treat? in Errors Row Hospital# Treat? in Errors
1 239 0 3 26 2795 1 3
2 1126 0 1 27 2889 0 5
3 1161 0 2 28 2892 1 9
4 1293 1 2 29 2991 1 2
5 1462 1 2 30 3166 1 2
6 1486 0 2 31 3190 0 1
7 1698 1 5 32 3254 0 4
8 1710 0 1 33 3312 1 2
9 1807 0 1 34 3373 1 2
10 1936 1 2 35 3403 1 3
11 1965 1 2 36 3403 0 1
12 2021 1 2 37 3429 1 2
13 2026 0 1 38 3441 1 6
14 202 0 3 39 3520 0 1
15 208 1 4 40 3568 1 2
16 2269 1 2 41 3580 0 2
17 2381 1 2 42 3599 0 2
18 2388 0 1 43 3660 1 2
19 2400 1 2 44 3985 0 2
20 2475 0 4 45 4014 1 2
21 2548 0 1 46 4060 0 1
22 2551 0 2 47 4076 1 2
23 2661 0 1 48 4093 0 1
24 2677 1 4 49 4230 0 2
25 2739 1 2 50 5633 0 2
1) Each row contains all the information for one and only one case.
2) All data for a given variable is in a single column.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.8 Variables and Their Flavors 21
To load data from a .csv file2 into a dataframe named “data,” you would use
pandas with the syntax data = pd.read_csv("hospitalerrors.
csv"). We will cover this in more detail in Section 2.19.3.
Let’s look at the hospital data.
Column 1 is simply the row number.
Column 2, hospital, contains case labels. These are arbitrary labels for the exper-
imental units—a unique number for each unit. Case labels keep track of the
data. For example, if we find a mistake in the data, we would need to know
which hospital that came from so we could investigate the cause and correct
the mistake. Numerical codes are preferred to more informative labels when
we wish to conceal the group to which subjects were assigned.
Column 3 labels observations from the treatment group with a one and those from
the control group with a zero.
Column 4 is the number of major medical errors in the year before the study
minus the number from the following year. A positive number represents a
reduction in medical errors. Note that all the numbers are positive—things
got better whether subjects got the treatment or not! This could be due to
the Hawthorne effect, or to any extra care the subjects got from being in the
experiment, or to other factors that may have changed globally between the
two years.
2 “Comma separated values,” or .csv, is a simple file format that can be read by a wide variety of
programming languages and software.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
22 2 Designing and Carrying Out a Statistical Study
With binary data in which one class is much more scarce than the other (e.g.
fraud/no-fraud), the scarce class is often designated “1.” Surveys and observational
studies often produce categorical data. People might be asked which candidate
they plan to vote for; what toothpaste they use; or whether they live in a house,
apartment, or mobile home. Observational studies often use similar variables that
can be evaluated at a glance or found in existing records. The basic statistical sum-
mary method for displaying such categorical data is a frequency table (see Table 3.5
for an example).
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.8 Variables and Their Flavors 23
With 0/1 data, for example, “fraudulent” or “OK,” the more rare and interesting category
(e.g. a fraudulent credit card charge) is designated “1.” Source: energepic.com/Pexels.
Treatment Control
2 3
2 1
5 2
2 2
2 1
2 1
4 1
2 3
2 1
2 4
4 1
2 2
3 1
9 5
2 1
2 4
2 1
2 1
3 2
2 2
6 2
2 1
2 1
2 2
2 2
Mean: 2.80 1.88
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.9 Python: Data Structures and Operations 25
The control group had 1.88 fewer errors in the second year. Both groups reduced
errors, but the treatment does appear to have reduced errors to a greater extent
than the control.
Data structures and their manipulation are key features of every programming
language. This section introduces standard data types and more complex data
structures as well as some basic operations to manipulate them.
2.9.2 Comments
Every programming language has a way to add comments to code. Comments are
used to add information about the code, but will not be executed. Comments are
started with a hash (#) and continue until the end of the line.
# This is a comment
print("Hello, World!") # This is also a comment
print("Inside a string # this is not a comment")
Other documents randomly have
different content
e) Königreich Bayern.
Vom Regierungsbezirk Oberfranken die
Bezirksämter Kronach, Naila,
Stadtsteinach und Teile der
Bezirksämter Münchberg und Hof nebst
der Stadt Hof
etwa 1140 130000
Übersicht der geologischen
Formationen.
I. Archäische Formationsgruppe.
II. Paläozoische Formationsgruppe.
1. Kambrische Formation.
2. Silur.
3. Devon.
4. Karbon.
a) Unter-Karbon (Kulm).
b) Ober-Karbon (produktive Steinkohlenformation).
2. Jura.
a) Lias.
b) Dogger (brauner Jura).
c) Malm (weißer Jura).
3. Kreideformation.
IV. Känozoische Formationsgruppe.
1. Alt-Tertiär.
2. Jung-Tertiär.
V. Quartäre Formation.
a) Diluvium.
b) Alluvium.
Litteratur.
Nur einige der wichtigsten Werke sind hier genannt:
F. Regel: Thüringen. Ein geographisches Handbuch. 3 Bde. Jena,
1892–1896.
Behandelt im weitesten Umfange Land, Biogeographie und Kulturgeographie und ist
eine außerordentlich fleißige und verdienstvolle Arbeit.
anderen — andern
andererseits — anderseits
angewandt — angewendet
Buntsandstein-Scholle — Buntsandsteinscholle
Eichsfeldes — Eichsfelds
Frankenwaldes — Frankenwalds
Gebietes — Gebiets
Gerbersteines — Gerbersteins
gewerbefleißig — gewerbfleißig
Kemenate — Kemnate
Komthurei — Komturei
Mainthales — Mainthals
Massemühle — Massenmühle
slavische — slawische
Stufenlandes — Stufenlands
Tabakpfeifen — Tabakspfeifen
Thüringerwaldes — Thüringerwalds
Updated editions will replace the previous one—the old editions will
be renamed.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
ebookgate.com