100% found this document useful (1 vote)
42 views

Statistics for Data Science and Analytics 1st Edition Peter C. Bruce instant download

The document provides links to various eBooks related to data science and analytics, including titles by Peter C. Bruce and others. It offers instant digital downloads in multiple formats and includes a range of topics from statistics to healthcare data analytics. Additionally, it contains copyright information and acknowledgments for the authors involved in the publications.

Uploaded by

perilsnioch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
42 views

Statistics for Data Science and Analytics 1st Edition Peter C. Bruce instant download

The document provides links to various eBooks related to data science and analytics, including titles by Peter C. Bruce and others. It offers instant digital downloads in multiple formats and includes a range of topics from statistics to healthcare data analytics. Additionally, it contains copyright information and acknowledgments for the authors involved in the publications.

Uploaded by

perilsnioch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Statistics for Data Science and Analytics 1st

Edition Peter C. Bruce pdf download

https://ebookgate.com/product/statistics-for-data-science-and-
analytics-1st-edition-peter-c-bruce/

Get Instant Ebook Downloads – Browse at https://ebookgate.com


Instant digital products (PDF, ePub, MOBI) available
Download now and explore formats that suit you...

Agile Data Science Building Data Analytics Applications


with Hadoop 1st Edition Russell Jurney

https://ebookgate.com/product/agile-data-science-building-data-
analytics-applications-with-hadoop-1st-edition-russell-jurney/

ebookgate.com

Introduction to Data Analytics for Accounting 2nd Edition


--

https://ebookgate.com/product/introduction-to-data-analytics-for-
accounting-2nd-edition/

ebookgate.com

Intelligent Techniques for Predictive Data Analytics 1st


Edition Neha Singh

https://ebookgate.com/product/intelligent-techniques-for-predictive-
data-analytics-1st-edition-neha-singh/

ebookgate.com

Big Data Analytics 1st Edition Venkat Ankam

https://ebookgate.com/product/big-data-analytics-1st-edition-venkat-
ankam/

ebookgate.com
Big data analytics 1st ed Edition Arvind Sathi

https://ebookgate.com/product/big-data-analytics-1st-ed-edition-
arvind-sathi/

ebookgate.com

Healthcare Data Analytics 1st Edition Chandan K. Reddy

https://ebookgate.com/product/healthcare-data-analytics-1st-edition-
chandan-k-reddy/

ebookgate.com

Data Analytics in Football Positional Data Collection


Modelling and Analysis 1st Edition Daniel Memmert

https://ebookgate.com/product/data-analytics-in-football-positional-
data-collection-modelling-and-analysis-1st-edition-daniel-memmert/

ebookgate.com

Big Data Analytics Disruptive Technologies for Changing


the Game 1st Edition Arvind Sathi

https://ebookgate.com/product/big-data-analytics-disruptive-
technologies-for-changing-the-game-1st-edition-arvind-sathi/

ebookgate.com

Applied Predictive Analytics Principles and Techniques for


the Professional Data Analyst 1st Edition Dean Abbott

https://ebookgate.com/product/applied-predictive-analytics-principles-
and-techniques-for-the-professional-data-analyst-1st-edition-dean-
abbott/
ebookgate.com
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Statistics for Data Science and Analytics
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Statistics for Data Science and Analytics

Founder, Institute for Statistics Education at Statistics.com

Senior Data Scientist, Collaborative Drug Discovery

Chair, Data Community DC


Janet Dobbins
Peter C. Bruce

Peter Gedeck
k

Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Copyright © 2025 by John Wiley & Sons, Inc. All rights reserved, including rights for text and
data mining and training of artificial technologies or similar technologies.

Published by John Wiley & Sons, Inc., Hoboken, New Jersey.


Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise,
except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without
either the prior written permission of the Publisher, or authorization through payment of the
appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers,
MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to
the Publisher for permission should be addressed to the Permissions Department, John Wiley &
Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at
http://www.wiley.com/go/permission.

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley &
Sons, Inc. and/or its affiliates in the United States and other countries and may not be used
without written permission. All other trademarks are the property of their respective owners.
John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best
k k
efforts in preparing this book, they make no representations or warranties with respect to the
accuracy or completeness of the contents of this book and specifically disclaim any implied
warranties of merchantability or fitness for a particular purpose. No warranty may be created or
extended by sales representatives or written sales materials. The advice and strategies contained
herein may not be suitable for your situation. You should consult with a professional where
appropriate. Further, readers should be aware that websites listed in this work may have
changed or disappeared between when this work was written and when it is read. Neither the
publisher nor authors shall be liable for any loss of profit or any other commercial damages,
including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please
contact our Customer Care Department within the United States at (800) 762-2974, outside the
United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in
print may not be available in electronic formats. For more information about Wiley products,
visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data Applied for:

Hardback ISBN: 9781394253807

Cover Design: Wiley


Cover Image: © gremlin/Getty Images

Set in 9.5/12.5pt STIXTwoText by Straive, Chennai, India

k
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
To my friend Peter, whose guidance on my data science journey has been invaluable,
To my parents, Helga and Erhard Gedeck, and my sister, Heike, who have always

and to my husband Peter, whose encouragement adds joy to every step—JD


To my wife, Liz, whose editorial help and judgment is impeccable—PB

been supportive of my education—PG


Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
vii

Contents

About the Authors xvii


Acknowledgments xix
About the Companion Website xxi
Introduction xxiii

1 Statistics and Data Science 1


1.1 Big Data: Predicting Pregnancy 2
1.2 Phantom Protection from Vitamin E 2
1.3 Statistician, Heal Thyself 3
1.4 Identifying Terrorists in Airports 4
1.5 Looking Ahead 5
1.6 Big Data and Statisticians 5
1.6.1 Data Scientists 6

2 Designing and Carrying Out a Statistical Study 9


2.1 Statistical Science 9
2.2 Big Data 10
2.3 Data Science 10
2.4 Example: Hospital Errors 11
2.5 Experiment 12
2.6 Designing an Experiment 13
2.6.1 A/B Tests; A Controlled Experiment for the Hospital Plans 13
2.6.2 Randomizing 14
2.6.3 Planning 15
2.6.4 Bias 16
2.6.4.1 Placebo 18
2.6.4.2 Blinding 18
2.6.4.3 Before-after Pairing 19
2.7 The Data 19
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
viii Contents

2.7.1 Dataframe Format 19


2.8 Variables and Their Flavors 21
2.8.1 Numeric Variables 21
2.8.2 Categorical Variables 22
2.8.3 Binary Variables 22
2.8.4 Text Data 23
2.8.5 Random Variables 23
2.8.6 Simplified Columnar Format 24
2.9 Python: Data Structures and Operations 25
2.9.1 Primary Data Types 25
2.9.2 Comments 25
2.9.3 Variables 26
2.9.4 Operations on Data 26
2.9.4.1 Converting Data Types 27
2.9.5 Advanced Data Structures 28
2.9.5.1 Classes and Objects 32
2.9.5.2 Data Types and Their Declaration 33
2.10 Are We Sure We Made a Difference? 34
2.11 Is Chance Responsible? The Foundation of Hypothesis Testing 34
2.11.1 Looking at Just One Hospital 34
2.12 Probability 36
2.12.1 Interpreting Our Result 37
2.13 Significance or Alpha Level 38
2.13.1 Increasing the Sample Size 38
2.13.2 Simulating Probabilities with Random Numbers 40
2.14 Other Kinds of Studies 40
2.15 When to Use Hypothesis Tests 42
2.16 Experiments Falling Short of the Gold Standard 42
2.17 Summary 43
2.18 Python: Iterations and Conditional Execution 44
2.18.1 if Statements 44
2.18.2 for Statements 45
2.18.3 while Statements 45
2.18.4 break and continue Statements 46
2.18.5 Example: Calculate Mean, Standard Deviation, Subsetting 47
2.18.6 List Comprehensions 48
2.19 Python: Numpy, scipy, and pandas—The Workhorses of Data
Science 50
2.19.1 Numpy 50
2.19.2 Scipy 53
2.19.3 Pandas 53
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Contents ix

2.19.3.1 Reading and Writing Data 53


2.19.3.2 Accessing Data 54
2.19.3.3 Manipulating Data 55
2.19.3.4 Iterating Over a DataFrame 56
2.19.3.5 And a Lot More 56
Exercises 56

3 Exploring and Displaying the Data 61


3.1 Exploratory Data Analysis 61
3.2 What to Measure—Central Location 62
3.2.1 Mean 62
3.2.2 Median 62
3.2.3 Mode 64
3.2.4 Expected Value 64
3.2.5 Proportions for Binary Data 65
3.2.5.1 Percents 65
3.3 What to Measure—Variability 65
3.3.1 Range 65
3.3.2 Percentiles 66
3.3.3 Interquartile Range 66
3.3.4 Deviations and Residuals 67
3.3.5 Mean Absolute Deviation 67
3.3.6 Variance and Standard Deviation 67
3.3.6.1 Denominator of N or N–1? 68
3.3.7 Population Variance 69
3.3.8 Degrees of Freedom 69
3.4 What to Measure—Distance (Nearness) 69
3.5 Test Statistic 71
3.5.1 Test Statistic for this Study 71
3.6 Examining and Displaying the Data 72
3.6.1 Frequency Tables 72
3.6.2 Histograms 73
3.6.3 Bar Chart 75
3.6.4 Box Plots 75
3.6.5 Tails and Skew 78
3.6.6 Errors and Outliers Are Not the Same Thing! 78
3.7 Python: Exploratory Data Analysis/Data Visualization 80
3.7.1 Matplotlib 80
3.7.2 Data Visualization Using Pandas and Seaborn 83
Exercises 88
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
x Contents

4 Accounting for Chance—Statistical Inference 91


4.1 Avoid Being Fooled by Chance 91
4.2 The Null Hypothesis 92
4.3 Repeating the Experiment 93
4.3.1 Shuffling and Picking Numbers from a Hat or Box 93
4.3.2 How Many Reshuffles? 94
4.3.3 The t-Test 95
4.3.4 Conclusion 96
4.4 Statistical Significance 99
4.4.1 Bottom Line 99
4.4.1.1 Statistical Significance as a Screening Device 100
4.4.2 Torturing the Data 101
4.4.3 Practical Significance 102
4.5 Power 103
4.6 The Normal Distribution 103
4.6.1 The Exact Test 104
4.7 Summary 105
4.8 Python: Random Numbers 105
4.8.1 Generating Random Numbers Using the random Package 105
4.8.2 Random Numbers in numpy and scipy 107
4.8.3 Using Random Numbers in Other Packages 108
4.8.4 Example: Implement a Resampling Experiment 109
4.8.5 Write Functions for Code Reuse 112
4.8.6 Organizing Code into Modules 113
Exercises 115

5 Probability 121
5.1 What Is Probability 121
5.2 Simple Probability 122
5.2.1 Venn Diagrams 124
5.3 Probability Distributions 126
5.3.1 Binomial Distribution 126
5.3.1.1 Example 128
5.4 From Binomial to Normal Distribution 129
5.4.1 Standardization (Normalization) 130
5.4.2 Standard Normal Distribution 131
5.4.2.1 z-Tables 132
5.4.3 The 95 Percent Rule 133
5.5 Appendix: Binomial Formula and Normal Approximation 133
5.5.1 Normal Approximation 134
5.6 Python: Probability 134
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Contents xi

5.6.1 Converting Counts to Probabilities 134


5.6.2 Probability Distributions in Python 135
5.6.3 Probability Distributions in random 136
5.6.4 Probability Distributions in the scipy Package 137
5.6.4.1 Continuous Distributions 137
5.6.4.2 Discrete Distributions 139
Exercises 141

6 Categorical Variables 143


6.1 Two-way Tables 143
6.2 Conditional Probability 144
6.2.1 From Numbers to Percentages to Conditional Probabilities 146
6.3 Bayesian Estimates 147
6.3.1 Let’s Review the Different Probabilities 148
6.3.2 Bayesian Calculations 149
6.4 Independence 150
6.4.1 Chi-square Test 151
6.4.1.1 Sensor Calibration 151
6.4.1.2 Standardizing Departure from Expected 153
6.5 Multiplication Rule 154
6.6 Simpson’s Paradox 156
6.7 Python: Counting and Contingency Tables 157
6.7.1 Counting in Python 157
6.7.2 Counting in Pandas 158
6.7.3 Two-way Tables Using Pandas 160
6.7.4 Chi-square Test 162
Exercises 163

7 Surveys and Sampling 167


7.1 Literary Digest—Sampling Trumps “All Data” 167
7.2 Simple Random Samples 170
7.3 Margin of Error: Sampling Distribution for a Proportion 172
7.3.1 The Confidence Interval 173
7.3.2 A More Manageable Box: Sampling with Replacement 174
7.3.3 Summing Up 174
7.4 Sampling Distribution for a Mean 174
7.4.1 Simulating the Behavior of Samples from a Hypothetical
Population 176
7.5 The Bootstrap 176
7.5.1 Resampling Procedure (Bootstrap) 177
7.6 Rationale for the Bootstrap 177
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
xii Contents

7.6.1 Let’s Recap 180


7.6.2 Formula-based Counterparts to Resampling 181
7.6.2.1 FORMULA: The Z-interval 182
7.6.2.2 Proportions 182
7.6.3 For a Mean: T-interval 183
7.6.4 Example—Manual Calculations 183
7.6.5 Example—Software 184
7.6.6 A Bit of History—1906 at Guinness Brewery 185
7.6.7 The Bootstrap Today 186
7.6.8 Central Limit Theorem 187
7.7 Standard Error 188
7.7.1 Standard Error via Formula 188
7.8 Other Sampling Methods 188
7.8.1 Stratified Sampling 188
7.8.2 Cluster Sampling 189
7.8.3 Systematic Sampling 190
7.8.4 Multistage Sampling 190
7.8.5 Convenience Sampling 190
7.8.6 Self-selection 191
7.8.7 Nonresponse Bias 191
7.9 Absolute vs. Relative Sample Size 192
7.10 Python: Random Sampling Strategies 192
7.10.1 Implement Simple Random Sample (SRS) 192
7.10.2 Determining Confidence Intervals 194
7.10.3 Bootstrap Sampling to Determine Confidence Intervals for a
Mean 196
7.10.4 Advanced Sampling Techniques 198
7.10.4.1 Stratified Sampling for Categorical Variables 198
7.10.4.2 Stratified Sampling of Continuous Variables 201
Exercises 202

8 More than Two Samples or Categories 207


8.1 Count Data—R × C Tables 207
8.2 The Role of Experiments (Many Are Costly) 208
8.2.1 Example: Marriage Therapy 208
8.3 Chi-Square Test 210
8.3.1 Alternate Option 210
8.3.2 Testing for the Role of Chance 210
8.3.3 Standardization to the Chi-Square Statistic 213
8.3.4 Chi-Square Example on the Computer 214
8.4 Single Sample—Goodness-of-Fit 215
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Contents xiii

8.4.1 Resampling Procedure 216


8.5 Numeric Data: ANOVA 217
8.6 Components of Variance 222
8.6.1 From ANOVA to Regression 224
8.7 Factorial Design 224
8.7.1 Stratification and Blocking 225
8.7.2 Blocking 226
8.8 The Problem of Multiple Inference 226
8.9 Continuous Testing 228
8.9.1 Medicine 228
8.9.2 Business 229
8.10 Bandit Algorithms 229
8.10.1 Web Testing 230
8.11 Appendix: ANOVA, the Factor Diagram, and the F-Statistic 230
8.11.1 Decomposition: The Factor Diagram 231
8.11.2 Constructing the ANOVA Table 232
8.11.3 Inference Using the ANOVA Table 233
8.11.4 The F-Distribution 234
8.11.5 Different Sized Groups 236
8.11.5.1 Resampling Method 236
8.11.5.2 Formula Method 236
8.11.6 Caveats and Assumptions 236
8.12 More than One Factor or Variable—From ANOVA to Statistical
Models 237
8.13 Python: Contingency Tables and Chi-square Test 237
8.13.1 Example: Marriage Therapy 237
8.13.2 Example: Imanishi-Kari Data 240
8.14 Python: ANOVA 241
8.14.1 Visual Comparison of Groups 241
8.14.2 ANOVA Using Resampling Test 243
8.14.3 ANOVA Using the F-Statistic 244
Exercises 246

9 Correlation 249
9.1 Example: Delta Wire 249
9.2 Example: Cotton Dust and Lung Disease 251
9.3 The Vector Product Sum Test 252
9.3.1 Example: Baseball Payroll 254
9.3.1.1 Resampling Procedure 254
9.4 Correlation Coefficient 256
9.4.1 Inference for the Correlation Coefficient—Resampling 257
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
xiv Contents

9.4.1.1 Hypothesis Test—Resampling 258


9.4.1.2 Example: Baseball Again 259
9.4.1.3 Inference for the Correlation Coefficient: Formulas 259
9.5 Correlation is not Causation 260
9.5.1 A Lurking External Cause 260
9.5.2 Coincidence 260
9.6 Other Forms of Association 261
9.7 Python: Correlation 262
9.7.1 Vector Operations 262
9.7.2 Resampling Test for Vector Product Sums 263
9.7.3 Calculating Correlation Coefficient 264
9.7.4 Calculate Correlation with numpy, pandas 266
9.7.5 Hypothesis Tests for Correlation 266
9.7.6 Using the t Statistic 267
9.7.7 Visualizing Correlation 268
Exercises 269

10 Regression 271
10.1 Finding the Regression Line by Eye 272
10.1.1 Making Predictions Based on the Regression Line 274
10.2 Finding the Regression Line by Minimizing Residuals 274
10.2.1 The “Loss Function” 275
10.3 Linear Relationships 276
10.3.1 Example: Workplace Exposure and PEFR 276
10.3.2 Residual Plots 277
10.3.2.1 How to Read the Payroll Residual Plot 278
10.4 Prediction vs. Explanation 280
10.4.1 Research Studies: Regression for Explanation 280
10.4.2 Assessing the Performance of Regression for Explanation 281
10.4.3 Big Data: Regression for Prediction 282
10.4.4 Assessing the Performance of Regression for Prediction 283
10.5 Python: Linear Regression 284
10.5.1 Linear Regression Using Statsmodels 284
10.5.2 Using the Non-formula Interface to statsmodels 287
10.5.3 Linear Regression Using scikit-learn 289
10.5.4 Splitting Datasets and Evaluating Model Performance 290
Exercises 293

11 Multiple Linear Regression 295


11.1 Terminology 295
11.2 Example—Housing Prices 296
11.2.1 Explaining Home Prices 297
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Contents xv

11.2.2 House Prices in Boston 298


11.2.3 Explore the Data 298
11.2.3.1 Performing and Interpreting a Regression Analysis 299
11.2.4 Using the Regression Equation 300
11.3 Interaction 301
11.3.1 Original Regression with No Interaction Term 302
11.3.2 The Regression with an Interaction Term 303
11.3.3 Does Crime Pay? 303
11.4 Regression Assumptions 304
11.4.1 Violation of Assumptions—Is the Model Useless? 305
11.5 Assessing Explanatory Regression Models 306
11.5.1 Overall Model Strength R2 307
11.5.2 Assessing Individual Coefficients 307
11.5.3 Resampling Procedure to Test Statistical Significance 307
11.5.4 Resampling Procedure for a Confidence Interval (the Pulmonary
Data) 308
11.5.4.1 Interpretation 309
11.5.5 Formula-based Inference 310
11.5.6 Interpreting Software Output 310
11.5.7 More Practice: Bootstrapping the Boston Housing Model 312
11.5.8 Inference for Regression—Hypothesis Tests 312
11.6 Assessing Regression for Prediction 314
11.6.1 Separate Training and Holdout Data 314
11.6.2 Root Mean Squared Error—RMSE 314
11.6.3 Tayko 315
11.6.4 Binary and Categorical Variables in Regression 316
11.6.5 Multicollinearity 317
11.6.6 Tayko—Building the Model 318
11.6.7 Reviewing the Output 319
11.6.8 Scoring the Model to the Validation Partition 320
11.6.9 The Naive Rule 321
11.7 Python: Multiple Linear Regression 324
11.7.1 Using Statsmodels 324
11.7.1.1 Adding Interaction Terms 325
11.7.2 Diagnostic Plots 327
11.7.3 Using Scikit-learn 328
11.7.3.1 Adding Interaction Terms 329
11.7.4 Resampling Procedures 329
11.7.4.1 Estimating the Significance of the Coefficients 329
11.7.4.2 Estimating Confidence Intervals—The Bootstrap 330
Exercises 332
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
xvi Contents

12 Predicting Binary Outcomes 337


12.1 K-Nearest-Neighbors 337
12.1.1 Predicting Which Customers Might be Pregnant 339
12.1.2 Small Hypothetical Example 340
12.1.3 Setting k 342
12.1.4 K-Nearest-Neighbors and Numerical Outcomes 342
12.1.5 Explanatory Modeling 343
12.2 Python: Classification 343
12.2.1 Classification Using scikit-learn 343
12.2.2 Evaluating the Model 344
12.2.3 Streamlining Model Fitting Using Pipelines 345
Exercises 346

Index 349
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
xvii

About the Authors

Peter C. Bruce is the Founder of the Institute for Statistics Education at Statis-
tics.com. Founded in 2002, Statistics.com was the first educational institution
devoted solely to online education in statistics and data science.

Peter Gedeck is a senior data scientist at Collaborative Drug Discovery. He spe-


cializes in the development of cloud-based software for managing data in the drug
discovery process. In addition, he teaches data science at the University of Virginia.

Janet Dobbins is a leading voice in the Washington, DC, data science commu-
nity. She is Chair of the Board of Directors for Data Community DC (DC2) and
co-organizes the popular Data Science DC meetups. She previously worked at the
Institute for Statistics Education at Statistics.com.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
xix

Acknowledgments

Julian Simon, an early resampling pioneer, first kindled Peter C. Bruce’s interest
in statistics with his permutation and bootstrap approach to statistics, his Resam-
pling Stats software (first released in the late 1970s), and his statistics text on the
same subject. Robert Hayden, who co-authored early versions of parts of the text
you are reading, was instrumental in getting this project launched.
Michelle Everson has taught many sessions of introductory statistics using ver-
sions of this book and has been vigilant in pointing out ambiguities and omissions.
Her active participation in the statistics education community has been an asset
as we have strived to improve and perfect this text. Meena Badade also teaches
using this text and has also been very helpful in bringing to our attention errors
and points requiring clarification. Diane Murphy reviewed the latest version of the
book with care and contributed many useful corrections and suggestions.
Many students at the Institute for Statistics Education at Statistics.com have
helped clarify confusing points and refine this book over the years.
We also thank our editor at Wiley, Brett Kurzman, who shepherded this book
through the acceptance and launch process quickly and smoothly. Nandhini
Karuppiah, the Managing Editor, helped guide us through the production process,
and Govind Nagaraj managed the copyediting.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
xxi

About the Companion Website

This book is accompanied by a companion website:

www.wiley.com/go/Wiley_Statistics_for_Data

We are happy that you have chosen our book for your course. For instructors that
adopt the book, we provide these supplemental materials:
● Short answers and exercises in the text.
● Datasets and Python examples
● Videos mentioned in the test
● Link to GitHub repository and Jupyter notebook
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
xxiii

Introduction

Statistics and Data Science


As of the writing of this book, the fields of statistics and data science are evolv-
ing rapidly to meet the changing needs of business, government, and research
organizations. It is an oversimplification, but still useful, to think of two distinct
communities as you proceed:
1) The traditional academic and medical research communities that typically con-
duct extended research projects adhering to rigorous regulatory or publication
standards, and
2) Businesses and large organizations that use statistical methods to extract value
from their data, often on the fly. Reliability and value are more important than
academic rigor to this data science community.
Most users of statistical methods now fall in the second category, as those meth-
ods are a basic component of what is now called artificial intelligence (AI). How-
ever, most of the specific techniques, as well as the language of statistics, had their
origin in the first group. As a result, there is a certain amount of “baggage” that
is not truly relevant to the data science community. That baggage can sometimes
be obscure or confusing and, in this book, we provide guidance on what is or is
not important to data science. Another feature of this book is the use of resam-
pling/simulation methods to develop the underpinnings of statistical inference
(the most difficult topic in an introductory course) in a transparent and under-
standable fashion.
We start off with some examples of statistics in action (including two of statistics
gone wrong), then dive right in to look at the proper design of studies and account
for the possible role of chance. All the standard topics of introductory statistics are
here (probability, descriptive statistics, inference, sampling, correlation, etc.), but
sometimes they are introduced not as separate standalone topics but rather in the
context of the situation in which they are needed.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
xxiv Introduction

Accompanying Web Resources


Python code, datasets, some solutions, and other material accompanying this book
can be found at https://introductorystatistics.com/.

Python

Python is a general programming language that can be used in many different


areas. It is especially popular in the machine learning and data science commu-
nities. A wide range of libraries provide efficient solutions for almost every need,
from simple one-off scripts, to web servers, and highly complex scientific applica-
tions. As we will see throughout this book, it also has great support for statistics.
You can use Python in many different ways. For most people new to the
language, the easiest way to get started is to use Python in Jupyter notebooks (see
https://jupyter.org/jupyter.org). Jupyter notebooks are documents that contain
both code and rich text elements, such as figures, links, equations, etc. Because of
this, they are an ideal environment to learn Python and to present your work. You
will find notebooks with the example code of this book on our website (https://
introductorystatistics.com/).
A great way to get started with Python is to run code on one of the freely
accessible cloud computing platforms. Google Colab (https://colab.research
.google.com/) has a free tier that is sufficient for all the examples in this book.
An alternative to cloud computing platforms is to install Python locally on
your computer. You can download and install different versions of Python
from https://www.python.org. However, it is more convenient to use Anaconda
(https://www.anaconda.com). Anaconda is a free package manager for Python
and R programming languages focusing on scientific computing. It distributes the
most popular Python packages for science, mathematics, engineering, and data
analysis. We provide detailed installation instructions on our website at https://
introductorystatistics.com/.

Using Python with this Book

With some exceptions, this book presents relevant Python code in the second part
of each chapter. The book is not an in-depth step-by-step introduction to computer
programming as a discipline, but rather it provides the tools you need to imple-
ment the statistical procedures that are discussed in this book. Because many of
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Introduction xxv

these procedures are based on iterative resampling, rather than simply calculating
formulas, you will get useful practice with the data handling and manipulation
that is a Python strength. No specific level of Python ability is required to get
started. If you are completely new to Python, you could consider launching your-
self with a quick self-study guide (easily found on the web), but, in general, you
should be able to follow along.
k

Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1

Statistics and Data Science

Statistical methods first came into use before homes had electricity, and had
several phases of rapid growth:
● The first big boost came from manufacturers and farmers who were able to
decrease costs, produce better products, and improve crop yields via statistical
experiments.
● Similar experiments helped drug companies graduate from snake oil purveyors
to makers of scientifically proven remedies.
k ● In the late 20th century, computing power enabled a new class of computation- k
ally intensive methods, like the resampling methods that we will study.
● In the early decades of the current millennium, organizations discovered that
the rapidly growing repositories of data they were collecting (“big data”) could
be mined for useful insights.
As with any powerful tool, the more you know about it the better you can apply
it and the less likely you will go astray. The lurking dangers are illustrated when
you type the phrase “How to lie with...” into a web search engine. The likely auto-
completion is “statistics.”
Much of the book that follows deals with important issues that can determine
whether data yields meaningful information or not:
● How to assess the role that random chance can play in creating apparently inter-
esting results or patterns in data
● How to design experiments and surveys to get useful and reliable information
● How to formulate simple statistical models to describe relationships between
one variable and another
We will start our study in the next chapter with a look at how to design exper-
iments, but, before we dive in, let’s look at some statistical wins and losses from
different arenas.

Statistics for Data Science and Analytics, First Edition. Peter C. Bruce, Peter Gedeck, and Janet Dobbins.
© 2025 John Wiley & Sons, Inc. Published 2025 by John Wiley & Sons, Inc.
Companion website: www.wiley.com/go/Wiley_Statistics_for_Data

k
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 1 Statistics and Data Science

1.1 Big Data: Predicting Pregnancy


In 2010, a statistician from Target described how the company used customer
transaction data to make educated guesses about whether customers are pregnant
or not. On the strength of these guesses, Target sent out advertising flyers to likely
prospects, centered around the needs of pregnant women.
How did Target use data to make those guesses? The key was data used to
“train” a statistical model: data in which the outcome of interest—pregnant/not
pregnant—was known in advance. Where did Target get such data? The “not
pregnant” data was easy—the vast majority of customers are not pregnant, so
data on their purchases is easy to come by. The “pregnant” data came from a
baby shower registry. Both datasets were quite large, containing lists of items
purchased by thousands of customers.
Some clues are obvious—the purchase of a crib and baby clothes is a dead
giveaway. But, from Target’s perspective, by the time a customer purchases these
obvious big ticket items, it was too late—they had already chosen their shopping
venue. Target wanted to reach customers earlier, before they decided where to
do their shopping for the big day. For that, Target used statistical modeling to
make use of non-obvious patterns in the data that distinguish pregnant from
non-pregnant customers. One clue that emerged was shifts in the pattern of
supplement purchases—e.g. a customer who was not buying supplements 60 days
ago but is buying them now.

1.2 Phantom Protection from Vitamin E


In 1993, researchers examining a database on nurses’ health found that nurses
who took vitamin E supplements had 30% to 40% fewer heart attacks than those
who didn’t. These data fit with theories that antioxidants such as vitamins E and
C could slow damaging processes within the body. Linus Pauling, winner of the
Nobel Prize in Chemistry in 1954, was a major proponent of these theories, which
were one driver of the nutritional supplements industry.
However, the heart health benefits of vitamin E turned out to be illusory. A study
completed in 2007 divided 14,641 male physicians randomly into four groups:
1) Take 268 mg of vitamin E every other day
2) Take 500 mg of vitamin C every day
3) Take both vitamin E and C
4) Take placebo.
Those who took vitamin E fared no better than those who did not take vitamin E.
Since the only difference between the two groups was whether or not they
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1.3 Statistician, Heal Thyself 3

took vitamin E, if there were a vitamin E effect, it would have shown up. Several
meta-analyses, which are consolidated reviews of the results of multiple published
studies, have reached the same conclusion. One found that vitamin E at the above
dosage might even increase mortality.
What happened to make the researchers in 1993 think they had found a link
between vitamin E and disease inhibition? In reviewing a vast quantity of data,
researchers thought they saw an interesting association. In retrospect, with the
benefit of a well-designed experiment, it appears that this association was merely
a chance coincidence. Unfortunately, coincidences happen all the time in life. In
fact, they happen to a greater extent than we think possible.

1.3 Statistician, Heal Thyself


In 1993, Mathsoft Corp., the developer of Mathcad mathematical software,
acquired StatSci, the developer of S-PLUS statistical software, the precursor
to R. Mathcad was an affordable tool popular with engineers—prices were
in the hundreds of dollars and the number of users was in the hundreds of
thousands. S-PLUS was a high-end graphical and statistical tool used primarily
by statisticians—prices were in the thousands of dollars and the number of users
was in the thousands.
In looking to boost revenues, Mathsoft turned to an established marketing
principle—cross-selling. In other words, try to convince the people who bought
product A to buy product B. With the acquisition of a highly regarded niche
product, S-PLUS, and an existing large customer base for Mathcad, Mathsoft
decided that the logical thing to do would be to ramp up S-PLUS sales via direct
mail to its installed Mathcad user base. It also decided to purchase lists of similar
prospective customers for both Mathcad and S-PLUS.
This major mailing program boosted revenues, but it boosted expenses even
more. The company lost over $13 million in 1993 and 1994 combined—significant
numbers for a company that had only $11 million in 1992 revenue.
What happened?
In retrospect, it was clear that the mailings were not well targeted. The costs of
the unopened mail exceeded the revenue from the few recipients who did respond.
Mathcad users turned out not to be likely users of S-PLUS. The huge losses could
have been avoided through the use of two common statistical techniques:
1) Doing a test mailing to the various lists being considered to (1) determine
whether the list is productive and (2) test different headlines, copy, pricing,
etc., to see what works best.
2) Using predictive modeling techniques to identify which names on a list are
most likely to turn into customers.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 1 Statistics and Data Science

1.4 Identifying Terrorists in Airports


Since the September 11, 2001, Al Qaeda attacks in the United States and subse-
quent attacks elsewhere, security screening programs at airports have become a
major undertaking, costing billions of dollars per year in the United States alone.
Most of these resources are consumed in an exhaustive screening process. All pas-
sengers and their tickets are reviewed, their baggage is screened and individuals
pass through detectors of varying sophistication. An individual and his or her bag
can only receive a limited amount of attention in a screening process that is applied
to everyone. The process is largely the same for each individual. Potential terrorists
can see the process and its workings in detail and identify weaknesses.
To improve the effectiveness of the system, security officials have studied ways of
focusing more concentrated attention on a small number of travelers. In the years
after the attacks, one technique used enhanced screening for a limited number
of randomly selected travelers. While it adds some uncertainty to the screening
process, which acts as a deterrent to attackers, random selection does nothing to
focus attention on high-risk individuals.
Determining who is at high-risk is, of course, the problem. How do you know
who the high-risk passengers are?
One method is passenger profiling—specifying some guidelines about what pas-
senger characteristics merit special attention. These characteristics were deter-
mined by a reasoned, logical approach. For example, purchasing a ticket for cash,
as the 2001 hijackers did, raises a red flag. The Transportation Security Admin-
istration trains a cadre of Behavior Detection Officers. The Administration also
maintains a specific no-fly list of individuals who trigger special screening.
There are several problems with the profiling and no-fly approaches.
● Profiling can generate backlash and controversy because it comes close to stereo-
typing. American National Public Radio commentator Juan Williams was fired
when he made an offhand comment to the effect that he would be nervous about
boarding an aircraft in the company of people in full Muslim garb.
● Profiling, since it does tend to merge with stereotyping and is based on logic and
reason, enables terrorist organizations to engineer attackers that do not meet
profile criteria.
● No-fly lists are imprecise (a name may match thousands of individuals) and
often erroneous. Senator Edward Kennedy was once pulled aside because he
supposedly showed up on a no-fly list.
An alternative or supplemental approach is a statistical one—separate out pas-
sengers who are “different” for additional screening, where “different” is defined
quantitatively across many variables that are not made known to the public.
The statistical term is “outlier.” Different does not necessarily prove that the
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1.6 Big Data and Statisticians 5

person is a terrorist threat, but the theory is that outliers may have a higher threat
probability. Turning the work over to a statistical algorithm mitigates some of the
controversy around profiling, since security officers would lack the authority to
make discretionary decisions.
Defining “different” requires a statistical measure of distance, which we will
learn more about later.

1.5 Looking Ahead


We’ll be studying many things in this book, but several important themes will be
1) Learning more about random processes and statistical tools that will help
quantify the role of chance and distinguish real phenomena from chance
coincidence.
2) Learning how to design experiments and studies that can provide more defini-
tive answers to questions such as whether a medical therapy works, which
marketing message generates a better response, and which management
technique or industrial process produces fewer errors.
3) Learning how to specify and interpret statistical models that describe the rela-
tionship between two variables, or between a response variable and several
“predictor” variables, in order to:
• Explain/understand phenomena and answer research questions (“What fac-
tors contribute to a drug’s success, or the response to a marketing message?”).
• Make predictions (“Will a given subscriber leave this year?” “Is a given insur-
ance claim fraudulent?”)

1.6 Big Data and Statisticians


Before the turn of the millennium, by and large, statisticians did not have to be
too concerned with programming languages, SQL queries, and the management
of data. Database administration and data storage in general was someone else’s
job, and statisticians would obtain or get handed data to work on and analyze.
A statistician might, for example,
● Direct the design of a clinical trial to determine the efficacy of a new therapy
● Help a researcher determine how many subjects to enroll in a study
● Analyze data to prepare for legal testimony
● Conduct sample surveys and analyze the results
● Help a scientist analyze data that comes out of a study
● Help an engineer improve an industrial process
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 1 Statistics and Data Science

All of these tasks involve examining data, but the number of records is likely to
be in the hundreds or thousands at most, and the challenge of obtaining the data
and preparing it for analysis was not overwhelming. So the task of obtaining the
data could safely be left to others.

1.6.1 Data Scientists


The advent of big data has changed things. The explosion of data means that more
interesting things can be done with data, and they are often done in real time or
on a rapid turnaround schedule. FICO, the credit-scoring company, uses statistical
models to predict credit card fraud, collecting customer data, merchant data, and
transaction data 24 hours a day. FICO has more than two billion customer accounts
to protect, so it is easy to see that this statistical modeling is a massive undertaking.
The science of computer programming and details of database administration lie
beyond the scope of this book, but these fields now lie within the scope of statistical
work. The statistician must be conversant with the data, as well as how to get it
and work with it.
● Statisticians are increasingly asked to plug their statistical models into big data
environments, where the challenge of wrangling and preparing analyzable data
is paramount, and requires both programming and database skills.
● Programmers and database administrators are increasingly interested in adding
statistical methods to their toolkits, as companies realize that their databases
possess value that is strategic, not just administrative, and goes well beyond the
original reason for collecting the data.
Around 2010, the term data scientist came into use to describe analysts who com-
bined these two sets of skills. Job announcements now carry the term data scientist
with greater frequency than the term statistician, reflecting the importance that
organizations attach to managing, manipulating, and obtaining value out of their
vast and rapidly growing quantities of data.
We close this chapter with a probability experiment:

Try It Yourself
1) Write down a series of 50 random coin flips without actually flipping the
coins. That is, write down a series of 50 made-up H’s and T’s selected in
such a way that they appear random.
2) Now actually flip a coin 50 times.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1.6 Big Data and Statisticians 7

If you look at the series of made-up tosses and compare them to the real tosses,
the longest streaks of either H or T generally occur in the ACTUAL tosses. When
a person is asked to make up random tosses, they will rarely “allow” more than
four H’s or T’s in a row. By the time they have written down four H’s in a row, they
think it is time to switch over to T, or else the series would not appear random. By
contrast, instructors who teach this exercise in class often see a streak of 8 T’s or
H’s in a row. Most people think that this is not random, and yet it clearly is.
In 1913, a roulette wheel at the Monte Carlo casino landed on black 26 times in a
row. As the streak developed, gamblers, convinced that the wheel would most cer-
tainly have to end the streak, increasingly bet heavily on red—they lost millions.
The message here is that random variation reliably produces patterns that
appear non-random.
Why is this significant? Just as with coin tosses, there is a significant component
of random variation (engineers call it noise) in the data that routinely flow through
life—whether business life, government affairs, the education world, or personal
life. So much data…so much random variation…how do we know what is real and
what is random?
We can’t know for certain, though we do know that random behavior can appear
to be real. One purpose of this book is to teach you about probability, and help you
evaluate the potential random component in data, and provide ways of modeling it.
This gives us a benchmark against which to measure patterns, and form educated
guesses about whether observed events or patterns of interest might be really due
to chance. When we understand randomness better, we can curb the tendency to
chase after random patterns, and produce more reliable analyses of data.
k

Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
9

Designing and Carrying Out a Statistical Study

In this chapter, we study random behavior and how it can fool us, and we learn
how to design studies to gain useful and reliable information. After completing
this chapter, you should be able to
1) Use coin flips to replicate random processes, and interpret the results of
coin-flipping experiments
2) Use an informal working definition of probability
3) Define, intuitively, p-value
k 4) Describe the different data formats you will encounter, including relational k
database and flat file formats
5) Describe the difference between data encountered in traditional statistical
research, and “big data”
6) Explain the use of treatment and control groups in experiments
7) Define statistical bias
8) Explain the role of randomization and blinding in assigning subjects in a study
9) Explain the difference between observational studies and experiments
10) Design a statistical study following basic principles

2.1 Statistical Science

“It’s not what you don’t know that


hurts you, it’s what you know for sure
that ain’t so.” (Will Rogers, American
humorist)

Source: Silvio/Adobe Stock Photos

Statistics for Data Science and Analytics, First Edition. Peter C. Bruce, Peter Gedeck, and Janet Dobbins.
© 2025 John Wiley & Sons, Inc. Published 2025 by John Wiley & Sons, Inc.
Companion website: www.wiley.com/go/Wiley_Statistics_for_Data

k
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 2 Designing and Carrying Out a Statistical Study

Nearly all large organizations now have huge stores of customer and other
data that they mine for insight, in hopes of boosting revenue or reducing costs.
In the academic world, over five million research articles are published per year
in scholarly and scientific journals. These activities afford ample opportunity to
dive into the data, and discover things that aren’t true, particularly when the
diving is done automatically and at a large scale. Statistical methods play a large
role in this extraction of meaning from data. However, the science of statistics
also provides tools to study data more carefully, and distinguish what’s true from
what ain’t so.

2.2 Big Data


In most organizations today, raw data are plentiful (often too plentiful), and this
is a two-edged sword.
● Huge amounts of data make prediction possible in circumstances where small
amounts of data don’t help. One type of recommendation system, for example,
needs to process millions of transactions to locate transactions with the same
item you are looking at—enough so that reliable information about associated
items can be deduced.
● On the other hand, huge data flows and incorrect data can obscure meaning-
ful patterns in the data, and generate false ones. Useful data are often difficult
and expensive to gather. We need to find ways to get the most information,
and the most accurate information, for each dollar spent in assembling and
preparing data.

2.3 Data Science

The terms big data, machine learning, data science, and artificial intelligence
(AI) often go together, and bring different things to mind for different people.
The term artificial intelligence, particularly with the advent of Chat-GPT and gen-
erative AI, suggests almost magical methods that approach human-like cognition
capabilities. Privacy-minded individuals may think of large corporations or spy
agencies combing through petabytes of personal data in hopes of locating tidbits
of information that are interesting or useful. Analysts may focus on statistical and
machine learning models that can predict an unknown value of interest (loan
default, acceptance of a sales offer, filing a fraudulent insurance claim or tax
return, for example).
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.4 Example: Hospital Errors 11

Statistical science, by contrast, has well over a century of history, and its meth-
ods were originally tailored to data that were small and well-structured. However,
it is an important contributor to the field of data science which, when it is well
practiced, is not just aimless trolling for patterns, but starts out with questions of
interest:
● What additional product should we recommend to a customer?
● Which price will generate more revenue?
● Does the MRI show a malignancy?
● Is a customer likely to terminate a subscription?
All these questions require some understanding of random behavior and all benefit
from an understanding of the principles of well-designed statistical studies, so this
is where we will start.

2.4 Example: Hospital Errors


Healthcare accounts for about 18% of the United States GDP (as of 2024), is a
regular subject of political controversy and proposals for reform, and produces
enormous amounts of data and analysis. One area of study is the problem of medi-
cal errors—violations of the Hippocratic oath’s “do no harm” provision. Millions of
hospitalized patients each year around the world are affected by treatment errors
(mostly medication errors). A 2017 report from the National Institutes of Health
(NIH) in the U.S. estimated that 250,000 deaths per year resulted from medical
errors. There are various approaches to dealing with the problem.

Source: Vineey/Adobe Stock Photos


Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 2 Designing and Carrying Out a Statistical Study

Clinical Decision Support systems (CDS) are used to guide practitioners in


diagnosis and treatment, and can provide rule-based alerts when standard
treatment protocols are violated. However, all those rules must be programmed
and kept up-to-date in an extremely complex medical environment. Many false
alarms result, which can cause practitioners to ignore the alerts. Recent advances
in machine learning have enabled systems that learn on their own to provide
alerts, without experts having to program rules. These systems allow for the
correction of errors once they occur, but what about identifying the causes of
errors and reducing their frequency?
One obvious and uncontroversial innovation has been to promote the use of
checklists to reduce errors of omission. Other ideas may not be so obvious. No-fault
error reporting has been proposed, in which staff are encouraged to report all
errors, both their own and those committed by others, without fear of punish-
ment. This could have the benefit of generating better information about errors
and their sources, but could also hinder accountability efforts. How could you find
out whether such a program really works? The answer: a well-designed statistical
study.

2.5 Experiment
To tie together our study of statistics we will look at an experiment designed to test
whether no-fault reporting of all hospital errors reduces major errors in hospitals
(errors resulting in further hospitalization, serious complications, or even death).
An experiment like this was conducted by a hospital in Quebec, Canada, but it was
too small to provide definitive conclusions. For illustrative purposes, we will look
at hypothetical data that a larger study might have produced.
Experiments are used in industry, medicine, social science, and data science.
The ubiquitous A/B test (more on that later) is an experiment. The key feature of
an experiment is that the investigator manipulates some variable that is believed to
affect an outcome of interest, in order to demonstrate the importance and effect (or
lack thereof) of the variable. This stands in contrast to a survey or other analysis
of existing data, where the analyst simply collects and analyzes data. For example,
in a web experiment, the marketing investigator might try out a new product price
to see how it affects sales.
Experiments can be uncontrolled or controlled. In an uncontrolled experiment,
the investigator collects data on the group or time period for which the variable of
interest has been changed. In a web experiment, for example, the price of a product
might be increased by 25%, and then sales compared to prior sales.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.6 Designing an Experiment 13

Experiment vs. Observational Study


In the fifth inning of the third game of the 1932 baseball World Series between
the NY Yankees and the Chicago Cubs, the great slugger Babe Ruth came to
bat and pointed towards center field, as if to indicate that he planned to hit
the next pitch there. On the next pitch, he indeed hit the ball for a home run
into the center field bleachers.a
A Babe Ruth home run was an impressive feat, but not that uncommon. He
hit one every 11.8 at-bats. What made this one so special is that he predicted it.
In statistical terms, he specified in advance a theory about a future event—the
next swing of the bat—and an outcome of interest—a home run to center field.
In statistics, we make an important distinction between studying
pre-existing data (an observational study) and collecting data to answer
a pre-specified question (an experiment or prospective study). The most
impressive and durable results in science come when the researcher specifies
a question in advance, then collects data in a well-designed experiment to
answer the question. Offering commentary on the past can be helpful, but is
no match for predicting the future.
a
There is some controversy about whether he actually pointed to center field or to left field
and whether he was foreshadowing a prospective home run or taunting Cubs players. You
can Google the incident (“Babe Ruth called shot”) and study videos on YouTube, then judge
for yourself.

2.6 Designing an Experiment


The problem with an uncontrolled experiment is the uncertainty involved in the
comparison. Suppose sales drop 10% in the web experiment with the new price.
Can you be sure that nothing else has changed since the experiment started? Prob-
ably not—companies are making modifications and trying new things all the time.
Hence, the need for a control group.
In a controlled experiment, two groups are used and they are handled in the
same way, except that one is given the treatment (e.g. the increased price), and the
other is not given the treatment. In this way, we can eliminate the confounding
effect of other factors not being studied.

2.6.1 A/B Tests; A Controlled Experiment for the Hospital Plans


In our errors experiment, we could compare two groups of hospitals. One group
uses the no-fault plan and one does not. The group that gets the change in
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 2 Designing and Carrying Out a Statistical Study

treatment you wish to study (here, the no-fault plan) is called the treatment group.
The group that gets no treatment or the standard treatment is called the control
group.
An experiment like this, testing a control group vs. a treatment group, is also
called an A/B test, particularly in the field of marketing, where one web treatment
might be tested against another. Sometimes, particularly in marketing, there might
not be an established control scenario and we are simply comparing one proposed
new treatment against another proposed new treatment (e.g. two different web
pages).
How do you decide which hospitals go into which group?
You would like the two groups to be similar to one another, except for the treat-
ment/control difference. That way, if the treatment group does turn out to have
fewer errors, you can be confident that it was due to the treatment. One way to do
this would be to study all the hospitals in detail, examine all their relevant charac-
teristics, and assign them to treatment/control in such a way that the two groups
end up being similar across all these attributes. There are two problems with this
approach.
1) It is usually not possible to think of all the relevant characteristics that might
affect the outcome. Research is replete with the discovery of factors that were
unknown prior to the study or thought to be unimportant.
2) The researcher, who has a stake in the outcome of the experiment, may con-
sciously or unconsciously assign hospitals in a way that enhances the chances
of the success of their pet theory.
Oddly enough, the best strategy is to assign hospitals randomly: for example,
by tossing a coin.

2.6.2 Randomizing
True random assignment eliminates both conscious and unconscious bias in
the assignment to groups. It does not guarantee that the groups will be equal in
all respects. However, it does guarantee that any departure from equality will
be due simply to the chance allocation, and that the larger the samples, the
fewer differences the groups will have. With extremely large samples, differences
due to chance virtually disappear, and you are left with differences that are
real—provided the assignment to groups is really random.
Random assignment lets us make the claim that any difference in group out-
comes that is more than might reasonably happen by chance is, in fact, due to
the different treatment of the groups. The study of probability in this book lets us
quantify the role that chance can play and take it into account.
We can imagine an experiment in which both groups got the same treatment.
We would expect to see some differences from one hospital to another. An everyday
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.6 Designing an Experiment 15

example of this might be tossing a coin. If you toss a coin 10 times you will get a
certain number of heads. Do it again and you will probably get a different number
of heads.
Though the results vary, there are laws of chance that allow you to calculate
things like how many heads you would expect on average or how much the results
would vary from one set of 10 (or 100 or 1000) tosses to the next. If we assign sub-
jects at random, we can use these same laws of chance—or a lot of coin tosses—to
analyze our results.
If we have Doctor Jones assign subjects using her own best judgment, we will
have no knowledge of the (often subconscious) factors that influence assignment.
These factors may bias assignment so that we can no longer say that the only thing
(besides chance assignment) distinguishing the treatment and control groups is
the treatment. Random assignment is not always possible—for example, randomly
assigning elementary school students to two different teaching methods, where
everything else in the education setting is the same.
Randomization can be difficult or impossible in some situations, but it is rela-
tively easy in the A/B testing that is popular in digital marketing. Web visitors can
be easily randomized to one web page or another; email recipients can easily be
assigned randomly to one version or another of an email.

2.6.3 Planning
You need some hospitals and you estimate you can find about 100 within a rea-
sonable distance. You will probably need to present a plan for your study to the
hospitals to get their approval. That seems like a nuisance, but they cannot let just
anyone do any study they please on the patients.1 In addition to writing a plan to
get approval, you know that one of the biggest problems in interpreting studies
is that many are poorly designed. You want to avoid that, so you think carefully
about your plan and ask others for advice. It would be good to talk to a statistician
with experience in medical work. Your plan is to ask the 100 or so available hos-
pitals if they are willing to join your study. They have a right to say no. You hope
quite a few will say yes. In particular, you hope to recruit 50 willing hospitals and
randomly assign them to treatment and control.

Try It Yourself
How exactly would you assign hospitals randomly? Think about options for
the scenario where it doesn’t matter if the groups are exactly equal-sized, and
for the scenario where you want two groups of equal size.

1 This effort is modest, in comparison to that required to gain approval for new drug therapies.
Clinical trials to establish drug efficacy and safety can take years and cost billions of dollars.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16 2 Designing and Carrying Out a Statistical Study

Figure 2.1 Dart throws off-target


in consistent fashion (biased).
Source: Peter Bruce (Book Author).

2.6.4 Bias
Randomization is used to try to make the two groups similar at the beginning.
It is important to keep them as similar as possible during the experiment. We want
to be sure the treatment is the only difference between them. Any difference in
outcome due to non-random extraneous factors is a form of bias. Statistical bias is a
technical concept but can include the lay definition that refers to people’s opinions
or states of mind.

Definitions: Bias Statistical bias is the tendency for an estimate, model, or pro-
cedure to yield results that are consistently off-target for a specific purpose (as in
Fig. 2.1).
For example, the mean (average) income for a region might not be a good esti-
mate for the income of a typical resident, if part of the region is home to a small
number of very high-income residents. Their incomes would likely raise the aver-
age above that of most typical residents (i.e. ones selected at random). Another
example is gun sights on a long-range rifle. Gravity will exert a downward pull on
a bullet, the more distant the target the greater the pull. The coordinates of the
target in the sights will be biased upward compared to where the bullet lands.
Bias can often creep into a study when humans are involved, either as subjects or
experimenters. For one thing, subject behavior can be changed by the fact that they
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.6 Designing an Experiment 17

are participating in a study. Experience has also shown that people respond posi-
tively to attention, and just being part of a study may cause subjects to change. A
positive response to the attention of being in a study is called the Hawthorne effect.
Awareness of an issue can significantly affect perceptions, which is why potential
jurors in a trial are asked if they have seen news coverage of a case at issue.

Out-of-Control Toyotas?
In the fall of 2009, the National Highway Transportation Safety Agency
(NHTSA) was receiving several dozen complaints per month about Toyota
cars speeding out of control. The rate of complaint was not that different
from the rates of complaint for other car companies. Then, in November of
2009, Toyota recalled 3.8 million vehicles to check for sticking gas pedals.
By February, the complaint rate had risen from several dozen per month to
over 1500 per month of alleged cases of unintended acceleration. Attention
turned to the electronic throttle.
Clearly what changed was not the actual condition of cars—the stock of Toy-
otas on the road in February of 2010 was not that different from November of
2009. What changed was car owners’ awareness and perception as a result of
the headlines surrounding the recall. Acceleration problems, whether real or
illusory, that escaped notice prior to November 2009 became causes for worry
and a trip to the dealer. Later, the NHTSA examined a number of engine data
recorders from accidents where the driver claimed to experience acceleration
despite applying the brakes. In all cases, the data recorder showed that the
brakes were not applied.
In February 2011, the US Department of Transportation announced that a
10-month investigation of the electronic throttle showed no problems. Public
awareness of the problem boosted the rate of complaint far out of proportion
to its true scope.
Lesson: Your perception of whether you personally experience a problem or
benefit is substantially affected by your prior awareness of others’ problems
and benefits.
Sources: Wall Street Journal, July 14, 2010; The Analysis Group (http://www.analysisgroup
.com—accessed July 14, 2010); USA Today online, April 2, 2011.

In some situations, we can avoid telling people they are participating in a study.
For example, a marketing study might try different ads or products in various
regions without publicizing that they are doing so for research purposes. In other
situations, we may not be able to avoid letting subjects know they are being
studied, but we may be able to conceal whether they are in the treatment or
control group.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
18 2 Designing and Carrying Out a Statistical Study

2.6.4.1 Placebo
A placebo is a dummy treatment imposed on the control group to render their
experience similar to that of the control group. In this way, we can distinguish the
real effect of the treatment from any response that is simply the result of perceiving
that you are being treated. Experience has shown that subjects will often experi-
ence and report good results even for dummy treatments. This positive response to
the perception that you are being treated is called the placebo effect. In medicine,
the placebo effect on the brain can be powerful for mitigating symptoms, and
explains the popularity of many remedies that have no rigorous scientific basis for
their effectiveness. In one study, a tablet labeled “placebo” was found to be 50% as
effective as actual migraine medicine in relieving migraine symptoms. Thus, the
net beneficial effect of many therapies is made up both of a real scientific compo-
nent, and a placebo component.

2.6.4.2 Blinding
The process of concealing treatment from control is termed blinding. We say
a study is single-blind when the subjects—the hospitals in our medical errors
example—do not know whether they are getting the treatment. It is double-blind
if the staff in contact with the subjects also do not know which group is getting
the real treatment. It is triple-blind if the people who evaluate the results do not
know, either.
Blinding, particularly full blinding, is not always feasible. The very nature of
the treatment (for example, no-fault error reporting for the hospital study) may
require participant awareness. The control side may need to be more than simply
“do nothing.” Drug trials typically include a sugar pill for the control arm. For
the hospital study, our control might be a standard set of best practices (excluding
no-fault reporting) shared among the control hospitals.
In marketing experiments, participant blinding usually happens automatically—
web viewers or message recipients are typically unaware there might be another
group seeing a different message. Bias from the analyst side, though, is common.
It often occurs when analysts have a favored outcome (perhaps subconsciously)
and can choose when to end an experiment, or elect to look at a subset of
results that seem more reasonable to them. Blinding of the analyst can mitigate
this bias.
In addition to the various forms of blinding, we try to keep all other aspects of
each subject’s environment the same. In the hospital experiment, we need to agree
on common treatment and control error-handling methods that will be applied
to all hospitals in each group. By keeping the two groups the same in every way
except the treatment, we can be confident that any differences in the results were
due to it.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.7 The Data 19

2.6.4.3 Before-after Pairing


We could run our study for a year and measure the total number of medical errors
each hospital had by the end of that period. A better strategy is to measure how
many errors they had the year before the study as well. Then we have paired data—
two measurements on each unit. This allows us to compare the treatment to no
treatment at the same hospitals.
Note that we still retain the control group. Having both a control group and
a treatment group allows us to separate out the improvement due to no-fault-
reporting from the improvement due to the more general best practices treatment.
Having a control group also controls for trends that affect all hospitals. For
example, the number of errors could be increasing due to an increased patient
load at hospitals generally.

2.7 The Data


Let’s look at the results—major errors in the year preceding the treatment, and
during the period when some hospitals implemented the treatment intervention
(the no-fault reporting plan). A “0” in the treatment column in Table 2.1 indicates
the hospital was not selected for the no-fault treatment program, a “1” indicates
it was.
Let’s now focus directly on the subject of interest—reduction in errors. Table 2.2
shows the full table, listing only the amount by which errors were reduced.

2.7.1 Dataframe Format


Tables 2.1 and 2.2 are examples of a standard database or tabular format, which all
database programs and most standard-purpose statistical software programs use.

Table 2.1 Hospital errors (partial): before and after


treatment.

Errors Errors
Row Hospital# Treat? Before After

1 239 0 27 24
2 1126 0 17 16
3 1161 0 31 29
4 1293 1 38 32
5 1462 1 25 23
45 more
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
20 2 Designing and Carrying Out a Statistical Study

Table 2.2 Reduction in major errors in hospitals.

Reduction Reduction
Row Hospital# Treat? in Errors Row Hospital# Treat? in Errors

1 239 0 3 26 2795 1 3
2 1126 0 1 27 2889 0 5
3 1161 0 2 28 2892 1 9
4 1293 1 2 29 2991 1 2
5 1462 1 2 30 3166 1 2
6 1486 0 2 31 3190 0 1
7 1698 1 5 32 3254 0 4
8 1710 0 1 33 3312 1 2
9 1807 0 1 34 3373 1 2
10 1936 1 2 35 3403 1 3
11 1965 1 2 36 3403 0 1
12 2021 1 2 37 3429 1 2
13 2026 0 1 38 3441 1 6
14 202 0 3 39 3520 0 1
15 208 1 4 40 3568 1 2
16 2269 1 2 41 3580 0 2
17 2381 1 2 42 3599 0 2
18 2388 0 1 43 3660 1 2
19 2400 1 2 44 3985 0 2
20 2475 0 4 45 4014 1 2
21 2548 0 1 46 4060 0 1
22 2551 0 2 47 4076 1 2
23 2661 0 1 48 4093 0 1
24 2677 1 4 49 4230 0 2
25 2739 1 2 50 5633 0 2

The standard object in data programming languages is a dataframe. Rows rep-


resent records or cases—hospitals in this example. Columns represent variables,
which are data that change from case to case (hospital to hospital). The format has
two key features:

1) Each row contains all the information for one and only one case.
2) All data for a given variable is in a single column.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.8 Variables and Their Flavors 21

To load data from a .csv file2 into a dataframe named “data,” you would use
pandas with the syntax data = pd.read_csv("hospitalerrors.
csv"). We will cover this in more detail in Section 2.19.3.
Let’s look at the hospital data.
Column 1 is simply the row number.
Column 2, hospital, contains case labels. These are arbitrary labels for the exper-
imental units—a unique number for each unit. Case labels keep track of the
data. For example, if we find a mistake in the data, we would need to know
which hospital that came from so we could investigate the cause and correct
the mistake. Numerical codes are preferred to more informative labels when
we wish to conceal the group to which subjects were assigned.
Column 3 labels observations from the treatment group with a one and those from
the control group with a zero.
Column 4 is the number of major medical errors in the year before the study
minus the number from the following year. A positive number represents a
reduction in medical errors. Note that all the numbers are positive—things
got better whether subjects got the treatment or not! This could be due to
the Hawthorne effect, or to any extra care the subjects got from being in the
experiment, or to other factors that may have changed globally between the
two years.

2.8 Variables and Their Flavors

2.8.1 Numeric Variables


The third (Treat?) and fourth (Reduction in errors) columns in Table 2.2 contain
variables. These are things we observe, compute, or measure for each subject. Each
row represents an experimental unit or subject, while each column represents a
variable. Note that we have gone from listing the number of errors at the begin-
ning of the study and the number at the end to simply listing the reduction in
errors, which is what we are really interested in. They are an example of a numeric
variable, also called a quantitative variable. These are numbers with which you
can do meaningful arithmetic, and typically measure magnitudes—how much of
something. (The hospital error data is a special type of numerical variable—it is
“count” data.)

2 “Comma separated values,” or .csv, is a simple file format that can be read by a wide variety of
programming languages and software.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
22 2 Designing and Carrying Out a Statistical Study

2.8.2 Categorical Variables


The other main type of variable is called categorical, or sometimes factor. Examples
that might be in the database (though not printed out above) are the city, county,
or province of each hospital, or whether it was a government, business, or char-
ity hospital. Categorical data is often recorded in text labels, for example: Male
or Female, Christian, Muslim, Hindu, Buddhist, Jew, or Other. But it is also com-
mon to code categories numerically. In the hospital data, treatment is a categorical
variable and was coded as one. Control is coded as zero.
A categorical variable must take one of a set of defined non-numerical values—
yes/no, low/medium/high, mammal/bird/reptile, etc. The categories might be
represented as numbers to accommodate software, so you need to be cautious
about doing routine arithmetic on those numbers, and be aware of what the
arithmetic really means. The choice of numbers is usually arbitrary.

2.8.3 Binary Variables


A special type of categorical variable is a binary, or two-value, variable. Much
data has a binary variable as an outcome, even some data that starts out as multi-
category outcome. Examples are survive or die, purchase or no-purchase, click
or no-click, fraud or no-fraud. The prevalence of binary or yes-no data reflects
the fact that decision-making is often easier if situations, even complex ones,
can be boiled down to a choice between two alternatives. Binary data is typically
represented by 1’s and 0’s, with 1 representing the more unusual and interesting
category.

Coding categories as numbers dose not make them numeric! Don’t do


arithmetic on numerical codes for categorical data when it makes no
sense. If we code Christian, Muslim, Hindu, Buddhist, Jew, or Other
as 1, 2, 3, 4, 5, 6, respectively, then finding the total or average of these
codes is not meaningful. When the categories are ordered, such as the
degree of pain, some limited calculations may be possible.

With binary data in which one class is much more scarce than the other (e.g.
fraud/no-fraud), the scarce class is often designated “1.” Surveys and observational
studies often produce categorical data. People might be asked which candidate
they plan to vote for; what toothpaste they use; or whether they live in a house,
apartment, or mobile home. Observational studies often use similar variables that
can be evaluated at a glance or found in existing records. The basic statistical sum-
mary method for displaying such categorical data is a frequency table (see Table 3.5
for an example).
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.8 Variables and Their Flavors 23

With 0/1 data, for example, “fraudulent” or “OK,” the more rare and interesting category
(e.g. a fraudulent credit card charge) is designated “1.” Source: energepic.com/Pexels.

2.8.4 Text Data


Another type of data encountered is text data. Text data can appear in a wide variety
of formats: ordinary prose, posts on social media, a doctor’s notes, the contents of
labels, etc. Not usually thought of as a variable, text data typically must be prepro-
cessed before it can be subjected to statistical analysis, although some machine
learning methods incorporate the needed preprocessing and can deal with text
data directly.

2.8.5 Random Variables


A variable that takes on different values (e.g. heads/tails) as the result of a random
process is a random variable.
The distinction between what is random and what is not is often fuzzy. A coin
flip, you would think, is purely unpredictable, yet magicians are able to flip coins
so that they always appear to land heads. The quantity of acetaminophen in an
extra strength Tylenol you would consider to be invariant at 500 mg, yet there is
always some tiny uncontrolled variability in the exact amount of acetaminophen
per tablet. If we drill down to a sufficiently fine level of resolution, we usually
encounter an element of randomness in most measurements.
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
24 2 Designing and Carrying Out a Statistical Study

2.8.6 Simplified Columnar Format


An alternative is to present the error reduction for the control group in one col-
umn and the treatment group in the other (see Table 2.3). This provides the clear-
est presentation of how the two groups differ in the extent to which errors were
reduced. The treatment group had, on average, 2.80 fewer errors in the second year.

Table 2.3 Error reduction: compact table.

Treatment Control

2 3
2 1
5 2
2 2
2 1
2 1
4 1
2 3
2 1
2 4
4 1
2 2
3 1
9 5
2 1
2 4
2 1
2 1
3 2
2 2
6 2
2 1
2 1
2 2
2 2
Mean: 2.80 1.88
Downloaded from https://onlinelibrary.wiley.com/doi/ by ibrahim ragab - Universita Di Firenze Sistema , Wiley Online Library on [13/08/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2.9 Python: Data Structures and Operations 25

The control group had 1.88 fewer errors in the second year. Both groups reduced
errors, but the treatment does appear to have reduced errors to a greater extent
than the control.

2.9 Python: Data Structures and Operations

Data structures and their manipulation are key features of every programming
language. This section introduces standard data types and more complex data
structures as well as some basic operations to manipulate them.

2.9.1 Primary Data Types


Python has the following primary data types:
● int: Represents integer values, such as 1, −5, or 1000.
● float: Represents floating-point or decimal values, such as 3.14, −0.5, or 2.0.
● string: Represents sequences of characters enclosed in single (’) or double
quotes ("). For example, "Hello, World!" or ’Python’.
● bool: Represents the boolean values, True and False.
The Python standard library also has support to handle data describing events at
a specific time and/or date:
● time: Represents time values, e.g. 8 AM or 13:20.
● date: Represents dates, e.g. May 4th, 2023.
● datetime: Represents date and time together, e.g. 8 AM on May 4th, 2023.
We are not going to cover these three data types in this book. A special data type
is None which is representing the absence of a value.
You can use the type() function to check the data type of a value.
print(type(3.14)) # <class 'float'>
print(type("Python")) # <class 'str'>
print(type(True)) # <class 'bool'>

2.9.2 Comments
Every programming language has a way to add comments to code. Comments are
used to add information about the code, but will not be executed. Comments are
started with a hash (#) and continue until the end of the line.
# This is a comment
print("Hello, World!") # This is also a comment
print("Inside a string # this is not a comment")
Other documents randomly have
different content
e) Königreich Bayern.
Vom Regierungsbezirk Oberfranken die
Bezirksämter Kronach, Naila,
Stadtsteinach und Teile der
Bezirksämter Münchberg und Hof nebst
der Stadt Hof
etwa 1140 130000
Übersicht der geologischen
Formationen.
I. Archäische Formationsgruppe.
II. Paläozoische Formationsgruppe.
1. Kambrische Formation.
2. Silur.
3. Devon.
4. Karbon.
a) Unter-Karbon (Kulm).
b) Ober-Karbon (produktive Steinkohlenformation).

5. Permische Formation (Dyas).


a) Rotliegendes.
b) Zechstein.

III. Mesozoische Formationsgruppe.


1. Trias.
a) Buntsandstein.
b) Muschelkalk.
c) Keuper.

2. Jura.
a) Lias.
b) Dogger (brauner Jura).
c) Malm (weißer Jura).

3. Kreideformation.
IV. Känozoische Formationsgruppe.
1. Alt-Tertiär.
2. Jung-Tertiär.
V. Quartäre Formation.
a) Diluvium.
b) Alluvium.
Litteratur.
Nur einige der wichtigsten Werke sind hier genannt:
F. Regel: Thüringen. Ein geographisches Handbuch. 3 Bde. Jena,
1892–1896.
Behandelt im weitesten Umfange Land, Biogeographie und Kulturgeographie und ist
eine außerordentlich fleißige und verdienstvolle Arbeit.

Fr. Regel: Entwickelung der Ortschaften im Thüringerwald.


Ergänzungsheft Nr. 76 zu Petermanns Mitteilungen. Gotha.
1884.
Gibt eine Darstellung der Verkehrsentwickelung und der Siedelung unter Heranziehung
eines reichen Urkundenmaterials.

Fr. Regel: Forstwirtschaft in Thüringen. Geographische Blätter. Bd.


XV. Bremen, 1892.
Bespricht die Verbreitung und Kultur des Waldes im Thüringerwald.

F. Spieß: Physikalische Topographie von Thüringen. Weimar, 1875.


Ist eine eingehende und heute noch wertvolle Beschreibung aller physischen
Verhältnisse des ganzen thüringischen Gebiets.

H. Pröscholdt: Der Thüringerwald und seine nächste Umgebung.


Forschungen zur deutschen Landes- und Volkskunde. V. Bd.
Stuttgart, 1891.
Gibt eine auf dem neuesten Standpunkt stehende Darstellung der geologischen
Verhältnisse und ihrer Entwickelung.

C. Kaesemacher: Die Volksdichte der Thüringischen Triasmulde.


Forschungen zur deutschen Landes- und Volkskunde. VI. Bd.
Stuttgart, 1892.
Untersucht die Volksdichte mit Rücksicht auf die geologischen Verhältnisse des Bodens.

H. Leinhose: Bevölkerung und Siedelungen im Schwarzagebiet.


Inaugural-Dissertation. Halle a. d. S., 1890.
Untersucht die Volksdichte mit Rücksicht auf Höhenstufen.

J. Bühring und L. Hertel: Der Rennsteig des Thüringerwaldes.


Jena, 1896.
Das beste Buch über den Rennsteig mit wertvollen geschichtlichen Untersuchungen.

E. Sax: Die Hausindustrie in Thüringen (Sammlung


nationalökonomischer und statistischer Abhandlungen), 3 Hefte.
Jena, 1884–1888.
Eingehende wirtschaftsgeschichtliche Studien mit Angaben über Innungen u. s. w.

H. Gebhardt: Zur bäuerlichen Glaubens- und Sittenlehre. Gotha,


1895.
Schildert Leben und Denken des Flachlandbauern Thüringens, seine Sitten und Bräuche
und seine Stellung zu Glauben und Kirche.

H. Größler: Führer durch das Unstrutthal von Artern bis Naumburg.


2 Teile. Freiburg a. d. U., 1892 u. 1893.
Mit besonderer Rücksicht auf Ortsgeschichte.

A. Trinius: Thüringer Wanderbuch. 6 Bde. Minden, 1886–1896.


Gemütvolle Schilderungen mit Beziehungen auf Ortsgeschichte und Sage.

Anding und Radefeld: Thüringen (Meyers Reisebücher). 13. Aufl.


Leipzig, 1896.
Das handlichste und zuverlässigste Reisebuch über den größten Teil Thüringens.

Specialkarte des Deutschen Reichs: 1 : 100000, herausg. von


den preußischen, sächsischen und bayerischen Generalstäben.
Für unser Gebiet sind 28 Blätter nötig, die nach dem Zeitpunkt ihrer Herstellung sehr
ungleichwertig sind.

F. Beyschlag: Geognostische Übersichtskarte des Thüringerwalds, 1


: 100000, herausg. von der Kgl. Preuß. Geologischen
Landesanstalt. Berlin, 1897.
Ist die beste geologische Karte und für das Studium unentbehrlich.

R. Lepsius: Geologische Karte des Deutschen Reiches in 27


Blättern. 1 : 500000. Gotha, 1897.
Unser Gebiet ist in den Blättern 13, 14, 18 und 19 dargestellt, die geologische
Darstellung von großer Anschaulichkeit.

Andrees Handatlas, Blatt Thüringen 1 : 500000. Bielefeld u.


Leipzig, 1898.
Die beste Übersichtskarte kleinen Maßstabes mit reichem Inhalt.
Anmerkungen zur Transkription
Inkonsistenzen wurden beibehalten, wenn beide Schreibweisen gebräuchlich waren, wie:

anderen — andern
andererseits — anderseits
angewandt — angewendet
Buntsandstein-Scholle — Buntsandsteinscholle
Eichsfeldes — Eichsfelds
Frankenwaldes — Frankenwalds
Gebietes — Gebiets
Gerbersteines — Gerbersteins
gewerbefleißig — gewerbfleißig
Kemenate — Kemnate
Komthurei — Komturei
Mainthales — Mainthals
Massemühle — Massenmühle
slavische — slawische
Stufenlandes — Stufenlands
Tabakpfeifen — Tabakspfeifen
Thüringerwaldes — Thüringerwalds

Interpunktion wurde ohne Erwähnung korrigiert. Im Text wurden folgende Änderungen


vorgenommen:

S. 9 »Pophyre« in »Porphyre« geändert.


S. 10 »Bischofsitzen« in »Bischofssitzen« geändert.
S. 12 »Christenthums« in »Christentums« geändert.
S. 27 »Orlaugaus« in »Orlagaus« geändert.
S. 42 »Müncheberger Gneisgebiet« in »Münchberger Gneisgebiet« geändert.
S. 46 »gewundenenen« in »gewundenen« geändert.
S. 47 »Hockerode« in »Hockeroda« geändert.
S. 50 »mitleren« in »mittleren« geändert.
S. 50 »Adlerberg« in »Adlersberg« geändert.
S. 50 »Linie Mengersreuth-Steinach« in »Linie Mengersgereuth-Steinach« geändert.
S. 52 »verhältniswäßig« in »verhältnismäßig« geändert.
S. 55 »Sigmundsburger« in »Siegmundsburger« geändert.
S. 56 »Sigmundsburg« in »Siegmundsburg« geändert.
S. 56 »Adlerberg« in »Adlersberg« geändert.
S. 68 »wiedergespiegelt« in »widergespiegelt« geändert.
S. 86 »Inselwasser« in »Inselswasser« geändert.
S. 90 »Wahrscheinlichleit« in »Wahrscheinlichkeit« geändert.
S. 94 »Co-Coburg« in »Coburg« geändert.
S. 98 »Länder« in »Ländler« geändert.
S. 110 »in natürlichen« in »in natürlichem« geändert.
S. 113 »und seine Genossen« in »er und seine Genossen« geändert.
S. 114 »ein herzogliche« in »eine herzogliche« geändert.
S. 117 »bei Zerstörung« in »bei der Zerstörung« geändert.
S. 118 »J. M. Kraus« in »G. M. Kraus« geändert.
S. 125 »entlang der Muschelkalkzug« in »entlang dem Muschelkalkzug« geändert.
S. 126 »Tennstädt« in »Tennstedt« geändert.
S. 127 »ehemaliege« in »ehemalige« geändert.
S. 128 »Holzthalleben« in »Holzthaleben« geändert.
S. 129 »Jahrundert« in »Jahrhundert« geändert.
S. 134 »angepflanzt wurden« in »angepflanzt worden« geändert.
S. 139 »Das Obere Eichsfeld« in »Das obere Eichsfeld« geändert.
S. 143 »zerstört wurde« in »zerstört wurden« geändert.
S. 146 »569 m Heiligen Berges« in »569 m hohen Heiligen Berges« geändert.
*** END OF THE PROJECT GUTENBERG EBOOK THÜRINGEN ***

Updated editions will replace the previous one—the old editions will
be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States copyright in
these works, so the Foundation (and you!) can copy and distribute it
in the United States without permission and without paying
copyright royalties. Special rules, set forth in the General Terms of
Use part of this license, apply to copying and distributing Project
Gutenberg™ electronic works to protect the PROJECT GUTENBERG™
concept and trademark. Project Gutenberg is a registered trademark,
and may not be used if you charge for an eBook, except by following
the terms of the trademark license, including paying royalties for use
of the Project Gutenberg trademark. If you do not charge anything
for copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such as
creation of derivative works, reports, performances and research.
Project Gutenberg eBooks may be modified and printed and given
away—you may do practically ANYTHING in the United States with
eBooks not protected by U.S. copyright law. Redistribution is subject
to the trademark license, especially commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the free


distribution of electronic works, by using or distributing this work (or
any other work associated in any way with the phrase “Project
Gutenberg”), you agree to comply with all the terms of the Full
Project Gutenberg™ License available with this file or online at
www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand, agree
to and accept all the terms of this license and intellectual property
(trademark/copyright) agreement. If you do not agree to abide by all
the terms of this agreement, you must cease using and return or
destroy all copies of Project Gutenberg™ electronic works in your
possession. If you paid a fee for obtaining a copy of or access to a
Project Gutenberg™ electronic work and you do not agree to be
bound by the terms of this agreement, you may obtain a refund
from the person or entity to whom you paid the fee as set forth in
paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only be


used on or associated in any way with an electronic work by people
who agree to be bound by the terms of this agreement. There are a
few things that you can do with most Project Gutenberg™ electronic
works even without complying with the full terms of this agreement.
See paragraph 1.C below. There are a lot of things you can do with
Project Gutenberg™ electronic works if you follow the terms of this
agreement and help preserve free future access to Project
Gutenberg™ electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law
in the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name associated
with the work. You can easily comply with the terms of this
agreement by keeping this work in the same format with its attached
full Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.

1.E. Unless you have removed all references to Project Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears,
or with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is derived


from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is posted


with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning
of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute this


electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the Project
Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or providing


access to or distributing Project Gutenberg™ electronic works
provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™


electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for


the “Right of Replacement or Refund” described in paragraph 1.F.3,
the Project Gutenberg Literary Archive Foundation, the owner of the
Project Gutenberg™ trademark, and any other party distributing a
Project Gutenberg™ electronic work under this agreement, disclaim
all liability to you for damages, costs and expenses, including legal
fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR
NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR
BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK
OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL
NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT,
CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF
YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of receiving
it, you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or
entity that provided you with the defective work may elect to provide
a replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,


the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!

ebookgate.com

You might also like