100% found this document useful (1 vote)
9 views

Python for Data Analysis Data Wrangling with Pandas NumPy and IPython 1st Edition Wes Mckinney pdf download

The document provides information about the book 'Python for Data Analysis' by Wes McKinney, focusing on data wrangling using libraries like Pandas, NumPy, and IPython. It includes links to download the book and other related resources, along with a detailed table of contents outlining various topics covered in the book. The book is published by O'Reilly Media and is aimed at those interested in data analysis using Python.

Uploaded by

leskycamm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
9 views

Python for Data Analysis Data Wrangling with Pandas NumPy and IPython 1st Edition Wes Mckinney pdf download

The document provides information about the book 'Python for Data Analysis' by Wes McKinney, focusing on data wrangling using libraries like Pandas, NumPy, and IPython. It includes links to download the book and other related resources, along with a detailed table of contents outlining various topics covered in the book. The book is published by O'Reilly Media and is aimed at those interested in data analysis using Python.

Uploaded by

leskycamm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Python for Data Analysis Data Wrangling with

Pandas NumPy and IPython 1st Edition Wes


Mckinney install download

https://ebookmeta.com/product/python-for-data-analysis-data-
wrangling-with-pandas-numpy-and-ipython-1st-edition-wes-mckinney/

Download more ebook from https://ebookmeta.com


We believe these products will be a great fit for you. Click
the link to download now, or visit ebookmeta.com
to discover even more!

Python Data Analysis Numpy, Matplotlib and Pandas Bernd


Klein

https://ebookmeta.com/product/python-data-analysis-numpy-
matplotlib-and-pandas-bernd-klein/

Python for Data Analysis, 3rd Edition (Second Early


Release) Wes Mckinney

https://ebookmeta.com/product/python-for-data-analysis-3rd-
edition-second-early-release-wes-mckinney/

Data Analysis with Python Introducing NumPy Pandas


Matplotlib and Essential Elements of Python Programming
1st Edition Rituraj Dixit

https://ebookmeta.com/product/data-analysis-with-python-
introducing-numpy-pandas-matplotlib-and-essential-elements-of-
python-programming-1st-edition-rituraj-dixit/

Illiberal Europe Eastern Europe from the Fall of the


Berlin Wall to the War in Ukraine 2nd Edition Léon Marc

https://ebookmeta.com/product/illiberal-europe-eastern-europe-
from-the-fall-of-the-berlin-wall-to-the-war-in-ukraine-2nd-
edition-leon-marc/
Mapping the Field of Adult and Continuing Education An
International Compendium 1st Edition Alan B. Knox

https://ebookmeta.com/product/mapping-the-field-of-adult-and-
continuing-education-an-international-compendium-1st-edition-
alan-b-knox/

What Really Happens in Vegas True Stories of the People


Who Make Vegas Vegas 1st Edition Patterson

https://ebookmeta.com/product/what-really-happens-in-vegas-true-
stories-of-the-people-who-make-vegas-vegas-1st-edition-patterson/

The Healthy, Happy Gut Cookbook: Simple, Non-


Restrictive Recipes to Treat IBS, Bloating,
Constipation and Other Digestive Issues the Natural Way
Dr. Heather Finley
https://ebookmeta.com/product/the-healthy-happy-gut-cookbook-
simple-non-restrictive-recipes-to-treat-ibs-bloating-
constipation-and-other-digestive-issues-the-natural-way-dr-
heather-finley/

Marketing Research Delivering Customer Insight 4th


Edition Alan Wilson

https://ebookmeta.com/product/marketing-research-delivering-
customer-insight-4th-edition-alan-wilson/

Mastering Financial Pattern Recognition Finding and


Back Testing Candlestick Patterns with Python 1st
Edition Sofien Kaabar

https://ebookmeta.com/product/mastering-financial-pattern-
recognition-finding-and-back-testing-candlestick-patterns-with-
python-1st-edition-sofien-kaabar-2/
Competition Cauldrons Conspiracy Moonflower Mystery 5
1st Edition Beverly Rearick

https://ebookmeta.com/product/competition-cauldrons-conspiracy-
moonflower-mystery-5-1st-edition-beverly-rearick/
Python for Data Analysis
Download from Wow! eBook <www.wowebook.com>

Wes McKinney

Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo


Python for Data Analysis
by Wes McKinney

Copyright © 2013 Wes McKinney. All rights reserved.


Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions
are also available for most titles (http://my.safaribooksonline.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

Editors: Julie Steele and Meghan Blanchette Indexer: BIM Publishing Services
Production Editor: Melanie Yarbrough Cover Designer: Karen Montgomery
Copyeditor: Teresa Exley Interior Designer: David Futato
Proofreader: BIM Publishing Services Illustrator: Rebecca Demarest

October 2012: First Edition.

Revision History for the First Edition:


2012-10-05 First release
See http://oreilly.com/catalog/errata.csp?isbn=9781449319793 for release details.

Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of
O’Reilly Media, Inc. Python for Data Analysis, the cover image of a golden-tailed tree shrew, and related
trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a
trademark claim, the designations have been printed in caps or initial caps.

While every precaution has been taken in the preparation of this book, the publisher and author assume
no responsibility for errors or omissions, or for damages resulting from the use of the information con-
tained herein.

ISBN: 978-1-449-31979-3

[LSI]

1349356084
Table of Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What Is This Book About? 1
Why Python for Data Analysis? 2
Python as Glue 2
Solving the “Two-Language” Problem 2
Why Not Python? 3
Essential Python Libraries 3
NumPy 4
pandas 4
matplotlib 5
IPython 5
SciPy 6
Installation and Setup 6
Windows 7
Apple OS X 9
GNU/Linux 10
Python 2 and Python 3 11
Integrated Development Environments (IDEs) 11
Community and Conferences 12
Navigating This Book 12
Code Examples 13
Data for Examples 13
Import Conventions 13
Jargon 13
Acknowledgements 14

2. Introductory Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.usa.gov data from bit.ly 17
Counting Time Zones in Pure Python 19

iii
Counting Time Zones with pandas 21
MovieLens 1M Data Set 26
Measuring rating disagreement 30
US Baby Names 1880-2010 32
Analyzing Naming Trends 36
Conclusions and The Path Ahead 43

3. IPython: An Interactive Computing and Development Environment . . . . . . . . . . . . 45


IPython Basics 46
Tab Completion 47
Introspection 48
The %run Command 49
Executing Code from the Clipboard 50
Keyboard Shortcuts 52
Exceptions and Tracebacks 53
Magic Commands 54
Qt-based Rich GUI Console 55
Matplotlib Integration and Pylab Mode 56
Using the Command History 58
Searching and Reusing the Command History 58
Input and Output Variables 58
Logging the Input and Output 59
Interacting with the Operating System 60
Shell Commands and Aliases 60
Directory Bookmark System 62
Software Development Tools 62
Interactive Debugger 62
Timing Code: %time and %timeit 67
Basic Profiling: %prun and %run -p 68
Profiling a Function Line-by-Line 70
IPython HTML Notebook 72
Tips for Productive Code Development Using IPython 72
Reloading Module Dependencies 74
Code Design Tips 74
Advanced IPython Features 76
Making Your Own Classes IPython-friendly 76
Profiles and Configuration 77
Credits 78

4. NumPy Basics: Arrays and Vectorized Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 79


The NumPy ndarray: A Multidimensional Array Object 80
Creating ndarrays 81
Data Types for ndarrays 83

iv | Table of Contents
Operations between Arrays and Scalars 85
Basic Indexing and Slicing 86
Boolean Indexing 89
Fancy Indexing 92
Transposing Arrays and Swapping Axes 93
Universal Functions: Fast Element-wise Array Functions 95
Data Processing Using Arrays 97
Expressing Conditional Logic as Array Operations 98
Mathematical and Statistical Methods 100
Methods for Boolean Arrays 101
Sorting 101
Unique and Other Set Logic 102
File Input and Output with Arrays 103
Storing Arrays on Disk in Binary Format 103
Saving and Loading Text Files 104
Linear Algebra 105
Random Number Generation 106
Example: Random Walks 108
Simulating Many Random Walks at Once 109

5. Getting Started with pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111


Introduction to pandas Data Structures 112
Series 112
DataFrame 115
Index Objects 120
Essential Functionality 122
Reindexing 122
Dropping entries from an axis 125
Indexing, selection, and filtering 125
Arithmetic and data alignment 128
Function application and mapping 132
Sorting and ranking 133
Axis indexes with duplicate values 136
Summarizing and Computing Descriptive Statistics 137
Correlation and Covariance 139
Unique Values, Value Counts, and Membership 141
Handling Missing Data 142
Filtering Out Missing Data 143
Filling in Missing Data 145
Hierarchical Indexing 147
Reordering and Sorting Levels 149
Summary Statistics by Level 150
Using a DataFrame’s Columns 150

Table of Contents | v
Other pandas Topics 151
Integer Indexing 151
Panel Data 152

6. Data Loading, Storage, and File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155


Reading and Writing Data in Text Format 155
Reading Text Files in Pieces 160
Writing Data Out to Text Format 162
Manually Working with Delimited Formats 163
JSON Data 165
XML and HTML: Web Scraping 166
Binary Data Formats 171
Using HDF5 Format 171
Reading Microsoft Excel Files 172
Interacting with HTML and Web APIs 173
Interacting with Databases 174
Storing and Loading Data in MongoDB 176

7. Data Wrangling: Clean, Transform, Merge, Reshape . . . . . . . . . . . . . . . . . . . . . . . . 177


Combining and Merging Data Sets 177
Database-style DataFrame Merges 178
Merging on Index 182
Concatenating Along an Axis 185
Combining Data with Overlap 188
Reshaping and Pivoting 189
Reshaping with Hierarchical Indexing 190
Pivoting “long” to “wide” Format 192
Data Transformation 194
Removing Duplicates 194
Transforming Data Using a Function or Mapping 195
Replacing Values 196
Renaming Axis Indexes 197
Discretization and Binning 199
Detecting and Filtering Outliers 201
Permutation and Random Sampling 202
Computing Indicator/Dummy Variables 203
String Manipulation 205
String Object Methods 206
Regular expressions 207
Vectorized string functions in pandas 210
Example: USDA Food Database 212

vi | Table of Contents
8. Plotting and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
A Brief matplotlib API Primer 219
Figures and Subplots 220
Colors, Markers, and Line Styles 224
Ticks, Labels, and Legends 225
Annotations and Drawing on a Subplot 228
Saving Plots to File 231
matplotlib Configuration 231
Plotting Functions in pandas 232
Line Plots 232
Bar Plots 235
Histograms and Density Plots 238
Scatter Plots 239
Plotting Maps: Visualizing Haiti Earthquake Crisis Data 241
Python Visualization Tool Ecosystem 247
Chaco 248
mayavi 248
Other Packages 248
The Future of Visualization Tools? 249

9. Data Aggregation and Group Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251


GroupBy Mechanics 252
Iterating Over Groups 255
Selecting a Column or Subset of Columns 256
Grouping with Dicts and Series 257
Grouping with Functions 258
Grouping by Index Levels 259
Data Aggregation 259
Column-wise and Multiple Function Application 262
Returning Aggregated Data in “unindexed” Form 264
Group-wise Operations and Transformations 264
Apply: General split-apply-combine 266
Quantile and Bucket Analysis 268
Example: Filling Missing Values with Group-specific Values 270
Example: Random Sampling and Permutation 271
Example: Group Weighted Average and Correlation 273
Example: Group-wise Linear Regression 274
Pivot Tables and Cross-Tabulation 275
Cross-Tabulations: Crosstab 277
Example: 2012 Federal Election Commission Database 278
Donation Statistics by Occupation and Employer 280
Bucketing Donation Amounts 283
Donation Statistics by State 285

Table of Contents | vii


10. Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Date and Time Data Types and Tools 290
Converting between string and datetime 291
Time Series Basics 293
Indexing, Selection, Subsetting 294
Time Series with Duplicate Indices 296
Date Ranges, Frequencies, and Shifting 297
Generating Date Ranges 298
Frequencies and Date Offsets 299
Shifting (Leading and Lagging) Data 301
Time Zone Handling 303
Localization and Conversion 304
Operations with Time Zone−aware Timestamp Objects 305
Operations between Different Time Zones 306
Periods and Period Arithmetic 307
Period Frequency Conversion 308
Quarterly Period Frequencies 309
Converting Timestamps to Periods (and Back) 311
Creating a PeriodIndex from Arrays 312
Resampling and Frequency Conversion 312
Downsampling 314
Upsampling and Interpolation 316
Resampling with Periods 318
Time Series Plotting 319
Moving Window Functions 320
Exponentially-weighted functions 324
Binary Moving Window Functions 324
User-Defined Moving Window Functions 326
Performance and Memory Usage Notes 327

11. Financial and Economic Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329


Data Munging Topics 329
Time Series and Cross-Section Alignment 330
Operations with Time Series of Different Frequencies 332
Time of Day and “as of” Data Selection 334
Splicing Together Data Sources 336
Return Indexes and Cumulative Returns 338
Group Transforms and Analysis 340
Group Factor Exposures 342
Decile and Quartile Analysis 343
More Example Applications 345
Signal Frontier Analysis 345
Future Contract Rolling 347

viii | Table of Contents


Rolling Correlation and Linear Regression 350

12. Advanced NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353


ndarray Object Internals 353
NumPy dtype Hierarchy 354
Advanced Array Manipulation 355
Reshaping Arrays 355
C versus Fortran Order 356
Concatenating and Splitting Arrays 357
Repeating Elements: Tile and Repeat 360
Fancy Indexing Equivalents: Take and Put 361
Broadcasting 362
Broadcasting Over Other Axes 364
Setting Array Values by Broadcasting 367
Advanced ufunc Usage 367
ufunc Instance Methods 368
Custom ufuncs 370
Structured and Record Arrays 370
Nested dtypes and Multidimensional Fields 371
Why Use Structured Arrays? 372
Structured Array Manipulations: numpy.lib.recfunctions 372
More About Sorting 373
Indirect Sorts: argsort and lexsort 374
Alternate Sort Algorithms 375
numpy.searchsorted: Finding elements in a Sorted Array 376
NumPy Matrix Class 377
Advanced Array Input and Output 379
Memory-mapped Files 379
HDF5 and Other Array Storage Options 380
Performance Tips 380
The Importance of Contiguous Memory 381
Other Speed Options: Cython, f2py, C 382

Appendix: Python Language Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433

Table of Contents | ix
Preface

The scientific Python ecosystem of open source libraries has grown substantially over
the last 10 years. By late 2011, I had long felt that the lack of centralized learning
resources for data analysis and statistical applications was a stumbling block for new
Python programmers engaged in such work. Key projects for data analysis (especially
NumPy, IPython, matplotlib, and pandas) had also matured enough that a book written
about them would likely not go out-of-date very quickly. Thus, I mustered the nerve
to embark on this writing project. This is the book that I wish existed when I started
using Python for data analysis in 2007. I hope you find it useful and are able to apply
these tools productively in your work.

Conventions Used in This Book


The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements
such as variable or function names, databases, data types, environment variables,
statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter-
mined by context.

This icon signifies a tip, suggestion, or general note.

xi
This icon indicates a warning or caution.

Using Code Examples


This book is here to help you get your job done. In general, you may use the code in
this book in your programs and documentation. You do not need to contact us for
permission unless you’re reproducing a significant portion of the code. For example,
writing a program that uses several chunks of code from this book does not require
permission. Selling or distributing a CD-ROM of examples from O’Reilly books does
require permission. Answering a question by citing this book and quoting example
code does not require permission. Incorporating a significant amount of example code
from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: “Python for Data Analysis by William Wes-
ley McKinney (O’Reilly). Copyright 2012 William McKinney, 978-1-449-31979-3.”
If you feel your use of code examples falls outside fair use or the permission given above,
feel free to contact us at permissions@oreilly.com.

Safari® Books Online


Safari Books Online (www.safaribooksonline.com) is an on-demand digital
library that delivers expert content in both book and video form from the
world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and cre-
ative professionals use Safari Books Online as their primary resource for research,
problem solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organi-
zations, government agencies, and individuals. Subscribers have access to thousands
of books, training videos, and prepublication manuscripts in one fully searchable da-
tabase from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley
Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John
Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT
Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Tech-
nology, and dozens more. For more information about Safari Books Online, please visit
us online.

xii | Preface
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://oreil.ly/python_for_data_analysis.
To comment or ask technical questions about this book, send email to
bookquestions@oreilly.com.
For more information about our books, courses, conferences, and news, see our website
at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia

Preface | xiii
CHAPTER 1
Preliminaries

What Is This Book About?


This book is concerned with the nuts and bolts of manipulating, processing, cleaning,
and crunching data in Python. It is also a practical, modern introduction to scientific
computing in Python, tailored for data-intensive applications. This is a book about the
parts of the Python language and libraries you’ll need to effectively solve a broad set of
data analysis problems. This book is not an exposition on analytical methods using
Python as the implementation language.
When I say “data”, what am I referring to exactly? The primary focus is on structured
data, a deliberately vague term that encompasses many different common forms of
data, such as
• Multidimensional arrays (matrices)
• Tabular or spreadsheet-like data in which each column may be a different type
(string, numeric, date, or otherwise). This includes most kinds of data commonly
stored in relational databases or tab- or comma-delimited text files
• Multiple tables of data interrelated by key columns (what would be primary or
foreign keys for a SQL user)
• Evenly or unevenly spaced time series
This is by no means a complete list. Even though it may not always be obvious, a large
percentage of data sets can be transformed into a structured form that is more suitable
for analysis and modeling. If not, it may be possible to extract features from a data set
into a structured form. As an example, a collection of news articles could be processed
into a word frequency table which could then be used to perform sentiment analysis.
Most users of spreadsheet programs like Microsoft Excel, perhaps the most widely used
data analysis tool in the world, will not be strangers to these kinds of data.

1
Why Python for Data Analysis?
For many people (myself among them), the Python language is easy to fall in love with.
Since its first appearance in 1991, Python has become one of the most popular dynamic,
programming languages, along with Perl, Ruby, and others. Python and Ruby have
become especially popular in recent years for building websites using their numerous
web frameworks, like Rails (Ruby) and Django (Python). Such languages are often
called scripting languages as they can be used to write quick-and-dirty small programs,
or scripts. I don’t like the term “scripting language” as it carries a connotation that they
cannot be used for building mission-critical software. Among interpreted languages
Python is distinguished by its large and active scientific computing community. Adop-
tion of Python for scientific computing in both industry applications and academic
research has increased significantly since the early 2000s.
For data analysis and interactive, exploratory computing and data visualization, Python
will inevitably draw comparisons with the many other domain-specific open source
and commercial programming languages and tools in wide use, such as R, MATLAB,
SAS, Stata, and others. In recent years, Python’s improved library support (primarily
pandas) has made it a strong alternative for data manipulation tasks. Combined with
Python’s strength in general purpose programming, it is an excellent choice as a single
language for building data-centric applications.

Python as Glue
Part of Python’s success as a scientific computing platform is the ease of integrating C,
C++, and FORTRAN code. Most modern computing environments share a similar set
of legacy FORTRAN and C libraries for doing linear algebra, optimization, integration,
fast fourier transforms, and other such algorithms. The same story has held true for
many companies and national labs that have used Python to glue together 30 years’
worth of legacy software.
Most programs consist of small portions of code where most of the time is spent, with
large amounts of “glue code” that doesn’t run often. In many cases, the execution time
of the glue code is insignificant; effort is most fruitfully invested in optimizing the
computational bottlenecks, sometimes by moving the code to a lower-level language
like C.
In the last few years, the Cython project (http://cython.org) has become one of the
preferred ways of both creating fast compiled extensions for Python and also interfacing
with C and C++ code.

Solving the “Two-Language” Problem


In many organizations, it is common to research, prototype, and test new ideas using
a more domain-specific computing language like MATLAB or R then later port those

2 | Chapter 1: Preliminaries
ideas to be part of a larger production system written in, say, Java, C#, or C++. What
people are increasingly finding is that Python is a suitable language not only for doing
research and prototyping but also building the production systems, too. I believe that
more and more companies will go down this path as there are often significant organ-
izational benefits to having both scientists and technologists using the same set of pro-
grammatic tools.

Why Not Python?


While Python is an excellent environment for building computationally-intensive sci-
entific applications and building most kinds of general purpose systems, there are a
number of uses for which Python may be less suitable.
As Python is an interpreted programming language, in general most Python code will
run substantially slower than code written in a compiled language like Java or C++. As
programmer time is typically more valuable than CPU time, many are happy to make
this tradeoff. However, in an application with very low latency requirements (for ex-
ample, a high frequency trading system), the time spent programming in a lower-level,
lower-productivity language like C++ to achieve the maximum possible performance
might be time well spent.
Python is not an ideal language for highly concurrent, multithreaded applications, par-
ticularly applications with many CPU-bound threads. The reason for this is that it has
what is known as the global interpreter lock (GIL), a mechanism which prevents the
interpreter from executing more than one Python bytecode instruction at a time. The
technical reasons for why the GIL exists are beyond the scope of this book, but as of
this writing it does not seem likely that the GIL will disappear anytime soon. While it
is true that in many big data processing applications, a cluster of computers may be
required to process a data set in a reasonable amount of time, there are still situations
where a single-process, multithreaded system is desirable.
This is not to say that Python cannot execute truly multithreaded, parallel code; that
code just cannot be executed in a single Python process. As an example, the Cython
project features easy integration with OpenMP, a C framework for parallel computing,
in order to to parallelize loops and thus significantly speed up numerical algorithms.

Essential Python Libraries


For those who are less familiar with the scientific Python ecosystem and the libraries
used throughout the book, I present the following overview of each library.

Essential Python Libraries | 3


NumPy
NumPy, short for Numerical Python, is the foundational package for scientific com-
puting in Python. The majority of this book will be based on NumPy and libraries built
on top of NumPy. It provides, among other things:
• A fast and efficient multidimensional array object ndarray
• Functions for performing element-wise computations with arrays or mathematical
operations between arrays
• Tools for reading and writing array-based data sets to disk
• Linear algebra operations, Fourier transform, and random number generation
• Tools for integrating connecting C, C++, and Fortran code to Python
Beyond the fast array-processing capabilities that NumPy adds to Python, one of its
primary purposes with regards to data analysis is as the primary container for data to
be passed between algorithms. For numerical data, NumPy arrays are a much more
efficient way of storing and manipulating data than the other built-in Python data
structures. Also, libraries written in a lower-level language, such as C or Fortran, can
operate on the data stored in a NumPy array without copying any data.

pandas
pandas provides rich data structures and functions designed to make working with
structured data fast, easy, and expressive. It is, as you will see, one of the critical in-
gredients enabling Python to be a powerful and productive data analysis environment.
The primary object in pandas that will be used in this book is the DataFrame, a two-
dimensional tabular, column-oriented data structure with both row and column labels:
>>> frame
total_bill tip sex smoker day time size
1 16.99 1.01 Female No Sun Dinner 2
2 10.34 1.66 Male No Sun Dinner 3
3 21.01 3.5 Male No Sun Dinner 3
4 23.68 3.31 Male No Sun Dinner 2
5 24.59 3.61 Female No Sun Dinner 4
6 25.29 4.71 Male No Sun Dinner 4
7 8.77 2 Male No Sun Dinner 2
8 26.88 3.12 Male No Sun Dinner 4
9 15.04 1.96 Male No Sun Dinner 2
10 14.78 3.23 Male No Sun Dinner 2

pandas combines the high performance array-computing features of NumPy with the
flexible data manipulation capabilities of spreadsheets and relational databases (such
as SQL). It provides sophisticated indexing functionality to make it easy to reshape,
slice and dice, perform aggregations, and select subsets of data. pandas is the primary
tool that we will use in this book.

4 | Chapter 1: Preliminaries
For financial users, pandas features rich, high-performance time series functionality
and tools well-suited for working with financial data. In fact, I initially designed pandas
as an ideal tool for financial data analysis applications.
For users of the R language for statistical computing, the DataFrame name will be
familiar, as the object was named after the similar R data.frame object. They are not
the same, however; the functionality provided by data.frame in R is essentially a strict
subset of that provided by the pandas DataFrame. While this is a book about Python, I
will occasionally draw comparisons with R as it is one of the most widely-used open
source data analysis environments and will be familiar to many readers.
The pandas name itself is derived from panel data, an econometrics term for multidi-
mensional structured data sets, and Python data analysis itself.

matplotlib
matplotlib is the most popular Python library for producing plots and other 2D data
visualizations. It was originally created by John D. Hunter (JDH) and is now maintained
by a large team of developers. It is well-suited for creating plots suitable for publication.
It integrates well with IPython (see below), thus providing a comfortable interactive
environment for plotting and exploring data. The plots are also interactive; you can
zoom in on a section of the plot and pan around the plot using the toolbar in the plot
window.

IPython
IPython is the component in the standard scientific Python toolset that ties everything
together. It provides a robust and productive environment for interactive and explor-
atory computing. It is an enhanced Python shell designed to accelerate the writing,
testing, and debugging of Python code. It is particularly useful for interactively working
with data and visualizing data with matplotlib. IPython is usually involved with the
majority of my Python work, including running, debugging, and testing code.
Aside from the standard terminal-based IPython shell, the project also provides
• A Mathematica-like HTML notebook for connecting to IPython through a web
browser (more on this later).
• A Qt framework-based GUI console with inline plotting, multiline editing, and
syntax highlighting
• An infrastructure for interactive parallel and distributed computing
I will devote a chapter to IPython and how to get the most out of its features. I strongly
recommend using it while working through this book.

Essential Python Libraries | 5


SciPy
SciPy is a collection of packages addressing a number of different standard problem
domains in scientific computing. Here is a sampling of the packages included:
• scipy.integrate: numerical integration routines and differential equation solvers
• scipy.linalg: linear algebra routines and matrix decompositions extending be-
yond those provided in numpy.linalg.
• scipy.optimize: function optimizers (minimizers) and root finding algorithms
• scipy.signal: signal processing tools
• scipy.sparse: sparse matrices and sparse linear system solvers
• scipy.special: wrapper around SPECFUN, a Fortran library implementing many
common mathematical functions, such as the gamma function
• scipy.stats: standard continuous and discrete probability distributions (density
functions, samplers, continuous distribution functions), various statistical tests,
and more descriptive statistics
• scipy.weave: tool for using inline C++ code to accelerate array computations
Together NumPy and SciPy form a reasonably complete computational replacement
for much of MATLAB along with some of its add-on toolboxes.

Installation and Setup


Since everyone uses Python for different applications, there is no single solution for
setting up Python and required add-on packages. Many readers will not have a complete
scientific Python environment suitable for following along with this book, so here I will
give detailed instructions to get set up on each operating system. I recommend using
one of the following base Python distributions:
• Enthought Python Distribution: a scientific-oriented Python distribution from En-
thought (http://www.enthought.com). This includes EPDFree, a free base scientific
distribution (with NumPy, SciPy, matplotlib, Chaco, and IPython) and EPD Full,
a comprehensive suite of more than 100 scientific packages across many domains.
EPD Full is free for academic use but has an annual subscription for non-academic
users.
• Python(x,y) (http://pythonxy.googlecode.com): A free scientific-oriented Python
distribution for Windows.
I will be using EPDFree for the installation guides, though you are welcome to take
another approach depending on your needs. At the time of this writing, EPD includes
Python 2.7, though this might change at some point in the future. After installing, you
will have the following packages installed and importable:

6 | Chapter 1: Preliminaries
• Scientific Python base: NumPy, SciPy, matplotlib, and IPython. These are all in-
cluded in EPDFree.
• IPython Notebook dependencies: tornado and pyzmq. These are included in EPD-
Free.
• pandas (version 0.8.2 or higher).
At some point while reading you may wish to install one or more of the following
packages: statsmodels, PyTables, PyQt (or equivalently, PySide), xlrd, lxml, basemap,
pymongo, and requests. These are used in various examples. Installing these optional
libraries is not necessary, and I would would suggest waiting until you need them. For
example, installing PyQt or PyTables from source on OS X or Linux can be rather
arduous. For now, it’s most important to get up and running with the bare minimum:
EPDFree and pandas.
For information on each Python package and links to binary installers or other help,
see the Python Package Index (PyPI, http://pypi.python.org). This is also an excellent
resource for finding new Python packages.

To avoid confusion and to keep things simple, I am avoiding discussion


of more complex environment management tools like pip and virtua-
lenv. There are many excellent guides available for these tools on the
Internet.

Some users may be interested in alternate Python implementations, such


as IronPython, Jython, or PyPy. To make use of the tools presented in
this book, it is (currently) necessary to use the standard C-based Python
interpreter, known as CPython.

Windows
To get started on Windows, download the EPDFree installer from http://www.en
thought.com, which should be an MSI installer named like epd_free-7.3-1-win-
x86.msi. Run the installer and accept the default installation location C:\Python27. If
you had previously installed Python in this location, you may want to delete it manually
first (or using Add/Remove Programs).
Next, you need to verify that Python has been successfully added to the system path
and that there are no conflicts with any prior-installed Python versions. First, open a
command prompt by going to the Start Menu and starting the Command Prompt ap-
plication, also known as cmd.exe. Try starting the Python interpreter by typing
python. You should see a message that matches the version of EPDFree you installed:
C:\Users\Wes>python
Python 2.7.3 |EPD_free 7.3-1 (32-bit)| (default, Apr 12 2012, 14:30:37) on win32
Type "credits", "demo" or "enthought" for more information.
>>>

Installation and Setup | 7


If you see a message for a different version of EPD or it doesn’t work at all, you will
need to clean up your Windows environment variables. On Windows 7 you can start
typing “environment variables” in the programs search field and select Edit environ
ment variables for your account. On Windows XP, you will have to go to Control
Panel > System > Advanced > Environment Variables. On the window that pops up,
you are looking for the Path variable. It needs to contain the following two directory
paths, separated by semicolons:
C:\Python27;C:\Python27\Scripts

If you installed other versions of Python, be sure to delete any other Python-related
directories from both the system and user Path variables. After making a path alterna-
tion, you have to restart the command prompt for the changes to take effect.
Once you can launch Python successfully from the command prompt, you need to
install pandas. The easiest way is to download the appropriate binary installer from
http://pypi.python.org/pypi/pandas. For EPDFree, this should be pandas-0.9.0.win32-
py2.7.exe. After you run this, let’s launch IPython and check that things are installed
correctly by importing pandas and making a simple matplotlib plot:
C:\Users\Wes>ipython --pylab
Python 2.7.3 |EPD_free 7.3-1 (32-bit)|
Type "copyright", "credits" or "license" for more information.

IPython 0.12.1 -- An enhanced Interactive Python.


? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

Welcome to pylab, a matplotlib-based Python environment [backend: WXAgg].


For more information, type 'help(pylab)'.

In [1]: import pandas

In [2]: plot(arange(10))

If successful, there should be no error messages and a plot window will appear. You
can also check that the IPython HTML notebook can be successfully run by typing:
$ ipython notebook --pylab=inline

If you use the IPython notebook application on Windows and normally


use Internet Explorer, you will likely need to install and run Mozilla
Firefox or Google Chrome instead.

EPDFree on Windows contains only 32-bit executables. If you want or need a 64-bit
setup on Windows, using EPD Full is the most painless way to accomplish that. If you
would rather install from scratch and not pay for an EPD subscription, Christoph
Gohlke at the University of California, Irvine, publishes unofficial binary installers for

8 | Chapter 1: Preliminaries
all of the book’s necessary packages (http://www.lfd.uci.edu/~gohlke/pythonlibs/) for 32-
and 64-bit Windows.

Apple OS X
To get started on OS X, you must first install Xcode, which includes Apple’s suite of
software development tools. The necessary component for our purposes is the gcc C
and C++ compiler suite. The Xcode installer can be found on the OS X install DVD
that came with your computer or downloaded from Apple directly.
Once you’ve installed Xcode, launch the terminal (Terminal.app) by navigating to
Applications > Utilities. Type gcc and press enter. You should hopefully see some-
thing like:
$ gcc
i686-apple-darwin10-gcc-4.2.1: no input files
Download from Wow! eBook <www.wowebook.com>

Now you need to install EPDFree. Download the installer which should be a disk image
named something like epd_free-7.3-1-macosx-i386.dmg. Double-click the .dmg file to
mount it, then double-click the .mpkg file inside to run the installer.
When the installer runs, it automatically appends the EPDFree executable path to
your .bash_profile file. This is located at /Users/your_uname/.bash_profile:
# Setting PATH for EPD_free-7.3-1
PATH="/Library/Frameworks/Python.framework/Versions/Current/bin:${PATH}"
export PATH

Should you encounter any problems in the following steps, you’ll want to inspect
your .bash_profile and potentially add the above directory to your path.
Now, it’s time to install pandas. Execute this command in the terminal:
$ sudo easy_install pandas
Searching for pandas
Reading http://pypi.python.org/simple/pandas/
Reading http://pandas.pydata.org
Reading http://pandas.sourceforge.net
Best match: pandas 0.9.0
Downloading http://pypi.python.org/packages/source/p/pandas/pandas-0.9.0.zip
Processing pandas-0.9.0.zip
Writing /tmp/easy_install-H5mIX6/pandas-0.9.0/setup.cfg
Running pandas-0.9.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-H5mIX6/
pandas-0.9.0/egg-dist-tmp-RhLG0z
Adding pandas 0.9.0 to easy-install.pth file

Installed /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/
site-packages/pandas-0.9.0-py2.7-macosx-10.5-i386.egg
Processing dependencies for pandas
Finished processing dependencies for pandas

To verify everything is working, launch IPython in Pylab mode and test importing pan-
das then making a plot interactively:

Installation and Setup | 9


$ ipython --pylab
22:29 ~/VirtualBox VMs/WindowsXP $ ipython
Python 2.7.3 |EPD_free 7.3-1 (32-bit)| (default, Apr 12 2012, 11:28:34)
Type "copyright", "credits" or "license" for more information.

IPython 0.12.1 -- An enhanced Interactive Python.


? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.

Welcome to pylab, a matplotlib-based Python environment [backend: WXAgg].


For more information, type 'help(pylab)'.

In [1]: import pandas

In [2]: plot(arange(10))

If this succeeds, a plot window with a straight line should pop up.

GNU/Linux

Some, but not all, Linux distributions include sufficiently up-to-date


versions of all the required Python packages and can be installed using
the built-in package management tool like apt. I detail setup using EPD-
Free as it's easily reproducible across distributions.

Linux details will vary a bit depending on your Linux flavor, but here I give details for
Debian-based GNU/Linux systems like Ubuntu and Mint. Setup is similar to OS X with
the exception of how EPDFree is installed. The installer is a shell script that must be
executed in the terminal. Depending on whether you have a 32-bit or 64-bit system,
you will either need to install the x86 (32-bit) or x86_64 (64-bit) installer. You will then
have a file named something similar to epd_free-7.3-1-rh5-x86_64.sh. To install it,
execute this script with bash:
$ bash epd_free-7.3-1-rh5-x86_64.sh

After accepting the license, you will be presented with a choice of where to put the
EPDFree files. I recommend installing the files in your home directory, say /home/wesm/
epd (substituting your own username for wesm).
Once the installer has finished, you need to add EPDFree’s bin directory to your
$PATH variable. If you are using the bash shell (the default in Ubuntu, for example), this
means adding the following path addition in your .bashrc:
export PATH=/home/wesm/epd/bin:$PATH

Obviously, substitute the installation directory you used for /home/wesm/epd/. After
doing this you can either start a new terminal process or execute your .bashrc again
with source ~/.bashrc.

10 | Chapter 1: Preliminaries
You need a C compiler such as gcc to move forward; many Linux distributions include
gcc, but others may not. On Debian systems, you can install gcc by executing:
sudo apt-get install gcc

If you type gcc on the command line it should say something like:
$ gcc
gcc: no input files

Now, time to install pandas:


$ easy_install pandas

If you installed EPDFree as root, you may need to add sudo to the command and enter
the sudo or root password. To verify things are working, perform the same checks as
in the OS X section.

Python 2 and Python 3


The Python community is currently undergoing a drawn-out transition from the Python
2 series of interpreters to the Python 3 series. Until the appearance of Python 3.0, all
Python code was backwards compatible. The community decided that in order to move
the language forward, certain backwards incompatible changes were necessary.
I am writing this book with Python 2.7 as its basis, as the majority of the scientific
Python community has not yet transitioned to Python 3. The good news is that, with
a few exceptions, you should have no trouble following along with the book if you
happen to be using Python 3.2.

Integrated Development Environments (IDEs)


When asked about my standard development environment, I almost always say “IPy-
thon plus a text editor”. I typically write a program and iteratively test and debug each
piece of it in IPython. It is also useful to be able to play around with data interactively
and visually verify that a particular set of data manipulations are doing the right thing.
Libraries like pandas and NumPy are designed to be easy-to-use in the shell.
However, some will still prefer to work in an IDE instead of a text editor. They do
provide many nice “code intelligence” features like completion or quickly pulling up
the documentation associated with functions and classes. Here are some that you can
explore:
• Eclipse with PyDev Plugin
• Python Tools for Visual Studio (for Windows users)
• PyCharm
• Spyder
• Komodo IDE

Installation and Setup | 11


Community and Conferences
Outside of an Internet search, the scientific Python mailing lists are generally helpful
and responsive to questions. Some ones to take a look at are:
• pydata: a Google Group list for questions related to Python for data analysis and
pandas
• pystatsmodels: for statsmodels or pandas-related questions
• numpy-discussion: for NumPy-related questions
• scipy-user: for general SciPy or scientific Python questions
I deliberately did not post URLs for these in case they change. They can be easily located
via Internet search.
Each year many conferences are held all over the world for Python programmers. PyCon
and EuroPython are the two main general Python conferences in the United States and
Europe, respectively. SciPy and EuroSciPy are scientific-oriented Python conferences
where you will likely find many “birds of a feather” if you become more involved with
using Python for data analysis after reading this book.

Navigating This Book


If you have never programmed in Python before, you may actually want to start at the
end of the book, where I have placed a condensed tutorial on Python syntax, language
features, and built-in data structures like tuples, lists, and dicts. These things are con-
sidered prerequisite knowledge for the remainder of the book.
The book starts by introducing you to the IPython environment. Next, I give a short
introduction to the key features of NumPy, leaving more advanced NumPy use for
another chapter at the end of the book. Then, I introduce pandas and devote the rest
of the book to data analysis topics applying pandas, NumPy, and matplotlib (for vis-
ualization). I have structured the material in the most incremental way possible, though
there is occasionally some minor cross-over between chapters.
Data files and related material for each chapter are hosted as a git repository on GitHub:
http://github.com/pydata/pydata-book

I encourage you to download the data and use it to replicate the book’s code examples
and experiment with the tools presented in each chapter. I will happily accept contri-
butions, scripts, IPython notebooks, or any other materials you wish to contribute to
the book's repository for all to enjoy.

12 | Chapter 1: Preliminaries
Code Examples
Most of the code examples in the book are shown with input and output as it would
appear executed in the IPython shell.
In [5]: code
Out[5]: output

At times, for clarity, multiple code examples will be shown side by side. These should
be read left to right and executed separately.
In [5]: code In [6]: code2
Out[5]: output Out[6]: output2

Data for Examples


Data sets for the examples in each chapter are hosted in a repository on GitHub: http:
//github.com/pydata/pydata-book. You can download this data either by using the git
revision control command-line program or by downloading a zip file of the repository
from the website.
I have made every effort to ensure that it contains everything necessary to reproduce
the examples, but I may have made some mistakes or omissions. If so, please send me
an e-mail: wesmckinn@gmail.com.

Import Conventions
The Python community has adopted a number of naming conventions for commonly-
used modules:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

This means that when you see np.arange, this is a reference to the arange function in
NumPy. This is done as it’s considered bad practice in Python software development
to import everything (from numpy import *) from a large package like NumPy.

Jargon
I’ll use some terms common both to programming and data science that you may not
be familiar with. Thus, here are some brief definitions:
Munge/Munging/Wrangling
Describes the overall process of manipulating unstructured and/or messy data into
a structured or clean form. The word has snuck its way into the jargon of many
modern day data hackers. Munge rhymes with “lunge”.

Navigating This Book | 13


Pseudocode
A description of an algorithm or process that takes a code-like form while likely
not being actual valid source code.
Syntactic sugar
Programming syntax which does not add new features, but makes something more
convenient or easier to type.

Acknowledgements
It would have been difficult for me to write this book without the support of a large
number of people.
On the O’Reilly staff, I’m very grateful for my editors Meghan Blanchette and Julie
Steele who guided me through the process. Mike Loukides also worked with me in the
proposal stages and helped make the book a reality.
I received a wealth of technical review from a large cast of characters. In particular,
Martin Blais and Hugh White were incredibly helpful in improving the book’s exam-
ples, clarity, and organization from cover to cover. James Long, Drew Conway, Fer-
nando Pérez, Brian Granger, Thomas Kluyver, Adam Klein, Josh Klein, Chang She, and
Stéfan van der Walt each reviewed one or more chapters, providing pointed feedback
from many different perspectives.
I got many great ideas for examples and data sets from friends and colleagues in the
data community, among them: Mike Dewar, Jeff Hammerbacher, James Johndrow,
Kristian Lum, Adam Klein, Hilary Mason, Chang She, and Ashley Williams.
I am of course indebted to the many leaders in the open source scientific Python com-
munity who’ve built the foundation for my development work and gave encouragement
while I was writing this book: the IPython core team (Fernando Pérez, Brian Granger,
Min Ragan-Kelly, Thomas Kluyver, and others), John Hunter, Skipper Seabold, Travis
Oliphant, Peter Wang, Eric Jones, Robert Kern, Josef Perktold, Francesc Alted, Chris
Fonnesbeck, and too many others to mention. Several other people provided a great
deal of support, ideas, and encouragement along the way: Drew Conway, Sean Taylor,
Giuseppe Paleologo, Jared Lander, David Epstein, John Krowas, Joshua Bloom, Den
Pilsworth, John Myles-White, and many others I’ve forgotten.
I’d also like to thank a number of people from my formative years. First, my former
AQR colleagues who’ve cheered me on in my pandas work over the years: Alex Reyf-
man, Michael Wong, Tim Sargen, Oktay Kurbanov, Matthew Tschantz, Roni Israelov,
Michael Katz, Chris Uga, Prasad Ramanan, Ted Square, and Hoon Kim. Lastly, my
academic advisors Haynes Miller (MIT) and Mike West (Duke).
On the personal side, Casey Dinkin provided invaluable day-to-day support during the
writing process, tolerating my highs and lows as I hacked together the final draft on

14 | Chapter 1: Preliminaries
top of an already overcommitted schedule. Lastly, my parents, Bill and Kim, taught me
to always follow my dreams and to never settle for less.

Acknowledgements | 15
CHAPTER 2
Introductory Examples

This book teaches you the Python tools to work productively with data. While readers
may have many different end goals for their work, the tasks required generally fall into
a number of different broad groups:
Interacting with the outside world
Reading and writing with a variety of file formats and databases.
Preparation
Cleaning, munging, combining, normalizing, reshaping, slicing and dicing, and
transforming data for analysis.
Transformation
Applying mathematical and statistical operations to groups of data sets to derive
new data sets. For example, aggregating a large table by group variables.
Modeling and computation
Connecting your data to statistical models, machine learning algorithms, or other
computational tools
Presentation
Creating interactive or static graphical visualizations or textual summaries
In this chapter I will show you a few data sets and some things we can do with them.
These examples are just intended to pique your interest and thus will only be explained
at a high level. Don’t worry if you have no experience with any of these tools; they will
be discussed in great detail throughout the rest of the book. In the code examples you’ll
see input and output prompts like In [15]:; these are from the IPython shell.

1.usa.gov data from bit.ly


In 2011, URL shortening service bit.ly partnered with the United States government
website usa.gov to provide a feed of anonymous data gathered from users who shorten
links ending with .gov or .mil. As of this writing, in addition to providing a live feed,
hourly snapshots are available as downloadable text files.1

17
In the case of the hourly snapshots, each line in each file contains a common form of
web data known as JSON, which stands for JavaScript Object Notation. For example,
if we read just the first line of a file you may see something like
In [15]: path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'

In [16]: open(path).readline()
Out[16]: '{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11
(KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1,
"tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l":
"orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r":
"http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u":
"http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc":
1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'

Python has numerous built-in and 3rd party modules for converting a JSON string into
a Python dictionary object. Here I’ll use the json module and its loads function invoked
on each line in the sample file I downloaded:
import json
path = 'ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]

If you’ve never programmed in Python before, the last expression here is called a list
comprehension, which is a concise way of applying an operation (like json.loads) to a
collection of strings or other objects. Conveniently, iterating over an open file handle
gives you a sequence of its lines. The resulting object records is now a list of Python
dicts:
In [18]: records[0]
Out[18]:
{u'a': u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like
Gecko) Chrome/17.0.963.78 Safari/535.11',
u'al': u'en-US,en;q=0.8',
u'c': u'US',
u'cy': u'Danvers',
u'g': u'A6qOVH',
u'gr': u'MA',
u'h': u'wfLQtf',
u'hc': 1331822918,
u'hh': u'1.usa.gov',
u'l': u'orofrog',
u'll': [42.576698, -70.954903],
u'nk': 1,
u'r': u'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
u't': 1331923247,
u'tz': u'America/New_York',
u'u': u'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}

1. http://www.usa.gov/About/developer-resources/1usagov.shtml

18 | Chapter 2: Introductory Examples


Note that Python indices start at 0 and not 1 like some other languages (like R). It’s
now easy to access individual values within records by passing a string for the key you
wish to access:
In [19]: records[0]['tz']
Out[19]: u'America/New_York'

The u here in front of the quotation stands for unicode, a standard form of string en-
coding. Note that IPython shows the time zone string object representation here rather
than its print equivalent:
In [20]: print records[0]['tz']
America/New_York

Counting Time Zones in Pure Python


Suppose we were interested in the most often-occurring time zones in the data set (the
tz field). There are many ways we could do this. First, let’s extract a list of time zones
again using a list comprehension:
In [25]: time_zones = [rec['tz'] for rec in records]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/home/wesm/book_scripts/whetting/<ipython> in <module>()
----> 1 time_zones = [rec['tz'] for rec in records]

KeyError: 'tz'

Oops! Turns out that not all of the records have a time zone field. This is easy to handle
as we can add the check if 'tz' in rec at the end of the list comprehension:
In [26]: time_zones = [rec['tz'] for rec in records if 'tz' in rec]

In [27]: time_zones[:10]
Out[27]:
[u'America/New_York',
u'America/Denver',
u'America/New_York',
u'America/Sao_Paulo',
u'America/New_York',
u'America/New_York',
u'Europe/Warsaw',
u'',
u'',
u'']

Just looking at the first 10 time zones we see that some of them are unknown (empty).
You can filter these out also but I’ll leave them in for now. Now, to produce counts by
time zone I’ll show two approaches: the harder way (using just the Python standard
library) and the easier way (using pandas). One way to do the counting is to use a dict
to store counts while we iterate through the time zones:
def get_counts(sequence):
counts = {}

1.usa.gov data from bit.ly | 19


for x in sequence:
if x in counts:
counts[x] += 1
else:
counts[x] = 1
return counts

If you know a bit more about the Python standard library, you might prefer to write
the same thing more briefly:
from collections import defaultdict

def get_counts2(sequence):
counts = defaultdict(int) # values will initialize to 0
for x in sequence:
counts[x] += 1
return counts

I put this logic in a function just to make it more reusable. To use it on the time zones,
Download from Wow! eBook <www.wowebook.com>

just pass the time_zones list:


In [31]: counts = get_counts(time_zones)

In [32]: counts['America/New_York']
Out[32]: 1251

In [33]: len(time_zones)
Out[33]: 3440

If we wanted the top 10 time zones and their counts, we have to do a little bit of dic-
tionary acrobatics:
def top_counts(count_dict, n=10):
value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]

We have then:
In [35]: top_counts(counts)
Out[35]:
[(33, u'America/Sao_Paulo'),
(35, u'Europe/Madrid'),
(36, u'Pacific/Honolulu'),
(37, u'Asia/Tokyo'),
(74, u'Europe/London'),
(191, u'America/Denver'),
(382, u'America/Los_Angeles'),
(400, u'America/Chicago'),
(521, u''),
(1251, u'America/New_York')]

20 | Chapter 2: Introductory Examples


If you search the Python standard library, you may find the collections.Counter class,
which makes this task a lot easier:
In [49]: from collections import Counter

In [50]: counts = Counter(time_zones)

In [51]: counts.most_common(10)
Out[51]:
[(u'America/New_York', 1251),
(u'', 521),
(u'America/Chicago', 400),
(u'America/Los_Angeles', 382),
(u'America/Denver', 191),
(u'Europe/London', 74),
(u'Asia/Tokyo', 37),
(u'Pacific/Honolulu', 36),
(u'Europe/Madrid', 35),
(u'America/Sao_Paulo', 33)]

Counting Time Zones with pandas


The main pandas data structure is the DataFrame, which you can think of as repre-
senting a table or spreadsheet of data. Creating a DataFrame from the original set of
records is simple:
In [289]: from pandas import DataFrame, Series

In [290]: import pandas as pd

In [291]: frame = DataFrame(records)

In [292]: frame
Out[292]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3560 entries, 0 to 3559
Data columns:
_heartbeat_ 120 non-null values
a 3440 non-null values
al 3094 non-null values
c 2919 non-null values
cy 2919 non-null values
g 3440 non-null values
gr 2919 non-null values
h 3440 non-null values
hc 3440 non-null values
hh 3440 non-null values
kw 93 non-null values
l 3440 non-null values
ll 2919 non-null values
nk 3440 non-null values
r 3440 non-null values
t 3440 non-null values
tz 3440 non-null values

1.usa.gov data from bit.ly | 21


u 3440 non-null values
dtypes: float64(4), object(14)

In [293]: frame['tz'][:10]
Out[293]:
0 America/New_York
1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
5 America/New_York
6 Europe/Warsaw
7
8
9
Name: tz

The output shown for the frame is the summary view, shown for large DataFrame ob-
jects. The Series object returned by frame['tz'] has a method value_counts that gives
us what we’re looking for:
In [294]: tz_counts = frame['tz'].value_counts()

In [295]: tz_counts[:10]
Out[295]:
America/New_York 1251
521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
America/Sao_Paulo 33

Then, we might want to make a plot of this data using plotting library, matplotlib. You
can do a bit of munging to fill in a substitute value for unknown and missing time zone
data in the records. The fillna function can replace missing (NA) values and unknown
(empty strings) values can be replaced by boolean array indexing:
In [296]: clean_tz = frame['tz'].fillna('Missing')

In [297]: clean_tz[clean_tz == ''] = 'Unknown'

In [298]: tz_counts = clean_tz.value_counts()

In [299]: tz_counts[:10]
Out[299]:
America/New_York 1251
Unknown 521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Missing 120

22 | Chapter 2: Introductory Examples


Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35

Making a horizontal bar plot can be accomplished using the plot method on the
counts objects:
In [301]: tz_counts[:10].plot(kind='barh', rot=0)

See Figure 2-1 for the resulting figure. We’ll explore more tools for working with this
kind of data. For example, the a field contains information about the browser, device,
or application used to perform the URL shortening:
In [302]: frame['a'][1]
Out[302]: u'GoogleMaps/RochesterNY'

In [303]: frame['a'][50]
Out[303]: u'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'

In [304]: frame['a'][51]
Out[304]: u'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P925/V10e Build/FRG83G) AppleWebKit/533.1 (K

Figure 2-1. Top time zones in the 1.usa.gov sample data

Parsing all of the interesting information in these “agent” strings may seem like a
daunting task. Luckily, once you have mastered Python’s built-in string functions and
regular expression capabilities, it is really not so bad. For example, we could split off
the first token in the string (corresponding roughly to the browser capability) and make
another summary of the user behavior:
In [305]: results = Series([x.split()[0] for x in frame.a.dropna()])

In [306]: results[:5]
Out[306]:
0 Mozilla/5.0
1 GoogleMaps/RochesterNY
2 Mozilla/4.0
3 Mozilla/5.0
4 Mozilla/5.0

1.usa.gov data from bit.ly | 23


In [307]: results.value_counts()[:8]
Out[307]:
Mozilla/5.0 2594
Mozilla/4.0 601
GoogleMaps/RochesterNY 121
Opera/9.80 34
TEST_INTERNET_AGENT 24
GoogleProducer 21
Mozilla/6.0 5
BlackBerry8520/5.0.0.681 4

Now, suppose you wanted to decompose the top time zones into Windows and non-
Windows users. As a simplification, let’s say that a user is on Windows if the string
'Windows' is in the agent string. Since some of the agents are missing, I’ll exclude these
from the data:
In [308]: cframe = frame[frame.a.notnull()]

We want to then compute a value whether each row is Windows or not:


In [309]: operating_system = np.where(cframe['a'].str.contains('Windows'),
.....: 'Windows', 'Not Windows')

In [310]: operating_system[:5]
Out[310]:
0 Windows
1 Not Windows
2 Windows
3 Not Windows
4 Windows
Name: a

Then, you can group the data by its time zone column and this new list of operating
systems:
In [311]: by_tz_os = cframe.groupby(['tz', operating_system])

The group counts, analogous to the value_counts function above, can be computed
using size. This result is then reshaped into a table with unstack:
In [312]: agg_counts = by_tz_os.size().unstack().fillna(0)

In [313]: agg_counts[:10]
Out[313]:
a Not Windows Windows
tz
245 276
Africa/Cairo 0 3
Africa/Casablanca 0 1
Africa/Ceuta 0 2
Africa/Johannesburg 0 1
Africa/Lusaka 0 1
America/Anchorage 4 1
America/Argentina/Buenos_Aires 1 0

24 | Chapter 2: Introductory Examples


America/Argentina/Cordoba 0 1
America/Argentina/Mendoza 0 1

Finally, let’s select the top overall time zones. To do so, I construct an indirect index
array from the row counts in agg_counts:
# Use to sort in ascending order
In [314]: indexer = agg_counts.sum(1).argsort()

In [315]: indexer[:10]
Out[315]:
tz
24
Africa/Cairo 20
Africa/Casablanca 21
Africa/Ceuta 92
Africa/Johannesburg 87
Africa/Lusaka 53
America/Anchorage 54
America/Argentina/Buenos_Aires 57
America/Argentina/Cordoba 26
America/Argentina/Mendoza 55

I then use take to select the rows in that order, then slice off the last 10 rows:
In [316]: count_subset = agg_counts.take(indexer)[-10:]

In [317]: count_subset
Out[317]:
a Not Windows Windows
tz
America/Sao_Paulo 13 20
Europe/Madrid 16 19
Pacific/Honolulu 0 36
Asia/Tokyo 2 35
Europe/London 43 31
America/Denver 132 59
America/Los_Angeles 130 252
America/Chicago 115 285
245 276
America/New_York 339 912

Then, as shown in the preceding code block, this can be plotted in a bar plot; I’ll make
it a stacked bar plot by passing stacked=True (see Figure 2-2) :
In [319]: count_subset.plot(kind='barh', stacked=True)

The plot doesn’t make it easy to see the relative percentage of Windows users in the
smaller groups, but the rows can easily be normalized to sum to 1 then plotted again
(see Figure 2-3):
In [321]: normed_subset = count_subset.div(count_subset.sum(1), axis=0)

In [322]: normed_subset.plot(kind='barh', stacked=True)

1.usa.gov data from bit.ly | 25


Figure 2-2. Top time zones by Windows and non-Windows users

Figure 2-3. Percentage Windows and non-Windows users in top-occurring time zones
All of the methods employed here will be examined in great detail throughout the rest
of the book.

MovieLens 1M Data Set


GroupLens Research (http://www.grouplens.org/node/73) provides a number of collec-
tions of movie ratings data collected from users of MovieLens in the late 1990s and

26 | Chapter 2: Introductory Examples


early 2000s. The data provide movie ratings, movie metadata (genres and year), and
demographic data about the users (age, zip code, gender, and occupation). Such data
is often of interest in the development of recommendation systems based on machine
learning algorithms. While I will not be exploring machine learning techniques in great
detail in this book, I will show you how to slice and dice data sets like these into the
exact form you need.
The MovieLens 1M data set contains 1 million ratings collected from 6000 users on
4000 movies. It’s spread across 3 tables: ratings, user information, and movie infor-
mation. After extracting the data from the zip file, each table can be loaded into a pandas
DataFrame object using pandas.read_table:
import pandas as pd

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']


users = pd.read_table('ml-1m/users.dat', sep='::', header=None,
names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']


ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None,
names=rnames)

mnames = ['movie_id', 'title', 'genres']


movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None,
names=mnames)

You can verify that everything succeeded by looking at the first few rows of each Da-
taFrame with Python's slice syntax:
In [334]: users[:5]
Out[334]:
user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455

In [335]: ratings[:5]
Out[335]:
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291

In [336]: movies[:5]
Out[336]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama

MovieLens 1M Data Set | 27


4 5 Father of the Bride Part II (1995) Comedy

In [337]: ratings
Out[337]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns:
user_id 1000209 non-null values
movie_id 1000209 non-null values
rating 1000209 non-null values
timestamp 1000209 non-null values
dtypes: int64(4)

Note that ages and occupations are coded as integers indicating groups described in
the data set’s README file. Analyzing the data spread across three tables is not a simple
task; for example, suppose you wanted to compute mean ratings for a particular movie
by sex and age. As you will see, this is much easier to do with all of the data merged
together into a single table. Using pandas’s merge function, we first merge ratings with
users then merging that result with the movies data. pandas infers which columns to
use as the merge (or join) keys based on overlapping names:
In [338]: data = pd.merge(pd.merge(ratings, users), movies)

In [339]: data
Out[339]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns:
user_id 1000209 non-null values
movie_id 1000209 non-null values
rating 1000209 non-null values
timestamp 1000209 non-null values
gender 1000209 non-null values
age 1000209 non-null values
occupation 1000209 non-null values
zip 1000209 non-null values
title 1000209 non-null values
genres 1000209 non-null values
dtypes: int64(6), object(4)

In [340]: data.ix[0]
Out[340]:
user_id 1
movie_id 1
rating 5
timestamp 978824268
gender F
age 1
occupation 10
zip 48067
title Toy Story (1995)
genres Animation|Children's|Comedy
Name: 0

28 | Chapter 2: Introductory Examples


In this form, aggregating the ratings grouped by one or more user or movie attributes
is straightforward once you build some familiarity with pandas. To get mean movie
ratings for each film grouped by gender, we can use the pivot_table method:
In [341]: mean_ratings = data.pivot_table('rating', rows='title',
.....: cols='gender', aggfunc='mean')

In [342]: mean_ratings[:5]
Out[342]:
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024

This produced another DataFrame containing mean ratings with movie totals as row
labels and gender as column labels. First, I’m going to filter down to movies that re-
ceived at least 250 ratings (a completely arbitrary number); to do this, I group the data
by title and use size() to get a Series of group sizes for each title:
In [343]: ratings_by_title = data.groupby('title').size()

In [344]: ratings_by_title[:10]
Out[344]:
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
1-900 (1994) 2
10 Things I Hate About You (1999) 700
101 Dalmatians (1961) 565
101 Dalmatians (1996) 364
12 Angry Men (1957) 616

In [345]: active_titles = ratings_by_title.index[ratings_by_title >= 250]

In [346]: active_titles
Out[346]:
Index(['burbs, The (1989), 10 Things I Hate About You (1999),
101 Dalmatians (1961), ..., Young Sherlock Holmes (1985),
Zero Effect (1998), eXistenZ (1999)], dtype=object)

The index of titles receiving at least 250 ratings can then be used to select rows from
mean_ratings above:
In [347]: mean_ratings = mean_ratings.ix[active_titles]

In [348]: mean_ratings
Out[348]:
<class 'pandas.core.frame.DataFrame'>
Index: 1216 entries, 'burbs, The (1989) to eXistenZ (1999)

MovieLens 1M Data Set | 29


Exploring the Variety of Random
Documents with Different Content
FOSSIL ELEPHANTS AND THEIR ALLIES.
The Proboscidea, represented, as we have already seen, by two
species only among living animals, both of which are met with in and
near the tropical regions of the Old World, in the fossil state are met
with over nearly the whole of the Old World, and of the New; and
are divided into three genera—Elephas, Mastodon, and Dinotherium.
The teeth and bones of these creatures found in Europe were
assigned in the sixteenth, seventeenth, and eighteenth centuries to
giants, and many are the stories which were commonly reported
about them—as, for example, that of the giant of Dauphiné, in the
reign of Louis XIV. His remains were discovered by a surgeon, who
stated that they were enclosed in an enormous sepulchre covered
with a stone slab, bearing the inscription Teutobochus rex; and that
in the vicinity there were also found coins or medals, all of which
showed the remains to be those of a giant king of the Cimbri, who
fought against Marius. However, the original owner of these bones,
though not of the coins, was proved to have been an Elephant.
The story of Teutobochus is even excelled by that of another
giant, called the giant of Lucerne, whose remains when dug up were
examined by a celebrated Professor of Basle, who described them as
of human origin, and was skilful enough to put them together so as
to resemble a giant no less than twenty-six feet high. For some time
the deluded people of Lucerne paid homage to this Elephantine
prodigy, until the scales were removed from their eyes by
Blumenbach, who pronounced to their astonished senses that the
giant, as it lay in state at the Jesuits’ College, was but the skeleton
of an Elephant.
The Tertiary or third great period into which the geologists divide
the life history of the earth consists of the following divisions:—
Eocene, Miocene, Pliocene, Pleistocene, Prehistoric, and Historic, and
it is in the Pliocene stage that the Elephant first appears in Europe
and America.
The large, straight-tusked Elephant (E. meridionalis), with large
grinders composed of thick and coarse plates, is found ranged over
the whole of France, Italy, Britain, and Germany in those times, in
company with another narrow-toothed species, also with straight
tusks, described by Dr. Falconer under the name of Elephas
antiquus.
By far the best known and most important of these huge
creatures is the far-famed MAMMOTH (Elephas primigenius). This
Elephant has been found frozen in Siberian soil beautifully
preserved, with the hair and tissues in so good a condition that
microscopical sections have been made of them.
The story of finding the first Mammoth imbedded in ice has been
often told, but is still of sufficient interest to be related again. A
Tungoosian fisherman, named Schumachoff, about the year 1799,
was proceeding, as is the custom of fishermen in those parts when
fishing proves a failure, along the shores of the Lena in quest of
Mammoth tusks, which have been there found in considerable
abundance. During his rambles, having gone farther than he had
done before, he suddenly came face to face with a huge Mammoth
imbedded in clear ice. This extraordinary sight seems to have filled
him with astonishment and awe; for instead of at once profiting by
the fortunate discovery, he allowed several years to roll on before he
summoned courage to approach it closely, although it was his habit
to make stealthy journeys occasionally to the object of his wonder.
At length, seeing, it is presumed, the terrific monster made no signs
of eating him up, and that its tusks would bring him a considerable
sum of money, he allowed the hope of gain to overcome his
superstitious scruples. He boldly broke the barrier of ice, chopped off
the tusks, and left the carcass to the mercy of the Wolves and Bears,
who, finding it palatable, soon reduced the huge creature to a
skeleton. Some two years afterwards a man of science was on the
scent, and although so late in at the death, found a huge skeleton
with three legs, the eyes still in the orbits, and the brain uninjured in
the skull.

SKELETON OF MAMMOTH.

In addition to the peculiarity of the Mammoth having its body


covered with long woolly hair, it was also remarkable for the
extraordinary formation of its enormous tusks, which curved
upwards, forming a spiral.
The eminent Siberian explorer, Dr. Middendorf, in 1843, met with
a second instance of the Mammoth being preserved to such a
degree that the bulb of the eye is now in the same museum as the
skeleton of a Mammoth found by Mr. Adams in 1803. Middendorf
found it in latitude 66° 30′ N., between the Obi and the Yenisei near
the Arctic Circle. In the same year he also found a young animal of
the same species in beds of sand and gravel, at about fifteen feet
above the level of the sea, near the river Taimyr, in latitude 75° 15′,
associated with marine shells of living Arctic species, as well as with
the trunk of the larch. But the fourth, and by far the most important,
discovery of a Mammoth is described by an eye-witness of its
unearthing, and the record is so valuable in its bearings that we give
it at some length. A young Russian engineer, Benkendorf by name,
employed by the Government in a survey of the coast of the mouth
of the Lena and Indighirka, was despatched up the latter stream, in
1846, in command of a small iron steam-cutter. He writes the
following account, which we translate, to a friend in Germany:—
“In 1846 there was uncommon warm weather in the north of
Siberia. Already in May unusual rains poured over the moors and
bogs, storms shook the earth, and the streams carried not only ice
to the sea, but also large tracts of land, thawed by the masses of
warm water fed by the southern rains.... We steamed on the first
favourable day up the Indighirka; but there were no thoughts of
land. We saw around us only a sea of dirty brown water, and knew
the river only by the rushing and roaring of the stream. The river
rolled against us trees, moss, and large masses of peat, so that it
was only with great trouble and danger we could proceed. At the
end of the second day, we were only about forty versts [one verst =
1,166½ yards English] up the stream. Some one had to stand with
the sounding-rod in hand continually, and the boat received so many
shocks that it shuddered to the keel. A wooden vessel would have
been smashed. Around us we saw nothing but the flooded land. For
eight days we met with the like hindrances, until at last we reached
the place where our Yakuts were to have met us. Farther up was a
place called Ujandina, whence the people were to have come to us,
but they were not there, prevented evidently by the floods. As we
had been here in former years we knew the place. But how it had
changed! The Indighirka, here about three versts wide, had torn up
the land and worn itself a fresh channel, and when the waters sank
we saw to our astonishment that the old river-bed had become
merely that of an insignificant stream. This allowed me to cut
through the soft earth, and we went reconnoitring up the new
stream which had worn its way westwards. Afterwards we landed on
the new shore, and surveyed the undermining and destructive
operation of the wild waters, that carried away with extraordinary
rapidity masses of soft peat and loam. It was then that we made a
wonderful discovery. The land on which we were treading was
moorland, covered thickly with young plants. Many lovely flowers
rejoiced the eye in the warm beams of the sun, that shone for
twenty-two out of the twenty-four hours. The stream rolled over and
tore up the soft wet ground like chaff, so that it was dangerous to go
near the brink. While we were all quiet, we suddenly heard under
our feet a sudden gurgling and stirring, which betrayed the working
of the disturbed water. Suddenly our jäger [hunter], ever on the
look-out, called loudly, and pointed to a singular and unshapely
object, which rose and sank through the disturbed waters. I had
already remarked it, but not given it my attention, considering it only
drift wood. Now we all hastened to the spot on the shore, had the
boat drawn near, and waited until the mysterious thing should again
show itself. Our patience was tried, but at last, a black, horrible,
giant-like mass was thrust out of the water, and we beheld a colossal
Elephant’s head, armed with mighty tusks, with its long trunk
moving in the water, in an unearthly manner, as though seeking for
something lost therein. Breathless with astonishment, I beheld the
monster hardly twelve feet from me, with his half-open eyes yet
showing the whites. It was still in good preservation.
“‘A Mammoth! a Mammoth!’ broke out the Tschermomori, and I
shouted ‘Here quickly! chains and ropes!’ I will pass over our
preparations for securing the giant animal, whose body the water
was trying to bear from us. As the animal again sank we waited for
an opportunity to throw the ropes over his neck. This was only
accomplished after many efforts. For the rest we had no cause for
anxiety, for after examining the ground I satisfied myself that the
hind legs of the Mammoth still stuck in the earth, and that the water
would work for us to unloosen them. We therefore fastened a rope
round his neck, threw a chain round his tusks, that were eight feet
long, drove a stake into the ground about twenty feet from the
shore, and made chain and rope fast to it. The day went by quicker
than I thought for, but still the time seemed long before the animal
was secured, as it was only after the lapse of twenty-four hours that
the waters had loosened it. But the position of the animal was
interesting to me; it was standing in the earth, and not lying on its
side or back as a dead animal naturally would, indicating by this the
manner of its destruction. The soft peat or marsh land on which he
stepped thousands of years ago gave way by the weight of the
giant, and he sank as he stood on it feet foremost, incapable of
saving himself, and a severe frost came and turned him into ice, as
well as the moor which had buried him; the latter, however, grew
and flourished, every summer renewing itself; possibly the
neighbouring stream had heaped plants and sand over the dead
body. God only knows what causes had worked for its preservation;
now, however, the stream had once more brought it to the light of
day, and I, an ephemera of life compared with this primeval giant,
was sent here by heaven just at the right time to welcome him. You
can imagine how I jumped for joy.... Picture to yourself an Elephant
with a body covered with thick fur, about thirteen feet in height, and
fifteen in length, with tusks eight feet long, thick and curving
outwards at their ends, a stout trunk of six feet in length, colossal
limbs of one foot and a half in thickness, and a tail naked up to the
end, which was covered with thick tufty hair. The animal was fat and
well grown; death had overtaken him in the fulness of his powers.
His parchment-like, large, naked ears lay fearfully turned up over the
head; about the shoulders and the back he had stiff hair, about a
foot in length, like a mane. The long outer hair was deep brown and
coarsely rooted. The top of the head looked so wild, and was so
penetrated with pitch, that it resembled the rind of an old oak-tree.
On the sides it was cleaner, and under the outer hair there appeared
everywhere a wool, very soft, warm, and thick, and of a yellow
brown colour. The giant was well protected against the cold. The
whole appearance of the animal was fearfully wild and strange. It
had not the shape of our present Elephants. As compared with our
Indian Elephants, its head was rough, the brain case low and
narrow, but the trunk and mouth were much larger. The teeth were
very powerful. Our Elephant is an awkward animal, but compared
with this Mammoth it is as an Arabian steed to a coarse ugly Dray-
horse. I could not divest myself of a feeling of fear as I approached
the head; the broken, widely-opened eyes gave the animal an
appearance of life, as though it might move in a moment and
destroy us with a roar.... The bad smell of the body warned us that it
was time to save of it what we could, and the swelling flood, too,
bade us hasten. First of all we cut off the tusks, and sent them to
the cutter. Then the people tried to hew the head off, but,
notwithstanding their good will, this was slow work. As the belly of
the animal was cut open the intestines rolled out, and then the smell
was so dreadful that I could not overcome my nauseousness, and
was obliged to turn away. But I had the stomach separated and
brought on one side. It was well filled, and the contents instructive
and well preserved. The principal were young shoots of the fir and
pine; a quantity of young fir cones also, in a chewed state, were
mixed with the mass.”
This most graphic account affords a key for the solution of
several problems hitherto unknown. It is clear that the animal must
have been buried where it died, and that it was not transported from
any place farther up stream to the south, where the climate is
comparatively temperate. The presence of fir-spikes in the stomach
proves that it fed on the vegetation which is now found at the
northern part of the woods, as they join the low, desolate, treeless,
moss-covered tundra, in which the body lay buried, a fact that would
necessarily involve the conclusion that the climate of Siberia, in
those ancient days, differed but slightly from that of the present
time. Before this discovery, the food of the Mammoth had not been
known by direct evidence. The circumstances under which it was
brought to light enable us to see how animal remains could be
entombed in the frozen soil without undergoing decomposition,
which Baron Cuvier and Dr. Buckland agreed in accounting for by a
sudden cataclysm, and Sir Charles Lyell by the hypothesis of their
having been swept down by floods from the temperate into the
arctic zone. In this particular case, the marsh must have been
sufficiently soft to admit of the Mammoth sinking in; while shortly
after death the temperature must have been lowered, so as to arrest
decomposition, up to the very day on which the body arose under
the eyes of M. Benkendorf, in the exceptionally warm year of 1846,
when the tundra was thawed to a most unusual depth, and
converted into a morass permeable by water. Had any Mammoths
been alive in that year, and had they strayed beyond the limits of the
woods into the tundra, some would in all human probability have
been engulfed; and, when once covered up, the normal cold of
winter would suffice to prevent the thaw of the carcases, except in
extraordinary seasons, such as that in which this one was
discovered. Probably many such warm summers intervened since its
death, but as it was preserved from the air, they would not
accelerate putrefaction to any great degree. In this way the problem
of its entombment and preservation may be solved by an appeal to
the present climatal conditions of Siberia. The difficulty of accounting
for such vast quantities of remains in the Arctic Ocean, especially in
the Läckhow Islands off the mouth of the Lena, is also explained by
this discovery, as well as the association of marine shells with the
remains of the Mammoth. The body was swept away by the swollen
flood of the Indighirka, along with many other waifs and strays, and
no doubt by this time is adding to the vast accumulation in the Arctic
Sea. It was seen by a mere chance, and must be viewed as an
example of the method by which animal remains are swept seaward.
In all probability, the frozen morass in which it was discovered is as
full of Mammoths as the peat-bogs of Ireland are of Irish Elk, and
have been the main source from which the Arctic rivers have
obtained their supply of animal remains. The remains of the
Mammoth are met with in incredible numbers in the river deposits of
Middle and Northern Europe, as well as in those of North America,
showing that in ancient times the animal ranged over a tract of land
extending from the Mediterranean to the Arctic Sea, and from
Behring Strait to the Gulf of Mexico. It is also met with in the caves
in Middle Europe, having been dragged into them by the Hyænas, or
having fallen a prey to the ancient hunter. We owe, indeed, to the
skill of the latter an incisive sketch of the animal as he appeared to
the inhabitants of Auvergne, in the remote geological period known
as Pleistocene; the long, hairy mane, and spirally-curved tusks, are
faithfully depicted by the artist, and, were it not for the strange
chance which has preserved to us the whole animal in the frozen
ice-cliffs of Siberia, would have seemed to us merely imaginative
details. In another example, also from the caves of Auvergne, the
Mammoth is represented with his mouth open, and his trunk lifted
up in the attitude of charging.

MAMMOTH (Restored).

Remains of other extinct species of Elephants are found; one,


which is of exceedingly small stature, standing not much higher than
from two and a half to three feet, has been discovered in the bone-
caves of Malta. The genus MASTODON, which in many respects
resembles the true Elephants, differs from them in the formation of
the teeth, the grinders being much simpler, more tubercular, and
with crowns free from cement. In most cases, also, there were two
small tusks in the lower jaw, as well as those in the upper. In Europe
they appear in the Miocene and Pliocene strata, and in America they
survived into the Pleistocene. The most extraordinary-looking,
perhaps, of the fossil Proboscidea, and that furthest removed from
the living Elephants, is the DINOTHERIUM, of the Miocene age. It
possessed no tusks in the upper jaw, but its lower jaw was armed
with two long curved tusks, projecting downwards. It probably
possessed the habits of the Elephant, and these tusks may have
been used for uprooting trees, or hooking down boughs, so as to
obtain the leaves and shoots for food.

W. BOYD DAWKINS.
H. W. OAKLEY.
ORDER HYRACOIDEA (CONIES).
What is the Coney?—Mention in the Bible—General Appearance—Real
Place—Range—Varieties—Coney of the Bible—Cape Coney—Ashkoko
of Abyssinia—Mr. Winwood Reade’s Account of the Habits of the Cape
Coney—Skull, Dentition, Ribs, &c.

THE order of animals known to naturalists as Hyracoidea (derived


from the Greek ὕραξ, a Shrew, and εἶδος, form) contains but one
genus, called Hyrax. Belonging to this genus are but two or three
species of small animals, which, however, are of considerable
interest, both from their peculiar organisation, and from their
mention four times in the Bible under the name of Shaphan,
improperly translated Coney, which has given rise to considerable
controversy, as to what animal was meant. Some persons
considered, and naturally enough, that Coney meant nothing more
or less than the Rabbit; but now no doubt exists, as has been shown
from its characters and habits, that the animal referred to is the
Daman, or Hyrax syriacus.
The following are the passages literally rendered, in which the
Hyrax is mentioned in the Bible: “Likewise the Coney, because he
cheweth the cud, and divideth not the hoof; he shall be unclean
unto you” (Leviticus xi. 5). “But these ye shall not eat of them that
chew the cud, and of them that divide and cleave the hoof only; the
Camel, nor the Hare, nor the Coney; for they chew the cud, but
divide not the hoof; therefore they shall be unclean unto you”
(Deuteronomy xiv. 7). “The high mountains are for the Goats; the
rocks are a refuge for the Conies” (Psalms civ. 18). “The Conies are
but a feeble folk, yet make they their houses in the rocks” (Proverbs
xxx. 26). With regard to the first passage, although the Hyrax
certainly does not chew the cud, the peculiar way in which it moves
its jaws, as it sits perched in a ruminating manner, so to speak, on
some ledge of rock, would naturally suggest to the ignorant that it
really was chewing the cud. In the third quotation, we read “the
rocks are a refuge for the Conies.” This exactly suits the Hyrax,
which is always found inhabiting rocky situations. The last extract
also agrees with the known habits of the Hyrax. Here it is alluded to
as being one of the four animals on earth who are small, but very
wise. These four are the Ant, the Locust, the Spider, and the Coney.
All travellers who have noticed the Hyrax are agreed that it is a most
wary and crafty animal, and that the utmost caution is required even
to obtain a view of it; and to kill one requires a most skilful and
practised sportsman.
The Hyrax is a little animal clothed with a brownish fur, of about
the size of an ordinary Rabbit, to which, indeed, it has some
resemblance. It is allied to the Rhinoceros, the Tapir, and Rodents;
but the whole form of the skeleton approaches more nearly to that
of the two former than it does to any known species of the latter.
Linnæus, however, and other authors, classed it with the Rodents;
but Cuvier, seeing that it more nearly approached the characters of
the old group of animals called Pachydermata (thick-skinned
animals), placed it with them. Now, however, it is assigned by Prof.
Huxley to an order of its own named Hyracoidea; but it still is a
doubtful question as to what should be done with it.
Of the several animals forming the genus, one, the Hyrax
syriacus, the Coney of the Bible, is found from the coast of the Red
Sea northwards through Syria, by Lebanon, and southwards into
Arabia and Ethiopia. Another species, Hyrax capensis, the Cape
Coney, is found at the Cape and east coast of Africa, extending from
Abyssinia down the east coast southwards. Two other species are
described from West Africa; but both probably belong to one genus.
Bruce, in his “Travels in Abyssinia,” tells us that the Ashkoko,
which is understood to be the same as the Daman (Hyrax syriacus),
is found in Ethiopia, in the caverns of the rocks, and under the large
stones in the Mountain of the Sun, behind the queen’s palace at
Koscam. He also informs us that it is of common occurrence in many
other rocky places of Abyssinia, and he says that it does not make
holes like Rabbits or Rats, because its toes are not adapted for so
doing, and that it is a very timid and gentle creature, stealing along
a few paces, and then stopping, as if to see that the coast is clear.
Bruce also states that apparently the same species inhabits
Mount Libanus, and the rocks of Cape Mohammed, which divides the
Elanitic from the Heroopolitic Gulf, or Gulf of Suez from that of
Akabah, and that the only difference he saw was in the greater size
and fatness of those of the Mountain of the Sun.
CONIES.

“The Hyrax capensis,” writes Mr. Reade, “is found living at the
Cape of Good Hope, inhabiting the hollows and caves of the rocks,
both on the hill-sides and on the sea shore, a little above high-water
mark. It seems to live in families, and in its wild state is remarkably
shy. In the cold weather it is fond of coming out of its hole and
warming itself in the sun on the side of a rock, and in summer it
enjoys the breeze on the top of the hills, but in both instances, as
well as when it feeds, a sentinel is always placed on the look-out,
generally an old male, which gives notice of any approach of danger
by a long shrill cry.
“Its principal food is the young tops of shrubs, especially those
which are aromatic, but it also eats herbs, grass, and the tops of
flowers. To eat it tastes much like a Rabbit. It is recorded that one
gentleman caught two young ones which he kept for some time.
They became very tame, and as they were allowed the run of the
house would follow him about, jump on to his lap, or creep into his
bed for the sake of the warmth. One brought home by Mr. Hennah
would also run inquisitively about the cabins, climbing up and
examining every person and thing, but startled by any noise, it
would run away and hide itself. When shut up for long, it became
savage and snarled and tried to bite at everything that came in its
way. This animal, both when wild as well as when tame, is very
cleanly in its habits. From its faintly crying in its sleep it may be
supposed that it dreams. It has also been heard to chew its food at
night. When tame it will eat a variety of things, the leaves of plants,
bruised Indian corn, raw potatoes, bread, and onions, and will
greedily lick up salt. The one brought home by Mr. Hennah was very
sensible of the cold, for when a candle was placed near its cage, it
would come as close as possible to the bars, and sit still to receive
as much warmth as it could. I am inclined to think that the female
does not produce more than two young ones at a time, from having
observed in several instances but two following the old ones. Its
name at the Cape is the Dasse, which is, I believe, the Dutch for a
Badger.”
SKULL OF CONEY.

In structure, the skull of the Hyrax approaches more nearly to


that of the Ungulata (animals with hoofs), especially to that of the
Rhinoceros, than it does to that of any of the Rodents. The nose of
the Hyrax, however, not having any horn to support, the nasal bones
are not thickened, as they are in the Rhinoceros. There is a marked
distinction between the maxillary, or upper jaw-bones of the Hyrax
and those of the Rodents, the extent of the former being much
smaller. In the former, also, there are two parietal bones, as
compared with one in the latter. The joint, or condyle of the lower
jaw, differs from that of the Rodents, in which it is compressed
longitudinally, while in the Hyrax it is compressed transversely, as in
the Ungulata, being also applied to a plane surface of the temporal
bone, whereby a motion more or less horizontal is permitted. The
Hyrax has no canine teeth. The upper incisors resemble those of
Rabbits and Hares in number, which are four in the adult, and those
of Rodents generally in the possession of persistent pulps. In shape
they approach more to the form of the canines of the Hippopotamus
by terminating in a point. The number of lower incisors is also four,
and they are procumbent somewhat like those of the Hog. The
grinders, both in number and form, resemble those of the
Rhinoceros.
DENTITION OF CONEY.

With regard to the number of ribs, the Hyrax approaches nearer


to the Ungulata and Proboscidea than it does to the Rodents. It
departs from the former in the number of the vertebræ and form of
the pelvis; but again approaches them in the form of the femora
(thigh bones), and also in the formation of the feet; the toes are
four in front and three behind, as in the Tapir, and they are supplied
with hoofs, or rounded hoof-like nails. They are without collar-bones
(clavicles). The body of the Hyrax is covered with thick hair, which is
here and there beset with bristles, and the tail is represented by a
mere tubercle. No remains of the Hyrax have yet been found in a
fossil state.

W. BOYD DAWKINS.
H. W. OAKLEY.
KIANG, OR WILD ASS OF TIBET.

ORDER UNGULATA (HOOFED


QUADRUPEDS).

CHAPTER I.
PERISSODACTYLA—THE EQUIDÆ, OR HORSE
FAMILY.
Order UNGULATA—Divisions—PERISSODACTYLA—Characteristics—EQUIDÆ
—Species—Descent—First Domestic Horses in Europe—Used for Food
—Mention of the Horse in the Bible—War-Chariots—The Horse among
the Greeks and Romans—In Britain—Attempts to Improve the Breed
—Colour—Teeth—“The Mark”—The Foot—Skull—Disease from the
Gad-fly—RACE-HORSE—TROTTING HORSE OF AMERICA—DRAY HORSE—
SHETLAND PONY—ARAB AND BARB—PERSIAN HORSE—WILD HORSES IN AMERICA
—Habits—Byron’s “Mazeppa”—Capture and Breaking in—WILD HORSES
IN AUSTRALIA—THE ASS—Species—Stripes—Characteristics—MULE AND
HINNY—WILD ASS OF TIBET—ONAGER—WILD ASS OF ABYSSINIA—ZEBRAS—
BURCHELL’S ZEBRA—QUAGGA—FOSSIL EQUIDÆ—Distribution—HIPPARION.

THE hoofed quadrupeds are so called because they possess hoofs,


from which fact the order Ungulata takes its name,[260] and they
include animals of widely different appearance, such as the Horse,
Rhinoceros, Giraffe, Camel, and the like. They are classified into two
sub-orders, according to the odd or even number of toes, those
having an odd number on the hind foot being termed the
Perissodactyla,[261] such as the Horse, Tapir, and Rhinoceros; and
the Artiodactyla,[262] or animals with an even number of toes on
their hind feet, such as the Pig, Hippopotamus, Sheep, Ox, Deer, and
the like. All the animals belonging to the order feed upon vegetables,
with the exception of the Pig and Peccary, which are omnivorous;
and none of them are provided with sharp-edged cutting back teeth,
adapted for dividing flesh, such as are found in the Carnivora—Lions,
Tigers, Wolves, and Hyænas. The odd-toed Ungulates come first.
SUB-ORDER PERISSODACTYLA.
The odd-toed animals consist of three living families—(1) The
Equidæ, or Horses; (2) the Tapiridæ, or Tapirs; (3) the
Rhinocerotidæ, or Rhinoceroses; and two extinct families—(1) the
Palæotheridæ, or Palæotheres (παλαιός, old; θηρίον, beast); and (2)
the Macraucheniadæ (μακρός, long; αὐχήν, neck). In all the animals
belonging to the group, the number of dorso-lumbar vertebræ is not
fewer than twenty-two, the third or middle digit of each foot is
symmetrical, the femur or thigh-bone has a third trochanter, or knob
of bone on the outer side, and the two facets on the front of the
astragalus or ankle-bone are very unequal. When the head is
provided with horns, they are skin-deep only, without a core of
bone, and they are always placed in the middle line of the skull, as
in the Rhinoceros.
In the Perissodactyla the number of toes is reduced to a
minimum. Supposing, for example, we compare the foot of a Horse
with one of our own hands, we shall see that those parts which
correspond with the thumb and little finger are altogether absent,
while that which corresponds with the middle finger is largely
developed, and with its hoof, the equivalent of our nail, constitutes
the whole foot. The small splint bones, however, resting behind the
principal bone of the foot represent those portions (metacarpals) of
the second and fourth digits which extend from the wrist to the
fingers properly so called, and are to be viewed as traces of a foot
composed of three toes in an ancestral form of the Horse, which we
shall discuss presently. In the Tapir, the hind foot is composed of
three well-developed toes, corresponding to the three middle toes in
man, and in the Rhinoceros both feet are provided with three toes
formed of the same three digits. In the extinct Palæotherium also,
the foot is constituted very much as in the Rhinoceros.
FAMILY I.—EQUIDÆ, THE HORSE-TRIBE.

TARPAN.

The Equidæ, or Horse-tribe, comprise several living and many


extinct species. Three living members are restricted, in a state of
nature, to Asia and Africa, and are divided into the true Horses,
which have horny patches or callosities on the inner sides of both
pairs of limbs—above the wrist in the fore, and on the inner side of
the metatarsus on the hind limbs—and the Asses, which possess
such callosities only on the fore-limb. With the latter are classed the
Zebras and the Quaggas. All the existing and some of the extinct
members of the family, are characterised by the feet being formed of
one perfectly developed digit or toe only, the others being present in
a rudimentary shape as the splint-bones. In the extinct Hipparion,
however, and Anchithere, as we shall see presently, the accessory
toes are well developed.

WILD HORSE OF TARTARY.


The true Horses are represented by one well-established species,
Equus caballus, from which all the other races, or varieties, are
descended, by a process of selection under the care of man, and
these vary in size, proportion of parts, and colour, as much as any
two closely-allied species of wild animals can be said to be defined
from each other. According to Mr. Darwin, no aboriginal or truly Wild
Horse is positively known to exist, for the Wild Horses of the East
may probably be descended from those which have escaped from
the service of man. In all probability the wild animal has been
exterminated by the hand of man in those countries which it
formerly inhabited, and in which it has left its remains to attest its
former presence.
The Tarpan and Wild Horse of Tartary, which are to be found in
thousands in the great treeless plains, present us with the nearest
examples of the stock from which the Domestic Horses were
probably derived. Their colour is mouse-coloured, with a stripe along
the back. The best and strongest of these are caught by the Tartars
by the aid of the lasso, and by the help of Falcons, which are trained
to settle on the Horse’s head, and flutter their wings, so as to take
its attention away from the approaching hunter.
The first Domestic Horses known in Europe were introduced at a
very early period, long before the dawn of history, in the period
known by the archæologists as that of polished stone, or that in
which man had not yet acquired a knowledge of the metals bronze
or iron. They are met with in the ruins of those wonderful pile
dwellings, which lie at the bottom of the Swiss lakes, in association
with the remains of the Pig, Sheep, Goat, Short-horned Ox, large Ox
of the Urus type, and Dog, and evidently belonged to a race of
farmers, by whom they were introduced into Europe. Bones occur in
the camps, sepulchres, and habitations of this age, throughout the
whole of the Continent, and of Great Britain and Ireland. In all
probability they were used at this time not for riding or for driving,
but for food. In the succeeding, or bronze age, however, they were
employed for purposes of riding, as may be seen from the discovery
of the bronze bits, which have been met with in France and Italy.
They were probably introduced by a race of nomads, who no doubt
brought Horses with them from the steppes of Central Asia.
According to Colonel Hamilton Smith, “so little is known of the
primitive seat of civilisation—the original centre, perhaps, in Bactria,
in the higher valleys of the Oxus, or in Cashmere, whence
knowledge radiated to China, India, and Egypt—that it may be
surmised that the first domestication of the post-diluvian Horse was
achieved in Central Asia, or commenced nearly simultaneously in
several regions where the wild animals of the Horse form existed.”
The Horse was universally used for food by man before the
historic period, and would be used now in Europe more generally
than it is, were it not for an edict of the Church in the eighth
century. During the Roman occupation of Britain, it formed a large
part of the diet of the inhabitants; by the Scandinavians it was eaten
in honour of Odin. As Christianity prevailed over the heathen
worship, it was banished from the table. It appears, however, that it
was used in England as late as the year 787, after it had been
prohibited in Eastern Europe. The ecclesiastic rule, however, was not
always obeyed, for the monks of St. Gall, in Switzerland, not only ate
Horse-flesh in the eleventh century, but returned thanks for it in a
metrical grace, which has survived to our times on account of its
elegance and beauty.
It is somewhat remarkable that the Horse is, with few
exceptions, mentioned in the Bible only in connection with war, and
that there is a wonderful absence of detail with regard to its nature
and habits otherwise than for the purposes it served in battle. That
the Horse spoken of in Scripture was nearly identical with the Arab
Horse of to-day there can be little doubt, if we examine the various
sculptures and paintings which are handed down to us, and which
speak of the faded glories of Egypt and Assyria. The first account we
have of the Horse is during the famine in Egypt that was foretold by
Joseph, and here we find that it was evidently an Egyptian animal.
“And they brought their cattle unto Joseph; and Joseph gave them
bread in exchange for Horses, and for the flocks, and for the cattle
of the herds, and for the Asses; and he fed them with bread for all
their cattle for that year.”
The courage and fiery nature of the Arab Horse, particularly
fitting it for use in the wars of ancient times, were evidently well
understood. In the Book of Job (xxxix. 19–25) we read:—“Hast thou
given the Horse strength? hast thou clothed his neck with thunder?
Canst thou make him afraid as a grasshopper? the glory of his
nostrils is terrible. He paweth in the valley, and rejoiceth in his
strength: he goeth on to meet the armed men. He mocketh at fear,
and is not affrighted; neither turneth he back from the sword. The
quiver rattleth against him, the glittering spear and the shield. He
swalloweth the ground with fierceness and rage: neither believeth
he that it is the sound of the trumpet. He saith among the trumpets,
Ha, ha! and he smelleth the battle afar off, the thunder of the
captains, and the shouting.”
The Hebrews in the patriarchal age did not require Horses, and
for a long time after their settlement in Canaan did not use them,
probably partly on account of the nature of the country, which was
hilly, and partly because they were prohibited on account of their
hostility to the Egyptians. The Horses of the kings David and
Solomon were derived from Egypt. In the reign of the latter, a Horse
was worth 150 shekels of silver, and a chariot six hundred. The
former was the first to establish a force of cavalry and chariots.
From the very earliest ages known to the historian in Egypt and
Assyria, Horses were used for purposes of war, and were yoked in
pairs, and sometimes in threes, to the war-chariots in which the
kings and great captains rode. They are generally depicted as being
of upright or Hog manes. Horsemen were also employed by both
nations, but they were evidently not thought so important as Horses
and chariots for warlike purposes.
In the earlier period of Greek history, and in Homeric times, the
art of riding was utterly unknown to the Greeks, for if a man was
seen on horseback he was supposed to be a Centaur. Down to 500
B.C. riding was not practised by the Greeks, although it was well
known to the Barbarians. As we get close to the year mentioned, we
hear of Persian cavalry; for instance, the great question with regard
to the battle of Marathon (490 B.C.) is, What were the Persian cavalry
doing? And at the same period we find that cavalry had become an
important arm in Northern Greece. Throughout all the times of
Greek pre-eminence, Horses were mainly used for the purpose of
the chariot. The utmost care and attention were devoted to their
breeding, and the greatest expense incurred in the maintenance of a
stud, which was a luxury possible only to the very richest persons,
and almost entirely beyond the means of private individuals. The
greatest horsekeepers, and consequently winners in the chariot-
races, were almost entirely princes and ruling families.
After 450 B.C. we begin to hear of riding and of cavalry in Greece
proper, side by side with charioteers. Books were written on the art,
one of which, from the pen of Xenophon, is still extant.
The case is totally different when we turn to the history of Rome
during the same period. In the early regal times, and in the first
centuries of the Republic, cavalry was the most important arm of the
military service. It was naturally composed of the aristocracy, who
alone could bear the expense of a Horse. It was only when a rich
middle class had sprung up, and were denied the aristocratic
privilege of serving on horseback, that the heavy-armed infantry,
which in later times won all the great Roman victories, came first
into existence. As they increased, the cavalry decreased in
importance, and the typical Roman soldier was what was called in
mediæval times a man-at-arms.
The native breeds of Horses in Britain, before the Roman
conquest, are known to us merely from a reference to them by
Cæsar, that they were powerful and well suited for purposes of war
by their stature and training. They were used in the battles of the
Romans, yoked to the chariots. They were evidently considered of
great importance, since they appear on some of the early British
coins—as, for example, those of Cunobelin. After the conquest of
Britain by the Angles, Jutes, and Saxons, the Horses demanded
more attention than before. Athelstan thought the preservation of
the native breed of sufficient importance to call for a legal
enactment to prevent the export of Horses, excepting as presents.
Saddle-horses were employed, according to the testimony of Bede,
in England in the early part of the seventh century, and from the
notices in the Anglo-Saxon Chronicle it is evident that they were
frequently used by the Danes for purposes of transport from one
part of the country to another; and in the song of the fight of
Maldon, we read of Goderic flying from the field on a Horse, on
which his lord had ridden down to the battle.
The first attempt on record to improve the native breed, by the
introduction of foreign blood, was by the importation of “running
Horses” from Germany in the time of Athelstan; in whose reign also
many Spanish Horses were imported. William the Conqueror, who
owed his success in the Battle of Hastings to his cavalry, paid great
attention to the English breed. In his time, Professor Bell tells us,
“Roger de Belesme, Earl of Shrewsbury, imported the elegant and
docile Spanish Horse, and bred from it on his estates in Powisland;
and it is recorded that the Horses of that part of Wales were long
celebrated for their swiftness, a quality which they had doubtless
derived from this happy mixture of blood. The heavy panoply of
mail, however, with which the warriors of this and of succeeding
ages at once protected and loaded both themselves and their
steeds, sufficiently attests that the cavalry must have been mounted
on Horses of great strength and size; and there is no doubt that,
until the universal employment of firearms rendered such a
protection in a great measure unavailable, the speed and figure of
the War Horse must have been sacrificed to the qualities of power
and endurance. The beautiful Horses on which many of our light
cavalry regiments are now mounted, although endowed with
considerable strength, would have been crushed beneath the weight
of metal by which both the knight of olden time and his Horse were
so heavily laden.”
King John paid great attention to the improvement of the breed
for agricultural purposes; and to him, according to Youatt, we are

You might also like