Web Scraping with Python 1st Edition Ryan Mitchell instant download
Web Scraping with Python 1st Edition Ryan Mitchell instant download
https://ebookmeta.com/product/web-scraping-with-python-1st-
edition-ryan-mitchell/
Ryan Mitchell
Boston
Web Scraping with Python
by Ryan Mitchell
Copyright © 2015 Ryan Mitchell. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or corporate@oreilly.com.
Editors: Simon St. Laurent and Allyson MacDonald Indexer: Lucie Haskins
Production Editor: Shiny Kalapurakkel Interior Designer: David Futato
Copyeditor: Jasmine Kwityn Cover Designer: Karen Montgomery
Proofreader: Carla Thornton Illustrator: Rebecca Demarest
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Web Scraping with Python, the cover
image, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the author disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-491-91027-6
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
3. Starting to Crawl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Traversing a Single Domain 31
Crawling an Entire Site 35
Collecting Data Across an Entire Site 38
Crawling Across the Internet 40
Crawling with Scrapy 45
4. Using APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
How APIs Work 50
iii
Common Conventions 50
Methods 51
Authentication 52
Responses 52
API Calls 53
Echo Nest 54
A Few Examples 54
Twitter 55
Getting Started 56
A Few Examples 57
Google APIs 60
Getting Started 60
A Few Examples 61
Parsing JSON 63
Bringing It All Back Home 64
More About APIs 68
5. Storing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Media Files 71
Storing Data to CSV 74
MySQL 76
Installing MySQL 77
Some Basic Commands 79
Integrating with Python 82
Database Techniques and Good Practice 85
“Six Degrees” in MySQL 87
Email 90
6. Reading Documents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Document Encoding 93
Text 94
Text Encoding and the Global Internet 94
CSV 98
Reading CSV Files 98
PDF 100
Microsoft Word and .docx 102
iv | Table of Contents
Data Normalization 112
Cleaning After the Fact 113
OpenRefine 114
Table of Contents | v
12. Avoiding Scraping Traps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
A Note on Ethics 177
Looking Like a Human 178
Adjust Your Headers 179
Handling Cookies 181
Timing Is Everything 182
Common Form Security Features 183
Hidden Input Field Values 183
Avoiding Honeypots 184
The Human Checklist 186
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
vi | Table of Contents
Preface
To those who have not developed the skill, computer programming can seem like a
kind of magic. If programming is magic, then web scraping is wizardry; that is, the
application of magic for particularly impressive and useful—yet surprisingly effortless
—feats.
In fact, in my years as a software engineer, I’ve found that very few programming
practices capture the excitement of both programmers and laymen alike quite like
web scraping. The ability to write a simple bot that collects data and streams it down
a terminal or stores it in a database, while not difficult, never fails to provide a certain
thrill and sense of possibility, no matter how many times you might have done it
before.
It’s unfortunate that when I speak to other programmers about web scraping, there’s a
lot of misunderstanding and confusion about the practice. Some people aren’t sure if
it’s legal (it is), or how to handle the modern Web, with all its JavaScript, multimedia,
and cookies. Some get confused about the distinction between APIs and web scra‐
pers.
This book seeks to put an end to many of these common questions and misconcep‐
tions about web scraping, while providing a comprehensive guide to most common
web-scraping tasks.
Beginning in Chapter 1, I’ll provide code samples periodically to demonstrate con‐
cepts. These code samples are in the public domain, and can be used with or without
attribution (although acknowledgment is always appreciated). All code samples also
will be available on the website for viewing and downloading.
vii
What Is Web Scraping?
The automated gathering of data from the Internet is nearly as old as the Internet
itself. Although web scraping is not a new term, in years past the practice has been
more commonly known as screen scraping, data mining, web harvesting, or similar
variations. General consensus today seems to favor web scraping, so that is the term
I’ll use throughout the book, although I will occasionally refer to the web-scraping
programs themselves as bots.
In theory, web scraping is the practice of gathering data through any means other
than a program interacting with an API (or, obviously, through a human using a web
browser). This is most commonly accomplished by writing an automated program
that queries a web server, requests data (usually in the form of the HTML and other
files that comprise web pages), and then parses that data to extract needed informa‐
tion.
In practice, web scraping encompasses a wide variety of programming techniques
and technologies, such as data analysis and information security. This book will cover
the basics of web scraping and crawling (Part I), and delve into some of the advanced
topics in Part II.
viii | Preface
want to use such as Twitter posts or Wikipedia pages. In general, it is preferable to use
an API (if one exists), rather than build a bot to get the same data. However, there are
several reasons why an API might not exist:
• You are gathering data across a collection of sites that do not have a cohesive API.
• The data you want is a fairly small, finite set that the webmaster did not think
warranted an API.
• The source does not have the infrastructure or technical ability to create an API.
Even when an API does exist, request volume and rate limits, the types of data, or the
format of data that it provides might be insufficient for your purposes.
This is where web scraping steps in. With few exceptions, if you can view it in your
browser, you can access it via a Python script. If you can access it in a script, you can
store it in a database. And if you can store it in a database, you can do virtually any‐
thing with that data.
There are obviously many extremely practical applications of having access to nearly
unlimited data: market forecasting, machine language translation, and even medical
diagnostics have benefited tremendously from the ability to retrieve and analyze data
from news sites, translated texts, and health forums, respectively.
Even in the art world, web scraping has opened up new frontiers for creation. The
2006 project “We Feel Fine” by Jonathan Harris and Sep Kamvar, scraped a variety of
English-language blog sites for phrases starting with “I feel” or “I am feeling.” This led
to a popular data visualization, describing how the world was feeling day by day and
minute by minute.
Regardless of your field, there is almost always a way web scraping can guide business
practices more effectively, improve productivity, or even branch off into a brand-new
field entirely.
Preface | ix
If you’re looking for a more comprehensive Python resource, the book Introducing
Python by Bill Lubanovic is a very good, if lengthy, guide. For those with shorter
attention spans, the video series Introduction to Python by Jessica McKellar is an
excellent resource.
Appendix C includes case studies, as well as a breakdown of key issues that might
affect how you can legally run scrapers in the United States and use the data that they
produce.
Technical books are often able to focus on a single language or technology, but web
scraping is a relatively disparate subject, with practices that require the use of databa‐
ses, web servers, HTTP, HTML, Internet security, image processing, data science, and
other tools. This book attempts to cover all of these to an extent for the purpose of
gathering data from remote sources across the Internet.
Part I covers the subject of web scraping and web crawling in depth, with a strong
focus on a small handful of libraries used throughout the book. Part I can easily be
used as a comprehensive reference for these libraries and techniques (with certain
exceptions, where additional references will be provided).
Part II covers additional subjects that the reader might find useful when writing web
scrapers. These subjects are, unfortunately, too broad to be neatly wrapped up in a
single chapter. Because of this, frequent references will be made to other resources
for additional information.
The structure of this book is arranged to be easy to jump around among chapters to
find only the web-scraping technique or information that you are looking for. When
a concept or piece of code builds on another mentioned in a previous chapter, I will
explicitly reference the section that it was addressed in.
x | Preface
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter‐
mined by context.
Preface | xi
Safari® Books Online
Safari Books Online is an on-demand digital library that deliv‐
ers expert content in both book and video form from the
world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and crea‐
tive professionals use Safari Books Online as their primary resource for research,
problem solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organi‐
zations, government agencies, and individuals. Subscribers have access to thousands
of books, training videos, and prepublication manuscripts in one fully searchable
database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-
Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco
Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt,
Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett,
Course Technology, and dozens more. For more information about Safari Books
Online, please visit us online.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://oreil.ly/1ePG2Uj.
To comment or ask technical questions about this book, send email to bookques‐
tions@oreilly.com.
For more information about our books, courses, conferences, and news, see our web‐
site at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
xii | Preface
Acknowledgments
Just like some of the best products arise out of a sea of user feedback, this book could
have never existed in any useful form without the help of many collaborators, cheer‐
leaders, and editors. Thank you to the O’Reilly staff and their amazing support for
this somewhat unconventional subject, to my friends and family who have offered
advice and put up with impromptu readings, and to my coworkers at LinkeDrive who
I now likely owe many hours of work to.
Thank you, in particular, to Allyson MacDonald, Brian Anderson, Miguel Grinberg,
and Eric VanWyk for their feedback, guidance, and occasional tough love. Quite a few
sections and code samples were written as a direct result of their inspirational sugges‐
tions.
Thank you to Yale Specht for his limitless patience throughout the past nine months,
providing the initial encouragement to pursue this project, and stylistic feedback dur‐
ing the writing process. Without him, this book would have been written in half the
time but would not be nearly as useful.
Finally, thanks to Jim Waldo, who really started this whole thing many years ago
when he mailed a Linux box and The Art and Science of C to a young and impression‐
able teenager.
Preface | xiii
Discovering Diverse Content Through
Random Scribd Documents
disturbing Gran’pa Eliot. She intended to assure herself that Phœbe
would be unable to get at the hidden treasure again.
And now the full horror of the situation burst upon the girl’s
mental vision, making her cringe and wince as if in bodily pain. Jail!
Jail for helping Phil! Well, it was far better that she should suffer
than her twin—a boy whose honor was all in all to him. She would
try to be brave and pay the penalty for Phil’s salvation unflinchingly.
For a while the poor girl sat cowering in the depths of despair.
What could she do? where could she turn for help? Then a sudden
thought came to her like an inspiration. Judge Ferguson had once
made her promise to come to him if she was in any trouble. Of
course. Judge Ferguson was her father’s old friend. She would see
him at once, and perhaps he would be able to advise her in this
grave emergency.
CHAPTER XXI
Judge Ferguson hung his hat on the peg again and went to the
door of an inner room.
“Toby!” he called.
“Yes, sir.”
“Yes, sir.”
“And, Toby, when you return stand guard over the private room
and see that I’m not disturbed.”
“Yes, sir.”
So she first told him of how she had discovered old Miss Halliday
counting the secret hoard, and of her reasons for keeping the
knowledge to herself. Next, she related Phil’s experiences at the
bank, his suspicions of Eric and the midnight adventure when
together the twins watched the banker’s son robbing the safe. All
the details of Eric’s plan to implicate Phil had been carefully
treasured in the girl’s memory, and she now related them simply, but
convincingly, to the lawyer.
It was more difficult to confess the rest, but Phœbe did not falter
nor spare herself. A way to save Phil had been suggested to her by
the discovery of her grandfather’s hoarded money—for she naturally
supposed it was his. Her description of the manner in which she had
secured exactly the same amount Eric had taken was dramatic
enough to hold her listener spellbound, and he even smiled when
she related Eric’s confusion at finding the money restored, and how
he had eagerly made restitution of the minor sums he had
embezzled by “fixing” the books.
When, in conclusion, Phœbe told of her late interview with the old
housekeeper and recited as well as she could remember the terms of
the deed of gift from Mr. Eliot to Elaine Halliday, Judge Ferguson
became visibly excited.
“I cannot say, sir, for I have seldom seen his signature,” she
replied.
The lawyer sat for some time staring at a penholder which he tried
to balance upon his middle finger. He was very intent upon this
matter until a long-drawn sigh from Phœbe aroused him. Then he
leaned back in his chair, thrust his hands deep in his pockets and
bobbed his head at her reassuringly.
“I don’t believe it. I know Jonathan Eliot. And I’ve known other
misers before him. Not one of them would ever give up a dollar of
their beloved accumulation as long as a spark of life remained in
their bodies—your grandfather, least of all. And to his housekeeper!
Why should he resign it to her, I’d like to know?”
“H-m. I cannot say, as yet. I must have time to think. Why, it’s five
o’clock,” looking at his watch. “Sit still! Don’t be in a hurry. Let’s
figure a little; let’s—figure.”
Slowly he rose from his chair, came over to her and kissed her
cheek.
“Yes, sir. Fifteen hundred in gold and eighteen hundred and ninety,
in bills.”
“You can’t, my dear; but I can. Let’s see. She has given you until
to-morrow noon—All right.”
“Why, I think so,” she said, astonished. “Perhaps I can sleep with
Cousin Judith; but—”
“No. I want to put a detective there. I’m almost sure there will be
something to see through that peephole to-night.”
“A detective!”
“And now,” continued the lawyer, briskly, “it’s all settled, cut and
dried. You may go home to supper without a single worry. I’ll send
Janet after you with an invitation to spend the night at our house,
and Toby will take your place at home. You’ve given me proof that
you’re not a bad conspirator, Phœbe, so I depend upon your wit to
get Toby into your room unobserved.”
“Don’t fret, my dear. We’ve got everything planned, now, and you
have nothing further to fear from this strange complication.”
She could not quite understand how that might be. Whatever
plans Judge Ferguson had evolved he kept closely guarded in his
own bosom. But Phœbe knew she might trust him, and carried away
with her a much lighter heart than the one she had brought to the
lawyer’s office.
When she had gone Mr. Ferguson called Toby Clark into his private
room and talked with the young man long and earnestly.
Some years ago Judge Ferguson had taken Toby Clark into his
employ, recognizing a shrewd wit and exceptional intelligence hidden
beneath his unprepossessing exterior. At first, the boy went to school
and took care of the judge’s furnace in winter, and his lawn and
flower beds in summer. Then he was taken into the office, where he
was now studying law. No one had really understood Toby except
the old lawyer, and the youth was grateful and wholly devoted to his
patron.
“Above all,” was the final injunction, “do not lose sight of Miss
Halliday. Stick to her like a burr, whatever happens; but do not let
her know you are watching her. Is it all clear to you, Toby?”
“Yes, sir.”
“Come!” she whispered, and led the way into the house. Halfway
up the stairs she paused to look back, not hearing his footsteps; but
he was so close behind that he startled her and soon she had
ushered him into her own little room.
“Lock the door behind you,” said she, “and pay no attention if
anyone knocks or tries to get in.”
“Why, Phœbe! where are you going?” asked Sue, seeing the bag.
“You seem nervous and excited, now,” said her brother, looking at
her closely. “Anything new turned up to annoy you, Phœbe?”
“I’m quite contented to-night, Phil, dear.” And then she ran away
before he could question her, further.
They met Cousin Judith just leaving the Randolph’s house, and
Marion was with her. Miss Eliot at once approved Phœbe’s plan to
stay with Janet for the night. She thought the girl had seemed
unnerved and ill at ease lately and believed the change of
environment would do her good.
When Judith had bade them good night and started across the
street to rejoin her flock, Marion said:
“I’ll walk with you a little way, if you don’t mind. It’s such a lovely
evening, and I’ve a mystery to disclose, besides.”
“Tell us what it is,” urged Phœbe, “for then it will remain a mystery
no longer.”
“Oh, yes it will,” declared Marion, rather soberly. “I’ve no solution
to offer. All I can do is tell you what I saw, and allow you to solve
the mystery yourselves.”
“A ghost.”
“To be sure.”
But Phœbe had lived in romance during the past few days and no
element of mystery now seemed absurd to her. Indeed, she began
to feel slightly uneasy, without knowing why.
“Oh!” said Janet and Phœbe together, for their companion had
spoken seriously and with a slight shudder. Moreover, the graveyard
was at that moment a short block to their left, and twilight had
already fallen. Beneath the rows of maples and chestnuts that lined
the road the shadows were quite deep.
“I am troubled with insomnia,” explained Marion. “The doctors say
I have studied too hard and my nerves are affected. At any rate I
am very wakeful, and sometimes do not go to bed until two or three
o’clock in the morning, knowing I could not sleep if I tried. Last
evening I was especially restless. It was a beautiful starlit night, so
after the family had all retired I slipped out of doors and started for
a walk through the lanes. I have often done this before, since I
came here, and it is not unusual for me to visit the old graveyard;
not because I am morbid, but for the reason that it seems so restful
and quiet there.”
“Not a very generous thing to do,” added Janet; “but Mr. Eliot has
always been a queer man, and done queer things.”
“It failed to disclose its sex, my dear. The door seemed to swing
shut behind it; but the ghostly one was obliged to put out an arm to
raise the latch of the iron gate. It passed through and I heard the
click of the latch as it again fell into place. Then the apparition—”
“Oh, yes; the Ghostly Mystery glided out of sight while I sat
listlessly wondering what it could be. I was not frightened, but I
failed to act promptly; so, when I arose to follow it, the thing or
person—or whatever it was—had disappeared for good and all.”
“Perhaps one o’clock. It was nearly two when I got home; but I
had walked quite a way before I decided to enter the house.”
“And have you no idea who it might be?” questioned Janet, who
had now grown thoughtful.
“Oh, do you like ghosts? Well, then, I’ll take you with me on my
next midnight ramble,” laughed Marion.
“I may be,” admitted Janet, honestly; “but I’m willing to risk it.”
Later in the evening the judge came in, and smiled cheerily upon
the three young girls.
Phœbe thought she knew what had occupied him, but said
nothing.
In their rooms the girls sat and discussed their plans, waiting for
the judge and Mrs. Ferguson to get to bed and for the arrival of the
hour when they might venture forth. It was demure little Janet who
suggested they all wear sheets on their midnight stroll.
“We can carry them over our arms until we get to the graveyard,”
she said, “and then wrap ourselves in the white folds. If the ghost
appears we’ll show him that others are able to play the same trick.”
“It may be a good idea,” she said, rather reluctantly, for somehow
she regarded this matter far more seriously than did the others. The
ghost was using her grandfather’s tomb for its headquarters,
according to Marion’s report, and that gave Phœbe a personal
interest in the affair.
At last the clock warned them it was nearly twelve o’clock; so they
gathered up the sheets Janet had provided and stole noiselessly
from the house. The graveyard was only a short distance away and
they reached it about midnight, taking their position in a dark corner
near the Eliot mausoleum. They assisted one another to drape the
sheets effectually and then sat down upon the ground, huddled
close together, to await the advent of the ghost.
“Do you feel at all creepy, girls?” asked Phœbe, who caught
herself indulging in nervous shivers at times, despite the fact that
the night was warm and sultry.
Two hearts, at least, were beating very fast now, for the long-
expected ghost was at last in sight, gliding silently past the turnstile.
Well, not exactly “gliding,” they decided, watching intently. It was
not a very healthy looking ghost, and to their astonishment was
entering the graveyard with shuffling, uneven steps. Of course it
should have suddenly appeared from some tomb, as every well
regulated ghost is supposed to do.
Glancing neither to right nor left the apparition slowly made its
way into the graveyard and advanced to the big square mausoleum
erected as the future abiding place of Jonathan Eliot. The white-
robed figure seemed bent and feeble.
She glided swiftly out into the starlight, wrapping her sheet closely
about her, and gained a position behind the tomb. Phœbe and Janet
followed, spurred on by Marion’s fearless action. One passed to the
right and the other to the left.
Singularly enough, the bent figure did not observe their presence
until the tomb was nearly reached, when Marion circled around the
railing and confronted the mysterious visitant. At the same time
Janet and Phœbe advanced and all three slowly raised their white-
draped arms above their heads.
With a shriek that pierced the night air far and wide the ghost
staggered backward and toppled to the ground, lying still as death.