0% found this document useful (0 votes)

11 views

Download

The document discusses the basics of web scraping including how it works, HTML structure and tags, navigating HTML trees, extracting data using Beautiful Soup and Pandas, and concludes with an introduction to responsible web scraping.

Uploaded by

SAIFUR RAHMAN

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Download

Uploaded by

SAIFUR RAHMAN

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

2/10/24, 11:01 PM about:blank

Web Scraping and HTML Basics

Estimated time: 10 mins

Lab Objectives:
By the end of this reading, you should be able to:

1. Understand key concepts related to HTML structure.

2. Learn about HTML tag composition.
3. Explore the concept of HTML document trees.
4. Familiarize yourself with HTML tables.
5. Gain insight into the basics of web scraping using Python and BeautifulSoup.

Introduction to Web scraping

Web scraping, also known as web harvesting or web data extraction, is the process of extracting information from websites or web pages. It involves automated retrieval
of data from web sources and is commonly used for a wide range of applications such as data analysis, data mining, price comparison, content aggregation, and more.

How web scraping works:

HTTP Request:

The process typically begins with an HTTP request. A web scraper sends an HTTP request to a specific URL, similar to how a web browser would when you visit a
website. The request is usually an HTTP GET request, which retrieves the content of the web page.

Web Page Retrieval:

The web server hosting the website responds to the request by sending back the requested web page's HTML content. This content includes not only the visible text and
media elements but also the underlying HTML structure that defines the page's layout.

HTML Parsing:

Once the HTML content is received, it needs to be parsed. Parsing involves breaking down the HTML structure into its individual components, such as tags, attributes,
and text content. This is where a library like BeautifulSoup in Python is commonly used. It creates a structured representation of the HTML content that can be easily
navigated and manipulated.

Data Extraction:

With the HTML content parsed, web scrapers can now identify and extract the specific data they need. This data can include text, links, images, tables, product prices,
news articles, and more. Scrapers locate the data by searching for relevant HTML tags, attributes, and patterns in the HTML structure.

Data Transformation:

Extracted data may need further processing and transformation. For instance, removing HTML tags from text, converting data formats, or cleaning up messy data. This
step ensures that the data is ready for analysis or other use cases.

Storage:

After extraction and transformation, the scraped data can be stored in various formats, such as databases, spreadsheets, or even JSON or CSV files. The choice of storage
format depends on the specific project's requirements.

Automation:

In many cases, web scraping is automated using scripts or programs. These automation tools allow for recurring data extraction from multiple web pages or websites.
Automated scraping is especially useful for collecting data from dynamic websites that regularly update their content.

about:blank 1/4
2/10/24, 11:01 PM about:blank

HTML Structure
HTML (Hypertext Markup Language) serves as the foundation of web pages. Understanding its structure is crucial for web scraping.

<html> is the root element of an HTML page.

<head> contains meta-information about the HTML page.
<body> displays the content on the web page, often the data of interest.
<h3> tags are type 3 headings, making text larger and bold, typically used for player names.
<p> tags represent paragraphs and contain player salary information.

Composition of an HTML Tag

HTML tags define the structure of web content and can contain attributes.

An HTML tag consists of an opening (start) tag and a closing (end) tag.
Tags have names (e.g., <a> for an anchor tag).
Tags may contain attributes with an attribute name and value, providing additional information to the tag.

HTML Document Tree

HTML documents can be visualized as trees with tags as nodes.

Tags can contain strings and other tags, making them the tag's children.
Tags within the same parent tag are considered siblings.
For example, the <html> tag contains both <head> and <body> tags, making them descendants of <html but children of <html>. <head> and <body> are siblings.

HTML Tables
about:blank 2/4
2/10/24, 11:01 PM about:blank
HTML tables are essential for presenting structured data.

Define an HTML table using the <table> tag.

Each table row is defined with a <tr> tag.
The first row often uses the table header tag, typically <th>.
The table cell is represented by <td> tags, defining individual cells in a row.

Web Scraping
Web scraping involves extracting information from web pages using Python. It can save time and automate data collection.

Required Tools:

Web scraping requires Python code and two essential modules: Requests and Beautiful Soup. Ensure you have both modules installed in your Python environment.

1. 1
2. 2

1. # Import Beautiful Soup to parse web page content

2. from bs4 import BeautifulSoup

Copied!

Fetching and Parsing HTML:

To start web scraping, you need to fetch the HTML content of a webpage and parse it using Beautiful Soup. Here's a step-by-step example:
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6
7. 7
8. 8
9. 9
10. 10
11. 11
12. 12
13. 13
14. 14
15. 15
16. 16
17. 17

1. import requests
2. from bs4 import BeautifulSoup
3.
4. # Specify the URL of the webpage you want to scrape
5. url = 'https://en.wikipedia.org/wiki/IBM'
6.
7. # Send an HTTP GET request to the webpage
8. response = requests.get(url)
9.
10. # Store the HTML content in a variable
11. html_content = response.text
12.
13. # Create a BeautifulSoup object to parse the HTML
14. soup = BeautifulSoup(html_content, 'html.parser')
15.
16. # Display a snippet of the HTML content
17. print(html_content[:500])

Copied!

about:blank 3/4
2/10/24, 11:01 PM about:blank
Navigating the HTML Structure:

BeautifulSoup represents HTML content as a tree-like structure, allowing for easy navigation. You can use methods like find_all to filter and extract specific HTML
elements. For example, to find all anchor tags () and print their text:
1. 1
2. 2
3. 3
4. 4
5. 5
6. 6

1. # Find all <a> tags (anchor tags) in the HTML

2. links = soup.find_all('a')
3.
4. # Iterate through the list of links and print their text
5. for link in links:
6. print(link.text)

Copied!

Custom Data Extraction:

Web scraping allows you to navigate the HTML structure and extract specific information based on your requirements. This may involve finding specific tags, attributes,
or text content within the HTML document.

Using BeautifulSoup for HTML Parsing

Beautiful Soup is a powerful tool for navigating and extracting specific parts of a web page. It allows you to find elements based on their tags, attributes, or text, making it
easier to extract the information you're interested in.

Using pandas read_html for Table Extraction

On many websites, data is neatly organized in tables. Pandas, a Python library, provides a function called read_html, which can automatically extract data from these
tables and present it in a format suitable for analysis. It's similar to taking a table from a webpage and importing it into a spreadsheet for further analysis.

Conclusion
In summary, this reading introduces web scraping with BeautifulSoup and Pandas, emphasizing extracting elements and tables. BeautifulSoup facilitates
HTML parsing, while Pandas' read_html streamlines table extraction. Responsible web scraping is highlighted, ensuring adherence to website terms. Armed with this
knowledge, you can confidently engage in precise data extraction.

Author
Akansha Yadav

Changelog
Date Version Changed by Change Description
2023-04-11 0.1 Akansha Yadav Initial version created

about:blank 4/4

Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (2)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
How To Scrape Websites With Python and BeautifulSoup PDF
100% (2)
How To Scrape Websites With Python and BeautifulSoup PDF
10 pages
Hypertext Markup Language (HTML) Fundamentals: How to Master HTML with Ease
From Everand
Hypertext Markup Language (HTML) Fundamentals: How to Master HTML with Ease
Steven Bright
No ratings yet
Free Comptia Network+ Study Guide by MC Mcse
No ratings yet
Free Comptia Network+ Study Guide by MC Mcse
48 pages
Web Scarpping
No ratings yet
Web Scarpping
4 pages
Web Scraping and HTML Basics
No ratings yet
Web Scraping and HTML Basics
4 pages
scraping
No ratings yet
scraping
6 pages
Notes for Web Scraping - BeautifulSoup-3903
No ratings yet
Notes for Web Scraping - BeautifulSoup-3903
6 pages
BeautifulSoup Notes
No ratings yet
BeautifulSoup Notes
22 pages
S12 Web Scraping
No ratings yet
S12 Web Scraping
13 pages
Web Crawling - python
No ratings yet
Web Crawling - python
34 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Beautiful Soup Tutorial
100% (2)
Beautiful Soup Tutorial
56 pages
Scraping
100% (1)
Scraping
25 pages
Christos Chen
No ratings yet
Christos Chen
42 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
WebScraping Lessons 1
100% (1)
WebScraping Lessons 1
3 pages
1.1 Web Scraping
No ratings yet
1.1 Web Scraping
34 pages
HKU - 7001 - 4. Web Scraping
No ratings yet
HKU - 7001 - 4. Web Scraping
73 pages
Web Scraping With Python Tutorials From A To Z
100% (1)
Web Scraping With Python Tutorials From A To Z
35 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
20_BeautifulSoup Library for Web Scraping
No ratings yet
20_BeautifulSoup Library for Web Scraping
12 pages
Html5 for Beginners: A Step-By-Step Guide
From Everand
Html5 for Beginners: A Step-By-Step Guide
Zack Mark Lakeman
No ratings yet
Lesson 4 Unstructured Data
No ratings yet
Lesson 4 Unstructured Data
20 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
200336.055-en
No ratings yet
200336.055-en
2 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Webscraping1 1 PDF
No ratings yet
Webscraping1 1 PDF
10 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Web Scraping Cheat Sheet 2.0
No ratings yet
Web Scraping Cheat Sheet 2.0
3 pages
Web Devlopment
From Everand
Web Devlopment
Netra
No ratings yet
Text-Processing-For-NLP-Web-Scrapping (5)
No ratings yet
Text-Processing-For-NLP-Web-Scrapping (5)
18 pages
Scrap Website With Python Free Code Camp
No ratings yet
Scrap Website With Python Free Code Camp
6 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Web Scaping - YL
No ratings yet
Web Scaping - YL
10 pages
Web Scraping Python - Chapter 1
No ratings yet
Web Scraping Python - Chapter 1
29 pages
James Learning Javascript Programming
From Everand
James Learning Javascript Programming
James Lombard
No ratings yet
Beautiful Soup
No ratings yet
Beautiful Soup
7 pages
Web Scraping by Using R
No ratings yet
Web Scraping by Using R
3 pages
4a82c633-5051-45ef-a932-6a6495641a0e_4F_IntroToWebScraping
No ratings yet
4a82c633-5051-45ef-a932-6a6495641a0e_4F_IntroToWebScraping
6 pages
Web Scraping Tools
No ratings yet
Web Scraping Tools
5 pages
Session 3 Data Aquisition - Updated
100% (1)
Session 3 Data Aquisition - Updated
40 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
Practical Web Scraping for Economists 1744341390
No ratings yet
Practical Web Scraping for Economists 1744341390
33 pages
Arindam Manna, Financial Analytics
No ratings yet
Arindam Manna, Financial Analytics
9 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Web Scraping Takeaways
No ratings yet
Web Scraping Takeaways
2 pages
08_web_scraping
No ratings yet
08_web_scraping
13 pages
HTML5 & CSS3 For Beginners: Your Guide To Easily Learn HTML5 & CSS3 Programming in 7 Days
From Everand
HTML5 & CSS3 For Beginners: Your Guide To Easily Learn HTML5 & CSS3 Programming in 7 Days
i Code Academy
4/5 (11)
B42_IP105__S1_D2
No ratings yet
B42_IP105__S1_D2
4 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
web_scrapping_final[1]
No ratings yet
web_scrapping_final[1]
7 pages
FDSWeb Scraping
No ratings yet
FDSWeb Scraping
31 pages
HTML in 30 Pages
From Everand
HTML in 30 Pages
U.Q. Magnusson
4.5/5 (14)
TYMIQ Company Profile
No ratings yet
TYMIQ Company Profile
10 pages
Solid State Physics
No ratings yet
Solid State Physics
111 pages
EPON OLT WebGUI User Manual
No ratings yet
EPON OLT WebGUI User Manual
82 pages
Assignment 2
No ratings yet
Assignment 2
24 pages
Intro To QoS
No ratings yet
Intro To QoS
106 pages
Module 5 Question Bank Os Bcs303
No ratings yet
Module 5 Question Bank Os Bcs303
2 pages
EBSPROD - Switchover - Switchback - Version 1.0
No ratings yet
EBSPROD - Switchover - Switchback - Version 1.0
20 pages
CX32L003_UserManual_1.0.2_FlashDetails_EN_BVE_04Sep2023
No ratings yet
CX32L003_UserManual_1.0.2_FlashDetails_EN_BVE_04Sep2023
11 pages
A Comprehensive Survey of Massive MIMO Based On 5G Antennas
No ratings yet
A Comprehensive Survey of Massive MIMO Based On 5G Antennas
28 pages
Hard Skills DevOps
No ratings yet
Hard Skills DevOps
3 pages
DBMS assignment_02
No ratings yet
DBMS assignment_02
2 pages
3.1 Sequences and Series
No ratings yet
3.1 Sequences and Series
33 pages
550x Valid Hotmail
No ratings yet
550x Valid Hotmail
10 pages
DS-VD22D-C HW50E Datasheet 20240329
No ratings yet
DS-VD22D-C HW50E Datasheet 20240329
4 pages
MASAR - (Smart Cities)
No ratings yet
MASAR - (Smart Cities)
30 pages
Foundation University: Syed Shahabal Shah Hamdani F171-BCSE050
No ratings yet
Foundation University: Syed Shahabal Shah Hamdani F171-BCSE050
7 pages
Important Programs
No ratings yet
Important Programs
2 pages
Cloud Resume
No ratings yet
Cloud Resume
1 page
Icstt-Rm458 - En-P (Config Guide - Wb2.0)
No ratings yet
Icstt-Rm458 - En-P (Config Guide - Wb2.0)
223 pages
Price list-EBit 30
No ratings yet
Price list-EBit 30
1 page
Scribble - The Racket Documentation Tool
No ratings yet
Scribble - The Racket Documentation Tool
2 pages
Osnovi Informatike 6 Razred
No ratings yet
Osnovi Informatike 6 Razred
99 pages
ENGLISH POST TEST FOR G12 (1) .Final 3 - 070111
No ratings yet
ENGLISH POST TEST FOR G12 (1) .Final 3 - 070111
20 pages
NetEngine 8000 M8 Universal Service Router Datasheet - Cleaned
100% (1)
NetEngine 8000 M8 Universal Service Router Datasheet - Cleaned
17 pages
IP Addresses IP Subnetting: CIDR Notation
No ratings yet
IP Addresses IP Subnetting: CIDR Notation
6 pages
Brand Identity Manual: March 2007
No ratings yet
Brand Identity Manual: March 2007
121 pages
Political Correctness Essay
100% (2)
Political Correctness Essay
7 pages
Tour and Travel Final Project Report
100% (1)
Tour and Travel Final Project Report
88 pages
7466443
No ratings yet
7466443
357 pages