As I use YouTube, and Reddit every day, I wonder how I can gather information about some topic automatically. For example, if I search “FFXIV” on YouTube, there will be a long list of videos, is there a way to collect the title, views, and channel info to excel files automatically? The solution is web scrapping, collecting data from a webpage automatically.
- What is Beautiful Soup
- How to use it
- Try to get information from a YouTube video
1. What is Beautiful Soup
Beautiful Soup is one of the most common python packages for web scrapping. It allows you to pull data from HTML and XML files. It helps you to remove the HTML markup, parse the documents and save the information.
2. How to use it
pip install BeautifulSoup4
Assuming we have requested a webpage, and the response is like this:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
With Beautiful Soup, you obtain the object first.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
You can use different methods to get information wrapped by specific tags.
soup.title
# <title>The Dormouse's story</title>soup.p
# <p class="title"><b>The Dormouse's story</b></p>soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>