Selected topic
Web Scraping
Prefer practical output? Use related tools below while reading.
=====================================================
Web scraping is the process of extracting data from websites, often using a web browser. In this summary, we'll cover how to use BeautifulSoup, a popular Python library, to scrape websites.
BeautifulSoup is a Python library used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
In this example, we'll scrape the title of the first article on Wikipedia's main page.
### Install BeautifulSoup and requests libraries
bash
pip install beautifulsoup4 requests### Python Code
python
import requests
from bs4 import BeautifulSoup# Send a GET request to Wikipedia's main page
url = "https://en.wikipedia.org/"
response = requests.get(url)
# If the GET request is successful, the status code will be 200
if response.status_code == 200:
# Get the content of the response
page_content = response.content
# Create a BeautifulSoup object and specify the parser
soup = BeautifulSoup(page_content, 'html.parser')
# Find the title of the first article on the page
title = soup.find('h1', class_='firstHeading').text.strip()
print("Title:", title)
else:
print("Failed to retrieve Wikipedia's main page")
### How it Works
requests.html.parser).soup.find() to find the first <h1> element with the class 'firstHeading', which contains the title of the article.lxml or html5lib instead of html.parser.soup.find_all()./robots.txt) to ensure web scraping is allowed.