Scraping Data on the Web with BeautifulSoup

How to use Beautiful Soup for Web Scraping?

Any discipline of study or area of personal interest can benefit greatly from the enormous volume of data on the Internet. You must become an expert at web scraping if you want to collect that data properly. Web scraping has become a crucial method for understanding your business and using it to produce a dataset for your decision algorithm due to the abundance of data available on the internet. Furthermore, there are many web scrapers available, however, in this post, we will tell you about Beautiful Soup and how to extract using it.

 

What is web scraping?

 

Most websites allow users to view the data they offer using web browsers. The option to store the data in a user-friendly format is not provided by the web browser. The only way to store the information is as a web page, and most online pages only offer the user the option to manually copy and paste the information.

 

Web scraping is an innovative and smart technology that is used to harvest a ton of data from the target websites. The extracted data can then be saved as a spreadsheet or to a local file on your computer. Benefits of web scraping include the ability to automate the steps involved in using scripts to collect data from websites.

 

What are the benefits if web scraping?

Save money

 

Web scraping services decrease the time and cost associated with the data extraction task. There is less reliance on the human workforce because these tools can be automated once they are built.

 

Precision of results

 

Web scraping easily defeats manual data collection. You can obtain quick and dependable results with automated scraping that are not humanly achievable.

 

Market forecasting advantage

 

Businesses can save time, money, and labor by using accurate results. As a result, you appear to have a time-to-market edge over your rivals.

 

High quality

 

Through scraping APIs, web scraping makes clean, well-organized, and high-quality data accessible so that systems can also have up-to-date and new data.

 

What is a Beautiful Soup?

 

Beautiful Soup is a python library that makes it simple to gather data from websites. It provides Pythonic methods for iterating, accessing, and altering the parse tree and sits on top of an HTML or XML parser. The Beautiful Soup library makes it easier to separate webpage titles and links. It can change the HTML in the document we’re working on and extract all of the text from HTML tags.

 

Web scraping with Beautiful Soup

 

Step 1: Installing the necessary external (third-party) libraries

 

Using pip is the simplest method for installing external libraries in Python. Python software packages are installed and managed using the pip package management system. All that is required is:

 

pip install requests

pip install html5lib

pip install bs4

Step 2: Accessing the website’s HTML content

import requests

URL = “https://www.geeksforgeeks.org/data-structures/”

r = requests.get(URL)

print(r.content)

 

Let’s attempt to comprehend this bit of code.

 

Import the requests library first.

The URL of the website you wish to scrape, then, should be specified.

Send an HTTP request to the specified URL, then save the server’s reply in an object called a response object (r).

As print r.content at this point to obtain the webpage’s raw HTML content. It is of the type “string.”

 

Step 3: Parsing the HTML content

 

#This will not run on online IDE

import requests

from bs4 import BeautifulSoup

 

URL = “http://www.values.com/inspirational-quotes”

r = requests.get(URL)

 

soup = BeautifulSoup(r.content, ‘html5lib’) # If this line causes an error, run ‘pip install html5lib’ or install html5lib

print(soup.prettify())

 

The HTML parsing libraries like html5lib, lxml, html.parser, etc. are developed on top of the Beautiful Soup library, which is a pretty wonderful feature. Therefore, it is possible to build a Beautiful Soup object and provide a parser library at the same time. In the example above,

 

soup = BeautifulSoup(r.content, ‘html5lib’)

We pass two arguments to construct a BeautifulSoup object:

r.content: It is the actual HTML code in its raw form.

html5lib: Choosing the HTML parser to use.

 

Now that soup.prettify() has been printed, a visual representation of the parse tree built from the original HTML content is available.

 

Step 4: Navigating and search the parse tree

 

We now want to take some helpful information out of the HTML content. All of the data that could be programmatically taken from the hierarchical structure is contained in the soup object. In our illustration, we are scraping a webpage that has a few quotes. Therefore, we want to develop a program to save the quotes (and all relevant information about them).

 

#Python program to scrape website

#and save quotes from website

import requests

from bs4 import BeautifulSoup

import csv

 

URL = “http://www.values.com/inspirational-quotes”

r = requests.get(URL)

 

soup = BeautifulSoup(r.content, ‘html5lib’)

 

quotes=[] # a list to store quotes

 

table = soup.find(‘div’, attrs = {‘id’:’all_quotes’})

 

for row in table.findAll(‘div’,

attrs = {‘class’:’col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top’}):

quote = {}

quote[‘theme’] = row.h5.text

quote[‘url’] = row.a[‘href’]

quote[‘img’] = row.img[‘src’]

quote[‘lines’] = row.img[‘alt’].split(” #”)[0]

quote[‘author’] = row.img[‘alt’].split(” #”)[1]

quotes.append(quote)

 

filename = ‘inspirational_quotes.csv’

with open(filename, ‘w’, newline=”) as f:

w = csv.DictWriter(f,[‘theme’,’url’,’img’,’lines’,’author’])

w.writeheader()

for quote in quotes:

w.writerow(quote)

 

We advise you to look over the HTML of the webpage we printed using the soup.prettify() method and look for a pattern or a technique to get to the quotations before progressing.

 

All of the quotes are contained in a div container with the id “all_quotes,” as can be seen. Thus, we use the find() method to locate that div element (referred to as table in the preceding code):

 

table = soup.find(‘div’, attrs = {‘id’:’all_quotes’})

 

The HTML tag you want to search for is the first argument, and the second argument is a dictionary-type element to provide any additional properties related to that tag. The first matched element is returned by the find() function. You might try printing a table. To understand what this piece of code does, use prettify().

 

Now that each quote is placed within a div container with the class of quote, one can see this in the table element. As a result, we repeatedly go over each div container with the class quotation. Here, we utilize the findAll() method, which takes similar arguments to the find method and outputs a list of all matched elements. Now, each quote is repeated using the row variable. To help you understand, the following sample HTML content is provided:

 

 

Take a look at this piece of code now.

 

for row in table.find_all_next(‘div’, attrs = {‘class’: ‘col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top’}):

quote = {}

quote[‘theme’] = row.h5.text

quote[‘url’] = row.a[‘href’]

quote[‘img’] = row.img[‘src’]

quote[‘lines’] = row.img[‘alt’].split(” #”)[0]

quote[‘author’] = row.img[‘alt’].split(” #”)[1]

quotes.append(quote)

 

To save all relevant information about a quote, we develop a dictionary. Dot notation can be used to access the hierarchical structure. To access the text contained within an HTML element. text would be used.

 

quote[‘theme’] = row.h5.text

The attributes of a tag are accessible, editable, and addable. To achieve this, use the tag as a dictionary:

quote[‘url’] = row.a[‘href’]

Finally, a list called quotations is appended with all of the quotes. In the end, we want to save all of our data in a CSV file.

filename = ‘inspirational_quotes.csv’

with open(filename, ‘w’, newline=”) as f:

w = csv.DictWriter(f,[‘theme’,’url’,’img’,’lines’,’author’])

w.writeheader()

for quote in quotes:

w.writerow(quote)

 

Here, all the quotes are saved in a CSV file called inspirational_quotes.csv for future use.

 

This was an easy demonstration of how to build a Python web scraper. You can attempt to scrape any other website of your choice if you understand this tutorial.

 

So what we have learned so far is;

 

Web scraping involves these steps:

 

The URL of the website you want to access should get an HTTP request. The HTML of the webpage is returned by the server as a response to the request. We’ll use a third-party HTTP library for Python requests for this purpose.

 

The next step is to parse the data after we have accessed the HTML content. Since most HTML data is hierarchical, string processing cannot be used to simply extract data. A parser that can organize HTML data into a hierarchical or tree structure is required. Although there are several HTML parser libraries available, html5lib is the newest.

 

All that remains for us to do tree traversal, or navigate and search the parse tree we built. We used Beautiful Soup, a third-party Python library, for this job. It is a Python library that allows users to extract data from HTML and XML files.

 

Conclusion

 

You must take quick action and enter the market if you want your product or service to endure. In the process of being successful and growing the business, web scraping is crucial. The process of extracting data from websites is called web scraping. By replicating human activity, the software is used to extract all the information from the targeted site. When it comes to selecting the web scraper, beautiful soup is a particularly wonderful tool for web scrapers because of its fundamental components. The programmer can use it to swiftly extract data from a certain web page. With the help of this library, we can extract data from HTML and XML files as we mentioned above.

No Comments

Post a Comment

Comment
Name
Email
Website