How to scrape data with Perl

How to scrape data with Perl

Everyone is seeking methods to innovate and use new technologies in today’s competitive marketplace. For those looking for an automated way to acquire structured web data, web scraping—also known as web data extraction or data scraping—offers a solution. If the public website you want to acquire data from doesn’t have an API or if it has but only offers restricted access to the data, web scraping can be helpful. It has become a crucial method for gathering information about different businesses and using it to produce a dataset for your decision engine since there is an abundance of data available on the internet.

 

For web scraping, you can find numerous web scrappers or tools; however, in this post, we will learn about one of the most popular languages, Perl, and how we can scrape data using Perl. So, let’s dig into it.

 

What are the web scraping tools?

 

Web scraping tools are computer programs, or “bots,” that have been developed to search databases and retrieve data. Many of the bot types used can be entirely customized to:

 

  1. Identify distinctive HTML site structures
  2. Obtain and modify content
  3. Archive scraped data
  4. Data extraction from APIs

 

What are the benefits of web scraping?

 

A method for obtaining information from websites is called web scraping. Although it can be done manually, programming is more common. There are several reasons why someone could scrape a website, including:

 

  • Lead generation for marketing
  • Observing the costs on a page (and purchasing when the price drops low)
  • Academic analysis
  • Arbitrage betting

 

What is Perl?

 

Perl is a general-purpose programming language. It was initially created for text manipulation and is now used for many different things like system administration, web design and development, network programming, GUI creation, and more. Furthermore, it is a robust and cross-platform programming language. Although Practical Extraction and Report Language (Perl) is not a recognized abbreviation, some individuals nevertheless use it. Both the public and private sectors use it for initiatives that are absolutely necessary.

 

Why Perl is so famous?

 

  • Perl borrows the greatest features from a variety of different programming languages, including C, awk, sed, sh, and BASIC.
  • Interface for integrating databases in Perl Databases from other companies, such as Oracle, Sybase, Postgres, and MySQL, are supported by DBI.
  • HTML, XML, and other mark-up languages are compatible with Perl.

 

Web scrapping applications

 

We’ve previously covered several advantages of scraping, though the list wasn’t thorough.

 

Web scrapers can help fill in the gaps in API response data and retrieve information that the API maintainer would otherwise have left out. For instance, the Genius API doesn’t include lyrics in its API response; this post will demonstrate how to go around this problem via scraping.

 

How to scrape data with PERL?

 

The scraper you are about to develop aims to retrieve song lyrics from Genius for a given song. This is helpful because lyrics are missing from the song resource in the Genius REST API. Installing the HTML TreeBuilder module for Perl and utilizing it in conjunction with the Library for the World Wide Web in Perl (LWP) module will be necessary to do this.

 

Step 1: Understand Perl libraries LWP

 

A collection of language APIs (classes and functions) for creating HTTP clients that access data from the web is provided by Perl’s Library for WWW in Perl (LWP). Most Perl installations include the library out of the box, and it supports a variety of HTTP capabilities like multimethod HTTP queries and document and file downloads. The library also powers language package distributors like CPAN.

 

By entering the following in your preferred console, you can get the complete LWP API specification on the Comprehensive Perl Archive Network (CPAN) or explore it locally with perldoc:

 

$ perldoc LWP

 

Step 2: Parsing with TreeBuilder

 

A Perl module called HTML::Treebuilder is available on CPAN, and its primary duty is to build HTML trees for further selective parsing. The HTML::Parser and HTML::Element packages it uses internally provide a variety of ways for creating HTML pages and markup-interspersed texts.

 

Note:

 

According to the instructions on the Official CPAN website, you can install TreeBuilder using either cpan or cpanm, which can be installed on the majority of major operating systems. The following directive should be sufficient in the cpan REPL (which may need extra preparation before initial usage, in that you may need to enable readline package support to simplify the formatting of console input and select a base directory for all Perl modules installed via the REPL).

 

$ cpan

cpan[1]> install HTML::TreeBuilder

 

Alternatively, you can install TreeBuilder using cpanm, a minified version of cpan that works similarly to npm for JavaScript and Composer for PHP.

 

$ cpanm HTML::TreeBuilder

Step 3: Scraper coding

 

We’re going to get the lyrics of the song “Six Days” by American songwriter DJ Shadow in this example.

 

The LWP and TreeBuilder libraries must first be operationalized in order to make an HTTP GET request and effectively grab the lyrics data from Genius, and set up the scraper to interpret the resulting HTML.

 

my $ua = LWP::UserAgent->new;

$ua->agent(“Genius Scraper”);

my $url = “https://genius.com/DJ-Shadow-Six-Days-lyrics”;

my $root = HTML::TreeBuilder->new();

# perform HTTP GET request

my $request = $ua->get($url) or die “Cannot contact Genius $!\n”;

 

Step 4: Parsing the markup

 

The next step is parsing the markup returned by servers hosting the Genius app. In this context, parsing refers to encoding the markup into a tree structure that can be traversed. If the request is successful, you should provide the parser’s parse method with a UTF-8 decoded version of the markup that was returned from the request to avoid any Perl encoding issues.

 

if ($request->is_success) {

$root->parse(decode_utf8 $request->content);

} else {

# print error message

print “Cannot display the lyrics.\n”;

}

Step 5: Invoke look_down method

 

You will use the look_down method from the TreeBuilder API to navigate the generated markup when the parsing is done and extract the lyrics. A div element known as the lyrics-root contains the lyrics that have been placed on the Genius platform. To continue, you must encode the previous need to create something that resembles the sample below.

 

my $data = $root->look_down(

_tag => “div”,

id => “lyrics-root”

);

 

Although TreeBuilder’s dumping methods (derived from the HTMLBuilder module) are highly helpful in debugging tree traversals, they do not provide the ideal UI for markup formatting. On the other hand, the primitives in the FormatText module are most helpful for neatly displaying the markup that comes from the HTTP response.

 

Step 6: Printing the HTML subtree

 

Simply create FormatText and call its format method with data as the only argument if you want to print the resulting HTML subtree as a tidy string output.

 

my $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);

$formatter->format($data);

The scraper can now be used by entering the following into a console:

$ chmod a+x scraper.pl && ./scraper.pl

The output in the console should look like the example below:

Six Days Lyrics

—————

[Verse 1]

At the starting of the week

At summit talks you‘ll hear them speak

It‘s only Monday

Negotiations breaking down

See those leaders start to frown

It‘s sword and gun day

[Hook]

Tomorrow never comes until it‘s too late

… etc. etc.

The scraper is currently browsing the page to get the lyrics to “Six Days.”

You can fine-tune each execution of the script with a single command-line argument—relevant song input—to give it some polish and expand its parsing capability to support any song hosted on the Genius platform.

if ($#ARGV + 1 != 1) {

die “Please provide song input\n”;

}

The following modification to the URL should then occur to complete the parser:

my $url = “https://genius.com/$ARGV[0]”;

Now that the majority of the work is done, type the following to activate the Genius scraper and extract the lyrics to the song “Six Days”:

$ ./scraper.pl DJ-shadow-Six-Days-lyrics

Conclusion

Indeed, the web scraping process is gaining importance by each passing day. The process of scraping the web encompasses obtaining the content of a digital resource, most often a web page. It can be applied to a variety of tasks, from boosting business analytics to filling in gaps in API response data, for both small-scale corporate projects and independent developers. Any language that can make HTTP client requests and analyze the resulting HTML can use this technique to collect data because scraping mainly relies on the markup transmitted via HTTP, and Perl is one of the most capable of doing so.

 

Since the beginning, Perl has been at the core of web development; even Amazon is based on Perl. These days, Perl provides access to a collection of over 18,000 mature modules covering just about anything via CPAN. Due to its ease of text manipulation and quick development time, Perl used to be the most widely used web programming language. In short, web scraping with Perl is really efficient and covers a broad spectrum.

No Comments

Post a Comment

Comment
Name
Email
Website