How to scrape data with Scrapy?

How to scrape data with Scrapy?

Many people claim that as data becomes a more valuable resource, it will likely overtake oil. There is a tremendous amount of data on many websites due to the increase in internet usage. Utilizing an API or web scraping strategy is one way to obtain data from web pages. Web crawlers and scrapers scan a website’s pages and feed while looking for hints in the markup language and site structure to extract data. The information gathered by scraping may occasionally be sent into other applications for validation, cleansing, and input into a data store. It might also be used as input for other procedures, including machine learning (ML) models or toolchains for natural language processing (NLP).

 

Beautiful Soup and Scrapy are two Python libraries you may use for web scraping. Well, in this post, we’ll tell you about Scrapy. We can easily prototype and create web scrapers thanks to Scrapy.

 

What is web scraping?

 

The practice of deploying bots to gather information and material from a website is known as web scraping. It collects the underlying HTML code and, with it, data kept in a database. It is different from screen scraping, which just scrapes pixels displayed onscreen. After that, the scraper can duplicate a whole website’s content elsewhere.

 

Many digital firms that rely on data harvesting use web scraping. Authentic use cases consist of the following:

 

A website is crawled by search engine bots, who then examine its content and assign it a ranking.

Price comparison websites use bots to automatically obtain product prices and descriptions from affiliated seller websites.

Companies that conduct market research use scrapers to gather information from social media and forums (e.g., for sentiment analysis).

 

What is Scrapy?

 

Scrapy is a Python library. It is used for extensive web scraping and provides you with all the resources you need to effectively extract data from websites, then process it as you want, and save it in the structure and format of your preference.

 

There is no “one size fits all” method for gathering data from websites given; how diverse the internet is! Ad hoc methods are frequently used, and if you begin writing code for every small operation you carry out, you will ultimately end yourself developing your own scraping framework. and that framework is Scrapy.

 

Features od Scrapy

 

1 Spider

 

Spiders are classes specify as a set of instructions for scraping a specific website. An effective method for web scraping is provided by these built-in customizable classes.

 

2 Selectors

 

In scrapy, selectors are utilised to pick out certain HTML elements based on XPath or CSS expressions. Regular expressions can be used with selectors using the re() method.

 

3 Items

 

Spider-extracted data is returned as items. The following items are supported by the itemadapter library: attrs objects, dictionaries, item objects, and data class objects.

 

4 Item Pipeline

 

A Python class that checks, verifies, and stores the data from the scrape in a database. Additionally, it searches for duplication.

 

5 Request and response

 

Requests are generated by the spider, which sends them to the end point for execution before returning them to the spider with the response object.

 

6 Link extractors

 

A strong function that pulls links from comments.

 

Installing the Scrapy

 

Step 1: Create a virtual environment

 

For Scrapy, it is advisable to establish a new virtual environment because doing so isolates it and ensures that it is unaffected by any other programs running on the computer.

 

Install the virtualenv first by using the command listed below.

 

$ pip install virtualenv

 

Python is now used to build a virtual environment.

$ virtualenv scrapyvenv

 

You can specify the Python version for Linux and Mac.

$ virtualenv -p python3 scrapyvenv

 

Additionally, you can specify the Python version for which the virtual environment should be created.

 

Activate the virtual environment you just created.

 

For Windows

$ cd scrapyvenv

2$ .\Scripts\activate

For Linux/Mac

$ cd scrapyvenv

2$ source bin/activate

 

Step 2: Install Scrapy

 

The majority of the requirements will be installed automatically. They are offered for Python 2.7 and up.

 

Pip install: Run the following command in the terminal to install using pip:

 

$ pip install scrapy

Conda Install: Run the following command in the terminal to install using conda:

$ conda install -c anaconda scrapy

 

You can download and locally install the Twisted library if you are having trouble installing it.

 

How to create a Scrapy project?

 

Step 1: Using command startproject

 

We must stick to certain standards because Scrapy is a framework. Use the command startproject in Scrapy to start a new project. The project in this illustration is called webscrapy.

$ scrapy startproject webscrapy

Additionally, this will produce a webscrapy directory containing the following information:

webscrapy

2├── scrapy.cfg — deploy configuration file of scrapy project

3└── webscrapy — your scrapy project module.

4 ├── __init__.py — module initializer(empty file)

5 ├── items.py — project item definition py file

6 ├── middlewares.py — project middleware py file

7 ├── pipelines.py — project pipeline py file

8 ├── settings.py — project settings py file

9 └── spiders — directory where spiders are kept

10 ├── __init__.py

 

Step 2: Creating a spider

 

Let’s make our first spider right away. Use the command genspider, which asks for the URL to crawl and the name of the spider to use:

 

$ cd webscrapy

2$ scrapy genspider imdb www.imdb.com

 

The moment you issue this command, Scrapy instantly creates a Python file with the name imdb in the spider folder. You can see a class called imdbSpider that implements Scrapy when you open the spider imdb.py Python file.

 

Parse is a method found in the Spider class.

 

import scrapy

2

3class ImdbSpider(scrapy.Spider):

4 name = ‘imdb’

5 allowed_domains = [‘www.imdb.com’]

6 start_urls = [‘http://www.imdb.com/’]

7

8 def parse(self, response):

9 pass

 

Note a few points in this case:

 

name: The spider’s name. It is ImdbSpider in this instance. When you have to care for hundreds of spiders, appropriately naming them becomes a big relief.

allowed domains: Allowable domains are listed in the optional list of allowed domains, which is a collection of strings. Requests for URLs that are not associated with one of the domain names listed here won’t be honored.

parse(self, response): When a URL is successfully crawled, the crawler calls the method parse(self, response).

 

Use the command listed below to launch this spider. Be sure you are in the correct directory before executing this command.

 

$ scrapy crawl imdb

Take note that the name of the spider is an argument to the above command.

Step 3: Scrape on IMDB

Let’s now extract all the table entries from the table of the top 250 movies on IMDB, including the title, year, and rating.

Make the already developed spider imdb.py available.

# importing the scrapy

2 import scrapy

3

4class ImdbSpider(scrapy.Spider):

5 name = “imdb”

6 allowed_domains = [“imdb.com”]

7 start_urls = [‘http://www.imdb.com/chart/top’,]

8

9 def parse(self, response):

10 # table coloums of all the movies

11 columns = response.css(‘table[data-caller-name=”chart-top250movie”] tbody[class=”lister-list”] tr’)

12 for col in columns:

13 # Get the required text from element.

14 yield {

15 “title”: col.css(“td[class=’titleColumn’] a::text”).extract_first(),

16 “year”: col.css(“td[class=’titleColumn’] span::text”).extract_first().strip(“() “),

17 “rating”: col.css(“td[class=’ratingColumn imdbRating’] strong::text”).extract_first(),

18

19 }

Run the IMDb spider shown above:

$ scrapy crawl imdb

The output will be as follows:

{‘title’: ‘The Shawshank Redemption’, ‘year’: ‘1994’, ‘rating’: ‘9.2’}

2{‘title’: ‘The Godfather’, ‘year’: ‘1972’, ‘rating’: ‘9.1’}

3…

4{‘title’: ‘Swades’, ‘year’: ‘2004’, ‘rating’: ‘8.0’}

5{‘title’: ‘Song of the Sea’, ‘year’: ‘2014’, ‘rating’: ‘8.0’}

 

Conclusion

 

For those who enjoy data science, the internet’s rapid growth has been beneficial. The diversity and volume of data that is now accessible via the internet are comparable to a hidden treasure trove of puzzles and mysteries. A method for obtaining data from an online source is called web scraping. You can store the structured data it gives you in any format. Then, AI and ML systems can use this data. You can get a lot of clean data in bulk from web scraping, which is best for these algorithms. Also, for web scraping, a variety of software and libraries can be utilized. If you wish to understand how to scrape data from the web, Scrapy is an excellent option as one of the most well-liked web scraping frameworks.

 

Scrapy is a web crawling and scraping framework which was created to collect structured data from websites. Scrapy may also be used to automatically test and monitor web applications, though. Scrapy was created in 2008 and is entirely composed of Python. An asynchronous technique offered by Scrapy handles numerous queries concurrently.

 

The architecture of the Scrapy project is based on “spiders,” which are autonomous crawlers that are given a set of instructions. With the help of the Scrapy framework’s robust features like auto-throttle, rotating proxies, and user agents, you may effectively go undetected while scraping the internet. Additionally, Scrapy offers a web-crawling shell that programmers can use to verify their theories regarding a website’s behavior.

 

No Comments

Post a Comment

Comment
Name
Email
Website