How to Scrape Amazon Product Data?
Web Scraper is helpful to automate all data extraction projects to gather information from websites. In this blog, we are going to focus mainly on the extraction of Amazon product data via web scarping. The valuable product data is going to be mainly product details and product price listings. The simple web scraper is going to be created using SelectorLib and Python. Furthermore, the created web scraper is going to be tested to collect Amazon product data.
We will be using Python 3 to create the Amazon Scraper. This is due to the reason that the code will not run if we utilize Python 2.7. To begin our data scraping project, the primary thing is to download Python 3 and PIP to your computer.
You can install pip3 by consulting the action given below:
pip3 install requests selectorlib
Scraping Product Details from the Amazon Product Page
The product page scraper for Amazon will scrape the following mentioned elements:
Product Name
Product Price
Product Description (Short/Full)
Product Image URLs
Product Rating
Product Reviews
Product Sales Ranking
The Code
For this, you need to develop a folder called Amazon Scraper and then paste in your selectorlib yaml template file in the format as selectors.yml. Let us create a file amzaon.py and place the below code in it. It helps to read out the Amazon Product URLs from the urls.txt file. After which it allows scraping of data preceded by storage of data within JSON file format.
from selectorlib import Extractor
import requests
import json
from time import sleep
# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file(‘selectors.yml’)
def scrape(url):
headers = {
‘authority’: ‘www.amazon.com’,
‘pragma’: ‘no-cache’,
‘cache-control’: ‘no-cache’,
‘dnt’: ‘1’,
‘upgrade-insecure-requests’: ‘1’,
‘user-agent’: ‘Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36’,
‘accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9’,
‘sec-fetch-site’: ‘none’,
‘sec-fetch-mode’: ‘navigate’,
‘sec-fetch-dest’: ‘document’,
‘accept-language’: ‘en-GB,en-US;q=0.9,en;q=0.8’,
}
# Download the page using requests
print(“Downloading %s”%url)
r = requests.get(url, headers=headers)
# Simple check to check if page was blocked (Usually 503)
if r.status_code > 500:
if “To discuss automated access to Amazon data please contact” in r.text:
print(“Page %s was blocked by Amazon. Please try using better proxies\n”%url)
else:
print(“Page %s must have been blocked by Amazon as the status code was %d”%(url,r.status_code))
return None
# Pass the HTML of the page and create
return e.extract(r.text)
# product_data = []
with open(“urls.txt”,’r’) as urllist, open(‘output.jsonl’,’w’) as outfile:
for url in urllist.readlines():
data = scrape(url)
if data:
json.dump(data,outfile)
outfile.write(“\n”)
# sleep(5)
Running the Amazon Product Page Scraper
Just start your scraper with the following command:
python3 amazon.py
Until scraping is completed you should find the file named output.json1with the data.
{
“name”: “2020 HP 15.6\” Laptop Computer, 10th Gen Intel Quard-Core i7 1065G7 up to 3.9GHz, 16GB DDR4 RAM, 512GB PCIe SSD, 802.11ac WiFi, Bluetooth 4.2, Silver, Windows 10, YZAKKA USB External DVD + Accessories”,
“price”: “$959.00”,
“short_description”: “Powered by latest 10th Gen Intel Core i7-1065G7 Processor @ 1.30GHz (4 Cores, 8M Cache, up to 3.90 GHz); Ultra-low-voltage platform. Quad-core, eight-way processing provides maximum high-efficiency power to go.\n15.6\” diagonal HD SVA BrightView micro-edge WLED-backlit, 220 nits, 45% NTSC (1366 x 768) Display; Intel Iris Plus Graphics\n16GB 2666MHz DDR4 Memory for full-power multitasking; 512GB Solid State Drive (PCI-e), Save files fast and store more data. With massive amounts of storage and advanced communication power, PCI-e SSDs are great for major gaming applications, multiple servers, daily backups, and more.\nRealtek RTL8821CE 802.11b/g/n/ac (1×1) Wi-Fi and Bluetooth 4.2 Combo; 1 USB 3.1 Gen 1 Type-C (Data Transfer Only, 5 Gb/s signaling rate); 2 USB 3.1 Gen 1 Type-A (Data Transfer Only); 1 AC smart pin; 1 HDMI 1.4b; 1 headphone/microphone combo\nWindows 10 Home, 64-bit, English; Natural silver; YZAKKA USB External DVD drive + USB extension cord 6ft, HDMI cable 6ft and Mouse Pad\n› See more product details”,
“images”: “{\”https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX425_.jpg\”:[425,425],\”https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX466_.jpg\”:[466,466],\”https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SY355_.jpg\”:[355,355],\”https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX569_.jpg\”:[569,569],\”https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SY450_.jpg\”:[450,450],\”https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX679_.jpg\”:[679,679],\”https://images-na.ssl-images-amazon.com/images/I/61CBqERgZ7L._AC_SX522_.jpg\”:[522,522]}”,
“variants”: [
{
“name”: “Click to select 4GB DDR4 RAM, 128GB PCIe SSD”,
“asin”: “B01MCZ4LH1”
},
{
“name”: “Click to select 8GB DDR4 RAM, 256GB PCIe SSD”,
“asin”: “B08537NR9D”
},
{
“name”: “Click to select 12GB DDR4 RAM, 512GB PCIe SSD”,
“asin”: “B08537ZDYH”
},
{
“name”: “Click to select 16GB DDR4 RAM, 512GB PCIe SSD”,
“asin”: “B085383P7M”
},
{
“name”: “Click to select 20GB DDR4 RAM, 1TB PCIe SSD”,
“asin”: “B08537NDVZ”
Scraping Amazon Products from the Search Results Page
Now it is time to scrape products from the Amazon search results page by the Amazon search result scraper. The tool will help scrape the following key components of a product like a price, name, and rating. The step to code scraping results is quite similar to the above-mentioned code for product description scraper from the product page.
Markup the data fields using Selectorlib
At this stage to differ from the last file we can call selectorlib yml file as search results.yml
products:
css: ‘div[data-component-type=”s-search-result”]’
xpath: null
multiple: true
type: Text
children:
title:
css: ‘h2 a.a-link-normal.a-text-normal’
xpath: null
type: Text
url:
css: ‘h2 a.a-link-normal.a-text-normal’
xpath: null
type: Link
rating:
css: ‘div.a-row.a-size-small span:nth-of-type(1)’
xpath: null
type: Attribute
attribute: aria-label
reviews:
css: ‘div.a-row.a-size-small span:nth-of-type(2)’
xpath: null
type: Attribute
attribute: aria-label
price:
css: ‘span.a-price:nth-of-type(1) span.a-offscreen’
xpath: null
type: Text
The Code
As the code is somewhat identical to the previously employed scraper, the changes lie in each product, which we need to save as a separate line. Create file searchresults.py and run the code given below.
Open the file search_results_urls.txt. The code will help extract data that is going to be stored in JSON lines file named as search_results_output.json1.
from selectorlib import Extractor
import requests
import json
from time import sleep
# Create an Extractor by reading from the YAML file
e = Extractor.from_yaml_file(‘search_results.yml’)
def scrape(url):
headers = {
‘dnt’: ‘1’,
‘upgrade-insecure-requests’: ‘1’,
‘user-agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36’,
‘accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9’,
‘sec-fetch-site’: ‘same-origin’,
‘sec-fetch-mode’: ‘navigate’,
‘sec-fetch-user’: ‘?1’,
‘sec-fetch-dest’: ‘document’,
‘referer’: ‘https://www.amazon.com/’,
‘accept-language’: ‘en-GB,en-US;q=0.9,en;q=0.8’,
}
# Download the page using requests
print(“Downloading %s”%url)
r = requests.get(url, headers=headers)
# Simple check to check if page was blocked (Usually 503)
if r.status_code > 500:
if “To discuss automated access to Amazon data please contact” in r.text:
print(“Page %s was blocked by Amazon. Please try using better proxies\n”%url)
else:
print(“Page %s must have been blocked by Amazon as the status code was %d”%(url,r.status_code))
return None
# Pass the HTML of the page and create
return e.extract(r.text)
# product_data = []
with open(“search_results_urls.txt”,’r’) as urllist, open(‘search_results_output.jsonl’,’w’) as outfile:
for url in urllist.read().splitlines():
data = scrape(url)
if data:
for product in data[‘products’]:
product[‘search_url’] = url
print(“Saving Product: %s”%product[‘title’])
json.dump(product,outfile)
outfile.write(“\n”)
# sleep(5)
Running the Amazon Scraper to Scrape Search Result
You can simply start with your scraper by typing the given command:
python3 searchresults.py
After completion of the scraping procedure, you must come across a file named search_results_output.json1 with the data. The example for the URL is as under:
https://www.amazon.com/s?k=laptops
Retry, Retry, Retry
If you face certain scraping challenges or you are blocked by Amazon Platform, in either case, you must make sure that you retry the request. If you look closely at the codes mentioned in the blog we have included 20 retries. This way the code retries after it fails. You can improve your chances of scraping Amazon product data by creating a list of retry queues.
How ITS Can Help You With Web Scraping Service?
Information Transformation Service (ITS) includes a variety of Professional Web Scraping Services catered by Experienced Crew Members and Technical Software. If you are interested in ITS Web Scraping Services, you can ask for a free quote!