How to scrape Pinterest data in 5 easy steps | Apify Blog

How To Scrape Pinterest Using A WebScraper?

‘Pinterest’ is one of the popular social media platforms where users discover new information and amazing images over the Internet. It allows social media users to get visual information by entering a specific domain keyword or phrase. The platform is excellent for sharing images in almost every niche and this can be confirmed by an army of over 400 million active Pinterest users. Especially if you intend to search for visual content then Pinterest is just the right place for you. The social site has proven itself to be a leading image-sharing platform that is home to a large dataset that can be detected only through a professional lens.

 

If you like any of the Pinterest posts and pins you can collect the information through web scraping. In this blog, we will show you how to scrape Pinterest successfully by using simple coding skills. Furthermore, you will get to know more about what Pinterest means and how to develop a useful Pinterest Scraper and other premade Pinterest scrapers in the market.

 

Our prime focus is to use a scraper to get access to the information because manual extraction is insufficient to get all the information from Pinterest without wasting time, effort, and cost. Moreover, the information acquired through manual means is prone to error. On the other side, you must consider the fact that Pinterest does not provide an API to gather information from the platform. So there is only one reliable option left, you guessed it right it is none other than web scraping. However you must be aware of the Pinterest anti-spam system, it requires some practice to achieve your goal. Let us overview what Pinterest web scraping is all about!

 

What Do You Mean By Pinterest Scraping?

 

The idea is to get access to Pinterest information by using a computer bot typically known as a web scraper. A web scraper can be used to gather both textural and visual data for your business. It is undeniably the best way to extract information from websites that do not offer a data API. Since the platform does not offer an API, therefore the only solution we are left with is creating a workable web scraper that can get access to our desired information. However, the platform does not encourage web scrapers directly but it does not correspond to the fact that web scraping this website is illegal as data is publically made available. The main challenge is to scrape information since a large section of visual data is copyrighted. Another issue that you may face is the platform’s resistance to outsider bots that attempt to scrape information. Pinterest supports an anti-spam system that busts web scrapers and would eventually block you if it discovers your web scraper.

 

Pinterest can easily track you by using your IP address and the cookie that it drops in your browser. However, the latter is not a problem since you can decide to not have it instead. You can deliberately dodge the cookie tracker but for the IP tracking system, you will need to use proxies. You will also need to gear up against CAPTCHAS.

 

How to Scrape Pinterest Using Selenium and Python?

 

If you are a non-coder, we are showing an example of a full-length code that you can use to upgrade your coding skills and develop a customized Pinterest web scraper. If you have decided to scrape Pinterest once and for all then the foremost thing that you should do is check whether you can access the information with JavaScript OFF. This will determine the framework that you will use. For Python, to scrape a high-end website like Pinterest you will need to use Selenium instead of BeautifulSoup and Requests combined. Request can not render JavaScript while Selenium can easily automate your browser. You can use it to access the Python page, Which will render the content of the page afterward you can use the API provided by Selenium to locate the information you are interested in and extract it in a secure file format. You can run the code mentioned below:

 

The Code

 

from selenium import webdriver

from selenium.common.exceptionsimport StaleElementReferenceException

from selenium.webdriver.common.keysimport Keys

import time, random, socket, unicodedata

import string, copy, os

import pandas as pd

import requests

try:

from urlparseimport urlparse

except ImportError:

from six.moves.urllib.parseimport urlparse

defdownload(myinput, mydir=”./”):

if isinstance(myinput, str) or isinstance(myinput, bytes):

# http://automatetheboringstuff.com/chapter11/

res = requests.get(myinput)

res.raise_for_status()

# https://stackoverflow.com/questions/18727347/how-to-extract-a-filename-from-a-url-append-a-word-to-it

outfile = mydir + “/” + os.path.basename(urlparse(myinput).path)

playFile = open(outfile, ‘wb’)

for chunk in res.iter_content(100000):

playFile.write(chunk)

playFile.close()

elifisinstance(myinput, list):

for i in myinput:

download(i, mydir)

else:

pass

defphantom_noimages():

from fake_useragentimport UserAgent

from selenium.webdriver.common.desired_capabilitiesimport DesiredCapabilities

ua = UserAgent()

# ua.update()

# https://stackoverflow.com/questions/29916054/change-user-agent-for-selenium-driver

caps = DesiredCapabilities.PHANTOMJS

caps[“phantomjs.page.settings.userAgent”] = ua.random

return webdriver.PhantomJS(service_args=[“–load-images=no”], desired_capabilities=caps)

defranddelay(a, b):

time.sleep(random.uniform(a, b))

defu_to_s(uni):

return unicodedata.normalize(‘NFKD’, uni).encode(‘ascii’, ‘ignore’)

class Pinterest_Helper(object):

def__init__(self, login, pw, browser=None):

if browser is None:

# http://tarunlalwani.com/post/selenium-disable-image-loading-different-browsers/

profile = webdriver.FirefoxProfile()

profile.set_preference(“permissions.default.image”, 2)

self.browser = webdriver.Firefox(firefox_profile=profile)

else:

self.browser = browser

self.browser.get(“https://www.pinterest.com”)

emailElem = self.browser.find_element_by_name(‘id’)

emailElem.send_keys(login)

passwordElem = self.browser.find_element_by_name(‘password’)

passwordElem.send_keys(pw)

passwordElem.send_keys(Keys.RETURN)

randdelay(2, 4)

defgetURLs(self, urlcsv, threshold=500):

tmp = self.read(urlcsv)

results = []

for t in tmp:

tmp3 = self.runme(t, threshold)

results = list(set(results + tmp3))

random.shuffle(results)

return results

defwrite(self, myfile, mylist):

tmp = pd.DataFrame(mylist)

tmp.to_csv(myfile, index=False, header=False)

defread(self, myfile):

tmp = pd.read_csv(myfile, header=None).values.tolist()

tmp2 = []

for i in range(0, len(tmp)):

tmp2.append(tmp[i][0])

return tmp2

defrunme(self, url, threshold=500, persistence=120, debug=False):

final_results = []

previmages = []

tries = 0

try:

self.browser.get(url)

while threshold >0:

try:

results = []

images = self.browser.find_elements_by_tag_name(“img”)

if images == previmages:

tries += 1

else:

tries = 0

if tries > persistence:

if debug == True:

print(“Exitting: persistence exceeded”)

return final_results

for i in images:

src = i.get_attribute(“src”)

if src:

if src.find(“/236x/”) != -1:

src = src.replace(“/236x/”, “/736x/”)

results.append(u_to_s(src))

previmages = copy.copy(images)

final_results = list(set(final_results + results))

dummy = self.browser.find_element_by_tag_name(‘a’)

dummy.send_keys(Keys.PAGE_DOWN)

randdelay(1, 2)

threshold -= 1

except (StaleElementReferenceException):

if debug == True:

print(“StaleElementReferenceException”)

threshold -= 1

except (socket.error, socket.timeout):

if debug == True:

print(“Socket Error”)

except KeyboardInterrupt:

return final_results

if debug == True:

print(“Exitting at end”)

return final_results

defrunme_alt(self, url, threshold=500, tol=10, minwait=1, maxwait=2, debug=False):

final_results = []

heights = []

dwait = 0

try:

self.browser.get(url)

while threshold >0:

try:

results = []

images = self.browser.find_elements_by_tag_name(“img”)

cur_height = self.browser.execute_script(“return document.documentElement.scrollTop”)

page_height = self.browser.execute_script(“return document.body.scrollHeight”)

heights.append(int(page_height))

if debug == True:

print(“Current Height: ” + str(cur_height))

print(“Page Height: ” + str(page_height))

if len(heights) >tol:

if heights[-tol:] == [heights[-1]] * tol:

if debug == True:

print(“No more elements”)

return final_results

else:

if debug == True:

print(“Min element: {}”.format(str(min(heights[-tol:]))))

print(“Max element: {}”.format(str(max(heights[-tol:]))))

for i in images:

src = i.get_attribute(“src”)

if src:

if src.find(“/236x/”) != -1:

src = src.replace(“/236x/”, “/736x/”)

results.append(u_to_s(src))

final_results = list(set(final_results + results))

self.browser.execute_script(“window.scrollTo(0, document.body.scrollHeight);”)

randdelay(minwait, maxwait)

threshold -= 1

except (StaleElementReferenceException):

if debug == True:

print(“StaleElementReferenceException”)

threshold -= 1

except (socket.error, socket.timeout):

if debug == True:

print(“Socket Error. Waiting {} seconds.”.format(str(dwait)))

time.sleep(dwait)

dwait += 1

# except (socket.error, socket.timeout):

#    if debug == True:

#        print(“Socket Error”)

except KeyboardInterrupt:

return final_results

if debug == True:

print(“Exitting at end”)

return final_results

defscrape_old(self, url):

results = []

self.browser.get(url)

images = self.browser.find_elements_by_tag_name(“img”)

for i in images:

src = i.get_attribute(“src”)

if src:

if string.find(src, “/236x/”) != -1:

src = string.replace(src, “/236x/”, “/736x/”)

results.append(u_to_s(src))

return results

defclose(self):

self.browser.close()

 

Conclusion

 

By using this code you can access both textual and visual content available on Pinterest, if you are interested in ITS Web Scraping Services, you can ask for a free quote and our leading data scientists will get back to you within 24 business hours!

 

No Comments

Post a Comment

Comment
Name
Email
Website