Posts Crawl Bilingual News with Scrapy and Splash
Post
Cancel

Crawl Bilingual News with Scrapy and Splash

Learn more about the concepts, advantages and disadvantages of Client Side and Server Side Rendering through the article Client-Side and Server-Side Rendering.

We’ll see that some websites use client-side rendering. Their HTML code will be generated on the user’s browser side, so when crawling, even though F12 shows all website elements, the downloaded code is all Javascript code and we can’t find the elements we need anywhere.

Browsers can do this because they have a Javascript interpreter. When downloading website code, the interpreter will translate the Javascript code first so this code generates and downloads necessary HTML segments and then renders them for users. For websites rendered on the user side like this, we can use the following 2 methods to collect data from the website:

    1. Using API. Dynamically rendered websites often use APIs to get data. Press F12 and monitor the network tab to find the API that the website uses to get data
    1. We’ll do the same as what the browser is doing. We’ll have a Javascript compiler to render HTML code before analyzing. The 2 most popular tools are Selenium and Splash. With Scrapy, it’s often used together with Splash. To compare the differences between Scrapy and Selenium, you can see at https://webscraping.fyi/lib/compare/python-selenium-vs-python-splash/

In this article, we’ll collect data from some bilingual websites that are dynamically generated by Javascript. The demonstration website is https://toomva.com. You can view the entire project at https://github.com/trannguyenhan/bilingualcrawl-vietnamese-english.

Installing scrapy-splash Library

1
pip install scrapy scrapy-splash

Installing and Running Splash

1
sudo docker pull scrapinghub/splash
1
sudo docker run -p 8050:8050 scrapinghub/splash

Adding Splash Middleware in settings.py

1
2
3
4
5
6
7
8
9
10
11
12
SPIDER_MIDDLEWARES = {
#    'bilingualcrawl.middlewares.BilingualcrawlSpiderMiddleware': 543,
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
#    'bilingualcrawl.middlewares.BilingualcrawlDownloaderMiddleware': 543,
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
}

Adding Required Splash Configuration

1
2
3
4
SPLASH_URL="http://localhost:8050"
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
SPLASH_COOKIES_DEBUG = False

Using Scrapy Splash in Spider

In spiders, instead of using scrapy.Request, we’ll use SplashRequest so they can use Splash to render HTML code before analyzing:

1
2
3
4
def parse_website(self, response):
  lst = response.css('.grid-search-video').css('a').xpath('@href').extract()
  for itm in lst: 
      yield SplashRequest(url=itm, callback=self.parse)

Testing Splash via Postman

1
curl --location --request GET 'http://localhost:8050/render.html?url=https://demanejar.github.io/posts/add-proxy-to-scrapy-project/'

Read more details about Splash at https://splash.readthedocs.io/.

This post is licensed under CC BY 4.0 by the author.