Posts Some Small Tricks When Crawling Data
Post
Cancel

Some Small Tricks When Crawling Data

Javascript Variables Can Contain Necessary Data When Crawling Javascript-Rendered Websites

When crawling Javascript-rendered websites like in the previous article, we have to use a front-end compiler to translate this Javascript code so they render HTML before analyzing. However, many websites when loading HTML already contain data. For example, Youtube when loading HTML has code on the header like below:

1
var ytInitialData = {"responseContext":{"serviceTrackingParams":[{"service":"GFEEDBACK","params":[{"key":"route","value":"channel."},{"key":"is_owner","value":"false"},{"key":"is_alc_surface","value":"false"},{"key":"browse_id","value":"UCHP_xaIhyLPZv90otD2SehA"},{"key":"browse_id_prefix","value":""},{"key":"logged_in","value":"1"},{"key":"e","value":"9405957,23804281,23966208,23986031,24004644,24077241,24108448,24166867,24181174,24241378,24290971,24425061,24439361,24453989,24495712,24542367,24548629,24566687,24699899,39325798,39325815,39325854,39326587,39326596,39326613,39326617,39326681,39326965,39327050,39327093,39327367,39327561,39327571,39327591,39327598,39327635,51009781,51010235,51017346,51020570,51025415,51030101,51037344,51037349,51041512,51050361,51053689,51057842,51057855,51063643,51064835,51072748,51091058,51095478,51098297,51098299,51101169,51111738,51112978,51115184,51124104,51125020,51133103,51134507,51141472,51144926,51145218,51151423,51152050,51153490,51157411,51157430,51157432,51157841,51157895,51158514,51160545,51162170,51165467,51169118,51176511,51177818,51178310,51178329,51178344,51178353,51178982,51183909,51184990,51186528,51190652,51194136,51195231,51204329,51209050,5..

This variable contains a lot of data, and in it there’s a lot of necessary data that doesn’t need to be extracted from anywhere else. For example, below is a small part I captured from the data of the ytInitialData variable. Because it’s very large, it can’t be captured completely. The information we get is a lot: name, description, endpoint, view count, publication date,… of the video.

Using Cache Websites

Cache websites like Google Cache or web archive have already visited these websites to crawl once and saved the information of those websites. We can crawl information through these sites.

Scrapy Getting Forbidden by robots.txt

From Scrapy version 1.1 (2016-05-11), the downloader will download the robots.txt file before starting crawling. To change this, change in settings.py:

1
ROBOTSTXT_OBEY = False

=> Related to copyright issues. Change to = True to continue crawling on some websites.

Types of Facebook Pages

Depending on needs and purposes of extracting data, we can use the appropriate page. For example, if using Jsoup Java cannot crawl Javascript-generated websites, then crawling directly from the mbasic page will be very suitable,…

Small Trick to Crawl Facebook

Anything fast, continuously operating abnormally will eventually be checkpointed, so we need to combine many factors to minimize:

  • Switch many cookies + combine switching IP
  • Use cookies instead of tokens
  • Delay the program slower after each operation

Bypass Cloudflare

Use the library https://pypi.org/project/undetected-chromedriver/2.1.1/. I’ve tried it and it’s very effective.

User Agent

The importance of User Agent when crawling, especially when crawling large websites like Facebook, Youtube,… These platforms all build separate versions for different User Agents from Mobile or PC. Using appropriate User Agents will make websites easier to analyze (use User Agents of Google Bot, Bing Bot, old Mobile).

Headless Mode of Selenium

Headless mode of Selenium may not be effective when crawling with Selenium.

Selenium helps simulate almost completely human operations, helping to bypass many website blockers. However, when crawling with Selenium, there’s a problem that we have to open the browser, which causes errors when deploying the application to a server without UI. At this point, we think of Selenium’s --headless which will help the application run without opening the browser. However, this method won’t completely solve the problem because some websites still detect when you use --headless mode and will immediately be blocked.

The method I propose is:

  • 1 is to use Remote Webdriver
1
2
options = webdriver.ChromeOptions()
driver = webdriver.Remote(command_executor=server, options=options)
  • 2 is to use xvfb virtual screen
1
DISPLAY=:1 xvfb-run --auto-servernum --server-num=1 python sele.py

This article is just small tricks when crawling, so I’ll talk briefly about how to do it. When it comes to specific articles about each part, I’ll explain more clearly about usage and installation.

Removing HTML Tags

If analyzing HTML pages and extracting data cannot completely remove HTML tags from the page, use the following function to remove them.

1
2
3
4
5
6
def remove_html_tag(html_content):
    if html_content is None:
        return None

    text = re.sub(r'<[^>]+>', '', html_content)
    return text
This post is licensed under CC BY 4.0 by the author.