Some initial research about k8s. Relationship between containerd and docker Low level runtime is currently mainly runc High level runtime is currently mainly using containerd graph TD;...
Installing Selenium Middleware for Scrapy
Scrapy provides the scrapy-selenium library that allows using Selenium to extract website data before returning to the spider for processing. However, in some cases, for example, I don’t want to us...
Some Small Tricks When Crawling Data
Javascript Variables Can Contain Necessary Data When Crawling Javascript-Rendered Websites When crawling Javascript-rendered websites like in the previous article, we have to use a front-end compi...
Spline | Data Lineage Tracking And Visualization Solution
Spline is an OpenSource tool that allows automatic tracking of Data Lineage and Data Pipeline Structure. Its most common use is tracking and visualizing Data Lineage for Spark. Spline Overview Sp...
Airflow HA Guide
In this article, I will guide installing Airflow and setting up HA for it. The environment used is VirtualBox virtual machines. Create 2 virtual machines with fixed addresses, add user and gran...
Crawl Bilingual News with Scrapy and Splash
Learn more about the concepts, advantages and disadvantages of Client Side and Server Side Rendering through the article Client-Side and Server-Side Rendering. We’ll see that some websites use cli...
Introduction to Scrapy Shell
Every time we write a spider, we have to write many css selector and xpath segments to analyze information, and many times we don’t know if they’re correct or not. Each time like that, we have to r...
PHP Scraper
When talking about crawling, everything probably focuses on Python and frameworks built on Python like Scrapy, Beautiful Soup, or Selenium using Python,… In this article, let’s talk a bit off-topic...
Crawl 1000 News Websites with Scrapy and MySQL
If we write 1 spider to analyze information for each website, it will be very time-consuming, especially for news websites. There are thousands of different news websites and they’re still growing ...
Configuring Proxy for Scrapy Project
Proxy is probably a concept that’s no longer unfamiliar to everyone. For people working on data crawling, proxy is like an inseparable companion. In this article, I will guide how to configure prox...