Current websites are no longer as easy to extract data from as before because the structure of websites now is also very different from before. They don’t have clearly defined parts for quick analysis, and a large part is because some medium and large websites have also applied some measures to prevent crawling data from their websites.
However, it’s certain that there’s no way to completely avoid it. Part of the reason is that website owners also don’t completely want to protect their websites from being crawled 100%. Some reasons are listed as follows:
There’s no way to completely prevent web scraping, so many people who understand this don’t try too hard to prevent it: in general, except for shutting down the server, as long as your website is still public on the Internet and many people can access it, it will be impossible to prevent being scraped.
Impact on user experience: Websites can use a very common method to prevent bots, which is using captcha. Every time you enter the website, users must pass a captcha first. This will definitely reduce a large number of bots entering the website, but it also affects users a lot. Try to think, if a website forces you to enter captcha too many times, how would you feel? For me, I would close it immediately.
Protection costs a lot: Using third-party services like Cloudflare to protect your website from bots will also cost a relatively significant amount that not everyone is willing to pay. And if you don’t use third-party services, you have to spend a relatively large amount of money to build the features, and then test them.
Impact on search ranking: As I said, most website owners don’t like us scraping their data, but there’s one bot they really like and even have to clear the way for it to enter, which is Google’s Search Engine bot (and other Search Engines). So if you use too many blocking techniques that prevent Google from getting your website’s data, your website definitely won’t rank high.
Legal issues: I also don’t want to talk too much about this issue, I just want to say a general idea that like many other activities on the Internet, there’s no simple answer about the legal aspects of crawling data from websites.
Reference: https://finddatalab.com/
Since the website doesn’t have a comment section under articles, everyone can discuss and give feedback to me at this GITHUB DISCUSSION: https://github.com/orgs/demanejar/discussions/1.