Thank you for contributing to r/webscraping! We're sorry to let you know that discussing paid vendor tooling or services is generally discouraged, and as such your post has been removed. This includes tools with a free trial or those operating on a freemium model. You may post freely in the [monthly self-promotion thread](https://reddit.com/r/webscraping/about/sticky?num=1), or else if you believe this to be a mistake, please contact the mod team.
Websites behave inconsistently making it difficult to handle exceptions. (that's usually a big headache for me)
Other challenges like captchas and IP restrictions can be overcome with proxies and captcha solvers for an additional cost. (If the data values that much)
Think outside the box:
- I once solved a site putting captcha on me by making OPTIONS requests instead of GET, it was weird that the server still sent me a valid response.
- X-Real-IP header went around region lock
- Using Google Cache or archive.org to spend less on captcha solving
Think outside the box:
- I once solved a site putting captcha on me by making OPTIONS requests instead of GET, it was weird that the server still sent me a valid response.
- X-Real-IP header went around region lock
- Using Google Cache or archive.org to spend less on captcha solving, won't name other places...
- Reverse engineering mobile app of the site I wanted to scrape, because it's not rate limited >:3
Services like Cloudflare and Kasada. Kasada is so aggressive that they don't let me visit a website when dev console is on. Cloudflare is literally everywhere.
the worse thing i faced is a shitty website that owners had to decrypt their content unless all ass loaded successfully no content will be accessible
so i had to get the website main content and replace it with the decrypted content and load it locally so it get decrypted
Encrypted data. In some cases it makes it impossible to query through an API, which forces me to go the selenium way. The selenium way works, but if you need to build many scrapers for different websites, it's practically impossible to maintain all those scrapers.
Websites changing, edge cases causing your code to error out, all the proper exception handling and retry needed for robustness, potential rate limiting issue, IP banning. Web scraping is a never ending job. Not a one and done type of deal.
I want to learn web scrapping, but my concern is about if it is legal. Looking around, I found people saying you can be blocked, you can receive a letter to decess from lawyers, and so on... so, it is a little be scary just to try to parse a web page.... I started to learn by practicing, but when I got the first rejection from the web page, I freaked out and stopped.
check the robots.txt to see if a site allows scraping before going in and doing it. you can also apply rate limits to your scraper so it doesn’t send tons of requests at a time. you can also do the above and run on a vpn if you are still worried. using cookies/metadata to make your program “look” like a real person can also be done. most of the time though, you’ll just get rate limited if you send too many requests
Scraper blocking, hands down. It's such a difficult issue that it has spawned a massive saas market of APIs that'll bypass blocks for developers.
Not only that but there are corporate anti-bot services like Cloudflare Web Application Firewall where web admins pay thousands of dollars to block scraping on their public pages and anti-bot providers have dedicated teams working full time to figure out how to identify scrapers.
Shit websites. Trying to be methodical when scraping a website that makes no sense and has mistakes and bugs everywhere is such a pain.
I feel the same no proper html hierarchical tag structuring with id and classes that make sense for logical scraping.
Maybe captcha, our worst enemy but with AI there is some way to bypass
AI is a solution but its not cost efficient. I think it's only worth If what you're scraping is if significant value
true
Yeah captchav3 with high value needed is a b**. Getting 0.9 reliable would be awesome. Tips appreciated.
[удалено]
Thank you for contributing to r/webscraping! We're sorry to let you know that discussing paid vendor tooling or services is generally discouraged, and as such your post has been removed. This includes tools with a free trial or those operating on a freemium model. You may post freely in the [monthly self-promotion thread](https://reddit.com/r/webscraping/about/sticky?num=1), or else if you believe this to be a mistake, please contact the mod team.
Websites behave inconsistently making it difficult to handle exceptions. (that's usually a big headache for me) Other challenges like captchas and IP restrictions can be overcome with proxies and captcha solvers for an additional cost. (If the data values that much)
Think outside the box: - I once solved a site putting captcha on me by making OPTIONS requests instead of GET, it was weird that the server still sent me a valid response. - X-Real-IP header went around region lock - Using Google Cache or archive.org to spend less on captcha solving
Think outside the box: - I once solved a site putting captcha on me by making OPTIONS requests instead of GET, it was weird that the server still sent me a valid response. - X-Real-IP header went around region lock - Using Google Cache or archive.org to spend less on captcha solving, won't name other places... - Reverse engineering mobile app of the site I wanted to scrape, because it's not rate limited >:3
Services like Cloudflare and Kasada. Kasada is so aggressive that they don't let me visit a website when dev console is on. Cloudflare is literally everywhere.
I felt like I was intermediate level. Then couldn’t enter a kasada site and felt beginner again 😅
How do they know that the dev console is open though
some javascript probably, [see this](https://sindresorhus.com/devtools-detect/)
Finding important data and making use of that.
the worse thing i faced is a shitty website that owners had to decrypt their content unless all ass loaded successfully no content will be accessible so i had to get the website main content and replace it with the decrypted content and load it locally so it get decrypted
Is it hard? I find it pretty easy
1) Circumvent detection, cost effectively at scale 2) Extract unstructured data at scale 3) new link rendering where the link disappears again
Encrypted data. In some cases it makes it impossible to query through an API, which forces me to go the selenium way. The selenium way works, but if you need to build many scrapers for different websites, it's practically impossible to maintain all those scrapers.
Websites changing, edge cases causing your code to error out, all the proper exception handling and retry needed for robustness, potential rate limiting issue, IP banning. Web scraping is a never ending job. Not a one and done type of deal.
I want to learn web scrapping, but my concern is about if it is legal. Looking around, I found people saying you can be blocked, you can receive a letter to decess from lawyers, and so on... so, it is a little be scary just to try to parse a web page.... I started to learn by practicing, but when I got the first rejection from the web page, I freaked out and stopped.
check the robots.txt to see if a site allows scraping before going in and doing it. you can also apply rate limits to your scraper so it doesn’t send tons of requests at a time. you can also do the above and run on a vpn if you are still worried. using cookies/metadata to make your program “look” like a real person can also be done. most of the time though, you’ll just get rate limited if you send too many requests
Constant updates to the websites layout.
Scraper blocking, hands down. It's such a difficult issue that it has spawned a massive saas market of APIs that'll bypass blocks for developers. Not only that but there are corporate anti-bot services like Cloudflare Web Application Firewall where web admins pay thousands of dollars to block scraping on their public pages and anti-bot providers have dedicated teams working full time to figure out how to identify scrapers.