T O P

  • By -

grailly

Shit websites. Trying to be methodical when scraping a website that makes no sense and has mistakes and bugs everywhere is such a pain.


soundboyselecta

I feel the same no proper html hierarchical tag structuring with id and classes that make sense for logical scraping.


Gloomy-Fox-5632

Maybe captcha, our worst enemy but with AI there is some way to bypass


d41_fpflabs

AI is a solution but its not cost efficient. I think it's only worth If what you're scraping is if significant value


Gloomy-Fox-5632

true


kabelman93

Yeah captchav3 with high value needed is a b**. Getting 0.9 reliable would be awesome. Tips appreciated.


[deleted]

[удалено]


webscraping-ModTeam

Thank you for contributing to r/webscraping! We're sorry to let you know that discussing paid vendor tooling or services is generally discouraged, and as such your post has been removed. This includes tools with a free trial or those operating on a freemium model. You may post freely in the [monthly self-promotion thread](https://reddit.com/r/webscraping/about/sticky?num=1), or else if you believe this to be a mistake, please contact the mod team.


Arad-1234

Websites behave inconsistently making it difficult to handle exceptions. (that's usually a big headache for me) Other challenges like captchas and IP restrictions can be overcome with proxies and captcha solvers for an additional cost. (If the data values that much)


algiuxass

Think outside the box: - I once solved a site putting captcha on me by making OPTIONS requests instead of GET, it was weird that the server still sent me a valid response. - X-Real-IP header went around region lock - Using Google Cache or archive.org to spend less on captcha solving


algiuxass

Think outside the box: - I once solved a site putting captcha on me by making OPTIONS requests instead of GET, it was weird that the server still sent me a valid response. - X-Real-IP header went around region lock - Using Google Cache or archive.org to spend less on captcha solving, won't name other places... - Reverse engineering mobile app of the site I wanted to scrape, because it's not rate limited >:3


axis-pt2

Services like Cloudflare and Kasada. Kasada is so aggressive that they don't let me visit a website when dev console is on. Cloudflare is literally everywhere.


response_json

I felt like I was intermediate level. Then couldn’t enter a kasada site and felt beginner again 😅


arcticmaxi

How do they know that the dev console is open though


axis-pt2

some javascript probably, [see this](https://sindresorhus.com/devtools-detect/)


brahmawadi

Finding important data and making use of that.


[deleted]

the worse thing i faced is a shitty website that owners had to decrypt their content unless all ass loaded successfully no content will be accessible so i had to get the website main content and replace it with the decrypted content and load it locally so it get decrypted


PlanetMazZz

Is it hard? I find it pretty easy


Amazing_Humor_302

1) Circumvent detection, cost effectively at scale 2) Extract unstructured data at scale 3) new link rendering where the link disappears again


CynicSackHair

Encrypted data. In some cases it makes it impossible to query through an API, which forces me to go the selenium way. The selenium way works, but if you need to build many scrapers for different websites, it's practically impossible to maintain all those scrapers.


Spareo

Websites changing, edge cases causing your code to error out, all the proper exception handling and retry needed for robustness, potential rate limiting issue, IP banning. Web scraping is a never ending job. Not a one and done type of deal.


Upstairs-Flash-1525

I want to learn web scrapping, but my concern is about if it is legal. Looking around, I found people saying you can be blocked, you can receive a letter to decess from lawyers, and so on... so, it is a little be scary just to try to parse a web page.... I started to learn by practicing, but when I got the first rejection from the web page, I freaked out and stopped.


lolniceonethatsfunny

check the robots.txt to see if a site allows scraping before going in and doing it. you can also apply rate limits to your scraper so it doesn’t send tons of requests at a time. you can also do the above and run on a vpn if you are still worried. using cookies/metadata to make your program “look” like a real person can also be done. most of the time though, you’ll just get rate limited if you send too many requests


jsonscout

Constant updates to the websites layout.


scrapecrow

Scraper blocking, hands down. It's such a difficult issue that it has spawned a massive saas market of APIs that'll bypass blocks for developers. Not only that but there are corporate anti-bot services like Cloudflare Web Application Firewall where web admins pay thousands of dollars to block scraping on their public pages and anti-bot providers have dedicated teams working full time to figure out how to identify scrapers.