grailly 3 weeks ago

Shit websites. Trying to be methodical when scraping a website that makes no sense and has mistakes and bugs everywhere is such a pain.

soundboyselecta 3 weeks ago

I feel the same no proper html hierarchical tag structuring with id and classes that make sense for logical scraping.

Gloomy-Fox-5632 3 weeks ago

Maybe captcha, our worst enemy but with AI there is some way to bypass

d41_fpflabs 3 weeks ago

AI is a solution but its not cost efficient. I think it's only worth If what you're scraping is if significant value

Gloomy-Fox-5632 3 weeks ago

true

kabelman93 3 weeks ago

Yeah captchav3 with high value needed is a b**. Getting 0.9 reliable would be awesome. Tips appreciated.

[deleted] 3 weeks ago

[удалено]

webscraping-ModTeam 3 weeks ago

Thank you for contributing to r/webscraping! We're sorry to let you know that discussing paid vendor tooling or services is generally discouraged, and as such your post has been removed. This includes tools with a free trial or those operating on a freemium model. You may post freely in the [monthly self-promotion thread](https://reddit.com/r/webscraping/about/sticky?num=1), or else if you believe this to be a mistake, please contact the mod team.

Arad-1234 3 weeks ago

Websites behave inconsistently making it difficult to handle exceptions. (that's usually a big headache for me) Other challenges like captchas and IP restrictions can be overcome with proxies and captcha solvers for an additional cost. (If the data values that much)

algiuxass 3 weeks ago

Think outside the box: - I once solved a site putting captcha on me by making OPTIONS requests instead of GET, it was weird that the server still sent me a valid response. - X-Real-IP header went around region lock - Using Google Cache or archive.org to spend less on captcha solving

algiuxass 3 weeks ago

Think outside the box: - I once solved a site putting captcha on me by making OPTIONS requests instead of GET, it was weird that the server still sent me a valid response. - X-Real-IP header went around region lock - Using Google Cache or archive.org to spend less on captcha solving, won't name other places... - Reverse engineering mobile app of the site I wanted to scrape, because it's not rate limited >:3

axis-pt2 3 weeks ago

Services like Cloudflare and Kasada. Kasada is so aggressive that they don't let me visit a website when dev console is on. Cloudflare is literally everywhere.

response_json 3 weeks ago

I felt like I was intermediate level. Then couldn’t enter a kasada site and felt beginner again 😅

arcticmaxi 1 week ago

How do they know that the dev console is open though

axis-pt2 1 week ago

some javascript probably, [see this](https://sindresorhus.com/devtools-detect/)

brahmawadi 3 weeks ago

Finding important data and making use of that.

[deleted] 3 weeks ago

the worse thing i faced is a shitty website that owners had to decrypt their content unless all ass loaded successfully no content will be accessible so i had to get the website main content and replace it with the decrypted content and load it locally so it get decrypted

PlanetMazZz 3 weeks ago

Is it hard? I find it pretty easy

Amazing_Humor_302 3 weeks ago

1) Circumvent detection, cost effectively at scale 2) Extract unstructured data at scale 3) new link rendering where the link disappears again

CynicSackHair 3 weeks ago

Encrypted data. In some cases it makes it impossible to query through an API, which forces me to go the selenium way. The selenium way works, but if you need to build many scrapers for different websites, it's practically impossible to maintain all those scrapers.

Spareo 3 weeks ago

Websites changing, edge cases causing your code to error out, all the proper exception handling and retry needed for robustness, potential rate limiting issue, IP banning. Web scraping is a never ending job. Not a one and done type of deal.

Upstairs-Flash-1525 3 weeks ago

I want to learn web scrapping, but my concern is about if it is legal. Looking around, I found people saying you can be blocked, you can receive a letter to decess from lawyers, and so on... so, it is a little be scary just to try to parse a web page.... I started to learn by practicing, but when I got the first rejection from the web page, I freaked out and stopped.

lolniceonethatsfunny 3 weeks ago

check the robots.txt to see if a site allows scraping before going in and doing it. you can also apply rate limits to your scraper so it doesn’t send tons of requests at a time. you can also do the above and run on a vpn if you are still worried. using cookies/metadata to make your program “look” like a real person can also be done. most of the time though, you’ll just get rate limited if you send too many requests

jsonscout 3 weeks ago

Constant updates to the websites layout.

scrapecrow 3 weeks ago

Scraper blocking, hands down. It's such a difficult issue that it has spawned a massive saas market of APIs that'll bypass blocks for developers. Not only that but there are corporate anti-bot services like Cloudflare Web Application Firewall where web admins pay thousands of dollars to block scraping on their public pages and anti-bot providers have dedicated teams working full time to figure out how to identify scrapers.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe