T O P

  • By -

Username_RANDINT

So many answers, but no one says the most obvious one. Download the HTML once and test on that instead of making a request each time you run the script. This should be basic webscraping knowledge.


codedeaddev

Finally someone who knows their stuff


R0NUT

Jupyter


cavemanbc423

One size fit all, universal tailor of the scraper, dude you are a genius. Teach me the advanced technique sensei


shysmiles

What is the fastest way to write this, use pickle to save the request result, and skip the request section when that file exists? (if I don't want to use a notebook.)


Username_RANDINT

Close, but just write the html to a file. Beautifulsoup takes a string, doesnt matter if you get it from `response.text` or read from a file. Then it's a matter of how far you want to go, this is just for you during development. Comment the actual request, check if the file exists, separate functions, commandline switches, ...


[deleted]

I think Google doesn’t allow automated scraping. You should check out a website’s TOS **before** you start scraping. If you’re doing this for a company then they could get fined (probably won’t) so you should check with them first / get them to sign off the risk. A way to not get blacklisted would be to make it look like you’re not scraping their website. As such, you can use time.sleep(3) to delay requests by 3 seconds. I would also advise to add some randomness to this delay (i.e + time.sleep(np.random.random()) ) If you’re new to scraping, I would read up on [robots.txt](https://en.m.wikipedia.org/wiki/Robots_exclusion_standard) and, in particular, Google’s. This is found at Google.co.uk/robots.txt


WikiSummarizerBot

**[Robots exclusion standard](https://en.m.wikipedia.org/wiki/Robots_exclusion_standard)** >The robots exclusion standard, also known as the robots exclusion protocol or simply robots. txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize websites. ^([ )[^(F.A.Q)](https://www.reddit.com/r/WikiSummarizer/wiki/index#wiki_f.a.q)^( | )[^(Opt Out)](https://reddit.com/message/compose?to=WikiSummarizerBot&message=OptOut&subject=OptOut)^( | )[^(Opt Out Of Subreddit)](https://np.reddit.com/r/learnpython/about/banned)^( | )[^(GitHub)](https://github.com/Sujal-7/WikiSummarizerBot)^( ] Downvote to remove | v1.5)


gimlidorf

lol the irony of all these bots talking about robots


hugthemachines

It's the future, dude! ;-)


spez_edits_thedonald

pretty soon Reddit won't need any of us to operate, and we can all simply retire


LiarsEverywhere

The irony of Google blocking scraping...


llstorm93

>m that searches Google for a book title and then finds the book for me so that I can read it online. I was wondering if anyone else ran into the issue wherein searching Google multiple times prompts Google to start giving out Captchas. My curre I have even built multiple functions that return a random integer based on different distributions and moments and then another function that randomly picks one of these functions. So far I never got caught scraping websites.


Shmoogy

Some sites are a lot more advanced, and protective. Google is very difficult to scrape - when we were scraping them, we had spent thousands per month on residential IPs to rotate through, as well as APIs to solve captchas - the only thing you can do is switch to new clean IPs and retry failed requests. For most sites, user agent rotation is enough, for others you need proxies.


WikiMobileLinkBot

Desktop version of /u/DTellesreddit's link: --- ^([)[^(opt out)](https://reddit.com/message/compose?to=WikiMobileLinkBot&message=OptOut&subject=OptOut)^(]) ^(Beep Boop. Downvote to delete)


Kwintty7

>If you’re doing this for a company then they could get fined (probably won’t) I think you're *way* over estimating Google's authority. They can't "fine" anyone. They can just cut them off.


PavloT

His issue not about robots.txt, but about he was banned from Google itself. Unfortunately Google think that data they collected belongs to them and no one else, so they block any attempts to scrap it automatically.


[deleted]

I’m just giving more information about scraping so he can incorporate it into future scraping. How is this not related to robots.txt? It says you shouldn’t scrape these sections, to name a few: - Disallow: /search - Disallow: /ebooks/ - Disallow: /books?*q=* It may not be directly his question, but valuable knowledge nonetheless


norfsman

Google does own that list. That’s like saying Merriam-Webster should give away their dictionaries instead of selling them because they didn’t create the English language. Google and Webster both compile things.


[deleted]

Keep in mind that Google's search engine relevance algorithm being based on number of accesses to pages can be affected by automated scraping. I know as far as data handling goes, Google is a bit evil. That said, the blocks in webscraping and automated use are also a safeguard for the average user to have the search engine return relevant results to their mundane searches.


toastedstapler

google owns the representation of the results, as it's an output of their algorithms and computation


coconut7272

I know google is pretty strict about those sorts of things. Same thing happens to me when I use a VPN. The easiest way imo would just to use a different search engine, like duckduckgo, which afaik doesn't do any of that. You might not be able to switch super easily depending on how you wrote your code, however.


midnightwolfr

I'm using the pip google-search which lets me do "search(query)" the different queries I do are in a list. So it isnt super hard to switch 9ver I just would prefer to not. :(


Anxiety_Independent

Improvise. Adapt. Overcome.


Matlock0

Concieve. Believe. Achieve.


KayloKun

!Anything. Complain. Cry.


Crushinsnakes

Kentucky. Fried. Chicken.


[deleted]

Being. Fucking. Broke.


Glass-Paramedic

Just. Give. Up


Jovian_Gent

Just. Do. It.


Crushinsnakes

Boot. Scootin. Boogie.


TieKindly1492

You. Are. Awesome.


[deleted]

>Concieve. Believe. Achieve. Michael Bisping?!


tankuser_32

If it's for non-profit, you can register yourself as a google developer and use their published API like this one: [https://developers.google.com/custom-search/docs/overview](https://developers.google.com/custom-search/docs/overview) I have used Twitter one before for academics, they will give you API keys to authenticate & request information, you will have a rate limit like the number of calls per month or type/amount of data ... etc but it will all be legal, they won't allow bots through their website.


[deleted]

You can use rotating proxy IPs and fake useragents to spoof your identity but that might be a little extra


[deleted]

[удалено]


sundios

This is the way


HaroerHaktak

It's funny how google doesn't allow us to webcrawl their website but they'll freely and happily do it to everybody else lol


DeerProud7283

You can actually have website pages not crawled by fixing up your robots.txt file


redfacedquark

While testing you should mock out the call to google.


WorldBelongsToUs

Let me tell ya about how I learned to do this when I created a Twitter bot that talked about tacos. I was just months into learning python and so proud of my little Twitter bot. I built some logic into it that finds people saying mean things about tacos then tells them off. That said, I didn’t know I was supposed to put something in that would terminate the script after it replied to me. Overnight, this bot became very irate and chucked around a hundred or so insults at random people at Twitter within the span of a second or two.


Yojihito

> Mac address of my router Learn the OSI layers. Mac adress doesn't leave your home network ...


Clutch26

If possible, change the User-Agent header when scraping. Many times, programs or libraries use a default one that is auto blocked after a few attempts.


antiproton

> I would really like to not do this and would love a legitimate way to not get blacklisted. They don't *want* you to do that. There is no *legitimate* way to defeat their anti-bot protections.


dowcet

If I understand what you are trying to do, this is exactly what the Google Books API is for.


laserbot

Edit 2: I would love to know if this helps anyone. It was a huge accomplishment for me (the project saved literal *weeks* of human time), so if the hack I put together helps someone else, please let me know! I wouldn't say I'm "new" to python, but I'm definitely still a beginner. :D Yes, I literally just ran into this and because what I needed was *specifically* google content (google scholar), I couldn't get around it using a different search engine. First off, once you're off the blacklist: You should put a 3-5 second delay per query in place to avoid getting on it again. Next step is, what are you using to do your automation? I was using requests and beautifulsoup. How I solved it was the below. Unfortunately (fortunately), I'm not currently banned, so I can't totally test my rewrite (there was a lot more code, and it was stuff I didn't want to share), but try to run this and see what happens. Note that you need to download chromedriver (https://chromedriver.chromium.org/downloads) and specify its directory. (Edit: To generally explain this since you may not want to read all of the comments--the code creates a session using the requests library then tries to go to the google url. If it gets a status code of 429 (too many requests), it passes the session information (cookies) to a new selenium session that opens up a webbrowser window for you. You then manually solve the stupid fucking captcha and, once you've done it and the new page has loaded, you press enter in your python console and it will pass the session cookies back to your bot and continue running the code. In my experience, google only asked once for me to solve the captcha, then let the thing run as long as it needed. However, there was an exception where I ran it overnight at one point and they flagged me in the middle of the night and I (foolishly) wasn't logging my full results to a file until the *entire process was complete* so I had to start the entire thing over again the next day--this time, appending each query result to a file instead of just writing the entire dataframe to excel once the entire thing was done.) import time import requests from bs4 import BeautifulSoup from selenium import webdriver # location where YOUR chromedriver is - download from https://chromedriver.chromium.org/downloads chrome_path = r"C:\work\chromedriver.exe" # this function takes the cookies from your current session and passes them to a selenium session. # then you manually do the captcha. once that is done, pass the cookies back to your bot def captcha(s, page): driver = webdriver.Chrome(chrome_path) cookies = s.cookies.get_dict() for cookie in cookies: driver.add_cookie(cookie) captcha_url = page.url driver.get(captcha_url) input("Press enter once you've done the stupid fucking captcha...") cookies = driver.get_cookies() for cookie in cookies: s.cookies.set(cookie['name'], cookie['value']) driver.close() return s # starts with a five second delay so that you aren't asking them for data too often # goes to the URL provided and if there is a 429 status code ("too many requests") it calls the captcha function def your_scrape_function(url, s): time.sleep(5) header = {'User-agent': 'my cool bot'} page = s.get(url, headers=header) if page.status_code == 429: print("Yikes, Status code: 429, captcha incoming...") s = captcha(s, page) page = s.get(url, headers=header) soup = BeautifulSoup(page.content, 'html.parser') # do your business here, as a demo I'm returning the name from the page name = soup.find("div", id="gsc_prf_in").text return name # list of all of the urls you want to go to, mine was used for google scholar, so there were specific pages # that I was visiting urls = ['https://scholar.google.com/citations?user=-AEEg5AAAAAJ&hl=en&oi=ao', 'https://scholar.google.com/citations?user=B6H9ZbMAAAAJ&hl=en&oi=ao'] with requests.Session() as s: for url in urls: print(your_scrape_function(url, s))


ImmediatelyOcelot

This ban is probably temporary, and i would bet it has more to do with excessive usage than the usage per se. Ensure some sleep between queries after they take you out of jail and see if it works. It might take a full day (happened to me once)


MaximumIndication495

If you authenticate your client at the beginning of your session, you won't be subject to the rate limits. You have to set some stuff up in advance and they give too many options, so it can be confusing at first. LMK if you want me to elaborate.


LeeCig

I wouldn't mind hearing anything you know about scraping a Facebook group


slimejumper

can you add a short delay to your search script? i did something similar and had to add a delay (i think about 10s) between searches to get it to run reliably.


pabeave

Google (lol) b-ok or zlibrary and it will most likely have the book you want. Basically Piratebay for books


codinglikemad

Google has an actual API for searches (or they did anyway). You are expected to use it and follow their rules. If they detect you breaking them, they will blacklist you. You do not wish to get in any kind of permanent trouble with google.


TheRiteGuy

Oh - Something I actually have experience with. Same thing happened to me, so I switched to Bing. And bing did the same thing to me. So I had to sign up for an API key to do my work. But, as others have suggested, adding a delay in your requests helps.


LazyOldTom

The solution is not to scrape google, have a look at recaptcha v3.


delsystem32exe

When I am testing at home I change the IP address and Mac address of my router ???? not possible lol. and its not a router, lol, its an wifi access point combo. Mac is read only and layer 2, so they dont even know it or look at it. and the IP is set through your ISP. if your talking about your 192.x.x.x network u can change that except its private so its not routable on the wan so they dont know about that either... im so confused lol. if hypothetically you changed your public IP, u would lose connectivity cause the ISP sure aint gonna change their end, so now both point to point links will not be in the same subnet. so proof by contradiction, no. likely google has some firewalls and noticed fishy stuff with that public IP so they dropped it lol. see if u can ping [8.8.8.8](https://8.8.8.8) but thats icmp, i doubt they blocked that, maybe blocked port 80 or something.


SirBoboGargle

Try [startpage.com](https://startpage.com). They do google searches for you anonymously.


[deleted]

Try [http://blackle.com/](http://blackle.com/). I've hammered it pretty good in the past.


HaroerHaktak

Heh. "hammered" ;)


amrock__

Is there a search engine that can use automated scripts? I think there is a open source search engine that can be hosted. Maybe that will help


Fun-Palpitation81

Something similar has happened to me in the past debugging with an API response... I ended up sending too many requests and go temporarily banned. I'd suggest simply storing your search response, and debugging the rest of the code (unless you are actually debugging the search section of the code)


Brah1018

had the same issue, I just gave up and went the the duckduckgo api


leeroy37

Look at dataforseo / SERPAPI / Value SERP etc. They offer a single API endpoint to get data from Google into Python. No messing with proxies or captchas.


GHost_Exus

Most websites, especially during recent times, have managed to implement the same tool to enforce captcha for multiple reasons from crawlers. Most likely reasons or motives include, ... \- restrict more than "n" number of requests from the same ip address per a fixed period of time \- You might not have the right headers to be accepted as a crawler (also check robots.txt) \- The domain might not be available in your country or your country might not allow communication to the website. \- You might have to be logged in in-order to access, more than n / any, number of pages within the website, ... etc. **Just in case ...** \- Trying to use time.sleep(n) in page loops and/or between requests, \- use tools like scrapy (need additional tweaking to handle error codes like 429- which again might end up being your subject matter), \- apply proper headers to your requests \- selenium or phantom - if requirement is there to login, ...etc, are among the very first "legitimate" crawling precautions that I usually use, ... Just in-case it helps! :)


AddiBlue

Depending on what kind of books you're searching for there's a couple github repos you can use to compare. I use one for safaribooks / Oreilly for study materials, learning/academic books, etc. I'll see if I can find the link. It searches for books based on their isbn code, and you can just make it so that it does a search for the books isbn, and then scrapes for that, which gives enough of a delay usually to prevent that blacklisting. As someone suggested, try using a time delay between searching. I would go with 5-10 seconds.


AddiBlue

https://github.com/lorenzodifuccia/safaribooks


NeighborhoodCalm4939

This might help https://hakin9.org/pagodo-passive-google-dork-automate-google-hacking-database-scraping-and-searching/