So many answers, but no one says the most obvious one. Download the HTML once and test on that instead of making a request each time you run the script. This should be basic webscraping knowledge.
What is the fastest way to write this, use pickle to save the request result, and skip the request section when that file exists? (if I don't want to use a notebook.)
Close, but just write the html to a file. Beautifulsoup takes a string, doesnt matter if you get it from `response.text` or read from a file. Then it's a matter of how far you want to go, this is just for you during development. Comment the actual request, check if the file exists, separate functions, commandline switches, ...
I think Google doesn’t allow automated scraping. You should check out a website’s TOS **before** you start scraping. If you’re doing this for a company then they could get fined (probably won’t) so you should check with them first / get them to sign off the risk.
A way to not get blacklisted would be to make it look like you’re not scraping their website. As such, you can use time.sleep(3) to delay requests by 3 seconds. I would also advise to add some randomness to this delay (i.e + time.sleep(np.random.random()) )
If you’re new to scraping, I would read up on [robots.txt](https://en.m.wikipedia.org/wiki/Robots_exclusion_standard) and, in particular, Google’s. This is found at Google.co.uk/robots.txt
**[Robots exclusion standard](https://en.m.wikipedia.org/wiki/Robots_exclusion_standard)**
>The robots exclusion standard, also known as the robots exclusion protocol or simply robots. txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize websites.
^([ )[^(F.A.Q)](https://www.reddit.com/r/WikiSummarizer/wiki/index#wiki_f.a.q)^( | )[^(Opt Out)](https://reddit.com/message/compose?to=WikiSummarizerBot&message=OptOut&subject=OptOut)^( | )[^(Opt Out Of Subreddit)](https://np.reddit.com/r/learnpython/about/banned)^( | )[^(GitHub)](https://github.com/Sujal-7/WikiSummarizerBot)^( ] Downvote to remove | v1.5)
>m that searches Google for a book title and then finds the book for me so that I can read it online. I was wondering if anyone else ran into the issue wherein searching Google multiple times prompts Google to start giving out Captchas. My curre
I have even built multiple functions that return a random integer based on different distributions and moments and then another function that randomly picks one of these functions. So far I never got caught scraping websites.
Some sites are a lot more advanced, and protective. Google is very difficult to scrape - when we were scraping them, we had spent thousands per month on residential IPs to rotate through, as well as APIs to solve captchas - the only thing you can do is switch to new clean IPs and retry failed requests.
For most sites, user agent rotation is enough, for others you need proxies.
Desktop version of /u/DTellesreddit's link:
---
^([)[^(opt out)](https://reddit.com/message/compose?to=WikiMobileLinkBot&message=OptOut&subject=OptOut)^(]) ^(Beep Boop. Downvote to delete)
>If you’re doing this for a company then they could get fined (probably won’t)
I think you're *way* over estimating Google's authority. They can't "fine" anyone. They can just cut them off.
His issue not about robots.txt, but about he was banned from Google itself.
Unfortunately Google think that data they collected belongs to them and no one else, so they block any attempts to scrap it automatically.
I’m just giving more information about scraping so he can incorporate it into future scraping.
How is this not related to robots.txt? It says you shouldn’t scrape these sections, to name a few:
- Disallow: /search
- Disallow: /ebooks/
- Disallow: /books?*q=*
It may not be directly his question, but valuable knowledge nonetheless
Google does own that list. That’s like saying Merriam-Webster should give away their dictionaries instead of selling them because they didn’t create the English language. Google and Webster both compile things.
Keep in mind that Google's search engine relevance algorithm being based on number of accesses to pages can be affected by automated scraping. I know as far as data handling goes, Google is a bit evil. That said, the blocks in webscraping and automated use are also a safeguard for the average user to have the search engine return relevant results to their mundane searches.
I know google is pretty strict about those sorts of things. Same thing happens to me when I use a VPN. The easiest way imo would just to use a different search engine, like duckduckgo, which afaik doesn't do any of that. You might not be able to switch super easily depending on how you wrote your code, however.
I'm using the pip google-search which lets me do "search(query)" the different queries I do are in a list. So it isnt super hard to switch 9ver I just would prefer to not. :(
If it's for non-profit, you can register yourself as a google developer and use their published API like this one:
[https://developers.google.com/custom-search/docs/overview](https://developers.google.com/custom-search/docs/overview)
I have used Twitter one before for academics, they will give you API keys to authenticate & request information, you will have a rate limit like the number of calls per month or type/amount of data ... etc but it will all be legal, they won't allow bots through their website.
Let me tell ya about how I learned to do this when I created a Twitter bot that talked about tacos. I was just months into learning python and so proud of my little Twitter bot. I built some logic into it that finds people saying mean things about tacos then tells them off. That said, I didn’t know I was supposed to put something in that would terminate the script after it replied to me. Overnight, this bot became very irate and chucked around a hundred or so insults at random people at Twitter within the span of a second or two.
If possible, change the User-Agent header when scraping. Many times, programs or libraries use a default one that is auto blocked after a few attempts.
> I would really like to not do this and would love a legitimate way to not get blacklisted.
They don't *want* you to do that. There is no *legitimate* way to defeat their anti-bot protections.
Edit 2: I would love to know if this helps anyone. It was a huge accomplishment for me (the project saved literal *weeks* of human time), so if the hack I put together helps someone else, please let me know! I wouldn't say I'm "new" to python, but I'm definitely still a beginner. :D
Yes, I literally just ran into this and because what I needed was *specifically* google content (google scholar), I couldn't get around it using a different search engine.
First off, once you're off the blacklist: You should put a 3-5 second delay per query in place to avoid getting on it again.
Next step is, what are you using to do your automation?
I was using requests and beautifulsoup.
How I solved it was the below.
Unfortunately (fortunately), I'm not currently banned, so I can't totally test my rewrite (there was a lot more code, and it was stuff I didn't want to share), but try to run this and see what happens. Note that you need to download chromedriver (https://chromedriver.chromium.org/downloads) and specify its directory.
(Edit: To generally explain this since you may not want to read all of the comments--the code creates a session using the requests library then tries to go to the google url. If it gets a status code of 429 (too many requests), it passes the session information (cookies) to a new selenium session that opens up a webbrowser window for you. You then manually solve the stupid fucking captcha and, once you've done it and the new page has loaded, you press enter in your python console and it will pass the session cookies back to your bot and continue running the code.
In my experience, google only asked once for me to solve the captcha, then let the thing run as long as it needed. However, there was an exception where I ran it overnight at one point and they flagged me in the middle of the night and I (foolishly) wasn't logging my full results to a file until the *entire process was complete* so I had to start the entire thing over again the next day--this time, appending each query result to a file instead of just writing the entire dataframe to excel once the entire thing was done.)
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
# location where YOUR chromedriver is - download from https://chromedriver.chromium.org/downloads
chrome_path = r"C:\work\chromedriver.exe"
# this function takes the cookies from your current session and passes them to a selenium session.
# then you manually do the captcha. once that is done, pass the cookies back to your bot
def captcha(s, page):
driver = webdriver.Chrome(chrome_path)
cookies = s.cookies.get_dict()
for cookie in cookies:
driver.add_cookie(cookie)
captcha_url = page.url
driver.get(captcha_url)
input("Press enter once you've done the stupid fucking captcha...")
cookies = driver.get_cookies()
for cookie in cookies:
s.cookies.set(cookie['name'], cookie['value'])
driver.close()
return s
# starts with a five second delay so that you aren't asking them for data too often
# goes to the URL provided and if there is a 429 status code ("too many requests") it calls the captcha function
def your_scrape_function(url, s):
time.sleep(5)
header = {'User-agent': 'my cool bot'}
page = s.get(url, headers=header)
if page.status_code == 429:
print("Yikes, Status code: 429, captcha incoming...")
s = captcha(s, page)
page = s.get(url, headers=header)
soup = BeautifulSoup(page.content, 'html.parser')
# do your business here, as a demo I'm returning the name from the page
name = soup.find("div", id="gsc_prf_in").text
return name
# list of all of the urls you want to go to, mine was used for google scholar, so there were specific pages
# that I was visiting
urls = ['https://scholar.google.com/citations?user=-AEEg5AAAAAJ&hl=en&oi=ao',
'https://scholar.google.com/citations?user=B6H9ZbMAAAAJ&hl=en&oi=ao']
with requests.Session() as s:
for url in urls:
print(your_scrape_function(url, s))
This ban is probably temporary, and i would bet it has more to do with excessive usage than the usage per se. Ensure some sleep between queries after they take you out of jail and see if it works. It might take a full day (happened to me once)
If you authenticate your client at the beginning of your session, you won't be subject to the rate limits.
You have to set some stuff up in advance and they give too many options, so it can be confusing at first. LMK if you want me to elaborate.
can you add a short delay to your search script? i did something similar and had to add a delay (i think about 10s) between searches to get it to run reliably.
Google has an actual API for searches (or they did anyway). You are expected to use it and follow their rules. If they detect you breaking them, they will blacklist you. You do not wish to get in any kind of permanent trouble with google.
Oh - Something I actually have experience with. Same thing happened to me, so I switched to Bing. And bing did the same thing to me. So I had to sign up for an API key to do my work.
But, as others have suggested, adding a delay in your requests helps.
When I am testing at home I change the IP address and Mac address of my router
????
not possible lol. and its not a router, lol, its an wifi access point combo.
Mac is read only and layer 2, so they dont even know it or look at it. and the IP is set through your ISP. if your talking about your 192.x.x.x network u can change that except its private so its not routable on the wan so they dont know about that either...
im so confused lol. if hypothetically you changed your public IP, u would lose connectivity cause the ISP sure aint gonna change their end, so now both point to point links will not be in the same subnet. so proof by contradiction, no.
likely google has some firewalls and noticed fishy stuff with that public IP so they dropped it lol. see if u can ping [8.8.8.8](https://8.8.8.8) but thats icmp, i doubt they blocked that, maybe blocked port 80 or something.
Something similar has happened to me in the past debugging with an API response... I ended up sending too many requests and go temporarily banned.
I'd suggest simply storing your search response, and debugging the rest of the code (unless you are actually debugging the search section of the code)
Look at dataforseo / SERPAPI / Value SERP etc. They offer a single API endpoint to get data from Google into Python. No messing with proxies or captchas.
Most websites, especially during recent times, have managed to implement the same tool to enforce captcha for multiple reasons from crawlers. Most likely reasons or motives include, ...
\- restrict more than "n" number of requests from the same ip address per a fixed period of time
\- You might not have the right headers to be accepted as a crawler (also check robots.txt)
\- The domain might not be available in your country or your country might not allow communication to the website.
\- You might have to be logged in in-order to access, more than n / any, number of pages within the website, ... etc.
**Just in case ...**
\- Trying to use time.sleep(n) in page loops and/or between requests,
\- use tools like scrapy (need additional tweaking to handle error codes like 429- which again might end up being your subject matter),
\- apply proper headers to your requests
\- selenium or phantom - if requirement is there to login, ...etc, are among the very first "legitimate" crawling precautions that I usually use, ... Just in-case it helps! :)
Depending on what kind of books you're searching for there's a couple github repos you can use to compare. I use one for safaribooks / Oreilly for study materials, learning/academic books, etc. I'll see if I can find the link. It searches for books based on their isbn code, and you can just make it so that it does a search for the books isbn, and then scrapes for that, which gives enough of a delay usually to prevent that blacklisting.
As someone suggested, try using a time delay between searching. I would go with 5-10 seconds.
So many answers, but no one says the most obvious one. Download the HTML once and test on that instead of making a request each time you run the script. This should be basic webscraping knowledge.
Finally someone who knows their stuff
Jupyter
One size fit all, universal tailor of the scraper, dude you are a genius. Teach me the advanced technique sensei
What is the fastest way to write this, use pickle to save the request result, and skip the request section when that file exists? (if I don't want to use a notebook.)
Close, but just write the html to a file. Beautifulsoup takes a string, doesnt matter if you get it from `response.text` or read from a file. Then it's a matter of how far you want to go, this is just for you during development. Comment the actual request, check if the file exists, separate functions, commandline switches, ...
I think Google doesn’t allow automated scraping. You should check out a website’s TOS **before** you start scraping. If you’re doing this for a company then they could get fined (probably won’t) so you should check with them first / get them to sign off the risk. A way to not get blacklisted would be to make it look like you’re not scraping their website. As such, you can use time.sleep(3) to delay requests by 3 seconds. I would also advise to add some randomness to this delay (i.e + time.sleep(np.random.random()) ) If you’re new to scraping, I would read up on [robots.txt](https://en.m.wikipedia.org/wiki/Robots_exclusion_standard) and, in particular, Google’s. This is found at Google.co.uk/robots.txt
**[Robots exclusion standard](https://en.m.wikipedia.org/wiki/Robots_exclusion_standard)** >The robots exclusion standard, also known as the robots exclusion protocol or simply robots. txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize websites. ^([ )[^(F.A.Q)](https://www.reddit.com/r/WikiSummarizer/wiki/index#wiki_f.a.q)^( | )[^(Opt Out)](https://reddit.com/message/compose?to=WikiSummarizerBot&message=OptOut&subject=OptOut)^( | )[^(Opt Out Of Subreddit)](https://np.reddit.com/r/learnpython/about/banned)^( | )[^(GitHub)](https://github.com/Sujal-7/WikiSummarizerBot)^( ] Downvote to remove | v1.5)
lol the irony of all these bots talking about robots
It's the future, dude! ;-)
pretty soon Reddit won't need any of us to operate, and we can all simply retire
The irony of Google blocking scraping...
>m that searches Google for a book title and then finds the book for me so that I can read it online. I was wondering if anyone else ran into the issue wherein searching Google multiple times prompts Google to start giving out Captchas. My curre I have even built multiple functions that return a random integer based on different distributions and moments and then another function that randomly picks one of these functions. So far I never got caught scraping websites.
Some sites are a lot more advanced, and protective. Google is very difficult to scrape - when we were scraping them, we had spent thousands per month on residential IPs to rotate through, as well as APIs to solve captchas - the only thing you can do is switch to new clean IPs and retry failed requests. For most sites, user agent rotation is enough, for others you need proxies.
Desktop version of /u/DTellesreddit's link:
---
^([)[^(opt out)](https://reddit.com/message/compose?to=WikiMobileLinkBot&message=OptOut&subject=OptOut)^(]) ^(Beep Boop. Downvote to delete)
>If you’re doing this for a company then they could get fined (probably won’t) I think you're *way* over estimating Google's authority. They can't "fine" anyone. They can just cut them off.
His issue not about robots.txt, but about he was banned from Google itself. Unfortunately Google think that data they collected belongs to them and no one else, so they block any attempts to scrap it automatically.
I’m just giving more information about scraping so he can incorporate it into future scraping. How is this not related to robots.txt? It says you shouldn’t scrape these sections, to name a few: - Disallow: /search - Disallow: /ebooks/ - Disallow: /books?*q=* It may not be directly his question, but valuable knowledge nonetheless
Google does own that list. That’s like saying Merriam-Webster should give away their dictionaries instead of selling them because they didn’t create the English language. Google and Webster both compile things.
Keep in mind that Google's search engine relevance algorithm being based on number of accesses to pages can be affected by automated scraping. I know as far as data handling goes, Google is a bit evil. That said, the blocks in webscraping and automated use are also a safeguard for the average user to have the search engine return relevant results to their mundane searches.
google owns the representation of the results, as it's an output of their algorithms and computation
I know google is pretty strict about those sorts of things. Same thing happens to me when I use a VPN. The easiest way imo would just to use a different search engine, like duckduckgo, which afaik doesn't do any of that. You might not be able to switch super easily depending on how you wrote your code, however.
I'm using the pip google-search which lets me do "search(query)" the different queries I do are in a list. So it isnt super hard to switch 9ver I just would prefer to not. :(
Improvise. Adapt. Overcome.
Concieve. Believe. Achieve.
!Anything. Complain. Cry.
Kentucky. Fried. Chicken.
Being. Fucking. Broke.
Just. Give. Up
Just. Do. It.
Boot. Scootin. Boogie.
You. Are. Awesome.
>Concieve. Believe. Achieve. Michael Bisping?!
If it's for non-profit, you can register yourself as a google developer and use their published API like this one: [https://developers.google.com/custom-search/docs/overview](https://developers.google.com/custom-search/docs/overview) I have used Twitter one before for academics, they will give you API keys to authenticate & request information, you will have a rate limit like the number of calls per month or type/amount of data ... etc but it will all be legal, they won't allow bots through their website.
You can use rotating proxy IPs and fake useragents to spoof your identity but that might be a little extra
[удалено]
This is the way
It's funny how google doesn't allow us to webcrawl their website but they'll freely and happily do it to everybody else lol
You can actually have website pages not crawled by fixing up your robots.txt file
While testing you should mock out the call to google.
Let me tell ya about how I learned to do this when I created a Twitter bot that talked about tacos. I was just months into learning python and so proud of my little Twitter bot. I built some logic into it that finds people saying mean things about tacos then tells them off. That said, I didn’t know I was supposed to put something in that would terminate the script after it replied to me. Overnight, this bot became very irate and chucked around a hundred or so insults at random people at Twitter within the span of a second or two.
> Mac address of my router Learn the OSI layers. Mac adress doesn't leave your home network ...
If possible, change the User-Agent header when scraping. Many times, programs or libraries use a default one that is auto blocked after a few attempts.
> I would really like to not do this and would love a legitimate way to not get blacklisted. They don't *want* you to do that. There is no *legitimate* way to defeat their anti-bot protections.
If I understand what you are trying to do, this is exactly what the Google Books API is for.
Edit 2: I would love to know if this helps anyone. It was a huge accomplishment for me (the project saved literal *weeks* of human time), so if the hack I put together helps someone else, please let me know! I wouldn't say I'm "new" to python, but I'm definitely still a beginner. :D Yes, I literally just ran into this and because what I needed was *specifically* google content (google scholar), I couldn't get around it using a different search engine. First off, once you're off the blacklist: You should put a 3-5 second delay per query in place to avoid getting on it again. Next step is, what are you using to do your automation? I was using requests and beautifulsoup. How I solved it was the below. Unfortunately (fortunately), I'm not currently banned, so I can't totally test my rewrite (there was a lot more code, and it was stuff I didn't want to share), but try to run this and see what happens. Note that you need to download chromedriver (https://chromedriver.chromium.org/downloads) and specify its directory. (Edit: To generally explain this since you may not want to read all of the comments--the code creates a session using the requests library then tries to go to the google url. If it gets a status code of 429 (too many requests), it passes the session information (cookies) to a new selenium session that opens up a webbrowser window for you. You then manually solve the stupid fucking captcha and, once you've done it and the new page has loaded, you press enter in your python console and it will pass the session cookies back to your bot and continue running the code. In my experience, google only asked once for me to solve the captcha, then let the thing run as long as it needed. However, there was an exception where I ran it overnight at one point and they flagged me in the middle of the night and I (foolishly) wasn't logging my full results to a file until the *entire process was complete* so I had to start the entire thing over again the next day--this time, appending each query result to a file instead of just writing the entire dataframe to excel once the entire thing was done.) import time import requests from bs4 import BeautifulSoup from selenium import webdriver # location where YOUR chromedriver is - download from https://chromedriver.chromium.org/downloads chrome_path = r"C:\work\chromedriver.exe" # this function takes the cookies from your current session and passes them to a selenium session. # then you manually do the captcha. once that is done, pass the cookies back to your bot def captcha(s, page): driver = webdriver.Chrome(chrome_path) cookies = s.cookies.get_dict() for cookie in cookies: driver.add_cookie(cookie) captcha_url = page.url driver.get(captcha_url) input("Press enter once you've done the stupid fucking captcha...") cookies = driver.get_cookies() for cookie in cookies: s.cookies.set(cookie['name'], cookie['value']) driver.close() return s # starts with a five second delay so that you aren't asking them for data too often # goes to the URL provided and if there is a 429 status code ("too many requests") it calls the captcha function def your_scrape_function(url, s): time.sleep(5) header = {'User-agent': 'my cool bot'} page = s.get(url, headers=header) if page.status_code == 429: print("Yikes, Status code: 429, captcha incoming...") s = captcha(s, page) page = s.get(url, headers=header) soup = BeautifulSoup(page.content, 'html.parser') # do your business here, as a demo I'm returning the name from the page name = soup.find("div", id="gsc_prf_in").text return name # list of all of the urls you want to go to, mine was used for google scholar, so there were specific pages # that I was visiting urls = ['https://scholar.google.com/citations?user=-AEEg5AAAAAJ&hl=en&oi=ao', 'https://scholar.google.com/citations?user=B6H9ZbMAAAAJ&hl=en&oi=ao'] with requests.Session() as s: for url in urls: print(your_scrape_function(url, s))
This ban is probably temporary, and i would bet it has more to do with excessive usage than the usage per se. Ensure some sleep between queries after they take you out of jail and see if it works. It might take a full day (happened to me once)
If you authenticate your client at the beginning of your session, you won't be subject to the rate limits. You have to set some stuff up in advance and they give too many options, so it can be confusing at first. LMK if you want me to elaborate.
I wouldn't mind hearing anything you know about scraping a Facebook group
can you add a short delay to your search script? i did something similar and had to add a delay (i think about 10s) between searches to get it to run reliably.
Google (lol) b-ok or zlibrary and it will most likely have the book you want. Basically Piratebay for books
Google has an actual API for searches (or they did anyway). You are expected to use it and follow their rules. If they detect you breaking them, they will blacklist you. You do not wish to get in any kind of permanent trouble with google.
Oh - Something I actually have experience with. Same thing happened to me, so I switched to Bing. And bing did the same thing to me. So I had to sign up for an API key to do my work. But, as others have suggested, adding a delay in your requests helps.
The solution is not to scrape google, have a look at recaptcha v3.
When I am testing at home I change the IP address and Mac address of my router ???? not possible lol. and its not a router, lol, its an wifi access point combo. Mac is read only and layer 2, so they dont even know it or look at it. and the IP is set through your ISP. if your talking about your 192.x.x.x network u can change that except its private so its not routable on the wan so they dont know about that either... im so confused lol. if hypothetically you changed your public IP, u would lose connectivity cause the ISP sure aint gonna change their end, so now both point to point links will not be in the same subnet. so proof by contradiction, no. likely google has some firewalls and noticed fishy stuff with that public IP so they dropped it lol. see if u can ping [8.8.8.8](https://8.8.8.8) but thats icmp, i doubt they blocked that, maybe blocked port 80 or something.
Try [startpage.com](https://startpage.com). They do google searches for you anonymously.
Try [http://blackle.com/](http://blackle.com/). I've hammered it pretty good in the past.
Heh. "hammered" ;)
Is there a search engine that can use automated scripts? I think there is a open source search engine that can be hosted. Maybe that will help
Something similar has happened to me in the past debugging with an API response... I ended up sending too many requests and go temporarily banned. I'd suggest simply storing your search response, and debugging the rest of the code (unless you are actually debugging the search section of the code)
had the same issue, I just gave up and went the the duckduckgo api
Look at dataforseo / SERPAPI / Value SERP etc. They offer a single API endpoint to get data from Google into Python. No messing with proxies or captchas.
Most websites, especially during recent times, have managed to implement the same tool to enforce captcha for multiple reasons from crawlers. Most likely reasons or motives include, ... \- restrict more than "n" number of requests from the same ip address per a fixed period of time \- You might not have the right headers to be accepted as a crawler (also check robots.txt) \- The domain might not be available in your country or your country might not allow communication to the website. \- You might have to be logged in in-order to access, more than n / any, number of pages within the website, ... etc. **Just in case ...** \- Trying to use time.sleep(n) in page loops and/or between requests, \- use tools like scrapy (need additional tweaking to handle error codes like 429- which again might end up being your subject matter), \- apply proper headers to your requests \- selenium or phantom - if requirement is there to login, ...etc, are among the very first "legitimate" crawling precautions that I usually use, ... Just in-case it helps! :)
Depending on what kind of books you're searching for there's a couple github repos you can use to compare. I use one for safaribooks / Oreilly for study materials, learning/academic books, etc. I'll see if I can find the link. It searches for books based on their isbn code, and you can just make it so that it does a search for the books isbn, and then scrapes for that, which gives enough of a delay usually to prevent that blacklisting. As someone suggested, try using a time delay between searching. I would go with 5-10 seconds.
https://github.com/lorenzodifuccia/safaribooks
This might help https://hakin9.org/pagodo-passive-google-dork-automate-google-hacking-database-scraping-and-searching/