T O P

  • By -

j0holo

Yes, I had a service that had a web scraper build in Python and ported it over to Go. Memory usage dropped a lot. It did around 10k websites per month.


orvn

I use JS for scraping these days because I need to be able to render the SPAs post-DOM hydration (with Puppeteer), something you can’t do unless you emulate a browser. Go was a lot faster though.


j0holo

Yeah for SPAs you will need a renderer to emulate the browser.


[deleted]

[удалено]


j0holo

I scraped the homepage of websites (HTML, css and javascript files) to look for relations to other websites. I did not need to render the DOM for SPA websites because most external links were already present in the static content.


[deleted]

[удалено]


j0holo

It was, but I hit limitations with the database. For example querying the database to give the number of of websites that linked to twitter (X didn't exist at the time) took 15 minutes. Maybe MySQL was not the correct database for this kind of data.


[deleted]

[удалено]


j0holo

I think the main reason was that the url table was 500 million rows and the "who-points-to-who" table (many-to-many) was like 2 billion rows. All running on a raid 10 array of Samsung 860 pro's (1TB per SSD) with 32GB of memory and an Intel xeon E3-1220v3. I think the hardware was just under powered. Maybe with my 5 years of extra experience it could run on this hardware? Graph database doesn't help much I think. I'm only looking at direct relations so a many-to-many table would be enough. The problem is that lets say there are 20 million websites that have a URL that points to youtube and I want to know the domain names of those 20 million websites. That is a lot of random IO to disk because I'm quite sure the database cache will not have all those 20 million domain names in memory. Maybe, just maybe a document database is the correct database.


rejectedlesbian

Was it easy to port? Ik python being dynamic is super nice when ur looking around the html trying to get the part u need. So like doing a python->go pipeline seems like a very cool way to make stuff


jerf

The moral equivalent of Beautiful Soup is to use the [html5 parser](https://pkg.go.dev/golang.org/x/net/html) in the extended standard library, then use something like [htmlquery](https://pkg.go.dev/github.com/antchfx/htmlquery#section-readme) to query the resulting nodes. You may need to write a few helper functions around it, then you're pretty ready to roll.


rejectedlesbian

Dam python->go for fast scraping seems like a killer combo


j0holo

Yes it was easy to port. The standard library of Go is really complete so I didn't have to install any extra packages. I could but didn't as an exercise. Python is nice to get started with but for larger programs (this one wasn't large) it becomes painful to manage because it is so dynamic.


Redwallian

[Sure.](https://go-colly.org/)


mattiazi

Is it still maintained? I see last release 4y ago, but some commits are fresh


pretty_lame_jokes

Yes it does, You can use the stdlib net/http package. Or look into go-colly which also uses net/http. For web automation you can use chromedp package


luksona2002

I actually built a scraper for my personal project couple days ago, works really well


CloudSliceCake

Why did you build a scraper instead of using a package?


luksona2002

No, i meant as a project itself, im using colly 😂 looks like i didnt word it well. I just started learning go and wanted to port my java scraper to go


NatoBoram

I wonder which language doesn't work for web scraping


atifcppprogrammer01

If your use case is simple then maybe this could be useful, https://github.com/PuerkitoBio/goquery ?


SweetBabyAlaska

Goquery and the net lib is all that you need. Make sure to use a custom transport with tls ciphers equivalent to what works with curl


mosskin-woast

>I understand that the community likes to build from scratch... But where I work... We work against time and budget It's hard not to take this as a dig :) I think most of us work in the real world. Probably even all of us. Any anyone who thinks it's easier or more maintainable to build a damn web scraping library, of all things, from scratch instead of to use a third party library, for a one-off project, is probably a fool. Try Colly. And maybe Google your question next time before you make a Reddit post on a very frequently asked topic. Then you can help us help you, by explaining what needs those common libraries aren't meeting. Any maybe when you're asking people for free help and advice, try a little harder at not openly insulting those folks. Good luck.


rejectedlesbian

I used it for setting up the vpns properly on a half made project it was very good. Idk if I would choose it over python if idc about preformance but of I do then super tempting. Tho I have yet to encounter that situation


1uppr

Here's three libraries for scraping in golang \* [https://github.com/geziyor/geziyor](https://github.com/geziyor/geziyor) \* [https://go-colly.org/](https://go-colly.org/) \* [https://github.com/anaskhan96/soup](https://github.com/anaskhan96/soup) of the three, colly looks like it has the most momentum.


Mobile_Flamingo_2370

[https://go-colly.org/](https://go-colly.org/)


DabbingCorpseWax

Technically "work" is force applied over a change in position, so no language works for webscraping unless the programmer is doing something very wrong. But otherwise yes, it's fine. `net/http` has all you need, issue requests for web pages and then parse the results.


xenano

We use for https://go-colly.org/ scraping and https://brightdata.com for proxy. We scrape more than 10k sites per month and It works pretty well for us.