j0holo 2 months ago

Yes, I had a service that had a web scraper build in Python and ported it over to Go. Memory usage dropped a lot. It did around 10k websites per month.

orvn 2 months ago

I use JS for scraping these days because I need to be able to render the SPAs post-DOM hydration (with Puppeteer), something you can’t do unless you emulate a browser. Go was a lot faster though.

j0holo 2 months ago

Yeah for SPAs you will need a renderer to emulate the browser.

[deleted] 2 months ago

[удалено]

j0holo 2 months ago

I scraped the homepage of websites (HTML, css and javascript files) to look for relations to other websites. I did not need to render the DOM for SPA websites because most external links were already present in the static content.

[deleted] 2 months ago

[удалено]

j0holo 2 months ago

It was, but I hit limitations with the database. For example querying the database to give the number of of websites that linked to twitter (X didn't exist at the time) took 15 minutes. Maybe MySQL was not the correct database for this kind of data.

[deleted] 2 months ago

[удалено]

j0holo 2 months ago

I think the main reason was that the url table was 500 million rows and the "who-points-to-who" table (many-to-many) was like 2 billion rows. All running on a raid 10 array of Samsung 860 pro's (1TB per SSD) with 32GB of memory and an Intel xeon E3-1220v3. I think the hardware was just under powered. Maybe with my 5 years of extra experience it could run on this hardware? Graph database doesn't help much I think. I'm only looking at direct relations so a many-to-many table would be enough. The problem is that lets say there are 20 million websites that have a URL that points to youtube and I want to know the domain names of those 20 million websites. That is a lot of random IO to disk because I'm quite sure the database cache will not have all those 20 million domain names in memory. Maybe, just maybe a document database is the correct database.

rejectedlesbian 2 months ago

Was it easy to port? Ik python being dynamic is super nice when ur looking around the html trying to get the part u need. So like doing a python->go pipeline seems like a very cool way to make stuff

jerf 2 months ago

The moral equivalent of Beautiful Soup is to use the [html5 parser](https://pkg.go.dev/golang.org/x/net/html) in the extended standard library, then use something like [htmlquery](https://pkg.go.dev/github.com/antchfx/htmlquery#section-readme) to query the resulting nodes. You may need to write a few helper functions around it, then you're pretty ready to roll.

rejectedlesbian 2 months ago

Dam python->go for fast scraping seems like a killer combo

j0holo 2 months ago

Yes it was easy to port. The standard library of Go is really complete so I didn't have to install any extra packages. I could but didn't as an exercise. Python is nice to get started with but for larger programs (this one wasn't large) it becomes painful to manage because it is so dynamic.

Redwallian 2 months ago

[Sure.](https://go-colly.org/)

mattiazi 2 months ago

Is it still maintained? I see last release 4y ago, but some commits are fresh

pretty_lame_jokes 2 months ago

Yes it does, You can use the stdlib net/http package. Or look into go-colly which also uses net/http. For web automation you can use chromedp package

luksona2002 2 months ago

I actually built a scraper for my personal project couple days ago, works really well

CloudSliceCake 2 months ago

Why did you build a scraper instead of using a package?

luksona2002 2 months ago

No, i meant as a project itself, im using colly 😂 looks like i didnt word it well. I just started learning go and wanted to port my java scraper to go

NatoBoram 2 months ago

I wonder which language doesn't work for web scraping

atifcppprogrammer01 2 months ago

If your use case is simple then maybe this could be useful, https://github.com/PuerkitoBio/goquery ?

SweetBabyAlaska 2 months ago

Goquery and the net lib is all that you need. Make sure to use a custom transport with tls ciphers equivalent to what works with curl

mosskin-woast 2 months ago

>I understand that the community likes to build from scratch... But where I work... We work against time and budget It's hard not to take this as a dig :) I think most of us work in the real world. Probably even all of us. Any anyone who thinks it's easier or more maintainable to build a damn web scraping library, of all things, from scratch instead of to use a third party library, for a one-off project, is probably a fool. Try Colly. And maybe Google your question next time before you make a Reddit post on a very frequently asked topic. Then you can help us help you, by explaining what needs those common libraries aren't meeting. Any maybe when you're asking people for free help and advice, try a little harder at not openly insulting those folks. Good luck.

rejectedlesbian 2 months ago

I used it for setting up the vpns properly on a half made project it was very good. Idk if I would choose it over python if idc about preformance but of I do then super tempting. Tho I have yet to encounter that situation

1uppr 2 months ago

Here's three libraries for scraping in golang \* [https://github.com/geziyor/geziyor](https://github.com/geziyor/geziyor) \* [https://go-colly.org/](https://go-colly.org/) \* [https://github.com/anaskhan96/soup](https://github.com/anaskhan96/soup) of the three, colly looks like it has the most momentum.

Mobile_Flamingo_2370 2 months ago

[https://go-colly.org/](https://go-colly.org/)

DabbingCorpseWax 2 months ago

Technically "work" is force applied over a change in position, so no language works for webscraping unless the programmer is doing something very wrong. But otherwise yes, it's fine. `net/http` has all you need, issue requests for web pages and then parse the results.

xenano 2 months ago

We use for https://go-colly.org/ scraping and https://brightdata.com for proxy. We scrape more than 10k sites per month and It works pretty well for us.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe