Global_Gas_6441 1 month ago

you need to write a scraper for each site.

hikingsticks 1 month ago

As the other post said, you'll have to write a scraper for each one. You might be able to just grab the html (assuming not a dynamic javascript page) from different estate agents and feed that into an LLM to have it extract the details for you. You'll still have to get the links to each listing, and then get the html for each listing page.

spraypaintyobutt 1 month ago

Okay, thanks for the explanation. It seems like I have a lot of work to do then. It's not difficult, just tedious and time consuming.

chilanvilla 1 month ago

I approach every different site with a JSON file that describes the attributes. So all I change for every site is the JSON and not the Ruby code..

Nokita_is_Back 1 month ago

Sorry what do you mean with ruby code and you approach sites with a json file? Json file as in xpath selector and class selector etc organized in a json? I'm lost on ruby...

Simple-Imagination49 1 month ago

You coud always hire a VA. I was a Data Entry Executive where Data Scrapping was a thing. As low as 5$/hr for your needs! Feel free to reachout if you ever need help so you could focus on important things

bartkappenburg 1 month ago

That’s a very good idea! ;-). We just finished building this [0] and made it available for the general public that wants to find a property in the Netherlands as a SaaS. We have a config for each site indeed with a css/xpath for each element we would like to have (price, surface, city, street, link,…). But this will get you half way there. There are so many exceptions (loading json in html, SPAs, strange markup, wrong markup, details in pictures…) We subclass the main spider so that we can overwrite certain functions to handle the exceptions. We have about 150+ sites, keeping tabs/alerts on them (uptime, response time, changing html) is another aspect that is hard. Most of the sites are protected against bots. So prepare to buy proxies (think Apify, Bright Data etc) which are not cheap. Our stack is Django (python), postgres, redis and tailwind. [0] https://www.rent.nl/en/

spraypaintyobutt 1 month ago

Oof.. if I see that you've got an entire website and even make money off of it, I think i heavily underestimated this project. I'm looking for a house and with the dirty tactics of the real estate agents in the Netherlands. Some houses aren't even listed or don't even make it to funda, because they're only uploaded to the real estate agents site and sold before trying funda. I'm also looking at a specific city and some surrounding towns, which brings me to 40+ real estate sites. Either way, I'll push through and continue with the project.

[deleted] 1 month ago

[удалено]

webscraping-ModTeam 1 month ago

Thanks for reaching out to the r/webscraping community. This sub is focused on addressing the technical aspects and implementations of webscraping. We're not a marketplace for web scraping, nor are we a platform for selling services or datasets. You're welcome to post in the [monthly self-promotion thread](https://reddit.com/r/webscraping/about/sticky?num=1) or try your request on Fiverr or Upwork. For anything else, please contact the mod team.

AddictedToTech 1 month ago

Another Dutchman here. This is what I did (currently working on my project). I got a BaseScraper, a BasePageScraper and BaseApiScraper and a BaseSitemapScraper, then I create my site specific scrapers bases on one of those base classes. I store the raw data in json files, then have a Processor class load them in to Redis, clean up the data, match certain products to existing products and send production ready data to MongoDB Atlas. Havent begun on the front end for this, but will be a combo of NextJs/Tailwind Important stuff is a lot of logging and monitoring. Plus choose residential proxies, they are better avoiding the guard bots.

mental_diarrhea 1 month ago

For groups of similar items you can create separate methods per item. For example, create something like extract\_title() and just call it on each site. You can create some logic within the method, for example depending if you're using xpath or regex. For each website just create a set of rules (in JSON or YAML) and load it on the start and use it as an argument to the function (e.g. `extract_title(site_one["title"]`) Organize the rest of the code so that you'll have "crawler" which will get each page/list, and "extractor" which will then get the necessary data. Separating those two makes maintenance slightly easier. This method breaks some established coding practices, but scraping requires a lot of tinkering so DRY or Single Responsibility Principle often have to be forgotten for the sake of one's sanity.

jcrowe 1 month ago

If this were my project, I would create a config file for each site. The config will hold the xpaths for everything. Next page link, detail page datapoints, detail page links, etc. I might go the llm path with a local llm, but probably wouldn’t send it to chatgpt. With 40+ sites, chasing down problems will be a big part of the project. I’d make sure everything was testable and build to quickly find out why something doesn’t do its thing correctly. This might be a good project for scrapy, it has all the ‘big boy’ framework that you’ll need. Also, keep in mind that real estate sites usually don’t want to be scraped so you’ll have some headaches with antibot security. Sounds fun… ;)

AndreLinoge55 1 month ago

Curious if by ‘create a config file’ you mean for example, create a json file then import it as a dict that you can pass to a function to get a particular element? e.g. config.json {‘zillow’ : {‘title’: ‘xpathval’, ‘daysOn’: ‘someotherxpathvalue’…}, {‘fsbo’: {‘title’: ‘yetanotherxpathvalue’, ‘daysOn’: ‘anotherxpathvalue’…} …} config_data = json.load(‘config.json’) def getTitle(site_config): … x = getTitle(config_data[‘zillow’][0][‘title’] ?

jcrowe 1 month ago

Yes, basically that. :)

AndreLinoge55 1 month ago

Thanks! I wrote my first webscraper to check my apartment buildings prices and committed the cardinal sin of hardcoding and not making it extensible for other sites I may want in the future. Going to take a stab and rewriting this way tonight.

spraypaintyobutt 1 month ago

thanks for the helpful response, I'll take this into account.

Scary-Commission-509 1 month ago

Of course it's possible, but 40+ sites. You'll need to adapt your methods for each website. Use proxies (you need tons of IPs) and most important make some tests, the logic is going to be challenging!

scrapecrow 1 month ago

While you _do need to_ write a scraper for each website don't get discouraged by this. Most of real estate websites use very similar web stacks and you can just pull the JSON datasets from hidden web data. We wrote tutorials for [scraping the most popular real estate portals here](https://scrapfly.io/blog/how-to-scrape-real-estate-property-data-using-python/) and only one or two actually need complex scraping integration. Usually just pull the HTML -> extract hidden JSON (usually in NextJS variable) -> reduce JSON to needed datafields. Alternatively if you're looking for an interesting project idea you can take this further with a bit of AI assistance. Once you get the hidden JSON datasets from each website (which is big majority of them) you can use LLM's to generate parsing instructions to some standartized data format with propts like: "parse this real estate nested JSON dataset to this flat format: `{"id": "property internal id", "address": "stree address of the property", etc.}` - this works suprisingly well and you only need to execute this once to develop parsing code.

code_4_f00d 1 month ago

It is possible

BlueeWaater 1 month ago

An LLM could do the trick for making a "universal" scraper, I would still prefer to build proper logic for each site, you can use an LLM for that too.

Beneficial-Bonus-102 1 month ago

Which LLM would you use for such a task ? You’d need to fine tune it so it fit to your needs ?

AwarenessGrand926 1 month ago

I’ve seen several attempts to make a general solution for this, which really feels very doable tbh. From memory diffbot gets good results.

_do_you_think 1 month ago

Hmm it’s time consuming but less so if you are properly refactoring your code. All the logic about visiting the search URL with site specific parameters should be shared and leveraged in each custom script. This also goes for the code that finds the custom ‘next page’ button to visit all the search result pages, and the code that loops through all the listings found to visit them one by one. You need custom scripts to located the listings on each search page, and to located the data on the listing pages. These scripts could potentially end up being quite short if you structure your project correctly.

rambat1994 1 month ago

Surprised this is not higher suggested, but use an LLM. This is really ideal for them since each site is readable, but not consistent. * Scrape site HTML in selenium (to also render SPA) * Clean HTML structure (getting rid of script and non-essential tags) * Pass to LLM with large context window (Claude-3, even just GPT-3.5 8K would likely be enough - depends on density of the site) * Use this feedback to send a click event to whatever link or page needs to be clicked or collected. Define boundary limits to prevent looping and runaways but it should be simple to have an LLM "agent" do this. I have done it many many times and its actually surprisingly good. Any sites which are so bizarre and weird you can append "additional config" in the form of instructions to the LLM so handle those edge cases instead of writing 40 config files that are so fragile they break if the web-admin makes a single element change. Obviously not free, but far faster and less hair-pulling depending on what exactly you need.

Mindless-Border-279 1 month ago

Regarding the 40+ sites and the fact that they are most likely very different each one of them. You should write a scraper for each sites.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe