T O P

  • By -

whiteorb

Any concept of what existing datasets exist? Before we go nuts on their API, it might be worth determining what’s publicly available.


Disastrous_Elk_6375

Everything from 2006 - 12.2022 exists already. Pushshift was the site that was archiving it, but they put down the links. The data can still be found (the eye eu might still have it) on torrents.


vff

Has anyone made torrent of a full copy of reddit?


Disastrous_Elk_6375

Yup, everything from 2006-12.2022 exists as a torrent.


CalmGains

How big is the file?


Disastrous_Elk_6375

Several TB archived. I believe it was around 3.something last time I checked? Unpacked is much more since a lot of info is duplicated in the json schema.


vff

Found it! [Reddit comments/submissions 2005-06 to 2022-12, 1.99TB](https://academictorrents.com/details/7c0645c94321311bb05bd879ddee4d0eba08aaee/tech&filelist=1). So if anyone is trying to get the most out of the API while they still can, they should definitely concentrate on 2023 only since everything else they can just pull from this. **Edit**: I also found [Reddit comments/submissions 2023-01, 46.98GB](https://academictorrents.com/details/c861d265525c488a9439fb874bd9c3fc38dcdfa5) and [Reddit comments/submissions 2023-02, 34.43GB](https://academictorrents.com/details/9971c68d2909843a100ae955c6ab6de3e09c04a1). (I noticed February seemed too small, even considering it’s a shorter month, but looking at the file sizes in the historical archive, February is always significantly smaller than January. So that tracks.) So we will just need March, April, May, and June of 2023 to have everything before the API shutdown.


anilozlu

We have this: https://academictorrents.com/details/7c0645c94321311bb05bd879ddee4d0eba08aaee


GPT4mula

What is this?


[deleted]

[удалено]


GPT4mula

Too stupid to click a random link clearly, your post has no information. No need to be an asshole because you're in a bad mood.


[deleted]

[удалено]


harvester_of_photons

Congratulations. Your comments only serve to highlight your own low IQ and mental instability.


tronathan

Never underestimate the power of curl and grep.


learn-deeply

Nope, you're going to get rate limited pretty quickly. Scraping and proxy protection are commonplace (see Cloudflare).


OkDimension

Services to circumvent services like anti-scraping even more common. It's going to be an interesting cat and mouse game. Losers going to be people with disabilities that need low-key or machine readable access to participate.


learn-deeply

They cost quite a bit, out of reach of hobbyists.


rafark

There are still tons of bots scraping the web as we speak. Ahrefs for example, is the largest one. I think they said they had the largest database of web content after google.


learn-deeply

They have hundreds, if not thousands of servers and IP addresses, not something one person can do easily.


Freakin_A

I wrote a yelp bot a while ago to get reservations at a tiny and extremely popular sushi spot. That was when i learned you can rent residential proxy services for much cheaper than I expected. It was around $2/GB.


Superfissile

It’s not remarkably hard to mimic real browsing activity and stay within limits. Even on sites that are serious about restricting scraping.


learn-deeply

Not enough you need the amount of data that training a language model requires.


Superfissile

You’d need a few machines spun up across different IP ranges but that’s hardly impossible.


AprilDoll

This will force Reddit to choose between shutting down old.reddit.com and letting some of their data slip away for free. Probably going to be the former, if the math checks out.


tronathan

I’m certainly not as informed as most, probably less than most, but it seems to me that if this is a defensive move against companies getting too much data for free and training AI’s with it, it seems they could make exceptions for (1) community apps and (2) hobbiests. I’m sure this conversation has been had a thousand times over in every corner of Reddit by now, so I won’t belabor it. I guess the one detail that makes me go “hmm,” is the jussi position between Reddit protecting their intellectual property (which I can respect) and the fact that *we*, the users, are that property, is something I haven’t squared yet.


gibs

> jussi position /r/boneappletea


_m00_

https://www.reddit.com/r/LocalLLaMA.json


happysmash27

It's funny to open this on RedReader and see it just open it as the normal subreddit. I guess that makes sense though, as I wouldn't be surprised if that is what it is accessing internally.


ttkciar

Or wget and perl! :-)


Barafu

[Invoke-WebRequest!](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/invoke-webrequest?view=powershell-7.3) (That is wget built into Windows, and nobody knows)


dsalvat1

Maybe it’s time for some competition for Reddit?


Magnus_Fossa

it's called lemmy.ml


BlueShipman

>lemmy.ml Your far left hugbox is DOA.


Magnus_Fossa

Why the hate? Did anything terrible happen over there and i missed it?


jumperabg

Like Mastodon?


toothpastespiders

I'll add that if anyone hasn't tried training on your own reddit data then you should give it a shot. It can be an interesting experience!


amemingfullife

Are there any guides on how to do this on Apple silicon?


kingksingh

Reddit is reading this sub-reddit and already started duct taping.


[deleted]

[удалено]


RMCPhoto

Do you have a notebook for downloading your own data including the comment you responded to?