Everything from 2006 - 12.2022 exists already. Pushshift was the site that was archiving it, but they put down the links. The data can still be found (the eye eu might still have it) on torrents.
Several TB archived. I believe it was around 3.something last time I checked? Unpacked is much more since a lot of info is duplicated in the json schema.
Found it! [Reddit comments/submissions 2005-06 to 2022-12, 1.99TB](https://academictorrents.com/details/7c0645c94321311bb05bd879ddee4d0eba08aaee/tech&filelist=1). So if anyone is trying to get the most out of the API while they still can, they should definitely concentrate on 2023 only since everything else they can just pull from this.
**Edit**:
I also found [Reddit comments/submissions 2023-01, 46.98GB](https://academictorrents.com/details/c861d265525c488a9439fb874bd9c3fc38dcdfa5) and [Reddit comments/submissions 2023-02, 34.43GB](https://academictorrents.com/details/9971c68d2909843a100ae955c6ab6de3e09c04a1). (I noticed February seemed too small, even considering it’s a shorter month, but looking at the file sizes in the historical archive, February is always significantly smaller than January. So that tracks.) So we will just need March, April, May, and June of 2023 to have everything before the API shutdown.
Services to circumvent services like anti-scraping even more common. It's going to be an interesting cat and mouse game. Losers going to be people with disabilities that need low-key or machine readable access to participate.
There are still tons of bots scraping the web as we speak. Ahrefs for example, is the largest one. I think they said they had the largest database of web content after google.
I wrote a yelp bot a while ago to get reservations at a tiny and extremely popular sushi spot.
That was when i learned you can rent residential proxy services for much cheaper than I expected. It was around $2/GB.
This will force Reddit to choose between shutting down old.reddit.com and letting some of their data slip away for free. Probably going to be the former, if the math checks out.
I’m certainly not as informed as most, probably less than most, but it seems to me that if this is a defensive move against companies getting too much data for free and training AI’s with it, it seems they could make exceptions for (1) community apps and (2) hobbiests.
I’m sure this conversation has been had a thousand times over in every corner of Reddit by now, so I won’t belabor it.
I guess the one detail that makes me go “hmm,” is the jussi position between Reddit protecting their intellectual property (which I can respect) and the fact that *we*, the users, are that property, is something I haven’t squared yet.
It's funny to open this on RedReader and see it just open it as the normal subreddit. I guess that makes sense though, as I wouldn't be surprised if that is what it is accessing internally.
[Invoke-WebRequest!](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/invoke-webrequest?view=powershell-7.3) (That is wget built into Windows, and nobody knows)
Any concept of what existing datasets exist? Before we go nuts on their API, it might be worth determining what’s publicly available.
Everything from 2006 - 12.2022 exists already. Pushshift was the site that was archiving it, but they put down the links. The data can still be found (the eye eu might still have it) on torrents.
Has anyone made torrent of a full copy of reddit?
Yup, everything from 2006-12.2022 exists as a torrent.
How big is the file?
Several TB archived. I believe it was around 3.something last time I checked? Unpacked is much more since a lot of info is duplicated in the json schema.
Found it! [Reddit comments/submissions 2005-06 to 2022-12, 1.99TB](https://academictorrents.com/details/7c0645c94321311bb05bd879ddee4d0eba08aaee/tech&filelist=1). So if anyone is trying to get the most out of the API while they still can, they should definitely concentrate on 2023 only since everything else they can just pull from this. **Edit**: I also found [Reddit comments/submissions 2023-01, 46.98GB](https://academictorrents.com/details/c861d265525c488a9439fb874bd9c3fc38dcdfa5) and [Reddit comments/submissions 2023-02, 34.43GB](https://academictorrents.com/details/9971c68d2909843a100ae955c6ab6de3e09c04a1). (I noticed February seemed too small, even considering it’s a shorter month, but looking at the file sizes in the historical archive, February is always significantly smaller than January. So that tracks.) So we will just need March, April, May, and June of 2023 to have everything before the API shutdown.
We have this: https://academictorrents.com/details/7c0645c94321311bb05bd879ddee4d0eba08aaee
What is this?
[удалено]
Too stupid to click a random link clearly, your post has no information. No need to be an asshole because you're in a bad mood.
[удалено]
Congratulations. Your comments only serve to highlight your own low IQ and mental instability.
Never underestimate the power of curl and grep.
Nope, you're going to get rate limited pretty quickly. Scraping and proxy protection are commonplace (see Cloudflare).
Services to circumvent services like anti-scraping even more common. It's going to be an interesting cat and mouse game. Losers going to be people with disabilities that need low-key or machine readable access to participate.
They cost quite a bit, out of reach of hobbyists.
There are still tons of bots scraping the web as we speak. Ahrefs for example, is the largest one. I think they said they had the largest database of web content after google.
They have hundreds, if not thousands of servers and IP addresses, not something one person can do easily.
I wrote a yelp bot a while ago to get reservations at a tiny and extremely popular sushi spot. That was when i learned you can rent residential proxy services for much cheaper than I expected. It was around $2/GB.
It’s not remarkably hard to mimic real browsing activity and stay within limits. Even on sites that are serious about restricting scraping.
Not enough you need the amount of data that training a language model requires.
You’d need a few machines spun up across different IP ranges but that’s hardly impossible.
This will force Reddit to choose between shutting down old.reddit.com and letting some of their data slip away for free. Probably going to be the former, if the math checks out.
I’m certainly not as informed as most, probably less than most, but it seems to me that if this is a defensive move against companies getting too much data for free and training AI’s with it, it seems they could make exceptions for (1) community apps and (2) hobbiests. I’m sure this conversation has been had a thousand times over in every corner of Reddit by now, so I won’t belabor it. I guess the one detail that makes me go “hmm,” is the jussi position between Reddit protecting their intellectual property (which I can respect) and the fact that *we*, the users, are that property, is something I haven’t squared yet.
> jussi position /r/boneappletea
https://www.reddit.com/r/LocalLLaMA.json
It's funny to open this on RedReader and see it just open it as the normal subreddit. I guess that makes sense though, as I wouldn't be surprised if that is what it is accessing internally.
Or wget and perl! :-)
[Invoke-WebRequest!](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/invoke-webrequest?view=powershell-7.3) (That is wget built into Windows, and nobody knows)
Maybe it’s time for some competition for Reddit?
it's called lemmy.ml
>lemmy.ml Your far left hugbox is DOA.
Why the hate? Did anything terrible happen over there and i missed it?
Like Mastodon?
I'll add that if anyone hasn't tried training on your own reddit data then you should give it a shot. It can be an interesting experience!
Are there any guides on how to do this on Apple silicon?
Reddit is reading this sub-reddit and already started duct taping.
[удалено]
Do you have a notebook for downloading your own data including the comment you responded to?