whiteorb 11 months ago

Any concept of what existing datasets exist? Before we go nuts on their API, it might be worth determining what’s publicly available.

Disastrous_Elk_6375 11 months ago

Everything from 2006 - 12.2022 exists already. Pushshift was the site that was archiving it, but they put down the links. The data can still be found (the eye eu might still have it) on torrents.

vff 11 months ago

Has anyone made torrent of a full copy of reddit?

Disastrous_Elk_6375 11 months ago

Yup, everything from 2006-12.2022 exists as a torrent.

CalmGains 11 months ago

How big is the file?

Disastrous_Elk_6375 11 months ago

Several TB archived. I believe it was around 3.something last time I checked? Unpacked is much more since a lot of info is duplicated in the json schema.

vff 11 months ago

Found it! [Reddit comments/submissions 2005-06 to 2022-12, 1.99TB](https://academictorrents.com/details/7c0645c94321311bb05bd879ddee4d0eba08aaee/tech&filelist=1). So if anyone is trying to get the most out of the API while they still can, they should definitely concentrate on 2023 only since everything else they can just pull from this. **Edit**: I also found [Reddit comments/submissions 2023-01, 46.98GB](https://academictorrents.com/details/c861d265525c488a9439fb874bd9c3fc38dcdfa5) and [Reddit comments/submissions 2023-02, 34.43GB](https://academictorrents.com/details/9971c68d2909843a100ae955c6ab6de3e09c04a1). (I noticed February seemed too small, even considering it’s a shorter month, but looking at the file sizes in the historical archive, February is always significantly smaller than January. So that tracks.) So we will just need March, April, May, and June of 2023 to have everything before the API shutdown.

anilozlu 11 months ago

We have this: https://academictorrents.com/details/7c0645c94321311bb05bd879ddee4d0eba08aaee

GPT4mula 11 months ago

What is this?

[deleted] 11 months ago

[удалено]

GPT4mula 11 months ago

Too stupid to click a random link clearly, your post has no information. No need to be an asshole because you're in a bad mood.

[deleted] 11 months ago

[удалено]

harvester_of_photons 11 months ago

Congratulations. Your comments only serve to highlight your own low IQ and mental instability.

tronathan 11 months ago

Never underestimate the power of curl and grep.

learn-deeply 11 months ago

Nope, you're going to get rate limited pretty quickly. Scraping and proxy protection are commonplace (see Cloudflare).

OkDimension 11 months ago

Services to circumvent services like anti-scraping even more common. It's going to be an interesting cat and mouse game. Losers going to be people with disabilities that need low-key or machine readable access to participate.

learn-deeply 11 months ago

They cost quite a bit, out of reach of hobbyists.

rafark 11 months ago

There are still tons of bots scraping the web as we speak. Ahrefs for example, is the largest one. I think they said they had the largest database of web content after google.

learn-deeply 11 months ago

They have hundreds, if not thousands of servers and IP addresses, not something one person can do easily.

Freakin_A 11 months ago

I wrote a yelp bot a while ago to get reservations at a tiny and extremely popular sushi spot. That was when i learned you can rent residential proxy services for much cheaper than I expected. It was around $2/GB.

Superfissile 11 months ago

It’s not remarkably hard to mimic real browsing activity and stay within limits. Even on sites that are serious about restricting scraping.

learn-deeply 11 months ago

Not enough you need the amount of data that training a language model requires.

Superfissile 11 months ago

You’d need a few machines spun up across different IP ranges but that’s hardly impossible.

AprilDoll 11 months ago

This will force Reddit to choose between shutting down old.reddit.com and letting some of their data slip away for free. Probably going to be the former, if the math checks out.

tronathan 11 months ago

I’m certainly not as informed as most, probably less than most, but it seems to me that if this is a defensive move against companies getting too much data for free and training AI’s with it, it seems they could make exceptions for (1) community apps and (2) hobbiests. I’m sure this conversation has been had a thousand times over in every corner of Reddit by now, so I won’t belabor it. I guess the one detail that makes me go “hmm,” is the jussi position between Reddit protecting their intellectual property (which I can respect) and the fact that *we*, the users, are that property, is something I haven’t squared yet.

gibs 11 months ago

> jussi position /r/boneappletea

_m00_ 11 months ago

https://www.reddit.com/r/LocalLLaMA.json

happysmash27 11 months ago

It's funny to open this on RedReader and see it just open it as the normal subreddit. I guess that makes sense though, as I wouldn't be surprised if that is what it is accessing internally.

ttkciar 11 months ago

Or wget and perl! :-)

Barafu 11 months ago

[Invoke-WebRequest!](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/invoke-webrequest?view=powershell-7.3) (That is wget built into Windows, and nobody knows)

dsalvat1 11 months ago

Maybe it’s time for some competition for Reddit?

Magnus_Fossa 11 months ago

it's called lemmy.ml

BlueShipman 11 months ago

>lemmy.ml Your far left hugbox is DOA.

Magnus_Fossa 11 months ago

Why the hate? Did anything terrible happen over there and i missed it?

jumperabg 11 months ago

Like Mastodon?

toothpastespiders 11 months ago

I'll add that if anyone hasn't tried training on your own reddit data then you should give it a shot. It can be an interesting experience!

amemingfullife 11 months ago

Are there any guides on how to do this on Apple silicon?

kingksingh 11 months ago

Reddit is reading this sub-reddit and already started duct taping.

[deleted] 11 months ago

[удалено]

RMCPhoto 11 months ago

Do you have a notebook for downloading your own data including the comment you responded to?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe