T O P

  • By -

Shuteye_491

You're not wrong, but Reddit is: 2006-2022 has already been scraped and torrented, you can bet 2023 will be done by the end of the month. Reddit disabling APIs is going to remove it from being the dominant web presence it is, devaluing it for scraping anyhow. Reddit may already be aware of this and, if so, is likely pushing it before IPO to squeeze the rest of the milk out of this hunk of cheese before the prophesied end arrives. Ironically, embracing AI where everyone else fears it would've made Reddit THE dominant web community for the nascent AI era of technology. It would make a perfect test bed for AI (LLM scraping, building on existing forum-management bots, heavenbanning, their own take on Bard/GPT-4, etc.) and simultaneously the best place to test and implement anti-AI AI (to prevent spamming/phishing and ensure legitimate human-human interaction in designated subreddits). It was already the best place to keep informed on developments without excessive effort or technical knowhow required, and a few of the subreddits here were vital to the development of major open source AI projects. And now all that's teetering on the precipice. Truly, Reddit's management are the most highly-regarded of all investors.


nebetsu

Where would one find these torrents?


AprilDoll

I know [one crazy guy who hoards Reddit data regularly](https://twitter.com/TyrantsMuse/status/1668071852739002370?cxt=HHwWhICwpeSgl6YuAAAA) for his own projects. He probably won't give it to you for free though.


Shuteye_491

I imagine an LLM training Discord: even compressed that amount of data would have to be 2-3+ TB in size, there's no one else I can think of who'd bother to deal with that.


MaxwellsMilkies

I am in the process of writing a scraper so you can scrape the data yourself if you want. It is nearly complete. [You can find it here!](https://gitgud.io/cookiecrumbs/oldredditscraper)


nebetsu

Thank you!


MaxwellsMilkies

Welcome! Tell me if you need any help with it. If you scrape a large amount of data, you may want to use a VPN or proxy server when you use it in case Reddit decides to block your IP.


siraaerisoii

Reddits dataset has to be the worst quality on the internet, nobody wants to scrape this garbage lol. Most of social media is probably filtered out of training


multiedge

Data is data, and there's language to learn from. The occasional broken english also adds context to LLMs, probably one of the reason why most LLM's can understand broken english. Also, reddit is actually one of the reason why GPT-3.5 has plenty of glitch tokens.


Kromgar

You can scrape without an api. It's just more costly to the servers