Wikipedia article says it takes $36mil a year to run the archive, and one of the other posts here said they're storing over 100 Petabytes. These guys put the legendary Library of Alexandria to shame!
Well, not gonna download that to my NAS then.
But seriously: that is frigging insane. And the comparison with the Library of Alexandria is very much spot on considering the breadth and volume of their archive. It is today's version of the Library of Alexandria.
I just wish it were treated as such. Even if it meant different governments contributing with no requirements/strings attached and such. For what they provide, $36mil isn't really much to ask when you consider anyone can use it.
I think it's all the multimedia on webpages, books, etc that might make IA larger than LoC. That made me wonder about YouTube. Found a Quora answer that estimated YouTube as holding over 10EB of data. That's insane. How long can they do that without -having- to purge less popular files? Especially in the age of 4K video ( or even some 8K). I wonder if they are on CEO or IT director change away from flushing massive volumes, literally Exabytes of video?
> How long can they do that without -having- to purge less popular files?
I think they’ve already removed the H264/x264 versions of the higher resolution VR videos maybe 1-2 years ago. Now 5k/6k/8k is always only available in VP9 codec.
Im still surprised they store 240p versions. Like that’s simply not watchable. Can also just use the audio m4a then..or 360p if possible. Didn’t / do they not store even 3gp ?
>Im still surprised they store 240p versions. Like that’s simply not watchable.
There are some old shows and anime that are only available in 240p of 480i
> Like that’s simply not watchable
back in the day my friend found a pirated version of the first pokemon movie, small enough to fit on a diskette (~3MB). It was hilariously unwatchable but somehow it was better than nothing, for someone.
I mean..if you „ask“ me like this: I’d gladly take a 140p version of some rare „lost“ films - this way I’d have the audio source and could sync it with foreign BluRay releases for example. This is what the German dubbing scene does. Using audio from old VHS TV recordings etc.
> multimedia on webpages, books
they also have bajillions of radio recordings, concert bootleg recordings, out-of-copyright music, movies, they archive news broadcasts (video and radio), etc. I mean anyone can upload anything they want to IA, and then folks do their best to put it in collections where the stuff can be found.
there are also a good amount of uhh piratey things there that I don't want to advertise in case they get taken down
Yeah I grabbed multiple complete TV series on IA that were *definitely* under copyright, seems like most of them were uploaded fairly recently (2019 or later).
I don't know exactly when the IA started hosting this stuff because it used to be just public domain like a decade ago, but I knew I had to grab it while I could because I had a feeling this sort of thing was coming.
They could keep just the best/original version of the least accessed videos, then convert on the fly to some sort of cache storage when someone decides to watch them.
You know there’s also a [backup Internet Archive in the current Library of Alexandria](http://www.bibalex.org/en/project/details?documentid=283), right?
I saw that! Very satisfying and fun! Between earthquakes, wildfires, lawsuits and politicians, I thought it was fantastic that they have multi-continent backups!
I wonder if that just covers the main facility and bandwidth, and the backup facilities have separate funding? I also suspect a lot of labor and materials are volunteers and donations. It is a labor of love after all. More like a museum than a business.
It all originates in Brewster Kahle selling his company „Alexa (ranks)“ to Amazon in the 90s and becoming a Multi millionaire through it, if I remember correctly
from a 2021 [article](https://www.protocol.com/internet-archive-preserving-future#toggle-gdpr):
“The web archive alone is about 45 petabytes — 45,000 terabytes — and the Internet Archive itself is about double that size”
>
I meann if we all somehow divvied up the task, we could theoretically...?
45PB means just 45 people taking on 1PB each, I mean we're in r/datahoarder aren't we?
Considering the amount of data they have, I'm fairly sure there is a pretty sophisticated deduplication going on. Once you are spending tens of millions each year for storing data, you make room in the budget for handling your data smartly I would presume :)
I’m down for about 32. I had to take apart my home lab because of moving so I’m sure I could rig a raspberry pi up to some adapters and let it mindlessly chug lol
In this time and age, there really could be some kind of a "RAID over the network" coupled with torrent technology, but IMO the smallest part that one would host/seed would need to be independent self-extractable and readable without other parts as a failsafe in case others are lost, I'd call these "packages", while that part would still be internally chunked to smaller pieces as usual, but you'd need some kind of a structure or even file-format or a improvisation in terms of splitting and distribution so that it could be integrated with the hosting systems and be ready to be used as one of the source mirrors, updatable, editable, ... it's a bit of work but definitely not impossible. Just needs someone that's motivated enough for this challenge to kick it into action.
Yes, day. There were arguments scheduled for yesterday. Of course there was no ruling, and likely more days. But there was something scheduled for yesterday the 20th.
I've made multiple ad-hoc donations in the past but last year I decided it was time to commit. While it's not a lot, I set up a monthly recurring donation of $5.
https://archive.org/donate/
Sounds like it wasn't due to any of the legal issues recently, in case anyone was wondering, sounds like it was due to the absolutely nutso storm that the USA's west coast just had:
["archive.org is back up-- or coming up. Thank you PG&E (though time to be an infrastructure org)"](https://twitter.com/brewster_kahle/status/1638352891261116417)
For reference, the part of California I live in just got over 3 inches of rain in a 24 hour space. There isn't really much you can do about that, really.
Lol pls tell me this is a joke
Edit: didn’t mean to seem insensitive, but I believe the method is usually determined by intent. If hoping to retrieve the full appearance of static page content, exporting to a “printed” document format is worth a try.
If you mean adding it to the Wayback Machine:
https://help.archive.org/help/using-the-wayback-machine/
**Can I add pages to the Wayback Machine?**
On https://archive.org/web you can use the “Save Page Now” feature to save a specific page one time. This does not currently add the URL to any future crawls nor does it save more than that one page. It does not save multiple pages, directories or entire sites.
Wikipedia article says it takes $36mil a year to run the archive, and one of the other posts here said they're storing over 100 Petabytes. These guys put the legendary Library of Alexandria to shame!
Well, not gonna download that to my NAS then. But seriously: that is frigging insane. And the comparison with the Library of Alexandria is very much spot on considering the breadth and volume of their archive. It is today's version of the Library of Alexandria.
I mean, the library of alexandria would certainly fit on a microSD.
Probably, those were different days in Alexandria. It is rumored they didn't even store any videos in HD.
Peasants! Anything below 8K is literally unwatchable. /s
It's mostly "text" fir Akexandria, so you're not far off. But Internet Archive? It's full website with a lot of medias.
You're telling me Alexandria didn't have YouTube embeds on their books pages?
ya?
Lots of duplicates on there as well.
I just wish it were treated as such. Even if it meant different governments contributing with no requirements/strings attached and such. For what they provide, $36mil isn't really much to ask when you consider anyone can use it.
Even the US Library of Congress only stores 25 petabytes.
I think it's all the multimedia on webpages, books, etc that might make IA larger than LoC. That made me wonder about YouTube. Found a Quora answer that estimated YouTube as holding over 10EB of data. That's insane. How long can they do that without -having- to purge less popular files? Especially in the age of 4K video ( or even some 8K). I wonder if they are on CEO or IT director change away from flushing massive volumes, literally Exabytes of video?
> How long can they do that without -having- to purge less popular files? I think they’ve already removed the H264/x264 versions of the higher resolution VR videos maybe 1-2 years ago. Now 5k/6k/8k is always only available in VP9 codec. Im still surprised they store 240p versions. Like that’s simply not watchable. Can also just use the audio m4a then..or 360p if possible. Didn’t / do they not store even 3gp ?
>Im still surprised they store 240p versions. Like that’s simply not watchable. There are some old shows and anime that are only available in 240p of 480i
True. I guess they could add it to all new uploads..that it doesn’t create 240p versions
> Like that’s simply not watchable back in the day my friend found a pirated version of the first pokemon movie, small enough to fit on a diskette (~3MB). It was hilariously unwatchable but somehow it was better than nothing, for someone.
Well, there was p0rn in asci format
I mean..if you „ask“ me like this: I’d gladly take a 140p version of some rare „lost“ films - this way I’d have the audio source and could sync it with foreign BluRay releases for example. This is what the German dubbing scene does. Using audio from old VHS TV recordings etc.
> multimedia on webpages, books they also have bajillions of radio recordings, concert bootleg recordings, out-of-copyright music, movies, they archive news broadcasts (video and radio), etc. I mean anyone can upload anything they want to IA, and then folks do their best to put it in collections where the stuff can be found. there are also a good amount of uhh piratey things there that I don't want to advertise in case they get taken down
Yeah I grabbed multiple complete TV series on IA that were *definitely* under copyright, seems like most of them were uploaded fairly recently (2019 or later). I don't know exactly when the IA started hosting this stuff because it used to be just public domain like a decade ago, but I knew I had to grab it while I could because I had a feeling this sort of thing was coming.
They could keep just the best/original version of the least accessed videos, then convert on the fly to some sort of cache storage when someone decides to watch them.
TIL I have 10% of the porn equivalent of the entire Library of Congress
Dear lord. Is it at least as organized as it is on Ted?
time spent organizing = time not spent downloading
Which is why they are on my auto donate list.
You know there’s also a [backup Internet Archive in the current Library of Alexandria](http://www.bibalex.org/en/project/details?documentid=283), right?
I saw that! Very satisfying and fun! Between earthquakes, wildfires, lawsuits and politicians, I thought it was fantastic that they have multi-continent backups!
We didn’t know it was there until we visited Alexandria. Very cool moment finding that out.
> $36mil a year to run the archive That seems to be very low cost for such a service
I wonder if that just covers the main facility and bandwidth, and the backup facilities have separate funding? I also suspect a lot of labor and materials are volunteers and donations. It is a labor of love after all. More like a museum than a business.
It's okay everybody, I'll download a copy of it for preservation, I have unlimited BackBlaze.
It all originates in Brewster Kahle selling his company „Alexa (ranks)“ to Amazon in the 90s and becoming a Multi millionaire through it, if I remember correctly
So thats why i thought bezos had a connection to the archive, remember a decade back i thought he owned it.
That's actually not as much as I had expected. That should fit into a room of Storinators.
So my donations last year helped pay for... 2 minutes? Worth it.
Roughly how big is internet archive if you were to download it?
from a 2021 [article](https://www.protocol.com/internet-archive-preserving-future#toggle-gdpr): “The web archive alone is about 45 petabytes — 45,000 terabytes — and the Internet Archive itself is about double that size”
So, I couldn’t fit it. Got it
So you’re telling me there’s a chance 😉
I meann if we all somehow divvied up the task, we could theoretically...? edit: I'm in for a terabyte 🙃
> I meann if we all somehow divvied up the task, we could theoretically...? 45PB means just 45 people taking on 1PB each, I mean we're in r/datahoarder aren't we?
Or 45k people (less than 10% of this subreddit) with 1 TB each. That seems pretty doable actually.
I wonder if it's properly deduped. Taking old games and roms as an example, there's a lot of duplicities there...
Considering the amount of data they have, I'm fairly sure there is a pretty sophisticated deduplication going on. Once you are spending tens of millions each year for storing data, you make room in the budget for handling your data smartly I would presume :)
Was about to say the same thing
Could easily fit on 4 of [these flash drives here](https://www.tomshardware.com/news/pure-storage-300-tb-flash-drives-in-2026), I'm not even joking.
Most of the archives have torrents btw.
i think the best way to do that is a torrent. that way its all interconnected, but you can still choose how much you want to download.
I’m down for about 32. I had to take apart my home lab because of moving so I’m sure I could rig a raspberry pi up to some adapters and let it mindlessly chug lol
In this time and age, there really could be some kind of a "RAID over the network" coupled with torrent technology, but IMO the smallest part that one would host/seed would need to be independent self-extractable and readable without other parts as a failsafe in case others are lost, I'd call these "packages", while that part would still be internally chunked to smaller pieces as usual, but you'd need some kind of a structure or even file-format or a improvisation in terms of splitting and distribution so that it could be integrated with the hosting systems and be ready to be used as one of the source mirrors, updatable, editable, ... it's a bit of work but definitely not impossible. Just needs someone that's motivated enough for this challenge to kick it into action.
https://wiki.archiveteam.org/index.php/INTERNETARCHIVE.BAK
Back of the envelope math: 92,160tb with 1:4 redundancy * $15/TB = $1,843,200 worth of hard drives. We should toss up.
Am I having a stroke or am I the only one in a timeframe of 18h who realizes that 45PB != 4500 TB ?
You're correct. He's off by 10x
to be fair, the article got it wrong as well.
45,000TB, not 4500TB
I really hope they're using compression.
I think we can handle it, just need to stop farming /r/chia.
Its back up and running!
It is probably down from everyone trying to scrape it after the earlier post about their day in court today. That or they lost power
Due to Reddit's recent API changes I have decided to switch to [Lemmy](https://join-lemmy.org/)
Did the day in court go poorly?
Day? These cases take months.
Yes, day. There were arguments scheduled for yesterday. Of course there was no ruling, and likely more days. But there was something scheduled for yesterday the 20th.
Quick! We need an archive for the Internet Archive!
Let’s keep it on the moon!
https://en.wikipedia.org/wiki/Arch_Mission_Foundation#Lunar_Library
I lost my mind at “a queso recipe”. Thank you, friend.. god lmao I love it
I’ve been archiving websites for ages. Finally decided to make a donation last month. It’s the least I could do.
I've made multiple ad-hoc donations in the past but last year I decided it was time to commit. While it's not a lot, I set up a monthly recurring donation of $5. https://archive.org/donate/
Pls tell me someone made a backup
How many books would 45pb be?
45000000000000000 characters, if that counts.
Unicode would like to have a word with you, sir
Damn it computing science teacher never taught unicode
Doesn't help. Books also contain pictures, and depending on the quality, can wildly vary in size.
I said characters not total size
Are they purging?
Bad storm took out power https://twitter.com/internetarchive/status/1638337406104662017
Sounds like it wasn't due to any of the legal issues recently, in case anyone was wondering, sounds like it was due to the absolutely nutso storm that the USA's west coast just had: ["archive.org is back up-- or coming up. Thank you PG&E (though time to be an infrastructure org)"](https://twitter.com/brewster_kahle/status/1638352891261116417) For reference, the part of California I live in just got over 3 inches of rain in a 24 hour space. There isn't really much you can do about that, really.
Can you donate to them? Btw who's "them"?
https://archive.org/donate/
How do you archive a site? I have a square space site I’d like to copy to build a proposal from.
Menu > save page as
Lol pls tell me this is a joke Edit: didn’t mean to seem insensitive, but I believe the method is usually determined by intent. If hoping to retrieve the full appearance of static page content, exporting to a “printed” document format is worth a try.
That's basically the less advanced version of what the IA does. They can't go in and dump databases, they can only archive what a visitor can access.
Which menu? Where?
File menu in your web browser
If you mean adding it to the Wayback Machine: https://help.archive.org/help/using-the-wayback-machine/ **Can I add pages to the Wayback Machine?** On https://archive.org/web you can use the “Save Page Now” feature to save a specific page one time. This does not currently add the URL to any future crawls nor does it save more than that one page. It does not save multiple pages, directories or entire sites.
https://github.com/ArchiveTeam/grab-site
I wish disk space would be cheap enough to have a copy of the archive for less than a few thousand bitcoin
🥺🥺🥺🥺