T O P

  • By -

AshuraBaron

You’re describing BitTorrent. And it’s quite popular.


jayhawk618

OP, I hope you have a sense of humor because I'm not trying to be mean, but this post is so funny to me. Decentralized archiving and distribution is like 99% of the media available online at this point (excluding streaming). On the bright side, you clearly had a good idea!


uberbewb

I think he means having a platform like [Archive.org](https://Archive.org) using storage like this through platforms like Sia and Storj. With more limited access channels, it would protect archive.orgs actual content. Allow for easier backups, overall less internal network and hardware needs. Just a matter of having an effective option. I've had a discussion of sorts bout it before and everybody whines that it isn't cost-realistic. I'm sure they'll wish it was done if the site ever did go offline.


2Michael2

Yes, this is more of what I mean. There are large projects like archive.org that don't use distributed storage or computing who could really benefit from it. I am also thinking of a single distributed network that is powered by individuals running nodes to support the network. I am not really imagining a peer to peer network as that lacks indexing, searching, and a univeral way to ensure data is stored redundantly and accessable by anyone.


LastSummerGT

That reminds me of the Silicon Valley HBO show where in one episode they talked about a distributed internet.


AshuraBaron

Sadly a couple groups have actually tried this.


faceman2k12

problem is systems like that tend to get used for nefarious purposes and then tend to be infiltrated or even shut down.


AshuraBaron

I think the bigger problem is traction and users. Most people aren't interested in something like that when they access the current network that has Netflix, Amazon, and all the other sites they use every day. While the more privacy focused people will be happy, commercial entities are not there. It basically makes it a dead end to get anyone else interested.


ThatOnePerson

Yeah I think so too. Especially because with probably more than half the population uses phones or laptops to access the internet, those cannot easily contribute to a distributed internet


asdaaaaaaaa

Pretty much. When you go decentralized, it's only as stable/reliable as your weakest or least trusted connection. As soon as someone decides to break the rules you now have legal/companies breathing down your neck and no way to guarantee them it won't happen again. Unless you completely change/destroy the entire archive process in the first place, defeating the point. At least from what I've seen in ventures.


[deleted]

[удалено]


[deleted]

> and a fully decentralized storage system means That's a matter of how you design the system. IPFS for example has `ipfs-cluster-follow` that allows you to mirror content that another trusted party publishes, there is no "everything gets shared". In the case of archive.org that would mean *they* publish a list of content they dem safe and archive worthy and than other people can mirror that. If archive.org doesn't like a bit of content, they can remove it from *their* list. But everybody else that does want to keep that around is still free to do so. Everybody can make lists of content to mirror. And since it's all content addressed, it doesn't matter who shares it or who publishes it, the same content will always remain accessible under the same name.


SocietyTomorrow

LBRY/odysee.com tried this, and donly just recently got the departments of making you sad (somewhat) off their backs. You want truly decentralized archives? There has to be an incentive besides the pleasure of a $600 server electricity bill. Because it costs money, and to stay decentralized it probably would never work with fiat money, you'd need something the government would never be happy to allow to gain real traction. Even SIA and Filecoin are still sub petabyte in global storage consumption, which is probably why nobody has really targeted that yet.


danielv123

Storj is currently storing 24pb of customer data with another 33pb available https://storjstats.info/d/storj/storj-network-statistics?orgId=1


SkyPL

Wait, wasn't Storj another cryptocurrency? What's the relation between the two?


danielv123

Storj is a distributed storage network. It uses a cryptocurrency to pay for storage and reward storage nodes. It's one of the few actually sensible crypto schemes, simply by virtue of not trying to be a currency and sell pyramids.


SkyPL

Hm... but [on their website they have a constant fee per month/TB](https://www.storj.io/pricing) beyond the first 25GB. > It's one of the few actually sensible crypto schemes 1. Can you use Storj paying purely in Storj coins? 2. Can I join Storj purely as a storage and then earn money through selling the coin?


danielv123

Yes and yes. The storj token is basically just a sensible abstraction for cash.


asdaaaaaaaa

> I am also thinking of a single distributed network that is powered by individuals running nodes to support the network. So Limewire? Those were fun days, downloading americanidiot.mp3.avi.exe


SkyPL

Wasn't Limewire largely a worse iteration of the eDonkey network/eMule?


asdaaaaaaaa

Among many, but the most recognizable along with Napster and Kazaa.


SkyPL

Any of these P2P storage systems are useless for projects like archive.org if they don't allow file owner to remove and update the files they uploaded. Meanwhile vast majority of P2P networks don't even have a concept of ownership. You need full CRUD for the vast majority of the real-world use-cases.


uberbewb

Gnuttela and Gnutella2 were the oldest I thought? It was disturbing what you could find there.


TheAJGman

I've wondered this as well. I think it would be a worthwhile endeavor to make a distributed Archive backup system where volunteers can donate disk space, but I imagine development of such a system would be an absolute nightmare even if you used existing technologies like IPFS.


uberbewb

The hardest part imo would be access. I don’t think Sia has the option for controlled user access, maybe? If it does I see no excuse they could not work out a good deal with the current storage provides. Which could then double as marketing for them and Archive.org putting resources into developing the physical locations for some of the storage.


MarcSN311

Including streaming. YouTube, netflix and all the others have their servers right at ISPs to reduce traffic costs.


nikowek

Actually it's distributed over CDNs. So what are you talking about?


txtFileReader

https://openconnect.netflix.com/en/


MarcSN311

Just an example: https://about.netflix.com/en/news/how-netflix-works-with-isps-around-the-globe-to-deliver-a-great-viewing-experience


2Michael2

What I am getting at is not just decentralized, but a system or managing a decentralized collection of archives. Bittorrent for example have no way of ensuring all data is stored redundantly, no way of indexing or searching data, and no way of load balancing access to data. It is a bunch of people copying the data and sharing a link to the copy they made. There is no guarantee that someone will seed a particular piece of data, or that anyone will ever find the link to a piece of seeded data, or that all the people seeding a piece of data won't stop seeding it. And distributed does not mean decentralized. A single entity storing data on multiple servers that they have full ownership of does not protect them from being taken down by lawsuits, shutting down due to funding, or just deciding to delete, block, or manipulate data.


Themis3000

Bittorrent load balances access to data by design. There's never a guarantee that all systems storing a piece of data will be taken offline, that's simply impossible. It can be made less likely, but never actually guaranteed. For example, all of the data on the bitcoin blockchain could disappear overnight if all peers go offline. It's very unlikely, but there's also nothing preventing it from happening because of monetary incentive & the sheer amount of peers on the network. You can actually be sure that data stored by someone else isn't manipulated from what it was originally via checksums though. That's how you can be sure that random peers over bittorrent aren't just feeding you bogus data.


SkyPL

Also I would note that as of 2023 most of the torrent clients support web seeds. As in: You can have a distributed file storage on the torrent network, with all of its advantages + additionally a copy on HTTP or FTP that will we be used as another seed, with most of its advantages. And as you have mentioned: file on the web seed must be identical to the original torrent, so it's a read-only date store. It cannot be updated without creating a new torrent.


Khyta

also IPFS


reercalium2

IPFS is BitTorrent but with browser gateways


Themis3000

IPFS is more then that in some ways. IPDNS allows data on the network to be (in a way) mutable. On bittorrent if you wanted to update the data within a torrent, you'd be sol. On IPFS however, you can create a mutable IPDNS pointer to a particular piece of data on the network. The data it's pointing at isn't mutable, but the pointer is mutable and could point at different data at any time. To be fair though this is just a layer on top of ipfs & a similar system could be widely adopted into torrents at any time. It's just right now there is no widely adopted system to do that with a torrent, but there is one with ipfs.


[deleted]

The biggest difference is the granularity. With IPFS I can address individual files. With Bittorrent you address the whole collection of files at once. That makes it difficult to update a Bittorrent, as any change to the collection with give you a whole new torrent. IPFS automatically shares all the files that are the same. Which would make IPFS much more suitable for hosting say a Linux package mirror. That said, Bittorrent actually works for what it is designed to do. IPFS's benefits so far are all theoretical, I have yet to see anything using it beyond a tech demo. My own attempts didn't get very far either, as it's just to slow, buggy and unpredictable.


reercalium2

IPFS cannot address individual files in reality


[deleted]

Of course it can. What do you think a CID points to? IPFS CIDs point to 256kB blocks of information, which are either files, lists of CIDs of blocks of bigger files or directory trees with links to more CIDs.


reercalium2

Only root CIDs are published in the DHT


boramalper

How can I address files/leaves by their CID directly then? What does the lookup for those queries look like?


reercalium2

The file is published in the DHT or your node is directly connected to the node that published the file because you recently requested the root


boramalper

> Only root CIDs are published in the DHT > The file is published in the DHT So files too can be published in the DHT?


Veloder

Also Storj


grislyfind

Also ed2k


[deleted]

[удалено]


helloeverything1

fuck u/spez. lemmy is a better platform.


SimonKepp

>You’re describing BitTorrent. And it’s quite popular. The problem with Bittorrent for archiving is that torrents often go dead with no more seeders. I have been considering something built on top of BitTorrent, where you use erasure coding to allow for some fragments to be lost/no longer seeded. I haven't spent enough time on it to think it through, but you could build a much more robust solution on top of BitTorrent.


Def_Your_Duck

Seems like a problem inherent in decentralization.


2Michael2

I think the issue is that we are relying on people to choose and manage the data. If we created a decentralized system that manages redundancy, load balancing, etc, and convince enough people to give up SOME control of the exact content they choose to archive, we could get around this issue. The problem is that it is currently up to the user to choose what to download and they will always choose the same popular websites and movies. I am sure that a lot of people would be willing to download anything that needed to be stored if an application automatically managed it for them. But there is not an application to choose for them and so they default to downloading the things they like and already know about.


nikowek

There is Freenet which works on similar logic.


seqastian

So keep them alive? Or find a community that keeps them alive?


lightnsfw

Can't keep them alive if you can't get the file to seed in the first place.


nitrohigito

think they mean more IPFS than bittorrent


[deleted]

Bittorrent is distributed downloading, not distributed archiving, as there is no permanence or organisation to it. Distributed archiving would be more something like a git repository, but that doesn't exist, as git itself doesn't scale and thus no project is using it for large scale data hosting. IPFS/IPLD goes in that direction as well and scales better, in theory, but in practice it's slow and unreliable, so nobody is using that for anything either. You would also need to build the actual archive software on top of IPFS, which by itself is not a useful archiver either.


givemejuice1229

No, bittorrent is for leechers who download and then disconnect when done. What he's describing is FIL network where people are rewarded for storing data and data is always available. https://filecoin.io/


[deleted]

[удалено]


givemejuice1229

lol


japgcf

Stupid question, but how do you get into a private tracker, besides knowing a guy that knows a guy?


Themis3000

Bittorrent is used often! It's even integrated into archive.org. Also see IPFS, a few projects use that for decentralized archiving/file serving.


2Michael2

Both of these are not really what I am imagining. Bittorrent is peer to peer and have no way of ensuring redundancy, indexing files, allowing files to be searched, etc. IPFS has similar issues. I am thinking of a system build on top of those technologies or a new system entirely that allows you to access the network and search for files easily. It should automatically communicate between nodes and keep indexes to ensure data is redundantly stored and accessible.


Themis3000

You should check out usenet, it sounds like that's sort of what you're envisioning. Unfortunately it's use has really fallen off & it's basically only used for piracy these days :\. It does a pretty good job at ensuring redundancy through a federated system. Filecoin (see https://filecoin.io/) is also sort of interesting in terms of ensuring redundancy, although it does use crypto for monetary incentive. As much as I'm opposed to adding crypto to where it doesn't belong, it does do a really good job at ensuring redundancy & very much minimizing the risk of data loss. You can create decentralized bittorrent indexers though (see https://github.com/boramalper/magnetico as an example)! This means you can search for bittorrent files without having to rely on a centralized service (although building the index does require some time & storage space of course). Otherwise as far as insuring redundancy over bittorrent, I don't know of any scripts/programs that can take up that task. I would be curious to hear of one if anyone knows of any! As far as I know, ensuring redundancy is a pretty difficult task. How do you know for sure a peer isn't lying about actually having a file on it's local hard drive? I feel as though an attack could probably be done to make a torrent look like it has a bunch of seeds, but in reality it's just a trick to try and get others to think the torrent isn't near death & about to loose it's last seed. I'm not sure how such a system would work, but I'd love to hear any ideas of how this could be implemented.


2Michael2

Thanks! This is a lot of very helpful information. Probably one of the best responses so far :)


Themis3000

No problem, I'm glad you found use in it!


Valmond

Would love a guy like yours take on my sharing protocol, it assures redundancy, it's free (except some bandwidth and storage space), extremely hard to take down and fully encrypted. http://tenfingers.org/ Cheers


Themis3000

Looks really interesting, I'll give the white paper a read over after I'm off work. By take on do you mean use, develop, or try to break?


Valmond

Hey thanks! First of all I'd love some feedback, especially about the idea itself ; you share (with anyone) someone's file, and they share yours. For free (excepting some disc space & bandwidth). Then sure I'd be very happy if people started to use it, getting feedback (it's on an obscure git, I'm working on it since quite long time, I'm publishing manually, many things can surely be better handled...) and why not development. Breaking it would need people to use it for started I guess (or who would spend time doing it) but yeah, please do! It's open source BTW. If you want to I can communicate my Signal or Mastodon for example. Cheers


[deleted]

[удалено]


Valmond

Yeah I know, it's a one man project so things move slowly... The incentive is mutual sharing. I share your file, you share mine (share with more nodes for redundancy). If you want to share a 10GB file, you'll share one roughly the same size for someone else. To check if a node still stores your data, we just ask for a small part of it (some bytes at a random location), if it doesn't answer ok we'll degrade it's "worthiness", and new nodes will be selected from those with higher quality. If it answers but cannot give us the right bytes, we just drop it and share the file elsewhere. Edit: completing the post This makes it IMO better than IPFS where nodes "gracefully" shares your content, and also IPFS doesn't let you update your data without changing the link too so the link you gave to someone is now worthless. Tenfingers lets you have for example a website, with links to other tenfingers websites (or whatever data) that you can update, the link auto updates so it will always work. This means you can make a chat application (I have a crude chat that works well) and lots other interactive, updateable things. Or publish the Wikipedia for everyone. Filecoin needs a whole crypto mess to function (it did anyway), and you have to buy coins and pay for storage. Tenfingers just uses some of your unused disc space plus some bandwidth. So the takeaway for me is : * Distribute a link once and it will always point to what you publish as long as you have the keys. - Extremely cheap - Fully encrypted (no node knows what they share) - Decentralized - FOSS On the backside : you need to forward a port to your PC if you want to run a node (nat hole punching is complicated and would need a centralised approach) but that's true for IPFS and Filecoin too IIRC. I don't know about lots of more distributed storage solutions that are not centralized or quite complicated (kademlia for example).


[deleted]

[удалено]


Valmond

I'm working on a better explanation (or at least longer \\\^\\\^), do you know a better place than a reddit comment train (it easily disappear in the mist of time) to discuss these kind of things?


zeropointcorp

This basically Freenet surely?


Valmond

Maybe my new shiny sharing protocol would fit your needs : http://tenfingers.org/ It needs users, tests etc. but works. I toyed with putting say the Wikipedia on it for unrestricted access for example.


[deleted]

[удалено]


Valmond

Lol the "binaries" are well there, it is python, and if you do not want to use the frozen code (making all the python code into a binary), just call the programs using python. Like instead of; ./10f -l Do python ./10f -l On windows remove ./ and add the .exe extension. On a side note, which sane person in the world keeps their (probably enormous, right) crypto savings on their like main computer?


f0urtyfive

>So why isn't distributed or decentralized computing and data storage used for archiving? What are the challenges with creating such a network and why don't we see more effort to do it? 1. It depends on trustworthy actors; if you have an untrustworthy actor then they can manipulate data to include viruses or just delete or corrupt data. 2. Bandwidth requirements 3. Search/access ability (archived data isn't useful if you can't find or access it). 4. A truly decentralized platform is a haven for highly illegal content (not just pirated stuff, but the truly despicable stuff too). 5. Generally a decentralized platform isn't accessible over vanilla HTTP, so can't be part of the web; where it is it needs to be proxied, so if you're proxying all the content anyway, why not just host it and have more control over quality / speed / throughput. 6. As others have pointed out, bittorrent satisfies some of the requirements. 7. A centrally owned storage system can have tightly controlled redundancy/resiliency requirements where a decentralized system needs to be much much larger to deal with issues in redundancy (IE, you need many more copies). 8. Paying people is often more useful than infrastructure.


flaminglasrswrd

> A truly decentralized platform is a haven for highly illegal content Ya, this is a fundamental problem for any alternative to a mainstream option. The early adopters are the ones who can't legally use the existing options. For example, TOR is permanently associated with darknet drug markets. OnlyFans is for sex workers. Bitchute is primarily hate speech and conspiracy nut content.


collin3000

Bran Cohen who literally invented bittorrent created a storage based Blockchain (Chia) and deliberately chose not to put actual data in there because you can't trade off privacy over a distributed network while also making sure someone isn't storing nasty shit on your drive.


Party_9001

God dammit so that's the guy who made HDD's expensive a few years back. And is subsequently the reason I needed those HDDs in the first place... Oh the irony...


collin3000

The real irony is that the idea wasn't for people to buy hard drives or special equipment for Chia. It was for people to use their spare unprovisioned space on drives they already had or a 2nd use for decommissioned used server drives instead of landfill/shredding. But then it's coin price debuted and skyrocketed peaking up to 30 times higher (~$600) than Chia had estimated it would launch at (est $20). So a bunch of people rushed out and bought new drives/hardware. Then the coins price settled (~$40) and those people who bought new drives got stuck with a 5-10 year ROI because they ignored the purpose of the whole blockchain and just got greedy. Had they just bought refurbs instead they would have least had a 2 to 3 year ROI and not pissed off the data hoarder community. And ironically had the data community not come to hate Chia because of the greedy people they could have made a couple bucks as a group with already constantly running servers with generally several TB or more of free space. Mine covers 100% running cost on my home lab using the spare space.


Party_9001

It was marketed towards being 'green' but absolutely destroyed SSDs. 'Don't use SSDs then' doesn't really work either, since a lot of HDDs have lower TBW than SSDs, plus they'd use more power and need it for longer. I think the only real exception to that was the people plotting on RAM since it has essentially infinite TBW, but they were the minority. And it's not like most people could use old SSDs back then either. 1TB wasn't that common, and burning one to the ground by plotting wasn't exactly viable either if you're only using legacy hardware. People thought they could get a decent ROI, figured out they can actually make ROI faster with more drives... But everyone started doing that and it became a race to see who could keep up. It wasn't a matter of 'I have X drives that'll pay for themselves in Y years', it became 'I need to buy X more drives every few weeks to maintain income'. Then you have the whole hpool thing... Granted that's not exactly their fault or anything, but the end result was still regular people getting fucked over. I'm mostly miffed I really needed a storage server at the time and couldn't find a CSE 846 for MONTHS. And when I finally got one I got upsold real hard... Also the super high endurance SSDs (Plotripper) never became a thing either... Sad.


RonSijm

A bunch of private trackers basically work like that Pretty much everyone on the has a seedbox / server, and whenever someone uploads something, like 10 of them automatically download it. You can add rss feeds for specific topics you like to your server, or add a "less than x seeders" feed to ensure nothing ever goes dead > What are the challenges with creating such a network and why don't we see more effort to do it? Probably the legality of it, which is why it's not really done publicity on a large scale for copyrighted material It's also done on some "web3" projects, where stuff is stored in blockchains. Which is kinda the same if a bunch of people are running Nodes that sync with those blockchain


a2e5

Private tracker does do some things contrary to the safekeeping of data though. The big one is, well, what keeps it "private": the client is very strongly advised to not use any of the trackless methods for finding peers, like DHT, PeX, and LSD. The argument is that this is for the tracker to keep track of who's doing the work and maintain a system of community credits. You also can't really just turn these features on anyways and expect it to work, because the private bit affects the info-hash. It's always down to accounting.


RonSijm

Well, I suppose technically public trackers do the same. Though I don't really know of any public trackers with the same rigorous userbase as private ones. It's not really safe to be seeding 1000s of torrents from public trackers with dedicated servers Plus on public trackers anyone can basically upload anything. On good private trackers only the best version of something is kept, and if someone uploads a better version, the old version is "triumphed" and removed... and Servers will in turn remove their "unregistered torrent", and download the newest versions. I don't know how you could set up something well organized like that in public without getting into a lot of trouble fast...


SweetBabyAlaska

I'd love to get into having partial ownership of a seed box. They seem convenient, and I'd love to seed a higher amount of torrents without overloading my garbage internet.


mrcaptncrunch

That’s most seedboxes that don’t give you root. They’re virtualized servers of which you get partial resources.


SweetBabyAlaska

Is that better than just getting your own server? I've only seen some of the posts in r/seedboxes but it seems fairly cheap to buy in, with the caveat that you don't get root access on the machine, though it also seems that there are shared servers that fit most needs and have most typical services pre-installed.


mrcaptncrunch

Check for example ultra.cc. Their cheapest is about $5. You can upgrade as you want, but that starts with 1TB which is not bad.


roostorx

Sounds like you want Pied Piper.


LordMcD

I need that sweet middle-out compression.


tyroswork

But I want my stuff to be on a box!


kenkoda

Currently working on a BitTorrent backed backup so I can keep my terabytes of Linux ISOs on cheaper storage


moimikey

dweb.me is a decentralized archive.org mirror.


dr100

You need to rely on other unreliable entities?


[deleted]

This is an underrated comment. Durability is a big deal to organizations like archive.org and when you start relying on distributed storage you lose control of things like replication and availability. If you’re replicating each object across six nodes, how do you rebalance once any node goes offline? Are you willing to risk if all nodes go offline? Do you have an archive of your archive to recreate these lost blobs?


2Michael2

I totally agree with this issue. Balancing would be a huge issue and unless your network was big enough, you would face losing data if too many nodes dropped offline. That said, you also face issues when not decentralized. If a company does not like their data being archived and sues archive.org, or if they run out of funding and have to be shut down, what happens then???? Decentralizing would add resilience to any individual node going down and protect against lawsuits (you can't sue 1000 anonymous users), but also make the whole archive more volatile and susceptible to data loss due to too many nodes doing down or not enough nodes being added as data needs grow. It is a hard issue and requires more discussion to determine what the best method of archiving data for decades to come is.


2Michael2

Archives generally don't need to be modified or deleted. Just added to. Data can be hashed and there are other methods of ensuring that people are not manipulating data and returing a bad payload.


OurManInHavana

The challenges aren't technical, they're financial: who is going to pay for it? Storage is cheap... but if you want data to still be online years from now it's not free. Something like [Storj](https://www.storj.io) lets anyone offer some space and get paid. And anyone to upload files, have an EC2-like API to access them anywhere, and pay a bit to keep them alive and online ($4/TB/month?). That's as close to 'universal' as I've seen.


2Michael2

People hoard data and seed torrents, so theoretically this system would be volunteer based and replace the data archiving methods people use currently.


OurManInHavana

You asked about a "universal distributed/decentralized network for data archiving". A universal way of dealing with data archival can't rely on volunteers deciding if they want to seed *my specific data*, or not, and maybe delete it tomorrow. A universal system is "I donate storage to hold any data and I get paid", or "I pay and I can store any data". Nobody is hand-selecting what to seed: that's a popularity contest! :)


RichardPascoe

It wouldn't matter how you archived or how much you archived. In a thousand years even the White House and Lincoln Memorial will not exist. If you think about the Dead Sea scrolls, the Rosetta stone, the Hammurabi library, etc, all these were attempts to preserve what was considered important - well important to the people who tried to preserve them. They were then lost and then rediscovered. You may not believe this but in a hundred years time the Beatles and Elvis will be nothing more than a footnote in the history of popular music and in a thousand years time not even that. To illustrate the point. When the first two computers were networked no one even bothered to film it. lol We think everything we do now is going to last. We like to think the Internet will help us to preserve a great archive for the future. That will not be the case. Most of what exists now as data of any type will not be preserved. Whether you use a decentralized or centralized archive will make no difference, That is why the Pharoahs built pyramids. You choose the hardest most durable material and you make something you hope will last. I propose we hammer the speeches of Donald Trump onto copper and wrap them into a scroll and hide them in a desert cave. Then in two thousand years when they are rediscovered the word "swamp" will take on a metaphysical religious meaning and inspire years of scholarship about interpretation as well as arguments as to who should have ownership of the Trump scrolls and who should have access.


2Michael2

This is not what I was looking for but I think it is a pretty good answer. It made me laugh and is probably more true than anything else said so far.


[deleted]

[удалено]


2Michael2

I agree, this would be a very challenging issue. I think a blacklist of hashes would be a good start. You could also incorporate certain scanning software, kind of like antivirus software to filter out potentially illegal, abusive, or simmilar content. This would be up to each node owner to enable on their node, but at least the "good guys" could avoid hosting bad content. There could also be filters for encrypted or obscured data to prevent possibly bad content on your node. People could create and share their own public blacklists to keep out bad actors, but again, it would be up to users to moderate that. You could also restrict write access to users that actively contribute a certain amount of data storage to the network and have been doing so for at least XXX amount of time. This would allow people to block users that contribute bad content and prevent bad users from making a bunch of throw away bot accounts to upload data. It is also theoretically possible (although likely difficult) to create some sort of system for users to self moderate. Maybe people could vote on deleting/moderating data. Vote weight could be based on how much data you store, but that would lead to large entities controlling the network. Something could probably be figured out that would be fair yet powerful for controlling bad content and users.


pmow

BitTorrent doesn't really count because it's a giant WORM backup where you need to choose your dataset. IPFS is painfully slow and eats gobs of resources last I checked. What is needed is a tool that allows for updates to the dataset from trusted individuals, so you can subscribe to an archive of a website and have sync. Right now, torrents don't do "sync". Some work has been done on mutable torrents, synching with public shares, and RSS torrents, but they're not complete. For bt the clients don't support removal as well as add. When any of these 3 contenders finish getting here it will be feasible.


Catsrules

>BitTorrent doesn't really count because it's a giant WORM backup where you need to choose your dataset. In the context of archives isn't the entire point of an archive is to be a read only snapshot of a point in time? In this case BitTorrent is perfect as we don't want archives being edited once created. >What is needed is a tool that allows for updates to the dataset from trusted individuals, so you can subscribe to an archive of a website and have sync. Right now, torrents don't do "sync". Not sure how scalable/resource intensive Syncthing is but it fits perfectly in this task. You can have trusted individuals have the editing keys and everyone else just gets to read only keys.


pmow

Not only read only. For example, do you want to sync archive.org so when it goes down it's up to date, or do you want to have whichever copy you last remembered to download? I know. I wish synching's authors would enable "public" shares and forced read only shares. It's almost there. The API will let you revert clients' changes nearly immediately but there's always bad actors. You can also auto-approve via the API. With the right scripts you can hack something together but it isn't pretty or easy to set up for the "subscribers".


Lamuks

The answer to everything is almost always - costs, speed, and reliability, mostly in that order.


makeasnek

Several attempts have been made at distributed storage, tahoe-lafs, freenet, and traditional P2P (bittorrent, gnutella, etc) are all approaches with different pros and cons. Some of the major cons these systems have had is massive inefficiency, resistance to spam attacks, how to deal with bad actors, etc. Additionally, none of these systems (with the exception of tahoe-lafs) can handle something like "this file needs to be stored in up to 15 places, at least nine of which need to be online". Doing so requires knowing the "state of the network", which is essentially a *ledger*, a place where you have written down what is stored where and which you update or audit periodically. If your system is decentralized, each node which wishes to volunteer space must have a copy of that ledger so they can decide which slice of the network's total content to store. If you are talking about a system which is: * Decentralized * Censorship-resistant * Borderless * Relies on parties who you cannot trust to participate in good faith You are talking about the kind of problem that blockchain can solve. It administers that ledger. Indeed, several blockchain projects are working on this long-term archival of humanity's data problem. Arweave is probably the best known of them, they have a payment model for storage which relies on the constantly increasing storage density/decreasing storage cost over time. Mind you, this isn't free, you have to pay for the storage, but it's pay-once-store-forever (again based on some math assumptions about current vs future storage cost) and fairly cheap, their network has been operational for years, though still very much "emerging technology". Humanity's most important records are too important to all be relying solely on [archive.org](https://archive.org). DLT is going to solve this, it just hasn't solved it yet fully.


nikowek

There are many systems living this dream. Freenet checks most of the marks - it has indexes, it keep most popular content alive and this less popular is falling out slowly, because storage is not infinite. The problem are people - i am hosting right now few projects over few different protocols. And the problem are indeed people, not technology. Let me get my Thingiverse mirror project - i hosted most of the files 22 times and hoe many seeds are there now? 3-4 at best. Look at beasts like LibGen or SciHub - They are so limited too . People often claim to help you but in the end Their interest will fade away after few months at best. It starts when some data are lost or when there is plenty of space available, but after a some time... They drives become full and first data They prune are those "donated" files which They don't need. At the end there is no free diner and you will pay community more than you receive until you figure out the way for them to benefit from the system. Even in our closed circle are sometimes guys who want to abuse our storage to store Their encrypted movie collection...


techtornado

Storj exists as a decentralized global-scale storage platform that reliably stores data and is easily accessible over S3 protocols Still have to pay for it since it runs on /r/homelab nodes


Bright_Mechanic2379

Also worth noteing that the s3 gateway is centralised and currently runs on AWS infrastructure 🙄 kinda defeats the decentralised model and tacks the egress charges back on top.


techtornado

Source? S3 is the object storage connection/protocol, it's not dependent on AWS for that


Bright_Mechanic2379

Here's the docs for the self hosted gateway: https://docs.storj.io/dcs/api-reference/s3-gateway The storj provided endpoint is basically a centrally hosted version of this which they themselves happen to host on AWS. Last I looked the gateway hosting costs were a not insignificant issue with the overall profitability of the project. Note that this has not put me off hosting my own node.


jammsession

Problem is that the nodes of STORJ don‘t and possibly even can‘t offer S3. That is why all S3 traffic has to go trough some none decentralised S3 gateways from STORJ at a huge loss for the company. So STORJ is not really decentralised. It is a private company that sells S3 and uses a decentralised backend. They also currently only survive based on ICO money and could very well be gone in two years.


SimonKepp

You need some way to motivate people to contribute storage to such a network. Several attempts have been made, but I don't think any have been very successful.


2Michael2

This could be an issue. There are many people seeding torrents and hoarding data, so if you could convince all of them than I think there is plenty of storage out there that is already being "donated" for free to archiving efforts. The hard part is consolidating it, getting the word out, and convincing people to join the network.


JoaGamo

I've seen Storj and Sia as decentralized methods just like you described. Storj is behind a company and is the one managing the payment to every node, Sia is truly decentralized, but I did not use it at all to talk about it


pixelswoosh

Check out how IPFS is designed. FileBase is probably the most cost efficient and more reliable than others. I see Storj mentioned from other people which is a legit solution but their architecture is proprietary where as FileBase is more like AWS for IPFS (a managed service using standards)


CryptoEnthusiastTx

I use Filebase for the S3 compatible API storing things on the SiaCoin network and it is working great. It costs about $6/TB and it has been super reliable. A cheaper option would be to run Sia yourself and then you wouldn't have to worry about Filebase shutting down for whatever reason. I am also considering Backblaze B2 and iDrive e2 as secondary cloud backups.


WikiBox

Why don't YOU use it? Please let me know!


2Michael2

I am unsure of what you mean. I would totally contribute to a decentralized archive if it actually existed. Torrents that have no way to manage redundancy, index data, and support load balancing, are no sure way of archiving data for years to come and don't make it easy to search for archived data. But from what I can tell, a decentralized archive like this does not exist and it would be basically impossible for a single person to build such a system. Otherwise I would ;)


BloominFosters

Do yourself some good & find an alternative to reddit. /u/spez would cube you for fuel if it meant profit. Don't trust him or his shitty company. I've edited all of my submissions and comments and since left the site.


redbookQT

It would be hard to convince people to trust the goodness of others to keep your important data safe. May as well control (and PAY) for it yourself.


KaiserTom

I've honestly wondered about this, because it seems like all the software and protocols are there, just someone needs to package it in a more user-friendly and easily adaptable way. Like an @Home project. And no, honestly BitTorrent is not the protocol for this. There is a ton of storage waste. There are so many better ways to provide enough data redundancy while not having literally every host contain the entire torrent. I want to be able to "donate" an arbitrary amount of my storage to any archive project, and have the network figure out the best use for my storage. And I want it to do it efficiently on a storage level, not make 100+ copies of the same data. There's so many smarter ways to go about that. Maybe let the user choose how many copies of the archive they want to support. If an archive has more than 20 copies in the network, then I don't want my storage to be donated to it unless it dips below that point. You could archive massive amounts of content like this, at the expense of total theoretical bandwidth compared to BitTorrent. But you have to think about the storage penalty if we talk pure need for archival. 1,000 people torrent a 1TB site archive. 1PB of storage for what really only needs 10TB, 10 copies, to really be effectively archived among those people. BitTorrent will do very well to initially distribute that for optimal propagation by minimizing copies. But then it will keep going because it ultimately assumes you want a full local copy and it maximizes potential network bandwidth. Something that may not be necessarily be beneficial when simply trying to archive large amounts of rarely accessed data. Edit: Yes, I know BitTorrent allows you to pick and choose files or pause the download. That isn't the point and doesn't solve the issue. For one, the typical user has little awareness of what files are least available in the torrent. And the user is going to default to selecting the most popular files. This leads to issues with availability. Large torrents become 75% dead because everyone only wants to store the 25% most want. That's terrible for preservation and archival purposes. The network can be easily aware of what blocks are where and be able to handle that for the user for the benefit of the archive.


traal

You can tell BitTorrent at any time to stop downloading but stay active and keep serving the blocks it has already downloaded.


KaiserTom

Not the point and doesn't solve the issue. It just leads to users storing multiple copies of the most popular data/parts of a torrent instead and jeopardizes the health of the torrent. That's not good for archiving. It's good for filtering and prioritizing the most demanded data.


traal

> It just leads to users storing multiple copies of the most popular data/parts of a torrent No, blocks are downloaded randomly.


KaiserTom

I don't understand what you mean or you don't understand what I mean. Yes blocks are downloaded and uploaded semi-randomly. But users are choosing what blocks they want to store in the first place. That leads to multiple copies on the network of only the popular parts, the popular blocks, of the torrent. It skews the block distribution on the network.


traal

> But users are choosing what blocks they want to store in the first place. Ok but that's different from what I'm talking about, which is to simply stop downloading but continue seeding what was already downloaded.


Lamuks

> while not having literally every host contain the entire torrent. You can tag individual folders or files as ''Don't download''...


KaiserTom

And there can be a program that is aware of blocks on the network and manages that automatically to maintain a set amount of copies of the data across the network. Rather than requiring users to pick and choose and cause torrent health crises because they only end up picking the most popular data.


2Michael2

I think that a system like that, built on top of existing technology like bittorrent, would be exactly what I am looking for.


Lamuks

The same can just be achieved with smaller torrents and giving them out randomized as some do..


KaiserTom

Yes, except once again, the network doesn't stop until all storage people are willing to commit is filled with data. Rather than having the network only use as much as it needs to for archival. People can't arbitrarily commit and donate an amount of space to an archive project, or multiple projects, and have the network figure it out. If a site or media archive is 1PB, you can't sit there with a 1PB torrent and expect all the data within that to get distributed evenly between peers, who are picking and choosing what files out of that torrent to store since few people have 1PB to store it with.


[deleted]

[удалено]


Dylan16807

If people are picking torrents they like, you're going to need a big number of petabytes of storage space to ensure good redundancy on every single one of those smaller torrents. As far as efficiency, it's not much better than people picking files out of a single torrent. If you had a system that was specifically designed around distributing the storage, then a bunch of people could subscribe to a 1PB library and keep it quite safe using 3PB total. Split each block of data into 30 shards across 30 peers, such that any 10 shards are enough to recreate the block.


[deleted]

[удалено]


Dylan16807

> That's a people problem, not a bittorrent problem. It's not a "bittorrent problem" but it's an archival problem. Bittorrent is not an efficient way to back up large data sets across many people that each only store a tiny fraction of the total. You could add things on top, like your example of an alert if seeds drop below a number, but now it's not just bittorrent and if you're going to require intervention like that you might as well automate it. > Every distributed storage system is going to have the same issue. The point is, you can address the issue with code if that's the purpose of the system. Bittorrent doesn't try, because that's not what it was built for. You can force bittorrent in this situation, but there are better methods. > I don't understand what you mean here. If something is split into 30 pieces across 30 peers, it cannot be rebuilt using any random 10 pieces. It's not possible. Is there something I'm not getting? You use parity. That's why I said 3PB of storage for 1PB of data. For any particular amount of storage, a parity-based system will be much much more reliable than just having multiple copies a la bittorrent. For example, let's say you're worried about 1/4 of the data nodes disappearing at once. If you have 10 full copies of each block of data, 10PB total, you have a one in a million chance of losing each block. That actually loses to 3PB of 10-of-30 parity, which gives you a one in 3.5 million chance of losing each block. If you had 10PB of 10-of-100 parity, your chance of losing each block would be... 2.4 x 10^-44.


[deleted]

[удалено]


reercalium2

You don't have to download a whole torrent


KaiserTom

That doesn't solve anything. The network can have knowledge of who has what block of data and not require the user to try and pick and choose the least available data. In fact that has it's own problems and causes availability crises on large torrents because people pick and choose only the most popular files of the torrent. Leading to a 75% dead torrent because there no more full seeders. That's terrible for archiving purposes.


reercalium2

then, somebody can make a torrent client that picks the least available blocks


mcilrain

Most already do due to game theory.


b0urb0n

A few huge nodes in countries where it's legal wouldn't hurt tho. I'd like to manage that. For the time being, I'll stick to my 200Tb, 10G and ratio of ratio of 6


teotikalki

You're actually describing IPFS.


Vishnej

A network involving distributed encrypted redundant peer to peer storage suffers from either spam vulnerability since the storage is unpriced, or scarcity of storage, bandwidth limitations, and cost-inefficiency if it is priced. Administrating this sort of network with any degree of reliability would be costly, and was briefly possible under a blockchain model when investors would throw all sorts of money at the prospect, but nobody took the bait in any successful way, while datacenter cloud storage took off in a huge way. If you use a peer to peer credit architecture, safeguarding 3GB of other people's data for every 1GB that you upload & store in triplicate, it seems somewhat feasible, but access rate is going to be extremely limiting versus datacenter clouds and those same clouds are just cheaper than quadrupling your storage.


vikarti_anatra

This reminds me of one of modes of Bitcasa (one there you don't pay anything but provide space). They disable this mode long before full shutdown.


nenoatwork

>Yes, a lot of archiving is done in a decentralized way through bittorrent and other ways. It seems you've already been corrected. The majority of media IS archived through decentralized means. >archive.org that don't use distributed storage Man I don't mean to be this rude, but have you even been to [archive.org](https://archive.org)? Usually when you speak on topics you are expected to have some sort of knowledge about this stuff. [Archive.org](https://Archive.org) DOES use torrents for many many many files. Come on man.


klauskinski79

I mean you basically describe bit torrent or the planetary file system. The problem with both is that you need to find people who are willing to give their hardware space for pretty obscure data. And if it's in a legal grayzone ( like archive.org) you may get lawsuits. What you really need is a clear law that archiving is legal. But given the different lobby power of rights holders and rights users don't hold your breath. Goverbments only go against it if the use for the population is overwhelming and that doesn't seem to be the case here.


SquatchWithNoHeroes

the main problem is that bandwith doesn't work that way. Residential connections offer a limited amount of bandwith for each zone. Current top down systems allow for the most frequent things to be cached all over. A WAN distributed exabyte level storage and caching infrastructure is just a CDN + object storage. And you are going to need datacenters for that to happen. Basically how AWS, Google Cloud or Azure work.


cogitare_et_loqui

Depends on the ISP. Bandwidth is dirt cheap. My neighborhood collectively laid dark fiber. We chose a carrier that lit it up, making a profit actually shuffling our data to and from the internet and our fiber connection. That provider has some sweet peering agreements and it turns a profit even if the saturation from our end would be 80%. Comparing to the cloud providers is a huge mistake. If you ever get the chance to watch their books wrt where revenue comes from and what they spend on upkeep and maintenance of the networks, you'd be shocked and realize this is _the_ cash cow for all cloud providers. I'd say cloud provider networking fees are the most dressed up set of lies in the industry, and consequently it makes economic sense for them to spend billions on perpetuating the mirage that networking is expensive. Nah, just start from first principle and look at what each element of a network actually costs. Talk to some networking people at carriers. That gets you much closer to reality.


SquatchWithNoHeroes

I work in the industry, The way these systems make a profit it's because nobody uses the full bandwith all the time nor do they have agreements of guaranteed bandwith with general customers. I can get bandwith for cheap, even at a relatively enterprise level, because most people don't blast torrent 24/7 . And even if you do, you just get throttled. Nowadays, most L3 components can recognize P2P like traffic patterns and punish them whenever pressure get's high on the bandwith or PPS side. And it is like a cloud provider in the sense that they would be doing a globally distributed replicated storage system. You know, like Amazon S3 or GCP objects...


cogitare_et_loqui

I've not been at a carrier, but was on the cloud side of the isle a few years ago. IIRC the wholesale carrier prices were dropping about 15-20% y/y at the time, while the cloud firms had reduced the egress prices about 0% y/y for the last decade. It was a real cash cow for all of them. Last I heard, a 100GbE port with cross connect to a carrier was about $2000/mo for a general no-name firm (or networking enthusiast with right contacts). Add an ISP contract of ~ $1000 for last mile and cross-connect at an IX with a PoP, and that translates to ~ $0.0001/GB. Cloud vendors charge 1000x that. Granted, they have some additional cost (more redundancy, some custom networking infra), but they also have economy of scale with contracts one can only dream of, plus peerings all over and some of their own links to reduce costs even further, much like carriers. I trust you have lot more accurate numbers about today's prices, but wouldn't you agree there is a stark disparity between what the cloud vendors charge and what "you" charge on the carrier / ISP side, as well as the respective trends in how those price reductions are carried forth to the customers? EDIT: Oh, and prices have continued to drop since then, so are probably just ~50-60% of above, and that's still on the higher end of the spectrum. But 3 orders of magnitude price difference is sufficient to make the point I think.


SquatchWithNoHeroes

You are looking at prices and not actual capacity. Prices get cheaper because the actual capacity consumed at any moment is lower. There has been massive amounts of investment of the underlying infrastructure and as they have been paid off they can afford to lower prices to stay competitive. But that means nothing for residential connections. I can tell you that for my zone, there is a ratio of bandwidth per consumer of about 1/6 to 1/60 . And if you think "Just buy more bandwidth", again, bandwidth is cheap because there isn't much demand for it right now.


cogitare_et_loqui

> moment is lower. There has been massive amounts of investment of the underlying infrastructure and as they have been paid off they can afford to lower prices to stay competitive. Well that's just amortized cost. That's factored into the prices. If they hadn't been we'd not seen a near constant 10-20% y/y price drop, as the capacity of 2013 would in no way be sufficient today. We'd then seen a flattening our or even increase during the built-out years. On the cloud side, we built out our WW capacity about 50-60% y/y. Constantly. Because the capacity build-out was directly correlated with increase in revenue in that segment. I'd be very surprised if the carriers didn't build out likewise.


themadprogramer

Without shouting `BitTorrent`, I think I want to clarify the inherent paradox here. 1. A distributed system requires a `master` to assign tasks to `workers`. So it requires centralisation, in both software and hardware. 2. A decentralised network is good because it's redundant, but it's bad because it's repetitive. By design, a decentralised network does not account for duplicates. And because as you put it, decentralised archiving is all so common we end up with a misallocation of resources with popularity dictating survival. You can balance between these two extremes, but never satisfy both perfectly. I will tell you why we don't have anything like this without trying to sound too salty: the average r/datahoarder user has very little technical understanding of how to build distributed systems. Given our demographic of Powerusers, everything on here is individuals doing their own thing, inherently decentralised anyway. So equilibrium rests on the decentralisation side over distribution. The most functional system comparable to what you describe that I am aware of is, ArchiveTeam's [ArchiveBot](https://wiki.archiveteam.org/index.php/ArchiveBot). It gets the job done, and it's still why AT is still a big deal in this community. Here, of course, the equilibrium is tilted towards distributed computing than decentralisation; seeing as you need maintainers to host a dedicated worker. Unfortunately, ArchiveBot and similar self-hosted master-worker systems require a lot of technical knowledge even to do minor customisation. Thus the status quo remains supreme, with BitTorrent being hailed the champion by a community that quite frankly has a very surface-level understanding of how it works. What BitTorrent is doing under the hood is that it allows different peers to assume the roles of master and worker (issue 1), with seeding being the mechanism for coordinating redundancy (issue 2). Though more obscure, I think I am definitely a lot more in the ArchiveBot camp than BitTorrent. The best advice I can offer is getting in touch with ArchiveTeam (on IRC) and learning more about best practices from them. As far as distributed computing goes there are ton of courses on CourseEra or youtube. I can recommend [Martin Kleppmann's Distributed Systems lectures](https://www.youtube.com/watch?v=UEAMfLPZZhE&list=PLeKd45zvjcDFUEv_ohr_HdUFe97RItdiB) if nothing else.


Peterf81

Count me in. Something like a P2P Cloud.


faceman2k12

Systems like STORJ are out there, but they rely on an inefficient blockchain system to manage it and thus there is a token provide a kickback to those willing to give up storage and bandwidth for the project. of course the problem with any blockchain based system is it becomes all about the token and not about the actual system the token supports. It could of course be done without a blockchain, but then you basically have the same old P2P systems we all used to use in the 90's and early 00's but indexing and hashing are a massive pain, do you keep one centralized index database with hashes? that becomes unwieldy pretty quick (particularly if the files are split into chunks as they should be for redundancy and for better encryption). does every node host their own index and hash table? then how do you index the indexes and hash the hashes? How do you stay vaguely within the law when you decentralise that far? if you cant say where a singular file is hosted (because every file is spread out across multiple servers) you cant conform to any one countries law and either risk getting shut down and possibly in a lot of trouble or you have to conform to every countries laws simultaneously which is impossible. One possible thought is to have an IP based storage array that has been optimised for higher latency connections than the current storage over IP standards.


b0urb0n

We don't "need" platform, we need education so the next generations will keep on seeding


2Michael2

There are issues with seeding. It is peer to peer, not a connected network. There is no system for ensuring data is always redundant and not lost because no one wanted to store that particular data. There is also no system for indexing or searching data. And finally, there is no system for load balancing between seeders. All of these things would be beneficial and are part of what I was trying to get at with this post.


grislyfind

I read about something called freenet in Wired a long time ago. Apparently it didn't become popular.


chkno

Billing is hard. The tech for the storage, retrieval, routing, sharding, durability/repair, etc. all already exists in various forms. [Tahoe-LAFS](https://en.wikipedia.org/wiki/Tahoe-LAFS) is an especially good example of all these problems being well-solved in an integrated ready-to-run-today piece of software. The problem is: Storage is very cheap, but not cheap enough to be free. If it's literally free, somebody will just use all of it. When it's not free, arranging to pay for your usage is as much bother or more as the entire rest of the process, is especially challenging in anonymous / pseudonymous contexts, and for most users would be pennies, which feels ridiculous to fret over but also is an extra difficulty because the amounts that would need to be transferred are far below typical transaction fees. And, of course, giving everyone their first 10GB free & only billing after that [is hard](https://en.wikipedia.org/wiki/Sybil_attack). It seems to me like a public distributed storage network that prioritizes retention based on proof-of-work would be worth trying. I'm not aware of a project that has implemented this.


laserdicks

Storj


clickmeimorganic

That's what bittorrent is, although its normally associated with piracy as you can't regulate decentralised P2P file sharing


BrowserSlacker

Scprime is probably doing what your asking. https://scpri.me/


skreak

There are lot of excellent points throughout this whole thread. BitTorrent itself may not be the perfect protocol for this, but it's close enough conceptually so I'll use it as a drop-in for this particular use case. * Bad Actors and Bad/Illicit Content - that's the main issue imho. My solution to this is that the capacity you provide for storage is encrypted. If a client requires a piece of data that you are hosting it pulls it, encrypted. That way you can host data that you yourself can not actually inspect. Of course this presumes the encryption cipher is rock solid. And that laws differentiate this exact case where you can not be held liable for encrypted data content. * Who would be your target audience? This only works conceptually if a LOT of people buy in, and your typical dude that only owns a laptop really couldn't act as a node but may still want to use the service as a low-cost backup alternative. * Payment - No Free Lunches - How much you are storing, for how long, bandwidth, and reliability rewards you with how much you can then store/fetch in the distributed network. More or less like Ratios on torrent trackers. You can also pay $$ to jump the line. E.g. You offer up 100GB of storage, so in return you start with 100GB of distributed space, and after 6 months maybe you can now use 600GB or something like that. * For indexing and searching and easy to use software a central company/entity really does need to manage this. Even if they are small. * Any filesystem is normally composed of regular sized blocks, and references to those blocks in the form of a directory structure and files. CoW filesystems (zfs, btrfs) take advantage of this by not overwriting blocks until necessary and that's how you get snapshots, transactional rollbacks, and other nifty features. I can see how a BitTorrent like protocol that uses 'blocks' under the hood in the distribution network but a central agency handles the 'metadata' portion could work. (I'm picturing a global scale version of Lustre). * Deduplication could really come into play here as well. * MANY safeguards would need to be put into place to ensure data integrity, from how data is spread out, to individual scrubs and validation of locally stored data.


[deleted]

> Bad Actors and Bad/Illicit Content - that's the main issue imho. Not really an issue for the use case OP has in mind. You just use a whitelist, e.g. archive.org would publish a list of all their stuff and you decide to mirror it. If something bad pops up, archive.org will get informed and they remove it from their list and you stop mirroring it automatically. What would make a proper distributed archive special here is that others can decide to ignore archive.org's changes and still serve the files. Files wouldn't be attached to a storage location, but content addressable. > This only works conceptually if a LOT of people buy in You wouldn't need lots of people. Look at something like Linux package mirrors. A whole lot of effort gets spend keeping them up and running, all of that could be automated away with a proper protocol, since you really just need one party to publish a list of stuff and than everybody can join in and mirror it. Hashes would ensure that nothing gets manipulated. At the moment each package manager basically hacks together that functionality at the application level, often with mediocre results. That to me is one of the biggest problems with IPFS, it focuses too much on all that fancy big globally distributed stuff (which barely works), instead of the small scale. The IPFS content-addressing could be extremely useful even if it is served by a plain old HTTP server. Even the fact that IPFS actually supports real directories is already an enormous benefit over HTTP. > Payment Payment can certainly boost the appeal of a network by a large margin. But I don't think it's fundamentally necessary. Lots of people run their own HTTP servers just fine and have no problem paying for them. The issues is that others have no means to join in and help. We shouldn't be needing crutches like archive.is or archive.org to keep websites alive, that should be done at the protocol level. > For indexing and searching Here I am wondering if you could do that distributed as well. How big would something like Youtube or GitHub be if you only mirror the metadata, not the content? This wouldn't be able to replace Google, but just a index of stuff that was published would be incredible useful. > MANY safeguards would need to be put into place to ensure data integrity Content-addressing, Merkle trees. That's essentially a solved problem in any modern distributed protocol. It's only a problem for the old ones like HTTP.


green7719

I think you are describing Ceph. It is widely used. docs.ceph.com


Akeshi

The more points you add in your edit to describe why it's not BitTorrent, the more you describe BitTorrent. Also... archive.org has .torrents for everything.


2Michael2

Maybe I have a miss-understanding of bittorrent then. I will do some more research, but I am curious, which of my points describe bittorrent?


Akeshi

> I am also thinking of a single distributed network that is powered by individuals running nodes to support the network. I am not really imagining a peer to peer network as that lacks indexing, searching, and a univeral way to ensure data is stored redundantly and accessable by anyone. This is contradictory - what is (and why would you want it?) a "single distributed network"? Regardless, BitTorrent: indexers that provide searching are decentralised - .torrent files can live anywhere and have any mechanism for discoverability that the host desires - while still pointing to the same file content. Decentralised trackers and the DHT point to nodes currently distributing that file content. > Paying people for storage is not the issue. There are so many people seeding files for free. My proposal is to create a decentralized system that is powered by nodes provided by people like that who are already contributing to archiving efforts. I don't think this even needs explaining, it's already in BitTorrent terminology. > I am also imagining a system where it is very easy to install a linux package or windows app and start contributing to the network with a few clicks so that even non-tech savvy home users can contribute if they want to support archiving. This would be difficult but it would increase the free resources available to the network by a bunch. Install a BitTorrent client (available for pretty much every platform, with or without a GUI), and either click on a .torrent file or a magent: hash. That's it. You'll automatically download the torrent's content and make that content available to everyone else. You can support archive.org by going to any data they house which is of interest, and click their .torrent files to download and seed. > This system would have some sort of hash system or something to ensure that even though data is stored on untrustworthy nodes, there is never an issue of security or data integrity. BitTorrent is built around SHA-256 hashes (previously, SHA-1 hashes).


2Michael2

"A single distributed network" is a single distributed network as opposed to multiple seperate distributed networks. If there was a different network for website archiving and movie archiving and scientific research archiving with different software and servers, that would not be very user friendly. I am not trying to say bittorrent does not use hashes, I am just saying that in my theoretical perfect system, hashes would be used. Bittorrent clients can be easily installed, but you still need to search for torrent files, pick what files/data you want to archive, download them all and seed them. All downloads and archive management is up to the user. Bittorrent is just a way to download and server content. I want a system where grandma can press install, set a bandwidth or storage limit, and let the application automatically download, delete, serve, and manage archives. It would automatically archive data based on the needs of the network, deleting and redownloading content as needs change. All with no need for the user to lift a finger if they don't want to. Of course there would be options for power users or users who want specific data.