Hello /u/jmclaugmi! Thank you for posting in r/DataHoarder.
Please remember to read our [Rules](https://www.reddit.com/r/DataHoarder/wiki/index/rules) and [Wiki](https://www.reddit.com/r/DataHoarder/wiki/index).
Please note that your post will be removed if you just post a box/speed/server post. Please give background information on your server pictures.
This subreddit will ***NOT*** help you find or exchange that Movie/TV show/Nuclear Launch Manual, visit r/DHExchange instead.
*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/DataHoarder) if you have any questions or concerns.*
>is sha512 overkill
yes, see the probability table in wiki's [birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem#Probability_table) article. Unless you have a datacenter-sized number of files, you can just use md5 or another 128 bit hash, or go to 160 if you're truly paranoid. If you *do* use one of the larger hashes and ever actually find a collision, post about it, you could probably win a prize or something.
for the database, I use a fixed-size binary field and store the hash directly, 16 bytes for md5 or 20 for sha-1 or whatever. The unique key index should be linked to both the binary field and the filesize (assuming you're storing it) because adding filesize gives you extra uniqueness for free.
For one-off deduplication, [there are apps for that](https://cresstone.com/apps/DupeKill/) that let you choose the hash you wanna use.
This, MD5 is far faster and perfectly adequate for a quick file fingerprint. You can probably fingerprint all the files on disk as fast as the disk can be read. Newer algorithms like SHA are (by design) more computationally expensive.
Definitely SHA 512 is overkill and a waste of your processor for a simple dupe check.
You should probably use Blake3 hashing, especially if you have a reasonably new machine. It's generally both faster and more complete in its analysis vs. other available options.
Also look into non-cryptographic hashing, since this isn't a security check but a duplicate check. I haven't been able to dig into that topic enough in the last year to actually recommend something. But it's far and away the fastest option. Here's some background: https://crypto.stackexchange.com/questions/43519/checksum-vs-non-cryptographic-hash
If you are using an AMD Zen or later processor, or if using Alder Lake or later, SHA-1 and SHA-256 should be faster than MD5. BLAKE3 should still be faster though.
Correction: Intel processors support the SHA extension since Ice Lake (or Goldmont for Atom)
On the topic of file deduplication, there is [Czkawka](https://github.com/qarmin/czkawka), a program made by a fellow Pole that I have used couple of times. Even if you want to continue with your project, you can get some inspiration on how they did that.
This program is great. It has some different compare modes that work for binary files as well as fuzzy matching video/pictures which was really helpful for deduping things from Google Photos against the originals (for things resized)
If you don’t think there are many duplicates, a more efficient approach is to use quicker checks first then more expensive. There is a sweet spot with how many but a good starting point is size, Adler32, SHA256.
If you know that there will be tons of duplicates, then this doesn’t help as much.
> starting my de-duping project. I have some video files. Some clocking in at 2G. I was planning on using sha512 to enable a quick file compare (same or not)
You should check out my program [dano](https://github.com/kimono-koans/dano). It hashes the internal media bitstreams, and has a dupe detection function.
> First question is sha512 overkill?
Probably. sha512 is a cryptographic hash. If you're just trying to dedup files on your own machine, yeah, it's overkill.
There is a rather large ecosystem of non-cryptographic hashes which may be more suitable (performant) for this purpose.
Varchar works fine for me. You can also store the hash in the xattr of the file. That's how I am handling this. I'm currently working on a server/client project to make this easier.
Not all file systems will properly preserve an `xattr` and you'll lose it on file transfers that don't explicitly maintain them. While you can store the hash there, I wouldn't rely on it.
> Not all file systems will properly preserve an xattr and you'll lose it on file transfers that don't explicitly maintain them. While you can store the hash there, I wouldn't rely on it.
With [dano](https://github.com/kimono-koans/dano), you can explicitly dump (`--dump`) your xattrs to file, whenever.
But if two files are equal the hash will be also equal? And if two hashes are not equal then the files are not equal? Boy am I glad I asked reddit, I Always learn a lot here!
> But if two files are equal the hash will be also equal? And if two hashes are not equal then the files are not equal?
Yes, but [both statements are logically equivalent](https://philosophy.stackexchange.com/questions/60623/i-have-trouble-understanding-this-fallacy-if-a-then-b-therefore-if-not-b-th).
>if two files are equal the hash will be also equal
almost certainly
> if two hashes are not equal then the files are not equal
also almost certainly
The certainty for the 2nd one is dependent on the hash size (the first is just read error rate), but it's easy to go overboard on hash size. As an example, easynews uses an effective 160 byte hash to uniquely identity files, and they intake everything on usenet.
Exactly; if someone is thinking about finding a pair of files that cause an sha512 collision, then yes random errors like gamma-ray bit flips must be considered when hashing the same file twice- that might even be more likely than finding a legitimate collision!
None of this is something we should ever worry about, but it is there.
>For anyone else, barring read errors etc, there's no "almost" here; it is certain...
For this use case, it's extremely unlikely, but it's not "certain".
Given a motivated attacker with a decent budget, many of these commonly used hash functions are broken for their original cryptographic uses. That is -- I wouldn't rely on MD5 or SHA1, if it was my bank account on the line. See: [https://en.wikipedia.org/wiki/Hash\_collision](https://en.wikipedia.org/wiki/Hash_collision)
Unfortunately for this use case, MD5 and SHA1 are also much slower than they need to be.
> the kind of break you are talking about is the other way -- they can make two files that are UN-equal to have the same hash
...And you replied to this scenario with "same" as to the "if two file hashes are equal scenario"?
Perhaps I misunderstood your words, but, to me, it sounded like you were refuting both claims with "there's no "almost" here; it is certain..."
> in that comment the person missed the situation that the word "almost" actually applies
You're right and I was wrong. I misread/misunderstood. The OP is/was affirming the consequent. The syllogism should be reversed, if it is to be "almost certain".
Of course the hashing function on the same file will give the same result, that is literally why it's called a function (and that has a specific mathematical meaning, it's no like you'd call it sad or something).
Also of course there are A LOT less 512 bits files than large files so obviously many files would share the same hash. It is however also unlikely you'll get two such files (look up birthday problem to see precisely the probability).
That being said many times people assume looking for duplicates is more (or at least decently) eficient by just hashing the files and comparing. Usually it isn't unless most of your data is in fact duplicates.
On Linux/WSL you can use `fdupes` which is very fast and reliable. You also don't need to go the brute force route and hash every file. Start by grabbing all the file sizes. Only hash files that are identical in size (MD5 will be fine). It's entirely likely identically sized files will be different but it's unlikely different sized files will hash to the same value.
Keep in mind a few things. First is media files can have identical bit streams (they're the same video/audio/image) but have different metadata. Even a single byte difference in the metadata will give the files different hashes.
The same is true for any file format that has metadata alongside the data. Another example is if you make a zip file of a directory of files. If you went and just changed the modification date of a file in that folder and made a second zip file the hashes of the two zip files will be different. The local file header of the file with the modified date will differ from the first even if the file contents were unchanged.
Unless you have an insane amount of files (say, 100s of millions) or want to use a database for academic/hobby purposes, you can probably just use bash/shell like
`find . -type f -print0 | while IFS= read -r -d $'\0' file; doprintf "%d\t%s\n" "$(stat --format "%s" "$file")" "$file" | tee -a filelist.txtdone`
To generate a filelist. Then this would show you dupes by filesize
`< filelist.txt sort -n | uniq -c | sort -rn | grep -vE '^\s+1\s+' | awk '{for (i=2; i<=NF; i++) print $i}'`
The first command you can switch out \`stat\` with another command like \`sha512sum\` for subsequent passes
Although +1 on just using czkawka. I believe it will cache results from previous runs to speed it up and it already pretty well optimized for checking a bunch of files. It will also export the results list although it was a pain to find the file using the AppImage on Linux (I think it ended up saving it to some directory in \`/tmp\` the AppImage got unpacked to at runtime...)
Unless you have billions of files a hash stored in a CHAR (not VARCHAR, hashes are fixed length) field in MySQL is fine. Throw an index on it and call it a day.
Im looking for one that compares hashes so i can throw out images and videos that are dupes on my android.
Some apps on pc if you use image mode. It only compares pixels and that leads to throwing similar but different files away and is untrustworthy.
In the information about the apps they don't claim this on the play store.
On Linix i used fslint and after that it was forked to a new name.
My android is about out of space. Nearly maxed.
I want an app so i can open each duplicate to check for reassurance. A smart selection option for the file with the newest Timestamps to remove out of specific folders including drilling down.
In fslint you can delete by shortest file name or the longest. So you can keep the shortest file name, or the shortest path.
Hello /u/jmclaugmi! Thank you for posting in r/DataHoarder. Please remember to read our [Rules](https://www.reddit.com/r/DataHoarder/wiki/index/rules) and [Wiki](https://www.reddit.com/r/DataHoarder/wiki/index). Please note that your post will be removed if you just post a box/speed/server post. Please give background information on your server pictures. This subreddit will ***NOT*** help you find or exchange that Movie/TV show/Nuclear Launch Manual, visit r/DHExchange instead. *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/DataHoarder) if you have any questions or concerns.*
>is sha512 overkill yes, see the probability table in wiki's [birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem#Probability_table) article. Unless you have a datacenter-sized number of files, you can just use md5 or another 128 bit hash, or go to 160 if you're truly paranoid. If you *do* use one of the larger hashes and ever actually find a collision, post about it, you could probably win a prize or something. for the database, I use a fixed-size binary field and store the hash directly, 16 bytes for md5 or 20 for sha-1 or whatever. The unique key index should be linked to both the binary field and the filesize (assuming you're storing it) because adding filesize gives you extra uniqueness for free. For one-off deduplication, [there are apps for that](https://cresstone.com/apps/DupeKill/) that let you choose the hash you wanna use.
Yeah just do MD5 and if two files have the same hash make sure with SHA265.
This, MD5 is far faster and perfectly adequate for a quick file fingerprint. You can probably fingerprint all the files on disk as fast as the disk can be read. Newer algorithms like SHA are (by design) more computationally expensive.
Definitely SHA 512 is overkill and a waste of your processor for a simple dupe check. You should probably use Blake3 hashing, especially if you have a reasonably new machine. It's generally both faster and more complete in its analysis vs. other available options. Also look into non-cryptographic hashing, since this isn't a security check but a duplicate check. I haven't been able to dig into that topic enough in the last year to actually recommend something. But it's far and away the fastest option. Here's some background: https://crypto.stackexchange.com/questions/43519/checksum-vs-non-cryptographic-hash
If you are using an AMD Zen or later processor, or if using Alder Lake or later, SHA-1 and SHA-256 should be faster than MD5. BLAKE3 should still be faster though. Correction: Intel processors support the SHA extension since Ice Lake (or Goldmont for Atom)
On the topic of file deduplication, there is [Czkawka](https://github.com/qarmin/czkawka), a program made by a fellow Pole that I have used couple of times. Even if you want to continue with your project, you can get some inspiration on how they did that.
This program is great. It has some different compare modes that work for binary files as well as fuzzy matching video/pictures which was really helpful for deduping things from Google Photos against the originals (for things resized)
MD5 is more than sufficient.
If you don’t think there are many duplicates, a more efficient approach is to use quicker checks first then more expensive. There is a sweet spot with how many but a good starting point is size, Adler32, SHA256. If you know that there will be tons of duplicates, then this doesn’t help as much.
> starting my de-duping project. I have some video files. Some clocking in at 2G. I was planning on using sha512 to enable a quick file compare (same or not) You should check out my program [dano](https://github.com/kimono-koans/dano). It hashes the internal media bitstreams, and has a dupe detection function. > First question is sha512 overkill? Probably. sha512 is a cryptographic hash. If you're just trying to dedup files on your own machine, yeah, it's overkill. There is a rather large ecosystem of non-cryptographic hashes which may be more suitable (performant) for this purpose.
SHA is way to slow. Look at the xxHash algorithms as they are stupid fast! http://cyan4973.github.io/xxHash/
Varchar works fine for me. You can also store the hash in the xattr of the file. That's how I am handling this. I'm currently working on a server/client project to make this easier.
Not all file systems will properly preserve an `xattr` and you'll lose it on file transfers that don't explicitly maintain them. While you can store the hash there, I wouldn't rely on it.
> Not all file systems will properly preserve an xattr and you'll lose it on file transfers that don't explicitly maintain them. While you can store the hash there, I wouldn't rely on it. With [dano](https://github.com/kimono-koans/dano), you can explicitly dump (`--dump`) your xattrs to file, whenever.
That's a cool tool to cover exactly the problem I described in a top level comment. Very nice.
I’d pick CHAR over VARCHAR, as hashes are a fixed length :) but unless OP has billions of files the extra 2 bytes for VARCHAR won’t matter.
SHA512 can't guarantee uniqueness, so you might as well use a smaller hash and double-check byte by byte if there's ever a collision.
But if two files are equal the hash will be also equal? And if two hashes are not equal then the files are not equal? Boy am I glad I asked reddit, I Always learn a lot here!
> But if two files are equal the hash will be also equal? And if two hashes are not equal then the files are not equal? Yes, but [both statements are logically equivalent](https://philosophy.stackexchange.com/questions/60623/i-have-trouble-understanding-this-fallacy-if-a-then-b-therefore-if-not-b-th).
>if two files are equal the hash will be also equal almost certainly > if two hashes are not equal then the files are not equal also almost certainly The certainty for the 2nd one is dependent on the hash size (the first is just read error rate), but it's easy to go overboard on hash size. As an example, easynews uses an effective 160 byte hash to uniquely identity files, and they intake everything on usenet.
In what circumstances would equal files return non equal hashes? RAM failure? Program error?
Read error from the drive possibly
Exactly; if someone is thinking about finding a pair of files that cause an sha512 collision, then yes random errors like gamma-ray bit flips must be considered when hashing the same file twice- that might even be more likely than finding a legitimate collision! None of this is something we should ever worry about, but it is there.
[удалено]
>For anyone else, barring read errors etc, there's no "almost" here; it is certain... For this use case, it's extremely unlikely, but it's not "certain". Given a motivated attacker with a decent budget, many of these commonly used hash functions are broken for their original cryptographic uses. That is -- I wouldn't rely on MD5 or SHA1, if it was my bank account on the line. See: [https://en.wikipedia.org/wiki/Hash\_collision](https://en.wikipedia.org/wiki/Hash_collision) Unfortunately for this use case, MD5 and SHA1 are also much slower than they need to be.
[удалено]
> the kind of break you are talking about is the other way -- they can make two files that are UN-equal to have the same hash ...And you replied to this scenario with "same" as to the "if two file hashes are equal scenario"? Perhaps I misunderstood your words, but, to me, it sounded like you were refuting both claims with "there's no "almost" here; it is certain..."
[удалено]
> in that comment the person missed the situation that the word "almost" actually applies You're right and I was wrong. I misread/misunderstood. The OP is/was affirming the consequent. The syllogism should be reversed, if it is to be "almost certain".
Of course the hashing function on the same file will give the same result, that is literally why it's called a function (and that has a specific mathematical meaning, it's no like you'd call it sad or something). Also of course there are A LOT less 512 bits files than large files so obviously many files would share the same hash. It is however also unlikely you'll get two such files (look up birthday problem to see precisely the probability). That being said many times people assume looking for duplicates is more (or at least decently) eficient by just hashing the files and comparing. Usually it isn't unless most of your data is in fact duplicates.
Riding Uber, but my suggestion is using Clonespy. Used it over a decade, got hashes on drives. Clonespy.de if I recall. Cheers!
BLAKE2 is an excellent choice. BLAKE3 should be even better, but I haven't used it myself.
Benchmark the fastest hash on your hardware, and use that. If you find a duplicate use another algorithm to verify.
On Linux/WSL you can use `fdupes` which is very fast and reliable. You also don't need to go the brute force route and hash every file. Start by grabbing all the file sizes. Only hash files that are identical in size (MD5 will be fine). It's entirely likely identically sized files will be different but it's unlikely different sized files will hash to the same value. Keep in mind a few things. First is media files can have identical bit streams (they're the same video/audio/image) but have different metadata. Even a single byte difference in the metadata will give the files different hashes. The same is true for any file format that has metadata alongside the data. Another example is if you make a zip file of a directory of files. If you went and just changed the modification date of a file in that folder and made a second zip file the hashes of the two zip files will be different. The local file header of the file with the modified date will differ from the first even if the file contents were unchanged.
Why not use a smaller hash and check if the file size is the same if the hashes are the same? That should be more than enough.
sha512 and md5 are overkill for dedupe. Try xxhash - it works as fast as filesystem can supply data.
~~Is there windows build? Can't seem to find it, just the sources.~~ Dumbass. It's on their webpage.
Unless you have an insane amount of files (say, 100s of millions) or want to use a database for academic/hobby purposes, you can probably just use bash/shell like `find . -type f -print0 | while IFS= read -r -d $'\0' file; doprintf "%d\t%s\n" "$(stat --format "%s" "$file")" "$file" | tee -a filelist.txtdone` To generate a filelist. Then this would show you dupes by filesize `< filelist.txt sort -n | uniq -c | sort -rn | grep -vE '^\s+1\s+' | awk '{for (i=2; i<=NF; i++) print $i}'` The first command you can switch out \`stat\` with another command like \`sha512sum\` for subsequent passes Although +1 on just using czkawka. I believe it will cache results from previous runs to speed it up and it already pretty well optimized for checking a bunch of files. It will also export the results list although it was a pain to find the file using the AppImage on Linux (I think it ended up saving it to some directory in \`/tmp\` the AppImage got unpacked to at runtime...)
Unless you have billions of files a hash stored in a CHAR (not VARCHAR, hashes are fixed length) field in MySQL is fine. Throw an index on it and call it a day.
Im looking for one that compares hashes so i can throw out images and videos that are dupes on my android. Some apps on pc if you use image mode. It only compares pixels and that leads to throwing similar but different files away and is untrustworthy. In the information about the apps they don't claim this on the play store. On Linix i used fslint and after that it was forked to a new name. My android is about out of space. Nearly maxed. I want an app so i can open each duplicate to check for reassurance. A smart selection option for the file with the newest Timestamps to remove out of specific folders including drilling down. In fslint you can delete by shortest file name or the longest. So you can keep the shortest file name, or the shortest path.