From: piotr@larroy.com (Pedro Larroy)
To: linux-kernel@vger.kernel.org
Subject: Call to atention about using hash functions as content indexers (SCM saga)
Date: Tue, 12 Apr 2005 00:40:21 +0200 [thread overview]
Message-ID: <20050411224021.GA25106@larroy.com> (raw)
Hi
I had a quick look at the source of GIT tonight, I'd like to warn you
about the use of hash functions as content indexers.
As probably you are aware, hash functions such as SHA-1 are surjective not
bijective (1-to-1 map), so they have collisions. Here one can argue
about the low probability of such a collision, I won't get into
subjetive valorations of what probability of collision is tolerable for
me and what is not.
I my humble opinion, choosing deliberately, as a design decision, a
method such as this one, that in some cases could corrupt data in a
silent and very hard to detect way, is not very good. One can also argue
that the probability of data getting corrupted in the disk, or whatever
could be higher than that of the collision, again this is not valid
comparison, since the fact that indexing by hash functions without
additional checking could make data corruption legal between the
reasonable working parameters of the program is very dangerous.
And by the way, I've read
http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/node15.html
and I don't agree with it. Asides from the "avalanche effect" the
properties of a good hash function is that distributes well the entropy
between the output bits, so probably, although the inputs are not
random, the pdf change in the outputs would be small, otherwise it would
be easier to find collision by differential or statistical methods.
Last, hash functions should be only used to check data integrity, and
that's the reason of the "avalanche effect", so even single bit errors
are propagated and it's effect is very noticeable at the output.
Regards.
--
Pedro Larroy Tovar | Linux & Network consultant | pedro%larroy.com
Make debian mirrors with debian-multimirror: http://pedro.larroy.com/deb_mm/
* Las patentes de programación son nocivas para la innovación *
http://proinnova.hispalinux.es/
next reply other threads:[~2005-04-11 22:41 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2005-04-11 22:40 Pedro Larroy [this message]
2005-04-11 22:51 ` Call to atention about using hash functions as content indexers (SCM saga) Petr Baudis
2005-04-11 23:23 ` Magnus Damm
2005-04-12 10:02 ` Catalin Marinas
2005-04-12 9:59 ` Barry K. Nathan
2005-04-12 12:10 ` Richard B. Johnson
2005-04-12 15:29 ` Theodore Ts'o
2005-04-14 15:54 ` Eric D. Mudama
2005-04-12 16:35 ` Eric Rannaud
2005-04-14 8:30 ` Andy Isaacson
2005-04-14 13:35 ` Eric Rannaud
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20050411224021.GA25106@larroy.com \
--to=piotr@larroy.com \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox