public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Call to atention about using hash functions as content indexers (SCM saga)
@ 2005-04-11 22:40 Pedro Larroy
  2005-04-11 22:51 ` Petr Baudis
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Pedro Larroy @ 2005-04-11 22:40 UTC (permalink / raw)
  To: linux-kernel

Hi

I had a quick look at the source of GIT tonight, I'd like to warn you
about the use of hash functions as content indexers.

As probably you are aware, hash functions such as SHA-1 are surjective not
bijective (1-to-1 map), so they have collisions. Here one can argue
about the low probability of such a collision, I won't get into
subjetive valorations of what probability of collision is tolerable for
me and what is not. 

I my humble opinion, choosing deliberately, as a design decision, a
method such as this one, that in some cases could corrupt data in a
silent and very hard to detect way, is not very good. One can also argue
that the probability of data getting corrupted in the disk, or whatever
could be higher than that of the collision, again this is not valid
comparison, since the fact that indexing by hash functions without
additional checking could make data corruption legal between the
reasonable working parameters of the program is very dangerous.

And by the way, I've read
http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/node15.html

and I don't agree with it. Asides from the "avalanche effect" the
properties of a good hash function is that distributes well the entropy
between the output bits, so probably, although the inputs are not
random, the pdf change in the outputs would be small, otherwise it would
be easier to find collision by differential or statistical methods.

Last, hash functions should be only used to check data integrity, and
that's the reason of the "avalanche effect", so even single bit errors
are propagated and it's effect is very noticeable at the output.

Regards.

-- 
Pedro Larroy Tovar | Linux & Network consultant |  pedro%larroy.com 
Make debian mirrors with debian-multimirror: http://pedro.larroy.com/deb_mm/
	* Las patentes de programación son nocivas para la innovación * 
				http://proinnova.hispalinux.es/

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-04-14 15:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-11 22:40 Call to atention about using hash functions as content indexers (SCM saga) Pedro Larroy
2005-04-11 22:51 ` Petr Baudis
2005-04-11 23:23   ` Magnus Damm
2005-04-12 10:02     ` Catalin Marinas
2005-04-12  9:59   ` Barry K. Nathan
2005-04-12 12:10   ` Richard B. Johnson
2005-04-12 15:29 ` Theodore Ts'o
2005-04-14 15:54   ` Eric D. Mudama
2005-04-12 16:35 ` Eric Rannaud
2005-04-14  8:30   ` Andy Isaacson
2005-04-14 13:35     ` Eric Rannaud

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox