SHA1 hash safety

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* SHA1 hash safety
@ 2005-04-16 12:24 David Lang
  2005-04-16 12:31 ` Ingo Molnar
  0 siblings, 1 reply; 30+ messages in thread
From: David Lang @ 2005-04-16 12:24 UTC (permalink / raw)
  To: git

this issue was raised a few days ago in the context of someone tampering 
with the files and it was decided that the extra checks were good enough 
to prevent this (at least for now), but what about accidental collisions?

if I am understanding things right the objects get saved in the filesystem 
in filenames that are the SHA1 hash. of two legitimate files have the same 
hash I  don't see any way for both of them to exist.

yes the risk of any two files having the same has is low, but in the 
earlier thread someone chimed in and said that they had two files on their 
system that had the same hash..

David Lang

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 12:24 SHA1 hash safety David Lang
@ 2005-04-16 12:31 ` Ingo Molnar
  2005-04-16 12:48   ` David Lang
  0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2005-04-16 12:31 UTC (permalink / raw)
  To: David Lang; +Cc: git


* David Lang <david.lang@digitalinsight.com> wrote:

> this issue was raised a few days ago in the context of someone 
> tampering with the files and it was decided that the extra checks were 
> good enough to prevent this (at least for now), but what about 
> accidental collisions?
> 
> if I am understanding things right the objects get saved in the 
> filesystem in filenames that are the SHA1 hash. of two legitimate 
> files have the same hash I don't see any way for both of them to 
> exist.
> 
> yes the risk of any two files having the same has is low, but in the 
> earlier thread someone chimed in and said that they had two files on 
> their system that had the same hash..

you can add -DCOLLISION_CHECK to Makefile:CFLAGS to turn on collision 
checking (disabled currently). If there indeed exist two files that have 
different content but the same hash, could someone send those two files?

	Ingo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 12:31 ` Ingo Molnar
@ 2005-04-16 12:48   ` David Lang
  2005-04-16 13:29     ` Brian O'Mahoney
  0 siblings, 1 reply; 30+ messages in thread
From: David Lang @ 2005-04-16 12:48 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: git

On Sat, 16 Apr 2005, Ingo Molnar wrote:

> * David Lang <david.lang@digitalinsight.com> wrote:
>
>> this issue was raised a few days ago in the context of someone
>> tampering with the files and it was decided that the extra checks were
>> good enough to prevent this (at least for now), but what about
>> accidental collisions?
>>
>> if I am understanding things right the objects get saved in the
>> filesystem in filenames that are the SHA1 hash. of two legitimate
>> files have the same hash I don't see any way for both of them to
>> exist.
>>
>> yes the risk of any two files having the same has is low, but in the
>> earlier thread someone chimed in and said that they had two files on
>> their system that had the same hash..
>
> you can add -DCOLLISION_CHECK to Makefile:CFLAGS to turn on collision
> checking (disabled currently). If there indeed exist two files that have
> different content but the same hash, could someone send those two files?

remember that the flap over SHA1 being 'broken' a couple weeks ago was not 
from researchers finding multiple files with the same hash, but finding 
that it was more likly then expected that files would have the same hash.

there was qa discussion on LKML within the last year about useing MD5 
hashes for identifying unique filesystem blocks (with the idea of being 
able to merge identical blocks) and in that discussion it was pointed out 
that collisions are a known real-life issue.

so if collision detection is turned on in git, does that make it error out 
if it runs into a second file with the same hash, or does it do something 
else?

David Lang

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 12:48   ` David Lang
@ 2005-04-16 13:29     ` Brian O'Mahoney
  2005-04-16 14:58       ` C. Scott Ananian
                         ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Brian O'Mahoney @ 2005-04-16 13:29 UTC (permalink / raw)
  To: David Lang; +Cc: Ingo Molnar, git

Three points:
(1) I _have_ seen real-life collisions with MD5, in the context of
    Document management systems containing ~10^6 ms-WORD documents.
(2) The HMAC (ethernet-harware-address) of any interface _should_
    help to make a unique Id.
(3) While I havn't looked at the details of the plumbing, this is
    the time to make sure we can, easily, drop in SHA-160, SHA-256
    (or whatever comes from NIST) when needed.


David Lang wrote:
> On Sat, 16 Apr 2005, Ingo Molnar wrote:
> 
>> * David Lang <david.lang@digitalinsight.com> wrote:
>>
>>> this issue was raised a few days ago in the context of someone
>>> tampering with the files and it was decided that the extra checks were
>>> good enough to prevent this (at least for now), but what about
>>> accidental collisions?
>>>
>>> if I am understanding things right the objects get saved in the
>>> filesystem in filenames that are the SHA1 hash. of two legitimate
>>> files have the same hash I don't see any way for both of them to
>>> exist.
>>>
>>> yes the risk of any two files having the same has is low, but in the
>>> earlier thread someone chimed in and said that they had two files on
>>> their system that had the same hash..
>>
>>
>> you can add -DCOLLISION_CHECK to Makefile:CFLAGS to turn on collision
>> checking (disabled currently). If there indeed exist two files that have
>> different content but the same hash, could someone send those two files?
> 
> 
> remember that the flap over SHA1 being 'broken' a couple weeks ago was
> not from researchers finding multiple files with the same hash, but
> finding that it was more likly then expected that files would have the
> same hash.
> 
> there was qa discussion on LKML within the last year about useing MD5
> hashes for identifying unique filesystem blocks (with the idea of being
> able to merge identical blocks) and in that discussion it was pointed
> out that collisions are a known real-life issue.
> 
> so if collision detection is turned on in git, does that make it error
> out if it runs into a second file with the same hash, or does it do
> something else?
> 
> David Lang
> 

-- 
Brian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 13:29     ` Brian O'Mahoney
@ 2005-04-16 14:58       ` C. Scott Ananian
  2005-04-16 15:11         ` Petr Baudis
                           ` (4 more replies)
  2005-04-16 22:33       ` David Lang
  2005-04-17  3:23       ` Tkil
  2 siblings, 5 replies; 30+ messages in thread
From: C. Scott Ananian @ 2005-04-16 14:58 UTC (permalink / raw)
  To: omb; +Cc: David Lang, Ingo Molnar, git

On Sat, 16 Apr 2005, Brian O'Mahoney wrote:

> (1) I _have_ seen real-life collisions with MD5, in the context of
>    Document management systems containing ~10^6 ms-WORD documents.

Dude!  You could have been *famous*!  Why the 
aitch-ee-double-hockey-sticks didn't you publish this when you found it?
Seriously, man.

Even given the known weaknesses in MD5, it would take much more than a 
million documents to find MD5 collisions.  I can only conclude that the 
hash was being used incorrectly; most likely truncated (my wild-ass guess 
would be to 32 bits; a collision is likely with > 50% probability in a 
million document store for a hash of less than 40 bits).

I know the current state of the art here.  It's going to take more than 
just hearsay to convince me that full 128-bit MD5 collisions are likely. 
I believe there are only two or so known to exist so far, and those were 
found by a research team in China (which, yes, is fairly famous among the 
cryptographic community now after publishing a paper consisting of little 
apart from the two collisions themselves).

Please, let's talk about hash collisions responsibly.  I posted earlier 
about the *actual computed probability* of finding two files with an SHA-1 
collision before the sun goes supernova.  It's 10^28 to 1 against.
The recent cryptographic works has shown that there are certain situations 
where a decent amount of computer work (2^69 operations) can produce two 
sequences with the same hash, but these sequences are not freely chosen; 
they've got very specific structure.  This attack does not apply to 
(effectively) random files sitting in a SCM.
   http://www.schneier.com/blog/archives/2005/02/sha1_broken.html

That said, Linux's widespread use means that it may not be unimaginable 
for an attacker to devote this amount of resources to an attack, which 
would probably involve first committing some specially structured file to 
the SCM (but would Linus accept it?) and then silently corrupting said 
file via a SHA1 collision to toggle some bits (which would presumably Do 
Evil).  Thus hashes other than SHA1 really ought to be considered...

...but the cryptographic community has not yet come to a conclusion on 
what the replacement ought to be.  These attacks are so new that we don't 
really understand what it is about the structure of SHA1 which makes them 
possible, which makes it hard to determine which other hashes are 
similarly vulnerable.  It will take time.

I believe Linus has already stated on this list that his plan is to 
eventually provide a tool for bulk migration of an existing SHA1 git 
repository to a new hash type.   Basically munging through the repository 
in bulk, replacing all the hashes.  This seems a perfectly adequate 
strategy at the moment.
  --scott

WASHTUB Panama Minister Moscow explosives KUGOWN hack Marxist LPMEDLEY 
genetic immediate radar SCRANTON COBRA JANE KGB Shoal Bay atomic Bejing
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: SHA1 hash safety
  2005-04-16 14:58       ` C. Scott Ananian
@ 2005-04-16 15:11         ` Petr Baudis
  2005-04-16 15:36           ` C. Scott Ananian
  2005-04-16 15:49         ` ross
                           ` (3 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Petr Baudis @ 2005-04-16 15:11 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: omb, David Lang, Ingo Molnar, git

Dear diary, on Sat, Apr 16, 2005 at 04:58:15PM CEST, I got a letter
where "C. Scott Ananian" <cscott@cscott.net> told me that...
> On Sat, 16 Apr 2005, Brian O'Mahoney wrote:
> 
> >(1) I _have_ seen real-life collisions with MD5, in the context of
> >   Document management systems containing ~10^6 ms-WORD documents.
> 
> Dude!  You could have been *famous*!  Why the 
> aitch-ee-double-hockey-sticks didn't you publish this when you found it?
> Seriously, man.
> 
> Even given the known weaknesses in MD5, it would take much more than a 
> million documents to find MD5 collisions.  I can only conclude that the 
> hash was being used incorrectly; most likely truncated (my wild-ass guess 
> would be to 32 bits; a collision is likely with > 50% probability in a 
> million document store for a hash of less than 40 bits).
> 
> I know the current state of the art here.  It's going to take more than 
> just hearsay to convince me that full 128-bit MD5 collisions are likely. 
> I believe there are only two or so known to exist so far, and those were 
> found by a research team in China (which, yes, is fairly famous among the 
> cryptographic community now after publishing a paper consisting of little 
> apart from the two collisions themselves).

http://cryptography.hyperlink.cz/MD5_collisions.html

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: SHA1 hash safety
  2005-04-16 15:11         ` Petr Baudis
@ 2005-04-16 15:36           ` C. Scott Ananian
  2005-04-16 22:56             ` David Lang
  0 siblings, 1 reply; 30+ messages in thread
From: C. Scott Ananian @ 2005-04-16 15:36 UTC (permalink / raw)
  To: Petr Baudis; +Cc: omb, David Lang, Ingo Molnar, git

On Sat, 16 Apr 2005, Petr Baudis wrote:

>> I know the current state of the art here.  It's going to take more than
>> just hearsay to convince me that full 128-bit MD5 collisions are likely.
>
> http://cryptography.hyperlink.cz/MD5_collisions.html

OK, OK, I spoke too sloppily.  Let me rephrase:
   It's going to take more than just hearsay to convince me that full
   128-bit MD5 collisions *IN ARBITRARILY CHOSEN DOCUMENTS* are likely.

I could add, "WITHOUT SPECIAL EFFORT BY AN ATTACKER".

But you're right, I was too busy thrashing around with the basic 
probability cluestick to carefully distinguish MD5 (in which *collisions* 
can be found fairly easily now by an attacker, although not *preimages*) 
and SHA1 (which is what git is actually using, and still requires 2^69 
hash computations to collide).

And note again that these are not preimage attacks.  Even with MD5, an 
attacker can't arbitrarily change existing code in the Linux kernel by 
creating a malicious file with the same MD5 hash.

But extreme caution is necessary, because both of these hash mechanisms 
have been shown to be weak, and algorithms grow weaker with time, not 
stronger.

I think the only conclusion that can be made is that "one should not rely 
on the hash for security".  And I don't believe that we are.  We should be
careful to continue saying "branch 46f<mumble> *in Linus' tree*" instead 
of just "branch 46f<mumble>" and assuming that that is unique.  The 
security is provided by Linus' control over his repository, not by the 
hash.
   --scott

[The 'MD5 collisions in 15 minutes on a laptop' paper did surprise me.  I 
vaguely remember hearing about this before, but I'd forgotten just how 
broken MD5 is.  It's still a fine *hash* function; just not a terribly 
good *cryptographically secure* hash function.]

Israel PBSUCCESS $400 million in gold bullion President Nader jihad 
RNC LPMEDLEY agent HTKEEPER Cheney SEQUIN SARANAC Clinton biowarfare
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 14:58       ` C. Scott Ananian
  2005-04-16 15:11         ` Petr Baudis
@ 2005-04-16 15:49         ` ross
  2005-04-17  6:35           ` Horst von Brand
  2005-04-16 19:16         ` Paul Jackson
                           ` (2 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: ross @ 2005-04-16 15:49 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: omb, David Lang, Ingo Molnar, git

On Sat, Apr 16, 2005 at 10:58:15AM -0400, C. Scott Ananian wrote:
> Even given the known weaknesses in MD5, it would take much more than a 
> million documents to find MD5 collisions.  I can only conclude that the 
> hash was being used incorrectly; most likely truncated (my wild-ass guess 
> would be to 32 bits; a collision is likely with > 50% probability in a 
> million document store for a hash of less than 40 bits).

I've also seen non thread-safe GUID generation, using MD5m hit collisions:
but of course that was due to the fact that the code had thread safety
issues, not because anyone actually ever hit a MD5 collision...

Of course there are constructed cases of MD5 collision, but those are
pretty disinteresting.  Give me two files that have useful content and
the same hash, and then I'll be impressed.

Linus has already weighed in that he doesn't give a crap.  All the
crypto-babble about collision whitepapers is uninteresting without a
repo that has real collisions.  git is far too cool as is - prove I
should be concerned.

-- 
Ross Vandegrift
ross@lug.udel.edu

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
	--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 14:58       ` C. Scott Ananian
  2005-04-16 15:11         ` Petr Baudis
  2005-04-16 15:49         ` ross
@ 2005-04-16 19:16         ` Paul Jackson
  2005-04-16 21:35         ` Brian O'Mahoney
  2005-04-16 22:46         ` David Lang
  4 siblings, 0 replies; 30+ messages in thread
From: Paul Jackson @ 2005-04-16 19:16 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: omb, david.lang, mingo, git

Scott wrote:
> Please, let's talk about hash collisions responsibly.

Agreed.

Chasing down links from the one Petr provided:

  http://cryptography.hyperlink.cz/MD5_collisions.html

the best read I found was:

  MD5 To Be Considered Harmful Someday
  http://eprint.iacr.org/2004/357.pdf

As the author, Dan Kaminsky, states:

> it is far too easy to overestimate the risks described in this paper.

This paper does a good job of explaining the vulnerabilities
that MD5 has, currently (and yes, git uses SHA1 ...).

We have far greater vulnerabilities from intentional or accidental
coding errors, inadequately audited code, root exploits of user
(non-kernel) code, compilation and build tools, unreliable hardware
(how many of us use non-ECC memory - I do), poorly administered
systems, ...

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 14:58       ` C. Scott Ananian
                           ` (2 preceding siblings ...)
  2005-04-16 19:16         ` Paul Jackson
@ 2005-04-16 21:35         ` Brian O'Mahoney
  2005-04-18  7:43           ` Andy Isaacson
  2005-04-16 22:46         ` David Lang
  4 siblings, 1 reply; 30+ messages in thread
From: Brian O'Mahoney @ 2005-04-16 21:35 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: omb, David Lang, Ingo Molnar, git

Please see below:

C. Scott Ananian wrote:
> On Sat, 16 Apr 2005, Brian O'Mahoney wrote:
> 
>> (1) I _have_ seen real-life collisions with MD5, in the context of
>>    Document management systems containing ~10^6 ms-WORD documents.
> 
> 
> Dude!  You could have been *famous*!  Why the
> aitch-ee-double-hockey-sticks didn't you publish this when you found it?
> Seriously, man.

The MD5 has was fine, or at least the code (a) produced the correct
results on the published test cases, and, (b) was properly applied to
all bytes of the file(s). I was surprised when it happened, which is why
I bothered to post to this list at this time, so I make two more points

(1) These hashes were designed, to assist in the construction of digital
signatures, ie so it is hard to produce a document to hash to a known
hash value, and that with a defined document format so they are designed
(i) hash similar documents far apart, and (ii) be hard to reverse;

it says nothing about naturally ocurring collisions, ie where the
document is not constrained to be similar,

> 
> Even given the known weaknesses in MD5, it would take much more than a
> million documents to find MD5 collisions.  I can only conclude that the
> hash was being used incorrectly; most likely truncated (my wild-ass
> guess would be to 32 bits; a collision is likely with > 50% probability
> in a million document store for a hash of less than 40 bits).
> 
> I know the current state of the art here.  It's going to take more than
> just hearsay to convince me that full 128-bit MD5 collisions are likely.
> I believe there are only two or so known to exist so far, and those were
> found by a research team in China (which, yes, is fairly famous among
> the cryptographic community now after publishing a paper consisting of
> little apart from the two collisions themselves).

(2) I am not concerned with cryptography here, merely sound engineering
tradeoffs and the avoidance of _pain_in_the_ass_ when we do see a
random collision, [NB the 2^69 is to 'cause a collision in SHA1' not the
odds against such a collision] ... (below)

> 
> Please, let's talk about hash collisions responsibly.  I posted earlier
> about the *actual computed probability* of finding two files with an
> SHA-1 collision before the sun goes supernova.  It's 10^28 to 1 against.
> The recent cryptographic works has shown that there are certain
> situations where a decent amount of computer work (2^69 operations) can
> produce two sequences with the same hash, but these sequences are not
> freely chosen; they've got very specific structure.  This attack does
> not apply to (effectively) random files sitting in a SCM.
>   http://www.schneier.com/blog/archives/2005/02/sha1_broken.html
> 
> That said, Linux's widespread use means that it may not be unimaginable
> for an attacker to devote this amount of resources to an attack, which
> would probably involve first committing some specially structured file
> to the SCM (but would Linus accept it?) and then silently corrupting
> said file via a SHA1 collision to toggle some bits (which would
> presumably Do Evil).  Thus hashes other than SHA1 really ought to be
> considered...
>
> ..but the cryptographic community has not yet come to a conclusion on
> what the replacement ought to be.  These attacks are so new that we
> don't really understand what it is about the structure of SHA1 which
> makes them possible, which makes it hard to determine which other hashes
> are similarly vulnerable.  It will take time.
> 
> I believe Linus has already stated on this list that his plan is to
> eventually provide a tool for bulk migration of an existing SHA1 git
> repository to a new hash type.   Basically munging through the
> repository in bulk, replacing all the hashes.  This seems a perfectly
> adequate strategy at the moment.

... [I say again, the problem here is NOT forgery of hashes, though SCO
like paranoia ...] ... but the hashes are a tiny part of the total
space, even for trivial patches, so that, providing _NOW_ for a longer
hash (and then why not use, say, SHA-256 for now as well) is prudent.

We do not want to revisit the plumbing, in the next 3-10 years, for 16
bytes per hash.

Finally I can do no more than quote Schneier:

"SHA-1 has been broken. Not a reduced-round version. Not a simplified
version. The real thing. ...

It's time for us all to migrate away from SHA-1.

Luckily, there are alternatives. The National Institute of Standards and
Technology already has standards for longer -- and harder to break --
hash functions: SHA-224, SHA-256, SHA-384, and SHA-512. They're already
government standards, and can already be used." and there are FOSS
implementations.

Or, put more simply by Jon Callas, PGP's CTO: "It's time to walk, but
not run, to the fire exits. You don't see smoke, but the fire alarms
have gone off." That's basically what he said last August [2004].

>  --scott
> 
> WASHTUB Panama Minister Moscow explosives KUGOWN hack Marxist LPMEDLEY
> genetic immediate radar SCRANTON COBRA JANE KGB Shoal Bay atomic Bejing
>                          ( http://cscott.net/ )
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 

-- 
mit freundlichen Grüßen, Brian.

Dr. Brian O'Mahoney
Mobile +41 (0)79 334 8035 Email: omb@bluewin.ch
Bleicherstrasse 25, CH-8953 Dietikon, Switzerland
PGP Key fingerprint = 33 41 A2 DE 35 7C CE 5D  F5 14 39 C9 6D 38 56 D5

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 13:29     ` Brian O'Mahoney
  2005-04-16 14:58       ` C. Scott Ananian
@ 2005-04-16 22:33       ` David Lang
  2005-04-17  3:23       ` Tkil
  2 siblings, 0 replies; 30+ messages in thread
From: David Lang @ 2005-04-16 22:33 UTC (permalink / raw)
  To: omb; +Cc: Ingo Molnar, git

On Sat, 16 Apr 2005, Brian O'Mahoney wrote:

> Three points:
> (1) I _have_ seen real-life collisions with MD5, in the context of
>    Document management systems containing ~10^6 ms-WORD documents.
> (2) The HMAC (ethernet-harware-address) of any interface _should_
>    help to make a unique Id.

you want a unique ID that can be computed directly from the file contents.

what file integrety programa (ala tripwire) do is to use multiple 
identification routines (aide uses MD4+MD5+filesize IIRC)

>
> David Lang wrote:
>> On Sat, 16 Apr 2005, Ingo Molnar wrote:
>>
>>> * David Lang <david.lang@digitalinsight.com> wrote:
>>>
>>>> this issue was raised a few days ago in the context of someone
>>>> tampering with the files and it was decided that the extra checks were
>>>> good enough to prevent this (at least for now), but what about
>>>> accidental collisions?
>>>>
>>>> if I am understanding things right the objects get saved in the
>>>> filesystem in filenames that are the SHA1 hash. of two legitimate
>>>> files have the same hash I don't see any way for both of them to
>>>> exist.
>>>>
>>>> yes the risk of any two files having the same has is low, but in the
>>>> earlier thread someone chimed in and said that they had two files on
>>>> their system that had the same hash..
>>>
>>>
>>> you can add -DCOLLISION_CHECK to Makefile:CFLAGS to turn on collision
>>> checking (disabled currently). If there indeed exist two files that have
>>> different content but the same hash, could someone send those two files?
>>
>>
>> remember that the flap over SHA1 being 'broken' a couple weeks ago was
>> not from researchers finding multiple files with the same hash, but
>> finding that it was more likly then expected that files would have the
>> same hash.
>>
>> there was qa discussion on LKML within the last year about useing MD5
>> hashes for identifying unique filesystem blocks (with the idea of being
>> able to merge identical blocks) and in that discussion it was pointed
>> out that collisions are a known real-life issue.
>>
>> so if collision detection is turned on in git, does that make it error
>> out if it runs into a second file with the same hash, or does it do
>> something else?
>>
>> David Lang
>>
>
> -- 
> Brian
>

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 14:58       ` C. Scott Ananian
                           ` (3 preceding siblings ...)
  2005-04-16 21:35         ` Brian O'Mahoney
@ 2005-04-16 22:46         ` David Lang
  2005-04-16 23:14           ` Paul Jackson
  4 siblings, 1 reply; 30+ messages in thread
From: David Lang @ 2005-04-16 22:46 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: omb, Ingo Molnar, git

that's the difference between CS researchers and sysadmins.

sysadmins realize that there are an infinante number of files that map to 
the same hash value and plan accordingly (becouse we KNOW we will run 
across them eventually), and don't see it as a big deal when we finally 
do.

CS researches quote statistics that show how hard it is to intentiallly 
create two files with the same hash and insist it just doesn't happen 
until presented by the proof, at which point it is a big deal.

a difference in viewpoints.

David Lang


  On Sat, 16 Apr 2005, C. Scott Ananian wrote:

> Date: Sat, 16 Apr 2005 10:58:15 -0400 (EDT)
> From: C. Scott Ananian <cscott@cscott.net>
> To: omb@bluewin.ch
> Cc: David Lang <david.lang@digitalinsight.com>, Ingo Molnar <mingo@elte.hu>,
>     git@vger.kernel.org
> Subject: Re: SHA1 hash safety
> 
> On Sat, 16 Apr 2005, Brian O'Mahoney wrote:
>
>> (1) I _have_ seen real-life collisions with MD5, in the context of
>>    Document management systems containing ~10^6 ms-WORD documents.
>
> Dude!  You could have been *famous*!  Why the aitch-ee-double-hockey-sticks 
> didn't you publish this when you found it?
> Seriously, man.
>
> Even given the known weaknesses in MD5, it would take much more than a 
> million documents to find MD5 collisions.  I can only conclude that the hash 
> was being used incorrectly; most likely truncated (my wild-ass guess would be 
> to 32 bits; a collision is likely with > 50% probability in a million 
> document store for a hash of less than 40 bits).
>
> I know the current state of the art here.  It's going to take more than just 
> hearsay to convince me that full 128-bit MD5 collisions are likely. I believe 
> there are only two or so known to exist so far, and those were found by a 
> research team in China (which, yes, is fairly famous among the cryptographic 
> community now after publishing a paper consisting of little apart from the 
> two collisions themselves).
>
> Please, let's talk about hash collisions responsibly.  I posted earlier about 
> the *actual computed probability* of finding two files with an SHA-1 
> collision before the sun goes supernova.  It's 10^28 to 1 against.
> The recent cryptographic works has shown that there are certain situations 
> where a decent amount of computer work (2^69 operations) can produce two 
> sequences with the same hash, but these sequences are not freely chosen; 
> they've got very specific structure.  This attack does not apply to 
> (effectively) random files sitting in a SCM.
>  http://www.schneier.com/blog/archives/2005/02/sha1_broken.html
>
> That said, Linux's widespread use means that it may not be unimaginable for 
> an attacker to devote this amount of resources to an attack, which would 
> probably involve first committing some specially structured file to the SCM 
> (but would Linus accept it?) and then silently corrupting said file via a 
> SHA1 collision to toggle some bits (which would presumably Do Evil).  Thus 
> hashes other than SHA1 really ought to be considered...
>
> ...but the cryptographic community has not yet come to a conclusion on what 
> the replacement ought to be.  These attacks are so new that we don't really 
> understand what it is about the structure of SHA1 which makes them possible, 
> which makes it hard to determine which other hashes are similarly vulnerable. 
> It will take time.
>
> I believe Linus has already stated on this list that his plan is to 
> eventually provide a tool for bulk migration of an existing SHA1 git 
> repository to a new hash type.   Basically munging through the repository in 
> bulk, replacing all the hashes.  This seems a perfectly adequate strategy at 
> the moment.
> --scott
>
> WASHTUB Panama Minister Moscow explosives KUGOWN hack Marxist LPMEDLEY 
> genetic immediate radar SCRANTON COBRA JANE KGB Shoal Bay atomic Bejing
>                         ( http://cscott.net/ )
>

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Re: SHA1 hash safety
  2005-04-16 15:36           ` C. Scott Ananian
@ 2005-04-16 22:56             ` David Lang
  2005-04-16 23:11               ` Paul Jackson
  0 siblings, 1 reply; 30+ messages in thread
From: David Lang @ 2005-04-16 22:56 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: Petr Baudis, omb, Ingo Molnar, git

On Sat, 16 Apr 2005, C. Scott Ananian wrote:

> Date: Sat, 16 Apr 2005 11:36:28 -0400 (EDT)
> From: C. Scott Ananian <cscott@cscott.net>
> To: Petr Baudis <pasky@ucw.cz>
> Cc: omb@bluewin.ch, David Lang <david.lang@digitalinsight.com>,
>     Ingo Molnar <mingo@elte.hu>, git@vger.kernel.org
> Subject: Re: Re: SHA1 hash safety
> 
> On Sat, 16 Apr 2005, Petr Baudis wrote:
>
>>> I know the current state of the art here.  It's going to take more than
>>> just hearsay to convince me that full 128-bit MD5 collisions are likely.
>> 
>> http://cryptography.hyperlink.cz/MD5_collisions.html
>
> OK, OK, I spoke too sloppily.  Let me rephrase:
>  It's going to take more than just hearsay to convince me that full
>  128-bit MD5 collisions *IN ARBITRARILY CHOSEN DOCUMENTS* are likely.
>
> I could add, "WITHOUT SPECIAL EFFORT BY AN ATTACKER".

you are missing the point.

I'm not talking about takeing one document (sched.c) and finding a 
replacement that can drop in without being noticed.

what I'm talking about is the chance that somewhere, sometime there will 
be two different documents that end up with the same hash

what git is doing (in very crude sysadminish terms) is to take all the 
files you care about, move them into a new directory where they are named 
by their hash with a symlink that replaces the origional file (and then a 
bunch of stuff to manage multiple versions of those symlinks)

if you are taking every file that you ever care about and loosing all 
refrence to it except by it's hash then when you get a second file that 
has the same hash you loose the contents of one of the two files (race 
condition over which file gets written into the storage directory last)

anywhere else that hashing algorithms are used people realize that there 
will be hash collisions and plan accordingly, however people tend to put 
blinders on when you say SHA1 or MD5 and decide that somehow the same 
thing cannot happen with these types of hashes.

they can, and eventually they will so you need to plan accordingly.

David Lang

-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 22:56             ` David Lang
@ 2005-04-16 23:11               ` Paul Jackson
  2005-04-16 23:18                 ` Martin Mares
  2005-04-17  4:38                 ` David A. Wheeler
  0 siblings, 2 replies; 30+ messages in thread
From: Paul Jackson @ 2005-04-16 23:11 UTC (permalink / raw)
  To: David Lang; +Cc: cscott, pasky, omb, mingo, git

> what I'm talking about is the chance that somewhere, sometime there will 
> be two different documents that end up with the same hash

I have vastly greater chance of a file colliding due to hardware or
software glitch than a random message digest collision of two legitimate
documents.

I've lost quite a few files in 25 years of computing to just
such glitches, sometimes without knowing it until months or years
later.

We've already computed the chances of a random pure hash collision
with SHA1 - it's something like an average of 1 collision every
10 billion years if we have 10,000 coders generating 1 new file
version every minute, non-stop, 24 hours a day, 365 days a year.

Get real.  There are _many_ sources of random error in our
tools.  When some sources are billions of billions times
more likely to occur, it makes sense to worry about them first.

Reminds me of the drunk looking under the lamp post for the
house keys he dropped - because that's where the light is.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 22:46         ` David Lang
@ 2005-04-16 23:14           ` Paul Jackson
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Jackson @ 2005-04-16 23:14 UTC (permalink / raw)
  To: David Lang; +Cc: cscott, omb, mingo, git

> sysadmins realize that there are an infinante number of files that map to 

Sysadmins know that there are an infinite ways for their
systems to crap out, and try to cover for the ones that
there is a snow balls chance in Hades of them seeing in
their lifetime.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 23:11               ` Paul Jackson
@ 2005-04-16 23:18                 ` Martin Mares
  2005-04-17  4:38                 ` David A. Wheeler
  1 sibling, 0 replies; 30+ messages in thread
From: Martin Mares @ 2005-04-16 23:18 UTC (permalink / raw)
  To: Paul Jackson; +Cc: git

Hi!

> We've already computed the chances of a random pure hash collision
> with SHA1 - it's something like an average of 1 collision every
> 10 billion years if we have 10,000 coders generating 1 new file
> version every minute, non-stop, 24 hours a day, 365 days a year.

GIT is safe even for the millions of monkeys writing Shakespeare :-)

				Have a nice fortnight
-- 
Martin `MJ' Mares   <mj@ucw.cz>   http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
Homo homini lupus, frater fratri lupior, bohemus bohemo lupissimus.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 13:29     ` Brian O'Mahoney
  2005-04-16 14:58       ` C. Scott Ananian
  2005-04-16 22:33       ` David Lang
@ 2005-04-17  3:23       ` Tkil
  2005-04-17  4:09         ` Paul Jackson
  2 siblings, 1 reply; 30+ messages in thread
From: Tkil @ 2005-04-17  3:23 UTC (permalink / raw)
  To: omb; +Cc: David Lang, Ingo Molnar, git

>>>>> "Brian" == Brian O'Mahoney <omb@khandalf.com> writes:

Brian> (1) I _have_ seen real-life collisions with MD5, in the context
Brian>     of Document management systems containing ~10^6 ms-WORD
Brian>     documents.

Was this whole-document based, or was it blocked or otherwise chunked?

I'm wondering, because (SFAIK) the MS word on-disk format is some
serialized version of one or more containers, possibly nested.  If
you're blocks are sized so that the first block is the same across
multiple files, this could cause collisions -- but they're the good
kind, that allow us to save disk space, so they're not a problem.

Are you saying that, within 1e7 documents, that you found two
documents with the same MD5 hash yet different contents?

That's not an accusation, btw; I'm just trying to get clarity on the
terminology.  I'm fascinated by the idea of using this sort of
content-addressable filesystem, but the chance of any collision at all
wigs me out.  I look at the probabilities, but still.

Thanks,
t.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-17  3:23       ` Tkil
@ 2005-04-17  4:09         ` Paul Jackson
  2005-04-17  4:43           ` Tkil
  0 siblings, 1 reply; 30+ messages in thread
From: Paul Jackson @ 2005-04-17  4:09 UTC (permalink / raw)
  To: Tkil; +Cc: omb, david.lang, mingo, git

> but the chance of any collision at all wigs me out.

Guess you're just going to get wigged out then.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 23:11               ` Paul Jackson
  2005-04-16 23:18                 ` Martin Mares
@ 2005-04-17  4:38                 ` David A. Wheeler
  2005-04-18  0:09                   ` Theodore Ts'o
  1 sibling, 1 reply; 30+ messages in thread
From: David A. Wheeler @ 2005-04-17  4:38 UTC (permalink / raw)
  To: Paul Jackson; +Cc: David Lang, cscott, pasky, omb, mingo, git

Paul Jackson wrote:
>>what I'm talking about is the chance that somewhere, sometime there will 
>>be two different documents that end up with the same hash
> 
> I have vastly greater chance of a file colliding due to hardware or
> software glitch than a random message digest collision of two legitimate
> documents.

The probability of an accidental overlap for SHA-1 for two
different files is absurdly remote; it's just not worth worrying about.

However, the possibility of an INTENTIONAL overlap is a completely
different matter.  I think the hash algorithm should change in the
future; I have a proposal below.

Someone has ALREADY broken into a server to modify the Linux kernel
code already, so the idea of an attack on kernel code
is not an idle fantasy. MD5 is dead, and SHA-1's work factor has
already been sufficiently broken that people have already been told
"walk to the exits" (i.e., DO NOT USE SHA-1 for new programs like git).

The fact that blobs are compressed first, with a length header
in front, _may_ make it harder to attack.  But maybe not.
I haven't checked for this case, but most decompression algorithms
I know of have a "don't change" mode that essentially just copies the
data behind it.  If the one used in git has such a mode
(I bet it does!), an attacker could use that mode to
make it MUCH easier to create an attack vector than it would
appear at first.  Now the attacker just needs to create a collision
(hmmm, where was that paper?).  Remember, you don't need to
run a hash algorithm over an entire file; you can precompute
to near the end, and then try your iterations from there.
A little hardware (inc. FPGAs) would speed the attack.

Of course, that assumes you actually
check everything to make sure that an attacker can't slip
in something different. After each rsync, are all new files'
hash values checked?  Do they uncompress to right length?
Do they have excess data after the decompression?
I'm hoping that sort of input-checking (since the data
might be from an attacker, if indirectly!) is already going on,
though I haven't reviewed the git source code.

While the jury's still out, the current belief by most folks
I talk to is that SHA-1 variants with more bits, such as SHA-256,
are the way to go now.  The SHA-1 attack simply reduces
the work factor (it's not a COMPLETE break), so adding
more bits is believed to increase the work factor
enough to counter it.

Adding more information to the hash can make attacking even harder.
Here's one idea: whenever that hash algorithm
switch occurs, create a new "hash" value as this:
   SHA-256 "+" uncompressed-length
Where SHA-256 is computed just like SHA-1 is now, e.g.,
SHA-256(file) where file = typecode + length + compressed data.
Leave the internal format as-is (with the length embedded as well).
This means that an attacker has to come up with an attack
that creates the same length uncompressed, yet has the same hash
of the compressed result. That's harder to do.
Length is also really, really cheap to compute :-).
That also might help the convince the "what happens if there's
an accidental collision" crowd: now, if the file lengths
are different, you're GUARANTEED that the hash values are different,
though that's not the best reason to do that.

One reason to think about switching sooner rather than later
is that it'd be really nice if the object store also included
signatures, so that in one fell swoop you could check who signed what
(and thus you could later on CONFIRM with much more certainty who
REALLY submitted a given change... say if it was clearly malicious).
If you switch hash algorithms, the signatures might not work,
depending on how you do it.

--- David A. Wheeler

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-17  4:09         ` Paul Jackson
@ 2005-04-17  4:43           ` Tkil
  2005-04-17  5:09             ` Paul Jackson
  0 siblings, 1 reply; 30+ messages in thread
From: Tkil @ 2005-04-17  4:43 UTC (permalink / raw)
  To: Paul Jackson; +Cc: omb, david.lang, mingo, git

>>>>> "Tkil" == Tkil <tkil@scrye.com> writes:

Tkil> but the chance of any collision at all wigs me out.

>>>>> "Paul" == Paul Jackson <pj@sgi.com> writes:

Paul> Guess you're just going to get wigged out then.

Wig wig.  :)

I didn't mean "wigs me out to the point I won't use it" but more of
"wigs me out so that I'm curious whether there are backup schemes
worth considering".

In particular, the comparisons between hash collisions and hardware
failure seem contrived -- if I have bad RAM, or a bad block on my HD,
I can recover it from known good sources.  But if the actual known
good source is structured in such a way that a particular set of data
cannot be represented, that bothers me.

In this case, the fact that it has to be the same length, same SHA-1,
correct C, and functionally similar C at that, makes for a comforting
cushion.  Further, git wouldn't be the only representation; there
would be periodic tarballs, different trees, etc.

On the other paw, if "effectively random" MS Word docs gave true MD5
collisions (when we have a proper MD5 hash computed over the entire
document) in a "mere" 1e7 space, that is interesting/scary.

(I was also trying to add a few factoids to the MSW comment, as their
structure could lead to collisions if (say) only the first 512 bytes
were considered -- it's possible that nothing but size and date might
change in that, and /those/ I can see colliding in 1e7 documents.)

Finally, I apologize for taking your time.  I'm just watching this
from the sidelines, and the questions above are just intellectual
curiosity.  :-/

(The only other thread I'm really following is people trying to chunk
files in a way that would increase storage efficiency; reading the
Venti paper, I was wondering how efficient it would be if a one-byte
addition at the top of the file would generate all-new blocks, while
the rsync-ish protocol seems to offer substantial relief.  But if the
"interesting history" fits in 10USD worth of HD, that might be enough.
Babble.)

Thanks,
t.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-17  4:43           ` Tkil
@ 2005-04-17  5:09             ` Paul Jackson
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Jackson @ 2005-04-17  5:09 UTC (permalink / raw)
  To: Tkil, David A. Wheeler; +Cc: omb, david.lang, mingo, git

I have nothing further to contribute to this subtopic.
Good luck with it.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 15:49         ` ross
@ 2005-04-17  6:35           ` Horst von Brand
  2005-04-18  2:07             ` Brian O'Mahoney
  2005-04-18 16:50             ` C. Scott Ananian
  0 siblings, 2 replies; 30+ messages in thread
From: Horst von Brand @ 2005-04-17  6:35 UTC (permalink / raw)
  To: ross; +Cc: C. Scott Ananian, omb, David Lang, Ingo Molnar, git

ross@jose.lug.udel.edu said:

[...]

> Linus has already weighed in that he doesn't give a crap.  All the
> crypto-babble about collision whitepapers is uninteresting without a
> repo that has real collisions.  git is far too cool as is - prove I
> should be concerned.

Just copy over a file (might be the first step in splitting it, or a
header file that is duplicated for convenience, ...)
-- 
Dr. Horst H. von Brand                   User #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-17  4:38                 ` David A. Wheeler
@ 2005-04-18  0:09                   ` Theodore Ts'o
  0 siblings, 0 replies; 30+ messages in thread
From: Theodore Ts'o @ 2005-04-18  0:09 UTC (permalink / raw)
  To: David A. Wheeler; +Cc: Paul Jackson, David Lang, cscott, pasky, omb, mingo, git

On Sun, Apr 17, 2005 at 12:38:37AM -0400, David A. Wheeler wrote:
> The probability of an accidental overlap for SHA-1 for two
> different files is absurdly remote; it's just not worth worrying about.
> 
> However, the possibility of an INTENTIONAL overlap is a completely
> different matter.  I think the hash algorithm should change in the
> future; I have a proposal below.
> 
> Someone has ALREADY broken into a server to modify the Linux kernel
> code already, so the idea of an attack on kernel code
> is not an idle fantasy. MD5 is dead, and SHA-1's work factor has
> already been sufficiently broken that people have already been told
> "walk to the exits" (i.e., DO NOT USE SHA-1 for new programs like git).

We're very clearly going to need a FAQ for git.

SHA-1's work factor has been decreased to 2**69 from 2**80 for
generating two messages that have the same hash value, WHERE THE HASH
VALUE AND THE MESSAGES ARE NOT UNDER THE ATTACKER'S CONTROL.  This is
not the same as a pre-image attack, where given a message M1 which
hashes to value H, the attacker can find another message M2 which also
hashes to value H.  In even if the attacker can do this, the result
has to have valid git metadata format, and also be valid C code.

So the the recent result which has weakened (but not broken) SHA-1's
use in digital signatures, and which has resulted in the advice to
"walk not run" for the exits, do not apply to git.

Can we guarantee that there won't be further innovations that may
break SHA-1?  Of course not.  But an attacker who wants to introduced
a trojan into the Linux kernel would have a much easier time doing a
"black bag job" --- i.e., breaking into Linus's house in Portland, and
then inserting a buggered patch into his master source tree.

If you're going to be a professional paranoid, it's best to worry
about the realistic attacks before stressing out over the unrealistic
ones.

						- Ted

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-17  6:35           ` Horst von Brand
@ 2005-04-18  2:07             ` Brian O'Mahoney
  2005-04-18 16:50             ` C. Scott Ananian
  1 sibling, 0 replies; 30+ messages in thread
From: Brian O'Mahoney @ 2005-04-18  2:07 UTC (permalink / raw)
  To: Horst von Brand; +Cc: ross, C. Scott Ananian, omb, David Lang, Ingo Molnar, git

Linus wants to drive ahead, and ignore the collision issue for now,
and has been dismissive of the risks, he wants a result not heart
searching, and the list comments exhibit a confusion with the
engineering problem of avoiding accidental collisions v deliberate sabotage.

Since this is not a show-stopper, and getting the BK replacement in place
is time critical, and if you look at the code it is easy to extend the
content key, LET US just leave this issue for now.

Horst von Brand wrote:
> ross@jose.lug.udel.edu said:
> 
> [...]
> 
> 
>>Linus has already weighed in that he doesn't give a crap.  All the
>>crypto-babble about collision whitepapers is uninteresting without a
>>repo that has real collisions.  git is far too cool as is - prove I
>>should be concerned.
> 
> 
> Just copy over a file (might be the first step in splitting it, or a
> header file that is duplicated for convenience, ...)

-- 
mit freundlichen Grüßen, Brian.

Dr. Brian O'Mahoney
Mobile +41 (0)79 334 8035 Email: omb@bluewin.ch
Bleicherstrasse 25, CH-8953 Dietikon, Switzerland
PGP Key fingerprint = 33 41 A2 DE 35 7C CE 5D  F5 14 39 C9 6D 38 56 D5

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-16 21:35         ` Brian O'Mahoney
@ 2005-04-18  7:43           ` Andy Isaacson
  2005-04-18 17:04             ` C. Scott Ananian
  2005-04-19 22:30             ` David Meybohm
  0 siblings, 2 replies; 30+ messages in thread
From: Andy Isaacson @ 2005-04-18  7:43 UTC (permalink / raw)
  To: omb; +Cc: git

[trimmed cc list, nobody wants to read this noise]

On Sat, Apr 16, 2005 at 11:35:39PM +0200, Brian O'Mahoney wrote:
> >> (1) I _have_ seen real-life collisions with MD5, in the context of
> >>    Document management systems containing ~10^6 ms-WORD documents.
> > 
> > Dude!  You could have been *famous*!  Why the
> > aitch-ee-double-hockey-sticks didn't you publish this when you found it?
> > Seriously, man.
> 
> The MD5 has was fine, or at least the code (a) produced the correct
> results on the published test cases, and, (b) was properly applied to
> all bytes of the file(s). I was surprised when it happened, which is why
> I bothered to post to this list at this time, so I make two more points

OK, I guess it's time for some remedial math.

There are 2^128 = 340282366920938463463374607431768211456 different MD5
hashes.

You are suggesting that you found a collision using ~1e6 = ~1,000,000
plaintexts.

Let's suppose there were actually 100,000,000 = 1e8 plaintexts, just in
case you underestimated the number.

Applying the birthday paradox, we have a 50% probability that you'd find
one collision if there were ~7,213,475,309,916,173 possible hash values.
If you extend the birthday argument ("what is the probability of a
collision if you take N samples from a set of size M?") you get the
following results, with N = 1e8:

50% (1 in 2) probability of collision in           7,213,475,309,916,173.
1% (1 in 100) probability of collision in        497,496,027,172,833,194.
.05% (1 in 1845) probability of collision in   9,223,372,036,854,775,806.

That's where my quick-and-dirty solver craps out, but we're still a
really long ways from

                     340,282,366,920,938,463,463,374,607,431,768,211,456.

A simple linear extrapolation (certainly wrong, but not by more than a
few dozen orders of magnitude) says that the odds would be
1 in 68,056,473,384,187,692,692 for the full MD5 hash (I'm not even
going to dignify that with a percentage).

I'm not going to do the sums, but I would hazard a guess that it's more
likely your PC suffered a cosmic-ray-induced memory fault - EACH OF THE
FOUR TIMES YOU TESTED IT - causing it to report the same MD5, than that
you actually discovered a collision with a measly million (or even
hundred million) plaintexts.

(Of course, I don't know how many tests of the hash you actually did.
But the point stands.)

Hell, if you're *that* lucky, what are you doing in IT?  You could be
making a killing at the roulette table.

Or, even more likely, there was some other factor in the system (most
likely that it was only using a few bits, probably 32, of the hash
when looking for collisions) that resulted in a false alarm.

If you had actual evidence of a collision, I'd love to see it - even if
it's just the equivalent of
% md5 foo
d3b07384d113edec49eaa6238ad5ff00 foo
% md5 bar
d3b07384d113edec49eaa6238ad5ff00 bar
% cmp foo bar
foo bar differ: byte 25, line 1
%

But in the absence of actual evidence, we have to assume (just based on
the probabilities) that there was some error in your testing.

-andy

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-17  6:35           ` Horst von Brand
  2005-04-18  2:07             ` Brian O'Mahoney
@ 2005-04-18 16:50             ` C. Scott Ananian
  1 sibling, 0 replies; 30+ messages in thread
From: C. Scott Ananian @ 2005-04-18 16:50 UTC (permalink / raw)
  To: Horst von Brand; +Cc: ross, omb, David Lang, Ingo Molnar, git

On Sun, 17 Apr 2005, Horst von Brand wrote:

>> crypto-babble about collision whitepapers is uninteresting without a
>> repo that has real collisions.  git is far too cool as is - prove I
> Just copy over a file (might be the first step in splitting it, or a
> header file that is duplicated for convenience, ...)

This is not a collision.  This is a *feature*.
  --scott

payment UKUSA ODOATH AVBLIMP ESSENCE JUBILIST ASW AK-47 CABOUNCE Ortega 
PBPRIME North Korea anthrax Milosevic bomb Soviet  QKFLOWAGE Yeltsin
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-18  7:43           ` Andy Isaacson
@ 2005-04-18 17:04             ` C. Scott Ananian
  2005-04-19 22:30             ` David Meybohm
  1 sibling, 0 replies; 30+ messages in thread
From: C. Scott Ananian @ 2005-04-18 17:04 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: omb, git

On Mon, 18 Apr 2005, Andy Isaacson wrote:

> If you had actual evidence of a collision, I'd love to see it - even if
> it's just the equivalent of
> % md5 foo
> d3b07384d113edec49eaa6238ad5ff00 foo
> % md5 bar
> d3b07384d113edec49eaa6238ad5ff00 bar
> % cmp foo bar
> foo bar differ: byte 25, line 1
> %
>
> But in the absence of actual evidence, we have to assume (just based on
> the probabilities) that there was some error in your testing.

I've already had a long correspondence with this poster.  He claims that 
"this happened 7 years ago", involved a "commercial contract covered by 
Swiss Banking Law" (with caps!) and that, of course, he "certainly doesn't 
retain [his] client's documents", and even if he *did*, he wouldn't show 
them to *me*.

And then he was unable to comprehend that I couldn't accept his word alone 
as prima facie evidence that the laws of probability did not apply to him or 
his clients.

I've been a coder far too long to attribute to "The Mysterious Hand Of 
God" what can adequately be described by subtle programmer error.

The most reasonable explanation, given the (lack of) evidence, is that 
the programmer involved quickly took refuge in a (wildly improbable, but 
his clients'll never know) "MD5 collision" instead of buckling down and 
finding the bug in his code.
  --scott

ODOATH Ortega FBI SGUAT AEBARMAN India Peking ODACID operation RYBAT 
[Hello to all my fans in domestic surveillance] for Dummies KUCLUB
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-18  7:43           ` Andy Isaacson
  2005-04-18 17:04             ` C. Scott Ananian
@ 2005-04-19 22:30             ` David Meybohm
  2005-04-19 22:48               ` C. Scott Ananian
  1 sibling, 1 reply; 30+ messages in thread
From: David Meybohm @ 2005-04-19 22:30 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: omb, git

On Mon, Apr 18, 2005 at 12:43:23AM -0700, Andy Isaacson wrote:
> 
> I'm not going to do the sums, but I would hazard a guess that it's more
> likely your PC suffered a cosmic-ray-induced memory fault - EACH OF THE
> FOUR TIMES YOU TESTED IT - causing it to report the same MD5, than that
> you actually discovered a collision with a measly million (or even
> hundred million) plaintexts.

But doesn't this require assuming the distribution of MD5 is uniform,
and don't the papers finding collisions in less show it's not? So, your
birthday-argument for calculating the probability wouldn't apply, because
it rests on the assumption MD5 is uniform, and it isn't.

For example, say most people are married in June, get pregnant, and
there are more births around March, 9 months later, than in other
months. Then if you are born in March you have a higher chance of seeing
a collision of your birthday with someone else's. The same is true for
someone else born in March too, and this makes the chances of seeing a
collision for the whole function higher.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-19 22:30             ` David Meybohm
@ 2005-04-19 22:48               ` C. Scott Ananian
  2005-04-20 18:56                 ` David Meybohm
  0 siblings, 1 reply; 30+ messages in thread
From: C. Scott Ananian @ 2005-04-19 22:48 UTC (permalink / raw)
  To: David Meybohm; +Cc: Andy Isaacson, omb, git

On Tue, 19 Apr 2005, David Meybohm wrote:

> But doesn't this require assuming the distribution of MD5 is uniform,
> and don't the papers finding collisions in less show it's not? So, your
> birthday-argument for calculating the probability wouldn't apply, because
> it rests on the assumption MD5 is uniform, and it isn't.

No, the collision papers don't show this at all.
  --scott
atomic strategic HBDRILL SARANAC COBRA JUDY Ft. Meade assassination politics 
Mossad HOPEFUL ZPSEMANTIC DTFROGS HTKEEPER LITEMPO LIONIZER operation
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: SHA1 hash safety
  2005-04-19 22:48               ` C. Scott Ananian
@ 2005-04-20 18:56                 ` David Meybohm
  0 siblings, 0 replies; 30+ messages in thread
From: David Meybohm @ 2005-04-20 18:56 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: Andy Isaacson, omb, git

On Tue, Apr 19, 2005 at 06:48:57PM -0400, C. Scott Ananian wrote:
> On Tue, 19 Apr 2005, David Meybohm wrote:
> 
> >But doesn't this require assuming the distribution of MD5 is uniform,
> >and don't the papers finding collisions in less show it's not? So, your
> >birthday-argument for calculating the probability wouldn't apply, because
> >it rests on the assumption MD5 is uniform, and it isn't.
> 
> No, the collision papers don't show this at all.

I didn't mean they showed it directly. I meant by finding collisions in
MD5 quickly, MD5 would have to have some non-uniformity. But that's
nevertheless wrong because uniformness and collision finding ability
aren't related. Sorry to have wasted everyone's time.

Dave

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2005-04-20 18:52 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-16 12:24 SHA1 hash safety David Lang
2005-04-16 12:31 ` Ingo Molnar
2005-04-16 12:48   ` David Lang
2005-04-16 13:29     ` Brian O'Mahoney
2005-04-16 14:58       ` C. Scott Ananian
2005-04-16 15:11         ` Petr Baudis
2005-04-16 15:36           ` C. Scott Ananian
2005-04-16 22:56             ` David Lang
2005-04-16 23:11               ` Paul Jackson
2005-04-16 23:18                 ` Martin Mares
2005-04-17  4:38                 ` David A. Wheeler
2005-04-18  0:09                   ` Theodore Ts'o
2005-04-16 15:49         ` ross
2005-04-17  6:35           ` Horst von Brand
2005-04-18  2:07             ` Brian O'Mahoney
2005-04-18 16:50             ` C. Scott Ananian
2005-04-16 19:16         ` Paul Jackson
2005-04-16 21:35         ` Brian O'Mahoney
2005-04-18  7:43           ` Andy Isaacson
2005-04-18 17:04             ` C. Scott Ananian
2005-04-19 22:30             ` David Meybohm
2005-04-19 22:48               ` C. Scott Ananian
2005-04-20 18:56                 ` David Meybohm
2005-04-16 22:46         ` David Lang
2005-04-16 23:14           ` Paul Jackson
2005-04-16 22:33       ` David Lang
2005-04-17  3:23       ` Tkil
2005-04-17  4:09         ` Paul Jackson
2005-04-17  4:43           ` Tkil
2005-04-17  5:09             ` Paul Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).