From mboxrd@z Thu Jan 1 00:00:00 1970 From: "David A. Wheeler" Subject: Re: SHA1 hash safety Date: Sun, 17 Apr 2005 00:38:37 -0400 Message-ID: <4261E84D.6040208@dwheeler.com> References: <20050416123155.GA19908@elte.hu> <4261132A.3090907@khandalf.com> <20050416151116.GC19099@pasky.ji.cz> <20050416161153.534b47d5.pj@sgi.com> Reply-To: dwheeler@dwheeler.com Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: David Lang , cscott@cscott.net, pasky@ucw.cz, omb@bluewin.ch, mingo@elte.hu, git@vger.kernel.org X-From: git-owner@vger.kernel.org Sun Apr 17 06:33:38 2005 Return-path: Received: from vger.kernel.org ([12.107.209.244]) by ciao.gmane.org with esmtp (Exim 4.43) id 1DN1TE-0004ZS-Fy for gcvg-git@gmane.org; Sun, 17 Apr 2005 06:33:20 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261253AbVDQEg5 (ORCPT ); Sun, 17 Apr 2005 00:36:57 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261255AbVDQEg5 (ORCPT ); Sun, 17 Apr 2005 00:36:57 -0400 Received: from cujo.runbox.com ([193.71.199.138]:12255 "EHLO cujo.runbox.com") by vger.kernel.org with ESMTP id S261253AbVDQEgx (ORCPT ); Sun, 17 Apr 2005 00:36:53 -0400 Received: from [10.9.9.110] (helo=snoopy.runbox.com) by greyhound.runbox.com with esmtp (Exim 4.34) id 1DN1WW-0003OU-3f; Sun, 17 Apr 2005 06:36:44 +0200 Received: from [70.17.101.238] (helo=[192.168.2.73]) by snoopy.runbox.com with asmtp (uid:258406) (Exim 4.34) id 1DN1WV-0005La-Ku; Sun, 17 Apr 2005 06:36:44 +0200 User-Agent: Mozilla Thunderbird 1.0.2-1.3.2 (X11/20050324) X-Accept-Language: en-us, en To: Paul Jackson In-Reply-To: <20050416161153.534b47d5.pj@sgi.com> X-Sender: 258406@vger.kernel.org Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Paul Jackson wrote: >>what I'm talking about is the chance that somewhere, sometime there will >>be two different documents that end up with the same hash > > I have vastly greater chance of a file colliding due to hardware or > software glitch than a random message digest collision of two legitimate > documents. The probability of an accidental overlap for SHA-1 for two different files is absurdly remote; it's just not worth worrying about. However, the possibility of an INTENTIONAL overlap is a completely different matter. I think the hash algorithm should change in the future; I have a proposal below. Someone has ALREADY broken into a server to modify the Linux kernel code already, so the idea of an attack on kernel code is not an idle fantasy. MD5 is dead, and SHA-1's work factor has already been sufficiently broken that people have already been told "walk to the exits" (i.e., DO NOT USE SHA-1 for new programs like git). The fact that blobs are compressed first, with a length header in front, _may_ make it harder to attack. But maybe not. I haven't checked for this case, but most decompression algorithms I know of have a "don't change" mode that essentially just copies the data behind it. If the one used in git has such a mode (I bet it does!), an attacker could use that mode to make it MUCH easier to create an attack vector than it would appear at first. Now the attacker just needs to create a collision (hmmm, where was that paper?). Remember, you don't need to run a hash algorithm over an entire file; you can precompute to near the end, and then try your iterations from there. A little hardware (inc. FPGAs) would speed the attack. Of course, that assumes you actually check everything to make sure that an attacker can't slip in something different. After each rsync, are all new files' hash values checked? Do they uncompress to right length? Do they have excess data after the decompression? I'm hoping that sort of input-checking (since the data might be from an attacker, if indirectly!) is already going on, though I haven't reviewed the git source code. While the jury's still out, the current belief by most folks I talk to is that SHA-1 variants with more bits, such as SHA-256, are the way to go now. The SHA-1 attack simply reduces the work factor (it's not a COMPLETE break), so adding more bits is believed to increase the work factor enough to counter it. Adding more information to the hash can make attacking even harder. Here's one idea: whenever that hash algorithm switch occurs, create a new "hash" value as this: SHA-256 "+" uncompressed-length Where SHA-256 is computed just like SHA-1 is now, e.g., SHA-256(file) where file = typecode + length + compressed data. Leave the internal format as-is (with the length embedded as well). This means that an attacker has to come up with an attack that creates the same length uncompressed, yet has the same hash of the compressed result. That's harder to do. Length is also really, really cheap to compute :-). That also might help the convince the "what happens if there's an accidental collision" crowd: now, if the file lengths are different, you're GUARANTEED that the hash values are different, though that's not the best reason to do that. One reason to think about switching sooner rather than later is that it'd be really nice if the object store also included signatures, so that in one fell swoop you could check who signed what (and thus you could later on CONFIRM with much more certainty who REALLY submitted a given change... say if it was clearly malicious). If you switch hash algorithms, the signatures might not work, depending on how you do it. --- David A. Wheeler