* Re: [PATCH] Implement fast hash-collision detection [not found] <1322546563.1719.22.camel@yos> @ 2011-11-29 9:07 ` Jeff King 2011-11-29 10:24 ` Ævar Arnfjörð Bjarmason ` (3 more replies) 2011-11-29 17:08 ` Shawn Pearce 1 sibling, 4 replies; 20+ messages in thread From: Jeff King @ 2011-11-29 9:07 UTC (permalink / raw) To: Bill Zaumen; +Cc: git, gitster, pclouds, spearce, torvalds [Your original message was almost certainly bounced by git@vger because it surpassed the 100K limit; I'll try to quote liberally for those who missed the original. You may want to split your content and repost.] On Mon, Nov 28, 2011 at 10:02:43PM -0800, Bill Zaumen wrote: > Maintains a database of CRCs of Git objects to allow SHA-1 hash > collisions to be detected with high probability (1 - 1/2^32) and with > little computational overhead. The CRCs cover the content of Git > objects, but not the header. For loose objects, these are stored in > subdirectories of GIT_DIRECTORY/objects/crcs, with each subdirectory's > name consisting of the first two hexadecimal digits of the > corresponding object's SHA-1 hash. For each pack file, FILE.pack, the > CRCs are stored in a FILE.mds, in the same order as the SHA-1 hashes > that appear in FILE.idx. Checks for hash collisions are made whenever > a new loose object is created. I'm confused. Is this about avoiding accidental collisions, or about avoiding malicious collision attacks? Let's assume the former for a minute. The usual way of considering collision likelihood is not by probabilities, but by the number of items that must be selected to achieve a >50% probability that there is a collision. Which is the square root of the number of total items, or half the bit-space. So we expect to see a collision in 160-bit SHA-1 after writing about 2^80 objects. The linux-2.6 repository has about 2.2 million objects after 6.5 years of development. If development continues at this pace, we would expect a collision in a mere 10^18 years. Assuming your 32-bit CRC is statistically independent of the SHA-1 value, it adds 32 bits to the hash, or 16 bits to the number of objects we expect to generate (i.e., 2^96). That bumps us up to 10^23 years. However, it may be of interest that the Sun is expected to burn out in a mere 10^10 years[1]. So I'm not sure there is really any point in adding a few bits to the hash length to detect an accidental collision. It's already fantastically unlikely. Adding another probability on top does make it less likely, but in the same ballpark of fantastic. You can argue about whether linux-2.6 is a representative sample, or whether the pace of object creation might increase. But the simple answer is that we're many orders of magnitude away from having to care. However, in your patch you write: > +Security-Issue Details > +---------------------- > + > +Without hash-collision detection, Git has a higher risk of data > +corruption due to the obvious hash-collision vulnerability, so the > +issue is really whether a usable vulnerability exists. Recent research > +has shown that SHA-1 collisions can be found in 2^63 operations or > +less. While one result claimed 2^53 operations, the paper claiming > +that value was withdrawn from publication due to an error in the > +estimate. Another result claimed a complexity of between 2^51 and 2^57 > +operations, and still another claimed a complexity of 2^57.5 SHA-1 > +computations. A summary is available at > +<http://hackipedia.org/Checksums/SHA/html/SHA-1.htm#SHA-1>. Given the > +number of recent attacks, possibly by governments or large-scale > +criminal enterprises > +(<http://www.csmonitor.com/World/terrorism-security/2011/0906/Iranian-government-may-be-behind-hack-of-Dutch-security-firm>, > +<http://en.wikipedia.org/wiki/Operation_Aurora>, > +<http://en.wikipedia.org/wiki/Botnet#Historical_list_of_botnets>), > +which include botnets with an estimated 30 million computers, there is > +reason for some concern: while generating a SHA-1 collision for > +purposes of damaging a Git repository is extremely expensive > +computationally, it is possibly within reach of very well funded > +organizations. 2^32 operations, even if the operations are as > +expensive as computing a SHA-1 hash of a modest source-code file, can > +be performed in a reasonably short period of time on the type of > +hardware widely used in desktop or laptop computers at present. With > +sufficient parallelism, 30 million personal computers sufficient for > +playing the latest video games could perform 2^56 operations in a > +reasonable time. ...which makes me think that you do care about malicious collisions. All of what you wrote above seems fairly accurate. Let's leave aside that those numbers are for a collision attack as opposed to a pre-image attack (collision attacks are hard to execute, as they require you to generate two "matched" objects, one good and one evil, and then convince somebody to first accept your good object, only to later replace it with the evil one). I have two concerns with this as a security mechanism: First, let us assume that the implementation details of your scheme work, and that git users will effectively be checking not just a SHA-1, but now a SHA-1 plus your additional digest. In that case, why use crc32 as the digest? It seems like a terrible choice. Assuming it's cryptographically secure, then it's adding a mere 16 bits to the numbers you mentioned above. IOW, it's at best a band-aid that pushes attacks off for a few years. But more importantly, it's _not_ secure, and can be trivially forged. But that's a relatively simple problem. crc32 could be replaced in your scheme with any of the SHA-2 family, or the upcoming SHA-3, or whatever. That brings me to my second concern: how does this alternative message digest have any authority? For example, your patch teaches the git protocol a new extension to pass these digests along with the object sha1s. But how do we know the server isn't feeding us a bad digest along with the bad object? The "usual" security model discussed in git is that of verifying a signed tag. Linus signs a tag and pushes it to a server. I fetch the tag, and can verify the signature on the tag. I want to know that I have the exact same objects that Linus had. But I can't assume the server is trustworthy; it may have fed me a bad object with a collided sha1. But what's in the signed part of the tag object? Only the sha1 of the commit the tag points to, but not the new digest. So an attacker can replace the commit with one that collides, and it can in turn point to arbitrary trees and blobs. You can fix this by including an extra header in the signed part of the tag that says "also, the digest of the commit I point to is X". Then you know you have the same commit that Linus had. But the commit points to a tree by its sha1. So you have to add a similar header in the commit object that says "also, the digest of the tree I point to is X". And ditto for all of the parent pointers, if you want to care about signing history. And then you have the same problem in the tree: each sub-tree and blob is referenced by its sha1. In other words, authority flows from the gpg-signed tag portion, and links in the chain to each object are made by referencing sha1s. Every time such a link is made, you need to also include the digest of the object, or you are giving the attacker a place to insert a collided object. For tag and commit objects, this actually isn't that hard; they have a relatively small number of links, and they have room for arbitrary headers. I.e., add a "tree-md-sha256" header that gives the expected sha-256 of the tree object referenced by sha-1 in the "tree" header. Older versions of git will skip over this header (though obviously they won't be smart enough to do the more-secure verification). However, it's harder for trees. Each entry needs to have the new digest added, but there simply isn't room in the format. So we would have to change the tree format, breaking interoperability with older versions of git. And all of your tree sha1's would change as soon as you wrote them with the new format. That's only slightly better than just swapping sha1 out for a better algorithm. One trick you could do is to include the digest of each blob in the commit object itself. This really bloats the size of commit objects, though. You can store a digest of their digests instead (which your patch actually calculates, but AFAICT does not actually insert into the commit object). That is small and relatively cheap (it turns commit from an O(1) operation into an O(number of files) operation, but the per-file constant is pretty inexpensive). But it comes at the expense of not being able to tell which of the constituent blobs was actually attacked. So I think all of that would work for verification starting at a signed tag that points to a commit or a blob. For a tag pointing straight to a tree, it could include the same "digest of digests" that the commit would. But it can never handle direct sha1 references outside of git. For example, a bug tracker or CI system that references a commit in git will do so by sha1. But that's an insecure link in the chain; you really need it to refer to both the sha1 _and_ the digest, and then you can verify that the object you have under that sha1 matches the digest. So you'd have to teach a whole generation of tools that they can't trust git sha1 ids, and that you need an "extended" id that includes the digest. At that point, I really wonder if a flag day to switch to a new repository format is all that bad. You'd have to either rewrite existing history (which means re-signing tags if you care about them!), or just decide to leave the new and old histories disconnected. Whereas a scheme based around an added-on digest could keep the old history connected. But at the same time, the old history will have zero value from a security perspective. If sha1 is broken, then those old signatures are worthless, anyway. And just switching to a new algorithm means the implementation remains very simple. -Peff [1] Fun fact of the day: if linux development continues at the same rate until the Sun burns out, there is a 1/2^60 chance of a collision! ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-29 9:07 ` [PATCH] Implement fast hash-collision detection Jeff King @ 2011-11-29 10:24 ` Ævar Arnfjörð Bjarmason 2011-11-29 10:29 ` Jeff King 2011-11-29 13:17 ` Nguyen Thai Ngoc Duy ` (2 subsequent siblings) 3 siblings, 1 reply; 20+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2011-11-29 10:24 UTC (permalink / raw) To: Jeff King; +Cc: Bill Zaumen, git, gitster, pclouds, spearce, torvalds On Tue, Nov 29, 2011 at 10:07, Jeff King <peff@peff.net> wrote: > However, it may be of interest that the Sun is expected to burn out in a > mere 10^10 years[1]. Off topic, but it's a a lot sooner than that. The total age of the sun is is around 10^10 (10 billion) years, but we're already ~4.6 billion years along that line. However the Sun is currently in a stage of gradual heating until it turns into a red giant in ~5 billion years. In around 500 million years the earth will be uninhabitable as we know it, and in around 1 billion years the surface will be hot enough to have boiled all the oceans. In other words the earth in a billion years will probably look similar to how Venus looks now. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-29 10:24 ` Ævar Arnfjörð Bjarmason @ 2011-11-29 10:29 ` Jeff King 0 siblings, 0 replies; 20+ messages in thread From: Jeff King @ 2011-11-29 10:29 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason; +Cc: git On Tue, Nov 29, 2011 at 11:24:18AM +0100, Ævar Arnfjörð Bjarmason wrote: > On Tue, Nov 29, 2011 at 10:07, Jeff King <peff@peff.net> wrote: > > > However, it may be of interest that the Sun is expected to burn out in a > > mere 10^10 years[1]. > > Off topic, but it's a a lot sooner than that. The total age of the sun > is is around 10^10 (10 billion) years, but we're already ~4.6 billion > years along that line. Yeah, I checked Wikipedia, but rounded up for simplicity. I did use 5 billion for my fun fact at the end of the email, which is close to accurate. > However the Sun is currently in a stage of gradual heating until it > turns into a red giant in ~5 billion years. In around 500 million > years the earth will be uninhabitable as we know it, and in around 1 > billion years the surface will be hot enough to have boiled all the > oceans. In other words the earth in a billion years will probably look > similar to how Venus looks now. Good point. If we want an accidental collision in the next 500 million years, we'd better step up the pace of development! -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-29 9:07 ` [PATCH] Implement fast hash-collision detection Jeff King 2011-11-29 10:24 ` Ævar Arnfjörð Bjarmason @ 2011-11-29 13:17 ` Nguyen Thai Ngoc Duy 2011-11-29 15:23 ` Shawn Pearce 2011-11-29 14:04 ` Nguyen Thai Ngoc Duy 2011-11-29 21:56 ` Bill Zaumen 3 siblings, 1 reply; 20+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-11-29 13:17 UTC (permalink / raw) To: Jeff King; +Cc: Bill Zaumen, git, gitster, spearce, torvalds On Tue, Nov 29, 2011 at 4:07 PM, Jeff King <peff@peff.net> wrote: > However, it's harder for trees. Each entry needs to have the new digest > added, but there simply isn't room in the format. So we would have to > change the tree format, breaking interoperability with older versions of > git. And all of your tree sha1's would change as soon as you wrote them > with the new format. That's only slightly better than just swapping > sha1 out for a better algorithm. I think we could hide the new digest after NUL at the end of path name. C Git at least does not seem to care whatever after NUL. -- Duy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-29 13:17 ` Nguyen Thai Ngoc Duy @ 2011-11-29 15:23 ` Shawn Pearce 0 siblings, 0 replies; 20+ messages in thread From: Shawn Pearce @ 2011-11-29 15:23 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Jeff King, Bill Zaumen, git, gitster, torvalds On Tue, Nov 29, 2011 at 05:17, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote: > On Tue, Nov 29, 2011 at 4:07 PM, Jeff King <peff@peff.net> wrote: >> However, it's harder for trees. Each entry needs to have the new digest >> added, but there simply isn't room in the format. So we would have to >> change the tree format, breaking interoperability with older versions of >> git. And all of your tree sha1's would change as soon as you wrote them >> with the new format. That's only slightly better than just swapping >> sha1 out for a better algorithm. > > I think we could hide the new digest after NUL at the end of path > name. C Git at least does not seem to care whatever after NUL. No, you can't. The next byte after the NUL at the end of a path name is the SHA-1 of that entry. After those 20 bytes of SHA-1 data is consumed, the "mode SP name\0" of the next entry is parsed. There is not room in the tree format for any additional data. Period. At best you can modify the mode value to be a new octal code that is not recognized by current Git implementations. But that doesn't get you a whole lot of data, and its pretty risky change because its rather incompatible. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-29 9:07 ` [PATCH] Implement fast hash-collision detection Jeff King 2011-11-29 10:24 ` Ævar Arnfjörð Bjarmason 2011-11-29 13:17 ` Nguyen Thai Ngoc Duy @ 2011-11-29 14:04 ` Nguyen Thai Ngoc Duy 2011-11-29 20:59 ` Jeff King 2011-11-29 21:56 ` Bill Zaumen 3 siblings, 1 reply; 20+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-11-29 14:04 UTC (permalink / raw) To: Jeff King; +Cc: Bill Zaumen, git, gitster, spearce, torvalds On Tue, Nov 29, 2011 at 4:07 PM, Jeff King <peff@peff.net> wrote: > That brings me to my second concern: how does this alternative message > digest have any authority? > > For example, your patch teaches the git protocol a new extension to pass > these digests along with the object sha1s. But how do we know the server > isn't feeding us a bad digest along with the bad object? > > The "usual" security model discussed in git is that of verifying a > signed tag. Linus signs a tag and pushes it to a server. I fetch the > tag, and can verify the signature on the tag. I want to know that I have > the exact same objects that Linus had. But I can't assume the server is > trustworthy; it may have fed me a bad object with a collided sha1. > > But what's in the signed part of the tag object? Only the sha1 of the > commit the tag points to, but not the new digest. So an attacker can > replace the commit with one that collides, and it can in turn point to > arbitrary trees and blobs. > > You can fix this by including an extra header in the signed part of the > tag that says "also, the digest of the commit I point to is X". Then you > know you have the same commit that Linus had. But the commit points to a > tree by its sha1. So you have to add a similar header in the commit > object that says "also, the digest of the tree I point to is X". And > ditto for all of the parent pointers, if you want to care about signing > history. And then you have the same problem in the tree: each sub-tree > and blob is referenced by its sha1. > > .. (Security ignorant speaking) Can we just hash all objects in a pack from bottom up, (replacing sha-1 in trees/commits/tags with the new digest in memory before hashing), then attach the new top digest to tag's content? The sender is required by the receiver to send new digests for all objects in the pack together with the pack. The receiver can then go through the same process to produce the top digest and match it with one saved in tag. I do agree that we should use stronger digests, not weaker, preferably a combination of them. -- Duy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-29 14:04 ` Nguyen Thai Ngoc Duy @ 2011-11-29 20:59 ` Jeff King 2011-11-30 13:35 ` Nguyen Thai Ngoc Duy 0 siblings, 1 reply; 20+ messages in thread From: Jeff King @ 2011-11-29 20:59 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Bill Zaumen, git, gitster, spearce, torvalds On Tue, Nov 29, 2011 at 09:04:06PM +0700, Nguyen Thai Ngoc Duy wrote: > > You can fix this by including an extra header in the signed part of the > > tag that says "also, the digest of the commit I point to is X". Then you > > know you have the same commit that Linus had. But the commit points to a > > tree by its sha1. So you have to add a similar header in the commit > > object that says "also, the digest of the tree I point to is X". And > > ditto for all of the parent pointers, if you want to care about signing > > history. And then you have the same problem in the tree: each sub-tree > > and blob is referenced by its sha1. > > > Can we just hash all objects in a pack from bottom up, (replacing > sha-1 in trees/commits/tags with the new digest in memory before > hashing), then attach the new top digest to tag's content? The sender > is required by the receiver to send new digests for all objects in the > pack together with the pack. The receiver can then go through the same > process to produce the top digest and match it with one saved in tag. I think that is conflating two different layers of git. The security for tags happens at the conceptual object db layer: you sign a tag, and that points to a commit, which points to a tree, and so on. The authenticity comes from the tag signature, but the integrity of each link in the chain is verifiable because of the has property. The pack layer, on the other hand, is just an implementation detail about how those conceptual objects are stored. More than just your tag will be in a pack, and the contents of your tag may be spread across several packs (or even loose objects). So I don't think it's right to talk about packs at all in the signature model. If you wanted to say "make a digest of all of the sub-objects pointed to by the tag", then yes, that does work (security-wise). But it's expensive to calculate. Instead, you want to use a "digest of digests" as much as possible. Which is what git already does, of course; you hash the tree object, which contains hashes of the blob sha1s. Git's conceptual model is fine. The only problem is that sha1 is potentially going to lose its security properties, weakening the links in the chain. So as much as possible, we want to insert additional links at the exact same places, but using a stronger algorithm. Does that make sense? -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-29 20:59 ` Jeff King @ 2011-11-30 13:35 ` Nguyen Thai Ngoc Duy 2011-11-30 18:05 ` Junio C Hamano 2011-11-30 19:00 ` Bill Zaumen 0 siblings, 2 replies; 20+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-11-30 13:35 UTC (permalink / raw) To: Jeff King; +Cc: Bill Zaumen, git, gitster, spearce, torvalds On Wed, Nov 30, 2011 at 3:59 AM, Jeff King <peff@peff.net> wrote: > If you wanted to say "make a digest of all of the sub-objects pointed to > by the tag", then yes, that does work (security-wise). But it's > expensive to calculate. Instead, you want to use a "digest of digests" > as much as possible. Which is what git already does, of course; you hash > the tree object, which contains hashes of the blob sha1s. Git's > conceptual model is fine. The only problem is that sha1 is potentially > going to lose its security properties, weakening the links in the chain. > So as much as possible, we want to insert additional links at the exact > same places, but using a stronger algorithm. What I'm thinking is whether it's possible to decouple two sha-1 roles in git, as object identifier and digest, separately. Each sha-1 identifies an object and an extra set of digests on the "same" object. Object database is extended to store all these new digests and mapping between sha-1 and them. When we need to verify an object, given an sha-1, we rehash that object and check the result digest with the ones linked to the sha-1. These new digests would be "digest of digests" just like how we use sha-1. In fact this new digest set could be just sha-1. We just don't hash trees/commits/tags as-is when computing these new digests. When a tree object is hashed, for example, a new tree object with new digests will be created for hashing (we keep sha-1 <-> new digest mapping on disk). Think of duplicating an entire DAG with new digests as links instead of sha-1, then we keep digests and drop the DAG. The day sha-1 is broken, a project can generate new digests from its old good repo and enforce developers to use new digests for verification instead of sha-1. sha-1 is still used by git as identifier after that day. Computing these digests is expensive, but for local, day-to-day use, we only need sha-1 as identifier (correct me if I'm wrong here), so we can delay compute/store these new digests until we absolutely need them (e.g. push/fetch). There is also an interesting case, assume these digests are strong enough, we could replace sha-1 as identifier in git with a cheaper hash. A new hash must fit in 160-bit space that sha-1 takes and should have good distribution, of course. Projects with a lot of data may like it that way. -- Duy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-30 13:35 ` Nguyen Thai Ngoc Duy @ 2011-11-30 18:05 ` Junio C Hamano 2011-12-01 4:43 ` Nguyen Thai Ngoc Duy 2011-11-30 19:00 ` Bill Zaumen 1 sibling, 1 reply; 20+ messages in thread From: Junio C Hamano @ 2011-11-30 18:05 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy Cc: Jeff King, Bill Zaumen, git, gitster, spearce, torvalds Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > What I'm thinking is whether it's possible to decouple two sha-1 roles > in git, as object identifier and digest, separately. Why it would be a good thing? If you have a collided identifier, somebody has to choose which blob a particular tree wants to have at the path, and because the tree would not record anything but the identifier, you cannot. > ... > The day sha-1 is broken, a project can generate new digests from its > old good repo and enforce developers to use new digests for > verification instead of sha-1. sha-1 is still used by git as > identifier after that day. And an old blob that is identified with a SHA-1 now has a new blob that has different contents but happens to have the same SHA-1. How does Git decide which blob to use when a particular object is named by the SHA-1? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-30 18:05 ` Junio C Hamano @ 2011-12-01 4:43 ` Nguyen Thai Ngoc Duy 0 siblings, 0 replies; 20+ messages in thread From: Nguyen Thai Ngoc Duy @ 2011-12-01 4:43 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jeff King, Bill Zaumen, git, spearce, torvalds On Thu, Dec 1, 2011 at 1:05 AM, Junio C Hamano <gitster@pobox.com> wrote: > Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > >> What I'm thinking is whether it's possible to decouple two sha-1 roles >> in git, as object identifier and digest, separately. > > Why it would be a good thing? If you have a collided identifier, somebody > has to choose which blob a particular tree wants to have at the path, and > because the tree would not record anything but the identifier, you cannot. Accidental collision likelihood is small enough we don't have to care about. >> ... >> The day sha-1 is broken, a project can generate new digests from its >> old good repo and enforce developers to use new digests for >> verification instead of sha-1. sha-1 is still used by git as >> identifier after that day. > > And an old blob that is identified with a SHA-1 now has a new blob that > has different contents but happens to have the same SHA-1. How does Git > decide which blob to use when a particular object is named by the SHA-1? Again, I assume the likelihood that a content happens to have the same sha-1 with another one is too low to care about. If they are, it's must be an attack. We do not allow malicious objects to enter in the first place using other digests. Once objects are in, they are safe to use. -- Duy ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-30 13:35 ` Nguyen Thai Ngoc Duy 2011-11-30 18:05 ` Junio C Hamano @ 2011-11-30 19:00 ` Bill Zaumen 1 sibling, 0 replies; 20+ messages in thread From: Bill Zaumen @ 2011-11-30 19:00 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Jeff King, git, gitster, spearce, torvalds [Will send a reply to Jeff's comment from last night with some clarifications and explanations later]. > What I'm thinking is whether it's possible to decouple two sha-1 roles > in git, as object identifier and digest, separately. Each sha-1 > identifies an object and an extra set of digests on the "same" object. > Object database is extended to store all these new digests and mapping > between sha-1 and them. When we need to verify an object, given an > sha-1, we rehash that object and check the result digest with the ones > linked to the sha-1. The patch I created (at least, a reasonable chunk of the code) kind of does that: it is very easy to change the CRC to whatever message digest one wants. I used a CRC primarily because I had the impression that people were very concerned about speed, but it is easy to change that to the message digest of your choice. In any case, it might be a good starting point if you want to try something in a different direction. Basically, when you create a loose object, in addition to getting a SHA-1 ID, you get a message digest that gets stored as well (in a separate file). When you index a pack file, you get an IDX file containing the SHA-1 ID plus a corresponding MDS file containing the message digest. Index-pack calculates the SHA-1 value from the object stored in the pack file, and the (additional) message digest is computed at the same time using the same data. Commands like verify-pack check both the IDX file and the MDS file for consistency with the matching pack file. The new message digest (the CRC in the patch) is used only in cases where a repository is being altered (e.g., a loose object or pack file is being created or a fetch, push, or pull operation) or some explicit verification operation is running (e.g., git verify-pack). Adding an additional header to the commit message is a good idea (I had actually tried that, but something went wrong, although one of you suggested what the problem might have been --- I can try again if there is some interest in pursuing that). It might be worth pointing out that you can use the SHA-1 hash of the contents of objects (e.g., without the Git object header) as an additional digest: I tried a test using two 128-byte files with the same MD5 hash, differing past the 20th byte, and deleted the first four bytes of each. With those bytes deleted, the hash collision went away. I doubt if there is a known efficient algorithm that can generate a hash collision for two files and for two other files that differ from the first set by deleting N bytes from both. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-29 9:07 ` [PATCH] Implement fast hash-collision detection Jeff King ` (2 preceding siblings ...) 2011-11-29 14:04 ` Nguyen Thai Ngoc Duy @ 2011-11-29 21:56 ` Bill Zaumen 2011-11-30 6:25 ` Jeff King 3 siblings, 1 reply; 20+ messages in thread From: Bill Zaumen @ 2011-11-29 21:56 UTC (permalink / raw) To: Jeff King; +Cc: git, gitster, pclouds, spearce, torvalds Thanks for mentioning the 100K limit, which I didn't know about. Will have to try to see how to split it into two patches. The intent is to increase the cost of a malicious attack, which requires generating two different files with the same SHA-1 value, detect such an attack early, and to slow such an attack down - because of Git's rule that the first object with a SHA-1 value is the one the repository has, if it takes longer to generate a collision than the time it takes to get the original object into all repositories (which is done manually by multiple individuals), the forged file will never appear in any "official" repository. The additional CRC (easily changed to whatever message digest one might prefer) makes a malicious attack far more difficult: the modified file has to have both the same SHA-1 hash (including the Git header) and the same CRC (not including the Git header). An efficient algorithm to do both simultaneously does not yet exist. So, if we could generate a SHA-1 collision in one second, it would presumably take billions of seconds (many decades of continuous computation) to generate a SHA-1 hash with the same CRC, and well before a year has elapsed, the original object should have been in all the repositories, preventing a forged object from being inserted. Of course, eventually you might need a real message digest. The weakness of a CRC as an integrity check is not an issue since it is never used alone: it's use is more analogous to the few extra bits added to a data stream when error-detecting codes are used. I used a CRC in the initial implementation rather than a message digest because it is faster, and because the initial goal was to get things to work correctly. In any case, the patch does not eliminate any code in which Git already does a byte-by-byte comparison. In cases where Git currently assumes that two objects are the same because the SHA-1 hashes are the same, the patch compares CRCs as an additional test. Regarding your [Jeff's] second concern, "how does this alternative digest have any authority?" there are two things to keep in mind. First, it is a supplement to the existing digest. Second, any value of the CRC that is stored permanently (baring bugs, in my implementation, of course) is computed locally - when a loose object is created or when a pack file's index is created. At no point is a CRC that was obtained from another repository trusted. While the patch modifies Git so that it can send CRCs when using the git protocol, these CRCs are never stored, but are instead used only for cross checks. If one side or the other "lies", you get an error. To give a concrete example, during a fetch, the git protocol currently sends "have" messages that contain the SHA-1 hashes of commits. The extension allows two CRCs to be sent along with each hash. If these do not match the local values (tested only if the local values exist), something is wrong and you get an error report that the server sends to the client, but the server never uses these CRCs for any other purpose and the server never sends its CRCs to the client because of backwards-compatibility issues. For objects that are transferred, you end up with a pack file, with index-pack called to build the index (and with the patch, the corresponding MDS file), but index-pack already does a byte-by-byte comparison to detect collisions - the comparison is much faster than the SHA-1 computation index-pack has to do anyway. Where this helps is when one is using multiple repositories. If you fetch a commit from repository B, which we'll assume has a forged blob (different content, but the original SHA-1 hash), and then run fetch using repository A, which has has the same commit with the original blob, the forged blob will not be transferred from Server A and the client will not be notified that there is an inconsistency - the protocol is "smart" enough to know that the client already has the commit and assumes there is nothing to do regarding it. BTW, regarding your [Jeff's] discussion about putting an additional header in commit messages - I tried that. The existing versions of Git didn't like it: barring a bug in my test code, it seems that Git expects headers in commit messages to be in a particular order and treats deviations from that to be an error. I even tried appending blank lines at the end of a commit, with spaces and tabs encoding an additional CRC, and that didn't work either - at least it never got through all the test programs, failing in places like the tests involving notes. In any case, you'd have to phase in such a change gradually, first putting in the code to read the new header if it is there, and subsequently (after ample time so that everyone is running a sufficiently new version) enabling the code to create the new header. Also, regarding "At that point, I really wonder if a flag day to switch to a new repository format is all that bad," if that turns out to be the decision, I'd recommend doing it sooner rather than later. The reason is cost, which grows with the number of git users and the number and size of Git repositories. Bill ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-29 21:56 ` Bill Zaumen @ 2011-11-30 6:25 ` Jeff King 2011-12-01 0:41 ` Bill Zaumen 0 siblings, 1 reply; 20+ messages in thread From: Jeff King @ 2011-11-30 6:25 UTC (permalink / raw) To: Bill Zaumen; +Cc: git, gitster, pclouds, spearce, torvalds On Tue, Nov 29, 2011 at 01:56:28PM -0800, Bill Zaumen wrote: > The additional CRC (easily changed to whatever message digest one might > prefer) makes a malicious attack far more difficult: the modified file > has to have both the same SHA-1 hash (including the Git header) and > the same CRC (not including the Git header). Only if the attack actually involves creating a collision on both. But I think the important attacks bypass your CRC anyway. Consider this attack scenario: 1. Linus signs a tag (or a commit) and pushes it to kernel.org. 2. kernel.org gets hacked, and the attacker replaces an object with an evil colliding version[1]. 3. I clone from kernel.org, and run "git tag --verify". Git says it's OK, because the signature checks out, but I have a bogus object. How does your CRC help? If I understand your scheme correctly, kernel.org will have told me the CRC of all of the objects during the clone. But that isn't part of what Linus signed, so the attacker in step 2 could just as easily have overwritten kernel.org's crc file, and the signature will remain valid. [1] This is an over-simplification, of course. Because the only even remotely feasible attacks on sha1 are birthday attacks, not pre-image attacks, there is a step 0 in which the attacker generates a colliding pair, convinces Linus to commit it, and then waits. Which is probably really hard, but for the purposes of this discussion, we assume the attacker is capable of inserting a colliding object maliciously into a repo you will fetch from. Otherwise, the integrity of sha1 isn't an issue at all. > An efficient algorithm to do both simultaneously does not yet exist. > So, if we could generate a SHA-1 collision in one second, it would > presumably take billions of seconds (many decades of continuous > computation) to generate a SHA-1 hash with the same CRC, and well > before a year has elapsed, the original object should have been in all > the repositories, preventing a forged object from being inserted. Of > course, eventually you might need a real message digest. This is wrong, for two reasons. 1. The method for generating an object that collides in both sha-1 and CRC is not necessarily to generate a colliding sha-1 and then do a pre-image attack on the CRC. It is to do a birthday attack on the sha-1 and the CRC together. Which halves the bit-strength of the CRC to 16 bits (just as we can generally find collisions in 160-bit sha1s in 2^80). 16 bits isn't a lot to add when you are trying to fix a broken cryptosystem (it's not broken yet, obviously, but when it does get broken, will it be because computing reaches the 2^57 or so that sha1 is broken at, or will it be because a new weakness is found that drops sha1's bit-strength to something much lower?). This assumes that you can combine the two in a birthday attack. Certainly this analysis works against brute-force 2^80 sha1 collision attacks. But I haven't actually read the details of the sha1 attacks, so maybe some of the tweaking they do to get those results makes it harder. On the other hand, attacking CRC is far from hard, so I certainly wouldn't stake money that sha1 reseachers couldn't tweak their attacks in a way that also allows finding CRC collisions. You say that an algorithm to do both simultaneously does not yet exist. But is that because it's hard, or simply because nobody has bothered trying? Anyway, all of that is just reiterating that CRC should not be used as a security function. It can easily be replaced in your scheme by sha-256, which does have the desired properties. 2. Your attack seems to be "find the sha-1 collision, publish one of your colliding objects (i.e., the innocent-looking half), then try to break the CRC". And then you claim that by the time you find the CRC, everybody will already have the object. But wouldn't a smarter attack be to first find the collision, including the CRC, and only _then_ start the attack? Then nobody will have the object. Moreover, it's not true that after a year everyone will have the object. People still run "git clone" against kernel.org. Those repos do not have the object. > The weakness of a CRC as an integrity check is not an issue since it > is never used alone: it's use is more analogous to the few extra bits > added to a data stream when error-detecting codes are used. I used a > CRC in the initial implementation rather than a message digest because > it is faster, and because the initial goal was to get things to work > correctly. In any case, the patch does not eliminate any code in > which Git already does a byte-by-byte comparison. In cases where Git > currently assumes that two objects are the same because the SHA-1 > hashes are the same, the patch compares CRCs as an additional test. Right. I don't claim that your scheme makes git any weaker. I just claim that it fails to solve the problems people are actually concerned about, and it adds a lot of complexity while doing so. > Regarding your [Jeff's] second concern, "how does this alternative > digest have any authority?" there are two things to keep in mind. > First, it is a supplement to the existing digest. Right, but we are assuming that sha1 is broken. That's the whole security problem. So the existing digest is not worth much. > Second, any value of the CRC that is stored permanently (baring bugs, > in my implementation, of course) is computed locally - when a loose > object is created or when a pack file's index is created. At no point > is a CRC that was obtained from another repository trusted. While the > patch modifies Git so that it can send CRCs when using the git > protocol, these CRCs are never stored, but are instead used only for > cross checks. If one side or the other "lies", you get an error. But if I don't already have the object, then I have nothing to compare against. So when I get it from kernel.org, I have to simply accept that the object I'm getting is good, and write it into my object db. > BTW, regarding your [Jeff's] discussion about putting an additional > header in commit messages - I tried that. The existing versions of > Git didn't like it: barring a bug in my test code, it seems that Git > expects headers in commit messages to be in a particular order and > treats deviations from that to be an error. Yes, the header has to go at the end of the existing headers. But I don't see any reason that would be a problem for the scheme I described. > I even tried appending blank lines at the end of a commit, with spaces > and tabs encoding an additional CRC, and that didn't work either - at > least it never got through all the test programs, failing in places > like the tests involving notes. Yes, git will helpfully trim whitespace in commit messages. With the current code, you can hide arbitrary bytes in a commit message after a NUL, but don't do that. It's not guaranteed to stay that way, and the appropriate place to add new information is in a header. > In any case, you'd have to phase in such a change gradually, first > putting in the code to read the new header if it is there, and > subsequently (after ample time so that everyone is running a > sufficiently new version) enabling the code to create the new header. Current git should ignore headers that it doesn't understand. I haven't tested this, but Junio recently has been experimenting with gpg-signature lines in commits, and I'm pretty sure he checked that older gits properly ignore them. -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-30 6:25 ` Jeff King @ 2011-12-01 0:41 ` Bill Zaumen 2011-12-01 5:26 ` Jeff King 0 siblings, 1 reply; 20+ messages in thread From: Bill Zaumen @ 2011-12-01 0:41 UTC (permalink / raw) To: Jeff King; +Cc: git, gitster, pclouds, spearce, torvalds On Wed, 2011-11-30 at 01:25 -0500, Jeff King wrote: > On Tue, Nov 29, 2011 at 01:56:28PM -0800, Bill Zaumen wrote: > But I > think the important attacks bypass your CRC anyway. Consider this attack > scenario: > > 1. Linus signs a tag (or a commit) and pushes it to kernel.org. > > 2. kernel.org gets hacked, and the attacker replaces an object with > an evil colliding version[1]. > > 3. I clone from kernel.org, and run "git tag --verify". Git says it's > OK, because the signature checks out, but I have a bogus object. > > How does your CRC help? If I understand your scheme correctly, > kernel.org will have told me the CRC of all of the objects during the > clone. But that isn't part of what Linus signed, so the attacker in step > 2 could just as easily have overwritten kernel.org's crc file, and the > signature will remain valid. First, there is a misconception - the server will not tell you the CRC. The CRC will be computed locally by the client instead. Aside from that, suppose the attacker does what you suggests (providing a valid CRC so that git commands like verify-pack don't have an error to detect). You can't tell that something is wrong, but Linus can - the next time he does a fetch. If he fetches, the server sends some SHA-1 hashes and the client responds with 'have' or 'want' in a reply. If the client wants it, the client doesn't have a CRC, if the client sends 'have', the CRCs are available so those get sent. The server then checks, notices a mismatch (probably in the CRC of the CRCs of all the blobs in the commit's tree), and generates an error. It's also possible to write some additional commands to (for example) fetch the SHA-1 hashes and CRCs from all remote repositories you use and compare these to make sure they are all consistent, something that can be run ocassionally. > > An efficient algorithm to do both simultaneously does not yet exist. > > So, if we could generate a SHA-1 collision in one second, it would > > presumably take billions of seconds (many decades of continuous > > computation) to generate a SHA-1 hash with the same CRC, and well > > before a year has elapsed, the original object should have been in all > > the repositories, preventing a forged object from being inserted. Of > > course, eventually you might need a real message digest. > > This is wrong, for two reasons. > > 1. The method for generating an object that collides in both sha-1 and > CRC is not necessarily to generate a colliding sha-1 and then do a > pre-image attack on the CRC. It is to do a birthday attack on the > sha-1 and the CRC together. Which halves the bit-strength of the > CRC to 16 bits (just as we can generally find collisions in 160-bit > sha1s in 2^80). That result for birthday attack assumes the goal is to find two files that will have the same SHA-1 value (or SHA-1 + CRC). The case I was talking about (apologies if I did not state this clearly) is that you have an existing object and an existing CRC and you want to generate a second object with the same SHA-1 and same CRC as the first. A birthday attack doesn't work so well in that case - the number of tries is much higher than half the number of bits in the digest. http://en.wikipedia.org/wiki/Birthday_attack#Digital_signature_susceptibility has a discussion regarding digital signatures. The trick is for a person to create a large number of variations of a "fair" and "unfair" contract, and use a birthday attack to find a pair that have the same hash. The variations are typically inconsequential changes (extra spacing, commas, etc.) In the case I was discussing, a developer creates some code, commits it and pushes it to a shared repository - the developer is not given code by the attacker. The attacker can, however, see the code by fetching it. An attack then consists of generating a collision, change the object in the attacker's local repository, and then push the original developer's commit (with the modified object) to another shared repository before someone else puts the correct objects into that repository. A birthday attack does not work in this case. There one issue that this suggests however - it is not clear if the 2^57 number given for the best SHA-1 attacks were attempts to generate a new file with the same SHA-1 hash as an existing file or a pair of files that have the same SHA-1 hash. If it is the latter, then an attack is significantly harder than I assumed as a worst case, but still possibly much, much better than brute force. http://en.wikipedia.org/wiki/Birthday_problem has an analysis of the birthday problem (the basis of a birthday attack) and clearly notes that this is different than the "same birthday as you" variation - you don't do nearly so well in that case. > Anyway, all of that is just reiterating that CRC should not be used > as a security function. It can easily be replaced in your scheme by > sha-256, which does have the desired properties. Oh, I'm perfectly happy with sha-256 (and indicated that in the documentation that is in the patch) - there's a tradeoff between error detection and speed, and I simply guessed that the community was more concerned with speed. > 2. Your attack seems to be "find the sha-1 collision, publish one of > your colliding objects (i.e., the innocent-looking half), then try > to break the CRC". And then you claim that by the time you find the > CRC, everybody will already have the object. > > But wouldn't a smarter attack be to first find the collision, including > the CRC, and only _then_ start the attack? Then nobody will have > the object. My assumption was that legitimate developer wrote some code, put it in one of the remote repositories, and than an attacker downloaded that code and tried to get a modified version into a different remote repository. > > Moreover, it's not true that after a year everyone will have the > object. People still run "git clone" against kernel.org. Those > repos do not have the object. This isn't the problem - a clone is going to be an exact copy of an existing repository. I was referring to the case where a commit made it into one remote repository but not another - to get into the second one, someone else would have had to review the code, creating a window where an attacker with write access to the second repository could put the modified object there. > Right, but we are assuming that sha1 is broken. That's the whole > security problem. So the existing digest is not worth much. The assumption was more that "J Random Hacker's" code would be trusted far less than code submitted by Linus, so if "J Random Hacker" can generate create a replacement file with the same SHA-1 hash as one in one of Linus' commits, others will initially assume that Linus wrote that code. But, Git uses a "first in wins" rule so the bad guy has to generate the replacement file and get it inserted before Linus' commit reaches all the shared repositories for the project. A SHA-1 collision is harmless if computed too late to get in. > > Second, any value of the CRC that is stored permanently (baring bugs, > > in my implementation, of course) is computed locally > But if I don't already have the object, then I have nothing to compare > against. So when I get it from kernel.org, I have to simply accept that > the object I'm getting is good, and write it into my object db. The value is computed locally but can be compared remotely. You (specifically) won't find anything while talking to kernel.org in this example, but when you try to fetch from a different repository, instead of just noting that you have the object, you'll get a notification that there was a collision, so you'll find the problem earlier than you would otherwise. Also, anyone who had the right object from kernel.org (the object before the repository was 'hacked' would get a warning when trying to fetch from kernel.org after the change. > Yes, the header has to go at the end of the existing headers. But I > don't see any reason that would be a problem for the scheme I described. Thanks for mentioning it - I tried putting the header in multiple locations and probably missed the one spot where it would work. > Current git should ignore headers that it doesn't understand. I haven't > tested this, but Junio recently has been experimenting with > gpg-signature lines in commits, and I'm pretty sure he checked that > older gits properly ignore them. Possibly there was another bug in the test I tried as well - something was preventing the directory cleanup in one of the 'notes' test. Anyway, thanks for the comments. Bill ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-12-01 0:41 ` Bill Zaumen @ 2011-12-01 5:26 ` Jeff King 2011-12-02 2:59 ` Bill Zaumen 0 siblings, 1 reply; 20+ messages in thread From: Jeff King @ 2011-12-01 5:26 UTC (permalink / raw) To: Bill Zaumen; +Cc: git, gitster, pclouds, spearce, torvalds On Wed, Nov 30, 2011 at 04:41:15PM -0800, Bill Zaumen wrote: > Aside from that, suppose the attacker does what you suggests (providing > a valid CRC so that git commands like verify-pack don't have an error > to detect). You can't tell that something is wrong, but Linus can - > the next time he does a fetch. If he fetches, the server sends > some SHA-1 hashes and the client responds with 'have' or 'want' in a > reply. If the client wants it, the client doesn't have a CRC, if the > client sends 'have', the CRCs are available so those get sent. The > server then checks, notices a mismatch (probably in the CRC of the CRCs > of all the blobs in the commit's tree), and generates an error. OK. I thought your claim was that this would help detect collision-related forgeries in the chain of hashes while verifying a signed tag. But your claim is only that your scheme gives people who already have the objects an additional chance to notice discrepencies in what the server is serving and what they have. And then they can notify other people of the problem in an out-of-band way (i.e., telling kernel.org admins that the system is hacked, or telling users not to fetch from there). Cryptographically speaking, I think that claim is sound, and you can certainly construct attack scenarios where this detection would help. However, quantifying the effectiveness is difficult. What is the likelihood of a malicious collided-object replacement being detected without your scheme? What is it with it? There are many questions that factor into guessing the latter. How often does Linus fetch from his own kernel.org repository? He would usually push, I would think. Even if he does fetch, he wouldn't be getting the old objects that he already has. I guess this is the reason for your digest-of-digests for each commit object sent? But what about objects that are no longer in current commits, but are in older ones? E.g., v1.0 of foo.c has sha1 X, and an attacker finds a collision and replaces it. Meanwhile, the tip of development now has replaced foo.c with X'. When Linus talks to the server, git will never care about X (it is neither being transmitted, nor is part of a commit that is being transmitted). But people may still be cloning and using the v1.0 tag, assuming it is valid. What about the server being more clever about hiding the replacement object? E.g., instead of just breaking into kernel.org and inserting a replacement object, the attacker runs a malicious git-daemon that returns the bogus object to cloners, but the real object to fetchers. Thus there is nothing for Linus to detect, but new cloners get the malicious object. Or you could give the bogus object to people who are getting the object for the first time (since they presumably have no crc for that object), but otherwise use the real object. You could get around this deception by pretending to be a "victim"; i.e., cloning fresh and seeing what the server gives you, comparing it to your known-good repository. Which is similar to what you suggest here: > It's also possible to write some additional commands to (for example) > fetch the SHA-1 hashes and CRCs from all remote repositories you use > and compare these to make sure they are all consistent, something that > can be run ocassionally. But we can already do that. Assume you have an existing repo "foo". To verify the copy at git://example.com/foo.git, do a fresh clone to "bar", and then compare the objects in "foo" to "bar", either byte-wise or by digest. > That result for birthday attack assumes the goal is to find two files > that will have the same SHA-1 value (or SHA-1 + CRC). The case I was > talking about (apologies if I did not state this clearly) is that you > have an existing object and an existing CRC and you want to generate > a second object with the same SHA-1 and same CRC as the first. A > birthday attack doesn't work so well in that case - the number of tries > is much higher than half the number of bits in the digest. Yes. This is called a "pre-image" attack (to be more specific, it is a "second pre-image attack", since you have the original object). And AFAIK, it's not feasible against SHA-1, nor will it be in the near future (even MD5, which is considered totally broken, does not have feasible pre-image attacks). > http://en.wikipedia.org/wiki/Birthday_attack#Digital_signature_susceptibility > has a discussion regarding digital signatures. The trick is for a > person to create a large number of variations of a "fair" and "unfair" > contract, and use a birthday attack to find a pair that have the same > hash. The variations are typically inconsequential changes (extra > spacing, commas, etc.) Right. This is the classic birthday attack. I don't keep up with the state of the art in collision attacks these days, but I think in practice they are usually executed by adding arbitrary binary garbage into a "hidden" spot in a file. Which makes it much harder to execute against text files like source code (as opposed to, say, a binary firmware blob). IIRC, the practical MD5 attacks involved postscript documents (with non-printing garbage sections that printers would ignore). > In the case I was discussing, a developer creates some code, commits > it and pushes it to a shared repository - the developer is not given > code by the attacker. The attacker can, however, see the code by > fetching it. An attack then consists of generating a collision, > change the object in the attacker's local repository, and then push > the original developer's commit (with the modified object) to another > shared repository before someone else puts the correct objects into > that repository. A birthday attack does not work in this case. Yeah, that is simply not feasible, and is not likely to become so anytime soon. > There one issue that this suggests however - it is not clear if the > 2^57 number given for the best SHA-1 attacks were attempts to generate > a new file with the same SHA-1 hash as an existing file or a pair of > files that have the same SHA-1 hash. If it is the latter, then an > attack is significantly harder than I assumed as a worst case, but > still possibly much, much better than brute force. The 2^57 number is for a collision of two new objects, not for a pre-image attack. AFAIK, there are currently no pre-image attacks on SHA-1 at all. I don't think there's a need to response individually to the points in the rest of your email; they're all based on the assumption that the attacker is replacing a known-good written-by-Linus commit with one found in a pre-image attack. -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-12-01 5:26 ` Jeff King @ 2011-12-02 2:59 ` Bill Zaumen 2011-12-02 17:00 ` Jeff King 0 siblings, 1 reply; 20+ messages in thread From: Bill Zaumen @ 2011-12-02 2:59 UTC (permalink / raw) To: Jeff King; +Cc: git, gitster, pclouds, spearce, torvalds On Thu, 2011-12-01 at 00:26 -0500, Jeff King wrote: > Cryptographically speaking, I think that claim is sound, and you can > certainly construct attack scenarios where this detection would help. > However, quantifying the effectiveness is difficult. What is the > likelihood of a malicious collided-object replacement being detected > without your scheme? What is it with it? > > There are many questions that factor into guessing the latter. > > How often does Linus fetch from his own kernel.org repository? He would > usually push, I would think. Even if he does fetch, he wouldn't be > getting the old objects that he already has. I guess this is the reason > for your digest-of-digests for each commit object sent? Yes, - the digest of digests is used to check things that would not be sent. > > But what about > objects that are no longer in current commits, but are in older ones? That's a good question, of course. After Linus pushes a commit, he'll notify others, and if some fetch before the repository is hacked, they will detect an error on a subsequent fetch. For fetches, the server reports a series of commits, and the client responds with 'have' for those it has, with the CRCs added so the server can check for a mismatch. I made minimal changes to fetch-pack.c and upload-pack.c: just adding the CRC fields to the messages already sent. The server asks about more commits than it actually transfers, but all the ones it asks about are tested. One could send additional 'have' replies if necessary (for ones the server didn't mention) but I didn't do that, partly for simplicity but also because I was looking at the fetch-pack.c and update-pack.c code for the first time. If desired, such changes can be added. I also do some similar checking when a commit is pushed - the server tells the client the last commit it has and the client will send the CRCs in its reply to allow the server to cross check those. I didn't mention that before because only the latest is really checked. Again, I just changed a message format (backwards compatible, of course), but additional checking could be added if desired. You could also add options to check tips of branches and all commits that have tags (e.g., a v1.0 tag) All of that simply requires more work on commands such as fetch-pack, upload-pack, send-pack and receive-pack. > What about the server being more clever about hiding the replacement > object? E.g., instead of just breaking into kernel.org and inserting a > replacement object, the attacker runs a malicious git-daemon that > returns the bogus object to cloners, but the real object to fetchers. That's really a server-security issue, not a git one. Perhaps repositories should be configured so that all the executables are on read-only partitions. It's an important question in general of course, but it is probably useful to distinguish attacks that put bad data on a server from ones that install new software. > > > It's also possible to write some additional commands to (for example) > > fetch the SHA-1 hashes and CRCs from all remote repositories you use > > and compare these to make sure they are all consistent, something that > > can be run ocassionally. > > But we can already do that. Assume you have an existing repo "foo". To > verify the copy at git://example.com/foo.git, do a fresh clone to > "bar", and then compare the objects in "foo" to "bar", either byte-wise > or by digest. Of course, but that is an expensive operation - in the case of Git transferring some 50 MBytes of data per repository. A command to fetch the SHA-1 ID and a CRC or message digest for each object would not only run faster, but should put a much lower load on the server. Getting back to the birthday attack question (this is an area where your comments were very useful for me), there's a case I didn't consider. Suppose two developers bounce code back and forth by email and one of them commits it, but the other developer is a bad guy. The bad guy would then have had an opportunity to use a birthday attack by sending back subtly modified code (e.g., changes to how comments are formated, additional blank lines, etc.) He can even put a humorous comment at the end of the file such as "the first 200 Chinese characters I learned" and then include the Chinese characters (I've tested this with gcc - the Chinese characters, represented in Unicode, print in an editor and are ignored by the compiler.) Unicode is a lot closer to binary data so you have a lot of bits you can alter in a small amount of space, with each character requiring multiple bytes to represent it. The comment will look silly but innocuous. I think Linus Torvalds once suggested being suspicious of anything that looked like "line noise" in a patch. Non-western unicode characters can serve the same function but look legitimate, at least to people who don't know the language and when coupled with some "social engineering" to set expectations. As an example of how this attack might work, without breaking into a system, assume two programmers collaborating on a project both have write-access to the same shared repository. 1. The project is using Java, with a rule that all classes and methods that are protected or public be documented (so javadoc can create API documentation). 2. Programmer A emails some Java source code to programmer B with a request to edit the comments to improve them or fix any obvious mistakes. 3. Programmer B fixes the comments, but also creates a modified file with the same SHA-1 hash as the correct file in order to add some bugs or security flaws. 4. Programmer B creates a branch from an earlier version, adds some tests, puts the contents of the modified file into the directory tree under an obscure name, adds it and does a commit. B then pushes it, creating a new remote branch. 5. Programmer B then tells Programmer A that he'll have the modified file back quickly, but could he please fetch his new remote branch and run a test, and answer some questions about what happens as B needs that information to finish his review of the documentation. 6. Programmer A tells B the results, so B knows that A has fetched the remote branch, an B then sends A the corrected file (not the modified one). Programmer A reviews the file, notes that everything seems OK, specifically that only comments were changed, and runs commit -a followed by a push. Because Git tries to be smart, it will (I think) notice from the SHA-1 hashes and from the remote branches that the server already has the object so there is no need to send it. 8. Programmer B fetches the changes, deletes his temporary branch, both locally and on the shared repository. He tells A that the temporary branch is deleted so that B should run "git branch update --prune ..." So what happens? Hopefully someone finds the problem, either through a source-code review or some QA testing, but regardless Programmer A may get the blame as the evidence of any tampering has pretty much been erased. In the worst case, a release with a security hole goes out. Why would Programmer B do that? Maybe he's leaving the company because he's hard to work with and is blaming Programmer A for the problem, and wants to "get back" at Programmer A by harming his career. But in any case he didn't have to break into the repository to get the effect he wanted. At least is is extremely hard to do in terms of computational resources. Bill ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-12-02 2:59 ` Bill Zaumen @ 2011-12-02 17:00 ` Jeff King 0 siblings, 0 replies; 20+ messages in thread From: Jeff King @ 2011-12-02 17:00 UTC (permalink / raw) To: Bill Zaumen; +Cc: git, gitster, pclouds, spearce, torvalds On Thu, Dec 01, 2011 at 06:59:04PM -0800, Bill Zaumen wrote: > > What about the server being more clever about hiding the replacement > > object? E.g., instead of just breaking into kernel.org and inserting a > > replacement object, the attacker runs a malicious git-daemon that > > returns the bogus object to cloners, but the real object to fetchers. > > That's really a server-security issue, not a git one. Perhaps > repositories should be configured so that all the executables are on > read-only partitions. It's an important question in general of > course, but it is probably useful to distinguish attacks that put > bad data on a server from ones that install new software. I don't agree here. You have to assume that the attacker will ignore attacks you have blocked, but continue with ones you haven't (just to counter your example, why not replace the running git-daemon in-memory?). You can target the narrow window of attacks that compromise the on-disk repository without being able to execute arbitrary code. But I don't see a point. After the kernel.org hack, yes, people are interested in hardening kernel.org. But they are much more interested in cryptographic sources of authority that let us not have to trust kernel.org at all. Having some weird half-way trust just complicates things. > > But we can already do that. Assume you have an existing repo "foo". To > > verify the copy at git://example.com/foo.git, do a fresh clone to > > "bar", and then compare the objects in "foo" to "bar", either byte-wise > > or by digest. > > Of course, but that is an expensive operation - in the case of Git > transferring some 50 MBytes of data per repository. A command to > fetch the SHA-1 ID and a CRC or message digest for each object would > not only run faster, but should put a much lower load on the server. Yes, it is more expensive. But again, my threat model is that the server is not trusted to serve data accurately or consistently. So you can't come to the server and say "Hey, I'm doing a security verification. Can you send me the CRCs?" You _have_ to present yourself as one of the victims to be infected by the bad object, or a smart attacker will send you the unmodified data. > Getting back to the birthday attack question (this is an area where > your comments were very useful for me), there's a case I didn't > consider. > [elaborate birthday attack scenario] >From my quick reading of your scenario, yes, that is a possible attack. To me, though, it just highlights the need for either a non-colliding algorithm, or for better trust verification about the authors of objects (i.e., cryptographically strong trust). -Peff ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection [not found] <1322546563.1719.22.camel@yos> 2011-11-29 9:07 ` [PATCH] Implement fast hash-collision detection Jeff King @ 2011-11-29 17:08 ` Shawn Pearce 2011-11-29 22:05 ` Jeff King 2011-11-30 4:01 ` Bill Zaumen 1 sibling, 2 replies; 20+ messages in thread From: Shawn Pearce @ 2011-11-29 17:08 UTC (permalink / raw) To: Bill Zaumen; +Cc: git, gitster, pclouds, peff, torvalds On Mon, Nov 28, 2011 at 22:02, Bill Zaumen <bill.zaumen+git@gmail.com> wrote: > Several years ago (in 2006) there was a discussion about the whether > SHA-1 is adequate given the very small but non-zero probability of a > hash collision, particularly given the possibility of a malicious > attempt to generate a collision. At roughly the same time, Git was > modified to support "thin packs" during data transfers. These allow > one to send deltas based on objects that are not in the pack file > being transferred. As a result a previously undetected hash collision > could result in a corrupted repository when (for example) the same > delta is applied to different objects that have the same SHA-1 hash. I don't think you understand how these thin packs are processed. If the pack contains <100 objects, it is inflated to loose objects. If the receiving side (so client in fetch, server in push) already has an object by that SHA-1, the new object is discarded. If the pack contains >=100 objects, and the receiving side already has the object, it is compared byte-for-byte to ensure the incoming copy exactly matches the already existing copy. Either way the first object to arrive always wins. The recipient has to trust that the remote side is providing it something reasonable. If the recipient has *ZERO* trust in the sender, then s/he should read the content of all newly arrived objects before compiling or executing them. This is one reason why Git does not run hooks that are transported as part of the repository. If the recipient thinks reading the content is too onerous or impossible, then they have to make a trust assertion on the remote side. This trust assertion should be derived from the community, and not so much around the machine claiming the content is what it says it is. We have yet to disprove the halting problem, so we have yet to construct a machine that can verify those Linux kernel sources you downloaded don't contain a local root exploit (for example). Instead we have to trust the community of developers and users who work on and run that code to have confidence that the code works as expected, etc. We base our trust off reputable people making statements like "Linus kernel at git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git is pretty good", and "kernel.org is where Linus pushes directly to, so its reasonable to trust the kernel.org server". The recipient should have some understanding of the remote server's security policies, or pay attention to notices posted by others who are also fetching and reviewing content from the same repository. At some level, the community using a repository from a given site should be policing that site and establishing trust that the host is not providing garbage content. After an incident, it is possible to pick up again by rebuilding the environment from an already known repository that people trust. In the case of the recent kernel.org environment rebuild, that is exactly what they did, the community picked up again from Linus' personal repository. DNS could be abused to send you the wrong IP for a site, but most people don't use random DNS servers, they have some level of trust in their DNS provider. DNSSEC is helping to improve the security of the name->IP translation process, and using protocols like HTTPS with SSL certificate verification can help to reduce the chances that a forged DNS entry sends you to an evil source, rather than the community trusted one. (Although SSL certificates seem to be getting forged left and right these days, so again you can't really rely on strong cryptography to magically solve security problems when the attacker holds the private key you have decided to trust with no further verification.) But trust aside, consider an object C is sent as a delta to the remote side. The delta base B is not included in the pack, and is referenced by SHA-1. When the remote side processed delta C, it looks up a copy of base B from its own repository. We assume this content of B is correct, due to the "first to arrive wins" rule, and the community review/trust/notification process. The inflated length of B is checked against a size that is stored in the front of the delta instruction stream that describes C. These lengths must match exactly, if they do not match then the delta application aborts, the pack is rejected, and any temporary data is removed from the filesystem. As Peff pointed out elsewhere in this thread, the odds of a SHA-1 collision in a project are low, on the order of 1/(2^80). Although there are some attacks on message digest functions like MD-5 or SHA-1 that might be possible to generate a duplicate in 2^57 time, any that I have read require producing a different length content than the original you are trying to replace. Assuming the copy of B on the remote system actually inflates and computes to the correct SHA-1 B, it probably does not also have the correct length if it is an object with correct hash but wrong content. So delta application should still be checking for collision with a 1/(2^80) probability. Assuming the remote's copy of B passed the size check, the delta is applied on top, and the SHA-1 of the result buffer is computed. The attacker must craft the delta such that the SHA-1 of C is the result, otherwise connectivity checks will fail. Assuming the attacker successfully stores a C' that has the right SHA-1, but wrong content... the community around that repository will eventually notice this and message that the source site cannot be trusted. I refer you back to the statement above about trusting the site you pull from, or trusting the users that you authorize to push into a repository. But thin pack aside, this problem exists in any form of a packed object. An attacker can try to send object P' (as a non-delta) in place of P. SHA1(P') = SHA1(P), but the content differs. This is far easier to construct than the thin pack delta case you think is a problem, and is the most likely approach for an attacker to take. I refer you back to the statement above about trusting the site you pull from, or trusting the users that you authorize to push into a repository, or reading every object you receive. Even if you magically fix the hash function somehow to decrease the odds of collision (e.g. by switching to a member of the SHA-2 or SHA-3 family), there is no way to prevent a bug or root exploit from entering the project except by never adding new code, or by carefully reviewing everything that is submitted, and building up a basis of trust around that review method. It is far more likely for an attacker to try and submit a plain text patch to the Linux kernel mailing list that reads completely correct, hashes to the correct SHA-1s when applied in Git, etc... but just "happens" to contain an off by one pointer bug in some weird case that allows the attacker to overflow a critical memory buffer and later inject some code that can later be used to create an exploit. If they are ever "caught" they may just claim "I AM A MORON I AM SORRY I MISSED THAT BUFFER CHECK AND SO DID YOU DURING CODE REVIEW SO ITS NOT ALL MY FAULT LEAVE ME ALONE" and get away with it. Trust. Review. Verify. I don't know about you, but I don't just pull random code from arbitrary sites on the Internet. Nor do I compile or execute that code on my workstation. I do trust some individuals based on their reputation on the Internet, or my past experiences working with their code. And I also trust some hosting environments like kernel.org, or GitHub, or code.google.com, to provide reasonably secure hosting, and to aggressively react to any event that might make it harder for me to trust the content they supply. And I also read a lot of code that I pull. It really isn't the problem you try to claim it is. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-29 17:08 ` Shawn Pearce @ 2011-11-29 22:05 ` Jeff King 2011-11-30 4:01 ` Bill Zaumen 1 sibling, 0 replies; 20+ messages in thread From: Jeff King @ 2011-11-29 22:05 UTC (permalink / raw) To: Shawn Pearce; +Cc: Bill Zaumen, git, gitster, pclouds, torvalds On Tue, Nov 29, 2011 at 09:08:27AM -0800, Shawn O. Pearce wrote: > As Peff pointed out elsewhere in this thread, the odds of a SHA-1 > collision in a project are low, on the order of 1/(2^80). Minor nit: it's actually way less than that. You have to do on the order of 2^80 operations to get a 50% chance of a collision. But that's not the probability for a collision given a particular number of operations[1]. The probability for a SHA-1 collision on 10 million hashes (where linux-2.6 will be in a decade or two) is about 1/(2^115). That doesn't change the validity of any of your points, of course. 1 in 2^80 and 1 in 2^115 are both in the range of "impossibly small enough not to care about". To continue our astronomy analogies, NASA estimates[2] the impact probability of most tracked asteroids in the 10^6 range (around 2^20). So getting a collision in linux-2.6 in the next decade has roughly the same odds as the Earth being hit by 5 or 6 large asteroids. -Peff [1] http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem [2] http://neo.jpl.nasa.gov/risk/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] Implement fast hash-collision detection 2011-11-29 17:08 ` Shawn Pearce 2011-11-29 22:05 ` Jeff King @ 2011-11-30 4:01 ` Bill Zaumen 1 sibling, 0 replies; 20+ messages in thread From: Bill Zaumen @ 2011-11-30 4:01 UTC (permalink / raw) To: Shawn Pearce; +Cc: git, gitster, pclouds, peff, torvalds Note: for some reason my email is not showing up on the mailing list. I'm trying a different email address - previously my 'From' field contained a subaddress "+git" but gmail won't put that in the 'Sender' field, so possibly the email is being filtered for that reason. On Tue, 2011-11-29 at 09:08 -0800, Shawn Pearce wrote: > I don't think you understand how these thin packs are processed. I think the confusion was due to me being a bit too terse. The documentation clearly states that thin packs allow deltas to be sent when the delta is based on an object that the server and client both have in common, given the commits each already has. If there is one server and one client, there isn't an issue. The case I meant is the one in which a user does a fetch from one server, gets a forged blob, and then fetches from another server with the original blob, and with additional commits along the same branch. If a server bases the delta off of the original blob, and the client applies the delta to the forged blob, the client will most likely end up with a blob with a different SHA-1 hash than the one expected. Since an object in a tree is then missing (no object with the expected SHA-1 hash), the repository is corrupted. The "first to arrive wins" policy isn't sufficient in one specific case: multiple remote repositories where new commits are added asynchronously, with the repositories out of sync possibly for days at a time (e.g., over a 3-day weekend). In this case, the first to arrive at one repository may not be the first to arrive at another, so what happens at a particular client in the presence of hash collisions is dependent on the sequence of remotes from which updates were fetched. The risk occurs in the window where the repositories are out of sync. Regarding the kernel.org problem that you used as a separate example, while it was fortunately possible to rebuild things (and git provided significant advantages), earlier detection of the problem might have reduced the time for which kernel.org was down. Early detection of errors in general is a good practice if it can be done at a reasonable cost. > Trust. Review. Verify. While good advice in principle, you should keep in mind that there are a lot of people out there working at various companies who are not as capable as you are. Some of them are overworked and make mistakes because they've been working 16 hour days for weeks trying to meet a deadline. Given that, extra checks to catch problems early are probably a good idea if they don't impact performance significantly. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2011-12-02 17:00 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1322546563.1719.22.camel@yos> 2011-11-29 9:07 ` [PATCH] Implement fast hash-collision detection Jeff King 2011-11-29 10:24 ` Ævar Arnfjörð Bjarmason 2011-11-29 10:29 ` Jeff King 2011-11-29 13:17 ` Nguyen Thai Ngoc Duy 2011-11-29 15:23 ` Shawn Pearce 2011-11-29 14:04 ` Nguyen Thai Ngoc Duy 2011-11-29 20:59 ` Jeff King 2011-11-30 13:35 ` Nguyen Thai Ngoc Duy 2011-11-30 18:05 ` Junio C Hamano 2011-12-01 4:43 ` Nguyen Thai Ngoc Duy 2011-11-30 19:00 ` Bill Zaumen 2011-11-29 21:56 ` Bill Zaumen 2011-11-30 6:25 ` Jeff King 2011-12-01 0:41 ` Bill Zaumen 2011-12-01 5:26 ` Jeff King 2011-12-02 2:59 ` Bill Zaumen 2011-12-02 17:00 ` Jeff King 2011-11-29 17:08 ` Shawn Pearce 2011-11-29 22:05 ` Jeff King 2011-11-30 4:01 ` Bill Zaumen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).