* parse_object does check_sha1_signature but not parse_object_buffer? @ 2016-02-02 1:57 Mike Hommey 2016-02-02 2:06 ` Mike Hommey 2016-02-02 3:10 ` Junio C Hamano 0 siblings, 2 replies; 7+ messages in thread From: Mike Hommey @ 2016-02-02 1:57 UTC (permalink / raw) To: git Hi, You might or might not be aware of this thread: https://groups.google.com/forum/#!topic/binary-transparency/f-BI4o8HZW0 Anyways, this got me to take a look around, and I noticed that parse_object does SHA-1 validation through check_sha1_signature. What surprised me is that parse_object_buffer doesn't. So we end up with inconsistent behavior across commands: $ git init $ echo a > a ; echo b > b $ git add a b $ git cat-file blob 78981922613b2afb6025042ff6bd878ac1994e85 a $ cp -f .git/objects/61/780798228d17af2d34fce4cfbdf35556832472 .git/objects/78/981922613b2afb6025042ff6bd878ac1994e85 $ git cat-file blob 78981922613b2afb6025042ff6bd878ac1994e85 b $ git show 78981922613b2afb6025042ff6bd878ac1994e85 error: sha1 mismatch 78981922613b2afb6025042ff6bd878ac1994e85 fatal: bad object 78981922613b2afb6025042ff6bd878ac1994e85 Shouldn't parse_object_buffer also do check_sha1_signature? Mike ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: parse_object does check_sha1_signature but not parse_object_buffer? 2016-02-02 1:57 parse_object does check_sha1_signature but not parse_object_buffer? Mike Hommey @ 2016-02-02 2:06 ` Mike Hommey 2016-02-02 3:10 ` Junio C Hamano 1 sibling, 0 replies; 7+ messages in thread From: Mike Hommey @ 2016-02-02 2:06 UTC (permalink / raw) To: git On Tue, Feb 02, 2016 at 10:57:01AM +0900, Mike Hommey wrote: > Hi, > > You might or might not be aware of this thread: > https://groups.google.com/forum/#!topic/binary-transparency/f-BI4o8HZW0 > > Anyways, this got me to take a look around, and I noticed that > parse_object does SHA-1 validation through check_sha1_signature. What > surprised me is that parse_object_buffer doesn't. So we end up with > inconsistent behavior across commands: > > $ git init > $ echo a > a ; echo b > b > $ git add a b > $ git cat-file blob 78981922613b2afb6025042ff6bd878ac1994e85 > a > $ cp -f .git/objects/61/780798228d17af2d34fce4cfbdf35556832472 .git/objects/78/981922613b2afb6025042ff6bd878ac1994e85 > $ git cat-file blob 78981922613b2afb6025042ff6bd878ac1994e85 > b > $ git show 78981922613b2afb6025042ff6bd878ac1994e85 > error: sha1 mismatch 78981922613b2afb6025042ff6bd878ac1994e85 > fatal: bad object 78981922613b2afb6025042ff6bd878ac1994e85 > > Shouldn't parse_object_buffer also do check_sha1_signature? Well, except cat-file doesn't use parse_object_buffer either... Mike ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: parse_object does check_sha1_signature but not parse_object_buffer? 2016-02-02 1:57 parse_object does check_sha1_signature but not parse_object_buffer? Mike Hommey 2016-02-02 2:06 ` Mike Hommey @ 2016-02-02 3:10 ` Junio C Hamano 2016-02-02 4:36 ` Mike Hommey 1 sibling, 1 reply; 7+ messages in thread From: Junio C Hamano @ 2016-02-02 3:10 UTC (permalink / raw) To: Mike Hommey; +Cc: git Mike Hommey <mh@glandium.org> writes: > Shouldn't parse_object_buffer also do check_sha1_signature? In general, it shouldn't; its callers are supposed to do it as additional check when/if needed. Callers like the one in fsck.c does not want to die after seeing one bad one. We want to report and keep checking other things. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: parse_object does check_sha1_signature but not parse_object_buffer? 2016-02-02 3:10 ` Junio C Hamano @ 2016-02-02 4:36 ` Mike Hommey 2016-02-02 19:25 ` Junio C Hamano 0 siblings, 1 reply; 7+ messages in thread From: Mike Hommey @ 2016-02-02 4:36 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Mon, Feb 01, 2016 at 07:10:04PM -0800, Junio C Hamano wrote: > Mike Hommey <mh@glandium.org> writes: > > > Shouldn't parse_object_buffer also do check_sha1_signature? > > In general, it shouldn't; its callers are supposed to do it as > additional check when/if needed. Callers like the one in fsck.c > does not want to die after seeing one bad one. We want to report > and keep checking other things. Shouldn't some things like, at least, `git checkout`, still check the sha1s, though? Mike ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: parse_object does check_sha1_signature but not parse_object_buffer? 2016-02-02 4:36 ` Mike Hommey @ 2016-02-02 19:25 ` Junio C Hamano 2016-02-04 7:19 ` Jeff King 0 siblings, 1 reply; 7+ messages in thread From: Junio C Hamano @ 2016-02-02 19:25 UTC (permalink / raw) To: Mike Hommey; +Cc: git Mike Hommey <mh@glandium.org> writes: > On Mon, Feb 01, 2016 at 07:10:04PM -0800, Junio C Hamano wrote: >> Mike Hommey <mh@glandium.org> writes: >> >> > Shouldn't parse_object_buffer also do check_sha1_signature? >> >> In general, it shouldn't; its callers are supposed to do it as >> additional check when/if needed. Callers like the one in fsck.c >> does not want to die after seeing one bad one. We want to report >> and keep checking other things. > > Shouldn't some things like, at least, `git checkout`, still check > the sha1s, though? That is a different issue--my answer was about the quoted question regarding parse_object_buffer(). Its callers are supposed to do additional check when/if needed, and there may be codepaths that currently use parse_object_buffer() that may want to do their own check, or call a different function that does the check for them. Having said that, I do not necessarily think "git checkout" should revalidate the object name. The repository that you use for your daily work would have the same error/corruption rate as your working tree files, and I do not think you would constantly "validate" what is in your working tree by comparing their contents with what you think ought to be there. If you are working on extremely poor quality disks and SSDs, it might make sense to constantly revalidating the object data to catch corruption early, as that is what we can do (as opposed to the working tree files, corruption to which you probably do not have anything to catch bitflipping on). Unless you are placing your working tree on a filesystem with checksums, but your object data would also be protected against corruption in that case, so an extra revalidation at "git checkout" time would not buy you much if anything at all. http://article.gmane.org/gmane.comp.version-control.git/283380 (not necessarily the entire thread, but that exact article) is a reasonable summary that illustrates the way how we view the object integrity. So "index-pack" is the enforcement point, and the rest of the git commands generally assume that we can trust what is on disk (as it is has either been generated by us, or checked by index-pack). The rest of the commands do not spend time checking that the on-disk contents are sane (though you can run git-fsck if you want to do that). If anything, we may want to reduce the number of codepaths that calls check_sha1_signature() on data that we know we have read from our own repository. Even though I am not opposed to an idea to have a "paranoid" mode that revalidates the object name every time (and if "git checkout" does not currently, allow it to revalidate when we are operating under the "paranoid" mode), I do not think it should be on by default. In fact, I have this suspicion that the original justification to have the call to check_sha1_signature() in parse_object() might have been invalidated by the restructuring of the code over the past 10 years. e9eefa67 ([PATCH] Add function to parse an object of unspecified type (take 2), 2005-04-28) says It also checks the hash of the file, due to its heritage as part of fsck-cache. I.e. we do not need this call here, as long as we make sure that fsck codepath does not depend on the fact that parse_object() calls check_sha1_signature() to validate the consistency between the data and the object name. In fact, we do this, which is quite suboptimal: static int fsck_sha1(const unsigned char *sha1) { struct object *obj = parse_object(sha1); if (!obj) { errors_found |= ERROR_OBJECT; return error("%s: object corrupt or missing", sha1_to_hex(sha1)); } obj->flags |= HAS_OBJ; return fsck_obj(obj); } This function is called for each loose object file we find in fsck_object_dir(), and there are a few problems: * The function parse_object() called from here would issue an error message and returns NULL; then you get another "corrupt or missing" error message, because this code cannot tell from the NULL which is the case. * The intent of the callchain to fsck_sha1() is to iterate over loose object files xx/x{38} and validate what is contained in them, but that behaviour is not guaranteed because it calls parse_object(), which may get the object data from a packfile if the loose object is also in the packfile. This function should instead take "path" (the only caller of this function fsck_loose() has it), read the data in the file, hash the data to validate that it matches "sha1" and then create the object out of that data it read by calling parse_object_buffer(). I didn't check other callers of parse_object(), but I doubt that they need or want a check_sha1_signature() call in the function. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: parse_object does check_sha1_signature but not parse_object_buffer? 2016-02-02 19:25 ` Junio C Hamano @ 2016-02-04 7:19 ` Jeff King 2016-02-04 17:40 ` Junio C Hamano 0 siblings, 1 reply; 7+ messages in thread From: Jeff King @ 2016-02-04 7:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: Mike Hommey, git On Tue, Feb 02, 2016 at 11:25:44AM -0800, Junio C Hamano wrote: > Having said that, I do not necessarily think "git checkout" should > revalidate the object name. The repository that you use for your > daily work would have the same error/corruption rate as your working > tree files, and I do not think you would constantly "validate" what > is in your working tree by comparing their contents with what you > think ought to be there. No, but there is a question of how much it costs to do so. In the past, I've spent a lot of time trying to speed up object access, but it was usually about replacing `parse_object` with `lookup_object` or similar to avoid loading from disk in the first place, as most of the time goes to zlib. If we are accessing the object already (and obviously we have to for something like checkout), I'm not sure what the marginal cost is of computing the sha1 on the data as it passes through. It might be significant, but I don't have numbers. If it's not, then I think it's a nice feature that we would notice problems earlier rather than later. > If you are working on extremely poor quality disks and SSDs, it > might make sense to constantly revalidating the object data to catch > corruption early, as that is what we can do (as opposed to the > working tree files, corruption to which you probably do not have > anything to catch bitflipping on). Hopefully if your working tree files bit-flip, then you notice via `git diff`. I guess you wouldn't for stat-clean ones. But if a bit flips in the forest, and it's not stat-dirty enough for anyone to hear it, does it make a sound? > http://article.gmane.org/gmane.comp.version-control.git/283380 (not > necessarily the entire thread, but that exact article) is a > reasonable summary that illustrates the way how we view the object > integrity. > > So "index-pack" is the enforcement point, and the rest of the > git commands generally assume that we can trust what is on disk > (as it is has either been generated by us, or checked by > index-pack). The rest of the commands do not spend time > checking that the on-disk contents are sane (though you can run > git-fsck if you want to do that). I think that's me your quoting. I was specifically talking about malicious tampering there. Which isn't to say I disagree with the world-view you're proposing, but I think for random errors it's a little more complicated. We certainly should be checking incoming data (and we do). For local repository operations, most of them are about reading data. And there I generally favor performance over extra validation, with the caveat that we should always be aware of the tradeoff. An extra comparison to make sure we are not going out-of-bounds on a pack .idx pointer is cheap. Loading a blob just to make sure its sha1 is valid before we mention it in `diff --raw` output is stupid. Checking the sha1 on objects we are otherwise accessing is somewhere in between. :) For local write operations, like repacking, we should err on the careful side. And I think we do a good job of balancing performance and validation there (e.g., we reuse deltas without reconstructing the object, but _with_ a crc check on the delta data itself). > In fact, we do this, which is quite suboptimal: > > static int fsck_sha1(const unsigned char *sha1) > { > struct object *obj = parse_object(sha1); > if (!obj) { > errors_found |= ERROR_OBJECT; > return error("%s: object corrupt or missing", > sha1_to_hex(sha1)); > } > obj->flags |= HAS_OBJ; > return fsck_obj(obj); > } > > This function is called for each loose object file we find in > fsck_object_dir(), and there are a few problems: > > * The function parse_object() called from here would issue an error > message and returns NULL; then you get another "corrupt or > missing" error message, because this code cannot tell from the > NULL which is the case. > > * The intent of the callchain to fsck_sha1() is to iterate over > loose object files xx/x{38} and validate what is contained in > them, but that behaviour is not guaranteed because it calls > parse_object(), which may get the object data from a packfile > if the loose object is also in the packfile. > > This function should instead take "path" (the only caller of this > function fsck_loose() has it), read the data in the file, hash the > data to validate that it matches "sha1" and then create the object > out of that data it read by calling parse_object_buffer(). Yeah, I agree. I think as it is written, we also end up loading the loose objects twice (once in parse_object, and then later again in fsck_object to do the real work). Your solution would fix that, too. It looks like we _don't_ load loose commits twice, though. We never turn off save_commit_buffer, so we happily cache the buffer for every single commit, even though we won't need it again after fsck_object returns. I guess nobody has really noticed either issue, because repositories large enough for it to make a difference will usually be packed. -Peff ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: parse_object does check_sha1_signature but not parse_object_buffer? 2016-02-04 7:19 ` Jeff King @ 2016-02-04 17:40 ` Junio C Hamano 0 siblings, 0 replies; 7+ messages in thread From: Junio C Hamano @ 2016-02-04 17:40 UTC (permalink / raw) To: Jeff King; +Cc: Mike Hommey, git Jeff King <peff@peff.net> writes: > For local repository operations, most of them are about reading data. > And there I generally favor performance over extra validation, with the > caveat that we should always be aware of the tradeoff. An extra > comparison to make sure we are not going out-of-bounds on a pack .idx > pointer is cheap. Loading a blob just to make sure its sha1 is valid > before we mention it in `diff --raw` output is stupid. Checking the sha1 > on objects we are otherwise accessing is somewhere in between. :) > > For local write operations, like repacking, we should err on the careful > side. And I think we do a good job of balancing performance and > validation there (e.g., we reuse deltas without reconstructing the > object, but _with_ a crc check on the delta data itself). OK, that is a reasonable set of rules. We can say "checkout" is writing out the contents to the filesystem, and make the "somewhere in between" be closer to the "error on the careful side". >> In fact, we do this, which is quite suboptimal: >> >> static int fsck_sha1(const unsigned char *sha1) >> { >> struct object *obj = parse_object(sha1); >> if (!obj) { >> errors_found |= ERROR_OBJECT; >> return error("%s: object corrupt or missing", >> sha1_to_hex(sha1)); >> } >> obj->flags |= HAS_OBJ; >> return fsck_obj(obj); >> } >> >> This function is called for each loose object file we find in >> fsck_object_dir(), and there are a few problems: >> >> * The function parse_object() called from here would issue an error >> message and returns NULL; then you get another "corrupt or >> missing" error message, because this code cannot tell from the >> NULL which is the case. >> >> * The intent of the callchain to fsck_sha1() is to iterate over >> loose object files xx/x{38} and validate what is contained in >> them, but that behaviour is not guaranteed because it calls >> parse_object(), which may get the object data from a packfile >> if the loose object is also in the packfile. >> >> This function should instead take "path" (the only caller of this >> function fsck_loose() has it), read the data in the file, hash the >> data to validate that it matches "sha1" and then create the object >> out of that data it read by calling parse_object_buffer(). > > Yeah, I agree. I think as it is written, we also end up loading the > loose objects twice (once in parse_object, and then later again in > fsck_object to do the real work). Your solution would fix that, too. > ... > I guess nobody has really noticed either issue, because repositories > large enough for it to make a difference will usually be packed. I'll throw it in the leftover-bits list, then ;-) ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-02-04 17:41 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-02-02 1:57 parse_object does check_sha1_signature but not parse_object_buffer? Mike Hommey 2016-02-02 2:06 ` Mike Hommey 2016-02-02 3:10 ` Junio C Hamano 2016-02-02 4:36 ` Mike Hommey 2016-02-02 19:25 ` Junio C Hamano 2016-02-04 7:19 ` Jeff King 2016-02-04 17:40 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).