* sha-1 check in rev-list --verify-objects redundant? @ 2012-02-25 17:15 Nguyen Thai Ngoc Duy 2012-02-26 8:30 ` Junio C Hamano 0 siblings, 1 reply; 9+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-02-25 17:15 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List Hi, "git rev-list --verify-objects" calls parse_object() on all except commits. This function in turn does check_sha1_signature() which rehashes object content to verify the content matches its sha-1 signature. This is an expensive operation. I wonder if this is a redundant check. --verify-objects is called to verify new packs. index-pack is also (always?) called on new packs, and index-pack hashes all object content, saves the hashes as signature in pack index file. So they must match. Am I missing something here? -- Duy ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant? 2012-02-25 17:15 sha-1 check in rev-list --verify-objects redundant? Nguyen Thai Ngoc Duy @ 2012-02-26 8:30 ` Junio C Hamano 2012-02-26 9:11 ` Junio C Hamano 2012-02-26 11:11 ` Nguyen Thai Ngoc Duy 0 siblings, 2 replies; 9+ messages in thread From: Junio C Hamano @ 2012-02-26 8:30 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > ... I wonder if this is a > redundant check. --verify-objects is called to verify new packs. I do not think --verify-objects does not have anything to do with verifying the integrity of packs, whether new or old. The check is about the integrity of the *history* we _might_ already have on our side, when we find ourselves wanting to fetch up to a commit $X, whose reachability from the tips of our refs (i.e. the objects that are guaranteed to be present in our repository) is unknown, and we somehow already have the commit $X itself in the repository. We cannot just declare victory upon seeing commit $X and omit fetching the history leading to the commit, because we may or may not have its parent commit object, or the tree object that is recorded in it (it may be that we killed an HTTP walker after we fetched $X but not its parents or trees). We need to walk back from $X until we hit one of the tips of our refs, and while doing so, we also need to make sure the trees and blobs referenced from the walked commits are also healthy. As 5a48d24 (rev-list --verify-object, 2011-09-01) explains, we used to do this with --objects instead, but that check does not even make sure blobs exist [*1*] let alone checking to see if these blobs were healthy. The whole point of using --verify-objects instead of --objects is to make sure that we do not miss blob objects. [Footnote] *1* The --objects code reads the commits and trees in order to _list_ objects to the blob level, so implicitly, it validates that commits and trees reachable from the commit $X we happened to have in our repository, relying on the fact that we would error out if we fail to read them. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant? 2012-02-26 8:30 ` Junio C Hamano @ 2012-02-26 9:11 ` Junio C Hamano 2012-02-26 11:11 ` Nguyen Thai Ngoc Duy 1 sibling, 0 replies; 9+ messages in thread From: Junio C Hamano @ 2012-02-26 9:11 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List Junio C Hamano <gitster@pobox.com> writes: > Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > >> ... I wonder if this is a >> redundant check. --verify-objects is called to verify new packs. > > I do not think --verify-objects does not have anything to do with > verifying the integrity of packs, whether new or old. A typo/grammo that should be obvious from the context, sorry. The above should be "I do not think --verify-objects has anything to do with ..." ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant? 2012-02-26 8:30 ` Junio C Hamano 2012-02-26 9:11 ` Junio C Hamano @ 2012-02-26 11:11 ` Nguyen Thai Ngoc Duy 2012-02-26 21:37 ` Junio C Hamano 2012-02-27 13:01 ` Nguyen Thai Ngoc Duy 1 sibling, 2 replies; 9+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-02-26 11:11 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Sun, Feb 26, 2012 at 3:30 PM, Junio C Hamano <gitster@pobox.com> wrote: > Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > >> ... I wonder if this is a >> redundant check. --verify-objects is called to verify new packs. > > I do not think --verify-objects does not have anything to do with > verifying the integrity of packs, whether new or old. > > The check is about the integrity of the *history* we _might_ already have > on our side, when we find ourselves wanting to fetch up to a commit $X, > whose reachability from the tips of our refs (i.e. the objects that are > guaranteed to be present in our repository) is unknown, and we somehow > already have the commit $X itself in the repository. > > We cannot just declare victory upon seeing commit $X and omit fetching the > history leading to the commit, because we may or may not have its parent > commit object, or the tree object that is recorded in it (it may be that > we killed an HTTP walker after we fetched $X but not its parents or > trees). We need to walk back from $X until we hit one of the tips of our > refs, and while doing so, we also need to make sure the trees and blobs > referenced from the walked commits are also healthy. > > As 5a48d24 (rev-list --verify-object, 2011-09-01) explains, we used to do > this with --objects instead, but that check does not even make sure blobs > exist [*1*] let alone checking to see if these blobs were healthy. The > whole point of using --verify-objects instead of --objects is to make sure > that we do not miss blob objects. "rev-list --objects" does check for blob existence, in finish_object(). On the well-formedness, unless I'm mistaken, --verify-objects is _always_ used in conjunction with index-pack. --verify-object is not even documented. index-pack makes sure all (new) object signatures reflect their content. Commit and tree content are validated by rev-list walking them. So at least when --verify-objects is used with index-pack, I don't see the point rehashing every new object _again_. > [Footnote] > > *1* The --objects code reads the commits and trees in order to _list_ > objects to the blob level, so implicitly, it validates that commits and > trees reachable from the commit $X we happened to have in our repository, > relying on the fact that we would error out if we fail to read them. -- Duy ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant? 2012-02-26 11:11 ` Nguyen Thai Ngoc Duy @ 2012-02-26 21:37 ` Junio C Hamano 2012-02-27 0:48 ` Nguyen Thai Ngoc Duy 2012-02-27 13:01 ` Nguyen Thai Ngoc Duy 1 sibling, 1 reply; 9+ messages in thread From: Junio C Hamano @ 2012-02-26 21:37 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > On the well-formedness, unless I'm mistaken, --verify-objects is > _always_ used in conjunction with index-pack. Hmm, you are making my head hurt. Is the above "always" a typo of "never"? The static check_everything_connected() function in builtin/fetch.c is a direct callsite of "rev-list --verify-objects", and the function is used in two codepaths: * store_updated_refs() that is used after we receive and store objects from the other end. We may or may not have run index-pack in this codepath; in either case we need to make sure the other side did send everything that is needed to complete the history between what we used to have and what they claimed to supply us, to protect us from a broken remote side. * quickfetch() that is called even before we get any object from the other end, to optimize the transfer when we already have what we need. The latter is the original use to protect against unconnected island of chain I explained in the previous message, but the former is also abot the same protection, in a different callchain. In both cases, the check by --verify-objects is about completeness of the history (is everything connected to the tips of refs we have?), and is different from integrity of individual objects (is each individual object well formed and hash correctly?). Both kinds of sanity need to be checked, as they are orthogonal concepts. In order to check the history completeness, we need to read the objects that we walk during the check. I wouldn't be surprised if the codepath to do this is written overly defensive, taking a belt-and-suspender approach, and check the well-formedness of an object before it reads it to find out the other objects pointed by it. If we _know_ that we have checked the integrity of all the necessary individual objects before we start reading them in order to check the completeness of the history, there is an opportunity to optimize by teaching --verify-objects paths to optionally be looser than it currently is, to avoid checking the object integrity twice. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant? 2012-02-26 21:37 ` Junio C Hamano @ 2012-02-27 0:48 ` Nguyen Thai Ngoc Duy 2012-02-27 1:23 ` Junio C Hamano 0 siblings, 1 reply; 9+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-02-27 0:48 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Mon, Feb 27, 2012 at 4:37 AM, Junio C Hamano <gitster@pobox.com> wrote: > Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > >> On the well-formedness, unless I'm mistaken, --verify-objects is >> _always_ used in conjunction with index-pack. > > Hmm, you are making my head hurt. Is the above "always" a typo of > "never"? > > The static check_everything_connected() function in builtin/fetch.c is a > direct callsite of "rev-list --verify-objects", and the function is used > in two codepaths: > > * store_updated_refs() that is used after we receive and store objects > from the other end. We may or may not have run index-pack in this > codepath; in either case we need to make sure the other side did send > everything that is needed to complete the history between what we used > to have and what they claimed to supply us, to protect us from a broken > remote side. I stand corrected. --verify-objects is _usually_ used in conjunction with index-pack, when the media is a pack (i.e. no remote helpers) > * quickfetch() that is called even before we get any object from the > other end, to optimize the transfer when we already have what we need. > > The latter is the original use to protect against unconnected island of > chain I explained in the previous message, but the former is also abot the > same protection, in a different callchain. I think we can trust what we already have, so in the latter case (and the former when the medium is a pack), --objects should suffice. > In both cases, the check by --verify-objects is about completeness of the > history (is everything connected to the tips of refs we have?), and is > different from integrity of individual objects (is each individual object > well formed and hash correctly?). Both kinds of sanity need to be > checked, as they are orthogonal concepts. > > In order to check the history completeness, we need to read the objects > that we walk during the check. I wouldn't be surprised if the codepath to > do this is written overly defensive, taking a belt-and-suspender approach, > and check the well-formedness of an object before it reads it to find out > the other objects pointed by it. > > If we _know_ that we have checked the integrity of all the necessary > individual objects before we start reading them in order to check the > completeness of the history, there is an opportunity to optimize by > teaching --verify-objects paths to optionally be looser than it currently > is, to avoid checking the object integrity twice. Ok, will cook something. The reason I raised it is because --verify-objects --all on git.git could take ~1m10s, but if we assume object integrity is fine and skip it, it could drop to 10s (I suspect --objects gives the same number). -- Duy ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant? 2012-02-27 0:48 ` Nguyen Thai Ngoc Duy @ 2012-02-27 1:23 ` Junio C Hamano 0 siblings, 0 replies; 9+ messages in thread From: Junio C Hamano @ 2012-02-27 1:23 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: >> * quickfetch() that is called even before we get any object from the >> other end, to optimize the transfer when we already have what we need. >> >> The latter is the original use to protect against unconnected island of >> chain I explained in the previous message, but the former is also abot the >> same protection, in a different callchain. > > I think we can trust what we already have, so in the latter case (and > the former when the medium is a pack), --objects should suffice. Hrm, what you seem to be missing is that in the latter case, the objects we are walking to make sure the connectivity are *not* yet considered as part of "what we already have" (hence we can trust). The whole point of quickfetch is to turn these untrustworthy bits that happen to exist in our repository into a part of trustable history without having to fetch from the other side to complete them. In the former case, when we know that pack has been indexed with recent index-pack/unpack-objects that are more picky (didn't Shawn and I recently tighten them?), we may be able to do without object integrity checks. >> If we _know_ that we have checked the integrity of all the necessary >> individual objects before we start reading them in order to check the >> completeness of the history, there is an opportunity to optimize by >> teaching --verify-objects paths to optionally be looser than it currently >> is, to avoid checking the object integrity twice. > > Ok, will cook something. The reason I raised it is because > --verify-objects --all on git.git could take ~1m10s, but if we assume > object integrity is fine and skip it, it could drop to 10s (I suspect > --objects gives the same number). That is a big assumption that I personally do not want to sacrifice the integrity of my history on. I would 100% agree with the above with s/if we assume/if we know/, though. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant? 2012-02-26 11:11 ` Nguyen Thai Ngoc Duy 2012-02-26 21:37 ` Junio C Hamano @ 2012-02-27 13:01 ` Nguyen Thai Ngoc Duy 2012-02-27 18:41 ` Junio C Hamano 1 sibling, 1 reply; 9+ messages in thread From: Nguyen Thai Ngoc Duy @ 2012-02-27 13:01 UTC (permalink / raw) To: Junio C Hamano; +Cc: Git Mailing List On Sun, Feb 26, 2012 at 06:11:30PM +0700, Nguyen Thai Ngoc Duy wrote: > "rev-list --objects" does check for blob existence, in finish_object(). Eck.. I think "--quiet --verify-objects" becomes "--quiet --objects" because of this code: -- 8< -- traverse_commit_list(&revs, quiet ? finish_commit : show_commit, quiet ? finish_object : show_object, &info); -- 8< -- Unless that's intentional, shouldn't we apply this patch? --quiet's interfering with rev-list's business sounds weird to me. -- 8< -- diff --git a/builtin/rev-list.c b/builtin/rev-list.c index 264e3ae..95fb605 100644 --- a/builtin/rev-list.c +++ b/builtin/rev-list.c @@ -172,19 +172,18 @@ static void finish_object(struct object *obj, const struct name_path *path, const char *name, void *cb_data) { + struct rev_list_info *info = cb_data; if (obj->type == OBJ_BLOB && !has_sha1_file(obj->sha1)) die("missing blob object '%s'", sha1_to_hex(obj->sha1)); + if (info->revs->verify_objects && !obj->parsed && obj->type != OBJ_COMMIT) + parse_object(obj->sha1); } static void show_object(struct object *obj, const struct name_path *path, const char *component, void *cb_data) { - struct rev_list_info *info = cb_data; - finish_object(obj, path, component, cb_data); - if (info->revs->verify_objects && !obj->parsed && obj->type != OBJ_COMMIT) - parse_object(obj->sha1); show_object_with_name(stdout, obj, path, component); } -- 8< -- -- Duy ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant? 2012-02-27 13:01 ` Nguyen Thai Ngoc Duy @ 2012-02-27 18:41 ` Junio C Hamano 0 siblings, 0 replies; 9+ messages in thread From: Junio C Hamano @ 2012-02-27 18:41 UTC (permalink / raw) To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes: > On Sun, Feb 26, 2012 at 06:11:30PM +0700, Nguyen Thai Ngoc Duy wrote: >> "rev-list --objects" does check for blob existence, in finish_object(). > > Eck.. I think "--quiet --verify-objects" becomes "--quiet --objects" > because of this code: > > -- 8< -- > traverse_commit_list(&revs, > quiet ? finish_commit : show_commit, > quiet ? finish_object : show_object, > &info); > -- 8< -- > > Unless that's intentional, shouldn't we apply this patch? --quiet's > interfering with rev-list's business sounds weird to me. Good thinking. Anything we are missing by calling finish_* other than printing is a similar bug waiting to happen. Can't we push the quiet bit in the info structure and have a single pair of callback functions, so that we can make sure this kind of glitch would never happen? ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-02-27 18:41 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-02-25 17:15 sha-1 check in rev-list --verify-objects redundant? Nguyen Thai Ngoc Duy 2012-02-26 8:30 ` Junio C Hamano 2012-02-26 9:11 ` Junio C Hamano 2012-02-26 11:11 ` Nguyen Thai Ngoc Duy 2012-02-26 21:37 ` Junio C Hamano 2012-02-27 0:48 ` Nguyen Thai Ngoc Duy 2012-02-27 1:23 ` Junio C Hamano 2012-02-27 13:01 ` Nguyen Thai Ngoc Duy 2012-02-27 18:41 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).