* sha-1 check in rev-list --verify-objects redundant?
@ 2012-02-25 17:15 Nguyen Thai Ngoc Duy
2012-02-26 8:30 ` Junio C Hamano
0 siblings, 1 reply; 9+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-25 17:15 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git Mailing List
Hi,
"git rev-list --verify-objects" calls parse_object() on all except
commits. This function in turn does check_sha1_signature() which
rehashes object content to verify the content matches its sha-1
signature. This is an expensive operation. I wonder if this is a
redundant check. --verify-objects is called to verify new packs.
index-pack is also (always?) called on new packs, and index-pack
hashes all object content, saves the hashes as signature in pack index
file. So they must match. Am I missing something here?
--
Duy
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant?
2012-02-25 17:15 sha-1 check in rev-list --verify-objects redundant? Nguyen Thai Ngoc Duy
@ 2012-02-26 8:30 ` Junio C Hamano
2012-02-26 9:11 ` Junio C Hamano
2012-02-26 11:11 ` Nguyen Thai Ngoc Duy
0 siblings, 2 replies; 9+ messages in thread
From: Junio C Hamano @ 2012-02-26 8:30 UTC (permalink / raw)
To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List
Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
> ... I wonder if this is a
> redundant check. --verify-objects is called to verify new packs.
I do not think --verify-objects does not have anything to do with
verifying the integrity of packs, whether new or old.
The check is about the integrity of the *history* we _might_ already have
on our side, when we find ourselves wanting to fetch up to a commit $X,
whose reachability from the tips of our refs (i.e. the objects that are
guaranteed to be present in our repository) is unknown, and we somehow
already have the commit $X itself in the repository.
We cannot just declare victory upon seeing commit $X and omit fetching the
history leading to the commit, because we may or may not have its parent
commit object, or the tree object that is recorded in it (it may be that
we killed an HTTP walker after we fetched $X but not its parents or
trees). We need to walk back from $X until we hit one of the tips of our
refs, and while doing so, we also need to make sure the trees and blobs
referenced from the walked commits are also healthy.
As 5a48d24 (rev-list --verify-object, 2011-09-01) explains, we used to do
this with --objects instead, but that check does not even make sure blobs
exist [*1*] let alone checking to see if these blobs were healthy. The
whole point of using --verify-objects instead of --objects is to make sure
that we do not miss blob objects.
[Footnote]
*1* The --objects code reads the commits and trees in order to _list_
objects to the blob level, so implicitly, it validates that commits and
trees reachable from the commit $X we happened to have in our repository,
relying on the fact that we would error out if we fail to read them.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant?
2012-02-26 8:30 ` Junio C Hamano
@ 2012-02-26 9:11 ` Junio C Hamano
2012-02-26 11:11 ` Nguyen Thai Ngoc Duy
1 sibling, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2012-02-26 9:11 UTC (permalink / raw)
To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List
Junio C Hamano <gitster@pobox.com> writes:
> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>
>> ... I wonder if this is a
>> redundant check. --verify-objects is called to verify new packs.
>
> I do not think --verify-objects does not have anything to do with
> verifying the integrity of packs, whether new or old.
A typo/grammo that should be obvious from the context, sorry. The above
should be "I do not think --verify-objects has anything to do with ..."
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant?
2012-02-26 8:30 ` Junio C Hamano
2012-02-26 9:11 ` Junio C Hamano
@ 2012-02-26 11:11 ` Nguyen Thai Ngoc Duy
2012-02-26 21:37 ` Junio C Hamano
2012-02-27 13:01 ` Nguyen Thai Ngoc Duy
1 sibling, 2 replies; 9+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-26 11:11 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git Mailing List
On Sun, Feb 26, 2012 at 3:30 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>
>> ... I wonder if this is a
>> redundant check. --verify-objects is called to verify new packs.
>
> I do not think --verify-objects does not have anything to do with
> verifying the integrity of packs, whether new or old.
>
> The check is about the integrity of the *history* we _might_ already have
> on our side, when we find ourselves wanting to fetch up to a commit $X,
> whose reachability from the tips of our refs (i.e. the objects that are
> guaranteed to be present in our repository) is unknown, and we somehow
> already have the commit $X itself in the repository.
>
> We cannot just declare victory upon seeing commit $X and omit fetching the
> history leading to the commit, because we may or may not have its parent
> commit object, or the tree object that is recorded in it (it may be that
> we killed an HTTP walker after we fetched $X but not its parents or
> trees). We need to walk back from $X until we hit one of the tips of our
> refs, and while doing so, we also need to make sure the trees and blobs
> referenced from the walked commits are also healthy.
>
> As 5a48d24 (rev-list --verify-object, 2011-09-01) explains, we used to do
> this with --objects instead, but that check does not even make sure blobs
> exist [*1*] let alone checking to see if these blobs were healthy. The
> whole point of using --verify-objects instead of --objects is to make sure
> that we do not miss blob objects.
"rev-list --objects" does check for blob existence, in finish_object().
On the well-formedness, unless I'm mistaken, --verify-objects is
_always_ used in conjunction with index-pack. --verify-object is not
even documented. index-pack makes sure all (new) object signatures
reflect their content. Commit and tree content are validated by
rev-list walking them. So at least when --verify-objects is used with
index-pack, I don't see the point rehashing every new object _again_.
> [Footnote]
>
> *1* The --objects code reads the commits and trees in order to _list_
> objects to the blob level, so implicitly, it validates that commits and
> trees reachable from the commit $X we happened to have in our repository,
> relying on the fact that we would error out if we fail to read them.
--
Duy
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant?
2012-02-26 11:11 ` Nguyen Thai Ngoc Duy
@ 2012-02-26 21:37 ` Junio C Hamano
2012-02-27 0:48 ` Nguyen Thai Ngoc Duy
2012-02-27 13:01 ` Nguyen Thai Ngoc Duy
1 sibling, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2012-02-26 21:37 UTC (permalink / raw)
To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List
Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
> On the well-formedness, unless I'm mistaken, --verify-objects is
> _always_ used in conjunction with index-pack.
Hmm, you are making my head hurt. Is the above "always" a typo of
"never"?
The static check_everything_connected() function in builtin/fetch.c is a
direct callsite of "rev-list --verify-objects", and the function is used
in two codepaths:
* store_updated_refs() that is used after we receive and store objects
from the other end. We may or may not have run index-pack in this
codepath; in either case we need to make sure the other side did send
everything that is needed to complete the history between what we used
to have and what they claimed to supply us, to protect us from a broken
remote side.
* quickfetch() that is called even before we get any object from the
other end, to optimize the transfer when we already have what we need.
The latter is the original use to protect against unconnected island of
chain I explained in the previous message, but the former is also abot the
same protection, in a different callchain.
In both cases, the check by --verify-objects is about completeness of the
history (is everything connected to the tips of refs we have?), and is
different from integrity of individual objects (is each individual object
well formed and hash correctly?). Both kinds of sanity need to be
checked, as they are orthogonal concepts.
In order to check the history completeness, we need to read the objects
that we walk during the check. I wouldn't be surprised if the codepath to
do this is written overly defensive, taking a belt-and-suspender approach,
and check the well-formedness of an object before it reads it to find out
the other objects pointed by it.
If we _know_ that we have checked the integrity of all the necessary
individual objects before we start reading them in order to check the
completeness of the history, there is an opportunity to optimize by
teaching --verify-objects paths to optionally be looser than it currently
is, to avoid checking the object integrity twice.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant?
2012-02-26 21:37 ` Junio C Hamano
@ 2012-02-27 0:48 ` Nguyen Thai Ngoc Duy
2012-02-27 1:23 ` Junio C Hamano
0 siblings, 1 reply; 9+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-27 0:48 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git Mailing List
On Mon, Feb 27, 2012 at 4:37 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>
>> On the well-formedness, unless I'm mistaken, --verify-objects is
>> _always_ used in conjunction with index-pack.
>
> Hmm, you are making my head hurt. Is the above "always" a typo of
> "never"?
>
> The static check_everything_connected() function in builtin/fetch.c is a
> direct callsite of "rev-list --verify-objects", and the function is used
> in two codepaths:
>
> * store_updated_refs() that is used after we receive and store objects
> from the other end. We may or may not have run index-pack in this
> codepath; in either case we need to make sure the other side did send
> everything that is needed to complete the history between what we used
> to have and what they claimed to supply us, to protect us from a broken
> remote side.
I stand corrected. --verify-objects is _usually_ used in conjunction
with index-pack, when the media is a pack (i.e. no remote helpers)
> * quickfetch() that is called even before we get any object from the
> other end, to optimize the transfer when we already have what we need.
>
> The latter is the original use to protect against unconnected island of
> chain I explained in the previous message, but the former is also abot the
> same protection, in a different callchain.
I think we can trust what we already have, so in the latter case (and
the former when the medium is a pack), --objects should suffice.
> In both cases, the check by --verify-objects is about completeness of the
> history (is everything connected to the tips of refs we have?), and is
> different from integrity of individual objects (is each individual object
> well formed and hash correctly?). Both kinds of sanity need to be
> checked, as they are orthogonal concepts.
>
> In order to check the history completeness, we need to read the objects
> that we walk during the check. I wouldn't be surprised if the codepath to
> do this is written overly defensive, taking a belt-and-suspender approach,
> and check the well-formedness of an object before it reads it to find out
> the other objects pointed by it.
>
> If we _know_ that we have checked the integrity of all the necessary
> individual objects before we start reading them in order to check the
> completeness of the history, there is an opportunity to optimize by
> teaching --verify-objects paths to optionally be looser than it currently
> is, to avoid checking the object integrity twice.
Ok, will cook something. The reason I raised it is because
--verify-objects --all on git.git could take ~1m10s, but if we assume
object integrity is fine and skip it, it could drop to 10s (I suspect
--objects gives the same number).
--
Duy
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant?
2012-02-27 0:48 ` Nguyen Thai Ngoc Duy
@ 2012-02-27 1:23 ` Junio C Hamano
0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2012-02-27 1:23 UTC (permalink / raw)
To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List
Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>> * quickfetch() that is called even before we get any object from the
>> other end, to optimize the transfer when we already have what we need.
>>
>> The latter is the original use to protect against unconnected island of
>> chain I explained in the previous message, but the former is also abot the
>> same protection, in a different callchain.
>
> I think we can trust what we already have, so in the latter case (and
> the former when the medium is a pack), --objects should suffice.
Hrm, what you seem to be missing is that in the latter case, the objects
we are walking to make sure the connectivity are *not* yet considered as
part of "what we already have" (hence we can trust). The whole point of
quickfetch is to turn these untrustworthy bits that happen to exist in our
repository into a part of trustable history without having to fetch from
the other side to complete them. In the former case, when we know that
pack has been indexed with recent index-pack/unpack-objects that are more
picky (didn't Shawn and I recently tighten them?), we may be able to do
without object integrity checks.
>> If we _know_ that we have checked the integrity of all the necessary
>> individual objects before we start reading them in order to check the
>> completeness of the history, there is an opportunity to optimize by
>> teaching --verify-objects paths to optionally be looser than it currently
>> is, to avoid checking the object integrity twice.
>
> Ok, will cook something. The reason I raised it is because
> --verify-objects --all on git.git could take ~1m10s, but if we assume
> object integrity is fine and skip it, it could drop to 10s (I suspect
> --objects gives the same number).
That is a big assumption that I personally do not want to sacrifice the
integrity of my history on. I would 100% agree with the above with s/if
we assume/if we know/, though.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant?
2012-02-26 11:11 ` Nguyen Thai Ngoc Duy
2012-02-26 21:37 ` Junio C Hamano
@ 2012-02-27 13:01 ` Nguyen Thai Ngoc Duy
2012-02-27 18:41 ` Junio C Hamano
1 sibling, 1 reply; 9+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2012-02-27 13:01 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git Mailing List
On Sun, Feb 26, 2012 at 06:11:30PM +0700, Nguyen Thai Ngoc Duy wrote:
> "rev-list --objects" does check for blob existence, in finish_object().
Eck.. I think "--quiet --verify-objects" becomes "--quiet --objects"
because of this code:
-- 8< --
traverse_commit_list(&revs,
quiet ? finish_commit : show_commit,
quiet ? finish_object : show_object,
&info);
-- 8< --
Unless that's intentional, shouldn't we apply this patch? --quiet's
interfering with rev-list's business sounds weird to me.
-- 8< --
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 264e3ae..95fb605 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -172,19 +172,18 @@ static void finish_object(struct object *obj,
const struct name_path *path, const char *name,
void *cb_data)
{
+ struct rev_list_info *info = cb_data;
if (obj->type == OBJ_BLOB && !has_sha1_file(obj->sha1))
die("missing blob object '%s'", sha1_to_hex(obj->sha1));
+ if (info->revs->verify_objects && !obj->parsed && obj->type != OBJ_COMMIT)
+ parse_object(obj->sha1);
}
static void show_object(struct object *obj,
const struct name_path *path, const char *component,
void *cb_data)
{
- struct rev_list_info *info = cb_data;
-
finish_object(obj, path, component, cb_data);
- if (info->revs->verify_objects && !obj->parsed && obj->type != OBJ_COMMIT)
- parse_object(obj->sha1);
show_object_with_name(stdout, obj, path, component);
}
-- 8< --
--
Duy
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: sha-1 check in rev-list --verify-objects redundant?
2012-02-27 13:01 ` Nguyen Thai Ngoc Duy
@ 2012-02-27 18:41 ` Junio C Hamano
0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2012-02-27 18:41 UTC (permalink / raw)
To: Nguyen Thai Ngoc Duy; +Cc: Git Mailing List
Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
> On Sun, Feb 26, 2012 at 06:11:30PM +0700, Nguyen Thai Ngoc Duy wrote:
>> "rev-list --objects" does check for blob existence, in finish_object().
>
> Eck.. I think "--quiet --verify-objects" becomes "--quiet --objects"
> because of this code:
>
> -- 8< --
> traverse_commit_list(&revs,
> quiet ? finish_commit : show_commit,
> quiet ? finish_object : show_object,
> &info);
> -- 8< --
>
> Unless that's intentional, shouldn't we apply this patch? --quiet's
> interfering with rev-list's business sounds weird to me.
Good thinking. Anything we are missing by calling finish_* other than
printing is a similar bug waiting to happen.
Can't we push the quiet bit in the info structure and have a single pair
of callback functions, so that we can make sure this kind of glitch would
never happen?
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-02-27 18:41 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-25 17:15 sha-1 check in rev-list --verify-objects redundant? Nguyen Thai Ngoc Duy
2012-02-26 8:30 ` Junio C Hamano
2012-02-26 9:11 ` Junio C Hamano
2012-02-26 11:11 ` Nguyen Thai Ngoc Duy
2012-02-26 21:37 ` Junio C Hamano
2012-02-27 0:48 ` Nguyen Thai Ngoc Duy
2012-02-27 1:23 ` Junio C Hamano
2012-02-27 13:01 ` Nguyen Thai Ngoc Duy
2012-02-27 18:41 ` Junio C Hamano
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).