* [RFC PATCH 0/4] Some patches for fsck for missing objects
@ 2017-07-26 23:29 Jonathan Tan
2017-07-26 23:29 ` [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension Jonathan Tan
` (5 more replies)
0 siblings, 6 replies; 40+ messages in thread
From: Jonathan Tan @ 2017-07-26 23:29 UTC (permalink / raw)
To: git; +Cc: Jonathan Tan, peartben, christian.couder
This is an updated version of my previous patch [1], but without the
list of promises. It is somewhat different now because fsck cannot just
mark all promised objects as HAS_OBJ.
This could be adapted and incorporated into patch sets that have objects
missing from the local repo.
I split this up into 4 patches so that you can see how the changes in
fsck are being tested, but eventually these should probably be combined
into 1 patch.
[1] https://public-inbox.org/git/3420d9ae9ef86b78af1abe721891233e3f5865a2.1500508695.git.jonathantanmy@google.com/
Jonathan Tan (4):
environment, fsck: introduce lazyobject extension
fsck: support refs pointing to lazy objects
fsck: support referenced lazy objects
fsck: support lazy objects as CLI argument
Documentation/technical/repository-version.txt | 7 ++
builtin/fsck.c | 23 ++++++-
cache.h | 2 +
environment.c | 1 +
setup.c | 7 +-
t/t0410-lazy-object.sh | 95 ++++++++++++++++++++++++++
6 files changed, 133 insertions(+), 2 deletions(-)
create mode 100755 t/t0410-lazy-object.sh
--
2.14.0.rc0.400.g1c36432dff-goog
^ permalink raw reply [flat|nested] 40+ messages in thread* [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension 2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan @ 2017-07-26 23:29 ` Jonathan Tan 2017-07-27 18:55 ` Junio C Hamano 2017-07-26 23:30 ` [RFC PATCH 2/4] fsck: support refs pointing to lazy objects Jonathan Tan ` (4 subsequent siblings) 5 siblings, 1 reply; 40+ messages in thread From: Jonathan Tan @ 2017-07-26 23:29 UTC (permalink / raw) To: git; +Cc: Jonathan Tan, peartben, christian.couder Currently, Git does not support repos with very large numbers of objects or repos that wish to minimize manipulation of certain blobs (for example, because they are very large) very well, even if the user operates mostly on part of the repo, because Git is designed on the assumption that every referenced object is available somewhere in the repo storage. Introduce a new repository extension option "extensions.lazyobject", of data type string. If this is set in a repository, Git will tolerate objects being missing in that repository. When Git needs those objects, it will invoke the command in that option. Teach fsck about the new state of affairs. In this commit, teach fsck that missing objects referenced from the reflog are not an error case; in future commits, fsck will be taught about other cases. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- Documentation/technical/repository-version.txt | 7 ++++++ builtin/fsck.c | 2 +- cache.h | 2 ++ environment.c | 1 + setup.c | 7 +++++- t/t0410-lazy-object.sh | 32 ++++++++++++++++++++++++++ 6 files changed, 49 insertions(+), 2 deletions(-) create mode 100755 t/t0410-lazy-object.sh diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt index 00ad37986..39e362543 100644 --- a/Documentation/technical/repository-version.txt +++ b/Documentation/technical/repository-version.txt @@ -86,3 +86,10 @@ for testing format-1 compatibility. When the config key `extensions.preciousObjects` is set to `true`, objects in the repository MUST NOT be deleted (e.g., by `git-prune` or `git repack -d`). + +`lazyObject` +~~~~~~~~~~~~~~~~~ + +When the config key `extensions.lazyObject` is set to a string, Git +tolerates objects being missing in the repository. This string contains +the command to be run whenever a missing object is needed. diff --git a/builtin/fsck.c b/builtin/fsck.c index fb0947009..1cfb8d98c 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -402,7 +402,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid, xstrfmt("%s@{%"PRItime"}", refname, timestamp)); obj->flags |= USED; mark_object_reachable(obj); - } else { + } else if (!repository_format_lazy_object) { error("%s: invalid reflog entry %s", refname, oid_to_hex(oid)); errors_found |= ERROR_REACHABLE; } diff --git a/cache.h b/cache.h index 6c8242340..9e6bc0a21 100644 --- a/cache.h +++ b/cache.h @@ -853,10 +853,12 @@ extern int grafts_replace_parents; #define GIT_REPO_VERSION 0 #define GIT_REPO_VERSION_READ 1 extern int repository_format_precious_objects; +extern char *repository_format_lazy_object; struct repository_format { int version; int precious_objects; + char *lazy_object; int is_bare; char *work_tree; struct string_list unknown_extensions; diff --git a/environment.c b/environment.c index 3fd4b1084..cd8ef2897 100644 --- a/environment.c +++ b/environment.c @@ -27,6 +27,7 @@ int warn_ambiguous_refs = 1; int warn_on_object_refname_ambiguity = 1; int ref_paranoia = -1; int repository_format_precious_objects; +char *repository_format_lazy_object; const char *git_commit_encoding; const char *git_log_output_encoding; const char *apply_default_whitespace; diff --git a/setup.c b/setup.c index 860507e1f..94cfde3cc 100644 --- a/setup.c +++ b/setup.c @@ -425,7 +425,11 @@ static int check_repo_format(const char *var, const char *value, void *vdata) ; else if (!strcmp(ext, "preciousobjects")) data->precious_objects = git_config_bool(var, value); - else + else if (!strcmp(ext, "lazyobject")) { + if (!value) + return config_error_nonbool(var); + data->lazy_object = xstrdup(value); + } else string_list_append(&data->unknown_extensions, ext); } else if (strcmp(var, "core.bare") == 0) { data->is_bare = git_config_bool(var, value); @@ -468,6 +472,7 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok) } repository_format_precious_objects = candidate.precious_objects; + repository_format_lazy_object = candidate.lazy_object; string_list_clear(&candidate.unknown_extensions, 0); if (!has_common) { if (candidate.is_bare != -1) { diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh new file mode 100755 index 000000000..36442531f --- /dev/null +++ b/t/t0410-lazy-object.sh @@ -0,0 +1,32 @@ +#!/bin/sh + +test_description='lazy object' + +. ./test-lib.sh + +delete_object () { + rm $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40) +} + +test_expect_success 'fsck fails on lazy object in reflog' ' + test_create_repo repo && + test_commit -C repo 1 && + + A=$(git -C repo commit-tree -m a HEAD^{tree}) && + B=$(git -C repo commit-tree -m b HEAD^{tree}) && + + # Reference $A only from reflog, and delete it + git -C repo branch mybranch "$A" && + git -C repo branch -f mybranch "$B" && + delete_object repo "$A" && + + test_must_fail git -C repo fsck +' + +test_expect_success '...but succeeds if lazyobject is set' ' + git -C repo config core.repositoryformatversion 1 && + git -C repo config extensions.lazyobject "arbitrary string" && + git -C repo fsck +' + +test_done -- 2.14.0.rc0.400.g1c36432dff-goog ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension 2017-07-26 23:29 ` [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension Jonathan Tan @ 2017-07-27 18:55 ` Junio C Hamano 2017-07-28 13:20 ` Ben Peart 2017-07-28 23:50 ` Jonathan Tan 0 siblings, 2 replies; 40+ messages in thread From: Junio C Hamano @ 2017-07-27 18:55 UTC (permalink / raw) To: Jonathan Tan; +Cc: git, peartben, christian.couder Jonathan Tan <jonathantanmy@google.com> writes: > Currently, Git does not support repos with very large numbers of objects > or repos that wish to minimize manipulation of certain blobs (for > example, because they are very large) very well, even if the user > operates mostly on part of the repo, because Git is designed on the > assumption that every referenced object is available somewhere in the > repo storage. > > Introduce a new repository extension option "extensions.lazyobject", of > data type string. If this is set in a repository, Git will tolerate > objects being missing in that repository. When Git needs those objects, > it will invoke the command in that option. My reading hiccupped after the first sentence, as the problem description made it sound like this was a boolean ("are we using lazy object feature?"), after reading "data type string". And then "the command in that option" made me hiccup one more time, as it did not "click" that "in that option" was trying to say that the string is used as the command name (or is it a whole command line? The leading part of the command line to which some arguments are appended before it gets invoked as a command? or what?). Logically, I think it is more like - extensions.lazyobject can be set to tell Git to consider missing objects in certain cases are not errors; - the value of extensions.lazyobject variable must be a string, which is used to name the command to lazily make the object "appear" in the repository on demand. > Teach fsck about the new state of affairs. In this commit, teach fsck > that missing objects referenced from the reflog are not an error case; > in future commits, fsck will be taught about other cases. In any case, sounds like a small and good first step. > + > +`lazyObject` > +~~~~~~~~~~~~~~~~~ > + > +When the config key `extensions.lazyObject` is set to a string, Git > +tolerates objects being missing in the repository. This string contains > +the command to be run whenever a missing object is needed. This has the same issue but to a lessor degree as there is "string contains". What the command will do (e.g. "makes the object magically appear in the object store" or "emits the contents of the object to its standard output, so that Git can hash it to make it appear in the object store"), and how it is used (e.g. "this is a single command name and nothing else", "this is a leading part of command line and arguments are appended at the end, before the whole thing is passed to the shell to be executed", etc.) will need to be clarified in the final version of the series (not necessarily at this step---the series can elaborate in the later patches). > diff --git a/builtin/fsck.c b/builtin/fsck.c > index fb0947009..1cfb8d98c 100644 > --- a/builtin/fsck.c > +++ b/builtin/fsck.c > @@ -402,7 +402,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid, > xstrfmt("%s@{%"PRItime"}", refname, timestamp)); > obj->flags |= USED; > mark_object_reachable(obj); > - } else { > + } else if (!repository_format_lazy_object) { > error("%s: invalid reflog entry %s", refname, oid_to_hex(oid)); > errors_found |= ERROR_REACHABLE; > } > diff --git a/cache.h b/cache.h > index 6c8242340..9e6bc0a21 100644 > --- a/cache.h > +++ b/cache.h > @@ -853,10 +853,12 @@ extern int grafts_replace_parents; > #define GIT_REPO_VERSION 0 > #define GIT_REPO_VERSION_READ 1 > extern int repository_format_precious_objects; > +extern char *repository_format_lazy_object; This is not a new problem, but I think these two should be called repository_extension_$NAME not repository_format_$NAME. > diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh > new file mode 100755 > index 000000000..36442531f > --- /dev/null > +++ b/t/t0410-lazy-object.sh > @@ -0,0 +1,32 @@ > +#!/bin/sh > + > +test_description='lazy object' > + > +. ./test-lib.sh > + > +delete_object () { > + rm $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40) > +} > + > +test_expect_success 'fsck fails on lazy object in reflog' ' > + test_create_repo repo && > + test_commit -C repo 1 && > + > + A=$(git -C repo commit-tree -m a HEAD^{tree}) && > + B=$(git -C repo commit-tree -m b HEAD^{tree}) && > + > + # Reference $A only from reflog, and delete it > + git -C repo branch mybranch "$A" && > + git -C repo branch -f mybranch "$B" && > + delete_object repo "$A" && > + > + test_must_fail git -C repo fsck > +' > +test_expect_success '...but succeeds if lazyobject is set' ' > + git -C repo config core.repositoryformatversion 1 && > + git -C repo config extensions.lazyobject "arbitrary string" && > + git -C repo fsck > +' > + > +test_done ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension 2017-07-27 18:55 ` Junio C Hamano @ 2017-07-28 13:20 ` Ben Peart 2017-07-28 23:50 ` Jonathan Tan 1 sibling, 0 replies; 40+ messages in thread From: Ben Peart @ 2017-07-28 13:20 UTC (permalink / raw) To: Junio C Hamano, Jonathan Tan; +Cc: git, christian.couder On 7/27/2017 2:55 PM, Junio C Hamano wrote: > Jonathan Tan <jonathantanmy@google.com> writes: > >> Currently, Git does not support repos with very large numbers of objects >> or repos that wish to minimize manipulation of certain blobs (for >> example, because they are very large) very well, even if the user >> operates mostly on part of the repo, because Git is designed on the >> assumption that every referenced object is available somewhere in the >> repo storage. >> I very much like the direction this is taking. Handling missing objects gracefully is an important part of the overall partial clone support. >> Introduce a new repository extension option "extensions.lazyobject", of >> data type string. If this is set in a repository, Git will tolerate >> objects being missing in that repository. When Git needs those objects, >> it will invoke the command in that option. > I'm very glad you are doing this. Having never used precious objects I was unaware of the extensions concept but it looks like a great way to deal with this repo specific option. > My reading hiccupped after the first sentence, as the problem > description made it sound like this was a boolean ("are we using > lazy object feature?"), after reading "data type string". And then > "the command in that option" made me hiccup one more time, as it did > not "click" that "in that option" was trying to say that the string > is used as the command name (or is it a whole command line? The > leading part of the command line to which some arguments are > appended before it gets invoked as a command? or what?). > > Logically, I think it is more like > > - extensions.lazyobject can be set to tell Git to consider missing > objects in certain cases are not errors; > > - the value of extensions.lazyobject variable must be a string, > which is used to name the command to lazily make the object > "appear" in the repository on demand. > >> Teach fsck about the new state of affairs. In this commit, teach fsck >> that missing objects referenced from the reflog are not an error case; >> in future commits, fsck will be taught about other cases. > > In any case, sounds like a small and good first step. > I agree completely with the feedback on the description. That is quite the run on sentence. :) >> + >> +`lazyObject` >> +~~~~~~~~~~~~~~~~~ >> + >> +When the config key `extensions.lazyObject` is set to a string, Git >> +tolerates objects being missing in the repository. This string contains >> +the command to be run whenever a missing object is needed. > > This has the same issue but to a lessor degree as there is "string > contains". What the command will do (e.g. "makes the object > magically appear in the object store" or "emits the contents of the > object to its standard output, so that Git can hash it to make it > appear in the object store"), and how it is used (e.g. "this is a > single command name and nothing else", "this is a leading part of > command line and arguments are appended at the end, before the whole > thing is passed to the shell to be executed", etc.) will need to be > clarified in the final version of the series (not necessarily at > this step---the series can elaborate in the later patches). > >> diff --git a/builtin/fsck.c b/builtin/fsck.c >> index fb0947009..1cfb8d98c 100644 >> --- a/builtin/fsck.c >> +++ b/builtin/fsck.c >> @@ -402,7 +402,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid, >> xstrfmt("%s@{%"PRItime"}", refname, timestamp)); >> obj->flags |= USED; >> mark_object_reachable(obj); >> - } else { >> + } else if (!repository_format_lazy_object) { >> error("%s: invalid reflog entry %s", refname, oid_to_hex(oid)); >> errors_found |= ERROR_REACHABLE; >> } >> diff --git a/cache.h b/cache.h >> index 6c8242340..9e6bc0a21 100644 >> --- a/cache.h >> +++ b/cache.h >> @@ -853,10 +853,12 @@ extern int grafts_replace_parents; >> #define GIT_REPO_VERSION 0 >> #define GIT_REPO_VERSION_READ 1 >> extern int repository_format_precious_objects; >> +extern char *repository_format_lazy_object; > > This is not a new problem, but I think these two should be > called repository_extension_$NAME not repository_format_$NAME. > >> diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh >> new file mode 100755 >> index 000000000..36442531f >> --- /dev/null >> +++ b/t/t0410-lazy-object.sh >> @@ -0,0 +1,32 @@ >> +#!/bin/sh >> + >> +test_description='lazy object' >> + >> +. ./test-lib.sh >> + >> +delete_object () { >> + rm $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40) >> +} >> + >> +test_expect_success 'fsck fails on lazy object in reflog' ' >> + test_create_repo repo && >> + test_commit -C repo 1 && >> + >> + A=$(git -C repo commit-tree -m a HEAD^{tree}) && >> + B=$(git -C repo commit-tree -m b HEAD^{tree}) && >> + >> + # Reference $A only from reflog, and delete it >> + git -C repo branch mybranch "$A" && >> + git -C repo branch -f mybranch "$B" && >> + delete_object repo "$A" && >> + >> + test_must_fail git -C repo fsck >> +' >> +test_expect_success '...but succeeds if lazyobject is set' ' >> + git -C repo config core.repositoryformatversion 1 && >> + git -C repo config extensions.lazyobject "arbitrary string" && >> + git -C repo fsck >> +' >> + >> +test_done ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension 2017-07-27 18:55 ` Junio C Hamano 2017-07-28 13:20 ` Ben Peart @ 2017-07-28 23:50 ` Jonathan Tan 2017-07-29 0:21 ` Junio C Hamano 1 sibling, 1 reply; 40+ messages in thread From: Jonathan Tan @ 2017-07-28 23:50 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, peartben, christian.couder On Thu, 27 Jul 2017 11:55:46 -0700 Junio C Hamano <gitster@pobox.com> wrote: > My reading hiccupped after the first sentence, as the problem > description made it sound like this was a boolean ("are we using > lazy object feature?"), after reading "data type string". And then > "the command in that option" made me hiccup one more time, as it did > not "click" that "in that option" was trying to say that the string > is used as the command name (or is it a whole command line? The > leading part of the command line to which some arguments are > appended before it gets invoked as a command? or what?). > > Logically, I think it is more like > > - extensions.lazyobject can be set to tell Git to consider missing > objects in certain cases are not errors; > > - the value of extensions.lazyobject variable must be a string, > which is used to name the command to lazily make the object > "appear" in the repository on demand. OK, I'll update the commit message in the next reroll. > > extern int repository_format_precious_objects; > > +extern char *repository_format_lazy_object; > > This is not a new problem, but I think these two should be > called repository_extension_$NAME not repository_format_$NAME. Looking at the original commit 067fbd4 ("introduce "preciousObjects" repository extension", 2015-06-24), it seems that this was so named to be analogous to the existing "struct repository_format { int version; ...}" => "int repository_format_version;". The existing repository_format_$NAME thus seems reasonable to me. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension 2017-07-28 23:50 ` Jonathan Tan @ 2017-07-29 0:21 ` Junio C Hamano 0 siblings, 0 replies; 40+ messages in thread From: Junio C Hamano @ 2017-07-29 0:21 UTC (permalink / raw) To: Jonathan Tan; +Cc: git, peartben, christian.couder Jonathan Tan <jonathantanmy@google.com> writes: >> > extern int repository_format_precious_objects; >> > +extern char *repository_format_lazy_object; >> >> This is not a new problem, but I think these two should be >> called repository_extension_$NAME not repository_format_$NAME. > > Looking at the original commit 067fbd4 ("introduce "preciousObjects" > repository extension", 2015-06-24), it seems that this was so named to > be analogous to the existing "struct repository_format { int version; > ...}" => "int repository_format_version;". The existing > repository_format_$NAME thus seems reasonable to me. OK. They smell like "repository extension" to me, but probably the fully spelled name of the concept is "repository format extension", so using the word "format" out of that phrase sounds OK to me. Thanks. ^ permalink raw reply [flat|nested] 40+ messages in thread
* [RFC PATCH 2/4] fsck: support refs pointing to lazy objects 2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan 2017-07-26 23:29 ` [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension Jonathan Tan @ 2017-07-26 23:30 ` Jonathan Tan 2017-07-27 18:59 ` Junio C Hamano 2017-07-26 23:30 ` [RFC PATCH 3/4] fsck: support referenced lazy objects Jonathan Tan ` (3 subsequent siblings) 5 siblings, 1 reply; 40+ messages in thread From: Jonathan Tan @ 2017-07-26 23:30 UTC (permalink / raw) To: git; +Cc: Jonathan Tan, peartben, christian.couder Teach fsck to not treat refs with missing targets as an error when extensions.lazyobject is set. For the purposes of warning about no default refs, such refs are still treated as legitimate refs. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- builtin/fsck.c | 8 ++++++++ t/t0410-lazy-object.sh | 20 ++++++++++++++++++++ 2 files changed, 28 insertions(+) diff --git a/builtin/fsck.c b/builtin/fsck.c index 1cfb8d98c..e29ff760b 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -438,6 +438,14 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid, obj = parse_object(oid); if (!obj) { + if (repository_format_lazy_object) { + /* + * Increment default_refs anyway, because this is a + * valid ref. + */ + default_refs++; + return 0; + } error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid)); errors_found |= ERROR_REACHABLE; /* We'll continue with the rest despite the error.. */ diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh index 36442531f..00e1b4a88 100755 --- a/t/t0410-lazy-object.sh +++ b/t/t0410-lazy-object.sh @@ -29,4 +29,24 @@ test_expect_success '...but succeeds if lazyobject is set' ' git -C repo fsck ' +test_expect_success 'fsck fails on lazy object pointed to by ref' ' + rm -rf repo && + test_create_repo repo && + test_commit -C repo 1 && + + A=$(git -C repo commit-tree -m a HEAD^{tree}) && + + # Reference $A only from ref, and delete it + git -C repo branch mybranch "$A" && + delete_object repo "$A" && + + test_must_fail git -C repo fsck +' + +test_expect_success '...but succeeds if lazyobject is set' ' + git -C repo config core.repositoryformatversion 1 && + git -C repo config extensions.lazyobject "arbitrary string" && + git -C repo fsck +' + test_done -- 2.14.0.rc0.400.g1c36432dff-goog ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 2/4] fsck: support refs pointing to lazy objects 2017-07-26 23:30 ` [RFC PATCH 2/4] fsck: support refs pointing to lazy objects Jonathan Tan @ 2017-07-27 18:59 ` Junio C Hamano 2017-07-27 23:50 ` Jonathan Tan 0 siblings, 1 reply; 40+ messages in thread From: Junio C Hamano @ 2017-07-27 18:59 UTC (permalink / raw) To: Jonathan Tan; +Cc: git, peartben, christian.couder Jonathan Tan <jonathantanmy@google.com> writes: > Teach fsck to not treat refs with missing targets as an error when > extensions.lazyobject is set. > > For the purposes of warning about no default refs, such refs are still > treated as legitimate refs. > > Signed-off-by: Jonathan Tan <jonathantanmy@google.com> > --- > builtin/fsck.c | 8 ++++++++ > t/t0410-lazy-object.sh | 20 ++++++++++++++++++++ > 2 files changed, 28 insertions(+) > > diff --git a/builtin/fsck.c b/builtin/fsck.c > index 1cfb8d98c..e29ff760b 100644 > --- a/builtin/fsck.c > +++ b/builtin/fsck.c > @@ -438,6 +438,14 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid, > > obj = parse_object(oid); > if (!obj) { > + if (repository_format_lazy_object) { > + /* > + * Increment default_refs anyway, because this is a > + * valid ref. > + */ > + default_refs++; > + return 0; > + } > error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid)); > errors_found |= ERROR_REACHABLE; At this point, do we know (or can we tell) if this is a missing object or a file exists as a loose object but is corrupt? If we could, it would be nice to do this only for the former to avoid sweeping a real corruption that is unrelated to the lazy fetch under the rug. > +test_expect_success 'fsck fails on lazy object pointed to by ref' ' > + rm -rf repo && > + test_create_repo repo && > + test_commit -C repo 1 && > + > + A=$(git -C repo commit-tree -m a HEAD^{tree}) && > + > + # Reference $A only from ref, and delete it > + git -C repo branch mybranch "$A" && > + delete_object repo "$A" && > + > + test_must_fail git -C repo fsck > +' And a new test that uses a helper different from delete_object (perhaps call it corrupt_object?) can be used to make sure that we complain in that case here. > +test_expect_success '...but succeeds if lazyobject is set' ' > + git -C repo config core.repositoryformatversion 1 && > + git -C repo config extensions.lazyobject "arbitrary string" && > + git -C repo fsck > +' > + > test_done ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 2/4] fsck: support refs pointing to lazy objects 2017-07-27 18:59 ` Junio C Hamano @ 2017-07-27 23:50 ` Jonathan Tan 2017-07-28 13:29 ` Ben Peart 0 siblings, 1 reply; 40+ messages in thread From: Jonathan Tan @ 2017-07-27 23:50 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, peartben, christian.couder On Thu, 27 Jul 2017 11:59:47 -0700 Junio C Hamano <gitster@pobox.com> wrote: > > @@ -438,6 +438,14 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid, > > > > obj = parse_object(oid); > > if (!obj) { > > + if (repository_format_lazy_object) { > > + /* > > + * Increment default_refs anyway, because this is a > > + * valid ref. > > + */ > > + default_refs++; > > + return 0; > > + } > > error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid)); > > errors_found |= ERROR_REACHABLE; > > At this point, do we know (or can we tell) if this is a missing > object or a file exists as a loose object but is corrupt? If we > could, it would be nice to do this only for the former to avoid > sweeping a real corruption that is unrelated to the lazy fetch under > the rug. Before all this is run, there is a check over all loose and packed objects and I've verified that this check reports failure in corrupt-object situations (see test below). It is true that parse_object() cannot report the difference, but since fsck has already verified non-corruptness, I don't think it needs to know the difference at this point. > > +test_expect_success 'fsck fails on lazy object pointed to by ref' ' > > + rm -rf repo && > > + test_create_repo repo && > > + test_commit -C repo 1 && > > + > > + A=$(git -C repo commit-tree -m a HEAD^{tree}) && > > + > > + # Reference $A only from ref, and delete it > > + git -C repo branch mybranch "$A" && > > + delete_object repo "$A" && > > + > > + test_must_fail git -C repo fsck > > +' > > And a new test that uses a helper different from delete_object > (perhaps call it corrupt_object?) can be used to make sure that we > complain in that case here. I agree that object corruption can cause this specific part of the production code to falsely work. But I think that this specific part of the code can and should rely on object corruption being checked elsewhere. (I usually don't like to assume that other components work and will continue to work, but in this case, I think that fsck checking for object corruption is very foundational and should be relied upon.) But if we think that defense "in depth" is a good idea, I have no problem adding such tests (like the one below). --- delete_object () { rm $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40) } corrupt_object () { chmod a+w $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40) && echo CORRUPT >$1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40) } setup_object_in_reflog () { rm -rf repo && test_create_repo repo && test_commit -C repo 1 && A=$(git -C repo commit-tree -m a HEAD^{tree}) && B=$(git -C repo commit-tree -m b HEAD^{tree}) && # Reference $A only from reflog git -C repo branch mybranch "$A" && git -C repo branch -f mybranch "$B" } test_expect_success 'lazy object in reflog' ' setup_object_in_reflog && delete_object repo "$A" && test_must_fail git -C repo fsck && git -C repo config core.repositoryformatversion 1 && git -C repo config extensions.lazyobject "arbitrary string" && git -C repo fsck ' test_expect_success 'corrupt loose object in reflog' ' setup_object_in_reflog && corrupt_object repo "$A" && test_must_fail git -C repo fsck && git -C repo config core.repositoryformatversion 1 && git -C repo config extensions.lazyobject "arbitrary string" && test_must_fail git -C repo fsck ' test_expect_success 'missing packed object in reflog' ' setup_object_in_reflog && git -C repo repack -a && delete_object repo "$A" && chmod a+w repo/.git/objects/pack/*.pack && echo CORRUPT >"$(echo repo/.git/objects/pack/*.pack)" && test_must_fail git -C repo fsck && git -C repo config core.repositoryformatversion 1 && git -C repo config extensions.lazyobject "arbitrary string" && test_must_fail git -C repo fsck ' ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 2/4] fsck: support refs pointing to lazy objects 2017-07-27 23:50 ` Jonathan Tan @ 2017-07-28 13:29 ` Ben Peart 2017-07-28 20:08 ` [PATCH] tests: ensure fsck fails on corrupt packfiles Jonathan Tan 0 siblings, 1 reply; 40+ messages in thread From: Ben Peart @ 2017-07-28 13:29 UTC (permalink / raw) To: Jonathan Tan, Junio C Hamano; +Cc: git, christian.couder On 7/27/2017 7:50 PM, Jonathan Tan wrote: > On Thu, 27 Jul 2017 11:59:47 -0700 > Junio C Hamano <gitster@pobox.com> wrote: > >>> @@ -438,6 +438,14 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid, >>> >>> obj = parse_object(oid); >>> if (!obj) { >>> + if (repository_format_lazy_object) { >>> + /* >>> + * Increment default_refs anyway, because this is a >>> + * valid ref. >>> + */ >>> + default_refs++; >>> + return 0; >>> + } >>> error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid)); >>> errors_found |= ERROR_REACHABLE; >> >> At this point, do we know (or can we tell) if this is a missing >> object or a file exists as a loose object but is corrupt? If we >> could, it would be nice to do this only for the former to avoid >> sweeping a real corruption that is unrelated to the lazy fetch under >> the rug. > > Before all this is run, there is a check over all loose and packed > objects and I've verified that this check reports failure in > corrupt-object situations (see test below). > > It is true that parse_object() cannot report the difference, but since > fsck has already verified non-corruptness, I don't think it needs to > know the difference at this point. > >>> +test_expect_success 'fsck fails on lazy object pointed to by ref' ' >>> + rm -rf repo && >>> + test_create_repo repo && >>> + test_commit -C repo 1 && >>> + >>> + A=$(git -C repo commit-tree -m a HEAD^{tree}) && >>> + >>> + # Reference $A only from ref, and delete it >>> + git -C repo branch mybranch "$A" && >>> + delete_object repo "$A" && >>> + >>> + test_must_fail git -C repo fsck >>> +' >> >> And a new test that uses a helper different from delete_object >> (perhaps call it corrupt_object?) can be used to make sure that we >> complain in that case here. > > I agree that object corruption can cause this specific part of the > production code to falsely work. But I think that this specific part of > the code can and should rely on object corruption being checked > elsewhere. (I usually don't like to assume that other components work > and will continue to work, but in this case, I think that fsck checking > for object corruption is very foundational and should be relied upon.) > > But if we think that defense "in depth" is a good idea, I have no > problem adding such tests (like the one below). > It's good to know that object corruption is already being checked elsewhere in fsck. I agree that you should be able to rely that as long as it is guaranteed to run. I'd rather see a single location in the code that has the logic to detect corrupted objects rather than two copies (or worse, two different versions). Are there also tests for validating the object corruption detection code? If not, adding some (like below) would be great. > --- > delete_object () { > rm $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40) > } > > corrupt_object () { > chmod a+w $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40) && > echo CORRUPT >$1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40) > } > > setup_object_in_reflog () { > rm -rf repo && > test_create_repo repo && > test_commit -C repo 1 && > > A=$(git -C repo commit-tree -m a HEAD^{tree}) && > B=$(git -C repo commit-tree -m b HEAD^{tree}) && > > # Reference $A only from reflog > git -C repo branch mybranch "$A" && > git -C repo branch -f mybranch "$B" > } > > test_expect_success 'lazy object in reflog' ' > setup_object_in_reflog && > delete_object repo "$A" && > test_must_fail git -C repo fsck && > git -C repo config core.repositoryformatversion 1 && > git -C repo config extensions.lazyobject "arbitrary string" && > git -C repo fsck > ' > > test_expect_success 'corrupt loose object in reflog' ' > setup_object_in_reflog && > corrupt_object repo "$A" && > test_must_fail git -C repo fsck && > git -C repo config core.repositoryformatversion 1 && > git -C repo config extensions.lazyobject "arbitrary string" && > test_must_fail git -C repo fsck > ' > > test_expect_success 'missing packed object in reflog' ' > setup_object_in_reflog && > git -C repo repack -a && > delete_object repo "$A" && > chmod a+w repo/.git/objects/pack/*.pack && > echo CORRUPT >"$(echo repo/.git/objects/pack/*.pack)" && > test_must_fail git -C repo fsck && > git -C repo config core.repositoryformatversion 1 && > git -C repo config extensions.lazyobject "arbitrary string" && > test_must_fail git -C repo fsck > ' > ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH] tests: ensure fsck fails on corrupt packfiles 2017-07-28 13:29 ` Ben Peart @ 2017-07-28 20:08 ` Jonathan Tan 0 siblings, 0 replies; 40+ messages in thread From: Jonathan Tan @ 2017-07-28 20:08 UTC (permalink / raw) To: git; +Cc: Jonathan Tan, peartben, gitster t1450-fsck.sh does not have a test that checks fsck's behavior when a packfile is invalid. It does have a test for when an object in a packfile is invalid, but in that test, the packfile itself is valid. Add such a test. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- I think the existing packfile corruption test (t5303) is good enough that we would notice such things, but just in case we want to test fsck specifically, here's a patch. --- t/t1450-fsck.sh | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh index bb89e1a5d..4087150db 100755 --- a/t/t1450-fsck.sh +++ b/t/t1450-fsck.sh @@ -608,6 +608,22 @@ test_expect_success 'fsck errors in packed objects' ' ! grep corrupt out ' +test_expect_success 'fsck fails on corrupt packfile' ' + hsh=$(git commit-tree -m mycommit HEAD^{tree}) && + pack=$(echo $hsh | git pack-objects .git/objects/pack/pack) && + + # Corrupt the first byte of the first object. (It contains 3 type bits, + # at least one of which is not zero, so setting the first byte to 0 is + # sufficient.) + chmod a+w .git/objects/pack/pack-$pack.pack && + printf '\0' | dd of=.git/objects/pack/pack-$pack.pack bs=1 conv=notrunc seek=12 && + + test_when_finished "rm -f .git/objects/pack/pack-$pack.*" && + remove_object $hsh && + test_must_fail git fsck 2>out && + test_i18ngrep "checksum mismatch" out +' + test_expect_success 'fsck finds problems in duplicate loose objects' ' rm -rf broken-duplicate && git init broken-duplicate && -- 2.14.0.rc0.400.g1c36432dff-goog ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [RFC PATCH 3/4] fsck: support referenced lazy objects 2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan 2017-07-26 23:29 ` [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension Jonathan Tan 2017-07-26 23:30 ` [RFC PATCH 2/4] fsck: support refs pointing to lazy objects Jonathan Tan @ 2017-07-26 23:30 ` Jonathan Tan 2017-07-27 19:17 ` Junio C Hamano 2017-07-29 16:04 ` Junio C Hamano 2017-07-26 23:30 ` [RFC PATCH 4/4] fsck: support lazy objects as CLI argument Jonathan Tan ` (2 subsequent siblings) 5 siblings, 2 replies; 40+ messages in thread From: Jonathan Tan @ 2017-07-26 23:30 UTC (permalink / raw) To: git; +Cc: Jonathan Tan, peartben, christian.couder Teach fsck to not treat missing objects indirectly pointed to by refs as an error when extensions.lazyobject is set. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- builtin/fsck.c | 11 +++++++++++ t/t0410-lazy-object.sh | 27 +++++++++++++++++++++++++++ 2 files changed, 38 insertions(+) diff --git a/builtin/fsck.c b/builtin/fsck.c index e29ff760b..238532cc2 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -149,6 +149,15 @@ static int mark_object(struct object *obj, int type, void *data, struct fsck_opt return 0; obj->flags |= REACHABLE; if (!(obj->flags & HAS_OBJ)) { + if (repository_format_lazy_object) + /* + * Return immediately; this is not an error, and further + * recursion does not need to be performed on this + * object since it is missing (so it does not need to be + * added to "pending"). + */ + return 0; + if (parent && !has_object_file(&obj->oid)) { printf("broken link from %7s %s\n", printable_type(parent), describe_object(parent)); @@ -212,6 +221,8 @@ static void check_reachable_object(struct object *obj) * do a full fsck */ if (!(obj->flags & HAS_OBJ)) { + if (repository_format_lazy_object) + return; if (has_sha1_pack(obj->oid.hash)) return; /* it is in pack - forget about it */ printf("missing %s %s\n", printable_type(obj), diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh index 00e1b4a88..45f665a15 100755 --- a/t/t0410-lazy-object.sh +++ b/t/t0410-lazy-object.sh @@ -49,4 +49,31 @@ test_expect_success '...but succeeds if lazyobject is set' ' git -C repo fsck ' +test_expect_success 'fsck fails on lazy object indirectly pointed to by ref' ' + rm -rf repo && + test_create_repo repo && + test_commit -C repo 1 && + test_commit -C repo 2 && + test_commit -C repo 3 && + git -C repo tag -a annotated_tag -m "annotated tag" && + + C=$(git -C repo rev-parse 1) && + T=$(git -C repo rev-parse 2^{tree}) && + B=$(git hash-object repo/3.t) && + AT=$(git -C repo rev-parse annotated_tag) && + + # missing commit, tree, blob, and tag + delete_object repo "$C" && + delete_object repo "$T" && + delete_object repo "$B" && + delete_object repo "$AT" && + test_must_fail git -C repo fsck +' + +test_expect_success '...but succeeds if lazyobject is set' ' + git -C repo config core.repositoryformatversion 1 && + git -C repo config extensions.lazyobject "arbitrary string" && + git -C repo fsck +' + test_done -- 2.14.0.rc0.400.g1c36432dff-goog ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 3/4] fsck: support referenced lazy objects 2017-07-26 23:30 ` [RFC PATCH 3/4] fsck: support referenced lazy objects Jonathan Tan @ 2017-07-27 19:17 ` Junio C Hamano 2017-07-27 23:50 ` Jonathan Tan 2017-07-29 16:04 ` Junio C Hamano 1 sibling, 1 reply; 40+ messages in thread From: Junio C Hamano @ 2017-07-27 19:17 UTC (permalink / raw) To: Jonathan Tan; +Cc: git, peartben, christian.couder Jonathan Tan <jonathantanmy@google.com> writes: > Teach fsck to not treat missing objects indirectly pointed to by refs as > an error when extensions.lazyobject is set. > > Signed-off-by: Jonathan Tan <jonathantanmy@google.com> > --- > builtin/fsck.c | 11 +++++++++++ > t/t0410-lazy-object.sh | 27 +++++++++++++++++++++++++++ > 2 files changed, 38 insertions(+) > > diff --git a/builtin/fsck.c b/builtin/fsck.c > index e29ff760b..238532cc2 100644 > --- a/builtin/fsck.c > +++ b/builtin/fsck.c > @@ -149,6 +149,15 @@ static int mark_object(struct object *obj, int type, void *data, struct fsck_opt > return 0; > obj->flags |= REACHABLE; > if (!(obj->flags & HAS_OBJ)) { > + if (repository_format_lazy_object) > + /* > + * Return immediately; this is not an error, and further > + * recursion does not need to be performed on this > + * object since it is missing (so it does not need to be > + * added to "pending"). > + */ > + return 0; > + The same comment as 2/4 applies here. > @@ -212,6 +221,8 @@ static void check_reachable_object(struct object *obj) > * do a full fsck > */ > if (!(obj->flags & HAS_OBJ)) { > + if (repository_format_lazy_object) > + return; > if (has_sha1_pack(obj->oid.hash)) > return; /* it is in pack - forget about it */ > printf("missing %s %s\n", printable_type(obj), Also this reminds as a related issue. Imagine: - An object X was once retrieved, perhaps but not necessarily lazily, together with another object Y that is referred to by X (e.g. X is a tree, Y is a blob in the directory at path D, which is represented by X). - The same blob Y is added to the index in a different directory at path E. - The user decides to make this a slimmed-down "narrow clone" style repository and tells Git that path D is not interesting. We lose X, but not Y because Y is still referenced from the index. - "git reset --hard" happens, and there no longer is any reference to Y. Now, when we run fsck, should we diagnose Y as "unreachable and/or dangling"? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 3/4] fsck: support referenced lazy objects 2017-07-27 19:17 ` Junio C Hamano @ 2017-07-27 23:50 ` Jonathan Tan 0 siblings, 0 replies; 40+ messages in thread From: Jonathan Tan @ 2017-07-27 23:50 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, peartben, christian.couder On Thu, 27 Jul 2017 12:17:37 -0700 Junio C Hamano <gitster@pobox.com> wrote: > The same comment as 2/4 applies here. Noted - whatever the resolution is, I'll apply it to all the patches. > > > @@ -212,6 +221,8 @@ static void check_reachable_object(struct object *obj) > > * do a full fsck > > */ > > if (!(obj->flags & HAS_OBJ)) { > > + if (repository_format_lazy_object) > > + return; > > if (has_sha1_pack(obj->oid.hash)) > > return; /* it is in pack - forget about it */ > > printf("missing %s %s\n", printable_type(obj), > > Also this reminds as a related issue. Imagine: > > - An object X was once retrieved, perhaps but not necessarily > lazily, together with another object Y that is referred to by X > (e.g. X is a tree, Y is a blob in the directory at path D, which > is represented by X). > > - The same blob Y is added to the index in a different directory at > path E. > > - The user decides to make this a slimmed-down "narrow clone" style > repository and tells Git that path D is not interesting. We lose > X, but not Y because Y is still referenced from the index. > > - "git reset --hard" happens, and there no longer is any reference > to Y. > > Now, when we run fsck, should we diagnose Y as "unreachable and/or > dangling"? I would say yes (or whatever happens in the case where we re-fetch into a shallow clone). Come to think of it..."git reset --hard" always has the potential to create unreachable objects, right (regardless of whether it's a "shallow clone" or "narrow clone" or ordinary clone)? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 3/4] fsck: support referenced lazy objects 2017-07-26 23:30 ` [RFC PATCH 3/4] fsck: support referenced lazy objects Jonathan Tan 2017-07-27 19:17 ` Junio C Hamano @ 2017-07-29 16:04 ` Junio C Hamano 1 sibling, 0 replies; 40+ messages in thread From: Junio C Hamano @ 2017-07-29 16:04 UTC (permalink / raw) To: Jonathan Tan; +Cc: git, peartben, christian.couder Jonathan Tan <jonathantanmy@google.com> writes: > Teach fsck to not treat missing objects indirectly pointed to by refs as > an error when extensions.lazyobject is set. I forgot to mention a potential flaw in this approach in my previous message. If you are a pure sightseer, then this is perfectly fine. The object store in your local Git client working in that mode is purely a cache, lazily populated while browsing the object store backed by the source of what lazy-object "hook" talks with. As long as that cache does not give us a corrupt object, we are OK, because missing objects do not matter. But once you start using the repository as more than a sightseer, you will have objects that only exist in your local "cache" and are not yet in that backing store behind the lazy-object "hook". You need to notice it when any of them goes corrupt or missing, or your next "git push" to send them over to a remote location will fail by definition because you are the only one with these objects. If we had the "promise" thing, then we could say that it is OK if traversal terminated at a "promised but not fetched yet" boundary, but we cannot afford the "promise", and more importantly, I do not think "promise" has to be the only approach to ensure that the objects that exist only in the local repository are all connected. For example, if we know that the remote 'origin' is the actual backing store lazy-object "hook" talks with, a validation rule to ensure that we haven't lost any local commit is to ensure that a traversal from our local branch tips down to remote-tracking branches taken from 'origin' must not hit _any_ missing commit. That covers only the commit objects. I do not know offhand if we can and how we extend this concept to protect the tags, trees and blobs we have locally generated and haven't pushed out, but you and Ben hopefully can come up with ways to cover them. ^ permalink raw reply [flat|nested] 40+ messages in thread
* [RFC PATCH 4/4] fsck: support lazy objects as CLI argument 2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan ` (2 preceding siblings ...) 2017-07-26 23:30 ` [RFC PATCH 3/4] fsck: support referenced lazy objects Jonathan Tan @ 2017-07-26 23:30 ` Jonathan Tan 2017-07-26 23:42 ` [RFC PATCH 0/4] Some patches for fsck for missing objects brian m. carlson 2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan 5 siblings, 0 replies; 40+ messages in thread From: Jonathan Tan @ 2017-07-26 23:30 UTC (permalink / raw) To: git; +Cc: Jonathan Tan, peartben, christian.couder Teach fsck to not treat missing objects provided on the CLI as an error when extensions.lazyobject is set. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- builtin/fsck.c | 2 ++ t/t0410-lazy-object.sh | 16 ++++++++++++++++ 2 files changed, 18 insertions(+) diff --git a/builtin/fsck.c b/builtin/fsck.c index 238532cc2..592e64172 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -755,6 +755,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) struct object *obj = lookup_object(oid.hash); if (!obj || !(obj->flags & HAS_OBJ)) { + if (repository_format_lazy_object) + continue; error("%s: object missing", oid_to_hex(&oid)); errors_found |= ERROR_OBJECT; continue; diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh index 45f665a15..3ac61c1c5 100755 --- a/t/t0410-lazy-object.sh +++ b/t/t0410-lazy-object.sh @@ -76,4 +76,20 @@ test_expect_success '...but succeeds if lazyobject is set' ' git -C repo fsck ' +test_expect_success 'fsck fails on lazy object directly given in command-line' ' + rm -rf repo && + test_create_repo repo && + test_commit -C repo 1 && + HASH=$(git hash-object repo/1.t) && + delete_object repo "$HASH" && + + test_must_fail git -C repo fsck "$HASH" +' + +test_expect_success '...but succeeds if lazyobject is set' ' + git -C repo config core.repositoryformatversion 1 && + git -C repo config extensions.lazyobject "arbitrary string" && + git -C repo fsck "$HASH" +' + test_done -- 2.14.0.rc0.400.g1c36432dff-goog ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4] Some patches for fsck for missing objects 2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan ` (3 preceding siblings ...) 2017-07-26 23:30 ` [RFC PATCH 4/4] fsck: support lazy objects as CLI argument Jonathan Tan @ 2017-07-26 23:42 ` brian m. carlson 2017-07-27 0:24 ` Stefan Beller 2017-07-27 17:25 ` Jonathan Tan 2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan 5 siblings, 2 replies; 40+ messages in thread From: brian m. carlson @ 2017-07-26 23:42 UTC (permalink / raw) To: Jonathan Tan; +Cc: git, peartben, christian.couder [-- Attachment #1: Type: text/plain, Size: 1234 bytes --] On Wed, Jul 26, 2017 at 04:29:58PM -0700, Jonathan Tan wrote: > This is an updated version of my previous patch [1], but without the > list of promises. It is somewhat different now because fsck cannot just > mark all promised objects as HAS_OBJ. > > This could be adapted and incorporated into patch sets that have objects > missing from the local repo. > > I split this up into 4 patches so that you can see how the changes in > fsck are being tested, but eventually these should probably be combined > into 1 patch. I looked at this and I like the direction it's going. It's pretty simple and straightforward, which I also like. What I'd recommend is that instead of making lazyObject a string, we make it an integer representing a protocol version. We then add a different config setting that is the actual program to invoke, using the given protocol version. This lets us change the way we speak to the tool without breaking backwards compatibility, and it also allows us to simply check the lazyObject script for supported protocols up front. -- brian m. carlson / brian with sandals: Houston, Texas, US https://www.crustytoothpaste.net/~bmc | My opinion only OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 868 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4] Some patches for fsck for missing objects 2017-07-26 23:42 ` [RFC PATCH 0/4] Some patches for fsck for missing objects brian m. carlson @ 2017-07-27 0:24 ` Stefan Beller 2017-07-27 17:25 ` Jonathan Tan 1 sibling, 0 replies; 40+ messages in thread From: Stefan Beller @ 2017-07-27 0:24 UTC (permalink / raw) To: brian m. carlson, Jonathan Tan, git@vger.kernel.org, Ben Peart, Christian Couder On Wed, Jul 26, 2017 at 4:42 PM, brian m. carlson <sandals@crustytoothpaste.net> wrote: > On Wed, Jul 26, 2017 at 04:29:58PM -0700, Jonathan Tan wrote: >> This is an updated version of my previous patch [1], but without the >> list of promises. It is somewhat different now because fsck cannot just >> mark all promised objects as HAS_OBJ. >> >> This could be adapted and incorporated into patch sets that have objects >> missing from the local repo. >> >> I split this up into 4 patches so that you can see how the changes in >> fsck are being tested, but eventually these should probably be combined >> into 1 patch. > > I looked at this and I like the direction it's going. It's pretty > simple and straightforward, which I also like. I agree. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4] Some patches for fsck for missing objects 2017-07-26 23:42 ` [RFC PATCH 0/4] Some patches for fsck for missing objects brian m. carlson 2017-07-27 0:24 ` Stefan Beller @ 2017-07-27 17:25 ` Jonathan Tan 2017-07-28 13:40 ` Ben Peart 1 sibling, 1 reply; 40+ messages in thread From: Jonathan Tan @ 2017-07-27 17:25 UTC (permalink / raw) To: brian m. carlson; +Cc: git, peartben, christian.couder On Wed, 26 Jul 2017 23:42:38 +0000 "brian m. carlson" <sandals@crustytoothpaste.net> wrote: > I looked at this and I like the direction it's going. It's pretty > simple and straightforward, which I also like. Thanks. > What I'd recommend is that instead of making lazyObject a string, we > make it an integer representing a protocol version. We then add a > different config setting that is the actual program to invoke, using the > given protocol version. This lets us change the way we speak to the > tool without breaking backwards compatibility, and it also allows us to > simply check the lazyObject script for supported protocols up front. That's possible too. As for version negotiation, I think we'll end up using a protocol similar to the clean/smudge long-running process protocol (as documented as gitattributes), so that does not need to be taken care of here, but making lazyObject be the version integer is fine too. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [RFC PATCH 0/4] Some patches for fsck for missing objects 2017-07-27 17:25 ` Jonathan Tan @ 2017-07-28 13:40 ` Ben Peart 0 siblings, 0 replies; 40+ messages in thread From: Ben Peart @ 2017-07-28 13:40 UTC (permalink / raw) To: Jonathan Tan, brian m. carlson; +Cc: git, christian.couder On 7/27/2017 1:25 PM, Jonathan Tan wrote: > On Wed, 26 Jul 2017 23:42:38 +0000 > "brian m. carlson" <sandals@crustytoothpaste.net> wrote: > >> I looked at this and I like the direction it's going. It's pretty >> simple and straightforward, which I also like. > ditto, simple is good! > Thanks. > >> What I'd recommend is that instead of making lazyObject a string, we >> make it an integer representing a protocol version. We then add a >> different config setting that is the actual program to invoke, using the >> given protocol version. This lets us change the way we speak to the >> tool without breaking backwards compatibility, and it also allows us to >> simply check the lazyObject script for supported protocols up front. > > That's possible too. As for version negotiation, I think we'll end up > using a protocol similar to the clean/smudge long-running process > protocol (as documented as gitattributes), so that does not need to be > taken care of here, but making lazyObject be the version integer is fine > too. > I was also thinking the way to retrieve the missing objects be a versioned negotiation via the long-running process protocol. That said, I'm a fan of including versions to make our life easier later if we decide to adopt an entirely different model for retrieving the missing objects. ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader 2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan ` (4 preceding siblings ...) 2017-07-26 23:42 ` [RFC PATCH 0/4] Some patches for fsck for missing objects brian m. carlson @ 2017-07-31 21:02 ` Jonathan Tan 2017-07-31 21:02 ` [PATCH v2 1/5] environment, fsck: introduce lazyobject extension Jonathan Tan ` (6 more replies) 5 siblings, 7 replies; 40+ messages in thread From: Jonathan Tan @ 2017-07-31 21:02 UTC (permalink / raw) To: git; +Cc: Jonathan Tan, gitster, peartben, christian.couder Besides review changes, this patch set now includes my rewritten lazy-loading sha1_file patch, so you can now do this (excerpted from one of the tests): test_create_repo server test_commit -C server 1 1.t abcdefgh HASH=$(git hash-object server/1.t) test_create_repo client test_must_fail git -C client cat-file -p "$HASH" git -C client config core.repositoryformatversion 1 git -C client config extensions.lazyobject \ "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\"" git -C client cat-file -p "$HASH" with fsck still working. Also, there is no need for a list of promised blobs, and the long-running process protocol is being used. Changes from v1: - added last patch that supports lazy loading - clarified documentation in "introduce lazyobject extension" patch (following Junio's comments [1]) As listed in the changes above, I have rewritten my lazy-loading sha1_file patch to no longer use the list of promises. Also, I have added documentation about the protocol used to (hopefully) the appropriate places. This is a minimal implementation, hopefully enough of a foundation to be built upon. In particular, I haven't added the environment variable to suppress lazy loading, and the lazy loading protocol only supports one object at a time. Other work ---------- This differs slightly from Ben Peart's patch [2] in that the lazy-loading functionality is provided through a configured shell command instead of a hook shell script. I envision commands like "git clone", in the future, needing to pre-configure lazy loading, and I think that it will be less surprising to the user if "git clone" wrote a default configuration instead of a default hook. This also differs from Christian Couder's patch set [3] that implement a larger-scale object database, in that (i) my patch set does not support putting objects into external databases, and (ii) my patch set requires the lazy loader to make the objects available in the local repo, instead of allowing the objects to only be stored in the external database. [1] https://public-inbox.org/git/xmqqzibpn1zh.fsf@gitster.mtv.corp.google.com/ [2] https://public-inbox.org/git/20170714132651.170708-2-benpeart@microsoft.com/ [3] https://public-inbox.org/git/20170620075523.26961-1-chriscool@tuxfamily.org/ Jonathan Tan (5): environment, fsck: introduce lazyobject extension fsck: support refs pointing to lazy objects fsck: support referenced lazy objects fsck: support lazy objects as CLI argument sha1_file: support loading lazy objects Documentation/Makefile | 1 + Documentation/gitattributes.txt | 54 ++-------- Documentation/gitrepository-layout.txt | 3 + .../technical/long-running-process-protocol.txt | 50 +++++++++ Documentation/technical/repository-version.txt | 23 +++++ Makefile | 1 + builtin/cat-file.c | 2 + builtin/fsck.c | 25 ++++- cache.h | 4 + environment.c | 1 + lazy-object.c | 80 +++++++++++++++ lazy-object.h | 12 +++ object.c | 7 ++ object.h | 13 +++ setup.c | 7 +- sha1_file.c | 44 +++++--- t/t0410-lazy-object.sh | 113 +++++++++++++++++++++ t/t0410/lazy-object | 102 +++++++++++++++++++ 18 files changed, 478 insertions(+), 64 deletions(-) create mode 100644 Documentation/technical/long-running-process-protocol.txt create mode 100644 lazy-object.c create mode 100644 lazy-object.h create mode 100755 t/t0410-lazy-object.sh create mode 100755 t/t0410/lazy-object -- 2.14.0.rc1.383.gd1ce394fe2-goog ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH v2 1/5] environment, fsck: introduce lazyobject extension 2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan @ 2017-07-31 21:02 ` Jonathan Tan 2017-07-31 21:02 ` [PATCH v2 2/5] fsck: support refs pointing to lazy objects Jonathan Tan ` (5 subsequent siblings) 6 siblings, 0 replies; 40+ messages in thread From: Jonathan Tan @ 2017-07-31 21:02 UTC (permalink / raw) To: git; +Cc: Jonathan Tan, gitster, peartben, christian.couder Currently, Git does not support repos with very large numbers of objects or repos that wish to minimize manipulation of certain blobs (for example, because they are very large) very well, even if the user operates mostly on part of the repo, because Git is designed on the assumption that every referenced object is available somewhere in the repo storage. Introduce a new repository extension option `extensions.lazyObject`. When it is set, Git does not treat missing objects as errors. The value of `extensions.lazyObject` must be a string, naming the command used to make a missing object appear whenever it is needed. Teach fsck about the new state of affairs. In this commit, teach fsck that missing objects referenced from the reflog are not an error case; in future commits, fsck will be taught about other cases. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- Documentation/technical/repository-version.txt | 7 ++++++ builtin/fsck.c | 2 +- cache.h | 2 ++ environment.c | 1 + setup.c | 7 +++++- t/t0410-lazy-object.sh | 32 ++++++++++++++++++++++++++ 6 files changed, 49 insertions(+), 2 deletions(-) create mode 100755 t/t0410-lazy-object.sh diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt index 00ad37986..f0482cae4 100644 --- a/Documentation/technical/repository-version.txt +++ b/Documentation/technical/repository-version.txt @@ -86,3 +86,10 @@ for testing format-1 compatibility. When the config key `extensions.preciousObjects` is set to `true`, objects in the repository MUST NOT be deleted (e.g., by `git-prune` or `git repack -d`). + +`lazyObject` +~~~~~~~~~~~~~~~~~ + +When the config key `extensions.lazyObject` is set, Git does not treat missing +objects as errors. The value of `extensions.lazyObject` must be a string, +naming the command used to make a missing object appear whenever it is needed. diff --git a/builtin/fsck.c b/builtin/fsck.c index a92f44818..b7e245654 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -402,7 +402,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid, xstrfmt("%s@{%"PRItime"}", refname, timestamp)); obj->flags |= USED; mark_object_reachable(obj); - } else { + } else if (!repository_format_lazy_object) { error("%s: invalid reflog entry %s", refname, oid_to_hex(oid)); errors_found |= ERROR_REACHABLE; } diff --git a/cache.h b/cache.h index 6c8242340..9e6bc0a21 100644 --- a/cache.h +++ b/cache.h @@ -853,10 +853,12 @@ extern int grafts_replace_parents; #define GIT_REPO_VERSION 0 #define GIT_REPO_VERSION_READ 1 extern int repository_format_precious_objects; +extern char *repository_format_lazy_object; struct repository_format { int version; int precious_objects; + char *lazy_object; int is_bare; char *work_tree; struct string_list unknown_extensions; diff --git a/environment.c b/environment.c index 3fd4b1084..cd8ef2897 100644 --- a/environment.c +++ b/environment.c @@ -27,6 +27,7 @@ int warn_ambiguous_refs = 1; int warn_on_object_refname_ambiguity = 1; int ref_paranoia = -1; int repository_format_precious_objects; +char *repository_format_lazy_object; const char *git_commit_encoding; const char *git_log_output_encoding; const char *apply_default_whitespace; diff --git a/setup.c b/setup.c index 860507e1f..94cfde3cc 100644 --- a/setup.c +++ b/setup.c @@ -425,7 +425,11 @@ static int check_repo_format(const char *var, const char *value, void *vdata) ; else if (!strcmp(ext, "preciousobjects")) data->precious_objects = git_config_bool(var, value); - else + else if (!strcmp(ext, "lazyobject")) { + if (!value) + return config_error_nonbool(var); + data->lazy_object = xstrdup(value); + } else string_list_append(&data->unknown_extensions, ext); } else if (strcmp(var, "core.bare") == 0) { data->is_bare = git_config_bool(var, value); @@ -468,6 +472,7 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok) } repository_format_precious_objects = candidate.precious_objects; + repository_format_lazy_object = candidate.lazy_object; string_list_clear(&candidate.unknown_extensions, 0); if (!has_common) { if (candidate.is_bare != -1) { diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh new file mode 100755 index 000000000..36442531f --- /dev/null +++ b/t/t0410-lazy-object.sh @@ -0,0 +1,32 @@ +#!/bin/sh + +test_description='lazy object' + +. ./test-lib.sh + +delete_object () { + rm $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40) +} + +test_expect_success 'fsck fails on lazy object in reflog' ' + test_create_repo repo && + test_commit -C repo 1 && + + A=$(git -C repo commit-tree -m a HEAD^{tree}) && + B=$(git -C repo commit-tree -m b HEAD^{tree}) && + + # Reference $A only from reflog, and delete it + git -C repo branch mybranch "$A" && + git -C repo branch -f mybranch "$B" && + delete_object repo "$A" && + + test_must_fail git -C repo fsck +' + +test_expect_success '...but succeeds if lazyobject is set' ' + git -C repo config core.repositoryformatversion 1 && + git -C repo config extensions.lazyobject "arbitrary string" && + git -C repo fsck +' + +test_done -- 2.14.0.rc1.383.gd1ce394fe2-goog ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v2 2/5] fsck: support refs pointing to lazy objects 2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan 2017-07-31 21:02 ` [PATCH v2 1/5] environment, fsck: introduce lazyobject extension Jonathan Tan @ 2017-07-31 21:02 ` Jonathan Tan 2017-07-31 21:02 ` [PATCH v2 3/5] fsck: support referenced " Jonathan Tan ` (4 subsequent siblings) 6 siblings, 0 replies; 40+ messages in thread From: Jonathan Tan @ 2017-07-31 21:02 UTC (permalink / raw) To: git; +Cc: Jonathan Tan, gitster, peartben, christian.couder Teach fsck to not treat refs with missing targets as an error when extensions.lazyobject is set. For the purposes of warning about no default refs, such refs are still treated as legitimate refs. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- builtin/fsck.c | 8 ++++++++ t/t0410-lazy-object.sh | 20 ++++++++++++++++++++ 2 files changed, 28 insertions(+) diff --git a/builtin/fsck.c b/builtin/fsck.c index b7e245654..38ed630d8 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -438,6 +438,14 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid, obj = parse_object(oid); if (!obj) { + if (repository_format_lazy_object) { + /* + * Increment default_refs anyway, because this is a + * valid ref. + */ + default_refs++; + return 0; + } error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid)); errors_found |= ERROR_REACHABLE; /* We'll continue with the rest despite the error.. */ diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh index 36442531f..00e1b4a88 100755 --- a/t/t0410-lazy-object.sh +++ b/t/t0410-lazy-object.sh @@ -29,4 +29,24 @@ test_expect_success '...but succeeds if lazyobject is set' ' git -C repo fsck ' +test_expect_success 'fsck fails on lazy object pointed to by ref' ' + rm -rf repo && + test_create_repo repo && + test_commit -C repo 1 && + + A=$(git -C repo commit-tree -m a HEAD^{tree}) && + + # Reference $A only from ref, and delete it + git -C repo branch mybranch "$A" && + delete_object repo "$A" && + + test_must_fail git -C repo fsck +' + +test_expect_success '...but succeeds if lazyobject is set' ' + git -C repo config core.repositoryformatversion 1 && + git -C repo config extensions.lazyobject "arbitrary string" && + git -C repo fsck +' + test_done -- 2.14.0.rc1.383.gd1ce394fe2-goog ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v2 3/5] fsck: support referenced lazy objects 2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan 2017-07-31 21:02 ` [PATCH v2 1/5] environment, fsck: introduce lazyobject extension Jonathan Tan 2017-07-31 21:02 ` [PATCH v2 2/5] fsck: support refs pointing to lazy objects Jonathan Tan @ 2017-07-31 21:02 ` Jonathan Tan 2017-07-31 21:02 ` [PATCH v2 4/5] fsck: support lazy objects as CLI argument Jonathan Tan ` (3 subsequent siblings) 6 siblings, 0 replies; 40+ messages in thread From: Jonathan Tan @ 2017-07-31 21:02 UTC (permalink / raw) To: git; +Cc: Jonathan Tan, gitster, peartben, christian.couder Teach fsck to not treat missing objects indirectly pointed to by refs as an error when extensions.lazyobject is set. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- builtin/fsck.c | 11 +++++++++++ t/t0410-lazy-object.sh | 27 +++++++++++++++++++++++++++ 2 files changed, 38 insertions(+) diff --git a/builtin/fsck.c b/builtin/fsck.c index 38ed630d8..19681c5b3 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -149,6 +149,15 @@ static int mark_object(struct object *obj, int type, void *data, struct fsck_opt return 0; obj->flags |= REACHABLE; if (!(obj->flags & HAS_OBJ)) { + if (repository_format_lazy_object) + /* + * Return immediately; this is not an error, and further + * recursion does not need to be performed on this + * object since it is missing (so it does not need to be + * added to "pending"). + */ + return 0; + if (parent && !has_object_file(&obj->oid)) { printf("broken link from %7s %s\n", printable_type(parent), describe_object(parent)); @@ -212,6 +221,8 @@ static void check_reachable_object(struct object *obj) * do a full fsck */ if (!(obj->flags & HAS_OBJ)) { + if (repository_format_lazy_object) + return; if (has_sha1_pack(obj->oid.hash)) return; /* it is in pack - forget about it */ printf("missing %s %s\n", printable_type(obj), diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh index 00e1b4a88..45f665a15 100755 --- a/t/t0410-lazy-object.sh +++ b/t/t0410-lazy-object.sh @@ -49,4 +49,31 @@ test_expect_success '...but succeeds if lazyobject is set' ' git -C repo fsck ' +test_expect_success 'fsck fails on lazy object indirectly pointed to by ref' ' + rm -rf repo && + test_create_repo repo && + test_commit -C repo 1 && + test_commit -C repo 2 && + test_commit -C repo 3 && + git -C repo tag -a annotated_tag -m "annotated tag" && + + C=$(git -C repo rev-parse 1) && + T=$(git -C repo rev-parse 2^{tree}) && + B=$(git hash-object repo/3.t) && + AT=$(git -C repo rev-parse annotated_tag) && + + # missing commit, tree, blob, and tag + delete_object repo "$C" && + delete_object repo "$T" && + delete_object repo "$B" && + delete_object repo "$AT" && + test_must_fail git -C repo fsck +' + +test_expect_success '...but succeeds if lazyobject is set' ' + git -C repo config core.repositoryformatversion 1 && + git -C repo config extensions.lazyobject "arbitrary string" && + git -C repo fsck +' + test_done -- 2.14.0.rc1.383.gd1ce394fe2-goog ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v2 4/5] fsck: support lazy objects as CLI argument 2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan ` (2 preceding siblings ...) 2017-07-31 21:02 ` [PATCH v2 3/5] fsck: support referenced " Jonathan Tan @ 2017-07-31 21:02 ` Jonathan Tan 2017-07-31 21:02 ` [PATCH v2 5/5] sha1_file: support loading lazy objects Jonathan Tan ` (2 subsequent siblings) 6 siblings, 0 replies; 40+ messages in thread From: Jonathan Tan @ 2017-07-31 21:02 UTC (permalink / raw) To: git; +Cc: Jonathan Tan, gitster, peartben, christian.couder Teach fsck to not treat missing objects provided on the CLI as an error when extensions.lazyobject is set. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- builtin/fsck.c | 2 ++ t/t0410-lazy-object.sh | 16 ++++++++++++++++ 2 files changed, 18 insertions(+) diff --git a/builtin/fsck.c b/builtin/fsck.c index 19681c5b3..20415902f 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -754,6 +754,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) struct object *obj = lookup_object(oid.hash); if (!obj || !(obj->flags & HAS_OBJ)) { + if (repository_format_lazy_object) + continue; error("%s: object missing", oid_to_hex(&oid)); errors_found |= ERROR_OBJECT; continue; diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh index 45f665a15..3ac61c1c5 100755 --- a/t/t0410-lazy-object.sh +++ b/t/t0410-lazy-object.sh @@ -76,4 +76,20 @@ test_expect_success '...but succeeds if lazyobject is set' ' git -C repo fsck ' +test_expect_success 'fsck fails on lazy object directly given in command-line' ' + rm -rf repo && + test_create_repo repo && + test_commit -C repo 1 && + HASH=$(git hash-object repo/1.t) && + delete_object repo "$HASH" && + + test_must_fail git -C repo fsck "$HASH" +' + +test_expect_success '...but succeeds if lazyobject is set' ' + git -C repo config core.repositoryformatversion 1 && + git -C repo config extensions.lazyobject "arbitrary string" && + git -C repo fsck "$HASH" +' + test_done -- 2.14.0.rc1.383.gd1ce394fe2-goog ^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH v2 5/5] sha1_file: support loading lazy objects 2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan ` (3 preceding siblings ...) 2017-07-31 21:02 ` [PATCH v2 4/5] fsck: support lazy objects as CLI argument Jonathan Tan @ 2017-07-31 21:02 ` Jonathan Tan 2017-07-31 21:29 ` Junio C Hamano 2017-08-08 20:20 ` Ben Peart 2017-07-31 21:21 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Junio C Hamano 2017-08-08 17:13 ` Ben Peart 6 siblings, 2 replies; 40+ messages in thread From: Jonathan Tan @ 2017-07-31 21:02 UTC (permalink / raw) To: git; +Cc: Jonathan Tan, gitster, peartben, christian.couder Teach sha1_file to invoke the command configured in extensions.lazyObject whenever an object is requested and unavailable. The usage of the hook can be suppressed through a flag when invoking has_object_file_with_flags() and other similar functions. This is meant as a temporary measure to ensure that all Git commands work in such a situation. Future patches will update some commands to either tolerate missing objects (without invoking the command) or be more efficient in invoking this command. In order to determine the code changes in sha1_file.c necessary, I investigated the following: (1) functions in sha1_file that take in a hash, without the user regarding how the object is stored (loose or packed) (2) functions in sha1_file that operate on packed objects (because I need to check callers that know about the loose/packed distinction and operate on both differently, and ensure that they can handle the concept of objects that are neither loose nor packed) (1) is handled by the modification to sha1_object_info_extended(). For (2), I looked at for_each_packed_object and at the packed-related functions that take in a hash. For for_each_packed_object, the callers either already work or are fixed in this patch: - reachable - only to find recent objects - builtin/fsck - already knows about missing objects - builtin/cat-file - warning message added in this commit Callers of the other functions do not need to be changed: - parse_pack_index - http - indirectly from http_get_info_packs - find_pack_entry_one - this searches a single pack that is provided as an argument; the caller already knows (through other means) that the sought object is in a specific pack - find_sha1_pack - fast-import - appears to be an optimization to not store a file if it is already in a pack - http-walker - to search through a struct alt_base - http-push - to search through remote packs - has_sha1_pack - builtin/fsck - already knows about promised objects - builtin/count-objects - informational purposes only (check if loose object is also packed) - builtin/prune-packed - check if object to be pruned is packed (if not, don't prune it) - revision - used to exclude packed objects if requested by user - diff - just for optimization An alternative design that I considered but rejected: - Adding a hook whenever a packed object is requested, not on any object. That is, whenever we attempt to search the packfiles for an object, if it is missing (from the packfiles and from the loose object storage), to invoke the hook (which must then store it as a packfile), open the packfile the hook generated, and report that the object is found in that new packfile. This reduces the amount of analysis needed (in that we only need to look at how packed objects are handled), but requires that the hook generate packfiles (or for sha1_file to pack whatever loose objects are generated), creating one packfile for each missing object and potentially very many packfiles that must be linearly searched. This may be tolerable now for repos that only have a few missing objects (for example, repos that only want to exclude large blobs), and might be tolerable in the future if we have batching support for the most commonly used commands, but is not tolerable now for repos that exclude a large amount of objects. Helped-by: Ben Peart <benpeart@microsoft.com> Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- Documentation/Makefile | 1 + Documentation/gitattributes.txt | 54 ++--------- Documentation/gitrepository-layout.txt | 3 + .../technical/long-running-process-protocol.txt | 50 ++++++++++ Documentation/technical/repository-version.txt | 16 ++++ Makefile | 1 + builtin/cat-file.c | 2 + builtin/fsck.c | 2 +- cache.h | 2 + lazy-object.c | 80 ++++++++++++++++ lazy-object.h | 12 +++ object.c | 7 ++ object.h | 13 +++ sha1_file.c | 44 ++++++--- t/t0410-lazy-object.sh | 18 ++++ t/t0410/lazy-object | 102 +++++++++++++++++++++ 16 files changed, 345 insertions(+), 62 deletions(-) create mode 100644 Documentation/technical/long-running-process-protocol.txt create mode 100644 lazy-object.c create mode 100644 lazy-object.h create mode 100755 t/t0410/lazy-object diff --git a/Documentation/Makefile b/Documentation/Makefile index 2415e0d65..5657d49c2 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -69,6 +69,7 @@ SP_ARTICLES += $(API_DOCS) TECH_DOCS += technical/http-protocol TECH_DOCS += technical/index-format +TECH_DOCS += technical/long-running-process-protocol TECH_DOCS += technical/pack-format TECH_DOCS += technical/pack-heuristics TECH_DOCS += technical/pack-protocol diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index c4f2be254..ed5e8204e 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -387,46 +387,14 @@ Long Running Filter Process If the filter command (a string value) is defined via `filter.<driver>.process` then Git can process all blobs with a single filter invocation for the entire life of a single Git -command. This is achieved by using a packet format (pkt-line, -see technical/protocol-common.txt) based protocol over standard -input and standard output as follows. All packets, except for the -"*CONTENT" packets and the "0000" flush packet, are considered -text and therefore are terminated by a LF. - -Git starts the filter when it encounters the first file -that needs to be cleaned or smudged. After the filter started -Git sends a welcome message ("git-filter-client"), a list of supported -protocol version numbers, and a flush packet. Git expects to read a welcome -response message ("git-filter-server"), exactly one protocol version number -from the previously sent list, and a flush packet. All further -communication will be based on the selected version. The remaining -protocol description below documents "version=2". Please note that -"version=42" in the example below does not exist and is only there -to illustrate how the protocol would look like with more than one -version. - -After the version negotiation Git sends a list of all capabilities that -it supports and a flush packet. Git expects to read a list of desired -capabilities, which must be a subset of the supported capabilities list, -and a flush packet as response: ------------------------- -packet: git> git-filter-client -packet: git> version=2 -packet: git> version=42 -packet: git> 0000 -packet: git< git-filter-server -packet: git< version=2 -packet: git< 0000 -packet: git> capability=clean -packet: git> capability=smudge -packet: git> capability=not-yet-invented -packet: git> 0000 -packet: git< capability=clean -packet: git< capability=smudge -packet: git< 0000 ------------------------- -Supported filter capabilities in version 2 are "clean", "smudge", -and "delay". +command. This is achieved by using the long-running process protocol +(described in technical/long-running-process-protocol.txt). + +When Git encounters the first file that needs to be cleaned or smudged, +it starts the filter and performs the handshake. In the handshake, the +welcome message sent by Git is "git-filter-client", only version 2 is +suppported, and the supported capabilities are "clean", "smudge", and +"delay". Afterwards Git sends a list of "key=value" pairs terminated with a flush packet. The list will contain at least the filter command @@ -512,12 +480,6 @@ the protocol then Git will stop the filter process and restart it with the next file that needs to be processed. Depending on the `filter.<driver>.required` flag Git will interpret that as error. -After the filter has processed a command it is expected to wait for -a "key=value" list containing the next command. Git will close -the command pipe on exit. The filter is expected to detect EOF -and exit gracefully on its own. Git will wait until the filter -process has stopped. - Delay ^^^^^ diff --git a/Documentation/gitrepository-layout.txt b/Documentation/gitrepository-layout.txt index f51ed4e37..a690fdcd5 100644 --- a/Documentation/gitrepository-layout.txt +++ b/Documentation/gitrepository-layout.txt @@ -47,6 +47,9 @@ use with dumb transports but otherwise is OK as long as `objects/info/alternates` points at the object stores it borrows from. + +. If `extensions.lazyObject` is set, some objects could be missing, to +be lazily loaded by Git when required. ++ This directory is ignored if $GIT_COMMON_DIR is set and "$GIT_COMMON_DIR/objects" will be used instead. diff --git a/Documentation/technical/long-running-process-protocol.txt b/Documentation/technical/long-running-process-protocol.txt new file mode 100644 index 000000000..aa0aa9af1 --- /dev/null +++ b/Documentation/technical/long-running-process-protocol.txt @@ -0,0 +1,50 @@ +Long-running process protocol +============================= + +This protocol is used when Git needs to communicate with an external +process throughout the entire life of a single Git command. All +communication is in pkt-line format (see technical/protocol-common.txt) +over standard input and standard output. + +Handshake +--------- + +Git starts by sending a welcome message (for example, +"git-filter-client"), a list of supported protocol version numbers, and +a flush packet. Git expects to read the welcome message with "server" +instead of "client" (for example, "git-filter-server"), exactly one +protocol version number from the previously sent list, and a flush +packet. All further communication will be based on the selected version. +The remaining protocol description below documents "version=2". Please +note that "version=42" in the example below does not exist and is only +there to illustrate how the protocol would look like with more than one +version. + +After the version negotiation Git sends a list of all capabilities that +it supports and a flush packet. Git expects to read a list of desired +capabilities, which must be a subset of the supported capabilities list, +and a flush packet as response: +------------------------ +packet: git> git-filter-client +packet: git> version=2 +packet: git> version=42 +packet: git> 0000 +packet: git< git-filter-server +packet: git< version=2 +packet: git< 0000 +packet: git> capability=clean +packet: git> capability=smudge +packet: git> capability=not-yet-invented +packet: git> 0000 +packet: git< capability=clean +packet: git< capability=smudge +packet: git< 0000 +------------------------ + +Shutdown +-------- + +Git will close +the command pipe on exit. The filter is expected to detect EOF +and exit gracefully on its own. Git will wait until the filter +process has stopped. diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt index f0482cae4..38c8f966e 100644 --- a/Documentation/technical/repository-version.txt +++ b/Documentation/technical/repository-version.txt @@ -93,3 +93,19 @@ objects in the repository MUST NOT be deleted (e.g., by `git-prune` or When the config key `extensions.lazyObject` is set, Git does not treat missing objects as errors. The value of `extensions.lazyObject` must be a string, naming the command used to make a missing object appear whenever it is needed. + +Git communicates with this command using the long-running process +protocol described in technical/long-running-process-protocol.txt. +During the handshake, the welcome message used is "git-lazy-object", the +only supported version is 1, and the only supported capability is "get". +After the handshake, for each object to be loaded, Git will send +"command=get\n", "sha1=<sha1>\n", and a flush packet, and will expect +"status=success\n" or "status=error\n" and a flush packet. For example: + +------------------------ +packet: git> command=get +packet: git> sha1=1234567890123456789012345678901234567890 +packet: git> 0000 +packet: git< status=success +packet: git< 0000 +------------------------ diff --git a/Makefile b/Makefile index cd7d36c7d..3cac24775 100644 --- a/Makefile +++ b/Makefile @@ -800,6 +800,7 @@ LIB_OBJS += help.o LIB_OBJS += hex.o LIB_OBJS += ident.o LIB_OBJS += kwset.o +LIB_OBJS += lazy-object.o LIB_OBJS += levenshtein.o LIB_OBJS += line-log.o LIB_OBJS += line-range.o diff --git a/builtin/cat-file.c b/builtin/cat-file.c index 62c8cf0eb..ef016b84e 100644 --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -473,6 +473,8 @@ static int batch_objects(struct batch_options *opt) for_each_loose_object(batch_loose_object, &sa, 0); for_each_packed_object(batch_packed_object, &sa, 0); + if (repository_format_lazy_object) + warning("This repository has extensions.lazyObject set. Some objects may not be loaded."); cb.opt = opt; cb.expand = &data; diff --git a/builtin/fsck.c b/builtin/fsck.c index 20415902f..327b087c2 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -447,7 +447,7 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid, { struct object *obj; - obj = parse_object(oid); + obj = parse_maybe_missing_object(oid); if (!obj) { if (repository_format_lazy_object) { /* diff --git a/cache.h b/cache.h index 9e6bc0a21..375f5b700 100644 --- a/cache.h +++ b/cache.h @@ -1837,6 +1837,8 @@ struct object_info { #define OBJECT_INFO_SKIP_CACHED 4 /* Do not retry packed storage after checking packed and loose storage */ #define OBJECT_INFO_QUICK 8 +/* Never invoke extensions.lazyObject, even if object is missing */ +#define OBJECT_INFO_NO_LAZY 16 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags); extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *); diff --git a/lazy-object.c b/lazy-object.c new file mode 100644 index 000000000..d4b271a4e --- /dev/null +++ b/lazy-object.c @@ -0,0 +1,80 @@ +#include "cache.h" +#include "lazy-object.h" +#include "sha1-lookup.h" +#include "strbuf.h" +#include "run-command.h" +#include "sha1-array.h" +#include "config.h" +#include "sigchain.h" +#include "sub-process.h" +#include "pkt-line.h" + +#define CAP_GET (1u<<0) + +static struct hashmap subprocess_map; + +int start_lazy_object_fn(struct subprocess_entry *subprocess) +{ + static int versions[] = {1, 0}; + static struct subprocess_capability capabilities[] = { + { "get", CAP_GET }, + { NULL, 0 } + }; + + unsigned int supported_capabilities = 0; + int ret = subprocess_handshake(subprocess, "git-lazy-object", versions, + NULL, capabilities, + &supported_capabilities); + if (ret) + return ret; + return supported_capabilities != CAP_GET; +} + +void load_lazy_object(const unsigned char *sha1) +{ + struct subprocess_entry *entry; + struct child_process *process; + struct strbuf status = STRBUF_INIT; + + if (!repository_format_lazy_object) + die("BUG: extensions.lazyObject is not set"); + + if (!subprocess_map.tablesize) { + hashmap_init(&subprocess_map, + (hashmap_cmp_fn) cmd2process_cmp, NULL, 0); + entry = NULL; + } else { + entry = subprocess_find_entry(&subprocess_map, + repository_format_lazy_object); + } + if (!entry) { + entry = xmalloc(sizeof(*entry)); + + if (subprocess_start(&subprocess_map, entry, + repository_format_lazy_object, + start_lazy_object_fn)) { + free(entry); + die("could not start '%s'", repository_format_lazy_object); + } + } + process = subprocess_get_child_process(entry); + + packet_write_fmt(process->in, "command=get\n"); + packet_write_fmt(process->in, "sha1=%s\n", sha1_to_hex(sha1)); + packet_flush(process->in); + if (subprocess_read_status(process->out, &status)) + die("Could not read status from '%s'", repository_format_lazy_object); + + if (!strcmp(status.buf, "success")) + return; + if (!strcmp(status.buf, "error")) + die("external process '%s' returned error", + repository_format_lazy_object); + die("external process '%s' failed", repository_format_lazy_object); + + /* + * The command above may have updated packfiles, so update our record + * of them. + */ + reprepare_packed_git(); +} diff --git a/lazy-object.h b/lazy-object.h new file mode 100644 index 000000000..1ec6ff203 --- /dev/null +++ b/lazy-object.h @@ -0,0 +1,12 @@ +#ifndef LAZY_OBJECT_H +#define LAZY_OBJECT_H + +#include "cache.h" +#include "sha1-array.h" + +/* + * Invokes repository_format_lazy_object to load the given missing object. + */ +void load_lazy_object(const unsigned char *sha1); + +#endif diff --git a/object.c b/object.c index 321d7e920..e3e734f67 100644 --- a/object.c +++ b/object.c @@ -279,6 +279,13 @@ struct object *parse_object(const struct object_id *oid) return NULL; } +struct object *parse_maybe_missing_object(const struct object_id *oid) +{ + if (has_object_file_with_flags(oid, OBJECT_INFO_NO_LAZY)) + return parse_object(oid); + return NULL; +} + struct object_list *object_list_insert(struct object *item, struct object_list **list_p) { diff --git a/object.h b/object.h index 0a419ba8d..b98e73572 100644 --- a/object.h +++ b/object.h @@ -87,10 +87,23 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet); /* * Returns the object, having parsed it to find out what it is. * + * This function automatically invokes the command in extensions.lazyObject if + * the object is missing and extensions.lazyObject is set. + * * Returns NULL if the object is missing or corrupt. */ struct object *parse_object(const struct object_id *oid); +/* + * Returns the object, having parsed it to find out what it is. + * + * This function never invokes the command in extensions.lazyObject, unlike + * parse_object(). + * + * Returns NULL if the object is missing or corrupt. + */ +struct object *parse_maybe_missing_object(const struct object_id *oid); + /* * Like parse_object, but will die() instead of returning NULL. If the * "name" parameter is not NULL, it is included in the error message diff --git a/sha1_file.c b/sha1_file.c index b60ae15f7..1785c61d8 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -28,6 +28,11 @@ #include "list.h" #include "mergesort.h" #include "quote.h" +#include "iterator.h" +#include "dir-iterator.h" +#include "sha1-lookup.h" +#include "lazy-object.h" +#include "sha1-array.h" #define SZ_FMT PRIuMAX static inline uintmax_t sz_fmt(size_t s) { return s; } @@ -2984,6 +2989,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ? lookup_replace_object(sha1) : sha1; + int already_retried = 0; if (!oi) oi = &blank_oi; @@ -3008,30 +3014,38 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, } } - if (!find_pack_entry(real, &e)) { - /* Most likely it's a loose object. */ - if (!sha1_loose_object_info(real, oi, flags)) { - oi->whence = OI_LOOSE; - return 0; - } +retry: + if (find_pack_entry(real, &e)) + goto found_packed; - /* Not a loose object; someone else may have just packed it. */ - if (flags & OBJECT_INFO_QUICK) { - return -1; - } else { - reprepare_packed_git(); - if (!find_pack_entry(real, &e)) - return -1; - } + /* Most likely it's a loose object. */ + if (!sha1_loose_object_info(real, oi, flags)) { + oi->whence = OI_LOOSE; + return 0; } + /* Not a loose object; someone else may have just packed it. */ + reprepare_packed_git(); + if (find_pack_entry(real, &e)) + goto found_packed; + + /* Check if it is a missing object */ + if (repository_format_lazy_object && !already_retried && + !(flags & OBJECT_INFO_NO_LAZY)) { + load_lazy_object(real); + already_retried = 1; + goto retry; + } + + return -1; + +found_packed: if (oi == &blank_oi) /* * We know that the caller doesn't actually need the * information below, so return early. */ return 0; - rtype = packed_object_info(e.p, e.offset, oi); if (rtype < 0) { mark_bad_packed_object(e.p, real); diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh index 3ac61c1c5..d8888baa0 100755 --- a/t/t0410-lazy-object.sh +++ b/t/t0410-lazy-object.sh @@ -92,4 +92,22 @@ test_expect_success '...but succeeds if lazyobject is set' ' git -C repo fsck "$HASH" ' +test_expect_success 'sha1_object_info_extended (through git cat-file)' ' + rm -rf repo client && + + test_create_repo server && + test_commit -C server 1 1.t abcdefgh && + HASH=$(git hash-object server/1.t) && + + test_create_repo client && + test_must_fail git -C client cat-file -p "$HASH" +' + +test_expect_success '...but succeeds if lazyobject is set' ' + git -C client config core.repositoryformatversion 1 && + git -C client config extensions.lazyobject \ + "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\"" && + git -C client cat-file -p "$HASH" +' + test_done diff --git a/t/t0410/lazy-object b/t/t0410/lazy-object new file mode 100755 index 000000000..4f4a9c38a --- /dev/null +++ b/t/t0410/lazy-object @@ -0,0 +1,102 @@ +#!/usr/bin/perl +# +# Example implementation for the Git lazyObject protocol version 1. See +# the documentation for extensions.lazyObject in +# Documentation/technical/repository-version.txt +# +# Allows you to test the ability for blobs to be pulled from a host git repo +# "on demand." Called when git needs a blob it couldn't find locally due to +# a lazy clone that only cloned the commits and trees. +# +# Please note, this sample is a minimal skeleton. No proper error handling +# was implemented. + +use strict; +use warnings; + +# +# Point $DIR to the folder where your host git repo is located so we can pull +# missing objects from it +# +my $DIR = $ARGV[0]; + +sub packet_bin_read { + my $buffer; + my $bytes_read = read STDIN, $buffer, 4; + if ( $bytes_read == 0 ) { + + # EOF - Git stopped talking to us! + exit(); + } + elsif ( $bytes_read != 4 ) { + die "invalid packet: '$buffer'"; + } + my $pkt_size = hex($buffer); + if ( $pkt_size == 0 ) { + return ( 1, "" ); + } + elsif ( $pkt_size > 4 ) { + my $content_size = $pkt_size - 4; + $bytes_read = read STDIN, $buffer, $content_size; + if ( $bytes_read != $content_size ) { + die "invalid packet ($content_size bytes expected; $bytes_read bytes read)"; + } + return ( 0, $buffer ); + } + else { + die "invalid packet size: $pkt_size"; + } +} + +sub packet_txt_read { + my ( $res, $buf ) = packet_bin_read(); + unless ( $buf =~ s/\n$// ) { + die "A non-binary line MUST be terminated by an LF."; + } + return ( $res, $buf ); +} + +sub packet_bin_write { + my $buf = shift; + print STDOUT sprintf( "%04x", length($buf) + 4 ); + print STDOUT $buf; + STDOUT->flush(); +} + +sub packet_txt_write { + packet_bin_write( $_[0] . "\n" ); +} + +sub packet_flush { + print STDOUT sprintf( "%04x", 0 ); + STDOUT->flush(); +} + +( packet_txt_read() eq ( 0, "git-lazy-object-client" ) ) || die "bad initialize"; +( packet_txt_read() eq ( 0, "version=1" ) ) || die "bad version"; +( packet_bin_read() eq ( 1, "" ) ) || die "bad version end"; + +packet_txt_write("git-lazy-object-server"); +packet_txt_write("version=1"); +packet_flush(); + +( packet_txt_read() eq ( 0, "capability=get" ) ) || die "bad capability"; +( packet_bin_read() eq ( 1, "" ) ) || die "bad capability end"; + +packet_txt_write("capability=get"); +packet_flush(); + +while (1) { + my ($command) = packet_txt_read() =~ /^command=([^=]+)$/; + + if ( $command eq "get" ) { + my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/; + packet_bin_read(); + + system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . ' | git -c extensions.lazyobject=false hash-object -w --stdin >/dev/null 2>&1'); + packet_txt_write(($?) ? "status=error" : "status=success"); + packet_flush(); + } else { + die "bad command '$command'"; + } +} -- 2.14.0.rc1.383.gd1ce394fe2-goog ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH v2 5/5] sha1_file: support loading lazy objects 2017-07-31 21:02 ` [PATCH v2 5/5] sha1_file: support loading lazy objects Jonathan Tan @ 2017-07-31 21:29 ` Junio C Hamano 2017-08-08 20:20 ` Ben Peart 1 sibling, 0 replies; 40+ messages in thread From: Junio C Hamano @ 2017-07-31 21:29 UTC (permalink / raw) To: Jonathan Tan; +Cc: git, peartben, christian.couder Jonathan Tan <jonathantanmy@google.com> writes: > Teach sha1_file to invoke the command configured in > extensions.lazyObject whenever an object is requested and unavailable. > > The usage of the hook can be suppressed through a flag when invoking > has_object_file_with_flags() and other similar functions. > > This is meant as a temporary measure to ensure that all Git commands > work in such a situation. Future patches will update some commands to > either tolerate missing objects (without invoking the command) or be > more efficient in invoking this command. > > In order to determine the code changes in sha1_file.c necessary, I > investigated the following: > (1) functions in sha1_file that take in a hash, without the user > regarding how the object is stored (loose or packed) > (2) functions in sha1_file that operate on packed objects (because I > need to check callers that know about the loose/packed distinction > and operate on both differently, and ensure that they can handle > the concept of objects that are neither loose nor packed) > > (1) is handled by the modification to sha1_object_info_extended(). > > For (2), I looked at for_each_packed_object and at the packed-related > functions that take in a hash. For for_each_packed_object, the callers > either already work or are fixed in this patch: > - reachable - only to find recent objects > - builtin/fsck - already knows about missing objects > - builtin/cat-file - warning message added in this commit > > Callers of the other functions do not need to be changed: > - parse_pack_index > - http - indirectly from http_get_info_packs > - find_pack_entry_one > - this searches a single pack that is provided as an argument; the > caller already knows (through other means) that the sought object > is in a specific pack > - find_sha1_pack > - fast-import - appears to be an optimization to not store a > file if it is already in a pack > - http-walker - to search through a struct alt_base > - http-push - to search through remote packs > - has_sha1_pack > - builtin/fsck - already knows about promised objects > - builtin/count-objects - informational purposes only (check if loose > object is also packed) > - builtin/prune-packed - check if object to be pruned is packed (if > not, don't prune it) > - revision - used to exclude packed objects if requested by user > - diff - just for optimization > > An alternative design that I considered but rejected: > > - Adding a hook whenever a packed object is requested, not on any > object. That is, whenever we attempt to search the packfiles for an > object, if it is missing (from the packfiles and from the loose > object storage), to invoke the hook (which must then store it as a > packfile), open the packfile the hook generated, and report that the > object is found in that new packfile. This reduces the amount of > analysis needed (in that we only need to look at how packed objects > are handled), but requires that the hook generate packfiles (or for > sha1_file to pack whatever loose objects are generated), creating one > packfile for each missing object and potentially very many packfiles > that must be linearly searched. This may be tolerable now for repos > that only have a few missing objects (for example, repos that only > want to exclude large blobs), and might be tolerable in the future if > we have batching support for the most commonly used commands, but is > not tolerable now for repos that exclude a large amount of objects. > > Helped-by: Ben Peart <benpeart@microsoft.com> > Signed-off-by: Jonathan Tan <jonathantanmy@google.com> > --- Even though I said a hugely negative thing about the "missing objects are always OK" butchering of fsck, I do like what this patch does. The interface is reasonably well isolated, and moving of the long-running-process documentation to a standalone file is very sensible. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 5/5] sha1_file: support loading lazy objects 2017-07-31 21:02 ` [PATCH v2 5/5] sha1_file: support loading lazy objects Jonathan Tan 2017-07-31 21:29 ` Junio C Hamano @ 2017-08-08 20:20 ` Ben Peart 1 sibling, 0 replies; 40+ messages in thread From: Ben Peart @ 2017-08-08 20:20 UTC (permalink / raw) To: Jonathan Tan, git; +Cc: gitster, christian.couder On 7/31/2017 5:02 PM, Jonathan Tan wrote: > Teach sha1_file to invoke the command configured in > extensions.lazyObject whenever an object is requested and unavailable. > > The usage of the hook can be suppressed through a flag when invoking > has_object_file_with_flags() and other similar functions. > > This is meant as a temporary measure to ensure that all Git commands > work in such a situation. Future patches will update some commands to > either tolerate missing objects (without invoking the command) or be > more efficient in invoking this command. To prevent fetch from downloading all missing objects, you will also need to add logic in check_connected. The simplest model is to simply return 0 if repository_format_lazy_object is set. /* * Running a with lazy_objects there will be objects that are * missing locally and we don't want to download a bunch of * commits, trees, and blobs just to make sure everything is * reachable locally so this option will skip reachablility * checks below that use rev-list. This will stop the check * before uploadpack runs to determine if there is anything to * fetch. Returning zero for the first check will also prevent the * uploadpack from happening. It will also skip the check after * the fetch is finished to make sure all the objects where * downloaded in the pack file. This will allow the fetch to * run and get all the latest tip commit ids for all the branches * in the fetch but not pull down commits, trees, or blobs via * upload pack. */ if (repository_format_lazy_object) return 0; [...] > diff --git a/sha1_file.c b/sha1_file.c > index b60ae15f7..1785c61d8 100644 > --- a/sha1_file.c > +++ b/sha1_file.c > @@ -28,6 +28,11 @@ > #include "list.h" > #include "mergesort.h" > #include "quote.h" > +#include "iterator.h" > +#include "dir-iterator.h" > +#include "sha1-lookup.h" > +#include "lazy-object.h" > +#include "sha1-array.h" > > #define SZ_FMT PRIuMAX > static inline uintmax_t sz_fmt(size_t s) { return s; } > @@ -2984,6 +2989,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, > const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ? > lookup_replace_object(sha1) : > sha1; > + int already_retried = 0; > > if (!oi) > oi = &blank_oi; > @@ -3008,30 +3014,38 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, > } > } > > - if (!find_pack_entry(real, &e)) { > - /* Most likely it's a loose object. */ > - if (!sha1_loose_object_info(real, oi, flags)) { > - oi->whence = OI_LOOSE; > - return 0; > - } > +retry: > + if (find_pack_entry(real, &e)) > + goto found_packed; > > - /* Not a loose object; someone else may have just packed it. */ > - if (flags & OBJECT_INFO_QUICK) { > - return -1; > - } else { > - reprepare_packed_git(); > - if (!find_pack_entry(real, &e)) > - return -1; > - } > + /* Most likely it's a loose object. */ > + if (!sha1_loose_object_info(real, oi, flags)) { > + oi->whence = OI_LOOSE; > + return 0; > } > > + /* Not a loose object; someone else may have just packed it. */ > + reprepare_packed_git(); > + if (find_pack_entry(real, &e)) > + goto found_packed; Same feedback as before. I like to avoid using goto's as flow control other than in error handling. Also, this patch looses the OBJECT_INFO_QUICK logic which could be restored. [...] > diff --git a/t/t0410/lazy-object b/t/t0410/lazy-object > new file mode 100755 > index 000000000..4f4a9c38a > --- /dev/null > +++ b/t/t0410/lazy-object > @@ -0,0 +1,102 @@ > +#!/usr/bin/perl > +# > +# Example implementation for the Git lazyObject protocol version 1. See > +# the documentation for extensions.lazyObject in > +# Documentation/technical/repository-version.txt > +# > +# Allows you to test the ability for blobs to be pulled from a host git repo > +# "on demand." Called when git needs a blob it couldn't find locally due to > +# a lazy clone that only cloned the commits and trees. > +# > +# Please note, this sample is a minimal skeleton. No proper error handling > +# was implemented. > + > +use strict; > +use warnings; > + > +# > +# Point $DIR to the folder where your host git repo is located so we can pull > +# missing objects from it > +# > +my $DIR = $ARGV[0]; > + At some point, this should be based on the refactored pkt_* functions currently contained in the ObjectDB patch series. > +sub packet_bin_read { > + my $buffer; > + my $bytes_read = read STDIN, $buffer, 4; > + if ( $bytes_read == 0 ) { > + > + # EOF - Git stopped talking to us! > + exit(); > + } > + elsif ( $bytes_read != 4 ) { > + die "invalid packet: '$buffer'"; > + } > + my $pkt_size = hex($buffer); > + if ( $pkt_size == 0 ) { > + return ( 1, "" ); > + } > + elsif ( $pkt_size > 4 ) { > + my $content_size = $pkt_size - 4; > + $bytes_read = read STDIN, $buffer, $content_size; > + if ( $bytes_read != $content_size ) { > + die "invalid packet ($content_size bytes expected; $bytes_read bytes read)"; > + } > + return ( 0, $buffer ); > + } > + else { > + die "invalid packet size: $pkt_size"; > + } > +} > + > +sub packet_txt_read { > + my ( $res, $buf ) = packet_bin_read(); > + unless ( $buf =~ s/\n$// ) { > + die "A non-binary line MUST be terminated by an LF."; > + } > + return ( $res, $buf ); > +} > + > +sub packet_bin_write { > + my $buf = shift; > + print STDOUT sprintf( "%04x", length($buf) + 4 ); > + print STDOUT $buf; > + STDOUT->flush(); > +} > + > +sub packet_txt_write { > + packet_bin_write( $_[0] . "\n" ); > +} > + > +sub packet_flush { > + print STDOUT sprintf( "%04x", 0 ); > + STDOUT->flush(); > +} > + > +( packet_txt_read() eq ( 0, "git-lazy-object-client" ) ) || die "bad initialize"; > +( packet_txt_read() eq ( 0, "version=1" ) ) || die "bad version"; > +( packet_bin_read() eq ( 1, "" ) ) || die "bad version end"; > + > +packet_txt_write("git-lazy-object-server"); > +packet_txt_write("version=1"); > +packet_flush(); > + > +( packet_txt_read() eq ( 0, "capability=get" ) ) || die "bad capability"; > +( packet_bin_read() eq ( 1, "" ) ) || die "bad capability end"; > + > +packet_txt_write("capability=get"); > +packet_flush(); > + > +while (1) { > + my ($command) = packet_txt_read() =~ /^command=([^=]+)$/; > + > + if ( $command eq "get" ) { > + my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/; > + packet_bin_read(); > + > + system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . ' | git -c extensions.lazyobject=false hash-object -w --stdin >/dev/null 2>&1'); > + packet_txt_write(($?) ? "status=error" : "status=success"); > + packet_flush(); > + } else { > + die "bad command '$command'"; > + } > +} > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader 2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan ` (4 preceding siblings ...) 2017-07-31 21:02 ` [PATCH v2 5/5] sha1_file: support loading lazy objects Jonathan Tan @ 2017-07-31 21:21 ` Junio C Hamano 2017-07-31 23:05 ` Jonathan Tan 2017-08-08 17:13 ` Ben Peart 6 siblings, 1 reply; 40+ messages in thread From: Junio C Hamano @ 2017-07-31 21:21 UTC (permalink / raw) To: Jonathan Tan; +Cc: git, peartben, christian.couder Jonathan Tan <jonathantanmy@google.com> writes: > Besides review changes, this patch set now includes my rewritten > lazy-loading sha1_file patch, so you can now do this (excerpted from one > of the tests): > > test_create_repo server > test_commit -C server 1 1.t abcdefgh > HASH=$(git hash-object server/1.t) > > test_create_repo client > test_must_fail git -C client cat-file -p "$HASH" > git -C client config core.repositoryformatversion 1 > git -C client config extensions.lazyobject \ > "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\"" > git -C client cat-file -p "$HASH" > > with fsck still working. Also, there is no need for a list of promised > blobs, and the long-running process protocol is being used. I do not think I read your response to my last comment on this series, so I could be missing something large, but I am afraid that the resulting fsck is only half as useful as the normal fsck. I do not see it any better than a hypothetical castrated version of fsck that only checks the integrity of objects that appear in the local object store, without doing any connectivity checks. Don't get me wrong. The integrity check on local objects you still do is important---that is what allows us to make sure that the local "cache" does not prevent us from going to the real source of the remote object store by having a corrupt copy. But not being able to tell if a missing object is OK to be missing (because we can get them if/as needed from elsewhere) or we lost the sole copy of an object that we created and have not pushed out (hence we are in deep yogurt) makes it pretty much pointless to run "fsck", doesn't it? It does not give us any guarantee that our repository plus perfect network connectivity gives us an environment to build further work on. Or am I missing something fundamental? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader 2017-07-31 21:21 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Junio C Hamano @ 2017-07-31 23:05 ` Jonathan Tan 2017-08-01 17:11 ` Junio C Hamano 0 siblings, 1 reply; 40+ messages in thread From: Jonathan Tan @ 2017-07-31 23:05 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, peartben, christian.couder On Mon, 31 Jul 2017 14:21:56 -0700 Junio C Hamano <gitster@pobox.com> wrote: > Jonathan Tan <jonathantanmy@google.com> writes: > > > Besides review changes, this patch set now includes my rewritten > > lazy-loading sha1_file patch, so you can now do this (excerpted from one > > of the tests): > > > > test_create_repo server > > test_commit -C server 1 1.t abcdefgh > > HASH=$(git hash-object server/1.t) > > > > test_create_repo client > > test_must_fail git -C client cat-file -p "$HASH" > > git -C client config core.repositoryformatversion 1 > > git -C client config extensions.lazyobject \ > > "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\"" > > git -C client cat-file -p "$HASH" > > > > with fsck still working. Also, there is no need for a list of promised > > blobs, and the long-running process protocol is being used. > > I do not think I read your response to my last comment on this > series, so I could be missing something large, but I am afraid that > the resulting fsck is only half as useful as the normal fsck. I do > not see it any better than a hypothetical castrated version of fsck > that only checks the integrity of objects that appear in the local > object store, without doing any connectivity checks. Sorry, I haven't replied to your last response [1]. That does sound like a good idea, though, and probably can be extended to trees and blobs in that we need to make sure that any object referenced from local-only commits (calculated as you describe in [1]) can be obtained through an object walk from a remote-tracking branch. I haven't fully thought of the implications of things like building commits on top of an arbitrary upstream commit (so since our upstream commit is not a tip, the object walk from all remote-tracking branches might not reach our upstream commit). To try to solve that, we could use an alternate object store to store remote objects in order to be able to find remote objects quickly without doing a traversal, but that does not fully solve the problem, because some information about remote object possession lies only in their parents (for example, if we don't have a remote blob, sometimes the only way to know that the remote has it is by having a tree containing that blob). In addition, this also couples the lazy object loading with either a remote ref (or all remote refs, if we decide to consider objects from all remote refs as potentially loadable). I'll think about this further. [1] https://public-inbox.org/git/xmqq379fkz4x.fsf@gitster.mtv.corp.google.com/ > Don't get me wrong. The integrity check on local objects you still > do is important---that is what allows us to make sure that the local > "cache" does not prevent us from going to the real source of the > remote object store by having a corrupt copy. > > But not being able to tell if a missing object is OK to be missing > (because we can get them if/as needed from elsewhere) or we lost the > sole copy of an object that we created and have not pushed out > (hence we are in deep yogurt) makes it pretty much pointless to run > "fsck", doesn't it? It does not give us any guarantee that our > repository plus perfect network connectivity gives us an environment > to build further work on. > > Or am I missing something fundamental? Well, the fsck can still detect issues like corrupt objects (as you mention above) and dangling heads, which might be real issues. But it is true that it does not give you the guarantee you describe. From a user standpoint, this might be able to be worked around by providing a network-requiring object connectivity checking tool or by just having the user running a build to ensure that all necessary files are present. Having said that, this feature will be very nice to have. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader 2017-07-31 23:05 ` Jonathan Tan @ 2017-08-01 17:11 ` Junio C Hamano 2017-08-01 17:45 ` Jonathan Nieder 2017-08-02 0:19 ` Jonathan Tan 0 siblings, 2 replies; 40+ messages in thread From: Junio C Hamano @ 2017-08-01 17:11 UTC (permalink / raw) To: Jonathan Tan; +Cc: git, peartben, christian.couder Jonathan Tan <jonathantanmy@google.com> writes: > Well, the fsck can still detect issues like corrupt objects (as you > mention above) and dangling heads, which might be real issues. But it is > true that it does not give you the guarantee you describe. Which makes it pretty much useless. The whole point of running "fsck" is to make sure that we won't waste work by not finding a corruption long after it was introduced and spent a lot of effort building on top of a state that nobody can reproduce. > From a user standpoint, this might be able to be worked around by > providing a network-requiring object connectivity checking tool or by > just having the user running a build to ensure that all necessary files > are present. I actually was hoping that you do not have to go to the network for the checking. And I have to say that "only the tip matters" is a horrible cop-out that is not even a workaround. Your users would be served better if you honestly admit that your fsck will not be useful when this feature is used---at least they won't be harmed by a false expectation that "fsck" would give them some assurance, which is not the case. Let's step back a bit and think what already happens in the pre- lazy-object world. We record cut-off commits when a depth limited clone is created in "shallow". These essentially are promises, saying something like: Rest assured that everything in the history behind these commits are on the other side and you can retrieve them by unshallowing. If you traverse from your local tips and find no missing objects before reaching one of these commits, then you do not have any local corruption you need to worry about. the other end made to us, when the shallow clone was made. And we take this promise and build more commits on top, and then we adjust these cut-off commits incrementally as we deepen our clone or make it even shallower. For this assurance to work, we of course need to assume a bit more than what we assume for a complete clone, namely, the "other side" will hold onto the history behind these, i.e. does not remind the tips it already has shown to us, or even if it does, the objects that are reachable from these cut-off points will somehow always be available to us on demand. Can we do something similar, i.e. maintain minimum set of cut-off points and adjust that set incrementally, just sufficient to ensure the integrity of objects locally created and not yet safely stored away by pushing them the "other side"? I haven't thought things through (and I know you, Ben and others have thought much longer and harder), but I would imagine if we have a commit object [*1*], some of whose parent commits, trees and blobs are locally missing, and know that the commit exists on the "other side", we know that all of these "missing" objects that are referenced by the commit are also available from the "other side". IOW, I suspect that the same principle "shallow" uses to give us the integrity guarantee can be naturally extended to allow us to see if a broken connectivity is OK. [Footnote] *1* The same can be said for a tag or a tree object that we know exist on the "other side"; they may refer, directly or indirectly through objects we locally have, to objects that that are missing locally, and as long as the starting point object are known to be available on the "other side", it is OK for them to be missing locally. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader 2017-08-01 17:11 ` Junio C Hamano @ 2017-08-01 17:45 ` Jonathan Nieder 2017-08-01 20:15 ` Junio C Hamano 2017-08-02 0:19 ` Jonathan Tan 1 sibling, 1 reply; 40+ messages in thread From: Jonathan Nieder @ 2017-08-01 17:45 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jonathan Tan, git, peartben, christian.couder Hi, Junio C Hamano wrote: > Can we do something similar, i.e. maintain minimum set of cut-off > points and adjust that set incrementally, just sufficient to ensure > the integrity of objects locally created and not yet safely stored > away by pushing them the "other side"? This sounds like a variant on the "promises" idea (maintaining a list of objects at the frontier) described before. Instead of listing blobs that the server promised, you are proposing listing trees that the server has promised to handle all references from. > I haven't thought things through (and I know you, Ben and others > have thought much longer and harder), but I would imagine if we have > a commit object [*1*], some of whose parent commits, trees and blobs > are locally missing, and know that the commit exists on the "other > side", we know that all of these "missing" objects that are > referenced by the commit are also available from the "other side". > IOW, I suspect that the same principle "shallow" uses to give us the > integrity guarantee can be naturally extended to allow us to see if > a broken connectivity is OK. If we are deeply worried about this kind of broken connectivity, there is another case to care about: the server can "promise" to serve requests for some object (e.g., the tree pointed to by the server's "master") and then decide it does not want to fulfill that promise (e.g., that tree pointed to private key material and "master" was rewound to avoid it). In the promises model, how we do we get a fresh understanding of what the server wants to promise now? Earlier in this discussion of fsck, I thought you were proposing a slightly different idea. The idea I heard is that you want to check connectivity for whatever you have built locally, while accepting a relaxed guarantee for objects from upstream. If an object is missing, the idea would be that at least this way you know whose fault it is. :) (Not that there's much to do with that knowledge.) Implementing that by treating everything reachable from a remote-tracking branch as "from upstream" seems natural. But that implementation suffers from the same problems: not all objects from upstream need be reachable from a remote-tracking branch (e.g. after a fetch-by-object-id, or because a remote branch can be rewound). Both variants proposed of the promises idea also hold some promise, but my understanding was that the cost of maintaining the promises file (getting data to fill it, locking on update, merging in new objects into it on update), for little benefit wasn't palatable to Microsoft, who has been coping fine without such a file. Jonathan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader 2017-08-01 17:45 ` Jonathan Nieder @ 2017-08-01 20:15 ` Junio C Hamano 0 siblings, 0 replies; 40+ messages in thread From: Junio C Hamano @ 2017-08-01 20:15 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Jonathan Tan, git, peartben, christian.couder Jonathan Nieder <jrnieder@gmail.com> writes: > If we are deeply worried about this kind of broken connectivity, there > is another case to care about: the server can "promise" to serve > requests for some object (e.g., the tree pointed to by the server's > "master") and then decide it does not want to fulfill that promise > (e.g., that tree pointed to private key material and "master" was > rewound to avoid it). I think I've already covered that in my message (i.e. "we need to assume more than the normal Git"). In short, it is not our problem, but the "lazy-object" service's problem. If developers cannot trust the "central server", most likely owned by the organization that employs them and forces them to offload the access to these objects to the "central server", I think there is much larger problem there. > In the promises model, how we do we get a fresh > understanding of what the server wants to promise now? Yes, that is one of the things that needs to be designed if we want to officially support lazy-objects like structure. We need a way to incrementally adjust the cut-off point, below which it is the responsibility of the "other side" to ensure that necessary objects are available (on demand), and above which it is a local repository's responsibility to notice corrupted and/or missing objects. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader 2017-08-01 17:11 ` Junio C Hamano 2017-08-01 17:45 ` Jonathan Nieder @ 2017-08-02 0:19 ` Jonathan Tan 2017-08-02 16:20 ` Junio C Hamano 1 sibling, 1 reply; 40+ messages in thread From: Jonathan Tan @ 2017-08-02 0:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, peartben, christian.couder On Tue, 01 Aug 2017 10:11:38 -0700 Junio C Hamano <gitster@pobox.com> wrote: > Let's step back a bit and think what already happens in the pre- > lazy-object world. We record cut-off commits when a depth limited > clone is created in "shallow". These essentially are promises, > saying something like: > > Rest assured that everything in the history behind these commits > are on the other side and you can retrieve them by unshallowing. > > If you traverse from your local tips and find no missing objects > before reaching one of these commits, then you do not have any > local corruption you need to worry about. > > the other end made to us, when the shallow clone was made. And we > take this promise and build more commits on top, and then we adjust > these cut-off commits incrementally as we deepen our clone or make > it even shallower. For this assurance to work, we of course need to > assume a bit more than what we assume for a complete clone, namely, > the "other side" will hold onto the history behind these, i.e. does > not remind the tips it already has shown to us, or even if it does, > the objects that are reachable from these cut-off points will > somehow always be available to us on demand. > > Can we do something similar, i.e. maintain minimum set of cut-off > points and adjust that set incrementally, just sufficient to ensure > the integrity of objects locally created and not yet safely stored > away by pushing them the "other side"? This suggestion (the "frontier" of what we have) does seem to incur less overhead than the original promise suggestion (the "frontier" of what we don't have), but after some in-office discussion, I'm convinced that it might not be the case - for example, one tree (that we have) might reference many blobs (that we don't have), but at the same time, many trees (that we have) might have the same blob (that we don't have). And the promise overhead was already decided to be too much - which is why we moved away from it. One possibility to conceptually have the same thing without the overhead of the list is to put the obtained-from-elsewhere objects into its own alternate object store, so that we can distinguish the two. I mentioned this in my e-mail but rejected it, but after some more thought, this might be sufficient - we might still need to iterate through every object to know exactly what we can assume the remote to have, but the "frontier" solution also needs this iteration, so we are no worse off. Going back to the original use cases that motivated this (the monorepo like Microsoft's repo and the large-blob repo like Android's repo), it might be better just to disable the connectivity check when extensions.lazyObject is set (as you mentioned). This does change the meaning of fsck, but it may be fine since the "meaning" of the repo (a view of another repo, and no longer a full repo) has changed too. Then this patch set will be more about ensuring that the lazy object loader is not inadvertently run. As future work, we could add diagnostics that, for example, attempt a walk anyway and print a list of missing SHA-1s. (I suspect that we will also need to disable the connectivity check for things like "git fetch", which means that we won't be able to tell locally if the server sent us all the objects that we requested for. This might not be a problem, though, since the local repo already has some measure of trust for the server.) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader 2017-08-02 0:19 ` Jonathan Tan @ 2017-08-02 16:20 ` Junio C Hamano 2017-08-02 17:38 ` Jonathan Nieder 0 siblings, 1 reply; 40+ messages in thread From: Junio C Hamano @ 2017-08-02 16:20 UTC (permalink / raw) To: Jonathan Tan; +Cc: git, peartben, christian.couder Jonathan Tan <jonathantanmy@google.com> writes: > One possibility to conceptually have the same thing without the overhead > of the list is to put the obtained-from-elsewhere objects into its own > alternate object store, so that we can distinguish the two. Now you are talking. Either a separate object store, or a packfile that is specially marked as such, would work. "Maintaining a list of object names in a flat file is too costly" is not a valid excuse to discard the integrity of locally created objects, without which Git will no longer be a version control system, and your "We have to trust the sender of objects on the other side anyway when operating in lazy-objects mode, so devise a way that we can use to tell which objects we _know_ the other side has" that leads to the above idea is a good line of thought. > I mentioned > this in my e-mail but rejected it, but after some more thought, this > might be sufficient - we might still need to iterate through every > object to know exactly what we can assume the remote to have, but the > "frontier" solution also needs this iteration, so we are no worse off. Most importantly, this is allowed to be costly---we are doing this not at runtime all the time, but when the user says "make sure that I haven't lost objects and it is safe for me to build further on what I created locally so far" by running "git fsck". ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader 2017-08-02 16:20 ` Junio C Hamano @ 2017-08-02 17:38 ` Jonathan Nieder 2017-08-02 20:51 ` Junio C Hamano 0 siblings, 1 reply; 40+ messages in thread From: Jonathan Nieder @ 2017-08-02 17:38 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jonathan Tan, git, peartben, christian.couder Hi, Junio C Hamano wrote: > Jonathan Tan <jonathantanmy@google.com> writes: >> One possibility to conceptually have the same thing without the overhead >> of the list is to put the obtained-from-elsewhere objects into its own >> alternate object store, so that we can distinguish the two. > > Now you are talking. Either a separate object store, or a packfile > that is specially marked as such, would work. Jonathan's not in today, so let me say a few more words about this approach. This approach implies a relaxed connectivity guarantee, by creating two classes of objects: 1. Objects that I made should satisfy the connectivity check. They can point to other objects I made, objects I fetched, or (*) objects pointed to directly by objects I fetched. More on (*) below. 2. Objects that I fetched do not need to satisfy a connectivity check. I can trust the server to provide objects that they point to when I ask for them, except in extraordinary cases like a credit card number that was accidentally pushed to the server and prompted a rewriting of history to remove it (**). The guarantee (1) looks like it should be easy to satisfy (just like the current connectivity guarantee where all objects are in class (1)). I have to know about an object to point to it --- that means the pointed-to object has to be in the object store or pointed to by something in the object store. The complication is in the "git gc" operation for the case (*). Today, "git gc" uses a reachability walk to decide which objects to remove --- an object referenced by no other object is fair game to remove. With (*), there is another kind of object that must not be removed: if an object that I made, M, points to a missing/promised object, O, pointed to by a an object I fetched, F, then I cannot prune F unless there is another fetched object present to anchor O. For example: suppose I have a sparse checkout and run git fetch origin refs/pulls/x git checkout -b topic FETCH_HEAD echo "Some great modification" >> README git add README git commit --amend When I run "git gc", there is nothing pointing to the commit that was pointed to by the remote ref refs/pulls/x, so it can be pruned. I would naively also expect that the tree pointed to by that commit could be pruned. But pruning it means pruning the promise that made it permissible to lack various blobs that my topic branch refers to that are outside the sparse checkout area. So "git gc" must notice that it is not safe to prune that tree. This feels hacky. I prefer the promised object list over this approach. > "Maintaining a list > of object names in a flat file is too costly" is not a valid excuse > to discard the integrity of locally created objects, without which > Git will no longer be a version control system, I am confused by this: I think that Git without a "git fsck" command at all would still be a version control system, just not as good of one. Can you spell this out more? To be clear, are you speaking as a reviewer or as the project maintainer? In other words, if other reviewers are able to settle on a design that involves a relaxed guarantee for fsck in this mode that they can agree on, does this represent a veto meaning the patch can still not go through? On one hand I'm grateful for the help exploring the design space, and I think it has helped get a better understanding of the issues involved. On the other hand, if this is a veto then it feels very black and white and a hard starting point to build a consensus from. I am worried. [...] >> I mentioned >> this in my e-mail but rejected it, but after some more thought, this >> might be sufficient - we might still need to iterate through every >> object to know exactly what we can assume the remote to have, but the >> "frontier" solution also needs this iteration, so we are no worse off. > > Most importantly, this is allowed to be costly---we are doing this > not at runtime all the time, but when the user says "make sure that > I haven't lost objects and it is safe for me to build further on > what I created locally so far" by running "git fsck". check_everything_connected is also used in some other circumstances: e.g. when running a fetch, and when receiving a push in git receive-pack. Thanks, Jonathan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader 2017-08-02 17:38 ` Jonathan Nieder @ 2017-08-02 20:51 ` Junio C Hamano 2017-08-02 22:13 ` Jonathan Nieder 2017-08-03 19:08 ` Jonathan Tan 0 siblings, 2 replies; 40+ messages in thread From: Junio C Hamano @ 2017-08-02 20:51 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Jonathan Tan, git, peartben, christian.couder Jonathan Nieder <jrnieder@gmail.com> writes: > Junio C Hamano wrote: >> Jonathan Tan <jonathantanmy@google.com> writes: > >>> One possibility to conceptually have the same thing without the overhead >>> of the list is to put the obtained-from-elsewhere objects into its own >>> alternate object store, so that we can distinguish the two. >> >> Now you are talking. Either a separate object store, or a packfile >> that is specially marked as such, would work. > > Jonathan's not in today, so let me say a few more words about this > approach. > > This approach implies a relaxed connectivity guarantee, by creating > two classes of objects: > > 1. Objects that I made should satisfy the connectivity check. They > can point to other objects I made, objects I fetched, or (*) objects > pointed to directly by objects I fetched. More on (*) below. Or objects that are referred to by objects I fetched. If you narrowly clone while omitting a subdirectory, updated a file that is outside the subdirectory, and created a new commit, while recording the same tree object name for the directory you do not know its contents (becaues you didn't fetch), then it is OK for the top-level tree of the resulting commit you created to be pointing at the tree that represents the subdirectory you never touched. > The complication is in the "git gc" operation for the case (*). > Today, "git gc" uses a reachability walk to decide which objects to > remove --- an object referenced by no other object is fair game to > remove. With (*), there is another kind of object that must not be > removed: if an object that I made, M, points to a missing/promised > object, O, pointed to by a an object I fetched, F, then I cannot prune > F unless there is another fetched object present to anchor O. Absolutely. Lazy-objects support comes with certain cost and this is one of them. But I do not think it is realistic to expect that you can prune anything you fetched from the "other place" (i.e. the source 'lazy-objects' hook reads from). After all, once they give out objects to their clients (like us in this case), they cannot prune it, if we take the "implicit promise" approach to avoid the cost to transmit and maintain a separate "object list". > For example: suppose I have a sparse checkout and run > > git fetch origin refs/pulls/x > git checkout -b topic FETCH_HEAD > echo "Some great modification" >> README > git add README > git commit --amend > > When I run "git gc", there is nothing pointing to the commit that was > pointed to by the remote ref refs/pulls/x, so it can be pruned. I > would naively also expect that the tree pointed to by that commit > could be pruned. But pruning it means pruning the promise that made > it permissible to lack various blobs that my topic branch refers to > that are outside the sparse checkout area. So "git gc" must notice > that it is not safe to prune that tree. > > This feels hacky. I prefer the promised object list over this > approach. I think they are moral equivalents implemented differently with different assumptions. The example we are discussing makes an extra assumption: In order to reduce the cost of transferring and maintaining the list, we assume that all objects that came during that transfer are implicitly "promised", i.e. everything behind each of these objects will later be available on demand. How these objects are marked is up to the exact mechanism (my preference is to mark the resulting packfile as special; Jon Tan's message to which my message was a resopnse alluded to using an alternate object store). If you choose to maintain a separate "object list" and have the "other side" explicitly give it, perhaps you can lift that assumption and replace it with some other assumption that assumes less. > Can you spell this out more? To be clear, are you speaking as a > reviewer or as the project maintainer? In other words, if other > reviewers are able to settle on a design that involves a relaxed > guarantee for fsck in this mode that they can agree on, does this > represent a veto meaning the patch can still not go through? Consider it a veto over punting without making sure that we can later come up with a solution to give such a guarantee. I am not getting a feeling that "other reviewers" are even seeking a "relaxed guarantee"---all I've seen in the thread is to give up any guarantee and to hope for the best. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader 2017-08-02 20:51 ` Junio C Hamano @ 2017-08-02 22:13 ` Jonathan Nieder 2017-08-03 19:08 ` Jonathan Tan 1 sibling, 0 replies; 40+ messages in thread From: Jonathan Nieder @ 2017-08-02 22:13 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jonathan Tan, git, peartben, christian.couder Junio C Hamano wrote: > Jonathan Nieder <jrnieder@gmail.com> writes: >> Can you spell this out more? To be clear, are you speaking as a >> reviewer or as the project maintainer? In other words, if other >> reviewers are able to settle on a design that involves a relaxed >> guarantee for fsck in this mode that they can agree on, does this >> represent a veto meaning the patch can still not go through? > > Consider it a veto over punting without making sure that we can > later come up with a solution to give such a guarantee. I am not > getting a feeling that "other reviewers" are even seeking a "relaxed > guarantee"---all I've seen in the thread is to give up any guarantee > and to hope for the best. Thank you. That makes sense. In my defense, one reason I had for being okay with dropping the connectivity check in the "lazy object" setup (at a higher level than this patch currently does it, to avoid wasted work) is that this patch series does not include the required components to do it more properly and previous discussions on list had pointed to some of those components that will arrive later (the "object size cache", which doubles as an incomplete list of promises). But that doesn't put the project in a good position because it isn't an explicitly spelled out plan. The set of other reviewers that I was hoping will weigh in at some point is the GVFS team at Microsoft. I'll write up a summary of the ideas discussed so far to try to get this unblocked. Sincerely, Jonathan ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader 2017-08-02 20:51 ` Junio C Hamano 2017-08-02 22:13 ` Jonathan Nieder @ 2017-08-03 19:08 ` Jonathan Tan 1 sibling, 0 replies; 40+ messages in thread From: Jonathan Tan @ 2017-08-03 19:08 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jonathan Nieder, git, peartben, christian.couder On Wed, 02 Aug 2017 13:51:37 -0700 Junio C Hamano <gitster@pobox.com> wrote: > > The complication is in the "git gc" operation for the case (*). > > Today, "git gc" uses a reachability walk to decide which objects to > > remove --- an object referenced by no other object is fair game to > > remove. With (*), there is another kind of object that must not be > > removed: if an object that I made, M, points to a missing/promised > > object, O, pointed to by a an object I fetched, F, then I cannot prune > > F unless there is another fetched object present to anchor O. > > Absolutely. Lazy-objects support comes with certain cost and this > is one of them. > > But I do not think it is realistic to expect that you can prune > anything you fetched from the "other place" (i.e. the source > 'lazy-objects' hook reads from). After all, once they give out > objects to their clients (like us in this case), they cannot prune > it, if we take the "implicit promise" approach to avoid the cost to > transmit and maintain a separate "object list". By this, do you mean that the client cannot prune anything lazily loaded from the server? If yes, I understand that the server cannot prune anything (for the reasons that you describe), but I don't see how that applies to the client. > > For example: suppose I have a sparse checkout and run > > > > git fetch origin refs/pulls/x > > git checkout -b topic FETCH_HEAD > > echo "Some great modification" >> README > > git add README > > git commit --amend > > > > When I run "git gc", there is nothing pointing to the commit that was > > pointed to by the remote ref refs/pulls/x, so it can be pruned. I > > would naively also expect that the tree pointed to by that commit > > could be pruned. But pruning it means pruning the promise that made > > it permissible to lack various blobs that my topic branch refers to > > that are outside the sparse checkout area. So "git gc" must notice > > that it is not safe to prune that tree. > > > > This feels hacky. I prefer the promised object list over this > > approach. > > I think they are moral equivalents implemented differently with > different assumptions. The example we are discussing makes an extra > assumption: In order to reduce the cost of transferring and > maintaining the list, we assume that all objects that came during > that transfer are implicitly "promised", i.e. everything behind each > of these objects will later be available on demand. How these > objects are marked is up to the exact mechanism (my preference is to > mark the resulting packfile as special; Jon Tan's message to which > my message was a resopnse alluded to using an alternate object > store). If you choose to maintain a separate "object list" and have > the "other side" explicitly give it, perhaps you can lift that > assumption and replace it with some other assumption that assumes > less. Marking might be an issue if we expect the lazy loader to emit an object after every hash, like in the current design. There would thus be one mark per object, with overhead similar to the promise list. (Having said that, batching is possible - I plan to optimize common cases like checkout, and have such a patch online for an older "promised blob" design [1].) Overhead could be reduced by embedding the mark in both the packed and loose objects, requiring a different format (instead of having a separate "catalog" of marked files). This seems more complicated than using an alternate object store, hence my suggestion. [1] https://github.com/jonathantanmy/git/commit/14f07d3f06bc3a1a2c9bca85adc8c42b230b9143 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader 2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan ` (5 preceding siblings ...) 2017-07-31 21:21 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Junio C Hamano @ 2017-08-08 17:13 ` Ben Peart 6 siblings, 0 replies; 40+ messages in thread From: Ben Peart @ 2017-08-08 17:13 UTC (permalink / raw) To: Jonathan Tan, git; +Cc: gitster, christian.couder On 7/31/2017 5:02 PM, Jonathan Tan wrote: > Besides review changes, this patch set now includes my rewritten > lazy-loading sha1_file patch, so you can now do this (excerpted from one > of the tests): > > test_create_repo server > test_commit -C server 1 1.t abcdefgh > HASH=$(git hash-object server/1.t) > > test_create_repo client > test_must_fail git -C client cat-file -p "$HASH" > git -C client config core.repositoryformatversion 1 > git -C client config extensions.lazyobject \ > "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\"" > git -C client cat-file -p "$HASH" > > with fsck still working. Also, there is no need for a list of promised > blobs, and the long-running process protocol is being used. > > Changes from v1: > - added last patch that supports lazy loading > - clarified documentation in "introduce lazyobject extension" patch > (following Junio's comments [1]) > > As listed in the changes above, I have rewritten my lazy-loading > sha1_file patch to no longer use the list of promises. Also, I have > added documentation about the protocol used to (hopefully) the > appropriate places. Glad to see the removal of the promises. Given the ongoing conversation, I'm interested to see how you are detecting locally create objects vs those downloaded from a server. > > This is a minimal implementation, hopefully enough of a foundation to be > built upon. In particular, I haven't added the environment variable to > suppress lazy loading, and the lazy loading protocol only supports one > object at a time. We can add multiple object support to the protocol when we get to the point that we have code that will utilize it. > > Other work > ---------- > > This differs slightly from Ben Peart's patch [2] in that the > lazy-loading functionality is provided through a configured shell > command instead of a hook shell script. I envision commands like "git > clone", in the future, needing to pre-configure lazy loading, and I > think that it will be less surprising to the user if "git clone" wrote a > default configuration instead of a default hook. This was on my "todo" list to investigate as I've been told it can enable people to use taskset to set CPU affinity and get some significant performance wins. I'd be interested to see if it actually helps here at all. > > This also differs from Christian Couder's patch set [3] that implement a > larger-scale object database, in that (i) my patch set does not support > putting objects into external databases, and (ii) my patch set requires > the lazy loader to make the objects available in the local repo, instead > of allowing the objects to only be stored in the external database. This is the model we're using today so I'm confident it will meet our requirements. > > [1] https://public-inbox.org/git/xmqqzibpn1zh.fsf@gitster.mtv.corp.google.com/ > [2] https://public-inbox.org/git/20170714132651.170708-2-benpeart@microsoft.com/ > [3] https://public-inbox.org/git/20170620075523.26961-1-chriscool@tuxfamily.org/ > > Jonathan Tan (5): > environment, fsck: introduce lazyobject extension > fsck: support refs pointing to lazy objects > fsck: support referenced lazy objects > fsck: support lazy objects as CLI argument > sha1_file: support loading lazy objects > > Documentation/Makefile | 1 + > Documentation/gitattributes.txt | 54 ++-------- > Documentation/gitrepository-layout.txt | 3 + > .../technical/long-running-process-protocol.txt | 50 +++++++++ > Documentation/technical/repository-version.txt | 23 +++++ > Makefile | 1 + > builtin/cat-file.c | 2 + > builtin/fsck.c | 25 ++++- > cache.h | 4 + > environment.c | 1 + > lazy-object.c | 80 +++++++++++++++ > lazy-object.h | 12 +++ > object.c | 7 ++ > object.h | 13 +++ > setup.c | 7 +- > sha1_file.c | 44 +++++--- > t/t0410-lazy-object.sh | 113 +++++++++++++++++++++ > t/t0410/lazy-object | 102 +++++++++++++++++++ > 18 files changed, 478 insertions(+), 64 deletions(-) > create mode 100644 Documentation/technical/long-running-process-protocol.txt > create mode 100644 lazy-object.c > create mode 100644 lazy-object.h > create mode 100755 t/t0410-lazy-object.sh > create mode 100755 t/t0410/lazy-object > ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2017-08-08 20:20 UTC | newest] Thread overview: 40+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-07-26 23:29 [RFC PATCH 0/4] Some patches for fsck for missing objects Jonathan Tan 2017-07-26 23:29 ` [RFC PATCH 1/4] environment, fsck: introduce lazyobject extension Jonathan Tan 2017-07-27 18:55 ` Junio C Hamano 2017-07-28 13:20 ` Ben Peart 2017-07-28 23:50 ` Jonathan Tan 2017-07-29 0:21 ` Junio C Hamano 2017-07-26 23:30 ` [RFC PATCH 2/4] fsck: support refs pointing to lazy objects Jonathan Tan 2017-07-27 18:59 ` Junio C Hamano 2017-07-27 23:50 ` Jonathan Tan 2017-07-28 13:29 ` Ben Peart 2017-07-28 20:08 ` [PATCH] tests: ensure fsck fails on corrupt packfiles Jonathan Tan 2017-07-26 23:30 ` [RFC PATCH 3/4] fsck: support referenced lazy objects Jonathan Tan 2017-07-27 19:17 ` Junio C Hamano 2017-07-27 23:50 ` Jonathan Tan 2017-07-29 16:04 ` Junio C Hamano 2017-07-26 23:30 ` [RFC PATCH 4/4] fsck: support lazy objects as CLI argument Jonathan Tan 2017-07-26 23:42 ` [RFC PATCH 0/4] Some patches for fsck for missing objects brian m. carlson 2017-07-27 0:24 ` Stefan Beller 2017-07-27 17:25 ` Jonathan Tan 2017-07-28 13:40 ` Ben Peart 2017-07-31 21:02 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Jonathan Tan 2017-07-31 21:02 ` [PATCH v2 1/5] environment, fsck: introduce lazyobject extension Jonathan Tan 2017-07-31 21:02 ` [PATCH v2 2/5] fsck: support refs pointing to lazy objects Jonathan Tan 2017-07-31 21:02 ` [PATCH v2 3/5] fsck: support referenced " Jonathan Tan 2017-07-31 21:02 ` [PATCH v2 4/5] fsck: support lazy objects as CLI argument Jonathan Tan 2017-07-31 21:02 ` [PATCH v2 5/5] sha1_file: support loading lazy objects Jonathan Tan 2017-07-31 21:29 ` Junio C Hamano 2017-08-08 20:20 ` Ben Peart 2017-07-31 21:21 ` [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader Junio C Hamano 2017-07-31 23:05 ` Jonathan Tan 2017-08-01 17:11 ` Junio C Hamano 2017-08-01 17:45 ` Jonathan Nieder 2017-08-01 20:15 ` Junio C Hamano 2017-08-02 0:19 ` Jonathan Tan 2017-08-02 16:20 ` Junio C Hamano 2017-08-02 17:38 ` Jonathan Nieder 2017-08-02 20:51 ` Junio C Hamano 2017-08-02 22:13 ` Jonathan Nieder 2017-08-03 19:08 ` Jonathan Tan 2017-08-08 17:13 ` Ben Peart
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.