* [RFC PATCH v2 1/4] object: remove "used" field from struct object
2017-07-20 0:21 ` [RFC PATCH v2 0/4] Partial clone: promised objects (not only blobs) Jonathan Tan
@ 2017-07-20 0:21 ` Jonathan Tan
2017-07-20 0:36 ` Stefan Beller
2017-07-20 21:20 ` Junio C Hamano
2017-07-20 0:21 ` [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects Jonathan Tan
` (2 subsequent siblings)
3 siblings, 2 replies; 45+ messages in thread
From: Jonathan Tan @ 2017-07-20 0:21 UTC (permalink / raw)
To: git; +Cc: Jonathan Tan, jrnieder, sbeller, git, peartben, philipoakley
The "used" field in struct object is only used by builtin/fsck. Remove
that field and modify builtin/fsck to use a flag instead.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
builtin/fsck.c | 24 ++++++++++++++----------
object.c | 1 -
object.h | 2 +-
3 files changed, 15 insertions(+), 12 deletions(-)
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 4ba311cda..462b8643b 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -19,6 +19,8 @@
#define REACHABLE 0x0001
#define SEEN 0x0002
#define HAS_OBJ 0x0004
+/* This flag is set if something points to this object. */
+#define USED 0x0008
static int show_root;
static int show_tags;
@@ -195,7 +197,7 @@ static int mark_used(struct object *obj, int type, void *data, struct fsck_optio
{
if (!obj)
return 1;
- obj->used = 1;
+ obj->flags |= USED;
return 0;
}
@@ -244,7 +246,7 @@ static void check_unreachable_object(struct object *obj)
}
/*
- * "!used" means that nothing at all points to it, including
+ * "!USED" means that nothing at all points to it, including
* other unreachable objects. In other words, it's the "tip"
* of some set of unreachable objects, usually a commit that
* got dropped.
@@ -255,7 +257,7 @@ static void check_unreachable_object(struct object *obj)
* deleted a branch by mistake, this is a prime candidate to
* start looking at, for example.
*/
- if (!obj->used) {
+ if (!(obj->flags & USED)) {
if (show_dangling)
printf("dangling %s %s\n", printable_type(obj),
describe_object(obj));
@@ -379,7 +381,8 @@ static int fsck_obj_buffer(const struct object_id *oid, enum object_type type,
errors_found |= ERROR_OBJECT;
return error("%s: object corrupt or missing", oid_to_hex(oid));
}
- obj->flags = HAS_OBJ;
+ obj->flags &= ~(REACHABLE | SEEN);
+ obj->flags |= HAS_OBJ;
return fsck_obj(obj);
}
@@ -397,7 +400,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
add_decoration(fsck_walk_options.object_names,
obj,
xstrfmt("%s@{%"PRItime"}", refname, timestamp));
- obj->used = 1;
+ obj->flags |= USED;
mark_object_reachable(obj);
} else {
error("%s: invalid reflog entry %s", refname, oid_to_hex(oid));
@@ -445,7 +448,7 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
errors_found |= ERROR_REFS;
}
default_refs++;
- obj->used = 1;
+ obj->flags |= USED;
if (name_objects)
add_decoration(fsck_walk_options.object_names,
obj, xstrdup(refname));
@@ -513,7 +516,8 @@ static int fsck_loose(const struct object_id *oid, const char *path, void *data)
return 0; /* keep checking other objects */
}
- obj->flags = HAS_OBJ;
+ obj->flags &= ~(REACHABLE | SEEN);
+ obj->flags |= HAS_OBJ;
if (fsck_obj(obj))
errors_found |= ERROR_OBJECT;
return 0;
@@ -595,7 +599,7 @@ static int fsck_cache_tree(struct cache_tree *it)
errors_found |= ERROR_REFS;
return 1;
}
- obj->used = 1;
+ obj->flags |= USED;
if (name_objects)
add_decoration(fsck_walk_options.object_names,
obj, xstrdup(":"));
@@ -737,7 +741,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
continue;
}
- obj->used = 1;
+ obj->flags |= USED;
if (name_objects)
add_decoration(fsck_walk_options.object_names,
obj, xstrdup(arg));
@@ -774,7 +778,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
if (!blob)
continue;
obj = &blob->object;
- obj->used = 1;
+ obj->flags |= USED;
if (name_objects)
add_decoration(fsck_walk_options.object_names,
obj,
diff --git a/object.c b/object.c
index f81877741..321d7e920 100644
--- a/object.c
+++ b/object.c
@@ -141,7 +141,6 @@ void *create_object(const unsigned char *sha1, void *o)
struct object *obj = o;
obj->parsed = 0;
- obj->used = 0;
obj->flags = 0;
hashcpy(obj->oid.hash, sha1);
diff --git a/object.h b/object.h
index 33e5cc994..0a419ba8d 100644
--- a/object.h
+++ b/object.h
@@ -38,6 +38,7 @@ struct object_array {
* http-push.c: 16-----19
* commit.c: 16-----19
* sha1_name.c: 20
+ * builtin/fsck.c: 0--3
*/
#define FLAG_BITS 27
@@ -46,7 +47,6 @@ struct object_array {
*/
struct object {
unsigned parsed : 1;
- unsigned used : 1;
unsigned type : TYPE_BITS;
unsigned flags : FLAG_BITS;
struct object_id oid;
--
2.14.0.rc0.284.gd933b75aa4-goog
^ permalink raw reply related [flat|nested] 45+ messages in thread* Re: [RFC PATCH v2 1/4] object: remove "used" field from struct object
2017-07-20 0:21 ` [RFC PATCH v2 1/4] object: remove "used" field from struct object Jonathan Tan
@ 2017-07-20 0:36 ` Stefan Beller
2017-07-20 0:55 ` Jonathan Tan
2017-07-20 21:20 ` Junio C Hamano
1 sibling, 1 reply; 45+ messages in thread
From: Stefan Beller @ 2017-07-20 0:36 UTC (permalink / raw)
To: Jonathan Tan
Cc: git@vger.kernel.org, Jonathan Nieder, Jeff Hostetler, Ben Peart,
Philip Oakley
On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> The "used" field in struct object is only used by builtin/fsck. Remove
> that field and modify builtin/fsck to use a flag instead.
The patch looks good to me (I would even claim this could
go in as an independent cleanup, not tied to the RFCish nature
of the later patches), though I have a question:
How did you select 0x0008 for USED, i.e. does it
collide with other flags (theoretically?), and if so
how do we make sure to avoid the collusion in
the future?
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC PATCH v2 1/4] object: remove "used" field from struct object
2017-07-20 0:36 ` Stefan Beller
@ 2017-07-20 0:55 ` Jonathan Tan
2017-07-20 17:44 ` Ben Peart
0 siblings, 1 reply; 45+ messages in thread
From: Jonathan Tan @ 2017-07-20 0:55 UTC (permalink / raw)
To: Stefan Beller
Cc: git@vger.kernel.org, Jonathan Nieder, Jeff Hostetler, Ben Peart,
Philip Oakley
On Wed, 19 Jul 2017 17:36:39 -0700
Stefan Beller <sbeller@google.com> wrote:
> On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> > The "used" field in struct object is only used by builtin/fsck. Remove
> > that field and modify builtin/fsck to use a flag instead.
>
> The patch looks good to me (I would even claim this could
> go in as an independent cleanup, not tied to the RFCish nature
> of the later patches), though I have a question:
> How did you select 0x0008 for USED, i.e. does it
> collide with other flags (theoretically?), and if so
> how do we make sure to avoid the collusion in
> the future?
Thanks. 0x0008 was the next one in the series (as you can see in the
context). As for whether it collides with other flags, that is what the
chart in object.h is for (which I have added to in this patch), I
presume. As far as I can tell, each component must make sure not to
overlap with any other component running concurrently.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC PATCH v2 1/4] object: remove "used" field from struct object
2017-07-20 0:55 ` Jonathan Tan
@ 2017-07-20 17:44 ` Ben Peart
0 siblings, 0 replies; 45+ messages in thread
From: Ben Peart @ 2017-07-20 17:44 UTC (permalink / raw)
To: Jonathan Tan, Stefan Beller
Cc: git@vger.kernel.org, Jonathan Nieder, Jeff Hostetler,
Philip Oakley
On 7/19/2017 8:55 PM, Jonathan Tan wrote:
> On Wed, 19 Jul 2017 17:36:39 -0700
> Stefan Beller <sbeller@google.com> wrote:
>
>> On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
>>> The "used" field in struct object is only used by builtin/fsck. Remove
>>> that field and modify builtin/fsck to use a flag instead.
>>
>> The patch looks good to me (I would even claim this could
>> go in as an independent cleanup, not tied to the RFCish nature
>> of the later patches), though I have a question:
>> How did you select 0x0008 for USED, i.e. does it
>> collide with other flags (theoretically?), and if so
>> how do we make sure to avoid the collusion in
>> the future?
>
> Thanks. 0x0008 was the next one in the series (as you can see in the
> context). As for whether it collides with other flags, that is what the
> chart in object.h is for (which I have added to in this patch), I
> presume. As far as I can tell, each component must make sure not to
> overlap with any other component running concurrently.
>
This patch seems reasonable to me. I agree it could go in separately as
a general cleanup.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC PATCH v2 1/4] object: remove "used" field from struct object
2017-07-20 0:21 ` [RFC PATCH v2 1/4] object: remove "used" field from struct object Jonathan Tan
2017-07-20 0:36 ` Stefan Beller
@ 2017-07-20 21:20 ` Junio C Hamano
1 sibling, 0 replies; 45+ messages in thread
From: Junio C Hamano @ 2017-07-20 21:20 UTC (permalink / raw)
To: Jonathan Tan; +Cc: git, jrnieder, sbeller, git, peartben, philipoakley
Jonathan Tan <jonathantanmy@google.com> writes:
> The "used" field in struct object is only used by builtin/fsck. Remove
> that field and modify builtin/fsck to use a flag instead.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> builtin/fsck.c | 24 ++++++++++++++----------
> object.c | 1 -
> object.h | 2 +-
> 3 files changed, 15 insertions(+), 12 deletions(-)
I vaguely recall trying to do this myself a few years ago. We can
easily spot places within this file that does things like this:
> - obj->flags = HAS_OBJ;
and correctly update it to this:
> + obj->flags &= ~(REACHABLE | SEEN);
> + obj->flags |= HAS_OBJ;
but I didn't do so because I was hesitant having to validate, and
having to maintain the invariant forever, that anything called from
these codepaths is always careful not to clobber the USED bit.
Looks like a reasonable preparatory clean-up that probably should be
doable and should be done way before the main part of the series to
me. Will queue, together with your other fsck change.
Thanks.
^ permalink raw reply [flat|nested] 45+ messages in thread
* [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
2017-07-20 0:21 ` [RFC PATCH v2 0/4] Partial clone: promised objects (not only blobs) Jonathan Tan
2017-07-20 0:21 ` [RFC PATCH v2 1/4] object: remove "used" field from struct object Jonathan Tan
@ 2017-07-20 0:21 ` Jonathan Tan
2017-07-20 18:07 ` Stefan Beller
2017-07-20 19:58 ` Ben Peart
2017-07-20 0:21 ` [RFC PATCH v2 3/4] sha1-array: support appending unsigned char hash Jonathan Tan
2017-07-20 0:21 ` [RFC PATCH v2 4/4] sha1_file: support promised object hook Jonathan Tan
3 siblings, 2 replies; 45+ messages in thread
From: Jonathan Tan @ 2017-07-20 0:21 UTC (permalink / raw)
To: git; +Cc: Jonathan Tan, jrnieder, sbeller, git, peartben, philipoakley
Currently, Git does not support repos with very large numbers of objects
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every referenced object is available somewhere in the
repo storage.
As a first step to reducing this problem, introduce the concept of
promised objects. Each Git repo can contain a list of promised objects
and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
contains functions to query them; functions for creating and modifying
that file will be introduced in later patches.
A repository that is missing an object but has that object promised is not
considered to be in error, so also teach fsck this. As part of doing
this, object.{h,c} has been modified to generate "struct object" based
on only the information available to promised objects, without requiring
the object itself.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
Documentation/technical/repository-version.txt | 6 ++
Makefile | 1 +
builtin/fsck.c | 18 +++-
cache.h | 2 +
environment.c | 1 +
fsck.c | 6 +-
object.c | 19 ++++
object.h | 19 ++++
promised-object.c | 130 +++++++++++++++++++++++++
promised-object.h | 22 +++++
setup.c | 7 +-
t/t3907-promised-object.sh | 41 ++++++++
t/test-lib-functions.sh | 6 ++
13 files changed, 273 insertions(+), 5 deletions(-)
create mode 100644 promised-object.c
create mode 100644 promised-object.h
create mode 100755 t/t3907-promised-object.sh
diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
index 00ad37986..f8b82c1c7 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -86,3 +86,9 @@ for testing format-1 compatibility.
When the config key `extensions.preciousObjects` is set to `true`,
objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
`git repack -d`).
+
+`promisedObjects`
+~~~~~~~~~~~~~~~~~
+
+(Explain this - basically a string containing a command to be run
+whenever a missing object needs to be fetched.)
diff --git a/Makefile b/Makefile
index 9c9c42f8f..c1446d5ef 100644
--- a/Makefile
+++ b/Makefile
@@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o
LIB_OBJS += pretty.o
LIB_OBJS += prio-queue.o
LIB_OBJS += progress.o
+LIB_OBJS += promised-object.o
LIB_OBJS += prompt.o
LIB_OBJS += quote.o
LIB_OBJS += reachable.o
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 462b8643b..49e21f361 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -15,6 +15,7 @@
#include "progress.h"
#include "streaming.h"
#include "decorate.h"
+#include "promised-object.h"
#define REACHABLE 0x0001
#define SEEN 0x0002
@@ -44,6 +45,7 @@ static int name_objects;
#define ERROR_REACHABLE 02
#define ERROR_PACK 04
#define ERROR_REFS 010
+#define ERROR_PROMISED_OBJECT 011
static const char *describe_object(struct object *obj)
{
@@ -436,7 +438,7 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
{
struct object *obj;
- obj = parse_object(oid);
+ obj = parse_or_promise_object(oid);
if (!obj) {
error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
errors_found |= ERROR_REACHABLE;
@@ -592,7 +594,7 @@ static int fsck_cache_tree(struct cache_tree *it)
fprintf(stderr, "Checking cache tree\n");
if (0 <= it->entry_count) {
- struct object *obj = parse_object(&it->oid);
+ struct object *obj = parse_or_promise_object(&it->oid);
if (!obj) {
error("%s: invalid sha1 pointer in cache-tree",
oid_to_hex(&it->oid));
@@ -635,6 +637,12 @@ static int mark_packed_for_connectivity(const struct object_id *oid,
return 0;
}
+static int mark_have_promised_object(const struct object_id *oid, void *data)
+{
+ mark_object_for_connectivity(oid);
+ return 0;
+}
+
static char const * const fsck_usage[] = {
N_("git fsck [<options>] [<object>...]"),
NULL
@@ -690,6 +698,11 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
git_config(fsck_config, NULL);
+ if (fsck_promised_objects()) {
+ error("Errors found in promised object list");
+ errors_found |= ERROR_PROMISED_OBJECT;
+ }
+
fsck_head_link();
if (connectivity_only) {
for_each_loose_object(mark_loose_for_connectivity, NULL, 0);
@@ -727,6 +740,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
stop_progress(&progress);
}
}
+ for_each_promised_object(mark_have_promised_object, NULL);
heads = 0;
for (i = 0; i < argc; i++) {
diff --git a/cache.h b/cache.h
index 71fe09264..dd94b5ffc 100644
--- a/cache.h
+++ b/cache.h
@@ -853,10 +853,12 @@ extern int grafts_replace_parents;
#define GIT_REPO_VERSION 0
#define GIT_REPO_VERSION_READ 1
extern int repository_format_precious_objects;
+extern char *repository_format_promised_objects;
struct repository_format {
int version;
int precious_objects;
+ char *promised_objects;
int is_bare;
char *work_tree;
struct string_list unknown_extensions;
diff --git a/environment.c b/environment.c
index 3fd4b1084..82658470d 100644
--- a/environment.c
+++ b/environment.c
@@ -27,6 +27,7 @@ int warn_ambiguous_refs = 1;
int warn_on_object_refname_ambiguity = 1;
int ref_paranoia = -1;
int repository_format_precious_objects;
+char *repository_format_promised_objects;
const char *git_commit_encoding;
const char *git_log_output_encoding;
const char *apply_default_whitespace;
diff --git a/fsck.c b/fsck.c
index b4204d772..f08ff0675 100644
--- a/fsck.c
+++ b/fsck.c
@@ -460,8 +460,10 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options)
if (!obj)
return -1;
- if (obj->type == OBJ_NONE)
- parse_object(&obj->oid);
+ if (!obj->parsed && !obj->promised)
+ parse_or_promise_object(&obj->oid);
+ if (obj->promised)
+ return 0;
switch (obj->type) {
case OBJ_BLOB:
diff --git a/object.c b/object.c
index 321d7e920..0aeb95084 100644
--- a/object.c
+++ b/object.c
@@ -4,6 +4,7 @@
#include "tree.h"
#include "commit.h"
#include "tag.h"
+#include "promised-object.h"
static struct object **obj_hash;
static int nr_objs, obj_hash_size;
@@ -141,6 +142,7 @@ void *create_object(const unsigned char *sha1, void *o)
struct object *obj = o;
obj->parsed = 0;
+ obj->promised = 0;
obj->flags = 0;
hashcpy(obj->oid.hash, sha1);
@@ -279,6 +281,23 @@ struct object *parse_object(const struct object_id *oid)
return NULL;
}
+struct object *parse_or_promise_object(const struct object_id *oid)
+{
+ enum object_type type;
+
+ if (has_object_file(oid))
+ return parse_object(oid);
+
+ if (is_promised_object(oid, &type, NULL)) {
+ struct object *obj = lookup_unknown_object(oid->hash);
+ obj->promised = 1;
+ obj->type = type;
+ return obj;
+ }
+
+ return NULL;
+}
+
struct object_list *object_list_insert(struct object *item,
struct object_list **list_p)
{
diff --git a/object.h b/object.h
index 0a419ba8d..640c8bff1 100644
--- a/object.h
+++ b/object.h
@@ -46,7 +46,17 @@ struct object_array {
* The object type is stored in 3 bits.
*/
struct object {
+ /*
+ * Set if this object is parsed. If set, "type" is populated and this
+ * object can be casted to "struct commit" or an equivalent.
+ */
unsigned parsed : 1;
+ /*
+ * Set if this object is not in the repo but is promised. If set,
+ * "type" is populated, but this object cannot be casted to "struct
+ * commit" or an equivalent.
+ */
+ unsigned promised : 1;
unsigned type : TYPE_BITS;
unsigned flags : FLAG_BITS;
struct object_id oid;
@@ -104,6 +114,15 @@ struct object *parse_object_or_die(const struct object_id *oid, const char *name
*/
struct object *parse_object_buffer(const struct object_id *oid, enum object_type type, unsigned long size, void *buffer, int *eaten_p);
+/*
+ * Returns the object, having parsed it if possible. In the returned object,
+ * either "parsed" or "promised" will be set.
+ *
+ * Returns NULL if the object is missing and not promised, or if the object is
+ * corrupt.
+ */
+struct object *parse_or_promise_object(const struct object_id *oid);
+
/** Returns the object, with potentially excess memory allocated. **/
struct object *lookup_unknown_object(const unsigned char *sha1);
diff --git a/promised-object.c b/promised-object.c
new file mode 100644
index 000000000..487ade437
--- /dev/null
+++ b/promised-object.c
@@ -0,0 +1,130 @@
+#include "cache.h"
+#include "promised-object.h"
+#include "sha1-lookup.h"
+#include "strbuf.h"
+
+#define ENTRY_SIZE (GIT_SHA1_RAWSZ + 1 + 8)
+/*
+ * A mmap-ed byte array of size (promised_object_nr * ENTRY_SIZE). Each
+ * ENTRY_SIZE-sized entry consists of the SHA-1 of the promised object, its
+ * 8-bit object type, and its 64-bit size in network byte order. The entries
+ * are sorted in ascending SHA-1 order.
+ */
+static char *promised_objects;
+static int64_t promised_object_nr = -1;
+
+static void prepare_promised_objects(void)
+{
+ char *filename;
+ int fd;
+ struct stat st;
+
+ if (promised_object_nr >= 0)
+ return;
+
+ if (!repository_format_promised_objects ||
+ getenv("GIT_IGNORE_PROMISED_OBJECTS")) {
+ promised_object_nr = 0;
+ return;
+ }
+
+ filename = xstrfmt("%s/promised", get_object_directory());
+ fd = git_open(filename);
+ if (fd < 0) {
+ if (errno == ENOENT) {
+ promised_object_nr = 0;
+ goto cleanup;
+ }
+ perror("prepare_promised_objects");
+ die("Could not open %s", filename);
+ }
+ if (fstat(fd, &st)) {
+ perror("prepare_promised_objects");
+ die("Could not stat %s", filename);
+ }
+ if (st.st_size == 0) {
+ promised_object_nr = 0;
+ goto cleanup;
+ }
+ if (st.st_size % ENTRY_SIZE) {
+ die("Size of %s is not a multiple of %d", filename, ENTRY_SIZE);
+ }
+
+ promised_objects = xmmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
+ promised_object_nr = st.st_size / ENTRY_SIZE;
+
+cleanup:
+ free(filename);
+ if (fd >= 0)
+ close(fd);
+}
+
+int is_promised_object(const struct object_id *oid, enum object_type *type,
+ unsigned long *size)
+{
+ int result;
+
+ prepare_promised_objects();
+ result = sha1_entry_pos(promised_objects, ENTRY_SIZE, 0, 0,
+ promised_object_nr, promised_object_nr,
+ oid->hash);
+ if (result >= 0) {
+ if (type) {
+ char *ptr = promised_objects +
+ result * ENTRY_SIZE + GIT_SHA1_RAWSZ;
+ *type = *ptr;
+ }
+ if (size) {
+ uint64_t size_nbo;
+ char *ptr = promised_objects +
+ result * ENTRY_SIZE + GIT_SHA1_RAWSZ + 1;
+ memcpy(&size_nbo, ptr, sizeof(size_nbo));
+ *size = ntohll(size_nbo);
+ }
+ return 1;
+ }
+ return 0;
+}
+
+int for_each_promised_object(each_promised_object_fn cb, void *data)
+{
+ struct object_id oid;
+ int i, r;
+
+ prepare_promised_objects();
+ for (i = 0; i < promised_object_nr; i++) {
+ memcpy(oid.hash, &promised_objects[i * ENTRY_SIZE],
+ GIT_SHA1_RAWSZ);
+ r = cb(&oid, data);
+ if (r)
+ return r;
+ }
+ return 0;
+}
+
+int fsck_promised_objects(void)
+{
+ int i;
+ prepare_promised_objects();
+ for (i = 0; i < promised_object_nr; i++) {
+ enum object_type type;
+ if (i != 0 && memcmp(&promised_objects[(i - 1) * ENTRY_SIZE],
+ &promised_objects[i * ENTRY_SIZE],
+ GIT_SHA1_RAWSZ) >= 0)
+ return error("Error in list of promised objects: not "
+ "in ascending order of object name");
+ type = promised_objects[i * ENTRY_SIZE + GIT_SHA1_RAWSZ];
+ switch (type) {
+ case OBJ_BLOB:
+ case OBJ_TREE:
+ case OBJ_COMMIT:
+ case OBJ_TAG:
+ break;
+ default:
+ return error("Error in list of promised "
+ "objects: found object of type %d",
+ type);
+ }
+ }
+ return 0;
+}
diff --git a/promised-object.h b/promised-object.h
new file mode 100644
index 000000000..7eaedff17
--- /dev/null
+++ b/promised-object.h
@@ -0,0 +1,22 @@
+#ifndef PROMISED_OBJECT_H
+#define PROMISED_OBJECT_H
+
+#include "cache.h"
+
+/*
+ * Returns 1 if oid is the name of a promised object. For non-blobs, 0 is
+ * reported as their size.
+ */
+int is_promised_object(const struct object_id *oid, enum object_type *type,
+ unsigned long *size);
+
+typedef int each_promised_object_fn(const struct object_id *oid, void *data);
+int for_each_promised_object(each_promised_object_fn, void *);
+
+/*
+ * Returns 0 if there is no list of promised objects or if the list of promised
+ * objects is valid.
+ */
+int fsck_promised_objects(void);
+
+#endif
diff --git a/setup.c b/setup.c
index 860507e1f..36022452f 100644
--- a/setup.c
+++ b/setup.c
@@ -425,7 +425,11 @@ static int check_repo_format(const char *var, const char *value, void *vdata)
;
else if (!strcmp(ext, "preciousobjects"))
data->precious_objects = git_config_bool(var, value);
- else
+ else if (!strcmp(ext, "promisedobjects")) {
+ if (!value)
+ return config_error_nonbool(var);
+ data->promised_objects = xstrdup(value);
+ } else
string_list_append(&data->unknown_extensions, ext);
} else if (strcmp(var, "core.bare") == 0) {
data->is_bare = git_config_bool(var, value);
@@ -468,6 +472,7 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
}
repository_format_precious_objects = candidate.precious_objects;
+ repository_format_promised_objects = candidate.promised_objects;
string_list_clear(&candidate.unknown_extensions, 0);
if (!has_common) {
if (candidate.is_bare != -1) {
diff --git a/t/t3907-promised-object.sh b/t/t3907-promised-object.sh
new file mode 100755
index 000000000..3e0caf4f9
--- /dev/null
+++ b/t/t3907-promised-object.sh
@@ -0,0 +1,41 @@
+#!/bin/sh
+
+test_description='promised objects'
+
+. ./test-lib.sh
+
+test_expect_success 'fsck fails on missing objects' '
+ test_create_repo repo &&
+
+ test_commit -C repo 1 &&
+ test_commit -C repo 2 &&
+ test_commit -C repo 3 &&
+ git -C repo tag -a annotated_tag -m "annotated tag" &&
+ C=$(git -C repo rev-parse 1) &&
+ T=$(git -C repo rev-parse 2^{tree}) &&
+ B=$(git hash-object repo/3.t) &&
+ AT=$(git -C repo rev-parse annotated_tag) &&
+
+ # missing commit, tree, blob, and tag
+ rm repo/.git/objects/$(echo $C | cut -c1-2)/$(echo $C | cut -c3-40) &&
+ rm repo/.git/objects/$(echo $T | cut -c1-2)/$(echo $T | cut -c3-40) &&
+ rm repo/.git/objects/$(echo $B | cut -c1-2)/$(echo $B | cut -c3-40) &&
+ rm repo/.git/objects/$(echo $AT | cut -c1-2)/$(echo $AT | cut -c3-40) &&
+ test_must_fail git -C repo fsck
+'
+
+test_expect_success '...but succeeds if they are promised objects' '
+ printf "%s01%016x\n%s02%016x\n%s03%016x\n%s04%016x" \
+ "$C" 0 "$T" 0 "$B" "$(wc -c <repo/3.t)" "$AT" 0 |
+ sort | tr -d "\n" | hex_pack >repo/.git/objects/promised &&
+ git -C repo config core.repositoryformatversion 1 &&
+ git -C repo config extensions.promisedobjects "arbitrary string" &&
+ git -C repo fsck
+'
+
+test_expect_success '...but fails again with GIT_IGNORE_PROMISED_OBJECTS' '
+ GIT_IGNORE_PROMISED_OBJECTS=1 test_must_fail git -C repo fsck &&
+ unset GIT_IGNORE_PROMISED_OBJECTS
+'
+
+test_done
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index db622c355..1ebdd2d04 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1018,3 +1018,9 @@ nongit () {
"$@"
)
}
+
+# Converts big-endian pairs of hexadecimal digits into bytes. For example,
+# "printf 61620d0a | hex_pack" results in "ab\r\n".
+hex_pack () {
+ perl -e '$/ = undef; $input = <>; print pack("H*", $input)'
+}
--
2.14.0.rc0.284.gd933b75aa4-goog
^ permalink raw reply related [flat|nested] 45+ messages in thread* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
2017-07-20 0:21 ` [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects Jonathan Tan
@ 2017-07-20 18:07 ` Stefan Beller
2017-07-20 19:17 ` Jonathan Tan
2017-07-20 19:58 ` Ben Peart
1 sibling, 1 reply; 45+ messages in thread
From: Stefan Beller @ 2017-07-20 18:07 UTC (permalink / raw)
To: Jonathan Tan
Cc: git@vger.kernel.org, Jonathan Nieder, Jeff Hostetler, Ben Peart,
Philip Oakley
On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> Currently, Git does not support repos with very large numbers of objects
> or repos that wish to minimize manipulation of certain blobs (for
> example, because they are very large) very well, even if the user
> operates mostly on part of the repo, because Git is designed on the
> assumption that every referenced object is available somewhere in the
> repo storage.
>
> As a first step to reducing this problem, introduce the concept of
> promised objects. Each Git repo can contain a list of promised objects
> and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
> contains functions to query them; functions for creating and modifying
> that file will be introduced in later patches.
>
> A repository that is missing an object but has that object promised is not
> considered to be in error, so also teach fsck this. As part of doing
> this, object.{h,c} has been modified to generate "struct object" based
> on only the information available to promised objects, without requiring
> the object itself.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> Documentation/technical/repository-version.txt | 6 ++
> Makefile | 1 +
> builtin/fsck.c | 18 +++-
> cache.h | 2 +
> environment.c | 1 +
> fsck.c | 6 +-
> object.c | 19 ++++
> object.h | 19 ++++
> promised-object.c | 130 +++++++++++++++++++++++++
> promised-object.h | 22 +++++
> setup.c | 7 +-
> t/t3907-promised-object.sh | 41 ++++++++
> t/test-lib-functions.sh | 6 ++
> 13 files changed, 273 insertions(+), 5 deletions(-)
> create mode 100644 promised-object.c
> create mode 100644 promised-object.h
> create mode 100755 t/t3907-promised-object.sh
>
> diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
> index 00ad37986..f8b82c1c7 100644
> --- a/Documentation/technical/repository-version.txt
> +++ b/Documentation/technical/repository-version.txt
> @@ -86,3 +86,9 @@ for testing format-1 compatibility.
> When the config key `extensions.preciousObjects` is set to `true`,
> objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
> `git repack -d`).
> +
> +`promisedObjects`
> +~~~~~~~~~~~~~~~~~
> +
> +(Explain this - basically a string containing a command to be run
> +whenever a missing object needs to be fetched.)
> diff --git a/Makefile b/Makefile
> index 9c9c42f8f..c1446d5ef 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o
> LIB_OBJS += pretty.o
> LIB_OBJS += prio-queue.o
> LIB_OBJS += progress.o
> +LIB_OBJS += promised-object.o
> LIB_OBJS += prompt.o
> LIB_OBJS += quote.o
> LIB_OBJS += reachable.o
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index 462b8643b..49e21f361 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -15,6 +15,7 @@
> #include "progress.h"
> #include "streaming.h"
> #include "decorate.h"
> +#include "promised-object.h"
>
> #define REACHABLE 0x0001
> #define SEEN 0x0002
> @@ -44,6 +45,7 @@ static int name_objects;
> #define ERROR_REACHABLE 02
> #define ERROR_PACK 04
> #define ERROR_REFS 010
> +#define ERROR_PROMISED_OBJECT 011
>
> static const char *describe_object(struct object *obj)
> {
> @@ -436,7 +438,7 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
> {
> struct object *obj;
>
> - obj = parse_object(oid);
> + obj = parse_or_promise_object(oid);
> if (!obj) {
> error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
> errors_found |= ERROR_REACHABLE;
> @@ -592,7 +594,7 @@ static int fsck_cache_tree(struct cache_tree *it)
> fprintf(stderr, "Checking cache tree\n");
>
> if (0 <= it->entry_count) {
> - struct object *obj = parse_object(&it->oid);
> + struct object *obj = parse_or_promise_object(&it->oid);
> if (!obj) {
> error("%s: invalid sha1 pointer in cache-tree",
> oid_to_hex(&it->oid));
> @@ -635,6 +637,12 @@ static int mark_packed_for_connectivity(const struct object_id *oid,
> return 0;
> }
>
> +static int mark_have_promised_object(const struct object_id *oid, void *data)
> +{
> + mark_object_for_connectivity(oid);
> + return 0;
> +}
> +
> static char const * const fsck_usage[] = {
> N_("git fsck [<options>] [<object>...]"),
> NULL
> @@ -690,6 +698,11 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>
> git_config(fsck_config, NULL);
>
> + if (fsck_promised_objects()) {
> + error("Errors found in promised object list");
> + errors_found |= ERROR_PROMISED_OBJECT;
> + }
This got me thinking: It is an error if we do not have an object
and also do not promise it, but what about the other case:
having and object and promising it, too?
That is not harmful to the operation, except that the promise
machinery may be slower due to its size. (Should that be a soft
warning then? Do we have warnings in fsck?)
> * The object type is stored in 3 bits.
> */
We may want to remove this comment while we're here as it
sounds stale despite being technically correct.
1974632c66 (Remove TYPE_* constant macros and use
object_type enums consistently., 2006-07-11)
> struct object {
> + /*
> + * Set if this object is parsed. If set, "type" is populated and this
> + * object can be casted to "struct commit" or an equivalent.
> + */
> unsigned parsed : 1;
> + /*
> + * Set if this object is not in the repo but is promised. If set,
> + * "type" is populated, but this object cannot be casted to "struct
> + * commit" or an equivalent.
> + */
> + unsigned promised : 1;
Would it make sense to have a bit field instead:
#define STATE_BITS 2
#define STATE_PARSED (1<<0)
#define STATE_PROMISED (1<<1)
unsigned state : STATE_BITS
This would be similar to the types and flags?
> +test_expect_success 'fsck fails on missing objects' '
> + test_create_repo repo &&
> +
> + test_commit -C repo 1 &&
> + test_commit -C repo 2 &&
> + test_commit -C repo 3 &&
> + git -C repo tag -a annotated_tag -m "annotated tag" &&
> + C=$(git -C repo rev-parse 1) &&
> + T=$(git -C repo rev-parse 2^{tree}) &&
> + B=$(git hash-object repo/3.t) &&
> + AT=$(git -C repo rev-parse annotated_tag) &&
> +
> + # missing commit, tree, blob, and tag
> + rm repo/.git/objects/$(echo $C | cut -c1-2)/$(echo $C | cut -c3-40) &&
> + rm repo/.git/objects/$(echo $T | cut -c1-2)/$(echo $T | cut -c3-40) &&
> + rm repo/.git/objects/$(echo $B | cut -c1-2)/$(echo $B | cut -c3-40) &&
> + rm repo/.git/objects/$(echo $AT | cut -c1-2)/$(echo $AT | cut -c3-40) &&
This is a pretty cool test as it promises all sorts of objects
from different parts of the graph.
^ permalink raw reply [flat|nested] 45+ messages in thread* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
2017-07-20 18:07 ` Stefan Beller
@ 2017-07-20 19:17 ` Jonathan Tan
0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2017-07-20 19:17 UTC (permalink / raw)
To: Stefan Beller
Cc: git@vger.kernel.org, Jonathan Nieder, Jeff Hostetler, Ben Peart,
Philip Oakley
On Thu, 20 Jul 2017 11:07:29 -0700
Stefan Beller <sbeller@google.com> wrote:
> > + if (fsck_promised_objects()) {
> > + error("Errors found in promised object list");
> > + errors_found |= ERROR_PROMISED_OBJECT;
> > + }
>
> This got me thinking: It is an error if we do not have an object
> and also do not promise it, but what about the other case:
> having and object and promising it, too?
> That is not harmful to the operation, except that the promise
> machinery may be slower due to its size. (Should that be a soft
> warning then? Do we have warnings in fsck?)
Good question - having an object and also having it promised is not an
error condition (and I don't think it's a good idea to make it so,
because objects can appear quite easily from various sources). In the
future, I expect "git gc" to be extended to remove such redundant lines
from the "promised" list.
> > * The object type is stored in 3 bits.
> > */
>
> We may want to remove this comment while we're here as it
> sounds stale despite being technically correct.
> 1974632c66 (Remove TYPE_* constant macros and use
> object_type enums consistently., 2006-07-11)
I agree that the comment is unnecessary, but in this commit I didn't
modify anything to do with the type, so I left it there.
> > struct object {
> > + /*
> > + * Set if this object is parsed. If set, "type" is populated and this
> > + * object can be casted to "struct commit" or an equivalent.
> > + */
> > unsigned parsed : 1;
> > + /*
> > + * Set if this object is not in the repo but is promised. If set,
> > + * "type" is populated, but this object cannot be casted to "struct
> > + * commit" or an equivalent.
> > + */
> > + unsigned promised : 1;
>
> Would it make sense to have a bit field instead:
>
> #define STATE_BITS 2
> #define STATE_PARSED (1<<0)
> #define STATE_PROMISED (1<<1)
>
> unsigned state : STATE_BITS
>
> This would be similar to the types and flags?
Both type and flag have to be bit fields (for different reasons), but
since we don't need such a combined field for "parsed" and "promised", I
prefer separating them each into their own field.
> > +test_expect_success 'fsck fails on missing objects' '
> > + test_create_repo repo &&
> > +
> > + test_commit -C repo 1 &&
> > + test_commit -C repo 2 &&
> > + test_commit -C repo 3 &&
> > + git -C repo tag -a annotated_tag -m "annotated tag" &&
> > + C=$(git -C repo rev-parse 1) &&
> > + T=$(git -C repo rev-parse 2^{tree}) &&
> > + B=$(git hash-object repo/3.t) &&
> > + AT=$(git -C repo rev-parse annotated_tag) &&
> > +
> > + # missing commit, tree, blob, and tag
> > + rm repo/.git/objects/$(echo $C | cut -c1-2)/$(echo $C | cut -c3-40) &&
> > + rm repo/.git/objects/$(echo $T | cut -c1-2)/$(echo $T | cut -c3-40) &&
> > + rm repo/.git/objects/$(echo $B | cut -c1-2)/$(echo $B | cut -c3-40) &&
> > + rm repo/.git/objects/$(echo $AT | cut -c1-2)/$(echo $AT | cut -c3-40) &&
>
> This is a pretty cool test as it promises all sorts of objects
> from different parts of the graph.
Thanks.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
2017-07-20 0:21 ` [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects Jonathan Tan
2017-07-20 18:07 ` Stefan Beller
@ 2017-07-20 19:58 ` Ben Peart
2017-07-20 21:13 ` Jonathan Tan
1 sibling, 1 reply; 45+ messages in thread
From: Ben Peart @ 2017-07-20 19:58 UTC (permalink / raw)
To: Jonathan Tan, git; +Cc: jrnieder, sbeller, git, philipoakley
On 7/19/2017 8:21 PM, Jonathan Tan wrote:
> Currently, Git does not support repos with very large numbers of objects
> or repos that wish to minimize manipulation of certain blobs (for
> example, because they are very large) very well, even if the user
> operates mostly on part of the repo, because Git is designed on the
> assumption that every referenced object is available somewhere in the
> repo storage.
>
Great to see this idea making progress. Making git able to gracefully
handle partial clones (beyond the existing shallow clone support) is a
key piece of dealing with very large objects and repos.
> As a first step to reducing this problem, introduce the concept of
> promised objects. Each Git repo can contain a list of promised objects
> and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
> contains functions to query them; functions for creating and modifying
> that file will be introduced in later patches.
If I'm reading this patch correctly, for a repo to successfully pass
"git fsck" either the object or a promise must exist for everything fsck
checks. From the documentation for fsck it says "git fsck defaults to
using the index file, all SHA-1 references in refs namespace, and all
reflogs (unless --no-reflogs is given) as heads." Doesn't this then
imply objects or promises must exist for all objects referenced by any
of these locations?
We're currently in the hundreds of millions of objects on some of our
repos so even downloading the promises for all the objects in the index
is unreasonable as it is gigabytes of data and growing.
I think we should have a flag (off by default) that enables someone to
say that promised objects are optional. If the flag is set,
"is_promised_object" will return success and pass the OBJ_ANY type and a
size of -1.
Nothing today is using the size and in the two places where the object
type is being checked for consistency (fsck_cache_tree and
fsck_handle_ref) the test can add a test for OBJ_ANY as well.
This will enable very large numbers of objects to be omitted from the
clone without triggering a download of the corresponding number of
promised objects.
>
> A repository that is missing an object but has that object promised is not
> considered to be in error, so also teach fsck this. As part of doing
> this, object.{h,c} has been modified to generate "struct object" based
> on only the information available to promised objects, without requiring
> the object itself.
In your work on this, did you investigate if there are other commands
(ie repack/gc) that will need to learn about promised objects? Have you
had a chance (or have plans) to hack up the test suite so that it runs
all tests with promised objects and see what (if anything) breaks?
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> Documentation/technical/repository-version.txt | 6 ++
> Makefile | 1 +
> builtin/fsck.c | 18 +++-
> cache.h | 2 +
> environment.c | 1 +
> fsck.c | 6 +-
> object.c | 19 ++++
> object.h | 19 ++++
> promised-object.c | 130 +++++++++++++++++++++++++
> promised-object.h | 22 +++++
> setup.c | 7 +-
> t/t3907-promised-object.sh | 41 ++++++++
> t/test-lib-functions.sh | 6 ++
> 13 files changed, 273 insertions(+), 5 deletions(-)
> create mode 100644 promised-object.c
> create mode 100644 promised-object.h
> create mode 100755 t/t3907-promised-object.sh
>
> diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
> index 00ad37986..f8b82c1c7 100644
> --- a/Documentation/technical/repository-version.txt
> +++ b/Documentation/technical/repository-version.txt
> @@ -86,3 +86,9 @@ for testing format-1 compatibility.
> When the config key `extensions.preciousObjects` is set to `true`,
> objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
> `git repack -d`).
> +
> +`promisedObjects`
> +~~~~~~~~~~~~~~~~~
> +
> +(Explain this - basically a string containing a command to be run
> +whenever a missing object needs to be fetched.)
> diff --git a/Makefile b/Makefile
> index 9c9c42f8f..c1446d5ef 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -828,6 +828,7 @@ LIB_OBJS += preload-index.o
> LIB_OBJS += pretty.o
> LIB_OBJS += prio-queue.o
> LIB_OBJS += progress.o
> +LIB_OBJS += promised-object.o
> LIB_OBJS += prompt.o
> LIB_OBJS += quote.o
> LIB_OBJS += reachable.o
> diff --git a/builtin/fsck.c b/builtin/fsck.c
> index 462b8643b..49e21f361 100644
> --- a/builtin/fsck.c
> +++ b/builtin/fsck.c
> @@ -15,6 +15,7 @@
> #include "progress.h"
> #include "streaming.h"
> #include "decorate.h"
> +#include "promised-object.h"
>
> #define REACHABLE 0x0001
> #define SEEN 0x0002
> @@ -44,6 +45,7 @@ static int name_objects;
> #define ERROR_REACHABLE 02
> #define ERROR_PACK 04
> #define ERROR_REFS 010
> +#define ERROR_PROMISED_OBJECT 011
>
> static const char *describe_object(struct object *obj)
> {
> @@ -436,7 +438,7 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
> {
> struct object *obj;
>
> - obj = parse_object(oid);
> + obj = parse_or_promise_object(oid);
> if (!obj) {
> error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
> errors_found |= ERROR_REACHABLE;
> @@ -592,7 +594,7 @@ static int fsck_cache_tree(struct cache_tree *it)
> fprintf(stderr, "Checking cache tree\n");
>
> if (0 <= it->entry_count) {
> - struct object *obj = parse_object(&it->oid);
> + struct object *obj = parse_or_promise_object(&it->oid);
> if (!obj) {
> error("%s: invalid sha1 pointer in cache-tree",
> oid_to_hex(&it->oid));
> @@ -635,6 +637,12 @@ static int mark_packed_for_connectivity(const struct object_id *oid,
> return 0;
> }
>
> +static int mark_have_promised_object(const struct object_id *oid, void *data)
> +{
> + mark_object_for_connectivity(oid);
> + return 0;
> +}
> +
> static char const * const fsck_usage[] = {
> N_("git fsck [<options>] [<object>...]"),
> NULL
> @@ -690,6 +698,11 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
>
> git_config(fsck_config, NULL);
>
> + if (fsck_promised_objects()) {
> + error("Errors found in promised object list");
> + errors_found |= ERROR_PROMISED_OBJECT;
> + }
> +
> fsck_head_link();
> if (connectivity_only) {
> for_each_loose_object(mark_loose_for_connectivity, NULL, 0);
> @@ -727,6 +740,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
> stop_progress(&progress);
> }
> }
> + for_each_promised_object(mark_have_promised_object, NULL);
>
> heads = 0;
> for (i = 0; i < argc; i++) {
> diff --git a/cache.h b/cache.h
> index 71fe09264..dd94b5ffc 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -853,10 +853,12 @@ extern int grafts_replace_parents;
> #define GIT_REPO_VERSION 0
> #define GIT_REPO_VERSION_READ 1
> extern int repository_format_precious_objects;
> +extern char *repository_format_promised_objects;
>
> struct repository_format {
> int version;
> int precious_objects;
> + char *promised_objects;
> int is_bare;
> char *work_tree;
> struct string_list unknown_extensions;
> diff --git a/environment.c b/environment.c
> index 3fd4b1084..82658470d 100644
> --- a/environment.c
> +++ b/environment.c
> @@ -27,6 +27,7 @@ int warn_ambiguous_refs = 1;
> int warn_on_object_refname_ambiguity = 1;
> int ref_paranoia = -1;
> int repository_format_precious_objects;
> +char *repository_format_promised_objects;
> const char *git_commit_encoding;
> const char *git_log_output_encoding;
> const char *apply_default_whitespace;
> diff --git a/fsck.c b/fsck.c
> index b4204d772..f08ff0675 100644
> --- a/fsck.c
> +++ b/fsck.c
> @@ -460,8 +460,10 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options)
> if (!obj)
> return -1;
>
> - if (obj->type == OBJ_NONE)
> - parse_object(&obj->oid);
> + if (!obj->parsed && !obj->promised)
> + parse_or_promise_object(&obj->oid);
> + if (obj->promised)
> + return 0;
>
> switch (obj->type) {
> case OBJ_BLOB:
> diff --git a/object.c b/object.c
> index 321d7e920..0aeb95084 100644
> --- a/object.c
> +++ b/object.c
> @@ -4,6 +4,7 @@
> #include "tree.h"
> #include "commit.h"
> #include "tag.h"
> +#include "promised-object.h"
>
> static struct object **obj_hash;
> static int nr_objs, obj_hash_size;
> @@ -141,6 +142,7 @@ void *create_object(const unsigned char *sha1, void *o)
> struct object *obj = o;
>
> obj->parsed = 0;
> + obj->promised = 0;
> obj->flags = 0;
> hashcpy(obj->oid.hash, sha1);
>
> @@ -279,6 +281,23 @@ struct object *parse_object(const struct object_id *oid)
> return NULL;
> }
>
> +struct object *parse_or_promise_object(const struct object_id *oid)
> +{
> + enum object_type type;
> +
> + if (has_object_file(oid))
> + return parse_object(oid);
> +
> + if (is_promised_object(oid, &type, NULL)) {
> + struct object *obj = lookup_unknown_object(oid->hash);
> + obj->promised = 1;
> + obj->type = type;
> + return obj;
> + }
> +
> + return NULL;
> +}
> +
> struct object_list *object_list_insert(struct object *item,
> struct object_list **list_p)
> {
> diff --git a/object.h b/object.h
> index 0a419ba8d..640c8bff1 100644
> --- a/object.h
> +++ b/object.h
> @@ -46,7 +46,17 @@ struct object_array {
> * The object type is stored in 3 bits.
> */
> struct object {
> + /*
> + * Set if this object is parsed. If set, "type" is populated and this
> + * object can be casted to "struct commit" or an equivalent.
> + */
> unsigned parsed : 1;
> + /*
> + * Set if this object is not in the repo but is promised. If set,
> + * "type" is populated, but this object cannot be casted to "struct
> + * commit" or an equivalent.
> + */
> + unsigned promised : 1;
> unsigned type : TYPE_BITS;
> unsigned flags : FLAG_BITS;
> struct object_id oid;
> @@ -104,6 +114,15 @@ struct object *parse_object_or_die(const struct object_id *oid, const char *name
> */
> struct object *parse_object_buffer(const struct object_id *oid, enum object_type type, unsigned long size, void *buffer, int *eaten_p);
>
> +/*
> + * Returns the object, having parsed it if possible. In the returned object,
> + * either "parsed" or "promised" will be set.
> + *
> + * Returns NULL if the object is missing and not promised, or if the object is
> + * corrupt.
> + */
> +struct object *parse_or_promise_object(const struct object_id *oid);
> +
> /** Returns the object, with potentially excess memory allocated. **/
> struct object *lookup_unknown_object(const unsigned char *sha1);
>
> diff --git a/promised-object.c b/promised-object.c
> new file mode 100644
> index 000000000..487ade437
> --- /dev/null
> +++ b/promised-object.c
> @@ -0,0 +1,130 @@
> +#include "cache.h"
> +#include "promised-object.h"
> +#include "sha1-lookup.h"
> +#include "strbuf.h"
> +
> +#define ENTRY_SIZE (GIT_SHA1_RAWSZ + 1 + 8)
> +/*
> + * A mmap-ed byte array of size (promised_object_nr * ENTRY_SIZE). Each
> + * ENTRY_SIZE-sized entry consists of the SHA-1 of the promised object, its
> + * 8-bit object type, and its 64-bit size in network byte order. The entries
> + * are sorted in ascending SHA-1 order.
> + */
> +static char *promised_objects;
> +static int64_t promised_object_nr = -1;
> +
> +static void prepare_promised_objects(void)
> +{
> + char *filename;
> + int fd;
> + struct stat st;
> +
> + if (promised_object_nr >= 0)
> + return;
> +
> + if (!repository_format_promised_objects ||
> + getenv("GIT_IGNORE_PROMISED_OBJECTS")) {
> + promised_object_nr = 0;
> + return;
> + }
> +
> + filename = xstrfmt("%s/promised", get_object_directory());
> + fd = git_open(filename);
> + if (fd < 0) {
> + if (errno == ENOENT) {
> + promised_object_nr = 0;
> + goto cleanup;
> + }
> + perror("prepare_promised_objects");
> + die("Could not open %s", filename);
> + }
> + if (fstat(fd, &st)) {
> + perror("prepare_promised_objects");
> + die("Could not stat %s", filename);
> + }
> + if (st.st_size == 0) {
> + promised_object_nr = 0;
> + goto cleanup;
> + }
> + if (st.st_size % ENTRY_SIZE) {
> + die("Size of %s is not a multiple of %d", filename, ENTRY_SIZE);
> + }
> +
> + promised_objects = xmmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
> + promised_object_nr = st.st_size / ENTRY_SIZE;
> +
> +cleanup:
> + free(filename);
> + if (fd >= 0)
> + close(fd);
> +}
> +
> +int is_promised_object(const struct object_id *oid, enum object_type *type,
> + unsigned long *size)
> +{
> + int result;
> +
> + prepare_promised_objects();
> + result = sha1_entry_pos(promised_objects, ENTRY_SIZE, 0, 0,
> + promised_object_nr, promised_object_nr,
> + oid->hash);
> + if (result >= 0) {
> + if (type) {
> + char *ptr = promised_objects +
> + result * ENTRY_SIZE + GIT_SHA1_RAWSZ;
> + *type = *ptr;
> + }
> + if (size) {
> + uint64_t size_nbo;
> + char *ptr = promised_objects +
> + result * ENTRY_SIZE + GIT_SHA1_RAWSZ + 1;
> + memcpy(&size_nbo, ptr, sizeof(size_nbo));
> + *size = ntohll(size_nbo);
> + }
> + return 1;
> + }
> + return 0;
> +}
> +
> +int for_each_promised_object(each_promised_object_fn cb, void *data)
> +{
> + struct object_id oid;
> + int i, r;
> +
> + prepare_promised_objects();
> + for (i = 0; i < promised_object_nr; i++) {
> + memcpy(oid.hash, &promised_objects[i * ENTRY_SIZE],
> + GIT_SHA1_RAWSZ);
> + r = cb(&oid, data);
> + if (r)
> + return r;
> + }
> + return 0;
> +}
> +
> +int fsck_promised_objects(void)
> +{
> + int i;
> + prepare_promised_objects();
> + for (i = 0; i < promised_object_nr; i++) {
> + enum object_type type;
> + if (i != 0 && memcmp(&promised_objects[(i - 1) * ENTRY_SIZE],
> + &promised_objects[i * ENTRY_SIZE],
> + GIT_SHA1_RAWSZ) >= 0)
> + return error("Error in list of promised objects: not "
> + "in ascending order of object name");
> + type = promised_objects[i * ENTRY_SIZE + GIT_SHA1_RAWSZ];
> + switch (type) {
> + case OBJ_BLOB:
> + case OBJ_TREE:
> + case OBJ_COMMIT:
> + case OBJ_TAG:
> + break;
> + default:
> + return error("Error in list of promised "
> + "objects: found object of type %d",
> + type);
> + }
> + }
> + return 0;
> +}
> diff --git a/promised-object.h b/promised-object.h
> new file mode 100644
> index 000000000..7eaedff17
> --- /dev/null
> +++ b/promised-object.h
> @@ -0,0 +1,22 @@
> +#ifndef PROMISED_OBJECT_H
> +#define PROMISED_OBJECT_H
> +
> +#include "cache.h"
> +
> +/*
> + * Returns 1 if oid is the name of a promised object. For non-blobs, 0 is
> + * reported as their size.
> + */
> +int is_promised_object(const struct object_id *oid, enum object_type *type,
> + unsigned long *size);
> +
> +typedef int each_promised_object_fn(const struct object_id *oid, void *data);
> +int for_each_promised_object(each_promised_object_fn, void *);
> +
> +/*
> + * Returns 0 if there is no list of promised objects or if the list of promised
> + * objects is valid.
> + */
> +int fsck_promised_objects(void);
> +
> +#endif
> diff --git a/setup.c b/setup.c
> index 860507e1f..36022452f 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -425,7 +425,11 @@ static int check_repo_format(const char *var, const char *value, void *vdata)
> ;
> else if (!strcmp(ext, "preciousobjects"))
> data->precious_objects = git_config_bool(var, value);
> - else
> + else if (!strcmp(ext, "promisedobjects")) {
> + if (!value)
> + return config_error_nonbool(var);
> + data->promised_objects = xstrdup(value);
> + } else
> string_list_append(&data->unknown_extensions, ext);
> } else if (strcmp(var, "core.bare") == 0) {
> data->is_bare = git_config_bool(var, value);
> @@ -468,6 +472,7 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
> }
>
> repository_format_precious_objects = candidate.precious_objects;
> + repository_format_promised_objects = candidate.promised_objects;
> string_list_clear(&candidate.unknown_extensions, 0);
> if (!has_common) {
> if (candidate.is_bare != -1) {
> diff --git a/t/t3907-promised-object.sh b/t/t3907-promised-object.sh
> new file mode 100755
> index 000000000..3e0caf4f9
> --- /dev/null
> +++ b/t/t3907-promised-object.sh
> @@ -0,0 +1,41 @@
> +#!/bin/sh
> +
> +test_description='promised objects'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'fsck fails on missing objects' '
> + test_create_repo repo &&
> +
> + test_commit -C repo 1 &&
> + test_commit -C repo 2 &&
> + test_commit -C repo 3 &&
> + git -C repo tag -a annotated_tag -m "annotated tag" &&
> + C=$(git -C repo rev-parse 1) &&
> + T=$(git -C repo rev-parse 2^{tree}) &&
> + B=$(git hash-object repo/3.t) &&
> + AT=$(git -C repo rev-parse annotated_tag) &&
> +
> + # missing commit, tree, blob, and tag
> + rm repo/.git/objects/$(echo $C | cut -c1-2)/$(echo $C | cut -c3-40) &&
> + rm repo/.git/objects/$(echo $T | cut -c1-2)/$(echo $T | cut -c3-40) &&
> + rm repo/.git/objects/$(echo $B | cut -c1-2)/$(echo $B | cut -c3-40) &&
> + rm repo/.git/objects/$(echo $AT | cut -c1-2)/$(echo $AT | cut -c3-40) &&
> + test_must_fail git -C repo fsck
> +'
> +
> +test_expect_success '...but succeeds if they are promised objects' '
> + printf "%s01%016x\n%s02%016x\n%s03%016x\n%s04%016x" \
> + "$C" 0 "$T" 0 "$B" "$(wc -c <repo/3.t)" "$AT" 0 |
> + sort | tr -d "\n" | hex_pack >repo/.git/objects/promised &&
> + git -C repo config core.repositoryformatversion 1 &&
> + git -C repo config extensions.promisedobjects "arbitrary string" &&
> + git -C repo fsck
> +'
> +
> +test_expect_success '...but fails again with GIT_IGNORE_PROMISED_OBJECTS' '
> + GIT_IGNORE_PROMISED_OBJECTS=1 test_must_fail git -C repo fsck &&
> + unset GIT_IGNORE_PROMISED_OBJECTS
> +'
> +
> +test_done
> diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
> index db622c355..1ebdd2d04 100644
> --- a/t/test-lib-functions.sh
> +++ b/t/test-lib-functions.sh
> @@ -1018,3 +1018,9 @@ nongit () {
> "$@"
> )
> }
> +
> +# Converts big-endian pairs of hexadecimal digits into bytes. For example,
> +# "printf 61620d0a | hex_pack" results in "ab\r\n".
> +hex_pack () {
> + perl -e '$/ = undef; $input = <>; print pack("H*", $input)'
> +}
>
^ permalink raw reply [flat|nested] 45+ messages in thread* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
2017-07-20 19:58 ` Ben Peart
@ 2017-07-20 21:13 ` Jonathan Tan
2017-07-21 16:24 ` Ben Peart
0 siblings, 1 reply; 45+ messages in thread
From: Jonathan Tan @ 2017-07-20 21:13 UTC (permalink / raw)
To: Ben Peart; +Cc: git, jrnieder, sbeller, git, philipoakley
On Thu, 20 Jul 2017 15:58:51 -0400
Ben Peart <peartben@gmail.com> wrote:
> On 7/19/2017 8:21 PM, Jonathan Tan wrote:
> > Currently, Git does not support repos with very large numbers of objects
> > or repos that wish to minimize manipulation of certain blobs (for
> > example, because they are very large) very well, even if the user
> > operates mostly on part of the repo, because Git is designed on the
> > assumption that every referenced object is available somewhere in the
> > repo storage.
> >
>
> Great to see this idea making progress. Making git able to gracefully
> handle partial clones (beyond the existing shallow clone support) is a
> key piece of dealing with very large objects and repos.
Thanks.
> > As a first step to reducing this problem, introduce the concept of
> > promised objects. Each Git repo can contain a list of promised objects
> > and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
> > contains functions to query them; functions for creating and modifying
> > that file will be introduced in later patches.
>
> If I'm reading this patch correctly, for a repo to successfully pass
> "git fsck" either the object or a promise must exist for everything fsck
> checks. From the documentation for fsck it says "git fsck defaults to
> using the index file, all SHA-1 references in refs namespace, and all
> reflogs (unless --no-reflogs is given) as heads." Doesn't this then
> imply objects or promises must exist for all objects referenced by any
> of these locations?
>
> We're currently in the hundreds of millions of objects on some of our
> repos so even downloading the promises for all the objects in the index
> is unreasonable as it is gigabytes of data and growing.
For the index to contain all the files, the repo must already have
downloaded all the trees for HEAD (at least). The trees collectively
contain entries for all the relevant blobs. We need one promise for each
blob, and the size of a promise is comparable to the size of a tree
entry, so the size (of download and storage) needed would be just double
of what we would need if we didn't need promises. This is still only
linear growth, unless you have found that the absolute numbers are too
large?
Also, if the index is ever changed to not have one entry for every file,
we also wouldn't need one promise for every file.
> I think we should have a flag (off by default) that enables someone to
> say that promised objects are optional. If the flag is set,
> "is_promised_object" will return success and pass the OBJ_ANY type and a
> size of -1.
>
> Nothing today is using the size and in the two places where the object
> type is being checked for consistency (fsck_cache_tree and
> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
>
> This will enable very large numbers of objects to be omitted from the
> clone without triggering a download of the corresponding number of
> promised objects.
Eventually I plan to use the size when implementing parameters for
history-searching commands (e.g. "git log -S"), but it's true that
that's in the future.
Allowing promised objects to be optional would indeed solve the issue of
downloading too many promises. It would make the code more complicated,
but I'm not sure by how much.
For example, in this fsck patch, the easiest way I could think of to
have promised objects was to introduce a 3rd state, called "promised",
of "struct object" - one in which the type is known, but we don't have
access to the full "struct commit" or equivalent. And thus fsck could
assume that if the "struct object" is "parsed" or "promised", the type
is known. Having optional promised objects would require that we let
this "promised" state have a type of OBJ_UNKNOWN (or something like
that) - maybe that would be fine, but I haven't looked into this in
detail.
> > A repository that is missing an object but has that object promised is not
> > considered to be in error, so also teach fsck this. As part of doing
> > this, object.{h,c} has been modified to generate "struct object" based
> > on only the information available to promised objects, without requiring
> > the object itself.
>
> In your work on this, did you investigate if there are other commands
> (ie repack/gc) that will need to learn about promised objects? Have you
> had a chance (or have plans) to hack up the test suite so that it runs
> all tests with promised objects and see what (if anything) breaks?
In one of the subsequent patches, I tried to ensure that all
object-reading functions in sha1_file.c somewhat works (albeit slowly)
in the presence of promised objects - that would cover the functionality
of the other commands. As for hacking up the test suite to run with
promised objects, that would be ideal, but I haven't figured out how to
do that yet.
^ permalink raw reply [flat|nested] 45+ messages in thread* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
2017-07-20 21:13 ` Jonathan Tan
@ 2017-07-21 16:24 ` Ben Peart
2017-07-21 20:33 ` Jonathan Tan
0 siblings, 1 reply; 45+ messages in thread
From: Ben Peart @ 2017-07-21 16:24 UTC (permalink / raw)
To: Jonathan Tan; +Cc: git, jrnieder, sbeller, git, philipoakley
On 7/20/2017 5:13 PM, Jonathan Tan wrote:
> On Thu, 20 Jul 2017 15:58:51 -0400
> Ben Peart <peartben@gmail.com> wrote:
>
>> On 7/19/2017 8:21 PM, Jonathan Tan wrote:
>>> Currently, Git does not support repos with very large numbers of objects
>>> or repos that wish to minimize manipulation of certain blobs (for
>>> example, because they are very large) very well, even if the user
>>> operates mostly on part of the repo, because Git is designed on the
>>> assumption that every referenced object is available somewhere in the
>>> repo storage.
>>>
>>
>> Great to see this idea making progress. Making git able to gracefully
>> handle partial clones (beyond the existing shallow clone support) is a
>> key piece of dealing with very large objects and repos.
>
> Thanks.
>
>>> As a first step to reducing this problem, introduce the concept of
>>> promised objects. Each Git repo can contain a list of promised objects
>>> and their sizes (if blobs) at $GIT_DIR/objects/promised. This patch
>>> contains functions to query them; functions for creating and modifying
>>> that file will be introduced in later patches.
>>
>> If I'm reading this patch correctly, for a repo to successfully pass
>> "git fsck" either the object or a promise must exist for everything fsck
>> checks. From the documentation for fsck it says "git fsck defaults to
>> using the index file, all SHA-1 references in refs namespace, and all
>> reflogs (unless --no-reflogs is given) as heads." Doesn't this then
>> imply objects or promises must exist for all objects referenced by any
>> of these locations?
>>
>> We're currently in the hundreds of millions of objects on some of our
>> repos so even downloading the promises for all the objects in the index
>> is unreasonable as it is gigabytes of data and growing.
>
> For the index to contain all the files, the repo must already have
> downloaded all the trees for HEAD (at least). The trees collectively
> contain entries for all the relevant blobs. We need one promise for each
> blob, and the size of a promise is comparable to the size of a tree
> entry, so the size (of download and storage) needed would be just double
> of what we would need if we didn't need promises. This is still only
> linear growth, unless you have found that the absolute numbers are too
> large?
>
Today we have 3.5 million objects * 30 bytes per entry = 105 MB of
promises. Given the average developer only hydrates 56K files (2 MB
promises) that is 103 MB to download that no one will ever need. We
would like to avoid that if possible as this would be a significant
regression in clone times from where we are today.
I'm also concerned about the performance of merging in promises given we
have 100M objects today and growing so the number of promises over time
could get pretty large.
> Also, if the index is ever changed to not have one entry for every file,
> we also wouldn't need one promise for every file.
>
>> I think we should have a flag (off by default) that enables someone to
>> say that promised objects are optional. If the flag is set,
>> "is_promised_object" will return success and pass the OBJ_ANY type and a
>> size of -1.
>>
>> Nothing today is using the size and in the two places where the object
>> type is being checked for consistency (fsck_cache_tree and
>> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
>>
>> This will enable very large numbers of objects to be omitted from the
>> clone without triggering a download of the corresponding number of
>> promised objects.
>
> Eventually I plan to use the size when implementing parameters for
> history-searching commands (e.g. "git log -S"), but it's true that
> that's in the future.
>
> Allowing promised objects to be optional would indeed solve the issue of
> downloading too many promises. It would make the code more complicated,
> but I'm not sure by how much.
>
> For example, in this fsck patch, the easiest way I could think of to
> have promised objects was to introduce a 3rd state, called "promised",
> of "struct object" - one in which the type is known, but we don't have
> access to the full "struct commit" or equivalent. And thus fsck could
> assume that if the "struct object" is "parsed" or "promised", the type
> is known. Having optional promised objects would require that we let
> this "promised" state have a type of OBJ_UNKNOWN (or something like
> that) - maybe that would be fine, but I haven't looked into this in
> detail.
>
Caveats apply as I only did a quick look but I only found the two
locations that were checking the object type for consistency.
>>> A repository that is missing an object but has that object promised is not
>>> considered to be in error, so also teach fsck this. As part of doing
>>> this, object.{h,c} has been modified to generate "struct object" based
>>> on only the information available to promised objects, without requiring
>>> the object itself.
>>
>> In your work on this, did you investigate if there are other commands
>> (ie repack/gc) that will need to learn about promised objects? Have you
>> had a chance (or have plans) to hack up the test suite so that it runs
>> all tests with promised objects and see what (if anything) breaks?
>
> In one of the subsequent patches, I tried to ensure that all
> object-reading functions in sha1_file.c somewhat works (albeit slowly)
> in the presence of promised objects - that would cover the functionality
> of the other commands. As for hacking up the test suite to run with
> promised objects, that would be ideal, but I haven't figured out how to
> do that yet.
>
^ permalink raw reply [flat|nested] 45+ messages in thread* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
2017-07-21 16:24 ` Ben Peart
@ 2017-07-21 20:33 ` Jonathan Tan
2017-07-25 15:10 ` Ben Peart
0 siblings, 1 reply; 45+ messages in thread
From: Jonathan Tan @ 2017-07-21 20:33 UTC (permalink / raw)
To: Ben Peart; +Cc: git, jrnieder, sbeller, git, philipoakley
On Fri, 21 Jul 2017 12:24:52 -0400
Ben Peart <peartben@gmail.com> wrote:
> Today we have 3.5 million objects * 30 bytes per entry = 105 MB of
> promises. Given the average developer only hydrates 56K files (2 MB
> promises) that is 103 MB to download that no one will ever need. We
> would like to avoid that if possible as this would be a significant
> regression in clone times from where we are today.
>
> I'm also concerned about the performance of merging in promises given we
> have 100M objects today and growing so the number of promises over time
> could get pretty large.
After some thought, maybe a hybrid solution is best, in which it is
permissible but optional for some missing objects to have promises. In
that case, it is more of a "size cache" (which stores the type as well)
rather than a true promise. When fetching, the client can optionally
request for the sizes and types of missing objects.
This is good for the large-blob case, in which we can always have size
information of missing blobs, and we can subsequently add blob-size
filtering (as a parameter) to "git log -S" and friends to avoid needing
to resolve a missing object. And this is, as far as I can tell, also
good for the many-blob case - just have an empty size cache all the
time. (And in the future, use cases could come up that desire non-empty
but non-comprehensive caches - for example, a directory lister working
on a partial clone that only needs to cache the sizes of frequently
accessed directories.)
Another option is to have a repo-wide option that toggles between
mandatory entries in the "size cache" and prohibited entries. Switching
to mandatory provides stricter fsck and negative lookups, but I think
it's not worth it for both the developers and users of Git to have to
know about these two modes.
> >> I think we should have a flag (off by default) that enables someone to
> >> say that promised objects are optional. If the flag is set,
> >> "is_promised_object" will return success and pass the OBJ_ANY type and a
> >> size of -1.
> >>
> >> Nothing today is using the size and in the two places where the object
> >> type is being checked for consistency (fsck_cache_tree and
> >> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
> >>
> >> This will enable very large numbers of objects to be omitted from the
> >> clone without triggering a download of the corresponding number of
> >> promised objects.
> >
> > Eventually I plan to use the size when implementing parameters for
> > history-searching commands (e.g. "git log -S"), but it's true that
> > that's in the future.
> >
> > Allowing promised objects to be optional would indeed solve the issue of
> > downloading too many promises. It would make the code more complicated,
> > but I'm not sure by how much.
> >
> > For example, in this fsck patch, the easiest way I could think of to
> > have promised objects was to introduce a 3rd state, called "promised",
> > of "struct object" - one in which the type is known, but we don't have
> > access to the full "struct commit" or equivalent. And thus fsck could
> > assume that if the "struct object" is "parsed" or "promised", the type
> > is known. Having optional promised objects would require that we let
> > this "promised" state have a type of OBJ_UNKNOWN (or something like
> > that) - maybe that would be fine, but I haven't looked into this in
> > detail.
> >
>
> Caveats apply as I only did a quick look but I only found the two
> locations that were checking the object type for consistency.
I haven't looked into detail, but you are probably right.
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
2017-07-21 20:33 ` Jonathan Tan
@ 2017-07-25 15:10 ` Ben Peart
2017-07-29 13:26 ` Philip Oakley
0 siblings, 1 reply; 45+ messages in thread
From: Ben Peart @ 2017-07-25 15:10 UTC (permalink / raw)
To: Jonathan Tan; +Cc: git, jrnieder, sbeller, git, philipoakley
On 7/21/2017 4:33 PM, Jonathan Tan wrote:
> On Fri, 21 Jul 2017 12:24:52 -0400
> Ben Peart <peartben@gmail.com> wrote:
>
>> Today we have 3.5 million objects * 30 bytes per entry = 105 MB of
>> promises. Given the average developer only hydrates 56K files (2 MB
>> promises) that is 103 MB to download that no one will ever need. We
>> would like to avoid that if possible as this would be a significant
>> regression in clone times from where we are today.
>>
>> I'm also concerned about the performance of merging in promises given we
>> have 100M objects today and growing so the number of promises over time
>> could get pretty large.
>
> After some thought, maybe a hybrid solution is best, in which it is
> permissible but optional for some missing objects to have promises. In
> that case, it is more of a "size cache" (which stores the type as well)
> rather than a true promise. When fetching, the client can optionally
> request for the sizes and types of missing objects.
>
In our GVFS solution today we do not download any size or object type
information at clone as the number of objects and the resulting file
would be too large. Instead, we have a new sizes endpoint
(https://github.com/Microsoft/GVFS/blob/master/Protocol.md) that enables
us to retrieve object sizes "on demand" much like we are enabling for
the actual object content.
This protocol could easily be extended to return both size and type so
that it could be used to retrieve "promise" data for objects as they are
needed. Having a way to "cache" that data locally so that both git and
other code could share it would be great.
At a minimum, we should ensure the data stream passed back is the same
whether at clone time or when hitting a "promises" end point. I think it
would also be helpful to enable promises to be downloaded on demand much
like we are doing for the object data itself.
> This is good for the large-blob case, in which we can always have size
> information of missing blobs, and we can subsequently add blob-size
> filtering (as a parameter) to "git log -S" and friends to avoid needing
> to resolve a missing object. And this is, as far as I can tell, also
> good for the many-blob case - just have an empty size cache all the
> time. (And in the future, use cases could come up that desire non-empty
> but non-comprehensive caches - for example, a directory lister working
> on a partial clone that only needs to cache the sizes of frequently
> accessed directories.)
>
> Another option is to have a repo-wide option that toggles between
> mandatory entries in the "size cache" and prohibited entries. Switching
> to mandatory provides stricter fsck and negative lookups, but I think
> it's not worth it for both the developers and users of Git to have to
> know about these two modes.
>
>>>> I think we should have a flag (off by default) that enables someone to
>>>> say that promised objects are optional. If the flag is set,
>>>> "is_promised_object" will return success and pass the OBJ_ANY type and a
>>>> size of -1.
>>>>
>>>> Nothing today is using the size and in the two places where the object
>>>> type is being checked for consistency (fsck_cache_tree and
>>>> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
>>>>
>>>> This will enable very large numbers of objects to be omitted from the
>>>> clone without triggering a download of the corresponding number of
>>>> promised objects.
>>>
>>> Eventually I plan to use the size when implementing parameters for
>>> history-searching commands (e.g. "git log -S"), but it's true that
>>> that's in the future.
>>>
>>> Allowing promised objects to be optional would indeed solve the issue of
>>> downloading too many promises. It would make the code more complicated,
>>> but I'm not sure by how much.
>>>
>>> For example, in this fsck patch, the easiest way I could think of to
>>> have promised objects was to introduce a 3rd state, called "promised",
>>> of "struct object" - one in which the type is known, but we don't have
>>> access to the full "struct commit" or equivalent. And thus fsck could
>>> assume that if the "struct object" is "parsed" or "promised", the type
>>> is known. Having optional promised objects would require that we let
>>> this "promised" state have a type of OBJ_UNKNOWN (or something like
>>> that) - maybe that would be fine, but I haven't looked into this in
>>> detail.
>>>
>>
>> Caveats apply as I only did a quick look but I only found the two
>> locations that were checking the object type for consistency.
>
> I haven't looked into detail, but you are probably right.
>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
2017-07-25 15:10 ` Ben Peart
@ 2017-07-29 13:26 ` Philip Oakley
0 siblings, 0 replies; 45+ messages in thread
From: Philip Oakley @ 2017-07-29 13:26 UTC (permalink / raw)
To: Jonathan Tan, Ben Peart; +Cc: git, jrnieder, sbeller, git
From: "Ben Peart" <peartben@gmail.com>
Sent: Tuesday, July 25, 2017 4:10 PM
> On 7/21/2017 4:33 PM, Jonathan Tan wrote:
>> On Fri, 21 Jul 2017 12:24:52 -0400
>> Ben Peart <peartben@gmail.com> wrote:
>>
>>> Today we have 3.5 million objects * 30 bytes per entry = 105 MB of
>>> promises. Given the average developer only hydrates 56K files (2 MB
>>> promises) that is 103 MB to download that no one will ever need. We
>>> would like to avoid that if possible as this would be a significant
>>> regression in clone times from where we are today.
>>>
A question in the broader context of Narrow clones that I'd be interested
in.
How narrow are the tree levels that contain the hydrated files? The question
splits a couple of ways:
A. If one goes to the highest tree that contains all the 56K files, how many
files and sub trees would be in that complete tree (i.e. what fraction of
that 'top' tree is inflated).
A2. Is there some deep/wide metric that indicates how tightly together the
inflated files tend to cluster?
B. If instead we look at just the trees in the paths of those inflated
files, those trees will also reference other trees/blobs that are not
inflated, how big is that list (it would indicate the size of a narrow
repo's object store that holds the oid stubs)
I would quess / expect that the typical inflation only has a few clustered
areas of interest, but it maybe that in such a big reality (*) the inflated
files are actually spread very widely. (*) as per various blog posts saying
there was no realistic way of partitioning the BigWin repo!
I'd be interested in any such sparsity metric (apologies if I've missed
previous reports).
--
Philip
>>> I'm also concerned about the performance of merging in promises given we
>>> have 100M objects today and growing so the number of promises over time
>>> could get pretty large.
>>
>> After some thought, maybe a hybrid solution is best, in which it is
>> permissible but optional for some missing objects to have promises. In
>> that case, it is more of a "size cache" (which stores the type as well)
>> rather than a true promise. When fetching, the client can optionally
>> request for the sizes and types of missing objects.
>>
>
> In our GVFS solution today we do not download any size or object type
> information at clone as the number of objects and the resulting file would
> be too large. Instead, we have a new sizes endpoint
> (https://github.com/Microsoft/GVFS/blob/master/Protocol.md) that enables
> us to retrieve object sizes "on demand" much like we are enabling for the
> actual object content.
>
> This protocol could easily be extended to return both size and type so
> that it could be used to retrieve "promise" data for objects as they are
> needed. Having a way to "cache" that data locally so that both git and
> other code could share it would be great.
>
> At a minimum, we should ensure the data stream passed back is the same
> whether at clone time or when hitting a "promises" end point. I think it
> would also be helpful to enable promises to be downloaded on demand much
> like we are doing for the object data itself.
>
>> This is good for the large-blob case, in which we can always have size
>> information of missing blobs, and we can subsequently add blob-size
>> filtering (as a parameter) to "git log -S" and friends to avoid needing
>> to resolve a missing object. And this is, as far as I can tell, also
>> good for the many-blob case - just have an empty size cache all the
>> time. (And in the future, use cases could come up that desire non-empty
>> but non-comprehensive caches - for example, a directory lister working
>> on a partial clone that only needs to cache the sizes of frequently
>> accessed directories.)
>>
>> Another option is to have a repo-wide option that toggles between
>> mandatory entries in the "size cache" and prohibited entries. Switching
>> to mandatory provides stricter fsck and negative lookups, but I think
>> it's not worth it for both the developers and users of Git to have to
>> know about these two modes.
>>
>>>>> I think we should have a flag (off by default) that enables someone to
>>>>> say that promised objects are optional. If the flag is set,
>>>>> "is_promised_object" will return success and pass the OBJ_ANY type and
>>>>> a
>>>>> size of -1.
>>>>>
>>>>> Nothing today is using the size and in the two places where the object
>>>>> type is being checked for consistency (fsck_cache_tree and
>>>>> fsck_handle_ref) the test can add a test for OBJ_ANY as well.
>>>>>
>>>>> This will enable very large numbers of objects to be omitted from the
>>>>> clone without triggering a download of the corresponding number of
>>>>> promised objects.
>>>>
>>>> Eventually I plan to use the size when implementing parameters for
>>>> history-searching commands (e.g. "git log -S"), but it's true that
>>>> that's in the future.
>>>>
>>>> Allowing promised objects to be optional would indeed solve the issue
>>>> of
>>>> downloading too many promises. It would make the code more complicated,
>>>> but I'm not sure by how much.
>>>>
>>>> For example, in this fsck patch, the easiest way I could think of to
>>>> have promised objects was to introduce a 3rd state, called "promised",
>>>> of "struct object" - one in which the type is known, but we don't have
>>>> access to the full "struct commit" or equivalent. And thus fsck could
>>>> assume that if the "struct object" is "parsed" or "promised", the type
>>>> is known. Having optional promised objects would require that we let
>>>> this "promised" state have a type of OBJ_UNKNOWN (or something like
>>>> that) - maybe that would be fine, but I haven't looked into this in
>>>> detail.
>>>>
>>>
>>> Caveats apply as I only did a quick look but I only found the two
>>> locations that were checking the object type for consistency.
>>
>> I haven't looked into detail, but you are probably right.
>>
^ permalink raw reply [flat|nested] 45+ messages in thread
* [RFC PATCH v2 3/4] sha1-array: support appending unsigned char hash
2017-07-20 0:21 ` [RFC PATCH v2 0/4] Partial clone: promised objects (not only blobs) Jonathan Tan
2017-07-20 0:21 ` [RFC PATCH v2 1/4] object: remove "used" field from struct object Jonathan Tan
2017-07-20 0:21 ` [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects Jonathan Tan
@ 2017-07-20 0:21 ` Jonathan Tan
2017-07-20 0:21 ` [RFC PATCH v2 4/4] sha1_file: support promised object hook Jonathan Tan
3 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2017-07-20 0:21 UTC (permalink / raw)
To: git; +Cc: Jonathan Tan, jrnieder, sbeller, git, peartben, philipoakley
In a subsequent patch, sha1_file will need to append object names in the
form of "unsigned char *" to oid arrays. Teach sha1-array support for
that.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
sha1-array.c | 7 +++++++
sha1-array.h | 1 +
2 files changed, 8 insertions(+)
diff --git a/sha1-array.c b/sha1-array.c
index 838b3bf84..6e0e35391 100644
--- a/sha1-array.c
+++ b/sha1-array.c
@@ -9,6 +9,13 @@ void oid_array_append(struct oid_array *array, const struct object_id *oid)
array->sorted = 0;
}
+void oid_array_append_sha1(struct oid_array *array, const unsigned char *sha1)
+{
+ ALLOC_GROW(array->oid, array->nr + 1, array->alloc);
+ hashcpy(array->oid[array->nr++].hash, sha1);
+ array->sorted = 0;
+}
+
static int void_hashcmp(const void *a, const void *b)
{
return oidcmp(a, b);
diff --git a/sha1-array.h b/sha1-array.h
index 04b075633..3479959e4 100644
--- a/sha1-array.h
+++ b/sha1-array.h
@@ -11,6 +11,7 @@ struct oid_array {
#define OID_ARRAY_INIT { NULL, 0, 0, 0 }
void oid_array_append(struct oid_array *array, const struct object_id *oid);
+void oid_array_append_sha1(struct oid_array *array, const unsigned char *sha1);
int oid_array_lookup(struct oid_array *array, const struct object_id *oid);
void oid_array_clear(struct oid_array *array);
--
2.14.0.rc0.284.gd933b75aa4-goog
^ permalink raw reply related [flat|nested] 45+ messages in thread* [RFC PATCH v2 4/4] sha1_file: support promised object hook
2017-07-20 0:21 ` [RFC PATCH v2 0/4] Partial clone: promised objects (not only blobs) Jonathan Tan
` (2 preceding siblings ...)
2017-07-20 0:21 ` [RFC PATCH v2 3/4] sha1-array: support appending unsigned char hash Jonathan Tan
@ 2017-07-20 0:21 ` Jonathan Tan
2017-07-20 18:23 ` Stefan Beller
3 siblings, 1 reply; 45+ messages in thread
From: Jonathan Tan @ 2017-07-20 0:21 UTC (permalink / raw)
To: git; +Cc: Jonathan Tan, jrnieder, sbeller, git, peartben, philipoakley
Teach sha1_file to invoke a hook whenever an object is requested and
unavailable but is promised. The hook is a shell command that can be
configured through "git config"; this hook takes in a list of hashes and
writes (if successful) the corresponding objects to the repo's local
storage.
The usage of the hook can be suppressed through a flag when invoking
has_object_file_with_flags() and other similar functions.
parse_or_promise_object() in object.c requires this functionality, and
has been modified to use it.
This is meant as a temporary measure to ensure that all Git commands
work in such a situation. Future patches will update some commands to
either tolerate promised objects (without invoking the hook) or be more
efficient in invoking the promised objects hook.
In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
(1) functions in sha1_file that take in a hash, without the user
regarding how the object is stored (loose or packed)
(2) functions in sha1_file that operate on packed objects (because I
need to check callers that know about the loose/packed distinction
and operate on both differently, and ensure that they can handle
the concept of objects that are neither loose nor packed)
(1) is handled by the modification to sha1_object_info_extended().
For (2), I looked at for_each_packed_object and at the packed-related
functions that take in a hash. For for_each_packed_object, the callers
either already work or are fixed in this patch:
- reachable - only to find recent objects
- builtin/fsck - already knows about promised objects
- builtin/cat-file - fixed in this commit
Callers of the other functions do not need to be changed:
- parse_pack_index
- http - indirectly from http_get_info_packs
- find_pack_entry_one
- this searches a single pack that is provided as an argument; the
caller already knows (through other means) that the sought object
is in a specific pack
- find_sha1_pack
- fast-import - appears to be an optimization to not store a
file if it is already in a pack
- http-walker - to search through a struct alt_base
- http-push - to search through remote packs
- has_sha1_pack
- builtin/fsck - already knows about promised objects
- builtin/count-objects - informational purposes only (check if loose
object is also packed)
- builtin/prune-packed - check if object to be pruned is packed (if
not, don't prune it)
- revision - used to exclude packed objects if requested by user
- diff - just for optimization
An alternative design that I considered but rejected:
- Adding a hook whenever a packed object is requested, not on any
object. That is, whenever we attempt to search the packfiles for an
object, if it is missing (from the packfiles and from the loose
object storage), to invoke the hook (which must then store it as a
packfile), open the packfile the hook generated, and report that the
object is found in that new packfile. This reduces the amount of
analysis needed (in that we only need to look at how packed objects
are handled), but requires that the hook generate packfiles (or for
sha1_file to pack whatever loose objects are generated), creating one
packfile for each missing object and potentially very many packfiles
that must be linearly searched. This may be tolerable now for repos
that only have a few missing objects (for example, repos that only
want to exclude large blobs), and might be tolerable in the future if
we have batching support for the most commonly used commands, but is
not tolerable now for repos that exclude a large amount of objects.
Helped-by: Ben Peart <benpeart@microsoft.com>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
Documentation/config.txt | 8 +
Documentation/gitrepository-layout.txt | 8 +
Documentation/technical/read-object-protocol.txt | 102 ++++++++++++
builtin/cat-file.c | 9 ++
cache.h | 2 +
object.c | 3 +-
promised-object.c | 194 +++++++++++++++++++++++
promised-object.h | 12 ++
sha1_file.c | 44 +++--
t/t3907-promised-object.sh | 32 ++++
t/t3907/read-object | 114 +++++++++++++
11 files changed, 513 insertions(+), 15 deletions(-)
create mode 100644 Documentation/technical/read-object-protocol.txt
create mode 100755 t/t3907/read-object
diff --git a/Documentation/config.txt b/Documentation/config.txt
index d5c9c4cab..c293ac921 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -393,6 +393,14 @@ The default is false, except linkgit:git-clone[1] or linkgit:git-init[1]
will probe and set core.ignoreCase true if appropriate when the repository
is created.
+core.promisedBlobCommand::
+ If set, whenever a blob in the local repo is attempted to be read, but
+ is both missing and a promised blob, invoke this shell command to
+ generate or obtain that blob before reporting an error. This shell
+ command should take one or more hashes, each terminated by a newline,
+ as standard input, and (if successful) should write the corresponding
+ objects to the local repo (packed or loose).
+
core.precomposeUnicode::
This option is only used by Mac OS implementation of Git.
When core.precomposeUnicode=true, Git reverts the unicode decomposition
diff --git a/Documentation/gitrepository-layout.txt b/Documentation/gitrepository-layout.txt
index f51ed4e37..7dea7fe6b 100644
--- a/Documentation/gitrepository-layout.txt
+++ b/Documentation/gitrepository-layout.txt
@@ -47,6 +47,10 @@ use with dumb transports but otherwise is OK as long as
`objects/info/alternates` points at the object stores it
borrows from.
+
+. You could have objects that are merely promised by another source.
+When Git requires those objects, it will invoke the command in the
+`extensions.promisedObjects` configuration variable.
++
This directory is ignored if $GIT_COMMON_DIR is set and
"$GIT_COMMON_DIR/objects" will be used instead.
@@ -91,6 +95,10 @@ objects/info/http-alternates::
this object store borrows objects from, to be used when
the repository is fetched over HTTP.
+objects/promised::
+ This file records the sha1 object names, types, and sizes of
+ promised objects.
+
refs::
References are stored in subdirectories of this
directory. The 'git prune' command knows to preserve
diff --git a/Documentation/technical/read-object-protocol.txt b/Documentation/technical/read-object-protocol.txt
new file mode 100644
index 000000000..a893b46e7
--- /dev/null
+++ b/Documentation/technical/read-object-protocol.txt
@@ -0,0 +1,102 @@
+Read Object Process
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The read-object process enables Git to read all missing blobs with a
+single process invocation for the entire life of a single Git command.
+This is achieved by using a packet format (pkt-line, see technical/
+protocol-common.txt) based protocol over standard input and standard
+output as follows. All packets, except for the "*CONTENT" packets and
+the "0000" flush packet, are considered text and therefore are
+terminated by a LF.
+
+Git starts the process when it encounters the first missing object that
+needs to be retrieved. After the process is started, Git sends a welcome
+message ("git-read-object-client"), a list of supported protocol version
+numbers, and a flush packet. Git expects to read a welcome response
+message ("git-read-object-server"), exactly one protocol version number
+from the previously sent list, and a flush packet. All further
+communication will be based on the selected version.
+
+The remaining protocol description below documents "version=1". Please
+note that "version=42" in the example below does not exist and is only
+there to illustrate how the protocol would look with more than one
+version.
+
+After the version negotiation Git sends a list of all capabilities that
+it supports and a flush packet. Git expects to read a list of desired
+capabilities, which must be a subset of the supported capabilities list,
+and a flush packet as response:
+------------------------
+packet: git> git-read-object-client
+packet: git> version=1
+packet: git> version=42
+packet: git> 0000
+packet: git< git-read-object-server
+packet: git< version=1
+packet: git< 0000
+packet: git> capability=get
+packet: git> capability=have
+packet: git> capability=put
+packet: git> capability=not-yet-invented
+packet: git> 0000
+packet: git< capability=get
+packet: git< 0000
+------------------------
+The only supported capability in version 1 is "get".
+
+Afterwards Git sends a list of "key=value" pairs terminated with a flush
+packet. The list will contain at least the command (based on the
+supported capabilities) and the sha1 of the object to retrieve. Please
+note, that the process must not send any response before it received the
+final flush packet.
+
+When the process receives the "get" command, it should make the requested
+object available in the git object store and then return success. Git will
+then check the object store again and this time find it and proceed.
+------------------------
+packet: git> command=get
+packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
+packet: git> 0000
+------------------------
+
+The process is expected to respond with a list of "key=value" pairs
+terminated with a flush packet. If the process does not experience
+problems then the list must contain a "success" status.
+------------------------
+packet: git< status=success
+packet: git< 0000
+------------------------
+
+In case the process cannot or does not want to process the content, it
+is expected to respond with an "error" status.
+------------------------
+packet: git< status=error
+packet: git< 0000
+------------------------
+
+In case the process cannot or does not want to process the content as
+well as any future content for the lifetime of the Git process, then it
+is expected to respond with an "abort" status at any point in the
+protocol.
+------------------------
+packet: git< status=abort
+packet: git< 0000
+------------------------
+
+Git neither stops nor restarts the process in case the "error"/"abort"
+status is set.
+
+If the process dies during the communication or does not adhere to the
+protocol then Git will stop the process and restart it with the next
+object that needs to be processed.
+
+After the read-object process has processed an object it is expected to
+wait for the next "key=value" list containing a command. Git will close
+the command pipe on exit. The process is expected to detect EOF and exit
+gracefully on its own. Git will wait until the process has stopped.
+
+A long running read-object process demo implementation can be found in
+`contrib/long-running-read-object/example.pl` located in the Git core
+repository. If you develop your own long running process then the
+`GIT_TRACE_PACKET` environment variables can be very helpful for
+debugging (see linkgit:git[1]).
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 96b786e48..33f636926 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -12,6 +12,7 @@
#include "streaming.h"
#include "tree-walk.h"
#include "sha1-array.h"
+#include "promised-object.h"
struct batch_options {
int enabled;
@@ -432,6 +433,13 @@ static int batch_packed_object(const struct object_id *oid,
return 0;
}
+static int batch_promised_object(const struct object_id *oid,
+ void *data)
+{
+ oid_array_append(data, oid);
+ return 0;
+}
+
static int batch_objects(struct batch_options *opt)
{
struct strbuf buf = STRBUF_INIT;
@@ -473,6 +481,7 @@ static int batch_objects(struct batch_options *opt)
for_each_loose_object(batch_loose_object, &sa, 0);
for_each_packed_object(batch_packed_object, &sa, 0);
+ for_each_promised_object(batch_promised_object, &sa);
cb.opt = opt;
cb.expand = &data;
diff --git a/cache.h b/cache.h
index dd94b5ffc..75b71f38b 100644
--- a/cache.h
+++ b/cache.h
@@ -1835,6 +1835,8 @@ struct object_info {
#define OBJECT_INFO_SKIP_CACHED 4
/* Do not retry packed storage after checking packed and loose storage */
#define OBJECT_INFO_QUICK 8
+/* Ignore list of promised objects */
+#define OBJECT_INFO_IGNORE_PROMISES 16
extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
diff --git a/object.c b/object.c
index 0aeb95084..23f2a6cbc 100644
--- a/object.c
+++ b/object.c
@@ -285,7 +285,8 @@ struct object *parse_or_promise_object(const struct object_id *oid)
{
enum object_type type;
- if (has_object_file(oid))
+ if (has_object_file_with_flags(oid, OBJECT_INFO_SKIP_CACHED |
+ OBJECT_INFO_IGNORE_PROMISES))
return parse_object(oid);
if (is_promised_object(oid, &type, NULL)) {
diff --git a/promised-object.c b/promised-object.c
index 487ade437..d8d95ebb2 100644
--- a/promised-object.c
+++ b/promised-object.c
@@ -2,6 +2,12 @@
#include "promised-object.h"
#include "sha1-lookup.h"
#include "strbuf.h"
+#include "run-command.h"
+#include "sha1-array.h"
+#include "config.h"
+#include "sigchain.h"
+#include "sub-process.h"
+#include "pkt-line.h"
#define ENTRY_SIZE (GIT_SHA1_RAWSZ + 1 + 8)
/*
@@ -128,3 +134,191 @@ int fsck_promised_objects(void)
}
return 0;
}
+
+#define CAP_GET (1u<<0)
+
+static int subprocess_map_initialized;
+static struct hashmap subprocess_map;
+
+struct read_object_process {
+ struct subprocess_entry subprocess;
+ unsigned int supported_capabilities;
+};
+
+int start_read_object_fn(struct subprocess_entry *subprocess)
+{
+ int err;
+ struct read_object_process *entry = (struct read_object_process *)subprocess;
+ struct child_process *process;
+ struct string_list cap_list = STRING_LIST_INIT_NODUP;
+ char *cap_buf;
+ const char *cap_name;
+
+ process = subprocess_get_child_process(&entry->subprocess);
+
+ sigchain_push(SIGPIPE, SIG_IGN);
+
+ err = packet_writel(process->in, "git-read-object-client", "version=1", NULL);
+ if (err)
+ goto done;
+
+ err = strcmp(packet_read_line(process->out, NULL), "git-read-object-server");
+ if (err) {
+ error("external process '%s' does not support read-object protocol version 1", subprocess->cmd);
+ goto done;
+ }
+ err = strcmp(packet_read_line(process->out, NULL), "version=1");
+ if (err)
+ goto done;
+ err = packet_read_line(process->out, NULL) != NULL;
+ if (err)
+ goto done;
+
+ err = packet_writel(process->in, "capability=get", NULL);
+ if (err)
+ goto done;
+
+ for (;;) {
+ cap_buf = packet_read_line(process->out, NULL);
+ if (!cap_buf)
+ break;
+ string_list_split_in_place(&cap_list, cap_buf, '=', 1);
+
+ if (cap_list.nr != 2 || strcmp(cap_list.items[0].string, "capability"))
+ continue;
+
+ cap_name = cap_list.items[1].string;
+ if (!strcmp(cap_name, "get")) {
+ entry->supported_capabilities |= CAP_GET;
+ }
+ else {
+ warning(
+ "external process '%s' requested unsupported read-object capability '%s'",
+ subprocess->cmd, cap_name
+ );
+ }
+
+ string_list_clear(&cap_list, 0);
+ }
+
+done:
+ sigchain_pop(SIGPIPE);
+
+ if (err || errno == EPIPE)
+ return err ? err : errno;
+
+ return 0;
+}
+
+static int read_object_process(const unsigned char *sha1)
+{
+ int err;
+ struct read_object_process *entry;
+ struct child_process *process;
+ struct strbuf status = STRBUF_INIT;
+ uint64_t start;
+
+ start = getnanotime();
+
+ if (!repository_format_promised_objects)
+ die("BUG: if extensions.promisedObjects is not set, there "
+ "should not be any promised objects");
+
+ if (!subprocess_map_initialized) {
+ subprocess_map_initialized = 1;
+ hashmap_init(&subprocess_map, (hashmap_cmp_fn)cmd2process_cmp, 0);
+ entry = NULL;
+ } else {
+ entry = (struct read_object_process *)subprocess_find_entry(&subprocess_map, repository_format_promised_objects);
+ }
+ if (!entry) {
+ entry = xmalloc(sizeof(*entry));
+ entry->supported_capabilities = 0;
+
+ if (subprocess_start(&subprocess_map, &entry->subprocess, repository_format_promised_objects, start_read_object_fn)) {
+ free(entry);
+ return -1;
+ }
+ }
+ process = subprocess_get_child_process(&entry->subprocess);
+
+ if (!(CAP_GET & entry->supported_capabilities))
+ return -1;
+
+ sigchain_push(SIGPIPE, SIG_IGN);
+
+ err = packet_write_fmt_gently(process->in, "command=get\n");
+ if (err)
+ goto done;
+
+ err = packet_write_fmt_gently(process->in, "sha1=%s\n", sha1_to_hex(sha1));
+ if (err)
+ goto done;
+
+ err = packet_flush_gently(process->in);
+ if (err)
+ goto done;
+
+ err = subprocess_read_status(process->out, &status);
+ err = err ? err : strcmp(status.buf, "success");
+
+done:
+ sigchain_pop(SIGPIPE);
+
+ if (err || errno == EPIPE) {
+ err = err ? err : errno;
+ if (!strcmp(status.buf, "error")) {
+ /* The process signaled a problem with the file. */
+ }
+ else if (!strcmp(status.buf, "abort")) {
+ /*
+ * The process signaled a permanent problem. Don't try to read
+ * objects with the same command for the lifetime of the current
+ * Git process.
+ */
+ entry->supported_capabilities &= ~CAP_GET;
+ }
+ else {
+ /*
+ * Something went wrong with the read-object process.
+ * Force shutdown and restart if needed.
+ */
+ error("external process '%s' failed", repository_format_promised_objects);
+ subprocess_stop(&subprocess_map, (struct subprocess_entry *)entry);
+ free(entry);
+ }
+ }
+
+ trace_performance_since(start, "read_object_process");
+
+ return err;
+}
+
+int request_promised_objects(const struct oid_array *oids)
+{
+ int oids_requested = 0;
+ int i;
+
+ for (i = 0; i < oids->nr; i++) {
+ if (is_promised_object(&oids->oid[i], NULL, NULL))
+ break;
+ }
+
+ if (i == oids->nr)
+ /* Nothing to fetch */
+ return 0;
+
+ for (; i < oids->nr; i++) {
+ if (is_promised_object(&oids->oid[i], NULL, NULL)) {
+ read_object_process(oids->oid[i].hash);
+ oids_requested++;
+ }
+ }
+
+ /*
+ * The command above may have updated packfiles, so update our record
+ * of them.
+ */
+ reprepare_packed_git();
+ return oids_requested;
+}
diff --git a/promised-object.h b/promised-object.h
index 7eaedff17..8ad47aa4c 100644
--- a/promised-object.h
+++ b/promised-object.h
@@ -2,6 +2,7 @@
#define PROMISED_OBJECT_H
#include "cache.h"
+#include "sha1-array.h"
/*
* Returns 1 if oid is the name of a promised object. For non-blobs, 0 is
@@ -19,4 +20,15 @@ int for_each_promised_object(each_promised_object_fn, void *);
*/
int fsck_promised_objects(void);
+/*
+ * If any of the given objects are promised objects, invokes
+ * core.promisedobjectcommand with those objects and returns the number of
+ * objects requested. No check is made as to whether the invocation actually
+ * populated the repository with the promised objects.
+ *
+ * If none of the given objects are promised objects, this function does not
+ * invoke anything and returns 0.
+ */
+int request_promised_objects(const struct oid_array *oids);
+
#endif
diff --git a/sha1_file.c b/sha1_file.c
index 5862386cd..ded0ef46b 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -28,6 +28,11 @@
#include "list.h"
#include "mergesort.h"
#include "quote.h"
+#include "iterator.h"
+#include "dir-iterator.h"
+#include "sha1-lookup.h"
+#include "promised-object.h"
+#include "sha1-array.h"
#define SZ_FMT PRIuMAX
static inline uintmax_t sz_fmt(size_t s) { return s; }
@@ -2983,6 +2988,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
lookup_replace_object(sha1) :
sha1;
+ int already_retried = 0;
if (!oi)
oi = &blank_oi;
@@ -3007,30 +3013,40 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
}
}
- if (!find_pack_entry(real, &e)) {
- /* Most likely it's a loose object. */
- if (!sha1_loose_object_info(real, oi, flags)) {
- oi->whence = OI_LOOSE;
- return 0;
- }
+retry:
+ if (find_pack_entry(real, &e))
+ goto found_packed;
- /* Not a loose object; someone else may have just packed it. */
- if (flags & OBJECT_INFO_QUICK) {
- return -1;
- } else {
- reprepare_packed_git();
- if (!find_pack_entry(real, &e))
- return -1;
+ /* Most likely it's a loose object. */
+ if (!sha1_loose_object_info(real, oi, flags)) {
+ oi->whence = OI_LOOSE;
+ return 0;
+ }
+
+ /* Not a loose object; someone else may have just packed it. */
+ reprepare_packed_git();
+ if (find_pack_entry(real, &e))
+ goto found_packed;
+
+ /* Check if it is a promised blob */
+ if (!already_retried && !(flags & OBJECT_INFO_IGNORE_PROMISES)) {
+ struct oid_array promised = OID_ARRAY_INIT;
+ oid_array_append_sha1(&promised, real);
+ if (request_promised_objects(&promised)) {
+ already_retried = 1;
+ goto retry;
}
}
+ return -1;
+
+found_packed:
if (oi == &blank_oi)
/*
* We know that the caller doesn't actually need the
* information below, so return early.
*/
return 0;
-
rtype = packed_object_info(e.p, e.offset, oi);
if (rtype < 0) {
mark_bad_packed_object(e.p, real);
diff --git a/t/t3907-promised-object.sh b/t/t3907-promised-object.sh
index 3e0caf4f9..d9e6a6486 100755
--- a/t/t3907-promised-object.sh
+++ b/t/t3907-promised-object.sh
@@ -38,4 +38,36 @@ test_expect_success '...but fails again with GIT_IGNORE_PROMISED_OBJECTS' '
unset GIT_IGNORE_PROMISED_OBJECTS
'
+test_expect_success 'sha1_object_info_extended (through git cat-file)' '
+ test_create_repo server &&
+ test_commit -C server 1 1.t abcdefgh &&
+ HASH=$(git hash-object server/1.t) &&
+
+ test_create_repo client &&
+ test_must_fail git -C client cat-file -p "$HASH"
+'
+
+test_expect_success '...succeeds if it is a promised object' '
+ printf "%s03%016x" "$HASH" "$(wc -c <server/1.t)" |
+ hex_pack >client/.git/objects/promised &&
+ git -C client config core.repositoryformatversion 1 &&
+ git -C client config extensions.promisedobjects \
+ "\"$TEST_DIRECTORY/t3907/read-object\" \"$(pwd)/server/.git\"" &&
+ git -C client cat-file -p "$HASH"
+'
+
+test_expect_success 'cat-file --batch-all-objects with promised objects' '
+ rm -rf client &&
+ test_create_repo client &&
+ git -C client config core.repositoryformatversion 1 &&
+ git -C client config extensions.promisedobjects \
+ "\"$TEST_DIRECTORY/t3907/read-object\" \"$(pwd)/server/.git\"" &&
+ printf "%s03%016x" "$HASH" "$(wc -c <server/1.t)" |
+ hex_pack >client/.git/objects/promised &&
+
+ # Verify that the promised object is printed
+ git -C client cat-file --batch --batch-all-objects | tee out |
+ grep abcdefgh
+'
+
test_done
diff --git a/t/t3907/read-object b/t/t3907/read-object
new file mode 100755
index 000000000..9666ad597
--- /dev/null
+++ b/t/t3907/read-object
@@ -0,0 +1,114 @@
+#!/usr/bin/perl
+#
+# Example implementation for the Git read-object protocol version 1
+# See Documentation/technical/read-object-protocol.txt
+#
+# Allows you to test the ability for blobs to be pulled from a host git repo
+# "on demand." Called when git needs a blob it couldn't find locally due to
+# a lazy clone that only cloned the commits and trees.
+#
+# A lazy clone can be simulated via the following commands from the host repo
+# you wish to create a lazy clone of:
+#
+# cd /host_repo
+# git rev-parse HEAD
+# git init /guest_repo
+# git cat-file --batch-check --batch-all-objects | grep -v 'blob' |
+# cut -d' ' -f1 | git pack-objects /guest_repo/.git/objects/pack/noblobs
+# cd /guest_repo
+# git config core.virtualizeobjects true
+# git reset --hard <sha from rev-parse call above>
+#
+# Please note, this sample is a minimal skeleton. No proper error handling
+# was implemented.
+#
+
+use strict;
+use warnings;
+
+#
+# Point $DIR to the folder where your host git repo is located so we can pull
+# missing objects from it
+#
+my $DIR = $ARGV[0];
+
+sub packet_bin_read {
+ my $buffer;
+ my $bytes_read = read STDIN, $buffer, 4;
+ if ( $bytes_read == 0 ) {
+
+ # EOF - Git stopped talking to us!
+ exit();
+ }
+ elsif ( $bytes_read != 4 ) {
+ die "invalid packet: '$buffer'";
+ }
+ my $pkt_size = hex($buffer);
+ if ( $pkt_size == 0 ) {
+ return ( 1, "" );
+ }
+ elsif ( $pkt_size > 4 ) {
+ my $content_size = $pkt_size - 4;
+ $bytes_read = read STDIN, $buffer, $content_size;
+ if ( $bytes_read != $content_size ) {
+ die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";
+ }
+ return ( 0, $buffer );
+ }
+ else {
+ die "invalid packet size: $pkt_size";
+ }
+}
+
+sub packet_txt_read {
+ my ( $res, $buf ) = packet_bin_read();
+ unless ( $buf =~ s/\n$// ) {
+ die "A non-binary line MUST be terminated by an LF.";
+ }
+ return ( $res, $buf );
+}
+
+sub packet_bin_write {
+ my $buf = shift;
+ print STDOUT sprintf( "%04x", length($buf) + 4 );
+ print STDOUT $buf;
+ STDOUT->flush();
+}
+
+sub packet_txt_write {
+ packet_bin_write( $_[0] . "\n" );
+}
+
+sub packet_flush {
+ print STDOUT sprintf( "%04x", 0 );
+ STDOUT->flush();
+}
+
+( packet_txt_read() eq ( 0, "git-read-object-client" ) ) || die "bad initialize";
+( packet_txt_read() eq ( 0, "version=1" ) ) || die "bad version";
+( packet_bin_read() eq ( 1, "" ) ) || die "bad version end";
+
+packet_txt_write("git-read-object-server");
+packet_txt_write("version=1");
+packet_flush();
+
+( packet_txt_read() eq ( 0, "capability=get" ) ) || die "bad capability";
+( packet_bin_read() eq ( 1, "" ) ) || die "bad capability end";
+
+packet_txt_write("capability=get");
+packet_flush();
+
+while (1) {
+ my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
+
+ if ( $command eq "get" ) {
+ my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/;
+ packet_bin_read();
+
+ system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . ' | git -c core.virtualizeobjects=false hash-object -w --stdin >/dev/null 2>&1');
+ packet_txt_write(($?) ? "status=error" : "status=success");
+ packet_flush();
+ } else {
+ die "bad command '$command'";
+ }
+}
--
2.14.0.rc0.284.gd933b75aa4-goog
^ permalink raw reply related [flat|nested] 45+ messages in thread* Re: [RFC PATCH v2 4/4] sha1_file: support promised object hook
2017-07-20 0:21 ` [RFC PATCH v2 4/4] sha1_file: support promised object hook Jonathan Tan
@ 2017-07-20 18:23 ` Stefan Beller
2017-07-20 20:58 ` Ben Peart
0 siblings, 1 reply; 45+ messages in thread
From: Stefan Beller @ 2017-07-20 18:23 UTC (permalink / raw)
To: Jonathan Tan, Lars Schneider
Cc: git@vger.kernel.org, Jonathan Nieder, Jeff Hostetler, Ben Peart,
Philip Oakley
On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> Teach sha1_file to invoke a hook whenever an object is requested and
> unavailable but is promised. The hook is a shell command that can be
> configured through "git config"; this hook takes in a list of hashes and
> writes (if successful) the corresponding objects to the repo's local
> storage.
>
> The usage of the hook can be suppressed through a flag when invoking
> has_object_file_with_flags() and other similar functions.
> parse_or_promise_object() in object.c requires this functionality, and
> has been modified to use it.
>
> This is meant as a temporary measure to ensure that all Git commands
> work in such a situation. Future patches will update some commands to
> either tolerate promised objects (without invoking the hook) or be more
> efficient in invoking the promised objects hook.
>
> In order to determine the code changes in sha1_file.c necessary, I
> investigated the following:
> (1) functions in sha1_file that take in a hash, without the user
> regarding how the object is stored (loose or packed)
> (2) functions in sha1_file that operate on packed objects (because I
> need to check callers that know about the loose/packed distinction
> and operate on both differently, and ensure that they can handle
> the concept of objects that are neither loose nor packed)
>
> (1) is handled by the modification to sha1_object_info_extended().
>
> For (2), I looked at for_each_packed_object and at the packed-related
> functions that take in a hash. For for_each_packed_object, the callers
> either already work or are fixed in this patch:
> - reachable - only to find recent objects
> - builtin/fsck - already knows about promised objects
> - builtin/cat-file - fixed in this commit
>
> Callers of the other functions do not need to be changed:
> - parse_pack_index
> - http - indirectly from http_get_info_packs
> - find_pack_entry_one
> - this searches a single pack that is provided as an argument; the
> caller already knows (through other means) that the sought object
> is in a specific pack
> - find_sha1_pack
> - fast-import - appears to be an optimization to not store a
> file if it is already in a pack
> - http-walker - to search through a struct alt_base
> - http-push - to search through remote packs
> - has_sha1_pack
> - builtin/fsck - already knows about promised objects
> - builtin/count-objects - informational purposes only (check if loose
> object is also packed)
> - builtin/prune-packed - check if object to be pruned is packed (if
> not, don't prune it)
> - revision - used to exclude packed objects if requested by user
> - diff - just for optimization
>
> An alternative design that I considered but rejected:
>
> - Adding a hook whenever a packed object is requested, not on any
> object. That is, whenever we attempt to search the packfiles for an
> object, if it is missing (from the packfiles and from the loose
> object storage), to invoke the hook (which must then store it as a
> packfile), open the packfile the hook generated, and report that the
> object is found in that new packfile. This reduces the amount of
> analysis needed (in that we only need to look at how packed objects
> are handled), but requires that the hook generate packfiles (or for
> sha1_file to pack whatever loose objects are generated), creating one
> packfile for each missing object and potentially very many packfiles
> that must be linearly searched. This may be tolerable now for repos
> that only have a few missing objects (for example, repos that only
> want to exclude large blobs), and might be tolerable in the future if
> we have batching support for the most commonly used commands, but is
> not tolerable now for repos that exclude a large amount of objects.
>
> Helped-by: Ben Peart <benpeart@microsoft.com>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> Documentation/config.txt | 8 +
> Documentation/gitrepository-layout.txt | 8 +
> Documentation/technical/read-object-protocol.txt | 102 ++++++++++++
> builtin/cat-file.c | 9 ++
> cache.h | 2 +
> object.c | 3 +-
> promised-object.c | 194 +++++++++++++++++++++++
> promised-object.h | 12 ++
> sha1_file.c | 44 +++--
> t/t3907-promised-object.sh | 32 ++++
> t/t3907/read-object | 114 +++++++++++++
> 11 files changed, 513 insertions(+), 15 deletions(-)
> create mode 100644 Documentation/technical/read-object-protocol.txt
> create mode 100755 t/t3907/read-object
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index d5c9c4cab..c293ac921 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -393,6 +393,14 @@ The default is false, except linkgit:git-clone[1] or linkgit:git-init[1]
> will probe and set core.ignoreCase true if appropriate when the repository
> is created.
>
> +core.promisedBlobCommand::
> + If set, whenever a blob in the local repo is attempted to be read, but
> + is both missing and a promised blob, invoke this shell command to
> + generate or obtain that blob before reporting an error. This shell
> + command should take one or more hashes, each terminated by a newline,
> + as standard input, and (if successful) should write the corresponding
> + objects to the local repo (packed or loose).
> +
> core.precomposeUnicode::
> This option is only used by Mac OS implementation of Git.
> When core.precomposeUnicode=true, Git reverts the unicode decomposition
> diff --git a/Documentation/gitrepository-layout.txt b/Documentation/gitrepository-layout.txt
> index f51ed4e37..7dea7fe6b 100644
> --- a/Documentation/gitrepository-layout.txt
> +++ b/Documentation/gitrepository-layout.txt
> @@ -47,6 +47,10 @@ use with dumb transports but otherwise is OK as long as
> `objects/info/alternates` points at the object stores it
> borrows from.
> +
> +. You could have objects that are merely promised by another source.
> +When Git requires those objects, it will invoke the command in the
> +`extensions.promisedObjects` configuration variable.
> ++
> This directory is ignored if $GIT_COMMON_DIR is set and
> "$GIT_COMMON_DIR/objects" will be used instead.
>
> @@ -91,6 +95,10 @@ objects/info/http-alternates::
> this object store borrows objects from, to be used when
> the repository is fetched over HTTP.
>
> +objects/promised::
> + This file records the sha1 object names, types, and sizes of
> + promised objects.
> +
> refs::
> References are stored in subdirectories of this
> directory. The 'git prune' command knows to preserve
> diff --git a/Documentation/technical/read-object-protocol.txt b/Documentation/technical/read-object-protocol.txt
> new file mode 100644
> index 000000000..a893b46e7
> --- /dev/null
> +++ b/Documentation/technical/read-object-protocol.txt
> @@ -0,0 +1,102 @@
> +Read Object Process
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^
This protocol reads very similar to the protocol that is used in
clean/smudge filtering, designed by Lars (cc'd)
> +
> +The read-object process enables Git to read all missing blobs with a
> +single process invocation for the entire life of a single Git command.
> +This is achieved by using a packet format (pkt-line, see technical/
> +protocol-common.txt) based protocol over standard input and standard
> +output as follows. All packets, except for the "*CONTENT" packets and
> +the "0000" flush packet, are considered text and therefore are
> +terminated by a LF.
> +
> +Git starts the process when it encounters the first missing object that
> +needs to be retrieved. After the process is started, Git sends a welcome
> +message ("git-read-object-client"), a list of supported protocol version
> +numbers, and a flush packet. Git expects to read a welcome response
> +message ("git-read-object-server"), exactly one protocol version number
> +from the previously sent list, and a flush packet. All further
> +communication will be based on the selected version.
> +
> +The remaining protocol description below documents "version=1". Please
> +note that "version=42" in the example below does not exist and is only
> +there to illustrate how the protocol would look with more than one
> +version.
> +
> +After the version negotiation Git sends a list of all capabilities that
> +it supports and a flush packet. Git expects to read a list of desired
> +capabilities, which must be a subset of the supported capabilities list,
> +and a flush packet as response:
> +------------------------
> +packet: git> git-read-object-client
> +packet: git> version=1
> +packet: git> version=42
> +packet: git> 0000
> +packet: git< git-read-object-server
> +packet: git< version=1
> +packet: git< 0000
> +packet: git> capability=get
> +packet: git> capability=have
> +packet: git> capability=put
> +packet: git> capability=not-yet-invented
> +packet: git> 0000
> +packet: git< capability=get
> +packet: git< 0000
> +------------------------
> +The only supported capability in version 1 is "get".
> +
> +Afterwards Git sends a list of "key=value" pairs terminated with a flush
> +packet. The list will contain at least the command (based on the
> +supported capabilities) and the sha1 of the object to retrieve. Please
> +note, that the process must not send any response before it received the
> +final flush packet.
> +
> +When the process receives the "get" command, it should make the requested
> +object available in the git object store and then return success. Git will
> +then check the object store again and this time find it and proceed.
> +------------------------
> +packet: git> command=get
> +packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
> +packet: git> 0000
> +------------------------
> +
> +The process is expected to respond with a list of "key=value" pairs
> +terminated with a flush packet. If the process does not experience
> +problems then the list must contain a "success" status.
> +------------------------
> +packet: git< status=success
> +packet: git< 0000
> +------------------------
> +
> +In case the process cannot or does not want to process the content, it
> +is expected to respond with an "error" status.
> +------------------------
> +packet: git< status=error
> +packet: git< 0000
> +------------------------
> +
> +In case the process cannot or does not want to process the content as
> +well as any future content for the lifetime of the Git process, then it
> +is expected to respond with an "abort" status at any point in the
> +protocol.
> +------------------------
> +packet: git< status=abort
> +packet: git< 0000
> +------------------------
> +
> +Git neither stops nor restarts the process in case the "error"/"abort"
> +status is set.
> +
> +If the process dies during the communication or does not adhere to the
> +protocol then Git will stop the process and restart it with the next
> +object that needs to be processed.
> +
> +After the read-object process has processed an object it is expected to
> +wait for the next "key=value" list containing a command. Git will close
> +the command pipe on exit. The process is expected to detect EOF and exit
> +gracefully on its own. Git will wait until the process has stopped.
> +
> +A long running read-object process demo implementation can be found in
> +`contrib/long-running-read-object/example.pl` located in the Git core
> +repository. If you develop your own long running process then the
> +`GIT_TRACE_PACKET` environment variables can be very helpful for
> +debugging (see linkgit:git[1]).
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 96b786e48..33f636926 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -12,6 +12,7 @@
> #include "streaming.h"
> #include "tree-walk.h"
> #include "sha1-array.h"
> +#include "promised-object.h"
>
> struct batch_options {
> int enabled;
> @@ -432,6 +433,13 @@ static int batch_packed_object(const struct object_id *oid,
> return 0;
> }
>
> +static int batch_promised_object(const struct object_id *oid,
> + void *data)
> +{
> + oid_array_append(data, oid);
> + return 0;
> +}
> +
> static int batch_objects(struct batch_options *opt)
> {
> struct strbuf buf = STRBUF_INIT;
> @@ -473,6 +481,7 @@ static int batch_objects(struct batch_options *opt)
>
> for_each_loose_object(batch_loose_object, &sa, 0);
> for_each_packed_object(batch_packed_object, &sa, 0);
> + for_each_promised_object(batch_promised_object, &sa);
>
> cb.opt = opt;
> cb.expand = &data;
> diff --git a/cache.h b/cache.h
> index dd94b5ffc..75b71f38b 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1835,6 +1835,8 @@ struct object_info {
> #define OBJECT_INFO_SKIP_CACHED 4
> /* Do not retry packed storage after checking packed and loose storage */
> #define OBJECT_INFO_QUICK 8
> +/* Ignore list of promised objects */
> +#define OBJECT_INFO_IGNORE_PROMISES 16
> extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
> extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
>
> diff --git a/object.c b/object.c
> index 0aeb95084..23f2a6cbc 100644
> --- a/object.c
> +++ b/object.c
> @@ -285,7 +285,8 @@ struct object *parse_or_promise_object(const struct object_id *oid)
> {
> enum object_type type;
>
> - if (has_object_file(oid))
> + if (has_object_file_with_flags(oid, OBJECT_INFO_SKIP_CACHED |
> + OBJECT_INFO_IGNORE_PROMISES))
> return parse_object(oid);
>
> if (is_promised_object(oid, &type, NULL)) {
> diff --git a/promised-object.c b/promised-object.c
> index 487ade437..d8d95ebb2 100644
> --- a/promised-object.c
> +++ b/promised-object.c
> @@ -2,6 +2,12 @@
> #include "promised-object.h"
> #include "sha1-lookup.h"
> #include "strbuf.h"
> +#include "run-command.h"
> +#include "sha1-array.h"
> +#include "config.h"
> +#include "sigchain.h"
> +#include "sub-process.h"
> +#include "pkt-line.h"
>
> #define ENTRY_SIZE (GIT_SHA1_RAWSZ + 1 + 8)
> /*
> @@ -128,3 +134,191 @@ int fsck_promised_objects(void)
> }
> return 0;
> }
> +
> +#define CAP_GET (1u<<0)
> +
> +static int subprocess_map_initialized;
> +static struct hashmap subprocess_map;
> +
> +struct read_object_process {
> + struct subprocess_entry subprocess;
> + unsigned int supported_capabilities;
> +};
> +
> +int start_read_object_fn(struct subprocess_entry *subprocess)
> +{
> + int err;
> + struct read_object_process *entry = (struct read_object_process *)subprocess;
> + struct child_process *process;
> + struct string_list cap_list = STRING_LIST_INIT_NODUP;
> + char *cap_buf;
> + const char *cap_name;
> +
> + process = subprocess_get_child_process(&entry->subprocess);
> +
> + sigchain_push(SIGPIPE, SIG_IGN);
> +
> + err = packet_writel(process->in, "git-read-object-client", "version=1", NULL);
> + if (err)
> + goto done;
> +
> + err = strcmp(packet_read_line(process->out, NULL), "git-read-object-server");
> + if (err) {
> + error("external process '%s' does not support read-object protocol version 1", subprocess->cmd);
> + goto done;
> + }
> + err = strcmp(packet_read_line(process->out, NULL), "version=1");
> + if (err)
> + goto done;
> + err = packet_read_line(process->out, NULL) != NULL;
> + if (err)
> + goto done;
> +
> + err = packet_writel(process->in, "capability=get", NULL);
> + if (err)
> + goto done;
> +
> + for (;;) {
> + cap_buf = packet_read_line(process->out, NULL);
> + if (!cap_buf)
> + break;
> + string_list_split_in_place(&cap_list, cap_buf, '=', 1);
> +
> + if (cap_list.nr != 2 || strcmp(cap_list.items[0].string, "capability"))
> + continue;
> +
> + cap_name = cap_list.items[1].string;
> + if (!strcmp(cap_name, "get")) {
> + entry->supported_capabilities |= CAP_GET;
> + }
> + else {
> + warning(
> + "external process '%s' requested unsupported read-object capability '%s'",
> + subprocess->cmd, cap_name
> + );
> + }
> +
> + string_list_clear(&cap_list, 0);
> + }
> +
> +done:
> + sigchain_pop(SIGPIPE);
> +
> + if (err || errno == EPIPE)
> + return err ? err : errno;
> +
> + return 0;
> +}
> +
> +static int read_object_process(const unsigned char *sha1)
> +{
> + int err;
> + struct read_object_process *entry;
> + struct child_process *process;
> + struct strbuf status = STRBUF_INIT;
> + uint64_t start;
> +
> + start = getnanotime();
> +
> + if (!repository_format_promised_objects)
> + die("BUG: if extensions.promisedObjects is not set, there "
> + "should not be any promised objects");
> +
> + if (!subprocess_map_initialized) {
> + subprocess_map_initialized = 1;
> + hashmap_init(&subprocess_map, (hashmap_cmp_fn)cmd2process_cmp, 0);
> + entry = NULL;
> + } else {
> + entry = (struct read_object_process *)subprocess_find_entry(&subprocess_map, repository_format_promised_objects);
> + }
> + if (!entry) {
> + entry = xmalloc(sizeof(*entry));
> + entry->supported_capabilities = 0;
> +
> + if (subprocess_start(&subprocess_map, &entry->subprocess, repository_format_promised_objects, start_read_object_fn)) {
> + free(entry);
> + return -1;
> + }
> + }
> + process = subprocess_get_child_process(&entry->subprocess);
> +
> + if (!(CAP_GET & entry->supported_capabilities))
> + return -1;
> +
> + sigchain_push(SIGPIPE, SIG_IGN);
> +
> + err = packet_write_fmt_gently(process->in, "command=get\n");
> + if (err)
> + goto done;
> +
> + err = packet_write_fmt_gently(process->in, "sha1=%s\n", sha1_to_hex(sha1));
> + if (err)
> + goto done;
> +
> + err = packet_flush_gently(process->in);
> + if (err)
> + goto done;
> +
> + err = subprocess_read_status(process->out, &status);
> + err = err ? err : strcmp(status.buf, "success");
> +
> +done:
> + sigchain_pop(SIGPIPE);
> +
> + if (err || errno == EPIPE) {
> + err = err ? err : errno;
> + if (!strcmp(status.buf, "error")) {
> + /* The process signaled a problem with the file. */
> + }
> + else if (!strcmp(status.buf, "abort")) {
> + /*
> + * The process signaled a permanent problem. Don't try to read
> + * objects with the same command for the lifetime of the current
> + * Git process.
> + */
> + entry->supported_capabilities &= ~CAP_GET;
> + }
> + else {
> + /*
> + * Something went wrong with the read-object process.
> + * Force shutdown and restart if needed.
> + */
> + error("external process '%s' failed", repository_format_promised_objects);
> + subprocess_stop(&subprocess_map, (struct subprocess_entry *)entry);
> + free(entry);
> + }
> + }
> +
> + trace_performance_since(start, "read_object_process");
> +
> + return err;
> +}
> +
> +int request_promised_objects(const struct oid_array *oids)
> +{
> + int oids_requested = 0;
> + int i;
> +
> + for (i = 0; i < oids->nr; i++) {
> + if (is_promised_object(&oids->oid[i], NULL, NULL))
> + break;
> + }
> +
> + if (i == oids->nr)
> + /* Nothing to fetch */
> + return 0;
> +
> + for (; i < oids->nr; i++) {
> + if (is_promised_object(&oids->oid[i], NULL, NULL)) {
> + read_object_process(oids->oid[i].hash);
> + oids_requested++;
> + }
> + }
> +
> + /*
> + * The command above may have updated packfiles, so update our record
> + * of them.
> + */
> + reprepare_packed_git();
> + return oids_requested;
> +}
> diff --git a/promised-object.h b/promised-object.h
> index 7eaedff17..8ad47aa4c 100644
> --- a/promised-object.h
> +++ b/promised-object.h
> @@ -2,6 +2,7 @@
> #define PROMISED_OBJECT_H
>
> #include "cache.h"
> +#include "sha1-array.h"
>
> /*
> * Returns 1 if oid is the name of a promised object. For non-blobs, 0 is
> @@ -19,4 +20,15 @@ int for_each_promised_object(each_promised_object_fn, void *);
> */
> int fsck_promised_objects(void);
>
> +/*
> + * If any of the given objects are promised objects, invokes
> + * core.promisedobjectcommand with those objects and returns the number of
> + * objects requested. No check is made as to whether the invocation actually
> + * populated the repository with the promised objects.
> + *
> + * If none of the given objects are promised objects, this function does not
> + * invoke anything and returns 0.
> + */
> +int request_promised_objects(const struct oid_array *oids);
> +
> #endif
> diff --git a/sha1_file.c b/sha1_file.c
> index 5862386cd..ded0ef46b 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -28,6 +28,11 @@
> #include "list.h"
> #include "mergesort.h"
> #include "quote.h"
> +#include "iterator.h"
> +#include "dir-iterator.h"
> +#include "sha1-lookup.h"
> +#include "promised-object.h"
> +#include "sha1-array.h"
>
> #define SZ_FMT PRIuMAX
> static inline uintmax_t sz_fmt(size_t s) { return s; }
> @@ -2983,6 +2988,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
> const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
> lookup_replace_object(sha1) :
> sha1;
> + int already_retried = 0;
>
> if (!oi)
> oi = &blank_oi;
> @@ -3007,30 +3013,40 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
> }
> }
>
> - if (!find_pack_entry(real, &e)) {
> - /* Most likely it's a loose object. */
> - if (!sha1_loose_object_info(real, oi, flags)) {
> - oi->whence = OI_LOOSE;
> - return 0;
> - }
> +retry:
> + if (find_pack_entry(real, &e))
> + goto found_packed;
>
> - /* Not a loose object; someone else may have just packed it. */
> - if (flags & OBJECT_INFO_QUICK) {
> - return -1;
> - } else {
> - reprepare_packed_git();
> - if (!find_pack_entry(real, &e))
> - return -1;
> + /* Most likely it's a loose object. */
> + if (!sha1_loose_object_info(real, oi, flags)) {
> + oi->whence = OI_LOOSE;
> + return 0;
> + }
> +
> + /* Not a loose object; someone else may have just packed it. */
> + reprepare_packed_git();
> + if (find_pack_entry(real, &e))
> + goto found_packed;
> +
> + /* Check if it is a promised blob */
> + if (!already_retried && !(flags & OBJECT_INFO_IGNORE_PROMISES)) {
> + struct oid_array promised = OID_ARRAY_INIT;
> + oid_array_append_sha1(&promised, real);
> + if (request_promised_objects(&promised)) {
> + already_retried = 1;
> + goto retry;
> }
> }
>
> + return -1;
> +
> +found_packed:
> if (oi == &blank_oi)
> /*
> * We know that the caller doesn't actually need the
> * information below, so return early.
> */
> return 0;
> -
> rtype = packed_object_info(e.p, e.offset, oi);
> if (rtype < 0) {
> mark_bad_packed_object(e.p, real);
> diff --git a/t/t3907-promised-object.sh b/t/t3907-promised-object.sh
> index 3e0caf4f9..d9e6a6486 100755
> --- a/t/t3907-promised-object.sh
> +++ b/t/t3907-promised-object.sh
> @@ -38,4 +38,36 @@ test_expect_success '...but fails again with GIT_IGNORE_PROMISED_OBJECTS' '
> unset GIT_IGNORE_PROMISED_OBJECTS
> '
>
> +test_expect_success 'sha1_object_info_extended (through git cat-file)' '
> + test_create_repo server &&
> + test_commit -C server 1 1.t abcdefgh &&
> + HASH=$(git hash-object server/1.t) &&
> +
> + test_create_repo client &&
> + test_must_fail git -C client cat-file -p "$HASH"
> +'
> +
> +test_expect_success '...succeeds if it is a promised object' '
> + printf "%s03%016x" "$HASH" "$(wc -c <server/1.t)" |
> + hex_pack >client/.git/objects/promised &&
> + git -C client config core.repositoryformatversion 1 &&
> + git -C client config extensions.promisedobjects \
> + "\"$TEST_DIRECTORY/t3907/read-object\" \"$(pwd)/server/.git\"" &&
> + git -C client cat-file -p "$HASH"
> +'
> +
> +test_expect_success 'cat-file --batch-all-objects with promised objects' '
> + rm -rf client &&
> + test_create_repo client &&
> + git -C client config core.repositoryformatversion 1 &&
> + git -C client config extensions.promisedobjects \
> + "\"$TEST_DIRECTORY/t3907/read-object\" \"$(pwd)/server/.git\"" &&
> + printf "%s03%016x" "$HASH" "$(wc -c <server/1.t)" |
> + hex_pack >client/.git/objects/promised &&
> +
> + # Verify that the promised object is printed
> + git -C client cat-file --batch --batch-all-objects | tee out |
> + grep abcdefgh
> +'
> +
> test_done
> diff --git a/t/t3907/read-object b/t/t3907/read-object
> new file mode 100755
> index 000000000..9666ad597
> --- /dev/null
> +++ b/t/t3907/read-object
> @@ -0,0 +1,114 @@
> +#!/usr/bin/perl
> +#
> +# Example implementation for the Git read-object protocol version 1
> +# See Documentation/technical/read-object-protocol.txt
> +#
> +# Allows you to test the ability for blobs to be pulled from a host git repo
> +# "on demand." Called when git needs a blob it couldn't find locally due to
> +# a lazy clone that only cloned the commits and trees.
> +#
> +# A lazy clone can be simulated via the following commands from the host repo
> +# you wish to create a lazy clone of:
> +#
> +# cd /host_repo
> +# git rev-parse HEAD
> +# git init /guest_repo
> +# git cat-file --batch-check --batch-all-objects | grep -v 'blob' |
> +# cut -d' ' -f1 | git pack-objects /guest_repo/.git/objects/pack/noblobs
> +# cd /guest_repo
> +# git config core.virtualizeobjects true
> +# git reset --hard <sha from rev-parse call above>
> +#
> +# Please note, this sample is a minimal skeleton. No proper error handling
> +# was implemented.
> +#
> +
> +use strict;
> +use warnings;
> +
> +#
> +# Point $DIR to the folder where your host git repo is located so we can pull
> +# missing objects from it
> +#
> +my $DIR = $ARGV[0];
> +
> +sub packet_bin_read {
> + my $buffer;
> + my $bytes_read = read STDIN, $buffer, 4;
> + if ( $bytes_read == 0 ) {
> +
> + # EOF - Git stopped talking to us!
> + exit();
> + }
> + elsif ( $bytes_read != 4 ) {
> + die "invalid packet: '$buffer'";
> + }
> + my $pkt_size = hex($buffer);
> + if ( $pkt_size == 0 ) {
> + return ( 1, "" );
> + }
> + elsif ( $pkt_size > 4 ) {
> + my $content_size = $pkt_size - 4;
> + $bytes_read = read STDIN, $buffer, $content_size;
> + if ( $bytes_read != $content_size ) {
> + die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";
> + }
> + return ( 0, $buffer );
> + }
> + else {
> + die "invalid packet size: $pkt_size";
> + }
> +}
> +
> +sub packet_txt_read {
> + my ( $res, $buf ) = packet_bin_read();
> + unless ( $buf =~ s/\n$// ) {
> + die "A non-binary line MUST be terminated by an LF.";
> + }
> + return ( $res, $buf );
> +}
> +
> +sub packet_bin_write {
> + my $buf = shift;
> + print STDOUT sprintf( "%04x", length($buf) + 4 );
> + print STDOUT $buf;
> + STDOUT->flush();
> +}
> +
> +sub packet_txt_write {
> + packet_bin_write( $_[0] . "\n" );
> +}
> +
> +sub packet_flush {
> + print STDOUT sprintf( "%04x", 0 );
> + STDOUT->flush();
> +}
> +
> +( packet_txt_read() eq ( 0, "git-read-object-client" ) ) || die "bad initialize";
> +( packet_txt_read() eq ( 0, "version=1" ) ) || die "bad version";
> +( packet_bin_read() eq ( 1, "" ) ) || die "bad version end";
> +
> +packet_txt_write("git-read-object-server");
> +packet_txt_write("version=1");
> +packet_flush();
> +
> +( packet_txt_read() eq ( 0, "capability=get" ) ) || die "bad capability";
> +( packet_bin_read() eq ( 1, "" ) ) || die "bad capability end";
> +
> +packet_txt_write("capability=get");
> +packet_flush();
> +
> +while (1) {
> + my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
> +
> + if ( $command eq "get" ) {
> + my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/;
> + packet_bin_read();
> +
> + system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . ' | git -c core.virtualizeobjects=false hash-object -w --stdin >/dev/null 2>&1');
> + packet_txt_write(($?) ? "status=error" : "status=success");
> + packet_flush();
> + } else {
> + die "bad command '$command'";
> + }
> +}
> --
> 2.14.0.rc0.284.gd933b75aa4-goog
>
^ permalink raw reply [flat|nested] 45+ messages in thread* Re: [RFC PATCH v2 4/4] sha1_file: support promised object hook
2017-07-20 18:23 ` Stefan Beller
@ 2017-07-20 20:58 ` Ben Peart
2017-07-20 21:18 ` Jonathan Tan
0 siblings, 1 reply; 45+ messages in thread
From: Ben Peart @ 2017-07-20 20:58 UTC (permalink / raw)
To: Stefan Beller, Jonathan Tan, Lars Schneider
Cc: git@vger.kernel.org, Jonathan Nieder, Jeff Hostetler,
Philip Oakley
On 7/20/2017 2:23 PM, Stefan Beller wrote:
> On Wed, Jul 19, 2017 at 5:21 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
>> Teach sha1_file to invoke a hook whenever an object is requested and
>> unavailable but is promised. The hook is a shell command that can be
>> configured through "git config"; this hook takes in a list of hashes and
>> writes (if successful) the corresponding objects to the repo's local
>> storage.
>>
>> The usage of the hook can be suppressed through a flag when invoking
>> has_object_file_with_flags() and other similar functions.
>> parse_or_promise_object() in object.c requires this functionality, and
>> has been modified to use it.
>>
>> This is meant as a temporary measure to ensure that all Git commands
>> work in such a situation. Future patches will update some commands to
>> either tolerate promised objects (without invoking the hook) or be more
>> efficient in invoking the promised objects hook.
I agree that making git more tolerant of promised objects if possible
and precomputing a list of promised objects required to complete a
particular command and downloading them with a single request are good
optimizations to add over time.
>>
>> In order to determine the code changes in sha1_file.c necessary, I
>> investigated the following:
>> (1) functions in sha1_file that take in a hash, without the user
>> regarding how the object is stored (loose or packed)
>> (2) functions in sha1_file that operate on packed objects (because I
>> need to check callers that know about the loose/packed distinction
>> and operate on both differently, and ensure that they can handle
>> the concept of objects that are neither loose nor packed)
>>
>> (1) is handled by the modification to sha1_object_info_extended().
>>
>> For (2), I looked at for_each_packed_object and at the packed-related
>> functions that take in a hash. For for_each_packed_object, the callers
>> either already work or are fixed in this patch:
>> - reachable - only to find recent objects
>> - builtin/fsck - already knows about promised objects
>> - builtin/cat-file - fixed in this commit
>>
>> Callers of the other functions do not need to be changed:
>> - parse_pack_index
>> - http - indirectly from http_get_info_packs
>> - find_pack_entry_one
>> - this searches a single pack that is provided as an argument; the
>> caller already knows (through other means) that the sought object
>> is in a specific pack
>> - find_sha1_pack
>> - fast-import - appears to be an optimization to not store a
>> file if it is already in a pack
>> - http-walker - to search through a struct alt_base
>> - http-push - to search through remote packs
>> - has_sha1_pack
>> - builtin/fsck - already knows about promised objects
>> - builtin/count-objects - informational purposes only (check if loose
>> object is also packed)
>> - builtin/prune-packed - check if object to be pruned is packed (if
>> not, don't prune it)
>> - revision - used to exclude packed objects if requested by user
>> - diff - just for optimization
>>
has_sha1_file also takes a hash "whether local or in an alternate object
database, and whether packed or loose" but never calls
sha1_object_info_extended. As a result, we had to add support in
check_and_freshen to download missing objects to get proper behavior in
all cases. I don't think this will work correctly without it.
>> An alternative design that I considered but rejected:
>>
>> - Adding a hook whenever a packed object is requested, not on any
>> object. That is, whenever we attempt to search the packfiles for an
>> object, if it is missing (from the packfiles and from the loose
>> object storage), to invoke the hook (which must then store it as a
>> packfile), open the packfile the hook generated, and report that the
>> object is found in that new packfile. This reduces the amount of
>> analysis needed (in that we only need to look at how packed objects
>> are handled), but requires that the hook generate packfiles (or for
>> sha1_file to pack whatever loose objects are generated), creating one
>> packfile for each missing object and potentially very many packfiles
>> that must be linearly searched. This may be tolerable now for repos
>> that only have a few missing objects (for example, repos that only
>> want to exclude large blobs), and might be tolerable in the future if
>> we have batching support for the most commonly used commands, but is
>> not tolerable now for repos that exclude a large amount of objects.
>>
>> Helped-by: Ben Peart <benpeart@microsoft.com>
>> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
>> ---
>> Documentation/config.txt | 8 +
>> Documentation/gitrepository-layout.txt | 8 +
>> Documentation/technical/read-object-protocol.txt | 102 ++++++++++++
>> builtin/cat-file.c | 9 ++
>> cache.h | 2 +
>> object.c | 3 +-
>> promised-object.c | 194 +++++++++++++++++++++++
>> promised-object.h | 12 ++
>> sha1_file.c | 44 +++--
>> t/t3907-promised-object.sh | 32 ++++
>> t/t3907/read-object | 114 +++++++++++++
>> 11 files changed, 513 insertions(+), 15 deletions(-)
>> create mode 100644 Documentation/technical/read-object-protocol.txt
>> create mode 100755 t/t3907/read-object
>>
>> diff --git a/Documentation/config.txt b/Documentation/config.txt
>> index d5c9c4cab..c293ac921 100644
>> --- a/Documentation/config.txt
>> +++ b/Documentation/config.txt
>> @@ -393,6 +393,14 @@ The default is false, except linkgit:git-clone[1] or linkgit:git-init[1]
>> will probe and set core.ignoreCase true if appropriate when the repository
>> is created.
>>
>> +core.promisedBlobCommand::
>> + If set, whenever a blob in the local repo is attempted to be read, but
>> + is both missing and a promised blob, invoke this shell command to
>> + generate or obtain that blob before reporting an error. This shell
>> + command should take one or more hashes, each terminated by a newline,
>> + as standard input, and (if successful) should write the corresponding
>> + objects to the local repo (packed or loose).
>> +
>> core.precomposeUnicode::
>> This option is only used by Mac OS implementation of Git.
>> When core.precomposeUnicode=true, Git reverts the unicode decomposition
>> diff --git a/Documentation/gitrepository-layout.txt b/Documentation/gitrepository-layout.txt
>> index f51ed4e37..7dea7fe6b 100644
>> --- a/Documentation/gitrepository-layout.txt
>> +++ b/Documentation/gitrepository-layout.txt
>> @@ -47,6 +47,10 @@ use with dumb transports but otherwise is OK as long as
>> `objects/info/alternates` points at the object stores it
>> borrows from.
>> +
>> +. You could have objects that are merely promised by another source.
>> +When Git requires those objects, it will invoke the command in the
>> +`extensions.promisedObjects` configuration variable.
>> ++
>> This directory is ignored if $GIT_COMMON_DIR is set and
>> "$GIT_COMMON_DIR/objects" will be used instead.
>>
>> @@ -91,6 +95,10 @@ objects/info/http-alternates::
>> this object store borrows objects from, to be used when
>> the repository is fetched over HTTP.
>>
>> +objects/promised::
>> + This file records the sha1 object names, types, and sizes of
>> + promised objects.
>> +
>> refs::
>> References are stored in subdirectories of this
>> directory. The 'git prune' command knows to preserve
>> diff --git a/Documentation/technical/read-object-protocol.txt b/Documentation/technical/read-object-protocol.txt
>> new file mode 100644
>> index 000000000..a893b46e7
>> --- /dev/null
>> +++ b/Documentation/technical/read-object-protocol.txt
>> @@ -0,0 +1,102 @@
>> +Read Object Process
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> This protocol reads very similar to the protocol that is used in
> clean/smudge filtering, designed by Lars (cc'd)
>
>> +
>> +The read-object process enables Git to read all missing blobs with a
>> +single process invocation for the entire life of a single Git command.
>> +This is achieved by using a packet format (pkt-line, see technical/
>> +protocol-common.txt) based protocol over standard input and standard
>> +output as follows. All packets, except for the "*CONTENT" packets and
>> +the "0000" flush packet, are considered text and therefore are
>> +terminated by a LF.
>> +
>> +Git starts the process when it encounters the first missing object that
>> +needs to be retrieved. After the process is started, Git sends a welcome
>> +message ("git-read-object-client"), a list of supported protocol version
>> +numbers, and a flush packet. Git expects to read a welcome response
>> +message ("git-read-object-server"), exactly one protocol version number
>> +from the previously sent list, and a flush packet. All further
>> +communication will be based on the selected version.
>> +
>> +The remaining protocol description below documents "version=1". Please
>> +note that "version=42" in the example below does not exist and is only
>> +there to illustrate how the protocol would look with more than one
>> +version.
>> +
>> +After the version negotiation Git sends a list of all capabilities that
>> +it supports and a flush packet. Git expects to read a list of desired
>> +capabilities, which must be a subset of the supported capabilities list,
>> +and a flush packet as response:
>> +------------------------
>> +packet: git> git-read-object-client
>> +packet: git> version=1
>> +packet: git> version=42
>> +packet: git> 0000
>> +packet: git< git-read-object-server
>> +packet: git< version=1
>> +packet: git< 0000
>> +packet: git> capability=get
>> +packet: git> capability=have
>> +packet: git> capability=put
>> +packet: git> capability=not-yet-invented
>> +packet: git> 0000
>> +packet: git< capability=get
>> +packet: git< 0000
>> +------------------------
>> +The only supported capability in version 1 is "get".
>> +
>> +Afterwards Git sends a list of "key=value" pairs terminated with a flush
>> +packet. The list will contain at least the command (based on the
>> +supported capabilities) and the sha1 of the object to retrieve. Please
>> +note, that the process must not send any response before it received the
>> +final flush packet.
>> +
>> +When the process receives the "get" command, it should make the requested
>> +object available in the git object store and then return success. Git will
>> +then check the object store again and this time find it and proceed.
>> +------------------------
>> +packet: git> command=get
>> +packet: git> sha1=0a214a649e1b3d5011e14a3dc227753f2bd2be05
>> +packet: git> 0000
>> +------------------------
>> +
>> +The process is expected to respond with a list of "key=value" pairs
>> +terminated with a flush packet. If the process does not experience
>> +problems then the list must contain a "success" status.
>> +------------------------
>> +packet: git< status=success
>> +packet: git< 0000
>> +------------------------
>> +
>> +In case the process cannot or does not want to process the content, it
>> +is expected to respond with an "error" status.
>> +------------------------
>> +packet: git< status=error
>> +packet: git< 0000
>> +------------------------
>> +
>> +In case the process cannot or does not want to process the content as
>> +well as any future content for the lifetime of the Git process, then it
>> +is expected to respond with an "abort" status at any point in the
>> +protocol.
>> +------------------------
>> +packet: git< status=abort
>> +packet: git< 0000
>> +------------------------
>> +
>> +Git neither stops nor restarts the process in case the "error"/"abort"
>> +status is set.
>> +
>> +If the process dies during the communication or does not adhere to the
>> +protocol then Git will stop the process and restart it with the next
>> +object that needs to be processed.
>> +
>> +After the read-object process has processed an object it is expected to
>> +wait for the next "key=value" list containing a command. Git will close
>> +the command pipe on exit. The process is expected to detect EOF and exit
>> +gracefully on its own. Git will wait until the process has stopped.
>> +
>> +A long running read-object process demo implementation can be found in
>> +`contrib/long-running-read-object/example.pl` located in the Git core
>> +repository. If you develop your own long running process then the
>> +`GIT_TRACE_PACKET` environment variables can be very helpful for
>> +debugging (see linkgit:git[1]).
>> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
>> index 96b786e48..33f636926 100644
>> --- a/builtin/cat-file.c
>> +++ b/builtin/cat-file.c
>> @@ -12,6 +12,7 @@
>> #include "streaming.h"
>> #include "tree-walk.h"
>> #include "sha1-array.h"
>> +#include "promised-object.h"
>>
>> struct batch_options {
>> int enabled;
>> @@ -432,6 +433,13 @@ static int batch_packed_object(const struct object_id *oid,
>> return 0;
>> }
>>
>> +static int batch_promised_object(const struct object_id *oid,
>> + void *data)
>> +{
>> + oid_array_append(data, oid);
>> + return 0;
>> +}
>> +
>> static int batch_objects(struct batch_options *opt)
>> {
>> struct strbuf buf = STRBUF_INIT;
>> @@ -473,6 +481,7 @@ static int batch_objects(struct batch_options *opt)
>>
>> for_each_loose_object(batch_loose_object, &sa, 0);
>> for_each_packed_object(batch_packed_object, &sa, 0);
>> + for_each_promised_object(batch_promised_object, &sa);
>>
>> cb.opt = opt;
>> cb.expand = &data;
>> diff --git a/cache.h b/cache.h
>> index dd94b5ffc..75b71f38b 100644
>> --- a/cache.h
>> +++ b/cache.h
>> @@ -1835,6 +1835,8 @@ struct object_info {
>> #define OBJECT_INFO_SKIP_CACHED 4
>> /* Do not retry packed storage after checking packed and loose storage */
>> #define OBJECT_INFO_QUICK 8
>> +/* Ignore list of promised objects */
>> +#define OBJECT_INFO_IGNORE_PROMISES 16
>> extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
>> extern int packed_object_info(struct packed_git *pack, off_t offset, struct object_info *);
>>
>> diff --git a/object.c b/object.c
>> index 0aeb95084..23f2a6cbc 100644
>> --- a/object.c
>> +++ b/object.c
>> @@ -285,7 +285,8 @@ struct object *parse_or_promise_object(const struct object_id *oid)
>> {
>> enum object_type type;
>>
>> - if (has_object_file(oid))
>> + if (has_object_file_with_flags(oid, OBJECT_INFO_SKIP_CACHED |
>> + OBJECT_INFO_IGNORE_PROMISES))
>> return parse_object(oid);
>>
>> if (is_promised_object(oid, &type, NULL)) {
>> diff --git a/promised-object.c b/promised-object.c
>> index 487ade437..d8d95ebb2 100644
>> --- a/promised-object.c
>> +++ b/promised-object.c
>> @@ -2,6 +2,12 @@
>> #include "promised-object.h"
>> #include "sha1-lookup.h"
>> #include "strbuf.h"
>> +#include "run-command.h"
>> +#include "sha1-array.h"
>> +#include "config.h"
>> +#include "sigchain.h"
>> +#include "sub-process.h"
>> +#include "pkt-line.h"
>>
>> #define ENTRY_SIZE (GIT_SHA1_RAWSZ + 1 + 8)
>> /*
>> @@ -128,3 +134,191 @@ int fsck_promised_objects(void)
>> }
>> return 0;
>> }
>> +
>> +#define CAP_GET (1u<<0)
>> +
>> +static int subprocess_map_initialized;
>> +static struct hashmap subprocess_map;
>> +
>> +struct read_object_process {
>> + struct subprocess_entry subprocess;
>> + unsigned int supported_capabilities;
>> +};
>> +
>> +int start_read_object_fn(struct subprocess_entry *subprocess)
>> +{
>> + int err;
>> + struct read_object_process *entry = (struct read_object_process *)subprocess;
>> + struct child_process *process;
>> + struct string_list cap_list = STRING_LIST_INIT_NODUP;
>> + char *cap_buf;
>> + const char *cap_name;
>> +
>> + process = subprocess_get_child_process(&entry->subprocess);
>> +
>> + sigchain_push(SIGPIPE, SIG_IGN);
>> +
>> + err = packet_writel(process->in, "git-read-object-client", "version=1", NULL);
>> + if (err)
>> + goto done;
>> +
>> + err = strcmp(packet_read_line(process->out, NULL), "git-read-object-server");
>> + if (err) {
>> + error("external process '%s' does not support read-object protocol version 1", subprocess->cmd);
>> + goto done;
>> + }
>> + err = strcmp(packet_read_line(process->out, NULL), "version=1");
>> + if (err)
>> + goto done;
>> + err = packet_read_line(process->out, NULL) != NULL;
>> + if (err)
>> + goto done;
>> +
>> + err = packet_writel(process->in, "capability=get", NULL);
>> + if (err)
>> + goto done;
>> +
>> + for (;;) {
>> + cap_buf = packet_read_line(process->out, NULL);
>> + if (!cap_buf)
>> + break;
>> + string_list_split_in_place(&cap_list, cap_buf, '=', 1);
>> +
>> + if (cap_list.nr != 2 || strcmp(cap_list.items[0].string, "capability"))
>> + continue;
>> +
>> + cap_name = cap_list.items[1].string;
>> + if (!strcmp(cap_name, "get")) {
>> + entry->supported_capabilities |= CAP_GET;
>> + }
>> + else {
>> + warning(
>> + "external process '%s' requested unsupported read-object capability '%s'",
>> + subprocess->cmd, cap_name
>> + );
>> + }
>> +
>> + string_list_clear(&cap_list, 0);
>> + }
>> +
>> +done:
>> + sigchain_pop(SIGPIPE);
>> +
>> + if (err || errno == EPIPE)
>> + return err ? err : errno;
>> +
>> + return 0;
>> +}
>> +
>> +static int read_object_process(const unsigned char *sha1)
>> +{
>> + int err;
>> + struct read_object_process *entry;
>> + struct child_process *process;
>> + struct strbuf status = STRBUF_INIT;
>> + uint64_t start;
>> +
>> + start = getnanotime();
>> +
>> + if (!repository_format_promised_objects)
>> + die("BUG: if extensions.promisedObjects is not set, there "
>> + "should not be any promised objects");
>> +
>> + if (!subprocess_map_initialized) {
>> + subprocess_map_initialized = 1;
>> + hashmap_init(&subprocess_map, (hashmap_cmp_fn)cmd2process_cmp, 0);
>> + entry = NULL;
>> + } else {
>> + entry = (struct read_object_process *)subprocess_find_entry(&subprocess_map, repository_format_promised_objects);
>> + }
>> + if (!entry) {
>> + entry = xmalloc(sizeof(*entry));
>> + entry->supported_capabilities = 0;
>> +
>> + if (subprocess_start(&subprocess_map, &entry->subprocess, repository_format_promised_objects, start_read_object_fn)) {
>> + free(entry);
>> + return -1;
>> + }
>> + }
>> + process = subprocess_get_child_process(&entry->subprocess);
>> +
>> + if (!(CAP_GET & entry->supported_capabilities))
>> + return -1;
>> +
>> + sigchain_push(SIGPIPE, SIG_IGN);
>> +
>> + err = packet_write_fmt_gently(process->in, "command=get\n");
>> + if (err)
>> + goto done;
>> +
>> + err = packet_write_fmt_gently(process->in, "sha1=%s\n", sha1_to_hex(sha1));
>> + if (err)
>> + goto done;
>> +
>> + err = packet_flush_gently(process->in);
>> + if (err)
>> + goto done;
>> +
>> + err = subprocess_read_status(process->out, &status);
>> + err = err ? err : strcmp(status.buf, "success");
>> +
>> +done:
>> + sigchain_pop(SIGPIPE);
>> +
>> + if (err || errno == EPIPE) {
>> + err = err ? err : errno;
>> + if (!strcmp(status.buf, "error")) {
>> + /* The process signaled a problem with the file. */
>> + }
>> + else if (!strcmp(status.buf, "abort")) {
>> + /*
>> + * The process signaled a permanent problem. Don't try to read
>> + * objects with the same command for the lifetime of the current
>> + * Git process.
>> + */
>> + entry->supported_capabilities &= ~CAP_GET;
>> + }
>> + else {
>> + /*
>> + * Something went wrong with the read-object process.
>> + * Force shutdown and restart if needed.
>> + */
>> + error("external process '%s' failed", repository_format_promised_objects);
>> + subprocess_stop(&subprocess_map, (struct subprocess_entry *)entry);
>> + free(entry);
>> + }
>> + }
>> +
>> + trace_performance_since(start, "read_object_process");
>> +
>> + return err;
>> +}
>> +
>> +int request_promised_objects(const struct oid_array *oids)
>> +{
>> + int oids_requested = 0;
>> + int i;
>> +
>> + for (i = 0; i < oids->nr; i++) {
>> + if (is_promised_object(&oids->oid[i], NULL, NULL))
>> + break;
>> + }
>> +
>> + if (i == oids->nr)
>> + /* Nothing to fetch */
>> + return 0;
>> +
>> + for (; i < oids->nr; i++) {
>> + if (is_promised_object(&oids->oid[i], NULL, NULL)) {
>> + read_object_process(oids->oid[i].hash);
>> + oids_requested++;
>> + }
>> + }
>> +
>> + /*
>> + * The command above may have updated packfiles, so update our record
>> + * of them.
>> + */
>> + reprepare_packed_git();
>> + return oids_requested;
>> +}
>> diff --git a/promised-object.h b/promised-object.h
>> index 7eaedff17..8ad47aa4c 100644
>> --- a/promised-object.h
>> +++ b/promised-object.h
>> @@ -2,6 +2,7 @@
>> #define PROMISED_OBJECT_H
>>
>> #include "cache.h"
>> +#include "sha1-array.h"
>>
>> /*
>> * Returns 1 if oid is the name of a promised object. For non-blobs, 0 is
>> @@ -19,4 +20,15 @@ int for_each_promised_object(each_promised_object_fn, void *);
>> */
>> int fsck_promised_objects(void);
>>
>> +/*
>> + * If any of the given objects are promised objects, invokes
>> + * core.promisedobjectcommand with those objects and returns the number of
>> + * objects requested. No check is made as to whether the invocation actually
>> + * populated the repository with the promised objects.
>> + *
>> + * If none of the given objects are promised objects, this function does not
>> + * invoke anything and returns 0.
>> + */
>> +int request_promised_objects(const struct oid_array *oids);
>> +
>> #endif
>> diff --git a/sha1_file.c b/sha1_file.c
>> index 5862386cd..ded0ef46b 100644
>> --- a/sha1_file.c
>> +++ b/sha1_file.c
>> @@ -28,6 +28,11 @@
>> #include "list.h"
>> #include "mergesort.h"
>> #include "quote.h"
>> +#include "iterator.h"
>> +#include "dir-iterator.h"
>> +#include "sha1-lookup.h"
>> +#include "promised-object.h"
>> +#include "sha1-array.h"
>>
>> #define SZ_FMT PRIuMAX
>> static inline uintmax_t sz_fmt(size_t s) { return s; }
>> @@ -2983,6 +2988,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>> const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
>> lookup_replace_object(sha1) :
>> sha1;
>> + int already_retried = 0;
>>
>> if (!oi)
>> oi = &blank_oi;
>> @@ -3007,30 +3013,40 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>> }
>> }
>>
>> - if (!find_pack_entry(real, &e)) {
>> - /* Most likely it's a loose object. */
>> - if (!sha1_loose_object_info(real, oi, flags)) {
>> - oi->whence = OI_LOOSE;
>> - return 0;
>> - }
>> +retry:
>> + if (find_pack_entry(real, &e))
>> + goto found_packed;
>>
>> - /* Not a loose object; someone else may have just packed it. */
>> - if (flags & OBJECT_INFO_QUICK) {
>> - return -1;
>> - } else {
>> - reprepare_packed_git();
>> - if (!find_pack_entry(real, &e))
>> - return -1;
>> + /* Most likely it's a loose object. */
>> + if (!sha1_loose_object_info(real, oi, flags)) {
>> + oi->whence = OI_LOOSE;
>> + return 0;
>> + }
>> +
>> + /* Not a loose object; someone else may have just packed it. */
>> + reprepare_packed_git();
>> + if (find_pack_entry(real, &e))
>> + goto found_packed;
>> +
>> + /* Check if it is a promised blob */
>> + if (!already_retried && !(flags & OBJECT_INFO_IGNORE_PROMISES)) {
>> + struct oid_array promised = OID_ARRAY_INIT;
>> + oid_array_append_sha1(&promised, real);
>> + if (request_promised_objects(&promised)) {
>> + already_retried = 1;
>> + goto retry;
>> }
>> }
>>
>> + return -1;
>> +
>> +found_packed:
>> if (oi == &blank_oi)
>> /*
>> * We know that the caller doesn't actually need the
>> * information below, so return early.
>> */
>> return 0;
>> -
>> rtype = packed_object_info(e.p, e.offset, oi);
>> if (rtype < 0) {
>> mark_bad_packed_object(e.p, real);
>> diff --git a/t/t3907-promised-object.sh b/t/t3907-promised-object.sh
>> index 3e0caf4f9..d9e6a6486 100755
>> --- a/t/t3907-promised-object.sh
>> +++ b/t/t3907-promised-object.sh
>> @@ -38,4 +38,36 @@ test_expect_success '...but fails again with GIT_IGNORE_PROMISED_OBJECTS' '
>> unset GIT_IGNORE_PROMISED_OBJECTS
>> '
>>
>> +test_expect_success 'sha1_object_info_extended (through git cat-file)' '
>> + test_create_repo server &&
>> + test_commit -C server 1 1.t abcdefgh &&
>> + HASH=$(git hash-object server/1.t) &&
>> +
>> + test_create_repo client &&
>> + test_must_fail git -C client cat-file -p "$HASH"
>> +'
>> +
>> +test_expect_success '...succeeds if it is a promised object' '
>> + printf "%s03%016x" "$HASH" "$(wc -c <server/1.t)" |
>> + hex_pack >client/.git/objects/promised &&
>> + git -C client config core.repositoryformatversion 1 &&
>> + git -C client config extensions.promisedobjects \
>> + "\"$TEST_DIRECTORY/t3907/read-object\" \"$(pwd)/server/.git\"" &&
>> + git -C client cat-file -p "$HASH"
>> +'
>> +
>> +test_expect_success 'cat-file --batch-all-objects with promised objects' '
>> + rm -rf client &&
>> + test_create_repo client &&
>> + git -C client config core.repositoryformatversion 1 &&
>> + git -C client config extensions.promisedobjects \
>> + "\"$TEST_DIRECTORY/t3907/read-object\" \"$(pwd)/server/.git\"" &&
>> + printf "%s03%016x" "$HASH" "$(wc -c <server/1.t)" |
>> + hex_pack >client/.git/objects/promised &&
>> +
>> + # Verify that the promised object is printed
>> + git -C client cat-file --batch --batch-all-objects | tee out |
>> + grep abcdefgh
>> +'
>> +
>> test_done
>> diff --git a/t/t3907/read-object b/t/t3907/read-object
>> new file mode 100755
>> index 000000000..9666ad597
>> --- /dev/null
>> +++ b/t/t3907/read-object
>> @@ -0,0 +1,114 @@
>> +#!/usr/bin/perl
>> +#
>> +# Example implementation for the Git read-object protocol version 1
>> +# See Documentation/technical/read-object-protocol.txt
>> +#
>> +# Allows you to test the ability for blobs to be pulled from a host git repo
>> +# "on demand." Called when git needs a blob it couldn't find locally due to
>> +# a lazy clone that only cloned the commits and trees.
>> +#
>> +# A lazy clone can be simulated via the following commands from the host repo
>> +# you wish to create a lazy clone of:
>> +#
>> +# cd /host_repo
>> +# git rev-parse HEAD
>> +# git init /guest_repo
>> +# git cat-file --batch-check --batch-all-objects | grep -v 'blob' |
>> +# cut -d' ' -f1 | git pack-objects /guest_repo/.git/objects/pack/noblobs
>> +# cd /guest_repo
>> +# git config core.virtualizeobjects true
>> +# git reset --hard <sha from rev-parse call above>
>> +#
>> +# Please note, this sample is a minimal skeleton. No proper error handling
>> +# was implemented.
>> +#
>> +
>> +use strict;
>> +use warnings;
>> +
>> +#
>> +# Point $DIR to the folder where your host git repo is located so we can pull
>> +# missing objects from it
>> +#
>> +my $DIR = $ARGV[0];
>> +
>> +sub packet_bin_read {
>> + my $buffer;
>> + my $bytes_read = read STDIN, $buffer, 4;
>> + if ( $bytes_read == 0 ) {
>> +
>> + # EOF - Git stopped talking to us!
>> + exit();
>> + }
>> + elsif ( $bytes_read != 4 ) {
>> + die "invalid packet: '$buffer'";
>> + }
>> + my $pkt_size = hex($buffer);
>> + if ( $pkt_size == 0 ) {
>> + return ( 1, "" );
>> + }
>> + elsif ( $pkt_size > 4 ) {
>> + my $content_size = $pkt_size - 4;
>> + $bytes_read = read STDIN, $buffer, $content_size;
>> + if ( $bytes_read != $content_size ) {
>> + die "invalid packet ($content_size bytes expected; $bytes_read bytes read)";
>> + }
>> + return ( 0, $buffer );
>> + }
>> + else {
>> + die "invalid packet size: $pkt_size";
>> + }
>> +}
>> +
>> +sub packet_txt_read {
>> + my ( $res, $buf ) = packet_bin_read();
>> + unless ( $buf =~ s/\n$// ) {
>> + die "A non-binary line MUST be terminated by an LF.";
>> + }
>> + return ( $res, $buf );
>> +}
>> +
>> +sub packet_bin_write {
>> + my $buf = shift;
>> + print STDOUT sprintf( "%04x", length($buf) + 4 );
>> + print STDOUT $buf;
>> + STDOUT->flush();
>> +}
>> +
>> +sub packet_txt_write {
>> + packet_bin_write( $_[0] . "\n" );
>> +}
>> +
>> +sub packet_flush {
>> + print STDOUT sprintf( "%04x", 0 );
>> + STDOUT->flush();
>> +}
>> +
>> +( packet_txt_read() eq ( 0, "git-read-object-client" ) ) || die "bad initialize";
>> +( packet_txt_read() eq ( 0, "version=1" ) ) || die "bad version";
>> +( packet_bin_read() eq ( 1, "" ) ) || die "bad version end";
>> +
>> +packet_txt_write("git-read-object-server");
>> +packet_txt_write("version=1");
>> +packet_flush();
>> +
>> +( packet_txt_read() eq ( 0, "capability=get" ) ) || die "bad capability";
>> +( packet_bin_read() eq ( 1, "" ) ) || die "bad capability end";
>> +
>> +packet_txt_write("capability=get");
>> +packet_flush();
>> +
>> +while (1) {
>> + my ($command) = packet_txt_read() =~ /^command=([^=]+)$/;
>> +
>> + if ( $command eq "get" ) {
>> + my ($sha1) = packet_txt_read() =~ /^sha1=([0-9a-f]{40})$/;
>> + packet_bin_read();
>> +
>> + system ('git --git-dir="' . $DIR . '" cat-file blob ' . $sha1 . ' | git -c core.virtualizeobjects=false hash-object -w --stdin >/dev/null 2>&1');
>> + packet_txt_write(($?) ? "status=error" : "status=success");
>> + packet_flush();
>> + } else {
>> + die "bad command '$command'";
>> + }
>> +}
>> --
>> 2.14.0.rc0.284.gd933b75aa4-goog
>>
^ permalink raw reply [flat|nested] 45+ messages in thread* Re: [RFC PATCH v2 4/4] sha1_file: support promised object hook
2017-07-20 20:58 ` Ben Peart
@ 2017-07-20 21:18 ` Jonathan Tan
2017-07-21 16:27 ` Ben Peart
0 siblings, 1 reply; 45+ messages in thread
From: Jonathan Tan @ 2017-07-20 21:18 UTC (permalink / raw)
To: Ben Peart
Cc: Stefan Beller, Lars Schneider, git@vger.kernel.org,
Jonathan Nieder, Jeff Hostetler, Philip Oakley
On Thu, 20 Jul 2017 16:58:16 -0400
Ben Peart <peartben@gmail.com> wrote:
> >> This is meant as a temporary measure to ensure that all Git commands
> >> work in such a situation. Future patches will update some commands to
> >> either tolerate promised objects (without invoking the hook) or be more
> >> efficient in invoking the promised objects hook.
>
> I agree that making git more tolerant of promised objects if possible
> and precomputing a list of promised objects required to complete a
> particular command and downloading them with a single request are good
> optimizations to add over time.
That's good to know!
> has_sha1_file also takes a hash "whether local or in an alternate object
> database, and whether packed or loose" but never calls
> sha1_object_info_extended. As a result, we had to add support in
> check_and_freshen to download missing objects to get proper behavior in
> all cases. I don't think this will work correctly without it.
Thanks for the attention to detail. Is this before or after commit
e83e71c ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26)?
^ permalink raw reply [flat|nested] 45+ messages in thread* Re: [RFC PATCH v2 4/4] sha1_file: support promised object hook
2017-07-20 21:18 ` Jonathan Tan
@ 2017-07-21 16:27 ` Ben Peart
0 siblings, 0 replies; 45+ messages in thread
From: Ben Peart @ 2017-07-21 16:27 UTC (permalink / raw)
To: Jonathan Tan
Cc: Stefan Beller, Lars Schneider, git@vger.kernel.org,
Jonathan Nieder, Jeff Hostetler, Philip Oakley
On 7/20/2017 5:18 PM, Jonathan Tan wrote:
> On Thu, 20 Jul 2017 16:58:16 -0400
> Ben Peart <peartben@gmail.com> wrote:
>
>>>> This is meant as a temporary measure to ensure that all Git commands
>>>> work in such a situation. Future patches will update some commands to
>>>> either tolerate promised objects (without invoking the hook) or be more
>>>> efficient in invoking the promised objects hook.
>>
>> I agree that making git more tolerant of promised objects if possible
>> and precomputing a list of promised objects required to complete a
>> particular command and downloading them with a single request are good
>> optimizations to add over time.
>
> That's good to know!
>
>> has_sha1_file also takes a hash "whether local or in an alternate object
>> database, and whether packed or loose" but never calls
>> sha1_object_info_extended. As a result, we had to add support in
>> check_and_freshen to download missing objects to get proper behavior in
>> all cases. I don't think this will work correctly without it.
>
> Thanks for the attention to detail. Is this before or after commit
> e83e71c ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26)?
>
Sorry, my bad. I missed the comment in the cover letter that said this
needed to be applied on top of your other patch series.
^ permalink raw reply [flat|nested] 45+ messages in thread