* [PATCH 01/17] path-walk: introduce an object walk by path
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
@ 2024-10-08 14:11 ` Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 02/17] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
` (16 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:11 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.
There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 54 +++++
Makefile | 1 +
path-walk.c | 241 ++++++++++++++++++++++
path-walk.h | 43 ++++
4 files changed, 339 insertions(+)
create mode 100644 Documentation/technical/api-path-walk.txt
create mode 100644 path-walk.c
create mode 100644 path-walk.h
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
new file mode 100644
index 00000000000..6472222ae6d
--- /dev/null
+++ b/Documentation/technical/api-path-walk.txt
@@ -0,0 +1,54 @@
+Path-Walk API
+=============
+
+The path-walk API is used to walk reachable objects, but to visit objects
+in batches based on a common path they appear in, or by type.
+
+For example, all reachable commits are visited in a group. All tags are
+visited in a group. Then, all root trees are visited. At some point, all
+blobs reachable via a path `my/dir/to/A` are visited. When there are
+multiple paths possible to reach the same object, then only one of those
+paths is used to visit the object.
+
+When walking a range of commits with some `UNINTERESTING` objects, the
+objects with the `UNINTERESTING` flag are included in these batches. In
+order to walk `UNINTERESTING` objects, the `--boundary` option must be
+used in the commit walk in order to visit `UNINTERESTING` commits.
+
+Basics
+------
+
+To use the path-walk API, include `path-walk.h` and call
+`walk_objects_by_path()` with a customized `path_walk_info` struct. The
+struct is used to set all of the options for how the walk should proceed.
+Let's dig into the different options and their use.
+
+`path_fn` and `path_fn_data`::
+ The most important option is the `path_fn` option, which is a
+ function pointer to the callback that can execute logic on the
+ object IDs for objects grouped by type and path. This function
+ also receives a `data` value that corresponds to the
+ `path_fn_data` member, for providing custom data structures to
+ this callback function.
+
+`revs`::
+ To configure the exact details of the reachable set of objects,
+ use the `revs` member and initialize it using the revision
+ machinery in `revision.h`. Initialize `revs` using calls such as
+ `setup_revisions()` or `parse_revision_opt()`. Do not call
+ `prepare_revision_walk()`, as that will be called within
+ `walk_objects_by_path()`.
++
+It is also important that you do not specify the `--objects` flag for the
+`revs` struct. The revision walk should only be used to walk commits, and
+the objects will be walked in a separate way based on those starting
+commits.
++
+If you want the path-walk API to emit `UNINTERESTING` objects based on the
+commit walk's boundary, be sure to set `revs.boundary` so the boundary
+commits are emitted.
+
+Examples
+--------
+
+See example usages in future changes.
diff --git a/Makefile b/Makefile
index 7344a7f7257..d0d8d6888e3 100644
--- a/Makefile
+++ b/Makefile
@@ -1094,6 +1094,7 @@ LIB_OBJS += parse-options.o
LIB_OBJS += patch-delta.o
LIB_OBJS += patch-ids.o
LIB_OBJS += path.o
+LIB_OBJS += path-walk.o
LIB_OBJS += pathspec.o
LIB_OBJS += pkt-line.o
LIB_OBJS += preload-index.o
diff --git a/path-walk.c b/path-walk.c
new file mode 100644
index 00000000000..66840187e28
--- /dev/null
+++ b/path-walk.c
@@ -0,0 +1,241 @@
+/*
+ * path-walk.c: implementation for path-based walks of the object graph.
+ */
+#include "git-compat-util.h"
+#include "path-walk.h"
+#include "blob.h"
+#include "commit.h"
+#include "dir.h"
+#include "hashmap.h"
+#include "hex.h"
+#include "object.h"
+#include "oid-array.h"
+#include "revision.h"
+#include "string-list.h"
+#include "strmap.h"
+#include "trace2.h"
+#include "tree.h"
+#include "tree-walk.h"
+
+struct type_and_oid_list
+{
+ enum object_type type;
+ struct oid_array oids;
+};
+
+#define TYPE_AND_OID_LIST_INIT { \
+ .type = OBJ_NONE, \
+ .oids = OID_ARRAY_INIT \
+}
+
+struct path_walk_context {
+ /**
+ * Repeats of data in 'struct path_walk_info' for
+ * access with fewer characters.
+ */
+ struct repository *repo;
+ struct rev_info *revs;
+ struct path_walk_info *info;
+
+ /**
+ * Map a path to a 'struct type_and_oid_list'
+ * containing the objects discovered at that
+ * path.
+ */
+ struct strmap paths_to_lists;
+
+ /**
+ * Store the current list of paths in a stack, to
+ * facilitate depth-first-search without recursion.
+ */
+ struct string_list path_stack;
+};
+
+static int add_children(struct path_walk_context *ctx,
+ const char *base_path,
+ struct object_id *oid)
+{
+ struct tree_desc desc;
+ struct name_entry entry;
+ struct strbuf path = STRBUF_INIT;
+ size_t base_len;
+ struct tree *tree = lookup_tree(ctx->repo, oid);
+
+ if (!tree) {
+ error(_("failed to walk children of tree %s: not found"),
+ oid_to_hex(oid));
+ return -1;
+ } else if (parse_tree_gently(tree, 1)) {
+ die("bad tree object %s", oid_to_hex(oid));
+ }
+
+ strbuf_addstr(&path, base_path);
+ base_len = path.len;
+
+ parse_tree(tree);
+ init_tree_desc(&desc, &tree->object.oid, tree->buffer, tree->size);
+ while (tree_entry(&desc, &entry)) {
+ struct type_and_oid_list *list;
+ struct object *o;
+ /* Not actually true, but we will ignore submodules later. */
+ enum object_type type = S_ISDIR(entry.mode) ? OBJ_TREE : OBJ_BLOB;
+
+ /* Skip submodules. */
+ if (S_ISGITLINK(entry.mode))
+ continue;
+
+ if (type == OBJ_TREE) {
+ struct tree *child = lookup_tree(ctx->repo, &entry.oid);
+ o = child ? &child->object : NULL;
+ } else if (type == OBJ_BLOB) {
+ struct blob *child = lookup_blob(ctx->repo, &entry.oid);
+ o = child ? &child->object : NULL;
+ } else {
+ /* Wrong type? */
+ continue;
+ }
+
+ if (!o) /* report error?*/
+ continue;
+
+ /* Skip this object if already seen. */
+ if (o->flags & SEEN)
+ continue;
+ o->flags |= SEEN;
+
+ strbuf_setlen(&path, base_len);
+ strbuf_add(&path, entry.path, entry.pathlen);
+
+ /*
+ * Trees will end with "/" for concatenation and distinction
+ * from blobs at the same path.
+ */
+ if (type == OBJ_TREE)
+ strbuf_addch(&path, '/');
+
+ if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
+ CALLOC_ARRAY(list, 1);
+ list->type = type;
+ strmap_put(&ctx->paths_to_lists, path.buf, list);
+ string_list_append(&ctx->path_stack, path.buf);
+ }
+ oid_array_append(&list->oids, &entry.oid);
+ }
+
+ free_tree_buffer(tree);
+ strbuf_release(&path);
+ return 0;
+}
+
+/*
+ * For each path in paths_to_explore, walk the trees another level
+ * and add any found blobs to the batch (but only if they exist and
+ * haven't been added yet).
+ */
+static int walk_path(struct path_walk_context *ctx,
+ const char *path)
+{
+ struct type_and_oid_list *list;
+ int ret = 0;
+
+ list = strmap_get(&ctx->paths_to_lists, path);
+
+ /* Evaluate function pointer on this data. */
+ ret = ctx->info->path_fn(path, &list->oids, list->type,
+ ctx->info->path_fn_data);
+
+ /* Expand data for children. */
+ if (list->type == OBJ_TREE) {
+ for (size_t i = 0; i < list->oids.nr; i++) {
+ ret |= add_children(ctx,
+ path,
+ &list->oids.oid[i]);
+ }
+ }
+
+ oid_array_clear(&list->oids);
+ strmap_remove(&ctx->paths_to_lists, path, 1);
+ return ret;
+}
+
+static void clear_strmap(struct strmap *map)
+{
+ struct hashmap_iter iter;
+ struct strmap_entry *e;
+
+ hashmap_for_each_entry(&map->map, &iter, e, ent) {
+ struct type_and_oid_list *list = e->value;
+ oid_array_clear(&list->oids);
+ }
+ strmap_clear(map, 1);
+ strmap_init(map);
+}
+
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+ *
+ * Returns nonzero on an error.
+ */
+int walk_objects_by_path(struct path_walk_info *info)
+{
+ const char *root_path = "";
+ int ret = 0;
+ size_t commits_nr = 0, paths_nr = 0;
+ struct commit *c;
+ struct type_and_oid_list *root_tree_list;
+ struct path_walk_context ctx = {
+ .repo = info->revs->repo,
+ .revs = info->revs,
+ .info = info,
+ .path_stack = STRING_LIST_INIT_DUP,
+ .paths_to_lists = STRMAP_INIT
+ };
+
+ trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
+
+ /* Insert a single list for the root tree into the paths. */
+ CALLOC_ARRAY(root_tree_list, 1);
+ root_tree_list->type = OBJ_TREE;
+ strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
+
+ if (prepare_revision_walk(info->revs))
+ die(_("failed to setup revision walk"));
+
+ while ((c = get_revision(info->revs))) {
+ struct object_id *oid = get_commit_tree_oid(c);
+ struct tree *t = lookup_tree(info->revs->repo, oid);
+ commits_nr++;
+
+ if (t) {
+ if (t->object.flags & SEEN)
+ continue;
+ t->object.flags |= SEEN;
+ oid_array_append(&root_tree_list->oids, oid);
+ } else {
+ warning("could not find tree %s", oid_to_hex(oid));
+ }
+ }
+
+ trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
+ trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+
+ string_list_append(&ctx.path_stack, root_path);
+
+ trace2_region_enter("path-walk", "path-walk", info->revs->repo);
+ while (!ret && ctx.path_stack.nr) {
+ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
+ ctx.path_stack.nr--;
+ paths_nr++;
+
+ ret = walk_path(&ctx, path);
+
+ free(path);
+ }
+ trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr);
+ trace2_region_leave("path-walk", "path-walk", info->revs->repo);
+
+ clear_strmap(&ctx.paths_to_lists);
+ string_list_clear(&ctx.path_stack, 0);
+ return ret;
+}
diff --git a/path-walk.h b/path-walk.h
new file mode 100644
index 00000000000..c9e94a98bc8
--- /dev/null
+++ b/path-walk.h
@@ -0,0 +1,43 @@
+/*
+ * path-walk.h : Methods and structures for walking the object graph in batches
+ * by the paths that can reach those objects.
+ */
+#include "object.h" /* Required for 'enum object_type'. */
+
+struct rev_info;
+struct oid_array;
+
+/**
+ * The type of a function pointer for the method that is called on a list of
+ * objects reachable at a given path.
+ */
+typedef int (*path_fn)(const char *path,
+ struct oid_array *oids,
+ enum object_type type,
+ void *data);
+
+struct path_walk_info {
+ /**
+ * revs provides the definitions for the commit walk, including
+ * which commits are UNINTERESTING or not.
+ */
+ struct rev_info *revs;
+
+ /**
+ * The caller wishes to execute custom logic on objects reachable at a
+ * given path. Every reachable object will be visited exactly once, and
+ * the first path to see an object wins. This may not be a stable choice.
+ */
+ path_fn path_fn;
+ void *path_fn_data;
+};
+
+#define PATH_WALK_INFO_INIT { 0 }
+
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+ *
+ * Returns nonzero on an error.
+ */
+int walk_objects_by_path(struct path_walk_info *info);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 02/17] t6601: add helper for testing path-walk API
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 01/17] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
@ 2024-10-08 14:11 ` Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 03/17] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
` (15 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:11 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Add some tests based on the current behavior, doing interesting checks
for different sets of branches, ranges, and the --boundary option. This
sets a baseline for the behavior and we can extend it as new options are
introduced.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 3 +-
Makefile | 1 +
t/helper/test-path-walk.c | 86 ++++++++++++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t6601-path-walk.sh | 130 ++++++++++++++++++++++
6 files changed, 221 insertions(+), 1 deletion(-)
create mode 100644 t/helper/test-path-walk.c
create mode 100755 t/t6601-path-walk.sh
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 6472222ae6d..e588897ab8d 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -51,4 +51,5 @@ commits are emitted.
Examples
--------
-See example usages in future changes.
+See example usages in:
+ `t/helper/test-path-walk.c`
diff --git a/Makefile b/Makefile
index d0d8d6888e3..50413d96492 100644
--- a/Makefile
+++ b/Makefile
@@ -818,6 +818,7 @@ TEST_BUILTINS_OBJS += test-parse-options.o
TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
TEST_BUILTINS_OBJS += test-partial-clone.o
TEST_BUILTINS_OBJS += test-path-utils.o
+TEST_BUILTINS_OBJS += test-path-walk.o
TEST_BUILTINS_OBJS += test-pcre2-config.o
TEST_BUILTINS_OBJS += test-pkt-line.o
TEST_BUILTINS_OBJS += test-proc-receive.o
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
new file mode 100644
index 00000000000..3c48f017fa0
--- /dev/null
+++ b/t/helper/test-path-walk.c
@@ -0,0 +1,86 @@
+#define USE_THE_REPOSITORY_VARIABLE
+
+#include "test-tool.h"
+#include "environment.h"
+#include "hex.h"
+#include "object-name.h"
+#include "object.h"
+#include "pretty.h"
+#include "revision.h"
+#include "setup.h"
+#include "parse-options.h"
+#include "path-walk.h"
+#include "oid-array.h"
+
+static const char * const path_walk_usage[] = {
+ N_("test-tool path-walk <options> -- <revision-options>"),
+ NULL
+};
+
+struct path_walk_test_data {
+ uintmax_t tree_nr;
+ uintmax_t blob_nr;
+};
+
+static int emit_block(const char *path, struct oid_array *oids,
+ enum object_type type, void *data)
+{
+ struct path_walk_test_data *tdata = data;
+ const char *typestr;
+
+ switch (type) {
+ case OBJ_TREE:
+ typestr = "TREE";
+ tdata->tree_nr += oids->nr;
+ break;
+
+ case OBJ_BLOB:
+ typestr = "BLOB";
+ tdata->blob_nr += oids->nr;
+ break;
+
+ default:
+ BUG("we do not understand this type");
+ }
+
+ for (size_t i = 0; i < oids->nr; i++)
+ printf("%s:%s:%s\n", typestr, path, oid_to_hex(&oids->oid[i]));
+
+ return 0;
+}
+
+int cmd__path_walk(int argc, const char **argv)
+{
+ int res;
+ struct rev_info revs = REV_INFO_INIT;
+ struct path_walk_info info = PATH_WALK_INFO_INIT;
+ struct path_walk_test_data data = { 0 };
+ struct option options[] = {
+ OPT_END(),
+ };
+
+ initialize_repository(the_repository);
+ setup_git_directory();
+ revs.repo = the_repository;
+
+ argc = parse_options(argc, argv, NULL,
+ options, path_walk_usage,
+ PARSE_OPT_KEEP_UNKNOWN_OPT | PARSE_OPT_KEEP_ARGV0);
+
+ if (argc > 1)
+ setup_revisions(argc, argv, &revs, NULL);
+ else
+ usage(path_walk_usage[0]);
+
+ info.revs = &revs;
+ info.path_fn = emit_block;
+ info.path_fn_data = &data;
+
+ res = walk_objects_by_path(&info);
+
+ printf("trees:%" PRIuMAX "\n"
+ "blobs:%" PRIuMAX "\n",
+ data.tree_nr, data.blob_nr);
+
+ return res;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 1ebb69a5dc4..43676e7b93a 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -52,6 +52,7 @@ static struct test_cmd cmds[] = {
{ "parse-subcommand", cmd__parse_subcommand },
{ "partial-clone", cmd__partial_clone },
{ "path-utils", cmd__path_utils },
+ { "path-walk", cmd__path_walk },
{ "pcre2-config", cmd__pcre2_config },
{ "pkt-line", cmd__pkt_line },
{ "proc-receive", cmd__proc_receive },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 21802ac27da..9cfc5da6e57 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -45,6 +45,7 @@ int cmd__parse_pathspec_file(int argc, const char** argv);
int cmd__parse_subcommand(int argc, const char **argv);
int cmd__partial_clone(int argc, const char **argv);
int cmd__path_utils(int argc, const char **argv);
+int cmd__path_walk(int argc, const char **argv);
int cmd__pcre2_config(int argc, const char **argv);
int cmd__pkt_line(int argc, const char **argv);
int cmd__proc_receive(int argc, const char **argv);
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
new file mode 100755
index 00000000000..ca18b61c3f1
--- /dev/null
+++ b/t/t6601-path-walk.sh
@@ -0,0 +1,130 @@
+#!/bin/sh
+
+test_description='direct path-walk API tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repository' '
+ git checkout -b base &&
+
+ mkdir left &&
+ mkdir right &&
+ echo a >a &&
+ echo b >left/b &&
+ echo c >right/c &&
+ git add . &&
+ git commit -m "first" &&
+
+ echo d >right/d &&
+ git add right &&
+ git commit -m "second" &&
+
+ echo bb >left/b &&
+ git commit -a -m "third" &&
+
+ git checkout -b topic HEAD~1 &&
+ echo cc >right/c &&
+ git commit -a -m "topic"
+'
+
+test_expect_success 'all' '
+ test-tool path-walk -- --all >out &&
+
+ cat >expect <<-EOF &&
+ TREE::$(git rev-parse topic^{tree})
+ TREE::$(git rev-parse base^{tree})
+ TREE::$(git rev-parse base~1^{tree})
+ TREE::$(git rev-parse base~2^{tree})
+ TREE:left/:$(git rev-parse base:left)
+ TREE:left/:$(git rev-parse base~2:left)
+ TREE:right/:$(git rev-parse topic:right)
+ TREE:right/:$(git rev-parse base~1:right)
+ TREE:right/:$(git rev-parse base~2:right)
+ trees:9
+ BLOB:a:$(git rev-parse base~2:a)
+ BLOB:left/b:$(git rev-parse base~2:left/b)
+ BLOB:left/b:$(git rev-parse base:left/b)
+ BLOB:right/c:$(git rev-parse base~2:right/c)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse base~1:right/d)
+ blobs:6
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
+test_expect_success 'topic only' '
+ test-tool path-walk -- topic >out &&
+
+ cat >expect <<-EOF &&
+ TREE::$(git rev-parse topic^{tree})
+ TREE::$(git rev-parse base~1^{tree})
+ TREE::$(git rev-parse base~2^{tree})
+ TREE:left/:$(git rev-parse base~2:left)
+ TREE:right/:$(git rev-parse topic:right)
+ TREE:right/:$(git rev-parse base~1:right)
+ TREE:right/:$(git rev-parse base~2:right)
+ trees:7
+ BLOB:a:$(git rev-parse base~2:a)
+ BLOB:left/b:$(git rev-parse base~2:left/b)
+ BLOB:right/c:$(git rev-parse base~2:right/c)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse base~1:right/d)
+ blobs:5
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
+test_expect_success 'topic, not base' '
+ test-tool path-walk -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ TREE::$(git rev-parse topic^{tree})
+ TREE:left/:$(git rev-parse topic:left)
+ TREE:right/:$(git rev-parse topic:right)
+ trees:3
+ BLOB:a:$(git rev-parse topic:a)
+ BLOB:left/b:$(git rev-parse topic:left/b)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse topic:right/d)
+ blobs:4
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
+test_expect_success 'topic, not base, boundary' '
+ test-tool path-walk -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ TREE::$(git rev-parse topic^{tree})
+ TREE::$(git rev-parse base~1^{tree})
+ TREE:left/:$(git rev-parse base~1:left)
+ TREE:right/:$(git rev-parse topic:right)
+ TREE:right/:$(git rev-parse base~1:right)
+ trees:5
+ BLOB:a:$(git rev-parse base~1:a)
+ BLOB:left/b:$(git rev-parse base~1:left/b)
+ BLOB:right/c:$(git rev-parse base~1:right/c)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse base~1:right/d)
+ blobs:5
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
+test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 03/17] path-walk: allow consumer to specify object types
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 01/17] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 02/17] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
@ 2024-10-08 14:11 ` Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 04/17] path-walk: allow visiting tags Derrick Stolee via GitGitGadget
` (14 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:11 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <derrickstolee@github.com>
We add the ability to filter the object types in the path-walk API so
the callback function is called fewer times.
This adds the ability to ask for the commits in a list, as well. Future
changes will add the ability to visit annotated tags.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 9 +++
path-walk.c | 39 ++++++++++--
path-walk.h | 13 +++-
t/helper/test-path-walk.c | 17 +++++-
t/t6601-path-walk.sh | 72 +++++++++++++++++++++++
5 files changed, 141 insertions(+), 9 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index e588897ab8d..b7ae476ea0a 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -48,6 +48,15 @@ If you want the path-walk API to emit `UNINTERESTING` objects based on the
commit walk's boundary, be sure to set `revs.boundary` so the boundary
commits are emitted.
+`commits`, `blobs`, `trees`::
+ By default, these members are enabled and signal that the path-walk
+ API should call the `path_fn` on objects of these types. Specialized
+ applications could disable some options to make it simpler to walk
+ the objects or to have fewer calls to `path_fn`.
++
+While it is possible to walk only commits in this way, consumers would be
+better off using the revision walk API instead.
+
Examples
--------
diff --git a/path-walk.c b/path-walk.c
index 66840187e28..22e1aa13f31 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -84,6 +84,10 @@ static int add_children(struct path_walk_context *ctx,
if (S_ISGITLINK(entry.mode))
continue;
+ /* If the caller doesn't want blobs, then don't bother. */
+ if (!ctx->info->blobs && type == OBJ_BLOB)
+ continue;
+
if (type == OBJ_TREE) {
struct tree *child = lookup_tree(ctx->repo, &entry.oid);
o = child ? &child->object : NULL;
@@ -140,9 +144,11 @@ static int walk_path(struct path_walk_context *ctx,
list = strmap_get(&ctx->paths_to_lists, path);
- /* Evaluate function pointer on this data. */
- ret = ctx->info->path_fn(path, &list->oids, list->type,
- ctx->info->path_fn_data);
+ /* Evaluate function pointer on this data, if requested. */
+ if ((list->type == OBJ_TREE && ctx->info->trees) ||
+ (list->type == OBJ_BLOB && ctx->info->blobs))
+ ret = ctx->info->path_fn(path, &list->oids, list->type,
+ ctx->info->path_fn_data);
/* Expand data for children. */
if (list->type == OBJ_TREE) {
@@ -184,6 +190,7 @@ int walk_objects_by_path(struct path_walk_info *info)
size_t commits_nr = 0, paths_nr = 0;
struct commit *c;
struct type_and_oid_list *root_tree_list;
+ struct type_and_oid_list *commit_list;
struct path_walk_context ctx = {
.repo = info->revs->repo,
.revs = info->revs,
@@ -194,19 +201,32 @@ int walk_objects_by_path(struct path_walk_info *info)
trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
+ CALLOC_ARRAY(commit_list, 1);
+ commit_list->type = OBJ_COMMIT;
+
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
-
if (prepare_revision_walk(info->revs))
die(_("failed to setup revision walk"));
while ((c = get_revision(info->revs))) {
- struct object_id *oid = get_commit_tree_oid(c);
- struct tree *t = lookup_tree(info->revs->repo, oid);
+ struct object_id *oid;
+ struct tree *t;
commits_nr++;
+ if (info->commits)
+ oid_array_append(&commit_list->oids,
+ &c->object.oid);
+
+ /* If we only care about commits, then skip trees. */
+ if (!info->trees && !info->blobs)
+ continue;
+
+ oid = get_commit_tree_oid(c);
+ t = lookup_tree(info->revs->repo, oid);
+
if (t) {
if (t->object.flags & SEEN)
continue;
@@ -220,6 +240,13 @@ int walk_objects_by_path(struct path_walk_info *info)
trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+ /* Track all commits. */
+ if (info->commits)
+ ret = info->path_fn("", &commit_list->oids, OBJ_COMMIT,
+ info->path_fn_data);
+ oid_array_clear(&commit_list->oids);
+ free(commit_list);
+
string_list_append(&ctx.path_stack, root_path);
trace2_region_enter("path-walk", "path-walk", info->revs->repo);
diff --git a/path-walk.h b/path-walk.h
index c9e94a98bc8..6ef372d8942 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -30,9 +30,20 @@ struct path_walk_info {
*/
path_fn path_fn;
void *path_fn_data;
+ /**
+ * Initialize which object types the path_fn should be called on. This
+ * could also limit the walk to skip blobs if not set.
+ */
+ int commits;
+ int trees;
+ int blobs;
};
-#define PATH_WALK_INFO_INIT { 0 }
+#define PATH_WALK_INFO_INIT { \
+ .blobs = 1, \
+ .trees = 1, \
+ .commits = 1, \
+}
/**
* Given the configuration of 'info', walk the commits based on 'info->revs' and
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 3c48f017fa0..37c5e3e31e8 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -18,6 +18,7 @@ static const char * const path_walk_usage[] = {
};
struct path_walk_test_data {
+ uintmax_t commit_nr;
uintmax_t tree_nr;
uintmax_t blob_nr;
};
@@ -29,6 +30,11 @@ static int emit_block(const char *path, struct oid_array *oids,
const char *typestr;
switch (type) {
+ case OBJ_COMMIT:
+ typestr = "COMMIT";
+ tdata->commit_nr += oids->nr;
+ break;
+
case OBJ_TREE:
typestr = "TREE";
tdata->tree_nr += oids->nr;
@@ -56,6 +62,12 @@ int cmd__path_walk(int argc, const char **argv)
struct path_walk_info info = PATH_WALK_INFO_INIT;
struct path_walk_test_data data = { 0 };
struct option options[] = {
+ OPT_BOOL(0, "blobs", &info.blobs,
+ N_("toggle inclusion of blob objects")),
+ OPT_BOOL(0, "commits", &info.commits,
+ N_("toggle inclusion of commit objects")),
+ OPT_BOOL(0, "trees", &info.trees,
+ N_("toggle inclusion of tree objects")),
OPT_END(),
};
@@ -78,9 +90,10 @@ int cmd__path_walk(int argc, const char **argv)
res = walk_objects_by_path(&info);
- printf("trees:%" PRIuMAX "\n"
+ printf("commits:%" PRIuMAX "\n"
+ "trees:%" PRIuMAX "\n"
"blobs:%" PRIuMAX "\n",
- data.tree_nr, data.blob_nr);
+ data.commit_nr, data.tree_nr, data.blob_nr);
return res;
}
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index ca18b61c3f1..e4788664f93 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -31,6 +31,11 @@ test_expect_success 'all' '
test-tool path-walk -- --all >out &&
cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ COMMIT::$(git rev-parse base)
+ COMMIT::$(git rev-parse base~1)
+ COMMIT::$(git rev-parse base~2)
+ commits:4
TREE::$(git rev-parse topic^{tree})
TREE::$(git rev-parse base^{tree})
TREE::$(git rev-parse base~1^{tree})
@@ -60,6 +65,10 @@ test_expect_success 'topic only' '
test-tool path-walk -- topic >out &&
cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ COMMIT::$(git rev-parse base~1)
+ COMMIT::$(git rev-parse base~2)
+ commits:3
TREE::$(git rev-parse topic^{tree})
TREE::$(git rev-parse base~1^{tree})
TREE::$(git rev-parse base~2^{tree})
@@ -86,6 +95,8 @@ test_expect_success 'topic, not base' '
test-tool path-walk -- topic --not base >out &&
cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ commits:1
TREE::$(git rev-parse topic^{tree})
TREE:left/:$(git rev-parse topic:left)
TREE:right/:$(git rev-parse topic:right)
@@ -103,10 +114,71 @@ test_expect_success 'topic, not base' '
test_cmp expect.sorted out.sorted
'
+test_expect_success 'topic, not base, only blobs' '
+ test-tool path-walk --no-trees --no-commits \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ commits:0
+ trees:0
+ BLOB:a:$(git rev-parse topic:a)
+ BLOB:left/b:$(git rev-parse topic:left/b)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse topic:right/d)
+ blobs:4
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
+# No, this doesn't make a lot of sense for the path-walk API,
+# but it is possible to do.
+test_expect_success 'topic, not base, only commits' '
+ test-tool path-walk --no-blobs --no-trees \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ commits:1
+ trees:0
+ blobs:0
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
+test_expect_success 'topic, not base, only trees' '
+ test-tool path-walk --no-blobs --no-commits \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ commits:0
+ TREE::$(git rev-parse topic^{tree})
+ TREE:left/:$(git rev-parse topic:left)
+ TREE:right/:$(git rev-parse topic:right)
+ trees:3
+ blobs:0
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
test_expect_success 'topic, not base, boundary' '
test-tool path-walk -- --boundary topic --not base >out &&
cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ COMMIT::$(git rev-parse base~1)
+ commits:2
TREE::$(git rev-parse topic^{tree})
TREE::$(git rev-parse base~1^{tree})
TREE:left/:$(git rev-parse base~1:left)
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 04/17] path-walk: allow visiting tags
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (2 preceding siblings ...)
2024-10-08 14:11 ` [PATCH 03/17] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
@ 2024-10-08 14:11 ` Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 05/17] revision: create mark_trees_uninteresting_dense() Derrick Stolee via GitGitGadget
` (13 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:11 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
In anticipation of using the path-walk API to analyze tags or include
them in a pack-file, add the ability to walk the tags that were included
in the revision walk.
When these tag objects point to blobs or trees, we need to make sure
those objects are also visited. Treat tagged trees as root trees, but
put the tagged blobs in their own category.
Be careful about objects that are referred to by multiple references.
Co-authored-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 2 +-
path-walk.c | 76 ++++++++++++++++++++
path-walk.h | 2 +
t/helper/test-path-walk.c | 13 +++-
t/t6601-path-walk.sh | 85 +++++++++++++++++++++--
5 files changed, 170 insertions(+), 8 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index b7ae476ea0a..5fea1d1db17 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -48,7 +48,7 @@ If you want the path-walk API to emit `UNINTERESTING` objects based on the
commit walk's boundary, be sure to set `revs.boundary` so the boundary
commits are emitted.
-`commits`, `blobs`, `trees`::
+`commits`, `blobs`, `trees`, `tags`::
By default, these members are enabled and signal that the path-walk
API should call the `path_fn` on objects of these types. Specialized
applications could disable some options to make it simpler to walk
diff --git a/path-walk.c b/path-walk.c
index 22e1aa13f31..7cd461adf47 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -13,6 +13,7 @@
#include "revision.h"
#include "string-list.h"
#include "strmap.h"
+#include "tag.h"
#include "trace2.h"
#include "tree.h"
#include "tree-walk.h"
@@ -204,13 +205,88 @@ int walk_objects_by_path(struct path_walk_info *info)
CALLOC_ARRAY(commit_list, 1);
commit_list->type = OBJ_COMMIT;
+ if (info->tags)
+ info->revs->tag_objects = 1;
+
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
+
+ /*
+ * Set these values before preparing the walk to catch
+ * lightweight tags pointing to non-commits.
+ */
+ info->revs->blob_objects = info->blobs;
+ info->revs->tree_objects = info->trees;
+
if (prepare_revision_walk(info->revs))
die(_("failed to setup revision walk"));
+ info->revs->blob_objects = info->revs->tree_objects = 0;
+
+ if (info->tags) {
+ struct oid_array tagged_blob_list = OID_ARRAY_INIT;
+ struct oid_array tags = OID_ARRAY_INIT;
+
+ trace2_region_enter("path-walk", "tag-walk", info->revs->repo);
+
+ /*
+ * Walk any pending objects at this point, but they should only
+ * be tags.
+ */
+ for (size_t i = 0; i < info->revs->pending.nr; i++) {
+ struct object_array_entry *pending = info->revs->pending.objects + i;
+ struct object *obj = pending->item;
+
+ if (obj->type == OBJ_COMMIT || obj->flags & SEEN)
+ continue;
+
+ obj->flags |= SEEN;
+
+ while (obj->type == OBJ_TAG) {
+ struct tag *tag = lookup_tag(info->revs->repo,
+ &obj->oid);
+ if (oid_array_lookup(&tags, &obj->oid) < 0)
+ oid_array_append(&tags, &obj->oid);
+ obj = tag->tagged;
+ }
+
+ switch (obj->type) {
+ case OBJ_TREE:
+ if (info->trees &&
+ oid_array_lookup(&root_tree_list->oids, &obj->oid) < 0)
+ oid_array_append(&root_tree_list->oids, &obj->oid);
+ break;
+
+ case OBJ_BLOB:
+ if (info->blobs &&
+ oid_array_lookup(&tagged_blob_list, &obj->oid) < 0)
+ oid_array_append(&tagged_blob_list, &obj->oid);
+ break;
+
+ case OBJ_COMMIT:
+ /* Make sure it is in the object walk */
+ add_pending_object(info->revs, obj, "");
+ break;
+
+ default:
+ BUG("should not see any other type here");
+ }
+ }
+
+ info->path_fn("", &tags, OBJ_TAG, info->path_fn_data);
+
+ if (tagged_blob_list.nr && info->blobs)
+ info->path_fn("/tagged-blobs", &tagged_blob_list, OBJ_BLOB,
+ info->path_fn_data);
+
+ trace2_data_intmax("path-walk", ctx.repo, "tags", tags.nr);
+ trace2_region_leave("path-walk", "tag-walk", info->revs->repo);
+ oid_array_clear(&tags);
+ oid_array_clear(&tagged_blob_list);
+ }
+
while ((c = get_revision(info->revs))) {
struct object_id *oid;
struct tree *t;
diff --git a/path-walk.h b/path-walk.h
index 6ef372d8942..3f3b63180ef 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -37,12 +37,14 @@ struct path_walk_info {
int commits;
int trees;
int blobs;
+ int tags;
};
#define PATH_WALK_INFO_INIT { \
.blobs = 1, \
.trees = 1, \
.commits = 1, \
+ .tags = 1, \
}
/**
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 37c5e3e31e8..c6c60d68749 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -21,6 +21,7 @@ struct path_walk_test_data {
uintmax_t commit_nr;
uintmax_t tree_nr;
uintmax_t blob_nr;
+ uintmax_t tag_nr;
};
static int emit_block(const char *path, struct oid_array *oids,
@@ -45,6 +46,11 @@ static int emit_block(const char *path, struct oid_array *oids,
tdata->blob_nr += oids->nr;
break;
+ case OBJ_TAG:
+ typestr = "TAG";
+ tdata->tag_nr += oids->nr;
+ break;
+
default:
BUG("we do not understand this type");
}
@@ -66,6 +72,8 @@ int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of blob objects")),
OPT_BOOL(0, "commits", &info.commits,
N_("toggle inclusion of commit objects")),
+ OPT_BOOL(0, "tags", &info.tags,
+ N_("toggle inclusion of tag objects")),
OPT_BOOL(0, "trees", &info.trees,
N_("toggle inclusion of tree objects")),
OPT_END(),
@@ -92,8 +100,9 @@ int cmd__path_walk(int argc, const char **argv)
printf("commits:%" PRIuMAX "\n"
"trees:%" PRIuMAX "\n"
- "blobs:%" PRIuMAX "\n",
- data.commit_nr, data.tree_nr, data.blob_nr);
+ "blobs:%" PRIuMAX "\n"
+ "tags:%" PRIuMAX "\n",
+ data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
return res;
}
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index e4788664f93..7758e2529ee 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -7,24 +7,55 @@ test_description='direct path-walk API tests'
test_expect_success 'setup test repository' '
git checkout -b base &&
+ # Make some objects that will only be reachable
+ # via non-commit tags.
+ mkdir child &&
+ echo file >child/file &&
+ git add child &&
+ git commit -m "will abandon" &&
+ git tag -a -m "tree" tree-tag HEAD^{tree} &&
+ echo file2 >file2 &&
+ git add file2 &&
+ git commit --amend -m "will abandon" &&
+ git tag tree-tag2 HEAD^{tree} &&
+
+ echo blob >file &&
+ blob_oid=$(git hash-object -t blob -w --stdin <file) &&
+ git tag -a -m "blob" blob-tag "$blob_oid" &&
+ echo blob2 >file2 &&
+ blob2_oid=$(git hash-object -t blob -w --stdin <file2) &&
+ git tag blob-tag2 "$blob2_oid" &&
+
+ rm -fr child file file2 &&
+
mkdir left &&
mkdir right &&
echo a >a &&
echo b >left/b &&
echo c >right/c &&
git add . &&
- git commit -m "first" &&
+ git commit --amend -m "first" &&
+ git tag -m "first" first HEAD &&
echo d >right/d &&
git add right &&
git commit -m "second" &&
+ git tag -a -m "second (under)" second.1 HEAD &&
+ git tag -a -m "second (top)" second.2 second.1 &&
+ # Set up file/dir collision in history.
+ rm a &&
+ mkdir a &&
+ echo a >a/a &&
echo bb >left/b &&
- git commit -a -m "third" &&
+ git add a left &&
+ git commit -m "third" &&
+ git tag -a -m "third" third &&
git checkout -b topic HEAD~1 &&
echo cc >right/c &&
- git commit -a -m "topic"
+ git commit -a -m "topic" &&
+ git tag -a -m "fourth" fourth
'
test_expect_success 'all' '
@@ -40,19 +71,35 @@ test_expect_success 'all' '
TREE::$(git rev-parse base^{tree})
TREE::$(git rev-parse base~1^{tree})
TREE::$(git rev-parse base~2^{tree})
+ TREE::$(git rev-parse refs/tags/tree-tag^{})
+ TREE::$(git rev-parse refs/tags/tree-tag2^{})
+ TREE:a/:$(git rev-parse base:a)
TREE:left/:$(git rev-parse base:left)
TREE:left/:$(git rev-parse base~2:left)
TREE:right/:$(git rev-parse topic:right)
TREE:right/:$(git rev-parse base~1:right)
TREE:right/:$(git rev-parse base~2:right)
- trees:9
+ TREE:child/:$(git rev-parse refs/tags/tree-tag^{}:child)
+ trees:13
BLOB:a:$(git rev-parse base~2:a)
+ BLOB:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
BLOB:left/b:$(git rev-parse base~2:left/b)
BLOB:left/b:$(git rev-parse base:left/b)
BLOB:right/c:$(git rev-parse base~2:right/c)
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse base~1:right/d)
- blobs:6
+ BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
+ BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
+ BLOB:child/file:$(git rev-parse refs/tags/tree-tag^{}:child/file)
+ blobs:10
+ TAG::$(git rev-parse refs/tags/first)
+ TAG::$(git rev-parse refs/tags/second.1)
+ TAG::$(git rev-parse refs/tags/second.2)
+ TAG::$(git rev-parse refs/tags/third)
+ TAG::$(git rev-parse refs/tags/fourth)
+ TAG::$(git rev-parse refs/tags/tree-tag)
+ TAG::$(git rev-parse refs/tags/blob-tag)
+ tags:7
EOF
sort expect >expect.sorted &&
@@ -83,6 +130,7 @@ test_expect_success 'topic only' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse base~1:right/d)
blobs:5
+ tags:0
EOF
sort expect >expect.sorted &&
@@ -106,6 +154,7 @@ test_expect_success 'topic, not base' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse topic:right/d)
blobs:4
+ tags:0
EOF
sort expect >expect.sorted &&
@@ -126,6 +175,7 @@ test_expect_success 'topic, not base, only blobs' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse topic:right/d)
blobs:4
+ tags:0
EOF
sort expect >expect.sorted &&
@@ -145,6 +195,7 @@ test_expect_success 'topic, not base, only commits' '
commits:1
trees:0
blobs:0
+ tags:0
EOF
sort expect >expect.sorted &&
@@ -164,6 +215,7 @@ test_expect_success 'topic, not base, only trees' '
TREE:right/:$(git rev-parse topic:right)
trees:3
blobs:0
+ tags:0
EOF
sort expect >expect.sorted &&
@@ -191,6 +243,7 @@ test_expect_success 'topic, not base, boundary' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse base~1:right/d)
blobs:5
+ tags:0
EOF
sort expect >expect.sorted &&
@@ -199,4 +252,26 @@ test_expect_success 'topic, not base, boundary' '
test_cmp expect.sorted out.sorted
'
+test_expect_success 'trees are reported exactly once' '
+ test_when_finished "rm -rf unique-trees" &&
+ test_create_repo unique-trees &&
+ (
+ cd unique-trees &&
+ mkdir initial &&
+ test_commit initial/file &&
+
+ git switch -c move-to-top &&
+ git mv initial/file.t ./ &&
+ test_tick &&
+ git commit -m moved &&
+
+ git update-ref refs/heads/other HEAD
+ ) &&
+
+ test-tool -C unique-trees path-walk -- --all >out &&
+ tree=$(git -C unique-trees rev-parse HEAD:) &&
+ grep "$tree" out >out-filtered &&
+ test_line_count = 1 out-filtered
+'
+
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 05/17] revision: create mark_trees_uninteresting_dense()
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (3 preceding siblings ...)
2024-10-08 14:11 ` [PATCH 04/17] path-walk: allow visiting tags Derrick Stolee via GitGitGadget
@ 2024-10-08 14:11 ` Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 06/17] path-walk: add prune_all_uninteresting option Derrick Stolee via GitGitGadget
` (12 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:11 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
The sparse tree walk algorithm was created in d5d2e93577e (revision:
implement sparse algorithm, 2019-01-16) and involves using the
mark_trees_uninteresting_sparse() method. This method takes a repository
and an oidset of tree IDs, some of which have the UNINTERESTING flag and
some of which do not.
Create a method that has an equivalent set of preconditions but uses a
"dense" walk (recursively visits all reachable trees, as long as they
have not previously been marked UNINTERESTING). This is an important
difference from mark_tree_uninteresting(), which short-circuits if the
given tree has the UNINTERESTING flag.
A use of this method will be added in a later change, with a condition
set whether the sparse or dense approach should be used.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
revision.c | 15 +++++++++++++++
revision.h | 1 +
2 files changed, 16 insertions(+)
diff --git a/revision.c b/revision.c
index 2d7ad2bddff..bdc312f1538 100644
--- a/revision.c
+++ b/revision.c
@@ -219,6 +219,21 @@ static void add_children_by_path(struct repository *r,
free_tree_buffer(tree);
}
+void mark_trees_uninteresting_dense(struct repository *r,
+ struct oidset *trees)
+{
+ struct object_id *oid;
+ struct oidset_iter iter;
+
+ oidset_iter_init(trees, &iter);
+ while ((oid = oidset_iter_next(&iter))) {
+ struct tree *tree = lookup_tree(r, oid);
+
+ if (tree && (tree->object.flags & UNINTERESTING))
+ mark_tree_contents_uninteresting(r, tree);
+ }
+}
+
void mark_trees_uninteresting_sparse(struct repository *r,
struct oidset *trees)
{
diff --git a/revision.h b/revision.h
index 71e984c452b..8938b2db112 100644
--- a/revision.h
+++ b/revision.h
@@ -487,6 +487,7 @@ void put_revision_mark(const struct rev_info *revs,
void mark_parents_uninteresting(struct rev_info *revs, struct commit *commit);
void mark_tree_uninteresting(struct repository *r, struct tree *tree);
+void mark_trees_uninteresting_dense(struct repository *r, struct oidset *trees);
void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *trees);
void show_object_with_name(FILE *, struct object *, const char *);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 06/17] path-walk: add prune_all_uninteresting option
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (4 preceding siblings ...)
2024-10-08 14:11 ` [PATCH 05/17] revision: create mark_trees_uninteresting_dense() Derrick Stolee via GitGitGadget
@ 2024-10-08 14:11 ` Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 07/17] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
` (11 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:11 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
This option causes the path-walk API to act like the sparse tree-walk
algorithm implemented by mark_trees_uninteresting_sparse() in
list-objects.c.
Starting from the commits marked as UNINTERESTING, their root trees and
all objects reachable from those trees are UNINTERSTING, at least as we
walk path-by-path. When we reach a path where all objects associated
with that path are marked UNINTERESTING, then do no continue walking the
children of that path.
We need to be careful to pass the UNINTERESTING flag in a deep way on
the UNINTERESTING objects before we start the path-walk, or else the
depth-first search for the path-walk API may accidentally report some
objects as interesting.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 8 +++
path-walk.c | 64 ++++++++++++++++++++++-
path-walk.h | 8 +++
t/helper/test-path-walk.c | 10 +++-
t/t6601-path-walk.sh | 40 +++++++++++---
5 files changed, 118 insertions(+), 12 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 5fea1d1db17..c51f92cd649 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -57,6 +57,14 @@ commits are emitted.
While it is possible to walk only commits in this way, consumers would be
better off using the revision walk API instead.
+`prune_all_uninteresting`::
+ By default, all reachable paths are emitted by the path-walk API.
+ This option allows consumers to declare that they are not
+ interested in paths where all included objects are marked with the
+ `UNINTERESTING` flag. This requires using the `boundary` option in
+ the revision walk so that the walk emits commits marked with the
+ `UNINTERESTING` flag.
+
Examples
--------
diff --git a/path-walk.c b/path-walk.c
index 7cd461adf47..dce0840937e 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -22,6 +22,7 @@ struct type_and_oid_list
{
enum object_type type;
struct oid_array oids;
+ int maybe_interesting;
};
#define TYPE_AND_OID_LIST_INIT { \
@@ -124,6 +125,8 @@ static int add_children(struct path_walk_context *ctx,
strmap_put(&ctx->paths_to_lists, path.buf, list);
string_list_append(&ctx->path_stack, path.buf);
}
+ if (!(o->flags & UNINTERESTING))
+ list->maybe_interesting = 1;
oid_array_append(&list->oids, &entry.oid);
}
@@ -145,6 +148,40 @@ static int walk_path(struct path_walk_context *ctx,
list = strmap_get(&ctx->paths_to_lists, path);
+ if (ctx->info->prune_all_uninteresting) {
+ /*
+ * This is true if all objects were UNINTERESTING
+ * when added to the list.
+ */
+ if (!list->maybe_interesting)
+ return 0;
+
+ /*
+ * But it's still possible that the objects were set
+ * as UNINTERESTING after being added. Do a quick check.
+ */
+ list->maybe_interesting = 0;
+ for (size_t i = 0;
+ !list->maybe_interesting && i < list->oids.nr;
+ i++) {
+ if (list->type == OBJ_TREE) {
+ struct tree *t = lookup_tree(ctx->repo,
+ &list->oids.oid[i]);
+ if (t && !(t->object.flags & UNINTERESTING))
+ list->maybe_interesting = 1;
+ } else {
+ struct blob *b = lookup_blob(ctx->repo,
+ &list->oids.oid[i]);
+ if (b && !(b->object.flags & UNINTERESTING))
+ list->maybe_interesting = 1;
+ }
+ }
+
+ /* We have confirmed that all objects are UNINTERESTING. */
+ if (!list->maybe_interesting)
+ return 0;
+ }
+
/* Evaluate function pointer on this data, if requested. */
if ((list->type == OBJ_TREE && ctx->info->trees) ||
(list->type == OBJ_BLOB && ctx->info->blobs))
@@ -187,7 +224,7 @@ static void clear_strmap(struct strmap *map)
int walk_objects_by_path(struct path_walk_info *info)
{
const char *root_path = "";
- int ret = 0;
+ int ret = 0, has_uninteresting = 0;
size_t commits_nr = 0, paths_nr = 0;
struct commit *c;
struct type_and_oid_list *root_tree_list;
@@ -199,6 +236,7 @@ int walk_objects_by_path(struct path_walk_info *info)
.path_stack = STRING_LIST_INIT_DUP,
.paths_to_lists = STRMAP_INIT
};
+ struct oidset root_tree_set = OIDSET_INIT;
trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
@@ -211,6 +249,7 @@ int walk_objects_by_path(struct path_walk_info *info)
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
+ root_tree_list->maybe_interesting = 1;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
/*
@@ -304,10 +343,16 @@ int walk_objects_by_path(struct path_walk_info *info)
t = lookup_tree(info->revs->repo, oid);
if (t) {
+ if ((c->object.flags & UNINTERESTING)) {
+ t->object.flags |= UNINTERESTING;
+ has_uninteresting = 1;
+ }
+
if (t->object.flags & SEEN)
continue;
t->object.flags |= SEEN;
- oid_array_append(&root_tree_list->oids, oid);
+ if (!oidset_insert(&root_tree_set, oid))
+ oid_array_append(&root_tree_list->oids, oid);
} else {
warning("could not find tree %s", oid_to_hex(oid));
}
@@ -323,6 +368,21 @@ int walk_objects_by_path(struct path_walk_info *info)
oid_array_clear(&commit_list->oids);
free(commit_list);
+ /*
+ * Before performing a DFS of our paths and emitting them as interesting,
+ * do a full walk of the trees to distribute the UNINTERESTING bit. Use
+ * the sparse algorithm if prune_all_uninteresting was set.
+ */
+ if (has_uninteresting) {
+ trace2_region_enter("path-walk", "uninteresting-walk", info->revs->repo);
+ if (info->prune_all_uninteresting)
+ mark_trees_uninteresting_sparse(ctx.repo, &root_tree_set);
+ else
+ mark_trees_uninteresting_dense(ctx.repo, &root_tree_set);
+ trace2_region_leave("path-walk", "uninteresting-walk", info->revs->repo);
+ }
+ oidset_clear(&root_tree_set);
+
string_list_append(&ctx.path_stack, root_path);
trace2_region_enter("path-walk", "path-walk", info->revs->repo);
diff --git a/path-walk.h b/path-walk.h
index 3f3b63180ef..3e44c4b8a58 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -38,6 +38,14 @@ struct path_walk_info {
int trees;
int blobs;
int tags;
+
+ /**
+ * When 'prune_all_uninteresting' is set and a path has all objects
+ * marked as UNINTERESTING, then the path-walk will not visit those
+ * objects. It will not call path_fn on those objects and will not
+ * walk the children of such trees.
+ */
+ int prune_all_uninteresting;
};
#define PATH_WALK_INFO_INIT { \
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index c6c60d68749..06b103d8760 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -55,8 +55,12 @@ static int emit_block(const char *path, struct oid_array *oids,
BUG("we do not understand this type");
}
- for (size_t i = 0; i < oids->nr; i++)
- printf("%s:%s:%s\n", typestr, path, oid_to_hex(&oids->oid[i]));
+ for (size_t i = 0; i < oids->nr; i++) {
+ struct object *o = lookup_unknown_object(the_repository,
+ &oids->oid[i]);
+ printf("%s:%s:%s%s\n", typestr, path, oid_to_hex(&oids->oid[i]),
+ o->flags & UNINTERESTING ? ":UNINTERESTING" : "");
+ }
return 0;
}
@@ -76,6 +80,8 @@ int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of tag objects")),
OPT_BOOL(0, "trees", &info.trees,
N_("toggle inclusion of tree objects")),
+ OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
+ N_("toggle pruning of uninteresting paths")),
OPT_END(),
};
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 7758e2529ee..943adc6c8f1 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -229,19 +229,19 @@ test_expect_success 'topic, not base, boundary' '
cat >expect <<-EOF &&
COMMIT::$(git rev-parse topic)
- COMMIT::$(git rev-parse base~1)
+ COMMIT::$(git rev-parse base~1):UNINTERESTING
commits:2
TREE::$(git rev-parse topic^{tree})
- TREE::$(git rev-parse base~1^{tree})
- TREE:left/:$(git rev-parse base~1:left)
+ TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
+ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
TREE:right/:$(git rev-parse topic:right)
- TREE:right/:$(git rev-parse base~1:right)
+ TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
trees:5
- BLOB:a:$(git rev-parse base~1:a)
- BLOB:left/b:$(git rev-parse base~1:left/b)
- BLOB:right/c:$(git rev-parse base~1:right/c)
+ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
+ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
BLOB:right/c:$(git rev-parse topic:right/c)
- BLOB:right/d:$(git rev-parse base~1:right/d)
+ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
blobs:5
tags:0
EOF
@@ -252,6 +252,30 @@ test_expect_success 'topic, not base, boundary' '
test_cmp expect.sorted out.sorted
'
+test_expect_success 'topic, not base, boundary with pruning' '
+ test-tool path-walk --prune -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ COMMIT::$(git rev-parse base~1):UNINTERESTING
+ commits:2
+ TREE::$(git rev-parse topic^{tree})
+ TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
+ TREE:right/:$(git rev-parse topic:right)
+ TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
+ trees:4
+ BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ blobs:2
+ tags:0
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
test_expect_success 'trees are reported exactly once' '
test_when_finished "rm -rf unique-trees" &&
test_create_repo unique-trees &&
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 07/17] pack-objects: extract should_attempt_deltas()
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (5 preceding siblings ...)
2024-10-08 14:11 ` [PATCH 06/17] path-walk: add prune_all_uninteresting option Derrick Stolee via GitGitGadget
@ 2024-10-08 14:11 ` Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 08/17] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
` (10 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:11 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
This will be helpful in a future change that introduces a new way to
compute deltas.
Be careful to preserve the nr_deltas counting logic in the existing
method, but take the rest of the logic wholesale.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
builtin/pack-objects.c | 53 +++++++++++++++++++++++-------------------
1 file changed, 29 insertions(+), 24 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 0fc0680b402..82f4ca04000 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3167,6 +3167,33 @@ static int add_ref_tag(const char *tag UNUSED, const char *referent UNUSED, cons
return 0;
}
+static int should_attempt_deltas(struct object_entry *entry)
+{
+ if (DELTA(entry))
+ return 0;
+
+ if (!entry->type_valid ||
+ oe_size_less_than(&to_pack, entry, 50))
+ return 0;
+
+ if (entry->no_try_delta)
+ return 0;
+
+ if (!entry->preferred_base) {
+ if (oe_type(entry) < 0)
+ die(_("unable to get type of object %s"),
+ oid_to_hex(&entry->idx.oid));
+ } else if (oe_type(entry) < 0) {
+ /*
+ * This object is not found, but we
+ * don't have to include it anyway.
+ */
+ return 0;
+ }
+
+ return 1;
+}
+
static void prepare_pack(int window, int depth)
{
struct object_entry **delta_list;
@@ -3197,33 +3224,11 @@ static void prepare_pack(int window, int depth)
for (i = 0; i < to_pack.nr_objects; i++) {
struct object_entry *entry = to_pack.objects + i;
- if (DELTA(entry))
- /* This happens if we decided to reuse existing
- * delta from a pack. "reuse_delta &&" is implied.
- */
- continue;
-
- if (!entry->type_valid ||
- oe_size_less_than(&to_pack, entry, 50))
+ if (!should_attempt_deltas(entry))
continue;
- if (entry->no_try_delta)
- continue;
-
- if (!entry->preferred_base) {
+ if (!entry->preferred_base)
nr_deltas++;
- if (oe_type(entry) < 0)
- die(_("unable to get type of object %s"),
- oid_to_hex(&entry->idx.oid));
- } else {
- if (oe_type(entry) < 0) {
- /*
- * This object is not found, but we
- * don't have to include it anyway.
- */
- continue;
- }
- }
delta_list[n++] = entry;
}
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 08/17] pack-objects: add --path-walk option
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (6 preceding siblings ...)
2024-10-08 14:11 ` [PATCH 07/17] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
@ 2024-10-08 14:11 ` Derrick Stolee via GitGitGadget
2024-10-28 19:54 ` Jonathan Tan
2024-10-08 14:11 ` [PATCH 09/17] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
` (9 subsequent siblings)
17 siblings, 1 reply; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:11 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
In order to more easily compute delta bases among objects that appear at the
exact same path, add a --path-walk option to 'git pack-objects'.
This option will use the path-walk API instead of the object walk given by
the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.
The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.
After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.
The current implementation performs delta calculations while walking
objects, which is not ideal for a few reasons. First, this will cause
the "Enumerating objects" phase to be much longer than usual. Second, it
does not take advantage of threading during the path-scoped delta
calculations. Even with this lack of threading, the path-walk option is
sometimes faster than the usual approach. Future changes will refactor
this code to allow for threading, but that complexity is deferred until
later to keep this patch as simple as possible.
This new walk is incompatible with some features and is ignored by
others:
* Object filters are not currently integrated with the path-walk API,
such as sparse-checkout or tree depth. A blobless packfile could be
integrated easily, but that is deferred for later.
* Server-focused features such as delta islands, shallow packs, and
using a bitmap index are incompatible with the path-walk API.
* The path walk API is only compatible with the --revs option, not
taking object lists or pack lists over stdin. These alternative ways
to specify the objects currently ignores the --path-walk option
without even a warning.
Future changes will create performance tests that demonstrate the power
of this approach.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/git-pack-objects.txt | 13 +-
Documentation/technical/api-path-walk.txt | 3 +-
builtin/pack-objects.c | 147 ++++++++++++++++++++--
t/t5300-pack-object.sh | 17 +++
4 files changed, 169 insertions(+), 11 deletions(-)
diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index e32404c6aae..f2fda800a43 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -15,7 +15,8 @@ SYNOPSIS
[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
[--cruft] [--cruft-expiration=<time>]
[--stdout [--filter=<filter-spec>] | <base-name>]
- [--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
+ [--shallow] [--keep-true-parents] [--[no-]sparse]
+ [--path-walk] < <object-list>
DESCRIPTION
@@ -345,6 +346,16 @@ raise an error.
Restrict delta matches based on "islands". See DELTA ISLANDS
below.
+--path-walk::
+ By default, `git pack-objects` walks objects in an order that
+ presents trees and blobs in an order unrelated to the path they
+ appear relative to a commit's root tree. The `--path-walk` option
+ enables a different walking algorithm that organizes trees and
+ blobs by path. This has the potential to improve delta compression
+ especially in the presence of filenames that cause collisions in
+ Git's default name-hash algorithm. Due to changing how the objects
+ are walked, this option is not compatible with `--delta-islands`,
+ `--shallow`, or `--filter`.
DELTA ISLANDS
-------------
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index c51f92cd649..2d25281774d 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -69,4 +69,5 @@ Examples
--------
See example usages in:
- `t/helper/test-path-walk.c`
+ `t/helper/test-path-walk.c`,
+ `builtin/pack-objects.c`
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 82f4ca04000..103263666f6 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -39,6 +39,9 @@
#include "promisor-remote.h"
#include "pack-mtimes.h"
#include "parse-options.h"
+#include "blob.h"
+#include "tree.h"
+#include "path-walk.h"
/*
* Objects we are going to pack are collected in the `to_pack` structure.
@@ -215,6 +218,7 @@ static int delta_search_threads;
static int pack_to_stdout;
static int sparse;
static int thin;
+static int path_walk;
static int num_preferred_base;
static struct progress *progress_state;
@@ -4143,6 +4147,105 @@ static void mark_bitmap_preferred_tips(void)
}
}
+static inline int is_oid_interesting(struct repository *repo,
+ struct object_id *oid)
+{
+ struct object *o = lookup_object(repo, oid);
+ return o && !(o->flags & UNINTERESTING);
+}
+
+static int add_objects_by_path(const char *path,
+ struct oid_array *oids,
+ enum object_type type,
+ void *data)
+{
+ struct object_entry **delta_list;
+ size_t oe_start = to_pack.nr_objects;
+ size_t oe_end;
+ unsigned int sub_list_size;
+ unsigned int *processed = data;
+
+ /*
+ * First, add all objects to the packing data, including the ones
+ * marked UNINTERESTING (translated to 'exclude') as they can be
+ * used as delta bases.
+ */
+ for (size_t i = 0; i < oids->nr; i++) {
+ int exclude;
+ struct object_info oi = OBJECT_INFO_INIT;
+ struct object_id *oid = &oids->oid[i];
+
+ /* Skip objects that do not exist locally. */
+ if (exclude_promisor_objects &&
+ oid_object_info_extended(the_repository, oid, &oi,
+ OBJECT_INFO_FOR_PREFETCH) < 0)
+ continue;
+
+ exclude = !is_oid_interesting(the_repository, oid);
+
+ if (exclude && !thin)
+ continue;
+
+ add_object_entry(oid, type, path, exclude);
+ }
+
+ oe_end = to_pack.nr_objects;
+
+ /* We can skip delta calculations if it is a no-op. */
+ if (oe_end == oe_start || !window)
+ return 0;
+
+ sub_list_size = 0;
+ ALLOC_ARRAY(delta_list, oe_end - oe_start);
+
+ for (size_t i = 0; i < oe_end - oe_start; i++) {
+ struct object_entry *entry = to_pack.objects + oe_start + i;
+
+ if (!should_attempt_deltas(entry))
+ continue;
+
+ delta_list[sub_list_size++] = entry;
+ }
+
+ /*
+ * Find delta bases among this list of objects that all match the same
+ * path. This causes the delta compression to be interleaved in the
+ * object walk, which can lead to confusing progress indicators. This is
+ * also incompatible with threaded delta calculations. In the future,
+ * consider creating a list of regions in the full to_pack.objects array
+ * that could be picked up by the threaded delta computation.
+ */
+ if (sub_list_size && window) {
+ QSORT(delta_list, sub_list_size, type_size_sort);
+ find_deltas(delta_list, &sub_list_size, window, depth, processed);
+ }
+
+ free(delta_list);
+ return 0;
+}
+
+static void get_object_list_path_walk(struct rev_info *revs)
+{
+ struct path_walk_info info = PATH_WALK_INFO_INIT;
+ unsigned int processed = 0;
+
+ info.revs = revs;
+ info.path_fn = add_objects_by_path;
+ info.path_fn_data = &processed;
+ revs->tag_objects = 1;
+
+ /*
+ * Allow the --[no-]sparse option to be interesting here, if only
+ * for testing purposes. Paths with no interesting objects will not
+ * contribute to the resulting pack, but only create noisy preferred
+ * base objects.
+ */
+ info.prune_all_uninteresting = sparse;
+
+ if (walk_objects_by_path(&info))
+ die(_("failed to pack objects via path-walk"));
+}
+
static void get_object_list(struct rev_info *revs, int ac, const char **av)
{
struct setup_revision_opt s_r_opt = {
@@ -4189,7 +4292,7 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
warn_on_object_refname_ambiguity = save_warning;
- if (use_bitmap_index && !get_object_list_from_bitmap(revs))
+ if (use_bitmap_index && !path_walk && !get_object_list_from_bitmap(revs))
return;
if (use_delta_islands)
@@ -4198,15 +4301,19 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
if (write_bitmap_index)
mark_bitmap_preferred_tips();
- if (prepare_revision_walk(revs))
- die(_("revision walk setup failed"));
- mark_edges_uninteresting(revs, show_edge, sparse);
-
if (!fn_show_object)
fn_show_object = show_object;
- traverse_commit_list(revs,
- show_commit, fn_show_object,
- NULL);
+
+ if (path_walk) {
+ get_object_list_path_walk(revs);
+ } else {
+ if (prepare_revision_walk(revs))
+ die(_("revision walk setup failed"));
+ mark_edges_uninteresting(revs, show_edge, sparse);
+ traverse_commit_list(revs,
+ show_commit, fn_show_object,
+ NULL);
+ }
if (unpack_unreachable_expiration) {
revs->ignore_missing_links = 1;
@@ -4404,6 +4511,8 @@ int cmd_pack_objects(int argc,
N_("use the sparse reachability algorithm")),
OPT_BOOL(0, "thin", &thin,
N_("create thin packs")),
+ OPT_BOOL(0, "path-walk", &path_walk,
+ N_("use the path-walk API to walk objects when possible")),
OPT_BOOL(0, "shallow", &shallow,
N_("create packs suitable for shallow fetches")),
OPT_BOOL(0, "honor-pack-keep", &ignore_packed_keep_on_disk,
@@ -4484,7 +4593,27 @@ int cmd_pack_objects(int argc,
window = 0;
strvec_push(&rp, "pack-objects");
- if (thin) {
+
+ if (path_walk && filter_options.choice) {
+ warning(_("cannot use --filter with --path-walk"));
+ path_walk = 0;
+ }
+ if (path_walk && use_delta_islands) {
+ warning(_("cannot use delta islands with --path-walk"));
+ path_walk = 0;
+ }
+ if (path_walk && shallow) {
+ warning(_("cannot use --shallow with --path-walk"));
+ path_walk = 0;
+ }
+ if (path_walk) {
+ strvec_push(&rp, "--boundary");
+ /*
+ * We must disable the bitmaps because we are removing
+ * the --objects / --objects-edge[-aggressive] options.
+ */
+ use_bitmap_index = 0;
+ } else if (thin) {
use_internal_rev_list = 1;
strvec_push(&rp, shallow
? "--objects-edge-aggressive"
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 3b9dae331a5..5f6914acae7 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -674,4 +674,21 @@ do
'
done
+# Basic "repack everything" test
+test_expect_success '--path-walk pack everything' '
+ git -C server rev-parse HEAD >in &&
+ git -C server pack-objects --stdout --revs --path-walk <in >out.pack &&
+ git -C server index-pack --stdin <out.pack
+'
+
+# Basic "thin pack" test
+test_expect_success '--path-walk thin pack' '
+ cat >in <<-EOF &&
+ $(git -C server rev-parse HEAD)
+ ^$(git -C server rev-parse HEAD~2)
+ EOF
+ git -C server pack-objects --thin --stdout --revs --path-walk <in >out.pack &&
+ git -C server index-pack --fix-thin --stdin <out.pack
+'
+
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH 08/17] pack-objects: add --path-walk option
2024-10-08 14:11 ` [PATCH 08/17] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
@ 2024-10-28 19:54 ` Jonathan Tan
2024-10-29 18:07 ` Taylor Blau
2024-10-31 2:12 ` Derrick Stolee
0 siblings, 2 replies; 55+ messages in thread
From: Jonathan Tan @ 2024-10-28 19:54 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: Jonathan Tan, git, gitster, johannes.schindelin, peff, ps, me,
johncai86, newren, christian.couder, kristofferhaugsbakk,
Derrick Stolee
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> This new walk is incompatible with some features and is ignored by
> others:
>
> * Object filters are not currently integrated with the path-walk API,
> such as sparse-checkout or tree depth. A blobless packfile could be
> integrated easily, but that is deferred for later.
>
> * Server-focused features such as delta islands, shallow packs, and
> using a bitmap index are incompatible with the path-walk API.
>
> * The path walk API is only compatible with the --revs option, not
> taking object lists or pack lists over stdin. These alternative ways
> to specify the objects currently ignores the --path-walk option
> without even a warning.
It might be better to declare --path-walk as "internal use only" and
only supporting what send-pack.c (used by "git push") and "git repack"
needs. (From this list, it seems that there is a lot of incompatibility,
some of which can happen without a warning to the user, so it sounds
better to be up-front and say that we only support what send-pack.c
needs. This also makes reviewing easier, as we don't have to think about
the possible interactions with every other rev-list feature - only what
is used by send-pack.c.)
Also from a reviewer perspective, it might be better to restrict this
patch set to what send-pack.c needs and leave "git repack" for a future
patch set. This means that we would not need features such as blob
and tree exclusions, and possibly not even bitmap use or delta reuse
(assuming that the user would typically push recently-created objects
that have not been repacked).
> + /* Skip objects that do not exist locally. */
> + if (exclude_promisor_objects &&
> + oid_object_info_extended(the_repository, oid, &oi,
> + OBJECT_INFO_FOR_PREFETCH) < 0)
> + continue;
This functionality is typically triggered by --missing=allow;
--exclude_promisor_objects means (among other things) that we allow
a missing link only if that object is known to be a promisor object
(because another promisor object refers to it) (see Documentation/
rev-list-options.txt, and also get_reference() and elsewhere in
revision.c - notice how is_promisor_object() is paired with it)
Having said that, we should probably just fail outright on missing
objects, whether or not we have exclude_promisor_objects. If we have
computed that we need to push an object, that object needs to exist.
(Same for repack.)
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 08/17] pack-objects: add --path-walk option
2024-10-28 19:54 ` Jonathan Tan
@ 2024-10-29 18:07 ` Taylor Blau
2024-10-29 21:36 ` Jonathan Tan
2024-10-31 2:14 ` Derrick Stolee
2024-10-31 2:12 ` Derrick Stolee
1 sibling, 2 replies; 55+ messages in thread
From: Taylor Blau @ 2024-10-29 18:07 UTC (permalink / raw)
To: Jonathan Tan
Cc: Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, ps, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee
On Mon, Oct 28, 2024 at 12:54:04PM -0700, Jonathan Tan wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > This new walk is incompatible with some features and is ignored by
> > others:
> >
> > * Object filters are not currently integrated with the path-walk API,
> > such as sparse-checkout or tree depth. A blobless packfile could be
> > integrated easily, but that is deferred for later.
> >
> > * Server-focused features such as delta islands, shallow packs, and
> > using a bitmap index are incompatible with the path-walk API.
> >
> > * The path walk API is only compatible with the --revs option, not
> > taking object lists or pack lists over stdin. These alternative ways
> > to specify the objects currently ignores the --path-walk option
> > without even a warning.
>
> It might be better to declare --path-walk as "internal use only" and
> only supporting what send-pack.c (used by "git push") and "git repack"
> needs. (From this list, it seems that there is a lot of incompatibility,
> some of which can happen without a warning to the user, so it sounds
> better to be up-front and say that we only support what send-pack.c
> needs. This also makes reviewing easier, as we don't have to think about
> the possible interactions with every other rev-list feature - only what
> is used by send-pack.c.)
Is the thinking there that we care mostly about 'git push' and 'git
repack' on the client-side?
I don't think it's unreasonable necessarily, but I would add that
client-side users definitely do use bitmaps (though not delta islands),
either when working in a bare repository (where bitmaps are the default)
or when using 'git gc' (and/or through 'git maintenance') when
'repack.writeBitmaps' is enabled.
So I think the approach here would be to limit it to some cases of
client side behavior, but we should keep in mind that it will not cover
all cases.
My feeling is that it would be nice to pull on the incompatibility
string a little more and see if we can't make the two work together
without too much effort.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 08/17] pack-objects: add --path-walk option
2024-10-29 18:07 ` Taylor Blau
@ 2024-10-29 21:36 ` Jonathan Tan
2024-10-29 22:16 ` Taylor Blau
2024-10-31 2:14 ` Derrick Stolee
1 sibling, 1 reply; 55+ messages in thread
From: Jonathan Tan @ 2024-10-29 21:36 UTC (permalink / raw)
To: Taylor Blau
Cc: Jonathan Tan, Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, ps, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee
Taylor Blau <me@ttaylorr.com> writes:
> Is the thinking there that we care mostly about 'git push' and 'git
> repack' on the client-side?
I would go further - for the initial patch set, we should only care
about "git push" on the client side. Stolee said [1] that the "primary
motivation for this feature is its use to shrink the packfile created
by 'git push' when there are many name-hash collisions", and in thinking
about how to reduce the patch set for easier review, I thought that to
be a good scope.
Subsequent patch set(s) can implement "git repack", useful for both
client and server.
[1] https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/
> I don't think it's unreasonable necessarily, but I would add that
> client-side users definitely do use bitmaps (though not delta islands),
> either when working in a bare repository (where bitmaps are the default)
> or when using 'git gc' (and/or through 'git maintenance') when
> 'repack.writeBitmaps' is enabled.
I was thinking that a typical use case would be to create the commits
(using the tool Stolee mentioned, "beachball") and then immediately push
them. In this case, I don't think there would be much opportunity for
a bitmap write to be triggered, meaning that the pushed commits are not
covered by bitmaps.
But in any case, this was motivated by a desire to reduce the patch set
- I don't have a fundamental objection to including support for bitmaps
in the first patch set.
> So I think the approach here would be to limit it to some cases of
> client side behavior, but we should keep in mind that it will not cover
> all cases.
Yeah, that was my approach too.
> My feeling is that it would be nice to pull on the incompatibility
> string a little more and see if we can't make the two work together
> without too much effort.
>
> Thanks,
> Taylor
By incompatibility, do you mean the incompatibility between bitmaps
and the overall --path-walk feature as implemented collectively by the
patches in Stolee's patch set? If so, I suspect that we will need a
parallel code path that takes in the "want" and "uninteresting" commits
and emits the list of objects (possibly before sorting the objects by
path hash), much like in builtin/pack-objects.c, so I think there will
be some effort involved in making the two work together.
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 08/17] pack-objects: add --path-walk option
2024-10-29 21:36 ` Jonathan Tan
@ 2024-10-29 22:16 ` Taylor Blau
2024-10-31 2:04 ` Derrick Stolee
0 siblings, 1 reply; 55+ messages in thread
From: Taylor Blau @ 2024-10-29 22:16 UTC (permalink / raw)
To: Jonathan Tan
Cc: Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, ps, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee
On Tue, Oct 29, 2024 at 02:36:57PM -0700, Jonathan Tan wrote:
> By incompatibility, do you mean the incompatibility between bitmaps
> and the overall --path-walk feature as implemented collectively by the
> patches in Stolee's patch set? If so, I suspect that we will need a
> parallel code path that takes in the "want" and "uninteresting" commits
> and emits the list of objects (possibly before sorting the objects by
> path hash), much like in builtin/pack-objects.c, so I think there will
> be some effort involved in making the two work together.
I am not sure yet, in all honesty, because I haven't had enough time to
spend yet reviewing these patches to have anything intelligent to say.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 08/17] pack-objects: add --path-walk option
2024-10-29 22:16 ` Taylor Blau
@ 2024-10-31 2:04 ` Derrick Stolee
0 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee @ 2024-10-31 2:04 UTC (permalink / raw)
To: Taylor Blau, Jonathan Tan
Cc: Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, ps, johncai86, newren,
christian.couder, kristofferhaugsbakk
On 10/29/24 6:16 PM, Taylor Blau wrote:
> On Tue, Oct 29, 2024 at 02:36:57PM -0700, Jonathan Tan wrote:
>> By incompatibility, do you mean the incompatibility between bitmaps
>> and the overall --path-walk feature as implemented collectively by the
>> patches in Stolee's patch set? If so, I suspect that we will need a
>> parallel code path that takes in the "want" and "uninteresting" commits
>> and emits the list of objects (possibly before sorting the objects by
>> path hash), much like in builtin/pack-objects.c, so I think there will
>> be some effort involved in making the two work together.
>
> I am not sure yet, in all honesty, because I haven't had enough time to
> spend yet reviewing these patches to have anything intelligent to say.
I think that the --path-walk option is fundamentally incompatible with
the --use-bitmap-index option (using reachability bitmaps to reduce how
many objects we parse to discover reachability) but is not necessarily
incompatible with writing bitmaps. But it would require testing to be
sure that there are no surprises due to something like an object order
changing or something like that.
The feature is also not currently integrated with delta islands, so
that would need integration and testing to make sure things are
grouped together and delta chains go in the right direction.
These things may be something possible to overcome in the future, but
the lack of current integration points and testing makes me want to
leave this version with guard rails that prevent users from getting
into a bind.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 08/17] pack-objects: add --path-walk option
2024-10-29 18:07 ` Taylor Blau
2024-10-29 21:36 ` Jonathan Tan
@ 2024-10-31 2:14 ` Derrick Stolee
2024-10-31 21:02 ` Taylor Blau
1 sibling, 1 reply; 55+ messages in thread
From: Derrick Stolee @ 2024-10-31 2:14 UTC (permalink / raw)
To: Taylor Blau, Jonathan Tan
Cc: Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, ps, johncai86, newren,
christian.couder, kristofferhaugsbakk
On 10/29/24 2:07 PM, Taylor Blau wrote:
> On Mon, Oct 28, 2024 at 12:54:04PM -0700, Jonathan Tan wrote:
> Is the thinking there that we care mostly about 'git push' and 'git
> repack' on the client-side?
>
> I don't think it's unreasonable necessarily, but I would add that
> client-side users definitely do use bitmaps (though not delta islands),
> either when working in a bare repository (where bitmaps are the default)
> or when using 'git gc' (and/or through 'git maintenance') when
> 'repack.writeBitmaps' is enabled.
I suppose some users do use bitmaps, but in my experience, client-side
pushes are slower with bitmaps because a typical target branch is
faster to compute by doing a commit walk, at least when the bitmaps are
older than the new commits in the topic branch. This may be outdated by
now, as it has been a few years since I did a client-side test of
bitmaps.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 08/17] pack-objects: add --path-walk option
2024-10-31 2:14 ` Derrick Stolee
@ 2024-10-31 21:02 ` Taylor Blau
0 siblings, 0 replies; 55+ messages in thread
From: Taylor Blau @ 2024-10-31 21:02 UTC (permalink / raw)
To: Derrick Stolee
Cc: Jonathan Tan, Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, ps, johncai86, newren,
christian.couder, kristofferhaugsbakk
On Wed, Oct 30, 2024 at 10:14:08PM -0400, Derrick Stolee wrote:
> On 10/29/24 2:07 PM, Taylor Blau wrote:
> > On Mon, Oct 28, 2024 at 12:54:04PM -0700, Jonathan Tan wrote:
>
> > Is the thinking there that we care mostly about 'git push' and 'git
> > repack' on the client-side?
> >
> > I don't think it's unreasonable necessarily, but I would add that
> > client-side users definitely do use bitmaps (though not delta islands),
> > either when working in a bare repository (where bitmaps are the default)
> > or when using 'git gc' (and/or through 'git maintenance') when
> > 'repack.writeBitmaps' is enabled.
> I suppose some users do use bitmaps, but in my experience, client-side
> pushes are slower with bitmaps because a typical target branch is
> faster to compute by doing a commit walk, at least when the bitmaps are
> older than the new commits in the topic branch. This may be outdated by
> now, as it has been a few years since I did a client-side test of
> bitmaps.
All true, though it's hard to estimate the size of "some". I share your
intuition that bitmaps are often a drag on performance for the
client-side because doing a pure commit walk is often faster, especially
if the client has a reasonably up-to-date commit graph.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH 08/17] pack-objects: add --path-walk option
2024-10-28 19:54 ` Jonathan Tan
2024-10-29 18:07 ` Taylor Blau
@ 2024-10-31 2:12 ` Derrick Stolee
1 sibling, 0 replies; 55+ messages in thread
From: Derrick Stolee @ 2024-10-31 2:12 UTC (permalink / raw)
To: Jonathan Tan, Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, ps, me, johncai86,
newren, christian.couder, kristofferhaugsbakk
On 10/28/24 3:54 PM, Jonathan Tan wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> This new walk is incompatible with some features and is ignored by
>> others:
>>
>> * Object filters are not currently integrated with the path-walk API,
>> such as sparse-checkout or tree depth. A blobless packfile could be
>> integrated easily, but that is deferred for later.
>>
>> * Server-focused features such as delta islands, shallow packs, and
>> using a bitmap index are incompatible with the path-walk API.
>>
>> * The path walk API is only compatible with the --revs option, not
>> taking object lists or pack lists over stdin. These alternative ways
>> to specify the objects currently ignores the --path-walk option
>> without even a warning.
>
> It might be better to declare --path-walk as "internal use only" and
> only supporting what send-pack.c (used by "git push") and "git repack"
> needs. (From this list, it seems that there is a lot of incompatibility,
> some of which can happen without a warning to the user, so it sounds
> better to be up-front and say that we only support what send-pack.c
> needs. This also makes reviewing easier, as we don't have to think about
> the possible interactions with every other rev-list feature - only what
> is used by send-pack.c.)
I do wonder what the value of doing this would be. I consider 'gitpack-objects'
to already be a plumbing command, so marking any option
as "internal use only" seems like overkill. It takes effort to combine
the options carefully for the right effect. The tests in p5313 are not
terribly simple, such as needing --no-reuse-delta to guarantee we are
using the desired delta algorithm.
> Also from a reviewer perspective, it might be better to restrict this
> patch set to what send-pack.c needs and leave "git repack" for a future
> patch set. This means that we would not need features such as blob
> and tree exclusions, and possibly not even bitmap use or delta reuse
> (assuming that the user would typically push recently-created objects
> that have not been repacked).
While I can understand that as being a potential place to split the
patch series, the integration to add 'git repack --path-walk' is actually
very simple. Repacking "everything" needs to happen to be able to push a
repo to an empty remote, after all.
There are some subtleties around indexed objects, reflogs, and the like
that add some complexity, but they also are handled in the path-walk API
layer. Some of that complexity was helpful to know about during repack
tests.
Finally, the 'git repack --path-walk' use case is a great one for
demonstrating the benefits to threading the 'git pack-objects
--path-walk' algorithm.
>> + /* Skip objects that do not exist locally. */
>> + if (exclude_promisor_objects &&
>> + oid_object_info_extended(the_repository, oid, &oi,
>> + OBJECT_INFO_FOR_PREFETCH) < 0)
>> + continue;
>
> This functionality is typically triggered by --missing=allow;
> --exclude_promisor_objects means (among other things) that we allow
> a missing link only if that object is known to be a promisor object
> (because another promisor object refers to it) (see Documentation/
> rev-list-options.txt, and also get_reference() and elsewhere in
> revision.c - notice how is_promisor_object() is paired with it)
>
> Having said that, we should probably just fail outright on missing
> objects, whether or not we have exclude_promisor_objects. If we have
> computed that we need to push an object, that object needs to exist.
> (Same for repack.)
I think that this is not a reasonable assumption that a hard fail
should be expected. Someone could create a blobless clone with a
sparse-checkout and then add a new file outside their sparse-checkout
without ever having the previous version downloaded.
When pushing, they won't be able to use the previous version as a
delta base, but they would certainly be confused about an object
being downloaded during 'git push'.
While the example I bring up is somewhat contrived, I can easily
imagine cases where missing objects are part of the commit boundary
and could be marked as UNINTERESTING but would still be sent in the
batch to be considered as a delta base.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH 09/17] pack-objects: update usage to match docs
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (7 preceding siblings ...)
2024-10-08 14:11 ` [PATCH 08/17] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
@ 2024-10-08 14:11 ` Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 10/17] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
` (8 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:11 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
The t0450 test script verifies that builtin usage matches the synopsis
in the documentation. Adjust the builtin to match and then remove 'git
pack-objects' from the exception list.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/git-pack-objects.txt | 14 +++++++-------
builtin/pack-objects.c | 10 ++++++++--
t/t0450/txt-help-mismatches | 1 -
3 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index f2fda800a43..68d86ed8838 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -10,13 +10,13 @@ SYNOPSIS
--------
[verse]
'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
- [--no-reuse-delta] [--delta-base-offset] [--non-empty]
- [--local] [--incremental] [--window=<n>] [--depth=<n>]
- [--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
- [--cruft] [--cruft-expiration=<time>]
- [--stdout [--filter=<filter-spec>] | <base-name>]
- [--shallow] [--keep-true-parents] [--[no-]sparse]
- [--path-walk] < <object-list>
+ [--no-reuse-delta] [--delta-base-offset] [--non-empty]
+ [--local] [--incremental] [--window=<n>] [--depth=<n>]
+ [--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
+ [--cruft] [--cruft-expiration=<time>]
+ [--stdout [--filter=<filter-spec>] | <base-name>]
+ [--shallow] [--keep-true-parents] [--[no-]sparse]
+ [--path-walk] < <object-list>
DESCRIPTION
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 103263666f6..77fb1217b2e 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -185,8 +185,14 @@ static inline void oe_set_delta_size(struct packing_data *pack,
#define SET_DELTA_SIBLING(obj, val) oe_set_delta_sibling(&to_pack, obj, val)
static const char *pack_usage[] = {
- N_("git pack-objects --stdout [<options>] [< <ref-list> | < <object-list>]"),
- N_("git pack-objects [<options>] <base-name> [< <ref-list> | < <object-list>]"),
+ N_("git pack-objects [-q | --progress | --all-progress] [--all-progress-implied]\n"
+ " [--no-reuse-delta] [--delta-base-offset] [--non-empty]\n"
+ " [--local] [--incremental] [--window=<n>] [--depth=<n>]\n"
+ " [--revs [--unpacked | --all]] [--keep-pack=<pack-name>]\n"
+ " [--cruft] [--cruft-expiration=<time>]\n"
+ " [--stdout [--filter=<filter-spec>] | <base-name>]\n"
+ " [--shallow] [--keep-true-parents] [--[no-]sparse]\n"
+ " [--path-walk] < <object-list>"),
NULL
};
diff --git a/t/t0450/txt-help-mismatches b/t/t0450/txt-help-mismatches
index 28003f18c92..285ae81a6b5 100644
--- a/t/t0450/txt-help-mismatches
+++ b/t/t0450/txt-help-mismatches
@@ -38,7 +38,6 @@ merge-one-file
multi-pack-index
name-rev
notes
-pack-objects
push
range-diff
rebase
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 10/17] p5313: add performance tests for --path-walk
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (8 preceding siblings ...)
2024-10-08 14:11 ` [PATCH 09/17] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
@ 2024-10-08 14:11 ` Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 11/17] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
` (7 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:11 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
The previous change added a --path-walk option to 'git pack-objects'.
Create a performance test that demonstrates the time and space benefits
of the feature.
In order to get an appropriate comparison, we need to avoid reusing
deltas and recompute them from scratch.
Compare the creation of a thin pack representing a small push and the
creation of a relatively large non-thin pack.
Running on my copy of the Git repository results in this data:
Test this tree
---------------------------------------------------------
5313.2: thin pack 0.01(0.00+0.00)
5313.3: thin pack size 1.1K
5313.4: thin pack with --path-walk 0.01(0.01+0.00)
5313.5: thin pack size with --path-walk 1.1K
5313.6: big pack 2.52(6.59+0.38)
5313.7: big pack size 14.1M
5313.8: big pack with --path-walk 4.90(5.76+0.26)
5313.9: big pack size with --path-walk 13.2M
Note that the timing is slower because there is no threading in the
--path-walk case (yet).
The cases where the --path-walk option really shines is when the default
name-hash is overwhelmed with collisions. An open source example can be
found in the microsoft/fluentui repo [1] at a certain commit [2].
[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08
Running the tests on this repo results in the following output:
Test this tree
----------------------------------------------------------
5313.2: thin pack 0.28(0.38+0.02)
5313.3: thin pack size 1.2M
5313.4: thin pack with --path-walk 0.08(0.06+0.01)
5313.5: thin pack size with --path-walk 18.4K
5313.6: big pack 4.05(29.62+0.43)
5313.7: big pack size 20.0M
5313.8: big pack with --path-walk 5.99(9.06+0.24)
5313.9: big pack size with --path-walk 16.4M
Notice in particular that in the small thin pack, the time performance
has improved from 0.28s to 0.08s and this is likely due to the improved
size of the resulting pack: 18.4K instead of 1.2M.
Finally, running this on a copy of the Linux kernel repository results
in these data points:
Test this tree
-----------------------------------------------------------
5313.2: thin pack 0.00(0.00+0.00)
5313.3: thin pack size 5.8K
5313.4: thin pack with --path-walk 0.00(0.01+0.00)
5313.5: thin pack size with --path-walk 5.8K
5313.6: big pack 24.39(65.81+1.31)
5313.7: big pack size 155.7M
5313.8: big pack with --path-walk 41.07(60.69+0.68)
5313.9: big pack size with --path-walk 150.8M
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
t/perf/p5313-pack-objects.sh | 59 ++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)
create mode 100755 t/perf/p5313-pack-objects.sh
diff --git a/t/perf/p5313-pack-objects.sh b/t/perf/p5313-pack-objects.sh
new file mode 100755
index 00000000000..840075f5691
--- /dev/null
+++ b/t/perf/p5313-pack-objects.sh
@@ -0,0 +1,59 @@
+#!/bin/sh
+
+test_description='Tests pack performance using bitmaps'
+. ./perf-lib.sh
+
+GIT_TEST_PASSING_SANITIZE_LEAK=0
+export GIT_TEST_PASSING_SANITIZE_LEAK
+
+test_perf_large_repo
+
+test_expect_success 'create rev input' '
+ cat >in-thin <<-EOF &&
+ $(git rev-parse HEAD)
+ ^$(git rev-parse HEAD~1)
+ EOF
+
+ cat >in-big <<-EOF
+ $(git rev-parse HEAD)
+ ^$(git rev-parse HEAD~1000)
+ EOF
+'
+
+test_perf 'thin pack' '
+ git pack-objects --thin --stdout --no-reuse-delta \
+ --revs --sparse <in-thin >out
+'
+
+test_size 'thin pack size' '
+ test_file_size out
+'
+
+test_perf 'thin pack with --path-walk' '
+ git pack-objects --thin --stdout --no-reuse-delta \
+ --revs --sparse --path-walk <in-thin >out
+'
+
+test_size 'thin pack size with --path-walk' '
+ test_file_size out
+'
+
+test_perf 'big pack' '
+ git pack-objects --stdout --no-reuse-delta --revs \
+ --sparse <in-big >out
+'
+
+test_size 'big pack size' '
+ test_file_size out
+'
+
+test_perf 'big pack with --path-walk' '
+ git pack-objects --stdout --no-reuse-delta --revs \
+ --sparse --path-walk <in-big >out
+'
+
+test_size 'big pack size with --path-walk' '
+ test_file_size out
+'
+
+test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 11/17] pack-objects: introduce GIT_TEST_PACK_PATH_WALK
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (9 preceding siblings ...)
2024-10-08 14:11 ` [PATCH 10/17] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
@ 2024-10-08 14:11 ` Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 12/17] repack: add --path-walk option Derrick Stolee via GitGitGadget
` (6 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:11 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.
This was useful in testing the implementation of the --path-walk
implementation, especially in conjunction with test such as:
- t0411-clone-from-partial.sh : One test fetches from a repo that does
not have the boundary objects. This causes the path-based walk to
fail. Disable the variable for this test.
- t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
without a boundary object.
- t5310-pack-bitmaps.sh : One test compares the case when packing with
bitmaps to the case when packing without them. Since we disable the
test variable when writing bitmaps, this causes a difference in the
object list (the --path-walk option adds an extra object). Specify
--no-path-walk in both processes for the comparison. Another test
checks for a specific delta base, but when computing dynamically
without using bitmaps, the base object it too small to be considered
in the delta calculations so no base is used.
- t5316-pack-delta-depth.sh : This script cares about certain delta
choices and their chain lengths. The --path-walk option changes how
these chains are selected, and thus changes the results of this test.
- t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
the --sparse option and how it combines with --path-walk.
- t5332-multi-pack-reuse.sh : This test verifies that the preferred
pack is used for delta reuse when possible. The --path-walk option is
not currently aware of the preferred pack at all, so finds a
different delta base.
- t7406-submodule-update.sh : When using the variable, the --depth
option collides with the --path-walk feature, resulting in a warning
message. Disable the variable so this warning does not appear.
I want to call out one specific test change that is only temporary:
- t5530-upload-pack-error.sh : One test cares specifically about an
"unable to read" error message. Since the current implementation
performs delta calculations within the path-walk API callback, a
different "unable to get size" error message appears. When this
is changed in a future refactoring, this test change can be reverted.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
builtin/pack-objects.c | 12 ++++++++++--
ci/run-build-and-tests.sh | 1 +
t/README | 4 ++++
t/t0411-clone-from-partial.sh | 6 ++++++
t/t5306-pack-nobase.sh | 5 +++++
t/t5310-pack-bitmaps.sh | 13 +++++++++++--
t/t5316-pack-delta-depth.sh | 9 ++++++---
t/t5332-multi-pack-reuse.sh | 7 +++++++
t/t5530-upload-pack-error.sh | 6 ++++++
t/t7406-submodule-update.sh | 4 ++++
10 files changed, 60 insertions(+), 7 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 77fb1217b2e..b97bec5661e 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -224,7 +224,7 @@ static int delta_search_threads;
static int pack_to_stdout;
static int sparse;
static int thin;
-static int path_walk;
+static int path_walk = -1;
static int num_preferred_base;
static struct progress *progress_state;
@@ -4182,7 +4182,7 @@ static int add_objects_by_path(const char *path,
struct object_id *oid = &oids->oid[i];
/* Skip objects that do not exist locally. */
- if (exclude_promisor_objects &&
+ if ((exclude_promisor_objects || arg_missing_action != MA_ERROR) &&
oid_object_info_extended(the_repository, oid, &oi,
OBJECT_INFO_FOR_PREFETCH) < 0)
continue;
@@ -4583,6 +4583,14 @@ int cmd_pack_objects(int argc,
if (pack_to_stdout != !base_name || argc)
usage_with_options(pack_usage, pack_objects_options);
+ if (path_walk < 0) {
+ if (use_bitmap_index > 0 ||
+ !use_internal_rev_list)
+ path_walk = 0;
+ else
+ path_walk = git_env_bool("GIT_TEST_PACK_PATH_WALK", 0);
+ }
+
if (depth < 0)
depth = 0;
if (depth >= (1 << OE_DEPTH_BITS)) {
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index 2e28d02b20f..7c75492f366 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -30,6 +30,7 @@ linux-TEST-vars)
export GIT_TEST_NO_WRITE_REV_INDEX=1
export GIT_TEST_CHECKOUT_WORKERS=2
export GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=1
+ export GIT_TEST_PACK_PATH_WALK=1
;;
linux-clang)
export GIT_TEST_DEFAULT_HASH=sha1
diff --git a/t/README b/t/README
index 8dcb778e260..bec31955d2d 100644
--- a/t/README
+++ b/t/README
@@ -436,6 +436,10 @@ GIT_TEST_PACK_SPARSE=<boolean> if disabled will default the pack-objects
builtin to use the non-sparse object walk. This can still be overridden by
the --sparse command-line argument.
+GIT_TEST_PACK_PATH_WALK=<boolean> if enabled will default the pack-objects
+builtin to use the path-walk API for the object walk. This can still be
+overridden by the --no-path-walk command-line argument.
+
GIT_TEST_PRELOAD_INDEX=<boolean> exercises the preload-index code path
by overriding the minimum number of cache entries required per thread.
diff --git a/t/t0411-clone-from-partial.sh b/t/t0411-clone-from-partial.sh
index 932bf2067da..342d8d2997c 100755
--- a/t/t0411-clone-from-partial.sh
+++ b/t/t0411-clone-from-partial.sh
@@ -63,6 +63,12 @@ test_expect_success 'pack-objects should fetch from promisor remote and execute
test_expect_success 'clone from promisor remote does not lazy-fetch by default' '
rm -f script-executed &&
+
+ # The --path-walk feature of "git pack-objects" is not
+ # compatible with this kind of fetch from an incomplete repo.
+ GIT_TEST_PACK_PATH_WALK=0 &&
+ export GIT_TEST_PACK_PATH_WALK &&
+
test_must_fail git clone evil no-lazy 2>err &&
test_grep "lazy fetching disabled" err &&
test_path_is_missing script-executed
diff --git a/t/t5306-pack-nobase.sh b/t/t5306-pack-nobase.sh
index 0d50c6b4bca..429be5ce724 100755
--- a/t/t5306-pack-nobase.sh
+++ b/t/t5306-pack-nobase.sh
@@ -60,6 +60,11 @@ test_expect_success 'indirectly clone patch_clone' '
git pull ../.git &&
test $(git rev-parse HEAD) = $B &&
+ # The --path-walk feature of "git pack-objects" is not
+ # compatible with this kind of fetch from an incomplete repo.
+ GIT_TEST_PACK_PATH_WALK=0 &&
+ export GIT_TEST_PACK_PATH_WALK &&
+
git pull ../patch_clone/.git &&
test $(git rev-parse HEAD) = $C
)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index a6de7c57643..881b3f9c8d1 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -128,8 +128,9 @@ test_bitmap_cases () {
ls .git/objects/pack/ | grep bitmap >output &&
test_line_count = 1 output &&
# verify equivalent packs are generated with/without using bitmap index
- packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
- packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+ # Be careful to not use the path-walk option in either case.
+ packasha1=$(git pack-objects --no-use-bitmap-index --no-path-walk --all packa </dev/null) &&
+ packbsha1=$(git pack-objects --use-bitmap-index --no-path-walk --all packb </dev/null) &&
list_packed_objects packa-$packasha1.idx >packa.objects &&
list_packed_objects packb-$packbsha1.idx >packb.objects &&
test_cmp packa.objects packb.objects
@@ -358,6 +359,14 @@ test_bitmap_cases () {
git init --bare client.git &&
(
cd client.git &&
+
+ # This test relies on reusing a delta, but if the
+ # path-walk machinery is engaged, the base object
+ # is considered too small to use during the
+ # dynamic computation, so is not used.
+ GIT_TEST_PACK_PATH_WALK=0 &&
+ export GIT_TEST_PACK_PATH_WALK &&
+
git config transfer.unpackLimit 1 &&
git fetch .. delta-reuse-old:delta-reuse-old &&
git fetch .. delta-reuse-new:delta-reuse-new &&
diff --git a/t/t5316-pack-delta-depth.sh b/t/t5316-pack-delta-depth.sh
index eb4ef3dda4d..12a6901fecb 100755
--- a/t/t5316-pack-delta-depth.sh
+++ b/t/t5316-pack-delta-depth.sh
@@ -90,15 +90,18 @@ max_chain() {
# adjusted (or scrapped if the heuristics have become too unreliable)
test_expect_success 'packing produces a long delta' '
# Use --window=0 to make sure we are seeing reused deltas,
- # not computing a new long chain.
- pack=$(git pack-objects --all --window=0 </dev/null pack) &&
+ # not computing a new long chain. (Also avoid the --path-walk
+ # option as it may break delta chains.)
+ pack=$(git pack-objects --all --window=0 --no-path-walk </dev/null pack) &&
echo 9 >expect &&
max_chain pack-$pack.pack >actual &&
test_cmp expect actual
'
test_expect_success '--depth limits depth' '
- pack=$(git pack-objects --all --depth=5 </dev/null pack) &&
+ # Avoid --path-walk to avoid breaking delta chains across path
+ # boundaries.
+ pack=$(git pack-objects --all --depth=5 --no-path-walk </dev/null pack) &&
echo 5 >expect &&
max_chain pack-$pack.pack >actual &&
test_cmp expect actual
diff --git a/t/t5332-multi-pack-reuse.sh b/t/t5332-multi-pack-reuse.sh
index 955ea42769b..df7dcb4b487 100755
--- a/t/t5332-multi-pack-reuse.sh
+++ b/t/t5332-multi-pack-reuse.sh
@@ -8,6 +8,13 @@ TEST_PASSES_SANITIZE_LEAK=true
GIT_TEST_MULTI_PACK_INDEX=0
GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
+
+# The --path-walk option does not consider the preferred pack
+# at all for reusing deltas, so this variable changes the
+# behavior of this test, if enabled.
+GIT_TEST_PACK_PATH_WALK=0
+export GIT_TEST_PACK_PATH_WALK
+
objdir=.git/objects
packdir=$objdir/pack
diff --git a/t/t5530-upload-pack-error.sh b/t/t5530-upload-pack-error.sh
index 7172780d550..356b96cb741 100755
--- a/t/t5530-upload-pack-error.sh
+++ b/t/t5530-upload-pack-error.sh
@@ -35,6 +35,12 @@ test_expect_success 'upload-pack fails due to error in pack-objects packing' '
hexsz=$(test_oid hexsz) &&
printf "%04xwant %s\n00000009done\n0000" \
$(($hexsz + 10)) $head >input &&
+
+ # The current implementation of path-walk causes a different
+ # error message. This will be changed by a future refactoring.
+ GIT_TEST_PACK_PATH_WALK=0 &&
+ export GIT_TEST_PACK_PATH_WALK &&
+
test_must_fail git upload-pack . <input >/dev/null 2>output.err &&
test_grep "unable to read" output.err &&
test_grep "pack-objects died" output.err
diff --git a/t/t7406-submodule-update.sh b/t/t7406-submodule-update.sh
index 297c6c3b5cc..d2284e67d3d 100755
--- a/t/t7406-submodule-update.sh
+++ b/t/t7406-submodule-update.sh
@@ -1093,12 +1093,16 @@ test_expect_success 'submodule update --quiet passes quietness to fetch with a s
) &&
git clone super4 super5 &&
(cd super5 &&
+ # This test variable will create a "warning" message to stderr
+ GIT_TEST_PACK_PATH_WALK=0 \
git submodule update --quiet --init --depth=1 submodule3 >out 2>err &&
test_must_be_empty out &&
test_must_be_empty err
) &&
git clone super4 super6 &&
(cd super6 &&
+ # This test variable will create a "warning" message to stderr
+ GIT_TEST_PACK_PATH_WALK=0 \
git submodule update --init --depth=1 submodule3 >out 2>err &&
test_file_not_empty out &&
test_file_not_empty err
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 12/17] repack: add --path-walk option
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (10 preceding siblings ...)
2024-10-08 14:11 ` [PATCH 11/17] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
@ 2024-10-08 14:11 ` Derrick Stolee via GitGitGadget
2024-10-08 14:11 ` [PATCH 13/17] repack: update usage to match docs Derrick Stolee via GitGitGadget
` (5 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:11 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.
In my copy of the Git repository, the new tests in p5313 show these
results:
Test this tree
-------------------------------------------------------------
5313.10: repack 27.88(150.23+2.70)
5313.11: repack size 228.2M
5313.12: repack with --path-walk 134.59(148.77+0.81)
5313.13: repack size with --path-walk 209.7M
Note that the 'git pack-objects --path-walk' feature is not integrated
with threads. Look forward to a future change that will introduce
threading to improve the time performance of this feature with
equivalent space performance.
For the microsoft/fluentui repo [1] had some interesting aspects for the
previous tests in p5313, so here are the repack results:
Test this tree
-------------------------------------------------------------
5313.10: repack 91.76(680.94+2.48)
5313.11: repack size 439.1M
5313.12: repack with --path-walk 110.35(130.46+0.74)
5313.13: repack size with --path-walk 155.3M
[1] https://github.com/microsoft/fluentui
Here, we see the significant improvement of a full repack using this
strategy. The name-hash collisions in this repo cause the space
problems. Those collisions also cause the repack command to spend a lot
of cycles trying to find delta bases among files that are not actually
very similar, so the lack of threading with the --path-walk feature is
less pronounced in the process time.
For the Linux kernel repository, we have these stats:
Test this tree
---------------------------------------------------------------
5313.10: repack 553.61(1929.41+30.31)
5313.11: repack size 2.5G
5313.12: repack with --path-walk 1777.63(2044.16+7.47)
5313.13: repack size with --path-walk 2.5G
This demonstrates that the --path-walk feature does not always present
measurable improvements, especially in cases where the name-hash has
very few collisions.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/git-repack.txt | 17 ++++++++++++++++-
builtin/repack.c | 9 ++++++++-
t/perf/p5313-pack-objects.sh | 18 ++++++++++++++++++
3 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index c902512a9e8..4ec59cd27b1 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -9,7 +9,9 @@ git-repack - Pack unpacked objects in a repository
SYNOPSIS
--------
[verse]
-'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>] [--write-midx]
+'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
+ [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
+ [--write-midx] [--path-walk]
DESCRIPTION
-----------
@@ -249,6 +251,19 @@ linkgit:git-multi-pack-index[1]).
Write a multi-pack index (see linkgit:git-multi-pack-index[1])
containing the non-redundant packs.
+--path-walk::
+ This option passes the `--path-walk` option to the underlying
+ `git pack-options` process (see linkgit:git-pack-objects[1]).
+ By default, `git pack-objects` walks objects in an order that
+ presents trees and blobs in an order unrelated to the path they
+ appear relative to a commit's root tree. The `--path-walk` option
+ enables a different walking algorithm that organizes trees and
+ blobs by path. This has the potential to improve delta compression
+ especially in the presence of filenames that cause collisions in
+ Git's default name-hash algorithm. Due to changing how the objects
+ are walked, this option is not compatible with `--delta-islands`
+ or `--filter`.
+
CONFIGURATION
-------------
diff --git a/builtin/repack.c b/builtin/repack.c
index cb4420f0856..af3f218ced7 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -39,7 +39,9 @@ static int run_update_server_info = 1;
static char *packdir, *packtmp_name, *packtmp;
static const char *const git_repack_usage[] = {
- N_("git repack [<options>]"),
+ N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
+ "[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
+ "[--write-midx] [--full-path-walk]"),
NULL
};
@@ -58,6 +60,7 @@ struct pack_objects_args {
int no_reuse_object;
int quiet;
int local;
+ int path_walk;
struct list_objects_filter_options filter_options;
};
@@ -289,6 +292,8 @@ static void prepare_pack_objects(struct child_process *cmd,
strvec_pushf(&cmd->args, "--no-reuse-delta");
if (args->no_reuse_object)
strvec_pushf(&cmd->args, "--no-reuse-object");
+ if (args->path_walk)
+ strvec_pushf(&cmd->args, "--path-walk");
if (args->local)
strvec_push(&cmd->args, "--local");
if (args->quiet)
@@ -1182,6 +1187,8 @@ int cmd_repack(int argc,
N_("pass --no-reuse-delta to git-pack-objects")),
OPT_BOOL('F', NULL, &po_args.no_reuse_object,
N_("pass --no-reuse-object to git-pack-objects")),
+ OPT_BOOL(0, "path-walk", &po_args.path_walk,
+ N_("pass --path-walk to git-pack-objects")),
OPT_NEGBIT('n', NULL, &run_update_server_info,
N_("do not run git-update-server-info"), 1),
OPT__QUIET(&po_args.quiet, N_("be quiet")),
diff --git a/t/perf/p5313-pack-objects.sh b/t/perf/p5313-pack-objects.sh
index 840075f5691..b588066ddb0 100755
--- a/t/perf/p5313-pack-objects.sh
+++ b/t/perf/p5313-pack-objects.sh
@@ -56,4 +56,22 @@ test_size 'big pack size with --path-walk' '
test_file_size out
'
+test_perf 'repack' '
+ git repack -adf
+'
+
+test_size 'repack size' '
+ pack=$(ls .git/objects/pack/pack-*.pack) &&
+ test_file_size "$pack"
+'
+
+test_perf 'repack with --path-walk' '
+ git repack -adf --path-walk
+'
+
+test_size 'repack size with --path-walk' '
+ pack=$(ls .git/objects/pack/pack-*.pack) &&
+ test_file_size "$pack"
+'
+
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 13/17] repack: update usage to match docs
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (11 preceding siblings ...)
2024-10-08 14:11 ` [PATCH 12/17] repack: add --path-walk option Derrick Stolee via GitGitGadget
@ 2024-10-08 14:11 ` Derrick Stolee via GitGitGadget
2024-10-08 14:12 ` [PATCH 14/17] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
` (4 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:11 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
The t0450 test script verifies that the builtin usage matches the
synopsis in the documentation. Update 'git repack' to match and remove
it from the exception list.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
builtin/repack.c | 2 +-
t/t0450/txt-help-mismatches | 1 -
2 files changed, 1 insertion(+), 2 deletions(-)
diff --git a/builtin/repack.c b/builtin/repack.c
index af3f218ced7..50f208b48b4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -41,7 +41,7 @@ static char *packdir, *packtmp_name, *packtmp;
static const char *const git_repack_usage[] = {
N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
"[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
- "[--write-midx] [--full-path-walk]"),
+ "[--write-midx] [--path-walk]"),
NULL
};
diff --git a/t/t0450/txt-help-mismatches b/t/t0450/txt-help-mismatches
index 285ae81a6b5..06b469bdee2 100644
--- a/t/t0450/txt-help-mismatches
+++ b/t/t0450/txt-help-mismatches
@@ -44,7 +44,6 @@ rebase
remote
remote-ext
remote-fd
-repack
reset
restore
rev-parse
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 14/17] pack-objects: enable --path-walk via config
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (12 preceding siblings ...)
2024-10-08 14:11 ` [PATCH 13/17] repack: update usage to match docs Derrick Stolee via GitGitGadget
@ 2024-10-08 14:12 ` Derrick Stolee via GitGitGadget
2024-10-08 14:12 ` [PATCH 15/17] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
` (3 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:12 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.
This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.
Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/config/feature.txt | 4 ++++
Documentation/config/pack.txt | 8 ++++++++
builtin/pack-objects.c | 3 +++
repo-settings.c | 3 +++
repo-settings.h | 1 +
5 files changed, 19 insertions(+)
diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt
index f061b64b748..cb49ff2604a 100644
--- a/Documentation/config/feature.txt
+++ b/Documentation/config/feature.txt
@@ -20,6 +20,10 @@ walking fewer objects.
+
* `pack.allowPackReuse=multi` may improve the time it takes to create a pack by
reusing objects from multiple packs instead of just one.
++
+* `pack.usePathWalk` may speed up packfile creation and make the packfiles be
+significantly smaller in the presence of certain filename collisions with Git's
+default name-hash.
feature.manyFiles::
Enable config options that optimize for repos with many files in the
diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index da527377faf..08d06271177 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -155,6 +155,14 @@ pack.useSparse::
commits contain certain types of direct renames. Default is
`true`.
+pack.usePathWalk::
+ When true, git will default to using the '--path-walk' option in
+ 'git pack-objects' when the '--revs' option is present. This
+ algorithm groups objects by path to maximize the ability to
+ compute delta chains across historical versions of the same
+ object. This may disable other options, such as using bitmaps to
+ enumerate objects.
+
pack.preferBitmapTips::
When selecting which commits will receive bitmaps, prefer a
commit at the tip of any reference that is a suffix of any value
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b97bec5661e..6805a55c60d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4587,6 +4587,9 @@ int cmd_pack_objects(int argc,
if (use_bitmap_index > 0 ||
!use_internal_rev_list)
path_walk = 0;
+ else if (the_repository->gitdir &&
+ the_repository->settings.pack_use_path_walk)
+ path_walk = 1;
else
path_walk = git_env_bool("GIT_TEST_PACK_PATH_WALK", 0);
}
diff --git a/repo-settings.c b/repo-settings.c
index 4699b4b3650..d8123b9323d 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -45,11 +45,13 @@ void prepare_repo_settings(struct repository *r)
r->settings.fetch_negotiation_algorithm = FETCH_NEGOTIATION_SKIPPING;
r->settings.pack_use_bitmap_boundary_traversal = 1;
r->settings.pack_use_multi_pack_reuse = 1;
+ r->settings.pack_use_path_walk = 1;
}
if (manyfiles) {
r->settings.index_version = 4;
r->settings.index_skip_hash = 1;
r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE;
+ r->settings.pack_use_path_walk = 1;
}
/* Commit graph config or default, does not cascade (simple) */
@@ -64,6 +66,7 @@ void prepare_repo_settings(struct repository *r)
/* Boolean config or default, does not cascade (simple) */
repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1);
+ repo_cfg_bool(r, "pack.usepathwalk", &r->settings.pack_use_path_walk, 0);
repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1);
repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
diff --git a/repo-settings.h b/repo-settings.h
index 51d6156a117..ae5c74ba60d 100644
--- a/repo-settings.h
+++ b/repo-settings.h
@@ -53,6 +53,7 @@ struct repo_settings {
enum untracked_cache_setting core_untracked_cache;
int pack_use_sparse;
+ int pack_use_path_walk;
enum fetch_negotiation_setting fetch_negotiation_algorithm;
int core_multi_pack_index;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 15/17] scalar: enable path-walk during push via config
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (13 preceding siblings ...)
2024-10-08 14:12 ` [PATCH 14/17] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
@ 2024-10-08 14:12 ` Derrick Stolee via GitGitGadget
2024-10-08 14:12 ` [PATCH 16/17] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
` (2 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:12 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
scalar.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/scalar.c b/scalar.c
index 73b79a5d4c9..7db9c5bbfe0 100644
--- a/scalar.c
+++ b/scalar.c
@@ -170,6 +170,7 @@ static int set_recommended_config(int reconfigure)
{ "core.autoCRLF", "false" },
{ "core.safeCRLF", "false" },
{ "fetch.showForcedUpdates", "false" },
+ { "push.usePathWalk", "true" },
{ NULL, NULL },
};
int i;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 16/17] pack-objects: refactor path-walk delta phase
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (14 preceding siblings ...)
2024-10-08 14:12 ` [PATCH 15/17] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
@ 2024-10-08 14:12 ` Derrick Stolee via GitGitGadget
2024-10-08 14:12 ` [PATCH 17/17] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:12 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.
Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.
This presents a new progress indicator that can be used in tests to
verify that this stage is happening.
The current implementation is not integrated with threads, but could be
done in a future update.
Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
builtin/pack-objects.c | 81 +++++++++++++++++++++++++-----------
pack-objects.h | 12 ++++++
t/t5300-pack-object.sh | 8 +++-
t/t5530-upload-pack-error.sh | 6 ---
4 files changed, 74 insertions(+), 33 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6805a55c60d..5c413ac07e6 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3204,6 +3204,50 @@ static int should_attempt_deltas(struct object_entry *entry)
return 1;
}
+static void find_deltas_for_region(struct object_entry *list UNUSED,
+ struct packing_region *region,
+ unsigned int *processed)
+{
+ struct object_entry **delta_list;
+ uint32_t delta_list_nr = 0;
+
+ ALLOC_ARRAY(delta_list, region->nr);
+ for (uint32_t i = 0; i < region->nr; i++) {
+ struct object_entry *entry = to_pack.objects + region->start + i;
+ if (should_attempt_deltas(entry))
+ delta_list[delta_list_nr++] = entry;
+ }
+
+ QSORT(delta_list, delta_list_nr, type_size_sort);
+ find_deltas(delta_list, &delta_list_nr, window, depth, processed);
+ free(delta_list);
+}
+
+static void find_deltas_by_region(struct object_entry *list,
+ struct packing_region *regions,
+ uint32_t start, uint32_t nr)
+{
+ unsigned int processed = 0;
+ uint32_t progress_nr;
+
+ if (!nr)
+ return;
+
+ progress_nr = regions[nr - 1].start + regions[nr - 1].nr;
+
+ if (progress)
+ progress_state = start_progress(_("Compressing objects by path"),
+ progress_nr);
+
+ while (nr--)
+ find_deltas_for_region(list,
+ ®ions[start++],
+ &processed);
+
+ display_progress(progress_state, progress_nr);
+ stop_progress(&progress_state);
+}
+
static void prepare_pack(int window, int depth)
{
struct object_entry **delta_list;
@@ -3228,6 +3272,10 @@ static void prepare_pack(int window, int depth)
if (!to_pack.nr_objects || !window || !depth)
return;
+ if (path_walk)
+ find_deltas_by_region(to_pack.objects, to_pack.regions,
+ 0, to_pack.nr_regions);
+
ALLOC_ARRAY(delta_list, to_pack.nr_objects);
nr_deltas = n = 0;
@@ -4165,10 +4213,8 @@ static int add_objects_by_path(const char *path,
enum object_type type,
void *data)
{
- struct object_entry **delta_list;
size_t oe_start = to_pack.nr_objects;
size_t oe_end;
- unsigned int sub_list_size;
unsigned int *processed = data;
/*
@@ -4201,32 +4247,17 @@ static int add_objects_by_path(const char *path,
if (oe_end == oe_start || !window)
return 0;
- sub_list_size = 0;
- ALLOC_ARRAY(delta_list, oe_end - oe_start);
+ ALLOC_GROW(to_pack.regions,
+ to_pack.nr_regions + 1,
+ to_pack.nr_regions_alloc);
- for (size_t i = 0; i < oe_end - oe_start; i++) {
- struct object_entry *entry = to_pack.objects + oe_start + i;
+ to_pack.regions[to_pack.nr_regions].start = oe_start;
+ to_pack.regions[to_pack.nr_regions].nr = oe_end - oe_start;
+ to_pack.nr_regions++;
- if (!should_attempt_deltas(entry))
- continue;
+ *processed += oids->nr;
+ display_progress(progress_state, *processed);
- delta_list[sub_list_size++] = entry;
- }
-
- /*
- * Find delta bases among this list of objects that all match the same
- * path. This causes the delta compression to be interleaved in the
- * object walk, which can lead to confusing progress indicators. This is
- * also incompatible with threaded delta calculations. In the future,
- * consider creating a list of regions in the full to_pack.objects array
- * that could be picked up by the threaded delta computation.
- */
- if (sub_list_size && window) {
- QSORT(delta_list, sub_list_size, type_size_sort);
- find_deltas(delta_list, &sub_list_size, window, depth, processed);
- }
-
- free(delta_list);
return 0;
}
diff --git a/pack-objects.h b/pack-objects.h
index b9898a4e64b..bde4ba19755 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -118,11 +118,23 @@ struct object_entry {
unsigned ext_base:1; /* delta_idx points outside packlist */
};
+/**
+ * A packing region is a section of the packing_data.objects array
+ * as given by a starting index and a number of elements.
+ */
+struct packing_region {
+ uint32_t start;
+ uint32_t nr;
+};
+
struct packing_data {
struct repository *repo;
struct object_entry *objects;
uint32_t nr_objects, nr_alloc;
+ struct packing_region *regions;
+ uint32_t nr_regions, nr_regions_alloc;
+
int32_t *index;
uint32_t index_size;
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 5f6914acae7..4f81613eab1 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -677,7 +677,9 @@ done
# Basic "repack everything" test
test_expect_success '--path-walk pack everything' '
git -C server rev-parse HEAD >in &&
- git -C server pack-objects --stdout --revs --path-walk <in >out.pack &&
+ GIT_PROGRESS_DELAY=0 git -C server pack-objects \
+ --stdout --revs --path-walk --progress <in >out.pack 2>err &&
+ grep "Compressing objects by path" err &&
git -C server index-pack --stdin <out.pack
'
@@ -687,7 +689,9 @@ test_expect_success '--path-walk thin pack' '
$(git -C server rev-parse HEAD)
^$(git -C server rev-parse HEAD~2)
EOF
- git -C server pack-objects --thin --stdout --revs --path-walk <in >out.pack &&
+ GIT_PROGRESS_DELAY=0 git -C server pack-objects \
+ --thin --stdout --revs --path-walk --progress <in >out.pack 2>err &&
+ grep "Compressing objects by path" err &&
git -C server index-pack --fix-thin --stdin <out.pack
'
diff --git a/t/t5530-upload-pack-error.sh b/t/t5530-upload-pack-error.sh
index 356b96cb741..7172780d550 100755
--- a/t/t5530-upload-pack-error.sh
+++ b/t/t5530-upload-pack-error.sh
@@ -35,12 +35,6 @@ test_expect_success 'upload-pack fails due to error in pack-objects packing' '
hexsz=$(test_oid hexsz) &&
printf "%04xwant %s\n00000009done\n0000" \
$(($hexsz + 10)) $head >input &&
-
- # The current implementation of path-walk causes a different
- # error message. This will be changed by a future refactoring.
- GIT_TEST_PACK_PATH_WALK=0 &&
- export GIT_TEST_PACK_PATH_WALK &&
-
test_must_fail git upload-pack . <input >/dev/null 2>output.err &&
test_grep "unable to read" output.err &&
test_grep "pack-objects died" output.err
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH 17/17] pack-objects: thread the path-based compression
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (15 preceding siblings ...)
2024-10-08 14:12 ` [PATCH 16/17] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
@ 2024-10-08 14:12 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-08 14:12 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.
This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.
Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)
Test HEAD~1 HEAD
---------------------------------------------------------------
5313.2: thin pack 0.01 0.01 +0.0%
5313.4: thin pack with --path-walk 0.01 0.01 +0.0%
5313.6: big pack 2.54 2.60 +2.4%
5313.8: big pack with --path-walk 4.70 3.09 -34.3%
5313.10: repack 28.75 28.55 -0.7%
5313.12: repack with --path-walk 108.55 46.14 -57.5%
On the microsoft/fluentui repo, where the --path-walk feature has been
shown to be more effective in space savings, we get these results:
Test HEAD~1 HEAD
----------------------------------------------------------------
5313.2: thin pack 0.39 0.40 +2.6%
5313.4: thin pack with --path-walk 0.08 0.07 -12.5%
5313.6: big pack 4.15 4.15 +0.0%
5313.8: big pack with --path-walk 6.41 3.21 -49.9%
5313.10: repack 90.69 90.83 +0.2%
5313.12: repack with --path-walk 108.23 49.09 -54.6%
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
builtin/pack-objects.c | 162 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 160 insertions(+), 2 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5c413ac07e6..443ce17063a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2935,6 +2935,7 @@ static void find_deltas(struct object_entry **list, unsigned *list_size,
struct thread_params {
pthread_t thread;
struct object_entry **list;
+ struct packing_region *regions;
unsigned list_size;
unsigned remaining;
int window;
@@ -3248,6 +3249,163 @@ static void find_deltas_by_region(struct object_entry *list,
stop_progress(&progress_state);
}
+static void *threaded_find_deltas_by_path(void *arg)
+{
+ struct thread_params *me = arg;
+
+ progress_lock();
+ while (me->remaining) {
+ while (me->remaining) {
+ progress_unlock();
+ find_deltas_for_region(to_pack.objects,
+ me->regions,
+ me->processed);
+ progress_lock();
+ me->remaining--;
+ me->regions++;
+ }
+
+ me->working = 0;
+ pthread_cond_signal(&progress_cond);
+ progress_unlock();
+
+ /*
+ * We must not set ->data_ready before we wait on the
+ * condition because the main thread may have set it to 1
+ * before we get here. In order to be sure that new
+ * work is available if we see 1 in ->data_ready, it
+ * was initialized to 0 before this thread was spawned
+ * and we reset it to 0 right away.
+ */
+ pthread_mutex_lock(&me->mutex);
+ while (!me->data_ready)
+ pthread_cond_wait(&me->cond, &me->mutex);
+ me->data_ready = 0;
+ pthread_mutex_unlock(&me->mutex);
+
+ progress_lock();
+ }
+ progress_unlock();
+ /* leave ->working 1 so that this doesn't get more work assigned */
+ return NULL;
+}
+
+static void ll_find_deltas_by_region(struct object_entry *list,
+ struct packing_region *regions,
+ uint32_t start, uint32_t nr)
+{
+ struct thread_params *p;
+ int i, ret, active_threads = 0;
+ unsigned int processed = 0;
+ uint32_t progress_nr;
+ init_threaded_search();
+
+ if (!nr)
+ return;
+
+ progress_nr = regions[nr - 1].start + regions[nr - 1].nr;
+ if (delta_search_threads <= 1) {
+ find_deltas_by_region(list, regions, start, nr);
+ cleanup_threaded_search();
+ return;
+ }
+
+ if (progress > pack_to_stdout)
+ fprintf_ln(stderr, _("Path-based delta compression using up to %d threads"),
+ delta_search_threads);
+ CALLOC_ARRAY(p, delta_search_threads);
+
+ if (progress)
+ progress_state = start_progress(_("Compressing objects by path"),
+ progress_nr);
+ /* Partition the work amongst work threads. */
+ for (i = 0; i < delta_search_threads; i++) {
+ unsigned sub_size = nr / (delta_search_threads - i);
+
+ p[i].window = window;
+ p[i].depth = depth;
+ p[i].processed = &processed;
+ p[i].working = 1;
+ p[i].data_ready = 0;
+
+ p[i].regions = regions;
+ p[i].list_size = sub_size;
+ p[i].remaining = sub_size;
+
+ regions += sub_size;
+ nr -= sub_size;
+ }
+
+ /* Start work threads. */
+ for (i = 0; i < delta_search_threads; i++) {
+ if (!p[i].list_size)
+ continue;
+ pthread_mutex_init(&p[i].mutex, NULL);
+ pthread_cond_init(&p[i].cond, NULL);
+ ret = pthread_create(&p[i].thread, NULL,
+ threaded_find_deltas_by_path, &p[i]);
+ if (ret)
+ die(_("unable to create thread: %s"), strerror(ret));
+ active_threads++;
+ }
+
+ /*
+ * Now let's wait for work completion. Each time a thread is done
+ * with its work, we steal half of the remaining work from the
+ * thread with the largest number of unprocessed objects and give
+ * it to that newly idle thread. This ensure good load balancing
+ * until the remaining object list segments are simply too short
+ * to be worth splitting anymore.
+ */
+ while (active_threads) {
+ struct thread_params *target = NULL;
+ struct thread_params *victim = NULL;
+ unsigned sub_size = 0;
+
+ progress_lock();
+ for (;;) {
+ for (i = 0; !target && i < delta_search_threads; i++)
+ if (!p[i].working)
+ target = &p[i];
+ if (target)
+ break;
+ pthread_cond_wait(&progress_cond, &progress_mutex);
+ }
+
+ for (i = 0; i < delta_search_threads; i++)
+ if (p[i].remaining > 2*window &&
+ (!victim || victim->remaining < p[i].remaining))
+ victim = &p[i];
+ if (victim) {
+ sub_size = victim->remaining / 2;
+ target->regions = victim->regions + victim->remaining - sub_size;
+ victim->list_size -= sub_size;
+ victim->remaining -= sub_size;
+ }
+ target->list_size = sub_size;
+ target->remaining = sub_size;
+ target->working = 1;
+ progress_unlock();
+
+ pthread_mutex_lock(&target->mutex);
+ target->data_ready = 1;
+ pthread_cond_signal(&target->cond);
+ pthread_mutex_unlock(&target->mutex);
+
+ if (!sub_size) {
+ pthread_join(target->thread, NULL);
+ pthread_cond_destroy(&target->cond);
+ pthread_mutex_destroy(&target->mutex);
+ active_threads--;
+ }
+ }
+ cleanup_threaded_search();
+ free(p);
+
+ display_progress(progress_state, progress_nr);
+ stop_progress(&progress_state);
+}
+
static void prepare_pack(int window, int depth)
{
struct object_entry **delta_list;
@@ -3273,8 +3431,8 @@ static void prepare_pack(int window, int depth)
return;
if (path_walk)
- find_deltas_by_region(to_pack.objects, to_pack.regions,
- 0, to_pack.nr_regions);
+ ll_find_deltas_by_region(to_pack.objects, to_pack.regions,
+ 0, to_pack.nr_regions);
ALLOC_ARRAY(delta_list, to_pack.nr_objects);
nr_deltas = n = 0;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
2024-10-08 14:11 [PATCH 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (16 preceding siblings ...)
2024-10-08 14:12 ` [PATCH 17/17] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 01/17] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
` (17 more replies)
17 siblings, 18 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee
This is a reviewable series introducing the path-walk API and its
application within the 'git pack-objects --path-walk' machinery. This API
was previously discussed in the path-walk RFC [1] and the patch series
around the --full-name-hash option (branch ds/pack-name-hash-tweak) [2].
This series conflicts with ds/pack-name-hash-tweak, but that was on hold
because it did not seem as critical based on community interest.
[1]
https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com
[2]
https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com
The primary motivation for this feature is its use to shrink the packfile
created by 'git push' when there are many name-hash collisions. This need
was discovered in several Javascript repositories that use the beachball
tool [3] to generate CHANGELOG.md and CHANGELOG.json files. When a batch of
these files are created at the same time and pushed to a release branch, the
'git pack-objects' process has many collisions among these files and delta
bases are selected poorly.
[3] https://github.com/microsoft/beachball
In some cases, 'git push' is pushing 60-100 MB when the new path-walk
algorithm will identify better delta bases and pack the same objects into a
thin pack less than 1 MB. This was the most extreme example we could find
and is present in a private monorepo. However, the microsoft/fluentui repo
[4] is a good example for demonstrating similar improvements. The patch
descriptions frequently refer to this repo and which commit should be
checked out to reproduce this behavior.
[4] https://github.com/microsoft/fluentui
The path-walk API is a new way to walk objects. Traditionally, the revision
API walks objects by visiting a commit, then visiting its root tree and
recursing through trees to visit reachable objects that were not previously
visited. The path-walk API visits objects in batches based on the path
walked from a root tree to that object. (Only the first discovered path is
chosen; this avoids certain kinds of Git bombs.)
This has an immediate application to 'git pack-objects'.
When using the traditional revision API to walk objects, each object is
emitted with an associated path. Since this path may appear for many objects
spread across the full list, a heuristic is used: the "name-hash" is stored
for that object instead of the full path name. This name-hash will group
objects at the same path together, but also has a form of "locality" to
group likely-similar objects together. When there are few collisions in the
name-hash function, this helps compress objects that appear at the same path
as well as help compress objects across different paths that have similar
suffixes. When there are many versions of the same path, then finding delta
bases across that family of objects is very important. When there are few
versions of the same path, then finding cross-name delta bases is also
important. The former is important for clones and repacks while the latter
is important for shallow clones. They can both be important for pushes. In
all cases, this approach is fraught when there are many name-hash collisions
as the window size becomes a limiting factor for finding quality delta
bases.
When using the path-walk API to walk objects, we group objects by the same
path from the start. We don't need to store the path name, since we have the
objects already in a group. We can compute deltas within that group and then
use the name-hash approach to resort the object list and look for
opportunistic cross-path deltas. Thus, the path-walk approach allows finding
delta bases at least as good as the traditional revision API approach.
(Caveat: if we assume delta reuse and the existing deltas were computed with
the revision API approach, then the path-walk API approach may result in
slightly worse delta compression. The performance tests in this series use
--no-reuse-delta for this reason.)
Once 'git pack-objects --path-walk' exists, we have a few ways to take
advantage of it in other places:
* The new 'pack.usePathWalk' config option will assume the --path-walk
option. This allows 'git push' to assume this value and get the effect we
want. This is similar to the 'pack.useSparse' config option that uses a
similar path-based walk to limit the set of boundary objects.
* 'git repack' learns a '--path-walk' option to pass to its child 'git
pack-objects' process. This is also implied by 'pack.usePathWalk' but
allows for testing without modifying repository config.
I'll repeat the following table of repacking results when using 'git repack
-adf [--path-walk]' on a set of repositories with many name-hash collisions.
Only the microsoft/fluentui repository is publicly available for testing, so
the others are left as Repo B/C/D.
| Repo | Standard Repack | With --path-walk |
|----------|-----------------|------------------|
| fluentui | 438 MB | 148 MB |
| Repo B | 6,255 MB | 778 MB |
| Repo C | 37,737 MB | 6,158 MB |
| Repo D | 130,049 MB | 4,432 MB |
While this series is replacing ds/pack-name-hash-tweak and its introduction
of the --full-name-hash option, it is worth comparing that option to the
--path-walk option.
* The --full-name-hash option is a much smaller code change, as it drops
into the existing uses of the name-hash function.
* The --full-name-hash option is more likely to integrate with server-side
features such as delta islands and reachability bitmaps due to not
changing the object walk. It was already noted that the .bitmap files
store name-hash values, so there is some compatibility required to
integrate with bitmaps. The --path-walk option will be more difficult to
integrate with those options (at least during a repack), but maybe is not
impossible; the --path-walk option will not work when reading
reachability bitmaps, since we are avoiding walking trees entirely.
* The --full-name-hash option is good when there are many name-hash
collisions and many versions of the paths with those collisions. When
creating a shallow clone or certain kinds of pushes, the --full-name-hash
option is much worse at finding cross-path delta bases since it loses the
locality of the standard name-hash function. The --path-walk option
includes a second pass of delta computation using the standard name-hash
function and thus finds good cross-path delta bases when they improve
upon the same-path delta bases.
There are a few differences from the RFC version of this series:
1. The last two patches refactor the approach to perform delta calculations
by path after the object walk and then allows those delta calculations
to happen in a threaded manner.
2. Both 'git pack-objects' and 'git repack' are removed from the t0450
exclusion list.
3. The path-walk API has improved technical documentation that is extended
as its functionality is expanded.
4. Various bugs have been patched with matching tests. The new 'test-tool
path-walk' helper allows for careful testing of the API separately from
its use within other commands.
Updates in v2
=============
I'm sending this v2 to request some review feedback on the series. I'm sorry
it's so long.
There are two updates in this version:
* Fixed a performance issue in the presence of many annotated tags. This is
caught by p5313 when run on a repo with 10,000+ annotated tags.
* The Scalar config was previously wrong and should be pack.usePathWalk,
not push.usePathWalk.
Thanks, - Stolee
Derrick Stolee (17):
path-walk: introduce an object walk by path
t6601: add helper for testing path-walk API
path-walk: allow consumer to specify object types
path-walk: allow visiting tags
revision: create mark_trees_uninteresting_dense()
path-walk: add prune_all_uninteresting option
pack-objects: extract should_attempt_deltas()
pack-objects: add --path-walk option
pack-objects: update usage to match docs
p5313: add performance tests for --path-walk
pack-objects: introduce GIT_TEST_PACK_PATH_WALK
repack: add --path-walk option
repack: update usage to match docs
pack-objects: enable --path-walk via config
scalar: enable path-walk during push via config
pack-objects: refactor path-walk delta phase
pack-objects: thread the path-based compression
Documentation/config/feature.txt | 4 +
Documentation/config/pack.txt | 8 +
Documentation/git-pack-objects.txt | 23 +-
Documentation/git-repack.txt | 17 +-
Documentation/technical/api-path-walk.txt | 73 ++++
Makefile | 2 +
builtin/pack-objects.c | 410 ++++++++++++++++++++--
builtin/repack.c | 9 +-
ci/run-build-and-tests.sh | 1 +
pack-objects.h | 12 +
path-walk.c | 406 +++++++++++++++++++++
path-walk.h | 64 ++++
repo-settings.c | 3 +
repo-settings.h | 1 +
revision.c | 15 +
revision.h | 1 +
scalar.c | 1 +
t/README | 4 +
t/helper/test-path-walk.c | 114 ++++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/perf/p5313-pack-objects.sh | 77 ++++
t/t0411-clone-from-partial.sh | 6 +
t/t0450/txt-help-mismatches | 2 -
t/t5300-pack-object.sh | 21 ++
t/t5306-pack-nobase.sh | 5 +
t/t5310-pack-bitmaps.sh | 13 +-
t/t5316-pack-delta-depth.sh | 9 +-
t/t5332-multi-pack-reuse.sh | 7 +
t/t6601-path-walk.sh | 301 ++++++++++++++++
t/t7406-submodule-update.sh | 4 +
31 files changed, 1565 insertions(+), 50 deletions(-)
create mode 100644 Documentation/technical/api-path-walk.txt
create mode 100644 path-walk.c
create mode 100644 path-walk.h
create mode 100644 t/helper/test-path-walk.c
create mode 100755 t/perf/p5313-pack-objects.sh
create mode 100755 t/t6601-path-walk.sh
base-commit: e9356ba3ea2a6754281ff7697b3e5a1697b21e24
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1813%2Fderrickstolee%2Fpath-walk-upstream-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1813/derrickstolee/path-walk-upstream-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1813
Range-diff vs v1:
1: 98bdc94a773 = 1: 98bdc94a773 path-walk: introduce an object walk by path
2: a00ab0c62c9 = 2: a00ab0c62c9 t6601: add helper for testing path-walk API
3: 14375d19392 = 3: 14375d19392 path-walk: allow consumer to specify object types
4: 6f48cddadc0 ! 4: c321f58c62d path-walk: allow visiting tags
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
+ if (obj->type == OBJ_COMMIT || obj->flags & SEEN)
+ continue;
+
-+ obj->flags |= SEEN;
-+
+ while (obj->type == OBJ_TAG) {
+ struct tag *tag = lookup_tag(info->revs->repo,
+ &obj->oid);
-+ if (oid_array_lookup(&tags, &obj->oid) < 0)
++ if (!(obj->flags & SEEN)) {
++ obj->flags |= SEEN;
+ oid_array_append(&tags, &obj->oid);
++ }
+ obj = tag->tagged;
+ }
+
++ if ((obj->flags & SEEN))
++ continue;
++ obj->flags |= SEEN;
++
+ switch (obj->type) {
+ case OBJ_TREE:
-+ if (info->trees &&
-+ oid_array_lookup(&root_tree_list->oids, &obj->oid) < 0)
++ if (info->trees)
+ oid_array_append(&root_tree_list->oids, &obj->oid);
+ break;
+
+ case OBJ_BLOB:
-+ if (info->blobs &&
-+ oid_array_lookup(&tagged_blob_list, &obj->oid) < 0)
++ if (info->blobs)
+ oid_array_append(&tagged_blob_list, &obj->oid);
+ break;
+
5: cd98447f7c8 = 5: 6e89fb219b5 revision: create mark_trees_uninteresting_dense()
6: 214e10a9984 = 6: 238d7d95715 path-walk: add prune_all_uninteresting option
7: cd360ad1040 = 7: 3fdb57edbc5 pack-objects: extract should_attempt_deltas()
8: f8ee11d3003 = 8: a0475c7cba8 pack-objects: add --path-walk option
9: eaeb40980f4 = 9: 73c8b61e87b pack-objects: update usage to match docs
10: 3113ead1e01 = 10: 21dc3723c36 p5313: add performance tests for --path-walk
11: 211a16ae889 = 11: 6f96b1c227a pack-objects: introduce GIT_TEST_PACK_PATH_WALK
12: 507ed0f6f90 = 12: 834c9ea2709 repack: add --path-walk option
13: eae96e8214f = 13: 6ef8d67af4b repack: update usage to match docs
14: 2a9ae02217f = 14: 1db90e361ba pack-objects: enable --path-walk via config
15: adcb3167809 ! 15: 0f3040b4b90 scalar: enable path-walk during push via config
@@ scalar.c: static int set_recommended_config(int reconfigure)
{ "core.autoCRLF", "false" },
{ "core.safeCRLF", "false" },
{ "fetch.showForcedUpdates", "false" },
-+ { "push.usePathWalk", "true" },
++ { "pack.usePathWalk", "true" },
{ NULL, NULL },
};
int i;
16: c3f24dc3647 = 16: 030d8ec238e pack-objects: refactor path-walk delta phase
17: 264affbf058 = 17: fddc320eb0b pack-objects: thread the path-based compression
--
gitgitgadget
^ permalink raw reply [flat|nested] 55+ messages in thread
* [PATCH v2 01/17] path-walk: introduce an object walk by path
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 02/17] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
` (16 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.
There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 54 +++++
Makefile | 1 +
path-walk.c | 241 ++++++++++++++++++++++
path-walk.h | 43 ++++
4 files changed, 339 insertions(+)
create mode 100644 Documentation/technical/api-path-walk.txt
create mode 100644 path-walk.c
create mode 100644 path-walk.h
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
new file mode 100644
index 00000000000..6472222ae6d
--- /dev/null
+++ b/Documentation/technical/api-path-walk.txt
@@ -0,0 +1,54 @@
+Path-Walk API
+=============
+
+The path-walk API is used to walk reachable objects, but to visit objects
+in batches based on a common path they appear in, or by type.
+
+For example, all reachable commits are visited in a group. All tags are
+visited in a group. Then, all root trees are visited. At some point, all
+blobs reachable via a path `my/dir/to/A` are visited. When there are
+multiple paths possible to reach the same object, then only one of those
+paths is used to visit the object.
+
+When walking a range of commits with some `UNINTERESTING` objects, the
+objects with the `UNINTERESTING` flag are included in these batches. In
+order to walk `UNINTERESTING` objects, the `--boundary` option must be
+used in the commit walk in order to visit `UNINTERESTING` commits.
+
+Basics
+------
+
+To use the path-walk API, include `path-walk.h` and call
+`walk_objects_by_path()` with a customized `path_walk_info` struct. The
+struct is used to set all of the options for how the walk should proceed.
+Let's dig into the different options and their use.
+
+`path_fn` and `path_fn_data`::
+ The most important option is the `path_fn` option, which is a
+ function pointer to the callback that can execute logic on the
+ object IDs for objects grouped by type and path. This function
+ also receives a `data` value that corresponds to the
+ `path_fn_data` member, for providing custom data structures to
+ this callback function.
+
+`revs`::
+ To configure the exact details of the reachable set of objects,
+ use the `revs` member and initialize it using the revision
+ machinery in `revision.h`. Initialize `revs` using calls such as
+ `setup_revisions()` or `parse_revision_opt()`. Do not call
+ `prepare_revision_walk()`, as that will be called within
+ `walk_objects_by_path()`.
++
+It is also important that you do not specify the `--objects` flag for the
+`revs` struct. The revision walk should only be used to walk commits, and
+the objects will be walked in a separate way based on those starting
+commits.
++
+If you want the path-walk API to emit `UNINTERESTING` objects based on the
+commit walk's boundary, be sure to set `revs.boundary` so the boundary
+commits are emitted.
+
+Examples
+--------
+
+See example usages in future changes.
diff --git a/Makefile b/Makefile
index 7344a7f7257..d0d8d6888e3 100644
--- a/Makefile
+++ b/Makefile
@@ -1094,6 +1094,7 @@ LIB_OBJS += parse-options.o
LIB_OBJS += patch-delta.o
LIB_OBJS += patch-ids.o
LIB_OBJS += path.o
+LIB_OBJS += path-walk.o
LIB_OBJS += pathspec.o
LIB_OBJS += pkt-line.o
LIB_OBJS += preload-index.o
diff --git a/path-walk.c b/path-walk.c
new file mode 100644
index 00000000000..66840187e28
--- /dev/null
+++ b/path-walk.c
@@ -0,0 +1,241 @@
+/*
+ * path-walk.c: implementation for path-based walks of the object graph.
+ */
+#include "git-compat-util.h"
+#include "path-walk.h"
+#include "blob.h"
+#include "commit.h"
+#include "dir.h"
+#include "hashmap.h"
+#include "hex.h"
+#include "object.h"
+#include "oid-array.h"
+#include "revision.h"
+#include "string-list.h"
+#include "strmap.h"
+#include "trace2.h"
+#include "tree.h"
+#include "tree-walk.h"
+
+struct type_and_oid_list
+{
+ enum object_type type;
+ struct oid_array oids;
+};
+
+#define TYPE_AND_OID_LIST_INIT { \
+ .type = OBJ_NONE, \
+ .oids = OID_ARRAY_INIT \
+}
+
+struct path_walk_context {
+ /**
+ * Repeats of data in 'struct path_walk_info' for
+ * access with fewer characters.
+ */
+ struct repository *repo;
+ struct rev_info *revs;
+ struct path_walk_info *info;
+
+ /**
+ * Map a path to a 'struct type_and_oid_list'
+ * containing the objects discovered at that
+ * path.
+ */
+ struct strmap paths_to_lists;
+
+ /**
+ * Store the current list of paths in a stack, to
+ * facilitate depth-first-search without recursion.
+ */
+ struct string_list path_stack;
+};
+
+static int add_children(struct path_walk_context *ctx,
+ const char *base_path,
+ struct object_id *oid)
+{
+ struct tree_desc desc;
+ struct name_entry entry;
+ struct strbuf path = STRBUF_INIT;
+ size_t base_len;
+ struct tree *tree = lookup_tree(ctx->repo, oid);
+
+ if (!tree) {
+ error(_("failed to walk children of tree %s: not found"),
+ oid_to_hex(oid));
+ return -1;
+ } else if (parse_tree_gently(tree, 1)) {
+ die("bad tree object %s", oid_to_hex(oid));
+ }
+
+ strbuf_addstr(&path, base_path);
+ base_len = path.len;
+
+ parse_tree(tree);
+ init_tree_desc(&desc, &tree->object.oid, tree->buffer, tree->size);
+ while (tree_entry(&desc, &entry)) {
+ struct type_and_oid_list *list;
+ struct object *o;
+ /* Not actually true, but we will ignore submodules later. */
+ enum object_type type = S_ISDIR(entry.mode) ? OBJ_TREE : OBJ_BLOB;
+
+ /* Skip submodules. */
+ if (S_ISGITLINK(entry.mode))
+ continue;
+
+ if (type == OBJ_TREE) {
+ struct tree *child = lookup_tree(ctx->repo, &entry.oid);
+ o = child ? &child->object : NULL;
+ } else if (type == OBJ_BLOB) {
+ struct blob *child = lookup_blob(ctx->repo, &entry.oid);
+ o = child ? &child->object : NULL;
+ } else {
+ /* Wrong type? */
+ continue;
+ }
+
+ if (!o) /* report error?*/
+ continue;
+
+ /* Skip this object if already seen. */
+ if (o->flags & SEEN)
+ continue;
+ o->flags |= SEEN;
+
+ strbuf_setlen(&path, base_len);
+ strbuf_add(&path, entry.path, entry.pathlen);
+
+ /*
+ * Trees will end with "/" for concatenation and distinction
+ * from blobs at the same path.
+ */
+ if (type == OBJ_TREE)
+ strbuf_addch(&path, '/');
+
+ if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
+ CALLOC_ARRAY(list, 1);
+ list->type = type;
+ strmap_put(&ctx->paths_to_lists, path.buf, list);
+ string_list_append(&ctx->path_stack, path.buf);
+ }
+ oid_array_append(&list->oids, &entry.oid);
+ }
+
+ free_tree_buffer(tree);
+ strbuf_release(&path);
+ return 0;
+}
+
+/*
+ * For each path in paths_to_explore, walk the trees another level
+ * and add any found blobs to the batch (but only if they exist and
+ * haven't been added yet).
+ */
+static int walk_path(struct path_walk_context *ctx,
+ const char *path)
+{
+ struct type_and_oid_list *list;
+ int ret = 0;
+
+ list = strmap_get(&ctx->paths_to_lists, path);
+
+ /* Evaluate function pointer on this data. */
+ ret = ctx->info->path_fn(path, &list->oids, list->type,
+ ctx->info->path_fn_data);
+
+ /* Expand data for children. */
+ if (list->type == OBJ_TREE) {
+ for (size_t i = 0; i < list->oids.nr; i++) {
+ ret |= add_children(ctx,
+ path,
+ &list->oids.oid[i]);
+ }
+ }
+
+ oid_array_clear(&list->oids);
+ strmap_remove(&ctx->paths_to_lists, path, 1);
+ return ret;
+}
+
+static void clear_strmap(struct strmap *map)
+{
+ struct hashmap_iter iter;
+ struct strmap_entry *e;
+
+ hashmap_for_each_entry(&map->map, &iter, e, ent) {
+ struct type_and_oid_list *list = e->value;
+ oid_array_clear(&list->oids);
+ }
+ strmap_clear(map, 1);
+ strmap_init(map);
+}
+
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+ *
+ * Returns nonzero on an error.
+ */
+int walk_objects_by_path(struct path_walk_info *info)
+{
+ const char *root_path = "";
+ int ret = 0;
+ size_t commits_nr = 0, paths_nr = 0;
+ struct commit *c;
+ struct type_and_oid_list *root_tree_list;
+ struct path_walk_context ctx = {
+ .repo = info->revs->repo,
+ .revs = info->revs,
+ .info = info,
+ .path_stack = STRING_LIST_INIT_DUP,
+ .paths_to_lists = STRMAP_INIT
+ };
+
+ trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
+
+ /* Insert a single list for the root tree into the paths. */
+ CALLOC_ARRAY(root_tree_list, 1);
+ root_tree_list->type = OBJ_TREE;
+ strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
+
+ if (prepare_revision_walk(info->revs))
+ die(_("failed to setup revision walk"));
+
+ while ((c = get_revision(info->revs))) {
+ struct object_id *oid = get_commit_tree_oid(c);
+ struct tree *t = lookup_tree(info->revs->repo, oid);
+ commits_nr++;
+
+ if (t) {
+ if (t->object.flags & SEEN)
+ continue;
+ t->object.flags |= SEEN;
+ oid_array_append(&root_tree_list->oids, oid);
+ } else {
+ warning("could not find tree %s", oid_to_hex(oid));
+ }
+ }
+
+ trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
+ trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+
+ string_list_append(&ctx.path_stack, root_path);
+
+ trace2_region_enter("path-walk", "path-walk", info->revs->repo);
+ while (!ret && ctx.path_stack.nr) {
+ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
+ ctx.path_stack.nr--;
+ paths_nr++;
+
+ ret = walk_path(&ctx, path);
+
+ free(path);
+ }
+ trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr);
+ trace2_region_leave("path-walk", "path-walk", info->revs->repo);
+
+ clear_strmap(&ctx.paths_to_lists);
+ string_list_clear(&ctx.path_stack, 0);
+ return ret;
+}
diff --git a/path-walk.h b/path-walk.h
new file mode 100644
index 00000000000..c9e94a98bc8
--- /dev/null
+++ b/path-walk.h
@@ -0,0 +1,43 @@
+/*
+ * path-walk.h : Methods and structures for walking the object graph in batches
+ * by the paths that can reach those objects.
+ */
+#include "object.h" /* Required for 'enum object_type'. */
+
+struct rev_info;
+struct oid_array;
+
+/**
+ * The type of a function pointer for the method that is called on a list of
+ * objects reachable at a given path.
+ */
+typedef int (*path_fn)(const char *path,
+ struct oid_array *oids,
+ enum object_type type,
+ void *data);
+
+struct path_walk_info {
+ /**
+ * revs provides the definitions for the commit walk, including
+ * which commits are UNINTERESTING or not.
+ */
+ struct rev_info *revs;
+
+ /**
+ * The caller wishes to execute custom logic on objects reachable at a
+ * given path. Every reachable object will be visited exactly once, and
+ * the first path to see an object wins. This may not be a stable choice.
+ */
+ path_fn path_fn;
+ void *path_fn_data;
+};
+
+#define PATH_WALK_INFO_INIT { 0 }
+
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+ *
+ * Returns nonzero on an error.
+ */
+int walk_objects_by_path(struct path_walk_info *info);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 02/17] t6601: add helper for testing path-walk API
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 01/17] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 03/17] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
` (15 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Add some tests based on the current behavior, doing interesting checks
for different sets of branches, ranges, and the --boundary option. This
sets a baseline for the behavior and we can extend it as new options are
introduced.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 3 +-
Makefile | 1 +
t/helper/test-path-walk.c | 86 ++++++++++++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t6601-path-walk.sh | 130 ++++++++++++++++++++++
6 files changed, 221 insertions(+), 1 deletion(-)
create mode 100644 t/helper/test-path-walk.c
create mode 100755 t/t6601-path-walk.sh
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 6472222ae6d..e588897ab8d 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -51,4 +51,5 @@ commits are emitted.
Examples
--------
-See example usages in future changes.
+See example usages in:
+ `t/helper/test-path-walk.c`
diff --git a/Makefile b/Makefile
index d0d8d6888e3..50413d96492 100644
--- a/Makefile
+++ b/Makefile
@@ -818,6 +818,7 @@ TEST_BUILTINS_OBJS += test-parse-options.o
TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
TEST_BUILTINS_OBJS += test-partial-clone.o
TEST_BUILTINS_OBJS += test-path-utils.o
+TEST_BUILTINS_OBJS += test-path-walk.o
TEST_BUILTINS_OBJS += test-pcre2-config.o
TEST_BUILTINS_OBJS += test-pkt-line.o
TEST_BUILTINS_OBJS += test-proc-receive.o
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
new file mode 100644
index 00000000000..3c48f017fa0
--- /dev/null
+++ b/t/helper/test-path-walk.c
@@ -0,0 +1,86 @@
+#define USE_THE_REPOSITORY_VARIABLE
+
+#include "test-tool.h"
+#include "environment.h"
+#include "hex.h"
+#include "object-name.h"
+#include "object.h"
+#include "pretty.h"
+#include "revision.h"
+#include "setup.h"
+#include "parse-options.h"
+#include "path-walk.h"
+#include "oid-array.h"
+
+static const char * const path_walk_usage[] = {
+ N_("test-tool path-walk <options> -- <revision-options>"),
+ NULL
+};
+
+struct path_walk_test_data {
+ uintmax_t tree_nr;
+ uintmax_t blob_nr;
+};
+
+static int emit_block(const char *path, struct oid_array *oids,
+ enum object_type type, void *data)
+{
+ struct path_walk_test_data *tdata = data;
+ const char *typestr;
+
+ switch (type) {
+ case OBJ_TREE:
+ typestr = "TREE";
+ tdata->tree_nr += oids->nr;
+ break;
+
+ case OBJ_BLOB:
+ typestr = "BLOB";
+ tdata->blob_nr += oids->nr;
+ break;
+
+ default:
+ BUG("we do not understand this type");
+ }
+
+ for (size_t i = 0; i < oids->nr; i++)
+ printf("%s:%s:%s\n", typestr, path, oid_to_hex(&oids->oid[i]));
+
+ return 0;
+}
+
+int cmd__path_walk(int argc, const char **argv)
+{
+ int res;
+ struct rev_info revs = REV_INFO_INIT;
+ struct path_walk_info info = PATH_WALK_INFO_INIT;
+ struct path_walk_test_data data = { 0 };
+ struct option options[] = {
+ OPT_END(),
+ };
+
+ initialize_repository(the_repository);
+ setup_git_directory();
+ revs.repo = the_repository;
+
+ argc = parse_options(argc, argv, NULL,
+ options, path_walk_usage,
+ PARSE_OPT_KEEP_UNKNOWN_OPT | PARSE_OPT_KEEP_ARGV0);
+
+ if (argc > 1)
+ setup_revisions(argc, argv, &revs, NULL);
+ else
+ usage(path_walk_usage[0]);
+
+ info.revs = &revs;
+ info.path_fn = emit_block;
+ info.path_fn_data = &data;
+
+ res = walk_objects_by_path(&info);
+
+ printf("trees:%" PRIuMAX "\n"
+ "blobs:%" PRIuMAX "\n",
+ data.tree_nr, data.blob_nr);
+
+ return res;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 1ebb69a5dc4..43676e7b93a 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -52,6 +52,7 @@ static struct test_cmd cmds[] = {
{ "parse-subcommand", cmd__parse_subcommand },
{ "partial-clone", cmd__partial_clone },
{ "path-utils", cmd__path_utils },
+ { "path-walk", cmd__path_walk },
{ "pcre2-config", cmd__pcre2_config },
{ "pkt-line", cmd__pkt_line },
{ "proc-receive", cmd__proc_receive },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 21802ac27da..9cfc5da6e57 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -45,6 +45,7 @@ int cmd__parse_pathspec_file(int argc, const char** argv);
int cmd__parse_subcommand(int argc, const char **argv);
int cmd__partial_clone(int argc, const char **argv);
int cmd__path_utils(int argc, const char **argv);
+int cmd__path_walk(int argc, const char **argv);
int cmd__pcre2_config(int argc, const char **argv);
int cmd__pkt_line(int argc, const char **argv);
int cmd__proc_receive(int argc, const char **argv);
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
new file mode 100755
index 00000000000..ca18b61c3f1
--- /dev/null
+++ b/t/t6601-path-walk.sh
@@ -0,0 +1,130 @@
+#!/bin/sh
+
+test_description='direct path-walk API tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repository' '
+ git checkout -b base &&
+
+ mkdir left &&
+ mkdir right &&
+ echo a >a &&
+ echo b >left/b &&
+ echo c >right/c &&
+ git add . &&
+ git commit -m "first" &&
+
+ echo d >right/d &&
+ git add right &&
+ git commit -m "second" &&
+
+ echo bb >left/b &&
+ git commit -a -m "third" &&
+
+ git checkout -b topic HEAD~1 &&
+ echo cc >right/c &&
+ git commit -a -m "topic"
+'
+
+test_expect_success 'all' '
+ test-tool path-walk -- --all >out &&
+
+ cat >expect <<-EOF &&
+ TREE::$(git rev-parse topic^{tree})
+ TREE::$(git rev-parse base^{tree})
+ TREE::$(git rev-parse base~1^{tree})
+ TREE::$(git rev-parse base~2^{tree})
+ TREE:left/:$(git rev-parse base:left)
+ TREE:left/:$(git rev-parse base~2:left)
+ TREE:right/:$(git rev-parse topic:right)
+ TREE:right/:$(git rev-parse base~1:right)
+ TREE:right/:$(git rev-parse base~2:right)
+ trees:9
+ BLOB:a:$(git rev-parse base~2:a)
+ BLOB:left/b:$(git rev-parse base~2:left/b)
+ BLOB:left/b:$(git rev-parse base:left/b)
+ BLOB:right/c:$(git rev-parse base~2:right/c)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse base~1:right/d)
+ blobs:6
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
+test_expect_success 'topic only' '
+ test-tool path-walk -- topic >out &&
+
+ cat >expect <<-EOF &&
+ TREE::$(git rev-parse topic^{tree})
+ TREE::$(git rev-parse base~1^{tree})
+ TREE::$(git rev-parse base~2^{tree})
+ TREE:left/:$(git rev-parse base~2:left)
+ TREE:right/:$(git rev-parse topic:right)
+ TREE:right/:$(git rev-parse base~1:right)
+ TREE:right/:$(git rev-parse base~2:right)
+ trees:7
+ BLOB:a:$(git rev-parse base~2:a)
+ BLOB:left/b:$(git rev-parse base~2:left/b)
+ BLOB:right/c:$(git rev-parse base~2:right/c)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse base~1:right/d)
+ blobs:5
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
+test_expect_success 'topic, not base' '
+ test-tool path-walk -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ TREE::$(git rev-parse topic^{tree})
+ TREE:left/:$(git rev-parse topic:left)
+ TREE:right/:$(git rev-parse topic:right)
+ trees:3
+ BLOB:a:$(git rev-parse topic:a)
+ BLOB:left/b:$(git rev-parse topic:left/b)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse topic:right/d)
+ blobs:4
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
+test_expect_success 'topic, not base, boundary' '
+ test-tool path-walk -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ TREE::$(git rev-parse topic^{tree})
+ TREE::$(git rev-parse base~1^{tree})
+ TREE:left/:$(git rev-parse base~1:left)
+ TREE:right/:$(git rev-parse topic:right)
+ TREE:right/:$(git rev-parse base~1:right)
+ trees:5
+ BLOB:a:$(git rev-parse base~1:a)
+ BLOB:left/b:$(git rev-parse base~1:left/b)
+ BLOB:right/c:$(git rev-parse base~1:right/c)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse base~1:right/d)
+ blobs:5
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
+test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 03/17] path-walk: allow consumer to specify object types
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 01/17] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 02/17] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 04/17] path-walk: allow visiting tags Derrick Stolee via GitGitGadget
` (14 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <derrickstolee@github.com>
We add the ability to filter the object types in the path-walk API so
the callback function is called fewer times.
This adds the ability to ask for the commits in a list, as well. Future
changes will add the ability to visit annotated tags.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 9 +++
path-walk.c | 39 ++++++++++--
path-walk.h | 13 +++-
t/helper/test-path-walk.c | 17 +++++-
t/t6601-path-walk.sh | 72 +++++++++++++++++++++++
5 files changed, 141 insertions(+), 9 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index e588897ab8d..b7ae476ea0a 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -48,6 +48,15 @@ If you want the path-walk API to emit `UNINTERESTING` objects based on the
commit walk's boundary, be sure to set `revs.boundary` so the boundary
commits are emitted.
+`commits`, `blobs`, `trees`::
+ By default, these members are enabled and signal that the path-walk
+ API should call the `path_fn` on objects of these types. Specialized
+ applications could disable some options to make it simpler to walk
+ the objects or to have fewer calls to `path_fn`.
++
+While it is possible to walk only commits in this way, consumers would be
+better off using the revision walk API instead.
+
Examples
--------
diff --git a/path-walk.c b/path-walk.c
index 66840187e28..22e1aa13f31 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -84,6 +84,10 @@ static int add_children(struct path_walk_context *ctx,
if (S_ISGITLINK(entry.mode))
continue;
+ /* If the caller doesn't want blobs, then don't bother. */
+ if (!ctx->info->blobs && type == OBJ_BLOB)
+ continue;
+
if (type == OBJ_TREE) {
struct tree *child = lookup_tree(ctx->repo, &entry.oid);
o = child ? &child->object : NULL;
@@ -140,9 +144,11 @@ static int walk_path(struct path_walk_context *ctx,
list = strmap_get(&ctx->paths_to_lists, path);
- /* Evaluate function pointer on this data. */
- ret = ctx->info->path_fn(path, &list->oids, list->type,
- ctx->info->path_fn_data);
+ /* Evaluate function pointer on this data, if requested. */
+ if ((list->type == OBJ_TREE && ctx->info->trees) ||
+ (list->type == OBJ_BLOB && ctx->info->blobs))
+ ret = ctx->info->path_fn(path, &list->oids, list->type,
+ ctx->info->path_fn_data);
/* Expand data for children. */
if (list->type == OBJ_TREE) {
@@ -184,6 +190,7 @@ int walk_objects_by_path(struct path_walk_info *info)
size_t commits_nr = 0, paths_nr = 0;
struct commit *c;
struct type_and_oid_list *root_tree_list;
+ struct type_and_oid_list *commit_list;
struct path_walk_context ctx = {
.repo = info->revs->repo,
.revs = info->revs,
@@ -194,19 +201,32 @@ int walk_objects_by_path(struct path_walk_info *info)
trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
+ CALLOC_ARRAY(commit_list, 1);
+ commit_list->type = OBJ_COMMIT;
+
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
-
if (prepare_revision_walk(info->revs))
die(_("failed to setup revision walk"));
while ((c = get_revision(info->revs))) {
- struct object_id *oid = get_commit_tree_oid(c);
- struct tree *t = lookup_tree(info->revs->repo, oid);
+ struct object_id *oid;
+ struct tree *t;
commits_nr++;
+ if (info->commits)
+ oid_array_append(&commit_list->oids,
+ &c->object.oid);
+
+ /* If we only care about commits, then skip trees. */
+ if (!info->trees && !info->blobs)
+ continue;
+
+ oid = get_commit_tree_oid(c);
+ t = lookup_tree(info->revs->repo, oid);
+
if (t) {
if (t->object.flags & SEEN)
continue;
@@ -220,6 +240,13 @@ int walk_objects_by_path(struct path_walk_info *info)
trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+ /* Track all commits. */
+ if (info->commits)
+ ret = info->path_fn("", &commit_list->oids, OBJ_COMMIT,
+ info->path_fn_data);
+ oid_array_clear(&commit_list->oids);
+ free(commit_list);
+
string_list_append(&ctx.path_stack, root_path);
trace2_region_enter("path-walk", "path-walk", info->revs->repo);
diff --git a/path-walk.h b/path-walk.h
index c9e94a98bc8..6ef372d8942 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -30,9 +30,20 @@ struct path_walk_info {
*/
path_fn path_fn;
void *path_fn_data;
+ /**
+ * Initialize which object types the path_fn should be called on. This
+ * could also limit the walk to skip blobs if not set.
+ */
+ int commits;
+ int trees;
+ int blobs;
};
-#define PATH_WALK_INFO_INIT { 0 }
+#define PATH_WALK_INFO_INIT { \
+ .blobs = 1, \
+ .trees = 1, \
+ .commits = 1, \
+}
/**
* Given the configuration of 'info', walk the commits based on 'info->revs' and
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 3c48f017fa0..37c5e3e31e8 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -18,6 +18,7 @@ static const char * const path_walk_usage[] = {
};
struct path_walk_test_data {
+ uintmax_t commit_nr;
uintmax_t tree_nr;
uintmax_t blob_nr;
};
@@ -29,6 +30,11 @@ static int emit_block(const char *path, struct oid_array *oids,
const char *typestr;
switch (type) {
+ case OBJ_COMMIT:
+ typestr = "COMMIT";
+ tdata->commit_nr += oids->nr;
+ break;
+
case OBJ_TREE:
typestr = "TREE";
tdata->tree_nr += oids->nr;
@@ -56,6 +62,12 @@ int cmd__path_walk(int argc, const char **argv)
struct path_walk_info info = PATH_WALK_INFO_INIT;
struct path_walk_test_data data = { 0 };
struct option options[] = {
+ OPT_BOOL(0, "blobs", &info.blobs,
+ N_("toggle inclusion of blob objects")),
+ OPT_BOOL(0, "commits", &info.commits,
+ N_("toggle inclusion of commit objects")),
+ OPT_BOOL(0, "trees", &info.trees,
+ N_("toggle inclusion of tree objects")),
OPT_END(),
};
@@ -78,9 +90,10 @@ int cmd__path_walk(int argc, const char **argv)
res = walk_objects_by_path(&info);
- printf("trees:%" PRIuMAX "\n"
+ printf("commits:%" PRIuMAX "\n"
+ "trees:%" PRIuMAX "\n"
"blobs:%" PRIuMAX "\n",
- data.tree_nr, data.blob_nr);
+ data.commit_nr, data.tree_nr, data.blob_nr);
return res;
}
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index ca18b61c3f1..e4788664f93 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -31,6 +31,11 @@ test_expect_success 'all' '
test-tool path-walk -- --all >out &&
cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ COMMIT::$(git rev-parse base)
+ COMMIT::$(git rev-parse base~1)
+ COMMIT::$(git rev-parse base~2)
+ commits:4
TREE::$(git rev-parse topic^{tree})
TREE::$(git rev-parse base^{tree})
TREE::$(git rev-parse base~1^{tree})
@@ -60,6 +65,10 @@ test_expect_success 'topic only' '
test-tool path-walk -- topic >out &&
cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ COMMIT::$(git rev-parse base~1)
+ COMMIT::$(git rev-parse base~2)
+ commits:3
TREE::$(git rev-parse topic^{tree})
TREE::$(git rev-parse base~1^{tree})
TREE::$(git rev-parse base~2^{tree})
@@ -86,6 +95,8 @@ test_expect_success 'topic, not base' '
test-tool path-walk -- topic --not base >out &&
cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ commits:1
TREE::$(git rev-parse topic^{tree})
TREE:left/:$(git rev-parse topic:left)
TREE:right/:$(git rev-parse topic:right)
@@ -103,10 +114,71 @@ test_expect_success 'topic, not base' '
test_cmp expect.sorted out.sorted
'
+test_expect_success 'topic, not base, only blobs' '
+ test-tool path-walk --no-trees --no-commits \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ commits:0
+ trees:0
+ BLOB:a:$(git rev-parse topic:a)
+ BLOB:left/b:$(git rev-parse topic:left/b)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse topic:right/d)
+ blobs:4
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
+# No, this doesn't make a lot of sense for the path-walk API,
+# but it is possible to do.
+test_expect_success 'topic, not base, only commits' '
+ test-tool path-walk --no-blobs --no-trees \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ commits:1
+ trees:0
+ blobs:0
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
+test_expect_success 'topic, not base, only trees' '
+ test-tool path-walk --no-blobs --no-commits \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ commits:0
+ TREE::$(git rev-parse topic^{tree})
+ TREE:left/:$(git rev-parse topic:left)
+ TREE:right/:$(git rev-parse topic:right)
+ trees:3
+ blobs:0
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
test_expect_success 'topic, not base, boundary' '
test-tool path-walk -- --boundary topic --not base >out &&
cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ COMMIT::$(git rev-parse base~1)
+ commits:2
TREE::$(git rev-parse topic^{tree})
TREE::$(git rev-parse base~1^{tree})
TREE:left/:$(git rev-parse base~1:left)
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 04/17] path-walk: allow visiting tags
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (2 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 03/17] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 05/17] revision: create mark_trees_uninteresting_dense() Derrick Stolee via GitGitGadget
` (13 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
In anticipation of using the path-walk API to analyze tags or include
them in a pack-file, add the ability to walk the tags that were included
in the revision walk.
When these tag objects point to blobs or trees, we need to make sure
those objects are also visited. Treat tagged trees as root trees, but
put the tagged blobs in their own category.
Be careful about objects that are referred to by multiple references.
Co-authored-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 2 +-
path-walk.c | 78 +++++++++++++++++++++
path-walk.h | 2 +
t/helper/test-path-walk.c | 13 +++-
t/t6601-path-walk.sh | 85 +++++++++++++++++++++--
5 files changed, 172 insertions(+), 8 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index b7ae476ea0a..5fea1d1db17 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -48,7 +48,7 @@ If you want the path-walk API to emit `UNINTERESTING` objects based on the
commit walk's boundary, be sure to set `revs.boundary` so the boundary
commits are emitted.
-`commits`, `blobs`, `trees`::
+`commits`, `blobs`, `trees`, `tags`::
By default, these members are enabled and signal that the path-walk
API should call the `path_fn` on objects of these types. Specialized
applications could disable some options to make it simpler to walk
diff --git a/path-walk.c b/path-walk.c
index 22e1aa13f31..55758f50abd 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -13,6 +13,7 @@
#include "revision.h"
#include "string-list.h"
#include "strmap.h"
+#include "tag.h"
#include "trace2.h"
#include "tree.h"
#include "tree-walk.h"
@@ -204,13 +205,90 @@ int walk_objects_by_path(struct path_walk_info *info)
CALLOC_ARRAY(commit_list, 1);
commit_list->type = OBJ_COMMIT;
+ if (info->tags)
+ info->revs->tag_objects = 1;
+
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
+
+ /*
+ * Set these values before preparing the walk to catch
+ * lightweight tags pointing to non-commits.
+ */
+ info->revs->blob_objects = info->blobs;
+ info->revs->tree_objects = info->trees;
+
if (prepare_revision_walk(info->revs))
die(_("failed to setup revision walk"));
+ info->revs->blob_objects = info->revs->tree_objects = 0;
+
+ if (info->tags) {
+ struct oid_array tagged_blob_list = OID_ARRAY_INIT;
+ struct oid_array tags = OID_ARRAY_INIT;
+
+ trace2_region_enter("path-walk", "tag-walk", info->revs->repo);
+
+ /*
+ * Walk any pending objects at this point, but they should only
+ * be tags.
+ */
+ for (size_t i = 0; i < info->revs->pending.nr; i++) {
+ struct object_array_entry *pending = info->revs->pending.objects + i;
+ struct object *obj = pending->item;
+
+ if (obj->type == OBJ_COMMIT || obj->flags & SEEN)
+ continue;
+
+ while (obj->type == OBJ_TAG) {
+ struct tag *tag = lookup_tag(info->revs->repo,
+ &obj->oid);
+ if (!(obj->flags & SEEN)) {
+ obj->flags |= SEEN;
+ oid_array_append(&tags, &obj->oid);
+ }
+ obj = tag->tagged;
+ }
+
+ if ((obj->flags & SEEN))
+ continue;
+ obj->flags |= SEEN;
+
+ switch (obj->type) {
+ case OBJ_TREE:
+ if (info->trees)
+ oid_array_append(&root_tree_list->oids, &obj->oid);
+ break;
+
+ case OBJ_BLOB:
+ if (info->blobs)
+ oid_array_append(&tagged_blob_list, &obj->oid);
+ break;
+
+ case OBJ_COMMIT:
+ /* Make sure it is in the object walk */
+ add_pending_object(info->revs, obj, "");
+ break;
+
+ default:
+ BUG("should not see any other type here");
+ }
+ }
+
+ info->path_fn("", &tags, OBJ_TAG, info->path_fn_data);
+
+ if (tagged_blob_list.nr && info->blobs)
+ info->path_fn("/tagged-blobs", &tagged_blob_list, OBJ_BLOB,
+ info->path_fn_data);
+
+ trace2_data_intmax("path-walk", ctx.repo, "tags", tags.nr);
+ trace2_region_leave("path-walk", "tag-walk", info->revs->repo);
+ oid_array_clear(&tags);
+ oid_array_clear(&tagged_blob_list);
+ }
+
while ((c = get_revision(info->revs))) {
struct object_id *oid;
struct tree *t;
diff --git a/path-walk.h b/path-walk.h
index 6ef372d8942..3f3b63180ef 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -37,12 +37,14 @@ struct path_walk_info {
int commits;
int trees;
int blobs;
+ int tags;
};
#define PATH_WALK_INFO_INIT { \
.blobs = 1, \
.trees = 1, \
.commits = 1, \
+ .tags = 1, \
}
/**
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 37c5e3e31e8..c6c60d68749 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -21,6 +21,7 @@ struct path_walk_test_data {
uintmax_t commit_nr;
uintmax_t tree_nr;
uintmax_t blob_nr;
+ uintmax_t tag_nr;
};
static int emit_block(const char *path, struct oid_array *oids,
@@ -45,6 +46,11 @@ static int emit_block(const char *path, struct oid_array *oids,
tdata->blob_nr += oids->nr;
break;
+ case OBJ_TAG:
+ typestr = "TAG";
+ tdata->tag_nr += oids->nr;
+ break;
+
default:
BUG("we do not understand this type");
}
@@ -66,6 +72,8 @@ int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of blob objects")),
OPT_BOOL(0, "commits", &info.commits,
N_("toggle inclusion of commit objects")),
+ OPT_BOOL(0, "tags", &info.tags,
+ N_("toggle inclusion of tag objects")),
OPT_BOOL(0, "trees", &info.trees,
N_("toggle inclusion of tree objects")),
OPT_END(),
@@ -92,8 +100,9 @@ int cmd__path_walk(int argc, const char **argv)
printf("commits:%" PRIuMAX "\n"
"trees:%" PRIuMAX "\n"
- "blobs:%" PRIuMAX "\n",
- data.commit_nr, data.tree_nr, data.blob_nr);
+ "blobs:%" PRIuMAX "\n"
+ "tags:%" PRIuMAX "\n",
+ data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
return res;
}
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index e4788664f93..7758e2529ee 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -7,24 +7,55 @@ test_description='direct path-walk API tests'
test_expect_success 'setup test repository' '
git checkout -b base &&
+ # Make some objects that will only be reachable
+ # via non-commit tags.
+ mkdir child &&
+ echo file >child/file &&
+ git add child &&
+ git commit -m "will abandon" &&
+ git tag -a -m "tree" tree-tag HEAD^{tree} &&
+ echo file2 >file2 &&
+ git add file2 &&
+ git commit --amend -m "will abandon" &&
+ git tag tree-tag2 HEAD^{tree} &&
+
+ echo blob >file &&
+ blob_oid=$(git hash-object -t blob -w --stdin <file) &&
+ git tag -a -m "blob" blob-tag "$blob_oid" &&
+ echo blob2 >file2 &&
+ blob2_oid=$(git hash-object -t blob -w --stdin <file2) &&
+ git tag blob-tag2 "$blob2_oid" &&
+
+ rm -fr child file file2 &&
+
mkdir left &&
mkdir right &&
echo a >a &&
echo b >left/b &&
echo c >right/c &&
git add . &&
- git commit -m "first" &&
+ git commit --amend -m "first" &&
+ git tag -m "first" first HEAD &&
echo d >right/d &&
git add right &&
git commit -m "second" &&
+ git tag -a -m "second (under)" second.1 HEAD &&
+ git tag -a -m "second (top)" second.2 second.1 &&
+ # Set up file/dir collision in history.
+ rm a &&
+ mkdir a &&
+ echo a >a/a &&
echo bb >left/b &&
- git commit -a -m "third" &&
+ git add a left &&
+ git commit -m "third" &&
+ git tag -a -m "third" third &&
git checkout -b topic HEAD~1 &&
echo cc >right/c &&
- git commit -a -m "topic"
+ git commit -a -m "topic" &&
+ git tag -a -m "fourth" fourth
'
test_expect_success 'all' '
@@ -40,19 +71,35 @@ test_expect_success 'all' '
TREE::$(git rev-parse base^{tree})
TREE::$(git rev-parse base~1^{tree})
TREE::$(git rev-parse base~2^{tree})
+ TREE::$(git rev-parse refs/tags/tree-tag^{})
+ TREE::$(git rev-parse refs/tags/tree-tag2^{})
+ TREE:a/:$(git rev-parse base:a)
TREE:left/:$(git rev-parse base:left)
TREE:left/:$(git rev-parse base~2:left)
TREE:right/:$(git rev-parse topic:right)
TREE:right/:$(git rev-parse base~1:right)
TREE:right/:$(git rev-parse base~2:right)
- trees:9
+ TREE:child/:$(git rev-parse refs/tags/tree-tag^{}:child)
+ trees:13
BLOB:a:$(git rev-parse base~2:a)
+ BLOB:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
BLOB:left/b:$(git rev-parse base~2:left/b)
BLOB:left/b:$(git rev-parse base:left/b)
BLOB:right/c:$(git rev-parse base~2:right/c)
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse base~1:right/d)
- blobs:6
+ BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
+ BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
+ BLOB:child/file:$(git rev-parse refs/tags/tree-tag^{}:child/file)
+ blobs:10
+ TAG::$(git rev-parse refs/tags/first)
+ TAG::$(git rev-parse refs/tags/second.1)
+ TAG::$(git rev-parse refs/tags/second.2)
+ TAG::$(git rev-parse refs/tags/third)
+ TAG::$(git rev-parse refs/tags/fourth)
+ TAG::$(git rev-parse refs/tags/tree-tag)
+ TAG::$(git rev-parse refs/tags/blob-tag)
+ tags:7
EOF
sort expect >expect.sorted &&
@@ -83,6 +130,7 @@ test_expect_success 'topic only' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse base~1:right/d)
blobs:5
+ tags:0
EOF
sort expect >expect.sorted &&
@@ -106,6 +154,7 @@ test_expect_success 'topic, not base' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse topic:right/d)
blobs:4
+ tags:0
EOF
sort expect >expect.sorted &&
@@ -126,6 +175,7 @@ test_expect_success 'topic, not base, only blobs' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse topic:right/d)
blobs:4
+ tags:0
EOF
sort expect >expect.sorted &&
@@ -145,6 +195,7 @@ test_expect_success 'topic, not base, only commits' '
commits:1
trees:0
blobs:0
+ tags:0
EOF
sort expect >expect.sorted &&
@@ -164,6 +215,7 @@ test_expect_success 'topic, not base, only trees' '
TREE:right/:$(git rev-parse topic:right)
trees:3
blobs:0
+ tags:0
EOF
sort expect >expect.sorted &&
@@ -191,6 +243,7 @@ test_expect_success 'topic, not base, boundary' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse base~1:right/d)
blobs:5
+ tags:0
EOF
sort expect >expect.sorted &&
@@ -199,4 +252,26 @@ test_expect_success 'topic, not base, boundary' '
test_cmp expect.sorted out.sorted
'
+test_expect_success 'trees are reported exactly once' '
+ test_when_finished "rm -rf unique-trees" &&
+ test_create_repo unique-trees &&
+ (
+ cd unique-trees &&
+ mkdir initial &&
+ test_commit initial/file &&
+
+ git switch -c move-to-top &&
+ git mv initial/file.t ./ &&
+ test_tick &&
+ git commit -m moved &&
+
+ git update-ref refs/heads/other HEAD
+ ) &&
+
+ test-tool -C unique-trees path-walk -- --all >out &&
+ tree=$(git -C unique-trees rev-parse HEAD:) &&
+ grep "$tree" out >out-filtered &&
+ test_line_count = 1 out-filtered
+'
+
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 05/17] revision: create mark_trees_uninteresting_dense()
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (3 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 04/17] path-walk: allow visiting tags Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 06/17] path-walk: add prune_all_uninteresting option Derrick Stolee via GitGitGadget
` (12 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
The sparse tree walk algorithm was created in d5d2e93577e (revision:
implement sparse algorithm, 2019-01-16) and involves using the
mark_trees_uninteresting_sparse() method. This method takes a repository
and an oidset of tree IDs, some of which have the UNINTERESTING flag and
some of which do not.
Create a method that has an equivalent set of preconditions but uses a
"dense" walk (recursively visits all reachable trees, as long as they
have not previously been marked UNINTERESTING). This is an important
difference from mark_tree_uninteresting(), which short-circuits if the
given tree has the UNINTERESTING flag.
A use of this method will be added in a later change, with a condition
set whether the sparse or dense approach should be used.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
revision.c | 15 +++++++++++++++
revision.h | 1 +
2 files changed, 16 insertions(+)
diff --git a/revision.c b/revision.c
index 2d7ad2bddff..bdc312f1538 100644
--- a/revision.c
+++ b/revision.c
@@ -219,6 +219,21 @@ static void add_children_by_path(struct repository *r,
free_tree_buffer(tree);
}
+void mark_trees_uninteresting_dense(struct repository *r,
+ struct oidset *trees)
+{
+ struct object_id *oid;
+ struct oidset_iter iter;
+
+ oidset_iter_init(trees, &iter);
+ while ((oid = oidset_iter_next(&iter))) {
+ struct tree *tree = lookup_tree(r, oid);
+
+ if (tree && (tree->object.flags & UNINTERESTING))
+ mark_tree_contents_uninteresting(r, tree);
+ }
+}
+
void mark_trees_uninteresting_sparse(struct repository *r,
struct oidset *trees)
{
diff --git a/revision.h b/revision.h
index 71e984c452b..8938b2db112 100644
--- a/revision.h
+++ b/revision.h
@@ -487,6 +487,7 @@ void put_revision_mark(const struct rev_info *revs,
void mark_parents_uninteresting(struct rev_info *revs, struct commit *commit);
void mark_tree_uninteresting(struct repository *r, struct tree *tree);
+void mark_trees_uninteresting_dense(struct repository *r, struct oidset *trees);
void mark_trees_uninteresting_sparse(struct repository *r, struct oidset *trees);
void show_object_with_name(FILE *, struct object *, const char *);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 06/17] path-walk: add prune_all_uninteresting option
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (4 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 05/17] revision: create mark_trees_uninteresting_dense() Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 07/17] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
` (11 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
This option causes the path-walk API to act like the sparse tree-walk
algorithm implemented by mark_trees_uninteresting_sparse() in
list-objects.c.
Starting from the commits marked as UNINTERESTING, their root trees and
all objects reachable from those trees are UNINTERSTING, at least as we
walk path-by-path. When we reach a path where all objects associated
with that path are marked UNINTERESTING, then do no continue walking the
children of that path.
We need to be careful to pass the UNINTERESTING flag in a deep way on
the UNINTERESTING objects before we start the path-walk, or else the
depth-first search for the path-walk API may accidentally report some
objects as interesting.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 8 +++
path-walk.c | 64 ++++++++++++++++++++++-
path-walk.h | 8 +++
t/helper/test-path-walk.c | 10 +++-
t/t6601-path-walk.sh | 40 +++++++++++---
5 files changed, 118 insertions(+), 12 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 5fea1d1db17..c51f92cd649 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -57,6 +57,14 @@ commits are emitted.
While it is possible to walk only commits in this way, consumers would be
better off using the revision walk API instead.
+`prune_all_uninteresting`::
+ By default, all reachable paths are emitted by the path-walk API.
+ This option allows consumers to declare that they are not
+ interested in paths where all included objects are marked with the
+ `UNINTERESTING` flag. This requires using the `boundary` option in
+ the revision walk so that the walk emits commits marked with the
+ `UNINTERESTING` flag.
+
Examples
--------
diff --git a/path-walk.c b/path-walk.c
index 55758f50abd..910dfc6fdc9 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -22,6 +22,7 @@ struct type_and_oid_list
{
enum object_type type;
struct oid_array oids;
+ int maybe_interesting;
};
#define TYPE_AND_OID_LIST_INIT { \
@@ -124,6 +125,8 @@ static int add_children(struct path_walk_context *ctx,
strmap_put(&ctx->paths_to_lists, path.buf, list);
string_list_append(&ctx->path_stack, path.buf);
}
+ if (!(o->flags & UNINTERESTING))
+ list->maybe_interesting = 1;
oid_array_append(&list->oids, &entry.oid);
}
@@ -145,6 +148,40 @@ static int walk_path(struct path_walk_context *ctx,
list = strmap_get(&ctx->paths_to_lists, path);
+ if (ctx->info->prune_all_uninteresting) {
+ /*
+ * This is true if all objects were UNINTERESTING
+ * when added to the list.
+ */
+ if (!list->maybe_interesting)
+ return 0;
+
+ /*
+ * But it's still possible that the objects were set
+ * as UNINTERESTING after being added. Do a quick check.
+ */
+ list->maybe_interesting = 0;
+ for (size_t i = 0;
+ !list->maybe_interesting && i < list->oids.nr;
+ i++) {
+ if (list->type == OBJ_TREE) {
+ struct tree *t = lookup_tree(ctx->repo,
+ &list->oids.oid[i]);
+ if (t && !(t->object.flags & UNINTERESTING))
+ list->maybe_interesting = 1;
+ } else {
+ struct blob *b = lookup_blob(ctx->repo,
+ &list->oids.oid[i]);
+ if (b && !(b->object.flags & UNINTERESTING))
+ list->maybe_interesting = 1;
+ }
+ }
+
+ /* We have confirmed that all objects are UNINTERESTING. */
+ if (!list->maybe_interesting)
+ return 0;
+ }
+
/* Evaluate function pointer on this data, if requested. */
if ((list->type == OBJ_TREE && ctx->info->trees) ||
(list->type == OBJ_BLOB && ctx->info->blobs))
@@ -187,7 +224,7 @@ static void clear_strmap(struct strmap *map)
int walk_objects_by_path(struct path_walk_info *info)
{
const char *root_path = "";
- int ret = 0;
+ int ret = 0, has_uninteresting = 0;
size_t commits_nr = 0, paths_nr = 0;
struct commit *c;
struct type_and_oid_list *root_tree_list;
@@ -199,6 +236,7 @@ int walk_objects_by_path(struct path_walk_info *info)
.path_stack = STRING_LIST_INIT_DUP,
.paths_to_lists = STRMAP_INIT
};
+ struct oidset root_tree_set = OIDSET_INIT;
trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
@@ -211,6 +249,7 @@ int walk_objects_by_path(struct path_walk_info *info)
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
+ root_tree_list->maybe_interesting = 1;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
/*
@@ -306,10 +345,16 @@ int walk_objects_by_path(struct path_walk_info *info)
t = lookup_tree(info->revs->repo, oid);
if (t) {
+ if ((c->object.flags & UNINTERESTING)) {
+ t->object.flags |= UNINTERESTING;
+ has_uninteresting = 1;
+ }
+
if (t->object.flags & SEEN)
continue;
t->object.flags |= SEEN;
- oid_array_append(&root_tree_list->oids, oid);
+ if (!oidset_insert(&root_tree_set, oid))
+ oid_array_append(&root_tree_list->oids, oid);
} else {
warning("could not find tree %s", oid_to_hex(oid));
}
@@ -325,6 +370,21 @@ int walk_objects_by_path(struct path_walk_info *info)
oid_array_clear(&commit_list->oids);
free(commit_list);
+ /*
+ * Before performing a DFS of our paths and emitting them as interesting,
+ * do a full walk of the trees to distribute the UNINTERESTING bit. Use
+ * the sparse algorithm if prune_all_uninteresting was set.
+ */
+ if (has_uninteresting) {
+ trace2_region_enter("path-walk", "uninteresting-walk", info->revs->repo);
+ if (info->prune_all_uninteresting)
+ mark_trees_uninteresting_sparse(ctx.repo, &root_tree_set);
+ else
+ mark_trees_uninteresting_dense(ctx.repo, &root_tree_set);
+ trace2_region_leave("path-walk", "uninteresting-walk", info->revs->repo);
+ }
+ oidset_clear(&root_tree_set);
+
string_list_append(&ctx.path_stack, root_path);
trace2_region_enter("path-walk", "path-walk", info->revs->repo);
diff --git a/path-walk.h b/path-walk.h
index 3f3b63180ef..3e44c4b8a58 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -38,6 +38,14 @@ struct path_walk_info {
int trees;
int blobs;
int tags;
+
+ /**
+ * When 'prune_all_uninteresting' is set and a path has all objects
+ * marked as UNINTERESTING, then the path-walk will not visit those
+ * objects. It will not call path_fn on those objects and will not
+ * walk the children of such trees.
+ */
+ int prune_all_uninteresting;
};
#define PATH_WALK_INFO_INIT { \
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index c6c60d68749..06b103d8760 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -55,8 +55,12 @@ static int emit_block(const char *path, struct oid_array *oids,
BUG("we do not understand this type");
}
- for (size_t i = 0; i < oids->nr; i++)
- printf("%s:%s:%s\n", typestr, path, oid_to_hex(&oids->oid[i]));
+ for (size_t i = 0; i < oids->nr; i++) {
+ struct object *o = lookup_unknown_object(the_repository,
+ &oids->oid[i]);
+ printf("%s:%s:%s%s\n", typestr, path, oid_to_hex(&oids->oid[i]),
+ o->flags & UNINTERESTING ? ":UNINTERESTING" : "");
+ }
return 0;
}
@@ -76,6 +80,8 @@ int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of tag objects")),
OPT_BOOL(0, "trees", &info.trees,
N_("toggle inclusion of tree objects")),
+ OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
+ N_("toggle pruning of uninteresting paths")),
OPT_END(),
};
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 7758e2529ee..943adc6c8f1 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -229,19 +229,19 @@ test_expect_success 'topic, not base, boundary' '
cat >expect <<-EOF &&
COMMIT::$(git rev-parse topic)
- COMMIT::$(git rev-parse base~1)
+ COMMIT::$(git rev-parse base~1):UNINTERESTING
commits:2
TREE::$(git rev-parse topic^{tree})
- TREE::$(git rev-parse base~1^{tree})
- TREE:left/:$(git rev-parse base~1:left)
+ TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
+ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
TREE:right/:$(git rev-parse topic:right)
- TREE:right/:$(git rev-parse base~1:right)
+ TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
trees:5
- BLOB:a:$(git rev-parse base~1:a)
- BLOB:left/b:$(git rev-parse base~1:left/b)
- BLOB:right/c:$(git rev-parse base~1:right/c)
+ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
+ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
BLOB:right/c:$(git rev-parse topic:right/c)
- BLOB:right/d:$(git rev-parse base~1:right/d)
+ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
blobs:5
tags:0
EOF
@@ -252,6 +252,30 @@ test_expect_success 'topic, not base, boundary' '
test_cmp expect.sorted out.sorted
'
+test_expect_success 'topic, not base, boundary with pruning' '
+ test-tool path-walk --prune -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ COMMIT::$(git rev-parse base~1):UNINTERESTING
+ commits:2
+ TREE::$(git rev-parse topic^{tree})
+ TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
+ TREE:right/:$(git rev-parse topic:right)
+ TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
+ trees:4
+ BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ blobs:2
+ tags:0
+ EOF
+
+ sort expect >expect.sorted &&
+ sort out >out.sorted &&
+
+ test_cmp expect.sorted out.sorted
+'
+
test_expect_success 'trees are reported exactly once' '
test_when_finished "rm -rf unique-trees" &&
test_create_repo unique-trees &&
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 07/17] pack-objects: extract should_attempt_deltas()
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (5 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 06/17] path-walk: add prune_all_uninteresting option Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 08/17] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
` (10 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
This will be helpful in a future change that introduces a new way to
compute deltas.
Be careful to preserve the nr_deltas counting logic in the existing
method, but take the rest of the logic wholesale.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
builtin/pack-objects.c | 53 +++++++++++++++++++++++-------------------
1 file changed, 29 insertions(+), 24 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 0fc0680b402..82f4ca04000 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3167,6 +3167,33 @@ static int add_ref_tag(const char *tag UNUSED, const char *referent UNUSED, cons
return 0;
}
+static int should_attempt_deltas(struct object_entry *entry)
+{
+ if (DELTA(entry))
+ return 0;
+
+ if (!entry->type_valid ||
+ oe_size_less_than(&to_pack, entry, 50))
+ return 0;
+
+ if (entry->no_try_delta)
+ return 0;
+
+ if (!entry->preferred_base) {
+ if (oe_type(entry) < 0)
+ die(_("unable to get type of object %s"),
+ oid_to_hex(&entry->idx.oid));
+ } else if (oe_type(entry) < 0) {
+ /*
+ * This object is not found, but we
+ * don't have to include it anyway.
+ */
+ return 0;
+ }
+
+ return 1;
+}
+
static void prepare_pack(int window, int depth)
{
struct object_entry **delta_list;
@@ -3197,33 +3224,11 @@ static void prepare_pack(int window, int depth)
for (i = 0; i < to_pack.nr_objects; i++) {
struct object_entry *entry = to_pack.objects + i;
- if (DELTA(entry))
- /* This happens if we decided to reuse existing
- * delta from a pack. "reuse_delta &&" is implied.
- */
- continue;
-
- if (!entry->type_valid ||
- oe_size_less_than(&to_pack, entry, 50))
+ if (!should_attempt_deltas(entry))
continue;
- if (entry->no_try_delta)
- continue;
-
- if (!entry->preferred_base) {
+ if (!entry->preferred_base)
nr_deltas++;
- if (oe_type(entry) < 0)
- die(_("unable to get type of object %s"),
- oid_to_hex(&entry->idx.oid));
- } else {
- if (oe_type(entry) < 0) {
- /*
- * This object is not found, but we
- * don't have to include it anyway.
- */
- continue;
- }
- }
delta_list[n++] = entry;
}
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 08/17] pack-objects: add --path-walk option
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (6 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 07/17] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 09/17] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
` (9 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
In order to more easily compute delta bases among objects that appear at the
exact same path, add a --path-walk option to 'git pack-objects'.
This option will use the path-walk API instead of the object walk given by
the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.
The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.
After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.
The current implementation performs delta calculations while walking
objects, which is not ideal for a few reasons. First, this will cause
the "Enumerating objects" phase to be much longer than usual. Second, it
does not take advantage of threading during the path-scoped delta
calculations. Even with this lack of threading, the path-walk option is
sometimes faster than the usual approach. Future changes will refactor
this code to allow for threading, but that complexity is deferred until
later to keep this patch as simple as possible.
This new walk is incompatible with some features and is ignored by
others:
* Object filters are not currently integrated with the path-walk API,
such as sparse-checkout or tree depth. A blobless packfile could be
integrated easily, but that is deferred for later.
* Server-focused features such as delta islands, shallow packs, and
using a bitmap index are incompatible with the path-walk API.
* The path walk API is only compatible with the --revs option, not
taking object lists or pack lists over stdin. These alternative ways
to specify the objects currently ignores the --path-walk option
without even a warning.
Future changes will create performance tests that demonstrate the power
of this approach.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/git-pack-objects.txt | 13 +-
Documentation/technical/api-path-walk.txt | 3 +-
builtin/pack-objects.c | 147 ++++++++++++++++++++--
t/t5300-pack-object.sh | 17 +++
4 files changed, 169 insertions(+), 11 deletions(-)
diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index e32404c6aae..f2fda800a43 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -15,7 +15,8 @@ SYNOPSIS
[--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
[--cruft] [--cruft-expiration=<time>]
[--stdout [--filter=<filter-spec>] | <base-name>]
- [--shallow] [--keep-true-parents] [--[no-]sparse] < <object-list>
+ [--shallow] [--keep-true-parents] [--[no-]sparse]
+ [--path-walk] < <object-list>
DESCRIPTION
@@ -345,6 +346,16 @@ raise an error.
Restrict delta matches based on "islands". See DELTA ISLANDS
below.
+--path-walk::
+ By default, `git pack-objects` walks objects in an order that
+ presents trees and blobs in an order unrelated to the path they
+ appear relative to a commit's root tree. The `--path-walk` option
+ enables a different walking algorithm that organizes trees and
+ blobs by path. This has the potential to improve delta compression
+ especially in the presence of filenames that cause collisions in
+ Git's default name-hash algorithm. Due to changing how the objects
+ are walked, this option is not compatible with `--delta-islands`,
+ `--shallow`, or `--filter`.
DELTA ISLANDS
-------------
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index c51f92cd649..2d25281774d 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -69,4 +69,5 @@ Examples
--------
See example usages in:
- `t/helper/test-path-walk.c`
+ `t/helper/test-path-walk.c`,
+ `builtin/pack-objects.c`
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 82f4ca04000..103263666f6 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -39,6 +39,9 @@
#include "promisor-remote.h"
#include "pack-mtimes.h"
#include "parse-options.h"
+#include "blob.h"
+#include "tree.h"
+#include "path-walk.h"
/*
* Objects we are going to pack are collected in the `to_pack` structure.
@@ -215,6 +218,7 @@ static int delta_search_threads;
static int pack_to_stdout;
static int sparse;
static int thin;
+static int path_walk;
static int num_preferred_base;
static struct progress *progress_state;
@@ -4143,6 +4147,105 @@ static void mark_bitmap_preferred_tips(void)
}
}
+static inline int is_oid_interesting(struct repository *repo,
+ struct object_id *oid)
+{
+ struct object *o = lookup_object(repo, oid);
+ return o && !(o->flags & UNINTERESTING);
+}
+
+static int add_objects_by_path(const char *path,
+ struct oid_array *oids,
+ enum object_type type,
+ void *data)
+{
+ struct object_entry **delta_list;
+ size_t oe_start = to_pack.nr_objects;
+ size_t oe_end;
+ unsigned int sub_list_size;
+ unsigned int *processed = data;
+
+ /*
+ * First, add all objects to the packing data, including the ones
+ * marked UNINTERESTING (translated to 'exclude') as they can be
+ * used as delta bases.
+ */
+ for (size_t i = 0; i < oids->nr; i++) {
+ int exclude;
+ struct object_info oi = OBJECT_INFO_INIT;
+ struct object_id *oid = &oids->oid[i];
+
+ /* Skip objects that do not exist locally. */
+ if (exclude_promisor_objects &&
+ oid_object_info_extended(the_repository, oid, &oi,
+ OBJECT_INFO_FOR_PREFETCH) < 0)
+ continue;
+
+ exclude = !is_oid_interesting(the_repository, oid);
+
+ if (exclude && !thin)
+ continue;
+
+ add_object_entry(oid, type, path, exclude);
+ }
+
+ oe_end = to_pack.nr_objects;
+
+ /* We can skip delta calculations if it is a no-op. */
+ if (oe_end == oe_start || !window)
+ return 0;
+
+ sub_list_size = 0;
+ ALLOC_ARRAY(delta_list, oe_end - oe_start);
+
+ for (size_t i = 0; i < oe_end - oe_start; i++) {
+ struct object_entry *entry = to_pack.objects + oe_start + i;
+
+ if (!should_attempt_deltas(entry))
+ continue;
+
+ delta_list[sub_list_size++] = entry;
+ }
+
+ /*
+ * Find delta bases among this list of objects that all match the same
+ * path. This causes the delta compression to be interleaved in the
+ * object walk, which can lead to confusing progress indicators. This is
+ * also incompatible with threaded delta calculations. In the future,
+ * consider creating a list of regions in the full to_pack.objects array
+ * that could be picked up by the threaded delta computation.
+ */
+ if (sub_list_size && window) {
+ QSORT(delta_list, sub_list_size, type_size_sort);
+ find_deltas(delta_list, &sub_list_size, window, depth, processed);
+ }
+
+ free(delta_list);
+ return 0;
+}
+
+static void get_object_list_path_walk(struct rev_info *revs)
+{
+ struct path_walk_info info = PATH_WALK_INFO_INIT;
+ unsigned int processed = 0;
+
+ info.revs = revs;
+ info.path_fn = add_objects_by_path;
+ info.path_fn_data = &processed;
+ revs->tag_objects = 1;
+
+ /*
+ * Allow the --[no-]sparse option to be interesting here, if only
+ * for testing purposes. Paths with no interesting objects will not
+ * contribute to the resulting pack, but only create noisy preferred
+ * base objects.
+ */
+ info.prune_all_uninteresting = sparse;
+
+ if (walk_objects_by_path(&info))
+ die(_("failed to pack objects via path-walk"));
+}
+
static void get_object_list(struct rev_info *revs, int ac, const char **av)
{
struct setup_revision_opt s_r_opt = {
@@ -4189,7 +4292,7 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
warn_on_object_refname_ambiguity = save_warning;
- if (use_bitmap_index && !get_object_list_from_bitmap(revs))
+ if (use_bitmap_index && !path_walk && !get_object_list_from_bitmap(revs))
return;
if (use_delta_islands)
@@ -4198,15 +4301,19 @@ static void get_object_list(struct rev_info *revs, int ac, const char **av)
if (write_bitmap_index)
mark_bitmap_preferred_tips();
- if (prepare_revision_walk(revs))
- die(_("revision walk setup failed"));
- mark_edges_uninteresting(revs, show_edge, sparse);
-
if (!fn_show_object)
fn_show_object = show_object;
- traverse_commit_list(revs,
- show_commit, fn_show_object,
- NULL);
+
+ if (path_walk) {
+ get_object_list_path_walk(revs);
+ } else {
+ if (prepare_revision_walk(revs))
+ die(_("revision walk setup failed"));
+ mark_edges_uninteresting(revs, show_edge, sparse);
+ traverse_commit_list(revs,
+ show_commit, fn_show_object,
+ NULL);
+ }
if (unpack_unreachable_expiration) {
revs->ignore_missing_links = 1;
@@ -4404,6 +4511,8 @@ int cmd_pack_objects(int argc,
N_("use the sparse reachability algorithm")),
OPT_BOOL(0, "thin", &thin,
N_("create thin packs")),
+ OPT_BOOL(0, "path-walk", &path_walk,
+ N_("use the path-walk API to walk objects when possible")),
OPT_BOOL(0, "shallow", &shallow,
N_("create packs suitable for shallow fetches")),
OPT_BOOL(0, "honor-pack-keep", &ignore_packed_keep_on_disk,
@@ -4484,7 +4593,27 @@ int cmd_pack_objects(int argc,
window = 0;
strvec_push(&rp, "pack-objects");
- if (thin) {
+
+ if (path_walk && filter_options.choice) {
+ warning(_("cannot use --filter with --path-walk"));
+ path_walk = 0;
+ }
+ if (path_walk && use_delta_islands) {
+ warning(_("cannot use delta islands with --path-walk"));
+ path_walk = 0;
+ }
+ if (path_walk && shallow) {
+ warning(_("cannot use --shallow with --path-walk"));
+ path_walk = 0;
+ }
+ if (path_walk) {
+ strvec_push(&rp, "--boundary");
+ /*
+ * We must disable the bitmaps because we are removing
+ * the --objects / --objects-edge[-aggressive] options.
+ */
+ use_bitmap_index = 0;
+ } else if (thin) {
use_internal_rev_list = 1;
strvec_push(&rp, shallow
? "--objects-edge-aggressive"
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 3b9dae331a5..5f6914acae7 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -674,4 +674,21 @@ do
'
done
+# Basic "repack everything" test
+test_expect_success '--path-walk pack everything' '
+ git -C server rev-parse HEAD >in &&
+ git -C server pack-objects --stdout --revs --path-walk <in >out.pack &&
+ git -C server index-pack --stdin <out.pack
+'
+
+# Basic "thin pack" test
+test_expect_success '--path-walk thin pack' '
+ cat >in <<-EOF &&
+ $(git -C server rev-parse HEAD)
+ ^$(git -C server rev-parse HEAD~2)
+ EOF
+ git -C server pack-objects --thin --stdout --revs --path-walk <in >out.pack &&
+ git -C server index-pack --fix-thin --stdin <out.pack
+'
+
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 09/17] pack-objects: update usage to match docs
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (7 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 08/17] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 10/17] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
` (8 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
The t0450 test script verifies that builtin usage matches the synopsis
in the documentation. Adjust the builtin to match and then remove 'git
pack-objects' from the exception list.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/git-pack-objects.txt | 14 +++++++-------
builtin/pack-objects.c | 10 ++++++++--
t/t0450/txt-help-mismatches | 1 -
3 files changed, 15 insertions(+), 10 deletions(-)
diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index f2fda800a43..68d86ed8838 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -10,13 +10,13 @@ SYNOPSIS
--------
[verse]
'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
- [--no-reuse-delta] [--delta-base-offset] [--non-empty]
- [--local] [--incremental] [--window=<n>] [--depth=<n>]
- [--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
- [--cruft] [--cruft-expiration=<time>]
- [--stdout [--filter=<filter-spec>] | <base-name>]
- [--shallow] [--keep-true-parents] [--[no-]sparse]
- [--path-walk] < <object-list>
+ [--no-reuse-delta] [--delta-base-offset] [--non-empty]
+ [--local] [--incremental] [--window=<n>] [--depth=<n>]
+ [--revs [--unpacked | --all]] [--keep-pack=<pack-name>]
+ [--cruft] [--cruft-expiration=<time>]
+ [--stdout [--filter=<filter-spec>] | <base-name>]
+ [--shallow] [--keep-true-parents] [--[no-]sparse]
+ [--path-walk] < <object-list>
DESCRIPTION
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 103263666f6..77fb1217b2e 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -185,8 +185,14 @@ static inline void oe_set_delta_size(struct packing_data *pack,
#define SET_DELTA_SIBLING(obj, val) oe_set_delta_sibling(&to_pack, obj, val)
static const char *pack_usage[] = {
- N_("git pack-objects --stdout [<options>] [< <ref-list> | < <object-list>]"),
- N_("git pack-objects [<options>] <base-name> [< <ref-list> | < <object-list>]"),
+ N_("git pack-objects [-q | --progress | --all-progress] [--all-progress-implied]\n"
+ " [--no-reuse-delta] [--delta-base-offset] [--non-empty]\n"
+ " [--local] [--incremental] [--window=<n>] [--depth=<n>]\n"
+ " [--revs [--unpacked | --all]] [--keep-pack=<pack-name>]\n"
+ " [--cruft] [--cruft-expiration=<time>]\n"
+ " [--stdout [--filter=<filter-spec>] | <base-name>]\n"
+ " [--shallow] [--keep-true-parents] [--[no-]sparse]\n"
+ " [--path-walk] < <object-list>"),
NULL
};
diff --git a/t/t0450/txt-help-mismatches b/t/t0450/txt-help-mismatches
index 28003f18c92..285ae81a6b5 100644
--- a/t/t0450/txt-help-mismatches
+++ b/t/t0450/txt-help-mismatches
@@ -38,7 +38,6 @@ merge-one-file
multi-pack-index
name-rev
notes
-pack-objects
push
range-diff
rebase
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 10/17] p5313: add performance tests for --path-walk
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (8 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 09/17] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 11/17] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
` (7 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
The previous change added a --path-walk option to 'git pack-objects'.
Create a performance test that demonstrates the time and space benefits
of the feature.
In order to get an appropriate comparison, we need to avoid reusing
deltas and recompute them from scratch.
Compare the creation of a thin pack representing a small push and the
creation of a relatively large non-thin pack.
Running on my copy of the Git repository results in this data:
Test this tree
---------------------------------------------------------
5313.2: thin pack 0.01(0.00+0.00)
5313.3: thin pack size 1.1K
5313.4: thin pack with --path-walk 0.01(0.01+0.00)
5313.5: thin pack size with --path-walk 1.1K
5313.6: big pack 2.52(6.59+0.38)
5313.7: big pack size 14.1M
5313.8: big pack with --path-walk 4.90(5.76+0.26)
5313.9: big pack size with --path-walk 13.2M
Note that the timing is slower because there is no threading in the
--path-walk case (yet).
The cases where the --path-walk option really shines is when the default
name-hash is overwhelmed with collisions. An open source example can be
found in the microsoft/fluentui repo [1] at a certain commit [2].
[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08
Running the tests on this repo results in the following output:
Test this tree
----------------------------------------------------------
5313.2: thin pack 0.28(0.38+0.02)
5313.3: thin pack size 1.2M
5313.4: thin pack with --path-walk 0.08(0.06+0.01)
5313.5: thin pack size with --path-walk 18.4K
5313.6: big pack 4.05(29.62+0.43)
5313.7: big pack size 20.0M
5313.8: big pack with --path-walk 5.99(9.06+0.24)
5313.9: big pack size with --path-walk 16.4M
Notice in particular that in the small thin pack, the time performance
has improved from 0.28s to 0.08s and this is likely due to the improved
size of the resulting pack: 18.4K instead of 1.2M.
Finally, running this on a copy of the Linux kernel repository results
in these data points:
Test this tree
-----------------------------------------------------------
5313.2: thin pack 0.00(0.00+0.00)
5313.3: thin pack size 5.8K
5313.4: thin pack with --path-walk 0.00(0.01+0.00)
5313.5: thin pack size with --path-walk 5.8K
5313.6: big pack 24.39(65.81+1.31)
5313.7: big pack size 155.7M
5313.8: big pack with --path-walk 41.07(60.69+0.68)
5313.9: big pack size with --path-walk 150.8M
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
t/perf/p5313-pack-objects.sh | 59 ++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)
create mode 100755 t/perf/p5313-pack-objects.sh
diff --git a/t/perf/p5313-pack-objects.sh b/t/perf/p5313-pack-objects.sh
new file mode 100755
index 00000000000..840075f5691
--- /dev/null
+++ b/t/perf/p5313-pack-objects.sh
@@ -0,0 +1,59 @@
+#!/bin/sh
+
+test_description='Tests pack performance using bitmaps'
+. ./perf-lib.sh
+
+GIT_TEST_PASSING_SANITIZE_LEAK=0
+export GIT_TEST_PASSING_SANITIZE_LEAK
+
+test_perf_large_repo
+
+test_expect_success 'create rev input' '
+ cat >in-thin <<-EOF &&
+ $(git rev-parse HEAD)
+ ^$(git rev-parse HEAD~1)
+ EOF
+
+ cat >in-big <<-EOF
+ $(git rev-parse HEAD)
+ ^$(git rev-parse HEAD~1000)
+ EOF
+'
+
+test_perf 'thin pack' '
+ git pack-objects --thin --stdout --no-reuse-delta \
+ --revs --sparse <in-thin >out
+'
+
+test_size 'thin pack size' '
+ test_file_size out
+'
+
+test_perf 'thin pack with --path-walk' '
+ git pack-objects --thin --stdout --no-reuse-delta \
+ --revs --sparse --path-walk <in-thin >out
+'
+
+test_size 'thin pack size with --path-walk' '
+ test_file_size out
+'
+
+test_perf 'big pack' '
+ git pack-objects --stdout --no-reuse-delta --revs \
+ --sparse <in-big >out
+'
+
+test_size 'big pack size' '
+ test_file_size out
+'
+
+test_perf 'big pack with --path-walk' '
+ git pack-objects --stdout --no-reuse-delta --revs \
+ --sparse --path-walk <in-big >out
+'
+
+test_size 'big pack size with --path-walk' '
+ test_file_size out
+'
+
+test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 11/17] pack-objects: introduce GIT_TEST_PACK_PATH_WALK
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (9 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 10/17] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 12/17] repack: add --path-walk option Derrick Stolee via GitGitGadget
` (6 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.
This was useful in testing the implementation of the --path-walk
implementation, especially in conjunction with test such as:
- t0411-clone-from-partial.sh : One test fetches from a repo that does
not have the boundary objects. This causes the path-based walk to
fail. Disable the variable for this test.
- t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
without a boundary object.
- t5310-pack-bitmaps.sh : One test compares the case when packing with
bitmaps to the case when packing without them. Since we disable the
test variable when writing bitmaps, this causes a difference in the
object list (the --path-walk option adds an extra object). Specify
--no-path-walk in both processes for the comparison. Another test
checks for a specific delta base, but when computing dynamically
without using bitmaps, the base object it too small to be considered
in the delta calculations so no base is used.
- t5316-pack-delta-depth.sh : This script cares about certain delta
choices and their chain lengths. The --path-walk option changes how
these chains are selected, and thus changes the results of this test.
- t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
the --sparse option and how it combines with --path-walk.
- t5332-multi-pack-reuse.sh : This test verifies that the preferred
pack is used for delta reuse when possible. The --path-walk option is
not currently aware of the preferred pack at all, so finds a
different delta base.
- t7406-submodule-update.sh : When using the variable, the --depth
option collides with the --path-walk feature, resulting in a warning
message. Disable the variable so this warning does not appear.
I want to call out one specific test change that is only temporary:
- t5530-upload-pack-error.sh : One test cares specifically about an
"unable to read" error message. Since the current implementation
performs delta calculations within the path-walk API callback, a
different "unable to get size" error message appears. When this
is changed in a future refactoring, this test change can be reverted.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
builtin/pack-objects.c | 12 ++++++++++--
ci/run-build-and-tests.sh | 1 +
t/README | 4 ++++
t/t0411-clone-from-partial.sh | 6 ++++++
t/t5306-pack-nobase.sh | 5 +++++
t/t5310-pack-bitmaps.sh | 13 +++++++++++--
t/t5316-pack-delta-depth.sh | 9 ++++++---
t/t5332-multi-pack-reuse.sh | 7 +++++++
t/t5530-upload-pack-error.sh | 6 ++++++
t/t7406-submodule-update.sh | 4 ++++
10 files changed, 60 insertions(+), 7 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 77fb1217b2e..b97bec5661e 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -224,7 +224,7 @@ static int delta_search_threads;
static int pack_to_stdout;
static int sparse;
static int thin;
-static int path_walk;
+static int path_walk = -1;
static int num_preferred_base;
static struct progress *progress_state;
@@ -4182,7 +4182,7 @@ static int add_objects_by_path(const char *path,
struct object_id *oid = &oids->oid[i];
/* Skip objects that do not exist locally. */
- if (exclude_promisor_objects &&
+ if ((exclude_promisor_objects || arg_missing_action != MA_ERROR) &&
oid_object_info_extended(the_repository, oid, &oi,
OBJECT_INFO_FOR_PREFETCH) < 0)
continue;
@@ -4583,6 +4583,14 @@ int cmd_pack_objects(int argc,
if (pack_to_stdout != !base_name || argc)
usage_with_options(pack_usage, pack_objects_options);
+ if (path_walk < 0) {
+ if (use_bitmap_index > 0 ||
+ !use_internal_rev_list)
+ path_walk = 0;
+ else
+ path_walk = git_env_bool("GIT_TEST_PACK_PATH_WALK", 0);
+ }
+
if (depth < 0)
depth = 0;
if (depth >= (1 << OE_DEPTH_BITS)) {
diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
index 2e28d02b20f..7c75492f366 100755
--- a/ci/run-build-and-tests.sh
+++ b/ci/run-build-and-tests.sh
@@ -30,6 +30,7 @@ linux-TEST-vars)
export GIT_TEST_NO_WRITE_REV_INDEX=1
export GIT_TEST_CHECKOUT_WORKERS=2
export GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL=1
+ export GIT_TEST_PACK_PATH_WALK=1
;;
linux-clang)
export GIT_TEST_DEFAULT_HASH=sha1
diff --git a/t/README b/t/README
index 8dcb778e260..bec31955d2d 100644
--- a/t/README
+++ b/t/README
@@ -436,6 +436,10 @@ GIT_TEST_PACK_SPARSE=<boolean> if disabled will default the pack-objects
builtin to use the non-sparse object walk. This can still be overridden by
the --sparse command-line argument.
+GIT_TEST_PACK_PATH_WALK=<boolean> if enabled will default the pack-objects
+builtin to use the path-walk API for the object walk. This can still be
+overridden by the --no-path-walk command-line argument.
+
GIT_TEST_PRELOAD_INDEX=<boolean> exercises the preload-index code path
by overriding the minimum number of cache entries required per thread.
diff --git a/t/t0411-clone-from-partial.sh b/t/t0411-clone-from-partial.sh
index 932bf2067da..342d8d2997c 100755
--- a/t/t0411-clone-from-partial.sh
+++ b/t/t0411-clone-from-partial.sh
@@ -63,6 +63,12 @@ test_expect_success 'pack-objects should fetch from promisor remote and execute
test_expect_success 'clone from promisor remote does not lazy-fetch by default' '
rm -f script-executed &&
+
+ # The --path-walk feature of "git pack-objects" is not
+ # compatible with this kind of fetch from an incomplete repo.
+ GIT_TEST_PACK_PATH_WALK=0 &&
+ export GIT_TEST_PACK_PATH_WALK &&
+
test_must_fail git clone evil no-lazy 2>err &&
test_grep "lazy fetching disabled" err &&
test_path_is_missing script-executed
diff --git a/t/t5306-pack-nobase.sh b/t/t5306-pack-nobase.sh
index 0d50c6b4bca..429be5ce724 100755
--- a/t/t5306-pack-nobase.sh
+++ b/t/t5306-pack-nobase.sh
@@ -60,6 +60,11 @@ test_expect_success 'indirectly clone patch_clone' '
git pull ../.git &&
test $(git rev-parse HEAD) = $B &&
+ # The --path-walk feature of "git pack-objects" is not
+ # compatible with this kind of fetch from an incomplete repo.
+ GIT_TEST_PACK_PATH_WALK=0 &&
+ export GIT_TEST_PACK_PATH_WALK &&
+
git pull ../patch_clone/.git &&
test $(git rev-parse HEAD) = $C
)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index a6de7c57643..881b3f9c8d1 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -128,8 +128,9 @@ test_bitmap_cases () {
ls .git/objects/pack/ | grep bitmap >output &&
test_line_count = 1 output &&
# verify equivalent packs are generated with/without using bitmap index
- packasha1=$(git pack-objects --no-use-bitmap-index --all packa </dev/null) &&
- packbsha1=$(git pack-objects --use-bitmap-index --all packb </dev/null) &&
+ # Be careful to not use the path-walk option in either case.
+ packasha1=$(git pack-objects --no-use-bitmap-index --no-path-walk --all packa </dev/null) &&
+ packbsha1=$(git pack-objects --use-bitmap-index --no-path-walk --all packb </dev/null) &&
list_packed_objects packa-$packasha1.idx >packa.objects &&
list_packed_objects packb-$packbsha1.idx >packb.objects &&
test_cmp packa.objects packb.objects
@@ -358,6 +359,14 @@ test_bitmap_cases () {
git init --bare client.git &&
(
cd client.git &&
+
+ # This test relies on reusing a delta, but if the
+ # path-walk machinery is engaged, the base object
+ # is considered too small to use during the
+ # dynamic computation, so is not used.
+ GIT_TEST_PACK_PATH_WALK=0 &&
+ export GIT_TEST_PACK_PATH_WALK &&
+
git config transfer.unpackLimit 1 &&
git fetch .. delta-reuse-old:delta-reuse-old &&
git fetch .. delta-reuse-new:delta-reuse-new &&
diff --git a/t/t5316-pack-delta-depth.sh b/t/t5316-pack-delta-depth.sh
index eb4ef3dda4d..12a6901fecb 100755
--- a/t/t5316-pack-delta-depth.sh
+++ b/t/t5316-pack-delta-depth.sh
@@ -90,15 +90,18 @@ max_chain() {
# adjusted (or scrapped if the heuristics have become too unreliable)
test_expect_success 'packing produces a long delta' '
# Use --window=0 to make sure we are seeing reused deltas,
- # not computing a new long chain.
- pack=$(git pack-objects --all --window=0 </dev/null pack) &&
+ # not computing a new long chain. (Also avoid the --path-walk
+ # option as it may break delta chains.)
+ pack=$(git pack-objects --all --window=0 --no-path-walk </dev/null pack) &&
echo 9 >expect &&
max_chain pack-$pack.pack >actual &&
test_cmp expect actual
'
test_expect_success '--depth limits depth' '
- pack=$(git pack-objects --all --depth=5 </dev/null pack) &&
+ # Avoid --path-walk to avoid breaking delta chains across path
+ # boundaries.
+ pack=$(git pack-objects --all --depth=5 --no-path-walk </dev/null pack) &&
echo 5 >expect &&
max_chain pack-$pack.pack >actual &&
test_cmp expect actual
diff --git a/t/t5332-multi-pack-reuse.sh b/t/t5332-multi-pack-reuse.sh
index 955ea42769b..df7dcb4b487 100755
--- a/t/t5332-multi-pack-reuse.sh
+++ b/t/t5332-multi-pack-reuse.sh
@@ -8,6 +8,13 @@ TEST_PASSES_SANITIZE_LEAK=true
GIT_TEST_MULTI_PACK_INDEX=0
GIT_TEST_MULTI_PACK_INDEX_WRITE_INCREMENTAL=0
+
+# The --path-walk option does not consider the preferred pack
+# at all for reusing deltas, so this variable changes the
+# behavior of this test, if enabled.
+GIT_TEST_PACK_PATH_WALK=0
+export GIT_TEST_PACK_PATH_WALK
+
objdir=.git/objects
packdir=$objdir/pack
diff --git a/t/t5530-upload-pack-error.sh b/t/t5530-upload-pack-error.sh
index 7172780d550..356b96cb741 100755
--- a/t/t5530-upload-pack-error.sh
+++ b/t/t5530-upload-pack-error.sh
@@ -35,6 +35,12 @@ test_expect_success 'upload-pack fails due to error in pack-objects packing' '
hexsz=$(test_oid hexsz) &&
printf "%04xwant %s\n00000009done\n0000" \
$(($hexsz + 10)) $head >input &&
+
+ # The current implementation of path-walk causes a different
+ # error message. This will be changed by a future refactoring.
+ GIT_TEST_PACK_PATH_WALK=0 &&
+ export GIT_TEST_PACK_PATH_WALK &&
+
test_must_fail git upload-pack . <input >/dev/null 2>output.err &&
test_grep "unable to read" output.err &&
test_grep "pack-objects died" output.err
diff --git a/t/t7406-submodule-update.sh b/t/t7406-submodule-update.sh
index 297c6c3b5cc..d2284e67d3d 100755
--- a/t/t7406-submodule-update.sh
+++ b/t/t7406-submodule-update.sh
@@ -1093,12 +1093,16 @@ test_expect_success 'submodule update --quiet passes quietness to fetch with a s
) &&
git clone super4 super5 &&
(cd super5 &&
+ # This test variable will create a "warning" message to stderr
+ GIT_TEST_PACK_PATH_WALK=0 \
git submodule update --quiet --init --depth=1 submodule3 >out 2>err &&
test_must_be_empty out &&
test_must_be_empty err
) &&
git clone super4 super6 &&
(cd super6 &&
+ # This test variable will create a "warning" message to stderr
+ GIT_TEST_PACK_PATH_WALK=0 \
git submodule update --init --depth=1 submodule3 >out 2>err &&
test_file_not_empty out &&
test_file_not_empty err
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 12/17] repack: add --path-walk option
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (10 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 11/17] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 13/17] repack: update usage to match docs Derrick Stolee via GitGitGadget
` (5 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.
In my copy of the Git repository, the new tests in p5313 show these
results:
Test this tree
-------------------------------------------------------------
5313.10: repack 27.88(150.23+2.70)
5313.11: repack size 228.2M
5313.12: repack with --path-walk 134.59(148.77+0.81)
5313.13: repack size with --path-walk 209.7M
Note that the 'git pack-objects --path-walk' feature is not integrated
with threads. Look forward to a future change that will introduce
threading to improve the time performance of this feature with
equivalent space performance.
For the microsoft/fluentui repo [1] had some interesting aspects for the
previous tests in p5313, so here are the repack results:
Test this tree
-------------------------------------------------------------
5313.10: repack 91.76(680.94+2.48)
5313.11: repack size 439.1M
5313.12: repack with --path-walk 110.35(130.46+0.74)
5313.13: repack size with --path-walk 155.3M
[1] https://github.com/microsoft/fluentui
Here, we see the significant improvement of a full repack using this
strategy. The name-hash collisions in this repo cause the space
problems. Those collisions also cause the repack command to spend a lot
of cycles trying to find delta bases among files that are not actually
very similar, so the lack of threading with the --path-walk feature is
less pronounced in the process time.
For the Linux kernel repository, we have these stats:
Test this tree
---------------------------------------------------------------
5313.10: repack 553.61(1929.41+30.31)
5313.11: repack size 2.5G
5313.12: repack with --path-walk 1777.63(2044.16+7.47)
5313.13: repack size with --path-walk 2.5G
This demonstrates that the --path-walk feature does not always present
measurable improvements, especially in cases where the name-hash has
very few collisions.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/git-repack.txt | 17 ++++++++++++++++-
builtin/repack.c | 9 ++++++++-
t/perf/p5313-pack-objects.sh | 18 ++++++++++++++++++
3 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index c902512a9e8..4ec59cd27b1 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -9,7 +9,9 @@ git-repack - Pack unpacked objects in a repository
SYNOPSIS
--------
[verse]
-'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m] [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>] [--write-midx]
+'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
+ [--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
+ [--write-midx] [--path-walk]
DESCRIPTION
-----------
@@ -249,6 +251,19 @@ linkgit:git-multi-pack-index[1]).
Write a multi-pack index (see linkgit:git-multi-pack-index[1])
containing the non-redundant packs.
+--path-walk::
+ This option passes the `--path-walk` option to the underlying
+ `git pack-options` process (see linkgit:git-pack-objects[1]).
+ By default, `git pack-objects` walks objects in an order that
+ presents trees and blobs in an order unrelated to the path they
+ appear relative to a commit's root tree. The `--path-walk` option
+ enables a different walking algorithm that organizes trees and
+ blobs by path. This has the potential to improve delta compression
+ especially in the presence of filenames that cause collisions in
+ Git's default name-hash algorithm. Due to changing how the objects
+ are walked, this option is not compatible with `--delta-islands`
+ or `--filter`.
+
CONFIGURATION
-------------
diff --git a/builtin/repack.c b/builtin/repack.c
index cb4420f0856..af3f218ced7 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -39,7 +39,9 @@ static int run_update_server_info = 1;
static char *packdir, *packtmp_name, *packtmp;
static const char *const git_repack_usage[] = {
- N_("git repack [<options>]"),
+ N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
+ "[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
+ "[--write-midx] [--full-path-walk]"),
NULL
};
@@ -58,6 +60,7 @@ struct pack_objects_args {
int no_reuse_object;
int quiet;
int local;
+ int path_walk;
struct list_objects_filter_options filter_options;
};
@@ -289,6 +292,8 @@ static void prepare_pack_objects(struct child_process *cmd,
strvec_pushf(&cmd->args, "--no-reuse-delta");
if (args->no_reuse_object)
strvec_pushf(&cmd->args, "--no-reuse-object");
+ if (args->path_walk)
+ strvec_pushf(&cmd->args, "--path-walk");
if (args->local)
strvec_push(&cmd->args, "--local");
if (args->quiet)
@@ -1182,6 +1187,8 @@ int cmd_repack(int argc,
N_("pass --no-reuse-delta to git-pack-objects")),
OPT_BOOL('F', NULL, &po_args.no_reuse_object,
N_("pass --no-reuse-object to git-pack-objects")),
+ OPT_BOOL(0, "path-walk", &po_args.path_walk,
+ N_("pass --path-walk to git-pack-objects")),
OPT_NEGBIT('n', NULL, &run_update_server_info,
N_("do not run git-update-server-info"), 1),
OPT__QUIET(&po_args.quiet, N_("be quiet")),
diff --git a/t/perf/p5313-pack-objects.sh b/t/perf/p5313-pack-objects.sh
index 840075f5691..b588066ddb0 100755
--- a/t/perf/p5313-pack-objects.sh
+++ b/t/perf/p5313-pack-objects.sh
@@ -56,4 +56,22 @@ test_size 'big pack size with --path-walk' '
test_file_size out
'
+test_perf 'repack' '
+ git repack -adf
+'
+
+test_size 'repack size' '
+ pack=$(ls .git/objects/pack/pack-*.pack) &&
+ test_file_size "$pack"
+'
+
+test_perf 'repack with --path-walk' '
+ git repack -adf --path-walk
+'
+
+test_size 'repack size with --path-walk' '
+ pack=$(ls .git/objects/pack/pack-*.pack) &&
+ test_file_size "$pack"
+'
+
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 13/17] repack: update usage to match docs
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (11 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 12/17] repack: add --path-walk option Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 14/17] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
` (4 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
The t0450 test script verifies that the builtin usage matches the
synopsis in the documentation. Update 'git repack' to match and remove
it from the exception list.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
builtin/repack.c | 2 +-
t/t0450/txt-help-mismatches | 1 -
2 files changed, 1 insertion(+), 2 deletions(-)
diff --git a/builtin/repack.c b/builtin/repack.c
index af3f218ced7..50f208b48b4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -41,7 +41,7 @@ static char *packdir, *packtmp_name, *packtmp;
static const char *const git_repack_usage[] = {
N_("git repack [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]\n"
"[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]\n"
- "[--write-midx] [--full-path-walk]"),
+ "[--write-midx] [--path-walk]"),
NULL
};
diff --git a/t/t0450/txt-help-mismatches b/t/t0450/txt-help-mismatches
index 285ae81a6b5..06b469bdee2 100644
--- a/t/t0450/txt-help-mismatches
+++ b/t/t0450/txt-help-mismatches
@@ -44,7 +44,6 @@ rebase
remote
remote-ext
remote-fd
-repack
reset
restore
rev-parse
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 14/17] pack-objects: enable --path-walk via config
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (12 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 13/17] repack: update usage to match docs Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 15/17] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
` (3 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.
This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.
Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/config/feature.txt | 4 ++++
Documentation/config/pack.txt | 8 ++++++++
builtin/pack-objects.c | 3 +++
repo-settings.c | 3 +++
repo-settings.h | 1 +
5 files changed, 19 insertions(+)
diff --git a/Documentation/config/feature.txt b/Documentation/config/feature.txt
index f061b64b748..cb49ff2604a 100644
--- a/Documentation/config/feature.txt
+++ b/Documentation/config/feature.txt
@@ -20,6 +20,10 @@ walking fewer objects.
+
* `pack.allowPackReuse=multi` may improve the time it takes to create a pack by
reusing objects from multiple packs instead of just one.
++
+* `pack.usePathWalk` may speed up packfile creation and make the packfiles be
+significantly smaller in the presence of certain filename collisions with Git's
+default name-hash.
feature.manyFiles::
Enable config options that optimize for repos with many files in the
diff --git a/Documentation/config/pack.txt b/Documentation/config/pack.txt
index da527377faf..08d06271177 100644
--- a/Documentation/config/pack.txt
+++ b/Documentation/config/pack.txt
@@ -155,6 +155,14 @@ pack.useSparse::
commits contain certain types of direct renames. Default is
`true`.
+pack.usePathWalk::
+ When true, git will default to using the '--path-walk' option in
+ 'git pack-objects' when the '--revs' option is present. This
+ algorithm groups objects by path to maximize the ability to
+ compute delta chains across historical versions of the same
+ object. This may disable other options, such as using bitmaps to
+ enumerate objects.
+
pack.preferBitmapTips::
When selecting which commits will receive bitmaps, prefer a
commit at the tip of any reference that is a suffix of any value
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b97bec5661e..6805a55c60d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -4587,6 +4587,9 @@ int cmd_pack_objects(int argc,
if (use_bitmap_index > 0 ||
!use_internal_rev_list)
path_walk = 0;
+ else if (the_repository->gitdir &&
+ the_repository->settings.pack_use_path_walk)
+ path_walk = 1;
else
path_walk = git_env_bool("GIT_TEST_PACK_PATH_WALK", 0);
}
diff --git a/repo-settings.c b/repo-settings.c
index 4699b4b3650..d8123b9323d 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -45,11 +45,13 @@ void prepare_repo_settings(struct repository *r)
r->settings.fetch_negotiation_algorithm = FETCH_NEGOTIATION_SKIPPING;
r->settings.pack_use_bitmap_boundary_traversal = 1;
r->settings.pack_use_multi_pack_reuse = 1;
+ r->settings.pack_use_path_walk = 1;
}
if (manyfiles) {
r->settings.index_version = 4;
r->settings.index_skip_hash = 1;
r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE;
+ r->settings.pack_use_path_walk = 1;
}
/* Commit graph config or default, does not cascade (simple) */
@@ -64,6 +66,7 @@ void prepare_repo_settings(struct repository *r)
/* Boolean config or default, does not cascade (simple) */
repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1);
+ repo_cfg_bool(r, "pack.usepathwalk", &r->settings.pack_use_path_walk, 0);
repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1);
repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0);
repo_cfg_bool(r, "index.skiphash", &r->settings.index_skip_hash, r->settings.index_skip_hash);
diff --git a/repo-settings.h b/repo-settings.h
index 51d6156a117..ae5c74ba60d 100644
--- a/repo-settings.h
+++ b/repo-settings.h
@@ -53,6 +53,7 @@ struct repo_settings {
enum untracked_cache_setting core_untracked_cache;
int pack_use_sparse;
+ int pack_use_path_walk;
enum fetch_negotiation_setting fetch_negotiation_algorithm;
int core_multi_pack_index;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 15/17] scalar: enable path-walk during push via config
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (13 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 14/17] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 16/17] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
` (2 subsequent siblings)
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
scalar.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/scalar.c b/scalar.c
index 73b79a5d4c9..daec7a4d08e 100644
--- a/scalar.c
+++ b/scalar.c
@@ -170,6 +170,7 @@ static int set_recommended_config(int reconfigure)
{ "core.autoCRLF", "false" },
{ "core.safeCRLF", "false" },
{ "fetch.showForcedUpdates", "false" },
+ { "pack.usePathWalk", "true" },
{ NULL, NULL },
};
int i;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 16/17] pack-objects: refactor path-walk delta phase
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (14 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 15/17] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-20 13:43 ` [PATCH v2 17/17] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2024-10-21 21:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Taylor Blau
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.
Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.
This presents a new progress indicator that can be used in tests to
verify that this stage is happening.
The current implementation is not integrated with threads, but could be
done in a future update.
Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
builtin/pack-objects.c | 81 +++++++++++++++++++++++++-----------
pack-objects.h | 12 ++++++
t/t5300-pack-object.sh | 8 +++-
t/t5530-upload-pack-error.sh | 6 ---
4 files changed, 74 insertions(+), 33 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6805a55c60d..5c413ac07e6 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3204,6 +3204,50 @@ static int should_attempt_deltas(struct object_entry *entry)
return 1;
}
+static void find_deltas_for_region(struct object_entry *list UNUSED,
+ struct packing_region *region,
+ unsigned int *processed)
+{
+ struct object_entry **delta_list;
+ uint32_t delta_list_nr = 0;
+
+ ALLOC_ARRAY(delta_list, region->nr);
+ for (uint32_t i = 0; i < region->nr; i++) {
+ struct object_entry *entry = to_pack.objects + region->start + i;
+ if (should_attempt_deltas(entry))
+ delta_list[delta_list_nr++] = entry;
+ }
+
+ QSORT(delta_list, delta_list_nr, type_size_sort);
+ find_deltas(delta_list, &delta_list_nr, window, depth, processed);
+ free(delta_list);
+}
+
+static void find_deltas_by_region(struct object_entry *list,
+ struct packing_region *regions,
+ uint32_t start, uint32_t nr)
+{
+ unsigned int processed = 0;
+ uint32_t progress_nr;
+
+ if (!nr)
+ return;
+
+ progress_nr = regions[nr - 1].start + regions[nr - 1].nr;
+
+ if (progress)
+ progress_state = start_progress(_("Compressing objects by path"),
+ progress_nr);
+
+ while (nr--)
+ find_deltas_for_region(list,
+ ®ions[start++],
+ &processed);
+
+ display_progress(progress_state, progress_nr);
+ stop_progress(&progress_state);
+}
+
static void prepare_pack(int window, int depth)
{
struct object_entry **delta_list;
@@ -3228,6 +3272,10 @@ static void prepare_pack(int window, int depth)
if (!to_pack.nr_objects || !window || !depth)
return;
+ if (path_walk)
+ find_deltas_by_region(to_pack.objects, to_pack.regions,
+ 0, to_pack.nr_regions);
+
ALLOC_ARRAY(delta_list, to_pack.nr_objects);
nr_deltas = n = 0;
@@ -4165,10 +4213,8 @@ static int add_objects_by_path(const char *path,
enum object_type type,
void *data)
{
- struct object_entry **delta_list;
size_t oe_start = to_pack.nr_objects;
size_t oe_end;
- unsigned int sub_list_size;
unsigned int *processed = data;
/*
@@ -4201,32 +4247,17 @@ static int add_objects_by_path(const char *path,
if (oe_end == oe_start || !window)
return 0;
- sub_list_size = 0;
- ALLOC_ARRAY(delta_list, oe_end - oe_start);
+ ALLOC_GROW(to_pack.regions,
+ to_pack.nr_regions + 1,
+ to_pack.nr_regions_alloc);
- for (size_t i = 0; i < oe_end - oe_start; i++) {
- struct object_entry *entry = to_pack.objects + oe_start + i;
+ to_pack.regions[to_pack.nr_regions].start = oe_start;
+ to_pack.regions[to_pack.nr_regions].nr = oe_end - oe_start;
+ to_pack.nr_regions++;
- if (!should_attempt_deltas(entry))
- continue;
+ *processed += oids->nr;
+ display_progress(progress_state, *processed);
- delta_list[sub_list_size++] = entry;
- }
-
- /*
- * Find delta bases among this list of objects that all match the same
- * path. This causes the delta compression to be interleaved in the
- * object walk, which can lead to confusing progress indicators. This is
- * also incompatible with threaded delta calculations. In the future,
- * consider creating a list of regions in the full to_pack.objects array
- * that could be picked up by the threaded delta computation.
- */
- if (sub_list_size && window) {
- QSORT(delta_list, sub_list_size, type_size_sort);
- find_deltas(delta_list, &sub_list_size, window, depth, processed);
- }
-
- free(delta_list);
return 0;
}
diff --git a/pack-objects.h b/pack-objects.h
index b9898a4e64b..bde4ba19755 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -118,11 +118,23 @@ struct object_entry {
unsigned ext_base:1; /* delta_idx points outside packlist */
};
+/**
+ * A packing region is a section of the packing_data.objects array
+ * as given by a starting index and a number of elements.
+ */
+struct packing_region {
+ uint32_t start;
+ uint32_t nr;
+};
+
struct packing_data {
struct repository *repo;
struct object_entry *objects;
uint32_t nr_objects, nr_alloc;
+ struct packing_region *regions;
+ uint32_t nr_regions, nr_regions_alloc;
+
int32_t *index;
uint32_t index_size;
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 5f6914acae7..4f81613eab1 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -677,7 +677,9 @@ done
# Basic "repack everything" test
test_expect_success '--path-walk pack everything' '
git -C server rev-parse HEAD >in &&
- git -C server pack-objects --stdout --revs --path-walk <in >out.pack &&
+ GIT_PROGRESS_DELAY=0 git -C server pack-objects \
+ --stdout --revs --path-walk --progress <in >out.pack 2>err &&
+ grep "Compressing objects by path" err &&
git -C server index-pack --stdin <out.pack
'
@@ -687,7 +689,9 @@ test_expect_success '--path-walk thin pack' '
$(git -C server rev-parse HEAD)
^$(git -C server rev-parse HEAD~2)
EOF
- git -C server pack-objects --thin --stdout --revs --path-walk <in >out.pack &&
+ GIT_PROGRESS_DELAY=0 git -C server pack-objects \
+ --thin --stdout --revs --path-walk --progress <in >out.pack 2>err &&
+ grep "Compressing objects by path" err &&
git -C server index-pack --fix-thin --stdin <out.pack
'
diff --git a/t/t5530-upload-pack-error.sh b/t/t5530-upload-pack-error.sh
index 356b96cb741..7172780d550 100755
--- a/t/t5530-upload-pack-error.sh
+++ b/t/t5530-upload-pack-error.sh
@@ -35,12 +35,6 @@ test_expect_success 'upload-pack fails due to error in pack-objects packing' '
hexsz=$(test_oid hexsz) &&
printf "%04xwant %s\n00000009done\n0000" \
$(($hexsz + 10)) $head >input &&
-
- # The current implementation of path-walk causes a different
- # error message. This will be changed by a future refactoring.
- GIT_TEST_PACK_PATH_WALK=0 &&
- export GIT_TEST_PACK_PATH_WALK &&
-
test_must_fail git upload-pack . <input >/dev/null 2>output.err &&
test_grep "unable to read" output.err &&
test_grep "pack-objects died" output.err
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* [PATCH v2 17/17] pack-objects: thread the path-based compression
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (15 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 16/17] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
@ 2024-10-20 13:43 ` Derrick Stolee via GitGitGadget
2024-10-21 21:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Taylor Blau
17 siblings, 0 replies; 55+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-20 13:43 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee,
Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.
This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.
Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)
Test HEAD~1 HEAD
---------------------------------------------------------------
5313.2: thin pack 0.01 0.01 +0.0%
5313.4: thin pack with --path-walk 0.01 0.01 +0.0%
5313.6: big pack 2.54 2.60 +2.4%
5313.8: big pack with --path-walk 4.70 3.09 -34.3%
5313.10: repack 28.75 28.55 -0.7%
5313.12: repack with --path-walk 108.55 46.14 -57.5%
On the microsoft/fluentui repo, where the --path-walk feature has been
shown to be more effective in space savings, we get these results:
Test HEAD~1 HEAD
----------------------------------------------------------------
5313.2: thin pack 0.39 0.40 +2.6%
5313.4: thin pack with --path-walk 0.08 0.07 -12.5%
5313.6: big pack 4.15 4.15 +0.0%
5313.8: big pack with --path-walk 6.41 3.21 -49.9%
5313.10: repack 90.69 90.83 +0.2%
5313.12: repack with --path-walk 108.23 49.09 -54.6%
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
builtin/pack-objects.c | 162 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 160 insertions(+), 2 deletions(-)
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 5c413ac07e6..443ce17063a 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2935,6 +2935,7 @@ static void find_deltas(struct object_entry **list, unsigned *list_size,
struct thread_params {
pthread_t thread;
struct object_entry **list;
+ struct packing_region *regions;
unsigned list_size;
unsigned remaining;
int window;
@@ -3248,6 +3249,163 @@ static void find_deltas_by_region(struct object_entry *list,
stop_progress(&progress_state);
}
+static void *threaded_find_deltas_by_path(void *arg)
+{
+ struct thread_params *me = arg;
+
+ progress_lock();
+ while (me->remaining) {
+ while (me->remaining) {
+ progress_unlock();
+ find_deltas_for_region(to_pack.objects,
+ me->regions,
+ me->processed);
+ progress_lock();
+ me->remaining--;
+ me->regions++;
+ }
+
+ me->working = 0;
+ pthread_cond_signal(&progress_cond);
+ progress_unlock();
+
+ /*
+ * We must not set ->data_ready before we wait on the
+ * condition because the main thread may have set it to 1
+ * before we get here. In order to be sure that new
+ * work is available if we see 1 in ->data_ready, it
+ * was initialized to 0 before this thread was spawned
+ * and we reset it to 0 right away.
+ */
+ pthread_mutex_lock(&me->mutex);
+ while (!me->data_ready)
+ pthread_cond_wait(&me->cond, &me->mutex);
+ me->data_ready = 0;
+ pthread_mutex_unlock(&me->mutex);
+
+ progress_lock();
+ }
+ progress_unlock();
+ /* leave ->working 1 so that this doesn't get more work assigned */
+ return NULL;
+}
+
+static void ll_find_deltas_by_region(struct object_entry *list,
+ struct packing_region *regions,
+ uint32_t start, uint32_t nr)
+{
+ struct thread_params *p;
+ int i, ret, active_threads = 0;
+ unsigned int processed = 0;
+ uint32_t progress_nr;
+ init_threaded_search();
+
+ if (!nr)
+ return;
+
+ progress_nr = regions[nr - 1].start + regions[nr - 1].nr;
+ if (delta_search_threads <= 1) {
+ find_deltas_by_region(list, regions, start, nr);
+ cleanup_threaded_search();
+ return;
+ }
+
+ if (progress > pack_to_stdout)
+ fprintf_ln(stderr, _("Path-based delta compression using up to %d threads"),
+ delta_search_threads);
+ CALLOC_ARRAY(p, delta_search_threads);
+
+ if (progress)
+ progress_state = start_progress(_("Compressing objects by path"),
+ progress_nr);
+ /* Partition the work amongst work threads. */
+ for (i = 0; i < delta_search_threads; i++) {
+ unsigned sub_size = nr / (delta_search_threads - i);
+
+ p[i].window = window;
+ p[i].depth = depth;
+ p[i].processed = &processed;
+ p[i].working = 1;
+ p[i].data_ready = 0;
+
+ p[i].regions = regions;
+ p[i].list_size = sub_size;
+ p[i].remaining = sub_size;
+
+ regions += sub_size;
+ nr -= sub_size;
+ }
+
+ /* Start work threads. */
+ for (i = 0; i < delta_search_threads; i++) {
+ if (!p[i].list_size)
+ continue;
+ pthread_mutex_init(&p[i].mutex, NULL);
+ pthread_cond_init(&p[i].cond, NULL);
+ ret = pthread_create(&p[i].thread, NULL,
+ threaded_find_deltas_by_path, &p[i]);
+ if (ret)
+ die(_("unable to create thread: %s"), strerror(ret));
+ active_threads++;
+ }
+
+ /*
+ * Now let's wait for work completion. Each time a thread is done
+ * with its work, we steal half of the remaining work from the
+ * thread with the largest number of unprocessed objects and give
+ * it to that newly idle thread. This ensure good load balancing
+ * until the remaining object list segments are simply too short
+ * to be worth splitting anymore.
+ */
+ while (active_threads) {
+ struct thread_params *target = NULL;
+ struct thread_params *victim = NULL;
+ unsigned sub_size = 0;
+
+ progress_lock();
+ for (;;) {
+ for (i = 0; !target && i < delta_search_threads; i++)
+ if (!p[i].working)
+ target = &p[i];
+ if (target)
+ break;
+ pthread_cond_wait(&progress_cond, &progress_mutex);
+ }
+
+ for (i = 0; i < delta_search_threads; i++)
+ if (p[i].remaining > 2*window &&
+ (!victim || victim->remaining < p[i].remaining))
+ victim = &p[i];
+ if (victim) {
+ sub_size = victim->remaining / 2;
+ target->regions = victim->regions + victim->remaining - sub_size;
+ victim->list_size -= sub_size;
+ victim->remaining -= sub_size;
+ }
+ target->list_size = sub_size;
+ target->remaining = sub_size;
+ target->working = 1;
+ progress_unlock();
+
+ pthread_mutex_lock(&target->mutex);
+ target->data_ready = 1;
+ pthread_cond_signal(&target->cond);
+ pthread_mutex_unlock(&target->mutex);
+
+ if (!sub_size) {
+ pthread_join(target->thread, NULL);
+ pthread_cond_destroy(&target->cond);
+ pthread_mutex_destroy(&target->mutex);
+ active_threads--;
+ }
+ }
+ cleanup_threaded_search();
+ free(p);
+
+ display_progress(progress_state, progress_nr);
+ stop_progress(&progress_state);
+}
+
static void prepare_pack(int window, int depth)
{
struct object_entry **delta_list;
@@ -3273,8 +3431,8 @@ static void prepare_pack(int window, int depth)
return;
if (path_walk)
- find_deltas_by_region(to_pack.objects, to_pack.regions,
- 0, to_pack.nr_regions);
+ ll_find_deltas_by_region(to_pack.objects, to_pack.regions,
+ 0, to_pack.nr_regions);
ALLOC_ARRAY(delta_list, to_pack.nr_objects);
nr_deltas = n = 0;
--
gitgitgadget
^ permalink raw reply related [flat|nested] 55+ messages in thread
* Re: [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
2024-10-20 13:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Derrick Stolee via GitGitGadget
` (16 preceding siblings ...)
2024-10-20 13:43 ` [PATCH v2 17/17] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
@ 2024-10-21 21:43 ` Taylor Blau
2024-10-24 13:29 ` Derrick Stolee
2024-10-28 5:46 ` Patrick Steinhardt
17 siblings, 2 replies; 55+ messages in thread
From: Taylor Blau @ 2024-10-21 21:43 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, ps, johncai86, newren,
christian.couder, kristofferhaugsbakk, Derrick Stolee
On Sun, Oct 20, 2024 at 01:43:13PM +0000, Derrick Stolee via GitGitGadget wrote:
> Updates in v2
> =============
>
> I'm sending this v2 to request some review feedback on the series. I'm sorry
> it's so long.
>
> There are two updates in this version:
>
> * Fixed a performance issue in the presence of many annotated tags. This is
> caught by p5313 when run on a repo with 10,000+ annotated tags.
> * The Scalar config was previously wrong and should be pack.usePathWalk,
> not push.usePathWalk.
Thanks. I queued the new round. As an aside, I would like to find the
time to review this series in depth, but haven't been able to do so
while also trying to keep up with the volume of the rest of the list.
I know that this topic was split out of a larger one. It may be worth
seeing if there is a way to split this topic out into multiple series
that are more easily review-able, but still sensible on their own.
I haven't looked in enough depth to know myself whether such a cut
exists, but it is worth thinking about if you haven't done so already.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
2024-10-21 21:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Taylor Blau
@ 2024-10-24 13:29 ` Derrick Stolee
2024-10-24 15:52 ` Taylor Blau
2024-10-28 5:46 ` Patrick Steinhardt
1 sibling, 1 reply; 55+ messages in thread
From: Derrick Stolee @ 2024-10-24 13:29 UTC (permalink / raw)
To: Taylor Blau, Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, ps, johncai86, newren,
christian.couder, kristofferhaugsbakk
On 10/21/24 5:43 PM, Taylor Blau wrote:
> On Sun, Oct 20, 2024 at 01:43:13PM +0000, Derrick Stolee via GitGitGadget wrote:
>> Updates in v2
>> =============
>>
>> I'm sending this v2 to request some review feedback on the series. I'm sorry
>> it's so long.
>>
>> There are two updates in this version:
>>
>> * Fixed a performance issue in the presence of many annotated tags. This is
>> caught by p5313 when run on a repo with 10,000+ annotated tags.
>> * The Scalar config was previously wrong and should be pack.usePathWalk,
>> not push.usePathWalk.
>
> Thanks. I queued the new round. As an aside, I would like to find the
> time to review this series in depth, but haven't been able to do so
> while also trying to keep up with the volume of the rest of the list.
>
> I know that this topic was split out of a larger one. It may be worth
> seeing if there is a way to split this topic out into multiple series
> that are more easily review-able, but still sensible on their own.
I'll see what I can do. I needed to re-roll after discovering an issue
when trying to integrate the algorithm with shallow clones. The solution
ends up simplifying the code, which is nice.
> I haven't looked in enough depth to know myself whether such a cut
> exists, but it is worth thinking about if you haven't done so already.
In the current series, there's a natural cut between patches 1-4
and the rest, if we want to put the API in without a non-test consumer.
I could also split out the 'git repack' changes into a third series.
Finally, the threading implementation could be done separately, but I
think it's not complicated enough to leave out from the first version
of the --path-walk option in 'git pack-objects'.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
2024-10-24 13:29 ` Derrick Stolee
@ 2024-10-24 15:52 ` Taylor Blau
0 siblings, 0 replies; 55+ messages in thread
From: Taylor Blau @ 2024-10-24 15:52 UTC (permalink / raw)
To: Derrick Stolee
Cc: Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, ps, johncai86, newren,
christian.couder, kristofferhaugsbakk
On Thu, Oct 24, 2024 at 09:29:02AM -0400, Derrick Stolee wrote:
> > I know that this topic was split out of a larger one. It may be worth
> > seeing if there is a way to split this topic out into multiple series
> > that are more easily review-able, but still sensible on their own.
>
> I'll see what I can do. I needed to re-roll after discovering an issue
> when trying to integrate the algorithm with shallow clones. The solution
> ends up simplifying the code, which is nice.
It's always nice when that happens :-).
Should we avoid reviewing the current round in anticipation of a
somewhat restructured series, or would you like us to review the current
round as well?
> > I haven't looked in enough depth to know myself whether such a cut
> > exists, but it is worth thinking about if you haven't done so already.
>
> In the current series, there's a natural cut between patches 1-4
> and the rest, if we want to put the API in without a non-test consumer.
>
> I could also split out the 'git repack' changes into a third series.
>
> Finally, the threading implementation could be done separately, but I
> think it's not complicated enough to leave out from the first version
> of the --path-walk option in 'git pack-objects'.
I'd suggest erring on the side of more smaller series rather than a
single large one. If you feel like there are cut points where we can
review them in isolation and still see some benefit, or at least
clearly how they each fit into the larger puzzle, I think that is worth
doing.
But I trust your judgement here, so if you think that the series is best
reviewed as a whole, then that's fine too. Just my $.02 :-).
Thanks,
Taylor
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
2024-10-21 21:43 ` [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas Taylor Blau
2024-10-24 13:29 ` Derrick Stolee
@ 2024-10-28 5:46 ` Patrick Steinhardt
2024-10-28 16:47 ` Taylor Blau
1 sibling, 1 reply; 55+ messages in thread
From: Patrick Steinhardt @ 2024-10-28 5:46 UTC (permalink / raw)
To: Taylor Blau
Cc: Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, johncai86, newren, christian.couder,
kristofferhaugsbakk, Derrick Stolee
On Mon, Oct 21, 2024 at 05:43:48PM -0400, Taylor Blau wrote:
> On Sun, Oct 20, 2024 at 01:43:13PM +0000, Derrick Stolee via GitGitGadget wrote:
> > Updates in v2
> > =============
> >
> > I'm sending this v2 to request some review feedback on the series. I'm sorry
> > it's so long.
> >
> > There are two updates in this version:
> >
> > * Fixed a performance issue in the presence of many annotated tags. This is
> > caught by p5313 when run on a repo with 10,000+ annotated tags.
> > * The Scalar config was previously wrong and should be pack.usePathWalk,
> > not push.usePathWalk.
>
> Thanks. I queued the new round. As an aside, I would like to find the
> time to review this series in depth, but haven't been able to do so
> while also trying to keep up with the volume of the rest of the list.
>
> I know that this topic was split out of a larger one. It may be worth
> seeing if there is a way to split this topic out into multiple series
> that are more easily review-able, but still sensible on their own.
>
> I haven't looked in enough depth to know myself whether such a cut
> exists, but it is worth thinking about if you haven't done so already.
I'm in the same boat -- I want to review this, but somehow never find
the time to sit down and do it. I definitely won't get to it this week
as I'll be out-of-office for most of the part.
I've flagged this internally now at GitLab so that we can provide some
more data with some of the repos that are on the bigger side to check
whether we can confirm the findings and to prioritize its review.
Patrick
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
2024-10-28 5:46 ` Patrick Steinhardt
@ 2024-10-28 16:47 ` Taylor Blau
2024-10-28 17:13 ` Derrick Stolee
0 siblings, 1 reply; 55+ messages in thread
From: Taylor Blau @ 2024-10-28 16:47 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, johncai86, newren, christian.couder,
kristofferhaugsbakk, Derrick Stolee
On Mon, Oct 28, 2024 at 06:46:07AM +0100, Patrick Steinhardt wrote:
> I've flagged this internally now at GitLab so that we can provide some
> more data with some of the repos that are on the bigger side to check
> whether we can confirm the findings and to prioritize its review.
I suspect that you'll end up measuring no change assuming that you
(AFAIK) use bitmaps and (I imagine) delta islands in your production
configuration? This series is not compatible with either of those
features to my knowledge.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
2024-10-28 16:47 ` Taylor Blau
@ 2024-10-28 17:13 ` Derrick Stolee
2024-10-28 17:25 ` Taylor Blau
0 siblings, 1 reply; 55+ messages in thread
From: Derrick Stolee @ 2024-10-28 17:13 UTC (permalink / raw)
To: Taylor Blau, Patrick Steinhardt
Cc: Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, johncai86, newren, christian.couder,
kristofferhaugsbakk
On 10/28/24 12:47 PM, Taylor Blau wrote:
> On Mon, Oct 28, 2024 at 06:46:07AM +0100, Patrick Steinhardt wrote:
>> I've flagged this internally now at GitLab so that we can provide some
>> more data with some of the repos that are on the bigger side to check
>> whether we can confirm the findings and to prioritize its review.
>
> I suspect that you'll end up measuring no change assuming that you
> (AFAIK) use bitmaps and (I imagine) delta islands in your production
> configuration? This series is not compatible with either of those
> features to my knowledge.
You are correct that this is not compatible with those features as-is.
_Maybe_ there is potential to integrate them in the future, but that
would require better understanding of whether the new compression
mechanism valuable in enough cases (final storage size or maybe even
in repacking time).
At the very least, it would be helpful if some other large repos were
tested to see how commonly this could help client-side users. Are
there other aspects to a repo's structure that could be important to
how effective this approach is?
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
2024-10-28 17:13 ` Derrick Stolee
@ 2024-10-28 17:25 ` Taylor Blau
2024-10-28 19:46 ` Derrick Stolee
0 siblings, 1 reply; 55+ messages in thread
From: Taylor Blau @ 2024-10-28 17:25 UTC (permalink / raw)
To: Derrick Stolee
Cc: Patrick Steinhardt, Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, johncai86, newren, christian.couder,
kristofferhaugsbakk
On Mon, Oct 28, 2024 at 01:13:15PM -0400, Derrick Stolee wrote:
> On 10/28/24 12:47 PM, Taylor Blau wrote:
> > On Mon, Oct 28, 2024 at 06:46:07AM +0100, Patrick Steinhardt wrote:
> > > I've flagged this internally now at GitLab so that we can provide some
> > > more data with some of the repos that are on the bigger side to check
> > > whether we can confirm the findings and to prioritize its review.
> >
> > I suspect that you'll end up measuring no change assuming that you
> > (AFAIK) use bitmaps and (I imagine) delta islands in your production
> > configuration? This series is not compatible with either of those
> > features to my knowledge.
> You are correct that this is not compatible with those features as-is.
> _Maybe_ there is potential to integrate them in the future, but that
> would require better understanding of whether the new compression
> mechanism valuable in enough cases (final storage size or maybe even
> in repacking time).
I think the bitmap thing is not too big of a hurdle. The .bitmap file is
the only spot we store name-hash values on-disk in the "hashcache"
extension.
Unfortunately, there is no easy way to reuse the format of the existing
hashcache extension as-is to indicate to the reader whether they are
recording traditional name-hash values, or the new --path-walk hash
values.
I suspect that you could either add a new extension for --path-walk hash
values, or add a new variant of the hashcache extension that has a flag
to indicate what kind of hash value it's recording.
Of the two, I think the latter is preferred, since it would allow us to
grow new hash functions on paths in the future without needing to add an
additional extension (only a new bit in the existing one).
> At the very least, it would be helpful if some other large repos were
> tested to see how commonly this could help client-side users. Are
> there other aspects to a repo's structure that could be important to
> how effective this approach is?
What measurements are you looking for here? I thought that you had
already done an extensive job of measuring the client-side impact of
pushing smaller packs and faster local repacks, no?
Thanks,
Taylor
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
2024-10-28 17:25 ` Taylor Blau
@ 2024-10-28 19:46 ` Derrick Stolee
2024-10-29 18:02 ` Taylor Blau
0 siblings, 1 reply; 55+ messages in thread
From: Derrick Stolee @ 2024-10-28 19:46 UTC (permalink / raw)
To: Taylor Blau
Cc: Patrick Steinhardt, Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, johncai86, newren, christian.couder,
kristofferhaugsbakk
On 10/28/24 1:25 PM, Taylor Blau wrote:
> On Mon, Oct 28, 2024 at 01:13:15PM -0400, Derrick Stolee wrote:
>> You are correct that this is not compatible with those features as-is.
>> _Maybe_ there is potential to integrate them in the future, but that
>> would require better understanding of whether the new compression
>> mechanism valuable in enough cases (final storage size or maybe even
>> in repacking time).
>
> I think the bitmap thing is not too big of a hurdle. The .bitmap file is
> the only spot we store name-hash values on-disk in the "hashcache"
> extension.
>
> Unfortunately, there is no easy way to reuse the format of the existing
> hashcache extension as-is to indicate to the reader whether they are
> recording traditional name-hash values, or the new --path-walk hash
> values.
The --path-walk option does not mess with the name-hash. You're thinking
of the --full-name-hash feature [1] that was pulled out due to a lack of
interest (and better results with --path-walk).
[1] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com/
>> At the very least, it would be helpful if some other large repos were
>> tested to see how commonly this could help client-side users. Are
>> there other aspects to a repo's structure that could be important to
>> how effective this approach is?
>
> What measurements are you looking for here? I thought that you had
> already done an extensive job of measuring the client-side impact of
> pushing smaller packs and faster local repacks, no?
I've done what I can with the repos I know about, but perhaps other
folks have other repos they like to test that might present new
aspects to the problem.
For example, a colleague was testing this in a variety of Javascript
repos and found that the node repo [2] was slightly worse with the
--path-walk option. I've since discovered that this is only true when
using a checked-out copy and the .git/index file is iterated, as some
large source files with few versions become split across the boundary
of "in the index" or "in commit history". (I am fixing this aspect as
well in the next iteration, hence some reason for its delay.)
[2] https://github.com/nodejs/node
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
2024-10-28 19:46 ` Derrick Stolee
@ 2024-10-29 18:02 ` Taylor Blau
2024-10-31 2:28 ` Derrick Stolee
0 siblings, 1 reply; 55+ messages in thread
From: Taylor Blau @ 2024-10-29 18:02 UTC (permalink / raw)
To: Derrick Stolee
Cc: Patrick Steinhardt, Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, johncai86, newren, christian.couder,
kristofferhaugsbakk
On Mon, Oct 28, 2024 at 03:46:11PM -0400, Derrick Stolee wrote:
> On 10/28/24 1:25 PM, Taylor Blau wrote:
> > On Mon, Oct 28, 2024 at 01:13:15PM -0400, Derrick Stolee wrote:
>
> > > You are correct that this is not compatible with those features as-is.
> > > _Maybe_ there is potential to integrate them in the future, but that
> > > would require better understanding of whether the new compression
> > > mechanism valuable in enough cases (final storage size or maybe even
> > > in repacking time).
> >
> > I think the bitmap thing is not too big of a hurdle. The .bitmap file is
> > the only spot we store name-hash values on-disk in the "hashcache"
> > extension.
> >
> > Unfortunately, there is no easy way to reuse the format of the existing
> > hashcache extension as-is to indicate to the reader whether they are
> > recording traditional name-hash values, or the new --path-walk hash
> > values.
>
> The --path-walk option does not mess with the name-hash. You're thinking
> of the --full-name-hash feature [1] that was pulled out due to a lack of
> interest (and better results with --path-walk).
>
> [1] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com/
Ah, gotcha. Thanks for clarifying.
What is the incompatibility between the two, then? Is it just that
bitmaps give us the objects in pack- or pseudo-pack order, and we don't
have a way to permute that back into the order that --path-walk would
give us?
If so, a couple of thoughts:
- You could consider storing the path information for each blob and
tree object in the bitmap using a trie-like structure. This would
give you enough information to reconstruct the path-walk order (I
think) at read-time, but at significant cost in terms of the on-disk
size of the .bitmap.
- Alternatively, if you construct the bitmap from a pack or packs that
were generated in path-walk order, then you could store a
permutation between pack order and path-walk order in the bitmap
itself.
- Alternatively still: if the actual pack *order* were dictated solely
by path-walk, then neither of these would be necessary.
That all said, I'm still not sure that there is a compatibility blocker
here. Is the goal is to ensure that packs generated with
--use-bitmap-index are still compact in the same way that they would be
with your new --path-walk option?
If so, I think matching the object order in a pack to the path walk
order would achieve that goal, since the chunks that you end up reusing
verbatim as a result of pack-reuse driven by the bitmap would already be
delta-ified according to the --path-walk rules, so the resulting pack
would appear similarly.
OTOH, the order in which we pack objects is extremely important to
performance as you no doubt are aware of. So changing that order to more
closely match the --path-walk option should be done with great care.
Anyway. All of that is to say that I want to better understand what does
and doesn't work together between bitmaps and path-walk. Given my
current understanding, it seems there are a couple of approaches to
unifying these two things together, so it would be nice to be able to
do so if possible.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
2024-10-29 18:02 ` Taylor Blau
@ 2024-10-31 2:28 ` Derrick Stolee
2024-10-31 21:07 ` Taylor Blau
0 siblings, 1 reply; 55+ messages in thread
From: Derrick Stolee @ 2024-10-31 2:28 UTC (permalink / raw)
To: Taylor Blau
Cc: Patrick Steinhardt, Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, johncai86, newren, christian.couder,
kristofferhaugsbakk
On 10/29/24 2:02 PM, Taylor Blau wrote:
> On Mon, Oct 28, 2024 at 03:46:11PM -0400, Derrick Stolee wrote:
>> On 10/28/24 1:25 PM, Taylor Blau wrote:
>>> Unfortunately, there is no easy way to reuse the format of the existing
>>> hashcache extension as-is to indicate to the reader whether they are
>>> recording traditional name-hash values, or the new --path-walk hash
>>> values.
>>
>> The --path-walk option does not mess with the name-hash. You're thinking
>> of the --full-name-hash feature [1] that was pulled out due to a lack of
>> interest (and better results with --path-walk).
>>
>> [1] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com/
>
> Ah, gotcha. Thanks for clarifying.
>
> What is the incompatibility between the two, then? Is it just that
> bitmaps give us the objects in pack- or pseudo-pack order, and we don't
> have a way to permute that back into the order that --path-walk would
> give us?
The incompatibility of reading bitmaps and using the path-walk API is
that the path-walk API does not check a bitmap to see if an object is
already discovered. Thus, it does not use the reachability information
from the bitmap at all and would parse commits and trees to find the
objects that should be in the pack-file.
It should also be worth noting that using something like 'git repack
--path-walk' does not mean that future 'git pack-objects' executions
from that packfile data need to use the --path-walk option. I expect
that it should be painless to write bitmaps on top of a packfile created
with 'git repack -adf --path-walk', but since most places doing so also
likely want delta islands, I have not explored this option thoroughly.
(Delta islands are their own challenge, since the path-walk API is not
spreading the reachability information across the objects it walks.
However, this could be remedied by doing a separate walk to identify
islands using the normal method. I believe Peff had an idea in that
direction in another thread. This requires some integration and testing
that I don't have the expertise to provide.)
> If so, a couple of thoughts:
> ...
Since the incompatibility is in a different direction, I don't think
these thoughts were relevant to the problem.
> OTOH, the order in which we pack objects is extremely important to
> performance as you no doubt are aware of. So changing that order to more
> closely match the --path-walk option should be done with great care.
This is a place where I'm unsure about how the --path-walk option adjusts
the object order within the pack. The packing list gets resorted to match
the typical method, at least for how the delta compression window works.
This would be another good reason to consider the --path-walk option in
server environments very carefully. My patch series puts up guard rails
specifically because it makes no claim to be effective in all of the
dimensions that matter for those scenarios. Hopefully, others will be
motivated enough to determine if the compression that's possible with
this algorithm could be achieved in a way that is compatible with server
needs.
> Anyway. All of that is to say that I want to better understand what does
> and doesn't work together between bitmaps and path-walk. Given my
> current understanding, it seems there are a couple of approaches to
> unifying these two things together, so it would be nice to be able to
> do so if possible.
I think this is an excellent opportunity for testing and debugging to
build up more intuition with how the path-walk API works. When I submit
the next version later tonight, the path-walk algorithm will be better
documented.
That said, I don't have any personal motivation to integrate the two
together, so I don't expect to be contributing that integration point
myself. I think that the results speak for themselves in the very
common environment of a Git client without bitmaps.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
2024-10-31 2:28 ` Derrick Stolee
@ 2024-10-31 21:07 ` Taylor Blau
0 siblings, 0 replies; 55+ messages in thread
From: Taylor Blau @ 2024-10-31 21:07 UTC (permalink / raw)
To: Derrick Stolee
Cc: Patrick Steinhardt, Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, johncai86, newren, christian.couder,
kristofferhaugsbakk
On Wed, Oct 30, 2024 at 10:28:22PM -0400, Derrick Stolee wrote:
> On 10/29/24 2:02 PM, Taylor Blau wrote:
> > On Mon, Oct 28, 2024 at 03:46:11PM -0400, Derrick Stolee wrote:
> >> On 10/28/24 1:25 PM, Taylor Blau wrote:
> >>> Unfortunately, there is no easy way to reuse the format of the existing
> >>> hashcache extension as-is to indicate to the reader whether they are
> >>> recording traditional name-hash values, or the new --path-walk hash
> >>> values.
> >>
> >> The --path-walk option does not mess with the name-hash. You're thinking
> >> of the --full-name-hash feature [1] that was pulled out due to a lack of
> >> interest (and better results with --path-walk).
> >>
> >> [1] https://lore.kernel.org/git/pull.1785.git.1725890210.gitgitgadget@gmail.com/
> >
> > Ah, gotcha. Thanks for clarifying.
> >
> > What is the incompatibility between the two, then? Is it just that
> > bitmaps give us the objects in pack- or pseudo-pack order, and we don't
> > have a way to permute that back into the order that --path-walk would
> > give us?
>
> The incompatibility of reading bitmaps and using the path-walk API is
> that the path-walk API does not check a bitmap to see if an object is
> already discovered. Thus, it does not use the reachability information
> from the bitmap at all and would parse commits and trees to find the
> objects that should be in the pack-file.
Sure, I think what I'm trying to understand here is whether this
"incapability" is a fundamental one, or just that we haven't implemented
checking bitmaps in the path-walk API yet.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 55+ messages in thread