* [PATCH 1/6] path-walk: introduce an object walk by path
2024-10-31 6:26 [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee via GitGitGadget
@ 2024-10-31 6:26 ` Derrick Stolee via GitGitGadget
2024-11-01 13:12 ` karthik nayak
[not found] ` <draft-87r07v14kl.fsf@archlinux.mail-host-address-is-not-set>
2024-10-31 6:26 ` [PATCH 2/6] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
` (7 subsequent siblings)
8 siblings, 2 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-31 6:26 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects, and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.
These batches are collected in 'struct type_and_oid_list' objects, which
store an object type and an oid_array of objects.
The data structures are documented in 'struct path_walk_context', but in
summary the most important are:
* 'paths_to_lists' is a strmap that connects a path to a
type_and_oid_list for that path. To avoid conflicts in path names,
we make sure that tree paths end in "/" (except the root path with
is an empty string) and blob paths do not end in "/".
* 'path_stack' is a string list that is added to in an append-only
way. This stores the stack of our depth-first search on the heap
instead of using recursion.
* 'path_stack_pushed' is a strmap that stores path names that were
already added to 'path_stack', to avoid repeating paths in the
stack. Mostly, this saves us from quadratic lookups from doing
unsorted checks into the string_list.
The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
push_to_stack() method. Call this instead of inserting into these
structures directly.
The walk_objects_by_path() method initializes these structures and
starts walking commits from the given rev_info struct. The commits are
used to find the list of root trees which populate the start of our
depth-first search.
The core of our depth-first search is in a while loop that continues
while we have not indicated an early exit and our 'path_stack' still has
entries in it. The loop body pops a path off of the stack and "visits"
the path via the walk_path() method.
The walk_path() method gets the list of OIDs from the 'path_to_lists'
strmap and executes the callback method on that list with the given path
and type. If the OIDs correspond to tree objects, then iterate over all
trees in the list and run add_children() to add the child objects to
their own lists, adding new entries to the stack if necessary.
In testing, this depth-first search approach was the one that used the
least memory while iterating over the object lists. There is still a
chance that repositories with too-wide path patterns could cause memory
pressure issues. Limiting the stack size could be done in the future by
limiting how many objects are being considered in-progress, or by
visiting blob paths earlier than trees.
There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 45 ++++
Makefile | 1 +
path-walk.c | 260 ++++++++++++++++++++++
path-walk.h | 43 ++++
4 files changed, 349 insertions(+)
create mode 100644 Documentation/technical/api-path-walk.txt
create mode 100644 path-walk.c
create mode 100644 path-walk.h
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
new file mode 100644
index 00000000000..c550c77ca30
--- /dev/null
+++ b/Documentation/technical/api-path-walk.txt
@@ -0,0 +1,45 @@
+Path-Walk API
+=============
+
+The path-walk API is used to walk reachable objects, but to visit objects
+in batches based on a common path they appear in, or by type.
+
+For example, all reachable commits are visited in a group. All tags are
+visited in a group. Then, all root trees are visited. At some point, all
+blobs reachable via a path `my/dir/to/A` are visited. When there are
+multiple paths possible to reach the same object, then only one of those
+paths is used to visit the object.
+
+Basics
+------
+
+To use the path-walk API, include `path-walk.h` and call
+`walk_objects_by_path()` with a customized `path_walk_info` struct. The
+struct is used to set all of the options for how the walk should proceed.
+Let's dig into the different options and their use.
+
+`path_fn` and `path_fn_data`::
+ The most important option is the `path_fn` option, which is a
+ function pointer to the callback that can execute logic on the
+ object IDs for objects grouped by type and path. This function
+ also receives a `data` value that corresponds to the
+ `path_fn_data` member, for providing custom data structures to
+ this callback function.
+
+`revs`::
+ To configure the exact details of the reachable set of objects,
+ use the `revs` member and initialize it using the revision
+ machinery in `revision.h`. Initialize `revs` using calls such as
+ `setup_revisions()` or `parse_revision_opt()`. Do not call
+ `prepare_revision_walk()`, as that will be called within
+ `walk_objects_by_path()`.
++
+It is also important that you do not specify the `--objects` flag for the
+`revs` struct. The revision walk should only be used to walk commits, and
+the objects will be walked in a separate way based on those starting
+commits.
+
+Examples
+--------
+
+See example usages in future changes.
diff --git a/Makefile b/Makefile
index 7344a7f7257..d0d8d6888e3 100644
--- a/Makefile
+++ b/Makefile
@@ -1094,6 +1094,7 @@ LIB_OBJS += parse-options.o
LIB_OBJS += patch-delta.o
LIB_OBJS += patch-ids.o
LIB_OBJS += path.o
+LIB_OBJS += path-walk.o
LIB_OBJS += pathspec.o
LIB_OBJS += pkt-line.o
LIB_OBJS += preload-index.o
diff --git a/path-walk.c b/path-walk.c
new file mode 100644
index 00000000000..9dc56aff88c
--- /dev/null
+++ b/path-walk.c
@@ -0,0 +1,260 @@
+/*
+ * path-walk.c: implementation for path-based walks of the object graph.
+ */
+#include "git-compat-util.h"
+#include "path-walk.h"
+#include "blob.h"
+#include "commit.h"
+#include "dir.h"
+#include "hashmap.h"
+#include "hex.h"
+#include "object.h"
+#include "oid-array.h"
+#include "revision.h"
+#include "string-list.h"
+#include "strmap.h"
+#include "trace2.h"
+#include "tree.h"
+#include "tree-walk.h"
+
+struct type_and_oid_list
+{
+ enum object_type type;
+ struct oid_array oids;
+};
+
+#define TYPE_AND_OID_LIST_INIT { \
+ .type = OBJ_NONE, \
+ .oids = OID_ARRAY_INIT \
+}
+
+struct path_walk_context {
+ /**
+ * Repeats of data in 'struct path_walk_info' for
+ * access with fewer characters.
+ */
+ struct repository *repo;
+ struct rev_info *revs;
+ struct path_walk_info *info;
+
+ /**
+ * Map a path to a 'struct type_and_oid_list'
+ * containing the objects discovered at that
+ * path.
+ */
+ struct strmap paths_to_lists;
+
+ /**
+ * Store the current list of paths in a stack, to
+ * facilitate depth-first-search without recursion.
+ *
+ * Use path_stack_pushed to indicate whether a path
+ * was previously added to path_stack.
+ */
+ struct string_list path_stack;
+ struct strset path_stack_pushed;
+};
+
+static void push_to_stack(struct path_walk_context *ctx,
+ const char *path)
+{
+ if (strset_contains(&ctx->path_stack_pushed, path))
+ return;
+
+ strset_add(&ctx->path_stack_pushed, path);
+ string_list_append(&ctx->path_stack, path);
+}
+
+static int add_children(struct path_walk_context *ctx,
+ const char *base_path,
+ struct object_id *oid)
+{
+ struct tree_desc desc;
+ struct name_entry entry;
+ struct strbuf path = STRBUF_INIT;
+ size_t base_len;
+ struct tree *tree = lookup_tree(ctx->repo, oid);
+
+ if (!tree) {
+ error(_("failed to walk children of tree %s: not found"),
+ oid_to_hex(oid));
+ return -1;
+ } else if (parse_tree_gently(tree, 1)) {
+ die("bad tree object %s", oid_to_hex(oid));
+ }
+
+ strbuf_addstr(&path, base_path);
+ base_len = path.len;
+
+ parse_tree(tree);
+ init_tree_desc(&desc, &tree->object.oid, tree->buffer, tree->size);
+ while (tree_entry(&desc, &entry)) {
+ struct type_and_oid_list *list;
+ struct object *o;
+ /* Not actually true, but we will ignore submodules later. */
+ enum object_type type = S_ISDIR(entry.mode) ? OBJ_TREE : OBJ_BLOB;
+
+ /* Skip submodules. */
+ if (S_ISGITLINK(entry.mode))
+ continue;
+
+ if (type == OBJ_TREE) {
+ struct tree *child = lookup_tree(ctx->repo, &entry.oid);
+ o = child ? &child->object : NULL;
+ } else if (type == OBJ_BLOB) {
+ struct blob *child = lookup_blob(ctx->repo, &entry.oid);
+ o = child ? &child->object : NULL;
+ } else {
+ /* Wrong type? */
+ continue;
+ }
+
+ if (!o) /* report error?*/
+ continue;
+
+ strbuf_setlen(&path, base_len);
+ strbuf_add(&path, entry.path, entry.pathlen);
+
+ /*
+ * Trees will end with "/" for concatenation and distinction
+ * from blobs at the same path.
+ */
+ if (type == OBJ_TREE)
+ strbuf_addch(&path, '/');
+
+ if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
+ CALLOC_ARRAY(list, 1);
+ list->type = type;
+ strmap_put(&ctx->paths_to_lists, path.buf, list);
+ }
+ push_to_stack(ctx, path.buf);
+
+ /* Skip this object if already seen. */
+ if (o->flags & SEEN)
+ continue;
+ o->flags |= SEEN;
+ oid_array_append(&list->oids, &entry.oid);
+ }
+
+ free_tree_buffer(tree);
+ strbuf_release(&path);
+ return 0;
+}
+
+/*
+ * For each path in paths_to_explore, walk the trees another level
+ * and add any found blobs to the batch (but only if they exist and
+ * haven't been added yet).
+ */
+static int walk_path(struct path_walk_context *ctx,
+ const char *path)
+{
+ struct type_and_oid_list *list;
+ int ret = 0;
+
+ list = strmap_get(&ctx->paths_to_lists, path);
+
+ /* Evaluate function pointer on this data. */
+ ret = ctx->info->path_fn(path, &list->oids, list->type,
+ ctx->info->path_fn_data);
+
+ /* Expand data for children. */
+ if (list->type == OBJ_TREE) {
+ for (size_t i = 0; i < list->oids.nr; i++) {
+ ret |= add_children(ctx,
+ path,
+ &list->oids.oid[i]);
+ }
+ }
+
+ oid_array_clear(&list->oids);
+ strmap_remove(&ctx->paths_to_lists, path, 1);
+ return ret;
+}
+
+static void clear_strmap(struct strmap *map)
+{
+ struct hashmap_iter iter;
+ struct strmap_entry *e;
+
+ hashmap_for_each_entry(&map->map, &iter, e, ent) {
+ struct type_and_oid_list *list = e->value;
+ oid_array_clear(&list->oids);
+ }
+ strmap_clear(map, 1);
+ strmap_init(map);
+}
+
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+ *
+ * Returns nonzero on an error.
+ */
+int walk_objects_by_path(struct path_walk_info *info)
+{
+ const char *root_path = "";
+ int ret = 0;
+ size_t commits_nr = 0, paths_nr = 0;
+ struct commit *c;
+ struct type_and_oid_list *root_tree_list;
+ struct path_walk_context ctx = {
+ .repo = info->revs->repo,
+ .revs = info->revs,
+ .info = info,
+ .path_stack = STRING_LIST_INIT_DUP,
+ .path_stack_pushed = STRSET_INIT,
+ .paths_to_lists = STRMAP_INIT
+ };
+
+ trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
+
+ /* Insert a single list for the root tree into the paths. */
+ CALLOC_ARRAY(root_tree_list, 1);
+ root_tree_list->type = OBJ_TREE;
+ strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
+ push_to_stack(&ctx, root_path);
+
+ if (prepare_revision_walk(info->revs))
+ die(_("failed to setup revision walk"));
+
+ while ((c = get_revision(info->revs))) {
+ struct object_id *oid = get_commit_tree_oid(c);
+ struct tree *t;
+ commits_nr++;
+
+ oid = get_commit_tree_oid(c);
+ t = lookup_tree(info->revs->repo, oid);
+
+ if (!t) {
+ warning("could not find tree %s", oid_to_hex(oid));
+ continue;
+ }
+
+ if (t->object.flags & SEEN)
+ continue;
+ t->object.flags |= SEEN;
+ oid_array_append(&root_tree_list->oids, oid);
+ }
+
+ trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
+ trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+
+ trace2_region_enter("path-walk", "path-walk", info->revs->repo);
+ while (!ret && ctx.path_stack.nr) {
+ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
+ ctx.path_stack.nr--;
+ paths_nr++;
+
+ ret = walk_path(&ctx, path);
+
+ free(path);
+ }
+ trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr);
+ trace2_region_leave("path-walk", "path-walk", info->revs->repo);
+
+ clear_strmap(&ctx.paths_to_lists);
+ strset_clear(&ctx.path_stack_pushed);
+ string_list_clear(&ctx.path_stack, 0);
+ return ret;
+}
diff --git a/path-walk.h b/path-walk.h
new file mode 100644
index 00000000000..c9e94a98bc8
--- /dev/null
+++ b/path-walk.h
@@ -0,0 +1,43 @@
+/*
+ * path-walk.h : Methods and structures for walking the object graph in batches
+ * by the paths that can reach those objects.
+ */
+#include "object.h" /* Required for 'enum object_type'. */
+
+struct rev_info;
+struct oid_array;
+
+/**
+ * The type of a function pointer for the method that is called on a list of
+ * objects reachable at a given path.
+ */
+typedef int (*path_fn)(const char *path,
+ struct oid_array *oids,
+ enum object_type type,
+ void *data);
+
+struct path_walk_info {
+ /**
+ * revs provides the definitions for the commit walk, including
+ * which commits are UNINTERESTING or not.
+ */
+ struct rev_info *revs;
+
+ /**
+ * The caller wishes to execute custom logic on objects reachable at a
+ * given path. Every reachable object will be visited exactly once, and
+ * the first path to see an object wins. This may not be a stable choice.
+ */
+ path_fn path_fn;
+ void *path_fn_data;
+};
+
+#define PATH_WALK_INFO_INIT { 0 }
+
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+ *
+ * Returns nonzero on an error.
+ */
+int walk_objects_by_path(struct path_walk_info *info);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH 1/6] path-walk: introduce an object walk by path
2024-10-31 6:26 ` [PATCH 1/6] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
@ 2024-11-01 13:12 ` karthik nayak
2024-11-01 13:44 ` Derrick Stolee
[not found] ` <draft-87r07v14kl.fsf@archlinux.mail-host-address-is-not-set>
1 sibling, 1 reply; 67+ messages in thread
From: karthik nayak @ 2024-11-01 13:12 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget, git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
Derrick Stolee
[-- Attachment #1: Type: text/plain, Size: 2499 bytes --]
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <stolee@gmail.com>
>
> In anticipation of a few planned applications, introduce the most basic form
> of a path-walk API. It currently assumes that there are no UNINTERESTING
> objects, and does not include any complicated filters. It calls a function
> pointer on groups of tree and blob objects as grouped by path. This only
> includes objects the first time they are discovered, so an object that
> appears at multiple paths will not be included in two batches.
>
> These batches are collected in 'struct type_and_oid_list' objects, which
> store an object type and an oid_array of objects.
>
> The data structures are documented in 'struct path_walk_context', but in
> summary the most important are:
>
> * 'paths_to_lists' is a strmap that connects a path to a
> type_and_oid_list for that path. To avoid conflicts in path names,
> we make sure that tree paths end in "/" (except the root path with
> is an empty string) and blob paths do not end in "/".
>
> * 'path_stack' is a string list that is added to in an append-only
> way. This stores the stack of our depth-first search on the heap
> instead of using recursion.
>
> * 'path_stack_pushed' is a strmap that stores path names that were
> already added to 'path_stack', to avoid repeating paths in the
> stack. Mostly, this saves us from quadratic lookups from doing
> unsorted checks into the string_list.
>
> The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
> push_to_stack() method. Call this instead of inserting into these
> structures directly.
>
> The walk_objects_by_path() method initializes these structures and
> starts walking commits from the given rev_info struct. The commits are
> used to find the list of root trees which populate the start of our
> depth-first search.
Isn't this more of breadth-first search? Reading through the code, the
algorithm seems something like:
- For each commit in list of commits (from rev_info)
- Tackle each root tree, add root path to the stack.
- For each path in stack left
- Call the callback provided by client.
- Find all its first level children, add each to the stack.
So wouldn't this go through the tree in level by level basis? Making it
a BFS?
Apart from this, the patch itself looks solid. I ended up writing a
small client to play with this API, and was very pleased how quickly I
could get it running.
[snip]
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 1/6] path-walk: introduce an object walk by path
2024-11-01 13:12 ` karthik nayak
@ 2024-11-01 13:44 ` Derrick Stolee
0 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee @ 2024-11-01 13:44 UTC (permalink / raw)
To: karthik nayak, Derrick Stolee via GitGitGadget, git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy
On 11/1/24 9:12 AM, karthik nayak wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> The walk_objects_by_path() method initializes these structures and
>> starts walking commits from the given rev_info struct. The commits are
>> used to find the list of root trees which populate the start of our
>> depth-first search.
>
> Isn't this more of breadth-first search? Reading through the code, the
> algorithm seems something like:
>
> - For each commit in list of commits (from rev_info)
> - Tackle each root tree, add root path to the stack.
> - For each path in stack left
> - Call the callback provided by client.
> - Find all its first level children, add each to the stack.
>
> So wouldn't this go through the tree in level by level basis? Making it
> a BFS?
While we are adding all children to the stack, we only pop off the top
of the stack, making it a DFS. (We do visit the paths in reverse-
lexicographic order, though.)
To make it a BFS, we would need to visit the paths in the order they
are added to the list. Instead, we visit them in Last-In First-Out
order.
I initially had built it as a BFS, but ran into memory issues when
running it on very large repos.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 67+ messages in thread
[parent not found: <draft-87r07v14kl.fsf@archlinux.mail-host-address-is-not-set>]
* Re: [PATCH 1/6] path-walk: introduce an object walk by path
[not found] ` <draft-87r07v14kl.fsf@archlinux.mail-host-address-is-not-set>
@ 2024-11-01 13:42 ` karthik nayak
0 siblings, 0 replies; 67+ messages in thread
From: karthik nayak @ 2024-11-01 13:42 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget, git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
Derrick Stolee
[-- Attachment #1: Type: text/plain, Size: 2806 bytes --]
karthik nayak <karthik.188@gmail.com> writes:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> In anticipation of a few planned applications, introduce the most basic form
>> of a path-walk API. It currently assumes that there are no UNINTERESTING
>> objects, and does not include any complicated filters. It calls a function
>> pointer on groups of tree and blob objects as grouped by path. This only
>> includes objects the first time they are discovered, so an object that
>> appears at multiple paths will not be included in two batches.
>>
>> These batches are collected in 'struct type_and_oid_list' objects, which
>> store an object type and an oid_array of objects.
>>
>> The data structures are documented in 'struct path_walk_context', but in
>> summary the most important are:
>>
>> * 'paths_to_lists' is a strmap that connects a path to a
>> type_and_oid_list for that path. To avoid conflicts in path names,
>> we make sure that tree paths end in "/" (except the root path with
>> is an empty string) and blob paths do not end in "/".
>>
>> * 'path_stack' is a string list that is added to in an append-only
>> way. This stores the stack of our depth-first search on the heap
>> instead of using recursion.
>>
>> * 'path_stack_pushed' is a strmap that stores path names that were
>> already added to 'path_stack', to avoid repeating paths in the
>> stack. Mostly, this saves us from quadratic lookups from doing
>> unsorted checks into the string_list.
>>
>> The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
>> push_to_stack() method. Call this instead of inserting into these
>> structures directly.
>>
>> The walk_objects_by_path() method initializes these structures and
>> starts walking commits from the given rev_info struct. The commits are
>> used to find the list of root trees which populate the start of our
>> depth-first search.
>
> Isn't this more of breadth-first search? Reading through the code, the
> algorithm seems something like:
>
> - For each commit in list of commits (from rev_info)
> - Tackle each root tree, add root path to the stack.
> - For each path in stack left
> - Call the callback provided by client.
> - Find all its first level children, add each to the stack.
>
> So wouldn't this go through the tree in level by level basis? Making it
> a BFS?
My bad here, thinking more about it, it is DFS indeed. Although we add
all the children of a level to the stack, we pop each of them from the
stack and end up traversing down that level.
>
> Apart from this, the patch itself looks solid. I ended up writing a
> small client to play with this API, and was very pleased how quickly I
> could get it running.
>
> [snip]
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH 2/6] test-lib-functions: add test_cmp_sorted
2024-10-31 6:26 [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee via GitGitGadget
2024-10-31 6:26 ` [PATCH 1/6] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
@ 2024-10-31 6:26 ` Derrick Stolee via GitGitGadget
2024-10-31 6:27 ` [PATCH 3/6] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
` (6 subsequent siblings)
8 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-31 6:26 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
This test helper will be helpful to reduce repeated logic in
t6601-path-walk.sh, but may be helpful elsewhere, too.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
t/test-lib-functions.sh | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index fde9bf54fc3..16b70aebd60 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1267,6 +1267,16 @@ test_cmp () {
eval "$GIT_TEST_CMP" '"$@"'
}
+# test_cmp_sorted runs test_cmp on sorted versions of the two
+# input files. Uses "$1.sorted" and "$2.sorted" as temp files.
+
+test_cmp_sorted () {
+ sort <"$1" >"$1.sorted" &&
+ sort <"$2" >"$2.sorted" &&
+ test_cmp "$1.sorted" "$2.sorted" &&
+ rm "$1.sorted" "$2.sorted"
+}
+
# Check that the given config key has the expected value.
#
# test_cmp_config [-C <dir>] <expected-value>
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH 3/6] t6601: add helper for testing path-walk API
2024-10-31 6:26 [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee via GitGitGadget
2024-10-31 6:26 ` [PATCH 1/6] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-10-31 6:26 ` [PATCH 2/6] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
@ 2024-10-31 6:27 ` Derrick Stolee via GitGitGadget
2024-11-01 13:46 ` karthik nayak
` (2 more replies)
2024-10-31 6:27 ` [PATCH 4/6] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
` (5 subsequent siblings)
8 siblings, 3 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-31 6:27 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Add some tests based on the current behavior, doing interesting checks
for different sets of branches, ranges, and the --boundary option. This
sets a baseline for the behavior and we can extend it as new options are
introduced.
It is important to mention that the behavior of the API will change soon as
we start to handle UNINTERESTING objects differently, but these tests will
demonstrate the change in behavior.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 3 +-
Makefile | 1 +
t/helper/test-path-walk.c | 86 ++++++++++++++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t6601-path-walk.sh | 118 ++++++++++++++++++++++
6 files changed, 209 insertions(+), 1 deletion(-)
create mode 100644 t/helper/test-path-walk.c
create mode 100755 t/t6601-path-walk.sh
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index c550c77ca30..662162ec70b 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -42,4 +42,5 @@ commits.
Examples
--------
-See example usages in future changes.
+See example usages in:
+ `t/helper/test-path-walk.c`
diff --git a/Makefile b/Makefile
index d0d8d6888e3..50413d96492 100644
--- a/Makefile
+++ b/Makefile
@@ -818,6 +818,7 @@ TEST_BUILTINS_OBJS += test-parse-options.o
TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
TEST_BUILTINS_OBJS += test-partial-clone.o
TEST_BUILTINS_OBJS += test-path-utils.o
+TEST_BUILTINS_OBJS += test-path-walk.o
TEST_BUILTINS_OBJS += test-pcre2-config.o
TEST_BUILTINS_OBJS += test-pkt-line.o
TEST_BUILTINS_OBJS += test-proc-receive.o
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
new file mode 100644
index 00000000000..3c48f017fa0
--- /dev/null
+++ b/t/helper/test-path-walk.c
@@ -0,0 +1,86 @@
+#define USE_THE_REPOSITORY_VARIABLE
+
+#include "test-tool.h"
+#include "environment.h"
+#include "hex.h"
+#include "object-name.h"
+#include "object.h"
+#include "pretty.h"
+#include "revision.h"
+#include "setup.h"
+#include "parse-options.h"
+#include "path-walk.h"
+#include "oid-array.h"
+
+static const char * const path_walk_usage[] = {
+ N_("test-tool path-walk <options> -- <revision-options>"),
+ NULL
+};
+
+struct path_walk_test_data {
+ uintmax_t tree_nr;
+ uintmax_t blob_nr;
+};
+
+static int emit_block(const char *path, struct oid_array *oids,
+ enum object_type type, void *data)
+{
+ struct path_walk_test_data *tdata = data;
+ const char *typestr;
+
+ switch (type) {
+ case OBJ_TREE:
+ typestr = "TREE";
+ tdata->tree_nr += oids->nr;
+ break;
+
+ case OBJ_BLOB:
+ typestr = "BLOB";
+ tdata->blob_nr += oids->nr;
+ break;
+
+ default:
+ BUG("we do not understand this type");
+ }
+
+ for (size_t i = 0; i < oids->nr; i++)
+ printf("%s:%s:%s\n", typestr, path, oid_to_hex(&oids->oid[i]));
+
+ return 0;
+}
+
+int cmd__path_walk(int argc, const char **argv)
+{
+ int res;
+ struct rev_info revs = REV_INFO_INIT;
+ struct path_walk_info info = PATH_WALK_INFO_INIT;
+ struct path_walk_test_data data = { 0 };
+ struct option options[] = {
+ OPT_END(),
+ };
+
+ initialize_repository(the_repository);
+ setup_git_directory();
+ revs.repo = the_repository;
+
+ argc = parse_options(argc, argv, NULL,
+ options, path_walk_usage,
+ PARSE_OPT_KEEP_UNKNOWN_OPT | PARSE_OPT_KEEP_ARGV0);
+
+ if (argc > 1)
+ setup_revisions(argc, argv, &revs, NULL);
+ else
+ usage(path_walk_usage[0]);
+
+ info.revs = &revs;
+ info.path_fn = emit_block;
+ info.path_fn_data = &data;
+
+ res = walk_objects_by_path(&info);
+
+ printf("trees:%" PRIuMAX "\n"
+ "blobs:%" PRIuMAX "\n",
+ data.tree_nr, data.blob_nr);
+
+ return res;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 1ebb69a5dc4..43676e7b93a 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -52,6 +52,7 @@ static struct test_cmd cmds[] = {
{ "parse-subcommand", cmd__parse_subcommand },
{ "partial-clone", cmd__partial_clone },
{ "path-utils", cmd__path_utils },
+ { "path-walk", cmd__path_walk },
{ "pcre2-config", cmd__pcre2_config },
{ "pkt-line", cmd__pkt_line },
{ "proc-receive", cmd__proc_receive },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 21802ac27da..9cfc5da6e57 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -45,6 +45,7 @@ int cmd__parse_pathspec_file(int argc, const char** argv);
int cmd__parse_subcommand(int argc, const char **argv);
int cmd__partial_clone(int argc, const char **argv);
int cmd__path_utils(int argc, const char **argv);
+int cmd__path_walk(int argc, const char **argv);
int cmd__pcre2_config(int argc, const char **argv);
int cmd__pkt_line(int argc, const char **argv);
int cmd__proc_receive(int argc, const char **argv);
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
new file mode 100755
index 00000000000..1f277b88291
--- /dev/null
+++ b/t/t6601-path-walk.sh
@@ -0,0 +1,118 @@
+#!/bin/sh
+
+test_description='direct path-walk API tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repository' '
+ git checkout -b base &&
+
+ mkdir left &&
+ mkdir right &&
+ echo a >a &&
+ echo b >left/b &&
+ echo c >right/c &&
+ git add . &&
+ git commit -m "first" &&
+
+ echo d >right/d &&
+ git add right &&
+ git commit -m "second" &&
+
+ echo bb >left/b &&
+ git commit -a -m "third" &&
+
+ git checkout -b topic HEAD~1 &&
+ echo cc >right/c &&
+ git commit -a -m "topic"
+'
+
+test_expect_success 'all' '
+ test-tool path-walk -- --all >out &&
+
+ cat >expect <<-EOF &&
+ TREE::$(git rev-parse topic^{tree})
+ TREE::$(git rev-parse base^{tree})
+ TREE::$(git rev-parse base~1^{tree})
+ TREE::$(git rev-parse base~2^{tree})
+ TREE:left/:$(git rev-parse base:left)
+ TREE:left/:$(git rev-parse base~2:left)
+ TREE:right/:$(git rev-parse topic:right)
+ TREE:right/:$(git rev-parse base~1:right)
+ TREE:right/:$(git rev-parse base~2:right)
+ trees:9
+ BLOB:a:$(git rev-parse base~2:a)
+ BLOB:left/b:$(git rev-parse base~2:left/b)
+ BLOB:left/b:$(git rev-parse base:left/b)
+ BLOB:right/c:$(git rev-parse base~2:right/c)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse base~1:right/d)
+ blobs:6
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic only' '
+ test-tool path-walk -- topic >out &&
+
+ cat >expect <<-EOF &&
+ TREE::$(git rev-parse topic^{tree})
+ TREE::$(git rev-parse base~1^{tree})
+ TREE::$(git rev-parse base~2^{tree})
+ TREE:left/:$(git rev-parse base~2:left)
+ TREE:right/:$(git rev-parse topic:right)
+ TREE:right/:$(git rev-parse base~1:right)
+ TREE:right/:$(git rev-parse base~2:right)
+ trees:7
+ BLOB:a:$(git rev-parse base~2:a)
+ BLOB:left/b:$(git rev-parse base~2:left/b)
+ BLOB:right/c:$(git rev-parse base~2:right/c)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse base~1:right/d)
+ blobs:5
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base' '
+ test-tool path-walk -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ TREE::$(git rev-parse topic^{tree})
+ TREE:left/:$(git rev-parse topic:left)
+ TREE:right/:$(git rev-parse topic:right)
+ trees:3
+ BLOB:a:$(git rev-parse topic:a)
+ BLOB:left/b:$(git rev-parse topic:left/b)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse topic:right/d)
+ blobs:4
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, boundary' '
+ test-tool path-walk -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ TREE::$(git rev-parse topic^{tree})
+ TREE::$(git rev-parse base~1^{tree})
+ TREE:left/:$(git rev-parse base~1:left)
+ TREE:right/:$(git rev-parse topic:right)
+ TREE:right/:$(git rev-parse base~1:right)
+ trees:5
+ BLOB:a:$(git rev-parse base~1:a)
+ BLOB:left/b:$(git rev-parse base~1:left/b)
+ BLOB:right/c:$(git rev-parse base~1:right/c)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse base~1:right/d)
+ blobs:5
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH 3/6] t6601: add helper for testing path-walk API
2024-10-31 6:27 ` [PATCH 3/6] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
@ 2024-11-01 13:46 ` karthik nayak
2024-11-01 22:23 ` Jonathan Tan
2024-11-06 14:04 ` Patrick Steinhardt
2 siblings, 0 replies; 67+ messages in thread
From: karthik nayak @ 2024-11-01 13:46 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget, git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
Derrick Stolee
[-- Attachment #1: Type: text/plain, Size: 1859 bytes --]
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
[snip]
> diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
> new file mode 100755
> index 00000000000..1f277b88291
> --- /dev/null
> +++ b/t/t6601-path-walk.sh
> @@ -0,0 +1,118 @@
> +#!/bin/sh
> +
> +test_description='direct path-walk API tests'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup test repository' '
> + git checkout -b base &&
> +
> + mkdir left &&
> + mkdir right &&
> + echo a >a &&
> + echo b >left/b &&
> + echo c >right/c &&
> + git add . &&
> + git commit -m "first" &&
> +
> + echo d >right/d &&
> + git add right &&
> + git commit -m "second" &&
> +
> + echo bb >left/b &&
> + git commit -a -m "third" &&
> +
> + git checkout -b topic HEAD~1 &&
> + echo cc >right/c &&
> + git commit -a -m "topic"
> +'
> +
Nit: Since the root level tree is already special cased out, we only
check one level of path here, would be nice to add another level of tree
to this.
> +test_expect_success 'all' '
> + test-tool path-walk -- --all >out &&
> +
> + cat >expect <<-EOF &&
> + TREE::$(git rev-parse topic^{tree})
> + TREE::$(git rev-parse base^{tree})
> + TREE::$(git rev-parse base~1^{tree})
> + TREE::$(git rev-parse base~2^{tree})
> + TREE:left/:$(git rev-parse base:left)
> + TREE:left/:$(git rev-parse base~2:left)
> + TREE:right/:$(git rev-parse topic:right)
> + TREE:right/:$(git rev-parse base~1:right)
> + TREE:right/:$(git rev-parse base~2:right)
> + trees:9
> + BLOB:a:$(git rev-parse base~2:a)
> + BLOB:left/b:$(git rev-parse base~2:left/b)
> + BLOB:left/b:$(git rev-parse base:left/b)
> + BLOB:right/c:$(git rev-parse base~2:right/c)
> + BLOB:right/c:$(git rev-parse topic:right/c)
> + BLOB:right/d:$(git rev-parse base~1:right/d)
> + blobs:6
> + EOF
> +
> + test_cmp_sorted expect out
> +'
Isn't the order deterministic? Why do we need to sort it?
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 3/6] t6601: add helper for testing path-walk API
2024-10-31 6:27 ` [PATCH 3/6] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-11-01 13:46 ` karthik nayak
@ 2024-11-01 22:23 ` Jonathan Tan
2024-11-04 15:56 ` Derrick Stolee
2024-11-06 14:04 ` Patrick Steinhardt
2 siblings, 1 reply; 67+ messages in thread
From: Jonathan Tan @ 2024-11-01 22:23 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: Jonathan Tan, git, gitster, johannes.schindelin, peff, ps, me,
johncai86, newren, christian.couder, kristofferhaugsbakk,
Derrick Stolee
I haven't looked thoroughly at the rest of the patches yet, but had a
comment about this test. Rearranging:
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> +test_expect_success 'all' '
> + test-tool path-walk -- --all >out &&
> +
> + cat >expect <<-EOF &&
> + TREE::$(git rev-parse topic^{tree})
> + TREE::$(git rev-parse base^{tree})
> + TREE::$(git rev-parse base~1^{tree})
> + TREE::$(git rev-parse base~2^{tree})
> + TREE:left/:$(git rev-parse base:left)
> + TREE:left/:$(git rev-parse base~2:left)
> + TREE:right/:$(git rev-parse topic:right)
> + TREE:right/:$(git rev-parse base~1:right)
> + TREE:right/:$(git rev-parse base~2:right)
> + trees:9
[snip rest of "expect"]
The way you're testing this, wouldn't the tests pass even if the OIDs
aren't emitted in path order? (E.g. if topic:right and base~1:right
were somehow grouped into two different groups, even though they have
the same path.)
I would have expected the test output to be something like:
TREE:right/ $(rp :right topic base~1 base~2)
where rp is a function that takes in a suffix and one or more prefixes -
I haven't figured out its contents yet, but
echo $(git rev-parse HEAD^^ HEAD^ HEAD | sort)
gives us a space-separated list, so it doesn't seem too difficult to
define such a function.
> +static int emit_block(const char *path, struct oid_array *oids,
> + enum object_type type, void *data)
> +{
> + struct path_walk_test_data *tdata = data;
> + const char *typestr;
> +
> + switch (type) {
> + case OBJ_TREE:
> + typestr = "TREE";
> + tdata->tree_nr += oids->nr;
> + break;
> +
> + case OBJ_BLOB:
> + typestr = "BLOB";
> + tdata->blob_nr += oids->nr;
> + break;
> +
> + default:
> + BUG("we do not understand this type");
> + }
> +
> + for (size_t i = 0; i < oids->nr; i++)
> + printf("%s:%s:%s\n", typestr, path, oid_to_hex(&oids->oid[i]));
Then here, you would print typestr and path before the "for" loop. In
the "for" loop you would add oid_to_hex() results to a sorted string
list, have another "for" loop that prints each element preceded by a
space, then print a "\n" after both "for" loops.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 3/6] t6601: add helper for testing path-walk API
2024-11-01 22:23 ` Jonathan Tan
@ 2024-11-04 15:56 ` Derrick Stolee
2024-11-04 23:39 ` Jonathan Tan
0 siblings, 1 reply; 67+ messages in thread
From: Derrick Stolee @ 2024-11-04 15:56 UTC (permalink / raw)
To: Jonathan Tan, Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, ps, me, johncai86,
newren, christian.couder, kristofferhaugsbakk
On 11/1/24 6:23 PM, Jonathan Tan wrote:
> I haven't looked thoroughly at the rest of the patches yet, but had a
> comment about this test. Rearranging:
>
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> +test_expect_success 'all' '
>> + test-tool path-walk -- --all >out &&
>> +
>> + cat >expect <<-EOF &&
>> + TREE::$(git rev-parse topic^{tree})
>> + TREE::$(git rev-parse base^{tree})
>> + TREE::$(git rev-parse base~1^{tree})
>> + TREE::$(git rev-parse base~2^{tree})
>> + TREE:left/:$(git rev-parse base:left)
>> + TREE:left/:$(git rev-parse base~2:left)
>> + TREE:right/:$(git rev-parse topic:right)
>> + TREE:right/:$(git rev-parse base~1:right)
>> + TREE:right/:$(git rev-parse base~2:right)
>> + trees:9
>
> [snip rest of "expect"]
>
> The way you're testing this, wouldn't the tests pass even if the OIDs
> aren't emitted in path order? (E.g. if topic:right and base~1:right
> were somehow grouped into two different groups, even though they have
> the same path.)
You are correct that if the path-walk API emitted multiple batches
with the same path name, then we would not detect that via the current
testing strategy.
The main reason to use the sort is to avoid adding a restriction on
the order in which objects appear within the batch.
Your recommendation to group a batch into a single line does not
strike me as a suitable approach, because long lines become hard to
read and difficult to parse diffs. (Also, the order within the batch
becomes baked in as a requirement.)
The biggest question I'd like to ask is this: do you see a risk of
a path being repeated? There are cases where it will happen, such as
indexed objects that are not reachable anywhere else.
The way I would consider modifying these tests to reflect the batching
would be to associate each batch with a number, causing the order of
the paths to become hard-coded in the test. Something like
0:COMMIT::$(git rev-parse ...)
0:COMMIT::$(git rev-parse ...)
1:TREE::$(git rev-parse ...)
1:TREE::$(git rev-parse ...)
2:TREE:right/:$(git rev-parse ...)
3:BLOB:right/a:$(...)
4:TREE:left/:$(git rev-parse ...)
5:BLOB:left/b:$(...)
This would imply some amount of order that maybe should become a
requirement of the API.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 3/6] t6601: add helper for testing path-walk API
2024-11-04 15:56 ` Derrick Stolee
@ 2024-11-04 23:39 ` Jonathan Tan
2024-11-08 14:53 ` Derrick Stolee
0 siblings, 1 reply; 67+ messages in thread
From: Jonathan Tan @ 2024-11-04 23:39 UTC (permalink / raw)
To: Derrick Stolee
Cc: Jonathan Tan, Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk
Derrick Stolee <stolee@gmail.com> writes:
> You are correct that if the path-walk API emitted multiple batches
> with the same path name, then we would not detect that via the current
> testing strategy.
>
> The main reason to use the sort is to avoid adding a restriction on
> the order in which objects appear within the batch.
>
> Your recommendation to group a batch into a single line does not
> strike me as a suitable approach, because long lines become hard to
> read and difficult to parse diffs. (Also, the order within the batch
> becomes baked in as a requirement.)
The hashes in a line can be abbreviated if line length is a concern.
Also, note that I am suggesting sorting the OIDs within a line (that is,
a batch), and also sorting the lines (batches) as a whole.
> The biggest question I'd like to ask is this: do you see a risk of
> a path being repeated? There are cases where it will happen, such as
> indexed objects that are not reachable anywhere else.
I was thinking that the whole point of this feature is that we group
objects by path, so it seems desirable to test that paths are not
repeated. (Or repeated as little as possible, if it is not possible
to avoid repetition e.g. in the case you describe.)
> The way I would consider modifying these tests to reflect the batching
> would be to associate each batch with a number, causing the order of
> the paths to become hard-coded in the test. Something like
>
> 0:COMMIT::$(git rev-parse ...)
> 0:COMMIT::$(git rev-parse ...)
> 1:TREE::$(git rev-parse ...)
> 1:TREE::$(git rev-parse ...)
> 2:TREE:right/:$(git rev-parse ...)
> 3:BLOB:right/a:$(...)
> 4:TREE:left/:$(git rev-parse ...)
> 5:BLOB:left/b:$(...)
>
> This would imply some amount of order that maybe should become a
> requirement of the API.
>
> Thanks,
> -Stolee
If we're willing to declare an order in which we will return paths to
the user, that would work too. (I'm not sure that we need to declare an
order, though.)
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 3/6] t6601: add helper for testing path-walk API
2024-11-04 23:39 ` Jonathan Tan
@ 2024-11-08 14:53 ` Derrick Stolee
0 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee @ 2024-11-08 14:53 UTC (permalink / raw)
To: Jonathan Tan
Cc: Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk
On 11/4/24 6:39 PM, Jonathan Tan wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
>> The biggest question I'd like to ask is this: do you see a risk of
>> a path being repeated? There are cases where it will happen, such as
>> indexed objects that are not reachable anywhere else.
>
> I was thinking that the whole point of this feature is that we group
> objects by path, so it seems desirable to test that paths are not
> repeated. (Or repeated as little as possible, if it is not possible
> to avoid repetition e.g. in the case you describe.)
In addition to determining the order of the batches, it can be
helpful to demonstrate that we don't call the path_fn with an
empty batch! I discovered this while making the appropriate
changes today and putting the fixes in the right places.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 3/6] t6601: add helper for testing path-walk API
2024-10-31 6:27 ` [PATCH 3/6] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
2024-11-01 13:46 ` karthik nayak
2024-11-01 22:23 ` Jonathan Tan
@ 2024-11-06 14:04 ` Patrick Steinhardt
2024-11-08 14:58 ` Derrick Stolee
2 siblings, 1 reply; 67+ messages in thread
From: Patrick Steinhardt @ 2024-11-06 14:04 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
Derrick Stolee
On Thu, Oct 31, 2024 at 06:27:00AM +0000, Derrick Stolee via GitGitGadget wrote:
[snip]
> +int cmd__path_walk(int argc, const char **argv)
> +{
> + int res;
> + struct rev_info revs = REV_INFO_INIT;
> + struct path_walk_info info = PATH_WALK_INFO_INIT;
> + struct path_walk_test_data data = { 0 };
> + struct option options[] = {
> + OPT_END(),
> + };
> +
> + initialize_repository(the_repository);
> + setup_git_directory();
> + revs.repo = the_repository;
> +
> + argc = parse_options(argc, argv, NULL,
> + options, path_walk_usage,
> + PARSE_OPT_KEEP_UNKNOWN_OPT | PARSE_OPT_KEEP_ARGV0);
> +
> + if (argc > 1)
> + setup_revisions(argc, argv, &revs, NULL);
> + else
> + usage(path_walk_usage[0]);
> +
> + info.revs = &revs;
> + info.path_fn = emit_block;
> + info.path_fn_data = &data;
> +
> + res = walk_objects_by_path(&info);
> +
> + printf("trees:%" PRIuMAX "\n"
> + "blobs:%" PRIuMAX "\n",
> + data.tree_nr, data.blob_nr);
> +
> + return res;
> +}
This function is leaking memory. I'd propose to add below patch on top
to plug them, which makes t6601 pass with the leak sanitizer enabled.
Patrick
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 06b103d876..fa3bfe46b5 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -85,7 +85,6 @@ int cmd__path_walk(int argc, const char **argv)
OPT_END(),
};
- initialize_repository(the_repository);
setup_git_directory();
revs.repo = the_repository;
@@ -110,5 +109,6 @@ int cmd__path_walk(int argc, const char **argv)
"tags:%" PRIuMAX "\n",
data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
+ release_revisions(&revs);
return res;
}
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH 3/6] t6601: add helper for testing path-walk API
2024-11-06 14:04 ` Patrick Steinhardt
@ 2024-11-08 14:58 ` Derrick Stolee
0 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee @ 2024-11-08 14:58 UTC (permalink / raw)
To: Patrick Steinhardt, Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy
On 11/6/24 9:04 AM, Patrick Steinhardt wrote:
> On Thu, Oct 31, 2024 at 06:27:00AM +0000, Derrick Stolee via GitGitGadget wrote:
> [snip]
> This function is leaking memory. I'd propose to add below patch on top
> to plug them, which makes t6601 pass with the leak sanitizer enabled.
Thanks! Applied for the next version.
-Stolee
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH 4/6] path-walk: allow consumer to specify object types
2024-10-31 6:26 [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee via GitGitGadget
` (2 preceding siblings ...)
2024-10-31 6:27 ` [PATCH 3/6] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
@ 2024-10-31 6:27 ` Derrick Stolee via GitGitGadget
2024-10-31 6:27 ` [PATCH 5/6] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
` (4 subsequent siblings)
8 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-31 6:27 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
Derrick Stolee, Derrick Stolee
From: Derrick Stolee <derrickstolee@github.com>
We add the ability to filter the object types in the path-walk API so
the callback function is called fewer times.
This adds the ability to ask for the commits in a list, as well. We
re-use the empty string for this set of objects because these are passed
directly to the callback function instead of being part of the
'path_stack'.
Future changes will add the ability to visit annotated tags.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 9 ++++
path-walk.c | 33 ++++++++++--
path-walk.h | 14 ++++-
t/helper/test-path-walk.c | 17 +++++-
t/t6601-path-walk.sh | 63 +++++++++++++++++++++++
5 files changed, 129 insertions(+), 7 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 662162ec70b..dce553b6114 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -39,6 +39,15 @@ It is also important that you do not specify the `--objects` flag for the
the objects will be walked in a separate way based on those starting
commits.
+`commits`, `blobs`, `trees`::
+ By default, these members are enabled and signal that the path-walk
+ API should call the `path_fn` on objects of these types. Specialized
+ applications could disable some options to make it simpler to walk
+ the objects or to have fewer calls to `path_fn`.
++
+While it is possible to walk only commits in this way, consumers would be
+better off using the revision walk API instead.
+
Examples
--------
diff --git a/path-walk.c b/path-walk.c
index 9dc56aff88c..14ad322bdd2 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -98,6 +98,10 @@ static int add_children(struct path_walk_context *ctx,
if (S_ISGITLINK(entry.mode))
continue;
+ /* If the caller doesn't want blobs, then don't bother. */
+ if (!ctx->info->blobs && type == OBJ_BLOB)
+ continue;
+
if (type == OBJ_TREE) {
struct tree *child = lookup_tree(ctx->repo, &entry.oid);
o = child ? &child->object : NULL;
@@ -154,9 +158,11 @@ static int walk_path(struct path_walk_context *ctx,
list = strmap_get(&ctx->paths_to_lists, path);
- /* Evaluate function pointer on this data. */
- ret = ctx->info->path_fn(path, &list->oids, list->type,
- ctx->info->path_fn_data);
+ /* Evaluate function pointer on this data, if requested. */
+ if ((list->type == OBJ_TREE && ctx->info->trees) ||
+ (list->type == OBJ_BLOB && ctx->info->blobs))
+ ret = ctx->info->path_fn(path, &list->oids, list->type,
+ ctx->info->path_fn_data);
/* Expand data for children. */
if (list->type == OBJ_TREE) {
@@ -198,6 +204,7 @@ int walk_objects_by_path(struct path_walk_info *info)
size_t commits_nr = 0, paths_nr = 0;
struct commit *c;
struct type_and_oid_list *root_tree_list;
+ struct type_and_oid_list *commit_list;
struct path_walk_context ctx = {
.repo = info->revs->repo,
.revs = info->revs,
@@ -209,6 +216,9 @@ int walk_objects_by_path(struct path_walk_info *info)
trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
+ CALLOC_ARRAY(commit_list, 1);
+ commit_list->type = OBJ_COMMIT;
+
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
@@ -219,10 +229,18 @@ int walk_objects_by_path(struct path_walk_info *info)
die(_("failed to setup revision walk"));
while ((c = get_revision(info->revs))) {
- struct object_id *oid = get_commit_tree_oid(c);
+ struct object_id *oid;
struct tree *t;
commits_nr++;
+ if (info->commits)
+ oid_array_append(&commit_list->oids,
+ &c->object.oid);
+
+ /* If we only care about commits, then skip trees. */
+ if (!info->trees && !info->blobs)
+ continue;
+
oid = get_commit_tree_oid(c);
t = lookup_tree(info->revs->repo, oid);
@@ -240,6 +258,13 @@ int walk_objects_by_path(struct path_walk_info *info)
trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+ /* Track all commits. */
+ if (info->commits)
+ ret = info->path_fn("", &commit_list->oids, OBJ_COMMIT,
+ info->path_fn_data);
+ oid_array_clear(&commit_list->oids);
+ free(commit_list);
+
trace2_region_enter("path-walk", "path-walk", info->revs->repo);
while (!ret && ctx.path_stack.nr) {
char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
diff --git a/path-walk.h b/path-walk.h
index c9e94a98bc8..2d2afc29b47 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -30,9 +30,21 @@ struct path_walk_info {
*/
path_fn path_fn;
void *path_fn_data;
+
+ /**
+ * Initialize which object types the path_fn should be called on. This
+ * could also limit the walk to skip blobs if not set.
+ */
+ int commits;
+ int trees;
+ int blobs;
};
-#define PATH_WALK_INFO_INIT { 0 }
+#define PATH_WALK_INFO_INIT { \
+ .blobs = 1, \
+ .trees = 1, \
+ .commits = 1, \
+}
/**
* Given the configuration of 'info', walk the commits based on 'info->revs' and
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 3c48f017fa0..37c5e3e31e8 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -18,6 +18,7 @@ static const char * const path_walk_usage[] = {
};
struct path_walk_test_data {
+ uintmax_t commit_nr;
uintmax_t tree_nr;
uintmax_t blob_nr;
};
@@ -29,6 +30,11 @@ static int emit_block(const char *path, struct oid_array *oids,
const char *typestr;
switch (type) {
+ case OBJ_COMMIT:
+ typestr = "COMMIT";
+ tdata->commit_nr += oids->nr;
+ break;
+
case OBJ_TREE:
typestr = "TREE";
tdata->tree_nr += oids->nr;
@@ -56,6 +62,12 @@ int cmd__path_walk(int argc, const char **argv)
struct path_walk_info info = PATH_WALK_INFO_INIT;
struct path_walk_test_data data = { 0 };
struct option options[] = {
+ OPT_BOOL(0, "blobs", &info.blobs,
+ N_("toggle inclusion of blob objects")),
+ OPT_BOOL(0, "commits", &info.commits,
+ N_("toggle inclusion of commit objects")),
+ OPT_BOOL(0, "trees", &info.trees,
+ N_("toggle inclusion of tree objects")),
OPT_END(),
};
@@ -78,9 +90,10 @@ int cmd__path_walk(int argc, const char **argv)
res = walk_objects_by_path(&info);
- printf("trees:%" PRIuMAX "\n"
+ printf("commits:%" PRIuMAX "\n"
+ "trees:%" PRIuMAX "\n"
"blobs:%" PRIuMAX "\n",
- data.tree_nr, data.blob_nr);
+ data.commit_nr, data.tree_nr, data.blob_nr);
return res;
}
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 1f277b88291..4b16a0a3c80 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -31,6 +31,11 @@ test_expect_success 'all' '
test-tool path-walk -- --all >out &&
cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ COMMIT::$(git rev-parse base)
+ COMMIT::$(git rev-parse base~1)
+ COMMIT::$(git rev-parse base~2)
+ commits:4
TREE::$(git rev-parse topic^{tree})
TREE::$(git rev-parse base^{tree})
TREE::$(git rev-parse base~1^{tree})
@@ -57,6 +62,10 @@ test_expect_success 'topic only' '
test-tool path-walk -- topic >out &&
cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ COMMIT::$(git rev-parse base~1)
+ COMMIT::$(git rev-parse base~2)
+ commits:3
TREE::$(git rev-parse topic^{tree})
TREE::$(git rev-parse base~1^{tree})
TREE::$(git rev-parse base~2^{tree})
@@ -80,6 +89,8 @@ test_expect_success 'topic, not base' '
test-tool path-walk -- topic --not base >out &&
cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ commits:1
TREE::$(git rev-parse topic^{tree})
TREE:left/:$(git rev-parse topic:left)
TREE:right/:$(git rev-parse topic:right)
@@ -94,10 +105,62 @@ test_expect_success 'topic, not base' '
test_cmp_sorted expect out
'
+test_expect_success 'topic, not base, only blobs' '
+ test-tool path-walk --no-trees --no-commits \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ commits:0
+ trees:0
+ BLOB:a:$(git rev-parse topic:a)
+ BLOB:left/b:$(git rev-parse topic:left/b)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse topic:right/d)
+ blobs:4
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+# No, this doesn't make a lot of sense for the path-walk API,
+# but it is possible to do.
+test_expect_success 'topic, not base, only commits' '
+ test-tool path-walk --no-blobs --no-trees \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ commits:1
+ trees:0
+ blobs:0
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, only trees' '
+ test-tool path-walk --no-blobs --no-commits \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ commits:0
+ TREE::$(git rev-parse topic^{tree})
+ TREE:left/:$(git rev-parse topic:left)
+ TREE:right/:$(git rev-parse topic:right)
+ trees:3
+ blobs:0
+ EOF
+
+ test_cmp_sorted expect out
+'
+
test_expect_success 'topic, not base, boundary' '
test-tool path-walk -- --boundary topic --not base >out &&
cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ COMMIT::$(git rev-parse base~1)
+ commits:2
TREE::$(git rev-parse topic^{tree})
TREE::$(git rev-parse base~1^{tree})
TREE:left/:$(git rev-parse base~1:left)
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH 5/6] path-walk: visit tags and cached objects
2024-10-31 6:26 [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee via GitGitGadget
` (3 preceding siblings ...)
2024-10-31 6:27 ` [PATCH 4/6] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
@ 2024-10-31 6:27 ` Derrick Stolee via GitGitGadget
2024-11-01 14:25 ` karthik nayak
2024-10-31 6:27 ` [PATCH 6/6] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
` (3 subsequent siblings)
8 siblings, 1 reply; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-31 6:27 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
The rev_info that is specified for a path-walk traversal may specify
visiting tag refs (both lightweight and annotated) and also may specify
indexed objects (blobs and trees). Update the path-walk API to walk
these objects as well.
When walking tags, we need to peel the annotated objects until reaching
a non-tag object. If we reach a commit, then we can add it to the
pending objects to make sure we visit in the commit walk portion. If we
reach a tree, then we will assume that it is a root tree. If we reach a
blob, then we have no good path name and so add it to a new list of
"tagged blobs".
When the rev_info includes the "--indexed-objects" flag, then the
pending set includes blobs and trees found in the cache entries and
cache-tree. The cache entries are usually blobs, though they could be
trees in the case of a sparse index. The cache-tree stores
previously-hashed tree objects but these are cleared out when staging
objects below those paths. We add tests that demonstrate this.
The indexed objects come with a non-NULL 'path' value in the pending
item. This allows us to prepopulate the 'path_to_lists' strmap with
lists for these paths.
The tricky thing about this walk is that we will want to combine the
indexed objects walk with the commit walk, especially in the future case
of walking objects during a command like 'git repack'.
Whenever possible, we want the objects from the index to be grouped with
similar objects in history. We don't want to miss any paths that appear
only in the index and not in the commit history.
Thus, we need to be careful to let the path stack be populated initially
with only the root tree path (and possibly tags and tagged blobs) and go
through the normal depth-first search. Afterwards, if there are other
paths that are remaining in the paths_to_lists strmap, we should then
iterate through the stack and visit those objects recursively.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 2 +-
path-walk.c | 175 +++++++++++++++++++++-
path-walk.h | 2 +
t/helper/test-path-walk.c | 13 +-
t/t6601-path-walk.sh | 154 ++++++++++++++++++-
5 files changed, 336 insertions(+), 10 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index dce553b6114..6022c381b7c 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -39,7 +39,7 @@ It is also important that you do not specify the `--objects` flag for the
the objects will be walked in a separate way based on those starting
commits.
-`commits`, `blobs`, `trees`::
+`commits`, `blobs`, `trees`, `tags`::
By default, these members are enabled and signal that the path-walk
API should call the `path_fn` on objects of these types. Specialized
applications could disable some options to make it simpler to walk
diff --git a/path-walk.c b/path-walk.c
index 14ad322bdd2..eca0e5f3d5b 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -13,10 +13,13 @@
#include "revision.h"
#include "string-list.h"
#include "strmap.h"
+#include "tag.h"
#include "trace2.h"
#include "tree.h"
#include "tree-walk.h"
+static const char *root_path = "";
+
struct type_and_oid_list
{
enum object_type type;
@@ -158,9 +161,13 @@ static int walk_path(struct path_walk_context *ctx,
list = strmap_get(&ctx->paths_to_lists, path);
+ if (!list)
+ BUG("provided path '%s' that had no associated list", path);
+
/* Evaluate function pointer on this data, if requested. */
if ((list->type == OBJ_TREE && ctx->info->trees) ||
- (list->type == OBJ_BLOB && ctx->info->blobs))
+ (list->type == OBJ_BLOB && ctx->info->blobs) ||
+ (list->type == OBJ_TAG && ctx->info->tags))
ret = ctx->info->path_fn(path, &list->oids, list->type,
ctx->info->path_fn_data);
@@ -191,6 +198,134 @@ static void clear_strmap(struct strmap *map)
strmap_init(map);
}
+static void setup_pending_objects(struct path_walk_info *info,
+ struct path_walk_context *ctx)
+{
+ struct type_and_oid_list *tags = NULL;
+ struct type_and_oid_list *tagged_blobs = NULL;
+ struct type_and_oid_list *root_tree_list = NULL;
+
+ if (info->tags)
+ CALLOC_ARRAY(tags, 1);
+ if (info->blobs)
+ CALLOC_ARRAY(tagged_blobs, 1);
+ if (info->trees)
+ root_tree_list = strmap_get(&ctx->paths_to_lists, root_path);
+
+ /*
+ * Pending objects include:
+ * * Commits at branch tips.
+ * * Annotated tags at tag tips.
+ * * Any kind of object at lightweight tag tips.
+ * * Trees and blobs in the index (with an associated path).
+ */
+ for (size_t i = 0; i < info->revs->pending.nr; i++) {
+ struct object_array_entry *pending = info->revs->pending.objects + i;
+ struct object *obj = pending->item;
+
+ /* Commits will be picked up by revision walk. */
+ if (obj->type == OBJ_COMMIT)
+ continue;
+
+ /* Navigate annotated tag object chains. */
+ while (obj->type == OBJ_TAG) {
+ struct tag *tag = lookup_tag(info->revs->repo, &obj->oid);
+ if (!tag)
+ break;
+ if (tag->object.flags & SEEN)
+ break;
+ tag->object.flags |= SEEN;
+
+ if (tags)
+ oid_array_append(&tags->oids, &obj->oid);
+ obj = tag->tagged;
+ }
+
+ if (obj->type == OBJ_TAG)
+ continue;
+
+ /* We are now at a non-tag object. */
+ if (obj->flags & SEEN)
+ continue;
+ obj->flags |= SEEN;
+
+ switch (obj->type) {
+ case OBJ_TREE:
+ if (!info->trees)
+ continue;
+ if (pending->path) {
+ struct type_and_oid_list *list;
+ char *path = *pending->path ? xstrfmt("%s/", pending->path)
+ : xstrdup("");
+ if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
+ CALLOC_ARRAY(list, 1);
+ list->type = OBJ_TREE;
+ strmap_put(&ctx->paths_to_lists, path, list);
+ }
+ oid_array_append(&list->oids, &obj->oid);
+ free(path);
+ } else {
+ /* assume a root tree, such as a lightweight tag. */
+ oid_array_append(&root_tree_list->oids, &obj->oid);
+ }
+ break;
+
+ case OBJ_BLOB:
+ if (!info->blobs)
+ continue;
+ if (pending->path) {
+ struct type_and_oid_list *list;
+ char *path = pending->path;
+ if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
+ CALLOC_ARRAY(list, 1);
+ list->type = OBJ_BLOB;
+ strmap_put(&ctx->paths_to_lists, path, list);
+ }
+ oid_array_append(&list->oids, &obj->oid);
+ } else {
+ /* assume a root tree, such as a lightweight tag. */
+ oid_array_append(&tagged_blobs->oids, &obj->oid);
+ }
+ break;
+
+ case OBJ_COMMIT:
+ /* Make sure it is in the object walk */
+ if (obj != pending->item)
+ add_pending_object(info->revs, obj, "");
+ break;
+
+ default:
+ BUG("should not see any other type here");
+ }
+ }
+
+ /*
+ * Add tag objects and tagged blobs if they exist.
+ */
+ if (tagged_blobs) {
+ if (tagged_blobs->oids.nr) {
+ const char *tagged_blob_path = "/tagged-blobs";
+ tagged_blobs->type = OBJ_BLOB;
+ push_to_stack(ctx, tagged_blob_path);
+ strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
+ } else {
+ oid_array_clear(&tagged_blobs->oids);
+ free(tagged_blobs);
+ }
+ }
+ if (tags) {
+ if (tags->oids.nr) {
+ const char *tag_path = "/tags";
+ tags->type = OBJ_TAG;
+ push_to_stack(ctx, tag_path);
+ strmap_put(&ctx->paths_to_lists, tag_path, tags);
+ } else {
+ oid_array_clear(&tags->oids);
+ free(tags);
+ }
+ }
+}
+
/**
* Given the configuration of 'info', walk the commits based on 'info->revs' and
* call 'info->path_fn' on each discovered path.
@@ -199,7 +334,6 @@ static void clear_strmap(struct strmap *map)
*/
int walk_objects_by_path(struct path_walk_info *info)
{
- const char *root_path = "";
int ret = 0;
size_t commits_nr = 0, paths_nr = 0;
struct commit *c;
@@ -219,15 +353,31 @@ int walk_objects_by_path(struct path_walk_info *info)
CALLOC_ARRAY(commit_list, 1);
commit_list->type = OBJ_COMMIT;
+ if (info->tags)
+ info->revs->tag_objects = 1;
+
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
push_to_stack(&ctx, root_path);
+ /*
+ * Set these values before preparing the walk to catch
+ * lightweight tags pointing to non-commits and indexed objects.
+ */
+ info->revs->blob_objects = info->blobs;
+ info->revs->tree_objects = info->trees;
+
if (prepare_revision_walk(info->revs))
die(_("failed to setup revision walk"));
+ info->revs->blob_objects = info->revs->tree_objects = 0;
+
+ trace2_region_enter("path-walk", "pending-walk", info->revs->repo);
+ setup_pending_objects(info, &ctx);
+ trace2_region_leave("path-walk", "pending-walk", info->revs->repo);
+
while ((c = get_revision(info->revs))) {
struct object_id *oid;
struct tree *t;
@@ -275,6 +425,27 @@ int walk_objects_by_path(struct path_walk_info *info)
free(path);
}
+
+ /* Are there paths remaining? Likely they are from indexed objects. */
+ if (!strmap_empty(&ctx.paths_to_lists)) {
+ struct hashmap_iter iter;
+ struct strmap_entry *entry;
+
+ strmap_for_each_entry(&ctx.paths_to_lists, &iter, entry) {
+ push_to_stack(&ctx, entry->key);
+ }
+
+ while (!ret && ctx.path_stack.nr) {
+ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
+ ctx.path_stack.nr--;
+ paths_nr++;
+
+ ret = walk_path(&ctx, path);
+
+ free(path);
+ }
+ }
+
trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr);
trace2_region_leave("path-walk", "path-walk", info->revs->repo);
diff --git a/path-walk.h b/path-walk.h
index 2d2afc29b47..ca839f873e4 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -38,12 +38,14 @@ struct path_walk_info {
int commits;
int trees;
int blobs;
+ int tags;
};
#define PATH_WALK_INFO_INIT { \
.blobs = 1, \
.trees = 1, \
.commits = 1, \
+ .tags = 1, \
}
/**
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 37c5e3e31e8..c6c60d68749 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -21,6 +21,7 @@ struct path_walk_test_data {
uintmax_t commit_nr;
uintmax_t tree_nr;
uintmax_t blob_nr;
+ uintmax_t tag_nr;
};
static int emit_block(const char *path, struct oid_array *oids,
@@ -45,6 +46,11 @@ static int emit_block(const char *path, struct oid_array *oids,
tdata->blob_nr += oids->nr;
break;
+ case OBJ_TAG:
+ typestr = "TAG";
+ tdata->tag_nr += oids->nr;
+ break;
+
default:
BUG("we do not understand this type");
}
@@ -66,6 +72,8 @@ int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of blob objects")),
OPT_BOOL(0, "commits", &info.commits,
N_("toggle inclusion of commit objects")),
+ OPT_BOOL(0, "tags", &info.tags,
+ N_("toggle inclusion of tag objects")),
OPT_BOOL(0, "trees", &info.trees,
N_("toggle inclusion of tree objects")),
OPT_END(),
@@ -92,8 +100,9 @@ int cmd__path_walk(int argc, const char **argv)
printf("commits:%" PRIuMAX "\n"
"trees:%" PRIuMAX "\n"
- "blobs:%" PRIuMAX "\n",
- data.commit_nr, data.tree_nr, data.blob_nr);
+ "blobs:%" PRIuMAX "\n"
+ "tags:%" PRIuMAX "\n",
+ data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
return res;
}
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 4b16a0a3c80..5ed6c79fbd1 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -7,24 +7,55 @@ test_description='direct path-walk API tests'
test_expect_success 'setup test repository' '
git checkout -b base &&
+ # Make some objects that will only be reachable
+ # via non-commit tags.
+ mkdir child &&
+ echo file >child/file &&
+ git add child &&
+ git commit -m "will abandon" &&
+ git tag -a -m "tree" tree-tag HEAD^{tree} &&
+ echo file2 >file2 &&
+ git add file2 &&
+ git commit --amend -m "will abandon" &&
+ git tag tree-tag2 HEAD^{tree} &&
+
+ echo blob >file &&
+ blob_oid=$(git hash-object -t blob -w --stdin <file) &&
+ git tag -a -m "blob" blob-tag "$blob_oid" &&
+ echo blob2 >file2 &&
+ blob2_oid=$(git hash-object -t blob -w --stdin <file2) &&
+ git tag blob-tag2 "$blob2_oid" &&
+
+ rm -fr child file file2 &&
+
mkdir left &&
mkdir right &&
echo a >a &&
echo b >left/b &&
echo c >right/c &&
git add . &&
- git commit -m "first" &&
+ git commit --amend -m "first" &&
+ git tag -m "first" first HEAD &&
echo d >right/d &&
git add right &&
git commit -m "second" &&
+ git tag -a -m "second (under)" second.1 HEAD &&
+ git tag -a -m "second (top)" second.2 second.1 &&
+ # Set up file/dir collision in history.
+ rm a &&
+ mkdir a &&
+ echo a >a/a &&
echo bb >left/b &&
- git commit -a -m "third" &&
+ git add a left &&
+ git commit -m "third" &&
+ git tag -a -m "third" third &&
git checkout -b topic HEAD~1 &&
echo cc >right/c &&
- git commit -a -m "topic"
+ git commit -a -m "topic" &&
+ git tag -a -m "fourth" fourth
'
test_expect_success 'all' '
@@ -40,19 +71,104 @@ test_expect_success 'all' '
TREE::$(git rev-parse base^{tree})
TREE::$(git rev-parse base~1^{tree})
TREE::$(git rev-parse base~2^{tree})
+ TREE::$(git rev-parse refs/tags/tree-tag^{})
+ TREE::$(git rev-parse refs/tags/tree-tag2^{})
+ TREE:a/:$(git rev-parse base:a)
TREE:left/:$(git rev-parse base:left)
TREE:left/:$(git rev-parse base~2:left)
TREE:right/:$(git rev-parse topic:right)
TREE:right/:$(git rev-parse base~1:right)
TREE:right/:$(git rev-parse base~2:right)
- trees:9
+ TREE:child/:$(git rev-parse refs/tags/tree-tag^{}:child)
+ trees:13
BLOB:a:$(git rev-parse base~2:a)
+ BLOB:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
BLOB:left/b:$(git rev-parse base~2:left/b)
BLOB:left/b:$(git rev-parse base:left/b)
BLOB:right/c:$(git rev-parse base~2:right/c)
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse base~1:right/d)
- blobs:6
+ BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
+ BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
+ BLOB:child/file:$(git rev-parse refs/tags/tree-tag^{}:child/file)
+ blobs:10
+ TAG:/tags:$(git rev-parse refs/tags/first)
+ TAG:/tags:$(git rev-parse refs/tags/second.1)
+ TAG:/tags:$(git rev-parse refs/tags/second.2)
+ TAG:/tags:$(git rev-parse refs/tags/third)
+ TAG:/tags:$(git rev-parse refs/tags/fourth)
+ TAG:/tags:$(git rev-parse refs/tags/tree-tag)
+ TAG:/tags:$(git rev-parse refs/tags/blob-tag)
+ tags:7
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'indexed objects' '
+ test_when_finished git reset --hard &&
+
+ # stage change into index, adding a blob but
+ # also invalidating the cache-tree for the root
+ # and the "left" directory.
+ echo bogus >left/c &&
+ git add left &&
+
+ test-tool path-walk -- --indexed-objects >out &&
+
+ cat >expect <<-EOF &&
+ commits:0
+ TREE:right/:$(git rev-parse topic:right)
+ trees:1
+ BLOB:a:$(git rev-parse HEAD:a)
+ BLOB:left/b:$(git rev-parse HEAD:left/b)
+ BLOB:left/c:$(git rev-parse :left/c)
+ BLOB:right/c:$(git rev-parse HEAD:right/c)
+ BLOB:right/d:$(git rev-parse HEAD:right/d)
+ blobs:5
+ tags:0
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'branches and indexed objects mix well' '
+ test_when_finished git reset --hard &&
+
+ # stage change into index, adding a blob but
+ # also invalidating the cache-tree for the root
+ # and the "right" directory.
+ echo fake >right/d &&
+ git add right &&
+
+ test-tool path-walk -- --indexed-objects --branches >out &&
+
+ cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ COMMIT::$(git rev-parse base)
+ COMMIT::$(git rev-parse base~1)
+ COMMIT::$(git rev-parse base~2)
+ commits:4
+ TREE::$(git rev-parse topic^{tree})
+ TREE::$(git rev-parse base^{tree})
+ TREE::$(git rev-parse base~1^{tree})
+ TREE::$(git rev-parse base~2^{tree})
+ TREE:a/:$(git rev-parse base:a)
+ TREE:left/:$(git rev-parse base:left)
+ TREE:left/:$(git rev-parse base~2:left)
+ TREE:right/:$(git rev-parse topic:right)
+ TREE:right/:$(git rev-parse base~1:right)
+ TREE:right/:$(git rev-parse base~2:right)
+ trees:10
+ BLOB:a:$(git rev-parse base~2:a)
+ BLOB:left/b:$(git rev-parse base:left/b)
+ BLOB:left/b:$(git rev-parse base~2:left/b)
+ BLOB:right/c:$(git rev-parse base~2:right/c)
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse base~1:right/d)
+ BLOB:right/d:$(git rev-parse :right/d)
+ blobs:7
+ tags:0
EOF
test_cmp_sorted expect out
@@ -80,6 +196,7 @@ test_expect_success 'topic only' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse base~1:right/d)
blobs:5
+ tags:0
EOF
test_cmp_sorted expect out
@@ -100,6 +217,7 @@ test_expect_success 'topic, not base' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse topic:right/d)
blobs:4
+ tags:0
EOF
test_cmp_sorted expect out
@@ -117,6 +235,7 @@ test_expect_success 'topic, not base, only blobs' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse topic:right/d)
blobs:4
+ tags:0
EOF
test_cmp_sorted expect out
@@ -133,6 +252,7 @@ test_expect_success 'topic, not base, only commits' '
commits:1
trees:0
blobs:0
+ tags:0
EOF
test_cmp_sorted expect out
@@ -149,6 +269,7 @@ test_expect_success 'topic, not base, only trees' '
TREE:right/:$(git rev-parse topic:right)
trees:3
blobs:0
+ tags:0
EOF
test_cmp_sorted expect out
@@ -173,9 +294,32 @@ test_expect_success 'topic, not base, boundary' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse base~1:right/d)
blobs:5
+ tags:0
EOF
test_cmp_sorted expect out
'
+test_expect_success 'trees are reported exactly once' '
+ test_when_finished "rm -rf unique-trees" &&
+ test_create_repo unique-trees &&
+ (
+ cd unique-trees &&
+ mkdir initial &&
+ test_commit initial/file &&
+
+ git switch -c move-to-top &&
+ git mv initial/file.t ./ &&
+ test_tick &&
+ git commit -m moved &&
+
+ git update-ref refs/heads/other HEAD
+ ) &&
+
+ test-tool -C unique-trees path-walk -- --all >out &&
+ tree=$(git -C unique-trees rev-parse HEAD:) &&
+ grep "$tree" out >out-filtered &&
+ test_line_count = 1 out-filtered
+'
+
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH 5/6] path-walk: visit tags and cached objects
2024-10-31 6:27 ` [PATCH 5/6] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
@ 2024-11-01 14:25 ` karthik nayak
2024-11-04 15:56 ` Derrick Stolee
0 siblings, 1 reply; 67+ messages in thread
From: karthik nayak @ 2024-11-01 14:25 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget, git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
Derrick Stolee
[-- Attachment #1: Type: text/plain, Size: 1203 bytes --]
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <stolee@gmail.com>
>
> The rev_info that is specified for a path-walk traversal may specify
> visiting tag refs (both lightweight and annotated) and also may specify
> indexed objects (blobs and trees). Update the path-walk API to walk
> these objects as well.
>
> When walking tags, we need to peel the annotated objects until reaching
> a non-tag object. If we reach a commit, then we can add it to the
> pending objects to make sure we visit in the commit walk portion. If we
Nit: s/in/it in/
[snip]
> + case OBJ_BLOB:
> + if (!info->blobs)
> + continue;
> + if (pending->path) {
> + struct type_and_oid_list *list;
> + char *path = pending->path;
> + if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
> + CALLOC_ARRAY(list, 1);
> + list->type = OBJ_BLOB;
> + strmap_put(&ctx->paths_to_lists, path, list);
> + }
> + oid_array_append(&list->oids, &obj->oid);
> + } else {
> + /* assume a root tree, such as a lightweight tag. */
Shouldn't this comment be for tagged blobs?
> + oid_array_append(&tagged_blobs->oids, &obj->oid);
> + }
> + break;
[snip]
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 690 bytes --]
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 5/6] path-walk: visit tags and cached objects
2024-11-01 14:25 ` karthik nayak
@ 2024-11-04 15:56 ` Derrick Stolee
0 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee @ 2024-11-04 15:56 UTC (permalink / raw)
To: karthik nayak, Derrick Stolee via GitGitGadget, git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy
On 11/1/24 10:25 AM, karthik nayak wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> The rev_info that is specified for a path-walk traversal may specify
>> visiting tag refs (both lightweight and annotated) and also may specify
>> indexed objects (blobs and trees). Update the path-walk API to walk
>> these objects as well.
>>
>> When walking tags, we need to peel the annotated objects until reaching
>> a non-tag object. If we reach a commit, then we can add it to the
>> pending objects to make sure we visit in the commit walk portion. If we
>
> Nit: s/in/it in/
thanks!
>> + case OBJ_BLOB:
>> + if (!info->blobs)
>> + continue;
>> + if (pending->path) {
>> + struct type_and_oid_list *list;
>> + char *path = pending->path;
>> + if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
>> + CALLOC_ARRAY(list, 1);
>> + list->type = OBJ_BLOB;
>> + strmap_put(&ctx->paths_to_lists, path, list);
>> + }
>> + oid_array_append(&list->oids, &obj->oid);
>> + } else {
>> + /* assume a root tree, such as a lightweight tag. */
>
> Shouldn't this comment be for tagged blobs?
Yes. This is a copy-paste error.
Thanks for the careful reading.
-Stolee
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH 6/6] path-walk: mark trees and blobs as UNINTERESTING
2024-10-31 6:26 [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee via GitGitGadget
` (4 preceding siblings ...)
2024-10-31 6:27 ` [PATCH 5/6] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
@ 2024-10-31 6:27 ` Derrick Stolee via GitGitGadget
2024-10-31 12:36 ` [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee
` (2 subsequent siblings)
8 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-10-31 6:27 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
When the input rev_info has UNINTERESTING starting points, we want to be
sure that the UNINTERESTING flag is passed appropriately through the
objects. To match how this is done in places such as 'git pack-objects', we
use the mark_edges_uninteresting() method.
This method has an option for using the "sparse" walk, which is similar in
spirit to the path-walk API's walk. To be sure to keep it independent, add a
new 'prune_all_uninteresting' option to the path_walk_info struct.
To check how the UNINTERSTING flag is spread through our objects, extend the
'test-tool path-walk' command to output whether or not an object has that
flag. This changes our tests significantly, including the removal of some
objects that were previously visited due to the incomplete implementation.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 8 +++
path-walk.c | 73 +++++++++++++++++++++
path-walk.h | 8 +++
t/helper/test-path-walk.c | 10 ++-
t/t6601-path-walk.sh | 79 +++++++++++++++++------
5 files changed, 157 insertions(+), 21 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 6022c381b7c..7075d0d5ab5 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -48,6 +48,14 @@ commits.
While it is possible to walk only commits in this way, consumers would be
better off using the revision walk API instead.
+`prune_all_uninteresting`::
+ By default, all reachable paths are emitted by the path-walk API.
+ This option allows consumers to declare that they are not
+ interested in paths where all included objects are marked with the
+ `UNINTERESTING` flag. This requires using the `boundary` option in
+ the revision walk so that the walk emits commits marked with the
+ `UNINTERESTING` flag.
+
Examples
--------
diff --git a/path-walk.c b/path-walk.c
index eca0e5f3d5b..6f658c28307 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -8,6 +8,7 @@
#include "dir.h"
#include "hashmap.h"
#include "hex.h"
+#include "list-objects.h"
#include "object.h"
#include "oid-array.h"
#include "revision.h"
@@ -24,6 +25,7 @@ struct type_and_oid_list
{
enum object_type type;
struct oid_array oids;
+ int maybe_interesting;
};
#define TYPE_AND_OID_LIST_INIT { \
@@ -140,6 +142,9 @@ static int add_children(struct path_walk_context *ctx,
if (o->flags & SEEN)
continue;
o->flags |= SEEN;
+
+ if (!(o->flags & UNINTERESTING))
+ list->maybe_interesting = 1;
oid_array_append(&list->oids, &entry.oid);
}
@@ -164,6 +169,43 @@ static int walk_path(struct path_walk_context *ctx,
if (!list)
BUG("provided path '%s' that had no associated list", path);
+ if (ctx->info->prune_all_uninteresting) {
+ /*
+ * This is true if all objects were UNINTERESTING
+ * when added to the list.
+ */
+ if (!list->maybe_interesting)
+ return 0;
+
+ /*
+ * But it's still possible that the objects were set
+ * as UNINTERESTING after being added. Do a quick check.
+ */
+ list->maybe_interesting = 0;
+ for (size_t i = 0;
+ !list->maybe_interesting && i < list->oids.nr;
+ i++) {
+ if (list->type == OBJ_TREE) {
+ struct tree *t = lookup_tree(ctx->repo,
+ &list->oids.oid[i]);
+ if (t && !(t->object.flags & UNINTERESTING))
+ list->maybe_interesting = 1;
+ } else if (list->type == OBJ_BLOB) {
+ struct blob *b = lookup_blob(ctx->repo,
+ &list->oids.oid[i]);
+ if (b && !(b->object.flags & UNINTERESTING))
+ list->maybe_interesting = 1;
+ } else {
+ /* Tags are always interesting if visited. */
+ list->maybe_interesting = 1;
+ }
+ }
+
+ /* We have confirmed that all objects are UNINTERESTING. */
+ if (!list->maybe_interesting)
+ return 0;
+ }
+
/* Evaluate function pointer on this data, if requested. */
if ((list->type == OBJ_TREE && ctx->info->trees) ||
(list->type == OBJ_BLOB && ctx->info->blobs) ||
@@ -198,6 +240,26 @@ static void clear_strmap(struct strmap *map)
strmap_init(map);
}
+static struct repository *edge_repo;
+static struct type_and_oid_list *edge_tree_list;
+
+static void show_edge(struct commit *commit)
+{
+ struct tree *t = repo_get_commit_tree(edge_repo, commit);
+
+ if (!t)
+ return;
+
+ if (commit->object.flags & UNINTERESTING)
+ t->object.flags |= UNINTERESTING;
+
+ if (t->object.flags & SEEN)
+ return;
+ t->object.flags |= SEEN;
+
+ oid_array_append(&edge_tree_list->oids, &t->object.oid);
+}
+
static void setup_pending_objects(struct path_walk_info *info,
struct path_walk_context *ctx)
{
@@ -306,6 +368,7 @@ static void setup_pending_objects(struct path_walk_info *info,
if (tagged_blobs->oids.nr) {
const char *tagged_blob_path = "/tagged-blobs";
tagged_blobs->type = OBJ_BLOB;
+ tagged_blobs->maybe_interesting = 1;
push_to_stack(ctx, tagged_blob_path);
strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
} else {
@@ -317,6 +380,7 @@ static void setup_pending_objects(struct path_walk_info *info,
if (tags->oids.nr) {
const char *tag_path = "/tags";
tags->type = OBJ_TAG;
+ tags->maybe_interesting = 1;
push_to_stack(ctx, tag_path);
strmap_put(&ctx->paths_to_lists, tag_path, tags);
} else {
@@ -359,6 +423,7 @@ int walk_objects_by_path(struct path_walk_info *info)
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
+ root_tree_list->maybe_interesting = 1;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
push_to_stack(&ctx, root_path);
@@ -372,6 +437,14 @@ int walk_objects_by_path(struct path_walk_info *info)
if (prepare_revision_walk(info->revs))
die(_("failed to setup revision walk"));
+ /* Walk trees to mark them as UNINTERESTING. */
+ edge_repo = info->revs->repo;
+ edge_tree_list = root_tree_list;
+ mark_edges_uninteresting(info->revs, show_edge,
+ info->prune_all_uninteresting);
+ edge_repo = NULL;
+ edge_tree_list = NULL;
+
info->revs->blob_objects = info->revs->tree_objects = 0;
trace2_region_enter("path-walk", "pending-walk", info->revs->repo);
diff --git a/path-walk.h b/path-walk.h
index ca839f873e4..de0db007dc9 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -39,6 +39,14 @@ struct path_walk_info {
int trees;
int blobs;
int tags;
+
+ /**
+ * When 'prune_all_uninteresting' is set and a path has all objects
+ * marked as UNINTERESTING, then the path-walk will not visit those
+ * objects. It will not call path_fn on those objects and will not
+ * walk the children of such trees.
+ */
+ int prune_all_uninteresting;
};
#define PATH_WALK_INFO_INIT { \
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index c6c60d68749..06b103d8760 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -55,8 +55,12 @@ static int emit_block(const char *path, struct oid_array *oids,
BUG("we do not understand this type");
}
- for (size_t i = 0; i < oids->nr; i++)
- printf("%s:%s:%s\n", typestr, path, oid_to_hex(&oids->oid[i]));
+ for (size_t i = 0; i < oids->nr; i++) {
+ struct object *o = lookup_unknown_object(the_repository,
+ &oids->oid[i]);
+ printf("%s:%s:%s%s\n", typestr, path, oid_to_hex(&oids->oid[i]),
+ o->flags & UNINTERESTING ? ":UNINTERESTING" : "");
+ }
return 0;
}
@@ -76,6 +80,8 @@ int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of tag objects")),
OPT_BOOL(0, "trees", &info.trees,
N_("toggle inclusion of tree objects")),
+ OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
+ N_("toggle pruning of uninteresting paths")),
OPT_END(),
};
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 5ed6c79fbd1..a561c21d484 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -209,13 +209,13 @@ test_expect_success 'topic, not base' '
COMMIT::$(git rev-parse topic)
commits:1
TREE::$(git rev-parse topic^{tree})
- TREE:left/:$(git rev-parse topic:left)
+ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
TREE:right/:$(git rev-parse topic:right)
trees:3
- BLOB:a:$(git rev-parse topic:a)
- BLOB:left/b:$(git rev-parse topic:left/b)
+ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
+ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
BLOB:right/c:$(git rev-parse topic:right/c)
- BLOB:right/d:$(git rev-parse topic:right/d)
+ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
blobs:4
tags:0
EOF
@@ -223,6 +223,29 @@ test_expect_success 'topic, not base' '
test_cmp_sorted expect out
'
+test_expect_success 'fourth, blob-tag2, not base' '
+ test-tool path-walk -- fourth blob-tag2 --not base >out &&
+
+ cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ commits:1
+ TREE::$(git rev-parse topic^{tree})
+ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
+ TREE:right/:$(git rev-parse topic:right)
+ trees:3
+ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
+ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
+ BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
+ blobs:5
+ TAG:/tags:$(git rev-parse fourth)
+ tags:1
+ EOF
+
+ test_cmp_sorted expect out
+'
+
test_expect_success 'topic, not base, only blobs' '
test-tool path-walk --no-trees --no-commits \
-- topic --not base >out &&
@@ -230,10 +253,10 @@ test_expect_success 'topic, not base, only blobs' '
cat >expect <<-EOF &&
commits:0
trees:0
- BLOB:a:$(git rev-parse topic:a)
- BLOB:left/b:$(git rev-parse topic:left/b)
+ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
+ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
BLOB:right/c:$(git rev-parse topic:right/c)
- BLOB:right/d:$(git rev-parse topic:right/d)
+ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
blobs:4
tags:0
EOF
@@ -265,7 +288,7 @@ test_expect_success 'topic, not base, only trees' '
cat >expect <<-EOF &&
commits:0
TREE::$(git rev-parse topic^{tree})
- TREE:left/:$(git rev-parse topic:left)
+ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
TREE:right/:$(git rev-parse topic:right)
trees:3
blobs:0
@@ -280,19 +303,19 @@ test_expect_success 'topic, not base, boundary' '
cat >expect <<-EOF &&
COMMIT::$(git rev-parse topic)
- COMMIT::$(git rev-parse base~1)
+ COMMIT::$(git rev-parse base~1):UNINTERESTING
commits:2
TREE::$(git rev-parse topic^{tree})
- TREE::$(git rev-parse base~1^{tree})
- TREE:left/:$(git rev-parse base~1:left)
+ TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
+ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
TREE:right/:$(git rev-parse topic:right)
- TREE:right/:$(git rev-parse base~1:right)
+ TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
trees:5
- BLOB:a:$(git rev-parse base~1:a)
- BLOB:left/b:$(git rev-parse base~1:left/b)
- BLOB:right/c:$(git rev-parse base~1:right/c)
+ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
+ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
BLOB:right/c:$(git rev-parse topic:right/c)
- BLOB:right/d:$(git rev-parse base~1:right/d)
+ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
blobs:5
tags:0
EOF
@@ -300,6 +323,27 @@ test_expect_success 'topic, not base, boundary' '
test_cmp_sorted expect out
'
+test_expect_success 'topic, not base, boundary with pruning' '
+ test-tool path-walk --prune -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ COMMIT::$(git rev-parse topic)
+ COMMIT::$(git rev-parse base~1):UNINTERESTING
+ commits:2
+ TREE::$(git rev-parse topic^{tree})
+ TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
+ TREE:right/:$(git rev-parse topic:right)
+ TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
+ trees:4
+ BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
+ BLOB:right/c:$(git rev-parse topic:right/c)
+ blobs:2
+ tags:0
+ EOF
+
+ test_cmp_sorted expect out
+'
+
test_expect_success 'trees are reported exactly once' '
test_when_finished "rm -rf unique-trees" &&
test_create_repo unique-trees &&
@@ -307,15 +351,12 @@ test_expect_success 'trees are reported exactly once' '
cd unique-trees &&
mkdir initial &&
test_commit initial/file &&
-
git switch -c move-to-top &&
git mv initial/file.t ./ &&
test_tick &&
git commit -m moved &&
-
git update-ref refs/heads/other HEAD
) &&
-
test-tool -C unique-trees path-walk -- --all >out &&
tree=$(git -C unique-trees rev-parse HEAD:) &&
grep "$tree" out >out-filtered &&
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH 0/6] PATH WALK I: The path-walk API
2024-10-31 6:26 [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee via GitGitGadget
` (5 preceding siblings ...)
2024-10-31 6:27 ` [PATCH 6/6] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
@ 2024-10-31 12:36 ` Derrick Stolee
2024-11-01 19:23 ` Taylor Blau
2024-11-09 19:41 ` [PATCH v2 " Derrick Stolee via GitGitGadget
8 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee @ 2024-10-31 12:36 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget, git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy
On 10/31/24 2:26 AM, Derrick Stolee via GitGitGadget wrote:
> This is a new series that rerolls the initial "path-walk API" patches of my
> RFC [1] "Path-walk API and applications". This new API (in path-walk.c and
> path-walk.h) presents a new way to walk objects such that trees and blobs
> are walked in batches according to their path.
>
> This also replaces the previous version of ds/path-walk that was being
> reviewed in [2]. The consensus was that the series was too long/dense and
> could use some reduction in size. This series takes the first few patches,
> but also makes some updates (which will be described later).
>
> [1]
> https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/
> [RFC] Path-walk API and applications
>
> [2]
> https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/
> [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
...
> I will include a full range diff relative to the previous versions of these
> patches in [2] in a reply to this cover letter.
Here is the promised range-diff:
1: 98bdc94a773 ! 1: c71f0a0e361 path-walk: introduce an object walk by path
@@ Commit message
In anticipation of a few planned applications, introduce the most
basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
- objects and does not include any complicated filters. It calls a function
+ objects, and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.
+ These batches are collected in 'struct type_and_oid_list' objects, which
+ store an object type and an oid_array of objects.
+
+ The data structures are documented in 'struct path_walk_context', but in
+ summary the most important are:
+
+ * 'paths_to_lists' is a strmap that connects a path to a
+ type_and_oid_list for that path. To avoid conflicts in path names,
+ we make sure that tree paths end in "/" (except the root path with
+ is an empty string) and blob paths do not end in "/".
+
+ * 'path_stack' is a string list that is added to in an append-only
+ way. This stores the stack of our depth-first search on the heap
+ instead of using recursion.
+
+ * 'path_stack_pushed' is a strmap that stores path names that were
+ already added to 'path_stack', to avoid repeating paths in the
+ stack. Mostly, this saves us from quadratic lookups from doing
+ unsorted checks into the string_list.
+
+ The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
+ push_to_stack() method. Call this instead of inserting into these
+ structures directly.
+
+ The walk_objects_by_path() method initializes these structures and
+ starts walking commits from the given rev_info struct. The commits are
+ used to find the list of root trees which populate the start of our
+ depth-first search.
+
+ The core of our depth-first search is in a while loop that continues
+ while we have not indicated an early exit and our 'path_stack' still has
+ entries in it. The loop body pops a path off of the stack and "visits"
+ the path via the walk_path() method.
+
+ The walk_path() method gets the list of OIDs from the 'path_to_lists'
+ strmap and executes the callback method on that list with the given path
+ and type. If the OIDs correspond to tree objects, then iterate over all
+ trees in the list and run add_children() to add the child objects to
+ their own lists, adding new entries to the stack if necessary.
+
+ In testing, this depth-first search approach was the one that used the
+ least memory while iterating over the object lists. There is still a
+ chance that repositories with too-wide path patterns could cause memory
+ pressure issues. Limiting the stack size could be done in the future by
+ limiting how many objects are being considered in-progress, or by
+ visiting blob paths earlier than trees.
+
There are many future adaptations that could be made, but they are
left for
future updates when consumers are ready to take advantage of those
features.
@@ Documentation/technical/api-path-walk.txt (new)
+multiple paths possible to reach the same object, then only one of those
+paths is used to visit the object.
+
-+When walking a range of commits with some `UNINTERESTING` objects, the
-+objects with the `UNINTERESTING` flag are included in these batches. In
-+order to walk `UNINTERESTING` objects, the `--boundary` option must be
-+used in the commit walk in order to visit `UNINTERESTING` commits.
-+
+Basics
+------
+
@@ Documentation/technical/api-path-walk.txt (new)
+`revs` struct. The revision walk should only be used to walk commits, and
+the objects will be walked in a separate way based on those starting
+commits.
-++
-+If you want the path-walk API to emit `UNINTERESTING` objects based on the
-+commit walk's boundary, be sure to set `revs.boundary` so the boundary
-+commits are emitted.
+
+Examples
+--------
@@ path-walk.c (new)
+ /**
+ * Store the current list of paths in a stack, to
+ * facilitate depth-first-search without recursion.
++ *
++ * Use path_stack_pushed to indicate whether a path
++ * was previously added to path_stack.
+ */
+ struct string_list path_stack;
++ struct strset path_stack_pushed;
+};
+
++static void push_to_stack(struct path_walk_context *ctx,
++ const char *path)
++{
++ if (strset_contains(&ctx->path_stack_pushed, path))
++ return;
++
++ strset_add(&ctx->path_stack_pushed, path);
++ string_list_append(&ctx->path_stack, path);
++}
++
+static int add_children(struct path_walk_context *ctx,
+ const char *base_path,
+ struct object_id *oid)
@@ path-walk.c (new)
+ if (!o) /* report error?*/
+ continue;
+
-+ /* Skip this object if already seen. */
-+ if (o->flags & SEEN)
-+ continue;
-+ o->flags |= SEEN;
-+
+ strbuf_setlen(&path, base_len);
+ strbuf_add(&path, entry.path, entry.pathlen);
+
@@ path-walk.c (new)
+ CALLOC_ARRAY(list, 1);
+ list->type = type;
+ strmap_put(&ctx->paths_to_lists, path.buf, list);
-+ string_list_append(&ctx->path_stack, path.buf);
+ }
++ push_to_stack(ctx, path.buf);
++
++ /* Skip this object if already seen. */
++ if (o->flags & SEEN)
++ continue;
++ o->flags |= SEEN;
+ oid_array_append(&list->oids, &entry.oid);
+ }
+
@@ path-walk.c (new)
+ .revs = info->revs,
+ .info = info,
+ .path_stack = STRING_LIST_INIT_DUP,
++ .path_stack_pushed = STRSET_INIT,
+ .paths_to_lists = STRMAP_INIT
+ };
+
@@ path-walk.c (new)
+ CALLOC_ARRAY(root_tree_list, 1);
+ root_tree_list->type = OBJ_TREE;
+ strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
++ push_to_stack(&ctx, root_path);
+
+ if (prepare_revision_walk(info->revs))
+ die(_("failed to setup revision walk"));
+
+ while ((c = get_revision(info->revs))) {
+ struct object_id *oid = get_commit_tree_oid(c);
-+ struct tree *t = lookup_tree(info->revs->repo, oid);
++ struct tree *t;
+ commits_nr++;
+
-+ if (t) {
-+ if (t->object.flags & SEEN)
-+ continue;
-+ t->object.flags |= SEEN;
-+ oid_array_append(&root_tree_list->oids, oid);
-+ } else {
++ oid = get_commit_tree_oid(c);
++ t = lookup_tree(info->revs->repo, oid);
++
++ if (!t) {
+ warning("could not find tree %s", oid_to_hex(oid));
++ continue;
+ }
++
++ if (t->object.flags & SEEN)
++ continue;
++ t->object.flags |= SEEN;
++ oid_array_append(&root_tree_list->oids, oid);
+ }
+
+ trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
+ trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+
-+ string_list_append(&ctx.path_stack, root_path);
-+
+ trace2_region_enter("path-walk", "path-walk", info->revs->repo);
+ while (!ret && ctx.path_stack.nr) {
+ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
@@ path-walk.c (new)
+ trace2_region_leave("path-walk", "path-walk", info->revs->repo);
+
+ clear_strmap(&ctx.paths_to_lists);
++ strset_clear(&ctx.path_stack_pushed);
+ string_list_clear(&ctx.path_stack, 0);
+ return ret;
+}
5: 6e89fb219b5 ! 2: 4f9f898fec1 revision: create mark_trees_uninteresting_dense()
@@ Metadata
Author: Derrick Stolee <stolee@gmail.com>
## Commit message ##
- revision: create mark_trees_uninteresting_dense()
+ test-lib-functions: add test_cmp_sorted
- The sparse tree walk algorithm was created in d5d2e93577e (revision:
- implement sparse algorithm, 2019-01-16) and involves using the
- mark_trees_uninteresting_sparse() method. This method takes a repository
- and an oidset of tree IDs, some of which have the UNINTERESTING flag and
- some of which do not.
-
- Create a method that has an equivalent set of preconditions but uses a
- "dense" walk (recursively visits all reachable trees, as long as they
- have not previously been marked UNINTERESTING). This is an important
- difference from mark_tree_uninteresting(), which short-circuits if the
- given tree has the UNINTERESTING flag.
-
- A use of this method will be added in a later change, with a condition
- set whether the sparse or dense approach should be used.
+ This test helper will be helpful to reduce repeated logic in
+ t6601-path-walk.sh, but may be helpful elsewhere, too.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
- ## revision.c ##
-@@ revision.c: static void add_children_by_path(struct repository *r,
- free_tree_buffer(tree);
+ ## t/test-lib-functions.sh ##
+@@ t/test-lib-functions.sh: test_cmp () {
+ eval "$GIT_TEST_CMP" '"$@"'
}
-+void mark_trees_uninteresting_dense(struct repository *r,
-+ struct oidset *trees)
-+{
-+ struct object_id *oid;
-+ struct oidset_iter iter;
-+
-+ oidset_iter_init(trees, &iter);
-+ while ((oid = oidset_iter_next(&iter))) {
-+ struct tree *tree = lookup_tree(r, oid);
++# test_cmp_sorted runs test_cmp on sorted versions of the two
++# input files. Uses "$1.sorted" and "$2.sorted" as temp files.
+
-+ if (tree && (tree->object.flags & UNINTERESTING))
-+ mark_tree_contents_uninteresting(r, tree);
-+ }
++test_cmp_sorted () {
++ sort <"$1" >"$1.sorted" &&
++ sort <"$2" >"$2.sorted" &&
++ test_cmp "$1.sorted" "$2.sorted" &&
++ rm "$1.sorted" "$2.sorted"
+}
+
- void mark_trees_uninteresting_sparse(struct repository *r,
- struct oidset *trees)
- {
-
- ## revision.h ##
-@@ revision.h: void put_revision_mark(const struct rev_info *revs,
-
- void mark_parents_uninteresting(struct rev_info *revs, struct commit
*commit);
- void mark_tree_uninteresting(struct repository *r, struct tree *tree);
-+void mark_trees_uninteresting_dense(struct repository *r, struct oidset
*trees);
- void mark_trees_uninteresting_sparse(struct repository *r, struct oidset
*trees);
-
- void show_object_with_name(FILE *, struct object *, const char *);
+ # Check that the given config key has the expected value.
+ #
+ # test_cmp_config [-C <dir>] <expected-value>
2: a00ab0c62c9 ! 3: 6f93dff88e7 t6601: add helper for testing path-walk API
@@ Commit message
sets a baseline for the behavior and we can extend it as new options are
introduced.
+ It is important to mention that the behavior of the API will change
soon as
+ we start to handle UNINTERESTING objects differently, but these tests will
+ demonstrate the change in behavior.
+
Signed-off-by: Derrick Stolee <stolee@gmail.com>
## Documentation/technical/api-path-walk.txt ##
-@@ Documentation/technical/api-path-walk.txt: commits are emitted.
+@@ Documentation/technical/api-path-walk.txt: commits.
Examples
--------
@@ t/t6601-path-walk.sh (new)
+ blobs:6
+ EOF
+
-+ sort expect >expect.sorted &&
-+ sort out >out.sorted &&
-+
-+ test_cmp expect.sorted out.sorted
++ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic only' '
@@ t/t6601-path-walk.sh (new)
+ blobs:5
+ EOF
+
-+ sort expect >expect.sorted &&
-+ sort out >out.sorted &&
-+
-+ test_cmp expect.sorted out.sorted
++ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base' '
@@ t/t6601-path-walk.sh (new)
+ blobs:4
+ EOF
+
-+ sort expect >expect.sorted &&
-+ sort out >out.sorted &&
-+
-+ test_cmp expect.sorted out.sorted
++ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, boundary' '
@@ t/t6601-path-walk.sh (new)
+ blobs:5
+ EOF
+
-+ sort expect >expect.sorted &&
-+ sort out >out.sorted &&
-+
-+ test_cmp expect.sorted out.sorted
++ test_cmp_sorted expect out
+'
+
+test_done
3: 14375d19392 ! 4: f4bf8be30b5 path-walk: allow consumer to specify object types
@@ Commit message
We add the ability to filter the object types in the path-walk API so
the callback function is called fewer times.
- This adds the ability to ask for the commits in a list, as well. Future
- changes will add the ability to visit annotated tags.
+ This adds the ability to ask for the commits in a list, as well. We
+ re-use the empty string for this set of objects because these are passed
+ directly to the callback function instead of being part of the
+ 'path_stack'.
+
+ Future changes will add the ability to visit annotated tags.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
## Documentation/technical/api-path-walk.txt ##
-@@ Documentation/technical/api-path-walk.txt: If you want the path-walk
API to emit `UNINTERESTING` objects based on the
- commit walk's boundary, be sure to set `revs.boundary` so the boundary
- commits are emitted.
+@@ Documentation/technical/api-path-walk.txt: It is also important that
you do not specify the `--objects` flag for the
+ the objects will be walked in a separate way based on those starting
+ commits.
+`commits`, `blobs`, `trees`::
+ By default, these members are enabled and signal that the path-walk
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
- strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
--
- if (prepare_revision_walk(info->revs))
+@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
die(_("failed to setup revision walk"));
while ((c = get_revision(info->revs))) {
- struct object_id *oid = get_commit_tree_oid(c);
-- struct tree *t = lookup_tree(info->revs->repo, oid);
+ struct object_id *oid;
-+ struct tree *t;
+ struct tree *t;
commits_nr++;
+ if (info->commits)
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
+ if (!info->trees && !info->blobs)
+ continue;
+
-+ oid = get_commit_tree_oid(c);
-+ t = lookup_tree(info->revs->repo, oid);
-+
- if (t) {
- if (t->object.flags & SEEN)
- continue;
+ oid = get_commit_tree_oid(c);
+ t = lookup_tree(info->revs->repo, oid);
+
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
+ oid_array_clear(&commit_list->oids);
+ free(commit_list);
+
- string_list_append(&ctx.path_stack, root_path);
-
trace2_region_enter("path-walk", "path-walk", info->revs->repo);
+ while (!ret && ctx.path_stack.nr) {
+ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
## path-walk.h ##
@@ path-walk.h: struct path_walk_info {
*/
path_fn path_fn;
void *path_fn_data;
++
+ /**
+ * Initialize which object types the path_fn should be called on. This
+ * could also limit the walk to skip blobs if not set.
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
TREE:left/:$(git rev-parse topic:left)
TREE:right/:$(git rev-parse topic:right)
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
- test_cmp expect.sorted out.sorted
+ test_cmp_sorted expect out
'
+test_expect_success 'topic, not base, only blobs' '
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
+ blobs:4
+ EOF
+
-+ sort expect >expect.sorted &&
-+ sort out >out.sorted &&
-+
-+ test_cmp expect.sorted out.sorted
++ test_cmp_sorted expect out
+'
+
+# No, this doesn't make a lot of sense for the path-walk API,
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
+ blobs:0
+ EOF
+
-+ sort expect >expect.sorted &&
-+ sort out >out.sorted &&
-+
-+ test_cmp expect.sorted out.sorted
++ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, only trees' '
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
+ blobs:0
+ EOF
+
-+ sort expect >expect.sorted &&
-+ sort out >out.sorted &&
-+
-+ test_cmp expect.sorted out.sorted
++ test_cmp_sorted expect out
+'
+
test_expect_success 'topic, not base, boundary' '
4: c321f58c62d ! 5: dfd00b2bf0c path-walk: allow visiting tags
@@ Metadata
Author: Derrick Stolee <stolee@gmail.com>
## Commit message ##
- path-walk: allow visiting tags
+ path-walk: visit tags and cached objects
- In anticipation of using the path-walk API to analyze tags or include
- them in a pack-file, add the ability to walk the tags that were included
- in the revision walk.
+ The rev_info that is specified for a path-walk traversal may specify
+ visiting tag refs (both lightweight and annotated) and also may specify
+ indexed objects (blobs and trees). Update the path-walk API to walk
+ these objects as well.
- When these tag objects point to blobs or trees, we need to make sure
- those objects are also visited. Treat tagged trees as root trees, but
- put the tagged blobs in their own category.
+ When walking tags, we need to peel the annotated objects until reaching
+ a non-tag object. If we reach a commit, then we can add it to the
+ pending objects to make sure we visit in the commit walk portion. If we
+ reach a tree, then we will assume that it is a root tree. If we reach a
+ blob, then we have no good path name and so add it to a new list of
+ "tagged blobs".
- Be careful about objects that are referred to by multiple references.
+ When the rev_info includes the "--indexed-objects" flag, then the
+ pending set includes blobs and trees found in the cache entries and
+ cache-tree. The cache entries are usually blobs, though they could be
+ trees in the case of a sparse index. The cache-tree stores
+ previously-hashed tree objects but these are cleared out when staging
+ objects below those paths. We add tests that demonstrate this.
+
+ The indexed objects come with a non-NULL 'path' value in the pending
+ item. This allows us to prepopulate the 'path_to_lists' strmap with
+ lists for these paths.
+
+ The tricky thing about this walk is that we will want to combine the
+ indexed objects walk with the commit walk, especially in the future case
+ of walking objects during a command like 'git repack'.
+
+ Whenever possible, we want the objects from the index to be grouped with
+ similar objects in history. We don't want to miss any paths that appear
+ only in the index and not in the commit history.
+
+ Thus, we need to be careful to let the path stack be populated initially
+ with only the root tree path (and possibly tags and tagged blobs) and go
+ through the normal depth-first search. Afterwards, if there are other
+ paths that are remaining in the paths_to_lists strmap, we should then
+ iterate through the stack and visit those objects recursively.
- Co-authored-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
- Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
## Documentation/technical/api-path-walk.txt ##
-@@ Documentation/technical/api-path-walk.txt: If you want the path-walk
API to emit `UNINTERESTING` objects based on the
- commit walk's boundary, be sure to set `revs.boundary` so the boundary
- commits are emitted.
+@@ Documentation/technical/api-path-walk.txt: It is also important that
you do not specify the `--objects` flag for the
+ the objects will be walked in a separate way based on those starting
+ commits.
-`commits`, `blobs`, `trees`::
+`commits`, `blobs`, `trees`, `tags`::
@@ path-walk.c
#include "trace2.h"
#include "tree.h"
#include "tree-walk.h"
+
++static const char *root_path = "";
++
+ struct type_and_oid_list
+ {
+ enum object_type type;
+@@ path-walk.c: static int walk_path(struct path_walk_context *ctx,
+
+ list = strmap_get(&ctx->paths_to_lists, path);
+
++ if (!list)
++ BUG("provided path '%s' that had no associated list", path);
++
+ /* Evaluate function pointer on this data, if requested. */
+ if ((list->type == OBJ_TREE && ctx->info->trees) ||
+- (list->type == OBJ_BLOB && ctx->info->blobs))
++ (list->type == OBJ_BLOB && ctx->info->blobs)||
++ (list->type == OBJ_TAG && ctx->info->tags))
+ ret = ctx->info->path_fn(path, &list->oids, list->type,
+ ctx->info->path_fn_data);
+
+@@ path-walk.c: static void clear_strmap(struct strmap *map)
+ strmap_init(map);
+ }
+
++static void setup_pending_objects(struct path_walk_info *info,
++ struct path_walk_context *ctx)
++{
++ struct type_and_oid_list *tags = NULL;
++ struct type_and_oid_list *tagged_blobs = NULL;
++ struct type_and_oid_list *root_tree_list = NULL;
++
++ if (info->tags)
++ CALLOC_ARRAY(tags, 1);
++ if (info->blobs)
++ CALLOC_ARRAY(tagged_blobs, 1);
++ if (info->trees)
++ root_tree_list = strmap_get(&ctx->paths_to_lists, root_path);
++
++ /*
++ * Pending objects include:
++ * * Commits at branch tips.
++ * * Annotated tags at tag tips.
++ * * Any kind of object at lightweight tag tips.
++ * * Trees and blobs in the index (with an associated path).
++ */
++ for (size_t i = 0; i < info->revs->pending.nr; i++) {
++ struct object_array_entry *pending = info->revs->pending.objects + i;
++ struct object *obj = pending->item;
++
++ /* Commits will be picked up by revision walk. */
++ if (obj->type == OBJ_COMMIT)
++ continue;
++
++ /* Navigate annotated tag object chains. */
++ while (obj->type == OBJ_TAG) {
++ struct tag *tag = lookup_tag(info->revs->repo, &obj->oid);
++ if (!tag)
++ break;
++ if (tag->object.flags & SEEN)
++ break;
++ tag->object.flags |= SEEN;
++
++ if (tags)
++ oid_array_append(&tags->oids, &obj->oid);
++ obj = tag->tagged;
++ }
++
++ if (obj->type == OBJ_TAG)
++ continue;
++
++ /* We are now at a non-tag object. */
++ if (obj->flags & SEEN)
++ continue;
++ obj->flags |= SEEN;
++
++ switch (obj->type) {
++ case OBJ_TREE:
++ if (!info->trees)
++ continue;
++ if (pending->path) {
++ struct type_and_oid_list *list;
++ char *path = *pending->path ? xstrfmt("%s/", pending->path)
++ : xstrdup("");
++ if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
++ CALLOC_ARRAY(list, 1);
++ list->type = OBJ_TREE;
++ strmap_put(&ctx->paths_to_lists, path, list);
++ }
++ oid_array_append(&list->oids, &obj->oid);
++ free(path);
++ } else {
++ /* assume a root tree, such as a lightweight tag. */
++ oid_array_append(&root_tree_list->oids, &obj->oid);
++ }
++ break;
++
++ case OBJ_BLOB:
++ if (!info->blobs)
++ continue;
++ if (pending->path) {
++ struct type_and_oid_list *list;
++ char *path = pending->path;
++ if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
++ CALLOC_ARRAY(list, 1);
++ list->type = OBJ_BLOB;
++ strmap_put(&ctx->paths_to_lists, path, list);
++ }
++ oid_array_append(&list->oids, &obj->oid);
++ } else {
++ /* assume a root tree, such as a lightweight tag. */
++ oid_array_append(&tagged_blobs->oids, &obj->oid);
++ }
++ break;
++
++ case OBJ_COMMIT:
++ /* Make sure it is in the object walk */
++ if (obj != pending->item)
++ add_pending_object(info->revs, obj, "");
++ break;
++
++ default:
++ BUG("should not see any other type here");
++ }
++ }
++
++ /*
++ * Add tag objects and tagged blobs if they exist.
++ */
++ if (tagged_blobs) {
++ if (tagged_blobs->oids.nr) {
++ const char *tagged_blob_path = "/tagged-blobs";
++ tagged_blobs->type = OBJ_BLOB;
++ push_to_stack(ctx, tagged_blob_path);
++ strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
++ } else {
++ oid_array_clear(&tagged_blobs->oids);
++ free(tagged_blobs);
++ }
++ }
++ if (tags) {
++ if (tags->oids.nr) {
++ const char *tag_path = "/tags";
++ tags->type = OBJ_TAG;
++ push_to_stack(ctx, tag_path);
++ strmap_put(&ctx->paths_to_lists, tag_path, tags);
++ } else {
++ oid_array_clear(&tags->oids);
++ free(tags);
++ }
++ }
++}
++
+ /**
+ * Given the configuration of 'info', walk the commits based on
'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+@@ path-walk.c: static void clear_strmap(struct strmap *map)
+ */
+ int walk_objects_by_path(struct path_walk_info *info)
+ {
+- const char *root_path = "";
+ int ret = 0;
+ size_t commits_nr = 0, paths_nr = 0;
+ struct commit *c;
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
CALLOC_ARRAY(commit_list, 1);
commit_list->type = OBJ_COMMIT;
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
-+
+ push_to_stack(&ctx, root_path);
+
+ /*
+ * Set these values before preparing the walk to catch
-+ * lightweight tags pointing to non-commits.
++ * lightweight tags pointing to non-commits and indexed objects.
+ */
+ info->revs->blob_objects = info->blobs;
+ info->revs->tree_objects = info->trees;
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
+ info->revs->blob_objects = info->revs->tree_objects = 0;
+
-+ if (info->tags) {
-+ struct oid_array tagged_blob_list = OID_ARRAY_INIT;
-+ struct oid_array tags = OID_ARRAY_INIT;
-+
-+ trace2_region_enter("path-walk", "tag-walk", info->revs->repo);
-+
-+ /*
-+ * Walk any pending objects at this point, but they should only
-+ * be tags.
-+ */
-+ for (size_t i = 0; i < info->revs->pending.nr; i++) {
-+ struct object_array_entry *pending = info->revs->pending.objects + i;
-+ struct object *obj = pending->item;
-+
-+ if (obj->type == OBJ_COMMIT || obj->flags & SEEN)
-+ continue;
-+
-+ while (obj->type == OBJ_TAG) {
-+ struct tag *tag = lookup_tag(info->revs->repo,
-+ &obj->oid);
-+ if (!(obj->flags & SEEN)) {
-+ obj->flags |= SEEN;
-+ oid_array_append(&tags, &obj->oid);
-+ }
-+ obj = tag->tagged;
-+ }
-+
-+ if ((obj->flags & SEEN))
-+ continue;
-+ obj->flags |= SEEN;
++ trace2_region_enter("path-walk", "pending-walk", info->revs->repo);
++ setup_pending_objects(info, &ctx);
++ trace2_region_leave("path-walk", "pending-walk", info->revs->repo);
+
-+ switch (obj->type) {
-+ case OBJ_TREE:
-+ if (info->trees)
-+ oid_array_append(&root_tree_list->oids, &obj->oid);
-+ break;
-+
-+ case OBJ_BLOB:
-+ if (info->blobs)
-+ oid_array_append(&tagged_blob_list, &obj->oid);
-+ break;
+ while ((c = get_revision(info->revs))) {
+ struct object_id *oid;
+ struct tree *t;
+@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
+
+ free(path);
+ }
+
-+ case OBJ_COMMIT:
-+ /* Make sure it is in the object walk */
-+ add_pending_object(info->revs, obj, "");
-+ break;
++ /* Are there paths remaining? Likely they are from indexed objects. */
++ if (!strmap_empty(&ctx.paths_to_lists)) {
++ struct hashmap_iter iter;
++ struct strmap_entry *entry;
+
-+ default:
-+ BUG("should not see any other type here");
-+ }
++ strmap_for_each_entry(&ctx.paths_to_lists, &iter, entry) {
++ push_to_stack(&ctx, entry->key);
+ }
+
-+ info->path_fn("", &tags, OBJ_TAG, info->path_fn_data);
++ while (!ret && ctx.path_stack.nr) {
++ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
++ ctx.path_stack.nr--;
++ paths_nr++;
+
-+ if (tagged_blob_list.nr && info->blobs)
-+ info->path_fn("/tagged-blobs", &tagged_blob_list, OBJ_BLOB,
-+ info->path_fn_data);
++ ret = walk_path(&ctx, path);
+
-+ trace2_data_intmax("path-walk", ctx.repo, "tags", tags.nr);
-+ trace2_region_leave("path-walk", "tag-walk", info->revs->repo);
-+ oid_array_clear(&tags);
-+ oid_array_clear(&tagged_blob_list);
++ free(path);
++ }
+ }
+
- while ((c = get_revision(info->revs))) {
- struct object_id *oid;
- struct tree *t;
+ trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr);
+ trace2_region_leave("path-walk", "path-walk", info->revs->repo);
+
## path-walk.h ##
@@ path-walk.h: struct path_walk_info {
@@ t/t6601-path-walk.sh: test_expect_success 'all' '
+ BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
+ BLOB:child/file:$(git rev-parse refs/tags/tree-tag^{}:child/file)
+ blobs:10
-+ TAG::$(git rev-parse refs/tags/first)
-+ TAG::$(git rev-parse refs/tags/second.1)
-+ TAG::$(git rev-parse refs/tags/second.2)
-+ TAG::$(git rev-parse refs/tags/third)
-+ TAG::$(git rev-parse refs/tags/fourth)
-+ TAG::$(git rev-parse refs/tags/tree-tag)
-+ TAG::$(git rev-parse refs/tags/blob-tag)
++ TAG:/tags:$(git rev-parse refs/tags/first)
++ TAG:/tags:$(git rev-parse refs/tags/second.1)
++ TAG:/tags:$(git rev-parse refs/tags/second.2)
++ TAG:/tags:$(git rev-parse refs/tags/third)
++ TAG:/tags:$(git rev-parse refs/tags/fourth)
++ TAG:/tags:$(git rev-parse refs/tags/tree-tag)
++ TAG:/tags:$(git rev-parse refs/tags/blob-tag)
+ tags:7
++ EOF
++
++ test_cmp_sorted expect out
++'
++
++test_expect_success 'indexed objects' '
++ test_when_finished git reset --hard &&
++
++ # stage change into index, adding a blob but
++ # also invalidating the cache-tree for the root
++ # and the "left" directory.
++ echo bogus >left/c &&
++ git add left &&
++
++ test-tool path-walk -- --indexed-objects >out &&
++
++ cat >expect <<-EOF &&
++ commits:0
++ TREE:right/:$(git rev-parse topic:right)
++ trees:1
++ BLOB:a:$(git rev-parse HEAD:a)
++ BLOB:left/b:$(git rev-parse HEAD:left/b)
++ BLOB:left/c:$(git rev-parse :left/c)
++ BLOB:right/c:$(git rev-parse HEAD:right/c)
++ BLOB:right/d:$(git rev-parse HEAD:right/d)
++ blobs:5
++ tags:0
++ EOF
++
++ test_cmp_sorted expect out
++'
++
++test_expect_success 'branches and indexed objects mix well' '
++ test_when_finished git reset --hard &&
++
++ # stage change into index, adding a blob but
++ # also invalidating the cache-tree for the root
++ # and the "right" directory.
++ echo fake >right/d &&
++ git add right &&
++
++ test-tool path-walk -- --indexed-objects --branches >out &&
++
++ cat >expect <<-EOF &&
++ COMMIT::$(git rev-parse topic)
++ COMMIT::$(git rev-parse base)
++ COMMIT::$(git rev-parse base~1)
++ COMMIT::$(git rev-parse base~2)
++ commits:4
++ TREE::$(git rev-parse topic^{tree})
++ TREE::$(git rev-parse base^{tree})
++ TREE::$(git rev-parse base~1^{tree})
++ TREE::$(git rev-parse base~2^{tree})
++ TREE:a/:$(git rev-parse base:a)
++ TREE:left/:$(git rev-parse base:left)
++ TREE:left/:$(git rev-parse base~2:left)
++ TREE:right/:$(git rev-parse topic:right)
++ TREE:right/:$(git rev-parse base~1:right)
++ TREE:right/:$(git rev-parse base~2:right)
++ trees:10
++ BLOB:a:$(git rev-parse base~2:a)
++ BLOB:left/b:$(git rev-parse base:left/b)
++ BLOB:left/b:$(git rev-parse base~2:left/b)
++ BLOB:right/c:$(git rev-parse base~2:right/c)
++ BLOB:right/c:$(git rev-parse topic:right/c)
++ BLOB:right/d:$(git rev-parse base~1:right/d)
++ BLOB:right/d:$(git rev-parse :right/d)
++ blobs:7
++ tags:0
EOF
- sort expect >expect.sorted &&
+ test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic only' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse base~1:right/d)
@@ t/t6601-path-walk.sh: test_expect_success 'topic only' '
+ tags:0
EOF
- sort expect >expect.sorted &&
+ test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse topic:right/d)
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
+ tags:0
EOF
- sort expect >expect.sorted &&
+ test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only blobs' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse topic:right/d)
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only blobs' '
+ tags:0
EOF
- sort expect >expect.sorted &&
+ test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only commits' '
commits:1
trees:0
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only commits' '
+ tags:0
EOF
- sort expect >expect.sorted &&
+ test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only trees' '
TREE:right/:$(git rev-parse topic:right)
trees:3
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only trees' '
+ tags:0
EOF
- sort expect >expect.sorted &&
+ test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
BLOB:right/c:$(git rev-parse topic:right/c)
BLOB:right/d:$(git rev-parse base~1:right/d)
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
+ tags:0
EOF
- sort expect >expect.sorted &&
-@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
- test_cmp expect.sorted out.sorted
+ test_cmp_sorted expect out
'
+test_expect_success 'trees are reported exactly once' '
6: 238d7d95715 ! 6: 5252076d556 path-walk: add prune_all_uninteresting option
@@ Metadata
Author: Derrick Stolee <stolee@gmail.com>
## Commit message ##
- path-walk: add prune_all_uninteresting option
+ path-walk: mark trees and blobs as UNINTERESTING
- This option causes the path-walk API to act like the sparse tree-walk
- algorithm implemented by mark_trees_uninteresting_sparse() in
- list-objects.c.
+ When the input rev_info has UNINTERESTING starting points, we want to be
+ sure that the UNINTERESTING flag is passed appropriately through the
+ objects. To match how this is done in places such as 'git
pack-objects', we
+ use the mark_edges_uninteresting() method.
- Starting from the commits marked as UNINTERESTING, their root trees and
- all objects reachable from those trees are UNINTERSTING, at least as we
- walk path-by-path. When we reach a path where all objects associated
- with that path are marked UNINTERESTING, then do no continue walking the
- children of that path.
+ This method has an option for using the "sparse" walk, which is similar in
+ spirit to the path-walk API's walk. To be sure to keep it independent,
add a
+ new 'prune_all_uninteresting' option to the path_walk_info struct.
- We need to be careful to pass the UNINTERESTING flag in a deep way on
- the UNINTERESTING objects before we start the path-walk, or else the
- depth-first search for the path-walk API may accidentally report some
- objects as interesting.
+ To check how the UNINTERSTING flag is spread through our objects,
extend the
+ 'test-tool path-walk' command to output whether or not an object has that
+ flag. This changes our tests significantly, including the removal of some
+ objects that were previously visited due to the incomplete implementation.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
## Documentation/technical/api-path-walk.txt ##
-@@ Documentation/technical/api-path-walk.txt: commits are emitted.
+@@ Documentation/technical/api-path-walk.txt: commits.
While it is possible to walk only commits in this way, consumers would be
better off using the revision walk API instead.
@@ Documentation/technical/api-path-walk.txt: commits are emitted.
## path-walk.c ##
+@@
+ #include "dir.h"
+ #include "hashmap.h"
+ #include "hex.h"
++#include "list-objects.h"
+ #include "object.h"
+ #include "oid-array.h"
+ #include "revision.h"
@@ path-walk.c: struct type_and_oid_list
{
enum object_type type;
@@ path-walk.c: struct type_and_oid_list
#define TYPE_AND_OID_LIST_INIT { \
@@ path-walk.c: static int add_children(struct path_walk_context *ctx,
- strmap_put(&ctx->paths_to_lists, path.buf, list);
- string_list_append(&ctx->path_stack, path.buf);
- }
+ if (o->flags & SEEN)
+ continue;
+ o->flags |= SEEN;
++
+ if (!(o->flags & UNINTERESTING))
+ list->maybe_interesting = 1;
oid_array_append(&list->oids, &entry.oid);
}
@@ path-walk.c: static int walk_path(struct path_walk_context *ctx,
-
- list = strmap_get(&ctx->paths_to_lists, path);
+ if (!list)
+ BUG("provided path '%s' that had no associated list", path);
+ if (ctx->info->prune_all_uninteresting) {
+ /*
@@ path-walk.c: static int walk_path(struct path_walk_context *ctx,
+ &list->oids.oid[i]);
+ if (t && !(t->object.flags & UNINTERESTING))
+ list->maybe_interesting = 1;
-+ } else {
++ } else if (list->type == OBJ_BLOB) {
+ struct blob *b = lookup_blob(ctx->repo,
+ &list->oids.oid[i]);
+ if (b && !(b->object.flags & UNINTERESTING))
+ list->maybe_interesting = 1;
++ } else {
++ /* Tags are always interesting if visited. */
++ list->maybe_interesting = 1;
+ }
+ }
+
@@ path-walk.c: static int walk_path(struct path_walk_context *ctx,
+
/* Evaluate function pointer on this data, if requested. */
if ((list->type == OBJ_TREE && ctx->info->trees) ||
- (list->type == OBJ_BLOB && ctx->info->blobs))
+ (list->type == OBJ_BLOB && ctx->info->blobs)||
@@ path-walk.c: static void clear_strmap(struct strmap *map)
- int walk_objects_by_path(struct path_walk_info *info)
- {
- const char *root_path = "";
-- int ret = 0;
-+ int ret = 0, has_uninteresting = 0;
- size_t commits_nr = 0, paths_nr = 0;
- struct commit *c;
- struct type_and_oid_list *root_tree_list;
-@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
- .path_stack = STRING_LIST_INIT_DUP,
- .paths_to_lists = STRMAP_INIT
- };
-+ struct oidset root_tree_set = OIDSET_INIT;
-
- trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
+ strmap_init(map);
+ }
++static struct repository *edge_repo;
++static struct type_and_oid_list *edge_tree_list;
++
++static void show_edge(struct commit *commit)
++{
++ struct tree *t = repo_get_commit_tree(edge_repo, commit);
++
++ if (!t)
++ return;
++
++ if (commit->object.flags & UNINTERESTING)
++ t->object.flags |= UNINTERESTING;
++
++ if (t->object.flags & SEEN)
++ return;
++ t->object.flags |= SEEN;
++
++ oid_array_append(&edge_tree_list->oids, &t->object.oid);
++}
++
+ static void setup_pending_objects(struct path_walk_info *info,
+ struct path_walk_context *ctx)
+ {
+@@ path-walk.c: static void setup_pending_objects(struct path_walk_info *info,
+ if (tagged_blobs->oids.nr) {
+ const char *tagged_blob_path = "/tagged-blobs";
+ tagged_blobs->type = OBJ_BLOB;
++ tagged_blobs->maybe_interesting = 1;
+ push_to_stack(ctx, tagged_blob_path);
+ strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
+ } else {
+@@ path-walk.c: static void setup_pending_objects(struct path_walk_info *info,
+ if (tags->oids.nr) {
+ const char *tag_path = "/tags";
+ tags->type = OBJ_TAG;
++ tags->maybe_interesting = 1;
+ push_to_stack(ctx, tag_path);
+ strmap_put(&ctx->paths_to_lists, tag_path, tags);
+ } else {
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
+ root_tree_list->maybe_interesting = 1;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
+ push_to_stack(&ctx, root_path);
- /*
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
- t = lookup_tree(info->revs->repo, oid);
+ if (prepare_revision_walk(info->revs))
+ die(_("failed to setup revision walk"));
- if (t) {
-+ if ((c->object.flags & UNINTERESTING)) {
-+ t->object.flags |= UNINTERESTING;
-+ has_uninteresting = 1;
-+ }
++ /* Walk trees to mark them as UNINTERESTING. */
++ edge_repo = info->revs->repo;
++ edge_tree_list = root_tree_list;
++ mark_edges_uninteresting(info->revs, show_edge,
++ info->prune_all_uninteresting);
++ edge_repo = NULL;
++ edge_tree_list = NULL;
+
- if (t->object.flags & SEEN)
- continue;
- t->object.flags |= SEEN;
-- oid_array_append(&root_tree_list->oids, oid);
-+ if (!oidset_insert(&root_tree_set, oid))
-+ oid_array_append(&root_tree_list->oids, oid);
- } else {
- warning("could not find tree %s", oid_to_hex(oid));
- }
-@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
- oid_array_clear(&commit_list->oids);
- free(commit_list);
+ info->revs->blob_objects = info->revs->tree_objects = 0;
-+ /*
-+ * Before performing a DFS of our paths and emitting them as interesting,
-+ * do a full walk of the trees to distribute the UNINTERESTING bit. Use
-+ * the sparse algorithm if prune_all_uninteresting was set.
-+ */
-+ if (has_uninteresting) {
-+ trace2_region_enter("path-walk", "uninteresting-walk", info->revs->repo);
-+ if (info->prune_all_uninteresting)
-+ mark_trees_uninteresting_sparse(ctx.repo, &root_tree_set);
-+ else
-+ mark_trees_uninteresting_dense(ctx.repo, &root_tree_set);
-+ trace2_region_leave("path-walk", "uninteresting-walk", info->revs->repo);
-+ }
-+ oidset_clear(&root_tree_set);
-+
- string_list_append(&ctx.path_stack, root_path);
-
- trace2_region_enter("path-walk", "path-walk", info->revs->repo);
+ trace2_region_enter("path-walk", "pending-walk", info->revs->repo);
## path-walk.h ##
@@ path-walk.h: struct path_walk_info {
@@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
## t/t6601-path-walk.sh ##
+@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
+ COMMIT::$(git rev-parse topic)
+ commits:1
+ TREE::$(git rev-parse topic^{tree})
+- TREE:left/:$(git rev-parse topic:left)
++ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
+ TREE:right/:$(git rev-parse topic:right)
+ trees:3
+- BLOB:a:$(git rev-parse topic:a)
+- BLOB:left/b:$(git rev-parse topic:left/b)
++ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
++ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ BLOB:right/c:$(git rev-parse topic:right/c)
+- BLOB:right/d:$(git rev-parse topic:right/d)
++ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
+ blobs:4
+ tags:0
+ EOF
+@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
+ test_cmp_sorted expect out
+ '
+
++test_expect_success 'fourth, blob-tag2, not base' '
++ test-tool path-walk -- fourth blob-tag2 --not base >out &&
++
++ cat >expect <<-EOF &&
++ COMMIT::$(git rev-parse topic)
++ commits:1
++ TREE::$(git rev-parse topic^{tree})
++ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
++ TREE:right/:$(git rev-parse topic:right)
++ trees:3
++ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
++ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
++ BLOB:right/c:$(git rev-parse topic:right/c)
++ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
++ BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
++ blobs:5
++ TAG:/tags:$(git rev-parse fourth)
++ tags:1
++ EOF
++
++ test_cmp_sorted expect out
++'
++
+ test_expect_success 'topic, not base, only blobs' '
+ test-tool path-walk --no-trees --no-commits \
+ -- topic --not base >out &&
+@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only blobs' '
+ cat >expect <<-EOF &&
+ commits:0
+ trees:0
+- BLOB:a:$(git rev-parse topic:a)
+- BLOB:left/b:$(git rev-parse topic:left/b)
++ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
++ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ BLOB:right/c:$(git rev-parse topic:right/c)
+- BLOB:right/d:$(git rev-parse topic:right/d)
++ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
+ blobs:4
+ tags:0
+ EOF
+@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only trees' '
+ cat >expect <<-EOF &&
+ commits:0
+ TREE::$(git rev-parse topic^{tree})
+- TREE:left/:$(git rev-parse topic:left)
++ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
+ TREE:right/:$(git rev-parse topic:right)
+ trees:3
+ blobs:0
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
cat >expect <<-EOF &&
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
tags:0
EOF
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
- test_cmp expect.sorted out.sorted
+ test_cmp_sorted expect out
'
+-test_expect_success 'trees are reported exactly once' '
+- test_when_finished "rm -rf unique-trees" &&
+- test_create_repo unique-trees &&
+- (
+- cd unique-trees &&
+- mkdir initial &&
+- test_commit initial/file &&
+-
+- git switch -c move-to-top &&
+- git mv initial/file.t ./ &&
+- test_tick &&
+- git commit -m moved &&
+-
+- git update-ref refs/heads/other HEAD
+- ) &&
+-
+- test-tool -C unique-trees path-walk -- --all >out &&
+- tree=$(git -C unique-trees rev-parse HEAD:) &&
+- grep "$tree" out >out-filtered &&
+- test_line_count = 1 out-filtered
+test_expect_success 'topic, not base, boundary with pruning' '
+ test-tool path-walk --prune -- --boundary topic --not base >out &&
+
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
+ tags:0
+ EOF
+
-+ sort expect >expect.sorted &&
-+ sort out >out.sorted &&
-+
-+ test_cmp expect.sorted out.sorted
-+'
-+
- test_expect_success 'trees are reported exactly once' '
- test_when_finished "rm -rf unique-trees" &&
- test_create_repo unique-trees &&
++ test_cmp_sorted expect out
+ '
+
+ test_done
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/6] PATH WALK I: The path-walk API
2024-10-31 6:26 [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee via GitGitGadget
` (6 preceding siblings ...)
2024-10-31 12:36 ` [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee
@ 2024-11-01 19:23 ` Taylor Blau
2024-11-04 15:48 ` Derrick Stolee
2024-11-09 19:41 ` [PATCH v2 " Derrick Stolee via GitGitGadget
8 siblings, 1 reply; 67+ messages in thread
From: Taylor Blau @ 2024-11-01 19:23 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, ps, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
Derrick Stolee
Hi Stolee,
On Thu, Oct 31, 2024 at 06:26:57AM +0000, Derrick Stolee via GitGitGadget wrote:
>
> Introduction and relation to prior series
> =========================================
>
> This is a new series that rerolls the initial "path-walk API" patches of my
> RFC [1] "Path-walk API and applications". This new API (in path-walk.c and
> path-walk.h) presents a new way to walk objects such that trees and blobs
> are walked in batches according to their path.
>
> This also replaces the previous version of ds/path-walk that was being
> reviewed in [2]. The consensus was that the series was too long/dense and
> could use some reduction in size. This series takes the first few patches,
> but also makes some updates (which will be described later).
>
> [1]
> https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/
> [RFC] Path-walk API and applications
>
> [2]
> https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/
> [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
I apologize for not having a better place to start discussing a topic
which pertains to more than just this immediate patch series, but I
figure here is as good a place as any to do so.
From our earlier discussion, it seems to stand that the path-walk API
is fundamentally incompatible with reachability bitmaps and
delta-islands, making the series a non-starter in environments that
rely significantly one or both of those features. My understanding as a
result is that the path-walk API and feature are more targeted towards
improving client-side repacks and push performance, where neither of the
aforementioned two features are used quite as commonly.
I was discussing this a bit off-list with Peff (who I hope will join the
thread and share his own thoughts), but I wonder if it was a mistake to
discard your '--full-name-hash' idea (or something similar, which I'll
discuss in a bit below) from earlier.
(Repeating a few things that I am sure are obvious to you out loud so
that I can get a grasp on them for my own understanding):
It seems that the problems you've identified which result in poor repack
performance occur when you have files at the same path, but they get
poorly sorted in the delta selection window due to other paths having
the same final 16 characters, so Git doesn't see that much better delta
opportunities exist.
Your series takes into account the full name when hashing, which seems
to produce a clear win in many cases. I'm sure that there are some cases
where it presents a modest regression in pack sizes, but I think that's
fine and probably par for the course when making any changes like this,
as there is probably no easy silver bullet here that uniformly improves
all cases.
I suspect that you could go even further and intern the full path at
which each object occurs, and sort lexically by that. Just stringing
together all of the paths in linux.git only takes 3.099 MiB on my clone.
(Of course, that's unbounded in the number of objects and length of
their pathnames, but you could at least bound the latter by taking only
the last, say, 128 characters, which would be more than good enough for
the kernel, whose longest path is only 102 characters).
Some of the repositories that you've tested on I don't have easy access
to, so I wonder if either doing (a) that, or (b) using some fancier
context-sensitive hash (like SimHash or MinHash) would be beneficial.
I realize that this is taking us back to an idea you've already
presented to the list, but I think (to me, at least) the benefit and
simplicity of that approach has only become clear to me in hindsight
when seeing some alternatives. I would like to apologize for the time
you spent reworking this series back and forth to have the response be
"maybe we should have just done the first thing you suggested". Like I
said, I think to me it was really only clear in hindsight.
In any event, the major benefit to doing --full-name-hash would be that
*all* environments could benefit from the size reduction, not just those
that don't rely on certain other features.
Perhaps just --full-name-hash isn't quite as good by itself as the
--path-walk implementation that this series starts us off implementing.
So in that sense, maybe we want both, which I understand was the
original approach. I see a couple of options here:
- We take both, because doing --path-walk on top represents a
significant enough improvement that we are collectively OK with
taking on more code to improve a more narrow (but common) use-case.
- Or we decide that either the benefit isn't significant enough to
warrant an additional and relatively complex implementation, or in
other words that --full-name-hash by itself is good enough.
Again, I apologize for not having a clearer picture of this all to start
with, and I want to tell you specifically and sincerely that I
appreciate your patience as I wrap my head around all of this. I think
the benefit of --full-name-hash is much clearer and appealing to me now
having had both more time and seeing the series approached in a couple
of different ways. Let me know what you think.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/6] PATH WALK I: The path-walk API
2024-11-01 19:23 ` Taylor Blau
@ 2024-11-04 15:48 ` Derrick Stolee
2024-11-04 17:25 ` Jeff King
0 siblings, 1 reply; 67+ messages in thread
From: Derrick Stolee @ 2024-11-04 15:48 UTC (permalink / raw)
To: Taylor Blau, Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, ps, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy
On 11/1/24 3:23 PM, Taylor Blau wrote:
> Hi Stolee,
>
> On Thu, Oct 31, 2024 at 06:26:57AM +0000, Derrick Stolee via GitGitGadget wrote:
>>
>> Introduction and relation to prior series
>> =========================================
>>
>> This is a new series that rerolls the initial "path-walk API" patches of my
>> RFC [1] "Path-walk API and applications". This new API (in path-walk.c and
>> path-walk.h) presents a new way to walk objects such that trees and blobs
>> are walked in batches according to their path.
>>
>> This also replaces the previous version of ds/path-walk that was being
>> reviewed in [2]. The consensus was that the series was too long/dense and
>> could use some reduction in size. This series takes the first few patches,
>> but also makes some updates (which will be described later).
>>
>> [1]
>> https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/
>> [RFC] Path-walk API and applications
>>
>> [2]
>> https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/
>> [PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
>
> I apologize for not having a better place to start discussing a topic
> which pertains to more than just this immediate patch series, but I
> figure here is as good a place as any to do so.
>
> From our earlier discussion, it seems to stand that the path-walk API
> is fundamentally incompatible with reachability bitmaps and
> delta-islands, making the series a non-starter in environments that
> rely significantly one or both of those features. My understanding as a
> result is that the path-walk API and feature are more targeted towards
> improving client-side repacks and push performance, where neither of the
> aforementioned two features are used quite as commonly.
This is correct. I would go even farther to say that this approach was
designed first and foremost for Git clients and specifically their
performance while computing a thin packfile during "git push". The same
logic to help the push case happens to also help the "git repack" case
significantly.
> I was discussing this a bit off-list with Peff (who I hope will join the
> thread and share his own thoughts), but I wonder if it was a mistake to
> discard your '--full-name-hash' idea (or something similar, which I'll
> discuss in a bit below) from earlier.
I'd be happy to resurrect that series, adding in the learnings from
working on the path-walk feature. It helps that the current series adds
the path-walk API and has no conflicting changes in the pack-objects or
repack builtins. (I can handle those conflicts as things merge down.)
> (Repeating a few things that I am sure are obvious to you out loud so
> that I can get a grasp on them for my own understanding):
>
> It seems that the problems you've identified which result in poor repack
> performance occur when you have files at the same path, but they get
> poorly sorted in the delta selection window due to other paths having
> the same final 16 characters, so Git doesn't see that much better delta
> opportunities exist.
>
> Your series takes into account the full name when hashing, which seems
> to produce a clear win in many cases. I'm sure that there are some cases
> where it presents a modest regression in pack sizes, but I think that's
> fine and probably par for the course when making any changes like this,
> as there is probably no easy silver bullet here that uniformly improves
> all cases.
>
> I suspect that you could go even further and intern the full path at
> which each object occurs, and sort lexically by that. Just stringing
> together all of the paths in linux.git only takes 3.099 MiB on my clone.
> (Of course, that's unbounded in the number of objects and length of
> their pathnames, but you could at least bound the latter by taking only
> the last, say, 128 characters, which would be more than good enough for
> the kernel, whose longest path is only 102 characters).
When the optimization idea is to focus on the full path and not care
about the "locality" of the path name by its later bits, storing the
full name in a list and storing an index into that list would have a
very similar effect.
I'd be interested to explore the idea of storing the full path name.
Based on my exploration with the 'test-tool name-hash' test helper in
that series, I'm not sure that we will make significant gains by doing
so. Worth trying.
> Some of the repositories that you've tested on I don't have easy access
> to, so I wonder if either doing (a) that, or (b) using some fancier
> context-sensitive hash (like SimHash or MinHash) would be beneficial.
I don't know too much about SimHash or MinHash, but based on what I
could gather from some initial reading I'm not sure that they would be
effective without increasing the hash length. We'd also get a different
kind of locality, such as the appearance of a common word would be more
likely to affect the locality than the end of the path.
The size of the hash could probably be mitigated by storing it in the
list of all full paths and accessing them from the index stored on each
to-pack object.
> I realize that this is taking us back to an idea you've already
> presented to the list, but I think (to me, at least) the benefit and
> simplicity of that approach has only become clear to me in hindsight
> when seeing some alternatives. I would like to apologize for the time
> you spent reworking this series back and forth to have the response be
> "maybe we should have just done the first thing you suggested". Like I
> said, I think to me it was really only clear in hindsight.
I always assumed that we'd come back to it eventually. There is also the
extra bit about making the change to the name-hash compatible with the
way name-hashes are stored in the reachability bitmaps. That will need
some work before it is ready for prime time.
> In any event, the major benefit to doing --full-name-hash would be that
> *all* environments could benefit from the size reduction, not just those
> that don't rely on certain other features.
I disagree that all environments will prefer the --full-name-hash. I'm
currently repeating the performance tests right now, and I've added one.
The issues are:
1. The --full-name-hash approach sometimes leads to a larger pack when
using "git push" on the client, especially when the name-hash is
already effective for compressing across paths.
2. A depth 1 shallow clone cannot use previous versions of a path, so
those situations will want to use the normal name hash. This can be
accomplished simply by disabling the --full-name-hash option when
the --shallow option is present; a more detailed version could be
used to check for a large depth before disabling it. This case also
disables bitmaps, so that isn't something to worry about.
> Perhaps just --full-name-hash isn't quite as good by itself as the
> --path-walk implementation that this series starts us off implementing.
> So in that sense, maybe we want both, which I understand was the
> original approach. I see a couple of options here:
>
> - We take both, because doing --path-walk on top represents a
> significant enough improvement that we are collectively OK with
> taking on more code to improve a more narrow (but common) use-case.
Doing both doesn't help at all, since the --path-walk approach already
batches by the full path name. The --path-walk approach has a significant
benefit by doing a second pass by the standard name-hash to pick up on the
cross-path deltas. This is why the --path-walk approach with the standard
name hash as consistently provided the most-compact pack-files in all
tests.
Aside: there were some initial tests that showed the --path-walk option
led to slightly larger packfiles, but I've since discovered that those
cases were due to an incorrect walking of indexed paths. This is fixed
by the code in patch 5 of the current series and my WIP patches in [3]
have the performance numbers with this change.
[3] https://github.com/gitgitgadget/git/pull/1819
PATH WALK II: Add --path-walk option to 'git pack-objects'
> - Or we decide that either the benefit isn't significant enough to
> warrant an additional and relatively complex implementation, or in
> other words that --full-name-hash by itself is good enough.
I hope that I've sufficiently communicated that --full-name-hash is not
good enough by itself.
The point I was trying to make by submitting it first was that I believed
it was likely the easiest way for Git servers to gain 90% of the benefits
that the --path-walk approach provides while making it relatively easy to
integrate with other server-side features such as bitmaps and delta islands.
(Maybe the --path-walk approach could also be extended to be compatible
with those features, but it would be a significant investment that rebuilds
those features within the context of the new object walk instead of relying
on the existing implementations. That could easily be a blocker.)
> Again, I apologize for not having a clearer picture of this all to start
> with, and I want to tell you specifically and sincerely that I
> appreciate your patience as I wrap my head around all of this. I think
> the benefit of --full-name-hash is much clearer and appealing to me now
> having had both more time and seeing the series approached in a couple
> of different ways. Let me know what you think.
Thanks for taking the time to engage with the patches. I'm currently
rerunning my performance tests on a rebased copy of the --full-name-hash
patches and will submit a new version when it's ready.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/6] PATH WALK I: The path-walk API
2024-11-04 15:48 ` Derrick Stolee
@ 2024-11-04 17:25 ` Jeff King
2024-11-05 0:11 ` Junio C Hamano
0 siblings, 1 reply; 67+ messages in thread
From: Jeff King @ 2024-11-04 17:25 UTC (permalink / raw)
To: Derrick Stolee
Cc: Taylor Blau, Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, ps, johncai86, newren, christian.couder,
kristofferhaugsbakk, jonathantanmy
On Mon, Nov 04, 2024 at 10:48:49AM -0500, Derrick Stolee wrote:
> > I was discussing this a bit off-list with Peff (who I hope will join the
> > thread and share his own thoughts), but I wonder if it was a mistake to
> > discard your '--full-name-hash' idea (or something similar, which I'll
> > discuss in a bit below) from earlier.
>
> I'd be happy to resurrect that series, adding in the learnings from
> working on the path-walk feature. It helps that the current series adds
> the path-walk API and has no conflicting changes in the pack-objects or
> repack builtins. (I can handle those conflicts as things merge down.)
Adding my two cents, the discussion we had came after reading this post:
https://www.jonathancreamer.com/how-we-shrunk-our-git-repo-size-by-94-percent/
I think a few of the low-level details in there are confusing, but it
seemed to me that most of the improvement he mentions is just about
finding better delta candidates. And it seems obvious that our current
pack_name_hash() is pretty rudimentary as context-sensitive hashing
goes and won't do well for long paths with similar endings.
So just swapping that out for something better seems like an easy thing
to do regardless of whether we pursue --path-walk. It doesn't
drastically change how we choose delta pairs so it's not much code and
it shouldn't conflict with other features. And the risk of making
anything worse should be pretty low.
I wouldn't at all be surprised if --path-walk can do better, but if we
do the easy thing first then I think it gives us a better idea of the
cost/benefit it's providing.
I suspect there's room for both in the long run. You seem to be focused
on push size and cost, whereas I think Taylor and I are more interested
in overall repo size and cost of serving bitmapped fetches.
> When the optimization idea is to focus on the full path and not care
> about the "locality" of the path name by its later bits, storing the
> full name in a list and storing an index into that list would have a
> very similar effect.
>
> I'd be interested to explore the idea of storing the full path name.
> Based on my exploration with the 'test-tool name-hash' test helper in
> that series, I'm not sure that we will make significant gains by doing
> so. Worth trying.
The way I look at it is a possible continuum. We want to use pathnames
as a way to sort delta candidates near each other, since we expect them
to have high locality with respect to delta-able contents. The current
name_hash uses a very small bit of that path information and throws away
most of it. The other extreme end is holding the whole path. We may want
to end up in the middle for two reasons:
1. Dealing with whole paths might be costly (though I'm not yet
convinced of that; outside of pathological cases, the number of
paths in a repo tends to pale in comparison to the number of
objects, and the per-object costs dominate during repacking).
2. It's possible that over-emphasizing the path might be a slightly
worse heuristic (and I think this is a potential danger of
--path-walk, too). We still want to find candidate pairs that were
copied or renamed, for example, or that substantially share content
found in different parts of the tree.
So it would be interesting to be able to see the performance of various
points on that line, from full path down to partial paths down to longer
hashes down to the current hash. The true extreme end of course is no
path info at all, but I think we know that sucks; that's why we
implemented the bitmap name-hash extension in the first place.
> I don't know too much about SimHash or MinHash, but based on what I
> could gather from some initial reading I'm not sure that they would be
> effective without increasing the hash length. We'd also get a different
> kind of locality, such as the appearance of a common word would be more
> likely to affect the locality than the end of the path.
Good point. This is all heuristics, of course, but I suspect that the
order of the path is important, and that foo/bar.c and bar/foo.c are
unlikely to be good matches.
> > I realize that this is taking us back to an idea you've already
> > presented to the list, but I think (to me, at least) the benefit and
> > simplicity of that approach has only become clear to me in hindsight
> > when seeing some alternatives. I would like to apologize for the time
> > you spent reworking this series back and forth to have the response be
> > "maybe we should have just done the first thing you suggested". Like I
> > said, I think to me it was really only clear in hindsight.
>
> I always assumed that we'd come back to it eventually. There is also the
> extra bit about making the change to the name-hash compatible with the
> way name-hashes are stored in the reachability bitmaps. That will need
> some work before it is ready for prime time.
Having worked on that feature of bitmaps, I'm not too worried about it.
I think we'd just need to:
- introduce a new bitmap ext with a flag (HASH_CACHE_V2 or something,
either with the new hash, or with a "version" byte at the start for
extensibility).
- when bitmaps are not in use, we're free to use whichever hash we
want internally. If the new hash is consistently better, we'd
probably just enable it by default.
- when packing using on-disk bitmaps, use internally whichever format
the on-disk file provided. Technically the format could even provide
both (in which case we'd prefer the new hash), but I don't see much
point.
- when writing bitmaps, use whichever hash the command-line options
asked for. There's an off chance somebody might want to generate a
.bitmap file whose hashes will be understood by an older version of
git, in which case they'd use --no-full-name-hash or whatever while
repacking.
If we're considering full paths, then that is potentially a bit more
involved, just because we'd want the format to avoid repeating duplicate
paths for each object (plus they're no longer fixed-size). So probably
an extension with packed NUL-terminated path strings, plus a set of
fixed-length offsets into that block, one per object.
> I disagree that all environments will prefer the --full-name-hash. I'm
> currently repeating the performance tests right now, and I've added one.
> The issues are:
>
> 1. The --full-name-hash approach sometimes leads to a larger pack when
> using "git push" on the client, especially when the name-hash is
> already effective for compressing across paths.
That's interesting. I wonder which cases get worse, and if a larger
window size might help. I.e., presumably we are pushing the candidates
further away in the sorted delta list.
> 2. A depth 1 shallow clone cannot use previous versions of a path, so
> those situations will want to use the normal name hash. This can be
> accomplished simply by disabling the --full-name-hash option when
> the --shallow option is present; a more detailed version could be
> used to check for a large depth before disabling it. This case also
> disables bitmaps, so that isn't something to worry about.
I'm not sure why a larger hash would be worse in a shallow clone. As you
note, with only one version of each path the name-similarity heuristic
is not likely to buy you much. But I'd have thought that would be true
for the existing name hash as well as a longer one. Maybe this is the
"over-emphasizing" case.
-Peff
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/6] PATH WALK I: The path-walk API
2024-11-04 17:25 ` Jeff King
@ 2024-11-05 0:11 ` Junio C Hamano
2024-11-08 15:17 ` Derrick Stolee
0 siblings, 1 reply; 67+ messages in thread
From: Junio C Hamano @ 2024-11-05 0:11 UTC (permalink / raw)
To: Jeff King
Cc: Derrick Stolee, Taylor Blau, Derrick Stolee via GitGitGadget, git,
johannes.schindelin, ps, johncai86, newren, christian.couder,
kristofferhaugsbakk, jonathantanmy
Jeff King <peff@peff.net> writes:
> On Mon, Nov 04, 2024 at 10:48:49AM -0500, Derrick Stolee wrote:
>> I disagree that all environments will prefer the --full-name-hash. I'm
>> currently repeating the performance tests right now, and I've added one.
>> The issues are:
>>
>> 1. The --full-name-hash approach sometimes leads to a larger pack when
>> using "git push" on the client, especially when the name-hash is
>> already effective for compressing across paths.
>
> That's interesting. I wonder which cases get worse, and if a larger
> window size might help. I.e., presumably we are pushing the candidates
> further away in the sorted delta list.
>
>> 2. A depth 1 shallow clone cannot use previous versions of a path, so
>> those situations will want to use the normal name hash. This can be
>> accomplished simply by disabling the --full-name-hash option when
>> the --shallow option is present; a more detailed version could be
>> used to check for a large depth before disabling it. This case also
>> disables bitmaps, so that isn't something to worry about.
>
> I'm not sure why a larger hash would be worse in a shallow clone. As you
> note, with only one version of each path the name-similarity heuristic
> is not likely to buy you much. But I'd have thought that would be true
> for the existing name hash as well as a longer one. Maybe this is the
> "over-emphasizing" case.
I too am curious to hear Derrick explain the above points and what
was learned from the performance tests. The original hash was
designed to place files that are renamed across directories closer
to each other in the list sorted by the name hash, so a/Makefile and
b/Makefile would likely be treated as delta-base candidates while
foo/bar.c and bar/foo.c are treated as unrelated things. A push
of a handful of commits that rename paths would likely place the
rename source of older commits and rename destination of newer
commits into the same delta chain, even with a smaller delta window.
In such a history, uniformly-distributed-without-regard-to-renames
hash is likely to make them into two distinct delta chains, leading
to less optimal delta-base selection.
A whole-repository packing, or a large push or fetch, of the same
history with renamed files are affected a lot less by such negative
effects of full-name hash. When generating a pack with more commits
than the "--window", use of the original hash would mean blobs from
paths that share similar names (e.g., "Makefile"s everywhere in the
directory hierarchy) are placed close to each other, but full-name
hash will likely group the blobs from exactly the same path and
nothing else together, and the resulting delta-chain for identical
(and not similar) paths would be sufficiently long. A long delta
chain has to be broken into multiple chains _anyway_ due to finite
"--depth" setting, so placing blobs from each path into its own
(initial) delta chain, completely ignoring renamed paths, would
likely to give us long enough (initial) delta chain to be split at
the depth limit.
It would lead to a good delta-base selection with smaller window
size quite efficiently with full-name hash.
I think a full-name hash forces a single-commit pack of a wide tree
to give up on deltified blobs, but with the original hash, at least
similar and common files (e.g. Makefile and COPYING) would sit close
together in the delta queue and can be deltified with each other,
which may be where the inefficiency comes from when full-name hash
is used.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/6] PATH WALK I: The path-walk API
2024-11-05 0:11 ` Junio C Hamano
@ 2024-11-08 15:17 ` Derrick Stolee
2024-11-11 2:56 ` Junio C Hamano
2024-11-11 22:04 ` Jeff King
0 siblings, 2 replies; 67+ messages in thread
From: Derrick Stolee @ 2024-11-08 15:17 UTC (permalink / raw)
To: Junio C Hamano, Jeff King
Cc: Taylor Blau, Derrick Stolee via GitGitGadget, git,
johannes.schindelin, ps, johncai86, newren, christian.couder,
kristofferhaugsbakk, jonathantanmy
On 11/4/24 7:11 PM, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
>> On Mon, Nov 04, 2024 at 10:48:49AM -0500, Derrick Stolee wrote:
>>> I disagree that all environments will prefer the --full-name-hash. I'm
>>> currently repeating the performance tests right now, and I've added one.
>>> The issues are:
>>>
>>> 1. The --full-name-hash approach sometimes leads to a larger pack when
>>> using "git push" on the client, especially when the name-hash is
>>> already effective for compressing across paths.
>>
>> That's interesting. I wonder which cases get worse, and if a larger
>> window size might help. I.e., presumably we are pushing the candidates
>> further away in the sorted delta list.
I think the cases that make things get worse with --full-name-hash are:
1. The presence of renames, partitioning objects that used to fit into
the same bucket (in the case of directory renames).
2. Some standard kinds of files may appear several times across the
tree but do not change very often and are similar across path.
3. Common patterns across similar file types, such as similar includes
in .c and .h files or other kinds of boilerplate in different
languages.
A larger window size will always expand the range of possible deltas, at
a cost to the time taken to compute the deltas. My first experiment in
these repositories was to increase the window size to 250 (from the default
10). This caused a very slow repack, but the repositories shrunk.
For example, the big Javascript monorepo that repacked to ~100 GB with
default settings would repack to ~30 GB with --window=250. This was an
indicator that delta compression would work if we can find the right pairs
to use for deltas.
The point of the two strategies (--full-name-hash and --path-walk) is
about putting objects close to each other in better ways than the name
hash sort.
>>> 2. A depth 1 shallow clone cannot use previous versions of a path, so
>>> those situations will want to use the normal name hash. This can be
>>> accomplished simply by disabling the --full-name-hash option when
>>> the --shallow option is present; a more detailed version could be
>>> used to check for a large depth before disabling it. This case also
>>> disables bitmaps, so that isn't something to worry about.
>>
>> I'm not sure why a larger hash would be worse in a shallow clone. As you
>> note, with only one version of each path the name-similarity heuristic
>> is not likely to buy you much. But I'd have thought that would be true
>> for the existing name hash as well as a longer one. Maybe this is the
>> "over-emphasizing" case.
I'm confused by your wording of "larger hash" because the hash size
is the exact same: 32 bits. It's just that the --full-name-hash option
has fewer collisions by abandoning the hope of locality.
In a depth 1 shallow clone, there are no repeated paths, so any hash
collisions are true collisions instead of good candidates for deltas.
The full name hash is essentially random, so the delta compression
algorithm basically says:
1. Sort by type.
2. Within each type, sort the objects randomly.
With that sort, the delta compression scan is less effective than the
standard name hash.
> I too am curious to hear Derrick explain the above points and what
> was learned from the performance tests. The original hash was
> designed to place files that are renamed across directories closer
> to each other in the list sorted by the name hash, so a/Makefile and
> b/Makefile would likely be treated as delta-base candidates while
> foo/bar.c and bar/foo.c are treated as unrelated things. A push
> of a handful of commits that rename paths would likely place the
> rename source of older commits and rename destination of newer
> commits into the same delta chain, even with a smaller delta window.
> In such a history, uniformly-distributed-without-regard-to-renames
> hash is likely to make them into two distinct delta chains, leading
> to less optimal delta-base selection.
Yes. This is the downside of the --full-name-hash compared to the
standard name hash. When repacking an entire repository, the effect
of these renames is typically not important in the long run as it's
basically a single break in the delta chain. The downside comes in
when doing a small fetch or push where the rename has more impact.
The --path-walk approach does not suffer from this problem because
it has a second pass that sorts by the name hash and looks for
better deltas than the ones that already exist. Thus, it gets the
best of both worlds.
The performance impact of the two passes of the --path-walk
approach is interesting, as you'd typically expect this to always
be slower. However:
1. The delta compression within each batch only compares the
objects within that batch. We do not compare these objects to
unrelated objects, which can be expensive and wasteful. This
also means that small batches may even be smaller than the
delta window, reducing the number of comparisons.
2. In the second pass, the delta calculation can short-circuit if
the computed delta would be larger than the current-best delta.
Thus, the good deltas from the first pass make the second pass
faster.
> A whole-repository packing, or a large push or fetch, of the same
> history with renamed files are affected a lot less by such negative
> effects of full-name hash. When generating a pack with more commits
> than the "--window", use of the original hash would mean blobs from
> paths that share similar names (e.g., "Makefile"s everywhere in the
> directory hierarchy) are placed close to each other, but full-name
> hash will likely group the blobs from exactly the same path and
> nothing else together, and the resulting delta-chain for identical
> (and not similar) paths would be sufficiently long. A long delta
> chain has to be broken into multiple chains _anyway_ due to finite
> "--depth" setting, so placing blobs from each path into its own
> (initial) delta chain, completely ignoring renamed paths, would
> likely to give us long enough (initial) delta chain to be split at
> the depth limit.
>
> It would lead to a good delta-base selection with smaller window
> size quite efficiently with full-name hash.>
> I think a full-name hash forces a single-commit pack of a wide tree
> to give up on deltified blobs, but with the original hash, at least
> similar and common files (e.g. Makefile and COPYING) would sit close
> together in the delta queue and can be deltified with each other,
> which may be where the inefficiency comes from when full-name hash
> is used.
Yes, this is a good summary of why this works for the data
efficiency in long histories. Your earlier observations are why
the full-name hash has demonstrated issues with smaller time scales.
These numbers are carefully detailed in the performance tests in
the refreshed series [1]. The series also has a way to disable the
full-name hash when serving a shallow clone for this reason.
[1] https://lore.kernel.org/git/pull.1823.git.1730775907.gitgitgadget@gmail.com/
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/6] PATH WALK I: The path-walk API
2024-11-08 15:17 ` Derrick Stolee
@ 2024-11-11 2:56 ` Junio C Hamano
2024-11-11 13:20 ` Derrick Stolee
2024-11-11 21:55 ` Jeff King
2024-11-11 22:04 ` Jeff King
1 sibling, 2 replies; 67+ messages in thread
From: Junio C Hamano @ 2024-11-11 2:56 UTC (permalink / raw)
To: Derrick Stolee
Cc: Jeff King, Taylor Blau, Derrick Stolee via GitGitGadget, git,
johannes.schindelin, ps, johncai86, newren, christian.couder,
kristofferhaugsbakk, jonathantanmy
Derrick Stolee <stolee@gmail.com> writes:
>> Jeff King <peff@peff.net> writes:
>>
>>> That's interesting. I wonder which cases get worse, and if a larger
>>> window size might help. I.e., presumably we are pushing the candidates
>>> further away in the sorted delta list.
>
> I think the cases that make things get worse with --full-name-hash are:
>
> 1. The presence of renames, partitioning objects that used to fit into
> the same bucket (in the case of directory renames).
>
> 2. Some standard kinds of files may appear several times across the
> tree but do not change very often and are similar across path.
>
> 3. Common patterns across similar file types, such as similar includes
> in .c and .h files or other kinds of boilerplate in different
> languages.
> ...
> In a depth 1 shallow clone, there are no repeated paths, so any hash
> collisions are true collisions instead of good candidates for deltas.
Or #2 and #3 above, where large boilerplates are shared across
similarly named files.
> Yes. This is the downside of the --full-name-hash compared to the
> standard name hash. When repacking an entire repository, the effect
> of these renames is typically not important in the long run as it's
> basically a single break in the delta chain. The downside comes in
> when doing a small fetch or push where the rename has more impact.
Yes. Due to --depth limit, we need to break delta chains somewhere
anyway, and a rename boundary is just as good place as any other in
a sufficiently long chain.
> The --path-walk approach does not suffer from this problem because
> it has a second pass that sorts by the name hash and looks for
> better deltas than the ones that already exist. Thus, it gets the
> best of both worlds.
Yes, at the cost of being more complex :-)
> The performance impact of the two passes of the --path-walk
> approach is interesting, as you'd typically expect this to always
> be slower. However:
>
> 1. The delta compression within each batch only compares the
> objects within that batch. We do not compare these objects to
> unrelated objects, which can be expensive and wasteful. This
> also means that small batches may even be smaller than the
> delta window, reducing the number of comparisons.
Interesting.
> 2. In the second pass, the delta calculation can short-circuit if
> the computed delta would be larger than the current-best delta.
> Thus, the good deltas from the first pass make the second pass
> faster.
Yes, the early give-up codepath helps when you already found
something semi-decently good delta base.
Thanks for a good summary.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/6] PATH WALK I: The path-walk API
2024-11-11 2:56 ` Junio C Hamano
@ 2024-11-11 13:20 ` Derrick Stolee
2024-11-11 21:55 ` Jeff King
1 sibling, 0 replies; 67+ messages in thread
From: Derrick Stolee @ 2024-11-11 13:20 UTC (permalink / raw)
To: Junio C Hamano
Cc: Jeff King, Taylor Blau, Derrick Stolee via GitGitGadget, git,
johannes.schindelin, ps, johncai86, newren, christian.couder,
kristofferhaugsbakk, jonathantanmy
On 11/10/24 9:56 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>> The --path-walk approach does not suffer from this problem because
>> it has a second pass that sorts by the name hash and looks for
>> better deltas than the ones that already exist. Thus, it gets the
>> best of both worlds.
>
> Yes, at the cost of being more complex :-)
True. I hope that the results sufficiently justify the complexity.
Further, the later uses of the path-walk API (git backfill and git
survey) build upon this up-front investment.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/6] PATH WALK I: The path-walk API
2024-11-11 2:56 ` Junio C Hamano
2024-11-11 13:20 ` Derrick Stolee
@ 2024-11-11 21:55 ` Jeff King
2024-11-11 22:29 ` Junio C Hamano
1 sibling, 1 reply; 67+ messages in thread
From: Jeff King @ 2024-11-11 21:55 UTC (permalink / raw)
To: Junio C Hamano
Cc: Derrick Stolee, Taylor Blau, Derrick Stolee via GitGitGadget, git,
johannes.schindelin, ps, johncai86, newren, christian.couder,
kristofferhaugsbakk, jonathantanmy
On Mon, Nov 11, 2024 at 11:56:01AM +0900, Junio C Hamano wrote:
> > Yes. This is the downside of the --full-name-hash compared to the
> > standard name hash. When repacking an entire repository, the effect
> > of these renames is typically not important in the long run as it's
> > basically a single break in the delta chain. The downside comes in
> > when doing a small fetch or push where the rename has more impact.
>
> Yes. Due to --depth limit, we need to break delta chains somewhere
> anyway, and a rename boundary is just as good place as any other in
> a sufficiently long chain.
We don't necessarily have to break the chains due to depth limits,
because they are not always linear. They can end up as bushy trees,
like:
A <- B <- C
\
<- D <- E
\
<- F
We might get that way because it mirrors the shape of history (e.g., if
D and E happened on a side branch from B and C). But we can also get
there from a linear history by choosing a delta that balances size
versus depth. In the above example, the smallest sequence might be:
A <- B <- C <- D <- E <- F
but if we have a limit of 3 and A-B-C already exists, then we might
reject the C-D delta and choose A-D instead (or I guess if it really is
linear, probably B-D is more likely). That may be a larger delta by
itself, but it is still better than storing a full copy of the object.
And we find it by having those candidates close together in the sorted
list, reject C-D based on depth, and then moving on to the next
candidate.
I'm not sure in practice how often we find these kinds of deltas. If you
look at, say, all the deltas for "Makefile" in git.git like this:
git rev-list --objects --all --reflog --full-history -- Makefile |
perl -lne 'print $1 if /(.*) Makefile/' |
git cat-file --batch-check='%(objectsize:disk) %(objectname) %(deltabase)'
my repo has 33 full copies (you can see non-deltas by grepping for
"0{40}$" in the output) out of 4473 total. So it's not like we never
break chains. But we can use graphviz to visualize it by piping the
above through:
perl -alne '
BEGIN { print "digraph {" }
print "node_$F[1] [label=$F[0]]";
print "node_$F[1] -> node_$F[2]" if $F[2] !~ /^0+$/;
END { print "}" }
'
and then piping it through "dot" to make an svg, or using an interactive
viewer like "xdot" (the labels are the on-disk size of each object). I
see a lot of wide parts of the graph in the output.
Of course this may all depend on packing patterns, too. I did my
investigations after running "git repack -adf" to generate what should
be a pretty reasonable pack. You might see something different from
incremental repacking over time.
I'm not sure what any of this means for --path-walk, of course. ;)
Ultimately we care about resulting size and time to compute, so if it
can do better on those metrics then it doesn't matter what the graph
looks like. But maybe those tools can help us understand where things
might go better (or worse) with various approaches.
-Peff
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/6] PATH WALK I: The path-walk API
2024-11-11 21:55 ` Jeff King
@ 2024-11-11 22:29 ` Junio C Hamano
0 siblings, 0 replies; 67+ messages in thread
From: Junio C Hamano @ 2024-11-11 22:29 UTC (permalink / raw)
To: Jeff King
Cc: Derrick Stolee, Taylor Blau, Derrick Stolee via GitGitGadget, git,
johannes.schindelin, ps, johncai86, newren, christian.couder,
kristofferhaugsbakk, jonathantanmy
Jeff King <peff@peff.net> writes:
>> Yes. Due to --depth limit, we need to break delta chains somewhere
>> anyway, and a rename boundary is just as good place as any other in
>> a sufficiently long chain.
>
> We don't necessarily have to break the chains due to depth limits,
> because they are not always linear. They can end up as bushy trees,
True. And being able to pair blobs before and after a rename will
give us more candidates to place in a single bushy tree, so in that
sense, with a short segment of history, it is understandable that
the full-name hash fails to have as many candidates as the original
hash gives us. But with sufficiently large number of blobs at the
same path that are similar (i.e. not a "pushing a short segment of
history", but an initial clone), splitting what could be one delta
family into two delta families at the rename boundary is not too
bad, as long as both halves have enough blobs to deltify against
each other.
> I'm not sure in practice how often we find these kinds of deltas. If you
> look at, say, all the deltas for "Makefile" in git.git like this:
>
> git rev-list --objects --all --reflog --full-history -- Makefile |
> perl -lne 'print $1 if /(.*) Makefile/' |
> git cat-file --batch-check='%(objectsize:disk) %(objectname) %(deltabase)'
>
> my repo has 33 full copies (you can see non-deltas by grepping for
> "0{40}$" in the output) out of 4473 total. So it's not like we never
> break chains. But we can use graphviz to visualize it by piping the
> above through:
>
> perl -alne '
> BEGIN { print "digraph {" }
> print "node_$F[1] [label=$F[0]]";
> print "node_$F[1] -> node_$F[2]" if $F[2] !~ /^0+$/;
> END { print "}" }
> '
>
> and then piping it through "dot" to make an svg, or using an interactive
> viewer like "xdot" (the labels are the on-disk size of each object). I
> see a lot of wide parts of the graph in the output.
>
> Of course this may all depend on packing patterns, too. I did my
> investigations after running "git repack -adf" to generate what should
> be a pretty reasonable pack. You might see something different from
> incremental repacking over time.
That is very true. I forgot that we do things to encourage bushy
delta-base selection. One thing I also am happy to see is the
effect of our "clever" delta-base selection, where the algorithm
does not blindly favor the delta-base that makes the resulting delta
absolutely minimal, but takes the depth of the delta-base into
account (i.e. a base at a much shallower depth is preferred over a
base near the depth limit, even if it results in a slightly larger
delta data).
> I'm not sure what any of this means for --path-walk, of course. ;)
> Ultimately we care about resulting size and time to compute, so if it
> can do better on those metrics then it doesn't matter what the graph
> looks like.
True, too. Another thing that we care about is the time to access
data, and favoring shallow delta chain, even with the help of the
in-core delta-base cache, has merit.
Thanks.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH 0/6] PATH WALK I: The path-walk API
2024-11-08 15:17 ` Derrick Stolee
2024-11-11 2:56 ` Junio C Hamano
@ 2024-11-11 22:04 ` Jeff King
1 sibling, 0 replies; 67+ messages in thread
From: Jeff King @ 2024-11-11 22:04 UTC (permalink / raw)
To: Derrick Stolee
Cc: Junio C Hamano, Taylor Blau, Derrick Stolee via GitGitGadget, git,
johannes.schindelin, ps, johncai86, newren, christian.couder,
kristofferhaugsbakk, jonathantanmy
On Fri, Nov 08, 2024 at 10:17:24AM -0500, Derrick Stolee wrote:
> > > I'm not sure why a larger hash would be worse in a shallow clone. As you
> > > note, with only one version of each path the name-similarity heuristic
> > > is not likely to buy you much. But I'd have thought that would be true
> > > for the existing name hash as well as a longer one. Maybe this is the
> > > "over-emphasizing" case.
>
> I'm confused by your wording of "larger hash" because the hash size
> is the exact same: 32 bits. It's just that the --full-name-hash option
> has fewer collisions by abandoning the hope of locality.
>
> In a depth 1 shallow clone, there are no repeated paths, so any hash
> collisions are true collisions instead of good candidates for deltas.
> The full name hash is essentially random, so the delta compression
> algorithm basically says:
>
> 1. Sort by type.
> 2. Within each type, sort the objects randomly.
>
> With that sort, the delta compression scan is less effective than the
> standard name hash.
Ah, OK. I'm sorry, I had not really investigated the full-name-hash and
misunderstood what it was doing. I thought we were using a larger hash
in order to give more locality hints. I.e., to let us distinguish
"foo/long-path.c" from "bar/long-path.c", but still retain some locality
between them.
But it is throwing away all locality outside of the exact name. So yes,
it would never find a rename from "foo/" to "bar/", as those mean the
name-hash is effectively random.
So I guess getting back to what Taylor and I had talked about off-list:
we were wondering if there was a way to provide a better "slider" for
locality as part of the normal delta candidate sorting process. I think
you could store the full pathname and then doing a sort based on the
reverse of the string (so "foo.c" and "bar.c" would compare "c", then
".", etc). And that would let you tell the difference between
"foo/long-path.c" and "bar/long-path.c" (preferring the identical
filenames over the merely similar ones), but still sort them together
versus "some-unrelated-path.c".
And what I wondered (and what I had initially thought full-name-hash was
doing) was whether we could store some fixed-size hash that produces a
similar distribution (modulo collisions) to what that reverse-name sort
would do.
-Peff
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v2 0/6] PATH WALK I: The path-walk API
2024-10-31 6:26 [PATCH 0/6] PATH WALK I: The path-walk API Derrick Stolee via GitGitGadget
` (7 preceding siblings ...)
2024-11-01 19:23 ` Taylor Blau
@ 2024-11-09 19:41 ` Derrick Stolee via GitGitGadget
2024-11-09 19:41 ` [PATCH v2 1/6] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
` (7 more replies)
8 siblings, 8 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-11-09 19:41 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee
Introduction and relation to prior series
=========================================
This is a new series that rerolls the initial "path-walk API" patches of my
RFC [1] "Path-walk API and applications". This new API (in path-walk.c and
path-walk.h) presents a new way to walk objects such that trees and blobs
are walked in batches according to their path.
This also replaces the previous version of ds/path-walk that was being
reviewed in [2]. The consensus was that the series was too long/dense and
could use some reduction in size. This series takes the first few patches,
but also makes some updates (which will be described later).
[1]
https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/
[RFC] Path-walk API and applications
[2]
https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/
[PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
This series only introduces the path-walk API, but does so to the full
complexity required to later add the integration with git pack-objects to
improve packing compression in both time and space for repositories with
many name hash collisions. The compression also at least improves for other
repositories, but may not always have an improvement in time.
Some of the changes that are present in this series that differ from the
previous version are motivated directly by discoveries made by testing the
feature in Git for Windows and microsoft/git forks that shipped these
features for fast delivery of these improvements to users who needed them.
That testing across many environments informed some things that needed to be
changed, and in this series those changes are checked by tests in the
t6601-path-walk.sh test script and the test-tool path-walk test helper.
Thus, the code being introduced in this series is covered by tests even
though it is not integrated into the git executable.
Discussion of follow-up applications
====================================
By splitting this series out into its own, I was able to reorganize the
patches such that each application can be build independently off of this
series. These are available as pending PRs in gitgitgadget/git:
* Better delta compression with 'git pack-objects' [3]: This application
allows an option in 'git pack-objects' to change how objects are walked
in order to group objects with the same path for early delta compression
before using the name hash sort to look for cross-path deltas. This helps
significantly in repositories with many name-hash collisions. This
reduces the size of 'git push' pacifies via a config option and reduces
the total repo size in 'git repack'.
* The 'git backfill' command [4]: This command downloads missing blobs in a
bloodless partial clone. In order to save space and network bandwidth, it
assumes that objects at a common path are likely to delta well with each
other, so it downloads missing blobs in batches via the path-walk API.
This presents a way to use blobless clones as a pseudo-resumable clone,
since the initial clone of commits and trees is a smaller initial
download and the batch size allows downloading blobs incrementally. When
pairing this command with the sparse-checkout feature, the path-walk API
is adjusted to focus on the paths within the sparse-checkout. This allows
the user to only download the files they are likely to need when
inspecting history within their scope without downloading the entire
repository history.
* The 'git survey' command [5]. This application begins the work to mimic
the behavior of git-sizer, but to use internal data structures for better
performance and careful understanding of how objects are stored. Using
the path-walk API, paths with many versions can be considered in a batch
and sorted into a list to report the paths that contribute most to the
size of the repository. A version of this command was used to help
confirm the issues with the name hash collisions. It was also used to
diagnose why some repacks using the --path-walk option were taking more
space than without for some repositories. (More on this later.)
Question for reviewers: I am prepped to send these three applications to the
mailing list, but I'll refrain for now to avoid causing too much noise for
folks. Would you like to see them on-list while this series is under review?
Or would you prefer to explore the PRs ([3] [4] and [5])?
[3] https://github.com/gitgitgadget/git/pull/1819
PATH WALK II: Add --path-walk option to 'git pack-objects'
[4] https://github.com/gitgitgadget/git/pull/1820
PATH WALK III: Add 'git backfill' command
[5] https://github.com/gitgitgadget/git/pull/1821
PATH WALK IV: Add 'git survey' command
Structure of the Patch Series
=============================
This patch series attempts to create the simplest version of the API in
patch 1, then build functionality incrementally. During the process, each
change will introduce an update to:
* The path-walk API itself in path-walk.c and path-walk.h.
* The documentation of the API in
Documentation/technical/api-path-walk.txt.
* The test script t/t6601-path-walk.sh.
The core of the API relies on using a 'struct rev_info' to define an initial
set of objects and some form of a commit walk to define what range of
objects to visit. Initially, only a subset of 'struct rev_info' options work
as expected. For example:
* Patch 1 assumes that only commit objects are starting positions, but the
focus is on exploring trees and blobs.
* Patch 3 allows users to specify object types, which includes outputting
the visited commits in a batch.
* Annotated tags and indexed objects are considered in Patch 4. These are
grouped because they both exist within the 'pending' object list.
* UNINTERESTING objects are not considered until Patch 5.
Changes in v1 (since previous version)
======================================
There are a few hard-won learnings from previous versions of this series due
to testing this in the wild with many different repositories.
* Initially, the 'git pack-objects --path-walk' feature was not tested with
the '--shallow' option because it was expected that this option was for
servers creating a pack containing shallow commits. However, this option
is also used when pushing from a shallow clone, and this was a critical
feature that we needed to reduce the size of commits pushed from
automated environments that were bootstrapped by shallow clones. The crux
of the change is in Patch 5 and how UNINTERESTING objects are handled. We
no longer need to push the UNINTERESTING flag around the objects
ourselves and can use existing logic in list-objects.c to do so. This
allows using the --objects-edge-aggressive option when necessary to
reduce the object count when pushing from a shallow clone. (The
pack-objects series expands on tests to cover this integration point.)
* When looking into cases where 'git repack -adf' outperformed 'git repack
-adf --path-walk', I discovered that the issue did not reproduce in a
bare repository. This is due to 'git repack' iterating over all indexed
objects before walking commits. I had inadvertently put all indexed
objects in their own category, leading to no good deltas with previous
versions of those files; I had also not used the 'path' option from the
pending list, so these objects had invalid name hash values. You will see
in patch 4 that the pending list is handled quite differently and the
'--indexed-objects' option is tested directly within t6601.
* I added a new 'test_cmp_sorted' helper because I wanted to simplify some
repeated sections of t6601.
* Patch 1 has significantly more context than it did before.
* Annotated tags are given a name of "/tags" to differentiate them slightly
from root trees and commits.
Changes in v2
=============
* Updated the test helper to output the batch number, allowing us to
confirm that OIDs are grouped appropriately. This also signaled a few
cases where the callback function was being called on an empty set.
* This change has resulted in significant changes to the test data,
including reordered lines and prepended batch numbers.
* Thanks to Patrick for providing a recommended change to remove memory
leaks from the test helper.
Thanks, -Stolee
Derrick Stolee (6):
path-walk: introduce an object walk by path
test-lib-functions: add test_cmp_sorted
t6601: add helper for testing path-walk API
path-walk: allow consumer to specify object types
path-walk: visit tags and cached objects
path-walk: mark trees and blobs as UNINTERESTING
Documentation/technical/api-path-walk.txt | 63 +++
Makefile | 2 +
path-walk.c | 531 ++++++++++++++++++++++
path-walk.h | 65 +++
t/helper/test-path-walk.c | 124 +++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t6601-path-walk.sh | 368 +++++++++++++++
t/test-lib-functions.sh | 10 +
9 files changed, 1165 insertions(+)
create mode 100644 Documentation/technical/api-path-walk.txt
create mode 100644 path-walk.c
create mode 100644 path-walk.h
create mode 100644 t/helper/test-path-walk.c
create mode 100755 t/t6601-path-walk.sh
base-commit: e9356ba3ea2a6754281ff7697b3e5a1697b21e24
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1818%2Fderrickstolee%2Fapi-upstream-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1818/derrickstolee/api-upstream-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1818
Range-diff vs v1:
1: c71f0a0e361 ! 1: b7e9b81e8b3 path-walk: introduce an object walk by path
@@ path-walk.c (new)
+
+ list = strmap_get(&ctx->paths_to_lists, path);
+
++ if (!list->oids.nr)
++ return 0;
++
+ /* Evaluate function pointer on this data. */
+ ret = ctx->info->path_fn(path, &list->oids, list->type,
+ ctx->info->path_fn_data);
2: 4f9f898fec1 = 2: cf2ed61b324 test-lib-functions: add test_cmp_sorted
3: 6f93dff88e7 ! 3: a3c754d93cc t6601: add helper for testing path-walk API
@@ Commit message
sets a baseline for the behavior and we can extend it as new options are
introduced.
+ Store and output a 'batch_nr' value so we can demonstrate that the paths are
+ grouped together in a batch and not following some other ordering. This
+ allows us to test the depth-first behavior of the path-walk API. However, we
+ purposefully do not test the order of the objects in the batch, so the
+ output is compared to the expected output through a sort.
+
It is important to mention that the behavior of the API will change soon as
we start to handle UNINTERESTING objects differently, but these tests will
demonstrate the change in behavior.
@@ t/helper/test-path-walk.c (new)
+};
+
+struct path_walk_test_data {
++ uintmax_t batch_nr;
+ uintmax_t tree_nr;
+ uintmax_t blob_nr;
+};
@@ t/helper/test-path-walk.c (new)
+ }
+
+ for (size_t i = 0; i < oids->nr; i++)
-+ printf("%s:%s:%s\n", typestr, path, oid_to_hex(&oids->oid[i]));
++ printf("%"PRIuMAX":%s:%s:%s\n",
++ tdata->batch_nr, typestr, path,
++ oid_to_hex(&oids->oid[i]));
+
++ tdata->batch_nr++;
+ return 0;
+}
+
@@ t/helper/test-path-walk.c (new)
+ OPT_END(),
+ };
+
-+ initialize_repository(the_repository);
+ setup_git_directory();
+ revs.repo = the_repository;
+
@@ t/helper/test-path-walk.c (new)
+ "blobs:%" PRIuMAX "\n",
+ data.tree_nr, data.blob_nr);
+
++ release_revisions(&revs);
+ return res;
+}
@@ t/t6601-path-walk.sh (new)
@@
+#!/bin/sh
+
++TEST_PASSES_SANITIZE_LEAK=true
++
+test_description='direct path-walk API tests'
+
+. ./test-lib.sh
@@ t/t6601-path-walk.sh (new)
+ test-tool path-walk -- --all >out &&
+
+ cat >expect <<-EOF &&
-+ TREE::$(git rev-parse topic^{tree})
-+ TREE::$(git rev-parse base^{tree})
-+ TREE::$(git rev-parse base~1^{tree})
-+ TREE::$(git rev-parse base~2^{tree})
-+ TREE:left/:$(git rev-parse base:left)
-+ TREE:left/:$(git rev-parse base~2:left)
-+ TREE:right/:$(git rev-parse topic:right)
-+ TREE:right/:$(git rev-parse base~1:right)
-+ TREE:right/:$(git rev-parse base~2:right)
-+ trees:9
-+ BLOB:a:$(git rev-parse base~2:a)
-+ BLOB:left/b:$(git rev-parse base~2:left/b)
-+ BLOB:left/b:$(git rev-parse base:left/b)
-+ BLOB:right/c:$(git rev-parse base~2:right/c)
-+ BLOB:right/c:$(git rev-parse topic:right/c)
-+ BLOB:right/d:$(git rev-parse base~1:right/d)
++ 0:TREE::$(git rev-parse topic^{tree})
++ 0:TREE::$(git rev-parse base^{tree})
++ 0:TREE::$(git rev-parse base~1^{tree})
++ 0:TREE::$(git rev-parse base~2^{tree})
++ 1:TREE:right/:$(git rev-parse topic:right)
++ 1:TREE:right/:$(git rev-parse base~1:right)
++ 1:TREE:right/:$(git rev-parse base~2:right)
++ 2:BLOB:right/d:$(git rev-parse base~1:right/d)
++ 3:BLOB:right/c:$(git rev-parse base~2:right/c)
++ 3:BLOB:right/c:$(git rev-parse topic:right/c)
++ 4:TREE:left/:$(git rev-parse base:left)
++ 4:TREE:left/:$(git rev-parse base~2:left)
++ 5:BLOB:left/b:$(git rev-parse base~2:left/b)
++ 5:BLOB:left/b:$(git rev-parse base:left/b)
++ 6:BLOB:a:$(git rev-parse base~2:a)
+ blobs:6
++ trees:9
+ EOF
+
+ test_cmp_sorted expect out
@@ t/t6601-path-walk.sh (new)
+ test-tool path-walk -- topic >out &&
+
+ cat >expect <<-EOF &&
-+ TREE::$(git rev-parse topic^{tree})
-+ TREE::$(git rev-parse base~1^{tree})
-+ TREE::$(git rev-parse base~2^{tree})
-+ TREE:left/:$(git rev-parse base~2:left)
-+ TREE:right/:$(git rev-parse topic:right)
-+ TREE:right/:$(git rev-parse base~1:right)
-+ TREE:right/:$(git rev-parse base~2:right)
-+ trees:7
-+ BLOB:a:$(git rev-parse base~2:a)
-+ BLOB:left/b:$(git rev-parse base~2:left/b)
-+ BLOB:right/c:$(git rev-parse base~2:right/c)
-+ BLOB:right/c:$(git rev-parse topic:right/c)
-+ BLOB:right/d:$(git rev-parse base~1:right/d)
++ 0:TREE::$(git rev-parse topic^{tree})
++ 0:TREE::$(git rev-parse base~1^{tree})
++ 0:TREE::$(git rev-parse base~2^{tree})
++ 1:TREE:right/:$(git rev-parse topic:right)
++ 1:TREE:right/:$(git rev-parse base~1:right)
++ 1:TREE:right/:$(git rev-parse base~2:right)
++ 2:BLOB:right/d:$(git rev-parse base~1:right/d)
++ 3:BLOB:right/c:$(git rev-parse base~2:right/c)
++ 3:BLOB:right/c:$(git rev-parse topic:right/c)
++ 4:TREE:left/:$(git rev-parse base~2:left)
++ 5:BLOB:left/b:$(git rev-parse base~2:left/b)
++ 6:BLOB:a:$(git rev-parse base~2:a)
+ blobs:5
++ trees:7
+ EOF
+
+ test_cmp_sorted expect out
@@ t/t6601-path-walk.sh (new)
+ test-tool path-walk -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
-+ TREE::$(git rev-parse topic^{tree})
-+ TREE:left/:$(git rev-parse topic:left)
-+ TREE:right/:$(git rev-parse topic:right)
-+ trees:3
-+ BLOB:a:$(git rev-parse topic:a)
-+ BLOB:left/b:$(git rev-parse topic:left/b)
-+ BLOB:right/c:$(git rev-parse topic:right/c)
-+ BLOB:right/d:$(git rev-parse topic:right/d)
++ 0:TREE::$(git rev-parse topic^{tree})
++ 1:TREE:right/:$(git rev-parse topic:right)
++ 2:BLOB:right/d:$(git rev-parse topic:right/d)
++ 3:BLOB:right/c:$(git rev-parse topic:right/c)
++ 4:TREE:left/:$(git rev-parse topic:left)
++ 5:BLOB:left/b:$(git rev-parse topic:left/b)
++ 6:BLOB:a:$(git rev-parse topic:a)
+ blobs:4
++ trees:3
+ EOF
+
+ test_cmp_sorted expect out
@@ t/t6601-path-walk.sh (new)
+ test-tool path-walk -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
-+ TREE::$(git rev-parse topic^{tree})
-+ TREE::$(git rev-parse base~1^{tree})
-+ TREE:left/:$(git rev-parse base~1:left)
-+ TREE:right/:$(git rev-parse topic:right)
-+ TREE:right/:$(git rev-parse base~1:right)
-+ trees:5
-+ BLOB:a:$(git rev-parse base~1:a)
-+ BLOB:left/b:$(git rev-parse base~1:left/b)
-+ BLOB:right/c:$(git rev-parse base~1:right/c)
-+ BLOB:right/c:$(git rev-parse topic:right/c)
-+ BLOB:right/d:$(git rev-parse base~1:right/d)
++ 0:TREE::$(git rev-parse topic^{tree})
++ 0:TREE::$(git rev-parse base~1^{tree})
++ 1:TREE:right/:$(git rev-parse topic:right)
++ 1:TREE:right/:$(git rev-parse base~1:right)
++ 2:BLOB:right/d:$(git rev-parse base~1:right/d)
++ 3:BLOB:right/c:$(git rev-parse base~1:right/c)
++ 3:BLOB:right/c:$(git rev-parse topic:right/c)
++ 4:TREE:left/:$(git rev-parse base~1:left)
++ 5:BLOB:left/b:$(git rev-parse base~1:left/b)
++ 6:BLOB:a:$(git rev-parse base~1:a)
+ blobs:5
++ trees:5
+ EOF
+
+ test_cmp_sorted expect out
4: f4bf8be30b5 ! 4: 83b746f569d path-walk: allow consumer to specify object types
@@ path-walk.c: static int add_children(struct path_walk_context *ctx,
struct tree *child = lookup_tree(ctx->repo, &entry.oid);
o = child ? &child->object : NULL;
@@ path-walk.c: static int walk_path(struct path_walk_context *ctx,
-
- list = strmap_get(&ctx->paths_to_lists, path);
+ if (!list->oids.nr)
+ return 0;
- /* Evaluate function pointer on this data. */
- ret = ctx->info->path_fn(path, &list->oids, list->type,
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+ /* Track all commits. */
-+ if (info->commits)
++ if (info->commits && commit_list->oids.nr)
+ ret = info->path_fn("", &commit_list->oids, OBJ_COMMIT,
+ info->path_fn_data);
+ oid_array_clear(&commit_list->oids);
@@ path-walk.h: struct path_walk_info {
## t/helper/test-path-walk.c ##
@@ t/helper/test-path-walk.c: static const char * const path_walk_usage[] = {
- };
struct path_walk_test_data {
+ uintmax_t batch_nr;
++
+ uintmax_t commit_nr;
uintmax_t tree_nr;
uintmax_t blob_nr;
@@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
- data.tree_nr, data.blob_nr);
+ data.commit_nr, data.tree_nr, data.blob_nr);
+ release_revisions(&revs);
return res;
- }
## t/t6601-path-walk.sh ##
@@ t/t6601-path-walk.sh: test_expect_success 'all' '
test-tool path-walk -- --all >out &&
cat >expect <<-EOF &&
-+ COMMIT::$(git rev-parse topic)
-+ COMMIT::$(git rev-parse base)
-+ COMMIT::$(git rev-parse base~1)
-+ COMMIT::$(git rev-parse base~2)
+- 0:TREE::$(git rev-parse topic^{tree})
+- 0:TREE::$(git rev-parse base^{tree})
+- 0:TREE::$(git rev-parse base~1^{tree})
+- 0:TREE::$(git rev-parse base~2^{tree})
+- 1:TREE:right/:$(git rev-parse topic:right)
+- 1:TREE:right/:$(git rev-parse base~1:right)
+- 1:TREE:right/:$(git rev-parse base~2:right)
+- 2:BLOB:right/d:$(git rev-parse base~1:right/d)
+- 3:BLOB:right/c:$(git rev-parse base~2:right/c)
+- 3:BLOB:right/c:$(git rev-parse topic:right/c)
+- 4:TREE:left/:$(git rev-parse base:left)
+- 4:TREE:left/:$(git rev-parse base~2:left)
+- 5:BLOB:left/b:$(git rev-parse base~2:left/b)
+- 5:BLOB:left/b:$(git rev-parse base:left/b)
+- 6:BLOB:a:$(git rev-parse base~2:a)
++ 0:COMMIT::$(git rev-parse topic)
++ 0:COMMIT::$(git rev-parse base)
++ 0:COMMIT::$(git rev-parse base~1)
++ 0:COMMIT::$(git rev-parse base~2)
++ 1:TREE::$(git rev-parse topic^{tree})
++ 1:TREE::$(git rev-parse base^{tree})
++ 1:TREE::$(git rev-parse base~1^{tree})
++ 1:TREE::$(git rev-parse base~2^{tree})
++ 2:TREE:right/:$(git rev-parse topic:right)
++ 2:TREE:right/:$(git rev-parse base~1:right)
++ 2:TREE:right/:$(git rev-parse base~2:right)
++ 3:BLOB:right/d:$(git rev-parse base~1:right/d)
++ 4:BLOB:right/c:$(git rev-parse base~2:right/c)
++ 4:BLOB:right/c:$(git rev-parse topic:right/c)
++ 5:TREE:left/:$(git rev-parse base:left)
++ 5:TREE:left/:$(git rev-parse base~2:left)
++ 6:BLOB:left/b:$(git rev-parse base~2:left/b)
++ 6:BLOB:left/b:$(git rev-parse base:left/b)
++ 7:BLOB:a:$(git rev-parse base~2:a)
+ blobs:6
+ commits:4
- TREE::$(git rev-parse topic^{tree})
- TREE::$(git rev-parse base^{tree})
- TREE::$(git rev-parse base~1^{tree})
+ trees:9
+ EOF
+
@@ t/t6601-path-walk.sh: test_expect_success 'topic only' '
test-tool path-walk -- topic >out &&
cat >expect <<-EOF &&
-+ COMMIT::$(git rev-parse topic)
-+ COMMIT::$(git rev-parse base~1)
-+ COMMIT::$(git rev-parse base~2)
+- 0:TREE::$(git rev-parse topic^{tree})
+- 0:TREE::$(git rev-parse base~1^{tree})
+- 0:TREE::$(git rev-parse base~2^{tree})
+- 1:TREE:right/:$(git rev-parse topic:right)
+- 1:TREE:right/:$(git rev-parse base~1:right)
+- 1:TREE:right/:$(git rev-parse base~2:right)
+- 2:BLOB:right/d:$(git rev-parse base~1:right/d)
+- 3:BLOB:right/c:$(git rev-parse base~2:right/c)
+- 3:BLOB:right/c:$(git rev-parse topic:right/c)
+- 4:TREE:left/:$(git rev-parse base~2:left)
+- 5:BLOB:left/b:$(git rev-parse base~2:left/b)
+- 6:BLOB:a:$(git rev-parse base~2:a)
++ 0:COMMIT::$(git rev-parse topic)
++ 0:COMMIT::$(git rev-parse base~1)
++ 0:COMMIT::$(git rev-parse base~2)
++ 1:TREE::$(git rev-parse topic^{tree})
++ 1:TREE::$(git rev-parse base~1^{tree})
++ 1:TREE::$(git rev-parse base~2^{tree})
++ 2:TREE:right/:$(git rev-parse topic:right)
++ 2:TREE:right/:$(git rev-parse base~1:right)
++ 2:TREE:right/:$(git rev-parse base~2:right)
++ 3:BLOB:right/d:$(git rev-parse base~1:right/d)
++ 4:BLOB:right/c:$(git rev-parse base~2:right/c)
++ 4:BLOB:right/c:$(git rev-parse topic:right/c)
++ 5:TREE:left/:$(git rev-parse base~2:left)
++ 6:BLOB:left/b:$(git rev-parse base~2:left/b)
++ 7:BLOB:a:$(git rev-parse base~2:a)
+ blobs:5
+ commits:3
- TREE::$(git rev-parse topic^{tree})
- TREE::$(git rev-parse base~1^{tree})
- TREE::$(git rev-parse base~2^{tree})
+ trees:7
+ EOF
+
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
test-tool path-walk -- topic --not base >out &&
cat >expect <<-EOF &&
-+ COMMIT::$(git rev-parse topic)
++ 0:COMMIT::$(git rev-parse topic)
++ 1:TREE::$(git rev-parse topic^{tree})
++ 2:TREE:right/:$(git rev-parse topic:right)
++ 3:BLOB:right/d:$(git rev-parse topic:right/d)
++ 4:BLOB:right/c:$(git rev-parse topic:right/c)
++ 5:TREE:left/:$(git rev-parse topic:left)
++ 6:BLOB:left/b:$(git rev-parse topic:left/b)
++ 7:BLOB:a:$(git rev-parse topic:a)
++ blobs:4
+ commits:1
- TREE::$(git rev-parse topic^{tree})
- TREE:left/:$(git rev-parse topic:left)
- TREE:right/:$(git rev-parse topic:right)
-@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
- test_cmp_sorted expect out
- '
-
++ trees:3
++ EOF
++
++ test_cmp_sorted expect out
++'
++
+test_expect_success 'topic, not base, only blobs' '
+ test-tool path-walk --no-trees --no-commits \
+ -- topic --not base >out &&
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
+ cat >expect <<-EOF &&
+ commits:0
+ trees:0
-+ BLOB:a:$(git rev-parse topic:a)
-+ BLOB:left/b:$(git rev-parse topic:left/b)
-+ BLOB:right/c:$(git rev-parse topic:right/c)
-+ BLOB:right/d:$(git rev-parse topic:right/d)
++ 0:BLOB:right/d:$(git rev-parse topic:right/d)
++ 1:BLOB:right/c:$(git rev-parse topic:right/c)
++ 2:BLOB:left/b:$(git rev-parse topic:left/b)
++ 3:BLOB:a:$(git rev-parse topic:a)
+ blobs:4
+ EOF
+
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
-+ COMMIT::$(git rev-parse topic)
++ 0:COMMIT::$(git rev-parse topic)
+ commits:1
+ trees:0
+ blobs:0
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
+
+ cat >expect <<-EOF &&
+ commits:0
-+ TREE::$(git rev-parse topic^{tree})
-+ TREE:left/:$(git rev-parse topic:left)
-+ TREE:right/:$(git rev-parse topic:right)
-+ trees:3
+ 0:TREE::$(git rev-parse topic^{tree})
+ 1:TREE:right/:$(git rev-parse topic:right)
+- 2:BLOB:right/d:$(git rev-parse topic:right/d)
+- 3:BLOB:right/c:$(git rev-parse topic:right/c)
+- 4:TREE:left/:$(git rev-parse topic:left)
+- 5:BLOB:left/b:$(git rev-parse topic:left/b)
+- 6:BLOB:a:$(git rev-parse topic:a)
+- blobs:4
++ 2:TREE:left/:$(git rev-parse topic:left)
+ trees:3
+ blobs:0
-+ EOF
-+
-+ test_cmp_sorted expect out
-+'
-+
- test_expect_success 'topic, not base, boundary' '
+ EOF
+
+ test_cmp_sorted expect out
+@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
test-tool path-walk -- --boundary topic --not base >out &&
cat >expect <<-EOF &&
-+ COMMIT::$(git rev-parse topic)
-+ COMMIT::$(git rev-parse base~1)
+- 0:TREE::$(git rev-parse topic^{tree})
+- 0:TREE::$(git rev-parse base~1^{tree})
+- 1:TREE:right/:$(git rev-parse topic:right)
+- 1:TREE:right/:$(git rev-parse base~1:right)
+- 2:BLOB:right/d:$(git rev-parse base~1:right/d)
+- 3:BLOB:right/c:$(git rev-parse base~1:right/c)
+- 3:BLOB:right/c:$(git rev-parse topic:right/c)
+- 4:TREE:left/:$(git rev-parse base~1:left)
+- 5:BLOB:left/b:$(git rev-parse base~1:left/b)
+- 6:BLOB:a:$(git rev-parse base~1:a)
++ 0:COMMIT::$(git rev-parse topic)
++ 0:COMMIT::$(git rev-parse base~1)
++ 1:TREE::$(git rev-parse topic^{tree})
++ 1:TREE::$(git rev-parse base~1^{tree})
++ 2:TREE:right/:$(git rev-parse topic:right)
++ 2:TREE:right/:$(git rev-parse base~1:right)
++ 3:BLOB:right/d:$(git rev-parse base~1:right/d)
++ 4:BLOB:right/c:$(git rev-parse base~1:right/c)
++ 4:BLOB:right/c:$(git rev-parse topic:right/c)
++ 5:TREE:left/:$(git rev-parse base~1:left)
++ 6:BLOB:left/b:$(git rev-parse base~1:left/b)
++ 7:BLOB:a:$(git rev-parse base~1:a)
+ blobs:5
+ commits:2
- TREE::$(git rev-parse topic^{tree})
- TREE::$(git rev-parse base~1^{tree})
- TREE:left/:$(git rev-parse base~1:left)
+ trees:5
+ EOF
+
5: 3dc27658526 ! 5: 97765aa04c2 path-walk: visit tags and cached objects
@@ path-walk.c: static int walk_path(struct path_walk_context *ctx,
+ if (!list)
+ BUG("provided path '%s' that had no associated list", path);
+
+ if (!list->oids.nr)
+ return 0;
+
/* Evaluate function pointer on this data, if requested. */
if ((list->type == OBJ_TREE && ctx->info->trees) ||
- (list->type == OBJ_BLOB && ctx->info->blobs))
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
+ struct hashmap_iter iter;
+ struct strmap_entry *entry;
+
-+ strmap_for_each_entry(&ctx.paths_to_lists, &iter, entry) {
++ strmap_for_each_entry(&ctx.paths_to_lists, &iter, entry)
+ push_to_stack(&ctx, entry->key);
-+ }
+
+ while (!ret && ctx.path_stack.nr) {
+ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
@@ t/helper/test-path-walk.c: static int emit_block(const char *path, struct oid_ar
default:
BUG("we do not understand this type");
}
+
++ /* This should never be output during tests. */
++ if (!oids->nr)
++ printf("%"PRIuMAX":%s:%s:EMPTY\n",
++ tdata->batch_nr, typestr, path);
++
+ for (size_t i = 0; i < oids->nr; i++)
+ printf("%"PRIuMAX":%s:%s:%s\n",
+ tdata->batch_nr, typestr, path,
@@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of blob objects")),
OPT_BOOL(0, "commits", &info.commits,
@@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
+ "tags:%" PRIuMAX "\n",
+ data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
+ release_revisions(&revs);
return res;
- }
## t/t6601-path-walk.sh ##
@@ t/t6601-path-walk.sh: test_description='direct path-walk API tests'
@@ t/t6601-path-walk.sh: test_description='direct path-walk API tests'
'
test_expect_success 'all' '
-@@ t/t6601-path-walk.sh: test_expect_success 'all' '
- TREE::$(git rev-parse base^{tree})
- TREE::$(git rev-parse base~1^{tree})
- TREE::$(git rev-parse base~2^{tree})
-+ TREE::$(git rev-parse refs/tags/tree-tag^{})
-+ TREE::$(git rev-parse refs/tags/tree-tag2^{})
-+ TREE:a/:$(git rev-parse base:a)
- TREE:left/:$(git rev-parse base:left)
- TREE:left/:$(git rev-parse base~2:left)
- TREE:right/:$(git rev-parse topic:right)
- TREE:right/:$(git rev-parse base~1:right)
- TREE:right/:$(git rev-parse base~2:right)
-- trees:9
-+ TREE:child/:$(git rev-parse refs/tags/tree-tag^{}:child)
-+ trees:13
- BLOB:a:$(git rev-parse base~2:a)
-+ BLOB:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
- BLOB:left/b:$(git rev-parse base~2:left/b)
- BLOB:left/b:$(git rev-parse base:left/b)
- BLOB:right/c:$(git rev-parse base~2:right/c)
- BLOB:right/c:$(git rev-parse topic:right/c)
- BLOB:right/d:$(git rev-parse base~1:right/d)
-- blobs:6
-+ BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
-+ BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
-+ BLOB:child/file:$(git rev-parse refs/tags/tree-tag^{}:child/file)
+ test-tool path-walk -- --all >out &&
+
++ cat >expect <<-EOF &&
++ 0:COMMIT::$(git rev-parse topic)
++ 0:COMMIT::$(git rev-parse base)
++ 0:COMMIT::$(git rev-parse base~1)
++ 0:COMMIT::$(git rev-parse base~2)
++ 1:TAG:/tags:$(git rev-parse refs/tags/first)
++ 1:TAG:/tags:$(git rev-parse refs/tags/second.1)
++ 1:TAG:/tags:$(git rev-parse refs/tags/second.2)
++ 1:TAG:/tags:$(git rev-parse refs/tags/third)
++ 1:TAG:/tags:$(git rev-parse refs/tags/fourth)
++ 1:TAG:/tags:$(git rev-parse refs/tags/tree-tag)
++ 1:TAG:/tags:$(git rev-parse refs/tags/blob-tag)
++ 2:BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
++ 2:BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
++ 3:TREE::$(git rev-parse topic^{tree})
++ 3:TREE::$(git rev-parse base^{tree})
++ 3:TREE::$(git rev-parse base~1^{tree})
++ 3:TREE::$(git rev-parse base~2^{tree})
++ 3:TREE::$(git rev-parse refs/tags/tree-tag^{})
++ 3:TREE::$(git rev-parse refs/tags/tree-tag2^{})
++ 4:BLOB:a:$(git rev-parse base~2:a)
++ 5:TREE:right/:$(git rev-parse topic:right)
++ 5:TREE:right/:$(git rev-parse base~1:right)
++ 5:TREE:right/:$(git rev-parse base~2:right)
++ 6:BLOB:right/d:$(git rev-parse base~1:right/d)
++ 7:BLOB:right/c:$(git rev-parse base~2:right/c)
++ 7:BLOB:right/c:$(git rev-parse topic:right/c)
++ 8:TREE:left/:$(git rev-parse base:left)
++ 8:TREE:left/:$(git rev-parse base~2:left)
++ 9:BLOB:left/b:$(git rev-parse base~2:left/b)
++ 9:BLOB:left/b:$(git rev-parse base:left/b)
++ 10:TREE:a/:$(git rev-parse base:a)
++ 11:BLOB:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
++ 12:TREE:child/:$(git rev-parse refs/tags/tree-tag:child)
++ 13:BLOB:child/file:$(git rev-parse refs/tags/tree-tag:child/file)
+ blobs:10
-+ TAG:/tags:$(git rev-parse refs/tags/first)
-+ TAG:/tags:$(git rev-parse refs/tags/second.1)
-+ TAG:/tags:$(git rev-parse refs/tags/second.2)
-+ TAG:/tags:$(git rev-parse refs/tags/third)
-+ TAG:/tags:$(git rev-parse refs/tags/fourth)
-+ TAG:/tags:$(git rev-parse refs/tags/tree-tag)
-+ TAG:/tags:$(git rev-parse refs/tags/blob-tag)
++ commits:4
+ tags:7
++ trees:13
+ EOF
+
+ test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'all' '
+ test-tool path-walk -- --indexed-objects >out &&
+
+ cat >expect <<-EOF &&
-+ commits:0
-+ TREE:right/:$(git rev-parse topic:right)
-+ trees:1
-+ BLOB:a:$(git rev-parse HEAD:a)
-+ BLOB:left/b:$(git rev-parse HEAD:left/b)
-+ BLOB:left/c:$(git rev-parse :left/c)
-+ BLOB:right/c:$(git rev-parse HEAD:right/c)
-+ BLOB:right/d:$(git rev-parse HEAD:right/d)
++ 0:BLOB:a:$(git rev-parse HEAD:a)
++ 1:BLOB:left/b:$(git rev-parse HEAD:left/b)
++ 2:BLOB:left/c:$(git rev-parse :left/c)
++ 3:BLOB:right/c:$(git rev-parse HEAD:right/c)
++ 4:BLOB:right/d:$(git rev-parse HEAD:right/d)
++ 5:TREE:right/:$(git rev-parse topic:right)
+ blobs:5
++ commits:0
+ tags:0
++ trees:1
+ EOF
+
+ test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'all' '
+
+ test-tool path-walk -- --indexed-objects --branches >out &&
+
-+ cat >expect <<-EOF &&
-+ COMMIT::$(git rev-parse topic)
-+ COMMIT::$(git rev-parse base)
-+ COMMIT::$(git rev-parse base~1)
-+ COMMIT::$(git rev-parse base~2)
-+ commits:4
-+ TREE::$(git rev-parse topic^{tree})
-+ TREE::$(git rev-parse base^{tree})
-+ TREE::$(git rev-parse base~1^{tree})
-+ TREE::$(git rev-parse base~2^{tree})
-+ TREE:a/:$(git rev-parse base:a)
-+ TREE:left/:$(git rev-parse base:left)
-+ TREE:left/:$(git rev-parse base~2:left)
-+ TREE:right/:$(git rev-parse topic:right)
-+ TREE:right/:$(git rev-parse base~1:right)
-+ TREE:right/:$(git rev-parse base~2:right)
-+ trees:10
-+ BLOB:a:$(git rev-parse base~2:a)
-+ BLOB:left/b:$(git rev-parse base:left/b)
-+ BLOB:left/b:$(git rev-parse base~2:left/b)
-+ BLOB:right/c:$(git rev-parse base~2:right/c)
-+ BLOB:right/c:$(git rev-parse topic:right/c)
-+ BLOB:right/d:$(git rev-parse base~1:right/d)
-+ BLOB:right/d:$(git rev-parse :right/d)
+ cat >expect <<-EOF &&
+ 0:COMMIT::$(git rev-parse topic)
+ 0:COMMIT::$(git rev-parse base)
+@@ t/t6601-path-walk.sh: test_expect_success 'all' '
+ 1:TREE::$(git rev-parse base^{tree})
+ 1:TREE::$(git rev-parse base~1^{tree})
+ 1:TREE::$(git rev-parse base~2^{tree})
+- 2:TREE:right/:$(git rev-parse topic:right)
+- 2:TREE:right/:$(git rev-parse base~1:right)
+- 2:TREE:right/:$(git rev-parse base~2:right)
+- 3:BLOB:right/d:$(git rev-parse base~1:right/d)
+- 4:BLOB:right/c:$(git rev-parse base~2:right/c)
+- 4:BLOB:right/c:$(git rev-parse topic:right/c)
+- 5:TREE:left/:$(git rev-parse base:left)
+- 5:TREE:left/:$(git rev-parse base~2:left)
+- 6:BLOB:left/b:$(git rev-parse base~2:left/b)
+- 6:BLOB:left/b:$(git rev-parse base:left/b)
+- 7:BLOB:a:$(git rev-parse base~2:a)
+- blobs:6
++ 2:BLOB:a:$(git rev-parse base~2:a)
++ 3:TREE:right/:$(git rev-parse topic:right)
++ 3:TREE:right/:$(git rev-parse base~1:right)
++ 3:TREE:right/:$(git rev-parse base~2:right)
++ 4:BLOB:right/d:$(git rev-parse base~1:right/d)
++ 4:BLOB:right/d:$(git rev-parse :right/d)
++ 5:BLOB:right/c:$(git rev-parse base~2:right/c)
++ 5:BLOB:right/c:$(git rev-parse topic:right/c)
++ 6:TREE:left/:$(git rev-parse base:left)
++ 6:TREE:left/:$(git rev-parse base~2:left)
++ 7:BLOB:left/b:$(git rev-parse base:left/b)
++ 7:BLOB:left/b:$(git rev-parse base~2:left/b)
++ 8:TREE:a/:$(git rev-parse refs/tags/third:a)
+ blobs:7
+ commits:4
+- trees:9
+ tags:0
++ trees:10
EOF
test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic only' '
- BLOB:right/c:$(git rev-parse topic:right/c)
- BLOB:right/d:$(git rev-parse base~1:right/d)
+ 7:BLOB:a:$(git rev-parse base~2:a)
blobs:5
+ commits:3
+ tags:0
+ trees:7
EOF
- test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
- BLOB:right/c:$(git rev-parse topic:right/c)
- BLOB:right/d:$(git rev-parse topic:right/d)
+ 7:BLOB:a:$(git rev-parse topic:a)
blobs:4
+ commits:1
+ tags:0
+ trees:3
EOF
- test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only blobs' '
- BLOB:right/c:$(git rev-parse topic:right/c)
- BLOB:right/d:$(git rev-parse topic:right/d)
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+- commits:0
+- trees:0
+ 0:BLOB:right/d:$(git rev-parse topic:right/d)
+ 1:BLOB:right/c:$(git rev-parse topic:right/c)
+ 2:BLOB:left/b:$(git rev-parse topic:left/b)
+ 3:BLOB:a:$(git rev-parse topic:a)
blobs:4
++ commits:0
+ tags:0
++ trees:0
EOF
test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only commits' '
+ cat >expect <<-EOF &&
+ 0:COMMIT::$(git rev-parse topic)
commits:1
- trees:0
+- trees:0
blobs:0
+ tags:0
++ trees:0
EOF
test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only trees' '
- TREE:right/:$(git rev-parse topic:right)
- trees:3
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+- commits:0
+ 0:TREE::$(git rev-parse topic^{tree})
+ 1:TREE:right/:$(git rev-parse topic:right)
+ 2:TREE:left/:$(git rev-parse topic:left)
+- trees:3
++ commits:0
blobs:0
+ tags:0
++ trees:3
EOF
test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
- BLOB:right/c:$(git rev-parse topic:right/c)
- BLOB:right/d:$(git rev-parse base~1:right/d)
+ 7:BLOB:a:$(git rev-parse base~1:a)
blobs:5
+ commits:2
+ tags:0
+ trees:5
EOF
test_cmp_sorted expect out
6: 0bb607e1fd3 ! 6: a4aaa3b001b path-walk: mark trees and blobs as UNINTERESTING
@@ path-walk.c: static int add_children(struct path_walk_context *ctx,
}
@@ path-walk.c: static int walk_path(struct path_walk_context *ctx,
- if (!list)
- BUG("provided path '%s' that had no associated list", path);
+ if (!list->oids.nr)
+ return 0;
+ if (ctx->info->prune_all_uninteresting) {
+ /*
@@ path-walk.h: struct path_walk_info {
## t/helper/test-path-walk.c ##
@@ t/helper/test-path-walk.c: static int emit_block(const char *path, struct oid_array *oids,
- BUG("we do not understand this type");
- }
+ printf("%"PRIuMAX":%s:%s:EMPTY\n",
+ tdata->batch_nr, typestr, path);
- for (size_t i = 0; i < oids->nr; i++)
-- printf("%s:%s:%s\n", typestr, path, oid_to_hex(&oids->oid[i]));
+- printf("%"PRIuMAX":%s:%s:%s\n",
+ for (size_t i = 0; i < oids->nr; i++) {
+ struct object *o = lookup_unknown_object(the_repository,
+ &oids->oid[i]);
-+ printf("%s:%s:%s%s\n", typestr, path, oid_to_hex(&oids->oid[i]),
++ printf("%"PRIuMAX":%s:%s:%s%s\n",
+ tdata->batch_nr, typestr, path,
+- oid_to_hex(&oids->oid[i]));
++ oid_to_hex(&oids->oid[i]),
+ o->flags & UNINTERESTING ? ":UNINTERESTING" : "");
+ }
+ tdata->batch_nr++;
return 0;
- }
@@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of tag objects")),
OPT_BOOL(0, "trees", &info.trees,
@@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
## t/t6601-path-walk.sh ##
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
- COMMIT::$(git rev-parse topic)
- commits:1
- TREE::$(git rev-parse topic^{tree})
-- TREE:left/:$(git rev-parse topic:left)
-+ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
- TREE:right/:$(git rev-parse topic:right)
- trees:3
-- BLOB:a:$(git rev-parse topic:a)
-- BLOB:left/b:$(git rev-parse topic:left/b)
-+ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
-+ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
- BLOB:right/c:$(git rev-parse topic:right/c)
-- BLOB:right/d:$(git rev-parse topic:right/d)
-+ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
+ 0:COMMIT::$(git rev-parse topic)
+ 1:TREE::$(git rev-parse topic^{tree})
+ 2:TREE:right/:$(git rev-parse topic:right)
+- 3:BLOB:right/d:$(git rev-parse topic:right/d)
++ 3:BLOB:right/d:$(git rev-parse topic:right/d):UNINTERESTING
+ 4:BLOB:right/c:$(git rev-parse topic:right/c)
+- 5:TREE:left/:$(git rev-parse topic:left)
+- 6:BLOB:left/b:$(git rev-parse topic:left/b)
+- 7:BLOB:a:$(git rev-parse topic:a)
++ 5:TREE:left/:$(git rev-parse topic:left):UNINTERESTING
++ 6:BLOB:left/b:$(git rev-parse topic:left/b):UNINTERESTING
++ 7:BLOB:a:$(git rev-parse topic:a):UNINTERESTING
blobs:4
+ commits:1
tags:0
- EOF
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
test_cmp_sorted expect out
'
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
+ test-tool path-walk -- fourth blob-tag2 --not base >out &&
+
+ cat >expect <<-EOF &&
-+ COMMIT::$(git rev-parse topic)
-+ commits:1
-+ TREE::$(git rev-parse topic^{tree})
-+ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
-+ TREE:right/:$(git rev-parse topic:right)
-+ trees:3
-+ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
-+ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
-+ BLOB:right/c:$(git rev-parse topic:right/c)
-+ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
-+ BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
++ 0:COMMIT::$(git rev-parse topic)
++ 1:TAG:/tags:$(git rev-parse fourth)
++ 2:BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
++ 3:TREE::$(git rev-parse topic^{tree})
++ 4:TREE:right/:$(git rev-parse topic:right)
++ 5:BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
++ 6:BLOB:right/c:$(git rev-parse topic:right/c)
++ 7:TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
++ 8:BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
++ 9:BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
+ blobs:5
-+ TAG:/tags:$(git rev-parse fourth)
++ commits:1
+ tags:1
++ trees:3
+ EOF
+
+ test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
test_expect_success 'topic, not base, only blobs' '
test-tool path-walk --no-trees --no-commits \
-- topic --not base >out &&
-@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only blobs' '
+
cat >expect <<-EOF &&
- commits:0
- trees:0
-- BLOB:a:$(git rev-parse topic:a)
-- BLOB:left/b:$(git rev-parse topic:left/b)
-+ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
-+ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
- BLOB:right/c:$(git rev-parse topic:right/c)
-- BLOB:right/d:$(git rev-parse topic:right/d)
-+ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
+- 0:BLOB:right/d:$(git rev-parse topic:right/d)
++ 0:BLOB:right/d:$(git rev-parse topic:right/d):UNINTERESTING
+ 1:BLOB:right/c:$(git rev-parse topic:right/c)
+- 2:BLOB:left/b:$(git rev-parse topic:left/b)
+- 3:BLOB:a:$(git rev-parse topic:a)
++ 2:BLOB:left/b:$(git rev-parse topic:left/b):UNINTERESTING
++ 3:BLOB:a:$(git rev-parse topic:a):UNINTERESTING
blobs:4
+ commits:0
tags:0
- EOF
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only trees' '
cat >expect <<-EOF &&
+ 0:TREE::$(git rev-parse topic^{tree})
+ 1:TREE:right/:$(git rev-parse topic:right)
+- 2:TREE:left/:$(git rev-parse topic:left)
++ 2:TREE:left/:$(git rev-parse topic:left):UNINTERESTING
commits:0
- TREE::$(git rev-parse topic^{tree})
-- TREE:left/:$(git rev-parse topic:left)
-+ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
- TREE:right/:$(git rev-parse topic:right)
- trees:3
blobs:0
+ tags:0
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
cat >expect <<-EOF &&
- COMMIT::$(git rev-parse topic)
-- COMMIT::$(git rev-parse base~1)
-+ COMMIT::$(git rev-parse base~1):UNINTERESTING
- commits:2
- TREE::$(git rev-parse topic^{tree})
-- TREE::$(git rev-parse base~1^{tree})
-- TREE:left/:$(git rev-parse base~1:left)
-+ TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
-+ TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
- TREE:right/:$(git rev-parse topic:right)
-- TREE:right/:$(git rev-parse base~1:right)
-+ TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
- trees:5
-- BLOB:a:$(git rev-parse base~1:a)
-- BLOB:left/b:$(git rev-parse base~1:left/b)
-- BLOB:right/c:$(git rev-parse base~1:right/c)
-+ BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
-+ BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
-+ BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
- BLOB:right/c:$(git rev-parse topic:right/c)
-- BLOB:right/d:$(git rev-parse base~1:right/d)
-+ BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
+ 0:COMMIT::$(git rev-parse topic)
+- 0:COMMIT::$(git rev-parse base~1)
++ 0:COMMIT::$(git rev-parse base~1):UNINTERESTING
+ 1:TREE::$(git rev-parse topic^{tree})
+- 1:TREE::$(git rev-parse base~1^{tree})
++ 1:TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
+ 2:TREE:right/:$(git rev-parse topic:right)
+- 2:TREE:right/:$(git rev-parse base~1:right)
+- 3:BLOB:right/d:$(git rev-parse base~1:right/d)
+- 4:BLOB:right/c:$(git rev-parse base~1:right/c)
++ 2:TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
++ 3:BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
++ 4:BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
+ 4:BLOB:right/c:$(git rev-parse topic:right/c)
+- 5:TREE:left/:$(git rev-parse base~1:left)
+- 6:BLOB:left/b:$(git rev-parse base~1:left/b)
+- 7:BLOB:a:$(git rev-parse base~1:a)
++ 5:TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
++ 6:BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
++ 7:BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
blobs:5
+ commits:2
tags:0
- EOF
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
test_cmp_sorted expect out
'
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
+ test-tool path-walk --prune -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
-+ COMMIT::$(git rev-parse topic)
-+ COMMIT::$(git rev-parse base~1):UNINTERESTING
-+ commits:2
-+ TREE::$(git rev-parse topic^{tree})
-+ TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
-+ TREE:right/:$(git rev-parse topic:right)
-+ TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
-+ trees:4
-+ BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
-+ BLOB:right/c:$(git rev-parse topic:right/c)
++ 0:COMMIT::$(git rev-parse topic)
++ 0:COMMIT::$(git rev-parse base~1):UNINTERESTING
++ 1:TREE::$(git rev-parse topic^{tree})
++ 1:TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
++ 2:TREE:right/:$(git rev-parse topic:right)
++ 2:TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
++ 3:BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
++ 3:BLOB:right/c:$(git rev-parse topic:right/c)
+ blobs:2
++ commits:2
+ tags:0
++ trees:4
+ EOF
+
+ test_cmp_sorted expect out
--
gitgitgadget
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v2 1/6] path-walk: introduce an object walk by path
2024-11-09 19:41 ` [PATCH v2 " Derrick Stolee via GitGitGadget
@ 2024-11-09 19:41 ` Derrick Stolee via GitGitGadget
2024-11-09 19:41 ` [PATCH v2 2/6] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
` (6 subsequent siblings)
7 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-11-09 19:41 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects, and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.
These batches are collected in 'struct type_and_oid_list' objects, which
store an object type and an oid_array of objects.
The data structures are documented in 'struct path_walk_context', but in
summary the most important are:
* 'paths_to_lists' is a strmap that connects a path to a
type_and_oid_list for that path. To avoid conflicts in path names,
we make sure that tree paths end in "/" (except the root path with
is an empty string) and blob paths do not end in "/".
* 'path_stack' is a string list that is added to in an append-only
way. This stores the stack of our depth-first search on the heap
instead of using recursion.
* 'path_stack_pushed' is a strmap that stores path names that were
already added to 'path_stack', to avoid repeating paths in the
stack. Mostly, this saves us from quadratic lookups from doing
unsorted checks into the string_list.
The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
push_to_stack() method. Call this instead of inserting into these
structures directly.
The walk_objects_by_path() method initializes these structures and
starts walking commits from the given rev_info struct. The commits are
used to find the list of root trees which populate the start of our
depth-first search.
The core of our depth-first search is in a while loop that continues
while we have not indicated an early exit and our 'path_stack' still has
entries in it. The loop body pops a path off of the stack and "visits"
the path via the walk_path() method.
The walk_path() method gets the list of OIDs from the 'path_to_lists'
strmap and executes the callback method on that list with the given path
and type. If the OIDs correspond to tree objects, then iterate over all
trees in the list and run add_children() to add the child objects to
their own lists, adding new entries to the stack if necessary.
In testing, this depth-first search approach was the one that used the
least memory while iterating over the object lists. There is still a
chance that repositories with too-wide path patterns could cause memory
pressure issues. Limiting the stack size could be done in the future by
limiting how many objects are being considered in-progress, or by
visiting blob paths earlier than trees.
There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 45 ++++
Makefile | 1 +
path-walk.c | 263 ++++++++++++++++++++++
path-walk.h | 43 ++++
4 files changed, 352 insertions(+)
create mode 100644 Documentation/technical/api-path-walk.txt
create mode 100644 path-walk.c
create mode 100644 path-walk.h
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
new file mode 100644
index 00000000000..c550c77ca30
--- /dev/null
+++ b/Documentation/technical/api-path-walk.txt
@@ -0,0 +1,45 @@
+Path-Walk API
+=============
+
+The path-walk API is used to walk reachable objects, but to visit objects
+in batches based on a common path they appear in, or by type.
+
+For example, all reachable commits are visited in a group. All tags are
+visited in a group. Then, all root trees are visited. At some point, all
+blobs reachable via a path `my/dir/to/A` are visited. When there are
+multiple paths possible to reach the same object, then only one of those
+paths is used to visit the object.
+
+Basics
+------
+
+To use the path-walk API, include `path-walk.h` and call
+`walk_objects_by_path()` with a customized `path_walk_info` struct. The
+struct is used to set all of the options for how the walk should proceed.
+Let's dig into the different options and their use.
+
+`path_fn` and `path_fn_data`::
+ The most important option is the `path_fn` option, which is a
+ function pointer to the callback that can execute logic on the
+ object IDs for objects grouped by type and path. This function
+ also receives a `data` value that corresponds to the
+ `path_fn_data` member, for providing custom data structures to
+ this callback function.
+
+`revs`::
+ To configure the exact details of the reachable set of objects,
+ use the `revs` member and initialize it using the revision
+ machinery in `revision.h`. Initialize `revs` using calls such as
+ `setup_revisions()` or `parse_revision_opt()`. Do not call
+ `prepare_revision_walk()`, as that will be called within
+ `walk_objects_by_path()`.
++
+It is also important that you do not specify the `--objects` flag for the
+`revs` struct. The revision walk should only be used to walk commits, and
+the objects will be walked in a separate way based on those starting
+commits.
+
+Examples
+--------
+
+See example usages in future changes.
diff --git a/Makefile b/Makefile
index 7344a7f7257..d0d8d6888e3 100644
--- a/Makefile
+++ b/Makefile
@@ -1094,6 +1094,7 @@ LIB_OBJS += parse-options.o
LIB_OBJS += patch-delta.o
LIB_OBJS += patch-ids.o
LIB_OBJS += path.o
+LIB_OBJS += path-walk.o
LIB_OBJS += pathspec.o
LIB_OBJS += pkt-line.o
LIB_OBJS += preload-index.o
diff --git a/path-walk.c b/path-walk.c
new file mode 100644
index 00000000000..24cf04c1e7d
--- /dev/null
+++ b/path-walk.c
@@ -0,0 +1,263 @@
+/*
+ * path-walk.c: implementation for path-based walks of the object graph.
+ */
+#include "git-compat-util.h"
+#include "path-walk.h"
+#include "blob.h"
+#include "commit.h"
+#include "dir.h"
+#include "hashmap.h"
+#include "hex.h"
+#include "object.h"
+#include "oid-array.h"
+#include "revision.h"
+#include "string-list.h"
+#include "strmap.h"
+#include "trace2.h"
+#include "tree.h"
+#include "tree-walk.h"
+
+struct type_and_oid_list
+{
+ enum object_type type;
+ struct oid_array oids;
+};
+
+#define TYPE_AND_OID_LIST_INIT { \
+ .type = OBJ_NONE, \
+ .oids = OID_ARRAY_INIT \
+}
+
+struct path_walk_context {
+ /**
+ * Repeats of data in 'struct path_walk_info' for
+ * access with fewer characters.
+ */
+ struct repository *repo;
+ struct rev_info *revs;
+ struct path_walk_info *info;
+
+ /**
+ * Map a path to a 'struct type_and_oid_list'
+ * containing the objects discovered at that
+ * path.
+ */
+ struct strmap paths_to_lists;
+
+ /**
+ * Store the current list of paths in a stack, to
+ * facilitate depth-first-search without recursion.
+ *
+ * Use path_stack_pushed to indicate whether a path
+ * was previously added to path_stack.
+ */
+ struct string_list path_stack;
+ struct strset path_stack_pushed;
+};
+
+static void push_to_stack(struct path_walk_context *ctx,
+ const char *path)
+{
+ if (strset_contains(&ctx->path_stack_pushed, path))
+ return;
+
+ strset_add(&ctx->path_stack_pushed, path);
+ string_list_append(&ctx->path_stack, path);
+}
+
+static int add_children(struct path_walk_context *ctx,
+ const char *base_path,
+ struct object_id *oid)
+{
+ struct tree_desc desc;
+ struct name_entry entry;
+ struct strbuf path = STRBUF_INIT;
+ size_t base_len;
+ struct tree *tree = lookup_tree(ctx->repo, oid);
+
+ if (!tree) {
+ error(_("failed to walk children of tree %s: not found"),
+ oid_to_hex(oid));
+ return -1;
+ } else if (parse_tree_gently(tree, 1)) {
+ die("bad tree object %s", oid_to_hex(oid));
+ }
+
+ strbuf_addstr(&path, base_path);
+ base_len = path.len;
+
+ parse_tree(tree);
+ init_tree_desc(&desc, &tree->object.oid, tree->buffer, tree->size);
+ while (tree_entry(&desc, &entry)) {
+ struct type_and_oid_list *list;
+ struct object *o;
+ /* Not actually true, but we will ignore submodules later. */
+ enum object_type type = S_ISDIR(entry.mode) ? OBJ_TREE : OBJ_BLOB;
+
+ /* Skip submodules. */
+ if (S_ISGITLINK(entry.mode))
+ continue;
+
+ if (type == OBJ_TREE) {
+ struct tree *child = lookup_tree(ctx->repo, &entry.oid);
+ o = child ? &child->object : NULL;
+ } else if (type == OBJ_BLOB) {
+ struct blob *child = lookup_blob(ctx->repo, &entry.oid);
+ o = child ? &child->object : NULL;
+ } else {
+ /* Wrong type? */
+ continue;
+ }
+
+ if (!o) /* report error?*/
+ continue;
+
+ strbuf_setlen(&path, base_len);
+ strbuf_add(&path, entry.path, entry.pathlen);
+
+ /*
+ * Trees will end with "/" for concatenation and distinction
+ * from blobs at the same path.
+ */
+ if (type == OBJ_TREE)
+ strbuf_addch(&path, '/');
+
+ if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
+ CALLOC_ARRAY(list, 1);
+ list->type = type;
+ strmap_put(&ctx->paths_to_lists, path.buf, list);
+ }
+ push_to_stack(ctx, path.buf);
+
+ /* Skip this object if already seen. */
+ if (o->flags & SEEN)
+ continue;
+ o->flags |= SEEN;
+ oid_array_append(&list->oids, &entry.oid);
+ }
+
+ free_tree_buffer(tree);
+ strbuf_release(&path);
+ return 0;
+}
+
+/*
+ * For each path in paths_to_explore, walk the trees another level
+ * and add any found blobs to the batch (but only if they exist and
+ * haven't been added yet).
+ */
+static int walk_path(struct path_walk_context *ctx,
+ const char *path)
+{
+ struct type_and_oid_list *list;
+ int ret = 0;
+
+ list = strmap_get(&ctx->paths_to_lists, path);
+
+ if (!list->oids.nr)
+ return 0;
+
+ /* Evaluate function pointer on this data. */
+ ret = ctx->info->path_fn(path, &list->oids, list->type,
+ ctx->info->path_fn_data);
+
+ /* Expand data for children. */
+ if (list->type == OBJ_TREE) {
+ for (size_t i = 0; i < list->oids.nr; i++) {
+ ret |= add_children(ctx,
+ path,
+ &list->oids.oid[i]);
+ }
+ }
+
+ oid_array_clear(&list->oids);
+ strmap_remove(&ctx->paths_to_lists, path, 1);
+ return ret;
+}
+
+static void clear_strmap(struct strmap *map)
+{
+ struct hashmap_iter iter;
+ struct strmap_entry *e;
+
+ hashmap_for_each_entry(&map->map, &iter, e, ent) {
+ struct type_and_oid_list *list = e->value;
+ oid_array_clear(&list->oids);
+ }
+ strmap_clear(map, 1);
+ strmap_init(map);
+}
+
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+ *
+ * Returns nonzero on an error.
+ */
+int walk_objects_by_path(struct path_walk_info *info)
+{
+ const char *root_path = "";
+ int ret = 0;
+ size_t commits_nr = 0, paths_nr = 0;
+ struct commit *c;
+ struct type_and_oid_list *root_tree_list;
+ struct path_walk_context ctx = {
+ .repo = info->revs->repo,
+ .revs = info->revs,
+ .info = info,
+ .path_stack = STRING_LIST_INIT_DUP,
+ .path_stack_pushed = STRSET_INIT,
+ .paths_to_lists = STRMAP_INIT
+ };
+
+ trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
+
+ /* Insert a single list for the root tree into the paths. */
+ CALLOC_ARRAY(root_tree_list, 1);
+ root_tree_list->type = OBJ_TREE;
+ strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
+ push_to_stack(&ctx, root_path);
+
+ if (prepare_revision_walk(info->revs))
+ die(_("failed to setup revision walk"));
+
+ while ((c = get_revision(info->revs))) {
+ struct object_id *oid = get_commit_tree_oid(c);
+ struct tree *t;
+ commits_nr++;
+
+ oid = get_commit_tree_oid(c);
+ t = lookup_tree(info->revs->repo, oid);
+
+ if (!t) {
+ warning("could not find tree %s", oid_to_hex(oid));
+ continue;
+ }
+
+ if (t->object.flags & SEEN)
+ continue;
+ t->object.flags |= SEEN;
+ oid_array_append(&root_tree_list->oids, oid);
+ }
+
+ trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
+ trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+
+ trace2_region_enter("path-walk", "path-walk", info->revs->repo);
+ while (!ret && ctx.path_stack.nr) {
+ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
+ ctx.path_stack.nr--;
+ paths_nr++;
+
+ ret = walk_path(&ctx, path);
+
+ free(path);
+ }
+ trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr);
+ trace2_region_leave("path-walk", "path-walk", info->revs->repo);
+
+ clear_strmap(&ctx.paths_to_lists);
+ strset_clear(&ctx.path_stack_pushed);
+ string_list_clear(&ctx.path_stack, 0);
+ return ret;
+}
diff --git a/path-walk.h b/path-walk.h
new file mode 100644
index 00000000000..c9e94a98bc8
--- /dev/null
+++ b/path-walk.h
@@ -0,0 +1,43 @@
+/*
+ * path-walk.h : Methods and structures for walking the object graph in batches
+ * by the paths that can reach those objects.
+ */
+#include "object.h" /* Required for 'enum object_type'. */
+
+struct rev_info;
+struct oid_array;
+
+/**
+ * The type of a function pointer for the method that is called on a list of
+ * objects reachable at a given path.
+ */
+typedef int (*path_fn)(const char *path,
+ struct oid_array *oids,
+ enum object_type type,
+ void *data);
+
+struct path_walk_info {
+ /**
+ * revs provides the definitions for the commit walk, including
+ * which commits are UNINTERESTING or not.
+ */
+ struct rev_info *revs;
+
+ /**
+ * The caller wishes to execute custom logic on objects reachable at a
+ * given path. Every reachable object will be visited exactly once, and
+ * the first path to see an object wins. This may not be a stable choice.
+ */
+ path_fn path_fn;
+ void *path_fn_data;
+};
+
+#define PATH_WALK_INFO_INIT { 0 }
+
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+ *
+ * Returns nonzero on an error.
+ */
+int walk_objects_by_path(struct path_walk_info *info);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH v2 2/6] test-lib-functions: add test_cmp_sorted
2024-11-09 19:41 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2024-11-09 19:41 ` [PATCH v2 1/6] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
@ 2024-11-09 19:41 ` Derrick Stolee via GitGitGadget
2024-11-09 19:41 ` [PATCH v2 3/6] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
` (5 subsequent siblings)
7 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-11-09 19:41 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
This test helper will be helpful to reduce repeated logic in
t6601-path-walk.sh, but may be helpful elsewhere, too.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
t/test-lib-functions.sh | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index fde9bf54fc3..16b70aebd60 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1267,6 +1267,16 @@ test_cmp () {
eval "$GIT_TEST_CMP" '"$@"'
}
+# test_cmp_sorted runs test_cmp on sorted versions of the two
+# input files. Uses "$1.sorted" and "$2.sorted" as temp files.
+
+test_cmp_sorted () {
+ sort <"$1" >"$1.sorted" &&
+ sort <"$2" >"$2.sorted" &&
+ test_cmp "$1.sorted" "$2.sorted" &&
+ rm "$1.sorted" "$2.sorted"
+}
+
# Check that the given config key has the expected value.
#
# test_cmp_config [-C <dir>] <expected-value>
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH v2 3/6] t6601: add helper for testing path-walk API
2024-11-09 19:41 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2024-11-09 19:41 ` [PATCH v2 1/6] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-11-09 19:41 ` [PATCH v2 2/6] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
@ 2024-11-09 19:41 ` Derrick Stolee via GitGitGadget
2024-11-21 22:39 ` Taylor Blau
2024-11-09 19:41 ` [PATCH v2 4/6] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
` (4 subsequent siblings)
7 siblings, 1 reply; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-11-09 19:41 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Add some tests based on the current behavior, doing interesting checks
for different sets of branches, ranges, and the --boundary option. This
sets a baseline for the behavior and we can extend it as new options are
introduced.
Store and output a 'batch_nr' value so we can demonstrate that the paths are
grouped together in a batch and not following some other ordering. This
allows us to test the depth-first behavior of the path-walk API. However, we
purposefully do not test the order of the objects in the batch, so the
output is compared to the expected output through a sort.
It is important to mention that the behavior of the API will change soon as
we start to handle UNINTERESTING objects differently, but these tests will
demonstrate the change in behavior.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 3 +-
Makefile | 1 +
t/helper/test-path-walk.c | 90 ++++++++++++++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t6601-path-walk.sh | 120 ++++++++++++++++++++++
6 files changed, 215 insertions(+), 1 deletion(-)
create mode 100644 t/helper/test-path-walk.c
create mode 100755 t/t6601-path-walk.sh
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index c550c77ca30..662162ec70b 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -42,4 +42,5 @@ commits.
Examples
--------
-See example usages in future changes.
+See example usages in:
+ `t/helper/test-path-walk.c`
diff --git a/Makefile b/Makefile
index d0d8d6888e3..50413d96492 100644
--- a/Makefile
+++ b/Makefile
@@ -818,6 +818,7 @@ TEST_BUILTINS_OBJS += test-parse-options.o
TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
TEST_BUILTINS_OBJS += test-partial-clone.o
TEST_BUILTINS_OBJS += test-path-utils.o
+TEST_BUILTINS_OBJS += test-path-walk.o
TEST_BUILTINS_OBJS += test-pcre2-config.o
TEST_BUILTINS_OBJS += test-pkt-line.o
TEST_BUILTINS_OBJS += test-proc-receive.o
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
new file mode 100644
index 00000000000..aa468871079
--- /dev/null
+++ b/t/helper/test-path-walk.c
@@ -0,0 +1,90 @@
+#define USE_THE_REPOSITORY_VARIABLE
+
+#include "test-tool.h"
+#include "environment.h"
+#include "hex.h"
+#include "object-name.h"
+#include "object.h"
+#include "pretty.h"
+#include "revision.h"
+#include "setup.h"
+#include "parse-options.h"
+#include "path-walk.h"
+#include "oid-array.h"
+
+static const char * const path_walk_usage[] = {
+ N_("test-tool path-walk <options> -- <revision-options>"),
+ NULL
+};
+
+struct path_walk_test_data {
+ uintmax_t batch_nr;
+ uintmax_t tree_nr;
+ uintmax_t blob_nr;
+};
+
+static int emit_block(const char *path, struct oid_array *oids,
+ enum object_type type, void *data)
+{
+ struct path_walk_test_data *tdata = data;
+ const char *typestr;
+
+ switch (type) {
+ case OBJ_TREE:
+ typestr = "TREE";
+ tdata->tree_nr += oids->nr;
+ break;
+
+ case OBJ_BLOB:
+ typestr = "BLOB";
+ tdata->blob_nr += oids->nr;
+ break;
+
+ default:
+ BUG("we do not understand this type");
+ }
+
+ for (size_t i = 0; i < oids->nr; i++)
+ printf("%"PRIuMAX":%s:%s:%s\n",
+ tdata->batch_nr, typestr, path,
+ oid_to_hex(&oids->oid[i]));
+
+ tdata->batch_nr++;
+ return 0;
+}
+
+int cmd__path_walk(int argc, const char **argv)
+{
+ int res;
+ struct rev_info revs = REV_INFO_INIT;
+ struct path_walk_info info = PATH_WALK_INFO_INIT;
+ struct path_walk_test_data data = { 0 };
+ struct option options[] = {
+ OPT_END(),
+ };
+
+ setup_git_directory();
+ revs.repo = the_repository;
+
+ argc = parse_options(argc, argv, NULL,
+ options, path_walk_usage,
+ PARSE_OPT_KEEP_UNKNOWN_OPT | PARSE_OPT_KEEP_ARGV0);
+
+ if (argc > 1)
+ setup_revisions(argc, argv, &revs, NULL);
+ else
+ usage(path_walk_usage[0]);
+
+ info.revs = &revs;
+ info.path_fn = emit_block;
+ info.path_fn_data = &data;
+
+ res = walk_objects_by_path(&info);
+
+ printf("trees:%" PRIuMAX "\n"
+ "blobs:%" PRIuMAX "\n",
+ data.tree_nr, data.blob_nr);
+
+ release_revisions(&revs);
+ return res;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 1ebb69a5dc4..43676e7b93a 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -52,6 +52,7 @@ static struct test_cmd cmds[] = {
{ "parse-subcommand", cmd__parse_subcommand },
{ "partial-clone", cmd__partial_clone },
{ "path-utils", cmd__path_utils },
+ { "path-walk", cmd__path_walk },
{ "pcre2-config", cmd__pcre2_config },
{ "pkt-line", cmd__pkt_line },
{ "proc-receive", cmd__proc_receive },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 21802ac27da..9cfc5da6e57 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -45,6 +45,7 @@ int cmd__parse_pathspec_file(int argc, const char** argv);
int cmd__parse_subcommand(int argc, const char **argv);
int cmd__partial_clone(int argc, const char **argv);
int cmd__path_utils(int argc, const char **argv);
+int cmd__path_walk(int argc, const char **argv);
int cmd__pcre2_config(int argc, const char **argv);
int cmd__pkt_line(int argc, const char **argv);
int cmd__proc_receive(int argc, const char **argv);
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
new file mode 100755
index 00000000000..a3da55933f4
--- /dev/null
+++ b/t/t6601-path-walk.sh
@@ -0,0 +1,120 @@
+#!/bin/sh
+
+TEST_PASSES_SANITIZE_LEAK=true
+
+test_description='direct path-walk API tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repository' '
+ git checkout -b base &&
+
+ mkdir left &&
+ mkdir right &&
+ echo a >a &&
+ echo b >left/b &&
+ echo c >right/c &&
+ git add . &&
+ git commit -m "first" &&
+
+ echo d >right/d &&
+ git add right &&
+ git commit -m "second" &&
+
+ echo bb >left/b &&
+ git commit -a -m "third" &&
+
+ git checkout -b topic HEAD~1 &&
+ echo cc >right/c &&
+ git commit -a -m "topic"
+'
+
+test_expect_success 'all' '
+ test-tool path-walk -- --all >out &&
+
+ cat >expect <<-EOF &&
+ 0:TREE::$(git rev-parse topic^{tree})
+ 0:TREE::$(git rev-parse base^{tree})
+ 0:TREE::$(git rev-parse base~1^{tree})
+ 0:TREE::$(git rev-parse base~2^{tree})
+ 1:TREE:right/:$(git rev-parse topic:right)
+ 1:TREE:right/:$(git rev-parse base~1:right)
+ 1:TREE:right/:$(git rev-parse base~2:right)
+ 2:BLOB:right/d:$(git rev-parse base~1:right/d)
+ 3:BLOB:right/c:$(git rev-parse base~2:right/c)
+ 3:BLOB:right/c:$(git rev-parse topic:right/c)
+ 4:TREE:left/:$(git rev-parse base:left)
+ 4:TREE:left/:$(git rev-parse base~2:left)
+ 5:BLOB:left/b:$(git rev-parse base~2:left/b)
+ 5:BLOB:left/b:$(git rev-parse base:left/b)
+ 6:BLOB:a:$(git rev-parse base~2:a)
+ blobs:6
+ trees:9
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic only' '
+ test-tool path-walk -- topic >out &&
+
+ cat >expect <<-EOF &&
+ 0:TREE::$(git rev-parse topic^{tree})
+ 0:TREE::$(git rev-parse base~1^{tree})
+ 0:TREE::$(git rev-parse base~2^{tree})
+ 1:TREE:right/:$(git rev-parse topic:right)
+ 1:TREE:right/:$(git rev-parse base~1:right)
+ 1:TREE:right/:$(git rev-parse base~2:right)
+ 2:BLOB:right/d:$(git rev-parse base~1:right/d)
+ 3:BLOB:right/c:$(git rev-parse base~2:right/c)
+ 3:BLOB:right/c:$(git rev-parse topic:right/c)
+ 4:TREE:left/:$(git rev-parse base~2:left)
+ 5:BLOB:left/b:$(git rev-parse base~2:left/b)
+ 6:BLOB:a:$(git rev-parse base~2:a)
+ blobs:5
+ trees:7
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base' '
+ test-tool path-walk -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:TREE::$(git rev-parse topic^{tree})
+ 1:TREE:right/:$(git rev-parse topic:right)
+ 2:BLOB:right/d:$(git rev-parse topic:right/d)
+ 3:BLOB:right/c:$(git rev-parse topic:right/c)
+ 4:TREE:left/:$(git rev-parse topic:left)
+ 5:BLOB:left/b:$(git rev-parse topic:left/b)
+ 6:BLOB:a:$(git rev-parse topic:a)
+ blobs:4
+ trees:3
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, boundary' '
+ test-tool path-walk -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:TREE::$(git rev-parse topic^{tree})
+ 0:TREE::$(git rev-parse base~1^{tree})
+ 1:TREE:right/:$(git rev-parse topic:right)
+ 1:TREE:right/:$(git rev-parse base~1:right)
+ 2:BLOB:right/d:$(git rev-parse base~1:right/d)
+ 3:BLOB:right/c:$(git rev-parse base~1:right/c)
+ 3:BLOB:right/c:$(git rev-parse topic:right/c)
+ 4:TREE:left/:$(git rev-parse base~1:left)
+ 5:BLOB:left/b:$(git rev-parse base~1:left/b)
+ 6:BLOB:a:$(git rev-parse base~1:a)
+ blobs:5
+ trees:5
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH v2 3/6] t6601: add helper for testing path-walk API
2024-11-09 19:41 ` [PATCH v2 3/6] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
@ 2024-11-21 22:39 ` Taylor Blau
0 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2024-11-21 22:39 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, ps, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee
On Sat, Nov 09, 2024 at 07:41:09PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@gmail.com>
>
> Add some tests based on the current behavior, doing interesting checks
> for different sets of branches, ranges, and the --boundary option. This
> sets a baseline for the behavior and we can extend it as new options are
> introduced.
>
> Store and output a 'batch_nr' value so we can demonstrate that the paths are
> grouped together in a batch and not following some other ordering. This
> allows us to test the depth-first behavior of the path-walk API. However, we
> purposefully do not test the order of the objects in the batch, so the
> output is compared to the expected output through a sort.
>
> It is important to mention that the behavior of the API will change soon as
> we start to handle UNINTERESTING objects differently, but these tests will
> demonstrate the change in behavior.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
Nice. I like the approach of implementing the API in a single commit,
and then demonstrating a trivial "caller" by way of a custom test
helper. I think that the artifact of having a test helper here is useful
on its own, but it also serves as a good example of how to use the API,
and provides something to actually test the implementation with.
I'm going to steal this pattern the next time I need to work on
something that necessitates a complex new API ;-).
> +static int emit_block(const char *path, struct oid_array *oids,
> + enum object_type type, void *data)
> +{
> + struct path_walk_test_data *tdata = data;
> + const char *typestr;
> +
> + switch (type) {
> + case OBJ_TREE:
> + typestr = "TREE";
> + tdata->tree_nr += oids->nr;
> + break;
> +
> + case OBJ_BLOB:
> + typestr = "BLOB";
> + tdata->blob_nr += oids->nr;
> + break;
> + default:
> + BUG("we do not understand this type");
> + }
I think you could write this as:
if (type == OBJ_TREE)
tdata->tree_nr += oids->nr;
else if (type == OBJ_BLOB)
tdata->blob_nr += oids->nr;
else
BUG("we do not understand this type");
typestr = type_name(type);
Which DRYs things up a bit and uses the type_name() helper. That will
give you strings like "tree" and "blob" instead of "TREE" and "BLOB",
but I'm not sure if the casing is important.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v2 4/6] path-walk: allow consumer to specify object types
2024-11-09 19:41 ` [PATCH v2 " Derrick Stolee via GitGitGadget
` (2 preceding siblings ...)
2024-11-09 19:41 ` [PATCH v2 3/6] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
@ 2024-11-09 19:41 ` Derrick Stolee via GitGitGadget
2024-11-21 22:44 ` Taylor Blau
2024-11-09 19:41 ` [PATCH v2 5/6] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
` (3 subsequent siblings)
7 siblings, 1 reply; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-11-09 19:41 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <derrickstolee@github.com>
We add the ability to filter the object types in the path-walk API so
the callback function is called fewer times.
This adds the ability to ask for the commits in a list, as well. We
re-use the empty string for this set of objects because these are passed
directly to the callback function instead of being part of the
'path_stack'.
Future changes will add the ability to visit annotated tags.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 9 ++
path-walk.c | 33 ++++-
path-walk.h | 14 +-
t/helper/test-path-walk.c | 18 ++-
t/t6601-path-walk.sh | 149 +++++++++++++++-------
5 files changed, 173 insertions(+), 50 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 662162ec70b..dce553b6114 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -39,6 +39,15 @@ It is also important that you do not specify the `--objects` flag for the
the objects will be walked in a separate way based on those starting
commits.
+`commits`, `blobs`, `trees`::
+ By default, these members are enabled and signal that the path-walk
+ API should call the `path_fn` on objects of these types. Specialized
+ applications could disable some options to make it simpler to walk
+ the objects or to have fewer calls to `path_fn`.
++
+While it is possible to walk only commits in this way, consumers would be
+better off using the revision walk API instead.
+
Examples
--------
diff --git a/path-walk.c b/path-walk.c
index 24cf04c1e7d..2ca08402367 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -98,6 +98,10 @@ static int add_children(struct path_walk_context *ctx,
if (S_ISGITLINK(entry.mode))
continue;
+ /* If the caller doesn't want blobs, then don't bother. */
+ if (!ctx->info->blobs && type == OBJ_BLOB)
+ continue;
+
if (type == OBJ_TREE) {
struct tree *child = lookup_tree(ctx->repo, &entry.oid);
o = child ? &child->object : NULL;
@@ -157,9 +161,11 @@ static int walk_path(struct path_walk_context *ctx,
if (!list->oids.nr)
return 0;
- /* Evaluate function pointer on this data. */
- ret = ctx->info->path_fn(path, &list->oids, list->type,
- ctx->info->path_fn_data);
+ /* Evaluate function pointer on this data, if requested. */
+ if ((list->type == OBJ_TREE && ctx->info->trees) ||
+ (list->type == OBJ_BLOB && ctx->info->blobs))
+ ret = ctx->info->path_fn(path, &list->oids, list->type,
+ ctx->info->path_fn_data);
/* Expand data for children. */
if (list->type == OBJ_TREE) {
@@ -201,6 +207,7 @@ int walk_objects_by_path(struct path_walk_info *info)
size_t commits_nr = 0, paths_nr = 0;
struct commit *c;
struct type_and_oid_list *root_tree_list;
+ struct type_and_oid_list *commit_list;
struct path_walk_context ctx = {
.repo = info->revs->repo,
.revs = info->revs,
@@ -212,6 +219,9 @@ int walk_objects_by_path(struct path_walk_info *info)
trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
+ CALLOC_ARRAY(commit_list, 1);
+ commit_list->type = OBJ_COMMIT;
+
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
@@ -222,10 +232,18 @@ int walk_objects_by_path(struct path_walk_info *info)
die(_("failed to setup revision walk"));
while ((c = get_revision(info->revs))) {
- struct object_id *oid = get_commit_tree_oid(c);
+ struct object_id *oid;
struct tree *t;
commits_nr++;
+ if (info->commits)
+ oid_array_append(&commit_list->oids,
+ &c->object.oid);
+
+ /* If we only care about commits, then skip trees. */
+ if (!info->trees && !info->blobs)
+ continue;
+
oid = get_commit_tree_oid(c);
t = lookup_tree(info->revs->repo, oid);
@@ -243,6 +261,13 @@ int walk_objects_by_path(struct path_walk_info *info)
trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+ /* Track all commits. */
+ if (info->commits && commit_list->oids.nr)
+ ret = info->path_fn("", &commit_list->oids, OBJ_COMMIT,
+ info->path_fn_data);
+ oid_array_clear(&commit_list->oids);
+ free(commit_list);
+
trace2_region_enter("path-walk", "path-walk", info->revs->repo);
while (!ret && ctx.path_stack.nr) {
char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
diff --git a/path-walk.h b/path-walk.h
index c9e94a98bc8..2d2afc29b47 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -30,9 +30,21 @@ struct path_walk_info {
*/
path_fn path_fn;
void *path_fn_data;
+
+ /**
+ * Initialize which object types the path_fn should be called on. This
+ * could also limit the walk to skip blobs if not set.
+ */
+ int commits;
+ int trees;
+ int blobs;
};
-#define PATH_WALK_INFO_INIT { 0 }
+#define PATH_WALK_INFO_INIT { \
+ .blobs = 1, \
+ .trees = 1, \
+ .commits = 1, \
+}
/**
* Given the configuration of 'info', walk the commits based on 'info->revs' and
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index aa468871079..2b7e6e98d18 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -19,6 +19,8 @@ static const char * const path_walk_usage[] = {
struct path_walk_test_data {
uintmax_t batch_nr;
+
+ uintmax_t commit_nr;
uintmax_t tree_nr;
uintmax_t blob_nr;
};
@@ -30,6 +32,11 @@ static int emit_block(const char *path, struct oid_array *oids,
const char *typestr;
switch (type) {
+ case OBJ_COMMIT:
+ typestr = "COMMIT";
+ tdata->commit_nr += oids->nr;
+ break;
+
case OBJ_TREE:
typestr = "TREE";
tdata->tree_nr += oids->nr;
@@ -60,6 +67,12 @@ int cmd__path_walk(int argc, const char **argv)
struct path_walk_info info = PATH_WALK_INFO_INIT;
struct path_walk_test_data data = { 0 };
struct option options[] = {
+ OPT_BOOL(0, "blobs", &info.blobs,
+ N_("toggle inclusion of blob objects")),
+ OPT_BOOL(0, "commits", &info.commits,
+ N_("toggle inclusion of commit objects")),
+ OPT_BOOL(0, "trees", &info.trees,
+ N_("toggle inclusion of tree objects")),
OPT_END(),
};
@@ -81,9 +94,10 @@ int cmd__path_walk(int argc, const char **argv)
res = walk_objects_by_path(&info);
- printf("trees:%" PRIuMAX "\n"
+ printf("commits:%" PRIuMAX "\n"
+ "trees:%" PRIuMAX "\n"
"blobs:%" PRIuMAX "\n",
- data.tree_nr, data.blob_nr);
+ data.commit_nr, data.tree_nr, data.blob_nr);
release_revisions(&revs);
return res;
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index a3da55933f4..dcd3c03a2e8 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -33,22 +33,27 @@ test_expect_success 'all' '
test-tool path-walk -- --all >out &&
cat >expect <<-EOF &&
- 0:TREE::$(git rev-parse topic^{tree})
- 0:TREE::$(git rev-parse base^{tree})
- 0:TREE::$(git rev-parse base~1^{tree})
- 0:TREE::$(git rev-parse base~2^{tree})
- 1:TREE:right/:$(git rev-parse topic:right)
- 1:TREE:right/:$(git rev-parse base~1:right)
- 1:TREE:right/:$(git rev-parse base~2:right)
- 2:BLOB:right/d:$(git rev-parse base~1:right/d)
- 3:BLOB:right/c:$(git rev-parse base~2:right/c)
- 3:BLOB:right/c:$(git rev-parse topic:right/c)
- 4:TREE:left/:$(git rev-parse base:left)
- 4:TREE:left/:$(git rev-parse base~2:left)
- 5:BLOB:left/b:$(git rev-parse base~2:left/b)
- 5:BLOB:left/b:$(git rev-parse base:left/b)
- 6:BLOB:a:$(git rev-parse base~2:a)
+ 0:COMMIT::$(git rev-parse topic)
+ 0:COMMIT::$(git rev-parse base)
+ 0:COMMIT::$(git rev-parse base~1)
+ 0:COMMIT::$(git rev-parse base~2)
+ 1:TREE::$(git rev-parse topic^{tree})
+ 1:TREE::$(git rev-parse base^{tree})
+ 1:TREE::$(git rev-parse base~1^{tree})
+ 1:TREE::$(git rev-parse base~2^{tree})
+ 2:TREE:right/:$(git rev-parse topic:right)
+ 2:TREE:right/:$(git rev-parse base~1:right)
+ 2:TREE:right/:$(git rev-parse base~2:right)
+ 3:BLOB:right/d:$(git rev-parse base~1:right/d)
+ 4:BLOB:right/c:$(git rev-parse base~2:right/c)
+ 4:BLOB:right/c:$(git rev-parse topic:right/c)
+ 5:TREE:left/:$(git rev-parse base:left)
+ 5:TREE:left/:$(git rev-parse base~2:left)
+ 6:BLOB:left/b:$(git rev-parse base~2:left/b)
+ 6:BLOB:left/b:$(git rev-parse base:left/b)
+ 7:BLOB:a:$(git rev-parse base~2:a)
blobs:6
+ commits:4
trees:9
EOF
@@ -59,19 +64,23 @@ test_expect_success 'topic only' '
test-tool path-walk -- topic >out &&
cat >expect <<-EOF &&
- 0:TREE::$(git rev-parse topic^{tree})
- 0:TREE::$(git rev-parse base~1^{tree})
- 0:TREE::$(git rev-parse base~2^{tree})
- 1:TREE:right/:$(git rev-parse topic:right)
- 1:TREE:right/:$(git rev-parse base~1:right)
- 1:TREE:right/:$(git rev-parse base~2:right)
- 2:BLOB:right/d:$(git rev-parse base~1:right/d)
- 3:BLOB:right/c:$(git rev-parse base~2:right/c)
- 3:BLOB:right/c:$(git rev-parse topic:right/c)
- 4:TREE:left/:$(git rev-parse base~2:left)
- 5:BLOB:left/b:$(git rev-parse base~2:left/b)
- 6:BLOB:a:$(git rev-parse base~2:a)
+ 0:COMMIT::$(git rev-parse topic)
+ 0:COMMIT::$(git rev-parse base~1)
+ 0:COMMIT::$(git rev-parse base~2)
+ 1:TREE::$(git rev-parse topic^{tree})
+ 1:TREE::$(git rev-parse base~1^{tree})
+ 1:TREE::$(git rev-parse base~2^{tree})
+ 2:TREE:right/:$(git rev-parse topic:right)
+ 2:TREE:right/:$(git rev-parse base~1:right)
+ 2:TREE:right/:$(git rev-parse base~2:right)
+ 3:BLOB:right/d:$(git rev-parse base~1:right/d)
+ 4:BLOB:right/c:$(git rev-parse base~2:right/c)
+ 4:BLOB:right/c:$(git rev-parse topic:right/c)
+ 5:TREE:left/:$(git rev-parse base~2:left)
+ 6:BLOB:left/b:$(git rev-parse base~2:left/b)
+ 7:BLOB:a:$(git rev-parse base~2:a)
blobs:5
+ commits:3
trees:7
EOF
@@ -82,15 +91,66 @@ test_expect_success 'topic, not base' '
test-tool path-walk -- topic --not base >out &&
cat >expect <<-EOF &&
+ 0:COMMIT::$(git rev-parse topic)
+ 1:TREE::$(git rev-parse topic^{tree})
+ 2:TREE:right/:$(git rev-parse topic:right)
+ 3:BLOB:right/d:$(git rev-parse topic:right/d)
+ 4:BLOB:right/c:$(git rev-parse topic:right/c)
+ 5:TREE:left/:$(git rev-parse topic:left)
+ 6:BLOB:left/b:$(git rev-parse topic:left/b)
+ 7:BLOB:a:$(git rev-parse topic:a)
+ blobs:4
+ commits:1
+ trees:3
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, only blobs' '
+ test-tool path-walk --no-trees --no-commits \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ commits:0
+ trees:0
+ 0:BLOB:right/d:$(git rev-parse topic:right/d)
+ 1:BLOB:right/c:$(git rev-parse topic:right/c)
+ 2:BLOB:left/b:$(git rev-parse topic:left/b)
+ 3:BLOB:a:$(git rev-parse topic:a)
+ blobs:4
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+# No, this doesn't make a lot of sense for the path-walk API,
+# but it is possible to do.
+test_expect_success 'topic, not base, only commits' '
+ test-tool path-walk --no-blobs --no-trees \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:COMMIT::$(git rev-parse topic)
+ commits:1
+ trees:0
+ blobs:0
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, only trees' '
+ test-tool path-walk --no-blobs --no-commits \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ commits:0
0:TREE::$(git rev-parse topic^{tree})
1:TREE:right/:$(git rev-parse topic:right)
- 2:BLOB:right/d:$(git rev-parse topic:right/d)
- 3:BLOB:right/c:$(git rev-parse topic:right/c)
- 4:TREE:left/:$(git rev-parse topic:left)
- 5:BLOB:left/b:$(git rev-parse topic:left/b)
- 6:BLOB:a:$(git rev-parse topic:a)
- blobs:4
+ 2:TREE:left/:$(git rev-parse topic:left)
trees:3
+ blobs:0
EOF
test_cmp_sorted expect out
@@ -100,17 +160,20 @@ test_expect_success 'topic, not base, boundary' '
test-tool path-walk -- --boundary topic --not base >out &&
cat >expect <<-EOF &&
- 0:TREE::$(git rev-parse topic^{tree})
- 0:TREE::$(git rev-parse base~1^{tree})
- 1:TREE:right/:$(git rev-parse topic:right)
- 1:TREE:right/:$(git rev-parse base~1:right)
- 2:BLOB:right/d:$(git rev-parse base~1:right/d)
- 3:BLOB:right/c:$(git rev-parse base~1:right/c)
- 3:BLOB:right/c:$(git rev-parse topic:right/c)
- 4:TREE:left/:$(git rev-parse base~1:left)
- 5:BLOB:left/b:$(git rev-parse base~1:left/b)
- 6:BLOB:a:$(git rev-parse base~1:a)
+ 0:COMMIT::$(git rev-parse topic)
+ 0:COMMIT::$(git rev-parse base~1)
+ 1:TREE::$(git rev-parse topic^{tree})
+ 1:TREE::$(git rev-parse base~1^{tree})
+ 2:TREE:right/:$(git rev-parse topic:right)
+ 2:TREE:right/:$(git rev-parse base~1:right)
+ 3:BLOB:right/d:$(git rev-parse base~1:right/d)
+ 4:BLOB:right/c:$(git rev-parse base~1:right/c)
+ 4:BLOB:right/c:$(git rev-parse topic:right/c)
+ 5:TREE:left/:$(git rev-parse base~1:left)
+ 6:BLOB:left/b:$(git rev-parse base~1:left/b)
+ 7:BLOB:a:$(git rev-parse base~1:a)
blobs:5
+ commits:2
trees:5
EOF
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH v2 4/6] path-walk: allow consumer to specify object types
2024-11-09 19:41 ` [PATCH v2 4/6] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
@ 2024-11-21 22:44 ` Taylor Blau
0 siblings, 0 replies; 67+ messages in thread
From: Taylor Blau @ 2024-11-21 22:44 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, ps, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
On Sat, Nov 09, 2024 at 07:41:10PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/path-walk.c b/path-walk.c
> index 24cf04c1e7d..2ca08402367 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -98,6 +98,10 @@ static int add_children(struct path_walk_context *ctx,
> if (S_ISGITLINK(entry.mode))
> continue;
>
> + /* If the caller doesn't want blobs, then don't bother. */
> + if (!ctx->info->blobs && type == OBJ_BLOB)
> + continue;
> +
I was going to ask why we're not reusing the existing property of the
rev_info struct to specify what object type(s) we do/don't want, but the
answer is obvious: we don't want that to change the behavior of the
commit-level walk which is used to seed the actual path walk based on
its results.
However, it would be kind of nice to have a single place to specify how
you want to traverse objects, and using the rev_info struct seems like a
good choice there since you can naively ask Git to parse command-line
arguments like --blobs, --trees, --objects, etc. and set the appropriate
bits.
I wonder if it might make sense to do something like:
unsigned int tmp_blobs = revs->blob_objects;
unsigned int tmp_trees = revs->tree_objects;
unsigned int tmp_tags = revs->tag_objects;
if (prepare_revision_walk(revs))
die(_("failed to setup revision walk"));
/* commit-level walk */
revs->blob_objects = tmp_blobs;
revs->tree_objects = tmp_trees;
revs->tag_objects = tmp_tags;
/* path-level walk */
I don't have strong feelings about it, but it feels like this approach
could be cleaner from a caller's perspective. But I can see an argument
to the contrary that it does introduce some awkwardness with the
dual-use of those fields.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v2 5/6] path-walk: visit tags and cached objects
2024-11-09 19:41 ` [PATCH v2 " Derrick Stolee via GitGitGadget
` (3 preceding siblings ...)
2024-11-09 19:41 ` [PATCH v2 4/6] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
@ 2024-11-09 19:41 ` Derrick Stolee via GitGitGadget
2024-11-09 19:41 ` [PATCH v2 6/6] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
` (2 subsequent siblings)
7 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-11-09 19:41 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
The rev_info that is specified for a path-walk traversal may specify
visiting tag refs (both lightweight and annotated) and also may specify
indexed objects (blobs and trees). Update the path-walk API to walk
these objects as well.
When walking tags, we need to peel the annotated objects until reaching
a non-tag object. If we reach a commit, then we can add it to the
pending objects to make sure we visit in the commit walk portion. If we
reach a tree, then we will assume that it is a root tree. If we reach a
blob, then we have no good path name and so add it to a new list of
"tagged blobs".
When the rev_info includes the "--indexed-objects" flag, then the
pending set includes blobs and trees found in the cache entries and
cache-tree. The cache entries are usually blobs, though they could be
trees in the case of a sparse index. The cache-tree stores
previously-hashed tree objects but these are cleared out when staging
objects below those paths. We add tests that demonstrate this.
The indexed objects come with a non-NULL 'path' value in the pending
item. This allows us to prepopulate the 'path_to_lists' strmap with
lists for these paths.
The tricky thing about this walk is that we will want to combine the
indexed objects walk with the commit walk, especially in the future case
of walking objects during a command like 'git repack'.
Whenever possible, we want the objects from the index to be grouped with
similar objects in history. We don't want to miss any paths that appear
only in the index and not in the commit history.
Thus, we need to be careful to let the path stack be populated initially
with only the root tree path (and possibly tags and tagged blobs) and go
through the normal depth-first search. Afterwards, if there are other
paths that are remaining in the paths_to_lists strmap, we should then
iterate through the stack and visit those objects recursively.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 2 +-
path-walk.c | 174 +++++++++++++++++++-
path-walk.h | 2 +
t/helper/test-path-walk.c | 18 ++-
t/t6601-path-walk.sh | 186 +++++++++++++++++++---
5 files changed, 356 insertions(+), 26 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index dce553b6114..6022c381b7c 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -39,7 +39,7 @@ It is also important that you do not specify the `--objects` flag for the
the objects will be walked in a separate way based on those starting
commits.
-`commits`, `blobs`, `trees`::
+`commits`, `blobs`, `trees`, `tags`::
By default, these members are enabled and signal that the path-walk
API should call the `path_fn` on objects of these types. Specialized
applications could disable some options to make it simpler to walk
diff --git a/path-walk.c b/path-walk.c
index 2ca08402367..a1f539dcd46 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -13,10 +13,13 @@
#include "revision.h"
#include "string-list.h"
#include "strmap.h"
+#include "tag.h"
#include "trace2.h"
#include "tree.h"
#include "tree-walk.h"
+static const char *root_path = "";
+
struct type_and_oid_list
{
enum object_type type;
@@ -158,12 +161,16 @@ static int walk_path(struct path_walk_context *ctx,
list = strmap_get(&ctx->paths_to_lists, path);
+ if (!list)
+ BUG("provided path '%s' that had no associated list", path);
+
if (!list->oids.nr)
return 0;
/* Evaluate function pointer on this data, if requested. */
if ((list->type == OBJ_TREE && ctx->info->trees) ||
- (list->type == OBJ_BLOB && ctx->info->blobs))
+ (list->type == OBJ_BLOB && ctx->info->blobs) ||
+ (list->type == OBJ_TAG && ctx->info->tags))
ret = ctx->info->path_fn(path, &list->oids, list->type,
ctx->info->path_fn_data);
@@ -194,6 +201,134 @@ static void clear_strmap(struct strmap *map)
strmap_init(map);
}
+static void setup_pending_objects(struct path_walk_info *info,
+ struct path_walk_context *ctx)
+{
+ struct type_and_oid_list *tags = NULL;
+ struct type_and_oid_list *tagged_blobs = NULL;
+ struct type_and_oid_list *root_tree_list = NULL;
+
+ if (info->tags)
+ CALLOC_ARRAY(tags, 1);
+ if (info->blobs)
+ CALLOC_ARRAY(tagged_blobs, 1);
+ if (info->trees)
+ root_tree_list = strmap_get(&ctx->paths_to_lists, root_path);
+
+ /*
+ * Pending objects include:
+ * * Commits at branch tips.
+ * * Annotated tags at tag tips.
+ * * Any kind of object at lightweight tag tips.
+ * * Trees and blobs in the index (with an associated path).
+ */
+ for (size_t i = 0; i < info->revs->pending.nr; i++) {
+ struct object_array_entry *pending = info->revs->pending.objects + i;
+ struct object *obj = pending->item;
+
+ /* Commits will be picked up by revision walk. */
+ if (obj->type == OBJ_COMMIT)
+ continue;
+
+ /* Navigate annotated tag object chains. */
+ while (obj->type == OBJ_TAG) {
+ struct tag *tag = lookup_tag(info->revs->repo, &obj->oid);
+ if (!tag)
+ break;
+ if (tag->object.flags & SEEN)
+ break;
+ tag->object.flags |= SEEN;
+
+ if (tags)
+ oid_array_append(&tags->oids, &obj->oid);
+ obj = tag->tagged;
+ }
+
+ if (obj->type == OBJ_TAG)
+ continue;
+
+ /* We are now at a non-tag object. */
+ if (obj->flags & SEEN)
+ continue;
+ obj->flags |= SEEN;
+
+ switch (obj->type) {
+ case OBJ_TREE:
+ if (!info->trees)
+ continue;
+ if (pending->path) {
+ struct type_and_oid_list *list;
+ char *path = *pending->path ? xstrfmt("%s/", pending->path)
+ : xstrdup("");
+ if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
+ CALLOC_ARRAY(list, 1);
+ list->type = OBJ_TREE;
+ strmap_put(&ctx->paths_to_lists, path, list);
+ }
+ oid_array_append(&list->oids, &obj->oid);
+ free(path);
+ } else {
+ /* assume a root tree, such as a lightweight tag. */
+ oid_array_append(&root_tree_list->oids, &obj->oid);
+ }
+ break;
+
+ case OBJ_BLOB:
+ if (!info->blobs)
+ continue;
+ if (pending->path) {
+ struct type_and_oid_list *list;
+ char *path = pending->path;
+ if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
+ CALLOC_ARRAY(list, 1);
+ list->type = OBJ_BLOB;
+ strmap_put(&ctx->paths_to_lists, path, list);
+ }
+ oid_array_append(&list->oids, &obj->oid);
+ } else {
+ /* assume a root tree, such as a lightweight tag. */
+ oid_array_append(&tagged_blobs->oids, &obj->oid);
+ }
+ break;
+
+ case OBJ_COMMIT:
+ /* Make sure it is in the object walk */
+ if (obj != pending->item)
+ add_pending_object(info->revs, obj, "");
+ break;
+
+ default:
+ BUG("should not see any other type here");
+ }
+ }
+
+ /*
+ * Add tag objects and tagged blobs if they exist.
+ */
+ if (tagged_blobs) {
+ if (tagged_blobs->oids.nr) {
+ const char *tagged_blob_path = "/tagged-blobs";
+ tagged_blobs->type = OBJ_BLOB;
+ push_to_stack(ctx, tagged_blob_path);
+ strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
+ } else {
+ oid_array_clear(&tagged_blobs->oids);
+ free(tagged_blobs);
+ }
+ }
+ if (tags) {
+ if (tags->oids.nr) {
+ const char *tag_path = "/tags";
+ tags->type = OBJ_TAG;
+ push_to_stack(ctx, tag_path);
+ strmap_put(&ctx->paths_to_lists, tag_path, tags);
+ } else {
+ oid_array_clear(&tags->oids);
+ free(tags);
+ }
+ }
+}
+
/**
* Given the configuration of 'info', walk the commits based on 'info->revs' and
* call 'info->path_fn' on each discovered path.
@@ -202,7 +337,6 @@ static void clear_strmap(struct strmap *map)
*/
int walk_objects_by_path(struct path_walk_info *info)
{
- const char *root_path = "";
int ret = 0;
size_t commits_nr = 0, paths_nr = 0;
struct commit *c;
@@ -222,15 +356,31 @@ int walk_objects_by_path(struct path_walk_info *info)
CALLOC_ARRAY(commit_list, 1);
commit_list->type = OBJ_COMMIT;
+ if (info->tags)
+ info->revs->tag_objects = 1;
+
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
push_to_stack(&ctx, root_path);
+ /*
+ * Set these values before preparing the walk to catch
+ * lightweight tags pointing to non-commits and indexed objects.
+ */
+ info->revs->blob_objects = info->blobs;
+ info->revs->tree_objects = info->trees;
+
if (prepare_revision_walk(info->revs))
die(_("failed to setup revision walk"));
+ info->revs->blob_objects = info->revs->tree_objects = 0;
+
+ trace2_region_enter("path-walk", "pending-walk", info->revs->repo);
+ setup_pending_objects(info, &ctx);
+ trace2_region_leave("path-walk", "pending-walk", info->revs->repo);
+
while ((c = get_revision(info->revs))) {
struct object_id *oid;
struct tree *t;
@@ -278,6 +428,26 @@ int walk_objects_by_path(struct path_walk_info *info)
free(path);
}
+
+ /* Are there paths remaining? Likely they are from indexed objects. */
+ if (!strmap_empty(&ctx.paths_to_lists)) {
+ struct hashmap_iter iter;
+ struct strmap_entry *entry;
+
+ strmap_for_each_entry(&ctx.paths_to_lists, &iter, entry)
+ push_to_stack(&ctx, entry->key);
+
+ while (!ret && ctx.path_stack.nr) {
+ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
+ ctx.path_stack.nr--;
+ paths_nr++;
+
+ ret = walk_path(&ctx, path);
+
+ free(path);
+ }
+ }
+
trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr);
trace2_region_leave("path-walk", "path-walk", info->revs->repo);
diff --git a/path-walk.h b/path-walk.h
index 2d2afc29b47..ca839f873e4 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -38,12 +38,14 @@ struct path_walk_info {
int commits;
int trees;
int blobs;
+ int tags;
};
#define PATH_WALK_INFO_INIT { \
.blobs = 1, \
.trees = 1, \
.commits = 1, \
+ .tags = 1, \
}
/**
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 2b7e6e98d18..265bd0b443b 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -23,6 +23,7 @@ struct path_walk_test_data {
uintmax_t commit_nr;
uintmax_t tree_nr;
uintmax_t blob_nr;
+ uintmax_t tag_nr;
};
static int emit_block(const char *path, struct oid_array *oids,
@@ -47,10 +48,20 @@ static int emit_block(const char *path, struct oid_array *oids,
tdata->blob_nr += oids->nr;
break;
+ case OBJ_TAG:
+ typestr = "TAG";
+ tdata->tag_nr += oids->nr;
+ break;
+
default:
BUG("we do not understand this type");
}
+ /* This should never be output during tests. */
+ if (!oids->nr)
+ printf("%"PRIuMAX":%s:%s:EMPTY\n",
+ tdata->batch_nr, typestr, path);
+
for (size_t i = 0; i < oids->nr; i++)
printf("%"PRIuMAX":%s:%s:%s\n",
tdata->batch_nr, typestr, path,
@@ -71,6 +82,8 @@ int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of blob objects")),
OPT_BOOL(0, "commits", &info.commits,
N_("toggle inclusion of commit objects")),
+ OPT_BOOL(0, "tags", &info.tags,
+ N_("toggle inclusion of tag objects")),
OPT_BOOL(0, "trees", &info.trees,
N_("toggle inclusion of tree objects")),
OPT_END(),
@@ -96,8 +109,9 @@ int cmd__path_walk(int argc, const char **argv)
printf("commits:%" PRIuMAX "\n"
"trees:%" PRIuMAX "\n"
- "blobs:%" PRIuMAX "\n",
- data.commit_nr, data.tree_nr, data.blob_nr);
+ "blobs:%" PRIuMAX "\n"
+ "tags:%" PRIuMAX "\n",
+ data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
release_revisions(&revs);
return res;
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index dcd3c03a2e8..bf43ab0e22a 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -9,29 +9,142 @@ test_description='direct path-walk API tests'
test_expect_success 'setup test repository' '
git checkout -b base &&
+ # Make some objects that will only be reachable
+ # via non-commit tags.
+ mkdir child &&
+ echo file >child/file &&
+ git add child &&
+ git commit -m "will abandon" &&
+ git tag -a -m "tree" tree-tag HEAD^{tree} &&
+ echo file2 >file2 &&
+ git add file2 &&
+ git commit --amend -m "will abandon" &&
+ git tag tree-tag2 HEAD^{tree} &&
+
+ echo blob >file &&
+ blob_oid=$(git hash-object -t blob -w --stdin <file) &&
+ git tag -a -m "blob" blob-tag "$blob_oid" &&
+ echo blob2 >file2 &&
+ blob2_oid=$(git hash-object -t blob -w --stdin <file2) &&
+ git tag blob-tag2 "$blob2_oid" &&
+
+ rm -fr child file file2 &&
+
mkdir left &&
mkdir right &&
echo a >a &&
echo b >left/b &&
echo c >right/c &&
git add . &&
- git commit -m "first" &&
+ git commit --amend -m "first" &&
+ git tag -m "first" first HEAD &&
echo d >right/d &&
git add right &&
git commit -m "second" &&
+ git tag -a -m "second (under)" second.1 HEAD &&
+ git tag -a -m "second (top)" second.2 second.1 &&
+ # Set up file/dir collision in history.
+ rm a &&
+ mkdir a &&
+ echo a >a/a &&
echo bb >left/b &&
- git commit -a -m "third" &&
+ git add a left &&
+ git commit -m "third" &&
+ git tag -a -m "third" third &&
git checkout -b topic HEAD~1 &&
echo cc >right/c &&
- git commit -a -m "topic"
+ git commit -a -m "topic" &&
+ git tag -a -m "fourth" fourth
'
test_expect_success 'all' '
test-tool path-walk -- --all >out &&
+ cat >expect <<-EOF &&
+ 0:COMMIT::$(git rev-parse topic)
+ 0:COMMIT::$(git rev-parse base)
+ 0:COMMIT::$(git rev-parse base~1)
+ 0:COMMIT::$(git rev-parse base~2)
+ 1:TAG:/tags:$(git rev-parse refs/tags/first)
+ 1:TAG:/tags:$(git rev-parse refs/tags/second.1)
+ 1:TAG:/tags:$(git rev-parse refs/tags/second.2)
+ 1:TAG:/tags:$(git rev-parse refs/tags/third)
+ 1:TAG:/tags:$(git rev-parse refs/tags/fourth)
+ 1:TAG:/tags:$(git rev-parse refs/tags/tree-tag)
+ 1:TAG:/tags:$(git rev-parse refs/tags/blob-tag)
+ 2:BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
+ 2:BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
+ 3:TREE::$(git rev-parse topic^{tree})
+ 3:TREE::$(git rev-parse base^{tree})
+ 3:TREE::$(git rev-parse base~1^{tree})
+ 3:TREE::$(git rev-parse base~2^{tree})
+ 3:TREE::$(git rev-parse refs/tags/tree-tag^{})
+ 3:TREE::$(git rev-parse refs/tags/tree-tag2^{})
+ 4:BLOB:a:$(git rev-parse base~2:a)
+ 5:TREE:right/:$(git rev-parse topic:right)
+ 5:TREE:right/:$(git rev-parse base~1:right)
+ 5:TREE:right/:$(git rev-parse base~2:right)
+ 6:BLOB:right/d:$(git rev-parse base~1:right/d)
+ 7:BLOB:right/c:$(git rev-parse base~2:right/c)
+ 7:BLOB:right/c:$(git rev-parse topic:right/c)
+ 8:TREE:left/:$(git rev-parse base:left)
+ 8:TREE:left/:$(git rev-parse base~2:left)
+ 9:BLOB:left/b:$(git rev-parse base~2:left/b)
+ 9:BLOB:left/b:$(git rev-parse base:left/b)
+ 10:TREE:a/:$(git rev-parse base:a)
+ 11:BLOB:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
+ 12:TREE:child/:$(git rev-parse refs/tags/tree-tag:child)
+ 13:BLOB:child/file:$(git rev-parse refs/tags/tree-tag:child/file)
+ blobs:10
+ commits:4
+ tags:7
+ trees:13
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'indexed objects' '
+ test_when_finished git reset --hard &&
+
+ # stage change into index, adding a blob but
+ # also invalidating the cache-tree for the root
+ # and the "left" directory.
+ echo bogus >left/c &&
+ git add left &&
+
+ test-tool path-walk -- --indexed-objects >out &&
+
+ cat >expect <<-EOF &&
+ 0:BLOB:a:$(git rev-parse HEAD:a)
+ 1:BLOB:left/b:$(git rev-parse HEAD:left/b)
+ 2:BLOB:left/c:$(git rev-parse :left/c)
+ 3:BLOB:right/c:$(git rev-parse HEAD:right/c)
+ 4:BLOB:right/d:$(git rev-parse HEAD:right/d)
+ 5:TREE:right/:$(git rev-parse topic:right)
+ blobs:5
+ commits:0
+ tags:0
+ trees:1
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'branches and indexed objects mix well' '
+ test_when_finished git reset --hard &&
+
+ # stage change into index, adding a blob but
+ # also invalidating the cache-tree for the root
+ # and the "right" directory.
+ echo fake >right/d &&
+ git add right &&
+
+ test-tool path-walk -- --indexed-objects --branches >out &&
+
cat >expect <<-EOF &&
0:COMMIT::$(git rev-parse topic)
0:COMMIT::$(git rev-parse base)
@@ -41,20 +154,23 @@ test_expect_success 'all' '
1:TREE::$(git rev-parse base^{tree})
1:TREE::$(git rev-parse base~1^{tree})
1:TREE::$(git rev-parse base~2^{tree})
- 2:TREE:right/:$(git rev-parse topic:right)
- 2:TREE:right/:$(git rev-parse base~1:right)
- 2:TREE:right/:$(git rev-parse base~2:right)
- 3:BLOB:right/d:$(git rev-parse base~1:right/d)
- 4:BLOB:right/c:$(git rev-parse base~2:right/c)
- 4:BLOB:right/c:$(git rev-parse topic:right/c)
- 5:TREE:left/:$(git rev-parse base:left)
- 5:TREE:left/:$(git rev-parse base~2:left)
- 6:BLOB:left/b:$(git rev-parse base~2:left/b)
- 6:BLOB:left/b:$(git rev-parse base:left/b)
- 7:BLOB:a:$(git rev-parse base~2:a)
- blobs:6
+ 2:BLOB:a:$(git rev-parse base~2:a)
+ 3:TREE:right/:$(git rev-parse topic:right)
+ 3:TREE:right/:$(git rev-parse base~1:right)
+ 3:TREE:right/:$(git rev-parse base~2:right)
+ 4:BLOB:right/d:$(git rev-parse base~1:right/d)
+ 4:BLOB:right/d:$(git rev-parse :right/d)
+ 5:BLOB:right/c:$(git rev-parse base~2:right/c)
+ 5:BLOB:right/c:$(git rev-parse topic:right/c)
+ 6:TREE:left/:$(git rev-parse base:left)
+ 6:TREE:left/:$(git rev-parse base~2:left)
+ 7:BLOB:left/b:$(git rev-parse base:left/b)
+ 7:BLOB:left/b:$(git rev-parse base~2:left/b)
+ 8:TREE:a/:$(git rev-parse refs/tags/third:a)
+ blobs:7
commits:4
- trees:9
+ tags:0
+ trees:10
EOF
test_cmp_sorted expect out
@@ -81,6 +197,7 @@ test_expect_success 'topic only' '
7:BLOB:a:$(git rev-parse base~2:a)
blobs:5
commits:3
+ tags:0
trees:7
EOF
@@ -101,6 +218,7 @@ test_expect_success 'topic, not base' '
7:BLOB:a:$(git rev-parse topic:a)
blobs:4
commits:1
+ tags:0
trees:3
EOF
@@ -112,13 +230,14 @@ test_expect_success 'topic, not base, only blobs' '
-- topic --not base >out &&
cat >expect <<-EOF &&
- commits:0
- trees:0
0:BLOB:right/d:$(git rev-parse topic:right/d)
1:BLOB:right/c:$(git rev-parse topic:right/c)
2:BLOB:left/b:$(git rev-parse topic:left/b)
3:BLOB:a:$(git rev-parse topic:a)
blobs:4
+ commits:0
+ tags:0
+ trees:0
EOF
test_cmp_sorted expect out
@@ -133,8 +252,9 @@ test_expect_success 'topic, not base, only commits' '
cat >expect <<-EOF &&
0:COMMIT::$(git rev-parse topic)
commits:1
- trees:0
blobs:0
+ tags:0
+ trees:0
EOF
test_cmp_sorted expect out
@@ -145,12 +265,13 @@ test_expect_success 'topic, not base, only trees' '
-- topic --not base >out &&
cat >expect <<-EOF &&
- commits:0
0:TREE::$(git rev-parse topic^{tree})
1:TREE:right/:$(git rev-parse topic:right)
2:TREE:left/:$(git rev-parse topic:left)
- trees:3
+ commits:0
blobs:0
+ tags:0
+ trees:3
EOF
test_cmp_sorted expect out
@@ -174,10 +295,33 @@ test_expect_success 'topic, not base, boundary' '
7:BLOB:a:$(git rev-parse base~1:a)
blobs:5
commits:2
+ tags:0
trees:5
EOF
test_cmp_sorted expect out
'
+test_expect_success 'trees are reported exactly once' '
+ test_when_finished "rm -rf unique-trees" &&
+ test_create_repo unique-trees &&
+ (
+ cd unique-trees &&
+ mkdir initial &&
+ test_commit initial/file &&
+
+ git switch -c move-to-top &&
+ git mv initial/file.t ./ &&
+ test_tick &&
+ git commit -m moved &&
+
+ git update-ref refs/heads/other HEAD
+ ) &&
+
+ test-tool -C unique-trees path-walk -- --all >out &&
+ tree=$(git -C unique-trees rev-parse HEAD:) &&
+ grep "$tree" out >out-filtered &&
+ test_line_count = 1 out-filtered
+'
+
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH v2 6/6] path-walk: mark trees and blobs as UNINTERESTING
2024-11-09 19:41 ` [PATCH v2 " Derrick Stolee via GitGitGadget
` (4 preceding siblings ...)
2024-11-09 19:41 ` [PATCH v2 5/6] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
@ 2024-11-09 19:41 ` Derrick Stolee via GitGitGadget
2024-11-21 22:57 ` [PATCH v2 0/6] PATH WALK I: The path-walk API Taylor Blau
2024-12-06 19:45 ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
7 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-11-09 19:41 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
When the input rev_info has UNINTERESTING starting points, we want to be
sure that the UNINTERESTING flag is passed appropriately through the
objects. To match how this is done in places such as 'git pack-objects', we
use the mark_edges_uninteresting() method.
This method has an option for using the "sparse" walk, which is similar in
spirit to the path-walk API's walk. To be sure to keep it independent, add a
new 'prune_all_uninteresting' option to the path_walk_info struct.
To check how the UNINTERSTING flag is spread through our objects, extend the
'test-tool path-walk' command to output whether or not an object has that
flag. This changes our tests significantly, including the removal of some
objects that were previously visited due to the incomplete implementation.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 8 +++
path-walk.c | 73 +++++++++++++++++++++
path-walk.h | 8 +++
t/helper/test-path-walk.c | 12 +++-
t/t6601-path-walk.sh | 79 +++++++++++++++++------
5 files changed, 158 insertions(+), 22 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 6022c381b7c..7075d0d5ab5 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -48,6 +48,14 @@ commits.
While it is possible to walk only commits in this way, consumers would be
better off using the revision walk API instead.
+`prune_all_uninteresting`::
+ By default, all reachable paths are emitted by the path-walk API.
+ This option allows consumers to declare that they are not
+ interested in paths where all included objects are marked with the
+ `UNINTERESTING` flag. This requires using the `boundary` option in
+ the revision walk so that the walk emits commits marked with the
+ `UNINTERESTING` flag.
+
Examples
--------
diff --git a/path-walk.c b/path-walk.c
index a1f539dcd46..896ec0c4779 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -8,6 +8,7 @@
#include "dir.h"
#include "hashmap.h"
#include "hex.h"
+#include "list-objects.h"
#include "object.h"
#include "oid-array.h"
#include "revision.h"
@@ -24,6 +25,7 @@ struct type_and_oid_list
{
enum object_type type;
struct oid_array oids;
+ int maybe_interesting;
};
#define TYPE_AND_OID_LIST_INIT { \
@@ -140,6 +142,9 @@ static int add_children(struct path_walk_context *ctx,
if (o->flags & SEEN)
continue;
o->flags |= SEEN;
+
+ if (!(o->flags & UNINTERESTING))
+ list->maybe_interesting = 1;
oid_array_append(&list->oids, &entry.oid);
}
@@ -167,6 +172,43 @@ static int walk_path(struct path_walk_context *ctx,
if (!list->oids.nr)
return 0;
+ if (ctx->info->prune_all_uninteresting) {
+ /*
+ * This is true if all objects were UNINTERESTING
+ * when added to the list.
+ */
+ if (!list->maybe_interesting)
+ return 0;
+
+ /*
+ * But it's still possible that the objects were set
+ * as UNINTERESTING after being added. Do a quick check.
+ */
+ list->maybe_interesting = 0;
+ for (size_t i = 0;
+ !list->maybe_interesting && i < list->oids.nr;
+ i++) {
+ if (list->type == OBJ_TREE) {
+ struct tree *t = lookup_tree(ctx->repo,
+ &list->oids.oid[i]);
+ if (t && !(t->object.flags & UNINTERESTING))
+ list->maybe_interesting = 1;
+ } else if (list->type == OBJ_BLOB) {
+ struct blob *b = lookup_blob(ctx->repo,
+ &list->oids.oid[i]);
+ if (b && !(b->object.flags & UNINTERESTING))
+ list->maybe_interesting = 1;
+ } else {
+ /* Tags are always interesting if visited. */
+ list->maybe_interesting = 1;
+ }
+ }
+
+ /* We have confirmed that all objects are UNINTERESTING. */
+ if (!list->maybe_interesting)
+ return 0;
+ }
+
/* Evaluate function pointer on this data, if requested. */
if ((list->type == OBJ_TREE && ctx->info->trees) ||
(list->type == OBJ_BLOB && ctx->info->blobs) ||
@@ -201,6 +243,26 @@ static void clear_strmap(struct strmap *map)
strmap_init(map);
}
+static struct repository *edge_repo;
+static struct type_and_oid_list *edge_tree_list;
+
+static void show_edge(struct commit *commit)
+{
+ struct tree *t = repo_get_commit_tree(edge_repo, commit);
+
+ if (!t)
+ return;
+
+ if (commit->object.flags & UNINTERESTING)
+ t->object.flags |= UNINTERESTING;
+
+ if (t->object.flags & SEEN)
+ return;
+ t->object.flags |= SEEN;
+
+ oid_array_append(&edge_tree_list->oids, &t->object.oid);
+}
+
static void setup_pending_objects(struct path_walk_info *info,
struct path_walk_context *ctx)
{
@@ -309,6 +371,7 @@ static void setup_pending_objects(struct path_walk_info *info,
if (tagged_blobs->oids.nr) {
const char *tagged_blob_path = "/tagged-blobs";
tagged_blobs->type = OBJ_BLOB;
+ tagged_blobs->maybe_interesting = 1;
push_to_stack(ctx, tagged_blob_path);
strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
} else {
@@ -320,6 +383,7 @@ static void setup_pending_objects(struct path_walk_info *info,
if (tags->oids.nr) {
const char *tag_path = "/tags";
tags->type = OBJ_TAG;
+ tags->maybe_interesting = 1;
push_to_stack(ctx, tag_path);
strmap_put(&ctx->paths_to_lists, tag_path, tags);
} else {
@@ -362,6 +426,7 @@ int walk_objects_by_path(struct path_walk_info *info)
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
+ root_tree_list->maybe_interesting = 1;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
push_to_stack(&ctx, root_path);
@@ -375,6 +440,14 @@ int walk_objects_by_path(struct path_walk_info *info)
if (prepare_revision_walk(info->revs))
die(_("failed to setup revision walk"));
+ /* Walk trees to mark them as UNINTERESTING. */
+ edge_repo = info->revs->repo;
+ edge_tree_list = root_tree_list;
+ mark_edges_uninteresting(info->revs, show_edge,
+ info->prune_all_uninteresting);
+ edge_repo = NULL;
+ edge_tree_list = NULL;
+
info->revs->blob_objects = info->revs->tree_objects = 0;
trace2_region_enter("path-walk", "pending-walk", info->revs->repo);
diff --git a/path-walk.h b/path-walk.h
index ca839f873e4..de0db007dc9 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -39,6 +39,14 @@ struct path_walk_info {
int trees;
int blobs;
int tags;
+
+ /**
+ * When 'prune_all_uninteresting' is set and a path has all objects
+ * marked as UNINTERESTING, then the path-walk will not visit those
+ * objects. It will not call path_fn on those objects and will not
+ * walk the children of such trees.
+ */
+ int prune_all_uninteresting;
};
#define PATH_WALK_INFO_INIT { \
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 265bd0b443b..7e791cfaf97 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -62,10 +62,14 @@ static int emit_block(const char *path, struct oid_array *oids,
printf("%"PRIuMAX":%s:%s:EMPTY\n",
tdata->batch_nr, typestr, path);
- for (size_t i = 0; i < oids->nr; i++)
- printf("%"PRIuMAX":%s:%s:%s\n",
+ for (size_t i = 0; i < oids->nr; i++) {
+ struct object *o = lookup_unknown_object(the_repository,
+ &oids->oid[i]);
+ printf("%"PRIuMAX":%s:%s:%s%s\n",
tdata->batch_nr, typestr, path,
- oid_to_hex(&oids->oid[i]));
+ oid_to_hex(&oids->oid[i]),
+ o->flags & UNINTERESTING ? ":UNINTERESTING" : "");
+ }
tdata->batch_nr++;
return 0;
@@ -86,6 +90,8 @@ int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of tag objects")),
OPT_BOOL(0, "trees", &info.trees,
N_("toggle inclusion of tree objects")),
+ OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
+ N_("toggle pruning of uninteresting paths")),
OPT_END(),
};
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index bf43ab0e22a..d3c0015319a 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -211,11 +211,11 @@ test_expect_success 'topic, not base' '
0:COMMIT::$(git rev-parse topic)
1:TREE::$(git rev-parse topic^{tree})
2:TREE:right/:$(git rev-parse topic:right)
- 3:BLOB:right/d:$(git rev-parse topic:right/d)
+ 3:BLOB:right/d:$(git rev-parse topic:right/d):UNINTERESTING
4:BLOB:right/c:$(git rev-parse topic:right/c)
- 5:TREE:left/:$(git rev-parse topic:left)
- 6:BLOB:left/b:$(git rev-parse topic:left/b)
- 7:BLOB:a:$(git rev-parse topic:a)
+ 5:TREE:left/:$(git rev-parse topic:left):UNINTERESTING
+ 6:BLOB:left/b:$(git rev-parse topic:left/b):UNINTERESTING
+ 7:BLOB:a:$(git rev-parse topic:a):UNINTERESTING
blobs:4
commits:1
tags:0
@@ -225,15 +225,38 @@ test_expect_success 'topic, not base' '
test_cmp_sorted expect out
'
+test_expect_success 'fourth, blob-tag2, not base' '
+ test-tool path-walk -- fourth blob-tag2 --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:COMMIT::$(git rev-parse topic)
+ 1:TAG:/tags:$(git rev-parse fourth)
+ 2:BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
+ 3:TREE::$(git rev-parse topic^{tree})
+ 4:TREE:right/:$(git rev-parse topic:right)
+ 5:BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
+ 6:BLOB:right/c:$(git rev-parse topic:right/c)
+ 7:TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
+ 8:BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ 9:BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
+ blobs:5
+ commits:1
+ tags:1
+ trees:3
+ EOF
+
+ test_cmp_sorted expect out
+'
+
test_expect_success 'topic, not base, only blobs' '
test-tool path-walk --no-trees --no-commits \
-- topic --not base >out &&
cat >expect <<-EOF &&
- 0:BLOB:right/d:$(git rev-parse topic:right/d)
+ 0:BLOB:right/d:$(git rev-parse topic:right/d):UNINTERESTING
1:BLOB:right/c:$(git rev-parse topic:right/c)
- 2:BLOB:left/b:$(git rev-parse topic:left/b)
- 3:BLOB:a:$(git rev-parse topic:a)
+ 2:BLOB:left/b:$(git rev-parse topic:left/b):UNINTERESTING
+ 3:BLOB:a:$(git rev-parse topic:a):UNINTERESTING
blobs:4
commits:0
tags:0
@@ -267,7 +290,7 @@ test_expect_success 'topic, not base, only trees' '
cat >expect <<-EOF &&
0:TREE::$(git rev-parse topic^{tree})
1:TREE:right/:$(git rev-parse topic:right)
- 2:TREE:left/:$(git rev-parse topic:left)
+ 2:TREE:left/:$(git rev-parse topic:left):UNINTERESTING
commits:0
blobs:0
tags:0
@@ -282,17 +305,17 @@ test_expect_success 'topic, not base, boundary' '
cat >expect <<-EOF &&
0:COMMIT::$(git rev-parse topic)
- 0:COMMIT::$(git rev-parse base~1)
+ 0:COMMIT::$(git rev-parse base~1):UNINTERESTING
1:TREE::$(git rev-parse topic^{tree})
- 1:TREE::$(git rev-parse base~1^{tree})
+ 1:TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
2:TREE:right/:$(git rev-parse topic:right)
- 2:TREE:right/:$(git rev-parse base~1:right)
- 3:BLOB:right/d:$(git rev-parse base~1:right/d)
- 4:BLOB:right/c:$(git rev-parse base~1:right/c)
+ 2:TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
+ 3:BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
+ 4:BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
4:BLOB:right/c:$(git rev-parse topic:right/c)
- 5:TREE:left/:$(git rev-parse base~1:left)
- 6:BLOB:left/b:$(git rev-parse base~1:left/b)
- 7:BLOB:a:$(git rev-parse base~1:a)
+ 5:TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
+ 6:BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ 7:BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
blobs:5
commits:2
tags:0
@@ -302,6 +325,27 @@ test_expect_success 'topic, not base, boundary' '
test_cmp_sorted expect out
'
+test_expect_success 'topic, not base, boundary with pruning' '
+ test-tool path-walk --prune -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:COMMIT::$(git rev-parse topic)
+ 0:COMMIT::$(git rev-parse base~1):UNINTERESTING
+ 1:TREE::$(git rev-parse topic^{tree})
+ 1:TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
+ 2:TREE:right/:$(git rev-parse topic:right)
+ 2:TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
+ 3:BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
+ 3:BLOB:right/c:$(git rev-parse topic:right/c)
+ blobs:2
+ commits:2
+ tags:0
+ trees:4
+ EOF
+
+ test_cmp_sorted expect out
+'
+
test_expect_success 'trees are reported exactly once' '
test_when_finished "rm -rf unique-trees" &&
test_create_repo unique-trees &&
@@ -309,15 +353,12 @@ test_expect_success 'trees are reported exactly once' '
cd unique-trees &&
mkdir initial &&
test_commit initial/file &&
-
git switch -c move-to-top &&
git mv initial/file.t ./ &&
test_tick &&
git commit -m moved &&
-
git update-ref refs/heads/other HEAD
) &&
-
test-tool -C unique-trees path-walk -- --all >out &&
tree=$(git -C unique-trees rev-parse HEAD:) &&
grep "$tree" out >out-filtered &&
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH v2 0/6] PATH WALK I: The path-walk API
2024-11-09 19:41 ` [PATCH v2 " Derrick Stolee via GitGitGadget
` (5 preceding siblings ...)
2024-11-09 19:41 ` [PATCH v2 6/6] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
@ 2024-11-21 22:57 ` Taylor Blau
2024-11-25 8:56 ` Patrick Steinhardt
2024-12-06 19:45 ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
7 siblings, 1 reply; 67+ messages in thread
From: Taylor Blau @ 2024-11-21 22:57 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, ps, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee
On Sat, Nov 09, 2024 at 07:41:06PM +0000, Derrick Stolee via GitGitGadget wrote:
> Derrick Stolee (6):
> path-walk: introduce an object walk by path
> test-lib-functions: add test_cmp_sorted
> t6601: add helper for testing path-walk API
> path-walk: allow consumer to specify object types
> path-walk: visit tags and cached objects
> path-walk: mark trees and blobs as UNINTERESTING
My apologies for taking so long to review this. Having rad through the
patches in detail, a couple of thoughts:
- First, I like the structure that you decided on for this series. It
nicely demonstrates a minimal caller for this new API instead of
implementing a bunch of untested code. I think that's a great way to
lay out things up until this point.
- Second, I read through the existing API and only had minor comments.
I read through the implementation in detail and found it to match my
expectation of how each step should function.
So my take-away from spending a few hours with this series is that
everything seems on track so far, and I think this is in a good spot to
build on for more path-walk features.
That all said, I am still not totally sold on the idea that we need a
separate path-based traversal given the significant benefits of the
full-name hash approach that I reviewed earlier today.
To be clear, I am totally willing to believe that there are some
benefits to doing the path-walk approach on top, but I think we should
consider those benefits relative to the large amount of highly
non-trivial code that we're adding in order to power it.
So I'm not strongly opposed or in favor of the approach pursued starting
in this series, I just think that it's worth spending time as a group
(beyond just you and I) considering the benefits and costs associated
with it.
As for the patches themselves, I think that cooking them for a long time
in 'next' makes most sense. We will want to land this patch series if
and only if we decide that the traversal powered by this API is the
right approach. IOW, I don't think it makes sense to have the path-walk
stuff in the tree if it has no callers outside of the test helper
provided by this series.
OK. I think that's a good stopping point for me on the list today, and I
look forward to your responses :-).
Thanks,
Taylor
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2 0/6] PATH WALK I: The path-walk API
2024-11-21 22:57 ` [PATCH v2 0/6] PATH WALK I: The path-walk API Taylor Blau
@ 2024-11-25 8:56 ` Patrick Steinhardt
2024-11-26 7:39 ` Junio C Hamano
0 siblings, 1 reply; 67+ messages in thread
From: Patrick Steinhardt @ 2024-11-25 8:56 UTC (permalink / raw)
To: Taylor Blau
Cc: Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, johncai86, newren, christian.couder,
kristofferhaugsbakk, jonathantanmy, karthik nayak, Derrick Stolee
On Thu, Nov 21, 2024 at 05:57:25PM -0500, Taylor Blau wrote:
> On Sat, Nov 09, 2024 at 07:41:06PM +0000, Derrick Stolee via GitGitGadget wrote:
> > Derrick Stolee (6):
> > path-walk: introduce an object walk by path
> > test-lib-functions: add test_cmp_sorted
> > t6601: add helper for testing path-walk API
> > path-walk: allow consumer to specify object types
> > path-walk: visit tags and cached objects
> > path-walk: mark trees and blobs as UNINTERESTING
>
> My apologies for taking so long to review this. Having rad through the
> patches in detail, a couple of thoughts:
>
> - First, I like the structure that you decided on for this series. It
> nicely demonstrates a minimal caller for this new API instead of
> implementing a bunch of untested code. I think that's a great way to
> lay out things up until this point.
>
> - Second, I read through the existing API and only had minor comments.
> I read through the implementation in detail and found it to match my
> expectation of how each step should function.
>
> So my take-away from spending a few hours with this series is that
> everything seems on track so far, and I think this is in a good spot to
> build on for more path-walk features.
>
> That all said, I am still not totally sold on the idea that we need a
> separate path-based traversal given the significant benefits of the
> full-name hash approach that I reviewed earlier today.
The repo size reductions achieved via the path-walk API was only one of
the selling points of this series. And from my current understanding we
will likely not end up realizing those gains via path-walk, but rather
via the much simpler full-name hash algorithm indeed.
But there were two more selling points:
- git-survey(1) as a native replacement for git-sizer(1). I think it
is a great idea to have a native tool that allows us to gain deep
insights into a repository so that we get better signals from our
users in case they face problems with their repository. I'd love to
have this tool as a baseline for an extensible format where we can
eventually also start reporting the health state of refs as well as
any auxiliary data structures.
- git-backfill(1) as a helper to fetch blobs more efficiently from a
promisor remote. This is a boon to have as well in our odyssey
towards a better UI/UX with huge monorepos.
Both of these tools are quite exciting to me, and there is a need for
having such tools from my point of view.
The question of course is whether these tools require the path-walk API,
or whether they could be built on top of existing functionality. But if
there are good reasons why the existing functionality is insufficient
then I'd be all for having the path-walk API, even if it doesn't help us
with repo size reductions as we initially thought.
Patrick
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2 0/6] PATH WALK I: The path-walk API
2024-11-25 8:56 ` Patrick Steinhardt
@ 2024-11-26 7:39 ` Junio C Hamano
2024-11-26 7:43 ` Patrick Steinhardt
0 siblings, 1 reply; 67+ messages in thread
From: Junio C Hamano @ 2024-11-26 7:39 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: Taylor Blau, Derrick Stolee via GitGitGadget, git,
johannes.schindelin, peff, johncai86, newren, christian.couder,
kristofferhaugsbakk, jonathantanmy, karthik nayak, Derrick Stolee
Patrick Steinhardt <ps@pks.im> writes:
> The question of course is whether these tools require the path-walk API,
> or whether they could be built on top of existing functionality. But if
> there are good reasons why the existing functionality is insufficient
> then I'd be all for having the path-walk API, even if it doesn't help us
> with repo size reductions as we initially thought.
Is the implied statement that we didn't quite see sufficient rationale
to convince ourselves that a new path-walk machinery is needed?
Thanks.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2 0/6] PATH WALK I: The path-walk API
2024-11-26 7:39 ` Junio C Hamano
@ 2024-11-26 7:43 ` Patrick Steinhardt
2024-11-26 8:16 ` Junio C Hamano
0 siblings, 1 reply; 67+ messages in thread
From: Patrick Steinhardt @ 2024-11-26 7:43 UTC (permalink / raw)
To: Junio C Hamano
Cc: Taylor Blau, Derrick Stolee via GitGitGadget, git,
johannes.schindelin, peff, johncai86, newren, christian.couder,
kristofferhaugsbakk, jonathantanmy, karthik nayak, Derrick Stolee
On Tue, Nov 26, 2024 at 04:39:17PM +0900, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
> > The question of course is whether these tools require the path-walk API,
> > or whether they could be built on top of existing functionality. But if
> > there are good reasons why the existing functionality is insufficient
> > then I'd be all for having the path-walk API, even if it doesn't help us
> > with repo size reductions as we initially thought.
>
> Is the implied statement that we didn't quite see sufficient rationale
> to convince ourselves that a new path-walk machinery is needed?
No, it's rather that I didn't find the time yet to have a deeper look at
the patch series to figure out for myself whether the path-walk API is
needed for them. So I was trying to prompt Derrick with the above to
find out whether he thinks that it is needed for both of these features
and if so why the existing APIs are insufficient.
I'm already sold on the idea of git-survey(1) and git-backfill(1), so if
there are two use cases where the API makes sense I'm happy to have the
additional complexity even if it's not needed anymore for the repo size
reduction.
Patrick
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v2 0/6] PATH WALK I: The path-walk API
2024-11-26 7:43 ` Patrick Steinhardt
@ 2024-11-26 8:16 ` Junio C Hamano
0 siblings, 0 replies; 67+ messages in thread
From: Junio C Hamano @ 2024-11-26 8:16 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: Taylor Blau, Derrick Stolee via GitGitGadget, git,
johannes.schindelin, peff, johncai86, newren, christian.couder,
kristofferhaugsbakk, jonathantanmy, karthik nayak, Derrick Stolee
Patrick Steinhardt <ps@pks.im> writes:
>> > then I'd be all for having the path-walk API, even if it doesn't help us
>> > with repo size reductions as we initially thought.
>>
>> Is the implied statement that we didn't quite see sufficient rationale
>> to convince ourselves that a new path-walk machinery is needed?
>
> No, it's rather that I didn't find the time yet to have a deeper look at
> the patch series to figure out for myself whether the path-walk API is
> needed for them. So I was trying to prompt Derrick with the above to
> find out whether he thinks that it is needed for both of these features
> and if so why the existing APIs are insufficient.
>
> I'm already sold on the idea of git-survey(1) and git-backfill(1), so if
> there are two use cases where the API makes sense I'm happy to have the
> additional complexity even if it's not needed anymore for the repo size
> reduction.
Ah, I misspoke and failed to add "in order to implement these new
features" after "is needed". I like the idea of "backfill" thing,
too (as code paths to deal with promisor remotes irritates me ;-).
Thanks.
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 0/7] PATH WALK I: The path-walk API
2024-11-09 19:41 ` [PATCH v2 " Derrick Stolee via GitGitGadget
` (6 preceding siblings ...)
2024-11-21 22:57 ` [PATCH v2 0/6] PATH WALK I: The path-walk API Taylor Blau
@ 2024-12-06 19:45 ` Derrick Stolee via GitGitGadget
2024-12-06 19:45 ` [PATCH v3 1/7] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
` (8 more replies)
7 siblings, 9 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-06 19:45 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee
Introduction and relation to prior series
=========================================
This is a new series that rerolls the initial "path-walk API" patches of my
RFC [1] "Path-walk API and applications". This new API (in path-walk.c and
path-walk.h) presents a new way to walk objects such that trees and blobs
are walked in batches according to their path.
This also replaces the previous version of ds/path-walk that was being
reviewed in [2]. The consensus was that the series was too long/dense and
could use some reduction in size. This series takes the first few patches,
but also makes some updates (which will be described later).
[1]
https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/
[RFC] Path-walk API and applications
[2]
https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/
[PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
This series only introduces the path-walk API, but does so to the full
complexity required to later add the integration with git pack-objects to
improve packing compression in both time and space for repositories with
many name hash collisions. The compression also at least improves for other
repositories, but may not always have an improvement in time.
Some of the changes that are present in this series that differ from the
previous version are motivated directly by discoveries made by testing the
feature in Git for Windows and microsoft/git forks that shipped these
features for fast delivery of these improvements to users who needed them.
That testing across many environments informed some things that needed to be
changed, and in this series those changes are checked by tests in the
t6601-path-walk.sh test script and the test-tool path-walk test helper.
Thus, the code being introduced in this series is covered by tests even
though it is not integrated into the git executable.
Discussion of follow-up applications
====================================
By splitting this series out into its own, I was able to reorganize the
patches such that each application can be build independently off of this
series. These are available as pending PRs in gitgitgadget/git:
* Better delta compression with 'git pack-objects' [3]: This application
allows an option in 'git pack-objects' to change how objects are walked
in order to group objects with the same path for early delta compression
before using the name hash sort to look for cross-path deltas. This helps
significantly in repositories with many name-hash collisions. This
reduces the size of 'git push' pacifies via a config option and reduces
the total repo size in 'git repack'.
* The 'git backfill' command [4]: This command downloads missing blobs in a
bloodless partial clone. In order to save space and network bandwidth, it
assumes that objects at a common path are likely to delta well with each
other, so it downloads missing blobs in batches via the path-walk API.
This presents a way to use blobless clones as a pseudo-resumable clone,
since the initial clone of commits and trees is a smaller initial
download and the batch size allows downloading blobs incrementally. When
pairing this command with the sparse-checkout feature, the path-walk API
is adjusted to focus on the paths within the sparse-checkout. This allows
the user to only download the files they are likely to need when
inspecting history within their scope without downloading the entire
repository history.
* The 'git survey' command [5]. This application begins the work to mimic
the behavior of git-sizer, but to use internal data structures for better
performance and careful understanding of how objects are stored. Using
the path-walk API, paths with many versions can be considered in a batch
and sorted into a list to report the paths that contribute most to the
size of the repository. A version of this command was used to help
confirm the issues with the name hash collisions. It was also used to
diagnose why some repacks using the --path-walk option were taking more
space than without for some repositories. (More on this later.)
Question for reviewers: I am prepped to send these three applications to the
mailing list, but I'll refrain for now to avoid causing too much noise for
folks. Would you like to see them on-list while this series is under review?
Or would you prefer to explore the PRs ([3] [4] and [5])?
[3] https://github.com/gitgitgadget/git/pull/1819
PATH WALK II: Add --path-walk option to 'git pack-objects'
[4] https://github.com/gitgitgadget/git/pull/1820
PATH WALK III: Add 'git backfill' command
[5] https://github.com/gitgitgadget/git/pull/1821
PATH WALK IV: Add 'git survey' command
Structure of the Patch Series
=============================
This patch series attempts to create the simplest version of the API in
patch 1, then build functionality incrementally. During the process, each
change will introduce an update to:
* The path-walk API itself in path-walk.c and path-walk.h.
* The documentation of the API in
Documentation/technical/api-path-walk.txt.
* The test script t/t6601-path-walk.sh.
The core of the API relies on using a 'struct rev_info' to define an initial
set of objects and some form of a commit walk to define what range of
objects to visit. Initially, only a subset of 'struct rev_info' options work
as expected. For example:
* Patch 1 assumes that only commit objects are starting positions, but the
focus is on exploring trees and blobs.
* Patch 3 allows users to specify object types, which includes outputting
the visited commits in a batch.
* Annotated tags and indexed objects are considered in Patch 4. These are
grouped because they both exist within the 'pending' object list.
* UNINTERESTING objects are not considered until Patch 5.
Changes in v1 (since previous version)
======================================
There are a few hard-won learnings from previous versions of this series due
to testing this in the wild with many different repositories.
* Initially, the 'git pack-objects --path-walk' feature was not tested with
the '--shallow' option because it was expected that this option was for
servers creating a pack containing shallow commits. However, this option
is also used when pushing from a shallow clone, and this was a critical
feature that we needed to reduce the size of commits pushed from
automated environments that were bootstrapped by shallow clones. The crux
of the change is in Patch 5 and how UNINTERESTING objects are handled. We
no longer need to push the UNINTERESTING flag around the objects
ourselves and can use existing logic in list-objects.c to do so. This
allows using the --objects-edge-aggressive option when necessary to
reduce the object count when pushing from a shallow clone. (The
pack-objects series expands on tests to cover this integration point.)
* When looking into cases where 'git repack -adf' outperformed 'git repack
-adf --path-walk', I discovered that the issue did not reproduce in a
bare repository. This is due to 'git repack' iterating over all indexed
objects before walking commits. I had inadvertently put all indexed
objects in their own category, leading to no good deltas with previous
versions of those files; I had also not used the 'path' option from the
pending list, so these objects had invalid name hash values. You will see
in patch 4 that the pending list is handled quite differently and the
'--indexed-objects' option is tested directly within t6601.
* I added a new 'test_cmp_sorted' helper because I wanted to simplify some
repeated sections of t6601.
* Patch 1 has significantly more context than it did before.
* Annotated tags are given a name of "/tags" to differentiate them slightly
from root trees and commits.
Changes in v2
=============
* Updated the test helper to output the batch number, allowing us to
confirm that OIDs are grouped appropriately. This also signaled a few
cases where the callback function was being called on an empty set.
* This change has resulted in significant changes to the test data,
including reordered lines and prepended batch numbers.
* Thanks to Patrick for providing a recommended change to remove memory
leaks from the test helper.
Changes in v3
=============
* Updated test helper to use type_string(), which leads to a change to use
lowercase strings in the test scripts. That will lead to the range-diff
looking pretty terrible.
* Added a new patch that changes the visit order of the path-walk API. The
intention is to reduce memory pressure by emitting blob paths before
recursing into tree paths. This also has the effect of visiting blobs and
trees in lexicographic order instead of the reverse.
Thanks, -Stolee
Derrick Stolee (7):
path-walk: introduce an object walk by path
test-lib-functions: add test_cmp_sorted
t6601: add helper for testing path-walk API
path-walk: allow consumer to specify object types
path-walk: visit tags and cached objects
path-walk: mark trees and blobs as UNINTERESTING
path-walk: reorder object visits
Documentation/technical/api-path-walk.txt | 63 +++
Makefile | 2 +
path-walk.c | 567 ++++++++++++++++++++++
path-walk.h | 65 +++
t/helper/test-path-walk.c | 112 +++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t6601-path-walk.sh | 368 ++++++++++++++
t/test-lib-functions.sh | 10 +
9 files changed, 1189 insertions(+)
create mode 100644 Documentation/technical/api-path-walk.txt
create mode 100644 path-walk.c
create mode 100644 path-walk.h
create mode 100644 t/helper/test-path-walk.c
create mode 100755 t/t6601-path-walk.sh
base-commit: e9356ba3ea2a6754281ff7697b3e5a1697b21e24
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1818%2Fderrickstolee%2Fapi-upstream-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1818/derrickstolee/api-upstream-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1818
Range-diff vs v2:
1: b7e9b81e8b3 = 1: b7e9b81e8b3 path-walk: introduce an object walk by path
2: cf2ed61b324 = 2: cf2ed61b324 test-lib-functions: add test_cmp_sorted
3: a3c754d93cc ! 3: 54886fcb081 t6601: add helper for testing path-walk API
@@ t/helper/test-path-walk.c (new)
+ struct path_walk_test_data *tdata = data;
+ const char *typestr;
+
-+ switch (type) {
-+ case OBJ_TREE:
-+ typestr = "TREE";
++ if (type == OBJ_TREE)
+ tdata->tree_nr += oids->nr;
-+ break;
-+
-+ case OBJ_BLOB:
-+ typestr = "BLOB";
++ else if (type == OBJ_BLOB)
+ tdata->blob_nr += oids->nr;
-+ break;
-+
-+ default:
++ else
+ BUG("we do not understand this type");
-+ }
++
++ typestr = type_name(type);
+
+ for (size_t i = 0; i < oids->nr; i++)
+ printf("%"PRIuMAX":%s:%s:%s\n",
@@ t/t6601-path-walk.sh (new)
+ test-tool path-walk -- --all >out &&
+
+ cat >expect <<-EOF &&
-+ 0:TREE::$(git rev-parse topic^{tree})
-+ 0:TREE::$(git rev-parse base^{tree})
-+ 0:TREE::$(git rev-parse base~1^{tree})
-+ 0:TREE::$(git rev-parse base~2^{tree})
-+ 1:TREE:right/:$(git rev-parse topic:right)
-+ 1:TREE:right/:$(git rev-parse base~1:right)
-+ 1:TREE:right/:$(git rev-parse base~2:right)
-+ 2:BLOB:right/d:$(git rev-parse base~1:right/d)
-+ 3:BLOB:right/c:$(git rev-parse base~2:right/c)
-+ 3:BLOB:right/c:$(git rev-parse topic:right/c)
-+ 4:TREE:left/:$(git rev-parse base:left)
-+ 4:TREE:left/:$(git rev-parse base~2:left)
-+ 5:BLOB:left/b:$(git rev-parse base~2:left/b)
-+ 5:BLOB:left/b:$(git rev-parse base:left/b)
-+ 6:BLOB:a:$(git rev-parse base~2:a)
++ 0:tree::$(git rev-parse topic^{tree})
++ 0:tree::$(git rev-parse base^{tree})
++ 0:tree::$(git rev-parse base~1^{tree})
++ 0:tree::$(git rev-parse base~2^{tree})
++ 1:tree:right/:$(git rev-parse topic:right)
++ 1:tree:right/:$(git rev-parse base~1:right)
++ 1:tree:right/:$(git rev-parse base~2:right)
++ 2:blob:right/d:$(git rev-parse base~1:right/d)
++ 3:blob:right/c:$(git rev-parse base~2:right/c)
++ 3:blob:right/c:$(git rev-parse topic:right/c)
++ 4:tree:left/:$(git rev-parse base:left)
++ 4:tree:left/:$(git rev-parse base~2:left)
++ 5:blob:left/b:$(git rev-parse base~2:left/b)
++ 5:blob:left/b:$(git rev-parse base:left/b)
++ 6:blob:a:$(git rev-parse base~2:a)
+ blobs:6
+ trees:9
+ EOF
@@ t/t6601-path-walk.sh (new)
+ test-tool path-walk -- topic >out &&
+
+ cat >expect <<-EOF &&
-+ 0:TREE::$(git rev-parse topic^{tree})
-+ 0:TREE::$(git rev-parse base~1^{tree})
-+ 0:TREE::$(git rev-parse base~2^{tree})
-+ 1:TREE:right/:$(git rev-parse topic:right)
-+ 1:TREE:right/:$(git rev-parse base~1:right)
-+ 1:TREE:right/:$(git rev-parse base~2:right)
-+ 2:BLOB:right/d:$(git rev-parse base~1:right/d)
-+ 3:BLOB:right/c:$(git rev-parse base~2:right/c)
-+ 3:BLOB:right/c:$(git rev-parse topic:right/c)
-+ 4:TREE:left/:$(git rev-parse base~2:left)
-+ 5:BLOB:left/b:$(git rev-parse base~2:left/b)
-+ 6:BLOB:a:$(git rev-parse base~2:a)
++ 0:tree::$(git rev-parse topic^{tree})
++ 0:tree::$(git rev-parse base~1^{tree})
++ 0:tree::$(git rev-parse base~2^{tree})
++ 1:tree:right/:$(git rev-parse topic:right)
++ 1:tree:right/:$(git rev-parse base~1:right)
++ 1:tree:right/:$(git rev-parse base~2:right)
++ 2:blob:right/d:$(git rev-parse base~1:right/d)
++ 3:blob:right/c:$(git rev-parse base~2:right/c)
++ 3:blob:right/c:$(git rev-parse topic:right/c)
++ 4:tree:left/:$(git rev-parse base~2:left)
++ 5:blob:left/b:$(git rev-parse base~2:left/b)
++ 6:blob:a:$(git rev-parse base~2:a)
+ blobs:5
+ trees:7
+ EOF
@@ t/t6601-path-walk.sh (new)
+ test-tool path-walk -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
-+ 0:TREE::$(git rev-parse topic^{tree})
-+ 1:TREE:right/:$(git rev-parse topic:right)
-+ 2:BLOB:right/d:$(git rev-parse topic:right/d)
-+ 3:BLOB:right/c:$(git rev-parse topic:right/c)
-+ 4:TREE:left/:$(git rev-parse topic:left)
-+ 5:BLOB:left/b:$(git rev-parse topic:left/b)
-+ 6:BLOB:a:$(git rev-parse topic:a)
++ 0:tree::$(git rev-parse topic^{tree})
++ 1:tree:right/:$(git rev-parse topic:right)
++ 2:blob:right/d:$(git rev-parse topic:right/d)
++ 3:blob:right/c:$(git rev-parse topic:right/c)
++ 4:tree:left/:$(git rev-parse topic:left)
++ 5:blob:left/b:$(git rev-parse topic:left/b)
++ 6:blob:a:$(git rev-parse topic:a)
+ blobs:4
+ trees:3
+ EOF
@@ t/t6601-path-walk.sh (new)
+ test-tool path-walk -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
-+ 0:TREE::$(git rev-parse topic^{tree})
-+ 0:TREE::$(git rev-parse base~1^{tree})
-+ 1:TREE:right/:$(git rev-parse topic:right)
-+ 1:TREE:right/:$(git rev-parse base~1:right)
-+ 2:BLOB:right/d:$(git rev-parse base~1:right/d)
-+ 3:BLOB:right/c:$(git rev-parse base~1:right/c)
-+ 3:BLOB:right/c:$(git rev-parse topic:right/c)
-+ 4:TREE:left/:$(git rev-parse base~1:left)
-+ 5:BLOB:left/b:$(git rev-parse base~1:left/b)
-+ 6:BLOB:a:$(git rev-parse base~1:a)
++ 0:tree::$(git rev-parse topic^{tree})
++ 0:tree::$(git rev-parse base~1^{tree})
++ 1:tree:right/:$(git rev-parse topic:right)
++ 1:tree:right/:$(git rev-parse base~1:right)
++ 2:blob:right/d:$(git rev-parse base~1:right/d)
++ 3:blob:right/c:$(git rev-parse base~1:right/c)
++ 3:blob:right/c:$(git rev-parse topic:right/c)
++ 4:tree:left/:$(git rev-parse base~1:left)
++ 5:blob:left/b:$(git rev-parse base~1:left/b)
++ 6:blob:a:$(git rev-parse base~1:a)
+ blobs:5
+ trees:5
+ EOF
4: 83b746f569d ! 4: 42e71e6285f path-walk: allow consumer to specify object types
@@ t/helper/test-path-walk.c: static const char * const path_walk_usage[] = {
uintmax_t blob_nr;
};
@@ t/helper/test-path-walk.c: static int emit_block(const char *path, struct oid_array *oids,
- const char *typestr;
-
- switch (type) {
-+ case OBJ_COMMIT:
-+ typestr = "COMMIT";
-+ tdata->commit_nr += oids->nr;
-+ break;
-+
- case OBJ_TREE:
- typestr = "TREE";
tdata->tree_nr += oids->nr;
+ else if (type == OBJ_BLOB)
+ tdata->blob_nr += oids->nr;
++ else if (type == OBJ_COMMIT)
++ tdata->commit_nr += oids->nr;
+ else
+ BUG("we do not understand this type");
+
@@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
struct path_walk_info info = PATH_WALK_INFO_INIT;
struct path_walk_test_data data = { 0 };
@@ t/t6601-path-walk.sh: test_expect_success 'all' '
test-tool path-walk -- --all >out &&
cat >expect <<-EOF &&
-- 0:TREE::$(git rev-parse topic^{tree})
-- 0:TREE::$(git rev-parse base^{tree})
-- 0:TREE::$(git rev-parse base~1^{tree})
-- 0:TREE::$(git rev-parse base~2^{tree})
-- 1:TREE:right/:$(git rev-parse topic:right)
-- 1:TREE:right/:$(git rev-parse base~1:right)
-- 1:TREE:right/:$(git rev-parse base~2:right)
-- 2:BLOB:right/d:$(git rev-parse base~1:right/d)
-- 3:BLOB:right/c:$(git rev-parse base~2:right/c)
-- 3:BLOB:right/c:$(git rev-parse topic:right/c)
-- 4:TREE:left/:$(git rev-parse base:left)
-- 4:TREE:left/:$(git rev-parse base~2:left)
-- 5:BLOB:left/b:$(git rev-parse base~2:left/b)
-- 5:BLOB:left/b:$(git rev-parse base:left/b)
-- 6:BLOB:a:$(git rev-parse base~2:a)
-+ 0:COMMIT::$(git rev-parse topic)
-+ 0:COMMIT::$(git rev-parse base)
-+ 0:COMMIT::$(git rev-parse base~1)
-+ 0:COMMIT::$(git rev-parse base~2)
-+ 1:TREE::$(git rev-parse topic^{tree})
-+ 1:TREE::$(git rev-parse base^{tree})
-+ 1:TREE::$(git rev-parse base~1^{tree})
-+ 1:TREE::$(git rev-parse base~2^{tree})
-+ 2:TREE:right/:$(git rev-parse topic:right)
-+ 2:TREE:right/:$(git rev-parse base~1:right)
-+ 2:TREE:right/:$(git rev-parse base~2:right)
-+ 3:BLOB:right/d:$(git rev-parse base~1:right/d)
-+ 4:BLOB:right/c:$(git rev-parse base~2:right/c)
-+ 4:BLOB:right/c:$(git rev-parse topic:right/c)
-+ 5:TREE:left/:$(git rev-parse base:left)
-+ 5:TREE:left/:$(git rev-parse base~2:left)
-+ 6:BLOB:left/b:$(git rev-parse base~2:left/b)
-+ 6:BLOB:left/b:$(git rev-parse base:left/b)
-+ 7:BLOB:a:$(git rev-parse base~2:a)
+- 0:tree::$(git rev-parse topic^{tree})
+- 0:tree::$(git rev-parse base^{tree})
+- 0:tree::$(git rev-parse base~1^{tree})
+- 0:tree::$(git rev-parse base~2^{tree})
+- 1:tree:right/:$(git rev-parse topic:right)
+- 1:tree:right/:$(git rev-parse base~1:right)
+- 1:tree:right/:$(git rev-parse base~2:right)
+- 2:blob:right/d:$(git rev-parse base~1:right/d)
+- 3:blob:right/c:$(git rev-parse base~2:right/c)
+- 3:blob:right/c:$(git rev-parse topic:right/c)
+- 4:tree:left/:$(git rev-parse base:left)
+- 4:tree:left/:$(git rev-parse base~2:left)
+- 5:blob:left/b:$(git rev-parse base~2:left/b)
+- 5:blob:left/b:$(git rev-parse base:left/b)
+- 6:blob:a:$(git rev-parse base~2:a)
++ 0:commit::$(git rev-parse topic)
++ 0:commit::$(git rev-parse base)
++ 0:commit::$(git rev-parse base~1)
++ 0:commit::$(git rev-parse base~2)
++ 1:tree::$(git rev-parse topic^{tree})
++ 1:tree::$(git rev-parse base^{tree})
++ 1:tree::$(git rev-parse base~1^{tree})
++ 1:tree::$(git rev-parse base~2^{tree})
++ 2:tree:right/:$(git rev-parse topic:right)
++ 2:tree:right/:$(git rev-parse base~1:right)
++ 2:tree:right/:$(git rev-parse base~2:right)
++ 3:blob:right/d:$(git rev-parse base~1:right/d)
++ 4:blob:right/c:$(git rev-parse base~2:right/c)
++ 4:blob:right/c:$(git rev-parse topic:right/c)
++ 5:tree:left/:$(git rev-parse base:left)
++ 5:tree:left/:$(git rev-parse base~2:left)
++ 6:blob:left/b:$(git rev-parse base~2:left/b)
++ 6:blob:left/b:$(git rev-parse base:left/b)
++ 7:blob:a:$(git rev-parse base~2:a)
blobs:6
+ commits:4
trees:9
@@ t/t6601-path-walk.sh: test_expect_success 'topic only' '
test-tool path-walk -- topic >out &&
cat >expect <<-EOF &&
-- 0:TREE::$(git rev-parse topic^{tree})
-- 0:TREE::$(git rev-parse base~1^{tree})
-- 0:TREE::$(git rev-parse base~2^{tree})
-- 1:TREE:right/:$(git rev-parse topic:right)
-- 1:TREE:right/:$(git rev-parse base~1:right)
-- 1:TREE:right/:$(git rev-parse base~2:right)
-- 2:BLOB:right/d:$(git rev-parse base~1:right/d)
-- 3:BLOB:right/c:$(git rev-parse base~2:right/c)
-- 3:BLOB:right/c:$(git rev-parse topic:right/c)
-- 4:TREE:left/:$(git rev-parse base~2:left)
-- 5:BLOB:left/b:$(git rev-parse base~2:left/b)
-- 6:BLOB:a:$(git rev-parse base~2:a)
-+ 0:COMMIT::$(git rev-parse topic)
-+ 0:COMMIT::$(git rev-parse base~1)
-+ 0:COMMIT::$(git rev-parse base~2)
-+ 1:TREE::$(git rev-parse topic^{tree})
-+ 1:TREE::$(git rev-parse base~1^{tree})
-+ 1:TREE::$(git rev-parse base~2^{tree})
-+ 2:TREE:right/:$(git rev-parse topic:right)
-+ 2:TREE:right/:$(git rev-parse base~1:right)
-+ 2:TREE:right/:$(git rev-parse base~2:right)
-+ 3:BLOB:right/d:$(git rev-parse base~1:right/d)
-+ 4:BLOB:right/c:$(git rev-parse base~2:right/c)
-+ 4:BLOB:right/c:$(git rev-parse topic:right/c)
-+ 5:TREE:left/:$(git rev-parse base~2:left)
-+ 6:BLOB:left/b:$(git rev-parse base~2:left/b)
-+ 7:BLOB:a:$(git rev-parse base~2:a)
+- 0:tree::$(git rev-parse topic^{tree})
+- 0:tree::$(git rev-parse base~1^{tree})
+- 0:tree::$(git rev-parse base~2^{tree})
+- 1:tree:right/:$(git rev-parse topic:right)
+- 1:tree:right/:$(git rev-parse base~1:right)
+- 1:tree:right/:$(git rev-parse base~2:right)
+- 2:blob:right/d:$(git rev-parse base~1:right/d)
+- 3:blob:right/c:$(git rev-parse base~2:right/c)
+- 3:blob:right/c:$(git rev-parse topic:right/c)
+- 4:tree:left/:$(git rev-parse base~2:left)
+- 5:blob:left/b:$(git rev-parse base~2:left/b)
+- 6:blob:a:$(git rev-parse base~2:a)
++ 0:commit::$(git rev-parse topic)
++ 0:commit::$(git rev-parse base~1)
++ 0:commit::$(git rev-parse base~2)
++ 1:tree::$(git rev-parse topic^{tree})
++ 1:tree::$(git rev-parse base~1^{tree})
++ 1:tree::$(git rev-parse base~2^{tree})
++ 2:tree:right/:$(git rev-parse topic:right)
++ 2:tree:right/:$(git rev-parse base~1:right)
++ 2:tree:right/:$(git rev-parse base~2:right)
++ 3:blob:right/d:$(git rev-parse base~1:right/d)
++ 4:blob:right/c:$(git rev-parse base~2:right/c)
++ 4:blob:right/c:$(git rev-parse topic:right/c)
++ 5:tree:left/:$(git rev-parse base~2:left)
++ 6:blob:left/b:$(git rev-parse base~2:left/b)
++ 7:blob:a:$(git rev-parse base~2:a)
blobs:5
+ commits:3
trees:7
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
test-tool path-walk -- topic --not base >out &&
cat >expect <<-EOF &&
-+ 0:COMMIT::$(git rev-parse topic)
-+ 1:TREE::$(git rev-parse topic^{tree})
-+ 2:TREE:right/:$(git rev-parse topic:right)
-+ 3:BLOB:right/d:$(git rev-parse topic:right/d)
-+ 4:BLOB:right/c:$(git rev-parse topic:right/c)
-+ 5:TREE:left/:$(git rev-parse topic:left)
-+ 6:BLOB:left/b:$(git rev-parse topic:left/b)
-+ 7:BLOB:a:$(git rev-parse topic:a)
++ 0:commit::$(git rev-parse topic)
++ 1:tree::$(git rev-parse topic^{tree})
++ 2:tree:right/:$(git rev-parse topic:right)
++ 3:blob:right/d:$(git rev-parse topic:right/d)
++ 4:blob:right/c:$(git rev-parse topic:right/c)
++ 5:tree:left/:$(git rev-parse topic:left)
++ 6:blob:left/b:$(git rev-parse topic:left/b)
++ 7:blob:a:$(git rev-parse topic:a)
+ blobs:4
+ commits:1
+ trees:3
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
+ cat >expect <<-EOF &&
+ commits:0
+ trees:0
-+ 0:BLOB:right/d:$(git rev-parse topic:right/d)
-+ 1:BLOB:right/c:$(git rev-parse topic:right/c)
-+ 2:BLOB:left/b:$(git rev-parse topic:left/b)
-+ 3:BLOB:a:$(git rev-parse topic:a)
++ 0:blob:right/d:$(git rev-parse topic:right/d)
++ 1:blob:right/c:$(git rev-parse topic:right/c)
++ 2:blob:left/b:$(git rev-parse topic:left/b)
++ 3:blob:a:$(git rev-parse topic:a)
+ blobs:4
+ EOF
+
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
-+ 0:COMMIT::$(git rev-parse topic)
++ 0:commit::$(git rev-parse topic)
+ commits:1
+ trees:0
+ blobs:0
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
+
+ cat >expect <<-EOF &&
+ commits:0
- 0:TREE::$(git rev-parse topic^{tree})
- 1:TREE:right/:$(git rev-parse topic:right)
-- 2:BLOB:right/d:$(git rev-parse topic:right/d)
-- 3:BLOB:right/c:$(git rev-parse topic:right/c)
-- 4:TREE:left/:$(git rev-parse topic:left)
-- 5:BLOB:left/b:$(git rev-parse topic:left/b)
-- 6:BLOB:a:$(git rev-parse topic:a)
+ 0:tree::$(git rev-parse topic^{tree})
+ 1:tree:right/:$(git rev-parse topic:right)
+- 2:blob:right/d:$(git rev-parse topic:right/d)
+- 3:blob:right/c:$(git rev-parse topic:right/c)
+- 4:tree:left/:$(git rev-parse topic:left)
+- 5:blob:left/b:$(git rev-parse topic:left/b)
+- 6:blob:a:$(git rev-parse topic:a)
- blobs:4
-+ 2:TREE:left/:$(git rev-parse topic:left)
++ 2:tree:left/:$(git rev-parse topic:left)
trees:3
+ blobs:0
EOF
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
test-tool path-walk -- --boundary topic --not base >out &&
cat >expect <<-EOF &&
-- 0:TREE::$(git rev-parse topic^{tree})
-- 0:TREE::$(git rev-parse base~1^{tree})
-- 1:TREE:right/:$(git rev-parse topic:right)
-- 1:TREE:right/:$(git rev-parse base~1:right)
-- 2:BLOB:right/d:$(git rev-parse base~1:right/d)
-- 3:BLOB:right/c:$(git rev-parse base~1:right/c)
-- 3:BLOB:right/c:$(git rev-parse topic:right/c)
-- 4:TREE:left/:$(git rev-parse base~1:left)
-- 5:BLOB:left/b:$(git rev-parse base~1:left/b)
-- 6:BLOB:a:$(git rev-parse base~1:a)
-+ 0:COMMIT::$(git rev-parse topic)
-+ 0:COMMIT::$(git rev-parse base~1)
-+ 1:TREE::$(git rev-parse topic^{tree})
-+ 1:TREE::$(git rev-parse base~1^{tree})
-+ 2:TREE:right/:$(git rev-parse topic:right)
-+ 2:TREE:right/:$(git rev-parse base~1:right)
-+ 3:BLOB:right/d:$(git rev-parse base~1:right/d)
-+ 4:BLOB:right/c:$(git rev-parse base~1:right/c)
-+ 4:BLOB:right/c:$(git rev-parse topic:right/c)
-+ 5:TREE:left/:$(git rev-parse base~1:left)
-+ 6:BLOB:left/b:$(git rev-parse base~1:left/b)
-+ 7:BLOB:a:$(git rev-parse base~1:a)
+- 0:tree::$(git rev-parse topic^{tree})
+- 0:tree::$(git rev-parse base~1^{tree})
+- 1:tree:right/:$(git rev-parse topic:right)
+- 1:tree:right/:$(git rev-parse base~1:right)
+- 2:blob:right/d:$(git rev-parse base~1:right/d)
+- 3:blob:right/c:$(git rev-parse base~1:right/c)
+- 3:blob:right/c:$(git rev-parse topic:right/c)
+- 4:tree:left/:$(git rev-parse base~1:left)
+- 5:blob:left/b:$(git rev-parse base~1:left/b)
+- 6:blob:a:$(git rev-parse base~1:a)
++ 0:commit::$(git rev-parse topic)
++ 0:commit::$(git rev-parse base~1)
++ 1:tree::$(git rev-parse topic^{tree})
++ 1:tree::$(git rev-parse base~1^{tree})
++ 2:tree:right/:$(git rev-parse topic:right)
++ 2:tree:right/:$(git rev-parse base~1:right)
++ 3:blob:right/d:$(git rev-parse base~1:right/d)
++ 4:blob:right/c:$(git rev-parse base~1:right/c)
++ 4:blob:right/c:$(git rev-parse topic:right/c)
++ 5:tree:left/:$(git rev-parse base~1:left)
++ 6:blob:left/b:$(git rev-parse base~1:left/b)
++ 7:blob:a:$(git rev-parse base~1:a)
blobs:5
+ commits:2
trees:5
5: 97765aa04c2 ! 5: a41f53f7ced path-walk: visit tags and cached objects
@@ t/helper/test-path-walk.c: struct path_walk_test_data {
static int emit_block(const char *path, struct oid_array *oids,
@@ t/helper/test-path-walk.c: static int emit_block(const char *path, struct oid_array *oids,
tdata->blob_nr += oids->nr;
- break;
-
-+ case OBJ_TAG:
-+ typestr = "TAG";
+ else if (type == OBJ_COMMIT)
+ tdata->commit_nr += oids->nr;
++ else if (type == OBJ_TAG)
+ tdata->tag_nr += oids->nr;
-+ break;
-+
- default:
+ else
BUG("we do not understand this type");
- }
+
+ typestr = type_name(type);
+ /* This should never be output during tests. */
+ if (!oids->nr)
@@ t/t6601-path-walk.sh: test_description='direct path-walk API tests'
test-tool path-walk -- --all >out &&
+ cat >expect <<-EOF &&
-+ 0:COMMIT::$(git rev-parse topic)
-+ 0:COMMIT::$(git rev-parse base)
-+ 0:COMMIT::$(git rev-parse base~1)
-+ 0:COMMIT::$(git rev-parse base~2)
-+ 1:TAG:/tags:$(git rev-parse refs/tags/first)
-+ 1:TAG:/tags:$(git rev-parse refs/tags/second.1)
-+ 1:TAG:/tags:$(git rev-parse refs/tags/second.2)
-+ 1:TAG:/tags:$(git rev-parse refs/tags/third)
-+ 1:TAG:/tags:$(git rev-parse refs/tags/fourth)
-+ 1:TAG:/tags:$(git rev-parse refs/tags/tree-tag)
-+ 1:TAG:/tags:$(git rev-parse refs/tags/blob-tag)
-+ 2:BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
-+ 2:BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
-+ 3:TREE::$(git rev-parse topic^{tree})
-+ 3:TREE::$(git rev-parse base^{tree})
-+ 3:TREE::$(git rev-parse base~1^{tree})
-+ 3:TREE::$(git rev-parse base~2^{tree})
-+ 3:TREE::$(git rev-parse refs/tags/tree-tag^{})
-+ 3:TREE::$(git rev-parse refs/tags/tree-tag2^{})
-+ 4:BLOB:a:$(git rev-parse base~2:a)
-+ 5:TREE:right/:$(git rev-parse topic:right)
-+ 5:TREE:right/:$(git rev-parse base~1:right)
-+ 5:TREE:right/:$(git rev-parse base~2:right)
-+ 6:BLOB:right/d:$(git rev-parse base~1:right/d)
-+ 7:BLOB:right/c:$(git rev-parse base~2:right/c)
-+ 7:BLOB:right/c:$(git rev-parse topic:right/c)
-+ 8:TREE:left/:$(git rev-parse base:left)
-+ 8:TREE:left/:$(git rev-parse base~2:left)
-+ 9:BLOB:left/b:$(git rev-parse base~2:left/b)
-+ 9:BLOB:left/b:$(git rev-parse base:left/b)
-+ 10:TREE:a/:$(git rev-parse base:a)
-+ 11:BLOB:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
-+ 12:TREE:child/:$(git rev-parse refs/tags/tree-tag:child)
-+ 13:BLOB:child/file:$(git rev-parse refs/tags/tree-tag:child/file)
++ 0:commit::$(git rev-parse topic)
++ 0:commit::$(git rev-parse base)
++ 0:commit::$(git rev-parse base~1)
++ 0:commit::$(git rev-parse base~2)
++ 1:tag:/tags:$(git rev-parse refs/tags/first)
++ 1:tag:/tags:$(git rev-parse refs/tags/second.1)
++ 1:tag:/tags:$(git rev-parse refs/tags/second.2)
++ 1:tag:/tags:$(git rev-parse refs/tags/third)
++ 1:tag:/tags:$(git rev-parse refs/tags/fourth)
++ 1:tag:/tags:$(git rev-parse refs/tags/tree-tag)
++ 1:tag:/tags:$(git rev-parse refs/tags/blob-tag)
++ 2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
++ 2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
++ 3:tree::$(git rev-parse topic^{tree})
++ 3:tree::$(git rev-parse base^{tree})
++ 3:tree::$(git rev-parse base~1^{tree})
++ 3:tree::$(git rev-parse base~2^{tree})
++ 3:tree::$(git rev-parse refs/tags/tree-tag^{})
++ 3:tree::$(git rev-parse refs/tags/tree-tag2^{})
++ 4:blob:a:$(git rev-parse base~2:a)
++ 5:tree:right/:$(git rev-parse topic:right)
++ 5:tree:right/:$(git rev-parse base~1:right)
++ 5:tree:right/:$(git rev-parse base~2:right)
++ 6:blob:right/d:$(git rev-parse base~1:right/d)
++ 7:blob:right/c:$(git rev-parse base~2:right/c)
++ 7:blob:right/c:$(git rev-parse topic:right/c)
++ 8:tree:left/:$(git rev-parse base:left)
++ 8:tree:left/:$(git rev-parse base~2:left)
++ 9:blob:left/b:$(git rev-parse base~2:left/b)
++ 9:blob:left/b:$(git rev-parse base:left/b)
++ 10:tree:a/:$(git rev-parse base:a)
++ 11:blob:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
++ 12:tree:child/:$(git rev-parse refs/tags/tree-tag:child)
++ 13:blob:child/file:$(git rev-parse refs/tags/tree-tag:child/file)
+ blobs:10
+ commits:4
+ tags:7
@@ t/t6601-path-walk.sh: test_description='direct path-walk API tests'
+ test-tool path-walk -- --indexed-objects >out &&
+
+ cat >expect <<-EOF &&
-+ 0:BLOB:a:$(git rev-parse HEAD:a)
-+ 1:BLOB:left/b:$(git rev-parse HEAD:left/b)
-+ 2:BLOB:left/c:$(git rev-parse :left/c)
-+ 3:BLOB:right/c:$(git rev-parse HEAD:right/c)
-+ 4:BLOB:right/d:$(git rev-parse HEAD:right/d)
-+ 5:TREE:right/:$(git rev-parse topic:right)
++ 0:blob:a:$(git rev-parse HEAD:a)
++ 1:blob:left/b:$(git rev-parse HEAD:left/b)
++ 2:blob:left/c:$(git rev-parse :left/c)
++ 3:blob:right/c:$(git rev-parse HEAD:right/c)
++ 4:blob:right/d:$(git rev-parse HEAD:right/d)
++ 5:tree:right/:$(git rev-parse topic:right)
+ blobs:5
+ commits:0
+ tags:0
@@ t/t6601-path-walk.sh: test_description='direct path-walk API tests'
+ test-tool path-walk -- --indexed-objects --branches >out &&
+
cat >expect <<-EOF &&
- 0:COMMIT::$(git rev-parse topic)
- 0:COMMIT::$(git rev-parse base)
+ 0:commit::$(git rev-parse topic)
+ 0:commit::$(git rev-parse base)
@@ t/t6601-path-walk.sh: test_expect_success 'all' '
- 1:TREE::$(git rev-parse base^{tree})
- 1:TREE::$(git rev-parse base~1^{tree})
- 1:TREE::$(git rev-parse base~2^{tree})
-- 2:TREE:right/:$(git rev-parse topic:right)
-- 2:TREE:right/:$(git rev-parse base~1:right)
-- 2:TREE:right/:$(git rev-parse base~2:right)
-- 3:BLOB:right/d:$(git rev-parse base~1:right/d)
-- 4:BLOB:right/c:$(git rev-parse base~2:right/c)
-- 4:BLOB:right/c:$(git rev-parse topic:right/c)
-- 5:TREE:left/:$(git rev-parse base:left)
-- 5:TREE:left/:$(git rev-parse base~2:left)
-- 6:BLOB:left/b:$(git rev-parse base~2:left/b)
-- 6:BLOB:left/b:$(git rev-parse base:left/b)
-- 7:BLOB:a:$(git rev-parse base~2:a)
+ 1:tree::$(git rev-parse base^{tree})
+ 1:tree::$(git rev-parse base~1^{tree})
+ 1:tree::$(git rev-parse base~2^{tree})
+- 2:tree:right/:$(git rev-parse topic:right)
+- 2:tree:right/:$(git rev-parse base~1:right)
+- 2:tree:right/:$(git rev-parse base~2:right)
+- 3:blob:right/d:$(git rev-parse base~1:right/d)
+- 4:blob:right/c:$(git rev-parse base~2:right/c)
+- 4:blob:right/c:$(git rev-parse topic:right/c)
+- 5:tree:left/:$(git rev-parse base:left)
+- 5:tree:left/:$(git rev-parse base~2:left)
+- 6:blob:left/b:$(git rev-parse base~2:left/b)
+- 6:blob:left/b:$(git rev-parse base:left/b)
+- 7:blob:a:$(git rev-parse base~2:a)
- blobs:6
-+ 2:BLOB:a:$(git rev-parse base~2:a)
-+ 3:TREE:right/:$(git rev-parse topic:right)
-+ 3:TREE:right/:$(git rev-parse base~1:right)
-+ 3:TREE:right/:$(git rev-parse base~2:right)
-+ 4:BLOB:right/d:$(git rev-parse base~1:right/d)
-+ 4:BLOB:right/d:$(git rev-parse :right/d)
-+ 5:BLOB:right/c:$(git rev-parse base~2:right/c)
-+ 5:BLOB:right/c:$(git rev-parse topic:right/c)
-+ 6:TREE:left/:$(git rev-parse base:left)
-+ 6:TREE:left/:$(git rev-parse base~2:left)
-+ 7:BLOB:left/b:$(git rev-parse base:left/b)
-+ 7:BLOB:left/b:$(git rev-parse base~2:left/b)
-+ 8:TREE:a/:$(git rev-parse refs/tags/third:a)
++ 2:blob:a:$(git rev-parse base~2:a)
++ 3:tree:right/:$(git rev-parse topic:right)
++ 3:tree:right/:$(git rev-parse base~1:right)
++ 3:tree:right/:$(git rev-parse base~2:right)
++ 4:blob:right/d:$(git rev-parse base~1:right/d)
++ 4:blob:right/d:$(git rev-parse :right/d)
++ 5:blob:right/c:$(git rev-parse base~2:right/c)
++ 5:blob:right/c:$(git rev-parse topic:right/c)
++ 6:tree:left/:$(git rev-parse base:left)
++ 6:tree:left/:$(git rev-parse base~2:left)
++ 7:blob:left/b:$(git rev-parse base:left/b)
++ 7:blob:left/b:$(git rev-parse base~2:left/b)
++ 8:tree:a/:$(git rev-parse refs/tags/third:a)
+ blobs:7
commits:4
- trees:9
@@ t/t6601-path-walk.sh: test_expect_success 'all' '
test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic only' '
- 7:BLOB:a:$(git rev-parse base~2:a)
+ 7:blob:a:$(git rev-parse base~2:a)
blobs:5
commits:3
+ tags:0
@@ t/t6601-path-walk.sh: test_expect_success 'topic only' '
EOF
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
- 7:BLOB:a:$(git rev-parse topic:a)
+ 7:blob:a:$(git rev-parse topic:a)
blobs:4
commits:1
+ tags:0
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only blobs' '
cat >expect <<-EOF &&
- commits:0
- trees:0
- 0:BLOB:right/d:$(git rev-parse topic:right/d)
- 1:BLOB:right/c:$(git rev-parse topic:right/c)
- 2:BLOB:left/b:$(git rev-parse topic:left/b)
- 3:BLOB:a:$(git rev-parse topic:a)
+ 0:blob:right/d:$(git rev-parse topic:right/d)
+ 1:blob:right/c:$(git rev-parse topic:right/c)
+ 2:blob:left/b:$(git rev-parse topic:left/b)
+ 3:blob:a:$(git rev-parse topic:a)
blobs:4
+ commits:0
+ tags:0
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only blobs' '
test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only commits' '
cat >expect <<-EOF &&
- 0:COMMIT::$(git rev-parse topic)
+ 0:commit::$(git rev-parse topic)
commits:1
- trees:0
blobs:0
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only trees' '
cat >expect <<-EOF &&
- commits:0
- 0:TREE::$(git rev-parse topic^{tree})
- 1:TREE:right/:$(git rev-parse topic:right)
- 2:TREE:left/:$(git rev-parse topic:left)
+ 0:tree::$(git rev-parse topic^{tree})
+ 1:tree:right/:$(git rev-parse topic:right)
+ 2:tree:left/:$(git rev-parse topic:left)
- trees:3
+ commits:0
blobs:0
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only trees' '
test_cmp_sorted expect out
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
- 7:BLOB:a:$(git rev-parse base~1:a)
+ 7:blob:a:$(git rev-parse base~1:a)
blobs:5
commits:2
+ tags:0
6: a4aaa3b001b ! 6: 0f1e6c51b2c path-walk: mark trees and blobs as UNINTERESTING
@@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
## t/t6601-path-walk.sh ##
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
- 0:COMMIT::$(git rev-parse topic)
- 1:TREE::$(git rev-parse topic^{tree})
- 2:TREE:right/:$(git rev-parse topic:right)
-- 3:BLOB:right/d:$(git rev-parse topic:right/d)
-+ 3:BLOB:right/d:$(git rev-parse topic:right/d):UNINTERESTING
- 4:BLOB:right/c:$(git rev-parse topic:right/c)
-- 5:TREE:left/:$(git rev-parse topic:left)
-- 6:BLOB:left/b:$(git rev-parse topic:left/b)
-- 7:BLOB:a:$(git rev-parse topic:a)
-+ 5:TREE:left/:$(git rev-parse topic:left):UNINTERESTING
-+ 6:BLOB:left/b:$(git rev-parse topic:left/b):UNINTERESTING
-+ 7:BLOB:a:$(git rev-parse topic:a):UNINTERESTING
+ 0:commit::$(git rev-parse topic)
+ 1:tree::$(git rev-parse topic^{tree})
+ 2:tree:right/:$(git rev-parse topic:right)
+- 3:blob:right/d:$(git rev-parse topic:right/d)
++ 3:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
+ 4:blob:right/c:$(git rev-parse topic:right/c)
+- 5:tree:left/:$(git rev-parse topic:left)
+- 6:blob:left/b:$(git rev-parse topic:left/b)
+- 7:blob:a:$(git rev-parse topic:a)
++ 5:tree:left/:$(git rev-parse topic:left):UNINTERESTING
++ 6:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
++ 7:blob:a:$(git rev-parse topic:a):UNINTERESTING
blobs:4
commits:1
tags:0
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
+ test-tool path-walk -- fourth blob-tag2 --not base >out &&
+
+ cat >expect <<-EOF &&
-+ 0:COMMIT::$(git rev-parse topic)
-+ 1:TAG:/tags:$(git rev-parse fourth)
-+ 2:BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
-+ 3:TREE::$(git rev-parse topic^{tree})
-+ 4:TREE:right/:$(git rev-parse topic:right)
-+ 5:BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
-+ 6:BLOB:right/c:$(git rev-parse topic:right/c)
-+ 7:TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
-+ 8:BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
-+ 9:BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
++ 0:commit::$(git rev-parse topic)
++ 1:tag:/tags:$(git rev-parse fourth)
++ 2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
++ 3:tree::$(git rev-parse topic^{tree})
++ 4:tree:right/:$(git rev-parse topic:right)
++ 5:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
++ 6:blob:right/c:$(git rev-parse topic:right/c)
++ 7:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
++ 8:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
++ 9:blob:a:$(git rev-parse base~1:a):UNINTERESTING
+ blobs:5
+ commits:1
+ tags:1
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base' '
-- topic --not base >out &&
cat >expect <<-EOF &&
-- 0:BLOB:right/d:$(git rev-parse topic:right/d)
-+ 0:BLOB:right/d:$(git rev-parse topic:right/d):UNINTERESTING
- 1:BLOB:right/c:$(git rev-parse topic:right/c)
-- 2:BLOB:left/b:$(git rev-parse topic:left/b)
-- 3:BLOB:a:$(git rev-parse topic:a)
-+ 2:BLOB:left/b:$(git rev-parse topic:left/b):UNINTERESTING
-+ 3:BLOB:a:$(git rev-parse topic:a):UNINTERESTING
+- 0:blob:right/d:$(git rev-parse topic:right/d)
++ 0:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
+ 1:blob:right/c:$(git rev-parse topic:right/c)
+- 2:blob:left/b:$(git rev-parse topic:left/b)
+- 3:blob:a:$(git rev-parse topic:a)
++ 2:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
++ 3:blob:a:$(git rev-parse topic:a):UNINTERESTING
blobs:4
commits:0
tags:0
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, only trees' '
cat >expect <<-EOF &&
- 0:TREE::$(git rev-parse topic^{tree})
- 1:TREE:right/:$(git rev-parse topic:right)
-- 2:TREE:left/:$(git rev-parse topic:left)
-+ 2:TREE:left/:$(git rev-parse topic:left):UNINTERESTING
+ 0:tree::$(git rev-parse topic^{tree})
+ 1:tree:right/:$(git rev-parse topic:right)
+- 2:tree:left/:$(git rev-parse topic:left)
++ 2:tree:left/:$(git rev-parse topic:left):UNINTERESTING
commits:0
blobs:0
tags:0
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
cat >expect <<-EOF &&
- 0:COMMIT::$(git rev-parse topic)
-- 0:COMMIT::$(git rev-parse base~1)
-+ 0:COMMIT::$(git rev-parse base~1):UNINTERESTING
- 1:TREE::$(git rev-parse topic^{tree})
-- 1:TREE::$(git rev-parse base~1^{tree})
-+ 1:TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
- 2:TREE:right/:$(git rev-parse topic:right)
-- 2:TREE:right/:$(git rev-parse base~1:right)
-- 3:BLOB:right/d:$(git rev-parse base~1:right/d)
-- 4:BLOB:right/c:$(git rev-parse base~1:right/c)
-+ 2:TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
-+ 3:BLOB:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
-+ 4:BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
- 4:BLOB:right/c:$(git rev-parse topic:right/c)
-- 5:TREE:left/:$(git rev-parse base~1:left)
-- 6:BLOB:left/b:$(git rev-parse base~1:left/b)
-- 7:BLOB:a:$(git rev-parse base~1:a)
-+ 5:TREE:left/:$(git rev-parse base~1:left):UNINTERESTING
-+ 6:BLOB:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
-+ 7:BLOB:a:$(git rev-parse base~1:a):UNINTERESTING
+ 0:commit::$(git rev-parse topic)
+- 0:commit::$(git rev-parse base~1)
++ 0:commit::$(git rev-parse base~1):UNINTERESTING
+ 1:tree::$(git rev-parse topic^{tree})
+- 1:tree::$(git rev-parse base~1^{tree})
++ 1:tree::$(git rev-parse base~1^{tree}):UNINTERESTING
+ 2:tree:right/:$(git rev-parse topic:right)
+- 2:tree:right/:$(git rev-parse base~1:right)
+- 3:blob:right/d:$(git rev-parse base~1:right/d)
+- 4:blob:right/c:$(git rev-parse base~1:right/c)
++ 2:tree:right/:$(git rev-parse base~1:right):UNINTERESTING
++ 3:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
++ 4:blob:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
+ 4:blob:right/c:$(git rev-parse topic:right/c)
+- 5:tree:left/:$(git rev-parse base~1:left)
+- 6:blob:left/b:$(git rev-parse base~1:left/b)
+- 7:blob:a:$(git rev-parse base~1:a)
++ 5:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
++ 6:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
++ 7:blob:a:$(git rev-parse base~1:a):UNINTERESTING
blobs:5
commits:2
tags:0
@@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
+ test-tool path-walk --prune -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
-+ 0:COMMIT::$(git rev-parse topic)
-+ 0:COMMIT::$(git rev-parse base~1):UNINTERESTING
-+ 1:TREE::$(git rev-parse topic^{tree})
-+ 1:TREE::$(git rev-parse base~1^{tree}):UNINTERESTING
-+ 2:TREE:right/:$(git rev-parse topic:right)
-+ 2:TREE:right/:$(git rev-parse base~1:right):UNINTERESTING
-+ 3:BLOB:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
-+ 3:BLOB:right/c:$(git rev-parse topic:right/c)
++ 0:commit::$(git rev-parse topic)
++ 0:commit::$(git rev-parse base~1):UNINTERESTING
++ 1:tree::$(git rev-parse topic^{tree})
++ 1:tree::$(git rev-parse base~1^{tree}):UNINTERESTING
++ 2:tree:right/:$(git rev-parse topic:right)
++ 2:tree:right/:$(git rev-parse base~1:right):UNINTERESTING
++ 3:blob:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
++ 3:blob:right/c:$(git rev-parse topic:right/c)
+ blobs:2
+ commits:2
+ tags:0
-: ----------- > 7: e716672c041 path-walk: reorder object visits
--
gitgitgadget
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 1/7] path-walk: introduce an object walk by path
2024-12-06 19:45 ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
@ 2024-12-06 19:45 ` Derrick Stolee via GitGitGadget
2024-12-13 11:58 ` Patrick Steinhardt
2024-12-06 19:45 ` [PATCH v3 2/7] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
` (7 subsequent siblings)
8 siblings, 1 reply; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-06 19:45 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects, and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.
These batches are collected in 'struct type_and_oid_list' objects, which
store an object type and an oid_array of objects.
The data structures are documented in 'struct path_walk_context', but in
summary the most important are:
* 'paths_to_lists' is a strmap that connects a path to a
type_and_oid_list for that path. To avoid conflicts in path names,
we make sure that tree paths end in "/" (except the root path with
is an empty string) and blob paths do not end in "/".
* 'path_stack' is a string list that is added to in an append-only
way. This stores the stack of our depth-first search on the heap
instead of using recursion.
* 'path_stack_pushed' is a strmap that stores path names that were
already added to 'path_stack', to avoid repeating paths in the
stack. Mostly, this saves us from quadratic lookups from doing
unsorted checks into the string_list.
The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
push_to_stack() method. Call this instead of inserting into these
structures directly.
The walk_objects_by_path() method initializes these structures and
starts walking commits from the given rev_info struct. The commits are
used to find the list of root trees which populate the start of our
depth-first search.
The core of our depth-first search is in a while loop that continues
while we have not indicated an early exit and our 'path_stack' still has
entries in it. The loop body pops a path off of the stack and "visits"
the path via the walk_path() method.
The walk_path() method gets the list of OIDs from the 'path_to_lists'
strmap and executes the callback method on that list with the given path
and type. If the OIDs correspond to tree objects, then iterate over all
trees in the list and run add_children() to add the child objects to
their own lists, adding new entries to the stack if necessary.
In testing, this depth-first search approach was the one that used the
least memory while iterating over the object lists. There is still a
chance that repositories with too-wide path patterns could cause memory
pressure issues. Limiting the stack size could be done in the future by
limiting how many objects are being considered in-progress, or by
visiting blob paths earlier than trees.
There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 45 ++++
Makefile | 1 +
path-walk.c | 263 ++++++++++++++++++++++
path-walk.h | 43 ++++
4 files changed, 352 insertions(+)
create mode 100644 Documentation/technical/api-path-walk.txt
create mode 100644 path-walk.c
create mode 100644 path-walk.h
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
new file mode 100644
index 00000000000..c550c77ca30
--- /dev/null
+++ b/Documentation/technical/api-path-walk.txt
@@ -0,0 +1,45 @@
+Path-Walk API
+=============
+
+The path-walk API is used to walk reachable objects, but to visit objects
+in batches based on a common path they appear in, or by type.
+
+For example, all reachable commits are visited in a group. All tags are
+visited in a group. Then, all root trees are visited. At some point, all
+blobs reachable via a path `my/dir/to/A` are visited. When there are
+multiple paths possible to reach the same object, then only one of those
+paths is used to visit the object.
+
+Basics
+------
+
+To use the path-walk API, include `path-walk.h` and call
+`walk_objects_by_path()` with a customized `path_walk_info` struct. The
+struct is used to set all of the options for how the walk should proceed.
+Let's dig into the different options and their use.
+
+`path_fn` and `path_fn_data`::
+ The most important option is the `path_fn` option, which is a
+ function pointer to the callback that can execute logic on the
+ object IDs for objects grouped by type and path. This function
+ also receives a `data` value that corresponds to the
+ `path_fn_data` member, for providing custom data structures to
+ this callback function.
+
+`revs`::
+ To configure the exact details of the reachable set of objects,
+ use the `revs` member and initialize it using the revision
+ machinery in `revision.h`. Initialize `revs` using calls such as
+ `setup_revisions()` or `parse_revision_opt()`. Do not call
+ `prepare_revision_walk()`, as that will be called within
+ `walk_objects_by_path()`.
++
+It is also important that you do not specify the `--objects` flag for the
+`revs` struct. The revision walk should only be used to walk commits, and
+the objects will be walked in a separate way based on those starting
+commits.
+
+Examples
+--------
+
+See example usages in future changes.
diff --git a/Makefile b/Makefile
index 7344a7f7257..d0d8d6888e3 100644
--- a/Makefile
+++ b/Makefile
@@ -1094,6 +1094,7 @@ LIB_OBJS += parse-options.o
LIB_OBJS += patch-delta.o
LIB_OBJS += patch-ids.o
LIB_OBJS += path.o
+LIB_OBJS += path-walk.o
LIB_OBJS += pathspec.o
LIB_OBJS += pkt-line.o
LIB_OBJS += preload-index.o
diff --git a/path-walk.c b/path-walk.c
new file mode 100644
index 00000000000..24cf04c1e7d
--- /dev/null
+++ b/path-walk.c
@@ -0,0 +1,263 @@
+/*
+ * path-walk.c: implementation for path-based walks of the object graph.
+ */
+#include "git-compat-util.h"
+#include "path-walk.h"
+#include "blob.h"
+#include "commit.h"
+#include "dir.h"
+#include "hashmap.h"
+#include "hex.h"
+#include "object.h"
+#include "oid-array.h"
+#include "revision.h"
+#include "string-list.h"
+#include "strmap.h"
+#include "trace2.h"
+#include "tree.h"
+#include "tree-walk.h"
+
+struct type_and_oid_list
+{
+ enum object_type type;
+ struct oid_array oids;
+};
+
+#define TYPE_AND_OID_LIST_INIT { \
+ .type = OBJ_NONE, \
+ .oids = OID_ARRAY_INIT \
+}
+
+struct path_walk_context {
+ /**
+ * Repeats of data in 'struct path_walk_info' for
+ * access with fewer characters.
+ */
+ struct repository *repo;
+ struct rev_info *revs;
+ struct path_walk_info *info;
+
+ /**
+ * Map a path to a 'struct type_and_oid_list'
+ * containing the objects discovered at that
+ * path.
+ */
+ struct strmap paths_to_lists;
+
+ /**
+ * Store the current list of paths in a stack, to
+ * facilitate depth-first-search without recursion.
+ *
+ * Use path_stack_pushed to indicate whether a path
+ * was previously added to path_stack.
+ */
+ struct string_list path_stack;
+ struct strset path_stack_pushed;
+};
+
+static void push_to_stack(struct path_walk_context *ctx,
+ const char *path)
+{
+ if (strset_contains(&ctx->path_stack_pushed, path))
+ return;
+
+ strset_add(&ctx->path_stack_pushed, path);
+ string_list_append(&ctx->path_stack, path);
+}
+
+static int add_children(struct path_walk_context *ctx,
+ const char *base_path,
+ struct object_id *oid)
+{
+ struct tree_desc desc;
+ struct name_entry entry;
+ struct strbuf path = STRBUF_INIT;
+ size_t base_len;
+ struct tree *tree = lookup_tree(ctx->repo, oid);
+
+ if (!tree) {
+ error(_("failed to walk children of tree %s: not found"),
+ oid_to_hex(oid));
+ return -1;
+ } else if (parse_tree_gently(tree, 1)) {
+ die("bad tree object %s", oid_to_hex(oid));
+ }
+
+ strbuf_addstr(&path, base_path);
+ base_len = path.len;
+
+ parse_tree(tree);
+ init_tree_desc(&desc, &tree->object.oid, tree->buffer, tree->size);
+ while (tree_entry(&desc, &entry)) {
+ struct type_and_oid_list *list;
+ struct object *o;
+ /* Not actually true, but we will ignore submodules later. */
+ enum object_type type = S_ISDIR(entry.mode) ? OBJ_TREE : OBJ_BLOB;
+
+ /* Skip submodules. */
+ if (S_ISGITLINK(entry.mode))
+ continue;
+
+ if (type == OBJ_TREE) {
+ struct tree *child = lookup_tree(ctx->repo, &entry.oid);
+ o = child ? &child->object : NULL;
+ } else if (type == OBJ_BLOB) {
+ struct blob *child = lookup_blob(ctx->repo, &entry.oid);
+ o = child ? &child->object : NULL;
+ } else {
+ /* Wrong type? */
+ continue;
+ }
+
+ if (!o) /* report error?*/
+ continue;
+
+ strbuf_setlen(&path, base_len);
+ strbuf_add(&path, entry.path, entry.pathlen);
+
+ /*
+ * Trees will end with "/" for concatenation and distinction
+ * from blobs at the same path.
+ */
+ if (type == OBJ_TREE)
+ strbuf_addch(&path, '/');
+
+ if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
+ CALLOC_ARRAY(list, 1);
+ list->type = type;
+ strmap_put(&ctx->paths_to_lists, path.buf, list);
+ }
+ push_to_stack(ctx, path.buf);
+
+ /* Skip this object if already seen. */
+ if (o->flags & SEEN)
+ continue;
+ o->flags |= SEEN;
+ oid_array_append(&list->oids, &entry.oid);
+ }
+
+ free_tree_buffer(tree);
+ strbuf_release(&path);
+ return 0;
+}
+
+/*
+ * For each path in paths_to_explore, walk the trees another level
+ * and add any found blobs to the batch (but only if they exist and
+ * haven't been added yet).
+ */
+static int walk_path(struct path_walk_context *ctx,
+ const char *path)
+{
+ struct type_and_oid_list *list;
+ int ret = 0;
+
+ list = strmap_get(&ctx->paths_to_lists, path);
+
+ if (!list->oids.nr)
+ return 0;
+
+ /* Evaluate function pointer on this data. */
+ ret = ctx->info->path_fn(path, &list->oids, list->type,
+ ctx->info->path_fn_data);
+
+ /* Expand data for children. */
+ if (list->type == OBJ_TREE) {
+ for (size_t i = 0; i < list->oids.nr; i++) {
+ ret |= add_children(ctx,
+ path,
+ &list->oids.oid[i]);
+ }
+ }
+
+ oid_array_clear(&list->oids);
+ strmap_remove(&ctx->paths_to_lists, path, 1);
+ return ret;
+}
+
+static void clear_strmap(struct strmap *map)
+{
+ struct hashmap_iter iter;
+ struct strmap_entry *e;
+
+ hashmap_for_each_entry(&map->map, &iter, e, ent) {
+ struct type_and_oid_list *list = e->value;
+ oid_array_clear(&list->oids);
+ }
+ strmap_clear(map, 1);
+ strmap_init(map);
+}
+
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+ *
+ * Returns nonzero on an error.
+ */
+int walk_objects_by_path(struct path_walk_info *info)
+{
+ const char *root_path = "";
+ int ret = 0;
+ size_t commits_nr = 0, paths_nr = 0;
+ struct commit *c;
+ struct type_and_oid_list *root_tree_list;
+ struct path_walk_context ctx = {
+ .repo = info->revs->repo,
+ .revs = info->revs,
+ .info = info,
+ .path_stack = STRING_LIST_INIT_DUP,
+ .path_stack_pushed = STRSET_INIT,
+ .paths_to_lists = STRMAP_INIT
+ };
+
+ trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
+
+ /* Insert a single list for the root tree into the paths. */
+ CALLOC_ARRAY(root_tree_list, 1);
+ root_tree_list->type = OBJ_TREE;
+ strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
+ push_to_stack(&ctx, root_path);
+
+ if (prepare_revision_walk(info->revs))
+ die(_("failed to setup revision walk"));
+
+ while ((c = get_revision(info->revs))) {
+ struct object_id *oid = get_commit_tree_oid(c);
+ struct tree *t;
+ commits_nr++;
+
+ oid = get_commit_tree_oid(c);
+ t = lookup_tree(info->revs->repo, oid);
+
+ if (!t) {
+ warning("could not find tree %s", oid_to_hex(oid));
+ continue;
+ }
+
+ if (t->object.flags & SEEN)
+ continue;
+ t->object.flags |= SEEN;
+ oid_array_append(&root_tree_list->oids, oid);
+ }
+
+ trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
+ trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+
+ trace2_region_enter("path-walk", "path-walk", info->revs->repo);
+ while (!ret && ctx.path_stack.nr) {
+ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
+ ctx.path_stack.nr--;
+ paths_nr++;
+
+ ret = walk_path(&ctx, path);
+
+ free(path);
+ }
+ trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr);
+ trace2_region_leave("path-walk", "path-walk", info->revs->repo);
+
+ clear_strmap(&ctx.paths_to_lists);
+ strset_clear(&ctx.path_stack_pushed);
+ string_list_clear(&ctx.path_stack, 0);
+ return ret;
+}
diff --git a/path-walk.h b/path-walk.h
new file mode 100644
index 00000000000..c9e94a98bc8
--- /dev/null
+++ b/path-walk.h
@@ -0,0 +1,43 @@
+/*
+ * path-walk.h : Methods and structures for walking the object graph in batches
+ * by the paths that can reach those objects.
+ */
+#include "object.h" /* Required for 'enum object_type'. */
+
+struct rev_info;
+struct oid_array;
+
+/**
+ * The type of a function pointer for the method that is called on a list of
+ * objects reachable at a given path.
+ */
+typedef int (*path_fn)(const char *path,
+ struct oid_array *oids,
+ enum object_type type,
+ void *data);
+
+struct path_walk_info {
+ /**
+ * revs provides the definitions for the commit walk, including
+ * which commits are UNINTERESTING or not.
+ */
+ struct rev_info *revs;
+
+ /**
+ * The caller wishes to execute custom logic on objects reachable at a
+ * given path. Every reachable object will be visited exactly once, and
+ * the first path to see an object wins. This may not be a stable choice.
+ */
+ path_fn path_fn;
+ void *path_fn_data;
+};
+
+#define PATH_WALK_INFO_INIT { 0 }
+
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+ *
+ * Returns nonzero on an error.
+ */
+int walk_objects_by_path(struct path_walk_info *info);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH v3 1/7] path-walk: introduce an object walk by path
2024-12-06 19:45 ` [PATCH v3 1/7] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
@ 2024-12-13 11:58 ` Patrick Steinhardt
2024-12-18 14:21 ` Derrick Stolee
0 siblings, 1 reply; 67+ messages in thread
From: Patrick Steinhardt @ 2024-12-13 11:58 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee
On Fri, Dec 06, 2024 at 07:45:52PM +0000, Derrick Stolee via GitGitGadget wrote:
> --- /dev/null
> +++ b/path-walk.c
> @@ -0,0 +1,263 @@
> +/*
> + * path-walk.c: implementation for path-based walks of the object graph.
> + */
> +#include "git-compat-util.h"
> +#include "path-walk.h"
> +#include "blob.h"
> +#include "commit.h"
> +#include "dir.h"
> +#include "hashmap.h"
> +#include "hex.h"
> +#include "object.h"
> +#include "oid-array.h"
> +#include "revision.h"
> +#include "string-list.h"
> +#include "strmap.h"
> +#include "trace2.h"
> +#include "tree.h"
> +#include "tree-walk.h"
> +
> +struct type_and_oid_list
> +{
> + enum object_type type;
> + struct oid_array oids;
> +};
Nit: formatting of this struct is off.
> +static void push_to_stack(struct path_walk_context *ctx,
> + const char *path)
> +{
> + if (strset_contains(&ctx->path_stack_pushed, path))
> + return;
> +
> + strset_add(&ctx->path_stack_pushed, path);
> + string_list_append(&ctx->path_stack, path);
> +}
> +
> +static int add_children(struct path_walk_context *ctx,
> + const char *base_path,
> + struct object_id *oid)
> +{
So this function assumes that `oid` always refers to a tree? I think it
would make sense to clarify this by calling the function accordingly,
like e.g. `add_tree_entries()`.
> + struct tree_desc desc;
> + struct name_entry entry;
> + struct strbuf path = STRBUF_INIT;
> + size_t base_len;
> + struct tree *tree = lookup_tree(ctx->repo, oid);
> +
> + if (!tree) {
> + error(_("failed to walk children of tree %s: not found"),
> + oid_to_hex(oid));
> + return -1;
> + } else if (parse_tree_gently(tree, 1)) {
> + die("bad tree object %s", oid_to_hex(oid));
I wonder whether we maybe shouldn't die but instead return an error in
the spirit of libification.
> + }
> + strbuf_addstr(&path, base_path);
> + base_len = path.len;
> +
> + parse_tree(tree);
> + init_tree_desc(&desc, &tree->object.oid, tree->buffer, tree->size);
> + while (tree_entry(&desc, &entry)) {
> + struct type_and_oid_list *list;
> + struct object *o;
> + /* Not actually true, but we will ignore submodules later. */
> + enum object_type type = S_ISDIR(entry.mode) ? OBJ_TREE : OBJ_BLOB;
> +
> + /* Skip submodules. */
> + if (S_ISGITLINK(entry.mode))
> + continue;
> +
> + if (type == OBJ_TREE) {
> + struct tree *child = lookup_tree(ctx->repo, &entry.oid);
> + o = child ? &child->object : NULL;
> + } else if (type == OBJ_BLOB) {
> + struct blob *child = lookup_blob(ctx->repo, &entry.oid);
> + o = child ? &child->object : NULL;
> + } else {
> + /* Wrong type? */
> + continue;
This code is unreachable, so we could make this a `BUG()`. Might also
use a switch instead, but that's more of a stylistic question.
> + }
> +
> + if (!o) /* report error?*/
> + continue;
So this can only happen in case `lookup_tree()` or `lookup_blob()`
run into an error. I think this error should definitely be bubbled up so
that we don't silently skip tree entries in case of repo corruption.
> + strbuf_setlen(&path, base_len);
> + strbuf_add(&path, entry.path, entry.pathlen);
> +
> + /*
> + * Trees will end with "/" for concatenation and distinction
> + * from blobs at the same path.
> + */
> + if (type == OBJ_TREE)
> + strbuf_addch(&path, '/');
> +
> + if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
> + CALLOC_ARRAY(list, 1);
> + list->type = type;
> + strmap_put(&ctx->paths_to_lists, path.buf, list);
> + }
> + push_to_stack(ctx, path.buf);
> +
> + /* Skip this object if already seen. */
> + if (o->flags & SEEN)
> + continue;
> + o->flags |= SEEN;
This made me wonder: why do we only skip the object this late? Couldn't
we already have done so immediately after we have looked up the object
to avoid some work? If not, it might be useful to add a comment why it
has to come this late.
> + oid_array_append(&list->oids, &entry.oid);
> + }
> +
> + free_tree_buffer(tree);
> + strbuf_release(&path);
> + return 0;
> +}
> +
> +/*
> + * For each path in paths_to_explore, walk the trees another level
> + * and add any found blobs to the batch (but only if they exist and
> + * haven't been added yet).
> + */
> +static int walk_path(struct path_walk_context *ctx,
> + const char *path)
> +{
> + struct type_and_oid_list *list;
> + int ret = 0;
Nit: needless initialization.
> +
> + list = strmap_get(&ctx->paths_to_lists, path);
> +
> + if (!list->oids.nr)
> + return 0;
> +
> + /* Evaluate function pointer on this data. */
> + ret = ctx->info->path_fn(path, &list->oids, list->type,
> + ctx->info->path_fn_data);
> +
> + /* Expand data for children. */
> + if (list->type == OBJ_TREE) {
> + for (size_t i = 0; i < list->oids.nr; i++) {
> + ret |= add_children(ctx,
> + path,
> + &list->oids.oid[i]);
> + }
> + }
> +
> + oid_array_clear(&list->oids);
> + strmap_remove(&ctx->paths_to_lists, path, 1);
> + return ret;
> +}
> +
> +static void clear_strmap(struct strmap *map)
Nit: this isn't clearing a generic strmap, but rather `paths_to_lists`.
Should we maybe rename it to `clear_paths_to_lists()`?
> +{
> + struct hashmap_iter iter;
> + struct strmap_entry *e;
> +
> + hashmap_for_each_entry(&map->map, &iter, e, ent) {
> + struct type_and_oid_list *list = e->value;
> + oid_array_clear(&list->oids);
> + }
> + strmap_clear(map, 1);
> + strmap_init(map);
> +}
> +
> +/**
> + * Given the configuration of 'info', walk the commits based on 'info->revs' and
> + * call 'info->path_fn' on each discovered path.
> + *
> + * Returns nonzero on an error.
> + */
> +int walk_objects_by_path(struct path_walk_info *info)
> +{
> + const char *root_path = "";
> + int ret = 0;
> + size_t commits_nr = 0, paths_nr = 0;
> + struct commit *c;
> + struct type_and_oid_list *root_tree_list;
> + struct path_walk_context ctx = {
> + .repo = info->revs->repo,
> + .revs = info->revs,
> + .info = info,
> + .path_stack = STRING_LIST_INIT_DUP,
> + .path_stack_pushed = STRSET_INIT,
> + .paths_to_lists = STRMAP_INIT
> + };
> +
> + trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
> +
> + /* Insert a single list for the root tree into the paths. */
> + CALLOC_ARRAY(root_tree_list, 1);
> + root_tree_list->type = OBJ_TREE;
> + strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
> + push_to_stack(&ctx, root_path);
> +
> + if (prepare_revision_walk(info->revs))
> + die(_("failed to setup revision walk"));
> +
> + while ((c = get_revision(info->revs))) {
> + struct object_id *oid = get_commit_tree_oid(c);
> + struct tree *t;
> + commits_nr++;
> +
> + oid = get_commit_tree_oid(c);
> + t = lookup_tree(info->revs->repo, oid);
> +
> + if (!t) {
> + warning("could not find tree %s", oid_to_hex(oid));
> + continue;
> + }
Is this error something we should bubble up to the caller? As mentioned
above, I'm being cautious about silently accepting potentially-corrupt
data. Silent in the spirit of the caller not noticing it, not in the
sense of the user not noticing it.
Patrick
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 1/7] path-walk: introduce an object walk by path
2024-12-13 11:58 ` Patrick Steinhardt
@ 2024-12-18 14:21 ` Derrick Stolee
2024-12-27 14:18 ` Patrick Steinhardt
0 siblings, 1 reply; 67+ messages in thread
From: Derrick Stolee @ 2024-12-18 14:21 UTC (permalink / raw)
To: Patrick Steinhardt, Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak
On 12/13/24 6:58 AM, Patrick Steinhardt wrote:
> On Fri, Dec 06, 2024 at 07:45:52PM +0000, Derrick Stolee via GitGitGadget wrote:
>> + } else if (parse_tree_gently(tree, 1)) {
>> + die("bad tree object %s", oid_to_hex(oid));
>
> I wonder whether we maybe shouldn't die but instead return an error in
> the spirit of libification.
This is in fact something that is being tested when 'git pack-objects' has
the --path-walk feature. See "get an error for missing tree object" in
t5317 as an example.
It's not enough to fail, but we need to fail with this error message.
Has there been enough progress in the libification effort to establish a
pattern for returning an error message like "bad tree object %s" from an
API like this to the caller?
I will try using a "error(); return -1;" and consider that as the best
option for right now.
>> + } else {
>> + /* Wrong type? */
>> + continue;
>
> This code is unreachable, so we could make this a `BUG()`. Might also
> use a switch instead, but that's more of a stylistic question.
I think a BUG() would be good here.
>> + }
>> +
>> + if (!o) /* report error?*/
>> + continue;
>
> So this can only happen in case `lookup_tree()` or `lookup_blob()`
> run into an error. I think this error should definitely be bubbled up so
> that we don't silently skip tree entries in case of repo corruption.
Looks like I agreed with you but didn't follow through.
>> + /* Skip this object if already seen. */
>> + if (o->flags & SEEN)
>> + continue;
>> + o->flags |= SEEN;
>
> This made me wonder: why do we only skip the object this late? Couldn't
> we already have done so immediately after we have looked up the object
> to avoid some work? If not, it might be useful to add a comment why it
> has to come this late.
I went to look to see if there was a reason, and at this point there is
not a good reason. This should be moved up to avoid some checks and path
manipulation.
I think that in a later patch, the use of the UNINTERESTING flag is
important to pass the flag even when the object is already SEEN. This is
probably cruft from an earlier version that passed the UNINTERESTING bits
in this part of the code.
>> + if (!t) {
>> + warning("could not find tree %s", oid_to_hex(oid));
>> + continue;
>> + }
>
> Is this error something we should bubble up to the caller? As mentioned
> above, I'm being cautious about silently accepting potentially-corrupt
> data. Silent in the spirit of the caller not noticing it, not in the
> sense of the user not noticing it.
Can do.
Thanks for the careful review!
-Stolee
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 1/7] path-walk: introduce an object walk by path
2024-12-18 14:21 ` Derrick Stolee
@ 2024-12-27 14:18 ` Patrick Steinhardt
0 siblings, 0 replies; 67+ messages in thread
From: Patrick Steinhardt @ 2024-12-27 14:18 UTC (permalink / raw)
To: Derrick Stolee
Cc: Derrick Stolee via GitGitGadget, git, gitster,
johannes.schindelin, peff, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak
On Wed, Dec 18, 2024 at 09:21:25AM -0500, Derrick Stolee wrote:
> On 12/13/24 6:58 AM, Patrick Steinhardt wrote:
> > On Fri, Dec 06, 2024 at 07:45:52PM +0000, Derrick Stolee via GitGitGadget wrote:
>
> > > + } else if (parse_tree_gently(tree, 1)) {
> > > + die("bad tree object %s", oid_to_hex(oid));
> >
> > I wonder whether we maybe shouldn't die but instead return an error in
> > the spirit of libification.
>
> This is in fact something that is being tested when 'git pack-objects' has
> the --path-walk feature. See "get an error for missing tree object" in
> t5317 as an example.
>
> It's not enough to fail, but we need to fail with this error message.
>
> Has there been enough progress in the libification effort to establish a
> pattern for returning an error message like "bad tree object %s" from an
> API like this to the caller?
>
> I will try using a "error(); return -1;" and consider that as the best
> option for right now.
Yeah, I think this is best practice for now where we don't have a
superior mechanism like structured errors.
Patrick
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 2/7] test-lib-functions: add test_cmp_sorted
2024-12-06 19:45 ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
2024-12-06 19:45 ` [PATCH v3 1/7] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
@ 2024-12-06 19:45 ` Derrick Stolee via GitGitGadget
2024-12-06 19:45 ` [PATCH v3 3/7] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
` (6 subsequent siblings)
8 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-06 19:45 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
This test helper will be helpful to reduce repeated logic in
t6601-path-walk.sh, but may be helpful elsewhere, too.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
t/test-lib-functions.sh | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index fde9bf54fc3..16b70aebd60 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1267,6 +1267,16 @@ test_cmp () {
eval "$GIT_TEST_CMP" '"$@"'
}
+# test_cmp_sorted runs test_cmp on sorted versions of the two
+# input files. Uses "$1.sorted" and "$2.sorted" as temp files.
+
+test_cmp_sorted () {
+ sort <"$1" >"$1.sorted" &&
+ sort <"$2" >"$2.sorted" &&
+ test_cmp "$1.sorted" "$2.sorted" &&
+ rm "$1.sorted" "$2.sorted"
+}
+
# Check that the given config key has the expected value.
#
# test_cmp_config [-C <dir>] <expected-value>
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH v3 3/7] t6601: add helper for testing path-walk API
2024-12-06 19:45 ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
2024-12-06 19:45 ` [PATCH v3 1/7] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-12-06 19:45 ` [PATCH v3 2/7] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
@ 2024-12-06 19:45 ` Derrick Stolee via GitGitGadget
2024-12-06 19:45 ` [PATCH v3 4/7] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
` (5 subsequent siblings)
8 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-06 19:45 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Add some tests based on the current behavior, doing interesting checks
for different sets of branches, ranges, and the --boundary option. This
sets a baseline for the behavior and we can extend it as new options are
introduced.
Store and output a 'batch_nr' value so we can demonstrate that the paths are
grouped together in a batch and not following some other ordering. This
allows us to test the depth-first behavior of the path-walk API. However, we
purposefully do not test the order of the objects in the batch, so the
output is compared to the expected output through a sort.
It is important to mention that the behavior of the API will change soon as
we start to handle UNINTERESTING objects differently, but these tests will
demonstrate the change in behavior.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 3 +-
Makefile | 1 +
t/helper/test-path-walk.c | 84 +++++++++++++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t6601-path-walk.sh | 120 ++++++++++++++++++++++
6 files changed, 209 insertions(+), 1 deletion(-)
create mode 100644 t/helper/test-path-walk.c
create mode 100755 t/t6601-path-walk.sh
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index c550c77ca30..662162ec70b 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -42,4 +42,5 @@ commits.
Examples
--------
-See example usages in future changes.
+See example usages in:
+ `t/helper/test-path-walk.c`
diff --git a/Makefile b/Makefile
index d0d8d6888e3..50413d96492 100644
--- a/Makefile
+++ b/Makefile
@@ -818,6 +818,7 @@ TEST_BUILTINS_OBJS += test-parse-options.o
TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
TEST_BUILTINS_OBJS += test-partial-clone.o
TEST_BUILTINS_OBJS += test-path-utils.o
+TEST_BUILTINS_OBJS += test-path-walk.o
TEST_BUILTINS_OBJS += test-pcre2-config.o
TEST_BUILTINS_OBJS += test-pkt-line.o
TEST_BUILTINS_OBJS += test-proc-receive.o
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
new file mode 100644
index 00000000000..def7c81ac4f
--- /dev/null
+++ b/t/helper/test-path-walk.c
@@ -0,0 +1,84 @@
+#define USE_THE_REPOSITORY_VARIABLE
+
+#include "test-tool.h"
+#include "environment.h"
+#include "hex.h"
+#include "object-name.h"
+#include "object.h"
+#include "pretty.h"
+#include "revision.h"
+#include "setup.h"
+#include "parse-options.h"
+#include "path-walk.h"
+#include "oid-array.h"
+
+static const char * const path_walk_usage[] = {
+ N_("test-tool path-walk <options> -- <revision-options>"),
+ NULL
+};
+
+struct path_walk_test_data {
+ uintmax_t batch_nr;
+ uintmax_t tree_nr;
+ uintmax_t blob_nr;
+};
+
+static int emit_block(const char *path, struct oid_array *oids,
+ enum object_type type, void *data)
+{
+ struct path_walk_test_data *tdata = data;
+ const char *typestr;
+
+ if (type == OBJ_TREE)
+ tdata->tree_nr += oids->nr;
+ else if (type == OBJ_BLOB)
+ tdata->blob_nr += oids->nr;
+ else
+ BUG("we do not understand this type");
+
+ typestr = type_name(type);
+
+ for (size_t i = 0; i < oids->nr; i++)
+ printf("%"PRIuMAX":%s:%s:%s\n",
+ tdata->batch_nr, typestr, path,
+ oid_to_hex(&oids->oid[i]));
+
+ tdata->batch_nr++;
+ return 0;
+}
+
+int cmd__path_walk(int argc, const char **argv)
+{
+ int res;
+ struct rev_info revs = REV_INFO_INIT;
+ struct path_walk_info info = PATH_WALK_INFO_INIT;
+ struct path_walk_test_data data = { 0 };
+ struct option options[] = {
+ OPT_END(),
+ };
+
+ setup_git_directory();
+ revs.repo = the_repository;
+
+ argc = parse_options(argc, argv, NULL,
+ options, path_walk_usage,
+ PARSE_OPT_KEEP_UNKNOWN_OPT | PARSE_OPT_KEEP_ARGV0);
+
+ if (argc > 1)
+ setup_revisions(argc, argv, &revs, NULL);
+ else
+ usage(path_walk_usage[0]);
+
+ info.revs = &revs;
+ info.path_fn = emit_block;
+ info.path_fn_data = &data;
+
+ res = walk_objects_by_path(&info);
+
+ printf("trees:%" PRIuMAX "\n"
+ "blobs:%" PRIuMAX "\n",
+ data.tree_nr, data.blob_nr);
+
+ release_revisions(&revs);
+ return res;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 1ebb69a5dc4..43676e7b93a 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -52,6 +52,7 @@ static struct test_cmd cmds[] = {
{ "parse-subcommand", cmd__parse_subcommand },
{ "partial-clone", cmd__partial_clone },
{ "path-utils", cmd__path_utils },
+ { "path-walk", cmd__path_walk },
{ "pcre2-config", cmd__pcre2_config },
{ "pkt-line", cmd__pkt_line },
{ "proc-receive", cmd__proc_receive },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 21802ac27da..9cfc5da6e57 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -45,6 +45,7 @@ int cmd__parse_pathspec_file(int argc, const char** argv);
int cmd__parse_subcommand(int argc, const char **argv);
int cmd__partial_clone(int argc, const char **argv);
int cmd__path_utils(int argc, const char **argv);
+int cmd__path_walk(int argc, const char **argv);
int cmd__pcre2_config(int argc, const char **argv);
int cmd__pkt_line(int argc, const char **argv);
int cmd__proc_receive(int argc, const char **argv);
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
new file mode 100755
index 00000000000..4e052c09309
--- /dev/null
+++ b/t/t6601-path-walk.sh
@@ -0,0 +1,120 @@
+#!/bin/sh
+
+TEST_PASSES_SANITIZE_LEAK=true
+
+test_description='direct path-walk API tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repository' '
+ git checkout -b base &&
+
+ mkdir left &&
+ mkdir right &&
+ echo a >a &&
+ echo b >left/b &&
+ echo c >right/c &&
+ git add . &&
+ git commit -m "first" &&
+
+ echo d >right/d &&
+ git add right &&
+ git commit -m "second" &&
+
+ echo bb >left/b &&
+ git commit -a -m "third" &&
+
+ git checkout -b topic HEAD~1 &&
+ echo cc >right/c &&
+ git commit -a -m "topic"
+'
+
+test_expect_success 'all' '
+ test-tool path-walk -- --all >out &&
+
+ cat >expect <<-EOF &&
+ 0:tree::$(git rev-parse topic^{tree})
+ 0:tree::$(git rev-parse base^{tree})
+ 0:tree::$(git rev-parse base~1^{tree})
+ 0:tree::$(git rev-parse base~2^{tree})
+ 1:tree:right/:$(git rev-parse topic:right)
+ 1:tree:right/:$(git rev-parse base~1:right)
+ 1:tree:right/:$(git rev-parse base~2:right)
+ 2:blob:right/d:$(git rev-parse base~1:right/d)
+ 3:blob:right/c:$(git rev-parse base~2:right/c)
+ 3:blob:right/c:$(git rev-parse topic:right/c)
+ 4:tree:left/:$(git rev-parse base:left)
+ 4:tree:left/:$(git rev-parse base~2:left)
+ 5:blob:left/b:$(git rev-parse base~2:left/b)
+ 5:blob:left/b:$(git rev-parse base:left/b)
+ 6:blob:a:$(git rev-parse base~2:a)
+ blobs:6
+ trees:9
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic only' '
+ test-tool path-walk -- topic >out &&
+
+ cat >expect <<-EOF &&
+ 0:tree::$(git rev-parse topic^{tree})
+ 0:tree::$(git rev-parse base~1^{tree})
+ 0:tree::$(git rev-parse base~2^{tree})
+ 1:tree:right/:$(git rev-parse topic:right)
+ 1:tree:right/:$(git rev-parse base~1:right)
+ 1:tree:right/:$(git rev-parse base~2:right)
+ 2:blob:right/d:$(git rev-parse base~1:right/d)
+ 3:blob:right/c:$(git rev-parse base~2:right/c)
+ 3:blob:right/c:$(git rev-parse topic:right/c)
+ 4:tree:left/:$(git rev-parse base~2:left)
+ 5:blob:left/b:$(git rev-parse base~2:left/b)
+ 6:blob:a:$(git rev-parse base~2:a)
+ blobs:5
+ trees:7
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base' '
+ test-tool path-walk -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:tree::$(git rev-parse topic^{tree})
+ 1:tree:right/:$(git rev-parse topic:right)
+ 2:blob:right/d:$(git rev-parse topic:right/d)
+ 3:blob:right/c:$(git rev-parse topic:right/c)
+ 4:tree:left/:$(git rev-parse topic:left)
+ 5:blob:left/b:$(git rev-parse topic:left/b)
+ 6:blob:a:$(git rev-parse topic:a)
+ blobs:4
+ trees:3
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, boundary' '
+ test-tool path-walk -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:tree::$(git rev-parse topic^{tree})
+ 0:tree::$(git rev-parse base~1^{tree})
+ 1:tree:right/:$(git rev-parse topic:right)
+ 1:tree:right/:$(git rev-parse base~1:right)
+ 2:blob:right/d:$(git rev-parse base~1:right/d)
+ 3:blob:right/c:$(git rev-parse base~1:right/c)
+ 3:blob:right/c:$(git rev-parse topic:right/c)
+ 4:tree:left/:$(git rev-parse base~1:left)
+ 5:blob:left/b:$(git rev-parse base~1:left/b)
+ 6:blob:a:$(git rev-parse base~1:a)
+ blobs:5
+ trees:5
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH v3 4/7] path-walk: allow consumer to specify object types
2024-12-06 19:45 ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
` (2 preceding siblings ...)
2024-12-06 19:45 ` [PATCH v3 3/7] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
@ 2024-12-06 19:45 ` Derrick Stolee via GitGitGadget
2024-12-06 19:45 ` [PATCH v3 5/7] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
` (4 subsequent siblings)
8 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-06 19:45 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <derrickstolee@github.com>
We add the ability to filter the object types in the path-walk API so
the callback function is called fewer times.
This adds the ability to ask for the commits in a list, as well. We
re-use the empty string for this set of objects because these are passed
directly to the callback function instead of being part of the
'path_stack'.
Future changes will add the ability to visit annotated tags.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 9 ++
path-walk.c | 33 ++++-
path-walk.h | 14 +-
t/helper/test-path-walk.c | 15 ++-
t/t6601-path-walk.sh | 149 +++++++++++++++-------
5 files changed, 170 insertions(+), 50 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 662162ec70b..dce553b6114 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -39,6 +39,15 @@ It is also important that you do not specify the `--objects` flag for the
the objects will be walked in a separate way based on those starting
commits.
+`commits`, `blobs`, `trees`::
+ By default, these members are enabled and signal that the path-walk
+ API should call the `path_fn` on objects of these types. Specialized
+ applications could disable some options to make it simpler to walk
+ the objects or to have fewer calls to `path_fn`.
++
+While it is possible to walk only commits in this way, consumers would be
+better off using the revision walk API instead.
+
Examples
--------
diff --git a/path-walk.c b/path-walk.c
index 24cf04c1e7d..2ca08402367 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -98,6 +98,10 @@ static int add_children(struct path_walk_context *ctx,
if (S_ISGITLINK(entry.mode))
continue;
+ /* If the caller doesn't want blobs, then don't bother. */
+ if (!ctx->info->blobs && type == OBJ_BLOB)
+ continue;
+
if (type == OBJ_TREE) {
struct tree *child = lookup_tree(ctx->repo, &entry.oid);
o = child ? &child->object : NULL;
@@ -157,9 +161,11 @@ static int walk_path(struct path_walk_context *ctx,
if (!list->oids.nr)
return 0;
- /* Evaluate function pointer on this data. */
- ret = ctx->info->path_fn(path, &list->oids, list->type,
- ctx->info->path_fn_data);
+ /* Evaluate function pointer on this data, if requested. */
+ if ((list->type == OBJ_TREE && ctx->info->trees) ||
+ (list->type == OBJ_BLOB && ctx->info->blobs))
+ ret = ctx->info->path_fn(path, &list->oids, list->type,
+ ctx->info->path_fn_data);
/* Expand data for children. */
if (list->type == OBJ_TREE) {
@@ -201,6 +207,7 @@ int walk_objects_by_path(struct path_walk_info *info)
size_t commits_nr = 0, paths_nr = 0;
struct commit *c;
struct type_and_oid_list *root_tree_list;
+ struct type_and_oid_list *commit_list;
struct path_walk_context ctx = {
.repo = info->revs->repo,
.revs = info->revs,
@@ -212,6 +219,9 @@ int walk_objects_by_path(struct path_walk_info *info)
trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
+ CALLOC_ARRAY(commit_list, 1);
+ commit_list->type = OBJ_COMMIT;
+
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
@@ -222,10 +232,18 @@ int walk_objects_by_path(struct path_walk_info *info)
die(_("failed to setup revision walk"));
while ((c = get_revision(info->revs))) {
- struct object_id *oid = get_commit_tree_oid(c);
+ struct object_id *oid;
struct tree *t;
commits_nr++;
+ if (info->commits)
+ oid_array_append(&commit_list->oids,
+ &c->object.oid);
+
+ /* If we only care about commits, then skip trees. */
+ if (!info->trees && !info->blobs)
+ continue;
+
oid = get_commit_tree_oid(c);
t = lookup_tree(info->revs->repo, oid);
@@ -243,6 +261,13 @@ int walk_objects_by_path(struct path_walk_info *info)
trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+ /* Track all commits. */
+ if (info->commits && commit_list->oids.nr)
+ ret = info->path_fn("", &commit_list->oids, OBJ_COMMIT,
+ info->path_fn_data);
+ oid_array_clear(&commit_list->oids);
+ free(commit_list);
+
trace2_region_enter("path-walk", "path-walk", info->revs->repo);
while (!ret && ctx.path_stack.nr) {
char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
diff --git a/path-walk.h b/path-walk.h
index c9e94a98bc8..2d2afc29b47 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -30,9 +30,21 @@ struct path_walk_info {
*/
path_fn path_fn;
void *path_fn_data;
+
+ /**
+ * Initialize which object types the path_fn should be called on. This
+ * could also limit the walk to skip blobs if not set.
+ */
+ int commits;
+ int trees;
+ int blobs;
};
-#define PATH_WALK_INFO_INIT { 0 }
+#define PATH_WALK_INFO_INIT { \
+ .blobs = 1, \
+ .trees = 1, \
+ .commits = 1, \
+}
/**
* Given the configuration of 'info', walk the commits based on 'info->revs' and
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index def7c81ac4f..a57a05a6391 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -19,6 +19,8 @@ static const char * const path_walk_usage[] = {
struct path_walk_test_data {
uintmax_t batch_nr;
+
+ uintmax_t commit_nr;
uintmax_t tree_nr;
uintmax_t blob_nr;
};
@@ -33,6 +35,8 @@ static int emit_block(const char *path, struct oid_array *oids,
tdata->tree_nr += oids->nr;
else if (type == OBJ_BLOB)
tdata->blob_nr += oids->nr;
+ else if (type == OBJ_COMMIT)
+ tdata->commit_nr += oids->nr;
else
BUG("we do not understand this type");
@@ -54,6 +58,12 @@ int cmd__path_walk(int argc, const char **argv)
struct path_walk_info info = PATH_WALK_INFO_INIT;
struct path_walk_test_data data = { 0 };
struct option options[] = {
+ OPT_BOOL(0, "blobs", &info.blobs,
+ N_("toggle inclusion of blob objects")),
+ OPT_BOOL(0, "commits", &info.commits,
+ N_("toggle inclusion of commit objects")),
+ OPT_BOOL(0, "trees", &info.trees,
+ N_("toggle inclusion of tree objects")),
OPT_END(),
};
@@ -75,9 +85,10 @@ int cmd__path_walk(int argc, const char **argv)
res = walk_objects_by_path(&info);
- printf("trees:%" PRIuMAX "\n"
+ printf("commits:%" PRIuMAX "\n"
+ "trees:%" PRIuMAX "\n"
"blobs:%" PRIuMAX "\n",
- data.tree_nr, data.blob_nr);
+ data.commit_nr, data.tree_nr, data.blob_nr);
release_revisions(&revs);
return res;
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 4e052c09309..4a4939a1b02 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -33,22 +33,27 @@ test_expect_success 'all' '
test-tool path-walk -- --all >out &&
cat >expect <<-EOF &&
- 0:tree::$(git rev-parse topic^{tree})
- 0:tree::$(git rev-parse base^{tree})
- 0:tree::$(git rev-parse base~1^{tree})
- 0:tree::$(git rev-parse base~2^{tree})
- 1:tree:right/:$(git rev-parse topic:right)
- 1:tree:right/:$(git rev-parse base~1:right)
- 1:tree:right/:$(git rev-parse base~2:right)
- 2:blob:right/d:$(git rev-parse base~1:right/d)
- 3:blob:right/c:$(git rev-parse base~2:right/c)
- 3:blob:right/c:$(git rev-parse topic:right/c)
- 4:tree:left/:$(git rev-parse base:left)
- 4:tree:left/:$(git rev-parse base~2:left)
- 5:blob:left/b:$(git rev-parse base~2:left/b)
- 5:blob:left/b:$(git rev-parse base:left/b)
- 6:blob:a:$(git rev-parse base~2:a)
+ 0:commit::$(git rev-parse topic)
+ 0:commit::$(git rev-parse base)
+ 0:commit::$(git rev-parse base~1)
+ 0:commit::$(git rev-parse base~2)
+ 1:tree::$(git rev-parse topic^{tree})
+ 1:tree::$(git rev-parse base^{tree})
+ 1:tree::$(git rev-parse base~1^{tree})
+ 1:tree::$(git rev-parse base~2^{tree})
+ 2:tree:right/:$(git rev-parse topic:right)
+ 2:tree:right/:$(git rev-parse base~1:right)
+ 2:tree:right/:$(git rev-parse base~2:right)
+ 3:blob:right/d:$(git rev-parse base~1:right/d)
+ 4:blob:right/c:$(git rev-parse base~2:right/c)
+ 4:blob:right/c:$(git rev-parse topic:right/c)
+ 5:tree:left/:$(git rev-parse base:left)
+ 5:tree:left/:$(git rev-parse base~2:left)
+ 6:blob:left/b:$(git rev-parse base~2:left/b)
+ 6:blob:left/b:$(git rev-parse base:left/b)
+ 7:blob:a:$(git rev-parse base~2:a)
blobs:6
+ commits:4
trees:9
EOF
@@ -59,19 +64,23 @@ test_expect_success 'topic only' '
test-tool path-walk -- topic >out &&
cat >expect <<-EOF &&
- 0:tree::$(git rev-parse topic^{tree})
- 0:tree::$(git rev-parse base~1^{tree})
- 0:tree::$(git rev-parse base~2^{tree})
- 1:tree:right/:$(git rev-parse topic:right)
- 1:tree:right/:$(git rev-parse base~1:right)
- 1:tree:right/:$(git rev-parse base~2:right)
- 2:blob:right/d:$(git rev-parse base~1:right/d)
- 3:blob:right/c:$(git rev-parse base~2:right/c)
- 3:blob:right/c:$(git rev-parse topic:right/c)
- 4:tree:left/:$(git rev-parse base~2:left)
- 5:blob:left/b:$(git rev-parse base~2:left/b)
- 6:blob:a:$(git rev-parse base~2:a)
+ 0:commit::$(git rev-parse topic)
+ 0:commit::$(git rev-parse base~1)
+ 0:commit::$(git rev-parse base~2)
+ 1:tree::$(git rev-parse topic^{tree})
+ 1:tree::$(git rev-parse base~1^{tree})
+ 1:tree::$(git rev-parse base~2^{tree})
+ 2:tree:right/:$(git rev-parse topic:right)
+ 2:tree:right/:$(git rev-parse base~1:right)
+ 2:tree:right/:$(git rev-parse base~2:right)
+ 3:blob:right/d:$(git rev-parse base~1:right/d)
+ 4:blob:right/c:$(git rev-parse base~2:right/c)
+ 4:blob:right/c:$(git rev-parse topic:right/c)
+ 5:tree:left/:$(git rev-parse base~2:left)
+ 6:blob:left/b:$(git rev-parse base~2:left/b)
+ 7:blob:a:$(git rev-parse base~2:a)
blobs:5
+ commits:3
trees:7
EOF
@@ -82,15 +91,66 @@ test_expect_success 'topic, not base' '
test-tool path-walk -- topic --not base >out &&
cat >expect <<-EOF &&
+ 0:commit::$(git rev-parse topic)
+ 1:tree::$(git rev-parse topic^{tree})
+ 2:tree:right/:$(git rev-parse topic:right)
+ 3:blob:right/d:$(git rev-parse topic:right/d)
+ 4:blob:right/c:$(git rev-parse topic:right/c)
+ 5:tree:left/:$(git rev-parse topic:left)
+ 6:blob:left/b:$(git rev-parse topic:left/b)
+ 7:blob:a:$(git rev-parse topic:a)
+ blobs:4
+ commits:1
+ trees:3
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, only blobs' '
+ test-tool path-walk --no-trees --no-commits \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ commits:0
+ trees:0
+ 0:blob:right/d:$(git rev-parse topic:right/d)
+ 1:blob:right/c:$(git rev-parse topic:right/c)
+ 2:blob:left/b:$(git rev-parse topic:left/b)
+ 3:blob:a:$(git rev-parse topic:a)
+ blobs:4
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+# No, this doesn't make a lot of sense for the path-walk API,
+# but it is possible to do.
+test_expect_success 'topic, not base, only commits' '
+ test-tool path-walk --no-blobs --no-trees \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:commit::$(git rev-parse topic)
+ commits:1
+ trees:0
+ blobs:0
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, only trees' '
+ test-tool path-walk --no-blobs --no-commits \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ commits:0
0:tree::$(git rev-parse topic^{tree})
1:tree:right/:$(git rev-parse topic:right)
- 2:blob:right/d:$(git rev-parse topic:right/d)
- 3:blob:right/c:$(git rev-parse topic:right/c)
- 4:tree:left/:$(git rev-parse topic:left)
- 5:blob:left/b:$(git rev-parse topic:left/b)
- 6:blob:a:$(git rev-parse topic:a)
- blobs:4
+ 2:tree:left/:$(git rev-parse topic:left)
trees:3
+ blobs:0
EOF
test_cmp_sorted expect out
@@ -100,17 +160,20 @@ test_expect_success 'topic, not base, boundary' '
test-tool path-walk -- --boundary topic --not base >out &&
cat >expect <<-EOF &&
- 0:tree::$(git rev-parse topic^{tree})
- 0:tree::$(git rev-parse base~1^{tree})
- 1:tree:right/:$(git rev-parse topic:right)
- 1:tree:right/:$(git rev-parse base~1:right)
- 2:blob:right/d:$(git rev-parse base~1:right/d)
- 3:blob:right/c:$(git rev-parse base~1:right/c)
- 3:blob:right/c:$(git rev-parse topic:right/c)
- 4:tree:left/:$(git rev-parse base~1:left)
- 5:blob:left/b:$(git rev-parse base~1:left/b)
- 6:blob:a:$(git rev-parse base~1:a)
+ 0:commit::$(git rev-parse topic)
+ 0:commit::$(git rev-parse base~1)
+ 1:tree::$(git rev-parse topic^{tree})
+ 1:tree::$(git rev-parse base~1^{tree})
+ 2:tree:right/:$(git rev-parse topic:right)
+ 2:tree:right/:$(git rev-parse base~1:right)
+ 3:blob:right/d:$(git rev-parse base~1:right/d)
+ 4:blob:right/c:$(git rev-parse base~1:right/c)
+ 4:blob:right/c:$(git rev-parse topic:right/c)
+ 5:tree:left/:$(git rev-parse base~1:left)
+ 6:blob:left/b:$(git rev-parse base~1:left/b)
+ 7:blob:a:$(git rev-parse base~1:a)
blobs:5
+ commits:2
trees:5
EOF
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH v3 5/7] path-walk: visit tags and cached objects
2024-12-06 19:45 ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
` (3 preceding siblings ...)
2024-12-06 19:45 ` [PATCH v3 4/7] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
@ 2024-12-06 19:45 ` Derrick Stolee via GitGitGadget
2024-12-13 11:58 ` Patrick Steinhardt
2024-12-06 19:45 ` [PATCH v3 6/7] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
` (3 subsequent siblings)
8 siblings, 1 reply; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-06 19:45 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
The rev_info that is specified for a path-walk traversal may specify
visiting tag refs (both lightweight and annotated) and also may specify
indexed objects (blobs and trees). Update the path-walk API to walk
these objects as well.
When walking tags, we need to peel the annotated objects until reaching
a non-tag object. If we reach a commit, then we can add it to the
pending objects to make sure we visit in the commit walk portion. If we
reach a tree, then we will assume that it is a root tree. If we reach a
blob, then we have no good path name and so add it to a new list of
"tagged blobs".
When the rev_info includes the "--indexed-objects" flag, then the
pending set includes blobs and trees found in the cache entries and
cache-tree. The cache entries are usually blobs, though they could be
trees in the case of a sparse index. The cache-tree stores
previously-hashed tree objects but these are cleared out when staging
objects below those paths. We add tests that demonstrate this.
The indexed objects come with a non-NULL 'path' value in the pending
item. This allows us to prepopulate the 'path_to_lists' strmap with
lists for these paths.
The tricky thing about this walk is that we will want to combine the
indexed objects walk with the commit walk, especially in the future case
of walking objects during a command like 'git repack'.
Whenever possible, we want the objects from the index to be grouped with
similar objects in history. We don't want to miss any paths that appear
only in the index and not in the commit history.
Thus, we need to be careful to let the path stack be populated initially
with only the root tree path (and possibly tags and tagged blobs) and go
through the normal depth-first search. Afterwards, if there are other
paths that are remaining in the paths_to_lists strmap, we should then
iterate through the stack and visit those objects recursively.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 2 +-
path-walk.c | 174 +++++++++++++++++++-
path-walk.h | 2 +
t/helper/test-path-walk.c | 15 +-
t/t6601-path-walk.sh | 186 +++++++++++++++++++---
5 files changed, 353 insertions(+), 26 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index dce553b6114..6022c381b7c 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -39,7 +39,7 @@ It is also important that you do not specify the `--objects` flag for the
the objects will be walked in a separate way based on those starting
commits.
-`commits`, `blobs`, `trees`::
+`commits`, `blobs`, `trees`, `tags`::
By default, these members are enabled and signal that the path-walk
API should call the `path_fn` on objects of these types. Specialized
applications could disable some options to make it simpler to walk
diff --git a/path-walk.c b/path-walk.c
index 2ca08402367..a1f539dcd46 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -13,10 +13,13 @@
#include "revision.h"
#include "string-list.h"
#include "strmap.h"
+#include "tag.h"
#include "trace2.h"
#include "tree.h"
#include "tree-walk.h"
+static const char *root_path = "";
+
struct type_and_oid_list
{
enum object_type type;
@@ -158,12 +161,16 @@ static int walk_path(struct path_walk_context *ctx,
list = strmap_get(&ctx->paths_to_lists, path);
+ if (!list)
+ BUG("provided path '%s' that had no associated list", path);
+
if (!list->oids.nr)
return 0;
/* Evaluate function pointer on this data, if requested. */
if ((list->type == OBJ_TREE && ctx->info->trees) ||
- (list->type == OBJ_BLOB && ctx->info->blobs))
+ (list->type == OBJ_BLOB && ctx->info->blobs) ||
+ (list->type == OBJ_TAG && ctx->info->tags))
ret = ctx->info->path_fn(path, &list->oids, list->type,
ctx->info->path_fn_data);
@@ -194,6 +201,134 @@ static void clear_strmap(struct strmap *map)
strmap_init(map);
}
+static void setup_pending_objects(struct path_walk_info *info,
+ struct path_walk_context *ctx)
+{
+ struct type_and_oid_list *tags = NULL;
+ struct type_and_oid_list *tagged_blobs = NULL;
+ struct type_and_oid_list *root_tree_list = NULL;
+
+ if (info->tags)
+ CALLOC_ARRAY(tags, 1);
+ if (info->blobs)
+ CALLOC_ARRAY(tagged_blobs, 1);
+ if (info->trees)
+ root_tree_list = strmap_get(&ctx->paths_to_lists, root_path);
+
+ /*
+ * Pending objects include:
+ * * Commits at branch tips.
+ * * Annotated tags at tag tips.
+ * * Any kind of object at lightweight tag tips.
+ * * Trees and blobs in the index (with an associated path).
+ */
+ for (size_t i = 0; i < info->revs->pending.nr; i++) {
+ struct object_array_entry *pending = info->revs->pending.objects + i;
+ struct object *obj = pending->item;
+
+ /* Commits will be picked up by revision walk. */
+ if (obj->type == OBJ_COMMIT)
+ continue;
+
+ /* Navigate annotated tag object chains. */
+ while (obj->type == OBJ_TAG) {
+ struct tag *tag = lookup_tag(info->revs->repo, &obj->oid);
+ if (!tag)
+ break;
+ if (tag->object.flags & SEEN)
+ break;
+ tag->object.flags |= SEEN;
+
+ if (tags)
+ oid_array_append(&tags->oids, &obj->oid);
+ obj = tag->tagged;
+ }
+
+ if (obj->type == OBJ_TAG)
+ continue;
+
+ /* We are now at a non-tag object. */
+ if (obj->flags & SEEN)
+ continue;
+ obj->flags |= SEEN;
+
+ switch (obj->type) {
+ case OBJ_TREE:
+ if (!info->trees)
+ continue;
+ if (pending->path) {
+ struct type_and_oid_list *list;
+ char *path = *pending->path ? xstrfmt("%s/", pending->path)
+ : xstrdup("");
+ if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
+ CALLOC_ARRAY(list, 1);
+ list->type = OBJ_TREE;
+ strmap_put(&ctx->paths_to_lists, path, list);
+ }
+ oid_array_append(&list->oids, &obj->oid);
+ free(path);
+ } else {
+ /* assume a root tree, such as a lightweight tag. */
+ oid_array_append(&root_tree_list->oids, &obj->oid);
+ }
+ break;
+
+ case OBJ_BLOB:
+ if (!info->blobs)
+ continue;
+ if (pending->path) {
+ struct type_and_oid_list *list;
+ char *path = pending->path;
+ if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
+ CALLOC_ARRAY(list, 1);
+ list->type = OBJ_BLOB;
+ strmap_put(&ctx->paths_to_lists, path, list);
+ }
+ oid_array_append(&list->oids, &obj->oid);
+ } else {
+ /* assume a root tree, such as a lightweight tag. */
+ oid_array_append(&tagged_blobs->oids, &obj->oid);
+ }
+ break;
+
+ case OBJ_COMMIT:
+ /* Make sure it is in the object walk */
+ if (obj != pending->item)
+ add_pending_object(info->revs, obj, "");
+ break;
+
+ default:
+ BUG("should not see any other type here");
+ }
+ }
+
+ /*
+ * Add tag objects and tagged blobs if they exist.
+ */
+ if (tagged_blobs) {
+ if (tagged_blobs->oids.nr) {
+ const char *tagged_blob_path = "/tagged-blobs";
+ tagged_blobs->type = OBJ_BLOB;
+ push_to_stack(ctx, tagged_blob_path);
+ strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
+ } else {
+ oid_array_clear(&tagged_blobs->oids);
+ free(tagged_blobs);
+ }
+ }
+ if (tags) {
+ if (tags->oids.nr) {
+ const char *tag_path = "/tags";
+ tags->type = OBJ_TAG;
+ push_to_stack(ctx, tag_path);
+ strmap_put(&ctx->paths_to_lists, tag_path, tags);
+ } else {
+ oid_array_clear(&tags->oids);
+ free(tags);
+ }
+ }
+}
+
/**
* Given the configuration of 'info', walk the commits based on 'info->revs' and
* call 'info->path_fn' on each discovered path.
@@ -202,7 +337,6 @@ static void clear_strmap(struct strmap *map)
*/
int walk_objects_by_path(struct path_walk_info *info)
{
- const char *root_path = "";
int ret = 0;
size_t commits_nr = 0, paths_nr = 0;
struct commit *c;
@@ -222,15 +356,31 @@ int walk_objects_by_path(struct path_walk_info *info)
CALLOC_ARRAY(commit_list, 1);
commit_list->type = OBJ_COMMIT;
+ if (info->tags)
+ info->revs->tag_objects = 1;
+
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
push_to_stack(&ctx, root_path);
+ /*
+ * Set these values before preparing the walk to catch
+ * lightweight tags pointing to non-commits and indexed objects.
+ */
+ info->revs->blob_objects = info->blobs;
+ info->revs->tree_objects = info->trees;
+
if (prepare_revision_walk(info->revs))
die(_("failed to setup revision walk"));
+ info->revs->blob_objects = info->revs->tree_objects = 0;
+
+ trace2_region_enter("path-walk", "pending-walk", info->revs->repo);
+ setup_pending_objects(info, &ctx);
+ trace2_region_leave("path-walk", "pending-walk", info->revs->repo);
+
while ((c = get_revision(info->revs))) {
struct object_id *oid;
struct tree *t;
@@ -278,6 +428,26 @@ int walk_objects_by_path(struct path_walk_info *info)
free(path);
}
+
+ /* Are there paths remaining? Likely they are from indexed objects. */
+ if (!strmap_empty(&ctx.paths_to_lists)) {
+ struct hashmap_iter iter;
+ struct strmap_entry *entry;
+
+ strmap_for_each_entry(&ctx.paths_to_lists, &iter, entry)
+ push_to_stack(&ctx, entry->key);
+
+ while (!ret && ctx.path_stack.nr) {
+ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
+ ctx.path_stack.nr--;
+ paths_nr++;
+
+ ret = walk_path(&ctx, path);
+
+ free(path);
+ }
+ }
+
trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr);
trace2_region_leave("path-walk", "path-walk", info->revs->repo);
diff --git a/path-walk.h b/path-walk.h
index 2d2afc29b47..ca839f873e4 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -38,12 +38,14 @@ struct path_walk_info {
int commits;
int trees;
int blobs;
+ int tags;
};
#define PATH_WALK_INFO_INIT { \
.blobs = 1, \
.trees = 1, \
.commits = 1, \
+ .tags = 1, \
}
/**
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index a57a05a6391..56289859e69 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -23,6 +23,7 @@ struct path_walk_test_data {
uintmax_t commit_nr;
uintmax_t tree_nr;
uintmax_t blob_nr;
+ uintmax_t tag_nr;
};
static int emit_block(const char *path, struct oid_array *oids,
@@ -37,11 +38,18 @@ static int emit_block(const char *path, struct oid_array *oids,
tdata->blob_nr += oids->nr;
else if (type == OBJ_COMMIT)
tdata->commit_nr += oids->nr;
+ else if (type == OBJ_TAG)
+ tdata->tag_nr += oids->nr;
else
BUG("we do not understand this type");
typestr = type_name(type);
+ /* This should never be output during tests. */
+ if (!oids->nr)
+ printf("%"PRIuMAX":%s:%s:EMPTY\n",
+ tdata->batch_nr, typestr, path);
+
for (size_t i = 0; i < oids->nr; i++)
printf("%"PRIuMAX":%s:%s:%s\n",
tdata->batch_nr, typestr, path,
@@ -62,6 +70,8 @@ int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of blob objects")),
OPT_BOOL(0, "commits", &info.commits,
N_("toggle inclusion of commit objects")),
+ OPT_BOOL(0, "tags", &info.tags,
+ N_("toggle inclusion of tag objects")),
OPT_BOOL(0, "trees", &info.trees,
N_("toggle inclusion of tree objects")),
OPT_END(),
@@ -87,8 +97,9 @@ int cmd__path_walk(int argc, const char **argv)
printf("commits:%" PRIuMAX "\n"
"trees:%" PRIuMAX "\n"
- "blobs:%" PRIuMAX "\n",
- data.commit_nr, data.tree_nr, data.blob_nr);
+ "blobs:%" PRIuMAX "\n"
+ "tags:%" PRIuMAX "\n",
+ data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
release_revisions(&revs);
return res;
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 4a4939a1b02..1f3d2e0cb76 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -9,29 +9,142 @@ test_description='direct path-walk API tests'
test_expect_success 'setup test repository' '
git checkout -b base &&
+ # Make some objects that will only be reachable
+ # via non-commit tags.
+ mkdir child &&
+ echo file >child/file &&
+ git add child &&
+ git commit -m "will abandon" &&
+ git tag -a -m "tree" tree-tag HEAD^{tree} &&
+ echo file2 >file2 &&
+ git add file2 &&
+ git commit --amend -m "will abandon" &&
+ git tag tree-tag2 HEAD^{tree} &&
+
+ echo blob >file &&
+ blob_oid=$(git hash-object -t blob -w --stdin <file) &&
+ git tag -a -m "blob" blob-tag "$blob_oid" &&
+ echo blob2 >file2 &&
+ blob2_oid=$(git hash-object -t blob -w --stdin <file2) &&
+ git tag blob-tag2 "$blob2_oid" &&
+
+ rm -fr child file file2 &&
+
mkdir left &&
mkdir right &&
echo a >a &&
echo b >left/b &&
echo c >right/c &&
git add . &&
- git commit -m "first" &&
+ git commit --amend -m "first" &&
+ git tag -m "first" first HEAD &&
echo d >right/d &&
git add right &&
git commit -m "second" &&
+ git tag -a -m "second (under)" second.1 HEAD &&
+ git tag -a -m "second (top)" second.2 second.1 &&
+ # Set up file/dir collision in history.
+ rm a &&
+ mkdir a &&
+ echo a >a/a &&
echo bb >left/b &&
- git commit -a -m "third" &&
+ git add a left &&
+ git commit -m "third" &&
+ git tag -a -m "third" third &&
git checkout -b topic HEAD~1 &&
echo cc >right/c &&
- git commit -a -m "topic"
+ git commit -a -m "topic" &&
+ git tag -a -m "fourth" fourth
'
test_expect_success 'all' '
test-tool path-walk -- --all >out &&
+ cat >expect <<-EOF &&
+ 0:commit::$(git rev-parse topic)
+ 0:commit::$(git rev-parse base)
+ 0:commit::$(git rev-parse base~1)
+ 0:commit::$(git rev-parse base~2)
+ 1:tag:/tags:$(git rev-parse refs/tags/first)
+ 1:tag:/tags:$(git rev-parse refs/tags/second.1)
+ 1:tag:/tags:$(git rev-parse refs/tags/second.2)
+ 1:tag:/tags:$(git rev-parse refs/tags/third)
+ 1:tag:/tags:$(git rev-parse refs/tags/fourth)
+ 1:tag:/tags:$(git rev-parse refs/tags/tree-tag)
+ 1:tag:/tags:$(git rev-parse refs/tags/blob-tag)
+ 2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
+ 2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
+ 3:tree::$(git rev-parse topic^{tree})
+ 3:tree::$(git rev-parse base^{tree})
+ 3:tree::$(git rev-parse base~1^{tree})
+ 3:tree::$(git rev-parse base~2^{tree})
+ 3:tree::$(git rev-parse refs/tags/tree-tag^{})
+ 3:tree::$(git rev-parse refs/tags/tree-tag2^{})
+ 4:blob:a:$(git rev-parse base~2:a)
+ 5:tree:right/:$(git rev-parse topic:right)
+ 5:tree:right/:$(git rev-parse base~1:right)
+ 5:tree:right/:$(git rev-parse base~2:right)
+ 6:blob:right/d:$(git rev-parse base~1:right/d)
+ 7:blob:right/c:$(git rev-parse base~2:right/c)
+ 7:blob:right/c:$(git rev-parse topic:right/c)
+ 8:tree:left/:$(git rev-parse base:left)
+ 8:tree:left/:$(git rev-parse base~2:left)
+ 9:blob:left/b:$(git rev-parse base~2:left/b)
+ 9:blob:left/b:$(git rev-parse base:left/b)
+ 10:tree:a/:$(git rev-parse base:a)
+ 11:blob:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
+ 12:tree:child/:$(git rev-parse refs/tags/tree-tag:child)
+ 13:blob:child/file:$(git rev-parse refs/tags/tree-tag:child/file)
+ blobs:10
+ commits:4
+ tags:7
+ trees:13
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'indexed objects' '
+ test_when_finished git reset --hard &&
+
+ # stage change into index, adding a blob but
+ # also invalidating the cache-tree for the root
+ # and the "left" directory.
+ echo bogus >left/c &&
+ git add left &&
+
+ test-tool path-walk -- --indexed-objects >out &&
+
+ cat >expect <<-EOF &&
+ 0:blob:a:$(git rev-parse HEAD:a)
+ 1:blob:left/b:$(git rev-parse HEAD:left/b)
+ 2:blob:left/c:$(git rev-parse :left/c)
+ 3:blob:right/c:$(git rev-parse HEAD:right/c)
+ 4:blob:right/d:$(git rev-parse HEAD:right/d)
+ 5:tree:right/:$(git rev-parse topic:right)
+ blobs:5
+ commits:0
+ tags:0
+ trees:1
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'branches and indexed objects mix well' '
+ test_when_finished git reset --hard &&
+
+ # stage change into index, adding a blob but
+ # also invalidating the cache-tree for the root
+ # and the "right" directory.
+ echo fake >right/d &&
+ git add right &&
+
+ test-tool path-walk -- --indexed-objects --branches >out &&
+
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
0:commit::$(git rev-parse base)
@@ -41,20 +154,23 @@ test_expect_success 'all' '
1:tree::$(git rev-parse base^{tree})
1:tree::$(git rev-parse base~1^{tree})
1:tree::$(git rev-parse base~2^{tree})
- 2:tree:right/:$(git rev-parse topic:right)
- 2:tree:right/:$(git rev-parse base~1:right)
- 2:tree:right/:$(git rev-parse base~2:right)
- 3:blob:right/d:$(git rev-parse base~1:right/d)
- 4:blob:right/c:$(git rev-parse base~2:right/c)
- 4:blob:right/c:$(git rev-parse topic:right/c)
- 5:tree:left/:$(git rev-parse base:left)
- 5:tree:left/:$(git rev-parse base~2:left)
- 6:blob:left/b:$(git rev-parse base~2:left/b)
- 6:blob:left/b:$(git rev-parse base:left/b)
- 7:blob:a:$(git rev-parse base~2:a)
- blobs:6
+ 2:blob:a:$(git rev-parse base~2:a)
+ 3:tree:right/:$(git rev-parse topic:right)
+ 3:tree:right/:$(git rev-parse base~1:right)
+ 3:tree:right/:$(git rev-parse base~2:right)
+ 4:blob:right/d:$(git rev-parse base~1:right/d)
+ 4:blob:right/d:$(git rev-parse :right/d)
+ 5:blob:right/c:$(git rev-parse base~2:right/c)
+ 5:blob:right/c:$(git rev-parse topic:right/c)
+ 6:tree:left/:$(git rev-parse base:left)
+ 6:tree:left/:$(git rev-parse base~2:left)
+ 7:blob:left/b:$(git rev-parse base:left/b)
+ 7:blob:left/b:$(git rev-parse base~2:left/b)
+ 8:tree:a/:$(git rev-parse refs/tags/third:a)
+ blobs:7
commits:4
- trees:9
+ tags:0
+ trees:10
EOF
test_cmp_sorted expect out
@@ -81,6 +197,7 @@ test_expect_success 'topic only' '
7:blob:a:$(git rev-parse base~2:a)
blobs:5
commits:3
+ tags:0
trees:7
EOF
@@ -101,6 +218,7 @@ test_expect_success 'topic, not base' '
7:blob:a:$(git rev-parse topic:a)
blobs:4
commits:1
+ tags:0
trees:3
EOF
@@ -112,13 +230,14 @@ test_expect_success 'topic, not base, only blobs' '
-- topic --not base >out &&
cat >expect <<-EOF &&
- commits:0
- trees:0
0:blob:right/d:$(git rev-parse topic:right/d)
1:blob:right/c:$(git rev-parse topic:right/c)
2:blob:left/b:$(git rev-parse topic:left/b)
3:blob:a:$(git rev-parse topic:a)
blobs:4
+ commits:0
+ tags:0
+ trees:0
EOF
test_cmp_sorted expect out
@@ -133,8 +252,9 @@ test_expect_success 'topic, not base, only commits' '
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
commits:1
- trees:0
blobs:0
+ tags:0
+ trees:0
EOF
test_cmp_sorted expect out
@@ -145,12 +265,13 @@ test_expect_success 'topic, not base, only trees' '
-- topic --not base >out &&
cat >expect <<-EOF &&
- commits:0
0:tree::$(git rev-parse topic^{tree})
1:tree:right/:$(git rev-parse topic:right)
2:tree:left/:$(git rev-parse topic:left)
- trees:3
+ commits:0
blobs:0
+ tags:0
+ trees:3
EOF
test_cmp_sorted expect out
@@ -174,10 +295,33 @@ test_expect_success 'topic, not base, boundary' '
7:blob:a:$(git rev-parse base~1:a)
blobs:5
commits:2
+ tags:0
trees:5
EOF
test_cmp_sorted expect out
'
+test_expect_success 'trees are reported exactly once' '
+ test_when_finished "rm -rf unique-trees" &&
+ test_create_repo unique-trees &&
+ (
+ cd unique-trees &&
+ mkdir initial &&
+ test_commit initial/file &&
+
+ git switch -c move-to-top &&
+ git mv initial/file.t ./ &&
+ test_tick &&
+ git commit -m moved &&
+
+ git update-ref refs/heads/other HEAD
+ ) &&
+
+ test-tool -C unique-trees path-walk -- --all >out &&
+ tree=$(git -C unique-trees rev-parse HEAD:) &&
+ grep "$tree" out >out-filtered &&
+ test_line_count = 1 out-filtered
+'
+
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH v3 5/7] path-walk: visit tags and cached objects
2024-12-06 19:45 ` [PATCH v3 5/7] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
@ 2024-12-13 11:58 ` Patrick Steinhardt
2024-12-18 14:23 ` Derrick Stolee
0 siblings, 1 reply; 67+ messages in thread
From: Patrick Steinhardt @ 2024-12-13 11:58 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee
On Fri, Dec 06, 2024 at 07:45:56PM +0000, Derrick Stolee via GitGitGadget wrote:
> @@ -194,6 +201,134 @@ static void clear_strmap(struct strmap *map)
> strmap_init(map);
> }
>
> +static void setup_pending_objects(struct path_walk_info *info,
> + struct path_walk_context *ctx)
> +{
> + struct type_and_oid_list *tags = NULL;
> + struct type_and_oid_list *tagged_blobs = NULL;
> + struct type_and_oid_list *root_tree_list = NULL;
> +
> + if (info->tags)
> + CALLOC_ARRAY(tags, 1);
> + if (info->blobs)
> + CALLOC_ARRAY(tagged_blobs, 1);
> + if (info->trees)
> + root_tree_list = strmap_get(&ctx->paths_to_lists, root_path);
> +
> + /*
> + * Pending objects include:
> + * * Commits at branch tips.
> + * * Annotated tags at tag tips.
> + * * Any kind of object at lightweight tag tips.
> + * * Trees and blobs in the index (with an associated path).
> + */
> + for (size_t i = 0; i < info->revs->pending.nr; i++) {
> + struct object_array_entry *pending = info->revs->pending.objects + i;
> + struct object *obj = pending->item;
> +
> + /* Commits will be picked up by revision walk. */
> + if (obj->type == OBJ_COMMIT)
> + continue;
> +
> + /* Navigate annotated tag object chains. */
> + while (obj->type == OBJ_TAG) {
> + struct tag *tag = lookup_tag(info->revs->repo, &obj->oid);
> + if (!tag)
> + break;
Same here as previous comments, is this an error that we should rather
report?
[snip]
> + if (tagged_blobs) {
> + if (tagged_blobs->oids.nr) {
> + const char *tagged_blob_path = "/tagged-blobs";
> + tagged_blobs->type = OBJ_BLOB;
> + push_to_stack(ctx, tagged_blob_path);
> + strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
> + } else {
> + oid_array_clear(&tagged_blobs->oids);
> + free(tagged_blobs);
> + }
> + }
> + if (tags) {
> + if (tags->oids.nr) {
> + const char *tag_path = "/tags";
> + tags->type = OBJ_TAG;
> + push_to_stack(ctx, tag_path);
> + strmap_put(&ctx->paths_to_lists, tag_path, tags);
> + } else {
> + oid_array_clear(&tags->oids);
> + free(tags);
> + }
> + }
> +}
So this is kind of curious. Does that mean that a file named
"tagged-blobs" would be thrown into the same bag as a tagged blob? Or
are these special due to the leading "/"?
Patrick
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: [PATCH v3 5/7] path-walk: visit tags and cached objects
2024-12-13 11:58 ` Patrick Steinhardt
@ 2024-12-18 14:23 ` Derrick Stolee
0 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee @ 2024-12-18 14:23 UTC (permalink / raw)
To: Patrick Steinhardt, Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak
On 12/13/24 6:58 AM, Patrick Steinhardt wrote:
> On Fri, Dec 06, 2024 at 07:45:56PM +0000, Derrick Stolee via GitGitGadget wrote:
>> @@ -194,6 +201,134 @@ static void clear_strmap(struct strmap *map)
>> + /* Navigate annotated tag object chains. */
>> + while (obj->type == OBJ_TAG) {
>> + struct tag *tag = lookup_tag(info->revs->repo, &obj->oid);
>> + if (!tag)
>> + break;
>
> Same here as previous comments, is this an error that we should rather
> report?
Can do.
> [snip]
>> + if (tagged_blobs) {
>> + if (tagged_blobs->oids.nr) {
>> + const char *tagged_blob_path = "/tagged-blobs";
>> + tagged_blobs->type = OBJ_BLOB;
>> + push_to_stack(ctx, tagged_blob_path);
>> + strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
>> + } else {
>> + oid_array_clear(&tagged_blobs->oids);
>> + free(tagged_blobs);
>> + }
>> + }
>> + if (tags) {
>> + if (tags->oids.nr) {
>> + const char *tag_path = "/tags";
>> + tags->type = OBJ_TAG;
>> + push_to_stack(ctx, tag_path);
>> + strmap_put(&ctx->paths_to_lists, tag_path, tags);
>> + } else {
>> + oid_array_clear(&tags->oids);
>> + free(tags);
>> + }
>> + }
>> +}
>
> So this is kind of curious. Does that mean that a file named
> "tagged-blobs" would be thrown into the same bag as a tagged blob? Or
> are these special due to the leading "/"?
Indeed, the leading "/" differentiates these categories from other paths
that are stored in the process.
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v3 6/7] path-walk: mark trees and blobs as UNINTERESTING
2024-12-06 19:45 ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
` (4 preceding siblings ...)
2024-12-06 19:45 ` [PATCH v3 5/7] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
@ 2024-12-06 19:45 ` Derrick Stolee via GitGitGadget
2024-12-06 19:45 ` [PATCH v3 7/7] path-walk: reorder object visits Derrick Stolee via GitGitGadget
` (2 subsequent siblings)
8 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-06 19:45 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
When the input rev_info has UNINTERESTING starting points, we want to be
sure that the UNINTERESTING flag is passed appropriately through the
objects. To match how this is done in places such as 'git pack-objects', we
use the mark_edges_uninteresting() method.
This method has an option for using the "sparse" walk, which is similar in
spirit to the path-walk API's walk. To be sure to keep it independent, add a
new 'prune_all_uninteresting' option to the path_walk_info struct.
To check how the UNINTERSTING flag is spread through our objects, extend the
'test-tool path-walk' command to output whether or not an object has that
flag. This changes our tests significantly, including the removal of some
objects that were previously visited due to the incomplete implementation.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 8 +++
path-walk.c | 73 +++++++++++++++++++++
path-walk.h | 8 +++
t/helper/test-path-walk.c | 12 +++-
t/t6601-path-walk.sh | 79 +++++++++++++++++------
5 files changed, 158 insertions(+), 22 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 6022c381b7c..7075d0d5ab5 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -48,6 +48,14 @@ commits.
While it is possible to walk only commits in this way, consumers would be
better off using the revision walk API instead.
+`prune_all_uninteresting`::
+ By default, all reachable paths are emitted by the path-walk API.
+ This option allows consumers to declare that they are not
+ interested in paths where all included objects are marked with the
+ `UNINTERESTING` flag. This requires using the `boundary` option in
+ the revision walk so that the walk emits commits marked with the
+ `UNINTERESTING` flag.
+
Examples
--------
diff --git a/path-walk.c b/path-walk.c
index a1f539dcd46..896ec0c4779 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -8,6 +8,7 @@
#include "dir.h"
#include "hashmap.h"
#include "hex.h"
+#include "list-objects.h"
#include "object.h"
#include "oid-array.h"
#include "revision.h"
@@ -24,6 +25,7 @@ struct type_and_oid_list
{
enum object_type type;
struct oid_array oids;
+ int maybe_interesting;
};
#define TYPE_AND_OID_LIST_INIT { \
@@ -140,6 +142,9 @@ static int add_children(struct path_walk_context *ctx,
if (o->flags & SEEN)
continue;
o->flags |= SEEN;
+
+ if (!(o->flags & UNINTERESTING))
+ list->maybe_interesting = 1;
oid_array_append(&list->oids, &entry.oid);
}
@@ -167,6 +172,43 @@ static int walk_path(struct path_walk_context *ctx,
if (!list->oids.nr)
return 0;
+ if (ctx->info->prune_all_uninteresting) {
+ /*
+ * This is true if all objects were UNINTERESTING
+ * when added to the list.
+ */
+ if (!list->maybe_interesting)
+ return 0;
+
+ /*
+ * But it's still possible that the objects were set
+ * as UNINTERESTING after being added. Do a quick check.
+ */
+ list->maybe_interesting = 0;
+ for (size_t i = 0;
+ !list->maybe_interesting && i < list->oids.nr;
+ i++) {
+ if (list->type == OBJ_TREE) {
+ struct tree *t = lookup_tree(ctx->repo,
+ &list->oids.oid[i]);
+ if (t && !(t->object.flags & UNINTERESTING))
+ list->maybe_interesting = 1;
+ } else if (list->type == OBJ_BLOB) {
+ struct blob *b = lookup_blob(ctx->repo,
+ &list->oids.oid[i]);
+ if (b && !(b->object.flags & UNINTERESTING))
+ list->maybe_interesting = 1;
+ } else {
+ /* Tags are always interesting if visited. */
+ list->maybe_interesting = 1;
+ }
+ }
+
+ /* We have confirmed that all objects are UNINTERESTING. */
+ if (!list->maybe_interesting)
+ return 0;
+ }
+
/* Evaluate function pointer on this data, if requested. */
if ((list->type == OBJ_TREE && ctx->info->trees) ||
(list->type == OBJ_BLOB && ctx->info->blobs) ||
@@ -201,6 +243,26 @@ static void clear_strmap(struct strmap *map)
strmap_init(map);
}
+static struct repository *edge_repo;
+static struct type_and_oid_list *edge_tree_list;
+
+static void show_edge(struct commit *commit)
+{
+ struct tree *t = repo_get_commit_tree(edge_repo, commit);
+
+ if (!t)
+ return;
+
+ if (commit->object.flags & UNINTERESTING)
+ t->object.flags |= UNINTERESTING;
+
+ if (t->object.flags & SEEN)
+ return;
+ t->object.flags |= SEEN;
+
+ oid_array_append(&edge_tree_list->oids, &t->object.oid);
+}
+
static void setup_pending_objects(struct path_walk_info *info,
struct path_walk_context *ctx)
{
@@ -309,6 +371,7 @@ static void setup_pending_objects(struct path_walk_info *info,
if (tagged_blobs->oids.nr) {
const char *tagged_blob_path = "/tagged-blobs";
tagged_blobs->type = OBJ_BLOB;
+ tagged_blobs->maybe_interesting = 1;
push_to_stack(ctx, tagged_blob_path);
strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
} else {
@@ -320,6 +383,7 @@ static void setup_pending_objects(struct path_walk_info *info,
if (tags->oids.nr) {
const char *tag_path = "/tags";
tags->type = OBJ_TAG;
+ tags->maybe_interesting = 1;
push_to_stack(ctx, tag_path);
strmap_put(&ctx->paths_to_lists, tag_path, tags);
} else {
@@ -362,6 +426,7 @@ int walk_objects_by_path(struct path_walk_info *info)
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
+ root_tree_list->maybe_interesting = 1;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
push_to_stack(&ctx, root_path);
@@ -375,6 +440,14 @@ int walk_objects_by_path(struct path_walk_info *info)
if (prepare_revision_walk(info->revs))
die(_("failed to setup revision walk"));
+ /* Walk trees to mark them as UNINTERESTING. */
+ edge_repo = info->revs->repo;
+ edge_tree_list = root_tree_list;
+ mark_edges_uninteresting(info->revs, show_edge,
+ info->prune_all_uninteresting);
+ edge_repo = NULL;
+ edge_tree_list = NULL;
+
info->revs->blob_objects = info->revs->tree_objects = 0;
trace2_region_enter("path-walk", "pending-walk", info->revs->repo);
diff --git a/path-walk.h b/path-walk.h
index ca839f873e4..de0db007dc9 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -39,6 +39,14 @@ struct path_walk_info {
int trees;
int blobs;
int tags;
+
+ /**
+ * When 'prune_all_uninteresting' is set and a path has all objects
+ * marked as UNINTERESTING, then the path-walk will not visit those
+ * objects. It will not call path_fn on those objects and will not
+ * walk the children of such trees.
+ */
+ int prune_all_uninteresting;
};
#define PATH_WALK_INFO_INIT { \
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 56289859e69..7f2d409c5bc 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -50,10 +50,14 @@ static int emit_block(const char *path, struct oid_array *oids,
printf("%"PRIuMAX":%s:%s:EMPTY\n",
tdata->batch_nr, typestr, path);
- for (size_t i = 0; i < oids->nr; i++)
- printf("%"PRIuMAX":%s:%s:%s\n",
+ for (size_t i = 0; i < oids->nr; i++) {
+ struct object *o = lookup_unknown_object(the_repository,
+ &oids->oid[i]);
+ printf("%"PRIuMAX":%s:%s:%s%s\n",
tdata->batch_nr, typestr, path,
- oid_to_hex(&oids->oid[i]));
+ oid_to_hex(&oids->oid[i]),
+ o->flags & UNINTERESTING ? ":UNINTERESTING" : "");
+ }
tdata->batch_nr++;
return 0;
@@ -74,6 +78,8 @@ int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of tag objects")),
OPT_BOOL(0, "trees", &info.trees,
N_("toggle inclusion of tree objects")),
+ OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
+ N_("toggle pruning of uninteresting paths")),
OPT_END(),
};
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 1f3d2e0cb76..a317cdf289e 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -211,11 +211,11 @@ test_expect_success 'topic, not base' '
0:commit::$(git rev-parse topic)
1:tree::$(git rev-parse topic^{tree})
2:tree:right/:$(git rev-parse topic:right)
- 3:blob:right/d:$(git rev-parse topic:right/d)
+ 3:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
4:blob:right/c:$(git rev-parse topic:right/c)
- 5:tree:left/:$(git rev-parse topic:left)
- 6:blob:left/b:$(git rev-parse topic:left/b)
- 7:blob:a:$(git rev-parse topic:a)
+ 5:tree:left/:$(git rev-parse topic:left):UNINTERESTING
+ 6:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
+ 7:blob:a:$(git rev-parse topic:a):UNINTERESTING
blobs:4
commits:1
tags:0
@@ -225,15 +225,38 @@ test_expect_success 'topic, not base' '
test_cmp_sorted expect out
'
+test_expect_success 'fourth, blob-tag2, not base' '
+ test-tool path-walk -- fourth blob-tag2 --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:commit::$(git rev-parse topic)
+ 1:tag:/tags:$(git rev-parse fourth)
+ 2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
+ 3:tree::$(git rev-parse topic^{tree})
+ 4:tree:right/:$(git rev-parse topic:right)
+ 5:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
+ 6:blob:right/c:$(git rev-parse topic:right/c)
+ 7:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
+ 8:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ 9:blob:a:$(git rev-parse base~1:a):UNINTERESTING
+ blobs:5
+ commits:1
+ tags:1
+ trees:3
+ EOF
+
+ test_cmp_sorted expect out
+'
+
test_expect_success 'topic, not base, only blobs' '
test-tool path-walk --no-trees --no-commits \
-- topic --not base >out &&
cat >expect <<-EOF &&
- 0:blob:right/d:$(git rev-parse topic:right/d)
+ 0:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
1:blob:right/c:$(git rev-parse topic:right/c)
- 2:blob:left/b:$(git rev-parse topic:left/b)
- 3:blob:a:$(git rev-parse topic:a)
+ 2:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
+ 3:blob:a:$(git rev-parse topic:a):UNINTERESTING
blobs:4
commits:0
tags:0
@@ -267,7 +290,7 @@ test_expect_success 'topic, not base, only trees' '
cat >expect <<-EOF &&
0:tree::$(git rev-parse topic^{tree})
1:tree:right/:$(git rev-parse topic:right)
- 2:tree:left/:$(git rev-parse topic:left)
+ 2:tree:left/:$(git rev-parse topic:left):UNINTERESTING
commits:0
blobs:0
tags:0
@@ -282,17 +305,17 @@ test_expect_success 'topic, not base, boundary' '
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
- 0:commit::$(git rev-parse base~1)
+ 0:commit::$(git rev-parse base~1):UNINTERESTING
1:tree::$(git rev-parse topic^{tree})
- 1:tree::$(git rev-parse base~1^{tree})
+ 1:tree::$(git rev-parse base~1^{tree}):UNINTERESTING
2:tree:right/:$(git rev-parse topic:right)
- 2:tree:right/:$(git rev-parse base~1:right)
- 3:blob:right/d:$(git rev-parse base~1:right/d)
- 4:blob:right/c:$(git rev-parse base~1:right/c)
+ 2:tree:right/:$(git rev-parse base~1:right):UNINTERESTING
+ 3:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
+ 4:blob:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
4:blob:right/c:$(git rev-parse topic:right/c)
- 5:tree:left/:$(git rev-parse base~1:left)
- 6:blob:left/b:$(git rev-parse base~1:left/b)
- 7:blob:a:$(git rev-parse base~1:a)
+ 5:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
+ 6:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ 7:blob:a:$(git rev-parse base~1:a):UNINTERESTING
blobs:5
commits:2
tags:0
@@ -302,6 +325,27 @@ test_expect_success 'topic, not base, boundary' '
test_cmp_sorted expect out
'
+test_expect_success 'topic, not base, boundary with pruning' '
+ test-tool path-walk --prune -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:commit::$(git rev-parse topic)
+ 0:commit::$(git rev-parse base~1):UNINTERESTING
+ 1:tree::$(git rev-parse topic^{tree})
+ 1:tree::$(git rev-parse base~1^{tree}):UNINTERESTING
+ 2:tree:right/:$(git rev-parse topic:right)
+ 2:tree:right/:$(git rev-parse base~1:right):UNINTERESTING
+ 3:blob:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
+ 3:blob:right/c:$(git rev-parse topic:right/c)
+ blobs:2
+ commits:2
+ tags:0
+ trees:4
+ EOF
+
+ test_cmp_sorted expect out
+'
+
test_expect_success 'trees are reported exactly once' '
test_when_finished "rm -rf unique-trees" &&
test_create_repo unique-trees &&
@@ -309,15 +353,12 @@ test_expect_success 'trees are reported exactly once' '
cd unique-trees &&
mkdir initial &&
test_commit initial/file &&
-
git switch -c move-to-top &&
git mv initial/file.t ./ &&
test_tick &&
git commit -m moved &&
-
git update-ref refs/heads/other HEAD
) &&
-
test-tool -C unique-trees path-walk -- --all >out &&
tree=$(git -C unique-trees rev-parse HEAD:) &&
grep "$tree" out >out-filtered &&
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH v3 7/7] path-walk: reorder object visits
2024-12-06 19:45 ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
` (5 preceding siblings ...)
2024-12-06 19:45 ` [PATCH v3 6/7] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
@ 2024-12-06 19:45 ` Derrick Stolee via GitGitGadget
2024-12-13 11:58 ` [PATCH v3 0/7] PATH WALK I: The path-walk API Patrick Steinhardt
2024-12-20 16:21 ` [PATCH v4 " Derrick Stolee via GitGitGadget
8 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-06 19:45 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
The path-walk API currently uses a stack-based approach to recursing
through the list of paths within the repository. This guarantees that
after a tree path is explored, all paths contained within that tree path
will be explored before continuing to explore siblings of that tree
path.
The initial motivation of this depth-first approach was to minimize
memory pressure while exploring the repository. A breadth-first approach
would have too many "active" paths being stored in the paths_to_lists
map.
We can take this approach one step further by making sure that blob
paths are visited before tree paths. This allows the API to free the
memory for these blob objects before continuing to perform the
depth-first search. This modifies the order in which we visit siblings,
but does not change the fact that we are performing depth-first search.
To achieve this goal, use a priority queue with a custom sorting method.
The sort needs to handle tags, blobs, and trees (commits are handled
slightly differently). When objects share a type, we can sort by path
name. This will keep children of the latest path to leave the stack be
preferred over the rest of the paths in the stack, since they agree in
prefix up to and including a directory separator. When the types are
different, we can prefer tags over other types and blobs over trees.
This causes significant adjustments to t6601-path-walk.sh to rearrange
the order of the visited paths.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
path-walk.c | 60 ++++++++++++++++-----
t/t6601-path-walk.sh | 122 +++++++++++++++++++++----------------------
2 files changed, 109 insertions(+), 73 deletions(-)
diff --git a/path-walk.c b/path-walk.c
index 896ec0c4779..b31924df52e 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -11,6 +11,7 @@
#include "list-objects.h"
#include "object.h"
#include "oid-array.h"
+#include "prio-queue.h"
#include "revision.h"
#include "string-list.h"
#include "strmap.h"
@@ -50,16 +51,50 @@ struct path_walk_context {
struct strmap paths_to_lists;
/**
- * Store the current list of paths in a stack, to
- * facilitate depth-first-search without recursion.
+ * Store the current list of paths in a priority queue,
+ * using object type as a sorting mechanism, mostly to
+ * make sure blobs are popped off the stack first. No
+ * other sort is made, so within each object type it acts
+ * like a stack and performs a DFS within the trees.
*
* Use path_stack_pushed to indicate whether a path
* was previously added to path_stack.
*/
- struct string_list path_stack;
+ struct prio_queue path_stack;
struct strset path_stack_pushed;
};
+static int compare_by_type(const void *one, const void *two, void *cb_data)
+{
+ struct type_and_oid_list *list1, *list2;
+ const char *str1 = one;
+ const char *str2 = two;
+ struct path_walk_context *ctx = cb_data;
+
+ list1 = strmap_get(&ctx->paths_to_lists, str1);
+ list2 = strmap_get(&ctx->paths_to_lists, str2);
+
+ /*
+ * If object types are equal, then use path comparison.
+ */
+ if (!list1 || !list2 || list1->type == list2->type)
+ return strcmp(str1, str2);
+
+ /* Prefer tags to be popped off first. */
+ if (list1->type == OBJ_TAG)
+ return -1;
+ if (list2->type == OBJ_TAG)
+ return 1;
+
+ /* Prefer blobs to be popped off second. */
+ if (list1->type == OBJ_BLOB)
+ return -1;
+ if (list2->type == OBJ_BLOB)
+ return 1;
+
+ return 0;
+}
+
static void push_to_stack(struct path_walk_context *ctx,
const char *path)
{
@@ -67,7 +102,7 @@ static void push_to_stack(struct path_walk_context *ctx,
return;
strset_add(&ctx->path_stack_pushed, path);
- string_list_append(&ctx->path_stack, path);
+ prio_queue_put(&ctx->path_stack, xstrdup(path));
}
static int add_children(struct path_walk_context *ctx,
@@ -372,8 +407,8 @@ static void setup_pending_objects(struct path_walk_info *info,
const char *tagged_blob_path = "/tagged-blobs";
tagged_blobs->type = OBJ_BLOB;
tagged_blobs->maybe_interesting = 1;
- push_to_stack(ctx, tagged_blob_path);
strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
+ push_to_stack(ctx, tagged_blob_path);
} else {
oid_array_clear(&tagged_blobs->oids);
free(tagged_blobs);
@@ -384,8 +419,8 @@ static void setup_pending_objects(struct path_walk_info *info,
const char *tag_path = "/tags";
tags->type = OBJ_TAG;
tags->maybe_interesting = 1;
- push_to_stack(ctx, tag_path);
strmap_put(&ctx->paths_to_lists, tag_path, tags);
+ push_to_stack(ctx, tag_path);
} else {
oid_array_clear(&tags->oids);
free(tags);
@@ -410,7 +445,10 @@ int walk_objects_by_path(struct path_walk_info *info)
.repo = info->revs->repo,
.revs = info->revs,
.info = info,
- .path_stack = STRING_LIST_INIT_DUP,
+ .path_stack = {
+ .compare = compare_by_type,
+ .cb_data = &ctx
+ },
.path_stack_pushed = STRSET_INIT,
.paths_to_lists = STRMAP_INIT
};
@@ -493,8 +531,7 @@ int walk_objects_by_path(struct path_walk_info *info)
trace2_region_enter("path-walk", "path-walk", info->revs->repo);
while (!ret && ctx.path_stack.nr) {
- char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
- ctx.path_stack.nr--;
+ char *path = prio_queue_get(&ctx.path_stack);
paths_nr++;
ret = walk_path(&ctx, path);
@@ -511,8 +548,7 @@ int walk_objects_by_path(struct path_walk_info *info)
push_to_stack(&ctx, entry->key);
while (!ret && ctx.path_stack.nr) {
- char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
- ctx.path_stack.nr--;
+ char *path = prio_queue_get(&ctx.path_stack);
paths_nr++;
ret = walk_path(&ctx, path);
@@ -526,6 +562,6 @@ int walk_objects_by_path(struct path_walk_info *info)
clear_strmap(&ctx.paths_to_lists);
strset_clear(&ctx.path_stack_pushed);
- string_list_clear(&ctx.path_stack, 0);
+ clear_prio_queue(&ctx.path_stack);
return ret;
}
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index a317cdf289e..7d765ffe907 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -84,20 +84,20 @@ test_expect_success 'all' '
3:tree::$(git rev-parse refs/tags/tree-tag^{})
3:tree::$(git rev-parse refs/tags/tree-tag2^{})
4:blob:a:$(git rev-parse base~2:a)
- 5:tree:right/:$(git rev-parse topic:right)
- 5:tree:right/:$(git rev-parse base~1:right)
- 5:tree:right/:$(git rev-parse base~2:right)
- 6:blob:right/d:$(git rev-parse base~1:right/d)
- 7:blob:right/c:$(git rev-parse base~2:right/c)
- 7:blob:right/c:$(git rev-parse topic:right/c)
- 8:tree:left/:$(git rev-parse base:left)
- 8:tree:left/:$(git rev-parse base~2:left)
- 9:blob:left/b:$(git rev-parse base~2:left/b)
- 9:blob:left/b:$(git rev-parse base:left/b)
- 10:tree:a/:$(git rev-parse base:a)
- 11:blob:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
- 12:tree:child/:$(git rev-parse refs/tags/tree-tag:child)
- 13:blob:child/file:$(git rev-parse refs/tags/tree-tag:child/file)
+ 5:blob:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
+ 6:tree:a/:$(git rev-parse base:a)
+ 7:tree:child/:$(git rev-parse refs/tags/tree-tag:child)
+ 8:blob:child/file:$(git rev-parse refs/tags/tree-tag:child/file)
+ 9:tree:left/:$(git rev-parse base:left)
+ 9:tree:left/:$(git rev-parse base~2:left)
+ 10:blob:left/b:$(git rev-parse base~2:left/b)
+ 10:blob:left/b:$(git rev-parse base:left/b)
+ 11:tree:right/:$(git rev-parse topic:right)
+ 11:tree:right/:$(git rev-parse base~1:right)
+ 11:tree:right/:$(git rev-parse base~2:right)
+ 12:blob:right/c:$(git rev-parse base~2:right/c)
+ 12:blob:right/c:$(git rev-parse topic:right/c)
+ 13:blob:right/d:$(git rev-parse base~1:right/d)
blobs:10
commits:4
tags:7
@@ -155,18 +155,18 @@ test_expect_success 'branches and indexed objects mix well' '
1:tree::$(git rev-parse base~1^{tree})
1:tree::$(git rev-parse base~2^{tree})
2:blob:a:$(git rev-parse base~2:a)
- 3:tree:right/:$(git rev-parse topic:right)
- 3:tree:right/:$(git rev-parse base~1:right)
- 3:tree:right/:$(git rev-parse base~2:right)
- 4:blob:right/d:$(git rev-parse base~1:right/d)
- 4:blob:right/d:$(git rev-parse :right/d)
- 5:blob:right/c:$(git rev-parse base~2:right/c)
- 5:blob:right/c:$(git rev-parse topic:right/c)
- 6:tree:left/:$(git rev-parse base:left)
- 6:tree:left/:$(git rev-parse base~2:left)
- 7:blob:left/b:$(git rev-parse base:left/b)
- 7:blob:left/b:$(git rev-parse base~2:left/b)
- 8:tree:a/:$(git rev-parse refs/tags/third:a)
+ 3:tree:a/:$(git rev-parse refs/tags/third:a)
+ 4:tree:left/:$(git rev-parse base:left)
+ 4:tree:left/:$(git rev-parse base~2:left)
+ 5:blob:left/b:$(git rev-parse base:left/b)
+ 5:blob:left/b:$(git rev-parse base~2:left/b)
+ 6:tree:right/:$(git rev-parse topic:right)
+ 6:tree:right/:$(git rev-parse base~1:right)
+ 6:tree:right/:$(git rev-parse base~2:right)
+ 7:blob:right/c:$(git rev-parse base~2:right/c)
+ 7:blob:right/c:$(git rev-parse topic:right/c)
+ 8:blob:right/d:$(git rev-parse base~1:right/d)
+ 8:blob:right/d:$(git rev-parse :right/d)
blobs:7
commits:4
tags:0
@@ -186,15 +186,15 @@ test_expect_success 'topic only' '
1:tree::$(git rev-parse topic^{tree})
1:tree::$(git rev-parse base~1^{tree})
1:tree::$(git rev-parse base~2^{tree})
- 2:tree:right/:$(git rev-parse topic:right)
- 2:tree:right/:$(git rev-parse base~1:right)
- 2:tree:right/:$(git rev-parse base~2:right)
- 3:blob:right/d:$(git rev-parse base~1:right/d)
- 4:blob:right/c:$(git rev-parse base~2:right/c)
- 4:blob:right/c:$(git rev-parse topic:right/c)
- 5:tree:left/:$(git rev-parse base~2:left)
- 6:blob:left/b:$(git rev-parse base~2:left/b)
- 7:blob:a:$(git rev-parse base~2:a)
+ 2:blob:a:$(git rev-parse base~2:a)
+ 3:tree:left/:$(git rev-parse base~2:left)
+ 4:blob:left/b:$(git rev-parse base~2:left/b)
+ 5:tree:right/:$(git rev-parse topic:right)
+ 5:tree:right/:$(git rev-parse base~1:right)
+ 5:tree:right/:$(git rev-parse base~2:right)
+ 6:blob:right/c:$(git rev-parse base~2:right/c)
+ 6:blob:right/c:$(git rev-parse topic:right/c)
+ 7:blob:right/d:$(git rev-parse base~1:right/d)
blobs:5
commits:3
tags:0
@@ -210,12 +210,12 @@ test_expect_success 'topic, not base' '
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
1:tree::$(git rev-parse topic^{tree})
- 2:tree:right/:$(git rev-parse topic:right)
- 3:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
- 4:blob:right/c:$(git rev-parse topic:right/c)
- 5:tree:left/:$(git rev-parse topic:left):UNINTERESTING
- 6:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
- 7:blob:a:$(git rev-parse topic:a):UNINTERESTING
+ 2:blob:a:$(git rev-parse topic:a):UNINTERESTING
+ 3:tree:left/:$(git rev-parse topic:left):UNINTERESTING
+ 4:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
+ 5:tree:right/:$(git rev-parse topic:right)
+ 6:blob:right/c:$(git rev-parse topic:right/c)
+ 7:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
blobs:4
commits:1
tags:0
@@ -233,12 +233,12 @@ test_expect_success 'fourth, blob-tag2, not base' '
1:tag:/tags:$(git rev-parse fourth)
2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
3:tree::$(git rev-parse topic^{tree})
- 4:tree:right/:$(git rev-parse topic:right)
- 5:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
- 6:blob:right/c:$(git rev-parse topic:right/c)
- 7:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
- 8:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
- 9:blob:a:$(git rev-parse base~1:a):UNINTERESTING
+ 4:blob:a:$(git rev-parse base~1:a):UNINTERESTING
+ 5:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
+ 6:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ 7:tree:right/:$(git rev-parse topic:right)
+ 8:blob:right/c:$(git rev-parse topic:right/c)
+ 9:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
blobs:5
commits:1
tags:1
@@ -253,10 +253,10 @@ test_expect_success 'topic, not base, only blobs' '
-- topic --not base >out &&
cat >expect <<-EOF &&
- 0:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
- 1:blob:right/c:$(git rev-parse topic:right/c)
- 2:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
- 3:blob:a:$(git rev-parse topic:a):UNINTERESTING
+ 0:blob:a:$(git rev-parse topic:a):UNINTERESTING
+ 1:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
+ 2:blob:right/c:$(git rev-parse topic:right/c)
+ 3:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
blobs:4
commits:0
tags:0
@@ -289,8 +289,8 @@ test_expect_success 'topic, not base, only trees' '
cat >expect <<-EOF &&
0:tree::$(git rev-parse topic^{tree})
- 1:tree:right/:$(git rev-parse topic:right)
- 2:tree:left/:$(git rev-parse topic:left):UNINTERESTING
+ 1:tree:left/:$(git rev-parse topic:left):UNINTERESTING
+ 2:tree:right/:$(git rev-parse topic:right)
commits:0
blobs:0
tags:0
@@ -308,14 +308,14 @@ test_expect_success 'topic, not base, boundary' '
0:commit::$(git rev-parse base~1):UNINTERESTING
1:tree::$(git rev-parse topic^{tree})
1:tree::$(git rev-parse base~1^{tree}):UNINTERESTING
- 2:tree:right/:$(git rev-parse topic:right)
- 2:tree:right/:$(git rev-parse base~1:right):UNINTERESTING
- 3:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
- 4:blob:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
- 4:blob:right/c:$(git rev-parse topic:right/c)
- 5:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
- 6:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
- 7:blob:a:$(git rev-parse base~1:a):UNINTERESTING
+ 2:blob:a:$(git rev-parse base~1:a):UNINTERESTING
+ 3:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
+ 4:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ 5:tree:right/:$(git rev-parse topic:right)
+ 5:tree:right/:$(git rev-parse base~1:right):UNINTERESTING
+ 6:blob:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
+ 6:blob:right/c:$(git rev-parse topic:right/c)
+ 7:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
blobs:5
commits:2
tags:0
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH v3 0/7] PATH WALK I: The path-walk API
2024-12-06 19:45 ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
` (6 preceding siblings ...)
2024-12-06 19:45 ` [PATCH v3 7/7] path-walk: reorder object visits Derrick Stolee via GitGitGadget
@ 2024-12-13 11:58 ` Patrick Steinhardt
2024-12-20 16:21 ` [PATCH v4 " Derrick Stolee via GitGitGadget
8 siblings, 0 replies; 67+ messages in thread
From: Patrick Steinhardt @ 2024-12-13 11:58 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee
On Fri, Dec 06, 2024 at 07:45:51PM +0000, Derrick Stolee via GitGitGadget wrote:
>
> Introduction and relation to prior series
> =========================================
Sorry for being late to the party -- I wanted to review this series a
lot earlier, but never really found the time to do so. The patches
mostly look good to me. I've got a couple of nits on the first patch and
think that the error handling could be stricter so that we don't ignore
anay kind of data corruption during the walk. But other than that I'm
looking forward to the usecases this API will enable.
Thanks!
Patrick
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v4 0/7] PATH WALK I: The path-walk API
2024-12-06 19:45 ` [PATCH v3 0/7] " Derrick Stolee via GitGitGadget
` (7 preceding siblings ...)
2024-12-13 11:58 ` [PATCH v3 0/7] PATH WALK I: The path-walk API Patrick Steinhardt
@ 2024-12-20 16:21 ` Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 1/7] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
` (6 more replies)
8 siblings, 7 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-20 16:21 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee
Introduction and relation to prior series
=========================================
This is a new series that rerolls the initial "path-walk API" patches of my
RFC [1] "Path-walk API and applications". This new API (in path-walk.c and
path-walk.h) presents a new way to walk objects such that trees and blobs
are walked in batches according to their path.
This also replaces the previous version of ds/path-walk that was being
reviewed in [2]. The consensus was that the series was too long/dense and
could use some reduction in size. This series takes the first few patches,
but also makes some updates (which will be described later).
[1]
https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/
[RFC] Path-walk API and applications
[2]
https://lore.kernel.org/git/pull.1813.v2.git.1729431810.gitgitgadget@gmail.com/
[PATCH v2 00/17] pack-objects: add --path-walk option for better deltas
This series only introduces the path-walk API, but does so to the full
complexity required to later add the integration with git pack-objects to
improve packing compression in both time and space for repositories with
many name hash collisions. The compression also at least improves for other
repositories, but may not always have an improvement in time.
Some of the changes that are present in this series that differ from the
previous version are motivated directly by discoveries made by testing the
feature in Git for Windows and microsoft/git forks that shipped these
features for fast delivery of these improvements to users who needed them.
That testing across many environments informed some things that needed to be
changed, and in this series those changes are checked by tests in the
t6601-path-walk.sh test script and the test-tool path-walk test helper.
Thus, the code being introduced in this series is covered by tests even
though it is not integrated into the git executable.
Discussion of follow-up applications
====================================
By splitting this series out into its own, I was able to reorganize the
patches such that each application can be build independently off of this
series. These are available as pending PRs in gitgitgadget/git:
* Better delta compression with 'git pack-objects' [3]: This application
allows an option in 'git pack-objects' to change how objects are walked
in order to group objects with the same path for early delta compression
before using the name hash sort to look for cross-path deltas. This helps
significantly in repositories with many name-hash collisions. This
reduces the size of 'git push' pacifies via a config option and reduces
the total repo size in 'git repack'.
* The 'git backfill' command [4]: This command downloads missing blobs in a
bloodless partial clone. In order to save space and network bandwidth, it
assumes that objects at a common path are likely to delta well with each
other, so it downloads missing blobs in batches via the path-walk API.
This presents a way to use blobless clones as a pseudo-resumable clone,
since the initial clone of commits and trees is a smaller initial
download and the batch size allows downloading blobs incrementally. When
pairing this command with the sparse-checkout feature, the path-walk API
is adjusted to focus on the paths within the sparse-checkout. This allows
the user to only download the files they are likely to need when
inspecting history within their scope without downloading the entire
repository history.
* The 'git survey' command [5]. This application begins the work to mimic
the behavior of git-sizer, but to use internal data structures for better
performance and careful understanding of how objects are stored. Using
the path-walk API, paths with many versions can be considered in a batch
and sorted into a list to report the paths that contribute most to the
size of the repository. A version of this command was used to help
confirm the issues with the name hash collisions. It was also used to
diagnose why some repacks using the --path-walk option were taking more
space than without for some repositories. (More on this later.)
Question for reviewers: I am prepped to send these three applications to the
mailing list, but I'll refrain for now to avoid causing too much noise for
folks. Would you like to see them on-list while this series is under review?
Or would you prefer to explore the PRs ([3] [4] and [5])?
[3] https://github.com/gitgitgadget/git/pull/1819
PATH WALK II: Add --path-walk option to 'git pack-objects'
[4] https://github.com/gitgitgadget/git/pull/1820
PATH WALK III: Add 'git backfill' command
[5] https://github.com/gitgitgadget/git/pull/1821
PATH WALK IV: Add 'git survey' command
Structure of the Patch Series
=============================
This patch series attempts to create the simplest version of the API in
patch 1, then build functionality incrementally. During the process, each
change will introduce an update to:
* The path-walk API itself in path-walk.c and path-walk.h.
* The documentation of the API in
Documentation/technical/api-path-walk.txt.
* The test script t/t6601-path-walk.sh.
The core of the API relies on using a 'struct rev_info' to define an initial
set of objects and some form of a commit walk to define what range of
objects to visit. Initially, only a subset of 'struct rev_info' options work
as expected. For example:
* Patch 1 assumes that only commit objects are starting positions, but the
focus is on exploring trees and blobs.
* Patch 3 allows users to specify object types, which includes outputting
the visited commits in a batch.
* Annotated tags and indexed objects are considered in Patch 4. These are
grouped because they both exist within the 'pending' object list.
* UNINTERESTING objects are not considered until Patch 5.
Changes in v1 (since previous version)
======================================
There are a few hard-won learnings from previous versions of this series due
to testing this in the wild with many different repositories.
* Initially, the 'git pack-objects --path-walk' feature was not tested with
the '--shallow' option because it was expected that this option was for
servers creating a pack containing shallow commits. However, this option
is also used when pushing from a shallow clone, and this was a critical
feature that we needed to reduce the size of commits pushed from
automated environments that were bootstrapped by shallow clones. The crux
of the change is in Patch 5 and how UNINTERESTING objects are handled. We
no longer need to push the UNINTERESTING flag around the objects
ourselves and can use existing logic in list-objects.c to do so. This
allows using the --objects-edge-aggressive option when necessary to
reduce the object count when pushing from a shallow clone. (The
pack-objects series expands on tests to cover this integration point.)
* When looking into cases where 'git repack -adf' outperformed 'git repack
-adf --path-walk', I discovered that the issue did not reproduce in a
bare repository. This is due to 'git repack' iterating over all indexed
objects before walking commits. I had inadvertently put all indexed
objects in their own category, leading to no good deltas with previous
versions of those files; I had also not used the 'path' option from the
pending list, so these objects had invalid name hash values. You will see
in patch 4 that the pending list is handled quite differently and the
'--indexed-objects' option is tested directly within t6601.
* I added a new 'test_cmp_sorted' helper because I wanted to simplify some
repeated sections of t6601.
* Patch 1 has significantly more context than it did before.
* Annotated tags are given a name of "/tags" to differentiate them slightly
from root trees and commits.
Changes in v2
=============
* Updated the test helper to output the batch number, allowing us to
confirm that OIDs are grouped appropriately. This also signaled a few
cases where the callback function was being called on an empty set.
* This change has resulted in significant changes to the test data,
including reordered lines and prepended batch numbers.
* Thanks to Patrick for providing a recommended change to remove memory
leaks from the test helper.
Changes in v3
=============
* Updated test helper to use type_string(), which leads to a change to use
lowercase strings in the test scripts. That will lead to the range-diff
looking pretty terrible.
* Added a new patch that changes the visit order of the path-walk API. The
intention is to reduce memory pressure by emitting blob paths before
recursing into tree paths. This also has the effect of visiting blobs and
trees in lexicographic order instead of the reverse.
Changes in v4
=============
* Several style fixes and function renames.
* Better error handling, avoiding some die() statements.
* Additional BUG() statements for "impossible" scenarios.
* Optimizations around SEEN objects to avoid extra work. This does have
some impact on paths that appear in the index but no other versions are
discovered during the tree walk. This changes a test in t6601 and the
timing of visiting the blob path "a" being delayed to the end.
* The path_walk_info struct now has proper initializers and destructors,
even though the current destructor is empty.
Thanks, -Stolee
Derrick Stolee (7):
path-walk: introduce an object walk by path
test-lib-functions: add test_cmp_sorted
t6601: add helper for testing path-walk API
path-walk: allow consumer to specify object types
path-walk: visit tags and cached objects
path-walk: mark trees and blobs as UNINTERESTING
path-walk: reorder object visits
Documentation/technical/api-path-walk.txt | 63 +++
Makefile | 2 +
path-walk.c | 592 ++++++++++++++++++++++
path-walk.h | 69 +++
t/helper/test-path-walk.c | 112 ++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t6601-path-walk.sh | 368 ++++++++++++++
t/test-lib-functions.sh | 10 +
9 files changed, 1218 insertions(+)
create mode 100644 Documentation/technical/api-path-walk.txt
create mode 100644 path-walk.c
create mode 100644 path-walk.h
create mode 100644 t/helper/test-path-walk.c
create mode 100755 t/t6601-path-walk.sh
base-commit: e9356ba3ea2a6754281ff7697b3e5a1697b21e24
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1818%2Fderrickstolee%2Fapi-upstream-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1818/derrickstolee/api-upstream-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/1818
Range-diff vs v3:
1: b7e9b81e8b3 ! 1: 62f7aae477b path-walk: introduce an object walk by path
@@ path-walk.c (new)
+#include "tree.h"
+#include "tree-walk.h"
+
-+struct type_and_oid_list
-+{
++struct type_and_oid_list {
+ enum object_type type;
+ struct oid_array oids;
+};
@@ path-walk.c (new)
+ string_list_append(&ctx->path_stack, path);
+}
+
-+static int add_children(struct path_walk_context *ctx,
-+ const char *base_path,
-+ struct object_id *oid)
++static int add_tree_entries(struct path_walk_context *ctx,
++ const char *base_path,
++ struct object_id *oid)
+{
+ struct tree_desc desc;
+ struct name_entry entry;
@@ path-walk.c (new)
+ oid_to_hex(oid));
+ return -1;
+ } else if (parse_tree_gently(tree, 1)) {
-+ die("bad tree object %s", oid_to_hex(oid));
++ error("bad tree object %s", oid_to_hex(oid));
++ return -1;
+ }
+
+ strbuf_addstr(&path, base_path);
@@ path-walk.c (new)
+ struct blob *child = lookup_blob(ctx->repo, &entry.oid);
+ o = child ? &child->object : NULL;
+ } else {
-+ /* Wrong type? */
-+ continue;
++ BUG("invalid type for tree entry: %d", type);
+ }
+
-+ if (!o) /* report error?*/
++ if (!o) {
++ error(_("failed to find object %s"),
++ oid_to_hex(&o->oid));
++ return -1;
++ }
++
++ /* Skip this object if already seen. */
++ if (o->flags & SEEN)
+ continue;
++ o->flags |= SEEN;
+
+ strbuf_setlen(&path, base_len);
+ strbuf_add(&path, entry.path, entry.pathlen);
@@ path-walk.c (new)
+ strmap_put(&ctx->paths_to_lists, path.buf, list);
+ }
+ push_to_stack(ctx, path.buf);
-+
-+ /* Skip this object if already seen. */
-+ if (o->flags & SEEN)
-+ continue;
-+ o->flags |= SEEN;
+ oid_array_append(&list->oids, &entry.oid);
+ }
+
@@ path-walk.c (new)
+ /* Expand data for children. */
+ if (list->type == OBJ_TREE) {
+ for (size_t i = 0; i < list->oids.nr; i++) {
-+ ret |= add_children(ctx,
++ ret |= add_tree_entries(ctx,
+ path,
+ &list->oids.oid[i]);
+ }
@@ path-walk.c (new)
+ return ret;
+}
+
-+static void clear_strmap(struct strmap *map)
++static void clear_paths_to_lists(struct strmap *map)
+{
+ struct hashmap_iter iter;
+ struct strmap_entry *e;
@@ path-walk.c (new)
+ t = lookup_tree(info->revs->repo, oid);
+
+ if (!t) {
-+ warning("could not find tree %s", oid_to_hex(oid));
-+ continue;
++ error("could not find tree %s", oid_to_hex(oid));
++ return -1;
+ }
+
+ if (t->object.flags & SEEN)
@@ path-walk.c (new)
+ trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr);
+ trace2_region_leave("path-walk", "path-walk", info->revs->repo);
+
-+ clear_strmap(&ctx.paths_to_lists);
++ clear_paths_to_lists(&ctx.paths_to_lists);
+ strset_clear(&ctx.path_stack_pushed);
+ string_list_clear(&ctx.path_stack, 0);
+ return ret;
++}
++
++void path_walk_info_init(struct path_walk_info *info)
++{
++ struct path_walk_info empty = PATH_WALK_INFO_INIT;
++ memcpy(info, &empty, sizeof(empty));
++}
++
++void path_walk_info_clear(struct path_walk_info *info UNUSED)
++{
++ /*
++ * This destructor is empty for now, as info->revs
++ * is not owned by 'struct path_walk_info'.
++ */
+}
## path-walk.h (new) ##
@@ path-walk.h (new)
+struct path_walk_info {
+ /**
+ * revs provides the definitions for the commit walk, including
-+ * which commits are UNINTERESTING or not.
++ * which commits are UNINTERESTING or not. This structure is
++ * expected to be owned by the caller.
+ */
+ struct rev_info *revs;
+
@@ path-walk.h (new)
+
+#define PATH_WALK_INFO_INIT { 0 }
+
++void path_walk_info_init(struct path_walk_info *info);
++void path_walk_info_clear(struct path_walk_info *info);
++
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
2: cf2ed61b324 = 2: 8ad2a5e79a2 test-lib-functions: add test_cmp_sorted
3: 54886fcb081 = 3: 2bc0538bce9 t6601: add helper for testing path-walk API
4: 42e71e6285f ! 4: db5c8611230 path-walk: allow consumer to specify object types
@@ Documentation/technical/api-path-walk.txt: It is also important that you do not
## path-walk.c ##
-@@ path-walk.c: static int add_children(struct path_walk_context *ctx,
+@@ path-walk.c: static int add_tree_entries(struct path_walk_context *ctx,
if (S_ISGITLINK(entry.mode))
continue;
@@ path-walk.h: struct path_walk_info {
+ .commits = 1, \
+}
- /**
- * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ void path_walk_info_init(struct path_walk_info *info);
+ void path_walk_info_clear(struct path_walk_info *info);
## t/helper/test-path-walk.c ##
@@ t/helper/test-path-walk.c: static const char * const path_walk_usage[] = {
5: a41f53f7ced ! 5: 6df56f465d7 path-walk: visit tags and cached objects
@@ path-walk.c
+static const char *root_path = "";
+
- struct type_and_oid_list
- {
+ struct type_and_oid_list {
enum object_type type;
+ struct oid_array oids;
@@ path-walk.c: static int walk_path(struct path_walk_context *ctx,
list = strmap_get(&ctx->paths_to_lists, path);
@@ path-walk.c: static int walk_path(struct path_walk_context *ctx,
ret = ctx->info->path_fn(path, &list->oids, list->type,
ctx->info->path_fn_data);
-@@ path-walk.c: static void clear_strmap(struct strmap *map)
+@@ path-walk.c: static void clear_paths_to_lists(struct strmap *map)
strmap_init(map);
}
-+static void setup_pending_objects(struct path_walk_info *info,
-+ struct path_walk_context *ctx)
++static int setup_pending_objects(struct path_walk_info *info,
++ struct path_walk_context *ctx)
+{
+ struct type_and_oid_list *tags = NULL;
+ struct type_and_oid_list *tagged_blobs = NULL;
@@ path-walk.c: static void clear_strmap(struct strmap *map)
+ /* Navigate annotated tag object chains. */
+ while (obj->type == OBJ_TAG) {
+ struct tag *tag = lookup_tag(info->revs->repo, &obj->oid);
-+ if (!tag)
-+ break;
++ if (!tag) {
++ error(_("failed to find tag %s"),
++ oid_to_hex(&obj->oid));
++ return -1;
++ }
+ if (tag->object.flags & SEEN)
+ break;
+ tag->object.flags |= SEEN;
@@ path-walk.c: static void clear_strmap(struct strmap *map)
+ free(tags);
+ }
+ }
++
++ return 0;
+}
+
/**
* Given the configuration of 'info', walk the commits based on 'info->revs' and
* call 'info->path_fn' on each discovered path.
-@@ path-walk.c: static void clear_strmap(struct strmap *map)
+@@ path-walk.c: static void clear_paths_to_lists(struct strmap *map)
*/
int walk_objects_by_path(struct path_walk_info *info)
{
- const char *root_path = "";
- int ret = 0;
+- int ret = 0;
++ int ret;
size_t commits_nr = 0, paths_nr = 0;
struct commit *c;
+ struct type_and_oid_list *root_tree_list;
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
CALLOC_ARRAY(commit_list, 1);
commit_list->type = OBJ_COMMIT;
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
+ info->revs->blob_objects = info->revs->tree_objects = 0;
+
+ trace2_region_enter("path-walk", "pending-walk", info->revs->repo);
-+ setup_pending_objects(info, &ctx);
++ ret = setup_pending_objects(info, &ctx);
+ trace2_region_leave("path-walk", "pending-walk", info->revs->repo);
++
++ if (ret)
++ return ret;
+
while ((c = get_revision(info->revs))) {
struct object_id *oid;
@@ path-walk.h: struct path_walk_info {
+ .tags = 1, \
}
- /**
+ void path_walk_info_init(struct path_walk_info *info);
## t/helper/test-path-walk.c ##
@@ t/helper/test-path-walk.c: struct path_walk_test_data {
6: 0f1e6c51b2c ! 6: f2ffc32a303 path-walk: mark trees and blobs as UNINTERESTING
@@ path-walk.c
#include "object.h"
#include "oid-array.h"
#include "revision.h"
-@@ path-walk.c: struct type_and_oid_list
- {
+@@ path-walk.c: static const char *root_path = "";
+ struct type_and_oid_list {
enum object_type type;
struct oid_array oids;
+ int maybe_interesting;
};
#define TYPE_AND_OID_LIST_INIT { \
-@@ path-walk.c: static int add_children(struct path_walk_context *ctx,
- if (o->flags & SEEN)
- continue;
- o->flags |= SEEN;
+@@ path-walk.c: static int add_tree_entries(struct path_walk_context *ctx,
+ strmap_put(&ctx->paths_to_lists, path.buf, list);
+ }
+ push_to_stack(ctx, path.buf);
+
+ if (!(o->flags & UNINTERESTING))
+ list->maybe_interesting = 1;
++
oid_array_append(&list->oids, &entry.oid);
}
@@ path-walk.c: static int walk_path(struct path_walk_context *ctx,
/* Evaluate function pointer on this data, if requested. */
if ((list->type == OBJ_TREE && ctx->info->trees) ||
(list->type == OBJ_BLOB && ctx->info->blobs) ||
-@@ path-walk.c: static void clear_strmap(struct strmap *map)
+@@ path-walk.c: static void clear_paths_to_lists(struct strmap *map)
strmap_init(map);
}
@@ path-walk.c: static void clear_strmap(struct strmap *map)
+ oid_array_append(&edge_tree_list->oids, &t->object.oid);
+}
+
- static void setup_pending_objects(struct path_walk_info *info,
- struct path_walk_context *ctx)
+ static int setup_pending_objects(struct path_walk_info *info,
+ struct path_walk_context *ctx)
{
-@@ path-walk.c: static void setup_pending_objects(struct path_walk_info *info,
+@@ path-walk.c: static int setup_pending_objects(struct path_walk_info *info,
if (tagged_blobs->oids.nr) {
const char *tagged_blob_path = "/tagged-blobs";
tagged_blobs->type = OBJ_BLOB;
@@ path-walk.c: static void setup_pending_objects(struct path_walk_info *info,
push_to_stack(ctx, tagged_blob_path);
strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
} else {
-@@ path-walk.c: static void setup_pending_objects(struct path_walk_info *info,
+@@ path-walk.c: static int setup_pending_objects(struct path_walk_info *info,
if (tags->oids.nr) {
const char *tag_path = "/tags";
tags->type = OBJ_TAG;
7: e716672c041 ! 7: ef543429ed9 path-walk: reorder object visits
@@ path-walk.c: static void push_to_stack(struct path_walk_context *ctx,
+ prio_queue_put(&ctx->path_stack, xstrdup(path));
}
- static int add_children(struct path_walk_context *ctx,
-@@ path-walk.c: static void setup_pending_objects(struct path_walk_info *info,
+ static int add_tree_entries(struct path_walk_context *ctx,
+@@ path-walk.c: static int setup_pending_objects(struct path_walk_info *info,
const char *tagged_blob_path = "/tagged-blobs";
tagged_blobs->type = OBJ_BLOB;
tagged_blobs->maybe_interesting = 1;
@@ path-walk.c: static void setup_pending_objects(struct path_walk_info *info,
} else {
oid_array_clear(&tagged_blobs->oids);
free(tagged_blobs);
-@@ path-walk.c: static void setup_pending_objects(struct path_walk_info *info,
+@@ path-walk.c: static int setup_pending_objects(struct path_walk_info *info,
const char *tag_path = "/tags";
tags->type = OBJ_TAG;
tags->maybe_interesting = 1;
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
ret = walk_path(&ctx, path);
@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
- clear_strmap(&ctx.paths_to_lists);
+ clear_paths_to_lists(&ctx.paths_to_lists);
strset_clear(&ctx.path_stack_pushed);
- string_list_clear(&ctx.path_stack, 0);
+ clear_prio_queue(&ctx.path_stack);
return ret;
}
+
## t/t6601-path-walk.sh ##
@@ t/t6601-path-walk.sh: test_expect_success 'all' '
@@ t/t6601-path-walk.sh: test_expect_success 'all' '
commits:4
tags:7
@@ t/t6601-path-walk.sh: test_expect_success 'branches and indexed objects mix well' '
+ 1:tree::$(git rev-parse base^{tree})
1:tree::$(git rev-parse base~1^{tree})
1:tree::$(git rev-parse base~2^{tree})
- 2:blob:a:$(git rev-parse base~2:a)
+- 2:blob:a:$(git rev-parse base~2:a)
- 3:tree:right/:$(git rev-parse topic:right)
- 3:tree:right/:$(git rev-parse base~1:right)
- 3:tree:right/:$(git rev-parse base~2:right)
@@ t/t6601-path-walk.sh: test_expect_success 'branches and indexed objects mix well
- 7:blob:left/b:$(git rev-parse base:left/b)
- 7:blob:left/b:$(git rev-parse base~2:left/b)
- 8:tree:a/:$(git rev-parse refs/tags/third:a)
-+ 3:tree:a/:$(git rev-parse refs/tags/third:a)
-+ 4:tree:left/:$(git rev-parse base:left)
-+ 4:tree:left/:$(git rev-parse base~2:left)
-+ 5:blob:left/b:$(git rev-parse base:left/b)
-+ 5:blob:left/b:$(git rev-parse base~2:left/b)
-+ 6:tree:right/:$(git rev-parse topic:right)
-+ 6:tree:right/:$(git rev-parse base~1:right)
-+ 6:tree:right/:$(git rev-parse base~2:right)
-+ 7:blob:right/c:$(git rev-parse base~2:right/c)
-+ 7:blob:right/c:$(git rev-parse topic:right/c)
-+ 8:blob:right/d:$(git rev-parse base~1:right/d)
-+ 8:blob:right/d:$(git rev-parse :right/d)
++ 2:tree:a/:$(git rev-parse refs/tags/third:a)
++ 3:tree:left/:$(git rev-parse base:left)
++ 3:tree:left/:$(git rev-parse base~2:left)
++ 4:blob:left/b:$(git rev-parse base:left/b)
++ 4:blob:left/b:$(git rev-parse base~2:left/b)
++ 5:tree:right/:$(git rev-parse topic:right)
++ 5:tree:right/:$(git rev-parse base~1:right)
++ 5:tree:right/:$(git rev-parse base~2:right)
++ 6:blob:right/c:$(git rev-parse base~2:right/c)
++ 6:blob:right/c:$(git rev-parse topic:right/c)
++ 7:blob:right/d:$(git rev-parse base~1:right/d)
++ 7:blob:right/d:$(git rev-parse :right/d)
++ 8:blob:a:$(git rev-parse base~2:a)
blobs:7
commits:4
tags:0
--
gitgitgadget
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v4 1/7] path-walk: introduce an object walk by path
2024-12-20 16:21 ` [PATCH v4 " Derrick Stolee via GitGitGadget
@ 2024-12-20 16:21 ` Derrick Stolee via GitGitGadget
2024-12-27 14:18 ` Patrick Steinhardt
2024-12-20 16:21 ` [PATCH v4 2/7] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
` (5 subsequent siblings)
6 siblings, 1 reply; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-20 16:21 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects, and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.
These batches are collected in 'struct type_and_oid_list' objects, which
store an object type and an oid_array of objects.
The data structures are documented in 'struct path_walk_context', but in
summary the most important are:
* 'paths_to_lists' is a strmap that connects a path to a
type_and_oid_list for that path. To avoid conflicts in path names,
we make sure that tree paths end in "/" (except the root path with
is an empty string) and blob paths do not end in "/".
* 'path_stack' is a string list that is added to in an append-only
way. This stores the stack of our depth-first search on the heap
instead of using recursion.
* 'path_stack_pushed' is a strmap that stores path names that were
already added to 'path_stack', to avoid repeating paths in the
stack. Mostly, this saves us from quadratic lookups from doing
unsorted checks into the string_list.
The coupling of 'path_stack' and 'path_stack_pushed' is protected by the
push_to_stack() method. Call this instead of inserting into these
structures directly.
The walk_objects_by_path() method initializes these structures and
starts walking commits from the given rev_info struct. The commits are
used to find the list of root trees which populate the start of our
depth-first search.
The core of our depth-first search is in a while loop that continues
while we have not indicated an early exit and our 'path_stack' still has
entries in it. The loop body pops a path off of the stack and "visits"
the path via the walk_path() method.
The walk_path() method gets the list of OIDs from the 'path_to_lists'
strmap and executes the callback method on that list with the given path
and type. If the OIDs correspond to tree objects, then iterate over all
trees in the list and run add_children() to add the child objects to
their own lists, adding new entries to the stack if necessary.
In testing, this depth-first search approach was the one that used the
least memory while iterating over the object lists. There is still a
chance that repositories with too-wide path patterns could cause memory
pressure issues. Limiting the stack size could be done in the future by
limiting how many objects are being considered in-progress, or by
visiting blob paths earlier than trees.
There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 45 ++++
Makefile | 1 +
path-walk.c | 279 ++++++++++++++++++++++
path-walk.h | 47 ++++
4 files changed, 372 insertions(+)
create mode 100644 Documentation/technical/api-path-walk.txt
create mode 100644 path-walk.c
create mode 100644 path-walk.h
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
new file mode 100644
index 00000000000..c550c77ca30
--- /dev/null
+++ b/Documentation/technical/api-path-walk.txt
@@ -0,0 +1,45 @@
+Path-Walk API
+=============
+
+The path-walk API is used to walk reachable objects, but to visit objects
+in batches based on a common path they appear in, or by type.
+
+For example, all reachable commits are visited in a group. All tags are
+visited in a group. Then, all root trees are visited. At some point, all
+blobs reachable via a path `my/dir/to/A` are visited. When there are
+multiple paths possible to reach the same object, then only one of those
+paths is used to visit the object.
+
+Basics
+------
+
+To use the path-walk API, include `path-walk.h` and call
+`walk_objects_by_path()` with a customized `path_walk_info` struct. The
+struct is used to set all of the options for how the walk should proceed.
+Let's dig into the different options and their use.
+
+`path_fn` and `path_fn_data`::
+ The most important option is the `path_fn` option, which is a
+ function pointer to the callback that can execute logic on the
+ object IDs for objects grouped by type and path. This function
+ also receives a `data` value that corresponds to the
+ `path_fn_data` member, for providing custom data structures to
+ this callback function.
+
+`revs`::
+ To configure the exact details of the reachable set of objects,
+ use the `revs` member and initialize it using the revision
+ machinery in `revision.h`. Initialize `revs` using calls such as
+ `setup_revisions()` or `parse_revision_opt()`. Do not call
+ `prepare_revision_walk()`, as that will be called within
+ `walk_objects_by_path()`.
++
+It is also important that you do not specify the `--objects` flag for the
+`revs` struct. The revision walk should only be used to walk commits, and
+the objects will be walked in a separate way based on those starting
+commits.
+
+Examples
+--------
+
+See example usages in future changes.
diff --git a/Makefile b/Makefile
index 7344a7f7257..d0d8d6888e3 100644
--- a/Makefile
+++ b/Makefile
@@ -1094,6 +1094,7 @@ LIB_OBJS += parse-options.o
LIB_OBJS += patch-delta.o
LIB_OBJS += patch-ids.o
LIB_OBJS += path.o
+LIB_OBJS += path-walk.o
LIB_OBJS += pathspec.o
LIB_OBJS += pkt-line.o
LIB_OBJS += preload-index.o
diff --git a/path-walk.c b/path-walk.c
new file mode 100644
index 00000000000..021840ab41b
--- /dev/null
+++ b/path-walk.c
@@ -0,0 +1,279 @@
+/*
+ * path-walk.c: implementation for path-based walks of the object graph.
+ */
+#include "git-compat-util.h"
+#include "path-walk.h"
+#include "blob.h"
+#include "commit.h"
+#include "dir.h"
+#include "hashmap.h"
+#include "hex.h"
+#include "object.h"
+#include "oid-array.h"
+#include "revision.h"
+#include "string-list.h"
+#include "strmap.h"
+#include "trace2.h"
+#include "tree.h"
+#include "tree-walk.h"
+
+struct type_and_oid_list {
+ enum object_type type;
+ struct oid_array oids;
+};
+
+#define TYPE_AND_OID_LIST_INIT { \
+ .type = OBJ_NONE, \
+ .oids = OID_ARRAY_INIT \
+}
+
+struct path_walk_context {
+ /**
+ * Repeats of data in 'struct path_walk_info' for
+ * access with fewer characters.
+ */
+ struct repository *repo;
+ struct rev_info *revs;
+ struct path_walk_info *info;
+
+ /**
+ * Map a path to a 'struct type_and_oid_list'
+ * containing the objects discovered at that
+ * path.
+ */
+ struct strmap paths_to_lists;
+
+ /**
+ * Store the current list of paths in a stack, to
+ * facilitate depth-first-search without recursion.
+ *
+ * Use path_stack_pushed to indicate whether a path
+ * was previously added to path_stack.
+ */
+ struct string_list path_stack;
+ struct strset path_stack_pushed;
+};
+
+static void push_to_stack(struct path_walk_context *ctx,
+ const char *path)
+{
+ if (strset_contains(&ctx->path_stack_pushed, path))
+ return;
+
+ strset_add(&ctx->path_stack_pushed, path);
+ string_list_append(&ctx->path_stack, path);
+}
+
+static int add_tree_entries(struct path_walk_context *ctx,
+ const char *base_path,
+ struct object_id *oid)
+{
+ struct tree_desc desc;
+ struct name_entry entry;
+ struct strbuf path = STRBUF_INIT;
+ size_t base_len;
+ struct tree *tree = lookup_tree(ctx->repo, oid);
+
+ if (!tree) {
+ error(_("failed to walk children of tree %s: not found"),
+ oid_to_hex(oid));
+ return -1;
+ } else if (parse_tree_gently(tree, 1)) {
+ error("bad tree object %s", oid_to_hex(oid));
+ return -1;
+ }
+
+ strbuf_addstr(&path, base_path);
+ base_len = path.len;
+
+ parse_tree(tree);
+ init_tree_desc(&desc, &tree->object.oid, tree->buffer, tree->size);
+ while (tree_entry(&desc, &entry)) {
+ struct type_and_oid_list *list;
+ struct object *o;
+ /* Not actually true, but we will ignore submodules later. */
+ enum object_type type = S_ISDIR(entry.mode) ? OBJ_TREE : OBJ_BLOB;
+
+ /* Skip submodules. */
+ if (S_ISGITLINK(entry.mode))
+ continue;
+
+ if (type == OBJ_TREE) {
+ struct tree *child = lookup_tree(ctx->repo, &entry.oid);
+ o = child ? &child->object : NULL;
+ } else if (type == OBJ_BLOB) {
+ struct blob *child = lookup_blob(ctx->repo, &entry.oid);
+ o = child ? &child->object : NULL;
+ } else {
+ BUG("invalid type for tree entry: %d", type);
+ }
+
+ if (!o) {
+ error(_("failed to find object %s"),
+ oid_to_hex(&o->oid));
+ return -1;
+ }
+
+ /* Skip this object if already seen. */
+ if (o->flags & SEEN)
+ continue;
+ o->flags |= SEEN;
+
+ strbuf_setlen(&path, base_len);
+ strbuf_add(&path, entry.path, entry.pathlen);
+
+ /*
+ * Trees will end with "/" for concatenation and distinction
+ * from blobs at the same path.
+ */
+ if (type == OBJ_TREE)
+ strbuf_addch(&path, '/');
+
+ if (!(list = strmap_get(&ctx->paths_to_lists, path.buf))) {
+ CALLOC_ARRAY(list, 1);
+ list->type = type;
+ strmap_put(&ctx->paths_to_lists, path.buf, list);
+ }
+ push_to_stack(ctx, path.buf);
+ oid_array_append(&list->oids, &entry.oid);
+ }
+
+ free_tree_buffer(tree);
+ strbuf_release(&path);
+ return 0;
+}
+
+/*
+ * For each path in paths_to_explore, walk the trees another level
+ * and add any found blobs to the batch (but only if they exist and
+ * haven't been added yet).
+ */
+static int walk_path(struct path_walk_context *ctx,
+ const char *path)
+{
+ struct type_and_oid_list *list;
+ int ret = 0;
+
+ list = strmap_get(&ctx->paths_to_lists, path);
+
+ if (!list->oids.nr)
+ return 0;
+
+ /* Evaluate function pointer on this data. */
+ ret = ctx->info->path_fn(path, &list->oids, list->type,
+ ctx->info->path_fn_data);
+
+ /* Expand data for children. */
+ if (list->type == OBJ_TREE) {
+ for (size_t i = 0; i < list->oids.nr; i++) {
+ ret |= add_tree_entries(ctx,
+ path,
+ &list->oids.oid[i]);
+ }
+ }
+
+ oid_array_clear(&list->oids);
+ strmap_remove(&ctx->paths_to_lists, path, 1);
+ return ret;
+}
+
+static void clear_paths_to_lists(struct strmap *map)
+{
+ struct hashmap_iter iter;
+ struct strmap_entry *e;
+
+ hashmap_for_each_entry(&map->map, &iter, e, ent) {
+ struct type_and_oid_list *list = e->value;
+ oid_array_clear(&list->oids);
+ }
+ strmap_clear(map, 1);
+ strmap_init(map);
+}
+
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+ *
+ * Returns nonzero on an error.
+ */
+int walk_objects_by_path(struct path_walk_info *info)
+{
+ const char *root_path = "";
+ int ret = 0;
+ size_t commits_nr = 0, paths_nr = 0;
+ struct commit *c;
+ struct type_and_oid_list *root_tree_list;
+ struct path_walk_context ctx = {
+ .repo = info->revs->repo,
+ .revs = info->revs,
+ .info = info,
+ .path_stack = STRING_LIST_INIT_DUP,
+ .path_stack_pushed = STRSET_INIT,
+ .paths_to_lists = STRMAP_INIT
+ };
+
+ trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
+
+ /* Insert a single list for the root tree into the paths. */
+ CALLOC_ARRAY(root_tree_list, 1);
+ root_tree_list->type = OBJ_TREE;
+ strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
+ push_to_stack(&ctx, root_path);
+
+ if (prepare_revision_walk(info->revs))
+ die(_("failed to setup revision walk"));
+
+ while ((c = get_revision(info->revs))) {
+ struct object_id *oid = get_commit_tree_oid(c);
+ struct tree *t;
+ commits_nr++;
+
+ oid = get_commit_tree_oid(c);
+ t = lookup_tree(info->revs->repo, oid);
+
+ if (!t) {
+ error("could not find tree %s", oid_to_hex(oid));
+ return -1;
+ }
+
+ if (t->object.flags & SEEN)
+ continue;
+ t->object.flags |= SEEN;
+ oid_array_append(&root_tree_list->oids, oid);
+ }
+
+ trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
+ trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+
+ trace2_region_enter("path-walk", "path-walk", info->revs->repo);
+ while (!ret && ctx.path_stack.nr) {
+ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
+ ctx.path_stack.nr--;
+ paths_nr++;
+
+ ret = walk_path(&ctx, path);
+
+ free(path);
+ }
+ trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr);
+ trace2_region_leave("path-walk", "path-walk", info->revs->repo);
+
+ clear_paths_to_lists(&ctx.paths_to_lists);
+ strset_clear(&ctx.path_stack_pushed);
+ string_list_clear(&ctx.path_stack, 0);
+ return ret;
+}
+
+void path_walk_info_init(struct path_walk_info *info)
+{
+ struct path_walk_info empty = PATH_WALK_INFO_INIT;
+ memcpy(info, &empty, sizeof(empty));
+}
+
+void path_walk_info_clear(struct path_walk_info *info UNUSED)
+{
+ /*
+ * This destructor is empty for now, as info->revs
+ * is not owned by 'struct path_walk_info'.
+ */
+}
diff --git a/path-walk.h b/path-walk.h
new file mode 100644
index 00000000000..7cb3538cd8f
--- /dev/null
+++ b/path-walk.h
@@ -0,0 +1,47 @@
+/*
+ * path-walk.h : Methods and structures for walking the object graph in batches
+ * by the paths that can reach those objects.
+ */
+#include "object.h" /* Required for 'enum object_type'. */
+
+struct rev_info;
+struct oid_array;
+
+/**
+ * The type of a function pointer for the method that is called on a list of
+ * objects reachable at a given path.
+ */
+typedef int (*path_fn)(const char *path,
+ struct oid_array *oids,
+ enum object_type type,
+ void *data);
+
+struct path_walk_info {
+ /**
+ * revs provides the definitions for the commit walk, including
+ * which commits are UNINTERESTING or not. This structure is
+ * expected to be owned by the caller.
+ */
+ struct rev_info *revs;
+
+ /**
+ * The caller wishes to execute custom logic on objects reachable at a
+ * given path. Every reachable object will be visited exactly once, and
+ * the first path to see an object wins. This may not be a stable choice.
+ */
+ path_fn path_fn;
+ void *path_fn_data;
+};
+
+#define PATH_WALK_INFO_INIT { 0 }
+
+void path_walk_info_init(struct path_walk_info *info);
+void path_walk_info_clear(struct path_walk_info *info);
+
+/**
+ * Given the configuration of 'info', walk the commits based on 'info->revs' and
+ * call 'info->path_fn' on each discovered path.
+ *
+ * Returns nonzero on an error.
+ */
+int walk_objects_by_path(struct path_walk_info *info);
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* Re: [PATCH v4 1/7] path-walk: introduce an object walk by path
2024-12-20 16:21 ` [PATCH v4 1/7] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
@ 2024-12-27 14:18 ` Patrick Steinhardt
0 siblings, 0 replies; 67+ messages in thread
From: Patrick Steinhardt @ 2024-12-27 14:18 UTC (permalink / raw)
To: Derrick Stolee via GitGitGadget
Cc: git, gitster, johannes.schindelin, peff, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee
On Fri, Dec 20, 2024 at 04:21:09PM +0000, Derrick Stolee via GitGitGadget wrote:
[snip]
> +static int add_tree_entries(struct path_walk_context *ctx,
> + const char *base_path,
> + struct object_id *oid)
> +{
> + struct tree_desc desc;
> + struct name_entry entry;
> + struct strbuf path = STRBUF_INIT;
> + size_t base_len;
> + struct tree *tree = lookup_tree(ctx->repo, oid);
> +
> + if (!tree) {
> + error(_("failed to walk children of tree %s: not found"),
> + oid_to_hex(oid));
> + return -1;
> + } else if (parse_tree_gently(tree, 1)) {
> + error("bad tree object %s", oid_to_hex(oid));
> + return -1;
> + }
You can `return error(_("..."));` directly as it already returns `-1`.
Not sure whether this by itself warrants a reroll -- probably not. I'll
leave it up to you.
The rest of the patch series looks as expected, mostly based on the
range diff.
Patrick
^ permalink raw reply [flat|nested] 67+ messages in thread
* [PATCH v4 2/7] test-lib-functions: add test_cmp_sorted
2024-12-20 16:21 ` [PATCH v4 " Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 1/7] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
@ 2024-12-20 16:21 ` Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 3/7] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
` (4 subsequent siblings)
6 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-20 16:21 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
This test helper will be helpful to reduce repeated logic in
t6601-path-walk.sh, but may be helpful elsewhere, too.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
t/test-lib-functions.sh | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index fde9bf54fc3..16b70aebd60 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1267,6 +1267,16 @@ test_cmp () {
eval "$GIT_TEST_CMP" '"$@"'
}
+# test_cmp_sorted runs test_cmp on sorted versions of the two
+# input files. Uses "$1.sorted" and "$2.sorted" as temp files.
+
+test_cmp_sorted () {
+ sort <"$1" >"$1.sorted" &&
+ sort <"$2" >"$2.sorted" &&
+ test_cmp "$1.sorted" "$2.sorted" &&
+ rm "$1.sorted" "$2.sorted"
+}
+
# Check that the given config key has the expected value.
#
# test_cmp_config [-C <dir>] <expected-value>
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH v4 3/7] t6601: add helper for testing path-walk API
2024-12-20 16:21 ` [PATCH v4 " Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 1/7] path-walk: introduce an object walk by path Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 2/7] test-lib-functions: add test_cmp_sorted Derrick Stolee via GitGitGadget
@ 2024-12-20 16:21 ` Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 4/7] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
` (3 subsequent siblings)
6 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-20 16:21 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
Add some tests based on the current behavior, doing interesting checks
for different sets of branches, ranges, and the --boundary option. This
sets a baseline for the behavior and we can extend it as new options are
introduced.
Store and output a 'batch_nr' value so we can demonstrate that the paths are
grouped together in a batch and not following some other ordering. This
allows us to test the depth-first behavior of the path-walk API. However, we
purposefully do not test the order of the objects in the batch, so the
output is compared to the expected output through a sort.
It is important to mention that the behavior of the API will change soon as
we start to handle UNINTERESTING objects differently, but these tests will
demonstrate the change in behavior.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 3 +-
Makefile | 1 +
t/helper/test-path-walk.c | 84 +++++++++++++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/t6601-path-walk.sh | 120 ++++++++++++++++++++++
6 files changed, 209 insertions(+), 1 deletion(-)
create mode 100644 t/helper/test-path-walk.c
create mode 100755 t/t6601-path-walk.sh
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index c550c77ca30..662162ec70b 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -42,4 +42,5 @@ commits.
Examples
--------
-See example usages in future changes.
+See example usages in:
+ `t/helper/test-path-walk.c`
diff --git a/Makefile b/Makefile
index d0d8d6888e3..50413d96492 100644
--- a/Makefile
+++ b/Makefile
@@ -818,6 +818,7 @@ TEST_BUILTINS_OBJS += test-parse-options.o
TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
TEST_BUILTINS_OBJS += test-partial-clone.o
TEST_BUILTINS_OBJS += test-path-utils.o
+TEST_BUILTINS_OBJS += test-path-walk.o
TEST_BUILTINS_OBJS += test-pcre2-config.o
TEST_BUILTINS_OBJS += test-pkt-line.o
TEST_BUILTINS_OBJS += test-proc-receive.o
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
new file mode 100644
index 00000000000..def7c81ac4f
--- /dev/null
+++ b/t/helper/test-path-walk.c
@@ -0,0 +1,84 @@
+#define USE_THE_REPOSITORY_VARIABLE
+
+#include "test-tool.h"
+#include "environment.h"
+#include "hex.h"
+#include "object-name.h"
+#include "object.h"
+#include "pretty.h"
+#include "revision.h"
+#include "setup.h"
+#include "parse-options.h"
+#include "path-walk.h"
+#include "oid-array.h"
+
+static const char * const path_walk_usage[] = {
+ N_("test-tool path-walk <options> -- <revision-options>"),
+ NULL
+};
+
+struct path_walk_test_data {
+ uintmax_t batch_nr;
+ uintmax_t tree_nr;
+ uintmax_t blob_nr;
+};
+
+static int emit_block(const char *path, struct oid_array *oids,
+ enum object_type type, void *data)
+{
+ struct path_walk_test_data *tdata = data;
+ const char *typestr;
+
+ if (type == OBJ_TREE)
+ tdata->tree_nr += oids->nr;
+ else if (type == OBJ_BLOB)
+ tdata->blob_nr += oids->nr;
+ else
+ BUG("we do not understand this type");
+
+ typestr = type_name(type);
+
+ for (size_t i = 0; i < oids->nr; i++)
+ printf("%"PRIuMAX":%s:%s:%s\n",
+ tdata->batch_nr, typestr, path,
+ oid_to_hex(&oids->oid[i]));
+
+ tdata->batch_nr++;
+ return 0;
+}
+
+int cmd__path_walk(int argc, const char **argv)
+{
+ int res;
+ struct rev_info revs = REV_INFO_INIT;
+ struct path_walk_info info = PATH_WALK_INFO_INIT;
+ struct path_walk_test_data data = { 0 };
+ struct option options[] = {
+ OPT_END(),
+ };
+
+ setup_git_directory();
+ revs.repo = the_repository;
+
+ argc = parse_options(argc, argv, NULL,
+ options, path_walk_usage,
+ PARSE_OPT_KEEP_UNKNOWN_OPT | PARSE_OPT_KEEP_ARGV0);
+
+ if (argc > 1)
+ setup_revisions(argc, argv, &revs, NULL);
+ else
+ usage(path_walk_usage[0]);
+
+ info.revs = &revs;
+ info.path_fn = emit_block;
+ info.path_fn_data = &data;
+
+ res = walk_objects_by_path(&info);
+
+ printf("trees:%" PRIuMAX "\n"
+ "blobs:%" PRIuMAX "\n",
+ data.tree_nr, data.blob_nr);
+
+ release_revisions(&revs);
+ return res;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 1ebb69a5dc4..43676e7b93a 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -52,6 +52,7 @@ static struct test_cmd cmds[] = {
{ "parse-subcommand", cmd__parse_subcommand },
{ "partial-clone", cmd__partial_clone },
{ "path-utils", cmd__path_utils },
+ { "path-walk", cmd__path_walk },
{ "pcre2-config", cmd__pcre2_config },
{ "pkt-line", cmd__pkt_line },
{ "proc-receive", cmd__proc_receive },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 21802ac27da..9cfc5da6e57 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -45,6 +45,7 @@ int cmd__parse_pathspec_file(int argc, const char** argv);
int cmd__parse_subcommand(int argc, const char **argv);
int cmd__partial_clone(int argc, const char **argv);
int cmd__path_utils(int argc, const char **argv);
+int cmd__path_walk(int argc, const char **argv);
int cmd__pcre2_config(int argc, const char **argv);
int cmd__pkt_line(int argc, const char **argv);
int cmd__proc_receive(int argc, const char **argv);
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
new file mode 100755
index 00000000000..4e052c09309
--- /dev/null
+++ b/t/t6601-path-walk.sh
@@ -0,0 +1,120 @@
+#!/bin/sh
+
+TEST_PASSES_SANITIZE_LEAK=true
+
+test_description='direct path-walk API tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup test repository' '
+ git checkout -b base &&
+
+ mkdir left &&
+ mkdir right &&
+ echo a >a &&
+ echo b >left/b &&
+ echo c >right/c &&
+ git add . &&
+ git commit -m "first" &&
+
+ echo d >right/d &&
+ git add right &&
+ git commit -m "second" &&
+
+ echo bb >left/b &&
+ git commit -a -m "third" &&
+
+ git checkout -b topic HEAD~1 &&
+ echo cc >right/c &&
+ git commit -a -m "topic"
+'
+
+test_expect_success 'all' '
+ test-tool path-walk -- --all >out &&
+
+ cat >expect <<-EOF &&
+ 0:tree::$(git rev-parse topic^{tree})
+ 0:tree::$(git rev-parse base^{tree})
+ 0:tree::$(git rev-parse base~1^{tree})
+ 0:tree::$(git rev-parse base~2^{tree})
+ 1:tree:right/:$(git rev-parse topic:right)
+ 1:tree:right/:$(git rev-parse base~1:right)
+ 1:tree:right/:$(git rev-parse base~2:right)
+ 2:blob:right/d:$(git rev-parse base~1:right/d)
+ 3:blob:right/c:$(git rev-parse base~2:right/c)
+ 3:blob:right/c:$(git rev-parse topic:right/c)
+ 4:tree:left/:$(git rev-parse base:left)
+ 4:tree:left/:$(git rev-parse base~2:left)
+ 5:blob:left/b:$(git rev-parse base~2:left/b)
+ 5:blob:left/b:$(git rev-parse base:left/b)
+ 6:blob:a:$(git rev-parse base~2:a)
+ blobs:6
+ trees:9
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic only' '
+ test-tool path-walk -- topic >out &&
+
+ cat >expect <<-EOF &&
+ 0:tree::$(git rev-parse topic^{tree})
+ 0:tree::$(git rev-parse base~1^{tree})
+ 0:tree::$(git rev-parse base~2^{tree})
+ 1:tree:right/:$(git rev-parse topic:right)
+ 1:tree:right/:$(git rev-parse base~1:right)
+ 1:tree:right/:$(git rev-parse base~2:right)
+ 2:blob:right/d:$(git rev-parse base~1:right/d)
+ 3:blob:right/c:$(git rev-parse base~2:right/c)
+ 3:blob:right/c:$(git rev-parse topic:right/c)
+ 4:tree:left/:$(git rev-parse base~2:left)
+ 5:blob:left/b:$(git rev-parse base~2:left/b)
+ 6:blob:a:$(git rev-parse base~2:a)
+ blobs:5
+ trees:7
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base' '
+ test-tool path-walk -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:tree::$(git rev-parse topic^{tree})
+ 1:tree:right/:$(git rev-parse topic:right)
+ 2:blob:right/d:$(git rev-parse topic:right/d)
+ 3:blob:right/c:$(git rev-parse topic:right/c)
+ 4:tree:left/:$(git rev-parse topic:left)
+ 5:blob:left/b:$(git rev-parse topic:left/b)
+ 6:blob:a:$(git rev-parse topic:a)
+ blobs:4
+ trees:3
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, boundary' '
+ test-tool path-walk -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:tree::$(git rev-parse topic^{tree})
+ 0:tree::$(git rev-parse base~1^{tree})
+ 1:tree:right/:$(git rev-parse topic:right)
+ 1:tree:right/:$(git rev-parse base~1:right)
+ 2:blob:right/d:$(git rev-parse base~1:right/d)
+ 3:blob:right/c:$(git rev-parse base~1:right/c)
+ 3:blob:right/c:$(git rev-parse topic:right/c)
+ 4:tree:left/:$(git rev-parse base~1:left)
+ 5:blob:left/b:$(git rev-parse base~1:left/b)
+ 6:blob:a:$(git rev-parse base~1:a)
+ blobs:5
+ trees:5
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH v4 4/7] path-walk: allow consumer to specify object types
2024-12-20 16:21 ` [PATCH v4 " Derrick Stolee via GitGitGadget
` (2 preceding siblings ...)
2024-12-20 16:21 ` [PATCH v4 3/7] t6601: add helper for testing path-walk API Derrick Stolee via GitGitGadget
@ 2024-12-20 16:21 ` Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 5/7] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
` (2 subsequent siblings)
6 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-20 16:21 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <derrickstolee@github.com>
We add the ability to filter the object types in the path-walk API so
the callback function is called fewer times.
This adds the ability to ask for the commits in a list, as well. We
re-use the empty string for this set of objects because these are passed
directly to the callback function instead of being part of the
'path_stack'.
Future changes will add the ability to visit annotated tags.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 9 ++
path-walk.c | 33 ++++-
path-walk.h | 14 +-
t/helper/test-path-walk.c | 15 ++-
t/t6601-path-walk.sh | 149 +++++++++++++++-------
5 files changed, 170 insertions(+), 50 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 662162ec70b..dce553b6114 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -39,6 +39,15 @@ It is also important that you do not specify the `--objects` flag for the
the objects will be walked in a separate way based on those starting
commits.
+`commits`, `blobs`, `trees`::
+ By default, these members are enabled and signal that the path-walk
+ API should call the `path_fn` on objects of these types. Specialized
+ applications could disable some options to make it simpler to walk
+ the objects or to have fewer calls to `path_fn`.
++
+While it is possible to walk only commits in this way, consumers would be
+better off using the revision walk API instead.
+
Examples
--------
diff --git a/path-walk.c b/path-walk.c
index 021840ab41b..05ca7b2442a 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -98,6 +98,10 @@ static int add_tree_entries(struct path_walk_context *ctx,
if (S_ISGITLINK(entry.mode))
continue;
+ /* If the caller doesn't want blobs, then don't bother. */
+ if (!ctx->info->blobs && type == OBJ_BLOB)
+ continue;
+
if (type == OBJ_TREE) {
struct tree *child = lookup_tree(ctx->repo, &entry.oid);
o = child ? &child->object : NULL;
@@ -159,9 +163,11 @@ static int walk_path(struct path_walk_context *ctx,
if (!list->oids.nr)
return 0;
- /* Evaluate function pointer on this data. */
- ret = ctx->info->path_fn(path, &list->oids, list->type,
- ctx->info->path_fn_data);
+ /* Evaluate function pointer on this data, if requested. */
+ if ((list->type == OBJ_TREE && ctx->info->trees) ||
+ (list->type == OBJ_BLOB && ctx->info->blobs))
+ ret = ctx->info->path_fn(path, &list->oids, list->type,
+ ctx->info->path_fn_data);
/* Expand data for children. */
if (list->type == OBJ_TREE) {
@@ -203,6 +209,7 @@ int walk_objects_by_path(struct path_walk_info *info)
size_t commits_nr = 0, paths_nr = 0;
struct commit *c;
struct type_and_oid_list *root_tree_list;
+ struct type_and_oid_list *commit_list;
struct path_walk_context ctx = {
.repo = info->revs->repo,
.revs = info->revs,
@@ -214,6 +221,9 @@ int walk_objects_by_path(struct path_walk_info *info)
trace2_region_enter("path-walk", "commit-walk", info->revs->repo);
+ CALLOC_ARRAY(commit_list, 1);
+ commit_list->type = OBJ_COMMIT;
+
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
@@ -224,10 +234,18 @@ int walk_objects_by_path(struct path_walk_info *info)
die(_("failed to setup revision walk"));
while ((c = get_revision(info->revs))) {
- struct object_id *oid = get_commit_tree_oid(c);
+ struct object_id *oid;
struct tree *t;
commits_nr++;
+ if (info->commits)
+ oid_array_append(&commit_list->oids,
+ &c->object.oid);
+
+ /* If we only care about commits, then skip trees. */
+ if (!info->trees && !info->blobs)
+ continue;
+
oid = get_commit_tree_oid(c);
t = lookup_tree(info->revs->repo, oid);
@@ -245,6 +263,13 @@ int walk_objects_by_path(struct path_walk_info *info)
trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
+ /* Track all commits. */
+ if (info->commits && commit_list->oids.nr)
+ ret = info->path_fn("", &commit_list->oids, OBJ_COMMIT,
+ info->path_fn_data);
+ oid_array_clear(&commit_list->oids);
+ free(commit_list);
+
trace2_region_enter("path-walk", "path-walk", info->revs->repo);
while (!ret && ctx.path_stack.nr) {
char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
diff --git a/path-walk.h b/path-walk.h
index 7cb3538cd8f..2cafc71e153 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -31,9 +31,21 @@ struct path_walk_info {
*/
path_fn path_fn;
void *path_fn_data;
+
+ /**
+ * Initialize which object types the path_fn should be called on. This
+ * could also limit the walk to skip blobs if not set.
+ */
+ int commits;
+ int trees;
+ int blobs;
};
-#define PATH_WALK_INFO_INIT { 0 }
+#define PATH_WALK_INFO_INIT { \
+ .blobs = 1, \
+ .trees = 1, \
+ .commits = 1, \
+}
void path_walk_info_init(struct path_walk_info *info);
void path_walk_info_clear(struct path_walk_info *info);
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index def7c81ac4f..a57a05a6391 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -19,6 +19,8 @@ static const char * const path_walk_usage[] = {
struct path_walk_test_data {
uintmax_t batch_nr;
+
+ uintmax_t commit_nr;
uintmax_t tree_nr;
uintmax_t blob_nr;
};
@@ -33,6 +35,8 @@ static int emit_block(const char *path, struct oid_array *oids,
tdata->tree_nr += oids->nr;
else if (type == OBJ_BLOB)
tdata->blob_nr += oids->nr;
+ else if (type == OBJ_COMMIT)
+ tdata->commit_nr += oids->nr;
else
BUG("we do not understand this type");
@@ -54,6 +58,12 @@ int cmd__path_walk(int argc, const char **argv)
struct path_walk_info info = PATH_WALK_INFO_INIT;
struct path_walk_test_data data = { 0 };
struct option options[] = {
+ OPT_BOOL(0, "blobs", &info.blobs,
+ N_("toggle inclusion of blob objects")),
+ OPT_BOOL(0, "commits", &info.commits,
+ N_("toggle inclusion of commit objects")),
+ OPT_BOOL(0, "trees", &info.trees,
+ N_("toggle inclusion of tree objects")),
OPT_END(),
};
@@ -75,9 +85,10 @@ int cmd__path_walk(int argc, const char **argv)
res = walk_objects_by_path(&info);
- printf("trees:%" PRIuMAX "\n"
+ printf("commits:%" PRIuMAX "\n"
+ "trees:%" PRIuMAX "\n"
"blobs:%" PRIuMAX "\n",
- data.tree_nr, data.blob_nr);
+ data.commit_nr, data.tree_nr, data.blob_nr);
release_revisions(&revs);
return res;
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 4e052c09309..4a4939a1b02 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -33,22 +33,27 @@ test_expect_success 'all' '
test-tool path-walk -- --all >out &&
cat >expect <<-EOF &&
- 0:tree::$(git rev-parse topic^{tree})
- 0:tree::$(git rev-parse base^{tree})
- 0:tree::$(git rev-parse base~1^{tree})
- 0:tree::$(git rev-parse base~2^{tree})
- 1:tree:right/:$(git rev-parse topic:right)
- 1:tree:right/:$(git rev-parse base~1:right)
- 1:tree:right/:$(git rev-parse base~2:right)
- 2:blob:right/d:$(git rev-parse base~1:right/d)
- 3:blob:right/c:$(git rev-parse base~2:right/c)
- 3:blob:right/c:$(git rev-parse topic:right/c)
- 4:tree:left/:$(git rev-parse base:left)
- 4:tree:left/:$(git rev-parse base~2:left)
- 5:blob:left/b:$(git rev-parse base~2:left/b)
- 5:blob:left/b:$(git rev-parse base:left/b)
- 6:blob:a:$(git rev-parse base~2:a)
+ 0:commit::$(git rev-parse topic)
+ 0:commit::$(git rev-parse base)
+ 0:commit::$(git rev-parse base~1)
+ 0:commit::$(git rev-parse base~2)
+ 1:tree::$(git rev-parse topic^{tree})
+ 1:tree::$(git rev-parse base^{tree})
+ 1:tree::$(git rev-parse base~1^{tree})
+ 1:tree::$(git rev-parse base~2^{tree})
+ 2:tree:right/:$(git rev-parse topic:right)
+ 2:tree:right/:$(git rev-parse base~1:right)
+ 2:tree:right/:$(git rev-parse base~2:right)
+ 3:blob:right/d:$(git rev-parse base~1:right/d)
+ 4:blob:right/c:$(git rev-parse base~2:right/c)
+ 4:blob:right/c:$(git rev-parse topic:right/c)
+ 5:tree:left/:$(git rev-parse base:left)
+ 5:tree:left/:$(git rev-parse base~2:left)
+ 6:blob:left/b:$(git rev-parse base~2:left/b)
+ 6:blob:left/b:$(git rev-parse base:left/b)
+ 7:blob:a:$(git rev-parse base~2:a)
blobs:6
+ commits:4
trees:9
EOF
@@ -59,19 +64,23 @@ test_expect_success 'topic only' '
test-tool path-walk -- topic >out &&
cat >expect <<-EOF &&
- 0:tree::$(git rev-parse topic^{tree})
- 0:tree::$(git rev-parse base~1^{tree})
- 0:tree::$(git rev-parse base~2^{tree})
- 1:tree:right/:$(git rev-parse topic:right)
- 1:tree:right/:$(git rev-parse base~1:right)
- 1:tree:right/:$(git rev-parse base~2:right)
- 2:blob:right/d:$(git rev-parse base~1:right/d)
- 3:blob:right/c:$(git rev-parse base~2:right/c)
- 3:blob:right/c:$(git rev-parse topic:right/c)
- 4:tree:left/:$(git rev-parse base~2:left)
- 5:blob:left/b:$(git rev-parse base~2:left/b)
- 6:blob:a:$(git rev-parse base~2:a)
+ 0:commit::$(git rev-parse topic)
+ 0:commit::$(git rev-parse base~1)
+ 0:commit::$(git rev-parse base~2)
+ 1:tree::$(git rev-parse topic^{tree})
+ 1:tree::$(git rev-parse base~1^{tree})
+ 1:tree::$(git rev-parse base~2^{tree})
+ 2:tree:right/:$(git rev-parse topic:right)
+ 2:tree:right/:$(git rev-parse base~1:right)
+ 2:tree:right/:$(git rev-parse base~2:right)
+ 3:blob:right/d:$(git rev-parse base~1:right/d)
+ 4:blob:right/c:$(git rev-parse base~2:right/c)
+ 4:blob:right/c:$(git rev-parse topic:right/c)
+ 5:tree:left/:$(git rev-parse base~2:left)
+ 6:blob:left/b:$(git rev-parse base~2:left/b)
+ 7:blob:a:$(git rev-parse base~2:a)
blobs:5
+ commits:3
trees:7
EOF
@@ -82,15 +91,66 @@ test_expect_success 'topic, not base' '
test-tool path-walk -- topic --not base >out &&
cat >expect <<-EOF &&
+ 0:commit::$(git rev-parse topic)
+ 1:tree::$(git rev-parse topic^{tree})
+ 2:tree:right/:$(git rev-parse topic:right)
+ 3:blob:right/d:$(git rev-parse topic:right/d)
+ 4:blob:right/c:$(git rev-parse topic:right/c)
+ 5:tree:left/:$(git rev-parse topic:left)
+ 6:blob:left/b:$(git rev-parse topic:left/b)
+ 7:blob:a:$(git rev-parse topic:a)
+ blobs:4
+ commits:1
+ trees:3
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, only blobs' '
+ test-tool path-walk --no-trees --no-commits \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ commits:0
+ trees:0
+ 0:blob:right/d:$(git rev-parse topic:right/d)
+ 1:blob:right/c:$(git rev-parse topic:right/c)
+ 2:blob:left/b:$(git rev-parse topic:left/b)
+ 3:blob:a:$(git rev-parse topic:a)
+ blobs:4
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+# No, this doesn't make a lot of sense for the path-walk API,
+# but it is possible to do.
+test_expect_success 'topic, not base, only commits' '
+ test-tool path-walk --no-blobs --no-trees \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:commit::$(git rev-parse topic)
+ commits:1
+ trees:0
+ blobs:0
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'topic, not base, only trees' '
+ test-tool path-walk --no-blobs --no-commits \
+ -- topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ commits:0
0:tree::$(git rev-parse topic^{tree})
1:tree:right/:$(git rev-parse topic:right)
- 2:blob:right/d:$(git rev-parse topic:right/d)
- 3:blob:right/c:$(git rev-parse topic:right/c)
- 4:tree:left/:$(git rev-parse topic:left)
- 5:blob:left/b:$(git rev-parse topic:left/b)
- 6:blob:a:$(git rev-parse topic:a)
- blobs:4
+ 2:tree:left/:$(git rev-parse topic:left)
trees:3
+ blobs:0
EOF
test_cmp_sorted expect out
@@ -100,17 +160,20 @@ test_expect_success 'topic, not base, boundary' '
test-tool path-walk -- --boundary topic --not base >out &&
cat >expect <<-EOF &&
- 0:tree::$(git rev-parse topic^{tree})
- 0:tree::$(git rev-parse base~1^{tree})
- 1:tree:right/:$(git rev-parse topic:right)
- 1:tree:right/:$(git rev-parse base~1:right)
- 2:blob:right/d:$(git rev-parse base~1:right/d)
- 3:blob:right/c:$(git rev-parse base~1:right/c)
- 3:blob:right/c:$(git rev-parse topic:right/c)
- 4:tree:left/:$(git rev-parse base~1:left)
- 5:blob:left/b:$(git rev-parse base~1:left/b)
- 6:blob:a:$(git rev-parse base~1:a)
+ 0:commit::$(git rev-parse topic)
+ 0:commit::$(git rev-parse base~1)
+ 1:tree::$(git rev-parse topic^{tree})
+ 1:tree::$(git rev-parse base~1^{tree})
+ 2:tree:right/:$(git rev-parse topic:right)
+ 2:tree:right/:$(git rev-parse base~1:right)
+ 3:blob:right/d:$(git rev-parse base~1:right/d)
+ 4:blob:right/c:$(git rev-parse base~1:right/c)
+ 4:blob:right/c:$(git rev-parse topic:right/c)
+ 5:tree:left/:$(git rev-parse base~1:left)
+ 6:blob:left/b:$(git rev-parse base~1:left/b)
+ 7:blob:a:$(git rev-parse base~1:a)
blobs:5
+ commits:2
trees:5
EOF
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH v4 5/7] path-walk: visit tags and cached objects
2024-12-20 16:21 ` [PATCH v4 " Derrick Stolee via GitGitGadget
` (3 preceding siblings ...)
2024-12-20 16:21 ` [PATCH v4 4/7] path-walk: allow consumer to specify object types Derrick Stolee via GitGitGadget
@ 2024-12-20 16:21 ` Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 6/7] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 7/7] path-walk: reorder object visits Derrick Stolee via GitGitGadget
6 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-20 16:21 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
The rev_info that is specified for a path-walk traversal may specify
visiting tag refs (both lightweight and annotated) and also may specify
indexed objects (blobs and trees). Update the path-walk API to walk
these objects as well.
When walking tags, we need to peel the annotated objects until reaching
a non-tag object. If we reach a commit, then we can add it to the
pending objects to make sure we visit in the commit walk portion. If we
reach a tree, then we will assume that it is a root tree. If we reach a
blob, then we have no good path name and so add it to a new list of
"tagged blobs".
When the rev_info includes the "--indexed-objects" flag, then the
pending set includes blobs and trees found in the cache entries and
cache-tree. The cache entries are usually blobs, though they could be
trees in the case of a sparse index. The cache-tree stores
previously-hashed tree objects but these are cleared out when staging
objects below those paths. We add tests that demonstrate this.
The indexed objects come with a non-NULL 'path' value in the pending
item. This allows us to prepopulate the 'path_to_lists' strmap with
lists for these paths.
The tricky thing about this walk is that we will want to combine the
indexed objects walk with the commit walk, especially in the future case
of walking objects during a command like 'git repack'.
Whenever possible, we want the objects from the index to be grouped with
similar objects in history. We don't want to miss any paths that appear
only in the index and not in the commit history.
Thus, we need to be careful to let the path stack be populated initially
with only the root tree path (and possibly tags and tagged blobs) and go
through the normal depth-first search. Afterwards, if there are other
paths that are remaining in the paths_to_lists strmap, we should then
iterate through the stack and visit those objects recursively.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 2 +-
path-walk.c | 184 ++++++++++++++++++++-
path-walk.h | 2 +
t/helper/test-path-walk.c | 15 +-
t/t6601-path-walk.sh | 186 +++++++++++++++++++---
5 files changed, 362 insertions(+), 27 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index dce553b6114..6022c381b7c 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -39,7 +39,7 @@ It is also important that you do not specify the `--objects` flag for the
the objects will be walked in a separate way based on those starting
commits.
-`commits`, `blobs`, `trees`::
+`commits`, `blobs`, `trees`, `tags`::
By default, these members are enabled and signal that the path-walk
API should call the `path_fn` on objects of these types. Specialized
applications could disable some options to make it simpler to walk
diff --git a/path-walk.c b/path-walk.c
index 05ca7b2442a..f34dbf61de0 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -13,10 +13,13 @@
#include "revision.h"
#include "string-list.h"
#include "strmap.h"
+#include "tag.h"
#include "trace2.h"
#include "tree.h"
#include "tree-walk.h"
+static const char *root_path = "";
+
struct type_and_oid_list {
enum object_type type;
struct oid_array oids;
@@ -160,12 +163,16 @@ static int walk_path(struct path_walk_context *ctx,
list = strmap_get(&ctx->paths_to_lists, path);
+ if (!list)
+ BUG("provided path '%s' that had no associated list", path);
+
if (!list->oids.nr)
return 0;
/* Evaluate function pointer on this data, if requested. */
if ((list->type == OBJ_TREE && ctx->info->trees) ||
- (list->type == OBJ_BLOB && ctx->info->blobs))
+ (list->type == OBJ_BLOB && ctx->info->blobs) ||
+ (list->type == OBJ_TAG && ctx->info->tags))
ret = ctx->info->path_fn(path, &list->oids, list->type,
ctx->info->path_fn_data);
@@ -196,6 +203,139 @@ static void clear_paths_to_lists(struct strmap *map)
strmap_init(map);
}
+static int setup_pending_objects(struct path_walk_info *info,
+ struct path_walk_context *ctx)
+{
+ struct type_and_oid_list *tags = NULL;
+ struct type_and_oid_list *tagged_blobs = NULL;
+ struct type_and_oid_list *root_tree_list = NULL;
+
+ if (info->tags)
+ CALLOC_ARRAY(tags, 1);
+ if (info->blobs)
+ CALLOC_ARRAY(tagged_blobs, 1);
+ if (info->trees)
+ root_tree_list = strmap_get(&ctx->paths_to_lists, root_path);
+
+ /*
+ * Pending objects include:
+ * * Commits at branch tips.
+ * * Annotated tags at tag tips.
+ * * Any kind of object at lightweight tag tips.
+ * * Trees and blobs in the index (with an associated path).
+ */
+ for (size_t i = 0; i < info->revs->pending.nr; i++) {
+ struct object_array_entry *pending = info->revs->pending.objects + i;
+ struct object *obj = pending->item;
+
+ /* Commits will be picked up by revision walk. */
+ if (obj->type == OBJ_COMMIT)
+ continue;
+
+ /* Navigate annotated tag object chains. */
+ while (obj->type == OBJ_TAG) {
+ struct tag *tag = lookup_tag(info->revs->repo, &obj->oid);
+ if (!tag) {
+ error(_("failed to find tag %s"),
+ oid_to_hex(&obj->oid));
+ return -1;
+ }
+ if (tag->object.flags & SEEN)
+ break;
+ tag->object.flags |= SEEN;
+
+ if (tags)
+ oid_array_append(&tags->oids, &obj->oid);
+ obj = tag->tagged;
+ }
+
+ if (obj->type == OBJ_TAG)
+ continue;
+
+ /* We are now at a non-tag object. */
+ if (obj->flags & SEEN)
+ continue;
+ obj->flags |= SEEN;
+
+ switch (obj->type) {
+ case OBJ_TREE:
+ if (!info->trees)
+ continue;
+ if (pending->path) {
+ struct type_and_oid_list *list;
+ char *path = *pending->path ? xstrfmt("%s/", pending->path)
+ : xstrdup("");
+ if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
+ CALLOC_ARRAY(list, 1);
+ list->type = OBJ_TREE;
+ strmap_put(&ctx->paths_to_lists, path, list);
+ }
+ oid_array_append(&list->oids, &obj->oid);
+ free(path);
+ } else {
+ /* assume a root tree, such as a lightweight tag. */
+ oid_array_append(&root_tree_list->oids, &obj->oid);
+ }
+ break;
+
+ case OBJ_BLOB:
+ if (!info->blobs)
+ continue;
+ if (pending->path) {
+ struct type_and_oid_list *list;
+ char *path = pending->path;
+ if (!(list = strmap_get(&ctx->paths_to_lists, path))) {
+ CALLOC_ARRAY(list, 1);
+ list->type = OBJ_BLOB;
+ strmap_put(&ctx->paths_to_lists, path, list);
+ }
+ oid_array_append(&list->oids, &obj->oid);
+ } else {
+ /* assume a root tree, such as a lightweight tag. */
+ oid_array_append(&tagged_blobs->oids, &obj->oid);
+ }
+ break;
+
+ case OBJ_COMMIT:
+ /* Make sure it is in the object walk */
+ if (obj != pending->item)
+ add_pending_object(info->revs, obj, "");
+ break;
+
+ default:
+ BUG("should not see any other type here");
+ }
+ }
+
+ /*
+ * Add tag objects and tagged blobs if they exist.
+ */
+ if (tagged_blobs) {
+ if (tagged_blobs->oids.nr) {
+ const char *tagged_blob_path = "/tagged-blobs";
+ tagged_blobs->type = OBJ_BLOB;
+ push_to_stack(ctx, tagged_blob_path);
+ strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
+ } else {
+ oid_array_clear(&tagged_blobs->oids);
+ free(tagged_blobs);
+ }
+ }
+ if (tags) {
+ if (tags->oids.nr) {
+ const char *tag_path = "/tags";
+ tags->type = OBJ_TAG;
+ push_to_stack(ctx, tag_path);
+ strmap_put(&ctx->paths_to_lists, tag_path, tags);
+ } else {
+ oid_array_clear(&tags->oids);
+ free(tags);
+ }
+ }
+
+ return 0;
+}
+
/**
* Given the configuration of 'info', walk the commits based on 'info->revs' and
* call 'info->path_fn' on each discovered path.
@@ -204,8 +344,7 @@ static void clear_paths_to_lists(struct strmap *map)
*/
int walk_objects_by_path(struct path_walk_info *info)
{
- const char *root_path = "";
- int ret = 0;
+ int ret;
size_t commits_nr = 0, paths_nr = 0;
struct commit *c;
struct type_and_oid_list *root_tree_list;
@@ -224,15 +363,34 @@ int walk_objects_by_path(struct path_walk_info *info)
CALLOC_ARRAY(commit_list, 1);
commit_list->type = OBJ_COMMIT;
+ if (info->tags)
+ info->revs->tag_objects = 1;
+
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
push_to_stack(&ctx, root_path);
+ /*
+ * Set these values before preparing the walk to catch
+ * lightweight tags pointing to non-commits and indexed objects.
+ */
+ info->revs->blob_objects = info->blobs;
+ info->revs->tree_objects = info->trees;
+
if (prepare_revision_walk(info->revs))
die(_("failed to setup revision walk"));
+ info->revs->blob_objects = info->revs->tree_objects = 0;
+
+ trace2_region_enter("path-walk", "pending-walk", info->revs->repo);
+ ret = setup_pending_objects(info, &ctx);
+ trace2_region_leave("path-walk", "pending-walk", info->revs->repo);
+
+ if (ret)
+ return ret;
+
while ((c = get_revision(info->revs))) {
struct object_id *oid;
struct tree *t;
@@ -280,6 +438,26 @@ int walk_objects_by_path(struct path_walk_info *info)
free(path);
}
+
+ /* Are there paths remaining? Likely they are from indexed objects. */
+ if (!strmap_empty(&ctx.paths_to_lists)) {
+ struct hashmap_iter iter;
+ struct strmap_entry *entry;
+
+ strmap_for_each_entry(&ctx.paths_to_lists, &iter, entry)
+ push_to_stack(&ctx, entry->key);
+
+ while (!ret && ctx.path_stack.nr) {
+ char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
+ ctx.path_stack.nr--;
+ paths_nr++;
+
+ ret = walk_path(&ctx, path);
+
+ free(path);
+ }
+ }
+
trace2_data_intmax("path-walk", ctx.repo, "paths", paths_nr);
trace2_region_leave("path-walk", "path-walk", info->revs->repo);
diff --git a/path-walk.h b/path-walk.h
index 2cafc71e153..3679fa7a859 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -39,12 +39,14 @@ struct path_walk_info {
int commits;
int trees;
int blobs;
+ int tags;
};
#define PATH_WALK_INFO_INIT { \
.blobs = 1, \
.trees = 1, \
.commits = 1, \
+ .tags = 1, \
}
void path_walk_info_init(struct path_walk_info *info);
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index a57a05a6391..56289859e69 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -23,6 +23,7 @@ struct path_walk_test_data {
uintmax_t commit_nr;
uintmax_t tree_nr;
uintmax_t blob_nr;
+ uintmax_t tag_nr;
};
static int emit_block(const char *path, struct oid_array *oids,
@@ -37,11 +38,18 @@ static int emit_block(const char *path, struct oid_array *oids,
tdata->blob_nr += oids->nr;
else if (type == OBJ_COMMIT)
tdata->commit_nr += oids->nr;
+ else if (type == OBJ_TAG)
+ tdata->tag_nr += oids->nr;
else
BUG("we do not understand this type");
typestr = type_name(type);
+ /* This should never be output during tests. */
+ if (!oids->nr)
+ printf("%"PRIuMAX":%s:%s:EMPTY\n",
+ tdata->batch_nr, typestr, path);
+
for (size_t i = 0; i < oids->nr; i++)
printf("%"PRIuMAX":%s:%s:%s\n",
tdata->batch_nr, typestr, path,
@@ -62,6 +70,8 @@ int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of blob objects")),
OPT_BOOL(0, "commits", &info.commits,
N_("toggle inclusion of commit objects")),
+ OPT_BOOL(0, "tags", &info.tags,
+ N_("toggle inclusion of tag objects")),
OPT_BOOL(0, "trees", &info.trees,
N_("toggle inclusion of tree objects")),
OPT_END(),
@@ -87,8 +97,9 @@ int cmd__path_walk(int argc, const char **argv)
printf("commits:%" PRIuMAX "\n"
"trees:%" PRIuMAX "\n"
- "blobs:%" PRIuMAX "\n",
- data.commit_nr, data.tree_nr, data.blob_nr);
+ "blobs:%" PRIuMAX "\n"
+ "tags:%" PRIuMAX "\n",
+ data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
release_revisions(&revs);
return res;
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 4a4939a1b02..1f3d2e0cb76 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -9,29 +9,142 @@ test_description='direct path-walk API tests'
test_expect_success 'setup test repository' '
git checkout -b base &&
+ # Make some objects that will only be reachable
+ # via non-commit tags.
+ mkdir child &&
+ echo file >child/file &&
+ git add child &&
+ git commit -m "will abandon" &&
+ git tag -a -m "tree" tree-tag HEAD^{tree} &&
+ echo file2 >file2 &&
+ git add file2 &&
+ git commit --amend -m "will abandon" &&
+ git tag tree-tag2 HEAD^{tree} &&
+
+ echo blob >file &&
+ blob_oid=$(git hash-object -t blob -w --stdin <file) &&
+ git tag -a -m "blob" blob-tag "$blob_oid" &&
+ echo blob2 >file2 &&
+ blob2_oid=$(git hash-object -t blob -w --stdin <file2) &&
+ git tag blob-tag2 "$blob2_oid" &&
+
+ rm -fr child file file2 &&
+
mkdir left &&
mkdir right &&
echo a >a &&
echo b >left/b &&
echo c >right/c &&
git add . &&
- git commit -m "first" &&
+ git commit --amend -m "first" &&
+ git tag -m "first" first HEAD &&
echo d >right/d &&
git add right &&
git commit -m "second" &&
+ git tag -a -m "second (under)" second.1 HEAD &&
+ git tag -a -m "second (top)" second.2 second.1 &&
+ # Set up file/dir collision in history.
+ rm a &&
+ mkdir a &&
+ echo a >a/a &&
echo bb >left/b &&
- git commit -a -m "third" &&
+ git add a left &&
+ git commit -m "third" &&
+ git tag -a -m "third" third &&
git checkout -b topic HEAD~1 &&
echo cc >right/c &&
- git commit -a -m "topic"
+ git commit -a -m "topic" &&
+ git tag -a -m "fourth" fourth
'
test_expect_success 'all' '
test-tool path-walk -- --all >out &&
+ cat >expect <<-EOF &&
+ 0:commit::$(git rev-parse topic)
+ 0:commit::$(git rev-parse base)
+ 0:commit::$(git rev-parse base~1)
+ 0:commit::$(git rev-parse base~2)
+ 1:tag:/tags:$(git rev-parse refs/tags/first)
+ 1:tag:/tags:$(git rev-parse refs/tags/second.1)
+ 1:tag:/tags:$(git rev-parse refs/tags/second.2)
+ 1:tag:/tags:$(git rev-parse refs/tags/third)
+ 1:tag:/tags:$(git rev-parse refs/tags/fourth)
+ 1:tag:/tags:$(git rev-parse refs/tags/tree-tag)
+ 1:tag:/tags:$(git rev-parse refs/tags/blob-tag)
+ 2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
+ 2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
+ 3:tree::$(git rev-parse topic^{tree})
+ 3:tree::$(git rev-parse base^{tree})
+ 3:tree::$(git rev-parse base~1^{tree})
+ 3:tree::$(git rev-parse base~2^{tree})
+ 3:tree::$(git rev-parse refs/tags/tree-tag^{})
+ 3:tree::$(git rev-parse refs/tags/tree-tag2^{})
+ 4:blob:a:$(git rev-parse base~2:a)
+ 5:tree:right/:$(git rev-parse topic:right)
+ 5:tree:right/:$(git rev-parse base~1:right)
+ 5:tree:right/:$(git rev-parse base~2:right)
+ 6:blob:right/d:$(git rev-parse base~1:right/d)
+ 7:blob:right/c:$(git rev-parse base~2:right/c)
+ 7:blob:right/c:$(git rev-parse topic:right/c)
+ 8:tree:left/:$(git rev-parse base:left)
+ 8:tree:left/:$(git rev-parse base~2:left)
+ 9:blob:left/b:$(git rev-parse base~2:left/b)
+ 9:blob:left/b:$(git rev-parse base:left/b)
+ 10:tree:a/:$(git rev-parse base:a)
+ 11:blob:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
+ 12:tree:child/:$(git rev-parse refs/tags/tree-tag:child)
+ 13:blob:child/file:$(git rev-parse refs/tags/tree-tag:child/file)
+ blobs:10
+ commits:4
+ tags:7
+ trees:13
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'indexed objects' '
+ test_when_finished git reset --hard &&
+
+ # stage change into index, adding a blob but
+ # also invalidating the cache-tree for the root
+ # and the "left" directory.
+ echo bogus >left/c &&
+ git add left &&
+
+ test-tool path-walk -- --indexed-objects >out &&
+
+ cat >expect <<-EOF &&
+ 0:blob:a:$(git rev-parse HEAD:a)
+ 1:blob:left/b:$(git rev-parse HEAD:left/b)
+ 2:blob:left/c:$(git rev-parse :left/c)
+ 3:blob:right/c:$(git rev-parse HEAD:right/c)
+ 4:blob:right/d:$(git rev-parse HEAD:right/d)
+ 5:tree:right/:$(git rev-parse topic:right)
+ blobs:5
+ commits:0
+ tags:0
+ trees:1
+ EOF
+
+ test_cmp_sorted expect out
+'
+
+test_expect_success 'branches and indexed objects mix well' '
+ test_when_finished git reset --hard &&
+
+ # stage change into index, adding a blob but
+ # also invalidating the cache-tree for the root
+ # and the "right" directory.
+ echo fake >right/d &&
+ git add right &&
+
+ test-tool path-walk -- --indexed-objects --branches >out &&
+
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
0:commit::$(git rev-parse base)
@@ -41,20 +154,23 @@ test_expect_success 'all' '
1:tree::$(git rev-parse base^{tree})
1:tree::$(git rev-parse base~1^{tree})
1:tree::$(git rev-parse base~2^{tree})
- 2:tree:right/:$(git rev-parse topic:right)
- 2:tree:right/:$(git rev-parse base~1:right)
- 2:tree:right/:$(git rev-parse base~2:right)
- 3:blob:right/d:$(git rev-parse base~1:right/d)
- 4:blob:right/c:$(git rev-parse base~2:right/c)
- 4:blob:right/c:$(git rev-parse topic:right/c)
- 5:tree:left/:$(git rev-parse base:left)
- 5:tree:left/:$(git rev-parse base~2:left)
- 6:blob:left/b:$(git rev-parse base~2:left/b)
- 6:blob:left/b:$(git rev-parse base:left/b)
- 7:blob:a:$(git rev-parse base~2:a)
- blobs:6
+ 2:blob:a:$(git rev-parse base~2:a)
+ 3:tree:right/:$(git rev-parse topic:right)
+ 3:tree:right/:$(git rev-parse base~1:right)
+ 3:tree:right/:$(git rev-parse base~2:right)
+ 4:blob:right/d:$(git rev-parse base~1:right/d)
+ 4:blob:right/d:$(git rev-parse :right/d)
+ 5:blob:right/c:$(git rev-parse base~2:right/c)
+ 5:blob:right/c:$(git rev-parse topic:right/c)
+ 6:tree:left/:$(git rev-parse base:left)
+ 6:tree:left/:$(git rev-parse base~2:left)
+ 7:blob:left/b:$(git rev-parse base:left/b)
+ 7:blob:left/b:$(git rev-parse base~2:left/b)
+ 8:tree:a/:$(git rev-parse refs/tags/third:a)
+ blobs:7
commits:4
- trees:9
+ tags:0
+ trees:10
EOF
test_cmp_sorted expect out
@@ -81,6 +197,7 @@ test_expect_success 'topic only' '
7:blob:a:$(git rev-parse base~2:a)
blobs:5
commits:3
+ tags:0
trees:7
EOF
@@ -101,6 +218,7 @@ test_expect_success 'topic, not base' '
7:blob:a:$(git rev-parse topic:a)
blobs:4
commits:1
+ tags:0
trees:3
EOF
@@ -112,13 +230,14 @@ test_expect_success 'topic, not base, only blobs' '
-- topic --not base >out &&
cat >expect <<-EOF &&
- commits:0
- trees:0
0:blob:right/d:$(git rev-parse topic:right/d)
1:blob:right/c:$(git rev-parse topic:right/c)
2:blob:left/b:$(git rev-parse topic:left/b)
3:blob:a:$(git rev-parse topic:a)
blobs:4
+ commits:0
+ tags:0
+ trees:0
EOF
test_cmp_sorted expect out
@@ -133,8 +252,9 @@ test_expect_success 'topic, not base, only commits' '
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
commits:1
- trees:0
blobs:0
+ tags:0
+ trees:0
EOF
test_cmp_sorted expect out
@@ -145,12 +265,13 @@ test_expect_success 'topic, not base, only trees' '
-- topic --not base >out &&
cat >expect <<-EOF &&
- commits:0
0:tree::$(git rev-parse topic^{tree})
1:tree:right/:$(git rev-parse topic:right)
2:tree:left/:$(git rev-parse topic:left)
- trees:3
+ commits:0
blobs:0
+ tags:0
+ trees:3
EOF
test_cmp_sorted expect out
@@ -174,10 +295,33 @@ test_expect_success 'topic, not base, boundary' '
7:blob:a:$(git rev-parse base~1:a)
blobs:5
commits:2
+ tags:0
trees:5
EOF
test_cmp_sorted expect out
'
+test_expect_success 'trees are reported exactly once' '
+ test_when_finished "rm -rf unique-trees" &&
+ test_create_repo unique-trees &&
+ (
+ cd unique-trees &&
+ mkdir initial &&
+ test_commit initial/file &&
+
+ git switch -c move-to-top &&
+ git mv initial/file.t ./ &&
+ test_tick &&
+ git commit -m moved &&
+
+ git update-ref refs/heads/other HEAD
+ ) &&
+
+ test-tool -C unique-trees path-walk -- --all >out &&
+ tree=$(git -C unique-trees rev-parse HEAD:) &&
+ grep "$tree" out >out-filtered &&
+ test_line_count = 1 out-filtered
+'
+
test_done
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH v4 6/7] path-walk: mark trees and blobs as UNINTERESTING
2024-12-20 16:21 ` [PATCH v4 " Derrick Stolee via GitGitGadget
` (4 preceding siblings ...)
2024-12-20 16:21 ` [PATCH v4 5/7] path-walk: visit tags and cached objects Derrick Stolee via GitGitGadget
@ 2024-12-20 16:21 ` Derrick Stolee via GitGitGadget
2024-12-20 16:21 ` [PATCH v4 7/7] path-walk: reorder object visits Derrick Stolee via GitGitGadget
6 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-20 16:21 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
When the input rev_info has UNINTERESTING starting points, we want to be
sure that the UNINTERESTING flag is passed appropriately through the
objects. To match how this is done in places such as 'git pack-objects', we
use the mark_edges_uninteresting() method.
This method has an option for using the "sparse" walk, which is similar in
spirit to the path-walk API's walk. To be sure to keep it independent, add a
new 'prune_all_uninteresting' option to the path_walk_info struct.
To check how the UNINTERSTING flag is spread through our objects, extend the
'test-tool path-walk' command to output whether or not an object has that
flag. This changes our tests significantly, including the removal of some
objects that were previously visited due to the incomplete implementation.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
Documentation/technical/api-path-walk.txt | 8 +++
path-walk.c | 74 +++++++++++++++++++++
path-walk.h | 8 +++
t/helper/test-path-walk.c | 12 +++-
t/t6601-path-walk.sh | 79 +++++++++++++++++------
5 files changed, 159 insertions(+), 22 deletions(-)
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 6022c381b7c..7075d0d5ab5 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -48,6 +48,14 @@ commits.
While it is possible to walk only commits in this way, consumers would be
better off using the revision walk API instead.
+`prune_all_uninteresting`::
+ By default, all reachable paths are emitted by the path-walk API.
+ This option allows consumers to declare that they are not
+ interested in paths where all included objects are marked with the
+ `UNINTERESTING` flag. This requires using the `boundary` option in
+ the revision walk so that the walk emits commits marked with the
+ `UNINTERESTING` flag.
+
Examples
--------
diff --git a/path-walk.c b/path-walk.c
index f34dbf61de0..4013569e9e4 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -8,6 +8,7 @@
#include "dir.h"
#include "hashmap.h"
#include "hex.h"
+#include "list-objects.h"
#include "object.h"
#include "oid-array.h"
#include "revision.h"
@@ -23,6 +24,7 @@ static const char *root_path = "";
struct type_and_oid_list {
enum object_type type;
struct oid_array oids;
+ int maybe_interesting;
};
#define TYPE_AND_OID_LIST_INIT { \
@@ -142,6 +144,10 @@ static int add_tree_entries(struct path_walk_context *ctx,
strmap_put(&ctx->paths_to_lists, path.buf, list);
}
push_to_stack(ctx, path.buf);
+
+ if (!(o->flags & UNINTERESTING))
+ list->maybe_interesting = 1;
+
oid_array_append(&list->oids, &entry.oid);
}
@@ -169,6 +175,43 @@ static int walk_path(struct path_walk_context *ctx,
if (!list->oids.nr)
return 0;
+ if (ctx->info->prune_all_uninteresting) {
+ /*
+ * This is true if all objects were UNINTERESTING
+ * when added to the list.
+ */
+ if (!list->maybe_interesting)
+ return 0;
+
+ /*
+ * But it's still possible that the objects were set
+ * as UNINTERESTING after being added. Do a quick check.
+ */
+ list->maybe_interesting = 0;
+ for (size_t i = 0;
+ !list->maybe_interesting && i < list->oids.nr;
+ i++) {
+ if (list->type == OBJ_TREE) {
+ struct tree *t = lookup_tree(ctx->repo,
+ &list->oids.oid[i]);
+ if (t && !(t->object.flags & UNINTERESTING))
+ list->maybe_interesting = 1;
+ } else if (list->type == OBJ_BLOB) {
+ struct blob *b = lookup_blob(ctx->repo,
+ &list->oids.oid[i]);
+ if (b && !(b->object.flags & UNINTERESTING))
+ list->maybe_interesting = 1;
+ } else {
+ /* Tags are always interesting if visited. */
+ list->maybe_interesting = 1;
+ }
+ }
+
+ /* We have confirmed that all objects are UNINTERESTING. */
+ if (!list->maybe_interesting)
+ return 0;
+ }
+
/* Evaluate function pointer on this data, if requested. */
if ((list->type == OBJ_TREE && ctx->info->trees) ||
(list->type == OBJ_BLOB && ctx->info->blobs) ||
@@ -203,6 +246,26 @@ static void clear_paths_to_lists(struct strmap *map)
strmap_init(map);
}
+static struct repository *edge_repo;
+static struct type_and_oid_list *edge_tree_list;
+
+static void show_edge(struct commit *commit)
+{
+ struct tree *t = repo_get_commit_tree(edge_repo, commit);
+
+ if (!t)
+ return;
+
+ if (commit->object.flags & UNINTERESTING)
+ t->object.flags |= UNINTERESTING;
+
+ if (t->object.flags & SEEN)
+ return;
+ t->object.flags |= SEEN;
+
+ oid_array_append(&edge_tree_list->oids, &t->object.oid);
+}
+
static int setup_pending_objects(struct path_walk_info *info,
struct path_walk_context *ctx)
{
@@ -314,6 +377,7 @@ static int setup_pending_objects(struct path_walk_info *info,
if (tagged_blobs->oids.nr) {
const char *tagged_blob_path = "/tagged-blobs";
tagged_blobs->type = OBJ_BLOB;
+ tagged_blobs->maybe_interesting = 1;
push_to_stack(ctx, tagged_blob_path);
strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
} else {
@@ -325,6 +389,7 @@ static int setup_pending_objects(struct path_walk_info *info,
if (tags->oids.nr) {
const char *tag_path = "/tags";
tags->type = OBJ_TAG;
+ tags->maybe_interesting = 1;
push_to_stack(ctx, tag_path);
strmap_put(&ctx->paths_to_lists, tag_path, tags);
} else {
@@ -369,6 +434,7 @@ int walk_objects_by_path(struct path_walk_info *info)
/* Insert a single list for the root tree into the paths. */
CALLOC_ARRAY(root_tree_list, 1);
root_tree_list->type = OBJ_TREE;
+ root_tree_list->maybe_interesting = 1;
strmap_put(&ctx.paths_to_lists, root_path, root_tree_list);
push_to_stack(&ctx, root_path);
@@ -382,6 +448,14 @@ int walk_objects_by_path(struct path_walk_info *info)
if (prepare_revision_walk(info->revs))
die(_("failed to setup revision walk"));
+ /* Walk trees to mark them as UNINTERESTING. */
+ edge_repo = info->revs->repo;
+ edge_tree_list = root_tree_list;
+ mark_edges_uninteresting(info->revs, show_edge,
+ info->prune_all_uninteresting);
+ edge_repo = NULL;
+ edge_tree_list = NULL;
+
info->revs->blob_objects = info->revs->tree_objects = 0;
trace2_region_enter("path-walk", "pending-walk", info->revs->repo);
diff --git a/path-walk.h b/path-walk.h
index 3679fa7a859..414d6db23c2 100644
--- a/path-walk.h
+++ b/path-walk.h
@@ -40,6 +40,14 @@ struct path_walk_info {
int trees;
int blobs;
int tags;
+
+ /**
+ * When 'prune_all_uninteresting' is set and a path has all objects
+ * marked as UNINTERESTING, then the path-walk will not visit those
+ * objects. It will not call path_fn on those objects and will not
+ * walk the children of such trees.
+ */
+ int prune_all_uninteresting;
};
#define PATH_WALK_INFO_INIT { \
diff --git a/t/helper/test-path-walk.c b/t/helper/test-path-walk.c
index 56289859e69..7f2d409c5bc 100644
--- a/t/helper/test-path-walk.c
+++ b/t/helper/test-path-walk.c
@@ -50,10 +50,14 @@ static int emit_block(const char *path, struct oid_array *oids,
printf("%"PRIuMAX":%s:%s:EMPTY\n",
tdata->batch_nr, typestr, path);
- for (size_t i = 0; i < oids->nr; i++)
- printf("%"PRIuMAX":%s:%s:%s\n",
+ for (size_t i = 0; i < oids->nr; i++) {
+ struct object *o = lookup_unknown_object(the_repository,
+ &oids->oid[i]);
+ printf("%"PRIuMAX":%s:%s:%s%s\n",
tdata->batch_nr, typestr, path,
- oid_to_hex(&oids->oid[i]));
+ oid_to_hex(&oids->oid[i]),
+ o->flags & UNINTERESTING ? ":UNINTERESTING" : "");
+ }
tdata->batch_nr++;
return 0;
@@ -74,6 +78,8 @@ int cmd__path_walk(int argc, const char **argv)
N_("toggle inclusion of tag objects")),
OPT_BOOL(0, "trees", &info.trees,
N_("toggle inclusion of tree objects")),
+ OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
+ N_("toggle pruning of uninteresting paths")),
OPT_END(),
};
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index 1f3d2e0cb76..a317cdf289e 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -211,11 +211,11 @@ test_expect_success 'topic, not base' '
0:commit::$(git rev-parse topic)
1:tree::$(git rev-parse topic^{tree})
2:tree:right/:$(git rev-parse topic:right)
- 3:blob:right/d:$(git rev-parse topic:right/d)
+ 3:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
4:blob:right/c:$(git rev-parse topic:right/c)
- 5:tree:left/:$(git rev-parse topic:left)
- 6:blob:left/b:$(git rev-parse topic:left/b)
- 7:blob:a:$(git rev-parse topic:a)
+ 5:tree:left/:$(git rev-parse topic:left):UNINTERESTING
+ 6:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
+ 7:blob:a:$(git rev-parse topic:a):UNINTERESTING
blobs:4
commits:1
tags:0
@@ -225,15 +225,38 @@ test_expect_success 'topic, not base' '
test_cmp_sorted expect out
'
+test_expect_success 'fourth, blob-tag2, not base' '
+ test-tool path-walk -- fourth blob-tag2 --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:commit::$(git rev-parse topic)
+ 1:tag:/tags:$(git rev-parse fourth)
+ 2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
+ 3:tree::$(git rev-parse topic^{tree})
+ 4:tree:right/:$(git rev-parse topic:right)
+ 5:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
+ 6:blob:right/c:$(git rev-parse topic:right/c)
+ 7:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
+ 8:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ 9:blob:a:$(git rev-parse base~1:a):UNINTERESTING
+ blobs:5
+ commits:1
+ tags:1
+ trees:3
+ EOF
+
+ test_cmp_sorted expect out
+'
+
test_expect_success 'topic, not base, only blobs' '
test-tool path-walk --no-trees --no-commits \
-- topic --not base >out &&
cat >expect <<-EOF &&
- 0:blob:right/d:$(git rev-parse topic:right/d)
+ 0:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
1:blob:right/c:$(git rev-parse topic:right/c)
- 2:blob:left/b:$(git rev-parse topic:left/b)
- 3:blob:a:$(git rev-parse topic:a)
+ 2:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
+ 3:blob:a:$(git rev-parse topic:a):UNINTERESTING
blobs:4
commits:0
tags:0
@@ -267,7 +290,7 @@ test_expect_success 'topic, not base, only trees' '
cat >expect <<-EOF &&
0:tree::$(git rev-parse topic^{tree})
1:tree:right/:$(git rev-parse topic:right)
- 2:tree:left/:$(git rev-parse topic:left)
+ 2:tree:left/:$(git rev-parse topic:left):UNINTERESTING
commits:0
blobs:0
tags:0
@@ -282,17 +305,17 @@ test_expect_success 'topic, not base, boundary' '
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
- 0:commit::$(git rev-parse base~1)
+ 0:commit::$(git rev-parse base~1):UNINTERESTING
1:tree::$(git rev-parse topic^{tree})
- 1:tree::$(git rev-parse base~1^{tree})
+ 1:tree::$(git rev-parse base~1^{tree}):UNINTERESTING
2:tree:right/:$(git rev-parse topic:right)
- 2:tree:right/:$(git rev-parse base~1:right)
- 3:blob:right/d:$(git rev-parse base~1:right/d)
- 4:blob:right/c:$(git rev-parse base~1:right/c)
+ 2:tree:right/:$(git rev-parse base~1:right):UNINTERESTING
+ 3:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
+ 4:blob:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
4:blob:right/c:$(git rev-parse topic:right/c)
- 5:tree:left/:$(git rev-parse base~1:left)
- 6:blob:left/b:$(git rev-parse base~1:left/b)
- 7:blob:a:$(git rev-parse base~1:a)
+ 5:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
+ 6:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ 7:blob:a:$(git rev-parse base~1:a):UNINTERESTING
blobs:5
commits:2
tags:0
@@ -302,6 +325,27 @@ test_expect_success 'topic, not base, boundary' '
test_cmp_sorted expect out
'
+test_expect_success 'topic, not base, boundary with pruning' '
+ test-tool path-walk --prune -- --boundary topic --not base >out &&
+
+ cat >expect <<-EOF &&
+ 0:commit::$(git rev-parse topic)
+ 0:commit::$(git rev-parse base~1):UNINTERESTING
+ 1:tree::$(git rev-parse topic^{tree})
+ 1:tree::$(git rev-parse base~1^{tree}):UNINTERESTING
+ 2:tree:right/:$(git rev-parse topic:right)
+ 2:tree:right/:$(git rev-parse base~1:right):UNINTERESTING
+ 3:blob:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
+ 3:blob:right/c:$(git rev-parse topic:right/c)
+ blobs:2
+ commits:2
+ tags:0
+ trees:4
+ EOF
+
+ test_cmp_sorted expect out
+'
+
test_expect_success 'trees are reported exactly once' '
test_when_finished "rm -rf unique-trees" &&
test_create_repo unique-trees &&
@@ -309,15 +353,12 @@ test_expect_success 'trees are reported exactly once' '
cd unique-trees &&
mkdir initial &&
test_commit initial/file &&
-
git switch -c move-to-top &&
git mv initial/file.t ./ &&
test_tick &&
git commit -m moved &&
-
git update-ref refs/heads/other HEAD
) &&
-
test-tool -C unique-trees path-walk -- --all >out &&
tree=$(git -C unique-trees rev-parse HEAD:) &&
grep "$tree" out >out-filtered &&
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread
* [PATCH v4 7/7] path-walk: reorder object visits
2024-12-20 16:21 ` [PATCH v4 " Derrick Stolee via GitGitGadget
` (5 preceding siblings ...)
2024-12-20 16:21 ` [PATCH v4 6/7] path-walk: mark trees and blobs as UNINTERESTING Derrick Stolee via GitGitGadget
@ 2024-12-20 16:21 ` Derrick Stolee via GitGitGadget
6 siblings, 0 replies; 67+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-12-20 16:21 UTC (permalink / raw)
To: git
Cc: gitster, johannes.schindelin, peff, ps, me, johncai86, newren,
christian.couder, kristofferhaugsbakk, jonathantanmy,
karthik nayak, Derrick Stolee, Derrick Stolee
From: Derrick Stolee <stolee@gmail.com>
The path-walk API currently uses a stack-based approach to recursing
through the list of paths within the repository. This guarantees that
after a tree path is explored, all paths contained within that tree path
will be explored before continuing to explore siblings of that tree
path.
The initial motivation of this depth-first approach was to minimize
memory pressure while exploring the repository. A breadth-first approach
would have too many "active" paths being stored in the paths_to_lists
map.
We can take this approach one step further by making sure that blob
paths are visited before tree paths. This allows the API to free the
memory for these blob objects before continuing to perform the
depth-first search. This modifies the order in which we visit siblings,
but does not change the fact that we are performing depth-first search.
To achieve this goal, use a priority queue with a custom sorting method.
The sort needs to handle tags, blobs, and trees (commits are handled
slightly differently). When objects share a type, we can sort by path
name. This will keep children of the latest path to leave the stack be
preferred over the rest of the paths in the stack, since they agree in
prefix up to and including a directory separator. When the types are
different, we can prefer tags over other types and blobs over trees.
This causes significant adjustments to t6601-path-walk.sh to rearrange
the order of the visited paths.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
path-walk.c | 60 ++++++++++++++++-----
t/t6601-path-walk.sh | 124 +++++++++++++++++++++----------------------
2 files changed, 110 insertions(+), 74 deletions(-)
diff --git a/path-walk.c b/path-walk.c
index 4013569e9e4..136ec08fb0e 100644
--- a/path-walk.c
+++ b/path-walk.c
@@ -11,6 +11,7 @@
#include "list-objects.h"
#include "object.h"
#include "oid-array.h"
+#include "prio-queue.h"
#include "revision.h"
#include "string-list.h"
#include "strmap.h"
@@ -49,16 +50,50 @@ struct path_walk_context {
struct strmap paths_to_lists;
/**
- * Store the current list of paths in a stack, to
- * facilitate depth-first-search without recursion.
+ * Store the current list of paths in a priority queue,
+ * using object type as a sorting mechanism, mostly to
+ * make sure blobs are popped off the stack first. No
+ * other sort is made, so within each object type it acts
+ * like a stack and performs a DFS within the trees.
*
* Use path_stack_pushed to indicate whether a path
* was previously added to path_stack.
*/
- struct string_list path_stack;
+ struct prio_queue path_stack;
struct strset path_stack_pushed;
};
+static int compare_by_type(const void *one, const void *two, void *cb_data)
+{
+ struct type_and_oid_list *list1, *list2;
+ const char *str1 = one;
+ const char *str2 = two;
+ struct path_walk_context *ctx = cb_data;
+
+ list1 = strmap_get(&ctx->paths_to_lists, str1);
+ list2 = strmap_get(&ctx->paths_to_lists, str2);
+
+ /*
+ * If object types are equal, then use path comparison.
+ */
+ if (!list1 || !list2 || list1->type == list2->type)
+ return strcmp(str1, str2);
+
+ /* Prefer tags to be popped off first. */
+ if (list1->type == OBJ_TAG)
+ return -1;
+ if (list2->type == OBJ_TAG)
+ return 1;
+
+ /* Prefer blobs to be popped off second. */
+ if (list1->type == OBJ_BLOB)
+ return -1;
+ if (list2->type == OBJ_BLOB)
+ return 1;
+
+ return 0;
+}
+
static void push_to_stack(struct path_walk_context *ctx,
const char *path)
{
@@ -66,7 +101,7 @@ static void push_to_stack(struct path_walk_context *ctx,
return;
strset_add(&ctx->path_stack_pushed, path);
- string_list_append(&ctx->path_stack, path);
+ prio_queue_put(&ctx->path_stack, xstrdup(path));
}
static int add_tree_entries(struct path_walk_context *ctx,
@@ -378,8 +413,8 @@ static int setup_pending_objects(struct path_walk_info *info,
const char *tagged_blob_path = "/tagged-blobs";
tagged_blobs->type = OBJ_BLOB;
tagged_blobs->maybe_interesting = 1;
- push_to_stack(ctx, tagged_blob_path);
strmap_put(&ctx->paths_to_lists, tagged_blob_path, tagged_blobs);
+ push_to_stack(ctx, tagged_blob_path);
} else {
oid_array_clear(&tagged_blobs->oids);
free(tagged_blobs);
@@ -390,8 +425,8 @@ static int setup_pending_objects(struct path_walk_info *info,
const char *tag_path = "/tags";
tags->type = OBJ_TAG;
tags->maybe_interesting = 1;
- push_to_stack(ctx, tag_path);
strmap_put(&ctx->paths_to_lists, tag_path, tags);
+ push_to_stack(ctx, tag_path);
} else {
oid_array_clear(&tags->oids);
free(tags);
@@ -418,7 +453,10 @@ int walk_objects_by_path(struct path_walk_info *info)
.repo = info->revs->repo,
.revs = info->revs,
.info = info,
- .path_stack = STRING_LIST_INIT_DUP,
+ .path_stack = {
+ .compare = compare_by_type,
+ .cb_data = &ctx
+ },
.path_stack_pushed = STRSET_INIT,
.paths_to_lists = STRMAP_INIT
};
@@ -504,8 +542,7 @@ int walk_objects_by_path(struct path_walk_info *info)
trace2_region_enter("path-walk", "path-walk", info->revs->repo);
while (!ret && ctx.path_stack.nr) {
- char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
- ctx.path_stack.nr--;
+ char *path = prio_queue_get(&ctx.path_stack);
paths_nr++;
ret = walk_path(&ctx, path);
@@ -522,8 +559,7 @@ int walk_objects_by_path(struct path_walk_info *info)
push_to_stack(&ctx, entry->key);
while (!ret && ctx.path_stack.nr) {
- char *path = ctx.path_stack.items[ctx.path_stack.nr - 1].string;
- ctx.path_stack.nr--;
+ char *path = prio_queue_get(&ctx.path_stack);
paths_nr++;
ret = walk_path(&ctx, path);
@@ -537,7 +573,7 @@ int walk_objects_by_path(struct path_walk_info *info)
clear_paths_to_lists(&ctx.paths_to_lists);
strset_clear(&ctx.path_stack_pushed);
- string_list_clear(&ctx.path_stack, 0);
+ clear_prio_queue(&ctx.path_stack);
return ret;
}
diff --git a/t/t6601-path-walk.sh b/t/t6601-path-walk.sh
index a317cdf289e..5f04acb8a2f 100755
--- a/t/t6601-path-walk.sh
+++ b/t/t6601-path-walk.sh
@@ -84,20 +84,20 @@ test_expect_success 'all' '
3:tree::$(git rev-parse refs/tags/tree-tag^{})
3:tree::$(git rev-parse refs/tags/tree-tag2^{})
4:blob:a:$(git rev-parse base~2:a)
- 5:tree:right/:$(git rev-parse topic:right)
- 5:tree:right/:$(git rev-parse base~1:right)
- 5:tree:right/:$(git rev-parse base~2:right)
- 6:blob:right/d:$(git rev-parse base~1:right/d)
- 7:blob:right/c:$(git rev-parse base~2:right/c)
- 7:blob:right/c:$(git rev-parse topic:right/c)
- 8:tree:left/:$(git rev-parse base:left)
- 8:tree:left/:$(git rev-parse base~2:left)
- 9:blob:left/b:$(git rev-parse base~2:left/b)
- 9:blob:left/b:$(git rev-parse base:left/b)
- 10:tree:a/:$(git rev-parse base:a)
- 11:blob:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
- 12:tree:child/:$(git rev-parse refs/tags/tree-tag:child)
- 13:blob:child/file:$(git rev-parse refs/tags/tree-tag:child/file)
+ 5:blob:file2:$(git rev-parse refs/tags/tree-tag2^{}:file2)
+ 6:tree:a/:$(git rev-parse base:a)
+ 7:tree:child/:$(git rev-parse refs/tags/tree-tag:child)
+ 8:blob:child/file:$(git rev-parse refs/tags/tree-tag:child/file)
+ 9:tree:left/:$(git rev-parse base:left)
+ 9:tree:left/:$(git rev-parse base~2:left)
+ 10:blob:left/b:$(git rev-parse base~2:left/b)
+ 10:blob:left/b:$(git rev-parse base:left/b)
+ 11:tree:right/:$(git rev-parse topic:right)
+ 11:tree:right/:$(git rev-parse base~1:right)
+ 11:tree:right/:$(git rev-parse base~2:right)
+ 12:blob:right/c:$(git rev-parse base~2:right/c)
+ 12:blob:right/c:$(git rev-parse topic:right/c)
+ 13:blob:right/d:$(git rev-parse base~1:right/d)
blobs:10
commits:4
tags:7
@@ -154,19 +154,19 @@ test_expect_success 'branches and indexed objects mix well' '
1:tree::$(git rev-parse base^{tree})
1:tree::$(git rev-parse base~1^{tree})
1:tree::$(git rev-parse base~2^{tree})
- 2:blob:a:$(git rev-parse base~2:a)
- 3:tree:right/:$(git rev-parse topic:right)
- 3:tree:right/:$(git rev-parse base~1:right)
- 3:tree:right/:$(git rev-parse base~2:right)
- 4:blob:right/d:$(git rev-parse base~1:right/d)
- 4:blob:right/d:$(git rev-parse :right/d)
- 5:blob:right/c:$(git rev-parse base~2:right/c)
- 5:blob:right/c:$(git rev-parse topic:right/c)
- 6:tree:left/:$(git rev-parse base:left)
- 6:tree:left/:$(git rev-parse base~2:left)
- 7:blob:left/b:$(git rev-parse base:left/b)
- 7:blob:left/b:$(git rev-parse base~2:left/b)
- 8:tree:a/:$(git rev-parse refs/tags/third:a)
+ 2:tree:a/:$(git rev-parse refs/tags/third:a)
+ 3:tree:left/:$(git rev-parse base:left)
+ 3:tree:left/:$(git rev-parse base~2:left)
+ 4:blob:left/b:$(git rev-parse base:left/b)
+ 4:blob:left/b:$(git rev-parse base~2:left/b)
+ 5:tree:right/:$(git rev-parse topic:right)
+ 5:tree:right/:$(git rev-parse base~1:right)
+ 5:tree:right/:$(git rev-parse base~2:right)
+ 6:blob:right/c:$(git rev-parse base~2:right/c)
+ 6:blob:right/c:$(git rev-parse topic:right/c)
+ 7:blob:right/d:$(git rev-parse base~1:right/d)
+ 7:blob:right/d:$(git rev-parse :right/d)
+ 8:blob:a:$(git rev-parse base~2:a)
blobs:7
commits:4
tags:0
@@ -186,15 +186,15 @@ test_expect_success 'topic only' '
1:tree::$(git rev-parse topic^{tree})
1:tree::$(git rev-parse base~1^{tree})
1:tree::$(git rev-parse base~2^{tree})
- 2:tree:right/:$(git rev-parse topic:right)
- 2:tree:right/:$(git rev-parse base~1:right)
- 2:tree:right/:$(git rev-parse base~2:right)
- 3:blob:right/d:$(git rev-parse base~1:right/d)
- 4:blob:right/c:$(git rev-parse base~2:right/c)
- 4:blob:right/c:$(git rev-parse topic:right/c)
- 5:tree:left/:$(git rev-parse base~2:left)
- 6:blob:left/b:$(git rev-parse base~2:left/b)
- 7:blob:a:$(git rev-parse base~2:a)
+ 2:blob:a:$(git rev-parse base~2:a)
+ 3:tree:left/:$(git rev-parse base~2:left)
+ 4:blob:left/b:$(git rev-parse base~2:left/b)
+ 5:tree:right/:$(git rev-parse topic:right)
+ 5:tree:right/:$(git rev-parse base~1:right)
+ 5:tree:right/:$(git rev-parse base~2:right)
+ 6:blob:right/c:$(git rev-parse base~2:right/c)
+ 6:blob:right/c:$(git rev-parse topic:right/c)
+ 7:blob:right/d:$(git rev-parse base~1:right/d)
blobs:5
commits:3
tags:0
@@ -210,12 +210,12 @@ test_expect_success 'topic, not base' '
cat >expect <<-EOF &&
0:commit::$(git rev-parse topic)
1:tree::$(git rev-parse topic^{tree})
- 2:tree:right/:$(git rev-parse topic:right)
- 3:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
- 4:blob:right/c:$(git rev-parse topic:right/c)
- 5:tree:left/:$(git rev-parse topic:left):UNINTERESTING
- 6:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
- 7:blob:a:$(git rev-parse topic:a):UNINTERESTING
+ 2:blob:a:$(git rev-parse topic:a):UNINTERESTING
+ 3:tree:left/:$(git rev-parse topic:left):UNINTERESTING
+ 4:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
+ 5:tree:right/:$(git rev-parse topic:right)
+ 6:blob:right/c:$(git rev-parse topic:right/c)
+ 7:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
blobs:4
commits:1
tags:0
@@ -233,12 +233,12 @@ test_expect_success 'fourth, blob-tag2, not base' '
1:tag:/tags:$(git rev-parse fourth)
2:blob:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
3:tree::$(git rev-parse topic^{tree})
- 4:tree:right/:$(git rev-parse topic:right)
- 5:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
- 6:blob:right/c:$(git rev-parse topic:right/c)
- 7:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
- 8:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
- 9:blob:a:$(git rev-parse base~1:a):UNINTERESTING
+ 4:blob:a:$(git rev-parse base~1:a):UNINTERESTING
+ 5:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
+ 6:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ 7:tree:right/:$(git rev-parse topic:right)
+ 8:blob:right/c:$(git rev-parse topic:right/c)
+ 9:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
blobs:5
commits:1
tags:1
@@ -253,10 +253,10 @@ test_expect_success 'topic, not base, only blobs' '
-- topic --not base >out &&
cat >expect <<-EOF &&
- 0:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
- 1:blob:right/c:$(git rev-parse topic:right/c)
- 2:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
- 3:blob:a:$(git rev-parse topic:a):UNINTERESTING
+ 0:blob:a:$(git rev-parse topic:a):UNINTERESTING
+ 1:blob:left/b:$(git rev-parse topic:left/b):UNINTERESTING
+ 2:blob:right/c:$(git rev-parse topic:right/c)
+ 3:blob:right/d:$(git rev-parse topic:right/d):UNINTERESTING
blobs:4
commits:0
tags:0
@@ -289,8 +289,8 @@ test_expect_success 'topic, not base, only trees' '
cat >expect <<-EOF &&
0:tree::$(git rev-parse topic^{tree})
- 1:tree:right/:$(git rev-parse topic:right)
- 2:tree:left/:$(git rev-parse topic:left):UNINTERESTING
+ 1:tree:left/:$(git rev-parse topic:left):UNINTERESTING
+ 2:tree:right/:$(git rev-parse topic:right)
commits:0
blobs:0
tags:0
@@ -308,14 +308,14 @@ test_expect_success 'topic, not base, boundary' '
0:commit::$(git rev-parse base~1):UNINTERESTING
1:tree::$(git rev-parse topic^{tree})
1:tree::$(git rev-parse base~1^{tree}):UNINTERESTING
- 2:tree:right/:$(git rev-parse topic:right)
- 2:tree:right/:$(git rev-parse base~1:right):UNINTERESTING
- 3:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
- 4:blob:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
- 4:blob:right/c:$(git rev-parse topic:right/c)
- 5:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
- 6:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
- 7:blob:a:$(git rev-parse base~1:a):UNINTERESTING
+ 2:blob:a:$(git rev-parse base~1:a):UNINTERESTING
+ 3:tree:left/:$(git rev-parse base~1:left):UNINTERESTING
+ 4:blob:left/b:$(git rev-parse base~1:left/b):UNINTERESTING
+ 5:tree:right/:$(git rev-parse topic:right)
+ 5:tree:right/:$(git rev-parse base~1:right):UNINTERESTING
+ 6:blob:right/c:$(git rev-parse base~1:right/c):UNINTERESTING
+ 6:blob:right/c:$(git rev-parse topic:right/c)
+ 7:blob:right/d:$(git rev-parse base~1:right/d):UNINTERESTING
blobs:5
commits:2
tags:0
--
gitgitgadget
^ permalink raw reply related [flat|nested] 67+ messages in thread