* [PATCH 1/8] combine-diff: zero memory used for callback filepairs
2025-03-26 20:18 [PATCH 0/8] Introduce git-blame-tree(1) command Toon Claes
@ 2025-03-26 20:18 ` Toon Claes
2025-03-26 20:18 ` [PATCH 2/8] within_depth: fix return for empty path Toon Claes
` (7 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Toon Claes @ 2025-03-26 20:18 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt
From: Jeff King <peff@peff.net>
In commit 25e5e2bf85 (combine-diff: support format_callback,
2011-08-19), the combined-diff code learned how to make a multi-sourced
diff_filepair to pass to a diff callback. When we create each filepair,
we do not bother to fill in many of the fields, because they would make
no sense (e.g., there can be no rename score or broken_pair flag because
we do not go through the diffcore filters). However, we did not even
bother to zero them, leading to random values. Let's make sure
everything is blank with xcalloc, just as the regular diff code does.
We would potentially want to set the "status" flag to
something non-zero, but it is not clear to what. Possibly a
new DIFF_STATUS_COMBINED would make sense, as this is not
strictly a modification, nor does it fit any other category.
Since it is not yet clear what callers would want, this
patch simply leaves it as "0", the same empty flag that is
seen when diffcore_std is not used at all.
Signed-off-by: Jeff King <peff@peff.net>
---
combine-diff.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/combine-diff.c b/combine-diff.c
index 9527f3160d..1deb8c7374 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -1315,7 +1315,7 @@ static struct diff_filepair *combined_pair(struct combine_diff_path *p,
struct diff_filepair *pair;
struct diff_filespec *pool;
- pair = xmalloc(sizeof(*pair));
+ CALLOC_ARRAY(pair, 1);
CALLOC_ARRAY(pool, st_add(num_parent, 1));
pair->one = pool + 1;
pair->two = pool;
--
2.49.0.rc2
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH 2/8] within_depth: fix return for empty path
2025-03-26 20:18 [PATCH 0/8] Introduce git-blame-tree(1) command Toon Claes
2025-03-26 20:18 ` [PATCH 1/8] combine-diff: zero memory used for callback filepairs Toon Claes
@ 2025-03-26 20:18 ` Toon Claes
2025-03-26 20:18 ` [PATCH 3/8] diff: teach tree-diff a max-depth parameter Toon Claes
` (6 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Toon Claes @ 2025-03-26 20:18 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt
From: Jeff King <peff@peff.net>
The within_depth function is used to check whether pathspecs
limited by a max-depth parameter are acceptable. It takes a
path to check, a maximum depth, and a "base" depth. It
counts the components in the path (by counting slashes),
adds them to the base, and compare them to the maximum.
However, if the base does not have any slashes at all, we
always return "true". If the base depth is 0, then this is
correct; no matter what the maximum is, we are always within
it. However, if the base depth is greater than 0, then we
might return an erroneous result.
This ends up not causing any user-visible bugs in the
current code. The call sites in dir.c always pass a base
depth of 0, so are unaffected. But tree_entry_interesting
uses this function differently: it will pass the prefix of
the current entry, along with a "1" if the entry is a
directory, in essence checking whether items inside the
entry would be of interest. It turns out not to make a
difference in behavior, but the reasoning is complex.
Given a tree like:
file
a/file
a/b/file
walking the tree and calling tree_entry_interesting will
yield the following results:
(with max_depth=0):
file: yes
a: yes
a/file: no
a/b: no
(with max_depth=1):
file: yes
a: yes
a/file: yes
a/b: no
So we have inconsistent behavior in considering directories
interesting. If they are at the edge of our depth but at the
root, we will recurse into them, but then find all of their
entries uninteresting (e.g., in the first case, we will look
at "a" but find "a/*" uninteresting). But if they are at the
edge of our depth and not at the root, then we will not
recurse (in the second example, we do not even bother
entering "a/b").
This turns out not to matter because the only caller which
uses max-depth pathspecs is cmd_grep, which only cares about
blob entries. From its perspective, it is exactly the same
to not recurse into a subtree, or to recurse and find that
it contains no matching entries. Not recursing is merely an
optimization.
It is debatable whether tree_entry_interesting should
consider such an entry interesting. The only caller does not
care if it sees the tree itself, and can benefit from the
optimization. But if we add a "max-depth" limiter to regular
diffs, then a diff with DIFF_OPT_TREE_IN_RECURSIVE would
probably want to show the tree itself, but not what it
contains.
This patch just fixes within_depth, which means we consider
such entries uninteresting (and makes the current caller
happy). If we want to change that in the future, then this
fix is still the correct first step, as the current behavior is
simply inconsistent.
Signed-off-by: Jeff King <peff@peff.net>
---
dir.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/dir.c b/dir.c
index cbd82be6c9..3aead2a599 100644
--- a/dir.c
+++ b/dir.c
@@ -278,7 +278,7 @@ int within_depth(const char *name, int namelen,
if (depth > max_depth)
return 0;
}
- return 1;
+ return depth <= max_depth;
}
/*
--
2.49.0.rc2
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH 3/8] diff: teach tree-diff a max-depth parameter
2025-03-26 20:18 [PATCH 0/8] Introduce git-blame-tree(1) command Toon Claes
2025-03-26 20:18 ` [PATCH 1/8] combine-diff: zero memory used for callback filepairs Toon Claes
2025-03-26 20:18 ` [PATCH 2/8] within_depth: fix return for empty path Toon Claes
@ 2025-03-26 20:18 ` Toon Claes
2025-03-26 20:18 ` [PATCH 4/8] blame-tree: introduce new subcommand to blame files Toon Claes
` (5 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Toon Claes @ 2025-03-26 20:18 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt
From: Jeff King <peff@peff.net>
When you are doing a tree-diff, there are basically two
options: do not recurse into subtrees at all, or recurse
indefinitely. While most callers would want to always
recurse and see full pathnames, some may want the efficiency
of looking only at a particular level of the tree. This is
currently easy to do for the top-level (just turn off
recursion), but you cannot say "show me what changed in
subdir/, but do not recurse".
This patch adds a max-depth parameter which is measured from
the closest pathspec match, so that you can do:
git log --raw --max-depth=1 a/b/c
and see the contents of a/b/c/, but not those of a/b/c/d/
(you would see the --raw entry for a/b/c/d in such a case).
Signed-off-by: Jeff King <peff@peff.net>
---
Documentation/diff-options.adoc | 26 ++++++++++
diff-lib.c | 5 ++
diff.c | 18 +++++++
diff.h | 9 ++++
t/meson.build | 1 +
t/t4071-diff-max-depth.sh | 109 ++++++++++++++++++++++++++++++++++++++++
tree-diff.c | 78 ++++++++++++++++++++++++++--
7 files changed, 243 insertions(+), 3 deletions(-)
diff --git a/Documentation/diff-options.adoc b/Documentation/diff-options.adoc
index 640eb6e7db..8bb523d933 100644
--- a/Documentation/diff-options.adoc
+++ b/Documentation/diff-options.adoc
@@ -887,5 +887,31 @@ endif::git-format-patch[]
reverted with `--ita-visible-in-index`. Both options are
experimental and could be removed in future.
+--max-depth=<n>::
+
+ Limit diff recursion to `<n>` levels (implies `-r`). The depth
+ is measured from the closest pathspec. Given a tree containing
+ `foo/bar/baz`, the following list shows the matches generated by
+ each set of options:
++
+--
+ - `--max-depth=0 -- foo`: `foo`
+
+ - `--max-depth=1 -- foo`: `foo/bar`
+
+ - `--max-depth=2 -- foo`: `foo/bar/baz`
+
+ - `--max-depth=1 -- foo/bar`: `foo/bar/baz`
+--
++
+If no pathspec is given, the depth is measured as if all
+top-level entries were specified. Note that this is different
+than measuring from the root, in that `--max-depth=0` would
+still return `foo`. This allows you to still limit depth while
+asking for a subset of the top-level entries.
++
+Note that this option is only supported for diffs between tree objects,
+not against the index or working tree.
+
For more detailed explanation on these common options, see also
linkgit:gitdiffcore[7].
diff --git a/diff-lib.c b/diff-lib.c
index 353b473ed5..34d61cc6a0 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -115,6 +115,9 @@ void run_diff_files(struct rev_info *revs, unsigned int option)
uint64_t start = getnanotime();
struct index_state *istate = revs->diffopt.repo->index;
+ if (revs->diffopt.max_depth_valid)
+ die("max-depth is not supported for worktree diffs");
+
diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/");
refresh_fsmonitor(istate);
@@ -560,6 +563,8 @@ static int diff_cache(struct rev_info *revs,
opts.dst_index = NULL;
opts.pathspec = &revs->diffopt.pathspec;
opts.pathspec->recursive = 1;
+ if (revs->diffopt.max_depth_valid)
+ die("max-depth is not supported for index diffs");
init_tree_desc(&t, &tree->object.oid, tree->buffer, tree->size);
return unpack_trees(1, &t, &opts);
diff --git a/diff.c b/diff.c
index c89c15d98e..acf2212a4a 100644
--- a/diff.c
+++ b/diff.c
@@ -4986,6 +4986,9 @@ void diff_setup_done(struct diff_options *options)
options->filter = ~filter_bit[DIFF_STATUS_FILTER_AON];
options->filter &= ~options->filter_not;
}
+
+ if (options->pathspec.has_wildcard && options->max_depth_valid)
+ die("max-depth cannot be used with wildcard pathspecs");
}
int parse_long_opt(const char *opt, const char **argv,
@@ -5620,6 +5623,18 @@ static int diff_opt_rotate_to(const struct option *opt, const char *arg, int uns
return 0;
}
+static int diff_opt_max_depth(const struct option *opt,
+ const char *arg, int unset)
+{
+ struct diff_options *options = opt->value;
+
+ BUG_ON_OPT_NEG(unset);
+ options->flags.recursive = 1;
+ options->max_depth = strtol(arg, NULL, 10);
+ options->max_depth_valid = 1;
+ return 0;
+}
+
/*
* Consider adding new flags to __git_diff_common_options
* in contrib/completion/git-completion.bash
@@ -5895,6 +5910,9 @@ struct option *add_diff_options(const struct option *opts,
{ OPTION_CALLBACK, 0, "output", options, N_("<file>"),
N_("output to a specific file"),
PARSE_OPT_NONEG, NULL, 0, diff_opt_output },
+ OPT_CALLBACK_F(0, "max-depth", options, N_("<depth>"),
+ N_("maximum tree depth to recurse"),
+ PARSE_OPT_NONEG, diff_opt_max_depth),
OPT_END()
};
diff --git a/diff.h b/diff.h
index ff0348e4a9..a07d2178c4 100644
--- a/diff.h
+++ b/diff.h
@@ -396,6 +396,15 @@ struct diff_options {
struct strmap *additional_path_headers;
int no_free;
+
+ /*
+ * The extra "valid" flag is a slight hack. The value "0" is perfectly
+ * valid for max-depth. We would normally use -1 to indicate "not set",
+ * but there are many code paths which assume that assume that just
+ * zero-ing out a diff_options is enough to initialize it.
+ */
+ int max_depth;
+ int max_depth_valid;
};
unsigned diff_filter_bit(char status);
diff --git a/t/meson.build b/t/meson.build
index a59da26be3..950d1b7483 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -500,6 +500,7 @@ integration_tests = [
't4067-diff-partial-clone.sh',
't4068-diff-symmetric-merge-base.sh',
't4069-remerge-diff.sh',
+ 't4071-diff-max-depth.sh',
't4100-apply-stat.sh',
't4101-apply-nonl.sh',
't4102-apply-rename.sh',
diff --git a/t/t4071-diff-max-depth.sh b/t/t4071-diff-max-depth.sh
new file mode 100755
index 0000000000..1545ebb869
--- /dev/null
+++ b/t/t4071-diff-max-depth.sh
@@ -0,0 +1,109 @@
+#!/bin/sh
+
+test_description='check that diff --max-depth will limit recursion'
+. ./test-lib.sh
+
+make_dir() {
+ mkdir -p "$1" &&
+ echo "$2" >"$1/file"
+}
+
+make_files() {
+ echo "$1" >file &&
+ make_dir one "$1" &&
+ make_dir one/two "$1" &&
+ make_dir one/two/three "$1"
+}
+
+test_expect_success 'setup' '
+ git commit --allow-empty -m empty &&
+ git tag empty &&
+ make_files added &&
+ git add . &&
+ git commit -m added &&
+ make_files modified &&
+ git add . &&
+ git commit -m modified &&
+ make_files index &&
+ git add . &&
+ make_files worktree
+'
+
+test_expect_success '--max-depth is disallowed with wildcard pathspecs' '
+ test_must_fail git diff-tree --max-depth=0 HEAD^ HEAD -- "f*"
+'
+
+check_one() {
+ type=$1; shift
+ args=$1; shift
+ path=$1; shift
+ depth=$1; shift
+ test_expect_${expect:-success} "diff-$type $args, path=$path, depth=$depth" "
+ for i in $*; do echo \$i; done >expect &&
+ git diff-$type --max-depth=$depth --name-only $args -- $path >actual &&
+ test_cmp expect actual
+ "
+}
+
+# For tree comparisons, we expect to see subtrees at the boundary
+# get their own entry.
+check_trees() {
+ check_one tree "$*" '' 0 file one
+ check_one tree "$*" '' 1 file one/file one/two
+ check_one tree "$*" '' 2 file one/file one/two/file one/two/three
+ check_one tree "$*" '' 3 file one/file one/two/file one/two/three/file
+ check_one tree "$*" one 0 one
+ check_one tree "$*" one 1 one/file one/two
+ check_one tree "$*" one 2 one/file one/two/file one/two/three
+ check_one tree "$*" one 3 one/file one/two/file one/two/three/file
+ check_one tree "$*" one/two 0 one/two
+ check_one tree "$*" one/two 1 one/two/file one/two/three
+ check_one tree "$*" one/two 2 one/two/file one/two/three/file
+ check_one tree "$*" one/two/three 0 one/two/three
+ check_one tree "$*" one/two/three 1 one/two/three/file
+}
+
+# But for index comparisons, we do not store subtrees at all, so we do not
+# expect them.
+check_index() {
+ check_one "$@" '' 0 file
+ check_one "$@" '' 1 file one/file
+ check_one "$@" '' 2 file one/file one/two/file
+ check_one "$@" '' 3 file one/file one/two/file one/two/three/file
+ check_one "$@" one 0
+ check_one "$@" one 1 one/file
+ check_one "$@" one 2 one/file one/two/file
+ check_one "$@" one 3 one/file one/two/file one/two/three/file
+ check_one "$@" one/two 0
+ check_one "$@" one/two 1 one/two/file
+ check_one "$@" one/two 2 one/two/file one/two/three/file
+ check_one "$@" one/two/three 0
+ check_one "$@" one/two/three 1 one/two/three/file
+}
+
+# Check as a modification...
+check_trees HEAD^ HEAD
+# ...and as an addition...
+check_trees empty HEAD
+# ...and as a deletion.
+check_trees HEAD empty
+
+# We currently only implement max-depth for trees.
+expect=failure
+# Check index against a tree
+check_index index "--cached HEAD"
+# and index against the worktree
+check_index files ""
+expect=
+
+test_expect_success 'find shortest path within embedded pathspecs' '
+ cat >expect <<-\EOF &&
+ one/file
+ one/two/file
+ one/two/three/file
+ EOF
+ git diff-tree --max-depth=2 --name-only HEAD^ HEAD -- one one/two >actual &&
+ test_cmp expect actual
+'
+
+test_done
diff --git a/tree-diff.c b/tree-diff.c
index 60c558c2b5..2a744dfaec 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -13,6 +13,7 @@
#include "tree-walk.h"
#include "environment.h"
#include "repository.h"
+#include "dir.h"
/*
* Some mode bits are also used internally for computations.
@@ -48,6 +49,73 @@
free((x)); \
} while(0)
+/* Returns true if and only if "dir" is a leading directory of "path" */
+static int is_dir_prefix(const char *path, const char *dir, int dirlen)
+{
+ return !strncmp(path, dir, dirlen) &&
+ (!path[dirlen] || path[dirlen] == '/');
+}
+
+static int check_recursion_depth(struct strbuf *name,
+ const struct pathspec *ps,
+ int max_depth)
+{
+ int i;
+
+ if (!ps->nr)
+ return within_depth(name->buf, name->len, 1, max_depth);
+
+ /*
+ * We look through the pathspecs in reverse-sorted order, because we
+ * want to find the longest match first (e.g., "a/b" is better for
+ * checking depth than "a/b/c").
+ */
+ for (i = ps->nr - 1; i >= 0; i--) {
+ const struct pathspec_item *item = ps->items+i;
+
+ /*
+ * If the name to match is longer than the pathspec, then we
+ * are only interested if the pathspec matches and we are
+ * within the allowed depth.
+ */
+ if (name->len >= item->len) {
+ if (!is_dir_prefix(name->buf, item->match, item->len))
+ continue;
+ return within_depth(name->buf + item->len,
+ name->len - item->len,
+ 1, max_depth);
+ }
+
+ /*
+ * Otherwise, our name is shorter than the pathspec. We need to
+ * check if it is a prefix of the pathspec; if so, we must
+ * always recurse in order to process further (the resulting
+ * paths we find might or might not match our pathspec, but we
+ * cannot know until we recurse).
+ */
+ if (is_dir_prefix(item->match, name->buf, name->len))
+ return 1;
+ }
+ return 0;
+}
+
+static int should_recurse(struct strbuf *name, struct diff_options *opt)
+{
+ if (!opt->flags.recursive)
+ return 0;
+ if (!opt->max_depth_valid)
+ return 1;
+
+ /*
+ * We catch this during diff_setup_done, but let's double-check
+ * against any internal munging.
+ */
+ if (opt->pathspec.has_wildcard)
+ die("BUG: wildcard pathspecs are incompatible with max-depth");
+
+ return check_recursion_depth(name, &opt->pathspec, opt->max_depth);
+}
+
static void ll_diff_tree_paths(
struct combine_diff_path ***tail, const struct object_id *oid,
const struct object_id **parents_oid, int nparent,
@@ -170,9 +238,13 @@ static void emit_path(struct combine_diff_path ***tail,
mode = 0;
}
- if (opt->flags.recursive && isdir) {
- recurse = 1;
- emitthis = opt->flags.tree_in_recursive;
+ if (isdir) {
+ strbuf_add(base, path, pathlen);
+ if (should_recurse(base, opt)) {
+ recurse = 1;
+ emitthis = opt->flags.tree_in_recursive;
+ }
+ strbuf_setlen(base, old_baselen);
}
if (emitthis) {
--
2.49.0.rc2
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH 4/8] blame-tree: introduce new subcommand to blame files
2025-03-26 20:18 [PATCH 0/8] Introduce git-blame-tree(1) command Toon Claes
` (2 preceding siblings ...)
2025-03-26 20:18 ` [PATCH 3/8] diff: teach tree-diff a max-depth parameter Toon Claes
@ 2025-03-26 20:18 ` Toon Claes
2025-03-26 20:18 ` [PATCH 5/8] pathspec: add optional trie index Toon Claes
` (4 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Toon Claes @ 2025-03-26 20:18 UTC (permalink / raw)
To: git
Cc: Jeff King, Patrick Steinhardt,
Ævar Arnfjörð Bjarmason, Toon Claes
Similar to git-blame(1), introduce a new subcommand git-blame-tree(1).
This command shows the most recent modification to paths in a tree. It
does so by expanding the tree at a given commit, taking note of the
current state of each path, and then walking backwards through history
looking for commits where each path changed into its final commit ID.
Based-on-a-patch-by: Jeff King <peff@peff.net>
Improved-by: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
.gitignore | 1 +
Makefile | 2 +
blame-tree.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++
blame-tree.h | 43 +++++++++++
builtin.h | 1 +
builtin/blame-tree.c | 67 +++++++++++++++++
git.c | 1 +
meson.build | 2 +
t/helper/test-tool.h | 1 +
t/meson.build | 1 +
t/t8020-blame-tree.sh | 142 ++++++++++++++++++++++++++++++++++++
11 files changed, 459 insertions(+)
diff --git a/.gitignore b/.gitignore
index 08a66ca508..27faa0ce90 100644
--- a/.gitignore
+++ b/.gitignore
@@ -22,6 +22,7 @@
/git-backfill
/git-bisect
/git-blame
+/git-blame-tree
/git-branch
/git-bugreport
/git-bundle
diff --git a/Makefile b/Makefile
index 7315507381..92fdfc76df 100644
--- a/Makefile
+++ b/Makefile
@@ -972,6 +972,7 @@ LIB_OBJS += archive.o
LIB_OBJS += attr.o
LIB_OBJS += base85.o
LIB_OBJS += bisect.o
+LIB_OBJS += blame-tree.o
LIB_OBJS += blame.o
LIB_OBJS += blob.o
LIB_OBJS += bloom.o
@@ -1215,6 +1216,7 @@ BUILTIN_OBJS += builtin/archive.o
BUILTIN_OBJS += builtin/backfill.o
BUILTIN_OBJS += builtin/bisect.o
BUILTIN_OBJS += builtin/blame.o
+BUILTIN_OBJS += builtin/blame-tree.o
BUILTIN_OBJS += builtin/branch.o
BUILTIN_OBJS += builtin/bugreport.o
BUILTIN_OBJS += builtin/bundle.o
diff --git a/blame-tree.c b/blame-tree.c
new file mode 100644
index 0000000000..ed4ec1e694
--- /dev/null
+++ b/blame-tree.c
@@ -0,0 +1,198 @@
+#include "git-compat-util.h"
+#include "blame-tree.h"
+#include "strvec.h"
+#include "hex.h"
+#include "commit.h"
+#include "diffcore.h"
+#include "diff.h"
+#include "object.h"
+#include "revision.h"
+#include "repository.h"
+#include "log-tree.h"
+
+void blame_tree_opts_release(struct blame_tree_options *bto)
+{
+ strvec_clear(&bto->args);
+}
+
+struct blame_tree_entry {
+ struct object_id oid;
+ struct commit *commit;
+};
+
+static void add_from_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED, void *data)
+{
+ struct blame_tree *bt = data;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ struct blame_tree_entry *ent = xcalloc(1, sizeof(*ent));
+ struct string_list_item *it;
+
+ oidcpy(&ent->oid, &p->two->oid);
+ it = string_list_append(&bt->paths, p->two->path);
+ it->util = ent;
+ }
+}
+
+static int add_from_revs(struct blame_tree *bt)
+{
+ size_t count = 0;
+ struct diff_options diffopt;
+
+ memcpy(&diffopt, &bt->rev.diffopt, sizeof(diffopt));
+ diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ diffopt.format_callback = add_from_diff;
+ diffopt.format_callback_data = bt;
+ diffopt.no_free = 1;
+
+ for (size_t i = 0; i < bt->rev.pending.nr; i++) {
+ struct object_array_entry *obj = bt->rev.pending.objects + i;
+
+ if (obj->item->flags & UNINTERESTING)
+ continue;
+
+ if (count++)
+ return error(_("can only blame one tree at a time"));
+ diff_tree_oid(bt->rev.repo->hash_algo->empty_tree,
+ &obj->item->oid, "", &diffopt);
+ diff_flush(&diffopt);
+ }
+
+ string_list_sort(&bt->paths);
+ return 0;
+}
+
+void blame_tree_init(struct repository *r, struct blame_tree *bt,
+ const struct blame_tree_options *opts)
+{
+ repo_init_revisions(r, &bt->rev, opts->prefix);
+ bt->rev.def = oid_to_hex(&opts->oid);
+ bt->rev.combine_merges = 1;
+ bt->rev.show_root_diff = 1;
+ bt->rev.boundary = 1;
+ bt->rev.no_commit_id = 1;
+ bt->rev.diff = 1;
+ bt->rev.diffopt.flags.recursive = opts->recursive;
+ setup_revisions(opts->args.nr, opts->args.v, &bt->rev, NULL);
+
+ if (add_from_revs(bt) < 0)
+ die(_("unable to setup blame-tree"));
+}
+
+void blame_tree_release(struct blame_tree *bt)
+{
+ string_list_clear(&bt->paths, 1);
+ release_revisions(&bt->rev);
+}
+
+struct blame_tree_callback_data {
+ struct commit *commit;
+ struct string_list *paths;
+ size_t num_interesting;
+
+ blame_tree_fn callback;
+ void *callback_data;
+};
+
+static void mark_path(const char *path, const struct object_id *oid,
+ struct blame_tree_callback_data *data)
+{
+ struct string_list_item *item = string_list_lookup(data->paths, path);
+ struct blame_tree_entry *ent;
+
+ /* Is it even a path that exists in our tree? */
+ if (!item)
+ return;
+
+ /* Have we already blamed a commit? */
+ ent = item->util;
+ if (ent->commit)
+ return;
+
+ /*
+ * Is it arriving at a version of interest, or is it from a side branch
+ * which did not contribute to the final state?
+ */
+ if (!oideq(oid, &ent->oid))
+ return;
+
+ ent->commit = data->commit;
+ data->num_interesting--;
+ if (data->callback)
+ data->callback(path, data->commit, data->callback_data);
+}
+
+static void blame_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED, void *cbdata)
+{
+ struct blame_tree_callback_data *data = cbdata;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ switch (p->status) {
+ case DIFF_STATUS_DELETED:
+ /*
+ * There's no point in feeding a deletion, as it could
+ * not have resulted in our current state, which
+ * actually has the file.
+ */
+ break;
+
+ default:
+ /*
+ * Otherwise, we care only that we somehow arrived at
+ * a final path/sha1 state. Note that this covers some
+ * potentially controversial areas, including:
+ *
+ * 1. A rename or copy will be blamed, as it is the
+ * first time the content has arrived at the given
+ * path.
+ *
+ * 2. Even a non-content modification like a mode or
+ * type change will trigger it.
+ *
+ * We take the inclusive approach for now, and blame
+ * anything which impacts the path. Options to tweak
+ * the behavior (e.g., to "--follow" the content across
+ * renames) can come later.
+ */
+ mark_path(p->two->path, &p->two->oid, data);
+ break;
+ }
+ }
+}
+
+int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *cbdata)
+{
+ struct blame_tree_callback_data data;
+
+ data.paths = &bt->paths;
+ data.num_interesting = bt->paths.nr;
+ data.callback = cb;
+ data.callback_data = cbdata;
+
+ bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ bt->rev.diffopt.format_callback = blame_diff;
+ bt->rev.diffopt.format_callback_data = &data;
+
+ prepare_revision_walk(&bt->rev);
+
+ while (data.num_interesting) {
+ data.commit = get_revision(&bt->rev);
+ if (!data.commit)
+ break;
+
+ if (data.commit->object.flags & BOUNDARY) {
+ diff_tree_oid(bt->rev.repo->hash_algo->empty_tree,
+ &data.commit->object.oid,
+ "", &bt->rev.diffopt);
+ diff_flush(&bt->rev.diffopt);
+ } else {
+ log_tree_commit(&bt->rev, data.commit);
+ }
+ }
+
+ return 0;
+}
diff --git a/blame-tree.h b/blame-tree.h
new file mode 100644
index 0000000000..ea06298f88
--- /dev/null
+++ b/blame-tree.h
@@ -0,0 +1,43 @@
+#ifndef BLAME_TREE_H
+#define BLAME_TREE_H
+
+#include "hash.h"
+#include "strvec.h"
+#include "string-list.h"
+#include "revision.h"
+#include "commit.h"
+
+struct blame_tree_options {
+ struct object_id oid;
+ const char *prefix;
+ unsigned int recursive;
+ struct strvec args;
+};
+
+#define BLAME_TREE_OPTIONS_INIT(...) { \
+ .args = STRVEC_INIT, \
+ __VA_ARGS__ \
+}
+
+void blame_tree_opts_release(struct blame_tree_options *bto);
+
+struct blame_tree {
+ struct string_list paths;
+ struct rev_info rev;
+ struct repository *repository;
+};
+#define BLAME_TREE_INIT { \
+ .paths = STRING_LIST_INIT_DUP, \
+ .rev = REV_INFO_INIT, \
+}
+
+void blame_tree_init(struct repository *r, struct blame_tree *bt,
+ const struct blame_tree_options *opts);
+
+void blame_tree_release(struct blame_tree *);
+
+typedef void (*blame_tree_fn)(const char *path, const struct commit *commit,
+ void *data);
+int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *data);
+
+#endif /* BLAME_TREE_H */
diff --git a/builtin.h b/builtin.h
index 993a583872..be3176924d 100644
--- a/builtin.h
+++ b/builtin.h
@@ -123,6 +123,7 @@ int cmd_archive(int argc, const char **argv, const char *prefix, struct reposito
int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_bisect(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_blame(int argc, const char **argv, const char *prefix, struct repository *repo);
+int cmd_blame_tree(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_branch(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_bugreport(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_bundle(int argc, const char **argv, const char *prefix, struct repository *repo);
diff --git a/builtin/blame-tree.c b/builtin/blame-tree.c
new file mode 100644
index 0000000000..bc404b63f3
--- /dev/null
+++ b/builtin/blame-tree.c
@@ -0,0 +1,67 @@
+#define USE_THE_REPOSITORY_VARIABLE
+
+#include "git-compat-util.h"
+#include "blame-tree.h"
+#include "strvec.h"
+#include "hex.h"
+#include "quote.h"
+#include "config.h"
+#include "environment.h"
+#include "object-name.h"
+#include "parse-options.h"
+#include "builtin.h"
+#include "setup.h"
+
+static void show_entry(const char *path, const struct commit *commit, void *d)
+{
+ struct blame_tree *bt = d;
+
+ if (commit->object.flags & BOUNDARY)
+ putchar('^');
+ printf("%s\t", oid_to_hex(&commit->object.oid));
+
+ if (bt->rev.diffopt.line_termination)
+ write_name_quoted(path, stdout, '\n');
+ else
+ printf("%s%c", path, '\0');
+
+ fflush(stdout);
+}
+
+int cmd_blame_tree(int argc, const char **argv, const char *prefix, struct repository *repo)
+{
+ struct blame_tree bt = BLAME_TREE_INIT;
+ struct blame_tree_options opts = BLAME_TREE_OPTIONS_INIT(
+ .prefix = prefix,
+ );
+
+ struct option options[] = {
+ OPT_BOOL(0, "recursive", &opts.recursive,
+ "recurse into to subtrees"),
+ OPT_END()
+ };
+
+ const char * const blame_tree_usage[] = {
+ N_("git blame-tree [--no-recursive] [<rev-opts>]"),
+ NULL,
+ };
+
+ git_config(git_default_config, NULL);
+
+ if (repo_get_oid(the_repository, "HEAD", &opts.oid))
+ die("unable to get HEAD");
+
+ argc = parse_options(argc, argv, prefix, options, blame_tree_usage,
+ PARSE_OPT_KEEP_ARGV0 | PARSE_OPT_KEEP_UNKNOWN_OPT);
+ if (argc)
+ strvec_pushv(&opts.args, argv);
+
+ blame_tree_init(repo, &bt, &opts);
+
+ if (blame_tree_run(&bt, show_entry, &bt) < 0)
+ die(_("error running blame-tree traversal"));
+ blame_tree_release(&bt);
+ blame_tree_opts_release(&opts);
+
+ return 0;
+}
diff --git a/git.c b/git.c
index 450d6aaa86..42de740378 100644
--- a/git.c
+++ b/git.c
@@ -509,6 +509,7 @@ static struct cmd_struct commands[] = {
{ "backfill", cmd_backfill, RUN_SETUP },
{ "bisect", cmd_bisect, RUN_SETUP },
{ "blame", cmd_blame, RUN_SETUP },
+ { "blame-tree", cmd_blame_tree, RUN_SETUP },
{ "branch", cmd_branch, RUN_SETUP | DELAY_PAGER_CONFIG },
{ "bugreport", cmd_bugreport, RUN_SETUP_GENTLY },
{ "bundle", cmd_bundle, RUN_SETUP_GENTLY },
diff --git a/meson.build b/meson.build
index efe2871c9d..dd231b669b 100644
--- a/meson.build
+++ b/meson.build
@@ -241,6 +241,7 @@ libgit_sources = [
'attr.c',
'base85.c',
'bisect.c',
+ 'blame-tree.c',
'blame.c',
'blob.c',
'bloom.c',
@@ -512,6 +513,7 @@ builtin_sources = [
'builtin/archive.c',
'builtin/backfill.c',
'builtin/bisect.c',
+ 'builtin/blame-tree.c',
'builtin/blame.c',
'builtin/branch.c',
'builtin/bugreport.c',
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 6d62a5b53d..41cc3730dc 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -5,6 +5,7 @@
int cmd__advise_if_enabled(int argc, const char **argv);
int cmd__bitmap(int argc, const char **argv);
+int cmd__blame_tree(int argc, const char **argv);
int cmd__bloom(int argc, const char **argv);
int cmd__bundle_uri(int argc, const char **argv);
int cmd__cache_tree(int argc, const char **argv);
diff --git a/t/meson.build b/t/meson.build
index 950d1b7483..6f6485c8b4 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -960,6 +960,7 @@ integration_tests = [
't8012-blame-colors.sh',
't8013-blame-ignore-revs.sh',
't8014-blame-ignore-fuzzy.sh',
+ 't8020-blame-tree.sh',
't9001-send-email.sh',
't9002-column.sh',
't9003-help-autocorrect.sh',
diff --git a/t/t8020-blame-tree.sh b/t/t8020-blame-tree.sh
new file mode 100755
index 0000000000..5fd2c079fe
--- /dev/null
+++ b/t/t8020-blame-tree.sh
@@ -0,0 +1,142 @@
+#!/bin/sh
+
+test_description='blame-tree tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+ test_commit 1 file &&
+ mkdir a &&
+ test_commit 2 a/file &&
+ mkdir a/b &&
+ test_commit 3 a/b/file
+'
+
+test_expect_success 'cannot blame two trees' '
+ test_must_fail git blame-tree HEAD HEAD~1
+'
+
+check_blame() {
+ local indir= &&
+ while test $# != 0
+ do
+ case "$1" in
+ -C)
+ indir="$2"
+ shift
+ ;;
+ *)
+ break
+ ;;
+ esac &&
+ shift
+ done &&
+
+ cat >expect &&
+ test_when_finished "rm -f tmp.*" &&
+ git ${indir:+-C "$indir"} blame-tree "$@" >tmp.1 &&
+ git name-rev --annotate-stdin --name-only --tags \
+ <tmp.1 >tmp.2 &&
+ tr '\t' ' ' <tmp.2 >tmp.3 &&
+ sort tmp.3 >actual &&
+ test_cmp expect actual
+}
+
+test_expect_success 'blame recursive' '
+ check_blame --recursive <<-\EOF
+ 1 file
+ 2 a/file
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'blame non-recursive' '
+ check_blame --no-recursive <<-\EOF
+ 1 file
+ 3 a
+ EOF
+'
+
+test_expect_success 'blame subdir' '
+ check_blame a <<-\EOF
+ 3 a
+ EOF
+'
+
+test_expect_success 'blame subdir recursive' '
+ check_blame --recursive a <<-\EOF
+ 2 a/file
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'blame from non-HEAD commit' '
+ check_blame --no-recursive HEAD^ <<-\EOF
+ 1 file
+ 2 a
+ EOF
+'
+
+test_expect_success 'blame from subdir defaults to root' '
+ check_blame -C a --no-recursive <<-\EOF
+ 1 file
+ 3 a
+ EOF
+'
+
+test_expect_success 'blame from subdir uses relative pathspecs' '
+ check_blame -C a --recursive b <<-\EOF
+ 3 a/b/file
+ EOF
+'
+
+test_expect_failure 'limit blame traversal by count' '
+ check_blame --no-recursive -1 <<-\EOF
+ 3 a
+ EOF
+'
+
+test_expect_success 'limit blame traversal by commit' '
+ check_blame --no-recursive HEAD~2..HEAD <<-\EOF
+ 3 a
+ ^1 file
+ EOF
+'
+
+test_expect_success 'only blame files in the current tree' '
+ git rm -rf a &&
+ git commit -m "remove a" &&
+ check_blame <<-\EOF
+ 1 file
+ EOF
+'
+
+test_expect_success 'cross merge boundaries in blaming' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit m1 &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit m2 &&
+ git merge m1 &&
+ check_blame <<-\EOF
+ m1 m1.t
+ m2 m2.t
+ EOF
+'
+
+test_expect_success 'blame merge for resolved conflicts' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit c1 conflict &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit c2 conflict &&
+ test_must_fail git merge c1 &&
+ test_commit resolved conflict &&
+ check_blame conflict <<-\EOF
+ resolved conflict
+ EOF
+'
+
+test_done
--
2.49.0.rc2
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH 5/8] pathspec: add optional trie index
2025-03-26 20:18 [PATCH 0/8] Introduce git-blame-tree(1) command Toon Claes
` (3 preceding siblings ...)
2025-03-26 20:18 ` [PATCH 4/8] blame-tree: introduce new subcommand to blame files Toon Claes
@ 2025-03-26 20:18 ` Toon Claes
2025-03-26 20:18 ` [PATCH 6/8] pathspec: turn on tries when appropriate Toon Claes
` (3 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Toon Claes @ 2025-03-26 20:18 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt
From: Jeff King <peff@peff.net>
When checking if a path matches our pathspec list, in the
worst case we walk the whole list linearly to check each
pathspec. As a result, an operation like a tree diff does
`O(paths * pathspecs)` comparisons. If we assume that the
number of pathspecs may approach the number of paths, this
is `O(n^2)` in the number of paths. We have to do it this
way in the general case because pathspecs may be arbitrary
fnmatch expressions, and we cannot create any meaningful
index.
However, it is common that all of the pathspecs are vanilla
prefixes (e.g., "git.c" or "Documentation/") that do not use
any wildcards or magic features. In such a case, we can
index the pathspec list to give us faster lookups.
The simplest solution would be to keep the pathspec list
sorted, and use a binary search over the entries for each
path. This turns out to be rather tricky, though, as we are
not looking for an exact match in the list. We must also
find prefix matches, and take care to handle trailing
slashes for directories.
Instead, this patch introduces a pathspec_trie struct, which
represents the pathspecs as a graph of entries. For
example, the pathspec list ["foo/bar", "foo/baz/"] would be
represented as:
(bar, terminal)
/
(foo)
\
(baz, terminal, must_be_dir)
To see if a path "foo/eek" is covered by the pathspec, we
walk the trie, matching "foo" but ultimately seeing that
"eek" is not mentioned.
This provides us with two optimizations:
1. The children of each trie node are simple strings
representing directory components. This means we can sort
and binary-search them, giving us logarithmic lookups
(but note that rather than one log lookup on the whole
pathname, it is a sequence of smaller log lookups on
components of the pathname).
2. Many users of the pathspec code do not feed full
pathnames, but instead are walking a tree hierarchy
themselves, and limiting the walk as they go. They can
walk the trie at the same time, meanining they avoid
looking repeatedly at parts of the pathspec we already
know are matched. The current code, by contrast, will
repeatedly match "foo/" against each pathspec when
looking at "foo/bar", "foo/baz", etc.
Note that this patch just introduces the data structure; the
code is not used anywhere yet.
Signed-off-by: Jeff King <peff@peff.net>
---
Makefile | 1 +
pathspec.c | 119 +++++++++++++++++++++++++++++++++++++++++++++++
pathspec.h | 23 +++++++++
t/helper/meson.build | 1 +
t/helper/test-pathspec.c | 98 ++++++++++++++++++++++++++++++++++++++
t/helper/test-tool.c | 1 +
t/helper/test-tool.h | 1 +
t/meson.build | 1 +
t/t6137-pathspec-trie.sh | 57 +++++++++++++++++++++++
9 files changed, 302 insertions(+)
diff --git a/Makefile b/Makefile
index 92fdfc76df..c90607baa1 100644
--- a/Makefile
+++ b/Makefile
@@ -827,6 +827,7 @@ TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
TEST_BUILTINS_OBJS += test-partial-clone.o
TEST_BUILTINS_OBJS += test-path-utils.o
TEST_BUILTINS_OBJS += test-path-walk.o
+TEST_BUILTINS_OBJS += test-pathspec.o
TEST_BUILTINS_OBJS += test-pcre2-config.o
TEST_BUILTINS_OBJS += test-pkt-line.o
TEST_BUILTINS_OBJS += test-proc-receive.o
diff --git a/pathspec.c b/pathspec.c
index 89663645e1..0e381fd748 100644
--- a/pathspec.c
+++ b/pathspec.c
@@ -17,6 +17,125 @@
#include "quote.h"
#include "wildmatch.h"
+/*
+ * This is basically a strcmp, but we do not want the caller
+ * to have to terminate "a", so we pretend as if it had a NUL.
+ */
+static int pathspec_trie_cmp(const char *a, size_t a_len,
+ const char *b)
+{
+ int ret = strncmp(a, b, a_len);
+ return ret ?
+ ret :
+ (unsigned char)0 - (unsigned char )b[a_len];
+}
+
+static struct pathspec_trie *pathspec_trie_alloc(const char *path, size_t len)
+{
+ struct pathspec_trie *ret = xcalloc(1, sizeof(*ret) + len);
+ memcpy(ret->path, path, len);
+ return ret;
+}
+
+/*
+ * Add "path" to the trie rooted at "t".
+ */
+static void pathspec_trie_add(struct pathspec_trie *t, const char *path)
+{
+ /*
+ * Special case the empty path (i.e., "."), as our splitting algorithm
+ * below assumes at least one component.
+ */
+ if (!*path) {
+ t->terminal = 1;
+ return;
+ }
+
+ while (1) {
+ const char *slash = strchrnul(path, '/');
+ size_t len = slash - path;
+ int pos;
+
+ pos = pathspec_trie_lookup(t, path, len);
+ if (pos < 0) {
+ ALLOC_GROW(t->entries, t->nr + 1, t->alloc);
+
+ pos = -pos - 1;
+ if (pos < t->nr)
+ memmove(t->entries + pos + 1,
+ t->entries + pos,
+ sizeof(*t->entries) * (t->nr - pos));
+ t->entries[pos] = pathspec_trie_alloc(path, len);
+ t->nr++;
+ }
+
+ t = t->entries[pos];
+ path += len;
+
+ if (!*path) {
+ t->must_be_dir = 0;
+ t->terminal = 1;
+ return;
+ }
+
+ while (*path == '/')
+ path++;
+ if (!*path) {
+ /*
+ * if we were already a terminal, then do not set
+ * must_be_dir; we are "foo/", but we already had a
+ * pathspec "foo", which is a superset of us.
+ */
+ if (!t->terminal)
+ t->must_be_dir = 1;
+ t->terminal = 1;
+ return;
+ }
+ }
+}
+
+struct pathspec_trie *pathspec_trie_build(const struct pathspec *pathspec)
+{
+ struct pathspec_trie *ret;
+ int i;
+
+ /* we only make a trie for plain-vanilla pathspecs */
+ if (pathspec->has_wildcard || (pathspec->magic & ~PATHSPEC_LITERAL))
+ return NULL;
+
+ ret = pathspec_trie_alloc("", 0);
+
+ /*
+ * XXX we could construct the trie more efficiently by creating each
+ * node with all of its entries in sorted order. But this is much
+ * simpler, and since we only do this once at the start of a traversal,
+ * it's probably fast enough.
+ */
+ for (i = 0; i < pathspec->nr; i++)
+ pathspec_trie_add(ret, pathspec->items[i].match);
+
+ return ret;
+}
+
+int pathspec_trie_lookup(const struct pathspec_trie *parent,
+ const char *path, size_t len)
+{
+ int lo = 0, hi = parent->nr;
+ while (lo < hi) {
+ int mi = lo + ((hi - lo) / 2);
+ int cmp;
+
+ cmp = pathspec_trie_cmp(path, len, parent->entries[mi]->path);
+ if (!cmp)
+ return mi;
+ if (cmp < 0)
+ hi = mi;
+ else
+ lo = mi + 1;
+ }
+ return -lo - 1;
+}
+
/*
* Finds which of the given pathspecs match items in the index.
*
diff --git a/pathspec.h b/pathspec.h
index de537cff3c..71bafb78a9 100644
--- a/pathspec.h
+++ b/pathspec.h
@@ -2,6 +2,7 @@
#define PATHSPEC_H
struct index_state;
+struct pathspec;
/* Pathspec magic */
#define PATHSPEC_FROMTOP (1<<0)
@@ -22,6 +23,28 @@ struct index_state;
#define PATHSPEC_ONESTAR 1 /* the pathspec pattern satisfies GFNM_ONESTAR */
+struct pathspec_trie {
+ struct pathspec_trie **entries;
+ int nr, alloc;
+ unsigned terminal:1,
+ must_be_dir:1;
+ char path[FLEX_ARRAY];
+};
+
+/*
+ * Build a pathspec_trie for the given pathspec.
+ */
+struct pathspec_trie *pathspec_trie_build(const struct pathspec *);
+
+/*
+ * Do a binary search on one level of the pathspec_trie. If found,
+ * returns the offset of the item in the entry list. If not found,
+ * return a negative value encoding the offset where it would be inserted
+ * (you can recover the true offset with "-pos - 1").
+ */
+int pathspec_trie_lookup(const struct pathspec_trie *pst,
+ const char *path, size_t len);
+
/**
* See glossary-content.txt for the syntax of pathspec.
* In memory, a pathspec set is represented by "struct pathspec" and is
diff --git a/t/helper/meson.build b/t/helper/meson.build
index d2cabaa2bc..f6dc3e443b 100644
--- a/t/helper/meson.build
+++ b/t/helper/meson.build
@@ -42,6 +42,7 @@ test_tool_sources = [
'test-partial-clone.c',
'test-path-utils.c',
'test-path-walk.c',
+ 'test-pathspec.c',
'test-pcre2-config.c',
'test-pkt-line.c',
'test-proc-receive.c',
diff --git a/t/helper/test-pathspec.c b/t/helper/test-pathspec.c
new file mode 100644
index 0000000000..c04417681f
--- /dev/null
+++ b/t/helper/test-pathspec.c
@@ -0,0 +1,98 @@
+#include "test-tool.h"
+#include "pathspec.h"
+
+static const char usage_msg[] =
+"test-tool pathspec trie [pathspecs...] -- [paths....]";
+
+/*
+ * XXX Yuck. This is a lot of complicated code specific to our test. Even if it
+ * runs correctly, we have no real guarantee that the actual trie users are
+ * doing it right. And reusing their code is tough, because it happens as part
+ * of their own traversals (e.g., we walk the pathspec trie while walking the
+ * tree objects themselves).
+ *
+ * This whole test program should probably go away in favor of directly testing
+ * the tree-diff code.
+ */
+static int trie_match(const struct pathspec_trie *pst,
+ const char *path)
+{
+ int pathlen = strlen(path);
+ int is_dir = 0;
+
+ if (pathlen > 0 && path[pathlen-1] == '/') {
+ is_dir = 1;
+ pathlen--;
+ }
+
+ while (pathlen) {
+ const char *slash = memchr(path, '/', pathlen);
+ int component_len;
+ int pos;
+
+ if (slash)
+ component_len = slash - path;
+ else
+ component_len = pathlen;
+
+ pos = pathspec_trie_lookup(pst, path, component_len);
+ if (pos < 0)
+ return 0;
+
+ pst = pst->entries[pos];
+ path += component_len;
+ pathlen -= component_len;
+
+ while (pathlen && *path == '/') {
+ path++;
+ pathlen--;
+ }
+
+ if (pst->terminal) {
+ if (!pst->must_be_dir)
+ return 1;
+ if (pathlen)
+ return 1;
+ return is_dir;
+ }
+ }
+ return 0;
+}
+
+static int cmd_trie(const char **argv)
+{
+ const char **specs, **paths;
+ struct pathspec pathspec;
+ struct pathspec_trie *trie;
+
+ paths = specs = argv;
+ while (*paths && strcmp(*paths, "--"))
+ paths++;
+ if (*paths)
+ *paths++ = NULL;
+
+ parse_pathspec(&pathspec, 0, 0, "", specs);
+ trie = pathspec_trie_build(&pathspec);
+ if (!trie)
+ die("unable to make trie from pathspec");
+
+ for (; *paths; paths++) {
+ if (trie_match(trie, *paths))
+ printf("yes\n");
+ else
+ printf("no\n");
+ }
+ return 0;
+}
+
+int cmd__pathspec(int argc UNUSED, const char **argv)
+{
+ const char *cmd = argv[1];
+
+ if (!cmd)
+ usage(usage_msg);
+ else if (!strcmp(cmd, "trie"))
+ return cmd_trie(argv + 2);
+ else
+ die("unknown cmd: %s", cmd);
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 50dc4dac4e..bc0a1b9245 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -54,6 +54,7 @@ static struct test_cmd cmds[] = {
{ "partial-clone", cmd__partial_clone },
{ "path-utils", cmd__path_utils },
{ "path-walk", cmd__path_walk },
+ { "pathspec", cmd__pathspec },
{ "pcre2-config", cmd__pcre2_config },
{ "pkt-line", cmd__pkt_line },
{ "proc-receive", cmd__proc_receive },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 41cc3730dc..5e79fa38ca 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -46,6 +46,7 @@ int cmd__parse_options_flags(int argc, const char **argv);
int cmd__parse_pathspec_file(int argc, const char** argv);
int cmd__parse_subcommand(int argc, const char **argv);
int cmd__partial_clone(int argc, const char **argv);
+int cmd__pathspec(int argc, const char **argv);
int cmd__path_utils(int argc, const char **argv);
int cmd__path_walk(int argc, const char **argv);
int cmd__pcre2_config(int argc, const char **argv);
diff --git a/t/meson.build b/t/meson.build
index 6f6485c8b4..bb29c36b2a 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -788,6 +788,7 @@ integration_tests = [
't6134-pathspec-in-submodule.sh',
't6135-pathspec-with-attrs.sh',
't6136-pathspec-in-bare.sh',
+ 't6137-pathspec-trie.sh',
't6200-fmt-merge-msg.sh',
't6300-for-each-ref.sh',
't6301-for-each-ref-errors.sh',
diff --git a/t/t6137-pathspec-trie.sh b/t/t6137-pathspec-trie.sh
new file mode 100755
index 0000000000..ca2935c7db
--- /dev/null
+++ b/t/t6137-pathspec-trie.sh
@@ -0,0 +1,57 @@
+#!/bin/sh
+
+test_description='test optimized pathspec trie lookup'
+. ./test-lib.sh
+
+# This is a basic lookup test; the offsets are there to provide
+# some variation in where we land in our binary search.
+ps="faa fzz foo/bar foo/baa foo/bzz"
+for offset in a "a b" "a b c"; do
+ test_expect_success "lookups with ps=$offset $ps" "
+ cat >expect <<-\\EOF &&
+ no
+ yes
+ yes
+ no
+ EOF
+ test-tool pathspec trie $offset $ps -- f faa foo/bar foo/baz >actual &&
+ test_cmp expect actual
+ "
+done
+
+test_expect_success 'pathspecs match by prefix' '
+ cat >expect <<-\EOF &&
+ yes
+ yes
+ yes
+ EOF
+ test-tool pathspec trie foo -- foo foo/bar foo/with/deep/subdirs >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'trailing slash sets must_be_dir' '
+ cat >expect <<-\EOF &&
+ no
+ yes
+ yes
+ EOF
+ test-tool pathspec trie dir/ -- dir dir/ dir/foo
+'
+
+test_expect_success 'overlapping pathspecs allow the "loose" side' '
+ echo yes >expect &&
+ test-tool pathspec trie foo foo/ -- foo >actual &&
+ test_cmp expect actual &&
+ test-tool pathspec trie foo/ foo -- foo >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success '"." at the root matches everything' '
+ cat >expect <<-\EOF &&
+ yes
+ yes
+ EOF
+ test-tool pathspec trie . -- foo foo/bar
+'
+
+test_done
--
2.49.0.rc2
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH 6/8] pathspec: turn on tries when appropriate
2025-03-26 20:18 [PATCH 0/8] Introduce git-blame-tree(1) command Toon Claes
` (4 preceding siblings ...)
2025-03-26 20:18 ` [PATCH 5/8] pathspec: add optional trie index Toon Claes
@ 2025-03-26 20:18 ` Toon Claes
2025-03-26 20:18 ` [PATCH 7/8] tree-diff: use pathspec tries Toon Claes
` (2 subsequent siblings)
8 siblings, 0 replies; 15+ messages in thread
From: Toon Claes @ 2025-03-26 20:18 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt
From: Jeff King <peff@peff.net>
An earlier commit introduced pathspec_tries, but we did not
actually generate them by default. This patch causes us to
do so when it is possible (i.e., when no wildcards or other
pathspec magic are in use). This doesn't actually do
anything yet, though, as none of the pathspec users have
learned to make use of the tries.
We embed the pathspec_trie directly inside the "struct
pathspec". This is not strictly necessary, as once created,
the trie does not depend on the original pathspec. However,
since the intended use is to optimize existing pathspec
callers, passing the trie around as part of the pathspec
will minimize disruption to the call chain.
Signed-off-by: Jeff King <peff@peff.net>
---
pathspec.c | 19 +++++++++++++++++++
pathspec.h | 1 +
t/helper/test-pathspec.c | 6 ++----
3 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/pathspec.c b/pathspec.c
index 0e381fd748..c174edef32 100644
--- a/pathspec.c
+++ b/pathspec.c
@@ -117,6 +117,19 @@ struct pathspec_trie *pathspec_trie_build(const struct pathspec *pathspec)
return ret;
}
+static void pathspec_trie_clear(struct pathspec_trie *t)
+{
+ if (t) {
+ for (size_t i = 0; i < t->nr; i++) {
+ pathspec_trie_clear(t->entries[i]);
+ FREE_AND_NULL(t->entries[i]);
+ }
+
+ t->nr = 0;
+ FREE_AND_NULL(t->entries);
+ }
+}
+
int pathspec_trie_lookup(const struct pathspec_trie *parent,
const char *path, size_t len)
{
@@ -799,6 +812,8 @@ void parse_pathspec(struct pathspec *pathspec,
BUG("PATHSPEC_MAXDEPTH_VALID and PATHSPEC_KEEP_ORDER are incompatible");
QSORT(pathspec->items, pathspec->nr, pathspec_item_cmp);
}
+
+ pathspec->trie = pathspec_trie_build(pathspec);
}
void parse_pathspec_file(struct pathspec *pathspec, unsigned magic_mask,
@@ -859,6 +874,8 @@ void copy_pathspec(struct pathspec *dst, const struct pathspec *src)
d->attr_check = attr_check_dup(s->attr_check);
}
+
+ dst->trie = pathspec_trie_build(dst);
}
void clear_pathspec(struct pathspec *pathspec)
@@ -877,6 +894,8 @@ void clear_pathspec(struct pathspec *pathspec)
attr_check_free(pathspec->items[i].attr_check);
}
+ pathspec_trie_clear(pathspec->trie);
+ FREE_AND_NULL(pathspec->trie);
FREE_AND_NULL(pathspec->items);
pathspec->nr = 0;
}
diff --git a/pathspec.h b/pathspec.h
index 71bafb78a9..ff11599d25 100644
--- a/pathspec.h
+++ b/pathspec.h
@@ -76,6 +76,7 @@ struct pathspec {
} *attr_match;
struct attr_check *attr_check;
} *items;
+ struct pathspec_trie *trie;
};
#define GUARD_PATHSPEC(ps, mask) \
diff --git a/t/helper/test-pathspec.c b/t/helper/test-pathspec.c
index c04417681f..72e335abc1 100644
--- a/t/helper/test-pathspec.c
+++ b/t/helper/test-pathspec.c
@@ -63,7 +63,6 @@ static int cmd_trie(const char **argv)
{
const char **specs, **paths;
struct pathspec pathspec;
- struct pathspec_trie *trie;
paths = specs = argv;
while (*paths && strcmp(*paths, "--"))
@@ -72,12 +71,11 @@ static int cmd_trie(const char **argv)
*paths++ = NULL;
parse_pathspec(&pathspec, 0, 0, "", specs);
- trie = pathspec_trie_build(&pathspec);
- if (!trie)
+ if (!pathspec.trie)
die("unable to make trie from pathspec");
for (; *paths; paths++) {
- if (trie_match(trie, *paths))
+ if (trie_match(pathspec.trie, *paths))
printf("yes\n");
else
printf("no\n");
--
2.49.0.rc2
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH 7/8] tree-diff: use pathspec tries
2025-03-26 20:18 [PATCH 0/8] Introduce git-blame-tree(1) command Toon Claes
` (5 preceding siblings ...)
2025-03-26 20:18 ` [PATCH 6/8] pathspec: turn on tries when appropriate Toon Claes
@ 2025-03-26 20:18 ` Toon Claes
2025-03-26 20:18 ` [PATCH 8/8] blame-tree: narrow pathspec as we traverse Toon Claes
2025-03-26 20:38 ` [PATCH 0/8] Introduce git-blame-tree(1) command Taylor Blau
8 siblings, 0 replies; 15+ messages in thread
From: Toon Claes @ 2025-03-26 20:18 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt
From: Jeff King <peff@peff.net>
The tree-diff currently matches each pathspec against every
entry of the tree. For the common case of a handful of
pathspecs, this is not a big deal. However, if you have a
large number of pathspecs, it gets noticeably slow.
Now that we have the pathspec_trie optimization, we can do
much better (at least for simple cases without wildcards).
Here are numbers for running "git rev-list" with limiting
pathspecs of varying sizes, both before and after this
patch:
Test origin HEAD
---------------------------------------------------------------
4003.2: size=1 0.12(0.11+0.00) 0.12(0.12+0.00) +0.0%
4003.3: size=2 0.17(0.16+0.00) 0.16(0.15+0.01) -5.9%
4003.4: size=4 0.17(0.17+0.00) 0.17(0.17+0.00) +0.0%
4003.5: size=8 0.21(0.20+0.00) 0.20(0.20+0.00) -4.8%
4003.6: size=16 0.25(0.24+0.00) 0.21(0.20+0.00) -16.0%
4003.7: size=32 0.31(0.31+0.00) 0.21(0.20+0.00) -32.3%
4003.8: size=64 0.43(0.41+0.01) 0.21(0.21+0.00) -51.2%
4003.9: size=128 0.73(0.72+0.00) 0.22(0.21+0.00) -69.9%
4003.10: size=256 2.02(2.02+0.00) 0.37(0.36+0.00) -81.7%
4003.11: size=512 6.78(6.78+0.00) 0.64(0.64+0.00) -90.6%
4003.12: size=1024 23.67(23.67+0.02) 1.22(1.20+0.01) -94.8%
For small pathspecs, we produce no real difference (which is
good; we know we are asymptotically better, but we have not
regressed our constant factor). Between 16 and 32 pathspecs we
start to see some small improvement, and the benefit keeps
growing with the number of pathspecs.
Obviously these large-pathspec cases are unusual. But you
might use them, for example, if the pathspecs were generated
programatically (e.g., if you want the history of all files
that are in Documentation/ _now_, not what was historically
ever there, you would expand the pathspec at the current
tree, and feed the result to rev-list).
Signed-off-by: Jeff King <peff@peff.net>
---
t/perf/p4003-diff-pathspec.sh | 26 ++++++++++++
tree-diff.c | 98 +++++++++++++++++++++++++++++++++++++------
2 files changed, 112 insertions(+), 12 deletions(-)
diff --git a/t/perf/p4003-diff-pathspec.sh b/t/perf/p4003-diff-pathspec.sh
new file mode 100755
index 0000000000..02312d1b0c
--- /dev/null
+++ b/t/perf/p4003-diff-pathspec.sh
@@ -0,0 +1,26 @@
+#!/bin/sh
+
+test_description='diff performance with many pathspecs'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+sizes='1 2 4 8 16 32 64 128 256 512 1024'
+
+test_expect_success 'create pathspec lists' '
+ git ls-tree --name-only -r HEAD >all &&
+ for i in $sizes; do
+ {
+ printf "%s\n" -- &&
+ head -$i all
+ } >ps-$i || return 1
+ done
+'
+
+for i in $sizes; do
+ test_perf "size=$i" "
+ git rev-list HEAD --stdin <ps-$i >/dev/null
+ "
+done
+
+test_done
diff --git a/tree-diff.c b/tree-diff.c
index 2a744dfaec..f3d916201b 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -120,7 +120,8 @@ static void ll_diff_tree_paths(
struct combine_diff_path ***tail, const struct object_id *oid,
const struct object_id **parents_oid, int nparent,
struct strbuf *base, struct diff_options *opt,
- int depth);
+ int depth, struct pathspec_trie *pst);
+
static void ll_diff_tree_oid(const struct object_id *old_oid,
const struct object_id *new_oid,
struct strbuf *base, struct diff_options *opt);
@@ -205,7 +206,7 @@ static int emit_diff_first_parent_only(struct diff_options *opt, struct combine_
static void emit_path(struct combine_diff_path ***tail,
struct strbuf *base, struct diff_options *opt,
int nparent, struct tree_desc *t, struct tree_desc *tp,
- int imin, int depth)
+ int imin, int depth, struct pathspec_trie *pst)
{
unsigned short mode;
const char *path;
@@ -309,23 +310,95 @@ static void emit_path(struct combine_diff_path ***tail,
parents_oid[i] = tpi_valid ? &tp[i].entry.oid : NULL;
}
+ /*
+ * As we recurse through the tree objects, move through
+ * our pathspec trie, as well. The one exception is if
+ * we already hit a terminal node. This means we have a strict
+ * prefix match (e.g., "foo/" matched, and we are in
+ * "foo/bar"). We don't have to bother with checking the
+ * pathspec at all anymore in that case.
+ *
+ * Note that the "pos < 0" case should not happen here,
+ * as we would have skipped the tree entry as uninteresting
+ * earlier. As a safety measure, we turn off the trie
+ * optimization and fall back to doing regular pathspec
+ * matching in this case.
+ */
+ if (pst && !pst->terminal) {
+ int pos = pathspec_trie_lookup(pst, path, pathlen);
+ if (pos < 0)
+ pst = NULL;
+ else
+ pst = pst->entries[pos];
+ }
+
strbuf_add(base, path, pathlen);
strbuf_addch(base, '/');
ll_diff_tree_paths(tail, oid, parents_oid, nparent, base, opt,
- depth + 1);
+ depth + 1, pst);
FAST_ARRAY_FREE(parents_oid, nparent);
}
strbuf_setlen(base, old_baselen);
}
+static enum interesting match_pathspec_trie_entry(struct pathspec_trie *pst,
+ const struct name_entry *entry)
+{
+ int pos;
+
+ /*
+ * If our base directory is matched, then everything below is
+ * interesting (i.e., a prefix match).
+ */
+ if (pst->terminal)
+ return entry_interesting;
+
+ /*
+ * Otherwise, look up the actual entry. If we don't mention it at all,
+ * it's definitely uninteresting. But furthermore, if we're at the
+ * end of our sorted list, we know that nothing after it is
+ * interesting, either.
+ *
+ * XXX It seems like we should have to make special consideration here
+ * for the sort order of trees. But tree_entry_interesting does not
+ * seem to. Is it OK, is tree_entry_interesting buggy too, or am I
+ * reading it wrong? This optimization gives substantial speedups, so
+ * we really need to keep it or something like it.
+ */
+ pos = pathspec_trie_lookup(pst, entry->path, tree_entry_len(entry));
+ if (pos < 0) {
+ if (-pos - 1 == pst->nr)
+ return all_entries_not_interesting;
+ else
+ return entry_not_interesting;
+ }
+
+ /*
+ * We definitely have the entry. First we have to resolve any directory
+ * restrictions; if there aren't any, then it's definitely interesting.
+ *
+ * Note that we do not need to check the "terminal" flag of the
+ * resulting trie node. If it is not set, then this particular entry
+ * does not match our pathspec, but we do still need to traverse
+ * through it to get to the interesting things inside. It's interesting
+ * either way.
+ */
+ if (pst->entries[pos]->must_be_dir)
+ return !!S_ISDIR(entry->mode);
+ return entry_interesting;
+}
+
static void skip_uninteresting(struct tree_desc *t, struct strbuf *base,
- struct diff_options *opt)
+ struct diff_options *opt,
+ struct pathspec_trie *pst)
{
enum interesting match;
while (t->size) {
- match = tree_entry_interesting(opt->repo->index, &t->entry,
+ match = pst ?
+ match_pathspec_trie_entry(pst, &t->entry) :
+ tree_entry_interesting(opt->repo->index, &t->entry,
base, &opt->pathspec);
if (match) {
if (match == all_entries_not_interesting)
@@ -433,7 +506,7 @@ static void ll_diff_tree_paths(
struct combine_diff_path ***tail, const struct object_id *oid,
const struct object_id **parents_oid, int nparent,
struct strbuf *base, struct diff_options *opt,
- int depth)
+ int depth, struct pathspec_trie *pst)
{
struct tree_desc t, *tp;
void *ttree, **tptree;
@@ -468,9 +541,9 @@ static void ll_diff_tree_paths(
break;
if (opt->pathspec.nr) {
- skip_uninteresting(&t, base, opt);
+ skip_uninteresting(&t, base, opt, pst);
for (i = 0; i < nparent; i++)
- skip_uninteresting(&tp[i], base, opt);
+ skip_uninteresting(&tp[i], base, opt, pst);
}
/* comparing is finished when all trees are done */
@@ -535,7 +608,7 @@ static void ll_diff_tree_paths(
/* D += {δ(t,pi) if pi=p[imin]; "+a" if pi > p[imin]} */
emit_path(tail, base, opt, nparent,
- &t, tp, imin, depth);
+ &t, tp, imin, depth, pst);
skip_emit_t_tp:
/* t↓, ∀ pi=p[imin] pi↓ */
@@ -547,7 +620,7 @@ static void ll_diff_tree_paths(
else if (cmp < 0) {
/* D += "+t" */
emit_path(tail, base, opt, nparent,
- &t, /*tp=*/NULL, -1, depth);
+ &t, /*tp=*/NULL, -1, depth, pst);
/* t↓ */
update_tree_entry(&t);
@@ -563,7 +636,7 @@ static void ll_diff_tree_paths(
}
emit_path(tail, base, opt, nparent,
- /*t=*/NULL, tp, imin, depth);
+ /*t=*/NULL, tp, imin, depth, pst);
skip_emit_tp:
/* ∀ pi=p[imin] pi↓ */
@@ -584,7 +657,8 @@ struct combine_diff_path *diff_tree_paths(
struct strbuf *base, struct diff_options *opt)
{
struct combine_diff_path *head = NULL, **tail = &head;
- ll_diff_tree_paths(&tail, oid, parents_oid, nparent, base, opt, 0);
+ ll_diff_tree_paths(&tail, oid, parents_oid, nparent, base, opt,
+ 0, opt->pathspec.trie);
return head;
}
--
2.49.0.rc2
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH 8/8] blame-tree: narrow pathspec as we traverse
2025-03-26 20:18 [PATCH 0/8] Introduce git-blame-tree(1) command Toon Claes
` (6 preceding siblings ...)
2025-03-26 20:18 ` [PATCH 7/8] tree-diff: use pathspec tries Toon Claes
@ 2025-03-26 20:18 ` Toon Claes
2025-03-26 20:38 ` [PATCH 0/8] Introduce git-blame-tree(1) command Taylor Blau
8 siblings, 0 replies; 15+ messages in thread
From: Toon Claes @ 2025-03-26 20:18 UTC (permalink / raw)
To: git; +Cc: Jeff King, Patrick Steinhardt
From: Jeff King <peff@peff.net>
As we find out the "last touched" for each path, we cease to
care about that path. However, we traditionally have not
told the revision-walking machinery about this, for two
reasons:
1. Munging the pathspecs mid-traversal is not an
officially supported thing to be doing. It does seem to
work, though, and with Duy's recent pathspec
refactoring, it's fairly straightforward.
2. The pathspec machinery is slow. We know have to look at
each pathspec to see if each tree entry is interesting.
And since our cost to handle a diff_filepair is so
cheap, it doesn't really help us much.
It turns out that (2) is not quite right, though. For the
diffs themselves, limiting the pathspec doesn't help at all.
But it does mean that we can simplify history more
aggressively, which may mean avoiding whole swaths of
history that don't touch interesting paths. This happens
especially near the end of the traversal, when we are only
interested in a few paths, and there are many side branches
that do not touch any of them.
And as of the last commit, the pathspec machinery is no
longer quadratic, so the cost to use it is easily worth
paying, even for large trees.
Here are runs of `git blame-tree --max-depth=0` for
torvalds/linux (a complex history with a small root tree)
and git/git (a flat hierarchy with a large root tree):
For git/git:
Benchmark 1: ./git.old blame-tree --max-depth=0
Time (mean ± σ): 2.223 s ± 0.360 s [User: 1.808 s, System: 0.369 s]
Range (min … max): 1.866 s … 2.927 s 10 runs
Benchmark 2: ./git.new blame-tree --max-depth=0
Time (mean ± σ): 945.0 ms ± 93.4 ms [User: 783.6 ms, System: 131.0 ms]
Range (min … max): 846.4 ms … 1071.5 ms 10 runs
Summary
./git.new blame-tree --max-depth=0 ran
2.35 ± 0.45 times faster than git.old blame-tree --max-depth=0
For torvalds/linux:
Benchmark 1: ./git.old blame-tree --max-depth=0
Time (mean ± σ): 13.504 s ± 0.545 s [User: 12.605 s, System: 0.583 s]
Range (min … max): 12.947 s … 14.488 s 10 runs
Benchmark 2: ./git.new blame-tree --max-depth=0
Time (mean ± σ): 622.8 ms ± 39.2 ms [User: 522.1 ms, System: 95.7 ms]
Range (min … max): 556.6 ms … 681.6 ms 10 runs
Summary
./git.new blame-tree --max-depth=0 ran
21.68 ± 1.62 times faster than ./git.old blame-tree --max-depth=0
Note that the output may change slightly in some cases due
to the history pruning. This should be acceptable, as either
output is correct in these cases (e.g., scenarios like a
cherry-picked commit on two parallel branches).
Signed-off-by: Jeff King <peff@peff.net>
---
blame-tree.c | 18 ++++++++++++++
pathspec.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
pathspec.h | 3 +++
3 files changed, 99 insertions(+)
diff --git a/blame-tree.c b/blame-tree.c
index ed4ec1e694..7be67520f8 100644
--- a/blame-tree.c
+++ b/blame-tree.c
@@ -79,6 +79,18 @@ void blame_tree_init(struct repository *r, struct blame_tree *bt,
if (add_from_revs(bt) < 0)
die(_("unable to setup blame-tree"));
+
+ pathspec_setup(&bt->rev.prune_data, &bt->paths);
+ copy_pathspec(&bt->rev.pruning.pathspec, &bt->rev.prune_data);
+ copy_pathspec(&bt->rev.diffopt.pathspec, &bt->rev.prune_data);
+ bt->rev.prune = 1;
+
+ /*
+ * Have the diff engine tell us about everything, including trees.
+ * We may have used --max-depth to get our list of paths to blame,
+ * in which case we would mention trees explicitly.
+ */
+ bt->rev.diffopt.flags.tree_in_recursive = 1;
}
void blame_tree_release(struct blame_tree *bt)
@@ -91,6 +103,7 @@ struct blame_tree_callback_data {
struct commit *commit;
struct string_list *paths;
size_t num_interesting;
+ struct rev_info *rev;
blame_tree_fn callback;
void *callback_data;
@@ -122,6 +135,10 @@ static void mark_path(const char *path, const struct object_id *oid,
data->num_interesting--;
if (data->callback)
data->callback(path, data->commit, data->callback_data);
+
+ pathspec_drop(&data->rev->pruning.pathspec, path);
+ clear_pathspec(&data->rev->diffopt.pathspec);
+ copy_pathspec(&data->rev->diffopt.pathspec, &data->rev->pruning.pathspec);
}
static void blame_diff(struct diff_queue_struct *q,
@@ -172,6 +189,7 @@ int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *cbdata)
data.num_interesting = bt->paths.nr;
data.callback = cb;
data.callback_data = cbdata;
+ data.rev = &bt->rev;
bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
bt->rev.diffopt.format_callback = blame_diff;
diff --git a/pathspec.c b/pathspec.c
index c174edef32..7ef2a55280 100644
--- a/pathspec.c
+++ b/pathspec.c
@@ -117,6 +117,50 @@ struct pathspec_trie *pathspec_trie_build(const struct pathspec *pathspec)
return ret;
}
+static void pathspec_drop_from_trie(struct pathspec *ps,
+ const char *path)
+{
+ struct pathspec_trie *prev = NULL, *cur = ps->trie;
+ int pos = -1;
+
+ if (!cur)
+ return;
+
+ while (*path) {
+ const char *end = strchrnul(path, '/');
+ size_t len = end - path;
+ pos = pathspec_trie_lookup(cur, path, len);
+
+ if (pos < 0)
+ die("BUG: didn't find the pathspec trie we matched");
+
+ prev = cur;
+ cur = cur->entries[pos];
+ path = end;
+ while (*path == '/')
+ path++;
+ }
+
+ if (!cur->terminal)
+ die("BUG: pathspec trie we found isn't terminal?");
+
+ if (cur->nr) {
+ cur->terminal = 0;
+ cur->must_be_dir = 0;
+ return;
+ }
+
+ free(cur);
+ if (pos < 0)
+ ps->trie = NULL;
+ else if (prev) {
+ prev->nr--;
+ memmove(prev->entries + pos,
+ prev->entries + pos + 1,
+ sizeof(*prev->entries) * (prev->nr - pos));
+ }
+}
+
static void pathspec_trie_clear(struct pathspec_trie *t)
{
if (t) {
@@ -878,6 +922,40 @@ void copy_pathspec(struct pathspec *dst, const struct pathspec *src)
dst->trie = pathspec_trie_build(dst);
}
+void pathspec_setup(struct pathspec *ps, const struct string_list *paths)
+{
+ struct strvec argv = STRVEC_INIT;
+ size_t i;
+
+ for (i = 0; i < paths->nr; i++)
+ strvec_push(&argv, paths->items[i].string);
+
+ clear_pathspec(ps);
+ parse_pathspec(ps, PATHSPEC_ALL_MAGIC & ~PATHSPEC_LITERAL,
+ PATHSPEC_PREFER_FULL | PATHSPEC_LITERAL_PATH, "",
+ argv.v);
+ strvec_clear(&argv);
+}
+
+void pathspec_drop(struct pathspec *ps, const char *path)
+{
+ int i;
+
+ /* We know these are literals, so we can just strcmp */
+ for (i = 0; i < ps->nr; i++)
+ if (!strcmp(ps->items[i].match, path))
+ break;
+
+ if (i == ps->nr)
+ die("BUG: didn't find the pathspec we just matched");
+
+ memmove(ps->items + i, ps->items + i + 1,
+ sizeof(*ps->items) * (ps->nr - i - 1));
+ ps->nr--;
+
+ pathspec_drop_from_trie(ps, path);
+}
+
void clear_pathspec(struct pathspec *pathspec)
{
int i, j;
diff --git a/pathspec.h b/pathspec.h
index ff11599d25..a46b3177e3 100644
--- a/pathspec.h
+++ b/pathspec.h
@@ -3,6 +3,7 @@
struct index_state;
struct pathspec;
+struct string_list;
/* Pathspec magic */
#define PATHSPEC_FROMTOP (1<<0)
@@ -152,6 +153,8 @@ void parse_pathspec_file(struct pathspec *pathspec,
int nul_term_line);
void copy_pathspec(struct pathspec *dst, const struct pathspec *src);
+void pathspec_setup(struct pathspec *ps, const struct string_list *paths);
+void pathspec_drop(struct pathspec *ps, const char *path);
void clear_pathspec(struct pathspec *);
/*
--
2.49.0.rc2
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH 0/8] Introduce git-blame-tree(1) command
2025-03-26 20:18 [PATCH 0/8] Introduce git-blame-tree(1) command Toon Claes
` (7 preceding siblings ...)
2025-03-26 20:18 ` [PATCH 8/8] blame-tree: narrow pathspec as we traverse Toon Claes
@ 2025-03-26 20:38 ` Taylor Blau
2025-03-27 6:32 ` Jeff King
2025-03-27 12:04 ` Derrick Stolee
8 siblings, 2 replies; 15+ messages in thread
From: Taylor Blau @ 2025-03-26 20:38 UTC (permalink / raw)
To: Toon Claes
Cc: git, Jeff King, Patrick Steinhardt,
Ævar Arnfjörð Bjarmason
On Wed, Mar 26, 2025 at 09:18:24PM +0100, Toon Claes wrote:
> This is yet another attempt to upstream the builtin command
> `git-blame-tree(1)`. This command is similar to git-blame(1) and shows
> the most recent modification to paths in a tree..
>
> The last attempt (I'm aware of) was made by Ævar in 2023[1]. That
> series was based of patches by Peff written in 2011[2].
For what it's worth, the blame-tree implementation that this came from
has evolved significantly since it was originally written in 2011. Most
recently Stolee and I worked on a version that uses changed-path Bloom
filters to narrow the search, passing un-blamed paths to their parents
at each level of the traversal.
I wonder if it would be easier to start from scratch with the modern
implementation rather than land this one and try to build on top of it.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 0/8] Introduce git-blame-tree(1) command
2025-03-26 20:38 ` [PATCH 0/8] Introduce git-blame-tree(1) command Taylor Blau
@ 2025-03-27 6:32 ` Jeff King
2025-03-27 21:58 ` Taylor Blau
2025-03-27 12:04 ` Derrick Stolee
1 sibling, 1 reply; 15+ messages in thread
From: Jeff King @ 2025-03-27 6:32 UTC (permalink / raw)
To: Taylor Blau
Cc: Toon Claes, git, Patrick Steinhardt,
Ævar Arnfjörð Bjarmason
On Wed, Mar 26, 2025 at 04:38:59PM -0400, Taylor Blau wrote:
> On Wed, Mar 26, 2025 at 09:18:24PM +0100, Toon Claes wrote:
> > This is yet another attempt to upstream the builtin command
> > `git-blame-tree(1)`. This command is similar to git-blame(1) and shows
> > the most recent modification to paths in a tree..
> >
> > The last attempt (I'm aware of) was made by Ævar in 2023[1]. That
> > series was based of patches by Peff written in 2011[2].
>
> For what it's worth, the blame-tree implementation that this came from
> has evolved significantly since it was originally written in 2011. Most
> recently Stolee and I worked on a version that uses changed-path Bloom
> filters to narrow the search, passing un-blamed paths to their parents
> at each level of the traversal.
>
> I wonder if it would be easier to start from scratch with the modern
> implementation rather than land this one and try to build on top of it.
Yeah, I'd suggest starting with that work from you and Stolee (though I
do not know if it was ever made public?). It should be much faster and
will have been battle-tested in production.
The pathspec-trie stuff is, I think, still a reasonable idea for general
use. But IIRC, the rewritten blame-tree you guys worked on does not
benefit from it, because it ditches pathspecs entirely (both because
they're too slow without the tries, but also because it's important to
continually narrow the pathspec while traversing). That trie code was
never run in production, I think (and I see there is a patch to narrow
the pathspec while traversing; I suspect that likewise was never used).
The max-depth diff code is also in theory a reasonable thing to have in
general. But it is awkward to use, and not really necessary for
blame-tree. There we really only care about recursing vs not recursing,
but the usual "recursive" flag for diffing isn't enough (we have to
recurse down to the tree of interest, but may not want to go further). I
don't remember how that is handled in your blame-tree rewrites.
So that really mostly leaves the blame-tree scaffolding itself. I
remember Junio left a lot of good comments on the original thread on how
merges should be handled, but I don't think I ever fixed those bits. I
don't recall what your rewritten code does there, but I think it may
have improved things.
So yeah. I don't know if all of this is really a very good starting
point. Taylor, if you can share the current code that GitHub is running,
I think that would be beneficial for the community.
-Peff
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/8] Introduce git-blame-tree(1) command
2025-03-27 6:32 ` Jeff King
@ 2025-03-27 21:58 ` Taylor Blau
2025-03-28 13:27 ` Toon Claes
2025-04-01 9:29 ` Jeff King
0 siblings, 2 replies; 15+ messages in thread
From: Taylor Blau @ 2025-03-27 21:58 UTC (permalink / raw)
To: Jeff King
Cc: Toon Claes, git, Patrick Steinhardt,
Ævar Arnfjörð Bjarmason
On Thu, Mar 27, 2025 at 02:32:43AM -0400, Jeff King wrote:
> The pathspec-trie stuff is, I think, still a reasonable idea for general
> use. But IIRC, the rewritten blame-tree you guys worked on does not
> benefit from it, because it ditches pathspecs entirely (both because
> they're too slow without the tries, but also because it's important to
> continually narrow the pathspec while traversing). That trie code was
> never run in production, I think (and I see there is a patch to narrow
> the pathspec while traversing; I suspect that likewise was never used).
Yeah, the rewritten blame-tree code uses changed-path Bloom filters to
narrow the set of revisions that we need to actually compute tree-diffs
for.
The general idea is that we have a set of paths that we have yet to
blame, and those are the "interesting" ones. IOW, if a changed-path
Bloom filter tells us that we are at some revision where there is maybe
a change to one or more unblamed paths, we need to compute a tree-diff.
But if the Bloom filter says "no", then we can skip the tree-diff at
that layer entirely.
> The max-depth diff code is also in theory a reasonable thing to have in
> general. But it is awkward to use, and not really necessary for
> blame-tree. There we really only care about recursing vs not recursing,
> but the usual "recursive" flag for diffing isn't enough (we have to
> recurse down to the tree of interest, but may not want to go further). I
> don't remember how that is handled in your blame-tree rewrites.
I think that's mostly true, but the blame-tree caching stuff that Stolee
worked on many years ago and mentioned below does require it IIRC.
> So yeah. I don't know if all of this is really a very good starting
> point. Taylor, if you can share the current code that GitHub is running,
> I think that would be beneficial for the community.
Sure. You can fetch from the 'tb/blame-tree' branch from my tree (which
is located at 'git@github.com:ttaylorr/git.git'). I owe a huge "thank
you" to Victoria Dye, who split out the various topics from GitHub's
fork into individual rebased branches.
There were many more patches on top that came after Victoria's split
above, and I applied those manually. The commit structure probably needs
significant clean-up and polishing before it's ready for serious review,
since this is more-or-less a raw dump of the work on GitHub's side over
more than a decade.
It also doesn't pass the tests in t9932 (and the test number should
probably also be reworked, it's in the t99xx range so that inclusion in
GitHub's fork doesn't cause collisions with new tests when we merge
upstream). To that end, I removed everybody's Signed-off-by in case I
have mangled their work in some way unintentionally.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/8] Introduce git-blame-tree(1) command
2025-03-27 21:58 ` Taylor Blau
@ 2025-03-28 13:27 ` Toon Claes
2025-04-01 9:29 ` Jeff King
1 sibling, 0 replies; 15+ messages in thread
From: Toon Claes @ 2025-03-28 13:27 UTC (permalink / raw)
To: Taylor Blau, Jeff King
Cc: git, Patrick Steinhardt, Ævar Arnfjörð Bjarmason
Taylor Blau <me@ttaylorr.com> writes:
> Sure. You can fetch from the 'tb/blame-tree' branch from my tree (which
> is located at 'git@github.com:ttaylorr/git.git'). I owe a huge "thank
> you" to Victoria Dye, who split out the various topics from GitHub's
> fork into individual rebased branches.
>
> There were many more patches on top that came after Victoria's split
> above, and I applied those manually. The commit structure probably needs
> significant clean-up and polishing before it's ready for serious review,
> since this is more-or-less a raw dump of the work on GitHub's side over
> more than a decade.
>
> It also doesn't pass the tests in t9932 (and the test number should
> probably also be reworked, it's in the t99xx range so that inclusion in
> GitHub's fork doesn't cause collisions with new tests when we merge
> upstream). To that end, I removed everybody's Signed-off-by in case I
> have mangled their work in some way unintentionally.
Thanks for sharing. I will check it out.
--
Toon
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/8] Introduce git-blame-tree(1) command
2025-03-27 21:58 ` Taylor Blau
2025-03-28 13:27 ` Toon Claes
@ 2025-04-01 9:29 ` Jeff King
1 sibling, 0 replies; 15+ messages in thread
From: Jeff King @ 2025-04-01 9:29 UTC (permalink / raw)
To: Taylor Blau
Cc: Toon Claes, git, Patrick Steinhardt,
Ævar Arnfjörð Bjarmason
On Thu, Mar 27, 2025 at 05:58:19PM -0400, Taylor Blau wrote:
> On Thu, Mar 27, 2025 at 02:32:43AM -0400, Jeff King wrote:
> > The pathspec-trie stuff is, I think, still a reasonable idea for general
> > use. But IIRC, the rewritten blame-tree you guys worked on does not
> > benefit from it, because it ditches pathspecs entirely (both because
> > they're too slow without the tries, but also because it's important to
> > continually narrow the pathspec while traversing). That trie code was
> > never run in production, I think (and I see there is a patch to narrow
> > the pathspec while traversing; I suspect that likewise was never used).
>
> Yeah, the rewritten blame-tree code uses changed-path Bloom filters to
> narrow the set of revisions that we need to actually compute tree-diffs
> for.
>
> The general idea is that we have a set of paths that we have yet to
> blame, and those are the "interesting" ones. IOW, if a changed-path
> Bloom filter tells us that we are at some revision where there is maybe
> a change to one or more unblamed paths, we need to compute a tree-diff.
> But if the Bloom filter says "no", then we can skip the tree-diff at
> that layer entirely.
You'd still in theory benefit from the tree-diffs you _do_ run using a
continually narrowing pathspec. Skimming over the code from your
tb/blame-tree branch, it looks like it's just fed the original pathspec.
But that's probably good enough in practice. Especially for
non-recursive blame-trees, where pruning already-matched entries will
never save you from opening another tree anyway.
> > So yeah. I don't know if all of this is really a very good starting
> > point. Taylor, if you can share the current code that GitHub is running,
> > I think that would be beneficial for the community.
>
> Sure. You can fetch from the 'tb/blame-tree' branch from my tree (which
> is located at 'git@github.com:ttaylorr/git.git'). I owe a huge "thank
> you" to Victoria Dye, who split out the various topics from GitHub's
> fork into individual rebased branches.
Thanks. I don't have time to pick it up as a topic myself, but hopefully
it can be useful to Toon (or any others interested in the topic).
-Peff
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/8] Introduce git-blame-tree(1) command
2025-03-26 20:38 ` [PATCH 0/8] Introduce git-blame-tree(1) command Taylor Blau
2025-03-27 6:32 ` Jeff King
@ 2025-03-27 12:04 ` Derrick Stolee
1 sibling, 0 replies; 15+ messages in thread
From: Derrick Stolee @ 2025-03-27 12:04 UTC (permalink / raw)
To: Taylor Blau, Toon Claes
Cc: git, Jeff King, Patrick Steinhardt,
Ævar Arnfjörð Bjarmason
On 3/26/2025 4:38 PM, Taylor Blau wrote:
> On Wed, Mar 26, 2025 at 09:18:24PM +0100, Toon Claes wrote:
>> This is yet another attempt to upstream the builtin command
>> `git-blame-tree(1)`. This command is similar to git-blame(1) and shows
>> the most recent modification to paths in a tree..
>>
>> The last attempt (I'm aware of) was made by Ævar in 2023[1]. That
>> series was based of patches by Peff written in 2011[2].
>
> For what it's worth, the blame-tree implementation that this came from
> has evolved significantly since it was originally written in 2011. Most
> recently Stolee and I worked on a version that uses changed-path Bloom
> filters to narrow the search, passing un-blamed paths to their parents
> at each level of the traversal.
It's worth mentioning that the underlying algorithm was nearly rebuilt
from scratch with this "passing un-blamed paths to their parents" aspect,
which unlocked other features such as caching results to be used by
future queries, even when the tip branch advances.
With that in mind, using the 2011 version is unlikely to be valuable.
Taylor: do you have a drop of the latest blame-tree implementation that
could be shared with Toon?
Thanks,
-Stolee
^ permalink raw reply [flat|nested] 15+ messages in thread