* [PATCH RFC 0/5] Introduce git-blame-tree(1) command
@ 2025-04-22 17:46 Toon Claes
2025-04-22 17:46 ` [PATCH RFC 1/5] blame-tree: introduce new subcommand to blame files Toon Claes
` (7 more replies)
0 siblings, 8 replies; 135+ messages in thread
From: Toon Claes @ 2025-04-22 17:46 UTC (permalink / raw)
To: git
Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes,
Ævar Arnfjörð Bjarmason, Derrick Stolee
This is another attempt to upstream the git-blame-tree(1) subcommand.
After the previous attempt[1] the people of GitHub shared their version
of the subcommand, and this version integrates those changes.
What is different from the series shared by GitHub:
* Patches for --max-depth are excluded. I think it's a separate topic to
discuss and I'm not sure it needs to be part of blame-tree anyway. The
main patch was submitted in the previous attempt[2] and if people
consider it valuable, I'm happy to discuss that in a separate patch
series.
* The patches in 'tb/blame-tree' at Taylor's fork[3] implements a
caching layer. This feature reads/writes cached blame-tree results in
`.git/blame-tree/<hash>.btc`. To keep this series to a reviewable
size, that feature is excluded from this series. I think it's better
to submit this as a separate series.
* Squashed various commits together. Like they introduced a flag
`--go-faster`, which later became the default and only implementation.
That story was wrapped up in a single commit.
* The blame-tree command isn't recursive by default. If you want recurse
into subtrees, you need to pass `-r`.
* Fixed all memory leaks, and removed the use of
USE_THE_REPOSITORY_VARIABLE.
I've attempted to reuse commit messages as good as possible, but feel
free to correct me where you think I didn't give proper credit or messed
up. Although I have no idea what to do with the Signed-off-by trailers.
I didn't modify the benchmark results in the commit messages, simply
because I didn't get comparable results. In my benchmarks the difference
between two implementations was negligible, and even in some scenarios
the performance was worse in the "improved" implementation. As far as I
can tell, I didn't break anything in my refactoring, because the version
in these patches acts similar to Taylor's branch. To be honest, I cannot
explain why...?
With this version I'd like to gather feedback as much as possible for a
next version. I realize this feature is far from done, so that's why I'm
submitting it as an RFC.
Again thanks to Taylor and the people at GitHub for sharing these
patches. I hope we can work together to get this upstreamed.
[1]: https://lore.kernel.org/git/20250326-toon-blame-tree-v1-0-4173133f3786@iotcl.com/
[2]: https://lore.kernel.org/git/20250326-toon-blame-tree-v1-3-4173133f3786@iotcl.com/
[3]: git@github.com:ttaylorr/git.git
--
Toon
Signed-off-by: Toon Claes <toon@iotcl.com>
---
Jeff King (1):
t/perf: add blame-tree perf script
Taylor Blau (2):
blame-tree: use Bloom filters when available
blame-tree: implement faster algorithm
Toon Claes (2):
blame-tree: introduce new subcommand to blame files
blame-tree.c: initialize revision machinery without walk
.gitignore | 1 +
Makefile | 2 +
blame-tree.c | 496 +++++++++++++++++++++++++++++++++++++++++++++
blame-tree.h | 30 +++
builtin.h | 1 +
builtin/blame-tree.c | 43 ++++
git.c | 1 +
meson.build | 2 +
t/helper/test-tool.h | 1 +
t/meson.build | 1 +
t/perf/p8020-blame-tree.sh | 21 ++
t/t8020-blame-tree.sh | 148 ++++++++++++++
12 files changed, 747 insertions(+)
---
---
base-commit: 4bbb303af69990ccd05fe3a2eb58a1ce036f8220
change-id: 20250410-toon-new-blame-tree-bcdbb78c1c0f
Thanks
--
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH RFC 1/5] blame-tree: introduce new subcommand to blame files
2025-04-22 17:46 [PATCH RFC 0/5] Introduce git-blame-tree(1) command Toon Claes
@ 2025-04-22 17:46 ` Toon Claes
2025-04-24 16:19 ` Junio C Hamano
2025-04-22 17:46 ` [PATCH RFC 2/5] t/perf: add blame-tree perf script Toon Claes
` (6 subsequent siblings)
7 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-04-22 17:46 UTC (permalink / raw)
To: git
Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes,
Ævar Arnfjörð Bjarmason
Similar to git-blame(1), introduce a new subcommand git-blame-tree(1).
This command shows the most recent modification to paths in a tree. It
does so by expanding the tree at a given commit, taking note of the
current state of each path, and then walking backwards through history
looking for commits where each path changed into its final commit ID.
Based-on-patch-by: Jeff King <peff@peff.net>
Improved-by: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
.gitignore | 1 +
Makefile | 2 +
blame-tree.c | 213 ++++++++++++++++++++++++++++++++++++++++++++++++++
blame-tree.h | 27 +++++++
builtin.h | 1 +
builtin/blame-tree.c | 43 ++++++++++
git.c | 1 +
meson.build | 2 +
t/helper/test-tool.h | 1 +
t/meson.build | 1 +
t/t8020-blame-tree.sh | 147 ++++++++++++++++++++++++++++++++++
11 files changed, 439 insertions(+)
diff --git a/.gitignore b/.gitignore
index 04c444404e..ba23d5b098 100644
--- a/.gitignore
+++ b/.gitignore
@@ -22,6 +22,7 @@
/git-backfill
/git-bisect
/git-blame
+/git-blame-tree
/git-branch
/git-bugreport
/git-bundle
diff --git a/Makefile b/Makefile
index 13f9062a05..aaf22af0b8 100644
--- a/Makefile
+++ b/Makefile
@@ -972,6 +972,7 @@ LIB_OBJS += archive.o
LIB_OBJS += attr.o
LIB_OBJS += base85.o
LIB_OBJS += bisect.o
+LIB_OBJS += blame-tree.o
LIB_OBJS += blame.o
LIB_OBJS += blob.o
LIB_OBJS += bloom.o
@@ -1216,6 +1217,7 @@ BUILTIN_OBJS += builtin/archive.o
BUILTIN_OBJS += builtin/backfill.o
BUILTIN_OBJS += builtin/bisect.o
BUILTIN_OBJS += builtin/blame.o
+BUILTIN_OBJS += builtin/blame-tree.o
BUILTIN_OBJS += builtin/branch.o
BUILTIN_OBJS += builtin/bugreport.o
BUILTIN_OBJS += builtin/bundle.o
diff --git a/blame-tree.c b/blame-tree.c
new file mode 100644
index 0000000000..ce57db2cfc
--- /dev/null
+++ b/blame-tree.c
@@ -0,0 +1,213 @@
+#include "git-compat-util.h"
+#include "blame-tree.h"
+#include "commit.h"
+#include "diffcore.h"
+#include "diff.h"
+#include "object.h"
+#include "revision.h"
+#include "repository.h"
+#include "log-tree.h"
+
+struct blame_tree_entry {
+ struct hashmap_entry hashent;
+ struct object_id oid;
+ struct commit *commit;
+ const char path[FLEX_ARRAY];
+};
+
+static void add_from_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED,
+ void *data)
+{
+ struct blame_tree *bt = data;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ struct blame_tree_entry *ent;
+ const char *path = p->two->path;
+
+ FLEX_ALLOC_STR(ent, path, path);
+ oidcpy(&ent->oid, &p->two->oid);
+ hashmap_entry_init(&ent->hashent, strhash(ent->path));
+ hashmap_add(&bt->paths, &ent->hashent);
+ }
+}
+
+static int add_from_revs(struct blame_tree *bt)
+{
+ size_t count = 0;
+ struct diff_options diffopt;
+
+ memcpy(&diffopt, &bt->rev.diffopt, sizeof(diffopt));
+ copy_pathspec(&diffopt.pathspec, &bt->rev.diffopt.pathspec);
+ diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ diffopt.format_callback = add_from_diff;
+ diffopt.format_callback_data = bt;
+
+ for (size_t i = 0; i < bt->rev.pending.nr; i++) {
+ struct object_array_entry *obj = bt->rev.pending.objects + i;
+
+ if (obj->item->flags & UNINTERESTING)
+ continue;
+
+ if (count++)
+ return error(_("can only blame one tree at a time"));
+
+ diff_tree_oid(bt->rev.repo->hash_algo->empty_tree,
+ &obj->item->oid, "", &diffopt);
+ diff_flush(&diffopt);
+ }
+ clear_pathspec(&diffopt.pathspec);
+
+ return 0;
+}
+
+static int blame_tree_entry_hashcmp(const void *unused UNUSED,
+ const struct hashmap_entry *he1,
+ const struct hashmap_entry *he2,
+ const void *path)
+{
+ const struct blame_tree_entry *e1 =
+ container_of(he1, const struct blame_tree_entry, hashent);
+ const struct blame_tree_entry *e2 =
+ container_of(he2, const struct blame_tree_entry, hashent);
+ return strcmp(e1->path, path ? path : e2->path);
+}
+
+void blame_tree_init(struct blame_tree *bt,
+ struct repository *r,
+ const char *prefix,
+ int argc, const char **argv)
+{
+ memset(bt, 0, sizeof(*bt));
+ hashmap_init(&bt->paths, blame_tree_entry_hashcmp, NULL, 0);
+
+ repo_init_revisions(r, &bt->rev, prefix);
+ bt->rev.def = "HEAD";
+ bt->rev.combine_merges = 1;
+ bt->rev.show_root_diff = 1;
+ bt->rev.boundary = 1;
+ bt->rev.no_commit_id = 1;
+ bt->rev.diff = 1;
+ if (setup_revisions(argc, argv, &bt->rev, NULL) > 1)
+ die(_("unknown blame-tree argument: %s"), argv[1]);
+
+ if (add_from_revs(bt) < 0)
+ die(_("unable to setup blame-tree"));
+}
+
+void blame_tree_release(struct blame_tree *bt)
+{
+ hashmap_clear_and_free(&bt->paths, struct blame_tree_entry, hashent);
+ release_revisions(&bt->rev);
+}
+
+struct blame_tree_callback_data {
+ struct commit *commit;
+ struct hashmap *paths;
+
+ blame_tree_callback callback;
+ void *callback_data;
+};
+
+static void mark_path(const char *path, const struct object_id *oid,
+ struct blame_tree_callback_data *data)
+{
+ struct blame_tree_entry *ent;
+
+ /* Is it even a path that we are interested in? */
+ ent = hashmap_get_entry_from_hash(data->paths, strhash(path), path,
+ struct blame_tree_entry, hashent);
+ if (!ent)
+ return;
+
+ /* Have we already blamed a commit? */
+ if (ent->commit)
+ return;
+
+ /*
+ * Is it arriving at a version of interest, or is it from a side branch
+ * which did not contribute to the final state?
+ */
+ if (!oideq(oid, &ent->oid))
+ return;
+
+ ent->commit = data->commit;
+ if (data->callback)
+ data->callback(path, data->commit, data->callback_data);
+
+ hashmap_remove(data->paths, &ent->hashent, path);
+ free(ent);
+}
+
+static void blame_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED, void *cbdata)
+{
+ struct blame_tree_callback_data *data = cbdata;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ switch (p->status) {
+ case DIFF_STATUS_DELETED:
+ /*
+ * There's no point in feeding a deletion, as it could
+ * not have resulted in our current state, which
+ * actually has the file.
+ */
+ break;
+
+ default:
+ /*
+ * Otherwise, we care only that we somehow arrived at
+ * a final path/sha1 state. Note that this covers some
+ * potentially controversial areas, including:
+ *
+ * 1. A rename or copy will be blamed, as it is the
+ * first time the content has arrived at the given
+ * path.
+ *
+ * 2. Even a non-content modification like a mode or
+ * type change will trigger it.
+ *
+ * We take the inclusive approach for now, and blame
+ * anything which impacts the path. Options to tweak
+ * the behavior (e.g., to "--follow" the content across
+ * renames) can come later.
+ */
+ mark_path(p->two->path, &p->two->oid, data);
+ break;
+ }
+ }
+}
+
+int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
+{
+ struct blame_tree_callback_data data;
+
+ data.paths = &bt->paths;
+ data.callback = cb;
+ data.callback_data = cbdata;
+
+ bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ bt->rev.diffopt.format_callback = blame_diff;
+ bt->rev.diffopt.format_callback_data = &data;
+
+ prepare_revision_walk(&bt->rev);
+
+ while (hashmap_get_size(&bt->paths)) {
+ data.commit = get_revision(&bt->rev);
+ if (!data.commit)
+ break;
+
+ if (data.commit->object.flags & BOUNDARY) {
+ diff_tree_oid(bt->rev.repo->hash_algo->empty_tree,
+ &data.commit->object.oid,
+ "", &bt->rev.diffopt);
+ diff_flush(&bt->rev.diffopt);
+ } else {
+ log_tree_commit(&bt->rev, data.commit);
+ }
+ }
+
+ return 0;
+}
diff --git a/blame-tree.h b/blame-tree.h
new file mode 100644
index 0000000000..abb467cf1b
--- /dev/null
+++ b/blame-tree.h
@@ -0,0 +1,27 @@
+#ifndef BLAME_TREE_H
+#define BLAME_TREE_H
+
+#include "commit.h"
+#include "revision.h"
+#include "hashmap.h"
+
+struct blame_tree {
+ struct hashmap paths;
+ struct rev_info rev;
+};
+
+void blame_tree_init(struct blame_tree *bt,
+ struct repository *r,
+ const char *prefix,
+ int argc, const char **argv);
+
+void blame_tree_release(struct blame_tree *);
+
+typedef void (*blame_tree_callback)(const char *path,
+ const struct commit *commit,
+ void *data);
+int blame_tree_run(struct blame_tree *,
+ blame_tree_callback cb,
+ void *data);
+
+#endif /* BLAME_TREE_H */
diff --git a/builtin.h b/builtin.h
index bff13e3069..c7b06130b6 100644
--- a/builtin.h
+++ b/builtin.h
@@ -123,6 +123,7 @@ int cmd_archive(int argc, const char **argv, const char *prefix, struct reposito
int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_bisect(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_blame(int argc, const char **argv, const char *prefix, struct repository *repo);
+int cmd_blame_tree(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_branch(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_bugreport(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_bundle(int argc, const char **argv, const char *prefix, struct repository *repo);
diff --git a/builtin/blame-tree.c b/builtin/blame-tree.c
new file mode 100644
index 0000000000..aaa3e9daa1
--- /dev/null
+++ b/builtin/blame-tree.c
@@ -0,0 +1,43 @@
+#include "git-compat-util.h"
+#include "blame-tree.h"
+#include "hex.h"
+#include "quote.h"
+#include "config.h"
+#include "object-name.h"
+#include "parse-options.h"
+#include "builtin.h"
+
+static void show_entry(const char *path, const struct commit *commit, void *d)
+{
+ struct blame_tree *bt = d;
+
+ if (commit->object.flags & BOUNDARY)
+ putchar('^');
+ printf("%s\t", oid_to_hex(&commit->object.oid));
+
+ if (bt->rev.diffopt.line_termination)
+ write_name_quoted(path, stdout, '\n');
+ else
+ printf("%s%c", path, '\0');
+
+ fflush(stdout);
+}
+
+int cmd_blame_tree(int argc,
+ const char **argv,
+ const char *prefix,
+ struct repository *repo)
+{
+ int ret = 0;
+ struct blame_tree bt;
+
+ repo_config(repo, git_default_config, NULL);
+
+ blame_tree_init(&bt, repo, prefix, argc, argv);
+ if (blame_tree_run(&bt, show_entry, &bt) < 0)
+ die(_("error running blame-tree traversal"));
+
+ blame_tree_release(&bt);
+
+ return ret;
+}
diff --git a/git.c b/git.c
index 77c4359522..9f8b99b2d1 100644
--- a/git.c
+++ b/git.c
@@ -509,6 +509,7 @@ static struct cmd_struct commands[] = {
{ "backfill", cmd_backfill, RUN_SETUP },
{ "bisect", cmd_bisect, RUN_SETUP },
{ "blame", cmd_blame, RUN_SETUP },
+ { "blame-tree", cmd_blame_tree, RUN_SETUP },
{ "branch", cmd_branch, RUN_SETUP | DELAY_PAGER_CONFIG },
{ "bugreport", cmd_bugreport, RUN_SETUP_GENTLY },
{ "bundle", cmd_bundle, RUN_SETUP_GENTLY },
diff --git a/meson.build b/meson.build
index c47cb79af0..214ccf5a72 100644
--- a/meson.build
+++ b/meson.build
@@ -274,6 +274,7 @@ libgit_sources = [
'attr.c',
'base85.c',
'bisect.c',
+ 'blame-tree.c',
'blame.c',
'blob.c',
'bloom.c',
@@ -546,6 +547,7 @@ builtin_sources = [
'builtin/archive.c',
'builtin/backfill.c',
'builtin/bisect.c',
+ 'builtin/blame-tree.c',
'builtin/blame.c',
'builtin/branch.c',
'builtin/bugreport.c',
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 6d62a5b53d..41cc3730dc 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -5,6 +5,7 @@
int cmd__advise_if_enabled(int argc, const char **argv);
int cmd__bitmap(int argc, const char **argv);
+int cmd__blame_tree(int argc, const char **argv);
int cmd__bloom(int argc, const char **argv);
int cmd__bundle_uri(int argc, const char **argv);
int cmd__cache_tree(int argc, const char **argv);
diff --git a/t/meson.build b/t/meson.build
index bfb744e886..65402e97da 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -960,6 +960,7 @@ integration_tests = [
't8012-blame-colors.sh',
't8013-blame-ignore-revs.sh',
't8014-blame-ignore-fuzzy.sh',
+ 't8020-blame-tree.sh',
't9001-send-email.sh',
't9002-column.sh',
't9003-help-autocorrect.sh',
diff --git a/t/t8020-blame-tree.sh b/t/t8020-blame-tree.sh
new file mode 100755
index 0000000000..c11876c210
--- /dev/null
+++ b/t/t8020-blame-tree.sh
@@ -0,0 +1,147 @@
+#!/bin/sh
+
+test_description='blame-tree tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+ test_commit 1 file &&
+ mkdir a &&
+ test_commit 2 a/file &&
+ mkdir a/b &&
+ test_commit 3 a/b/file
+'
+
+test_expect_success 'cannot blame two trees' '
+ test_must_fail git blame-tree HEAD HEAD~1
+'
+
+check_blame() {
+ local indir= &&
+ while test $# != 0
+ do
+ case "$1" in
+ -C)
+ indir="$2"
+ shift
+ ;;
+ *)
+ break
+ ;;
+ esac &&
+ shift
+ done &&
+
+ cat >expect &&
+ test_when_finished "rm -f tmp.*" &&
+ git ${indir:+-C "$indir"} blame-tree "$@" >tmp.1 &&
+ git name-rev --annotate-stdin --name-only --tags \
+ <tmp.1 >tmp.2 &&
+ tr '\t' ' ' <tmp.2 >tmp.3 &&
+ sort tmp.3 >actual &&
+ test_cmp expect actual
+}
+
+test_expect_success 'blame recursive' '
+ check_blame --recursive <<-\EOF
+ 1 file
+ 2 a/file
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'blame non-recursive' '
+ check_blame --no-recursive <<-\EOF
+ 1 file
+ 3 a
+ EOF
+'
+
+test_expect_success 'blame subdir' '
+ check_blame a <<-\EOF
+ 3 a
+ EOF
+'
+
+test_expect_success 'blame subdir recursive' '
+ check_blame --recursive a <<-\EOF
+ 2 a/file
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'blame from non-HEAD commit' '
+ check_blame --no-recursive HEAD^ <<-\EOF
+ 1 file
+ 2 a
+ EOF
+'
+
+test_expect_success 'blame from subdir defaults to root' '
+ check_blame -C a --no-recursive <<-\EOF
+ 1 file
+ 3 a
+ EOF
+'
+
+test_expect_success 'blame from subdir uses relative pathspecs' '
+ check_blame -C a --recursive b <<-\EOF
+ 3 a/b/file
+ EOF
+'
+
+test_expect_failure 'limit blame traversal by count' '
+ check_blame --no-recursive -1 <<-\EOF
+ 3 a
+ EOF
+'
+
+test_expect_success 'limit blame traversal by commit' '
+ check_blame --no-recursive HEAD~2..HEAD <<-\EOF
+ 3 a
+ ^1 file
+ EOF
+'
+
+test_expect_success 'only blame files in the current tree' '
+ git rm -rf a &&
+ git commit -m "remove a" &&
+ check_blame <<-\EOF
+ 1 file
+ EOF
+'
+
+test_expect_success 'cross merge boundaries in blaming' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit m1 &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit m2 &&
+ git merge m1 &&
+ check_blame <<-\EOF
+ m1 m1.t
+ m2 m2.t
+ EOF
+'
+
+test_expect_success 'blame merge for resolved conflicts' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit c1 conflict &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit c2 conflict &&
+ test_must_fail git merge c1 &&
+ test_commit resolved conflict &&
+ check_blame conflict <<-\EOF
+ resolved conflict
+ EOF
+'
+
+test_expect_success 'blame-tree complains about unknown arguments' '
+ test_must_fail git blame-tree --foo 2>err &&
+ grep "unknown blame-tree argument: --foo" err
+'
+
+test_done
--
2.49.0.rc2
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH RFC 2/5] t/perf: add blame-tree perf script
2025-04-22 17:46 [PATCH RFC 0/5] Introduce git-blame-tree(1) command Toon Claes
2025-04-22 17:46 ` [PATCH RFC 1/5] blame-tree: introduce new subcommand to blame files Toon Claes
@ 2025-04-22 17:46 ` Toon Claes
2025-04-22 17:46 ` [PATCH RFC 3/5] blame-tree: use Bloom filters when available Toon Claes
` (5 subsequent siblings)
7 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-04-22 17:46 UTC (permalink / raw)
To: git; +Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes
From: Jeff King <peff@peff.net>
This just runs some simple blame-tree's. We already test correctness in
the regular suite, so this is just about finding performance regressions
from one version to another.
Signed-off-by: Toon Claes <toon@iotcl.com>
---
t/perf/p8020-blame-tree.sh | 21 +++++++++++++++++++++
t/t8020-blame-tree.sh | 19 ++++++++++---------
2 files changed, 31 insertions(+), 9 deletions(-)
diff --git a/t/perf/p8020-blame-tree.sh b/t/perf/p8020-blame-tree.sh
new file mode 100755
index 0000000000..6c4c2a369e
--- /dev/null
+++ b/t/perf/p8020-blame-tree.sh
@@ -0,0 +1,21 @@
+#!/bin/sh
+
+test_description='blame-tree perf tests'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+test_perf 'top-level blame-tree' '
+ git blame-tree HEAD
+'
+
+test_perf 'top-level recursive blame-tree' '
+ git blame-tree -r HEAD
+'
+
+test_perf 'subdir blame-tree' '
+ path=$(git ls-tree HEAD | grep ^040000 | head -n 1 | cut -f2)
+ git blame-tree -r HEAD -- "$path"
+'
+
+test_done
diff --git a/t/t8020-blame-tree.sh b/t/t8020-blame-tree.sh
index c11876c210..6a1db7efaa 100755
--- a/t/t8020-blame-tree.sh
+++ b/t/t8020-blame-tree.sh
@@ -43,7 +43,7 @@ check_blame() {
}
test_expect_success 'blame recursive' '
- check_blame --recursive <<-\EOF
+ check_blame -r <<-\EOF
1 file
2 a/file
3 a/b/file
@@ -51,7 +51,7 @@ test_expect_success 'blame recursive' '
'
test_expect_success 'blame non-recursive' '
- check_blame --no-recursive <<-\EOF
+ check_blame <<-\EOF
1 file
3 a
EOF
@@ -64,40 +64,41 @@ test_expect_success 'blame subdir' '
'
test_expect_success 'blame subdir recursive' '
- check_blame --recursive a <<-\EOF
+ check_blame -r a <<-\EOF
2 a/file
3 a/b/file
EOF
'
test_expect_success 'blame from non-HEAD commit' '
- check_blame --no-recursive HEAD^ <<-\EOF
+ check_blame HEAD^ <<-\EOF
1 file
2 a
EOF
'
test_expect_success 'blame from subdir defaults to root' '
- check_blame -C a --no-recursive <<-\EOF
+ check_blame -C a <<-\EOF
1 file
3 a
EOF
'
test_expect_success 'blame from subdir uses relative pathspecs' '
- check_blame -C a --recursive b <<-\EOF
+ check_blame -C a -r b <<-\EOF
3 a/b/file
EOF
'
-test_expect_failure 'limit blame traversal by count' '
- check_blame --no-recursive -1 <<-\EOF
+test_expect_success 'limit blame traversal by count' '
+ check_blame <<-\EOF
3 a
+ ^2 file
EOF
'
test_expect_success 'limit blame traversal by commit' '
- check_blame --no-recursive HEAD~2..HEAD <<-\EOF
+ check_blame HEAD~2..HEAD <<-\EOF
3 a
^1 file
EOF
--
2.49.0.rc2
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH RFC 3/5] blame-tree: use Bloom filters when available
2025-04-22 17:46 [PATCH RFC 0/5] Introduce git-blame-tree(1) command Toon Claes
2025-04-22 17:46 ` [PATCH RFC 1/5] blame-tree: introduce new subcommand to blame files Toon Claes
2025-04-22 17:46 ` [PATCH RFC 2/5] t/perf: add blame-tree perf script Toon Claes
@ 2025-04-22 17:46 ` Toon Claes
2025-04-22 17:46 ` [PATCH RFC 4/5] blame-tree: implement faster algorithm Toon Claes
` (4 subsequent siblings)
7 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-04-22 17:46 UTC (permalink / raw)
To: git; +Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes
From: Taylor Blau <me@ttaylorr.com>
Our 'git blame-tree' performs a revision walk, and computes a diff at
each point in the walk to figure out whether a given revision changed
any of the paths it considers interesting.
When changed-path Bloom filters are available, we can avoid computing
many such diffs. Before computing a diff, we first check if any of the
remaining paths of interest were possibly changed at a given commit by
consulting its Bloom filter. If any of them are, we are resigned to
compute the diff.
If none of those queries returned "maybe", we know that the given commit
doesn't contain any changed paths which are interesting to us. So, we
can avoid computing it in this case.
This results in a substantial performance speed-up in common cases of
'git blame-tree'. In the kernel, here is the before and after (all times
computed with best-of-five):
With commit-graphs (but no Bloom filters):
real 0m5.133s
user 0m4.942s
sys 0m0.180s
...and with Bloom filters:
real 0m0.936s
user 0m0.842s
sys 0m0.092s
These times are with my development-version of Git, so it's compiled
without optimizations. Compiling instead with `-O3`, the results look
even better:
real 0m0.754s
user 0m0.661s
sys 0m0.092s
Signed-off-by: Toon Claes <toon@iotcl.com>
---
blame-tree.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/blame-tree.c b/blame-tree.c
index ce57db2cfc..47354557a7 100644
--- a/blame-tree.c
+++ b/blame-tree.c
@@ -7,11 +7,15 @@
#include "revision.h"
#include "repository.h"
#include "log-tree.h"
+#include "dir.h"
+#include "commit-graph.h"
+#include "bloom.h"
struct blame_tree_entry {
struct hashmap_entry hashent;
struct object_id oid;
struct commit *commit;
+ struct bloom_key key;
const char path[FLEX_ARRAY];
};
@@ -28,6 +32,9 @@ static void add_from_diff(struct diff_queue_struct *q,
FLEX_ALLOC_STR(ent, path, path);
oidcpy(&ent->oid, &p->two->oid);
+ if (bt->rev.bloom_filter_settings)
+ fill_bloom_key(path, strlen(path), &ent->key,
+ bt->rev.bloom_filter_settings);
hashmap_entry_init(&ent->hashent, strhash(ent->path));
hashmap_add(&bt->paths, &ent->hashent);
}
@@ -92,12 +99,21 @@ void blame_tree_init(struct blame_tree *bt,
if (setup_revisions(argc, argv, &bt->rev, NULL) > 1)
die(_("unknown blame-tree argument: %s"), argv[1]);
+ (void)generation_numbers_enabled(bt->rev.repo);
+ bt->rev.bloom_filter_settings = get_bloom_filter_settings(bt->rev.repo);
+
if (add_from_revs(bt) < 0)
die(_("unable to setup blame-tree"));
}
void blame_tree_release(struct blame_tree *bt)
{
+ struct hashmap_iter iter;
+ struct blame_tree_entry *ent;
+
+ hashmap_for_each_entry(&bt->paths, &iter, ent, hashent) {
+ clear_bloom_key(&ent->key);
+ }
hashmap_clear_and_free(&bt->paths, struct blame_tree_entry, hashent);
release_revisions(&bt->rev);
}
@@ -137,6 +153,7 @@ static void mark_path(const char *path, const struct object_id *oid,
data->callback(path, data->commit, data->callback_data);
hashmap_remove(data->paths, &ent->hashent, path);
+ clear_bloom_key(&ent->key);
free(ent);
}
@@ -180,6 +197,30 @@ static void blame_diff(struct diff_queue_struct *q,
}
}
+static int maybe_changed_path(struct blame_tree *bt, struct commit *origin)
+{
+ struct bloom_filter *filter;
+ struct blame_tree_entry *e;
+ struct hashmap_iter iter;
+
+ if (!bt->rev.bloom_filter_settings)
+ return 1;
+
+ if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY)
+ return 1;
+
+ filter = get_bloom_filter(bt->rev.repo, origin);
+ if (!filter)
+ return 1;
+
+ hashmap_for_each_entry(&bt->paths, &iter, e, hashent) {
+ if (bloom_filter_contains(filter, &e->key,
+ bt->rev.bloom_filter_settings))
+ return 1;
+ }
+ return 0;
+}
+
int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
{
struct blame_tree_callback_data data;
@@ -199,6 +240,9 @@ int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
if (!data.commit)
break;
+ if (!maybe_changed_path(bt, data.commit))
+ continue;
+
if (data.commit->object.flags & BOUNDARY) {
diff_tree_oid(bt->rev.repo->hash_algo->empty_tree,
&data.commit->object.oid,
--
2.49.0.rc2
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH RFC 4/5] blame-tree: implement faster algorithm
2025-04-22 17:46 [PATCH RFC 0/5] Introduce git-blame-tree(1) command Toon Claes
` (2 preceding siblings ...)
2025-04-22 17:46 ` [PATCH RFC 3/5] blame-tree: use Bloom filters when available Toon Claes
@ 2025-04-22 17:46 ` Toon Claes
2025-04-22 17:46 ` [PATCH RFC 5/5] blame-tree.c: initialize revision machinery without walk Toon Claes
` (3 subsequent siblings)
7 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-04-22 17:46 UTC (permalink / raw)
To: git; +Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes,
Derrick Stolee
From: Taylor Blau <me@ttaylorr.com>
The current implementation of 'git blame-tree' works by doing a revision
walk, and inspecting the diff at each level of that walk to annotate the
yet-unblamed entries to a path. In other words, if the diff at some
level touches a path which has not yet been associated with a commit,
then that commit becomes associated with the path.
While a perfectly reasonable implementation, it can perform poorly in
either one of two scenarios:
1. There are many entries of interest, in which case there is simply
more work to do.
2. Or, there are (even a few) entries which have not been updated in a
long time, and so we must walk through a lot of history in order to
find a commit that touches that path.
This patch rewrites the blame-tree implementation that addresses (2).
The idea behind the algorithm is to propagate a set of 'active' paths (a
path is 'active' if it does not yet belong to a commit) up to parents
and do a truncated revision walk.
The walk is truncated because it does not produce a revision for every
change in the original pathspec, but rather only for active paths.
More specifically, consider a priority queue of commits sorted by
generation number. First, enqueue the set of boundary commits with all
paths in the original spec marked as interesting.
Then, while the queue is not empty, do the following:
1. Pop an element, say, 'c', off of the queue, making sure that 'c'
isn't reachable by anything in the '--not' set.
2. For each parent 'p' (with index 'parent_i') of 'c', do the
following:
a. Compute the diff between 'c' and 'p'.
b. Pass any active paths that are TREESAME from 'c' to 'p'.
c. If 'p' has any active paths, push it onto the queue.
3. Associate any remaining paths with 'c', and mark them as inactive.
This ends up being equivalent to doing something like 'git log -1 --
$path' for each path simultaneously. But, it allows us to go much faster
than the original implementation by limiting the number of diffs we
compute, since we can avoid parts of history that would have been
considered by the revision walk in the original implementation, but are
known to be uninteresting to us because we have already marked all paths
in that area to be inactive.
One other trick we can do on top is to avoid computing many first-parent
diffs when all paths active in 'c' are DEFINITELY_NOT in c's Bloom
filter. Since the commit-graph only stores first-parent diffs in the
Bloom filters, we can only apply this trick to first-parent diffs.
Now, some performance numbers. On github/git, our numbers look like the
following (all wall-clock times best-of-five, and with '--max-depth=0'
on the root):
github ttaylorr/blame-tree-fast
with filters: 0.754s 0.271s (2.78x faster, 6.18x overall)
without filters: 1.676s 1.056s (1.58x faster)
and on torvalds/linux:
github ttaylorr/blame-tree-fast
with filters: 0.608 0.062 (9.81x faster, ~52x overall)
without filters: 3.251 0.676 (4.81x faster)
In short, the existing implementation is comparably fast *with* filters
as the new implementation is *without* filters. So, most repositories
should get a dramatic speed-up by just deploying this (even without
computing Bloom filters), and all repositories should get faster still
when computing Bloom filters.
Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
blame-tree.c | 270 +++++++++++++++++++++++++++++++++++++++++++++++++++++++----
blame-tree.h | 3 +
2 files changed, 256 insertions(+), 17 deletions(-)
diff --git a/blame-tree.c b/blame-tree.c
index 47354557a7..2cb7a5045c 100644
--- a/blame-tree.c
+++ b/blame-tree.c
@@ -3,18 +3,20 @@
#include "commit.h"
#include "diffcore.h"
#include "diff.h"
-#include "object.h"
#include "revision.h"
#include "repository.h"
#include "log-tree.h"
#include "dir.h"
#include "commit-graph.h"
#include "bloom.h"
+#include "prio-queue.h"
+#include "commit-slab.h"
struct blame_tree_entry {
struct hashmap_entry hashent;
struct object_id oid;
struct commit *commit;
+ int diff_idx;
struct bloom_key key;
const char path[FLEX_ARRAY];
};
@@ -86,6 +88,9 @@ void blame_tree_init(struct blame_tree *bt,
const char *prefix,
int argc, const char **argv)
{
+ struct hashmap_iter iter;
+ struct blame_tree_entry *e;
+
memset(bt, 0, sizeof(*bt));
hashmap_init(&bt->paths, blame_tree_entry_hashcmp, NULL, 0);
@@ -104,6 +109,13 @@ void blame_tree_init(struct blame_tree *bt,
if (add_from_revs(bt) < 0)
die(_("unable to setup blame-tree"));
+
+ bt->all_paths = xcalloc(hashmap_get_size(&bt->paths), sizeof(const char *));
+ bt->all_paths_nr = 0;
+ hashmap_for_each_entry(&bt->paths, &iter, e, hashent) {
+ e->diff_idx = bt->all_paths_nr++;
+ bt->all_paths[e->diff_idx] = e->path;
+ }
}
void blame_tree_release(struct blame_tree *bt)
@@ -116,6 +128,20 @@ void blame_tree_release(struct blame_tree *bt)
}
hashmap_clear_and_free(&bt->paths, struct blame_tree_entry, hashent);
release_revisions(&bt->rev);
+ free(bt->all_paths);
+}
+
+struct commit_active_paths {
+ char *active;
+ int nr;
+};
+
+define_commit_slab(active_paths, struct commit_active_paths);
+static struct active_paths active_paths;
+
+static void free_one_active_path(struct commit_active_paths *active)
+{
+ free(active->active);
}
struct blame_tree_callback_data {
@@ -130,6 +156,7 @@ static void mark_path(const char *path, const struct object_id *oid,
struct blame_tree_callback_data *data)
{
struct blame_tree_entry *ent;
+ struct commit_active_paths *active;
/* Is it even a path that we are interested in? */
ent = hashmap_get_entry_from_hash(data->paths, strhash(path), path,
@@ -141,11 +168,17 @@ static void mark_path(const char *path, const struct object_id *oid,
if (ent->commit)
return;
+ /* Are we inactive on the current commit? */
+ active = active_paths_at(&active_paths, data->commit);
+ if (active && active->active &&
+ !active->active[ent->diff_idx])
+ return;
+
/*
* Is it arriving at a version of interest, or is it from a side branch
* which did not contribute to the final state?
*/
- if (!oideq(oid, &ent->oid))
+ if (oid && !oideq(oid, &ent->oid))
return;
ent->commit = data->commit;
@@ -197,7 +230,32 @@ static void blame_diff(struct diff_queue_struct *q,
}
}
-static int maybe_changed_path(struct blame_tree *bt, struct commit *origin)
+static char *scratch;
+
+static void pass_to_parent(struct commit_active_paths *c,
+ struct commit_active_paths *p,
+ int i)
+{
+ c->active[i] = 0;
+ c->nr--;
+ p->active[i] = 1;
+ p->nr++;
+}
+
+#define PARENT1 (1u<<16) /* used instead of SEEN */
+#define PARENT2 (1u<<17) /* used instead of BOTTOM, BOUNDARY */
+
+static int diff2idx(struct blame_tree *bt, char *path)
+{
+ struct blame_tree_entry *ent;
+ ent = hashmap_get_entry_from_hash(&bt->paths, strhash(path), path,
+ struct blame_tree_entry, hashent);
+ return ent ? ent->diff_idx : -1;
+}
+
+static int maybe_changed_path(struct blame_tree *bt,
+ struct commit *origin,
+ struct commit_active_paths *active)
{
struct bloom_filter *filter;
struct blame_tree_entry *e;
@@ -214,6 +272,8 @@ static int maybe_changed_path(struct blame_tree *bt, struct commit *origin)
return 1;
hashmap_for_each_entry(&bt->paths, &iter, e, hashent) {
+ if (active && !active->active[e->diff_idx])
+ continue;
if (bloom_filter_contains(filter, &e->key,
bt->rev.bloom_filter_settings))
return 1;
@@ -221,8 +281,88 @@ static int maybe_changed_path(struct blame_tree *bt, struct commit *origin)
return 0;
}
+static int process_parent(struct blame_tree *bt,
+ struct prio_queue *queue,
+ struct commit *c, struct commit_active_paths *active_c,
+ struct commit *parent, int parent_i)
+{
+ int i, ret = 0; // TODO type & for loop var
+ struct commit_active_paths *active_p;
+
+ repo_parse_commit(bt->rev.repo, parent);
+
+ active_p = active_paths_at(&active_paths, parent);
+ if (!active_p->active) {
+ active_p->active = xcalloc(sizeof(char), bt->all_paths_nr);
+ active_p->nr = 0;
+ }
+
+ /*
+ * Before calling 'diff_tree_oid()' on our first parent, see if Bloom
+ * filters will tell us the diff is conclusively uninteresting.
+ */
+ if (parent_i || maybe_changed_path(bt, c, active_c)) {
+ diff_tree_oid(&parent->object.oid,
+ &c->object.oid, "", &bt->rev.diffopt);
+ diffcore_std(&bt->rev.diffopt);
+ }
+
+ if (!diff_queued_diff.nr) {
+ /*
+ * No diff entries means we are TREESAME on the base path, and
+ * so all active paths get passed onto this parent.
+ */
+ for (i = 0; i < bt->all_paths_nr; i++) {
+ if (active_c->active[i])
+ pass_to_parent(active_c, active_p, i);
+ }
+
+ if (!(parent->object.flags & PARENT1)) {
+ parent->object.flags |= PARENT1;
+ prio_queue_put(queue, parent);
+ }
+ ret = 1;
+ goto cleanup;
+ }
+
+ /*
+ * Otherwise, test each path for TREESAME-ness against the parent, and
+ * pass those along.
+ *
+ * First, set each position in 'scratch' to be zero for TREESAME paths,
+ * and one otherwise. Then, pass active and TREESAME paths to the
+ * parent.
+ */
+ for (i = 0; i < diff_queued_diff.nr; i++) {
+ struct diff_filepair *fp = diff_queued_diff.queue[i];
+ int k = diff2idx(bt, fp->two->path);
+ if (0 <= k && active_c->active[k])
+ scratch[k] = 1;
+ diff_free_filepair(fp);
+ }
+ diff_queued_diff.nr = 0;
+ for (i = 0; i < bt->all_paths_nr; i++) {
+ if (active_c->active[i] && !scratch[i])
+ pass_to_parent(active_c, active_p, i);
+ }
+
+ if (active_p->nr && !(parent->object.flags & PARENT1)) {
+ parent->object.flags |= PARENT1;
+ prio_queue_put(queue, parent);
+ }
+
+cleanup:
+ diff_queue_clear(&diff_queued_diff);
+ memset(scratch, 0, bt->all_paths_nr);
+
+ return ret;
+}
+
int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
{
+ int max_count, queue_popped = 0;
+ struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
+ struct prio_queue not_queue = { compare_commits_by_gen_then_commit_date };
struct blame_tree_callback_data data;
data.paths = &bt->paths;
@@ -233,25 +373,121 @@ int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
bt->rev.diffopt.format_callback = blame_diff;
bt->rev.diffopt.format_callback_data = &data;
- prepare_revision_walk(&bt->rev);
+ max_count = bt->rev.max_count;
- while (hashmap_get_size(&bt->paths)) {
- data.commit = get_revision(&bt->rev);
- if (!data.commit)
- break;
+ init_active_paths(&active_paths);
+ scratch = xcalloc(bt->all_paths_nr, sizeof(char));
- if (!maybe_changed_path(bt, data.commit))
- continue;
+ /*
+ * bt->rev.pending holds the set of boundary commits for our walk.
+ *
+ * Loop through each such commit, and place it in the appropriate queue.
+ */
+ for (size_t i = 0; i < bt->rev.pending.nr; i++) {
+ struct commit *c = lookup_commit(bt->rev.repo,
+ &bt->rev.pending.objects[i].item->oid);
+ repo_parse_commit(bt->rev.repo, c);
- if (data.commit->object.flags & BOUNDARY) {
- diff_tree_oid(bt->rev.repo->hash_algo->empty_tree,
- &data.commit->object.oid,
- "", &bt->rev.diffopt);
- diff_flush(&bt->rev.diffopt);
- } else {
- log_tree_commit(&bt->rev, data.commit);
+ if (c->object.flags & BOTTOM) {
+ prio_queue_put(¬_queue, c);
+ c->object.flags |= PARENT2;
+ } else if (!(c->object.flags & PARENT1)) {
+ /*
+ * If the commit is a starting point (and hasn't been
+ * seen yet), then initialize the set of interesting
+ * paths, too.
+ */
+ struct commit_active_paths *active;
+
+ prio_queue_put(&queue, c);
+ c->object.flags |= PARENT1;
+
+ active = active_paths_at(&active_paths, c);
+ active->active = xcalloc(sizeof(char), bt->all_paths_nr);
+ memset(active->active, 1, bt->all_paths_nr);
+ active->nr = bt->all_paths_nr;
}
}
+ /*
+ * Now that we have processed the pending commits, allow the revision
+ * machinery to flush them by calling prepare_revision_walk().
+ */
+ prepare_revision_walk(&bt->rev);
+
+ while (queue.nr) {
+ int parent_i;
+ struct commit_list *p;
+ struct commit *c = prio_queue_get(&queue);
+ struct commit_active_paths *active_c = active_paths_at(&active_paths, c);
+
+ if ((0 <= max_count && max_count < ++queue_popped) ||
+ (c->object.flags & PARENT2)) {
+ /*
+ * Either a boundary commit, or we have already seen too
+ * many others. Either way, stop here.
+ */
+ c->object.flags |= PARENT2 | BOUNDARY;
+ data.commit = c;
+ diff_tree_oid(bt->rev.repo->hash_algo->empty_tree,
+ &c->object.oid,
+ "", &bt->rev.diffopt);
+ diff_flush(&bt->rev.diffopt);
+ goto cleanup;
+ }
+
+ /*
+ * Otherwise, keep going, but make sure that 'c' isn't reachable
+ * from anything in the '--not' queue.
+ */
+ repo_parse_commit(bt->rev.repo, c);
+
+ while (not_queue.nr) {
+ struct commit_list *np;
+ struct commit *n = prio_queue_get(¬_queue);
+
+ repo_parse_commit(bt->rev.repo, n);
+
+ for (np = n->parents; np; np = np->next) {
+ if (!(np->item->object.flags & PARENT2)) {
+ prio_queue_put(¬_queue, np->item);
+ np->item->object.flags |= PARENT2;
+ }
+ }
+
+ if (commit_graph_generation(n) < commit_graph_generation(c))
+ break;
+ }
+
+ /*
+ * Look at each remaining interesting path, and pass it onto
+ * parents in order if TREESAME.
+ */
+ for (p = c->parents, parent_i = 0; p; p = p->next, parent_i++) {
+ if (process_parent(bt, &queue,
+ c, active_c,
+ p->item, parent_i) > 0 )
+ break;
+ }
+
+ if (active_c->nr) {
+ /* Any paths that remain active were changed by 'c'. */
+ data.commit = c;
+ for (int i = 0; i < bt->all_paths_nr; i++) {
+ if (active_c->active[i])
+ mark_path(bt->all_paths[i], NULL, &data);
+ }
+ }
+
+cleanup:
+ FREE_AND_NULL(active_c->active);
+ active_c->nr = 0;
+ }
+
+ clear_prio_queue(¬_queue);
+ clear_prio_queue(&queue);
+ deep_clear_active_paths(&active_paths, free_one_active_path);
+ free(scratch);
+
return 0;
}
diff --git a/blame-tree.h b/blame-tree.h
index abb467cf1b..0e6a6929f6 100644
--- a/blame-tree.h
+++ b/blame-tree.h
@@ -8,6 +8,9 @@
struct blame_tree {
struct hashmap paths;
struct rev_info rev;
+
+ const char **all_paths;
+ int all_paths_nr;
};
void blame_tree_init(struct blame_tree *bt,
--
2.49.0.rc2
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH RFC 5/5] blame-tree.c: initialize revision machinery without walk
2025-04-22 17:46 [PATCH RFC 0/5] Introduce git-blame-tree(1) command Toon Claes
` (3 preceding siblings ...)
2025-04-22 17:46 ` [PATCH RFC 4/5] blame-tree: implement faster algorithm Toon Claes
@ 2025-04-22 17:46 ` Toon Claes
2025-04-23 13:26 ` [PATCH RFC 0/5] Introduce git-blame-tree(1) command Marc Branchaud
` (2 subsequent siblings)
7 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-04-22 17:46 UTC (permalink / raw)
To: git; +Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes
In a previous commit we inserted a call to 'prepare_revision_walk()'
before we started our traversal. This was done when we leveraged the
revision machinery more (at the time, we were leaning on
'log_tree_commit()' which only worked after calling
'prepare_revision_walk()').
But, we have since dropped 'log_tree_commit()', so we don't need most of
the initialization work of 'prepare_revision_walk()'. Now we ask it to
do very little work during initialization by setting the '->no_walk'
flag to '1', which leaves its internal state alone enough that we can
still function as normal.
Unfortunately, this means that we now no longer complain about
non-commit inputs, since the revision machinery check this for us (it
just silently ignores them).
Based-on-patch-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
blame-tree.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)
diff --git a/blame-tree.c b/blame-tree.c
index 2cb7a5045c..e244797b7e 100644
--- a/blame-tree.c
+++ b/blame-tree.c
@@ -271,6 +271,13 @@ static int maybe_changed_path(struct blame_tree *bt,
if (!filter)
return 1;
+ for (int i = 0; i < bt->rev.bloom_keys_nr; i++) {
+ if (!(bloom_filter_contains(filter,
+ &bt->rev.bloom_keys[i],
+ bt->rev.bloom_filter_settings)))
+ return 0;
+ }
+
hashmap_for_each_entry(&bt->paths, &iter, e, hashent) {
if (active && !active->active[e->diff_idx])
continue;
@@ -364,6 +371,7 @@ int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct prio_queue not_queue = { compare_commits_by_gen_then_commit_date };
struct blame_tree_callback_data data;
+ struct commit_list *list;
data.paths = &bt->paths;
data.callback = cb;
@@ -372,6 +380,9 @@ int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
bt->rev.diffopt.format_callback = blame_diff;
bt->rev.diffopt.format_callback_data = &data;
+ bt->rev.no_walk = 1;
+
+ prepare_revision_walk(&bt->rev);
max_count = bt->rev.max_count;
@@ -379,14 +390,12 @@ int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
scratch = xcalloc(bt->all_paths_nr, sizeof(char));
/*
- * bt->rev.pending holds the set of boundary commits for our walk.
+ * bt->rev.commits holds the set of boundary commits for our walk.
*
* Loop through each such commit, and place it in the appropriate queue.
*/
- for (size_t i = 0; i < bt->rev.pending.nr; i++) {
- struct commit *c = lookup_commit(bt->rev.repo,
- &bt->rev.pending.objects[i].item->oid);
- repo_parse_commit(bt->rev.repo, c);
+ for (list = bt->rev.commits; list; list = list->next) {
+ struct commit *c = list->item;
if (c->object.flags & BOTTOM) {
prio_queue_put(¬_queue, c);
@@ -409,12 +418,6 @@ int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
}
}
- /*
- * Now that we have processed the pending commits, allow the revision
- * machinery to flush them by calling prepare_revision_walk().
- */
- prepare_revision_walk(&bt->rev);
-
while (queue.nr) {
int parent_i;
struct commit_list *p;
--
2.49.0.rc2
^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-04-22 17:46 [PATCH RFC 0/5] Introduce git-blame-tree(1) command Toon Claes
` (4 preceding siblings ...)
2025-04-22 17:46 ` [PATCH RFC 5/5] blame-tree.c: initialize revision machinery without walk Toon Claes
@ 2025-04-23 13:26 ` Marc Branchaud
2025-05-07 14:22 ` Toon Claes
2025-05-23 9:33 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Toon Claes
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
7 siblings, 1 reply; 135+ messages in thread
From: Marc Branchaud @ 2025-04-23 13:26 UTC (permalink / raw)
To: Toon Claes, git
Cc: Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On 2025-04-22 13:46, Toon Claes wrote:
> This is another attempt to upstream the git-blame-tree(1) subcommand.
> After the previous attempt[1] the people of GitHub shared their version
> of the subcommand, and this version integrates those changes.
This functionality is awesome -- thanks for pushing this forwards.
I feel the need to get some bike-shedding off my chest, though:
"blame-tree" would be a terrible name for this command. I think that if
Git ends up with two blame-like commands it will merely solidify Git's
reputation for obscurity.
If this is really a form of blaming, then just make it an extension of
"git blame", like maybe "git blame --latest".
Otherwise, please come up with a new command name. "git latest"? "git
"latest-revs"? As long as it doesn't use the word "blame"...
FYI, here's Peff's original explanation[1] of how he came up with the name:
> I wasn't sure at first what to call it or what the calling conventions
> should be. The initial thought was to make it part of "ls-tree". But
> that feels wrong, as ls-tree otherwise never cares about traversal.
> The combination of traversal and diff made me think of blame, and
> indeed, I think this is really just about blaming a whole tree at the
> file-level, rather than at the content-level. Thus I called it blame-
> tree, and I used the same calling conventions as blame:
> "git blame-tree <path> <rev opts>".
To me that reads like an argument for folding this into "git blame".
M.
[1]
https://lore.kernel.org/git/20110302164031.GA18233@sigill.intra.peff.net/
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 1/5] blame-tree: introduce new subcommand to blame files
2025-04-22 17:46 ` [PATCH RFC 1/5] blame-tree: introduce new subcommand to blame files Toon Claes
@ 2025-04-24 16:19 ` Junio C Hamano
2025-05-07 13:13 ` Toon Claes
0 siblings, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-04-24 16:19 UTC (permalink / raw)
To: Toon Claes
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Toon Claes <toon@iotcl.com> writes:
> Similar to git-blame(1), introduce a new subcommand git-blame-tree(1).
> This command shows the most recent modification to paths in a tree. It
> does so by expanding the tree at a given commit, taking note of the
> current state of each path, and then walking backwards through history
> looking for commits where each path changed into its final commit ID.
What is missing in the series is an end-user facing documentation,
it seems? Don't take this as a complaint; an RFC is expected to be
incomplete and one of the reasons asking for comment responses is
to fill the gaps.
How is the "most recent modification" defined in a history with
forks and merges? For example, in this topology:
A---B---B---B---B---B
\ /
C---D
where each letter denotes the contents in the path we are interested
in (and as usual, time flows from left to right), is it the child of
commit A that made the last modification from A to B? Or was it the
merge commit that compared B and D and decided that the path should
have B? Something else? Does it change the story if the sides of
the merge were swapped, i.e. if the branch that kept B all the way
were not the mainline but the side branch that got merged?
A---B---B---C---D---B---B
\ /
B-------B
The same question applies if the path we are interested in is a
tree, not a leaf file.
I do not seem to see such a case that involves "ours" merge in the
tests, either.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 1/5] blame-tree: introduce new subcommand to blame files
2025-04-24 16:19 ` Junio C Hamano
@ 2025-05-07 13:13 ` Toon Claes
0 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-05-07 13:13 UTC (permalink / raw)
To: Junio C Hamano
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Junio C Hamano <gitster@pobox.com> writes:
> What is missing in the series is an end-user facing documentation,
> it seems? Don't take this as a complaint; an RFC is expected to be
> incomplete and one of the reasons asking for comment responses is
> to fill the gaps.
Yes, I'm aware that was missing, but I didn't want to spend too much
time on this if the general idea would be discarded anyway.
> How is the "most recent modification" defined in a history with
> forks and merges? For example, in this topology:
>
> A---B---B---B---B---B
> \ /
> C---D
>
> where each letter denotes the contents in the path we are interested
> in (and as usual, time flows from left to right), is it the child of
> commit A that made the last modification from A to B? Or was it the
> merge commit that compared B and D and decided that the path should
> have B? Something else? Does it change the story if the sides of
> the merge were swapped, i.e. if the branch that kept B all the way
> were not the mainline but the side branch that got merged?
>
> A---B---B---C---D---B---B
> \ /
> B-------B
>
> The same question applies if the path we are interested in is a
> tree, not a leaf file.
>
> I do not seem to see such a case that involves "ours" merge in the
> tests, either.
Those are good scenarios to think about. I'll try to include them in
test cases for the next version. I think it's easier to argue about it
then.
--
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-04-23 13:26 ` [PATCH RFC 0/5] Introduce git-blame-tree(1) command Marc Branchaud
@ 2025-05-07 14:22 ` Toon Claes
2025-05-07 20:23 ` Marc Branchaud
0 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-05-07 14:22 UTC (permalink / raw)
To: Marc Branchaud, git
Cc: Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Marc Branchaud <marcnarc@xiplink.com> writes:
> I feel the need to get some bike-shedding off my chest, though:
Always welcome!
> "blame-tree" would be a terrible name for this command.
Do you feel this way because "blame" as a negative conotation?
> I think that if Git ends up with two blame-like commands it will
> merely solidify Git's reputation for obscurity.
I think "blaming" is a well-concept in Git, and many people (familiar
with Git) would understand in instant what `blame-tree` would do.
> If this is really a form of blaming, then just make it an extension of
> "git blame", like maybe "git blame --latest".
I'm afraid that won't work very well, because the code is very much
different. If naming is the only motivation to shoehorn this in, then I
think it's better to rethink the name?
> Otherwise, please come up with a new command name. "git latest"? "git
> "latest-revs"? As long as it doesn't use the word "blame"...
I've been thinking about this a lot more, but I failed to come up with a
better name.
> FYI, here's Peff's original explanation[1] of how he came up with the name:
>
> > I wasn't sure at first what to call it or what the calling conventions
> > should be. The initial thought was to make it part of "ls-tree". But
> > that feels wrong, as ls-tree otherwise never cares about traversal.
> > The combination of traversal and diff made me think of blame, and
> > indeed, I think this is really just about blaming a whole tree at the
> > file-level, rather than at the content-level. Thus I called it blame-
> > tree, and I used the same calling conventions as blame:
> > "git blame-tree <path> <rev opts>".
>
> To me that reads like an argument for folding this into "git blame".
Forgive me, but I think folding into git-blame(1) will also solidify
Git's reputation of obscurity.
I think `blame-tree` is a fine name for this feature, but in the end I
don't care too much about the exact name. If we end up naming it `git
last-for-each`, `git annonate-files`, `git log-everyone`, or `git
when-modified` ... It's all good for me.
> [1]
> https://lore.kernel.org/git/20110302164031.GA18233@sigill.intra.peff.net/
--
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-07 14:22 ` Toon Claes
@ 2025-05-07 20:23 ` Marc Branchaud
2025-05-07 20:45 ` Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 135+ messages in thread
From: Marc Branchaud @ 2025-05-07 20:23 UTC (permalink / raw)
To: Toon Claes, git
Cc: Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On 2025-05-07 10:22, Toon Claes wrote:
> Marc Branchaud <marcnarc@xiplink.com> writes:
>
>> I feel the need to get some bike-shedding off my chest, though:
>
> Always welcome!
>
>> "blame-tree" would be a terrible name for this command.
>
> Do you feel this way because "blame" as a negative conotation?
Good question, but no, not at all.
My concern is about having two commands to do blaming (or "crediting" or
whatever anyone wants to call it), instead of just one.
>> I think that if Git ends up with two blame-like commands it will
>> merely solidify Git's reputation for obscurity.
>
> I think "blaming" is a well-concept in Git, and many people (familiar
> with Git) would understand in instant what `blame-tree` would do.
I agree that blaming is a well-(known) concept. I also agree that most
users would understand what blame-tree would do, *once they find it*.
But I think that's beside the point I'm trying to make. Git is
notorious for making users learn countless commands, and having two
slightly-different commands for blaming is just going to make that worse.
I mean, from a usability point of view, it makes much more sense if "git
blame" simply understood how to handle blaming a directory differently
from blaming a file/blob:
Want to see which commit last touched each line of a file? Just run
git blame path/to/file
Want to see which commits last touched each file under a tree? Just run
git blame path/to/directory
Git should be smart enough to figure out what to do from just whether or
not the last argument is a file or directory.
>> If this is really a form of blaming, then just make it an extension of
>> "git blame", like maybe "git blame --latest".
>
> I'm afraid that won't work very well, because the code is very much
> different. If naming is the only motivation to shoehorn this in, then I
> think it's better to rethink the name?
It's not just "naming" but rather trying to help Git be intuitively
useful to users.
Also, I think sacrificing usability because it makes the coding hard is
unfortunate.
I personally think it's fine for blame.c to contain two different
internal swathes of code that do different things. The ~500 lines or so
to implement blame-tree don't feel like a major burden to me, especially
compared to the ~3000 lines already in blame.c...
But if combining the two features into a single C file is too much to
bear, perhaps refactor the existing blame.c code? Something like:
- blame-file.c (the existing "git blame" implementation)
- blame-tree.c (the new functionality)
- blame.c (exposes both blame-file and blame-tree under "git blame")
>> Otherwise, please come up with a new command name. "git latest"? "git
>> "latest-revs"? As long as it doesn't use the word "blame"...
>
> I've been thinking about this a lot more, but I failed to come up with a
> better name.
>
>> FYI, here's Peff's original explanation[1] of how he came up with the name:
>>
>> > I wasn't sure at first what to call it or what the calling conventions
>> > should be. The initial thought was to make it part of "ls-tree". But
>> > that feels wrong, as ls-tree otherwise never cares about traversal.
>> > The combination of traversal and diff made me think of blame, and
>> > indeed, I think this is really just about blaming a whole tree at the
>> > file-level, rather than at the content-level. Thus I called it blame-
>> > tree, and I used the same calling conventions as blame:
>> > "git blame-tree <path> <rev opts>".
>>
>> To me that reads like an argument for folding this into "git blame".
>
> Forgive me, but I think folding into git-blame(1) will also solidify
> Git's reputation of obscurity.
Please elaborate.
> I think `blame-tree` is a fine name for this feature, but in the end I
> don't care too much about the exact name. If we end up naming it `git
> last-for-each`, `git annonate-files`, `git log-everyone`, or `git
> when-modified` ... It's all good for me.
But then, why not just expand "git blame"?
I feel that the
git blame path/to/directory
use case I mentioned above is a compelling argument to fold the feature
into standard "git blame", aside from any reputation-for-obscurity
discussion.
M.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-07 20:23 ` Marc Branchaud
@ 2025-05-07 20:45 ` Junio C Hamano
2025-05-08 13:26 ` Marc Branchaud
2025-05-07 20:49 ` Kristoffer Haugsbakk
2025-05-08 13:18 ` D. Ben Knoble
2 siblings, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-05-07 20:45 UTC (permalink / raw)
To: Marc Branchaud
Cc: Toon Claes, git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Marc Branchaud <marcnarc@xiplink.com> writes:
> My concern is about having two commands to do blaming (or "crediting"
> or whatever anyone wants to call it), instead of just one.
Existing "git blame" (or "git annotate") is about tracing the origin
of individual lines, so perhaps we can say "git blame" has two
modes, blame lines or blame files, and run the code for this new
mode with "git blame --mode=file" (and add "git blame --mode=line"
that is on by default that runs the original "git blame" code
paths)?
> I mean, from a usability point of view, it makes much more sense if
> "git blame" simply understood how to handle blaming a directory
> differently from blaming a file/blob:
I think this needs rephrasing: blaming a whole file (or a whole
tree) differently from blaming individual lines.
As lines can move across files, and we do find such moves while
tracing the origin of each line, "blaming a file" is not quite the
right way to think about it. "blaming lines in a file", perhaps.
> Git should be smart enough to figure out what to do from just whether
> or not the last argument is a file or directory.
Ah, that is interesting. We do not have to introduce "--mode=line/file"
option. Just see if the given pathspec names a tree object in the starting
commit and trigger the blame-tree logic, otherwise we just line the
"blame lines in a file" mode. So dispatching between the two modes
is almost trivial. I like that.
After command line option parsing, however, there may need some
sanity checking logic like "You said you want to blame the t/
directory and its contents, but at the same time you have -L1,10
to say you only want to blame the first 10 lines, which is an option
that does not make sense in blame-tree mode, so I abort". As long
as that is cleanly done, I think it is a good direction forward.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-07 20:23 ` Marc Branchaud
2025-05-07 20:45 ` Junio C Hamano
@ 2025-05-07 20:49 ` Kristoffer Haugsbakk
2025-05-08 13:20 ` D. Ben Knoble
2025-05-08 13:26 ` Marc Branchaud
2025-05-08 13:18 ` D. Ben Knoble
2 siblings, 2 replies; 135+ messages in thread
From: Kristoffer Haugsbakk @ 2025-05-07 20:49 UTC (permalink / raw)
To: Marc Branchaud, Toon Claes, git
Cc: Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On Wed, May 7, 2025, at 22:23, Marc Branchaud wrote:
> On 2025-05-07 10:22, Toon Claes wrote:
>> Marc Branchaud <marcnarc@xiplink.com> writes:
>>
>>> I feel the need to get some bike-shedding off my chest, though:
>>
>> Always welcome!
>>
>>> "blame-tree" would be a terrible name for this command.
>>
>> Do you feel this way because "blame" as a negative conotation?
>
> Good question, but no, not at all.
>
> My concern is about having two commands to do blaming (or "crediting" or
> whatever anyone wants to call it), instead of just one.
>
>>> I think that if Git ends up with two blame-like commands it will
>>> merely solidify Git's reputation for obscurity.
>>
>> I think "blaming" is a well-concept in Git, and many people (familiar
>> with Git) would understand in instant what `blame-tree` would do.
>
> I agree that blaming is a well-(known) concept. I also agree that most
> users would understand what blame-tree would do, *once they find it*.
>
> But I think that's beside the point I'm trying to make. Git is
> notorious for making users learn countless commands, and having two
> slightly-different commands for blaming is just going to make that worse.
Use a Git user I don’t see the problem. `git --list-cmds=builtins`
lists 144 commands. Six of them are `-tree` commands.
It’s not been my understanding that people stumble upon niche commands
that easily. Most questions I’ve seen about git-commit-tree(1) (one of
the `-tree` commands that seems to come up from time to time) seem to
come from a point of idle curiosity. That’s questions that bring it up
(i.e. potential user confusion).
(The first impression I got of `-tree` commands was that they were less
user-friendly commands for hardcore users.)
That’s just my perspective. Do you have a case in mind where such a new
command could lead to user confusion?
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-07 20:23 ` Marc Branchaud
2025-05-07 20:45 ` Junio C Hamano
2025-05-07 20:49 ` Kristoffer Haugsbakk
@ 2025-05-08 13:18 ` D. Ben Knoble
2 siblings, 0 replies; 135+ messages in thread
From: D. Ben Knoble @ 2025-05-08 13:18 UTC (permalink / raw)
To: Marc Branchaud
Cc: Toon Claes, git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On Wed, May 7, 2025 at 4:23 PM Marc Branchaud <marcnarc@xiplink.com> wrote:
>
>
> On 2025-05-07 10:22, Toon Claes wrote:
> > Marc Branchaud <marcnarc@xiplink.com> writes:
[cut]
> I agree that blaming is a well-(known) concept. I also agree that most
> users would understand what blame-tree would do, *once they find it*.
>
> But I think that's beside the point I'm trying to make. Git is
> notorious for making users learn countless commands, and having two
> slightly-different commands for blaming is just going to make that worse.
>
> I mean, from a usability point of view, it makes much more sense if "git
> blame" simply understood how to handle blaming a directory differently
> from blaming a file/blob:
>
> Want to see which commit last touched each line of a file? Just run
> git blame path/to/file
>
> Want to see which commits last touched each file under a tree? Just run
> git blame path/to/directory
>
> Git should be smart enough to figure out what to do from just whether or
> not the last argument is a file or directory.
I quite like this idea, too: today, "git blame t" in git.git is a
fatal error, for example (no such path 't' in HEAD). (#leftoverbits:
it's also not translated?)
Turning an error into a new use case seems like an excellent expansion
of capabilities.
>
> >> If this is really a form of blaming, then just make it an extension of
> >> "git blame", like maybe "git blame --latest".
> >
> > I'm afraid that won't work very well, because the code is very much
> > different. If naming is the only motivation to shoehorn this in, then I
> > think it's better to rethink the name?
>
> It's not just "naming" but rather trying to help Git be intuitively
> useful to users.
>
> Also, I think sacrificing usability because it makes the coding hard is
> unfortunate.
>
> I personally think it's fine for blame.c to contain two different
> internal swathes of code that do different things. The ~500 lines or so
> to implement blame-tree don't feel like a major burden to me, especially
> compared to the ~3000 lines already in blame.c...
>
> But if combining the two features into a single C file is too much to
> bear, perhaps refactor the existing blame.c code? Something like:
>
> - blame-file.c (the existing "git blame" implementation)
> - blame-tree.c (the new functionality)
> - blame.c (exposes both blame-file and blame-tree under "git blame")
A third alternative is to allow "git blame-tree" as here, but as
plumbing. Then we have "git blame <dir>" use it—today, that might mean
directly invoking "git blame-tree"; in the future, that might look
more like the relationship between "diff" and "diff-tree" (assuming
there is one)?
--
D. Ben Knoble
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-07 20:49 ` Kristoffer Haugsbakk
@ 2025-05-08 13:20 ` D. Ben Knoble
2025-05-08 13:26 ` Marc Branchaud
1 sibling, 0 replies; 135+ messages in thread
From: D. Ben Knoble @ 2025-05-08 13:20 UTC (permalink / raw)
To: Kristoffer Haugsbakk
Cc: Marc Branchaud, Toon Claes, git, Jeff King, Taylor Blau,
Derrick Stolee, Ævar Arnfjörð Bjarmason
On Wed, May 7, 2025 at 4:49 PM Kristoffer Haugsbakk
<kristofferhaugsbakk@fastmail.com> wrote:
> Use a Git user I don’t see the problem. `git --list-cmds=builtins`
> lists 144 commands. Six of them are `-tree` commands.
>
> It’s not been my understanding that people stumble upon niche commands
> that easily.
Seconded. My experience is that the distribution of Git users is
skewed left towards "add/commit/push" with a long tail of curiosity on
the right… improving discoverability is a worthwhile goal, I think.
--
D. Ben Knoble
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-07 20:45 ` Junio C Hamano
@ 2025-05-08 13:26 ` Marc Branchaud
2025-05-08 14:26 ` Junio C Hamano
0 siblings, 1 reply; 135+ messages in thread
From: Marc Branchaud @ 2025-05-08 13:26 UTC (permalink / raw)
To: Junio C Hamano
Cc: Toon Claes, git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On 2025-05-07 16:45, Junio C Hamano wrote:
> Marc Branchaud <marcnarc@xiplink.com> writes:
>>
>> I mean, from a usability point of view, it makes much more sense if
>> "git blame" simply understood how to handle blaming a directory
>> differently from blaming a file/blob:
>
> I think this needs rephrasing: blaming a whole file (or a whole
> tree) differently from blaming individual lines.
>
> As lines can move across files, and we do find such moves while
> tracing the origin of each line, "blaming a file" is not quite the
> right way to think about it. "blaming lines in a file", perhaps.
I see what you mean. "Blaming lines in a file" works for me.
This distinction brings up a wrinkle in my proposed DWIMery: should
git blame path/to/file
show the annotated blamed lines of the file, or simply display the last
commit that changed the file?
While a "whole-file blame" is really just
git log -1 path/to/file
I can appreciate the convenience of being able to do that with "git
blame". I suggest adding an option for this specific case, like maybe
"--latest" (I don't feel strongly about the option's name).
Want to see the annotated blamed lines of a file?
git blame path/to/file
Want to see the last commit to touch a file?
git blame --latest path/to/file
or
git log -1 path/to/file
Want to see the last commits to touch each file under a directory?
git blame path/to/directory
(--latest is implied because the target is a directory.)
It also occurs to me that
git blame path/to/directory
might need a way to toggle recursion. I suggest recursion be off by
default.
M.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-07 20:49 ` Kristoffer Haugsbakk
2025-05-08 13:20 ` D. Ben Knoble
@ 2025-05-08 13:26 ` Marc Branchaud
1 sibling, 0 replies; 135+ messages in thread
From: Marc Branchaud @ 2025-05-08 13:26 UTC (permalink / raw)
To: Kristoffer Haugsbakk, Toon Claes, git
Cc: Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On 2025-05-07 16:49, Kristoffer Haugsbakk wrote:
> On Wed, May 7, 2025, at 22:23, Marc Branchaud wrote:
>> I agree that blaming is a well-(known) concept. I also agree that most
>> users would understand what blame-tree would do, *once they find it*.
>>
>> But I think that's beside the point I'm trying to make. Git is
>> notorious for making users learn countless commands, and having two
>> slightly-different commands for blaming is just going to make that worse.
>
> Use a Git user I don’t see the problem. `git --list-cmds=builtins`
> lists 144 commands. Six of them are `-tree` commands.
None of the -tree commands are porcelain meant for regular use, and only
merge-tree is "ancillary". The rest are all plumbing. These are hardly
the commands normal users will use. I've been using and scripting Git
for a great many years, and I think I've maybe used read-tree a handful
of times.
(I see that --list-cmds is experimental and only documented deep within
"git help git". You seem to be a very advanced Git user!)
> It’s not been my understanding that people stumble upon niche commands
> that easily.
Yes, I agree. That seems to support the point I've been trying to make...
> Most questions I’ve seen about git-commit-tree(1) (one of
> the `-tree` commands that seems to come up from time to time) seem to
> come from a point of idle curiosity. That’s questions that bring it up
> (i.e. potential user confusion).
>
> (The first impression I got of `-tree` commands was that they were less
> user-friendly commands for hardcore users.)
Of course they're less user-friendly: They're not porcelain.
> That’s just my perspective. Do you have a case in mind where such a new
> command could lead to user confusion?
Only decades of experience writing and using software. Bloating Git's
command set should only be done after serious consideration of alternatives.
If I were not subscribed to this list, and Git went ahead with
"blame-tree", I would most likely never learn about it. Since I do know
about "blame", if the feature were part of that command then I have a
good chance of discovering it the next time I read blame's documentation.
M.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-08 13:26 ` Marc Branchaud
@ 2025-05-08 14:26 ` Junio C Hamano
2025-05-08 15:12 ` Marc Branchaud
0 siblings, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-05-08 14:26 UTC (permalink / raw)
To: Marc Branchaud
Cc: Toon Claes, git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Marc Branchaud <marcnarc@xiplink.com> writes:
> This distinction brings up a wrinkle in my proposed DWIMery: should
> git blame path/to/file
> show the annotated blamed lines of the file, or simply display the
> last commit that changed the file?
I thought you switch to blame-at-the-file-level only when you are
given a directory (or a tree)? "git blame path/to/file" has ALWAYS
done "blame these lines that appear in this file", and cannot change.
Of course you can say "git blame path/to/ | grep file"; as you said
yourself,
> git log -1 path/to/file
is so obvious, we do not need to introduce yet another way to get to
the same information, I think.
> It also occurs to me that
> git blame path/to/directory
> might need a way to toggle recursion. I suggest recursion be off by
> default.
I do not have strong opinion on this part; I've somehow assumed
while reading your message that you wanted it to always recurse
(like `git ls-files` does) and I thought it made sense, but not
recursing and just showing a single level (like `git ls-tree` does)
with an option to make it recurse is certainly a possibility.
Thanks.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-08 14:26 ` Junio C Hamano
@ 2025-05-08 15:12 ` Marc Branchaud
2025-05-14 14:42 ` Toon Claes
0 siblings, 1 reply; 135+ messages in thread
From: Marc Branchaud @ 2025-05-08 15:12 UTC (permalink / raw)
To: Junio C Hamano
Cc: Toon Claes, git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On 2025-05-08 10:26, Junio C Hamano wrote:
> Marc Branchaud <marcnarc@xiplink.com> writes:
>
>> This distinction brings up a wrinkle in my proposed DWIMery: should
>> git blame path/to/file
>> show the annotated blamed lines of the file, or simply display the
>> last commit that changed the file?
>
> I thought you switch to blame-at-the-file-level only when you are
> given a directory (or a tree)? "git blame path/to/file" has ALWAYS
> done "blame these lines that appear in this file", and cannot change.
>
> Of course you can say "git blame path/to/ | grep file"; as you said
> yourself,
>
>> git log -1 path/to/file
>
> is so obvious, we do not need to introduce yet another way to get to
> the same information, I think.
Fine by me. I personally don't think of "git blame" when I want to see
a file's commit history.
>> It also occurs to me that
>> git blame path/to/directory
>> might need a way to toggle recursion. I suggest recursion be off by
>> default.
>
> I do not have strong opinion on this part; I've somehow assumed
> while reading your message that you wanted it to always recurse
> (like `git ls-files` does) and I thought it made sense, but not
> recursing and just showing a single level (like `git ls-tree` does)
> with an option to make it recurse is certainly a possibility.
I also don't feel strongly either way. It just seemed that defaulting
to recursion could end up creating a lot of processing (and output), and
that making the user explicitly ask for it seems friendly.
But whatever the default is, it does seem useful to have an option to
control recursion.
M.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-08 15:12 ` Marc Branchaud
@ 2025-05-14 14:42 ` Toon Claes
2025-05-14 19:29 ` Junio C Hamano
2025-05-14 21:15 ` Marc Branchaud
0 siblings, 2 replies; 135+ messages in thread
From: Toon Claes @ 2025-05-14 14:42 UTC (permalink / raw)
To: Marc Branchaud, Junio C Hamano
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Marc Branchaud <marcnarc@xiplink.com> writes:
> On 2025-05-08 10:26, Junio C Hamano wrote:
>> Marc Branchaud <marcnarc@xiplink.com> writes:
>>
>>> This distinction brings up a wrinkle in my proposed DWIMery: should
>>> git blame path/to/file
>>> show the annotated blamed lines of the file, or simply display the
>>> last commit that changed the file?
>>
>> I thought you switch to blame-at-the-file-level only when you are
>> given a directory (or a tree)? "git blame path/to/file" has ALWAYS
>> done "blame these lines that appear in this file", and cannot change.
I don't know about that. What if you want to blame multiple files:
$ git blame-tree refs.c refs.h
or (letting your shell do the globbing):
$ $ git blame-tree *.h
I see these use-cases are very convenient. At GitLab we need to have
some kind pagination on files in a tree, if we can pass individual
filenames, we could use that for pagination.
>> Of course you can say "git blame path/to/ | grep file"; as you said
>> yourself,
This isn't very efficient. If a file in that tree was only touched in
the "initial commit" you have to wait for the blame process to walk the
history all the way down to that commit, while you're not actually
interested in that file.
>>
>>> git log -1 path/to/file
>>
>> is so obvious, we do not need to introduce yet another way to get to
>> the same information, I think.
Well, if you can pass multiple files (which git-blame-tree(1) currently
can) it's way more efficient to walk the history once, and see along the
history which file was touched when. For us at GitLab that's the whole
idea of upstreaming this feature.
> Fine by me. I personally don't think of "git blame" when I want to see
> a file's commit history.
Personally I don't like the idea of the DWIM approach. I rather keep
following the UNIX philosophy and having each command do one thing well.
I think it weird to change behavior based on context.
You said earlier in this thread:
> This distinction brings up a wrinkle in my proposed DWIMery: should
> git blame path/to/file
> show the annotated blamed lines of the file, or simply display the last
> commit that changed the file?
For me this gives good motivation to not mix behavior of file-level and
line-level blames into a single command. If behavior in ambiguous, we
should avoid it.
> I can appreciate the convenience of being able to do that with "git
> blame". I suggest adding an option for this specific case, like maybe
> "--latest" (I don't feel strongly about the option's name).
What makes `git blame --latest` better than `git blame-tree`?
> I agree that blaming is a well-(known) concept. I also agree that most
> users would understand what blame-tree would do, *once they find it*.
I'm also not convinced why a option argument to an existing command
would be easier to discover than a new command. I think it's more an
issue of us advertising features, than commands being discoverable on
it's own.
> Also, I think sacrificing usability because it makes the coding hard is
> unfortunate.
Agreed, that was not a good motivation from my side to make.
I wrote:
> > Forgive me, but I think folding into git-blame(1) will also solidify
> > Git's reputation of obscurity.
>
> Please elaborate.
As I mentioned above, I think having behavior of git-blame(1) depend on
the type of the argument (is it a dir or a file) is rather obscure. The
format of the output returned will be drastically different in both
cases, and having to machine-parse this might be tricky.
Cheers,
-- Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-14 14:42 ` Toon Claes
@ 2025-05-14 19:29 ` Junio C Hamano
2025-05-14 21:15 ` Marc Branchaud
2025-05-14 21:15 ` Marc Branchaud
1 sibling, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-05-14 19:29 UTC (permalink / raw)
To: Toon Claes
Cc: Marc Branchaud, git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Toon Claes <toon@iotcl.com> writes:
>>> I thought you switch to blame-at-the-file-level only when you are
>>> given a directory (or a tree)? "git blame path/to/file" has ALWAYS
>>> done "blame these lines that appear in this file", and cannot change.
>
> I don't know about that. What if you want to blame multiple files:
>
> $ git blame-tree refs.c refs.h
I do not mind "multiple files mean blame-tree mode" as a yet another
heuristics to tell which mode we are talking about, as "blame these
lines" mode would take just one pathname to a blob and never a tree.
But the topic, IIRC, was about how "git blame" (with 'blame-tree'
feature rolled into it) can tell which mode the request by the user
is about. So you should have said "git blame refs.c refs.h" above.
> or (letting your shell do the globbing):
>
> $ $ git blame-tree *.h
This one (with command name corrected) is questionable, as there
could be a case where there is a single .h file, in which case, the
command line would become "git blame that-single-header-file.h".
Again, I do not mind "even though I may have only a single blob
specified on the command line, I want the blame-tree mode" command
line option. So to recap
$ git blame path-to-dir ;# blame-tree mode for paths in the directory
$ git blame path1 path2 ;# blame-tree mode
$ git blame path ;# traditional blame-these-lines mode
$ git blame --tree path ;# blame-tree mode
$ git blame --tree path1 path2 ;# blame-tree mode
would work fine.
Having said that, I personally do not think of what "blame-tree"
does as "blame" at all, and there should be a better name for that
operation that does not use "blame" or "annotate". So a separate
command that does not even hint it has any relationship with "blame"
(because it doesn't; in my mental model, it does not do any "blame"
at all---it just does "git log -1 path" for many paths in parallel)
would be even more preferrable.
Thanks.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-14 19:29 ` Junio C Hamano
@ 2025-05-14 21:15 ` Marc Branchaud
2025-05-15 13:29 ` Patrick Steinhardt
0 siblings, 1 reply; 135+ messages in thread
From: Marc Branchaud @ 2025-05-14 21:15 UTC (permalink / raw)
To: Junio C Hamano, Toon Claes
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On 2025-05-14 15:29, Junio C Hamano wrote:
> Toon Claes <toon@iotcl.com> writes:
>
>>>> I thought you switch to blame-at-the-file-level only when you are
>>>> given a directory (or a tree)? "git blame path/to/file" has ALWAYS
>>>> done "blame these lines that appear in this file", and cannot change.
>>
>> I don't know about that. What if you want to blame multiple files:
>>
>> $ git blame-tree refs.c refs.h
>
> I do not mind "multiple files mean blame-tree mode" as a yet another
> heuristics to tell which mode we are talking about, as "blame these
> lines" mode would take just one pathname to a blob and never a tree.
>
> But the topic, IIRC, was about how "git blame" (with 'blame-tree'
> feature rolled into it) can tell which mode the request by the user
> is about. So you should have said "git blame refs.c refs.h" above.
>
>> or (letting your shell do the globbing):
>>
>> $ $ git blame-tree *.h
>
> This one (with command name corrected) is questionable, as there
> could be a case where there is a single .h file, in which case, the
> command line would become "git blame that-single-header-file.h".
>
> Again, I do not mind "even though I may have only a single blob
> specified on the command line, I want the blame-tree mode" command
> line option. So to recap
>
> $ git blame path-to-dir ;# blame-tree mode for paths in the directory
> $ git blame path1 path2 ;# blame-tree mode
> $ git blame path ;# traditional blame-these-lines mode
> $ git blame --tree path ;# blame-tree mode
> $ git blame --tree path1 path2 ;# blame-tree mode
>
> would work fine.
I'd be happy with all of that.
> Having said that, I personally do not think of what "blame-tree"
> does as "blame" at all, and there should be a better name for that
> operation that does not use "blame" or "annotate". So a separate
> command that does not even hint it has any relationship with "blame"
> (because it doesn't; in my mental model, it does not do any "blame"
> at all---it just does "git log -1 path" for many paths in parallel)
> would be even more preferrable.
I'd also be happy if instead this came in as a new command without
"blame" in its name.
How about [[consults thesaurus ...]] "git ascribe-tree"?
Or maybe fold it into ls-tree, e.g. "git ls-tree --ascribe"?
M.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-14 14:42 ` Toon Claes
2025-05-14 19:29 ` Junio C Hamano
@ 2025-05-14 21:15 ` Marc Branchaud
1 sibling, 0 replies; 135+ messages in thread
From: Marc Branchaud @ 2025-05-14 21:15 UTC (permalink / raw)
To: Toon Claes, Junio C Hamano
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
(I agree with Junio's reply to your message, so here I'm just going to
address the things that Junio didn't.)
I'll preface all this by restating my original point: If you really want
to implement this feature as a new command, please don't use "blame" in
that new command's name.
On 2025-05-14 10:42, Toon Claes wrote:
>
> Personally I don't like the idea of the DWIM approach. I rather keep
> following the UNIX philosophy and having each command do one thing well.
> I think it weird to change behavior based on context.
>
> You said earlier in this thread:
>
>> This distinction brings up a wrinkle in my proposed DWIMery: should
>> git blame path/to/file
>> show the annotated blamed lines of the file, or simply display the last
>> commit that changed the file?
>
> For me this gives good motivation to not mix behavior of file-level and
> line-level blames into a single command. If behavior in ambiguous, we
> should avoid it.
The behavior is not ambiguous at all, it's simply context-dependent.
Like with "git add": We don't have "git add-tree" to add a directory of
files. Adding is adding, and so we make the "add" command handle all
types of adding.
Similarly, we don't need "blame-tree" to annotate a directory. Just
because annotating a single file has different output from annotating a
tree of files doesn't mean that we need two different verbs to annotate
either kind of object.
>> I can appreciate the convenience of being able to do that with "git
>> blame". I suggest adding an option for this specific case, like maybe
>> "--latest" (I don't feel strongly about the option's name).
>
> What makes `git blame --latest` better than `git blame-tree`?
If a user wants to blame/annotate something -- a tree or a file -- it's
much easier for them to just use one command to blame whatever they
want. No need to discover a different command and read a whole new man
page to figure it out.
And all the people who already know about "git blame" get new and useful
behavior from their familiar command. They are much more likely to
discover that when it's built into "git blame" than if the new feature
is hiding under a different command.
Also, people who tab-complete commands will appreciate that
git bl<tab>
continues to complete to "git blame " instead of "git blame". (You
could argue that this is one way people might discover blame-tree, but I
think messing with completionists' muscle-memory is going to annoy them
more than help them.)
>> I agree that blaming is a well-(known) concept. I also agree that most
>> users would understand what blame-tree would do, *once they find it*.
>
> I'm also not convinced why a option argument to an existing command
> would be easier to discover than a new command. I think it's more an
> issue of us advertising features, than commands being discoverable on
> it's own.
Extending an existing command is an incremental way of making things
better for all the people who are already using that command. They are
more likely to discover the new behavior, either by spotting it when
they're checking the man page or, in this case, by accidentally passing
a directory to "git blame".
Hiding this in a new command makes it much less likely to be discovered
by current Git users.
Yes, it is an advertising issue. I don't consider Git to be a gold
standard for feature discoverability. So I don't think that simply
saying it's more of an advertising problem gets us anywhere, because so
far Git has failed miserably at advertising its commands.
>> Also, I think sacrificing usability because it makes the coding hard is
>> unfortunate.
>
> Agreed, that was not a good motivation from my side to make.
>
> I wrote:
>>> Forgive me, but I think folding into git-blame(1) will also solidify
>>> Git's reputation of obscurity.
>>
>> Please elaborate.
>
> As I mentioned above, I think having behavior of git-blame(1) depend on
> the type of the argument (is it a dir or a file) is rather obscure.
I don't buy that. Many Unix commands give different outputs when run
against a file vs. a directory (try diff, for example). Even simple
things like "ls" will show a single line of output for a file but
multiple lines for a directory. You can argue that one line vs. many
isn't a drastic difference, but it *is* a difference. And there's a
reason why it's "ls -R" instead of "ls-tree": Listing is listing, so
"ls" fulfills all your listing needs.
> The format of the output returned will be drastically different in both
> cases, and having to machine-parse this might be tricky.
Machine-parsing output is a strawman.
First of all, even though "blame" is considered an ancillary command and
not officially listed as porcelain, it's also not plumbing and so it has
no obligation to make machines' lives easier.
Second, why do you think a script needs to parse both output formats?
Even if there are reasons to write such a script, how does having two
commands for the different formats help? Either way such a script's
author needs to deal with both formats. Furthermore, if I was
maintaining a script that already understands how to parse single-file
annotation:
git blame path/to/file | my-script
I would be quite happy for it to die horribly if someone ran it on the
output of a tree annotation.
As you say, in that pipe example teaching my-script how to tell what
kind of output it's receiving could be tricky. But I doubt that many
existing scripts that parse blame output are implemented as
pipe-readers. Rather, I think (yes, without any evidence) that most
script authors run the blame command directly as part of their script
and so they'll know what kind of output the command they're running will
generate (since they'll know what kind of arguments they're passing to
the commmand).
Folks who really need to write a pipe-reader can just teach their script
an argument identifying the kind of output to expect. Much easier, and
more robust, than making the code figure it out. Pipe-reading scripts
will need figure out something like this regardless of how we resolve
this discussion.
M.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-14 21:15 ` Marc Branchaud
@ 2025-05-15 13:29 ` Patrick Steinhardt
2025-05-15 16:39 ` Junio C Hamano
2025-05-15 17:30 ` Marc Branchaud
0 siblings, 2 replies; 135+ messages in thread
From: Patrick Steinhardt @ 2025-05-15 13:29 UTC (permalink / raw)
To: Marc Branchaud
Cc: Junio C Hamano, Toon Claes, git, Jeff King, Taylor Blau,
Derrick Stolee, Ævar Arnfjörð Bjarmason
On Wed, May 14, 2025 at 05:15:30PM -0400, Marc Branchaud wrote:
> On 2025-05-14 15:29, Junio C Hamano wrote:
> > Having said that, I personally do not think of what "blame-tree"
> > does as "blame" at all, and there should be a better name for that
> > operation that does not use "blame" or "annotate". So a separate
> > command that does not even hint it has any relationship with "blame"
> > (because it doesn't; in my mental model, it does not do any "blame"
> > at all---it just does "git log -1 path" for many paths in parallel)
> > would be even more preferrable.
Curious. Isn't it exactly the same what git-blame(1) does though? Taken
the textual representation of a tree object, we figure out when each of
the lines has last been changed. That to me sounds like exactly the same
thing as git-blame(1), but just for trees instead of for blobs.
Sure, git-blame-tree(1) goes further than that. But conceptually it is
exactly the above thing, isn't it?
> I'd also be happy if instead this came in as a new command without "blame"
> in its name.
>
> How about [[consults thesaurus ...]] "git ascribe-tree"?
>
> Or maybe fold it into ls-tree, e.g. "git ls-tree --ascribe"?
I think anything that needs a thesaurus to come up with probably isn't a
good name for non-native speakers. I personally had to look up what this
word means.
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-15 13:29 ` Patrick Steinhardt
@ 2025-05-15 16:39 ` Junio C Hamano
2025-05-15 17:39 ` Marc Branchaud
2025-05-15 17:30 ` Marc Branchaud
1 sibling, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-05-15 16:39 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: Marc Branchaud, Toon Claes, git, Jeff King, Taylor Blau,
Derrick Stolee, Ævar Arnfjörð Bjarmason
Patrick Steinhardt <ps@pks.im> writes:
> Curious. Isn't it exactly the same what git-blame(1) does though? Taken
> the textual representation of a tree object, we figure out when each of
> the lines has last been changed. That to me sounds like exactly the same
> thing as git-blame(1), but just for trees instead of for blobs.
That's mechanical worldview from the viewpoint of those who know the
internal representation and workings of Git, I would have to say.
As an end-user, I view "where does the body of this function came
from" and "when did I touch this file the last time" quite different
and unrelated kind of queries.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-15 13:29 ` Patrick Steinhardt
2025-05-15 16:39 ` Junio C Hamano
@ 2025-05-15 17:30 ` Marc Branchaud
2025-05-16 4:30 ` Patrick Steinhardt
1 sibling, 1 reply; 135+ messages in thread
From: Marc Branchaud @ 2025-05-15 17:30 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: Junio C Hamano, Toon Claes, git, Jeff King, Taylor Blau,
Derrick Stolee, Ævar Arnfjörð Bjarmason
On 2025-05-15 09:29, Patrick Steinhardt wrote:
> On Wed, May 14, 2025 at 05:15:30PM -0400, Marc Branchaud wrote:
>> On 2025-05-14 15:29, Junio C Hamano wrote:
>>> Having said that, I personally do not think of what "blame-tree"
>>> does as "blame" at all, and there should be a better name for that
>>> operation that does not use "blame" or "annotate". So a separate
>>> command that does not even hint it has any relationship with "blame"
>>> (because it doesn't; in my mental model, it does not do any "blame"
>>> at all---it just does "git log -1 path" for many paths in parallel)
>>> would be even more preferrable.
>
> Curious. Isn't it exactly the same what git-blame(1) does though? Taken
> the textual representation of a tree object, we figure out when each of
> the lines has last been changed. That to me sounds like exactly the same
> thing as git-blame(1), but just for trees instead of for blobs.
>
> Sure, git-blame-tree(1) goes further than that. But conceptually it is
> exactly the above thing, isn't it?
I think the operation can be perceived in different ways. My only point
is that if we do conclude that it is a form of blaming then we fold it
into "git blame" instead of a new command.
>> I'd also be happy if instead this came in as a new command without "blame"
>> in its name.
>>
>> How about [[consults thesaurus ...]] "git ascribe-tree"?
>>
>> Or maybe fold it into ls-tree, e.g. "git ls-tree --ascribe"?
>
> I think anything that needs a thesaurus to come up with probably isn't a
> good name for non-native speakers. I personally had to look up what this
> word means.
Yeah, that was a bit tongue-in-cheek, sorry.
(Honestly, "ascribe" would be really bad, precisely because it is a
synonym of "blame"...)
M.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-15 16:39 ` Junio C Hamano
@ 2025-05-15 17:39 ` Marc Branchaud
2025-05-15 19:30 ` Jeff King
0 siblings, 1 reply; 135+ messages in thread
From: Marc Branchaud @ 2025-05-15 17:39 UTC (permalink / raw)
To: Junio C Hamano, Patrick Steinhardt
Cc: Toon Claes, git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On 2025-05-15 12:39, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
>> Curious. Isn't it exactly the same what git-blame(1) does though? Taken
>> the textual representation of a tree object, we figure out when each of
>> the lines has last been changed. That to me sounds like exactly the same
>> thing as git-blame(1), but just for trees instead of for blobs.
>
> That's mechanical worldview from the viewpoint of those who know the
> internal representation and workings of Git, I would have to say.
I interpreted Patrick's statement in the exact opposite way! I thought
he was speaking as a normal Git user, who is just considering the
semantics of the outputs: both are fundamentally just lines of a
Something (a file or a directory listing), each prefixed by a commit ID.
> As an end-user, I view "where does the body of this function came
> from" and "when did I touch this file the last time" quite different
> and unrelated kind of queries.
I can see them either way, depending on how I squint. I have no
objection if people want to think of this new operation as
something-that-is-not-a-blame. But then don't call it blame-tree!
How about last-touch?
M.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-15 17:39 ` Marc Branchaud
@ 2025-05-15 19:30 ` Jeff King
2025-05-16 4:38 ` Patrick Steinhardt
0 siblings, 1 reply; 135+ messages in thread
From: Jeff King @ 2025-05-15 19:30 UTC (permalink / raw)
To: Marc Branchaud
Cc: Junio C Hamano, Patrick Steinhardt, Toon Claes, git, Taylor Blau,
Derrick Stolee, Ævar Arnfjörð Bjarmason
On Thu, May 15, 2025 at 01:39:59PM -0400, Marc Branchaud wrote:
> > As an end-user, I view "where does the body of this function came
> > from" and "when did I touch this file the last time" quite different
> > and unrelated kind of queries.
>
> I can see them either way, depending on how I squint. I have no objection
> if people want to think of this new operation as
> something-that-is-not-a-blame. But then don't call it blame-tree!
>
> How about last-touch?
The name "blame-tree" is probably my fault, as that's what I called it
in 2012 when I originally wrote it. I don't have access to the adjacent
repos anymore, but I _think_ it was replacing a script that was in fact
called "git-last-modified" or something like that. So it all comes
around. ;)
The debate has mostly been over "blame" here. But I think "tree" is also
inaccurate. Theoretically it can be about any set of paths in the repo,
not just the entries of a single tree. So:
git last-modified Makefile Documentation/Makefile t/Makefile
would be a perfectly valid thing to ask about (and of course a
pathspec like '**Makefile' would be a simpler way to do so). The word
"tree" was there because the original use case at GitHub was getting
those values for all of the entries in a particular tree.
But conceptually it is just about expanding a pathspec into a set of
paths, and then traversing and reporting the last time each path was
modified. It _almost_ fits into the "git-log" family, which is all about
traversing and pathspecs. The output is a bit different, but I almost
wonder if it would work as an option to continuously limit the pathspec.
Something like:
$ git log --format=%H --last-modified --raw '**Makefile'
89d557b950c7a0581c12452e8f9576c45546246b
:100644 100644 13f9062a05 c4d21ccd3d M Makefile
[ skip a bunch of commits that touched only Makefile, nothing else ]
a7fa5b2f0ccb567a5a6afedece113f207902fa6f
:100644 100644 6485d40f62 b109d25e9c M Documentation/Makefile
[ skip more; now this one is interesting, because one commit touches a
bunch of files! It also touches Documentation/Makefile, but we'd
have already narrowed our pathspec to forget about it by this point ]
5309c1e9fb399c390ed36ef476e91f76f6746fa9
:100644 100644 3e67552cc5 97ce9c92fb M contrib/credential/libsecret/Makefile
:100644 100644 238f5f8c36 0948297e20 M contrib/credential/osxkeychain/Makefile
:100644 100644 6e992c0866 5b795fc9fe M contrib/credential/wincred/Makefile
:100644 100644 f2be7cc924 33c2ccc9f7 M contrib/diff-highlight/Makefile
:100644 100644 5ff5275496 2a98541477 M contrib/diff-highlight/t/Makefile
:100644 100644 4e603512a3 497ac434d6 M contrib/mw-to-git/Makefile
:100644 100644 f422203fa0 6c9f377caa M contrib/mw-to-git/t/Makefile
:100644 100644 52b84ba3d4 691737e76b M contrib/persistent-https/Makefile
:100644 100644 093399c788 2a85f5ee84 M contrib/subtree/t/Makefile
:100644 100644 667c39ed56 6c5a12bc32 M git-gui/Makefile
:100644 100644 749aa2e7ec e656b0d2b0 M git-gui/po/glossary/Makefile
:100644 100644 6911c2915a 4ff4ed0616 M t/interop/Makefile
:100644 100644 e4808aebed 9b3090c4ed M t/perf/Makefile
:100644 100644 bd1e9e30c1 722755338d M templates/Makefile
[ ... end immediately without traversing further here, since all
paths have been reported ... ]
I dunno. I just made that up. The output is obviously quite different
than blame-tree produces, but it would be easy-ish to collect it in the
same way. And it's much more flexible, because you could use --format
and diff options to report as much or as little about each commit as
you'd want.
It is a bit different from regular log, though, in that we'd expand the
pathspec at the very start, rather than applying it continuously as we
traverse (otherwise we could never end early, since we'd never know if
there was a "foo/Makefile" deep in history).
So you could argue that "git last-modified" could also just take
format and diff output options. ;)
-Peff
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-15 17:30 ` Marc Branchaud
@ 2025-05-16 4:30 ` Patrick Steinhardt
0 siblings, 0 replies; 135+ messages in thread
From: Patrick Steinhardt @ 2025-05-16 4:30 UTC (permalink / raw)
To: Marc Branchaud
Cc: Junio C Hamano, Toon Claes, git, Jeff King, Taylor Blau,
Derrick Stolee, Ævar Arnfjörð Bjarmason
On Thu, May 15, 2025 at 01:30:47PM -0400, Marc Branchaud wrote:
> On 2025-05-15 09:29, Patrick Steinhardt wrote:
> > On Wed, May 14, 2025 at 05:15:30PM -0400, Marc Branchaud wrote:
> > > in its name.
> > >
> > > How about [[consults thesaurus ...]] "git ascribe-tree"?
> > >
> > > Or maybe fold it into ls-tree, e.g. "git ls-tree --ascribe"?
> >
> > I think anything that needs a thesaurus to come up with probably isn't a
> > good name for non-native speakers. I personally had to look up what this
> > word means.
>
> Yeah, that was a bit tongue-in-cheek, sorry.
>
> (Honestly, "ascribe" would be really bad, precisely because it is a synonym
> of "blame"...)
There's no need to be sorry, even if it hadn't been tongue-in-cheek :)
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-15 19:30 ` Jeff King
@ 2025-05-16 4:38 ` Patrick Steinhardt
2025-05-20 8:49 ` Toon Claes
0 siblings, 1 reply; 135+ messages in thread
From: Patrick Steinhardt @ 2025-05-16 4:38 UTC (permalink / raw)
To: Jeff King
Cc: Marc Branchaud, Junio C Hamano, Toon Claes, git, Taylor Blau,
Derrick Stolee, Ævar Arnfjörð Bjarmason
On Thu, May 15, 2025 at 03:30:46PM -0400, Jeff King wrote:
> On Thu, May 15, 2025 at 01:39:59PM -0400, Marc Branchaud wrote:
>
> > > As an end-user, I view "where does the body of this function came
> > > from" and "when did I touch this file the last time" quite different
> > > and unrelated kind of queries.
> >
> > I can see them either way, depending on how I squint. I have no objection
> > if people want to think of this new operation as
> > something-that-is-not-a-blame. But then don't call it blame-tree!
> >
> > How about last-touch?
>
> The name "blame-tree" is probably my fault, as that's what I called it
> in 2012 when I originally wrote it. I don't have access to the adjacent
> repos anymore, but I _think_ it was replacing a script that was in fact
> called "git-last-modified" or something like that. So it all comes
> around. ;)
>
> The debate has mostly been over "blame" here. But I think "tree" is also
> inaccurate. Theoretically it can be about any set of paths in the repo,
> not just the entries of a single tree. So:
>
> git last-modified Makefile Documentation/Makefile t/Makefile
>
> would be a perfectly valid thing to ask about (and of course a
> pathspec like '**Makefile' would be a simpler way to do so). The word
> "tree" was there because the original use case at GitHub was getting
> those values for all of the entries in a particular tree.
I like "git last-modified". It's name is very telling and it does just
what it says.
> But conceptually it is just about expanding a pathspec into a set of
> paths, and then traversing and reporting the last time each path was
> modified. It _almost_ fits into the "git-log" family, which is all about
> traversing and pathspecs. The output is a bit different, but I almost
> wonder if it would work as an option to continuously limit the pathspec.
> Something like:
>
> $ git log --format=%H --last-modified --raw '**Makefile'
> 89d557b950c7a0581c12452e8f9576c45546246b
> :100644 100644 13f9062a05 c4d21ccd3d M Makefile
> [ skip a bunch of commits that touched only Makefile, nothing else ]
> a7fa5b2f0ccb567a5a6afedece113f207902fa6f
> :100644 100644 6485d40f62 b109d25e9c M Documentation/Makefile
> [ skip more; now this one is interesting, because one commit touches a
> bunch of files! It also touches Documentation/Makefile, but we'd
> have already narrowed our pathspec to forget about it by this point ]
> 5309c1e9fb399c390ed36ef476e91f76f6746fa9
> :100644 100644 3e67552cc5 97ce9c92fb M contrib/credential/libsecret/Makefile
> :100644 100644 238f5f8c36 0948297e20 M contrib/credential/osxkeychain/Makefile
> :100644 100644 6e992c0866 5b795fc9fe M contrib/credential/wincred/Makefile
> :100644 100644 f2be7cc924 33c2ccc9f7 M contrib/diff-highlight/Makefile
> :100644 100644 5ff5275496 2a98541477 M contrib/diff-highlight/t/Makefile
> :100644 100644 4e603512a3 497ac434d6 M contrib/mw-to-git/Makefile
> :100644 100644 f422203fa0 6c9f377caa M contrib/mw-to-git/t/Makefile
> :100644 100644 52b84ba3d4 691737e76b M contrib/persistent-https/Makefile
> :100644 100644 093399c788 2a85f5ee84 M contrib/subtree/t/Makefile
> :100644 100644 667c39ed56 6c5a12bc32 M git-gui/Makefile
> :100644 100644 749aa2e7ec e656b0d2b0 M git-gui/po/glossary/Makefile
> :100644 100644 6911c2915a 4ff4ed0616 M t/interop/Makefile
> :100644 100644 e4808aebed 9b3090c4ed M t/perf/Makefile
> :100644 100644 bd1e9e30c1 722755338d M templates/Makefile
> [ ... end immediately without traversing further here, since all
> paths have been reported ... ]
>
> I dunno. I just made that up. The output is obviously quite different
> than blame-tree produces, but it would be easy-ish to collect it in the
> same way. And it's much more flexible, because you could use --format
> and diff options to report as much or as little about each commit as
> you'd want.
>
> It is a bit different from regular log, though, in that we'd expand the
> pathspec at the very start, rather than applying it continuously as we
> traverse (otherwise we could never end early, since we'd never know if
> there was a "foo/Makefile" deep in history).
That's the biggest downside from my point of view: it works quite
differently, so we can expect that many of the options that git-log(1)
accepts wouldn't make sense at all. From my point of view we already
have too many commands where we have different "modes" hidden behind
options. They are hard to discover, and in theory you have to manually
mark all incompatible options as such, which is bound to grow stale.
> So you could argue that "git last-modified" could also just take
> format and diff output options. ;)
But this one I agree with -- if we had git-last-modified(1), then it
would eventually make sense to have at least `--format`. I don't have a
use case for diff output options, but if any come up it could probably
be added at a later point, as well.
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC 0/5] Introduce git-blame-tree(1) command
2025-05-16 4:38 ` Patrick Steinhardt
@ 2025-05-20 8:49 ` Toon Claes
0 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-05-20 8:49 UTC (permalink / raw)
To: Patrick Steinhardt, Jeff King
Cc: Marc Branchaud, Junio C Hamano, git, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Patrick Steinhardt <ps@pks.im> writes:
>> On Thu, May 15, 2025 at 01:39:59PM -0400, Marc Branchaud wrote:
> On Thu, May 15, 2025 at 03:30:46PM -0400, Jeff King wrote:
>>
>> The debate has mostly been over "blame" here.
Okay, I'm happy to move away from "blame".
>> But I think "tree" is also inaccurate. Theoretically it can be about
>> any set of paths in the repo, not just the entries of a single tree.
Totally.
>> So:
>>
>> git last-modified Makefile Documentation/Makefile t/Makefile
>>
>> would be a perfectly valid thing to ask about (and of course a
>> pathspec like '**Makefile' would be a simpler way to do so). The word
>> "tree" was there because the original use case at GitHub was getting
>> those values for all of the entries in a particular tree.
>
> I like "git last-modified". It's name is very telling and it does just
> what it says.
I like `git last-modified` too, but I'm only wondering if it makes sense
if you pass it a revision range:
git last-modified HEAD~2..HEAD
It kind of still does, but it's a little more questionable.
>> But conceptually it is just about expanding a pathspec into a set of
>> paths, and then traversing and reporting the last time each path was
>> modified. It _almost_ fits into the "git-log" family, which is all about
>> traversing and pathspecs. The output is a bit different, but I almost
>> wonder if it would work as an option to continuously limit the pathspec.
>> Something like:
>>
>> $ git log --format=%H --last-modified --raw '**Makefile'
>> 89d557b950c7a0581c12452e8f9576c45546246b
>> :100644 100644 13f9062a05 c4d21ccd3d M Makefile
>> [ skip a bunch of commits that touched only Makefile, nothing else ]
>> a7fa5b2f0ccb567a5a6afedece113f207902fa6f
>> :100644 100644 6485d40f62 b109d25e9c M Documentation/Makefile
>> [ skip more; now this one is interesting, because one commit touches a
>> bunch of files! It also touches Documentation/Makefile, but we'd
>> have already narrowed our pathspec to forget about it by this point ]
>> 5309c1e9fb399c390ed36ef476e91f76f6746fa9
>> :100644 100644 3e67552cc5 97ce9c92fb M contrib/credential/libsecret/Makefile
>> :100644 100644 238f5f8c36 0948297e20 M contrib/credential/osxkeychain/Makefile
>> :100644 100644 6e992c0866 5b795fc9fe M contrib/credential/wincred/Makefile
>> :100644 100644 f2be7cc924 33c2ccc9f7 M contrib/diff-highlight/Makefile
>> :100644 100644 5ff5275496 2a98541477 M contrib/diff-highlight/t/Makefile
>> :100644 100644 4e603512a3 497ac434d6 M contrib/mw-to-git/Makefile
>> :100644 100644 f422203fa0 6c9f377caa M contrib/mw-to-git/t/Makefile
>> :100644 100644 52b84ba3d4 691737e76b M contrib/persistent-https/Makefile
>> :100644 100644 093399c788 2a85f5ee84 M contrib/subtree/t/Makefile
>> :100644 100644 667c39ed56 6c5a12bc32 M git-gui/Makefile
>> :100644 100644 749aa2e7ec e656b0d2b0 M git-gui/po/glossary/Makefile
>> :100644 100644 6911c2915a 4ff4ed0616 M t/interop/Makefile
>> :100644 100644 e4808aebed 9b3090c4ed M t/perf/Makefile
>> :100644 100644 bd1e9e30c1 722755338d M templates/Makefile
>> [ ... end immediately without traversing further here, since all
>> paths have been reported ... ]
I like this idea. I think it makes sense to "commit ABBC touched X, Y, and
Z; and commit BBCD touched xx, and yy; and ...". It makes the output a
lot less verbose.
>> I dunno. I just made that up. The output is obviously quite different
>> than blame-tree produces
I think it depends on who you consider the primary user would be? Or
said differently, whether we mark this new command as plumbing or
porcelain? I would consider it a plumbing command, and that's the main
reason why I'm trying to upstream it: to use it in our tooling at
$DAYJOB. The output you present above is even more obscure than my
proposed git-blame-tree version, making it even more plumbing-like.
>> It is a bit different from regular log, though, in that we'd expand the
>> pathspec at the very start, rather than applying it continuously as we
>> traverse (otherwise we could never end early, since we'd never know if
>> there was a "foo/Makefile" deep in history).
>
> That's the biggest downside from my point of view: it works quite
> differently, so we can expect that many of the options that git-log(1)
> accepts wouldn't make sense at all.
Agreed.
>> So you could argue that "git last-modified" could also just take
>> format and diff output options. ;)
>
> But this one I agree with -- if we had git-last-modified(1), then it
> would eventually make sense to have at least `--format`. I don't have a
> use case for diff output options, but if any come up it could probably
> be added at a later point, as well.
I was planning to add `--format` to git-blame-tree(1) in the future, or
do you think it should be part of the initial version?
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH RFC v2 0/5] Introduce git-last-modified(1) command
2025-04-22 17:46 [PATCH RFC 0/5] Introduce git-blame-tree(1) command Toon Claes
` (5 preceding siblings ...)
2025-04-23 13:26 ` [PATCH RFC 0/5] Introduce git-blame-tree(1) command Marc Branchaud
@ 2025-05-23 9:33 ` Toon Claes
2025-05-23 9:33 ` [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified Toon Claes
` (5 more replies)
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
7 siblings, 6 replies; 135+ messages in thread
From: Toon Claes @ 2025-05-23 9:33 UTC (permalink / raw)
To: git
Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes,
Ævar Arnfjörð Bjarmason, Derrick Stolee
This is another attempt to upstream the ~~git-blame-tree(1)~~
git-last-modified(1) subcommand. After my previous attempt[1] the
people of GitHub shared their version of the subcommand, and this
version integrates those changes.
What is different from the series shared by GitHub:
* Renamed the subcommand from `blame-tree` to `last-modified`. There was
some consensus[4] this name works better, so let's give it a try and
see how this name feels.
* Patches for --max-depth are excluded. I think it's a separate topic to
discuss and I'm not sure it needs to be part of series anyway. The
main patch was submitted in the previous attempt[2] and if people
consider it valuable, I'm happy to discuss that in a separate patch
series.
* The patches in 'tb/blame-tree' at Taylor's fork[3] implements a
caching layer. This feature reads/writes cached results in
`.git/blame-tree/<hash>.btc`. To keep this series to a reviewable
size, that feature is excluded from this series. I think it's better
to submit this as a separate series.
* Squashed various commits together. Like they introduced a flag
`--go-faster`, which later became the default and only implementation.
That story was wrapped up in a single commit.
* The last-modified command isn't recursive by default. If you want
recurse into subtrees, you need to pass `-r`.
* Fixed all memory leaks, and removed the use of
USE_THE_REPOSITORY_VARIABLE.
I've attempted to reuse commit messages as good as possible, but feel
free to correct me where you think I didn't give proper credit or messed
up. Although I have no idea what to do with the Signed-off-by trailers.
I didn't modify the benchmark results in the commit messages, simply
because I didn't get comparable results. In my benchmarks the difference
between two implementations was negligible, and even in some scenarios
the performance was worse in the "improved" implementation. As far as I
can tell, I didn't break anything in my refactoring, because the version
in these patches acts similar to Taylor's branch. To be honest, I cannot
explain why...?
Again thanks to Taylor and the people at GitHub for sharing these
patches. I hope we can work together to get this upstreamed.
[1]: https://lore.kernel.org/git/20250326-toon-blame-tree-v1-0-4173133f3786@iotcl.com/
[2]: https://lore.kernel.org/git/20250326-toon-blame-tree-v1-3-4173133f3786@iotcl.com/
[3]: git@github.com:ttaylorr/git.git
[4]: https://lore.kernel.org/git/aCbBKj7O9LjO3SMK@pks.im/
--
Cheers,
Toon
Signed-off-by: Toon Claes <toon@iotcl.com>
---
Changes in v2:
- The subcommand is renamed from `blame-tree` to `last-modified`
- Documentation is added. Here we mark the command as experimental.
- Some test cases are added related to merges.
- Link to v1: https://lore.kernel.org/r/20250422-toon-new-blame-tree-v1-0-fdb51b8a394a@iotcl.com
---
Jeff King (1):
t/perf: add last-modified perf script
Toon Claes (4):
last-modified: new subcommand to show when files were last modified
last-modified: use Bloom filters when available
last-modified: implement faster algorithm
last-modified: initialize revision machinery without walk
.gitignore | 1 +
Documentation/git-last-modified.adoc | 49 ++++
Documentation/meson.build | 1 +
Makefile | 2 +
builtin.h | 1 +
builtin/last-modified.c | 43 +++
command-list.txt | 1 +
git.c | 1 +
last-modified.c | 496 +++++++++++++++++++++++++++++++++++
last-modified.h | 30 +++
meson.build | 2 +
t/meson.build | 2 +
t/perf/p8020-last-modified.sh | 21 ++
t/t8020-last-modified.sh | 194 ++++++++++++++
14 files changed, 844 insertions(+)
---
Range-diff versus v1:
1: 1b6cb2603e ! 1: 586f60da1f blame-tree: introduce new subcommand to blame files
@@ Metadata
Author: Toon Claes <toon@iotcl.com>
## Commit message ##
- blame-tree: introduce new subcommand to blame files
+ last-modified: new subcommand to show when files were last modified
- Similar to git-blame(1), introduce a new subcommand git-blame-tree(1).
- This command shows the most recent modification to paths in a tree. It
- does so by expanding the tree at a given commit, taking note of the
- current state of each path, and then walking backwards through history
- looking for commits where each path changed into its final commit ID.
+ Similar to git-blame(1), introduce a new subcommand
+ git-last-modified(1). This command shows the most recent modification to
+ paths in a tree. It does so by expanding the tree at a given commit,
+ taking note of the current state of each path, and then walking
+ backwards through history looking for commits where each path changed
+ into its final commit ID.
Based-on-patch-by: Jeff King <peff@peff.net>
Improved-by: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
@@ Commit message
## .gitignore ##
@@
- /git-backfill
- /git-bisect
- /git-blame
-+/git-blame-tree
- /git-branch
- /git-bugreport
- /git-bundle
+ /git-init-db
+ /git-interpret-trailers
+ /git-instaweb
++/git-last-modified
+ /git-log
+ /git-ls-files
+ /git-ls-remote
+
+ ## Documentation/git-last-modified.adoc (new) ##
+@@
++git-last-modified(1)
++====================
++
++NAME
++----
++git-last-modified - EXPERIMENTAL: Show when files were last modified
++
++
++SYNOPSIS
++--------
++[synopsis]
++git last-modified [-r] [<revision-range>] [[--] <path>...]
++
++DESCRIPTION
++-----------
++
++Shows which commit last modified each of the relevant files and subdirectories.
++
++THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
++
++OPTIONS
++-------
++
++-r::
++ Recurse into subtrees.
++
++-t::
++ Show tree entry itself as well as subtrees. Implies `-r`.
++
++<revision-range>::
++ Only traverse commits in the specified revision range. When no
++ `<revision-range>` is specified, it defaults to `HEAD` (i.e. the whole
++ history leading to the current commit). For a complete list of ways to
++ spell `<revision-range>`, see the 'Specifying Ranges' section of
++ linkgit:gitrevisions[7].
++
++[--] <path>...::
++ For each _<path>_ given, the commit which last modified it is returned.
++ Without an optional path parameter, all files and subdirectories
++ of the current working directory are included in the
++
++SEE ALSO
++--------
++linkgit:git-blame[1],
++linkgit:git-log[1].
++
++GIT
++---
++Part of the linkgit:git[1] suite
+
+ ## Documentation/meson.build ##
+@@ Documentation/meson.build: manpages = {
+ 'git-init.adoc' : 1,
+ 'git-instaweb.adoc' : 1,
+ 'git-interpret-trailers.adoc' : 1,
++ 'git-last-modified.adoc' : 1,
+ 'git-log.adoc' : 1,
+ 'git-ls-files.adoc' : 1,
+ 'git-ls-remote.adoc' : 1,
## Makefile ##
-@@ Makefile: LIB_OBJS += archive.o
- LIB_OBJS += attr.o
- LIB_OBJS += base85.o
- LIB_OBJS += bisect.o
-+LIB_OBJS += blame-tree.o
- LIB_OBJS += blame.o
- LIB_OBJS += blob.o
- LIB_OBJS += bloom.o
-@@ Makefile: BUILTIN_OBJS += builtin/archive.o
- BUILTIN_OBJS += builtin/backfill.o
- BUILTIN_OBJS += builtin/bisect.o
- BUILTIN_OBJS += builtin/blame.o
-+BUILTIN_OBJS += builtin/blame-tree.o
- BUILTIN_OBJS += builtin/branch.o
- BUILTIN_OBJS += builtin/bugreport.o
- BUILTIN_OBJS += builtin/bundle.o
+@@ Makefile: LIB_OBJS += hook.o
+ LIB_OBJS += ident.o
+ LIB_OBJS += json-writer.o
+ LIB_OBJS += kwset.o
++LIB_OBJS += last-modified.o
+ LIB_OBJS += levenshtein.o
+ LIB_OBJS += line-log.o
+ LIB_OBJS += line-range.o
+@@ Makefile: BUILTIN_OBJS += builtin/hook.o
+ BUILTIN_OBJS += builtin/index-pack.o
+ BUILTIN_OBJS += builtin/init-db.o
+ BUILTIN_OBJS += builtin/interpret-trailers.o
++BUILTIN_OBJS += builtin/last-modified.o
+ BUILTIN_OBJS += builtin/log.o
+ BUILTIN_OBJS += builtin/ls-files.o
+ BUILTIN_OBJS += builtin/ls-remote.o
- ## blame-tree.c (new) ##
+ ## builtin.h ##
+@@ builtin.h: int cmd_hook(int argc, const char **argv, const char *prefix, struct repository
+ int cmd_index_pack(int argc, const char **argv, const char *prefix, struct repository *repo);
+ int cmd_init_db(int argc, const char **argv, const char *prefix, struct repository *repo);
+ int cmd_interpret_trailers(int argc, const char **argv, const char *prefix, struct repository *repo);
++int cmd_last_modified(int argc, const char **argv, const char *prefix, struct repository *repo);
+ int cmd_log_reflog(int argc, const char **argv, const char *prefix, struct repository *repo);
+ int cmd_log(int argc, const char **argv, const char *prefix, struct repository *repo);
+ int cmd_ls_files(int argc, const char **argv, const char *prefix, struct repository *repo);
+
+ ## builtin/last-modified.c (new) ##
@@
+#include "git-compat-util.h"
-+#include "blame-tree.h"
++#include "last-modified.h"
++#include "hex.h"
++#include "quote.h"
++#include "config.h"
++#include "object-name.h"
++#include "parse-options.h"
++#include "builtin.h"
++
++static void show_entry(const char *path, const struct commit *commit, void *d)
++{
++ struct last_modified *lm = d;
++
++ if (commit->object.flags & BOUNDARY)
++ putchar('^');
++ printf("%s\t", oid_to_hex(&commit->object.oid));
++
++ if (lm->rev.diffopt.line_termination)
++ write_name_quoted(path, stdout, '\n');
++ else
++ printf("%s%c", path, '\0');
++
++ fflush(stdout);
++}
++
++int cmd_last_modified(int argc,
++ const char **argv,
++ const char *prefix,
++ struct repository *repo)
++{
++ int ret = 0;
++ struct last_modified lm;
++
++ repo_config(repo, git_default_config, NULL);
++
++ last_modified_init(&lm, repo, prefix, argc, argv);
++ if (last_modified_run(&lm, show_entry, &lm) < 0)
++ die(_("error running last-modified traversal"));
++
++ last_modified_release(&lm);
++
++ return ret;
++}
+
+ ## command-list.txt ##
+@@ command-list.txt: git-index-pack plumbingmanipulators
+ git-init mainporcelain init
+ git-instaweb ancillaryinterrogators complete
+ git-interpret-trailers purehelpers
++git-last-modified plumbinginterrogators
+ git-log mainporcelain info
+ git-ls-files plumbinginterrogators
+ git-ls-remote plumbinginterrogators
+
+ ## git.c ##
+@@ git.c: static struct cmd_struct commands[] = {
+ { "init", cmd_init_db },
+ { "init-db", cmd_init_db },
+ { "interpret-trailers", cmd_interpret_trailers, RUN_SETUP_GENTLY },
++ { "last-modified", cmd_last_modified, RUN_SETUP },
+ { "log", cmd_log, RUN_SETUP },
+ { "ls-files", cmd_ls_files, RUN_SETUP },
+ { "ls-remote", cmd_ls_remote, RUN_SETUP_GENTLY },
+
+ ## last-modified.c (new) ##
+@@
++#include "git-compat-util.h"
++#include "last-modified.h"
+#include "commit.h"
+#include "diffcore.h"
+#include "diff.h"
@@ blame-tree.c (new)
+#include "repository.h"
+#include "log-tree.h"
+
-+struct blame_tree_entry {
++struct last_modified_entry {
+ struct hashmap_entry hashent;
+ struct object_id oid;
+ struct commit *commit;
@@ blame-tree.c (new)
+ struct diff_options *opt UNUSED,
+ void *data)
+{
-+ struct blame_tree *bt = data;
++ struct last_modified *lm = data;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
-+ struct blame_tree_entry *ent;
++ struct last_modified_entry *ent;
+ const char *path = p->two->path;
+
+ FLEX_ALLOC_STR(ent, path, path);
+ oidcpy(&ent->oid, &p->two->oid);
+ hashmap_entry_init(&ent->hashent, strhash(ent->path));
-+ hashmap_add(&bt->paths, &ent->hashent);
++ hashmap_add(&lm->paths, &ent->hashent);
+ }
+}
+
-+static int add_from_revs(struct blame_tree *bt)
++static int add_from_revs(struct last_modified *lm)
+{
+ size_t count = 0;
+ struct diff_options diffopt;
+
-+ memcpy(&diffopt, &bt->rev.diffopt, sizeof(diffopt));
-+ copy_pathspec(&diffopt.pathspec, &bt->rev.diffopt.pathspec);
++ memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
++ copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
+ diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ diffopt.format_callback = add_from_diff;
-+ diffopt.format_callback_data = bt;
++ diffopt.format_callback_data = lm;
+
-+ for (size_t i = 0; i < bt->rev.pending.nr; i++) {
-+ struct object_array_entry *obj = bt->rev.pending.objects + i;
++ for (size_t i = 0; i < lm->rev.pending.nr; i++) {
++ struct object_array_entry *obj = lm->rev.pending.objects + i;
+
+ if (obj->item->flags & UNINTERESTING)
+ continue;
+
+ if (count++)
-+ return error(_("can only blame one tree at a time"));
++ return error(_("can only get last-modified one tree at a time"));
+
-+ diff_tree_oid(bt->rev.repo->hash_algo->empty_tree,
++ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &obj->item->oid, "", &diffopt);
+ diff_flush(&diffopt);
+ }
@@ blame-tree.c (new)
+ return 0;
+}
+
-+static int blame_tree_entry_hashcmp(const void *unused UNUSED,
-+ const struct hashmap_entry *he1,
-+ const struct hashmap_entry *he2,
++static int last_modified_entry_hashcmp(const void *unused UNUSED,
++ const struct hashmap_entry *hent1,
++ const struct hashmap_entry *hent2,
+ const void *path)
+{
-+ const struct blame_tree_entry *e1 =
-+ container_of(he1, const struct blame_tree_entry, hashent);
-+ const struct blame_tree_entry *e2 =
-+ container_of(he2, const struct blame_tree_entry, hashent);
-+ return strcmp(e1->path, path ? path : e2->path);
++ const struct last_modified_entry *ent1 =
++ container_of(hent1, const struct last_modified_entry, hashent);
++ const struct last_modified_entry *ent2 =
++ container_of(hent2, const struct last_modified_entry, hashent);
++ return strcmp(ent1->path, path ? path : ent2->path);
+}
+
-+void blame_tree_init(struct blame_tree *bt,
++void last_modified_init(struct last_modified *lm,
+ struct repository *r,
+ const char *prefix,
+ int argc, const char **argv)
+{
-+ memset(bt, 0, sizeof(*bt));
-+ hashmap_init(&bt->paths, blame_tree_entry_hashcmp, NULL, 0);
++ memset(lm, 0, sizeof(*lm));
++ hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
+
-+ repo_init_revisions(r, &bt->rev, prefix);
-+ bt->rev.def = "HEAD";
-+ bt->rev.combine_merges = 1;
-+ bt->rev.show_root_diff = 1;
-+ bt->rev.boundary = 1;
-+ bt->rev.no_commit_id = 1;
-+ bt->rev.diff = 1;
-+ if (setup_revisions(argc, argv, &bt->rev, NULL) > 1)
-+ die(_("unknown blame-tree argument: %s"), argv[1]);
++ repo_init_revisions(r, &lm->rev, prefix);
++ lm->rev.def = "HEAD";
++ lm->rev.combine_merges = 1;
++ lm->rev.show_root_diff = 1;
++ lm->rev.boundary = 1;
++ lm->rev.no_commit_id = 1;
++ lm->rev.diff = 1;
++ if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
++ die(_("unknown last-modified argument: %s"), argv[1]);
+
-+ if (add_from_revs(bt) < 0)
-+ die(_("unable to setup blame-tree"));
++ if (add_from_revs(lm) < 0)
++ die(_("unable to setup last-modified"));
+}
+
-+void blame_tree_release(struct blame_tree *bt)
++void last_modified_release(struct last_modified *lm)
+{
-+ hashmap_clear_and_free(&bt->paths, struct blame_tree_entry, hashent);
-+ release_revisions(&bt->rev);
++ hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
++ release_revisions(&lm->rev);
+}
+
-+struct blame_tree_callback_data {
++struct last_modified_callback_data {
+ struct commit *commit;
+ struct hashmap *paths;
+
-+ blame_tree_callback callback;
++ last_modified_callback callback;
+ void *callback_data;
+};
+
+static void mark_path(const char *path, const struct object_id *oid,
-+ struct blame_tree_callback_data *data)
++ struct last_modified_callback_data *data)
+{
-+ struct blame_tree_entry *ent;
++ struct last_modified_entry *ent;
+
+ /* Is it even a path that we are interested in? */
+ ent = hashmap_get_entry_from_hash(data->paths, strhash(path), path,
-+ struct blame_tree_entry, hashent);
++ struct last_modified_entry, hashent);
+ if (!ent)
+ return;
+
-+ /* Have we already blamed a commit? */
++ /* Have we already found a commit? */
+ if (ent->commit)
+ return;
+
@@ blame-tree.c (new)
+ free(ent);
+}
+
-+static void blame_diff(struct diff_queue_struct *q,
++static void last_modified_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED, void *cbdata)
+{
-+ struct blame_tree_callback_data *data = cbdata;
++ struct last_modified_callback_data *data = cbdata;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
@@ blame-tree.c (new)
+ * a final path/sha1 state. Note that this covers some
+ * potentially controversial areas, including:
+ *
-+ * 1. A rename or copy will be blamed, as it is the
++ * 1. A rename or copy will be found, as it is the
+ * first time the content has arrived at the given
+ * path.
+ *
+ * 2. Even a non-content modification like a mode or
+ * type change will trigger it.
+ *
-+ * We take the inclusive approach for now, and blame
++ * We take the inclusive approach for now, and find
+ * anything which impacts the path. Options to tweak
+ * the behavior (e.g., to "--follow" the content across
+ * renames) can come later.
@@ blame-tree.c (new)
+ }
+}
+
-+int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
++int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
+{
-+ struct blame_tree_callback_data data;
++ struct last_modified_callback_data data;
+
-+ data.paths = &bt->paths;
++ data.paths = &lm->paths;
+ data.callback = cb;
+ data.callback_data = cbdata;
+
-+ bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
-+ bt->rev.diffopt.format_callback = blame_diff;
-+ bt->rev.diffopt.format_callback_data = &data;
++ lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
++ lm->rev.diffopt.format_callback = last_modified_diff;
++ lm->rev.diffopt.format_callback_data = &data;
+
-+ prepare_revision_walk(&bt->rev);
++ prepare_revision_walk(&lm->rev);
+
-+ while (hashmap_get_size(&bt->paths)) {
-+ data.commit = get_revision(&bt->rev);
++ while (hashmap_get_size(&lm->paths)) {
++ data.commit = get_revision(&lm->rev);
+ if (!data.commit)
+ break;
+
+ if (data.commit->object.flags & BOUNDARY) {
-+ diff_tree_oid(bt->rev.repo->hash_algo->empty_tree,
++ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &data.commit->object.oid,
-+ "", &bt->rev.diffopt);
-+ diff_flush(&bt->rev.diffopt);
++ "", &lm->rev.diffopt);
++ diff_flush(&lm->rev.diffopt);
+ } else {
-+ log_tree_commit(&bt->rev, data.commit);
++ log_tree_commit(&lm->rev, data.commit);
+ }
+ }
+
+ return 0;
+}
- ## blame-tree.h (new) ##
+ ## last-modified.h (new) ##
@@
-+#ifndef BLAME_TREE_H
-+#define BLAME_TREE_H
++#ifndef LAST_MODIFIED_H
++#define LAST_MODIFIED_H
+
+#include "commit.h"
+#include "revision.h"
+#include "hashmap.h"
+
-+struct blame_tree {
++struct last_modified {
+ struct hashmap paths;
+ struct rev_info rev;
+};
+
-+void blame_tree_init(struct blame_tree *bt,
++void last_modified_init(struct last_modified *lm,
+ struct repository *r,
+ const char *prefix,
+ int argc, const char **argv);
+
-+void blame_tree_release(struct blame_tree *);
++void last_modified_release(struct last_modified *);
+
-+typedef void (*blame_tree_callback)(const char *path,
++typedef void (*last_modified_callback)(const char *path,
+ const struct commit *commit,
+ void *data);
-+int blame_tree_run(struct blame_tree *,
-+ blame_tree_callback cb,
-+ void *data);
++int last_modified_run(struct last_modified *lm,
++ last_modified_callback cb,
++ void *cbdata);
+
-+#endif /* BLAME_TREE_H */
-
- ## builtin.h ##
-@@ builtin.h: int cmd_archive(int argc, const char **argv, const char *prefix, struct reposito
- int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo);
- int cmd_bisect(int argc, const char **argv, const char *prefix, struct repository *repo);
- int cmd_blame(int argc, const char **argv, const char *prefix, struct repository *repo);
-+int cmd_blame_tree(int argc, const char **argv, const char *prefix, struct repository *repo);
- int cmd_branch(int argc, const char **argv, const char *prefix, struct repository *repo);
- int cmd_bugreport(int argc, const char **argv, const char *prefix, struct repository *repo);
- int cmd_bundle(int argc, const char **argv, const char *prefix, struct repository *repo);
-
- ## builtin/blame-tree.c (new) ##
-@@
-+#include "git-compat-util.h"
-+#include "blame-tree.h"
-+#include "hex.h"
-+#include "quote.h"
-+#include "config.h"
-+#include "object-name.h"
-+#include "parse-options.h"
-+#include "builtin.h"
-+
-+static void show_entry(const char *path, const struct commit *commit, void *d)
-+{
-+ struct blame_tree *bt = d;
-+
-+ if (commit->object.flags & BOUNDARY)
-+ putchar('^');
-+ printf("%s\t", oid_to_hex(&commit->object.oid));
-+
-+ if (bt->rev.diffopt.line_termination)
-+ write_name_quoted(path, stdout, '\n');
-+ else
-+ printf("%s%c", path, '\0');
-+
-+ fflush(stdout);
-+}
-+
-+int cmd_blame_tree(int argc,
-+ const char **argv,
-+ const char *prefix,
-+ struct repository *repo)
-+{
-+ int ret = 0;
-+ struct blame_tree bt;
-+
-+ repo_config(repo, git_default_config, NULL);
-+
-+ blame_tree_init(&bt, repo, prefix, argc, argv);
-+ if (blame_tree_run(&bt, show_entry, &bt) < 0)
-+ die(_("error running blame-tree traversal"));
-+
-+ blame_tree_release(&bt);
-+
-+ return ret;
-+}
-
- ## git.c ##
-@@ git.c: static struct cmd_struct commands[] = {
- { "backfill", cmd_backfill, RUN_SETUP },
- { "bisect", cmd_bisect, RUN_SETUP },
- { "blame", cmd_blame, RUN_SETUP },
-+ { "blame-tree", cmd_blame_tree, RUN_SETUP },
- { "branch", cmd_branch, RUN_SETUP | DELAY_PAGER_CONFIG },
- { "bugreport", cmd_bugreport, RUN_SETUP_GENTLY },
- { "bundle", cmd_bundle, RUN_SETUP_GENTLY },
++#endif /* LAST_MODIFIED_H */
## meson.build ##
@@ meson.build: libgit_sources = [
- 'attr.c',
- 'base85.c',
- 'bisect.c',
-+ 'blame-tree.c',
- 'blame.c',
- 'blob.c',
- 'bloom.c',
+ 'ident.c',
+ 'json-writer.c',
+ 'kwset.c',
++ 'last-modified.c',
+ 'levenshtein.c',
+ 'line-log.c',
+ 'line-range.c',
@@ meson.build: builtin_sources = [
- 'builtin/archive.c',
- 'builtin/backfill.c',
- 'builtin/bisect.c',
-+ 'builtin/blame-tree.c',
- 'builtin/blame.c',
- 'builtin/branch.c',
- 'builtin/bugreport.c',
-
- ## t/helper/test-tool.h ##
-@@
-
- int cmd__advise_if_enabled(int argc, const char **argv);
- int cmd__bitmap(int argc, const char **argv);
-+int cmd__blame_tree(int argc, const char **argv);
- int cmd__bloom(int argc, const char **argv);
- int cmd__bundle_uri(int argc, const char **argv);
- int cmd__cache_tree(int argc, const char **argv);
+ 'builtin/index-pack.c',
+ 'builtin/init-db.c',
+ 'builtin/interpret-trailers.c',
++ 'builtin/last-modified.c',
+ 'builtin/log.c',
+ 'builtin/ls-files.c',
+ 'builtin/ls-remote.c',
## t/meson.build ##
@@ t/meson.build: integration_tests = [
't8012-blame-colors.sh',
't8013-blame-ignore-revs.sh',
't8014-blame-ignore-fuzzy.sh',
-+ 't8020-blame-tree.sh',
++ 't8020-last-modified.sh',
't9001-send-email.sh',
't9002-column.sh',
't9003-help-autocorrect.sh',
- ## t/t8020-blame-tree.sh (new) ##
+ ## t/t8020-last-modified.sh (new) ##
@@
+#!/bin/sh
+
-+test_description='blame-tree tests'
++test_description='last-modified tests'
+
+. ./test-lib.sh
+
@@ t/t8020-blame-tree.sh (new)
+ test_commit 3 a/b/file
+'
+
-+test_expect_success 'cannot blame two trees' '
-+ test_must_fail git blame-tree HEAD HEAD~1
++test_expect_success 'cannot run last-modified on two trees' '
++ test_must_fail git last-modified HEAD HEAD~1
+'
+
-+check_blame() {
++check_last_modified() {
+ local indir= &&
+ while test $# != 0
+ do
@@ t/t8020-blame-tree.sh (new)
+
+ cat >expect &&
+ test_when_finished "rm -f tmp.*" &&
-+ git ${indir:+-C "$indir"} blame-tree "$@" >tmp.1 &&
++ git ${indir:+-C "$indir"} last-modified "$@" >tmp.1 &&
+ git name-rev --annotate-stdin --name-only --tags \
+ <tmp.1 >tmp.2 &&
+ tr '\t' ' ' <tmp.2 >tmp.3 &&
@@ t/t8020-blame-tree.sh (new)
+ test_cmp expect actual
+}
+
-+test_expect_success 'blame recursive' '
-+ check_blame --recursive <<-\EOF
++test_expect_success 'last-modified non-recursive' '
++ check_last_modified <<-\EOF
++ 1 file
++ 3 a
++ EOF
++'
++
++test_expect_success 'last-modified recursive' '
++ check_last_modified -r <<-\EOF
+ 1 file
+ 2 a/file
+ 3 a/b/file
+ EOF
+'
+
-+test_expect_success 'blame non-recursive' '
-+ check_blame --no-recursive <<-\EOF
-+ 1 file
++test_expect_success 'last-modified subdir' '
++ check_last_modified a <<-\EOF
+ 3 a
+ EOF
+'
+
-+test_expect_success 'blame subdir' '
-+ check_blame a <<-\EOF
-+ 3 a
-+ EOF
-+'
-+
-+test_expect_success 'blame subdir recursive' '
-+ check_blame --recursive a <<-\EOF
++test_expect_success 'last-modified subdir recursive' '
++ check_last_modified -r a <<-\EOF
+ 2 a/file
+ 3 a/b/file
+ EOF
+'
+
-+test_expect_success 'blame from non-HEAD commit' '
-+ check_blame --no-recursive HEAD^ <<-\EOF
++test_expect_success 'last-modified from non-HEAD commit' '
++ check_last_modified HEAD^ <<-\EOF
+ 1 file
+ 2 a
+ EOF
+'
+
-+test_expect_success 'blame from subdir defaults to root' '
-+ check_blame -C a --no-recursive <<-\EOF
++test_expect_success 'last-modified from subdir defaults to root' '
++ check_last_modified -C a <<-\EOF
+ 1 file
+ 3 a
+ EOF
+'
+
-+test_expect_success 'blame from subdir uses relative pathspecs' '
-+ check_blame -C a --recursive b <<-\EOF
++test_expect_success 'last-modified from subdir uses relative pathspecs' '
++ check_last_modified -C a -r b <<-\EOF
+ 3 a/b/file
+ EOF
+'
+
-+test_expect_failure 'limit blame traversal by count' '
-+ check_blame --no-recursive -1 <<-\EOF
++test_expect_success 'limit last-modified traversal by count' '
++ check_last_modified -1 <<-\EOF
+ 3 a
++ ^2 file
+ EOF
+'
+
-+test_expect_success 'limit blame traversal by commit' '
-+ check_blame --no-recursive HEAD~2..HEAD <<-\EOF
++test_expect_success 'limit last-modified traversal by commit' '
++ check_last_modified HEAD~2..HEAD <<-\EOF
+ 3 a
+ ^1 file
+ EOF
+'
+
-+test_expect_success 'only blame files in the current tree' '
++test_expect_success 'only last-modified files in the current tree' '
+ git rm -rf a &&
+ git commit -m "remove a" &&
-+ check_blame <<-\EOF
++ check_last_modified <<-\EOF
+ 1 file
+ EOF
+'
@@ t/t8020-blame-tree.sh (new)
+ git rm -rf . &&
+ test_commit m2 &&
+ git merge m1 &&
-+ check_blame <<-\EOF
++ check_last_modified <<-\EOF
+ m1 m1.t
+ m2 m2.t
+ EOF
+'
+
-+test_expect_success 'blame merge for resolved conflicts' '
++test_expect_success 'last-modified merge for resolved conflicts' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit c1 conflict &&
@@ t/t8020-blame-tree.sh (new)
+ test_commit c2 conflict &&
+ test_must_fail git merge c1 &&
+ test_commit resolved conflict &&
-+ check_blame conflict <<-\EOF
++ check_last_modified conflict <<-\EOF
+ resolved conflict
+ EOF
+'
+
-+test_expect_success 'blame-tree complains about unknown arguments' '
-+ test_must_fail git blame-tree --foo 2>err &&
-+ grep "unknown blame-tree argument: --foo" err
++
++# Consider `file` with this content through history:
++#
++# A---B---B-------B---B
++# \ /
++# C---D
++test_expect_success 'last-modified merge ignores content from branch' '
++ git checkout HEAD^0 &&
++ git rm -rf . &&
++ test_commit a1 file A &&
++ test_commit a2 file B &&
++ test_commit a3 file C &&
++ test_commit a4 file D &&
++ git checkout a2 &&
++ git merge --no-commit --no-ff a4 &&
++ git checkout a2 -- file &&
++ git merge --continue &&
++ check_last_modified <<-\EOF
++ a2 file
++ EOF
++'
++
++# Consider `file` with this content through history:
++#
++# A---B---B---C---D---B---B
++# \ /
++# B-------B
++test_expect_success 'last-modified merge undoes changes' '
++ git checkout HEAD^0 &&
++ git rm -rf . &&
++ test_commit b1 file A &&
++ test_commit b2 file B &&
++ test_commit b3 file C &&
++ test_commit b4 file D &&
++ git checkout b2 &&
++ test_commit b5 file2 2 &&
++ git checkout b4 &&
++ git merge --no-commit --no-ff b5 &&
++ git checkout b2 -- file &&
++ git merge --continue &&
++ check_last_modified <<-\EOF
++ b2 file
++ b5 file2
++ EOF
++'
++
++test_expect_success 'last-modified complains about unknown arguments' '
++ test_must_fail git last-modified --foo 2>err &&
++ grep "unknown last-modified argument: --foo" err
+'
+
+test_done
2: 595a8836fb ! 2: 54383e3f5c t/perf: add blame-tree perf script
@@ Metadata
Author: Jeff King <peff@peff.net>
## Commit message ##
- t/perf: add blame-tree perf script
+ t/perf: add last-modified perf script
- This just runs some simple blame-tree's. We already test correctness in
- the regular suite, so this is just about finding performance regressions
- from one version to another.
+ This just runs some simple last-modified commands. We already test
+ correctness in the regular suite, so this is just about finding
+ performance regressions from one version to another.
Signed-off-by: Toon Claes <toon@iotcl.com>
- ## t/perf/p8020-blame-tree.sh (new) ##
+ ## t/meson.build ##
+@@ t/meson.build: benchmarks = [
+ 'perf/p7820-grep-engines.sh',
+ 'perf/p7821-grep-engines-fixed.sh',
+ 'perf/p7822-grep-perl-character.sh',
++ 'perf/p8020-last-modified.sh',
+ 'perf/p9210-scalar.sh',
+ 'perf/p9300-fast-import-export.sh',
+ ]
+
+ ## t/perf/p8020-last-modified.sh (new) ##
@@
+#!/bin/sh
+
-+test_description='blame-tree perf tests'
++test_description='last-modified perf tests'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
-+test_perf 'top-level blame-tree' '
-+ git blame-tree HEAD
++test_perf 'top-level last-modified' '
++ git last-modified HEAD
+'
+
-+test_perf 'top-level recursive blame-tree' '
-+ git blame-tree -r HEAD
++test_perf 'top-level recursive last-modified' '
++ git last-modified -r HEAD
+'
+
-+test_perf 'subdir blame-tree' '
++test_perf 'subdir last-modified' '
+ path=$(git ls-tree HEAD | grep ^040000 | head -n 1 | cut -f2)
-+ git blame-tree -r HEAD -- "$path"
++ git last-modified -r HEAD -- "$path"
+'
+
+test_done
-
- ## t/t8020-blame-tree.sh ##
-@@ t/t8020-blame-tree.sh: check_blame() {
- }
-
- test_expect_success 'blame recursive' '
-- check_blame --recursive <<-\EOF
-+ check_blame -r <<-\EOF
- 1 file
- 2 a/file
- 3 a/b/file
-@@ t/t8020-blame-tree.sh: test_expect_success 'blame recursive' '
- '
-
- test_expect_success 'blame non-recursive' '
-- check_blame --no-recursive <<-\EOF
-+ check_blame <<-\EOF
- 1 file
- 3 a
- EOF
-@@ t/t8020-blame-tree.sh: test_expect_success 'blame subdir' '
- '
-
- test_expect_success 'blame subdir recursive' '
-- check_blame --recursive a <<-\EOF
-+ check_blame -r a <<-\EOF
- 2 a/file
- 3 a/b/file
- EOF
- '
-
- test_expect_success 'blame from non-HEAD commit' '
-- check_blame --no-recursive HEAD^ <<-\EOF
-+ check_blame HEAD^ <<-\EOF
- 1 file
- 2 a
- EOF
- '
-
- test_expect_success 'blame from subdir defaults to root' '
-- check_blame -C a --no-recursive <<-\EOF
-+ check_blame -C a <<-\EOF
- 1 file
- 3 a
- EOF
- '
-
- test_expect_success 'blame from subdir uses relative pathspecs' '
-- check_blame -C a --recursive b <<-\EOF
-+ check_blame -C a -r b <<-\EOF
- 3 a/b/file
- EOF
- '
-
--test_expect_failure 'limit blame traversal by count' '
-- check_blame --no-recursive -1 <<-\EOF
-+test_expect_success 'limit blame traversal by count' '
-+ check_blame <<-\EOF
- 3 a
-+ ^2 file
- EOF
- '
-
- test_expect_success 'limit blame traversal by commit' '
-- check_blame --no-recursive HEAD~2..HEAD <<-\EOF
-+ check_blame HEAD~2..HEAD <<-\EOF
- 3 a
- ^1 file
- EOF
3: 4d01e68e9b ! 3: f67b406980 blame-tree: use Bloom filters when available
@@
## Metadata ##
-Author: Taylor Blau <me@ttaylorr.com>
+Author: Toon Claes <toon@iotcl.com>
## Commit message ##
- blame-tree: use Bloom filters when available
+ last-modified: use Bloom filters when available
- Our 'git blame-tree' performs a revision walk, and computes a diff at
+ Our 'git last-modified' performs a revision walk, and computes a diff at
each point in the walk to figure out whether a given revision changed
any of the paths it considers interesting.
@@ Commit message
can avoid computing it in this case.
This results in a substantial performance speed-up in common cases of
- 'git blame-tree'. In the kernel, here is the before and after (all times
- computed with best-of-five):
+ 'git last-modified'. In the kernel, here is the before and after (all
+ times computed with best-of-five):
With commit-graphs (but no Bloom filters):
@@ Commit message
Signed-off-by: Toon Claes <toon@iotcl.com>
- ## blame-tree.c ##
+ ## last-modified.c ##
@@
#include "revision.h"
#include "repository.h"
@@ blame-tree.c
+#include "commit-graph.h"
+#include "bloom.h"
- struct blame_tree_entry {
+ struct last_modified_entry {
struct hashmap_entry hashent;
struct object_id oid;
struct commit *commit;
@@ blame-tree.c
const char path[FLEX_ARRAY];
};
-@@ blame-tree.c: static void add_from_diff(struct diff_queue_struct *q,
+@@ last-modified.c: static void add_from_diff(struct diff_queue_struct *q,
FLEX_ALLOC_STR(ent, path, path);
oidcpy(&ent->oid, &p->two->oid);
-+ if (bt->rev.bloom_filter_settings)
++ if (lm->rev.bloom_filter_settings)
+ fill_bloom_key(path, strlen(path), &ent->key,
-+ bt->rev.bloom_filter_settings);
++ lm->rev.bloom_filter_settings);
hashmap_entry_init(&ent->hashent, strhash(ent->path));
- hashmap_add(&bt->paths, &ent->hashent);
+ hashmap_add(&lm->paths, &ent->hashent);
}
-@@ blame-tree.c: void blame_tree_init(struct blame_tree *bt,
- if (setup_revisions(argc, argv, &bt->rev, NULL) > 1)
- die(_("unknown blame-tree argument: %s"), argv[1]);
+@@ last-modified.c: void last_modified_init(struct last_modified *lm,
+ if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
+ die(_("unknown last-modified argument: %s"), argv[1]);
-+ (void)generation_numbers_enabled(bt->rev.repo);
-+ bt->rev.bloom_filter_settings = get_bloom_filter_settings(bt->rev.repo);
++ (void)generation_numbers_enabled(lm->rev.repo);
++ lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
+
- if (add_from_revs(bt) < 0)
- die(_("unable to setup blame-tree"));
+ if (add_from_revs(lm) < 0)
+ die(_("unable to setup last-modified"));
}
- void blame_tree_release(struct blame_tree *bt)
+ void last_modified_release(struct last_modified *lm)
{
+ struct hashmap_iter iter;
-+ struct blame_tree_entry *ent;
++ struct last_modified_entry *ent;
+
-+ hashmap_for_each_entry(&bt->paths, &iter, ent, hashent) {
++ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
+ clear_bloom_key(&ent->key);
+ }
- hashmap_clear_and_free(&bt->paths, struct blame_tree_entry, hashent);
- release_revisions(&bt->rev);
+ hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
+ release_revisions(&lm->rev);
}
-@@ blame-tree.c: static void mark_path(const char *path, const struct object_id *oid,
+@@ last-modified.c: static void mark_path(const char *path, const struct object_id *oid,
data->callback(path, data->commit, data->callback_data);
hashmap_remove(data->paths, &ent->hashent, path);
@@ blame-tree.c: static void mark_path(const char *path, const struct object_id *oi
free(ent);
}
-@@ blame-tree.c: static void blame_diff(struct diff_queue_struct *q,
+@@ last-modified.c: static void last_modified_diff(struct diff_queue_struct *q,
}
}
-+static int maybe_changed_path(struct blame_tree *bt, struct commit *origin)
++static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
+{
+ struct bloom_filter *filter;
-+ struct blame_tree_entry *e;
++ struct last_modified_entry *ent;
+ struct hashmap_iter iter;
+
-+ if (!bt->rev.bloom_filter_settings)
++ if (!lm->rev.bloom_filter_settings)
+ return 1;
+
+ if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY)
+ return 1;
+
-+ filter = get_bloom_filter(bt->rev.repo, origin);
++ filter = get_bloom_filter(lm->rev.repo, origin);
+ if (!filter)
+ return 1;
+
-+ hashmap_for_each_entry(&bt->paths, &iter, e, hashent) {
-+ if (bloom_filter_contains(filter, &e->key,
-+ bt->rev.bloom_filter_settings))
++ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
++ if (bloom_filter_contains(filter, &ent->key,
++ lm->rev.bloom_filter_settings))
+ return 1;
+ }
+ return 0;
+}
+
- int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
+ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
{
- struct blame_tree_callback_data data;
-@@ blame-tree.c: int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
+ struct last_modified_callback_data data;
+@@ last-modified.c: int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
if (!data.commit)
break;
-+ if (!maybe_changed_path(bt, data.commit))
++ if (!maybe_changed_path(lm, data.commit))
+ continue;
+
if (data.commit->object.flags & BOUNDARY) {
- diff_tree_oid(bt->rev.repo->hash_algo->empty_tree,
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
&data.commit->object.oid,
4: 1a20acce8f ! 4: 3eac929e36 blame-tree: implement faster algorithm
@@
## Metadata ##
-Author: Taylor Blau <me@ttaylorr.com>
+Author: Toon Claes <toon@iotcl.com>
## Commit message ##
- blame-tree: implement faster algorithm
+ last-modified: implement faster algorithm
- The current implementation of 'git blame-tree' works by doing a revision
- walk, and inspecting the diff at each level of that walk to annotate the
- yet-unblamed entries to a path. In other words, if the diff at some
- level touches a path which has not yet been associated with a commit,
- then that commit becomes associated with the path.
+ The current implementation of 'git last-modified' works by doing a
+ revision walk, and inspecting the diff at each level of that walk to
+ annotate the to-be-found entries to a path. In other words, if the diff
+ at some level touches a path which has not yet been associated with a
+ commit, then that commit becomes associated with the path.
While a perfectly reasonable implementation, it can perform poorly in
either one of two scenarios:
@@ Commit message
long time, and so we must walk through a lot of history in order to
find a commit that touches that path.
- This patch rewrites the blame-tree implementation that addresses (2).
+ This patch rewrites the last-modified implementation that addresses (2).
The idea behind the algorithm is to propagate a set of 'active' paths (a
path is 'active' if it does not yet belong to a commit) up to parents
and do a truncated revision walk.
@@ Commit message
Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
- ## blame-tree.c ##
+ ## last-modified.c ##
@@
#include "commit.h"
#include "diffcore.h"
@@ blame-tree.c
+#include "prio-queue.h"
+#include "commit-slab.h"
- struct blame_tree_entry {
+ struct last_modified_entry {
struct hashmap_entry hashent;
struct object_id oid;
struct commit *commit;
@@ blame-tree.c
struct bloom_key key;
const char path[FLEX_ARRAY];
};
-@@ blame-tree.c: void blame_tree_init(struct blame_tree *bt,
+@@ last-modified.c: void last_modified_init(struct last_modified *lm,
const char *prefix,
int argc, const char **argv)
{
+ struct hashmap_iter iter;
-+ struct blame_tree_entry *e;
++ struct last_modified_entry *ent;
+
- memset(bt, 0, sizeof(*bt));
- hashmap_init(&bt->paths, blame_tree_entry_hashcmp, NULL, 0);
+ memset(lm, 0, sizeof(*lm));
+ hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
-@@ blame-tree.c: void blame_tree_init(struct blame_tree *bt,
+@@ last-modified.c: void last_modified_init(struct last_modified *lm,
- if (add_from_revs(bt) < 0)
- die(_("unable to setup blame-tree"));
+ if (add_from_revs(lm) < 0)
+ die(_("unable to setup last-modified"));
+
-+ bt->all_paths = xcalloc(hashmap_get_size(&bt->paths), sizeof(const char *));
-+ bt->all_paths_nr = 0;
-+ hashmap_for_each_entry(&bt->paths, &iter, e, hashent) {
-+ e->diff_idx = bt->all_paths_nr++;
-+ bt->all_paths[e->diff_idx] = e->path;
++ lm->all_paths = xcalloc(hashmap_get_size(&lm->paths), sizeof(const char *));
++ lm->all_paths_nr = 0;
++ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
++ ent->diff_idx = lm->all_paths_nr++;
++ lm->all_paths[ent->diff_idx] = ent->path;
+ }
}
- void blame_tree_release(struct blame_tree *bt)
-@@ blame-tree.c: void blame_tree_release(struct blame_tree *bt)
+ void last_modified_release(struct last_modified *lm)
+@@ last-modified.c: void last_modified_release(struct last_modified *lm)
}
- hashmap_clear_and_free(&bt->paths, struct blame_tree_entry, hashent);
- release_revisions(&bt->rev);
-+ free(bt->all_paths);
+ hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
+ release_revisions(&lm->rev);
++ free(lm->all_paths);
+}
+
+struct commit_active_paths {
@@ blame-tree.c: void blame_tree_release(struct blame_tree *bt)
+ free(active->active);
}
- struct blame_tree_callback_data {
-@@ blame-tree.c: static void mark_path(const char *path, const struct object_id *oid,
- struct blame_tree_callback_data *data)
+ struct last_modified_callback_data {
+@@ last-modified.c: static void mark_path(const char *path, const struct object_id *oid,
+ struct last_modified_callback_data *data)
{
- struct blame_tree_entry *ent;
+ struct last_modified_entry *ent;
+ struct commit_active_paths *active;
/* Is it even a path that we are interested in? */
ent = hashmap_get_entry_from_hash(data->paths, strhash(path), path,
-@@ blame-tree.c: static void mark_path(const char *path, const struct object_id *oid,
+@@ last-modified.c: static void mark_path(const char *path, const struct object_id *oid,
if (ent->commit)
return;
@@ blame-tree.c: static void mark_path(const char *path, const struct object_id *oi
return;
ent->commit = data->commit;
-@@ blame-tree.c: static void blame_diff(struct diff_queue_struct *q,
+@@ last-modified.c: static void last_modified_diff(struct diff_queue_struct *q,
}
}
--static int maybe_changed_path(struct blame_tree *bt, struct commit *origin)
+-static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
+static char *scratch;
+
+static void pass_to_parent(struct commit_active_paths *c,
@@ blame-tree.c: static void blame_diff(struct diff_queue_struct *q,
+#define PARENT1 (1u<<16) /* used instead of SEEN */
+#define PARENT2 (1u<<17) /* used instead of BOTTOM, BOUNDARY */
+
-+static int diff2idx(struct blame_tree *bt, char *path)
++static int diff2idx(struct last_modified *lm, char *path)
+{
-+ struct blame_tree_entry *ent;
-+ ent = hashmap_get_entry_from_hash(&bt->paths, strhash(path), path,
-+ struct blame_tree_entry, hashent);
++ struct last_modified_entry *ent;
++ ent = hashmap_get_entry_from_hash(&lm->paths, strhash(path), path,
++ struct last_modified_entry, hashent);
+ return ent ? ent->diff_idx : -1;
+}
+
-+static int maybe_changed_path(struct blame_tree *bt,
++static int maybe_changed_path(struct last_modified *lm,
+ struct commit *origin,
+ struct commit_active_paths *active)
{
struct bloom_filter *filter;
- struct blame_tree_entry *e;
-@@ blame-tree.c: static int maybe_changed_path(struct blame_tree *bt, struct commit *origin)
+ struct last_modified_entry *ent;
+@@ last-modified.c: static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
return 1;
- hashmap_for_each_entry(&bt->paths, &iter, e, hashent) {
-+ if (active && !active->active[e->diff_idx])
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
++ if (active && !active->active[ent->diff_idx])
+ continue;
- if (bloom_filter_contains(filter, &e->key,
- bt->rev.bloom_filter_settings))
+ if (bloom_filter_contains(filter, &ent->key,
+ lm->rev.bloom_filter_settings))
return 1;
-@@ blame-tree.c: static int maybe_changed_path(struct blame_tree *bt, struct commit *origin)
+@@ last-modified.c: static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
return 0;
}
-+static int process_parent(struct blame_tree *bt,
-+ struct prio_queue *queue,
-+ struct commit *c, struct commit_active_paths *active_c,
-+ struct commit *parent, int parent_i)
++static int process_parent(struct last_modified *lm, struct prio_queue *queue,
++ struct commit *c,
++ struct commit_active_paths *active_c,
++ struct commit *parent, int parent_i)
+{
+ int i, ret = 0; // TODO type & for loop var
+ struct commit_active_paths *active_p;
+
-+ repo_parse_commit(bt->rev.repo, parent);
++ repo_parse_commit(lm->rev.repo, parent);
+
+ active_p = active_paths_at(&active_paths, parent);
+ if (!active_p->active) {
-+ active_p->active = xcalloc(sizeof(char), bt->all_paths_nr);
++ active_p->active = xcalloc(sizeof(char), lm->all_paths_nr);
+ active_p->nr = 0;
+ }
+
@@ blame-tree.c: static int maybe_changed_path(struct blame_tree *bt, struct commit
+ * Before calling 'diff_tree_oid()' on our first parent, see if Bloom
+ * filters will tell us the diff is conclusively uninteresting.
+ */
-+ if (parent_i || maybe_changed_path(bt, c, active_c)) {
++ if (parent_i || maybe_changed_path(lm, c, active_c)) {
+ diff_tree_oid(&parent->object.oid,
-+ &c->object.oid, "", &bt->rev.diffopt);
-+ diffcore_std(&bt->rev.diffopt);
++ &c->object.oid, "", &lm->rev.diffopt);
++ diffcore_std(&lm->rev.diffopt);
+ }
+
+ if (!diff_queued_diff.nr) {
@@ blame-tree.c: static int maybe_changed_path(struct blame_tree *bt, struct commit
+ * No diff entries means we are TREESAME on the base path, and
+ * so all active paths get passed onto this parent.
+ */
-+ for (i = 0; i < bt->all_paths_nr; i++) {
++ for (i = 0; i < lm->all_paths_nr; i++) {
+ if (active_c->active[i])
+ pass_to_parent(active_c, active_p, i);
+ }
@@ blame-tree.c: static int maybe_changed_path(struct blame_tree *bt, struct commit
+ */
+ for (i = 0; i < diff_queued_diff.nr; i++) {
+ struct diff_filepair *fp = diff_queued_diff.queue[i];
-+ int k = diff2idx(bt, fp->two->path);
++ int k = diff2idx(lm, fp->two->path);
+ if (0 <= k && active_c->active[k])
+ scratch[k] = 1;
+ diff_free_filepair(fp);
+ }
+ diff_queued_diff.nr = 0;
-+ for (i = 0; i < bt->all_paths_nr; i++) {
++ for (i = 0; i < lm->all_paths_nr; i++) {
+ if (active_c->active[i] && !scratch[i])
+ pass_to_parent(active_c, active_p, i);
+ }
@@ blame-tree.c: static int maybe_changed_path(struct blame_tree *bt, struct commit
+
+cleanup:
+ diff_queue_clear(&diff_queued_diff);
-+ memset(scratch, 0, bt->all_paths_nr);
++ memset(scratch, 0, lm->all_paths_nr);
+
+ return ret;
+}
+
- int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
+ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
{
+ int max_count, queue_popped = 0;
+ struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
+ struct prio_queue not_queue = { compare_commits_by_gen_then_commit_date };
- struct blame_tree_callback_data data;
+ struct last_modified_callback_data data;
- data.paths = &bt->paths;
-@@ blame-tree.c: int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
- bt->rev.diffopt.format_callback = blame_diff;
- bt->rev.diffopt.format_callback_data = &data;
+ data.paths = &lm->paths;
+@@ last-modified.c: int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
+ lm->rev.diffopt.format_callback = last_modified_diff;
+ lm->rev.diffopt.format_callback_data = &data;
-- prepare_revision_walk(&bt->rev);
-+ max_count = bt->rev.max_count;
+- prepare_revision_walk(&lm->rev);
++ max_count = lm->rev.max_count;
-- while (hashmap_get_size(&bt->paths)) {
-- data.commit = get_revision(&bt->rev);
+- while (hashmap_get_size(&lm->paths)) {
+- data.commit = get_revision(&lm->rev);
- if (!data.commit)
- break;
+ init_active_paths(&active_paths);
-+ scratch = xcalloc(bt->all_paths_nr, sizeof(char));
++ scratch = xcalloc(lm->all_paths_nr, sizeof(char));
-- if (!maybe_changed_path(bt, data.commit))
+- if (!maybe_changed_path(lm, data.commit))
- continue;
+ /*
-+ * bt->rev.pending holds the set of boundary commits for our walk.
++ * lm->rev.pending holds the set of boundary commits for our walk.
+ *
+ * Loop through each such commit, and place it in the appropriate queue.
+ */
-+ for (size_t i = 0; i < bt->rev.pending.nr; i++) {
-+ struct commit *c = lookup_commit(bt->rev.repo,
-+ &bt->rev.pending.objects[i].item->oid);
-+ repo_parse_commit(bt->rev.repo, c);
++ for (size_t i = 0; i < lm->rev.pending.nr; i++) {
++ struct commit *c = lookup_commit(lm->rev.repo,
++ &lm->rev.pending.objects[i].item->oid);
++ repo_parse_commit(lm->rev.repo, c);
- if (data.commit->object.flags & BOUNDARY) {
-- diff_tree_oid(bt->rev.repo->hash_algo->empty_tree,
+- diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
- &data.commit->object.oid,
-- "", &bt->rev.diffopt);
-- diff_flush(&bt->rev.diffopt);
+- "", &lm->rev.diffopt);
+- diff_flush(&lm->rev.diffopt);
- } else {
-- log_tree_commit(&bt->rev, data.commit);
+- log_tree_commit(&lm->rev, data.commit);
+ if (c->object.flags & BOTTOM) {
+ prio_queue_put(¬_queue, c);
+ c->object.flags |= PARENT2;
@@ blame-tree.c: int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb,
+ c->object.flags |= PARENT1;
+
+ active = active_paths_at(&active_paths, c);
-+ active->active = xcalloc(sizeof(char), bt->all_paths_nr);
-+ memset(active->active, 1, bt->all_paths_nr);
-+ active->nr = bt->all_paths_nr;
++ active->active = xcalloc(sizeof(char), lm->all_paths_nr);
++ memset(active->active, 1, lm->all_paths_nr);
++ active->nr = lm->all_paths_nr;
}
}
@@ blame-tree.c: int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb,
+ * Now that we have processed the pending commits, allow the revision
+ * machinery to flush them by calling prepare_revision_walk().
+ */
-+ prepare_revision_walk(&bt->rev);
++ prepare_revision_walk(&lm->rev);
+
+ while (queue.nr) {
+ int parent_i;
@@ blame-tree.c: int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb,
+ */
+ c->object.flags |= PARENT2 | BOUNDARY;
+ data.commit = c;
-+ diff_tree_oid(bt->rev.repo->hash_algo->empty_tree,
++ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &c->object.oid,
-+ "", &bt->rev.diffopt);
-+ diff_flush(&bt->rev.diffopt);
++ "", &lm->rev.diffopt);
++ diff_flush(&lm->rev.diffopt);
+ goto cleanup;
+ }
+
@@ blame-tree.c: int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb,
+ * Otherwise, keep going, but make sure that 'c' isn't reachable
+ * from anything in the '--not' queue.
+ */
-+ repo_parse_commit(bt->rev.repo, c);
++ repo_parse_commit(lm->rev.repo, c);
+
+ while (not_queue.nr) {
+ struct commit_list *np;
+ struct commit *n = prio_queue_get(¬_queue);
+
-+ repo_parse_commit(bt->rev.repo, n);
++ repo_parse_commit(lm->rev.repo, n);
+
+ for (np = n->parents; np; np = np->next) {
+ if (!(np->item->object.flags & PARENT2)) {
@@ blame-tree.c: int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb,
+ * parents in order if TREESAME.
+ */
+ for (p = c->parents, parent_i = 0; p; p = p->next, parent_i++) {
-+ if (process_parent(bt, &queue,
++ if (process_parent(lm, &queue,
+ c, active_c,
+ p->item, parent_i) > 0 )
+ break;
@@ blame-tree.c: int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb,
+ if (active_c->nr) {
+ /* Any paths that remain active were changed by 'c'. */
+ data.commit = c;
-+ for (int i = 0; i < bt->all_paths_nr; i++) {
++ for (int i = 0; i < lm->all_paths_nr; i++) {
+ if (active_c->active[i])
-+ mark_path(bt->all_paths[i], NULL, &data);
++ mark_path(lm->all_paths[i], NULL, &data);
+ }
+ }
+
@@ blame-tree.c: int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb,
return 0;
}
- ## blame-tree.h ##
+ ## last-modified.h ##
@@
- struct blame_tree {
+ struct last_modified {
struct hashmap paths;
struct rev_info rev;
+
@@ blame-tree.h
+ int all_paths_nr;
};
- void blame_tree_init(struct blame_tree *bt,
+ void last_modified_init(struct last_modified *lm,
5: 10b306953f ! 5: 6808799f8b blame-tree.c: initialize revision machinery without walk
@@ Metadata
Author: Toon Claes <toon@iotcl.com>
## Commit message ##
- blame-tree.c: initialize revision machinery without walk
+ last-modified: initialize revision machinery without walk
In a previous commit we inserted a call to 'prepare_revision_walk()'
before we started our traversal. This was done when we leveraged the
@@ Commit message
Based-on-patch-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
- ## blame-tree.c ##
-@@ blame-tree.c: static int maybe_changed_path(struct blame_tree *bt,
+ ## last-modified.c ##
+@@ last-modified.c: static int maybe_changed_path(struct last_modified *lm,
if (!filter)
return 1;
-+ for (int i = 0; i < bt->rev.bloom_keys_nr; i++) {
++ for (int i = 0; i < lm->rev.bloom_keys_nr; i++) {
+ if (!(bloom_filter_contains(filter,
-+ &bt->rev.bloom_keys[i],
-+ bt->rev.bloom_filter_settings)))
++ &lm->rev.bloom_keys[i],
++ lm->rev.bloom_filter_settings)))
+ return 0;
+ }
+
- hashmap_for_each_entry(&bt->paths, &iter, e, hashent) {
- if (active && !active->active[e->diff_idx])
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
+ if (active && !active->active[ent->diff_idx])
continue;
-@@ blame-tree.c: int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
+@@ last-modified.c: int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct prio_queue not_queue = { compare_commits_by_gen_then_commit_date };
- struct blame_tree_callback_data data;
+ struct last_modified_callback_data data;
+ struct commit_list *list;
- data.paths = &bt->paths;
+ data.paths = &lm->paths;
data.callback = cb;
-@@ blame-tree.c: int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
- bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
- bt->rev.diffopt.format_callback = blame_diff;
- bt->rev.diffopt.format_callback_data = &data;
-+ bt->rev.no_walk = 1;
+@@ last-modified.c: int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
+ lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ lm->rev.diffopt.format_callback = last_modified_diff;
+ lm->rev.diffopt.format_callback_data = &data;
++ lm->rev.no_walk = 1;
+
-+ prepare_revision_walk(&bt->rev);
++ prepare_revision_walk(&lm->rev);
- max_count = bt->rev.max_count;
+ max_count = lm->rev.max_count;
-@@ blame-tree.c: int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
- scratch = xcalloc(bt->all_paths_nr, sizeof(char));
+@@ last-modified.c: int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
+ scratch = xcalloc(lm->all_paths_nr, sizeof(char));
/*
-- * bt->rev.pending holds the set of boundary commits for our walk.
-+ * bt->rev.commits holds the set of boundary commits for our walk.
+- * lm->rev.pending holds the set of boundary commits for our walk.
++ * lm->rev.commits holds the set of boundary commits for our walk.
*
* Loop through each such commit, and place it in the appropriate queue.
*/
-- for (size_t i = 0; i < bt->rev.pending.nr; i++) {
-- struct commit *c = lookup_commit(bt->rev.repo,
-- &bt->rev.pending.objects[i].item->oid);
-- repo_parse_commit(bt->rev.repo, c);
-+ for (list = bt->rev.commits; list; list = list->next) {
+- for (size_t i = 0; i < lm->rev.pending.nr; i++) {
+- struct commit *c = lookup_commit(lm->rev.repo,
+- &lm->rev.pending.objects[i].item->oid);
+- repo_parse_commit(lm->rev.repo, c);
++ for (list = lm->rev.commits; list; list = list->next) {
+ struct commit *c = list->item;
if (c->object.flags & BOTTOM) {
prio_queue_put(¬_queue, c);
-@@ blame-tree.c: int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
+@@ last-modified.c: int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
}
}
@@ blame-tree.c: int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb,
- * Now that we have processed the pending commits, allow the revision
- * machinery to flush them by calling prepare_revision_walk().
- */
-- prepare_revision_walk(&bt->rev);
+- prepare_revision_walk(&lm->rev);
-
while (queue.nr) {
int parent_i;
---
base-commit: 8613c2bb6cd16ef530dc5dd74d3b818a1ccbf1c0
change-id: 20250410-toon-new-blame-tree-bcdbb78c1c0f
Thanks
--
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified
2025-05-23 9:33 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Toon Claes
@ 2025-05-23 9:33 ` Toon Claes
2025-05-25 20:07 ` Justin Tobler
2025-05-27 10:39 ` Patrick Steinhardt
2025-05-23 9:33 ` [PATCH RFC v2 2/5] t/perf: add last-modified perf script Toon Claes
` (4 subsequent siblings)
5 siblings, 2 replies; 135+ messages in thread
From: Toon Claes @ 2025-05-23 9:33 UTC (permalink / raw)
To: git
Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes,
Ævar Arnfjörð Bjarmason, Derrick Stolee
Similar to git-blame(1), introduce a new subcommand
git-last-modified(1). This command shows the most recent modification to
paths in a tree. It does so by expanding the tree at a given commit,
taking note of the current state of each path, and then walking
backwards through history looking for commits where each path changed
into its final commit ID.
Based-on-patch-by: Jeff King <peff@peff.net>
Improved-by: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
.gitignore | 1 +
Documentation/git-last-modified.adoc | 49 ++++++++
Documentation/meson.build | 1 +
Makefile | 2 +
builtin.h | 1 +
builtin/last-modified.c | 43 +++++++
command-list.txt | 1 +
git.c | 1 +
last-modified.c | 213 +++++++++++++++++++++++++++++++++++
last-modified.h | 27 +++++
meson.build | 2 +
t/meson.build | 1 +
t/t8020-last-modified.sh | 194 +++++++++++++++++++++++++++++++
13 files changed, 536 insertions(+)
diff --git a/.gitignore b/.gitignore
index 04c444404e..a36ee94443 100644
--- a/.gitignore
+++ b/.gitignore
@@ -87,6 +87,7 @@
/git-init-db
/git-interpret-trailers
/git-instaweb
+/git-last-modified
/git-log
/git-ls-files
/git-ls-remote
diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
new file mode 100644
index 0000000000..1af38f402e
--- /dev/null
+++ b/Documentation/git-last-modified.adoc
@@ -0,0 +1,49 @@
+git-last-modified(1)
+====================
+
+NAME
+----
+git-last-modified - EXPERIMENTAL: Show when files were last modified
+
+
+SYNOPSIS
+--------
+[synopsis]
+git last-modified [-r] [<revision-range>] [[--] <path>...]
+
+DESCRIPTION
+-----------
+
+Shows which commit last modified each of the relevant files and subdirectories.
+
+THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
+
+OPTIONS
+-------
+
+-r::
+ Recurse into subtrees.
+
+-t::
+ Show tree entry itself as well as subtrees. Implies `-r`.
+
+<revision-range>::
+ Only traverse commits in the specified revision range. When no
+ `<revision-range>` is specified, it defaults to `HEAD` (i.e. the whole
+ history leading to the current commit). For a complete list of ways to
+ spell `<revision-range>`, see the 'Specifying Ranges' section of
+ linkgit:gitrevisions[7].
+
+[--] <path>...::
+ For each _<path>_ given, the commit which last modified it is returned.
+ Without an optional path parameter, all files and subdirectories
+ of the current working directory are included in the
+
+SEE ALSO
+--------
+linkgit:git-blame[1],
+linkgit:git-log[1].
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Documentation/meson.build b/Documentation/meson.build
index 1433acfd31..fa93cec5c3 100644
--- a/Documentation/meson.build
+++ b/Documentation/meson.build
@@ -74,6 +74,7 @@ manpages = {
'git-init.adoc' : 1,
'git-instaweb.adoc' : 1,
'git-interpret-trailers.adoc' : 1,
+ 'git-last-modified.adoc' : 1,
'git-log.adoc' : 1,
'git-ls-files.adoc' : 1,
'git-ls-remote.adoc' : 1,
diff --git a/Makefile b/Makefile
index ecd590a643..40bc24c704 100644
--- a/Makefile
+++ b/Makefile
@@ -1051,6 +1051,7 @@ LIB_OBJS += hook.o
LIB_OBJS += ident.o
LIB_OBJS += json-writer.o
LIB_OBJS += kwset.o
+LIB_OBJS += last-modified.o
LIB_OBJS += levenshtein.o
LIB_OBJS += line-log.o
LIB_OBJS += line-range.o
@@ -1266,6 +1267,7 @@ BUILTIN_OBJS += builtin/hook.o
BUILTIN_OBJS += builtin/index-pack.o
BUILTIN_OBJS += builtin/init-db.o
BUILTIN_OBJS += builtin/interpret-trailers.o
+BUILTIN_OBJS += builtin/last-modified.o
BUILTIN_OBJS += builtin/log.o
BUILTIN_OBJS += builtin/ls-files.o
BUILTIN_OBJS += builtin/ls-remote.o
diff --git a/builtin.h b/builtin.h
index bff13e3069..6ed6759ec4 100644
--- a/builtin.h
+++ b/builtin.h
@@ -176,6 +176,7 @@ int cmd_hook(int argc, const char **argv, const char *prefix, struct repository
int cmd_index_pack(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_init_db(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_interpret_trailers(int argc, const char **argv, const char *prefix, struct repository *repo);
+int cmd_last_modified(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_log_reflog(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_log(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_ls_files(int argc, const char **argv, const char *prefix, struct repository *repo);
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
new file mode 100644
index 0000000000..0d4733f666
--- /dev/null
+++ b/builtin/last-modified.c
@@ -0,0 +1,43 @@
+#include "git-compat-util.h"
+#include "last-modified.h"
+#include "hex.h"
+#include "quote.h"
+#include "config.h"
+#include "object-name.h"
+#include "parse-options.h"
+#include "builtin.h"
+
+static void show_entry(const char *path, const struct commit *commit, void *d)
+{
+ struct last_modified *lm = d;
+
+ if (commit->object.flags & BOUNDARY)
+ putchar('^');
+ printf("%s\t", oid_to_hex(&commit->object.oid));
+
+ if (lm->rev.diffopt.line_termination)
+ write_name_quoted(path, stdout, '\n');
+ else
+ printf("%s%c", path, '\0');
+
+ fflush(stdout);
+}
+
+int cmd_last_modified(int argc,
+ const char **argv,
+ const char *prefix,
+ struct repository *repo)
+{
+ int ret = 0;
+ struct last_modified lm;
+
+ repo_config(repo, git_default_config, NULL);
+
+ last_modified_init(&lm, repo, prefix, argc, argv);
+ if (last_modified_run(&lm, show_entry, &lm) < 0)
+ die(_("error running last-modified traversal"));
+
+ last_modified_release(&lm);
+
+ return ret;
+}
diff --git a/command-list.txt b/command-list.txt
index b7ade3ab9f..b715777b24 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -124,6 +124,7 @@ git-index-pack plumbingmanipulators
git-init mainporcelain init
git-instaweb ancillaryinterrogators complete
git-interpret-trailers purehelpers
+git-last-modified plumbinginterrogators
git-log mainporcelain info
git-ls-files plumbinginterrogators
git-ls-remote plumbinginterrogators
diff --git a/git.c b/git.c
index 77c4359522..65afc0d0e7 100644
--- a/git.c
+++ b/git.c
@@ -565,6 +565,7 @@ static struct cmd_struct commands[] = {
{ "init", cmd_init_db },
{ "init-db", cmd_init_db },
{ "interpret-trailers", cmd_interpret_trailers, RUN_SETUP_GENTLY },
+ { "last-modified", cmd_last_modified, RUN_SETUP },
{ "log", cmd_log, RUN_SETUP },
{ "ls-files", cmd_ls_files, RUN_SETUP },
{ "ls-remote", cmd_ls_remote, RUN_SETUP_GENTLY },
diff --git a/last-modified.c b/last-modified.c
new file mode 100644
index 0000000000..9283f8fcae
--- /dev/null
+++ b/last-modified.c
@@ -0,0 +1,213 @@
+#include "git-compat-util.h"
+#include "last-modified.h"
+#include "commit.h"
+#include "diffcore.h"
+#include "diff.h"
+#include "object.h"
+#include "revision.h"
+#include "repository.h"
+#include "log-tree.h"
+
+struct last_modified_entry {
+ struct hashmap_entry hashent;
+ struct object_id oid;
+ struct commit *commit;
+ const char path[FLEX_ARRAY];
+};
+
+static void add_from_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED,
+ void *data)
+{
+ struct last_modified *lm = data;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ struct last_modified_entry *ent;
+ const char *path = p->two->path;
+
+ FLEX_ALLOC_STR(ent, path, path);
+ oidcpy(&ent->oid, &p->two->oid);
+ hashmap_entry_init(&ent->hashent, strhash(ent->path));
+ hashmap_add(&lm->paths, &ent->hashent);
+ }
+}
+
+static int add_from_revs(struct last_modified *lm)
+{
+ size_t count = 0;
+ struct diff_options diffopt;
+
+ memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
+ copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
+ diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ diffopt.format_callback = add_from_diff;
+ diffopt.format_callback_data = lm;
+
+ for (size_t i = 0; i < lm->rev.pending.nr; i++) {
+ struct object_array_entry *obj = lm->rev.pending.objects + i;
+
+ if (obj->item->flags & UNINTERESTING)
+ continue;
+
+ if (count++)
+ return error(_("can only get last-modified one tree at a time"));
+
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &obj->item->oid, "", &diffopt);
+ diff_flush(&diffopt);
+ }
+ clear_pathspec(&diffopt.pathspec);
+
+ return 0;
+}
+
+static int last_modified_entry_hashcmp(const void *unused UNUSED,
+ const struct hashmap_entry *hent1,
+ const struct hashmap_entry *hent2,
+ const void *path)
+{
+ const struct last_modified_entry *ent1 =
+ container_of(hent1, const struct last_modified_entry, hashent);
+ const struct last_modified_entry *ent2 =
+ container_of(hent2, const struct last_modified_entry, hashent);
+ return strcmp(ent1->path, path ? path : ent2->path);
+}
+
+void last_modified_init(struct last_modified *lm,
+ struct repository *r,
+ const char *prefix,
+ int argc, const char **argv)
+{
+ memset(lm, 0, sizeof(*lm));
+ hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
+
+ repo_init_revisions(r, &lm->rev, prefix);
+ lm->rev.def = "HEAD";
+ lm->rev.combine_merges = 1;
+ lm->rev.show_root_diff = 1;
+ lm->rev.boundary = 1;
+ lm->rev.no_commit_id = 1;
+ lm->rev.diff = 1;
+ if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
+ die(_("unknown last-modified argument: %s"), argv[1]);
+
+ if (add_from_revs(lm) < 0)
+ die(_("unable to setup last-modified"));
+}
+
+void last_modified_release(struct last_modified *lm)
+{
+ hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
+ release_revisions(&lm->rev);
+}
+
+struct last_modified_callback_data {
+ struct commit *commit;
+ struct hashmap *paths;
+
+ last_modified_callback callback;
+ void *callback_data;
+};
+
+static void mark_path(const char *path, const struct object_id *oid,
+ struct last_modified_callback_data *data)
+{
+ struct last_modified_entry *ent;
+
+ /* Is it even a path that we are interested in? */
+ ent = hashmap_get_entry_from_hash(data->paths, strhash(path), path,
+ struct last_modified_entry, hashent);
+ if (!ent)
+ return;
+
+ /* Have we already found a commit? */
+ if (ent->commit)
+ return;
+
+ /*
+ * Is it arriving at a version of interest, or is it from a side branch
+ * which did not contribute to the final state?
+ */
+ if (!oideq(oid, &ent->oid))
+ return;
+
+ ent->commit = data->commit;
+ if (data->callback)
+ data->callback(path, data->commit, data->callback_data);
+
+ hashmap_remove(data->paths, &ent->hashent, path);
+ free(ent);
+}
+
+static void last_modified_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED, void *cbdata)
+{
+ struct last_modified_callback_data *data = cbdata;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ switch (p->status) {
+ case DIFF_STATUS_DELETED:
+ /*
+ * There's no point in feeding a deletion, as it could
+ * not have resulted in our current state, which
+ * actually has the file.
+ */
+ break;
+
+ default:
+ /*
+ * Otherwise, we care only that we somehow arrived at
+ * a final path/sha1 state. Note that this covers some
+ * potentially controversial areas, including:
+ *
+ * 1. A rename or copy will be found, as it is the
+ * first time the content has arrived at the given
+ * path.
+ *
+ * 2. Even a non-content modification like a mode or
+ * type change will trigger it.
+ *
+ * We take the inclusive approach for now, and find
+ * anything which impacts the path. Options to tweak
+ * the behavior (e.g., to "--follow" the content across
+ * renames) can come later.
+ */
+ mark_path(p->two->path, &p->two->oid, data);
+ break;
+ }
+ }
+}
+
+int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
+{
+ struct last_modified_callback_data data;
+
+ data.paths = &lm->paths;
+ data.callback = cb;
+ data.callback_data = cbdata;
+
+ lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ lm->rev.diffopt.format_callback = last_modified_diff;
+ lm->rev.diffopt.format_callback_data = &data;
+
+ prepare_revision_walk(&lm->rev);
+
+ while (hashmap_get_size(&lm->paths)) {
+ data.commit = get_revision(&lm->rev);
+ if (!data.commit)
+ break;
+
+ if (data.commit->object.flags & BOUNDARY) {
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &data.commit->object.oid,
+ "", &lm->rev.diffopt);
+ diff_flush(&lm->rev.diffopt);
+ } else {
+ log_tree_commit(&lm->rev, data.commit);
+ }
+ }
+
+ return 0;
+}
diff --git a/last-modified.h b/last-modified.h
new file mode 100644
index 0000000000..42a819d979
--- /dev/null
+++ b/last-modified.h
@@ -0,0 +1,27 @@
+#ifndef LAST_MODIFIED_H
+#define LAST_MODIFIED_H
+
+#include "commit.h"
+#include "revision.h"
+#include "hashmap.h"
+
+struct last_modified {
+ struct hashmap paths;
+ struct rev_info rev;
+};
+
+void last_modified_init(struct last_modified *lm,
+ struct repository *r,
+ const char *prefix,
+ int argc, const char **argv);
+
+void last_modified_release(struct last_modified *);
+
+typedef void (*last_modified_callback)(const char *path,
+ const struct commit *commit,
+ void *data);
+int last_modified_run(struct last_modified *lm,
+ last_modified_callback cb,
+ void *cbdata);
+
+#endif /* LAST_MODIFIED_H */
diff --git a/meson.build b/meson.build
index a1476e5b32..bdd9ed2c4c 100644
--- a/meson.build
+++ b/meson.build
@@ -365,6 +365,7 @@ libgit_sources = [
'ident.c',
'json-writer.c',
'kwset.c',
+ 'last-modified.c',
'levenshtein.c',
'line-log.c',
'line-range.c',
@@ -609,6 +610,7 @@ builtin_sources = [
'builtin/index-pack.c',
'builtin/init-db.c',
'builtin/interpret-trailers.c',
+ 'builtin/last-modified.c',
'builtin/log.c',
'builtin/ls-files.c',
'builtin/ls-remote.c',
diff --git a/t/meson.build b/t/meson.build
index fcfc1c2c2b..be5a711375 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -962,6 +962,7 @@ integration_tests = [
't8012-blame-colors.sh',
't8013-blame-ignore-revs.sh',
't8014-blame-ignore-fuzzy.sh',
+ 't8020-last-modified.sh',
't9001-send-email.sh',
't9002-column.sh',
't9003-help-autocorrect.sh',
diff --git a/t/t8020-last-modified.sh b/t/t8020-last-modified.sh
new file mode 100755
index 0000000000..0c4a19c029
--- /dev/null
+++ b/t/t8020-last-modified.sh
@@ -0,0 +1,194 @@
+#!/bin/sh
+
+test_description='last-modified tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+ test_commit 1 file &&
+ mkdir a &&
+ test_commit 2 a/file &&
+ mkdir a/b &&
+ test_commit 3 a/b/file
+'
+
+test_expect_success 'cannot run last-modified on two trees' '
+ test_must_fail git last-modified HEAD HEAD~1
+'
+
+check_last_modified() {
+ local indir= &&
+ while test $# != 0
+ do
+ case "$1" in
+ -C)
+ indir="$2"
+ shift
+ ;;
+ *)
+ break
+ ;;
+ esac &&
+ shift
+ done &&
+
+ cat >expect &&
+ test_when_finished "rm -f tmp.*" &&
+ git ${indir:+-C "$indir"} last-modified "$@" >tmp.1 &&
+ git name-rev --annotate-stdin --name-only --tags \
+ <tmp.1 >tmp.2 &&
+ tr '\t' ' ' <tmp.2 >tmp.3 &&
+ sort tmp.3 >actual &&
+ test_cmp expect actual
+}
+
+test_expect_success 'last-modified non-recursive' '
+ check_last_modified <<-\EOF
+ 1 file
+ 3 a
+ EOF
+'
+
+test_expect_success 'last-modified recursive' '
+ check_last_modified -r <<-\EOF
+ 1 file
+ 2 a/file
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'last-modified subdir' '
+ check_last_modified a <<-\EOF
+ 3 a
+ EOF
+'
+
+test_expect_success 'last-modified subdir recursive' '
+ check_last_modified -r a <<-\EOF
+ 2 a/file
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'last-modified from non-HEAD commit' '
+ check_last_modified HEAD^ <<-\EOF
+ 1 file
+ 2 a
+ EOF
+'
+
+test_expect_success 'last-modified from subdir defaults to root' '
+ check_last_modified -C a <<-\EOF
+ 1 file
+ 3 a
+ EOF
+'
+
+test_expect_success 'last-modified from subdir uses relative pathspecs' '
+ check_last_modified -C a -r b <<-\EOF
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'limit last-modified traversal by count' '
+ check_last_modified -1 <<-\EOF
+ 3 a
+ ^2 file
+ EOF
+'
+
+test_expect_success 'limit last-modified traversal by commit' '
+ check_last_modified HEAD~2..HEAD <<-\EOF
+ 3 a
+ ^1 file
+ EOF
+'
+
+test_expect_success 'only last-modified files in the current tree' '
+ git rm -rf a &&
+ git commit -m "remove a" &&
+ check_last_modified <<-\EOF
+ 1 file
+ EOF
+'
+
+test_expect_success 'cross merge boundaries in blaming' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit m1 &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit m2 &&
+ git merge m1 &&
+ check_last_modified <<-\EOF
+ m1 m1.t
+ m2 m2.t
+ EOF
+'
+
+test_expect_success 'last-modified merge for resolved conflicts' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit c1 conflict &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit c2 conflict &&
+ test_must_fail git merge c1 &&
+ test_commit resolved conflict &&
+ check_last_modified conflict <<-\EOF
+ resolved conflict
+ EOF
+'
+
+
+# Consider `file` with this content through history:
+#
+# A---B---B-------B---B
+# \ /
+# C---D
+test_expect_success 'last-modified merge ignores content from branch' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit a1 file A &&
+ test_commit a2 file B &&
+ test_commit a3 file C &&
+ test_commit a4 file D &&
+ git checkout a2 &&
+ git merge --no-commit --no-ff a4 &&
+ git checkout a2 -- file &&
+ git merge --continue &&
+ check_last_modified <<-\EOF
+ a2 file
+ EOF
+'
+
+# Consider `file` with this content through history:
+#
+# A---B---B---C---D---B---B
+# \ /
+# B-------B
+test_expect_success 'last-modified merge undoes changes' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit b1 file A &&
+ test_commit b2 file B &&
+ test_commit b3 file C &&
+ test_commit b4 file D &&
+ git checkout b2 &&
+ test_commit b5 file2 2 &&
+ git checkout b4 &&
+ git merge --no-commit --no-ff b5 &&
+ git checkout b2 -- file &&
+ git merge --continue &&
+ check_last_modified <<-\EOF
+ b2 file
+ b5 file2
+ EOF
+'
+
+test_expect_success 'last-modified complains about unknown arguments' '
+ test_must_fail git last-modified --foo 2>err &&
+ grep "unknown last-modified argument: --foo" err
+'
+
+test_done
--
2.49.0
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH RFC v2 2/5] t/perf: add last-modified perf script
2025-05-23 9:33 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Toon Claes
2025-05-23 9:33 ` [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified Toon Claes
@ 2025-05-23 9:33 ` Toon Claes
2025-05-23 9:33 ` [PATCH RFC v2 3/5] last-modified: use Bloom filters when available Toon Claes
` (3 subsequent siblings)
5 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-05-23 9:33 UTC (permalink / raw)
To: git
Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes,
Ævar Arnfjörð Bjarmason, Derrick Stolee
From: Jeff King <peff@peff.net>
This just runs some simple last-modified commands. We already test
correctness in the regular suite, so this is just about finding
performance regressions from one version to another.
Signed-off-by: Toon Claes <toon@iotcl.com>
---
t/meson.build | 1 +
t/perf/p8020-last-modified.sh | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+)
diff --git a/t/meson.build b/t/meson.build
index be5a711375..4ac28c04fe 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -1155,6 +1155,7 @@ benchmarks = [
'perf/p7820-grep-engines.sh',
'perf/p7821-grep-engines-fixed.sh',
'perf/p7822-grep-perl-character.sh',
+ 'perf/p8020-last-modified.sh',
'perf/p9210-scalar.sh',
'perf/p9300-fast-import-export.sh',
]
diff --git a/t/perf/p8020-last-modified.sh b/t/perf/p8020-last-modified.sh
new file mode 100755
index 0000000000..a02ec907d4
--- /dev/null
+++ b/t/perf/p8020-last-modified.sh
@@ -0,0 +1,21 @@
+#!/bin/sh
+
+test_description='last-modified perf tests'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+test_perf 'top-level last-modified' '
+ git last-modified HEAD
+'
+
+test_perf 'top-level recursive last-modified' '
+ git last-modified -r HEAD
+'
+
+test_perf 'subdir last-modified' '
+ path=$(git ls-tree HEAD | grep ^040000 | head -n 1 | cut -f2)
+ git last-modified -r HEAD -- "$path"
+'
+
+test_done
--
2.49.0
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH RFC v2 3/5] last-modified: use Bloom filters when available
2025-05-23 9:33 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Toon Claes
2025-05-23 9:33 ` [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified Toon Claes
2025-05-23 9:33 ` [PATCH RFC v2 2/5] t/perf: add last-modified perf script Toon Claes
@ 2025-05-23 9:33 ` Toon Claes
2025-05-27 10:40 ` Patrick Steinhardt
2025-05-23 9:33 ` [PATCH RFC v2 4/5] last-modified: implement faster algorithm Toon Claes
` (2 subsequent siblings)
5 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-05-23 9:33 UTC (permalink / raw)
To: git
Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes,
Ævar Arnfjörð Bjarmason, Derrick Stolee
Our 'git last-modified' performs a revision walk, and computes a diff at
each point in the walk to figure out whether a given revision changed
any of the paths it considers interesting.
When changed-path Bloom filters are available, we can avoid computing
many such diffs. Before computing a diff, we first check if any of the
remaining paths of interest were possibly changed at a given commit by
consulting its Bloom filter. If any of them are, we are resigned to
compute the diff.
If none of those queries returned "maybe", we know that the given commit
doesn't contain any changed paths which are interesting to us. So, we
can avoid computing it in this case.
This results in a substantial performance speed-up in common cases of
'git last-modified'. In the kernel, here is the before and after (all
times computed with best-of-five):
With commit-graphs (but no Bloom filters):
real 0m5.133s
user 0m4.942s
sys 0m0.180s
...and with Bloom filters:
real 0m0.936s
user 0m0.842s
sys 0m0.092s
These times are with my development-version of Git, so it's compiled
without optimizations. Compiling instead with `-O3`, the results look
even better:
real 0m0.754s
user 0m0.661s
sys 0m0.092s
Signed-off-by: Toon Claes <toon@iotcl.com>
---
last-modified.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/last-modified.c b/last-modified.c
index 9283f8fcae..f628434929 100644
--- a/last-modified.c
+++ b/last-modified.c
@@ -7,11 +7,15 @@
#include "revision.h"
#include "repository.h"
#include "log-tree.h"
+#include "dir.h"
+#include "commit-graph.h"
+#include "bloom.h"
struct last_modified_entry {
struct hashmap_entry hashent;
struct object_id oid;
struct commit *commit;
+ struct bloom_key key;
const char path[FLEX_ARRAY];
};
@@ -28,6 +32,9 @@ static void add_from_diff(struct diff_queue_struct *q,
FLEX_ALLOC_STR(ent, path, path);
oidcpy(&ent->oid, &p->two->oid);
+ if (lm->rev.bloom_filter_settings)
+ fill_bloom_key(path, strlen(path), &ent->key,
+ lm->rev.bloom_filter_settings);
hashmap_entry_init(&ent->hashent, strhash(ent->path));
hashmap_add(&lm->paths, &ent->hashent);
}
@@ -92,12 +99,21 @@ void last_modified_init(struct last_modified *lm,
if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
die(_("unknown last-modified argument: %s"), argv[1]);
+ (void)generation_numbers_enabled(lm->rev.repo);
+ lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
+
if (add_from_revs(lm) < 0)
die(_("unable to setup last-modified"));
}
void last_modified_release(struct last_modified *lm)
{
+ struct hashmap_iter iter;
+ struct last_modified_entry *ent;
+
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
+ clear_bloom_key(&ent->key);
+ }
hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
release_revisions(&lm->rev);
}
@@ -137,6 +153,7 @@ static void mark_path(const char *path, const struct object_id *oid,
data->callback(path, data->commit, data->callback_data);
hashmap_remove(data->paths, &ent->hashent, path);
+ clear_bloom_key(&ent->key);
free(ent);
}
@@ -180,6 +197,30 @@ static void last_modified_diff(struct diff_queue_struct *q,
}
}
+static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
+{
+ struct bloom_filter *filter;
+ struct last_modified_entry *ent;
+ struct hashmap_iter iter;
+
+ if (!lm->rev.bloom_filter_settings)
+ return 1;
+
+ if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY)
+ return 1;
+
+ filter = get_bloom_filter(lm->rev.repo, origin);
+ if (!filter)
+ return 1;
+
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
+ if (bloom_filter_contains(filter, &ent->key,
+ lm->rev.bloom_filter_settings))
+ return 1;
+ }
+ return 0;
+}
+
int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
{
struct last_modified_callback_data data;
@@ -199,6 +240,9 @@ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
if (!data.commit)
break;
+ if (!maybe_changed_path(lm, data.commit))
+ continue;
+
if (data.commit->object.flags & BOUNDARY) {
diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
&data.commit->object.oid,
--
2.49.0
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH RFC v2 4/5] last-modified: implement faster algorithm
2025-05-23 9:33 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Toon Claes
` (2 preceding siblings ...)
2025-05-23 9:33 ` [PATCH RFC v2 3/5] last-modified: use Bloom filters when available Toon Claes
@ 2025-05-23 9:33 ` Toon Claes
2025-05-27 10:39 ` Patrick Steinhardt
2025-05-23 9:33 ` [PATCH RFC v2 5/5] last-modified: initialize revision machinery without walk Toon Claes
2025-07-01 20:35 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Kristoffer Haugsbakk
5 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-05-23 9:33 UTC (permalink / raw)
To: git
Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes,
Ævar Arnfjörð Bjarmason, Derrick Stolee
The current implementation of 'git last-modified' works by doing a
revision walk, and inspecting the diff at each level of that walk to
annotate the to-be-found entries to a path. In other words, if the diff
at some level touches a path which has not yet been associated with a
commit, then that commit becomes associated with the path.
While a perfectly reasonable implementation, it can perform poorly in
either one of two scenarios:
1. There are many entries of interest, in which case there is simply
more work to do.
2. Or, there are (even a few) entries which have not been updated in a
long time, and so we must walk through a lot of history in order to
find a commit that touches that path.
This patch rewrites the last-modified implementation that addresses (2).
The idea behind the algorithm is to propagate a set of 'active' paths (a
path is 'active' if it does not yet belong to a commit) up to parents
and do a truncated revision walk.
The walk is truncated because it does not produce a revision for every
change in the original pathspec, but rather only for active paths.
More specifically, consider a priority queue of commits sorted by
generation number. First, enqueue the set of boundary commits with all
paths in the original spec marked as interesting.
Then, while the queue is not empty, do the following:
1. Pop an element, say, 'c', off of the queue, making sure that 'c'
isn't reachable by anything in the '--not' set.
2. For each parent 'p' (with index 'parent_i') of 'c', do the
following:
a. Compute the diff between 'c' and 'p'.
b. Pass any active paths that are TREESAME from 'c' to 'p'.
c. If 'p' has any active paths, push it onto the queue.
3. Associate any remaining paths with 'c', and mark them as inactive.
This ends up being equivalent to doing something like 'git log -1 --
$path' for each path simultaneously. But, it allows us to go much faster
than the original implementation by limiting the number of diffs we
compute, since we can avoid parts of history that would have been
considered by the revision walk in the original implementation, but are
known to be uninteresting to us because we have already marked all paths
in that area to be inactive.
One other trick we can do on top is to avoid computing many first-parent
diffs when all paths active in 'c' are DEFINITELY_NOT in c's Bloom
filter. Since the commit-graph only stores first-parent diffs in the
Bloom filters, we can only apply this trick to first-parent diffs.
Now, some performance numbers. On github/git, our numbers look like the
following (all wall-clock times best-of-five, and with '--max-depth=0'
on the root):
github ttaylorr/blame-tree-fast
with filters: 0.754s 0.271s (2.78x faster, 6.18x overall)
without filters: 1.676s 1.056s (1.58x faster)
and on torvalds/linux:
github ttaylorr/blame-tree-fast
with filters: 0.608 0.062 (9.81x faster, ~52x overall)
without filters: 3.251 0.676 (4.81x faster)
In short, the existing implementation is comparably fast *with* filters
as the new implementation is *without* filters. So, most repositories
should get a dramatic speed-up by just deploying this (even without
computing Bloom filters), and all repositories should get faster still
when computing Bloom filters.
Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
last-modified.c | 270 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
last-modified.h | 3 +
2 files changed, 256 insertions(+), 17 deletions(-)
diff --git a/last-modified.c b/last-modified.c
index f628434929..0a0818cdf1 100644
--- a/last-modified.c
+++ b/last-modified.c
@@ -3,18 +3,20 @@
#include "commit.h"
#include "diffcore.h"
#include "diff.h"
-#include "object.h"
#include "revision.h"
#include "repository.h"
#include "log-tree.h"
#include "dir.h"
#include "commit-graph.h"
#include "bloom.h"
+#include "prio-queue.h"
+#include "commit-slab.h"
struct last_modified_entry {
struct hashmap_entry hashent;
struct object_id oid;
struct commit *commit;
+ int diff_idx;
struct bloom_key key;
const char path[FLEX_ARRAY];
};
@@ -86,6 +88,9 @@ void last_modified_init(struct last_modified *lm,
const char *prefix,
int argc, const char **argv)
{
+ struct hashmap_iter iter;
+ struct last_modified_entry *ent;
+
memset(lm, 0, sizeof(*lm));
hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
@@ -104,6 +109,13 @@ void last_modified_init(struct last_modified *lm,
if (add_from_revs(lm) < 0)
die(_("unable to setup last-modified"));
+
+ lm->all_paths = xcalloc(hashmap_get_size(&lm->paths), sizeof(const char *));
+ lm->all_paths_nr = 0;
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
+ ent->diff_idx = lm->all_paths_nr++;
+ lm->all_paths[ent->diff_idx] = ent->path;
+ }
}
void last_modified_release(struct last_modified *lm)
@@ -116,6 +128,20 @@ void last_modified_release(struct last_modified *lm)
}
hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
release_revisions(&lm->rev);
+ free(lm->all_paths);
+}
+
+struct commit_active_paths {
+ char *active;
+ int nr;
+};
+
+define_commit_slab(active_paths, struct commit_active_paths);
+static struct active_paths active_paths;
+
+static void free_one_active_path(struct commit_active_paths *active)
+{
+ free(active->active);
}
struct last_modified_callback_data {
@@ -130,6 +156,7 @@ static void mark_path(const char *path, const struct object_id *oid,
struct last_modified_callback_data *data)
{
struct last_modified_entry *ent;
+ struct commit_active_paths *active;
/* Is it even a path that we are interested in? */
ent = hashmap_get_entry_from_hash(data->paths, strhash(path), path,
@@ -141,11 +168,17 @@ static void mark_path(const char *path, const struct object_id *oid,
if (ent->commit)
return;
+ /* Are we inactive on the current commit? */
+ active = active_paths_at(&active_paths, data->commit);
+ if (active && active->active &&
+ !active->active[ent->diff_idx])
+ return;
+
/*
* Is it arriving at a version of interest, or is it from a side branch
* which did not contribute to the final state?
*/
- if (!oideq(oid, &ent->oid))
+ if (oid && !oideq(oid, &ent->oid))
return;
ent->commit = data->commit;
@@ -197,7 +230,32 @@ static void last_modified_diff(struct diff_queue_struct *q,
}
}
-static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
+static char *scratch;
+
+static void pass_to_parent(struct commit_active_paths *c,
+ struct commit_active_paths *p,
+ int i)
+{
+ c->active[i] = 0;
+ c->nr--;
+ p->active[i] = 1;
+ p->nr++;
+}
+
+#define PARENT1 (1u<<16) /* used instead of SEEN */
+#define PARENT2 (1u<<17) /* used instead of BOTTOM, BOUNDARY */
+
+static int diff2idx(struct last_modified *lm, char *path)
+{
+ struct last_modified_entry *ent;
+ ent = hashmap_get_entry_from_hash(&lm->paths, strhash(path), path,
+ struct last_modified_entry, hashent);
+ return ent ? ent->diff_idx : -1;
+}
+
+static int maybe_changed_path(struct last_modified *lm,
+ struct commit *origin,
+ struct commit_active_paths *active)
{
struct bloom_filter *filter;
struct last_modified_entry *ent;
@@ -214,6 +272,8 @@ static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
return 1;
hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
+ if (active && !active->active[ent->diff_idx])
+ continue;
if (bloom_filter_contains(filter, &ent->key,
lm->rev.bloom_filter_settings))
return 1;
@@ -221,8 +281,88 @@ static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
return 0;
}
+static int process_parent(struct last_modified *lm, struct prio_queue *queue,
+ struct commit *c,
+ struct commit_active_paths *active_c,
+ struct commit *parent, int parent_i)
+{
+ int i, ret = 0; // TODO type & for loop var
+ struct commit_active_paths *active_p;
+
+ repo_parse_commit(lm->rev.repo, parent);
+
+ active_p = active_paths_at(&active_paths, parent);
+ if (!active_p->active) {
+ active_p->active = xcalloc(sizeof(char), lm->all_paths_nr);
+ active_p->nr = 0;
+ }
+
+ /*
+ * Before calling 'diff_tree_oid()' on our first parent, see if Bloom
+ * filters will tell us the diff is conclusively uninteresting.
+ */
+ if (parent_i || maybe_changed_path(lm, c, active_c)) {
+ diff_tree_oid(&parent->object.oid,
+ &c->object.oid, "", &lm->rev.diffopt);
+ diffcore_std(&lm->rev.diffopt);
+ }
+
+ if (!diff_queued_diff.nr) {
+ /*
+ * No diff entries means we are TREESAME on the base path, and
+ * so all active paths get passed onto this parent.
+ */
+ for (i = 0; i < lm->all_paths_nr; i++) {
+ if (active_c->active[i])
+ pass_to_parent(active_c, active_p, i);
+ }
+
+ if (!(parent->object.flags & PARENT1)) {
+ parent->object.flags |= PARENT1;
+ prio_queue_put(queue, parent);
+ }
+ ret = 1;
+ goto cleanup;
+ }
+
+ /*
+ * Otherwise, test each path for TREESAME-ness against the parent, and
+ * pass those along.
+ *
+ * First, set each position in 'scratch' to be zero for TREESAME paths,
+ * and one otherwise. Then, pass active and TREESAME paths to the
+ * parent.
+ */
+ for (i = 0; i < diff_queued_diff.nr; i++) {
+ struct diff_filepair *fp = diff_queued_diff.queue[i];
+ int k = diff2idx(lm, fp->two->path);
+ if (0 <= k && active_c->active[k])
+ scratch[k] = 1;
+ diff_free_filepair(fp);
+ }
+ diff_queued_diff.nr = 0;
+ for (i = 0; i < lm->all_paths_nr; i++) {
+ if (active_c->active[i] && !scratch[i])
+ pass_to_parent(active_c, active_p, i);
+ }
+
+ if (active_p->nr && !(parent->object.flags & PARENT1)) {
+ parent->object.flags |= PARENT1;
+ prio_queue_put(queue, parent);
+ }
+
+cleanup:
+ diff_queue_clear(&diff_queued_diff);
+ memset(scratch, 0, lm->all_paths_nr);
+
+ return ret;
+}
+
int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
{
+ int max_count, queue_popped = 0;
+ struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
+ struct prio_queue not_queue = { compare_commits_by_gen_then_commit_date };
struct last_modified_callback_data data;
data.paths = &lm->paths;
@@ -233,25 +373,121 @@ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
lm->rev.diffopt.format_callback = last_modified_diff;
lm->rev.diffopt.format_callback_data = &data;
- prepare_revision_walk(&lm->rev);
+ max_count = lm->rev.max_count;
- while (hashmap_get_size(&lm->paths)) {
- data.commit = get_revision(&lm->rev);
- if (!data.commit)
- break;
+ init_active_paths(&active_paths);
+ scratch = xcalloc(lm->all_paths_nr, sizeof(char));
- if (!maybe_changed_path(lm, data.commit))
- continue;
+ /*
+ * lm->rev.pending holds the set of boundary commits for our walk.
+ *
+ * Loop through each such commit, and place it in the appropriate queue.
+ */
+ for (size_t i = 0; i < lm->rev.pending.nr; i++) {
+ struct commit *c = lookup_commit(lm->rev.repo,
+ &lm->rev.pending.objects[i].item->oid);
+ repo_parse_commit(lm->rev.repo, c);
- if (data.commit->object.flags & BOUNDARY) {
- diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
- &data.commit->object.oid,
- "", &lm->rev.diffopt);
- diff_flush(&lm->rev.diffopt);
- } else {
- log_tree_commit(&lm->rev, data.commit);
+ if (c->object.flags & BOTTOM) {
+ prio_queue_put(¬_queue, c);
+ c->object.flags |= PARENT2;
+ } else if (!(c->object.flags & PARENT1)) {
+ /*
+ * If the commit is a starting point (and hasn't been
+ * seen yet), then initialize the set of interesting
+ * paths, too.
+ */
+ struct commit_active_paths *active;
+
+ prio_queue_put(&queue, c);
+ c->object.flags |= PARENT1;
+
+ active = active_paths_at(&active_paths, c);
+ active->active = xcalloc(sizeof(char), lm->all_paths_nr);
+ memset(active->active, 1, lm->all_paths_nr);
+ active->nr = lm->all_paths_nr;
}
}
+ /*
+ * Now that we have processed the pending commits, allow the revision
+ * machinery to flush them by calling prepare_revision_walk().
+ */
+ prepare_revision_walk(&lm->rev);
+
+ while (queue.nr) {
+ int parent_i;
+ struct commit_list *p;
+ struct commit *c = prio_queue_get(&queue);
+ struct commit_active_paths *active_c = active_paths_at(&active_paths, c);
+
+ if ((0 <= max_count && max_count < ++queue_popped) ||
+ (c->object.flags & PARENT2)) {
+ /*
+ * Either a boundary commit, or we have already seen too
+ * many others. Either way, stop here.
+ */
+ c->object.flags |= PARENT2 | BOUNDARY;
+ data.commit = c;
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &c->object.oid,
+ "", &lm->rev.diffopt);
+ diff_flush(&lm->rev.diffopt);
+ goto cleanup;
+ }
+
+ /*
+ * Otherwise, keep going, but make sure that 'c' isn't reachable
+ * from anything in the '--not' queue.
+ */
+ repo_parse_commit(lm->rev.repo, c);
+
+ while (not_queue.nr) {
+ struct commit_list *np;
+ struct commit *n = prio_queue_get(¬_queue);
+
+ repo_parse_commit(lm->rev.repo, n);
+
+ for (np = n->parents; np; np = np->next) {
+ if (!(np->item->object.flags & PARENT2)) {
+ prio_queue_put(¬_queue, np->item);
+ np->item->object.flags |= PARENT2;
+ }
+ }
+
+ if (commit_graph_generation(n) < commit_graph_generation(c))
+ break;
+ }
+
+ /*
+ * Look at each remaining interesting path, and pass it onto
+ * parents in order if TREESAME.
+ */
+ for (p = c->parents, parent_i = 0; p; p = p->next, parent_i++) {
+ if (process_parent(lm, &queue,
+ c, active_c,
+ p->item, parent_i) > 0 )
+ break;
+ }
+
+ if (active_c->nr) {
+ /* Any paths that remain active were changed by 'c'. */
+ data.commit = c;
+ for (int i = 0; i < lm->all_paths_nr; i++) {
+ if (active_c->active[i])
+ mark_path(lm->all_paths[i], NULL, &data);
+ }
+ }
+
+cleanup:
+ FREE_AND_NULL(active_c->active);
+ active_c->nr = 0;
+ }
+
+ clear_prio_queue(¬_queue);
+ clear_prio_queue(&queue);
+ deep_clear_active_paths(&active_paths, free_one_active_path);
+ free(scratch);
+
return 0;
}
diff --git a/last-modified.h b/last-modified.h
index 42a819d979..8b2d896aaa 100644
--- a/last-modified.h
+++ b/last-modified.h
@@ -8,6 +8,9 @@
struct last_modified {
struct hashmap paths;
struct rev_info rev;
+
+ const char **all_paths;
+ int all_paths_nr;
};
void last_modified_init(struct last_modified *lm,
--
2.49.0
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH RFC v2 5/5] last-modified: initialize revision machinery without walk
2025-05-23 9:33 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Toon Claes
` (3 preceding siblings ...)
2025-05-23 9:33 ` [PATCH RFC v2 4/5] last-modified: implement faster algorithm Toon Claes
@ 2025-05-23 9:33 ` Toon Claes
2025-05-27 10:39 ` Patrick Steinhardt
2025-07-01 20:35 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Kristoffer Haugsbakk
5 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-05-23 9:33 UTC (permalink / raw)
To: git
Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes,
Ævar Arnfjörð Bjarmason, Derrick Stolee
In a previous commit we inserted a call to 'prepare_revision_walk()'
before we started our traversal. This was done when we leveraged the
revision machinery more (at the time, we were leaning on
'log_tree_commit()' which only worked after calling
'prepare_revision_walk()').
But, we have since dropped 'log_tree_commit()', so we don't need most of
the initialization work of 'prepare_revision_walk()'. Now we ask it to
do very little work during initialization by setting the '->no_walk'
flag to '1', which leaves its internal state alone enough that we can
still function as normal.
Unfortunately, this means that we now no longer complain about
non-commit inputs, since the revision machinery check this for us (it
just silently ignores them).
Based-on-patch-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
last-modified.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)
diff --git a/last-modified.c b/last-modified.c
index 0a0818cdf1..b1458db0bc 100644
--- a/last-modified.c
+++ b/last-modified.c
@@ -271,6 +271,13 @@ static int maybe_changed_path(struct last_modified *lm,
if (!filter)
return 1;
+ for (int i = 0; i < lm->rev.bloom_keys_nr; i++) {
+ if (!(bloom_filter_contains(filter,
+ &lm->rev.bloom_keys[i],
+ lm->rev.bloom_filter_settings)))
+ return 0;
+ }
+
hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
if (active && !active->active[ent->diff_idx])
continue;
@@ -364,6 +371,7 @@ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct prio_queue not_queue = { compare_commits_by_gen_then_commit_date };
struct last_modified_callback_data data;
+ struct commit_list *list;
data.paths = &lm->paths;
data.callback = cb;
@@ -372,6 +380,9 @@ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
lm->rev.diffopt.format_callback = last_modified_diff;
lm->rev.diffopt.format_callback_data = &data;
+ lm->rev.no_walk = 1;
+
+ prepare_revision_walk(&lm->rev);
max_count = lm->rev.max_count;
@@ -379,14 +390,12 @@ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
scratch = xcalloc(lm->all_paths_nr, sizeof(char));
/*
- * lm->rev.pending holds the set of boundary commits for our walk.
+ * lm->rev.commits holds the set of boundary commits for our walk.
*
* Loop through each such commit, and place it in the appropriate queue.
*/
- for (size_t i = 0; i < lm->rev.pending.nr; i++) {
- struct commit *c = lookup_commit(lm->rev.repo,
- &lm->rev.pending.objects[i].item->oid);
- repo_parse_commit(lm->rev.repo, c);
+ for (list = lm->rev.commits; list; list = list->next) {
+ struct commit *c = list->item;
if (c->object.flags & BOTTOM) {
prio_queue_put(¬_queue, c);
@@ -409,12 +418,6 @@ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
}
}
- /*
- * Now that we have processed the pending commits, allow the revision
- * machinery to flush them by calling prepare_revision_walk().
- */
- prepare_revision_walk(&lm->rev);
-
while (queue.nr) {
int parent_i;
struct commit_list *p;
--
2.49.0
^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified
2025-05-23 9:33 ` [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified Toon Claes
@ 2025-05-25 20:07 ` Justin Tobler
2025-06-05 8:32 ` Toon Claes
2025-05-27 10:39 ` Patrick Steinhardt
1 sibling, 1 reply; 135+ messages in thread
From: Justin Tobler @ 2025-05-25 20:07 UTC (permalink / raw)
To: Toon Claes
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On 25/05/23 11:33AM, Toon Claes wrote:
> Similar to git-blame(1), introduce a new subcommand
> git-last-modified(1). This command shows the most recent modification to
> paths in a tree. It does so by expanding the tree at a given commit,
> taking note of the current state of each path, and then walking
> backwards through history looking for commits where each path changed
> into its final commit ID.
Just a thought, but it might be nice to include in a commit message why
this operation is useful.
> Based-on-patch-by: Jeff King <peff@peff.net>
> Improved-by: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
> Signed-off-by: Toon Claes <toon@iotcl.com>
> ---
> .gitignore | 1 +
> Documentation/git-last-modified.adoc | 49 ++++++++
> Documentation/meson.build | 1 +
> Makefile | 2 +
> builtin.h | 1 +
> builtin/last-modified.c | 43 +++++++
> command-list.txt | 1 +
> git.c | 1 +
> last-modified.c | 213 +++++++++++++++++++++++++++++++++++
> last-modified.h | 27 +++++
> meson.build | 2 +
> t/meson.build | 1 +
> t/t8020-last-modified.sh | 194 +++++++++++++++++++++++++++++++
> 13 files changed, 536 insertions(+)
>
> diff --git a/.gitignore b/.gitignore
> index 04c444404e..a36ee94443 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -87,6 +87,7 @@
> /git-init-db
> /git-interpret-trailers
> /git-instaweb
> +/git-last-modified
> /git-log
> /git-ls-files
> /git-ls-remote
> diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
> new file mode 100644
> index 0000000000..1af38f402e
> --- /dev/null
> +++ b/Documentation/git-last-modified.adoc
> @@ -0,0 +1,49 @@
> +git-last-modified(1)
> +====================
> +
> +NAME
> +----
> +git-last-modified - EXPERIMENTAL: Show when files were last modified
> +
> +
> +SYNOPSIS
> +--------
> +[synopsis]
> +git last-modified [-r] [<revision-range>] [[--] <path>...]
> +
> +DESCRIPTION
> +-----------
> +
> +Shows which commit last modified each of the relevant files and subdirectories.
> +
> +THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
> +
> +OPTIONS
> +-------
> +
> +-r::
> + Recurse into subtrees.
> +
> +-t::
> + Show tree entry itself as well as subtrees. Implies `-r`.
This left me wondering about the default behavior regarding displaying
trees when neither `-t` and `-r` are specified. If we omit showing when
a tree was last mostified?
> +
> +<revision-range>::
> + Only traverse commits in the specified revision range. When no
> + `<revision-range>` is specified, it defaults to `HEAD` (i.e. the whole
> + history leading to the current commit). For a complete list of ways to
> + spell `<revision-range>`, see the 'Specifying Ranges' section of
> + linkgit:gitrevisions[7].
> +
> +[--] <path>...::
> + For each _<path>_ given, the commit which last modified it is returned.
> + Without an optional path parameter, all files and subdirectories
> + of the current working directory are included in the
are include in the? I assume you meant to say the search/operation.
> +
> +SEE ALSO
> +--------
> +linkgit:git-blame[1],
> +linkgit:git-log[1].
> +
> +GIT
> +---
> +Part of the linkgit:git[1] suite
> diff --git a/Documentation/meson.build b/Documentation/meson.build
> index 1433acfd31..fa93cec5c3 100644
> --- a/Documentation/meson.build
> +++ b/Documentation/meson.build
> @@ -74,6 +74,7 @@ manpages = {
> 'git-init.adoc' : 1,
> 'git-instaweb.adoc' : 1,
> 'git-interpret-trailers.adoc' : 1,
> + 'git-last-modified.adoc' : 1,
> 'git-log.adoc' : 1,
> 'git-ls-files.adoc' : 1,
> 'git-ls-remote.adoc' : 1,
> diff --git a/Makefile b/Makefile
> index ecd590a643..40bc24c704 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -1051,6 +1051,7 @@ LIB_OBJS += hook.o
> LIB_OBJS += ident.o
> LIB_OBJS += json-writer.o
> LIB_OBJS += kwset.o
> +LIB_OBJS += last-modified.o
> LIB_OBJS += levenshtein.o
> LIB_OBJS += line-log.o
> LIB_OBJS += line-range.o
> @@ -1266,6 +1267,7 @@ BUILTIN_OBJS += builtin/hook.o
> BUILTIN_OBJS += builtin/index-pack.o
> BUILTIN_OBJS += builtin/init-db.o
> BUILTIN_OBJS += builtin/interpret-trailers.o
> +BUILTIN_OBJS += builtin/last-modified.o
> BUILTIN_OBJS += builtin/log.o
> BUILTIN_OBJS += builtin/ls-files.o
> BUILTIN_OBJS += builtin/ls-remote.o
> diff --git a/builtin.h b/builtin.h
> index bff13e3069..6ed6759ec4 100644
> --- a/builtin.h
> +++ b/builtin.h
> @@ -176,6 +176,7 @@ int cmd_hook(int argc, const char **argv, const char *prefix, struct repository
> int cmd_index_pack(int argc, const char **argv, const char *prefix, struct repository *repo);
> int cmd_init_db(int argc, const char **argv, const char *prefix, struct repository *repo);
> int cmd_interpret_trailers(int argc, const char **argv, const char *prefix, struct repository *repo);
> +int cmd_last_modified(int argc, const char **argv, const char *prefix, struct repository *repo);
> int cmd_log_reflog(int argc, const char **argv, const char *prefix, struct repository *repo);
> int cmd_log(int argc, const char **argv, const char *prefix, struct repository *repo);
> int cmd_ls_files(int argc, const char **argv, const char *prefix, struct repository *repo);
> diff --git a/builtin/last-modified.c b/builtin/last-modified.c
> new file mode 100644
> index 0000000000..0d4733f666
> --- /dev/null
> +++ b/builtin/last-modified.c
> @@ -0,0 +1,43 @@
> +#include "git-compat-util.h"
> +#include "last-modified.h"
> +#include "hex.h"
> +#include "quote.h"
> +#include "config.h"
> +#include "object-name.h"
> +#include "parse-options.h"
> +#include "builtin.h"
For builtins, "builtin.h" should be included at the top and
"git-compat-util.h" should be omitted.
> +
> +static void show_entry(const char *path, const struct commit *commit, void *d)
> +{
> + struct last_modified *lm = d;
> +
> + if (commit->object.flags & BOUNDARY)
> + putchar('^');
> + printf("%s\t", oid_to_hex(&commit->object.oid));
> +
> + if (lm->rev.diffopt.line_termination)
> + write_name_quoted(path, stdout, '\n');
> + else
> + printf("%s%c", path, '\0');
> +
> + fflush(stdout);
> +}
> +
> +int cmd_last_modified(int argc,
> + const char **argv,
> + const char *prefix,
> + struct repository *repo)
> +{
> + int ret = 0;
> + struct last_modified lm;
> +
> + repo_config(repo, git_default_config, NULL);
> +
> + last_modified_init(&lm, repo, prefix, argc, argv);
> + if (last_modified_run(&lm, show_entry, &lm) < 0)
> + die(_("error running last-modified traversal"));
> +
> + last_modified_release(&lm);
> +
> + return ret;
> +}
> diff --git a/command-list.txt b/command-list.txt
> index b7ade3ab9f..b715777b24 100644
> --- a/command-list.txt
> +++ b/command-list.txt
> @@ -124,6 +124,7 @@ git-index-pack plumbingmanipulators
> git-init mainporcelain init
> git-instaweb ancillaryinterrogators complete
> git-interpret-trailers purehelpers
> +git-last-modified plumbinginterrogators
> git-log mainporcelain info
> git-ls-files plumbinginterrogators
> git-ls-remote plumbinginterrogators
> diff --git a/git.c b/git.c
> index 77c4359522..65afc0d0e7 100644
> --- a/git.c
> +++ b/git.c
> @@ -565,6 +565,7 @@ static struct cmd_struct commands[] = {
> { "init", cmd_init_db },
> { "init-db", cmd_init_db },
> { "interpret-trailers", cmd_interpret_trailers, RUN_SETUP_GENTLY },
> + { "last-modified", cmd_last_modified, RUN_SETUP },
> { "log", cmd_log, RUN_SETUP },
> { "ls-files", cmd_ls_files, RUN_SETUP },
> { "ls-remote", cmd_ls_remote, RUN_SETUP_GENTLY },
> diff --git a/last-modified.c b/last-modified.c
> new file mode 100644
> index 0000000000..9283f8fcae
> --- /dev/null
> +++ b/last-modified.c
> @@ -0,0 +1,213 @@
> +#include "git-compat-util.h"
> +#include "last-modified.h"
> +#include "commit.h"
> +#include "diffcore.h"
> +#include "diff.h"
> +#include "object.h"
> +#include "revision.h"
> +#include "repository.h"
> +#include "log-tree.h"
> +
> +struct last_modified_entry {
> + struct hashmap_entry hashent;
> + struct object_id oid;
> + struct commit *commit;
> + const char path[FLEX_ARRAY];
> +};
> +
> +static void add_from_diff(struct diff_queue_struct *q,
> + struct diff_options *opt UNUSED,
> + void *data)
> +{
> + struct last_modified *lm = data;
> +
> + for (int i = 0; i < q->nr; i++) {
> + struct diff_filepair *p = q->queue[i];
> + struct last_modified_entry *ent;
> + const char *path = p->two->path;
> +
> + FLEX_ALLOC_STR(ent, path, path);
> + oidcpy(&ent->oid, &p->two->oid);
> + hashmap_entry_init(&ent->hashent, strhash(ent->path));
> + hashmap_add(&lm->paths, &ent->hashent);
> + }
> +}
> +
> +static int add_from_revs(struct last_modified *lm)
> +{
> + size_t count = 0;
> + struct diff_options diffopt;
> +
> + memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
> + copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
> + diffopt.output_format = DIFF_FORMAT_CALLBACK;
> + diffopt.format_callback = add_from_diff;
> + diffopt.format_callback_data = lm;
> +
> + for (size_t i = 0; i < lm->rev.pending.nr; i++) {
> + struct object_array_entry *obj = lm->rev.pending.objects + i;
> +
> + if (obj->item->flags & UNINTERESTING)
> + continue;
> +
> + if (count++)
> + return error(_("can only get last-modified one tree at a time"));
> +
> + diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
> + &obj->item->oid, "", &diffopt);
> + diff_flush(&diffopt);
> + }
> + clear_pathspec(&diffopt.pathspec);
> +
> + return 0;
> +}
> +
> +static int last_modified_entry_hashcmp(const void *unused UNUSED,
> + const struct hashmap_entry *hent1,
> + const struct hashmap_entry *hent2,
> + const void *path)
> +{
> + const struct last_modified_entry *ent1 =
> + container_of(hent1, const struct last_modified_entry, hashent);
> + const struct last_modified_entry *ent2 =
> + container_of(hent2, const struct last_modified_entry, hashent);
> + return strcmp(ent1->path, path ? path : ent2->path);
> +}
> +
> +void last_modified_init(struct last_modified *lm,
> + struct repository *r,
> + const char *prefix,
> + int argc, const char **argv)
> +{
> + memset(lm, 0, sizeof(*lm));
> + hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
> +
> + repo_init_revisions(r, &lm->rev, prefix);
> + lm->rev.def = "HEAD";
> + lm->rev.combine_merges = 1;
> + lm->rev.show_root_diff = 1;
> + lm->rev.boundary = 1;
> + lm->rev.no_commit_id = 1;
> + lm->rev.diff = 1;
> + if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
> + die(_("unknown last-modified argument: %s"), argv[1]);
> +
> + if (add_from_revs(lm) < 0)
> + die(_("unable to setup last-modified"));
> +}
> +
> +void last_modified_release(struct last_modified *lm)
> +{
> + hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
> + release_revisions(&lm->rev);
> +}
> +
> +struct last_modified_callback_data {
> + struct commit *commit;
> + struct hashmap *paths;
> +
> + last_modified_callback callback;
> + void *callback_data;
> +};
> +
> +static void mark_path(const char *path, const struct object_id *oid,
> + struct last_modified_callback_data *data)
> +{
> + struct last_modified_entry *ent;
> +
> + /* Is it even a path that we are interested in? */
> + ent = hashmap_get_entry_from_hash(data->paths, strhash(path), path,
> + struct last_modified_entry, hashent);
> + if (!ent)
> + return;
> +
> + /* Have we already found a commit? */
> + if (ent->commit)
> + return;
> +
> + /*
> + * Is it arriving at a version of interest, or is it from a side branch
> + * which did not contribute to the final state?
> + */
> + if (!oideq(oid, &ent->oid))
> + return;
> +
> + ent->commit = data->commit;
> + if (data->callback)
> + data->callback(path, data->commit, data->callback_data);
> +
> + hashmap_remove(data->paths, &ent->hashent, path);
> + free(ent);
> +}
> +
> +static void last_modified_diff(struct diff_queue_struct *q,
> + struct diff_options *opt UNUSED, void *cbdata)
> +{
> + struct last_modified_callback_data *data = cbdata;
> +
> + for (int i = 0; i < q->nr; i++) {
> + struct diff_filepair *p = q->queue[i];
> + switch (p->status) {
> + case DIFF_STATUS_DELETED:
> + /*
> + * There's no point in feeding a deletion, as it could
> + * not have resulted in our current state, which
> + * actually has the file.
> + */
> + break;
> +
> + default:
> + /*
> + * Otherwise, we care only that we somehow arrived at
> + * a final path/sha1 state. Note that this covers some
> + * potentially controversial areas, including:
> + *
> + * 1. A rename or copy will be found, as it is the
> + * first time the content has arrived at the given
> + * path.
> + *
> + * 2. Even a non-content modification like a mode or
> + * type change will trigger it.
> + *
> + * We take the inclusive approach for now, and find
> + * anything which impacts the path. Options to tweak
> + * the behavior (e.g., to "--follow" the content across
> + * renames) can come later.
> + */
> + mark_path(p->two->path, &p->two->oid, data);
> + break;
> + }
> + }
> +}
> +
> +int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
> +{
> + struct last_modified_callback_data data;
> +
> + data.paths = &lm->paths;
> + data.callback = cb;
> + data.callback_data = cbdata;
> +
> + lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
> + lm->rev.diffopt.format_callback = last_modified_diff;
> + lm->rev.diffopt.format_callback_data = &data;
> +
> + prepare_revision_walk(&lm->rev);
> +
> + while (hashmap_get_size(&lm->paths)) {
> + data.commit = get_revision(&lm->rev);
> + if (!data.commit)
> + break;
> +
> + if (data.commit->object.flags & BOUNDARY) {
> + diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
> + &data.commit->object.oid,
> + "", &lm->rev.diffopt);
> + diff_flush(&lm->rev.diffopt);
> + } else {
> + log_tree_commit(&lm->rev, data.commit);
> + }
> + }
> +
> + return 0;
> +}
> diff --git a/last-modified.h b/last-modified.h
> new file mode 100644
> index 0000000000..42a819d979
> --- /dev/null
> +++ b/last-modified.h
Any reason this code doesn't just live with the builtin? Is there intent
for it to be used elsewhere?
> @@ -0,0 +1,27 @@
> +#ifndef LAST_MODIFIED_H
> +#define LAST_MODIFIED_H
> +
> +#include "commit.h"
> +#include "revision.h"
> +#include "hashmap.h"
> +
> +struct last_modified {
> + struct hashmap paths;
> + struct rev_info rev;
> +};
It might be nice to leave some comments to document the types and
functions here.
> +
> +void last_modified_init(struct last_modified *lm,
> + struct repository *r,
> + const char *prefix,
> + int argc, const char **argv);
Being that `last_modified_init()` handles argument parsing for the
builtin, I somewhat question that value of having it outside the
builtin.
> +
> +void last_modified_release(struct last_modified *);
> +
> +typedef void (*last_modified_callback)(const char *path,
> + const struct commit *commit,
> + void *data);
> +int last_modified_run(struct last_modified *lm,
> + last_modified_callback cb,
> + void *cbdata);
> +
> +#endif /* LAST_MODIFIED_H */
> diff --git a/meson.build b/meson.build
> index a1476e5b32..bdd9ed2c4c 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -365,6 +365,7 @@ libgit_sources = [
> 'ident.c',
> 'json-writer.c',
> 'kwset.c',
> + 'last-modified.c',
> 'levenshtein.c',
> 'line-log.c',
> 'line-range.c',
> @@ -609,6 +610,7 @@ builtin_sources = [
> 'builtin/index-pack.c',
> 'builtin/init-db.c',
> 'builtin/interpret-trailers.c',
> + 'builtin/last-modified.c',
> 'builtin/log.c',
> 'builtin/ls-files.c',
> 'builtin/ls-remote.c',
> diff --git a/t/meson.build b/t/meson.build
> index fcfc1c2c2b..be5a711375 100644
> --- a/t/meson.build
> +++ b/t/meson.build
> @@ -962,6 +962,7 @@ integration_tests = [
> 't8012-blame-colors.sh',
> 't8013-blame-ignore-revs.sh',
> 't8014-blame-ignore-fuzzy.sh',
> + 't8020-last-modified.sh',
> 't9001-send-email.sh',
> 't9002-column.sh',
> 't9003-help-autocorrect.sh',
> diff --git a/t/t8020-last-modified.sh b/t/t8020-last-modified.sh
> new file mode 100755
> index 0000000000..0c4a19c029
> --- /dev/null
> +++ b/t/t8020-last-modified.sh
> @@ -0,0 +1,194 @@
> +#!/bin/sh
> +
> +test_description='last-modified tests'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> + test_commit 1 file &&
> + mkdir a &&
> + test_commit 2 a/file &&
> + mkdir a/b &&
> + test_commit 3 a/b/file
> +'
> +
> +test_expect_success 'cannot run last-modified on two trees' '
> + test_must_fail git last-modified HEAD HEAD~1
> +'
> +
> +check_last_modified() {
> + local indir= &&
> + while test $# != 0
> + do
> + case "$1" in
> + -C)
> + indir="$2"
> + shift
> + ;;
> + *)
> + break
> + ;;
> + esac &&
> + shift
> + done &&
> +
> + cat >expect &&
> + test_when_finished "rm -f tmp.*" &&
> + git ${indir:+-C "$indir"} last-modified "$@" >tmp.1 &&
> + git name-rev --annotate-stdin --name-only --tags \
> + <tmp.1 >tmp.2 &&
> + tr '\t' ' ' <tmp.2 >tmp.3 &&
> + sort tmp.3 >actual &&
> + test_cmp expect actual
> +}
> +
> +test_expect_success 'last-modified non-recursive' '
> + check_last_modified <<-\EOF
> + 1 file
> + 3 a
> + EOF
> +'
> +
> +test_expect_success 'last-modified recursive' '
> + check_last_modified -r <<-\EOF
> + 1 file
> + 2 a/file
> + 3 a/b/file
> + EOF
> +'
> +
> +test_expect_success 'last-modified subdir' '
> + check_last_modified a <<-\EOF
> + 3 a
> + EOF
> +'
> +
> +test_expect_success 'last-modified subdir recursive' '
> + check_last_modified -r a <<-\EOF
> + 2 a/file
> + 3 a/b/file
> + EOF
> +'
> +
> +test_expect_success 'last-modified from non-HEAD commit' '
> + check_last_modified HEAD^ <<-\EOF
> + 1 file
> + 2 a
> + EOF
> +'
> +
> +test_expect_success 'last-modified from subdir defaults to root' '
> + check_last_modified -C a <<-\EOF
> + 1 file
> + 3 a
> + EOF
> +'
> +
> +test_expect_success 'last-modified from subdir uses relative pathspecs' '
> + check_last_modified -C a -r b <<-\EOF
> + 3 a/b/file
> + EOF
> +'
> +
> +test_expect_success 'limit last-modified traversal by count' '
> + check_last_modified -1 <<-\EOF
> + 3 a
> + ^2 file
> + EOF
> +'
> +
> +test_expect_success 'limit last-modified traversal by commit' '
> + check_last_modified HEAD~2..HEAD <<-\EOF
> + 3 a
> + ^1 file
> + EOF
> +'
> +
> +test_expect_success 'only last-modified files in the current tree' '
> + git rm -rf a &&
> + git commit -m "remove a" &&
> + check_last_modified <<-\EOF
> + 1 file
> + EOF
> +'
> +
> +test_expect_success 'cross merge boundaries in blaming' '
> + git checkout HEAD^0 &&
> + git rm -rf . &&
> + test_commit m1 &&
> + git checkout HEAD^ &&
> + git rm -rf . &&
> + test_commit m2 &&
> + git merge m1 &&
> + check_last_modified <<-\EOF
> + m1 m1.t
> + m2 m2.t
> + EOF
> +'
> +
> +test_expect_success 'last-modified merge for resolved conflicts' '
> + git checkout HEAD^0 &&
> + git rm -rf . &&
> + test_commit c1 conflict &&
> + git checkout HEAD^ &&
> + git rm -rf . &&
> + test_commit c2 conflict &&
> + test_must_fail git merge c1 &&
> + test_commit resolved conflict &&
> + check_last_modified conflict <<-\EOF
> + resolved conflict
> + EOF
> +'
> +
> +
> +# Consider `file` with this content through history:
> +#
> +# A---B---B-------B---B
> +# \ /
> +# C---D
> +test_expect_success 'last-modified merge ignores content from branch' '
> + git checkout HEAD^0 &&
> + git rm -rf . &&
> + test_commit a1 file A &&
> + test_commit a2 file B &&
> + test_commit a3 file C &&
> + test_commit a4 file D &&
> + git checkout a2 &&
> + git merge --no-commit --no-ff a4 &&
> + git checkout a2 -- file &&
> + git merge --continue &&
> + check_last_modified <<-\EOF
> + a2 file
> + EOF
> +'
> +
> +# Consider `file` with this content through history:
> +#
> +# A---B---B---C---D---B---B
> +# \ /
> +# B-------B
> +test_expect_success 'last-modified merge undoes changes' '
> + git checkout HEAD^0 &&
> + git rm -rf . &&
> + test_commit b1 file A &&
> + test_commit b2 file B &&
> + test_commit b3 file C &&
> + test_commit b4 file D &&
> + git checkout b2 &&
> + test_commit b5 file2 2 &&
> + git checkout b4 &&
> + git merge --no-commit --no-ff b5 &&
> + git checkout b2 -- file &&
> + git merge --continue &&
> + check_last_modified <<-\EOF
> + b2 file
> + b5 file2
> + EOF
> +'
> +
> +test_expect_success 'last-modified complains about unknown arguments' '
> + test_must_fail git last-modified --foo 2>err &&
> + grep "unknown last-modified argument: --foo" err
> +'
> +
> +test_done
>
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 4/5] last-modified: implement faster algorithm
2025-05-23 9:33 ` [PATCH RFC v2 4/5] last-modified: implement faster algorithm Toon Claes
@ 2025-05-27 10:39 ` Patrick Steinhardt
0 siblings, 0 replies; 135+ messages in thread
From: Patrick Steinhardt @ 2025-05-27 10:39 UTC (permalink / raw)
To: Toon Claes
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On Fri, May 23, 2025 at 11:33:51AM +0200, Toon Claes wrote:
> The current implementation of 'git last-modified' works by doing a
> revision walk, and inspecting the diff at each level of that walk to
> annotate the to-be-found entries to a path. In other words, if the diff
> at some level touches a path which has not yet been associated with a
> commit, then that commit becomes associated with the path.
It's a bit funny that we first introduce an algorithm, only to change it
in a subsequent step again. I don't mind it much though: this variant
here is quite a bit more complicated, and it's nice to first gain an
understanding of how the simple algorithm works before going to the
harder one.
It's also nice that the tests prove that this is indeed leads to the
same result.
[snip]
> More specifically, consider a priority queue of commits sorted by
> generation number. First, enqueue the set of boundary commits with all
> paths in the original spec marked as interesting.
>
> Then, while the queue is not empty, do the following:
>
> 1. Pop an element, say, 'c', off of the queue, making sure that 'c'
> isn't reachable by anything in the '--not' set.
>
> 2. For each parent 'p' (with index 'parent_i') of 'c', do the
> following:
>
> a. Compute the diff between 'c' and 'p'.
> b. Pass any active paths that are TREESAME from 'c' to 'p'.
> c. If 'p' has any active paths, push it onto the queue.
What if an active path is changed on both sides of a merge commit? Do we
pass it to the first parent?
[snip]
> Now, some performance numbers. On github/git, our numbers look like the
> following (all wall-clock times best-of-five, and with '--max-depth=0'
> on the root):
This option does not exist in this version of git-last-modified(1).
> github ttaylorr/blame-tree-fast
> with filters: 0.754s 0.271s (2.78x faster, 6.18x overall)
> without filters: 1.676s 1.056s (1.58x faster)
>
> and on torvalds/linux:
>
> github ttaylorr/blame-tree-fast
> with filters: 0.608 0.062 (9.81x faster, ~52x overall)
> without filters: 3.251 0.676 (4.81x faster)
>
> In short, the existing implementation is comparably fast *with* filters
> as the new implementation is *without* filters. So, most repositories
> should get a dramatic speed-up by just deploying this (even without
> computing Bloom filters), and all repositories should get faster still
> when computing Bloom filters.
It would be nice to introduce "filters" as "Bloom filters". I was
initially wondering what filters you talk about until you then mention
it in the last sentence.
> diff --git a/last-modified.c b/last-modified.c
> index f628434929..0a0818cdf1 100644
> --- a/last-modified.c
> +++ b/last-modified.c
> @@ -3,18 +3,20 @@
> #include "commit.h"
> #include "diffcore.h"
> #include "diff.h"
> -#include "object.h"
> #include "revision.h"
> #include "repository.h"
> #include "log-tree.h"
> #include "dir.h"
> #include "commit-graph.h"
> #include "bloom.h"
> +#include "prio-queue.h"
> +#include "commit-slab.h"
It would be nice if we could keep these lexicographically sorted right
from the first commit.
> @@ -116,6 +128,20 @@ void last_modified_release(struct last_modified *lm)
> }
> hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
> release_revisions(&lm->rev);
> + free(lm->all_paths);
> +}
> +
> +struct commit_active_paths {
> + char *active;
> + int nr;
Should this be a `size_t` as it is counting something?
> +};
Hm, a bit weird, as I don't kno what `nr` is supposed to stand for.
Intuitively I would have expected that `active` is an array of strings,
and that `nr` tracks how many there are. But that's not the case.
Let's read on.
> +define_commit_slab(active_paths, struct commit_active_paths);
> +static struct active_paths active_paths;
> +
> +static void free_one_active_path(struct commit_active_paths *active)
s/free_one_active_path/commit_active_paths_release/
> @@ -197,7 +230,32 @@ static void last_modified_diff(struct diff_queue_struct *q,
> }
> }
>
> -static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
> +static char *scratch;
Having a global variable like this in a library is not great. Can we
instead pass around a context or something like this internally?
> +
> +static void pass_to_parent(struct commit_active_paths *c,
> + struct commit_active_paths *p,
> + int i)
> +{
> + c->active[i] = 0;
> + c->nr--;
> + p->active[i] = 1;
> + p->nr++;
Okay, so `active` is a bitfield. It's a bit weird that we use a full
byte for each bit though. It might not matter much in practice, but it
feels quite wasteful to me. Doubly so because we allocate this bitfield
for every commit we visit.
Can we maybe instead use `struct bitmap` for this?
> +}
> +
> +#define PARENT1 (1u<<16) /* used instead of SEEN */
> +#define PARENT2 (1u<<17) /* used instead of BOTTOM, BOUNDARY */
These are the same definitions as in "commit-reach.c". Might be worth it
to deduplicate those.
> @@ -221,8 +281,88 @@ static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
> return 0;
> }
>
> +static int process_parent(struct last_modified *lm, struct prio_queue *queue,
> + struct commit *c,
> + struct commit_active_paths *active_c,
> + struct commit *parent, int parent_i)
> +{
> + int i, ret = 0; // TODO type & for loop var
This looks like a left-over comment that should be addressed.
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified
2025-05-23 9:33 ` [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified Toon Claes
2025-05-25 20:07 ` Justin Tobler
@ 2025-05-27 10:39 ` Patrick Steinhardt
2025-06-13 9:34 ` Toon Claes
1 sibling, 1 reply; 135+ messages in thread
From: Patrick Steinhardt @ 2025-05-27 10:39 UTC (permalink / raw)
To: Toon Claes
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On Fri, May 23, 2025 at 11:33:48AM +0200, Toon Claes wrote:
> diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
> new file mode 100644
> index 0000000000..1af38f402e
> --- /dev/null
> +++ b/Documentation/git-last-modified.adoc
> @@ -0,0 +1,49 @@
> +git-last-modified(1)
> +====================
> +
> +NAME
> +----
> +git-last-modified - EXPERIMENTAL: Show when files were last modified
Nit: we don't have the EXPERIMENTAL label here for git-switch(1) or
git-restore(1).
> +
> +
> +SYNOPSIS
> +--------
> +[synopsis]
> +git last-modified [-r] [<revision-range>] [[--] <path>...]
> +
> +DESCRIPTION
> +-----------
> +
> +Shows which commit last modified each of the relevant files and subdirectories.
> +
> +THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
> +
> +OPTIONS
> +-------
> +
> +-r::
> + Recurse into subtrees.
> +
> +-t::
> + Show tree entry itself as well as subtrees. Implies `-r`.
These flags aren't yet supported in this version, are they?
> diff --git a/builtin/last-modified.c b/builtin/last-modified.c
> new file mode 100644
> index 0000000000..0d4733f666
> --- /dev/null
> +++ b/builtin/last-modified.c
> @@ -0,0 +1,43 @@
> +#include "git-compat-util.h"
> +#include "last-modified.h"
> +#include "hex.h"
> +#include "quote.h"
> +#include "config.h"
> +#include "object-name.h"
> +#include "parse-options.h"
> +#include "builtin.h"
> +
> +static void show_entry(const char *path, const struct commit *commit, void *d)
> +{
> + struct last_modified *lm = d;
> +
> + if (commit->object.flags & BOUNDARY)
> + putchar('^');
> + printf("%s\t", oid_to_hex(&commit->object.oid));
> +
> + if (lm->rev.diffopt.line_termination)
> + write_name_quoted(path, stdout, '\n');
> + else
> + printf("%s%c", path, '\0');
> +
> + fflush(stdout);
> +}
> +
> +int cmd_last_modified(int argc,
> + const char **argv,
> + const char *prefix,
> + struct repository *repo)
> +{
> + int ret = 0;
`ret` is basically unused here, we only use it to return 0.
> diff --git a/last-modified.c b/last-modified.c
> new file mode 100644
> index 0000000000..9283f8fcae
> --- /dev/null
> +++ b/last-modified.c
> @@ -0,0 +1,213 @@
> +#include "git-compat-util.h"
> +#include "last-modified.h"
> +#include "commit.h"
> +#include "diffcore.h"
> +#include "diff.h"
> +#include "object.h"
> +#include "revision.h"
> +#include "repository.h"
> +#include "log-tree.h"
> +
> +struct last_modified_entry {
> + struct hashmap_entry hashent;
> + struct object_id oid;
> + struct commit *commit;
> + const char path[FLEX_ARRAY];
> +};
> +
> +static void add_from_diff(struct diff_queue_struct *q,
> + struct diff_options *opt UNUSED,
> + void *data)
> +{
> + struct last_modified *lm = data;
> +
> + for (int i = 0; i < q->nr; i++) {
> + struct diff_filepair *p = q->queue[i];
> + struct last_modified_entry *ent;
> + const char *path = p->two->path;
> +
> + FLEX_ALLOC_STR(ent, path, path);
> + oidcpy(&ent->oid, &p->two->oid);
> + hashmap_entry_init(&ent->hashent, strhash(ent->path));
> + hashmap_add(&lm->paths, &ent->hashent);
> + }
> +}
>
> +static int add_from_revs(struct last_modified *lm)
> +{
> + size_t count = 0;
> + struct diff_options diffopt;
> +
> + memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
> + copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
> + diffopt.output_format = DIFF_FORMAT_CALLBACK;
> + diffopt.format_callback = add_from_diff;
> + diffopt.format_callback_data = lm;
As far as I understand we populate `paths` from the diff here, and
`paths` later on acts as a filter of paths we're interested in? Might be
nice to add a comment explaining the intent of this.
> + for (size_t i = 0; i < lm->rev.pending.nr; i++) {
> + struct object_array_entry *obj = lm->rev.pending.objects + i;
> +
> + if (obj->item->flags & UNINTERESTING)
> + continue;
> +
> + if (count++)
> + return error(_("can only get last-modified one tree at a time"));
It's a bit funny that `count` is pretending to be a counter even though
it ultimately is only a boolean flag whether we have already seen an
interesting item.
> + diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
> + &obj->item->oid, "", &diffopt);
> + diff_flush(&diffopt);
> + }
> + clear_pathspec(&diffopt.pathspec);
Shouldn't we call `diff_free()` instead of `clear_pathspec` to clear the
whole `struct diff_options`?
> +
> + return 0;
> +}
> +
> +static int last_modified_entry_hashcmp(const void *unused UNUSED,
> + const struct hashmap_entry *hent1,
> + const struct hashmap_entry *hent2,
> + const void *path)
> +{
> + const struct last_modified_entry *ent1 =
> + container_of(hent1, const struct last_modified_entry, hashent);
> + const struct last_modified_entry *ent2 =
> + container_of(hent2, const struct last_modified_entry, hashent);
> + return strcmp(ent1->path, path ? path : ent2->path);
> +}
> +
> +void last_modified_init(struct last_modified *lm,
> + struct repository *r,
> + const char *prefix,
> + int argc, const char **argv)
> +{
> + memset(lm, 0, sizeof(*lm));
> + hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
> +
> + repo_init_revisions(r, &lm->rev, prefix);
> + lm->rev.def = "HEAD";
> + lm->rev.combine_merges = 1;
> + lm->rev.show_root_diff = 1;
> + lm->rev.boundary = 1;
> + lm->rev.no_commit_id = 1;
> + lm->rev.diff = 1;
> + if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
> + die(_("unknown last-modified argument: %s"), argv[1]);
> +
> + if (add_from_revs(lm) < 0)
> + die(_("unable to setup last-modified"));
Given that this is library code, do we rather want to have
`last_modified_init()` return an error code and let the caller die?
> +}
> +
> +void last_modified_release(struct last_modified *lm)
> +{
> + hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
> + release_revisions(&lm->rev);
> +}
> +
> +struct last_modified_callback_data {
> + struct commit *commit;
> + struct hashmap *paths;
> +
> + last_modified_callback callback;
> + void *callback_data;
> +};
> +
> +static void mark_path(const char *path, const struct object_id *oid,
> + struct last_modified_callback_data *data)
> +{
> + struct last_modified_entry *ent;
> +
> + /* Is it even a path that we are interested in? */
> + ent = hashmap_get_entry_from_hash(data->paths, strhash(path), path,
> + struct last_modified_entry, hashent);
> + if (!ent)
> + return;
Yup, so this here is the filter to figure out whether we care for a
path, which uses the `paths` map we have populated at the beginning.
> + /* Have we already found a commit? */
> + if (ent->commit)
> + return;
Can this case even be hit? We remove the entry from the map once we have
seen it, so I'd expect that we never hit the same commit map entry
twice. If so, can this be converted to a `BUG()` or am I missing the
obvious?
> + /*
> + * Is it arriving at a version of interest, or is it from a side branch
> + * which did not contribute to the final state?
> + */
> + if (!oideq(oid, &ent->oid))
> + return;
> +
> + ent->commit = data->commit;
> + if (data->callback)
> + data->callback(path, data->commit, data->callback_data);
> +
> + hashmap_remove(data->paths, &ent->hashent, path);
And we end up removing that entry from paths so that we don't revisit it
in the future. After all, we're only interested in a single commit per
path.
> + free(ent);
> +}
> +
> +static void last_modified_diff(struct diff_queue_struct *q,
> + struct diff_options *opt UNUSED, void *cbdata)
> +{
> + struct last_modified_callback_data *data = cbdata;
> +
> + for (int i = 0; i < q->nr; i++) {
> + struct diff_filepair *p = q->queue[i];
> + switch (p->status) {
> + case DIFF_STATUS_DELETED:
> + /*
> + * There's no point in feeding a deletion, as it could
> + * not have resulted in our current state, which
> + * actually has the file.
> + */
> + break;
> +
> + default:
> + /*
> + * Otherwise, we care only that we somehow arrived at
> + * a final path/sha1 state. Note that this covers some
> + * potentially controversial areas, including:
> + *
> + * 1. A rename or copy will be found, as it is the
> + * first time the content has arrived at the given
> + * path.
> + *
> + * 2. Even a non-content modification like a mode or
> + * type change will trigger it.
Curious, but sensible. We're looking for the last time a specific tree
entry was changed, and that of course includes modifications. I could
totally see that we may eventually want to add a flag that ignores such
mode changes and only presents content changes. But for now I agree that
this is sensible.
> + * We take the inclusive approach for now, and find
> + * anything which impacts the path. Options to tweak
> + * the behavior (e.g., to "--follow" the content across
> + * renames) can come later.
> + */
> + mark_path(p->two->path, &p->two->oid, data);
> + break;
> + }
> + }
> +}
> +
> +int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
> +{
> + struct last_modified_callback_data data;
> +
> + data.paths = &lm->paths;
> + data.callback = cb;
> + data.callback_data = cbdata;
> +
> + lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
> + lm->rev.diffopt.format_callback = last_modified_diff;
> + lm->rev.diffopt.format_callback_data = &data;
> +
> + prepare_revision_walk(&lm->rev);
> +
> + while (hashmap_get_size(&lm->paths)) {
Okay, and this is the core of our logic: we continue walking the tree
until there are no more paths that we care about.
> diff --git a/last-modified.h b/last-modified.h
> new file mode 100644
> index 0000000000..42a819d979
> --- /dev/null
> +++ b/last-modified.h
> @@ -0,0 +1,27 @@
> +#ifndef LAST_MODIFIED_H
> +#define LAST_MODIFIED_H
> +
> +#include "commit.h"
> +#include "revision.h"
> +#include "hashmap.h"
> +
> +struct last_modified {
> + struct hashmap paths;
> + struct rev_info rev;
> +};
> +
> +void last_modified_init(struct last_modified *lm,
> + struct repository *r,
> + const char *prefix,
> + int argc, const char **argv);
> +
> +void last_modified_release(struct last_modified *);
> +
> +typedef void (*last_modified_callback)(const char *path,
> + const struct commit *commit,
> + void *data);
> +int last_modified_run(struct last_modified *lm,
> + last_modified_callback cb,
> + void *cbdata);
> +#endif /* LAST_MODIFIED_H */
It would be nice to have some documentation for each of these functions
as well as a bit of a higher-level conceptual info.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 5/5] last-modified: initialize revision machinery without walk
2025-05-23 9:33 ` [PATCH RFC v2 5/5] last-modified: initialize revision machinery without walk Toon Claes
@ 2025-05-27 10:39 ` Patrick Steinhardt
0 siblings, 0 replies; 135+ messages in thread
From: Patrick Steinhardt @ 2025-05-27 10:39 UTC (permalink / raw)
To: Toon Claes
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On Fri, May 23, 2025 at 11:33:52AM +0200, Toon Claes wrote:
> In a previous commit we inserted a call to 'prepare_revision_walk()'
> before we started our traversal. This was done when we leveraged the
> revision machinery more (at the time, we were leaning on
> 'log_tree_commit()' which only worked after calling
> 'prepare_revision_walk()').
>
> But, we have since dropped 'log_tree_commit()', so we don't need most of
> the initialization work of 'prepare_revision_walk()'. Now we ask it to
> do very little work during initialization by setting the '->no_walk'
> flag to '1', which leaves its internal state alone enough that we can
> still function as normal.
>
> Unfortunately, this means that we now no longer complain about
> non-commit inputs, since the revision machinery check this for us (it
> just silently ignores them).
Hm. Should we maybe have a manual check that all inputs are commits? It
doesn't feel right to me to silently ignore invalid queries.
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 3/5] last-modified: use Bloom filters when available
2025-05-23 9:33 ` [PATCH RFC v2 3/5] last-modified: use Bloom filters when available Toon Claes
@ 2025-05-27 10:40 ` Patrick Steinhardt
2025-06-13 11:05 ` Toon Claes
0 siblings, 1 reply; 135+ messages in thread
From: Patrick Steinhardt @ 2025-05-27 10:40 UTC (permalink / raw)
To: Toon Claes
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On Fri, May 23, 2025 at 11:33:50AM +0200, Toon Claes wrote:
> Our 'git last-modified' performs a revision walk, and computes a diff at
> each point in the walk to figure out whether a given revision changed
> any of the paths it considers interesting.
>
> When changed-path Bloom filters are available, we can avoid computing
> many such diffs. Before computing a diff, we first check if any of the
> remaining paths of interest were possibly changed at a given commit by
> consulting its Bloom filter. If any of them are, we are resigned to
> compute the diff.
>
> If none of those queries returned "maybe", we know that the given commit
> doesn't contain any changed paths which are interesting to us. So, we
> can avoid computing it in this case.
>
> This results in a substantial performance speed-up in common cases of
> 'git last-modified'. In the kernel, here is the before and after (all
> times computed with best-of-five):
>
> With commit-graphs (but no Bloom filters):
>
> real 0m5.133s
> user 0m4.942s
> sys 0m0.180s
>
> ...and with Bloom filters:
>
> real 0m0.936s
> user 0m0.842s
> sys 0m0.092s
>
> These times are with my development-version of Git, so it's compiled
> without optimizations. Compiling instead with `-O3`, the results look
> even better:
>
> real 0m0.754s
> user 0m0.661s
> sys 0m0.092s
I'm sure that the old state without bloom filters will also improve a
bit?
> Signed-off-by: Toon Claes <toon@iotcl.com>
> ---
> last-modified.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 44 insertions(+)
>
> diff --git a/last-modified.c b/last-modified.c
> index 9283f8fcae..f628434929 100644
> --- a/last-modified.c
> +++ b/last-modified.c
> @@ -92,12 +99,21 @@ void last_modified_init(struct last_modified *lm,
> if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
> die(_("unknown last-modified argument: %s"), argv[1]);
>
> + (void)generation_numbers_enabled(lm->rev.repo);
Why the `(void)` cast? And why even call this in the first place? This
definitely needs a comment and smells like funky design in our commit
graph subsystem where we rely on side effects of one function to leak
into a different function.
> + lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
> +
> if (add_from_revs(lm) < 0)
> die(_("unable to setup last-modified"));
> }
>
> void last_modified_release(struct last_modified *lm)
> {
> + struct hashmap_iter iter;
> + struct last_modified_entry *ent;
> +
> + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
> + clear_bloom_key(&ent->key);
> + }
The curly braces shouldn't be needed.
> @@ -180,6 +197,30 @@ static void last_modified_diff(struct diff_queue_struct *q,
> }
> }
>
> +static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
> +{
> + struct bloom_filter *filter;
> + struct last_modified_entry *ent;
> + struct hashmap_iter iter;
> +
> + if (!lm->rev.bloom_filter_settings)
> + return 1;
> +
> + if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY)
> + return 1;
Hm, okay, so here we require generation numbers to exist. Why is that
though? Shouldn't we only care about bloom filters? I don't quite get
that part yet.
> + filter = get_bloom_filter(lm->rev.repo, origin);
> + if (!filter)
> + return 1;
> +
> + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
> + if (bloom_filter_contains(filter, &ent->key,
> + lm->rev.bloom_filter_settings))
> + return 1;
> + }
> + return 0;
> +}
> +
Okay, and here we check whether any of our desired paths may be
contained in the bloom filter.
> int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
> {
> struct last_modified_callback_data data;
> @@ -199,6 +240,9 @@ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
> if (!data.commit)
> break;
>
> + if (!maybe_changed_path(lm, data.commit))
> + continue;
If there either are no bloom filters or in case none of them contain our
commit we can safely skip over the commit indeed. Otherwise we'll have
to check whether the commit really is interesting.
Makes sense.
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified
2025-05-25 20:07 ` Justin Tobler
@ 2025-06-05 8:32 ` Toon Claes
0 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-06-05 8:32 UTC (permalink / raw)
To: Justin Tobler
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Justin Tobler <jltobler@gmail.com> writes:
> On 25/05/23 11:33AM, Toon Claes wrote:
>> +-r::
>> + Recurse into subtrees.
>> +
>> +-t::
>> + Show tree entry itself as well as subtrees. Implies `-r`.
>
> This left me wondering about the default behavior regarding displaying
> trees when neither `-t` and `-r` are specified. If we omit showing when
> a tree was last mostified?
When omitting both the command returns the commit that last modified the
tree. Basically without any of both options it return which commit
touched a tree entry last, it doesn't care if that tree entry is a tree
or a blob.
I shall add something to the docs for `-r` that explains what happens if
it is omitted.
>> +[--] <path>...::
>> + For each _<path>_ given, the commit which last modified it is returned.
>> + Without an optional path parameter, all files and subdirectories
>> + of the current working directory are included in the
>
> are include in the? I assume you meant to say the search/operation.
Whoops.
>> diff --git a/builtin/last-modified.c b/builtin/last-modified.c
>> new file mode 100644
>> index 0000000000..0d4733f666
>> --- /dev/null
>> +++ b/builtin/last-modified.c
>> @@ -0,0 +1,43 @@
>> +#include "git-compat-util.h"
>> +#include "last-modified.h"
>> +#include "hex.h"
>> +#include "quote.h"
>> +#include "config.h"
>> +#include "object-name.h"
>> +#include "parse-options.h"
>> +#include "builtin.h"
>
> For builtins, "builtin.h" should be included at the top and
> "git-compat-util.h" should be omitted.
Thanks, will address.
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified
2025-05-27 10:39 ` Patrick Steinhardt
@ 2025-06-13 9:34 ` Toon Claes
2025-06-13 9:52 ` Kristoffer Haugsbakk
0 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-06-13 9:34 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Patrick Steinhardt <ps@pks.im> writes:
> On Fri, May 23, 2025 at 11:33:48AM +0200, Toon Claes wrote:
>> diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
>> new file mode 100644
>> index 0000000000..1af38f402e
>> --- /dev/null
>> +++ b/Documentation/git-last-modified.adoc
>> @@ -0,0 +1,49 @@
>> +git-last-modified(1)
>> +====================
>> +
>> +NAME
>> +----
>> +git-last-modified - EXPERIMENTAL: Show when files were last modified
>
> Nit: we don't have the EXPERIMENTAL label here for git-switch(1) or
> git-restore(1).
But we do for `git-replay(1)`. Because I haven't gotten much feedback
about the usage of the command, I wanted to be on the safe side and not
commit to the behavior. Marking it EXPERIMENTAL would allow us to make
changes on it's interface without _breaking_. But I wouldn't mind
dropping the experimental status.
>> +
>> +
>> +SYNOPSIS
>> +--------
>> +[synopsis]
>> +git last-modified [-r] [<revision-range>] [[--] <path>...]
>> +
>> +DESCRIPTION
>> +-----------
>> +
>> +Shows which commit last modified each of the relevant files and subdirectories.
>> +
>> +THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
>> +
>> +OPTIONS
>> +-------
>> +
>> +-r::
>> + Recurse into subtrees.
>> +
>> +-t::
>> + Show tree entry itself as well as subtrees. Implies `-r`.
>
> These flags aren't yet supported in this version, are they?
They are, but I see there are no tests for `-t`. I shall add them.
>
>> diff --git a/builtin/last-modified.c b/builtin/last-modified.c
>> new file mode 100644
>> index 0000000000..0d4733f666
>> --- /dev/null
>> +++ b/builtin/last-modified.c
>> @@ -0,0 +1,43 @@
>> +#include "git-compat-util.h"
>> +#include "last-modified.h"
>> +#include "hex.h"
>> +#include "quote.h"
>> +#include "config.h"
>> +#include "object-name.h"
>> +#include "parse-options.h"
>> +#include "builtin.h"
>> +
>> +static void show_entry(const char *path, const struct commit *commit, void *d)
>> +{
>> + struct last_modified *lm = d;
>> +
>> + if (commit->object.flags & BOUNDARY)
>> + putchar('^');
>> + printf("%s\t", oid_to_hex(&commit->object.oid));
>> +
>> + if (lm->rev.diffopt.line_termination)
>> + write_name_quoted(path, stdout, '\n');
>> + else
>> + printf("%s%c", path, '\0');
>> +
>> + fflush(stdout);
>> +}
>> +
>> +int cmd_last_modified(int argc,
>> + const char **argv,
>> + const char *prefix,
>> + struct repository *repo)
>> +{
>> + int ret = 0;
>
> `ret` is basically unused here, we only use it to return 0.
Good catch. Thanks!
>> diff --git a/last-modified.c b/last-modified.c
>> new file mode 100644
>> index 0000000000..9283f8fcae
>> --- /dev/null
>> +++ b/last-modified.c
>> @@ -0,0 +1,213 @@
>> +#include "git-compat-util.h"
>> +#include "last-modified.h"
>> +#include "commit.h"
>> +#include "diffcore.h"
>> +#include "diff.h"
>> +#include "object.h"
>> +#include "revision.h"
>> +#include "repository.h"
>> +#include "log-tree.h"
>> +
>> +struct last_modified_entry {
>> + struct hashmap_entry hashent;
>> + struct object_id oid;
>> + struct commit *commit;
>> + const char path[FLEX_ARRAY];
>> +};
>> +
>> +static void add_from_diff(struct diff_queue_struct *q,
>> + struct diff_options *opt UNUSED,
>> + void *data)
>> +{
>> + struct last_modified *lm = data;
>> +
>> + for (int i = 0; i < q->nr; i++) {
>> + struct diff_filepair *p = q->queue[i];
>> + struct last_modified_entry *ent;
>> + const char *path = p->two->path;
>> +
>> + FLEX_ALLOC_STR(ent, path, path);
>> + oidcpy(&ent->oid, &p->two->oid);
>> + hashmap_entry_init(&ent->hashent, strhash(ent->path));
>> + hashmap_add(&lm->paths, &ent->hashent);
>> + }
>> +}
>>
>> +static int add_from_revs(struct last_modified *lm)
>> +{
>> + size_t count = 0;
>> + struct diff_options diffopt;
>> +
>> + memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
>> + copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
>> + diffopt.output_format = DIFF_FORMAT_CALLBACK;
>> + diffopt.format_callback = add_from_diff;
>> + diffopt.format_callback_data = lm;
>
> As far as I understand we populate `paths` from the diff here, and
> `paths` later on acts as a filter of paths we're interested in? Might be
> nice to add a comment explaining the intent of this.
Will do.
>> + for (size_t i = 0; i < lm->rev.pending.nr; i++) {
>> + struct object_array_entry *obj = lm->rev.pending.objects + i;
>> +
>> + if (obj->item->flags & UNINTERESTING)
>> + continue;
>> +
>> + if (count++)
>> + return error(_("can only get last-modified one tree at a time"));
>
> It's a bit funny that `count` is pretending to be a counter even though
> it ultimately is only a boolean flag whether we have already seen an
> interesting item.
Okay, I'll refactor.
>> + diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
>> + &obj->item->oid, "", &diffopt);
>> + diff_flush(&diffopt);
>> + }
>> + clear_pathspec(&diffopt.pathspec);
>
> Shouldn't we call `diff_free()` instead of `clear_pathspec` to clear the
> whole `struct diff_options`?
Yeah, that's better.
>> +
>> + return 0;
>> +}
>> +
>> +static int last_modified_entry_hashcmp(const void *unused UNUSED,
>> + const struct hashmap_entry *hent1,
>> + const struct hashmap_entry *hent2,
>> + const void *path)
>> +{
>> + const struct last_modified_entry *ent1 =
>> + container_of(hent1, const struct last_modified_entry, hashent);
>> + const struct last_modified_entry *ent2 =
>> + container_of(hent2, const struct last_modified_entry, hashent);
>> + return strcmp(ent1->path, path ? path : ent2->path);
>> +}
>> +
>> +void last_modified_init(struct last_modified *lm,
>> + struct repository *r,
>> + const char *prefix,
>> + int argc, const char **argv)
>> +{
>> + memset(lm, 0, sizeof(*lm));
>> + hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
>> +
>> + repo_init_revisions(r, &lm->rev, prefix);
>> + lm->rev.def = "HEAD";
>> + lm->rev.combine_merges = 1;
>> + lm->rev.show_root_diff = 1;
>> + lm->rev.boundary = 1;
>> + lm->rev.no_commit_id = 1;
>> + lm->rev.diff = 1;
>> + if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
>> + die(_("unknown last-modified argument: %s"), argv[1]);
>> +
>> + if (add_from_revs(lm) < 0)
>> + die(_("unable to setup last-modified"));
>
> Given that this is library code, do we rather want to have
> `last_modified_init()` return an error code and let the caller die?
Yes, agreed.
>> +}
>> +
>> +void last_modified_release(struct last_modified *lm)
>> +{
>> + hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
>> + release_revisions(&lm->rev);
>> +}
>> +
>> +struct last_modified_callback_data {
>> + struct commit *commit;
>> + struct hashmap *paths;
>> +
>> + last_modified_callback callback;
>> + void *callback_data;
>> +};
>> +
>> +static void mark_path(const char *path, const struct object_id *oid,
>> + struct last_modified_callback_data *data)
>> +{
>> + struct last_modified_entry *ent;
>> +
>> + /* Is it even a path that we are interested in? */
>> + ent = hashmap_get_entry_from_hash(data->paths, strhash(path), path,
>> + struct last_modified_entry, hashent);
>> + if (!ent)
>> + return;
>
> Yup, so this here is the filter to figure out whether we care for a
> path, which uses the `paths` map we have populated at the beginning.
>
>> + /* Have we already found a commit? */
>> + if (ent->commit)
>> + return;
>
> Can this case even be hit? We remove the entry from the map once we have
> seen it, so I'd expect that we never hit the same commit map entry
> twice. If so, can this be converted to a `BUG()` or am I missing the
> obvious?
I was wondering about that. But took it over from the original code, and
left it in. I believe it's from a version where entries weren't removed
from the hashmap. I agree a `BUG()` would be a better approach, or
even better simply delete this condition.
>> + /*
>> + * Is it arriving at a version of interest, or is it from a side branch
>> + * which did not contribute to the final state?
>> + */
>> + if (!oideq(oid, &ent->oid))
>> + return;
>> +
>> + ent->commit = data->commit;
>> + if (data->callback)
>> + data->callback(path, data->commit, data->callback_data);
>> +
>> + hashmap_remove(data->paths, &ent->hashent, path);
>
> And we end up removing that entry from paths so that we don't revisit it
> in the future. After all, we're only interested in a single commit per
> path.
True. So it doesn't even make sense a `struct last_modified_entry` has a
`commit` attribute. I will delete that in next version.
>> + free(ent);
>> +}
>> +
>> +static void last_modified_diff(struct diff_queue_struct *q,
>> + struct diff_options *opt UNUSED, void *cbdata)
>> +{
>> + struct last_modified_callback_data *data = cbdata;
>> +
>> + for (int i = 0; i < q->nr; i++) {
>> + struct diff_filepair *p = q->queue[i];
>> + switch (p->status) {
>> + case DIFF_STATUS_DELETED:
>> + /*
>> + * There's no point in feeding a deletion, as it could
>> + * not have resulted in our current state, which
>> + * actually has the file.
>> + */
>> + break;
>> +
>> + default:
>> + /*
>> + * Otherwise, we care only that we somehow arrived at
>> + * a final path/sha1 state. Note that this covers some
>> + * potentially controversial areas, including:
>> + *
>> + * 1. A rename or copy will be found, as it is the
>> + * first time the content has arrived at the given
>> + * path.
>> + *
>> + * 2. Even a non-content modification like a mode or
>> + * type change will trigger it.
>
> Curious, but sensible. We're looking for the last time a specific tree
> entry was changed, and that of course includes modifications. I could
> totally see that we may eventually want to add a flag that ignores such
> mode changes and only presents content changes. But for now I agree that
> this is sensible.
>
>> + * We take the inclusive approach for now, and find
>> + * anything which impacts the path. Options to tweak
>> + * the behavior (e.g., to "--follow" the content across
>> + * renames) can come later.
>> + */
>> + mark_path(p->two->path, &p->two->oid, data);
>> + break;
>> + }
>> + }
>> +}
>> +
>> +int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
>> +{
>> + struct last_modified_callback_data data;
>> +
>> + data.paths = &lm->paths;
>> + data.callback = cb;
>> + data.callback_data = cbdata;
>> +
>> + lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
>> + lm->rev.diffopt.format_callback = last_modified_diff;
>> + lm->rev.diffopt.format_callback_data = &data;
>> +
>> + prepare_revision_walk(&lm->rev);
>> +
>> + while (hashmap_get_size(&lm->paths)) {
>
> Okay, and this is the core of our logic: we continue walking the tree
> until there are no more paths that we care about.
>
>> diff --git a/last-modified.h b/last-modified.h
>> new file mode 100644
>> index 0000000000..42a819d979
>> --- /dev/null
>> +++ b/last-modified.h
>> @@ -0,0 +1,27 @@
>> +#ifndef LAST_MODIFIED_H
>> +#define LAST_MODIFIED_H
>> +
>> +#include "commit.h"
>> +#include "revision.h"
>> +#include "hashmap.h"
>> +
>> +struct last_modified {
>> + struct hashmap paths;
>> + struct rev_info rev;
>> +};
>> +
>> +void last_modified_init(struct last_modified *lm,
>> + struct repository *r,
>> + const char *prefix,
>> + int argc, const char **argv);
>> +
>> +void last_modified_release(struct last_modified *);
>> +
>> +typedef void (*last_modified_callback)(const char *path,
>> + const struct commit *commit,
>> + void *data);
>> +int last_modified_run(struct last_modified *lm,
>> + last_modified_callback cb,
>> + void *cbdata);
>> +#endif /* LAST_MODIFIED_H */
>
> It would be nice to have some documentation for each of these functions
> as well as a bit of a higher-level conceptual info.
Will do.
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified
2025-06-13 9:34 ` Toon Claes
@ 2025-06-13 9:52 ` Kristoffer Haugsbakk
0 siblings, 0 replies; 135+ messages in thread
From: Kristoffer Haugsbakk @ 2025-06-13 9:52 UTC (permalink / raw)
To: Toon Claes, Patrick Steinhardt
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On Fri, Jun 13, 2025, at 11:34, Toon Claes wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
>> On Fri, May 23, 2025 at 11:33:48AM +0200, Toon Claes wrote:
>>> diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
>>> new file mode 100644
>>> index 0000000000..1af38f402e
>>> --- /dev/null
>>> +++ b/Documentation/git-last-modified.adoc
>>> @@ -0,0 +1,49 @@
>>> +git-last-modified(1)
>>> +====================
>>> +
>>> +NAME
>>> +----
>>> +git-last-modified - EXPERIMENTAL: Show when files were last modified
>>
>> Nit: we don't have the EXPERIMENTAL label here for git-switch(1) or
>> git-restore(1).
>
> But we do for `git-replay(1)`. Because I haven't gotten much feedback
> about the usage of the command, I wanted to be on the safe side and not
> commit to the behavior. Marking it EXPERIMENTAL would allow us to make
> changes on it's interface without _breaking_. But I wouldn't mind
> dropping the experimental status.
As a user I appreciate that experimental commands are prominently called
out as such, like it is here. I don’t see much user testing (more as in
DX/developer experience) on the mailing list for new commands.[1]
“Experimental” in my interpretation means that I should be careful about
using it in scripts and that the developers are open to making changes
to the command interface.
† 1: I mean specifically by a slightly wider user base; those of us who
might not be able to hack on or review the relevant code much but might
be interested in what the command interface will be like.
Thanks
--
Kristoffer Haugsbakk
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 3/5] last-modified: use Bloom filters when available
2025-05-27 10:40 ` Patrick Steinhardt
@ 2025-06-13 11:05 ` Toon Claes
0 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-06-13 11:05 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Patrick Steinhardt <ps@pks.im> writes:
> On Fri, May 23, 2025 at 11:33:50AM +0200, Toon Claes wrote:
>> Our 'git last-modified' performs a revision walk, and computes a diff at
>> each point in the walk to figure out whether a given revision changed
>> any of the paths it considers interesting.
>>
>> When changed-path Bloom filters are available, we can avoid computing
>> many such diffs. Before computing a diff, we first check if any of the
>> remaining paths of interest were possibly changed at a given commit by
>> consulting its Bloom filter. If any of them are, we are resigned to
>> compute the diff.
>>
>> If none of those queries returned "maybe", we know that the given commit
>> doesn't contain any changed paths which are interesting to us. So, we
>> can avoid computing it in this case.
>>
>> This results in a substantial performance speed-up in common cases of
>> 'git last-modified'. In the kernel, here is the before and after (all
>> times computed with best-of-five):
>>
>> With commit-graphs (but no Bloom filters):
>>
>> real 0m5.133s
>> user 0m4.942s
>> sys 0m0.180s
>>
>> ...and with Bloom filters:
>>
>> real 0m0.936s
>> user 0m0.842s
>> sys 0m0.092s
>>
>> These times are with my development-version of Git, so it's compiled
>> without optimizations. Compiling instead with `-O3`, the results look
>> even better:
>>
>> real 0m0.754s
>> user 0m0.661s
>> sys 0m0.092s
>
> I'm sure that the old state without bloom filters will also improve a
> bit?
These are the benchmarks from the original commits I took over. They are
no longer really relevant, I'll remove them.
>> Signed-off-by: Toon Claes <toon@iotcl.com>
>> ---
>> last-modified.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 44 insertions(+)
>>
>> diff --git a/last-modified.c b/last-modified.c
>> index 9283f8fcae..f628434929 100644
>> --- a/last-modified.c
>> +++ b/last-modified.c
>> @@ -92,12 +99,21 @@ void last_modified_init(struct last_modified *lm,
>> if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
>> die(_("unknown last-modified argument: %s"), argv[1]);
>>
>> + (void)generation_numbers_enabled(lm->rev.repo);
>
> Why the `(void)` cast? And why even call this in the first place? This
> definitely needs a comment and smells like funky design in our commit
> graph subsystem where we rely on side effects of one function to leak
> into a different function.
This function calls `prepare_commit_graph()` which I think is the
important side-effect. Let me add a comment. Or would you rather to see
a separate function?
>> + lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
>> +
>> if (add_from_revs(lm) < 0)
>> die(_("unable to setup last-modified"));
>> }
>>
>> void last_modified_release(struct last_modified *lm)
>> {
>> + struct hashmap_iter iter;
>> + struct last_modified_entry *ent;
>> +
>> + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
>> + clear_bloom_key(&ent->key);
>> + }
>
> The curly braces shouldn't be needed.
Okay.
>> @@ -180,6 +197,30 @@ static void last_modified_diff(struct diff_queue_struct *q,
>> }
>> }
>>
>> +static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
>> +{
>> + struct bloom_filter *filter;
>> + struct last_modified_entry *ent;
>> + struct hashmap_iter iter;
>> +
>> + if (!lm->rev.bloom_filter_settings)
>> + return 1;
>> +
>> + if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY)
>> + return 1;
>
> Hm, okay, so here we require generation numbers to exist. Why is that
> though? Shouldn't we only care about bloom filters? I don't quite get
> that part yet.
That's a good question. Because we're above ignoring the return value of
`generation_numbers_enabled()` we shouldn't rely on generation numbers.
I verified things and did some testing and it seems to me we can safely
remove this condition.
>> + filter = get_bloom_filter(lm->rev.repo, origin);
>> + if (!filter)
>> + return 1;
>> +
>> + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
>> + if (bloom_filter_contains(filter, &ent->key,
>> + lm->rev.bloom_filter_settings))
>> + return 1;
>> + }
>> + return 0;
>> +}
>> +
>
> Okay, and here we check whether any of our desired paths may be
> contained in the bloom filter.
>
>> int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
>> {
>> struct last_modified_callback_data data;
>> @@ -199,6 +240,9 @@ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
>> if (!data.commit)
>> break;
>>
>> + if (!maybe_changed_path(lm, data.commit))
>> + continue;
>
> If there either are no bloom filters or in case none of them contain our
> commit we can safely skip over the commit indeed. Otherwise we'll have
> to check whether the commit really is interesting.
>
> Makes sense.
>
> Patrick
>
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH RFC v3 0/3] Introduce git-last-modified(1) command
2025-04-22 17:46 [PATCH RFC 0/5] Introduce git-blame-tree(1) command Toon Claes
` (6 preceding siblings ...)
2025-05-23 9:33 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Toon Claes
@ 2025-06-30 18:49 ` Toon Claes
2025-06-30 18:49 ` [PATCH RFC v3 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
` (8 more replies)
7 siblings, 9 replies; 135+ messages in thread
From: Toon Claes @ 2025-06-30 18:49 UTC (permalink / raw)
To: git
Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes,
Ævar Arnfjörð Bjarmason
This series adds the git-last-modified(1) subcommand. In the past the
subcommand was proposed[1] to be named git-blame-tree(1). This version
is based on the patches shared by the kind people at GitHub[2].
What is different from the series shared by GitHub:
* Renamed the subcommand from `blame-tree` to `last-modified`. There was
some consensus[5] this name works better, so let's give it a try and
see how this name feels.
* Patches for --max-depth are excluded. I think it's a separate topic to
discuss and I'm not sure it needs to be part of series anyway. The
main patch was submitted in the previous attempt[3] and if people
consider it valuable, I'm happy to discuss that in a separate patch
series.
* The patches in 'tb/blame-tree' at Taylor's fork[4] implements a
caching layer. This feature reads/writes cached results in
`.git/blame-tree/<hash>.btc`. To keep this series to a reviewable
size, that feature is excluded from this series. I think it's better
to submit this as a separate series.
* Squashed various commits together. Like they introduced a flag
`--go-faster`, which later became the default and only implementation.
That story was wrapped up in a single commit.
* Dropped the patches that attempt to increase performance for tree
entries that have not been updated in a long time. In my testing I've
seen both performance improvements *and* degradation with these
changes:
Test HEAD~ HEAD
------------------------------------------------------------------------------------
8020.1: top-level last-modified 4.52(4.38+0.11) 2.03(1.93+0.08) -55.1%
8020.2: top-level recursive last-modified 5.79(5.64+0.11) 8.34(8.17+0.11) +44.0%
8020.3: subdir last-modified 0.15(0.09+0.06) 0.19(0.14+0.06) +26.7%
Before we include these patches, I want to make sure these changes
have positive impact in all/most scenarios. This can happen in a
separate series.
* The last-modified command isn't recursive by default. If you want
recurse into subtrees, you need to pass `-r`.
* Fixed all memory leaks, and removed the use of
USE_THE_REPOSITORY_VARIABLE.
I've set myself as the author and added Based-on-patch-by trailers to
credit the original authors. Let me know if you disagree.
Again thanks to Taylor and the people at GitHub for sharing these
patches. I hope we can work together to get this upstreamed.
[1]: https://lore.kernel.org/git/patch-1.1-0ea849d900b-20230205T204104Z-avarab@gmail.com/
[2]: https://lore.kernel.org/git/Z+XJ+1L3PnC9Dyba@nand.local/
[3]: https://lore.kernel.org/git/20250326-toon-blame-tree-v1-3-4173133f3786@iotcl.com/
[4]: git@github.com:ttaylorr/git.git
[5]: https://lore.kernel.org/git/aCbBKj7O9LjO3SMK@pks.im/
--
Cheers,
Toon
Signed-off-by: Toon Claes <toon@iotcl.com>
---
Changes in v3:
- Updated benchmarks in commit messages.
- Removed the patches that attempt to increase performance for tree
entries that have not been updated in a long time. (see above)
- Move handling failure in `last_modified_init()` to the caller.
- Sorted #include clauses lexicographically.
- Removed unneeded `commit` in `struct last_modified_entry`.
- Renamed some functions/variables and added some comments to make it
easier to understand.
- Removed unnecessary checking of the commit-graph generation number.
- Link to v2: https://lore.kernel.org/r/20250523-toon-new-blame-tree-v2-0-101e4ca4c1c9@iotcl.com
Changes in v2:
- The subcommand is renamed from `blame-tree` to `last-modified`
- Documentation is added. Here we mark the command as experimental.
- Some test cases are added related to merges.
- Link to v1: https://lore.kernel.org/r/20250422-toon-new-blame-tree-v1-0-fdb51b8a394a@iotcl.com
---
Toon Claes (3):
last-modified: new subcommand to show when files were last modified
t/perf: add last-modified perf script
last-modified: use Bloom filters when available
.gitignore | 1 +
Documentation/git-last-modified.adoc | 49 +++++++
Documentation/meson.build | 1 +
Makefile | 2 +
builtin.h | 1 +
builtin/last-modified.c | 44 ++++++
command-list.txt | 1 +
git.c | 1 +
last-modified.c | 257 +++++++++++++++++++++++++++++++++++
last-modified.h | 35 +++++
meson.build | 2 +
t/meson.build | 2 +
t/perf/p8020-last-modified.sh | 21 +++
t/t8020-last-modified.sh | 204 +++++++++++++++++++++++++++
14 files changed, 621 insertions(+)
---
Range-diff versus v2:
1: e77d1d65aa ! 1: 00e0ff81d9 last-modified: new subcommand to show when files were last modified
@@ builtin/last-modified.c (new)
+ const char *prefix,
+ struct repository *repo)
+{
-+ int ret = 0;
+ struct last_modified lm;
+
+ repo_config(repo, git_default_config, NULL);
+
-+ last_modified_init(&lm, repo, prefix, argc, argv);
++ if (last_modified_init(&lm, repo, prefix, argc, argv))
++ die(_("error setting up last-modified traversal"));
++
+ if (last_modified_run(&lm, show_entry, &lm) < 0)
+ die(_("error running last-modified traversal"));
+
+ last_modified_release(&lm);
+
-+ return ret;
++ return 0;
+}
## command-list.txt ##
@@ git.c: static struct cmd_struct commands[] = {
## last-modified.c (new) ##
@@
+#include "git-compat-util.h"
-+#include "last-modified.h"
+#include "commit.h"
-+#include "diffcore.h"
+#include "diff.h"
-+#include "object.h"
-+#include "revision.h"
-+#include "repository.h"
++#include "diffcore.h"
++#include "last-modified.h"
+#include "log-tree.h"
++#include "object.h"
++#include "repository.h"
++#include "revision.h"
+
+struct last_modified_entry {
+ struct hashmap_entry hashent;
+ struct object_id oid;
-+ struct commit *commit;
+ const char path[FLEX_ARRAY];
+};
+
-+static void add_from_diff(struct diff_queue_struct *q,
-+ struct diff_options *opt UNUSED,
-+ void *data)
++static void add_path_from_diff(struct diff_queue_struct *q,
++ struct diff_options *opt UNUSED,
++ void *data)
+{
+ struct last_modified *lm = data;
+
@@ last-modified.c (new)
+ }
+}
+
-+static int add_from_revs(struct last_modified *lm)
++static int populate_paths_from_revs(struct last_modified *lm)
+{
-+ size_t count = 0;
++ int num_interesting = 0;
+ struct diff_options diffopt;
+
+ memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
+ copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
++ /*
++ * Use a callback to populate the paths from revs
++ */
+ diffopt.output_format = DIFF_FORMAT_CALLBACK;
-+ diffopt.format_callback = add_from_diff;
++ diffopt.format_callback = add_path_from_diff;
+ diffopt.format_callback_data = lm;
+
+ for (size_t i = 0; i < lm->rev.pending.nr; i++) {
@@ last-modified.c (new)
+ if (obj->item->flags & UNINTERESTING)
+ continue;
+
-+ if (count++)
++ if (num_interesting++)
+ return error(_("can only get last-modified one tree at a time"));
+
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &obj->item->oid, "", &diffopt);
+ diff_flush(&diffopt);
+ }
-+ clear_pathspec(&diffopt.pathspec);
++ diff_free(&diffopt);
+
+ return 0;
+}
@@ last-modified.c (new)
+ return strcmp(ent1->path, path ? path : ent2->path);
+}
+
-+void last_modified_init(struct last_modified *lm,
++int last_modified_init(struct last_modified *lm,
+ struct repository *r,
+ const char *prefix,
+ int argc, const char **argv)
@@ last-modified.c (new)
+ lm->rev.no_commit_id = 1;
+ lm->rev.diff = 1;
+ if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
-+ die(_("unknown last-modified argument: %s"), argv[1]);
++ return error(_("unknown last-modified argument: %s"), argv[1]);
+
-+ if (add_from_revs(lm) < 0)
-+ die(_("unable to setup last-modified"));
++ if (populate_paths_from_revs(lm) < 0)
++ return error(_("unable to setup last-modified"));
++
++ return 0;
+}
+
+void last_modified_release(struct last_modified *lm)
@@ last-modified.c (new)
+ if (!ent)
+ return;
+
-+ /* Have we already found a commit? */
-+ if (ent->commit)
-+ return;
-+
+ /*
+ * Is it arriving at a version of interest, or is it from a side branch
+ * which did not contribute to the final state?
@@ last-modified.c (new)
+ if (!oideq(oid, &ent->oid))
+ return;
+
-+ ent->commit = data->commit;
+ if (data->callback)
+ data->callback(path, data->commit, data->callback_data);
+
@@ last-modified.c (new)
+}
+
+static void last_modified_diff(struct diff_queue_struct *q,
-+ struct diff_options *opt UNUSED, void *cbdata)
++ struct diff_options *opt UNUSED, void *cbdata)
+{
+ struct last_modified_callback_data *data = cbdata;
+
@@ last-modified.h (new)
+#define LAST_MODIFIED_H
+
+#include "commit.h"
-+#include "revision.h"
+#include "hashmap.h"
++#include "revision.h"
+
+struct last_modified {
+ struct hashmap paths;
+ struct rev_info rev;
+};
+
-+void last_modified_init(struct last_modified *lm,
++/*
++ * Initialize the last-modified machinery using command line arguments.
++ */
++int last_modified_init(struct last_modified *lm,
+ struct repository *r,
+ const char *prefix,
+ int argc, const char **argv);
@@ last-modified.h (new)
+typedef void (*last_modified_callback)(const char *path,
+ const struct commit *commit,
+ void *data);
++
++/*
++ * Run the last-modified traversal. For each path found the callback is called
++ * passing the path, the commit, and the cbdata.
++ */
+int last_modified_run(struct last_modified *lm,
+ last_modified_callback cb,
+ void *cbdata);
@@ t/t8020-last-modified.sh (new)
+ EOF
+'
+
++test_expect_success 'last-modified recursive with tree' '
++ check_last_modified -t <<-\EOF
++ 1 file
++ 2 a/file
++ 3 a
++ 3 a/b
++ 3 a/b/file
++ EOF
++'
++
+test_expect_success 'last-modified subdir' '
+ check_last_modified a <<-\EOF
+ 3 a
2: a9b69bf2f1 ! 2: dceac8196a t/perf: add last-modified perf script
@@
## Metadata ##
-Author: Jeff King <peff@peff.net>
+Author: Toon Claes <toon@iotcl.com>
## Commit message ##
t/perf: add last-modified perf script
@@ Commit message
correctness in the regular suite, so this is just about finding
performance regressions from one version to another.
+ Based-on-patch-by: Jeff King <peff@peff.net>
Signed-off-by: Toon Claes <toon@iotcl.com>
## t/meson.build ##
3: ee2fe0200a ! 3: a479ef7c40 last-modified: use Bloom filters when available
@@ Commit message
doesn't contain any changed paths which are interesting to us. So, we
can avoid computing it in this case.
- This results in a substantial performance speed-up in common cases of
- 'git last-modified'. In the kernel, here is the before and after (all
- times computed with best-of-five):
+ Comparing the perf test results on git.git:
- With commit-graphs (but no Bloom filters):
-
- real 0m5.133s
- user 0m4.942s
- sys 0m0.180s
-
- ...and with Bloom filters:
-
- real 0m0.936s
- user 0m0.842s
- sys 0m0.092s
-
- These times are with my development-version of Git, so it's compiled
- without optimizations. Compiling instead with `-O3`, the results look
- even better:
-
- real 0m0.754s
- user 0m0.661s
- sys 0m0.092s
+ Test HEAD~ HEAD
+ ------------------------------------------------------------------------------------
+ 8020.1: top-level last-modified 4.49(4.34+0.11) 2.22(2.05+0.09) -50.6%
+ 8020.2: top-level recursive last-modified 5.64(5.45+0.11) 5.62(5.30+0.11) -0.4%
+ 8020.3: subdir last-modified 0.11(0.06+0.04) 0.07(0.03+0.04) -36.4%
+ Based-on-patch-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
## last-modified.c ##
@@
- #include "revision.h"
- #include "repository.h"
- #include "log-tree.h"
-+#include "dir.h"
-+#include "commit-graph.h"
+ #include "git-compat-util.h"
+#include "bloom.h"
-
++#include "commit-graph.h"
+ #include "commit.h"
+ #include "diff.h"
+ #include "diffcore.h"
++#include "dir.h"
+ #include "last-modified.h"
+ #include "log-tree.h"
+ #include "object.h"
+@@
struct last_modified_entry {
struct hashmap_entry hashent;
struct object_id oid;
- struct commit *commit;
+ struct bloom_key key;
const char path[FLEX_ARRAY];
};
-@@ last-modified.c: static void add_from_diff(struct diff_queue_struct *q,
+@@ last-modified.c: static void add_path_from_diff(struct diff_queue_struct *q,
FLEX_ALLOC_STR(ent, path, path);
oidcpy(&ent->oid, &p->two->oid);
@@ last-modified.c: static void add_from_diff(struct diff_queue_struct *q,
hashmap_entry_init(&ent->hashent, strhash(ent->path));
hashmap_add(&lm->paths, &ent->hashent);
}
-@@ last-modified.c: void last_modified_init(struct last_modified *lm,
+@@ last-modified.c: int last_modified_init(struct last_modified *lm,
if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
- die(_("unknown last-modified argument: %s"), argv[1]);
+ return error(_("unknown last-modified argument: %s"), argv[1]);
++ /*
++ * We're not interested in generation numbers here,
++ * but calling this function to prepare the commit-graph.
++ */
+ (void)generation_numbers_enabled(lm->rev.repo);
+ lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
+
- if (add_from_revs(lm) < 0)
- die(_("unable to setup last-modified"));
- }
+ if (populate_paths_from_revs(lm) < 0)
+ return error(_("unable to setup last-modified"));
+
+@@ last-modified.c: int last_modified_init(struct last_modified *lm,
void last_modified_release(struct last_modified *lm)
{
+ struct hashmap_iter iter;
+ struct last_modified_entry *ent;
+
-+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
++ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
+ clear_bloom_key(&ent->key);
-+ }
++
hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
release_revisions(&lm->rev);
}
@@ last-modified.c: static void last_modified_diff(struct diff_queue_struct *q,
+ if (!lm->rev.bloom_filter_settings)
+ return 1;
+
-+ if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY)
-+ return 1;
-+
+ filter = get_bloom_filter(lm->rev.repo, origin);
+ if (!filter)
+ return 1;
4: 5dc990d49a < -: ---------- last-modified: implement faster algorithm
5: 6a5a921a41 < -: ---------- last-modified: initialize revision machinery without walk
---
base-commit: cf6f63ea6bf35173e02e18bdc6a4ba41288acff9
change-id: 20250410-toon-new-blame-tree-bcdbb78c1c0f
Thanks
--
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH RFC v3 1/3] last-modified: new subcommand to show when files were last modified
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
@ 2025-06-30 18:49 ` Toon Claes
2025-07-01 20:20 ` Kristoffer Haugsbakk
2025-07-02 11:51 ` Junio C Hamano
2025-06-30 18:49 ` [PATCH RFC v3 2/3] t/perf: add last-modified perf script Toon Claes
` (7 subsequent siblings)
8 siblings, 2 replies; 135+ messages in thread
From: Toon Claes @ 2025-06-30 18:49 UTC (permalink / raw)
To: git
Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes,
Ævar Arnfjörð Bjarmason
Similar to git-blame(1), introduce a new subcommand
git-last-modified(1). This command shows the most recent modification to
paths in a tree. It does so by expanding the tree at a given commit,
taking note of the current state of each path, and then walking
backwards through history looking for commits where each path changed
into its final commit ID.
Based-on-patch-by: Jeff King <peff@peff.net>
Improved-by: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
.gitignore | 1 +
Documentation/git-last-modified.adoc | 49 ++++++++
Documentation/meson.build | 1 +
Makefile | 2 +
builtin.h | 1 +
builtin/last-modified.c | 44 ++++++++
command-list.txt | 1 +
git.c | 1 +
last-modified.c | 212 +++++++++++++++++++++++++++++++++++
last-modified.h | 35 ++++++
meson.build | 2 +
t/meson.build | 1 +
t/t8020-last-modified.sh | 204 +++++++++++++++++++++++++++++++++
13 files changed, 554 insertions(+)
diff --git a/.gitignore b/.gitignore
index 04c444404e..a36ee94443 100644
--- a/.gitignore
+++ b/.gitignore
@@ -87,6 +87,7 @@
/git-init-db
/git-interpret-trailers
/git-instaweb
+/git-last-modified
/git-log
/git-ls-files
/git-ls-remote
diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
new file mode 100644
index 0000000000..1af38f402e
--- /dev/null
+++ b/Documentation/git-last-modified.adoc
@@ -0,0 +1,49 @@
+git-last-modified(1)
+====================
+
+NAME
+----
+git-last-modified - EXPERIMENTAL: Show when files were last modified
+
+
+SYNOPSIS
+--------
+[synopsis]
+git last-modified [-r] [<revision-range>] [[--] <path>...]
+
+DESCRIPTION
+-----------
+
+Shows which commit last modified each of the relevant files and subdirectories.
+
+THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
+
+OPTIONS
+-------
+
+-r::
+ Recurse into subtrees.
+
+-t::
+ Show tree entry itself as well as subtrees. Implies `-r`.
+
+<revision-range>::
+ Only traverse commits in the specified revision range. When no
+ `<revision-range>` is specified, it defaults to `HEAD` (i.e. the whole
+ history leading to the current commit). For a complete list of ways to
+ spell `<revision-range>`, see the 'Specifying Ranges' section of
+ linkgit:gitrevisions[7].
+
+[--] <path>...::
+ For each _<path>_ given, the commit which last modified it is returned.
+ Without an optional path parameter, all files and subdirectories
+ of the current working directory are included in the
+
+SEE ALSO
+--------
+linkgit:git-blame[1],
+linkgit:git-log[1].
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Documentation/meson.build b/Documentation/meson.build
index 2fe1a1369d..99aeb6d0e0 100644
--- a/Documentation/meson.build
+++ b/Documentation/meson.build
@@ -74,6 +74,7 @@ manpages = {
'git-init.adoc' : 1,
'git-instaweb.adoc' : 1,
'git-interpret-trailers.adoc' : 1,
+ 'git-last-modified.adoc' : 1,
'git-log.adoc' : 1,
'git-ls-files.adoc' : 1,
'git-ls-remote.adoc' : 1,
diff --git a/Makefile b/Makefile
index 70d1543b6b..e611bbae51 100644
--- a/Makefile
+++ b/Makefile
@@ -1052,6 +1052,7 @@ LIB_OBJS += hook.o
LIB_OBJS += ident.o
LIB_OBJS += json-writer.o
LIB_OBJS += kwset.o
+LIB_OBJS += last-modified.o
LIB_OBJS += levenshtein.o
LIB_OBJS += line-log.o
LIB_OBJS += line-range.o
@@ -1267,6 +1268,7 @@ BUILTIN_OBJS += builtin/hook.o
BUILTIN_OBJS += builtin/index-pack.o
BUILTIN_OBJS += builtin/init-db.o
BUILTIN_OBJS += builtin/interpret-trailers.o
+BUILTIN_OBJS += builtin/last-modified.o
BUILTIN_OBJS += builtin/log.o
BUILTIN_OBJS += builtin/ls-files.o
BUILTIN_OBJS += builtin/ls-remote.o
diff --git a/builtin.h b/builtin.h
index bff13e3069..6ed6759ec4 100644
--- a/builtin.h
+++ b/builtin.h
@@ -176,6 +176,7 @@ int cmd_hook(int argc, const char **argv, const char *prefix, struct repository
int cmd_index_pack(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_init_db(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_interpret_trailers(int argc, const char **argv, const char *prefix, struct repository *repo);
+int cmd_last_modified(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_log_reflog(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_log(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_ls_files(int argc, const char **argv, const char *prefix, struct repository *repo);
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
new file mode 100644
index 0000000000..4ff058c302
--- /dev/null
+++ b/builtin/last-modified.c
@@ -0,0 +1,44 @@
+#include "git-compat-util.h"
+#include "last-modified.h"
+#include "hex.h"
+#include "quote.h"
+#include "config.h"
+#include "object-name.h"
+#include "parse-options.h"
+#include "builtin.h"
+
+static void show_entry(const char *path, const struct commit *commit, void *d)
+{
+ struct last_modified *lm = d;
+
+ if (commit->object.flags & BOUNDARY)
+ putchar('^');
+ printf("%s\t", oid_to_hex(&commit->object.oid));
+
+ if (lm->rev.diffopt.line_termination)
+ write_name_quoted(path, stdout, '\n');
+ else
+ printf("%s%c", path, '\0');
+
+ fflush(stdout);
+}
+
+int cmd_last_modified(int argc,
+ const char **argv,
+ const char *prefix,
+ struct repository *repo)
+{
+ struct last_modified lm;
+
+ repo_config(repo, git_default_config, NULL);
+
+ if (last_modified_init(&lm, repo, prefix, argc, argv))
+ die(_("error setting up last-modified traversal"));
+
+ if (last_modified_run(&lm, show_entry, &lm) < 0)
+ die(_("error running last-modified traversal"));
+
+ last_modified_release(&lm);
+
+ return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index b7ade3ab9f..b715777b24 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -124,6 +124,7 @@ git-index-pack plumbingmanipulators
git-init mainporcelain init
git-instaweb ancillaryinterrogators complete
git-interpret-trailers purehelpers
+git-last-modified plumbinginterrogators
git-log mainporcelain info
git-ls-files plumbinginterrogators
git-ls-remote plumbinginterrogators
diff --git a/git.c b/git.c
index 07a5fe39fb..76a0b2a1a4 100644
--- a/git.c
+++ b/git.c
@@ -565,6 +565,7 @@ static struct cmd_struct commands[] = {
{ "init", cmd_init_db },
{ "init-db", cmd_init_db },
{ "interpret-trailers", cmd_interpret_trailers, RUN_SETUP_GENTLY },
+ { "last-modified", cmd_last_modified, RUN_SETUP },
{ "log", cmd_log, RUN_SETUP },
{ "ls-files", cmd_ls_files, RUN_SETUP },
{ "ls-remote", cmd_ls_remote, RUN_SETUP_GENTLY },
diff --git a/last-modified.c b/last-modified.c
new file mode 100644
index 0000000000..4904d00d2a
--- /dev/null
+++ b/last-modified.c
@@ -0,0 +1,212 @@
+#include "git-compat-util.h"
+#include "commit.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "last-modified.h"
+#include "log-tree.h"
+#include "object.h"
+#include "repository.h"
+#include "revision.h"
+
+struct last_modified_entry {
+ struct hashmap_entry hashent;
+ struct object_id oid;
+ const char path[FLEX_ARRAY];
+};
+
+static void add_path_from_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED,
+ void *data)
+{
+ struct last_modified *lm = data;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ struct last_modified_entry *ent;
+ const char *path = p->two->path;
+
+ FLEX_ALLOC_STR(ent, path, path);
+ oidcpy(&ent->oid, &p->two->oid);
+ hashmap_entry_init(&ent->hashent, strhash(ent->path));
+ hashmap_add(&lm->paths, &ent->hashent);
+ }
+}
+
+static int populate_paths_from_revs(struct last_modified *lm)
+{
+ int num_interesting = 0;
+ struct diff_options diffopt;
+
+ memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
+ copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
+ /*
+ * Use a callback to populate the paths from revs
+ */
+ diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ diffopt.format_callback = add_path_from_diff;
+ diffopt.format_callback_data = lm;
+
+ for (size_t i = 0; i < lm->rev.pending.nr; i++) {
+ struct object_array_entry *obj = lm->rev.pending.objects + i;
+
+ if (obj->item->flags & UNINTERESTING)
+ continue;
+
+ if (num_interesting++)
+ return error(_("can only get last-modified one tree at a time"));
+
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &obj->item->oid, "", &diffopt);
+ diff_flush(&diffopt);
+ }
+ diff_free(&diffopt);
+
+ return 0;
+}
+
+static int last_modified_entry_hashcmp(const void *unused UNUSED,
+ const struct hashmap_entry *hent1,
+ const struct hashmap_entry *hent2,
+ const void *path)
+{
+ const struct last_modified_entry *ent1 =
+ container_of(hent1, const struct last_modified_entry, hashent);
+ const struct last_modified_entry *ent2 =
+ container_of(hent2, const struct last_modified_entry, hashent);
+ return strcmp(ent1->path, path ? path : ent2->path);
+}
+
+int last_modified_init(struct last_modified *lm,
+ struct repository *r,
+ const char *prefix,
+ int argc, const char **argv)
+{
+ memset(lm, 0, sizeof(*lm));
+ hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
+
+ repo_init_revisions(r, &lm->rev, prefix);
+ lm->rev.def = "HEAD";
+ lm->rev.combine_merges = 1;
+ lm->rev.show_root_diff = 1;
+ lm->rev.boundary = 1;
+ lm->rev.no_commit_id = 1;
+ lm->rev.diff = 1;
+ if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
+ return error(_("unknown last-modified argument: %s"), argv[1]);
+
+ if (populate_paths_from_revs(lm) < 0)
+ return error(_("unable to setup last-modified"));
+
+ return 0;
+}
+
+void last_modified_release(struct last_modified *lm)
+{
+ hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
+ release_revisions(&lm->rev);
+}
+
+struct last_modified_callback_data {
+ struct commit *commit;
+ struct hashmap *paths;
+
+ last_modified_callback callback;
+ void *callback_data;
+};
+
+static void mark_path(const char *path, const struct object_id *oid,
+ struct last_modified_callback_data *data)
+{
+ struct last_modified_entry *ent;
+
+ /* Is it even a path that we are interested in? */
+ ent = hashmap_get_entry_from_hash(data->paths, strhash(path), path,
+ struct last_modified_entry, hashent);
+ if (!ent)
+ return;
+
+ /*
+ * Is it arriving at a version of interest, or is it from a side branch
+ * which did not contribute to the final state?
+ */
+ if (!oideq(oid, &ent->oid))
+ return;
+
+ if (data->callback)
+ data->callback(path, data->commit, data->callback_data);
+
+ hashmap_remove(data->paths, &ent->hashent, path);
+ free(ent);
+}
+
+static void last_modified_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED, void *cbdata)
+{
+ struct last_modified_callback_data *data = cbdata;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ switch (p->status) {
+ case DIFF_STATUS_DELETED:
+ /*
+ * There's no point in feeding a deletion, as it could
+ * not have resulted in our current state, which
+ * actually has the file.
+ */
+ break;
+
+ default:
+ /*
+ * Otherwise, we care only that we somehow arrived at
+ * a final path/sha1 state. Note that this covers some
+ * potentially controversial areas, including:
+ *
+ * 1. A rename or copy will be found, as it is the
+ * first time the content has arrived at the given
+ * path.
+ *
+ * 2. Even a non-content modification like a mode or
+ * type change will trigger it.
+ *
+ * We take the inclusive approach for now, and find
+ * anything which impacts the path. Options to tweak
+ * the behavior (e.g., to "--follow" the content across
+ * renames) can come later.
+ */
+ mark_path(p->two->path, &p->two->oid, data);
+ break;
+ }
+ }
+}
+
+int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
+{
+ struct last_modified_callback_data data;
+
+ data.paths = &lm->paths;
+ data.callback = cb;
+ data.callback_data = cbdata;
+
+ lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ lm->rev.diffopt.format_callback = last_modified_diff;
+ lm->rev.diffopt.format_callback_data = &data;
+
+ prepare_revision_walk(&lm->rev);
+
+ while (hashmap_get_size(&lm->paths)) {
+ data.commit = get_revision(&lm->rev);
+ if (!data.commit)
+ break;
+
+ if (data.commit->object.flags & BOUNDARY) {
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &data.commit->object.oid,
+ "", &lm->rev.diffopt);
+ diff_flush(&lm->rev.diffopt);
+ } else {
+ log_tree_commit(&lm->rev, data.commit);
+ }
+ }
+
+ return 0;
+}
diff --git a/last-modified.h b/last-modified.h
new file mode 100644
index 0000000000..04d5a1a5b6
--- /dev/null
+++ b/last-modified.h
@@ -0,0 +1,35 @@
+#ifndef LAST_MODIFIED_H
+#define LAST_MODIFIED_H
+
+#include "commit.h"
+#include "hashmap.h"
+#include "revision.h"
+
+struct last_modified {
+ struct hashmap paths;
+ struct rev_info rev;
+};
+
+/*
+ * Initialize the last-modified machinery using command line arguments.
+ */
+int last_modified_init(struct last_modified *lm,
+ struct repository *r,
+ const char *prefix,
+ int argc, const char **argv);
+
+void last_modified_release(struct last_modified *);
+
+typedef void (*last_modified_callback)(const char *path,
+ const struct commit *commit,
+ void *data);
+
+/*
+ * Run the last-modified traversal. For each path found the callback is called
+ * passing the path, the commit, and the cbdata.
+ */
+int last_modified_run(struct last_modified *lm,
+ last_modified_callback cb,
+ void *cbdata);
+
+#endif /* LAST_MODIFIED_H */
diff --git a/meson.build b/meson.build
index 7fea4a34d6..fc84a3c008 100644
--- a/meson.build
+++ b/meson.build
@@ -363,6 +363,7 @@ libgit_sources = [
'ident.c',
'json-writer.c',
'kwset.c',
+ 'last-modified.c',
'levenshtein.c',
'line-log.c',
'line-range.c',
@@ -607,6 +608,7 @@ builtin_sources = [
'builtin/index-pack.c',
'builtin/init-db.c',
'builtin/interpret-trailers.c',
+ 'builtin/last-modified.c',
'builtin/log.c',
'builtin/ls-files.c',
'builtin/ls-remote.c',
diff --git a/t/meson.build b/t/meson.build
index 50e89e764a..44eb2a693f 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -962,6 +962,7 @@ integration_tests = [
't8012-blame-colors.sh',
't8013-blame-ignore-revs.sh',
't8014-blame-ignore-fuzzy.sh',
+ 't8020-last-modified.sh',
't9001-send-email.sh',
't9002-column.sh',
't9003-help-autocorrect.sh',
diff --git a/t/t8020-last-modified.sh b/t/t8020-last-modified.sh
new file mode 100755
index 0000000000..921d2a0807
--- /dev/null
+++ b/t/t8020-last-modified.sh
@@ -0,0 +1,204 @@
+#!/bin/sh
+
+test_description='last-modified tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+ test_commit 1 file &&
+ mkdir a &&
+ test_commit 2 a/file &&
+ mkdir a/b &&
+ test_commit 3 a/b/file
+'
+
+test_expect_success 'cannot run last-modified on two trees' '
+ test_must_fail git last-modified HEAD HEAD~1
+'
+
+check_last_modified() {
+ local indir= &&
+ while test $# != 0
+ do
+ case "$1" in
+ -C)
+ indir="$2"
+ shift
+ ;;
+ *)
+ break
+ ;;
+ esac &&
+ shift
+ done &&
+
+ cat >expect &&
+ test_when_finished "rm -f tmp.*" &&
+ git ${indir:+-C "$indir"} last-modified "$@" >tmp.1 &&
+ git name-rev --annotate-stdin --name-only --tags \
+ <tmp.1 >tmp.2 &&
+ tr '\t' ' ' <tmp.2 >tmp.3 &&
+ sort tmp.3 >actual &&
+ test_cmp expect actual
+}
+
+test_expect_success 'last-modified non-recursive' '
+ check_last_modified <<-\EOF
+ 1 file
+ 3 a
+ EOF
+'
+
+test_expect_success 'last-modified recursive' '
+ check_last_modified -r <<-\EOF
+ 1 file
+ 2 a/file
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'last-modified recursive with tree' '
+ check_last_modified -t <<-\EOF
+ 1 file
+ 2 a/file
+ 3 a
+ 3 a/b
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'last-modified subdir' '
+ check_last_modified a <<-\EOF
+ 3 a
+ EOF
+'
+
+test_expect_success 'last-modified subdir recursive' '
+ check_last_modified -r a <<-\EOF
+ 2 a/file
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'last-modified from non-HEAD commit' '
+ check_last_modified HEAD^ <<-\EOF
+ 1 file
+ 2 a
+ EOF
+'
+
+test_expect_success 'last-modified from subdir defaults to root' '
+ check_last_modified -C a <<-\EOF
+ 1 file
+ 3 a
+ EOF
+'
+
+test_expect_success 'last-modified from subdir uses relative pathspecs' '
+ check_last_modified -C a -r b <<-\EOF
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'limit last-modified traversal by count' '
+ check_last_modified -1 <<-\EOF
+ 3 a
+ ^2 file
+ EOF
+'
+
+test_expect_success 'limit last-modified traversal by commit' '
+ check_last_modified HEAD~2..HEAD <<-\EOF
+ 3 a
+ ^1 file
+ EOF
+'
+
+test_expect_success 'only last-modified files in the current tree' '
+ git rm -rf a &&
+ git commit -m "remove a" &&
+ check_last_modified <<-\EOF
+ 1 file
+ EOF
+'
+
+test_expect_success 'cross merge boundaries in blaming' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit m1 &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit m2 &&
+ git merge m1 &&
+ check_last_modified <<-\EOF
+ m1 m1.t
+ m2 m2.t
+ EOF
+'
+
+test_expect_success 'last-modified merge for resolved conflicts' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit c1 conflict &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit c2 conflict &&
+ test_must_fail git merge c1 &&
+ test_commit resolved conflict &&
+ check_last_modified conflict <<-\EOF
+ resolved conflict
+ EOF
+'
+
+
+# Consider `file` with this content through history:
+#
+# A---B---B-------B---B
+# \ /
+# C---D
+test_expect_success 'last-modified merge ignores content from branch' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit a1 file A &&
+ test_commit a2 file B &&
+ test_commit a3 file C &&
+ test_commit a4 file D &&
+ git checkout a2 &&
+ git merge --no-commit --no-ff a4 &&
+ git checkout a2 -- file &&
+ git merge --continue &&
+ check_last_modified <<-\EOF
+ a2 file
+ EOF
+'
+
+# Consider `file` with this content through history:
+#
+# A---B---B---C---D---B---B
+# \ /
+# B-------B
+test_expect_success 'last-modified merge undoes changes' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit b1 file A &&
+ test_commit b2 file B &&
+ test_commit b3 file C &&
+ test_commit b4 file D &&
+ git checkout b2 &&
+ test_commit b5 file2 2 &&
+ git checkout b4 &&
+ git merge --no-commit --no-ff b5 &&
+ git checkout b2 -- file &&
+ git merge --continue &&
+ check_last_modified <<-\EOF
+ b2 file
+ b5 file2
+ EOF
+'
+
+test_expect_success 'last-modified complains about unknown arguments' '
+ test_must_fail git last-modified --foo 2>err &&
+ grep "unknown last-modified argument: --foo" err
+'
+
+test_done
--
2.50.0.rc0.18.gfcfe60668e
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH RFC v3 2/3] t/perf: add last-modified perf script
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
2025-06-30 18:49 ` [PATCH RFC v3 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
@ 2025-06-30 18:49 ` Toon Claes
2025-06-30 18:49 ` [PATCH RFC v3 3/3] last-modified: use Bloom filters when available Toon Claes
` (6 subsequent siblings)
8 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-06-30 18:49 UTC (permalink / raw)
To: git; +Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes
This just runs some simple last-modified commands. We already test
correctness in the regular suite, so this is just about finding
performance regressions from one version to another.
Based-on-patch-by: Jeff King <peff@peff.net>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
t/meson.build | 1 +
t/perf/p8020-last-modified.sh | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+)
diff --git a/t/meson.build b/t/meson.build
index 44eb2a693f..09f83d89ca 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -1155,6 +1155,7 @@ benchmarks = [
'perf/p7820-grep-engines.sh',
'perf/p7821-grep-engines-fixed.sh',
'perf/p7822-grep-perl-character.sh',
+ 'perf/p8020-last-modified.sh',
'perf/p9210-scalar.sh',
'perf/p9300-fast-import-export.sh',
]
diff --git a/t/perf/p8020-last-modified.sh b/t/perf/p8020-last-modified.sh
new file mode 100755
index 0000000000..a02ec907d4
--- /dev/null
+++ b/t/perf/p8020-last-modified.sh
@@ -0,0 +1,21 @@
+#!/bin/sh
+
+test_description='last-modified perf tests'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+test_perf 'top-level last-modified' '
+ git last-modified HEAD
+'
+
+test_perf 'top-level recursive last-modified' '
+ git last-modified -r HEAD
+'
+
+test_perf 'subdir last-modified' '
+ path=$(git ls-tree HEAD | grep ^040000 | head -n 1 | cut -f2)
+ git last-modified -r HEAD -- "$path"
+'
+
+test_done
--
2.50.0.rc0.18.gfcfe60668e
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH RFC v3 3/3] last-modified: use Bloom filters when available
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
2025-06-30 18:49 ` [PATCH RFC v3 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
2025-06-30 18:49 ` [PATCH RFC v3 2/3] t/perf: add last-modified perf script Toon Claes
@ 2025-06-30 18:49 ` Toon Claes
2025-07-01 23:01 ` [PATCH RFC v3 0/3] Introduce git-last-modified(1) command Junio C Hamano
` (5 subsequent siblings)
8 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-06-30 18:49 UTC (permalink / raw)
To: git; +Cc: Jeff King, Taylor Blau, Derrick Stolee, Toon Claes
Our 'git last-modified' performs a revision walk, and computes a diff at
each point in the walk to figure out whether a given revision changed
any of the paths it considers interesting.
When changed-path Bloom filters are available, we can avoid computing
many such diffs. Before computing a diff, we first check if any of the
remaining paths of interest were possibly changed at a given commit by
consulting its Bloom filter. If any of them are, we are resigned to
compute the diff.
If none of those queries returned "maybe", we know that the given commit
doesn't contain any changed paths which are interesting to us. So, we
can avoid computing it in this case.
Comparing the perf test results on git.git:
Test HEAD~ HEAD
------------------------------------------------------------------------------------
8020.1: top-level last-modified 4.49(4.34+0.11) 2.22(2.05+0.09) -50.6%
8020.2: top-level recursive last-modified 5.64(5.45+0.11) 5.62(5.30+0.11) -0.4%
8020.3: subdir last-modified 0.11(0.06+0.04) 0.07(0.03+0.04) -36.4%
Based-on-patch-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
last-modified.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
diff --git a/last-modified.c b/last-modified.c
index 4904d00d2a..2097894c6e 100644
--- a/last-modified.c
+++ b/last-modified.c
@@ -1,7 +1,10 @@
#include "git-compat-util.h"
+#include "bloom.h"
+#include "commit-graph.h"
#include "commit.h"
#include "diff.h"
#include "diffcore.h"
+#include "dir.h"
#include "last-modified.h"
#include "log-tree.h"
#include "object.h"
@@ -11,6 +14,7 @@
struct last_modified_entry {
struct hashmap_entry hashent;
struct object_id oid;
+ struct bloom_key key;
const char path[FLEX_ARRAY];
};
@@ -27,6 +31,9 @@ static void add_path_from_diff(struct diff_queue_struct *q,
FLEX_ALLOC_STR(ent, path, path);
oidcpy(&ent->oid, &p->two->oid);
+ if (lm->rev.bloom_filter_settings)
+ fill_bloom_key(path, strlen(path), &ent->key,
+ lm->rev.bloom_filter_settings);
hashmap_entry_init(&ent->hashent, strhash(ent->path));
hashmap_add(&lm->paths, &ent->hashent);
}
@@ -94,6 +101,13 @@ int last_modified_init(struct last_modified *lm,
if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
return error(_("unknown last-modified argument: %s"), argv[1]);
+ /*
+ * We're not interested in generation numbers here,
+ * but calling this function to prepare the commit-graph.
+ */
+ (void)generation_numbers_enabled(lm->rev.repo);
+ lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
+
if (populate_paths_from_revs(lm) < 0)
return error(_("unable to setup last-modified"));
@@ -102,6 +116,12 @@ int last_modified_init(struct last_modified *lm,
void last_modified_release(struct last_modified *lm)
{
+ struct hashmap_iter iter;
+ struct last_modified_entry *ent;
+
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
+ clear_bloom_key(&ent->key);
+
hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
release_revisions(&lm->rev);
}
@@ -136,6 +156,7 @@ static void mark_path(const char *path, const struct object_id *oid,
data->callback(path, data->commit, data->callback_data);
hashmap_remove(data->paths, &ent->hashent, path);
+ clear_bloom_key(&ent->key);
free(ent);
}
@@ -179,6 +200,27 @@ static void last_modified_diff(struct diff_queue_struct *q,
}
}
+static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
+{
+ struct bloom_filter *filter;
+ struct last_modified_entry *ent;
+ struct hashmap_iter iter;
+
+ if (!lm->rev.bloom_filter_settings)
+ return 1;
+
+ filter = get_bloom_filter(lm->rev.repo, origin);
+ if (!filter)
+ return 1;
+
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
+ if (bloom_filter_contains(filter, &ent->key,
+ lm->rev.bloom_filter_settings))
+ return 1;
+ }
+ return 0;
+}
+
int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
{
struct last_modified_callback_data data;
@@ -198,6 +240,9 @@ int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
if (!data.commit)
break;
+ if (!maybe_changed_path(lm, data.commit))
+ continue;
+
if (data.commit->object.flags & BOUNDARY) {
diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
&data.commit->object.oid,
--
2.50.0.rc0.18.gfcfe60668e
^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v3 1/3] last-modified: new subcommand to show when files were last modified
2025-06-30 18:49 ` [PATCH RFC v3 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
@ 2025-07-01 20:20 ` Kristoffer Haugsbakk
2025-07-02 11:51 ` Junio C Hamano
1 sibling, 0 replies; 135+ messages in thread
From: Kristoffer Haugsbakk @ 2025-07-01 20:20 UTC (permalink / raw)
To: Toon Claes, git
Cc: Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On Mon, Jun 30, 2025, at 20:49, Toon Claes wrote:
> + of the current working directory are included in the
in the search/traversal?
> [snip]
> + * a final path/sha1 state. Note that this covers some
nit: oid state?
> + * potentially controversial areas, including:
> + *
> + * 1. A rename or copy will be found, as it is the
> + * first time the content has arrived at the given
> + * path.
> + *
> + * 2. Even a non-content modification like a mode or
> + * type change will trigger it.
> + *
> + * We take the inclusive approach for now, and find
> + * anything which impacts the path. Options to tweak
> + * the behavior (e.g., to "--follow" the content across
> + * renames) can come later.
> + */
> + mark_path(p->two->path, &p->two->oid, data);
> + break;
> [snip]
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 0/5] Introduce git-last-modified(1) command
2025-05-23 9:33 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Toon Claes
` (4 preceding siblings ...)
2025-05-23 9:33 ` [PATCH RFC v2 5/5] last-modified: initialize revision machinery without walk Toon Claes
@ 2025-07-01 20:35 ` Kristoffer Haugsbakk
2025-07-01 21:06 ` Junio C Hamano
5 siblings, 1 reply; 135+ messages in thread
From: Kristoffer Haugsbakk @ 2025-07-01 20:35 UTC (permalink / raw)
To: Toon Claes, git
Cc: Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On Fri, May 23, 2025, at 11:33, Toon Claes wrote:
> This is another attempt to upstream the ~~git-blame-tree(1)~~
> git-last-modified(1) subcommand. After my previous attempt[1] the
> people of GitHub shared their version of the subcommand, and this
> version integrates those changes.
>
> What is different from the series shared by GitHub:
>
> * Renamed the subcommand from `blame-tree` to `last-modified`. There was
> some consensus[4] this name works better, so let's give it a try and
> see how this name feels.
>
> * Patches for --max-depth are excluded. I think it's a separate topic to
> discuss and I'm not sure it needs to be part of series anyway. The
> main patch was submitted in the previous attempt[2] and if people
> consider it valuable, I'm happy to discuss that in a separate patch
> series.
>
> * The patches in 'tb/blame-tree' at Taylor's fork[3] implements a
> caching layer. This feature reads/writes cached results in
> `.git/blame-tree/<hash>.btc`. To keep this series to a reviewable
> size, that feature is excluded from this series. I think it's better
> to submit this as a separate series.
>
> * Squashed various commits together. Like they introduced a flag
> `--go-faster`, which later became the default and only implementation.
> That story was wrapped up in a single commit.
>
> * The last-modified command isn't recursive by default. If you want
> recurse into subtrees, you need to pass `-r`.
>
> * Fixed all memory leaks, and removed the use of
> USE_THE_REPOSITORY_VARIABLE.
>
> I've attempted to reuse commit messages as good as possible, but feel
> free to correct me where you think I didn't give proper credit or messed
> up. Although I have no idea what to do with the Signed-off-by trailers.
>
> I didn't modify the benchmark results in the commit messages, simply
> because I didn't get comparable results. In my benchmarks the difference
> between two implementations was negligible, and even in some scenarios
> the performance was worse in the "improved" implementation. As far as I
> can tell, I didn't break anything in my refactoring, because the version
> in these patches acts similar to Taylor's branch. To be honest, I cannot
> explain why...?
>
> Again thanks to Taylor and the people at GitHub for sharing these
> patches. I hope we can work together to get this upstreamed.
>
> [1]:
> https://lore.kernel.org/git/20250326-toon-blame-tree-v1-0-4173133f3786@iotcl.com/
> [2]:
> https://lore.kernel.org/git/20250326-toon-blame-tree-v1-3-4173133f3786@iotcl.com/
> [3]: git@github.com:ttaylorr/git.git
> [4]: https://lore.kernel.org/git/aCbBKj7O9LjO3SMK@pks.im/
> --
> Cheers,
> Toon
>
> Signed-off-by: Toon Claes <toon@iotcl.com>
> ---
> Changes in v2:
> - The subcommand is renamed from `blame-tree` to `last-modified`
> - Documentation is added. Here we mark the command as experimental.
> - Some test cases are added related to merges.
> - Link to v1:
> https://lore.kernel.org/r/20250422-toon-new-blame-tree-v1-0-fdb51b8a394a@iotcl.com
It feels like the command strays a bit from the usual patterns to me. For paths/files
that is. I like this:
```
$ git last-modified -r refs.c refs.h
062b914c841329a003f74e1340ea5178391274a6 refs.c
47478802daddf3f9916111307f153c6298ffc0bc refs.h
```
I ask for two files and I get those in the output.
But for individual files in subdirectories:
```
$ git last-modified refs.c refs.h Documentation/git-last-modified.adoc Documentation/git-config.adoc
3691fe72d927658ae77ade7fe967544fc6739e67 Documentation
062b914c841329a003f74e1340ea5178391274a6 refs.c
47478802daddf3f9916111307f153c6298ffc0bc refs.h
```
Same as if I ask for `Documentation`:
```
$ git last-modified refs.c refs.h Documentation/
3691fe72d927658ae77ade7fe967544fc6739e67 Documentation
062b914c841329a003f74e1340ea5178391274a6 refs.c
47478802daddf3f9916111307f153c6298ffc0bc refs.h
```
But I didn’t ask for the directory first. I asked for two files.
I have to use `-r` (recurse):
```
$ git last-modified -r refs.c refs.h Documentation/git-last-modified.adoc Documentation/git-config.adoc
3691fe72d927658ae77ade7fe967544fc6739e67 Documentation/git-last-modified.adoc
062b914c841329a003f74e1340ea5178391274a6 refs.c
47478802daddf3f9916111307f153c6298ffc0bc refs.h
0fbe93b36c05bbf4156c157f27998938ce312265 Documentation/git-config.adoc
```
And `-r` with a directory like `Documentation` will recurse through that
directory.
As a user I imagine I want `-r` to control whether to, say, only show
each directory under `Documentation`. But now you seem to get a special
case of allowing directly listing first-level files but not the ones in
subdirectories.
I’m more used to being able to use individual files if I want that
as well as pathspecs for recursion. But now I just get the directory:
```
git last-modified -- 't/*'
532d9a0984e6464deadb6bdb0287fcce2990adc9 t
```
But I can still use pathspec magic which is nice:
```
$ git last-modified -r -- 't/*' ':^t/t1016*' | grep 1016
<empty>
```
I’m imagining that you may want to feed a specific pathspec to the
command and get only the output for those that match. Without having
worry about turning on recursion since that may cover some of the things
you want but also make it do too much. Well maybe that is just as
controllable here but it seems less obvious than for commands where the
pathspec is used more directly (?). I appreciate when these commands
allow me to express things directly without postprocessing (no excessive
pipelining).
Also you get a sort of trailing error if you give a non-existing option:
```
$ git last-modified --recursive
error: unknown last-modified argument: --recursive
fatal: error setting up last-modified traversal
```
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 0/5] Introduce git-last-modified(1) command
2025-07-01 20:35 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Kristoffer Haugsbakk
@ 2025-07-01 21:06 ` Junio C Hamano
2025-07-01 21:30 ` Kristoffer Haugsbakk
0 siblings, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-07-01 21:06 UTC (permalink / raw)
To: Kristoffer Haugsbakk
Cc: Toon Claes, git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
"Kristoffer Haugsbakk" <kristofferhaugsbakk@fastmail.com> writes:
> It feels like the command strays a bit from the usual patterns to me. For paths/files
> that is. I like this:
>
> ```
> $ git last-modified -r refs.c refs.h
> 062b914c841329a003f74e1340ea5178391274a6 refs.c
> 47478802daddf3f9916111307f153c6298ffc0bc refs.h
> ```
I am not getting this example. Unless "-r" stands for "reverse",
the above looks totally expected.
> I ask for two files and I get those in the output.
>
> But for individual files in subdirectories:
>
> ```
> $ git last-modified refs.c refs.h Documentation/git-last-modified.adoc Documentation/git-config.adoc
> 3691fe72d927658ae77ade7fe967544fc6739e67 Documentation
> 062b914c841329a003f74e1340ea5178391274a6 refs.c
> 47478802daddf3f9916111307f153c6298ffc0bc refs.h
> ```
I am indifferent with this outcome. I do not mind the tool giving
Documentation/ even when paths inside it are asked about, when it is
not asked to go recursive.
> Same as if I ask for `Documentation`:
>
> ```
> $ git last-modified refs.c refs.h Documentation/
> 3691fe72d927658ae77ade7fe967544fc6739e67 Documentation
> 062b914c841329a003f74e1340ea5178391274a6 refs.c
> 47478802daddf3f9916111307f153c6298ffc0bc refs.h
> ```
>
> But I didn’t ask for the directory first. I asked for two files.
I do not see anything unexpected. Have you seen "git ls-tree"
output without -r(ecursive) before?
$ git ls-tree HEAD refs.c refs.h Documentation
040000 tree a0f7113f63a19b70dff14bfd9f8f82809f5068e1 Documentation
100644 blob dce5c49ca2ba65fd6a2974e38f67134215bee369 refs.c
100644 blob 46a6008e07f2624239139cd8b2ff712545f07d3f refs.h
As I understand that this tool was written primarily to implement
scripts like repository browsers showing https://github.com/git/git
I do not mind non-recursive behaviour being the default. After all
I view it as a plumbing.
> I have to use `-r` (recurse):
>
> ```
> $ git last-modified -r refs.c refs.h Documentation/git-last-modified.adoc Documentation/git-config.adoc
> 3691fe72d927658ae77ade7fe967544fc6739e67 Documentation/git-last-modified.adoc
> 062b914c841329a003f74e1340ea5178391274a6 refs.c
> 47478802daddf3f9916111307f153c6298ffc0bc refs.h
> 0fbe93b36c05bbf4156c157f27998938ce312265 Documentation/git-config.adoc
> ```
>
> And `-r` with a directory like `Documentation` will recurse through that
> directory.
Totally expected.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 0/5] Introduce git-last-modified(1) command
2025-07-01 21:06 ` Junio C Hamano
@ 2025-07-01 21:30 ` Kristoffer Haugsbakk
2025-07-02 13:00 ` Toon Claes
0 siblings, 1 reply; 135+ messages in thread
From: Kristoffer Haugsbakk @ 2025-07-01 21:30 UTC (permalink / raw)
To: Junio C Hamano
Cc: Toon Claes, git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
On Tue, Jul 1, 2025, at 23:06, Junio C Hamano wrote:
> "Kristoffer Haugsbakk" <kristofferhaugsbakk@fastmail.com> writes:
>
>> It feels like the command strays a bit from the usual patterns to me. For paths/files
>> that is. I like this:
>>
>> ```
>> $ git last-modified -r refs.c refs.h
>> 062b914c841329a003f74e1340ea5178391274a6 refs.c
>> 47478802daddf3f9916111307f153c6298ffc0bc refs.h
>> ```
>
> I am not getting this example. Unless "-r" stands for "reverse",
> the above looks totally expected.
Sorry. I meant this (withouth `-r`) and that it makes sense.
```
$ git last-modified refs.c refs.h
062b914c841329a003f74e1340ea5178391274a6 refs.c
47478802daddf3f9916111307f153c6298ffc0bc refs.h
```
> I do not see anything unexpected. Have you seen "git ls-tree"
> output without -r(ecursive) before?
>
> $ git ls-tree HEAD refs.c refs.h Documentation
> 040000 tree a0f7113f63a19b70dff14bfd9f8f82809f5068e1 Documentation
> 100644 blob dce5c49ca2ba65fd6a2974e38f67134215bee369 refs.c
> 100644 blob 46a6008e07f2624239139cd8b2ff712545f07d3f refs.h
No. That’s my blindspot.
>
> As I understand that this tool was written primarily to implement
> scripts like repository browsers showing https://github.com/git/git
> I do not mind non-recursive behaviour being the default. After all
> I view it as a plumbing.
>
>> I have to use `-r` (recurse):
>>
>> ```
>> $ git last-modified -r refs.c refs.h Documentation/git-last-modified.adoc Documentation/git-config.adoc
>> 3691fe72d927658ae77ade7fe967544fc6739e67 Documentation/git-last-modified.adoc
>> 062b914c841329a003f74e1340ea5178391274a6 refs.c
>> 47478802daddf3f9916111307f153c6298ffc0bc refs.h
>> 0fbe93b36c05bbf4156c157f27998938ce312265 Documentation/git-config.adoc
>> ```
>>
>> And `-r` with a directory like `Documentation` will recurse through that
>> directory.
>
> Totally expected.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v3 0/3] Introduce git-last-modified(1) command
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
` (2 preceding siblings ...)
2025-06-30 18:49 ` [PATCH RFC v3 3/3] last-modified: use Bloom filters when available Toon Claes
@ 2025-07-01 23:01 ` Junio C Hamano
2025-07-09 15:26 ` [PATCH v4 " Toon Claes
` (4 subsequent siblings)
8 siblings, 0 replies; 135+ messages in thread
From: Junio C Hamano @ 2025-07-01 23:01 UTC (permalink / raw)
To: Toon Claes
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Toon Claes <toon@iotcl.com> writes:
> Again thanks to Taylor and the people at GitHub for sharing these
> patches. I hope we can work together to get this upstreamed.
>
> [1]: https://lore.kernel.org/git/patch-1.1-0ea849d900b-20230205T204104Z-avarab@gmail.com/
> [2]: https://lore.kernel.org/git/Z+XJ+1L3PnC9Dyba@nand.local/
> [3]: https://lore.kernel.org/git/20250326-toon-blame-tree-v1-3-4173133f3786@iotcl.com/
> [4]: git@github.com:ttaylorr/git.git
> [5]: https://lore.kernel.org/git/aCbBKj7O9LjO3SMK@pks.im/
>
> --
> Cheers,
> Toon
>
> Signed-off-by: Toon Claes <toon@iotcl.com>
> ---
> Changes in v3:
> - Updated benchmarks in commit messages.
> - Removed the patches that attempt to increase performance for tree
> entries that have not been updated in a long time. (see above)
> - Move handling failure in `last_modified_init()` to the caller.
> - Sorted #include clauses lexicographically.
> - Removed unneeded `commit` in `struct last_modified_entry`.
> - Renamed some functions/variables and added some comments to make it
> easier to understand.
> - Removed unnecessary checking of the commit-graph generation number.
> - Link to v2: https://lore.kernel.org/r/20250523-toon-new-blame-tree-v2-0-101e4ca4c1c9@iotcl.com
>
> Changes in v2:
> - The subcommand is renamed from `blame-tree` to `last-modified`
> - Documentation is added. Here we mark the command as experimental.
> - Some test cases are added related to merges.
> - Link to v1: https://lore.kernel.org/r/20250422-toon-new-blame-tree-v1-0-fdb51b8a394a@iotcl.com
>
> ---
> Toon Claes (3):
> last-modified: new subcommand to show when files were last modified
> t/perf: add last-modified perf script
> last-modified: use Bloom filters when available
>
> .gitignore | 1 +
> Documentation/git-last-modified.adoc | 49 +++++++
> Documentation/meson.build | 1 +
> Makefile | 2 +
> builtin.h | 1 +
> builtin/last-modified.c | 44 ++++++
> command-list.txt | 1 +
> git.c | 1 +
> last-modified.c | 257 +++++++++++++++++++++++++++++++++++
> last-modified.h | 35 +++++
> meson.build | 2 +
> t/meson.build | 2 +
> t/perf/p8020-last-modified.sh | 21 +++
> t/t8020-last-modified.sh | 204 +++++++++++++++++++++++++++
> 14 files changed, 621 insertions(+)
FWI, "git last-modified -h" does not work; its output is expected to
match what is in "git help last-modified", and t0450 would not pass
without it.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v3 1/3] last-modified: new subcommand to show when files were last modified
2025-06-30 18:49 ` [PATCH RFC v3 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-01 20:20 ` Kristoffer Haugsbakk
@ 2025-07-02 11:51 ` Junio C Hamano
1 sibling, 0 replies; 135+ messages in thread
From: Junio C Hamano @ 2025-07-02 11:51 UTC (permalink / raw)
To: Toon Claes
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Toon Claes <toon@iotcl.com> writes:
> diff --git a/builtin/last-modified.c b/builtin/last-modified.c
> new file mode 100644
> index 0000000000..4ff058c302
> --- /dev/null
> +++ b/builtin/last-modified.c
> @@ -0,0 +1,44 @@
> +#include "git-compat-util.h"
> +#include "last-modified.h"
> +#include "hex.h"
> +#include "quote.h"
> +#include "config.h"
> +#include "object-name.h"
> +#include "parse-options.h"
> +#include "builtin.h"
Apparently "parse-options.h" is included but is never used.
How much of these include do you truly use in this step?
I was looking at the code, since I was wondering why you forgot to
handle "-h", which comes absolutely for free when you use the
parse-options API in the most natural way.
> +int cmd_last_modified(int argc,
> + const char **argv,
> + const char *prefix,
> + struct repository *repo)
> +{
> + struct last_modified lm;
> +
> + repo_config(repo, git_default_config, NULL);
> +
> + if (last_modified_init(&lm, repo, prefix, argc, argv))
> + die(_("error setting up last-modified traversal"));
> +
> + if (last_modified_run(&lm, show_entry, &lm) < 0)
> + die(_("error running last-modified traversal"));
> +
> + last_modified_release(&lm);
> +
> + return 0;
> +}
It is a bit unusual for the top-legvel cmd_foo() to totally give up
the responsibility of command line parsing, and let a helper
function take over everything.
Is the idea that the family of last_modified_foo() functions wants
to form a library-ish API? I think the primary reason I find the
arrangement a bit unusual is that such a library interface would not
deal with end-user interactions like command line parsing. Even
commands that let setup_revisions() slurp the command line arguments
typically does necessary set-up (like discoverying the git directory
and reading the configuration files) on the side of the caller.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 0/5] Introduce git-last-modified(1) command
2025-07-01 21:30 ` Kristoffer Haugsbakk
@ 2025-07-02 13:00 ` Toon Claes
2025-07-09 15:53 ` Toon Claes
0 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-07-02 13:00 UTC (permalink / raw)
To: Kristoffer Haugsbakk, Junio C Hamano
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
>> I do not see anything unexpected. Have you seen "git ls-tree"
>> output without -r(ecursive) before?
>>
>> $ git ls-tree HEAD refs.c refs.h Documentation
>> 040000 tree a0f7113f63a19b70dff14bfd9f8f82809f5068e1 Documentation
>> 100644 blob dce5c49ca2ba65fd6a2974e38f67134215bee369 refs.c
>> 100644 blob 46a6008e07f2624239139cd8b2ff712545f07d3f refs.h
You raise a good point here. Let's compare:
$ git ls-tree HEAD -- refs.c refs.h Documentation/git-last-modified.adoc Documentation/git-config.adoc
100644 blob 936e0c5130fe7d67f645501fbb9e70b94b437f54 Documentation/git-config.adoc
100644 blob 1af38f402ed6437353fb5765f62251966d828df9 Documentation/git-last-modified.adoc
100644 blob dce5c49ca2ba65fd6a2974e38f67134215bee369 refs.c
100644 blob 46a6008e07f2624239139cd8b2ff712545f07d3f refs.h
$ git last-modified HEAD -- refs.c refs.h Documentation/git-last-modified.adoc Documentation/git-config.adoc
56073a0af90be947cfefbfc3cf762b268e5e20a9 Documentation
062b914c841329a003f74e1340ea5178391274a6 refs.c
47478802daddf3f9916111307f153c6298ffc0bc refs.h
I have to agree with Kristoffer here, and the latter is not what I
would expect. Thanks for the testing! I will try to address in next
version.
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH v4 0/3] Introduce git-last-modified(1) command
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
` (3 preceding siblings ...)
2025-07-01 23:01 ` [PATCH RFC v3 0/3] Introduce git-last-modified(1) command Junio C Hamano
@ 2025-07-09 15:26 ` Toon Claes
2025-07-09 21:57 ` Junio C Hamano
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
2025-07-09 15:26 ` [PATCH v4 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
` (3 subsequent siblings)
8 siblings, 2 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-09 15:26 UTC (permalink / raw)
To: git
Cc: Toon Claes, Junio C Hamano, Kristoffer Haugsbakk, Derrick Stolee,
Taylor Blau
This series adds the git-last-modified(1) subcommand. In the past the
subcommand was proposed[1] to be named git-blame-tree(1). This version
is based on the patches shared by the kind people at GitHub[2].
What is different from the series shared by GitHub:
* Renamed the subcommand from `blame-tree` to `last-modified`. There was
some consensus[5] this name works better, so let's give it a try and
see how this name feels.
* Patches for --max-depth are excluded. I think it's a separate topic to
discuss and I'm not sure it needs to be part of series anyway. The
main patch was submitted in the previous attempt[3] and if people
consider it valuable, I'm happy to discuss that in a separate patch
series.
* The patches in 'tb/blame-tree' at Taylor's fork[4] implements a
caching layer. This feature reads/writes cached results in
`.git/blame-tree/<hash>.btc`. To keep this series to a reviewable
size, that feature is excluded from this series. I think it's better
to submit this as a separate series.
* Squashed various commits together. Like they introduced a flag
`--go-faster`, which later became the default and only implementation.
That story was wrapped up in a single commit.
* Dropped the patches that attempt to increase performance for tree
entries that have not been updated in a long time. In my testing I've
seen both performance improvements *and* degradation with these
changes:
Test HEAD~ HEAD
------------------------------------------------------------------------------------
8020.1: top-level last-modified 4.52(4.38+0.11) 2.03(1.93+0.08) -55.1%
8020.2: top-level recursive last-modified 5.79(5.64+0.11) 8.34(8.17+0.11) +44.0%
8020.3: subdir last-modified 0.15(0.09+0.06) 0.19(0.14+0.06) +26.7%
Before we include these patches, I want to make sure these changes
have positive impact in all/most scenarios. This can happen in a
separate series.
* The last-modified command isn't recursive by default. If you want
recurse into subtrees, you need to pass `-r`.
* Fixed all memory leaks, and removed the use of
USE_THE_REPOSITORY_VARIABLE.
I've set myself as the author and added Based-on-patch-by trailers to
credit the original authors. Let me know if you disagree.
Again thanks to Taylor and the people at GitHub for sharing these
patches. I hope we can work together to get this upstreamed.
[1]: https://lore.kernel.org/git/patch-1.1-0ea849d900b-20230205T204104Z-avarab@gmail.com/
[2]: https://lore.kernel.org/git/Z+XJ+1L3PnC9Dyba@nand.local/
[3]: https://lore.kernel.org/git/20250326-toon-blame-tree-v1-3-4173133f3786@iotcl.com/
[4]: git@github.com:ttaylorr/git.git
[5]: https://lore.kernel.org/git/aCbBKj7O9LjO3SMK@pks.im/
--
Cheers,
Toon
Signed-off-by: Toon Claes <toon@iotcl.com>
---
Changes in v4:
- Removed root-level `last-modified.[ch]` library code and moved code to
`builtin/last-modified.c`. Historically we've had libary code (also because it
was used in testtool), but we no longer need that separation. I'm sorry this
makes the range-diff hard to read.
- Added the use of parse_options() to get better usage messages.
- Formatting fixes after conversation in
https://lore.kernel.org/git/xmqqh5zvk5h0.fsf@gitster.g/
- Link to v3: https://lore.kernel.org/git/20250630-toon-new-blame-tree-v3-0-3516025dc3bc@iotcl.com/
Changes in v3:
- Updated benchmarks in commit messages.
- Removed the patches that attempt to increase performance for tree
entries that have not been updated in a long time. (see above)
- Move handling failure in `last_modified_init()` to the caller.
- Sorted #include clauses lexicographically.
- Removed unneeded `commit` in `struct last_modified_entry`.
- Renamed some functions/variables and added some comments to make it
easier to understand.
- Removed unnecessary checking of the commit-graph generation number.
- Link to v2: https://lore.kernel.org/r/20250523-toon-new-blame-tree-v2-0-101e4ca4c1c9@iotcl.com
Changes in v2:
- The subcommand is renamed from `blame-tree` to `last-modified`
- Documentation is added. Here we mark the command as experimental.
- Some test cases are added related to merges.
- Link to v1: https://lore.kernel.org/r/20250422-toon-new-blame-tree-v1-0-fdb51b8a394a@iotcl.com
Toon Claes (3):
last-modified: new subcommand to show when files were last modified
t/perf: add last-modified perf script
last-modified: use Bloom filters when available
.gitignore | 1 +
Documentation/git-last-modified.adoc | 49 ++++
Documentation/meson.build | 1 +
Makefile | 1 +
builtin.h | 1 +
builtin/last-modified.c | 334 +++++++++++++++++++++++++++
command-list.txt | 1 +
git.c | 1 +
meson.build | 1 +
t/meson.build | 2 +
t/perf/p8020-last-modified.sh | 21 ++
t/t8020-last-modified.sh | 204 ++++++++++++++++
12 files changed, 617 insertions(+)
create mode 100644 Documentation/git-last-modified.adoc
create mode 100644 builtin/last-modified.c
create mode 100755 t/perf/p8020-last-modified.sh
create mode 100755 t/t8020-last-modified.sh
Range-diff against v3:
1: 26a2d9b5e0 ! 1: 0cc625f3f5 last-modified: new subcommand to show when files were last modified
@@ Documentation/git-last-modified.adoc (new)
+SYNOPSIS
+--------
+[synopsis]
-+git last-modified [-r] [<revision-range>] [[--] <path>...]
++git last-modified [-r] [-t] [<revision-range>] [[--] <path>...]
+
+DESCRIPTION
+-----------
@@ Documentation/git-last-modified.adoc (new)
+[--] <path>...::
+ For each _<path>_ given, the commit which last modified it is returned.
+ Without an optional path parameter, all files and subdirectories
-+ of the current working directory are included in the
++ in path traversal the are included in the output.
+
+SEE ALSO
+--------
@@ Documentation/meson.build: manpages = {
'git-ls-remote.adoc' : 1,
## Makefile ##
-@@ Makefile: LIB_OBJS += hook.o
- LIB_OBJS += ident.o
- LIB_OBJS += json-writer.o
- LIB_OBJS += kwset.o
-+LIB_OBJS += last-modified.o
- LIB_OBJS += levenshtein.o
- LIB_OBJS += line-log.o
- LIB_OBJS += line-range.o
@@ Makefile: BUILTIN_OBJS += builtin/hook.o
BUILTIN_OBJS += builtin/index-pack.o
BUILTIN_OBJS += builtin/init-db.o
@@ builtin.h: int cmd_hook(int argc, const char **argv, const char *prefix, struct
## builtin/last-modified.c (new) ##
@@
+#include "git-compat-util.h"
-+#include "last-modified.h"
-+#include "hex.h"
-+#include "quote.h"
-+#include "config.h"
-+#include "object-name.h"
-+#include "parse-options.h"
+#include "builtin.h"
-+
-+static void show_entry(const char *path, const struct commit *commit, void *d)
-+{
-+ struct last_modified *lm = d;
-+
-+ if (commit->object.flags & BOUNDARY)
-+ putchar('^');
-+ printf("%s\t", oid_to_hex(&commit->object.oid));
-+
-+ if (lm->rev.diffopt.line_termination)
-+ write_name_quoted(path, stdout, '\n');
-+ else
-+ printf("%s%c", path, '\0');
-+
-+ fflush(stdout);
-+}
-+
-+int cmd_last_modified(int argc,
-+ const char **argv,
-+ const char *prefix,
-+ struct repository *repo)
-+{
-+ struct last_modified lm;
-+
-+ repo_config(repo, git_default_config, NULL);
-+
-+ if (last_modified_init(&lm, repo, prefix, argc, argv))
-+ die(_("error setting up last-modified traversal"));
-+
-+ if (last_modified_run(&lm, show_entry, &lm) < 0)
-+ die(_("error running last-modified traversal"));
-+
-+ last_modified_release(&lm);
-+
-+ return 0;
-+}
-
- ## command-list.txt ##
-@@ command-list.txt: git-index-pack plumbingmanipulators
- git-init mainporcelain init
- git-instaweb ancillaryinterrogators complete
- git-interpret-trailers purehelpers
-+git-last-modified plumbinginterrogators
- git-log mainporcelain info
- git-ls-files plumbinginterrogators
- git-ls-remote plumbinginterrogators
-
- ## git.c ##
-@@ git.c: static struct cmd_struct commands[] = {
- { "init", cmd_init_db },
- { "init-db", cmd_init_db },
- { "interpret-trailers", cmd_interpret_trailers, RUN_SETUP_GENTLY },
-+ { "last-modified", cmd_last_modified, RUN_SETUP },
- { "log", cmd_log, RUN_SETUP },
- { "ls-files", cmd_ls_files, RUN_SETUP },
- { "ls-remote", cmd_ls_remote, RUN_SETUP_GENTLY },
-
- ## last-modified.c (new) ##
-@@
-+#include "git-compat-util.h"
+#include "commit.h"
++#include "config.h"
+#include "diff.h"
+#include "diffcore.h"
-+#include "last-modified.h"
++#include "hashmap.h"
++#include "hex.h"
+#include "log-tree.h"
++#include "object-name.h"
+#include "object.h"
++#include "parse-options.h"
++#include "quote.h"
+#include "repository.h"
+#include "revision.h"
+
@@ last-modified.c (new)
+ const char path[FLEX_ARRAY];
+};
+
++static int last_modified_entry_hashcmp(const void *unused UNUSED,
++ const struct hashmap_entry *hent1,
++ const struct hashmap_entry *hent2,
++ const void *path)
++{
++ const struct last_modified_entry *ent1 =
++ container_of(hent1, const struct last_modified_entry, hashent);
++ const struct last_modified_entry *ent2 =
++ container_of(hent2, const struct last_modified_entry, hashent);
++ return strcmp(ent1->path, path ? path : ent2->path);
++}
++
++struct last_modified {
++ struct hashmap paths;
++ struct rev_info rev;
++ int recursive, tree_in_recursive;
++};
++
++static void last_modified_release(struct last_modified *lm)
++{
++ hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
++ release_revisions(&lm->rev);
++}
++
++typedef void (*last_modified_callback)(const char *path,
++ const struct commit *commit, void *data);
++
++struct last_modified_callback_data {
++ struct commit *commit;
++ struct hashmap *paths;
++
++ last_modified_callback callback;
++ void *callback_data;
++};
++
+static void add_path_from_diff(struct diff_queue_struct *q,
-+ struct diff_options *opt UNUSED,
-+ void *data)
++ struct diff_options *opt UNUSED, void *data)
+{
+ struct last_modified *lm = data;
+
@@ last-modified.c (new)
+ return 0;
+}
+
-+static int last_modified_entry_hashcmp(const void *unused UNUSED,
-+ const struct hashmap_entry *hent1,
-+ const struct hashmap_entry *hent2,
-+ const void *path)
-+{
-+ const struct last_modified_entry *ent1 =
-+ container_of(hent1, const struct last_modified_entry, hashent);
-+ const struct last_modified_entry *ent2 =
-+ container_of(hent2, const struct last_modified_entry, hashent);
-+ return strcmp(ent1->path, path ? path : ent2->path);
-+}
-+
-+int last_modified_init(struct last_modified *lm,
-+ struct repository *r,
-+ const char *prefix,
-+ int argc, const char **argv)
-+{
-+ memset(lm, 0, sizeof(*lm));
-+ hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
-+
-+ repo_init_revisions(r, &lm->rev, prefix);
-+ lm->rev.def = "HEAD";
-+ lm->rev.combine_merges = 1;
-+ lm->rev.show_root_diff = 1;
-+ lm->rev.boundary = 1;
-+ lm->rev.no_commit_id = 1;
-+ lm->rev.diff = 1;
-+ if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
-+ return error(_("unknown last-modified argument: %s"), argv[1]);
-+
-+ if (populate_paths_from_revs(lm) < 0)
-+ return error(_("unable to setup last-modified"));
-+
-+ return 0;
-+}
-+
-+void last_modified_release(struct last_modified *lm)
-+{
-+ hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
-+ release_revisions(&lm->rev);
-+}
-+
-+struct last_modified_callback_data {
-+ struct commit *commit;
-+ struct hashmap *paths;
-+
-+ last_modified_callback callback;
-+ void *callback_data;
-+};
-+
+static void mark_path(const char *path, const struct object_id *oid,
+ struct last_modified_callback_data *data)
+{
@@ last-modified.c (new)
+ default:
+ /*
+ * Otherwise, we care only that we somehow arrived at
-+ * a final path/sha1 state. Note that this covers some
++ * a final oid state. Note that this covers some
+ * potentially controversial areas, including:
+ *
+ * 1. A rename or copy will be found, as it is the
@@ last-modified.c (new)
+ }
+}
+
-+int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
++static int last_modified_run(struct last_modified *lm,
++ last_modified_callback cb, void *cbdata)
+{
+ struct last_modified_callback_data data;
+
@@ last-modified.c (new)
+
+ if (data.commit->object.flags & BOUNDARY) {
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
-+ &data.commit->object.oid,
-+ "", &lm->rev.diffopt);
++ &data.commit->object.oid, "",
++ &lm->rev.diffopt);
+ diff_flush(&lm->rev.diffopt);
+ } else {
+ log_tree_commit(&lm->rev, data.commit);
@@ last-modified.c (new)
+ }
+
+ return 0;
++}
++
++static void show_entry(const char *path, const struct commit *commit, void *d)
++{
++ struct last_modified *lm = d;
++
++ if (commit->object.flags & BOUNDARY)
++ putchar('^');
++ printf("%s\t", oid_to_hex(&commit->object.oid));
++
++ if (lm->rev.diffopt.line_termination)
++ write_name_quoted(path, stdout, '\n');
++ else
++ printf("%s%c", path, '\0');
++
++ fflush(stdout);
++}
++
++static int last_modified_init(struct last_modified *lm, struct repository *r,
++ const char *prefix, int argc, const char **argv)
++{
++ hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
++
++ repo_init_revisions(r, &lm->rev, prefix);
++ lm->rev.def = "HEAD";
++ lm->rev.combine_merges = 1;
++ lm->rev.show_root_diff = 1;
++ lm->rev.boundary = 1;
++ lm->rev.no_commit_id = 1;
++ lm->rev.diff = 1;
++ lm->rev.diffopt.flags.recursive = lm->recursive || lm->tree_in_recursive;
++ lm->rev.diffopt.flags.tree_in_recursive = lm->tree_in_recursive;
++
++ if ((argc = setup_revisions(argc, argv, &lm->rev, NULL)) > 1) {
++ error(_("unknown last-modified argument: %s"), argv[1]);
++ return argc;
++ }
++
++ if (populate_paths_from_revs(lm) < 0)
++ return error(_("unable to setup last-modified"));
++
++ return 0;
++}
++
++int cmd_last_modified(int argc, const char **argv, const char *prefix,
++ struct repository *repo)
++{
++ int ret;
++ struct last_modified lm;
++
++ const char * const last_modified_usage[] = {
++ N_("git last-modified [-r] [-t] "
++ "[<revision-range>] [[--] <path>...]"),
++ NULL
++ };
++
++ struct option last_modified_options[] = {
++ OPT_BOOL('r', "recursive", &lm.recursive,
++ N_("recurse into subtrees")),
++ OPT_BOOL('t', "tree-in-recursive", &lm.tree_in_recursive,
++ N_("recurse into subtrees and include the tree entries too")),
++ OPT_END()
++ };
++
++ memset(&lm, 0, sizeof(lm));
++
++ argc = parse_options(argc, argv, prefix, last_modified_options,
++ last_modified_usage,
++ PARSE_OPT_KEEP_ARGV0 | PARSE_OPT_KEEP_UNKNOWN_OPT);
++
++ repo_config(repo, git_default_config, NULL);
++
++ if ((ret = last_modified_init(&lm, repo, prefix, argc, argv))) {
++ if (ret > 0)
++ usage_with_options(last_modified_usage,
++ last_modified_options);
++ goto out;
++ }
++
++ if ((ret = last_modified_run(&lm, show_entry, &lm)))
++ goto out;
++
++out:
++ last_modified_release(&lm);
++
++ return ret;
+}
- ## last-modified.h (new) ##
-@@
-+#ifndef LAST_MODIFIED_H
-+#define LAST_MODIFIED_H
-+
-+#include "commit.h"
-+#include "hashmap.h"
-+#include "revision.h"
-+
-+struct last_modified {
-+ struct hashmap paths;
-+ struct rev_info rev;
-+};
-+
-+/*
-+ * Initialize the last-modified machinery using command line arguments.
-+ */
-+int last_modified_init(struct last_modified *lm,
-+ struct repository *r,
-+ const char *prefix,
-+ int argc, const char **argv);
-+
-+void last_modified_release(struct last_modified *);
-+
-+typedef void (*last_modified_callback)(const char *path,
-+ const struct commit *commit,
-+ void *data);
-+
-+/*
-+ * Run the last-modified traversal. For each path found the callback is called
-+ * passing the path, the commit, and the cbdata.
-+ */
-+int last_modified_run(struct last_modified *lm,
-+ last_modified_callback cb,
-+ void *cbdata);
-+
-+#endif /* LAST_MODIFIED_H */
+ ## command-list.txt ##
+@@ command-list.txt: git-index-pack plumbingmanipulators
+ git-init mainporcelain init
+ git-instaweb ancillaryinterrogators complete
+ git-interpret-trailers purehelpers
++git-last-modified plumbinginterrogators
+ git-log mainporcelain info
+ git-ls-files plumbinginterrogators
+ git-ls-remote plumbinginterrogators
+
+ ## git.c ##
+@@ git.c: static struct cmd_struct commands[] = {
+ { "init", cmd_init_db },
+ { "init-db", cmd_init_db },
+ { "interpret-trailers", cmd_interpret_trailers, RUN_SETUP_GENTLY },
++ { "last-modified", cmd_last_modified, RUN_SETUP },
+ { "log", cmd_log, RUN_SETUP },
+ { "ls-files", cmd_ls_files, RUN_SETUP },
+ { "ls-remote", cmd_ls_remote, RUN_SETUP_GENTLY },
## meson.build ##
-@@ meson.build: libgit_sources = [
- 'ident.c',
- 'json-writer.c',
- 'kwset.c',
-+ 'last-modified.c',
- 'levenshtein.c',
- 'line-log.c',
- 'line-range.c',
@@ meson.build: builtin_sources = [
'builtin/index-pack.c',
'builtin/init-db.c',
2: 0691884735 = 2: a017f2c81c t/perf: add last-modified perf script
3: 393f304a3f ! 3: c739a7dbcc last-modified: use Bloom filters when available
@@ Commit message
Comparing the perf test results on git.git:
- Test HEAD~ HEAD
- ------------------------------------------------------------------------------------
- 8020.1: top-level last-modified 4.49(4.34+0.11) 2.22(2.05+0.09) -50.6%
- 8020.2: top-level recursive last-modified 5.64(5.45+0.11) 5.62(5.30+0.11) -0.4%
- 8020.3: subdir last-modified 0.11(0.06+0.04) 0.07(0.03+0.04) -36.4%
+ Test HEAD~ HEAD
+ ------------------------------------------------------------------------------------
+ 8020.1: top-level last-modified 4.49(4.34+0.11) 2.22(2.05+0.09) -50.6%
+ 8020.2: top-level recursive last-modified 5.64(5.45+0.11) 5.62(5.30+0.11) -0.4%
+ 8020.3: subdir last-modified 0.11(0.06+0.04) 0.07(0.03+0.04) -36.4%
Based-on-patch-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
- ## last-modified.c ##
+ ## builtin/last-modified.c ##
@@
#include "git-compat-util.h"
+#include "bloom.h"
+ #include "builtin.h"
+#include "commit-graph.h"
#include "commit.h"
+ #include "config.h"
#include "diff.h"
- #include "diffcore.h"
-+#include "dir.h"
- #include "last-modified.h"
- #include "log-tree.h"
- #include "object.h"
@@
struct last_modified_entry {
struct hashmap_entry hashent;
@@ last-modified.c
const char path[FLEX_ARRAY];
};
-@@ last-modified.c: static void add_path_from_diff(struct diff_queue_struct *q,
+@@ builtin/last-modified.c: struct last_modified {
- FLEX_ALLOC_STR(ent, path, path);
- oidcpy(&ent->oid, &p->two->oid);
-+ if (lm->rev.bloom_filter_settings)
-+ fill_bloom_key(path, strlen(path), &ent->key,
-+ lm->rev.bloom_filter_settings);
- hashmap_entry_init(&ent->hashent, strhash(ent->path));
- hashmap_add(&lm->paths, &ent->hashent);
- }
-@@ last-modified.c: int last_modified_init(struct last_modified *lm,
- if (setup_revisions(argc, argv, &lm->rev, NULL) > 1)
- return error(_("unknown last-modified argument: %s"), argv[1]);
-
-+ /*
-+ * We're not interested in generation numbers here,
-+ * but calling this function to prepare the commit-graph.
-+ */
-+ (void)generation_numbers_enabled(lm->rev.repo);
-+ lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
-+
- if (populate_paths_from_revs(lm) < 0)
- return error(_("unable to setup last-modified"));
-
-@@ last-modified.c: int last_modified_init(struct last_modified *lm,
-
- void last_modified_release(struct last_modified *lm)
+ static void last_modified_release(struct last_modified *lm)
{
+ struct hashmap_iter iter;
+ struct last_modified_entry *ent;
@@ last-modified.c: int last_modified_init(struct last_modified *lm,
hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
release_revisions(&lm->rev);
}
-@@ last-modified.c: static void mark_path(const char *path, const struct object_id *oid,
+@@ builtin/last-modified.c: static void add_path_from_diff(struct diff_queue_struct *q,
+
+ FLEX_ALLOC_STR(ent, path, path);
+ oidcpy(&ent->oid, &p->two->oid);
++ if (lm->rev.bloom_filter_settings)
++ fill_bloom_key(path, strlen(path), &ent->key,
++ lm->rev.bloom_filter_settings);
+ hashmap_entry_init(&ent->hashent, strhash(ent->path));
+ hashmap_add(&lm->paths, &ent->hashent);
+ }
+@@ builtin/last-modified.c: static void mark_path(const char *path, const struct object_id *oid,
data->callback(path, data->commit, data->callback_data);
hashmap_remove(data->paths, &ent->hashent, path);
@@ last-modified.c: static void mark_path(const char *path, const struct object_id
free(ent);
}
-@@ last-modified.c: static void last_modified_diff(struct diff_queue_struct *q,
+@@ builtin/last-modified.c: static void last_modified_diff(struct diff_queue_struct *q,
}
}
++
+static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
+{
+ struct bloom_filter *filter;
@@ last-modified.c: static void last_modified_diff(struct diff_queue_struct *q,
+ return 0;
+}
+
- int last_modified_run(struct last_modified *lm, last_modified_callback cb, void *cbdata)
+ static int last_modified_run(struct last_modified *lm,
+ last_modified_callback cb, void *cbdata)
{
- struct last_modified_callback_data data;
-@@ last-modified.c: int last_modified_run(struct last_modified *lm, last_modified_callback cb, void
+@@ builtin/last-modified.c: static int last_modified_run(struct last_modified *lm,
if (!data.commit)
break;
@@ last-modified.c: int last_modified_run(struct last_modified *lm, last_modified_c
+
if (data.commit->object.flags & BOUNDARY) {
diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
- &data.commit->object.oid,
+ &data.commit->object.oid, "",
+@@ builtin/last-modified.c: static int last_modified_init(struct last_modified *lm, struct repository *r,
+ return argc;
+ }
+
++ /*
++ * We're not interested in generation numbers here,
++ * but calling this function to prepare the commit-graph.
++ */
++ (void)generation_numbers_enabled(lm->rev.repo);
++ lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
++
+ if (populate_paths_from_revs(lm) < 0)
+ return error(_("unable to setup last-modified"));
+
base-commit: 41905d60226a0346b22f0d0d99428c746a5a3b14
--
2.50.0.rc0.18.gfcfe60668e
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH v4 1/3] last-modified: new subcommand to show when files were last modified
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
` (4 preceding siblings ...)
2025-07-09 15:26 ` [PATCH v4 " Toon Claes
@ 2025-07-09 15:26 ` Toon Claes
2025-07-09 15:26 ` [PATCH v4 2/3] t/perf: add last-modified perf script Toon Claes
` (2 subsequent siblings)
8 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-09 15:26 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Jeff King, Ævar Arnfjörð Bjarmason
Similar to git-blame(1), introduce a new subcommand
git-last-modified(1). This command shows the most recent modification to
paths in a tree. It does so by expanding the tree at a given commit,
taking note of the current state of each path, and then walking
backwards through history looking for commits where each path changed
into its final commit ID.
Based-on-patch-by: Jeff King <peff@peff.net>
Improved-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
.gitignore | 1 +
Documentation/git-last-modified.adoc | 49 +++++
Documentation/meson.build | 1 +
Makefile | 1 +
builtin.h | 1 +
builtin/last-modified.c | 289 +++++++++++++++++++++++++++
command-list.txt | 1 +
git.c | 1 +
meson.build | 1 +
t/meson.build | 1 +
t/t8020-last-modified.sh | 204 +++++++++++++++++++
11 files changed, 550 insertions(+)
create mode 100644 Documentation/git-last-modified.adoc
create mode 100644 builtin/last-modified.c
create mode 100755 t/t8020-last-modified.sh
diff --git a/.gitignore b/.gitignore
index 04c444404e..a36ee94443 100644
--- a/.gitignore
+++ b/.gitignore
@@ -87,6 +87,7 @@
/git-init-db
/git-interpret-trailers
/git-instaweb
+/git-last-modified
/git-log
/git-ls-files
/git-ls-remote
diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
new file mode 100644
index 0000000000..89138ebeb7
--- /dev/null
+++ b/Documentation/git-last-modified.adoc
@@ -0,0 +1,49 @@
+git-last-modified(1)
+====================
+
+NAME
+----
+git-last-modified - EXPERIMENTAL: Show when files were last modified
+
+
+SYNOPSIS
+--------
+[synopsis]
+git last-modified [-r] [-t] [<revision-range>] [[--] <path>...]
+
+DESCRIPTION
+-----------
+
+Shows which commit last modified each of the relevant files and subdirectories.
+
+THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
+
+OPTIONS
+-------
+
+-r::
+ Recurse into subtrees.
+
+-t::
+ Show tree entry itself as well as subtrees. Implies `-r`.
+
+<revision-range>::
+ Only traverse commits in the specified revision range. When no
+ `<revision-range>` is specified, it defaults to `HEAD` (i.e. the whole
+ history leading to the current commit). For a complete list of ways to
+ spell `<revision-range>`, see the 'Specifying Ranges' section of
+ linkgit:gitrevisions[7].
+
+[--] <path>...::
+ For each _<path>_ given, the commit which last modified it is returned.
+ Without an optional path parameter, all files and subdirectories
+ in path traversal the are included in the output.
+
+SEE ALSO
+--------
+linkgit:git-blame[1],
+linkgit:git-log[1].
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Documentation/meson.build b/Documentation/meson.build
index 2fe1a1369d..99aeb6d0e0 100644
--- a/Documentation/meson.build
+++ b/Documentation/meson.build
@@ -74,6 +74,7 @@ manpages = {
'git-init.adoc' : 1,
'git-instaweb.adoc' : 1,
'git-interpret-trailers.adoc' : 1,
+ 'git-last-modified.adoc' : 1,
'git-log.adoc' : 1,
'git-ls-files.adoc' : 1,
'git-ls-remote.adoc' : 1,
diff --git a/Makefile b/Makefile
index 70d1543b6b..11bf4fb55a 100644
--- a/Makefile
+++ b/Makefile
@@ -1267,6 +1267,7 @@ BUILTIN_OBJS += builtin/hook.o
BUILTIN_OBJS += builtin/index-pack.o
BUILTIN_OBJS += builtin/init-db.o
BUILTIN_OBJS += builtin/interpret-trailers.o
+BUILTIN_OBJS += builtin/last-modified.o
BUILTIN_OBJS += builtin/log.o
BUILTIN_OBJS += builtin/ls-files.o
BUILTIN_OBJS += builtin/ls-remote.o
diff --git a/builtin.h b/builtin.h
index bff13e3069..6ed6759ec4 100644
--- a/builtin.h
+++ b/builtin.h
@@ -176,6 +176,7 @@ int cmd_hook(int argc, const char **argv, const char *prefix, struct repository
int cmd_index_pack(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_init_db(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_interpret_trailers(int argc, const char **argv, const char *prefix, struct repository *repo);
+int cmd_last_modified(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_log_reflog(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_log(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_ls_files(int argc, const char **argv, const char *prefix, struct repository *repo);
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
new file mode 100644
index 0000000000..63993bc1c9
--- /dev/null
+++ b/builtin/last-modified.c
@@ -0,0 +1,289 @@
+#include "git-compat-util.h"
+#include "builtin.h"
+#include "commit.h"
+#include "config.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "hashmap.h"
+#include "hex.h"
+#include "log-tree.h"
+#include "object-name.h"
+#include "object.h"
+#include "parse-options.h"
+#include "quote.h"
+#include "repository.h"
+#include "revision.h"
+
+struct last_modified_entry {
+ struct hashmap_entry hashent;
+ struct object_id oid;
+ const char path[FLEX_ARRAY];
+};
+
+static int last_modified_entry_hashcmp(const void *unused UNUSED,
+ const struct hashmap_entry *hent1,
+ const struct hashmap_entry *hent2,
+ const void *path)
+{
+ const struct last_modified_entry *ent1 =
+ container_of(hent1, const struct last_modified_entry, hashent);
+ const struct last_modified_entry *ent2 =
+ container_of(hent2, const struct last_modified_entry, hashent);
+ return strcmp(ent1->path, path ? path : ent2->path);
+}
+
+struct last_modified {
+ struct hashmap paths;
+ struct rev_info rev;
+ int recursive, tree_in_recursive;
+};
+
+static void last_modified_release(struct last_modified *lm)
+{
+ hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
+ release_revisions(&lm->rev);
+}
+
+typedef void (*last_modified_callback)(const char *path,
+ const struct commit *commit, void *data);
+
+struct last_modified_callback_data {
+ struct commit *commit;
+ struct hashmap *paths;
+
+ last_modified_callback callback;
+ void *callback_data;
+};
+
+static void add_path_from_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED, void *data)
+{
+ struct last_modified *lm = data;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ struct last_modified_entry *ent;
+ const char *path = p->two->path;
+
+ FLEX_ALLOC_STR(ent, path, path);
+ oidcpy(&ent->oid, &p->two->oid);
+ hashmap_entry_init(&ent->hashent, strhash(ent->path));
+ hashmap_add(&lm->paths, &ent->hashent);
+ }
+}
+
+static int populate_paths_from_revs(struct last_modified *lm)
+{
+ int num_interesting = 0;
+ struct diff_options diffopt;
+
+ memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
+ copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
+ /*
+ * Use a callback to populate the paths from revs
+ */
+ diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ diffopt.format_callback = add_path_from_diff;
+ diffopt.format_callback_data = lm;
+
+ for (size_t i = 0; i < lm->rev.pending.nr; i++) {
+ struct object_array_entry *obj = lm->rev.pending.objects + i;
+
+ if (obj->item->flags & UNINTERESTING)
+ continue;
+
+ if (num_interesting++)
+ return error(_("can only get last-modified one tree at a time"));
+
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &obj->item->oid, "", &diffopt);
+ diff_flush(&diffopt);
+ }
+ diff_free(&diffopt);
+
+ return 0;
+}
+
+static void mark_path(const char *path, const struct object_id *oid,
+ struct last_modified_callback_data *data)
+{
+ struct last_modified_entry *ent;
+
+ /* Is it even a path that we are interested in? */
+ ent = hashmap_get_entry_from_hash(data->paths, strhash(path), path,
+ struct last_modified_entry, hashent);
+ if (!ent)
+ return;
+
+ /*
+ * Is it arriving at a version of interest, or is it from a side branch
+ * which did not contribute to the final state?
+ */
+ if (!oideq(oid, &ent->oid))
+ return;
+
+ if (data->callback)
+ data->callback(path, data->commit, data->callback_data);
+
+ hashmap_remove(data->paths, &ent->hashent, path);
+ free(ent);
+}
+
+static void last_modified_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED, void *cbdata)
+{
+ struct last_modified_callback_data *data = cbdata;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ switch (p->status) {
+ case DIFF_STATUS_DELETED:
+ /*
+ * There's no point in feeding a deletion, as it could
+ * not have resulted in our current state, which
+ * actually has the file.
+ */
+ break;
+
+ default:
+ /*
+ * Otherwise, we care only that we somehow arrived at
+ * a final oid state. Note that this covers some
+ * potentially controversial areas, including:
+ *
+ * 1. A rename or copy will be found, as it is the
+ * first time the content has arrived at the given
+ * path.
+ *
+ * 2. Even a non-content modification like a mode or
+ * type change will trigger it.
+ *
+ * We take the inclusive approach for now, and find
+ * anything which impacts the path. Options to tweak
+ * the behavior (e.g., to "--follow" the content across
+ * renames) can come later.
+ */
+ mark_path(p->two->path, &p->two->oid, data);
+ break;
+ }
+ }
+}
+
+static int last_modified_run(struct last_modified *lm,
+ last_modified_callback cb, void *cbdata)
+{
+ struct last_modified_callback_data data;
+
+ data.paths = &lm->paths;
+ data.callback = cb;
+ data.callback_data = cbdata;
+
+ lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ lm->rev.diffopt.format_callback = last_modified_diff;
+ lm->rev.diffopt.format_callback_data = &data;
+
+ prepare_revision_walk(&lm->rev);
+
+ while (hashmap_get_size(&lm->paths)) {
+ data.commit = get_revision(&lm->rev);
+ if (!data.commit)
+ break;
+
+ if (data.commit->object.flags & BOUNDARY) {
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &data.commit->object.oid, "",
+ &lm->rev.diffopt);
+ diff_flush(&lm->rev.diffopt);
+ } else {
+ log_tree_commit(&lm->rev, data.commit);
+ }
+ }
+
+ return 0;
+}
+
+static void show_entry(const char *path, const struct commit *commit, void *d)
+{
+ struct last_modified *lm = d;
+
+ if (commit->object.flags & BOUNDARY)
+ putchar('^');
+ printf("%s\t", oid_to_hex(&commit->object.oid));
+
+ if (lm->rev.diffopt.line_termination)
+ write_name_quoted(path, stdout, '\n');
+ else
+ printf("%s%c", path, '\0');
+
+ fflush(stdout);
+}
+
+static int last_modified_init(struct last_modified *lm, struct repository *r,
+ const char *prefix, int argc, const char **argv)
+{
+ hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
+
+ repo_init_revisions(r, &lm->rev, prefix);
+ lm->rev.def = "HEAD";
+ lm->rev.combine_merges = 1;
+ lm->rev.show_root_diff = 1;
+ lm->rev.boundary = 1;
+ lm->rev.no_commit_id = 1;
+ lm->rev.diff = 1;
+ lm->rev.diffopt.flags.recursive = lm->recursive || lm->tree_in_recursive;
+ lm->rev.diffopt.flags.tree_in_recursive = lm->tree_in_recursive;
+
+ if ((argc = setup_revisions(argc, argv, &lm->rev, NULL)) > 1) {
+ error(_("unknown last-modified argument: %s"), argv[1]);
+ return argc;
+ }
+
+ if (populate_paths_from_revs(lm) < 0)
+ return error(_("unable to setup last-modified"));
+
+ return 0;
+}
+
+int cmd_last_modified(int argc, const char **argv, const char *prefix,
+ struct repository *repo)
+{
+ int ret;
+ struct last_modified lm;
+
+ const char * const last_modified_usage[] = {
+ N_("git last-modified [-r] [-t] "
+ "[<revision-range>] [[--] <path>...]"),
+ NULL
+ };
+
+ struct option last_modified_options[] = {
+ OPT_BOOL('r', "recursive", &lm.recursive,
+ N_("recurse into subtrees")),
+ OPT_BOOL('t', "tree-in-recursive", &lm.tree_in_recursive,
+ N_("recurse into subtrees and include the tree entries too")),
+ OPT_END()
+ };
+
+ memset(&lm, 0, sizeof(lm));
+
+ argc = parse_options(argc, argv, prefix, last_modified_options,
+ last_modified_usage,
+ PARSE_OPT_KEEP_ARGV0 | PARSE_OPT_KEEP_UNKNOWN_OPT);
+
+ repo_config(repo, git_default_config, NULL);
+
+ if ((ret = last_modified_init(&lm, repo, prefix, argc, argv))) {
+ if (ret > 0)
+ usage_with_options(last_modified_usage,
+ last_modified_options);
+ goto out;
+ }
+
+ if ((ret = last_modified_run(&lm, show_entry, &lm)))
+ goto out;
+
+out:
+ last_modified_release(&lm);
+
+ return ret;
+}
diff --git a/command-list.txt b/command-list.txt
index b7ade3ab9f..b715777b24 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -124,6 +124,7 @@ git-index-pack plumbingmanipulators
git-init mainporcelain init
git-instaweb ancillaryinterrogators complete
git-interpret-trailers purehelpers
+git-last-modified plumbinginterrogators
git-log mainporcelain info
git-ls-files plumbinginterrogators
git-ls-remote plumbinginterrogators
diff --git a/git.c b/git.c
index 07a5fe39fb..76a0b2a1a4 100644
--- a/git.c
+++ b/git.c
@@ -565,6 +565,7 @@ static struct cmd_struct commands[] = {
{ "init", cmd_init_db },
{ "init-db", cmd_init_db },
{ "interpret-trailers", cmd_interpret_trailers, RUN_SETUP_GENTLY },
+ { "last-modified", cmd_last_modified, RUN_SETUP },
{ "log", cmd_log, RUN_SETUP },
{ "ls-files", cmd_ls_files, RUN_SETUP },
{ "ls-remote", cmd_ls_remote, RUN_SETUP_GENTLY },
diff --git a/meson.build b/meson.build
index 7fea4a34d6..a45608f0a4 100644
--- a/meson.build
+++ b/meson.build
@@ -607,6 +607,7 @@ builtin_sources = [
'builtin/index-pack.c',
'builtin/init-db.c',
'builtin/interpret-trailers.c',
+ 'builtin/last-modified.c',
'builtin/log.c',
'builtin/ls-files.c',
'builtin/ls-remote.c',
diff --git a/t/meson.build b/t/meson.build
index 6d7fe6b117..eee1863eb3 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -961,6 +961,7 @@ integration_tests = [
't8012-blame-colors.sh',
't8013-blame-ignore-revs.sh',
't8014-blame-ignore-fuzzy.sh',
+ 't8020-last-modified.sh',
't9001-send-email.sh',
't9002-column.sh',
't9003-help-autocorrect.sh',
diff --git a/t/t8020-last-modified.sh b/t/t8020-last-modified.sh
new file mode 100755
index 0000000000..921d2a0807
--- /dev/null
+++ b/t/t8020-last-modified.sh
@@ -0,0 +1,204 @@
+#!/bin/sh
+
+test_description='last-modified tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+ test_commit 1 file &&
+ mkdir a &&
+ test_commit 2 a/file &&
+ mkdir a/b &&
+ test_commit 3 a/b/file
+'
+
+test_expect_success 'cannot run last-modified on two trees' '
+ test_must_fail git last-modified HEAD HEAD~1
+'
+
+check_last_modified() {
+ local indir= &&
+ while test $# != 0
+ do
+ case "$1" in
+ -C)
+ indir="$2"
+ shift
+ ;;
+ *)
+ break
+ ;;
+ esac &&
+ shift
+ done &&
+
+ cat >expect &&
+ test_when_finished "rm -f tmp.*" &&
+ git ${indir:+-C "$indir"} last-modified "$@" >tmp.1 &&
+ git name-rev --annotate-stdin --name-only --tags \
+ <tmp.1 >tmp.2 &&
+ tr '\t' ' ' <tmp.2 >tmp.3 &&
+ sort tmp.3 >actual &&
+ test_cmp expect actual
+}
+
+test_expect_success 'last-modified non-recursive' '
+ check_last_modified <<-\EOF
+ 1 file
+ 3 a
+ EOF
+'
+
+test_expect_success 'last-modified recursive' '
+ check_last_modified -r <<-\EOF
+ 1 file
+ 2 a/file
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'last-modified recursive with tree' '
+ check_last_modified -t <<-\EOF
+ 1 file
+ 2 a/file
+ 3 a
+ 3 a/b
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'last-modified subdir' '
+ check_last_modified a <<-\EOF
+ 3 a
+ EOF
+'
+
+test_expect_success 'last-modified subdir recursive' '
+ check_last_modified -r a <<-\EOF
+ 2 a/file
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'last-modified from non-HEAD commit' '
+ check_last_modified HEAD^ <<-\EOF
+ 1 file
+ 2 a
+ EOF
+'
+
+test_expect_success 'last-modified from subdir defaults to root' '
+ check_last_modified -C a <<-\EOF
+ 1 file
+ 3 a
+ EOF
+'
+
+test_expect_success 'last-modified from subdir uses relative pathspecs' '
+ check_last_modified -C a -r b <<-\EOF
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'limit last-modified traversal by count' '
+ check_last_modified -1 <<-\EOF
+ 3 a
+ ^2 file
+ EOF
+'
+
+test_expect_success 'limit last-modified traversal by commit' '
+ check_last_modified HEAD~2..HEAD <<-\EOF
+ 3 a
+ ^1 file
+ EOF
+'
+
+test_expect_success 'only last-modified files in the current tree' '
+ git rm -rf a &&
+ git commit -m "remove a" &&
+ check_last_modified <<-\EOF
+ 1 file
+ EOF
+'
+
+test_expect_success 'cross merge boundaries in blaming' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit m1 &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit m2 &&
+ git merge m1 &&
+ check_last_modified <<-\EOF
+ m1 m1.t
+ m2 m2.t
+ EOF
+'
+
+test_expect_success 'last-modified merge for resolved conflicts' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit c1 conflict &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit c2 conflict &&
+ test_must_fail git merge c1 &&
+ test_commit resolved conflict &&
+ check_last_modified conflict <<-\EOF
+ resolved conflict
+ EOF
+'
+
+
+# Consider `file` with this content through history:
+#
+# A---B---B-------B---B
+# \ /
+# C---D
+test_expect_success 'last-modified merge ignores content from branch' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit a1 file A &&
+ test_commit a2 file B &&
+ test_commit a3 file C &&
+ test_commit a4 file D &&
+ git checkout a2 &&
+ git merge --no-commit --no-ff a4 &&
+ git checkout a2 -- file &&
+ git merge --continue &&
+ check_last_modified <<-\EOF
+ a2 file
+ EOF
+'
+
+# Consider `file` with this content through history:
+#
+# A---B---B---C---D---B---B
+# \ /
+# B-------B
+test_expect_success 'last-modified merge undoes changes' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit b1 file A &&
+ test_commit b2 file B &&
+ test_commit b3 file C &&
+ test_commit b4 file D &&
+ git checkout b2 &&
+ test_commit b5 file2 2 &&
+ git checkout b4 &&
+ git merge --no-commit --no-ff b5 &&
+ git checkout b2 -- file &&
+ git merge --continue &&
+ check_last_modified <<-\EOF
+ b2 file
+ b5 file2
+ EOF
+'
+
+test_expect_success 'last-modified complains about unknown arguments' '
+ test_must_fail git last-modified --foo 2>err &&
+ grep "unknown last-modified argument: --foo" err
+'
+
+test_done
--
2.50.0.rc0.18.gfcfe60668e
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH v4 2/3] t/perf: add last-modified perf script
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
` (5 preceding siblings ...)
2025-07-09 15:26 ` [PATCH v4 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
@ 2025-07-09 15:26 ` Toon Claes
2025-07-09 15:26 ` [PATCH v4 3/3] last-modified: use Bloom filters when available Toon Claes
2025-07-16 13:35 ` [PATCH v5 6/6] fixup! " Toon Claes
8 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-09 15:26 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Jeff King
This just runs some simple last-modified commands. We already test
correctness in the regular suite, so this is just about finding
performance regressions from one version to another.
Based-on-patch-by: Jeff King <peff@peff.net>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
t/meson.build | 1 +
t/perf/p8020-last-modified.sh | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+)
create mode 100755 t/perf/p8020-last-modified.sh
diff --git a/t/meson.build b/t/meson.build
index eee1863eb3..b41dfc41d7 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -1154,6 +1154,7 @@ benchmarks = [
'perf/p7820-grep-engines.sh',
'perf/p7821-grep-engines-fixed.sh',
'perf/p7822-grep-perl-character.sh',
+ 'perf/p8020-last-modified.sh',
'perf/p9210-scalar.sh',
'perf/p9300-fast-import-export.sh',
]
diff --git a/t/perf/p8020-last-modified.sh b/t/perf/p8020-last-modified.sh
new file mode 100755
index 0000000000..a02ec907d4
--- /dev/null
+++ b/t/perf/p8020-last-modified.sh
@@ -0,0 +1,21 @@
+#!/bin/sh
+
+test_description='last-modified perf tests'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+test_perf 'top-level last-modified' '
+ git last-modified HEAD
+'
+
+test_perf 'top-level recursive last-modified' '
+ git last-modified -r HEAD
+'
+
+test_perf 'subdir last-modified' '
+ path=$(git ls-tree HEAD | grep ^040000 | head -n 1 | cut -f2)
+ git last-modified -r HEAD -- "$path"
+'
+
+test_done
--
2.50.0.rc0.18.gfcfe60668e
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH v4 3/3] last-modified: use Bloom filters when available
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
` (6 preceding siblings ...)
2025-07-09 15:26 ` [PATCH v4 2/3] t/perf: add last-modified perf script Toon Claes
@ 2025-07-09 15:26 ` Toon Claes
2025-07-16 13:35 ` [PATCH v5 6/6] fixup! " Toon Claes
8 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-09 15:26 UTC (permalink / raw)
To: git; +Cc: Toon Claes, Taylor Blau
Our 'git last-modified' performs a revision walk, and computes a diff at
each point in the walk to figure out whether a given revision changed
any of the paths it considers interesting.
When changed-path Bloom filters are available, we can avoid computing
many such diffs. Before computing a diff, we first check if any of the
remaining paths of interest were possibly changed at a given commit by
consulting its Bloom filter. If any of them are, we are resigned to
compute the diff.
If none of those queries returned "maybe", we know that the given commit
doesn't contain any changed paths which are interesting to us. So, we
can avoid computing it in this case.
Comparing the perf test results on git.git:
Test HEAD~ HEAD
------------------------------------------------------------------------------------
8020.1: top-level last-modified 4.49(4.34+0.11) 2.22(2.05+0.09) -50.6%
8020.2: top-level recursive last-modified 5.64(5.45+0.11) 5.62(5.30+0.11) -0.4%
8020.3: subdir last-modified 0.11(0.06+0.04) 0.07(0.03+0.04) -36.4%
Based-on-patch-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
builtin/last-modified.c | 45 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
index 63993bc1c9..466df04fba 100644
--- a/builtin/last-modified.c
+++ b/builtin/last-modified.c
@@ -1,5 +1,7 @@
#include "git-compat-util.h"
+#include "bloom.h"
#include "builtin.h"
+#include "commit-graph.h"
#include "commit.h"
#include "config.h"
#include "diff.h"
@@ -17,6 +19,7 @@
struct last_modified_entry {
struct hashmap_entry hashent;
struct object_id oid;
+ struct bloom_key key;
const char path[FLEX_ARRAY];
};
@@ -40,6 +43,12 @@ struct last_modified {
static void last_modified_release(struct last_modified *lm)
{
+ struct hashmap_iter iter;
+ struct last_modified_entry *ent;
+
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
+ clear_bloom_key(&ent->key);
+
hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
release_revisions(&lm->rev);
}
@@ -67,6 +76,9 @@ static void add_path_from_diff(struct diff_queue_struct *q,
FLEX_ALLOC_STR(ent, path, path);
oidcpy(&ent->oid, &p->two->oid);
+ if (lm->rev.bloom_filter_settings)
+ fill_bloom_key(path, strlen(path), &ent->key,
+ lm->rev.bloom_filter_settings);
hashmap_entry_init(&ent->hashent, strhash(ent->path));
hashmap_add(&lm->paths, &ent->hashent);
}
@@ -126,6 +138,7 @@ static void mark_path(const char *path, const struct object_id *oid,
data->callback(path, data->commit, data->callback_data);
hashmap_remove(data->paths, &ent->hashent, path);
+ clear_bloom_key(&ent->key);
free(ent);
}
@@ -169,6 +182,28 @@ static void last_modified_diff(struct diff_queue_struct *q,
}
}
+
+static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
+{
+ struct bloom_filter *filter;
+ struct last_modified_entry *ent;
+ struct hashmap_iter iter;
+
+ if (!lm->rev.bloom_filter_settings)
+ return 1;
+
+ filter = get_bloom_filter(lm->rev.repo, origin);
+ if (!filter)
+ return 1;
+
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
+ if (bloom_filter_contains(filter, &ent->key,
+ lm->rev.bloom_filter_settings))
+ return 1;
+ }
+ return 0;
+}
+
static int last_modified_run(struct last_modified *lm,
last_modified_callback cb, void *cbdata)
{
@@ -189,6 +224,9 @@ static int last_modified_run(struct last_modified *lm,
if (!data.commit)
break;
+ if (!maybe_changed_path(lm, data.commit))
+ continue;
+
if (data.commit->object.flags & BOUNDARY) {
diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
&data.commit->object.oid, "",
@@ -238,6 +276,13 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
return argc;
}
+ /*
+ * We're not interested in generation numbers here,
+ * but calling this function to prepare the commit-graph.
+ */
+ (void)generation_numbers_enabled(lm->rev.repo);
+ lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
+
if (populate_paths_from_revs(lm) < 0)
return error(_("unable to setup last-modified"));
--
2.50.0.rc0.18.gfcfe60668e
^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 0/5] Introduce git-last-modified(1) command
2025-07-02 13:00 ` Toon Claes
@ 2025-07-09 15:53 ` Toon Claes
2025-07-09 17:00 ` Junio C Hamano
0 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-07-09 15:53 UTC (permalink / raw)
To: Kristoffer Haugsbakk, Junio C Hamano
Cc: git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Toon Claes <toon@iotcl.com> writes:
> You raise a good point here. Let's compare:
>
> $ git ls-tree HEAD -- refs.c refs.h Documentation/git-last-modified.adoc Documentation/git-config.adoc
> 100644 blob 936e0c5130fe7d67f645501fbb9e70b94b437f54 Documentation/git-config.adoc
> 100644 blob 1af38f402ed6437353fb5765f62251966d828df9 Documentation/git-last-modified.adoc
> 100644 blob dce5c49ca2ba65fd6a2974e38f67134215bee369 refs.c
> 100644 blob 46a6008e07f2624239139cd8b2ff712545f07d3f refs.h
>
> $ git last-modified HEAD -- refs.c refs.h Documentation/git-last-modified.adoc Documentation/git-config.adoc
> 56073a0af90be947cfefbfc3cf762b268e5e20a9 Documentation
> 062b914c841329a003f74e1340ea5178391274a6 refs.c
> 47478802daddf3f9916111307f153c6298ffc0bc refs.h
>
> I have to agree with Kristoffer here, and the latter is not what I
> would expect. Thanks for the testing! I will try to address in next
> version.
After some more testing and tinkering with the code, I've decided to
keep the behavior for several reasons:
1. While behavior differs from git-ls-tree(1) (see above), current
behavior is identical to git-diff-tree(1):
$ git diff-tree HEAD~1000 HEAD -- refs.c refs.h Documentation/git-last-modified.adoc Documentation/git-config.adoc
:040000 040000 810861a07e1360d3e3fa00db3c0d01e0604ff27a 1b01b770c15e7ae586452bb3587c3ce7c01abd91 M Documentation
:100644 100644 55d2e0b2cb9e959443e98eb329fdf97eff9073a9 dce5c49ca2ba65fd6a2974e38f67134215bee369 M refs.c
:100644 100644 d278775e086bfa7990999c226ad1db2f488e890d 46a6008e07f2624239139cd8b2ff712545f07d3f M refs.h
Both git-diff-tree(1) and git-last-modified(1) are marked as plumbing
commands, git-ls-tree(1) isn't. So I think that okay.
2. This command uses the diff machinery to walk the trees, so it's not
straightforward to change behavior.
3. We can later introduce `--max-depth` as Peff has been trying to do in
the past, but that can happen outside this patch series. These has
been taken out in an previous attempt[1] to upsteam the
git-blame-tree(1) subcommand.
Having option `--max-depth` would allow users to get all tree entries
in for example the documentation directory, while not getting any
from it's subtrees.
With these considerations I've submitted[2] a new version of my patches.
[1]: https://lore.kernel.org/git/patch-1.1-0ea849d900b-20230205T204104Z-avarab@gmail.com/
[2]: https://lore.kernel.org/git/20250709152628.1644521-1-toon@iotcl.com/
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH RFC v2 0/5] Introduce git-last-modified(1) command
2025-07-09 15:53 ` Toon Claes
@ 2025-07-09 17:00 ` Junio C Hamano
0 siblings, 0 replies; 135+ messages in thread
From: Junio C Hamano @ 2025-07-09 17:00 UTC (permalink / raw)
To: Toon Claes
Cc: Kristoffer Haugsbakk, git, Jeff King, Taylor Blau, Derrick Stolee,
Ævar Arnfjörð Bjarmason
Toon Claes <toon@iotcl.com> writes:
> After some more testing and tinkering with the code, I've decided to
> keep the behavior for several reasons:
>
> 1. While behavior differs from git-ls-tree(1) (see above), current
> behavior is identical to git-diff-tree(1):
>
> $ git diff-tree HEAD~1000 HEAD -- refs.c refs.h Documentation/git-last-modified.adoc Documentation/git-config.adoc
> :040000 040000 810861a07e1360d3e3fa00db3c0d01e0604ff27a 1b01b770c15e7ae586452bb3587c3ce7c01abd91 M Documentation
> :100644 100644 55d2e0b2cb9e959443e98eb329fdf97eff9073a9 dce5c49ca2ba65fd6a2974e38f67134215bee369 M refs.c
> :100644 100644 d278775e086bfa7990999c226ad1db2f488e890d 46a6008e07f2624239139cd8b2ff712545f07d3f M refs.h
>
> Both git-diff-tree(1) and git-last-modified(1) are marked as plumbing
> commands, git-ls-tree(1) isn't. So I think that okay.
Good.
We may want to "fix" this "inconsistency" someday, and I think it is
a bug that ls-tree is not marked as plumbing. But this is a topic
about last-modified, so it is fine.
Thanks.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v4 0/3] Introduce git-last-modified(1) command
2025-07-09 15:26 ` [PATCH v4 " Toon Claes
@ 2025-07-09 21:57 ` Junio C Hamano
2025-07-10 18:37 ` Junio C Hamano
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
1 sibling, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-07-09 21:57 UTC (permalink / raw)
To: Toon Claes, Lidong Yan
Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau
Toon Claes <toon@iotcl.com> writes:
> This series adds the git-last-modified(1) subcommand. In the past the
> subcommand was proposed[1] to be named git-blame-tree(1). This version
> is based on the patches shared by the kind people at GitHub[2].
You do not have to deal with it just yet, but FYI, another topic in
flight renames away a few bloom API functions that this topic adds
more callers of.
If this topic needs to be rerolled after the other topic graduates
to 'master', we may need to see this topic rebased on a newer
'master' with something like the attached patch squashed in, but
because the other topic is not close to 'next' yet, let's keep these
two topics independent from each other as long as possible, and let
me deal with this trivial semantic conflict resolution, at least for
now.
Thanks.
builtin/last-modified.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
index 466df04fba..2beae026cc 100644
--- a/builtin/last-modified.c
+++ b/builtin/last-modified.c
@@ -47,7 +47,7 @@ static void last_modified_release(struct last_modified *lm)
struct last_modified_entry *ent;
hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
- clear_bloom_key(&ent->key);
+ bloom_key_clear(&ent->key);
hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
release_revisions(&lm->rev);
@@ -77,7 +77,7 @@ static void add_path_from_diff(struct diff_queue_struct *q,
FLEX_ALLOC_STR(ent, path, path);
oidcpy(&ent->oid, &p->two->oid);
if (lm->rev.bloom_filter_settings)
- fill_bloom_key(path, strlen(path), &ent->key,
+ bloom_key_fill(path, strlen(path), &ent->key,
lm->rev.bloom_filter_settings);
hashmap_entry_init(&ent->hashent, strhash(ent->path));
hashmap_add(&lm->paths, &ent->hashent);
@@ -138,7 +138,7 @@ static void mark_path(const char *path, const struct object_id *oid,
data->callback(path, data->commit, data->callback_data);
hashmap_remove(data->paths, &ent->hashent, path);
- clear_bloom_key(&ent->key);
+ bloom_key_clear(&ent->key);
free(ent);
}
--
2.50.1-382-gda22511645
^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [PATCH v4 0/3] Introduce git-last-modified(1) command
2025-07-09 21:57 ` Junio C Hamano
@ 2025-07-10 18:37 ` Junio C Hamano
0 siblings, 0 replies; 135+ messages in thread
From: Junio C Hamano @ 2025-07-10 18:37 UTC (permalink / raw)
To: Toon Claes
Cc: Lidong Yan, git, Kristoffer Haugsbakk, Derrick Stolee,
Taylor Blau
Junio C Hamano <gitster@pobox.com> writes:
> Toon Claes <toon@iotcl.com> writes:
>
>> This series adds the git-last-modified(1) subcommand. In the past the
>> subcommand was proposed[1] to be named git-blame-tree(1). This version
>> is based on the patches shared by the kind people at GitHub[2].
>
> You do not have to deal with it just yet, but FYI, another topic in
> flight renames away a few bloom API functions that this topic adds
> more callers of.
>
> If this topic needs to be rerolled after the other topic graduates
> to 'master', we may need to see this topic rebased on a newer
> 'master' with something like the attached patch squashed in, but
> because the other topic is not close to 'next' yet, let's keep these
> two topics independent from each other as long as possible, and let
> me deal with this trivial semantic conflict resolution, at least for
> now.
>
> Thanks.
The situation wrt what you need to do hasn't changed, but the other
topic reshuffled the order of parameters for a few API functions.
The result does make more sense to have the key structure as the
first parameter, but a fallout is that the way the new calls added
by this series are massaged to fit the new world order needs to be
updated.
Again, just FYI, if you ended up needing to rebase your topic on top
of the other one, the following would become necessary.
builtin/last-modified.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
index 466df04fba..70e1e14f21 100644
--- a/builtin/last-modified.c
+++ b/builtin/last-modified.c
@@ -47,7 +47,7 @@ static void last_modified_release(struct last_modified *lm)
struct last_modified_entry *ent;
hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
- clear_bloom_key(&ent->key);
+ bloom_key_clear(&ent->key);
hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
release_revisions(&lm->rev);
@@ -77,7 +77,7 @@ static void add_path_from_diff(struct diff_queue_struct *q,
FLEX_ALLOC_STR(ent, path, path);
oidcpy(&ent->oid, &p->two->oid);
if (lm->rev.bloom_filter_settings)
- fill_bloom_key(path, strlen(path), &ent->key,
+ bloom_key_fill(&ent->key, path, strlen(path),
lm->rev.bloom_filter_settings);
hashmap_entry_init(&ent->hashent, strhash(ent->path));
hashmap_add(&lm->paths, &ent->hashent);
@@ -138,7 +138,7 @@ static void mark_path(const char *path, const struct object_id *oid,
data->callback(path, data->commit, data->callback_data);
hashmap_remove(data->paths, &ent->hashent, path);
- clear_bloom_key(&ent->key);
+ bloom_key_clear(&ent->key);
free(ent);
}
--
2.50.1-394-g0a41f16de2
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH v5 0/6] Introduce git-last-modified(1) command
2025-07-09 15:26 ` [PATCH v4 " Toon Claes
2025-07-09 21:57 ` Junio C Hamano
@ 2025-07-16 13:32 ` Toon Claes
2025-07-16 13:35 ` [PATCH v5 1/6] last-modified: new subcommand to show when files were last modified Toon Claes
` (11 more replies)
1 sibling, 12 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-16 13:32 UTC (permalink / raw)
To: git
Cc: Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau, Junio C Hamano,
Toon Claes
On many forges the tree view is shown in combination with commit data.
In such a view each tree entry is accompanied with the commit message
and date that last modified that tree entry. Something similar like:
| README.md | README: *.txt -> *.adoc fixes | 4 months ago |
| RelNotes | Start 2.51 cycle, the first batch | 4 weeks ago |
| SECURITY.md | SECURITY: describe how to report vulnerabilities | 4 years |
| abspath.c | abspath: move related functions to abspath | 2 years |
| abspath.h | abspath: move related functions to abspath | 2 years |
| aclocal.m4 | configure: use AC_LANG_PROGRAM consistently | 15 years ago |
| add-patch.c | pager: stop using `the_repository` | 7 months ago |
| advice.c | advice: allow disabling default branch name advice | 4 months ago |
| advice.h | advice: allow disabling default branch name advice | 4 months ago |
| alias.h | rebase -m: fix serialization of strategy options | 2 years |
| alloc.h | git-compat-util: move alloc macros to git-compat-util.h | 2 years ago |
| apply.c | apply: only write intents to add for new files | 8 days ago |
| archive.c | Merge branch 'ps/parse-options-integers' | 3 months ago |
| archive.h | archive.h: remove unnecessary include | 1 year |
| attr.h | fuzz: port fuzz-parse-attr-line from OSS-Fuzz | 9 months ago |
| banned.h | banned.h: mark `strtok()` and `strtok_r()` as banned | 2 years |
This series adds the git-last-modified(1) to feed this view. In the past
the subcommand was proposed[1] to be named git-blame-tree(1). This
version is based on the patches shared by the kind people at GitHub[2].
What is different from the series shared by GitHub:
* Renamed the subcommand from `blame-tree` to `last-modified`. There was
some consensus[5] this name works better, so let's give it a try and
see how this name feels.
* Patches for --max-depth are excluded. I think it's a separate topic to
discuss and I'm not sure it needs to be part of series anyway. The
main patch was submitted in the previous attempt[3] and if people
consider it valuable, I'm happy to discuss that in a separate patch
series.
* The last-modified command isn't recursive by default. If you want
recurse into subtrees, you need to pass `-r`.
* The patches in 'tb/blame-tree' at Taylor's fork[4] implements a
caching layer. This feature reads/writes cached results in
`.git/blame-tree/<hash>.btc`. To keep this series to a reviewable
size, that feature is excluded from this series. I think it's better
to submit this as a separate series.
* Squashed various commits together. Like they introduced a flag
`--go-faster`, which later became the default and only implementation.
That story was wrapped up in a single commit.
* Dropped the patches that attempt to increase performance for tree
entries that have not been updated in a long time. In my testing I've
seen both performance improvements *and* degradation with these
changes:
Test HEAD~ HEAD
------------------------------------------------------------------------------------
8020.1: top-level last-modified 4.52(4.38+0.11) 2.03(1.93+0.08) -55.1%
8020.2: top-level recursive last-modified 5.79(5.64+0.11) 8.34(8.17+0.11) +44.0%
8020.3: subdir last-modified 0.15(0.09+0.06) 0.19(0.14+0.06) +26.7%
Before we include these patches, I want to make sure these changes
have positive impact in all/most scenarios. This can happen in a
separate series.
I've set myself as the author and added Based-on-patch-by trailers to
credit the original authors. Let me know if you disagree.
Again thanks to Taylor and the people at GitHub for sharing these
patches. I hope we can work together to get this upstreamed.
[1]: https://lore.kernel.org/git/patch-1.1-0ea849d900b-20230205T204104Z-avarab@gmail.com/
[2]: https://lore.kernel.org/git/Z+XJ+1L3PnC9Dyba@nand.local/
[3]: https://lore.kernel.org/git/20250326-toon-blame-tree-v1-3-4173133f3786@iotcl.com/
[4]: git@github.com:ttaylorr/git.git
[5]: https://lore.kernel.org/git/aCbBKj7O9LjO3SMK@pks.im/
---
Changes in v5:
- Added a patch to allow for an "extended" format. The name for this option is
open for debate (please, all input is welcome). But the main goal of this
series is to provide the data needed for the "forge tree view" as demoed at
the top of this cover letter. With this extra patch (and the prepatory patch
to pretty.[ch]), I hope the use-case because more clear. But because it wasn't
included in previous 4 versions I also wouldn't mind sending a separate patch
series for it.
- Removed the call to sort(1) the t8020 tests. This was needed for the tests for
--extended.
- I'm adding a fixup! commit to be compatible with in-flight patches for bloom
filter optimizations:
https://lore.kernel.org/git/20250712093517.17907-1-yldhome2d2@gmail.com/
This patch can be dropped if current series lands before those.
Changes in v4:
- Removed root-level `last-modified.[ch]` library code and moved code to
`builtin/last-modified.c`. Historically we've had libary code (also because it
was used in testtool), but we no longer need that separation. I'm sorry this
makes the range-diff hard to read.
- Added the use of parse_options() to get better usage messages.
- Formatting fixes after conversation in
https://lore.kernel.org/git/xmqqh5zvk5h0.fsf@gitster.g/
- Link to v3: https://lore.kernel.org/git/20250630-toon-new-blame-tree-v3-0-3516025dc3bc@iotcl.com/
Changes in v3:
- Updated benchmarks in commit messages.
- Removed the patches that attempt to increase performance for tree
entries that have not been updated in a long time. (see above)
- Move handling failure in `last_modified_init()` to the caller.
- Sorted #include clauses lexicographically.
- Removed unneeded `commit` in `struct last_modified_entry`.
- Renamed some functions/variables and added some comments to make it
easier to understand.
- Removed unnecessary checking of the commit-graph generation number.
- Link to v2: https://lore.kernel.org/r/20250523-toon-new-blame-tree-v2-0-101e4ca4c1c9@iotcl.com
Changes in v2:
- The subcommand is renamed from `blame-tree` to `last-modified`
- Documentation is added. Here we mark the command as experimental.
- Some test cases are added related to merges.
- Link to v1: https://lore.kernel.org/r/20250422-toon-new-blame-tree-v1-0-fdb51b8a394a@iotcl.com
Toon Claes (6):
last-modified: new subcommand to show when files were last modified
t/perf: add last-modified perf script
last-modified: use Bloom filters when available
pretty: allow caller to disable indentation
last-modified: support --extended format
fixup! last-modified: use Bloom filters when available
.gitignore | 1 +
Documentation/git-last-modified.adoc | 95 +++++++
Documentation/meson.build | 1 +
Makefile | 1 +
builtin.h | 1 +
builtin/last-modified.c | 366 +++++++++++++++++++++++++++
command-list.txt | 1 +
git.c | 1 +
meson.build | 1 +
pretty.c | 2 +-
pretty.h | 1 +
t/meson.build | 2 +
t/perf/p8020-last-modified.sh | 21 ++
t/t8020-last-modified.sh | 225 ++++++++++++++++
14 files changed, 718 insertions(+), 1 deletion(-)
create mode 100644 Documentation/git-last-modified.adoc
create mode 100644 builtin/last-modified.c
create mode 100755 t/perf/p8020-last-modified.sh
create mode 100755 t/t8020-last-modified.sh
Range-diff against v4:
1: 0cc625f3f5 ! 1: da0e391faa last-modified: new subcommand to show when files were last modified
@@ t/t8020-last-modified.sh (new)
+ git ${indir:+-C "$indir"} last-modified "$@" >tmp.1 &&
+ git name-rev --annotate-stdin --name-only --tags \
+ <tmp.1 >tmp.2 &&
-+ tr '\t' ' ' <tmp.2 >tmp.3 &&
-+ sort tmp.3 >actual &&
++ tr '\t' ' ' <tmp.2 >actual &&
+ test_cmp expect actual
+}
+
+test_expect_success 'last-modified non-recursive' '
+ check_last_modified <<-\EOF
-+ 1 file
+ 3 a
++ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified recursive' '
+ check_last_modified -r <<-\EOF
-+ 1 file
-+ 2 a/file
+ 3 a/b/file
++ 2 a/file
++ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified recursive with tree' '
+ check_last_modified -t <<-\EOF
-+ 1 file
-+ 2 a/file
+ 3 a
+ 3 a/b
+ 3 a/b/file
++ 2 a/file
++ 1 file
+ EOF
+'
+
@@ t/t8020-last-modified.sh (new)
+
+test_expect_success 'last-modified subdir recursive' '
+ check_last_modified -r a <<-\EOF
-+ 2 a/file
+ 3 a/b/file
++ 2 a/file
+ EOF
+'
+
+test_expect_success 'last-modified from non-HEAD commit' '
+ check_last_modified HEAD^ <<-\EOF
-+ 1 file
+ 2 a
++ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified from subdir defaults to root' '
+ check_last_modified -C a <<-\EOF
-+ 1 file
+ 3 a
++ 1 file
+ EOF
+'
+
@@ t/t8020-last-modified.sh (new)
+ test_commit m2 &&
+ git merge m1 &&
+ check_last_modified <<-\EOF
-+ m1 m1.t
+ m2 m2.t
++ m1 m1.t
+ EOF
+'
+
@@ t/t8020-last-modified.sh (new)
+ git checkout b2 -- file &&
+ git merge --continue &&
+ check_last_modified <<-\EOF
-+ b2 file
+ b5 file2
++ b2 file
+ EOF
+'
+
2: a017f2c81c = 2: 9182f86440 t/perf: add last-modified perf script
3: c739a7dbcc = 3: b4fa376572 last-modified: use Bloom filters when available
-: ---------- > 4: 3df7833d59 pretty: allow caller to disable indentation
-: ---------- > 5: c2ac21c057 last-modified: support --extended format
-: ---------- > 6: 0be24d898d fixup! last-modified: use Bloom filters when available
base-commit: 32571a0222eb85ef265e136f27e44c302302b45c
--
2.50.1.327.g047016eb4a
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH v5 1/6] last-modified: new subcommand to show when files were last modified
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
@ 2025-07-16 13:35 ` Toon Claes
2025-07-18 0:02 ` Taylor Blau
2025-07-16 13:35 ` [PATCH v5 2/6] t/perf: add last-modified perf script Toon Claes
` (10 subsequent siblings)
11 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-07-16 13:35 UTC (permalink / raw)
To: git
Cc: Toon Claes, Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau,
Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason
Similar to git-blame(1), introduce a new subcommand
git-last-modified(1). This command shows the most recent modification to
paths in a tree. It does so by expanding the tree at a given commit,
taking note of the current state of each path, and then walking
backwards through history looking for commits where each path changed
into its final commit ID.
Based-on-patch-by: Jeff King <peff@peff.net>
Improved-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
.gitignore | 1 +
Documentation/git-last-modified.adoc | 49 +++++
Documentation/meson.build | 1 +
Makefile | 1 +
builtin.h | 1 +
builtin/last-modified.c | 289 +++++++++++++++++++++++++++
command-list.txt | 1 +
git.c | 1 +
meson.build | 1 +
t/meson.build | 1 +
t/t8020-last-modified.sh | 203 +++++++++++++++++++
11 files changed, 549 insertions(+)
create mode 100644 Documentation/git-last-modified.adoc
create mode 100644 builtin/last-modified.c
create mode 100755 t/t8020-last-modified.sh
diff --git a/.gitignore b/.gitignore
index 04c444404e..a36ee94443 100644
--- a/.gitignore
+++ b/.gitignore
@@ -87,6 +87,7 @@
/git-init-db
/git-interpret-trailers
/git-instaweb
+/git-last-modified
/git-log
/git-ls-files
/git-ls-remote
diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
new file mode 100644
index 0000000000..89138ebeb7
--- /dev/null
+++ b/Documentation/git-last-modified.adoc
@@ -0,0 +1,49 @@
+git-last-modified(1)
+====================
+
+NAME
+----
+git-last-modified - EXPERIMENTAL: Show when files were last modified
+
+
+SYNOPSIS
+--------
+[synopsis]
+git last-modified [-r] [-t] [<revision-range>] [[--] <path>...]
+
+DESCRIPTION
+-----------
+
+Shows which commit last modified each of the relevant files and subdirectories.
+
+THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
+
+OPTIONS
+-------
+
+-r::
+ Recurse into subtrees.
+
+-t::
+ Show tree entry itself as well as subtrees. Implies `-r`.
+
+<revision-range>::
+ Only traverse commits in the specified revision range. When no
+ `<revision-range>` is specified, it defaults to `HEAD` (i.e. the whole
+ history leading to the current commit). For a complete list of ways to
+ spell `<revision-range>`, see the 'Specifying Ranges' section of
+ linkgit:gitrevisions[7].
+
+[--] <path>...::
+ For each _<path>_ given, the commit which last modified it is returned.
+ Without an optional path parameter, all files and subdirectories
+ in path traversal the are included in the output.
+
+SEE ALSO
+--------
+linkgit:git-blame[1],
+linkgit:git-log[1].
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Documentation/meson.build b/Documentation/meson.build
index 2fe1a1369d..99aeb6d0e0 100644
--- a/Documentation/meson.build
+++ b/Documentation/meson.build
@@ -74,6 +74,7 @@ manpages = {
'git-init.adoc' : 1,
'git-instaweb.adoc' : 1,
'git-interpret-trailers.adoc' : 1,
+ 'git-last-modified.adoc' : 1,
'git-log.adoc' : 1,
'git-ls-files.adoc' : 1,
'git-ls-remote.adoc' : 1,
diff --git a/Makefile b/Makefile
index 5f7dd79dfa..b5ce55a703 100644
--- a/Makefile
+++ b/Makefile
@@ -1265,6 +1265,7 @@ BUILTIN_OBJS += builtin/hook.o
BUILTIN_OBJS += builtin/index-pack.o
BUILTIN_OBJS += builtin/init-db.o
BUILTIN_OBJS += builtin/interpret-trailers.o
+BUILTIN_OBJS += builtin/last-modified.o
BUILTIN_OBJS += builtin/log.o
BUILTIN_OBJS += builtin/ls-files.o
BUILTIN_OBJS += builtin/ls-remote.o
diff --git a/builtin.h b/builtin.h
index bff13e3069..6ed6759ec4 100644
--- a/builtin.h
+++ b/builtin.h
@@ -176,6 +176,7 @@ int cmd_hook(int argc, const char **argv, const char *prefix, struct repository
int cmd_index_pack(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_init_db(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_interpret_trailers(int argc, const char **argv, const char *prefix, struct repository *repo);
+int cmd_last_modified(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_log_reflog(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_log(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_ls_files(int argc, const char **argv, const char *prefix, struct repository *repo);
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
new file mode 100644
index 0000000000..63993bc1c9
--- /dev/null
+++ b/builtin/last-modified.c
@@ -0,0 +1,289 @@
+#include "git-compat-util.h"
+#include "builtin.h"
+#include "commit.h"
+#include "config.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "hashmap.h"
+#include "hex.h"
+#include "log-tree.h"
+#include "object-name.h"
+#include "object.h"
+#include "parse-options.h"
+#include "quote.h"
+#include "repository.h"
+#include "revision.h"
+
+struct last_modified_entry {
+ struct hashmap_entry hashent;
+ struct object_id oid;
+ const char path[FLEX_ARRAY];
+};
+
+static int last_modified_entry_hashcmp(const void *unused UNUSED,
+ const struct hashmap_entry *hent1,
+ const struct hashmap_entry *hent2,
+ const void *path)
+{
+ const struct last_modified_entry *ent1 =
+ container_of(hent1, const struct last_modified_entry, hashent);
+ const struct last_modified_entry *ent2 =
+ container_of(hent2, const struct last_modified_entry, hashent);
+ return strcmp(ent1->path, path ? path : ent2->path);
+}
+
+struct last_modified {
+ struct hashmap paths;
+ struct rev_info rev;
+ int recursive, tree_in_recursive;
+};
+
+static void last_modified_release(struct last_modified *lm)
+{
+ hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
+ release_revisions(&lm->rev);
+}
+
+typedef void (*last_modified_callback)(const char *path,
+ const struct commit *commit, void *data);
+
+struct last_modified_callback_data {
+ struct commit *commit;
+ struct hashmap *paths;
+
+ last_modified_callback callback;
+ void *callback_data;
+};
+
+static void add_path_from_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED, void *data)
+{
+ struct last_modified *lm = data;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ struct last_modified_entry *ent;
+ const char *path = p->two->path;
+
+ FLEX_ALLOC_STR(ent, path, path);
+ oidcpy(&ent->oid, &p->two->oid);
+ hashmap_entry_init(&ent->hashent, strhash(ent->path));
+ hashmap_add(&lm->paths, &ent->hashent);
+ }
+}
+
+static int populate_paths_from_revs(struct last_modified *lm)
+{
+ int num_interesting = 0;
+ struct diff_options diffopt;
+
+ memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
+ copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
+ /*
+ * Use a callback to populate the paths from revs
+ */
+ diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ diffopt.format_callback = add_path_from_diff;
+ diffopt.format_callback_data = lm;
+
+ for (size_t i = 0; i < lm->rev.pending.nr; i++) {
+ struct object_array_entry *obj = lm->rev.pending.objects + i;
+
+ if (obj->item->flags & UNINTERESTING)
+ continue;
+
+ if (num_interesting++)
+ return error(_("can only get last-modified one tree at a time"));
+
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &obj->item->oid, "", &diffopt);
+ diff_flush(&diffopt);
+ }
+ diff_free(&diffopt);
+
+ return 0;
+}
+
+static void mark_path(const char *path, const struct object_id *oid,
+ struct last_modified_callback_data *data)
+{
+ struct last_modified_entry *ent;
+
+ /* Is it even a path that we are interested in? */
+ ent = hashmap_get_entry_from_hash(data->paths, strhash(path), path,
+ struct last_modified_entry, hashent);
+ if (!ent)
+ return;
+
+ /*
+ * Is it arriving at a version of interest, or is it from a side branch
+ * which did not contribute to the final state?
+ */
+ if (!oideq(oid, &ent->oid))
+ return;
+
+ if (data->callback)
+ data->callback(path, data->commit, data->callback_data);
+
+ hashmap_remove(data->paths, &ent->hashent, path);
+ free(ent);
+}
+
+static void last_modified_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED, void *cbdata)
+{
+ struct last_modified_callback_data *data = cbdata;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ switch (p->status) {
+ case DIFF_STATUS_DELETED:
+ /*
+ * There's no point in feeding a deletion, as it could
+ * not have resulted in our current state, which
+ * actually has the file.
+ */
+ break;
+
+ default:
+ /*
+ * Otherwise, we care only that we somehow arrived at
+ * a final oid state. Note that this covers some
+ * potentially controversial areas, including:
+ *
+ * 1. A rename or copy will be found, as it is the
+ * first time the content has arrived at the given
+ * path.
+ *
+ * 2. Even a non-content modification like a mode or
+ * type change will trigger it.
+ *
+ * We take the inclusive approach for now, and find
+ * anything which impacts the path. Options to tweak
+ * the behavior (e.g., to "--follow" the content across
+ * renames) can come later.
+ */
+ mark_path(p->two->path, &p->two->oid, data);
+ break;
+ }
+ }
+}
+
+static int last_modified_run(struct last_modified *lm,
+ last_modified_callback cb, void *cbdata)
+{
+ struct last_modified_callback_data data;
+
+ data.paths = &lm->paths;
+ data.callback = cb;
+ data.callback_data = cbdata;
+
+ lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ lm->rev.diffopt.format_callback = last_modified_diff;
+ lm->rev.diffopt.format_callback_data = &data;
+
+ prepare_revision_walk(&lm->rev);
+
+ while (hashmap_get_size(&lm->paths)) {
+ data.commit = get_revision(&lm->rev);
+ if (!data.commit)
+ break;
+
+ if (data.commit->object.flags & BOUNDARY) {
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &data.commit->object.oid, "",
+ &lm->rev.diffopt);
+ diff_flush(&lm->rev.diffopt);
+ } else {
+ log_tree_commit(&lm->rev, data.commit);
+ }
+ }
+
+ return 0;
+}
+
+static void show_entry(const char *path, const struct commit *commit, void *d)
+{
+ struct last_modified *lm = d;
+
+ if (commit->object.flags & BOUNDARY)
+ putchar('^');
+ printf("%s\t", oid_to_hex(&commit->object.oid));
+
+ if (lm->rev.diffopt.line_termination)
+ write_name_quoted(path, stdout, '\n');
+ else
+ printf("%s%c", path, '\0');
+
+ fflush(stdout);
+}
+
+static int last_modified_init(struct last_modified *lm, struct repository *r,
+ const char *prefix, int argc, const char **argv)
+{
+ hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
+
+ repo_init_revisions(r, &lm->rev, prefix);
+ lm->rev.def = "HEAD";
+ lm->rev.combine_merges = 1;
+ lm->rev.show_root_diff = 1;
+ lm->rev.boundary = 1;
+ lm->rev.no_commit_id = 1;
+ lm->rev.diff = 1;
+ lm->rev.diffopt.flags.recursive = lm->recursive || lm->tree_in_recursive;
+ lm->rev.diffopt.flags.tree_in_recursive = lm->tree_in_recursive;
+
+ if ((argc = setup_revisions(argc, argv, &lm->rev, NULL)) > 1) {
+ error(_("unknown last-modified argument: %s"), argv[1]);
+ return argc;
+ }
+
+ if (populate_paths_from_revs(lm) < 0)
+ return error(_("unable to setup last-modified"));
+
+ return 0;
+}
+
+int cmd_last_modified(int argc, const char **argv, const char *prefix,
+ struct repository *repo)
+{
+ int ret;
+ struct last_modified lm;
+
+ const char * const last_modified_usage[] = {
+ N_("git last-modified [-r] [-t] "
+ "[<revision-range>] [[--] <path>...]"),
+ NULL
+ };
+
+ struct option last_modified_options[] = {
+ OPT_BOOL('r', "recursive", &lm.recursive,
+ N_("recurse into subtrees")),
+ OPT_BOOL('t', "tree-in-recursive", &lm.tree_in_recursive,
+ N_("recurse into subtrees and include the tree entries too")),
+ OPT_END()
+ };
+
+ memset(&lm, 0, sizeof(lm));
+
+ argc = parse_options(argc, argv, prefix, last_modified_options,
+ last_modified_usage,
+ PARSE_OPT_KEEP_ARGV0 | PARSE_OPT_KEEP_UNKNOWN_OPT);
+
+ repo_config(repo, git_default_config, NULL);
+
+ if ((ret = last_modified_init(&lm, repo, prefix, argc, argv))) {
+ if (ret > 0)
+ usage_with_options(last_modified_usage,
+ last_modified_options);
+ goto out;
+ }
+
+ if ((ret = last_modified_run(&lm, show_entry, &lm)))
+ goto out;
+
+out:
+ last_modified_release(&lm);
+
+ return ret;
+}
diff --git a/command-list.txt b/command-list.txt
index b7ade3ab9f..b715777b24 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -124,6 +124,7 @@ git-index-pack plumbingmanipulators
git-init mainporcelain init
git-instaweb ancillaryinterrogators complete
git-interpret-trailers purehelpers
+git-last-modified plumbinginterrogators
git-log mainporcelain info
git-ls-files plumbinginterrogators
git-ls-remote plumbinginterrogators
diff --git a/git.c b/git.c
index 07a5fe39fb..76a0b2a1a4 100644
--- a/git.c
+++ b/git.c
@@ -565,6 +565,7 @@ static struct cmd_struct commands[] = {
{ "init", cmd_init_db },
{ "init-db", cmd_init_db },
{ "interpret-trailers", cmd_interpret_trailers, RUN_SETUP_GENTLY },
+ { "last-modified", cmd_last_modified, RUN_SETUP },
{ "log", cmd_log, RUN_SETUP },
{ "ls-files", cmd_ls_files, RUN_SETUP },
{ "ls-remote", cmd_ls_remote, RUN_SETUP_GENTLY },
diff --git a/meson.build b/meson.build
index 9579377f3d..cda6bf67fe 100644
--- a/meson.build
+++ b/meson.build
@@ -607,6 +607,7 @@ builtin_sources = [
'builtin/index-pack.c',
'builtin/init-db.c',
'builtin/interpret-trailers.c',
+ 'builtin/last-modified.c',
'builtin/log.c',
'builtin/ls-files.c',
'builtin/ls-remote.c',
diff --git a/t/meson.build b/t/meson.build
index 1af289425d..fc77343331 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -961,6 +961,7 @@ integration_tests = [
't8012-blame-colors.sh',
't8013-blame-ignore-revs.sh',
't8014-blame-ignore-fuzzy.sh',
+ 't8020-last-modified.sh',
't9001-send-email.sh',
't9002-column.sh',
't9003-help-autocorrect.sh',
diff --git a/t/t8020-last-modified.sh b/t/t8020-last-modified.sh
new file mode 100755
index 0000000000..05c113a1f8
--- /dev/null
+++ b/t/t8020-last-modified.sh
@@ -0,0 +1,203 @@
+#!/bin/sh
+
+test_description='last-modified tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+ test_commit 1 file &&
+ mkdir a &&
+ test_commit 2 a/file &&
+ mkdir a/b &&
+ test_commit 3 a/b/file
+'
+
+test_expect_success 'cannot run last-modified on two trees' '
+ test_must_fail git last-modified HEAD HEAD~1
+'
+
+check_last_modified() {
+ local indir= &&
+ while test $# != 0
+ do
+ case "$1" in
+ -C)
+ indir="$2"
+ shift
+ ;;
+ *)
+ break
+ ;;
+ esac &&
+ shift
+ done &&
+
+ cat >expect &&
+ test_when_finished "rm -f tmp.*" &&
+ git ${indir:+-C "$indir"} last-modified "$@" >tmp.1 &&
+ git name-rev --annotate-stdin --name-only --tags \
+ <tmp.1 >tmp.2 &&
+ tr '\t' ' ' <tmp.2 >actual &&
+ test_cmp expect actual
+}
+
+test_expect_success 'last-modified non-recursive' '
+ check_last_modified <<-\EOF
+ 3 a
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified recursive' '
+ check_last_modified -r <<-\EOF
+ 3 a/b/file
+ 2 a/file
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified recursive with tree' '
+ check_last_modified -t <<-\EOF
+ 3 a
+ 3 a/b
+ 3 a/b/file
+ 2 a/file
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified subdir' '
+ check_last_modified a <<-\EOF
+ 3 a
+ EOF
+'
+
+test_expect_success 'last-modified subdir recursive' '
+ check_last_modified -r a <<-\EOF
+ 3 a/b/file
+ 2 a/file
+ EOF
+'
+
+test_expect_success 'last-modified from non-HEAD commit' '
+ check_last_modified HEAD^ <<-\EOF
+ 2 a
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified from subdir defaults to root' '
+ check_last_modified -C a <<-\EOF
+ 3 a
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified from subdir uses relative pathspecs' '
+ check_last_modified -C a -r b <<-\EOF
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'limit last-modified traversal by count' '
+ check_last_modified -1 <<-\EOF
+ 3 a
+ ^2 file
+ EOF
+'
+
+test_expect_success 'limit last-modified traversal by commit' '
+ check_last_modified HEAD~2..HEAD <<-\EOF
+ 3 a
+ ^1 file
+ EOF
+'
+
+test_expect_success 'only last-modified files in the current tree' '
+ git rm -rf a &&
+ git commit -m "remove a" &&
+ check_last_modified <<-\EOF
+ 1 file
+ EOF
+'
+
+test_expect_success 'cross merge boundaries in blaming' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit m1 &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit m2 &&
+ git merge m1 &&
+ check_last_modified <<-\EOF
+ m2 m2.t
+ m1 m1.t
+ EOF
+'
+
+test_expect_success 'last-modified merge for resolved conflicts' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit c1 conflict &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit c2 conflict &&
+ test_must_fail git merge c1 &&
+ test_commit resolved conflict &&
+ check_last_modified conflict <<-\EOF
+ resolved conflict
+ EOF
+'
+
+
+# Consider `file` with this content through history:
+#
+# A---B---B-------B---B
+# \ /
+# C---D
+test_expect_success 'last-modified merge ignores content from branch' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit a1 file A &&
+ test_commit a2 file B &&
+ test_commit a3 file C &&
+ test_commit a4 file D &&
+ git checkout a2 &&
+ git merge --no-commit --no-ff a4 &&
+ git checkout a2 -- file &&
+ git merge --continue &&
+ check_last_modified <<-\EOF
+ a2 file
+ EOF
+'
+
+# Consider `file` with this content through history:
+#
+# A---B---B---C---D---B---B
+# \ /
+# B-------B
+test_expect_success 'last-modified merge undoes changes' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit b1 file A &&
+ test_commit b2 file B &&
+ test_commit b3 file C &&
+ test_commit b4 file D &&
+ git checkout b2 &&
+ test_commit b5 file2 2 &&
+ git checkout b4 &&
+ git merge --no-commit --no-ff b5 &&
+ git checkout b2 -- file &&
+ git merge --continue &&
+ check_last_modified <<-\EOF
+ b5 file2
+ b2 file
+ EOF
+'
+
+test_expect_success 'last-modified complains about unknown arguments' '
+ test_must_fail git last-modified --foo 2>err &&
+ grep "unknown last-modified argument: --foo" err
+'
+
+test_done
--
2.50.1.327.g047016eb4a
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH v5 2/6] t/perf: add last-modified perf script
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
2025-07-16 13:35 ` [PATCH v5 1/6] last-modified: new subcommand to show when files were last modified Toon Claes
@ 2025-07-16 13:35 ` Toon Claes
2025-07-18 0:08 ` Taylor Blau
2025-07-16 13:35 ` [PATCH v5 3/6] last-modified: use Bloom filters when available Toon Claes
` (9 subsequent siblings)
11 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-07-16 13:35 UTC (permalink / raw)
To: git
Cc: Toon Claes, Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau,
Junio C Hamano, Jeff King
This just runs some simple last-modified commands. We already test
correctness in the regular suite, so this is just about finding
performance regressions from one version to another.
Based-on-patch-by: Jeff King <peff@peff.net>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
t/meson.build | 1 +
t/perf/p8020-last-modified.sh | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+)
create mode 100755 t/perf/p8020-last-modified.sh
diff --git a/t/meson.build b/t/meson.build
index fc77343331..567d524e91 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -1154,6 +1154,7 @@ benchmarks = [
'perf/p7820-grep-engines.sh',
'perf/p7821-grep-engines-fixed.sh',
'perf/p7822-grep-perl-character.sh',
+ 'perf/p8020-last-modified.sh',
'perf/p9210-scalar.sh',
'perf/p9300-fast-import-export.sh',
]
diff --git a/t/perf/p8020-last-modified.sh b/t/perf/p8020-last-modified.sh
new file mode 100755
index 0000000000..a02ec907d4
--- /dev/null
+++ b/t/perf/p8020-last-modified.sh
@@ -0,0 +1,21 @@
+#!/bin/sh
+
+test_description='last-modified perf tests'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+test_perf 'top-level last-modified' '
+ git last-modified HEAD
+'
+
+test_perf 'top-level recursive last-modified' '
+ git last-modified -r HEAD
+'
+
+test_perf 'subdir last-modified' '
+ path=$(git ls-tree HEAD | grep ^040000 | head -n 1 | cut -f2)
+ git last-modified -r HEAD -- "$path"
+'
+
+test_done
--
2.50.1.327.g047016eb4a
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH v5 3/6] last-modified: use Bloom filters when available
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
2025-07-16 13:35 ` [PATCH v5 1/6] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-16 13:35 ` [PATCH v5 2/6] t/perf: add last-modified perf script Toon Claes
@ 2025-07-16 13:35 ` Toon Claes
2025-07-18 0:16 ` Taylor Blau
2025-07-16 13:35 ` [PATCH v5 4/6] pretty: allow caller to disable indentation Toon Claes
` (8 subsequent siblings)
11 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-07-16 13:35 UTC (permalink / raw)
To: git
Cc: Toon Claes, Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau,
Junio C Hamano
Our 'git last-modified' performs a revision walk, and computes a diff at
each point in the walk to figure out whether a given revision changed
any of the paths it considers interesting.
When changed-path Bloom filters are available, we can avoid computing
many such diffs. Before computing a diff, we first check if any of the
remaining paths of interest were possibly changed at a given commit by
consulting its Bloom filter. If any of them are, we are resigned to
compute the diff.
If none of those queries returned "maybe", we know that the given commit
doesn't contain any changed paths which are interesting to us. So, we
can avoid computing it in this case.
Comparing the perf test results on git.git:
Test HEAD~ HEAD
------------------------------------------------------------------------------------
8020.1: top-level last-modified 4.49(4.34+0.11) 2.22(2.05+0.09) -50.6%
8020.2: top-level recursive last-modified 5.64(5.45+0.11) 5.62(5.30+0.11) -0.4%
8020.3: subdir last-modified 0.11(0.06+0.04) 0.07(0.03+0.04) -36.4%
Based-on-patch-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
builtin/last-modified.c | 45 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
index 63993bc1c9..466df04fba 100644
--- a/builtin/last-modified.c
+++ b/builtin/last-modified.c
@@ -1,5 +1,7 @@
#include "git-compat-util.h"
+#include "bloom.h"
#include "builtin.h"
+#include "commit-graph.h"
#include "commit.h"
#include "config.h"
#include "diff.h"
@@ -17,6 +19,7 @@
struct last_modified_entry {
struct hashmap_entry hashent;
struct object_id oid;
+ struct bloom_key key;
const char path[FLEX_ARRAY];
};
@@ -40,6 +43,12 @@ struct last_modified {
static void last_modified_release(struct last_modified *lm)
{
+ struct hashmap_iter iter;
+ struct last_modified_entry *ent;
+
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
+ clear_bloom_key(&ent->key);
+
hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
release_revisions(&lm->rev);
}
@@ -67,6 +76,9 @@ static void add_path_from_diff(struct diff_queue_struct *q,
FLEX_ALLOC_STR(ent, path, path);
oidcpy(&ent->oid, &p->two->oid);
+ if (lm->rev.bloom_filter_settings)
+ fill_bloom_key(path, strlen(path), &ent->key,
+ lm->rev.bloom_filter_settings);
hashmap_entry_init(&ent->hashent, strhash(ent->path));
hashmap_add(&lm->paths, &ent->hashent);
}
@@ -126,6 +138,7 @@ static void mark_path(const char *path, const struct object_id *oid,
data->callback(path, data->commit, data->callback_data);
hashmap_remove(data->paths, &ent->hashent, path);
+ clear_bloom_key(&ent->key);
free(ent);
}
@@ -169,6 +182,28 @@ static void last_modified_diff(struct diff_queue_struct *q,
}
}
+
+static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
+{
+ struct bloom_filter *filter;
+ struct last_modified_entry *ent;
+ struct hashmap_iter iter;
+
+ if (!lm->rev.bloom_filter_settings)
+ return 1;
+
+ filter = get_bloom_filter(lm->rev.repo, origin);
+ if (!filter)
+ return 1;
+
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
+ if (bloom_filter_contains(filter, &ent->key,
+ lm->rev.bloom_filter_settings))
+ return 1;
+ }
+ return 0;
+}
+
static int last_modified_run(struct last_modified *lm,
last_modified_callback cb, void *cbdata)
{
@@ -189,6 +224,9 @@ static int last_modified_run(struct last_modified *lm,
if (!data.commit)
break;
+ if (!maybe_changed_path(lm, data.commit))
+ continue;
+
if (data.commit->object.flags & BOUNDARY) {
diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
&data.commit->object.oid, "",
@@ -238,6 +276,13 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
return argc;
}
+ /*
+ * We're not interested in generation numbers here,
+ * but calling this function to prepare the commit-graph.
+ */
+ (void)generation_numbers_enabled(lm->rev.repo);
+ lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
+
if (populate_paths_from_revs(lm) < 0)
return error(_("unable to setup last-modified"));
--
2.50.1.327.g047016eb4a
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH v5 4/6] pretty: allow caller to disable indentation
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
` (2 preceding siblings ...)
2025-07-16 13:35 ` [PATCH v5 3/6] last-modified: use Bloom filters when available Toon Claes
@ 2025-07-16 13:35 ` Toon Claes
2025-07-16 15:50 ` Junio C Hamano
2025-07-16 13:35 ` [PATCH v5 5/6] last-modified: support --extended format Toon Claes
` (7 subsequent siblings)
11 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-07-16 13:35 UTC (permalink / raw)
To: git
Cc: Toon Claes, Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau,
Junio C Hamano
Most pretty formats indent the commit message with 4 spaces. Add field
`no_indent` to `struct pretty_print_context` to suppress this
indentation.
Signed-off-by: Toon Claes <toon@iotcl.com>
# Conflicts:
# pretty.h
Signed-off-by: Toon Claes <toon@iotcl.com>
---
pretty.c | 2 +-
pretty.h | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/pretty.c b/pretty.c
index 0bc8ad8a9a..9b1698417e 100644
--- a/pretty.c
+++ b/pretty.c
@@ -2286,7 +2286,7 @@ void pretty_print_commit(struct pretty_print_context *pp,
struct strbuf *sb)
{
unsigned long beginning_of_body;
- int indent = 4;
+ int indent = pp->no_indent ? 0 : 4;
const char *msg;
const char *reencoded;
const char *encoding;
diff --git a/pretty.h b/pretty.h
index df267afe4a..5d25ae2320 100644
--- a/pretty.h
+++ b/pretty.h
@@ -50,6 +50,7 @@ struct pretty_print_context {
struct ident_split *from_ident;
unsigned encode_email_headers:1;
struct pretty_print_describe_status *describe_status;
+ int no_indent;
/*
* Fields below here are manipulated internally by pp_* functions and
--
2.50.1.327.g047016eb4a
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH v5 5/6] last-modified: support --extended format
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
` (3 preceding siblings ...)
2025-07-16 13:35 ` [PATCH v5 4/6] pretty: allow caller to disable indentation Toon Claes
@ 2025-07-16 13:35 ` Toon Claes
2025-07-16 16:09 ` Junio C Hamano
2025-07-17 22:37 ` Junio C Hamano
2025-07-16 13:42 ` [PATCH v5 6/6] fixup! last-modified: use Bloom filters when available Toon Claes
` (6 subsequent siblings)
11 siblings, 2 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-16 13:35 UTC (permalink / raw)
To: git
Cc: Toon Claes, Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau,
Junio C Hamano
On many forges when they display a tree view, they show which commit
last modified each entry of the tree. The command git-last-modified(1)
was introduced to feed the data for this view. But it only returned the
commit OID and the path.
Add option --extended to git-last-modified(1). In combination with the
path and the commit OID, it shows the raw commit data which can be used
directly to feed the tree view on a forge.
Signed-off-by: Toon Claes <toon@iotcl.com>
---
Documentation/git-last-modified.adoc | 46 ++++++++++++++++++++++++++++
builtin/last-modified.c | 46 +++++++++++++++++++++++-----
t/t8020-last-modified.sh | 22 +++++++++++++
3 files changed, 107 insertions(+), 7 deletions(-)
diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
index 89138ebeb7..af028c7b74 100644
--- a/Documentation/git-last-modified.adoc
+++ b/Documentation/git-last-modified.adoc
@@ -27,6 +27,14 @@ OPTIONS
-t::
Show tree entry itself as well as subtrees. Implies `-r`.
+-z::
+
+ Instead of separating output entries with newlines, use a NUL byte to
+ delimit them. See 'OUTPUT' for more details.
+
+--extended::
+ Show output in extended format. See 'OUTPUT' below.
+
<revision-range>::
Only traverse commits in the specified revision range. When no
`<revision-range>` is specified, it defaults to `HEAD` (i.e. the whole
@@ -39,6 +47,44 @@ OPTIONS
Without an optional path parameter, all files and subdirectories
in path traversal the are included in the output.
+OUTPUT
+------
+
+The default format prints for each path:
+
+ <oid> TAB <path> LF
+
+When the commit is at boundary, it's prefixed with a caret `^`.
+
+Or when option `-z` is given:
+
+ <oid> TAB <path> NUL
+
+When `--extended` is provided, the output will be in the format:
+
+ path SP <path> LF
+ commit SP <oid> LF
+ tree SP <tree> LF
+ parent SP <parent> LF
+ author SP <author> LF
+ <message>
+
+Each line of the commit message is indented with four spaces.
+
+Unless together with `--extended` option `-z` is given, then the output is:
+
+ path SP <path> NUL
+ commit SP <oid> LF
+ tree SP <tree> LF
+ parent SP <parent> LF
+ author SP <author> LF
+ <message>
+
+In this situation the commit message is not indented.
+
+A path containing SP or special characters is enclosed in double-quotes in the C
+style as needed, unless option `-z` is provided.
+
SEE ALSO
--------
linkgit:git-blame[1],
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
index 466df04fba..71c66e8782 100644
--- a/builtin/last-modified.c
+++ b/builtin/last-modified.c
@@ -12,6 +12,7 @@
#include "object-name.h"
#include "object.h"
#include "parse-options.h"
+#include "pretty.h"
#include "quote.h"
#include "repository.h"
#include "revision.h"
@@ -39,6 +40,7 @@ struct last_modified {
struct hashmap paths;
struct rev_info rev;
int recursive, tree_in_recursive;
+ int extended;
};
static void last_modified_release(struct last_modified *lm)
@@ -244,14 +246,42 @@ static void show_entry(const char *path, const struct commit *commit, void *d)
{
struct last_modified *lm = d;
- if (commit->object.flags & BOUNDARY)
- putchar('^');
- printf("%s\t", oid_to_hex(&commit->object.oid));
+ if (lm->extended) {
+ struct strbuf buf = STRBUF_INIT;
+ struct pretty_print_context pp = { 0 };
- if (lm->rev.diffopt.line_termination)
- write_name_quoted(path, stdout, '\n');
- else
- printf("%s%c", path, '\0');
+ pp.abbrev = lm->rev.abbrev;
+ pp.date_mode = lm->rev.date_mode;
+ pp.date_mode_explicit = lm->rev.date_mode_explicit;
+ pp.fmt = CMIT_FMT_RAW;
+ pp.color = lm->rev.diffopt.use_color;
+ pp.rev = &lm->rev;
+ pp.no_indent = !lm->rev.diffopt.line_termination;
+
+ pretty_print_commit(&pp, commit, &buf);
+
+ printf("path ");
+ if (lm->rev.diffopt.line_termination)
+ write_name_quoted(path, stdout, '\n');
+ else
+ printf("%s%c", path, '\0');
+
+ printf("commit %s%s\n",
+ (commit->object.flags & BOUNDARY) ? "^" : "",
+ oid_to_hex(&commit->object.oid));
+ printf("%s%c", buf.buf, lm->rev.diffopt.line_termination);
+
+ strbuf_release(&buf);
+ } else {
+ printf("%s%s\t",
+ (commit->object.flags & BOUNDARY) ? "^" : "",
+ oid_to_hex(&commit->object.oid));
+
+ if (lm->rev.diffopt.line_termination)
+ write_name_quoted(path, stdout, '\n');
+ else
+ printf("%s%c", path, '\0');
+ }
fflush(stdout);
}
@@ -306,6 +336,8 @@ int cmd_last_modified(int argc, const char **argv, const char *prefix,
N_("recurse into subtrees")),
OPT_BOOL('t', "tree-in-recursive", &lm.tree_in_recursive,
N_("recurse into subtrees and include the tree entries too")),
+ OPT_BOOL(0, "extended", &lm.extended,
+ N_("extended format will include the commit message in the output")),
OPT_END()
};
diff --git a/t/t8020-last-modified.sh b/t/t8020-last-modified.sh
index 05c113a1f8..008ea708ab 100755
--- a/t/t8020-last-modified.sh
+++ b/t/t8020-last-modified.sh
@@ -48,6 +48,28 @@ test_expect_success 'last-modified non-recursive' '
EOF
'
+test_expect_success 'last-modified extended output' '
+ check_last_modified --extended <<-\EOF
+ path a
+ commit 3
+ tree e9a947598482012e54c9c5d3635d5b526b43a6a4
+ parent 2
+ author A U Thor <author@example.com> 1112912113 -0700
+ committer C O Mitter <committer@example.com> 1112912113 -0700
+
+ 3
+
+ path file
+ commit 1
+ tree f27c6ae26adb8396d3861976ba268f87ad8afa0b
+ author A U Thor <author@example.com> 1112911993 -0700
+ committer C O Mitter <committer@example.com> 1112911993 -0700
+
+ 1
+
+ EOF
+'
+
test_expect_success 'last-modified recursive' '
check_last_modified -r <<-\EOF
3 a/b/file
--
2.50.1.327.g047016eb4a
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH v5 6/6] fixup! last-modified: use Bloom filters when available
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
` (7 preceding siblings ...)
2025-07-09 15:26 ` [PATCH v4 3/3] last-modified: use Bloom filters when available Toon Claes
@ 2025-07-16 13:35 ` Toon Claes
8 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-16 13:35 UTC (permalink / raw)
To: git; +Cc: Toon Claes
Make changes compatible with the ongoing work in the bloom filter
optimizations for multiple pathspec elements.
[1]: https://lore.kernel.org/git/20250712093517.17907-1-yldhome2d2@gmail.com/
Signed-off-by: Toon Claes <toon@iotcl.com>
---
builtin/last-modified.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
index 71c66e8782..7cc57f9ada 100644
--- a/builtin/last-modified.c
+++ b/builtin/last-modified.c
@@ -49,7 +49,7 @@ static void last_modified_release(struct last_modified *lm)
struct last_modified_entry *ent;
hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
- clear_bloom_key(&ent->key);
+ bloom_key_clear(&ent->key);
hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
release_revisions(&lm->rev);
@@ -79,7 +79,7 @@ static void add_path_from_diff(struct diff_queue_struct *q,
FLEX_ALLOC_STR(ent, path, path);
oidcpy(&ent->oid, &p->two->oid);
if (lm->rev.bloom_filter_settings)
- fill_bloom_key(path, strlen(path), &ent->key,
+ bloom_key_fill(&ent->key, path, strlen(path),
lm->rev.bloom_filter_settings);
hashmap_entry_init(&ent->hashent, strhash(ent->path));
hashmap_add(&lm->paths, &ent->hashent);
@@ -140,7 +140,7 @@ static void mark_path(const char *path, const struct object_id *oid,
data->callback(path, data->commit, data->callback_data);
hashmap_remove(data->paths, &ent->hashent, path);
- clear_bloom_key(&ent->key);
+ bloom_key_clear(&ent->key);
free(ent);
}
--
2.50.1.327.g047016eb4a
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH v5 6/6] fixup! last-modified: use Bloom filters when available
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
` (4 preceding siblings ...)
2025-07-16 13:35 ` [PATCH v5 5/6] last-modified: support --extended format Toon Claes
@ 2025-07-16 13:42 ` Toon Claes
2025-07-17 23:39 ` [PATCH v5 0/6] Introduce git-last-modified(1) command Taylor Blau
` (5 subsequent siblings)
11 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-16 13:42 UTC (permalink / raw)
To: git
Cc: Toon Claes, Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau,
Junio C Hamano
Make changes compatible with the ongoing work in the bloom filter
optimizations for multiple pathspec elements.
[1]: https://lore.kernel.org/git/20250712093517.17907-1-yldhome2d2@gmail.com/
Signed-off-by: Toon Claes <toon@iotcl.com>
---
This is a resend because I originally sent it with the wrong In-Reply-To and Cc.
--
Toon
builtin/last-modified.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
index 71c66e8782..7cc57f9ada 100644
--- a/builtin/last-modified.c
+++ b/builtin/last-modified.c
@@ -49,7 +49,7 @@ static void last_modified_release(struct last_modified *lm)
struct last_modified_entry *ent;
hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
- clear_bloom_key(&ent->key);
+ bloom_key_clear(&ent->key);
hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
release_revisions(&lm->rev);
@@ -79,7 +79,7 @@ static void add_path_from_diff(struct diff_queue_struct *q,
FLEX_ALLOC_STR(ent, path, path);
oidcpy(&ent->oid, &p->two->oid);
if (lm->rev.bloom_filter_settings)
- fill_bloom_key(path, strlen(path), &ent->key,
+ bloom_key_fill(&ent->key, path, strlen(path),
lm->rev.bloom_filter_settings);
hashmap_entry_init(&ent->hashent, strhash(ent->path));
hashmap_add(&lm->paths, &ent->hashent);
@@ -140,7 +140,7 @@ static void mark_path(const char *path, const struct object_id *oid,
data->callback(path, data->commit, data->callback_data);
hashmap_remove(data->paths, &ent->hashent, path);
- clear_bloom_key(&ent->key);
+ bloom_key_clear(&ent->key);
free(ent);
}
--
2.50.1.327.g047016eb4a
^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [PATCH v5 4/6] pretty: allow caller to disable indentation
2025-07-16 13:35 ` [PATCH v5 4/6] pretty: allow caller to disable indentation Toon Claes
@ 2025-07-16 15:50 ` Junio C Hamano
2025-07-17 16:31 ` Toon Claes
0 siblings, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-07-16 15:50 UTC (permalink / raw)
To: Toon Claes; +Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau
Toon Claes <toon@iotcl.com> writes:
> Most pretty formats indent the commit message with 4 spaces. Add field
> `no_indent` to `struct pretty_print_context` to suppress this
> indentation.
>
> Signed-off-by: Toon Claes <toon@iotcl.com>
>
> # Conflicts:
> # pretty.h
Careful. There is no need to rush your patches to send a version
that hasn't been proof-read.
>
> Signed-off-by: Toon Claes <toon@iotcl.com>
> ---
I doubt that this is what you want in this series anyway, though.
You use this for "-z --extended", which presumably is about giving
the output that is as faithful as possible to the original, but if
you look at pretty_print_commit(), it does a LOT MORE than just
indent the log by 4 spaces. I suspect you would rather want to
avoid even calling pretty_print_commit() in such a code path.
> pretty.c | 2 +-
> pretty.h | 1 +
> 2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/pretty.c b/pretty.c
> index 0bc8ad8a9a..9b1698417e 100644
> --- a/pretty.c
> +++ b/pretty.c
> @@ -2286,7 +2286,7 @@ void pretty_print_commit(struct pretty_print_context *pp,
> struct strbuf *sb)
> {
> unsigned long beginning_of_body;
> - int indent = 4;
> + int indent = pp->no_indent ? 0 : 4;
> const char *msg;
> const char *reencoded;
> const char *encoding;
> diff --git a/pretty.h b/pretty.h
> index df267afe4a..5d25ae2320 100644
> --- a/pretty.h
> +++ b/pretty.h
> @@ -50,6 +50,7 @@ struct pretty_print_context {
> struct ident_split *from_ident;
> unsigned encode_email_headers:1;
> struct pretty_print_describe_status *describe_status;
> + int no_indent;
>
> /*
> * Fields below here are manipulated internally by pp_* functions and
> --
> 2.50.1.327.g047016eb4a
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 5/6] last-modified: support --extended format
2025-07-16 13:35 ` [PATCH v5 5/6] last-modified: support --extended format Toon Claes
@ 2025-07-16 16:09 ` Junio C Hamano
2025-07-17 16:31 ` Toon Claes
2025-07-17 22:37 ` Junio C Hamano
1 sibling, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-07-16 16:09 UTC (permalink / raw)
To: Toon Claes; +Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau
Toon Claes <toon@iotcl.com> writes:
> +OUTPUT
> +------
> +
> +The default format prints for each path:
> +
> + <oid> TAB <path> LF
> +
> +When the commit is at boundary, it's prefixed with a caret `^`.
> +
> +Or when option `-z` is given:
> +
> + <oid> TAB <path> NUL
> +
> +When `--extended` is provided, the output will be in the format:
> +
> + path SP <path> LF
> + commit SP <oid> LF
> + tree SP <tree> LF
> + parent SP <parent> LF
> + author SP <author> LF
> + <message>
> +
> +Each line of the commit message is indented with four spaces.
> +
> +Unless together with `--extended` option `-z` is given, then the output is:
"If" would probably have been more readable.
I can see why you wrote "Unless" here, i.e.
We indent by four spaces.
Unless you use "-z" and "--extended" together, that is.
but I do not think it is a good idea to use such a construct here.
The reason why I do not think you want to phrase it that way is
because the next block that illustrates what happens when "-z" and
"--extended" are used together has more differences than just a mere
"is the message indented?" single bit. Unlike "--extended" without
"-z" that uniformly use LF as inter-item separator, some items are
NUL terminated while others are LF terminated.
> + path SP <path> NUL
> + commit SP <oid> LF
> + tree SP <tree> LF
> + parent SP <parent> LF
> + author SP <author> LF
> + <message>
> +
> +In this situation the commit message is not indented.
> +
> +A path containing SP or special characters is enclosed in double-quotes in the C
> +style as needed, unless option `-z` is provided.
Another thing I find the above output description somewhat lacking
is that, while it is clear how each output entry ends when
"--extended" is not given (i.e. it shows what terminates each output
entry. The output is one entry per path and either LF or NUL
terminates an entry), the description of "--extended", with or
without "-z" is silent about how the reader program is expected to
notice when the message ends.
Without "-z" and indented, the end of the <message> part if either
EOF or any unindented line, whichever comes earlier, I presume? I
am planning to teach pretty_print_commit() to stop indenting an
empty line by 4 spaces, by the way---non-"-z" format needs to be
designed to withstand such a change.
How would this extended format gain more fields in the future? A
free-text <message> has to be at the end? What if we later need to
add another free-text thing (e.g., notes ttached to the commit that
is responsible for that latest state of the path)? I suspect that
you'd want an explicit tag (perhaps "message SP <message>") so that
the log message does not have to be anything special among others.
In any case, the above considerations need to be documented.
With "-z", a message body can begin with "path ", so you'd need to
arrange some terminator (like NUL) after the message body anyway.
Unless your format is "we tell about one path and then always exit",
that is, but that is probably not what we want.
Thanks.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 4/6] pretty: allow caller to disable indentation
2025-07-16 15:50 ` Junio C Hamano
@ 2025-07-17 16:31 ` Toon Claes
0 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-17 16:31 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau
Junio C Hamano <gitster@pobox.com> writes:
> Toon Claes <toon@iotcl.com> writes:
>
>> Most pretty formats indent the commit message with 4 spaces. Add field
>> `no_indent` to `struct pretty_print_context` to suppress this
>> indentation.
>>
>> Signed-off-by: Toon Claes <toon@iotcl.com>
>>
>> # Conflicts:
>> # pretty.h
>
> Careful. There is no need to rush your patches to send a version
> that hasn't been proof-read.
Whoops, I missed that. No idea how that happened. I have
`core.commentchar` set to `;`, because I used to write markdown headings
in my commit messages (with leading `#`). But even then, this should not
happen. I'm curious if I'm able to figure out what went wrong...
>>
>> Signed-off-by: Toon Claes <toon@iotcl.com>
>> ---
>
> I doubt that this is what you want in this series anyway, though.
> You use this for "-z --extended", which presumably is about giving
> the output that is as faithful as possible to the original, but if
> you look at pretty_print_commit(), it does a LOT MORE than just
> indent the log by 4 spaces. I suspect you would rather want to
> avoid even calling pretty_print_commit() in such a code path.
That's a good point. Before I submitted this version I was thinking to
have it support `--format` and `--pretty`, but that was even more
complex than what we have here. Indeed, we probably don't need
pretty_print_commit().
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 5/6] last-modified: support --extended format
2025-07-16 16:09 ` Junio C Hamano
@ 2025-07-17 16:31 ` Toon Claes
0 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-17 16:31 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau
Junio C Hamano <gitster@pobox.com> writes:
> Toon Claes <toon@iotcl.com> writes:
>
>> +OUTPUT
>> +------
>> +
>> +The default format prints for each path:
>> +
>> + <oid> TAB <path> LF
>> +
>> +When the commit is at boundary, it's prefixed with a caret `^`.
>> +
>> +Or when option `-z` is given:
>> +
>> + <oid> TAB <path> NUL
>> +
>> +When `--extended` is provided, the output will be in the format:
>> +
>> + path SP <path> LF
>> + commit SP <oid> LF
>> + tree SP <tree> LF
>> + parent SP <parent> LF
>> + author SP <author> LF
>> + <message>
>> +
>> +Each line of the commit message is indented with four spaces.
>> +
>> +Unless together with `--extended` option `-z` is given, then the output is:
>
> "If" would probably have been more readable.
>
> I can see why you wrote "Unless" here, i.e.
>
> We indent by four spaces.
> Unless you use "-z" and "--extended" together, that is.
>
> but I do not think it is a good idea to use such a construct here.
> The reason why I do not think you want to phrase it that way is
> because the next block that illustrates what happens when "-z" and
> "--extended" are used together has more differences than just a mere
> "is the message indented?" single bit. Unlike "--extended" without
> "-z" that uniformly use LF as inter-item separator, some items are
> NUL terminated while others are LF terminated.
>
>> + path SP <path> NUL
>> + commit SP <oid> LF
>> + tree SP <tree> LF
>> + parent SP <parent> LF
>> + author SP <author> LF
>> + <message>
>> +
>> +In this situation the commit message is not indented.
>> +
>> +A path containing SP or special characters is enclosed in double-quotes in the C
>> +style as needed, unless option `-z` is provided.
>
> Another thing I find the above output description somewhat lacking
> is that, while it is clear how each output entry ends when
> "--extended" is not given (i.e. it shows what terminates each output
> entry. The output is one entry per path and either LF or NUL
> terminates an entry), the description of "--extended", with or
> without "-z" is silent about how the reader program is expected to
> notice when the message ends.
>
> Without "-z" and indented, the end of the <message> part if either
> EOF or any unindented line, whichever comes earlier, I presume?
That's the idea.
> I
> am planning to teach pretty_print_commit() to stop indenting an
> empty line by 4 spaces, by the way---non-"-z" format needs to be
> designed to withstand such a change.
It kind of is. A new entry should start with "path " (no leading space).
> How would this extended format gain more fields in the future? A
> free-text <message> has to be at the end? What if we later need to
> add another free-text thing (e.g., notes ttached to the commit that
> is responsible for that latest state of the path)?
Ahha, you mean a future field that's multi-line? I assume we'd indent
those lines, but I agree it would make parsing harder.
> I suspect that
> you'd want an explicit tag (perhaps "message SP <message>") so that
> the log message does not have to be anything special among others.
The idea was to have the output compatible with the git-cat-file(1)
output. That would then no longe be the case. That's also why you see
mixed use of NUL and LF delimited fields in "-z" mode.
> In any case, the above considerations need to be documented.
Yeah, that's worth elaborating on.
> With "-z", a message body can begin with "path ", so you'd need to
> arrange some terminator (like NUL) after the message body anyway.
That should be the case, but it seems I missed that in the docs.
> Unless your format is "we tell about one path and then always exit",
> that is, but that is probably not what we want.
>
> Thanks.
Thanks for this feedback. I've submitted this patch on top mainly to
gather some feedback, and this is really valuable. I think in the next
version of this series I'm gonna leave out this patch because there are
too many loose ends, and I think in our product we can easily integrate
git-last-modified(1) if it only supports the single-line output format.
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 5/6] last-modified: support --extended format
2025-07-16 13:35 ` [PATCH v5 5/6] last-modified: support --extended format Toon Claes
2025-07-16 16:09 ` Junio C Hamano
@ 2025-07-17 22:37 ` Junio C Hamano
2025-07-18 17:36 ` Junio C Hamano
1 sibling, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-07-17 22:37 UTC (permalink / raw)
To: Toon Claes; +Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau
Toon Claes <toon@iotcl.com> writes:
> diff --git a/t/t8020-last-modified.sh b/t/t8020-last-modified.sh
> index 05c113a1f8..008ea708ab 100755
> --- a/t/t8020-last-modified.sh
> +++ b/t/t8020-last-modified.sh
> @@ -48,6 +48,28 @@ test_expect_success 'last-modified non-recursive' '
> EOF
> '
>
> +test_expect_success 'last-modified extended output' '
> + check_last_modified --extended <<-\EOF
> + path a
> + commit 3
> + tree e9a947598482012e54c9c5d3635d5b526b43a6a4
> + parent 2
> + author A U Thor <author@example.com> 1112912113 -0700
> + committer C O Mitter <committer@example.com> 1112912113 -0700
> +
> + 3
> +
> + path file
> + commit 1
> + tree f27c6ae26adb8396d3861976ba268f87ad8afa0b
> + author A U Thor <author@example.com> 1112911993 -0700
> + committer C O Mitter <committer@example.com> 1112911993 -0700
> +
> + 1
> +
> + EOF
> +'
Hmph. This hardcoding of everything does not look easy to maintain.
Besides, the test will fail rather miserably when run with SHA-256
hash (e.g., post Git 3.0 where the "git init" command by default
will give you a repository with new hash).
It looks somewhat inconsistent that tree is shown with its object
name, but commit is not.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 0/6] Introduce git-last-modified(1) command
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
` (5 preceding siblings ...)
2025-07-16 13:42 ` [PATCH v5 6/6] fixup! last-modified: use Bloom filters when available Toon Claes
@ 2025-07-17 23:39 ` Taylor Blau
2025-07-22 15:35 ` Toon Claes
2025-07-30 17:55 ` [PATCH v6 0/4] " Toon Claes
` (4 subsequent siblings)
11 siblings, 1 reply; 135+ messages in thread
From: Taylor Blau @ 2025-07-17 23:39 UTC (permalink / raw)
To: Toon Claes; +Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Junio C Hamano
On Wed, Jul 16, 2025 at 03:32:00PM +0200, Toon Claes wrote:
> This series adds the git-last-modified(1) to feed this view. In the past
> the subcommand was proposed[1] to be named git-blame-tree(1). This
> version is based on the patches shared by the kind people at GitHub[2].
Sorry for completely dropping this from my review queue. Let me try and
give it a read...
> What is different from the series shared by GitHub:
>
> * Renamed the subcommand from `blame-tree` to `last-modified`. There was
> some consensus[5] this name works better, so let's give it a try and
> see how this name feels.
Hmmph. I prefer the "blame-tree" name personally, but I am (a) biased,
and (b) used to it over "last-modified", so I don't think my preference
or bias should count for much here.
> * Patches for --max-depth are excluded. I think it's a separate topic to
> discuss and I'm not sure it needs to be part of series anyway. The
> main patch was submitted in the previous attempt[3] and if people
> consider it valuable, I'm happy to discuss that in a separate patch
> series.
Yeah, makes sense.
> * The last-modified command isn't recursive by default. If you want
> recurse into subtrees, you need to pass `-r`.
OK.
> * The patches in 'tb/blame-tree' at Taylor's fork[4] implements a
> caching layer. This feature reads/writes cached results in
> `.git/blame-tree/<hash>.btc`. To keep this series to a reviewable
> size, that feature is excluded from this series. I think it's better
> to submit this as a separate series.
Makes sense; the caching feature was primarily implemented by Stolee and
I think for our purposes here can be considered additive and not
essential to the basic functionality of this new command. For what it's
worth, I *would* like[^1] to see those features sent to the list at some
point, but I agree that they are a significant source of additional
complexity. So punting on them for now seems like the right direction to
me.
[^1]: My ulterior motive here would be to eventually ditch GitHub's
"blame-tree" command entirely and remove it from GitHub's diff to
upstream. I'm happy to help however I can with that effort once this
series lands.
> * Squashed various commits together. Like they introduced a flag
> `--go-faster`, which later became the default and only implementation.
> That story was wrapped up in a single commit.
Perfect, thank you. I figured that we would not want to keep temporary
measures around like the "--go-faster" flag, but I also figured that
they may be helpful in unpacking the history of this command, hence why
I sent them in the first place.
> * Dropped the patches that attempt to increase performance for tree
> entries that have not been updated in a long time. In my testing I've
> seen both performance improvements *and* degradation with these
> changes:
>
> Test HEAD~ HEAD
> ------------------------------------------------------------------------------------
> 8020.1: top-level last-modified 4.52(4.38+0.11) 2.03(1.93+0.08) -55.1%
> 8020.2: top-level recursive last-modified 5.79(5.64+0.11) 8.34(8.17+0.11) +44.0%
> 8020.3: subdir last-modified 0.15(0.09+0.06) 0.19(0.14+0.06) +26.7%
>
> Before we include these patches, I want to make sure these changes
> have positive impact in all/most scenarios. This can happen in a
> separate series.
Hmm. It's been long enough that I honestly don't remember the details
here, but I agree that this is worth looking into at some point in the
future.
> I've set myself as the author and added Based-on-patch-by trailers to
> credit the original authors. Let me know if you disagree.
I can't speak for the other authors of this command, but I have no issue
being ~~blamed~~ credited with a "Based-on-patch-by" trailer ;-).
> Again thanks to Taylor and the people at GitHub for sharing these
> patches. I hope we can work together to get this upstreamed.
Ditto.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 1/6] last-modified: new subcommand to show when files were last modified
2025-07-16 13:35 ` [PATCH v5 1/6] last-modified: new subcommand to show when files were last modified Toon Claes
@ 2025-07-18 0:02 ` Taylor Blau
2025-07-19 6:44 ` Jeff King
` (2 more replies)
0 siblings, 3 replies; 135+ messages in thread
From: Taylor Blau @ 2025-07-18 0:02 UTC (permalink / raw)
To: Toon Claes
Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Junio C Hamano,
Jeff King, Ævar Arnfjörð Bjarmason
On Wed, Jul 16, 2025 at 03:35:13PM +0200, Toon Claes wrote:
> 11 files changed, 549 insertions(+)
> create mode 100644 Documentation/git-last-modified.adoc
> create mode 100644 builtin/last-modified.c
> create mode 100755 t/t8020-last-modified.sh
I'm admittedly not entirely sure what the best way to review this patch
is given its size and my previous exposure to (similar) code.
From what I can tell, this does not include the optimizations that
Stolee and I worked on back in 2020-ish. Those would be nice to have,
but they are somewhat complex and I think more easily reviewed as an
incremental change on top rather than as part of the initial version.
As I mentioned in my response to your the cover letter, I would be more
than happy to help you with an effort to introduce those optimizations
on top.
> diff --git a/builtin/last-modified.c b/builtin/last-modified.c
> new file mode 100644
> index 0000000000..63993bc1c9
> --- /dev/null
> +++ b/builtin/last-modified.c
> @@ -0,0 +1,289 @@
> +#include "git-compat-util.h"
> +#include "builtin.h"
> +#include "commit.h"
> +#include "config.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +#include "hashmap.h"
> +#include "hex.h"
> +#include "log-tree.h"
> +#include "object-name.h"
> +#include "object.h"
> +#include "parse-options.h"
> +#include "quote.h"
> +#include "repository.h"
> +#include "revision.h"
> +
> +struct last_modified_entry {
> + struct hashmap_entry hashent;
> + struct object_id oid;
> + const char path[FLEX_ARRAY];
> +};
As a general comment on this patch, I am a little sad to see that many
of the implementation details have been moved back into the builtin
itself and not in their own last-modified.ch file(s).
Apologies if this was already discussed earlier in the thread and I
simply missed it, but can you comment on why the last-modified internals
were moved into the builtin?
Even in the earliest version of 'blame-tree' that I could find (from
26999d045b (add blame-tree command, 2012-10-20) in my fork) many of the
internals were written in blame-tree.c instead of builtin/blame-tree.c.
> +static int last_modified_entry_hashcmp(const void *unused UNUSED,
> + const struct hashmap_entry *hent1,
> + const struct hashmap_entry *hent2,
> + const void *path)
> +{
> + const struct last_modified_entry *ent1 =
> + container_of(hent1, const struct last_modified_entry, hashent);
> + const struct last_modified_entry *ent2 =
> + container_of(hent2, const struct last_modified_entry, hashent);
> + return strcmp(ent1->path, path ? path : ent2->path);
> +}
> +
> +struct last_modified {
> + struct hashmap paths;
> + struct rev_info rev;
> + int recursive, tree_in_recursive;
Can we either make these two part of a bitfield, or at least declare
them separately?
> +};
> +
> +static void last_modified_release(struct last_modified *lm)
> +{
> + hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
> + release_revisions(&lm->rev);
> +}
> +
> +typedef void (*last_modified_callback)(const char *path,
> + const struct commit *commit, void *data);
> +
> +struct last_modified_callback_data {
> + struct commit *commit;
> + struct hashmap *paths;
> +
> + last_modified_callback callback;
> + void *callback_data;
> +};
I can't quite tell what the purpose of this struct is in conjunction
with the last_modified_callback type above.
The last_modified_callback type makes sense as a generic callback
function that callers can pass to get <path, commit> pairs, along with
an arbitrary "data" pointer.
But then you define a last_modified_callback_data struct that, which
made me think that it would be used as the data type passed to the
callback. In other words, given the existence of this struct, I would
have expected the function pointer above to be defined like:
typedef void (*last_modified_callback)(const char *path,
const struct commit *commit,
struct last_modified_callback_data *data);
But the fact that the _data struct contains a last_modified_callback
function pointer gives us a hint at what's going on here. It seems like
last_modified_callback_data is used to store some bookkeeping
information and dispatch calls to the "callback" function pointer.
I think that the fact the struct's name ends with "_data" is what is
confusing to me. I think this would be a little clearer if you renamed
this "struct last_modified_callback" and the function pointer to
"last_modified_callback_fn" or similar.
(The irony is not lost on me that these comments would be applicable to
GitHub's version of this code, too :-s).
> +static int populate_paths_from_revs(struct last_modified *lm)
> +{
> + int num_interesting = 0;
> + struct diff_options diffopt;
> +
> + memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
> + copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
> + /*
> + * Use a callback to populate the paths from revs
> + */
> + diffopt.output_format = DIFF_FORMAT_CALLBACK;
> + diffopt.format_callback = add_path_from_diff;
> + diffopt.format_callback_data = lm;
> +
> + for (size_t i = 0; i < lm->rev.pending.nr; i++) {
> + struct object_array_entry *obj = lm->rev.pending.objects + i;
> +
> + if (obj->item->flags & UNINTERESTING)
> + continue;
> +
> + if (num_interesting++)
> + return error(_("can only get last-modified one tree at a time"));
This error text is a little difficult to parse, but I'm not sure that I
have a great suggestion for improving it. The equivalent from GitHub's
fork is "can only blame one tree at a time", and I think the difficulty
in parsing is that "last-modified" isn't a verb.
> +static void mark_path(const char *path, const struct object_id *oid,
> + struct last_modified_callback_data *data)
> +{
> + struct last_modified_entry *ent;
> +
> + /* Is it even a path that we are interested in? */
> + ent = hashmap_get_entry_from_hash(data->paths, strhash(path), path,
> + struct last_modified_entry, hashent);
> + if (!ent)
> + return;
> +
> + /*
> + * Is it arriving at a version of interest, or is it from a side branch
> + * which did not contribute to the final state?
> + */
> + if (!oideq(oid, &ent->oid))
> + return;
GitHub's fork writes this as "if (oid && !oideq(oid, &ent->oid))", but
the commit that introduces the "oid &&" portion of that expression
doesn't provide us with any clues as to why the change was necessary.
Since you have spent more time with these patches than I have recently,
perhaps you can help shed some light on what's going on here?
The rest of the code roughly matches my memory of the early versions of
this command.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 2/6] t/perf: add last-modified perf script
2025-07-16 13:35 ` [PATCH v5 2/6] t/perf: add last-modified perf script Toon Claes
@ 2025-07-18 0:08 ` Taylor Blau
2025-07-22 15:52 ` Toon Claes
0 siblings, 1 reply; 135+ messages in thread
From: Taylor Blau @ 2025-07-18 0:08 UTC (permalink / raw)
To: Toon Claes
Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Junio C Hamano,
Jeff King
On Wed, Jul 16, 2025 at 03:35:14PM +0200, Toon Claes wrote:
> diff --git a/t/perf/p8020-last-modified.sh b/t/perf/p8020-last-modified.sh
> new file mode 100755
> index 0000000000..a02ec907d4
> --- /dev/null
> +++ b/t/perf/p8020-last-modified.sh
> @@ -0,0 +1,21 @@
> +#!/bin/sh
> +
> +test_description='last-modified perf tests'
> +. ./perf-lib.sh
> +
> +test_perf_default_repo
> +
> +test_perf 'top-level last-modified' '
> + git last-modified HEAD
> +'
> +
> +test_perf 'top-level recursive last-modified' '
> + git last-modified -r HEAD
> +'
The only notable difference from GitHub's version here is that we do not
have a recursive option, so our test is just "git blame-tree
--max-depth=0", which is obviously not applicable here.
What you wrote (testing "last-modified" both with and without the "-r"
option) makes sense to me.
> +test_perf 'subdir last-modified' '
> + path=$(git ls-tree HEAD | grep ^040000 | head -n 1 | cut -f2)
Hmm. This line comes directly from the patches that I originally shared,
but seeing "git" on the left-hand side of a pipe makes me a little
uneasy.
We could also use the "-d" flag here, which will only show us trees,
thus eliminating the need for the "grep ^040000" portion above.
I'd probably write this as:
git ls-tree -d HEAD >subtrees &&
path="$(head -n 1 subtrees | cut -f2)" &&
git last-modified -- "$path"
Thanks,
Taylor
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 3/6] last-modified: use Bloom filters when available
2025-07-16 13:35 ` [PATCH v5 3/6] last-modified: use Bloom filters when available Toon Claes
@ 2025-07-18 0:16 ` Taylor Blau
2025-07-22 16:02 ` Toon Claes
0 siblings, 1 reply; 135+ messages in thread
From: Taylor Blau @ 2025-07-18 0:16 UTC (permalink / raw)
To: Toon Claes; +Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Junio C Hamano
On Wed, Jul 16, 2025 at 03:35:15PM +0200, Toon Claes wrote:
> Our 'git last-modified' performs a revision walk, and computes a diff at
> each point in the walk to figure out whether a given revision changed
> any of the paths it considers interesting.
>
> When changed-path Bloom filters are available, we can avoid computing
> many such diffs. Before computing a diff, we first check if any of the
> remaining paths of interest were possibly changed at a given commit by
> consulting its Bloom filter. If any of them are, we are resigned to
> compute the diff.
>
> If none of those queries returned "maybe", we know that the given commit
> doesn't contain any changed paths which are interesting to us. So, we
> can avoid computing it in this case.
>
> Comparing the perf test results on git.git:
>
> Test HEAD~ HEAD
> ------------------------------------------------------------------------------------
> 8020.1: top-level last-modified 4.49(4.34+0.11) 2.22(2.05+0.09) -50.6%
> 8020.2: top-level recursive last-modified 5.64(5.45+0.11) 5.62(5.30+0.11) -0.4%
> 8020.3: subdir last-modified 0.11(0.06+0.04) 0.07(0.03+0.04) -36.4%
As an aside on 8020.3 (that I probably should have mentioned in the last
commit), I think that our "| head -n1" heuristic for picking a sub-tree
is skewing these results down. In git.git, the lexicographically
earliest sub-tree is ".github", which is awfully tiny. I wonder if we
should be grabbing the *last* sub-tree, or maybe the largest one by
count of entries?
> @@ -40,6 +43,12 @@ struct last_modified {
>
> static void last_modified_release(struct last_modified *lm)
> {
> + struct hashmap_iter iter;
> + struct last_modified_entry *ent;
> +
> + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
> + clear_bloom_key(&ent->key);
> +
I did a double-take to make sure that ent->key would always be
initialized here, but it is thanks to the FLEX_ALLOC_STR() call below.
> hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
> release_revisions(&lm->rev);
> }
> @@ -67,6 +76,9 @@ static void add_path_from_diff(struct diff_queue_struct *q,
>
> FLEX_ALLOC_STR(ent, path, path);
> oidcpy(&ent->oid, &p->two->oid);
> + if (lm->rev.bloom_filter_settings)
> + fill_bloom_key(path, strlen(path), &ent->key,
> + lm->rev.bloom_filter_settings);
> hashmap_entry_init(&ent->hashent, strhash(ent->path));
> hashmap_add(&lm->paths, &ent->hashent);
> }
> @@ -126,6 +138,7 @@ static void mark_path(const char *path, const struct object_id *oid,
> data->callback(path, data->commit, data->callback_data);
>
> hashmap_remove(data->paths, &ent->hashent, path);
> + clear_bloom_key(&ent->key);
OK, we're calling clear_bloom_key() here, too, but it uses
FREE_AND_NULL(), so calling it again in last_modified_release() may be a
noop, but won't ever be a double-free.
> @@ -238,6 +276,13 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
> return argc;
> }
>
> + /*
> + * We're not interested in generation numbers here,
> + * but calling this function to prepare the commit-graph.
> + */
> + (void)generation_numbers_enabled(lm->rev.repo);
> + lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
Hmmph. I think when I originally wrote this I was using the side-effect
of calling generation_numbers_enabled() as a hack. But I think that it
may be worth making "prepare_commit_graph()" a non-static function and
calling that instead.
Thanks,
Taylor
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 5/6] last-modified: support --extended format
2025-07-17 22:37 ` Junio C Hamano
@ 2025-07-18 17:36 ` Junio C Hamano
2025-07-22 16:06 ` Toon Claes
0 siblings, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-07-18 17:36 UTC (permalink / raw)
To: Toon Claes; +Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau
Junio C Hamano <gitster@pobox.com> writes:
> Hmph. This hardcoding of everything does not look easy to maintain.
>
> Besides, the test will fail rather miserably when run with SHA-256
> hash (e.g., post Git 3.0 where the "git init" command by default
> will give you a repository with new hash).
>
> It looks somewhat inconsistent that tree is shown with its object
> name, but commit is not.
I do not address neither the first point or the last point above,
but at least something like the attached patch needs to be squashed
into this step to make the SHA-256 tests pass.
Thanks.
commit 86a64ae7a4b866db0f17f906ca5be95333d907ab
Author: Junio C Hamano <gitster@pobox.com>
Date: Fri Jul 18 10:33:58 2025 -0700
fixup! last-modified: support --extended format
diff --git a/t/t8020-last-modified.sh b/t/t8020-last-modified.sh
index 008ea708ab..e737cb2505 100755
--- a/t/t8020-last-modified.sh
+++ b/t/t8020-last-modified.sh
@@ -6,10 +6,12 @@ test_description='last-modified tests'
test_expect_success 'setup' '
test_commit 1 file &&
+ TREE1=$(git rev-parse HEAD^{tree}) &&
mkdir a &&
test_commit 2 a/file &&
mkdir a/b &&
- test_commit 3 a/b/file
+ test_commit 3 a/b/file &&
+ TREE3=$(git rev-parse HEAD^{tree})
'
test_expect_success 'cannot run last-modified on two trees' '
@@ -49,10 +51,10 @@ test_expect_success 'last-modified non-recursive' '
'
test_expect_success 'last-modified extended output' '
- check_last_modified --extended <<-\EOF
+ check_last_modified --extended <<-EOF
path a
commit 3
- tree e9a947598482012e54c9c5d3635d5b526b43a6a4
+ tree $TREE3
parent 2
author A U Thor <author@example.com> 1112912113 -0700
committer C O Mitter <committer@example.com> 1112912113 -0700
@@ -61,7 +63,7 @@ test_expect_success 'last-modified extended output' '
path file
commit 1
- tree f27c6ae26adb8396d3861976ba268f87ad8afa0b
+ tree $TREE1
author A U Thor <author@example.com> 1112911993 -0700
committer C O Mitter <committer@example.com> 1112911993 -0700
^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [PATCH v5 1/6] last-modified: new subcommand to show when files were last modified
2025-07-18 0:02 ` Taylor Blau
@ 2025-07-19 6:44 ` Jeff King
2025-07-22 15:50 ` Toon Claes
2025-08-01 9:09 ` Christian Couder
2 siblings, 0 replies; 135+ messages in thread
From: Jeff King @ 2025-07-19 6:44 UTC (permalink / raw)
To: Taylor Blau
Cc: Toon Claes, git, Kristoffer Haugsbakk, Derrick Stolee,
Junio C Hamano, Ævar Arnfjörð Bjarmason
On Thu, Jul 17, 2025 at 08:02:37PM -0400, Taylor Blau wrote:
> > + /*
> > + * Is it arriving at a version of interest, or is it from a side branch
> > + * which did not contribute to the final state?
> > + */
> > + if (!oideq(oid, &ent->oid))
> > + return;
>
> GitHub's fork writes this as "if (oid && !oideq(oid, &ent->oid))", but
> the commit that introduces the "oid &&" portion of that expression
> doesn't provide us with any clues as to why the change was necessary.
>
> Since you have spent more time with these patches than I have recently,
> perhaps you can help shed some light on what's going on here?
In the version from tb/blame-tree of your repo, the caching system
calls mark_path() with a NULL oid. But none of that code is in Toon's
version here. The only call to mark_path() in this series always passes
a pointer to a real struct.
-Peff
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 0/6] Introduce git-last-modified(1) command
2025-07-17 23:39 ` [PATCH v5 0/6] Introduce git-last-modified(1) command Taylor Blau
@ 2025-07-22 15:35 ` Toon Claes
2025-07-30 17:59 ` Toon Claes
0 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-07-22 15:35 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Junio C Hamano
Taylor Blau <me@ttaylorr.com> writes:
> On Wed, Jul 16, 2025 at 03:32:00PM +0200, Toon Claes wrote:
>> This series adds the git-last-modified(1) to feed this view. In the past
>> the subcommand was proposed[1] to be named git-blame-tree(1). This
>> version is based on the patches shared by the kind people at GitHub[2].
>
> Sorry for completely dropping this from my review queue. Let me try and
> give it a read...
No worries, we all got work to do ;)
>> What is different from the series shared by GitHub:
>>
>> * Renamed the subcommand from `blame-tree` to `last-modified`. There was
>> some consensus[5] this name works better, so let's give it a try and
>> see how this name feels.
>
> Hmmph. I prefer the "blame-tree" name personally, but I am (a) biased,
> and (b) used to it over "last-modified", so I don't think my preference
> or bias should count for much here.
Well, for what it's worth I like "blame-tree" more as well. But didn't
feel strong enough to push it through.
>> * Patches for --max-depth are excluded. I think it's a separate topic to
>> discuss and I'm not sure it needs to be part of series anyway. The
>> main patch was submitted in the previous attempt[3] and if people
>> consider it valuable, I'm happy to discuss that in a separate patch
>> series.
>
> Yeah, makes sense.
I might be revisiting this, because recently I've integrating the WIP
command in our tech stack I noticed having this option would be
useful/required.
>> * The last-modified command isn't recursive by default. If you want
>> recurse into subtrees, you need to pass `-r`.
>
> OK.
>
>> * The patches in 'tb/blame-tree' at Taylor's fork[4] implements a
>> caching layer. This feature reads/writes cached results in
>> `.git/blame-tree/<hash>.btc`. To keep this series to a reviewable
>> size, that feature is excluded from this series. I think it's better
>> to submit this as a separate series.
>
> Makes sense; the caching feature was primarily implemented by Stolee and
> I think for our purposes here can be considered additive and not
> essential to the basic functionality of this new command. For what it's
> worth, I *would* like[^1] to see those features sent to the list at some
> point, but I agree that they are a significant source of additional
> complexity. So punting on them for now seems like the right direction to
> me.
>
> [^1]: My ulterior motive here would be to eventually ditch GitHub's
> "blame-tree" command entirely and remove it from GitHub's diff to
> upstream. I'm happy to help however I can with that effort once this
> series lands.
Obviously. I know how it goes, more code maintained by the community is
better for everyone.
>> * Squashed various commits together. Like they introduced a flag
>> `--go-faster`, which later became the default and only implementation.
>> That story was wrapped up in a single commit.
>
> Perfect, thank you. I figured that we would not want to keep temporary
> measures around like the "--go-faster" flag, but I also figured that
> they may be helpful in unpacking the history of this command, hence why
> I sent them in the first place.
I wasn't sure about this. I've had a hard time unraveling what in the 55
patches in your branch was valuable and what could be squashed into
other commits.
>> * Dropped the patches that attempt to increase performance for tree
>> entries that have not been updated in a long time. In my testing I've
>> seen both performance improvements *and* degradation with these
>> changes:
>>
>> Test HEAD~ HEAD
>> ------------------------------------------------------------------------------------
>> 8020.1: top-level last-modified 4.52(4.38+0.11) 2.03(1.93+0.08) -55.1%
>> 8020.2: top-level recursive last-modified 5.79(5.64+0.11) 8.34(8.17+0.11) +44.0%
>> 8020.3: subdir last-modified 0.15(0.09+0.06) 0.19(0.14+0.06) +26.7%
>>
>> Before we include these patches, I want to make sure these changes
>> have positive impact in all/most scenarios. This can happen in a
>> separate series.
>
> Hmm. It's been long enough that I honestly don't remember the details
> here, but I agree that this is worth looking into at some point in the
> future.
I've had this patch included in version 2[1]. I'd love to include it,
but it didn't give the results we were expecting. Over time I became
more confortable with these changes. Let me see if I can get more
insights about it.
>> I've set myself as the author and added Based-on-patch-by trailers to
>> credit the original authors. Let me know if you disagree.
>
> I can't speak for the other authors of this command, but I have no issue
> being ~~blamed~~ credited with a "Based-on-patch-by" trailer ;-).
>
>> Again thanks to Taylor and the people at GitHub for sharing these
>> patches. I hope we can work together to get this upstreamed.
>
> Ditto.
<3
[1]: https://lore.kernel.org/git/20250523-toon-new-blame-tree-v2-4-101e4ca4c1c9@iotcl.com/
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 1/6] last-modified: new subcommand to show when files were last modified
2025-07-18 0:02 ` Taylor Blau
2025-07-19 6:44 ` Jeff King
@ 2025-07-22 15:50 ` Toon Claes
2025-08-01 9:09 ` Christian Couder
2 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-22 15:50 UTC (permalink / raw)
To: Taylor Blau
Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Junio C Hamano,
Jeff King, Ævar Arnfjörð Bjarmason
Taylor Blau <me@ttaylorr.com> writes:
> On Wed, Jul 16, 2025 at 03:35:13PM +0200, Toon Claes wrote:
>> 11 files changed, 549 insertions(+)
>> create mode 100644 Documentation/git-last-modified.adoc
>> create mode 100644 builtin/last-modified.c
>> create mode 100755 t/t8020-last-modified.sh
>
> I'm admittedly not entirely sure what the best way to review this patch
> is given its size and my previous exposure to (similar) code.
Yeah, I wasn't sure how to approach this. I didn't want to come in with
a big bang with the final version, but give the reviewers the change to
see the improvements (and complexity) come in gradually.
>> diff --git a/builtin/last-modified.c b/builtin/last-modified.c
>> new file mode 100644
>> index 0000000000..63993bc1c9
>> --- /dev/null
>> +++ b/builtin/last-modified.c
>> @@ -0,0 +1,289 @@
>> +#include "git-compat-util.h"
>> +#include "builtin.h"
>> +#include "commit.h"
>> +#include "config.h"
>> +#include "diff.h"
>> +#include "diffcore.h"
>> +#include "hashmap.h"
>> +#include "hex.h"
>> +#include "log-tree.h"
>> +#include "object-name.h"
>> +#include "object.h"
>> +#include "parse-options.h"
>> +#include "quote.h"
>> +#include "repository.h"
>> +#include "revision.h"
>> +
>> +struct last_modified_entry {
>> + struct hashmap_entry hashent;
>> + struct object_id oid;
>> + const char path[FLEX_ARRAY];
>> +};
>
> As a general comment on this patch, I am a little sad to see that many
> of the implementation details have been moved back into the builtin
> itself and not in their own last-modified.ch file(s).
>
> Apologies if this was already discussed earlier in the thread and I
> simply missed it, but can you comment on why the last-modified internals
> were moved into the builtin?
Wasn't discussed yet, and this only happened in this last version.
Basically my idea was: there's no one else using this, why put it at the
root level anyway? Also, it relies heavily on `setup_revisions()`. In my
first iterations `argc` and `argv` from the builtin were passed on
directly to the root-level `last-modified.[ch]` subsystem. This is a
little awkward, putting so much raw user-input handling in the
subsystem.
> Even in the earliest version of 'blame-tree' that I could find (from
> 26999d045b (add blame-tree command, 2012-10-20) in my fork) many of the
> internals were written in blame-tree.c instead of builtin/blame-tree.c.
>
>> +static int last_modified_entry_hashcmp(const void *unused UNUSED,
>> + const struct hashmap_entry *hent1,
>> + const struct hashmap_entry *hent2,
>> + const void *path)
>> +{
>> + const struct last_modified_entry *ent1 =
>> + container_of(hent1, const struct last_modified_entry, hashent);
>> + const struct last_modified_entry *ent2 =
>> + container_of(hent2, const struct last_modified_entry, hashent);
>> + return strcmp(ent1->path, path ? path : ent2->path);
>> +}
>> +
>> +struct last_modified {
>> + struct hashmap paths;
>> + struct rev_info rev;
>> + int recursive, tree_in_recursive;
>
> Can we either make these two part of a bitfield, or at least declare
> them separately?
>
>> +};
>> +
>> +static void last_modified_release(struct last_modified *lm)
>> +{
>> + hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
>> + release_revisions(&lm->rev);
>> +}
>> +
>> +typedef void (*last_modified_callback)(const char *path,
>> + const struct commit *commit, void *data);
>> +
>> +struct last_modified_callback_data {
>> + struct commit *commit;
>> + struct hashmap *paths;
>> +
>> + last_modified_callback callback;
>> + void *callback_data;
>> +};
>
> I can't quite tell what the purpose of this struct is in conjunction
> with the last_modified_callback type above.
Yeah, this is kind of a remnant of when there was a last-modified
subsystem. In current implementation, where all code lives in the
builtin, there's no good reason to keep this callback struct.
> The last_modified_callback type makes sense as a generic callback
> function that callers can pass to get <path, commit> pairs, along with
> an arbitrary "data" pointer.
>
> But then you define a last_modified_callback_data struct that, which
> made me think that it would be used as the data type passed to the
> callback. In other words, given the existence of this struct, I would
> have expected the function pointer above to be defined like:
>
> typedef void (*last_modified_callback)(const char *path,
> const struct commit *commit,
> struct last_modified_callback_data *data);
>
> But the fact that the _data struct contains a last_modified_callback
> function pointer gives us a hint at what's going on here. It seems like
> last_modified_callback_data is used to store some bookkeeping
> information and dispatch calls to the "callback" function pointer.
>
> I think that the fact the struct's name ends with "_data" is what is
> confusing to me. I think this would be a little clearer if you renamed
> this "struct last_modified_callback" and the function pointer to
> "last_modified_callback_fn" or similar.
>
> (The irony is not lost on me that these comments would be applicable to
> GitHub's version of this code, too :-s).
Hey, that's no excuse to keep it like this. I think keeping the callback
infrastructure depends on whether bring back the last-modified
subsystem. In that case, I will address your comments. If not, I think
we can get rid of it completely.
>> +static int populate_paths_from_revs(struct last_modified *lm)
>> +{
>> + int num_interesting = 0;
>> + struct diff_options diffopt;
>> +
>> + memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
>> + copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
>> + /*
>> + * Use a callback to populate the paths from revs
>> + */
>> + diffopt.output_format = DIFF_FORMAT_CALLBACK;
>> + diffopt.format_callback = add_path_from_diff;
>> + diffopt.format_callback_data = lm;
>> +
>> + for (size_t i = 0; i < lm->rev.pending.nr; i++) {
>> + struct object_array_entry *obj = lm->rev.pending.objects + i;
>> +
>> + if (obj->item->flags & UNINTERESTING)
>> + continue;
>> +
>> + if (num_interesting++)
>> + return error(_("can only get last-modified one tree at a time"));
>
> This error text is a little difficult to parse, but I'm not sure that I
> have a great suggestion for improving it. The equivalent from GitHub's
> fork is "can only blame one tree at a time", and I think the difficulty
> in parsing is that "last-modified" isn't a verb.
Oh yeah, I've been struggling with that myself as well. I'm open to a
rename, if you've got a better name?
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 2/6] t/perf: add last-modified perf script
2025-07-18 0:08 ` Taylor Blau
@ 2025-07-22 15:52 ` Toon Claes
0 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-22 15:52 UTC (permalink / raw)
To: Taylor Blau
Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Junio C Hamano,
Jeff King
Taylor Blau <me@ttaylorr.com> writes:
> On Wed, Jul 16, 2025 at 03:35:14PM +0200, Toon Claes wrote:
>> diff --git a/t/perf/p8020-last-modified.sh b/t/perf/p8020-last-modified.sh
>> new file mode 100755
>> index 0000000000..a02ec907d4
>> --- /dev/null
>> +++ b/t/perf/p8020-last-modified.sh
>> @@ -0,0 +1,21 @@
>> +#!/bin/sh
>> +
>> +test_description='last-modified perf tests'
>> +. ./perf-lib.sh
>> +
>> +test_perf_default_repo
>> +
>> +test_perf 'top-level last-modified' '
>> + git last-modified HEAD
>> +'
>> +
>> +test_perf 'top-level recursive last-modified' '
>> + git last-modified -r HEAD
>> +'
>
> The only notable difference from GitHub's version here is that we do not
> have a recursive option, so our test is just "git blame-tree
> --max-depth=0", which is obviously not applicable here.
>
> What you wrote (testing "last-modified" both with and without the "-r"
> option) makes sense to me.
>
>> +test_perf 'subdir last-modified' '
>> + path=$(git ls-tree HEAD | grep ^040000 | head -n 1 | cut -f2)
>
> Hmm. This line comes directly from the patches that I originally shared,
> but seeing "git" on the left-hand side of a pipe makes me a little
> uneasy.
>
> We could also use the "-d" flag here, which will only show us trees,
> thus eliminating the need for the "grep ^040000" portion above.
>
> I'd probably write this as:
>
> git ls-tree -d HEAD >subtrees &&
> path="$(head -n 1 subtrees | cut -f2)" &&
> git last-modified -- "$path"
Makes sense, I shall pick this up when I reroll.
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 3/6] last-modified: use Bloom filters when available
2025-07-18 0:16 ` Taylor Blau
@ 2025-07-22 16:02 ` Toon Claes
0 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-22 16:02 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Junio C Hamano
Taylor Blau <me@ttaylorr.com> writes:
> On Wed, Jul 16, 2025 at 03:35:15PM +0200, Toon Claes wrote:
>> Our 'git last-modified' performs a revision walk, and computes a diff at
>> each point in the walk to figure out whether a given revision changed
>> any of the paths it considers interesting.
>>
>> When changed-path Bloom filters are available, we can avoid computing
>> many such diffs. Before computing a diff, we first check if any of the
>> remaining paths of interest were possibly changed at a given commit by
>> consulting its Bloom filter. If any of them are, we are resigned to
>> compute the diff.
>>
>> If none of those queries returned "maybe", we know that the given commit
>> doesn't contain any changed paths which are interesting to us. So, we
>> can avoid computing it in this case.
>>
>> Comparing the perf test results on git.git:
>>
>> Test HEAD~ HEAD
>> ------------------------------------------------------------------------------------
>> 8020.1: top-level last-modified 4.49(4.34+0.11) 2.22(2.05+0.09) -50.6%
>> 8020.2: top-level recursive last-modified 5.64(5.45+0.11) 5.62(5.30+0.11) -0.4%
>> 8020.3: subdir last-modified 0.11(0.06+0.04) 0.07(0.03+0.04) -36.4%
>
> As an aside on 8020.3 (that I probably should have mentioned in the last
> commit), I think that our "| head -n1" heuristic for picking a sub-tree
> is skewing these results down. In git.git, the lexicographically
> earliest sub-tree is ".github", which is awfully tiny. I wonder if we
> should be grabbing the *last* sub-tree, or maybe the largest one by
> count of entries?
Ah yes, that's automatable. Good suggestion.
>
>> @@ -40,6 +43,12 @@ struct last_modified {
>>
>> static void last_modified_release(struct last_modified *lm)
>> {
>> + struct hashmap_iter iter;
>> + struct last_modified_entry *ent;
>> +
>> + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
>> + clear_bloom_key(&ent->key);
>> +
>
> I did a double-take to make sure that ent->key would always be
> initialized here, but it is thanks to the FLEX_ALLOC_STR() call below.
>
>> hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
>> release_revisions(&lm->rev);
>> }
>> @@ -67,6 +76,9 @@ static void add_path_from_diff(struct diff_queue_struct *q,
>>
>> FLEX_ALLOC_STR(ent, path, path);
>> oidcpy(&ent->oid, &p->two->oid);
>> + if (lm->rev.bloom_filter_settings)
>> + fill_bloom_key(path, strlen(path), &ent->key,
>> + lm->rev.bloom_filter_settings);
>> hashmap_entry_init(&ent->hashent, strhash(ent->path));
>> hashmap_add(&lm->paths, &ent->hashent);
>> }
>> @@ -126,6 +138,7 @@ static void mark_path(const char *path, const struct object_id *oid,
>> data->callback(path, data->commit, data->callback_data);
>>
>> hashmap_remove(data->paths, &ent->hashent, path);
>> + clear_bloom_key(&ent->key);
>
> OK, we're calling clear_bloom_key() here, too, but it uses
> FREE_AND_NULL(), so calling it again in last_modified_release() may be a
> noop, but won't ever be a double-free.
>
>> @@ -238,6 +276,13 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
>> return argc;
>> }
>>
>> + /*
>> + * We're not interested in generation numbers here,
>> + * but calling this function to prepare the commit-graph.
>> + */
>> + (void)generation_numbers_enabled(lm->rev.repo);
>> + lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
>
> Hmmph. I think when I originally wrote this I was using the side-effect
> of calling generation_numbers_enabled() as a hack. But I think that it
> may be worth making "prepare_commit_graph()" a non-static function and
> calling that instead.
Well thanks for confirming this assumption. I wasn't 100%, and I agree
having a non-static "prepare_commit_graph()" would be better.
Somewhat related to that. In your branch you have this line in
"maybe_changed_path()":
if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY)
return 1;
I excluded this line because if we ignore the result of
"generation_numbers_enabled()", why would it matter to have a generation
number?
The thing is "maybe_changed_path()" in blame.c also has this line too,
but unfortunately git-blaming that line didn't learn me anything why
it's there. Do you have any idea?
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 5/6] last-modified: support --extended format
2025-07-18 17:36 ` Junio C Hamano
@ 2025-07-22 16:06 ` Toon Claes
0 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-22 16:06 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Taylor Blau
Junio C Hamano <gitster@pobox.com> writes:
> Junio C Hamano <gitster@pobox.com> writes:
>
>> Hmph. This hardcoding of everything does not look easy to maintain.
>>
>> Besides, the test will fail rather miserably when run with SHA-256
>> hash (e.g., post Git 3.0 where the "git init" command by default
>> will give you a repository with new hash).
>>
>> It looks somewhat inconsistent that tree is shown with its object
>> name, but commit is not.
>
>
> I do not address neither the first point or the last point above,
> but at least something like the attached patch needs to be squashed
> into this step to make the SHA-256 tests pass.
Thank you for the suggestion. But I think we can expect this commit to
be excluded whenever I reroll this series. But I appreciate the effort.
I should have taken SHA256 into account in the first place.
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH v6 0/4] Introduce git-last-modified(1) command
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
` (6 preceding siblings ...)
2025-07-17 23:39 ` [PATCH v5 0/6] Introduce git-last-modified(1) command Taylor Blau
@ 2025-07-30 17:55 ` Toon Claes
2025-07-31 18:40 ` Junio C Hamano
` (4 more replies)
2025-07-30 17:55 ` [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified Toon Claes
` (3 subsequent siblings)
11 siblings, 5 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-30 17:55 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau, Derrick Stolee,
Christian Couder, Patrick Steinhardt, Toon Claes
On many forges the tree view is shown in combination with commit data.
In such a view each tree entry is accompanied with the commit message
and date that last modified that tree entry. Something similar like:
| README.md | README: *.txt -> *.adoc fixes | 4 months ago |
| RelNotes | Start 2.51 cycle, the first batch | 4 weeks ago |
| SECURITY.md | SECURITY: describe how to report vulnerabilities | 4 years |
| abspath.c | abspath: move related functions to abspath | 2 years |
| abspath.h | abspath: move related functions to abspath | 2 years |
| aclocal.m4 | configure: use AC_LANG_PROGRAM consistently | 15 years ago |
| add-patch.c | pager: stop using `the_repository` | 7 months ago |
| advice.c | advice: allow disabling default branch name advice | 4 months ago |
| advice.h | advice: allow disabling default branch name advice | 4 months ago |
| alias.h | rebase -m: fix serialization of strategy options | 2 years |
| alloc.h | git-compat-util: move alloc macros to git-compat-util.h | 2 years ago |
| apply.c | apply: only write intents to add for new files | 8 days ago |
| archive.c | Merge branch 'ps/parse-options-integers' | 3 months ago |
| archive.h | archive.h: remove unnecessary include | 1 year |
| attr.h | fuzz: port fuzz-parse-attr-line from OSS-Fuzz | 9 months ago |
| banned.h | banned.h: mark `strtok()` and `strtok_r()` as banned | 2 years |
This series adds the git-last-modified(1) to feed this view. In the past
the subcommand was proposed[1] to be named git-blame-tree(1). This
version is based on the patches shared by the kind people at GitHub[2].
What is different from the series shared by GitHub:
* Renamed the subcommand from `blame-tree` to `last-modified`. There was
some consensus[5] this name works better, so let's give it a try and
see how this name feels.
* Patches for --max-depth are excluded. I've submitted them as a separate patch
series[6].
* The last-modified command isn't recursive by default. If you want
recurse into subtrees, you need to pass `-r`.
* The patches in 'tb/blame-tree' at Taylor's fork[4] implements a
caching layer. This feature reads/writes cached results in
`.git/blame-tree/<hash>.btc`. To keep this series to a reviewable
size, that feature is excluded from this series. I think it's better
to submit this as a separate series.
* All the new last-modified machinery is no longer implemented in a library
layer (at the root of the project), but directly in the builtin. So far the
code is fairly small (little over 300 lines of code) and there are no other
users of this code anyway. Also the library level code taken from Taylor's
fork required to pass `argc` and `argv` into it. It's quite awkward the
library code was so tightly coupled with user interaction.
* Squashed various commits together. Like they introduced a flag
`--go-faster`, which later became the default and only implementation.
That story was wrapped up in a single commit.
* Dropped the patches that attempt to increase performance for tree
entries that have not been updated in a long time. In my testing I've
seen both performance improvements *and* degradation with these
changes:
Test HEAD~ HEAD
------------------------------------------------------------------------------------
8020.1: top-level last-modified 4.52(4.38+0.11) 2.03(1.93+0.08) -55.1%
8020.2: top-level recursive last-modified 5.79(5.64+0.11) 8.34(8.17+0.11) +44.0%
8020.3: subdir last-modified 0.15(0.09+0.06) 0.19(0.14+0.06) +26.7%
Before we include these patches, I want to make sure these changes
have positive impact in all/most scenarios. This can happen in a
separate series.
I've set myself as the author and added Based-on-patch-by trailers to
credit the original authors. Let me know if you disagree.
Again thanks to Taylor and the people at GitHub for sharing these
patches. I hope we can work together to get this upstreamed.
[1]: https://lore.kernel.org/git/patch-1.1-0ea849d900b-20230205T204104Z-avarab@gmail.com/
[2]: https://lore.kernel.org/git/Z+XJ+1L3PnC9Dyba@nand.local/
[3]: https://lore.kernel.org/git/20250326-toon-blame-tree-v1-3-4173133f3786@iotcl.com/
[4]: git@github.com:ttaylorr/git.git
[5]: https://lore.kernel.org/git/aCbBKj7O9LjO3SMK@pks.im/
[6]: https://lore.kernel.org/git/20250729-toon-max-depth-v1-0-c177e39c40fb@iotcl.com/
---
Changes in v6:
- Only the first 3 patches are kept. The last 3 patches worked toward adding an
extra option `--format`. The way it was implemented was heavily debatable and
in the end it is not required for a first iteration, so they are dropped.
- Function prepare_commit_graph() is exported and used in
generation_numbers_enabled().
- Since the library layer was removed and all the code was moved into the
builtin, there was still some leftovers from using a callback mechanism to
display the results. This is removed (as far as possible) and instead
last_modified_emit() always, this function was called show_entry() previously.
- Code is rebased to use refactoring in the bloom filter API.
Changes in v5:
- Added a patch to allow for an "extended" format. The name for this option is
open for debate (please, all input is welcome). But the main goal of this
series is to provide the data needed for the "forge tree view" as demoed at
the top of this cover letter. With this extra patch (and the prepatory patch
to pretty.[ch]), I hope the use-case because more clear. But because it wasn't
included in previous 4 versions I also wouldn't mind sending a separate patch
series for it.
- Removed the call to sort(1) the t8020 tests. This was needed for the tests for
--extended.
- I'm adding a fixup! commit to be compatible with in-flight patches for bloom
filter optimizations:
https://lore.kernel.org/git/20250712093517.17907-1-yldhome2d2@gmail.com/
This patch can be dropped if current series lands before those.
Changes in v4:
- Removed root-level `last-modified.[ch]` library code and moved code to
`builtin/last-modified.c`. Historically we've had libary code (also because it
was used in testtool), but we no longer need that separation. I'm sorry this
makes the range-diff hard to read.
- Added the use of parse_options() to get better usage messages.
- Formatting fixes after conversation in
https://lore.kernel.org/git/xmqqh5zvk5h0.fsf@gitster.g/
- Link to v3: https://lore.kernel.org/git/20250630-toon-new-blame-tree-v3-0-3516025dc3bc@iotcl.com/
Changes in v3:
- Updated benchmarks in commit messages.
- Removed the patches that attempt to increase performance for tree
entries that have not been updated in a long time. (see above)
- Move handling failure in `last_modified_init()` to the caller.
- Sorted #include clauses lexicographically.
- Removed unneeded `commit` in `struct last_modified_entry`.
- Renamed some functions/variables and added some comments to make it
easier to understand.
- Removed unnecessary checking of the commit-graph generation number.
- Link to v2: https://lore.kernel.org/r/20250523-toon-new-blame-tree-v2-0-101e4ca4c1c9@iotcl.com
Changes in v2:
- The subcommand is renamed from `blame-tree` to `last-modified`
- Documentation is added. Here we mark the command as experimental.
- Some test cases are added related to merges.
- Link to v1: https://lore.kernel.org/r/20250422-toon-new-blame-tree-v1-0-fdb51b8a394a@iotcl.com
Toon Claes (4):
last-modified: new subcommand to show when files were last modified
t/perf: add last-modified perf script
commit-graph: export prepare_commit_graph()
last-modified: use Bloom filters when available
.gitignore | 1 +
Documentation/git-last-modified.adoc | 49 +++++
Documentation/meson.build | 1 +
Makefile | 1 +
builtin.h | 1 +
builtin/last-modified.c | 318 +++++++++++++++++++++++++++
command-list.txt | 1 +
commit-graph.c | 8 +-
commit-graph.h | 8 +
git.c | 1 +
meson.build | 1 +
t/meson.build | 2 +
t/perf/p8020-last-modified.sh | 22 ++
t/t8020-last-modified.sh | 203 +++++++++++++++++
14 files changed, 610 insertions(+), 7 deletions(-)
create mode 100644 Documentation/git-last-modified.adoc
create mode 100644 builtin/last-modified.c
create mode 100755 t/perf/p8020-last-modified.sh
create mode 100755 t/t8020-last-modified.sh
Range-diff against v5:
1: 8c6493d1d1 ! 1: 9d5ce06460 last-modified: new subcommand to show when files were last modified
@@ builtin/last-modified.c (new)
+struct last_modified {
+ struct hashmap paths;
+ struct rev_info rev;
-+ int recursive, tree_in_recursive;
++ int recursive;
++ int tree_in_recursive;
+};
+
+static void last_modified_release(struct last_modified *lm)
@@ builtin/last-modified.c (new)
+ release_revisions(&lm->rev);
+}
+
-+typedef void (*last_modified_callback)(const char *path,
-+ const struct commit *commit, void *data);
-+
+struct last_modified_callback_data {
++ struct last_modified *lm;
+ struct commit *commit;
-+ struct hashmap *paths;
-+
-+ last_modified_callback callback;
-+ void *callback_data;
+};
+
+static void add_path_from_diff(struct diff_queue_struct *q,
@@ builtin/last-modified.c (new)
+ continue;
+
+ if (num_interesting++)
-+ return error(_("can only get last-modified one tree at a time"));
++ return error(_("last-modified can only operate on one tree at a time"));
+
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &obj->item->oid, "", &diffopt);
@@ builtin/last-modified.c (new)
+ return 0;
+}
+
++static void last_modified_emit(struct last_modified *lm,
++ const char *path, const struct commit *commit)
++
++{
++ if (commit->object.flags & BOUNDARY)
++ putchar('^');
++ printf("%s\t", oid_to_hex(&commit->object.oid));
++
++ if (lm->rev.diffopt.line_termination)
++ write_name_quoted(path, stdout, '\n');
++ else
++ printf("%s%c", path, '\0');
++
++ fflush(stdout);
++}
++
+static void mark_path(const char *path, const struct object_id *oid,
+ struct last_modified_callback_data *data)
+{
+ struct last_modified_entry *ent;
+
+ /* Is it even a path that we are interested in? */
-+ ent = hashmap_get_entry_from_hash(data->paths, strhash(path), path,
++ ent = hashmap_get_entry_from_hash(&data->lm->paths, strhash(path), path,
+ struct last_modified_entry, hashent);
+ if (!ent)
+ return;
@@ builtin/last-modified.c (new)
+ if (!oideq(oid, &ent->oid))
+ return;
+
-+ if (data->callback)
-+ data->callback(path, data->commit, data->callback_data);
++ last_modified_emit(data->lm, path, data->commit);
+
-+ hashmap_remove(data->paths, &ent->hashent, path);
++ hashmap_remove(&data->lm->paths, &ent->hashent, path);
+ free(ent);
+}
+
@@ builtin/last-modified.c (new)
+ }
+}
+
-+static int last_modified_run(struct last_modified *lm,
-+ last_modified_callback cb, void *cbdata)
++static int last_modified_run(struct last_modified *lm)
+{
-+ struct last_modified_callback_data data;
-+
-+ data.paths = &lm->paths;
-+ data.callback = cb;
-+ data.callback_data = cbdata;
++ struct last_modified_callback_data data = { .lm = lm };
+
+ lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ lm->rev.diffopt.format_callback = last_modified_diff;
@@ builtin/last-modified.c (new)
+ return 0;
+}
+
-+static void show_entry(const char *path, const struct commit *commit, void *d)
-+{
-+ struct last_modified *lm = d;
-+
-+ if (commit->object.flags & BOUNDARY)
-+ putchar('^');
-+ printf("%s\t", oid_to_hex(&commit->object.oid));
-+
-+ if (lm->rev.diffopt.line_termination)
-+ write_name_quoted(path, stdout, '\n');
-+ else
-+ printf("%s%c", path, '\0');
-+
-+ fflush(stdout);
-+}
-+
+static int last_modified_init(struct last_modified *lm, struct repository *r,
+ const char *prefix, int argc, const char **argv)
+{
@@ builtin/last-modified.c (new)
+ goto out;
+ }
+
-+ if ((ret = last_modified_run(&lm, show_entry, &lm)))
++ if ((ret = last_modified_run(&lm)))
+ goto out;
+
+out:
2: dc34010bfb ! 2: 7c921d4344 t/perf: add last-modified perf script
@@ t/perf/p8020-last-modified.sh (new)
+'
+
+test_perf 'subdir last-modified' '
-+ path=$(git ls-tree HEAD | grep ^040000 | head -n 1 | cut -f2)
++ git ls-tree -d HEAD >subtrees &&
++ path="$(head -n 1 subtrees | cut -f2)" &&
+ git last-modified -r HEAD -- "$path"
+'
+
-: ---------- > 3: 3c42043682 commit-graph: export prepare_commit_graph()
3: 8cd05437f0 ! 4: 4d7376a46d last-modified: use Bloom filters when available
@@ builtin/last-modified.c: static void add_path_from_diff(struct diff_queue_struct
hashmap_add(&lm->paths, &ent->hashent);
}
@@ builtin/last-modified.c: static void mark_path(const char *path, const struct object_id *oid,
- data->callback(path, data->commit, data->callback_data);
+ last_modified_emit(data->lm, path, data->commit);
- hashmap_remove(data->paths, &ent->hashent, path);
+ hashmap_remove(&data->lm->paths, &ent->hashent, path);
+ bloom_key_clear(&ent->key);
free(ent);
}
@@ builtin/last-modified.c: static void last_modified_diff(struct diff_queue_struct
}
}
-+
+static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
+{
+ struct bloom_filter *filter;
@@ builtin/last-modified.c: static void last_modified_diff(struct diff_queue_struct
+ return 0;
+}
+
- static int last_modified_run(struct last_modified *lm,
- last_modified_callback cb, void *cbdata)
+ static int last_modified_run(struct last_modified *lm)
{
-@@ builtin/last-modified.c: static int last_modified_run(struct last_modified *lm,
+ struct last_modified_callback_data data = { .lm = lm };
+@@ builtin/last-modified.c: static int last_modified_run(struct last_modified *lm)
if (!data.commit)
break;
@@ builtin/last-modified.c: static int last_modified_init(struct last_modified *lm,
return argc;
}
-+ /*
-+ * We're not interested in generation numbers here,
-+ * but calling this function to prepare the commit-graph.
-+ */
-+ (void)generation_numbers_enabled(lm->rev.repo);
++ prepare_commit_graph(lm->rev.repo);
+ lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
+
if (populate_paths_from_revs(lm) < 0)
base-commit: e813a0200a7121b97fec535f0d0b460b0a33356c
--
2.50.1.327.g047016eb4a
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
` (7 preceding siblings ...)
2025-07-30 17:55 ` [PATCH v6 0/4] " Toon Claes
@ 2025-07-30 17:55 ` Toon Claes
2025-07-31 6:42 ` Patrick Steinhardt
2025-08-01 10:18 ` Christian Couder
2025-07-30 17:55 ` [PATCH v6 2/4] t/perf: add last-modified perf script Toon Claes
` (2 subsequent siblings)
11 siblings, 2 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-30 17:55 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau, Derrick Stolee,
Christian Couder, Patrick Steinhardt, Toon Claes, Jeff King,
Ævar Arnfjörð Bjarmason
Similar to git-blame(1), introduce a new subcommand
git-last-modified(1). This command shows the most recent modification to
paths in a tree. It does so by expanding the tree at a given commit,
taking note of the current state of each path, and then walking
backwards through history looking for commits where each path changed
into its final commit ID.
Based-on-patch-by: Jeff King <peff@peff.net>
Improved-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
.gitignore | 1 +
Documentation/git-last-modified.adoc | 49 +++++
Documentation/meson.build | 1 +
Makefile | 1 +
builtin.h | 1 +
builtin/last-modified.c | 278 +++++++++++++++++++++++++++
command-list.txt | 1 +
git.c | 1 +
meson.build | 1 +
t/meson.build | 1 +
t/t8020-last-modified.sh | 203 +++++++++++++++++++
11 files changed, 538 insertions(+)
create mode 100644 Documentation/git-last-modified.adoc
create mode 100644 builtin/last-modified.c
create mode 100755 t/t8020-last-modified.sh
diff --git a/.gitignore b/.gitignore
index 04c444404e..a36ee94443 100644
--- a/.gitignore
+++ b/.gitignore
@@ -87,6 +87,7 @@
/git-init-db
/git-interpret-trailers
/git-instaweb
+/git-last-modified
/git-log
/git-ls-files
/git-ls-remote
diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
new file mode 100644
index 0000000000..89138ebeb7
--- /dev/null
+++ b/Documentation/git-last-modified.adoc
@@ -0,0 +1,49 @@
+git-last-modified(1)
+====================
+
+NAME
+----
+git-last-modified - EXPERIMENTAL: Show when files were last modified
+
+
+SYNOPSIS
+--------
+[synopsis]
+git last-modified [-r] [-t] [<revision-range>] [[--] <path>...]
+
+DESCRIPTION
+-----------
+
+Shows which commit last modified each of the relevant files and subdirectories.
+
+THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
+
+OPTIONS
+-------
+
+-r::
+ Recurse into subtrees.
+
+-t::
+ Show tree entry itself as well as subtrees. Implies `-r`.
+
+<revision-range>::
+ Only traverse commits in the specified revision range. When no
+ `<revision-range>` is specified, it defaults to `HEAD` (i.e. the whole
+ history leading to the current commit). For a complete list of ways to
+ spell `<revision-range>`, see the 'Specifying Ranges' section of
+ linkgit:gitrevisions[7].
+
+[--] <path>...::
+ For each _<path>_ given, the commit which last modified it is returned.
+ Without an optional path parameter, all files and subdirectories
+ in path traversal the are included in the output.
+
+SEE ALSO
+--------
+linkgit:git-blame[1],
+linkgit:git-log[1].
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Documentation/meson.build b/Documentation/meson.build
index 4404c623f0..a8ac5285f0 100644
--- a/Documentation/meson.build
+++ b/Documentation/meson.build
@@ -74,6 +74,7 @@ manpages = {
'git-init.adoc' : 1,
'git-instaweb.adoc' : 1,
'git-interpret-trailers.adoc' : 1,
+ 'git-last-modified.adoc' : 1,
'git-log.adoc' : 1,
'git-ls-files.adoc' : 1,
'git-ls-remote.adoc' : 1,
diff --git a/Makefile b/Makefile
index 5f7dd79dfa..b5ce55a703 100644
--- a/Makefile
+++ b/Makefile
@@ -1265,6 +1265,7 @@ BUILTIN_OBJS += builtin/hook.o
BUILTIN_OBJS += builtin/index-pack.o
BUILTIN_OBJS += builtin/init-db.o
BUILTIN_OBJS += builtin/interpret-trailers.o
+BUILTIN_OBJS += builtin/last-modified.o
BUILTIN_OBJS += builtin/log.o
BUILTIN_OBJS += builtin/ls-files.o
BUILTIN_OBJS += builtin/ls-remote.o
diff --git a/builtin.h b/builtin.h
index bff13e3069..6ed6759ec4 100644
--- a/builtin.h
+++ b/builtin.h
@@ -176,6 +176,7 @@ int cmd_hook(int argc, const char **argv, const char *prefix, struct repository
int cmd_index_pack(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_init_db(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_interpret_trailers(int argc, const char **argv, const char *prefix, struct repository *repo);
+int cmd_last_modified(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_log_reflog(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_log(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_ls_files(int argc, const char **argv, const char *prefix, struct repository *repo);
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
new file mode 100644
index 0000000000..e4c73464c7
--- /dev/null
+++ b/builtin/last-modified.c
@@ -0,0 +1,278 @@
+#include "git-compat-util.h"
+#include "builtin.h"
+#include "commit.h"
+#include "config.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "hashmap.h"
+#include "hex.h"
+#include "log-tree.h"
+#include "object-name.h"
+#include "object.h"
+#include "parse-options.h"
+#include "quote.h"
+#include "repository.h"
+#include "revision.h"
+
+struct last_modified_entry {
+ struct hashmap_entry hashent;
+ struct object_id oid;
+ const char path[FLEX_ARRAY];
+};
+
+static int last_modified_entry_hashcmp(const void *unused UNUSED,
+ const struct hashmap_entry *hent1,
+ const struct hashmap_entry *hent2,
+ const void *path)
+{
+ const struct last_modified_entry *ent1 =
+ container_of(hent1, const struct last_modified_entry, hashent);
+ const struct last_modified_entry *ent2 =
+ container_of(hent2, const struct last_modified_entry, hashent);
+ return strcmp(ent1->path, path ? path : ent2->path);
+}
+
+struct last_modified {
+ struct hashmap paths;
+ struct rev_info rev;
+ int recursive;
+ int tree_in_recursive;
+};
+
+static void last_modified_release(struct last_modified *lm)
+{
+ hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
+ release_revisions(&lm->rev);
+}
+
+struct last_modified_callback_data {
+ struct last_modified *lm;
+ struct commit *commit;
+};
+
+static void add_path_from_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED, void *data)
+{
+ struct last_modified *lm = data;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ struct last_modified_entry *ent;
+ const char *path = p->two->path;
+
+ FLEX_ALLOC_STR(ent, path, path);
+ oidcpy(&ent->oid, &p->two->oid);
+ hashmap_entry_init(&ent->hashent, strhash(ent->path));
+ hashmap_add(&lm->paths, &ent->hashent);
+ }
+}
+
+static int populate_paths_from_revs(struct last_modified *lm)
+{
+ int num_interesting = 0;
+ struct diff_options diffopt;
+
+ memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
+ copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
+ /*
+ * Use a callback to populate the paths from revs
+ */
+ diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ diffopt.format_callback = add_path_from_diff;
+ diffopt.format_callback_data = lm;
+
+ for (size_t i = 0; i < lm->rev.pending.nr; i++) {
+ struct object_array_entry *obj = lm->rev.pending.objects + i;
+
+ if (obj->item->flags & UNINTERESTING)
+ continue;
+
+ if (num_interesting++)
+ return error(_("last-modified can only operate on one tree at a time"));
+
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &obj->item->oid, "", &diffopt);
+ diff_flush(&diffopt);
+ }
+ diff_free(&diffopt);
+
+ return 0;
+}
+
+static void last_modified_emit(struct last_modified *lm,
+ const char *path, const struct commit *commit)
+
+{
+ if (commit->object.flags & BOUNDARY)
+ putchar('^');
+ printf("%s\t", oid_to_hex(&commit->object.oid));
+
+ if (lm->rev.diffopt.line_termination)
+ write_name_quoted(path, stdout, '\n');
+ else
+ printf("%s%c", path, '\0');
+
+ fflush(stdout);
+}
+
+static void mark_path(const char *path, const struct object_id *oid,
+ struct last_modified_callback_data *data)
+{
+ struct last_modified_entry *ent;
+
+ /* Is it even a path that we are interested in? */
+ ent = hashmap_get_entry_from_hash(&data->lm->paths, strhash(path), path,
+ struct last_modified_entry, hashent);
+ if (!ent)
+ return;
+
+ /*
+ * Is it arriving at a version of interest, or is it from a side branch
+ * which did not contribute to the final state?
+ */
+ if (!oideq(oid, &ent->oid))
+ return;
+
+ last_modified_emit(data->lm, path, data->commit);
+
+ hashmap_remove(&data->lm->paths, &ent->hashent, path);
+ free(ent);
+}
+
+static void last_modified_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED, void *cbdata)
+{
+ struct last_modified_callback_data *data = cbdata;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ switch (p->status) {
+ case DIFF_STATUS_DELETED:
+ /*
+ * There's no point in feeding a deletion, as it could
+ * not have resulted in our current state, which
+ * actually has the file.
+ */
+ break;
+
+ default:
+ /*
+ * Otherwise, we care only that we somehow arrived at
+ * a final oid state. Note that this covers some
+ * potentially controversial areas, including:
+ *
+ * 1. A rename or copy will be found, as it is the
+ * first time the content has arrived at the given
+ * path.
+ *
+ * 2. Even a non-content modification like a mode or
+ * type change will trigger it.
+ *
+ * We take the inclusive approach for now, and find
+ * anything which impacts the path. Options to tweak
+ * the behavior (e.g., to "--follow" the content across
+ * renames) can come later.
+ */
+ mark_path(p->two->path, &p->two->oid, data);
+ break;
+ }
+ }
+}
+
+static int last_modified_run(struct last_modified *lm)
+{
+ struct last_modified_callback_data data = { .lm = lm };
+
+ lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ lm->rev.diffopt.format_callback = last_modified_diff;
+ lm->rev.diffopt.format_callback_data = &data;
+
+ prepare_revision_walk(&lm->rev);
+
+ while (hashmap_get_size(&lm->paths)) {
+ data.commit = get_revision(&lm->rev);
+ if (!data.commit)
+ break;
+
+ if (data.commit->object.flags & BOUNDARY) {
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &data.commit->object.oid, "",
+ &lm->rev.diffopt);
+ diff_flush(&lm->rev.diffopt);
+ } else {
+ log_tree_commit(&lm->rev, data.commit);
+ }
+ }
+
+ return 0;
+}
+
+static int last_modified_init(struct last_modified *lm, struct repository *r,
+ const char *prefix, int argc, const char **argv)
+{
+ hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
+
+ repo_init_revisions(r, &lm->rev, prefix);
+ lm->rev.def = "HEAD";
+ lm->rev.combine_merges = 1;
+ lm->rev.show_root_diff = 1;
+ lm->rev.boundary = 1;
+ lm->rev.no_commit_id = 1;
+ lm->rev.diff = 1;
+ lm->rev.diffopt.flags.recursive = lm->recursive || lm->tree_in_recursive;
+ lm->rev.diffopt.flags.tree_in_recursive = lm->tree_in_recursive;
+
+ if ((argc = setup_revisions(argc, argv, &lm->rev, NULL)) > 1) {
+ error(_("unknown last-modified argument: %s"), argv[1]);
+ return argc;
+ }
+
+ if (populate_paths_from_revs(lm) < 0)
+ return error(_("unable to setup last-modified"));
+
+ return 0;
+}
+
+int cmd_last_modified(int argc, const char **argv, const char *prefix,
+ struct repository *repo)
+{
+ int ret;
+ struct last_modified lm;
+
+ const char * const last_modified_usage[] = {
+ N_("git last-modified [-r] [-t] "
+ "[<revision-range>] [[--] <path>...]"),
+ NULL
+ };
+
+ struct option last_modified_options[] = {
+ OPT_BOOL('r', "recursive", &lm.recursive,
+ N_("recurse into subtrees")),
+ OPT_BOOL('t', "tree-in-recursive", &lm.tree_in_recursive,
+ N_("recurse into subtrees and include the tree entries too")),
+ OPT_END()
+ };
+
+ memset(&lm, 0, sizeof(lm));
+
+ argc = parse_options(argc, argv, prefix, last_modified_options,
+ last_modified_usage,
+ PARSE_OPT_KEEP_ARGV0 | PARSE_OPT_KEEP_UNKNOWN_OPT);
+
+ repo_config(repo, git_default_config, NULL);
+
+ if ((ret = last_modified_init(&lm, repo, prefix, argc, argv))) {
+ if (ret > 0)
+ usage_with_options(last_modified_usage,
+ last_modified_options);
+ goto out;
+ }
+
+ if ((ret = last_modified_run(&lm)))
+ goto out;
+
+out:
+ last_modified_release(&lm);
+
+ return ret;
+}
diff --git a/command-list.txt b/command-list.txt
index b7ade3ab9f..b715777b24 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -124,6 +124,7 @@ git-index-pack plumbingmanipulators
git-init mainporcelain init
git-instaweb ancillaryinterrogators complete
git-interpret-trailers purehelpers
+git-last-modified plumbinginterrogators
git-log mainporcelain info
git-ls-files plumbinginterrogators
git-ls-remote plumbinginterrogators
diff --git a/git.c b/git.c
index 07a5fe39fb..76a0b2a1a4 100644
--- a/git.c
+++ b/git.c
@@ -565,6 +565,7 @@ static struct cmd_struct commands[] = {
{ "init", cmd_init_db },
{ "init-db", cmd_init_db },
{ "interpret-trailers", cmd_interpret_trailers, RUN_SETUP_GENTLY },
+ { "last-modified", cmd_last_modified, RUN_SETUP },
{ "log", cmd_log, RUN_SETUP },
{ "ls-files", cmd_ls_files, RUN_SETUP },
{ "ls-remote", cmd_ls_remote, RUN_SETUP_GENTLY },
diff --git a/meson.build b/meson.build
index 9bc1826cb6..77a3416b1c 100644
--- a/meson.build
+++ b/meson.build
@@ -607,6 +607,7 @@ builtin_sources = [
'builtin/index-pack.c',
'builtin/init-db.c',
'builtin/interpret-trailers.c',
+ 'builtin/last-modified.c',
'builtin/log.c',
'builtin/ls-files.c',
'builtin/ls-remote.c',
diff --git a/t/meson.build b/t/meson.build
index 660d780dcc..904455e3ab 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -961,6 +961,7 @@ integration_tests = [
't8012-blame-colors.sh',
't8013-blame-ignore-revs.sh',
't8014-blame-ignore-fuzzy.sh',
+ 't8020-last-modified.sh',
't9001-send-email.sh',
't9002-column.sh',
't9003-help-autocorrect.sh',
diff --git a/t/t8020-last-modified.sh b/t/t8020-last-modified.sh
new file mode 100755
index 0000000000..05c113a1f8
--- /dev/null
+++ b/t/t8020-last-modified.sh
@@ -0,0 +1,203 @@
+#!/bin/sh
+
+test_description='last-modified tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+ test_commit 1 file &&
+ mkdir a &&
+ test_commit 2 a/file &&
+ mkdir a/b &&
+ test_commit 3 a/b/file
+'
+
+test_expect_success 'cannot run last-modified on two trees' '
+ test_must_fail git last-modified HEAD HEAD~1
+'
+
+check_last_modified() {
+ local indir= &&
+ while test $# != 0
+ do
+ case "$1" in
+ -C)
+ indir="$2"
+ shift
+ ;;
+ *)
+ break
+ ;;
+ esac &&
+ shift
+ done &&
+
+ cat >expect &&
+ test_when_finished "rm -f tmp.*" &&
+ git ${indir:+-C "$indir"} last-modified "$@" >tmp.1 &&
+ git name-rev --annotate-stdin --name-only --tags \
+ <tmp.1 >tmp.2 &&
+ tr '\t' ' ' <tmp.2 >actual &&
+ test_cmp expect actual
+}
+
+test_expect_success 'last-modified non-recursive' '
+ check_last_modified <<-\EOF
+ 3 a
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified recursive' '
+ check_last_modified -r <<-\EOF
+ 3 a/b/file
+ 2 a/file
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified recursive with tree' '
+ check_last_modified -t <<-\EOF
+ 3 a
+ 3 a/b
+ 3 a/b/file
+ 2 a/file
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified subdir' '
+ check_last_modified a <<-\EOF
+ 3 a
+ EOF
+'
+
+test_expect_success 'last-modified subdir recursive' '
+ check_last_modified -r a <<-\EOF
+ 3 a/b/file
+ 2 a/file
+ EOF
+'
+
+test_expect_success 'last-modified from non-HEAD commit' '
+ check_last_modified HEAD^ <<-\EOF
+ 2 a
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified from subdir defaults to root' '
+ check_last_modified -C a <<-\EOF
+ 3 a
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified from subdir uses relative pathspecs' '
+ check_last_modified -C a -r b <<-\EOF
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'limit last-modified traversal by count' '
+ check_last_modified -1 <<-\EOF
+ 3 a
+ ^2 file
+ EOF
+'
+
+test_expect_success 'limit last-modified traversal by commit' '
+ check_last_modified HEAD~2..HEAD <<-\EOF
+ 3 a
+ ^1 file
+ EOF
+'
+
+test_expect_success 'only last-modified files in the current tree' '
+ git rm -rf a &&
+ git commit -m "remove a" &&
+ check_last_modified <<-\EOF
+ 1 file
+ EOF
+'
+
+test_expect_success 'cross merge boundaries in blaming' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit m1 &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit m2 &&
+ git merge m1 &&
+ check_last_modified <<-\EOF
+ m2 m2.t
+ m1 m1.t
+ EOF
+'
+
+test_expect_success 'last-modified merge for resolved conflicts' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit c1 conflict &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit c2 conflict &&
+ test_must_fail git merge c1 &&
+ test_commit resolved conflict &&
+ check_last_modified conflict <<-\EOF
+ resolved conflict
+ EOF
+'
+
+
+# Consider `file` with this content through history:
+#
+# A---B---B-------B---B
+# \ /
+# C---D
+test_expect_success 'last-modified merge ignores content from branch' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit a1 file A &&
+ test_commit a2 file B &&
+ test_commit a3 file C &&
+ test_commit a4 file D &&
+ git checkout a2 &&
+ git merge --no-commit --no-ff a4 &&
+ git checkout a2 -- file &&
+ git merge --continue &&
+ check_last_modified <<-\EOF
+ a2 file
+ EOF
+'
+
+# Consider `file` with this content through history:
+#
+# A---B---B---C---D---B---B
+# \ /
+# B-------B
+test_expect_success 'last-modified merge undoes changes' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit b1 file A &&
+ test_commit b2 file B &&
+ test_commit b3 file C &&
+ test_commit b4 file D &&
+ git checkout b2 &&
+ test_commit b5 file2 2 &&
+ git checkout b4 &&
+ git merge --no-commit --no-ff b5 &&
+ git checkout b2 -- file &&
+ git merge --continue &&
+ check_last_modified <<-\EOF
+ b5 file2
+ b2 file
+ EOF
+'
+
+test_expect_success 'last-modified complains about unknown arguments' '
+ test_must_fail git last-modified --foo 2>err &&
+ grep "unknown last-modified argument: --foo" err
+'
+
+test_done
--
2.50.1.327.g047016eb4a
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH v6 2/4] t/perf: add last-modified perf script
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
` (8 preceding siblings ...)
2025-07-30 17:55 ` [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified Toon Claes
@ 2025-07-30 17:55 ` Toon Claes
2025-07-30 17:55 ` [PATCH v6 3/4] commit-graph: export prepare_commit_graph() Toon Claes
2025-07-30 17:55 ` [PATCH v6 4/4] last-modified: use Bloom filters when available Toon Claes
11 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-07-30 17:55 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau, Derrick Stolee,
Christian Couder, Patrick Steinhardt, Toon Claes, Jeff King
This just runs some simple last-modified commands. We already test
correctness in the regular suite, so this is just about finding
performance regressions from one version to another.
Based-on-patch-by: Jeff King <peff@peff.net>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
t/meson.build | 1 +
t/perf/p8020-last-modified.sh | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+)
create mode 100755 t/perf/p8020-last-modified.sh
diff --git a/t/meson.build b/t/meson.build
index 904455e3ab..b74125b047 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -1155,6 +1155,7 @@ benchmarks = [
'perf/p7820-grep-engines.sh',
'perf/p7821-grep-engines-fixed.sh',
'perf/p7822-grep-perl-character.sh',
+ 'perf/p8020-last-modified.sh',
'perf/p9210-scalar.sh',
'perf/p9300-fast-import-export.sh',
]
diff --git a/t/perf/p8020-last-modified.sh b/t/perf/p8020-last-modified.sh
new file mode 100755
index 0000000000..cb1f98d3db
--- /dev/null
+++ b/t/perf/p8020-last-modified.sh
@@ -0,0 +1,22 @@
+#!/bin/sh
+
+test_description='last-modified perf tests'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+test_perf 'top-level last-modified' '
+ git last-modified HEAD
+'
+
+test_perf 'top-level recursive last-modified' '
+ git last-modified -r HEAD
+'
+
+test_perf 'subdir last-modified' '
+ git ls-tree -d HEAD >subtrees &&
+ path="$(head -n 1 subtrees | cut -f2)" &&
+ git last-modified -r HEAD -- "$path"
+'
+
+test_done
--
2.50.1.327.g047016eb4a
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH v6 3/4] commit-graph: export prepare_commit_graph()
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
` (9 preceding siblings ...)
2025-07-30 17:55 ` [PATCH v6 2/4] t/perf: add last-modified perf script Toon Claes
@ 2025-07-30 17:55 ` Toon Claes
2025-07-31 6:42 ` Patrick Steinhardt
2025-07-30 17:55 ` [PATCH v6 4/4] last-modified: use Bloom filters when available Toon Claes
11 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-07-30 17:55 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau, Derrick Stolee,
Christian Couder, Patrick Steinhardt, Toon Claes
Allow users of the commit-graph to explicitly prepare the commit-graph.
This can be useful when users want to start using bloom keys before
calling functions like prepare_revision_walk(). We'll use this exported
function in a subsequent commit.
Signed-off-by: Toon Claes <toon@iotcl.com>
---
commit-graph.c | 8 +-------
commit-graph.h | 8 ++++++++
2 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index bd7b6f5338..a1f9fc22a4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -739,13 +739,7 @@ static void prepare_commit_graph_one(struct repository *r,
r->objects->commit_graph = read_commit_graph_one(r, source);
}
-/*
- * Return 1 if commit_graph is non-NULL, and 0 otherwise.
- *
- * On the first invocation, this function attempts to load the commit
- * graph if the_repository is configured to have one.
- */
-static int prepare_commit_graph(struct repository *r)
+int prepare_commit_graph(struct repository *r)
{
struct odb_source *source;
diff --git a/commit-graph.h b/commit-graph.h
index 78ab7b875b..0f76681333 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -131,6 +131,14 @@ struct repo_settings;
struct commit_graph *parse_commit_graph(struct repo_settings *s,
void *graph_map, size_t graph_size);
+/*
+ * Return 1 if commit_graph is non-NULL, and 0 otherwise.
+ *
+ * On the first invocation, this function attempts to load the commit
+ * graph if the_repository is configured to have one.
+ */
+int prepare_commit_graph(struct repository *r);
+
/*
* Return 1 if and only if the repository has a commit-graph
* file and generation numbers are computed in that file.
--
2.50.1.327.g047016eb4a
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH v6 4/4] last-modified: use Bloom filters when available
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
` (10 preceding siblings ...)
2025-07-30 17:55 ` [PATCH v6 3/4] commit-graph: export prepare_commit_graph() Toon Claes
@ 2025-07-30 17:55 ` Toon Claes
2025-07-31 6:43 ` Patrick Steinhardt
11 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-07-30 17:55 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau, Derrick Stolee,
Christian Couder, Patrick Steinhardt, Toon Claes
Our 'git last-modified' performs a revision walk, and computes a diff at
each point in the walk to figure out whether a given revision changed
any of the paths it considers interesting.
When changed-path Bloom filters are available, we can avoid computing
many such diffs. Before computing a diff, we first check if any of the
remaining paths of interest were possibly changed at a given commit by
consulting its Bloom filter. If any of them are, we are resigned to
compute the diff.
If none of those queries returned "maybe", we know that the given commit
doesn't contain any changed paths which are interesting to us. So, we
can avoid computing it in this case.
Comparing the perf test results on git.git:
Test HEAD~ HEAD
------------------------------------------------------------------------------------
8020.1: top-level last-modified 4.49(4.34+0.11) 2.22(2.05+0.09) -50.6%
8020.2: top-level recursive last-modified 5.64(5.45+0.11) 5.62(5.30+0.11) -0.4%
8020.3: subdir last-modified 0.11(0.06+0.04) 0.07(0.03+0.04) -36.4%
Based-on-patch-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
builtin/last-modified.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 40 insertions(+)
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
index e4c73464c7..19bf25f8a5 100644
--- a/builtin/last-modified.c
+++ b/builtin/last-modified.c
@@ -1,5 +1,7 @@
#include "git-compat-util.h"
+#include "bloom.h"
#include "builtin.h"
+#include "commit-graph.h"
#include "commit.h"
#include "config.h"
#include "diff.h"
@@ -17,6 +19,7 @@
struct last_modified_entry {
struct hashmap_entry hashent;
struct object_id oid;
+ struct bloom_key key;
const char path[FLEX_ARRAY];
};
@@ -41,6 +44,12 @@ struct last_modified {
static void last_modified_release(struct last_modified *lm)
{
+ struct hashmap_iter iter;
+ struct last_modified_entry *ent;
+
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
+ bloom_key_clear(&ent->key);
+
hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
release_revisions(&lm->rev);
}
@@ -62,6 +71,9 @@ static void add_path_from_diff(struct diff_queue_struct *q,
FLEX_ALLOC_STR(ent, path, path);
oidcpy(&ent->oid, &p->two->oid);
+ if (lm->rev.bloom_filter_settings)
+ bloom_key_fill(&ent->key, path, strlen(path),
+ lm->rev.bloom_filter_settings);
hashmap_entry_init(&ent->hashent, strhash(ent->path));
hashmap_add(&lm->paths, &ent->hashent);
}
@@ -136,6 +148,7 @@ static void mark_path(const char *path, const struct object_id *oid,
last_modified_emit(data->lm, path, data->commit);
hashmap_remove(&data->lm->paths, &ent->hashent, path);
+ bloom_key_clear(&ent->key);
free(ent);
}
@@ -179,6 +192,27 @@ static void last_modified_diff(struct diff_queue_struct *q,
}
}
+static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
+{
+ struct bloom_filter *filter;
+ struct last_modified_entry *ent;
+ struct hashmap_iter iter;
+
+ if (!lm->rev.bloom_filter_settings)
+ return 1;
+
+ filter = get_bloom_filter(lm->rev.repo, origin);
+ if (!filter)
+ return 1;
+
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
+ if (bloom_filter_contains(filter, &ent->key,
+ lm->rev.bloom_filter_settings))
+ return 1;
+ }
+ return 0;
+}
+
static int last_modified_run(struct last_modified *lm)
{
struct last_modified_callback_data data = { .lm = lm };
@@ -194,6 +228,9 @@ static int last_modified_run(struct last_modified *lm)
if (!data.commit)
break;
+ if (!maybe_changed_path(lm, data.commit))
+ continue;
+
if (data.commit->object.flags & BOUNDARY) {
diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
&data.commit->object.oid, "",
@@ -227,6 +264,9 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
return argc;
}
+ prepare_commit_graph(lm->rev.repo);
+ lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
+
if (populate_paths_from_revs(lm) < 0)
return error(_("unable to setup last-modified"));
--
2.50.1.327.g047016eb4a
^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [PATCH v5 0/6] Introduce git-last-modified(1) command
2025-07-22 15:35 ` Toon Claes
@ 2025-07-30 17:59 ` Toon Claes
2025-07-31 7:45 ` Patrick Steinhardt
0 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-07-30 17:59 UTC (permalink / raw)
To: Taylor Blau; +Cc: git, Kristoffer Haugsbakk, Derrick Stolee, Junio C Hamano
Toon Claes <toon@iotcl.com> writes:
> I've had this patch included in version 2[1]. I'd love to include it,
> but it didn't give the results we were expecting. Over time I became
> more confortable with these changes. Let me see if I can get more
> insights about it.
I've spent a considerable amount of time on this, I didn't get to any
breakthrough. I just submitted v6[1] again without these patches. I
still love to figure it out and bring in the improvements, but for the
first iteration I think we're okay without.
[1]: https://lore.kernel.org/git/20250730175510.987383-1-toon@iotcl.com/
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-07-30 17:55 ` [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified Toon Claes
@ 2025-07-31 6:42 ` Patrick Steinhardt
2025-08-01 16:22 ` Toon Claes
2025-08-01 10:18 ` Christian Couder
1 sibling, 1 reply; 135+ messages in thread
From: Patrick Steinhardt @ 2025-07-31 6:42 UTC (permalink / raw)
To: Toon Claes
Cc: git, Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Christian Couder, Jeff King,
Ævar Arnfjörð Bjarmason
On Wed, Jul 30, 2025 at 07:55:07PM +0200, Toon Claes wrote:
> diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
> new file mode 100644
> index 0000000000..89138ebeb7
> --- /dev/null
> +++ b/Documentation/git-last-modified.adoc
> @@ -0,0 +1,49 @@
> +git-last-modified(1)
> +====================
> +
> +NAME
> +----
> +git-last-modified - EXPERIMENTAL: Show when files were last modified
> +
> +
> +SYNOPSIS
> +--------
> +[synopsis]
> +git last-modified [-r] [-t] [<revision-range>] [[--] <path>...]
I think we typically list long options here, not the short single-letter
ones.
> +
> +DESCRIPTION
> +-----------
> +
> +Shows which commit last modified each of the relevant files and subdirectories.
> +
> +THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
> +
> +OPTIONS
> +-------
> +
> +-r::
-r, --recursive::
> + Recurse into subtrees.
> +
> +-t::
-t, --tree-in-recursive::
> diff --git a/builtin/last-modified.c b/builtin/last-modified.c
> new file mode 100644
> index 0000000000..e4c73464c7
> --- /dev/null
> +++ b/builtin/last-modified.c
[snip]
> +static int populate_paths_from_revs(struct last_modified *lm)
> +{
> + int num_interesting = 0;
> + struct diff_options diffopt;
> +
> + memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
> + copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
> + /*
> + * Use a callback to populate the paths from revs
> + */
> + diffopt.output_format = DIFF_FORMAT_CALLBACK;
> + diffopt.format_callback = add_path_from_diff;
> + diffopt.format_callback_data = lm;
I feel like this whole block could use a comment that explains what
we're doing. Why do we copy `diffopt` around? Why is it fine to free
the struct at the end without unsetting `lm->rev.diffopt`? Couldn't that
cause a double free?
> + for (size_t i = 0; i < lm->rev.pending.nr; i++) {
> + struct object_array_entry *obj = lm->rev.pending.objects + i;
> +
> + if (obj->item->flags & UNINTERESTING)
> + continue;
> +
> + if (num_interesting++)
> + return error(_("last-modified can only operate on one tree at a time"));
> +
> + diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
> + &obj->item->oid, "", &diffopt);
> + diff_flush(&diffopt);
> + }
> + diff_free(&diffopt);
> +
> + return 0;
> +}
> +
> +static void last_modified_emit(struct last_modified *lm,
> + const char *path, const struct commit *commit)
> +
> +{
> + if (commit->object.flags & BOUNDARY)
> + putchar('^');
> + printf("%s\t", oid_to_hex(&commit->object.oid));
> +
> + if (lm->rev.diffopt.line_termination)
> + write_name_quoted(path, stdout, '\n');
> + else
> + printf("%s%c", path, '\0');
> +
> + fflush(stdout);
Is there a reason why we have to explicitly flush output? This command
doesn't have any interactivity with the caller.
> +static void last_modified_diff(struct diff_queue_struct *q,
> + struct diff_options *opt UNUSED, void *cbdata)
> +{
> + struct last_modified_callback_data *data = cbdata;
> +
> + for (int i = 0; i < q->nr; i++) {
> + struct diff_filepair *p = q->queue[i];
> + switch (p->status) {
> + case DIFF_STATUS_DELETED:
> + /*
> + * There's no point in feeding a deletion, as it could
> + * not have resulted in our current state, which
> + * actually has the file.
> + */
> + break;
> +
> + default:
> + /*
> + * Otherwise, we care only that we somehow arrived at
> + * a final oid state. Note that this covers some
> + * potentially controversial areas, including:
> + *
> + * 1. A rename or copy will be found, as it is the
> + * first time the content has arrived at the given
> + * path.
Makes sense that we don't handle renames (yet). I think I didn't spot
this in the manual, so maybe this is something we should document there.
> + * 2. Even a non-content modification like a mode or
> + * type change will trigger it.
Seems sensible as a default, as well. And likewise, we can add
`--ignore-mode-changes` at a later point if we ever have a use case for
it.
> + * We take the inclusive approach for now, and find
> + * anything which impacts the path. Options to tweak
> + * the behavior (e.g., to "--follow" the content across
> + * renames) can come later.
> + */
> + mark_path(p->two->path, &p->two->oid, data);
> + break;
> + }
> + }
> +}
> +
> +static int last_modified_run(struct last_modified *lm)
> +{
> + struct last_modified_callback_data data = { .lm = lm };
> +
> + lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
> + lm->rev.diffopt.format_callback = last_modified_diff;
> + lm->rev.diffopt.format_callback_data = &data;
> +
> + prepare_revision_walk(&lm->rev);
> +
> + while (hashmap_get_size(&lm->paths)) {
> + data.commit = get_revision(&lm->rev);
> + if (!data.commit)
> + break;
So in this case we have reached the end of our commit range. I assume we
simply print the oldest commit of that range in this case?
> + if (data.commit->object.flags & BOUNDARY) {
> + diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
> + &data.commit->object.oid, "",
> + &lm->rev.diffopt);
> + diff_flush(&lm->rev.diffopt);
> + } else {
> + log_tree_commit(&lm->rev, data.commit);
> + }
> + }
> +
> + return 0;
> +}
> +
> +static int last_modified_init(struct last_modified *lm, struct repository *r,
> + const char *prefix, int argc, const char **argv)
> +{
> + hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
> +
> + repo_init_revisions(r, &lm->rev, prefix);
> + lm->rev.def = "HEAD";
> + lm->rev.combine_merges = 1;
> + lm->rev.show_root_diff = 1;
> + lm->rev.boundary = 1;
> + lm->rev.no_commit_id = 1;
> + lm->rev.diff = 1;
> + lm->rev.diffopt.flags.recursive = lm->recursive || lm->tree_in_recursive;
> + lm->rev.diffopt.flags.tree_in_recursive = lm->tree_in_recursive;
> +
> + if ((argc = setup_revisions(argc, argv, &lm->rev, NULL)) > 1) {
Tiny nit: it's rather unusual in our codebase to assign values in
conditionals. I personally don't mind this usage at all -- I think it
can make error handling way less verbose. But I'm not sure whether we
deem this style acceptable.
argc = setup_revisions(argc, argv, &lm->rev, NULL)
if (argc) {
...
}
I've seen this style several times in this patch. I think we should keep
our typical style for now, but I wouldn't mind if you sent a patch for
our coding style document so that we can discuss this.
> + error(_("unknown last-modified argument: %s"), argv[1]);
> + return argc;
> + }
> +
> + if (populate_paths_from_revs(lm) < 0)
> + return error(_("unable to setup last-modified"));
> +
> + return 0;
> +}
> +
> +int cmd_last_modified(int argc, const char **argv, const char *prefix,
> + struct repository *repo)
> +{
> + int ret;
> + struct last_modified lm;
> +
> + const char * const last_modified_usage[] = {
> + N_("git last-modified [-r] [-t] "
> + "[<revision-range>] [[--] <path>...]"),
> + NULL
> + };
> +
> + struct option last_modified_options[] = {
> + OPT_BOOL('r', "recursive", &lm.recursive,
> + N_("recurse into subtrees")),
> + OPT_BOOL('t', "tree-in-recursive", &lm.tree_in_recursive,
> + N_("recurse into subtrees and include the tree entries too")),
Should this maybe be called something like "--recursive-with-trees"?
"--tree-in-recursive" reads somewhat strange to me.
> + OPT_END()
> + };
> +
> + memset(&lm, 0, sizeof(lm));
You can avoid the `memset()` and directly zero-initialize the struct
when it's declared. Alternatively, you can move this function call into
`last_modified_init()` itself, where it would be more reasonable.
> + argc = parse_options(argc, argv, prefix, last_modified_options,
> + last_modified_usage,
> + PARSE_OPT_KEEP_ARGV0 | PARSE_OPT_KEEP_UNKNOWN_OPT);
> +
> + repo_config(repo, git_default_config, NULL);
> +
> + if ((ret = last_modified_init(&lm, repo, prefix, argc, argv))) {
> + if (ret > 0)
> + usage_with_options(last_modified_usage,
> + last_modified_options);
> + goto out;
> + }
> +
> + if ((ret = last_modified_run(&lm)))
> + goto out;
Two more cases where we assign `if ((ret = ...))`.
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 3/4] commit-graph: export prepare_commit_graph()
2025-07-30 17:55 ` [PATCH v6 3/4] commit-graph: export prepare_commit_graph() Toon Claes
@ 2025-07-31 6:42 ` Patrick Steinhardt
0 siblings, 0 replies; 135+ messages in thread
From: Patrick Steinhardt @ 2025-07-31 6:42 UTC (permalink / raw)
To: Toon Claes
Cc: git, Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Christian Couder
On Wed, Jul 30, 2025 at 07:55:09PM +0200, Toon Claes wrote:
> Allow users of the commit-graph to explicitly prepare the commit-graph.
> This can be useful when users want to start using bloom keys before
> calling functions like prepare_revision_walk(). We'll use this exported
> function in a subsequent commit.
Hm. Ideally we wouldn't have to expose this low-level function and the
commit-graph subsystem would know to handle this. We typically have
patterns like this in our codebase:
if (repo_find_commit_pos_in_graph(r, c, &graph_pos))
load_bloom_filter_from_graph(r->objects->commit_graph,
filter, graph_pos);
The call to `repo_find_commit_pos_in_graph()` knows to call
`prepare_commit_graph()`, so no manual call to that function would be
required.
I haven't yet read the next commit though that adds the callsite. So
let's read on.
> diff --git a/commit-graph.h b/commit-graph.h
> index 78ab7b875b..0f76681333 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -131,6 +131,14 @@ struct repo_settings;
> struct commit_graph *parse_commit_graph(struct repo_settings *s,
> void *graph_map, size_t graph_size);
>
> +/*
> + * Return 1 if commit_graph is non-NULL, and 0 otherwise.
> + *
> + * On the first invocation, this function attempts to load the commit
> + * graph if the_repository is configured to have one.
> + */
> +int prepare_commit_graph(struct repository *r);
Let's fix the reference to `the_repository` while at it.
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 4/4] last-modified: use Bloom filters when available
2025-07-30 17:55 ` [PATCH v6 4/4] last-modified: use Bloom filters when available Toon Claes
@ 2025-07-31 6:43 ` Patrick Steinhardt
2025-08-01 16:23 ` Toon Claes
0 siblings, 1 reply; 135+ messages in thread
From: Patrick Steinhardt @ 2025-07-31 6:43 UTC (permalink / raw)
To: Toon Claes
Cc: git, Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Christian Couder
On Wed, Jul 30, 2025 at 07:55:10PM +0200, Toon Claes wrote:
> Our 'git last-modified' performs a revision walk, and computes a diff at
> each point in the walk to figure out whether a given revision changed
> any of the paths it considers interesting.
>
> When changed-path Bloom filters are available, we can avoid computing
> many such diffs. Before computing a diff, we first check if any of the
> remaining paths of interest were possibly changed at a given commit by
> consulting its Bloom filter. If any of them are, we are resigned to
> compute the diff.
>
> If none of those queries returned "maybe", we know that the given commit
> doesn't contain any changed paths which are interesting to us. So, we
> can avoid computing it in this case.
>
> Comparing the perf test results on git.git:
>
> Test HEAD~ HEAD
> ------------------------------------------------------------------------------------
> 8020.1: top-level last-modified 4.49(4.34+0.11) 2.22(2.05+0.09) -50.6%
> 8020.2: top-level recursive last-modified 5.64(5.45+0.11) 5.62(5.30+0.11) -0.4%
> 8020.3: subdir last-modified 0.11(0.06+0.04) 0.07(0.03+0.04) -36.4%
Nice results.
> diff --git a/builtin/last-modified.c b/builtin/last-modified.c
> index e4c73464c7..19bf25f8a5 100644
> --- a/builtin/last-modified.c
> +++ b/builtin/last-modified.c
> @@ -179,6 +192,27 @@ static void last_modified_diff(struct diff_queue_struct *q,
> }
> }
>
> +static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
> +{
> + struct bloom_filter *filter;
> + struct last_modified_entry *ent;
> + struct hashmap_iter iter;
> +
> + if (!lm->rev.bloom_filter_settings)
> + return 1;
> +
> + filter = get_bloom_filter(lm->rev.repo, origin);
> + if (!filter)
> + return 1;
> +
> + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
> + if (bloom_filter_contains(filter, &ent->key,
> + lm->rev.bloom_filter_settings))
> + return 1;
> + }
> + return 0;
> +}
This function is basically the same as `maybe_changed_paths()` in
"blame.c", but that isn't a huge issue from my point of view. What makes
me wonder though is why we have an additional check over there for
whether or not the commit has a valid generation number.
> @@ -227,6 +264,9 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
> return argc;
> }
>
> + prepare_commit_graph(lm->rev.repo);
> + lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
> +
So this here is why we export `prepare_commit_graph()`. How about we
instead expose `bloom_filters_enabled()` that mirrors what we do in
`generation_numbers_enabled()` and `corrected_commit_dates_enabled()`?
That would both be on a higher level and do exactly what we want to
achieve.
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 0/6] Introduce git-last-modified(1) command
2025-07-30 17:59 ` Toon Claes
@ 2025-07-31 7:45 ` Patrick Steinhardt
0 siblings, 0 replies; 135+ messages in thread
From: Patrick Steinhardt @ 2025-07-31 7:45 UTC (permalink / raw)
To: Toon Claes
Cc: Taylor Blau, git, Kristoffer Haugsbakk, Derrick Stolee,
Junio C Hamano
On Wed, Jul 30, 2025 at 07:59:10PM +0200, Toon Claes wrote:
> Toon Claes <toon@iotcl.com> writes:
>
> > I've had this patch included in version 2[1]. I'd love to include it,
> > but it didn't give the results we were expecting. Over time I became
> > more confortable with these changes. Let me see if I can get more
> > insights about it.
>
> I've spent a considerable amount of time on this, I didn't get to any
> breakthrough. I just submitted v6[1] again without these patches. I
> still love to figure it out and bring in the improvements, but for the
> first iteration I think we're okay without.
>
> [1]: https://lore.kernel.org/git/20250730175510.987383-1-toon@iotcl.com/
I guess that's probably fine. The patches would go on top anyway, so I
don't see a reason why we shouldn't land the "trivial" implementation
that just works and then iterate from thereon. It's going to be way
faster than any scripted solution already, so it does provide benefit
even without the additional performance boost.
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 0/4] Introduce git-last-modified(1) command
2025-07-30 17:55 ` [PATCH v6 0/4] " Toon Claes
@ 2025-07-31 18:40 ` Junio C Hamano
2025-07-31 23:57 ` Junio C Hamano
2025-08-05 9:33 ` [PATCH v7 0/3] " Toon Claes
` (3 subsequent siblings)
4 siblings, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-07-31 18:40 UTC (permalink / raw)
To: Toon Claes
Cc: git, Kristoffer Haugsbakk, Taylor Blau, Derrick Stolee,
Christian Couder, Patrick Steinhardt
Toon Claes <toon@iotcl.com> writes:
> Changes in v6:
> - Only the first 3 patches are kept. The last 3 patches worked toward adding an
> extra option `--format`. The way it was implemented was heavily debatable and
> in the end it is not required for a first iteration, so they are dropped.
OK.
> - Function prepare_commit_graph() is exported and used in
> generation_numbers_enabled().
OK.
> - Since the library layer was removed and all the code was moved into the
> builtin, there was still some leftovers from using a callback mechanism to
> display the results. This is removed (as far as possible) and instead
> last_modified_emit() always, this function was called show_entry() previously.
OK.
> - Code is rebased to use refactoring in the bloom filter API.
Ah, bloom_key_fill() and bloom_key_clear(); sorry to see you become
a victim of an unfortunate churn X-<, but hopefully it is for
greater good in the longer term.
Will queue. Thanks.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 0/4] Introduce git-last-modified(1) command
2025-07-31 18:40 ` Junio C Hamano
@ 2025-07-31 23:57 ` Junio C Hamano
0 siblings, 0 replies; 135+ messages in thread
From: Junio C Hamano @ 2025-07-31 23:57 UTC (permalink / raw)
To: Toon Claes
Cc: git, Kristoffer Haugsbakk, Taylor Blau, Derrick Stolee,
Christian Couder, Patrick Steinhardt
Junio C Hamano <gitster@pobox.com> writes:
> Toon Claes <toon@iotcl.com> writes:
>
>> Changes in v6:
>> - Only the first 3 patches are kept. The last 3 patches worked toward adding an
>> extra option `--format`. The way it was implemented was heavily debatable and
>> in the end it is not required for a first iteration, so they are dropped.
>
> OK.
>
>> - Function prepare_commit_graph() is exported and used in
>> generation_numbers_enabled().
>
> OK.
>
>> - Since the library layer was removed and all the code was moved into the
>> builtin, there was still some leftovers from using a callback mechanism to
>> display the results. This is removed (as far as possible) and instead
>> last_modified_emit() always, this function was called show_entry() previously.
>
> OK.
>
>> - Code is rebased to use refactoring in the bloom filter API.
>
> Ah, bloom_key_fill() and bloom_key_clear(); sorry to see you become
> a victim of an unfortunate churn X-<, but hopefully it is for
> greater good in the longer term.
>
> Will queue. Thanks.
CI runs without and with this topic in 'seen'
(without this topic)
https://github.com/git/git/actions/runs/16661801008
(with this topic)
https://github.com/git/git/actions/runs/16662408099
The difference in trees of these two runs match what is in this
topic and nothing else.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 1/6] last-modified: new subcommand to show when files were last modified
2025-07-18 0:02 ` Taylor Blau
2025-07-19 6:44 ` Jeff King
2025-07-22 15:50 ` Toon Claes
@ 2025-08-01 9:09 ` Christian Couder
2025-08-01 16:59 ` Junio C Hamano
2 siblings, 1 reply; 135+ messages in thread
From: Christian Couder @ 2025-08-01 9:09 UTC (permalink / raw)
To: Taylor Blau
Cc: Toon Claes, git, Kristoffer Haugsbakk, Derrick Stolee,
Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason
On Fri, Jul 18, 2025 at 2:02 AM Taylor Blau <me@ttaylorr.com> wrote:
> > +struct last_modified {
> > + struct hashmap paths;
> > + struct rev_info rev;
> > + int recursive, tree_in_recursive;
>
> Can we either make these two part of a bitfield, or at least declare
> them separately?
I wonder if we could/should use the `bool` type from <stdbool.h> as
Documentation/CodingGuidelines says that it's now allowed.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-07-30 17:55 ` [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-31 6:42 ` Patrick Steinhardt
@ 2025-08-01 10:18 ` Christian Couder
2025-08-01 10:22 ` Patrick Steinhardt
1 sibling, 1 reply; 135+ messages in thread
From: Christian Couder @ 2025-08-01 10:18 UTC (permalink / raw)
To: Toon Claes
Cc: git, Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Patrick Steinhardt, Jeff King,
Ævar Arnfjörð Bjarmason
On Wed, Jul 30, 2025 at 7:55 PM Toon Claes <toon@iotcl.com> wrote:
> +[--] <path>...::
> + For each _<path>_ given, the commit which last modified it is returned.
> + Without an optional path parameter, all files and subdirectories
> + in path traversal the are included in the output.
s/the are included/are included/
> +static void last_modified_release(struct last_modified *lm)
I think these days we tend to name those functions using "clear"
instead of "release"
> +{
> + hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
> + release_revisions(&lm->rev);
> +}
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-08-01 10:18 ` Christian Couder
@ 2025-08-01 10:22 ` Patrick Steinhardt
2025-08-01 17:06 ` Junio C Hamano
0 siblings, 1 reply; 135+ messages in thread
From: Patrick Steinhardt @ 2025-08-01 10:22 UTC (permalink / raw)
To: Christian Couder
Cc: Toon Claes, git, Junio C Hamano, Kristoffer Haugsbakk,
Taylor Blau, Derrick Stolee, Jeff King,
Ævar Arnfjörð Bjarmason
On Fri, Aug 01, 2025 at 12:18:39PM +0200, Christian Couder wrote:
> On Wed, Jul 30, 2025 at 7:55 PM Toon Claes <toon@iotcl.com> wrote:
>
> > +[--] <path>...::
> > + For each _<path>_ given, the commit which last modified it is returned.
> > + Without an optional path parameter, all files and subdirectories
> > + in path traversal the are included in the output.
>
> s/the are included/are included/
>
> > +static void last_modified_release(struct last_modified *lm)
>
> I think these days we tend to name those functions using "clear"
> instead of "release"
It actually depends: if the structure can be immediately reused
afterwards without requiring another reinit it would be caller "clear"
indeed. On the other hand, if we only release memory it's "release".
I think this function here falls into the latter category, so it's
correctly named.
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-07-31 6:42 ` Patrick Steinhardt
@ 2025-08-01 16:22 ` Toon Claes
2025-08-01 17:09 ` Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 135+ messages in thread
From: Toon Claes @ 2025-08-01 16:22 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: git, Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Christian Couder, Jeff King,
Ævar Arnfjörð Bjarmason
Patrick Steinhardt <ps@pks.im> writes:
> On Wed, Jul 30, 2025 at 07:55:07PM +0200, Toon Claes wrote:
>> diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
>> new file mode 100644
>> index 0000000000..89138ebeb7
>> --- /dev/null
>> +++ b/Documentation/git-last-modified.adoc
>> @@ -0,0 +1,49 @@
>> +git-last-modified(1)
>> +====================
>> +
>> +NAME
>> +----
>> +git-last-modified - EXPERIMENTAL: Show when files were last modified
>> +
>> +
>> +SYNOPSIS
>> +--------
>> +[synopsis]
>> +git last-modified [-r] [-t] [<revision-range>] [[--] <path>...]
>
> I think we typically list long options here, not the short single-letter
> ones.
Okay, makes sense.
>> +
>> +DESCRIPTION
>> +-----------
>> +
>> +Shows which commit last modified each of the relevant files and subdirectories.
>> +
>> +THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
>> +
>> +OPTIONS
>> +-------
>> +
>> +-r::
>
> -r, --recursive::
>
>> + Recurse into subtrees.
>> +
>> +-t::
>
> -t, --tree-in-recursive::
Sure!
>> diff --git a/builtin/last-modified.c b/builtin/last-modified.c
>> new file mode 100644
>> index 0000000000..e4c73464c7
>> --- /dev/null
>> +++ b/builtin/last-modified.c
> [snip]
>> +static int populate_paths_from_revs(struct last_modified *lm)
>> +{
>> + int num_interesting = 0;
>> + struct diff_options diffopt;
>> +
>> + memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
>> + copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
>> + /*
>> + * Use a callback to populate the paths from revs
>> + */
>> + diffopt.output_format = DIFF_FORMAT_CALLBACK;
>> + diffopt.format_callback = add_path_from_diff;
>> + diffopt.format_callback_data = lm;
>
> I feel like this whole block could use a comment that explains what
> we're doing. Why do we copy `diffopt` around?
I can extend the comment. We simply don't want to touch the original,
that's why we copy. Do you think it would be better to simply set the
callback before and reset it after?
> Why is it fine to free the struct at the end without unsetting
> `lm->rev.diffopt`? Couldn't that cause a double free?
Oof, that's a good call. In an earlier version it was only calling
clear_pathspec(). But then I got a comment[1] it would be better to call
diff_free(). I must I admit I didn't think it through further. Changing
back to clear_pathspec() seems the most sensible to me.
>> + for (size_t i = 0; i < lm->rev.pending.nr; i++) {
>> + struct object_array_entry *obj = lm->rev.pending.objects + i;
>> +
>> + if (obj->item->flags & UNINTERESTING)
>> + continue;
>> +
>> + if (num_interesting++)
>> + return error(_("last-modified can only operate on one tree at a time"));
>> +
>> + diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
>> + &obj->item->oid, "", &diffopt);
>> + diff_flush(&diffopt);
>> + }
>> + diff_free(&diffopt);
>> +
>> + return 0;
>> +}
>> +
>> +static void last_modified_emit(struct last_modified *lm,
>> + const char *path, const struct commit *commit)
>> +
>> +{
>> + if (commit->object.flags & BOUNDARY)
>> + putchar('^');
>> + printf("%s\t", oid_to_hex(&commit->object.oid));
>> +
>> + if (lm->rev.diffopt.line_termination)
>> + write_name_quoted(path, stdout, '\n');
>> + else
>> + printf("%s%c", path, '\0');
>> +
>> + fflush(stdout);
>
> Is there a reason why we have to explicitly flush output? This command
> doesn't have any interactivity with the caller.
Not that I'm aware of, yeah, shouldn't really be needed.
>> +static void last_modified_diff(struct diff_queue_struct *q,
>> + struct diff_options *opt UNUSED, void *cbdata)
>> +{
>> + struct last_modified_callback_data *data = cbdata;
>> +
>> + for (int i = 0; i < q->nr; i++) {
>> + struct diff_filepair *p = q->queue[i];
>> + switch (p->status) {
>> + case DIFF_STATUS_DELETED:
>> + /*
>> + * There's no point in feeding a deletion, as it could
>> + * not have resulted in our current state, which
>> + * actually has the file.
>> + */
>> + break;
>> +
>> + default:
>> + /*
>> + * Otherwise, we care only that we somehow arrived at
>> + * a final oid state. Note that this covers some
>> + * potentially controversial areas, including:
>> + *
>> + * 1. A rename or copy will be found, as it is the
>> + * first time the content has arrived at the given
>> + * path.
>
> Makes sense that we don't handle renames (yet). I think I didn't spot
> this in the manual, so maybe this is something we should document there.
I'll add a line in the docs.
>> + * 2. Even a non-content modification like a mode or
>> + * type change will trigger it.
>
> Seems sensible as a default, as well. And likewise, we can add
> `--ignore-mode-changes` at a later point if we ever have a use case for
> it.
Agreed.
>> + * We take the inclusive approach for now, and find
>> + * anything which impacts the path. Options to tweak
>> + * the behavior (e.g., to "--follow" the content across
>> + * renames) can come later.
>> + */
>> + mark_path(p->two->path, &p->two->oid, data);
>> + break;
>> + }
>> + }
>> +}
>> +
>> +static int last_modified_run(struct last_modified *lm)
>> +{
>> + struct last_modified_callback_data data = { .lm = lm };
>> +
>> + lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
>> + lm->rev.diffopt.format_callback = last_modified_diff;
>> + lm->rev.diffopt.format_callback_data = &data;
>> +
>> + prepare_revision_walk(&lm->rev);
>> +
>> + while (hashmap_get_size(&lm->paths)) {
>> + data.commit = get_revision(&lm->rev);
>> + if (!data.commit)
>> + break;
>
> So in this case we have reached the end of our commit range. I assume we
> simply print the oldest commit of that range in this case?
Looking at this more in detail, I feel we should be calling BUG here.
When we've hit the boundary commit, we should be printing the remaining
paths with that commit, but with a caret `^` prepended. If we hit this
condition it means we went beyond the boundary, but still have paths
remaining. That's a bug.
But... As a matter of fact. I had a test failing (on the commit using
bloom filters). It didn't print remaining paths with the boundary commit
with a caret. This happens only when having GIT_TEST_COMMIT_GRAPH and
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS set. And it's perfectly explainable
now:
With those set, we hit this exit condition. This happens because
maybe_changed_path() was called in previous loop, returning false. Then
we hit this exit, and un-printed paths remain. Big thanks for this hint.
>> + if (data.commit->object.flags & BOUNDARY) {
>> + diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
>> + &data.commit->object.oid, "",
>> + &lm->rev.diffopt);
>> + diff_flush(&lm->rev.diffopt);
>> + } else {
>> + log_tree_commit(&lm->rev, data.commit);
>> + }
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static int last_modified_init(struct last_modified *lm, struct repository *r,
>> + const char *prefix, int argc, const char **argv)
>> +{
>> + hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
>> +
>> + repo_init_revisions(r, &lm->rev, prefix);
>> + lm->rev.def = "HEAD";
>> + lm->rev.combine_merges = 1;
>> + lm->rev.show_root_diff = 1;
>> + lm->rev.boundary = 1;
>> + lm->rev.no_commit_id = 1;
>> + lm->rev.diff = 1;
>> + lm->rev.diffopt.flags.recursive = lm->recursive || lm->tree_in_recursive;
>> + lm->rev.diffopt.flags.tree_in_recursive = lm->tree_in_recursive;
>> +
>> + if ((argc = setup_revisions(argc, argv, &lm->rev, NULL)) > 1) {
>
> Tiny nit: it's rather unusual in our codebase to assign values in
> conditionals. I personally don't mind this usage at all -- I think it
> can make error handling way less verbose. But I'm not sure whether we
> deem this style acceptable.
>
> argc = setup_revisions(argc, argv, &lm->rev, NULL)
> if (argc) {
> ...
> }
I'm happy to adopt this change. I wasn't sure what the guideline is, for
some reason I assumed what I had. Personally I prefer the verbosity a
little more.
> I've seen this style several times in this patch. I think we should keep
> our typical style for now, but I wouldn't mind if you sent a patch for
> our coding style document so that we can discuss this.
No, let's follow typical style.
>> + error(_("unknown last-modified argument: %s"), argv[1]);
>> + return argc;
>> + }
>> +
>> + if (populate_paths_from_revs(lm) < 0)
>> + return error(_("unable to setup last-modified"));
>> +
>> + return 0;
>> +}
>> +
>> +int cmd_last_modified(int argc, const char **argv, const char *prefix,
>> + struct repository *repo)
>> +{
>> + int ret;
>> + struct last_modified lm;
>> +
>> + const char * const last_modified_usage[] = {
>> + N_("git last-modified [-r] [-t] "
>> + "[<revision-range>] [[--] <path>...]"),
>> + NULL
>> + };
>> +
>> + struct option last_modified_options[] = {
>> + OPT_BOOL('r', "recursive", &lm.recursive,
>> + N_("recurse into subtrees")),
>> + OPT_BOOL('t', "tree-in-recursive", &lm.tree_in_recursive,
>> + N_("recurse into subtrees and include the tree entries too")),
>
> Should this maybe be called something like "--recursive-with-trees"?
> "--tree-in-recursive" reads somewhat strange to me.
I agree that sounds better. It seems we don't have either options yet,
so we're still open to chose.
>> + OPT_END()
>> + };
>> +
>> + memset(&lm, 0, sizeof(lm));
>
> You can avoid the `memset()` and directly zero-initialize the struct
> when it's declared. Alternatively, you can move this function call into
> `last_modified_init()` itself, where it would be more reasonable.
Because I read parse_options() results into this struct, I cannot do the
memset() in last_modified_init(). So I'm changing to the `{ 0 }`
zero-init.
>> + argc = parse_options(argc, argv, prefix, last_modified_options,
>> + last_modified_usage,
>> + PARSE_OPT_KEEP_ARGV0 | PARSE_OPT_KEEP_UNKNOWN_OPT);
>> +
>> + repo_config(repo, git_default_config, NULL);
>> +
>> + if ((ret = last_modified_init(&lm, repo, prefix, argc, argv))) {
>> + if (ret > 0)
>> + usage_with_options(last_modified_usage,
>> + last_modified_options);
>> + goto out;
>> + }
>> +
>> + if ((ret = last_modified_run(&lm)))
>> + goto out;
>
> Two more cases where we assign `if ((ret = ...))`.
Yeah yeah, I've heard you ;-P. No no, joking, I appreciate you're
pointing this out.
>
> Patrick
>
[1]: https://lore.kernel.org/git/aDWWe6qCQXorPESd@pks.im/
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 4/4] last-modified: use Bloom filters when available
2025-07-31 6:43 ` Patrick Steinhardt
@ 2025-08-01 16:23 ` Toon Claes
2025-08-04 6:33 ` Patrick Steinhardt
0 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-08-01 16:23 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: git, Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Christian Couder
Patrick Steinhardt <ps@pks.im> writes:
> On Wed, Jul 30, 2025 at 07:55:10PM +0200, Toon Claes wrote:
>> Our 'git last-modified' performs a revision walk, and computes a diff at
>> each point in the walk to figure out whether a given revision changed
>> any of the paths it considers interesting.
>>
>> When changed-path Bloom filters are available, we can avoid computing
>> many such diffs. Before computing a diff, we first check if any of the
>> remaining paths of interest were possibly changed at a given commit by
>> consulting its Bloom filter. If any of them are, we are resigned to
>> compute the diff.
>>
>> If none of those queries returned "maybe", we know that the given commit
>> doesn't contain any changed paths which are interesting to us. So, we
>> can avoid computing it in this case.
>>
>> Comparing the perf test results on git.git:
>>
>> Test HEAD~ HEAD
>> ------------------------------------------------------------------------------------
>> 8020.1: top-level last-modified 4.49(4.34+0.11) 2.22(2.05+0.09) -50.6%
>> 8020.2: top-level recursive last-modified 5.64(5.45+0.11) 5.62(5.30+0.11) -0.4%
>> 8020.3: subdir last-modified 0.11(0.06+0.04) 0.07(0.03+0.04) -36.4%
>
> Nice results.
>
>> diff --git a/builtin/last-modified.c b/builtin/last-modified.c
>> index e4c73464c7..19bf25f8a5 100644
>> --- a/builtin/last-modified.c
>> +++ b/builtin/last-modified.c
>> @@ -179,6 +192,27 @@ static void last_modified_diff(struct diff_queue_struct *q,
>> }
>> }
>>
>> +static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
>> +{
>> + struct bloom_filter *filter;
>> + struct last_modified_entry *ent;
>> + struct hashmap_iter iter;
>> +
>> + if (!lm->rev.bloom_filter_settings)
>> + return 1;
>> +
>> + filter = get_bloom_filter(lm->rev.repo, origin);
>> + if (!filter)
>> + return 1;
>> +
>> + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
>> + if (bloom_filter_contains(filter, &ent->key,
>> + lm->rev.bloom_filter_settings))
>> + return 1;
>> + }
>> + return 0;
>> +}
>
> This function is basically the same as `maybe_changed_paths()` in
> "blame.c", but that isn't a huge issue from my point of view. What makes
> me wonder though is why we have an additional check over there for
> whether or not the commit has a valid generation number.
I've been asking me the same question. And I couldn't find a good reason
(neither from the commit history, or from my reasoning). This check was
in the version shared by Taylor, but because we were ignoring the return
value from generation_numbers_enabled() in that version, it didn't make
sense to me to do this check. That's why I removed it.
>> @@ -227,6 +264,9 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
>> return argc;
>> }
>>
>> + prepare_commit_graph(lm->rev.repo);
>> + lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
>> +
>
> So this here is why we export `prepare_commit_graph()`. How about we
> instead expose `bloom_filters_enabled()` that mirrors what we do in
> `generation_numbers_enabled()` and `corrected_commit_dates_enabled()`?
> That would both be on a higher level and do exactly what we want to
> achieve.
I've got another proposal, what if we let get_bloom_filter_settings()
call prepare_commit_graph()? Functions like
repo_find_commit_pos_in_graph() and lookup_commit_in_graph() do this
too.
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v5 1/6] last-modified: new subcommand to show when files were last modified
2025-08-01 9:09 ` Christian Couder
@ 2025-08-01 16:59 ` Junio C Hamano
0 siblings, 0 replies; 135+ messages in thread
From: Junio C Hamano @ 2025-08-01 16:59 UTC (permalink / raw)
To: Christian Couder
Cc: Taylor Blau, Toon Claes, git, Kristoffer Haugsbakk,
Derrick Stolee, Jeff King, Ævar Arnfjörð Bjarmason
Christian Couder <christian.couder@gmail.com> writes:
> On Fri, Jul 18, 2025 at 2:02 AM Taylor Blau <me@ttaylorr.com> wrote:
>
>> > +struct last_modified {
>> > + struct hashmap paths;
>> > + struct rev_info rev;
>> > + int recursive, tree_in_recursive;
>>
>> Can we either make these two part of a bitfield, or at least declare
>> them separately?
>
> I wonder if we could/should use the `bool` type from <stdbool.h> as
> Documentation/CodingGuidelines says that it's now allowed.
Even though "allowed" is different from "encouraged", I would say
it is a good idea to declare them separately, i.e.
bool recursive;
bool show_trees_in_recursive;
I am guessing 'tree-in-recursive' is one similar to 'git ls-tree -t'
feature but the name given in the patch requires such guessing, as
the name is a bit inadequate (it does not say what you want to do to
trees when recursive).
Renaming to show_trees_in_recursive eliminates the need for such
guessing. The implementation of ls-tree calls the corresponding but
as LS_SHOW_TREES which is a bit inadequate.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-08-01 10:22 ` Patrick Steinhardt
@ 2025-08-01 17:06 ` Junio C Hamano
2025-08-02 8:18 ` Christian Couder
2025-08-04 6:35 ` Patrick Steinhardt
0 siblings, 2 replies; 135+ messages in thread
From: Junio C Hamano @ 2025-08-01 17:06 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: Christian Couder, Toon Claes, git, Kristoffer Haugsbakk,
Taylor Blau, Derrick Stolee, Jeff King,
Ævar Arnfjörð Bjarmason
Patrick Steinhardt <ps@pks.im> writes:
>> > +static void last_modified_release(struct last_modified *lm)
>>
>> I think these days we tend to name those functions using "clear"
>> instead of "release"
>
> It actually depends: if the structure can be immediately reused
> afterwards without requiring another reinit it would be caller "clear"
> indeed. On the other hand, if we only release memory it's "release".
>
> I think this function here falls into the latter category, so it's
> correctly named.
Given that even a long-time contributor gets confused (including me,
who needed to see where we documented this for our developers),
perhaps a clarification patch is in order?
--- >8 ---
Subject: CodingGuidelines: clarify that S_release() does not reinitialize
In the section for naming various API functions, the fact that
S_release() only releases the resources without preparing the
structure for immediate reuse becomes only apparent when you
readentries for S_release() and S_clear().
Clarify the description of S_release() a bit to make the entry self
sufficient.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
Documentation/CodingGuidelines | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git c/Documentation/CodingGuidelines w/Documentation/CodingGuidelines
index c1046abfb7..76ec6268f2 100644
--- c/Documentation/CodingGuidelines
+++ w/Documentation/CodingGuidelines
@@ -610,8 +610,9 @@ For C programs:
- `S_init()` initializes a structure without allocating the
structure itself.
- - `S_release()` releases a structure's contents without freeing the
- structure.
+ - `S_release()` releases a structure's contents without reinitializing
+ the structure for immediate reuse, and without freeing the structure
+ itself.
- `S_clear()` is equivalent to `S_release()` followed by `S_init()`
such that the structure is directly usable after clearing it. When
^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-08-01 16:22 ` Toon Claes
@ 2025-08-01 17:09 ` Junio C Hamano
2025-08-04 6:34 ` Patrick Steinhardt
2025-08-01 20:34 ` Jean-Noël AVILA
2025-08-04 6:33 ` Patrick Steinhardt
2 siblings, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-08-01 17:09 UTC (permalink / raw)
To: Toon Claes
Cc: Patrick Steinhardt, git, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Christian Couder, Jeff King,
Ævar Arnfjörð Bjarmason
Toon Claes <toon@iotcl.com> writes:
>>> +-t::
>>
>> -t, --tree-in-recursive::
>
> Sure!
Clarify *what* you do to trees in recursive by giving a verb, e.g.
--show-trees-in-recursive
perhaps?
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-08-01 16:22 ` Toon Claes
2025-08-01 17:09 ` Junio C Hamano
@ 2025-08-01 20:34 ` Jean-Noël AVILA
2025-08-05 5:36 ` Toon Claes
2025-08-04 6:33 ` Patrick Steinhardt
2 siblings, 1 reply; 135+ messages in thread
From: Jean-Noël AVILA @ 2025-08-01 20:34 UTC (permalink / raw)
To: Patrick Steinhardt, Toon Claes
Cc: git, Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Christian Couder, Jeff King,
Ævar Arnfjörð Bjarmason
Hello,
On Friday, 1 August 2025 18:22:50 CEST Toon Claes wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> > On Wed, Jul 30, 2025 at 07:55:07PM +0200, Toon Claes wrote:
> >> diff --git a/Documentation/git-last-modified.adoc
> >> b/Documentation/git-last-modified.adoc new file mode 100644
> >> index 0000000000..89138ebeb7
> >> --- /dev/null
> >> +++ b/Documentation/git-last-modified.adoc
> >> @@ -0,0 +1,49 @@
> >> +git-last-modified(1)
> >> +====================
> >> +
> >> +NAME
> >> +----
> >> +git-last-modified - EXPERIMENTAL: Show when files were last modified
> >> +
> >> +
> >> +SYNOPSIS
> >> +--------
> >> +[synopsis]
> >> +git last-modified [-r] [-t] [<revision-range>] [[--] <path>...]
> >
> > I think we typically list long options here, not the short single-letter
> > ones.
>
> Okay, makes sense.
>
> >> +
> >> +DESCRIPTION
> >> +-----------
> >> +
> >> +Shows which commit last modified each of the relevant files and
subdirectories.
> >> +
> >> +THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
> >> +
> >> +OPTIONS
> >> +-------
> >> +
> >
> >> +-r::
> > -r, --recursive::
As a newly introduced man page, please switch to full synopsis style and cite
only one form per line:
`-r`::
`--recurse`::
> >> + Recurse into subtrees.
> >> +
> >
> >> +-t::
> > -t, --tree-in-recursive::
> Sure!
>
Idem here.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-08-01 17:06 ` Junio C Hamano
@ 2025-08-02 8:18 ` Christian Couder
2025-08-02 11:31 ` Christian Couder
2025-08-04 6:35 ` Patrick Steinhardt
1 sibling, 1 reply; 135+ messages in thread
From: Christian Couder @ 2025-08-02 8:18 UTC (permalink / raw)
To: Junio C Hamano
Cc: Patrick Steinhardt, Toon Claes, git, Kristoffer Haugsbakk,
Taylor Blau, Derrick Stolee, Jeff King,
Ævar Arnfjörð Bjarmason
On Fri, Aug 1, 2025 at 7:06 PM Junio C Hamano <gitster@pobox.com> wrote:
> Given that even a long-time contributor gets confused (including me,
> who needed to see where we documented this for our developers),
> perhaps a clarification patch is in order?
>
> --- >8 ---
> Subject: CodingGuidelines: clarify that S_release() does not reinitialize
>
> In the section for naming various API functions, the fact that
> S_release() only releases the resources without preparing the
> structure for immediate reuse becomes only apparent when you
> readentries for S_release() and S_clear().
>
> Clarify the description of S_release() a bit to make the entry self
> sufficient.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
> Documentation/CodingGuidelines | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git c/Documentation/CodingGuidelines w/Documentation/CodingGuidelines
> index c1046abfb7..76ec6268f2 100644
> --- c/Documentation/CodingGuidelines
> +++ w/Documentation/CodingGuidelines
> @@ -610,8 +610,9 @@ For C programs:
> - `S_init()` initializes a structure without allocating the
> structure itself.
>
> - - `S_release()` releases a structure's contents without freeing the
> - structure.
> + - `S_release()` releases a structure's contents without reinitializing
> + the structure for immediate reuse, and without freeing the structure
> + itself.
>
> - `S_clear()` is equivalent to `S_release()` followed by `S_init()`
> such that the structure is directly usable after clearing it. When
Yeah, I think that could help. Thanks!
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-08-02 8:18 ` Christian Couder
@ 2025-08-02 11:31 ` Christian Couder
2025-08-02 13:38 ` Christian Couder
0 siblings, 1 reply; 135+ messages in thread
From: Christian Couder @ 2025-08-02 11:31 UTC (permalink / raw)
To: Junio C Hamano
Cc: Patrick Steinhardt, Toon Claes, git, Kristoffer Haugsbakk,
Taylor Blau, Derrick Stolee, Jeff King,
Ævar Arnfjörð Bjarmason
On Sat, Aug 2, 2025 at 10:18 AM Christian Couder
<christian.couder@gmail.com> wrote:
>
> On Fri, Aug 1, 2025 at 7:06 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> > Given that even a long-time contributor gets confused (including me,
> > who needed to see where we documented this for our developers),
> > perhaps a clarification patch is in order?
> >
> > --- >8 ---
> > Subject: CodingGuidelines: clarify that S_release() does not reinitialize
> >
> > In the section for naming various API functions, the fact that
> > S_release() only releases the resources without preparing the
> > structure for immediate reuse becomes only apparent when you
> > readentries for S_release() and S_clear().
> >
> > Clarify the description of S_release() a bit to make the entry self
> > sufficient.
> >
> > Signed-off-by: Junio C Hamano <gitster@pobox.com>
> > ---
> > Documentation/CodingGuidelines | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git c/Documentation/CodingGuidelines w/Documentation/CodingGuidelines
> > index c1046abfb7..76ec6268f2 100644
> > --- c/Documentation/CodingGuidelines
> > +++ w/Documentation/CodingGuidelines
> > @@ -610,8 +610,9 @@ For C programs:
> > - `S_init()` initializes a structure without allocating the
> > structure itself.
> >
> > - - `S_release()` releases a structure's contents without freeing the
> > - structure.
> > + - `S_release()` releases a structure's contents without reinitializing
> > + the structure for immediate reuse, and without freeing the structure
> > + itself.
> >
> > - `S_clear()` is equivalent to `S_release()` followed by `S_init()`
> > such that the structure is directly usable after clearing it. When
>
> Yeah, I think that could help. Thanks!
For information, I have asked ChatGPT o3-pro (which can analyse a
whole repo) about this and here is what it says:
---------------------------
Q: In the source code of the Git SCM software are there standard ways
of naming functions that release memory from a struct or clear a
struct? Are they consistently named using "clear" or "release" or
"free" or "clear_and_free" or other things like that depending on what
they actually do?
A:
What the names mean in Git’s C code – quick rules of thumb
- *_clear():
effect: Free only the heap‑memory owned by the struct (arrays,
strings, hash‑table buckets, etc.) and reset its fields to an
initial/empty state.
reuse possible: Yes – the variable itself remains valid and can be
filled again.
- *_release():
effect: Historical synonym for the same thing as *_clear(), kept for
backward compatibility in a few low‑level APIs.
reuse possible: Yes – same as above.
- free_*() or *_free():
effect: Free the struct itself (it was malloc‑ed) and everything it
points at. After the call the pointer becomes dangling and must not be
reused.
reuse possible: No – the storage is gone.
- *_clear_and_free():
effect: Convenience wrapper that first calls the type’s *_clear() on
each contained element and then frees those elements themselves.
Mainly used for containers implemented as macros.
reuse possible: Not applicable – it is meant to be the last
operation on the container.
How consistent is the convention?
- Within each mini‑API the convention is very strict
The strbuf API always pairs strbuf_init() / strbuf_release(), the
strvec API always pairs strvec_init() / strvec_clear(), etc.
- Across the whole Git code‑base the spelling is mostly consistent but
not 100 % uniform
Very early core helpers (pre‑1.6): usually *_release()
Code written in the last decade: *_clear() for “empty in‑place”,
free_*() for “destroy”
- No “mixed‑semantics” functions
You will not find a helper that sometimes frees the struct and
sometimes only clears it depending on flags. The name alone tells you
which category it belongs to.
[...]
Bottom line
Git uses two distinct verbs to convey intent:
- clear / release – light‑weight reset, keep the container alive
- free – heavy‑weight destruction of the container itself
The rule is consistent enough that you can rely on it when reading or
writing Git’s C code; the rare exceptions are historical, not
ambiguous.
Q: Some say that if the structure can be immediately reused afterwards
without requiring another reinit the function should be named using
"clear", otherwise if it only releases memory then it should be called
"release". Is it true?
A:
Short answer – no.
In the Git source tree the verbs clear and release are not used to
distinguish “can I reuse the struct afterwards?” vs. “is this the
final teardown?”. Both variants free all heap memory owned by the
structure and then reset the in‑memory object to its pristine, empty
state, so the variable can be used again without an extra call to the
corresponding …_init() helper.
[...]
Why two different verbs exist
- Historical baggage – the very early helper that everybody still
uses, strbuf, was written long before today’s “*_clear()” convention
emerged, so its lifecycle pair is strbuf_init() / strbuf_release().
- Later subsystems standardised on *_clear() when they were introduced
(string‑list, strvec, oidset, hashmap, etc.).
- Because Git’s internal APIs are used by thousands of call‑sites,
there has been no large‑scale renaming; both spellings therefore
coexist.
[...]
Conclusion
The folk rule you quoted (“* use clear if the struct can be reused,
release if not *”) does not match what the Git code actually does:
strbuf_release() allows immediate reuse – just like every *_clear() helper does.
The difference in naming is historical, not semantic.
If you want to know whether the variable itself has been freed, look
for free_*()/*_free(), not for the presence or absence of the word
release.
Q: Do you have other examples than strbuf_release using "release"
where the function behaves like a "clear" function?
A:
Yes – a handful of other helpers are spelled “*_release()”, yet they
merely clear the inside of an on‑stack/object‑lifetime struct and
leave the container itself reusable
- Line reader: void line_buffer_release(struct line_buffer *buf);
Closes the underlying file descriptor and frees its internal scratch
space; the struct line_buffer itself remains valid and can be
re‑initialised with line_buffer_open() if desired.
- Packet‑line writer: void packet_writer_release(struct packet_writer *writer);
Calls strbuf_release(&writer‑>scratch_buf) and clears a few flags;
the struct packet_writer lives on so the caller may point it at
another FILE * later.
- FS‑monitor IPC channel: void fsmonitor_ipc__release(struct
fsmonitor_ipc *ipc);
Shuts down the socket, frees small temp buffers, and zeroes the
struct for re‑use inside the long‑running daemon.
All of these match the behaviour of strbuf_release() rather than that
of the many *_free() helpers which actually free() the struct itself.
How many are there?
A quick git grep '_release(struct' over the Git 2.50 tree finds fewer
than a dozen hits, and every one of them follows this
“clear‑but‑don’t‑free” pattern. In other words:
- *_release() in Git never deallocates the container itself.
- The form survives today only in subsystems that pre‑date the later
project‑wide switch to *_clear() (or that wrap strbuf_release()
internally).
- Newer code consistently uses *_clear() for the same semantics.
Bottom line
strbuf_release() is not unique, but it is the template that the other
rare *_release() helpers copy: they wipe the contents and reset the
struct so you can keep using the variable. If you are looking at Git C
code and see “_release” in a function name, you can safely assume
“clear in place” – not “free the object.”
---------------------------
In short it looks like it doesn't think there should be a distinction
between *_release() and *_clear(), because in many cases the existing
*_release() functions (starting with strbuf_release()) leave the
struct ready for immediate reuse.
For the documentation update, it suggests:
- `S_release()` releases a structure's contents without freeing the
structure. In older subsystems (e.g. *strbuf*), the helper also
resets the object to its initial, empty state; newer code should
prefer `S_clear()` for that purpose.
But if our intention is to tend towards a clear distinction between
"clear" and "release" even if in practice there is not a clear
distinction right now (because of historical reasons), I think we
could compromise with something like:
- `S_release()` releases a structure's contents without reinitializing
the structure for immediate reuse, and without freeing the structure
itself. In older subsystems (e.g. *strbuf*), the helper also
resets the object to its initial, empty state; newer code should
prefer `S_clear()` for that purpose.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-08-02 11:31 ` Christian Couder
@ 2025-08-02 13:38 ` Christian Couder
2025-08-02 16:26 ` Junio C Hamano
0 siblings, 1 reply; 135+ messages in thread
From: Christian Couder @ 2025-08-02 13:38 UTC (permalink / raw)
To: Junio C Hamano
Cc: Patrick Steinhardt, Toon Claes, git, Kristoffer Haugsbakk,
Taylor Blau, Derrick Stolee, Jeff King,
Ævar Arnfjörð Bjarmason
On Sat, Aug 2, 2025 at 1:31 PM Christian Couder
<christian.couder@gmail.com> wrote:
> Q: Do you have other examples than strbuf_release using "release"
> where the function behaves like a "clear" function?
>
> A:
>
> Yes – a handful of other helpers are spelled “*_release()”, yet they
> merely clear the inside of an on‑stack/object‑lifetime struct and
> leave the container itself reusable
>
> - Line reader: void line_buffer_release(struct line_buffer *buf);
> Closes the underlying file descriptor and frees its internal scratch
> space; the struct line_buffer itself remains valid and can be
> re‑initialised with line_buffer_open() if desired.
>
> - Packet‑line writer: void packet_writer_release(struct packet_writer *writer);
> Calls strbuf_release(&writer‑>scratch_buf) and clears a few flags;
> the struct packet_writer lives on so the caller may point it at
> another FILE * later.
>
> - FS‑monitor IPC channel: void fsmonitor_ipc__release(struct
> fsmonitor_ipc *ipc);
> Shuts down the socket, frees small temp buffers, and zeroes the
> struct for re‑use inside the long‑running daemon.
>
> All of these match the behaviour of strbuf_release() rather than that
> of the many *_free() helpers which actually free() the struct itself.
Actually it looks like it hallucinated those examples. It's true that
strbuf_release() makes it possible to reuse the struct, but it's not
efficient as memory needs to be reallocated.
Sorry for the noise.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-08-02 13:38 ` Christian Couder
@ 2025-08-02 16:26 ` Junio C Hamano
0 siblings, 0 replies; 135+ messages in thread
From: Junio C Hamano @ 2025-08-02 16:26 UTC (permalink / raw)
To: Christian Couder
Cc: Patrick Steinhardt, Toon Claes, git, Kristoffer Haugsbakk,
Taylor Blau, Derrick Stolee, Jeff King,
Ævar Arnfjörð Bjarmason
Christian Couder <christian.couder@gmail.com> writes:
> Sorry for the noise.
When 6d0618a8 (Add Documentation/CodingGuidelines, 2007-11-08)
started a written guideline, the project already had two-year's
worth of accumulated code. It was more like "we have been operating
without any written guideline, and so far it has been OK because
most of our contributors and reviewers were competent and
interaction among them amicable. But now we are having more new
faces. It is a good time to codify the rules that we have been
trying to adhere to. It is possible we may have missed some
violations during our reviews and have already took bad apples in
the code base, but they are tolerated-but-undesirable exceptions.
These are the rules we have been trying to follow." It is expected
that there are some corner cases that violate the writings without
meaning to.
Anybody reading the document should take it as an aspirational
guide, where existing violations (1) are not excuses to introduce
more deviations, (2) are "once written, it is often not worth the
code churn to go and fix them only for the sake of fixing them", and
(3) are very welcome to be rewritten if you are rewriting the code
that covers (not merely overlaps) the area.
And we writing or updating the document should try to make sure that
the aspirational nature is clear to readers.
So your intention to improve the wording of one single item was
surely appreciated, but I think the effort is better spent to make
sure that readers are aware that not just that single item, but
everything in the guideline, may have existing violations in the
code base, and they understand how they should treat these existing
violations, perhaps by polishing the preamble to the whole guideline
document somehow.
Thanks.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 4/4] last-modified: use Bloom filters when available
2025-08-01 16:23 ` Toon Claes
@ 2025-08-04 6:33 ` Patrick Steinhardt
0 siblings, 0 replies; 135+ messages in thread
From: Patrick Steinhardt @ 2025-08-04 6:33 UTC (permalink / raw)
To: Toon Claes
Cc: git, Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Christian Couder
On Fri, Aug 01, 2025 at 06:23:08PM +0200, Toon Claes wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> > On Wed, Jul 30, 2025 at 07:55:10PM +0200, Toon Claes wrote:
> >> diff --git a/builtin/last-modified.c b/builtin/last-modified.c
> >> index e4c73464c7..19bf25f8a5 100644
> >> --- a/builtin/last-modified.c
> >> +++ b/builtin/last-modified.c
> >> @@ -179,6 +192,27 @@ static void last_modified_diff(struct diff_queue_struct *q,
> >> }
> >> }
> >>
> >> +static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
> >> +{
> >> + struct bloom_filter *filter;
> >> + struct last_modified_entry *ent;
> >> + struct hashmap_iter iter;
> >> +
> >> + if (!lm->rev.bloom_filter_settings)
> >> + return 1;
> >> +
> >> + filter = get_bloom_filter(lm->rev.repo, origin);
> >> + if (!filter)
> >> + return 1;
> >> +
> >> + hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
> >> + if (bloom_filter_contains(filter, &ent->key,
> >> + lm->rev.bloom_filter_settings))
> >> + return 1;
> >> + }
> >> + return 0;
> >> +}
> >
> > This function is basically the same as `maybe_changed_paths()` in
> > "blame.c", but that isn't a huge issue from my point of view. What makes
> > me wonder though is why we have an additional check over there for
> > whether or not the commit has a valid generation number.
>
> I've been asking me the same question. And I couldn't find a good reason
> (neither from the commit history, or from my reasoning). This check was
> in the version shared by Taylor, but because we were ignoring the return
> value from generation_numbers_enabled() in that version, it didn't make
> sense to me to do this check. That's why I removed it.
Okay. It might make sense to point this out in the commit message.
> >> @@ -227,6 +264,9 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
> >> return argc;
> >> }
> >>
> >> + prepare_commit_graph(lm->rev.repo);
> >> + lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
> >> +
> >
> > So this here is why we export `prepare_commit_graph()`. How about we
> > instead expose `bloom_filters_enabled()` that mirrors what we do in
> > `generation_numbers_enabled()` and `corrected_commit_dates_enabled()`?
> > That would both be on a higher level and do exactly what we want to
> > achieve.
>
> I've got another proposal, what if we let get_bloom_filter_settings()
> call prepare_commit_graph()? Functions like
> repo_find_commit_pos_in_graph() and lookup_commit_in_graph() do this
> too.
Yeah, I don't see any issue with that, either.
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-08-01 16:22 ` Toon Claes
2025-08-01 17:09 ` Junio C Hamano
2025-08-01 20:34 ` Jean-Noël AVILA
@ 2025-08-04 6:33 ` Patrick Steinhardt
2 siblings, 0 replies; 135+ messages in thread
From: Patrick Steinhardt @ 2025-08-04 6:33 UTC (permalink / raw)
To: Toon Claes
Cc: git, Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Christian Couder, Jeff King,
Ævar Arnfjörð Bjarmason
On Fri, Aug 01, 2025 at 06:22:50PM +0200, Toon Claes wrote:
> Patrick Steinhardt <ps@pks.im> writes:
> > On Wed, Jul 30, 2025 at 07:55:07PM +0200, Toon Claes wrote:
> >> diff --git a/builtin/last-modified.c b/builtin/last-modified.c
> >> new file mode 100644
> >> index 0000000000..e4c73464c7
> >> --- /dev/null
> >> +++ b/builtin/last-modified.c
> >> +static int last_modified_run(struct last_modified *lm)
> >> +{
> >> + struct last_modified_callback_data data = { .lm = lm };
> >> +
> >> + lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
> >> + lm->rev.diffopt.format_callback = last_modified_diff;
> >> + lm->rev.diffopt.format_callback_data = &data;
> >> +
> >> + prepare_revision_walk(&lm->rev);
> >> +
> >> + while (hashmap_get_size(&lm->paths)) {
> >> + data.commit = get_revision(&lm->rev);
> >> + if (!data.commit)
> >> + break;
> >
> > So in this case we have reached the end of our commit range. I assume we
> > simply print the oldest commit of that range in this case?
>
> Looking at this more in detail, I feel we should be calling BUG here.
> When we've hit the boundary commit, we should be printing the remaining
> paths with that commit, but with a caret `^` prepended. If we hit this
> condition it means we went beyond the boundary, but still have paths
> remaining. That's a bug.
>
> But... As a matter of fact. I had a test failing (on the commit using
> bloom filters). It didn't print remaining paths with the boundary commit
> with a caret. This happens only when having GIT_TEST_COMMIT_GRAPH and
> GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS set. And it's perfectly explainable
> now:
>
> With those set, we hit this exit condition. This happens because
> maybe_changed_path() was called in previous loop, returning false. Then
> we hit this exit, and un-printed paths remain. Big thanks for this hint.
Nice :)
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-08-01 17:09 ` Junio C Hamano
@ 2025-08-04 6:34 ` Patrick Steinhardt
2025-08-04 17:14 ` Junio C Hamano
0 siblings, 1 reply; 135+ messages in thread
From: Patrick Steinhardt @ 2025-08-04 6:34 UTC (permalink / raw)
To: Junio C Hamano
Cc: Toon Claes, git, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Christian Couder, Jeff King,
Ævar Arnfjörð Bjarmason
On Fri, Aug 01, 2025 at 10:09:41AM -0700, Junio C Hamano wrote:
> Toon Claes <toon@iotcl.com> writes:
>
> >>> +-t::
> >>
> >> -t, --tree-in-recursive::
> >
> > Sure!
>
> Clarify *what* you do to trees in recursive by giving a verb, e.g.
>
> --show-trees-in-recursive
Ah, that's even better indeed! One question that this raises is whether
this option then should continue to imply `--recursive`. I think it
rather shouldn't with this new wording, but don't feel overly strong
about it.
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-08-01 17:06 ` Junio C Hamano
2025-08-02 8:18 ` Christian Couder
@ 2025-08-04 6:35 ` Patrick Steinhardt
1 sibling, 0 replies; 135+ messages in thread
From: Patrick Steinhardt @ 2025-08-04 6:35 UTC (permalink / raw)
To: Junio C Hamano
Cc: Christian Couder, Toon Claes, git, Kristoffer Haugsbakk,
Taylor Blau, Derrick Stolee, Jeff King,
Ævar Arnfjörð Bjarmason
On Fri, Aug 01, 2025 at 10:06:55AM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
>
> >> > +static void last_modified_release(struct last_modified *lm)
> >>
> >> I think these days we tend to name those functions using "clear"
> >> instead of "release"
> >
> > It actually depends: if the structure can be immediately reused
> > afterwards without requiring another reinit it would be caller "clear"
> > indeed. On the other hand, if we only release memory it's "release".
> >
> > I think this function here falls into the latter category, so it's
> > correctly named.
>
> Given that even a long-time contributor gets confused (including me,
> who needed to see where we documented this for our developers),
> perhaps a clarification patch is in order?
>
> --- >8 ---
> Subject: CodingGuidelines: clarify that S_release() does not reinitialize
>
> In the section for naming various API functions, the fact that
> S_release() only releases the resources without preparing the
> structure for immediate reuse becomes only apparent when you
> readentries for S_release() and S_clear().
>
> Clarify the description of S_release() a bit to make the entry self
> sufficient.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
> Documentation/CodingGuidelines | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git c/Documentation/CodingGuidelines w/Documentation/CodingGuidelines
> index c1046abfb7..76ec6268f2 100644
> --- c/Documentation/CodingGuidelines
> +++ w/Documentation/CodingGuidelines
> @@ -610,8 +610,9 @@ For C programs:
> - `S_init()` initializes a structure without allocating the
> structure itself.
>
> - - `S_release()` releases a structure's contents without freeing the
> - structure.
> + - `S_release()` releases a structure's contents without reinitializing
> + the structure for immediate reuse, and without freeing the structure
> + itself.
>
> - `S_clear()` is equivalent to `S_release()` followed by `S_init()`
> such that the structure is directly usable after clearing it. When
Yup, this looks like a reasonable change to me, thanks!
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-08-04 6:34 ` Patrick Steinhardt
@ 2025-08-04 17:14 ` Junio C Hamano
2025-08-05 5:35 ` Toon Claes
0 siblings, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-08-04 17:14 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: Toon Claes, git, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Christian Couder, Jeff King,
Ævar Arnfjörð Bjarmason
Patrick Steinhardt <ps@pks.im> writes:
> On Fri, Aug 01, 2025 at 10:09:41AM -0700, Junio C Hamano wrote:
>> Toon Claes <toon@iotcl.com> writes:
>>
>> >>> +-t::
>> >>
>> >> -t, --tree-in-recursive::
>> >
>> > Sure!
>>
>> Clarify *what* you do to trees in recursive by giving a verb, e.g.
>>
>> --show-trees-in-recursive
>
> Ah, that's even better indeed! One question that this raises is whether
> this option then should continue to imply `--recursive`. I think it
> rather shouldn't with this new wording, but don't feel overly strong
> about it.
I am somewhat indifferent. We can call the option --show-trees,
which I suspect would make it more consistent with ls-trees, while
making it a bit more confusing at the same time. I dunno.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-08-04 17:14 ` Junio C Hamano
@ 2025-08-05 5:35 ` Toon Claes
0 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-08-05 5:35 UTC (permalink / raw)
To: Junio C Hamano, Patrick Steinhardt
Cc: git, Kristoffer Haugsbakk, Taylor Blau, Derrick Stolee,
Christian Couder, Jeff King,
Ævar Arnfjörð Bjarmason
Junio C Hamano <gitster@pobox.com> writes:
> I am somewhat indifferent. We can call the option --show-trees,
> which I suspect would make it more consistent with ls-trees, while
> making it a bit more confusing at the same time. I dunno.
I like this a lot. Gonna go with this.
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified
2025-08-01 20:34 ` Jean-Noël AVILA
@ 2025-08-05 5:36 ` Toon Claes
0 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-08-05 5:36 UTC (permalink / raw)
To: Jean-Noël AVILA, Patrick Steinhardt
Cc: git, Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Christian Couder, Jeff King,
Ævar Arnfjörð Bjarmason
Jean-Noël AVILA <jn.avila@free.fr> writes:
>> >> +
>> >> +DESCRIPTION
>> >> +-----------
>> >> +
>> >> +Shows which commit last modified each of the relevant files and
> subdirectories.
>> >> +
>> >> +THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
>> >> +
>> >> +OPTIONS
>> >> +-------
>> >> +
>> >
>> >> +-r::
>> > -r, --recursive::
>
> As a newly introduced man page, please switch to full synopsis style and cite
> only one form per line:
>
> `-r`::
> `--recurse`::
Thanks for pointing that out. I wasn't aware.
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH v7 0/3] Introduce git-last-modified(1) command
2025-07-30 17:55 ` [PATCH v6 0/4] " Toon Claes
2025-07-31 18:40 ` Junio C Hamano
@ 2025-08-05 9:33 ` Toon Claes
2025-08-05 14:34 ` Patrick Steinhardt
2025-08-05 16:34 ` Junio C Hamano
2025-08-05 9:33 ` [PATCH v7 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
` (2 subsequent siblings)
4 siblings, 2 replies; 135+ messages in thread
From: Toon Claes @ 2025-08-05 9:33 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau, Derrick Stolee,
Christian Couder, Patrick Steinhardt, Jeff King, Toon Claes
On many forges the tree view is shown in combination with commit data.
In such a view each tree entry is accompanied with the commit message
and date that last modified that tree entry. Something similar like:
| README.md | README: *.txt -> *.adoc fixes | 4 months ago |
| RelNotes | Start 2.51 cycle, the first batch | 4 weeks ago |
| SECURITY.md | SECURITY: describe how to report vulnerabilities | 4 years |
| abspath.c | abspath: move related functions to abspath | 2 years |
| abspath.h | abspath: move related functions to abspath | 2 years |
| aclocal.m4 | configure: use AC_LANG_PROGRAM consistently | 15 years ago |
| add-patch.c | pager: stop using `the_repository` | 7 months ago |
| advice.c | advice: allow disabling default branch name advice | 4 months ago |
| advice.h | advice: allow disabling default branch name advice | 4 months ago |
| alias.h | rebase -m: fix serialization of strategy options | 2 years |
| alloc.h | git-compat-util: move alloc macros to git-compat-util.h | 2 years ago |
| apply.c | apply: only write intents to add for new files | 8 days ago |
| archive.c | Merge branch 'ps/parse-options-integers' | 3 months ago |
| archive.h | archive.h: remove unnecessary include | 1 year |
| attr.h | fuzz: port fuzz-parse-attr-line from OSS-Fuzz | 9 months ago |
| banned.h | banned.h: mark `strtok()` and `strtok_r()` as banned | 2 years |
This series adds the git-last-modified(1) to feed this view. In the past
the subcommand was proposed[1] to be named git-blame-tree(1). This
version is based on the patches shared by the kind people at GitHub[2].
What is different from the series shared by GitHub:
* Renamed the subcommand from `blame-tree` to `last-modified`. There was
some consensus[5] this name works better, so let's give it a try and
see how this name feels.
* Patches for --max-depth are excluded. I've submitted them as a separate patch
series[6].
* The last-modified command isn't recursive by default. If you want
recurse into subtrees, you need to pass `-r`.
* The patches in 'tb/blame-tree' at Taylor's fork[4] implements a
caching layer. This feature reads/writes cached results in
`.git/blame-tree/<hash>.btc`. To keep this series to a reviewable
size, that feature is excluded from this series. I think it's better
to submit this as a separate series.
* All the new last-modified machinery is no longer implemented in a library
layer (at the root of the project), but directly in the builtin. So far the
code is fairly small (little over 300 lines of code) and there are no other
users of this code anyway. Also the library level code taken from Taylor's
fork required to pass `argc` and `argv` into it. It's quite awkward the
library code was so tightly coupled with user interaction.
* Squashed various commits together. Like they introduced a flag
`--go-faster`, which later became the default and only implementation.
That story was wrapped up in a single commit.
* Dropped the patches that attempt to increase performance for tree
entries that have not been updated in a long time. In my testing I've
seen both performance improvements *and* degradation with these
changes:
Test HEAD~ HEAD
------------------------------------------------------------------------------------
8020.1: top-level last-modified 4.52(4.38+0.11) 2.03(1.93+0.08) -55.1%
8020.2: top-level recursive last-modified 5.79(5.64+0.11) 8.34(8.17+0.11) +44.0%
8020.3: subdir last-modified 0.15(0.09+0.06) 0.19(0.14+0.06) +26.7%
Before we include these patches, I want to make sure these changes
have positive impact in all/most scenarios. This can happen in a
separate series.
I've set myself as the author and added Based-on-patch-by trailers to
credit the original authors. Let me know if you disagree.
Again thanks to Taylor and the people at GitHub for sharing these
patches. I hope we can work together to get this upstreamed.
[1]: https://lore.kernel.org/git/patch-1.1-0ea849d900b-20230205T204104Z-avarab@gmail.com/
[2]: https://lore.kernel.org/git/Z+XJ+1L3PnC9Dyba@nand.local/
[3]: https://lore.kernel.org/git/20250326-toon-blame-tree-v1-3-4173133f3786@iotcl.com/
[4]: git@github.com:ttaylorr/git.git
[5]: https://lore.kernel.org/git/aCbBKj7O9LjO3SMK@pks.im/
[6]: https://lore.kernel.org/git/20250729-toon-max-depth-v1-0-c177e39c40fb@iotcl.com/
---
Changes in v7:
- Fix case when bloom filters were used and a commit range was given. This bug
was uncovered in CI.
- Rename the long option for `-t` to `--show-trees`. This option no longer
implies option `-r`. And resemble these changes in the documentation, with a
few other small documentation tweaks.
- Move prepare_commit_graph() into get_bloom_filter_settings() which no longer
requires last-modified to worry about it itself. This is similar to
repo_find_commit_pos_in_graph() and lookup_commit_in_graph()
- Bring back the call to commit_graph_generation() in maybe_changed_path(). This
is also called in the same function in blame.c and in
check_maybe_different_in_bloom_filter() in revision.c. I couldn't find a test
case that triggers this exit condition, but it should not have negative
side-effects.
- No longer call diff_free() on the copy we make when populating the `paths` of
`struct last_modified`. Because we weren't doing a deep copy, this could clean
up fields used later on by the original. Instead only call clear_pathspec(). A
comment to clarify this mechanism better is added.
- Add BUG() call to exit condition that shouldn't happen.
- Switch some int types to bool types.
Changes in v6:
- Only the first 3 patches are kept. The last 3 patches worked toward adding an
extra option `--format`. The way it was implemented was heavily debatable and
in the end it is not required for a first iteration, so they are dropped.
- Function prepare_commit_graph() is exported and used in
generation_numbers_enabled().
- Since the library layer was removed and all the code was moved into the
builtin, there was still some leftovers from using a callback mechanism to
display the results. This is removed (as far as possible) and instead
last_modified_emit() always, this function was called show_entry() previously.
- Code is rebased to use refactoring in the bloom filter API.
Changes in v5:
- Added a patch to allow for an "extended" format. The name for this option is
open for debate (please, all input is welcome). But the main goal of this
series is to provide the data needed for the "forge tree view" as demoed at
the top of this cover letter. With this extra patch (and the prepatory patch
to pretty.[ch]), I hope the use-case because more clear. But because it wasn't
included in previous 4 versions I also wouldn't mind sending a separate patch
series for it.
- Removed the call to sort(1) the t8020 tests. This was needed for the tests for
--extended.
- I'm adding a fixup! commit to be compatible with in-flight patches for bloom
filter optimizations:
https://lore.kernel.org/git/20250712093517.17907-1-yldhome2d2@gmail.com/
This patch can be dropped if current series lands before those.
Changes in v4:
- Removed root-level `last-modified.[ch]` library code and moved code to
`builtin/last-modified.c`. Historically we've had libary code (also because it
was used in testtool), but we no longer need that separation. I'm sorry this
makes the range-diff hard to read.
- Added the use of parse_options() to get better usage messages.
- Formatting fixes after conversation in
https://lore.kernel.org/git/xmqqh5zvk5h0.fsf@gitster.g/
- Link to v3: https://lore.kernel.org/git/20250630-toon-new-blame-tree-v3-0-3516025dc3bc@iotcl.com/
Changes in v3:
- Updated benchmarks in commit messages.
- Removed the patches that attempt to increase performance for tree
entries that have not been updated in a long time. (see above)
- Move handling failure in `last_modified_init()` to the caller.
- Sorted #include clauses lexicographically.
- Removed unneeded `commit` in `struct last_modified_entry`.
- Renamed some functions/variables and added some comments to make it
easier to understand.
- Removed unnecessary checking of the commit-graph generation number.
- Link to v2: https://lore.kernel.org/r/20250523-toon-new-blame-tree-v2-0-101e4ca4c1c9@iotcl.com
Changes in v2:
- The subcommand is renamed from `blame-tree` to `last-modified`
- Documentation is added. Here we mark the command as experimental.
- Some test cases are added related to merges.
- Link to v1: https://lore.kernel.org/r/20250422-toon-new-blame-tree-v1-0-fdb51b8a394a@iotcl.com
Toon Claes (3):
last-modified: new subcommand to show when files were last modified
t/perf: add last-modified perf script
last-modified: use Bloom filters when available
.gitignore | 1 +
Documentation/git-last-modified.adoc | 54 +++++
Documentation/meson.build | 1 +
Makefile | 1 +
builtin.h | 1 +
builtin/last-modified.c | 325 +++++++++++++++++++++++++++
command-list.txt | 1 +
commit-graph.c | 7 +-
git.c | 1 +
meson.build | 1 +
t/meson.build | 2 +
t/perf/p8020-last-modified.sh | 22 ++
t/t8020-last-modified.sh | 210 +++++++++++++++++
13 files changed, 626 insertions(+), 1 deletion(-)
create mode 100644 Documentation/git-last-modified.adoc
create mode 100644 builtin/last-modified.c
create mode 100755 t/perf/p8020-last-modified.sh
create mode 100755 t/t8020-last-modified.sh
Range-diff against v6:
1: 9d5ce06460 < -: ---------- last-modified: new subcommand to show when files were last modified
-: ---------- > 1: d5a2359633 last-modified: new subcommand to show when files were last modified
2: 7c921d4344 = 2: 7537f0e597 t/perf: add last-modified perf script
3: 3c42043682 < -: ---------- commit-graph: export prepare_commit_graph()
4: e3c2d5e3c1 ! 3: ebc7b061df last-modified: use Bloom filters when available
@@ builtin/last-modified.c: static void last_modified_diff(struct diff_queue_struct
}
}
-+static int maybe_changed_path(struct last_modified *lm, struct commit *origin)
++static bool maybe_changed_path(struct last_modified *lm, struct commit *origin)
+{
+ struct bloom_filter *filter;
+ struct last_modified_entry *ent;
+ struct hashmap_iter iter;
+
+ if (!lm->rev.bloom_filter_settings)
-+ return 1;
++ return true;
++
++ if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY)
++ return true;
+
+ filter = get_bloom_filter(lm->rev.repo, origin);
+ if (!filter)
-+ return 1;
++ return true;
+
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
+ if (bloom_filter_contains(filter, &ent->key,
+ lm->rev.bloom_filter_settings))
-+ return 1;
++ return true;
+ }
-+ return 0;
++ return false;
+}
+
static int last_modified_run(struct last_modified *lm)
{
struct last_modified_callback_data data = { .lm = lm };
@@ builtin/last-modified.c: static int last_modified_run(struct last_modified *lm)
- lm->rev.diffopt.format_callback_data = &data;
-
- prepare_revision_walk(&lm->rev);
-+ lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
-
- while (hashmap_get_size(&lm->paths)) {
- data.commit = get_revision(&lm->rev);
- if (!data.commit)
- break;
-
+ &data.commit->object.oid, "",
+ &lm->rev.diffopt);
+ diff_flush(&lm->rev.diffopt);
+- } else {
+- log_tree_commit(&lm->rev, data.commit);
++
++ break;
+ }
++
+ if (!maybe_changed_path(lm, data.commit))
+ continue;
+
- if (data.commit->object.flags & BOUNDARY) {
- diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
- &data.commit->object.oid, "",
++ log_tree_commit(&lm->rev, data.commit);
+ }
+
+ return 0;
@@ builtin/last-modified.c: static int last_modified_init(struct last_modified *lm, struct repository *r,
return argc;
}
-+ prepare_commit_graph(lm->rev.repo);
+ lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
+
if (populate_paths_from_revs(lm) < 0)
return error(_("unable to setup last-modified"));
+
+ ## commit-graph.c ##
+@@ commit-graph.c: int corrected_commit_dates_enabled(struct repository *r)
+
+ struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
+ {
+- struct commit_graph *g = r->objects->commit_graph;
++ struct commit_graph *g;
++
++ if (!prepare_commit_graph(r))
++ return NULL;
++
++ g = r->objects->commit_graph;
+ while (g) {
+ if (g->bloom_filter_settings)
+ return g->bloom_filter_settings;
base-commit: 112648dd6bdd8e4f485cd0ae11636807959d48be
--
2.50.1.327.g047016eb4a
^ permalink raw reply [flat|nested] 135+ messages in thread
* [PATCH v7 1/3] last-modified: new subcommand to show when files were last modified
2025-07-30 17:55 ` [PATCH v6 0/4] " Toon Claes
2025-07-31 18:40 ` Junio C Hamano
2025-08-05 9:33 ` [PATCH v7 0/3] " Toon Claes
@ 2025-08-05 9:33 ` Toon Claes
2025-08-05 9:33 ` [PATCH v7 2/3] t/perf: add last-modified perf script Toon Claes
2025-08-05 9:33 ` [PATCH v7 3/3] last-modified: use Bloom filters when available Toon Claes
4 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-08-05 9:33 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau, Derrick Stolee,
Christian Couder, Patrick Steinhardt, Jeff King, Toon Claes,
Ævar Arnfjörð Bjarmason
Similar to git-blame(1), introduce a new subcommand
git-last-modified(1). This command shows the most recent modification to
paths in a tree. It does so by expanding the tree at a given commit,
taking note of the current state of each path, and then walking
backwards through history looking for commits where each path changed
into its final commit ID.
Based-on-patch-by: Jeff King <peff@peff.net>
Improved-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
.gitignore | 1 +
Documentation/git-last-modified.adoc | 54 +++++
Documentation/meson.build | 1 +
Makefile | 1 +
builtin.h | 1 +
builtin/last-modified.c | 281 +++++++++++++++++++++++++++
command-list.txt | 1 +
git.c | 1 +
meson.build | 1 +
t/meson.build | 1 +
t/t8020-last-modified.sh | 210 ++++++++++++++++++++
11 files changed, 553 insertions(+)
create mode 100644 Documentation/git-last-modified.adoc
create mode 100644 builtin/last-modified.c
create mode 100755 t/t8020-last-modified.sh
diff --git a/.gitignore b/.gitignore
index 04c444404e..a36ee94443 100644
--- a/.gitignore
+++ b/.gitignore
@@ -87,6 +87,7 @@
/git-init-db
/git-interpret-trailers
/git-instaweb
+/git-last-modified
/git-log
/git-ls-files
/git-ls-remote
diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
new file mode 100644
index 0000000000..35bd4a1dd0
--- /dev/null
+++ b/Documentation/git-last-modified.adoc
@@ -0,0 +1,54 @@
+git-last-modified(1)
+====================
+
+NAME
+----
+git-last-modified - EXPERIMENTAL: Show when files were last modified
+
+
+SYNOPSIS
+--------
+[synopsis]
+git last-modified [--recursive] [--show-trees] [<revision-range>] [[--] <path>...]
+
+DESCRIPTION
+-----------
+
+Shows which commit last modified each of the relevant files and subdirectories.
+A commit renaming a path, or changing it's mode is also taken into account.
+
+THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
+
+OPTIONS
+-------
+
+-r::
+--recursive::
+ Instead of showing tree entries, step into subtrees and show all entries
+ inside them recursively.
+
+-t::
+--show-trees::
+ Show tree entries even when recursing into them. It has no effect
+ without `--recursive`.
+
+<revision-range>::
+ Only traverse commits in the specified revision range. When no
+ `<revision-range>` is specified, it defaults to `HEAD` (i.e. the whole
+ history leading to the current commit). For a complete list of ways to
+ spell `<revision-range>`, see the 'Specifying Ranges' section of
+ linkgit:gitrevisions[7].
+
+[--] <path>...::
+ For each _<path>_ given, the commit which last modified it is returned.
+ Without an optional path parameter, all files and subdirectories
+ in path traversal the are included in the output.
+
+SEE ALSO
+--------
+linkgit:git-blame[1],
+linkgit:git-log[1].
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Documentation/meson.build b/Documentation/meson.build
index 4404c623f0..a8ac5285f0 100644
--- a/Documentation/meson.build
+++ b/Documentation/meson.build
@@ -74,6 +74,7 @@ manpages = {
'git-init.adoc' : 1,
'git-instaweb.adoc' : 1,
'git-interpret-trailers.adoc' : 1,
+ 'git-last-modified.adoc' : 1,
'git-log.adoc' : 1,
'git-ls-files.adoc' : 1,
'git-ls-remote.adoc' : 1,
diff --git a/Makefile b/Makefile
index e11340c1ae..b0b3a30daa 100644
--- a/Makefile
+++ b/Makefile
@@ -1265,6 +1265,7 @@ BUILTIN_OBJS += builtin/hook.o
BUILTIN_OBJS += builtin/index-pack.o
BUILTIN_OBJS += builtin/init-db.o
BUILTIN_OBJS += builtin/interpret-trailers.o
+BUILTIN_OBJS += builtin/last-modified.o
BUILTIN_OBJS += builtin/log.o
BUILTIN_OBJS += builtin/ls-files.o
BUILTIN_OBJS += builtin/ls-remote.o
diff --git a/builtin.h b/builtin.h
index bff13e3069..6ed6759ec4 100644
--- a/builtin.h
+++ b/builtin.h
@@ -176,6 +176,7 @@ int cmd_hook(int argc, const char **argv, const char *prefix, struct repository
int cmd_index_pack(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_init_db(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_interpret_trailers(int argc, const char **argv, const char *prefix, struct repository *repo);
+int cmd_last_modified(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_log_reflog(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_log(int argc, const char **argv, const char *prefix, struct repository *repo);
int cmd_ls_files(int argc, const char **argv, const char *prefix, struct repository *repo);
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
new file mode 100644
index 0000000000..364493ac69
--- /dev/null
+++ b/builtin/last-modified.c
@@ -0,0 +1,281 @@
+#include "git-compat-util.h"
+#include "builtin.h"
+#include "commit.h"
+#include "config.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "environment.h"
+#include "hashmap.h"
+#include "hex.h"
+#include "log-tree.h"
+#include "object-name.h"
+#include "object.h"
+#include "parse-options.h"
+#include "quote.h"
+#include "repository.h"
+#include "revision.h"
+
+struct last_modified_entry {
+ struct hashmap_entry hashent;
+ struct object_id oid;
+ const char path[FLEX_ARRAY];
+};
+
+static int last_modified_entry_hashcmp(const void *unused UNUSED,
+ const struct hashmap_entry *hent1,
+ const struct hashmap_entry *hent2,
+ const void *path)
+{
+ const struct last_modified_entry *ent1 =
+ container_of(hent1, const struct last_modified_entry, hashent);
+ const struct last_modified_entry *ent2 =
+ container_of(hent2, const struct last_modified_entry, hashent);
+ return strcmp(ent1->path, path ? path : ent2->path);
+}
+
+struct last_modified {
+ struct hashmap paths;
+ struct rev_info rev;
+ bool recursive;
+ bool show_trees;
+};
+
+static void last_modified_release(struct last_modified *lm)
+{
+ hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
+ release_revisions(&lm->rev);
+}
+
+struct last_modified_callback_data {
+ struct last_modified *lm;
+ struct commit *commit;
+};
+
+static void add_path_from_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED, void *data)
+{
+ struct last_modified *lm = data;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ struct last_modified_entry *ent;
+ const char *path = p->two->path;
+
+ FLEX_ALLOC_STR(ent, path, path);
+ oidcpy(&ent->oid, &p->two->oid);
+ hashmap_entry_init(&ent->hashent, strhash(ent->path));
+ hashmap_add(&lm->paths, &ent->hashent);
+ }
+}
+
+static int populate_paths_from_revs(struct last_modified *lm)
+{
+ int num_interesting = 0;
+ struct diff_options diffopt;
+
+ /*
+ * Create a copy of `struct diff_options`. In this copy a callback is
+ * set that when called adds entries to `paths` in `struct last_modified`.
+ * This copy is used to diff the tree of the target revision against an
+ * empty tree. This results in all paths in the target revision being
+ * listed. After `paths` is populated, we don't need this copy no more.
+ */
+ memcpy(&diffopt, &lm->rev.diffopt, sizeof(diffopt));
+ copy_pathspec(&diffopt.pathspec, &lm->rev.diffopt.pathspec);
+ diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ diffopt.format_callback = add_path_from_diff;
+ diffopt.format_callback_data = lm;
+
+ for (size_t i = 0; i < lm->rev.pending.nr; i++) {
+ struct object_array_entry *obj = lm->rev.pending.objects + i;
+
+ if (obj->item->flags & UNINTERESTING)
+ continue;
+
+ if (num_interesting++)
+ return error(_("last-modified can only operate on one tree at a time"));
+
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &obj->item->oid, "", &diffopt);
+ diff_flush(&diffopt);
+ }
+ clear_pathspec(&diffopt.pathspec);
+
+ return 0;
+}
+
+static void last_modified_emit(struct last_modified *lm,
+ const char *path, const struct commit *commit)
+
+{
+ if (commit->object.flags & BOUNDARY)
+ putchar('^');
+ printf("%s\t", oid_to_hex(&commit->object.oid));
+
+ if (lm->rev.diffopt.line_termination)
+ write_name_quoted(path, stdout, '\n');
+ else
+ printf("%s%c", path, '\0');
+}
+
+static void mark_path(const char *path, const struct object_id *oid,
+ struct last_modified_callback_data *data)
+{
+ struct last_modified_entry *ent;
+
+ /* Is it even a path that we are interested in? */
+ ent = hashmap_get_entry_from_hash(&data->lm->paths, strhash(path), path,
+ struct last_modified_entry, hashent);
+ if (!ent)
+ return;
+
+ /*
+ * Is it arriving at a version of interest, or is it from a side branch
+ * which did not contribute to the final state?
+ */
+ if (!oideq(oid, &ent->oid))
+ return;
+
+ last_modified_emit(data->lm, path, data->commit);
+
+ hashmap_remove(&data->lm->paths, &ent->hashent, path);
+ free(ent);
+}
+
+static void last_modified_diff(struct diff_queue_struct *q,
+ struct diff_options *opt UNUSED, void *cbdata)
+{
+ struct last_modified_callback_data *data = cbdata;
+
+ for (int i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ switch (p->status) {
+ case DIFF_STATUS_DELETED:
+ /*
+ * There's no point in feeding a deletion, as it could
+ * not have resulted in our current state, which
+ * actually has the file.
+ */
+ break;
+
+ default:
+ /*
+ * Otherwise, we care only that we somehow arrived at
+ * a final oid state. Note that this covers some
+ * potentially controversial areas, including:
+ *
+ * 1. A rename or copy will be found, as it is the
+ * first time the content has arrived at the given
+ * path.
+ *
+ * 2. Even a non-content modification like a mode or
+ * type change will trigger it.
+ *
+ * We take the inclusive approach for now, and find
+ * anything which impacts the path. Options to tweak
+ * the behavior (e.g., to "--follow" the content across
+ * renames) can come later.
+ */
+ mark_path(p->two->path, &p->two->oid, data);
+ break;
+ }
+ }
+}
+
+static int last_modified_run(struct last_modified *lm)
+{
+ struct last_modified_callback_data data = { .lm = lm };
+
+ lm->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ lm->rev.diffopt.format_callback = last_modified_diff;
+ lm->rev.diffopt.format_callback_data = &data;
+
+ prepare_revision_walk(&lm->rev);
+
+ while (hashmap_get_size(&lm->paths)) {
+ data.commit = get_revision(&lm->rev);
+ if (!data.commit)
+ BUG("paths remaining beyond boundary in last-modified");
+
+ if (data.commit->object.flags & BOUNDARY) {
+ diff_tree_oid(lm->rev.repo->hash_algo->empty_tree,
+ &data.commit->object.oid, "",
+ &lm->rev.diffopt);
+ diff_flush(&lm->rev.diffopt);
+ } else {
+ log_tree_commit(&lm->rev, data.commit);
+ }
+ }
+
+ return 0;
+}
+
+static int last_modified_init(struct last_modified *lm, struct repository *r,
+ const char *prefix, int argc, const char **argv)
+{
+ hashmap_init(&lm->paths, last_modified_entry_hashcmp, NULL, 0);
+
+ repo_init_revisions(r, &lm->rev, prefix);
+ lm->rev.def = "HEAD";
+ lm->rev.combine_merges = 1;
+ lm->rev.show_root_diff = 1;
+ lm->rev.boundary = 1;
+ lm->rev.no_commit_id = 1;
+ lm->rev.diff = 1;
+ lm->rev.diffopt.flags.recursive = lm->recursive;
+ lm->rev.diffopt.flags.tree_in_recursive = lm->show_trees;
+
+ argc = setup_revisions(argc, argv, &lm->rev, NULL);
+ if (argc > 1) {
+ error(_("unknown last-modified argument: %s"), argv[1]);
+ return argc;
+ }
+
+ if (populate_paths_from_revs(lm) < 0)
+ return error(_("unable to setup last-modified"));
+
+ return 0;
+}
+
+int cmd_last_modified(int argc, const char **argv, const char *prefix,
+ struct repository *repo)
+{
+ int ret;
+ struct last_modified lm = { 0 };
+
+ const char * const last_modified_usage[] = {
+ N_("git last-modified [--recursive] [--show-trees] "
+ "[<revision-range>] [[--] <path>...]"),
+ NULL
+ };
+
+ struct option last_modified_options[] = {
+ OPT_BOOL('r', "recursive", &lm.recursive,
+ N_("recurse into subtrees")),
+ OPT_BOOL('t', "show-trees", &lm.show_trees,
+ N_("show tree entries when recursing into subtrees")),
+ OPT_END()
+ };
+
+ argc = parse_options(argc, argv, prefix, last_modified_options,
+ last_modified_usage,
+ PARSE_OPT_KEEP_ARGV0 | PARSE_OPT_KEEP_UNKNOWN_OPT);
+
+ repo_config(repo, git_default_config, NULL);
+
+ ret = last_modified_init(&lm, repo, prefix, argc, argv);
+ if (ret > 0)
+ usage_with_options(last_modified_usage,
+ last_modified_options);
+ if (ret)
+ goto out;
+
+ ret = last_modified_run(&lm);
+ if (ret)
+ goto out;
+
+out:
+ last_modified_release(&lm);
+
+ return ret;
+}
diff --git a/command-list.txt b/command-list.txt
index b7ade3ab9f..b715777b24 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -124,6 +124,7 @@ git-index-pack plumbingmanipulators
git-init mainporcelain init
git-instaweb ancillaryinterrogators complete
git-interpret-trailers purehelpers
+git-last-modified plumbinginterrogators
git-log mainporcelain info
git-ls-files plumbinginterrogators
git-ls-remote plumbinginterrogators
diff --git a/git.c b/git.c
index 07a5fe39fb..76a0b2a1a4 100644
--- a/git.c
+++ b/git.c
@@ -565,6 +565,7 @@ static struct cmd_struct commands[] = {
{ "init", cmd_init_db },
{ "init-db", cmd_init_db },
{ "interpret-trailers", cmd_interpret_trailers, RUN_SETUP_GENTLY },
+ { "last-modified", cmd_last_modified, RUN_SETUP },
{ "log", cmd_log, RUN_SETUP },
{ "ls-files", cmd_ls_files, RUN_SETUP },
{ "ls-remote", cmd_ls_remote, RUN_SETUP_GENTLY },
diff --git a/meson.build b/meson.build
index 5dd299b496..cff7125638 100644
--- a/meson.build
+++ b/meson.build
@@ -607,6 +607,7 @@ builtin_sources = [
'builtin/index-pack.c',
'builtin/init-db.c',
'builtin/interpret-trailers.c',
+ 'builtin/last-modified.c',
'builtin/log.c',
'builtin/ls-files.c',
'builtin/ls-remote.c',
diff --git a/t/meson.build b/t/meson.build
index bbeba1a8d5..68656fe08a 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -946,6 +946,7 @@ integration_tests = [
't8012-blame-colors.sh',
't8013-blame-ignore-revs.sh',
't8014-blame-ignore-fuzzy.sh',
+ 't8020-last-modified.sh',
't9001-send-email.sh',
't9002-column.sh',
't9003-help-autocorrect.sh',
diff --git a/t/t8020-last-modified.sh b/t/t8020-last-modified.sh
new file mode 100755
index 0000000000..5eb4cef035
--- /dev/null
+++ b/t/t8020-last-modified.sh
@@ -0,0 +1,210 @@
+#!/bin/sh
+
+test_description='last-modified tests'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+ test_commit 1 file &&
+ mkdir a &&
+ test_commit 2 a/file &&
+ mkdir a/b &&
+ test_commit 3 a/b/file
+'
+
+test_expect_success 'cannot run last-modified on two trees' '
+ test_must_fail git last-modified HEAD HEAD~1
+'
+
+check_last_modified() {
+ local indir= &&
+ while test $# != 0
+ do
+ case "$1" in
+ -C)
+ indir="$2"
+ shift
+ ;;
+ *)
+ break
+ ;;
+ esac &&
+ shift
+ done &&
+
+ cat >expect &&
+ test_when_finished "rm -f tmp.*" &&
+ git ${indir:+-C "$indir"} last-modified "$@" >tmp.1 &&
+ git name-rev --annotate-stdin --name-only --tags \
+ <tmp.1 >tmp.2 &&
+ tr '\t' ' ' <tmp.2 >actual &&
+ test_cmp expect actual
+}
+
+test_expect_success 'last-modified non-recursive' '
+ check_last_modified <<-\EOF
+ 3 a
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified recursive' '
+ check_last_modified -r <<-\EOF
+ 3 a/b/file
+ 2 a/file
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified recursive with show-trees' '
+ check_last_modified -r -t <<-\EOF
+ 3 a
+ 3 a/b
+ 3 a/b/file
+ 2 a/file
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified non-recursive with show-trees' '
+ check_last_modified -t <<-\EOF
+ 3 a
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified subdir' '
+ check_last_modified a <<-\EOF
+ 3 a
+ EOF
+'
+
+test_expect_success 'last-modified subdir recursive' '
+ check_last_modified -r a <<-\EOF
+ 3 a/b/file
+ 2 a/file
+ EOF
+'
+
+test_expect_success 'last-modified from non-HEAD commit' '
+ check_last_modified HEAD^ <<-\EOF
+ 2 a
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified from subdir defaults to root' '
+ check_last_modified -C a <<-\EOF
+ 3 a
+ 1 file
+ EOF
+'
+
+test_expect_success 'last-modified from subdir uses relative pathspecs' '
+ check_last_modified -C a -r b <<-\EOF
+ 3 a/b/file
+ EOF
+'
+
+test_expect_success 'limit last-modified traversal by count' '
+ check_last_modified -1 <<-\EOF
+ 3 a
+ ^2 file
+ EOF
+'
+
+test_expect_success 'limit last-modified traversal by commit' '
+ check_last_modified HEAD~2..HEAD <<-\EOF
+ 3 a
+ ^1 file
+ EOF
+'
+
+test_expect_success 'only last-modified files in the current tree' '
+ git rm -rf a &&
+ git commit -m "remove a" &&
+ check_last_modified <<-\EOF
+ 1 file
+ EOF
+'
+
+test_expect_success 'cross merge boundaries in blaming' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit m1 &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit m2 &&
+ git merge m1 &&
+ check_last_modified <<-\EOF
+ m2 m2.t
+ m1 m1.t
+ EOF
+'
+
+test_expect_success 'last-modified merge for resolved conflicts' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit c1 conflict &&
+ git checkout HEAD^ &&
+ git rm -rf . &&
+ test_commit c2 conflict &&
+ test_must_fail git merge c1 &&
+ test_commit resolved conflict &&
+ check_last_modified conflict <<-\EOF
+ resolved conflict
+ EOF
+'
+
+
+# Consider `file` with this content through history:
+#
+# A---B---B-------B---B
+# \ /
+# C---D
+test_expect_success 'last-modified merge ignores content from branch' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit a1 file A &&
+ test_commit a2 file B &&
+ test_commit a3 file C &&
+ test_commit a4 file D &&
+ git checkout a2 &&
+ git merge --no-commit --no-ff a4 &&
+ git checkout a2 -- file &&
+ git merge --continue &&
+ check_last_modified <<-\EOF
+ a2 file
+ EOF
+'
+
+# Consider `file` with this content through history:
+#
+# A---B---B---C---D---B---B
+# \ /
+# B-------B
+test_expect_success 'last-modified merge undoes changes' '
+ git checkout HEAD^0 &&
+ git rm -rf . &&
+ test_commit b1 file A &&
+ test_commit b2 file B &&
+ test_commit b3 file C &&
+ test_commit b4 file D &&
+ git checkout b2 &&
+ test_commit b5 file2 2 &&
+ git checkout b4 &&
+ git merge --no-commit --no-ff b5 &&
+ git checkout b2 -- file &&
+ git merge --continue &&
+ check_last_modified <<-\EOF
+ b5 file2
+ b2 file
+ EOF
+'
+
+test_expect_success 'last-modified complains about unknown arguments' '
+ test_must_fail git last-modified --foo 2>err &&
+ grep "unknown last-modified argument: --foo" err
+'
+
+test_done
--
2.50.1.327.g047016eb4a
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH v7 2/3] t/perf: add last-modified perf script
2025-07-30 17:55 ` [PATCH v6 0/4] " Toon Claes
` (2 preceding siblings ...)
2025-08-05 9:33 ` [PATCH v7 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
@ 2025-08-05 9:33 ` Toon Claes
2025-08-05 9:33 ` [PATCH v7 3/3] last-modified: use Bloom filters when available Toon Claes
4 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-08-05 9:33 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau, Derrick Stolee,
Christian Couder, Patrick Steinhardt, Jeff King, Toon Claes
This just runs some simple last-modified commands. We already test
correctness in the regular suite, so this is just about finding
performance regressions from one version to another.
Based-on-patch-by: Jeff King <peff@peff.net>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
t/meson.build | 1 +
t/perf/p8020-last-modified.sh | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+)
create mode 100755 t/perf/p8020-last-modified.sh
diff --git a/t/meson.build b/t/meson.build
index 68656fe08a..21d5e99bf5 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -1140,6 +1140,7 @@ benchmarks = [
'perf/p7820-grep-engines.sh',
'perf/p7821-grep-engines-fixed.sh',
'perf/p7822-grep-perl-character.sh',
+ 'perf/p8020-last-modified.sh',
'perf/p9210-scalar.sh',
'perf/p9300-fast-import-export.sh',
]
diff --git a/t/perf/p8020-last-modified.sh b/t/perf/p8020-last-modified.sh
new file mode 100755
index 0000000000..cb1f98d3db
--- /dev/null
+++ b/t/perf/p8020-last-modified.sh
@@ -0,0 +1,22 @@
+#!/bin/sh
+
+test_description='last-modified perf tests'
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+test_perf 'top-level last-modified' '
+ git last-modified HEAD
+'
+
+test_perf 'top-level recursive last-modified' '
+ git last-modified -r HEAD
+'
+
+test_perf 'subdir last-modified' '
+ git ls-tree -d HEAD >subtrees &&
+ path="$(head -n 1 subtrees | cut -f2)" &&
+ git last-modified -r HEAD -- "$path"
+'
+
+test_done
--
2.50.1.327.g047016eb4a
^ permalink raw reply related [flat|nested] 135+ messages in thread
* [PATCH v7 3/3] last-modified: use Bloom filters when available
2025-07-30 17:55 ` [PATCH v6 0/4] " Toon Claes
` (3 preceding siblings ...)
2025-08-05 9:33 ` [PATCH v7 2/3] t/perf: add last-modified perf script Toon Claes
@ 2025-08-05 9:33 ` Toon Claes
4 siblings, 0 replies; 135+ messages in thread
From: Toon Claes @ 2025-08-05 9:33 UTC (permalink / raw)
To: git
Cc: Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau, Derrick Stolee,
Christian Couder, Patrick Steinhardt, Jeff King, Toon Claes
Our 'git last-modified' performs a revision walk, and computes a diff at
each point in the walk to figure out whether a given revision changed
any of the paths it considers interesting.
When changed-path Bloom filters are available, we can avoid computing
many such diffs. Before computing a diff, we first check if any of the
remaining paths of interest were possibly changed at a given commit by
consulting its Bloom filter. If any of them are, we are resigned to
compute the diff.
If none of those queries returned "maybe", we know that the given commit
doesn't contain any changed paths which are interesting to us. So, we
can avoid computing it in this case.
Comparing the perf test results on git.git:
Test HEAD~ HEAD
------------------------------------------------------------------------------------
8020.1: top-level last-modified 4.49(4.34+0.11) 2.22(2.05+0.09) -50.6%
8020.2: top-level recursive last-modified 5.64(5.45+0.11) 5.62(5.30+0.11) -0.4%
8020.3: subdir last-modified 0.11(0.06+0.04) 0.07(0.03+0.04) -36.4%
Based-on-patch-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
---
builtin/last-modified.c | 48 +++++++++++++++++++++++++++++++++++++++--
commit-graph.c | 7 +++++-
2 files changed, 52 insertions(+), 3 deletions(-)
diff --git a/builtin/last-modified.c b/builtin/last-modified.c
index 364493ac69..82c5739827 100644
--- a/builtin/last-modified.c
+++ b/builtin/last-modified.c
@@ -1,5 +1,7 @@
#include "git-compat-util.h"
+#include "bloom.h"
#include "builtin.h"
+#include "commit-graph.h"
#include "commit.h"
#include "config.h"
#include "diff.h"
@@ -18,6 +20,7 @@
struct last_modified_entry {
struct hashmap_entry hashent;
struct object_id oid;
+ struct bloom_key key;
const char path[FLEX_ARRAY];
};
@@ -42,6 +45,12 @@ struct last_modified {
static void last_modified_release(struct last_modified *lm)
{
+ struct hashmap_iter iter;
+ struct last_modified_entry *ent;
+
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent)
+ bloom_key_clear(&ent->key);
+
hashmap_clear_and_free(&lm->paths, struct last_modified_entry, hashent);
release_revisions(&lm->rev);
}
@@ -63,6 +72,9 @@ static void add_path_from_diff(struct diff_queue_struct *q,
FLEX_ALLOC_STR(ent, path, path);
oidcpy(&ent->oid, &p->two->oid);
+ if (lm->rev.bloom_filter_settings)
+ bloom_key_fill(&ent->key, path, strlen(path),
+ lm->rev.bloom_filter_settings);
hashmap_entry_init(&ent->hashent, strhash(ent->path));
hashmap_add(&lm->paths, &ent->hashent);
}
@@ -139,6 +151,7 @@ static void mark_path(const char *path, const struct object_id *oid,
last_modified_emit(data->lm, path, data->commit);
hashmap_remove(&data->lm->paths, &ent->hashent, path);
+ bloom_key_clear(&ent->key);
free(ent);
}
@@ -182,6 +195,30 @@ static void last_modified_diff(struct diff_queue_struct *q,
}
}
+static bool maybe_changed_path(struct last_modified *lm, struct commit *origin)
+{
+ struct bloom_filter *filter;
+ struct last_modified_entry *ent;
+ struct hashmap_iter iter;
+
+ if (!lm->rev.bloom_filter_settings)
+ return true;
+
+ if (commit_graph_generation(origin) == GENERATION_NUMBER_INFINITY)
+ return true;
+
+ filter = get_bloom_filter(lm->rev.repo, origin);
+ if (!filter)
+ return true;
+
+ hashmap_for_each_entry(&lm->paths, &iter, ent, hashent) {
+ if (bloom_filter_contains(filter, &ent->key,
+ lm->rev.bloom_filter_settings))
+ return true;
+ }
+ return false;
+}
+
static int last_modified_run(struct last_modified *lm)
{
struct last_modified_callback_data data = { .lm = lm };
@@ -202,9 +239,14 @@ static int last_modified_run(struct last_modified *lm)
&data.commit->object.oid, "",
&lm->rev.diffopt);
diff_flush(&lm->rev.diffopt);
- } else {
- log_tree_commit(&lm->rev, data.commit);
+
+ break;
}
+
+ if (!maybe_changed_path(lm, data.commit))
+ continue;
+
+ log_tree_commit(&lm->rev, data.commit);
}
return 0;
@@ -231,6 +273,8 @@ static int last_modified_init(struct last_modified *lm, struct repository *r,
return argc;
}
+ lm->rev.bloom_filter_settings = get_bloom_filter_settings(lm->rev.repo);
+
if (populate_paths_from_revs(lm) < 0)
return error(_("unable to setup last-modified"));
diff --git a/commit-graph.c b/commit-graph.c
index e0d92b816f..a74ac342b3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -821,7 +821,12 @@ int corrected_commit_dates_enabled(struct repository *r)
struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
{
- struct commit_graph *g = r->objects->commit_graph;
+ struct commit_graph *g;
+
+ if (!prepare_commit_graph(r))
+ return NULL;
+
+ g = r->objects->commit_graph;
while (g) {
if (g->bloom_filter_settings)
return g->bloom_filter_settings;
--
2.50.1.327.g047016eb4a
^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [PATCH v7 0/3] Introduce git-last-modified(1) command
2025-08-05 9:33 ` [PATCH v7 0/3] " Toon Claes
@ 2025-08-05 14:34 ` Patrick Steinhardt
2025-08-05 16:21 ` Junio C Hamano
2025-08-05 16:34 ` Junio C Hamano
1 sibling, 1 reply; 135+ messages in thread
From: Patrick Steinhardt @ 2025-08-05 14:34 UTC (permalink / raw)
To: Toon Claes
Cc: git, Junio C Hamano, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Christian Couder, Jeff King
On Tue, Aug 05, 2025 at 11:33:55AM +0200, Toon Claes wrote:
> Changes in v7:
> - Fix case when bloom filters were used and a commit range was given. This bug
> was uncovered in CI.
> - Rename the long option for `-t` to `--show-trees`. This option no longer
> implies option `-r`. And resemble these changes in the documentation, with a
> few other small documentation tweaks.
> - Move prepare_commit_graph() into get_bloom_filter_settings() which no longer
> requires last-modified to worry about it itself. This is similar to
> repo_find_commit_pos_in_graph() and lookup_commit_in_graph()
> - Bring back the call to commit_graph_generation() in maybe_changed_path(). This
> is also called in the same function in blame.c and in
> check_maybe_different_in_bloom_filter() in revision.c. I couldn't find a test
> case that triggers this exit condition, but it should not have negative
> side-effects.
> - No longer call diff_free() on the copy we make when populating the `paths` of
> `struct last_modified`. Because we weren't doing a deep copy, this could clean
> up fields used later on by the original. Instead only call clear_pathspec(). A
> comment to clarify this mechanism better is added.
> - Add BUG() call to exit condition that shouldn't happen.
> - Switch some int types to bool types.
This version looks good to me, thanks!
Patrick
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v7 0/3] Introduce git-last-modified(1) command
2025-08-05 14:34 ` Patrick Steinhardt
@ 2025-08-05 16:21 ` Junio C Hamano
0 siblings, 0 replies; 135+ messages in thread
From: Junio C Hamano @ 2025-08-05 16:21 UTC (permalink / raw)
To: Patrick Steinhardt
Cc: Toon Claes, git, Kristoffer Haugsbakk, Taylor Blau,
Derrick Stolee, Christian Couder, Jeff King
Patrick Steinhardt <ps@pks.im> writes:
> On Tue, Aug 05, 2025 at 11:33:55AM +0200, Toon Claes wrote:
>> Changes in v7:
>> - Fix case when bloom filters were used and a commit range was given. This bug
>> was uncovered in CI.
>> - Rename the long option for `-t` to `--show-trees`. This option no longer
>> implies option `-r`. And resemble these changes in the documentation, with a
>> few other small documentation tweaks.
>> - Move prepare_commit_graph() into get_bloom_filter_settings() which no longer
>> requires last-modified to worry about it itself. This is similar to
>> repo_find_commit_pos_in_graph() and lookup_commit_in_graph()
>> - Bring back the call to commit_graph_generation() in maybe_changed_path(). This
>> is also called in the same function in blame.c and in
>> check_maybe_different_in_bloom_filter() in revision.c. I couldn't find a test
>> case that triggers this exit condition, but it should not have negative
>> side-effects.
>> - No longer call diff_free() on the copy we make when populating the `paths` of
>> `struct last_modified`. Because we weren't doing a deep copy, this could clean
>> up fields used later on by the original. Instead only call clear_pathspec(). A
>> comment to clarify this mechanism better is added.
>> - Add BUG() call to exit condition that shouldn't happen.
>> - Switch some int types to bool types.
>
> This version looks good to me, thanks!
Thanks, both of you.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v7 0/3] Introduce git-last-modified(1) command
2025-08-05 9:33 ` [PATCH v7 0/3] " Toon Claes
2025-08-05 14:34 ` Patrick Steinhardt
@ 2025-08-05 16:34 ` Junio C Hamano
2025-08-05 16:55 ` Toon Claes
1 sibling, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-08-05 16:34 UTC (permalink / raw)
To: Toon Claes; +Cc: git, Jean-Noël Avila
Toon Claes <toon@iotcl.com> writes:
> Changes in v7:
> - Fix case when bloom filters were used and a commit range was given. This bug
> was uncovered in CI.
> - Rename the long option for `-t` to `--show-trees`. This option no longer
> implies option `-r`. And resemble these changes in the documentation, with a
> few other small documentation tweaks.
> - Move prepare_commit_graph() into get_bloom_filter_settings() which no longer
> requires last-modified to worry about it itself. This is similar to
> repo_find_commit_pos_in_graph() and lookup_commit_in_graph()
> - Bring back the call to commit_graph_generation() in maybe_changed_path(). This
> is also called in the same function in blame.c and in
> check_maybe_different_in_bloom_filter() in revision.c. I couldn't find a test
> case that triggers this exit condition, but it should not have negative
> side-effects.
> - No longer call diff_free() on the copy we make when populating the `paths` of
> `struct last_modified`. Because we weren't doing a deep copy, this could clean
> up fields used later on by the original. Instead only call clear_pathspec(). A
> comment to clarify this mechanism better is added.
> - Add BUG() call to exit condition that shouldn't happen.
> - Switch some int types to bool types.
I am happy with the updates, but am wondering if documentation
update along the lines of attached patch is also needed. I am not
sure about the last two, i.e. things that are not dash+option
appearing as enumeration labels, though (and Cc'ing Jean-Noël to ask
for help).
Thanks.
Documentation/git-last-modified.adoc | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git c/Documentation/git-last-modified.adoc w/Documentation/git-last-modified.adoc
index 35bd4a1dd0..602843e095 100644
--- c/Documentation/git-last-modified.adoc
+++ w/Documentation/git-last-modified.adoc
@@ -22,24 +22,24 @@ THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
OPTIONS
-------
--r::
---recursive::
+`-r`::
+`--recursive`::
Instead of showing tree entries, step into subtrees and show all entries
inside them recursively.
--t::
---show-trees::
+`-t`::
+`--show-trees`::
Show tree entries even when recursing into them. It has no effect
without `--recursive`.
-<revision-range>::
+`<revision-range>`::
Only traverse commits in the specified revision range. When no
`<revision-range>` is specified, it defaults to `HEAD` (i.e. the whole
history leading to the current commit). For a complete list of ways to
spell `<revision-range>`, see the 'Specifying Ranges' section of
linkgit:gitrevisions[7].
-[--] <path>...::
+`[--] <path>...`::
For each _<path>_ given, the commit which last modified it is returned.
Without an optional path parameter, all files and subdirectories
in path traversal the are included in the output.
^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [PATCH v7 0/3] Introduce git-last-modified(1) command
2025-08-05 16:34 ` Junio C Hamano
@ 2025-08-05 16:55 ` Toon Claes
2025-08-05 17:20 ` Jean-Noël AVILA
2025-08-05 18:28 ` Junio C Hamano
0 siblings, 2 replies; 135+ messages in thread
From: Toon Claes @ 2025-08-05 16:55 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Jean-Noël Avila
Junio C Hamano <gitster@pobox.com> writes:
> I am happy with the updates, but am wondering if documentation
> update along the lines of attached patch is also needed.
Ah (annoyed grunt), I should have added backticks. Yes. I missed those,
sorry about that.
> I am not sure about the last two, i.e. things that are not dash+option
> appearing as enumeration labels, though (and Cc'ing Jean-Noël to ask
> for help).
Well, this gave me a nice opportunity to test Jean-Noël proposed docs
linter[1].
$ make check-docs
[snip
git-last-modified.adoc:25: '-r::' synopsis style and definition list item not backquoted
git-last-modified.adoc:26: '--recursive::' synopsis style and definition list item not backquoted
git-last-modified.adoc:30: '-t::' synopsis style and definition list item not backquoted
git-last-modified.adoc:31: '--show-trees::' synopsis style and definition list item not backquoted
It seems only dashed options should be backquoted.
[1]: https://lore.kernel.org/git/pull.1945.git.1754399033.gitgitgadget@gmail.com/
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v7 0/3] Introduce git-last-modified(1) command
2025-08-05 16:55 ` Toon Claes
@ 2025-08-05 17:20 ` Jean-Noël AVILA
2025-08-05 21:46 ` Junio C Hamano
2025-08-05 18:28 ` Junio C Hamano
1 sibling, 1 reply; 135+ messages in thread
From: Jean-Noël AVILA @ 2025-08-05 17:20 UTC (permalink / raw)
To: Junio C Hamano, Toon Claes; +Cc: git
On Tuesday, 5 August 2025 18:55:14 CEST Toon Claes wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> > I am happy with the updates, but am wondering if documentation
> > update along the lines of attached patch is also needed.
>
> Ah (annoyed grunt), I should have added backticks. Yes. I missed those,
> sorry about that.
>
> > I am not sure about the last two, i.e. things that are not dash+option
> > appearing as enumeration labels, though (and Cc'ing Jean-Noël to ask
> > for help).
>
> Well, this gave me a nice opportunity to test Jean-Noël proposed docs
> linter[1].
>
> $ make check-docs
> [snip
> git-last-modified.adoc:25: '-r::' synopsis style and definition list
item not
> backquoted git-last-modified.adoc:26: '--recursive::' synopsis style and
> definition list item not backquoted git-last-modified.adoc:30: '-t::'
synopsis
> style and definition list item not backquoted git-last-modified.adoc:31:
> '--show-trees::' synopsis style and definition list item not backquoted
>
> It seems only dashed options should be backquoted.
>
> [1]: https://lore.kernel.org/git/pull.1945.git.
1754399033.gitgitgadget@gmail.com/
Well, the check fails to catch all the missing cases: The last two terms
should also be formatted. For the <revision-range>, you can either enclose it
with underscores (as a placeholder) or with backticks (which the formatter
formats like a placeholder). For the last one, backticks are definitely needed
to differentiate the formatting between the placeholder and the syntax marks.
As for my patch series, this can definitely be checked. will reroll.
Thanks
Jean-Noël
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v7 0/3] Introduce git-last-modified(1) command
2025-08-05 16:55 ` Toon Claes
2025-08-05 17:20 ` Jean-Noël AVILA
@ 2025-08-05 18:28 ` Junio C Hamano
1 sibling, 0 replies; 135+ messages in thread
From: Junio C Hamano @ 2025-08-05 18:28 UTC (permalink / raw)
To: Toon Claes; +Cc: git, Jean-Noël Avila
Toon Claes <toon@iotcl.com> writes:
>> I am not sure about the last two, i.e. things that are not dash+option
>> appearing as enumeration labels, though (and Cc'ing Jean-Noël to ask
>> for help).
>
> Well, this gave me a nice opportunity to test Jean-Noël proposed docs
> linter[1].
You'd need to be careful and account for the possibility that a
just-off-the-press linter may not be complete, though ;-)
> $ make check-docs
> [snip
> git-last-modified.adoc:25: '-r::' synopsis style and definition list item not backquoted
> git-last-modified.adoc:26: '--recursive::' synopsis style and definition list item not backquoted
> git-last-modified.adoc:30: '-t::' synopsis style and definition list item not backquoted
> git-last-modified.adoc:31: '--show-trees::' synopsis style and definition list item not backquoted
>
> It seems only dashed options should be backquoted.
My go-to example has been git-commit.adoc where it has things like:
`--`::
Do not interpret any more arguments as options.
`<pathspec>...`::
When _<pathspec>_ is given on the command line, commit the contents of
the files that match the pathspec without recording the changes
already added to the index. The contents of these files are also
staged for the next commit on top of what have been staged before.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v7 0/3] Introduce git-last-modified(1) command
2025-08-05 17:20 ` Jean-Noël AVILA
@ 2025-08-05 21:46 ` Junio C Hamano
2025-08-06 12:01 ` Toon Claes
0 siblings, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-08-05 21:46 UTC (permalink / raw)
To: Toon Claes; +Cc: Jean-Noël AVILA, git
Jean-Noël AVILA <jn.avila@free.fr> writes:
>> > I am not sure about the last two, i.e. things that are not dash+option
>> > appearing as enumeration labels, though (and Cc'ing Jean-Noël to ask
>> > for help).
>> ...
> Well, the check fails to catch all the missing cases: The last two terms
> should also be formatted. For the <revision-range>, you can either enclose it
> with underscores (as a placeholder) or with backticks (which the formatter
> formats like a placeholder). For the last one, backticks are definitely needed
> to differentiate the formatting between the placeholder and the syntax marks.
>
> As for my patch series, this can definitely be checked. will reroll.
This is what I queued on top of your topic to prepare the
integration today.
--- >8 ---
From: Junio C Hamano <gitster@pobox.com>
Date: Tue, 5 Aug 2025 14:37:25 -0700
Subject: [PATCH] fixup! last-modified: new subcommand to show when files were last modified
---
Documentation/git-last-modified.adoc | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
index 35bd4a1dd0..602843e095 100644
--- a/Documentation/git-last-modified.adoc
+++ b/Documentation/git-last-modified.adoc
@@ -22,24 +22,24 @@ THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
OPTIONS
-------
--r::
---recursive::
+`-r`::
+`--recursive`::
Instead of showing tree entries, step into subtrees and show all entries
inside them recursively.
--t::
---show-trees::
+`-t`::
+`--show-trees`::
Show tree entries even when recursing into them. It has no effect
without `--recursive`.
-<revision-range>::
+`<revision-range>`::
Only traverse commits in the specified revision range. When no
`<revision-range>` is specified, it defaults to `HEAD` (i.e. the whole
history leading to the current commit). For a complete list of ways to
spell `<revision-range>`, see the 'Specifying Ranges' section of
linkgit:gitrevisions[7].
-[--] <path>...::
+`[--] <path>...`::
For each _<path>_ given, the commit which last modified it is returned.
Without an optional path parameter, all files and subdirectories
in path traversal the are included in the output.
--
2.51.0-rc0-162-g220549999b
^ permalink raw reply related [flat|nested] 135+ messages in thread
* Re: [PATCH v7 0/3] Introduce git-last-modified(1) command
2025-08-05 21:46 ` Junio C Hamano
@ 2025-08-06 12:01 ` Toon Claes
2025-08-06 15:38 ` Junio C Hamano
0 siblings, 1 reply; 135+ messages in thread
From: Toon Claes @ 2025-08-06 12:01 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Jean-Noël AVILA, git
Junio C Hamano <gitster@pobox.com> writes:
> This is what I queued on top of your topic to prepare the
> integration today.
>
> --- >8 ---
> From: Junio C Hamano <gitster@pobox.com>
> Date: Tue, 5 Aug 2025 14:37:25 -0700
> Subject: [PATCH] fixup! last-modified: new subcommand to show when files were last modified
>
> ---
> Documentation/git-last-modified.adoc | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
> index 35bd4a1dd0..602843e095 100644
> --- a/Documentation/git-last-modified.adoc
> +++ b/Documentation/git-last-modified.adoc
> @@ -22,24 +22,24 @@ THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
> OPTIONS
> -------
>
> --r::
> ---recursive::
> +`-r`::
> +`--recursive`::
> Instead of showing tree entries, step into subtrees and show all entries
> inside them recursively.
>
> --t::
> ---show-trees::
> +`-t`::
> +`--show-trees`::
> Show tree entries even when recursing into them. It has no effect
> without `--recursive`.
>
> -<revision-range>::
> +`<revision-range>`::
> Only traverse commits in the specified revision range. When no
> `<revision-range>` is specified, it defaults to `HEAD` (i.e. the whole
> history leading to the current commit). For a complete list of ways to
> spell `<revision-range>`, see the 'Specifying Ranges' section of
> linkgit:gitrevisions[7].
>
> -[--] <path>...::
> +`[--] <path>...`::
> For each _<path>_ given, the commit which last modified it is returned.
> Without an optional path parameter, all files and subdirectories
> in path traversal the are included in the output.
> --
> 2.51.0-rc0-162-g220549999b
Looks good to me. Do you want me to reroll, or will you `--autosquash`
yourself?
--
Cheers,
Toon
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v7 0/3] Introduce git-last-modified(1) command
2025-08-06 12:01 ` Toon Claes
@ 2025-08-06 15:38 ` Junio C Hamano
2025-08-28 22:44 ` Junio C Hamano
0 siblings, 1 reply; 135+ messages in thread
From: Junio C Hamano @ 2025-08-06 15:38 UTC (permalink / raw)
To: Toon Claes; +Cc: Jean-Noël AVILA, git
Toon Claes <toon@iotcl.com> writes:
> Looks good to me. Do you want me to reroll, or will you `--autosquash`
> yourself?
I can do the latter, unless there are other reasons that make it
necessary to update the patches. We'll see.
Thanks.
^ permalink raw reply [flat|nested] 135+ messages in thread
* Re: [PATCH v7 0/3] Introduce git-last-modified(1) command
2025-08-06 15:38 ` Junio C Hamano
@ 2025-08-28 22:44 ` Junio C Hamano
0 siblings, 0 replies; 135+ messages in thread
From: Junio C Hamano @ 2025-08-28 22:44 UTC (permalink / raw)
To: Toon Claes; +Cc: Jean-Noël AVILA, git
Junio C Hamano <gitster@pobox.com> writes:
> Toon Claes <toon@iotcl.com> writes:
>
>> Looks good to me. Do you want me to reroll, or will you `--autosquash`
>> yourself?
>
> I can do the latter, unless there are other reasons that make it
> necessary to update the patches. We'll see.
Sorry, but it seems that I dropped the ball after this exchange.
The topic still has the fixup! sitting at the top. If there are no
further changes needed, let me squash it into the base commit and
then mark the topic for 'next'.
Documentation/git-last-modified.adoc | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/Documentation/git-last-modified.adoc b/Documentation/git-last-modified.adoc
index 35bd4a1dd0..602843e095 100644
--- a/Documentation/git-last-modified.adoc
+++ b/Documentation/git-last-modified.adoc
@@ -22,24 +22,24 @@ THIS COMMAND IS EXPERIMENTAL. THE BEHAVIOR MAY CHANGE.
OPTIONS
-------
--r::
---recursive::
+`-r`::
+`--recursive`::
Instead of showing tree entries, step into subtrees and show all entries
inside them recursively.
--t::
---show-trees::
+`-t`::
+`--show-trees`::
Show tree entries even when recursing into them. It has no effect
without `--recursive`.
-<revision-range>::
+`<revision-range>`::
Only traverse commits in the specified revision range. When no
`<revision-range>` is specified, it defaults to `HEAD` (i.e. the whole
history leading to the current commit). For a complete list of ways to
spell `<revision-range>`, see the 'Specifying Ranges' section of
linkgit:gitrevisions[7].
-[--] <path>...::
+`[--] <path>...`::
For each _<path>_ given, the commit which last modified it is returned.
Without an optional path parameter, all files and subdirectories
in path traversal the are included in the output.
--
2.51.0-262-gbae8ff527a
^ permalink raw reply related [flat|nested] 135+ messages in thread
end of thread, other threads:[~2025-08-28 22:44 UTC | newest]
Thread overview: 135+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-22 17:46 [PATCH RFC 0/5] Introduce git-blame-tree(1) command Toon Claes
2025-04-22 17:46 ` [PATCH RFC 1/5] blame-tree: introduce new subcommand to blame files Toon Claes
2025-04-24 16:19 ` Junio C Hamano
2025-05-07 13:13 ` Toon Claes
2025-04-22 17:46 ` [PATCH RFC 2/5] t/perf: add blame-tree perf script Toon Claes
2025-04-22 17:46 ` [PATCH RFC 3/5] blame-tree: use Bloom filters when available Toon Claes
2025-04-22 17:46 ` [PATCH RFC 4/5] blame-tree: implement faster algorithm Toon Claes
2025-04-22 17:46 ` [PATCH RFC 5/5] blame-tree.c: initialize revision machinery without walk Toon Claes
2025-04-23 13:26 ` [PATCH RFC 0/5] Introduce git-blame-tree(1) command Marc Branchaud
2025-05-07 14:22 ` Toon Claes
2025-05-07 20:23 ` Marc Branchaud
2025-05-07 20:45 ` Junio C Hamano
2025-05-08 13:26 ` Marc Branchaud
2025-05-08 14:26 ` Junio C Hamano
2025-05-08 15:12 ` Marc Branchaud
2025-05-14 14:42 ` Toon Claes
2025-05-14 19:29 ` Junio C Hamano
2025-05-14 21:15 ` Marc Branchaud
2025-05-15 13:29 ` Patrick Steinhardt
2025-05-15 16:39 ` Junio C Hamano
2025-05-15 17:39 ` Marc Branchaud
2025-05-15 19:30 ` Jeff King
2025-05-16 4:38 ` Patrick Steinhardt
2025-05-20 8:49 ` Toon Claes
2025-05-15 17:30 ` Marc Branchaud
2025-05-16 4:30 ` Patrick Steinhardt
2025-05-14 21:15 ` Marc Branchaud
2025-05-07 20:49 ` Kristoffer Haugsbakk
2025-05-08 13:20 ` D. Ben Knoble
2025-05-08 13:26 ` Marc Branchaud
2025-05-08 13:18 ` D. Ben Knoble
2025-05-23 9:33 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Toon Claes
2025-05-23 9:33 ` [PATCH RFC v2 1/5] last-modified: new subcommand to show when files were last modified Toon Claes
2025-05-25 20:07 ` Justin Tobler
2025-06-05 8:32 ` Toon Claes
2025-05-27 10:39 ` Patrick Steinhardt
2025-06-13 9:34 ` Toon Claes
2025-06-13 9:52 ` Kristoffer Haugsbakk
2025-05-23 9:33 ` [PATCH RFC v2 2/5] t/perf: add last-modified perf script Toon Claes
2025-05-23 9:33 ` [PATCH RFC v2 3/5] last-modified: use Bloom filters when available Toon Claes
2025-05-27 10:40 ` Patrick Steinhardt
2025-06-13 11:05 ` Toon Claes
2025-05-23 9:33 ` [PATCH RFC v2 4/5] last-modified: implement faster algorithm Toon Claes
2025-05-27 10:39 ` Patrick Steinhardt
2025-05-23 9:33 ` [PATCH RFC v2 5/5] last-modified: initialize revision machinery without walk Toon Claes
2025-05-27 10:39 ` Patrick Steinhardt
2025-07-01 20:35 ` [PATCH RFC v2 0/5] Introduce git-last-modified(1) command Kristoffer Haugsbakk
2025-07-01 21:06 ` Junio C Hamano
2025-07-01 21:30 ` Kristoffer Haugsbakk
2025-07-02 13:00 ` Toon Claes
2025-07-09 15:53 ` Toon Claes
2025-07-09 17:00 ` Junio C Hamano
2025-06-30 18:49 ` [PATCH RFC v3 0/3] " Toon Claes
2025-06-30 18:49 ` [PATCH RFC v3 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-01 20:20 ` Kristoffer Haugsbakk
2025-07-02 11:51 ` Junio C Hamano
2025-06-30 18:49 ` [PATCH RFC v3 2/3] t/perf: add last-modified perf script Toon Claes
2025-06-30 18:49 ` [PATCH RFC v3 3/3] last-modified: use Bloom filters when available Toon Claes
2025-07-01 23:01 ` [PATCH RFC v3 0/3] Introduce git-last-modified(1) command Junio C Hamano
2025-07-09 15:26 ` [PATCH v4 " Toon Claes
2025-07-09 21:57 ` Junio C Hamano
2025-07-10 18:37 ` Junio C Hamano
2025-07-16 13:32 ` [PATCH v5 0/6] " Toon Claes
2025-07-16 13:35 ` [PATCH v5 1/6] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-18 0:02 ` Taylor Blau
2025-07-19 6:44 ` Jeff King
2025-07-22 15:50 ` Toon Claes
2025-08-01 9:09 ` Christian Couder
2025-08-01 16:59 ` Junio C Hamano
2025-07-16 13:35 ` [PATCH v5 2/6] t/perf: add last-modified perf script Toon Claes
2025-07-18 0:08 ` Taylor Blau
2025-07-22 15:52 ` Toon Claes
2025-07-16 13:35 ` [PATCH v5 3/6] last-modified: use Bloom filters when available Toon Claes
2025-07-18 0:16 ` Taylor Blau
2025-07-22 16:02 ` Toon Claes
2025-07-16 13:35 ` [PATCH v5 4/6] pretty: allow caller to disable indentation Toon Claes
2025-07-16 15:50 ` Junio C Hamano
2025-07-17 16:31 ` Toon Claes
2025-07-16 13:35 ` [PATCH v5 5/6] last-modified: support --extended format Toon Claes
2025-07-16 16:09 ` Junio C Hamano
2025-07-17 16:31 ` Toon Claes
2025-07-17 22:37 ` Junio C Hamano
2025-07-18 17:36 ` Junio C Hamano
2025-07-22 16:06 ` Toon Claes
2025-07-16 13:42 ` [PATCH v5 6/6] fixup! last-modified: use Bloom filters when available Toon Claes
2025-07-17 23:39 ` [PATCH v5 0/6] Introduce git-last-modified(1) command Taylor Blau
2025-07-22 15:35 ` Toon Claes
2025-07-30 17:59 ` Toon Claes
2025-07-31 7:45 ` Patrick Steinhardt
2025-07-30 17:55 ` [PATCH v6 0/4] " Toon Claes
2025-07-31 18:40 ` Junio C Hamano
2025-07-31 23:57 ` Junio C Hamano
2025-08-05 9:33 ` [PATCH v7 0/3] " Toon Claes
2025-08-05 14:34 ` Patrick Steinhardt
2025-08-05 16:21 ` Junio C Hamano
2025-08-05 16:34 ` Junio C Hamano
2025-08-05 16:55 ` Toon Claes
2025-08-05 17:20 ` Jean-Noël AVILA
2025-08-05 21:46 ` Junio C Hamano
2025-08-06 12:01 ` Toon Claes
2025-08-06 15:38 ` Junio C Hamano
2025-08-28 22:44 ` Junio C Hamano
2025-08-05 18:28 ` Junio C Hamano
2025-08-05 9:33 ` [PATCH v7 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
2025-08-05 9:33 ` [PATCH v7 2/3] t/perf: add last-modified perf script Toon Claes
2025-08-05 9:33 ` [PATCH v7 3/3] last-modified: use Bloom filters when available Toon Claes
2025-07-30 17:55 ` [PATCH v6 1/4] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-31 6:42 ` Patrick Steinhardt
2025-08-01 16:22 ` Toon Claes
2025-08-01 17:09 ` Junio C Hamano
2025-08-04 6:34 ` Patrick Steinhardt
2025-08-04 17:14 ` Junio C Hamano
2025-08-05 5:35 ` Toon Claes
2025-08-01 20:34 ` Jean-Noël AVILA
2025-08-05 5:36 ` Toon Claes
2025-08-04 6:33 ` Patrick Steinhardt
2025-08-01 10:18 ` Christian Couder
2025-08-01 10:22 ` Patrick Steinhardt
2025-08-01 17:06 ` Junio C Hamano
2025-08-02 8:18 ` Christian Couder
2025-08-02 11:31 ` Christian Couder
2025-08-02 13:38 ` Christian Couder
2025-08-02 16:26 ` Junio C Hamano
2025-08-04 6:35 ` Patrick Steinhardt
2025-07-30 17:55 ` [PATCH v6 2/4] t/perf: add last-modified perf script Toon Claes
2025-07-30 17:55 ` [PATCH v6 3/4] commit-graph: export prepare_commit_graph() Toon Claes
2025-07-31 6:42 ` Patrick Steinhardt
2025-07-30 17:55 ` [PATCH v6 4/4] last-modified: use Bloom filters when available Toon Claes
2025-07-31 6:43 ` Patrick Steinhardt
2025-08-01 16:23 ` Toon Claes
2025-08-04 6:33 ` Patrick Steinhardt
2025-07-09 15:26 ` [PATCH v4 1/3] last-modified: new subcommand to show when files were last modified Toon Claes
2025-07-09 15:26 ` [PATCH v4 2/3] t/perf: add last-modified perf script Toon Claes
2025-07-09 15:26 ` [PATCH v4 3/3] last-modified: use Bloom filters when available Toon Claes
2025-07-16 13:35 ` [PATCH v5 6/6] fixup! " Toon Claes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).