* [RFC] git blame-tree
@ 2011-03-02 16:40 Jeff King
2011-03-02 17:16 ` Jeff King
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Jeff King @ 2011-03-02 16:40 UTC (permalink / raw)
To: git
[I know, I know, another RFC. I'll get to actually cleaning up and
submitting some of these patches soon.]
It's sometimes useful to get a list of files in a tree along with the
last commit that touched them. This is the default tree view shown on
github.com, but it can also be handy from the command line (there has
been talk lately of having a "git ls"), or as plumbing for a local
fancier tree view. E.g., something like:
add.c 6e7293e git-add: make -A description clearer vs. -u
apply.c fd03881 add description parameter to OPT__VERBOSE
blame.c 9ca1169 parse-options: Don't call parse_options_check() so much
branch.c 62270f6 branch_merged: fix grammar in warning
bundle.c 62b4698 Use angles for placeholders consistently
The obvious naive way to do this is something like:
for i in `git ls-tree --name-only HEAD`; do
echo "`git rev-list -1 --no-merges HEAD -- $i` $i";
done
which is really slow, because we end up traversing the same commits many
times (plus the startup overhead for each rev-list). It takes about 35
seconds to run on git.git.
So the next obvious thing is to do one traversal, output the changed
files for each commit, and then mark each file as you see it. The perl
script below does this (though the careful reader will note it is
actually buggy with sub-trees; I didn't bother fixing it since it was
just a stage in the evolution):
-- >8 --
#!/usr/bin/perl
my $tree = shift;
my @revs = @ARGV;
my @files = `git ls-tree --name-only $tree`;
chomp @files;
my $num_interesting = @files;
open(my $fh, '-|', join(' ',
"git rev-list --no-merges",
@revs,
"| git diff-tree --stdin --name-status")
) or die "unable to run rev-list: $!";
my %last = map { $_ => undef } @files;
while($num_interesting and $_ = <$fh>) {
if (/^[0-9a-f]{40}$/) {
$sha1 = $&;
}
elsif (/^.\t(.*)/) {
next unless exists $last{$1};
next if defined $last{$1};
$last{$1} = $sha1;
$num_interesting--;
}
}
foreach my $file (sort keys(%last)) {
print "$last{$file} $file\n";
}
-- 8< --
This runs in about 3 seconds. And besides the above-mentioned bug,
also doesn't properly handle things like filenames that need quoting.
So I wrote it in C, which drops the time down to about 1.5 seconds, and
of course doesn't have any parsing issues. The patch is below.
I wasn't sure at first what to call it or what the calling conventions
should be. The initial thought was to make it part of "ls-tree". But
that feels wrong, as ls-tree otherwise never cares about traversal. The
combination of traversal and diff made me think of blame, and indeed, I
think this is really just about blaming a whole tree at the file-level,
rather than at the content-level. Thus I called it blame-tree, and I
used the same calling conventions as blame: "git blame-tree <path>
<rev opts>". See the test script for examples.
I have many thoughts on the patch already, but rather than put them
here, I'll include the patch without further ado, and put them inline in
a reply.
---
.gitignore | 1 +
Makefile | 3 +
blame-tree.c | 156 +++++++++++++++++++++++++++++++++++++++++++++++++
blame-tree.h | 25 ++++++++
builtin.h | 1 +
builtin/blame-tree.c | 34 +++++++++++
git.c | 1 +
t/t8010-blame-tree.sh | 90 ++++++++++++++++++++++++++++
8 files changed, 311 insertions(+), 0 deletions(-)
create mode 100644 blame-tree.c
create mode 100644 blame-tree.h
create mode 100644 builtin/blame-tree.c
create mode 100755 t/t8010-blame-tree.sh
diff --git a/.gitignore b/.gitignore
index c460c66..20c6501 100644
--- a/.gitignore
+++ b/.gitignore
@@ -14,6 +14,7 @@
/git-bisect
/git-bisect--helper
/git-blame
+/git-blame-tree
/git-branch
/git-bundle
/git-cat-file
diff --git a/Makefile b/Makefile
index ade7923..918ce1a 100644
--- a/Makefile
+++ b/Makefile
@@ -494,6 +494,7 @@ VCSSVN_LIB=vcs-svn/lib.a
LIB_H += advice.h
LIB_H += archive.h
LIB_H += attr.h
+LIB_H += blame-tree.h
LIB_H += blob.h
LIB_H += builtin.h
LIB_H += cache.h
@@ -571,6 +572,7 @@ LIB_OBJS += archive-zip.o
LIB_OBJS += attr.o
LIB_OBJS += base85.o
LIB_OBJS += bisect.o
+LIB_OBJS += blame-tree.o
LIB_OBJS += blob.o
LIB_OBJS += branch.o
LIB_OBJS += bundle.o
@@ -683,6 +685,7 @@ BUILTIN_OBJS += builtin/apply.o
BUILTIN_OBJS += builtin/archive.o
BUILTIN_OBJS += builtin/bisect--helper.o
BUILTIN_OBJS += builtin/blame.o
+BUILTIN_OBJS += builtin/blame-tree.o
BUILTIN_OBJS += builtin/branch.o
BUILTIN_OBJS += builtin/bundle.o
BUILTIN_OBJS += builtin/cat-file.o
diff --git a/blame-tree.c b/blame-tree.c
new file mode 100644
index 0000000..79ceb7e
--- /dev/null
+++ b/blame-tree.c
@@ -0,0 +1,156 @@
+#include "cache.h"
+#include "blame-tree.h"
+#include "commit.h"
+#include "diff.h"
+#include "diffcore.h"
+#include "revision.h"
+
+void blame_tree_init(struct blame_tree *bt)
+{
+ memset(bt, 0, sizeof(*bt));
+ bt->paths.strdup_strings = 1;
+ init_revisions(&bt->rev, NULL);
+ bt->rev.no_merges = 1;
+ bt->rev.def = "HEAD";
+}
+
+void blame_tree_release(struct blame_tree *bt)
+{
+ string_list_clear(&bt->paths, 0);
+ free(bt->prefix);
+}
+
+static int add_path(const unsigned char *sha1, const char *base, int baselen,
+ const char *name, unsigned mode, int stage, void *data)
+{
+ struct string_list *paths = data;
+ string_list_append(paths, name);
+ return 0;
+}
+
+static int add_from_revs(struct blame_tree *bt)
+{
+ struct object_array_entry *obj;
+ unsigned char sha1[20];
+ unsigned mode;
+ struct tree *tree;
+
+ if (!bt->rev.pending.nr)
+ return error("no revisions specified");
+
+ obj = bt->rev.pending.objects;
+ if (get_tree_entry(obj->item->sha1, bt->prefix, sha1, &mode) < 0)
+ return error("no such path: %s:%s", obj->name, bt->prefix);
+
+ tree = parse_tree_indirect(sha1);
+ if (!tree)
+ return error("not a tree: %s:%s", obj->name, bt->prefix);
+
+ if (read_tree_recursive(tree, "", 0, 0, NULL, add_path, &bt->paths) < 0)
+ return error("unable to read tree object");
+
+ sort_string_list(&bt->paths);
+ return 0;
+}
+
+void blame_tree_set_path(struct blame_tree *bt, const char *path,
+ const char *prefix)
+{
+ if (!path) {
+ bt->prefix = xstrdup("");
+ return;
+ }
+ bt->prefix = prefix_path(prefix, prefix ? strlen(prefix) : 0, path);
+}
+
+void blame_tree_set_revs(struct blame_tree *bt, int argc, const char **argv)
+{
+ setup_revisions(argc, argv, &bt->rev, NULL);
+}
+
+struct blame_tree_callback_data {
+ struct commit *commit;
+ struct string_list *paths;
+ int num_interesting;
+
+ blame_tree_callback callback;
+ void *callback_data;
+};
+
+static void process_diff(struct diff_queue_struct *q,
+ struct diff_options *opt, void *cbdata)
+{
+ struct blame_tree_callback_data *data = cbdata;
+ int i;
+
+ for (i = 0; i < q->nr; i++) {
+ struct diff_filepair *p = q->queue[i];
+ const char *path = p->one->path;
+ struct string_list_item *item;
+
+ item = string_list_lookup(data->paths, path);
+ /* Not an interesting path to us */
+ if (!item)
+ continue;
+ /* We already found its latest commit */
+ if (item->util)
+ continue;
+
+ item->util = data->commit;
+ data->num_interesting--;
+ if (data->callback)
+ data->callback(path, data->commit, data->callback_data);
+ }
+}
+
+int blame_tree_run(struct blame_tree *bt, blame_tree_callback cb, void *cbdata)
+{
+ struct blame_tree_callback_data data;
+ struct commit *commit;
+
+ if (add_from_revs(bt) < 0)
+ return -1;
+
+ data.paths = &bt->paths;
+ data.num_interesting = bt->paths.nr;
+ data.callback = cb;
+ data.callback_data = cbdata;
+
+ bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK;
+ bt->rev.diffopt.format_callback = process_diff;
+ bt->rev.diffopt.format_callback_data = &data;
+ diff_setup_done(&bt->rev.diffopt);
+
+ prepare_revision_walk(&bt->rev);
+
+ while (data.num_interesting > 0 &&
+ (commit = get_revision(&bt->rev)) != NULL) {
+ unsigned char to_sha1[20];
+ unsigned mode;
+
+ if (get_tree_entry(commit->object.sha1, bt->prefix,
+ to_sha1, &mode) < 0)
+ continue;
+
+ data.commit = commit;
+
+ if (commit->parents) {
+ struct commit_list *p;
+ for (p = commit->parents; p; p = p->next) {
+ unsigned char from_sha1[20];
+ if (get_tree_entry(p->item->object.sha1,
+ bt->prefix,
+ from_sha1, &mode) < 0)
+ diff_root_tree_sha1(to_sha1, "",
+ &bt->rev.diffopt);
+ else
+ diff_tree_sha1(from_sha1, to_sha1,
+ "", &bt->rev.diffopt);
+ }
+ }
+ else
+ diff_root_tree_sha1(to_sha1, "", &bt->rev.diffopt);
+ diff_flush(&bt->rev.diffopt);
+ }
+ return 0;
+}
diff --git a/blame-tree.h b/blame-tree.h
new file mode 100644
index 0000000..c39f203
--- /dev/null
+++ b/blame-tree.h
@@ -0,0 +1,25 @@
+#ifndef BLAME_TREE_H
+#define BLAME_TREE_H
+
+#include "commit.h"
+#include "diff.h"
+#include "revision.h"
+#include "string-list.h"
+
+struct blame_tree {
+ struct string_list paths;
+ char *prefix;
+ struct rev_info rev;
+};
+
+void blame_tree_init(struct blame_tree *);
+void blame_tree_release(struct blame_tree *);
+
+void blame_tree_set_path(struct blame_tree *, const char *path,
+ const char *prefix);
+void blame_tree_set_revs(struct blame_tree *, int argc, const char **argv);
+
+typedef void (*blame_tree_callback)(const char *, const struct commit *, void *);
+int blame_tree_run(struct blame_tree *, blame_tree_callback, void *);
+
+#endif /* BLAME_TREE_H */
diff --git a/builtin.h b/builtin.h
index 0e9da90..01f7842 100644
--- a/builtin.h
+++ b/builtin.h
@@ -143,5 +143,6 @@ extern int cmd_verify_pack(int argc, const char **argv, const char *prefix);
extern int cmd_show_ref(int argc, const char **argv, const char *prefix);
extern int cmd_pack_refs(int argc, const char **argv, const char *prefix);
extern int cmd_replace(int argc, const char **argv, const char *prefix);
+extern int cmd_blame_tree(int argc, const char **argv, const char *prefix);
#endif
diff --git a/builtin/blame-tree.c b/builtin/blame-tree.c
new file mode 100644
index 0000000..3f580c6
--- /dev/null
+++ b/builtin/blame-tree.c
@@ -0,0 +1,34 @@
+#include "cache.h"
+#include "blame-tree.h"
+#include "quote.h"
+#include "parse-options.h"
+
+static void show_entry(const char *path, const struct commit *commit, void *d)
+{
+ printf("%s\t", sha1_to_hex(commit->object.sha1));
+ write_name_quoted(path, stdout, '\n');
+ fflush(stdout);
+}
+
+int cmd_blame_tree(int argc, const char **argv, const char *prefix)
+{
+ struct blame_tree bt;
+ const char *path = NULL;
+
+ git_config(git_default_config, NULL);
+
+ if (argv[1]) {
+ path = argv[1];
+ argc--;
+ argv++;
+ }
+
+ blame_tree_init(&bt);
+ blame_tree_set_path(&bt, path, prefix);
+ blame_tree_set_revs(&bt, argc, argv);
+ if (blame_tree_run(&bt, show_entry, NULL) < 0)
+ die("error running blame-tree traversal");
+ blame_tree_release(&bt);
+
+ return 0;
+}
diff --git a/git.c b/git.c
index ef598c3..54ad91d 100644
--- a/git.c
+++ b/git.c
@@ -318,6 +318,7 @@ static void handle_internal_command(int argc, const char **argv)
{ "archive", cmd_archive },
{ "bisect--helper", cmd_bisect__helper, RUN_SETUP | NEED_WORK_TREE },
{ "blame", cmd_blame, RUN_SETUP },
+ { "blame-tree", cmd_blame_tree, RUN_SETUP },
{ "branch", cmd_branch, RUN_SETUP },
{ "bundle", cmd_bundle, RUN_SETUP_GENTLY },
{ "cat-file", cmd_cat_file, RUN_SETUP },
diff --git a/t/t8010-blame-tree.sh b/t/t8010-blame-tree.sh
new file mode 100755
index 0000000..21833f3
--- /dev/null
+++ b/t/t8010-blame-tree.sh
@@ -0,0 +1,90 @@
+#!/bin/sh
+
+test_description='basic blame-tree tests'
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+ test_commit 1 file &&
+ mkdir sub &&
+ test_commit 2 sub/file &&
+ mkdir -p deep/sub/nesting &&
+ test_commit 3 deep/sub/nesting/file
+'
+
+cat >expect.root <<'EOF'
+1 file
+2 sub
+3 deep
+EOF
+
+echo 2 file >expect.sub
+echo 3 sub >expect.deep
+echo 3 nesting >expect.deep.sub
+echo 3 file >expect.deep.sub.nesting
+
+check() {
+ expect=$1; shift
+ git blame-tree "$@" >actual &&
+ git name-rev --stdin --name-only --tags <actual >tmp &&
+ mv tmp actual &&
+ tr '\t' ' ' <actual >tmp &&
+ mv tmp actual &&
+ sort <actual >tmp &&
+ mv tmp actual &&
+ test_cmp "$expect" actual
+}
+
+test_expect_success 'blame root' '
+ check expect.root . HEAD
+'
+
+test_expect_success 'blame subdir' '
+ check expect.sub sub HEAD
+'
+
+test_expect_success 'blame nested subdirs' '
+ check expect.deep deep HEAD &&
+ check expect.deep.sub deep/sub HEAD &&
+ check expect.deep.sub.nesting deep/sub/nesting
+'
+
+test_expect_success 'assume HEAD if no rev opts' '
+ check expect.root .
+'
+
+test_expect_success 'assume root if no path opt' '
+ check expect.root
+'
+
+test_expect_success 'blame from older revision' '
+ echo 1 file >expect &&
+ check expect . HEAD~2
+'
+
+test_expect_success 'rev limiting works' '
+ echo 3 deep >expect &&
+ check expect . -1
+'
+
+test_expect_success 'complaint about a bogus path' '
+ test_must_fail git blame-tree bogus HEAD
+'
+
+test_expect_success 'complain about a non-tree' '
+ test_must_fail git blame-tree file HEAD
+'
+
+test_expect_success 'blame from subdir defaults to root' '
+ (cd deep &&
+ check ../expect.root
+ )
+'
+
+test_expect_success 'blame from subdir uses relative paths' '
+ (cd deep &&
+ check ../expect.deep . &&
+ check ../expect.deep.sub sub
+ )
+'
+
+test_done
--
1.7.4.rc1.11.g1c440
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 16:40 [RFC] git blame-tree Jeff King
@ 2011-03-02 17:16 ` Jeff King
2011-03-02 17:51 ` Piotr Krukowiecki
` (2 more replies)
2011-03-02 17:40 ` Jeff King
2011-03-04 14:40 ` Will Palmer
2 siblings, 3 replies; 17+ messages in thread
From: Jeff King @ 2011-03-02 17:16 UTC (permalink / raw)
To: git
On Wed, Mar 02, 2011 at 11:40:32AM -0500, Jeff King wrote:
> blame-tree.c | 156 +++++++++++++++++++++++++++++++++++++++++++++++++
> blame-tree.h | 25 ++++++++
> builtin.h | 1 +
> builtin/blame-tree.c | 34 +++++++++++
I tried to lib-ify the implementation as much as possible. It increases
the lines of code, of course, but I figured there was a reasonable
chance that there might be a user-friendly "git ls" command eventually,
and this would probably make a good "-v" option to it.
I considered making it a special mode of "git blame" when blame is fed a
directory instead of a file. But the implementations aren't shared at
all (nor do I think they need to be; blame-tree is _way_ simpler). And I
didn't want to steal that concept in case somebody can think of a more
content-level way of blaming a whole tree that makes sense (obviously
just showing the concatenation of the blames of each file is one way,
but I don't know how useful that would be). If we want to go that way,
we can always catch the special case in blame and just exec blame-tree.
> +void blame_tree_init(struct blame_tree *bt)
> +{
> + memset(bt, 0, sizeof(*bt));
> + bt->paths.strdup_strings = 1;
> + init_revisions(&bt->rev, NULL);
> + bt->rev.no_merges = 1;
> + bt->rev.def = "HEAD";
> +}
I turn off merges by default, since they are unlikely to be interesting
matches (you will see the merge of a side-branch that touched a file
instead of the actual commit on the side-branch). You could of course do
"git blame-tree . --no-merges" to get the same effect. I think no-merges
makes a saner default, but sadly it doesn't seem like there is a way to
turn no-merges back off ("--merges" means something else, and there is
no --no-no-merges").
> +static int add_from_revs(struct blame_tree *bt)
> +{
> + struct object_array_entry *obj;
> + unsigned char sha1[20];
> + unsigned mode;
> + struct tree *tree;
> +
> + if (!bt->rev.pending.nr)
> + return error("no revisions specified");
> +
> + obj = bt->rev.pending.objects;
> + if (get_tree_entry(obj->item->sha1, bt->prefix, sha1, &mode) < 0)
> + return error("no such path: %s:%s", obj->name, bt->prefix);
> +
> + tree = parse_tree_indirect(sha1);
> + if (!tree)
> + return error("not a tree: %s:%s", obj->name, bt->prefix);
> +
> + if (read_tree_recursive(tree, "", 0, 0, NULL, add_path, &bt->paths) < 0)
> + return error("unable to read tree object");
> +
> + sort_string_list(&bt->paths);
> + return 0;
> +}
The initial set of interesting files we come up with is gotten by
looking at the tree of the first pending object after parsing the rev
options (defaulting to HEAD). Which sounds a little flaky to me, but
does what you want in practice. I'd be curious if somebody can come up
with a counterexample where the ability to manually specify the source
tree would be more useful.
Right now the code just handles trees. But in the long run, it would
probably make sense to get the list of files from the index, and mark
files modified in the working tree or index, too. So something like:
foo.c 1234abcd this is a commit subject
bar.c modified in working tree
baz.c modified in index
Sort of like how gitk shows "pseudo-commits" on top of history to
indicate changes.
> +static void process_diff(struct diff_queue_struct *q,
> + struct diff_options *opt, void *cbdata)
> +{
> + struct blame_tree_callback_data *data = cbdata;
> + int i;
> +
> + for (i = 0; i < q->nr; i++) {
> + struct diff_filepair *p = q->queue[i];
> + const char *path = p->one->path;
> + struct string_list_item *item;
> +
> + item = string_list_lookup(data->paths, path);
> + /* Not an interesting path to us */
> + if (!item)
> + continue;
> + /* We already found its latest commit */
> + if (item->util)
> + continue;
> +
> + item->util = data->commit;
> + data->num_interesting--;
> + if (data->callback)
> + data->callback(path, data->commit, data->callback_data);
> + }
> +}
So this is the per-commit processing. Basically we just do a diff for
each commit, and see if each path has been claimed. Note that it
depends on the string-list item->util being initialized to zero. Hence
my recent patch to string-list, and this needs to go on top of 62b8102
(which is in master and maint).
Note that the further back you go in history, the less interesting most
of the diffs will be, so you waste a lot of time generating boring diffs
and then looking up those diffs in the string list (which is at least a
binary search). I considered an implementation where we literally just
do a 3-way merge-traversal of the old tree, the new tree, and the list
of paths still to be claimed. That would make it O(# of paths) for each
commit. But I decided against it because:
1. It would be a lot more complex, having to hook into unpack_trees
instead of the diff machinery.
2. I'm not sure it would actually be that much faster. If prior
measurements are any indication, we are probably spending most of
our time unpacking the tree objects.
3. It closes the door for more advanced diff machinery, like rename
detection.
> + while (data.num_interesting > 0 &&
An optimization; we can quit the traversal as soon as everything is
claimed. In practice this helps disappointingly little. Blaming git.git,
the last thing that gets claimed is tar.h, which saves only 3838 commits
(out of ~24500) being traversed. Similarly, COPYING in linux-2.6 is the
last, saving only 8378 commits out of ~232000. Still, every little bit
helps.
> + if (commit->parents) {
> + struct commit_list *p;
> + for (p = commit->parents; p; p = p->next) {
> + unsigned char from_sha1[20];
> + if (get_tree_entry(p->item->object.sha1,
> + bt->prefix,
> + from_sha1, &mode) < 0)
> + diff_root_tree_sha1(to_sha1, "",
> + &bt->rev.diffopt);
> + else
> + diff_tree_sha1(from_sha1, to_sha1,
> + "", &bt->rev.diffopt);
> + }
> + }
My merge handling is just "which files are different from the parents".
Which is reasonable, but I don't actually exercise it since we use
--no-merges by default. :)
We could try to do something clever here about evil merges. If the sha1
for a file is different in the merge from both parents, then we know
there was at least a conflict. So we could perhaps not blame the merge
for non-conflict cases pretty easily, and that would be more useful than
the current behavior. But finding out if a conflicted file was simply
resolved or if content was introduced is much more expensive.
> + else
> + diff_root_tree_sha1(to_sha1, "", &bt->rev.diffopt);
> + diff_flush(&bt->rev.diffopt);
Hmm. I mentioned rename detection previously. Probably I should
diffcore_std here so that it actually works.
> --- /dev/null
> +++ b/builtin/blame-tree.c
> [...]
> +static void show_entry(const char *path, const struct commit *commit, void *d)
> +{
> + printf("%s\t", sha1_to_hex(commit->object.sha1));
> + write_name_quoted(path, stdout, '\n');
> + fflush(stdout);
> +}
This callback just shows entries and their commit sha1s as we find them
during the traversal. So the output is not ordered by pathname, but
rather by traversal order (which is chronological-ish, though you can
also use --date-order to get a more certain ordering). It does keep the
output streaming, which is nice if you are incrementally filling in a
display.
Probably for a porcelain command we would want to collect and sort the
results by pathname. And show actual commit onelines instead of the
sha1.
> +int cmd_blame_tree(int argc, const char **argv, const char *prefix)
> +{
> + struct blame_tree bt;
> + const char *path = NULL;
> +
> + git_config(git_default_config, NULL);
> +
> + if (argv[1]) {
> + path = argv[1];
> + argc--;
> + argv++;
> + }
Obviously no options. Probably there should at least be "--porcelain" to
output the current form, and the default output should be more
user-friendly. And probably "-z" to avoid quoting issues.
> --- /dev/null
> +++ b/t/t8010-blame-tree.sh
> [...]
> +test_description='basic blame-tree tests'
These are extremely basic. If people can think of more interesting
corner cases, I'd love to exercise this more. I definitely need to
consider merge behavior a bit more and codify it in some tests. Probably
also diffcore stuff like renames and copies should be tested.
-Peff
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 16:40 [RFC] git blame-tree Jeff King
2011-03-02 17:16 ` Jeff King
@ 2011-03-02 17:40 ` Jeff King
2011-03-04 14:40 ` Will Palmer
2 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2011-03-02 17:40 UTC (permalink / raw)
To: git
On Wed, Mar 02, 2011 at 11:40:32AM -0500, Jeff King wrote:
> It's sometimes useful to get a list of files in a tree along with the
> last commit that touched them. This is the default tree view shown on
> github.com...
A note on the motivation for this patch.
Some of you may already know this, but I started working full-time for
GitHub in January, working specifically on git. Most of it is just doing
the same general git work I've been doing; obviously GitHub has a vested
interest in git being awesome, and this gives me time to make it more
so.
I'm also working on changes that are going to be useful to large service
providers like GitHub more than other people. My plan is to open-source
those changes by putting them in a form that's useful to other git users
and submitting them as patches to the list. This is the first example,
but I have one or two more so far (GitHub actually runs a surprisingly
stock git).
I wanted to mention it here because:
1. GitHub is paying money for git development, which I think is cool
and they deserve some credit. Having more git time, I'm hoping to
tackle some of the long-standing issues that I have queued up
(e.g., I'll probably be working on multi-file follow and better
support for large binary files in the near future).
2. I want to be very open about the background behind my patches. I
think the list more-or-less operates on a per-patch meritocracy,
so in theory the patches should stand on their own or not. But as a
long-time contributor, I think it's best to be honest about the
fact that my motivations aren't _just_ scratching my own itch
anymore.
I have more or less full autonomy with respect to how I spend my
time, so there will still be a lot of itch-scratching. And if
something is junk or not useful to non-GitHub users, I won't waste
your time with it. So I think in general there won't be any change
in the usefulness of my patches. But being open about it lets you
choose how many grains of salt to take my patches with. :)
-Peff
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 17:16 ` Jeff King
@ 2011-03-02 17:51 ` Piotr Krukowiecki
2011-03-02 18:07 ` Jeff King
2011-03-02 18:31 ` Junio C Hamano
2011-03-03 20:20 ` Jakub Narebski
2 siblings, 1 reply; 17+ messages in thread
From: Piotr Krukowiecki @ 2011-03-02 17:51 UTC (permalink / raw)
To: Jeff King; +Cc: git
Hi,
On Wed, Mar 2, 2011 at 6:16 PM, Jeff King <peff@peff.net> wrote:
> I considered making it a special mode of "git blame" when blame is fed a
> directory instead of a file. But the implementations aren't shared at
> all (nor do I think they need to be; blame-tree is _way_ simpler). And I
git blame dir/file.c
"Show what revision and author last modified each line of a file"
git blame dir/
"Show what revision and author last modified each file"
This makes sense to me (the user).
I don't understand the implementation thing. I don't see a difference between
those two commands. Even more, if I'm educated Unix user I might know
directories are also files.
> didn't want to steal that concept in case somebody can think of a more
> content-level way of blaming a whole tree that makes sense (obviously
> just showing the concatenation of the blames of each file is one way,
> but I don't know how useful that would be). If we want to go that way,
> we can always catch the special case in blame and just exec blame-tree.
Still can be in git-blame command, no?
> The initial set of interesting files we come up with is gotten by
> looking at the tree of the first pending object after parsing the rev
> options (defaulting to HEAD). Which sounds a little flaky to me, but
> does what you want in practice. I'd be curious if somebody can come up
> with a counterexample where the ability to manually specify the source
> tree would be more useful.
Same argument as for normal blame: I want to know who modified files at
the state of commit X (if I understand the question correctly).
--
Piotrek
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 17:51 ` Piotr Krukowiecki
@ 2011-03-02 18:07 ` Jeff King
2011-03-02 18:39 ` Piotr Krukowiecki
0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2011-03-02 18:07 UTC (permalink / raw)
To: Piotr Krukowiecki; +Cc: git
On Wed, Mar 02, 2011 at 06:51:57PM +0100, Piotr Krukowiecki wrote:
> On Wed, Mar 2, 2011 at 6:16 PM, Jeff King <peff@peff.net> wrote:
> > I considered making it a special mode of "git blame" when blame is fed a
> > directory instead of a file. But the implementations aren't shared at
> > all (nor do I think they need to be; blame-tree is _way_ simpler). And I
>
> git blame dir/file.c
> "Show what revision and author last modified each line of a file"
>
> git blame dir/
> "Show what revision and author last modified each file"
Right, I think we are agreeing.
> This makes sense to me (the user). I don't understand the
> implementation thing. I don't see a difference between those two
> commands. Even more, if I'm educated Unix user I might know
> directories are also files.
I mean the implementations are very different, so there was not much
point in putting the code into builtin/blame.c.
> > didn't want to steal that concept in case somebody can think of a more
> > content-level way of blaming a whole tree that makes sense (obviously
> > just showing the concatenation of the blames of each file is one way,
> > but I don't know how useful that would be). If we want to go that way,
> > we can always catch the special case in blame and just exec blame-tree.
>
> Still can be in git-blame command, no?
Right. What I meant was that we don't have to make the decision now. If
people like blame-tree, we can later magically turn:
git blame dir
into "git blame-tree dir". So I think we are just agreeing.
> > The initial set of interesting files we come up with is gotten by
> > looking at the tree of the first pending object after parsing the rev
> > options (defaulting to HEAD). Which sounds a little flaky to me, but
> > does what you want in practice. I'd be curious if somebody can come up
> > with a counterexample where the ability to manually specify the source
> > tree would be more useful.
>
> Same argument as for normal blame: I want to know who modified files at
> the state of commit X (if I understand the question correctly).
Yeah, that's what it does now. Specifically I was wondering about more
elaborate examples, like:
git blame-tree dir branch1 branch2
It will traverse using both branch1 and branch2, but get the initial
list of files from branch1. I guess we could also union those trees or
something. But I expect most calls to be:
git blame-tree dir commit
and that's it.
-Peff
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 17:16 ` Jeff King
2011-03-02 17:51 ` Piotr Krukowiecki
@ 2011-03-02 18:31 ` Junio C Hamano
2011-03-02 21:04 ` Jeff King
2011-03-05 5:41 ` Ryan Tomayko
2011-03-03 20:20 ` Jakub Narebski
2 siblings, 2 replies; 17+ messages in thread
From: Junio C Hamano @ 2011-03-02 18:31 UTC (permalink / raw)
To: Jeff King; +Cc: git
Jeff King <peff@peff.net> writes:
> I considered making it a special mode of "git blame" when blame is fed a
> directory instead of a file... ... If we want to go that way,
> we can always catch the special case in blame and just exec blame-tree.
I agree that rolling it into "blame" will be a good way to conclude this
feature from the end-user UI point of view, provided if (as you said) this
proves to be the most reasonable way to give a coarser, file-level blame
to a history.
It has been my opinion that the default github view that seems to be
modeled after cvsview is the least useful choice of the default in the
more modern git era ;-), but I guess people's eyes and brains are trained
by the old school "file boundaries matter" way of thinking.
I am not against catering to them as long as it is done in a modular way
that does not negatively affect the other parts of the code, and this
implementation certainly is a good one from a cursory reading, except for
two points.
One is merges, which you punted on with --no-merges ;-). If I were doing
this, I would probably have traversed without --no-merges, and would blame
all cases that are not TREESAME to the merge (i.e. regardless of the
cleanness of the content level merge), as this feature is about file level
blame. If the child took the whole content as-is from a parent, it is a
no-event, but if the child has the resulting content that is different
from any of its parents, then that child is responsible for the end-result
content at the file level.
Doing nothing else makes sense. There is a fundamental expectation for
file level history: if you give a commit C as the result for path F,
saying C was the one that touched path F the last, you are saying that F
that is contained in C is exactly the same as F at the tip of the history
(i.e. whereever you started digging back from). If you don't stop at a
merge that had content-level merge, you would break that expectation. You
would also rely on the local clock of the merged branches to say which
branch made a commit touching the path the last, but it is much minor
issue compared to the breakage of "result must match where we say it came
from" expectation.
> The initial set of interesting files we come up with is gotten by
> looking at the tree of the first pending object after parsing the rev
> options (defaulting to HEAD). Which sounds a little flaky to me, but
> does what you want in practice. I'd be curious if somebody can come up
> with a counterexample where the ability to manually specify the source
> tree would be more useful.
I started writing "In what sense is it flaky? What corner case do you
have in mind?" in a few lines above and then moved that here ;-).
The second point is "why didn't you use pathspec, but chose to take only
one path?" It would be useful if you can say
$ git blame v2.6.24..v2.6.37 -- net include/net
no?
> So this is the per-commit processing. Basically we just do a diff for
> each commit, and see if each path has been claimed. Note that it
> depends on the string-list item->util being initialized to zero. Hence
> my recent patch to string-list, and this needs to go on top of 62b8102
> (which is in master and maint).
Once you know who the guilty party is, can't you just remove that element
from the list, so that later search from the "remaining path" list would
decrease in cost as you go deeper?
Also at some point the set of remaining path would become small and it
might make sense to feed that as diff-tree pathspec. After all, aren't
you re-initializing the diff machinery for each commit (which is _not_ a
bad thing to do--I am just pointing out that there is an opportunity to
use narrower pathspecs as you go without any additional cost)? Note that
I am not suggesting to update the pathspec used for commit traversal in
the middle; that is a lot trickier.
>> + while (data.num_interesting > 0 &&
>
> An optimization; we can quit the traversal as soon as everything is
> claimed. In practice this helps disappointingly little.
That is somewhat unexpected, but would strongly suggest that the approach
to use more specific pathspec that cover only the remaining paths would
give you faster per-commit diff?
> My merge handling is just "which files are different from the parents".
> Which is reasonable, but I don't actually exercise it since we use
> --no-merges by default. :)
I think that is a big mistake and I already said it above.
> We could try to do something clever here about evil merges.
Again, I wouldn't even think about it; this is a "file boundary matters"
tool; you shouldn't care about cleanliness of content-level merges.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 18:07 ` Jeff King
@ 2011-03-02 18:39 ` Piotr Krukowiecki
2011-03-02 21:10 ` Jeff King
0 siblings, 1 reply; 17+ messages in thread
From: Piotr Krukowiecki @ 2011-03-02 18:39 UTC (permalink / raw)
To: Jeff King; +Cc: git
On Wed, Mar 2, 2011 at 7:07 PM, Jeff King <peff@peff.net> wrote:
> On Wed, Mar 02, 2011 at 06:51:57PM +0100, Piotr Krukowiecki wrote:
>
>> On Wed, Mar 2, 2011 at 6:16 PM, Jeff King <peff@peff.net> wrote:
>> > I considered making it a special mode of "git blame" when blame is fed a
>> > directory instead of a file. But the implementations aren't shared at
>> > all (nor do I think they need to be; blame-tree is _way_ simpler). And I
>>
>> git blame dir/file.c
>> "Show what revision and author last modified each line of a file"
>>
>> git blame dir/
>> "Show what revision and author last modified each file"
>
> Right, I think we are agreeing.
>
>> This makes sense to me (the user). I don't understand the
>> implementation thing. I don't see a difference between those two
>> commands. Even more, if I'm educated Unix user I might know
>> directories are also files.
>
> I mean the implementations are very different, so there was not much
> point in putting the code into builtin/blame.c.
Ah, ok.
>> > didn't want to steal that concept in case somebody can think of a more
>> > content-level way of blaming a whole tree that makes sense (obviously
>> > just showing the concatenation of the blames of each file is one way,
>> > but I don't know how useful that would be). If we want to go that way,
>> > we can always catch the special case in blame and just exec blame-tree.
>>
>> Still can be in git-blame command, no?
>
> Right. What I meant was that we don't have to make the decision now. If
> people like blame-tree, we can later magically turn:
>
> git blame dir
>
> into "git blame-tree dir". So I think we are just agreeing.
I hope nobody likes "blame-dir" :)
>> > The initial set of interesting files we come up with is gotten by
>> > looking at the tree of the first pending object after parsing the rev
>> > options (defaulting to HEAD). Which sounds a little flaky to me, but
>> > does what you want in practice. I'd be curious if somebody can come up
>> > with a counterexample where the ability to manually specify the source
>> > tree would be more useful.
>>
>> Same argument as for normal blame: I want to know who modified files at
>> the state of commit X (if I understand the question correctly).
>
> Yeah, that's what it does now. Specifically I was wondering about more
> elaborate examples, like:
>
> git blame-tree dir branch1 branch2
>
> It will traverse using both branch1 and branch2, but get the initial
> list of files from branch1. I guess we could also union those trees or
> something.
I'd expect this to be something like union. Currently I can only think about
following case:
Some files were changed in branch1, some in branch2, some in both.
Show me how the files are changed. For example:
file1 changed in branch1 in commit1
file2 changed in branch2 in commit2
file3 changed in branch1 in commit3 and in branch2 in commit4
If file was not changed since branch creation then don't show it (optionally).
But maybe this is more like a diff or log than a blame. Maybe there's already
such mode - I could not find it.
$ git init
Initialized empty Git repository in /tmp/a/.git/
$ echo a > a
$ echo b > b
$ echo c > c
$ git add .
$ git commit -a -m new
[master (root-commit) af5d319] new
3 files changed, 3 insertions(+), 0 deletions(-)
create mode 100644 a
create mode 100644 b
create mode 100644 c
$ git branch branch1
$ echo trunk1 > a
$ git commit -a -m trunk1
[master 2dc7f47] trunk1
1 files changed, 1 insertions(+), 1 deletions(-)
$ echo trunk2 > b
$ git commit -a -m trunk1
[master 736fcd2] trunk1
1 files changed, 1 insertions(+), 1 deletions(-)
$ git checkout branch1
Switched to branch 'branch1'
$ echo branch1 > c
$ git commit -a -m branch1
[branch1 52e371d] branch1
1 files changed, 1 insertions(+), 1 deletions(-)
$ echo branch2 > b
$ git commit -a -m branch2
[branch1 9fed07c] branch2
1 files changed, 1 insertions(+), 1 deletions(-)
$ git diff --stat branch1 master
a | 2 +-
b | 2 +-
c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
I would like to see output like this:
a 2dc7f47 (master)
b 736fcd2 (master)
b 9fed07c (branch1)
c 52e371d (branch1)
Not sure how useful it would be. Just an idea.
> But I expect most calls to be:
>
> git blame-tree dir commit
>
> and that's it.
Me too.
--
Piotrek
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 18:31 ` Junio C Hamano
@ 2011-03-02 21:04 ` Jeff King
2011-03-02 23:22 ` Junio C Hamano
2011-03-05 5:41 ` Ryan Tomayko
1 sibling, 1 reply; 17+ messages in thread
From: Jeff King @ 2011-03-02 21:04 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Wed, Mar 02, 2011 at 10:31:30AM -0800, Junio C Hamano wrote:
> I agree that rolling it into "blame" will be a good way to conclude this
> feature from the end-user UI point of view, provided if (as you said) this
> proves to be the most reasonable way to give a coarser, file-level blame
> to a history.
OK, then I'll proceed with it as a separate command, and we can decide
on rolling it in later.
> One is merges, which you punted on with --no-merges ;-). If I were doing
> this, I would probably have traversed without --no-merges, and would blame
> all cases that are not TREESAME to the merge (i.e. regardless of the
> cleanness of the content level merge), as this feature is about file level
> blame. If the child took the whole content as-is from a parent, it is a
> no-event, but if the child has the resulting content that is different
> from any of its parents, then that child is responsible for the end-result
> content at the file level.
Yeah. I made the initial implementation as absolutely simple as
possible. But after writing my comments, I can't see why I ever thought
anything but what you propose (and what I mentioned in the reply to
myself) would make any sense.
With sensible semantics, we can turn off --no-merges and still get good
results. And anybody who cares for something different can use
--no-merges themselves. Though I expect the opposite to be the case; I
can imagine somebody wanting --first-parent to blame files only to
merges. I'll have to make sure that works properly in my re-roll.
> > The initial set of interesting files we come up with is gotten by
> > looking at the tree of the first pending object after parsing the rev
> > options (defaulting to HEAD). Which sounds a little flaky to me, but
> > does what you want in practice. I'd be curious if somebody can come up
> > with a counterexample where the ability to manually specify the source
> > tree would be more useful.
>
> I started writing "In what sense is it flaky? What corner case do you
> have in mind?" in a few lines above and then moved that here ;-).
I was thinking something like:
git blame-tree dir branch1 branch2
where branch2 is actually _ahead_ of branch1. We take the file list from
branch1, but may blame commits on branch2. It's probably a little bit of
a crazy thing to ask for, though.
> The second point is "why didn't you use pathspec, but chose to take only
> one path?" It would be useful if you can say
>
> $ git blame v2.6.24..v2.6.37 -- net include/net
>
> no?
Thanks for mentioning; it was one of the things I meant to bring up but
forgot to. There was one thing that gave me pause[1]. The paths here
are not about limiting a traversal; they are about limiting the
_initial_ tree that defines which paths we care about. So in the
traversal above, if 2.6.25 removes net/foo.c, then a regular traversal
will think that's interesting and mention it. But for blame-tree, we
don't want that. We want to mention net/foo.c only if it existed at the
beginning.
Of course it would be easy enough to parse the revision parameters, then
create our initial path list by combining the initial tree and the
pathspecs. But I wonder if the different semantics implies that we
should be using different syntax. I dunno.
There is also the matter of output. Right now blame-tree takes one path
parameter, we look it up as a tree, and if it's not a tree we complain.
Then we output paths that are relative to the inside of that tree. If I
have _two_ trees, what should the output be? Should all of the entries
in "net" be prefixed with net/? Should there be a "--relative" option?
What should be the default when one pathspec is given. Two pathspecs?
[1] Actually, there was a second thing, but considering it more, I think
I am just being silly. Pathspecs like that imply to me that we are
interested in all paths underneath. But blame-tree (to me, anyway) is
about listing the contents of a single tree, and not recursing. But I
think I am being silly because we _are_ interested in all things
underneath for blame-tree. We just don't necessarily mention the
details. When net/subdir/foo.c changes, we do care, we just say
"net/subdir" changed. So I think it is just a matter of my mental model
matching what is happening. And also we could support both modes (blame
a tree without recursion, or blame recursively all files under a tree).
> > So this is the per-commit processing. Basically we just do a diff for
> > each commit, and see if each path has been claimed. Note that it
> > depends on the string-list item->util being initialized to zero. Hence
> > my recent patch to string-list, and this needs to go on top of 62b8102
> > (which is in master and maint).
>
> Once you know who the guilty party is, can't you just remove that element
> from the list, so that later search from the "remaining path" list would
> decrease in cost as you go deeper?
You can, and I considered that, but I don't think it would speed things
up. The lookup is a binary search, so it's already lg(# of paths), and
that number is small (in the hundreds). So there's not much to be saved.
The expensive part is actually looking at the tree and doing even the
raw diff for each commit.
In an ideal world, trees would be random-access, and for each commit, I
would check each unclaimed path to see how it differed in the commit's
tree versus the parent trees. And then I would do O(# of paths) work for
each commit, with that # getting smaller each time. But obviously that
is not the case.
> Also at some point the set of remaining path would become small and it
> might make sense to feed that as diff-tree pathspec. After all, aren't
> you re-initializing the diff machinery for each commit (which is _not_ a
> bad thing to do--I am just pointing out that there is an opportunity to
> use narrower pathspecs as you go without any additional cost)? Note that
> I am not suggesting to update the pathspec used for commit traversal in
> the middle; that is a lot trickier.
Is that going to be any cheaper than what I'm doing now? The whole thing
is a single non-recursive tree diff. So now the diff machinery looks at
each entry, shows me the differences, and then I look them up pretty
efficiently against my list.
By feeding diff the pathspec, it still has to traverse the whole tree
looking for interesting entries comparing them against my list, and then
hand me the reduced list. Its examination of the pathspec is going to be
on par with my list lookup, complexity-wise. I guess there is one small
optimization it can do: if the trees and pathspec are sorted, as it gets
to an entry with "t" and sees that the last interesting pathspec is "q",
it knows it can stop comparing entries.
I'm not sure if we do that optimization. Even if we do, I'm not sure how
much practical difference it will make. I expect most of the time is
spent getting the tree out of storage at all, not walking the entries.
But I admit I haven't done any timing; that's just based on previous
experience.
All that being said, I think I may need to restructure diff calls
anyway. Right now I find the relevant trees and call diff_tree_sha1. But
that doesn't leave any room for rename detection, since the diff
machinery never even knows about the other parts of the project ouside
our subtree, which may be rename sources. So probably I need to be
feeding the whole commit to the diff machinery, getting full pathnames
in my callback, and then picking out the interesting ones.
And then I _will_ need to feed the proper pathspecs to diff. Because it
won't be a matter of looking at a few extra entries in a tree we have
already pulled from storage. We need to tell diff not to delve into
those extra trees at all (unless we need them for rename sources, but we
should avoid it for the case that rename detection is not enabled).
I'll have to look into it.
> > An optimization; we can quit the traversal as soon as everything is
> > claimed. In practice this helps disappointingly little.
>
> That is somewhat unexpected, but would strongly suggest that the approach
> to use more specific pathspec that cover only the remaining paths would
> give you faster per-commit diff?
I don't think so, for the reasons stated above. But I didn't time.
> > My merge handling is just "which files are different from the parents".
> > Which is reasonable, but I don't actually exercise it since we use
> > --no-merges by default. :)
>
> I think that is a big mistake and I already said it above.
>
> > We could try to do something clever here about evil merges.
>
> Again, I wouldn't even think about it; this is a "file boundary matters"
> tool; you shouldn't care about cleanliness of content-level merges.
Yeah, I am very much in agreement with you on both points now.
-Peff
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 18:39 ` Piotr Krukowiecki
@ 2011-03-02 21:10 ` Jeff King
2011-03-02 21:24 ` Piotr Krukowiecki
0 siblings, 1 reply; 17+ messages in thread
From: Jeff King @ 2011-03-02 21:10 UTC (permalink / raw)
To: Piotr Krukowiecki; +Cc: git
On Wed, Mar 02, 2011 at 07:39:20PM +0100, Piotr Krukowiecki wrote:
> I'd expect this to be something like union. Currently I can only think about
> following case:
>
> Some files were changed in branch1, some in branch2, some in both.
> Show me how the files are changed. For example:
> file1 changed in branch1 in commit1
> file2 changed in branch2 in commit2
> file3 changed in branch1 in commit3 and in branch2 in commit4
>
> If file was not changed since branch creation then don't show it (optionally).
I think we are getting into something different here, because you are
caring not just about the commit in some traversal that touched a file,
but for each source, which commits got us there and potentially multiple
such commits, one per source for each file.
And that's a bit more expensive to compute, and the answers are not
always unambiguous. For example, let's say branch1 and branch2 fork from
some merge-base M. In the parent of M, file "foo" was changed. We
traverse from branch1 and branch2, not seeing anything interesting for
"foo". We hit M, and then finally see that its parent touched "foo".
What do we output? Both branches have equal claim to the commit.
I think you could figure out semantics that make sense if you spent
enough time on it. But I also think it is making the relatively simple
problem of blame-tree a lot more complex.
-Peff
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 21:10 ` Jeff King
@ 2011-03-02 21:24 ` Piotr Krukowiecki
2011-03-02 21:41 ` Jeff King
0 siblings, 1 reply; 17+ messages in thread
From: Piotr Krukowiecki @ 2011-03-02 21:24 UTC (permalink / raw)
To: Jeff King; +Cc: git
On Wed, Mar 2, 2011 at 10:10 PM, Jeff King <peff@peff.net> wrote:
> On Wed, Mar 02, 2011 at 07:39:20PM +0100, Piotr Krukowiecki wrote:
>
>> I'd expect this to be something like union. Currently I can only think about
>> following case:
>>
>> Some files were changed in branch1, some in branch2, some in both.
>> Show me how the files are changed. For example:
>> file1 changed in branch1 in commit1
>> file2 changed in branch2 in commit2
>> file3 changed in branch1 in commit3 and in branch2 in commit4
>>
>> If file was not changed since branch creation then don't show it (optionally).
>
> I think we are getting into something different here, because you are
> caring not just about the commit in some traversal that touched a file,
> but for each source, which commits got us there and potentially multiple
> such commits, one per source for each file.
Yah, it might be something for git-log or git-diff.
> And that's a bit more expensive to compute, and the answers are not
> always unambiguous. For example, let's say branch1 and branch2 fork from
> some merge-base M. In the parent of M, file "foo" was changed. We
> traverse from branch1 and branch2, not seeing anything interesting for
> "foo". We hit M, and then finally see that its parent touched "foo".
So it's like this?
B1
|
M - B2
|
P <- changes foo
> What do we output? Both branches have equal claim to the commit.
That's easy. In "show only differences" we don't show anything,
because on both branches last-change-commit of "foo" is the same.
In "show all" last-change-commit is P so show it (with message like
"changed in common root" or whatever).
> I think you could figure out semantics that make sense if you spent
> enough time on it. But I also think it is making the relatively simple
> problem of blame-tree a lot more complex.
I think this is simple, but maybe I don't understand some git
internals that make it hard.
--
Piotrek
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 21:24 ` Piotr Krukowiecki
@ 2011-03-02 21:41 ` Jeff King
0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2011-03-02 21:41 UTC (permalink / raw)
To: Piotr Krukowiecki; +Cc: git
On Wed, Mar 02, 2011 at 10:24:29PM +0100, Piotr Krukowiecki wrote:
> So it's like this?
>
> B1
> |
> M - B2
> |
> P <- changes foo
Yes.
> > What do we output? Both branches have equal claim to the commit.
>
> That's easy. In "show only differences" we don't show anything,
> because on both branches last-change-commit of "foo" is the same.
> In "show all" last-change-commit is P so show it (with message like
> "changed in common root" or whatever).
Ah, that is totally not the output I would have expected. But now I
understand a little better what you are talking about. In the former
case, you are interested in a blame traversal going to the merge-base of
branch1 and branch2, and you are interested in the source. So I think
you could do it with something like:
git blame-tree dir --left-right branch1...branch2
though of course the current output doesn't actually notice things like
left-right markings from the revision traversal machinery.
And then there is also the question of representing greater than two
branches. If "foo" blames to a commit that is in branch1 and branch2,
but not branch3, what should be output? Presumably you would want it
enumerated as "branch1 and branch2 touched it in commit X, branch three
touched it in commit Y". But I'm not sure how well git's revision
machinery tracks more than two sources.
> I think this is simple, but maybe I don't understand some git
> internals that make it hard.
I think it is possible, and probably would build on top of the work I am
doing. But I am going to try to get the basics right first, and then we
can see about building other stuff on top.
-Peff
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 21:04 ` Jeff King
@ 2011-03-02 23:22 ` Junio C Hamano
2011-03-03 15:36 ` Jeff King
0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2011-03-02 23:22 UTC (permalink / raw)
To: Jeff King; +Cc: git
Jeff King <peff@peff.net> writes:
> With sensible semantics, we can turn off --no-merges and still get good
> results. And anybody who cares for something different can use
> --no-merges themselves. Though I expect the opposite to be the case; I
> can imagine somebody wanting --first-parent to blame files only to
> merges. I'll have to make sure that works properly in my re-roll.
I still do not understand why anybody would even want to say --no-merges.
What does this even mean?
git blame --no-merges master..pu -- builtin | builtin/push.c
It may find ab/i18n or mm/push-default-advice depending on the timestamp,
by ignoring a content-level merge but is it a useful information?
> I was thinking something like:
>
> git blame-tree dir branch1 branch2
>
> where branch2 is actually _ahead_ of branch1. We take the file list from
> branch1, but may blame commits on branch2. It's probably a little bit of
> a crazy thing to ask for, though.
To be compatible with all the git commands, the order of parameters should
be revs first and then pathspec, i.e. "git blame-tree branch1 branch2 dir".
But that is a tangent.
Again, the same "what does that mean" applies.
>> The second point is "why didn't you use pathspec, but chose to take only
>> one path?" It would be useful if you can say
>>
>> $ git blame v2.6.24..v2.6.37 -- net include/net
>>
>> no?
>
> ... So in the
> traversal above, if 2.6.25 removes net/foo.c, then a regular traversal
> will think that's interesting and mention it. But for blame-tree, we
> don't want that. We want to mention net/foo.c only if it existed at the
> beginning.
Of course, if v2.6.37 that is the endpoint of the history (i.e. "at the
beginning" above) didn't have net/foo.c, then we shouldn't show it.
But that doesn't change my question. Why are you taking a single
directory and doing "ls-tree $endpoint $that_directory" to populate the
set of paths to find who touched them last, instead of using the args as
regular pathspecs and running "ls-tree $endpoint -- net include/net" to
populate the set?
> Of course it would be easy enough to parse the revision parameters, then
> create our initial path list by combining the initial tree and the
> pathspecs....
As I said already, it does not make any sense to have more than one
endpoint to begin with when you are asking "what is the last one that
touched things", and that also applies to the regular blame, which I think
already has a logic to make sure you have only one positive commit in the
starting rev list.
> ... When net/subdir/foo.c changes, we do care, we just say
> "net/subdir" changed.
I think you would want diff-tree with or without -r then. I suspect
people may want to pass --recursive to this command.
> The expensive part is actually looking at the tree and doing even the
> raw diff for each commit.
... which may probably be helped by limiting the recursion of that diff
with pathspec, but I realize that you are not by default running recursive
diff, so...
It also occurs to me that it may also be a plausible implementation to
hook this into the usual revision traversal machinery.
> I'm not sure if we do that optimization. Even if we do, I'm not sure how
> much practical difference it will make. I expect most of the time is
> spent getting the tree out of storage at all, not walking the entries.
Yeah, my "pathspec" comment primarily comes from my expectation that this
would want to be recursive. If you are only scanning a single flat tree
without recursing, then there is no need.
> And then I _will_ need to feed the proper pathspecs to diff. Because it
> won't be a matter of looking at a few extra entries in a tree we have
> already pulled from storage. We need to tell diff not to delve into
> those extra trees at all...
Yes.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 23:22 ` Junio C Hamano
@ 2011-03-03 15:36 ` Jeff King
0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2011-03-03 15:36 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Wed, Mar 02, 2011 at 03:22:14PM -0800, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > With sensible semantics, we can turn off --no-merges and still get good
> > results. And anybody who cares for something different can use
> > --no-merges themselves. Though I expect the opposite to be the case; I
> > can imagine somebody wanting --first-parent to blame files only to
> > merges. I'll have to make sure that works properly in my re-roll.
>
> I still do not understand why anybody would even want to say --no-merges.
Originally, because my merge semantics were crap. But what I was saying
here was "if, for whatever reason, somebody decided that --no-merges,
they can still do it". So I don't know why. My point was we can safely
drop the automatic no-merges as junk.
> > I was thinking something like:
> >
> > git blame-tree dir branch1 branch2
> >
> > where branch2 is actually _ahead_ of branch1. We take the file list from
> > branch1, but may blame commits on branch2. It's probably a little bit of
> > a crazy thing to ask for, though.
>
> To be compatible with all the git commands, the order of parameters should
> be revs first and then pathspec, i.e. "git blame-tree branch1 branch2 dir".
> But that is a tangent.
Yeah, agreed (for some reason when I wrote the original patch, I had the
brain-dead idea that I was matching blame. But obviously that is not the
case).
> Again, the same "what does that mean" applies.
Assuming we switched it, we have:
git blame-tree branch1 branch2 -- dir
And no, I don't know what that means. My question was what should we do
with it? I think your answer is "complain if rev.pending.nr != 1", and
that sounds sensible to me.
One other possible answer is to take the union of the trees of each
revision listed on the command line. Then you could do something like
"git blame-tree --all -- path", and see all files in all trees. But I
don't really see how that is a particularly useful output, so I'd just
as soon disallow it for now.
> But that doesn't change my question. Why are you taking a single
> directory and doing "ls-tree $endpoint $that_directory" to populate the
> set of paths to find who touched them last, instead of using the args as
> regular pathspecs and running "ls-tree $endpoint -- net include/net" to
> populate the set?
Because my initial attempt was bolted onto ls-tree before I realized
that what I really wanted was blame-tree, and this historical crap
survived from an earlier iteration. I think what you are suggesting is
much better.
> As I said already, it does not make any sense to have more than one
> endpoint to begin with when you are asking "what is the last one that
> touched things", and that also applies to the regular blame, which I think
> already has a logic to make sure you have only one positive commit in the
> starting rev list.
Yeah, agreed.
> > ... When net/subdir/foo.c changes, we do care, we just say
> > "net/subdir" changed.
>
> I think you would want diff-tree with or without -r then. I suspect
> people may want to pass --recursive to this command.
Yeah, I am coming to that conclusion the more I think about it.
> ... which may probably be helped by limiting the recursion of that diff
> with pathspec, but I realize that you are not by default running recursive
> diff, so...
Yes, I definitely didn't want recursion. But clearly that is an obvious
option for people to want, and we should support it. So the code needs
to be structured to handle both cases.
Partially I was blinded in my initial effort by trying to make it as
fast and simple as possible for one specific case. But I think we can
get good performance results using the more traditional traversal and
diff machinery, and have a lot more options, too. My rewrite will be
based around that.
Thanks for the comments.
-Peff
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 17:16 ` Jeff King
2011-03-02 17:51 ` Piotr Krukowiecki
2011-03-02 18:31 ` Junio C Hamano
@ 2011-03-03 20:20 ` Jakub Narebski
2 siblings, 0 replies; 17+ messages in thread
From: Jakub Narebski @ 2011-03-03 20:20 UTC (permalink / raw)
To: Jeff King; +Cc: git
Jeff King <peff@peff.net> writes:
> On Wed, Mar 02, 2011 at 11:40:32AM -0500, Jeff King wrote:
> > It's sometimes useful to get a list of files in a tree along with the
> > last commit that touched them. This is the default tree view shown on
> > github.com, but it can also be handy from the command line (there has
> > been talk lately of having a "git ls"), or as plumbing for a local
> > fancier tree view. E.g., something like:
> >
> > add.c 6e7293e git-add: make -A description clearer vs. -u
> > apply.c fd03881 add description parameter to OPT__VERBOSE
> > blame.c 9ca1169 parse-options: Don't call parse_options_check() so much
> > branch.c 62270f6 branch_merged: fix grammar in warning
> > bundle.c 62b4698 Use angles for placeholders consistently
[...]
> > blame-tree.c | 156 +++++++++++++++++++++++++++++++++++++++++++++++++
> > blame-tree.h | 25 ++++++++
> > builtin.h | 1 +
> > builtin/blame-tree.c | 34 +++++++++++
>
> I tried to lib-ify the implementation as much as possible. It increases
> the lines of code, of course, but I figured there was a reasonable
> chance that there might be a user-friendly "git ls" command eventually,
> and this would probably make a good "-v" option to it.
I think it _might_ be a good idea to add `--blame' option to
"git ls-tree", as a one of ways of presenting tree-blame output.
Or perhaps as part of "git ls".
In "[RFC] Tree blame (git blame <directory>)"[1] I proposed for
$ git blame --abbrev v1.6.3.3 -- .
to generate
100644 blob e57630e ba19a80 Junio C Hamano Feb 10 17:42 walker.c
100644 blob 8a149e1 c13b263 Daniel Barkalow Apr 26 2008 walker.h
100644 blob 7eb3218 fc71db3 Alex Riesen Apr 29 23:21 wrapper.c
100644 blob 4c29255 559e840 Junio C Hamano Jul 20 2008 write_or_die.c
100644 blob 819c797 a437900 Junio C Hamano Jun 21 02:35 ws.c
100644 blob 1b6df45 2af202b Linus Torvalds Jun 18 10:28 wt-status.c
100644 blob 78add09 6c2ce04 Marius Storm-Olsen Jun 5 2008 wt-status.h
100644 blob b9b0db8 eb3a9dd Benjamin Kramer Mar 7 21:02 xdiff-interface.c
100644 blob 7352b9a 86295bb Rene Scharfe Oct 25 2008 xdiff-interface.h
040000 tree ef5d413 5719db9 Charles Bailey May 25 01:21 xdiff/
or something like that. Date doesn't have to be in this strange format
used by 'ls'. Also instead of name we can use username part of email,
or just email; OTOH git-blame uses above format for author.
This could be result of "git ls-tree --abbrev --blame v1.6.3.3"...
and it could be combined with `-l' option of git-ls-tree.
[1]: http://article.gmane.org/gmane.comp.version-control.git/122830
>
> I considered making it a special mode of "git blame" when blame is fed a
> directory instead of a file. But the implementations aren't shared at
> all (nor do I think they need to be; blame-tree is _way_ simpler). And I
> didn't want to steal that concept in case somebody can think of a more
> content-level way of blaming a whole tree that makes sense (obviously
> just showing the concatenation of the blames of each file is one way,
> but I don't know how useful that would be). If we want to go that way,
> we can always catch the special case in blame and just exec blame-tree.
Well, having "git blame [<rev>] <directory>" to output tre-blame
would allow to reuse some of already existing options to ordinary
git-blame; well those that makes sense, like `-b', `-S <revs-file>',
`--reverse', perhaps (depending on available output) also `-l', `-t',
`-s', `--date <format>'.
<rev> is here starting revision or revision range; if it is revision
range then negative specifiers function as boundary.
We could use `-M' to turn on rename detection, and `-C' to turn on
copy detection; I think that in tree-blame we need to consider only
_exact_ renames (pure renames, i.e. the same SHA-1, different name).
Also for GitHub (and perhaps also in the future for gitweb too) would
I think use `--porcelain' or even `--incremental' version of tree-blame;
in [1] I have proposed the following output (following existing "for
porcelain" format):
$ git blame --porcelain v1.6.3.3 -- .
86295bb6bac1482d29650d1f77f19d8e7a7cc2fe 7352b9a9c204c2b1d4ca9df5ce040fe22d6f521c
author Rene Scharfe
author-mail <rene.scharfe@lsrfire.ath.cx>
author-time 1224941475
author-tz +0200
committer Junio C Hamano
committer-mail <gitster@pobox.com>
committer-time 1224961771
committer-tz -0700
summary add xdi_diff_hunks() for callers that only need hunk lengths
filename xdiff-interface.h
100644 blob 7352b9a9c204c2b1d4ca9df5ce040fe22d6f521c xdiff-interface.h
5719db91ce5915ee07c50f1afdc94fe34e91529f ef5d413237b3a390007fba56671b00d7c371ae1e
author Charles Bailey
author-mail <charles@hashpling.org>
author-time 1243210874
author-tz +0100
committer Junio C Hamano
committer-mail <gitster@pobox.com>
committer-time 1243234594
committer-tz -0700
summary add xdi_diff_hunks() for callers that only need hunk lengths
filename xdiff
040000 tree ef5d413237b3a390007fba56671b00d7c371ae1e xdiff
> > +void blame_tree_init(struct blame_tree *bt)
> > +{
> > + memset(bt, 0, sizeof(*bt));
> > + bt->paths.strdup_strings = 1;
> > + init_revisions(&bt->rev, NULL);
> > + bt->rev.no_merges = 1;
> > + bt->rev.def = "HEAD";
> > +}
>
> I turn off merges by default, since they are unlikely to be interesting
> matches (you will see the merge of a side-branch that touched a file
> instead of the actual commit on the side-branch). You could of course do
> "git blame-tree . --no-merges" to get the same effect. I think no-merges
> makes a saner default, but sadly it doesn't seem like there is a way to
> turn no-merges back off ("--merges" means something else, and there is
> no --no-no-merges").
IMHO merges are interecting; moreover if I remember correctly my proof
of concept of tree-blame which I tried to implement in Perl using
Git.pm (git cat-file --batch + git diff-tree --stdin), I have problems
with merges in tree-blame of subdirectory ("--relative" option doesn't
work as I thought it did).
> Right now the code just handles trees. But in the long run, it would
> probably make sense to get the list of files from the index, and mark
> files modified in the working tree or index, too. So something like:
>
> foo.c 1234abcd this is a commit subject
> bar.c modified in working tree
> baz.c modified in index
>
> Sort of like how gitk shows "pseudo-commits" on top of history to
> indicate changes.
Or how "git blame" handles "--contents <file>" option... though what
you mentioned is more than that.
[...]
> > +int cmd_blame_tree(int argc, const char **argv, const char *prefix)
> > +{
> > + struct blame_tree bt;
> > + const char *path = NULL;
> > +
> > + git_config(git_default_config, NULL);
> > +
> > + if (argv[1]) {
> > + path = argv[1];
> > + argc--;
> > + argv++;
> > + }
>
> Obviously no options. Probably there should at least be "--porcelain" to
> output the current form, and the default output should be more
> user-friendly. And probably "-z" to avoid quoting issues.
Thank you for working on this.
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 16:40 [RFC] git blame-tree Jeff King
2011-03-02 17:16 ` Jeff King
2011-03-02 17:40 ` Jeff King
@ 2011-03-04 14:40 ` Will Palmer
2011-03-04 14:53 ` Jeff King
2 siblings, 1 reply; 17+ messages in thread
From: Will Palmer @ 2011-03-04 14:40 UTC (permalink / raw)
To: Jeff King; +Cc: git
On Wed, 2011-03-02 at 11:40 -0500, Jeff King wrote:
> [I know, I know, another RFC. I'll get to actually cleaning up and
> submitting some of these patches soon.]
>
> It's sometimes useful to get a list of files in a tree along with the
> last commit that touched them. This is the default tree view shown on
> github.com, but it can also be handy from the command line (there has
> been talk lately of having a "git ls"), or as plumbing for a local
> fancier tree view. E.g., something like:
>
> add.c 6e7293e git-add: make -A description clearer vs. -u
> apply.c fd03881 add description parameter to OPT__VERBOSE
> blame.c 9ca1169 parse-options: Don't call parse_options_check() so much
> branch.c 62270f6 branch_merged: fix grammar in warning
> bundle.c 62b4698 Use angles for placeholders consistently
>
> The obvious naive way to do this is something like:
>
> for i in `git ls-tree --name-only HEAD`; do
> echo "`git rev-list -1 --no-merges HEAD -- $i` $i";
> done
>
> which is really slow, because we end up traversing the same commits many
> times (plus the startup overhead for each rev-list). It takes about 35
> seconds to run on git.git.
>
> So the next obvious thing is to do one traversal, output the changed
> files for each commit, and then mark each file as you see it. The perl
> script below does this (though the careful reader will note it is
> actually buggy with sub-trees; I didn't bother fixing it since it was
> just a stage in the evolution):
>
[code snipped]
>
> This runs in about 3 seconds. And besides the above-mentioned bug,
> also doesn't properly handle things like filenames that need quoting.
>
> So I wrote it in C, which drops the time down to about 1.5 seconds, and
> of course doesn't have any parsing issues. The patch is below.
>
> I wasn't sure at first what to call it or what the calling conventions
> should be. The initial thought was to make it part of "ls-tree". But
> that feels wrong, as ls-tree otherwise never cares about traversal. The
> combination of traversal and diff made me think of blame, and indeed, I
> think this is really just about blaming a whole tree at the file-level,
> rather than at the content-level. Thus I called it blame-tree, and I
> used the same calling conventions as blame: "git blame-tree <path>
> <rev opts>". See the test script for examples.
>
> I have many thoughts on the patch already, but rather than put them
> here, I'll include the patch without further ado, and put them inline in
> a reply.
>
[patch snipped]
Coincidentally, I'm doing a similar thing in a shell script at the
moment. Unfortunately, no tree-object is involved: I'm instead using the
output from "git diff" on two different branches to generate a list of
files I care about. How hard would it be to accept a nul-delimited list
of filenames via stdin, rather than from a tree? If I'm reading this
right, it looks like a pretty trivial change. (I couldn't get the
existing patch to apply, myself.. I assume I'm just doing something
wrong as I don't need to use "git am" very often.)
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-04 14:40 ` Will Palmer
@ 2011-03-04 14:53 ` Jeff King
0 siblings, 0 replies; 17+ messages in thread
From: Jeff King @ 2011-03-04 14:53 UTC (permalink / raw)
To: Will Palmer; +Cc: git
On Fri, Mar 04, 2011 at 02:40:14PM +0000, Will Palmer wrote:
> Coincidentally, I'm doing a similar thing in a shell script at the
> moment. Unfortunately, no tree-object is involved: I'm instead using the
> output from "git diff" on two different branches to generate a list of
> files I care about. How hard would it be to accept a nul-delimited list
> of filenames via stdin, rather than from a tree? If I'm reading this
> right, it looks like a pretty trivial change.
My planned rewrite will take arbitrary pathspecs, so you can ask for a
subset of the project files, not just a specific tree, on the
command-line. Which may be enough for your purposes (coupled with
xargs). But it would also be easy enough to take rev options on stdin
for very large cases that would exceed the usual command-line limits.
> (I couldn't get the existing patch to apply, myself.. I assume I'm
> just doing something wrong as I don't need to use "git am" very
> often.)
It's also possible I screwed something up in posting it. If you describe
the problem, I might be able to help.
-Peff
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] git blame-tree
2011-03-02 18:31 ` Junio C Hamano
2011-03-02 21:04 ` Jeff King
@ 2011-03-05 5:41 ` Ryan Tomayko
1 sibling, 0 replies; 17+ messages in thread
From: Ryan Tomayko @ 2011-03-05 5:41 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Jeff King, git
On Wed, Mar 2, 2011 at 10:31 AM, Junio C Hamano <gitster@pobox.com> wrote:
> It has been my opinion that the default github view that seems to be
> modeled after cvsview is the least useful choice of the default in the
> more modern git era ;-), but I guess people's eyes and brains are trained
> by the old school "file boundaries matter" way of thinking.
Junio, would you mind elaborating a bit here? I read this as "the
file tree is less interesting than the commit history", or maybe
some other view? I'm curious what would be your preference for
default view.
Ryan
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2011-03-05 5:41 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-02 16:40 [RFC] git blame-tree Jeff King
2011-03-02 17:16 ` Jeff King
2011-03-02 17:51 ` Piotr Krukowiecki
2011-03-02 18:07 ` Jeff King
2011-03-02 18:39 ` Piotr Krukowiecki
2011-03-02 21:10 ` Jeff King
2011-03-02 21:24 ` Piotr Krukowiecki
2011-03-02 21:41 ` Jeff King
2011-03-02 18:31 ` Junio C Hamano
2011-03-02 21:04 ` Jeff King
2011-03-02 23:22 ` Junio C Hamano
2011-03-03 15:36 ` Jeff King
2011-03-05 5:41 ` Ryan Tomayko
2011-03-03 20:20 ` Jakub Narebski
2011-03-02 17:40 ` Jeff King
2011-03-04 14:40 ` Will Palmer
2011-03-04 14:53 ` Jeff King
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).