From: Derrick Stolee <stolee@gmail.com>
To: "SZEDER Gábor" <szeder.dev@gmail.com>, git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>, Thomas Rast <tr@thomasrast.ch>
Subject: Re: [PATCH 2/2] line-log: avoid unnecessary full tree diffs
Date: Wed, 21 Aug 2019 11:53:28 -0400 [thread overview]
Message-ID: <e52e867b-af17-d239-11b5-e0c6353acc2f@gmail.com> (raw)
In-Reply-To: <20190821110424.18184-3-szeder.dev@gmail.com>
On 8/21/2019 7:04 AM, SZEDER Gábor wrote:
> With rename detection enabled the line-level log is able to trace the
> evolution of line ranges across whole-file renames [1]. Alas, to
> achieve that it uses the diff machinery very inefficiently, making the
> operation very slow [2]. And since rename detection is enabled by
> default, the line-level log is very slow by default.
>
> When the line-level log processes a commit with rename detection
> enabled, it currently does the following (see queue_diffs()):
>
> 1. Computes a full tree diff between the commit and (one of) its
> parent(s), i.e. invokes diff_tree_oid() with an empty
> 'diffopt->pathspec'.
> 2. Checks whether any paths in the line ranges were modified.
> 3. Checks whether any modified paths in the line ranges are missing
> in the parent commit's tree.
> 4. If there is such a missing path, then calls diffcore_std() to
> figure out whether the path was indeed renamed based on the
> previously computed full tree diff.
> 5. Continues doing stuff that are unrelated to the slowness.
>
> So basically the line-level log computes a full tree diff for each
> commit-parent pair in step (1) to be used for rename detection in step
> (4) in the off chance that an interesting path is missing from the
> parent.
>
> Avoid these expensive and mostly unnecessary full tree diffs by
> limiting the diffs to paths in the line ranges. This is much cheaper,
> and makes step (2) unnecessary. If it turns out that an interesting
> path is missing from the parent, then fall back and compute a full
> tree diff, so the rename detection will still work.
I applied your patches and tried them on our VFS-enabled version of Git
(see [1]). Unfortunately, the new logic is still triggering rename
detection, as measured by the number of objects being downloaded.
[1] https://github.com/microsoft/git/pull/182
My *guess* is that the repo has a lot of merge commits, and for many
of those, the file does not exist in the first parent. Since we are
essentially doing a --full-history, this means that edge tries a
rename detection. If we used the file-history simplification route of
traveling along a treesame edge instead of caring about both parents,
then maybe this would be avoided.
I could also be completely wrong about how this line-log code works
with regards to --full-history.
> Care must be taken when to update the pathspec used to limit the diff
> in case of renames. A path might be renamed on one branch and
> modified on several parallel running branches, and while processing
> commits on these branches the line-level log might have to alternate
> between looking at a path's new and old name. However, at any one
> time there is only a single 'diffopt->pathspec'.
>
> So add a step (0) to the above to ensure that the paths in the
> pathspec match the paths in the line ranges associated with the
> currently processed commit, and re-parse the pathspec from the paths
> in the line ranges if they differ.
>
> The new test cases include a specially crafted piece of history with
> two merged branches and two files, where each branch modifies both
> files, renames on of them, and then modifies both again. Then two
> separate 'git log -L' invocations check the line-level log of each of
> those two files, which ensures that at least one of those invocations
> have to do that back-and-forth between the file's old and new name (no
> matter which branch is traversed first). 't/t4211-line-log.sh'
> already contains two tests involving renames, they don't don't trigger
> this back-and-forth.
>
> Avoiding these unnecessary full tree diffs can have huge impact on
> performance, especially in big repositories with big trees and mergy
> history. Tracing the evolution of a function through the whole
> history:
>
> # git.git
> $ time git --no-pager log -L:read_alternate_refs:sha1-file.c v2.23.0
>
> Before:
>
> real 0m8.874s
> user 0m8.816s
> sys 0m0.057s
>
> After:
>
> real 0m2.516s
> user 0m2.456s
> sys 0m0.060s
>
> # linux.git
> $ time ~/src/git/git --no-pager log \
> -L:build_restore_work_registers:arch/mips/mm/tlbex.c v5.2
>
> Before:
>
> real 3m50.033s
> user 3m48.041s
> sys 0m0.300s
>
> After:
>
> real 0m2.599s
> user 0m2.466s
> sys 0m0.157s
>
> That's just over 88x speedup.
These performance numbers are great! Please don't let my complaints of
"it doesn't work for my particularly bad example" be a deterrent to this
change. If I figure out what is going on in my case, then I can create
an update on top of your changes.
> diff --git a/line-log.c b/line-log.c
> index fddd91f060..9010e00950 100644
> --- a/line-log.c
> +++ b/line-log.c
> @@ -737,6 +737,22 @@ static struct line_log_data *lookup_line_range(struct rev_info *revs,
> return ret;
> }
>
> +static int same_paths_in_pathspec_and_range(struct pathspec *pathspec,
> + struct line_log_data *range)
> +{
> + int i;
> + struct line_log_data *r;
> +
> + for (i = 0, r = range; i < pathspec->nr && r; i++, r = r->next)
> + if (strcmp(pathspec->items[i].match, r->path))
> + return 0;
> + if (i < pathspec->nr || r)
> + /* different number of pathspec items and ranges */
> + return 0;
> +
> + return 1;
> +}
This method is easy to digest. Looks correct.
> @@ -762,8 +778,7 @@ void line_log_init(struct rev_info *rev, const char *prefix, struct string_list
> range = parse_lines(rev->diffopt.repo, commit, prefix, args);
> add_line_range(rev, commit, range);
>
> - if (!rev->diffopt.detect_rename)
> - parse_pathspec_from_ranges(&rev->diffopt.pathspec, range);
> + parse_pathspec_from_ranges(&rev->diffopt.pathspec, range);
> }
So we always parse the pathspec, even if we don't do detect renames.
> @@ -821,15 +836,29 @@ static void queue_diffs(struct line_log_data *range,
> struct diff_queue_struct *queue,
> struct commit *commit, struct commit *parent)
> {
> + struct object_id *tree_oid, *parent_tree_oid;
> +
> assert(commit);
>
> + tree_oid = get_commit_tree_oid(commit);
> + parent_tree_oid = parent ? get_commit_tree_oid(parent) : NULL;
> +
> + if (opt->detect_rename &&
> + !same_paths_in_pathspec_and_range(&opt->pathspec, range)) {
> + clear_pathspec(&opt->pathspec);
> + parse_pathspec_from_ranges(&opt->pathspec, range);
> + }
If we are detecting renames and our pathspec is not up-to-date with the
range, then clear the pathspec and reparse. Makes sense.
> DIFF_QUEUE_CLEAR(&diff_queued_diff);
> - diff_tree_oid(parent ? get_commit_tree_oid(parent) : NULL,
> - get_commit_tree_oid(commit), "", opt);
> + diff_tree_oid(parent_tree_oid, tree_oid, "", opt);
(I rearranged a pair of -/+ lines in the diff to highlight this change.)
Makes sense, parent_tree_oid above was set using the same conditional.
> - if (opt->detect_rename) {
> + if (opt->detect_rename && diff_might_be_rename()) {
Here is the crux of the matter: diff_might_be_rename() can prevent
the full tree diff.
> + /* must look at the full tree diff to detect renames */
> + clear_pathspec(&opt->pathspec);
> + DIFF_QUEUE_CLEAR(&diff_queued_diff);
> +
> + diff_tree_oid(parent_tree_oid, tree_oid, "", opt);
> +
> filter_diffs_for_paths(range, 1);
> - if (diff_might_be_rename())
> - diffcore_std(opt);
> + diffcore_std(opt);
> filter_diffs_for_paths(range, 0);
> }
So before, diff_might_be_rename() already prevented diffcore_std(), but
now it also prevents clearing the pathspec. diff_might_be_rename() has
a simple implementation:
static inline int diff_might_be_rename(void)
{
int i;
for (i = 0; i < diff_queued_diff.nr; i++)
if (!DIFF_FILE_VALID(diff_queued_diff.queue[i]->one)) {
/* fprintf(stderr, "diff_might_be_rename found creation of: %s\n", */
/* diff_queued_diff.queue[i]->two->path); */
return 1;
}
return 0;
}
So yes, it is triggered by any path appearing in the child but not
a parent.
> move_diff_queue(queue, &diff_queued_diff);
> diff --git a/t/t4211-line-log.sh b/t/t4211-line-log.sh
> index 1db7bd0f59..8319163744 100755
> --- a/t/t4211-line-log.sh
> +++ b/t/t4211-line-log.sh
> @@ -132,4 +132,86 @@ test_expect_success '--raw is forbidden' '
> test_must_fail git log -L1,24:b.c --raw
> '
>
> +test_expect_success 'setup for checking fancy rename following' '
> + git checkout --orphan moves-start &&
> + git reset --hard &&
> +
> + printf "%s\n" 12 13 14 15 b c d e >file-1 &&
> + printf "%s\n" 22 23 24 25 B C D E >file-2 &&
> + git add file-1 file-2 &&
> + test_tick &&
> + git commit -m "Add file-1 and file-2" &&
> + oid_add_f1_f2=$(git rev-parse --short HEAD) &&
> +
> + git checkout -b moves-main &&
> + printf "%s\n" 11 12 13 14 15 b c d e >file-1 &&
> + git commit -a -m "Modify file-1 on main" &&
> + oid_mod_f1_main=$(git rev-parse --short HEAD) &&
> +
> + printf "%s\n" 21 22 23 24 25 B C D E >file-2 &&
> + git commit -a -m "Modify file-2 on main #1" &&
> + oid_mod_f2_main_1=$(git rev-parse --short HEAD) &&
> +
> + git mv file-1 renamed-1 &&
> + git commit -m "Rename file-1 to renamed-1 on main" &&
> +
> + printf "%s\n" 11 12 13 14 15 b c d e f >renamed-1 &&
> + git commit -a -m "Modify renamed-1 on main" &&
> + oid_mod_r1_main=$(git rev-parse --short HEAD) &&
> +
> + printf "%s\n" 21 22 23 24 25 B C D E F >file-2 &&
> + git commit -a -m "Modify file-2 on main #2" &&
> + oid_mod_f2_main_2=$(git rev-parse --short HEAD) &&
> +
> + git checkout -b moves-side moves-start &&
> + printf "%s\n" 12 13 14 15 16 b c d e >file-1 &&
> + git commit -a -m "Modify file-1 on side #1" &&
> + oid_mod_f1_side_1=$(git rev-parse --short HEAD) &&
> +
> + printf "%s\n" 22 23 24 25 26 B C D E >file-2 &&
> + git commit -a -m "Modify file-2 on side" &&
> + oid_mod_f2_side=$(git rev-parse --short HEAD) &&
> +
> + git mv file-2 renamed-2 &&
> + git commit -m "Rename file-2 to renamed-2 on side" &&
> +
> + printf "%s\n" 12 13 14 15 16 a b c d e >file-1 &&
> + git commit -a -m "Modify file-1 on side #2" &&
> + oid_mod_f1_side_2=$(git rev-parse --short HEAD) &&
> +
> + printf "%s\n" 22 23 24 25 26 A B C D E >renamed-2 &&
> + git commit -a -m "Modify renamed-2 on side" &&
> + oid_mod_r2_side=$(git rev-parse --short HEAD) &&
> +
> + git checkout moves-main &&
> + git merge moves-side &&
> + oid_merge=$(git rev-parse --short HEAD)
> +'
> +
> +test_expect_success 'fancy rename following #1' '
> + cat >expect <<-EOF &&
> + $oid_merge Merge branch '\''moves-side'\'' into moves-main
> + $oid_mod_f1_side_2 Modify file-1 on side #2
> + $oid_mod_f1_side_1 Modify file-1 on side #1
> + $oid_mod_r1_main Modify renamed-1 on main
> + $oid_mod_f1_main Modify file-1 on main
> + $oid_add_f1_f2 Add file-1 and file-2
> + EOF
> + git log -L1:renamed-1 --oneline --no-patch >actual &&
> + test_cmp expect actual
> +'
> +
> +test_expect_success 'fancy rename following #2' '
> + cat >expect <<-EOF &&
> + $oid_merge Merge branch '\''moves-side'\'' into moves-main
> + $oid_mod_r2_side Modify renamed-2 on side
> + $oid_mod_f2_side Modify file-2 on side
> + $oid_mod_f2_main_2 Modify file-2 on main #2
> + $oid_mod_f2_main_1 Modify file-2 on main #1
> + $oid_add_f1_f2 Add file-1 and file-2
> + EOF
> + git log -L1:renamed-2 --oneline --no-patch >actual &&
> + test_cmp expect actual
> +'
These look to be suitably interesting test cases. Thanks!
Looking at your patch, I can mostly follow the logic, but my
unfamiliarity with the code is keeping me from being confident
in full understanding. I hope someone who is familiar can
chime in, because I really like the direction here.
Hopefully I will have time in the next few weeks to revisit
this and work to resolve my abnormal case.
-Stolee
next prev parent reply other threads:[~2019-08-21 15:53 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-21 11:04 [PATCH 0/2] line-log: avoid unnecessary full tree diffs SZEDER Gábor
2019-08-21 11:04 ` [PATCH 1/2] line-log: extract pathspec parsing from line ranges into a helper function SZEDER Gábor
2019-08-21 11:04 ` [PATCH 2/2] line-log: avoid unnecessary full tree diffs SZEDER Gábor
2019-08-21 15:53 ` Derrick Stolee [this message]
2019-08-21 17:35 ` SZEDER Gábor
2019-08-21 18:12 ` Derrick Stolee
2019-08-22 8:41 ` SZEDER Gábor
2019-08-22 14:53 ` Derrick Stolee
2019-08-22 16:01 ` Junio C Hamano
2019-08-22 16:26 ` SZEDER Gábor
2019-08-22 16:51 ` Derrick Stolee
2019-08-23 10:04 ` SZEDER Gábor
2019-08-21 17:29 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e52e867b-af17-d239-11b5-e0c6353acc2f@gmail.com \
--to=stolee@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=szeder.dev@gmail.com \
--cc=tr@thomasrast.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).