* [PATCH] Avoid loading commits twice in log with diffs @ 2013-03-02 10:05 Thomas Rast 2013-03-03 7:06 ` Junio C Hamano 0 siblings, 1 reply; 5+ messages in thread From: Thomas Rast @ 2013-03-02 10:05 UTC (permalink / raw) To: git If you run a log with diffs (such as -p, --raw, --stat etc.) the current code ends up loading many objects twice. For example, for 'log -3000 -p' my instrumentation said the objects loaded more than once are distributed as follows: 2008 blob 2103 commit 2678 tree Fixing blobs and trees will be harder, because those are really used within the diff engine and need some form of caching. However, fixing the commits is easy at least at the band-aid level. They are triggered by log_tree_diff() invoking diff_tree_sha1() on commits, which duly loads the specified object to dereference it to a tree. Since log_tree_diff() knows that it works with commits and they must have trees, we can simply pass through the trees. This has a quite dramatic effect on log --raw, though only a negligible impact on log -p: Test with patch before -------------------------------------------------------------------- 4000.2: log --raw -3000 0.50(0.43+0.06) 0.54(0.46+0.06) +7.0%*** 4000.3: log -p -3000 2.34(2.20+0.13) 2.37(2.22+0.13) +1.2% -------------------------------------------------------------------- Significance hints: '.' 0.1 '*' 0.05 '**' 0.01 '***' 0.001 Signed-off-by: Thomas Rast <trast@student.ethz.ch> --- log-tree.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/log-tree.c b/log-tree.c index eb1a1b4..277a38f 100644 --- a/log-tree.c +++ b/log-tree.c @@ -715,7 +715,7 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log { int showed_log; struct commit_list *parents; - unsigned const char *sha1 = commit->object.sha1; + unsigned const char *sha1 = commit->tree->object.sha1; if (!opt->diff && !DIFF_OPT_TST(&opt->diffopt, EXIT_WITH_STATUS)) return 0; @@ -742,7 +742,8 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log * parent, showing summary diff of the others * we merged _in_. */ - diff_tree_sha1(parents->item->object.sha1, sha1, "", &opt->diffopt); + parse_commit(parents->item); + diff_tree_sha1(parents->item->tree->object.sha1, sha1, "", &opt->diffopt); log_tree_diff_flush(opt); return !opt->loginfo; } @@ -755,7 +756,8 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log for (;;) { struct commit *parent = parents->item; - diff_tree_sha1(parent->object.sha1, sha1, "", &opt->diffopt); + parse_commit(parent); + diff_tree_sha1(parent->tree->object.sha1, sha1, "", &opt->diffopt); log_tree_diff_flush(opt); showed_log |= !opt->loginfo; -- 1.8.2.rc1.393.ga167915 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] Avoid loading commits twice in log with diffs 2013-03-02 10:05 [PATCH] Avoid loading commits twice in log with diffs Thomas Rast @ 2013-03-03 7:06 ` Junio C Hamano 2013-03-04 9:58 ` Thomas Rast 2013-03-28 8:19 ` [PATCH v2] " Thomas Rast 0 siblings, 2 replies; 5+ messages in thread From: Junio C Hamano @ 2013-03-03 7:06 UTC (permalink / raw) To: Thomas Rast; +Cc: git Thomas Rast <trast@student.ethz.ch> writes: > Test with patch before > -------------------------------------------------------------------- > 4000.2: log --raw -3000 0.50(0.43+0.06) 0.54(0.46+0.06) +7.0%*** > 4000.3: log -p -3000 2.34(2.20+0.13) 2.37(2.22+0.13) +1.2% > -------------------------------------------------------------------- > Significance hints: '.' 0.1 '*' 0.05 '**' 0.01 '***' 0.001 It may be a silly question but what is a significance hint? > Signed-off-by: Thomas Rast <trast@student.ethz.ch> > --- > log-tree.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/log-tree.c b/log-tree.c > index eb1a1b4..277a38f 100644 > --- a/log-tree.c > +++ b/log-tree.c > @@ -715,7 +715,7 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log > { > int showed_log; > struct commit_list *parents; > - unsigned const char *sha1 = commit->object.sha1; > + unsigned const char *sha1 = commit->tree->object.sha1; Overall I think this goes in the right direction and I can see why the changes in later two hunks are correct, but I am not sure if we can safely assume that the caller has parsed the incoming commit and we have a good commit->tree here already. Right now, this static function has a single call-site in a public function log_tree_commit(), whose existing callers may all pass an already parsed commit, but I feel somewhat uneasy to do the above without some mechanism in place (either parse it here or in the caller if unparsed, or document that log_tree_commit() must be called with a parsed commit and perhaps add an assert there) to ensure that the invariant is not broken in the future. Thanks. > if (!opt->diff && !DIFF_OPT_TST(&opt->diffopt, EXIT_WITH_STATUS)) > return 0; > @@ -742,7 +742,8 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log > * parent, showing summary diff of the others > * we merged _in_. > */ > - diff_tree_sha1(parents->item->object.sha1, sha1, "", &opt->diffopt); > + parse_commit(parents->item); > + diff_tree_sha1(parents->item->tree->object.sha1, sha1, "", &opt->diffopt); > log_tree_diff_flush(opt); > return !opt->loginfo; > } > @@ -755,7 +756,8 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log > for (;;) { > struct commit *parent = parents->item; > > - diff_tree_sha1(parent->object.sha1, sha1, "", &opt->diffopt); > + parse_commit(parent); > + diff_tree_sha1(parent->tree->object.sha1, sha1, "", &opt->diffopt); > log_tree_diff_flush(opt); > > showed_log |= !opt->loginfo; ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Avoid loading commits twice in log with diffs 2013-03-03 7:06 ` Junio C Hamano @ 2013-03-04 9:58 ` Thomas Rast 2013-03-28 8:19 ` [PATCH v2] " Thomas Rast 1 sibling, 0 replies; 5+ messages in thread From: Thomas Rast @ 2013-03-04 9:58 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Jakub Narebski Junio C Hamano <gitster@pobox.com> writes: > Thomas Rast <trast@student.ethz.ch> writes: > >> Test with patch before >> -------------------------------------------------------------------- >> 4000.2: log --raw -3000 0.50(0.43+0.06) 0.54(0.46+0.06) +7.0%*** >> 4000.3: log -p -3000 2.34(2.20+0.13) 2.37(2.22+0.13) +1.2% >> -------------------------------------------------------------------- >> Significance hints: '.' 0.1 '*' 0.05 '**' 0.01 '***' 0.001 > > It may be a silly question but what is a significance hint? That's from my still-not-rerolled perf improvements series from, ahem, ages ago: http://thread.gmane.org/gmane.comp.version-control.git/192884 I stole the idea from R (and in fact use R to compute it). The *** tells you that the probability of the 7% difference is attributable to chance only with 0.1% probability, or in other words, it's highly likely that the difference is *not* down to chance. On the other hand, note that the p4000.3 measurements do not show a significant difference (significance hint is absent). The statistical background: Assume that the two series of measurements are drawn from two normal distributions (possibly with different mean/variance). Welch's t-test estimates the probability of the null hypothesis that the two distributions in fact had the same mean. If you can reject the null hypothesis, you have essentially proven that there's *some* difference in the timing runs. (Hopefully for the better, and hopefully _not_ caused just by CPU scaling or such.) By the way, in the above test Jakub pointed me at the Dumbbench perl module. I've had a look at the ideas within, and I'll try to put some sample rejection along their lines into perf-lib. However, several things make the module itself rather deserve the name. Most prominently, I can only get it to print timings to stdout in a fixed format designed for human consumption. >> --- a/log-tree.c >> +++ b/log-tree.c >> @@ -715,7 +715,7 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log >> { >> int showed_log; >> struct commit_list *parents; >> - unsigned const char *sha1 = commit->object.sha1; >> + unsigned const char *sha1 = commit->tree->object.sha1; > > Overall I think this goes in the right direction and I can see why > the changes in later two hunks are correct, but I am not sure if we > can safely assume that the caller has parsed the incoming commit and > we have a good commit->tree here already. > > Right now, this static function has a single call-site in a public > function log_tree_commit(), whose existing callers may all pass an > already parsed commit, but I feel somewhat uneasy to do the above > without some mechanism in place (either parse it here or in the > caller if unparsed, or document that log_tree_commit() must be > called with a parsed commit and perhaps add an assert there) to > ensure that the invariant is not broken in the future. In that case I'll reroll with the parsing -- it will have about the same cost as the assertion, since it'll just check ->object.parsed and return. -- Thomas Rast trast@{inf,student}.ethz.ch ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v2] Avoid loading commits twice in log with diffs 2013-03-03 7:06 ` Junio C Hamano 2013-03-04 9:58 ` Thomas Rast @ 2013-03-28 8:19 ` Thomas Rast 2013-03-28 13:51 ` Jeff King 1 sibling, 1 reply; 5+ messages in thread From: Thomas Rast @ 2013-03-28 8:19 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Thomas Rast If you run a log with diffs (such as -p, --raw, --stat etc.) the current code ends up loading many objects twice. For example, for 'log -3000 -p' my instrumentation said the objects loaded more than once are distributed as follows: 2008 blob 2103 commit 2678 tree Fixing blobs and trees will be harder, because those are really used within the diff engine and need some form of caching. However, fixing the commits is easy at least at the band-aid level. They are triggered by log_tree_diff() invoking diff_tree_sha1() on commits, which duly loads the specified object to dereference it to a tree. Since log_tree_diff() knows that it works with commits and they must have trees, we can simply pass through the trees. We add some parse_commit() calls. The ones for the parents are required; we do not know at this stage if they have been looked at. The one for the commit itself is pure paranoia, but has about the same cost as an assertion on commit->object.parsed. This has a quite dramatic effect on log --raw, though only a negligible impact on log -p: Test this tree HEAD -------------------------------------------------------------------- 4000.2: log --raw -3000 0.50(0.43+0.06) 0.54(0.46+0.06) +7.0%*** 4000.3: log -p -3000 2.34(2.20+0.13) 2.37(2.22+0.13) +1.2% -------------------------------------------------------------------- Significance hints: '.' 0.1 '*' 0.05 '**' 0.01 '***' 0.001 Signed-off-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Thomas Rast <trast@inf.ethz.ch> --- Adjusted for the concern that the commit might not be parsed yet. I think it's now more paranoid than the original code, since we cannot look at commit->parents without parsing. But it's really an almost free check. log-tree.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/log-tree.c b/log-tree.c index 5dc45c4..8a34332 100644 --- a/log-tree.c +++ b/log-tree.c @@ -792,11 +792,14 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log { int showed_log; struct commit_list *parents; - unsigned const char *sha1 = commit->object.sha1; + unsigned const char *sha1; if (!opt->diff && !DIFF_OPT_TST(&opt->diffopt, EXIT_WITH_STATUS)) return 0; + parse_commit(commit); + sha1 = commit->tree->object.sha1; + /* Root commit? */ parents = commit->parents; if (!parents) { @@ -819,7 +822,9 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log * parent, showing summary diff of the others * we merged _in_. */ - diff_tree_sha1(parents->item->object.sha1, sha1, "", &opt->diffopt); + parse_commit(parents->item); + diff_tree_sha1(parents->item->tree->object.sha1, + sha1, "", &opt->diffopt); log_tree_diff_flush(opt); return !opt->loginfo; } @@ -832,7 +837,9 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log for (;;) { struct commit *parent = parents->item; - diff_tree_sha1(parent->object.sha1, sha1, "", &opt->diffopt); + parse_commit(parent); + diff_tree_sha1(parent->tree->object.sha1, + sha1, "", &opt->diffopt); log_tree_diff_flush(opt); showed_log |= !opt->loginfo; -- 1.8.2.351.g867a5da ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2] Avoid loading commits twice in log with diffs 2013-03-28 8:19 ` [PATCH v2] " Thomas Rast @ 2013-03-28 13:51 ` Jeff King 0 siblings, 0 replies; 5+ messages in thread From: Jeff King @ 2013-03-28 13:51 UTC (permalink / raw) To: Thomas Rast; +Cc: Junio C Hamano, git, Thomas Rast On Thu, Mar 28, 2013 at 09:19:34AM +0100, Thomas Rast wrote: > However, fixing the commits is easy at least at the band-aid level. > They are triggered by log_tree_diff() invoking diff_tree_sha1() on > commits, which duly loads the specified object to dereference it to a > tree. Since log_tree_diff() knows that it works with commits and they > must have trees, we can simply pass through the trees. Reading this made me wonder if we could be doing the optimization at a lower level, by re-using objects that we have in our lookup_object cache (which would include the commit->tree peeling that you're optimizing here). My first instinct was to look at read_object_with_reference, but it's a bit general; for trees, we can indeed find the buffer/size pair. But for commits, we must infer the size from the buffer (by using strlen). And for blobs, we do not win at all, as we do not cache the blob contents. Another option is for diff_tree_sha1 to use parse_tree_indirect. That will pull the commit object from the cache and avoid re-parsing it, as well as re-use any trees (although that should not happen much in a regular "log" traversal). That patch looks something like: diff --git a/tree-diff.c b/tree-diff.c index ba01563..db454bc 100644 --- a/tree-diff.c +++ b/tree-diff.c @@ -269,27 +269,28 @@ int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const cha int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const char *base, struct diff_options *opt) { - void *tree1, *tree2; + struct tree *tree1, *tree2; struct tree_desc t1, t2; - unsigned long size1, size2; int retval; - tree1 = read_object_with_reference(old, tree_type, &size1, NULL); + tree1 = parse_tree_indirect(old); if (!tree1) die("unable to read source tree (%s)", sha1_to_hex(old)); - tree2 = read_object_with_reference(new, tree_type, &size2, NULL); + tree2 = parse_tree_indirect(new); if (!tree2) die("unable to read destination tree (%s)", sha1_to_hex(new)); - init_tree_desc(&t1, tree1, size1); - init_tree_desc(&t2, tree2, size2); + init_tree_desc(&t1, tree1->buffer, tree1->size); + init_tree_desc(&t2, tree2->buffer, tree2->size); retval = diff_tree(&t1, &t2, base, opt); if (!*base && DIFF_OPT_TST(opt, FOLLOW_RENAMES) && diff_might_be_rename()) { - init_tree_desc(&t1, tree1, size1); - init_tree_desc(&t2, tree2, size2); + init_tree_desc(&t1, tree1->buffer, tree1->size); + init_tree_desc(&t2, tree2->buffer, tree2->size); try_to_follow_renames(&t1, &t2, base, opt); } - free(tree1); - free(tree2); + /* + * Note that we are now filling up our cache with extra tree data; we + * could potentially unparse the tree objects. + */ return retval; } but it turns out to actually be slower! The problem is that parse_object will actually re-check the sha1 on every object it reads, which read_object_with_reference will not do. Using this hack: diff --git a/object.c b/object.c index 20703f5..361eb74 100644 --- a/object.c +++ b/object.c @@ -195,6 +195,7 @@ struct object *parse_object_or_die(const unsigned char *sha1, die(_("unable to parse object: %s"), name ? name : sha1_to_hex(sha1)); } +int parse_object_quick = 0; struct object *parse_object(const unsigned char *sha1) { unsigned long size; @@ -221,7 +222,8 @@ struct object *parse_object(const unsigned char *sha1) buffer = read_sha1_file(sha1, &type, &size); if (buffer) { - if (check_sha1_signature(repl, buffer, size, typename(type)) < 0) { + if (!parse_object_quick && + check_sha1_signature(repl, buffer, size, typename(type)) < 0) { free(buffer); error("sha1 mismatch %s", sha1_to_hex(repl)); return NULL; diff --git a/tree-diff.c b/tree-diff.c index db454bc..4b55a1a 100644 --- a/tree-diff.c +++ b/tree-diff.c @@ -267,18 +267,23 @@ int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const cha q->nr = 1; } +/* hacky */ +extern int parse_object_quick; + int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const char *base, struct diff_options *opt) { struct tree *tree1, *tree2; struct tree_desc t1, t2; int retval; + parse_object_quick = 1; tree1 = parse_tree_indirect(old); if (!tree1) die("unable to read source tree (%s)", sha1_to_hex(old)); tree2 = parse_tree_indirect(new); if (!tree2) die("unable to read destination tree (%s)", sha1_to_hex(new)); + parse_object_quick = 0; init_tree_desc(&t1, tree1->buffer, tree1->size); init_tree_desc(&t2, tree2->buffer, tree2->size); retval = diff_tree(&t1, &t2, base, opt); my best-of-five "git log --raw" on git.git went from ~4.3s to ~4.1s, which is on par with what you saw. -Peff ^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-03-28 13:51 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-03-02 10:05 [PATCH] Avoid loading commits twice in log with diffs Thomas Rast 2013-03-03 7:06 ` Junio C Hamano 2013-03-04 9:58 ` Thomas Rast 2013-03-28 8:19 ` [PATCH v2] " Thomas Rast 2013-03-28 13:51 ` Jeff King
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).