* [PATCH] Avoid loading commits twice in log with diffs
@ 2013-03-02 10:05 Thomas Rast
2013-03-03 7:06 ` Junio C Hamano
0 siblings, 1 reply; 5+ messages in thread
From: Thomas Rast @ 2013-03-02 10:05 UTC (permalink / raw)
To: git
If you run a log with diffs (such as -p, --raw, --stat etc.) the
current code ends up loading many objects twice. For example, for
'log -3000 -p' my instrumentation said the objects loaded more than
once are distributed as follows:
2008 blob
2103 commit
2678 tree
Fixing blobs and trees will be harder, because those are really used
within the diff engine and need some form of caching.
However, fixing the commits is easy at least at the band-aid level.
They are triggered by log_tree_diff() invoking diff_tree_sha1() on
commits, which duly loads the specified object to dereference it to a
tree. Since log_tree_diff() knows that it works with commits and they
must have trees, we can simply pass through the trees.
This has a quite dramatic effect on log --raw, though only a
negligible impact on log -p:
Test with patch before
--------------------------------------------------------------------
4000.2: log --raw -3000 0.50(0.43+0.06) 0.54(0.46+0.06) +7.0%***
4000.3: log -p -3000 2.34(2.20+0.13) 2.37(2.22+0.13) +1.2%
--------------------------------------------------------------------
Significance hints: '.' 0.1 '*' 0.05 '**' 0.01 '***' 0.001
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
---
log-tree.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/log-tree.c b/log-tree.c
index eb1a1b4..277a38f 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -715,7 +715,7 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log
{
int showed_log;
struct commit_list *parents;
- unsigned const char *sha1 = commit->object.sha1;
+ unsigned const char *sha1 = commit->tree->object.sha1;
if (!opt->diff && !DIFF_OPT_TST(&opt->diffopt, EXIT_WITH_STATUS))
return 0;
@@ -742,7 +742,8 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log
* parent, showing summary diff of the others
* we merged _in_.
*/
- diff_tree_sha1(parents->item->object.sha1, sha1, "", &opt->diffopt);
+ parse_commit(parents->item);
+ diff_tree_sha1(parents->item->tree->object.sha1, sha1, "", &opt->diffopt);
log_tree_diff_flush(opt);
return !opt->loginfo;
}
@@ -755,7 +756,8 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log
for (;;) {
struct commit *parent = parents->item;
- diff_tree_sha1(parent->object.sha1, sha1, "", &opt->diffopt);
+ parse_commit(parent);
+ diff_tree_sha1(parent->tree->object.sha1, sha1, "", &opt->diffopt);
log_tree_diff_flush(opt);
showed_log |= !opt->loginfo;
--
1.8.2.rc1.393.ga167915
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] Avoid loading commits twice in log with diffs
2013-03-02 10:05 [PATCH] Avoid loading commits twice in log with diffs Thomas Rast
@ 2013-03-03 7:06 ` Junio C Hamano
2013-03-04 9:58 ` Thomas Rast
2013-03-28 8:19 ` [PATCH v2] " Thomas Rast
0 siblings, 2 replies; 5+ messages in thread
From: Junio C Hamano @ 2013-03-03 7:06 UTC (permalink / raw)
To: Thomas Rast; +Cc: git
Thomas Rast <trast@student.ethz.ch> writes:
> Test with patch before
> --------------------------------------------------------------------
> 4000.2: log --raw -3000 0.50(0.43+0.06) 0.54(0.46+0.06) +7.0%***
> 4000.3: log -p -3000 2.34(2.20+0.13) 2.37(2.22+0.13) +1.2%
> --------------------------------------------------------------------
> Significance hints: '.' 0.1 '*' 0.05 '**' 0.01 '***' 0.001
It may be a silly question but what is a significance hint?
> Signed-off-by: Thomas Rast <trast@student.ethz.ch>
> ---
> log-tree.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/log-tree.c b/log-tree.c
> index eb1a1b4..277a38f 100644
> --- a/log-tree.c
> +++ b/log-tree.c
> @@ -715,7 +715,7 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log
> {
> int showed_log;
> struct commit_list *parents;
> - unsigned const char *sha1 = commit->object.sha1;
> + unsigned const char *sha1 = commit->tree->object.sha1;
Overall I think this goes in the right direction and I can see why
the changes in later two hunks are correct, but I am not sure if we
can safely assume that the caller has parsed the incoming commit and
we have a good commit->tree here already.
Right now, this static function has a single call-site in a public
function log_tree_commit(), whose existing callers may all pass an
already parsed commit, but I feel somewhat uneasy to do the above
without some mechanism in place (either parse it here or in the
caller if unparsed, or document that log_tree_commit() must be
called with a parsed commit and perhaps add an assert there) to
ensure that the invariant is not broken in the future.
Thanks.
> if (!opt->diff && !DIFF_OPT_TST(&opt->diffopt, EXIT_WITH_STATUS))
> return 0;
> @@ -742,7 +742,8 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log
> * parent, showing summary diff of the others
> * we merged _in_.
> */
> - diff_tree_sha1(parents->item->object.sha1, sha1, "", &opt->diffopt);
> + parse_commit(parents->item);
> + diff_tree_sha1(parents->item->tree->object.sha1, sha1, "", &opt->diffopt);
> log_tree_diff_flush(opt);
> return !opt->loginfo;
> }
> @@ -755,7 +756,8 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log
> for (;;) {
> struct commit *parent = parents->item;
>
> - diff_tree_sha1(parent->object.sha1, sha1, "", &opt->diffopt);
> + parse_commit(parent);
> + diff_tree_sha1(parent->tree->object.sha1, sha1, "", &opt->diffopt);
> log_tree_diff_flush(opt);
>
> showed_log |= !opt->loginfo;
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] Avoid loading commits twice in log with diffs
2013-03-03 7:06 ` Junio C Hamano
@ 2013-03-04 9:58 ` Thomas Rast
2013-03-28 8:19 ` [PATCH v2] " Thomas Rast
1 sibling, 0 replies; 5+ messages in thread
From: Thomas Rast @ 2013-03-04 9:58 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Jakub Narebski
Junio C Hamano <gitster@pobox.com> writes:
> Thomas Rast <trast@student.ethz.ch> writes:
>
>> Test with patch before
>> --------------------------------------------------------------------
>> 4000.2: log --raw -3000 0.50(0.43+0.06) 0.54(0.46+0.06) +7.0%***
>> 4000.3: log -p -3000 2.34(2.20+0.13) 2.37(2.22+0.13) +1.2%
>> --------------------------------------------------------------------
>> Significance hints: '.' 0.1 '*' 0.05 '**' 0.01 '***' 0.001
>
> It may be a silly question but what is a significance hint?
That's from my still-not-rerolled perf improvements series from, ahem,
ages ago:
http://thread.gmane.org/gmane.comp.version-control.git/192884
I stole the idea from R (and in fact use R to compute it). The ***
tells you that the probability of the 7% difference is attributable to
chance only with 0.1% probability, or in other words, it's highly likely
that the difference is *not* down to chance. On the other hand, note
that the p4000.3 measurements do not show a significant difference
(significance hint is absent).
The statistical background: Assume that the two series of measurements
are drawn from two normal distributions (possibly with different
mean/variance). Welch's t-test estimates the probability of the null
hypothesis that the two distributions in fact had the same mean. If you
can reject the null hypothesis, you have essentially proven that there's
*some* difference in the timing runs. (Hopefully for the better, and
hopefully _not_ caused just by CPU scaling or such.)
By the way, in the above test Jakub pointed me at the Dumbbench perl
module. I've had a look at the ideas within, and I'll try to put some
sample rejection along their lines into perf-lib. However, several
things make the module itself rather deserve the name. Most
prominently, I can only get it to print timings to stdout in a fixed
format designed for human consumption.
>> --- a/log-tree.c
>> +++ b/log-tree.c
>> @@ -715,7 +715,7 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log
>> {
>> int showed_log;
>> struct commit_list *parents;
>> - unsigned const char *sha1 = commit->object.sha1;
>> + unsigned const char *sha1 = commit->tree->object.sha1;
>
> Overall I think this goes in the right direction and I can see why
> the changes in later two hunks are correct, but I am not sure if we
> can safely assume that the caller has parsed the incoming commit and
> we have a good commit->tree here already.
>
> Right now, this static function has a single call-site in a public
> function log_tree_commit(), whose existing callers may all pass an
> already parsed commit, but I feel somewhat uneasy to do the above
> without some mechanism in place (either parse it here or in the
> caller if unparsed, or document that log_tree_commit() must be
> called with a parsed commit and perhaps add an assert there) to
> ensure that the invariant is not broken in the future.
In that case I'll reroll with the parsing -- it will have about the same
cost as the assertion, since it'll just check ->object.parsed and
return.
--
Thomas Rast
trast@{inf,student}.ethz.ch
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v2] Avoid loading commits twice in log with diffs
2013-03-03 7:06 ` Junio C Hamano
2013-03-04 9:58 ` Thomas Rast
@ 2013-03-28 8:19 ` Thomas Rast
2013-03-28 13:51 ` Jeff King
1 sibling, 1 reply; 5+ messages in thread
From: Thomas Rast @ 2013-03-28 8:19 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Thomas Rast
If you run a log with diffs (such as -p, --raw, --stat etc.) the
current code ends up loading many objects twice. For example, for
'log -3000 -p' my instrumentation said the objects loaded more than
once are distributed as follows:
2008 blob
2103 commit
2678 tree
Fixing blobs and trees will be harder, because those are really used
within the diff engine and need some form of caching.
However, fixing the commits is easy at least at the band-aid level.
They are triggered by log_tree_diff() invoking diff_tree_sha1() on
commits, which duly loads the specified object to dereference it to a
tree. Since log_tree_diff() knows that it works with commits and they
must have trees, we can simply pass through the trees.
We add some parse_commit() calls. The ones for the parents are
required; we do not know at this stage if they have been looked at.
The one for the commit itself is pure paranoia, but has about the same
cost as an assertion on commit->object.parsed.
This has a quite dramatic effect on log --raw, though only a
negligible impact on log -p:
Test this tree HEAD
--------------------------------------------------------------------
4000.2: log --raw -3000 0.50(0.43+0.06) 0.54(0.46+0.06) +7.0%***
4000.3: log -p -3000 2.34(2.20+0.13) 2.37(2.22+0.13) +1.2%
--------------------------------------------------------------------
Significance hints: '.' 0.1 '*' 0.05 '**' 0.01 '***' 0.001
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Thomas Rast <trast@inf.ethz.ch>
---
Adjusted for the concern that the commit might not be parsed yet. I
think it's now more paranoid than the original code, since we cannot
look at commit->parents without parsing. But it's really an almost
free check.
log-tree.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/log-tree.c b/log-tree.c
index 5dc45c4..8a34332 100644
--- a/log-tree.c
+++ b/log-tree.c
@@ -792,11 +792,14 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log
{
int showed_log;
struct commit_list *parents;
- unsigned const char *sha1 = commit->object.sha1;
+ unsigned const char *sha1;
if (!opt->diff && !DIFF_OPT_TST(&opt->diffopt, EXIT_WITH_STATUS))
return 0;
+ parse_commit(commit);
+ sha1 = commit->tree->object.sha1;
+
/* Root commit? */
parents = commit->parents;
if (!parents) {
@@ -819,7 +822,9 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log
* parent, showing summary diff of the others
* we merged _in_.
*/
- diff_tree_sha1(parents->item->object.sha1, sha1, "", &opt->diffopt);
+ parse_commit(parents->item);
+ diff_tree_sha1(parents->item->tree->object.sha1,
+ sha1, "", &opt->diffopt);
log_tree_diff_flush(opt);
return !opt->loginfo;
}
@@ -832,7 +837,9 @@ static int log_tree_diff(struct rev_info *opt, struct commit *commit, struct log
for (;;) {
struct commit *parent = parents->item;
- diff_tree_sha1(parent->object.sha1, sha1, "", &opt->diffopt);
+ parse_commit(parent);
+ diff_tree_sha1(parent->tree->object.sha1,
+ sha1, "", &opt->diffopt);
log_tree_diff_flush(opt);
showed_log |= !opt->loginfo;
--
1.8.2.351.g867a5da
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2] Avoid loading commits twice in log with diffs
2013-03-28 8:19 ` [PATCH v2] " Thomas Rast
@ 2013-03-28 13:51 ` Jeff King
0 siblings, 0 replies; 5+ messages in thread
From: Jeff King @ 2013-03-28 13:51 UTC (permalink / raw)
To: Thomas Rast; +Cc: Junio C Hamano, git, Thomas Rast
On Thu, Mar 28, 2013 at 09:19:34AM +0100, Thomas Rast wrote:
> However, fixing the commits is easy at least at the band-aid level.
> They are triggered by log_tree_diff() invoking diff_tree_sha1() on
> commits, which duly loads the specified object to dereference it to a
> tree. Since log_tree_diff() knows that it works with commits and they
> must have trees, we can simply pass through the trees.
Reading this made me wonder if we could be doing the optimization at a
lower level, by re-using objects that we have in our lookup_object cache
(which would include the commit->tree peeling that you're optimizing
here).
My first instinct was to look at read_object_with_reference, but it's a
bit general; for trees, we can indeed find the buffer/size pair. But for
commits, we must infer the size from the buffer (by using strlen). And
for blobs, we do not win at all, as we do not cache the blob contents.
Another option is for diff_tree_sha1 to use parse_tree_indirect. That
will pull the commit object from the cache and avoid re-parsing it, as
well as re-use any trees (although that should not happen much in a
regular "log" traversal).
That patch looks something like:
diff --git a/tree-diff.c b/tree-diff.c
index ba01563..db454bc 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -269,27 +269,28 @@ int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const cha
int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const char *base, struct diff_options *opt)
{
- void *tree1, *tree2;
+ struct tree *tree1, *tree2;
struct tree_desc t1, t2;
- unsigned long size1, size2;
int retval;
- tree1 = read_object_with_reference(old, tree_type, &size1, NULL);
+ tree1 = parse_tree_indirect(old);
if (!tree1)
die("unable to read source tree (%s)", sha1_to_hex(old));
- tree2 = read_object_with_reference(new, tree_type, &size2, NULL);
+ tree2 = parse_tree_indirect(new);
if (!tree2)
die("unable to read destination tree (%s)", sha1_to_hex(new));
- init_tree_desc(&t1, tree1, size1);
- init_tree_desc(&t2, tree2, size2);
+ init_tree_desc(&t1, tree1->buffer, tree1->size);
+ init_tree_desc(&t2, tree2->buffer, tree2->size);
retval = diff_tree(&t1, &t2, base, opt);
if (!*base && DIFF_OPT_TST(opt, FOLLOW_RENAMES) && diff_might_be_rename()) {
- init_tree_desc(&t1, tree1, size1);
- init_tree_desc(&t2, tree2, size2);
+ init_tree_desc(&t1, tree1->buffer, tree1->size);
+ init_tree_desc(&t2, tree2->buffer, tree2->size);
try_to_follow_renames(&t1, &t2, base, opt);
}
- free(tree1);
- free(tree2);
+ /*
+ * Note that we are now filling up our cache with extra tree data; we
+ * could potentially unparse the tree objects.
+ */
return retval;
}
but it turns out to actually be slower! The problem is that parse_object
will actually re-check the sha1 on every object it reads, which
read_object_with_reference will not do.
Using this hack:
diff --git a/object.c b/object.c
index 20703f5..361eb74 100644
--- a/object.c
+++ b/object.c
@@ -195,6 +195,7 @@ struct object *parse_object_or_die(const unsigned char *sha1,
die(_("unable to parse object: %s"), name ? name : sha1_to_hex(sha1));
}
+int parse_object_quick = 0;
struct object *parse_object(const unsigned char *sha1)
{
unsigned long size;
@@ -221,7 +222,8 @@ struct object *parse_object(const unsigned char *sha1)
buffer = read_sha1_file(sha1, &type, &size);
if (buffer) {
- if (check_sha1_signature(repl, buffer, size, typename(type)) < 0) {
+ if (!parse_object_quick &&
+ check_sha1_signature(repl, buffer, size, typename(type)) < 0) {
free(buffer);
error("sha1 mismatch %s", sha1_to_hex(repl));
return NULL;
diff --git a/tree-diff.c b/tree-diff.c
index db454bc..4b55a1a 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -267,18 +267,23 @@ int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const cha
q->nr = 1;
}
+/* hacky */
+extern int parse_object_quick;
+
int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const char *base, struct diff_options *opt)
{
struct tree *tree1, *tree2;
struct tree_desc t1, t2;
int retval;
+ parse_object_quick = 1;
tree1 = parse_tree_indirect(old);
if (!tree1)
die("unable to read source tree (%s)", sha1_to_hex(old));
tree2 = parse_tree_indirect(new);
if (!tree2)
die("unable to read destination tree (%s)", sha1_to_hex(new));
+ parse_object_quick = 0;
init_tree_desc(&t1, tree1->buffer, tree1->size);
init_tree_desc(&t2, tree2->buffer, tree2->size);
retval = diff_tree(&t1, &t2, base, opt);
my best-of-five "git log --raw" on git.git went from ~4.3s to ~4.1s,
which is on par with what you saw.
-Peff
^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-03-28 13:51 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-02 10:05 [PATCH] Avoid loading commits twice in log with diffs Thomas Rast
2013-03-03 7:06 ` Junio C Hamano
2013-03-04 9:58 ` Thomas Rast
2013-03-28 8:19 ` [PATCH v2] " Thomas Rast
2013-03-28 13:51 ` Jeff King
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).