From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 2/2] diffcore-break: save cnt_data for other phases
Date: Thu, 19 Nov 2009 10:22:34 -0500 [thread overview]
Message-ID: <20091119152234.GA6877@coredump.intra.peff.net> (raw)
In-Reply-To: <7vvdhanp4v.fsf@alter.siamese.dyndns.org>
On Mon, Nov 16, 2009 at 01:20:00PM -0800, Junio C Hamano wrote:
> The change looks correct, but I initially got worried about your change
> interacts badly with this part in estimate_similarity() around line 176:
>
> if (!src->cnt_data && diff_populate_filespec(src, 0))
> return 0;
> if (!dst->cnt_data && diff_populate_filespec(dst, 0))
> return 0;
>
> I think the reason why it (not your patch but the original code) felt a
> bit brittle is because the above if() statements know too much about the
> internal of d-c-c (namely, it never looks at src->data when src->cnt_data
> is already available, so it is safe to leave src->data NULL).
No, I am counting on those lines to actually optimize out the count
later on. But I agree it is a bit brittle. Probably a better design
would have been for diffcore_count_changes to simply call a new
"diff_filespec_hash", which would assign to the cnt_data member and
perform any optimizations, just as we do already with
diff_populate_filespec and the actual blob contents.
I don't know if it is worth refactoring though. This is a pretty static
part of the code these days, and what is there now is working.
> The same logic suggests that the loop to build the matrix in
> diffcore_rename() could free two->data at the end of the innermost loop.
>
> We currently loop this way (around ll. 505-530):
>
> for each two (i.e. rename destination candidate):
> for each one (i.e. rename source candidate):
> compute and record how similar one and two are
> free one->data
> free two->data
>
> After computing how similar "one" and something else is only once, we have
> one->cnt_data so we won't need one->data (the same fact is exploited by
> your patch for optimization), and that is why freeing one->data in the
> innermost loop does not result in constant re-reading of the same blob
> data when we iterate more than one rename destination. But the same logic
> applies to "two" and we should be able to free the blob data early to
> reduce the memory pressure.
>
> I dunno if it would give us measurable performance benefit, though.
>
> Here is the idea on top of your patch, but I think it can be applied
> independently.
With your patch, I shaved 0.2 seconds off of a 46-second rename
detection (one of my pathological "rename and tweak the tags on a bunch
of photos" commits). To be fair, though, that commit is now quite old,
so some of that CPU time was spent assembling delta chains.
It should also relieve memory pressure for large repositories, though
that is not nearly as big an issue for renames as for breaks, as we tend
to have many fewer candidates.
I think it is worth applying. Though perhaps the free-ing should simply
happen in estimate_similarity, which is the function that actually
populates the filespec.
Also, I don't see where we ever actually free the cnt_data. After we run
the whole O(n^2) loop, we should be able to drop cnt_data entirely. I
think in practice it doesn't matter all that much, since it isn't that
big, and afterwards we just generate the patch and exit (unless we are
doing diffcore-break, too, which will actually free all of the data).
-Peff
next prev parent reply other threads:[~2009-11-19 15:22 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-11-16 15:53 [PATCH 0/2] diffcore-break optimizations Jeff King
2009-11-16 15:56 ` [PATCH 1/2] diffcore-break: free filespec data as we go Jeff King
2009-11-16 16:02 ` [PATCH 2/2] diffcore-break: save cnt_data for other phases Jeff King
2009-11-16 21:20 ` Junio C Hamano
2009-11-19 15:22 ` Jeff King [this message]
2009-11-20 7:32 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20091119152234.GA6877@coredump.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).