Re: [PATCH] Rename detection: Avoid repeated filespec population

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jeff King <peff@peff.net>
To: "Björn Steinbrink" <B.Steinbrink@gmx.de>
Cc: Junio C Hamano <gitster@pobox.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	git@vger.kernel.org
Subject: Re: [PATCH] Rename detection: Avoid repeated filespec population
Date: Tue, 20 Jan 2009 16:27:23 -0500	[thread overview]
Message-ID: <20090120212723.GA10967@coredump.intra.peff.net> (raw)
In-Reply-To: <20090120155957.GA23237@atjola.homenet>

On Tue, Jan 20, 2009 at 04:59:57PM +0100, Björn Steinbrink wrote:

> In diffcore_rename, we assume that the blob contents in the filespec
> aren't required anymore after estimate_similarity has been called and thus
> we free it. But estimate_similarity might return early when the file sizes
> differ too much. In that case, cnt_data is never set and the next call to
> estimate_similarity will populate the filespec again, eventually rereading
> the same blob over and over again.
> 
> To fix that, we first get the blob sizes and only when the blob contents
> are actually required, and when cnt_data will be set, the full filespec is
> populated, once.

I think this is a sane thing to do, and obviously your numbers show an
impressive improvement.

However, I found your explanation a little confusing. Yes, repeatedly
loading those blobs is a problem. But what this is really about is that
estimate_similarity has two levels of checks: cheap checks involving the
size, and expensive checks involving the content. What is happening now
is that we are doing extra work for the expensive checks, even if the
cheap checks are going to let us fail early.

And that's just stupid, and a waste of processing time even if we
_don't_ ever look at the same filespec again.

But what makes it even worse is that we have a system in place to cache
the expensive work, but because we only do part of it, we don't bother
to cache the result. And that's what your patch description is about.

So I think your patch is absolutely the right thing to do. But I think
from the commit message it isn't clear that it would not be equally
correct to follow through on generating cnt_data instead of an early
return (which _isn't_ right, because you might not need to generate
cnt_data at all).

> This actually affects the copy detection way more than the pure
> rename detection, due to the larger number of candidates, but it's the
> same code, and I only realized that when I reran the stuff to get some
> numbers to show off. ;-)

I have a pathological real-world rename test case from a long time ago.
It's so awful on rename because almost the whole repo was reorganized,
but almost every path got its content adjusted, too. I was hoping it
would be better with your patch, but it isn't: as it turns out, it is
_too_ uniform. The cheap size checks don't work because all of the files
are almost the same size (they're all jpgs from a camera).

But I think for non-pathological cases, your improvement makes sense.

-Peff

next prev parent reply	other threads:[~2009-01-20 22:28 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-20 15:59 [PATCH] Rename detection: Avoid repeated filespec population Björn Steinbrink
2009-01-20 21:27 ` Jeff King [this message]
2009-01-20 22:12   ` Jeff King
2009-01-21 12:56   ` Björn Steinbrink
2009-01-21 13:32     ` Jeff King
2009-01-21 10:27 ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090120212723.GA10967@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=B.Steinbrink@gmx.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).