From: Junio C Hamano <gitster@pobox.com>
To: Jeff King <peff@peff.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>, git@vger.kernel.org
Subject: Re: [PATCH] diffcore-rename: cache file deltas
Date: Tue, 25 Sep 2007 14:29:07 -0700 [thread overview]
Message-ID: <7vy7eu4eos.fsf@gitster.siamese.dyndns.org> (raw)
In-Reply-To: 20070925192941.GA8564@coredump.intra.peff.net
Jeff King <peff@peff.net> writes:
> We find rename candidates by computing a fingerprint hash of
> each file, and then comparing those fingerprints. There are
> inherently O(n^2) comparisons, so it pays in CPU time to
> hoist the (rather expensive) computation of the fingerprint
> out of that loop (or to cache it once we have computed it once).
>
> Previously, we didn't keep the filespec information around
> because then we had the potential to consume a great deal of
> memory. However, instead of keeping all of the filespec
> data, we can instead just keep the fingerprint.
>
> This patch implements and uses diff_free_filespec_data_large
> to accomplish that goal. We also have to change
> estimate_similarity not to needlessly repopulate the
> filespec data when we already have the hash.
>
> Practical tests showed 4.5x speedup for a 10% memory usage
> increase.
>
> Signed-off-by: Jeff King <peff@peff.net>
Very nice.
> The implementation is a little less nice than I would like, but I was
> trying to be non-invasive. Specifically:
>
> - the name diff_free_filespec_data_large is horrible, but this is based
> on the fact that diff_free_filespec_data actually does too much (it
> frees the data _and_ some other auxiliary data). And renaming that
> would entail changing many callsites.
True. But we can rename it to diff_file_filespec_blob() and
that would perfectly well describe what it does. Will do so
when applying if it is Ok to you.
> - It seems that a better place to call diffcore_populate_filespec
> (rather than in estimate_similarity) would actually be in
> diffcore_count_changes (when we _know_ that we need to populate the
> contents data).
>
> - The hash_chars() should arguably be tied into
> diffcore_populate_filespec, which should have more of a "what
> information do you want" interface. I.e., the "size_only" parameter
> could be extended to a bitfield to say "populate this, and I need the
> delta fingerprint, size, actual contents, etc". Then callers could
> just use "populate" before looking at the filespec and it would
> lazily load whatever they needed.
Both good points, but I agree with you that it is wise to do
that with a follow-up patch.
next prev parent reply other threads:[~2007-09-25 21:29 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-09-25 19:29 [PATCH] diffcore-rename: cache file deltas Jeff King
2007-09-25 21:29 ` Junio C Hamano [this message]
2007-09-25 21:42 ` Jeff King
2007-10-03 1:36 ` Linus Torvalds
2007-10-03 4:10 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7vy7eu4eos.fsf@gitster.siamese.dyndns.org \
--to=gitster@pobox.com \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.