git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Andy C" <andychup@gmail.com>
To: "Jeff King" <peff@peff.net>
Cc: git@vger.kernel.org,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Junio C Hamano" <gitster@pobox.com>
Subject: Re: [PATCH/RFC 0/3] faster inexact rename handling
Date: Tue, 30 Oct 2007 17:06:02 -0700	[thread overview]
Message-ID: <596909b30710301706u78a326a4y88afac92f1480cec@mail.gmail.com> (raw)
In-Reply-To: <20071030042118.GA14729@sigill.intra.peff.net>

Sorry I have been AWOL... I was going to try to work on this, but I
got abjectly sick (long story).  But it's great to see this out.


On 10/29/07, Jeff King <peff@peff.net> wrote:
> This is my first stab at faster rename handling based on Andy's code.
> The patches are on top of next (to get Linus' recent work on exact
> renames). Most of the interesting stuff is in 2/3.
>
>   1/3: extension of hash interface
>   2/3: similarity detection code
>   3/3: integrate similarity detection into diffcore-rename
>
> The implementation is pretty basic, so I think there is room for
> code optimization (50% of the time is spent in hash lookups, so we might
> be able to micro-optimize that) as well as algorithmic improvements (like the
> sampling Andy mentioned).

For microoptimization, I was thinking that the hash tables could be
implemented without pointers per value (or memory allocation per
value), so everything is in a contiguous block of memory.   In C++ you
can do this trivially by declaring a small struct as the second
template parameter of the container; in C I guess you can simulate it
with a macro or something.

For the inverted indexing step, the values in the hash are going to be
quite small, especially if line_threshold=1.  Then you only need 2
integers for the left side and right side == 4 integers.  The integers
could just be indexes into the lists (like the current code uses).

For the count matrix step, the values are just going to be integers,
so storing it right in the hash table makes sense.

The sampling should be only necessary for binary files, I think.


> With these patches, I can get my monster binary diff down from about 2
> minutes to 17 seconds. And comparing all of linux-2.4 to all of
> linux-2.6 (similar to Andy's previous demo) takes about 10 seconds.

Hopefully that should be close to just reading the files off disk.
The algorithm should take a fraction of the time that simply reading
the files does, which presumably a git diff has to do.

I was timing that by comparing it to doing a "| xargs wc -l" on the
lists of files.


> There are a few downsides:
>   - the current implementation tends to give lower similarity values
>     compared to the old code (see discussion in 2/3), but this should be
>     tweakable
>   - on large datasets, it's more memory hungry than the old code because
>     the hash grows very large. This can be helped by bumping up the
>     binary chunk size (actually, the 17 seconds quoted above is using
>     256-byte chunks rather than 64-byte -- with 64-byte chunks, it's
>     more like 24 seconds) as well as sampling.
>   - no improvement on smaller datasets. Running "git-whatchanged -M
>     --raw -l0" on the linux-2.6 repo takes about the same time with the
>     old and new code (presumably the algorithmic savings of the new code
>     are lost in a higher constant factor, so when n is small, it is a
>     wash).

I think the old code tries to respect the cache as much as possible,
from what I can tell.  The new code has to use hash tables which are
unpredictable of course.  Though for smaller data sets I would expect
the hash table to fit in cache.  What's your definition of small here?
 Are you sure the old code isn't triggering one of the limits that was
there?

thanks,
Andy

      parent reply	other threads:[~2007-10-31  0:06 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-10-30  4:21 [PATCH/RFC 0/3] faster inexact rename handling Jeff King
2007-10-30  4:23 ` [PATCH/RFC 1/3] change hash table calling conventions Jeff King
2007-10-30  4:24 ` [PATCH/RFC 2/3] introduce generic similarity library Jeff King
2007-10-30  4:24 ` [PATCH/RFC 3/3] handle renames using similarity engine Jeff King
2007-10-30  5:06 ` [PATCH/RFC 0/3] faster inexact rename handling Linus Torvalds
2007-10-30  8:29   ` Junio C Hamano
2007-10-30 13:46     ` Jeff King
2007-10-30 13:43   ` Jeff King
2007-10-30 15:38     ` Linus Torvalds
2007-10-30 20:20       ` Jeff King
2007-10-30 20:39         ` Linus Torvalds
2007-10-31  0:27         ` Andy C
2007-10-31  0:06 ` Andy C [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=596909b30710301706u78a326a4y88afac92f1480cec@mail.gmail.com \
    --to=andychup@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).