git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH/RFC 0/3] faster inexact rename handling
@ 2007-10-30  4:21 Jeff King
  2007-10-30  4:23 ` [PATCH/RFC 1/3] change hash table calling conventions Jeff King
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Jeff King @ 2007-10-30  4:21 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds, Andy C, Junio C Hamano

This is my first stab at faster rename handling based on Andy's code.
The patches are on top of next (to get Linus' recent work on exact
renames). Most of the interesting stuff is in 2/3.

  1/3: extension of hash interface
  2/3: similarity detection code
  3/3: integrate similarity detection into diffcore-rename

The implementation is pretty basic, so I think there is room for
code optimization (50% of the time is spent in hash lookups, so we might
be able to micro-optimize that) as well as algorithmic improvements (like the
sampling Andy mentioned).

With these patches, I can get my monster binary diff down from about 2
minutes to 17 seconds. And comparing all of linux-2.4 to all of
linux-2.6 (similar to Andy's previous demo) takes about 10 seconds.

There are a few downsides:
  - the current implementation tends to give lower similarity values
    compared to the old code (see discussion in 2/3), but this should be
    tweakable
  - on large datasets, it's more memory hungry than the old code because
    the hash grows very large. This can be helped by bumping up the
    binary chunk size (actually, the 17 seconds quoted above is using
    256-byte chunks rather than 64-byte -- with 64-byte chunks, it's
    more like 24 seconds) as well as sampling.
  - no improvement on smaller datasets. Running "git-whatchanged -M
    --raw -l0" on the linux-2.6 repo takes about the same time with the
    old and new code (presumably the algorithmic savings of the new code
    are lost in a higher constant factor, so when n is small, it is a
    wash).

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2007-10-31  0:27 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-30  4:21 [PATCH/RFC 0/3] faster inexact rename handling Jeff King
2007-10-30  4:23 ` [PATCH/RFC 1/3] change hash table calling conventions Jeff King
2007-10-30  4:24 ` [PATCH/RFC 2/3] introduce generic similarity library Jeff King
2007-10-30  4:24 ` [PATCH/RFC 3/3] handle renames using similarity engine Jeff King
2007-10-30  5:06 ` [PATCH/RFC 0/3] faster inexact rename handling Linus Torvalds
2007-10-30  8:29   ` Junio C Hamano
2007-10-30 13:46     ` Jeff King
2007-10-30 13:43   ` Jeff King
2007-10-30 15:38     ` Linus Torvalds
2007-10-30 20:20       ` Jeff King
2007-10-30 20:39         ` Linus Torvalds
2007-10-31  0:27         ` Andy C
2007-10-31  0:06 ` Andy C

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).