Re: [PATCH] diff-delta: bound hash list length to avoid O(m*n) behavior

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Nicolas Pitre <nico@cam.org>
To: Junio C Hamano <junkio@cox.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] diff-delta: bound hash list length to avoid O(m*n) behavior
Date: Wed, 01 Mar 2006 12:22:22 -0500 (EST)	[thread overview]
Message-ID: <Pine.LNX.4.64.0603011211370.29834@localhost.localdomain> (raw)
In-Reply-To: <7vmzgajvpl.fsf@assigned-by-dhcp.cox.net>

On Wed, 1 Mar 2006, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> >> I tried an experimental patch to cull collided hash buckets
> >> very aggressively.  I haven't applied your last "reuse index"
> >> patch, though -- I think that is orthogonal and I'd like to
> >> leave that to the next round.
> >
> > It is indeed orthogonal and I think you could apply it to the next 
> > branch without the other patches (it should apply with little problems).  
> > This is an obvious and undisputable gain, even more if pack-objects is 
> > reworked to reduce memory usage by keeping only one live index for 
> > multiple consecutive deltaattempts.
> 
> Umm.  The hash-index is rather huge, isn't it?  I did not
> realize it was two-pointer structure for every byte in the
> source material, and we typically delta from larger to smaller,
> so we will keep about 10x the unpacked source.  Until we swap
> the windowing around, that means about 100x the unpacked source
> with the default window size.

That's why I said that the window reversal has to be done as well to be 
effective.  As for the index itself it can be reduced to a single 
pointer since the "ptr" value can be deduced from the offset of the 
index entry.

> Also, I am not sure which one is more costly: hash-index
> building or use of that to search inside target.  I somehow got
> an impression that the former is relatively cheap, and that is
> what is being cached here.

Yes, but caching it saves 10% on CPU time, probably more when the window 
is swapped around due to less memory usage.

> > Let's suppose the reference buffer has:
> >  
> > ***********************************************************************/
> >...
> > One improvement might consist of counting the number of consecutive 
> > identical bytes when starting a compare, and manage to skip as many hash 
> > entries (minus the block size) before looping again with more entries in 
> > the same hash bucket.
> 
> Umm, again.  Consecutive identical bytes (BTW, I think "* * *"
> and "** ** **" patterns have the same collision issues without
> being consecutive bytes, so such an optimization may be trickier
> and cost more),

First, those "** ** **" are less frequent in general. Next, they will be 
spread amongst 3 hash buckets instead of all the same one.  And with 
large binary files with lots of zeroes then scanning over those areas in 
one pass instead of iterating over them from every offset would help 
enormously as well, even without limiting the hash list length.

 when emitted as literals, would compress well,
> wouldn't they?  At the end of the day, I think what matters is
> the size of deflated delta, since going to disk to read it out
> is more expensive than deflating and applying.  I think you made
> a suggestion along the same line, capping the max delta used by
> try_delta() more precisely by taking the deflated size into
> account.

Yes.  But deflating a bunch of characters will never be as dense as a 4 
byte delta sequence that might expand to hundreds.


Nicolas

next prev parent reply	other threads:[~2006-03-01 17:22 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-02-28  4:09 [PATCH] diff-delta: bound hash list length to avoid O(m*n) behavior Nicolas Pitre
2006-02-28  6:51 ` Junio C Hamano
2006-02-28  8:10   ` Junio C Hamano
2006-02-28 17:05     ` Nicolas Pitre
2006-03-01 10:38       ` Junio C Hamano
2006-03-01 17:22         ` Nicolas Pitre [this message]
2006-03-04  2:28     ` Junio C Hamano
2006-03-04  2:39       ` Junio C Hamano
2006-03-04  3:01       ` Nicolas Pitre
2006-03-04  6:21       ` Linus Torvalds
  -- strict thread matches above, loose matches on Subject: below --
2006-03-08 19:32 Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0603011211370.29834@localhost.localdomain \
    --to=nico@cam.org \
    --cc=git@vger.kernel.org \
    --cc=junkio@cox.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).