Re: Performance problem, long run of identical hashes

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Nicolas Pitre <nico@cam.org>
To: David Kastrup <dak@gnu.org>
Cc: Jon Smirl <jonsmirl@gmail.com>, Git Mailing List <git@vger.kernel.org>
Subject: Re: Performance problem, long run of identical hashes
Date: Mon, 10 Dec 2007 15:11:15 -0500 (EST)	[thread overview]
Message-ID: <alpine.LFD.0.99999.0712101458140.555@xanadu.home> (raw)
In-Reply-To: <85tzmqnxua.fsf@lola.goethe.zz>

On Mon, 10 Dec 2007, David Kastrup wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > On Mon, 10 Dec 2007, Jon Smirl wrote:
> >
> >> I added some debug printfs which show that I
> >> have a 100,000+ run of identical hash entries. Processing the 100,000
> >> entries also causes RAM consumption to explode.
> >
> > That is impossible.  If you look at the code where those hash entries 
> > are created in create_delta_index(), you'll notice a hard limit of 
> > HASH_LIMIT (currently 64) is imposed on the number of identical hash 
> > entries.
> 
> On the other hand, if we have, say, laaaaaarge streaks of zeros, what
> happens is that we have 64 hashes seeing them.  Now about 4096 bytes are
> compared, and then the comparison stops.

No.

What would happen in that case is that the first hash entry pointing 
somewhere in the beginning of the zero chunk will match _at least_ 4096 
bytes, probably much more, up to the end of the buffer if that is 
possible.

> Then it scans backwards to
> seek for more zeros (and finds about 64k of them before it stops)

The first hash entry for those zeroes is going to point at an offset not 
greater than 15 bytes from the start of the zero area.  So, unless both 
buffers are going to match even before that zero area, the backward scan 
will stop quickly.

> folds them into the current compacted form.  Each of these backward
> scans (of which we have 64 in the worst case) is in a different memory
> area.  So since we scan/compare areas of 64k for each advance of 4k, we
> have an overscanning factor of 16 (for a worst case scenario).

Actually, if we matched 100MB of zeroes, we'll just encode that in a 
suite of 64KB copy operations, and continue scanning the buffer after 
that 100MB.

So I don't see where your scan/compare areas of 64k for each advance of 
4k comes from.

> Not sure whether this is what we are seeing here.  It would still not
> explain exploding memory usage I think.

Right.

Nicolas

     prev parent reply	other threads:[~2007-12-10 20:11 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-12-10 15:07 Performance problem, long run of identical hashes Jon Smirl
2007-12-10 15:45 ` Nicolas Pitre
2007-12-10 16:14   ` David Kastrup
2007-12-10 16:20   ` Jon Smirl
2007-12-10 16:30     ` Nicolas Pitre
2007-12-10 19:39   ` David Kastrup
2007-12-10 20:11     ` Nicolas Pitre [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.0.99999.0712101458140.555@xanadu.home \
    --to=nico@cam.org \
    --cc=dak@gnu.org \
    --cc=git@vger.kernel.org \
    --cc=jonsmirl@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).