Re: [PATCH] diff-delta: produce optimal pack data

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Nicolas Pitre <nico@cam.org>
To: Junio C Hamano <junkio@cox.net>
Cc: git@vger.kernel.org, Linus Torvalds <torvalds@osdl.org>,
	Carl Baldwin <cnb@fc.hp.com>
Subject: Re: [PATCH] diff-delta: produce optimal pack data
Date: Fri, 24 Feb 2006 16:39:42 -0500 (EST)	[thread overview]
Message-ID: <Pine.LNX.4.64.0602241613030.31162@localhost.localdomain> (raw)
In-Reply-To: <7vpslc8oni.fsf@assigned-by-dhcp.cox.net>

On Fri, 24 Feb 2006, Junio C Hamano wrote:

> I haven't looked at Nico's original or updated code closely at
> all, but two things come to mind.
> 
> (1) if we could tell the particular data is intrinsically
>     diff_delta unfriendly and diff_delta would waste too much
>     time when tried to delta against almost _any_ other blob,
>     then it might help to give an interface in diff-delta.c for
>     the caller to check for such a blob without even trying
>     diff_delta.
> 
> (2) otherwise, if diff_delta could detect it would spend too
>     many cycles to finish its work for a particular input early
>     on, we might want it to bail out instead of trying a
>     complete job.

I have a patch that implements an hybrid approach.

Currently, diff-delta takes blocks of data in the reference file and 
hash them.  When the target file is scanned, it uses the hash to match 
blocks from the target file with the reference file.

If blocks are hashed evenly the cost of  producing a delta is at most 
O(n+m) where n and m are the size of the reference and target files 
respectively.  In other words, with good data set the cost is linear.

But if many blocks from the reference buffer do hash to the same bucket 
then for each block in the target file many blocks from the reference 
buffer have to be tested against, making it tend towards O(n^m) which is 
pretty highly exponential.

The solution I'm investigating is to put a limit on the number of 
entries in the same hash bucket so to bring the cost back to something 
more linear.  That means the delta might miss on better matches that 
have not been hashed but still benefit from a limited set. Experience 
seems to show that the time to deltify the first two blobs you found to 
be problematic can be reduced by 2 orders of magnitude with about only 
10% increase in the resulting delta size, and still nearly 40% smaller 
than what the current delta code produces.

The question is how to determine the best limit on the number of entries 
in the same hash bucket.

Nicolas

next prev parent reply	other threads:[~2006-02-24 21:41 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-02-22  1:45 [PATCH] diff-delta: produce optimal pack data Nicolas Pitre
2006-02-24  8:49 ` Junio C Hamano
2006-02-24 15:37   ` Nicolas Pitre
2006-02-24 23:55     ` Junio C Hamano
2006-02-24 17:44   ` Carl Baldwin
2006-02-24 17:56     ` Nicolas Pitre
2006-02-24 18:35       ` Carl Baldwin
2006-02-24 18:57         ` Nicolas Pitre
2006-02-24 19:23           ` Carl Baldwin
2006-02-24 20:02             ` Nicolas Pitre
2006-02-24 20:40               ` Carl Baldwin
2006-02-24 21:12                 ` Nicolas Pitre
2006-02-24 22:50                   ` Carl Baldwin
2006-02-25  3:53                     ` Nicolas Pitre
2006-02-24 20:02             ` Linus Torvalds
2006-02-24 20:19               ` Nicolas Pitre
2006-02-24 20:53               ` Junio C Hamano
2006-02-24 21:39                 ` Nicolas Pitre [this message]
2006-02-24 21:48                   ` Nicolas Pitre
2006-02-25  0:45                   ` Linus Torvalds
2006-02-25  3:07                     ` Nicolas Pitre
2006-02-25  4:05                       ` Linus Torvalds
2006-02-25  5:10                         ` Nicolas Pitre
2006-02-25  5:35                           ` Nicolas Pitre
2006-03-07 23:48                             ` [RFH] zlib gurus out there? Junio C Hamano
2006-03-08  0:59                               ` Linus Torvalds
2006-03-08  1:22                                 ` Junio C Hamano
2006-03-08  2:00                                   ` Linus Torvalds
2006-03-08  9:46                                     ` Johannes Schindelin
2006-03-08 10:45                               ` [PATCH] write_sha1_file(): Perform Z_FULL_FLUSH between header and data Sergey Vlasov
2006-03-08 11:04                                 ` Junio C Hamano
2006-03-08 14:17                                   ` Sergey Vlasov
2006-02-25 19:18                           ` [PATCH] diff-delta: produce optimal pack data Linus Torvalds
2006-02-24 18:49       ` Carl Baldwin
2006-02-24 19:03         ` Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0602241613030.31162@localhost.localdomain \
    --to=nico@cam.org \
    --cc=cnb@fc.hp.com \
    --cc=git@vger.kernel.org \
    --cc=junkio@cox.net \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).