git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@osdl.org>
To: Nicolas Pitre <nico@cam.org>
Cc: Junio C Hamano <junkio@cox.net>,
	git@vger.kernel.org, Carl Baldwin <cnb@fc.hp.com>
Subject: Re: [PATCH] diff-delta: produce optimal pack data
Date: Fri, 24 Feb 2006 16:45:23 -0800 (PST)	[thread overview]
Message-ID: <Pine.LNX.4.64.0602241637480.22647@g5.osdl.org> (raw)
In-Reply-To: <Pine.LNX.4.64.0602241613030.31162@localhost.localdomain>



On Fri, 24 Feb 2006, Nicolas Pitre wrote:
> 
> Currently, diff-delta takes blocks of data in the reference file and 
> hash them.  When the target file is scanned, it uses the hash to match 
> blocks from the target file with the reference file.
> 
> If blocks are hashed evenly the cost of  producing a delta is at most 
> O(n+m) where n and m are the size of the reference and target files 
> respectively.  In other words, with good data set the cost is linear.

Assuming the hash is good, of course.

I think this was the problem with you trying something simpler than 
adler32..

> But if many blocks from the reference buffer do hash to the same bucket 
> then for each block in the target file many blocks from the reference 
> buffer have to be tested against, making it tend towards O(n^m) which is 
> pretty highly exponential.
> 
> The solution I'm investigating is to put a limit on the number of 
> entries in the same hash bucket so to bring the cost back to something 
> more linear.  That means the delta might miss on better matches that 
> have not been hashed but still benefit from a limited set.

Sounds fair enough.

However, you migt also want to consider another approach..

One of the biggest costs for the xdelta algorithm is probably just the 
"delta_prepare()", but at the same time, that is constant wrt the source 
buffer.

Now, the sad part is that when I wrote pack-objects, I didn't really 
understand the diff-delta algorithm, I just plugged it in. Which means 
that when I did it, I made the (obvious and simple) decision to keep the 
_result_ that we are looking at constant, and try to delta against 
different sources.

HOWEVER.

I suspect you already see where this is going..

We _could_ switch the "pack-objects" window handling around, and instead 
of looking at the object we want to pack, and looking at the ten (or 
"window") previous objects to delta against, we could do it the other way 
around: keep the object we delta against constant, and see what deltas we 
could prepare for the ten next objects.

And since the source would now be constant, you'd need to do the 
"delta_prepare()" just _once_ per window, instead of every single time.

Now, I haven't done any profiling on the diff-delta code, and maybe my 
guess that delta_prepare() is a pretty expensive part is wrong, and maybe 
it wouldn't help to switch the window probing around. But I thought I'd 
mention it as one thing to explore..

		Linus

  parent reply	other threads:[~2006-02-25  0:54 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-02-22  1:45 [PATCH] diff-delta: produce optimal pack data Nicolas Pitre
2006-02-24  8:49 ` Junio C Hamano
2006-02-24 15:37   ` Nicolas Pitre
2006-02-24 23:55     ` Junio C Hamano
2006-02-24 17:44   ` Carl Baldwin
2006-02-24 17:56     ` Nicolas Pitre
2006-02-24 18:35       ` Carl Baldwin
2006-02-24 18:57         ` Nicolas Pitre
2006-02-24 19:23           ` Carl Baldwin
2006-02-24 20:02             ` Nicolas Pitre
2006-02-24 20:40               ` Carl Baldwin
2006-02-24 21:12                 ` Nicolas Pitre
2006-02-24 22:50                   ` Carl Baldwin
2006-02-25  3:53                     ` Nicolas Pitre
2006-02-24 20:02             ` Linus Torvalds
2006-02-24 20:19               ` Nicolas Pitre
2006-02-24 20:53               ` Junio C Hamano
2006-02-24 21:39                 ` Nicolas Pitre
2006-02-24 21:48                   ` Nicolas Pitre
2006-02-25  0:45                   ` Linus Torvalds [this message]
2006-02-25  3:07                     ` Nicolas Pitre
2006-02-25  4:05                       ` Linus Torvalds
2006-02-25  5:10                         ` Nicolas Pitre
2006-02-25  5:35                           ` Nicolas Pitre
2006-03-07 23:48                             ` [RFH] zlib gurus out there? Junio C Hamano
2006-03-08  0:59                               ` Linus Torvalds
2006-03-08  1:22                                 ` Junio C Hamano
2006-03-08  2:00                                   ` Linus Torvalds
2006-03-08  9:46                                     ` Johannes Schindelin
2006-03-08 10:45                               ` [PATCH] write_sha1_file(): Perform Z_FULL_FLUSH between header and data Sergey Vlasov
2006-03-08 11:04                                 ` Junio C Hamano
2006-03-08 14:17                                   ` Sergey Vlasov
2006-02-25 19:18                           ` [PATCH] diff-delta: produce optimal pack data Linus Torvalds
2006-02-24 18:49       ` Carl Baldwin
2006-02-24 19:03         ` Nicolas Pitre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0602241637480.22647@g5.osdl.org \
    --to=torvalds@osdl.org \
    --cc=cnb@fc.hp.com \
    --cc=git@vger.kernel.org \
    --cc=junkio@cox.net \
    --cc=nico@cam.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).