From: Nicolas Pitre <nico@cam.org>
To: Junio C Hamano <junkio@cox.net>
Cc: git@vger.kernel.org, Linus Torvalds <torvalds@osdl.org>,
Carl Baldwin <cnb@fc.hp.com>
Subject: Re: [PATCH] diff-delta: produce optimal pack data
Date: Fri, 24 Feb 2006 16:39:42 -0500 (EST) [thread overview]
Message-ID: <Pine.LNX.4.64.0602241613030.31162@localhost.localdomain> (raw)
In-Reply-To: <7vpslc8oni.fsf@assigned-by-dhcp.cox.net>
On Fri, 24 Feb 2006, Junio C Hamano wrote:
> I haven't looked at Nico's original or updated code closely at
> all, but two things come to mind.
>
> (1) if we could tell the particular data is intrinsically
> diff_delta unfriendly and diff_delta would waste too much
> time when tried to delta against almost _any_ other blob,
> then it might help to give an interface in diff-delta.c for
> the caller to check for such a blob without even trying
> diff_delta.
>
> (2) otherwise, if diff_delta could detect it would spend too
> many cycles to finish its work for a particular input early
> on, we might want it to bail out instead of trying a
> complete job.
I have a patch that implements an hybrid approach.
Currently, diff-delta takes blocks of data in the reference file and
hash them. When the target file is scanned, it uses the hash to match
blocks from the target file with the reference file.
If blocks are hashed evenly the cost of producing a delta is at most
O(n+m) where n and m are the size of the reference and target files
respectively. In other words, with good data set the cost is linear.
But if many blocks from the reference buffer do hash to the same bucket
then for each block in the target file many blocks from the reference
buffer have to be tested against, making it tend towards O(n^m) which is
pretty highly exponential.
The solution I'm investigating is to put a limit on the number of
entries in the same hash bucket so to bring the cost back to something
more linear. That means the delta might miss on better matches that
have not been hashed but still benefit from a limited set. Experience
seems to show that the time to deltify the first two blobs you found to
be problematic can be reduced by 2 orders of magnitude with about only
10% increase in the resulting delta size, and still nearly 40% smaller
than what the current delta code produces.
The question is how to determine the best limit on the number of entries
in the same hash bucket.
Nicolas
next prev parent reply other threads:[~2006-02-24 21:41 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-02-22 1:45 [PATCH] diff-delta: produce optimal pack data Nicolas Pitre
2006-02-24 8:49 ` Junio C Hamano
2006-02-24 15:37 ` Nicolas Pitre
2006-02-24 23:55 ` Junio C Hamano
2006-02-24 17:44 ` Carl Baldwin
2006-02-24 17:56 ` Nicolas Pitre
2006-02-24 18:35 ` Carl Baldwin
2006-02-24 18:57 ` Nicolas Pitre
2006-02-24 19:23 ` Carl Baldwin
2006-02-24 20:02 ` Nicolas Pitre
2006-02-24 20:40 ` Carl Baldwin
2006-02-24 21:12 ` Nicolas Pitre
2006-02-24 22:50 ` Carl Baldwin
2006-02-25 3:53 ` Nicolas Pitre
2006-02-24 20:02 ` Linus Torvalds
2006-02-24 20:19 ` Nicolas Pitre
2006-02-24 20:53 ` Junio C Hamano
2006-02-24 21:39 ` Nicolas Pitre [this message]
2006-02-24 21:48 ` Nicolas Pitre
2006-02-25 0:45 ` Linus Torvalds
2006-02-25 3:07 ` Nicolas Pitre
2006-02-25 4:05 ` Linus Torvalds
2006-02-25 5:10 ` Nicolas Pitre
2006-02-25 5:35 ` Nicolas Pitre
2006-03-07 23:48 ` [RFH] zlib gurus out there? Junio C Hamano
2006-03-08 0:59 ` Linus Torvalds
2006-03-08 1:22 ` Junio C Hamano
2006-03-08 2:00 ` Linus Torvalds
2006-03-08 9:46 ` Johannes Schindelin
2006-03-08 10:45 ` [PATCH] write_sha1_file(): Perform Z_FULL_FLUSH between header and data Sergey Vlasov
2006-03-08 11:04 ` Junio C Hamano
2006-03-08 14:17 ` Sergey Vlasov
2006-02-25 19:18 ` [PATCH] diff-delta: produce optimal pack data Linus Torvalds
2006-02-24 18:49 ` Carl Baldwin
2006-02-24 19:03 ` Nicolas Pitre
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0602241613030.31162@localhost.localdomain \
--to=nico@cam.org \
--cc=cnb@fc.hp.com \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).