From: Linus Torvalds <torvalds@osdl.org>
To: Nicolas Pitre <nico@cam.org>
Cc: Junio C Hamano <junkio@cox.net>,
git@vger.kernel.org, Carl Baldwin <cnb@fc.hp.com>
Subject: Re: [PATCH] diff-delta: produce optimal pack data
Date: Fri, 24 Feb 2006 20:05:06 -0800 (PST) [thread overview]
Message-ID: <Pine.LNX.4.64.0602241952140.22647@g5.osdl.org> (raw)
In-Reply-To: <Pine.LNX.4.64.0602242130030.31162@localhost.localdomain>
On Fri, 24 Feb 2006, Nicolas Pitre wrote:
>
> Well, that's the compromize to make. By default the version with
> adler32 used 16 byte blocks to index the reference buffer. That means
> you can match target data against the reference only if whole 16 byte
> blocks match. Then, if you fix a typo in the target buffer then you'll
> inevitably need 16 literal bytes in the delta instead of only
> one because you won't be able to resynchronize with the reference buffer
> until the next 16 byte block.
That shouldn't be true. Once you find a 16-byte match, you can search
forward (in theory you should be able to search backwards too, but by that
time we've already expanded the non-matching previous bytes, I think).
But I'm no xdelta expert, so I migt be wrong.
> What I've made in my last delta patch is to reduce that 16 byte block to
> only 3 bytes. Why 3 bytes? Because less than that produces smaller
> delta data if done with literal bytes directly, and 3 bytes provided
> enough bits to hash.
On the other hand, the cost is that your lookups _are_ going to be more
expensive. Regardless of how good the hash is, basically you have 16/3
more hash-entries to look up, so you've made compression more expensive in
footprint, at least (I assume you've made the hash appropriately larger).
Also, at 3 bytes, insertion is at least equally dense (three bytes of data
vs three bytes of offset into the source), and can be worse (the offset
might be 5 bytes, no?). So it would seem like you'd be better off with 4+
bytes, at which point the delta should be a win.
Have you tried some half-way point, like ~8 bytes?
> Now the problem comes when indexing a reference file full of:
>
> 0x46f8, 0x000b, 0x42fe, 0x0000, 0xffc0, 0x0001, 0xff00, 0x0008,
...
>
> There is a bunch of ", 0x" that get hashed to the same thing.
You'll find a lot of that in any file: three or four bytes of similarity
just doesn't sound worthwhile to go digging after.
> The adler32 made that particular example a non issue since the
> likelyhood of many 16 byte blocks to be the same is pretty low in this
> case. But the flaw remains if for example there is lots of similar 16
> byte blocks, like a binary file with lots of zeroes for example. In
> fact, the performance problem Carl is having does use the diff-delta
> version still using adler32.
Agreed. I think limiting the hash length is a fine idea regardless, I just
think it sounds dangerous with the three-byte thing where a lot of matches
should be expected (never mind ", 0x", just things like newlines and tabs
in source code).
Only considering 16-byte sequences of similarities is a trade-off of
packing cost vs win.. Not saying other block-sizes aren't worth testing,
but I suspect trying too hard is going to be too costly.
Especially as deltas compress _worse_ the smaller they are. Bigger
"insert" chunks probably compress a lot better than a copy chunk.
Have you looked at the delta size vs compression?
Linus
next prev parent reply other threads:[~2006-02-25 4:14 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-02-22 1:45 [PATCH] diff-delta: produce optimal pack data Nicolas Pitre
2006-02-24 8:49 ` Junio C Hamano
2006-02-24 15:37 ` Nicolas Pitre
2006-02-24 23:55 ` Junio C Hamano
2006-02-24 17:44 ` Carl Baldwin
2006-02-24 17:56 ` Nicolas Pitre
2006-02-24 18:35 ` Carl Baldwin
2006-02-24 18:57 ` Nicolas Pitre
2006-02-24 19:23 ` Carl Baldwin
2006-02-24 20:02 ` Nicolas Pitre
2006-02-24 20:40 ` Carl Baldwin
2006-02-24 21:12 ` Nicolas Pitre
2006-02-24 22:50 ` Carl Baldwin
2006-02-25 3:53 ` Nicolas Pitre
2006-02-24 20:02 ` Linus Torvalds
2006-02-24 20:19 ` Nicolas Pitre
2006-02-24 20:53 ` Junio C Hamano
2006-02-24 21:39 ` Nicolas Pitre
2006-02-24 21:48 ` Nicolas Pitre
2006-02-25 0:45 ` Linus Torvalds
2006-02-25 3:07 ` Nicolas Pitre
2006-02-25 4:05 ` Linus Torvalds [this message]
2006-02-25 5:10 ` Nicolas Pitre
2006-02-25 5:35 ` Nicolas Pitre
2006-03-07 23:48 ` [RFH] zlib gurus out there? Junio C Hamano
2006-03-08 0:59 ` Linus Torvalds
2006-03-08 1:22 ` Junio C Hamano
2006-03-08 2:00 ` Linus Torvalds
2006-03-08 9:46 ` Johannes Schindelin
2006-03-08 10:45 ` [PATCH] write_sha1_file(): Perform Z_FULL_FLUSH between header and data Sergey Vlasov
2006-03-08 11:04 ` Junio C Hamano
2006-03-08 14:17 ` Sergey Vlasov
2006-02-25 19:18 ` [PATCH] diff-delta: produce optimal pack data Linus Torvalds
2006-02-24 18:49 ` Carl Baldwin
2006-02-24 19:03 ` Nicolas Pitre
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0602241952140.22647@g5.osdl.org \
--to=torvalds@osdl.org \
--cc=cnb@fc.hp.com \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
--cc=nico@cam.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).