* [PATCH] diff-delta: produce optimal pack data
@ 2006-02-22 1:45 Nicolas Pitre
2006-02-24 8:49 ` Junio C Hamano
0 siblings, 1 reply; 35+ messages in thread
From: Nicolas Pitre @ 2006-02-22 1:45 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
Indexing based on adler32 has a match precision based on the block size
(currently 16). Lowering the block size would produce smaller deltas
but the indexing memory and computing cost increases significantly.
For optimal delta result the indexing block size should be 3 with an
increment of 1 (instead of 16 and 16). With such low params the adler32
becomes a clear overhead increasing the time for git-repack by a factor
of 3. And with such small blocks the adler 32 is not very useful as the
whole of the block bits can be used directly.
This patch replaces the adler32 with an open coded index value based on
3 characters directly. This gives sufficient bits for hashing and
allows for optimal delta with reasonable CPU cycles.
The resulting packs are 6% smaller on average. The increase in CPU time
is about 25%. But this cost is now hidden by the delta reuse patch
while the saving on data transfers is always there.
Signed-off-by: Nicolas Pitre <nico@cam.org>
---
diff-delta.c | 77 +++++++++++++++++++++++-----------------------------------
1 files changed, 30 insertions(+), 47 deletions(-)
54aa50fb403981a9292453b76d894a79da9698de
diff --git a/diff-delta.c b/diff-delta.c
index 2ed5984..27f83a0 100644
--- a/diff-delta.c
+++ b/diff-delta.c
@@ -20,21 +20,11 @@
#include <stdlib.h>
#include <string.h>
-#include <zlib.h>
#include "delta.h"
-/* block size: min = 16, max = 64k, power of 2 */
-#define BLK_SIZE 16
-
-#define MIN(a, b) ((a) < (b) ? (a) : (b))
-
-#define GR_PRIME 0x9e370001
-#define HASH(v, shift) (((unsigned int)(v) * GR_PRIME) >> (shift))
-
struct index {
const unsigned char *ptr;
- unsigned int val;
struct index *next;
};
@@ -42,21 +32,21 @@ static struct index ** delta_index(const
unsigned long bufsize,
unsigned int *hash_shift)
{
- unsigned int hsize, hshift, entries, blksize, i;
+ unsigned long hsize;
+ unsigned int hshift, i;
const unsigned char *data;
struct index *entry, **hash;
void *mem;
/* determine index hash size */
- entries = (bufsize + BLK_SIZE - 1) / BLK_SIZE;
- hsize = entries / 4;
- for (i = 4; (1 << i) < hsize && i < 16; i++);
+ hsize = bufsize / 4;
+ for (i = 8; (1 << i) < hsize && i < 16; i++);
hsize = 1 << i;
- hshift = 32 - i;
+ hshift = i - 8;
*hash_shift = hshift;
/* allocate lookup index */
- mem = malloc(hsize * sizeof(*hash) + entries * sizeof(*entry));
+ mem = malloc(hsize * sizeof(*hash) + bufsize * sizeof(*entry));
if (!mem)
return NULL;
hash = mem;
@@ -64,17 +54,12 @@ static struct index ** delta_index(const
memset(hash, 0, hsize * sizeof(*hash));
/* then populate it */
- data = buf + entries * BLK_SIZE - BLK_SIZE;
- blksize = bufsize - (data - buf);
- while (data >= buf) {
- unsigned int val = adler32(0, data, blksize);
- i = HASH(val, hshift);
- entry->ptr = data;
- entry->val = val;
+ data = buf + bufsize - 2;
+ while (data > buf) {
+ entry->ptr = --data;
+ i = data[0] ^ data[1] ^ (data[2] << hshift);
entry->next = hash[i];
hash[i] = entry++;
- blksize = BLK_SIZE;
- data -= BLK_SIZE;
}
return hash;
@@ -141,29 +126,27 @@ void *diff_delta(void *from_buf, unsigne
while (data < top) {
unsigned int moff = 0, msize = 0;
- unsigned int blksize = MIN(top - data, BLK_SIZE);
- unsigned int val = adler32(0, data, blksize);
- i = HASH(val, hash_shift);
- for (entry = hash[i]; entry; entry = entry->next) {
- const unsigned char *ref = entry->ptr;
- const unsigned char *src = data;
- unsigned int ref_size = ref_top - ref;
- if (entry->val != val)
- continue;
- if (ref_size > top - src)
- ref_size = top - src;
- while (ref_size && *src++ == *ref) {
- ref++;
- ref_size--;
- }
- ref_size = ref - entry->ptr;
- if (ref_size > msize) {
- /* this is our best match so far */
- moff = entry->ptr - ref_data;
- msize = ref_size;
- if (msize >= 0x10000) {
- msize = 0x10000;
+ if (data + 2 < top) {
+ i = data[0] ^ data[1] ^ (data[2] << hash_shift);
+ for (entry = hash[i]; entry; entry = entry->next) {
+ const unsigned char *ref = entry->ptr;
+ const unsigned char *src = data;
+ unsigned int ref_size = ref_top - ref;
+ if (ref_size > top - src)
+ ref_size = top - src;
+ if (ref_size > 0x10000)
+ ref_size = 0x10000;
+ if (ref_size <= msize)
break;
+ while (ref_size && *src++ == *ref) {
+ ref++;
+ ref_size--;
+ }
+ ref_size = ref - entry->ptr;
+ if (msize < ref - entry->ptr) {
+ /* this is our best match so far */
+ msize = ref - entry->ptr;
+ moff = entry->ptr - ref_data;
}
}
}
--
1.2.2.g6643-dirty
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-22 1:45 [PATCH] diff-delta: produce optimal pack data Nicolas Pitre
@ 2006-02-24 8:49 ` Junio C Hamano
2006-02-24 15:37 ` Nicolas Pitre
2006-02-24 17:44 ` Carl Baldwin
0 siblings, 2 replies; 35+ messages in thread
From: Junio C Hamano @ 2006-02-24 8:49 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: git
Nicolas Pitre <nico@cam.org> writes:
> Indexing based on adler32 has a match precision based on the block size
> (currently 16). Lowering the block size would produce smaller deltas
> but the indexing memory and computing cost increases significantly.
Indeed.
I had this patch in my personal tree for a while. I was
wondring why sometimes progress indication during "Deltifying"
stage stops for literally several seconds, or more.
In Linux 2.6 repository, these object pairs take forever to
delta.
blob 9af06ba723df75fed49f7ccae5b6c9c34bc5115f ->
blob dfc9cd58dc065d17030d875d3fea6e7862ede143
size (491102 -> 496045)
58 seconds
blob 4917ec509720a42846d513addc11cbd25e0e3c4f ->
blob dfc9cd58dc065d17030d875d3fea6e7862ede143
size (495831 -> 496045)
64 seconds
Admittedly, these are *BAD* input samples (a binary firmware
blob with many similar looking ", 0x" sequences). I can see
that trying to reuse source materials really hard would take
significant computation.
However, this is simply unacceptable.
The new algoritm takes 58 seconds to produce 136000 bytes of
delta, while the old takes 0.25 seconds to produce 248899 (I am
using the test-delta program in git.git distribution). The
compression ratio is significantly better, but this is unusable
even for offline archival use (remember, pack delta selection
needs to do window=10 such deltification trials to come up with
the best delta, so you are spending 10 minutes to save 100k from
one oddball blob), let alone on-the-fly pack generation for
network transfer.
Maybe we would want two implementation next to each other, and
internally see if it is taking too much cycles compared to the
input size then switch to cheaper version?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 8:49 ` Junio C Hamano
@ 2006-02-24 15:37 ` Nicolas Pitre
2006-02-24 23:55 ` Junio C Hamano
2006-02-24 17:44 ` Carl Baldwin
1 sibling, 1 reply; 35+ messages in thread
From: Nicolas Pitre @ 2006-02-24 15:37 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Fri, 24 Feb 2006, Junio C Hamano wrote:
> Nicolas Pitre <nico@cam.org> writes:
>
> > Indexing based on adler32 has a match precision based on the block size
> > (currently 16). Lowering the block size would produce smaller deltas
> > but the indexing memory and computing cost increases significantly.
>
> Indeed.
>
> I had this patch in my personal tree for a while. I was
> wondring why sometimes progress indication during "Deltifying"
> stage stops for literally several seconds, or more.
Note that above I'm saying that _keeping_ adler32 for small blocks is
even longer. In other words, for small blocks, the version not using
adler32 is about 3 times faster.
I also noticed the significant slowdown after I made the
improved progress patch. The idea now has to do with detecting
patological cases and breaking out of them early.
> In Linux 2.6 repository, these object pairs take forever to
> delta.
>
> blob 9af06ba723df75fed49f7ccae5b6c9c34bc5115f ->
> blob dfc9cd58dc065d17030d875d3fea6e7862ede143
> size (491102 -> 496045)
> 58 seconds
>
> blob 4917ec509720a42846d513addc11cbd25e0e3c4f ->
> blob dfc9cd58dc065d17030d875d3fea6e7862ede143
> size (495831 -> 496045)
> 64 seconds
Thanks for this. I'll see what I can do to tweak the code to better
cope with those. Just keep my fourth delta patch in the pu branch for
now.
Nicolas
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 8:49 ` Junio C Hamano
2006-02-24 15:37 ` Nicolas Pitre
@ 2006-02-24 17:44 ` Carl Baldwin
2006-02-24 17:56 ` Nicolas Pitre
1 sibling, 1 reply; 35+ messages in thread
From: Carl Baldwin @ 2006-02-24 17:44 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nicolas Pitre, git
Junio,
This message came to me at exactly the right time. Yesterday I was
exploring using git as the content storage back-end for some binary
files. Up until now I've only used it for software projects.
I found the largest RCS file that we had in our current back-end. It
contained twelve versions of a binary file. Each version averaged about
20 MB. The ,v file from RCS was about 250MB. I did some experiments on
these binary files.
First, gzip consistantly is able to compress these files to about 10%
their original size. So, they are quite inflated. Second, xdelta would
produce a delta between two neighboring revisions of about 2.5MB in size
that would compress down to about 2MB. (about the same size as the next
revision compressed without deltification so packing is ineffective
here).
I added these 12 revisions to several version control back-ends
including subversion and git. Git produced a much smaller repository
size than the others simply due to the compression that it applies to
objects. It also was at least as fast as the others.
The problem came when I tried to clone this repository.
git-pack-objects chewed on these 12 revisions for over an hour before I
finally interrupted it. As far as I could tell, it hadn't made much
progress.
My other complaint was that git prune ran slow (~8 seconds on my very
fast machine with fast disk access) on a repository with only these
twelve revisions in it (37 total objects in the object store). This is
because 'git prune' actually ends up running fsck on all of the objects
which verifies the sha1 of each object. This seems like a lot of work
just to prune unwanted objects. What would you say to a --fast option
to git-prune that would avoid most of what fsck does including verifying
sha1 for each object?
Anyway, that was a tangent. I looked into to overriding the --depth
option to git-pack-objects and set it to 0. However, this isn't
trivial. git-pack-objects is never called directly by the user. It is
only called through things like 'git clone', 'git push' and 'git
repack'. What do you think about this? Could we add a configuration
option that could be set for the repository? Something smarter like
what you suggest where git would pack small text files but give up on
large binaries would be optimal.
I've already determined that packing a repository with this type of
largish binary file doesn't do any good but there doesn't seem to be a
way to avoid packing when doing network operations.
Thoughts?
Carl
On Fri, Feb 24, 2006 at 12:49:13AM -0800, Junio C Hamano wrote:
> Nicolas Pitre <nico@cam.org> writes:
>
> > Indexing based on adler32 has a match precision based on the block size
> > (currently 16). Lowering the block size would produce smaller deltas
> > but the indexing memory and computing cost increases significantly.
>
> Indeed.
>
> I had this patch in my personal tree for a while. I was
> wondring why sometimes progress indication during "Deltifying"
> stage stops for literally several seconds, or more.
>
> In Linux 2.6 repository, these object pairs take forever to
> delta.
>
> blob 9af06ba723df75fed49f7ccae5b6c9c34bc5115f ->
> blob dfc9cd58dc065d17030d875d3fea6e7862ede143
> size (491102 -> 496045)
> 58 seconds
>
> blob 4917ec509720a42846d513addc11cbd25e0e3c4f ->
> blob dfc9cd58dc065d17030d875d3fea6e7862ede143
> size (495831 -> 496045)
> 64 seconds
>
> Admittedly, these are *BAD* input samples (a binary firmware
> blob with many similar looking ", 0x" sequences). I can see
> that trying to reuse source materials really hard would take
> significant computation.
>
> However, this is simply unacceptable.
>
> The new algoritm takes 58 seconds to produce 136000 bytes of
> delta, while the old takes 0.25 seconds to produce 248899 (I am
> using the test-delta program in git.git distribution). The
> compression ratio is significantly better, but this is unusable
> even for offline archival use (remember, pack delta selection
> needs to do window=10 such deltification trials to come up with
> the best delta, so you are spending 10 minutes to save 100k from
> one oddball blob), let alone on-the-fly pack generation for
> network transfer.
>
> Maybe we would want two implementation next to each other, and
> internally see if it is taking too much cycles compared to the
> input size then switch to cheaper version?
>
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Carl Baldwin RADCAD (R&D CAD)
Hewlett Packard Company
MS 88 work: 970 898-1523
3404 E. Harmony Rd. work: Carl.N.Baldwin@hp.com
Fort Collins, CO 80525 home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 17:44 ` Carl Baldwin
@ 2006-02-24 17:56 ` Nicolas Pitre
2006-02-24 18:35 ` Carl Baldwin
2006-02-24 18:49 ` Carl Baldwin
0 siblings, 2 replies; 35+ messages in thread
From: Nicolas Pitre @ 2006-02-24 17:56 UTC (permalink / raw)
To: Carl Baldwin; +Cc: Junio C Hamano, git
On Fri, 24 Feb 2006, Carl Baldwin wrote:
> Junio,
>
> This message came to me at exactly the right time. Yesterday I was
> exploring using git as the content storage back-end for some binary
> files. Up until now I've only used it for software projects.
>
> I found the largest RCS file that we had in our current back-end. It
> contained twelve versions of a binary file. Each version averaged about
> 20 MB. The ,v file from RCS was about 250MB. I did some experiments on
> these binary files.
>
> First, gzip consistantly is able to compress these files to about 10%
> their original size. So, they are quite inflated. Second, xdelta would
> produce a delta between two neighboring revisions of about 2.5MB in size
> that would compress down to about 2MB. (about the same size as the next
> revision compressed without deltification so packing is ineffective
> here).
>
> I added these 12 revisions to several version control back-ends
> including subversion and git. Git produced a much smaller repository
> size than the others simply due to the compression that it applies to
> objects. It also was at least as fast as the others.
>
> The problem came when I tried to clone this repository.
> git-pack-objects chewed on these 12 revisions for over an hour before I
> finally interrupted it. As far as I could tell, it hadn't made much
> progress.
I must ask if you had applied my latest delta patches?
Also did you use a recent version of git that implements pack data
reuse?
Nicolas
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 17:56 ` Nicolas Pitre
@ 2006-02-24 18:35 ` Carl Baldwin
2006-02-24 18:57 ` Nicolas Pitre
2006-02-24 18:49 ` Carl Baldwin
1 sibling, 1 reply; 35+ messages in thread
From: Carl Baldwin @ 2006-02-24 18:35 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git
On Fri, Feb 24, 2006 at 12:56:04PM -0500, Nicolas Pitre wrote:
My version is 1.2.1. I have not been following your work. When was
pack data reuse introduced? From where can I obtain your delta patches?
There is really no opportunity for pack-data reuse in this case. The
repository had never been packed or cloned in the first place. As I
said, I do not intend to pack these binary files at all since there is
no benefit in this case.
The delta patches may help but I can't say for sure since I don't know
anything about them. Let me know where I can get them.
Carl
>
> I must ask if you had applied my latest delta patches?
>
> Also did you use a recent version of git that implements pack data
> reuse?
>
>
> Nicolas
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Carl Baldwin RADCAD (R&D CAD)
Hewlett Packard Company
MS 88 work: 970 898-1523
3404 E. Harmony Rd. work: Carl.N.Baldwin@hp.com
Fort Collins, CO 80525 home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 17:56 ` Nicolas Pitre
2006-02-24 18:35 ` Carl Baldwin
@ 2006-02-24 18:49 ` Carl Baldwin
2006-02-24 19:03 ` Nicolas Pitre
1 sibling, 1 reply; 35+ messages in thread
From: Carl Baldwin @ 2006-02-24 18:49 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git
I've updated to a very current master branch. This seems to include the
pack data reuse stuff. I've not made an attempt yet to apply your delta
patches.
git-repack quickly gets up to 5% (2/36) and hangs there. I'll let it
run for a while just to see how far it claims to get. I'm not hopeful.
Maybe your patches can help?
Carl
On Fri, Feb 24, 2006 at 12:56:04PM -0500, Nicolas Pitre wrote:
> On Fri, 24 Feb 2006, Carl Baldwin wrote:
>
> > Junio,
> >
> > This message came to me at exactly the right time. Yesterday I was
> > exploring using git as the content storage back-end for some binary
> > files. Up until now I've only used it for software projects.
> >
> > I found the largest RCS file that we had in our current back-end. It
> > contained twelve versions of a binary file. Each version averaged about
> > 20 MB. The ,v file from RCS was about 250MB. I did some experiments on
> > these binary files.
> >
> > First, gzip consistantly is able to compress these files to about 10%
> > their original size. So, they are quite inflated. Second, xdelta would
> > produce a delta between two neighboring revisions of about 2.5MB in size
> > that would compress down to about 2MB. (about the same size as the next
> > revision compressed without deltification so packing is ineffective
> > here).
> >
> > I added these 12 revisions to several version control back-ends
> > including subversion and git. Git produced a much smaller repository
> > size than the others simply due to the compression that it applies to
> > objects. It also was at least as fast as the others.
> >
> > The problem came when I tried to clone this repository.
> > git-pack-objects chewed on these 12 revisions for over an hour before I
> > finally interrupted it. As far as I could tell, it hadn't made much
> > progress.
>
> I must ask if you had applied my latest delta patches?
>
> Also did you use a recent version of git that implements pack data
> reuse?
>
>
> Nicolas
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Carl Baldwin RADCAD (R&D CAD)
Hewlett Packard Company
MS 88 work: 970 898-1523
3404 E. Harmony Rd. work: Carl.N.Baldwin@hp.com
Fort Collins, CO 80525 home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 18:35 ` Carl Baldwin
@ 2006-02-24 18:57 ` Nicolas Pitre
2006-02-24 19:23 ` Carl Baldwin
0 siblings, 1 reply; 35+ messages in thread
From: Nicolas Pitre @ 2006-02-24 18:57 UTC (permalink / raw)
To: Carl Baldwin; +Cc: Junio C Hamano, git
On Fri, 24 Feb 2006, Carl Baldwin wrote:
> My version is 1.2.1. I have not been following your work. When was
> pack data reuse introduced?
Try out version 1.2.3.
> From where can I obtain your delta patches?
Forget them for now -- they won't help you.
> There is really no opportunity for pack-data reuse in this case. The
> repository had never been packed or cloned in the first place. As I
> said, I do not intend to pack these binary files at all since there is
> no benefit in this case.
Yes there is, as long as you have version 1.2.3. The clone logic will
simply reuse already packed data without attempting to recompute it.
> The delta patches may help but I can't say for sure since I don't know
> anything about them.
They (actually the last one) might help reduce the size of resulting
packs but it currently has performance problems with some patological
data sets.
I think you really should try git version 1.2.3 with a packed
repository. It might handle your special case just fine.
Nicolas
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 18:49 ` Carl Baldwin
@ 2006-02-24 19:03 ` Nicolas Pitre
0 siblings, 0 replies; 35+ messages in thread
From: Nicolas Pitre @ 2006-02-24 19:03 UTC (permalink / raw)
To: Carl Baldwin; +Cc: Junio C Hamano, git
On Fri, 24 Feb 2006, Carl Baldwin wrote:
> I've updated to a very current master branch. This seems to include the
> pack data reuse stuff. I've not made an attempt yet to apply your delta
> patches.
>
> git-repack quickly gets up to 5% (2/36) and hangs there. I'll let it
> run for a while just to see how far it claims to get. I'm not hopeful.
It should complete sometimes, probably after the same amount of time
needed by your previous clone attempt. But after that any clone
operation should be quick. This is clearly unacceptable but at least
with the pack data reuse you should suffer only once for the initial
repack.
> Maybe your patches can help?
No. They actually make things worse performance wise, much worse in
some special cases.
Is it possible for me to have access to 2 consecutive versions of your
big binary file?
Nicolas
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 18:57 ` Nicolas Pitre
@ 2006-02-24 19:23 ` Carl Baldwin
2006-02-24 20:02 ` Nicolas Pitre
2006-02-24 20:02 ` Linus Torvalds
0 siblings, 2 replies; 35+ messages in thread
From: Carl Baldwin @ 2006-02-24 19:23 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git
On Fri, Feb 24, 2006 at 01:57:20PM -0500, Nicolas Pitre wrote:
> On Fri, 24 Feb 2006, Carl Baldwin wrote:
>
> > My version is 1.2.1. I have not been following your work. When was
> > pack data reuse introduced?
>
> Try out version 1.2.3.
I'm on it now.
> > From where can I obtain your delta patches?
>
> Forget them for now -- they won't help you.
Ah, I have been looking at your patches and clearly they will not help.
> > There is really no opportunity for pack-data reuse in this case. The
> > repository had never been packed or cloned in the first place. As I
> > said, I do not intend to pack these binary files at all since there is
> > no benefit in this case.
>
> Yes there is, as long as you have version 1.2.3. The clone logic will
> simply reuse already packed data without attempting to recompute it.
I meant that there is no benefit in disk space usage. Packing may
actually increase my disk space usage in this case. Refer to what I
said about experimentally running gzip and xdelta on the files
independantly of git.
I see what you're saying about this data reuse helping to speed up
subsequent cloning operations. However, if packing takes this long and
doesn't give me any disk space savings then I don't want to pay the very
heavy price of packing these files even the first time nor do I want to
pay the price incrementally.
The most I would tolerate for the first pack is a few seconds. The most
I would tolerate for any incremental pack is about 1 second.
BTW, git repack has been going for 30 minutes and has packed 4/36
objects. :-)
> I think you really should try git version 1.2.3 with a packed
> repository. It might handle your special case just fine.
No, not when I'm not willing to pay the price to pack even once. This
isn't a case where I have one such repository and 'once its been packed
then its packed'. This is only one example of such a repository. I am
looking for a process for revisioning this type of data that will be
used over and over. Git may not be the answer here but it sure is
looking good in many other ways.
I think the right answer would be for git to avoid trying to pack files
like this. Junio mentioned something like this in his message.
Thanks for your input.
Cheers,
Carl
--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Carl Baldwin RADCAD (R&D CAD)
Hewlett Packard Company
MS 88 work: 970 898-1523
3404 E. Harmony Rd. work: Carl.N.Baldwin@hp.com
Fort Collins, CO 80525 home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 19:23 ` Carl Baldwin
@ 2006-02-24 20:02 ` Nicolas Pitre
2006-02-24 20:40 ` Carl Baldwin
2006-02-24 20:02 ` Linus Torvalds
1 sibling, 1 reply; 35+ messages in thread
From: Nicolas Pitre @ 2006-02-24 20:02 UTC (permalink / raw)
To: Carl Baldwin; +Cc: Junio C Hamano, git
On Fri, 24 Feb 2006, Carl Baldwin wrote:
> I see what you're saying about this data reuse helping to speed up
> subsequent cloning operations. However, if packing takes this long and
> doesn't give me any disk space savings then I don't want to pay the very
> heavy price of packing these files even the first time nor do I want to
> pay the price incrementally.
Of course. There is admitedly a problem here. I'm just abusing a bit
of your time to properly identify its parameters.
> The most I would tolerate for the first pack is a few seconds. The most
> I would tolerate for any incremental pack is about 1 second.
Well that is probably a bit tight. Ideally it should be linear with the
size of the data set to process. If you have 10 files 10MB each it
should take about the same time to pack than 10000 files of 10KB each.
Of course incrementally packing one additional 10MB file might take more
than a second although it is only one file.
> BTW, git repack has been going for 30 minutes and has packed 4/36
> objects. :-)
Pathetic.
> I think the right answer would be for git to avoid trying to pack files
> like this. Junio mentioned something like this in his message.
Yes. First we could add an additional parameter to the repacking
strategy which is the undeltified but deflated size of an object. That
would prevent any deltas to become bigger than the simply deflated
version.
Remains the delta performance issue. I think I know what the problem
is. I'm not sure I know what the best solution would be though. The
patological data set is easy to identify quickly and one strategy might
simply to bail out early when it happens and therefore not attempt any
delta.
However, if you could let me play with two samples of your big file I'd
be grateful. If so I'd like to make git work well with your data set
too which is not that uncommon after all.
Nicolas
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 19:23 ` Carl Baldwin
2006-02-24 20:02 ` Nicolas Pitre
@ 2006-02-24 20:02 ` Linus Torvalds
2006-02-24 20:19 ` Nicolas Pitre
2006-02-24 20:53 ` Junio C Hamano
1 sibling, 2 replies; 35+ messages in thread
From: Linus Torvalds @ 2006-02-24 20:02 UTC (permalink / raw)
To: Carl Baldwin; +Cc: Nicolas Pitre, Junio C Hamano, git
On Fri, 24 Feb 2006, Carl Baldwin wrote:
>
> I meant that there is no benefit in disk space usage. Packing may
> actually increase my disk space usage in this case. Refer to what I
> said about experimentally running gzip and xdelta on the files
> independantly of git.
Yes. The deltas tend to compress a lot less well than "normal" files.
> I see what you're saying about this data reuse helping to speed up
> subsequent cloning operations. However, if packing takes this long and
> doesn't give me any disk space savings then I don't want to pay the very
> heavy price of packing these files even the first time nor do I want to
> pay the price incrementally.
I would look at tuning the heuristics in "try_delta()" (pack-objects.c) a
bit. That's the place that decides whether to even bother trying to make a
delta, and how big a delta is acceptable. For example, looking at them, I
already see one bug:
..
sizediff = oldsize > size ? oldsize - size : size - oldsize;
if (sizediff > size / 8)
return -1;
..
we really should compare sizediff to the _smaller_ of the two sizes, and
skip the delta if the difference in sizes is bound to be bigger than that.
However, the "size / 8" thing isn't a very strict limit anyway, so this
probably doesn't matter (and I think Nico already removed it as part of
his patches: the heuristic can make us avoid some deltas that would be
ok).
The other thing to look at is "max_size": right now it initializes that to
"size / 2 - 20", which just says that we don't ever want a delta that is
larger than about half the result (plus the 20 byte overhead for pointing
to the thing we delta against). Again, if you feel that normal compression
compresses better than half, you could try changing that to
..
max_size = size / 4 - 20;
..
or something like that instead (but then you need to check that it's still
positive - otherwise the comparisons with unsigned later on are screwed
up. Right now that value is guaranteed to be positive if only because we
already checked
..
if (size < 50)
return -1;
..
earlier).
NOTE! Every SINGLE one of those heuristics are just totally made up by
yours truly, and have no testing behind them. They're more of the type
"that sounds about right" than "this is how it must be". As mentioned,
Nico has already been playing with the heuristics - but he wanted better
packs, not better CPU usage, so he went the other way from what you would
want to try..
Linus
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 20:02 ` Linus Torvalds
@ 2006-02-24 20:19 ` Nicolas Pitre
2006-02-24 20:53 ` Junio C Hamano
1 sibling, 0 replies; 35+ messages in thread
From: Nicolas Pitre @ 2006-02-24 20:19 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Carl Baldwin, Junio C Hamano, git
On Fri, 24 Feb 2006, Linus Torvalds wrote:
> The other thing to look at is "max_size": right now it initializes that to
> "size / 2 - 20", which just says that we don't ever want a delta that is
> larger than about half the result (plus the 20 byte overhead for pointing
> to the thing we delta against). Again, if you feel that normal compression
> compresses better than half, you could try changing that to
>
> ..
> max_size = size / 4 - 20;
> ..
Like I mentioned, max_size should also be caped with the deflated
undeltified object
size. This value is easy to get since plain objects are already
deflated.
> NOTE! Every SINGLE one of those heuristics are just totally made up by
> yours truly, and have no testing behind them. They're more of the type
> "that sounds about right" than "this is how it must be". As mentioned,
> Nico has already been playing with the heuristics - but he wanted better
> packs, not better CPU usage, so he went the other way from what you would
> want to try..
Actually it's a good balance I'm after.
Using 30% more CPU for 10% smaller packs is OK I'd say.
Using 100 times the CPU for 50% saving on only one particular delta is
not acceptable.
And using more than one hour for 200MB of data with the current window
default is not acceptable either.
Nicolas
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 20:02 ` Nicolas Pitre
@ 2006-02-24 20:40 ` Carl Baldwin
2006-02-24 21:12 ` Nicolas Pitre
0 siblings, 1 reply; 35+ messages in thread
From: Carl Baldwin @ 2006-02-24 20:40 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git
On Fri, Feb 24, 2006 at 03:02:07PM -0500, Nicolas Pitre wrote:
> Well that is probably a bit tight. Ideally it should be linear with the
> size of the data set to process. If you have 10 files 10MB each it
> should take about the same time to pack than 10000 files of 10KB each.
> Of course incrementally packing one additional 10MB file might take more
> than a second although it is only one file.
Well, I might not have been fair here. I tried an experiment where I
packed each of the twelve large blob objects explicitly one-by-one using
git-pack-objects. Incrementally packing each single object was very
fast. Well under a second per object on my machine.
After the twelve large objects were packed into individual packs the
rest of the packing went very quickly and git v1.2.3's date reuse worked
very well. This was sort of my attempt at simulating how things would
be if git avoided deltification of each of these large files. I'm sorry
to have been so harsh earlier I just didn't understand that
incrementally packing one-by-one was going to help this much.
This gives me hope that if somehow git were to not attempt to deltify
these objects then performance would be much better than acceptible.
[snip]
> However, if you could let me play with two samples of your big file I'd
> be grateful. If so I'd like to make git work well with your data set
> too which is not that uncommon after all.
I would be happy to do this. I will probably need to scrub a bit and
make sure that the result shows the same characteristics. How would you
like me to deliver these files to you? They are about 25 MB deflated.
Carl
--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Carl Baldwin RADCAD (R&D CAD)
Hewlett Packard Company
MS 88 work: 970 898-1523
3404 E. Harmony Rd. work: Carl.N.Baldwin@hp.com
Fort Collins, CO 80525 home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 20:02 ` Linus Torvalds
2006-02-24 20:19 ` Nicolas Pitre
@ 2006-02-24 20:53 ` Junio C Hamano
2006-02-24 21:39 ` Nicolas Pitre
1 sibling, 1 reply; 35+ messages in thread
From: Junio C Hamano @ 2006-02-24 20:53 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: git, Linus Torvalds, Carl Baldwin
Linus Torvalds <torvalds@osdl.org> writes:
> NOTE! Every SINGLE one of those heuristics are just totally made up by
> yours truly, and have no testing behind them. They're more of the type
> "that sounds about right" than "this is how it must be". As mentioned,
> Nico has already been playing with the heuristics - but he wanted better
> packs, not better CPU usage, so he went the other way from what you would
> want to try..
I haven't looked at Nico's original or updated code closely at
all, but two things come to mind.
(1) if we could tell the particular data is intrinsically
diff_delta unfriendly and diff_delta would waste too much
time when tried to delta against almost _any_ other blob,
then it might help to give an interface in diff-delta.c for
the caller to check for such a blob without even trying
diff_delta.
(2) otherwise, if diff_delta could detect it would spend too
many cycles to finish its work for a particular input early
on, we might want it to bail out instead of trying a
complete job.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 20:40 ` Carl Baldwin
@ 2006-02-24 21:12 ` Nicolas Pitre
2006-02-24 22:50 ` Carl Baldwin
0 siblings, 1 reply; 35+ messages in thread
From: Nicolas Pitre @ 2006-02-24 21:12 UTC (permalink / raw)
To: Carl Baldwin; +Cc: Junio C Hamano, git
On Fri, 24 Feb 2006, Carl Baldwin wrote:
> On Fri, Feb 24, 2006 at 03:02:07PM -0500, Nicolas Pitre wrote:
> > Well that is probably a bit tight. Ideally it should be linear with the
> > size of the data set to process. If you have 10 files 10MB each it
> > should take about the same time to pack than 10000 files of 10KB each.
> > Of course incrementally packing one additional 10MB file might take more
> > than a second although it is only one file.
>
> Well, I might not have been fair here. I tried an experiment where I
> packed each of the twelve large blob objects explicitly one-by-one using
> git-pack-objects. Incrementally packing each single object was very
> fast. Well under a second per object on my machine.
>
> After the twelve large objects were packed into individual packs the
> rest of the packing went very quickly and git v1.2.3's date reuse worked
> very well. This was sort of my attempt at simulating how things would
> be if git avoided deltification of each of these large files. I'm sorry
> to have been so harsh earlier I just didn't understand that
> incrementally packing one-by-one was going to help this much.
Hmmmmmmm....
I don't think I understand what is going on here.
You say that, if you add those big files and incrementally repack after
each commit using git repack with no option, then it requires only about
one second each time. Right?
But if you use "git-repack -a -f" then it is gone for more than an hour?
I'd expect something like 2 * (sum i for i = 1 to 10) i.e. in the 110
second range due to the combinatorial effect when repacking everything.
This is far from one hour and something appears to be really really
wrong.
How many files besides those 12 big blobs do you have?
> This gives me hope that if somehow git were to not attempt to deltify
> these objects then performance would be much better than acceptible.
>
> [snip]
> > However, if you could let me play with two samples of your big file I'd
> > be grateful. If so I'd like to make git work well with your data set
> > too which is not that uncommon after all.
>
> I would be happy to do this. I will probably need to scrub a bit and
> make sure that the result shows the same characteristics. How would you
> like me to deliver these files to you? They are about 25 MB deflated.
If you can add them into a single .tgz with instructions on how
to reproduce the issue and provide me with an URL where I can fetch it
that'd be perfect.
Nicolas
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 20:53 ` Junio C Hamano
@ 2006-02-24 21:39 ` Nicolas Pitre
2006-02-24 21:48 ` Nicolas Pitre
2006-02-25 0:45 ` Linus Torvalds
0 siblings, 2 replies; 35+ messages in thread
From: Nicolas Pitre @ 2006-02-24 21:39 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Linus Torvalds, Carl Baldwin
On Fri, 24 Feb 2006, Junio C Hamano wrote:
> I haven't looked at Nico's original or updated code closely at
> all, but two things come to mind.
>
> (1) if we could tell the particular data is intrinsically
> diff_delta unfriendly and diff_delta would waste too much
> time when tried to delta against almost _any_ other blob,
> then it might help to give an interface in diff-delta.c for
> the caller to check for such a blob without even trying
> diff_delta.
>
> (2) otherwise, if diff_delta could detect it would spend too
> many cycles to finish its work for a particular input early
> on, we might want it to bail out instead of trying a
> complete job.
I have a patch that implements an hybrid approach.
Currently, diff-delta takes blocks of data in the reference file and
hash them. When the target file is scanned, it uses the hash to match
blocks from the target file with the reference file.
If blocks are hashed evenly the cost of producing a delta is at most
O(n+m) where n and m are the size of the reference and target files
respectively. In other words, with good data set the cost is linear.
But if many blocks from the reference buffer do hash to the same bucket
then for each block in the target file many blocks from the reference
buffer have to be tested against, making it tend towards O(n^m) which is
pretty highly exponential.
The solution I'm investigating is to put a limit on the number of
entries in the same hash bucket so to bring the cost back to something
more linear. That means the delta might miss on better matches that
have not been hashed but still benefit from a limited set. Experience
seems to show that the time to deltify the first two blobs you found to
be problematic can be reduced by 2 orders of magnitude with about only
10% increase in the resulting delta size, and still nearly 40% smaller
than what the current delta code produces.
The question is how to determine the best limit on the number of entries
in the same hash bucket.
Nicolas
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 21:39 ` Nicolas Pitre
@ 2006-02-24 21:48 ` Nicolas Pitre
2006-02-25 0:45 ` Linus Torvalds
1 sibling, 0 replies; 35+ messages in thread
From: Nicolas Pitre @ 2006-02-24 21:48 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Linus Torvalds, Carl Baldwin
On Fri, 24 Feb 2006, Nicolas Pitre wrote:
> If blocks are hashed evenly the cost of producing a delta is at most
> O(n+m) where n and m are the size of the reference and target files
> respectively. In other words, with good data set the cost is linear.
>
> But if many blocks from the reference buffer do hash to the same bucket
> then for each block in the target file many blocks from the reference
> buffer have to be tested against, making it tend towards O(n^m) which is
> pretty highly exponential.
Well, actually this is rather O(n*m) not O(n^m), but bad nevertheless.
Nicolas
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 21:12 ` Nicolas Pitre
@ 2006-02-24 22:50 ` Carl Baldwin
2006-02-25 3:53 ` Nicolas Pitre
0 siblings, 1 reply; 35+ messages in thread
From: Carl Baldwin @ 2006-02-24 22:50 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git
On Fri, Feb 24, 2006 at 04:12:14PM -0500, Nicolas Pitre wrote:
> On Fri, 24 Feb 2006, Carl Baldwin wrote:
> > After the twelve large objects were packed into individual packs the
> > rest of the packing went very quickly and git v1.2.3's date reuse worked
> > very well. This was sort of my attempt at simulating how things would
> > be if git avoided deltification of each of these large files. I'm sorry
> > to have been so harsh earlier I just didn't understand that
> > incrementally packing one-by-one was going to help this much.
>
> Hmmmmmmm....
>
> I don't think I understand what is going on here.
>
> You say that, if you add those big files and incrementally repack after
> each commit using git repack with no option, then it requires only about
> one second each time. Right?
Well, actually I was packing them individually by calling
git-pack-objects directly on each blob.
I'll try doing it exactly as you describe...
Ok, I tried it. Basically I do the following.
% mkdir test
% cd test
% git init-db
% cp ../files/binfile.1 binfile
% time git add binfile
real 0m2.459s
user 0m2.443s
sys 0m0.019s
% git commit -a -m "Rev 1"
% time git repack
[snip]
real 0m1.111s
user 0m1.046s
sys 0m0.061s
% for i in $(seq 2 12); do
cp ../files/binfile.$i binfile
time git commit -a -m "Rev $i"
time git repack
done
Each commit takes around 2.8-3.5 seconds and each repack takes about
1.2-1.5 seconds. These are prettly reasonable.
Now, I try 'git repack -a -f' (or even without -f) and it goes out to
lunch. I think it would take on the order of a day to actually finish
because it wasn't very far after an hour.
[snip]
> How many files besides those 12 big blobs do you have?
This repository has been completely stripped to the 12 revisions of the
one file. So, there are 36 objects.
12 blobs.
12 trees.
12 commits.
That is all.
[snip]
> If you can add them into a single .tgz with instructions on how
> to reproduce the issue and provide me with an URL where I can fetch it
> that'd be perfect.
I will do this in an email off of the list because these files really
shouldn't be available on a public list.
Carl
--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Carl Baldwin RADCAD (R&D CAD)
Hewlett Packard Company
MS 88 work: 970 898-1523
3404 E. Harmony Rd. work: Carl.N.Baldwin@hp.com
Fort Collins, CO 80525 home: Carl@ecBaldwin.net
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 15:37 ` Nicolas Pitre
@ 2006-02-24 23:55 ` Junio C Hamano
0 siblings, 0 replies; 35+ messages in thread
From: Junio C Hamano @ 2006-02-24 23:55 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: git
Nicolas Pitre <nico@cam.org> writes:
> On Fri, 24 Feb 2006, Junio C Hamano wrote:
>
>> In Linux 2.6 repository, these object pairs take forever to
>> delta.
>>
>> blob 9af06ba723df75fed49f7ccae5b6c9c34bc5115f ->
>> blob dfc9cd58dc065d17030d875d3fea6e7862ede143
>> size (491102 -> 496045)
>> 58 seconds
>>
>> blob 4917ec509720a42846d513addc11cbd25e0e3c4f ->
>> blob dfc9cd58dc065d17030d875d3fea6e7862ede143
>> size (495831 -> 496045)
>> 64 seconds
>
> Thanks for this. I'll see what I can do to tweak the code to better
> cope with those. Just keep my fourth delta patch in the pu branch for
> now.
If apply this on top of pack-objects.c, you can find more of
them yourself.
---
diff --git a/pack-objects.c b/pack-objects.c
index be7a200..3f88e86 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -62,6 +62,7 @@ static const char *base_name;
static unsigned char pack_file_sha1[20];
static int progress = 1;
static volatile int progress_update = 0;
+static volatile int progress_update_cnt = 0;
/*
* The object names in objects array are hashed with this hashtable,
@@ -826,6 +827,7 @@ static int try_delta(struct unpacked *cu
struct object_entry *old_entry = old->entry;
int old_preferred = (old_entry->preferred_base ||
old_entry->based_on_preferred);
+ int last_up;
unsigned long size, oldsize, delta_size, sizediff;
long max_size;
void *delta_buf;
@@ -890,8 +892,17 @@ static int try_delta(struct unpacked *cu
}
if (sizediff >= max_size)
return -1;
+ last_up = progress_update_cnt;
delta_buf = diff_delta(old->data, oldsize,
cur->data, size, &delta_size, max_size);
+ if (last_up + 1 < progress_update_cnt) {
+ /* It took more than one second */
+ fprintf(stderr, "%d -> %d: %s -> ",
+ last_up, progress_update_cnt,
+ sha1_to_hex(old_entry->sha1));
+ fprintf(stderr, "%s (%lu -> %lu)\n",
+ sha1_to_hex(cur_entry->sha1), oldsize, size);
+ }
if (!delta_buf)
return 0;
cur_entry->delta = old_entry;
@@ -906,6 +917,7 @@ static void progress_interval(int signum
{
signal(SIGALRM, progress_interval);
progress_update = 1;
+ progress_update_cnt++;
}
static void find_deltas(struct object_entry **list, int window, int depth)
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 21:39 ` Nicolas Pitre
2006-02-24 21:48 ` Nicolas Pitre
@ 2006-02-25 0:45 ` Linus Torvalds
2006-02-25 3:07 ` Nicolas Pitre
1 sibling, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2006-02-25 0:45 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git, Carl Baldwin
On Fri, 24 Feb 2006, Nicolas Pitre wrote:
>
> Currently, diff-delta takes blocks of data in the reference file and
> hash them. When the target file is scanned, it uses the hash to match
> blocks from the target file with the reference file.
>
> If blocks are hashed evenly the cost of producing a delta is at most
> O(n+m) where n and m are the size of the reference and target files
> respectively. In other words, with good data set the cost is linear.
Assuming the hash is good, of course.
I think this was the problem with you trying something simpler than
adler32..
> But if many blocks from the reference buffer do hash to the same bucket
> then for each block in the target file many blocks from the reference
> buffer have to be tested against, making it tend towards O(n^m) which is
> pretty highly exponential.
>
> The solution I'm investigating is to put a limit on the number of
> entries in the same hash bucket so to bring the cost back to something
> more linear. That means the delta might miss on better matches that
> have not been hashed but still benefit from a limited set.
Sounds fair enough.
However, you migt also want to consider another approach..
One of the biggest costs for the xdelta algorithm is probably just the
"delta_prepare()", but at the same time, that is constant wrt the source
buffer.
Now, the sad part is that when I wrote pack-objects, I didn't really
understand the diff-delta algorithm, I just plugged it in. Which means
that when I did it, I made the (obvious and simple) decision to keep the
_result_ that we are looking at constant, and try to delta against
different sources.
HOWEVER.
I suspect you already see where this is going..
We _could_ switch the "pack-objects" window handling around, and instead
of looking at the object we want to pack, and looking at the ten (or
"window") previous objects to delta against, we could do it the other way
around: keep the object we delta against constant, and see what deltas we
could prepare for the ten next objects.
And since the source would now be constant, you'd need to do the
"delta_prepare()" just _once_ per window, instead of every single time.
Now, I haven't done any profiling on the diff-delta code, and maybe my
guess that delta_prepare() is a pretty expensive part is wrong, and maybe
it wouldn't help to switch the window probing around. But I thought I'd
mention it as one thing to explore..
Linus
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-25 0:45 ` Linus Torvalds
@ 2006-02-25 3:07 ` Nicolas Pitre
2006-02-25 4:05 ` Linus Torvalds
0 siblings, 1 reply; 35+ messages in thread
From: Nicolas Pitre @ 2006-02-25 3:07 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, git, Carl Baldwin
On Fri, 24 Feb 2006, Linus Torvalds wrote:
>
>
> On Fri, 24 Feb 2006, Nicolas Pitre wrote:
> >
> > Currently, diff-delta takes blocks of data in the reference file and
> > hash them. When the target file is scanned, it uses the hash to match
> > blocks from the target file with the reference file.
> >
> > If blocks are hashed evenly the cost of producing a delta is at most
> > O(n+m) where n and m are the size of the reference and target files
> > respectively. In other words, with good data set the cost is linear.
>
> Assuming the hash is good, of course.
>
> I think this was the problem with you trying something simpler than
> adler32..
Well, that's the compromize to make. By default the version with
adler32 used 16 byte blocks to index the reference buffer. That means
you can match target data against the reference only if whole 16 byte
blocks match. Then, if you fix a typo in the target buffer then you'll
inevitably need 16 literal bytes in the delta instead of only
one because you won't be able to resynchronize with the reference buffer
until the next 16 byte block.
What I've made in my last delta patch is to reduce that 16 byte block to
only 3 bytes. Why 3 bytes? Because less than that produces smaller
delta data if done with literal bytes directly, and 3 bytes provided
enough bits to hash. I also made those 3 byte blocks overlap so
indexing would start at any offset with byte precision. This really
allows for optimal deltas so that they cannot be smaller.
Now the problem comes when indexing a reference file full of:
0x46f8, 0x000b, 0x42fe, 0x0000, 0xffc0, 0x0001, 0xff00, 0x0008,
0x03e0, 0x0009, 0x0f01, 0x0003, 0x8072, 0x0000, 0x0400, 0x0000,
0x0046, 0x0003, 0x9180, 0x0001, 0x0003, 0x0008, 0x02eb, 0x0003,
0x8072, 0x0000, 0x0400, 0x0000, 0x8010, 0x0008, 0x0010, 0x0000,
0x0361, 0x0003, 0x037e, 0x0004, 0x3941, 0x0002, 0x0b0f, 0x0003,
0x8072, 0x0000, 0x0400, 0x0000, 0x000a, 0x000b, 0x0346, 0x000c,
0x11fe, 0x0000, 0x3717, 0x0003, 0x8072, 0x0000, 0x0400, 0x0000,
0x8010, 0x0008, 0x000e, 0x0000, 0x0361, 0x0003, 0x8060, 0x0000,
There is a bunch of ", 0x" that get hashed to the same thing. And when
the second phase i.e. trying to find the best match into the reference
buffer for each occurrence of the same many ", 0x" in the target buffer
you get a conbinatorial explosion.
The adler32 made that particular example a non issue since the
likelyhood of many 16 byte blocks to be the same is pretty low in this
case. But the flaw remains if for example there is lots of similar 16
byte blocks, like a binary file with lots of zeroes for example. In
fact, the performance problem Carl is having does use the diff-delta
version still using adler32.
> > But if many blocks from the reference buffer do hash to the same bucket
> > then for each block in the target file many blocks from the reference
> > buffer have to be tested against, making it tend towards O(n^m) which is
> > pretty highly exponential.
> >
> > The solution I'm investigating is to put a limit on the number of
> > entries in the same hash bucket so to bring the cost back to something
> > more linear. That means the delta might miss on better matches that
> > have not been hashed but still benefit from a limited set.
>
> Sounds fair enough.
Testing appear to show that this is a worthwhile safety valve. And in
most case that safety valve should not be activated at all.
> However, you migt also want to consider another approach..
>
> One of the biggest costs for the xdelta algorithm is probably just the
> "delta_prepare()", but at the same time, that is constant wrt the source
> buffer.
Actually it is not that costly. Much much less than computing the sha1
of the same buffer for example.
> Now, the sad part is that when I wrote pack-objects, I didn't really
> understand the diff-delta algorithm, I just plugged it in. Which means
> that when I did it, I made the (obvious and simple) decision to keep the
> _result_ that we are looking at constant, and try to delta against
> different sources.
>
> HOWEVER.
>
> I suspect you already see where this is going..
>
> We _could_ switch the "pack-objects" window handling around, and instead
> of looking at the object we want to pack, and looking at the ten (or
> "window") previous objects to delta against, we could do it the other way
> around: keep the object we delta against constant, and see what deltas we
> could prepare for the ten next objects.
>
> And since the source would now be constant, you'd need to do the
> "delta_prepare()" just _once_ per window, instead of every single time.
Might be worth trying. Actually, this can be tested without even
changing the window handling just yet, since diff-delta() could return
the index data instead of freeing it, and pack-objects can store it
along side with the object data it tries to delta against. That
wouldn't be memory efficient, but at least that would give an idea of
the magnitude of the saving on CPU time. But I really doubt that'll
save more than a few percent.
>
> Now, I haven't done any profiling on the diff-delta code, and maybe my
> guess that delta_prepare() is a pretty expensive part is wrong, and maybe
> it wouldn't help to switch the window probing around. But I thought I'd
> mention it as one thing to explore..
Just to give you an idea, the bulk of my current "prepare" code looks
like this:
/* then populate the index */
data = buf + bufsize - 2;
while (data > buf) {
entry->ptr = --data;
i = (data[0] << hshift) ^ data[1];
i ^= (i << hshift) ^ data[2];
entry->next = hash[i];
hash[i] = entry++;
}
As you can see it is pretty lightweight.
But that would probably be a worthwhile optimization to have even if it
saves 10% of CPU time.
Nicolas
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-24 22:50 ` Carl Baldwin
@ 2006-02-25 3:53 ` Nicolas Pitre
0 siblings, 0 replies; 35+ messages in thread
From: Nicolas Pitre @ 2006-02-25 3:53 UTC (permalink / raw)
To: Carl Baldwin; +Cc: Junio C Hamano, git
On Fri, 24 Feb 2006, Carl Baldwin wrote:
> > If you can add them into a single .tgz with instructions on how
> > to reproduce the issue and provide me with an URL where I can fetch it
> > that'd be perfect.
>
> I will do this in an email off of the list because these files really
> shouldn't be available on a public list.
OK I have the files, and I can confirm that your problem is of the same
combinatorial explosion type I already talked about, _even_ with the
version using adler32. This is really O(m*n) where m and n being the
size of two consecutive versions of the files. But since m and n are
both _huge_ then the delta code really goes out to lunch.
I'm working on a patch to cap this to something like O(m+n).
Stay tuned.
Nicolas
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-25 3:07 ` Nicolas Pitre
@ 2006-02-25 4:05 ` Linus Torvalds
2006-02-25 5:10 ` Nicolas Pitre
0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2006-02-25 4:05 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git, Carl Baldwin
On Fri, 24 Feb 2006, Nicolas Pitre wrote:
>
> Well, that's the compromize to make. By default the version with
> adler32 used 16 byte blocks to index the reference buffer. That means
> you can match target data against the reference only if whole 16 byte
> blocks match. Then, if you fix a typo in the target buffer then you'll
> inevitably need 16 literal bytes in the delta instead of only
> one because you won't be able to resynchronize with the reference buffer
> until the next 16 byte block.
That shouldn't be true. Once you find a 16-byte match, you can search
forward (in theory you should be able to search backwards too, but by that
time we've already expanded the non-matching previous bytes, I think).
But I'm no xdelta expert, so I migt be wrong.
> What I've made in my last delta patch is to reduce that 16 byte block to
> only 3 bytes. Why 3 bytes? Because less than that produces smaller
> delta data if done with literal bytes directly, and 3 bytes provided
> enough bits to hash.
On the other hand, the cost is that your lookups _are_ going to be more
expensive. Regardless of how good the hash is, basically you have 16/3
more hash-entries to look up, so you've made compression more expensive in
footprint, at least (I assume you've made the hash appropriately larger).
Also, at 3 bytes, insertion is at least equally dense (three bytes of data
vs three bytes of offset into the source), and can be worse (the offset
might be 5 bytes, no?). So it would seem like you'd be better off with 4+
bytes, at which point the delta should be a win.
Have you tried some half-way point, like ~8 bytes?
> Now the problem comes when indexing a reference file full of:
>
> 0x46f8, 0x000b, 0x42fe, 0x0000, 0xffc0, 0x0001, 0xff00, 0x0008,
...
>
> There is a bunch of ", 0x" that get hashed to the same thing.
You'll find a lot of that in any file: three or four bytes of similarity
just doesn't sound worthwhile to go digging after.
> The adler32 made that particular example a non issue since the
> likelyhood of many 16 byte blocks to be the same is pretty low in this
> case. But the flaw remains if for example there is lots of similar 16
> byte blocks, like a binary file with lots of zeroes for example. In
> fact, the performance problem Carl is having does use the diff-delta
> version still using adler32.
Agreed. I think limiting the hash length is a fine idea regardless, I just
think it sounds dangerous with the three-byte thing where a lot of matches
should be expected (never mind ", 0x", just things like newlines and tabs
in source code).
Only considering 16-byte sequences of similarities is a trade-off of
packing cost vs win.. Not saying other block-sizes aren't worth testing,
but I suspect trying too hard is going to be too costly.
Especially as deltas compress _worse_ the smaller they are. Bigger
"insert" chunks probably compress a lot better than a copy chunk.
Have you looked at the delta size vs compression?
Linus
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-25 4:05 ` Linus Torvalds
@ 2006-02-25 5:10 ` Nicolas Pitre
2006-02-25 5:35 ` Nicolas Pitre
2006-02-25 19:18 ` [PATCH] diff-delta: produce optimal pack data Linus Torvalds
0 siblings, 2 replies; 35+ messages in thread
From: Nicolas Pitre @ 2006-02-25 5:10 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, git, Carl Baldwin
On Fri, 24 Feb 2006, Linus Torvalds wrote:
>
>
> On Fri, 24 Feb 2006, Nicolas Pitre wrote:
> >
> > Well, that's the compromize to make. By default the version with
> > adler32 used 16 byte blocks to index the reference buffer. That means
> > you can match target data against the reference only if whole 16 byte
> > blocks match. Then, if you fix a typo in the target buffer then you'll
> > inevitably need 16 literal bytes in the delta instead of only
> > one because you won't be able to resynchronize with the reference buffer
> > until the next 16 byte block.
>
> That shouldn't be true. Once you find a 16-byte match, you can search
> forward (in theory you should be able to search backwards too, but by that
> time we've already expanded the non-matching previous bytes, I think).
Obviously, but that's not my point. What I mean is that small changes
will kind of sabotage the match since you'll have to scan forward
(adding literal bytes to the delta output in the mean time) until you
find another 16 byte block that matches. This is harder to find than a
3 byte block. Hence you'll end up adding more literal bytes (up to 16
of them) until you can match the reference buffer again even if you only
flipped one byte in the target buffer. In some cases that makes up for
deltas that are twice as large.
> > What I've made in my last delta patch is to reduce that 16 byte block to
> > only 3 bytes. Why 3 bytes? Because less than that produces smaller
> > delta data if done with literal bytes directly, and 3 bytes provided
> > enough bits to hash.
>
> On the other hand, the cost is that your lookups _are_ going to be more
> expensive. Regardless of how good the hash is, basically you have 16/3
> more hash-entries to look up, so you've made compression more expensive in
> footprint, at least (I assume you've made the hash appropriately larger).
Yes, the hash is larger. There is a cost in memory usage but not really
in CPU cycles.
> Also, at 3 bytes, insertion is at least equally dense (three bytes of data
> vs three bytes of offset into the source), and can be worse (the offset
> might be 5 bytes, no?). So it would seem like you'd be better off with 4+
> bytes, at which point the delta should be a win.
The code already discriminate the space of a block copy notation given
the offset and size vs the space for the equivalent literal bytes. So
the optimal encoding is always chosen already.
In fact, if you want to copy up to 15 bytes from offset 0 that will be
encoded with only 2 bytes in the delta. The only case that is
suboptimal is when you want to copy only two bytes from offset 0 (2
delta bytes) but only two bytes is mever matched by the hash lookup
since the hash is computed with 3 bytes. In that case 2 literal bytes
will be added to the delta plus opcode = 3 bytes. I considered that
special case not worth it. However copying a block of 3 bytes that gets
encoded into 3 bytes of delta is quite common (that'd take 4 bytes of
delta if they were literals).
As for using more bytes for block hashing, that increase thenumber of
cycles to compute the hash. The adler32 version reads 16 bytes for
every byte offset in the target file while my latest version only reads
3 bytes for every byte offset. So in effect my target hash computation
is faster than the adler32 one. However there is potentially more
entries in the same hash bucket to validate especially with repetitive
data.
> Have you tried some half-way point, like ~8 bytes?
Yes, and while the needed cycles tend to remain the same on average, the
resulting pack gets larger.
> > Now the problem comes when indexing a reference file full of:
> >
> > 0x46f8, 0x000b, 0x42fe, 0x0000, 0xffc0, 0x0001, 0xff00, 0x0008,
> ...
> >
> > There is a bunch of ", 0x" that get hashed to the same thing.
>
> You'll find a lot of that in any file: three or four bytes of similarity
> just doesn't sound worthwhile to go digging after.
Well after having experimented a lot with multiple parameters I think
they are worth it after all. Not only they provide for optimal deltas,
but their hash is faster to compute than larger blocks which seems to
counter balance for the cost of increased hash list.
> > The adler32 made that particular example a non issue since the
> > likelyhood of many 16 byte blocks to be the same is pretty low in this
> > case. But the flaw remains if for example there is lots of similar 16
> > byte blocks, like a binary file with lots of zeroes for example. In
> > fact, the performance problem Carl is having does use the diff-delta
> > version still using adler32.
>
> Agreed. I think limiting the hash length is a fine idea regardless, I just
> think it sounds dangerous with the three-byte thing where a lot of matches
> should be expected (never mind ", 0x", just things like newlines and tabs
> in source code).
They are usually less than the number of lines. Yet if you have a 1000
line source file and let's suppose that we keep only 50 hashed "\n\t\t"
this is sufficient to provide enough opportunities for matching that
plus common patterns like a following "if (" for example.
> Only considering 16-byte sequences of similarities is a trade-off of
> packing cost vs win.. Not saying other block-sizes aren't worth testing,
> but I suspect trying too hard is going to be too costly.
I of course looked at the time to pack vs the size reduction in my
tests. And really like I said above the cost is well balanced. The
only issue is that smaller blocks are more likely to trap into
patological data sets. But that problem does exist with larger blocks
too, to a lesser degree of course but still. For example, using a 16
block size with adler32, computing a delta between two files
> Especially as deltas compress _worse_ the smaller they are. Bigger
> "insert" chunks probably compress a lot better than a copy chunk.
Yes, but given that we favor deltas from larger to smaller already those
inserts are already not making much differences. They have to be quite
large to effectively provide better zlib compression.
> Have you looked at the delta size vs compression?
That's certainly an additional test worth adding to try_delta().
max_size should be smaller than the original object deflated size making
sure we won't store deltas that might end up larger than the undeltified
object.
Nicolas
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-25 5:10 ` Nicolas Pitre
@ 2006-02-25 5:35 ` Nicolas Pitre
2006-03-07 23:48 ` [RFH] zlib gurus out there? Junio C Hamano
2006-02-25 19:18 ` [PATCH] diff-delta: produce optimal pack data Linus Torvalds
1 sibling, 1 reply; 35+ messages in thread
From: Nicolas Pitre @ 2006-02-25 5:35 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, git, Carl Baldwin
OOps.... Forgot to complete one paragraph.
On Sat, 25 Feb 2006, Nicolas Pitre wrote:
> I of course looked at the time to pack vs the size reduction in my
> tests. And really like I said above the cost is well balanced. The
> only issue is that smaller blocks are more likely to trap into
> patological data sets. But that problem does exist with larger blocks
> too, to a lesser degree of course but still. For example, using a 16
> block size with adler32, computing a delta between two files
... as provided by Carl takes up to _nine_ minutes for a _single_ delta !
So regardless of the block size used, the issue right now has more to do
with that combinatorial explosion than the actual block size. And
preventing that patological case from expending out of bounds is pretty
easy to do.
OK I just tested a tentative patch to trap that case and the time to
delta those two 20MB files passed from over 9 minutes to only 36 seconds
here, with less than 10% in delta size difference. So I think I might
be on the right track. Further tuning might help even further.
Nicolas
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] diff-delta: produce optimal pack data
2006-02-25 5:10 ` Nicolas Pitre
2006-02-25 5:35 ` Nicolas Pitre
@ 2006-02-25 19:18 ` Linus Torvalds
1 sibling, 0 replies; 35+ messages in thread
From: Linus Torvalds @ 2006-02-25 19:18 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, git, Carl Baldwin
On Sat, 25 Feb 2006, Nicolas Pitre wrote:
>
> Yes, the hash is larger. There is a cost in memory usage but not really
> in CPU cycles.
Note that memory usage translates almost 1:1 (or worse) to CPU cycles in
almost all real-life behaviours. Only in carefully tuned benchmarks does
it not.
Increased memory usage means more paging, and worse cache behaviour. Now,
hashes aren't wonderful for caches in the first place, but imagine the
hump you pass when the data doesn't fit in a 64kB L1 any more (or a 256kB
L2). Huge.
> > You'll find a lot of that in any file: three or four bytes of similarity
> > just doesn't sound worthwhile to go digging after.
>
> Well after having experimented a lot with multiple parameters I think
> they are worth it after all. Not only they provide for optimal deltas,
> but their hash is faster to compute than larger blocks which seems to
> counter balance for the cost of increased hash list.
Hey, numbers talk. If you've got the numbers, I'll just shut up ;)
Linus
^ permalink raw reply [flat|nested] 35+ messages in thread
* [RFH] zlib gurus out there?
2006-02-25 5:35 ` Nicolas Pitre
@ 2006-03-07 23:48 ` Junio C Hamano
2006-03-08 0:59 ` Linus Torvalds
2006-03-08 10:45 ` [PATCH] write_sha1_file(): Perform Z_FULL_FLUSH between header and data Sergey Vlasov
0 siblings, 2 replies; 35+ messages in thread
From: Junio C Hamano @ 2006-03-07 23:48 UTC (permalink / raw)
To: git; +Cc: Nicolas Pitre, Linus Torvalds
I've been staring at reusing existing data while packing, and
this occurred to me...
During packing, suppose that we chose to store an object in
base form, undeltified. And also suppose we have that object
loose in .git/objects/??/ directory. We already have it in
deflated form, but with its own header. I started wondering if
we can somehow reuse this.
A short object format brush-up lesson is in order here.
* An undeltified object in a pack is represented like this:
(1) the header is a dense variable size binary data, that
encodes type and inflated length;
(2) deflated data immediately follows the header.
* On the other hand, a loose object is represented like this:
(1) the header looks like sprintf("%s %lu%c", type, len, 0);
(2) concatenate the data to the header;
(3) SHA1 checksum of the above becomes the object name.
(4) deflate the header and data using the same z_stream, in two
steps, like this (sha1_file.c::write_sha1_file):
/* Compress it */
stream.next_out = compressed;
stream.avail_out = size;
/* First header.. */
stream.next_in = hdr;
stream.avail_in = hdrlen;
while (deflate(&stream, 0) == Z_OK)
/* nothing */;
/* Then the data itself.. */
stream.next_in = buf;
stream.avail_in = len;
while (deflate(&stream, Z_FINISH) == Z_OK)
/* nothing */;
deflateEnd(&stream);
size = stream.total_out;
So I thought... if we cause a full flush after the header part,
I can find the flush boundaries from a loose object file and
copy the rest into a packfile I am generating, after placing the
binary encoded header. If this works, we do not have to inflate
loose object to read it and deflate it to store that in the
pack. We will get a better packing as well, since we deflate
loose objects with Z_BEST_COMPRESSION, while packs are done with
Z_DEFAULT_COMPRESSION. While pack-objects read from a loose
object, if we can detect that there is no full flush after the
header, we would do the traditional inflate-deflate cycle, so
this would be backward compatible.
However, I am stuck with the first step, which is to do a full
flush after the header. An obvious change to the code quoted
above writes out a corrupt object:
/* First header.. */
stream.next_in = hdr;
stream.avail_in = hdrlen;
- while (deflate(&stream, 0) == Z_OK)
+ while (deflate(&stream, Z_FULL_FLUSH) == Z_OK)
/* nothing */;
git-fsck-objects complains that sha1 does not match. It appears
that the sha1_file.c::unpack_sha1_rest() somehow barfs upon
seeing the full flush, but I haven't dug into it yet.
Would anybody with more experience with zlib want to help?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFH] zlib gurus out there?
2006-03-07 23:48 ` [RFH] zlib gurus out there? Junio C Hamano
@ 2006-03-08 0:59 ` Linus Torvalds
2006-03-08 1:22 ` Junio C Hamano
2006-03-08 10:45 ` [PATCH] write_sha1_file(): Perform Z_FULL_FLUSH between header and data Sergey Vlasov
1 sibling, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2006-03-08 0:59 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Nicolas Pitre
On Tue, 7 Mar 2006, Junio C Hamano wrote:
>
> However, I am stuck with the first step, which is to do a full
> flush after the header. An obvious change to the code quoted
> above writes out a corrupt object:
>
> /* First header.. */
> stream.next_in = hdr;
> stream.avail_in = hdrlen;
> - while (deflate(&stream, 0) == Z_OK)
> + while (deflate(&stream, Z_FULL_FLUSH) == Z_OK)
> /* nothing */;
No, I don't think that's good. You're only doing a partial deflate, you
can't ask for a Z_FULL_FLUSH. That only works if you give it the whole
buffer, and you don't.
iirc, wtf, wdik, and ianal.
Linus
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFH] zlib gurus out there?
2006-03-08 0:59 ` Linus Torvalds
@ 2006-03-08 1:22 ` Junio C Hamano
2006-03-08 2:00 ` Linus Torvalds
0 siblings, 1 reply; 35+ messages in thread
From: Junio C Hamano @ 2006-03-08 1:22 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Linus Torvalds <torvalds@osdl.org> writes:
> On Tue, 7 Mar 2006, Junio C Hamano wrote:
>>
>> However, I am stuck with the first step, which is to do a full
>> flush after the header. An obvious change to the code quoted
>> above writes out a corrupt object:
>>
>> /* First header.. */
>> stream.next_in = hdr;
>> stream.avail_in = hdrlen;
>> - while (deflate(&stream, 0) == Z_OK)
>> + while (deflate(&stream, Z_FULL_FLUSH) == Z_OK)
>> /* nothing */;
>
> No, I don't think that's good. You're only doing a partial deflate, you
> can't ask for a Z_FULL_FLUSH. That only works if you give it the whole
> buffer, and you don't.
So, in short there is no way to create:
hdr part deflated.
flush.
data part deflated independently.
and have the current sha1_read_file() not to notice that flush,
while I can inspect the deflated stream to find the "flush", and
copy only the defalted data part into a pack? Bummer... I was
really shooting for full backward compatibility.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFH] zlib gurus out there?
2006-03-08 1:22 ` Junio C Hamano
@ 2006-03-08 2:00 ` Linus Torvalds
2006-03-08 9:46 ` Johannes Schindelin
0 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2006-03-08 2:00 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Tue, 7 Mar 2006, Junio C Hamano wrote:
> >
> > No, I don't think that's good. You're only doing a partial deflate, you
> > can't ask for a Z_FULL_FLUSH. That only works if you give it the whole
> > buffer, and you don't.
Actually, I misread what you were trying to do, and thought this was the
inflate phase, not the deflate. Now that I understand what you want,
> So, in short there is no way to create:
>
> hdr part deflated.
> flush.
> data part deflated independently.
>
> and have the current sha1_read_file() not to notice that flush,
Actually, try the patch you already tried, except you'll need to add a
deflateEnd(&stream);
deflateInit(&stream, Z_BEST_COMPRESSION);
.. set up output parameters again ..
and you need to change the initial
size = deflateBound(&stream, len+hdrlen);
to
size = deflateBound(&stream, len) + deflateBound(&stream, hdrlen);
and then you might be ok.
That said, I'm not sure I agree with what you're trying to do.
Linus
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [RFH] zlib gurus out there?
2006-03-08 2:00 ` Linus Torvalds
@ 2006-03-08 9:46 ` Johannes Schindelin
0 siblings, 0 replies; 35+ messages in thread
From: Johannes Schindelin @ 2006-03-08 9:46 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, git
Hi,
On Tue, 7 Mar 2006, Linus Torvalds wrote:
> On Tue, 7 Mar 2006, Junio C Hamano wrote:
> > >
> > > No, I don't think that's good. You're only doing a partial deflate, you
> > > can't ask for a Z_FULL_FLUSH. That only works if you give it the whole
> > > buffer, and you don't.
>
> Actually, I misread what you were trying to do, and thought this was the
> inflate phase, not the deflate.
I don't think it matters if it is inflate or deflate. ZLib keeps an
internal state depending on the data. That is the whole reason why the
packing is so good: it uses the redundancy in the data already seen to
construct a codebook. (And that's also the reason why you can't start to
deflate in the middle.)
Ciao,
Dscho
^ permalink raw reply [flat|nested] 35+ messages in thread
* [PATCH] write_sha1_file(): Perform Z_FULL_FLUSH between header and data
2006-03-07 23:48 ` [RFH] zlib gurus out there? Junio C Hamano
2006-03-08 0:59 ` Linus Torvalds
@ 2006-03-08 10:45 ` Sergey Vlasov
2006-03-08 11:04 ` Junio C Hamano
1 sibling, 1 reply; 35+ messages in thread
From: Sergey Vlasov @ 2006-03-08 10:45 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, Nicolas Pitre, Linus Torvalds
Data after Z_FULL_FLUSH will be compressed independently of the
header, and could therefore be reused without recompressing when
creating a pack.
---
This passes "make test" and unpacking of the whole git repo with
git-fsck-objects afterwards.
However, a straight reuse still will not be possible, because
sha1write_compressed() uses deflateInit(&stream, Z_DEFAULT_COMPRESSION),
which writes zlib headers around the deflate stream, and the zlib footer
contains adler32 checksum. So, as a minimum, you will need to
decompress the object data, calculate its adler32 checksum and write the
zlib header yourself.
sha1_file.c | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)
8b12d9a58e87a4c5b5a2a7b20d06fe29a5afb903
diff --git a/sha1_file.c b/sha1_file.c
index a80d849..34d4da4 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1399,7 +1399,8 @@ int write_sha1_file(void *buf, unsigned
/* Set it up */
memset(&stream, 0, sizeof(stream));
deflateInit(&stream, Z_BEST_COMPRESSION);
- size = deflateBound(&stream, len+hdrlen);
+ /* Additional 6 bytes for the Z_FULL_FLUSH marker */
+ size = deflateBound(&stream, hdrlen) + 6 + deflateBound(&stream, len);
compressed = xmalloc(size);
/* Compress it */
@@ -1412,6 +1413,10 @@ int write_sha1_file(void *buf, unsigned
while (deflate(&stream, 0) == Z_OK)
/* nothing */;
+ /* Flush before data */
+ while (deflate(&stream, Z_FULL_FLUSH) == Z_OK)
+ /* nothing */;
+
/* Then the data itself.. */
stream.next_in = buf;
stream.avail_in = len;
--
1.2.GIT
^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH] write_sha1_file(): Perform Z_FULL_FLUSH between header and data
2006-03-08 10:45 ` [PATCH] write_sha1_file(): Perform Z_FULL_FLUSH between header and data Sergey Vlasov
@ 2006-03-08 11:04 ` Junio C Hamano
2006-03-08 14:17 ` Sergey Vlasov
0 siblings, 1 reply; 35+ messages in thread
From: Junio C Hamano @ 2006-03-08 11:04 UTC (permalink / raw)
To: Sergey Vlasov; +Cc: git
Sergey Vlasov <vsu@altlinux.ru> writes:
> However, a straight reuse still will not be possible, because
> sha1write_compressed() uses deflateInit(&stream, Z_DEFAULT_COMPRESSION),
> which writes zlib headers around the deflate stream, and the zlib footer
> contains adler32 checksum. So, as a minimum, you will need to
> decompress the object data, calculate its adler32 checksum and write the
> zlib header yourself.
Hmph. Thanks for helping, but it sounds like my original plan
was not useful at all. Probably inflating would be still
cheaper than inflating and then deflating, but it would not be
as cool as a straight copy. Sigh...
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] write_sha1_file(): Perform Z_FULL_FLUSH between header and data
2006-03-08 11:04 ` Junio C Hamano
@ 2006-03-08 14:17 ` Sergey Vlasov
0 siblings, 0 replies; 35+ messages in thread
From: Sergey Vlasov @ 2006-03-08 14:17 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
[-- Attachment #1: Type: text/plain, Size: 1732 bytes --]
On Wed, Mar 08, 2006 at 03:04:14AM -0800, Junio C Hamano wrote:
> Sergey Vlasov <vsu@altlinux.ru> writes:
> > However, a straight reuse still will not be possible, because
> > sha1write_compressed() uses deflateInit(&stream, Z_DEFAULT_COMPRESSION),
> > which writes zlib headers around the deflate stream, and the zlib footer
> > contains adler32 checksum. So, as a minimum, you will need to
> > decompress the object data, calculate its adler32 checksum and write the
> > zlib header yourself.
>
> Hmph. Thanks for helping, but it sounds like my original plan
> was not useful at all. Probably inflating would be still
> cheaper than inflating and then deflating, but it would not be
> as cool as a straight copy. Sigh...
Actually you can calculate adler32 checksum of object data from
adler32(header+data) (available at the end of the loose object file),
adler32(header) (which you will need to calculate) and len(data)
(which is available in the header):
#define ADLER32_BASE 65521UL
unsigned int adler32_split(unsigned int adler_full, unsigned int adler_1,
unsigned long len_2)
{
unsigned long s1_1 = adler_1 & 0xffff;
unsigned long s1_2 = (adler_1 >> 16) & 0xffff;
unsigned long rem = len_2 % ADLER32_BASE;
unsigned long s_1_offset = (s1_1 + ADLER32_BASE - 1) % ADLER32_BASE;
unsigned long s_2_offset = (s1_2 + s_1_offset*rem) % ADLER32_BASE;
unsigned long sf_1 = adler_full & 0xffff;
unsigned long sf_2 = (adler_full >> 16) & 0xffff;
unsigned long s2_1 = (sf_1 + ADLER32_BASE - s_1_offset) % ADLER32_BASE;
unsigned long s2_2 = (sf_2 + ADLER32_BASE - s_2_offset) % ADLER32_BASE;
return (s2_2 << 16) | s2_1;
}
However, the resulting code probably won't be pretty...
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2006-03-08 14:17 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-22 1:45 [PATCH] diff-delta: produce optimal pack data Nicolas Pitre
2006-02-24 8:49 ` Junio C Hamano
2006-02-24 15:37 ` Nicolas Pitre
2006-02-24 23:55 ` Junio C Hamano
2006-02-24 17:44 ` Carl Baldwin
2006-02-24 17:56 ` Nicolas Pitre
2006-02-24 18:35 ` Carl Baldwin
2006-02-24 18:57 ` Nicolas Pitre
2006-02-24 19:23 ` Carl Baldwin
2006-02-24 20:02 ` Nicolas Pitre
2006-02-24 20:40 ` Carl Baldwin
2006-02-24 21:12 ` Nicolas Pitre
2006-02-24 22:50 ` Carl Baldwin
2006-02-25 3:53 ` Nicolas Pitre
2006-02-24 20:02 ` Linus Torvalds
2006-02-24 20:19 ` Nicolas Pitre
2006-02-24 20:53 ` Junio C Hamano
2006-02-24 21:39 ` Nicolas Pitre
2006-02-24 21:48 ` Nicolas Pitre
2006-02-25 0:45 ` Linus Torvalds
2006-02-25 3:07 ` Nicolas Pitre
2006-02-25 4:05 ` Linus Torvalds
2006-02-25 5:10 ` Nicolas Pitre
2006-02-25 5:35 ` Nicolas Pitre
2006-03-07 23:48 ` [RFH] zlib gurus out there? Junio C Hamano
2006-03-08 0:59 ` Linus Torvalds
2006-03-08 1:22 ` Junio C Hamano
2006-03-08 2:00 ` Linus Torvalds
2006-03-08 9:46 ` Johannes Schindelin
2006-03-08 10:45 ` [PATCH] write_sha1_file(): Perform Z_FULL_FLUSH between header and data Sergey Vlasov
2006-03-08 11:04 ` Junio C Hamano
2006-03-08 14:17 ` Sergey Vlasov
2006-02-25 19:18 ` [PATCH] diff-delta: produce optimal pack data Linus Torvalds
2006-02-24 18:49 ` Carl Baldwin
2006-02-24 19:03 ` Nicolas Pitre
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).