[RFC] Optimize diff-delta.c

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Optimize diff-delta.c
@ 2007-05-01 14:49 Martin Koegler
  2007-05-01 15:51 ` Johannes Schindelin
  2007-05-01 16:05 ` Nicolas Pitre
  0 siblings, 2 replies; 8+ messages in thread
From: Martin Koegler @ 2007-05-01 14:49 UTC (permalink / raw)
  To: git; +Cc: Nicolas Pitre, Martin Koegler

I try to use git with large blobs. Putting such blobs into pack files
is a slow operation and requires lots of memory. So I take a look at
the packing process.

As the delta format only supports 32 bit offsets, the uncompressed
blob size is limited to 4GB.

The delta index has approximately the same size in memory as the
uncompressed blob ((blob size)/16*(sizeof(index_entry)).
git-pack-objects keep the delta index of all objects in the search
window in memory.

So doing a delta of 4 GB files is totally unrealistic:
 (4GB data + ~4GB index)* window size [default: 10]= 80 GB in the worst case

In my case, the blobs are some hundred MB big, but git-pack-objects
already uses some GB of memory. As the memory requirement of
git-pack-objects is currently below the available memory of my system,
I need not to address this issue yet.

In the future, I'll propably need to create a patch to free big delta
indexes in find_delta immediatly, after create_delta returned. This
will increase the processing time, but better than not being able to 
pack objects.

I tried to speed up the delta generation by searching for a common
prefix, as my blobs are mostly append only. I tested it with about
less than 1000 big blobs. The time for finding the deltas decreased
from 17 to 14 minutes cpu time.

When repacking the git-repostiory itself, I get the following numbers:

Unmodified version (gcc-4.1):
$ echo | time ./git-pack-objects --non-empty --all --reflog --unpacked=pack-d44dc76d0e873a7c7566bcc4503731b9a5640b30.pack --no-reuse-delta  .git/.tmp-28449-pack
Generating pack...
Done counting 42553 objects.
Deltifying 42553 objects...
 100% (42553/42553) done
Writing 42553 objects...
 100% (42553/42553) done
d44dc76d0e873a7c7566bcc4503731b9a5640b30
Total 42553 (delta 29605), reused 12346 (delta 0)
63.82user 0.80system 1:06.64elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+29177minor)pagefaults 0swaps

Patched version (gcc-4.1):
$ echo | time ../git/git-pack-objects --non-empty --all --reflog --unpacked=pack-d44dc76d0e873a7c7566bcc4503731b9a5640b30.pack --no-reuse-delta  .git/.tmp-28448-pack
Generating pack...
Done counting 42553 objects.
Deltifying 42553 objects...
 100% (42553/42553) done
Writing 42553 objects...
 100% (42553/42553) done
d44dc76d0e873a7c7566bcc4503731b9a5640b30
Total 42553 (delta 29581), reused 12353 (delta 0)
62.13user 0.91system 1:05.07elapsed 96%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+39690minor)pagefaults 0swaps

So it can help improve the performance a little bit (62.13+0.91<->63.82+0.80) on normal
repositories.

The following patch is only for testing purposes and not cleaned up.

mfg Martin Kögler

 diff-delta.c |   62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/diff-delta.c b/diff-delta.c
index 9f998d0..4f29eb7 100644
--- a/diff-delta.c
+++ b/diff-delta.c
@@ -250,6 +250,8 @@ create_delta(const struct delta_index *index,
 	int inscnt;
 	const unsigned char *ref_data, *ref_top, *data, *top;
 	unsigned char *out;
+	const unsigned char* d, *dt, *t, *td;
+	unsigned int psize;
 
 	if (!trg_buf || !trg_size)
 		return NULL;
@@ -283,6 +285,66 @@ create_delta(const struct delta_index *index,
 	data = trg_buf;
 	top = (const unsigned char *) trg_buf + trg_size;
 
+	psize = 0;
+	d=data;
+	dt=top-RABIN_WINDOW;
+	t=ref_data;
+	td=ref_top-RABIN_WINDOW;
+	while (d<dt && t<td && !memcmp (d, t, RABIN_WINDOW))
+	{
+	    psize+=RABIN_WINDOW;
+	    d+=RABIN_WINDOW;
+	    td+=RABIN_WINDOW;
+	}
+	while (psize)
+	{
+	    unsigned int size=psize;
+	    unsigned int offset=0, moff;
+	    unsigned char *op;
+
+	    if(size>0xFFFFFF)
+		size=0xFFFFFF;
+
+
+	    data+=size;
+	    moff=offset;
+	    offset+=size;
+	    psize-=size;
+
+	    op = out + outpos++;
+	    i = 0x80;
+
+	    if (moff & 0xff) { out[outpos++] = moff; i |= 0x01; }
+	    moff >>= 8;
+	    if (moff & 0xff) { out[outpos++] = moff; i |= 0x02; }
+	    moff >>= 8;
+	    if (moff & 0xff) { out[outpos++] = moff; i |= 0x04; }
+	    moff >>= 8;
+	    if (moff & 0xff) { out[outpos++] = moff; i |= 0x08; }
+	    
+	    if (size & 0xff) { out[outpos++] = size; i |= 0x10; }
+	    size >>= 8;
+	    if (size & 0xff) { out[outpos++] = size; i |= 0x20; }
+	    size >>= 8;
+	    if (size & 0xff) { out[outpos++] = size; i |= 0x40; }
+
+	    *op = i;
+
+		if (outpos >= outsize - MAX_OP_SIZE) {
+			void *tmp = out;
+			outsize = outsize * 3 / 2;
+			if (max_size && outsize >= max_size)
+				outsize = max_size + MAX_OP_SIZE + 1;
+			if (max_size && outpos > max_size)
+				break;
+			out = xrealloc(out, outsize);
+			if (!out) {
+				free(tmp);
+				return NULL;
+			}
+		}
+	}
+
 	outpos++;
 	val = 0;
 	for (i = 0; i < RABIN_WINDOW && data < top; i++, data++) {
-- 
1.4.4.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC] Optimize diff-delta.c
  2007-05-01 14:49 Martin Koegler
@ 2007-05-01 15:51 ` Johannes Schindelin
  2007-05-01 16:05 ` Nicolas Pitre
  1 sibling, 0 replies; 8+ messages in thread
From: Johannes Schindelin @ 2007-05-01 15:51 UTC (permalink / raw)
  To: Martin Koegler; +Cc: git, Nicolas Pitre

Hi,

On Tue, 1 May 2007, Martin Koegler wrote:

> I tried to speed up the delta generation by searching for a common 
> prefix, as my blobs are mostly append only. I tested it with about less 
> than 1000 big blobs. The time for finding the deltas decreased from 17 
> to 14 minutes cpu time.

The interesting timings, of course, would be of big blobs which are _not_ 
append-only, as they are more common, at least for me.

Since git.git contained next to no binary blobs, this is not a good test 
case.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Optimize diff-delta.c
  2007-05-01 14:49 Martin Koegler
  2007-05-01 15:51 ` Johannes Schindelin
@ 2007-05-01 16:05 ` Nicolas Pitre
  1 sibling, 0 replies; 8+ messages in thread
From: Nicolas Pitre @ 2007-05-01 16:05 UTC (permalink / raw)
  To: Martin Koegler; +Cc: git

On Tue, 1 May 2007, Martin Koegler wrote:

> As the delta format only supports 32 bit offsets, the uncompressed
> blob size is limited to 4GB.

Right.  I think it would be a good idea to extend the delta format as 
well to allow for larger offsets in pack v4.

> The delta index has approximately the same size in memory as the
> uncompressed blob ((blob size)/16*(sizeof(index_entry)).

One thing that could be done with really large blobs is to create a 
sparser index, i.e. have a larger step than 16.  Because the delta match 
loop scans backward after a match the sparse index shouldn't affect 
compression that much on large blobs and the index could be 
significantly smaller.

> I tried to speed up the delta generation by searching for a common 
> prefix, as my blobs are mostly append only. I tested it with about 
> less than 1000 big blobs. The time for finding the deltas decreased 
> from 17 to 14 minutes cpu time.

I'm surprised that your patch makes so much of a difference.  Normally 
the first window should always match in the case you're trying to 
optimize and the current code should already perform more or less the 
same as your common prefix match does.

Ah, no, actually what your patch does is a pessimisation of the matching 
code by not considering other and possibly better matches elsewhere in 
the reference buffer whenever there is a match at the beginning of both 
buffers.  I don't think this is a good idea in general.

What you should try instead if you want to make the process faster is to 
lower the treshold used to consider a match sufficiently large to stop 
searching.  That has the potential for even faster processing as the 
"optimization" would then be effective throughout the buffer and not 
only at the beginning.

Currently the treshold is implicit and equal to 65536.  Please consider 
this patch instead of yours for testing:

diff --git a/diff-delta.c b/diff-delta.c
index 9f998d0..755c0a9 100644
--- a/diff-delta.c
+++ b/diff-delta.c
@@ -315,6 +315,9 @@ create_delta(const struct delta_index *index,
 				/* this is our best match so far */
 				msize = ref - entry->ptr;
 				moff = entry->ptr - ref_data;
+				/* a sufficiently large match is good enough */
+				if (msize >= 4096)
+					break;
 			}
 		}

You could experiment with that value to determine the best speed vs size 
compromize.

Nicolas

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC] Optimize diff-delta.c
@ 2007-05-04  6:40 Martin Koegler
  2007-05-04 13:35 ` Nicolas Pitre
  0 siblings, 1 reply; 8+ messages in thread
From: Martin Koegler @ 2007-05-04  6:40 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

On 2007-05-01 16:05:24, Nicolas Pitre wrote:
> > On Tue, 1 May 2007, Martin Koegler wrote:
> Right.  I think it would be a good idea to extend the delta format as 
> well to allow for larger offsets in pack v4.

Is git://repo.or.cz/git/fastimport.git#sp/pack4 the current version of
pack v4 efforts?

> > The delta index has approximately the same size in memory as the
> > uncompressed blob ((blob size)/16*(sizeof(index_entry)).
> 
> One thing that could be done with really large blobs is to create a 
> sparser index, i.e. have a larger step than 16.  Because the delta match 
> loop scans backward after a match the sparse index shouldn't affect 
> compression that much on large blobs and the index could be 
> significantly smaller.

In the long term, I think, that the delta generation code needs to get
tunable.

I would add a init_delta function for reading the configuration
options. In create_delta and create_delta_index, different delta
heuristics can be selected based on the blob size and the
configuration options. As patch_delta is not affected by this, it
should be easy to integrate new stragegies.

> > I tried to speed up the delta generation by searching for a common 
> > prefix, as my blobs are mostly append only. I tested it with about 
> > less than 1000 big blobs. The time for finding the deltas decreased 
> > from 17 to 14 minutes cpu time.
> 
> I'm surprised that your patch makes so much of a difference.  Normally 
> the first window should always match in the case you're trying to 
> optimize and the current code should already perform more or less the 
> same as your common prefix match does.

A block is limited to 64k. If the file has some hundred MBs, it has to
match many blocks.

My patch can process everything except the few last thousand lines by
doing a memcmp.

Additionally, nearly every line starts with the same, longer than 16
byte prefix. So its likely, that many blocks map to the same hash
value.

> Ah, no, actually what your patch does is a pessimisation of the matching 
> code by not considering other and possibly better matches elsewhere in 
> the reference buffer whenever there is a match at the beginning of both 
> buffers.  I don't think this is a good idea in general.

For small files, I agree with you.

> What you should try instead if you want to make the process faster is to 
> lower the treshold used to consider a match sufficiently large to stop 
> searching.  That has the potential for even faster processing as the 
> "optimization" would then be effective throughout the buffer and not 
> only at the beginning.
> 
[...]
> 
> You could experiment with that value to determine the best speed vs size 
> compromize.

I will do some experiments.

mfg Martin Kögler
PS: Sorry for break threading, as you did not CC me.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Optimize diff-delta.c
  2007-05-04  6:40 [RFC] Optimize diff-delta.c Martin Koegler
@ 2007-05-04 13:35 ` Nicolas Pitre
  0 siblings, 0 replies; 8+ messages in thread
From: Nicolas Pitre @ 2007-05-04 13:35 UTC (permalink / raw)
  To: Martin Koegler; +Cc: git

On Fri, 4 May 2007, Martin Koegler wrote:

> On 2007-05-01 16:05:24, Nicolas Pitre wrote:
> > > On Tue, 1 May 2007, Martin Koegler wrote:
> > Right.  I think it would be a good idea to extend the delta format as 
> > well to allow for larger offsets in pack v4.
> 
> Is git://repo.or.cz/git/fastimport.git#sp/pack4 the current version of
> pack v4 efforts?

Yes.  It is lagging behind current git though, and not usable yet.

> > > The delta index has approximately the same size in memory as the
> > > uncompressed blob ((blob size)/16*(sizeof(index_entry)).
> > 
> > One thing that could be done with really large blobs is to create a 
> > sparser index, i.e. have a larger step than 16.  Because the delta match 
> > loop scans backward after a match the sparse index shouldn't affect 
> > compression that much on large blobs and the index could be 
> > significantly smaller.
> 
> In the long term, I think, that the delta generation code needs to get
> tunable.

No.  It should be self-tunable certainly, but there are way too many 
config options already, and another one for the inner working of the 
delta algorithm would be a bit too esoteric for most users and they 
won't get advantage of it.  This thing really has to self tune itself.

> > > I tried to speed up the delta generation by searching for a common 
> > > prefix, as my blobs are mostly append only. I tested it with about 
> > > less than 1000 big blobs. The time for finding the deltas decreased 
> > > from 17 to 14 minutes cpu time.
> > 
> > I'm surprised that your patch makes so much of a difference.  Normally 
> > the first window should always match in the case you're trying to 
> > optimize and the current code should already perform more or less the 
> > same as your common prefix match does.
> 
> A block is limited to 64k. If the file has some hundred MBs, it has to
> match many blocks.

Only if the first match is smaller than 64K.  If the first match is 64K 
in size then the rest of the file is not considered at all.

> My patch can process everything except the few last thousand lines by
> doing a memcmp.
> 
> Additionally, nearly every line starts with the same, longer than 16
> byte prefix. So its likely, that many blocks map to the same hash
> value.

The hash index only remembers the lowest of consecutive identical blocks 
so repeated blocks are indexed only with the first one.  If however you 
happen to have many identical blocks interlaced between other blocks 
then there is not much that can be done.  What the code does in that 
case is to trim those hash buckets that gets too large by keeping only a 
few entries across the reference buffer to avoid a O(n^2) behavior. But 
that happens only when your line beginnings are located on the same 
block boundary (but with a block size of 16 this is rather likely in the 
presence of lots of lines I suppose).

I'll be very interested in the results you get with my suggested patch.

Nicolas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Optimize diff-delta.c
@ 2007-05-13 20:50 Martin Koegler
  2007-05-14 16:34 ` Nicolas Pitre
  0 siblings, 1 reply; 8+ messages in thread
From: Martin Koegler @ 2007-05-13 20:50 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git, Martin Koegler

---
On Fri, 04 May 2007, Nicolas Pitre wrote:
> On Fri, 4 May 2007, Martin Koegler wrote:
> > On 2007-05-01 16:05:24, Nicolas Pitre wrote:
> > In the long term, I think, that the delta generation code needs to get
> > tunable.
> 
> No.  It should be self-tunable certainly, but there are way too many 
> config options already, and another one for the inner working of the 
> delta algorithm would be a bit too esoteric for most users and they 
> won't get advantage of it.  This thing really has to self tune itself.

I would like that too, but that will probably not the case for
everybody. 

Why does git has options to control the zlib compression?  

Why are patches like "Custom compression levels for objects and packs"
(http://www.spinics.net/lists/git/msg30244.html) sent to the mailing
list?

> > > > I tried to speed up the delta generation by searching for a common 
> > > > prefix, as my blobs are mostly append only. I tested it with about 
> > > > less than 1000 big blobs. The time for finding the deltas decreased 
> > > > from 17 to 14 minutes cpu time.
> > > 
> > > I'm surprised that your patch makes so much of a difference.  Normally 
> > > the first window should always match in the case you're trying to 
> > > optimize and the current code should already perform more or less the 
> > > same as your common prefix match does.
> > 
> > A block is limited to 64k. If the file has some hundred MBs, it has to
> > match many blocks.
> 
> Only if the first match is smaller than 64K.  If the first match is 64K 
> in size then the rest of the file is not considered at all.

If I read the code correctly, the currect code does a new hash table
search after writing a block.

> > My patch can process everything except the few last thousand lines by
> > doing a memcmp.
> > 
> > Additionally, nearly every line starts with the same, longer than 16
> > byte prefix. So its likely, that many blocks map to the same hash
> > value.
> 
> The hash index only remembers the lowest of consecutive identical blocks 
> so repeated blocks are indexed only with the first one.  If however you 
> happen to have many identical blocks interlaced between other blocks 
> then there is not much that can be done.  What the code does in that 
> case is to trim those hash buckets that gets too large by keeping only a 
> few entries across the reference buffer to avoid a O(n^2) behavior. But 
> that happens only when your line beginnings are located on the same 
> block boundary (but with a block size of 16 this is rather likely in the 
> presence of lots of lines I suppose).
> 
> I'll be very interested in the results you get with my suggested patch.

As my original patch is not considering other and possibly better
matches, I rewrote the patch. The new version matches block of
unlimited length (stopping after finding a match >=64kB). If the block
it bigger than 64kB, it writes block of 64kB in a loop.

The patch recomputes the hash of 16 bytes after writing a block, as I
haven't had time to correct the hash update code for blocks < 16
bytes. (By the way, is this code path necessary?)

I did some tests on differenent machines:

Repacking a test repository with big blobs:

- original version:
   Total 6452 (delta 4582), reused 1522 (delta 0)
   11752.26user 4256.21system 4:26:52elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
   0inputs+0outputs (576major+1042499851minor)pagefaults 0swaps
   =>92 MB pack size
- your patch (stop at 4096 bytes block size):
   Total 6452 (delta 4582), reused 1522 (delta 0)
   11587.41user 4064.73system 4:20:54elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
   0inputs+0outputs (0major+1042500427minor)pagefaults 0swaps
   =>92 MB pack size
- my first patch
   Total 6452 (delta 4706), reused 1450 (delta 0)
   10316.93user 4220.67system 4:02:20elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
   0inputs+0outputs (0major+1045003394minor)pagefaults 0swaps
   =>92 MB pack size
- attached patch
   Total 6452 (delta 4581), reused 1522 (delta 0)
   11354.38user 5451.60system 4:40:09elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
   0inputs+0outputs (0major+1371504762minor)pagefaults 0swaps
   =>75 MB pack size

Repacking git.git repository:

- original version
   Total 42893 (delta 29900), reused 12400 (delta 0)
   57.32user 1.17system 1:03.73elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
   0inputs+0outputs (267major+47813minor)pagefaults 0swaps
   11538746 bytes pack size
- attached patch
   Total 42893 (delta 29899), reused 12400 (delta 0)
   57.04user 0.80system 1:00.68elapsed 95%CPU (0avgtext+0avgdata 0maxresident)k
   0inputs+0outputs (7major+28628minor)pagefaults 0swaps
   11538869 bytes pack size

Repacking linux kernel repository:

- original version
   Total 467313 (delta 376494), reused 78113 (delta 0)
   809.08user 39.59system 36:54.70elapsed 38%CPU (0avgtext+0avgdata 0maxresident)k
   0inputs+0outputs (83839major+1257686minor)pagefaults 0swaps
   149460017 bytes pack size
- attached patch
   Total 467313 (delta 376495), reused 78113 (delta 0)
   806.27user 37.81system 35:06.12elapsed 40%CPU (0avgtext+0avgdata 0maxresident)k
   0inputs+0outputs (79755major+1235874minor)pagefaults 0swaps
   149336835 bytes pack size

The following patch is only for testing purposes and not cleaned up.

 diff-delta.c |  105 +++++++++++++++++++++++++++++++++++++--------------------
 1 files changed, 68 insertions(+), 37 deletions(-)

diff --git a/diff-delta.c b/diff-delta.c
index 9f998d0..df9b336 100644
--- a/diff-delta.c
+++ b/diff-delta.c
@@ -305,8 +305,6 @@ create_delta(const struct delta_index *index,
 				continue;
 			if (ref_size > top - src)
 				ref_size = top - src;
-			if (ref_size > 0x10000)
-				ref_size = 0x10000;
 			if (ref_size <= msize)
 				break;
 			while (ref_size-- && *src++ == *ref)
@@ -315,6 +313,8 @@ create_delta(const struct delta_index *index,
 				/* this is our best match so far */
 				msize = ref - entry->ptr;
 				moff = entry->ptr - ref_data;
+				if (msize >= 0x10000)
+				    break;
 			}
 		}
 
@@ -328,30 +328,15 @@ create_delta(const struct delta_index *index,
 				inscnt = 0;
 			}
 		} else {
-			unsigned char *op;
-
-			if (msize >= RABIN_WINDOW) {
-				const unsigned char *sk;
-				sk = data + msize - RABIN_WINDOW;
-				val = 0;
-				for (i = 0; i < RABIN_WINDOW; i++)
-					val = ((val << 8) | *sk++) ^ T[val >> RABIN_SHIFT];
-			} else {
-				const unsigned char *sk = data + 1;
-				for (i = 1; i < msize; i++) {
-					val ^= U[sk[-RABIN_WINDOW]];
-					val = ((val << 8) | *sk++) ^ T[val >> RABIN_SHIFT];
-				}
-			}
-
+			unsigned int wsize, undo = 0;
+			
 			if (inscnt) {
 				while (moff && ref_data[moff-1] == data[-1]) {
-					if (msize == 0x10000)
-						break;
 					/* we can match one byte back */
 					msize++;
 					moff--;
 					data--;
+					undo++;
 					outpos--;
 					if (--inscnt)
 						continue;
@@ -359,27 +344,72 @@ create_delta(const struct delta_index *index,
 					inscnt--;  /* make it -1 */
 					break;
 				}
-				out[outpos - inscnt - 1] = inscnt;
+					out[outpos - inscnt - 1] = inscnt;
 				inscnt = 0;
 			}
+			
+			wsize=msize;
+			while (wsize >= 4)
+			{
+				unsigned char *op;
+				unsigned int boff, bsize;
+				boff = moff;
+				bsize = wsize;
+
+				if (bsize > 0x10000)
+					bsize = 0x10000;
+
+				moff += bsize;
+				wsize -= bsize;
+				
+				op = out + outpos++;
+				i = 0x80;
+				
+				if (boff & 0xff) { out[outpos++] = boff; i |= 0x01; }
+				boff >>= 8;
+				if (boff & 0xff) { out[outpos++] = boff; i |= 0x02; }
+				boff >>= 8;
+				if (boff & 0xff) { out[outpos++] = boff; i |= 0x04; }
+				boff >>= 8;
+				if (boff & 0xff) { out[outpos++] = boff; i |= 0x08; }
+				
+				if (bsize & 0xff) { out[outpos++] = bsize; i |= 0x10; }
+				bsize >>= 8;
+				if (bsize & 0xff) { out[outpos++] = bsize; i |= 0x20; }
+				
+				*op = i;
+
+				if (outpos >= outsize - MAX_OP_SIZE) {
+					void *tmp = out;
+					outsize = outsize * 3 / 2;
+					if (max_size && outsize >= max_size)
+						outsize = max_size + MAX_OP_SIZE + 1;
+					if (max_size && outpos > max_size)
+						goto out;
+					out = xrealloc(out, outsize);
+					if (!out) {
+						free(tmp);
+						return NULL;
+					}
+				}
+			}
+			
+			msize -= wsize;
+			if (msize - undo >= RABIN_WINDOW || 1) {
+				const unsigned char *sk;
+				sk = data + msize - RABIN_WINDOW;
+				val = 0;
+				for (i = 0; i < RABIN_WINDOW; i++)
+					val = ((val << 8) | *sk++) ^ T[val >> RABIN_SHIFT];
+			} else {
+				const unsigned char *sk = data + 1;
+				for (i = 1; i < msize; i++) {
+					val ^= U[sk[-RABIN_WINDOW + undo]];
+					val = ((val << 8) | *sk++) ^ T[val >> RABIN_SHIFT];
+				}
+			}
 
 			data += msize;
-			op = out + outpos++;
-			i = 0x80;
-
-			if (moff & 0xff) { out[outpos++] = moff; i |= 0x01; }
-			moff >>= 8;
-			if (moff & 0xff) { out[outpos++] = moff; i |= 0x02; }
-			moff >>= 8;
-			if (moff & 0xff) { out[outpos++] = moff; i |= 0x04; }
-			moff >>= 8;
-			if (moff & 0xff) { out[outpos++] = moff; i |= 0x08; }
-
-			if (msize & 0xff) { out[outpos++] = msize; i |= 0x10; }
-			msize >>= 8;
-			if (msize & 0xff) { out[outpos++] = msize; i |= 0x20; }
-
-			*op = i;
 		}
 
 		if (outpos >= outsize - MAX_OP_SIZE) {
@@ -397,6 +427,7 @@ create_delta(const struct delta_index *index,
 		}
 	}
 
+ out:
 	if (inscnt)
 		out[outpos - inscnt - 1] = inscnt;
 
-- 
1.4.4.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC] Optimize diff-delta.c
  2007-05-13 20:50 Martin Koegler
@ 2007-05-14 16:34 ` Nicolas Pitre
  0 siblings, 0 replies; 8+ messages in thread
From: Nicolas Pitre @ 2007-05-14 16:34 UTC (permalink / raw)
  To: Martin Koegler; +Cc: git

On Sun, 13 May 2007, Martin Koegler wrote:

> ---
> On Fri, 04 May 2007, Nicolas Pitre wrote:
> > On Fri, 4 May 2007, Martin Koegler wrote:
> > > On 2007-05-01 16:05:24, Nicolas Pitre wrote:
> > > In the long term, I think, that the delta generation code needs to get
> > > tunable.
> > 
> > No.  It should be self-tunable certainly, but there are way too many 
> > config options already, and another one for the inner working of the 
> > delta algorithm would be a bit too esoteric for most users and they 
> > won't get advantage of it.  This thing really has to self tune itself.
> 
> I would like that too, but that will probably not the case for
> everybody. 
> 
> Why does git has options to control the zlib compression?  
> 
> Why are patches like "Custom compression levels for objects and packs"
> (http://www.spinics.net/lists/git/msg30244.html) sent to the mailing
> list?

Because there is no nice ways to auto-tune them given the current zlib 
interface.

> > > > > I tried to speed up the delta generation by searching for a common 
> > > > > prefix, as my blobs are mostly append only. I tested it with about 
> > > > > less than 1000 big blobs. The time for finding the deltas decreased 
> > > > > from 17 to 14 minutes cpu time.
> > > > 
> > > > I'm surprised that your patch makes so much of a difference.  Normally 
> > > > the first window should always match in the case you're trying to 
> > > > optimize and the current code should already perform more or less the 
> > > > same as your common prefix match does.
> > > 
> > > A block is limited to 64k. If the file has some hundred MBs, it has to
> > > match many blocks.
> > 
> > Only if the first match is smaller than 64K.  If the first match is 64K 
> > in size then the rest of the file is not considered at all.
> 
> If I read the code correctly, the currect code does a new hash table
> search after writing a block.

Right.

> As my original patch is not considering other and possibly better
> matches, I rewrote the patch. The new version matches block of
> unlimited length (stopping after finding a match >=64kB). If the block
> it bigger than 64kB, it writes block of 64kB in a loop.

I like this idea a lot.

> The patch recomputes the hash of 16 bytes after writing a block, as I
> haven't had time to correct the hash update code for blocks < 16
> bytes. (By the way, is this code path necessary?)

It certainly helps in those cases where the block match is smaller than 
a Rabin window size, although I don't know how much.  Sliding the window 
over 8 bytes might be approximately the same cost as computing the 
hash from scratch over 16 bytes since in the sliding case there are 
twice the amount of pointer dereferences.  So the test should probably 
be "if (msize >= RABIN_WINDOW/2) ..." for the full reinitialization.

> I did some tests on differenent machines:
> 
> Repacking a test repository with big blobs:
> 
> - original version:
>    Total 6452 (delta 4582), reused 1522 (delta 0)
>    11752.26user 4256.21system 4:26:52elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
>    0inputs+0outputs (576major+1042499851minor)pagefaults 0swaps
>    =>92 MB pack size
> - your patch (stop at 4096 bytes block size):
>    Total 6452 (delta 4582), reused 1522 (delta 0)
>    11587.41user 4064.73system 4:20:54elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
>    0inputs+0outputs (0major+1042500427minor)pagefaults 0swaps
>    =>92 MB pack size
> - my first patch
>    Total 6452 (delta 4706), reused 1450 (delta 0)
>    10316.93user 4220.67system 4:02:20elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
>    0inputs+0outputs (0major+1045003394minor)pagefaults 0swaps
>    =>92 MB pack size
> - attached patch
>    Total 6452 (delta 4581), reused 1522 (delta 0)
>    11354.38user 5451.60system 4:40:09elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
>    0inputs+0outputs (0major+1371504762minor)pagefaults 0swaps
>    =>75 MB pack size

This is quite weird.  I wonder what might cause such a large difference 
in pack size.

Your first patch is probably faster due to the use of memcmp() which is 
certainly highly optimized, more than the comparison loop we have.  It 
is unfortunate that there is no library function to find the number of 
identical bytes between two buffers.  Or is there some?

But the size difference?  That has certainly something to do with your 
data set since your patch makes no significant difference on the git.git 
nor the Linux kernel repos.  Would it be possible for me to have a copy 
of your repo for further analysis?


Nicolas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Optimize diff-delta.c
@ 2007-05-14 20:43 Martin Koegler
  0 siblings, 0 replies; 8+ messages in thread
From: Martin Koegler @ 2007-05-14 20:43 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git, Martin Koegler

git-pack-objects: cache small deltas between big objects

---
On Mon, 14 May 2007, Nicolas Pitre wrote:
> > I did some tests on differenent machines:
> > 
> > - attached patch
> >    Total 6452 (delta 4581), reused 1522 (delta 0)
> >    11354.38user 5451.60system 4:40:09elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
> >    0inputs+0outputs (0major+1371504762minor)pagefaults 0swaps
> >    =>75 MB pack size
> 
> This is quite weird.  I wonder what might cause such a large difference 
> in pack size.
> 
> Your first patch is probably faster due to the use of memcmp() which is 
> certainly highly optimized, more than the comparison loop we have.  It 
> is unfortunate that there is no library function to find the number of 
> identical bytes between two buffers.  Or is there some?

As far as I know, no.

> But the size difference?  That has certainly something to do with your 
> data set since your patch makes no significant difference on the git.git 
> nor the Linux kernel repos.  Would it be possible for me to have a copy 
> of your repo for further analysis?

I generate my repository by dumping a database. A script dumps a
few tables with mysqldump into per table files and commits them.
The dump file of each tables uses the one insert per line syntax, so 
nearly all lines of a file share a >=27 bytes prefix. A diffstat to 
the previous commit typically shows, that some 
hundred lines are added at the end of each file.

A statistic of dropped hash table entries of the first thousand objects 
in create_delta_index shows, that up to 41 % are dropped, eg:
Dropping: 3404731 of 8540915 (39.86 %)
Dropping: 3397330 of 8525886 (39.85 %)
Dropping: 3388813 of 8509648 (39.82 %)
Dropping: 3381134 of 8494317 (39.80 %)
Dropping: 3381128 of 8494294 (39.80 %)
Dropping: 3375786 of 8483589 (39.79 %)
Dropping: 3369725 of 8472206 (39.77 %)
Dropping: 3364377 of 8460707 (39.76 %)
Dropping: 3358120 of 8447813 (39.75 %)
Dropping: 3351482 of 8435015 (39.73 %)
Dropping: 3351481 of 8435007 (39.73 %)
Dropping: 3351478 of 8435000 (39.73 %)
Dropping: 3346193 of 8424055 (39.72 %)
Dropping: 3339952 of 8410503 (39.71 %)
Dropping: 3334324 of 8398253 (39.70 %)
Dropping: 3370085 of 8384362 (40.19 %)
Dropping: 3362979 of 8369151 (40.18 %)
Dropping: 3354905 of 8353432 (40.16 %)

So the current code will probably not always find the best match. As
my last patch can match nearly the whole file after finding a match,
the missing entries will not have a big influence.

As a side effect, my last patch increased the total running time and
minor page faults compared to the orignal version. I can not explain,
why this happens. The deltifing phase needs to read and uncompress the
same data. The only difference I can image, is, that different objects are selected
as delta base for the writing phase, which require more reading time (eg. because
they are bigger or have a longer delta chain in their current pack file).

As a large amount of CPU time is spent in writing the pack file (reading
all blobs again and applying the delta chain, computing the
delta_index and recomputing the delta) and the deltas are small, 
I tried to cache the deltas from try_delta,
if the compared blobs are big (and therefore the delta operation is
expensive):

- my last patch + this patch
  Total 6452 (delta 4581), reused 1522 (delta 0)
  4176.04user 322.10system 1:14:58elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (0major+83869085minor)pagefaults 0swaps
  => 75MB

 builtin-pack-objects.c |   35 +++++++++++++++++++++++++----------
 1 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 966f843..fe19272 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -35,6 +35,7 @@ struct object_entry {
 	struct object_entry *delta_sibling; /* other deltified objects who
 					     * uses the same base as me
 					     */
+	void *delta_data;	/* cached delta (uncompressed) */
 	unsigned long delta_size;	/* delta data size (uncompressed) */
 	enum object_type type;
 	enum object_type in_pack_type;	/* could be delta */
@@ -445,17 +446,24 @@ static unsigned long write_object(struct sha1file *f,
 	}
 
 	if (!to_reuse) {
-		buf = read_sha1_file(entry->sha1, &type, &size);
-		if (!buf)
-			die("unable to read %s", sha1_to_hex(entry->sha1));
-		if (size != entry->size)
-			die("object %s size inconsistency (%lu vs %lu)",
-			    sha1_to_hex(entry->sha1), size, entry->size);
-		if (entry->delta) {
-			buf = delta_against(buf, size, entry);
+		if (entry->delta_data) {
+			buf = entry->delta_data;
 			size = entry->delta_size;
 			obj_type = (allow_ofs_delta && entry->delta->offset) ?
-				OBJ_OFS_DELTA : OBJ_REF_DELTA;
+					OBJ_OFS_DELTA : OBJ_REF_DELTA;
+		} else {
+			buf = read_sha1_file(entry->sha1, &type, &size);
+			if (!buf)
+				die("unable to read %s", sha1_to_hex(entry->sha1));
+			if (size != entry->size)
+				die("object %s size inconsistency (%lu vs %lu)",
+				    sha1_to_hex(entry->sha1), size, entry->size);
+			if (entry->delta) {
+				buf = delta_against(buf, size, entry);
+				size = entry->delta_size;
+				obj_type = (allow_ofs_delta && entry->delta->offset) ?
+					OBJ_OFS_DELTA : OBJ_REF_DELTA;
+			}
 		}
 		/*
 		 * The object header is a byte of 'type' followed by zero or
@@ -1359,10 +1367,17 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 	if (!delta_buf)
 		return 0;
 
+	if (trg_entry->delta_data)
+		free (trg_entry->delta_data);
+	trg_entry->delta_data = 0;
 	trg_entry->delta = src_entry;
 	trg_entry->delta_size = delta_size;
 	trg_entry->depth = src_entry->depth + 1;
-	free(delta_buf);
+	/* cache delta, if objects are large enough compared to delta size */
+	if ((src_size >> 20) + (trg_size >> 21) > (delta_size >> 10))
+		trg_entry->delta_data = delta_buf;
+	else
+		free(delta_buf);
 	return 1;
 }
 
-- 
1.5.1.4.g01b3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-05-14 20:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-04  6:40 [RFC] Optimize diff-delta.c Martin Koegler
2007-05-04 13:35 ` Nicolas Pitre
  -- strict thread matches above, loose matches on Subject: below --
2007-05-14 20:43 Martin Koegler
2007-05-13 20:50 Martin Koegler
2007-05-14 16:34 ` Nicolas Pitre
2007-05-01 14:49 Martin Koegler
2007-05-01 15:51 ` Johannes Schindelin
2007-05-01 16:05 ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).