From: Nicolas Pitre <nico@cam.org>
To: Junio C Hamano <junkio@cox.net>
Cc: git@vger.kernel.org
Subject: [PATCH] diff-delta: produce optimal pack data
Date: Tue, 21 Feb 2006 20:45:36 -0500 (EST) [thread overview]
Message-ID: <Pine.LNX.4.64.0602212043260.5606@localhost.localdomain> (raw)
Indexing based on adler32 has a match precision based on the block size
(currently 16). Lowering the block size would produce smaller deltas
but the indexing memory and computing cost increases significantly.
For optimal delta result the indexing block size should be 3 with an
increment of 1 (instead of 16 and 16). With such low params the adler32
becomes a clear overhead increasing the time for git-repack by a factor
of 3. And with such small blocks the adler 32 is not very useful as the
whole of the block bits can be used directly.
This patch replaces the adler32 with an open coded index value based on
3 characters directly. This gives sufficient bits for hashing and
allows for optimal delta with reasonable CPU cycles.
The resulting packs are 6% smaller on average. The increase in CPU time
is about 25%. But this cost is now hidden by the delta reuse patch
while the saving on data transfers is always there.
Signed-off-by: Nicolas Pitre <nico@cam.org>
---
diff-delta.c | 77 +++++++++++++++++++++++-----------------------------------
1 files changed, 30 insertions(+), 47 deletions(-)
54aa50fb403981a9292453b76d894a79da9698de
diff --git a/diff-delta.c b/diff-delta.c
index 2ed5984..27f83a0 100644
--- a/diff-delta.c
+++ b/diff-delta.c
@@ -20,21 +20,11 @@
#include <stdlib.h>
#include <string.h>
-#include <zlib.h>
#include "delta.h"
-/* block size: min = 16, max = 64k, power of 2 */
-#define BLK_SIZE 16
-
-#define MIN(a, b) ((a) < (b) ? (a) : (b))
-
-#define GR_PRIME 0x9e370001
-#define HASH(v, shift) (((unsigned int)(v) * GR_PRIME) >> (shift))
-
struct index {
const unsigned char *ptr;
- unsigned int val;
struct index *next;
};
@@ -42,21 +32,21 @@ static struct index ** delta_index(const
unsigned long bufsize,
unsigned int *hash_shift)
{
- unsigned int hsize, hshift, entries, blksize, i;
+ unsigned long hsize;
+ unsigned int hshift, i;
const unsigned char *data;
struct index *entry, **hash;
void *mem;
/* determine index hash size */
- entries = (bufsize + BLK_SIZE - 1) / BLK_SIZE;
- hsize = entries / 4;
- for (i = 4; (1 << i) < hsize && i < 16; i++);
+ hsize = bufsize / 4;
+ for (i = 8; (1 << i) < hsize && i < 16; i++);
hsize = 1 << i;
- hshift = 32 - i;
+ hshift = i - 8;
*hash_shift = hshift;
/* allocate lookup index */
- mem = malloc(hsize * sizeof(*hash) + entries * sizeof(*entry));
+ mem = malloc(hsize * sizeof(*hash) + bufsize * sizeof(*entry));
if (!mem)
return NULL;
hash = mem;
@@ -64,17 +54,12 @@ static struct index ** delta_index(const
memset(hash, 0, hsize * sizeof(*hash));
/* then populate it */
- data = buf + entries * BLK_SIZE - BLK_SIZE;
- blksize = bufsize - (data - buf);
- while (data >= buf) {
- unsigned int val = adler32(0, data, blksize);
- i = HASH(val, hshift);
- entry->ptr = data;
- entry->val = val;
+ data = buf + bufsize - 2;
+ while (data > buf) {
+ entry->ptr = --data;
+ i = data[0] ^ data[1] ^ (data[2] << hshift);
entry->next = hash[i];
hash[i] = entry++;
- blksize = BLK_SIZE;
- data -= BLK_SIZE;
}
return hash;
@@ -141,29 +126,27 @@ void *diff_delta(void *from_buf, unsigne
while (data < top) {
unsigned int moff = 0, msize = 0;
- unsigned int blksize = MIN(top - data, BLK_SIZE);
- unsigned int val = adler32(0, data, blksize);
- i = HASH(val, hash_shift);
- for (entry = hash[i]; entry; entry = entry->next) {
- const unsigned char *ref = entry->ptr;
- const unsigned char *src = data;
- unsigned int ref_size = ref_top - ref;
- if (entry->val != val)
- continue;
- if (ref_size > top - src)
- ref_size = top - src;
- while (ref_size && *src++ == *ref) {
- ref++;
- ref_size--;
- }
- ref_size = ref - entry->ptr;
- if (ref_size > msize) {
- /* this is our best match so far */
- moff = entry->ptr - ref_data;
- msize = ref_size;
- if (msize >= 0x10000) {
- msize = 0x10000;
+ if (data + 2 < top) {
+ i = data[0] ^ data[1] ^ (data[2] << hash_shift);
+ for (entry = hash[i]; entry; entry = entry->next) {
+ const unsigned char *ref = entry->ptr;
+ const unsigned char *src = data;
+ unsigned int ref_size = ref_top - ref;
+ if (ref_size > top - src)
+ ref_size = top - src;
+ if (ref_size > 0x10000)
+ ref_size = 0x10000;
+ if (ref_size <= msize)
break;
+ while (ref_size && *src++ == *ref) {
+ ref++;
+ ref_size--;
+ }
+ ref_size = ref - entry->ptr;
+ if (msize < ref - entry->ptr) {
+ /* this is our best match so far */
+ msize = ref - entry->ptr;
+ moff = entry->ptr - ref_data;
}
}
}
--
1.2.2.g6643-dirty
next reply other threads:[~2006-02-22 1:45 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-02-22 1:45 Nicolas Pitre [this message]
2006-02-24 8:49 ` [PATCH] diff-delta: produce optimal pack data Junio C Hamano
2006-02-24 15:37 ` Nicolas Pitre
2006-02-24 23:55 ` Junio C Hamano
2006-02-24 17:44 ` Carl Baldwin
2006-02-24 17:56 ` Nicolas Pitre
2006-02-24 18:35 ` Carl Baldwin
2006-02-24 18:57 ` Nicolas Pitre
2006-02-24 19:23 ` Carl Baldwin
2006-02-24 20:02 ` Nicolas Pitre
2006-02-24 20:40 ` Carl Baldwin
2006-02-24 21:12 ` Nicolas Pitre
2006-02-24 22:50 ` Carl Baldwin
2006-02-25 3:53 ` Nicolas Pitre
2006-02-24 20:02 ` Linus Torvalds
2006-02-24 20:19 ` Nicolas Pitre
2006-02-24 20:53 ` Junio C Hamano
2006-02-24 21:39 ` Nicolas Pitre
2006-02-24 21:48 ` Nicolas Pitre
2006-02-25 0:45 ` Linus Torvalds
2006-02-25 3:07 ` Nicolas Pitre
2006-02-25 4:05 ` Linus Torvalds
2006-02-25 5:10 ` Nicolas Pitre
2006-02-25 5:35 ` Nicolas Pitre
2006-03-07 23:48 ` [RFH] zlib gurus out there? Junio C Hamano
2006-03-08 0:59 ` Linus Torvalds
2006-03-08 1:22 ` Junio C Hamano
2006-03-08 2:00 ` Linus Torvalds
2006-03-08 9:46 ` Johannes Schindelin
2006-03-08 10:45 ` [PATCH] write_sha1_file(): Perform Z_FULL_FLUSH between header and data Sergey Vlasov
2006-03-08 11:04 ` Junio C Hamano
2006-03-08 14:17 ` Sergey Vlasov
2006-02-25 19:18 ` [PATCH] diff-delta: produce optimal pack data Linus Torvalds
2006-02-24 18:49 ` Carl Baldwin
2006-02-24 19:03 ` Nicolas Pitre
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0602212043260.5606@localhost.localdomain \
--to=nico@cam.org \
--cc=git@vger.kernel.org \
--cc=junkio@cox.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).