* [FYI] pack idx format @ 2006-02-15 8:39 Junio C Hamano 2006-02-15 11:16 ` Johannes Schindelin ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Junio C Hamano @ 2006-02-15 8:39 UTC (permalink / raw) To: git This is still WIP but if anybody is interested... Once done, it should become Documentation/technical/pack-format.txt. The reason I started doing this is to prototype this one: <7v4q3453qu.fsf@assigned-by-dhcp.cox.net> -- >8 -- Idx file: The idx file is to map object name SHA1 to offset into the corresponding pack file. There is the 'first-level fan-out' table at the beginning, and then the main part of the index follows. This is a table whose entries are sorted by their object name SHA1. The file ends with some trailer information. The main part is a table of 24-byte entries, and each entry is: offset : 4-byte network byte order integer. SHA1 : 20-byte object name SHA1. The data for the named object begins at byte offset "offset" in the corresponding pack file. Before this main table, at the beginning of the idx file, there is a table of 256 4-byte network byte order integers. This is called "first-level fan-out". N-th entry of this table records the offset into the main index for the first object whose object name SHA1 starts with N+1. fanout[255] points at the end of main index. The offset is expressed in 24-bytes unit. Example: idx +--------------------------------+ | fanout[0] = 2 |-. +--------------------------------+ | | fanout[1] | | +--------------------------------+ | | fanout[2] | | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | | fanout[255] | | +--------------------------------+ | main | offset | | index | object name 00XXXXXXXXXXXXXXXX | | table +--------------------------------+ | | offset | | | object name 00XXXXXXXXXXXXXXXX | | +--------------------------------+ | .-| offset |<+ | | object name 01XXXXXXXXXXXXXXXX | | +--------------------------------+ | | offset | | | object name 01XXXXXXXXXXXXXXXX | | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | | offset | | | object name FFXXXXXXXXXXXXXXXX | | +--------------------------------+ trailer | | packfile checksum | | +--------------------------------+ | | idxfile checksum | | +--------------------------------+ .-------. | Pack file entry: <+ packed object header: 1-byte type (upper 4-bit) size0 (lower 4-bit) n-byte sizeN (as long as MSB is set, each 7-bit) size0..sizeN form 4+7+7+..+7 bit integer, size0 is the most significant part. packed object data: If it is not DELTA, then deflated bytes (the size above is the size before compression). If it is DELTA, then 20-byte base object name SHA1 (the size above is the size of the delta data that follows). delta data, deflated. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [FYI] pack idx format 2006-02-15 8:39 [FYI] pack idx format Junio C Hamano @ 2006-02-15 11:16 ` Johannes Schindelin 2006-02-15 16:46 ` Nicolas Pitre 2006-02-16 1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano 2 siblings, 0 replies; 17+ messages in thread From: Johannes Schindelin @ 2006-02-15 11:16 UTC (permalink / raw) To: Junio C Hamano; +Cc: git Hi, On Wed, 15 Feb 2006, Junio C Hamano wrote: > [...] > > Before this main table, at the beginning of the idx file, there > is a table of 256 4-byte network byte order integers. This is > called "first-level fan-out". N-th entry of this table records > the offset into the main index for the first object whose object > name SHA1 starts with N+1. fanout[255] points at the end of > main index. The offset is expressed in 24-bytes unit. The description of the n-th entry is not completely correct. How about something like this: The fan-out table is sort of an index to the main table: It contains offsets into the main table, and all SHA1s starting with byte b are between fanout[b-1] .. fanout[b]-1 (or if b==0 between 0 .. fanout[0]-1). Comments? Ciao, Dscho ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [FYI] pack idx format 2006-02-15 8:39 [FYI] pack idx format Junio C Hamano 2006-02-15 11:16 ` Johannes Schindelin @ 2006-02-15 16:46 ` Nicolas Pitre 2006-02-16 1:58 ` Junio C Hamano 2006-02-16 1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano 2 siblings, 1 reply; 17+ messages in thread From: Nicolas Pitre @ 2006-02-15 16:46 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Wed, 15 Feb 2006, Junio C Hamano wrote: > This is still WIP but if anybody is interested... Once done, it > should become Documentation/technical/pack-format.txt. > [...] > > Pack file entry: <+ > > packed object header: > 1-byte type (upper 4-bit) Actually the type occupies only 3 bits (bits 4 to 6) as bit 7 is the size continuation bit. Nicolas ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [FYI] pack idx format 2006-02-15 16:46 ` Nicolas Pitre @ 2006-02-16 1:58 ` Junio C Hamano 0 siblings, 0 replies; 17+ messages in thread From: Junio C Hamano @ 2006-02-16 1:58 UTC (permalink / raw) To: git Nicolas Pitre <nico@cam.org> writes: > On Wed, 15 Feb 2006, Junio C Hamano wrote: > >> This is still WIP but if anybody is interested... Once done, it >> should become Documentation/technical/pack-format.txt. >> > [...] >> >> Pack file entry: <+ >> >> packed object header: >> 1-byte type (upper 4-bit) > > Actually the type occupies only 3 bits (bits 4 to 6) as bit 7 is the > size continuation bit. You are right. ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] pack-objects: reuse data from existing pack. 2006-02-15 8:39 [FYI] pack idx format Junio C Hamano 2006-02-15 11:16 ` Johannes Schindelin 2006-02-15 16:46 ` Nicolas Pitre @ 2006-02-16 1:43 ` Junio C Hamano 2006-02-16 1:45 ` [PATCH] packed objects: minor cleanup Junio C Hamano ` (4 more replies) 2 siblings, 5 replies; 17+ messages in thread From: Junio C Hamano @ 2006-02-16 1:43 UTC (permalink / raw) To: Linus Torvalds; +Cc: git When generating a new pack, notice if we have already the wanted object in existing packs. If the object has a delitified representation, and its base object is also what we are going to pack, then reuse the existing deltified representation unconditionally, bypassing all the expensive find_deltas() and try_deltas() routines. Also, when writing out such deltified representation and undeltified representation, if a matching data already exists in an existing pack, just write it out without uncompressing & recompressing. Without this patch: $ git-rev-list --objects v1.0.0 >RL $ time git-pack-objects p <RL Generating pack... Done counting 12233 objects. Packing 12233 objects.................... 60a88b3979df41e22d1edc3967095e897f720192 real 0m32.751s user 0m27.090s sys 0m2.750s With this patch: $ git-rev-list --objects v1.0.0 >RL $ time ../git.junio/git-pack-objects q <RL Generating pack... Done counting 12233 objects. Packing 12233 objects..................... 60a88b3979df41e22d1edc3967095e897f720192 Total 12233, written 12233, reused 12177 real 0m4.007s user 0m3.360s sys 0m0.090s Signed-off-by: Junio C Hamano <junkio@cox.net> --- * This may depend on one cleanup patch I have not sent out, but I am so excited that I could not help sending this out first. Admittedly this is hot off the press, I have not had enough time to beat this too hard, but the resulting pack from the above passed unpack-objects, index-pack and verify-pack. pack-objects.c | 317 ++++++++++++++++++++++++++++++++++++++++++++++---------- pack.h | 4 + sha1_file.c | 19 +++ 3 files changed, 283 insertions(+), 57 deletions(-) 0d574a3c3ec6924118d06ee0487d02d2fbb12646 diff --git a/pack-objects.c b/pack-objects.c index c5a5e61..f2a45a2 100644 --- a/pack-objects.c +++ b/pack-objects.c @@ -9,13 +9,16 @@ static const char pack_usage[] = "git-pa struct object_entry { unsigned char sha1[20]; - unsigned long size; - unsigned long offset; - unsigned int depth; - unsigned int hash; + unsigned long size; /* uncompressed size */ + unsigned long offset; /* offset into the final pack file (nonzero if already written) */ + unsigned int depth; /* delta depth */ + unsigned int hash; /* name hint hash */ enum object_type type; - unsigned long delta_size; - struct object_entry *delta; + unsigned long delta_size; /* delta data size (uncompressed) */ + struct object_entry *delta; /* delta base object */ + struct packed_git *in_pack; /* already in pack */ + enum object_type in_pack_type; /* could be delta */ + unsigned int in_pack_offset; }; static unsigned char object_list_sha1[20]; @@ -29,6 +32,105 @@ static const char *base_name; static unsigned char pack_file_sha1[20]; static int progress = 1; +static int *object_ix = NULL; +static int object_ix_hashsz = 0; + +struct pack_revindex { + struct packed_git *p; + unsigned long *revindex; +} *pack_revindex = NULL; +static int pack_revindex_hashsz = 0; + +static int pack_revindex_ix(struct packed_git *p) +{ + unsigned int ui = (unsigned int) p; + int i; + ui = ui ^ (ui >> 16); + i = (int)(ui % pack_revindex_hashsz); + while (pack_revindex[i].p) { + if (pack_revindex[i].p == p) + return i; + if (++i == pack_revindex_hashsz) + i = 0; + } + return -1 - i; +} + +static void prepare_pack_ix(void) +{ + int num; + struct packed_git *p; + for (num = 0, p = packed_git; p; p = p->next) + num++; + if (!num) + return; + pack_revindex_hashsz = num * 11; + pack_revindex = xcalloc(sizeof(*pack_revindex), pack_revindex_hashsz); + for (p = packed_git; p; p = p->next) { + num = pack_revindex_ix(p); + num = - 1 - num; + pack_revindex[num].p = p; + } + /* revindex elements are lazily initialized */ +} + +static int cmp_offset(const void *a_, const void *b_) +{ + unsigned long a = *(unsigned long *) a_; + unsigned long b = *(unsigned long *) b_; + if (a < b) + return -1; + else if (a == b) + return 0; + else + return 1; +} + +static void prepare_pack_revindex(struct pack_revindex *rix) +{ + struct packed_git *p = rix->p; + int num_ent = num_packed_objects(p); + int i; + void *index = p->index_base + 256; + + rix->revindex = xmalloc(sizeof(unsigned long) * (num_ent + 1)); + for (i = 0; i < num_ent; i++) { + long hl = *((long *)(index + 24 * i)); + rix->revindex[i] = ntohl(hl); + } + rix->revindex[num_ent] = p->pack_size - 20; + qsort(rix->revindex, num_ent, sizeof(unsigned long), cmp_offset); +} + +static unsigned long find_packed_object_size(struct packed_git *p, + unsigned long ofs) +{ + int num; + int lo, hi; + struct pack_revindex *rix; + unsigned long *revindex; + num = pack_revindex_ix(p); + if (num < 0) + die("internal error: pack revindex uninitialized"); + rix = &pack_revindex[num]; + if (!rix->revindex) + prepare_pack_revindex(rix); + revindex = rix->revindex; + lo = 0; + hi = num_packed_objects(p) + 1; + do { + int mi = (lo + hi) / 2; + if (revindex[mi] == ofs) { + return revindex[mi+1] - ofs; + } + else if (ofs < revindex[mi]) + hi = mi; + else + lo = mi + 1; + } while (lo < hi); + die("internal error: pack revindex corrupt"); +} + static void *delta_against(void *buf, unsigned long size, struct object_entry *entry) { unsigned long othersize, delta_size; @@ -74,39 +176,59 @@ static int encode_header(enum object_typ return n; } +static int written = 0; +static int reused = 0; + static unsigned long write_object(struct sha1file *f, struct object_entry *entry) { unsigned long size; char type[10]; - void *buf = read_sha1_file(entry->sha1, type, &size); + void *buf; unsigned char header[10]; unsigned hdrlen, datalen; enum object_type obj_type; - if (!buf) - die("unable to read %s", sha1_to_hex(entry->sha1)); - if (size != entry->size) - die("object %s size inconsistency (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size); - - /* - * The object header is a byte of 'type' followed by zero or - * more bytes of length. For deltas, the 20 bytes of delta sha1 - * follows that. - */ obj_type = entry->type; - if (entry->delta) { - buf = delta_against(buf, size, entry); - size = entry->delta_size; - obj_type = OBJ_DELTA; - } - hdrlen = encode_header(obj_type, size, header); - sha1write(f, header, hdrlen); - if (entry->delta) { - sha1write(f, entry->delta, 20); - hdrlen += 20; + if (!entry->in_pack || + (obj_type != entry->in_pack_type)) { + buf = read_sha1_file(entry->sha1, type, &size); + if (!buf) + die("unable to read %s", sha1_to_hex(entry->sha1)); + if (size != entry->size) + die("object %s size inconsistency (%lu vs %lu)", + sha1_to_hex(entry->sha1), size, entry->size); + if (entry->delta) { + buf = delta_against(buf, size, entry); + size = entry->delta_size; + obj_type = OBJ_DELTA; + } + /* + * The object header is a byte of 'type' followed by zero or + * more bytes of length. For deltas, the 20 bytes of delta + * sha1 follows that. + */ + hdrlen = encode_header(obj_type, size, header); + sha1write(f, header, hdrlen); + + if (entry->delta) { + sha1write(f, entry->delta, 20); + hdrlen += 20; + } + datalen = sha1write_compressed(f, buf, size); + free(buf); + } + else { + struct packed_git *p = entry->in_pack; + use_packed_git(p); + + datalen = find_packed_object_size(p, entry->in_pack_offset); + buf = p->pack_base + entry->in_pack_offset; + sha1write(f, buf, datalen); + unuse_packed_git(p); + hdrlen = 0; /* not really */ + reused++; } - datalen = sha1write_compressed(f, buf, size); - free(buf); + written++; return hdrlen + datalen; } @@ -196,18 +318,21 @@ static int add_object_entry(unsigned cha { unsigned int idx = nr_objects; struct object_entry *entry; - - if (incremental || local) { - struct packed_git *p; - - for (p = packed_git; p; p = p->next) { - struct pack_entry e; - - if (find_pack_entry_one(sha1, &e, p)) { - if (incremental) - return 0; - if (local && !p->pack_local) - return 0; + struct packed_git *p; + unsigned int found_offset; + struct packed_git *found_pack; + + found_pack = NULL; + for (p = packed_git; p; p = p->next) { + struct pack_entry e; + if (find_pack_entry_one(sha1, &e, p)) { + if (incremental) + return 0; + if (local && !p->pack_local) + return 0; + if (!found_pack) { + found_offset = e.offset; + found_pack = e.p; } } } @@ -221,30 +346,99 @@ static int add_object_entry(unsigned cha memset(entry, 0, sizeof(*entry)); memcpy(entry->sha1, sha1, 20); entry->hash = hash; + if (found_pack) { + entry->in_pack = found_pack; + entry->in_pack_offset = found_offset; + } nr_objects = idx+1; return 1; } +static int locate_object_entry_hash(unsigned char *sha1) +{ + int i; + unsigned int ui; + memcpy(&ui, sha1, sizeof(unsigned int)); + i = ui % object_ix_hashsz; + while (0 < object_ix[i]) { + if (!memcmp(sha1, objects[object_ix[i]-1].sha1, 20)) + return i; + if (++i == object_ix_hashsz) + i = 0; + } + return -1 - i; +} + +static struct object_entry *locate_object_entry(unsigned char *sha1) +{ + int i = locate_object_entry_hash(sha1); + if (0 <= i) + return &objects[object_ix[i]-1]; + return NULL; +} + static void check_object(struct object_entry *entry) { char type[20]; - if (!sha1_object_info(entry->sha1, type, &entry->size)) { - if (!strcmp(type, "commit")) { - entry->type = OBJ_COMMIT; - } else if (!strcmp(type, "tree")) { - entry->type = OBJ_TREE; - } else if (!strcmp(type, "blob")) { - entry->type = OBJ_BLOB; - } else if (!strcmp(type, "tag")) { - entry->type = OBJ_TAG; - } else - die("unable to pack object %s of type %s", - sha1_to_hex(entry->sha1), type); + if (entry->in_pack) { + /* Check if it is delta, and the base is also an object + * we are going to pack. If so we will reuse the existing + * delta. + */ + unsigned char base[20]; + unsigned long size; + struct object_entry *base_entry; + if (!check_reuse_pack_delta(entry->in_pack, + entry->in_pack_offset, + base, &size, + &entry->in_pack_type) && + (base_entry = locate_object_entry(base))) { + /* we do not know yet, but it does not matter */ + entry->depth = 1; + /* uncompressed size */ + entry->size = entry->delta_size = size; + entry->delta = base_entry; + entry->type = OBJ_DELTA; /* !! */ + return; + } + /* Otherwise we would do the usual */ } - else + + if (sha1_object_info(entry->sha1, type, &entry->size)) die("unable to get type of object %s", sha1_to_hex(entry->sha1)); + + if (!strcmp(type, "commit")) { + entry->type = OBJ_COMMIT; + } else if (!strcmp(type, "tree")) { + entry->type = OBJ_TREE; + } else if (!strcmp(type, "blob")) { + entry->type = OBJ_BLOB; + } else if (!strcmp(type, "tag")) { + entry->type = OBJ_TAG; + } else + die("unable to pack object %s of type %s", + sha1_to_hex(entry->sha1), type); +} + +static void hash_objects(void) +{ + int i; + struct object_entry *oe; + + object_ix_hashsz = nr_objects * 2; + object_ix = xcalloc(sizeof(int), object_ix_hashsz); + for (i = 0, oe = objects; i < nr_objects; i++, oe++) { + int ix = locate_object_entry_hash(oe->sha1); + if (0 <= ix) { + error("the same object '%s' added twice", + sha1_to_hex(oe->sha1)); + continue; + } + ix = -1 - ix; + object_ix[ix] = i + 1; + } } static void get_object_details(void) @@ -252,6 +446,8 @@ static void get_object_details(void) int i; struct object_entry *entry = objects; + hash_objects(); + prepare_pack_ix(); for (i = 0; i < nr_objects; i++) check_object(entry++); } @@ -382,6 +578,11 @@ static void find_deltas(struct object_en eye_candy -= nr_objects / 20; fputc('.', stderr); } + + if (entry->delta) + /* we already know what we need to know */ + continue; + free(n->data); n->entry = entry; n->data = read_sha1_file(entry->sha1, type, &size); @@ -411,10 +612,12 @@ static void find_deltas(struct object_en static void prepare_pack(int window, int depth) { - get_object_details(); - if (progress) fprintf(stderr, "Packing %d objects", nr_objects); + get_object_details(); + if (progress) + fprintf(stderr, "."); + sorted_by_type = create_sorted_list(type_size_sort); if (window && depth) find_deltas(sorted_by_type, window+1, depth); @@ -599,5 +802,7 @@ int main(int argc, char **argv) puts(sha1_to_hex(object_list_sha1)); } } + fprintf(stderr, "Total %d, written %d, reused %d\n", + nr_objects, written, reused); return 0; } diff --git a/pack.h b/pack.h index 9dafa2b..694e0c5 100644 --- a/pack.h +++ b/pack.h @@ -29,5 +29,7 @@ struct pack_header { }; extern int verify_pack(struct packed_git *, int); - +extern int check_reuse_pack_delta(struct packed_git *, unsigned long, + unsigned char *, unsigned long *, + enum object_type *); #endif diff --git a/sha1_file.c b/sha1_file.c index 64cf245..0a3a721 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -826,6 +826,25 @@ static unsigned long unpack_object_heade return offset; } +int check_reuse_pack_delta(struct packed_git *p, unsigned long offset, + unsigned char *base, unsigned long *sizep, + enum object_type *kindp) +{ + unsigned long ptr; + int status = -1; + + use_packed_git(p); + ptr = offset; + ptr = unpack_object_header(p, ptr, kindp, sizep); + if (*kindp != OBJ_DELTA) + goto done; + memcpy(base, p->pack_base + ptr, 20); + status = 0; + done: + unuse_packed_git(p); + return status; +} + void packed_object_info_detail(struct pack_entry *e, char *type, unsigned long *size, -- 1.2.0.gcfba7 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH] packed objects: minor cleanup 2006-02-16 1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano @ 2006-02-16 1:45 ` Junio C Hamano 2006-02-16 3:41 ` [PATCH] pack-objects: reuse data from existing pack Nicolas Pitre ` (3 subsequent siblings) 4 siblings, 0 replies; 17+ messages in thread From: Junio C Hamano @ 2006-02-16 1:45 UTC (permalink / raw) To: git; +Cc: Linus Torvalds The delta depth is unsigned. Signed-off-by: Junio C Hamano <junkio@cox.net> --- * And this is the clean-up patch that comes before the "packed representation reuse" patch. cache.h | 2 +- pack-check.c | 4 ++-- sha1_file.c | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) f8f135c9bae74f846a92e1f1f1fea8308802ace5 diff --git a/cache.h b/cache.h index c255421..b5db01f 100644 --- a/cache.h +++ b/cache.h @@ -322,7 +322,7 @@ extern int num_packed_objects(const stru extern int nth_packed_object_sha1(const struct packed_git *, int, unsigned char*); extern int find_pack_entry_one(const unsigned char *, struct pack_entry *, struct packed_git *); extern void *unpack_entry_gently(struct pack_entry *, char *, unsigned long *); -extern void packed_object_info_detail(struct pack_entry *, char *, unsigned long *, unsigned long *, int *, unsigned char *); +extern void packed_object_info_detail(struct pack_entry *, char *, unsigned long *, unsigned long *, unsigned int *, unsigned char *); /* Dumb servers support */ extern int update_server_info(int); diff --git a/pack-check.c b/pack-check.c index 67a7ecd..eca32b6 100644 --- a/pack-check.c +++ b/pack-check.c @@ -84,7 +84,7 @@ static void show_pack_info(struct packed char type[20]; unsigned long size; unsigned long store_size; - int delta_chain_length; + unsigned int delta_chain_length; if (nth_packed_object_sha1(p, i, sha1)) die("internal error pack-check nth-packed-object"); @@ -98,7 +98,7 @@ static void show_pack_info(struct packed if (!delta_chain_length) printf("%-6s %lu %u\n", type, size, e.offset); else - printf("%-6s %lu %u %d %s\n", type, size, e.offset, + printf("%-6s %lu %u %u %s\n", type, size, e.offset, delta_chain_length, sha1_to_hex(base_sha1)); } diff --git a/sha1_file.c b/sha1_file.c index 3d11a9b..64cf245 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -830,7 +830,7 @@ void packed_object_info_detail(struct pa char *type, unsigned long *size, unsigned long *store_size, - int *delta_chain_length, + unsigned int *delta_chain_length, unsigned char *base_sha1) { struct packed_git *p = e->p; @@ -844,7 +844,7 @@ void packed_object_info_detail(struct pa if (kind != OBJ_DELTA) *delta_chain_length = 0; else { - int chain_length = 0; + unsigned int chain_length = 0; memcpy(base_sha1, pack, 20); do { struct pack_entry base_ent; -- 1.2.0.gcfba7 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] pack-objects: reuse data from existing pack. 2006-02-16 1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano 2006-02-16 1:45 ` [PATCH] packed objects: minor cleanup Junio C Hamano @ 2006-02-16 3:41 ` Nicolas Pitre 2006-02-16 3:59 ` Junio C Hamano 2006-02-16 3:55 ` Linus Torvalds ` (2 subsequent siblings) 4 siblings, 1 reply; 17+ messages in thread From: Nicolas Pitre @ 2006-02-16 3:41 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git On Wed, 15 Feb 2006, Junio C Hamano wrote: > When generating a new pack, notice if we have already the wanted > object in existing packs. If the object has a delitified > representation, and its base object is also what we are going to > pack, then reuse the existing deltified representation > unconditionally, bypassing all the expensive find_deltas() and > try_deltas() routines. > > Also, when writing out such deltified representation and > undeltified representation, if a matching data already exists in > an existing pack, just write it out without uncompressing & > recompressing. Great ! > Without this patch: > > $ git-rev-list --objects v1.0.0 >RL > $ time git-pack-objects p <RL > > Generating pack... > Done counting 12233 objects. > Packing 12233 objects.................... > 60a88b3979df41e22d1edc3967095e897f720192 > > real 0m32.751s > user 0m27.090s > sys 0m2.750s > > With this patch: > > $ git-rev-list --objects v1.0.0 >RL > $ time ../git.junio/git-pack-objects q <RL > > Generating pack... > Done counting 12233 objects. > Packing 12233 objects..................... > 60a88b3979df41e22d1edc3967095e897f720192 > Total 12233, written 12233, reused 12177 > > real 0m4.007s > user 0m3.360s > sys 0m0.090s > > Signed-off-by: Junio C Hamano <junkio@cox.net> > > --- > > * This may depend on one cleanup patch I have not sent out, but > I am so excited that I could not help sending this out first. > > Admittedly this is hot off the press, I have not had enough > time to beat this too hard, but the resulting pack from the > above passed unpack-objects, index-pack and verify-pack. In fact, the resulting pack should be identical with or without this patch, shouldn't it? FYI: I have list of patches to produce even smaller (yet still compatible) packs, or less dense ones but with much reduced CPU usage. All depending on a new --speed argument to git-pack-objects. I've been able to produce 15-20% smaller packs with the same depth and window size, but taking twice as much CPU time to produce. Combined with your patch, one could repack the object store with the maximum compression even if it is expensive CPU wise, but any pull will benefit from it afterwards with no additional cost. I only need to find some time to finally clean and re-test those patches... Nicolas ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] pack-objects: reuse data from existing pack. 2006-02-16 3:41 ` [PATCH] pack-objects: reuse data from existing pack Nicolas Pitre @ 2006-02-16 3:59 ` Junio C Hamano 0 siblings, 0 replies; 17+ messages in thread From: Junio C Hamano @ 2006-02-16 3:59 UTC (permalink / raw) To: Nicolas Pitre; +Cc: git Nicolas Pitre <nico@cam.org> writes: > On Wed, 15 Feb 2006, Junio C Hamano wrote: > >> Generating pack... >> Done counting 12233 objects. >> Packing 12233 objects..................... >> 60a88b3979df41e22d1edc3967095e897f720192 >> Total 12233, written 12233, reused 12177 >> ... > In fact, the resulting pack should be identical with or without this > patch, shouldn't it? Not necessarily. The delta-depth limitation is currently lifted when reusing deltified objects (finding out the current depth is not so expensive compared to uncompress-delta-recompress cycle, but still costs somewhat, and the objective of this exercise is to gain performance). Notice the numbers 'written' and 'reused' in the output? The difference in that example comes from the fact that I am omitting some objects from the set of objects to be packed (v1.0.0 is ancient) in a repository where some newer objects are packed. Since packing-delta goes backwards, what is in v1.0.0 but not in my tip tends to be delitified in the original pack, but the resulting pack needs to have them expanded -- that is where the difference comes from. A cleaned-up patch will be in "pu" branch tonight. I considered putting it in "next", but decided against it. I have not spent enough time really beating on it, although I haven't seen major breakage. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] pack-objects: reuse data from existing pack. 2006-02-16 1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano 2006-02-16 1:45 ` [PATCH] packed objects: minor cleanup Junio C Hamano 2006-02-16 3:41 ` [PATCH] pack-objects: reuse data from existing pack Nicolas Pitre @ 2006-02-16 3:55 ` Linus Torvalds 2006-02-16 4:07 ` Junio C Hamano 2006-02-16 8:32 ` Andreas Ericsson 2006-02-17 4:30 ` Junio C Hamano 4 siblings, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2006-02-16 3:55 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Wed, 15 Feb 2006, Junio C Hamano wrote: > > When generating a new pack, notice if we have already the wanted > object in existing packs. If the object has a delitified > representation, and its base object is also what we are going to > pack, then reuse the existing deltified representation > unconditionally, bypassing all the expensive find_deltas() and > try_deltas() routines. I bow down before you. No ugly special-case caching, just automatically "the right thing", with very little overhead. It just makes sense. We have a winner. Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] pack-objects: reuse data from existing pack. 2006-02-16 3:55 ` Linus Torvalds @ 2006-02-16 4:07 ` Junio C Hamano 0 siblings, 0 replies; 17+ messages in thread From: Junio C Hamano @ 2006-02-16 4:07 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > No ugly special-case caching, just automatically "the right thing", with > very little overhead. > > It just makes sense. Thanks. I threw away two rounds of crap before this one, which were full of ugly special cases ;-). ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] pack-objects: reuse data from existing pack. 2006-02-16 1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano ` (2 preceding siblings ...) 2006-02-16 3:55 ` Linus Torvalds @ 2006-02-16 8:32 ` Andreas Ericsson 2006-02-16 9:13 ` Junio C Hamano 2006-02-17 4:30 ` Junio C Hamano 4 siblings, 1 reply; 17+ messages in thread From: Andreas Ericsson @ 2006-02-16 8:32 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Junio C Hamano wrote: > When generating a new pack, notice if we have already the wanted > object in existing packs. If the object has a delitified > representation, and its base object is also what we are going to > pack, then reuse the existing deltified representation > unconditionally, bypassing all the expensive find_deltas() and > try_deltas() routines. > > Also, when writing out such deltified representation and > undeltified representation, if a matching data already exists in > an existing pack, just write it out without uncompressing & > recompressing. > > Without this patch: > > $ git-rev-list --objects v1.0.0 >RL > $ time git-pack-objects p <RL > > Generating pack... > Done counting 12233 objects. > Packing 12233 objects.................... > 60a88b3979df41e22d1edc3967095e897f720192 > > real 0m32.751s > user 0m27.090s > sys 0m2.750s > > With this patch: > > $ git-rev-list --objects v1.0.0 >RL > $ time ../git.junio/git-pack-objects q <RL > > Generating pack... > Done counting 12233 objects. > Packing 12233 objects..................... > 60a88b3979df41e22d1edc3967095e897f720192 > Total 12233, written 12233, reused 12177 > > real 0m4.007s > user 0m3.360s > sys 0m0.090s > Whoa! Columbus and the egg. Strange noone saw it before. It's so obvious when you shove it under the nose like that. :) Now that pack-creation went from bizarrely expensive to insanely cheap (well, comparable to "tar czf" anyways), what's BCP for packing a public repository? Always one mega-pack and never worry, or should one still use incremental and sometimes overlapping pack-files? -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] pack-objects: reuse data from existing pack. 2006-02-16 8:32 ` Andreas Ericsson @ 2006-02-16 9:13 ` Junio C Hamano 0 siblings, 0 replies; 17+ messages in thread From: Junio C Hamano @ 2006-02-16 9:13 UTC (permalink / raw) To: Andreas Ericsson; +Cc: git Andreas Ericsson <ae@op5.se> writes: > Whoa! Columbus and the egg. Strange noone saw it before. It's so > obvious when you shove it under the nose like that. :) I wished the pack format were not so dense as we have today. It is very expensive to obtain the uncompressed size of a deltified object. For this reason, a delta newly created (either from a non-delta in an existing pack or from a loose object) by the experimental algorithm is never made against an object that is in deltified form in a pack. Also it incurs nontrivial cost to obtain the size of the in-pack representation of an object (either deltified or not). But the inefficiency in the resulting pack due to these factors may not matter in practice. I just packed v2.6.16-rc3 object list (184141 objects) using the current and the experimental, just for fun. Tonight's one runs just under 1 minutes on my Duron 750 (with slow disks I should add). This was done in a repository that has about 1500 loose objects and a single mega pack; reuse rate of packed data by the experimental algorithm is about 99%. I am hoping the one from the "master" would come back before I finish writing this message ;-). There are subtleties. For example, in a typical project, files tend to grow rather than shrink on average, and older ones tend to be in packs. If you do packing the traditional way, the largest one (which is typically the latest) is kept as non-delta, and all the smaller ones will be incremental delta from that, no matter how your packs and loose objects are organized. Usually, you have the latest objects as loose objects in your repository to be packed (either you push from it, or somebody else pulls from you). In other words, as you develop after your last repack, you would accumulate loose objects, and they are the ones that typically matter the most. Let's say you have been changing the same file in every commit (1..N), then you fully packed and then created another commit (revision N+1) that touches the file. The experimental reuse-packer would: - notice blobs from revision 1..(N-1) are deltified, relative to the rev one greater than each of them. these would be reused. - notice blob from revision N is in the pack but not deltified. - notice blob from revision N+1 is loose. Then emit the bigger one between N or N+1 as non-delta, the other one as delta. 1..(N-1) are output as delta. If it happens to choose N as plain, it does not have to uncompress and recompress so the pack process would go very fast, but you would end up always having to apply a delta to bring rev N to N+1 on top of the non-delta N to get to the latest blob in rev N+1, and you typically would want to access rev N+1 blob more often. In other words, the experimental reuse-packer would create a suboptimal pack in such a case. Not a big deal, though. We may want an option to disable the optimization for weekly/monthly repacking. git-daemon (or whatever runs pack-objects via upload-pack) should use the default with the optimization, since this is so obviously faster. > Now that pack-creation went from bizarrely expensive to insanely cheap > (well, comparable to "tar czf" anyways), what's BCP for packing a > public repository? Always one mega-pack and never worry, or should one > still use incremental and sometimes overlapping pack-files? I would say an optimum single mega-pack would work the best, but "repack -a -d" to create the mega-pack _with_ the optimization may have performance impact for users of resulting packs. Oh, the traditional one finally came back after 11 minutes. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] pack-objects: reuse data from existing pack. 2006-02-16 1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano ` (3 preceding siblings ...) 2006-02-16 8:32 ` Andreas Ericsson @ 2006-02-17 4:30 ` Junio C Hamano 2006-02-17 10:37 ` [PATCH] pack-objects: finishing touches Junio C Hamano 2006-02-17 15:39 ` [PATCH] pack-objects: reuse data from existing pack Linus Torvalds 4 siblings, 2 replies; 17+ messages in thread From: Junio C Hamano @ 2006-02-17 4:30 UTC (permalink / raw) To: git Junio C Hamano <junkio@cox.net> writes: > When generating a new pack, notice if we have already the wanted > object in existing packs. If the object has a delitified > representation, and its base object is also what we are going to > pack, then reuse the existing deltified representation > unconditionally, bypassing all the expensive find_deltas() and > try_deltas() routines. This one has one nasty data corruption bug, which fortunately I think I have figured out how to fix. Please do not use it for your production repository in the meantime. ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] pack-objects: finishing touches. 2006-02-17 4:30 ` Junio C Hamano @ 2006-02-17 10:37 ` Junio C Hamano 2006-02-18 6:50 ` [PATCH] pack-objects: avoid delta chains that are too long Junio C Hamano 2006-02-17 15:39 ` [PATCH] pack-objects: reuse data from existing pack Linus Torvalds 1 sibling, 1 reply; 17+ messages in thread From: Junio C Hamano @ 2006-02-17 10:37 UTC (permalink / raw) To: git Junio C Hamano <junkio@cox.net> writes: > This one has one nasty data corruption bug, which fortunately I > think I have figured out how to fix. Please do not use it for > your production repository in the meantime. It turns out that there wasn't a data corruption bug, but it can create a pack with delta chains whose length exceeds twice the specified depth. The data corruption bug was on my unpublished WIP and killed even before it hit "pu" ;-). This addresses most of the remaining issues and has been merged to "next" tonight. -- >8 -- This introduces --no-reuse-delta option to disable reusing of existing delta, which is a large part of the optimization introduced by this series. This may become necessary if repeated repacking makes delta chain too long. With this, the output of the command becomes identical to that of the older implementation. It still allows reusing non-deltified representations; there is no point uncompressing and recompressing the whole text. It also adds a couple more statistics output, while squelching it under -q flag, which the last round forgot to do. One remaining minor issue when --no-reuse-delta option is not used is that it can create a delta chain that is deeper than specified. A<--B<--C<--D E F G Suppose we have a delta chain A to D (A is stored in full either in a pack or as a loose object. B is depth1 delta relative to A, C is depth2 delta relative to B...) with loose objects E, F, G. And we are going to pack them. B, C and D are left as delta against A, B and C respectively. So A, E, F, and G are examined, and let's say we decided to keep E expanded, and store the rest as deltas like this: E<--F<--G<--A Oops. We ended up making D a bit too deep, didn't we? B, C and D form a chain on top of A! This is because we did not know what the final depth of A would be when checking objects and deciding to keep the existing delta. To prevent this from happening, we could say that A should be kept expanded. But how would we tell that, cheaply? To do this most precisely, after check_object() runs, each object that is used as the base object of some existing delta needs to be marked with the maximum depth of the objects we decided to keep deltified (in this case, D is depth 3 relative to A, so if no other delta chain that is longer than 3 based on A exists, mark A with 3). Then when attempting to deltify A, we would take that number into account to see if the final delta chain that leads to D becomes too deep. However, this is a bit cumbersome to compute, so we would cheat and reduce the maximum depth for A arbitrarily to depth/4 in this implementation. Signed-off-by: Junio C Hamano <junkio@cox.net> --- Documentation/git-pack-objects.txt | 21 +++++++ pack-objects.c | 102 +++++++++++++++++++++++++----------- 2 files changed, 91 insertions(+), 32 deletions(-) 98885061623676a37575e59d7f7aff64072e3300 diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt index 2d67d39..4cb2e83 100644 --- a/Documentation/git-pack-objects.txt +++ b/Documentation/git-pack-objects.txt @@ -8,7 +8,10 @@ git-pack-objects - Create a packed archi SYNOPSIS -------- -'git-pack-objects' [--non-empty] [--local] [--incremental] [--window=N] [--depth=N] {--stdout | base-name} < object-list +[verse] +'git-pack-objects' [-q] [--no-reuse-delta] [--non-empty] + [--local] [--incremental] [--window=N] [--depth=N] + {--stdout | base-name} < object-list DESCRIPTION @@ -32,6 +35,10 @@ Placing both in the pack/ subdirectory o any of the directories on $GIT_ALTERNATE_OBJECT_DIRECTORIES) enables git to read from such an archive. +In a packed archive, an object is either stored as a compressed +whole, or as a difference from some other object. The latter is +often called a delta. + OPTIONS ------- @@ -74,6 +81,18 @@ base-name:: Only create a packed archive if it would contain at least one object. +-q:: + This flag makes the command not to report its progress + on the standard error stream. + +--no-reuse-delta:: + When creating a packed archive in a repository that + has existing packs, the command reuses existing deltas. + This sometimes results in a slightly suboptimal pack. + This flag tells the command not to reuse existing deltas + but compute them from scratch. + + Author ------ Written by Linus Torvalds <torvalds@osdl.org> diff --git a/pack-objects.c b/pack-objects.c index 70fb2af..38e1c99 100644 --- a/pack-objects.c +++ b/pack-objects.c @@ -5,7 +5,7 @@ #include "csum-file.h" #include <sys/time.h> -static const char pack_usage[] = "git-pack-objects [-q] [--non-empty] [--local] [--incremental] [--window=N] [--depth=N] {--stdout | base-name} < object-list"; +static const char pack_usage[] = "git-pack-objects [-q] [--no-reuse-delta] [--non-empty] [--local] [--incremental] [--window=N] [--depth=N] {--stdout | base-name} < object-list"; struct object_entry { unsigned char sha1[20]; @@ -14,10 +14,11 @@ struct object_entry { unsigned int depth; /* delta depth */ unsigned int hash; /* name hint hash */ enum object_type type; + unsigned char edge; /* reused delta chain points at this entry. */ + enum object_type in_pack_type; /* could be delta */ unsigned long delta_size; /* delta data size (uncompressed) */ struct object_entry *delta; /* delta base object */ struct packed_git *in_pack; /* already in pack */ - enum object_type in_pack_type; /* could be delta */ unsigned int in_pack_offset; }; @@ -36,6 +37,7 @@ struct object_entry { static unsigned char object_list_sha1[20]; static int non_empty = 0; +static int no_reuse_delta = 0; static int local = 0; static int incremental = 0; static struct object_entry **sorted_by_sha, **sorted_by_type; @@ -75,7 +77,9 @@ static int pack_revindex_hashsz = 0; * stats */ static int written = 0; +static int written_delta = 0; static int reused = 0; +static int reused_delta = 0; static int pack_revindex_ix(struct packed_git *p) { @@ -227,10 +231,23 @@ static unsigned long write_object(struct unsigned char header[10]; unsigned hdrlen, datalen; enum object_type obj_type; + int to_reuse = 0; obj_type = entry->type; - if (!entry->in_pack || - (obj_type != entry->in_pack_type)) { + if (! entry->in_pack) + to_reuse = 0; /* can't reuse what we don't have */ + else if (obj_type == OBJ_DELTA) + to_reuse = 1; /* check_object() decided it for us */ + else if (obj_type != entry->in_pack_type) + to_reuse = 0; /* pack has delta which is unusable */ + else if (entry->delta) + to_reuse = 0; /* we want to pack afresh */ + else + to_reuse = 1; /* we have it in-pack undeltified, + * and we do not need to deltify it. + */ + + if (! to_reuse) { buf = read_sha1_file(entry->sha1, type, &size); if (!buf) die("unable to read %s", sha1_to_hex(entry->sha1)); @@ -266,8 +283,12 @@ static unsigned long write_object(struct sha1write(f, buf, datalen); unuse_packed_git(p); hdrlen = 0; /* not really */ + if (obj_type == OBJ_DELTA) + reused_delta++; reused++; } + if (obj_type == OBJ_DELTA) + written_delta++; written++; return hdrlen + datalen; } @@ -294,7 +315,6 @@ static void write_pack_file(void) int i; struct sha1file *f; unsigned long offset; - unsigned long mb; struct pack_header hdr; if (!base_name) @@ -357,10 +377,9 @@ static int add_object_entry(unsigned cha unsigned int idx = nr_objects; struct object_entry *entry; struct packed_git *p; - unsigned int found_offset; - struct packed_git *found_pack; + unsigned int found_offset = 0; + struct packed_git *found_pack = NULL; - found_pack = NULL; for (p = packed_git; p; p = p->next) { struct pack_entry e; if (find_pack_entry_one(sha1, &e, p)) { @@ -420,32 +439,39 @@ static void check_object(struct object_e char type[20]; if (entry->in_pack) { + unsigned char base[20]; + unsigned long size; + struct object_entry *base_entry; + + /* We want in_pack_type even if we do not reuse delta. + * There is no point not reusing non-delta representations. + */ + check_reuse_pack_delta(entry->in_pack, + entry->in_pack_offset, + base, &size, + &entry->in_pack_type); + /* Check if it is delta, and the base is also an object * we are going to pack. If so we will reuse the existing * delta. */ - unsigned char base[20]; - unsigned long size; - struct object_entry *base_entry; - if (!check_reuse_pack_delta(entry->in_pack, - entry->in_pack_offset, - base, &size, - &entry->in_pack_type) && + if (!no_reuse_delta && + entry->in_pack_type == OBJ_DELTA && (base_entry = locate_object_entry(base))) { - /* We do not know depth at this point, but it - * does not matter. Getting delta_chain_length - * with packed_object_info_detail() is not so - * expensive, so we could do that later if we - * wanted to. Calling sha1_object_info to get - * the true size (and later an uncompressed - * representation) of deeply deltified object - * is quite expensive. + + /* Depth value does not matter - find_deltas() + * will never consider reused delta as the + * base object to deltify other objects + * against, in order to avoid circular deltas. */ - entry->depth = 1; - /* uncompressed size */ + + /* uncompressed size of the delta data */ entry->size = entry->delta_size = size; entry->delta = base_entry; entry->type = OBJ_DELTA; + + base_entry->edge = 1; + return; } /* Otherwise we would do the usual */ @@ -568,6 +594,13 @@ static int try_delta(struct unpacked *cu if (cur_entry->type != old_entry->type) return -1; + /* If the current object is at edge, take the depth the objects + * that depend on the current object into account -- otherwise + * they would become too deep. + */ + if (cur_entry->edge) + max_depth /= 4; + size = cur_entry->size; if (size < 50) return -1; @@ -627,7 +660,7 @@ static void find_deltas(struct object_en if (entry->delta) /* This happens if we decided to reuse existing - * delta from a pack. + * delta from a pack. "!no_reuse_delta &&" is implied. */ continue; @@ -636,6 +669,7 @@ static void find_deltas(struct object_en n->data = read_sha1_file(entry->sha1, type, &size); if (size != entry->size) die("object %s inconsistent object length (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size); + j = window; while (--j > 0) { unsigned int other_idx = idx + j; @@ -664,7 +698,7 @@ static void prepare_pack(int window, int fprintf(stderr, "Packing %d objects", nr_objects); get_object_details(); if (progress) - fprintf(stderr, "."); + fputc('.', stderr); sorted_by_type = create_sorted_list(type_size_sort); if (window && depth) @@ -694,8 +728,9 @@ static int reuse_cached_pack(unsigned ch } } - fprintf(stderr, "Reusing %d objects pack %s\n", nr_objects, - sha1_to_hex(sha1)); + if (progress) + fprintf(stderr, "Reusing %d objects pack %s\n", nr_objects, + sha1_to_hex(sha1)); if (pack_to_stdout) { if (copy_fd(ifd, 1)) @@ -775,6 +810,10 @@ int main(int argc, char **argv) progress = 0; continue; } + if (!strcmp("--no-reuse-delta", arg)) { + no_reuse_delta = 1; + continue; + } if (!strcmp("--stdout", arg)) { pack_to_stdout = 1; continue; @@ -850,7 +889,8 @@ int main(int argc, char **argv) puts(sha1_to_hex(object_list_sha1)); } } - fprintf(stderr, "Total %d, written %d, reused %d\n", - nr_objects, written, reused); + if (progress) + fprintf(stderr, "Total %d, written %d (delta %d), reused %d (delta %d)\n", + nr_objects, written, written_delta, reused, reused_delta); return 0; } -- 1.2.1.g91016 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH] pack-objects: avoid delta chains that are too long. 2006-02-17 10:37 ` [PATCH] pack-objects: finishing touches Junio C Hamano @ 2006-02-18 6:50 ` Junio C Hamano 0 siblings, 0 replies; 17+ messages in thread From: Junio C Hamano @ 2006-02-18 6:50 UTC (permalink / raw) To: git This tries to rework the solution for the excess delta chain problem. An earlier commit worked it around ``cheaply'', but repeated repacking risks unbound growth of delta chains. This version counts the length of delta chain we are reusing from the existing pack, and makes sure a base object that has sufficiently long delta chain does not get deltified. Signed-off-by: Junio C Hamano <junkio@cox.net> --- * This solves the problem by doing exactly what the previous commit log suggested. We limit the chain depth correctly, to save the user of the resulting pack from suffering runtime overhead. Will be in "next" tonight. pack-objects.c | 43 +++++++++++++++++++++++++++++++++++-------- 1 files changed, 35 insertions(+), 8 deletions(-) e4c9327a77bd59e85d4b17a612e78977d68773ee diff --git a/pack-objects.c b/pack-objects.c index 38e1c99..0c9f4c9 100644 --- a/pack-objects.c +++ b/pack-objects.c @@ -10,16 +10,22 @@ static const char pack_usage[] = "git-pa struct object_entry { unsigned char sha1[20]; unsigned long size; /* uncompressed size */ - unsigned long offset; /* offset into the final pack file (nonzero if already written) */ + unsigned long offset; /* offset into the final pack file; + * nonzero if already written. + */ unsigned int depth; /* delta depth */ + unsigned int delta_limit; /* base adjustment for in-pack delta */ unsigned int hash; /* name hint hash */ enum object_type type; - unsigned char edge; /* reused delta chain points at this entry. */ enum object_type in_pack_type; /* could be delta */ unsigned long delta_size; /* delta data size (uncompressed) */ struct object_entry *delta; /* delta base object */ struct packed_git *in_pack; /* already in pack */ unsigned int in_pack_offset; + struct object_entry *delta_child; /* delitified objects who bases me */ + struct object_entry *delta_sibling; /* other deltified objects who + * uses the same base as me + */ }; /* @@ -470,7 +476,8 @@ static void check_object(struct object_e entry->delta = base_entry; entry->type = OBJ_DELTA; - base_entry->edge = 1; + entry->delta_sibling = base_entry->delta_child; + base_entry->delta_child = entry; return; } @@ -513,15 +520,32 @@ static void hash_objects(void) } } +static unsigned int check_delta_limit(struct object_entry *me, unsigned int n) +{ + struct object_entry *child = me->delta_child; + unsigned int m = n; + while (child) { + unsigned int c = check_delta_limit(child, n + 1); + if (m < c) + m = c; + child = child->delta_sibling; + } + return m; +} + static void get_object_details(void) { int i; - struct object_entry *entry = objects; + struct object_entry *entry; hash_objects(); prepare_pack_ix(); - for (i = 0; i < nr_objects; i++) - check_object(entry++); + for (i = 0, entry = objects; i < nr_objects; i++, entry++) + check_object(entry); + for (i = 0, entry = objects; i < nr_objects; i++, entry++) + if (!entry->delta && entry->delta_child) + entry->delta_limit = + check_delta_limit(entry, 1); } typedef int (*entry_sort_t)(const struct object_entry *, const struct object_entry *); @@ -598,8 +622,11 @@ static int try_delta(struct unpacked *cu * that depend on the current object into account -- otherwise * they would become too deep. */ - if (cur_entry->edge) - max_depth /= 4; + if (cur_entry->delta_child) { + if (max_depth <= cur_entry->delta_limit) + return 0; + max_depth -= cur_entry->delta_limit; + } size = cur_entry->size; if (size < 50) -- 1.2.1.g9b132 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] pack-objects: reuse data from existing pack. 2006-02-17 4:30 ` Junio C Hamano 2006-02-17 10:37 ` [PATCH] pack-objects: finishing touches Junio C Hamano @ 2006-02-17 15:39 ` Linus Torvalds 2006-02-17 18:18 ` Junio C Hamano 1 sibling, 1 reply; 17+ messages in thread From: Linus Torvalds @ 2006-02-17 15:39 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Thu, 16 Feb 2006, Junio C Hamano wrote: > > This one has one nasty data corruption bug, which fortunately I > think I have figured out how to fix. Circular deltas? What else can go wrong? Linus ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] pack-objects: reuse data from existing pack. 2006-02-17 15:39 ` [PATCH] pack-objects: reuse data from existing pack Linus Torvalds @ 2006-02-17 18:18 ` Junio C Hamano 0 siblings, 0 replies; 17+ messages in thread From: Junio C Hamano @ 2006-02-17 18:18 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > On Thu, 16 Feb 2006, Junio C Hamano wrote: >> >> This one has one nasty data corruption bug, which fortunately I >> think I have figured out how to fix. > > Circular deltas? What else can go wrong? Circular deltas are prevented by not using deltified objects check_object() decides to keep in find_deltas(), so I do not think it is an issue. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2006-02-18 6:50 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-02-15 8:39 [FYI] pack idx format Junio C Hamano 2006-02-15 11:16 ` Johannes Schindelin 2006-02-15 16:46 ` Nicolas Pitre 2006-02-16 1:58 ` Junio C Hamano 2006-02-16 1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano 2006-02-16 1:45 ` [PATCH] packed objects: minor cleanup Junio C Hamano 2006-02-16 3:41 ` [PATCH] pack-objects: reuse data from existing pack Nicolas Pitre 2006-02-16 3:59 ` Junio C Hamano 2006-02-16 3:55 ` Linus Torvalds 2006-02-16 4:07 ` Junio C Hamano 2006-02-16 8:32 ` Andreas Ericsson 2006-02-16 9:13 ` Junio C Hamano 2006-02-17 4:30 ` Junio C Hamano 2006-02-17 10:37 ` [PATCH] pack-objects: finishing touches Junio C Hamano 2006-02-18 6:50 ` [PATCH] pack-objects: avoid delta chains that are too long Junio C Hamano 2006-02-17 15:39 ` [PATCH] pack-objects: reuse data from existing pack Linus Torvalds 2006-02-17 18:18 ` Junio C Hamano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).