git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [FYI] pack idx format
@ 2006-02-15  8:39 Junio C Hamano
  2006-02-15 11:16 ` Johannes Schindelin
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Junio C Hamano @ 2006-02-15  8:39 UTC (permalink / raw)
  To: git

This is still WIP but if anybody is interested...  Once done, it
should become Documentation/technical/pack-format.txt.

The reason I started doing this is to prototype this one:

	<7v4q3453qu.fsf@assigned-by-dhcp.cox.net>

-- >8 --

Idx file:

The idx file is to map object name SHA1 to offset into the
corresponding pack file.  There is the 'first-level fan-out'
table at the beginning, and then the main part of the index
follows.  This is a table whose entries are sorted by their
object name SHA1.  The file ends with some trailer information.

The main part is a table of 24-byte entries, and each entry is:

	offset : 4-byte network byte order integer.
	SHA1   : 20-byte object name SHA1.

The data for the named object begins at byte offset "offset" in
the corresponding pack file.

Before this main table, at the beginning of the idx file, there
is a table of 256 4-byte network byte order integers.  This is
called "first-level fan-out".  N-th entry of this table records
the offset into the main index for the first object whose object
name SHA1 starts with N+1.  fanout[255] points at the end of
main index.  The offset is expressed in 24-bytes unit.

Example:

	idx
	    +--------------------------------+
	    | fanout[0] = 2                  |-.
	    +--------------------------------+ |
	    | fanout[1]                      | |
	    +--------------------------------+ |
	    | fanout[2]                      | |
	    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
	    | fanout[255]                    | |
	    +--------------------------------+ |
main	    | offset                         | |
index	    | object name 00XXXXXXXXXXXXXXXX | |
table	    +--------------------------------+ | 
	    | offset                         | |
	    | object name 00XXXXXXXXXXXXXXXX | |
	    +--------------------------------+ |
	  .-| offset                         |<+
	  | | object name 01XXXXXXXXXXXXXXXX |
	  | +--------------------------------+
	  | | offset                         |
	  | | object name 01XXXXXXXXXXXXXXXX |
	  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	  | | offset                         |
	  | | object name FFXXXXXXXXXXXXXXXX |
	  | +--------------------------------+
trailer	  | | packfile checksum              |
	  | +--------------------------------+
	  | | idxfile checksum               |
	  | +--------------------------------+
          .-------.      
                  |
Pack file entry: <+

     packed object header:
	1-byte type (upper 4-bit)
	       size0 (lower 4-bit) 
        n-byte sizeN (as long as MSB is set, each 7-bit)
		size0..sizeN form 4+7+7+..+7 bit integer, size0
		is the most significant part.
     packed object data:
        If it is not DELTA, then deflated bytes (the size above
		is the size before compression).
	If it is DELTA, then
	  20-byte base object name SHA1 (the size above is the
	  	size of the delta data that follows).
          delta data, deflated.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [FYI] pack idx format
  2006-02-15  8:39 [FYI] pack idx format Junio C Hamano
@ 2006-02-15 11:16 ` Johannes Schindelin
  2006-02-15 16:46 ` Nicolas Pitre
  2006-02-16  1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano
  2 siblings, 0 replies; 17+ messages in thread
From: Johannes Schindelin @ 2006-02-15 11:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Hi,

On Wed, 15 Feb 2006, Junio C Hamano wrote:

> [...]
> 
> Before this main table, at the beginning of the idx file, there
> is a table of 256 4-byte network byte order integers.  This is
> called "first-level fan-out".  N-th entry of this table records
> the offset into the main index for the first object whose object
> name SHA1 starts with N+1.  fanout[255] points at the end of
> main index.  The offset is expressed in 24-bytes unit.

The description of the n-th entry is not completely correct. How about 
something like this:

The fan-out table is sort of an index to the main table: It contains 
offsets into the main table, and all SHA1s starting with byte b are 
between fanout[b-1] .. fanout[b]-1 (or if b==0 between 0 .. fanout[0]-1).

Comments?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [FYI] pack idx format
  2006-02-15  8:39 [FYI] pack idx format Junio C Hamano
  2006-02-15 11:16 ` Johannes Schindelin
@ 2006-02-15 16:46 ` Nicolas Pitre
  2006-02-16  1:58   ` Junio C Hamano
  2006-02-16  1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano
  2 siblings, 1 reply; 17+ messages in thread
From: Nicolas Pitre @ 2006-02-15 16:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Wed, 15 Feb 2006, Junio C Hamano wrote:

> This is still WIP but if anybody is interested...  Once done, it
> should become Documentation/technical/pack-format.txt.
> 
[...]
> 
> Pack file entry: <+
> 
>      packed object header:
> 	1-byte type (upper 4-bit)

Actually the type occupies only 3 bits (bits 4 to 6) as bit 7 is the 
size continuation bit.


Nicolas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] pack-objects: reuse data from existing pack.
  2006-02-15  8:39 [FYI] pack idx format Junio C Hamano
  2006-02-15 11:16 ` Johannes Schindelin
  2006-02-15 16:46 ` Nicolas Pitre
@ 2006-02-16  1:43 ` Junio C Hamano
  2006-02-16  1:45   ` [PATCH] packed objects: minor cleanup Junio C Hamano
                     ` (4 more replies)
  2 siblings, 5 replies; 17+ messages in thread
From: Junio C Hamano @ 2006-02-16  1:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

When generating a new pack, notice if we have already the wanted
object in existing packs.  If the object has a delitified
representation, and its base object is also what we are going to
pack, then reuse the existing deltified representation
unconditionally, bypassing all the expensive find_deltas() and
try_deltas() routines.

Also, when writing out such deltified representation and
undeltified representation, if a matching data already exists in
an existing pack, just write it out without uncompressing &
recompressing.

Without this patch:

    $ git-rev-list --objects v1.0.0 >RL
    $ time git-pack-objects p <RL

    Generating pack...
    Done counting 12233 objects.
    Packing 12233 objects....................
    60a88b3979df41e22d1edc3967095e897f720192

    real    0m32.751s
    user    0m27.090s
    sys     0m2.750s

With this patch:

    $ git-rev-list --objects v1.0.0 >RL
    $ time ../git.junio/git-pack-objects q <RL

    Generating pack...
    Done counting 12233 objects.
    Packing 12233 objects.....................
    60a88b3979df41e22d1edc3967095e897f720192
    Total 12233, written 12233, reused 12177

    real    0m4.007s
    user    0m3.360s
    sys     0m0.090s

Signed-off-by: Junio C Hamano <junkio@cox.net>

---

 * This may depend on one cleanup patch I have not sent out, but
   I am so excited that I could not help sending this out first.

   Admittedly this is hot off the press, I have not had enough
   time to beat this too hard, but the resulting pack from the
   above passed unpack-objects, index-pack and verify-pack.

 pack-objects.c |  317 ++++++++++++++++++++++++++++++++++++++++++++++----------
 pack.h         |    4 +
 sha1_file.c    |   19 +++
 3 files changed, 283 insertions(+), 57 deletions(-)

0d574a3c3ec6924118d06ee0487d02d2fbb12646
diff --git a/pack-objects.c b/pack-objects.c
index c5a5e61..f2a45a2 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -9,13 +9,16 @@ static const char pack_usage[] = "git-pa
 
 struct object_entry {
 	unsigned char sha1[20];
-	unsigned long size;
-	unsigned long offset;
-	unsigned int depth;
-	unsigned int hash;
+	unsigned long size;	/* uncompressed size */
+	unsigned long offset;	/* offset into the final pack file (nonzero if already written) */
+	unsigned int depth;	/* delta depth */
+	unsigned int hash;	/* name hint hash */
 	enum object_type type;
-	unsigned long delta_size;
-	struct object_entry *delta;
+	unsigned long delta_size;	/* delta data size (uncompressed) */
+	struct object_entry *delta;	/* delta base object */
+	struct packed_git *in_pack; 	/* already in pack */
+	enum object_type in_pack_type;	/* could be delta */
+	unsigned int in_pack_offset;
 };
 
 static unsigned char object_list_sha1[20];
@@ -29,6 +32,105 @@ static const char *base_name;
 static unsigned char pack_file_sha1[20];
 static int progress = 1;
 
+static int *object_ix = NULL;
+static int object_ix_hashsz = 0;
+
+struct pack_revindex {
+	struct packed_git *p;
+	unsigned long *revindex;
+} *pack_revindex = NULL;
+static int pack_revindex_hashsz = 0;
+
+static int pack_revindex_ix(struct packed_git *p)
+{
+	unsigned int ui = (unsigned int) p;
+	int i;
+	ui = ui ^ (ui >> 16);
+	i = (int)(ui % pack_revindex_hashsz);
+	while (pack_revindex[i].p) {
+		if (pack_revindex[i].p == p)
+			return i;
+		if (++i == pack_revindex_hashsz)
+			i = 0;
+	}
+	return -1 - i;
+}
+
+static void prepare_pack_ix(void)
+{
+	int num;
+	struct packed_git *p;
+	for (num = 0, p = packed_git; p; p = p->next)
+		num++;
+	if (!num)
+		return;
+	pack_revindex_hashsz = num * 11;
+	pack_revindex = xcalloc(sizeof(*pack_revindex), pack_revindex_hashsz);
+	for (p = packed_git; p; p = p->next) {
+		num = pack_revindex_ix(p);
+		num = - 1 - num;
+		pack_revindex[num].p = p;
+	}
+	/* revindex elements are lazily initialized */
+}
+
+static int cmp_offset(const void *a_, const void *b_)
+{
+	unsigned long a = *(unsigned long *) a_;
+	unsigned long b = *(unsigned long *) b_;
+	if (a < b)
+		return -1;
+	else if (a == b)
+		return 0;
+	else
+		return 1;
+}
+
+static void prepare_pack_revindex(struct pack_revindex *rix)
+{
+	struct packed_git *p = rix->p;
+	int num_ent = num_packed_objects(p);
+	int i;
+	void *index = p->index_base + 256;
+
+	rix->revindex = xmalloc(sizeof(unsigned long) * (num_ent + 1));
+	for (i = 0; i < num_ent; i++) {
+		long hl = *((long *)(index + 24 * i));
+		rix->revindex[i] = ntohl(hl);
+	}
+	rix->revindex[num_ent] = p->pack_size - 20;
+	qsort(rix->revindex, num_ent, sizeof(unsigned long), cmp_offset);
+}
+
+static unsigned long find_packed_object_size(struct packed_git *p,
+					     unsigned long ofs)
+{
+	int num;
+	int lo, hi;
+	struct pack_revindex *rix;
+	unsigned long *revindex;
+	num = pack_revindex_ix(p);
+	if (num < 0)
+		die("internal error: pack revindex uninitialized");
+	rix = &pack_revindex[num];
+	if (!rix->revindex)
+		prepare_pack_revindex(rix);
+	revindex = rix->revindex;
+	lo = 0;
+	hi = num_packed_objects(p) + 1;
+	do {
+		int mi = (lo + hi) / 2;
+		if (revindex[mi] == ofs) {
+			return revindex[mi+1] - ofs;
+		}
+		else if (ofs < revindex[mi])
+			hi = mi;
+		else
+			lo = mi + 1;
+	} while (lo < hi);
+	die("internal error: pack revindex corrupt");
+}
+
 static void *delta_against(void *buf, unsigned long size, struct object_entry *entry)
 {
 	unsigned long othersize, delta_size;
@@ -74,39 +176,59 @@ static int encode_header(enum object_typ
 	return n;
 }
 
+static int written = 0;
+static int reused = 0;
+
 static unsigned long write_object(struct sha1file *f, struct object_entry *entry)
 {
 	unsigned long size;
 	char type[10];
-	void *buf = read_sha1_file(entry->sha1, type, &size);
+	void *buf;
 	unsigned char header[10];
 	unsigned hdrlen, datalen;
 	enum object_type obj_type;
 
-	if (!buf)
-		die("unable to read %s", sha1_to_hex(entry->sha1));
-	if (size != entry->size)
-		die("object %s size inconsistency (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size);
-
-	/*
-	 * The object header is a byte of 'type' followed by zero or
-	 * more bytes of length.  For deltas, the 20 bytes of delta sha1
-	 * follows that.
-	 */
 	obj_type = entry->type;
-	if (entry->delta) {
-		buf = delta_against(buf, size, entry);
-		size = entry->delta_size;
-		obj_type = OBJ_DELTA;
-	}
-	hdrlen = encode_header(obj_type, size, header);
-	sha1write(f, header, hdrlen);
-	if (entry->delta) {
-		sha1write(f, entry->delta, 20);
-		hdrlen += 20;
+	if (!entry->in_pack ||
+	    (obj_type != entry->in_pack_type)) {
+		buf = read_sha1_file(entry->sha1, type, &size);
+		if (!buf)
+			die("unable to read %s", sha1_to_hex(entry->sha1));
+		if (size != entry->size)
+			die("object %s size inconsistency (%lu vs %lu)",
+			    sha1_to_hex(entry->sha1), size, entry->size);
+		if (entry->delta) {
+			buf = delta_against(buf, size, entry);
+			size = entry->delta_size;
+			obj_type = OBJ_DELTA;
+		}
+		/*
+		 * The object header is a byte of 'type' followed by zero or
+		 * more bytes of length.  For deltas, the 20 bytes of delta
+		 * sha1 follows that.
+		 */
+		hdrlen = encode_header(obj_type, size, header);
+		sha1write(f, header, hdrlen);
+
+		if (entry->delta) {
+			sha1write(f, entry->delta, 20);
+			hdrlen += 20;
+		}
+		datalen = sha1write_compressed(f, buf, size);
+		free(buf);
+	}
+	else {
+		struct packed_git *p = entry->in_pack;
+		use_packed_git(p);
+
+		datalen = find_packed_object_size(p, entry->in_pack_offset);
+		buf = p->pack_base + entry->in_pack_offset;
+		sha1write(f, buf, datalen);
+		unuse_packed_git(p);
+		hdrlen = 0; /* not really */
+		reused++;
 	}
-	datalen = sha1write_compressed(f, buf, size);
-	free(buf);
+	written++;
 	return hdrlen + datalen;
 }
 
@@ -196,18 +318,21 @@ static int add_object_entry(unsigned cha
 {
 	unsigned int idx = nr_objects;
 	struct object_entry *entry;
-
-	if (incremental || local) {
-		struct packed_git *p;
-
-		for (p = packed_git; p; p = p->next) {
-			struct pack_entry e;
-
-			if (find_pack_entry_one(sha1, &e, p)) {
-				if (incremental)
-					return 0;
-				if (local && !p->pack_local)
-					return 0;
+	struct packed_git *p;
+	unsigned int found_offset;
+	struct packed_git *found_pack;
+
+	found_pack = NULL;
+	for (p = packed_git; p; p = p->next) {
+		struct pack_entry e;
+		if (find_pack_entry_one(sha1, &e, p)) {
+			if (incremental)
+				return 0;
+			if (local && !p->pack_local)
+				return 0;
+			if (!found_pack) {
+				found_offset = e.offset;
+				found_pack = e.p;
 			}
 		}
 	}
@@ -221,30 +346,99 @@ static int add_object_entry(unsigned cha
 	memset(entry, 0, sizeof(*entry));
 	memcpy(entry->sha1, sha1, 20);
 	entry->hash = hash;
+	if (found_pack) {
+		entry->in_pack = found_pack;
+		entry->in_pack_offset = found_offset;
+	}
 	nr_objects = idx+1;
 	return 1;
 }
 
+static int locate_object_entry_hash(unsigned char *sha1)
+{
+	int i;
+	unsigned int ui;
+	memcpy(&ui, sha1, sizeof(unsigned int));
+	i = ui % object_ix_hashsz;
+	while (0 < object_ix[i]) {
+		if (!memcmp(sha1, objects[object_ix[i]-1].sha1, 20))
+			return i;
+		if (++i == object_ix_hashsz)
+			i = 0;
+	}
+	return -1 - i;
+}
+
+static struct object_entry *locate_object_entry(unsigned char *sha1)
+{
+	int i = locate_object_entry_hash(sha1);
+	if (0 <= i)
+		return &objects[object_ix[i]-1];
+	return NULL;
+}
+
 static void check_object(struct object_entry *entry)
 {
 	char type[20];
 
-	if (!sha1_object_info(entry->sha1, type, &entry->size)) {
-		if (!strcmp(type, "commit")) {
-			entry->type = OBJ_COMMIT;
-		} else if (!strcmp(type, "tree")) {
-			entry->type = OBJ_TREE;
-		} else if (!strcmp(type, "blob")) {
-			entry->type = OBJ_BLOB;
-		} else if (!strcmp(type, "tag")) {
-			entry->type = OBJ_TAG;
-		} else
-			die("unable to pack object %s of type %s",
-			    sha1_to_hex(entry->sha1), type);
+	if (entry->in_pack) {
+		/* Check if it is delta, and the base is also an object
+		 * we are going to pack.  If so we will reuse the existing
+		 * delta.
+		 */
+		unsigned char base[20];
+		unsigned long size;
+		struct object_entry *base_entry;
+		if (!check_reuse_pack_delta(entry->in_pack,
+					    entry->in_pack_offset,
+					    base, &size,
+					    &entry->in_pack_type) &&
+		    (base_entry = locate_object_entry(base))) {
+			/* we do not know yet, but it does not matter */
+			entry->depth = 1;
+			/* uncompressed size */
+			entry->size = entry->delta_size = size;
+			entry->delta = base_entry;
+			entry->type = OBJ_DELTA; /* !! */
+			return;
+		}
+		/* Otherwise we would do the usual */
 	}
-	else
+
+	if (sha1_object_info(entry->sha1, type, &entry->size))
 		die("unable to get type of object %s",
 		    sha1_to_hex(entry->sha1));
+
+	if (!strcmp(type, "commit")) {
+		entry->type = OBJ_COMMIT;
+	} else if (!strcmp(type, "tree")) {
+		entry->type = OBJ_TREE;
+	} else if (!strcmp(type, "blob")) {
+		entry->type = OBJ_BLOB;
+	} else if (!strcmp(type, "tag")) {
+		entry->type = OBJ_TAG;
+	} else
+		die("unable to pack object %s of type %s",
+		    sha1_to_hex(entry->sha1), type);
+}
+
+static void hash_objects(void)
+{
+	int i;
+	struct object_entry *oe;
+
+	object_ix_hashsz = nr_objects * 2;
+	object_ix = xcalloc(sizeof(int), object_ix_hashsz);
+	for (i = 0, oe = objects; i < nr_objects; i++, oe++) {
+		int ix = locate_object_entry_hash(oe->sha1);
+		if (0 <= ix) {
+			error("the same object '%s' added twice",
+			      sha1_to_hex(oe->sha1));
+			continue;
+		}
+		ix = -1 - ix;
+		object_ix[ix] = i + 1;
+	}
 }
 
 static void get_object_details(void)
@@ -252,6 +446,8 @@ static void get_object_details(void)
 	int i;
 	struct object_entry *entry = objects;
 
+	hash_objects();
+	prepare_pack_ix();
 	for (i = 0; i < nr_objects; i++)
 		check_object(entry++);
 }
@@ -382,6 +578,11 @@ static void find_deltas(struct object_en
 			eye_candy -= nr_objects / 20;
 			fputc('.', stderr);
 		}
+
+		if (entry->delta)
+			/* we already know what we need to know */
+			continue;
+
 		free(n->data);
 		n->entry = entry;
 		n->data = read_sha1_file(entry->sha1, type, &size);
@@ -411,10 +612,12 @@ static void find_deltas(struct object_en
 
 static void prepare_pack(int window, int depth)
 {
-	get_object_details();
-
 	if (progress)
 		fprintf(stderr, "Packing %d objects", nr_objects);
+	get_object_details();
+	if (progress)
+		fprintf(stderr, ".");
+
 	sorted_by_type = create_sorted_list(type_size_sort);
 	if (window && depth)
 		find_deltas(sorted_by_type, window+1, depth);
@@ -599,5 +802,7 @@ int main(int argc, char **argv)
 			puts(sha1_to_hex(object_list_sha1));
 		}
 	}
+	fprintf(stderr, "Total %d, written %d, reused %d\n",
+		nr_objects, written, reused);
 	return 0;
 }
diff --git a/pack.h b/pack.h
index 9dafa2b..694e0c5 100644
--- a/pack.h
+++ b/pack.h
@@ -29,5 +29,7 @@ struct pack_header {
 };
 
 extern int verify_pack(struct packed_git *, int);
-
+extern int check_reuse_pack_delta(struct packed_git *, unsigned long,
+				  unsigned char *, unsigned long *,
+				  enum object_type *);
 #endif
diff --git a/sha1_file.c b/sha1_file.c
index 64cf245..0a3a721 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -826,6 +826,25 @@ static unsigned long unpack_object_heade
 	return offset;
 }
 
+int check_reuse_pack_delta(struct packed_git *p, unsigned long offset,
+			   unsigned char *base, unsigned long *sizep,
+			   enum object_type *kindp)
+{
+	unsigned long ptr;
+	int status = -1;
+
+	use_packed_git(p);
+	ptr = offset;
+	ptr = unpack_object_header(p, ptr, kindp, sizep);
+	if (*kindp != OBJ_DELTA)
+		goto done;
+	memcpy(base, p->pack_base + ptr, 20);
+	status = 0;
+ done:
+	unuse_packed_git(p);
+	return status;
+}
+
 void packed_object_info_detail(struct pack_entry *e,
 			       char *type,
 			       unsigned long *size,
-- 
1.2.0.gcfba7

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH] packed objects: minor cleanup
  2006-02-16  1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano
@ 2006-02-16  1:45   ` Junio C Hamano
  2006-02-16  3:41   ` [PATCH] pack-objects: reuse data from existing pack Nicolas Pitre
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2006-02-16  1:45 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds

The delta depth is unsigned.

Signed-off-by: Junio C Hamano <junkio@cox.net>

---

 * And this is the clean-up patch that comes before the "packed
   representation reuse" patch.

 cache.h      |    2 +-
 pack-check.c |    4 ++--
 sha1_file.c  |    4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

f8f135c9bae74f846a92e1f1f1fea8308802ace5
diff --git a/cache.h b/cache.h
index c255421..b5db01f 100644
--- a/cache.h
+++ b/cache.h
@@ -322,7 +322,7 @@ extern int num_packed_objects(const stru
 extern int nth_packed_object_sha1(const struct packed_git *, int, unsigned char*);
 extern int find_pack_entry_one(const unsigned char *, struct pack_entry *, struct packed_git *);
 extern void *unpack_entry_gently(struct pack_entry *, char *, unsigned long *);
-extern void packed_object_info_detail(struct pack_entry *, char *, unsigned long *, unsigned long *, int *, unsigned char *);
+extern void packed_object_info_detail(struct pack_entry *, char *, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
 
 /* Dumb servers support */
 extern int update_server_info(int);
diff --git a/pack-check.c b/pack-check.c
index 67a7ecd..eca32b6 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -84,7 +84,7 @@ static void show_pack_info(struct packed
 		char type[20];
 		unsigned long size;
 		unsigned long store_size;
-		int delta_chain_length;
+		unsigned int delta_chain_length;
 
 		if (nth_packed_object_sha1(p, i, sha1))
 			die("internal error pack-check nth-packed-object");
@@ -98,7 +98,7 @@ static void show_pack_info(struct packed
 		if (!delta_chain_length)
 			printf("%-6s %lu %u\n", type, size, e.offset);
 		else
-			printf("%-6s %lu %u %d %s\n", type, size, e.offset,
+			printf("%-6s %lu %u %u %s\n", type, size, e.offset,
 			       delta_chain_length, sha1_to_hex(base_sha1));
 	}
 
diff --git a/sha1_file.c b/sha1_file.c
index 3d11a9b..64cf245 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -830,7 +830,7 @@ void packed_object_info_detail(struct pa
 			       char *type,
 			       unsigned long *size,
 			       unsigned long *store_size,
-			       int *delta_chain_length,
+			       unsigned int *delta_chain_length,
 			       unsigned char *base_sha1)
 {
 	struct packed_git *p = e->p;
@@ -844,7 +844,7 @@ void packed_object_info_detail(struct pa
 	if (kind != OBJ_DELTA)
 		*delta_chain_length = 0;
 	else {
-		int chain_length = 0;
+		unsigned int chain_length = 0;
 		memcpy(base_sha1, pack, 20);
 		do {
 			struct pack_entry base_ent;
-- 
1.2.0.gcfba7

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [FYI] pack idx format
  2006-02-15 16:46 ` Nicolas Pitre
@ 2006-02-16  1:58   ` Junio C Hamano
  0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2006-02-16  1:58 UTC (permalink / raw)
  To: git

Nicolas Pitre <nico@cam.org> writes:

> On Wed, 15 Feb 2006, Junio C Hamano wrote:
>
>> This is still WIP but if anybody is interested...  Once done, it
>> should become Documentation/technical/pack-format.txt.
>> 
> [...]
>> 
>> Pack file entry: <+
>> 
>>      packed object header:
>> 	1-byte type (upper 4-bit)
>
> Actually the type occupies only 3 bits (bits 4 to 6) as bit 7 is the 
> size continuation bit.

You are right.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] pack-objects: reuse data from existing pack.
  2006-02-16  1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano
  2006-02-16  1:45   ` [PATCH] packed objects: minor cleanup Junio C Hamano
@ 2006-02-16  3:41   ` Nicolas Pitre
  2006-02-16  3:59     ` Junio C Hamano
  2006-02-16  3:55   ` Linus Torvalds
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Nicolas Pitre @ 2006-02-16  3:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

On Wed, 15 Feb 2006, Junio C Hamano wrote:

> When generating a new pack, notice if we have already the wanted
> object in existing packs.  If the object has a delitified
> representation, and its base object is also what we are going to
> pack, then reuse the existing deltified representation
> unconditionally, bypassing all the expensive find_deltas() and
> try_deltas() routines.
> 
> Also, when writing out such deltified representation and
> undeltified representation, if a matching data already exists in
> an existing pack, just write it out without uncompressing &
> recompressing.

Great !

> Without this patch:
> 
>     $ git-rev-list --objects v1.0.0 >RL
>     $ time git-pack-objects p <RL
> 
>     Generating pack...
>     Done counting 12233 objects.
>     Packing 12233 objects....................
>     60a88b3979df41e22d1edc3967095e897f720192
> 
>     real    0m32.751s
>     user    0m27.090s
>     sys     0m2.750s
> 
> With this patch:
> 
>     $ git-rev-list --objects v1.0.0 >RL
>     $ time ../git.junio/git-pack-objects q <RL
> 
>     Generating pack...
>     Done counting 12233 objects.
>     Packing 12233 objects.....................
>     60a88b3979df41e22d1edc3967095e897f720192
>     Total 12233, written 12233, reused 12177
> 
>     real    0m4.007s
>     user    0m3.360s
>     sys     0m0.090s
> 
> Signed-off-by: Junio C Hamano <junkio@cox.net>
> 
> ---
> 
>  * This may depend on one cleanup patch I have not sent out, but
>    I am so excited that I could not help sending this out first.
> 
>    Admittedly this is hot off the press, I have not had enough
>    time to beat this too hard, but the resulting pack from the
>    above passed unpack-objects, index-pack and verify-pack.

In fact, the resulting pack should be identical with or without this 
patch, shouldn't it?

FYI: I have list of patches to produce even smaller (yet still 
compatible) packs, or less dense ones but with much reduced CPU usage.  
All depending on a new --speed argument to git-pack-objects.  I've been 
able to produce 15-20% smaller packs with the same depth and window 
size, but taking twice as much CPU time to produce. Combined with your 
patch, one could repack the object store with the maximum compression 
even if it is expensive CPU wise, but any pull will benefit from it 
afterwards with no additional cost.

I only need to find some time to finally clean and re-test those 
patches...


Nicolas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] pack-objects: reuse data from existing pack.
  2006-02-16  1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano
  2006-02-16  1:45   ` [PATCH] packed objects: minor cleanup Junio C Hamano
  2006-02-16  3:41   ` [PATCH] pack-objects: reuse data from existing pack Nicolas Pitre
@ 2006-02-16  3:55   ` Linus Torvalds
  2006-02-16  4:07     ` Junio C Hamano
  2006-02-16  8:32   ` Andreas Ericsson
  2006-02-17  4:30   ` Junio C Hamano
  4 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2006-02-16  3:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On Wed, 15 Feb 2006, Junio C Hamano wrote:
>
> When generating a new pack, notice if we have already the wanted
> object in existing packs.  If the object has a delitified
> representation, and its base object is also what we are going to
> pack, then reuse the existing deltified representation
> unconditionally, bypassing all the expensive find_deltas() and
> try_deltas() routines.

I bow down before you.

No ugly special-case caching, just automatically "the right thing", with 
very little overhead.

It just makes sense.

We have a winner.

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] pack-objects: reuse data from existing pack.
  2006-02-16  3:41   ` [PATCH] pack-objects: reuse data from existing pack Nicolas Pitre
@ 2006-02-16  3:59     ` Junio C Hamano
  0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2006-02-16  3:59 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: git

Nicolas Pitre <nico@cam.org> writes:

> On Wed, 15 Feb 2006, Junio C Hamano wrote:
>
>>     Generating pack...
>>     Done counting 12233 objects.
>>     Packing 12233 objects.....................
>>     60a88b3979df41e22d1edc3967095e897f720192
>>     Total 12233, written 12233, reused 12177
>> ...
> In fact, the resulting pack should be identical with or without this 
> patch, shouldn't it?

Not necessarily.  The delta-depth limitation is currently lifted
when reusing deltified objects (finding out the current depth is
not so expensive compared to uncompress-delta-recompress cycle,
but still costs somewhat, and the objective of this exercise is
to gain performance).

Notice the numbers 'written' and 'reused' in the output?
The difference in that example comes from the fact that I am
omitting some objects from the set of objects to be packed
(v1.0.0 is ancient) in a repository where some newer objects are
packed.  Since packing-delta goes backwards, what is in v1.0.0
but not in my tip tends to be delitified in the original pack,
but the resulting pack needs to have them expanded -- that is
where the difference comes from.

A cleaned-up patch will be in "pu" branch tonight.  I considered
putting it in "next", but decided against it.  I have not spent
enough time really beating on it, although I haven't seen major
breakage.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] pack-objects: reuse data from existing pack.
  2006-02-16  3:55   ` Linus Torvalds
@ 2006-02-16  4:07     ` Junio C Hamano
  0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2006-02-16  4:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> No ugly special-case caching, just automatically "the right thing", with 
> very little overhead.
>
> It just makes sense.

Thanks.  I threw away two rounds of crap before this one, which
were full of ugly special cases ;-).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] pack-objects: reuse data from existing pack.
  2006-02-16  1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano
                     ` (2 preceding siblings ...)
  2006-02-16  3:55   ` Linus Torvalds
@ 2006-02-16  8:32   ` Andreas Ericsson
  2006-02-16  9:13     ` Junio C Hamano
  2006-02-17  4:30   ` Junio C Hamano
  4 siblings, 1 reply; 17+ messages in thread
From: Andreas Ericsson @ 2006-02-16  8:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

Junio C Hamano wrote:
> When generating a new pack, notice if we have already the wanted
> object in existing packs.  If the object has a delitified
> representation, and its base object is also what we are going to
> pack, then reuse the existing deltified representation
> unconditionally, bypassing all the expensive find_deltas() and
> try_deltas() routines.
> 
> Also, when writing out such deltified representation and
> undeltified representation, if a matching data already exists in
> an existing pack, just write it out without uncompressing &
> recompressing.
> 
> Without this patch:
> 
>     $ git-rev-list --objects v1.0.0 >RL
>     $ time git-pack-objects p <RL
> 
>     Generating pack...
>     Done counting 12233 objects.
>     Packing 12233 objects....................
>     60a88b3979df41e22d1edc3967095e897f720192
> 
>     real    0m32.751s
>     user    0m27.090s
>     sys     0m2.750s
> 
> With this patch:
> 
>     $ git-rev-list --objects v1.0.0 >RL
>     $ time ../git.junio/git-pack-objects q <RL
> 
>     Generating pack...
>     Done counting 12233 objects.
>     Packing 12233 objects.....................
>     60a88b3979df41e22d1edc3967095e897f720192
>     Total 12233, written 12233, reused 12177
> 
>     real    0m4.007s
>     user    0m3.360s
>     sys     0m0.090s
> 

Whoa! Columbus and the egg. Strange noone saw it before. It's so obvious 
when you shove it under the nose like that. :)

Now that pack-creation went from bizarrely expensive to insanely cheap 
(well, comparable to "tar czf" anyways), what's BCP for packing a public 
repository? Always one mega-pack and never worry, or should one still 
use incremental and sometimes overlapping pack-files?

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] pack-objects: reuse data from existing pack.
  2006-02-16  8:32   ` Andreas Ericsson
@ 2006-02-16  9:13     ` Junio C Hamano
  0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2006-02-16  9:13 UTC (permalink / raw)
  To: Andreas Ericsson; +Cc: git

Andreas Ericsson <ae@op5.se> writes:

> Whoa! Columbus and the egg. Strange noone saw it before. It's so
> obvious when you shove it under the nose like that. :)

I wished the pack format were not so dense as we have today.  It
is very expensive to obtain the uncompressed size of a deltified
object.  For this reason, a delta newly created (either from a
non-delta in an existing pack or from a loose object) by the
experimental algorithm is never made against an object that is
in deltified form in a pack.  Also it incurs nontrivial cost to
obtain the size of the in-pack representation of an object
(either deltified or not).  But the inefficiency in the
resulting pack due to these factors may not matter in practice.

I just packed v2.6.16-rc3 object list (184141 objects) using the
current and the experimental, just for fun.  Tonight's one runs
just under 1 minutes on my Duron 750 (with slow disks I should
add).  This was done in a repository that has about 1500 loose
objects and a single mega pack; reuse rate of packed data by the
experimental algorithm is about 99%.  I am hoping the one from
the "master" would come back before I finish writing this
message ;-).

There are subtleties.

For example, in a typical project, files tend to grow rather
than shrink on average, and older ones tend to be in packs.  If
you do packing the traditional way, the largest one (which is
typically the latest) is kept as non-delta, and all the smaller
ones will be incremental delta from that, no matter how your
packs and loose objects are organized.

Usually, you have the latest objects as loose objects in your
repository to be packed (either you push from it, or somebody
else pulls from you).  In other words, as you develop after your
last repack, you would accumulate loose objects, and they are
the ones that typically matter the most.

Let's say you have been changing the same file in every commit
(1..N), then you fully packed and then created another commit
(revision N+1) that touches the file.  The experimental
reuse-packer would:

    - notice blobs from revision 1..(N-1) are deltified,
      relative to the rev one greater than each of them.
      these would be reused.
    - notice blob from revision N is in the pack but not
      deltified.
    - notice blob from revision N+1 is loose.

Then emit the bigger one between N or N+1 as non-delta, the
other one as delta.  1..(N-1) are output as delta.  If it
happens to choose N as plain, it does not have to uncompress and
recompress so the pack process would go very fast, but you would
end up always having to apply a delta to bring rev N to N+1 on
top of the non-delta N to get to the latest blob in rev N+1, and
you typically would want to access rev N+1 blob more often.

In other words, the experimental reuse-packer would create a
suboptimal pack in such a case.  Not a big deal, though.

We may want an option to disable the optimization for
weekly/monthly repacking.  git-daemon (or whatever runs
pack-objects via upload-pack) should use the default with the
optimization, since this is so obviously faster.

> Now that pack-creation went from bizarrely expensive to insanely cheap
> (well, comparable to "tar czf" anyways), what's BCP for packing a
> public repository? Always one mega-pack and never worry, or should one
> still use incremental and sometimes overlapping pack-files?

I would say an optimum single mega-pack would work the best, but
"repack -a -d" to create the mega-pack _with_ the optimization
may have performance impact for users of resulting packs.

Oh, the traditional one finally came back after 11 minutes.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] pack-objects: reuse data from existing pack.
  2006-02-16  1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano
                     ` (3 preceding siblings ...)
  2006-02-16  8:32   ` Andreas Ericsson
@ 2006-02-17  4:30   ` Junio C Hamano
  2006-02-17 10:37     ` [PATCH] pack-objects: finishing touches Junio C Hamano
  2006-02-17 15:39     ` [PATCH] pack-objects: reuse data from existing pack Linus Torvalds
  4 siblings, 2 replies; 17+ messages in thread
From: Junio C Hamano @ 2006-02-17  4:30 UTC (permalink / raw)
  To: git

Junio C Hamano <junkio@cox.net> writes:

> When generating a new pack, notice if we have already the wanted
> object in existing packs.  If the object has a delitified
> representation, and its base object is also what we are going to
> pack, then reuse the existing deltified representation
> unconditionally, bypassing all the expensive find_deltas() and
> try_deltas() routines.

This one has one nasty data corruption bug, which fortunately I
think I have figured out how to fix.  Please do not use it for
your production repository in the meantime.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] pack-objects: finishing touches.
  2006-02-17  4:30   ` Junio C Hamano
@ 2006-02-17 10:37     ` Junio C Hamano
  2006-02-18  6:50       ` [PATCH] pack-objects: avoid delta chains that are too long Junio C Hamano
  2006-02-17 15:39     ` [PATCH] pack-objects: reuse data from existing pack Linus Torvalds
  1 sibling, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2006-02-17 10:37 UTC (permalink / raw)
  To: git

Junio C Hamano <junkio@cox.net> writes:

> This one has one nasty data corruption bug, which fortunately I
> think I have figured out how to fix.  Please do not use it for
> your production repository in the meantime.

It turns out that there wasn't a data corruption bug, but it can
create a pack with delta chains whose length exceeds twice the
specified depth.  The data corruption bug was on my unpublished
WIP and killed even before it hit "pu" ;-).  This addresses most
of the remaining issues and has been merged to "next" tonight.

-- >8 --
This introduces --no-reuse-delta option to disable reusing of
existing delta, which is a large part of the optimization
introduced by this series.  This may become necessary if
repeated repacking makes delta chain too long.  With this, the
output of the command becomes identical to that of the older
implementation.

It still allows reusing non-deltified representations; there is
no point uncompressing and recompressing the whole text.

It also adds a couple more statistics output, while squelching
it under -q flag, which the last round forgot to do.

One remaining minor issue when --no-reuse-delta option is not
used is that it can create a delta chain that is deeper than
specified.

    A<--B<--C<--D   E   F   G

Suppose we have a delta chain A to D (A is stored in full either
in a pack or as a loose object. B is depth1 delta relative to A,
C is depth2 delta relative to B...) with loose objects E, F, G.
And we are going to pack them.

B, C and D are left as delta against A, B and C respectively.
So A, E, F, and G are examined, and let's say we decided to keep
E expanded, and store the rest as deltas like this:

	E<--F<--G<--A

Oops.  We ended up making D a bit too deep, didn't we?  B, C and
D form a chain on top of A!

This is because we did not know what the final depth of A would
be when checking objects and deciding to keep the existing
delta.

To prevent this from happening, we could say that A should be
kept expanded.  But how would we tell that, cheaply?

To do this most precisely, after check_object() runs, each
object that is used as the base object of some existing delta
needs to be marked with the maximum depth of the objects we
decided to keep deltified (in this case, D is depth 3 relative
to A, so if no other delta chain that is longer than 3 based on
A exists, mark A with 3).  Then when attempting to deltify A, we
would take that number into account to see if the final delta
chain that leads to D becomes too deep.

However, this is a bit cumbersome to compute, so we would cheat
and reduce the maximum depth for A arbitrarily to depth/4 in
this implementation.

Signed-off-by: Junio C Hamano <junkio@cox.net>

---

 Documentation/git-pack-objects.txt |   21 +++++++
 pack-objects.c                     |  102 +++++++++++++++++++++++++-----------
 2 files changed, 91 insertions(+), 32 deletions(-)

98885061623676a37575e59d7f7aff64072e3300
diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 2d67d39..4cb2e83 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -8,7 +8,10 @@ git-pack-objects - Create a packed archi
 
 SYNOPSIS
 --------
-'git-pack-objects' [--non-empty] [--local] [--incremental] [--window=N] [--depth=N] {--stdout | base-name} < object-list
+[verse]
+'git-pack-objects' [-q] [--no-reuse-delta] [--non-empty]
+	[--local] [--incremental] [--window=N] [--depth=N]
+	{--stdout | base-name} < object-list
 
 
 DESCRIPTION
@@ -32,6 +35,10 @@ Placing both in the pack/ subdirectory o
 any of the directories on $GIT_ALTERNATE_OBJECT_DIRECTORIES)
 enables git to read from such an archive.
 
+In a packed archive, an object is either stored as a compressed
+whole, or as a difference from some other object.  The latter is
+often called a delta.
+
 
 OPTIONS
 -------
@@ -74,6 +81,18 @@ base-name::
         Only create a packed archive if it would contain at
         least one object.
 
+-q::
+	This flag makes the command not to report its progress
+	on the standard error stream.
+
+--no-reuse-delta::
+	When creating a packed archive in a repository that
+	has existing packs, the command reuses existing deltas.
+	This sometimes results in a slightly suboptimal pack.
+	This flag tells the command not to reuse existing deltas
+	but compute them from scratch.
+
+
 Author
 ------
 Written by Linus Torvalds <torvalds@osdl.org>
diff --git a/pack-objects.c b/pack-objects.c
index 70fb2af..38e1c99 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -5,7 +5,7 @@
 #include "csum-file.h"
 #include <sys/time.h>
 
-static const char pack_usage[] = "git-pack-objects [-q] [--non-empty] [--local] [--incremental] [--window=N] [--depth=N] {--stdout | base-name} < object-list";
+static const char pack_usage[] = "git-pack-objects [-q] [--no-reuse-delta] [--non-empty] [--local] [--incremental] [--window=N] [--depth=N] {--stdout | base-name} < object-list";
 
 struct object_entry {
 	unsigned char sha1[20];
@@ -14,10 +14,11 @@ struct object_entry {
 	unsigned int depth;	/* delta depth */
 	unsigned int hash;	/* name hint hash */
 	enum object_type type;
+	unsigned char edge;	/* reused delta chain points at this entry. */
+	enum object_type in_pack_type;	/* could be delta */
 	unsigned long delta_size;	/* delta data size (uncompressed) */
 	struct object_entry *delta;	/* delta base object */
 	struct packed_git *in_pack; 	/* already in pack */
-	enum object_type in_pack_type;	/* could be delta */
 	unsigned int in_pack_offset;
 };
 
@@ -36,6 +37,7 @@ struct object_entry {
 
 static unsigned char object_list_sha1[20];
 static int non_empty = 0;
+static int no_reuse_delta = 0;
 static int local = 0;
 static int incremental = 0;
 static struct object_entry **sorted_by_sha, **sorted_by_type;
@@ -75,7 +77,9 @@ static int pack_revindex_hashsz = 0;
  * stats
  */
 static int written = 0;
+static int written_delta = 0;
 static int reused = 0;
+static int reused_delta = 0;
 
 static int pack_revindex_ix(struct packed_git *p)
 {
@@ -227,10 +231,23 @@ static unsigned long write_object(struct
 	unsigned char header[10];
 	unsigned hdrlen, datalen;
 	enum object_type obj_type;
+	int to_reuse = 0;
 
 	obj_type = entry->type;
-	if (!entry->in_pack ||
-	    (obj_type != entry->in_pack_type)) {
+	if (! entry->in_pack)
+		to_reuse = 0;	/* can't reuse what we don't have */
+	else if (obj_type == OBJ_DELTA)
+		to_reuse = 1;	/* check_object() decided it for us */
+	else if (obj_type != entry->in_pack_type)
+		to_reuse = 0;	/* pack has delta which is unusable */
+	else if (entry->delta)
+		to_reuse = 0;	/* we want to pack afresh */
+	else
+		to_reuse = 1;	/* we have it in-pack undeltified,
+				 * and we do not need to deltify it.
+				 */
+
+	if (! to_reuse) {
 		buf = read_sha1_file(entry->sha1, type, &size);
 		if (!buf)
 			die("unable to read %s", sha1_to_hex(entry->sha1));
@@ -266,8 +283,12 @@ static unsigned long write_object(struct
 		sha1write(f, buf, datalen);
 		unuse_packed_git(p);
 		hdrlen = 0; /* not really */
+		if (obj_type == OBJ_DELTA)
+			reused_delta++;
 		reused++;
 	}
+	if (obj_type == OBJ_DELTA)
+		written_delta++;
 	written++;
 	return hdrlen + datalen;
 }
@@ -294,7 +315,6 @@ static void write_pack_file(void)
 	int i;
 	struct sha1file *f;
 	unsigned long offset;
-	unsigned long mb;
 	struct pack_header hdr;
 
 	if (!base_name)
@@ -357,10 +377,9 @@ static int add_object_entry(unsigned cha
 	unsigned int idx = nr_objects;
 	struct object_entry *entry;
 	struct packed_git *p;
-	unsigned int found_offset;
-	struct packed_git *found_pack;
+	unsigned int found_offset = 0;
+	struct packed_git *found_pack = NULL;
 
-	found_pack = NULL;
 	for (p = packed_git; p; p = p->next) {
 		struct pack_entry e;
 		if (find_pack_entry_one(sha1, &e, p)) {
@@ -420,32 +439,39 @@ static void check_object(struct object_e
 	char type[20];
 
 	if (entry->in_pack) {
+		unsigned char base[20];
+		unsigned long size;
+		struct object_entry *base_entry;
+
+		/* We want in_pack_type even if we do not reuse delta.
+		 * There is no point not reusing non-delta representations.
+		 */
+		check_reuse_pack_delta(entry->in_pack,
+				       entry->in_pack_offset,
+				       base, &size,
+				       &entry->in_pack_type);
+
 		/* Check if it is delta, and the base is also an object
 		 * we are going to pack.  If so we will reuse the existing
 		 * delta.
 		 */
-		unsigned char base[20];
-		unsigned long size;
-		struct object_entry *base_entry;
-		if (!check_reuse_pack_delta(entry->in_pack,
-					    entry->in_pack_offset,
-					    base, &size,
-					    &entry->in_pack_type) &&
+		if (!no_reuse_delta &&
+		    entry->in_pack_type == OBJ_DELTA &&
 		    (base_entry = locate_object_entry(base))) {
-			/* We do not know depth at this point, but it
-			 * does not matter.  Getting delta_chain_length
-			 * with packed_object_info_detail() is not so
-			 * expensive, so we could do that later if we
-			 * wanted to.  Calling sha1_object_info to get
-			 * the true size (and later an uncompressed
-			 * representation) of deeply deltified object
-			 * is quite expensive.
+
+			/* Depth value does not matter - find_deltas()
+			 * will never consider reused delta as the
+			 * base object to deltify other objects
+			 * against, in order to avoid circular deltas.
 			 */
-			entry->depth = 1;
-			/* uncompressed size */
+
+			/* uncompressed size of the delta data */
 			entry->size = entry->delta_size = size;
 			entry->delta = base_entry;
 			entry->type = OBJ_DELTA;
+
+			base_entry->edge = 1;
+
 			return;
 		}
 		/* Otherwise we would do the usual */
@@ -568,6 +594,13 @@ static int try_delta(struct unpacked *cu
 	if (cur_entry->type != old_entry->type)
 		return -1;
 
+	/* If the current object is at edge, take the depth the objects
+	 * that depend on the current object into account -- otherwise
+	 * they would become too deep.
+	 */
+	if (cur_entry->edge)
+		max_depth /= 4;
+
 	size = cur_entry->size;
 	if (size < 50)
 		return -1;
@@ -627,7 +660,7 @@ static void find_deltas(struct object_en
 
 		if (entry->delta)
 			/* This happens if we decided to reuse existing
-			 * delta from a pack.
+			 * delta from a pack.  "!no_reuse_delta &&" is implied.
 			 */
 			continue;
 
@@ -636,6 +669,7 @@ static void find_deltas(struct object_en
 		n->data = read_sha1_file(entry->sha1, type, &size);
 		if (size != entry->size)
 			die("object %s inconsistent object length (%lu vs %lu)", sha1_to_hex(entry->sha1), size, entry->size);
+
 		j = window;
 		while (--j > 0) {
 			unsigned int other_idx = idx + j;
@@ -664,7 +698,7 @@ static void prepare_pack(int window, int
 		fprintf(stderr, "Packing %d objects", nr_objects);
 	get_object_details();
 	if (progress)
-		fprintf(stderr, ".");
+		fputc('.', stderr);
 
 	sorted_by_type = create_sorted_list(type_size_sort);
 	if (window && depth)
@@ -694,8 +728,9 @@ static int reuse_cached_pack(unsigned ch
 		}
 	}
 
-	fprintf(stderr, "Reusing %d objects pack %s\n", nr_objects,
-		sha1_to_hex(sha1));
+	if (progress)
+		fprintf(stderr, "Reusing %d objects pack %s\n", nr_objects,
+			sha1_to_hex(sha1));
 
 	if (pack_to_stdout) {
 		if (copy_fd(ifd, 1))
@@ -775,6 +810,10 @@ int main(int argc, char **argv)
 				progress = 0;
 				continue;
 			}
+			if (!strcmp("--no-reuse-delta", arg)) {
+				no_reuse_delta = 1;
+				continue;
+			}
 			if (!strcmp("--stdout", arg)) {
 				pack_to_stdout = 1;
 				continue;
@@ -850,7 +889,8 @@ int main(int argc, char **argv)
 			puts(sha1_to_hex(object_list_sha1));
 		}
 	}
-	fprintf(stderr, "Total %d, written %d, reused %d\n",
-		nr_objects, written, reused);
+	if (progress)
+		fprintf(stderr, "Total %d, written %d (delta %d), reused %d (delta %d)\n",
+			nr_objects, written, written_delta, reused, reused_delta);
 	return 0;
 }
-- 
1.2.1.g91016

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] pack-objects: reuse data from existing pack.
  2006-02-17  4:30   ` Junio C Hamano
  2006-02-17 10:37     ` [PATCH] pack-objects: finishing touches Junio C Hamano
@ 2006-02-17 15:39     ` Linus Torvalds
  2006-02-17 18:18       ` Junio C Hamano
  1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2006-02-17 15:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On Thu, 16 Feb 2006, Junio C Hamano wrote:
> 
> This one has one nasty data corruption bug, which fortunately I
> think I have figured out how to fix.

Circular deltas? What else can go wrong?

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] pack-objects: reuse data from existing pack.
  2006-02-17 15:39     ` [PATCH] pack-objects: reuse data from existing pack Linus Torvalds
@ 2006-02-17 18:18       ` Junio C Hamano
  0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2006-02-17 18:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> On Thu, 16 Feb 2006, Junio C Hamano wrote:
>> 
>> This one has one nasty data corruption bug, which fortunately I
>> think I have figured out how to fix.
>
> Circular deltas? What else can go wrong?

Circular deltas are prevented by not using deltified objects
check_object() decides to keep in find_deltas(), so I do not
think it is an issue.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] pack-objects: avoid delta chains that are too long.
  2006-02-17 10:37     ` [PATCH] pack-objects: finishing touches Junio C Hamano
@ 2006-02-18  6:50       ` Junio C Hamano
  0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2006-02-18  6:50 UTC (permalink / raw)
  To: git

This tries to rework the solution for the excess delta chain
problem. An earlier commit worked it around ``cheaply'', but
repeated repacking risks unbound growth of delta chains.

This version counts the length of delta chain we are reusing
from the existing pack, and makes sure a base object that has
sufficiently long delta chain does not get deltified.

Signed-off-by: Junio C Hamano <junkio@cox.net>

---

 * This solves the problem by doing exactly what the previous
   commit log suggested.  We limit the chain depth correctly, to
   save the user of the resulting pack from suffering runtime
   overhead.

   Will be in "next" tonight.

 pack-objects.c |   43 +++++++++++++++++++++++++++++++++++--------
 1 files changed, 35 insertions(+), 8 deletions(-)

e4c9327a77bd59e85d4b17a612e78977d68773ee
diff --git a/pack-objects.c b/pack-objects.c
index 38e1c99..0c9f4c9 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -10,16 +10,22 @@ static const char pack_usage[] = "git-pa
 struct object_entry {
 	unsigned char sha1[20];
 	unsigned long size;	/* uncompressed size */
-	unsigned long offset;	/* offset into the final pack file (nonzero if already written) */
+	unsigned long offset;	/* offset into the final pack file;
+				 * nonzero if already written.
+				 */
 	unsigned int depth;	/* delta depth */
+	unsigned int delta_limit;	/* base adjustment for in-pack delta */
 	unsigned int hash;	/* name hint hash */
 	enum object_type type;
-	unsigned char edge;	/* reused delta chain points at this entry. */
 	enum object_type in_pack_type;	/* could be delta */
 	unsigned long delta_size;	/* delta data size (uncompressed) */
 	struct object_entry *delta;	/* delta base object */
 	struct packed_git *in_pack; 	/* already in pack */
 	unsigned int in_pack_offset;
+	struct object_entry *delta_child; /* delitified objects who bases me */
+	struct object_entry *delta_sibling; /* other deltified objects who
+					     * uses the same base as me
+					     */
 };
 
 /*
@@ -470,7 +476,8 @@ static void check_object(struct object_e
 			entry->delta = base_entry;
 			entry->type = OBJ_DELTA;
 
-			base_entry->edge = 1;
+			entry->delta_sibling = base_entry->delta_child;
+			base_entry->delta_child = entry;
 
 			return;
 		}
@@ -513,15 +520,32 @@ static void hash_objects(void)
 	}
 }
 
+static unsigned int check_delta_limit(struct object_entry *me, unsigned int n)
+{
+	struct object_entry *child = me->delta_child;
+	unsigned int m = n;
+	while (child) {
+		unsigned int c = check_delta_limit(child, n + 1);
+		if (m < c)
+			m = c;
+		child = child->delta_sibling;
+	}
+	return m;
+}
+
 static void get_object_details(void)
 {
 	int i;
-	struct object_entry *entry = objects;
+	struct object_entry *entry;
 
 	hash_objects();
 	prepare_pack_ix();
-	for (i = 0; i < nr_objects; i++)
-		check_object(entry++);
+	for (i = 0, entry = objects; i < nr_objects; i++, entry++)
+		check_object(entry);
+	for (i = 0, entry = objects; i < nr_objects; i++, entry++)
+		if (!entry->delta && entry->delta_child)
+			entry->delta_limit =
+				check_delta_limit(entry, 1);
 }
 
 typedef int (*entry_sort_t)(const struct object_entry *, const struct object_entry *);
@@ -598,8 +622,11 @@ static int try_delta(struct unpacked *cu
 	 * that depend on the current object into account -- otherwise
 	 * they would become too deep.
 	 */
-	if (cur_entry->edge)
-		max_depth /= 4;
+	if (cur_entry->delta_child) {
+		if (max_depth <= cur_entry->delta_limit)
+			return 0;
+		max_depth -= cur_entry->delta_limit;
+	}
 
 	size = cur_entry->size;
 	if (size < 50)
-- 
1.2.1.g9b132

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2006-02-18  6:50 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-15  8:39 [FYI] pack idx format Junio C Hamano
2006-02-15 11:16 ` Johannes Schindelin
2006-02-15 16:46 ` Nicolas Pitre
2006-02-16  1:58   ` Junio C Hamano
2006-02-16  1:43 ` [PATCH] pack-objects: reuse data from existing pack Junio C Hamano
2006-02-16  1:45   ` [PATCH] packed objects: minor cleanup Junio C Hamano
2006-02-16  3:41   ` [PATCH] pack-objects: reuse data from existing pack Nicolas Pitre
2006-02-16  3:59     ` Junio C Hamano
2006-02-16  3:55   ` Linus Torvalds
2006-02-16  4:07     ` Junio C Hamano
2006-02-16  8:32   ` Andreas Ericsson
2006-02-16  9:13     ` Junio C Hamano
2006-02-17  4:30   ` Junio C Hamano
2006-02-17 10:37     ` [PATCH] pack-objects: finishing touches Junio C Hamano
2006-02-18  6:50       ` [PATCH] pack-objects: avoid delta chains that are too long Junio C Hamano
2006-02-17 15:39     ` [PATCH] pack-objects: reuse data from existing pack Linus Torvalds
2006-02-17 18:18       ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).