From: Nicolas Pitre <nico@cam.org>
To: Jon Smirl <jonsmirl@gmail.com>
Cc: Shawn Pearce <spearce@spearce.org>, git <git@vger.kernel.org>
Subject: Re: Huge win, compressing a window of delta runs as a unit
Date: Fri, 18 Aug 2006 00:03:37 -0400 (EDT) [thread overview]
Message-ID: <Pine.LNX.4.64.0608172323520.11359@localhost.localdomain> (raw)
In-Reply-To: <9e4733910608161020s6855140bs68aaab6e1bbd3bad@mail.gmail.com>
On Wed, 16 Aug 2006, Jon Smirl wrote:
> Shawn put together a new version of his import utility that packs all
> of the deltas from a run into a single blob instead of one blob per
> delta. The idea is to put 10 or more deltas into each delta entry
> instead of one. The index format would map the 10 sha1's to a single
> packed delta entry which would be expanded when needed. Note that you
> probably needed multiple entries out of the delta pack to generate the
> revision you were looking for so this is no real loss on extraction.
>
> I ran it overnight on mozcvs. If his delta pack code is correct this
> is a huge win.
>
> One entry per delta - 845,42,0150
> Packed deltas - 295,018,474
> 65% smaller
Well, I have to state that I don't believe those results to be right.
Not that I question the results you obtained, but rather that Shawn's
pack generation is most probably broken.
Since I was highly suspicious looking at the above size difference (it
is just too good to be true), I did a quick hack to pack-objects.c to
produce packs with all deltas depending on the same root object into the
same zlib stream. In effect I implemented exactly what Shawn pretends
to have done but directly in pack-objects.c so git-repack could be used
with existing repositories like the linux kernel in order to validate
the concept.
Results:
One entry per delta - 122,103,455
Packed (or grouped) deltas : 108,479,014
Only 11% smaller.
I was really enthousiastic at first when I saw the result you obtained,
but I no longer believe it can be right.
Furthermore, such a pack organization has many disadvantages: it
prevents trivial delta data reuse for incremental push/pull operations
which is a huge performance penalty (bad for servers), as well as making
random object extraction from such a pack somewhat more complex. Those
disadvantages greatly outweight the 11% size saving IMHO.
A better way to get such a size saving is to increase the window and
depth parameters. For example, a window of 20 and depth of 20 can
usually provide a pack size saving greater than 11% with none of the
disadvantages mentioned above.
I therefore don't think this idea should be pursued any further.
For completeness you can find my changes to pack-objects below. Beware
that this produces packs that cannot be read back so backup your
repository first before playing with this.
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 448461b..bfae892 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -227,7 +227,7 @@ static int encode_header(enum object_typ
int n = 1;
unsigned char c;
- if (type < OBJ_COMMIT || type > OBJ_DELTA)
+ if (type < 0 || type > OBJ_DELTA)
die("bad type %d", type);
c = (type << 4) | (size & 15);
@@ -332,6 +332,8 @@ static unsigned long write_object(struct
return hdrlen + datalen;
}
+#if 0
+
static unsigned long write_one(struct sha1file *f,
struct object_entry *e,
unsigned long offset)
@@ -349,6 +351,107 @@ static unsigned long write_one(struct sh
return offset;
}
+#else
+
+static unsigned count_delta_childs(struct object_entry *e, unsigned long o)
+{
+ unsigned count = 0;
+ struct object_entry *child = e->delta_child;
+ while (child) {
+ count += 1 + count_delta_childs(child, o);
+ if (child->offset)
+ die("object already seen???");
+ child->offset = o;
+ child = child->delta_sibling;
+ }
+ return count;
+}
+
+static unsigned long write_delta_headers(struct sha1file *f, struct object_entry *e, unsigned long *size)
+{
+ unsigned long offset = 0;
+ struct object_entry *child = e->delta_child;
+ while (child) {
+ unsigned char header[10];
+ unsigned hdrlen;
+ hdrlen = encode_header(OBJ_DELTA, child->delta_size, header);
+ *size += child->delta_size;
+ sha1write(f, header, hdrlen);
+ sha1write(f, child->delta, 20);
+ offset += hdrlen + 20;
+ offset += write_delta_headers(f, child, size);
+ child = child->delta_sibling;
+ }
+ return offset;
+}
+
+static void * concat_delta_data(struct object_entry *e, void *buf)
+{
+ struct object_entry *child = e->delta_child;
+ while (child) {
+ char type[10];
+ unsigned long size;
+ void *data = read_sha1_file(child->sha1, type, &size);
+ if (!data)
+ die("unable to read %s", sha1_to_hex(child->sha1));
+ data = delta_against(data, size, child);
+ memcpy(buf, data, child->delta_size);
+ buf += child->delta_size;
+ free(data);
+ written++;
+ written_delta++;
+ buf = concat_delta_data(child, buf);
+ child = child->delta_sibling;
+ }
+ return buf;
+}
+
+static unsigned long write_one(struct sha1file *f,
+ struct object_entry *e,
+ unsigned long offset)
+{
+ unsigned char header[10];
+ unsigned count, hdrlen;
+ unsigned long size;
+ void *buf, *p;
+
+ if (e->offset)
+ /* offset starts from header size and cannot be zero
+ * if it is written already.
+ */
+ return offset;
+ if (!e->delta) {
+ e->offset = offset;
+ return offset + write_object(f, e);
+ }
+
+ /* find delta root object */
+ while (e->delta)
+ e = e->delta;
+
+ /* count how many deltas depend on it */
+ count = count_delta_childs(e, offset);
+
+ /* write header for object group */
+ hdrlen = encode_header(0, count, header);
+ sha1write(f, header, hdrlen);
+ offset += hdrlen;
+
+ /* write each object's header and find total data size */
+ size = 0;
+ offset += write_delta_headers(f, e, &size);
+
+ /* write concatenated object data buffer */
+ buf = xmalloc(size);
+ p = concat_delta_data(e, buf);
+ offset += sha1write_compressed(f, buf, p-buf);
+ free(buf);
+
+ return offset;
+}
+
+#endif
+
static void write_pack_file(void)
{
int i;
Nicolas
next prev parent reply other threads:[~2006-08-18 4:03 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-08-16 17:20 Huge win, compressing a window of delta runs as a unit Jon Smirl
2006-08-17 4:07 ` Shawn Pearce
2006-08-17 7:56 ` Johannes Schindelin
2006-08-17 8:07 ` Johannes Schindelin
2006-08-17 14:36 ` Jon Smirl
2006-08-17 15:45 ` Johannes Schindelin
2006-08-17 16:33 ` Nicolas Pitre
2006-08-17 17:05 ` Johannes Schindelin
2006-08-17 17:22 ` Jon Smirl
2006-08-17 18:15 ` Nicolas Pitre
2006-08-17 17:17 ` Jon Smirl
2006-08-17 17:32 ` Nicolas Pitre
2006-08-17 18:06 ` Jon Smirl
2006-08-17 17:22 ` Nicolas Pitre
2006-08-17 18:03 ` Jon Smirl
2006-08-17 18:24 ` Nicolas Pitre
2006-08-18 4:03 ` Nicolas Pitre [this message]
2006-08-18 12:53 ` Jon Smirl
2006-08-18 16:30 ` Nicolas Pitre
2006-08-18 16:56 ` Jon Smirl
2006-08-21 3:45 ` Nicolas Pitre
2006-08-21 6:46 ` Shawn Pearce
2006-08-21 10:24 ` Jakub Narebski
2006-08-21 16:23 ` Jon Smirl
2006-08-18 13:15 ` Jon Smirl
2006-08-18 13:36 ` Johannes Schindelin
2006-08-18 13:50 ` Jon Smirl
2006-08-19 19:25 ` Linus Torvalds
2006-08-18 16:25 ` Nicolas Pitre
2006-08-21 7:06 ` Shawn Pearce
2006-08-21 14:07 ` Jon Smirl
2006-08-21 15:46 ` Nicolas Pitre
2006-08-21 16:14 ` Jon Smirl
2006-08-21 17:48 ` Nicolas Pitre
2006-08-21 17:55 ` Nicolas Pitre
2006-08-21 18:01 ` Nicolas Pitre
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0608172323520.11359@localhost.localdomain \
--to=nico@cam.org \
--cc=git@vger.kernel.org \
--cc=jonsmirl@gmail.com \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).