[PATCH 0/2] fast-import: improve deltas for blobs

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] fast-import: improve deltas for blobs
@ 2011-08-20 19:04 Dmitry Ivankov
  2011-08-20 19:04 ` [PATCH 1/2] fast-import: count and report # of calls to diff_delta in stats Dmitry Ivankov
  2011-08-20 19:04 ` [PATCH 2/2] fast-import: treat cat-blob as a delta base hint for next blob Dmitry Ivankov
  0 siblings, 2 replies; 5+ messages in thread
From: Dmitry Ivankov @ 2011-08-20 19:04 UTC (permalink / raw)
  To: git; +Cc: Jonathan Nieder, Shawn O. Pearce, David Barr, Dmitry Ivankov

Currently delta base for blob objects is just a previous blob object
written. This way we just keep the last one in memory and it's cheap
(not too smart though and gains no pack size reduction most of the time).
If we also keep as last blob a response to cat-blob (whose main purpose
is to provide delta bases for a importer), svn-fe imports become faster
and packs produced become smaller.

1/2 adds a diff_delta attemps count as a related and interesting number
2/2 gives a nice performance improvement for svn-fe produced imports

Dmitry Ivankov (2):
  fast-import: count and report # of calls to diff_delta in stats
  fast-import: treat cat-blob as a delta base hint for next blob

 fast-import.c |   17 ++++++++++++-----
 1 files changed, 12 insertions(+), 5 deletions(-)

-- 
1.7.3.4

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] fast-import: count and report # of calls to diff_delta in stats
  2011-08-20 19:04 [PATCH 0/2] fast-import: improve deltas for blobs Dmitry Ivankov
@ 2011-08-20 19:04 ` Dmitry Ivankov
  2011-08-20 19:04 ` [PATCH 2/2] fast-import: treat cat-blob as a delta base hint for next blob Dmitry Ivankov
  1 sibling, 0 replies; 5+ messages in thread
From: Dmitry Ivankov @ 2011-08-20 19:04 UTC (permalink / raw)
  To: git; +Cc: Jonathan Nieder, Shawn O. Pearce, David Barr, Dmitry Ivankov

It's an interesting number, how often do we try to deltify each type of
objects and how often do we succeed. So do add it to stats.

Success doesn't mean much gain in pack size though. As we allow delta to
be as big as (data.len - 20). And delta close to data.len gains nothing
compared to no delta at all even after zlib compression (delta is pretty
much the same as data, just with few modifications).

We should try to make less attempts that result in huge deltas as these
consume more cpu than trivial small deltas. Either by choosing a better
delta base or reducing delta size upper bound or doing less delta attempts
at all.

Currently, delta base for blobs is a waste literally. Each blob delta
base is chosen as a previously stored blob. Disabling deltas for blobs
doesn't increase pack size and reduce import time, or at least doesn't
increase time for all fast-import streams I've tried.

Signed-off-by: Dmitry Ivankov <divanorama@gmail.com>
---
 fast-import.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fast-import.c b/fast-import.c
index 7cc2262..2b069e3 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -284,6 +284,7 @@ static uintmax_t marks_set_count;
 static uintmax_t object_count_by_type[1 << TYPE_BITS];
 static uintmax_t duplicate_count_by_type[1 << TYPE_BITS];
 static uintmax_t delta_count_by_type[1 << TYPE_BITS];
+static uintmax_t delta_count_attempts_by_type[1 << TYPE_BITS];
 static unsigned long object_count;
 static unsigned long branch_count;
 static unsigned long branch_load_count;
@@ -1045,6 +1046,7 @@ static int store_object(
 	}
 
 	if (last && last->data.buf && last->depth < max_depth && dat->len > 20) {
+		delta_count_attempts_by_type[type]++;
 		delta = diff_delta(last->data.buf, last->data.len,
 			dat->buf, dat->len,
 			&deltalen, dat->len - 20);
@@ -3338,10 +3340,10 @@ int main(int argc, const char **argv)
 		fprintf(stderr, "---------------------------------------------------------------------\n");
 		fprintf(stderr, "Alloc'd objects: %10" PRIuMAX "\n", alloc_count);
 		fprintf(stderr, "Total objects:   %10" PRIuMAX " (%10" PRIuMAX " duplicates                  )\n", total_count, duplicate_count);
-		fprintf(stderr, "      blobs  :   %10" PRIuMAX " (%10" PRIuMAX " duplicates %10" PRIuMAX " deltas)\n", object_count_by_type[OBJ_BLOB], duplicate_count_by_type[OBJ_BLOB], delta_count_by_type[OBJ_BLOB]);
-		fprintf(stderr, "      trees  :   %10" PRIuMAX " (%10" PRIuMAX " duplicates %10" PRIuMAX " deltas)\n", object_count_by_type[OBJ_TREE], duplicate_count_by_type[OBJ_TREE], delta_count_by_type[OBJ_TREE]);
-		fprintf(stderr, "      commits:   %10" PRIuMAX " (%10" PRIuMAX " duplicates %10" PRIuMAX " deltas)\n", object_count_by_type[OBJ_COMMIT], duplicate_count_by_type[OBJ_COMMIT], delta_count_by_type[OBJ_COMMIT]);
-		fprintf(stderr, "      tags   :   %10" PRIuMAX " (%10" PRIuMAX " duplicates %10" PRIuMAX " deltas)\n", object_count_by_type[OBJ_TAG], duplicate_count_by_type[OBJ_TAG], delta_count_by_type[OBJ_TAG]);
+		fprintf(stderr, "      blobs  :   %10" PRIuMAX " (%10" PRIuMAX " duplicates %10" PRIuMAX " deltas of %10" PRIuMAX" attempts)\n", object_count_by_type[OBJ_BLOB], duplicate_count_by_type[OBJ_BLOB], delta_count_by_type[OBJ_BLOB], delta_count_attempts_by_type[OBJ_BLOB]);
+		fprintf(stderr, "      trees  :   %10" PRIuMAX " (%10" PRIuMAX " duplicates %10" PRIuMAX " deltas of %10" PRIuMAX" attempts)\n", object_count_by_type[OBJ_TREE], duplicate_count_by_type[OBJ_TREE], delta_count_by_type[OBJ_TREE], delta_count_attempts_by_type[OBJ_TREE]);
+		fprintf(stderr, "      commits:   %10" PRIuMAX " (%10" PRIuMAX " duplicates %10" PRIuMAX " deltas of %10" PRIuMAX" attempts)\n", object_count_by_type[OBJ_COMMIT], duplicate_count_by_type[OBJ_COMMIT], delta_count_by_type[OBJ_COMMIT], delta_count_attempts_by_type[OBJ_COMMIT]);
+		fprintf(stderr, "      tags   :   %10" PRIuMAX " (%10" PRIuMAX " duplicates %10" PRIuMAX " deltas of %10" PRIuMAX" attempts)\n", object_count_by_type[OBJ_TAG], duplicate_count_by_type[OBJ_TAG], delta_count_by_type[OBJ_TAG], delta_count_attempts_by_type[OBJ_TAG]);
 		fprintf(stderr, "Total branches:  %10lu (%10lu loads     )\n", branch_count, branch_load_count);
 		fprintf(stderr, "      marks:     %10" PRIuMAX " (%10" PRIuMAX " unique    )\n", (((uintmax_t)1) << marks->shift) * 1024, marks_set_count);
 		fprintf(stderr, "      atoms:     %10u\n", atom_cnt);
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] fast-import: treat cat-blob as a delta base hint for next blob
  2011-08-20 19:04 [PATCH 0/2] fast-import: improve deltas for blobs Dmitry Ivankov
  2011-08-20 19:04 ` [PATCH 1/2] fast-import: count and report # of calls to diff_delta in stats Dmitry Ivankov
@ 2011-08-20 19:04 ` Dmitry Ivankov
  2011-08-20 19:17   ` Jonathan Nieder
  1 sibling, 1 reply; 5+ messages in thread
From: Dmitry Ivankov @ 2011-08-20 19:04 UTC (permalink / raw)
  To: git; +Cc: Jonathan Nieder, Shawn O. Pearce, David Barr, Dmitry Ivankov

Delta base for blobs is chosen as a previously saved blob. If we
treat cat-blob's blob as a delta base for the next blob, nothing
is likely to become worse.

For fast-import stream producer like svn-fe cat-blob is used like
following:
- svn-fe reads file delta in svn format
- to apply it, svn-fe asks cat-blob 'svn delta base'
- applies 'svn delta' to the response
- produces a blob command to store the result

Currently there is no way for svn-fe to give fast-import a hint on
object delta base. While what's requested in cat-blob is most of
the time a best delta base possible. Of course, it could be not a
good delta base, but we don't know any better one anyway.

So do treat cat-blob's result as a delta base for next blob. The
profit is nice: 2x to 7x reduction in pack size AND 1.2x to 3x
time speedup due to diff_delta being faster on good deltas. git gc
--aggressive can compress it even more, by 10% to 70%, utilizing
more cpu time, real time and 3 cpu cores.

Tested on 213M and 2.7G fast-import streams, resulting packs are 22M
and 113M, import time is 7s and 60s, both streams are produced by
svn-fe, sniffed and then used as raw input for fast-import.

For git-fast-export produced streams there is no change as it doesn't
use cat-blob and doesn't try to reorder blobs in some smart way to
make successive deltas small.

Signed-off-by: Dmitry Ivankov <divanorama@gmail.com>
---
 fast-import.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fast-import.c b/fast-import.c
index 2b069e3..0480fbf 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2802,7 +2802,12 @@ static void cat_blob(struct object_entry *oe, unsigned char sha1[20])
 	strbuf_release(&line);
 	cat_blob_write(buf, size);
 	cat_blob_write("\n", 1);
-	free(buf);
+	if (oe && oe->pack_id == pack_id) {
+		last_blob.offset = oe->idx.offset;
+		strbuf_attach(&last_blob.data, buf, size, size);
+		last_blob.depth = oe->depth;
+	} else
+		free(buf);
 }
 
 static void parse_cat_blob(void)
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] fast-import: treat cat-blob as a delta base hint for next blob
  2011-08-20 19:04 ` [PATCH 2/2] fast-import: treat cat-blob as a delta base hint for next blob Dmitry Ivankov
@ 2011-08-20 19:17   ` Jonathan Nieder
  2011-08-21 11:01     ` David Michael Barr
  0 siblings, 1 reply; 5+ messages in thread
From: Jonathan Nieder @ 2011-08-20 19:17 UTC (permalink / raw)
  To: Dmitry Ivankov; +Cc: git, Shawn O. Pearce, David Barr

Dmitry Ivankov wrote:

> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -2802,7 +2802,12 @@ static void cat_blob(struct object_entry *oe, unsigned char sha1[20])
>  	strbuf_release(&line);
>  	cat_blob_write(buf, size);
>  	cat_blob_write("\n", 1);
> -	free(buf);
> +	if (oe && oe->pack_id == pack_id) {
> +		strbuf_attach(&last_blob.data, buf, size, size);
> +		last_blob.offset = oe->idx.offset;
> +		last_blob.depth = oe->depth;
> +	} else
> +		free(buf);
>  }

Neat.  For what it's worth,
Acked-by: Jonathan Nieder <jrnieder@gmail.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] fast-import: treat cat-blob as a delta base hint for next blob
  2011-08-20 19:17   ` Jonathan Nieder
@ 2011-08-21 11:01     ` David Michael Barr
  0 siblings, 0 replies; 5+ messages in thread
From: David Michael Barr @ 2011-08-21 11:01 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Dmitry Ivankov, git, Shawn O. Pearce

On Sun, Aug 21, 2011 at 5:17 AM, Jonathan Nieder <jrnieder@gmail.com> wrote:
> Dmitry Ivankov wrote:
>
>> --- a/fast-import.c
>> +++ b/fast-import.c
>> @@ -2802,7 +2802,12 @@ static void cat_blob(struct object_entry *oe, unsigned char sha1[20])
>>       strbuf_release(&line);
>>       cat_blob_write(buf, size);
>>       cat_blob_write("\n", 1);
>> -     free(buf);
>> +     if (oe && oe->pack_id == pack_id) {
>> +             strbuf_attach(&last_blob.data, buf, size, size);
>> +             last_blob.offset = oe->idx.offset;
>> +             last_blob.depth = oe->depth;
>> +     } else
>> +             free(buf);
>>  }
>
> Neat.  For what it's worth,
> Acked-by: Jonathan Nieder <jrnieder@gmail.com>
>

Brilliant!

Acked-by: David Barr <davidbarr@google.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-08-21 11:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-20 19:04 [PATCH 0/2] fast-import: improve deltas for blobs Dmitry Ivankov
2011-08-20 19:04 ` [PATCH 1/2] fast-import: count and report # of calls to diff_delta in stats Dmitry Ivankov
2011-08-20 19:04 ` [PATCH 2/2] fast-import: treat cat-blob as a delta base hint for next blob Dmitry Ivankov
2011-08-20 19:17   ` Jonathan Nieder
2011-08-21 11:01     ` David Michael Barr

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).