git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dmitry Ivankov <divanorama@gmail.com>
To: git@vger.kernel.org
Cc: Jonathan Nieder <jrnieder@gmail.com>,
	"Shawn O. Pearce" <spearce@spearce.org>,
	David Barr <davidbarr@google.com>,
	Dmitry Ivankov <divanorama@gmail.com>
Subject: [PATCH 2/2] fast-import: treat cat-blob as a delta base hint for next blob
Date: Sun, 21 Aug 2011 01:04:12 +0600	[thread overview]
Message-ID: <1313867052-11993-3-git-send-email-divanorama@gmail.com> (raw)
In-Reply-To: <1313867052-11993-1-git-send-email-divanorama@gmail.com>

Delta base for blobs is chosen as a previously saved blob. If we
treat cat-blob's blob as a delta base for the next blob, nothing
is likely to become worse.

For fast-import stream producer like svn-fe cat-blob is used like
following:
- svn-fe reads file delta in svn format
- to apply it, svn-fe asks cat-blob 'svn delta base'
- applies 'svn delta' to the response
- produces a blob command to store the result

Currently there is no way for svn-fe to give fast-import a hint on
object delta base. While what's requested in cat-blob is most of
the time a best delta base possible. Of course, it could be not a
good delta base, but we don't know any better one anyway.

So do treat cat-blob's result as a delta base for next blob. The
profit is nice: 2x to 7x reduction in pack size AND 1.2x to 3x
time speedup due to diff_delta being faster on good deltas. git gc
--aggressive can compress it even more, by 10% to 70%, utilizing
more cpu time, real time and 3 cpu cores.

Tested on 213M and 2.7G fast-import streams, resulting packs are 22M
and 113M, import time is 7s and 60s, both streams are produced by
svn-fe, sniffed and then used as raw input for fast-import.

For git-fast-export produced streams there is no change as it doesn't
use cat-blob and doesn't try to reorder blobs in some smart way to
make successive deltas small.

Signed-off-by: Dmitry Ivankov <divanorama@gmail.com>
---
 fast-import.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fast-import.c b/fast-import.c
index 2b069e3..0480fbf 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2802,7 +2802,12 @@ static void cat_blob(struct object_entry *oe, unsigned char sha1[20])
 	strbuf_release(&line);
 	cat_blob_write(buf, size);
 	cat_blob_write("\n", 1);
-	free(buf);
+	if (oe && oe->pack_id == pack_id) {
+		last_blob.offset = oe->idx.offset;
+		strbuf_attach(&last_blob.data, buf, size, size);
+		last_blob.depth = oe->depth;
+	} else
+		free(buf);
 }
 
 static void parse_cat_blob(void)
-- 
1.7.3.4

  parent reply	other threads:[~2011-08-20 19:02 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-20 19:04 [PATCH 0/2] fast-import: improve deltas for blobs Dmitry Ivankov
2011-08-20 19:04 ` [PATCH 1/2] fast-import: count and report # of calls to diff_delta in stats Dmitry Ivankov
2011-08-20 19:04 ` Dmitry Ivankov [this message]
2011-08-20 19:17   ` [PATCH 2/2] fast-import: treat cat-blob as a delta base hint for next blob Jonathan Nieder
2011-08-21 11:01     ` David Michael Barr

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1313867052-11993-3-git-send-email-divanorama@gmail.com \
    --to=divanorama@gmail.com \
    --cc=davidbarr@google.com \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@gmail.com \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).