git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: Victor Leschuk <vleschuk@accesssoftek.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: git-svn: cat-file memory usage
Date: Wed, 16 Sep 2015 07:56:42 -0400	[thread overview]
Message-ID: <20150916115642.GA5104@sigill.intra.peff.net> (raw)
In-Reply-To: <6AE1604EE3EC5F4296C096518C6B77EE5D0FDAB9CB@mail.accesssoftek.com>

On Wed, Sep 16, 2015 at 04:00:48AM -0700, Victor Leschuk wrote:

>  * git svn clone of trac  takes about 1 hour 
>  * git svn clone of FreeBSD has already taken more than 3 days and
>  still running (currently has cloned about 40% of revisions)

I haven't worked with git-svn in a long time, but I doubt that it is the
fastest way to do a large repository import. You might want to look into
a tool like svn2git or reposurgeon to do the initial import.

> I have valgrind'ed the git-cat-file (which is running is --batch mode
> during the whole clone) and found no serious leaks (about 100 bytes
> definitely leaked), so all memory is carefully freed, but the heap
> usage grows maybe due to fragmentation or smth else. When I looked
> through the code I found out that most of heap allocations are called
> from batch_object_write() function (strbuf_expand -> realloc).

Certainly we will call strbuf_expand once per object. I would have
expected we would call read_sha1_file(), too. It looks like we always
try to stream blobs, but I think we have to fall back to reading the
whole object if there are deltas.

You can try this patch, which will reuse the same strbuf over and over:

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 07baad1..73f338c 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -256,7 +256,7 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
 static void batch_object_write(const char *obj_name, struct batch_options *opt,
 			       struct expand_data *data)
 {
-	struct strbuf buf = STRBUF_INIT;
+	static struct strbuf buf = STRBUF_INIT;
 
 	if (sha1_object_info_extended(data->sha1, &data->info, LOOKUP_REPLACE_OBJECT) < 0) {
 		printf("%s missing\n", obj_name ? obj_name : sha1_to_hex(data->sha1));
@@ -264,10 +264,10 @@ static void batch_object_write(const char *obj_name, struct batch_options *opt,
 		return;
 	}
 
+	strbuf_reset(&buf);
 	strbuf_expand(&buf, opt->format, expand_format, data);
 	strbuf_addch(&buf, '\n');
 	batch_write(opt, buf.buf, buf.len);
-	strbuf_release(&buf);
 
 	if (opt->print_contents) {
 		print_object_or_die(opt, data);

That will reduce your reallocs due to strbuf_expand, though I'm doubtful
that it will solve the problem (and if it does, I think the right
solution is probably to look into using a better allocator than what
your system malloc() is providing).

>  * In perl code do not run git cat-file in batch mode (in
>  Git::SVN::apply_textdelta) but rather run it as separate commands
>  each time
> 
>    my $size = $self->command_oneline('cat-file', '-s', $sha1);
>    # .....
>    my ($in, $c) = $self->command_output_pipe('cat-file', 'blob', $sha1);
> 
> The second approach doesn't slow down the whole process at all (~72
> minutes to clone repo both with --batch mode and without).

I'm surprised the startup cost of the process doesn't make an impact,
but maybe it gets lost in the noise of the rest of the work (AFAICT, the
point of this cat-file is to retrieve a blob, apply a delta to it, and
then write out the resulting object; that write is probably a lot more
expensive).

-Peff

  reply	other threads:[~2015-09-16 11:56 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-16 11:00 git-svn: cat-file memory usage Victor Leschuk
2015-09-16 11:56 ` Jeff King [this message]
2015-09-16 13:40   ` Victor Leschuk
2015-09-16 16:31     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150916115642.GA5104@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=vleschuk@accesssoftek.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).