git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* git-svn: cat-file memory usage
@ 2015-09-16 11:00 Victor Leschuk
  2015-09-16 11:56 ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Victor Leschuk @ 2015-09-16 11:00 UTC (permalink / raw)
  To: git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2444 bytes --]

Hello all,

We are currently getting acquainted with git-svn tool and have experienced few problems with it. The main issue is memory usage during "git svn clone": on large repositories the perl and git processes are using significant amount of memory. 

I have conducted several tests with different repositories. I have created mirrors of Trac project (http://trac.edgewall.org/ - rather small repo, ~14000 commits) and FreeBSD base repo (~280000 commits). Here is the summary of my tests (to eliminate network issues all clones were performed for file:// repos):

 * git svn clone of trac  takes about 1 hour 
 * git svn clone of FreeBSD has already taken more than 3 days and still running (currently has cloned about 40% of revisions)
 * git cat-file process memory footprint keeps growing during the clone process (see figure attached)

The main issue here is git cat-file consuming memory. The attached figure is for small repository which takes about an hour to clone, however on my another machine where FreeBSD clone is currently running the git cat-file has already taken more than 1Gb of memory (RSS) and has overgrown the parent perl process (~300-400 Mb). 

I have valgrind'ed the git-cat-file (which is running is --batch mode during the whole clone) and found no serious leaks (about 100 bytes definitely leaked), so all memory is carefully freed, but the heap usage grows maybe due to fragmentation or smth else. When I looked through the code I found out that most of heap allocations are called from batch_object_write() function (strbuf_expand -> realloc).

So I have found two possible workarounds for the issue: 

 * Set GIT_ALLOC_LIMIT variable - it does reduce the memory footprint but slows down the process
 * In perl code do not run git cat-file in batch mode (in Git::SVN::apply_textdelta) but rather run it as separate commands each time

   my $size = $self->command_oneline('cat-file', '-s', $sha1);
   # .....
   my ($in, $c) = $self->command_output_pipe('cat-file', 'blob', $sha1);

The second approach doesn't slow down the whole process at all (~72 minutes to clone repo both with --batch mode and without).

So the question is: what would be the correct approach to fight the problem with cat-file memory usage: maybe we should get rid of batch mode in perl code, or somehow tune allocation policy in C code?

Please let me know your thoughts. 

--
Best Regards,
Victor Leschuk

[-- Attachment #2: mem_usage.png --]
[-- Type: image/png, Size: 40523 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: git-svn: cat-file memory usage
  2015-09-16 11:00 git-svn: cat-file memory usage Victor Leschuk
@ 2015-09-16 11:56 ` Jeff King
  2015-09-16 13:40   ` Victor Leschuk
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2015-09-16 11:56 UTC (permalink / raw)
  To: Victor Leschuk; +Cc: git@vger.kernel.org

On Wed, Sep 16, 2015 at 04:00:48AM -0700, Victor Leschuk wrote:

>  * git svn clone of trac  takes about 1 hour 
>  * git svn clone of FreeBSD has already taken more than 3 days and
>  still running (currently has cloned about 40% of revisions)

I haven't worked with git-svn in a long time, but I doubt that it is the
fastest way to do a large repository import. You might want to look into
a tool like svn2git or reposurgeon to do the initial import.

> I have valgrind'ed the git-cat-file (which is running is --batch mode
> during the whole clone) and found no serious leaks (about 100 bytes
> definitely leaked), so all memory is carefully freed, but the heap
> usage grows maybe due to fragmentation or smth else. When I looked
> through the code I found out that most of heap allocations are called
> from batch_object_write() function (strbuf_expand -> realloc).

Certainly we will call strbuf_expand once per object. I would have
expected we would call read_sha1_file(), too. It looks like we always
try to stream blobs, but I think we have to fall back to reading the
whole object if there are deltas.

You can try this patch, which will reuse the same strbuf over and over:

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 07baad1..73f338c 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -256,7 +256,7 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
 static void batch_object_write(const char *obj_name, struct batch_options *opt,
 			       struct expand_data *data)
 {
-	struct strbuf buf = STRBUF_INIT;
+	static struct strbuf buf = STRBUF_INIT;
 
 	if (sha1_object_info_extended(data->sha1, &data->info, LOOKUP_REPLACE_OBJECT) < 0) {
 		printf("%s missing\n", obj_name ? obj_name : sha1_to_hex(data->sha1));
@@ -264,10 +264,10 @@ static void batch_object_write(const char *obj_name, struct batch_options *opt,
 		return;
 	}
 
+	strbuf_reset(&buf);
 	strbuf_expand(&buf, opt->format, expand_format, data);
 	strbuf_addch(&buf, '\n');
 	batch_write(opt, buf.buf, buf.len);
-	strbuf_release(&buf);
 
 	if (opt->print_contents) {
 		print_object_or_die(opt, data);

That will reduce your reallocs due to strbuf_expand, though I'm doubtful
that it will solve the problem (and if it does, I think the right
solution is probably to look into using a better allocator than what
your system malloc() is providing).

>  * In perl code do not run git cat-file in batch mode (in
>  Git::SVN::apply_textdelta) but rather run it as separate commands
>  each time
> 
>    my $size = $self->command_oneline('cat-file', '-s', $sha1);
>    # .....
>    my ($in, $c) = $self->command_output_pipe('cat-file', 'blob', $sha1);
> 
> The second approach doesn't slow down the whole process at all (~72
> minutes to clone repo both with --batch mode and without).

I'm surprised the startup cost of the process doesn't make an impact,
but maybe it gets lost in the noise of the rest of the work (AFAICT, the
point of this cat-file is to retrieve a blob, apply a delta to it, and
then write out the resulting object; that write is probably a lot more
expensive).

-Peff

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* RE: git-svn: cat-file memory usage
  2015-09-16 11:56 ` Jeff King
@ 2015-09-16 13:40   ` Victor Leschuk
  2015-09-16 16:31     ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Victor Leschuk @ 2015-09-16 13:40 UTC (permalink / raw)
  To: Jeff King; +Cc: git@vger.kernel.org

Hello Jeff, thanks for the advice.

Unfortunately using patch didn't change the situation. I will run some tests with alternate allocators (looking at jemalloc and tcmalloc). As for alternate tools: as far as I understood svn2git calls 'git svn' itself. So I assume it can't fix the memory usage or speed up clone process... Correct me if I'm wrong.

Reposurgeon looks interesting... Will give it a try. 

Btw, what do you think of getting rid of batch mode for clone/fetch in perl code. It really hardly has any impact on performance but reduces memory usage a lot.

--
Best Regards,
Victor
________________________________________
From: Jeff King [peff@peff.net]
Sent: Wednesday, September 16, 2015 4:56 AM
To: Victor Leschuk
Cc: git@vger.kernel.org
Subject: Re: git-svn: cat-file memory usage

On Wed, Sep 16, 2015 at 04:00:48AM -0700, Victor Leschuk wrote:

>  * git svn clone of trac  takes about 1 hour
>  * git svn clone of FreeBSD has already taken more than 3 days and
>  still running (currently has cloned about 40% of revisions)

I haven't worked with git-svn in a long time, but I doubt that it is the
fastest way to do a large repository import. You might want to look into
a tool like svn2git or reposurgeon to do the initial import.

> I have valgrind'ed the git-cat-file (which is running is --batch mode
> during the whole clone) and found no serious leaks (about 100 bytes
> definitely leaked), so all memory is carefully freed, but the heap
> usage grows maybe due to fragmentation or smth else. When I looked
> through the code I found out that most of heap allocations are called
> from batch_object_write() function (strbuf_expand -> realloc).

Certainly we will call strbuf_expand once per object. I would have
expected we would call read_sha1_file(), too. It looks like we always
try to stream blobs, but I think we have to fall back to reading the
whole object if there are deltas.

You can try this patch, which will reuse the same strbuf over and over:

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 07baad1..73f338c 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -256,7 +256,7 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
 static void batch_object_write(const char *obj_name, struct batch_options *opt,
                               struct expand_data *data)
 {
-       struct strbuf buf = STRBUF_INIT;
+       static struct strbuf buf = STRBUF_INIT;

        if (sha1_object_info_extended(data->sha1, &data->info, LOOKUP_REPLACE_OBJECT) < 0) {
                printf("%s missing\n", obj_name ? obj_name : sha1_to_hex(data->sha1));
@@ -264,10 +264,10 @@ static void batch_object_write(const char *obj_name, struct batch_options *opt,
                return;
        }

+       strbuf_reset(&buf);
        strbuf_expand(&buf, opt->format, expand_format, data);
        strbuf_addch(&buf, '\n');
        batch_write(opt, buf.buf, buf.len);
-       strbuf_release(&buf);

        if (opt->print_contents) {
                print_object_or_die(opt, data);

That will reduce your reallocs due to strbuf_expand, though I'm doubtful
that it will solve the problem (and if it does, I think the right
solution is probably to look into using a better allocator than what
your system malloc() is providing).

>  * In perl code do not run git cat-file in batch mode (in
>  Git::SVN::apply_textdelta) but rather run it as separate commands
>  each time
>
>    my $size = $self->command_oneline('cat-file', '-s', $sha1);
>    # .....
>    my ($in, $c) = $self->command_output_pipe('cat-file', 'blob', $sha1);
>
> The second approach doesn't slow down the whole process at all (~72
> minutes to clone repo both with --batch mode and without).

I'm surprised the startup cost of the process doesn't make an impact,
but maybe it gets lost in the noise of the rest of the work (AFAICT, the
point of this cat-file is to retrieve a blob, apply a delta to it, and
then write out the resulting object; that write is probably a lot more
expensive).

-Peff

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: git-svn: cat-file memory usage
  2015-09-16 13:40   ` Victor Leschuk
@ 2015-09-16 16:31     ` Jeff King
  0 siblings, 0 replies; 4+ messages in thread
From: Jeff King @ 2015-09-16 16:31 UTC (permalink / raw)
  To: Victor Leschuk; +Cc: git@vger.kernel.org

On Wed, Sep 16, 2015 at 06:40:23AM -0700, Victor Leschuk wrote:

> Unfortunately using patch didn't change the situation. I will run some
> tests with alternate allocators (looking at jemalloc and tcmalloc). As
> for alternate tools: as far as I understood svn2git calls 'git svn'
> itself. So I assume it can't fix the memory usage or speed up clone
> process... Correct me if I'm wrong.

I think there are actually several tools calling themselves svn2git.
There was a C tool once upon a time, but it looks fairly inactive, and
the top search hit for svn2git does turn up a git-svn wrapper. Like I
said, I am not very up on the current state of affairs.

> Btw, what do you think of getting rid of batch mode for clone/fetch in
> perl code. It really hardly has any impact on performance but reduces
> memory usage a lot.

I'd worry there are other cases where it does impact performance (e.g.,
perhaps smaller blobs), but I don't know enough about the git-svn
internals to say much more.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-09-16 16:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-16 11:00 git-svn: cat-file memory usage Victor Leschuk
2015-09-16 11:56 ` Jeff King
2015-09-16 13:40   ` Victor Leschuk
2015-09-16 16:31     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).