git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] git-svn: make batch mode optional for git-cat-file
@ 2015-09-21 13:51 Victor Leschuk
  2015-09-21 18:25 ` Junio C Hamano
  0 siblings, 1 reply; 10+ messages in thread
From: Victor Leschuk @ 2015-09-21 13:51 UTC (permalink / raw)
  To: git; +Cc: Victor Leschuk

[-- Attachment #1: Type: text/plain, Size: 196 bytes --]


Signed-off-by: Victor Leschuk <vleschuk@accesssoftek.com>
---
 git-svn.perl |  1 +
 perl/Git.pm  | 41 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-git-svn-make-batch-mode-optional-for-git-cat-file.patch --]
[-- Type: text/x-patch; name="0001-git-svn-make-batch-mode-optional-for-git-cat-file.patch", Size: 2225 bytes --]

diff --git a/git-svn.perl b/git-svn.perl
index 36f7240..b793c26 100755
--- a/git-svn.perl
+++ b/git-svn.perl
@@ -139,6 +139,7 @@ my %fc_opts = ( 'follow-parent|follow!' => \$Git::SVN::_follow_parent,
 		'use-log-author' => \$Git::SVN::_use_log_author,
 		'add-author-from' => \$Git::SVN::_add_author_from,
 		'localtime' => \$Git::SVN::_localtime,
+		'no-cat-file-batch' => sub { $Git::no_cat_file_batch = 1; },
 		%remote_opts );
 
 my ($_trunk, @_tags, @_branches, $_stdlayout);
diff --git a/perl/Git.pm b/perl/Git.pm
index 19ef081..69e5293 100644
--- a/perl/Git.pm
+++ b/perl/Git.pm
@@ -107,6 +107,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR);
 use Time::Local qw(timegm);
 }
 
+our $no_cat_file_batch = 0;
 
 =head1 CONSTRUCTORS
 
@@ -1012,6 +1013,10 @@ returns the number of bytes printed.
 =cut
 
 sub cat_blob {
+	(1 == $no_cat_file_batch) ? _cat_blob_cmd(@_) : _cat_blob_batch(@_);
+}
+
+sub _cat_blob_batch {
 	my ($self, $sha1, $fh) = @_;
 
 	$self->_open_cat_blob_if_needed();
@@ -1072,7 +1077,7 @@ sub cat_blob {
 sub _open_cat_blob_if_needed {
 	my ($self) = @_;
 
-	return if defined($self->{cat_blob_pid});
+	return if ( defined($self->{cat_blob_pid}) || 1 == $no_cat_file_batch );
 
 	($self->{cat_blob_pid}, $self->{cat_blob_in},
 	 $self->{cat_blob_out}, $self->{cat_blob_ctx}) =
@@ -1090,6 +1095,40 @@ sub _close_cat_blob {
 	delete @$self{@vars};
 }
 
+sub _cat_blob_cmd {
+	my ($self, $sha1, $fh) = @_;
+
+	my $size = $self->command_oneline('cat-file', '-s', $sha1);
+
+	if (!defined $size) {
+		carp "cat-file couldn't detect object size";
+		return -1;
+	}
+
+	my ($in, $c) = $self->command_output_pipe('cat-file', 'blob', $sha1);
+
+	my $blob;
+	my $bytesLeft = $size;
+
+	while (1) {
+		last unless $bytesLeft;
+
+		my $bytesToRead = $bytesLeft < 1024 ? $bytesLeft : 1024;
+		my $read = read($in, $blob, $bytesToRead);
+		unless (defined($read)) {
+			$self->command_close_pipe($in, $c);
+			throw Error::Simple("in pipe went bad");
+		}
+		unless (print $fh $blob) {
+			$self->command_close_pipe($in, $c);
+			throw Error::Simple("couldn't write to passed in filehandle");
+		}
+		$bytesLeft -= $read;
+	}
+
+	$self->command_close_pipe($in, $c);
+	return $size;
+}
 
 =item credential_read( FILEHANDLE )
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] git-svn: make batch mode optional for git-cat-file
  2015-09-21 13:51 [PATCH] git-svn: make batch mode optional for git-cat-file Victor Leschuk
@ 2015-09-21 18:25 ` Junio C Hamano
  2015-09-21 22:03   ` Victor Leschuk
  0 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2015-09-21 18:25 UTC (permalink / raw)
  To: Victor Leschuk; +Cc: git, Victor Leschuk

Victor Leschuk <vleschuk@gmail.com> writes:

> Signed-off-by: Victor Leschuk <vleschuk@accesssoftek.com>
> ---

Before the S-o-b line is a good place to explain why this is a good
change to have.  Please use it.

>  git-svn.perl |  1 +
>  perl/Git.pm  | 41 ++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 41 insertions(+), 1 deletion(-)
>
> diff --git a/git-svn.perl b/git-svn.perl
> index 36f7240..b793c26 100755
> --- a/git-svn.perl
> +++ b/git-svn.perl
> @@ -139,6 +139,7 @@ my %fc_opts = ( 'follow-parent|follow!' => \$Git::SVN::_follow_parent,
>  		'use-log-author' => \$Git::SVN::_use_log_author,
>  		'add-author-from' => \$Git::SVN::_add_author_from,
>  		'localtime' => \$Git::SVN::_localtime,
> +		'no-cat-file-batch' => sub { $Git::no_cat_file_batch = 1; },

An option whose name begins with no- looks somewhat strange.  You
can even say --no-no-cat-file-batch from the command line, I
suspect.

Why not give an option 'cat-file-batch' that sets the variable
$Git::cat_file_batch to false, and initialize the variable to true
to keep existing users who do not pass the option happy?

>  		%remote_opts );
>  
>  my ($_trunk, @_tags, @_branches, $_stdlayout);
> diff --git a/perl/Git.pm b/perl/Git.pm
> index 19ef081..69e5293 100644
> --- a/perl/Git.pm
> +++ b/perl/Git.pm
> @@ -107,6 +107,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR);
>  use Time::Local qw(timegm);
>  }
>  
> +our $no_cat_file_batch = 0;
>  
>  =head1 CONSTRUCTORS
>  
> @@ -1012,6 +1013,10 @@ returns the number of bytes printed.
>  =cut
>  
>  sub cat_blob {
> +	(1 == $no_cat_file_batch) ? _cat_blob_cmd(@_) : _cat_blob_batch(@_);

Discard "1 ==" here.  You are clearly using the variable as a
boolean, so writing this as

	$cat_file_batch ? _cat_blob_batch(@_) : _cat_blob_cmd(@_);

or better yet

	if ($cat_file_batch) {
        	_cat_blob_batch(@_);
	} else {
        	_cat_blob_cmd(@_);
	}

would be more natural.

> +}
> +
> +sub _cat_blob_batch {
>  	my ($self, $sha1, $fh) = @_;
>  
>  	$self->_open_cat_blob_if_needed();
> @@ -1072,7 +1077,7 @@ sub cat_blob {
>  sub _open_cat_blob_if_needed {
>  	my ($self) = @_;
>  
> -	return if defined($self->{cat_blob_pid});
> +	return if ( defined($self->{cat_blob_pid}) || 1 == $no_cat_file_batch );

Likewise.

	return if (!$cat_file_batch);
	return if defined($self->{cat_blob_pid});

> +sub _cat_blob_cmd {
> +	my ($self, $sha1, $fh) = @_;
> +...

The biggest thing that is missing from this patch is the explanation
of why this is a good thing to do.  The batch interface was invented
because people found that it was wasteful to spawn a new cat-file
process every time the contents of a blob is needed and wanted to
avoid it, and this new feature gives the user a way to tell Git to
do things in a "wasteful" way, so there must be a reason why the
user would want to use the "wasteful" way, perhaps work around some
other issue.  Without explaining that in the documentation what that
issue is, i.e. telling users who reads "git svn --help" when and why
the option might help them, nobody would use the feature to benefit
from it.

I wonder if "cat-file --batch" is leaky and bloats after running for
a while.  If that is the case, I have to wonder if "never do batch"
like this patch does is a sensible way forward.  Instead, "recycle
and renew the process after running it for N requests" (and ideally
auto-adjust that N without being told by the user) might be a better
way to do what you are trying to achieve, but as I already said, I
cannot read the motivation behind this change that is not explained,
so...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH] git-svn: make batch mode optional for git-cat-file
  2015-09-21 18:25 ` Junio C Hamano
@ 2015-09-21 22:03   ` Victor Leschuk
  2015-09-22 10:47     ` Victor Leschuk
  2015-09-23  0:13     ` Eric Wong
  0 siblings, 2 replies; 10+ messages in thread
From: Victor Leschuk @ 2015-09-21 22:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 5845 bytes --]

Hello Junio,

thanks for your review. First of all I'd like to apologize for sending the patch without description. Actually I was in a hurry and sent it by accident: I planned to edit the mail before sending... 

Here is the detailed description: 

Last week we had a quick discussion in this mailing list: http://thread.gmane.org/gmane.comp.version-control.git/278021 .

The thing is that git-cat-file keeps growing during work when running in "batch" mode. See the figure attached: it is for cloning a rather small repo (1 hour to clone about ~14000 revisions). However the clone of a large repo (~280000 revisions) took about 2 weeks and git-cat-file has outgrown the parent perl process several times (git-cat-file - ~3-4Gb, perl - 400Mb).

What was done:
 * I have run it under valgrind and mtrace and haven't found any memory leaks
 * Found the source of most number of memory reallocations (batch_object_write() function (strbuf_expand -> realloc)) - tried to make the streambuf object static and avoid reallocs - didn't help
 * Tried preloading other allocators than standard glibc - no significant difference

After that I replaced the batch mode with separate cat-file calls for each blob and it didn't have any impact on clone performance on real code repositories. However I created a fake test repo with large number of small files (~10 bytes each): here is how I created it https://bitbucket.org/vleschuk/svngenrepo

And on this artificial test repo it really slowed down the process. So I decided to suggest to make the batch mode optional to let the user "tune" the process and created a patch for this. 

As for your code-style notes, I agree with them, will adjust the code.

--
Best Regards,
Victor
________________________________________
From: Junio C Hamano [jch2355@gmail.com] On Behalf Of Junio C Hamano [gitster@pobox.com]
Sent: Monday, September 21, 2015 11:25 AM
To: Victor Leschuk
Cc: git@vger.kernel.org; Victor Leschuk
Subject: Re: [PATCH] git-svn: make batch mode optional for git-cat-file

Victor Leschuk <vleschuk@gmail.com> writes:

> Signed-off-by: Victor Leschuk <vleschuk@accesssoftek.com>
> ---

Before the S-o-b line is a good place to explain why this is a good
change to have.  Please use it.

>  git-svn.perl |  1 +
>  perl/Git.pm  | 41 ++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 41 insertions(+), 1 deletion(-)
>
> diff --git a/git-svn.perl b/git-svn.perl
> index 36f7240..b793c26 100755
> --- a/git-svn.perl
> +++ b/git-svn.perl
> @@ -139,6 +139,7 @@ my %fc_opts = ( 'follow-parent|follow!' => \$Git::SVN::_follow_parent,
>               'use-log-author' => \$Git::SVN::_use_log_author,
>               'add-author-from' => \$Git::SVN::_add_author_from,
>               'localtime' => \$Git::SVN::_localtime,
> +             'no-cat-file-batch' => sub { $Git::no_cat_file_batch = 1; },

An option whose name begins with no- looks somewhat strange.  You
can even say --no-no-cat-file-batch from the command line, I
suspect.

Why not give an option 'cat-file-batch' that sets the variable
$Git::cat_file_batch to false, and initialize the variable to true
to keep existing users who do not pass the option happy?

>               %remote_opts );
>
>  my ($_trunk, @_tags, @_branches, $_stdlayout);
> diff --git a/perl/Git.pm b/perl/Git.pm
> index 19ef081..69e5293 100644
> --- a/perl/Git.pm
> +++ b/perl/Git.pm
> @@ -107,6 +107,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR);
>  use Time::Local qw(timegm);
>  }
>
> +our $no_cat_file_batch = 0;
>
>  =head1 CONSTRUCTORS
>
> @@ -1012,6 +1013,10 @@ returns the number of bytes printed.
>  =cut
>
>  sub cat_blob {
> +     (1 == $no_cat_file_batch) ? _cat_blob_cmd(@_) : _cat_blob_batch(@_);

Discard "1 ==" here.  You are clearly using the variable as a
boolean, so writing this as

        $cat_file_batch ? _cat_blob_batch(@_) : _cat_blob_cmd(@_);

or better yet

        if ($cat_file_batch) {
                _cat_blob_batch(@_);
        } else {
                _cat_blob_cmd(@_);
        }

would be more natural.

> +}
> +
> +sub _cat_blob_batch {
>       my ($self, $sha1, $fh) = @_;
>
>       $self->_open_cat_blob_if_needed();
> @@ -1072,7 +1077,7 @@ sub cat_blob {
>  sub _open_cat_blob_if_needed {
>       my ($self) = @_;
>
> -     return if defined($self->{cat_blob_pid});
> +     return if ( defined($self->{cat_blob_pid}) || 1 == $no_cat_file_batch );

Likewise.

        return if (!$cat_file_batch);
        return if defined($self->{cat_blob_pid});

> +sub _cat_blob_cmd {
> +     my ($self, $sha1, $fh) = @_;
> +...

The biggest thing that is missing from this patch is the explanation
of why this is a good thing to do.  The batch interface was invented
because people found that it was wasteful to spawn a new cat-file
process every time the contents of a blob is needed and wanted to
avoid it, and this new feature gives the user a way to tell Git to
do things in a "wasteful" way, so there must be a reason why the
user would want to use the "wasteful" way, perhaps work around some
other issue.  Without explaining that in the documentation what that
issue is, i.e. telling users who reads "git svn --help" when and why
the option might help them, nobody would use the feature to benefit
from it.

I wonder if "cat-file --batch" is leaky and bloats after running for
a while.  If that is the case, I have to wonder if "never do batch"
like this patch does is a sensible way forward.  Instead, "recycle
and renew the process after running it for N requests" (and ideally
auto-adjust that N without being told by the user) might be a better
way to do what you are trying to achieve, but as I already said, I
cannot read the motivation behind this change that is not explained,
so...

[-- Attachment #2: mem_usage.png --]
[-- Type: image/png, Size: 40523 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH] git-svn: make batch mode optional for git-cat-file
  2015-09-21 22:03   ` Victor Leschuk
@ 2015-09-22 10:47     ` Victor Leschuk
  2015-09-22 14:17       ` Junio C Hamano
  2015-09-23  0:13     ` Eric Wong
  1 sibling, 1 reply; 10+ messages in thread
From: Victor Leschuk @ 2015-09-22 10:47 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git@vger.kernel.org

As for your remark regarding the option naming: 

> An option whose name begins with no- looks somewhat strange.  You
can even say --no-no-cat-file-batch from the command line, I
suspect.

We already do have some of these: 'no-metadata', 'no-checkout', 'no-auth-cache'. So I was just following the existing convention. Do you think we need to change it and stick with --catch-file-batch=1/--cat-file-batch=0 ?

--
Best Regards,
Victor
________________________________________
From: Victor Leschuk
Sent: Monday, September 21, 2015 3:03 PM
To: Junio C Hamano
Cc: git@vger.kernel.org
Subject: RE: [PATCH] git-svn: make batch mode optional for git-cat-file

Hello Junio,

thanks for your review. First of all I'd like to apologize for sending the patch without description. Actually I was in a hurry and sent it by accident: I planned to edit the mail before sending...

Here is the detailed description:

Last week we had a quick discussion in this mailing list: http://thread.gmane.org/gmane.comp.version-control.git/278021 .

The thing is that git-cat-file keeps growing during work when running in "batch" mode. See the figure attached: it is for cloning a rather small repo (1 hour to clone about ~14000 revisions). However the clone of a large repo (~280000 revisions) took about 2 weeks and git-cat-file has outgrown the parent perl process several times (git-cat-file - ~3-4Gb, perl - 400Mb).

What was done:
 * I have run it under valgrind and mtrace and haven't found any memory leaks
 * Found the source of most number of memory reallocations (batch_object_write() function (strbuf_expand -> realloc)) - tried to make the streambuf object static and avoid reallocs - didn't help
 * Tried preloading other allocators than standard glibc - no significant difference

After that I replaced the batch mode with separate cat-file calls for each blob and it didn't have any impact on clone performance on real code repositories. However I created a fake test repo with large number of small files (~10 bytes each): here is how I created it https://bitbucket.org/vleschuk/svngenrepo

And on this artificial test repo it really slowed down the process. So I decided to suggest to make the batch mode optional to let the user "tune" the process and created a patch for this.

As for your code-style notes, I agree with them, will adjust the code.

--
Best Regards,
Victor
________________________________________
From: Junio C Hamano [jch2355@gmail.com] On Behalf Of Junio C Hamano [gitster@pobox.com]
Sent: Monday, September 21, 2015 11:25 AM
To: Victor Leschuk
Cc: git@vger.kernel.org; Victor Leschuk
Subject: Re: [PATCH] git-svn: make batch mode optional for git-cat-file

Victor Leschuk <vleschuk@gmail.com> writes:

> Signed-off-by: Victor Leschuk <vleschuk@accesssoftek.com>
> ---

Before the S-o-b line is a good place to explain why this is a good
change to have.  Please use it.

>  git-svn.perl |  1 +
>  perl/Git.pm  | 41 ++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 41 insertions(+), 1 deletion(-)
>
> diff --git a/git-svn.perl b/git-svn.perl
> index 36f7240..b793c26 100755
> --- a/git-svn.perl
> +++ b/git-svn.perl
> @@ -139,6 +139,7 @@ my %fc_opts = ( 'follow-parent|follow!' => \$Git::SVN::_follow_parent,
>               'use-log-author' => \$Git::SVN::_use_log_author,
>               'add-author-from' => \$Git::SVN::_add_author_from,
>               'localtime' => \$Git::SVN::_localtime,
> +             'no-cat-file-batch' => sub { $Git::no_cat_file_batch = 1; },

An option whose name begins with no- looks somewhat strange.  You
can even say --no-no-cat-file-batch from the command line, I
suspect.

Why not give an option 'cat-file-batch' that sets the variable
$Git::cat_file_batch to false, and initialize the variable to true
to keep existing users who do not pass the option happy?

>               %remote_opts );
>
>  my ($_trunk, @_tags, @_branches, $_stdlayout);
> diff --git a/perl/Git.pm b/perl/Git.pm
> index 19ef081..69e5293 100644
> --- a/perl/Git.pm
> +++ b/perl/Git.pm
> @@ -107,6 +107,7 @@ use Fcntl qw(SEEK_SET SEEK_CUR);
>  use Time::Local qw(timegm);
>  }
>
> +our $no_cat_file_batch = 0;
>
>  =head1 CONSTRUCTORS
>
> @@ -1012,6 +1013,10 @@ returns the number of bytes printed.
>  =cut
>
>  sub cat_blob {
> +     (1 == $no_cat_file_batch) ? _cat_blob_cmd(@_) : _cat_blob_batch(@_);

Discard "1 ==" here.  You are clearly using the variable as a
boolean, so writing this as

        $cat_file_batch ? _cat_blob_batch(@_) : _cat_blob_cmd(@_);

or better yet

        if ($cat_file_batch) {
                _cat_blob_batch(@_);
        } else {
                _cat_blob_cmd(@_);
        }

would be more natural.

> +}
> +
> +sub _cat_blob_batch {
>       my ($self, $sha1, $fh) = @_;
>
>       $self->_open_cat_blob_if_needed();
> @@ -1072,7 +1077,7 @@ sub cat_blob {
>  sub _open_cat_blob_if_needed {
>       my ($self) = @_;
>
> -     return if defined($self->{cat_blob_pid});
> +     return if ( defined($self->{cat_blob_pid}) || 1 == $no_cat_file_batch );

Likewise.

        return if (!$cat_file_batch);
        return if defined($self->{cat_blob_pid});

> +sub _cat_blob_cmd {
> +     my ($self, $sha1, $fh) = @_;
> +...

The biggest thing that is missing from this patch is the explanation
of why this is a good thing to do.  The batch interface was invented
because people found that it was wasteful to spawn a new cat-file
process every time the contents of a blob is needed and wanted to
avoid it, and this new feature gives the user a way to tell Git to
do things in a "wasteful" way, so there must be a reason why the
user would want to use the "wasteful" way, perhaps work around some
other issue.  Without explaining that in the documentation what that
issue is, i.e. telling users who reads "git svn --help" when and why
the option might help them, nobody would use the feature to benefit
from it.

I wonder if "cat-file --batch" is leaky and bloats after running for
a while.  If that is the case, I have to wonder if "never do batch"
like this patch does is a sensible way forward.  Instead, "recycle
and renew the process after running it for N requests" (and ideally
auto-adjust that N without being told by the user) might be a better
way to do what you are trying to achieve, but as I already said, I
cannot read the motivation behind this change that is not explained,
so...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] git-svn: make batch mode optional for git-cat-file
  2015-09-22 10:47     ` Victor Leschuk
@ 2015-09-22 14:17       ` Junio C Hamano
  0 siblings, 0 replies; 10+ messages in thread
From: Junio C Hamano @ 2015-09-22 14:17 UTC (permalink / raw)
  To: Victor Leschuk; +Cc: git@vger.kernel.org

Victor Leschuk <vleschuk@accesssoftek.com> writes:

> We already do have some of these: 'no-metadata', 'no-checkout',
> no-auth-cache'. So I was just following the existing convention. Do
> you think we need to change it and stick with
> --catch-file-batch=1/--cat-file-batch=0 ?

Inventing a new --cat-file-batch=[0|1] is not a good idea, and
certainly not what I would suggest at all.

My suggestion was to accept --cat-file-batch to allow the --batch
processing, and to accept--no-cat-file-batch to trigger your new
codepath (and leave --cat-file-batch the default when neither is
given).  As these option descriptions are eventually passed to
Getopt::Long, I thought it should not be too hard to arrange.

Mimicking the existing handling of no-whatever is less bad than
accepting --cat-file-batch=[0|1], if you cannot tell the code to
take --[no-]cat-file-batch for whatever reason.  In the longer term
it would need to be cleaned up together with existing ones.  Your
patch would be adding another instance that needs to be cleaned up
to that existing pile, but as long as it follows the same pattern as
existing ones, it is easier to spot what needs to be fixed later.
Compared to that, accepting --cat-file-batch=[0|1] would be far
worse, as such a future clean-up effort can miss it due to its not
following the same pattern.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] git-svn: make batch mode optional for git-cat-file
  2015-09-21 22:03   ` Victor Leschuk
  2015-09-22 10:47     ` Victor Leschuk
@ 2015-09-23  0:13     ` Eric Wong
  2015-09-23  0:35       ` Eric Wong
  1 sibling, 1 reply; 10+ messages in thread
From: Eric Wong @ 2015-09-23  0:13 UTC (permalink / raw)
  To: Victor Leschuk; +Cc: Junio C Hamano, git@vger.kernel.org

Victor Leschuk <vleschuk@accesssoftek.com> wrote:
> The thing is that git-cat-file keeps growing during work when running
> in "batch" mode. See the figure attached: it is for cloning a rather
> small repo (1 hour to clone about ~14000 revisions). However the clone
> of a large repo (~280000 revisions) took about 2 weeks and
> git-cat-file has outgrown the parent perl process several times
> (git-cat-file - ~3-4Gb, perl - 400Mb).

Ugh, that sucks.
Even the 400Mb size of Perl annoys me greatly and I'd work
on fixing it if I had more time.

But I'm completely against adding this parameter to git-svn.
git-svn is not the only "cat-file --batch" user, so this option is
only hiding problems.

The best choice is to figure out why cat-file is wasting memory.

Disclaimer: I'm no expert on parts of git written in C,
but perhaps the alloc.c interface is why memory keeps growing.

> What was done:
>  * I have run it under valgrind and mtrace and haven't found any memory leaks
>  * Found the source of most number of memory reallocations (batch_object_write() function (strbuf_expand -> realloc)) - tried to make the streambuf object static and avoid reallocs - didn't help
>  * Tried preloading other allocators than standard glibc - no significant difference

A few more questions:

* What is the largest file that existed in that repo?

* Did you try "MALLOC_MMAP_THRESHOLD_" with glibc?

  Perhaps setting that to 131072 will help, that'll force releasing
  larger chunks than that; but it might be moot if alloc.c is
  getting in the way.

If alloc.c is the culprit, I would consider to transparently restart
"cat-file --batch" once it grows to a certain size or after a certain
number of requests are made to it.

We can probably do this inside "git cat-file" itself without
changing any callers by calling execve.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] git-svn: make batch mode optional for git-cat-file
  2015-09-23  0:13     ` Eric Wong
@ 2015-09-23  0:35       ` Eric Wong
  2015-09-23 15:28         ` Victor Leschuk
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Wong @ 2015-09-23  0:35 UTC (permalink / raw)
  To: Victor Leschuk; +Cc: Junio C Hamano, git@vger.kernel.org

Eric Wong <normalperson@yhbt.net> wrote:
> Victor Leschuk <vleschuk@accesssoftek.com> wrote:
> > The thing is that git-cat-file keeps growing during work when running
> > in "batch" mode. See the figure attached: it is for cloning a rather
> > small repo (1 hour to clone about ~14000 revisions). However the clone
> > of a large repo (~280000 revisions) took about 2 weeks and
> > git-cat-file has outgrown the parent perl process several times
> > (git-cat-file - ~3-4Gb, perl - 400Mb).

How much of that is anonymous memory, though?
(pmap $PID_OF_GIT_CAT_FILE)

Running the following on the Linux kernel tree I had lying around:

(for i in $(seq 100 200); do git ls-files | sed -e "s/^/HEAD~$i:/"; done)|\
  git cat-file --batch >/dev/null

Reveals about 510M RSS in top, but pmap says less than 20M of that
is anonymous.  So the rest are mmap-ed packfiles; that RSS gets
transparently released back to the kernel under memory pressure.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH] git-svn: make batch mode optional for git-cat-file
  2015-09-23  0:35       ` Eric Wong
@ 2015-09-23 15:28         ` Victor Leschuk
  2015-09-23 19:22           ` Eric Wong
  0 siblings, 1 reply; 10+ messages in thread
From: Victor Leschuk @ 2015-09-23 15:28 UTC (permalink / raw)
  To: Eric Wong; +Cc: Junio C Hamano, git@vger.kernel.org

Hello Eric, thanks for looking into it.

>> git-cat-file has outgrown the parent perl process several times
>> (git-cat-file - ~3-4Gb, perl - 400Mb).

> Ugh, that sucks.
> Even the 400Mb size of Perl annoys me greatly and I'd work
> on fixing it if I had more time.

I was going to look at this problem also, but first I'd like to improve the situation with cat-file as on large repos it is larger problem. By the way, what direction would you suggest to begin with?

> A few more questions:

> * What is the largest file that existed in that repo?

About 2.5M

> * Did you try "MALLOC_MMAP_THRESHOLD_" with glibc?

Have just tried it on a smaller repo (which takes about 1 hour to clone and RSS grows from 4M to 40M during the process. Unfortunately there is no much of an effect: max RSS is 41M with default settings and 38M with MALLOC_MMAP_THRESHOLD_=131072.

> If alloc.c is the culprit, I would consider to transparently restart
"cat-file --batch" once it grows to a certain size or after a certain
number of requests are made to it.

alloc.c interface is not used in cat-file at all, only direct calls to xmalloc and xrealloc from wrapper.c, and also xmmap() from sha1_file.c.

> > git-cat-file has outgrown the parent perl process several times
> > (git-cat-file - ~3-4Gb, perl - 400Mb).

> How much of that is anonymous memory, though?

Haven't measured on this particular repo: didn't redo the 2 week experiment =) However I checked on a smaller test repo and anon memory is about 12M out of 40M total. Most of memory is really taken by mmaped *.pack and *idx files.

Actually I accidentally found out that if I export GIT_MALLOC_LIMIT variable set to several megabytes it has the following effect:
 * git-svn.perl launches git-gc
 * git-gc can't allocate enough memory and thus doesn't create any pack files
 * git-cat-file works only with pure blob object, not packs, and it's memory usage doesn't grow larger than 4-5M

It gave me a thought that maybe we could get rid of "git gc" calls after each commit in perl code and just perform one large gc operation at the end. It will cost disk space during clone but save us memory. What do you think?

As for your suggestion regarding periodic restart of batch process inside git-cat-file, I think we could give it a try, I can prepare a patch and run some tests.

--
Best Regards,
Victor
________________________________________
From: Eric Wong [normalperson@yhbt.net]
Sent: Tuesday, September 22, 2015 5:35 PM
To: Victor Leschuk
Cc: Junio C Hamano; git@vger.kernel.org
Subject: Re: [PATCH] git-svn: make batch mode optional for git-cat-file

Eric Wong <normalperson@yhbt.net> wrote:
> Victor Leschuk <vleschuk@accesssoftek.com> wrote:
> > The thing is that git-cat-file keeps growing during work when running
> > in "batch" mode. See the figure attached: it is for cloning a rather
> > small repo (1 hour to clone about ~14000 revisions). However the clone
> > of a large repo (~280000 revisions) took about 2 weeks and
> > git-cat-file has outgrown the parent perl process several times
> > (git-cat-file - ~3-4Gb, perl - 400Mb).

How much of that is anonymous memory, though?
(pmap $PID_OF_GIT_CAT_FILE)

Running the following on the Linux kernel tree I had lying around:

(for i in $(seq 100 200); do git ls-files | sed -e "s/^/HEAD~$i:/"; done)|\
  git cat-file --batch >/dev/null

Reveals about 510M RSS in top, but pmap says less than 20M of that
is anonymous.  So the rest are mmap-ed packfiles; that RSS gets
transparently released back to the kernel under memory pressure.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] git-svn: make batch mode optional for git-cat-file
  2015-09-23 15:28         ` Victor Leschuk
@ 2015-09-23 19:22           ` Eric Wong
  2015-10-11 12:31             ` Victor Leschuk
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Wong @ 2015-09-23 19:22 UTC (permalink / raw)
  To: Victor Leschuk; +Cc: Junio C Hamano, git

Victor Leschuk <vleschuk@accesssoftek.com> wrote:
> Hello Eric, thanks for looking into it.
> 
> >> git-cat-file has outgrown the parent perl process several times
> >> (git-cat-file - ~3-4Gb, perl - 400Mb).
> 
> > Ugh, that sucks.
> > Even the 400Mb size of Perl annoys me greatly and I'd work
> > on fixing it if I had more time.
> 
> I was going to look at this problem also, but first I'd like to improve the situation with cat-file as on large repos it is larger problem. By the way, what direction would you suggest to begin with?

See below :)

<snip anonymous memory stuff, it doesn't seem to be a culprit>

> > > git-cat-file has outgrown the parent perl process several times
> > > (git-cat-file - ~3-4Gb, perl - 400Mb).
> 
> > How much of that is anonymous memory, though?
> 
> Haven't measured on this particular repo: didn't redo the 2 week
> experiment =) However I checked on a smaller test repo and anon memory
> is about 12M out of 40M total. Most of memory is really taken by
> mmaped *.pack and *idx files.

If it's mmap-ed files, that physical memory is only used on-demand
and can be dropped at any time because it's backed by disk.

In other words, I would not worry about any file-backed mmap at all
(unless you're on 32-bit, but I think git has workarounds for that)

Do you still have that giant repo around?

Are the combined size of the pack + idx files are at least 3-4 GB?

This should cat all the blobs in history without re-running git-svn:

	git log --all --raw -r --no-abbrev | \
	  awk '/^:/ {print $3; print $4}' | git cat-file --batch

git log actually keeps growing, but the cat-file process shouldn't
use anonymous memory much if you inspect it with pmap.

> Actually I accidentally found out that if I export GIT_MALLOC_LIMIT
> variable set to several megabytes it has the following effect:

>  * git-svn.perl launches git-gc
>  * git-gc can't allocate enough memory and thus doesn't create any pack files
>  * git-cat-file works only with pure blob object, not packs, and it's
> memory usage doesn't grow larger than 4-5M
> 
> It gave me a thought that maybe we could get rid of "git gc" calls
> after each commit in perl code and just perform one large gc operation
> at the end. It will cost disk space during clone but save us memory.
> What do you think?

You can set gc.auto to zero in your $GIT_CONFIG to disable gc.
The "git gc" calls were added because unpacked repos were growing
too large and caused problems for other people.

Perhaps play with some other pack* options documented in
Documentation/config to limit maximum pack size/depth.

Is this a 32-bit or 64-bit system?

> As for your suggestion regarding periodic restart of batch process
> inside git-cat-file, I think we could give it a try, I can prepare a
> patch and run some tests.

I am not sure if we need it for git-svn.

In another project, the only reason I've found to restart
"cat-file --batch" is in case the repo got repacked and old packs
got unlinked, cat-file would hold a reference onto the old file
and suck up space.   It might be better if "cat-file --batch" learned
to detect unlinked files and then munmap + close them.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH] git-svn: make batch mode optional for git-cat-file
  2015-09-23 19:22           ` Eric Wong
@ 2015-10-11 12:31             ` Victor Leschuk
  0 siblings, 0 replies; 10+ messages in thread
From: Victor Leschuk @ 2015-10-11 12:31 UTC (permalink / raw)
  To: Eric Wong; +Cc: Junio C Hamano, git@vger.kernel.org, vleschuk@gmail.com

Hello Eric,

Thanks for all the advices. I have played with several repositories (both on 32bit and 64bit machines). You were correct most of the memory if used by mapped files and yes it doesn't cause any problems, even a 32bit machine with 500Mb of memory works normally with a heavy loaded git-cat-file.

Thanks also for the advice to use git gc config options, I tested gc.auto=0 and it lead to the same behavior as my setting MALLOC_LIMIT, however it is more correct way to get this effect =)

I agree that we shouldn't worry about mapped files.

--
Best Regards,
Victor
________________________________________
From: Eric Wong [normalperson@yhbt.net]
Sent: Wednesday, September 23, 2015 12:22 PM
To: Victor Leschuk
Cc: Junio C Hamano; git@vger.kernel.org
Subject: Re: [PATCH] git-svn: make batch mode optional for git-cat-file

Victor Leschuk <vleschuk@accesssoftek.com> wrote:
> Hello Eric, thanks for looking into it.
>
> >> git-cat-file has outgrown the parent perl process several times
> >> (git-cat-file - ~3-4Gb, perl - 400Mb).
>
> > Ugh, that sucks.
> > Even the 400Mb size of Perl annoys me greatly and I'd work
> > on fixing it if I had more time.
>
> I was going to look at this problem also, but first I'd like to improve the situation with cat-file as on large repos it is larger problem. By the way, what direction would you suggest to begin with?

See below :)

<snip anonymous memory stuff, it doesn't seem to be a culprit>

> > > git-cat-file has outgrown the parent perl process several times
> > > (git-cat-file - ~3-4Gb, perl - 400Mb).
>
> > How much of that is anonymous memory, though?
>
> Haven't measured on this particular repo: didn't redo the 2 week
> experiment =) However I checked on a smaller test repo and anon memory
> is about 12M out of 40M total. Most of memory is really taken by
> mmaped *.pack and *idx files.

If it's mmap-ed files, that physical memory is only used on-demand
and can be dropped at any time because it's backed by disk.

In other words, I would not worry about any file-backed mmap at all
(unless you're on 32-bit, but I think git has workarounds for that)

Do you still have that giant repo around?

Are the combined size of the pack + idx files are at least 3-4 GB?

This should cat all the blobs in history without re-running git-svn:

        git log --all --raw -r --no-abbrev | \
          awk '/^:/ {print $3; print $4}' | git cat-file --batch

git log actually keeps growing, but the cat-file process shouldn't
use anonymous memory much if you inspect it with pmap.

> Actually I accidentally found out that if I export GIT_MALLOC_LIMIT
> variable set to several megabytes it has the following effect:

>  * git-svn.perl launches git-gc
>  * git-gc can't allocate enough memory and thus doesn't create any pack files
>  * git-cat-file works only with pure blob object, not packs, and it's
> memory usage doesn't grow larger than 4-5M
>
> It gave me a thought that maybe we could get rid of "git gc" calls
> after each commit in perl code and just perform one large gc operation
> at the end. It will cost disk space during clone but save us memory.
> What do you think?

You can set gc.auto to zero in your $GIT_CONFIG to disable gc.
The "git gc" calls were added because unpacked repos were growing
too large and caused problems for other people.

Perhaps play with some other pack* options documented in
Documentation/config to limit maximum pack size/depth.

Is this a 32-bit or 64-bit system?

> As for your suggestion regarding periodic restart of batch process
> inside git-cat-file, I think we could give it a try, I can prepare a
> patch and run some tests.

I am not sure if we need it for git-svn.

In another project, the only reason I've found to restart
"cat-file --batch" is in case the repo got repacked and old packs
got unlinked, cat-file would hold a reference onto the old file
and suck up space.   It might be better if "cat-file --batch" learned
to detect unlinked files and then munmap + close them.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-10-11 12:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-21 13:51 [PATCH] git-svn: make batch mode optional for git-cat-file Victor Leschuk
2015-09-21 18:25 ` Junio C Hamano
2015-09-21 22:03   ` Victor Leschuk
2015-09-22 10:47     ` Victor Leschuk
2015-09-22 14:17       ` Junio C Hamano
2015-09-23  0:13     ` Eric Wong
2015-09-23  0:35       ` Eric Wong
2015-09-23 15:28         ` Victor Leschuk
2015-09-23 19:22           ` Eric Wong
2015-10-11 12:31             ` Victor Leschuk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).