[PATCH] Fix in Git.pm cat_blob crashes on large files

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Fix in Git.pm cat_blob crashes on large files
@ 2013-02-21 22:13 Joshua Clayton
  2013-02-21 22:43 ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Joshua Clayton @ 2013-02-21 22:13 UTC (permalink / raw)
  To: git

Greetings.
This is my first patch here. Hopefully I get the stylistic & political
details right... :)
Patch applies against maint and master

(If I understand the mechanics, in theory a negative offset should work,
 if the values lined up just right, but would be very wrong, overwriting the
lower contents of the file)

        Developer's Certificate of Origin 1.1

        By making a contribution to this project, I certify that:

        (a) The contribution was created in whole or in part by me and I
            have the right to submit it under the open source license
            indicated in the file; or

        (b) The contribution is based upon previous work that, to the best
            of my knowledge, is covered under an appropriate open source
            license and I have the right under that license to submit that
            work with modifications, whether created in whole or in part
            by me, under the same open source license (unless I am
            permitted to submit under a different license), as indicated
            in the file; or

        (c) The contribution was provided directly to me by some other
            person who certified (a), (b) or (c) and I have not modified
            it.

(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.

Affects git svn clone/fetch
Original code loaded entire file contents into a variable
before writing to disk. If the offset within the variable passed
2 GiB, it becrame negative, resulting in a crash.
On a 32 bit system, or a system with low memory it may crash before
reaching 2 GiB due to memory exhaustion.
Fix writes in smaller 64K increments. Tested to work on git svn fetch

Signed-off-by: Joshua Clayton <stillcompiling@gmail.com>
---
 perl/Git.pm |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/perl/Git.pm b/perl/Git.pm
index 931047c..e55840f 100644
--- a/perl/Git.pm
+++ b/perl/Git.pm
@@ -942,6 +942,8 @@ sub cat_blob {
 	my $size = $1;

 	my $blob;
+	my $blobSize = 0;
+	my $flushSize = 1024*64;
 	my $bytesRead = 0;

 	while (1) {
@@ -949,13 +951,21 @@ sub cat_blob {
 		last unless $bytesLeft;

 		my $bytesToRead = $bytesLeft < 1024 ? $bytesLeft : 1024;
-		my $read = read($in, $blob, $bytesToRead, $bytesRead);
+		my $read = read($in, $blob, $bytesToRead, $blobSize);
 		unless (defined($read)) {
 			$self->_close_cat_blob();
 			throw Error::Simple("in pipe went bad");
 		}
-
 		$bytesRead += $read;
+		$blobSize += $read;
+		if (($blobSize >= $flushSize) || ($bytesLeft <= 1024)) {
+			unless (print $fh $blob) {
+				$self->_close_cat_blob();
+				throw Error::Simple("couldn't write to passed in filehandle");
+			}
+			$blob = "";
+			$blobSize = 0;
+		}
 	}

 	# Skip past the trailing newline.
@@ -970,11 +980,6 @@ sub cat_blob {
 		throw Error::Simple("didn't find newline after blob");
 	}

-	unless (print $fh $blob) {
-		$self->_close_cat_blob();
-		throw Error::Simple("couldn't write to passed in filehandle");
-	}
-
 	return $size;
 }

-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] Fix in Git.pm cat_blob crashes on large files
  2013-02-21 22:13 [PATCH] Fix in Git.pm cat_blob crashes on large files Joshua Clayton
@ 2013-02-21 22:43 ` Jeff King
  2013-02-21 23:18   ` Joshua Clayton
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2013-02-21 22:43 UTC (permalink / raw)
  To: Joshua Clayton; +Cc: git

On Thu, Feb 21, 2013 at 02:13:32PM -0800, Joshua Clayton wrote:

> Greetings.
> This is my first patch here. Hopefully I get the stylistic & political
> details right... :)
> Patch applies against maint and master

I have some comments. :)

The body of your email should contain the commit message (i.e., whatever
people reading "git log" a year from now would see). Cover letter bits
like this should go after the "---". That way "git am" knows which part
is which.

>         Developer's Certificate of Origin 1.1

You don't need to include the DCO. Your "Signed-off-by" is an indication
that you agree to it.

> Affects git svn clone/fetch
> Original code loaded entire file contents into a variable
> before writing to disk. If the offset within the variable passed
> 2 GiB, it becrame negative, resulting in a crash.

Interesting. I didn't think perl had signed wrap-around issues like
this, as its numeric variables are not strictly integers. But I don't
have a 32-bit machine to test on (and numbers larger than 2G obviously
work on 64-bit machines). At any rate, though:

> On a 32 bit system, or a system with low memory it may crash before
> reaching 2 GiB due to memory exhaustion.

Yeah, it is stupid to read the whole thing into memory if we are just
going to dump it to another filehandle.

> @@ -949,13 +951,21 @@ sub cat_blob {
>  		last unless $bytesLeft;
> 
>  		my $bytesToRead = $bytesLeft < 1024 ? $bytesLeft : 1024;
> -		my $read = read($in, $blob, $bytesToRead, $bytesRead);
> +		my $read = read($in, $blob, $bytesToRead, $blobSize);
>  		unless (defined($read)) {
>  			$self->_close_cat_blob();
>  			throw Error::Simple("in pipe went bad");
>  		}

Hmph. The existing code already reads in 1024-byte chunks. For no
reason, as far as I can tell, since we are just loading the blob buffer
incrementally into memory, only to then flush it all out at once.

Why do you read at the $blobSize offset? If we are just reading in
chunks, we be able to just keep writing to the start of our small
buffer, as we flush each chunk out before trying to read more.

IOW, shouldn't the final code look like this:

  my $bytesLeft = $size;
  while ($bytesLeft > 0) {
          my $buf;
          my $bytesToRead = $bytesLeft < 1024 ? $bytesLeft : 1024;
          my $read = read($in, $buf, $bytesToRead);
          unless (defined($read)) {
                  $self->_close_cat_blob();
                  throw Error::Simple("unable to read cat-blob pipe");
          }
          unless (print $fh $buf) {
                  $self->_close_cat_blob();
                  throw Error::Simple("unable to write blob output");
          }

          $bytesLeft -= $read;
  }

By having the read and flush size be the same, it's much simpler.

Your change (and my proposed code) do mean that an error during the read
operation will result in a truncated output file, rather than an empty
one. I think that is OK, though. That can happen anyway in the original
due to a failure in the "print" step. Any caller who wants to be careful
that they leave only a full file in place must either:

  1. Check the return value of cat_blob and verify that the result has
     $size bytes, and otherwise delete it.

  2. Write to a temporary file, then once success has been returned from
     cat_blob, rename the result into place.

Neither of which is affected by this change.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Fix in Git.pm cat_blob crashes on large files
  2013-02-21 22:43 ` Jeff King
@ 2013-02-21 23:18   ` Joshua Clayton
  2013-02-21 23:24     ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Joshua Clayton @ 2013-02-21 23:18 UTC (permalink / raw)
  To: Jeff King; +Cc: git

On Thu, Feb 21, 2013 at 2:43 PM, Jeff King <peff@peff.net> wrote:
> On Thu, Feb 21, 2013 at 02:13:32PM -0800, Joshua Clayton wrote:
>
>> Greetings.
>> This is my first patch here. Hopefully I get the stylistic & political
>> details right... :)
>> Patch applies against maint and master
>
> I have some comments. :)
>
> The body of your email should contain the commit message (i.e., whatever
> people reading "git log" a year from now would see). Cover letter bits
> like this should go after the "---". That way "git am" knows which part
> is which.
>
>>         Developer's Certificate of Origin 1.1
>
> You don't need to include the DCO. Your "Signed-off-by" is an indication
> that you agree to it.
>
>> Affects git svn clone/fetch
>> Original code loaded entire file contents into a variable
>> before writing to disk. If the offset within the variable passed
>> 2 GiB, it becrame negative, resulting in a crash.
>
> Interesting. I didn't think perl had signed wrap-around issues like
> this, as its numeric variables are not strictly integers. But I don't
> have a 32-bit machine to test on (and numbers larger than 2G obviously
> work on 64-bit machines). At any rate, though:
>
>> On a 32 bit system, or a system with low memory it may crash before
>> reaching 2 GiB due to memory exhaustion.
>
> Yeah, it is stupid to read the whole thing into memory if we are just
> going to dump it to another filehandle.
>
>> @@ -949,13 +951,21 @@ sub cat_blob {
>>               last unless $bytesLeft;
>>
>>               my $bytesToRead = $bytesLeft < 1024 ? $bytesLeft : 1024;
>> -             my $read = read($in, $blob, $bytesToRead, $bytesRead);
>> +             my $read = read($in, $blob, $bytesToRead, $blobSize);
>>               unless (defined($read)) {
>>                       $self->_close_cat_blob();
>>                       throw Error::Simple("in pipe went bad");
>>               }
>
> Hmph. The existing code already reads in 1024-byte chunks. For no
> reason, as far as I can tell, since we are just loading the blob buffer
> incrementally into memory, only to then flush it all out at once.
>
> Why do you read at the $blobSize offset? If we are just reading in
> chunks, we be able to just keep writing to the start of our small
> buffer, as we flush each chunk out before trying to read more.
>
> IOW, shouldn't the final code look like this:
>
>   my $bytesLeft = $size;
>   while ($bytesLeft > 0) {
>           my $buf;
>           my $bytesToRead = $bytesLeft < 1024 ? $bytesLeft : 1024;
>           my $read = read($in, $buf, $bytesToRead);
>           unless (defined($read)) {
>                   $self->_close_cat_blob();
>                   throw Error::Simple("unable to read cat-blob pipe");
>           }
>           unless (print $fh $buf) {
>                   $self->_close_cat_blob();
>                   throw Error::Simple("unable to write blob output");
>           }
>
>           $bytesLeft -= $read;
>   }
>
> By having the read and flush size be the same, it's much simpler.

My original bugfix did just read 1024, and write 1024. That works fine
and, yes, is simpler.
I changed it to be more similar to the original code in case there
were performance reasons for doing it that way.
That was the only reason I could think of for the design, and adding
the $flushSize variable means that
some motivated person could easily optimize it.

So far I have been too lazy to profile the two versions....
I guess I'll try a trivial git svn init; git svn fetch and check back in.
Its running now.

>
> Your change (and my proposed code) do mean that an error during the read
> operation will result in a truncated output file, rather than an empty
> one. I think that is OK, though. That can happen anyway in the original
> due to a failure in the "print" step. Any caller who wants to be careful
> that they leave only a full file in place must either:
>
>   1. Check the return value of cat_blob and verify that the result has
>      $size bytes, and otherwise delete it.
>
>   2. Write to a temporary file, then once success has been returned from
>      cat_blob, rename the result into place.
>
> Neither of which is affected by this change.
>
> -Peff

In git svn fetch (which is how I discovered it) the file being passed
to cat_blob is a temporary file, which is checksummed before putting
it into place.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Fix in Git.pm cat_blob crashes on large files
  2013-02-21 23:18   ` Joshua Clayton
@ 2013-02-21 23:24     ` Jeff King
  2013-02-22 15:11       ` Joshua Clayton
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2013-02-21 23:24 UTC (permalink / raw)
  To: Joshua Clayton; +Cc: git

On Thu, Feb 21, 2013 at 03:18:40PM -0800, Joshua Clayton wrote:

> > By having the read and flush size be the same, it's much simpler.
> 
> My original bugfix did just read 1024, and write 1024. That works fine
> and, yes, is simpler.
> I changed it to be more similar to the original code in case there
> were performance reasons for doing it that way.
> That was the only reason I could think of for the design, and adding
> the $flushSize variable means that
> some motivated person could easily optimize it.
> 
> So far I have been too lazy to profile the two versions....
> I guess I'll try a trivial git svn init; git svn fetch and check back in.
> Its running now.

I doubt it will make much of a difference; we are already writing to a
perl filehandle, so it will be buffered there (which I assume is 4K, but
I haven't checked). And your version retains the 1024-byte read. I do
think 1024 is quite low for this sort of descriptor-to-descriptor
copying. I'd be tempted to just bump that 1024 to 64K.

> In git svn fetch (which is how I discovered it) the file being passed
> to cat_blob is a temporary file, which is checksummed before putting
> it into place.

Great. There may be other callers outside of our tree, of course, but I
think it's pretty clear that the responsibility is on the caller to make
sure the function succeeded. We are changing what gets put on the output
stream for various error conditions, but ultimately that is an
implementation detail that the caller should not be depending on.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Fix in Git.pm cat_blob crashes on large files
  2013-02-21 23:24     ` Jeff King
@ 2013-02-22 15:11       ` Joshua Clayton
  2013-02-22 15:38         ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Joshua Clayton @ 2013-02-22 15:11 UTC (permalink / raw)
  To: Jeff King; +Cc: git

running git svn fetch on a remote repository (yes I know there are a
lot of possible outside variables, including network latency)
Code with 1024 reads and 64k writes:

real    75m19.906s
user    16m43.919s
sys     29m16.326s

Code with 1024 reads and 1024 writes:

real    71m21.006s
user    12m36.275s
sys     24m26.112s

...so the simpler code wins the trivial test.
I would say go with it.
Should I resubmit?

On Thu, Feb 21, 2013 at 3:24 PM, Jeff King <peff@peff.net> wrote:
> On Thu, Feb 21, 2013 at 03:18:40PM -0800, Joshua Clayton wrote:
>
>> > By having the read and flush size be the same, it's much simpler.
>>
>> My original bugfix did just read 1024, and write 1024. That works fine
>> and, yes, is simpler.
>> I changed it to be more similar to the original code in case there
>> were performance reasons for doing it that way.
>> That was the only reason I could think of for the design, and adding
>> the $flushSize variable means that
>> some motivated person could easily optimize it.
>>
>> So far I have been too lazy to profile the two versions....
>> I guess I'll try a trivial git svn init; git svn fetch and check back in.
>> Its running now.
>
> I doubt it will make much of a difference; we are already writing to a
> perl filehandle, so it will be buffered there (which I assume is 4K, but
> I haven't checked). And your version retains the 1024-byte read. I do
> think 1024 is quite low for this sort of descriptor-to-descriptor
> copying. I'd be tempted to just bump that 1024 to 64K.
>
>> In git svn fetch (which is how I discovered it) the file being passed
>> to cat_blob is a temporary file, which is checksummed before putting
>> it into place.
>
> Great. There may be other callers outside of our tree, of course, but I
> think it's pretty clear that the responsibility is on the caller to make
> sure the function succeeded. We are changing what gets put on the output
> stream for various error conditions, but ultimately that is an
> implementation detail that the caller should not be depending on.
>
> -Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] Fix in Git.pm cat_blob crashes on large files
  2013-02-22 15:11       ` Joshua Clayton
@ 2013-02-22 15:38         ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2013-02-22 15:38 UTC (permalink / raw)
  To: Joshua Clayton; +Cc: git

On Fri, Feb 22, 2013 at 07:11:54AM -0800, Joshua Clayton wrote:

> running git svn fetch on a remote repository (yes I know there are a
> lot of possible outside variables, including network latency)
> Code with 1024 reads and 64k writes:
> 
> real    75m19.906s
> user    16m43.919s
> sys     29m16.326s
> 
> Code with 1024 reads and 1024 writes:
> 
> real    71m21.006s
> user    12m36.275s
> sys     24m26.112s
> 
> ...so the simpler code wins the trivial test.

Interesting; I'd have expected no change or a slight win for your
version, which makes me wonder if the outside variables are dominating.
I wonder what 64K/64K would look like.

> I would say go with it.
> Should I resubmit?

Yes, please.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-02-22 15:38 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-21 22:13 [PATCH] Fix in Git.pm cat_blob crashes on large files Joshua Clayton
2013-02-21 22:43 ` Jeff King
2013-02-21 23:18   ` Joshua Clayton
2013-02-21 23:24     ` Jeff King
2013-02-22 15:11       ` Joshua Clayton
2013-02-22 15:38         ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).