Re: [PATCH] Fix in Git.pm cat_blob crashes on large files

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Joshua Clayton <stillcompiling@gmail.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] Fix in Git.pm cat_blob crashes on large files
Date: Thu, 21 Feb 2013 15:18:40 -0800	[thread overview]
Message-ID: <CAMB+bf+whVFD03neCh-gBORXOBoNjgaCbfP_mh8HgDy6UqGFZA@mail.gmail.com> (raw)
In-Reply-To: <20130221224319.GA19021@sigill.intra.peff.net>

On Thu, Feb 21, 2013 at 2:43 PM, Jeff King <peff@peff.net> wrote:
> On Thu, Feb 21, 2013 at 02:13:32PM -0800, Joshua Clayton wrote:
>
>> Greetings.
>> This is my first patch here. Hopefully I get the stylistic & political
>> details right... :)
>> Patch applies against maint and master
>
> I have some comments. :)
>
> The body of your email should contain the commit message (i.e., whatever
> people reading "git log" a year from now would see). Cover letter bits
> like this should go after the "---". That way "git am" knows which part
> is which.
>
>>         Developer's Certificate of Origin 1.1
>
> You don't need to include the DCO. Your "Signed-off-by" is an indication
> that you agree to it.
>
>> Affects git svn clone/fetch
>> Original code loaded entire file contents into a variable
>> before writing to disk. If the offset within the variable passed
>> 2 GiB, it becrame negative, resulting in a crash.
>
> Interesting. I didn't think perl had signed wrap-around issues like
> this, as its numeric variables are not strictly integers. But I don't
> have a 32-bit machine to test on (and numbers larger than 2G obviously
> work on 64-bit machines). At any rate, though:
>
>> On a 32 bit system, or a system with low memory it may crash before
>> reaching 2 GiB due to memory exhaustion.
>
> Yeah, it is stupid to read the whole thing into memory if we are just
> going to dump it to another filehandle.
>
>> @@ -949,13 +951,21 @@ sub cat_blob {
>>               last unless $bytesLeft;
>>
>>               my $bytesToRead = $bytesLeft < 1024 ? $bytesLeft : 1024;
>> -             my $read = read($in, $blob, $bytesToRead, $bytesRead);
>> +             my $read = read($in, $blob, $bytesToRead, $blobSize);
>>               unless (defined($read)) {
>>                       $self->_close_cat_blob();
>>                       throw Error::Simple("in pipe went bad");
>>               }
>
> Hmph. The existing code already reads in 1024-byte chunks. For no
> reason, as far as I can tell, since we are just loading the blob buffer
> incrementally into memory, only to then flush it all out at once.
>
> Why do you read at the $blobSize offset? If we are just reading in
> chunks, we be able to just keep writing to the start of our small
> buffer, as we flush each chunk out before trying to read more.
>
> IOW, shouldn't the final code look like this:
>
>   my $bytesLeft = $size;
>   while ($bytesLeft > 0) {
>           my $buf;
>           my $bytesToRead = $bytesLeft < 1024 ? $bytesLeft : 1024;
>           my $read = read($in, $buf, $bytesToRead);
>           unless (defined($read)) {
>                   $self->_close_cat_blob();
>                   throw Error::Simple("unable to read cat-blob pipe");
>           }
>           unless (print $fh $buf) {
>                   $self->_close_cat_blob();
>                   throw Error::Simple("unable to write blob output");
>           }
>
>           $bytesLeft -= $read;
>   }
>
> By having the read and flush size be the same, it's much simpler.

My original bugfix did just read 1024, and write 1024. That works fine
and, yes, is simpler.
I changed it to be more similar to the original code in case there
were performance reasons for doing it that way.
That was the only reason I could think of for the design, and adding
the $flushSize variable means that
some motivated person could easily optimize it.

So far I have been too lazy to profile the two versions....
I guess I'll try a trivial git svn init; git svn fetch and check back in.
Its running now.

>
> Your change (and my proposed code) do mean that an error during the read
> operation will result in a truncated output file, rather than an empty
> one. I think that is OK, though. That can happen anyway in the original
> due to a failure in the "print" step. Any caller who wants to be careful
> that they leave only a full file in place must either:
>
>   1. Check the return value of cat_blob and verify that the result has
>      $size bytes, and otherwise delete it.
>
>   2. Write to a temporary file, then once success has been returned from
>      cat_blob, rename the result into place.
>
> Neither of which is affected by this change.
>
> -Peff

In git svn fetch (which is how I discovered it) the file being passed
to cat_blob is a temporary file, which is checksummed before putting
it into place.

next prev parent reply	other threads:[~2013-02-21 23:19 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-21 22:13 [PATCH] Fix in Git.pm cat_blob crashes on large files Joshua Clayton
2013-02-21 22:43 ` Jeff King
2013-02-21 23:18   ` Joshua Clayton [this message]
2013-02-21 23:24     ` Jeff King
2013-02-22 15:11       ` Joshua Clayton
2013-02-22 15:38         ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMB+bf+whVFD03neCh-gBORXOBoNjgaCbfP_mh8HgDy6UqGFZA@mail.gmail.com \
    --to=stillcompiling@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).