Re: git status OOM on mmap of large file

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Joey Hess <id@joeyh.name>
To: Jeff King <peff@peff.net>
Cc: git <git@vger.kernel.org>
Subject: Re: git status OOM on mmap of large file
Date: Thu, 24 Jan 2019 14:38:10 -0400	[thread overview]
Message-ID: <20190124183810.GC29200@kitenet.net> (raw)
In-Reply-To: <20190124121037.GA4949@sigill.intra.peff.net>

Jeff King wrote:
> Looking at apply_single_file_filter(), it's not the _original_ file that
> it's trying to store, but rather the data coming back from the filter.
> It's just that we use the original file size as a hint!

Thanks much for working that out!

> In other words, I think this patch fixes your problem:
> 
> diff --git a/convert.c b/convert.c
> index 0d89ae7c23..85aebe2ed3 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -732,7 +732,7 @@ static int apply_single_file_filter(const char *path, const char *src, size_t le
>  	if (start_async(&async))
>  		return 0;	/* error was already reported */
>  
> -	if (strbuf_read(&nbuf, async.out, len) < 0) {
> +	if (strbuf_read(&nbuf, async.out, 0) < 0) {
>  		err = error(_("read from external filter '%s' failed"), cmd);
>  	}
>  	if (close(async.out)) {

Yes, confirmed that does fix it.

> though possibly we should actually continue to use the file size as a
> hint up to a certain point, which avoids reallocations for more "normal"
> filters where the input and output sizes are in the same ballpark.
> 
> Just off the top of my head, something like:
> 
>   /* guess that the filtered output will be the same size as the original */
>   hint = len;
> 
>   /* allocate 10% extra in case the clean size is slightly larger */
>   hint *= 1.1;
> 
>   /*
>    * in any case, never go higher than half of core.bigfileThreshold.
>    * We'd like to avoid allocating more bytes than that, and that still
>    * gives us room for our strbuf to preemptively double if our guess is
>    * just a little on the low side.
>    */
>   if (hint > big_file_threshold / 2)
> 	hint = big_file_threshold / 2;
> 
> But to be honest, I have no idea if that would even produce measurable
> benefits over simply growing the strbuf from scratch (i.e., hint==0).

Half of 512 MB is still quite a lot of memory to default to using in
this situation. Eg smaller VPS's still often only have a GB or two of ram.

When the clean filter is being used in a way that doesn't involve hashes
of large files, it will mostly be operating on typically sized source
code files. So capping the maximum hint size around the size of a typical
source code file would be plenty for both common cases for the clean filter.

I did some benchmarking, using cat as the clean filter:

git status 32 kb file, hint == len
   time                 3.865 ms   (3.829 ms .. 3.943 ms)
                        0.994 R²   (0.987 R² .. 0.999 R²)
   mean                 3.934 ms   (3.889 ms .. 4.021 ms)
   std dev              191.8 μs   (106.8 μs .. 291.8 μs)
git status 32 kb file, hint == 0
   time                 3.887 ms   (3.751 ms .. 4.064 ms)
                        0.992 R²   (0.986 R² .. 0.998 R²)
   mean                 4.002 ms   (3.931 ms .. 4.138 ms)
   std dev              292.2 μs   (189.0 μs .. 498.3 μs)
git status 1 mb file, hint == len
   time                 3.942 ms   (3.816 ms .. 4.078 ms)
                        0.995 R²   (0.991 R² .. 0.999 R²)
   mean                 3.969 ms   (3.916 ms .. 4.054 ms)
   std dev              220.1 μs   (155.1 μs .. 304.3 μs)
git status 1 mb file, hint == 0
   time                 3.869 ms   (3.836 ms .. 3.911 ms)
                        0.998 R²   (0.995 R² .. 1.000 R²)
   mean                 3.895 ms   (3.868 ms .. 3.947 ms)
   std dev              112.3 μs   (47.93 μs .. 182.7 μs)
git status 1 gb file, hint == len
   time                 7.173 s    (6.834 s .. 7.564 s)
                        0.999 R²   (0.999 R² .. 1.000 R²)
   mean                 7.560 s    (7.369 s .. 7.903 s)
   std dev              333.2 ms   (27.65 ms .. 412.2 ms)
git status 1 gb file, hint == 0
   time                 7.652 s    (6.307 s .. 8.263 s)
                        0.996 R²   (0.992 R² .. 1.000 R²)
   mean                 8.082 s    (7.843 s .. 8.202 s)
   std dev              232.3 ms   (2.362 ms .. 277.1 ms)

From this, it looks like the file has to be quite large before the
preallocation makes a sizable improvement to runtime, and the
smudge/clean filters have to be used for actual content filtering
(not for hash generation purposes as git-annex and git-lfs use it).
An unusual edge case I think. So hint == 0 seems fine.

-- 
see shy jo

next prev parent reply	other threads:[~2019-01-24 18:38 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-22 22:07 git status OOM on mmap of large file Joey Hess
2019-01-24  0:39 ` brian m. carlson
2019-01-24 12:14   ` Jeff King
2019-01-24 17:05     ` Joey Hess
2019-01-24 12:10 ` Jeff King
2019-01-24 12:54   ` Duy Nguyen
2019-01-24 18:38   ` Joey Hess [this message]
2019-01-24 19:18     ` Jeff King
2019-01-24 19:28       ` Jeff King
2019-01-24 20:36       ` [PATCH] avoid unncessary malloc of whole file size Joey Hess
2019-01-24 21:12         ` Junio C Hamano
2019-01-24 21:18           ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190124183810.GC29200@kitenet.net \
    --to=id@joeyh.name \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).