All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joey Hess <id@joeyh.name>
To: Jeff King <peff@peff.net>
Cc: git <git@vger.kernel.org>
Subject: Re: git status OOM on mmap of large file
Date: Thu, 24 Jan 2019 14:38:10 -0400	[thread overview]
Message-ID: <20190124183810.GC29200@kitenet.net> (raw)
In-Reply-To: <20190124121037.GA4949@sigill.intra.peff.net>

Jeff King wrote:
> Looking at apply_single_file_filter(), it's not the _original_ file that
> it's trying to store, but rather the data coming back from the filter.
> It's just that we use the original file size as a hint!

Thanks much for working that out!

> In other words, I think this patch fixes your problem:
> 
> diff --git a/convert.c b/convert.c
> index 0d89ae7c23..85aebe2ed3 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -732,7 +732,7 @@ static int apply_single_file_filter(const char *path, const char *src, size_t le
>  	if (start_async(&async))
>  		return 0;	/* error was already reported */
>  
> -	if (strbuf_read(&nbuf, async.out, len) < 0) {
> +	if (strbuf_read(&nbuf, async.out, 0) < 0) {
>  		err = error(_("read from external filter '%s' failed"), cmd);
>  	}
>  	if (close(async.out)) {

Yes, confirmed that does fix it.

> though possibly we should actually continue to use the file size as a
> hint up to a certain point, which avoids reallocations for more "normal"
> filters where the input and output sizes are in the same ballpark.
> 
> Just off the top of my head, something like:
> 
>   /* guess that the filtered output will be the same size as the original */
>   hint = len;
> 
>   /* allocate 10% extra in case the clean size is slightly larger */
>   hint *= 1.1;
> 
>   /*
>    * in any case, never go higher than half of core.bigfileThreshold.
>    * We'd like to avoid allocating more bytes than that, and that still
>    * gives us room for our strbuf to preemptively double if our guess is
>    * just a little on the low side.
>    */
>   if (hint > big_file_threshold / 2)
> 	hint = big_file_threshold / 2;
> 
> But to be honest, I have no idea if that would even produce measurable
> benefits over simply growing the strbuf from scratch (i.e., hint==0).

Half of 512 MB is still quite a lot of memory to default to using in
this situation. Eg smaller VPS's still often only have a GB or two of ram.

When the clean filter is being used in a way that doesn't involve hashes
of large files, it will mostly be operating on typically sized source
code files. So capping the maximum hint size around the size of a typical
source code file would be plenty for both common cases for the clean filter.

I did some benchmarking, using cat as the clean filter:

git status 32 kb file, hint == len
   time                 3.865 ms   (3.829 ms .. 3.943 ms)
                        0.994 R²   (0.987 R² .. 0.999 R²)
   mean                 3.934 ms   (3.889 ms .. 4.021 ms)
   std dev              191.8 μs   (106.8 μs .. 291.8 μs)
git status 32 kb file, hint == 0
   time                 3.887 ms   (3.751 ms .. 4.064 ms)
                        0.992 R²   (0.986 R² .. 0.998 R²)
   mean                 4.002 ms   (3.931 ms .. 4.138 ms)
   std dev              292.2 μs   (189.0 μs .. 498.3 μs)
git status 1 mb file, hint == len
   time                 3.942 ms   (3.816 ms .. 4.078 ms)
                        0.995 R²   (0.991 R² .. 0.999 R²)
   mean                 3.969 ms   (3.916 ms .. 4.054 ms)
   std dev              220.1 μs   (155.1 μs .. 304.3 μs)
git status 1 mb file, hint == 0
   time                 3.869 ms   (3.836 ms .. 3.911 ms)
                        0.998 R²   (0.995 R² .. 1.000 R²)
   mean                 3.895 ms   (3.868 ms .. 3.947 ms)
   std dev              112.3 μs   (47.93 μs .. 182.7 μs)
git status 1 gb file, hint == len
   time                 7.173 s    (6.834 s .. 7.564 s)
                        0.999 R²   (0.999 R² .. 1.000 R²)
   mean                 7.560 s    (7.369 s .. 7.903 s)
   std dev              333.2 ms   (27.65 ms .. 412.2 ms)
git status 1 gb file, hint == 0
   time                 7.652 s    (6.307 s .. 8.263 s)
                        0.996 R²   (0.992 R² .. 1.000 R²)
   mean                 8.082 s    (7.843 s .. 8.202 s)
   std dev              232.3 ms   (2.362 ms .. 277.1 ms)

From this, it looks like the file has to be quite large before the
preallocation makes a sizable improvement to runtime, and the
smudge/clean filters have to be used for actual content filtering
(not for hash generation purposes as git-annex and git-lfs use it).
An unusual edge case I think. So hint == 0 seems fine.

-- 
see shy jo

  parent reply	other threads:[~2019-01-24 18:38 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-22 22:07 git status OOM on mmap of large file Joey Hess
2019-01-24  0:39 ` brian m. carlson
2019-01-24 12:14   ` Jeff King
2019-01-24 17:05     ` Joey Hess
2019-01-24 12:10 ` Jeff King
2019-01-24 12:54   ` Duy Nguyen
2019-01-24 18:38   ` Joey Hess [this message]
2019-01-24 19:18     ` Jeff King
2019-01-24 19:28       ` Jeff King
2019-01-24 20:36       ` [PATCH] avoid unncessary malloc of whole file size Joey Hess
2019-01-24 21:12         ` Junio C Hamano
2019-01-24 21:18           ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190124183810.GC29200@kitenet.net \
    --to=id@joeyh.name \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.