From: Joey Hess <id@joeyh.name>
To: Jeff King <peff@peff.net>
Cc: git <git@vger.kernel.org>
Subject: Re: git status OOM on mmap of large file
Date: Thu, 24 Jan 2019 14:38:10 -0400 [thread overview]
Message-ID: <20190124183810.GC29200@kitenet.net> (raw)
In-Reply-To: <20190124121037.GA4949@sigill.intra.peff.net>
Jeff King wrote:
> Looking at apply_single_file_filter(), it's not the _original_ file that
> it's trying to store, but rather the data coming back from the filter.
> It's just that we use the original file size as a hint!
Thanks much for working that out!
> In other words, I think this patch fixes your problem:
>
> diff --git a/convert.c b/convert.c
> index 0d89ae7c23..85aebe2ed3 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -732,7 +732,7 @@ static int apply_single_file_filter(const char *path, const char *src, size_t le
> if (start_async(&async))
> return 0; /* error was already reported */
>
> - if (strbuf_read(&nbuf, async.out, len) < 0) {
> + if (strbuf_read(&nbuf, async.out, 0) < 0) {
> err = error(_("read from external filter '%s' failed"), cmd);
> }
> if (close(async.out)) {
Yes, confirmed that does fix it.
> though possibly we should actually continue to use the file size as a
> hint up to a certain point, which avoids reallocations for more "normal"
> filters where the input and output sizes are in the same ballpark.
>
> Just off the top of my head, something like:
>
> /* guess that the filtered output will be the same size as the original */
> hint = len;
>
> /* allocate 10% extra in case the clean size is slightly larger */
> hint *= 1.1;
>
> /*
> * in any case, never go higher than half of core.bigfileThreshold.
> * We'd like to avoid allocating more bytes than that, and that still
> * gives us room for our strbuf to preemptively double if our guess is
> * just a little on the low side.
> */
> if (hint > big_file_threshold / 2)
> hint = big_file_threshold / 2;
>
> But to be honest, I have no idea if that would even produce measurable
> benefits over simply growing the strbuf from scratch (i.e., hint==0).
Half of 512 MB is still quite a lot of memory to default to using in
this situation. Eg smaller VPS's still often only have a GB or two of ram.
When the clean filter is being used in a way that doesn't involve hashes
of large files, it will mostly be operating on typically sized source
code files. So capping the maximum hint size around the size of a typical
source code file would be plenty for both common cases for the clean filter.
I did some benchmarking, using cat as the clean filter:
git status 32 kb file, hint == len
time 3.865 ms (3.829 ms .. 3.943 ms)
0.994 R² (0.987 R² .. 0.999 R²)
mean 3.934 ms (3.889 ms .. 4.021 ms)
std dev 191.8 μs (106.8 μs .. 291.8 μs)
git status 32 kb file, hint == 0
time 3.887 ms (3.751 ms .. 4.064 ms)
0.992 R² (0.986 R² .. 0.998 R²)
mean 4.002 ms (3.931 ms .. 4.138 ms)
std dev 292.2 μs (189.0 μs .. 498.3 μs)
git status 1 mb file, hint == len
time 3.942 ms (3.816 ms .. 4.078 ms)
0.995 R² (0.991 R² .. 0.999 R²)
mean 3.969 ms (3.916 ms .. 4.054 ms)
std dev 220.1 μs (155.1 μs .. 304.3 μs)
git status 1 mb file, hint == 0
time 3.869 ms (3.836 ms .. 3.911 ms)
0.998 R² (0.995 R² .. 1.000 R²)
mean 3.895 ms (3.868 ms .. 3.947 ms)
std dev 112.3 μs (47.93 μs .. 182.7 μs)
git status 1 gb file, hint == len
time 7.173 s (6.834 s .. 7.564 s)
0.999 R² (0.999 R² .. 1.000 R²)
mean 7.560 s (7.369 s .. 7.903 s)
std dev 333.2 ms (27.65 ms .. 412.2 ms)
git status 1 gb file, hint == 0
time 7.652 s (6.307 s .. 8.263 s)
0.996 R² (0.992 R² .. 1.000 R²)
mean 8.082 s (7.843 s .. 8.202 s)
std dev 232.3 ms (2.362 ms .. 277.1 ms)
From this, it looks like the file has to be quite large before the
preallocation makes a sizable improvement to runtime, and the
smudge/clean filters have to be used for actual content filtering
(not for hash generation purposes as git-annex and git-lfs use it).
An unusual edge case I think. So hint == 0 seems fine.
--
see shy jo
next prev parent reply other threads:[~2019-01-24 18:38 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-22 22:07 git status OOM on mmap of large file Joey Hess
2019-01-24 0:39 ` brian m. carlson
2019-01-24 12:14 ` Jeff King
2019-01-24 17:05 ` Joey Hess
2019-01-24 12:10 ` Jeff King
2019-01-24 12:54 ` Duy Nguyen
2019-01-24 18:38 ` Joey Hess [this message]
2019-01-24 19:18 ` Jeff King
2019-01-24 19:28 ` Jeff King
2019-01-24 20:36 ` [PATCH] avoid unncessary malloc of whole file size Joey Hess
2019-01-24 21:12 ` Junio C Hamano
2019-01-24 21:18 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190124183810.GC29200@kitenet.net \
--to=id@joeyh.name \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).