From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: "Iucha\, Florin" <Florin.Iucha@amd.com>,
"git\@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: High locking contention during repack?
Date: Wed, 12 Dec 2018 15:04:19 +0100 [thread overview]
Message-ID: <87wooen8nw.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20181212112409.GB30673@sigill.intra.peff.net>
On Wed, Dec 12 2018, Jeff King wrote:
> On Wed, Dec 12, 2018 at 03:01:47AM +0000, Iucha, Florin wrote:
>
>> I am running “git-repack -A -d -f -F --window=250 --depth=250” on a
>> Git repository converted using git-svn.
>
> Sort of tangential to your question, but:
>
> - Using "-F" implies "-f" already, so you don't need both. That said,
> you are probably wasting CPU to use "-F", unless you have adjusted
> zlib compression settings since the last pack. (Whereas using "-f"
> is useful, if you're willing to pay the CPU tradeoff).
>
> - Using --depth=250 does not actually decrease the packfile size very
> much, and results in a packfile which is more expensive for
> subsequent processes to use. Some experimentation showed that
> --depth=50 is a sweet spot, and that is the default for both normal
> "git gc" and "git gc --aggressive" these days.
>
> See 07e7dbf0db (gc: default aggressive depth to 50, 2016-08-11) for
> more discussion.
I wonder if the size is still too high. I'd rather take the 5-10%
speedup quoted in that commit than a slight decrease in disk space use.
The git.git repository now with the repack command in that commit
message, with --depth=$n:
573M /tmp/git-1
200M /tmp/git-2
118M /tmp/git-3
91M /tmp/git-4
82M /tmp/git-5
77M /tmp/git-6
74M /tmp/git-7
72M /tmp/git-8
71M /tmp/git-9
70M /tmp/git-10
67M /tmp/git-15
66M /tmp/git-20
65M /tmp/git-40
65M /tmp/git-35
65M /tmp/git-30
65M /tmp/git-25
64M /tmp/git-50
64M /tmp/git-45
Produced via:
parallel 'rm -rf /tmp/git-{}; cp -Rvp /tmp/git.git/ /tmp/git-{} && git -C /tmp/git-{} repack -adf --depth={} --window=250' ::: {1..10} 15 20 25 30 35 40 45 50
And then the log -S benchmarks:
s/iter log -S 1 log -S 50 log -S 45 log -S 35 log -S 30 log -S 25 log -S 20 log -S 15 log -S 10 log -S 5
log -S 1 12.6 -- -26% -27% -32% -36% -38% -38% -39% -40% -42%
log -S 50 9.28 36% -- -0% -7% -13% -15% -16% -18% -19% -21%
log -S 45 9.25 36% 0% -- -7% -12% -15% -16% -17% -19% -20%
log -S 35 8.62 46% 8% 7% -- -6% -9% -10% -11% -13% -14%
log -S 30 8.12 55% 14% 14% 6% -- -3% -4% -6% -8% -9%
log -S 25 7.86 60% 18% 18% 10% 3% -- -1% -3% -5% -6%
log -S 20 7.77 62% 19% 19% 11% 4% 1% -- -2% -3% -5%
log -S 15 7.64 65% 21% 21% 13% 6% 3% 2% -- -2% -4%
log -S 10 7.51 68% 24% 23% 15% 8% 5% 3% 2% -- -2%
log -S 5 7.37 71% 26% 25% 17% 10% 7% 5% 4% 2% --
So we're getting a 20% speedup in log -S by using --depth=5 instead of
--depth=50, neatly at the cost of 20% more disk space, but could also
get a 19% speedup for 8% (--depth=10) more disk space, etc.
Then in rev-list it makes less of a difference, narrowing this down a
bit:
s/iter rev-list 50 rev-list 25 rev-list 10 rev-list 5
rev-list 50 1.61 -- -1% -4% -5%
rev-list 25 1.60 1% -- -3% -4%
rev-list 10 1.55 4% 3% -- -1%
rev-list 5 1.54 5% 4% 1% --
It seems to me we should at least move this to --depth=25 by default. A
~15% speedup in log -S & other mass-blob lookups for 1.5% more disk
space seems like a good trade-off. As your commit notes RAM (or disk
space) is not commonly the bottleneck anymore.
>> The system is a 16 core / 32 thread Threadripper with 128GB of RAM and
>> NVMe storage. The repack starts strong, with 32 threads but it fairly
>> quickly gets to 99% done and the number of threads drops to 4 then 3
>> then 2. However, running “dstat 5” I see lots of “sys” time without
>> any IO time (the network traffic you see is caused by SSH).
>
> This sounds mostly normal and expected. The parallel part of a repack is
> the delta search, which is not infinitely parallelizable. Each worker
> thread is given a "chunk" of objects, and it uses a sliding window to
> search for delta pairs through that chunk. You don't want a chunk that
> approaches the window size, since at every chunk boundary you're missing
> delta possibilities.
>
> The default chunk size is about 1/nr_threads of the total list size
> (i.e., we portion out all the work). And then when a thread finishes, we
> take work from the thread with the most work remaining, and portion it
> out. However, at some point the chunks approach their minimum, and we
> stop dividing. So the number of threads will drop, eventually to 1, and
> you'll be waiting on it to finish that final chunk.
>
> So that's all working as planned. Having high sys load does seem a bit
> odd. Most of the effort should be going to reading the mmap'd data from
> disk, zlib-inflating it and computing a fingerprint, and then comparing
> the fingerprints. So that would mostly be user time.
>
>> Running a strace on the running git-repack process shows only these:
>> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>>
>> Any idea on how to debug this? I have ran git-repack under gdb, but it seems to spin on builtin/repack.c line 409.
>
> The heavy lifting here is done by the pack-objects child process, not
> git-repack itself. Try running with GIT_TRACE=1 in the environment to
> see the exact invocation, but timing and debugging:
>
> git pack-objects --all --no-reuse-delta --delta-base-offset --stdout \
> </dev/null >/dev/null
>
> should produce interesting results.
>
> The SIGALRM loop you see above is likely just the progress meter
> triggering once per second (the actual worker threads are updating an
> int, and then at least once per second we'll show the int).
>
> -Peff
next prev parent reply other threads:[~2018-12-12 14:04 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-12-12 3:01 High locking contention during repack? Iucha, Florin
2018-12-12 11:24 ` Jeff King
2018-12-12 14:04 ` Ævar Arnfjörð Bjarmason [this message]
2018-12-12 16:49 ` Iucha, Florin
2018-12-12 16:54 ` Ævar Arnfjörð Bjarmason
2018-12-12 18:08 ` Iucha, Florin
2018-12-12 18:30 ` Iucha, Florin
2018-12-12 19:05 ` Iucha, Florin
2018-12-12 21:50 ` Iucha, Florin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wooen8nw.fsf@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=Florin.Iucha@amd.com \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.