From: Dennis Zhou <dennis@kernel.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: linux-kernel@vger.kernel.org, tj@kernel.org, cl@linux.com,
akpm@linux-foundation.org, shakeelb@google.com,
linux-mm@kvack.org
Subject: Re: [PATCH 0/2] execve scalability issues, part 1
Date: Mon, 21 Aug 2023 14:07:28 -0700 [thread overview]
Message-ID: <ZOPSEJTzrow8YFix@snowbird> (raw)
In-Reply-To: <20230821202829.2163744-1-mjguzik@gmail.com>
Hello,
On Mon, Aug 21, 2023 at 10:28:27PM +0200, Mateusz Guzik wrote:
> To start I figured I'm going to bench about as friendly case as it gets
> -- statically linked *separate* binaries all doing execve in a loop.
>
> I borrowed the bench from found here:
> http://apollo.backplane.com/DFlyMisc/doexec.c
>
> $ cc -static -O2 -o static-doexec doexec.c
> $ ./static-doexec $(nproc)
>
> It prints a result every second (warning: first line is garbage).
>
> My test box is temporarily only 26 cores and even at this scale I run
> into massive lock contention stemming from back-to-back calls to
> percpu_counter_init (and _destroy later).
>
> While not a panacea, one simple thing to do here is to batch these ops.
> Since the term "batching" is already used in the file, I decided to
> refer to it as "grouping" instead.
>
Unfortunately it's taken me longer to get back to and I'm actually not
super happy with the results, I wrote a batch percpu allocation api.
It's better than the starting place, but I'm falling short on the free
path. I am/was also wrestling with the lifetime ideas (should these
lifetimes be percpus problem or call site bound like you've done).
What I like about this approach is you group the percpu_counter lock
which batching percpu allocations wouldn't be able to solve no matter
how well I do.
I'll review this more closely today.
> Even if this code could be patched to dodge these counters, I would
> argue a high-traffic alloc/free consumer is only a matter of time so it
> makes sense to facilitate it.
>
> With the fix I get an ok win, to quote from the commit:
> > Even at a very modest scale of 26 cores (ops/s):
> > before: 133543.63
> > after: 186061.81 (+39%)
>
> > While with the patch these allocations remain a significant problem,
> > the primary bottleneck shifts to:
> >
> > __pv_queued_spin_lock_slowpath+1
> > _raw_spin_lock_irqsave+57
> > folio_lruvec_lock_irqsave+91
> > release_pages+590
> > tlb_batch_pages_flush+61
> > tlb_finish_mmu+101
> > exit_mmap+327
> > __mmput+61
> > begin_new_exec+1245
> > load_elf_binary+712
> > bprm_execve+644
> > do_execveat_common.isra.0+429
> > __x64_sys_execve+50
> > do_syscall_64+46
> > entry_SYSCALL_64_after_hwframe+110
>
> I intend to do more work on the area to mostly sort it out, but I would
> not mind if someone else took the hammer to folio. :)
>
> With this out of the way I'll be looking at some form of caching to
> eliminate these allocs as a problem.
>
I'm not against caching, this is just my first thought. Caching will
have an impact on the backing pages of percpu. All it takes is 1
allocation on a page for the current allocator to pin n pages of memory.
A few years ago percpu depopulation was implemented so that limits the
amount of resident backing pages.
Maybe the right thing to do is preallocate pools of common sized
allocations so that way they can be recycled such that we don't have to
think too hard about fragmentation that can occur if we populate these
pools over time?
Also as you've pointed out, it wasn't just the percpu allocation being
the bottleneck, but percpu_counter's global lock too for hotplug
support. I'm hazarding a guess most use cases of percpu might have
additional locking requirements too such as percpu_counter.
Thanks,
Dennis
> Thoughts?
>
> Mateusz Guzik (2):
> pcpcntr: add group allocation/free
> fork: group allocation of per-cpu counters for mm struct
>
> include/linux/percpu_counter.h | 19 ++++++++---
> kernel/fork.c | 13 ++------
> lib/percpu_counter.c | 61 ++++++++++++++++++++++++----------
> 3 files changed, 60 insertions(+), 33 deletions(-)
>
> --
> 2.39.2
>
next prev parent reply other threads:[~2023-08-21 21:07 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-21 20:28 [PATCH 0/2] execve scalability issues, part 1 Mateusz Guzik
2023-08-21 20:28 ` [PATCH 1/2] pcpcntr: add group allocation/free Mateusz Guzik
2023-08-22 13:37 ` Vegard Nossum
2023-08-22 14:06 ` Mateusz Guzik
2023-08-22 17:02 ` Dennis Zhou
2023-08-21 20:28 ` [PATCH 2/2] fork: group allocation of per-cpu counters for mm struct Mateusz Guzik
2023-08-21 21:20 ` Matthew Wilcox
2023-08-21 20:42 ` [PATCH 0/2] execve scalability issues, part 1 Matthew Wilcox
2023-08-21 20:44 ` [PATCH 1/7] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
2023-08-21 20:44 ` [PATCH 2/7] mm: Convert free_unref_page_list() to use folios Matthew Wilcox (Oracle)
2023-08-21 20:44 ` [PATCH 3/7] mm: Add free_unref_folios() Matthew Wilcox (Oracle)
2023-08-21 20:44 ` [PATCH 4/7] mm: Use folios_put() in __folio_batch_release() Matthew Wilcox (Oracle)
2023-08-21 20:44 ` [PATCH 5/7] memcg: Add mem_cgroup_uncharge_batch() Matthew Wilcox (Oracle)
2023-08-21 20:44 ` [PATCH 6/7] mm: Remove use of folio list from folios_put() Matthew Wilcox (Oracle)
2023-08-21 20:44 ` [PATCH 7/7] mm: Use free_unref_folios() in put_pages_list() Matthew Wilcox (Oracle)
2023-08-21 21:07 ` Dennis Zhou [this message]
2023-08-21 21:39 ` [PATCH 0/2] execve scalability issues, part 1 Mateusz Guzik
2023-08-21 22:29 ` Mateusz Guzik
2023-08-22 9:51 ` Jan Kara
2023-08-22 14:24 ` Mateusz Guzik
2023-08-23 9:49 ` Jan Kara
2023-08-23 10:49 ` David Laight
2023-08-23 12:01 ` Mateusz Guzik
2023-08-23 12:13 ` Mateusz Guzik
2023-08-23 15:47 ` Jan Kara
2023-08-23 16:10 ` Mateusz Guzik
2023-08-23 16:41 ` Jan Kara
2023-08-23 17:12 ` Mateusz Guzik
2023-08-23 20:27 ` Dennis Zhou
2023-08-24 9:19 ` Jan Kara
2023-08-26 18:33 ` Mateusz Guzik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZOPSEJTzrow8YFix@snowbird \
--to=dennis@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mjguzik@gmail.com \
--cc=shakeelb@google.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).