From: Lorenzo Stoakes <ljs@kernel.org>
To: "David Hildenbrand (Arm)" <david@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>,
Joseph Salisbury <joseph.salisbury@oracle.com>,
Andrew Morton <akpm@linux-foundation.org>,
Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>,
Jason Gunthorpe <jgg@ziepe.ca>,
John Hubbard <jhubbard@nvidia.com>, Peter Xu <peterx@redhat.com>,
Kemeng Shi <shikemeng@huaweicloud.com>,
Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
Barry Song <baohua@kernel.org>,
linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
Date: Thu, 9 Apr 2026 19:24:35 +0100 [thread overview]
Message-ID: <adfqY3ayAHvACm4q@lucifer> (raw)
In-Reply-To: <639f20f3-9e65-4117-af9b-e37af0829847@kernel.org>
On Wed, Apr 08, 2026 at 10:09:23AM +0200, David Hildenbrand (Arm) wrote:
> >>
> >> It was also found that adding '--mremap-numa' changes the behavior
> >> substantially:
> >
> > "assign memory mapped pages to randomly selected NUMA nodes. This is
> > disabled for systems that do not support NUMA."
> >
> > so this is just sharding your lock contention across your NUMA nodes (you
> > have an lruvec per node).
> >
> >>
> >> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
> >> --metrics-brief
> >>
> >> mremap 2570798 29.39 8.06 106.23 87466.50 22494.74
> >>
> >> So it's possible that either actual swapping, or the mbind(...,
> >> MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the excessive
> >> system time.
> >>
> >> Does this look like a known MM scalability issue around short-lived
> >> MAP_POPULATE / munmap churn?
> >
> > Yes. Is this an actual issue on some workload?
>
> Same thought, it's unclear to me why we should care here. In particular,
> when talking about excessive use of zero-filled pages.
Yup, I fear that this might also be misleading - stress-ng is designed to
saturate.
When swapping is enabled, it ends up rate-limited by I/O (there is simultanous
MADV_PAGEOUT occurring).
Then you see lower systime because... the system is sleeping more :)
The zero pages patch stops all that, so you throttle on the next thing - the
lruvec lock.
If you group by NUMA node rather than just not-at-all (the default) you
naturally distribute evenly across lruvec locks, because they're per node (+
memcg whatever).
So all this is arbitrary, it is essentially asking 'what do I rate limit on?'
And 'optimising' things to give different outcomes, esp. on things like system
time, doesn't really make sense.
If you absolutely hammer the hell out of the populate/unmap paths, unevenly over
NUMA nodes, you'll see system time explode because now you're hitting up on the
lruvec lock which is a spinlock (has to be due to possible irq context
invocation).
You're not actually asking 'how fast is this in a real workload?' or even a 'how
fast is this microbenchmark?', you're asking 'what does saturating this look
like?'.
So it's rather asking the wrong question, I fear, and a reason why
stress-ng-as-benchmark has to be treated with caution.
I would definitely recommend examining any underlying real-world workload that
is triggering the issue rather than stress-ng, and then examining closely what's
going on there.
This whole thing might be unfortunately misleading, as you observe saturation of
lruvec lock, but in reality it might simply be a manifestation of:
- syscalls on the hotpath
- not distributing work sensibly over NUMA nodes
Perhaps it is indeed an issue with the lruvec that needs attention, but with a
real world usecase we can perhaps be a little more sure it's that rather than
stress-ng doing it's thing :)
>
> --
> Cheers,
>
> David
Thanks, Lorenzo
next prev parent reply other threads:[~2026-04-09 18:24 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-07 20:09 [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths Joseph Salisbury
2026-04-07 21:47 ` Pedro Falcato
2026-04-08 8:09 ` David Hildenbrand (Arm)
2026-04-08 14:27 ` [External] : " Joseph Salisbury
2026-04-09 16:37 ` Haakon Bugge
2026-04-09 17:26 ` Joseph Salisbury
2026-04-10 10:43 ` [External] : " Pedro Falcato
2026-04-09 18:24 ` Lorenzo Stoakes [this message]
2026-04-09 21:59 ` Barry Song
2026-04-10 10:30 ` Pedro Falcato
2026-04-11 9:09 ` Barry Song
2026-04-07 22:44 ` John Hubbard
2026-04-08 0:35 ` Hugh Dickins
2026-04-09 18:03 ` Lorenzo Stoakes
2026-04-09 18:12 ` John Hubbard
2026-04-09 18:20 ` David Hildenbrand (Arm)
2026-04-09 18:47 ` Lorenzo Stoakes
2026-04-09 18:15 ` Haakon Bugge
2026-04-09 18:43 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=adfqY3ayAHvACm4q@lucifer \
--to=ljs@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=joseph.salisbury@oracle.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=peterx@redhat.com \
--cc=pfalcato@suse.de \
--cc=shikemeng@huaweicloud.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.