From: Lorenzo Stoakes <ljs@kernel.org>
To: Haakon Bugge <haakon.bugge@oracle.com>
Cc: Hugh Dickins <hughd@google.com>,
John Hubbard <jhubbard@nvidia.com>,
Joseph Salisbury <joseph.salisbury@oracle.com>,
Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>,
Jason Gunthorpe <jgg@ziepe.ca>, Peter Xu <peterx@redhat.com>,
Kemeng Shi <shikemeng@huaweicloud.com>,
Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
Barry Song <baohua@kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
Date: Thu, 9 Apr 2026 19:43:55 +0100 [thread overview]
Message-ID: <adfxfjtZqnxDKkyz@lucifer> (raw)
In-Reply-To: <AD19A104-818A-4880-9172-0C8FCA6B4633@oracle.com>
On Thu, Apr 09, 2026 at 06:15:50PM +0000, Haakon Bugge wrote:
>
>
> > On 9 Apr 2026, at 20:03, Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Tue, Apr 07, 2026 at 05:35:18PM -0700, Hugh Dickins wrote:
> >> On Tue, 7 Apr 2026, John Hubbard wrote:
> >>> On 4/7/26 1:09 PM, Joseph Salisbury wrote:
> >>>> Hello,
> >>>>
> >>>> I would like to ask for feedback on an MM performance issue triggered by
> >>>> stress-ng's mremap stressor:
> >>>>
> >>>> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief
> >>>>
> >>>> This was first investigated as a possible regression from 0ca0c24e3211
> >>>> ("mm: store zero pages to be swapped out in a bitmap"), but the current
> >>>> evidence suggests that commit is mostly exposing an older problem for
> >>>> this workload rather than directly causing it.
> >>>>
> >>>
> >>> Can you try this out? (Adding Hugh to Cc.)
> >>>
> >>> From: John Hubbard <jhubbard@nvidia.com>
> >>> Date: Tue, 7 Apr 2026 15:33:47 -0700
> >>> Subject: [PATCH] mm/gup: skip lru_add_drain() for non-locked populate
> >>> X-NVConfidentiality: public
> >>> Cc: John Hubbard <jhubbard@nvidia.com>
> >>>
> >>> populate_vma_page_range() calls lru_add_drain() unconditionally after
> >>> __get_user_pages(). With high-frequency single-page MAP_POPULATE/munmap
> >>> cycles at high thread counts, this forces a lruvec->lru_lock acquire
> >>> per page, defeating per-CPU folio_batch batching.
> >>>
> >>> The drain was added by commit ece369c7e104 ("mm/munlock: add
> >>> lru_add_drain() to fix memcg_stat_test") for VM_LOCKED populate, where
> >>> unevictable page stats must be accurate after faulting. Non-locked VMAs
> >>> have no such requirement. Skip the drain for them.
> >>>
> >>> Cc: Hugh Dickins <hughd@google.com>
> >>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> >>
> >> Thanks for the Cc. I'm not convinced that we should be making such a
> >> change, just to avoid the stress that an avowed stresstest is showing;
> >> but can let others debate that - and, need it be said, I have no
> >> problem with Joseph trying your patch.
> >
> > Yeah, the test case (as said by others also) is rather synthetic, and it's a
> > test designed to saturate, if not I/O throttled by swap then we hammer the
> > populate path. It feels like a micro-optimisation for something that is not (at
> > least not yet demonstrated to be) an actual problem.
> >
> > stress-ng is not a benchmarking tool per se, it's designed to eek out bugs.
> >
> > So really we need to see a real-world case I think.
> >
> >>
> >> I tend to stand by my comment in that commit, that it's not just for
> >> VM_LOCKED: I believe it's in everyone's interest that a bulk faulting
> >> interface like populate_vma_page_range() or faultin_vma_page_range()
> >> should drain its local pagevecs at the end, to save others sometimes
> >> needing the much more expensive lru_add_drain_all().
> >
> > I mean yeah, but I guess anywhere that _really_ needs to be sure of the drain
> > has to do an lru_add_drain_all(), because it'd be fragile to rely on
> > lru_add_drain()'s being done at the right time?
> >
> >>
> >> But lru_add_drain() and lru_add_drain_all(): there's so much to be
> >> said and agonized over there They've distressed me for years, and
> >> are a hot topic for us at present. But I won't be able to contribute
> >> more on that subject, not this week.
> >
> > Yeah they do feel rather delicate... :) sometimes you _really do_ need to know
> > everything's drained. But other times it feels a bit whack-a-mole.
> >
> > I also do agree it makes sense to drain locally after a batch operation.
> >
> > It all comes down to whether this manifests in a real-world case, at which point
> > maybe this is a more useful change?
> >
> >>
> >> Hugh
> >>
> >>> ---
> >>> mm/gup.c | 13 ++++++++++++-
> >>> 1 file changed, 12 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/mm/gup.c b/mm/gup.c
> >>> index 8e7dc2c6ee73..2dd5de1cb5b9 100644
> >>> --- a/mm/gup.c
> >>> +++ b/mm/gup.c
> >>> @@ -1816,6 +1816,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> >>> struct mm_struct *mm = vma->vm_mm;
> >>> unsigned long nr_pages = (end - start) / PAGE_SIZE;
> >>> int local_locked = 1;
> >>> + bool need_drain;
> >>> int gup_flags;
> >>> long ret;
> >>>
> >>> @@ -1857,9 +1858,19 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> >>> * We made sure addr is within a VMA, so the following will
> >>> * not result in a stack expansion that recurses back here.
> >>> */
> >>> + /*
> >>> + * Read VM_LOCKED before __get_user_pages(), which may drop
> >>> + * mmap_lock when FOLL_UNLOCKABLE is set, after which the vma
> >>> + * must not be accessed. The read is stable: mmap_lock is held
> >>> + * for read here, so mlock() (which needs the write lock)
> >>> + * cannot change VM_LOCKED concurrently.
> >>> + */
> >
> > BTW, not to nitpick (OK, maybe to nitpick :) this comments feels a bit
> > redundant. Maybe useful to note that the lock might be dropped (but you don't
> > indicate why it's OK to still assume state about the VMA), and it's a known
> > thing that you need a VMA write lock to alter flags, if we had to comment this
> > each time mm would be mostly comments :)
> >
> > So if you want a comment here I'd say something like 'the lock might be dropped
> > due to FOLL_UNLOCKABLE, but that's ok, we would simply end up doing a redundant
> > drain in this case'.
> >
> > But I'm not sure it's needed?
> >
> >>> + need_drain = vma->vm_flags & VM_LOCKED;
> >
> > Please use the new VMA flag interface :)
> >
> > need_drain = vma_test(VMA_LOCKED_BIT);
> >
>
> I think we all agree that the stress-ng test case is synthetic. I evaluated John's patch as I understood that was requested, and the outcome was, merely, as expected.
(Please wrap lines :)
Ack re: synthetic.
Thanks for evaluating it! I don't think John's patch is incorrect per se, but as
I said on reply further up thread, I fear this all might be rather a distraction
from a real world perspective, because you'd expect similar results due to
reasons other than the lruvec being a bit *ahem* sub-optimal shall we say.
>
> The fio case is more interesting, as, if my runs make sense, it improves IOPS by ~20% and avoid threads being stuck at termination. But, I am not intimate with fio, so take that part as a grain of salt.
That is interesting, but again I wonder what it's actually measuring, because if
things are getting stuck because of saturation from stress-ng doing insane
things (hammering the hell out of madvise(..., MAP_PAGEOUT), mremap(), munmap()
in the hot path, all while not caring about NUMA node locality), then that's
sort of what you'd expect I guess?
I guess the only way to avoid possibly measuring the wrong thing is to examine a
real-world case, and if there is something lurking there with lruvec scalability
(very possible) then we can definitely look at that!
Thanks for digging into this!
>
>
> Thxs, Håkon
>
>
Cheers, Lorenzo
prev parent reply other threads:[~2026-04-09 18:44 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-07 20:09 [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths Joseph Salisbury
2026-04-07 21:47 ` Pedro Falcato
2026-04-08 8:09 ` David Hildenbrand (Arm)
2026-04-08 14:27 ` [External] : " Joseph Salisbury
2026-04-09 16:37 ` Haakon Bugge
2026-04-09 17:26 ` Joseph Salisbury
2026-04-10 10:43 ` [External] : " Pedro Falcato
2026-04-09 18:24 ` Lorenzo Stoakes
2026-04-09 21:59 ` Barry Song
2026-04-10 10:30 ` Pedro Falcato
2026-04-11 9:09 ` Barry Song
2026-04-07 22:44 ` John Hubbard
2026-04-08 0:35 ` Hugh Dickins
2026-04-09 18:03 ` Lorenzo Stoakes
2026-04-09 18:12 ` John Hubbard
2026-04-09 18:20 ` David Hildenbrand (Arm)
2026-04-09 18:47 ` Lorenzo Stoakes
2026-04-09 18:15 ` Haakon Bugge
2026-04-09 18:43 ` Lorenzo Stoakes [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=adfxfjtZqnxDKkyz@lucifer \
--to=ljs@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=bhe@redhat.com \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=haakon.bugge@oracle.com \
--cc=hughd@google.com \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=joseph.salisbury@oracle.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=peterx@redhat.com \
--cc=shikemeng@huaweicloud.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.