[RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
@ 2026-04-07 20:09 Joseph Salisbury
  2026-04-07 21:47 ` Pedro Falcato
  2026-04-07 22:44 ` John Hubbard
  0 siblings, 2 replies; 15+ messages in thread
From: Joseph Salisbury @ 2026-04-07 20:09 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Chris Li, Kairui Song
  Cc: Jason Gunthorpe, John Hubbard, Peter Xu, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, linux-mm, LKML

Hello,

I would like to ask for feedback on an MM performance issue triggered by 
stress-ng's mremap stressor:

stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief

This was first investigated as a possible regression from 0ca0c24e3211 
("mm: store zero pages to be swapped out in a bitmap"), but the current 
evidence suggests that commit is mostly exposing an older problem for 
this workload rather than directly causing it.

Observed behavior:

The metrics below are in this format:
     stressor       bogo ops real time  usr time  sys time   bogo ops/s  
    bogo ops/s
                              (secs)    (secs)    (secs)   (real time) 
(usr+sys time)

On a 5.15-based kernel, the workload behaves much worse when swapping is 
disabled:

     swap enabled:
       mremap 1660980 31.08 64.78 84.63 53437.09 11116.73

     swap disabled:
       mremap 40786258 27.94 15.41 15354.79 1459749.43 2653.59

On a 6.12-based kernel with swap enabled, the same high-system-time 
behavior is also observed:

     mremap 77087729 21.50 29.95 30558.08 3584738.22 2520.19

A recent 7.0-rc5-based mainline build still behaves similarly:

     mremap 39208813 28.12 12.34 15318.39 1394408.50 2557.53

So this does not appear to be already fixed upstream.

The current theory is that 0ca0c24e3211 exposes this specific 
zero-page-heavy workload.  Before that change, swap-enabled runs 
actually swapped pages.  After that change, zero pages are stored in the 
swap bitmap instead, so the workload behaves much more like the 
swap-disabled case.

Perf data supports the idea that the expensive behavior is global LRU 
lock contention caused by short-lived populate/unmap churn.

The dominant stacks on the bad cases include:

     vm_mmap_pgoff
       __mm_populate
         populate_vma_page_range
           lru_add_drain
             folio_batch_move_lru
               folio_lruvec_lock_irqsave
                 native_queued_spin_lock_slowpath

and:

     __x64_sys_munmap
       __vm_munmap
         ...
           release_pages
             folios_put_refs
               __page_cache_release
                 folio_lruvec_relock_irqsave
                   native_queued_spin_lock_slowpath

It was also found that adding '--mremap-numa' changes the behavior 
substantially:

stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa 
--metrics-brief

mremap 2570798 29.39 8.06 106.23 87466.50 22494.74

So it's possible that either actual swapping, or the mbind(..., 
MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the 
excessive system time.

Does this look like a known MM scalability issue around short-lived 
MAP_POPULATE / munmap churn?

REPRODUCER:
The issue is reproducible with stress-ng's mremap stressor:

stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief

On older kernels, the bad behavior is easiest to expose by disabling 
swap first:

swapoff -a
stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief

On kernels with 0ca0c24e3211 ("mm: store zero pages to be swapped out in 
a bitmap") or newer, the same bad behavior can be seen even with swap 
enabled, because this zero-page-heavy workload no longer actually swaps 
pages and behaves much like the swap-disabled case.

Typical bad-case behaviour:
  - Very large aggregate sys time during a 30s run (for example, ~15000s 
or higher)
  - Poor bogo ops/s measured against usr+sys time (~2500 range in our tests)
  - Perf shows time dominated by:
       vm_mmap_pgoff -> __mm_populate -> populate_vma_page_range -> 
lru_add_drain
     and
       munmap -> release_pages -> __page_cache_release
    with heavy time in 
folio_lruvec_lock_irqsave/native_queued_spin_lock_slowpath

Diagnostic variant:
stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa 
--metrics-brief

That variant greatly reduces the excessive system time, which is one of 
the clues that the excessive system-time overhead depends on which MM 
path the workload takes.

Thanks in advance!

Joe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
  2026-04-07 20:09 [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths Joseph Salisbury
@ 2026-04-07 21:47 ` Pedro Falcato
  2026-04-08  8:09   ` David Hildenbrand (Arm)
  2026-04-07 22:44 ` John Hubbard
  1 sibling, 1 reply; 15+ messages in thread
From: Pedro Falcato @ 2026-04-07 21:47 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: Andrew Morton, David Hildenbrand, Chris Li, Kairui Song,
	Jason Gunthorpe, John Hubbard, Peter Xu, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, linux-mm, LKML

Hi,

On Tue, Apr 07, 2026 at 04:09:20PM -0400, Joseph Salisbury wrote:
> Hello,
> 
> I would like to ask for feedback on an MM performance issue triggered by
> stress-ng's mremap stressor:
> 
> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief
> 
> This was first investigated as a possible regression from 0ca0c24e3211 ("mm:
> store zero pages to be swapped out in a bitmap"), but the current evidence
> suggests that commit is mostly exposing an older problem for this workload
> rather than directly causing it.
> 
> 
> Observed behavior:
> 
> The metrics below are in this format:
>     stressor       bogo ops real time  usr time  sys time   bogo ops/s   
>  bogo ops/s
>                              (secs)    (secs)    (secs)   (real time)
> (usr+sys time)
> 
> On a 5.15-based kernel, the workload behaves much worse when swapping is
> disabled:
> 
>     swap enabled:
>       mremap 1660980 31.08 64.78 84.63 53437.09 11116.73
> 
>     swap disabled:
>       mremap 40786258 27.94 15.41 15354.79 1459749.43 2653.59
> 
> On a 6.12-based kernel with swap enabled, the same high-system-time behavior
> is also observed:
> 
>     mremap 77087729 21.50 29.95 30558.08 3584738.22 2520.19
> 
> A recent 7.0-rc5-based mainline build still behaves similarly:
> 
>     mremap 39208813 28.12 12.34 15318.39 1394408.50 2557.53
> 
> So this does not appear to be already fixed upstream.
> 
> 
> 
> The current theory is that 0ca0c24e3211 exposes this specific
> zero-page-heavy workload.  Before that change, swap-enabled runs actually
> swapped pages.  After that change, zero pages are stored in the swap bitmap
> instead, so the workload behaves much more like the swap-disabled case.
> 
> Perf data supports the idea that the expensive behavior is global LRU lock
> contention caused by short-lived populate/unmap churn.
> 
> The dominant stacks on the bad cases include:
> 
>     vm_mmap_pgoff
>       __mm_populate
>         populate_vma_page_range
>           lru_add_drain
>             folio_batch_move_lru
>               folio_lruvec_lock_irqsave
>                 native_queued_spin_lock_slowpath
> 
> and:
> 
>     __x64_sys_munmap
>       __vm_munmap
>         ...
>           release_pages
>             folios_put_refs
>               __page_cache_release
>                 folio_lruvec_relock_irqsave
>                   native_queued_spin_lock_slowpath
> 


Yes, this is known problematic. The lruvec locks are gigantic and, despite
the LRU cache in front, they are still problematic. It might be argued that the
current cache is downright useless for populate as it's too small to contain
a significant number of folios. Perhaps worth thinking about, but not trivial
to change given the way things are structured + the way folio batches work.

You should be able to see this on any workload that does lots of page faulting
or population (not dependent on mremap at all, etc)

> 
> 
> It was also found that adding '--mremap-numa' changes the behavior
> substantially:

"assign memory mapped pages to randomly selected NUMA nodes. This is
disabled for systems that do not support NUMA."

so this is just sharding your lock contention across your NUMA nodes (you
have an lruvec per node).

> 
> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
> --metrics-brief
> 
> mremap 2570798 29.39 8.06 106.23 87466.50 22494.74
> 
> So it's possible that either actual swapping, or the mbind(...,
> MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the excessive
> system time.
> 
> Does this look like a known MM scalability issue around short-lived
> MAP_POPULATE / munmap churn?

Yes. Is this an actual issue on some workload?

-- 
Pedro


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
  2026-04-07 20:09 [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths Joseph Salisbury
  2026-04-07 21:47 ` Pedro Falcato
@ 2026-04-07 22:44 ` John Hubbard
  2026-04-08  0:35   ` Hugh Dickins
  1 sibling, 1 reply; 15+ messages in thread
From: John Hubbard @ 2026-04-07 22:44 UTC (permalink / raw)
  To: Joseph Salisbury, Andrew Morton, David Hildenbrand, Chris Li,
	Kairui Song, Hugh Dickins
  Cc: Jason Gunthorpe, Peter Xu, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, linux-mm, LKML

On 4/7/26 1:09 PM, Joseph Salisbury wrote:
> Hello,
> 
> I would like to ask for feedback on an MM performance issue triggered by 
> stress-ng's mremap stressor:
> 
> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief
> 
> This was first investigated as a possible regression from 0ca0c24e3211 
> ("mm: store zero pages to be swapped out in a bitmap"), but the current 
> evidence suggests that commit is mostly exposing an older problem for 
> this workload rather than directly causing it.
> 

Can you try this out? (Adding Hugh to Cc.)

From: John Hubbard <jhubbard@nvidia.com>
Date: Tue, 7 Apr 2026 15:33:47 -0700
Subject: [PATCH] mm/gup: skip lru_add_drain() for non-locked populate
X-NVConfidentiality: public
Cc: John Hubbard <jhubbard@nvidia.com>

populate_vma_page_range() calls lru_add_drain() unconditionally after
__get_user_pages(). With high-frequency single-page MAP_POPULATE/munmap
cycles at high thread counts, this forces a lruvec->lru_lock acquire
per page, defeating per-CPU folio_batch batching.

The drain was added by commit ece369c7e104 ("mm/munlock: add
lru_add_drain() to fix memcg_stat_test") for VM_LOCKED populate, where
unevictable page stats must be accurate after faulting. Non-locked VMAs
have no such requirement. Skip the drain for them.

Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 8e7dc2c6ee73..2dd5de1cb5b9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1816,6 +1816,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long nr_pages = (end - start) / PAGE_SIZE;
 	int local_locked = 1;
+	bool need_drain;
 	int gup_flags;
 	long ret;
 
@@ -1857,9 +1858,19 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	 * We made sure addr is within a VMA, so the following will
 	 * not result in a stack expansion that recurses back here.
 	 */
+	/*
+	 * Read VM_LOCKED before __get_user_pages(), which may drop
+	 * mmap_lock when FOLL_UNLOCKABLE is set, after which the vma
+	 * must not be accessed. The read is stable: mmap_lock is held
+	 * for read here, so mlock() (which needs the write lock)
+	 * cannot change VM_LOCKED concurrently.
+	 */
+	need_drain = vma->vm_flags & VM_LOCKED;
+
 	ret = __get_user_pages(mm, start, nr_pages, gup_flags,
 			       NULL, locked ? locked : &local_locked);
-	lru_add_drain();
+	if (need_drain)
+		lru_add_drain();
 	return ret;
 }
 

base-commit: 3036cd0d3328220a1858b1ab390be8b562774e8a
-- 
2.53.0


thanks,
-- 
John Hubbard


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
  2026-04-07 22:44 ` John Hubbard
@ 2026-04-08  0:35   ` Hugh Dickins
  2026-04-09 18:03     ` Lorenzo Stoakes
  0 siblings, 1 reply; 15+ messages in thread
From: Hugh Dickins @ 2026-04-08  0:35 UTC (permalink / raw)
  To: John Hubbard
  Cc: Joseph Salisbury, Andrew Morton, David Hildenbrand, Chris Li,
	Kairui Song, Hugh Dickins, Jason Gunthorpe, Peter Xu, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, linux-mm, LKML

On Tue, 7 Apr 2026, John Hubbard wrote:
> On 4/7/26 1:09 PM, Joseph Salisbury wrote:
> > Hello,
> > 
> > I would like to ask for feedback on an MM performance issue triggered by 
> > stress-ng's mremap stressor:
> > 
> > stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief
> > 
> > This was first investigated as a possible regression from 0ca0c24e3211 
> > ("mm: store zero pages to be swapped out in a bitmap"), but the current 
> > evidence suggests that commit is mostly exposing an older problem for 
> > this workload rather than directly causing it.
> > 
> 
> Can you try this out? (Adding Hugh to Cc.)
> 
> From: John Hubbard <jhubbard@nvidia.com>
> Date: Tue, 7 Apr 2026 15:33:47 -0700
> Subject: [PATCH] mm/gup: skip lru_add_drain() for non-locked populate
> X-NVConfidentiality: public
> Cc: John Hubbard <jhubbard@nvidia.com>
> 
> populate_vma_page_range() calls lru_add_drain() unconditionally after
> __get_user_pages(). With high-frequency single-page MAP_POPULATE/munmap
> cycles at high thread counts, this forces a lruvec->lru_lock acquire
> per page, defeating per-CPU folio_batch batching.
> 
> The drain was added by commit ece369c7e104 ("mm/munlock: add
> lru_add_drain() to fix memcg_stat_test") for VM_LOCKED populate, where
> unevictable page stats must be accurate after faulting. Non-locked VMAs
> have no such requirement. Skip the drain for them.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Thanks for the Cc.  I'm not convinced that we should be making such a
change, just to avoid the stress that an avowed stresstest is showing;
but can let others debate that - and, need it be said, I have no
problem with Joseph trying your patch.

I tend to stand by my comment in that commit, that it's not just for
VM_LOCKED: I believe it's in everyone's interest that a bulk faulting
interface like populate_vma_page_range() or faultin_vma_page_range()
should drain its local pagevecs at the end, to save others sometimes
needing the much more expensive lru_add_drain_all().

But lru_add_drain() and lru_add_drain_all(): there's so much to be
said and agonized over there  They've distressed me for years, and
are a hot topic for us at present.  But I won't be able to contribute
more on that subject, not this week.

Hugh

> ---
>  mm/gup.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8e7dc2c6ee73..2dd5de1cb5b9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1816,6 +1816,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
>  	struct mm_struct *mm = vma->vm_mm;
>  	unsigned long nr_pages = (end - start) / PAGE_SIZE;
>  	int local_locked = 1;
> +	bool need_drain;
>  	int gup_flags;
>  	long ret;
>  
> @@ -1857,9 +1858,19 @@ long populate_vma_page_range(struct vm_area_struct *vma,
>  	 * We made sure addr is within a VMA, so the following will
>  	 * not result in a stack expansion that recurses back here.
>  	 */
> +	/*
> +	 * Read VM_LOCKED before __get_user_pages(), which may drop
> +	 * mmap_lock when FOLL_UNLOCKABLE is set, after which the vma
> +	 * must not be accessed. The read is stable: mmap_lock is held
> +	 * for read here, so mlock() (which needs the write lock)
> +	 * cannot change VM_LOCKED concurrently.
> +	 */
> +	need_drain = vma->vm_flags & VM_LOCKED;
> +
>  	ret = __get_user_pages(mm, start, nr_pages, gup_flags,
>  			       NULL, locked ? locked : &local_locked);
> -	lru_add_drain();
> +	if (need_drain)
> +		lru_add_drain();
>  	return ret;
>  }
>  
> 
> base-commit: 3036cd0d3328220a1858b1ab390be8b562774e8a
> -- 
> 2.53.0
> 
> 
> thanks,
> -- 
> John Hubbard


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
  2026-04-07 21:47 ` Pedro Falcato
@ 2026-04-08  8:09   ` David Hildenbrand (Arm)
  2026-04-08 14:27     ` [External] : " Joseph Salisbury
  2026-04-09 18:24     ` Lorenzo Stoakes
  0 siblings, 2 replies; 15+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-08  8:09 UTC (permalink / raw)
  To: Pedro Falcato, Joseph Salisbury
  Cc: Andrew Morton, Chris Li, Kairui Song, Jason Gunthorpe,
	John Hubbard, Peter Xu, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, linux-mm, LKML

>>
>> It was also found that adding '--mremap-numa' changes the behavior
>> substantially:
> 
> "assign memory mapped pages to randomly selected NUMA nodes. This is
> disabled for systems that do not support NUMA."
> 
> so this is just sharding your lock contention across your NUMA nodes (you
> have an lruvec per node).
> 
>>
>> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
>> --metrics-brief
>>
>> mremap 2570798 29.39 8.06 106.23 87466.50 22494.74
>>
>> So it's possible that either actual swapping, or the mbind(...,
>> MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the excessive
>> system time.
>>
>> Does this look like a known MM scalability issue around short-lived
>> MAP_POPULATE / munmap churn?
> 
> Yes. Is this an actual issue on some workload?

Same thought, it's unclear to me why we should care here. In particular,
when talking about excessive use of zero-filled pages.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] : Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
  2026-04-08  8:09   ` David Hildenbrand (Arm)
@ 2026-04-08 14:27     ` Joseph Salisbury
  2026-04-09 16:37       ` Haakon Bugge
  2026-04-09 18:24     ` Lorenzo Stoakes
  1 sibling, 1 reply; 15+ messages in thread
From: Joseph Salisbury @ 2026-04-08 14:27 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Pedro Falcato
  Cc: Andrew Morton, Chris Li, Kairui Song, Jason Gunthorpe,
	John Hubbard, Peter Xu, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, linux-mm, LKML



On 4/8/26 4:09 AM, David Hildenbrand (Arm) wrote:
>>> It was also found that adding '--mremap-numa' changes the behavior
>>> substantially:
>> "assign memory mapped pages to randomly selected NUMA nodes. This is
>> disabled for systems that do not support NUMA."
>>
>> so this is just sharding your lock contention across your NUMA nodes (you
>> have an lruvec per node).
>>
>>> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
>>> --metrics-brief
>>>
>>> mremap 2570798 29.39 8.06 106.23 87466.50 22494.74
>>>
>>> So it's possible that either actual swapping, or the mbind(...,
>>> MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the excessive
>>> system time.
>>>
>>> Does this look like a known MM scalability issue around short-lived
>>> MAP_POPULATE / munmap churn?
>> Yes. Is this an actual issue on some workload?
> Same thought, it's unclear to me why we should care here. In particular,
> when talking about excessive use of zero-filled pages.
>
Currently this is only showing up with that particular stress test. We 
will try John's patch and provide feedback.

Thanks for all the feedback, everyone!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [External] : Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
  2026-04-08 14:27     ` [External] : " Joseph Salisbury
@ 2026-04-09 16:37       ` Haakon Bugge
  2026-04-09 17:26         ` Joseph Salisbury
  0 siblings, 1 reply; 15+ messages in thread
From: Haakon Bugge @ 2026-04-09 16:37 UTC (permalink / raw)
  To: Joseph Salisbury
  Cc: David Hildenbrand (Arm), Pedro Falcato, Andrew Morton, Chris Li,
	Kairui Song, Jason Gunthorpe, John Hubbard, Peter Xu, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, linux-mm@kvack.org, LKML


> On 8 Apr 2026, at 16:27, Joseph Salisbury <joseph.salisbury@oracle.com> wrote:
> 
> 
> 
> On 4/8/26 4:09 AM, David Hildenbrand (Arm) wrote:
>>>> It was also found that adding '--mremap-numa' changes the behavior
>>>> substantially:
>>> "assign memory mapped pages to randomly selected NUMA nodes. This is
>>> disabled for systems that do not support NUMA."
>>> 
>>> so this is just sharding your lock contention across your NUMA nodes (you
>>> have an lruvec per node).
>>> 
>>>> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
>>>> --metrics-brief
>>>> 
>>>> mremap 2570798 29.39 8.06 106.23 87466.50 22494.74
>>>> 
>>>> So it's possible that either actual swapping, or the mbind(...,
>>>> MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the excessive
>>>> system time.
>>>> 
>>>> Does this look like a known MM scalability issue around short-lived
>>>> MAP_POPULATE / munmap churn?
>>> Yes. Is this an actual issue on some workload?
>> Same thought, it's unclear to me why we should care here. In particular,
>> when talking about excessive use of zero-filled pages.
>> 
> Currently this is only showing up with that particular stress test. We will try John's patch and provide feedback.
> 
> Thanks for all the feedback, everyone!

I reported this internally and have worked with Joseph on it. I tested v7.0-rc7-68-g7f87a5ea75f01 ("-"), "Base", vs. ditto plus John Hubbard's patch ("+"), "Test".

Stress-ng command: stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief

System is an AMD EPYC 9J45:
  NUMA node(s):              2
  NUMA node0 CPU(s):         0-127,256-383
  NUMA node1 CPU(s):         128-255,384-511

The stress-ng command was run ten times and here are the averages and pstdev:

   bogo ops/s   pstdev  system time   pstdev
   (realtime) 
--------------------------------------------
-     3192638      35%        24041      32%
+     3657904       5%        15278       0%

This is 15% improvement in bogo ops/s (realtime) and a decent 36% reduction in system time.

I shamelessly copied and modified the fio command from [1]. I ran:

# fio -filename=/dev/nvme0n1 -direct=0 -thread -size=1024G -rwmixwrite=30 \
--norandommap --randrepeat=0 -ioengine=mmap -bs=4k -numjobs=1024 -runtime=3600 \
--time_based -group_reporting -name=mytest

(that is, one hour runtime)

- read: IOPS=14.0M, BW=53.4GiB/s (57.3GB/s)(188TiB/3608413msec)
+ read: IOPS=16.0M, BW=61.2GiB/s (65.7GB/s)(215TiB/3600051msec)
- READ: bw=53.4GiB/s (57.3GB/s), 53.4GiB/s-53.4GiB/s (57.3GB/s-57.3GB/s), io=188TiB (207TB), run=3608413-3608413msec
+ READ: bw=61.2GiB/s (65.7GB/s), 61.2GiB/s-61.2GiB/s (65.7GB/s-65.7GB/s), io=215TiB (237TB), run=3600051-3600051msec

Also, running Base, I see tons of:

Jobs: 726 (f=726): [_(2),R(1),_(1),R(3),_(4),R(6),_(1),R(2),_(2),R(2),_(3),R(1),_(5),R(2),_(1),R(2),_(1),R(1),_(2),R(2),_(1),R(1),_(1),R(2),_(1),R(3),_(1),R(3),_(1),R(1),_(1),R(1),_(1),R(1),_(1),R(3),_(1),R(3),_(1),R(1),_(3),R(1),_(1),R(5),_(1),R(5),_(1),R(1),_(2),R(1),_(4),R(2),_(1),R(3),_(1),R(3),_(1),R(1),_(2),R(1),_(1),R(8),_(1),R(4),_(1),R(3),_(1),R(1),_(1),R(2),_(1),R(7),_(2),R(2)

when the fio test terminates, which I do not see using Test. I take that as the threads do not terminate timely using the Base kernel.


Thxs, Håkon


[1] https://lkml.org/lkml/2024/7/3/1049



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
  2026-04-09 16:37       ` Haakon Bugge
@ 2026-04-09 17:26         ` Joseph Salisbury
  0 siblings, 0 replies; 15+ messages in thread
From: Joseph Salisbury @ 2026-04-09 17:26 UTC (permalink / raw)
  To: Haakon Bugge
  Cc: David Hildenbrand (Arm), Pedro Falcato, Andrew Morton, Chris Li,
	Kairui Song, Jason Gunthorpe, John Hubbard, Peter Xu, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, linux-mm@kvack.org, LKML,
	Lorenzo Stoakes



On 4/9/26 12:37 PM, Haakon Bugge wrote:
>> On 8 Apr 2026, at 16:27, Joseph Salisbury <joseph.salisbury@oracle.com> wrote:
>>
>>
>>
>> On 4/8/26 4:09 AM, David Hildenbrand (Arm) wrote:
>>>>> It was also found that adding '--mremap-numa' changes the behavior
>>>>> substantially:
>>>> "assign memory mapped pages to randomly selected NUMA nodes. This is
>>>> disabled for systems that do not support NUMA."
>>>>
>>>> so this is just sharding your lock contention across your NUMA nodes (you
>>>> have an lruvec per node).
>>>>
>>>>> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
>>>>> --metrics-brief
>>>>>
>>>>> mremap 2570798 29.39 8.06 106.23 87466.50 22494.74
>>>>>
>>>>> So it's possible that either actual swapping, or the mbind(...,
>>>>> MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the excessive
>>>>> system time.
>>>>>
>>>>> Does this look like a known MM scalability issue around short-lived
>>>>> MAP_POPULATE / munmap churn?
>>>> Yes. Is this an actual issue on some workload?
>>> Same thought, it's unclear to me why we should care here. In particular,
>>> when talking about excessive use of zero-filled pages.
>>>
>> Currently this is only showing up with that particular stress test. We will try John's patch and provide feedback.
>>
>> Thanks for all the feedback, everyone!
> I reported this internally and have worked with Joseph on it. I tested v7.0-rc7-68-g7f87a5ea75f01 ("-"), "Base", vs. ditto plus John Hubbard's patch ("+"), "Test".
>
> Stress-ng command: stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief
>
> System is an AMD EPYC 9J45:
>    NUMA node(s):              2
>    NUMA node0 CPU(s):         0-127,256-383
>    NUMA node1 CPU(s):         128-255,384-511
>
> The stress-ng command was run ten times and here are the averages and pstdev:
>
>     bogo ops/s   pstdev  system time   pstdev
>     (realtime)
> --------------------------------------------
> -     3192638      35%        24041      32%
> +     3657904       5%        15278       0%
>
> This is 15% improvement in bogo ops/s (realtime) and a decent 36% reduction in system time.
>
> I shamelessly copied and modified the fio command from [1]. I ran:
>
> # fio -filename=/dev/nvme0n1 -direct=0 -thread -size=1024G -rwmixwrite=30 \
> --norandommap --randrepeat=0 -ioengine=mmap -bs=4k -numjobs=1024 -runtime=3600 \
> --time_based -group_reporting -name=mytest
>
> (that is, one hour runtime)
>
> - read: IOPS=14.0M, BW=53.4GiB/s (57.3GB/s)(188TiB/3608413msec)
> + read: IOPS=16.0M, BW=61.2GiB/s (65.7GB/s)(215TiB/3600051msec)
> - READ: bw=53.4GiB/s (57.3GB/s), 53.4GiB/s-53.4GiB/s (57.3GB/s-57.3GB/s), io=188TiB (207TB), run=3608413-3608413msec
> + READ: bw=61.2GiB/s (65.7GB/s), 61.2GiB/s-61.2GiB/s (65.7GB/s-65.7GB/s), io=215TiB (237TB), run=3600051-3600051msec
>
> Also, running Base, I see tons of:
>
> Jobs: 726 (f=726): [_(2),R(1),_(1),R(3),_(4),R(6),_(1),R(2),_(2),R(2),_(3),R(1),_(5),R(2),_(1),R(2),_(1),R(1),_(2),R(2),_(1),R(1),_(1),R(2),_(1),R(3),_(1),R(3),_(1),R(1),_(1),R(1),_(1),R(1),_(1),R(3),_(1),R(3),_(1),R(1),_(3),R(1),_(1),R(5),_(1),R(5),_(1),R(1),_(2),R(1),_(4),R(2),_(1),R(3),_(1),R(3),_(1),R(1),_(2),R(1),_(1),R(8),_(1),R(4),_(1),R(3),_(1),R(1),_(1),R(2),_(1),R(7),_(2),R(2)
>
> when the fio test terminates, which I do not see using Test. I take that as the threads do not terminate timely using the Base kernel.
>
>
> Thxs, Håkon
>
>
> [1] https://lkml.org/lkml/2024/7/3/1049
>
>
Adding Lorenzo Stoakes to Cc.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
  2026-04-08  0:35   ` Hugh Dickins
@ 2026-04-09 18:03     ` Lorenzo Stoakes
  2026-04-09 18:12       ` John Hubbard
  2026-04-09 18:15       ` Haakon Bugge
  0 siblings, 2 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2026-04-09 18:03 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: John Hubbard, Joseph Salisbury, Andrew Morton, David Hildenbrand,
	Chris Li, Kairui Song, Jason Gunthorpe, Peter Xu, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, linux-mm, LKML

On Tue, Apr 07, 2026 at 05:35:18PM -0700, Hugh Dickins wrote:
> On Tue, 7 Apr 2026, John Hubbard wrote:
> > On 4/7/26 1:09 PM, Joseph Salisbury wrote:
> > > Hello,
> > >
> > > I would like to ask for feedback on an MM performance issue triggered by
> > > stress-ng's mremap stressor:
> > >
> > > stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief
> > >
> > > This was first investigated as a possible regression from 0ca0c24e3211
> > > ("mm: store zero pages to be swapped out in a bitmap"), but the current
> > > evidence suggests that commit is mostly exposing an older problem for
> > > this workload rather than directly causing it.
> > >
> >
> > Can you try this out? (Adding Hugh to Cc.)
> >
> > From: John Hubbard <jhubbard@nvidia.com>
> > Date: Tue, 7 Apr 2026 15:33:47 -0700
> > Subject: [PATCH] mm/gup: skip lru_add_drain() for non-locked populate
> > X-NVConfidentiality: public
> > Cc: John Hubbard <jhubbard@nvidia.com>
> >
> > populate_vma_page_range() calls lru_add_drain() unconditionally after
> > __get_user_pages(). With high-frequency single-page MAP_POPULATE/munmap
> > cycles at high thread counts, this forces a lruvec->lru_lock acquire
> > per page, defeating per-CPU folio_batch batching.
> >
> > The drain was added by commit ece369c7e104 ("mm/munlock: add
> > lru_add_drain() to fix memcg_stat_test") for VM_LOCKED populate, where
> > unevictable page stats must be accurate after faulting. Non-locked VMAs
> > have no such requirement. Skip the drain for them.
> >
> > Cc: Hugh Dickins <hughd@google.com>
> > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
>
> Thanks for the Cc.  I'm not convinced that we should be making such a
> change, just to avoid the stress that an avowed stresstest is showing;
> but can let others debate that - and, need it be said, I have no
> problem with Joseph trying your patch.

Yeah, the test case (as said by others also) is rather synthetic, and it's a
test designed to saturate, if not I/O throttled by swap then we hammer the
populate path. It feels like a micro-optimisation for something that is not (at
least not yet demonstrated to be) an actual problem.

stress-ng is not a benchmarking tool per se, it's designed to eek out bugs.

So really we need to see a real-world case I think.

>
> I tend to stand by my comment in that commit, that it's not just for
> VM_LOCKED: I believe it's in everyone's interest that a bulk faulting
> interface like populate_vma_page_range() or faultin_vma_page_range()
> should drain its local pagevecs at the end, to save others sometimes
> needing the much more expensive lru_add_drain_all().

I mean yeah, but I guess anywhere that _really_ needs to be sure of the drain
has to do an lru_add_drain_all(), because it'd be fragile to rely on
lru_add_drain()'s being done at the right time?

>
> But lru_add_drain() and lru_add_drain_all(): there's so much to be
> said and agonized over there  They've distressed me for years, and
> are a hot topic for us at present.  But I won't be able to contribute
> more on that subject, not this week.

Yeah they do feel rather delicate... :) sometimes you _really do_ need to know
everything's drained. But other times it feels a bit whack-a-mole.

I also do agree it makes sense to drain locally after a batch operation.

It all comes down to whether this manifests in a real-world case, at which point
maybe this is a more useful change?

>
> Hugh
>
> > ---
> >  mm/gup.c | 13 ++++++++++++-
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 8e7dc2c6ee73..2dd5de1cb5b9 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1816,6 +1816,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	unsigned long nr_pages = (end - start) / PAGE_SIZE;
> >  	int local_locked = 1;
> > +	bool need_drain;
> >  	int gup_flags;
> >  	long ret;
> >
> > @@ -1857,9 +1858,19 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> >  	 * We made sure addr is within a VMA, so the following will
> >  	 * not result in a stack expansion that recurses back here.
> >  	 */
> > +	/*
> > +	 * Read VM_LOCKED before __get_user_pages(), which may drop
> > +	 * mmap_lock when FOLL_UNLOCKABLE is set, after which the vma
> > +	 * must not be accessed. The read is stable: mmap_lock is held
> > +	 * for read here, so mlock() (which needs the write lock)
> > +	 * cannot change VM_LOCKED concurrently.
> > +	 */

BTW, not to nitpick (OK, maybe to nitpick :) this comments feels a bit
redundant. Maybe useful to note that the lock might be dropped (but you don't
indicate why it's OK to still assume state about the VMA), and it's a known
thing that you need a VMA write lock to alter flags, if we had to comment this
each time mm would be mostly comments :)

So if you want a comment here I'd say something like 'the lock might be dropped
due to FOLL_UNLOCKABLE, but that's ok, we would simply end up doing a redundant
drain in this case'.

But I'm not sure it's needed?

> > +	need_drain = vma->vm_flags & VM_LOCKED;

Please use the new VMA flag interface :)

	need_drain = vma_test(VMA_LOCKED_BIT);

> > +
> >  	ret = __get_user_pages(mm, start, nr_pages, gup_flags,
> >  			       NULL, locked ? locked : &local_locked);
> > -	lru_add_drain();
> > +	if (need_drain)
> > +		lru_add_drain();
> >  	return ret;
> >  }
> >
> >
> > base-commit: 3036cd0d3328220a1858b1ab390be8b562774e8a
> > --
> > 2.53.0
> >
> >
> > thanks,
> > --
> > John Hubbard
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
  2026-04-09 18:03     ` Lorenzo Stoakes
@ 2026-04-09 18:12       ` John Hubbard
  2026-04-09 18:20         ` David Hildenbrand (Arm)
  2026-04-09 18:47         ` Lorenzo Stoakes
  2026-04-09 18:15       ` Haakon Bugge
  1 sibling, 2 replies; 15+ messages in thread
From: John Hubbard @ 2026-04-09 18:12 UTC (permalink / raw)
  To: Lorenzo Stoakes, Hugh Dickins
  Cc: Joseph Salisbury, Andrew Morton, David Hildenbrand, Chris Li,
	Kairui Song, Jason Gunthorpe, Peter Xu, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, linux-mm, LKML

On 4/9/26 11:03 AM, Lorenzo Stoakes wrote:
> On Tue, Apr 07, 2026 at 05:35:18PM -0700, Hugh Dickins wrote:
>> On Tue, 7 Apr 2026, John Hubbard wrote:
>>> On 4/7/26 1:09 PM, Joseph Salisbury wrote:
...
>> Thanks for the Cc.  I'm not convinced that we should be making such a
>> change, just to avoid the stress that an avowed stresstest is showing;
>> but can let others debate that - and, need it be said, I have no
>> problem with Joseph trying your patch.
> 
> Yeah, the test case (as said by others also) is rather synthetic, and it's a
> test designed to saturate, if not I/O throttled by swap then we hammer the
> populate path. It feels like a micro-optimisation for something that is not (at
> least not yet demonstrated to be) an actual problem.
> 
> stress-ng is not a benchmarking tool per se, it's designed to eek out bugs.
> 
> So really we need to see a real-world case I think.

Absolutely. And to be honest, I saw "Oracle" and recalled that they are
always doing things will zillions of threads, so I assumed that a real
world case was waiting right behind this. But maybe not, after all?

...
>>> +	 * Read VM_LOCKED before __get_user_pages(), which may drop
>>> +	 * mmap_lock when FOLL_UNLOCKABLE is set, after which the vma
>>> +	 * must not be accessed. The read is stable: mmap_lock is held
>>> +	 * for read here, so mlock() (which needs the write lock)
>>> +	 * cannot change VM_LOCKED concurrently.
>>> +	 */
> 
> BTW, not to nitpick (OK, maybe to nitpick :) this comments feels a bit
> redundant. Maybe useful to note that the lock might be dropped (but you don't
> indicate why it's OK to still assume state about the VMA), and it's a known
> thing that you need a VMA write lock to alter flags, if we had to comment this
> each time mm would be mostly comments :)
> 
> So if you want a comment here I'd say something like 'the lock might be dropped
> due to FOLL_UNLOCKABLE, but that's ok, we would simply end up doing a redundant
> drain in this case'.
> 
> But I'm not sure it's needed?

I'm OK with just dropping the whole comment. I've lost my way lately with
comment density. :)

> 
>>> +	need_drain = vma->vm_flags & VM_LOCKED;
> 
> Please use the new VMA flag interface :)

Oops, yes.

I'm on the fence about whether to post an updated version of this.
Maybe wait until someone pops up with a real need for it?

Thoughts?

thanks,
-- 
John Hubbard



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
  2026-04-09 18:03     ` Lorenzo Stoakes
  2026-04-09 18:12       ` John Hubbard
@ 2026-04-09 18:15       ` Haakon Bugge
  2026-04-09 18:43         ` Lorenzo Stoakes
  1 sibling, 1 reply; 15+ messages in thread
From: Haakon Bugge @ 2026-04-09 18:15 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Hugh Dickins, John Hubbard, Joseph Salisbury, Andrew Morton,
	David Hildenbrand, Chris Li, Kairui Song, Jason Gunthorpe,
	Peter Xu, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	linux-mm@kvack.org, LKML



> On 9 Apr 2026, at 20:03, Lorenzo Stoakes <ljs@kernel.org> wrote:
> 
> On Tue, Apr 07, 2026 at 05:35:18PM -0700, Hugh Dickins wrote:
>> On Tue, 7 Apr 2026, John Hubbard wrote:
>>> On 4/7/26 1:09 PM, Joseph Salisbury wrote:
>>>> Hello,
>>>> 
>>>> I would like to ask for feedback on an MM performance issue triggered by
>>>> stress-ng's mremap stressor:
>>>> 
>>>> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief
>>>> 
>>>> This was first investigated as a possible regression from 0ca0c24e3211
>>>> ("mm: store zero pages to be swapped out in a bitmap"), but the current
>>>> evidence suggests that commit is mostly exposing an older problem for
>>>> this workload rather than directly causing it.
>>>> 
>>> 
>>> Can you try this out? (Adding Hugh to Cc.)
>>> 
>>> From: John Hubbard <jhubbard@nvidia.com>
>>> Date: Tue, 7 Apr 2026 15:33:47 -0700
>>> Subject: [PATCH] mm/gup: skip lru_add_drain() for non-locked populate
>>> X-NVConfidentiality: public
>>> Cc: John Hubbard <jhubbard@nvidia.com>
>>> 
>>> populate_vma_page_range() calls lru_add_drain() unconditionally after
>>> __get_user_pages(). With high-frequency single-page MAP_POPULATE/munmap
>>> cycles at high thread counts, this forces a lruvec->lru_lock acquire
>>> per page, defeating per-CPU folio_batch batching.
>>> 
>>> The drain was added by commit ece369c7e104 ("mm/munlock: add
>>> lru_add_drain() to fix memcg_stat_test") for VM_LOCKED populate, where
>>> unevictable page stats must be accurate after faulting. Non-locked VMAs
>>> have no such requirement. Skip the drain for them.
>>> 
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
>> 
>> Thanks for the Cc.  I'm not convinced that we should be making such a
>> change, just to avoid the stress that an avowed stresstest is showing;
>> but can let others debate that - and, need it be said, I have no
>> problem with Joseph trying your patch.
> 
> Yeah, the test case (as said by others also) is rather synthetic, and it's a
> test designed to saturate, if not I/O throttled by swap then we hammer the
> populate path. It feels like a micro-optimisation for something that is not (at
> least not yet demonstrated to be) an actual problem.
> 
> stress-ng is not a benchmarking tool per se, it's designed to eek out bugs.
> 
> So really we need to see a real-world case I think.
> 
>> 
>> I tend to stand by my comment in that commit, that it's not just for
>> VM_LOCKED: I believe it's in everyone's interest that a bulk faulting
>> interface like populate_vma_page_range() or faultin_vma_page_range()
>> should drain its local pagevecs at the end, to save others sometimes
>> needing the much more expensive lru_add_drain_all().
> 
> I mean yeah, but I guess anywhere that _really_ needs to be sure of the drain
> has to do an lru_add_drain_all(), because it'd be fragile to rely on
> lru_add_drain()'s being done at the right time?
> 
>> 
>> But lru_add_drain() and lru_add_drain_all(): there's so much to be
>> said and agonized over there  They've distressed me for years, and
>> are a hot topic for us at present.  But I won't be able to contribute
>> more on that subject, not this week.
> 
> Yeah they do feel rather delicate... :) sometimes you _really do_ need to know
> everything's drained. But other times it feels a bit whack-a-mole.
> 
> I also do agree it makes sense to drain locally after a batch operation.
> 
> It all comes down to whether this manifests in a real-world case, at which point
> maybe this is a more useful change?
> 
>> 
>> Hugh
>> 
>>> ---
>>> mm/gup.c | 13 ++++++++++++-
>>> 1 file changed, 12 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 8e7dc2c6ee73..2dd5de1cb5b9 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -1816,6 +1816,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
>>> struct mm_struct *mm = vma->vm_mm;
>>> unsigned long nr_pages = (end - start) / PAGE_SIZE;
>>> int local_locked = 1;
>>> + bool need_drain;
>>> int gup_flags;
>>> long ret;
>>> 
>>> @@ -1857,9 +1858,19 @@ long populate_vma_page_range(struct vm_area_struct *vma,
>>> * We made sure addr is within a VMA, so the following will
>>> * not result in a stack expansion that recurses back here.
>>> */
>>> + /*
>>> + * Read VM_LOCKED before __get_user_pages(), which may drop
>>> + * mmap_lock when FOLL_UNLOCKABLE is set, after which the vma
>>> + * must not be accessed. The read is stable: mmap_lock is held
>>> + * for read here, so mlock() (which needs the write lock)
>>> + * cannot change VM_LOCKED concurrently.
>>> + */
> 
> BTW, not to nitpick (OK, maybe to nitpick :) this comments feels a bit
> redundant. Maybe useful to note that the lock might be dropped (but you don't
> indicate why it's OK to still assume state about the VMA), and it's a known
> thing that you need a VMA write lock to alter flags, if we had to comment this
> each time mm would be mostly comments :)
> 
> So if you want a comment here I'd say something like 'the lock might be dropped
> due to FOLL_UNLOCKABLE, but that's ok, we would simply end up doing a redundant
> drain in this case'.
> 
> But I'm not sure it's needed?
> 
>>> + need_drain = vma->vm_flags & VM_LOCKED;
> 
> Please use the new VMA flag interface :)
> 
> need_drain = vma_test(VMA_LOCKED_BIT);
> 

I think we all agree that the stress-ng test case is synthetic. I evaluated John's patch as I understood that was requested, and the outcome was, merely, as expected.

The fio case is more interesting, as, if my runs make sense, it improves IOPS by ~20% and avoid threads being stuck at termination. But, I am not intimate with fio, so take that part as a grain of salt.


Thxs, Håkon



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
  2026-04-09 18:12       ` John Hubbard
@ 2026-04-09 18:20         ` David Hildenbrand (Arm)
  2026-04-09 18:47         ` Lorenzo Stoakes
  1 sibling, 0 replies; 15+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-09 18:20 UTC (permalink / raw)
  To: John Hubbard, Lorenzo Stoakes, Hugh Dickins
  Cc: Joseph Salisbury, Andrew Morton, Chris Li, Kairui Song,
	Jason Gunthorpe, Peter Xu, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, linux-mm, LKML


> 
> Oops, yes.
> 
> I'm on the fence about whether to post an updated version of this.
> Maybe wait until someone pops up with a real need for it?
> 
> Thoughts?

+1

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
  2026-04-08  8:09   ` David Hildenbrand (Arm)
  2026-04-08 14:27     ` [External] : " Joseph Salisbury
@ 2026-04-09 18:24     ` Lorenzo Stoakes
  1 sibling, 0 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2026-04-09 18:24 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Pedro Falcato, Joseph Salisbury, Andrew Morton, Chris Li,
	Kairui Song, Jason Gunthorpe, John Hubbard, Peter Xu, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, linux-mm, LKML

On Wed, Apr 08, 2026 at 10:09:23AM +0200, David Hildenbrand (Arm) wrote:
> >>
> >> It was also found that adding '--mremap-numa' changes the behavior
> >> substantially:
> >
> > "assign memory mapped pages to randomly selected NUMA nodes. This is
> > disabled for systems that do not support NUMA."
> >
> > so this is just sharding your lock contention across your NUMA nodes (you
> > have an lruvec per node).
> >
> >>
> >> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa
> >> --metrics-brief
> >>
> >> mremap 2570798 29.39 8.06 106.23 87466.50 22494.74
> >>
> >> So it's possible that either actual swapping, or the mbind(...,
> >> MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the excessive
> >> system time.
> >>
> >> Does this look like a known MM scalability issue around short-lived
> >> MAP_POPULATE / munmap churn?
> >
> > Yes. Is this an actual issue on some workload?
>
> Same thought, it's unclear to me why we should care here. In particular,
> when talking about excessive use of zero-filled pages.

Yup, I fear that this might also be misleading - stress-ng is designed to
saturate.

When swapping is enabled, it ends up rate-limited by I/O (there is simultanous
MADV_PAGEOUT occurring).

Then you see lower systime because... the system is sleeping more :)

The zero pages patch stops all that, so you throttle on the next thing - the
lruvec lock.

If you group by NUMA node rather than just not-at-all (the default) you
naturally distribute evenly across lruvec locks, because they're per node (+
memcg whatever).

So all this is arbitrary, it is essentially asking 'what do I rate limit on?'

And 'optimising' things to give different outcomes, esp. on things like system
time, doesn't really make sense.

If you absolutely hammer the hell out of the populate/unmap paths, unevenly over
NUMA nodes, you'll see system time explode because now you're hitting up on the
lruvec lock which is a spinlock (has to be due to possible irq context
invocation).

You're not actually asking 'how fast is this in a real workload?' or even a 'how
fast is this microbenchmark?', you're asking 'what does saturating this look
like?'.

So it's rather asking the wrong question, I fear, and a reason why
stress-ng-as-benchmark has to be treated with caution.

I would definitely recommend examining any underlying real-world workload that
is triggering the issue rather than stress-ng, and then examining closely what's
going on there.

This whole thing might be unfortunately misleading, as you observe saturation of
lruvec lock, but in reality it might simply be a manifestation of:

- syscalls on the hotpath
- not distributing work sensibly over NUMA nodes

Perhaps it is indeed an issue with the lruvec that needs attention, but with a
real world usecase we can perhaps be a little more sure it's that rather than
stress-ng doing it's thing :)

>
> --
> Cheers,
>
> David

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
  2026-04-09 18:15       ` Haakon Bugge
@ 2026-04-09 18:43         ` Lorenzo Stoakes
  0 siblings, 0 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2026-04-09 18:43 UTC (permalink / raw)
  To: Haakon Bugge
  Cc: Hugh Dickins, John Hubbard, Joseph Salisbury, Andrew Morton,
	David Hildenbrand, Chris Li, Kairui Song, Jason Gunthorpe,
	Peter Xu, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	linux-mm@kvack.org, LKML

On Thu, Apr 09, 2026 at 06:15:50PM +0000, Haakon Bugge wrote:
>
>
> > On 9 Apr 2026, at 20:03, Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Tue, Apr 07, 2026 at 05:35:18PM -0700, Hugh Dickins wrote:
> >> On Tue, 7 Apr 2026, John Hubbard wrote:
> >>> On 4/7/26 1:09 PM, Joseph Salisbury wrote:
> >>>> Hello,
> >>>>
> >>>> I would like to ask for feedback on an MM performance issue triggered by
> >>>> stress-ng's mremap stressor:
> >>>>
> >>>> stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief
> >>>>
> >>>> This was first investigated as a possible regression from 0ca0c24e3211
> >>>> ("mm: store zero pages to be swapped out in a bitmap"), but the current
> >>>> evidence suggests that commit is mostly exposing an older problem for
> >>>> this workload rather than directly causing it.
> >>>>
> >>>
> >>> Can you try this out? (Adding Hugh to Cc.)
> >>>
> >>> From: John Hubbard <jhubbard@nvidia.com>
> >>> Date: Tue, 7 Apr 2026 15:33:47 -0700
> >>> Subject: [PATCH] mm/gup: skip lru_add_drain() for non-locked populate
> >>> X-NVConfidentiality: public
> >>> Cc: John Hubbard <jhubbard@nvidia.com>
> >>>
> >>> populate_vma_page_range() calls lru_add_drain() unconditionally after
> >>> __get_user_pages(). With high-frequency single-page MAP_POPULATE/munmap
> >>> cycles at high thread counts, this forces a lruvec->lru_lock acquire
> >>> per page, defeating per-CPU folio_batch batching.
> >>>
> >>> The drain was added by commit ece369c7e104 ("mm/munlock: add
> >>> lru_add_drain() to fix memcg_stat_test") for VM_LOCKED populate, where
> >>> unevictable page stats must be accurate after faulting. Non-locked VMAs
> >>> have no such requirement. Skip the drain for them.
> >>>
> >>> Cc: Hugh Dickins <hughd@google.com>
> >>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> >>
> >> Thanks for the Cc.  I'm not convinced that we should be making such a
> >> change, just to avoid the stress that an avowed stresstest is showing;
> >> but can let others debate that - and, need it be said, I have no
> >> problem with Joseph trying your patch.
> >
> > Yeah, the test case (as said by others also) is rather synthetic, and it's a
> > test designed to saturate, if not I/O throttled by swap then we hammer the
> > populate path. It feels like a micro-optimisation for something that is not (at
> > least not yet demonstrated to be) an actual problem.
> >
> > stress-ng is not a benchmarking tool per se, it's designed to eek out bugs.
> >
> > So really we need to see a real-world case I think.
> >
> >>
> >> I tend to stand by my comment in that commit, that it's not just for
> >> VM_LOCKED: I believe it's in everyone's interest that a bulk faulting
> >> interface like populate_vma_page_range() or faultin_vma_page_range()
> >> should drain its local pagevecs at the end, to save others sometimes
> >> needing the much more expensive lru_add_drain_all().
> >
> > I mean yeah, but I guess anywhere that _really_ needs to be sure of the drain
> > has to do an lru_add_drain_all(), because it'd be fragile to rely on
> > lru_add_drain()'s being done at the right time?
> >
> >>
> >> But lru_add_drain() and lru_add_drain_all(): there's so much to be
> >> said and agonized over there  They've distressed me for years, and
> >> are a hot topic for us at present.  But I won't be able to contribute
> >> more on that subject, not this week.
> >
> > Yeah they do feel rather delicate... :) sometimes you _really do_ need to know
> > everything's drained. But other times it feels a bit whack-a-mole.
> >
> > I also do agree it makes sense to drain locally after a batch operation.
> >
> > It all comes down to whether this manifests in a real-world case, at which point
> > maybe this is a more useful change?
> >
> >>
> >> Hugh
> >>
> >>> ---
> >>> mm/gup.c | 13 ++++++++++++-
> >>> 1 file changed, 12 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/mm/gup.c b/mm/gup.c
> >>> index 8e7dc2c6ee73..2dd5de1cb5b9 100644
> >>> --- a/mm/gup.c
> >>> +++ b/mm/gup.c
> >>> @@ -1816,6 +1816,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> >>> struct mm_struct *mm = vma->vm_mm;
> >>> unsigned long nr_pages = (end - start) / PAGE_SIZE;
> >>> int local_locked = 1;
> >>> + bool need_drain;
> >>> int gup_flags;
> >>> long ret;
> >>>
> >>> @@ -1857,9 +1858,19 @@ long populate_vma_page_range(struct vm_area_struct *vma,
> >>> * We made sure addr is within a VMA, so the following will
> >>> * not result in a stack expansion that recurses back here.
> >>> */
> >>> + /*
> >>> + * Read VM_LOCKED before __get_user_pages(), which may drop
> >>> + * mmap_lock when FOLL_UNLOCKABLE is set, after which the vma
> >>> + * must not be accessed. The read is stable: mmap_lock is held
> >>> + * for read here, so mlock() (which needs the write lock)
> >>> + * cannot change VM_LOCKED concurrently.
> >>> + */
> >
> > BTW, not to nitpick (OK, maybe to nitpick :) this comments feels a bit
> > redundant. Maybe useful to note that the lock might be dropped (but you don't
> > indicate why it's OK to still assume state about the VMA), and it's a known
> > thing that you need a VMA write lock to alter flags, if we had to comment this
> > each time mm would be mostly comments :)
> >
> > So if you want a comment here I'd say something like 'the lock might be dropped
> > due to FOLL_UNLOCKABLE, but that's ok, we would simply end up doing a redundant
> > drain in this case'.
> >
> > But I'm not sure it's needed?
> >
> >>> + need_drain = vma->vm_flags & VM_LOCKED;
> >
> > Please use the new VMA flag interface :)
> >
> > need_drain = vma_test(VMA_LOCKED_BIT);
> >
>
> I think we all agree that the stress-ng test case is synthetic. I evaluated John's patch as I understood that was requested, and the outcome was, merely, as expected.

(Please wrap lines :)

Ack re: synthetic.

Thanks for evaluating it! I don't think John's patch is incorrect per se, but as
I said on reply further up thread, I fear this all might be rather a distraction
from a real world perspective, because you'd expect similar results due to
reasons other than the lruvec being a bit *ahem* sub-optimal shall we say.

>
> The fio case is more interesting, as, if my runs make sense, it improves IOPS by ~20% and avoid threads being stuck at termination. But, I am not intimate with fio, so take that part as a grain of salt.

That is interesting, but again I wonder what it's actually measuring, because if
things are getting stuck because of saturation from stress-ng doing insane
things (hammering the hell out of madvise(..., MAP_PAGEOUT), mremap(), munmap()
in the hot path, all while not caring about NUMA node locality), then that's
sort of what you'd expect I guess?

I guess the only way to avoid possibly measuring the wrong thing is to examine a
real-world case, and if there is something lurking there with lruvec scalability
(very possible) then we can definitely look at that!

Thanks for digging into this!

>
>
> Thxs, Håkon
>
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
  2026-04-09 18:12       ` John Hubbard
  2026-04-09 18:20         ` David Hildenbrand (Arm)
@ 2026-04-09 18:47         ` Lorenzo Stoakes
  1 sibling, 0 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2026-04-09 18:47 UTC (permalink / raw)
  To: John Hubbard
  Cc: Hugh Dickins, Joseph Salisbury, Andrew Morton, David Hildenbrand,
	Chris Li, Kairui Song, Jason Gunthorpe, Peter Xu, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, linux-mm, LKML

On Thu, Apr 09, 2026 at 11:12:39AM -0700, John Hubbard wrote:
>
> ...
> >>> +	 * Read VM_LOCKED before __get_user_pages(), which may drop
> >>> +	 * mmap_lock when FOLL_UNLOCKABLE is set, after which the vma
> >>> +	 * must not be accessed. The read is stable: mmap_lock is held
> >>> +	 * for read here, so mlock() (which needs the write lock)
> >>> +	 * cannot change VM_LOCKED concurrently.
> >>> +	 */
> >
> > BTW, not to nitpick (OK, maybe to nitpick :) this comments feels a bit
> > redundant. Maybe useful to note that the lock might be dropped (but you don't
> > indicate why it's OK to still assume state about the VMA), and it's a known
> > thing that you need a VMA write lock to alter flags, if we had to comment this
> > each time mm would be mostly comments :)
> >
> > So if you want a comment here I'd say something like 'the lock might be dropped
> > due to FOLL_UNLOCKABLE, but that's ok, we would simply end up doing a redundant
> > drain in this case'.
> >
> > But I'm not sure it's needed?
>
> I'm OK with just dropping the whole comment. I've lost my way lately with
> comment density. :)

Thanks, yeah I get this wrong myself often - I think tending towards more
comments is the better default :)

>
> >
> >>> +	need_drain = vma->vm_flags & VM_LOCKED;
> >
> > Please use the new VMA flag interface :)
>
> Oops, yes.

It's entirely forgiveable because this is massively new and pretty much nobody
but me and people who've bumped into it on review are all that aware :P

I mean I'll end up converting it all later anyway.

>
> I'm on the fence about whether to post an updated version of this.
> Maybe wait until someone pops up with a real need for it?
>
> Thoughts?

Yeah, let's wait for a real world case, otherwise we'll never be sure that we're
not solving the wrong problem I think.

>
> thanks,
> --
> John Hubbard
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-04-09 18:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-07 20:09 [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths Joseph Salisbury
2026-04-07 21:47 ` Pedro Falcato
2026-04-08  8:09   ` David Hildenbrand (Arm)
2026-04-08 14:27     ` [External] : " Joseph Salisbury
2026-04-09 16:37       ` Haakon Bugge
2026-04-09 17:26         ` Joseph Salisbury
2026-04-09 18:24     ` Lorenzo Stoakes
2026-04-07 22:44 ` John Hubbard
2026-04-08  0:35   ` Hugh Dickins
2026-04-09 18:03     ` Lorenzo Stoakes
2026-04-09 18:12       ` John Hubbard
2026-04-09 18:20         ` David Hildenbrand (Arm)
2026-04-09 18:47         ` Lorenzo Stoakes
2026-04-09 18:15       ` Haakon Bugge
2026-04-09 18:43         ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox