All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joseph Salisbury <joseph.salisbury@oracle.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>,
	John Hubbard <jhubbard@nvidia.com>, Peter Xu <peterx@redhat.com>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>
Subject: [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths
Date: Tue, 7 Apr 2026 16:09:20 -0400	[thread overview]
Message-ID: <a3474fcf-9f20-47ee-9d15-233e5c7e3f83@oracle.com> (raw)

Hello,

I would like to ask for feedback on an MM performance issue triggered by 
stress-ng's mremap stressor:

stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief

This was first investigated as a possible regression from 0ca0c24e3211 
("mm: store zero pages to be swapped out in a bitmap"), but the current 
evidence suggests that commit is mostly exposing an older problem for 
this workload rather than directly causing it.


Observed behavior:

The metrics below are in this format:
     stressor       bogo ops real time  usr time  sys time   bogo ops/s  
    bogo ops/s
                              (secs)    (secs)    (secs)   (real time) 
(usr+sys time)

On a 5.15-based kernel, the workload behaves much worse when swapping is 
disabled:

     swap enabled:
       mremap 1660980 31.08 64.78 84.63 53437.09 11116.73

     swap disabled:
       mremap 40786258 27.94 15.41 15354.79 1459749.43 2653.59

On a 6.12-based kernel with swap enabled, the same high-system-time 
behavior is also observed:

     mremap 77087729 21.50 29.95 30558.08 3584738.22 2520.19

A recent 7.0-rc5-based mainline build still behaves similarly:

     mremap 39208813 28.12 12.34 15318.39 1394408.50 2557.53

So this does not appear to be already fixed upstream.



The current theory is that 0ca0c24e3211 exposes this specific 
zero-page-heavy workload.  Before that change, swap-enabled runs 
actually swapped pages.  After that change, zero pages are stored in the 
swap bitmap instead, so the workload behaves much more like the 
swap-disabled case.

Perf data supports the idea that the expensive behavior is global LRU 
lock contention caused by short-lived populate/unmap churn.

The dominant stacks on the bad cases include:

     vm_mmap_pgoff
       __mm_populate
         populate_vma_page_range
           lru_add_drain
             folio_batch_move_lru
               folio_lruvec_lock_irqsave
                 native_queued_spin_lock_slowpath

and:

     __x64_sys_munmap
       __vm_munmap
         ...
           release_pages
             folios_put_refs
               __page_cache_release
                 folio_lruvec_relock_irqsave
                   native_queued_spin_lock_slowpath



It was also found that adding '--mremap-numa' changes the behavior 
substantially:

stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa 
--metrics-brief

mremap 2570798 29.39 8.06 106.23 87466.50 22494.74

So it's possible that either actual swapping, or the mbind(..., 
MPOL_MF_MOVE) path used by '--mremap-numa', removes most of the 
excessive system time.

Does this look like a known MM scalability issue around short-lived 
MAP_POPULATE / munmap churn?




REPRODUCER:
The issue is reproducible with stress-ng's mremap stressor:

stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief

On older kernels, the bad behavior is easiest to expose by disabling 
swap first:

swapoff -a
stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --metrics-brief

On kernels with 0ca0c24e3211 ("mm: store zero pages to be swapped out in 
a bitmap") or newer, the same bad behavior can be seen even with swap 
enabled, because this zero-page-heavy workload no longer actually swaps 
pages and behaves much like the swap-disabled case.

Typical bad-case behaviour:
  - Very large aggregate sys time during a 30s run (for example, ~15000s 
or higher)
  - Poor bogo ops/s measured against usr+sys time (~2500 range in our tests)
  - Perf shows time dominated by:
       vm_mmap_pgoff -> __mm_populate -> populate_vma_page_range -> 
lru_add_drain
     and
       munmap -> release_pages -> __page_cache_release
    with heavy time in 
folio_lruvec_lock_irqsave/native_queued_spin_lock_slowpath

Diagnostic variant:
stress-ng --mremap 8192 --mremap-bytes 4K --timeout 30 --mremap-numa 
--metrics-brief

That variant greatly reduces the excessive system time, which is one of 
the clues that the excessive system-time overhead depends on which MM 
path the workload takes.


Thanks in advance!

Joe





             reply	other threads:[~2026-04-07 20:09 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-07 20:09 Joseph Salisbury [this message]
2026-04-07 21:47 ` [RFC] mm: stress-ng --mremap triggers severe lruvec lock contention in populate/unmap paths Pedro Falcato
2026-04-08  8:09   ` David Hildenbrand (Arm)
2026-04-08 14:27     ` [External] : " Joseph Salisbury
2026-04-09 16:37       ` Haakon Bugge
2026-04-09 17:26         ` Joseph Salisbury
2026-04-10 10:43         ` [External] : " Pedro Falcato
2026-04-09 18:24     ` Lorenzo Stoakes
2026-04-09 21:59     ` Barry Song
2026-04-10 10:30       ` Pedro Falcato
2026-04-11  9:09         ` Barry Song
2026-04-07 22:44 ` John Hubbard
2026-04-08  0:35   ` Hugh Dickins
2026-04-09 18:03     ` Lorenzo Stoakes
2026-04-09 18:12       ` John Hubbard
2026-04-09 18:20         ` David Hildenbrand (Arm)
2026-04-09 18:47         ` Lorenzo Stoakes
2026-04-09 18:15       ` Haakon Bugge
2026-04-09 18:43         ` Lorenzo Stoakes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a3474fcf-9f20-47ee-9d15-233e5c7e3f83@oracle.com \
    --to=joseph.salisbury@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=david@kernel.org \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=peterx@redhat.com \
    --cc=shikemeng@huaweicloud.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.