From: Usama Arif <usama.arif@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>,
david@kernel.org, ljs@kernel.org, liam@infradead.org,
vbabka@kernel.org, rppt@kernel.org, surenb@google.com,
mhocko@suse.com, kasong@tencent.com, qi.zheng@linux.dev,
shakeel.butt@linux.dev, axelrasmussen@google.com,
yuanchu@google.com, weixugc@google.com, chrisl@kernel.org,
nphamcs@gmail.com, baoquan.he@linux.dev, youngjun.park@lge.com,
hannes@cmpxchg.org, roman.gushchin@linux.dev,
muchun.song@linux.dev, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
rientjes@google.com, kernel-team@meta.com
Cc: Usama Arif <usama.arif@linux.dev>
Subject: [RFC 0/1] mm/vmscan: reduce lru_lock contention via vmstat-derived scan-balance cost
Date: Fri, 26 Jun 2026 05:19:46 -0700 [thread overview]
Message-ID: <20260626122009.75334-1-usama.arif@linux.dev> (raw)
The anon/file scan balance heuristic in get_scan_count() is fed by two
scalars in struct lruvec (anon_cost, file_cost) that every reclaim
producer updates under lruvec->lru_lock. The cost-recording work
itself is trivial, but it both contends for and contributes to
contention on lru_lock - which is often a contention point
on memory-pressured workloads. Specifically:
- shrink_inactive_list() re-acquires lru_lock at function exit just
to call lru_note_cost_unlock_irq().
- shrink_active_list() does the same after rotation accounting.
- workingset_refault() takes folio_lruvec_lock_irq() purely to
record the refault cost.
- prepare_scan_control() snapshots anon_cost/file_cost under
lru_lock.
- lru_note_cost_unlock_irq() itself walks parent_lruvec() and
re-acquires lru_lock on every ancestor, multiplying the cost
of every update by memcg-hierarchy depth.
This patch removes those producer-side acquisitions entirely. The
producer-local inputs (PGROTATE_*, PGRECLAIM_PAGEOUT_*) become
per-LRU vmstat counters; WORKINGSET_RESTORE_* already captures the
refault input. prepare_scan_control() reads the raw cost signal
lock-free from those vmstats and folds the delta into a per-lruvec
accumulator. A dedicated per-lruvec cost_lock — not touched by
isolate_lru_folios(), move_folios_to_lru(), or folio_add_lru() —
serialises the accumulator RMW and the lrusize/4 halving check.
Hierarchy aggregation is implicit in rstat propagation, so the
parent_lruvec() walk and the lru_reparent_memcg() cost-splice both
disappear.
Trade-offs:
- Signal freshness is slightly worse: cost reads see rstat-
aggregated values that can lag until periodic / reader-triggered
flushing. Decay timing is also coarser since multiple producer
events may be batched into one read-side halving check.
The cost signal is a heuristic feeding the anon-vs-file split,
it's not a precise control loop — it's deliberately smoothed by
the lrusize/4 halving. Producing/consuming it with a tiny lag should
not be perceptible.
- Per-lruvec footprint grows by 2 unsigned longs + a spinlock,
its a small cost.
== Numbers ==
Tested on a 176-core, 256 GB host. The benchmark drives sustained
swap-out/refault inside a tight memcg using vm-scalability/usemem:
usemem -n 16 --prealloc --prefault --random $((256*1024*1024))
run inside a two-level memcg with memory.max=512M on the leaf
(4 GB anon working set has to fit in 512 MB -> continuous
shrink_inactive_list + workingset_refault). A 16 GB swap file
is used. Measurement is a 30 s `perf lock record -a` window
over otherwise-idle hardware.
Workload rates are identical on both kernels (the bench drives the
same memory pressure):
baseline patched delta
pgscan_direct / s 172,662 171,817 ~0%
pgsteal_direct / s 67,162 66,306 ~0%
workingset_refault_anon / s
40,696 39,830 ~0%
perf lock contention (total wait per 30 s window):
Lock Name Before After % change
shrink_lruvec+0x770 722.84 ms 0 -100% (eliminated)
(= lru_note_cost_unlock_irq)
workingset_refault+0x167 385.26 ms 0 -100% (eliminated)
(= lru_note_cost_refault)
shrink_node+0x4ad 689.43 ms 26.95 ms -96%
shrink_active_list 208.34 ms 15.97 ms -92%
lru_add_drain_cpu+0x34 1.96 s 917.71 ms -53%
Total LRU lock wait ~4.23 s ~1.66 s -61%
The two specific contention sites the patch removes
(shrink_lruvec+0x770 = lru_note_cost_unlock_irq;
workingset_refault+0x167 = lru_note_cost_refault) are completely
absent from the patched perf-lock-contention output.
Secondary reductions in shrink_node, shrink_active_list,
lru_add_drain_cpu and pgrefill/pgactivate look like knock-on
effects from removing the cost-recording overhead and the
parent_lruvec walk.
The remaining ~1.66 s of LRU lock wait on the patched kernel is
dominated by the per-CPU pagevec drain (lru_add_drain_cpu) and the
main reclaim path in shrink_lruvec.
The numbers above can be reproduced using the script in [1].
== Alternatives considered ==
1. cost_lock for both producer and consumer (no vmstat indirection):
Keep the producer loop, just swap lru_lock for a new per-lruvec
cost_lock. Decouples cost from LRU manipulation, but producers
still synchronously contend on cost_lock, the parent_lruvec()
walk is still required (O(memcg-depth) acquisitions per recording,
now on cost_lock), and lru_reparent_memcg() still needs explicit
cost-splice. We can do much better and this series removes the
producer lock entirely and gets hierarchy propagation for "free"
via rstat.
2. Attempt to switch to using MGLRU's scan model:
MGLRU has no anon_cost/file_cost at all. It replaces the cost
heuristic with generation-based aging: per-LRU sequence numbers
(min_seq/max_seq) age folios into generations, and the
older-generation type is the one to scan. So
lru_note_cost_unlock_irq() / lru_note_cost_refault() are simply
not called when lru_gen_enabled() — by design it sidesteps every
concern this patch addresses.
But MGLRU is not a substitute for fixing classic LRU:
- It relies on a lot of things including per-lruvec generation
lists, bloom filters, mm_struct walk infrastructure, working-set
protection tiers and a whole sysfs interface. Replacing
classic LRU's cost recording with the MGLRU model would
mean dragging in all of that.
- It changes scan-balance semantics, not just the locking, so
it's a heuristic change we would need to evaluate separately.
There are known regressions (database/anon-heavy workloads
sensitive to swappiness, or file-cache-dominated workloads
where MGLRU's bloom-filter protection differs from classic
refault tracking).
This patch preserves classic-LRU semantics.
3. Atomic cost counter:
lrusize/4 halving has no clean atomic form, and the parent
walk still has to run explicitly. Reusing vmstats gives per-CPU
aggregation AND rstat hierarchy propagation for free.
4. Drop cost_lock from the existing patch and reuse lru_lock in the
consumer (prepare_scan_control()):
Saves 1 lock space per lruvec but re-couples the cost path to LRU
manipulation, though just from the consumer side this time.
prepare_scan_control() runs at the start of every shrink_lruvec()
cycle, so under sustained memory pressure it would take lru_lock
on the hot path and block isolate_lru_folios() /
move_folios_to_lru() / folio_add_lru() i.e. when reclaim is
in flight. A dedicated cost_lock is never taken by anyone except
the consumer cost calucation.
[1] https://gist.github.com/uarif1/a4eb33a86c5b2d7bbc55b42f0956e884
Usama Arif (1):
mm/vmscan: reduce lru_lock contention via vmstat-derived scan-balance
cost
include/linux/mmzone.h | 11 +++++--
include/linux/swap.h | 3 --
mm/memcontrol-v1.c | 4 +--
mm/memcontrol.c | 4 +++
mm/mmzone.c | 1 +
mm/swap.c | 69 ------------------------------------------
mm/vmscan.c | 64 +++++++++++++++++++++++++++++++++------
mm/vmstat.c | 4 +++
mm/workingset.c | 5 ---
9 files changed, 74 insertions(+), 91 deletions(-)
--
2.53.0-Meta
next reply other threads:[~2026-06-26 12:20 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-26 12:19 Usama Arif [this message]
2026-06-26 12:19 ` [RFC 1/1] mm/vmscan: reduce lru_lock contention via vmstat-derived scan-balance cost Usama Arif
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260626122009.75334-1-usama.arif@linux.dev \
--to=usama.arif@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=baoquan.he@linux.dev \
--cc=cgroups@vger.kernel.org \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=qi.zheng@linux.dev \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=weixugc@google.com \
--cc=youngjun.park@lge.com \
--cc=yuanchu@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.