From: Usama Arif <usama.arif@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>,
david@kernel.org, ljs@kernel.org, liam@infradead.org,
vbabka@kernel.org, rppt@kernel.org, surenb@google.com,
mhocko@suse.com, kasong@tencent.com, qi.zheng@linux.dev,
shakeel.butt@linux.dev, axelrasmussen@google.com,
yuanchu@google.com, weixugc@google.com, chrisl@kernel.org,
nphamcs@gmail.com, baoquan.he@linux.dev, youngjun.park@lge.com,
hannes@cmpxchg.org, roman.gushchin@linux.dev,
muchun.song@linux.dev, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
rientjes@google.com, kernel-team@meta.com
Cc: Usama Arif <usama.arif@linux.dev>
Subject: [RFC 1/1] mm/vmscan: reduce lru_lock contention via vmstat-derived scan-balance cost
Date: Fri, 26 Jun 2026 05:19:47 -0700 [thread overview]
Message-ID: <20260626122009.75334-2-usama.arif@linux.dev> (raw)
In-Reply-To: <20260626122009.75334-1-usama.arif@linux.dev>
The anon/file scan balance in get_scan_count() is driven by two scalars
in struct lruvec, anon_cost and file_cost, accumulated by every reclaim
producer under lruvec->lru_lock. The acquisition sites for cost work
specifically are:
- shrink_inactive_list() re-takes lru_lock at function exit purely
to call lru_note_cost_unlock_irq() with (nr_pageout, nr_scanned -
nr_reclaimed). One acquisition per inactive shrink.
- shrink_active_list() does the same with (0, nr_rotated). One
acquisition per active shrink.
- workingset_refault() takes the lock via folio_lruvec_lock_irq()
purely to record the refault cost. One acquisition per refault.
- prepare_scan_control() takes lru_lock just to snapshot the two
scalars into sc->{anon,file}_cost.
- lru_note_cost_unlock_irq() itself walks parent_lruvec and
re-acquires lru_lock on each ancestor to propagate the update,
adding O(memcg-depth) acquisitions per producer call.
This hurts because lru_lock is already a heavy contention point on
memory-heavy workloads: every isolate_lru_folios(), move_folios_to_lru()
and folio_add_lru() takes it. The cost work itself is trivial (two
scalar bumps and one comparison), but it contends with and causes
contention for actual LRU manipulation. The parent_lruvec() also walks
multiplies cost-update overhead by memcg hierarchy depth.
Replace the producer-side accumulators with a read-side accumulator fed
from per-LRU vmstat counters. The old producer formula was:
cost = nr_io * SWAP_CLUSTER_MAX + nr_rotated
Add explicit node_stat counters for the producer-local inputs:
PGRECLAIM_PAGEOUT_{ANON,FILE} - reclaim-driven pageout submissions
(formerly stat.nr_pageout, weighted
by SWAP_CLUSTER_MAX).
PGROTATE_{ANON,FILE} - reclaim-driven rotations, bumped from
both shrink_inactive_list (by
nr_scanned - nr_reclaimed) and
shrink_active_list (by nr_rotated),
unweighted.
WORKINGSET_RESTORE_{ANON,FILE} already captures the refault IO that
lru_note_cost_refault() used to bill.
In prepare_scan_control() the raw cost signal is recomputed lock-free of
lru_lock from monotonic counters:
now = (PGRECLAIM_PAGEOUT_X + WORKINGSET_RESTORE_X) * SWAP_CLUSTER_MAX
+ PGROTATE_X
The delta against a per-lruvec prev_cost[] snapshot is folded into
cost_accum[]. The lrusize/4 halving threshold is preserved, but the
decay check now happens at the read site instead of on every producer
update.
A dedicated per-lruvec spinlock, cost_lock, serialises the prev_cost
RMW, the accumulator update, and the halving check against concurrent
reclaimers in the same memcg+node.
Hierarchy aggregation is now implicit in the vmstat accounting. The
producer-side parent_lruvec() walk and lru_reparent_memcg() cost splice
existed only because anon_cost/file_cost were private lruvec fields. With
the cost expressed as lruvec vmstats, rstat propagates the underlying
counters through the memcg hierarchy and prepare_scan_control() consumes
the same ratelimited rstat view as the surrounding reclaim heuristics.
Signed-off-by: Usama Arif <usama.arif@linux.dev>
---
include/linux/mmzone.h | 11 +++++--
include/linux/swap.h | 3 --
mm/memcontrol-v1.c | 4 +--
mm/memcontrol.c | 4 +++
mm/mmzone.c | 1 +
mm/swap.c | 69 ------------------------------------------
mm/vmscan.c | 64 +++++++++++++++++++++++++++++++++------
mm/vmstat.c | 4 +++
mm/workingset.c | 5 ---
9 files changed, 74 insertions(+), 91 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ca2712187147..0627622a5184 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -323,6 +323,10 @@ enum node_stat_item {
PGSCAN_PROACTIVE,
PGSCAN_ANON,
PGSCAN_FILE,
+ PGRECLAIM_PAGEOUT_ANON,
+ PGRECLAIM_PAGEOUT_FILE,
+ PGROTATE_ANON,
+ PGROTATE_FILE,
PGREFILL,
#ifdef CONFIG_HUGETLB_PAGE
NR_HUGETLB,
@@ -763,9 +767,12 @@ struct lruvec {
* These track the cost of reclaiming one LRU - file or anon -
* over the other. As the observed cost of reclaiming one LRU
* increases, the reclaim scan balance tips toward the other.
+ * Updated and decayed at prepare_scan_control() time; cost_lock
+ * serialises that update.
*/
- unsigned long anon_cost;
- unsigned long file_cost;
+ unsigned long prev_cost[ANON_AND_FILE];
+ unsigned long cost_accum[ANON_AND_FILE];
+ spinlock_t cost_lock;
/* Non-resident age, driven by LRU movement */
atomic_long_t nonresident_age;
/* Refaults at the time of last reclaim cycle */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6d72778e6cc3..d35a4761ebd7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -309,9 +309,6 @@ extern unsigned long totalreserve_pages;
/* linux/mm/swap.c */
-void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file,
- unsigned int nr_io, unsigned int nr_rotated);
-void lru_note_cost_refault(struct folio *);
void folio_add_lru(struct folio *);
void folio_add_lru_vma(struct folio *, struct vm_area_struct *);
void mark_page_accessed(struct page *);
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 765069211567..c7a52bb68f4c 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -1988,8 +1988,8 @@ void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
for_each_online_pgdat(pgdat) {
mz = memcg->nodeinfo[pgdat->node_id];
- anon_cost += mz->lruvec.anon_cost;
- file_cost += mz->lruvec.file_cost;
+ anon_cost += mz->lruvec.cost_accum[WORKINGSET_ANON];
+ file_cost += mz->lruvec.cost_accum[WORKINGSET_FILE];
}
seq_buf_printf(s, "anon_cost %lu\n", anon_cost);
seq_buf_printf(s, "file_cost %lu\n", file_cost);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 56cd4af08232..3c068ebefd97 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -419,6 +419,10 @@ static const unsigned int memcg_node_stat_items[] = {
PGSCAN_PROACTIVE,
PGSCAN_ANON,
PGSCAN_FILE,
+ PGRECLAIM_PAGEOUT_ANON,
+ PGRECLAIM_PAGEOUT_FILE,
+ PGROTATE_ANON,
+ PGROTATE_FILE,
PGREFILL,
#ifdef CONFIG_HUGETLB_PAGE
NR_HUGETLB,
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 0c8f181d9d50..17139db4d291 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -78,6 +78,7 @@ void lruvec_init(struct lruvec *lruvec)
memset(lruvec, 0, sizeof(struct lruvec));
spin_lock_init(&lruvec->lru_lock);
+ spin_lock_init(&lruvec->cost_lock);
zswap_lruvec_state_init(lruvec);
for_each_lru(lru)
diff --git a/mm/swap.c b/mm/swap.c
index 588f50d8f1a8..74b281778cbc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -272,73 +272,6 @@ void folio_rotate_reclaimable(struct folio *folio)
folio_batch_add_and_move(folio, lru_move_tail);
}
-void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file,
- unsigned int nr_io, unsigned int nr_rotated)
- __releases(lruvec->lru_lock)
- __releases(rcu)
-{
- unsigned long cost;
-
- /*
- * Reflect the relative cost of incurring IO and spending CPU
- * time on rotations. This doesn't attempt to make a precise
- * comparison, it just says: if reloads are about comparable
- * between the LRU lists, or rotations are overwhelmingly
- * different between them, adjust scan balance for CPU work.
- */
- cost = nr_io * SWAP_CLUSTER_MAX + nr_rotated;
- if (!cost) {
- spin_unlock_irq(&lruvec->lru_lock);
- rcu_read_unlock();
- return;
- }
-
- for (;;) {
- unsigned long lrusize;
-
- /* Record cost event */
- if (file)
- lruvec->file_cost += cost;
- else
- lruvec->anon_cost += cost;
-
- /*
- * Decay previous events
- *
- * Because workloads change over time (and to avoid
- * overflow) we keep these statistics as a floating
- * average, which ends up weighing recent refaults
- * more than old ones.
- */
- lrusize = lruvec_page_state(lruvec, NR_INACTIVE_ANON) +
- lruvec_page_state(lruvec, NR_ACTIVE_ANON) +
- lruvec_page_state(lruvec, NR_INACTIVE_FILE) +
- lruvec_page_state(lruvec, NR_ACTIVE_FILE);
-
- if (lruvec->file_cost + lruvec->anon_cost > lrusize / 4) {
- lruvec->file_cost /= 2;
- lruvec->anon_cost /= 2;
- }
-
- spin_unlock_irq(&lruvec->lru_lock);
- lruvec = parent_lruvec(lruvec);
- if (!lruvec) {
- rcu_read_unlock();
- break;
- }
- spin_lock_irq(&lruvec->lru_lock);
- }
-}
-
-void lru_note_cost_refault(struct folio *folio)
-{
- struct lruvec *lruvec;
-
- lruvec = folio_lruvec_lock_irq(folio);
- lru_note_cost_unlock_irq(lruvec, folio_is_file_lru(folio),
- folio_nr_pages(folio), 0);
-}
-
static void lru_activate(struct lruvec *lruvec, struct folio *folio)
{
long nr_pages = folio_nr_pages(folio);
@@ -1164,8 +1097,6 @@ void lru_reparent_memcg(struct mem_cgroup *memcg, struct mem_cgroup *parent, int
child_lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
parent_lruvec = mem_cgroup_lruvec(parent, NODE_DATA(nid));
- parent_lruvec->anon_cost += child_lruvec->anon_cost;
- parent_lruvec->file_cost += child_lruvec->file_cost;
for_each_lru(lru)
lruvec_reparent_lru(child_lruvec, parent_lruvec, lru, nid);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e8a90911bf88..6d187f0f8bf8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2043,10 +2043,13 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
item = PGSTEAL_KSWAPD + reclaimer_offset(sc);
mod_lruvec_state(lruvec, item, nr_reclaimed);
mod_lruvec_state(lruvec, PGSTEAL_ANON + file, nr_reclaimed);
+ if (stat.nr_pageout)
+ mod_lruvec_state(lruvec, PGRECLAIM_PAGEOUT_ANON + file,
+ stat.nr_pageout);
+ if (nr_scanned > nr_reclaimed)
+ mod_lruvec_state(lruvec, PGROTATE_ANON + file,
+ nr_scanned - nr_reclaimed);
- lruvec_lock_irq(lruvec);
- lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
- nr_scanned - nr_reclaimed);
handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
nr_scanned, nr_reclaimed, &stat, sc->priority, file);
@@ -2152,9 +2155,9 @@ static void shrink_active_list(unsigned long nr_to_scan,
count_vm_events(PGDEACTIVATE, nr_deactivate);
count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
+ if (nr_rotated)
+ mod_lruvec_state(lruvec, PGROTATE_ANON + file, nr_rotated);
- lruvec_lock_irq(lruvec);
- lru_note_cost_unlock_irq(lruvec, file, 0, nr_rotated);
trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
nr_deactivate, nr_rotated, sc->priority, file);
}
@@ -2303,12 +2306,53 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
mem_cgroup_flush_stats_ratelimited(sc->target_mem_cgroup);
/*
- * Determine the scan balance between anon and file LRUs.
+ * Determine the scan balance between anon and file LRUs from per-LRU
+ * vmstat counters. The raw cost per side is:
+ *
+ * PGROTATE - reclaim-driven rotations, bumped from both
+ * shrink_inactive_list and shrink_active_list
+ * (CPU work).
+ * PGRECLAIM_PAGEOUT - reclaim-driven pageout IO.
+ * WORKINGSET_RESTORE - refaults of previously-workingset pages.
+ *
+ * The two IO terms are weighted by SWAP_CLUSTER_MAX to reflect the
+ * higher cost of an IO over a rotation.
+ *
+ * Reads are lock-free per-cpu sum collations, rstat-aggregated up
+ * the memcg hierarchy by mem_cgroup_flush_stats_ratelimited() above.
+ *
+ * The delta against prev_cost is folded into cost_accum, which is
+ * halved on both sides whenever their sum exceeds lrusize/4.
+ * cost_lock serialises concurrent reclaimers in the same memcg+node.
*/
- spin_lock_irq(&target_lruvec->lru_lock);
- sc->anon_cost = target_lruvec->anon_cost;
- sc->file_cost = target_lruvec->file_cost;
- spin_unlock_irq(&target_lruvec->lru_lock);
+ spin_lock(&target_lruvec->cost_lock);
+ for (int f = 0; f <= 1; f++) {
+ unsigned long now, delta;
+
+ now = lruvec_page_state(target_lruvec, PGROTATE_ANON + f) +
+ (lruvec_page_state(target_lruvec,
+ PGRECLAIM_PAGEOUT_ANON + f) +
+ lruvec_page_state(target_lruvec,
+ WORKINGSET_RESTORE_BASE + f)) *
+ SWAP_CLUSTER_MAX;
+ delta = now - target_lruvec->prev_cost[f];
+ target_lruvec->prev_cost[f] = now;
+ target_lruvec->cost_accum[f] += delta;
+ }
+ unsigned long lrusize =
+ lruvec_page_state(target_lruvec, NR_INACTIVE_ANON) +
+ lruvec_page_state(target_lruvec, NR_ACTIVE_ANON) +
+ lruvec_page_state(target_lruvec, NR_INACTIVE_FILE) +
+ lruvec_page_state(target_lruvec, NR_ACTIVE_FILE);
+
+ if (target_lruvec->cost_accum[WORKINGSET_ANON] +
+ target_lruvec->cost_accum[WORKINGSET_FILE] > lrusize / 4) {
+ target_lruvec->cost_accum[WORKINGSET_ANON] /= 2;
+ target_lruvec->cost_accum[WORKINGSET_FILE] /= 2;
+ }
+ sc->anon_cost = target_lruvec->cost_accum[WORKINGSET_ANON];
+ sc->file_cost = target_lruvec->cost_accum[WORKINGSET_FILE];
+ spin_unlock(&target_lruvec->cost_lock);
/*
* Target desirable inactive:active list ratios for the anon
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f534972f517d..2dfed5e9b64d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1289,6 +1289,10 @@ const char * const vmstat_text[] = {
[I(PGSCAN_PROACTIVE)] = "pgscan_proactive",
[I(PGSCAN_ANON)] = "pgscan_anon",
[I(PGSCAN_FILE)] = "pgscan_file",
+ [I(PGRECLAIM_PAGEOUT_ANON)] = "pgreclaim_pageout_anon",
+ [I(PGRECLAIM_PAGEOUT_FILE)] = "pgreclaim_pageout_file",
+ [I(PGROTATE_ANON)] = "pgrotate_anon",
+ [I(PGROTATE_FILE)] = "pgrotate_file",
[I(PGREFILL)] = "pgrefill",
#ifdef CONFIG_HUGETLB_PAGE
[I(NR_HUGETLB)] = "nr_hugetlb",
diff --git a/mm/workingset.c b/mm/workingset.c
index f351798e723a..7ac2b88c80ae 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -584,11 +584,6 @@ void workingset_refault(struct folio *folio, void *shadow)
/* Folio was active prior to eviction */
if (workingset) {
folio_set_workingset(folio);
- /*
- * XXX: Move to folio_add_lru() when it supports new vs
- * putback
- */
- lru_note_cost_refault(folio);
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
}
out:
--
2.53.0-Meta
prev parent reply other threads:[~2026-06-26 12:20 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-26 12:19 [RFC 0/1] mm/vmscan: reduce lru_lock contention via vmstat-derived scan-balance cost Usama Arif
2026-06-26 12:19 ` Usama Arif [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260626122009.75334-2-usama.arif@linux.dev \
--to=usama.arif@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=baoquan.he@linux.dev \
--cc=cgroups@vger.kernel.org \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=qi.zheng@linux.dev \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=weixugc@google.com \
--cc=youngjun.park@lge.com \
--cc=yuanchu@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.