public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] mm/mglru: fix cgroup OOM during MGLRU state switching
@ 2026-03-11 12:09 Leno Hou via B4 Relay
  2026-03-11 12:09 ` [PATCH v2 1/2] " Leno Hou via B4 Relay
  2026-03-11 12:09 ` [PATCH v2 2/2] mm/mglru: maintain workingset refault context across state transitions Leno Hou via B4 Relay
  0 siblings, 2 replies; 6+ messages in thread
From: Leno Hou via B4 Relay @ 2026-03-11 12:09 UTC (permalink / raw)
  To: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
	Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, Barry Song
  Cc: linux-mm, linux-kernel, Leno Hou

When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
condition exists between the state switching and the memory reclaim
path. This can lead to unexpected cgroup OOM kills, even when plenty of
reclaimable memory is available.

Problem Description
==================

The issue arises from a "reclaim vacuum" during the transition.

1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
   false before the pages are drained from MGLRU lists back to
   traditional LRU lists.
2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
   and skip the MGLRU path.
3. However, these pages might not have reached the traditional LRU lists
   yet, or the changes are not yet visible to all CPUs due to a lack of
   synchronization.
4. get_scan_count() subsequently finds traditional LRU lists empty,
   concludes there is no reclaimable memory, and triggers an OOM kill.

A similar race can occur during enablement, where the reclaimer sees
the new state but the MGLRU lists haven't been populated via
fill_evictable() yet.

Solution
========

Introduce a 'draining' state (`lru_drain_core`) to bridge the
transition. When transitioning, the system enters this intermediate state
where the reclaimer is forced to attempt both MGLRU and traditional reclaim
paths sequentially. This ensures that folios remain visible to at least
one reclaim mechanism until the transition is fully materialized across all
CPUs.

Changes
=======

v2:
- Repalce with a static branch `lru_drain_core` to track the transition state.
- Ensures all LRU helpers correctly identify page state by checking
  folio_lru_gen(folio) != -1 instead of relying solely on global flags.
- Maintain workingset refault context across MGLRU state transitions
- Fix build error when CONFIG_LRU_GEN is disabled.

v1:
- Use smp_store_release() and smp_load_acquire() to ensure the visibility
  of 'enabled' and 'draining' flags across CPUs.
- Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
  is in the 'draining' state, the reclaimer will attempt to scan MGLRU
  lists first, and then fall through to traditional LRU lists instead
  of returning early. This ensures that folios are visible to at least
  one reclaim path at any given time.

This effectively eliminates the race window that previously triggered OOMs
under high memory pressure.

Reproduction
===========

The issue was consistently reproduced on v6.1.157 and v6.18.3 using
a high-pressure memory cgroup (v1) environment.

Reproduction steps:
1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
   and 8GB active anonymous memory.
2. Toggle MGLRU state while performing new memory allocations to force
   direct reclaim.

Reproduction script
===================
```bash

MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"

switch_mglru() {
    local orig_val=$(cat "$MGLRU_FILE")
    if [[ "$orig_val" != "0x0000" ]]; then
        echo n > "$MGLRU_FILE" &
    else
        echo y > "$MGLRU_FILE" &
    fi
}

mkdir -p "$CGROUP_PATH"
echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
echo $$ > "$CGROUP_PATH/cgroup.procs"

dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache

stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
sleep 5

switch_mglru
stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || echo "OOM Triggered"

grep oom_kill "$CGROUP_PATH/memory.oom_control"
```

Signed-off-by: Leno Hou <lenohou@gmail.com>
---
Leno Hou (2):
      mm/mglru: fix cgroup OOM during MGLRU state switching
      mm/mglru: maintain workingset refault context across state transitions

 include/linux/mm_inline.h |  5 +++++
 mm/rmap.c                 |  2 +-
 mm/swap.c                 | 14 ++++++++------
 mm/vmscan.c               | 49 ++++++++++++++++++++++++++++++++++++++---------
 mm/workingset.c           | 19 ++++++++++++------
 5 files changed, 67 insertions(+), 22 deletions(-)
---
base-commit: 6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f
change-id: 20260311-b4-switch-mglru-v2-8b926a03843f

Best regards,
-- 
Leno Hou <lenohou@gmail.com>




^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/2] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-11 12:09 [PATCH v2 0/2] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou via B4 Relay
@ 2026-03-11 12:09 ` Leno Hou via B4 Relay
  2026-03-12  6:02   ` Barry Song
  2026-03-11 12:09 ` [PATCH v2 2/2] mm/mglru: maintain workingset refault context across state transitions Leno Hou via B4 Relay
  1 sibling, 1 reply; 6+ messages in thread
From: Leno Hou via B4 Relay @ 2026-03-11 12:09 UTC (permalink / raw)
  To: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
	Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, Barry Song
  Cc: linux-mm, linux-kernel, Leno Hou

From: Leno Hou <lenohou@gmail.com>

When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
condition exists between the state switching and the memory reclaim
path. This can lead to unexpected cgroup OOM kills, even when plenty of
reclaimable memory is available.

Problem Description
==================

The issue arises from a "reclaim vacuum" during the transition.

1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
   false before the pages are drained from MGLRU lists back to
   traditional LRU lists.
2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
   and skip the MGLRU path.
3. However, these pages might not have reached the traditional LRU lists
   yet, or the changes are not yet visible to all CPUs due to a lack of
   synchronization.
4. get_scan_count() subsequently finds traditional LRU lists empty,
   concludes there is no reclaimable memory, and triggers an OOM kill.

A similar race can occur during enablement, where the reclaimer sees
the new state but the MGLRU lists haven't been populated via
fill_evictable() yet.


Solution
=======

Introduce a 'draining' state (`lru_drain_core`) to bridge the
transition. When transitioning, the system enters this intermediate state
where the reclaimer is forced to attempt both MGLRU and traditional reclaim
paths sequentially. This ensures that folios remain visible to at least
one reclaim mechanism until the transition is fully materialized across all
CPUs.

Changes
=======

- Adds a static branch `lru_drain_core` to track the transition state.
- Updates shrink_lruvec(), shrink_node(), and kswapd_age_node() to allow
  a "joint reclaim" period during the transition.
- Ensures all LRU helpers correctly identify page state by checking
  folio_lru_gen(folio) != -1 instead of relying solely on global flags.

This effectively eliminates the race window that previously triggered OOMs
under high memory pressure.

The issue was consistently reproduced on v6.1.157 and v6.18.3 using
a high-pressure memory cgroup (v1) environment.

To: Andrew Morton <akpm@linux-foundation.org>
To: Axel Rasmussen <axelrasmussen@google.com>
To: Yuanchu Xie <yuanchu@google.com>
To: Wei Xu <weixugc@google.com>
To: Barry Song <21cnbao@gmail.com>
To: Jialing Wang <wjl.linux@gmail.com>
To: Yafang Shao <laoar.shao@gmail.com>
To: Yu Zhao <yuzhao@google.com>
To: Kairui Song <ryncsn@gmail.com>
To: Bingfang Guo <bfguo@icloud.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Leno Hou <lenohou@gmail.com>
---
 include/linux/mm_inline.h |  5 +++++
 mm/rmap.c                 |  2 +-
 mm/swap.c                 | 14 ++++++++------
 mm/vmscan.c               | 49 ++++++++++++++++++++++++++++++++++++++---------
 4 files changed, 54 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index fa2d6ba811b5..e6443e22bf67 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -321,6 +321,11 @@ static inline bool lru_gen_in_fault(void)
 	return false;
 }
 
+static inline int folio_lru_gen(const struct folio *folio)
+{
+	return -1;
+}
+
 static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
 {
 	return false;
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f00570d1b9e..488bcdca65ed 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -958,7 +958,7 @@ static bool folio_referenced_one(struct folio *folio,
 			return false;
 		}
 
-		if (lru_gen_enabled() && pvmw.pte) {
+		if ((folio_lru_gen(folio) != -1) && pvmw.pte) {
 			if (lru_gen_look_around(&pvmw))
 				referenced++;
 		} else if (pvmw.pte) {
diff --git a/mm/swap.c b/mm/swap.c
index bb19ccbece46..a2397b44710a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -456,7 +456,7 @@ void folio_mark_accessed(struct folio *folio)
 {
 	if (folio_test_dropbehind(folio))
 		return;
-	if (lru_gen_enabled()) {
+	if (folio_lru_gen(folio) != -1) {
 		lru_gen_inc_refs(folio);
 		return;
 	}
@@ -553,7 +553,7 @@ void folio_add_lru_vma(struct folio *folio, struct vm_area_struct *vma)
  */
 static void lru_deactivate_file(struct lruvec *lruvec, struct folio *folio)
 {
-	bool active = folio_test_active(folio) || lru_gen_enabled();
+	bool active = folio_test_active(folio) || (folio_lru_gen(folio) != -1);
 	long nr_pages = folio_nr_pages(folio);
 
 	if (folio_test_unevictable(folio))
@@ -596,7 +596,9 @@ static void lru_deactivate(struct lruvec *lruvec, struct folio *folio)
 {
 	long nr_pages = folio_nr_pages(folio);
 
-	if (folio_test_unevictable(folio) || !(folio_test_active(folio) || lru_gen_enabled()))
+	if (folio_test_unevictable(folio) ||
+		!(folio_test_active(folio) ||
+		(folio_lru_gen(folio) != -1)))
 		return;
 
 	lruvec_del_folio(lruvec, folio);
@@ -618,7 +620,7 @@ static void lru_lazyfree(struct lruvec *lruvec, struct folio *folio)
 
 	lruvec_del_folio(lruvec, folio);
 	folio_clear_active(folio);
-	if (lru_gen_enabled())
+	if (folio_lru_gen(folio) != -1)
 		lru_gen_clear_refs(folio);
 	else
 		folio_clear_referenced(folio);
@@ -689,7 +691,7 @@ void deactivate_file_folio(struct folio *folio)
 	if (folio_test_unevictable(folio) || !folio_test_lru(folio))
 		return;
 
-	if (lru_gen_enabled() && lru_gen_clear_refs(folio))
+	if ((folio_lru_gen(folio) != -1) && lru_gen_clear_refs(folio))
 		return;
 
 	folio_batch_add_and_move(folio, lru_deactivate_file);
@@ -708,7 +710,7 @@ void folio_deactivate(struct folio *folio)
 	if (folio_test_unevictable(folio) || !folio_test_lru(folio))
 		return;
 
-	if (lru_gen_enabled() ? lru_gen_clear_refs(folio) : !folio_test_active(folio))
+	if ((folio_lru_gen(folio) != -1) ? lru_gen_clear_refs(folio) : !folio_test_active(folio))
 		return;
 
 	folio_batch_add_and_move(folio, lru_deactivate);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fc9373e8251..38d38edda471 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -873,11 +873,23 @@ static bool lru_gen_set_refs(struct folio *folio)
 	set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_workingset));
 	return true;
 }
+
+DEFINE_STATIC_KEY_FALSE(lru_drain_core);
+static inline bool lru_gen_draining(void)
+{
+	return static_branch_unlikely(&lru_drain_core);
+}
+
 #else
 static bool lru_gen_set_refs(struct folio *folio)
 {
 	return false;
 }
+static inline bool lru_gen_draining(void)
+{
+	return false;
+}
+
 #endif /* CONFIG_LRU_GEN */
 
 static enum folio_references folio_check_references(struct folio *folio,
@@ -905,7 +917,7 @@ static enum folio_references folio_check_references(struct folio *folio,
 	if (referenced_ptes == -1)
 		return FOLIOREF_KEEP;
 
-	if (lru_gen_enabled()) {
+	if (folio_lru_gen(folio) != -1) {
 		if (!referenced_ptes)
 			return FOLIOREF_RECLAIM;
 
@@ -2319,7 +2331,7 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long file;
 	struct lruvec *target_lruvec;
 
-	if (lru_gen_enabled())
+	if (lru_gen_enabled() && !lru_gen_draining())
 		return;
 
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
@@ -5178,6 +5190,8 @@ static void lru_gen_change_state(bool enabled)
 	if (enabled == lru_gen_enabled())
 		goto unlock;
 
+	static_branch_enable_cpuslocked(&lru_drain_core);
+
 	if (enabled)
 		static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
 	else
@@ -5208,6 +5222,9 @@ static void lru_gen_change_state(bool enabled)
 
 		cond_resched();
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	static_branch_disable_cpuslocked(&lru_drain_core);
+
 unlock:
 	mutex_unlock(&state_mutex);
 	put_online_mems();
@@ -5780,9 +5797,12 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	bool proportional_reclaim;
 	struct blk_plug plug;
 
-	if (lru_gen_enabled() && !root_reclaim(sc)) {
+	if ((lru_gen_enabled() || lru_gen_draining()) && !root_reclaim(sc)) {
 		lru_gen_shrink_lruvec(lruvec, sc);
-		return;
+
+		if (!lru_gen_draining())
+			return;
+
 	}
 
 	get_scan_count(lruvec, sc, nr);
@@ -6041,11 +6061,17 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long nr_reclaimed, nr_scanned, nr_node_reclaimed;
 	struct lruvec *target_lruvec;
 	bool reclaimable = false;
+	s8 priority = sc->priority;
 
-	if (lru_gen_enabled() && root_reclaim(sc)) {
+	if ((lru_gen_enabled() || lru_gen_draining()) && root_reclaim(sc)) {
 		memset(&sc->nr, 0, sizeof(sc->nr));
 		lru_gen_shrink_node(pgdat, sc);
-		return;
+
+		if (!lru_gen_draining())
+			return;
+
+		sc->priority = priority;
+
 	}
 
 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
@@ -6315,7 +6341,7 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
 	struct lruvec *target_lruvec;
 	unsigned long refaults;
 
-	if (lru_gen_enabled())
+	if (lru_gen_enabled() && !lru_gen_draining())
 		return;
 
 	target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
@@ -6703,10 +6729,15 @@ static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 {
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
+	s8 priority = sc->priority;
 
-	if (lru_gen_enabled()) {
+	if (lru_gen_enabled() || lru_gen_draining()) {
 		lru_gen_age_node(pgdat, sc);
-		return;
+
+		if (!lru_gen_draining())
+			return;
+
+		sc->priority = priority;
 	}
 
 	lruvec = mem_cgroup_lruvec(NULL, pgdat);

-- 
2.52.0




^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/2] mm/mglru: maintain workingset refault context across state transitions
  2026-03-11 12:09 [PATCH v2 0/2] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou via B4 Relay
  2026-03-11 12:09 ` [PATCH v2 1/2] " Leno Hou via B4 Relay
@ 2026-03-11 12:09 ` Leno Hou via B4 Relay
  1 sibling, 0 replies; 6+ messages in thread
From: Leno Hou via B4 Relay @ 2026-03-11 12:09 UTC (permalink / raw)
  To: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
	Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, Barry Song
  Cc: linux-mm, linux-kernel, Leno Hou

From: Leno Hou <lenohou@gmail.com>

When MGLRU state is toggled dynamically, existing shadow entries (eviction
tokens) lose their context. Traditional LRU and MGLRU handle workingset
refaults using different logic. Without context, shadow entries
re-activated by the "wrong" reclaim logic trigger excessive page
activations (pgactivate) and system thrashing, as the kernel cannot
correctly distinguish if a refaulted page was originally managed by
MGLRU or the traditional LRU.

This patch introduces shadow entry context tracking:

- Encode MGLRU origin: Introduce WORKINGSET_MGLRU_SHIFT into the shadow
  entry (eviction token) encoding. This adds an 'is_mglru' bit to shadow
  entries, allowing the kernel to correctly identify the originating
  reclaim logic for a page even after the global MGLRU state has been
  toggled.

- Refault logic dispatch: Use this 'is_mglru' bit in workingset_refault()
  and workingset_test_recent() to dispatch refault events to the correct
  handler (lru_gen_refault vs. traditional workingset refault).

This ensures that refaulted pages are handled by the appropriate reclaim
logic regardless of the current MGLRU enabled state, preventing
unnecessary thrashing and state-inconsistent refault activations during
state transitions.

To: Andrew Morton <akpm@linux-foundation.org>
To: Axel Rasmussen <axelrasmussen@google.com>
To: Yuanchu Xie <yuanchu@google.com>
To: Wei Xu <weixugc@google.com>
To: Barry Song <21cnbao@gmail.com>
To: Jialing Wang <wjl.linux@gmail.com>
To: Yafang Shao <laoar.shao@gmail.com>
To: Yu Zhao <yuzhao@google.com>
To: Kairui Song <ryncsn@gmail.com>
To: Bingfang Guo <bfguo@icloud.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Leno Hou <lenohou@gmail.com>
---
 mm/workingset.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 13422d304715..baa766daac24 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -180,8 +180,10 @@
  * refault distance will immediately activate the refaulting page.
  */
 
+#define WORKINGSET_MGLRU_SHIFT  1
 #define WORKINGSET_SHIFT 1
 #define EVICTION_SHIFT	((BITS_PER_LONG - BITS_PER_XA_VALUE) +	\
+			 WORKINGSET_MGLRU_SHIFT + \
 			 WORKINGSET_SHIFT + NODES_SHIFT + \
 			 MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
@@ -197,12 +199,13 @@
 static unsigned int bucket_order __read_mostly;
 
 static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
-			 bool workingset)
+			 bool workingset, bool is_mglru)
 {
 	eviction &= EVICTION_MASK;
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
 	eviction = (eviction << WORKINGSET_SHIFT) | workingset;
+	eviction = (eviction << WORKINGSET_MGLRU_SHIFT) | is_mglru;
 
 	return xa_mk_value(eviction);
 }
@@ -214,6 +217,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
 	int memcgid, nid;
 	bool workingset;
 
+	entry >>= WORKINGSET_MGLRU_SHIFT;
 	workingset = entry & ((1UL << WORKINGSET_SHIFT) - 1);
 	entry >>= WORKINGSET_SHIFT;
 	nid = entry & ((1UL << NODES_SHIFT) - 1);
@@ -254,7 +258,7 @@ static void *lru_gen_eviction(struct folio *folio)
 	hist = lru_hist_from_seq(min_seq);
 	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
 
-	return pack_shadow(mem_cgroup_private_id(memcg), pgdat, token, workingset);
+	return pack_shadow(mem_cgroup_private_id(memcg), pgdat, token, workingset, true);
 }
 
 /*
@@ -390,7 +394,7 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
 	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 
-	if (lru_gen_enabled())
+	if (folio_lru_gen(folio) != -1)
 		return lru_gen_eviction(folio);
 
 	lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
@@ -400,7 +404,7 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
 	eviction >>= bucket_order;
 	workingset_age_nonresident(lruvec, folio_nr_pages(folio));
 	return pack_shadow(memcgid, pgdat, eviction,
-				folio_test_workingset(folio));
+				folio_test_workingset(folio), false);
 }
 
 /**
@@ -426,8 +430,10 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
 	int memcgid;
 	struct pglist_data *pgdat;
 	unsigned long eviction;
+	unsigned long entry = xa_to_value(shadow);
+	bool is_mglru = !!(entry & WORKINGSET_MGLRU_SHIFT);
 
-	if (lru_gen_enabled()) {
+	if (is_mglru) {
 		bool recent;
 
 		rcu_read_lock();
@@ -539,10 +545,11 @@ void workingset_refault(struct folio *folio, void *shadow)
 	struct lruvec *lruvec;
 	bool workingset;
 	long nr;
+	unsigned long entry = xa_to_value(shadow);
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 
-	if (lru_gen_enabled()) {
+	if (entry & ((1UL << WORKINGSET_MGLRU_SHIFT) - 1)) {
 		lru_gen_refault(folio, shadow);
 		return;
 	}

-- 
2.52.0




^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-11 12:09 ` [PATCH v2 1/2] " Leno Hou via B4 Relay
@ 2026-03-12  6:02   ` Barry Song
  2026-03-12 16:44     ` Leno Hou
  0 siblings, 1 reply; 6+ messages in thread
From: Barry Song @ 2026-03-12  6:02 UTC (permalink / raw)
  To: lenohou
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
	Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, linux-mm,
	linux-kernel

On Wed, Mar 11, 2026 at 8:11 PM Leno Hou via B4 Relay
<devnull+lenohou.gmail.com@kernel.org> wrote:
>
> From: Leno Hou <lenohou@gmail.com>
>
> When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> condition exists between the state switching and the memory reclaim
> path. This can lead to unexpected cgroup OOM kills, even when plenty of
> reclaimable memory is available.
>
> Problem Description
> ==================
>
> The issue arises from a "reclaim vacuum" during the transition.
>
> 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
>    false before the pages are drained from MGLRU lists back to
>    traditional LRU lists.
> 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
>    and skip the MGLRU path.
> 3. However, these pages might not have reached the traditional LRU lists
>    yet, or the changes are not yet visible to all CPUs due to a lack of
>    synchronization.
> 4. get_scan_count() subsequently finds traditional LRU lists empty,
>    concludes there is no reclaimable memory, and triggers an OOM kill.
>
> A similar race can occur during enablement, where the reclaimer sees
> the new state but the MGLRU lists haven't been populated via
> fill_evictable() yet.
>
>
> Solution
> =======
>
> Introduce a 'draining' state (`lru_drain_core`) to bridge the
> transition. When transitioning, the system enters this intermediate state
> where the reclaimer is forced to attempt both MGLRU and traditional reclaim
> paths sequentially. This ensures that folios remain visible to at least
> one reclaim mechanism until the transition is fully materialized across all
> CPUs.
>
> Changes
> =======
>
> - Adds a static branch `lru_drain_core` to track the transition state.
> - Updates shrink_lruvec(), shrink_node(), and kswapd_age_node() to allow
>   a "joint reclaim" period during the transition.
> - Ensures all LRU helpers correctly identify page state by checking
>   folio_lru_gen(folio) != -1 instead of relying solely on global flags.
>
> This effectively eliminates the race window that previously triggered OOMs
> under high memory pressure.

I don't think this eliminates the race window, but it does reduce it.
There is nothing preventing the draining state from changing while
you are shrinking.

for example:
t1:                                    t2:
   lru_gen_draining() = false;

                                    Drain mglru


   Drain mglru only....

>
> The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> a high-pressure memory cgroup (v1) environment.
>
> To: Andrew Morton <akpm@linux-foundation.org>
> To: Axel Rasmussen <axelrasmussen@google.com>
> To: Yuanchu Xie <yuanchu@google.com>
> To: Wei Xu <weixugc@google.com>
> To: Barry Song <21cnbao@gmail.com>
> To: Jialing Wang <wjl.linux@gmail.com>
> To: Yafang Shao <laoar.shao@gmail.com>
> To: Yu Zhao <yuzhao@google.com>
> To: Kairui Song <ryncsn@gmail.com>
> To: Bingfang Guo <bfguo@icloud.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Leno Hou <lenohou@gmail.com>
> ---
>  include/linux/mm_inline.h |  5 +++++
>  mm/rmap.c                 |  2 +-
>  mm/swap.c                 | 14 ++++++++------
>  mm/vmscan.c               | 49 ++++++++++++++++++++++++++++++++++++++---------
>  4 files changed, 54 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index fa2d6ba811b5..e6443e22bf67 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -321,6 +321,11 @@ static inline bool lru_gen_in_fault(void)
>         return false;
>  }
>
> +static inline int folio_lru_gen(const struct folio *folio)
> +{
> +       return -1;
> +}
> +
>  static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
>  {
>         return false;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0f00570d1b9e..488bcdca65ed 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -958,7 +958,7 @@ static bool folio_referenced_one(struct folio *folio,
>                         return false;
>                 }
>
> -               if (lru_gen_enabled() && pvmw.pte) {
> +               if ((folio_lru_gen(folio) != -1) && pvmw.pte) {

I am not quite sure if a folio's gen is set to -1 when it is isolated
from MGLRU for reclamation. If so, I don't think this would work.

Thanks
Barry


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-12  6:02   ` Barry Song
@ 2026-03-12 16:44     ` Leno Hou
  2026-03-12 20:08       ` Barry Song
  0 siblings, 1 reply; 6+ messages in thread
From: Leno Hou @ 2026-03-12 16:44 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
	Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, linux-mm,
	linux-kernel

On 3/12/26 2:02 PM, Barry Song wrote:
[...]
>> This effectively eliminates the race window that previously triggered OOMs
>> under high memory pressure.
>
> I don't think this eliminates the race window, but it does reduce it.
> There is nothing preventing the draining state from changing while
> you are shrinking.
>
Hi Barry,

You raise a valid point about the race window.

While the window exists, I believe the impact is effectively mitigated by
The reclaim retry mechanism: Even if the reclaimer makes a suboptimal path
 decision due to the race,  the kernel's reclaim loop (in do_try_to_free_pages)
provides multiple retries. If the first pass fails to reclaim  memory
due to the
race or an inconsistent state, the subsequent retry will observe the updated
draining state and correctly direct the scan to the appropriate LRU lists.

Anyway,thanks correct me and i'll updating the commit message.

>>
>> The issue was consistently reproduced on v6.1.157 and v6.18.3 using
>> a high-pressure memory cgroup (v1) environment.
>>
>> To: Andrew Morton <akpm@linux-foundation.org>
>> To: Axel Rasmussen <axelrasmussen@google.com>
>> To: Yuanchu Xie <yuanchu@google.com>
>> To: Wei Xu <weixugc@google.com>
>> To: Barry Song <21cnbao@gmail.com>
>> To: Jialing Wang <wjl.linux@gmail.com>
>> To: Yafang Shao <laoar.shao@gmail.com>
>> To: Yu Zhao <yuzhao@google.com>
>> To: Kairui Song <ryncsn@gmail.com>
>> To: Bingfang Guo <bfguo@icloud.com>
>> Cc: linux-mm@kvack.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Leno Hou <lenohou@gmail.com>
>> ---
>>   include/linux/mm_inline.h |  5 +++++
>>   mm/rmap.c                 |  2 +-
>>   mm/swap.c                 | 14 ++++++++------
>>   mm/vmscan.c               | 49 ++++++++++++++++++++++++++++++++++++++---------
>>   4 files changed, 54 insertions(+), 16 deletions(-)
>>
>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> index fa2d6ba811b5..e6443e22bf67 100644
>> --- a/include/linux/mm_inline.h
>> +++ b/include/linux/mm_inline.h
>> @@ -321,6 +321,11 @@ static inline bool lru_gen_in_fault(void)
>>          return false;
>>   }
>>
>> +static inline int folio_lru_gen(const struct folio *folio)
>> +{
>> +       return -1;
>> +}
>> +
>>   static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
>>   {
>>          return false;
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 0f00570d1b9e..488bcdca65ed 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -958,7 +958,7 @@ static bool folio_referenced_one(struct folio *folio,
>>                          return false;
>>                  }
>>
>> -               if (lru_gen_enabled() && pvmw.pte) {
>> +               if ((folio_lru_gen(folio) != -1) && pvmw.pte) {
>
> I am not quite sure if a folio's gen is set to -1 when it is isolated
> from MGLRU for reclamation. If so, I don't think this would work.
>

Thanks for your feedback. You are absolutely right—relying solely on
folio_lru_gen(folio) != -1 is insufficient because the generation tag
is cleared during isolation in lru_gen_del_folio().

In the current implementation, once a page is isolated, its generation
information is lost, causing the logic to fall back to the traditional LRU
path prematurely.

To address this, I am changing the logic to:
```c
@@ -966,7 +966,7 @@ static bool folio_referenced_one(struct folio *folio,
                        nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
                }

-               if (lru_gen_enabled() && pvmw.pte) {
+               if (lru_gen_enabled() && !lru_gen_draining() && pvmw.pte) {
                        if (lru_gen_look_around(&pvmw, nr))
                                referenced++;
                } else if (pvmw.pte) {
```
The rationale is that lru_gen_look_around() is an MGLRU-specific optimization
that promotes pages based on generation aging. During the draining
transition, we should strictly rely on traditional RMAP-based reference
auditing to ensure that pages are correctly migrated to traditional LRU
lists without being spuriously promoted by MGLRU logic. This avoids
mixing MGLRU's generation-based promotion with traditional LRU's 'active'
status, preventing potential reclaim state inconsistencies during the
transition period."


---
Thanks

Leno Hou


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-12 16:44     ` Leno Hou
@ 2026-03-12 20:08       ` Barry Song
  0 siblings, 0 replies; 6+ messages in thread
From: Barry Song @ 2026-03-12 20:08 UTC (permalink / raw)
  To: Leno Hou
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
	Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, linux-mm,
	linux-kernel

On Fri, Mar 13, 2026 at 12:44 AM Leno Hou <lenohou@gmail.com> wrote:
>
> On 3/12/26 2:02 PM, Barry Song wrote:
> [...]
> >> This effectively eliminates the race window that previously triggered OOMs
> >> under high memory pressure.
> >
> > I don't think this eliminates the race window, but it does reduce it.
> > There is nothing preventing the draining state from changing while
> > you are shrinking.
> >
> Hi Barry,
>
> You raise a valid point about the race window.
>
> While the window exists, I believe the impact is effectively mitigated by
> The reclaim retry mechanism: Even if the reclaimer makes a suboptimal path
>  decision due to the race,  the kernel's reclaim loop (in do_try_to_free_pages)
> provides multiple retries. If the first pass fails to reclaim  memory
> due to the
> race or an inconsistent state, the subsequent retry will observe the updated
> draining state and correctly direct the scan to the appropriate LRU lists.
>
> Anyway,thanks correct me and i'll updating the commit message.

I’m fine with this race condition as long as it is clearly documented
in the commit message.

>
> >>
> >> The issue was consistently reproduced on v6.1.157 and v6.18.3 using
> >> a high-pressure memory cgroup (v1) environment.
> >>
> >> To: Andrew Morton <akpm@linux-foundation.org>
> >> To: Axel Rasmussen <axelrasmussen@google.com>
> >> To: Yuanchu Xie <yuanchu@google.com>
> >> To: Wei Xu <weixugc@google.com>
> >> To: Barry Song <21cnbao@gmail.com>
> >> To: Jialing Wang <wjl.linux@gmail.com>
> >> To: Yafang Shao <laoar.shao@gmail.com>
> >> To: Yu Zhao <yuzhao@google.com>
> >> To: Kairui Song <ryncsn@gmail.com>
> >> To: Bingfang Guo <bfguo@icloud.com>
> >> Cc: linux-mm@kvack.org
> >> Cc: linux-kernel@vger.kernel.org
> >> Signed-off-by: Leno Hou <lenohou@gmail.com>
> >> ---
> >>   include/linux/mm_inline.h |  5 +++++
> >>   mm/rmap.c                 |  2 +-
> >>   mm/swap.c                 | 14 ++++++++------
> >>   mm/vmscan.c               | 49 ++++++++++++++++++++++++++++++++++++++---------
> >>   4 files changed, 54 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> >> index fa2d6ba811b5..e6443e22bf67 100644
> >> --- a/include/linux/mm_inline.h
> >> +++ b/include/linux/mm_inline.h
> >> @@ -321,6 +321,11 @@ static inline bool lru_gen_in_fault(void)
> >>          return false;
> >>   }
> >>
> >> +static inline int folio_lru_gen(const struct folio *folio)
> >> +{
> >> +       return -1;
> >> +}
> >> +
> >>   static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> >>   {
> >>          return false;
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 0f00570d1b9e..488bcdca65ed 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -958,7 +958,7 @@ static bool folio_referenced_one(struct folio *folio,
> >>                          return false;
> >>                  }
> >>
> >> -               if (lru_gen_enabled() && pvmw.pte) {
> >> +               if ((folio_lru_gen(folio) != -1) && pvmw.pte) {
> >
> > I am not quite sure if a folio's gen is set to -1 when it is isolated
> > from MGLRU for reclamation. If so, I don't think this would work.
> >
>
> Thanks for your feedback. You are absolutely right—relying solely on
> folio_lru_gen(folio) != -1 is insufficient because the generation tag
> is cleared during isolation in lru_gen_del_folio().
>
> In the current implementation, once a page is isolated, its generation
> information is lost, causing the logic to fall back to the traditional LRU
> path prematurely.
>
> To address this, I am changing the logic to:
> ```c
> @@ -966,7 +966,7 @@ static bool folio_referenced_one(struct folio *folio,
>                         nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
>                 }
>
> -               if (lru_gen_enabled() && pvmw.pte) {
> +               if (lru_gen_enabled() && !lru_gen_draining() && pvmw.pte) {
>                         if (lru_gen_look_around(&pvmw, nr))
>                                 referenced++;
>                 } else if (pvmw.pte) {
> ```
> The rationale is that lru_gen_look_around() is an MGLRU-specific optimization
> that promotes pages based on generation aging. During the draining
> transition, we should strictly rely on traditional RMAP-based reference
> auditing to ensure that pages are correctly migrated to traditional LRU
> lists without being spuriously promoted by MGLRU logic. This avoids
> mixing MGLRU's generation-based promotion with traditional LRU's 'active'
> status, preventing potential reclaim state inconsistencies during the
> transition period."

Right, using folio_lru_gen(folio) != -1 is dangerous because it
implicitly depends on whether the folio is still on the LRU list.
Even if we never switch between MGLRU and the active/inactive LRU,
the check can become invalid once the folio is removed from the LRU
for various reasons—for example, during isolate_folio().

So if possible, we should identify the real issue that led us to rely
on folio_lru_gen(folio) != -1 in the first place, and avoid inventing
this kind of unstable condition check.

Please DO NOT depend on folio_lru_gen(folio) != -1. If it is
unavoidable, limit its use to the smallest scope possible.

Thanks
Barry


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-12 20:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-11 12:09 [PATCH v2 0/2] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou via B4 Relay
2026-03-11 12:09 ` [PATCH v2 1/2] " Leno Hou via B4 Relay
2026-03-12  6:02   ` Barry Song
2026-03-12 16:44     ` Leno Hou
2026-03-12 20:08       ` Barry Song
2026-03-11 12:09 ` [PATCH v2 2/2] mm/mglru: maintain workingset refault context across state transitions Leno Hou via B4 Relay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox