* [PATCH v3 0/2] mm/mglru: fix cgroup OOM during MGLRU state switching
@ 2026-03-15 18:18 Leno Hou via B4 Relay
2026-03-15 18:18 ` [PATCH v3 1/2] " Leno Hou via B4 Relay
2026-03-15 18:18 ` [PATCH v3 2/2] mm/mglru: maintain workingset refault context across state transitions Leno Hou via B4 Relay
0 siblings, 2 replies; 7+ messages in thread
From: Leno Hou via B4 Relay @ 2026-03-15 18:18 UTC (permalink / raw)
To: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, Barry Song
Cc: linux-mm, linux-kernel, Leno Hou
When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
condition exists between the state switching and the memory reclaim path.
This can lead to unexpected cgroup OOM kills, even when plenty of
reclaimable memory is available.
Problem Description
==================
The issue arises from a "reclaim vacuum" during the transition.
1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
false before the pages are drained from MGLRU lists back to traditional
LRU lists.
2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
and skip the MGLRU path.
3. However, these pages might not have reached the traditional LRU lists
yet, or the changes are not yet visible to all CPUs due to a lack
of synchronization.
4. get_scan_count() subsequently finds traditional LRU lists empty,
concludes there is no reclaimable memory, and triggers an OOM kill.
A similar race can occur during enablement, where the reclaimer sees the
new state but the MGLRU lists haven't been populated via fill_evictable()
yet.
Solution
========
Introduce a 'draining' state (`lru_drain_core`) to bridge the transition.
When transitioning, the system enters this intermediate state where
the reclaimer is forced to attempt both MGLRU and traditional reclaim
paths sequentially. This ensures that folios remain visible to at least
one reclaim mechanism until the transition is fully materialized across
all CPUs.
Changes
=======
v3:
- Rebase onto mm-new branch for queue testing
- Don't look around while draining
- Fix Barry Song's comment
v2:
- Repalce with a static branch `lru_drain_core` to track the transition
state.
- Ensures all LRU helpers correctly identify page state by checking
folio_lru_gen(folio) != -1 instead of relying solely on global flags.
- Maintain workingset refault context across MGLRU state transitions
- Fix build error when CONFIG_LRU_GEN is disabled.
v1:
- Use smp_store_release() and smp_load_acquire() to ensure the visibility
of 'enabled' and 'draining' flags across CPUs.
- Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
is in the 'draining' state, the reclaimer will attempt to scan MGLRU
lists first, and then fall through to traditional LRU lists instead
of returning early. This ensures that folios are visible to at least
one reclaim path at any given time.
Race & Mitigation
================
A race window exists between checking the 'draining' state and performing
the actual list operations. For instance, a reclaimer might observe the
draining state as false just before it changes, leading to a suboptimal
reclaim path decision.
However, this impact is effectively mitigated by the kernel's reclaim
retry mechanism (e.g., in do_try_to_free_pages). If a reclaimer pass fails
to find eligible folios due to a state transition race, subsequent retries
in the loop will observe the updated state and correctly direct the scan
to the appropriate LRU lists. This ensures the transient inconsistency
does not escalate into a terminal OOM kill.
This effectively reduce the race window that previously triggered OOMs
under high memory pressure.
Reproduction
===========
The issue was consistently reproduced on v6.1.157 and v6.18.3 using a
high-pressure memory cgroup (v1) environment.
Reproduction steps:
1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
and 8GB active anonymous memory.
2. Toggle MGLRU state while performing new memory allocations to force
direct reclaim.
Reproduction script
===================
```bash
MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"
switch_mglru() {
local orig_val=$(cat "$MGLRU_FILE")
if [[ "$orig_val" != "0x0000" ]]; then
echo n > "$MGLRU_FILE" &
else
echo y > "$MGLRU_FILE" &
fi
}
mkdir -p "$CGROUP_PATH"
echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
echo $$ > "$CGROUP_PATH/cgroup.procs"
dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache
stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
sleep 5
switch_mglru
stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || \
echo "OOM Triggered"
grep oom_kill "$CGROUP_PATH/memory.oom_control"
```
Signed-off-by: Leno Hou <lenohou@gmail.com>
---
Leno Hou (2):
mm/mglru: fix cgroup OOM during MGLRU state switching
mm/mglru: maintain workingset refault context across state transitions
include/linux/mm_inline.h | 16 ++++++++++++++
include/linux/swap.h | 2 +-
mm/rmap.c | 2 +-
mm/swap.c | 15 +++++++------
mm/vmscan.c | 55 +++++++++++++++++++++++++++++++++++------------
mm/workingset.c | 22 +++++++++++++------
6 files changed, 83 insertions(+), 29 deletions(-)
---
base-commit: c5a81ff6071bcf42531426e6336b5cc424df6e3d
change-id: 20260311-b4-switch-mglru-v2-8b926a03843f
Best regards,
--
Leno Hou <lenohou@gmail.com>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v3 1/2] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-15 18:18 [PATCH v3 0/2] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou via B4 Relay
@ 2026-03-15 18:18 ` Leno Hou via B4 Relay
2026-03-17 7:52 ` Barry Song
2026-03-15 18:18 ` [PATCH v3 2/2] mm/mglru: maintain workingset refault context across state transitions Leno Hou via B4 Relay
1 sibling, 1 reply; 7+ messages in thread
From: Leno Hou via B4 Relay @ 2026-03-15 18:18 UTC (permalink / raw)
To: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, Barry Song
Cc: linux-mm, linux-kernel, Leno Hou
From: Leno Hou <lenohou@gmail.com>
When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
condition exists between the state switching and the memory reclaim path.
This can lead to unexpected cgroup OOM kills, even when plenty of
reclaimable memory is available.
Problem Description
==================
The issue arises from a "reclaim vacuum" during the transition.
1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
false before the pages are drained from MGLRU lists back to traditional
LRU lists.
2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
and skip the MGLRU path.
3. However, these pages might not have reached the traditional LRU lists
yet, or the changes are not yet visible to all CPUs due to a lack
of synchronization.
4. get_scan_count() subsequently finds traditional LRU lists empty,
concludes there is no reclaimable memory, and triggers an OOM kill.
A similar race can occur during enablement, where the reclaimer sees the
new state but the MGLRU lists haven't been populated via fill_evictable()
yet.
Solution
========
Introduce a 'draining' state (`lru_drain_core`) to bridge the transition.
When transitioning, the system enters this intermediate state where
the reclaimer is forced to attempt both MGLRU and traditional reclaim
paths sequentially. This ensures that folios remain visible to at least
one reclaim mechanism until the transition is fully materialized across
all CPUs.
Changes
=======
v3:
- Rebase onto mm-new branch for queue testing
- Don't look around while draining
- Fix Barry Song's comment
v2:
- Repalce with a static branch `lru_drain_core` to track the transition
state.
- Ensures all LRU helpers correctly identify page state by checking
folio_lru_gen(folio) != -1 instead of relying solely on global flags.
- Maintain workingset refault context across MGLRU state transitions
- Fix build error when CONFIG_LRU_GEN is disabled.
v1:
- Use smp_store_release() and smp_load_acquire() to ensure the visibility
of 'enabled' and 'draining' flags across CPUs.
- Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
is in the 'draining' state, the reclaimer will attempt to scan MGLRU
lists first, and then fall through to traditional LRU lists instead
of returning early. This ensures that folios are visible to at least
one reclaim path at any given time.
Race & Mitigation
================
A race window exists between checking the 'draining' state and performing
the actual list operations. For instance, a reclaimer might observe the
draining state as false just before it changes, leading to a suboptimal
reclaim path decision.
However, this impact is effectively mitigated by the kernel's reclaim
retry mechanism (e.g., in do_try_to_free_pages). If a reclaimer pass fails
to find eligible folios due to a state transition race, subsequent retries
in the loop will observe the updated state and correctly direct the scan
to the appropriate LRU lists. This ensures the transient inconsistency
does not escalate into a terminal OOM kill.
This effectively reduce the race window that previously triggered OOMs
under high memory pressure.
To: Andrew Morton <akpm@linux-foundation.org>
To: Axel Rasmussen <axelrasmussen@google.com>
To: Yuanchu Xie <yuanchu@google.com>
To: Wei Xu <weixugc@google.com>
To: Barry Song <21cnbao@gmail.com>
To: Jialing Wang <wjl.linux@gmail.com>
To: Yafang Shao <laoar.shao@gmail.com>
To: Yu Zhao <yuzhao@google.com>
To: Kairui Song <ryncsn@gmail.com>
To: Bingfang Guo <bfguo@icloud.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Leno Hou <lenohou@gmail.com>
---
include/linux/mm_inline.h | 16 ++++++++++++++++
mm/rmap.c | 2 +-
mm/swap.c | 15 +++++++++------
mm/vmscan.c | 38 +++++++++++++++++++++++++++++---------
4 files changed, 55 insertions(+), 16 deletions(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index ad50688d89db..16ac700dac9c 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -102,6 +102,12 @@ static __always_inline enum lru_list folio_lru_list(const struct folio *folio)
#ifdef CONFIG_LRU_GEN
+static inline bool lru_gen_draining(void)
+{
+ DECLARE_STATIC_KEY_FALSE(lru_drain_core);
+
+ return static_branch_unlikely(&lru_drain_core);
+}
#ifdef CONFIG_LRU_GEN_ENABLED
static inline bool lru_gen_enabled(void)
{
@@ -316,11 +322,21 @@ static inline bool lru_gen_enabled(void)
return false;
}
+static inline bool lru_gen_draining(void)
+{
+ return false;
+}
+
static inline bool lru_gen_in_fault(void)
{
return false;
}
+static inline int folio_lru_gen(const struct folio *folio)
+{
+ return -1;
+}
+
static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
{
return false;
diff --git a/mm/rmap.c b/mm/rmap.c
index 6398d7eef393..0b5f663f3062 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -966,7 +966,7 @@ static bool folio_referenced_one(struct folio *folio,
nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
}
- if (lru_gen_enabled() && pvmw.pte) {
+ if (lru_gen_enabled() && !lru_gen_draining() && pvmw.pte) {
if (lru_gen_look_around(&pvmw, nr))
referenced++;
} else if (pvmw.pte) {
diff --git a/mm/swap.c b/mm/swap.c
index 5cc44f0de987..ecb192c02d2e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -462,7 +462,7 @@ void folio_mark_accessed(struct folio *folio)
{
if (folio_test_dropbehind(folio))
return;
- if (lru_gen_enabled()) {
+ if (folio_lru_gen(folio) != -1) {
lru_gen_inc_refs(folio);
return;
}
@@ -559,7 +559,7 @@ void folio_add_lru_vma(struct folio *folio, struct vm_area_struct *vma)
*/
static void lru_deactivate_file(struct lruvec *lruvec, struct folio *folio)
{
- bool active = folio_test_active(folio) || lru_gen_enabled();
+ bool active = folio_test_active(folio) || (folio_lru_gen(folio) != -1);
long nr_pages = folio_nr_pages(folio);
if (folio_test_unevictable(folio))
@@ -602,7 +602,9 @@ static void lru_deactivate(struct lruvec *lruvec, struct folio *folio)
{
long nr_pages = folio_nr_pages(folio);
- if (folio_test_unevictable(folio) || !(folio_test_active(folio) || lru_gen_enabled()))
+ if (folio_test_unevictable(folio) ||
+ !(folio_test_active(folio) ||
+ (folio_lru_gen(folio) != -1)))
return;
lruvec_del_folio(lruvec, folio);
@@ -617,6 +619,7 @@ static void lru_deactivate(struct lruvec *lruvec, struct folio *folio)
static void lru_lazyfree(struct lruvec *lruvec, struct folio *folio)
{
long nr_pages = folio_nr_pages(folio);
+ int gen = folio_lru_gen(folio);
if (!folio_test_anon(folio) || !folio_test_swapbacked(folio) ||
folio_test_swapcache(folio) || folio_test_unevictable(folio))
@@ -624,7 +627,7 @@ static void lru_lazyfree(struct lruvec *lruvec, struct folio *folio)
lruvec_del_folio(lruvec, folio);
folio_clear_active(folio);
- if (lru_gen_enabled())
+ if (gen != -1)
lru_gen_clear_refs(folio);
else
folio_clear_referenced(folio);
@@ -695,7 +698,7 @@ void deactivate_file_folio(struct folio *folio)
if (folio_test_unevictable(folio) || !folio_test_lru(folio))
return;
- if (lru_gen_enabled() && lru_gen_clear_refs(folio))
+ if ((folio_lru_gen(folio) != -1) && lru_gen_clear_refs(folio))
return;
folio_batch_add_and_move(folio, lru_deactivate_file);
@@ -714,7 +717,7 @@ void folio_deactivate(struct folio *folio)
if (folio_test_unevictable(folio) || !folio_test_lru(folio))
return;
- if (lru_gen_enabled() ? lru_gen_clear_refs(folio) : !folio_test_active(folio))
+ if ((folio_lru_gen(folio) != -1) ? lru_gen_clear_refs(folio) : !folio_test_active(folio))
return;
folio_batch_add_and_move(folio, lru_deactivate);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33287ba4a500..bcefd8db9c03 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -886,7 +886,7 @@ static enum folio_references folio_check_references(struct folio *folio,
if (referenced_ptes == -1)
return FOLIOREF_KEEP;
- if (lru_gen_enabled()) {
+ if (lru_gen_enabled() && !lru_gen_draining()) {
if (!referenced_ptes)
return FOLIOREF_RECLAIM;
@@ -2286,7 +2286,7 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
unsigned long file;
struct lruvec *target_lruvec;
- if (lru_gen_enabled())
+ if (lru_gen_enabled() && !lru_gen_draining())
return;
target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
@@ -2625,6 +2625,7 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
#ifdef CONFIG_LRU_GEN
+DEFINE_STATIC_KEY_FALSE(lru_drain_core);
#ifdef CONFIG_LRU_GEN_ENABLED
DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS);
#define get_cap(cap) static_branch_likely(&lru_gen_caps[cap])
@@ -5318,6 +5319,8 @@ static void lru_gen_change_state(bool enabled)
if (enabled == lru_gen_enabled())
goto unlock;
+ static_branch_enable_cpuslocked(&lru_drain_core);
+
if (enabled)
static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
else
@@ -5348,6 +5351,9 @@ static void lru_gen_change_state(bool enabled)
cond_resched();
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+ static_branch_disable_cpuslocked(&lru_drain_core);
+
unlock:
mutex_unlock(&state_mutex);
put_online_mems();
@@ -5920,9 +5926,12 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
bool proportional_reclaim;
struct blk_plug plug;
- if (lru_gen_enabled() && !root_reclaim(sc)) {
+ if ((lru_gen_enabled() || lru_gen_draining()) && !root_reclaim(sc)) {
lru_gen_shrink_lruvec(lruvec, sc);
- return;
+
+ if (!lru_gen_draining())
+ return;
+
}
get_scan_count(lruvec, sc, nr);
@@ -6181,11 +6190,17 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
unsigned long nr_reclaimed, nr_scanned, nr_node_reclaimed;
struct lruvec *target_lruvec;
bool reclaimable = false;
+ s8 priority = sc->priority;
- if (lru_gen_enabled() && root_reclaim(sc)) {
+ if ((lru_gen_enabled() || lru_gen_draining()) && root_reclaim(sc)) {
memset(&sc->nr, 0, sizeof(sc->nr));
lru_gen_shrink_node(pgdat, sc);
- return;
+
+ if (!lru_gen_draining())
+ return;
+
+ sc->priority = priority;
+
}
target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
@@ -6455,7 +6470,7 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
struct lruvec *target_lruvec;
unsigned long refaults;
- if (lru_gen_enabled())
+ if (lru_gen_enabled() && !lru_gen_draining())
return;
target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
@@ -6844,10 +6859,15 @@ static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc)
{
struct mem_cgroup *memcg;
struct lruvec *lruvec;
+ s8 priority = sc->priority;
- if (lru_gen_enabled()) {
+ if (lru_gen_enabled() || lru_gen_draining()) {
lru_gen_age_node(pgdat, sc);
- return;
+
+ if (!lru_gen_draining())
+ return;
+
+ sc->priority = priority;
}
lruvec = mem_cgroup_lruvec(NULL, pgdat);
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v3 2/2] mm/mglru: maintain workingset refault context across state transitions
2026-03-15 18:18 [PATCH v3 0/2] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou via B4 Relay
2026-03-15 18:18 ` [PATCH v3 1/2] " Leno Hou via B4 Relay
@ 2026-03-15 18:18 ` Leno Hou via B4 Relay
2026-03-18 3:30 ` Kairui Song
2026-03-18 8:14 ` kernel test robot
1 sibling, 2 replies; 7+ messages in thread
From: Leno Hou via B4 Relay @ 2026-03-15 18:18 UTC (permalink / raw)
To: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, Barry Song
Cc: linux-mm, linux-kernel, Leno Hou
From: Leno Hou <lenohou@gmail.com>
When MGLRU state is toggled dynamically, existing shadow entries (eviction
tokens) lose their context. Traditional LRU and MGLRU handle workingset
refaults using different logic. Without context, shadow entries
re-activated by the "wrong" reclaim logic trigger excessive page
activations (pgactivate) and system thrashing, as the kernel cannot
correctly distinguish if a refaulted page was originally managed by
MGLRU or the traditional LRU.
This patch introduces shadow entry context tracking:
- Encode MGLRU origin: Introduce WORKINGSET_MGLRU_SHIFT into the shadow
entry (eviction token) encoding. This adds an 'is_mglru' bit to shadow
entries, allowing the kernel to correctly identify the originating
reclaim logic for a page even after the global MGLRU state has been
toggled.
- Refault logic dispatch: Use this 'is_mglru' bit in workingset_refault()
and workingset_test_recent() to dispatch refault events to the correct
handler (lru_gen_refault vs. traditional workingset refault).
This ensures that refaulted pages are handled by the appropriate reclaim
logic regardless of the current MGLRU enabled state, preventing
unnecessary thrashing and state-inconsistent refault activations during
state transitions.
To: Andrew Morton <akpm@linux-foundation.org>
To: Axel Rasmussen <axelrasmussen@google.com>
To: Yuanchu Xie <yuanchu@google.com>
To: Wei Xu <weixugc@google.com>
To: Barry Song <21cnbao@gmail.com>
To: Jialing Wang <wjl.linux@gmail.com>
To: Yafang Shao <laoar.shao@gmail.com>
To: Yu Zhao <yuzhao@google.com>
To: Kairui Song <ryncsn@gmail.com>
To: Bingfang Guo <bfguo@icloud.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Leno Hou <lenohou@gmail.com>
---
include/linux/swap.h | 2 +-
mm/vmscan.c | 17 ++++++++++++-----
mm/workingset.c | 22 +++++++++++++++-------
3 files changed, 28 insertions(+), 13 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7a09df6977a5..5f7d3f08d840 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -297,7 +297,7 @@ static inline swp_entry_t page_swap_entry(struct page *page)
bool workingset_test_recent(void *shadow, bool file, bool *workingset,
bool flush);
void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages);
-void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg);
+void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg, bool lru_gen);
void workingset_refault(struct folio *folio, void *shadow);
void workingset_activation(struct folio *folio);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bcefd8db9c03..de21343b5cd2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -180,6 +180,9 @@ struct scan_control {
/* for recording the reclaimed slab by now */
struct reclaim_state reclaim_state;
+
+ /* whether in lru gen scan context */
+ unsigned int lru_gen:1;
};
#ifdef ARCH_HAS_PREFETCHW
@@ -685,7 +688,7 @@ static pageout_t pageout(struct folio *folio, struct address_space *mapping,
* gets returned with a refcount of 0.
*/
static int __remove_mapping(struct address_space *mapping, struct folio *folio,
- bool reclaimed, struct mem_cgroup *target_memcg)
+ bool reclaimed, struct mem_cgroup *target_memcg, struct scan_control *sc)
{
int refcount;
void *shadow = NULL;
@@ -739,7 +742,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
swp_entry_t swap = folio->swap;
if (reclaimed && !mapping_exiting(mapping))
- shadow = workingset_eviction(folio, target_memcg);
+ shadow = workingset_eviction(folio, target_memcg, sc->lru_gen);
memcg1_swapout(folio, swap);
__swap_cache_del_folio(ci, folio, swap, shadow);
swap_cluster_unlock_irq(ci);
@@ -765,7 +768,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
*/
if (reclaimed && folio_is_file_lru(folio) &&
!mapping_exiting(mapping) && !dax_mapping(mapping))
- shadow = workingset_eviction(folio, target_memcg);
+ shadow = workingset_eviction(folio, target_memcg, sc->lru_gen);
__filemap_remove_folio(folio, shadow);
xa_unlock_irq(&mapping->i_pages);
if (mapping_shrinkable(mapping))
@@ -802,7 +805,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
*/
long remove_mapping(struct address_space *mapping, struct folio *folio)
{
- if (__remove_mapping(mapping, folio, false, NULL)) {
+ if (__remove_mapping(mapping, folio, false, NULL, NULL)) {
/*
* Unfreezing the refcount with 1 effectively
* drops the pagecache ref for us without requiring another
@@ -1499,7 +1502,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
count_vm_events(PGLAZYFREED, nr_pages);
count_memcg_folio_events(folio, PGLAZYFREED, nr_pages);
} else if (!mapping || !__remove_mapping(mapping, folio, true,
- sc->target_mem_cgroup))
+ sc->target_mem_cgroup, sc))
goto keep_locked;
folio_unlock(folio);
@@ -1599,6 +1602,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_unmap = 1,
+ .lru_gen = lru_gen_enabled(),
};
struct reclaim_stat stat;
unsigned int nr_reclaimed;
@@ -1993,6 +1997,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
if (nr_taken == 0)
return 0;
+ sc->lru_gen = 0;
nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false,
lruvec_memcg(lruvec));
@@ -2167,6 +2172,7 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list,
.may_unmap = 1,
.may_swap = 1,
.no_demotion = 1,
+ .lru_gen = lru_gen_enabled(),
};
nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &stat, true, NULL);
@@ -4864,6 +4870,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
if (list_empty(&list))
return scanned;
retry:
+ sc->lru_gen = 1;
reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
sc->nr_reclaimed += reclaimed;
diff --git a/mm/workingset.c b/mm/workingset.c
index 07e6836d0502..3764a4a68c2c 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -181,8 +181,10 @@
* refault distance will immediately activate the refaulting page.
*/
+#define WORKINGSET_MGLRU_SHIFT 1
#define WORKINGSET_SHIFT 1
#define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \
+ WORKINGSET_MGLRU_SHIFT + \
WORKINGSET_SHIFT + NODES_SHIFT + \
MEM_CGROUP_ID_SHIFT)
#define EVICTION_SHIFT_ANON (EVICTION_SHIFT + SWAP_COUNT_SHIFT)
@@ -200,12 +202,13 @@
static unsigned int bucket_order[ANON_AND_FILE] __read_mostly;
static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
- bool workingset, bool file)
+ bool workingset, bool file, bool is_mglru)
{
eviction &= file ? EVICTION_MASK : EVICTION_MASK_ANON;
eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
eviction = (eviction << WORKINGSET_SHIFT) | workingset;
+ eviction = (eviction << WORKINGSET_MGLRU_SHIFT) | is_mglru;
return xa_mk_value(eviction);
}
@@ -217,6 +220,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
int memcgid, nid;
bool workingset;
+ entry >>= WORKINGSET_MGLRU_SHIFT;
workingset = entry & ((1UL << WORKINGSET_SHIFT) - 1);
entry >>= WORKINGSET_SHIFT;
nid = entry & ((1UL << NODES_SHIFT) - 1);
@@ -263,7 +267,7 @@ static void *lru_gen_eviction(struct folio *folio)
memcg_id = mem_cgroup_private_id(memcg);
rcu_read_unlock();
- return pack_shadow(memcg_id, pgdat, token, workingset, type);
+ return pack_shadow(memcg_id, pgdat, token, workingset, type, true);
}
/*
@@ -387,7 +391,8 @@ void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
* Return: a shadow entry to be stored in @folio->mapping->i_pages in place
* of the evicted @folio so that a later refault can be detected.
*/
-void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
+void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg,
+ bool lru_gen)
{
struct pglist_data *pgdat = folio_pgdat(folio);
int file = folio_is_file_lru(folio);
@@ -400,7 +405,7 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
- if (lru_gen_enabled())
+ if (lru_gen)
return lru_gen_eviction(folio);
lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
@@ -410,7 +415,7 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
eviction >>= bucket_order[file];
workingset_age_nonresident(lruvec, folio_nr_pages(folio));
return pack_shadow(memcgid, pgdat, eviction,
- folio_test_workingset(folio), file);
+ folio_test_workingset(folio), file, false);
}
/**
@@ -436,8 +441,10 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
int memcgid;
struct pglist_data *pgdat;
unsigned long eviction;
+ unsigned long entry = xa_to_value(shadow);
+ bool is_mglru = !!(entry & WORKINGSET_MGLRU_SHIFT);
- if (lru_gen_enabled()) {
+ if (is_mglru) {
bool recent;
rcu_read_lock();
@@ -550,10 +557,11 @@ void workingset_refault(struct folio *folio, void *shadow)
struct lruvec *lruvec;
bool workingset;
long nr;
+ unsigned long entry = xa_to_value(shadow);
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
- if (lru_gen_enabled()) {
+ if (entry & ((1UL << WORKINGSET_MGLRU_SHIFT) - 1)) {
lru_gen_refault(folio, shadow);
return;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v3 1/2] mm/mglru: fix cgroup OOM during MGLRU state switching
2026-03-15 18:18 ` [PATCH v3 1/2] " Leno Hou via B4 Relay
@ 2026-03-17 7:52 ` Barry Song
0 siblings, 0 replies; 7+ messages in thread
From: Barry Song @ 2026-03-17 7:52 UTC (permalink / raw)
To: lenohou
Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, linux-mm,
linux-kernel
On Mon, Mar 16, 2026 at 1:56 PM Leno Hou via B4 Relay
<devnull+lenohou.gmail.com@kernel.org> wrote:
>
> From: Leno Hou <lenohou@gmail.com>
>
> When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
> condition exists between the state switching and the memory reclaim path.
> This can lead to unexpected cgroup OOM kills, even when plenty of
> reclaimable memory is available.
>
> Problem Description
> ==================
>
> The issue arises from a "reclaim vacuum" during the transition.
>
> 1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
> false before the pages are drained from MGLRU lists back to traditional
> LRU lists.
> 2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
> and skip the MGLRU path.
> 3. However, these pages might not have reached the traditional LRU lists
> yet, or the changes are not yet visible to all CPUs due to a lack
> of synchronization.
> 4. get_scan_count() subsequently finds traditional LRU lists empty,
> concludes there is no reclaimable memory, and triggers an OOM kill.
>
> A similar race can occur during enablement, where the reclaimer sees the
> new state but the MGLRU lists haven't been populated via fill_evictable()
> yet.
>
> Solution
> ========
>
> Introduce a 'draining' state (`lru_drain_core`) to bridge the transition.
> When transitioning, the system enters this intermediate state where
> the reclaimer is forced to attempt both MGLRU and traditional reclaim
> paths sequentially. This ensures that folios remain visible to at least
> one reclaim mechanism until the transition is fully materialized across
> all CPUs.
>
> Changes
> =======
>
> v3:
> - Rebase onto mm-new branch for queue testing
> - Don't look around while draining
> - Fix Barry Song's comment
>
> v2:
> - Repalce with a static branch `lru_drain_core` to track the transition
> state.
> - Ensures all LRU helpers correctly identify page state by checking
> folio_lru_gen(folio) != -1 instead of relying solely on global flags.
> - Maintain workingset refault context across MGLRU state transitions
> - Fix build error when CONFIG_LRU_GEN is disabled.
>
> v1:
> - Use smp_store_release() and smp_load_acquire() to ensure the visibility
> of 'enabled' and 'draining' flags across CPUs.
> - Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
> is in the 'draining' state, the reclaimer will attempt to scan MGLRU
> lists first, and then fall through to traditional LRU lists instead
> of returning early. This ensures that folios are visible to at least
> one reclaim path at any given time.
>
> Race & Mitigation
> ================
>
> A race window exists between checking the 'draining' state and performing
> the actual list operations. For instance, a reclaimer might observe the
> draining state as false just before it changes, leading to a suboptimal
> reclaim path decision.
>
> However, this impact is effectively mitigated by the kernel's reclaim
> retry mechanism (e.g., in do_try_to_free_pages). If a reclaimer pass fails
> to find eligible folios due to a state transition race, subsequent retries
> in the loop will observe the updated state and correctly direct the scan
> to the appropriate LRU lists. This ensures the transient inconsistency
> does not escalate into a terminal OOM kill.
>
> This effectively reduce the race window that previously triggered OOMs
> under high memory pressure.
>
> To: Andrew Morton <akpm@linux-foundation.org>
> To: Axel Rasmussen <axelrasmussen@google.com>
> To: Yuanchu Xie <yuanchu@google.com>
> To: Wei Xu <weixugc@google.com>
> To: Barry Song <21cnbao@gmail.com>
> To: Jialing Wang <wjl.linux@gmail.com>
> To: Yafang Shao <laoar.shao@gmail.com>
> To: Yu Zhao <yuzhao@google.com>
> To: Kairui Song <ryncsn@gmail.com>
> To: Bingfang Guo <bfguo@icloud.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Leno Hou <lenohou@gmail.com>
> ---
> include/linux/mm_inline.h | 16 ++++++++++++++++
> mm/rmap.c | 2 +-
> mm/swap.c | 15 +++++++++------
> mm/vmscan.c | 38 +++++++++++++++++++++++++++++---------
> 4 files changed, 55 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index ad50688d89db..16ac700dac9c 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -102,6 +102,12 @@ static __always_inline enum lru_list folio_lru_list(const struct folio *folio)
>
> #ifdef CONFIG_LRU_GEN
>
> +static inline bool lru_gen_draining(void)
> +{
> + DECLARE_STATIC_KEY_FALSE(lru_drain_core);
> +
> + return static_branch_unlikely(&lru_drain_core);
> +}
> #ifdef CONFIG_LRU_GEN_ENABLED
> static inline bool lru_gen_enabled(void)
> {
> @@ -316,11 +322,21 @@ static inline bool lru_gen_enabled(void)
> return false;
> }
>
> +static inline bool lru_gen_draining(void)
> +{
> + return false;
> +}
> +
> static inline bool lru_gen_in_fault(void)
> {
> return false;
> }
>
> +static inline int folio_lru_gen(const struct folio *folio)
> +{
> + return -1;
> +}
> +
> static inline bool lru_gen_add_folio(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
> {
> return false;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 6398d7eef393..0b5f663f3062 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -966,7 +966,7 @@ static bool folio_referenced_one(struct folio *folio,
> nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
> }
>
> - if (lru_gen_enabled() && pvmw.pte) {
> + if (lru_gen_enabled() && !lru_gen_draining() && pvmw.pte) {
> if (lru_gen_look_around(&pvmw, nr))
> referenced++;
> } else if (pvmw.pte) {
> diff --git a/mm/swap.c b/mm/swap.c
> index 5cc44f0de987..ecb192c02d2e 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -462,7 +462,7 @@ void folio_mark_accessed(struct folio *folio)
> {
> if (folio_test_dropbehind(folio))
> return;
> - if (lru_gen_enabled()) {
> + if (folio_lru_gen(folio) != -1) {
I still feel this is quite dangerous. A folio could be on the
lru_cache rather than on MGLRU’s lists.
This still changes MGLRU’s behavior, much like your v2, which
effectively disabled look_around.
I mentioned this in v2: please avoid depending on
folio_lru_gen() == -1 unless it is absolutely necessary and you are
certain the folio is on an LRU list.
This is hard to verify case by case. From a design perspective,
relying on folio_lru_gen() == -1 is not appropriate.
Thanks
Barry
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 2/2] mm/mglru: maintain workingset refault context across state transitions
2026-03-15 18:18 ` [PATCH v3 2/2] mm/mglru: maintain workingset refault context across state transitions Leno Hou via B4 Relay
@ 2026-03-18 3:30 ` Kairui Song
2026-03-18 3:41 ` Leno Hou
2026-03-18 8:14 ` kernel test robot
1 sibling, 1 reply; 7+ messages in thread
From: Kairui Song @ 2026-03-18 3:30 UTC (permalink / raw)
To: lenohou
Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
Yafang Shao, Yu Zhao, Bingfang Guo, Barry Song, linux-mm,
linux-kernel
On Mon, Mar 16, 2026 at 1:56 PM Leno Hou via B4 Relay
<devnull+lenohou.gmail.com@kernel.org> wrote:
>
> From: Leno Hou <lenohou@gmail.com>
>
> When MGLRU state is toggled dynamically, existing shadow entries (eviction
> tokens) lose their context. Traditional LRU and MGLRU handle workingset
> refaults using different logic. Without context, shadow entries
> re-activated by the "wrong" reclaim logic trigger excessive page
> activations (pgactivate) and system thrashing, as the kernel cannot
> correctly distinguish if a refaulted page was originally managed by
> MGLRU or the traditional LRU.
>
> This patch introduces shadow entry context tracking:
>
> - Encode MGLRU origin: Introduce WORKINGSET_MGLRU_SHIFT into the shadow
> entry (eviction token) encoding. This adds an 'is_mglru' bit to shadow
> entries, allowing the kernel to correctly identify the originating
> reclaim logic for a page even after the global MGLRU state has been
> toggled.
Hi Leno,
I really don't think it's a good idea to waste one bit there just for
the transition state which is rarely used. And if you switched between
MGLRU / non-MGLRU then the refault distance check is already kind of
meaning less unless we unify their logic of reactivation.
BTW I tried that sometime ago: https://lwn.net/Articles/945266/
>
> - Refault logic dispatch: Use this 'is_mglru' bit in workingset_refault()
> and workingset_test_recent() to dispatch refault events to the correct
> handler (lru_gen_refault vs. traditional workingset refault).
Hmm, restoring the folio ref count in MGLRU is not the same thing as
reactivation or restoring the workingset flag in non-MGLRU case, and
not really comparable. Not sure this will be helpful.
Maybe for now we just igore this part, shadow is just a hint after
all, switch the LRU at runtime is already a huge performance impact
factor and not recommend, that the shadow part is trivial compared to
that.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 2/2] mm/mglru: maintain workingset refault context across state transitions
2026-03-18 3:30 ` Kairui Song
@ 2026-03-18 3:41 ` Leno Hou
0 siblings, 0 replies; 7+ messages in thread
From: Leno Hou @ 2026-03-18 3:41 UTC (permalink / raw)
To: Kairui Song
Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
Yafang Shao, Yu Zhao, Bingfang Guo, Barry Song, linux-mm,
linux-kernel
On 3/18/26 11:30 AM, Kairui Song wrote:
> On Mon, Mar 16, 2026 at 1:56 PM Leno Hou via B4 Relay
> <devnull+lenohou.gmail.com@kernel.org> wrote:
>>
>> From: Leno Hou <lenohou@gmail.com>
>>
>> When MGLRU state is toggled dynamically, existing shadow entries (eviction
>> tokens) lose their context. Traditional LRU and MGLRU handle workingset
>> refaults using different logic. Without context, shadow entries
>> re-activated by the "wrong" reclaim logic trigger excessive page
>> activations (pgactivate) and system thrashing, as the kernel cannot
>> correctly distinguish if a refaulted page was originally managed by
>> MGLRU or the traditional LRU.
>>
>> This patch introduces shadow entry context tracking:
>>
>> - Encode MGLRU origin: Introduce WORKINGSET_MGLRU_SHIFT into the shadow
>> entry (eviction token) encoding. This adds an 'is_mglru' bit to shadow
>> entries, allowing the kernel to correctly identify the originating
>> reclaim logic for a page even after the global MGLRU state has been
>> toggled.
>
> Hi Leno,
>
> I really don't think it's a good idea to waste one bit there just for
> the transition state which is rarely used. And if you switched between
> MGLRU / non-MGLRU then the refault distance check is already kind of
> meaning less unless we unify their logic of reactivation.
>
> BTW I tried that sometime ago: https://lwn.net/Articles/945266/
>
>>
>> - Refault logic dispatch: Use this 'is_mglru' bit in workingset_refault()
>> and workingset_test_recent() to dispatch refault events to the correct
>> handler (lru_gen_refault vs. traditional workingset refault).
>
> Hmm, restoring the folio ref count in MGLRU is not the same thing as
> reactivation or restoring the workingset flag in non-MGLRU case, and
> not really comparable. Not sure this will be helpful.
>
> Maybe for now we just igore this part, shadow is just a hint after
> all, switch the LRU at runtime is already a huge performance impact
> factor and not recommend, that the shadow part is trivial compared to
> that.
Hi Kairui,
Thank you for the insightful feedback. I completely agree with your
assessment: the workingset refault context is indeed just a hint, and
trying to align or convert these tokens between MGLRU and non-MGLRU
states is overly complex and likely unnecessary, especially given that
runtime switching is an extreme and infrequent operation.
I have decided to take your advice and completely remove the patches
related to workingset refault context tracking and folio_lru_gen state
checking.
My revised patch will focus solely on the lru_drain_core state machine,
which is the minimal and robust approach to address the primary issue:
preventing cgroup OOMs caused by the race condition during state
transitions. This should significantly reduce the complexity and risk of
the patch series.
I've sent a simplified v4 patch series that focuses strictly on the
lru_drain_core logic, removing all the disputed context-tracking code.
And this patch was tested on latest 7.0.0-rc1 with 1000 iterations
toggle on/off and no OOM.
Thank you for helping me sharpen the focus of this fix.
Best regards,
Leno Hou
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v3 2/2] mm/mglru: maintain workingset refault context across state transitions
2026-03-15 18:18 ` [PATCH v3 2/2] mm/mglru: maintain workingset refault context across state transitions Leno Hou via B4 Relay
2026-03-18 3:30 ` Kairui Song
@ 2026-03-18 8:14 ` kernel test robot
1 sibling, 0 replies; 7+ messages in thread
From: kernel test robot @ 2026-03-18 8:14 UTC (permalink / raw)
To: Leno Hou via B4 Relay, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Jialing Wang, Yafang Shao, Yu Zhao, Kairui Song,
Bingfang Guo, Barry Song
Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel,
Leno Hou
Hi Leno,
kernel test robot noticed the following build warnings:
[auto build test WARNING on c5a81ff6071bcf42531426e6336b5cc424df6e3d]
url: https://github.com/intel-lab-lkp/linux/commits/Leno-Hou-via-B4-Relay/mm-mglru-fix-cgroup-OOM-during-MGLRU-state-switching/20260316-140702
base: c5a81ff6071bcf42531426e6336b5cc424df6e3d
patch link: https://lore.kernel.org/r/20260316-b4-switch-mglru-v2-v3-2-c846ce9a2321%40gmail.com
patch subject: [PATCH v3 2/2] mm/mglru: maintain workingset refault context across state transitions
config: openrisc-defconfig (https://download.01.org/0day-ci/archive/20260318/202603181625.6juiPMws-lkp@intel.com/config)
compiler: or1k-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260318/202603181625.6juiPMws-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603181625.6juiPMws-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> Warning: mm/workingset.c:395 function parameter 'lru_gen' not described in 'workingset_eviction'
>> Warning: mm/workingset.c:395 function parameter 'lru_gen' not described in 'workingset_eviction'
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-03-18 8:14 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-15 18:18 [PATCH v3 0/2] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou via B4 Relay
2026-03-15 18:18 ` [PATCH v3 1/2] " Leno Hou via B4 Relay
2026-03-17 7:52 ` Barry Song
2026-03-15 18:18 ` [PATCH v3 2/2] mm/mglru: maintain workingset refault context across state transitions Leno Hou via B4 Relay
2026-03-18 3:30 ` Kairui Song
2026-03-18 3:41 ` Leno Hou
2026-03-18 8:14 ` kernel test robot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox