[PATCH v4] mm/mglru: fix cgroup OOM during MGLRU state switching

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v4] mm/mglru: fix cgroup OOM during MGLRU state switching
@ 2026-03-17 17:43 Leno Hou via B4 Relay
  2026-03-18  7:16 ` Barry Song
  0 siblings, 1 reply; 8+ messages in thread
From: Leno Hou via B4 Relay @ 2026-03-17 17:43 UTC (permalink / raw)
  To: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
	Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, Barry Song
  Cc: linux-mm, linux-kernel, Leno Hou

From: Leno Hou <lenohou@gmail.com>

When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
condition exists between the state switching and the memory reclaim path.
This can lead to unexpected cgroup OOM kills, even when plenty of
reclaimable memory is available.

Problem Description
==================
The issue arises from a "reclaim vacuum" during the transition.

1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
   false before the pages are drained from MGLRU lists back to traditional
   LRU lists.
2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
   and skip the MGLRU path.
3. However, these pages might not have reached the traditional LRU lists
   yet, or the changes are not yet visible to all CPUs due to a lack
   of synchronization.
4. get_scan_count() subsequently finds traditional LRU lists empty,
   concludes there is no reclaimable memory, and triggers an OOM kill.

A similar race can occur during enablement, where the reclaimer sees the
new state but the MGLRU lists haven't been populated via fill_evictable()
yet.

Solution
========
Introduce a 'draining' state (`lru_drain_core`) to bridge the transition.
When transitioning, the system enters this intermediate state where
the reclaimer is forced to attempt both MGLRU and traditional reclaim
paths sequentially. This ensures that folios remain visible to at least
one reclaim mechanism until the transition is fully materialized across
all CPUs.

Changes
=======
v4:
 - Fix Sashiko.dev's AI CodeReview comments
 - Remove the patch maintain workingset refault context across
 - Remove folio_lru_gen(folio) != -1 which involved in v2 patch

v3:
 - Rebase onto mm-new branch for queue testing
 - Don't look around while draining
 - Fix Barry Song's comment

v2:
- Replace with a static branch `lru_drain_core` to track the transition
  state.
- Ensures all LRU helpers correctly identify page state by checking
  folio_lru_gen(folio) != -1 instead of relying solely on global flags.
- Maintain workingset refault context across MGLRU state transitions
- Fix build error when CONFIG_LRU_GEN is disabled.

v1:
- Use smp_store_release() and smp_load_acquire() to ensure the visibility
  of 'enabled' and 'draining' flags across CPUs.
- Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec
  is in the 'draining' state, the reclaimer will attempt to scan MGLRU
  lists first, and then fall through to traditional LRU lists instead
  of returning early. This ensures that folios are visible to at least
  one reclaim path at any given time.

Race & Mitigation
================
A race window exists between checking the 'draining' state and performing
the actual list operations. For instance, a reclaimer might observe the
draining state as false just before it changes, leading to a suboptimal
reclaim path decision.

However, this impact is effectively mitigated by the kernel's reclaim
retry mechanism (e.g., in do_try_to_free_pages). If a reclaimer pass fails
to find eligible folios due to a state transition race, subsequent retries
in the loop will observe the updated state and correctly direct the scan
to the appropriate LRU lists. This ensures the transient inconsistency
does not escalate into a terminal OOM kill.

This effectively reduce the race window that previously triggered OOMs
under high memory pressure.

This fix has been verified on v7.0.0-rc1; dynamic toggling of MGLRU
functions correctly without triggering unexpected OOM kills.

To: Andrew Morton <akpm@linux-foundation.org>
To: Axel Rasmussen <axelrasmussen@google.com>
To: Yuanchu Xie <yuanchu@google.com>
To: Wei Xu <weixugc@google.com>
To: Barry Song <21cnbao@gmail.com>
To: Jialing Wang <wjl.linux@gmail.com>
To: Yafang Shao <laoar.shao@gmail.com>
To: Yu Zhao <yuzhao@google.com>
To: Kairui Song <ryncsn@gmail.com>
To: Bingfang Guo <bfguo@icloud.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Leno Hou <lenohou@gmail.com>
---
When the Multi-Gen LRU (MGLRU) state is toggled dynamically, a race
condition exists between the state switching and the memory reclaim path.
This can lead to unexpected cgroup OOM kills, even when plenty of
reclaimable memory is available.

Problem Description 
==================
The issue arises from a "reclaim vacuum" during the transition.

1. When disabling MGLRU, lru_gen_change_state() sets lrugen->enabled to
   false before the pages are drained from MGLRU lists back to traditional
   LRU lists.
2. Concurrent reclaimers in shrink_lruvec() see lrugen->enabled as false
   and skip the MGLRU path.
3. However, these pages might not have reached the traditional LRU lists
   yet, or the changes are not yet visible to all CPUs due to a lack
   of synchronization.
4. get_scan_count() subsequently finds traditional LRU lists empty,
   concludes there is no reclaimable memory, and triggers an OOM kill.

A similar race can occur during enablement, where the reclaimer sees the
new state but the MGLRU lists haven't been populated via fill_evictable()
yet.

Solution 
========
Introduce a 'draining' state (`lru_drain_core`) to bridge the transition.
When transitioning, the system enters this intermediate state where
the reclaimer is forced to attempt both MGLRU and traditional reclaim
paths sequentially. This ensures that folios remain visible to at least
one reclaim mechanism until the transition is fully materialized across
all CPUs.

Changes 
=======
v4:
 - Fix Sashiko.dev's AI CodeReview comments
 - Remove the patch maintain workingset refault context across
 - Remove folio_lru_gen(folio) != -1 which involved in v2 patch

v3:
 - Rebase onto mm-new branch for queue testing 
 - Don't look around while draining 
 - Fix Barry Song's comment

v2: 
- Replace with a static branch `lru_drain_core` to track the transition
  state.
- Ensures all LRU helpers correctly identify page state by checking
  folio_lru_gen(folio) != -1 instead of relying solely on global flags.
- Maintain workingset refault context across MGLRU state transitions 
- Fix build error when CONFIG_LRU_GEN is disabled.

v1: 
- Use smp_store_release() and smp_load_acquire() to ensure the visibility
  of 'enabled' and 'draining' flags across CPUs.
- Modify shrink_lruvec() to allow a "joint reclaim" period. If an lruvec 
  is in the 'draining' state, the reclaimer will attempt to scan MGLRU
  lists first, and then fall through to traditional LRU lists instead
  of returning early. This ensures that folios are visible to at least
  one reclaim path at any given time.

Race & Mitigation 
================
A race window exists between checking the 'draining' state and performing
the actual list operations. For instance, a reclaimer might observe the
draining state as false just before it changes, leading to a suboptimal
reclaim path decision.

However, this impact is effectively mitigated by the kernel's reclaim
retry mechanism (e.g., in do_try_to_free_pages). If a reclaimer pass fails
to find eligible folios due to a state transition race, subsequent retries
in the loop will observe the updated state and correctly direct the scan
to the appropriate LRU lists. This ensures the transient inconsistency
does not escalate into a terminal OOM kill.

This effectively reduce the race window that previously triggered OOMs
under high memory pressure.

This fix has been verified on v7.0.0-rc1; dynamic toggling of MGLRU
functions correctly without triggering unexpected OOM kills.

Reproduction 
===========

The issue was consistently reproduced on v6.1.157 and v6.18.3 using a
high-pressure memory cgroup (v1) environment.

Reproduction steps: 
1. Create a 16GB memcg and populate it with 10GB file cache (5GB active)
   and 8GB active anonymous memory.
2. Toggle MGLRU state while performing new memory allocations to force
   direct reclaim.

Reproduction script 
===================

```bash

MGLRU_FILE="/sys/kernel/mm/lru_gen/enabled"
CGROUP_PATH="/sys/fs/cgroup/memory/memcg_oom_test"

switch_mglru() {
    local orig_val=$(cat "$MGLRU_FILE")
    if [[ "$orig_val" != "0x0000" ]]; then
        echo n > "$MGLRU_FILE" &
    else
        echo y > "$MGLRU_FILE" &
    fi
}

mkdir -p "$CGROUP_PATH"
echo $((16 * 1024 * 1024 * 1024)) > "$CGROUP_PATH/memory.limit_in_bytes"
echo $$ > "$CGROUP_PATH/cgroup.procs"

dd if=/dev/urandom of=/tmp/test_file bs=1M count=10240
dd if=/tmp/test_file of=/dev/null bs=1M # Warm up cache

stress-ng --vm 1 --vm-bytes 8G --vm-keep -t 600 &
sleep 5

switch_mglru
stress-ng --vm 1 --vm-bytes 2G --vm-populate --timeout 5s || \
echo "OOM Triggered"

grep oom_kill "$CGROUP_PATH/memory.oom_control"
```
---
Changes in v4:
- Fix Sashiko.dev's AI CodeReview comments
  Link: https://sashiko.dev/#/patchset/20260316-b4-switch-mglru-v2-v3-0-c846ce9a2321%40gmail.com 
- Remove the patch maintain workingset refault context across
- Remove folio_lru_gen(folio) != -1 which involved in v2 patch
- Link to v3: https://lore.kernel.org/r/20260316-b4-switch-mglru-v2-v3-0-c846ce9a2321@gmail.com
---
 include/linux/mm_inline.h | 11 +++++++++++
 mm/rmap.c                 |  2 +-
 mm/vmscan.c               | 33 ++++++++++++++++++++++++---------
 3 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index ad50688d89db..1f6b19bf365b 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -102,6 +102,12 @@ static __always_inline enum lru_list folio_lru_list(const struct folio *folio)

 #ifdef CONFIG_LRU_GEN

+static inline bool lru_gen_draining(void)
+{
+	DECLARE_STATIC_KEY_FALSE(lru_drain_core);
+
+	return static_branch_unlikely(&lru_drain_core);
+}
 #ifdef CONFIG_LRU_GEN_ENABLED
 static inline bool lru_gen_enabled(void)
 {
@@ -316,6 +322,11 @@ static inline bool lru_gen_enabled(void)
 	return false;
 }

+static inline bool lru_gen_draining(void)
+{
+	return false;
+}
+
 static inline bool lru_gen_in_fault(void)
 {
 	return false;
diff --git a/mm/rmap.c b/mm/rmap.c
index 6398d7eef393..0b5f663f3062 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -966,7 +966,7 @@ static bool folio_referenced_one(struct folio *folio,
 			nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
 		}

-		if (lru_gen_enabled() && pvmw.pte) {
+		if (lru_gen_enabled() && !lru_gen_draining() && pvmw.pte) {
 			if (lru_gen_look_around(&pvmw, nr))
 				referenced++;
 		} else if (pvmw.pte) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33287ba4a500..88b9db06e331 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -886,7 +886,7 @@ static enum folio_references folio_check_references(struct folio *folio,
 	if (referenced_ptes == -1)
 		return FOLIOREF_KEEP;

-	if (lru_gen_enabled()) {
+	if (lru_gen_enabled() && !lru_gen_draining()) {
 		if (!referenced_ptes)
 			return FOLIOREF_RECLAIM;

@@ -2286,7 +2286,7 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
 	unsigned long file;
 	struct lruvec *target_lruvec;

-	if (lru_gen_enabled())
+	if (lru_gen_enabled() && !lru_gen_draining())
 		return;

 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
@@ -2625,6 +2625,7 @@ static bool can_age_anon_pages(struct lruvec *lruvec,

 #ifdef CONFIG_LRU_GEN

+DEFINE_STATIC_KEY_FALSE(lru_drain_core);
 #ifdef CONFIG_LRU_GEN_ENABLED
 DEFINE_STATIC_KEY_ARRAY_TRUE(lru_gen_caps, NR_LRU_GEN_CAPS);
 #define get_cap(cap)	static_branch_likely(&lru_gen_caps[cap])
@@ -5318,6 +5319,8 @@ static void lru_gen_change_state(bool enabled)
 	if (enabled == lru_gen_enabled())
 		goto unlock;

+	static_branch_enable_cpuslocked(&lru_drain_core);
+
 	if (enabled)
 		static_branch_enable_cpuslocked(&lru_gen_caps[LRU_GEN_CORE]);
 	else
@@ -5348,6 +5351,9 @@ static void lru_gen_change_state(bool enabled)

 		cond_resched();
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
+
+	static_branch_disable_cpuslocked(&lru_drain_core);
+
 unlock:
 	mutex_unlock(&state_mutex);
 	put_online_mems();
@@ -5920,9 +5926,12 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	bool proportional_reclaim;
 	struct blk_plug plug;

-	if (lru_gen_enabled() && !root_reclaim(sc)) {
+	if ((lru_gen_enabled() || lru_gen_draining()) && !root_reclaim(sc)) {
 		lru_gen_shrink_lruvec(lruvec, sc);
-		return;
+
+		if (!lru_gen_draining())
+			return;
+
 	}

 	get_scan_count(lruvec, sc, nr);
@@ -6182,10 +6191,13 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	struct lruvec *target_lruvec;
 	bool reclaimable = false;

-	if (lru_gen_enabled() && root_reclaim(sc)) {
+	if ((lru_gen_enabled() || lru_gen_draining()) && root_reclaim(sc)) {
 		memset(&sc->nr, 0, sizeof(sc->nr));
 		lru_gen_shrink_node(pgdat, sc);
-		return;
+
+		if (!lru_gen_draining())
+			return;
+
 	}

 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
@@ -6455,7 +6467,7 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
 	struct lruvec *target_lruvec;
 	unsigned long refaults;

-	if (lru_gen_enabled())
+	if (lru_gen_enabled() && !lru_gen_draining())
 		return;

 	target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
@@ -6845,9 +6857,12 @@ static void kswapd_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;

-	if (lru_gen_enabled()) {
+	if (lru_gen_enabled() || lru_gen_draining()) {
 		lru_gen_age_node(pgdat, sc);
-		return;
+
+		if (!lru_gen_draining())
+			return;
+
 	}

 	lruvec = mem_cgroup_lruvec(NULL, pgdat);

---
base-commit: 39849a55738542a4cdef8394095ccfa98530e250
change-id: 20260311-b4-switch-mglru-v2-8b926a03843f

Best regards,
-- 
Leno Hou <lenohou@gmail.com>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v4] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-17 17:43 [PATCH v4] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou via B4 Relay
@ 2026-03-18  7:16 ` Barry Song
  2026-03-18  8:16   ` Leno Hou
  0 siblings, 1 reply; 8+ messages in thread
From: Barry Song @ 2026-03-18  7:16 UTC (permalink / raw)
  To: lenohou
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
	Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, linux-mm,
	linux-kernel

On Wed, Mar 18, 2026 at 11:29 AM Leno Hou via B4 Relay
<devnull+lenohou.gmail.com@kernel.org> wrote:
>
> From: Leno Hou <lenohou@gmail.com>

[...]

>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index ad50688d89db..1f6b19bf365b 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -102,6 +102,12 @@ static __always_inline enum lru_list folio_lru_list(const struct folio *folio)
>
>  #ifdef CONFIG_LRU_GEN
>
> +static inline bool lru_gen_draining(void)
> +{
> +       DECLARE_STATIC_KEY_FALSE(lru_drain_core);
> +
> +       return static_branch_unlikely(&lru_drain_core);
> +}

Can we name it lru_gen_switch() or lru_switch?
Since “drain” implies disabling MGLRU, the operation
could just as well be enabling it. Also, can we drop
the _core suffix?


>  #ifdef CONFIG_LRU_GEN_ENABLED
>  static inline bool lru_gen_enabled(void)
>  {
> @@ -316,6 +322,11 @@ static inline bool lru_gen_enabled(void)
>         return false;
>  }
>
> +static inline bool lru_gen_draining(void)

lru_gen_switching()?

> +{
> +       return false;
> +}
> +
>  static inline bool lru_gen_in_fault(void)
>  {
>         return false;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 6398d7eef393..0b5f663f3062 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -966,7 +966,7 @@ static bool folio_referenced_one(struct folio *folio,
>                         nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
>                 }
>
> -               if (lru_gen_enabled() && pvmw.pte) {
> +               if (lru_gen_enabled() && !lru_gen_draining() && pvmw.pte) {

Ack. When LRU is switching, we don’t know where the
surrounding folios are—they could be on active/inactive
lists or on MGLRU. So the simplest approach is to
disable this look-around optimization.
But please add a comment here explaining it.


>                         if (lru_gen_look_around(&pvmw, nr))
>                                 referenced++;
>                 } else if (pvmw.pte) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 33287ba4a500..88b9db06e331 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -886,7 +886,7 @@ static enum folio_references folio_check_references(struct folio *folio,
>         if (referenced_ptes == -1)
>                 return FOLIOREF_KEEP;
>
> -       if (lru_gen_enabled()) {
> +       if (lru_gen_enabled() && !lru_gen_draining()) {

I’m curious what prompted you to do this.

This feels a bit odd. I assume this effectively makes
folios on MGLRU, as well as those on active/inactive
lists, always follow the active/inactive logic.

It might be fine, but it needs thorough documentation here.

another approach would be:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33287ba4a500..91b60664b652 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -122,6 +122,9 @@ struct scan_control {
        /* Proactive reclaim invoked by userspace */
        unsigned int proactive:1;

+       /* Are we reclaiming from MGLRU */
+       unsigned int lru_gen:1;
+
        /*
         * Cgroup memory below memory.low is protected as long as we
         * don't threaten to OOM. If any cgroup is reclaimed at
@@ -886,7 +889,7 @@ static enum folio_references
folio_check_references(struct folio *folio,
        if (referenced_ptes == -1)
                return FOLIOREF_KEEP;

-       if (lru_gen_enabled()) {
+       if (sc->lru_gen) {
                if (!referenced_ptes)
                        return FOLIOREF_RECLAIM;

This makes the logic perfectly correct (you know exactly
where your folios come from), but I’m not sure it’s worth it.

Anyway, I’d like to understand why you always need to
use the active/inactive logic even for folios from MGLRU.
To me, it seems to work only by coincidence, which isn’t good.

Thanks
Barry


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v4] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-18  7:16 ` Barry Song
@ 2026-03-18  8:16   ` Leno Hou
  2026-03-18  8:30     ` Barry Song
  0 siblings, 1 reply; 8+ messages in thread
From: Leno Hou @ 2026-03-18  8:16 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
	Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, linux-mm,
	linux-kernel

On 3/18/26 3:16 PM, Barry Song wrote:
> On Wed, Mar 18, 2026 at 11:29 AM Leno Hou via B4 Relay
> <devnull+lenohou.gmail.com@kernel.org> wrote:
>>
>> From: Leno Hou <lenohou@gmail.com>
> 
> [...]
> 
>>
>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> index ad50688d89db..1f6b19bf365b 100644
>> --- a/include/linux/mm_inline.h
>> +++ b/include/linux/mm_inline.h
>> @@ -102,6 +102,12 @@ static __always_inline enum lru_list folio_lru_list(const struct folio *folio)
>>
>>   #ifdef CONFIG_LRU_GEN
>>
>> +static inline bool lru_gen_draining(void)
>> +{
>> +       DECLARE_STATIC_KEY_FALSE(lru_drain_core);
>> +
>> +       return static_branch_unlikely(&lru_drain_core);
>> +}
> 
> Can we name it lru_gen_switch() or lru_switch?
> Since “drain” implies disabling MGLRU, the operation
> could just as well be enabling it. Also, can we drop
> the _core suffix?

OK. Next V5 patch will be:

+static inline bool lru_gen_switching(void)
+{
+       DECLARE_STATIC_KEY_FALSE(lru_switch);
+
+       return static_branch_unlikely(&lru_switch);
+}

> 
> 
>>   #ifdef CONFIG_LRU_GEN_ENABLED
>>   static inline bool lru_gen_enabled(void)
>>   {
>> @@ -316,6 +322,11 @@ static inline bool lru_gen_enabled(void)
>>          return false;
>>   }
>>
>> +static inline bool lru_gen_draining(void)
> 
> lru_gen_switching()? >
>> +{
>> +       return false;
>> +}
>> +
>>   static inline bool lru_gen_in_fault(void)
>>   {
>>          return false;
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 6398d7eef393..0b5f663f3062 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -966,7 +966,7 @@ static bool folio_referenced_one(struct folio *folio,
>>                          nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
>>                  }

OK. I'll be add following ducumentation that just you said.
/* When LRU is switching, we don’t know where the surrounding folios
are. —they could be on active/inactive lists or on MGLRU. So the 
simplest approach is to disable this look-around optimization.
*/
>> -               if (lru_gen_enabled() && pvmw.pte) {
>> +               if (lru_gen_enabled() && !lru_gen_draining() && pvmw.pte) {
> 
> Ack. When LRU is switching, we don’t know where the
> surrounding folios are—they could be on active/inactive
> lists or on MGLRU. So the simplest approach is to
> disable this look-around optimization.
> But please add a comment here explaining it.
> 
> 
>>                          if (lru_gen_look_around(&pvmw, nr))
>>                                  referenced++;
>>                  } else if (pvmw.pte) {
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 33287ba4a500..88b9db06e331 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -886,7 +886,7 @@ static enum folio_references folio_check_references(struct folio *folio,
>>          if (referenced_ptes == -1)
>>                  return FOLIOREF_KEEP;
>>
>> -       if (lru_gen_enabled()) {

documentation as following:

/*
  * During the MGLRU state transition (lru_gen_switching), we force
  * folios to follow the traditional active/inactive reference checking.
  *
  * While MGLRU is switching,the generational state of folios is in flux.
  * Falling back to the traditional logic (which relies on PG_referenced/
  * PG_active flags that are consistent across both mechanisms) provides
  * a stable, safe behavior for the folio until it is fully migrated back
  * to the traditional LRU lists. This avoids relying on potentially
  * inconsistent MGLRU generational metadata during the transition.
  */

>> +       if (lru_gen_enabled() && !lru_gen_draining()) {
> 
> I’m curious what prompted you to do this.
> 
> This feels a bit odd. I assume this effectively makes
> folios on MGLRU, as well as those on active/inactive
> lists, always follow the active/inactive logic.
> 
> It might be fine, but it needs thorough documentation here.
> 
> another approach would be:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 33287ba4a500..91b60664b652 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -122,6 +122,9 @@ struct scan_control {
>          /* Proactive reclaim invoked by userspace */
>          unsigned int proactive:1;
> 
> +       /* Are we reclaiming from MGLRU */
> +       unsigned int lru_gen:1;
> +
>          /*
>           * Cgroup memory below memory.low is protected as long as we
>           * don't threaten to OOM. If any cgroup is reclaimed at
> @@ -886,7 +889,7 @@ static enum folio_references
> folio_check_references(struct folio *folio,
>          if (referenced_ptes == -1)
>                  return FOLIOREF_KEEP;
> 
> -       if (lru_gen_enabled()) {
> +       if (sc->lru_gen) {
>                  if (!referenced_ptes)
>                          return FOLIOREF_RECLAIM;
> 
> This makes the logic perfectly correct (you know exactly
> where your folios come from), but I’m not sure it’s worth it.
> 
> Anyway, I’d like to understand why you always need to
> use the active/inactive logic even for folios from MGLRU.
> To me, it seems to work only by coincidence, which isn’t good.
> 
> Thanks
> Barry

Hi Barry,

I agree that using !lru_gen_draining() feels a bit like a fallback path. 
However, after considering your suggestion for sc->lru_gen, I’m 
concerned about the broad impact of modifying struct scan_control.Since 
lru_drain_core is a very transient state, I prefer a localized fix that 
doesn't propagate architectural changes throughout the entire reclaim stack.

You mentioned that using the active/inactive logic feels like it works 
by 'coincidence'. To clarify, this is an intentional fallback: because 
the generational metadata in MGLRU becomes unreliable during draining, 
we intentionally downgrade these folios to the traditional logic. Since 
the PG_referenced and PG_active bits are maintained by the core VM and 
are consistent regardless of whether MGLRU is active, this fallback is 
technically sound and robust.

I have added detailed documentation to the code to explain this design 
choice, clarifying that it's a deliberate transition strategy rather 
than a coincidence."



Best Regards,
Leno Hou



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-18  8:16   ` Leno Hou
@ 2026-03-18  8:30     ` Barry Song
  2026-03-18 12:56       ` Leno Hou
  2026-03-18 12:59       ` Leno Hou
  0 siblings, 2 replies; 8+ messages in thread
From: Barry Song @ 2026-03-18  8:30 UTC (permalink / raw)
  To: Leno Hou
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
	Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, linux-mm,
	linux-kernel

On Wed, Mar 18, 2026 at 4:17 PM Leno Hou <lenohou@gmail.com> wrote:
>
> On 3/18/26 3:16 PM, Barry Song wrote:
> > On Wed, Mar 18, 2026 at 11:29 AM Leno Hou via B4 Relay
> > <devnull+lenohou.gmail.com@kernel.org> wrote:
> >>
> >> From: Leno Hou <lenohou@gmail.com>
> >
> > [...]
> >
> >>
> >> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> >> index ad50688d89db..1f6b19bf365b 100644
> >> --- a/include/linux/mm_inline.h
> >> +++ b/include/linux/mm_inline.h
> >> @@ -102,6 +102,12 @@ static __always_inline enum lru_list folio_lru_list(const struct folio *folio)
> >>
> >>   #ifdef CONFIG_LRU_GEN
> >>
> >> +static inline bool lru_gen_draining(void)
> >> +{
> >> +       DECLARE_STATIC_KEY_FALSE(lru_drain_core);
> >> +
> >> +       return static_branch_unlikely(&lru_drain_core);
> >> +}
> >
> > Can we name it lru_gen_switch() or lru_switch?
> > Since “drain” implies disabling MGLRU, the operation
> > could just as well be enabling it. Also, can we drop
> > the _core suffix?
>
> OK. Next V5 patch will be:
>
> +static inline bool lru_gen_switching(void)
> +{
> +       DECLARE_STATIC_KEY_FALSE(lru_switch);
> +
> +       return static_branch_unlikely(&lru_switch);
> +}
>
> >
> >
> >>   #ifdef CONFIG_LRU_GEN_ENABLED
> >>   static inline bool lru_gen_enabled(void)
> >>   {
> >> @@ -316,6 +322,11 @@ static inline bool lru_gen_enabled(void)
> >>          return false;
> >>   }
> >>
> >> +static inline bool lru_gen_draining(void)
> >
> > lru_gen_switching()? >
> >> +{
> >> +       return false;
> >> +}
> >> +
> >>   static inline bool lru_gen_in_fault(void)
> >>   {
> >>          return false;
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 6398d7eef393..0b5f663f3062 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -966,7 +966,7 @@ static bool folio_referenced_one(struct folio *folio,
> >>                          nr = folio_pte_batch(folio, pvmw.pte, pteval, max_nr);
> >>                  }
>
> OK. I'll be add following ducumentation that just you said.
> /* When LRU is switching, we don’t know where the surrounding folios
> are. —they could be on active/inactive lists or on MGLRU. So the
> simplest approach is to disable this look-around optimization.
> */
> >> -               if (lru_gen_enabled() && pvmw.pte) {
> >> +               if (lru_gen_enabled() && !lru_gen_draining() && pvmw.pte) {
> >
> > Ack. When LRU is switching, we don’t know where the
> > surrounding folios are—they could be on active/inactive
> > lists or on MGLRU. So the simplest approach is to
> > disable this look-around optimization.
> > But please add a comment here explaining it.
> >
> >
> >>                          if (lru_gen_look_around(&pvmw, nr))
> >>                                  referenced++;
> >>                  } else if (pvmw.pte) {
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 33287ba4a500..88b9db06e331 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -886,7 +886,7 @@ static enum folio_references folio_check_references(struct folio *folio,
> >>          if (referenced_ptes == -1)
> >>                  return FOLIOREF_KEEP;
> >>
> >> -       if (lru_gen_enabled()) {
>
> documentation as following:
>
> /*
>   * During the MGLRU state transition (lru_gen_switching), we force
>   * folios to follow the traditional active/inactive reference checking.
>   *
>   * While MGLRU is switching,the generational state of folios is in flux.
>   * Falling back to the traditional logic (which relies on PG_referenced/
>   * PG_active flags that are consistent across both mechanisms) provides
>   * a stable, safe behavior for the folio until it is fully migrated back
>   * to the traditional LRU lists. This avoids relying on potentially
>   * inconsistent MGLRU generational metadata during the transition.
>   */
>
> >> +       if (lru_gen_enabled() && !lru_gen_draining()) {
> >
> > I’m curious what prompted you to do this.
> >
> > This feels a bit odd. I assume this effectively makes
> > folios on MGLRU, as well as those on active/inactive
> > lists, always follow the active/inactive logic.
> >
> > It might be fine, but it needs thorough documentation here.
> >
> > another approach would be:
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 33287ba4a500..91b60664b652 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -122,6 +122,9 @@ struct scan_control {
> >          /* Proactive reclaim invoked by userspace */
> >          unsigned int proactive:1;
> >
> > +       /* Are we reclaiming from MGLRU */
> > +       unsigned int lru_gen:1;
> > +
> >          /*
> >           * Cgroup memory below memory.low is protected as long as we
> >           * don't threaten to OOM. If any cgroup is reclaimed at
> > @@ -886,7 +889,7 @@ static enum folio_references
> > folio_check_references(struct folio *folio,
> >          if (referenced_ptes == -1)
> >                  return FOLIOREF_KEEP;
> >
> > -       if (lru_gen_enabled()) {
> > +       if (sc->lru_gen) {
> >                  if (!referenced_ptes)
> >                          return FOLIOREF_RECLAIM;
> >
> > This makes the logic perfectly correct (you know exactly
> > where your folios come from), but I’m not sure it’s worth it.
> >
> > Anyway, I’d like to understand why you always need to
> > use the active/inactive logic even for folios from MGLRU.
> > To me, it seems to work only by coincidence, which isn’t good.
> >
> > Thanks
> > Barry
>
> Hi Barry,
>
> I agree that using !lru_gen_draining() feels a bit like a fallback path.
> However, after considering your suggestion for sc->lru_gen, I’m
> concerned about the broad impact of modifying struct scan_control.Since
> lru_drain_core is a very transient state, I prefer a localized fix that
> doesn't propagate architectural changes throughout the entire reclaim stack.
>
> You mentioned that using the active/inactive logic feels like it works
> by 'coincidence'. To clarify, this is an intentional fallback: because
> the generational metadata in MGLRU becomes unreliable during draining,
> we intentionally downgrade these folios to the traditional logic. Since
> the PG_referenced and PG_active bits are maintained by the core VM and
> are consistent regardless of whether MGLRU is active, this fallback is
> technically sound and robust.
>
> I have added detailed documentation to the code to explain this design
> choice, clarifying that it's a deliberate transition strategy rather
> than a coincidence."

Nope. You still haven’t explained why the active/inactive LRU
logic makes it work. MGLRU and active/inactive use different
methods to determine whether a folio is hot or cold. You’re
forcing active/inactive logic to decide hot/cold for an MGLRU
folio. It’s not that simple—PG_referenced isn’t maintained
by the core; it’s specific to active/inactive. See folio_mark_accessed().

Best Regards
Barry


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-18  8:30     ` Barry Song
@ 2026-03-18 12:56       ` Leno Hou
  2026-03-18 21:29         ` Barry Song
  2026-03-18 12:59       ` Leno Hou
  1 sibling, 1 reply; 8+ messages in thread
From: Leno Hou @ 2026-03-18 12:56 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
	Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, linux-mm,
	linux-kernel

On 3/18/26 4:30 PM, Barry Song wrote:
> On Wed, Mar 18, 2026 at 4:17 PM Leno Hou <lenohou@gmail.com> wrote:
[...]
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 33287ba4a500..88b9db06e331 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -886,7 +886,7 @@ static enum folio_references folio_check_references(struct folio *folio,
>>>>           if (referenced_ptes == -1)
>>>>                   return FOLIOREF_KEEP;
>>>>
>>>> -       if (lru_gen_enabled()) {
>>
>> documentation as following:
>>
>> /*
>>    * During the MGLRU state transition (lru_gen_switching), we force
>>    * folios to follow the traditional active/inactive reference checking.
>>    *
>>    * While MGLRU is switching,the generational state of folios is in flux.
>>    * Falling back to the traditional logic (which relies on PG_referenced/
>>    * PG_active flags that are consistent across both mechanisms) provides
>>    * a stable, safe behavior for the folio until it is fully migrated back
>>    * to the traditional LRU lists. This avoids relying on potentially
>>    * inconsistent MGLRU generational metadata during the transition.
>>    */
>>
>>>> +       if (lru_gen_enabled() && !lru_gen_draining()) {
>>>
>>> I’m curious what prompted you to do this.
>>>
>>> This feels a bit odd. I assume this effectively makes
>>> folios on MGLRU, as well as those on active/inactive
>>> lists, always follow the active/inactive logic.
>>>
>>> It might be fine, but it needs thorough documentation here.
>>>
>>> another approach would be:
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 33287ba4a500..91b60664b652 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -122,6 +122,9 @@ struct scan_control {
>>>           /* Proactive reclaim invoked by userspace */
>>>           unsigned int proactive:1;
>>>
>>> +       /* Are we reclaiming from MGLRU */
>>> +       unsigned int lru_gen:1;
>>> +
>>>           /*
>>>            * Cgroup memory below memory.low is protected as long as we
>>>            * don't threaten to OOM. If any cgroup is reclaimed at
>>> @@ -886,7 +889,7 @@ static enum folio_references
>>> folio_check_references(struct folio *folio,
>>>           if (referenced_ptes == -1)
>>>                   return FOLIOREF_KEEP;
>>>
>>> -       if (lru_gen_enabled()) {
>>> +       if (sc->lru_gen) {
>>>                   if (!referenced_ptes)
>>>                           return FOLIOREF_RECLAIM;
>>>
>>> This makes the logic perfectly correct (you know exactly
>>> where your folios come from), but I’m not sure it’s worth it.
>>>
>>> Anyway, I’d like to understand why you always need to
>>> use the active/inactive logic even for folios from MGLRU.
>>> To me, it seems to work only by coincidence, which isn’t good.
>>>
>>> Thanks
>>> Barry
>>
>> Hi Barry,
>>
>> I agree that using !lru_gen_draining() feels a bit like a fallback path.
>> However, after considering your suggestion for sc->lru_gen, I’m
>> concerned about the broad impact of modifying struct scan_control.Since
>> lru_drain_core is a very transient state, I prefer a localized fix that
>> doesn't propagate architectural changes throughout the entire reclaim stack.
>>
>> You mentioned that using the active/inactive logic feels like it works
>> by 'coincidence'. To clarify, this is an intentional fallback: because
>> the generational metadata in MGLRU becomes unreliable during draining,
>> we intentionally downgrade these folios to the traditional logic. Since
>> the PG_referenced and PG_active bits are maintained by the core VM and
>> are consistent regardless of whether MGLRU is active, this fallback is
>> technically sound and robust.
>>
>> I have added detailed documentation to the code to explain this design
>> choice, clarifying that it's a deliberate transition strategy rather
>> than a coincidence."
> 
> Nope. You still haven’t explained why the active/inactive LRU
> logic makes it work. MGLRU and active/inactive use different
> methods to determine whether a folio is hot or cold. You’re
> forcing active/inactive logic to decide hot/cold for an MGLRU
> folio. It’s not that simple—PG_referenced isn’t maintained
> by the core; it’s specific to active/inactive. See folio_mark_accessed().
> 
> Best Regards
> Barry

Hi Barry,

Thank you for your patience and for pointing out the version-specific 
nuances. You are absolutely correct—my previous assumption that the 
traditional reference-checking logic would serve as a robust fallback 
was fundamentally flawed.

After re-examining the code in v7.0 and comparing it with older versions 
(e.g., v6.1), I see the core issue you highlighted:

1. Evolution of PG_referenced: In older kernels, lru_gen_inc_refs() 
often interacted with the PG_referenced bit, which inadvertently 
provided a 'coincidental' hint for the legacy reclaim path. However, in 
v7.0+, lru_gen_inc_refs() has evolved to use set_mask_bits() on the 
LRU_REFS_MASK bitfield, and it no longer relies on or updates the legacy 
PG_referenced bit for MGLRU folios.

2. The Logic Flaw: When switching from MGLRU to the traditional LRU, 
these folios arrive at the legacy reclaim path with PG_referenced unset 
or stale. If I force them through the legacy folio_check_references() 
path, folio_test_clear_referenced(folio) predictably returns 0. The 
legacy path interprets this as a 'cold' folio, leading to premature 
reclamation. You are correct that forcing this active/inactive logic 
onto MGLRU folios is logically inconsistent.


3. My Revised Approach: Instead of attempting to patch 
folio_check_references() with a fallback logic, I have decided to keep 
the folio_check_references() logic unchanged.

The system handles this transition safely through the kernel's existing 
reclaim loop and retry mechanisms:

    a) While MGLRU is draining, folios are moved back to the traditional 
    LRU lists. Once migrated, these folios will naturally begin 
participating in the legacy reclaim path.

    b) Although some folios might be initially underestimated as 'cold' 
in the very first reclaim pass immediately after the switch, the 
kernel's reclaim loop will naturally re-evaluate them. As they are 
accessed, the standard legacy mechanism will correctly maintain the 
PG_referenced bit, and the system will converge to the correct state 
without needing an explicit fallback path or state-checking in 
folio_check_references().


This approach avoids the logical corruption caused by forcing 
incompatible evaluation methods and relies on the natural convergence of 
the existing reclaim loop.


Does this alignment with the existing reclaim mechanism address your 
concerns about logical consistency?


Best regards,
Leno Hou














^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-18  8:30     ` Barry Song
  2026-03-18 12:56       ` Leno Hou
@ 2026-03-18 12:59       ` Leno Hou
  1 sibling, 0 replies; 8+ messages in thread
From: Leno Hou @ 2026-03-18 12:59 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
	Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, linux-mm,
	linux-kernel

On 3/18/26 4:30 PM, Barry Song wrote:
> On Wed, Mar 18, 2026 at 4:17 PM Leno Hou <lenohou@gmail.com> wrote:
[...]
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 33287ba4a500..88b9db06e331 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -886,7 +886,7 @@ static enum folio_references folio_check_references(struct folio *folio,
>>>>           if (referenced_ptes == -1)
>>>>                   return FOLIOREF_KEEP;
>>>>
>>>> -       if (lru_gen_enabled()) {
>>
>> documentation as following:
>>
>> /*
>>    * During the MGLRU state transition (lru_gen_switching), we force
>>    * folios to follow the traditional active/inactive reference checking.
>>    *
>>    * While MGLRU is switching,the generational state of folios is in flux.
>>    * Falling back to the traditional logic (which relies on PG_referenced/
>>    * PG_active flags that are consistent across both mechanisms) provides
>>    * a stable, safe behavior for the folio until it is fully migrated back
>>    * to the traditional LRU lists. This avoids relying on potentially
>>    * inconsistent MGLRU generational metadata during the transition.
>>    */
>>
>>>> +       if (lru_gen_enabled() && !lru_gen_draining()) {
>>>
>>> I’m curious what prompted you to do this.
>>>
>>> This feels a bit odd. I assume this effectively makes
>>> folios on MGLRU, as well as those on active/inactive
>>> lists, always follow the active/inactive logic.
>>>
>>> It might be fine, but it needs thorough documentation here.
>>>
>>> another approach would be:
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 33287ba4a500..91b60664b652 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -122,6 +122,9 @@ struct scan_control {
>>>           /* Proactive reclaim invoked by userspace */
>>>           unsigned int proactive:1;
>>>
>>> +       /* Are we reclaiming from MGLRU */
>>> +       unsigned int lru_gen:1;
>>> +
>>>           /*
>>>            * Cgroup memory below memory.low is protected as long as we
>>>            * don't threaten to OOM. If any cgroup is reclaimed at
>>> @@ -886,7 +889,7 @@ static enum folio_references
>>> folio_check_references(struct folio *folio,
>>>           if (referenced_ptes == -1)
>>>                   return FOLIOREF_KEEP;
>>>
>>> -       if (lru_gen_enabled()) {
>>> +       if (sc->lru_gen) {
>>>                   if (!referenced_ptes)
>>>                           return FOLIOREF_RECLAIM;
>>>
>>> This makes the logic perfectly correct (you know exactly
>>> where your folios come from), but I’m not sure it’s worth it.
>>>
>>> Anyway, I’d like to understand why you always need to
>>> use the active/inactive logic even for folios from MGLRU.
>>> To me, it seems to work only by coincidence, which isn’t good.
>>>
>>> Thanks
>>> Barry
>>
>> Hi Barry,
>>
>> I agree that using !lru_gen_draining() feels a bit like a fallback path.
>> However, after considering your suggestion for sc->lru_gen, I’m
>> concerned about the broad impact of modifying struct scan_control.Since
>> lru_drain_core is a very transient state, I prefer a localized fix that
>> doesn't propagate architectural changes throughout the entire reclaim stack.
>>
>> You mentioned that using the active/inactive logic feels like it works
>> by 'coincidence'. To clarify, this is an intentional fallback: because
>> the generational metadata in MGLRU becomes unreliable during draining,
>> we intentionally downgrade these folios to the traditional logic. Since
>> the PG_referenced and PG_active bits are maintained by the core VM and
>> are consistent regardless of whether MGLRU is active, this fallback is
>> technically sound and robust.
>>
>> I have added detailed documentation to the code to explain this design
>> choice, clarifying that it's a deliberate transition strategy rather
>> than a coincidence."
> 
> Nope. You still haven’t explained why the active/inactive LRU
> logic makes it work. MGLRU and active/inactive use different
> methods to determine whether a folio is hot or cold. You’re
> forcing active/inactive logic to decide hot/cold for an MGLRU
> folio. It’s not that simple—PG_referenced isn’t maintained
> by the core; it’s specific to active/inactive. See folio_mark_accessed().
> 
> Best Regards
> Barry

Hi Barry,
Thank you for your patience and for pointing out the version-specific 
nuances. You are absolutely correct—my previous assumption that the 
traditional reference-checking logic would serve as a robust fallback 
was fundamentally flawed.

After re-examining the code in v7.0 and comparing it with older versions 
(e.g., v6.1), I see the core issue you highlighted:

1. Evolution of PG_referenced: In older kernels, lru_gen_inc_refs() 
often interacted with the PG_referenced bit, which inadvertently 
provided a 'coincidental' hint for the legacy reclaim path. However, in 
v7.0+, lru_gen_inc_refs() has evolved to use set_mask_bits() on the 
LRU_REFS_MASK bitfield, and it no longer relies on or updates the legacy 
PG_referenced bit for MGLRU folios.

2. The Logic Flaw: When switching from MGLRU to the traditional LRU, 
these folios arrive at the legacy reclaim path with PG_referenced unset 
or stale. If I force them through the legacy folio_check_references() 
path, folio_test_clear_referenced(folio) predictably returns 0. The 
legacy path interprets this as a 'cold' folio, leading to premature 
reclamation. You are correct that forcing this active/inactive logic 
onto MGLRU folios is logically inconsistent.


3. My Revised Approach: Instead of attempting to patch 
folio_check_references() with a fallback logic, I have decided to keep 
the folio_check_references() logic unchanged.

The system handles this transition safely through the kernel's existing 
reclaim loop and retry mechanisms:

    a) While MGLRU is draining, folios are moved back to the traditional 
    LRU lists. Once migrated, these folios will naturally begin 
participating in the legacy reclaim path.

    b) Although some folios might be initially underestimated as 'cold' 
in the very first reclaim pass immediately after the switch, the 
kernel's reclaim loop will naturally re-evaluate them. As they are 
accessed, the standard legacy mechanism will correctly maintain the 
PG_referenced bit, and the system will converge to the correct state 
without needing an explicit fallback path or state-checking in 
folio_check_references().


This approach avoids the logical corruption caused by forcing 
incompatible evaluation methods and relies on the natural convergence of 
the existing reclaim loop.


Does this alignment with the existing reclaim mechanism address your 
concerns about logical consistency?

Best regards,
Leno Hou














^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-18 12:56       ` Leno Hou
@ 2026-03-18 21:29         ` Barry Song
  2026-03-19  3:14           ` Leno Hou
  0 siblings, 1 reply; 8+ messages in thread
From: Barry Song @ 2026-03-18 21:29 UTC (permalink / raw)
  To: Leno Hou
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
	Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, linux-mm,
	linux-kernel

On Wed, Mar 18, 2026 at 8:56 PM Leno Hou <lenohou@gmail.com> wrote:
>
> On 3/18/26 4:30 PM, Barry Song wrote:
> > On Wed, Mar 18, 2026 at 4:17 PM Leno Hou <lenohou@gmail.com> wrote:
> [...]
> >>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>> index 33287ba4a500..88b9db06e331 100644
> >>>> --- a/mm/vmscan.c
> >>>> +++ b/mm/vmscan.c
> >>>> @@ -886,7 +886,7 @@ static enum folio_references folio_check_references(struct folio *folio,
> >>>>           if (referenced_ptes == -1)
> >>>>                   return FOLIOREF_KEEP;
> >>>>
> >>>> -       if (lru_gen_enabled()) {
> >>
> >> documentation as following:
> >>
> >> /*
> >>    * During the MGLRU state transition (lru_gen_switching), we force
> >>    * folios to follow the traditional active/inactive reference checking.
> >>    *
> >>    * While MGLRU is switching,the generational state of folios is in flux.
> >>    * Falling back to the traditional logic (which relies on PG_referenced/
> >>    * PG_active flags that are consistent across both mechanisms) provides
> >>    * a stable, safe behavior for the folio until it is fully migrated back
> >>    * to the traditional LRU lists. This avoids relying on potentially
> >>    * inconsistent MGLRU generational metadata during the transition.
> >>    */
> >>
> >>>> +       if (lru_gen_enabled() && !lru_gen_draining()) {
> >>>
> >>> I’m curious what prompted you to do this.
> >>>
> >>> This feels a bit odd. I assume this effectively makes
> >>> folios on MGLRU, as well as those on active/inactive
> >>> lists, always follow the active/inactive logic.
> >>>
> >>> It might be fine, but it needs thorough documentation here.
> >>>
> >>> another approach would be:
> >>> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>> index 33287ba4a500..91b60664b652 100644
> >>> --- a/mm/vmscan.c
> >>> +++ b/mm/vmscan.c
> >>> @@ -122,6 +122,9 @@ struct scan_control {
> >>>           /* Proactive reclaim invoked by userspace */
> >>>           unsigned int proactive:1;
> >>>
> >>> +       /* Are we reclaiming from MGLRU */
> >>> +       unsigned int lru_gen:1;
> >>> +
> >>>           /*
> >>>            * Cgroup memory below memory.low is protected as long as we
> >>>            * don't threaten to OOM. If any cgroup is reclaimed at
> >>> @@ -886,7 +889,7 @@ static enum folio_references
> >>> folio_check_references(struct folio *folio,
> >>>           if (referenced_ptes == -1)
> >>>                   return FOLIOREF_KEEP;
> >>>
> >>> -       if (lru_gen_enabled()) {
> >>> +       if (sc->lru_gen) {
> >>>                   if (!referenced_ptes)
> >>>                           return FOLIOREF_RECLAIM;
> >>>
> >>> This makes the logic perfectly correct (you know exactly
> >>> where your folios come from), but I’m not sure it’s worth it.
> >>>
> >>> Anyway, I’d like to understand why you always need to
> >>> use the active/inactive logic even for folios from MGLRU.
> >>> To me, it seems to work only by coincidence, which isn’t good.
> >>>
> >>> Thanks
> >>> Barry
> >>
> >> Hi Barry,
> >>
> >> I agree that using !lru_gen_draining() feels a bit like a fallback path.
> >> However, after considering your suggestion for sc->lru_gen, I’m
> >> concerned about the broad impact of modifying struct scan_control.Since
> >> lru_drain_core is a very transient state, I prefer a localized fix that
> >> doesn't propagate architectural changes throughout the entire reclaim stack.
> >>
> >> You mentioned that using the active/inactive logic feels like it works
> >> by 'coincidence'. To clarify, this is an intentional fallback: because
> >> the generational metadata in MGLRU becomes unreliable during draining,
> >> we intentionally downgrade these folios to the traditional logic. Since
> >> the PG_referenced and PG_active bits are maintained by the core VM and
> >> are consistent regardless of whether MGLRU is active, this fallback is
> >> technically sound and robust.
> >>
> >> I have added detailed documentation to the code to explain this design
> >> choice, clarifying that it's a deliberate transition strategy rather
> >> than a coincidence."
> >
> > Nope. You still haven’t explained why the active/inactive LRU
> > logic makes it work. MGLRU and active/inactive use different
> > methods to determine whether a folio is hot or cold. You’re
> > forcing active/inactive logic to decide hot/cold for an MGLRU
> > folio. It’s not that simple—PG_referenced isn’t maintained
> > by the core; it’s specific to active/inactive. See folio_mark_accessed().
> >
> > Best Regards
> > Barry
>
> Hi Barry,
>
> Thank you for your patience and for pointing out the version-specific
> nuances. You are absolutely correct—my previous assumption that the
> traditional reference-checking logic would serve as a robust fallback
> was fundamentally flawed.
>
> After re-examining the code in v7.0 and comparing it with older versions
> (e.g., v6.1), I see the core issue you highlighted:
>
> 1. Evolution of PG_referenced: In older kernels, lru_gen_inc_refs()
> often interacted with the PG_referenced bit, which inadvertently
> provided a 'coincidental' hint for the legacy reclaim path. However, in
> v7.0+, lru_gen_inc_refs() has evolved to use set_mask_bits() on the
> LRU_REFS_MASK bitfield, and it no longer relies on or updates the legacy
> PG_referenced bit for MGLRU folios.
>
> 2. The Logic Flaw: When switching from MGLRU to the traditional LRU,
> these folios arrive at the legacy reclaim path with PG_referenced unset
> or stale. If I force them through the legacy folio_check_references()
> path, folio_test_clear_referenced(folio) predictably returns 0. The
> legacy path interprets this as a 'cold' folio, leading to premature
> reclamation. You are correct that forcing this active/inactive logic
> onto MGLRU folios is logically inconsistent.
>
>
> 3. My Revised Approach: Instead of attempting to patch
> folio_check_references() with a fallback logic, I have decided to keep
> the folio_check_references() logic unchanged.
>
> The system handles this transition safely through the kernel's existing
> reclaim loop and retry mechanisms:
>
>     a) While MGLRU is draining, folios are moved back to the traditional
>     LRU lists. Once migrated, these folios will naturally begin
> participating in the legacy reclaim path.
>
>     b) Although some folios might be initially underestimated as 'cold'
> in the very first reclaim pass immediately after the switch, the
> kernel's reclaim loop will naturally re-evaluate them. As they are
> accessed, the standard legacy mechanism will correctly maintain the
> PG_referenced bit, and the system will converge to the correct state
> without needing an explicit fallback path or state-checking in
> folio_check_references().
>
>
> This approach avoids the logical corruption caused by forcing
> incompatible evaluation methods and relies on the natural convergence of
> the existing reclaim loop.
>
>
> Does this alignment with the existing reclaim mechanism address your
> concerns about logical consistency?

My gut feeling is that we probably don’t need to worry
too much about the accuracy of hot/cold evaluation during
switching, since the system is already in a volatile state
at that point. So as long as we avoid introducing unusual
logic—such as forcing active/inactive decisions onto MGLRU
folios—I’m fine with it.

Ideally, we would add an sc->lru_gen boolean so we know
exactly where the folios come from, rather than relying on
folio_lru_gen(folio) != -1, which can be misleading.
However, if this doesn’t bring much improvement, it may
not be worth increasing the complexity.

Thanks
Barry


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4] mm/mglru: fix cgroup OOM during MGLRU state switching
  2026-03-18 21:29         ` Barry Song
@ 2026-03-19  3:14           ` Leno Hou
  0 siblings, 0 replies; 8+ messages in thread
From: Leno Hou @ 2026-03-19  3:14 UTC (permalink / raw)
  To: Barry Song
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Jialing Wang,
	Yafang Shao, Yu Zhao, Kairui Song, Bingfang Guo, linux-mm,
	linux-kernel

On 3/19/26 5:29 AM, Barry Song wrote:
> On Wed, Mar 18, 2026 at 8:56 PM Leno Hou <lenohou@gmail.com> wrote:
>>
>> On 3/18/26 4:30 PM, Barry Song wrote:
>>> On Wed, Mar 18, 2026 at 4:17 PM Leno Hou <lenohou@gmail.com> wrote:
>> [...]
>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>> index 33287ba4a500..88b9db06e331 100644
>>>>>> --- a/mm/vmscan.c
>>>>>> +++ b/mm/vmscan.c
>>>>>> @@ -886,7 +886,7 @@ static enum folio_references folio_check_references(struct folio *folio,
>>>>>>            if (referenced_ptes == -1)
>>>>>>                    return FOLIOREF_KEEP;
>>>>>>
>>>>>> -       if (lru_gen_enabled()) {
>>>>
>>>> documentation as following:
>>>>
>>>> /*
>>>>     * During the MGLRU state transition (lru_gen_switching), we force
>>>>     * folios to follow the traditional active/inactive reference checking.
>>>>     *
>>>>     * While MGLRU is switching,the generational state of folios is in flux.
>>>>     * Falling back to the traditional logic (which relies on PG_referenced/
>>>>     * PG_active flags that are consistent across both mechanisms) provides
>>>>     * a stable, safe behavior for the folio until it is fully migrated back
>>>>     * to the traditional LRU lists. This avoids relying on potentially
>>>>     * inconsistent MGLRU generational metadata during the transition.
>>>>     */
>>>>
>>>>>> +       if (lru_gen_enabled() && !lru_gen_draining()) {
>>>>>
>>>>> I’m curious what prompted you to do this.
>>>>>
>>>>> This feels a bit odd. I assume this effectively makes
>>>>> folios on MGLRU, as well as those on active/inactive
>>>>> lists, always follow the active/inactive logic.
>>>>>
>>>>> It might be fine, but it needs thorough documentation here.
>>>>>
>>>>> another approach would be:
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index 33287ba4a500..91b60664b652 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -122,6 +122,9 @@ struct scan_control {
>>>>>            /* Proactive reclaim invoked by userspace */
>>>>>            unsigned int proactive:1;
>>>>>
>>>>> +       /* Are we reclaiming from MGLRU */
>>>>> +       unsigned int lru_gen:1;
>>>>> +
>>>>>            /*
>>>>>             * Cgroup memory below memory.low is protected as long as we
>>>>>             * don't threaten to OOM. If any cgroup is reclaimed at
>>>>> @@ -886,7 +889,7 @@ static enum folio_references
>>>>> folio_check_references(struct folio *folio,
>>>>>            if (referenced_ptes == -1)
>>>>>                    return FOLIOREF_KEEP;
>>>>>
>>>>> -       if (lru_gen_enabled()) {
>>>>> +       if (sc->lru_gen) {
>>>>>                    if (!referenced_ptes)
>>>>>                            return FOLIOREF_RECLAIM;
>>>>>
>>>>> This makes the logic perfectly correct (you know exactly
>>>>> where your folios come from), but I’m not sure it’s worth it.
>>>>>
>>>>> Anyway, I’d like to understand why you always need to
>>>>> use the active/inactive logic even for folios from MGLRU.
>>>>> To me, it seems to work only by coincidence, which isn’t good.
>>>>>
>>>>> Thanks
>>>>> Barry
>>>>
>>>> Hi Barry,
>>>>
>>>> I agree that using !lru_gen_draining() feels a bit like a fallback path.
>>>> However, after considering your suggestion for sc->lru_gen, I’m
>>>> concerned about the broad impact of modifying struct scan_control.Since
>>>> lru_drain_core is a very transient state, I prefer a localized fix that
>>>> doesn't propagate architectural changes throughout the entire reclaim stack.
>>>>
>>>> You mentioned that using the active/inactive logic feels like it works
>>>> by 'coincidence'. To clarify, this is an intentional fallback: because
>>>> the generational metadata in MGLRU becomes unreliable during draining,
>>>> we intentionally downgrade these folios to the traditional logic. Since
>>>> the PG_referenced and PG_active bits are maintained by the core VM and
>>>> are consistent regardless of whether MGLRU is active, this fallback is
>>>> technically sound and robust.
>>>>
>>>> I have added detailed documentation to the code to explain this design
>>>> choice, clarifying that it's a deliberate transition strategy rather
>>>> than a coincidence."
>>>
>>> Nope. You still haven’t explained why the active/inactive LRU
>>> logic makes it work. MGLRU and active/inactive use different
>>> methods to determine whether a folio is hot or cold. You’re
>>> forcing active/inactive logic to decide hot/cold for an MGLRU
>>> folio. It’s not that simple—PG_referenced isn’t maintained
>>> by the core; it’s specific to active/inactive. See folio_mark_accessed().
>>>
>>> Best Regards
>>> Barry
>>
>> Hi Barry,
>>
>> Thank you for your patience and for pointing out the version-specific
>> nuances. You are absolutely correct—my previous assumption that the
>> traditional reference-checking logic would serve as a robust fallback
>> was fundamentally flawed.
>>
>> After re-examining the code in v7.0 and comparing it with older versions
>> (e.g., v6.1), I see the core issue you highlighted:
>>
>> 1. Evolution of PG_referenced: In older kernels, lru_gen_inc_refs()
>> often interacted with the PG_referenced bit, which inadvertently
>> provided a 'coincidental' hint for the legacy reclaim path. However, in
>> v7.0+, lru_gen_inc_refs() has evolved to use set_mask_bits() on the
>> LRU_REFS_MASK bitfield, and it no longer relies on or updates the legacy
>> PG_referenced bit for MGLRU folios.
>>
>> 2. The Logic Flaw: When switching from MGLRU to the traditional LRU,
>> these folios arrive at the legacy reclaim path with PG_referenced unset
>> or stale. If I force them through the legacy folio_check_references()
>> path, folio_test_clear_referenced(folio) predictably returns 0. The
>> legacy path interprets this as a 'cold' folio, leading to premature
>> reclamation. You are correct that forcing this active/inactive logic
>> onto MGLRU folios is logically inconsistent.
>>
>>
>> 3. My Revised Approach: Instead of attempting to patch
>> folio_check_references() with a fallback logic, I have decided to keep
>> the folio_check_references() logic unchanged.
>>
>> The system handles this transition safely through the kernel's existing
>> reclaim loop and retry mechanisms:
>>
>>      a) While MGLRU is draining, folios are moved back to the traditional
>>      LRU lists. Once migrated, these folios will naturally begin
>> participating in the legacy reclaim path.
>>
>>      b) Although some folios might be initially underestimated as 'cold'
>> in the very first reclaim pass immediately after the switch, the
>> kernel's reclaim loop will naturally re-evaluate them. As they are
>> accessed, the standard legacy mechanism will correctly maintain the
>> PG_referenced bit, and the system will converge to the correct state
>> without needing an explicit fallback path or state-checking in
>> folio_check_references().
>>
>>
>> This approach avoids the logical corruption caused by forcing
>> incompatible evaluation methods and relies on the natural convergence of
>> the existing reclaim loop.
>>
>>
>> Does this alignment with the existing reclaim mechanism address your
>> concerns about logical consistency?
> 
> My gut feeling is that we probably don’t need to worry
> too much about the accuracy of hot/cold evaluation during
> switching, since the system is already in a volatile state
> at that point. So as long as we avoid introducing unusual
> logic—such as forcing active/inactive decisions onto MGLRU
> folios—I’m fine with it.
> 
> Ideally, we would add an sc->lru_gen boolean so we know
> exactly where the folios come from, rather than relying on
> folio_lru_gen(folio) != -1, which can be misleading.
> However, if this doesn’t bring much improvement, it may
> not be worth increasing the complexity.
> 

Hi Barry,


Thank you for the guidance~


To address your concerns regarding readability and maintainability:

1. Naming: I'll rename the transition state to lru_gen_switching 
(instead of draining) to better reflect its purpose.

2. Documentation: I'll add the documentation to explain why we disable
look-around optimization when LRU is switching.

This approach keeps the patch minimal and fix for the OOM issue during 
switching without introducing complexity.

See next v5 patch later.

---
Best regards,
Leno Hou




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-03-19  3:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-17 17:43 [PATCH v4] mm/mglru: fix cgroup OOM during MGLRU state switching Leno Hou via B4 Relay
2026-03-18  7:16 ` Barry Song
2026-03-18  8:16   ` Leno Hou
2026-03-18  8:30     ` Barry Song
2026-03-18 12:56       ` Leno Hou
2026-03-18 21:29         ` Barry Song
2026-03-19  3:14           ` Leno Hou
2026-03-18 12:59       ` Leno Hou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox