[PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
@ 2026-04-18 12:02 Barry Song (Xiaomi)
  2026-04-24 11:53 ` Andrew Morton
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-18 12:02 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, Barry Song (Xiaomi), Lance Yang, Xueyuan Chen,
	Kairui Song, Qi Zheng, Shakeel Butt, wangzicheng,
	Suren Baghdasaryan, Lei Liu, Matthew Wilcox, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Will Deacon

MGLRU gives high priority to folios mapped in page tables.
As a result, folio_set_active() is invoked for all folios
read during page faults. In practice, however, readahead
can bring in many folios that are never accessed via page
tables.

A previous attempt by Lei Liu proposed introducing a separate
LRU for readahead[1] to make readahead pages easier to reclaim,
but that approach is likely over-engineered.

Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
protection"), folios with PG_active were always placed in
the youngest generation, leading to over-protection and
increased refaults. After that commit, PG_active folios
are placed in the second youngest generation, which is
still too optimistic given the presence of readahead. In
contrast, the classic active/inactive scheme is more
conservative.

This patch switches to folio_mark_accessed(). If
folio_check_references() later detects referenced PTEs,
the folio will be promoted based on the reference flag
set by folio_mark_accessed().

The following uses a simple model to demonstrate why the current
code is not ideal. It runs fio-3.42 in a memcg, reading a file in a
strided pattern—4KB every 64KB—to simulate prefaulted pages that may
not be accessed.

 #!/bin/bash

 CG_NAME="mglru_verify_test"
 CG_PATH="/sys/fs/cgroup/$CG_NAME"
 MEM_LIMIT="400M"
 HOT_SIZE="600M"

 # 1. Environment Setup
 sudo rmdir "$CG_PATH" 2>/dev/null
 sudo mkdir -p "$CG_PATH"
 sudo chown -R $USER:$USER "$CG_PATH"
 echo "$MEM_LIMIT" > "$CG_PATH/memory.max"

 # 2. Prepare Data Files
 dd if=/dev/urandom of=hot_data.bin bs=1M count=600 conv=notrunc 2>/dev/null
 sync
 echo 3 > /proc/sys/vm/drop_caches

 # 3. Start Workload (Working Set)
 (
     echo $BASHPID > "$CG_PATH/cgroup.procs"
     exec ./fio-3.42 --name=hot_ws --rw=read --bs=4K --size=$HOT_SIZE --runtime=600 \
          --zonemode=strided --zonesize=4K --zonerange=64K \
 	 --time_based --direct=0 --filename=hot_data.bin --ioengine=mmap \
          --fadvise_hint=0 --group_reporting --numjobs=1 > fio.stats
 ) &
 WORKLOAD_PID=$!

 # 4. Waiting for hot data to warm up
 sleep 30
 BASE_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')

 # 5. Running workload for 60second
 sleep 60

 # 6. Report refault and IO bandwidth
 FINAL_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
 FINAL_D_FILE=$((FINAL_FILE - BASE_FILE))
 echo "File Refault Delta is $FINAL_D_FILE"

 kill $WORKLOAD_PID 2>/dev/null
 sleep 2
 grep -E "READ|WRITE" fio.stats \
 | awk '{for(i=1;i<=NF;i++){if($i ~ /^bw=/) bw=$i; if($i ~ /^io=/) io=$i} print $1, bw, io}'
 rm -f hot_data.bin fio.stats

Without the patch, we observed 12883855 file refaults and a very low
bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy
hot positions, continuously pushing out the real working set and
causing incorrect reclaim. With the patch, we observed 0 refaults
and bandwidth increased to 5078 MiB/s.

Note that this patch does not benefit any platform other than arm64,
since commit 315d09bf30c2 ("Revert "mm: make faultaround produce old
ptes"") reverted the change that made prefault PTEs “old”, after it
was identified as the cause of a ~6% regression in UnixBench on x86.
This was due to reports that x86 uses an internal microfault mechanism
for HW AF. The hardware access flag mechanism is relatively expensive
and can lead to a ~6% UnixBench regression when prefaulted PTEs are
not marked young directly in the page fault path, especially when
UnixBench runs without any memory pressure[2].

Thanks to Will for raising this for arm64—“Create ‘old’ PTEs for
faultaround mappings on arm64 with hardware access flag” [3].
This is also thanks to arm64 microarchitectures, which incur zero cost
for HW AF handling.

It may be time for x86 and other architectures to revisit
whether HW AF is truly costly on their platforms, given that
the original x86 regression was reported 10 years ago.

For those who want to try the model on x86, you will need the
following in arch/x86/include/asm/pgtable.h.

 #define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
 static inline bool arch_wants_old_prefaulted_pte(void)
 {
 	return true;
 }

Lance and Xueyuan made a huge contribution to this patch
through testing. They truly worked over weekends and after
work hours. If this patch deserves any credit, it belongs to
them.

[1] https://lore.kernel.org/linux-mm/20250916072226.220426-1-liulei.rjpt@vivo.com/
[2] https://lore.kernel.org/lkml/20160606022724.GA26227@yexl-desktop/
[3] https://lore.kernel.org/lkml/20210120173612.20913-1-will@kernel.org/
Tested-by: Lance Yang <lance.yang@linux.dev>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Qi Zheng <qi.zheng@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: wangzicheng <wangzicheng@honor.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lei Liu <liulei.rjpt@vivo.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 -rfc was:
 [PATCH RFC] mm/mglru: lazily activate folios while folios are really mapped
 https://lore.kernel.org/linux-mm/20260225212642.15219-1-21cnbao@gmail.com/

 mm/swap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index 5cc44f0de987..e3cf703ccb89 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -512,7 +512,7 @@ void folio_add_lru(struct folio *folio)
 	/* see the comment in lru_gen_folio_seq() */
 	if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
 	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
-		folio_set_active(folio);
+		folio_mark_accessed(folio);

 	folio_batch_add_and_move(folio, lru_add);
 }
-- 
2.39.3 (Apple Git-146)

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-18 12:02 [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF Barry Song (Xiaomi)
@ 2026-04-24 11:53 ` Andrew Morton
  2026-04-28  5:40   ` Barry Song
  2026-04-24 14:10 ` Andrew Morton
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2026-04-24 11:53 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: linux-mm, linux-kernel, Lance Yang, Xueyuan Chen, Kairui Song,
	Qi Zheng, Shakeel Butt, wangzicheng, Suren Baghdasaryan, Lei Liu,
	Matthew Wilcox, Axel Rasmussen, Yuanchu Xie, Wei Xu, Will Deacon

On Sat, 18 Apr 2026 20:02:33 +0800 "Barry Song (Xiaomi)" <baohua@kernel.org> wrote:

> MGLRU gives high priority to folios mapped in page tables.
> As a result, folio_set_active() is invoked for all folios
> read during page faults. In practice, however, readahead
> can bring in many folios that are never accessed via page
> tables.
> 
> A previous attempt by Lei Liu proposed introducing a separate
> LRU for readahead[1] to make readahead pages easier to reclaim,
> but that approach is likely over-engineered.
> 
> Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
> protection"), folios with PG_active were always placed in
> the youngest generation, leading to over-protection and
> increased refaults. After that commit, PG_active folios
> are placed in the second youngest generation, which is
> still too optimistic given the presence of readahead. In
> contrast, the classic active/inactive scheme is more
> conservative.
> 
> This patch switches to folio_mark_accessed(). If
> folio_check_references() later detects referenced PTEs,
> the folio will be promoted based on the reference flag
> set by folio_mark_accessed().
> 
> The following uses a simple model to demonstrate why the current
> code is not ideal. It runs fio-3.42 in a memcg, reading a file in a
> strided pattern—4KB every 64KB—to simulate prefaulted pages that may
> not be accessed.

Are you able to suggest any workloads which might regress?  And test
for those?

> Without the patch, we observed 12883855 file refaults and a very low
> bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy
> hot positions, continuously pushing out the real working set and
> causing incorrect reclaim. With the patch, we observed 0 refaults
> and bandwidth increased to 5078 MiB/s.

Wow.  And that isn't a crazy workload.

> For those who want to try the model on x86, you will need the
> following in arch/x86/include/asm/pgtable.h.
> 
>  #define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
>  static inline bool arch_wants_old_prefaulted_pte(void)
>  {
>  	return true;
>  }

Can you propose a patch?  We can at least toss it in there for testing
while we think about it.

> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -512,7 +512,7 @@ void folio_add_lru(struct folio *folio)
>  	/* see the comment in lru_gen_folio_seq() */
>  	if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
>  	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> -		folio_set_active(folio);
> +		folio_mark_accessed(folio);
>  
>  	folio_batch_add_and_move(folio, lru_add);
>  }

lol, I was expecting something larger ;)



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-18 12:02 [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF Barry Song (Xiaomi)
  2026-04-24 11:53 ` Andrew Morton
@ 2026-04-24 14:10 ` Andrew Morton
  2026-04-24 15:19 ` Pedro Falcato
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2026-04-24 14:10 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: linux-mm, linux-kernel, Lance Yang, Xueyuan Chen, Kairui Song,
	Qi Zheng, Shakeel Butt, wangzicheng, Suren Baghdasaryan, Lei Liu,
	Matthew Wilcox, Axel Rasmussen, Yuanchu Xie, Wei Xu, Will Deacon

On Sat, 18 Apr 2026 20:02:33 +0800 "Barry Song (Xiaomi)" <baohua@kernel.org> wrote:

> MGLRU gives high priority to folios mapped in page tables.
> As a result, folio_set_active() is invoked for all folios
> read during page faults. In practice, however, readahead
> can bring in many folios that are never accessed via page
> tables.
> 
> A previous attempt by Lei Liu proposed introducing a separate
> LRU for readahead[1] to make readahead pages easier to reclaim,
> but that approach is likely over-engineered.
> 
> Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
> protection"), folios with PG_active were always placed in
> the youngest generation, leading to over-protection and
> increased refaults. After that commit, PG_active folios
> are placed in the second youngest generation, which is
> still too optimistic given the presence of readahead. In
> contrast, the classic active/inactive scheme is more
> conservative.
> 
> This patch switches to folio_mark_accessed(). If
> folio_check_references() later detects referenced PTEs,
> the folio will be promoted based on the reference flag
> set by folio_mark_accessed().

Sashiko: https://sashiko.dev/#/patchset/20260418120233.7162-1-baohua@kernel.org


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-18 12:02 [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF Barry Song (Xiaomi)
  2026-04-24 11:53 ` Andrew Morton
  2026-04-24 14:10 ` Andrew Morton
@ 2026-04-24 15:19 ` Pedro Falcato
  2026-04-26  4:35   ` Barry Song
  2026-04-24 17:03 ` Shakeel Butt
  2026-04-28 18:54 ` Kairui Song
  4 siblings, 1 reply; 16+ messages in thread
From: Pedro Falcato @ 2026-04-24 15:19 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, linux-mm, linux-kernel, Lance Yang, Xueyuan Chen,
	Kairui Song, Qi Zheng, Shakeel Butt, wangzicheng,
	Suren Baghdasaryan, Lei Liu, Matthew Wilcox, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Will Deacon

On Sat, Apr 18, 2026 at 08:02:33PM +0800, Barry Song (Xiaomi) wrote:
> MGLRU gives high priority to folios mapped in page tables.
> As a result, folio_set_active() is invoked for all folios
> read during page faults. In practice, however, readahead
> can bring in many folios that are never accessed via page
> tables.
> 
> A previous attempt by Lei Liu proposed introducing a separate
> LRU for readahead[1] to make readahead pages easier to reclaim,
> but that approach is likely over-engineered.

Why does this even need to be kept? I'm not sure it makes sense
to even mark readahead folios as referenced.

I'd suggest folios should only be marked referenced (or even active, whatever)
when they're mapped. Anything else is a bit random and is hoping you are
eventually going to map them in the future (which is not true for, for example,
anything in an ELF file that may be readahead but not mapped, like debug info,
symbol tables, section headers, relocation tables, etc etc)


-- 
Pedro


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-18 12:02 [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF Barry Song (Xiaomi)
                   ` (2 preceding siblings ...)
  2026-04-24 15:19 ` Pedro Falcato
@ 2026-04-24 17:03 ` Shakeel Butt
  2026-04-26 21:56   ` Barry Song
  2026-04-28 18:54 ` Kairui Song
  4 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2026-04-24 17:03 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, linux-mm, linux-kernel, Lance Yang, Xueyuan Chen,
	Kairui Song, Qi Zheng, wangzicheng, Suren Baghdasaryan, Lei Liu,
	Matthew Wilcox, Axel Rasmussen, Yuanchu Xie, Wei Xu, Will Deacon

On Sat, Apr 18, 2026 at 08:02:33PM +0800, Barry Song (Xiaomi) wrote:
> MGLRU gives high priority to folios mapped in page tables.
> As a result, folio_set_active() is invoked for all folios
> read during page faults. In practice, however, readahead
> can bring in many folios that are never accessed via page
> tables.
> 
> A previous attempt by Lei Liu proposed introducing a separate
> LRU for readahead[1] to make readahead pages easier to reclaim,
> but that approach is likely over-engineered.
> 
> Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
> protection"), folios with PG_active were always placed in
> the youngest generation, leading to over-protection and
> increased refaults. After that commit, PG_active folios
> are placed in the second youngest generation, which is
> still too optimistic given the presence of readahead. In
> contrast, the classic active/inactive scheme is more
> conservative.
> 
> This patch switches to folio_mark_accessed(). If
> folio_check_references() later detects referenced PTEs,
> the folio will be promoted based on the reference flag
> set by folio_mark_accessed().
> 

There is a following comment and stat update in lru_gen_refault() which is
referring to setting active bit which this patch is removing.

	/* see folio_add_lru() where folio_set_active() will be called */
	if (lru_gen_in_fault())
		mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);

Is this still relevant or need changes?

I have not yet dig deeper into the patch and the heuristic. Will do later.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-24 15:19 ` Pedro Falcato
@ 2026-04-26  4:35   ` Barry Song
  2026-04-27 14:46     ` Pedro Falcato
  0 siblings, 1 reply; 16+ messages in thread
From: Barry Song @ 2026-04-26  4:35 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: akpm, linux-mm, linux-kernel, Lance Yang, Xueyuan Chen,
	Kairui Song, Qi Zheng, Shakeel Butt, wangzicheng,
	Suren Baghdasaryan, Lei Liu, Matthew Wilcox, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Will Deacon

On Fri, Apr 24, 2026 at 11:19 PM Pedro Falcato <pfalcato@suse.de> wrote:
>
> On Sat, Apr 18, 2026 at 08:02:33PM +0800, Barry Song (Xiaomi) wrote:
> > MGLRU gives high priority to folios mapped in page tables.
> > As a result, folio_set_active() is invoked for all folios
> > read during page faults. In practice, however, readahead
> > can bring in many folios that are never accessed via page
> > tables.
> >
> > A previous attempt by Lei Liu proposed introducing a separate
> > LRU for readahead[1] to make readahead pages easier to reclaim,
> > but that approach is likely over-engineered.
>
> Why does this even need to be kept? I'm not sure it makes sense
> to even mark readahead folios as referenced.
>
> I'd suggest folios should only be marked referenced (or even active, whatever)
> when they're mapped. Anything else is a bit random and is hoping you are
> eventually going to map them in the future (which is not true for, for example,
> anything in an ELF file that may be readahead but not mapped, like debug info,
> symbol tables, section headers, relocation tables, etc etc)

The patch targets the mmap readahead path rather than the syscall
readahead path.

With lru_gen_in_fault() in place, it’s roughly equivalent to
the mapped case, since readahead is typically 128 KB while
fault_around is 64 KB in PF.

Thanks
Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-24 17:03 ` Shakeel Butt
@ 2026-04-26 21:56   ` Barry Song
  0 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2026-04-26 21:56 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: akpm, linux-mm, linux-kernel, Lance Yang, Xueyuan Chen,
	Kairui Song, Qi Zheng, wangzicheng, Suren Baghdasaryan, Lei Liu,
	Matthew Wilcox, Axel Rasmussen, Yuanchu Xie, Wei Xu, Will Deacon

On Sat, Apr 25, 2026 at 1:03 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Sat, Apr 18, 2026 at 08:02:33PM +0800, Barry Song (Xiaomi) wrote:
> > MGLRU gives high priority to folios mapped in page tables.
> > As a result, folio_set_active() is invoked for all folios
> > read during page faults. In practice, however, readahead
> > can bring in many folios that are never accessed via page
> > tables.
> >
> > A previous attempt by Lei Liu proposed introducing a separate
> > LRU for readahead[1] to make readahead pages easier to reclaim,
> > but that approach is likely over-engineered.
> >
> > Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
> > protection"), folios with PG_active were always placed in
> > the youngest generation, leading to over-protection and
> > increased refaults. After that commit, PG_active folios
> > are placed in the second youngest generation, which is
> > still too optimistic given the presence of readahead. In
> > contrast, the classic active/inactive scheme is more
> > conservative.
> >
> > This patch switches to folio_mark_accessed(). If
> > folio_check_references() later detects referenced PTEs,
> > the folio will be promoted based on the reference flag
> > set by folio_mark_accessed().
> >
>
> There is a following comment and stat update in lru_gen_refault() which is
> referring to setting active bit which this patch is removing.
>
>         /* see folio_add_lru() where folio_set_active() will be called */
>         if (lru_gen_in_fault())
>                 mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
>
> Is this still relevant or need changes?

This seems like a very good question. From a counting
perspective, there is no impact — in MGLRU, no code depends
on the counting to make decisions, so it is fine. However,
it raises the question of whether we should proactively
call folio_set_active() for some refaulted folios.

In the classic active/inactive case, we mark recently
refaulted folios as active. workingset_test_recent()
measures the refault distance. If the distance is less than
workingset_size, we mark the refaulted folio as active
to protect it.

        if (!workingset_test_recent(shadow, file, &workingset, true))
                goto out;

        folio_set_active(folio);
        workingset_age_nonresident(lruvec, nr);
        mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file, nr);

In MGLRU, we compare the current max_seq with the historical
max_seq. If the gap is less than MAX_NR_GENS,
lru_gen_test_recent() considers it recent:

static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec,
                                unsigned long *token, bool
*workingset, bool file)
{
        int memcg_id;
        unsigned long max_seq;
        struct mem_cgroup *memcg;
        struct pglist_data *pgdat;

        unpack_shadow(shadow, &memcg_id, &pgdat, token, workingset);

        memcg = mem_cgroup_from_private_id(memcg_id);
        *lruvec = mem_cgroup_lruvec(memcg, pgdat);

        max_seq = READ_ONCE((*lruvec)->lrugen.max_seq);
        max_seq &= (file ? EVICTION_MASK : EVICTION_MASK_ANON) >>
LRU_REFS_WIDTH;

        return abs_diff(max_seq, *token >> LRU_REFS_WIDTH) < MAX_NR_GENS;
}

But the existing code never marks any folios other than those
read from PF as active. Instead, MGLRU unconditionally treats
PF as important and non-PF as unimportant. That is what we are
addressing in this patch. We do not think PF-read folios are
always important.

Maybe we can test a “very recent” case to emulate the classic
LRU workingset_test_recent(). Once it is true, we set the
folio to active, regardless of where it originally came from.

lru_gen_test_refault_hot()
{
      ...
      return abs_diff(max_seq, *token >> LRU_REFS_WIDTH) <= MIN_NR_GENS;
}

diff --git a/mm/workingset.c b/mm/workingset.c
index 07e6836d0502..aaf873101091 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -319,9 +319,16 @@ static void lru_gen_refault(struct folio *folio,
void *shadow)

        atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);

-       /* see folio_add_lru() where folio_set_active() will be called */
-       if (lru_gen_in_fault())
-               mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE +
type, delta);
+       /*
+        * If the folio was evicted within the recent MIN_GENS, it is
+        * considered very hot and should be protected.
+        */
+       if (lru_gen_test_refault_hot(folio)) {
+               folio_set_active(folio);
+               mod_lruvec_state(lruvec,
+                               WORKINGSET_ACTIVATE_BASE + type,
+                               delta);
+       }

        if (workingset) {
                folio_set_workingset(folio);

Maybe this will handle both sys-call folios and PF folios in a more
sensible way, rather than simply treating PF as high priority.

>
> I have not yet dig deeper into the patch and the heuristic. Will do later.

Thanks
Barry


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-26  4:35   ` Barry Song
@ 2026-04-27 14:46     ` Pedro Falcato
  2026-04-27 18:22       ` Axel Rasmussen
  2026-04-28  4:24       ` Barry Song
  0 siblings, 2 replies; 16+ messages in thread
From: Pedro Falcato @ 2026-04-27 14:46 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-kernel, Lance Yang, Xueyuan Chen,
	Kairui Song, Qi Zheng, Shakeel Butt, wangzicheng,
	Suren Baghdasaryan, Lei Liu, Matthew Wilcox, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Will Deacon

On Sun, Apr 26, 2026 at 12:35:46PM +0800, Barry Song wrote:
> On Fri, Apr 24, 2026 at 11:19 PM Pedro Falcato <pfalcato@suse.de> wrote:
> >
> > On Sat, Apr 18, 2026 at 08:02:33PM +0800, Barry Song (Xiaomi) wrote:
> > > MGLRU gives high priority to folios mapped in page tables.
> > > As a result, folio_set_active() is invoked for all folios
> > > read during page faults. In practice, however, readahead
> > > can bring in many folios that are never accessed via page
> > > tables.
> > >
> > > A previous attempt by Lei Liu proposed introducing a separate
> > > LRU for readahead[1] to make readahead pages easier to reclaim,
> > > but that approach is likely over-engineered.
> >
> > Why does this even need to be kept? I'm not sure it makes sense
> > to even mark readahead folios as referenced.
> >
> > I'd suggest folios should only be marked referenced (or even active, whatever)
> > when they're mapped. Anything else is a bit random and is hoping you are
> > eventually going to map them in the future (which is not true for, for example,
> > anything in an ELF file that may be readahead but not mapped, like debug info,
> > symbol tables, section headers, relocation tables, etc etc)
> 
> The patch targets the mmap readahead path rather than the syscall
> readahead path.

Yes.

> 
> With lru_gen_in_fault() in place, it’s roughly equivalent to
> the mapped case, since readahead is typically 128 KB while
> fault_around is 64 KB in PF.

I'm not sure I understand. How is 128KB roughly equivalent to 64KB?
That's almost a 2x difference!

And readahead, as of now, will read beyond the VMA's limits (which will not
be mapped).

Really, it's extremely unclear why this adhoc heuristic is here. A folio isn't
active just because you started readahead for it inside a page fault. It's not
even necessarily active just because it's mapped (although that is more
understandable).

And of course, because the whole ordeal is already extremely simple, it turns
out that in_lru_fault doesn't actually mean "we're inside a page fault" but
"we're inside a page fault and this VMA/file don't have any sequential/random
annotations".

I would really like to understand why this is here (and this is true for the
various odd heuristics mglru uses) if we ever want to have a chance to merge
the two LRUs together.

-- 
Pedro


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-27 14:46     ` Pedro Falcato
@ 2026-04-27 18:22       ` Axel Rasmussen
  2026-04-28  1:35         ` Barry Song (Xiaomi)
  2026-04-28  4:24       ` Barry Song
  1 sibling, 1 reply; 16+ messages in thread
From: Axel Rasmussen @ 2026-04-27 18:22 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Barry Song, akpm, linux-mm, linux-kernel, Lance Yang,
	Xueyuan Chen, Kairui Song, Qi Zheng, Shakeel Butt, wangzicheng,
	Suren Baghdasaryan, Lei Liu, Matthew Wilcox, Yuanchu Xie, Wei Xu,
	Will Deacon

For what it's worth, I agree with this change in principle.

In production we set fault_around_bytes to 4096. That setting is
surprisingly load-bearing (i.e. if I change it, even at a small
experimental scale, I expect workloads to notice and complain). So I
don't think I have an easy way to test this change under production
workloads.

Like Andrew said the workload in the commit message doesn't seem
unreasonable, and the benefit is large.

I guess the workload that would see a downside from this is one that
heavily uses readahead pages but also generates many "one-time-use"
pages instead of maintaining a "fixed" working set. Without activating
the readahead pages, does it lose some of the readahead benefit
because they are pushed out?

About the Sashiko comments, the tier bits being cleared doesn't seem
that problematic to me. However, the WORKINGSET_ACTIVATE counter issue
seems worth fixing.

On Mon, Apr 27, 2026 at 7:46 AM Pedro Falcato <pfalcato@suse.de> wrote:
>
> On Sun, Apr 26, 2026 at 12:35:46PM +0800, Barry Song wrote:
> > On Fri, Apr 24, 2026 at 11:19 PM Pedro Falcato <pfalcato@suse.de> wrote:
> > >
> > > On Sat, Apr 18, 2026 at 08:02:33PM +0800, Barry Song (Xiaomi) wrote:
> > > > MGLRU gives high priority to folios mapped in page tables.
> > > > As a result, folio_set_active() is invoked for all folios
> > > > read during page faults. In practice, however, readahead
> > > > can bring in many folios that are never accessed via page
> > > > tables.
> > > >
> > > > A previous attempt by Lei Liu proposed introducing a separate
> > > > LRU for readahead[1] to make readahead pages easier to reclaim,
> > > > but that approach is likely over-engineered.
> > >
> > > Why does this even need to be kept? I'm not sure it makes sense
> > > to even mark readahead folios as referenced.
> > >
> > > I'd suggest folios should only be marked referenced (or even active, whatever)
> > > when they're mapped. Anything else is a bit random and is hoping you are
> > > eventually going to map them in the future (which is not true for, for example,
> > > anything in an ELF file that may be readahead but not mapped, like debug info,
> > > symbol tables, section headers, relocation tables, etc etc)
> >
> > The patch targets the mmap readahead path rather than the syscall
> > readahead path.
>
> Yes.
>
> >
> > With lru_gen_in_fault() in place, it’s roughly equivalent to
> > the mapped case, since readahead is typically 128 KB while
> > fault_around is 64 KB in PF.
>
> I'm not sure I understand. How is 128KB roughly equivalent to 64KB?
> That's almost a 2x difference!
>
> And readahead, as of now, will read beyond the VMA's limits (which will not
> be mapped).
>
> Really, it's extremely unclear why this adhoc heuristic is here. A folio isn't
> active just because you started readahead for it inside a page fault. It's not
> even necessarily active just because it's mapped (although that is more
> understandable).
>
> And of course, because the whole ordeal is already extremely simple, it turns
> out that in_lru_fault doesn't actually mean "we're inside a page fault" but
> "we're inside a page fault and this VMA/file don't have any sequential/random
> annotations".
>
> I would really like to understand why this is here (and this is true for the
> various odd heuristics mglru uses) if we ever want to have a chance to merge
> the two LRUs together.

This heuristic was added in the original MGLRU implementation, which
is quite a large patch. If it was added later in response to some
specific issue/use case, we'd have more rationale written down. I
think it really just comes down to this line in the original commit
message: "A page is added to the youngest generation on faulting." - I
don't see evidence that a great deal of thought was put into the point
Barry is raising here.


>
> --
> Pedro


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-27 18:22       ` Axel Rasmussen
@ 2026-04-28  1:35         ` Barry Song (Xiaomi)
  0 siblings, 0 replies; 16+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-28  1:35 UTC (permalink / raw)
  To: axelrasmussen
  Cc: akpm, baohua, kasong, lance.yang, linux-kernel, linux-mm,
	liulei.rjpt, pfalcato, qi.zheng, shakeel.butt, surenb,
	wangzicheng, weixugc, will, willy, xueyuan.chen21, yuanchu

On Tue, Apr 28, 2026 at 2:23 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> For what it's worth, I agree with this change in principle.
>
> In production we set fault_around_bytes to 4096. That setting is
> surprisingly load-bearing (i.e. if I change it, even at a small
> experimental scale, I expect workloads to notice and complain). So I
> don't think I have an easy way to test this change under production
> workloads.
>
> Like Andrew said the workload in the commit message doesn't seem
> unreasonable, and the benefit is large.
>
> I guess the workload that would see a downside from this is one that
> heavily uses readahead pages but also generates many "one-time-use"
> pages instead of maintaining a "fixed" working set. Without activating
> the readahead pages, does it lose some of the readahead benefit
> because they are pushed out?
>
> About the Sashiko comments, the tier bits being cleared doesn't seem
> that problematic to me. However, the WORKINGSET_ACTIVATE counter issue
> seems worth fixing.
>

I am considering something more reasonable than simply
"fixing" the counter. Right now, MGLRU unconditionally
treats PF folios as WORKINGSET_ACTIVATE_BASE and neglects
other folios entirely. I am thinking of a better approach
that detects true recency. In the active/inactive case,
this is refault_distance < workingset_size.

In MGLRU, we might detect whether reclamation occurred
within the most recent one or two generations. I am
queuing the following for testing:

diff --git a/mm/workingset.c b/mm/workingset.c
index 07e6836d0502..8b552b3d7e37 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -271,10 +271,11 @@ static void *lru_gen_eviction(struct folio *folio)
  * Fills in @lruvec, @token, @workingset with the values unpacked from shadow.
  */
 static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec,
-				unsigned long *token, bool *workingset, bool file)
+				unsigned long *token, bool *workingset, bool file,
+				unsigned long *gen_distance)
 {
 	int memcg_id;
-	unsigned long max_seq;
+	unsigned long max_seq, distance;
 	struct mem_cgroup *memcg;
 	struct pglist_data *pgdat;
 
@@ -286,7 +287,10 @@ static bool lru_gen_test_recent(void *shadow, struct lruvec **lruvec,
 	max_seq = READ_ONCE((*lruvec)->lrugen.max_seq);
 	max_seq &= (file ? EVICTION_MASK : EVICTION_MASK_ANON) >> LRU_REFS_WIDTH;
 
-	return abs_diff(max_seq, *token >> LRU_REFS_WIDTH) < MAX_NR_GENS;
+	distance = abs_diff(max_seq, *token >> LRU_REFS_WIDTH);
+	if (gen_distance)
+		*gen_distance = distance;
+	return distance < MAX_NR_GENS;
 }
 
 static void lru_gen_refault(struct folio *folio, void *shadow)
@@ -294,7 +298,7 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 	bool recent;
 	int hist, tier, refs;
 	bool workingset;
-	unsigned long token;
+	unsigned long token, distance;
 	struct lruvec *lruvec;
 	struct lru_gen_folio *lrugen;
 	int type = folio_is_file_lru(folio);
@@ -302,7 +306,8 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 
 	rcu_read_lock();
 
-	recent = lru_gen_test_recent(shadow, &lruvec, &token, &workingset, type);
+	recent = lru_gen_test_recent(shadow, &lruvec, &token, &workingset, type,
+				     &distance);
 	if (lruvec != folio_lruvec(folio))
 		goto unlock;
 
@@ -319,9 +324,11 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
 
 	atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
 
-	/* see folio_add_lru() where folio_set_active() will be called */
-	if (lru_gen_in_fault())
+	/* If the folio was reclaimed very recently. */
+	if (distance <= MIN_LRU_GENS) {
+		folio_set_active(folio);
 		mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
+	}
 
 	if (workingset) {
 		folio_set_workingset(folio);
@@ -442,7 +449,7 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
 
 		rcu_read_lock();
 		recent = lru_gen_test_recent(shadow, &eviction_lruvec, &eviction,
-					     workingset, file);
+					     workingset, file, NULL);
 		rcu_read_unlock();
 		return recent;
 	}


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-27 14:46     ` Pedro Falcato
  2026-04-27 18:22       ` Axel Rasmussen
@ 2026-04-28  4:24       ` Barry Song
  1 sibling, 0 replies; 16+ messages in thread
From: Barry Song @ 2026-04-28  4:24 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: akpm, linux-mm, linux-kernel, Lance Yang, Xueyuan Chen,
	Kairui Song, Qi Zheng, Shakeel Butt, wangzicheng,
	Suren Baghdasaryan, Lei Liu, Matthew Wilcox, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Will Deacon

On Mon, Apr 27, 2026 at 10:46 PM Pedro Falcato <pfalcato@suse.de> wrote:
>
> On Sun, Apr 26, 2026 at 12:35:46PM +0800, Barry Song wrote:
> > On Fri, Apr 24, 2026 at 11:19 PM Pedro Falcato <pfalcato@suse.de> wrote:
> > >
> > > On Sat, Apr 18, 2026 at 08:02:33PM +0800, Barry Song (Xiaomi) wrote:
> > > > MGLRU gives high priority to folios mapped in page tables.
> > > > As a result, folio_set_active() is invoked for all folios
> > > > read during page faults. In practice, however, readahead
> > > > can bring in many folios that are never accessed via page
> > > > tables.
> > > >
> > > > A previous attempt by Lei Liu proposed introducing a separate
> > > > LRU for readahead[1] to make readahead pages easier to reclaim,
> > > > but that approach is likely over-engineered.
> > >
> > > Why does this even need to be kept? I'm not sure it makes sense
> > > to even mark readahead folios as referenced.
> > >
> > > I'd suggest folios should only be marked referenced (or even active, whatever)
> > > when they're mapped. Anything else is a bit random and is hoping you are
> > > eventually going to map them in the future (which is not true for, for example,
> > > anything in an ELF file that may be readahead but not mapped, like debug info,
> > > symbol tables, section headers, relocation tables, etc etc)
> >
> > The patch targets the mmap readahead path rather than the syscall
> > readahead path.
>
> Yes.
>
> >
> > With lru_gen_in_fault() in place, it’s roughly equivalent to
> > the mapped case, since readahead is typically 128 KB while
> > fault_around is 64 KB in PF.
>
> I'm not sure I understand. How is 128KB roughly equivalent to 64KB?
> That's almost a 2x difference!

For two reasons. First, we have moved those folios out of
active generations, so they no longer significantly occupy
active space in MGLRU. In any case, they require a real PTE
reference to be promoted in folio_check_references():

        if (lru_gen_enabled() && !lru_gen_switching()) {
                if (!referenced_ptes)
                        return FOLIOREF_RECLAIM;

                return lru_gen_set_refs(folio) ? FOLIOREF_ACTIVATE :
FOLIOREF_KEEP;
        }


Second, 64KB is effectively similar to 128KB when two
subpages within each 64KB are used by fault_around /
map_pages.

Of course, we could move the setting into multiple map
functions, but given such a small gap, I do not think it is
worth the added complexity.

>
> And readahead, as of now, will read beyond the VMA's limits (which will not
> be mapped).

Folios beyond the VMA limits will not be promoted in MGLRU.
But yes, I like the idea we don't readahead beyond vma limits:

https://lore.kernel.org/linux-mm/20260422005608.342028-1-fmayle@google.com/

>
> Really, it's extremely unclear why this adhoc heuristic is here. A folio isn't
> active just because you started readahead for it inside a page fault. It's not
> even necessarily active just because it's mapped (although that is more
> understandable).

Right. And with fault_around enabled, I don’t think it is
correct to set the folio active when it is mapped.

>
> And of course, because the whole ordeal is already extremely simple, it turns
> out that in_lru_fault doesn't actually mean "we're inside a page fault" but
> "we're inside a page fault and this VMA/file don't have any sequential/random
> annotations".
>
> I would really like to understand why this is here (and this is true for the
> various odd heuristics mglru uses) if we ever want to have a chance to merge
> the two LRUs together.

I’m probably not the right person to answer this question, as I’m
sending this patch to change it :-)

It might make sense to promote PF folios directly if there is no
readahead or fault_around.

Thanks
Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-24 11:53 ` Andrew Morton
@ 2026-04-28  5:40   ` Barry Song
  0 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2026-04-28  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Lance Yang, Xueyuan Chen, Kairui Song,
	Qi Zheng, Shakeel Butt, wangzicheng, Suren Baghdasaryan, Lei Liu,
	Matthew Wilcox, Axel Rasmussen, Yuanchu Xie, Wei Xu, Will Deacon,
	qiaozhe

On Fri, Apr 24, 2026 at 7:53 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Sat, 18 Apr 2026 20:02:33 +0800 "Barry Song (Xiaomi)" <baohua@kernel.org> wrote:
>
> > MGLRU gives high priority to folios mapped in page tables.
> > As a result, folio_set_active() is invoked for all folios
> > read during page faults. In practice, however, readahead
> > can bring in many folios that are never accessed via page
> > tables.
> >
> > A previous attempt by Lei Liu proposed introducing a separate
> > LRU for readahead[1] to make readahead pages easier to reclaim,
> > but that approach is likely over-engineered.
> >
> > Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
> > protection"), folios with PG_active were always placed in
> > the youngest generation, leading to over-protection and
> > increased refaults. After that commit, PG_active folios
> > are placed in the second youngest generation, which is
> > still too optimistic given the presence of readahead. In
> > contrast, the classic active/inactive scheme is more
> > conservative.
> >
> > This patch switches to folio_mark_accessed(). If
> > folio_check_references() later detects referenced PTEs,
> > the folio will be promoted based on the reference flag
> > set by folio_mark_accessed().
> >
> > The following uses a simple model to demonstrate why the current
> > code is not ideal. It runs fio-3.42 in a memcg, reading a file in a
> > strided pattern—4KB every 64KB—to simulate prefaulted pages that may
> > not be accessed.
>
> Are you able to suggest any workloads which might regress?  And test
> for those?

I don’t have a specific workload, but I can imagine one. For example,
a workload with readahead, where all readahead pages are mapped and
all folios also happen to be hot. In this case, placing them into
active generations at the readahead stage might be beneficial. However,
this patch would not lose them either; it may just require a more
folio_check_references()- PTE scan to confirm they are truly
accessed before activating them.

>
> > Without the patch, we observed 12883855 file refaults and a very low
> > bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy
> > hot positions, continuously pushing out the real working set and
> > causing incorrect reclaim. With the patch, we observed 0 refaults
> > and bandwidth increased to 5078 MiB/s.
>
> Wow.  And that isn't a crazy workload.

Right.
Readahead is mainly for I/O performance, but it does not necessarily
mean those pages will be accessed or become hot. On a memory-limited
system, promoting readahead folios can be very harmful.

>
> > For those who want to try the model on x86, you will need the
> > following in arch/x86/include/asm/pgtable.h.
> >
> >  #define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
> >  static inline bool arch_wants_old_prefaulted_pte(void)
> >  {
> >       return true;
> >  }
>
> Can you propose a patch?  We can at least toss it in there for testing
> while we think about it.

I can include this RFC after this patch in v2. Right now only arm64
supports fault_around of old PTEs; it seems all architectures were
disabled due to an x86 issue in commit 315d09bf30c2
("Revert 'mm: make faultaround produce old ptes'"). It was later
revived only for arm64 by Will.

I may reach out to and engage the x86 and RISC-V communities for a
revisit. I actually have Zhe Qiao testing it on newer AMD and Intel
platforms. With sufficient memory and no reclaim pressure, mapping
fault_around PTEs as old still shows a small UnixBench regression, as
below. However, I doubt this is significant, because under memory
pressure, reclaiming the correct folios is likely more important than
the hardware access flag cost, offsetting the HW AF overhead.

Alternatively, we may need some self-tuning logic to detect memory
pressure and decide whether to map fault_around PTEs as old on
platforms where HW AF handling is costly?

Thanks to Zhe Qiao for the data below. Cc’ing Zhe Qiao as well.

AMD Ryzen 9 9950X 16-Core Processor:

========== Kernel A: 7.0.0-14-generic ==========
================================================================
  Kernel: 7.0.0-14-generic
================================================================
Test                        Index (avg)     n  Iters
----------------------------------------------------------------
dhry2reg                      111703.38     4  [0,1,2,3]
whetstone-double               34170.80     4  [0,1,2,3]
execl                          15312.25     4  [0,1,2,3]
fstime                         59662.58     4  [0,1,2,3]
fsbuffer                       72696.42     4  [0,1,2,3]
fsdisk                         31885.47     4  [0,1,2,3]
pipe                           61432.48     4  [0,1,2,3]
context1                       15438.23     4  [0,1,2,3]
spawn                          16349.47     4  [0,1,2,3]
syscall                        38748.50     4  [0,1,2,3]
shell1                         24710.05     4  [0,1,2,3]
shell8                         20207.05     4  [0,1,2,3]
----------------------------------------------------------------

  Valid tests: 12 / 12
  ╔══════════════════════════════════════════════════╗
  ║  System Benchmarks Index Score: 34045.39         ║
  ╚══════════════════════════════════════════════════╝


========== Kernel B: 7.0.0-custom-test+ ==========
================================================================
  Kernel: 7.0.0-custom-test+
================================================================
Test                        Index (avg)     n  Iters
----------------------------------------------------------------
dhry2reg                      105061.82     5  [0,1,2,3,4]
whetstone-double               33945.03     4  [0,1,2,3]
execl                          13992.70     5  [0,1,2,3,4]
fstime                         59037.77     4  [0,1,2,3]
fsbuffer                       72465.12     4  [0,1,2,3]
fsdisk                         28388.05     4  [0,1,2,3]
pipe                           64047.38     4  [0,1,2,3]
context1                       15322.56     5  [0,1,2,3,4]
spawn                          16020.58     4  [0,1,2,3]
syscall                        40669.70     4  [0,1,2,3]
shell1                         23514.32     4  [0,1,2,3]
shell8                         19393.55     4  [0,1,2,3]
----------------------------------------------------------------

  Valid tests: 12 / 12
  ╔══════════════════════════════════════════════════╗
  ║  System Benchmarks Index Score: 33159.46         ║
  ╚══════════════════════════════════════════════════╝


================================================================
  Per-Test Index Comparison
================================================================
Test                         Kernel A       Kernel B       Diff %
----------------------------------------------------------------
dhry2reg                    111703.38      105061.82       -5.95%  ⬇
whetstone-double             34170.80       33945.03       -0.66%
execl                        15312.25       13992.70       -8.62%  ⬇
fstime                       59662.58       59037.77       -1.05%
fsbuffer                     72696.42       72465.12       -0.32%
fsdisk                       31885.47       28388.05      -10.97%  ⬇
pipe                         61432.48       64047.38       +4.26%  ⬆
context1                     15438.23       15322.56       -0.75%
spawn                        16349.47       16020.58       -2.01%  ⬇
syscall                      38748.50       40669.70       +4.96%  ⬆
shell1                       24710.05       23514.32       -4.84%  ⬇
shell8                       20207.05       19393.55       -4.03%  ⬇

================================================================
  Final System Benchmarks Index Score
================================================================
  7.0.0-14-generic                       34045.39
  7.0.0-custom-test+                     33159.46
  --------------------------------------------
  B vs A                                   -2.60%



INTEL(R) XEON(R) PLATINUM 8575C:


========== Kernel A: 7.0.0-14-generic ==========
================================================================
  Kernel: 7.0.0-14-generic
  Prefix: 24-300s- (24 threads)
================================================================
Test                       AvgScore     Baseline      Index     n
----------------------------------------------------------------
dhry2reg                   87226.50     116700.0       7.47     4
whetstone-double           29323.70         55.0    5331.58     4
execl                      12638.17         43.0    2939.11     4
fstime                     84223.20       3960.0     212.68     4
fsbuffer                   56491.22       1655.0     341.34     4
fsdisk                     34570.90       5800.0      59.61     4
pipe                       45941.97      12440.0      36.93     4
context1                   20177.08       4000.0      50.44     4
spawn                       9498.88        126.0     753.88     4
syscall                    29181.70      15000.0      19.45     4
shell1                     21060.58         42.4    4967.12     4
shell8                     17669.08          6.0   29448.47     4
----------------------------------------------------------------

  Valid tests: 12 / 12
  ╔════════════════════════════════════════════════╗
  ║  System Benchmarks Index Score: 335.36         ║
  ╚════════════════════════════════════════════════╝


========== Kernel B: 7.0.0-custom-test+ ==========
================================================================
  Kernel: 7.0.0-custom-test+
  Prefix: 24-300s- (24 threads)
================================================================
Test                       AvgScore     Baseline      Index     n
----------------------------------------------------------------
dhry2reg                   87607.90     116700.0       7.51     4
whetstone-double           29092.33         55.0    5289.51     4
execl                      12318.00         43.0    2864.65     4
fstime                     85738.35       3960.0     216.51     4
fsbuffer                   57621.05       1655.0     348.16     4
fsdisk                     33608.60       5800.0      57.95     4
pipe                       46320.38      12440.0      37.24     4
context1                   20450.12       4000.0      51.13     4
spawn                       9579.15        126.0     760.25     4
syscall                    29563.43      15000.0      19.71     4
shell1                     20073.10         42.4    4734.22     4
shell8                     16946.05          6.0   28243.42     4
----------------------------------------------------------------

  Valid tests: 12 / 12
  ╔════════════════════════════════════════════════╗
  ║  System Benchmarks Index Score: 333.55         ║
  ╚════════════════════════════════════════════════╝


================================================================
  Per-Test Index Comparison
================================================================
Test                         Kernel A       Kernel B       Diff %
----------------------------------------------------------------
dhry2reg                     87226.50       87607.90       +0.44%
whetstone-double             29323.70       29092.33       -0.79%
execl                        12638.17       12318.00       -2.53%  ⬇
fstime                       84223.20       85738.35       +1.80%
fsbuffer                     56491.22       57621.05       +2.00%  ⬆
fsdisk                       34570.90       33608.60       -2.78%  ⬇
pipe                         45941.97       46320.38       +0.82%
context1                     20177.08       20450.12       +1.35%
spawn                         9498.88        9579.15       +0.85%
syscall                      29181.70       29563.43       +1.31%
shell1                       21060.58       20073.10       -4.69%  ⬇
shell8                       17669.08       16946.05       -4.09%  ⬇

================================================================
  Final System Benchmarks Index Score
================================================================
  7.0.0-14-generic                         335.36
  7.0.0-custom-test+                       333.55
  --------------------------------------------
  B vs A                                   -0.54%

>
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -512,7 +512,7 @@ void folio_add_lru(struct folio *folio)
> >       /* see the comment in lru_gen_folio_seq() */
> >       if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> >           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> > -             folio_set_active(folio);
> > +             folio_mark_accessed(folio);
> >
> >       folio_batch_add_and_move(folio, lru_add);
> >  }
>
> lol, I was expecting something larger ;)

Yep. I usually prefer small patches if they can resolve the problem,
which would make our lives easier :-)

But we will likely need a 2/2 patch for refault activation, as
discussed with Shakeel and Axel.

Thanks
Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-18 12:02 [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF Barry Song (Xiaomi)
                   ` (3 preceding siblings ...)
  2026-04-24 17:03 ` Shakeel Butt
@ 2026-04-28 18:54 ` Kairui Song
  2026-04-28 22:26   ` Barry Song
  4 siblings, 1 reply; 16+ messages in thread
From: Kairui Song @ 2026-04-28 18:54 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, linux-mm, linux-kernel, Lance Yang, Xueyuan Chen, Qi Zheng,
	Shakeel Butt, wangzicheng, Suren Baghdasaryan, Lei Liu,
	Matthew Wilcox, Axel Rasmussen, Yuanchu Xie, Wei Xu, Will Deacon

On Sat, Apr 18, 2026 at 8:03 PM Barry Song (Xiaomi) <baohua@kernel.org> wrote:
>
> MGLRU gives high priority to folios mapped in page tables.
> As a result, folio_set_active() is invoked for all folios
> read during page faults. In practice, however, readahead
> can bring in many folios that are never accessed via page
> tables.
>
> A previous attempt by Lei Liu proposed introducing a separate
> LRU for readahead[1] to make readahead pages easier to reclaim,
> but that approach is likely over-engineered.
>
> Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
> protection"), folios with PG_active were always placed in
> the youngest generation, leading to over-protection and
> increased refaults. After that commit, PG_active folios
> are placed in the second youngest generation, which is
> still too optimistic given the presence of readahead. In
> contrast, the classic active/inactive scheme is more
> conservative.
>
> This patch switches to folio_mark_accessed(). If
> folio_check_references() later detects referenced PTEs,
> the folio will be promoted based on the reference flag
> set by folio_mark_accessed().
>
> The following uses a simple model to demonstrate why the current
> code is not ideal. It runs fio-3.42 in a memcg, reading a file in a
> strided pattern—4KB every 64KB—to simulate prefaulted pages that may
> not be accessed.
>
>  #!/bin/bash
>
>  CG_NAME="mglru_verify_test"
>  CG_PATH="/sys/fs/cgroup/$CG_NAME"
>  MEM_LIMIT="400M"
>  HOT_SIZE="600M"
>
>  # 1. Environment Setup
>  sudo rmdir "$CG_PATH" 2>/dev/null
>  sudo mkdir -p "$CG_PATH"
>  sudo chown -R $USER:$USER "$CG_PATH"
>  echo "$MEM_LIMIT" > "$CG_PATH/memory.max"
>
>  # 2. Prepare Data Files
>  dd if=/dev/urandom of=hot_data.bin bs=1M count=600 conv=notrunc 2>/dev/null
>  sync
>  echo 3 > /proc/sys/vm/drop_caches
>
>  # 3. Start Workload (Working Set)
>  (
>      echo $BASHPID > "$CG_PATH/cgroup.procs"
>      exec ./fio-3.42 --name=hot_ws --rw=read --bs=4K --size=$HOT_SIZE --runtime=600 \
>           --zonemode=strided --zonesize=4K --zonerange=64K \
>          --time_based --direct=0 --filename=hot_data.bin --ioengine=mmap \
>           --fadvise_hint=0 --group_reporting --numjobs=1 > fio.stats
>  ) &
>  WORKLOAD_PID=$!
>
>  # 4. Waiting for hot data to warm up
>  sleep 30
>  BASE_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
>
>  # 5. Running workload for 60second
>  sleep 60
>
>  # 6. Report refault and IO bandwidth
>  FINAL_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
>  FINAL_D_FILE=$((FINAL_FILE - BASE_FILE))
>  echo "File Refault Delta is $FINAL_D_FILE"
>
>  kill $WORKLOAD_PID 2>/dev/null
>  sleep 2
>  grep -E "READ|WRITE" fio.stats \
>  | awk '{for(i=1;i<=NF;i++){if($i ~ /^bw=/) bw=$i; if($i ~ /^io=/) io=$i} print $1, bw, io}'
>  rm -f hot_data.bin fio.stats
>
> Without the patch, we observed 12883855 file refaults and a very low
> bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy
> hot positions, continuously pushing out the real working set and
> causing incorrect reclaim. With the patch, we observed 0 refaults
> and bandwidth increased to 5078 MiB/s.
>
> Note that this patch does not benefit any platform other than arm64,
> since commit 315d09bf30c2 ("Revert "mm: make faultaround produce old
> ptes"") reverted the change that made prefault PTEs “old”, after it
> was identified as the cause of a ~6% regression in UnixBench on x86.
> This was due to reports that x86 uses an internal microfault mechanism
> for HW AF. The hardware access flag mechanism is relatively expensive
> and can lead to a ~6% UnixBench regression when prefaulted PTEs are
> not marked young directly in the page fault path, especially when
> UnixBench runs without any memory pressure[2].
>
> Thanks to Will for raising this for arm64—“Create ‘old’ PTEs for
> faultaround mappings on arm64 with hardware access flag” [3].
> This is also thanks to arm64 microarchitectures, which incur zero cost
> for HW AF handling.
>
> It may be time for x86 and other architectures to revisit
> whether HW AF is truly costly on their platforms, given that
> the original x86 regression was reported 10 years ago.
>
> For those who want to try the model on x86, you will need the
> following in arch/x86/include/asm/pgtable.h.
>
>  #define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
>  static inline bool arch_wants_old_prefaulted_pte(void)
>  {
>         return true;
>  }
>
> Lance and Xueyuan made a huge contribution to this patch
> through testing. They truly worked over weekends and after
> work hours. If this patch deserves any credit, it belongs to
> them.
>
> [1] https://lore.kernel.org/linux-mm/20250916072226.220426-1-liulei.rjpt@vivo.com/
> [2] https://lore.kernel.org/lkml/20160606022724.GA26227@yexl-desktop/
> [3] https://lore.kernel.org/lkml/20210120173612.20913-1-will@kernel.org/
> Tested-by: Lance Yang <lance.yang@linux.dev>
> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> Cc: Kairui Song <kasong@tencent.com>
> Cc: Qi Zheng <qi.zheng@linux.dev>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: wangzicheng <wangzicheng@honor.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Lei Liu <liulei.rjpt@vivo.com>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Yuanchu Xie <yuanchu@google.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Will Deacon <will@kernel.org>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  -rfc was:
>  [PATCH RFC] mm/mglru: lazily activate folios while folios are really mapped
>  https://lore.kernel.org/linux-mm/20260225212642.15219-1-21cnbao@gmail.com/
>
>  mm/swap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 5cc44f0de987..e3cf703ccb89 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -512,7 +512,7 @@ void folio_add_lru(struct folio *folio)
>         /* see the comment in lru_gen_folio_seq() */
>         if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
>             lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> -               folio_set_active(folio);
> +               folio_mark_accessed(folio);

Hi Barry,

Sorry I haven't checked everything yet, but just a naive idea: What if
we just remove this whole lru_gen_* check chunk here? Only keep the
one in workingset.c to do the folio_set_active so refaulted folios are
promoted like classical LRU, I have a series to restore the refault
distance based activation for MGLRU:
https://lwn.net/Articles/945266/

That series from me above is a bit buggy, but easy to fix, I can
resend it. Some workload benefits a lot from it, like the one in the
cover letter. And the latest MGLRU is still not performing well with
these workloads.

Is there any evidence that folios that are allocated through fault are
always frequently used folios? Because classical LRU has the exact
opposite assumption on that. Refault distance based activation is more
battle tested (I'm not saying that is absolutely right though).

Will the performance be worse or better if we remove this activation
here, and instead only do the activation through folio_mark_accessed
(not right now, see below), page table walk, and refauting distance
checking?

Oh and, right now MGLRU performance badly on some workload because
folio_mark_accessed never activate a folio, which can also be fixed
with:
https://github.com/ryncsn/linux/blob/b4/mglru-lfu/mm/swap.c#L393 (I
hope I can sent it out as RFC if I can finish the benchmark and
tweaking before LSFMM but sorry for now I'll just share this link...)

This is the LSFMMBPF topic idea I proposed, there folio_mark_accessed
calls folio_inc_lru_refs which will promote the folio for exact one
gen if the access count goes beyond LRU_REFS_MAX, making MGLRU
frequence aware and much more proactive on certain workloads. Testing
with YCSB on the server and using that on my phone are both looking
great.

It also remove the force protection on eviction path (that "if (refs +
workingset != BIT(LRU_REFS_WIDTH) + 1)" check, which is added about a
year or two later after the first MGLRU release), that force
protection is causing trouble too cause some cold folios with high
historical access count will stuck in LRU for a bit longer.

In general I think it might be a good idea to weaken or maybe just
remove this activation here. Need some time to discuss and verify
though.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-28 18:54 ` Kairui Song
@ 2026-04-28 22:26   ` Barry Song
  2026-04-28 22:50     ` Barry Song
  2026-04-29  3:17     ` Kairui Song
  0 siblings, 2 replies; 16+ messages in thread
From: Barry Song @ 2026-04-28 22:26 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, linux-mm, linux-kernel, Lance Yang, Xueyuan Chen, Qi Zheng,
	Shakeel Butt, wangzicheng, Suren Baghdasaryan, Lei Liu,
	Matthew Wilcox, Axel Rasmussen, Yuanchu Xie, Wei Xu, Will Deacon

On Wed, Apr 29, 2026 at 2:55 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Apr 18, 2026 at 8:03 PM Barry Song (Xiaomi) <baohua@kernel.org> wrote:
> >
> > MGLRU gives high priority to folios mapped in page tables.
> > As a result, folio_set_active() is invoked for all folios
> > read during page faults. In practice, however, readahead
> > can bring in many folios that are never accessed via page
> > tables.
> >
> > A previous attempt by Lei Liu proposed introducing a separate
> > LRU for readahead[1] to make readahead pages easier to reclaim,
> > but that approach is likely over-engineered.
> >
> > Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
> > protection"), folios with PG_active were always placed in
> > the youngest generation, leading to over-protection and
> > increased refaults. After that commit, PG_active folios
> > are placed in the second youngest generation, which is
> > still too optimistic given the presence of readahead. In
> > contrast, the classic active/inactive scheme is more
> > conservative.
> >
> > This patch switches to folio_mark_accessed(). If
> > folio_check_references() later detects referenced PTEs,
> > the folio will be promoted based on the reference flag
> > set by folio_mark_accessed().
> >
> > The following uses a simple model to demonstrate why the current
> > code is not ideal. It runs fio-3.42 in a memcg, reading a file in a
> > strided pattern—4KB every 64KB—to simulate prefaulted pages that may
> > not be accessed.
> >
> >  #!/bin/bash
> >
> >  CG_NAME="mglru_verify_test"
> >  CG_PATH="/sys/fs/cgroup/$CG_NAME"
> >  MEM_LIMIT="400M"
> >  HOT_SIZE="600M"
> >
> >  # 1. Environment Setup
> >  sudo rmdir "$CG_PATH" 2>/dev/null
> >  sudo mkdir -p "$CG_PATH"
> >  sudo chown -R $USER:$USER "$CG_PATH"
> >  echo "$MEM_LIMIT" > "$CG_PATH/memory.max"
> >
> >  # 2. Prepare Data Files
> >  dd if=/dev/urandom of=hot_data.bin bs=1M count=600 conv=notrunc 2>/dev/null
> >  sync
> >  echo 3 > /proc/sys/vm/drop_caches
> >
> >  # 3. Start Workload (Working Set)
> >  (
> >      echo $BASHPID > "$CG_PATH/cgroup.procs"
> >      exec ./fio-3.42 --name=hot_ws --rw=read --bs=4K --size=$HOT_SIZE --runtime=600 \
> >           --zonemode=strided --zonesize=4K --zonerange=64K \
> >          --time_based --direct=0 --filename=hot_data.bin --ioengine=mmap \
> >           --fadvise_hint=0 --group_reporting --numjobs=1 > fio.stats
> >  ) &
> >  WORKLOAD_PID=$!
> >
> >  # 4. Waiting for hot data to warm up
> >  sleep 30
> >  BASE_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
> >
> >  # 5. Running workload for 60second
> >  sleep 60
> >
> >  # 6. Report refault and IO bandwidth
> >  FINAL_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
> >  FINAL_D_FILE=$((FINAL_FILE - BASE_FILE))
> >  echo "File Refault Delta is $FINAL_D_FILE"
> >
> >  kill $WORKLOAD_PID 2>/dev/null
> >  sleep 2
> >  grep -E "READ|WRITE" fio.stats \
> >  | awk '{for(i=1;i<=NF;i++){if($i ~ /^bw=/) bw=$i; if($i ~ /^io=/) io=$i} print $1, bw, io}'
> >  rm -f hot_data.bin fio.stats
> >
> > Without the patch, we observed 12883855 file refaults and a very low
> > bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy
> > hot positions, continuously pushing out the real working set and
> > causing incorrect reclaim. With the patch, we observed 0 refaults
> > and bandwidth increased to 5078 MiB/s.
> >
> > Note that this patch does not benefit any platform other than arm64,
> > since commit 315d09bf30c2 ("Revert "mm: make faultaround produce old
> > ptes"") reverted the change that made prefault PTEs “old”, after it
> > was identified as the cause of a ~6% regression in UnixBench on x86.
> > This was due to reports that x86 uses an internal microfault mechanism
> > for HW AF. The hardware access flag mechanism is relatively expensive
> > and can lead to a ~6% UnixBench regression when prefaulted PTEs are
> > not marked young directly in the page fault path, especially when
> > UnixBench runs without any memory pressure[2].
> >
> > Thanks to Will for raising this for arm64—“Create ‘old’ PTEs for
> > faultaround mappings on arm64 with hardware access flag” [3].
> > This is also thanks to arm64 microarchitectures, which incur zero cost
> > for HW AF handling.
> >
> > It may be time for x86 and other architectures to revisit
> > whether HW AF is truly costly on their platforms, given that
> > the original x86 regression was reported 10 years ago.
> >
> > For those who want to try the model on x86, you will need the
> > following in arch/x86/include/asm/pgtable.h.
> >
> >  #define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
> >  static inline bool arch_wants_old_prefaulted_pte(void)
> >  {
> >         return true;
> >  }
> >
> > Lance and Xueyuan made a huge contribution to this patch
> > through testing. They truly worked over weekends and after
> > work hours. If this patch deserves any credit, it belongs to
> > them.
> >
> > [1] https://lore.kernel.org/linux-mm/20250916072226.220426-1-liulei.rjpt@vivo.com/
> > [2] https://lore.kernel.org/lkml/20160606022724.GA26227@yexl-desktop/
> > [3] https://lore.kernel.org/lkml/20210120173612.20913-1-will@kernel.org/
> > Tested-by: Lance Yang <lance.yang@linux.dev>
> > Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
> > Cc: Kairui Song <kasong@tencent.com>
> > Cc: Qi Zheng <qi.zheng@linux.dev>
> > Cc: Shakeel Butt <shakeel.butt@linux.dev>
> > Cc: wangzicheng <wangzicheng@honor.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Lei Liu <liulei.rjpt@vivo.com>
> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > Cc: Yuanchu Xie <yuanchu@google.com>
> > Cc: Wei Xu <weixugc@google.com>
> > Cc: Will Deacon <will@kernel.org>
> > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > ---
> >  -rfc was:
> >  [PATCH RFC] mm/mglru: lazily activate folios while folios are really mapped
> >  https://lore.kernel.org/linux-mm/20260225212642.15219-1-21cnbao@gmail.com/
> >
> >  mm/swap.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 5cc44f0de987..e3cf703ccb89 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -512,7 +512,7 @@ void folio_add_lru(struct folio *folio)
> >         /* see the comment in lru_gen_folio_seq() */
> >         if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> >             lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> > -               folio_set_active(folio);
> > +               folio_mark_accessed(folio);
>
> Hi Barry,
>
> Sorry I haven't checked everything yet, but just a naive idea: What if
> we just remove this whole lru_gen_* check chunk here? Only keep the

Do you mean the below?

index 5cc44f0de987..499ad49c1b51 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -509,11 +509,6 @@ void folio_add_lru(struct folio *folio)
                        folio_test_unevictable(folio), folio);
        VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

-       /* see the comment in lru_gen_folio_seq() */
-       if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
-           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
-               folio_set_active(folio);
-
        folio_batch_add_and_move(folio, lru_add);
 }
 EXPORT_SYMBOL(folio_add_lru);

If so, this essentially resembles the active/inactive LRU. But I
assume Yu Zhao’s earlier point about mmaped folio access still
has some merit? The problem, however, is that readahead and
prefaulting may have made this assumption less accurate, since
being mmaped doesn’t necessarily mean the user actually wants
to access it.

Dropping folio_mark_accessed(), we would need two scans to
confirm a mmaped folio is active. This seems reasonable to me
on platforms other than arm64, since they always set access
flags for prefaulted folios. The first scan would clear the
prefaulted access flag (which is fake), and the second scan
would confirm that the folio was actually accessed.

But for arm64, it seems we might slightly negatively impact PTE-mapped
folios?

I mean, I’m at least convinced the following might be correct:

@@ -509,10 +511,14 @@ void folio_add_lru(struct folio *folio)
                        folio_test_unevictable(folio), folio);
        VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

-       /* see the comment in lru_gen_folio_seq() */
-       if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
-           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
-               folio_set_active(folio);
+       /*
+        * For architectures without old prefaulted PTEs, we need a first
+        * PTE scan to clear the access flag set during prefault, and a second
+        * scan to confirm the folio is active. For architectures with old
+        * prefaulted PTEs, we can skip the scan that clears the access flag.
+        */
+       if (arch_wants_old_prefaulted_pte())
+               folio_mark_accessed(folio);

        folio_batch_add_and_move(folio, lru_add);
 }

It could also be the case below to check whether fault_around is
disabled, if it’s not too ugly :-)

+       if (arch_wants_old_prefaulted_pte() || fault_around_bytes == PAGE_SIZE)
+               folio_mark_accessed(folio);

I suspect the above code also fixes the fio workload performance I posted in the
changelog for x86. Let me queue it for testing.

BTW, it seems we can also fix set_pte_range(). The prefault check
feels quite useless to me—just let folio_referenced do one extra
scan.

diff --git a/mm/memory.c b/mm/memory.c
index ea6568571131..bee58a8fee0a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5593,13 +5593,12 @@ void set_pte_range(struct vm_fault *vmf,
struct folio *folio,
 {
        struct vm_area_struct *vma = vmf->vma;
        bool write = vmf->flags & FAULT_FLAG_WRITE;
-       bool prefault = !in_range(vmf->address, addr, nr * PAGE_SIZE);
        pte_t entry;

        flush_icache_pages(vma, page, nr);
        entry = mk_pte(page, vma->vm_page_prot);

-       if (prefault && arch_wants_old_prefaulted_pte())
+       if (arch_wants_old_prefaulted_pte())
                entry = pte_mkold(entry);
        else
                entry = pte_sw_mkyoung(entry);


> one in workingset.c to do the folio_set_active so refaulted folios are
> promoted like classical LRU, I have a series to restore the refault
> distance based activation for MGLRU:
> https://lwn.net/Articles/945266/
>
> That series from me above is a bit buggy, but easy to fix, I can
> resend it. Some workload benefits a lot from it, like the one in the
> cover letter. And the latest MGLRU is still not performing well with
> these workloads.
>
> Is there any evidence that folios that are allocated through fault are
> always frequently used folios? Because classical LRU has the exact
> opposite assumption on that. Refault distance based activation is more
> battle tested (I'm not saying that is absolutely right though).

I agree with this. I’m also queuing some code for testing
to check whether reclamation has occurred very recently.
If so, we set the folios active:
https://lore.kernel.org/linux-mm/20260428013520.47417-1-baohua@kernel.org/

So basically we’re on the same page, just taking slightly different
approaches to checking recency during refault?

>
> Will the performance be worse or better if we remove this activation
> here, and instead only do the activation through folio_mark_accessed
> (not right now, see below), page table walk, and refauting distance
> checking?
>

As explained above, I think it is probably sensible to remove the
chunk for x86, but not for arm64.

> Oh and, right now MGLRU performance badly on some workload because
> folio_mark_accessed never activate a folio, which can also be fixed
> with:
> https://github.com/ryncsn/linux/blob/b4/mglru-lfu/mm/swap.c#L393 (I
> hope I can sent it out as RFC if I can finish the benchmark and
> tweaking before LSFMM but sorry for now I'll just share this link...)

Thanks, I’d be glad to read it once you post the RFC.

>
> This is the LSFMMBPF topic idea I proposed, there folio_mark_accessed
> calls folio_inc_lru_refs which will promote the folio for exact one
> gen if the access count goes beyond LRU_REFS_MAX, making MGLRU
> frequence aware and much more proactive on certain workloads. Testing
> with YCSB on the server and using that on my phone are both looking
> great.
>
> It also remove the force protection on eviction path (that "if (refs +
> workingset != BIT(LRU_REFS_WIDTH) + 1)" check, which is added about a
> year or two later after the first MGLRU release), that force
> protection is causing trouble too cause some cold folios with high
> historical access count will stuck in LRU for a bit longer.
>
> In general I think it might be a good idea to weaken or maybe just
> remove this activation here. Need some time to discuss and verify
> though.

Yep, many thanks for your points.

Thanks
Barry


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-28 22:26   ` Barry Song
@ 2026-04-28 22:50     ` Barry Song
  2026-04-29  3:17     ` Kairui Song
  1 sibling, 0 replies; 16+ messages in thread
From: Barry Song @ 2026-04-28 22:50 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, linux-mm, linux-kernel, Lance Yang, Xueyuan Chen, Qi Zheng,
	Shakeel Butt, wangzicheng, Suren Baghdasaryan, Lei Liu,
	Matthew Wilcox, Axel Rasmussen, Yuanchu Xie, Wei Xu, Will Deacon

On Wed, Apr 29, 2026 at 6:26 AM Barry Song <baohua@kernel.org> wrote:
[...]
>
> BTW, it seems we can also fix set_pte_range(). The prefault check
> feels quite useless to me—just let folio_referenced do one extra
> scan.
>
> diff --git a/mm/memory.c b/mm/memory.c
> index ea6568571131..bee58a8fee0a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5593,13 +5593,12 @@ void set_pte_range(struct vm_fault *vmf,
> struct folio *folio,
>  {
>         struct vm_area_struct *vma = vmf->vma;
>         bool write = vmf->flags & FAULT_FLAG_WRITE;
> -       bool prefault = !in_range(vmf->address, addr, nr * PAGE_SIZE);
>         pte_t entry;
>
>         flush_icache_pages(vma, page, nr);
>         entry = mk_pte(page, vma->vm_page_prot);
>
> -       if (prefault && arch_wants_old_prefaulted_pte())
> +       if (arch_wants_old_prefaulted_pte())
>                 entry = pte_mkold(entry);
>         else
>                 entry = pte_sw_mkyoung(entry);
>
>

Please ignore this part.
I had a bit of a brain fart there. If it’s not prefault,
it’s a real fault and we are actually accessing it, so it
shouldn’t be considered old.

Thanks
Barry


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-28 22:26   ` Barry Song
  2026-04-28 22:50     ` Barry Song
@ 2026-04-29  3:17     ` Kairui Song
  1 sibling, 0 replies; 16+ messages in thread
From: Kairui Song @ 2026-04-29  3:17 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, linux-kernel, Lance Yang, Xueyuan Chen, Qi Zheng,
	Shakeel Butt, wangzicheng, Suren Baghdasaryan, Lei Liu,
	Matthew Wilcox, Axel Rasmussen, Yuanchu Xie, Wei Xu, Will Deacon

On Wed, Apr 29, 2026 at 6:26 AM Barry Song <baohua@kernel.org> wrote:
>
> On Wed, Apr 29, 2026 at 2:55 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Sat, Apr 18, 2026 at 8:03 PM Barry Song (Xiaomi) <baohua@kernel.org> wrote:
> > >
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > index 5cc44f0de987..e3cf703ccb89 100644
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -512,7 +512,7 @@ void folio_add_lru(struct folio *folio)
> > >         /* see the comment in lru_gen_folio_seq() */
> > >         if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> > >             lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> > > -               folio_set_active(folio);
> > > +               folio_mark_accessed(folio);
> >
> > Hi Barry,
> >
> > Sorry I haven't checked everything yet, but just a naive idea: What if
> > we just remove this whole lru_gen_* check chunk here? Only keep the
>
> Do you mean the below?
>
> index 5cc44f0de987..499ad49c1b51 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -509,11 +509,6 @@ void folio_add_lru(struct folio *folio)
>                         folio_test_unevictable(folio), folio);
>         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>
> -       /* see the comment in lru_gen_folio_seq() */
> -       if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> -           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> -               folio_set_active(folio);
> -
>         folio_batch_add_and_move(folio, lru_add);
>  }
>  EXPORT_SYMBOL(folio_add_lru);
>
> If so, this essentially resembles the active/inactive LRU. But I
> assume Yu Zhao’s earlier point about mmaped folio access still
> has some merit? The problem, however, is that readahead and
> prefaulting may have made this assumption less accurate, since
> being mmaped doesn’t necessarily mean the user actually wants
> to access it.

Right, my question is that, foundunmentally, it depends on which
assumption is true:
- All allocated folios should be treated equally.
vs
- Page fault allocation folios are more important.

Some databases have the option to use mmap or read, in that case, it's
only two different methods for reading the data in that case.

Classical LRU also used to treat anon folios differently, until commit
ccc5dc67340c and the series following it changed this.

Still I'm not against the idea that page fault allocation is more
important, I'm just not sure about this. I think taking your approach
first then improving it further might be a good idea.

> I mean, I’m at least convinced the following might be correct:
>
> @@ -509,10 +511,14 @@ void folio_add_lru(struct folio *folio)
>                         folio_test_unevictable(folio), folio);
>         VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>
> -       /* see the comment in lru_gen_folio_seq() */
> -       if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> -           lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> -               folio_set_active(folio);
> +       /*
> +        * For architectures without old prefaulted PTEs, we need a first
> +        * PTE scan to clear the access flag set during prefault, and a second
> +        * scan to confirm the folio is active. For architectures with old
> +        * prefaulted PTEs, we can skip the scan that clears the access flag.
> +        */
> +       if (arch_wants_old_prefaulted_pte())
> +               folio_mark_accessed(folio);
>
>         folio_batch_add_and_move(folio, lru_add);
>  }
>
> It could also be the case below to check whether fault_around is
> disabled, if it’s not too ugly :-)
>
> +       if (arch_wants_old_prefaulted_pte() || fault_around_bytes == PAGE_SIZE)
> +               folio_mark_accessed(folio);

Seems an interesting idea!

> > Is there any evidence that folios that are allocated through fault are
> > always frequently used folios? Because classical LRU has the exact
> > opposite assumption on that. Refault distance based activation is more
> > battle tested (I'm not saying that is absolutely right though).
>
> I agree with this. I’m also queuing some code for testing
> to check whether reclamation has occurred very recently.
> If so, we set the folios active:
> https://lore.kernel.org/linux-mm/20260428013520.47417-1-baohua@kernel.org/
>
> So basically we’re on the same page, just taking slightly different
> approaches to checking recency during refault?

That's a very nice idea too!

And actually I hesitated a lot between using refault distance / gen
distance. And didn't find a good reason to make a decision on that.

Using refault distance makes the result much more accurate, its page
granularity, and aging is a kind of random thing, each gen will have
vastly different lifetime, and could be very rough.

Using gen number makes things much cleaner, avoid the bucket in
workingset.c and might work better if we have periodical aging
support. Also avoid to deal with eviction root problem which will
somehow might also help to make the swap table more compact :)
https://lore.kernel.org/linux-mm/CAMgjq7Aq5ckraKtNtet8+1ANuqnitFsXxefbDJQZpBxNmaW7Cg@mail.gmail.com/


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-04-29  3:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-18 12:02 [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF Barry Song (Xiaomi)
2026-04-24 11:53 ` Andrew Morton
2026-04-28  5:40   ` Barry Song
2026-04-24 14:10 ` Andrew Morton
2026-04-24 15:19 ` Pedro Falcato
2026-04-26  4:35   ` Barry Song
2026-04-27 14:46     ` Pedro Falcato
2026-04-27 18:22       ` Axel Rasmussen
2026-04-28  1:35         ` Barry Song (Xiaomi)
2026-04-28  4:24       ` Barry Song
2026-04-24 17:03 ` Shakeel Butt
2026-04-26 21:56   ` Barry Song
2026-04-28 18:54 ` Kairui Song
2026-04-28 22:26   ` Barry Song
2026-04-28 22:50     ` Barry Song
2026-04-29  3:17     ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox