[PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
@ 2026-04-18 12:02 Barry Song (Xiaomi)
  2026-04-24 11:53 ` Andrew Morton
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-18 12:02 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: linux-kernel, Barry Song (Xiaomi), Lance Yang, Xueyuan Chen,
	Kairui Song, Qi Zheng, Shakeel Butt, wangzicheng,
	Suren Baghdasaryan, Lei Liu, Matthew Wilcox, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Will Deacon

MGLRU gives high priority to folios mapped in page tables.
As a result, folio_set_active() is invoked for all folios
read during page faults. In practice, however, readahead
can bring in many folios that are never accessed via page
tables.

A previous attempt by Lei Liu proposed introducing a separate
LRU for readahead[1] to make readahead pages easier to reclaim,
but that approach is likely over-engineered.

Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
protection"), folios with PG_active were always placed in
the youngest generation, leading to over-protection and
increased refaults. After that commit, PG_active folios
are placed in the second youngest generation, which is
still too optimistic given the presence of readahead. In
contrast, the classic active/inactive scheme is more
conservative.

This patch switches to folio_mark_accessed(). If
folio_check_references() later detects referenced PTEs,
the folio will be promoted based on the reference flag
set by folio_mark_accessed().

The following uses a simple model to demonstrate why the current
code is not ideal. It runs fio-3.42 in a memcg, reading a file in a
strided pattern—4KB every 64KB—to simulate prefaulted pages that may
not be accessed.

 #!/bin/bash

 CG_NAME="mglru_verify_test"
 CG_PATH="/sys/fs/cgroup/$CG_NAME"
 MEM_LIMIT="400M"
 HOT_SIZE="600M"

 # 1. Environment Setup
 sudo rmdir "$CG_PATH" 2>/dev/null
 sudo mkdir -p "$CG_PATH"
 sudo chown -R $USER:$USER "$CG_PATH"
 echo "$MEM_LIMIT" > "$CG_PATH/memory.max"

 # 2. Prepare Data Files
 dd if=/dev/urandom of=hot_data.bin bs=1M count=600 conv=notrunc 2>/dev/null
 sync
 echo 3 > /proc/sys/vm/drop_caches

 # 3. Start Workload (Working Set)
 (
     echo $BASHPID > "$CG_PATH/cgroup.procs"
     exec ./fio-3.42 --name=hot_ws --rw=read --bs=4K --size=$HOT_SIZE --runtime=600 \
          --zonemode=strided --zonesize=4K --zonerange=64K \
 	 --time_based --direct=0 --filename=hot_data.bin --ioengine=mmap \
          --fadvise_hint=0 --group_reporting --numjobs=1 > fio.stats
 ) &
 WORKLOAD_PID=$!

 # 4. Waiting for hot data to warm up
 sleep 30
 BASE_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')

 # 5. Running workload for 60second
 sleep 60

 # 6. Report refault and IO bandwidth
 FINAL_FILE=$(grep "workingset_refault_file" "$CG_PATH/memory.stat" | awk '{print $2}')
 FINAL_D_FILE=$((FINAL_FILE - BASE_FILE))
 echo "File Refault Delta is $FINAL_D_FILE"

 kill $WORKLOAD_PID 2>/dev/null
 sleep 2
 grep -E "READ|WRITE" fio.stats \
 | awk '{for(i=1;i<=NF;i++){if($i ~ /^bw=/) bw=$i; if($i ~ /^io=/) io=$i} print $1, bw, io}'
 rm -f hot_data.bin fio.stats

Without the patch, we observed 12883855 file refaults and a very low
bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy
hot positions, continuously pushing out the real working set and
causing incorrect reclaim. With the patch, we observed 0 refaults
and bandwidth increased to 5078 MiB/s.

Note that this patch does not benefit any platform other than arm64,
since commit 315d09bf30c2 ("Revert "mm: make faultaround produce old
ptes"") reverted the change that made prefault PTEs “old”, after it
was identified as the cause of a ~6% regression in UnixBench on x86.
This was due to reports that x86 uses an internal microfault mechanism
for HW AF. The hardware access flag mechanism is relatively expensive
and can lead to a ~6% UnixBench regression when prefaulted PTEs are
not marked young directly in the page fault path, especially when
UnixBench runs without any memory pressure[2].

Thanks to Will for raising this for arm64—“Create ‘old’ PTEs for
faultaround mappings on arm64 with hardware access flag” [3].
This is also thanks to arm64 microarchitectures, which incur zero cost
for HW AF handling.

It may be time for x86 and other architectures to revisit
whether HW AF is truly costly on their platforms, given that
the original x86 regression was reported 10 years ago.

For those who want to try the model on x86, you will need the
following in arch/x86/include/asm/pgtable.h.

 #define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
 static inline bool arch_wants_old_prefaulted_pte(void)
 {
 	return true;
 }

Lance and Xueyuan made a huge contribution to this patch
through testing. They truly worked over weekends and after
work hours. If this patch deserves any credit, it belongs to
them.

[1] https://lore.kernel.org/linux-mm/20250916072226.220426-1-liulei.rjpt@vivo.com/
[2] https://lore.kernel.org/lkml/20160606022724.GA26227@yexl-desktop/
[3] https://lore.kernel.org/lkml/20210120173612.20913-1-will@kernel.org/
Tested-by: Lance Yang <lance.yang@linux.dev>
Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Qi Zheng <qi.zheng@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: wangzicheng <wangzicheng@honor.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Lei Liu <liulei.rjpt@vivo.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 -rfc was:
 [PATCH RFC] mm/mglru: lazily activate folios while folios are really mapped
 https://lore.kernel.org/linux-mm/20260225212642.15219-1-21cnbao@gmail.com/

 mm/swap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index 5cc44f0de987..e3cf703ccb89 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -512,7 +512,7 @@ void folio_add_lru(struct folio *folio)
 	/* see the comment in lru_gen_folio_seq() */
 	if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
 	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
-		folio_set_active(folio);
+		folio_mark_accessed(folio);

 	folio_batch_add_and_move(folio, lru_add);
 }
-- 
2.39.3 (Apple Git-146)

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-18 12:02 [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF Barry Song (Xiaomi)
@ 2026-04-24 11:53 ` Andrew Morton
  2026-04-24 14:10 ` Andrew Morton
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2026-04-24 11:53 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: linux-mm, linux-kernel, Lance Yang, Xueyuan Chen, Kairui Song,
	Qi Zheng, Shakeel Butt, wangzicheng, Suren Baghdasaryan, Lei Liu,
	Matthew Wilcox, Axel Rasmussen, Yuanchu Xie, Wei Xu, Will Deacon

On Sat, 18 Apr 2026 20:02:33 +0800 "Barry Song (Xiaomi)" <baohua@kernel.org> wrote:

> MGLRU gives high priority to folios mapped in page tables.
> As a result, folio_set_active() is invoked for all folios
> read during page faults. In practice, however, readahead
> can bring in many folios that are never accessed via page
> tables.
> 
> A previous attempt by Lei Liu proposed introducing a separate
> LRU for readahead[1] to make readahead pages easier to reclaim,
> but that approach is likely over-engineered.
> 
> Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
> protection"), folios with PG_active were always placed in
> the youngest generation, leading to over-protection and
> increased refaults. After that commit, PG_active folios
> are placed in the second youngest generation, which is
> still too optimistic given the presence of readahead. In
> contrast, the classic active/inactive scheme is more
> conservative.
> 
> This patch switches to folio_mark_accessed(). If
> folio_check_references() later detects referenced PTEs,
> the folio will be promoted based on the reference flag
> set by folio_mark_accessed().
> 
> The following uses a simple model to demonstrate why the current
> code is not ideal. It runs fio-3.42 in a memcg, reading a file in a
> strided pattern—4KB every 64KB—to simulate prefaulted pages that may
> not be accessed.

Are you able to suggest any workloads which might regress?  And test
for those?

> Without the patch, we observed 12883855 file refaults and a very low
> bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy
> hot positions, continuously pushing out the real working set and
> causing incorrect reclaim. With the patch, we observed 0 refaults
> and bandwidth increased to 5078 MiB/s.

Wow.  And that isn't a crazy workload.

> For those who want to try the model on x86, you will need the
> following in arch/x86/include/asm/pgtable.h.
> 
>  #define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
>  static inline bool arch_wants_old_prefaulted_pte(void)
>  {
>  	return true;
>  }

Can you propose a patch?  We can at least toss it in there for testing
while we think about it.

> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -512,7 +512,7 @@ void folio_add_lru(struct folio *folio)
>  	/* see the comment in lru_gen_folio_seq() */
>  	if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
>  	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> -		folio_set_active(folio);
> +		folio_mark_accessed(folio);
>  
>  	folio_batch_add_and_move(folio, lru_add);
>  }

lol, I was expecting something larger ;)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-18 12:02 [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF Barry Song (Xiaomi)
  2026-04-24 11:53 ` Andrew Morton
@ 2026-04-24 14:10 ` Andrew Morton
  2026-04-24 15:19 ` Pedro Falcato
  2026-04-24 17:03 ` Shakeel Butt
  3 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2026-04-24 14:10 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: linux-mm, linux-kernel, Lance Yang, Xueyuan Chen, Kairui Song,
	Qi Zheng, Shakeel Butt, wangzicheng, Suren Baghdasaryan, Lei Liu,
	Matthew Wilcox, Axel Rasmussen, Yuanchu Xie, Wei Xu, Will Deacon

On Sat, 18 Apr 2026 20:02:33 +0800 "Barry Song (Xiaomi)" <baohua@kernel.org> wrote:

> MGLRU gives high priority to folios mapped in page tables.
> As a result, folio_set_active() is invoked for all folios
> read during page faults. In practice, however, readahead
> can bring in many folios that are never accessed via page
> tables.
> 
> A previous attempt by Lei Liu proposed introducing a separate
> LRU for readahead[1] to make readahead pages easier to reclaim,
> but that approach is likely over-engineered.
> 
> Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
> protection"), folios with PG_active were always placed in
> the youngest generation, leading to over-protection and
> increased refaults. After that commit, PG_active folios
> are placed in the second youngest generation, which is
> still too optimistic given the presence of readahead. In
> contrast, the classic active/inactive scheme is more
> conservative.
> 
> This patch switches to folio_mark_accessed(). If
> folio_check_references() later detects referenced PTEs,
> the folio will be promoted based on the reference flag
> set by folio_mark_accessed().

Sashiko: https://sashiko.dev/#/patchset/20260418120233.7162-1-baohua@kernel.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-18 12:02 [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF Barry Song (Xiaomi)
  2026-04-24 11:53 ` Andrew Morton
  2026-04-24 14:10 ` Andrew Morton
@ 2026-04-24 15:19 ` Pedro Falcato
  2026-04-26  4:35   ` Barry Song
  2026-04-24 17:03 ` Shakeel Butt
  3 siblings, 1 reply; 6+ messages in thread
From: Pedro Falcato @ 2026-04-24 15:19 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, linux-mm, linux-kernel, Lance Yang, Xueyuan Chen,
	Kairui Song, Qi Zheng, Shakeel Butt, wangzicheng,
	Suren Baghdasaryan, Lei Liu, Matthew Wilcox, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Will Deacon

On Sat, Apr 18, 2026 at 08:02:33PM +0800, Barry Song (Xiaomi) wrote:
> MGLRU gives high priority to folios mapped in page tables.
> As a result, folio_set_active() is invoked for all folios
> read during page faults. In practice, however, readahead
> can bring in many folios that are never accessed via page
> tables.
> 
> A previous attempt by Lei Liu proposed introducing a separate
> LRU for readahead[1] to make readahead pages easier to reclaim,
> but that approach is likely over-engineered.

Why does this even need to be kept? I'm not sure it makes sense
to even mark readahead folios as referenced.

I'd suggest folios should only be marked referenced (or even active, whatever)
when they're mapped. Anything else is a bit random and is hoping you are
eventually going to map them in the future (which is not true for, for example,
anything in an ELF file that may be readahead but not mapped, like debug info,
symbol tables, section headers, relocation tables, etc etc)


-- 
Pedro

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-18 12:02 [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF Barry Song (Xiaomi)
                   ` (2 preceding siblings ...)
  2026-04-24 15:19 ` Pedro Falcato
@ 2026-04-24 17:03 ` Shakeel Butt
  3 siblings, 0 replies; 6+ messages in thread
From: Shakeel Butt @ 2026-04-24 17:03 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, linux-mm, linux-kernel, Lance Yang, Xueyuan Chen,
	Kairui Song, Qi Zheng, wangzicheng, Suren Baghdasaryan, Lei Liu,
	Matthew Wilcox, Axel Rasmussen, Yuanchu Xie, Wei Xu, Will Deacon

On Sat, Apr 18, 2026 at 08:02:33PM +0800, Barry Song (Xiaomi) wrote:
> MGLRU gives high priority to folios mapped in page tables.
> As a result, folio_set_active() is invoked for all folios
> read during page faults. In practice, however, readahead
> can bring in many folios that are never accessed via page
> tables.
> 
> A previous attempt by Lei Liu proposed introducing a separate
> LRU for readahead[1] to make readahead pages easier to reclaim,
> but that approach is likely over-engineered.
> 
> Before commit 4d5d14a01e2c ("mm/mglru: rework workingset
> protection"), folios with PG_active were always placed in
> the youngest generation, leading to over-protection and
> increased refaults. After that commit, PG_active folios
> are placed in the second youngest generation, which is
> still too optimistic given the presence of readahead. In
> contrast, the classic active/inactive scheme is more
> conservative.
> 
> This patch switches to folio_mark_accessed(). If
> folio_check_references() later detects referenced PTEs,
> the folio will be promoted based on the reference flag
> set by folio_mark_accessed().
> 

There is a following comment and stat update in lru_gen_refault() which is
referring to setting active bit which this patch is removing.

	/* see folio_add_lru() where folio_set_active() will be called */
	if (lru_gen_in_fault())
		mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);

Is this still relevant or need changes?

I have not yet dig deeper into the patch and the heuristic. Will do later.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF
  2026-04-24 15:19 ` Pedro Falcato
@ 2026-04-26  4:35   ` Barry Song
  0 siblings, 0 replies; 6+ messages in thread
From: Barry Song @ 2026-04-26  4:35 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: akpm, linux-mm, linux-kernel, Lance Yang, Xueyuan Chen,
	Kairui Song, Qi Zheng, Shakeel Butt, wangzicheng,
	Suren Baghdasaryan, Lei Liu, Matthew Wilcox, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Will Deacon

On Fri, Apr 24, 2026 at 11:19 PM Pedro Falcato <pfalcato@suse.de> wrote:
>
> On Sat, Apr 18, 2026 at 08:02:33PM +0800, Barry Song (Xiaomi) wrote:
> > MGLRU gives high priority to folios mapped in page tables.
> > As a result, folio_set_active() is invoked for all folios
> > read during page faults. In practice, however, readahead
> > can bring in many folios that are never accessed via page
> > tables.
> >
> > A previous attempt by Lei Liu proposed introducing a separate
> > LRU for readahead[1] to make readahead pages easier to reclaim,
> > but that approach is likely over-engineered.
>
> Why does this even need to be kept? I'm not sure it makes sense
> to even mark readahead folios as referenced.
>
> I'd suggest folios should only be marked referenced (or even active, whatever)
> when they're mapped. Anything else is a bit random and is hoping you are
> eventually going to map them in the future (which is not true for, for example,
> anything in an ELF file that may be readahead but not mapped, like debug info,
> symbol tables, section headers, relocation tables, etc etc)

The patch targets the mmap readahead path rather than the syscall
readahead path.

With lru_gen_in_fault() in place, it’s roughly equivalent to
the mapped case, since readahead is typically 128 KB while
fault_around is 64 KB in PF.

Thanks
Barry

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-26 12:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-18 12:02 [PATCH] mm/mglru: Use folio_mark_accessed to replace folio_set_active in PF Barry Song (Xiaomi)
2026-04-24 11:53 ` Andrew Morton
2026-04-24 14:10 ` Andrew Morton
2026-04-24 15:19 ` Pedro Falcato
2026-04-26  4:35   ` Barry Song
2026-04-24 17:03 ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox