[REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12

All of lore.kernel.org
 help / color / mirror / Atom feed

* [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
@ 2026-05-18 13:01 Chengfeng Lin
  2026-05-18 13:10 ` [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x " Chengfeng Lin
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Chengfeng Lin @ 2026-05-18 13:01 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Liam R. Howlett, Lorenzo Stoakes, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Johannes Weiner, Michal Hocko,
	Qi Zheng, Shakeel Butt, Chris Li, Kairui Song, linux-kernel,
	regressions

Hi,

I would like to report a userspace-visible mprotect() performance
regression in a shared dirty PTE workload.

The workload is intentionally narrow:

  - anonymous shared 64 MiB mapping
  - prefault before protection changes
  - repeatedly toggle the whole range with mprotect(PROT_READ)
  - restore with mprotect(PROT_READ | PROT_WRITE)
  - write-touch after the protection cycle

This is not meant as a generic mprotect() regression report. In
particular, I am not claiming that the anon/THP mprotect paths regress.
The current signal is scoped to the shared-dirty full-range PTE toggle
path above.

The current public evidence bundle is here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle

The generated workload source used for auditing the workload semantics is
here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/e13469b/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c

The formal experiment profile is here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle/experiments

The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
configuration, using QEMU direct boot. The formal performance runs were
clean timing runs with coverage disabled. Coverage was collected
separately and is not used for the timing numbers below.

Lab environment:

  host label: lcf
  host kernel: Linux 6.14.0-37-generic x86_64
  QEMU: qemu-system-x86_64 8.2.2
  container/cgroup CPU set: 0,2,4,6,8,10,12,14
  container/cgroup memory limit: 16106127360 bytes
  guest memory: QEMU_MEM_MB=14336
  guest CPUs: QEMU_SMP=1/2/4
  repetitions: 9
  version order: interleaved
  performance coverage_enabled: false

Primary result, cycle_ns_per_page, lower is better:

  CPU   v6.12.77   v6.19.9   old-lower-vs-new   v6.19/v6.12   reliability
    1      346.8     578.1        40.0%             1.67x      reliable
    2      394.7     641.7        38.5%             1.63x      robust-only
    4      381.1     624.8        39.0%             1.64x      partial, same direction

The strongest current result is the 1CPU lab formal result. The 2CPU case
is same-direction but robust-only in the framework classification. The
4CPU case is same-direction but partial because one QEMU run failed; the
summary still has 8 successful runs for that CPU count.

The current mechanism hypothesis is local to the shared-dirty PTE path.
In v6.19, the measured hot path goes through the change_pte_range()
batching machinery:

  change_pte_range()
    -> mprotect_folio_pte_batch()
    -> modify_prot_start_ptes()
    -> set_write_prot_commit_flush_ptes()
    -> prot_commit_flush_ptes()

For this shared-dirty workload, follow-up batch-probe attribution showed
nr_ptes=1 in the measured path. The hypothesis is that the extra folio
lookup, batch-size query, helper dispatch, and commit machinery are paid
per 4 KiB PTE without effective batch-size amortization in this workload.
This is mechanism interpretation, not a completed culprit-commit bisect.

I have not bisected the exact culprit commit yet. Separate release-level
sanity checks showed v6.18.19 already in the slow range, so the current
best reporting range is:

#regzbot introduced: v6.12..v6.18

Please let me know if a standalone reproducer, a narrower bisect, or
additional raw logs would be more useful.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12
  2026-05-18 13:01 [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12 Chengfeng Lin
@ 2026-05-18 13:10 ` Chengfeng Lin
  2026-05-18 15:36 ` [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x " David Hildenbrand (Arm)
  2026-05-18 15:43 ` [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x " Lorenzo Stoakes
  2 siblings, 0 replies; 11+ messages in thread
From: Chengfeng Lin @ 2026-05-18 13:10 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Liam R. Howlett, Lorenzo Stoakes, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Johannes Weiner, Michal Hocko,
	Qi Zheng, Shakeel Butt, Chris Li, Kairui Song, linux-kernel,
	regressions

Sorry, I sent the previous report with the wrong subject line.

The intended subject is:

  [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12

The body and evidence links in that message are for the mprotect
shared-dirty PTE toggle regression. Please treat it as the mprotect
report, not as a MADV_PAGEOUT report.

#regzbot title: mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12

Sorry for the noise.


> -----原始邮件-----
> 发件人: "Chengfeng Lin" <23020251154299@stu.xmu.edu.cn>
> 发送时间:2026-05-18 21:01:02 (星期一)
> 收件人: "Andrew Morton" <akpm@linux-foundation.org>, linux-mm@kvack.org
> 抄送: "Liam R. Howlett" <Liam.Howlett@oracle.com>, "Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>, "David Hildenbrand" <david@kernel.org>, "Vlastimil Babka" <vbabka@suse.cz>, "Jann Horn" <jannh@google.com>, "Johannes Weiner" <hannes@cmpxchg.org>, "Michal Hocko" <mhocko@kernel.org>, "Qi Zheng" <zhengqi.arch@bytedance.com>, "Shakeel Butt" <shakeel.butt@linux.dev>, "Chris Li" <chrisl@kernel.org>, "Kairui Song" <kasong@tencent.com>, linux-kernel@vger.kernel.org, regressions@lists.linux.dev
> 主题: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
> 
> Hi,
> 
> I would like to report a userspace-visible mprotect() performance
> regression in a shared dirty PTE workload.
> 
> The workload is intentionally narrow:
> 
>   - anonymous shared 64 MiB mapping
>   - prefault before protection changes
>   - repeatedly toggle the whole range with mprotect(PROT_READ)
>   - restore with mprotect(PROT_READ | PROT_WRITE)
>   - write-touch after the protection cycle
> 
> This is not meant as a generic mprotect() regression report. In
> particular, I am not claiming that the anon/THP mprotect paths regress.
> The current signal is scoped to the shared-dirty full-range PTE toggle
> path above.
> 
> The current public evidence bundle is here:
> 
>   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle
> 
> The generated workload source used for auditing the workload semantics is
> here:
> 
>   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/e13469b/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c
> 
> The formal experiment profile is here:
> 
>   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle/experiments
> 
> The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
> configuration, using QEMU direct boot. The formal performance runs were
> clean timing runs with coverage disabled. Coverage was collected
> separately and is not used for the timing numbers below.
> 
> Lab environment:
> 
>   host label: lcf
>   host kernel: Linux 6.14.0-37-generic x86_64
>   QEMU: qemu-system-x86_64 8.2.2
>   container/cgroup CPU set: 0,2,4,6,8,10,12,14
>   container/cgroup memory limit: 16106127360 bytes
>   guest memory: QEMU_MEM_MB=14336
>   guest CPUs: QEMU_SMP=1/2/4
>   repetitions: 9
>   version order: interleaved
>   performance coverage_enabled: false
> 
> Primary result, cycle_ns_per_page, lower is better:
> 
>   CPU   v6.12.77   v6.19.9   old-lower-vs-new   v6.19/v6.12   reliability
>     1      346.8     578.1        40.0%             1.67x      reliable
>     2      394.7     641.7        38.5%             1.63x      robust-only
>     4      381.1     624.8        39.0%             1.64x      partial, same direction
> 
> The strongest current result is the 1CPU lab formal result. The 2CPU case
> is same-direction but robust-only in the framework classification. The
> 4CPU case is same-direction but partial because one QEMU run failed; the
> summary still has 8 successful runs for that CPU count.
> 
> The current mechanism hypothesis is local to the shared-dirty PTE path.
> In v6.19, the measured hot path goes through the change_pte_range()
> batching machinery:
> 
>   change_pte_range()
>     -> mprotect_folio_pte_batch()
>     -> modify_prot_start_ptes()
>     -> set_write_prot_commit_flush_ptes()
>     -> prot_commit_flush_ptes()
> 
> For this shared-dirty workload, follow-up batch-probe attribution showed
> nr_ptes=1 in the measured path. The hypothesis is that the extra folio
> lookup, batch-size query, helper dispatch, and commit machinery are paid
> per 4 KiB PTE without effective batch-size amortization in this workload.
> This is mechanism interpretation, not a completed culprit-commit bisect.
> 
> I have not bisected the exact culprit commit yet. Separate release-level
> sanity checks showed v6.18.19 already in the slow range, so the current
> best reporting range is:
> 
> #regzbot introduced: v6.12..v6.18
> 
> Please let me know if a standalone reproducer, a narrower bisect, or
> additional raw logs would be more useful.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
  2026-05-18 13:01 [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12 Chengfeng Lin
  2026-05-18 13:10 ` [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x " Chengfeng Lin
@ 2026-05-18 15:36 ` David Hildenbrand (Arm)
  2026-05-18 17:01   ` [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x " Chengfeng Lin
  2026-05-18 15:43 ` [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x " Lorenzo Stoakes
  2 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-18 15:36 UTC (permalink / raw)
  To: Chengfeng Lin, Andrew Morton, linux-mm
  Cc: Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn,
	Johannes Weiner, Michal Hocko, Qi Zheng, Shakeel Butt, Chris Li,
	Kairui Song, linux-kernel, regressions, Pedro Falcato

On 5/18/26 15:01, Chengfeng Lin wrote:
> Hi,
> 
> I would like to report a userspace-visible mprotect() performance
> regression in a shared dirty PTE workload.
> 
> The workload is intentionally narrow:
> 
>   - anonymous shared 64 MiB mapping
>   - prefault before protection changes
>   - repeatedly toggle the whole range with mprotect(PROT_READ)
>   - restore with mprotect(PROT_READ | PROT_WRITE)
>   - write-touch after the protection cycle
> 
> This is not meant as a generic mprotect() regression report. In
> particular, I am not claiming that the anon/THP mprotect paths regress.
> The current signal is scoped to the shared-dirty full-range PTE toggle
> path above.
> 
> The current public evidence bundle is here:
> 
>   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle
> 
> The generated workload source used for auditing the workload semantics is
> here:
> 
>   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/e13469b/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c
> 
> The formal experiment profile is here:
> 
>   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle/experiments
> 
> The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
> configuration, using QEMU direct boot. The formal performance runs were
> clean timing runs with coverage disabled. Coverage was collected
> separately and is not used for the timing numbers below.
> 
> Lab environment:
> 
>   host label: lcf
>   host kernel: Linux 6.14.0-37-generic x86_64
>   QEMU: qemu-system-x86_64 8.2.2
>   container/cgroup CPU set: 0,2,4,6,8,10,12,14
>   container/cgroup memory limit: 16106127360 bytes
>   guest memory: QEMU_MEM_MB=14336
>   guest CPUs: QEMU_SMP=1/2/4
>   repetitions: 9
>   version order: interleaved
>   performance coverage_enabled: false
> 
> Primary result, cycle_ns_per_page, lower is better:
> 
>   CPU   v6.12.77   v6.19.9   old-lower-vs-new   v6.19/v6.12   reliability
>     1      346.8     578.1        40.0%             1.67x      reliable
>     2      394.7     641.7        38.5%             1.63x      robust-only
>     4      381.1     624.8        39.0%             1.64x      partial, same direction
> 
> The strongest current result is the 1CPU lab formal result. The 2CPU case
> is same-direction but robust-only in the framework classification. The
> 4CPU case is same-direction but partial because one QEMU run failed; the
> summary still has 8 successful runs for that CPU count.
> 
> The current mechanism hypothesis is local to the shared-dirty PTE path.
> In v6.19, the measured hot path goes through the change_pte_range()
> batching machinery:
> 
>   change_pte_range()
>     -> mprotect_folio_pte_batch()
>     -> modify_prot_start_ptes()
>     -> set_write_prot_commit_flush_ptes()
>     -> prot_commit_flush_ptes()
> 
> For this shared-dirty workload, follow-up batch-probe attribution showed
> nr_ptes=1 in the measured path. The hypothesis is that the extra folio
> lookup, batch-size query, helper dispatch, and commit machinery are paid
> per 4 KiB PTE without effective batch-size amortization in this workload.
> This is mechanism interpretation, not a completed culprit-commit bisect.
> 
> I have not bisected the exact culprit commit yet. Separate release-level
> sanity checks showed v6.18.19 already in the slow range, so the current
> best reporting range is:
> 
> #regzbot introduced: v6.12..v6.18
> 
> Please let me know if a standalone reproducer, a narrower bisect, or
> additional raw logs would be more useful.

Pedro recently optimized this:

https://lore.kernel.org/all/20260402141628.3367596-1-pfalcato@suse.de/

Maybe that fixes most of the regression for you?

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12
  2026-05-18 15:36 ` [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x " David Hildenbrand (Arm)
@ 2026-05-18 17:01   ` Chengfeng Lin
  2026-05-22  9:03     ` Chengfeng Lin
  0 siblings, 1 reply; 11+ messages in thread
From: Chengfeng Lin @ 2026-05-18 17:01 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, linux-mm, Liam R. Howlett, Vlastimil Babka,
	Jann Horn, Johannes Weiner, Michal Hocko, Qi Zheng, Shakeel Butt,
	Chris Li, Kairui Song, linux-kernel, regressions, Pedro Falcato,
	ljs


Hi David,

Thanks for the pointer. I have not tested Pedro's mprotect
micro-optimization series yet.

That series does look directly relevant to this report, especially the
change_pte_range() / small-folio path. My current data only compares
v6.12.77 with v6.19.9, so I do not yet know whether that newer series
fixes or reduces the slowdown.

I will rerun the shared-dirty toggle workload on a kernel with that series
applied, or on the first branch/tag where it is included, and report back
with the same 1/2/4 CPU matrix if possible.

If it removes most of the delta, I will follow up and mark the report
accordingly.

Thanks,
Chengfeng

> -----Original Message-----
> From: "David Hildenbrand (Arm)" <david@kernel.org>
> Sent:Monday, 05/18/2026 23:36:27
> To: "Chengfeng Lin" <23020251154299@stu.xmu.edu.cn>, "Andrew Morton" <akpm@linux-foundation.org>, linux-mm@kvack.org
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>, "Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>, "Vlastimil Babka" <vbabka@suse.cz>, "Jann Horn" <jannh@google.com>, "Johannes Weiner" <hannes@cmpxchg.org>, "Michal Hocko" <mhocko@kernel.org>, "Qi Zheng" <zhengqi.arch@bytedance.com>, "Shakeel Butt" <shakeel.butt@linux.dev>, "Chris Li" <chrisl@kernel.org>, "Kairui Song" <kasong@tencent.com>, linux-kernel@vger.kernel.org, regressions@lists.linux.dev, "Pedro Falcato" <pfalcato@suse.de>
> Subject: Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
> 
> On 5/18/26 15:01, Chengfeng Lin wrote:
> > Hi,
> > 
> > I would like to report a userspace-visible mprotect() performance
> > regression in a shared dirty PTE workload.
> > 
> > The workload is intentionally narrow:
> > 
> >   - anonymous shared 64 MiB mapping
> >   - prefault before protection changes
> >   - repeatedly toggle the whole range with mprotect(PROT_READ)
> >   - restore with mprotect(PROT_READ | PROT_WRITE)
> >   - write-touch after the protection cycle
> > 
> > This is not meant as a generic mprotect() regression report. In
> > particular, I am not claiming that the anon/THP mprotect paths regress.
> > The current signal is scoped to the shared-dirty full-range PTE toggle
> > path above.
> > 
> > The current public evidence bundle is here:
> > 
> >   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle
> > 
> > The generated workload source used for auditing the workload semantics is
> > here:
> > 
> >   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/e13469b/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c
> > 
> > The formal experiment profile is here:
> > 
> >   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle/experiments
> > 
> > The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
> > configuration, using QEMU direct boot. The formal performance runs were
> > clean timing runs with coverage disabled. Coverage was collected
> > separately and is not used for the timing numbers below.
> > 
> > Lab environment:
> > 
> >   host label: lcf
> >   host kernel: Linux 6.14.0-37-generic x86_64
> >   QEMU: qemu-system-x86_64 8.2.2
> >   container/cgroup CPU set: 0,2,4,6,8,10,12,14
> >   container/cgroup memory limit: 16106127360 bytes
> >   guest memory: QEMU_MEM_MB=14336
> >   guest CPUs: QEMU_SMP=1/2/4
> >   repetitions: 9
> >   version order: interleaved
> >   performance coverage_enabled: false
> > 
> > Primary result, cycle_ns_per_page, lower is better:
> > 
> >   CPU   v6.12.77   v6.19.9   old-lower-vs-new   v6.19/v6.12   reliability
> >     1      346.8     578.1        40.0%             1.67x      reliable
> >     2      394.7     641.7        38.5%             1.63x      robust-only
> >     4      381.1     624.8        39.0%             1.64x      partial, same direction
> > 
> > The strongest current result is the 1CPU lab formal result. The 2CPU case
> > is same-direction but robust-only in the framework classification. The
> > 4CPU case is same-direction but partial because one QEMU run failed; the
> > summary still has 8 successful runs for that CPU count.
> > 
> > The current mechanism hypothesis is local to the shared-dirty PTE path.
> > In v6.19, the measured hot path goes through the change_pte_range()
> > batching machinery:
> > 
> >   change_pte_range()
> >     -> mprotect_folio_pte_batch()
> >     -> modify_prot_start_ptes()
> >     -> set_write_prot_commit_flush_ptes()
> >     -> prot_commit_flush_ptes()
> > 
> > For this shared-dirty workload, follow-up batch-probe attribution showed
> > nr_ptes=1 in the measured path. The hypothesis is that the extra folio
> > lookup, batch-size query, helper dispatch, and commit machinery are paid
> > per 4 KiB PTE without effective batch-size amortization in this workload.
> > This is mechanism interpretation, not a completed culprit-commit bisect.
> > 
> > I have not bisected the exact culprit commit yet. Separate release-level
> > sanity checks showed v6.18.19 already in the slow range, so the current
> > best reporting range is:
> > 
> > #regzbot introduced: v6.12..v6.18
> > 
> > Please let me know if a standalone reproducer, a narrower bisect, or
> > additional raw logs would be more useful.
> 
> Pedro recently optimized this:
> 
> https://lore.kernel.org/all/20260402141628.3367596-1-pfalcato@suse.de/
> 
> Maybe that fixes most of the regression for you?
> 
> -- 
> Cheers,
> 
> David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12
  2026-05-18 17:01   ` [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x " Chengfeng Lin
@ 2026-05-22  9:03     ` Chengfeng Lin
  2026-05-25 10:29       ` Pedro Falcato
  0 siblings, 1 reply; 11+ messages in thread
From: Chengfeng Lin @ 2026-05-22  9:03 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, linux-mm, Liam R. Howlett, Vlastimil Babka,
	Jann Horn, Johannes Weiner, Michal Hocko, Qi Zheng, Shakeel Butt,
	Chris Li, Kairui Song, linux-kernel, regressions, Pedro Falcato,
	ljs

Hi David,

Thanks for the pointer. I tested the current akpm/mm mm-unstable branch at
444fc9435e57, which contains Pedro's v3 two-patch mprotect series: the
softleaf refactor and the relevant small-folio / nr_ptes == 1 changes.

I first ran a local sanity check, and then reran the same shared-dirty
full-range toggle workload on the lab machine:

  kernels: v6.12.77, v6.19.9, akpm/mm mm-unstable 444fc9435e57
  QEMU: direct boot
  lab guest CPUs: QEMU_SMP=1/2/4/8/16
  lab guest memory: 14336 MiB for 1/2/4 CPU, 16384 MiB for 8 CPU,
                    32768 MiB for 16 CPU
  repetitions: 9
  order: interleaved
  coverage: disabled

The primary metric is cycle_ns_per_page, lower is better. Here "cycle" means
one workload iteration, not CPU cycles:

  CPU   v6.12.77   v6.19.9   mm-unstable   mm-unstable vs v6.19   gap closed
    1      336.1     532.0       497.0          6.6% faster          17.9%
    2      369.2     581.9       503.3         13.5% faster          36.9%
    4      355.7     587.2       524.2         10.7% faster          27.2%
    8      369.7     583.6       534.2          8.5% faster          23.1%
   16      374.8     607.1       547.8          9.8% faster          25.5%

The 1/2/4/8 CPU rows completed 9/9 runs for all three kernels. In the
16 CPU row, v6.12.77 had one QEMU failure, so I would treat that row only
as a supporting trend.

So yes, Pedro's small-folio work does reduce this synthetic shared-dirty
signal in my setup. It does not seem to remove most of the gap to v6.12.77:
looking at cycle_ns_per_page, it closes roughly 18-37% of the v6.12 ->
v6.19 gap in the clean 1/2/4/8 CPU lab rows.

I also ran a separate state-shape audit, because the MADV_PAGEOUT follow-up
showed that a timing delta can be misleading if the compared kernels are not
actually operating on the same page state. For this mprotect workload, the
successful runs across v6.12.77, v6.19.9, and mm-unstable all used the same
4 KiB shared-dirty PTE mapping shape:

  expected_match_ratio = 100
  unexpected_results = 0
  final_vmas_avg = 1
  present pages before/after protect = 16384 / 16384
  AnonHugePages = 0
  KernelPageSize/MMUPageSize = 4 KiB / 4 KiB
  THPeligible = 0

The state audit used the same 1/2/4/8/16 CPU and memory matrix, with 5 runs
per kernel. The 1/2/4/8 CPU rows completed 5/5 for all three kernels; the
16 CPU row had one v6.19.9 QEMU failure, but the successful v6.19.9 runs had
the same state-shape values.

I put the follow-up summaries here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/0c0e2d9/mprotect-shared-dirty-toggle/mm-unstable-lab-sanity

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/0c0e2d9/mprotect-shared-dirty-toggle/state-audit-lab

Given Lorenzo's question and the synthetic nature of this workload, I will
avoid treating this as a strong regression claim unless I can provide a
standalone reproducer and/or a narrower bisect. If this remaining signal is
still useful to characterize, I can prepare a smaller standalone reproducer
or try to bisect the remaining gap.

Thanks,
Chengfeng


> -----Original Message-----
> From: "Chengfeng Lin" <23020251154299@stu.xmu.edu.cn>
> Sent:Tuesday, 05/19/2026 01:01:54
> To: "David Hildenbrand (Arm)" <david@kernel.org>
> Cc: "Andrew Morton" <akpm@linux-foundation.org>, linux-mm@kvack.org, "Liam R. Howlett" <Liam.Howlett@oracle.com>, "Vlastimil Babka" <vbabka@suse.cz>, "Jann Horn" <jannh@google.com>, "Johannes Weiner" <hannes@cmpxchg.org>, "Michal Hocko" <mhocko@kernel.org>, "Qi Zheng" <zhengqi.arch@bytedance.com>, "Shakeel Butt" <shakeel.butt@linux.dev>, "Chris Li" <chrisl@kernel.org>, "Kairui Song" <kasong@tencent.com>, linux-kernel@vger.kernel.org, regressions@lists.linux.dev, "Pedro Falcato" <pfalcato@suse.de>, ljs@kernel.org
> Subject: Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12
> 
> 
> Hi David,
> 
> Thanks for the pointer. I have not tested Pedro's mprotect
> micro-optimization series yet.
> 
> That series does look directly relevant to this report, especially the
> change_pte_range() / small-folio path. My current data only compares
> v6.12.77 with v6.19.9, so I do not yet know whether that newer series
> fixes or reduces the slowdown.
> 
> I will rerun the shared-dirty toggle workload on a kernel with that series
> applied, or on the first branch/tag where it is included, and report back
> with the same 1/2/4 CPU matrix if possible.
> 
> If it removes most of the delta, I will follow up and mark the report
> accordingly.
> 
> Thanks,
> Chengfeng
> 
> > -----Original Message-----
> > From: "David Hildenbrand (Arm)" <david@kernel.org>
> > Sent:Monday, 05/18/2026 23:36:27
> > To: "Chengfeng Lin" <23020251154299@stu.xmu.edu.cn>, "Andrew Morton" <akpm@linux-foundation.org>, linux-mm@kvack.org
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>, "Lorenzo Stoakes" <lorenzo.stoakes@oracle.com>, "Vlastimil Babka" <vbabka@suse.cz>, "Jann Horn" <jannh@google.com>, "Johannes Weiner" <hannes@cmpxchg.org>, "Michal Hocko" <mhocko@kernel.org>, "Qi Zheng" <zhengqi.arch@bytedance.com>, "Shakeel Butt" <shakeel.butt@linux.dev>, "Chris Li" <chrisl@kernel.org>, "Kairui Song" <kasong@tencent.com>, linux-kernel@vger.kernel.org, regressions@lists.linux.dev, "Pedro Falcato" <pfalcato@suse.de>
> > Subject: Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
> > 
> > On 5/18/26 15:01, Chengfeng Lin wrote:
> > > Hi,
> > > 
> > > I would like to report a userspace-visible mprotect() performance
> > > regression in a shared dirty PTE workload.
> > > 
> > > The workload is intentionally narrow:
> > > 
> > >   - anonymous shared 64 MiB mapping
> > >   - prefault before protection changes
> > >   - repeatedly toggle the whole range with mprotect(PROT_READ)
> > >   - restore with mprotect(PROT_READ | PROT_WRITE)
> > >   - write-touch after the protection cycle
> > > 
> > > This is not meant as a generic mprotect() regression report. In
> > > particular, I am not claiming that the anon/THP mprotect paths regress.
> > > The current signal is scoped to the shared-dirty full-range PTE toggle
> > > path above.
> > > 
> > > The current public evidence bundle is here:
> > > 
> > >   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle
> > > 
> > > The generated workload source used for auditing the workload semantics is
> > > here:
> > > 
> > >   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/e13469b/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c
> > > 
> > > The formal experiment profile is here:
> > > 
> > >   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle/experiments
> > > 
> > > The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
> > > configuration, using QEMU direct boot. The formal performance runs were
> > > clean timing runs with coverage disabled. Coverage was collected
> > > separately and is not used for the timing numbers below.
> > > 
> > > Lab environment:
> > > 
> > >   host label: lcf
> > >   host kernel: Linux 6.14.0-37-generic x86_64
> > >   QEMU: qemu-system-x86_64 8.2.2
> > >   container/cgroup CPU set: 0,2,4,6,8,10,12,14
> > >   container/cgroup memory limit: 16106127360 bytes
> > >   guest memory: QEMU_MEM_MB=14336
> > >   guest CPUs: QEMU_SMP=1/2/4
> > >   repetitions: 9
> > >   version order: interleaved
> > >   performance coverage_enabled: false
> > > 
> > > Primary result, cycle_ns_per_page, lower is better:
> > > 
> > >   CPU   v6.12.77   v6.19.9   old-lower-vs-new   v6.19/v6.12   reliability
> > >     1      346.8     578.1        40.0%             1.67x      reliable
> > >     2      394.7     641.7        38.5%             1.63x      robust-only
> > >     4      381.1     624.8        39.0%             1.64x      partial, same direction
> > > 
> > > The strongest current result is the 1CPU lab formal result. The 2CPU case
> > > is same-direction but robust-only in the framework classification. The
> > > 4CPU case is same-direction but partial because one QEMU run failed; the
> > > summary still has 8 successful runs for that CPU count.
> > > 
> > > The current mechanism hypothesis is local to the shared-dirty PTE path.
> > > In v6.19, the measured hot path goes through the change_pte_range()
> > > batching machinery:
> > > 
> > >   change_pte_range()
> > >     -> mprotect_folio_pte_batch()
> > >     -> modify_prot_start_ptes()
> > >     -> set_write_prot_commit_flush_ptes()
> > >     -> prot_commit_flush_ptes()
> > > 
> > > For this shared-dirty workload, follow-up batch-probe attribution showed
> > > nr_ptes=1 in the measured path. The hypothesis is that the extra folio
> > > lookup, batch-size query, helper dispatch, and commit machinery are paid
> > > per 4 KiB PTE without effective batch-size amortization in this workload.
> > > This is mechanism interpretation, not a completed culprit-commit bisect.
> > > 
> > > I have not bisected the exact culprit commit yet. Separate release-level
> > > sanity checks showed v6.18.19 already in the slow range, so the current
> > > best reporting range is:
> > > 
> > > #regzbot introduced: v6.12..v6.18
> > > 
> > > Please let me know if a standalone reproducer, a narrower bisect, or
> > > additional raw logs would be more useful.
> > 
> > Pedro recently optimized this:
> > 
> > https://lore.kernel.org/all/20260402141628.3367596-1-pfalcato@suse.de/
> > 
> > Maybe that fixes most of the regression for you?
> > 
> > -- 
> > Cheers,
> > 
> > David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12
  2026-05-22  9:03     ` Chengfeng Lin
@ 2026-05-25 10:29       ` Pedro Falcato
  2026-05-26  7:57         ` Chengfeng Lin
  0 siblings, 1 reply; 11+ messages in thread
From: Pedro Falcato @ 2026-05-25 10:29 UTC (permalink / raw)
  To: Chengfeng Lin
  Cc: David Hildenbrand (Arm), Andrew Morton, linux-mm, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Johannes Weiner, Michal Hocko,
	Qi Zheng, Shakeel Butt, Chris Li, Kairui Song, linux-kernel,
	regressions, ljs

On Fri, May 22, 2026 at 05:03:44PM +0800, Chengfeng Lin wrote:
> Hi David,
> 
> Thanks for the pointer. I tested the current akpm/mm mm-unstable branch at
> 444fc9435e57, which contains Pedro's v3 two-patch mprotect series: the
> softleaf refactor and the relevant small-folio / nr_ptes == 1 changes.
> 
> I first ran a local sanity check, and then reran the same shared-dirty
> full-range toggle workload on the lab machine:
> 
>   kernels: v6.12.77, v6.19.9, akpm/mm mm-unstable 444fc9435e57
>   QEMU: direct boot
>   lab guest CPUs: QEMU_SMP=1/2/4/8/16
>   lab guest memory: 14336 MiB for 1/2/4 CPU, 16384 MiB for 8 CPU,
>                     32768 MiB for 16 CPU
>   repetitions: 9
>   order: interleaved
>   coverage: disabled
> 
> The primary metric is cycle_ns_per_page, lower is better. Here "cycle" means
> one workload iteration, not CPU cycles:
> 
>   CPU   v6.12.77   v6.19.9   mm-unstable   mm-unstable vs v6.19   gap closed
>     1      336.1     532.0       497.0          6.6% faster          17.9%
>     2      369.2     581.9       503.3         13.5% faster          36.9%
>     4      355.7     587.2       524.2         10.7% faster          27.2%
>     8      369.7     583.6       534.2          8.5% faster          23.1%
>    16      374.8     607.1       547.8          9.8% faster          25.5%
> 
> The 1/2/4/8 CPU rows completed 9/9 runs for all three kernels. In the
> 16 CPU row, v6.12.77 had one QEMU failure, so I would treat that row only
> as a supporting trend.
> 
> So yes, Pedro's small-folio work does reduce this synthetic shared-dirty
> signal in my setup. It does not seem to remove most of the gap to v6.12.77:
> looking at cycle_ns_per_page, it closes roughly 18-37% of the v6.12 ->
> v6.19 gap in the clean 1/2/4/8 CPU lab rows.
> 
> I also ran a separate state-shape audit, because the MADV_PAGEOUT follow-up
> showed that a timing delta can be misleading if the compared kernels are not
> actually operating on the same page state. For this mprotect workload, the
> successful runs across v6.12.77, v6.19.9, and mm-unstable all used the same
> 4 KiB shared-dirty PTE mapping shape:
> 
>   expected_match_ratio = 100
>   unexpected_results = 0
>   final_vmas_avg = 1
>   present pages before/after protect = 16384 / 16384
>   AnonHugePages = 0
>   KernelPageSize/MMUPageSize = 4 KiB / 4 KiB
>   THPeligible = 0
> 
> The state audit used the same 1/2/4/8/16 CPU and memory matrix, with 5 runs
> per kernel. The 1/2/4/8 CPU rows completed 5/5 for all three kernels; the
> 16 CPU row had one v6.19.9 QEMU failure, but the successful v6.19.9 runs had
> the same state-shape values.
> 
> I put the follow-up summaries here:
> 
>   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/0c0e2d9/mprotect-shared-dirty-toggle/mm-unstable-lab-sanity
> 
>   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/0c0e2d9/mprotect-shared-dirty-toggle/state-audit-lab
> 
> Given Lorenzo's question and the synthetic nature of this workload, I will
> avoid treating this as a strong regression claim unless I can provide a
> standalone reproducer and/or a narrower bisect. If this remaining signal is
> still useful to characterize, I can prepare a smaller standalone reproducer
> or try to bisect the remaining gap.

Yes, if you could give me more pointers (and a simpler repro) I would be happy
to take a quick look. Otherwise there's not much I can do here :)

-- 
Pedro

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12
  2026-05-25 10:29       ` Pedro Falcato
@ 2026-05-26  7:57         ` Chengfeng Lin
  0 siblings, 0 replies; 11+ messages in thread
From: Chengfeng Lin @ 2026-05-26  7:57 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: David Hildenbrand (Arm), Andrew Morton, linux-mm, Liam R. Howlett,
	Vlastimil Babka, Jann Horn, Johannes Weiner, Michal Hocko,
	Qi Zheng, Shakeel Butt, Chris Li, Kairui Song, linux-kernel,
	regressions, ljs

Hi Pedro,

Thanks. I prepared a smaller standalone reproducer for the shared-dirty case:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/aec9695/mprotect-shared-dirty-toggle/reproducer

It is distilled from the `shared_dirty_full_toggle_64m` scenario in the
generated workload I used for the earlier QEMU/lab runs. It keeps only the
core operation:

  - MAP_SHARED | MAP_ANONYMOUS mapping
  - write-prefault the whole range
  - full-range mprotect(PROT_READ)
  - restore with mprotect(PROT_READ | PROT_WRITE)
  - write-touch after each protection cycle

The core loop is essentially:

  p = mmap(..., MAP_SHARED | MAP_ANONYMOUS, ...);
  write_touch(p, len);
  for (...) {
          mprotect(p, len, PROT_READ);
          mprotect(p, len, PROT_READ | PROT_WRITE);
          write_touch(p, len);
  }

Build/run:

  gcc -O2 -Wall -Wextra -o mprotect_shared_dirty_reproducer \
    mprotect_shared_dirty_reproducer.c

  ./mprotect_shared_dirty_reproducer \
    shared_dirty_full_toggle_64m 5 \
    --mapping-mb 64 \
    --iterations 200 \
    --warmup 5

The main metric is `iteration_ns_per_page`, lower is better. It is
wall-clock nanoseconds per base page for one full
protect/restore/post-touch iteration. The program also prints
`protect_ns_per_page` and `restore_ns_per_page` separately.

I rebuilt the QEMU direct-boot kernels with an SMP-capable config and reran the
standalone reproducer on the lab machine:

  kernels: v6.12.77, v6.19.9, akpm/mm mm-unstable 444fc9435e57
  kernel config additions: CONFIG_SMP=y, CONFIG_NR_CPUS=16,
                           CONFIG_ACPI=y, CONFIG_ACPI_PROCESSOR=y
  QEMU_SMP: 1/2/4/8/16
  guest memory: 14336 MiB for 1/2/4 CPU, 16384 MiB for 8 CPU,
                32768 MiB for 16 CPU
  repetitions: 5
  order: interleaved
  coverage: disabled
  extra cmdline: tsc=unstable clocksource=refined-jiffies

I also checked the serial logs. The 1/2/4/8 CPU rows each had 15 serial logs
checked. The 16 CPU full-matrix row had one v6.12.77 QEMU failure, but a
targeted 16 CPU rerun completed cleanly with 15/15 serial logs checked. All
checked logs matched the requested guest CPU count, and none had `noapic` in
the guest cmdline.

`iteration_ns_per_page` results:

  CPU   v6.12.77   v6.19.9   mm-unstable   mm-unstable vs v6.19   gap closed
    1      296.4     548.6       498.6          9.1% faster          19.8%
    2      327.2     564.8       488.4         13.5% faster          32.2%
    4      319.8     578.2       505.8         12.5% faster          28.0%
    8      336.4     570.4       508.2         10.9% faster          26.6%
   16      380.0     624.0       553.8         11.3% faster          28.8%

The 1/2/4/8 CPU rows are clean screening rows. I would treat 16 CPU as
extended/supporting only because it uses the larger 32 GiB guest-memory setting;
the earlier v6.12.77 QEMU failure appears transient after the clean rerun.

So the standalone reproducer keeps the same broad direction: v6.19.9 is slower
than v6.12.77, and current mm-unstable improves the result but does not return
it to the v6.12.77 level in this setup. The per-phase metrics still put most of
the gap in the protect/restore mprotect phases rather than the post-touch phase.

The lab validation summary is here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/aec9695/mprotect-shared-dirty-toggle/reproducer-validation

One caveat: the standalone run does not collect the same detailed
smaps/pagemap state-shape audit as my separate state-audit run, so I would
treat this as a reproducer/timing screening check. The earlier state audit for
the same workload shape is here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/aec9695/mprotect-shared-dirty-toggle/state-audit-lab

For reference, the original generated workload source and formal profile are:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/aec9695/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/aec9695/mprotect-shared-dirty-toggle/experiments/mprotect_shared_dirty_formal_refresh.toml

I can try a narrower bisect next if this reproducer shape is useful.

Thanks,
Chengfeng


> -----Original Message-----
> From: "Pedro Falcato" <pfalcato@suse.de>
> Sent:Monday, 05/25/2026 18:29:17
> To: "Chengfeng Lin" <23020251154299@stu.xmu.edu.cn>
> Cc: "David Hildenbrand (Arm)" <david@kernel.org>, "Andrew Morton" <akpm@linux-foundation.org>, linux-mm@kvack.org, "Liam R. Howlett" <Liam.Howlett@oracle.com>, "Vlastimil Babka" <vbabka@suse.cz>, "Jann Horn" <jannh@google.com>, "Johannes Weiner" <hannes@cmpxchg.org>, "Michal Hocko" <mhocko@kernel.org>, "Qi Zheng" <zhengqi.arch@bytedance.com>, "Shakeel Butt" <shakeel.butt@linux.dev>, "Chris Li" <chrisl@kernel.org>, "Kairui Song" <kasong@tencent.com>, linux-kernel@vger.kernel.org, regressions@lists.linux.dev, ljs@kernel.org
> Subject: Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12
> 
> On Fri, May 22, 2026 at 05:03:44PM +0800, Chengfeng Lin wrote:
> > Hi David,
> > 
> > Thanks for the pointer. I tested the current akpm/mm mm-unstable branch at
> > 444fc9435e57, which contains Pedro's v3 two-patch mprotect series: the
> > softleaf refactor and the relevant small-folio / nr_ptes == 1 changes.
> > 
> > I first ran a local sanity check, and then reran the same shared-dirty
> > full-range toggle workload on the lab machine:
> > 
> >   kernels: v6.12.77, v6.19.9, akpm/mm mm-unstable 444fc9435e57
> >   QEMU: direct boot
> >   lab guest CPUs: QEMU_SMP=1/2/4/8/16
> >   lab guest memory: 14336 MiB for 1/2/4 CPU, 16384 MiB for 8 CPU,
> >                     32768 MiB for 16 CPU
> >   repetitions: 9
> >   order: interleaved
> >   coverage: disabled
> > 
> > The primary metric is cycle_ns_per_page, lower is better. Here "cycle" means
> > one workload iteration, not CPU cycles:
> > 
> >   CPU   v6.12.77   v6.19.9   mm-unstable   mm-unstable vs v6.19   gap closed
> >     1      336.1     532.0       497.0          6.6% faster          17.9%
> >     2      369.2     581.9       503.3         13.5% faster          36.9%
> >     4      355.7     587.2       524.2         10.7% faster          27.2%
> >     8      369.7     583.6       534.2          8.5% faster          23.1%
> >    16      374.8     607.1       547.8          9.8% faster          25.5%
> > 
> > The 1/2/4/8 CPU rows completed 9/9 runs for all three kernels. In the
> > 16 CPU row, v6.12.77 had one QEMU failure, so I would treat that row only
> > as a supporting trend.
> > 
> > So yes, Pedro's small-folio work does reduce this synthetic shared-dirty
> > signal in my setup. It does not seem to remove most of the gap to v6.12.77:
> > looking at cycle_ns_per_page, it closes roughly 18-37% of the v6.12 ->
> > v6.19 gap in the clean 1/2/4/8 CPU lab rows.
> > 
> > I also ran a separate state-shape audit, because the MADV_PAGEOUT follow-up
> > showed that a timing delta can be misleading if the compared kernels are not
> > actually operating on the same page state. For this mprotect workload, the
> > successful runs across v6.12.77, v6.19.9, and mm-unstable all used the same
> > 4 KiB shared-dirty PTE mapping shape:
> > 
> >   expected_match_ratio = 100
> >   unexpected_results = 0
> >   final_vmas_avg = 1
> >   present pages before/after protect = 16384 / 16384
> >   AnonHugePages = 0
> >   KernelPageSize/MMUPageSize = 4 KiB / 4 KiB
> >   THPeligible = 0
> > 
> > The state audit used the same 1/2/4/8/16 CPU and memory matrix, with 5 runs
> > per kernel. The 1/2/4/8 CPU rows completed 5/5 for all three kernels; the
> > 16 CPU row had one v6.19.9 QEMU failure, but the successful v6.19.9 runs had
> > the same state-shape values.
> > 
> > I put the follow-up summaries here:
> > 
> >   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/0c0e2d9/mprotect-shared-dirty-toggle/mm-unstable-lab-sanity
> > 
> >   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/0c0e2d9/mprotect-shared-dirty-toggle/state-audit-lab
> > 
> > Given Lorenzo's question and the synthetic nature of this workload, I will
> > avoid treating this as a strong regression claim unless I can provide a
> > standalone reproducer and/or a narrower bisect. If this remaining signal is
> > still useful to characterize, I can prepare a smaller standalone reproducer
> > or try to bisect the remaining gap.
> 
> Yes, if you could give me more pointers (and a simpler repro) I would be happy
> to take a quick look. Otherwise there's not much I can do here :)
> 
> -- 
> Pedro

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
  2026-05-18 13:01 [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12 Chengfeng Lin
  2026-05-18 13:10 ` [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x " Chengfeng Lin
  2026-05-18 15:36 ` [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x " David Hildenbrand (Arm)
@ 2026-05-18 15:43 ` Lorenzo Stoakes
  2026-05-18 16:51   ` [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x " Chengfeng Lin
  2 siblings, 1 reply; 11+ messages in thread
From: Lorenzo Stoakes @ 2026-05-18 15:43 UTC (permalink / raw)
  To: Chengfeng Lin
  Cc: Andrew Morton, linux-mm, Liam R. Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Johannes Weiner, Michal Hocko,
	Qi Zheng, Shakeel Butt, Chris Li, Kairui Song, linux-kernel,
	regressions

-cc wrong email

One day I will get to stop nagging like this :) Or just ignore mails that go to
the wrong place.

Please use ljs@kernel.org. I switched over a while ago. I tend to mark kernel
mails that go to my work address read without reading them.

People regularly update their emails, so it's important to re-check them when
you send a new mail.

On Mon, May 18, 2026 at 09:01:02PM +0800, Chengfeng Lin wrote:
> Hi,
>
> I would like to report a userspace-visible mprotect() performance
> regression in a shared dirty PTE workload.
>
> The workload is intentionally narrow:
>
>   - anonymous shared 64 MiB mapping
>   - prefault before protection changes
>   - repeatedly toggle the whole range with mprotect(PROT_READ)
>   - restore with mprotect(PROT_READ | PROT_WRITE)
>   - write-touch after the protection cycle
>
> This is not meant as a generic mprotect() regression report. In
> particular, I am not claiming that the anon/THP mprotect paths regress.
> The current signal is scoped to the shared-dirty full-range PTE toggle
> path above.
>
> The current public evidence bundle is here:
>
>   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle
>
> The generated workload source used for auditing the workload semantics is
> here:
>
>   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/e13469b/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c
>
> The formal experiment profile is here:
>
>   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle/experiments
>
> The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
> configuration, using QEMU direct boot. The formal performance runs were
> clean timing runs with coverage disabled. Coverage was collected
> separately and is not used for the timing numbers below.
>
> Lab environment:
>
>   host label: lcf
>   host kernel: Linux 6.14.0-37-generic x86_64
>   QEMU: qemu-system-x86_64 8.2.2
>   container/cgroup CPU set: 0,2,4,6,8,10,12,14
>   container/cgroup memory limit: 16106127360 bytes
>   guest memory: QEMU_MEM_MB=14336
>   guest CPUs: QEMU_SMP=1/2/4
>   repetitions: 9
>   version order: interleaved
>   performance coverage_enabled: false
>
> Primary result, cycle_ns_per_page, lower is better:
>
>   CPU   v6.12.77   v6.19.9   old-lower-vs-new   v6.19/v6.12   reliability
>     1      346.8     578.1        40.0%             1.67x      reliable
>     2      394.7     641.7        38.5%             1.63x      robust-only
>     4      381.1     624.8        39.0%             1.64x      partial, same direction
>
> The strongest current result is the 1CPU lab formal result. The 2CPU case
> is same-direction but robust-only in the framework classification. The
> 4CPU case is same-direction but partial because one QEMU run failed; the
> summary still has 8 successful runs for that CPU count.
>
> The current mechanism hypothesis is local to the shared-dirty PTE path.
> In v6.19, the measured hot path goes through the change_pte_range()
> batching machinery:
>
>   change_pte_range()
>     -> mprotect_folio_pte_batch()
>     -> modify_prot_start_ptes()
>     -> set_write_prot_commit_flush_ptes()
>     -> prot_commit_flush_ptes()
>
> For this shared-dirty workload, follow-up batch-probe attribution showed
> nr_ptes=1 in the measured path. The hypothesis is that the extra folio
> lookup, batch-size query, helper dispatch, and commit machinery are paid
> per 4 KiB PTE without effective batch-size amortization in this workload.
> This is mechanism interpretation, not a completed culprit-commit bisect.
>
> I have not bisected the exact culprit commit yet. Separate release-level
> sanity checks showed v6.18.19 already in the slow range, so the current
> best reporting range is:
>
> #regzbot introduced: v6.12..v6.18
>
> Please let me know if a standalone reproducer, a narrower bisect, or
> additional raw logs would be more useful.

Is this really a regression you're seeing in real worklaods or synthetic?

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x longer on v6.19 than v6.12
  2026-05-18 15:43 ` [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x " Lorenzo Stoakes
@ 2026-05-18 16:51   ` Chengfeng Lin
  0 siblings, 0 replies; 11+ messages in thread
From: Chengfeng Lin @ 2026-05-18 16:51 UTC (permalink / raw)
  To: ljs
  Cc: Andrew Morton, linux-mm, Liam R. Howlett, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Johannes Weiner, Michal Hocko,
	Qi Zheng, Shakeel Butt, Chris Li, Kairui Song, linux-kernel,
	regressions

Hi Lorenzo,

Sorry about the stale address. I will use ljs@kernel.org for future
kernel mails.

This is a synthetic/source-calibrated userspace micro-workload, not a
regression I observed in a production application.

The workload was generated from the mm/mprotect.c path and then narrowed
to the shared-dirty full-range PTE toggle case where the timing signal was
stable enough to report. So the intended claim is limited to "this legal
userspace mprotect pattern regressed in the test setup", not "a known real
application workload regressed".

I agree that this makes the report weaker than an application-level
regression. I sent it because the delta is large in the clean 1CPU formal
run (~1.67x slower on v6.19 vs v6.12), and the path looked plausibly tied
to the change_pte_range() batching path where the shared-dirty case did
not form an effective batch in my probe runs.

David also pointed me at Pedro's recent mprotect micro-optimization
series. I have not tested that yet, so I will first check whether that
series already removes most of this synthetic signal. If it does, I will
follow up and mark this accordingly. If the signal remains, I can prepare
a standalone reproducer and/or try to bisect the exact culprit commit
before asking you to spend more time on it.

Sorry for the noise, and thanks for taking a look.

Thanks,
Chengfeng


> -----原始邮件-----
> 发件人: "Lorenzo Stoakes" <ljs@kernel.org>
> 发送时间:2026-05-18 23:43:10 (星期一)
> 收件人: "Chengfeng Lin" <23020251154299@stu.xmu.edu.cn>
> 抄送: "Andrew Morton" <akpm@linux-foundation.org>, linux-mm@kvack.org, "Liam R. Howlett" <Liam.Howlett@oracle.com>, "David Hildenbrand" <david@kernel.org>, "Vlastimil Babka" <vbabka@suse.cz>, "Jann Horn" <jannh@google.com>, "Johannes Weiner" <hannes@cmpxchg.org>, "Michal Hocko" <mhocko@kernel.org>, "Qi Zheng" <zhengqi.arch@bytedance.com>, "Shakeel Butt" <shakeel.butt@linux.dev>, "Chris Li" <chrisl@kernel.org>, "Kairui Song" <kasong@tencent.com>, linux-kernel@vger.kernel.org, regressions@lists.linux.dev
> 主题: Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
> 
> -cc wrong email
> 
> One day I will get to stop nagging like this :) Or just ignore mails that go to
> the wrong place.
> 
> Please use ljs@kernel.org. I switched over a while ago. I tend to mark kernel
> mails that go to my work address read without reading them.
> 
> People regularly update their emails, so it's important to re-check them when
> you send a new mail.
> 
> On Mon, May 18, 2026 at 09:01:02PM +0800, Chengfeng Lin wrote:
> > Hi,
> >
> > I would like to report a userspace-visible mprotect() performance
> > regression in a shared dirty PTE workload.
> >
> > The workload is intentionally narrow:
> >
> >   - anonymous shared 64 MiB mapping
> >   - prefault before protection changes
> >   - repeatedly toggle the whole range with mprotect(PROT_READ)
> >   - restore with mprotect(PROT_READ | PROT_WRITE)
> >   - write-touch after the protection cycle
> >
> > This is not meant as a generic mprotect() regression report. In
> > particular, I am not claiming that the anon/THP mprotect paths regress.
> > The current signal is scoped to the shared-dirty full-range PTE toggle
> > path above.
> >
> > The current public evidence bundle is here:
> >
> >   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle
> >
> > The generated workload source used for auditing the workload semantics is
> > here:
> >
> >   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/e13469b/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c
> >
> > The formal experiment profile is here:
> >
> >   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle/experiments
> >
> > The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
> > configuration, using QEMU direct boot. The formal performance runs were
> > clean timing runs with coverage disabled. Coverage was collected
> > separately and is not used for the timing numbers below.
> >
> > Lab environment:
> >
> >   host label: lcf
> >   host kernel: Linux 6.14.0-37-generic x86_64
> >   QEMU: qemu-system-x86_64 8.2.2
> >   container/cgroup CPU set: 0,2,4,6,8,10,12,14
> >   container/cgroup memory limit: 16106127360 bytes
> >   guest memory: QEMU_MEM_MB=14336
> >   guest CPUs: QEMU_SMP=1/2/4
> >   repetitions: 9
> >   version order: interleaved
> >   performance coverage_enabled: false
> >
> > Primary result, cycle_ns_per_page, lower is better:
> >
> >   CPU   v6.12.77   v6.19.9   old-lower-vs-new   v6.19/v6.12   reliability
> >     1      346.8     578.1        40.0%             1.67x      reliable
> >     2      394.7     641.7        38.5%             1.63x      robust-only
> >     4      381.1     624.8        39.0%             1.64x      partial, same direction
> >
> > The strongest current result is the 1CPU lab formal result. The 2CPU case
> > is same-direction but robust-only in the framework classification. The
> > 4CPU case is same-direction but partial because one QEMU run failed; the
> > summary still has 8 successful runs for that CPU count.
> >
> > The current mechanism hypothesis is local to the shared-dirty PTE path.
> > In v6.19, the measured hot path goes through the change_pte_range()
> > batching machinery:
> >
> >   change_pte_range()
> >     -> mprotect_folio_pte_batch()
> >     -> modify_prot_start_ptes()
> >     -> set_write_prot_commit_flush_ptes()
> >     -> prot_commit_flush_ptes()
> >
> > For this shared-dirty workload, follow-up batch-probe attribution showed
> > nr_ptes=1 in the measured path. The hypothesis is that the extra folio
> > lookup, batch-size query, helper dispatch, and commit machinery are paid
> > per 4 KiB PTE without effective batch-size amortization in this workload.
> > This is mechanism interpretation, not a completed culprit-commit bisect.
> >
> > I have not bisected the exact culprit commit yet. Separate release-level
> > sanity checks showed v6.18.19 already in the slow range, so the current
> > best reporting range is:
> >
> > #regzbot introduced: v6.12..v6.18
> >
> > Please let me know if a standalone reproducer, a narrower bisect, or
> > additional raw logs would be more useful.
> 
> Is this really a regression you're seeing in real worklaods or synthetic?
> 
> Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
@ 2026-05-18 12:59 Chengfeng Lin
  2026-05-18 18:14 ` Kairui Song
  0 siblings, 1 reply; 11+ messages in thread
From: Chengfeng Lin @ 2026-05-18 12:59 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Liam R. Howlett, Lorenzo Stoakes, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Johannes Weiner, Michal Hocko,
	Qi Zheng, Shakeel Butt, Chris Li, Kairui Song, linux-kernel,
	regressions

Hi,

I would like to report a userspace-visible performance regression in a
MADV_PAGEOUT workload.

The workload is intentionally narrow:

  - map 16 MiB anonymous memory
  - use the default THP policy
  - run in a guest with no configured swap
  - call madvise(MADV_PAGEOUT)
  - refault/write-touch the mapping

This is not meant as a generic madvise() or generic MADV_PAGEOUT
regression report. The signal is currently scoped to the THP + no-swap +
refault/write-touch workflow above.

The current public evidence bundle is here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault

The standalone workload source is here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault/workload

The formal experiment profile is here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault/experiments

The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
configuration, using QEMU direct boot. The formal performance runs were
clean timing runs with coverage disabled. Coverage was collected
separately and is not used for the timing numbers below.

Lab environment:

  host label: lcf
  host kernel: Linux 6.14.0-37-generic x86_64
  QEMU: qemu-system-x86_64 8.2.2
  container/cgroup CPU set: 0,2,4,6,8,10,12,14
  container/cgroup memory limit: 16106127360 bytes
  guest memory: QEMU_MEM_MB=14336
  guest CPUs: QEMU_SMP=1/2/4
  repetitions: 9
  version order: interleaved
  performance coverage_enabled: false

Primary result, cycle_ns_per_page, lower is better:

  CPU   v6.12.77   v6.19.9   old-lower-vs-new   v6.19/v6.12
    1     1900.3    3304.7        42.5%             1.74x
    2     2107.7    3583.2        41.2%             1.70x
    4     2154.2    3690.9        41.6%             1.71x

MADV_PAGEOUT syscall/reclaim-side segment, advise_ns_per_page, lower is
better:

  CPU   v6.12.77   v6.19.9   old-lower-vs-new   v6.19/v6.12
    1     1713.2    2922.7        41.4%             1.71x
    2     1924.7    3162.9        39.1%             1.64x
    4     1953.1    3284.2        40.5%             1.68x

The current mechanism interpretation is that the timing difference is in
the MADV_PAGEOUT/reclaim part, not primarily in the later refault touch.
The path evidence points at the no-swap reclaim/swap-allocation-failure
chain:

  madvise(MADV_PAGEOUT)
    -> reclaim_pages()
    -> shrink_folio_list()
    -> folio_alloc_swap()
    -> swap allocation failure path

I have not bisected the exact culprit commit yet. Separate release-level
sanity checks showed v6.18.19 already in the slow range, so the current
best reporting range is:

#regzbot introduced: v6.12..v6.18

Please let me know if a different reproducer shape, a narrower bisect, or
additional raw logs would be more useful.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
  2026-05-18 12:59 [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x " Chengfeng Lin
@ 2026-05-18 18:14 ` Kairui Song
  0 siblings, 0 replies; 11+ messages in thread
From: Kairui Song @ 2026-05-18 18:14 UTC (permalink / raw)
  To: Chengfeng Lin
  Cc: Andrew Morton, linux-mm, Liam R. Howlett, Lorenzo Stoakes,
	David Hildenbrand, Vlastimil Babka, Jann Horn, Johannes Weiner,
	Michal Hocko, Qi Zheng, Shakeel Butt, Chris Li, linux-kernel,
	regressions

On Mon, May 18, 2026 at 9:09 PM Chengfeng Lin
<23020251154299@stu.xmu.edu.cn> wrote:
>
> Hi,
>
> I would like to report a userspace-visible performance regression in a
> MADV_PAGEOUT workload.

Hi Chengfeng,

Thanks for the report. Very interesting, but it looks a bit confusing.
You are doing PAGEOUT of anon pages without swap setup, so nothing is
actually being pageouted?

> The workload is intentionally narrow:
>
>   - map 16 MiB anonymous memory
>   - use the default THP policy
>   - run in a guest with no configured swap
>   - call madvise(MADV_PAGEOUT)
>   - refault/write-touch the mapping
>
> This is not meant as a generic madvise() or generic MADV_PAGEOUT
> regression report. The signal is currently scoped to the THP + no-swap +
> refault/write-touch workflow above.
>
> The current public evidence bundle is here:
>
>   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault
>
> The standalone workload source is here:
>
>   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault/workload
>
> The formal experiment profile is here:
>
>   https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault/experiments
>
> The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
> configuration, using QEMU direct boot. The formal performance runs were
> clean timing runs with coverage disabled. Coverage was collected
> separately and is not used for the timing numbers below.
>
> Lab environment:
>
>   host label: lcf
>   host kernel: Linux 6.14.0-37-generic x86_64
>   QEMU: qemu-system-x86_64 8.2.2
>   container/cgroup CPU set: 0,2,4,6,8,10,12,14
>   container/cgroup memory limit: 16106127360 bytes
>   guest memory: QEMU_MEM_MB=14336
>   guest CPUs: QEMU_SMP=1/2/4
>   repetitions: 9
>   version order: interleaved
>   performance coverage_enabled: false
>
> Primary result, cycle_ns_per_page, lower is better:

What is cycle_ns_per_page? The name is ambiguous. "cycle" reads like
CPU cycles but the value doesn't match. Or you mean iteration? Can you
make this cleaner?

>   CPU   v6.12.77   v6.19.9   old-lower-vs-new   v6.19/v6.12
>     1     1900.3    3304.7        42.5%             1.74x
>     2     2107.7    3583.2        41.2%             1.70x
>     4     2154.2    3690.9        41.6%             1.71x
>
> MADV_PAGEOUT syscall/reclaim-side segment, advise_ns_per_page, lower is
> better:
>
>   CPU   v6.12.77   v6.19.9   old-lower-vs-new   v6.19/v6.12
>     1     1713.2    2922.7        41.4%             1.71x
>     2     1924.7    3162.9        39.1%             1.64x
>     4     1953.1    3284.2        40.5%             1.68x
>
> The current mechanism interpretation is that the timing difference is in
> the MADV_PAGEOUT/reclaim part, not primarily in the later refault touch.

Right, with no swap configured, the post-advise should not be faulting
at all I think? It will just split the pages IIUC, however, that might
cause more TLB pressure. But this split-on-alloc-failure path itself
wasn't changed from 6.12 to 6.18. If there is no swap, we can skip the
split though, and maybe make folio_alloc_swap fail a bit faster.
Skipping the split seems more meaningful and could significantly speed
up this specific test, if this is necessary.

> The path evidence points at the no-swap reclaim/swap-allocation-failure
> chain:
>
>   madvise(MADV_PAGEOUT)
>     -> reclaim_pages()
>     -> shrink_folio_list()
>     -> folio_alloc_swap()
>     -> swap allocation failure path

What you posted, include that repo, seems only has end to end timing.
It would be much more
useful if you collect a perf or ftrace breakdown something like:
https://lore.kernel.org/all/CAMgjq7BfO=dNYep4z1aS7nUAJU3bktR17gYAufx=kkLudq4dAQ@mail.gmail.com/

Or like this:
https://lore.kernel.org/all/20260422003412.11678-1-xueyuan.chen21@gmail.com/


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-05-26  7:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-18 13:01 [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12 Chengfeng Lin
2026-05-18 13:10 ` [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x " Chengfeng Lin
2026-05-18 15:36 ` [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x " David Hildenbrand (Arm)
2026-05-18 17:01   ` [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x " Chengfeng Lin
2026-05-22  9:03     ` Chengfeng Lin
2026-05-25 10:29       ` Pedro Falcato
2026-05-26  7:57         ` Chengfeng Lin
2026-05-18 15:43 ` [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x " Lorenzo Stoakes
2026-05-18 16:51   ` [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x " Chengfeng Lin
  -- strict thread matches above, loose matches on Subject: below --
2026-05-18 12:59 [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x " Chengfeng Lin
2026-05-18 18:14 ` Kairui Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.