All of lore.kernel.org
 help / color / mirror / Atom feed
* [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
@ 2026-05-18 13:01 Chengfeng Lin
  2026-05-18 13:10 ` [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x " Chengfeng Lin
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Chengfeng Lin @ 2026-05-18 13:01 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Liam R. Howlett, Lorenzo Stoakes, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Johannes Weiner, Michal Hocko,
	Qi Zheng, Shakeel Butt, Chris Li, Kairui Song, linux-kernel,
	regressions

Hi,

I would like to report a userspace-visible mprotect() performance
regression in a shared dirty PTE workload.

The workload is intentionally narrow:

  - anonymous shared 64 MiB mapping
  - prefault before protection changes
  - repeatedly toggle the whole range with mprotect(PROT_READ)
  - restore with mprotect(PROT_READ | PROT_WRITE)
  - write-touch after the protection cycle

This is not meant as a generic mprotect() regression report. In
particular, I am not claiming that the anon/THP mprotect paths regress.
The current signal is scoped to the shared-dirty full-range PTE toggle
path above.

The current public evidence bundle is here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle

The generated workload source used for auditing the workload semantics is
here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/e13469b/mprotect-shared-dirty-toggle/workload/mprotect_paths_storm.c

The formal experiment profile is here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/mprotect-shared-dirty-toggle/experiments

The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
configuration, using QEMU direct boot. The formal performance runs were
clean timing runs with coverage disabled. Coverage was collected
separately and is not used for the timing numbers below.

Lab environment:

  host label: lcf
  host kernel: Linux 6.14.0-37-generic x86_64
  QEMU: qemu-system-x86_64 8.2.2
  container/cgroup CPU set: 0,2,4,6,8,10,12,14
  container/cgroup memory limit: 16106127360 bytes
  guest memory: QEMU_MEM_MB=14336
  guest CPUs: QEMU_SMP=1/2/4
  repetitions: 9
  version order: interleaved
  performance coverage_enabled: false

Primary result, cycle_ns_per_page, lower is better:

  CPU   v6.12.77   v6.19.9   old-lower-vs-new   v6.19/v6.12   reliability
    1      346.8     578.1        40.0%             1.67x      reliable
    2      394.7     641.7        38.5%             1.63x      robust-only
    4      381.1     624.8        39.0%             1.64x      partial, same direction

The strongest current result is the 1CPU lab formal result. The 2CPU case
is same-direction but robust-only in the framework classification. The
4CPU case is same-direction but partial because one QEMU run failed; the
summary still has 8 successful runs for that CPU count.

The current mechanism hypothesis is local to the shared-dirty PTE path.
In v6.19, the measured hot path goes through the change_pte_range()
batching machinery:

  change_pte_range()
    -> mprotect_folio_pte_batch()
    -> modify_prot_start_ptes()
    -> set_write_prot_commit_flush_ptes()
    -> prot_commit_flush_ptes()

For this shared-dirty workload, follow-up batch-probe attribution showed
nr_ptes=1 in the measured path. The hypothesis is that the extra folio
lookup, batch-size query, helper dispatch, and commit machinery are paid
per 4 KiB PTE without effective batch-size amortization in this workload.
This is mechanism interpretation, not a completed culprit-commit bisect.

I have not bisected the exact culprit commit yet. Separate release-level
sanity checks showed v6.18.19 already in the slow range, so the current
best reporting range is:

#regzbot introduced: v6.12..v6.18

Please let me know if a standalone reproducer, a narrower bisect, or
additional raw logs would be more useful.

^ permalink raw reply	[flat|nested] 11+ messages in thread
* [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12
@ 2026-05-18 12:59 Chengfeng Lin
  2026-05-18 18:14 ` Kairui Song
  0 siblings, 1 reply; 11+ messages in thread
From: Chengfeng Lin @ 2026-05-18 12:59 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Liam R. Howlett, Lorenzo Stoakes, David Hildenbrand,
	Vlastimil Babka, Jann Horn, Johannes Weiner, Michal Hocko,
	Qi Zheng, Shakeel Butt, Chris Li, Kairui Song, linux-kernel,
	regressions

Hi,

I would like to report a userspace-visible performance regression in a
MADV_PAGEOUT workload.

The workload is intentionally narrow:

  - map 16 MiB anonymous memory
  - use the default THP policy
  - run in a guest with no configured swap
  - call madvise(MADV_PAGEOUT)
  - refault/write-touch the mapping

This is not meant as a generic madvise() or generic MADV_PAGEOUT
regression report. The signal is currently scoped to the THP + no-swap +
refault/write-touch workflow above.

The current public evidence bundle is here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault

The standalone workload source is here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault/workload

The formal experiment profile is here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/e13469b/madvise-pageout-thp-noswap-refault/experiments

The formal timing runs compare v6.12.77 and v6.19.9 with similar kernel
configuration, using QEMU direct boot. The formal performance runs were
clean timing runs with coverage disabled. Coverage was collected
separately and is not used for the timing numbers below.

Lab environment:

  host label: lcf
  host kernel: Linux 6.14.0-37-generic x86_64
  QEMU: qemu-system-x86_64 8.2.2
  container/cgroup CPU set: 0,2,4,6,8,10,12,14
  container/cgroup memory limit: 16106127360 bytes
  guest memory: QEMU_MEM_MB=14336
  guest CPUs: QEMU_SMP=1/2/4
  repetitions: 9
  version order: interleaved
  performance coverage_enabled: false

Primary result, cycle_ns_per_page, lower is better:

  CPU   v6.12.77   v6.19.9   old-lower-vs-new   v6.19/v6.12
    1     1900.3    3304.7        42.5%             1.74x
    2     2107.7    3583.2        41.2%             1.70x
    4     2154.2    3690.9        41.6%             1.71x

MADV_PAGEOUT syscall/reclaim-side segment, advise_ns_per_page, lower is
better:

  CPU   v6.12.77   v6.19.9   old-lower-vs-new   v6.19/v6.12
    1     1713.2    2922.7        41.4%             1.71x
    2     1924.7    3162.9        39.1%             1.64x
    4     1953.1    3284.2        40.5%             1.68x

The current mechanism interpretation is that the timing difference is in
the MADV_PAGEOUT/reclaim part, not primarily in the later refault touch.
The path evidence points at the no-swap reclaim/swap-allocation-failure
chain:

  madvise(MADV_PAGEOUT)
    -> reclaim_pages()
    -> shrink_folio_list()
    -> folio_alloc_swap()
    -> swap allocation failure path

I have not bisected the exact culprit commit yet. Separate release-level
sanity checks showed v6.18.19 already in the slow range, so the current
best reporting range is:

#regzbot introduced: v6.12..v6.18

Please let me know if a different reproducer shape, a narrower bisect, or
additional raw logs would be more useful.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-05-26  7:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-18 13:01 [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x longer on v6.19 than v6.12 Chengfeng Lin
2026-05-18 13:10 ` [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x " Chengfeng Lin
2026-05-18 15:36 ` [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x " David Hildenbrand (Arm)
2026-05-18 17:01   ` [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x " Chengfeng Lin
2026-05-22  9:03     ` Chengfeng Lin
2026-05-25 10:29       ` Pedro Falcato
2026-05-26  7:57         ` Chengfeng Lin
2026-05-18 15:43 ` [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x " Lorenzo Stoakes
2026-05-18 16:51   ` [REGRESSION] mm/mprotect: shared dirty PTE toggle takes ~1.6x " Chengfeng Lin
  -- strict thread matches above, loose matches on Subject: below --
2026-05-18 12:59 [REGRESSION] mm: MADV_PAGEOUT THP/no-swap refault takes ~1.7x " Chengfeng Lin
2026-05-18 18:14 ` Kairui Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.