Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching
@ 2026-06-09  7:26 Chengfeng Lin
  2026-06-09  9:01 ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 5+ messages in thread
From: Chengfeng Lin @ 2026-06-09  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Baolin Wang, Barry Song,
	Dev Jain, Ryan Roberts, David Hildenbrand, Zi Yan

Hi,

I found a source-calibrated synthetic mincore() signal in the resident
base-page PTE path.  I do not currently have an easy arm64/mTHP validation
setup, so before trying to arrange that more expensive validation I would like
to ask whether the candidate fix shape below looks reasonable.

To keep the scope clear, I am not presenting this as a production application
regression report or as a generic mincore() regression.  It is a controlled
reproducer for a real userspace-visible syscall path, with the page-table shape
kept intentionally simple:

  mmap() private anonymous memory
  madvise(MADV_NOHUGEPAGE)
  fault in all pages
  repeatedly call mincore() over a resident 64 MiB range

The practical hook is that mincore() is the userspace-visible residency query
for an address range.  The resident anonymous no-THP range is intended to
isolate the base-page present-PTE scan and avoid file cache, swap, THP, marker,
and unmapped-range effects.  I would read the result as source-path evidence for
the hot path below, not as evidence that every mincore() caller or a specific
application workload regressed.

The intended hot path is:

  mincore()
    -> walk_page_range()
       -> mincore_pte_range()

The main metric is mincore_ns_per_1k_pages, lower is better.  It is the
wall-clock time spent in the mincore() scan, normalized by the number of pages
covered by the range and reported as nanoseconds per 1000 pages scanned.

As a release-level starting point, the matched-PREEMPT 1/2/4 CPU bridge below
uses original release kernels, QEMU direct boot, 9 repetitions, coverage
disabled, and the same CONFIG_ADVISE_SYSCALLS setup:

  scenario: no_thp_pte_scan_64m
  metric:   mincore_ns_per_1k_pages, lower is better

  CPU   v6.12.77   v6.18.19   v6.19.9    v7.0.9
  1     12827.667  15677.444  16482.667  16726.333
  2     13628.444  16102.333  18256.889  17270.333
  4     13798.222  16739.333  18892.111  17068.222

This shows cumulative cost relative to v6.12 in the primary 1/2/4 CPU matrix.
I also reran the 8CPU/16CPU release-level bridge on the same scenario.  These
rows show the same general direction, but the shared lab was busy during this
rerun and the high-CPU rows have higher CV, so I include them as extended
context only:

  CPU/mem     v6.12.77   v6.18.19   v6.19.9    v7.0.9
  8/16 GiB    17251.889  23335.556  21863.556  21664.778
  16/32 GiB   16697.333  21428.333  21629.778  21628.333

The 16CPU rerun had two QEMU returncode-139 failures in the original 36-run
matrix; I filled the missing v6.12/v6.18 samples with a clean two-run
supplement.  I therefore use the high-CPU rows as context for the release
bridge, not as part of the primary matrix.

Follow-up release-ladder and A/B testing narrowed the main step to the
v6.15 -> v6.16 window.  The strongest suspect is:

  4df65651f7075 ("mm: mincore: use pte_batch_hint() to batch process large folios")

That patch improved the mTHP/large-folio case, but in this base-page resident
PTE scan I see a sizeable cost.  The original commit message mentioned that
base pages did not show an obvious regression, so this may simply be a
different x86/base-page corner than the original arm64/mTHP test.

For the v6.16 introduction-window A/B, all rows below are lab QEMU direct boot,
9 repetitions, coverage disabled, same PREEMPT and CONFIG_ADVISE_SYSCALLS
setup.  The 1/2/4 CPU rows are the primary matrix; the 8CPU/16GiB and
16CPU/32GiB rows are the high-CPU follow-up:

  scenario: no_thp_pte_scan_64m
  metric:   mincore_ns_per_1k_pages, lower is better

  CPU/mem     v6.15      v6.16      v6.16 batch<=1 fastpath   v6.16 nobatch
  1           12946.889  17117.667        14560.556              13843.222
  2           15053.111  18214.667        15714.778              14270.556
  4           14942.000  18338.222        14397.889              14719.667
  8/16 GiB    15046.444  17540.222        13696.333              13200.000
  16/32 GiB   14674.111  18928.889        13949.000              15351.111

The high-CPU matrix completed 72/72 with all_cpu_match=true,
any_noapic=false, all_autorun_exit0=true, and all_semantic_ok=true.  One v6.15
16CPU timing sample in the main matrix was an obvious outlier, so the 16/32 GiB
v6.15 value above uses a clean v6.15-only 9-repeat supplement.

I also ran ftrace attribution on the same path as mechanism evidence, not as
clean timing.  In that run, v6.16 original had a higher mincore_pte_range
average than v6.15, v6.16-nobatch, and the batch<=1 fastpath:

  kernel                       mincore_pte_range avg_us
  v6.15-mainline-preempt       6.040
  v6.16-mainline-preempt       7.899
  v6.16-mainline-nobatch       6.031
  v6.16-mainline-fastpath      6.103

The smaller batch<=1 fastpath helped, but later v6.18 testing suggested the
remaining cost was more about the hot present-PTE branch layout.  The candidate
shape I tested is to check pte_present() first, while keeping pte_batch_hint()
for batch > 1:

  if (pte_present(pte)) {
          batch = pte_batch_hint(ptep, pte);
          if (batch > 1)
                  fill vec[0..step-1];
          else
                  *vec = 1;
  } else if (pte_none(pte) || pte_is_marker(pte)) {
          __mincore_unmapped_range(...);
  } else {
          mincore_swap(...);
  }

On x86, pte_batch_hint() defaults to 1, so this mainly measures the
resident-PTE hot path layout.  On arm64 the batch > 1 path should still be
preserved, but I have not validated mTHP/contiguous-PTE performance yet.

The v6.18 confirmation A/B.  The 1/2/4 CPU rows are the primary matrix; the
8CPU/16GiB and 16CPU/32GiB rows are the high-CPU follow-up.  All rows use the
same no-THP scenario, 9 repetitions, and coverage disabled:

  scenario: no_thp_pte_scan_64m
  metric:   mincore_ns_per_1k_pages, lower is better

  CPU/mem     v6.15      v6.18      v6.18 present-first   mean improvement
  1           13373.222  16473.000        11055.222              32.89%
  2           13454.444  16424.444        11467.556              30.18%
  4           13651.778  16772.333        11470.444              31.61%
  8/16 GiB            -  16008.778        10941.444              31.65%
  16/32 GiB           -  17549.556        11725.111              33.19%

And the v7.0.9 A/B, with the same row layout:

  CPU/mem     v7.0.9     v7.0.9 present-first   mean improvement
  1           16328.778        10061.778              38.38%
  2           17600.000        11856.444              32.63%
  4           17819.000        11961.556              32.87%
  8/16 GiB    17379.778        10999.889              36.71%
  16/32 GiB   17917.778        11555.889              35.51%

The high-CPU rows completed 72/72 with all_cpu_match=true,
any_noapic=false, all_autorun_exit0=true, and all_semantic_ok=true.  I still
treat them as extended x86 validation because the remaining preservation
question is arm64/mTHP/contiguous-PTE behavior.

I also ran an all-scenario semantic smoke on v6.18 original vs present-first.
Both THP and no-THP scenarios completed with all_semantic_ok=true.  That smoke
only checks that the THP/no-THP state shape still behaves as expected on x86;
it is not a substitute for arm64/mTHP preservation testing.

For the branch-ordering safety question, my reading is that pte_is_marker()
goes through softleaf_from_pte(), which first returns an empty leaf for
pte_present() or pte_none().  So a real marker is a non-present, non-none leaf,
and checking pte_present() first should not hide the marker path.  I would
still appreciate review from people more familiar with the arch PTE encodings.

I prepared a compact evidence/reproducer bundle here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/206a39d/mincore-present-pte-scan

It includes:

  - a standalone C reproducer for the no-THP mincore scan
  - the workload target/profile used by the local experiment framework
  - the local test patch shape, clearly marked not for direct submission
  - compact lab CSV summaries for the v6.16 intro-window A/B, including the
    high-CPU follow-up
  - compact lab CSV summaries for the v6.18, v7.0 and high-CPU present-first
    A/B runs
  - matched-PREEMPT release-level bridge summaries for the 1/2/4 CPU matrix
    and the separate 8CPU/16CPU context rows

I am intentionally not asking regzbot to track this at this stage.  It is a
source-calibrated synthetic signal with a strong x86 lab result across the
primary 1/2/4 CPU matrix and the 8/16CPU present-first A/B follow-up, but it
still needs arm64/mTHP validation and a proper patch before it should be
treated as an upstream-ready fix.

Does this present-first shape look like the right direction to validate further,
or would you prefer a different approach such as a smaller local fastpath around
pte_batch_hint() returning 1?

Thanks,
Chengfeng

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-09 14:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-09  7:26 [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching Chengfeng Lin
2026-06-09  9:01 ` David Hildenbrand (Arm)
2026-06-09  9:55   ` Barry Song
2026-06-09 14:12   ` Chengfeng Lin
2026-06-09 14:27     ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox