[RFC] mm/mincore: present-PTE scan cost after pte_batch

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching
@ 2026-06-09  7:26 Chengfeng Lin
  2026-06-09  9:01 ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 5+ messages in thread
From: Chengfeng Lin @ 2026-06-09  7:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Baolin Wang, Barry Song,
	Dev Jain, Ryan Roberts, David Hildenbrand, Zi Yan

Hi,

I found a source-calibrated synthetic mincore() signal in the resident
base-page PTE path.  I do not currently have an easy arm64/mTHP validation
setup, so before trying to arrange that more expensive validation I would like
to ask whether the candidate fix shape below looks reasonable.

To keep the scope clear, I am not presenting this as a production application
regression report or as a generic mincore() regression.  It is a controlled
reproducer for a real userspace-visible syscall path, with the page-table shape
kept intentionally simple:

  mmap() private anonymous memory
  madvise(MADV_NOHUGEPAGE)
  fault in all pages
  repeatedly call mincore() over a resident 64 MiB range

The practical hook is that mincore() is the userspace-visible residency query
for an address range.  The resident anonymous no-THP range is intended to
isolate the base-page present-PTE scan and avoid file cache, swap, THP, marker,
and unmapped-range effects.  I would read the result as source-path evidence for
the hot path below, not as evidence that every mincore() caller or a specific
application workload regressed.

The intended hot path is:

  mincore()
    -> walk_page_range()
       -> mincore_pte_range()

The main metric is mincore_ns_per_1k_pages, lower is better.  It is the
wall-clock time spent in the mincore() scan, normalized by the number of pages
covered by the range and reported as nanoseconds per 1000 pages scanned.

As a release-level starting point, the matched-PREEMPT 1/2/4 CPU bridge below
uses original release kernels, QEMU direct boot, 9 repetitions, coverage
disabled, and the same CONFIG_ADVISE_SYSCALLS setup:

  scenario: no_thp_pte_scan_64m
  metric:   mincore_ns_per_1k_pages, lower is better

  CPU   v6.12.77   v6.18.19   v6.19.9    v7.0.9
  1     12827.667  15677.444  16482.667  16726.333
  2     13628.444  16102.333  18256.889  17270.333
  4     13798.222  16739.333  18892.111  17068.222

This shows cumulative cost relative to v6.12 in the primary 1/2/4 CPU matrix.
I also reran the 8CPU/16CPU release-level bridge on the same scenario.  These
rows show the same general direction, but the shared lab was busy during this
rerun and the high-CPU rows have higher CV, so I include them as extended
context only:

  CPU/mem     v6.12.77   v6.18.19   v6.19.9    v7.0.9
  8/16 GiB    17251.889  23335.556  21863.556  21664.778
  16/32 GiB   16697.333  21428.333  21629.778  21628.333

The 16CPU rerun had two QEMU returncode-139 failures in the original 36-run
matrix; I filled the missing v6.12/v6.18 samples with a clean two-run
supplement.  I therefore use the high-CPU rows as context for the release
bridge, not as part of the primary matrix.

Follow-up release-ladder and A/B testing narrowed the main step to the
v6.15 -> v6.16 window.  The strongest suspect is:

  4df65651f7075 ("mm: mincore: use pte_batch_hint() to batch process large folios")

That patch improved the mTHP/large-folio case, but in this base-page resident
PTE scan I see a sizeable cost.  The original commit message mentioned that
base pages did not show an obvious regression, so this may simply be a
different x86/base-page corner than the original arm64/mTHP test.

For the v6.16 introduction-window A/B, all rows below are lab QEMU direct boot,
9 repetitions, coverage disabled, same PREEMPT and CONFIG_ADVISE_SYSCALLS
setup.  The 1/2/4 CPU rows are the primary matrix; the 8CPU/16GiB and
16CPU/32GiB rows are the high-CPU follow-up:

  scenario: no_thp_pte_scan_64m
  metric:   mincore_ns_per_1k_pages, lower is better

  CPU/mem     v6.15      v6.16      v6.16 batch<=1 fastpath   v6.16 nobatch
  1           12946.889  17117.667        14560.556              13843.222
  2           15053.111  18214.667        15714.778              14270.556
  4           14942.000  18338.222        14397.889              14719.667
  8/16 GiB    15046.444  17540.222        13696.333              13200.000
  16/32 GiB   14674.111  18928.889        13949.000              15351.111

The high-CPU matrix completed 72/72 with all_cpu_match=true,
any_noapic=false, all_autorun_exit0=true, and all_semantic_ok=true.  One v6.15
16CPU timing sample in the main matrix was an obvious outlier, so the 16/32 GiB
v6.15 value above uses a clean v6.15-only 9-repeat supplement.

I also ran ftrace attribution on the same path as mechanism evidence, not as
clean timing.  In that run, v6.16 original had a higher mincore_pte_range
average than v6.15, v6.16-nobatch, and the batch<=1 fastpath:

  kernel                       mincore_pte_range avg_us
  v6.15-mainline-preempt       6.040
  v6.16-mainline-preempt       7.899
  v6.16-mainline-nobatch       6.031
  v6.16-mainline-fastpath      6.103

The smaller batch<=1 fastpath helped, but later v6.18 testing suggested the
remaining cost was more about the hot present-PTE branch layout.  The candidate
shape I tested is to check pte_present() first, while keeping pte_batch_hint()
for batch > 1:

  if (pte_present(pte)) {
          batch = pte_batch_hint(ptep, pte);
          if (batch > 1)
                  fill vec[0..step-1];
          else
                  *vec = 1;
  } else if (pte_none(pte) || pte_is_marker(pte)) {
          __mincore_unmapped_range(...);
  } else {
          mincore_swap(...);
  }

On x86, pte_batch_hint() defaults to 1, so this mainly measures the
resident-PTE hot path layout.  On arm64 the batch > 1 path should still be
preserved, but I have not validated mTHP/contiguous-PTE performance yet.

The v6.18 confirmation A/B.  The 1/2/4 CPU rows are the primary matrix; the
8CPU/16GiB and 16CPU/32GiB rows are the high-CPU follow-up.  All rows use the
same no-THP scenario, 9 repetitions, and coverage disabled:

  scenario: no_thp_pte_scan_64m
  metric:   mincore_ns_per_1k_pages, lower is better

  CPU/mem     v6.15      v6.18      v6.18 present-first   mean improvement
  1           13373.222  16473.000        11055.222              32.89%
  2           13454.444  16424.444        11467.556              30.18%
  4           13651.778  16772.333        11470.444              31.61%
  8/16 GiB            -  16008.778        10941.444              31.65%
  16/32 GiB           -  17549.556        11725.111              33.19%

And the v7.0.9 A/B, with the same row layout:

  CPU/mem     v7.0.9     v7.0.9 present-first   mean improvement
  1           16328.778        10061.778              38.38%
  2           17600.000        11856.444              32.63%
  4           17819.000        11961.556              32.87%
  8/16 GiB    17379.778        10999.889              36.71%
  16/32 GiB   17917.778        11555.889              35.51%

The high-CPU rows completed 72/72 with all_cpu_match=true,
any_noapic=false, all_autorun_exit0=true, and all_semantic_ok=true.  I still
treat them as extended x86 validation because the remaining preservation
question is arm64/mTHP/contiguous-PTE behavior.

I also ran an all-scenario semantic smoke on v6.18 original vs present-first.
Both THP and no-THP scenarios completed with all_semantic_ok=true.  That smoke
only checks that the THP/no-THP state shape still behaves as expected on x86;
it is not a substitute for arm64/mTHP preservation testing.

For the branch-ordering safety question, my reading is that pte_is_marker()
goes through softleaf_from_pte(), which first returns an empty leaf for
pte_present() or pte_none().  So a real marker is a non-present, non-none leaf,
and checking pte_present() first should not hide the marker path.  I would
still appreciate review from people more familiar with the arch PTE encodings.

I prepared a compact evidence/reproducer bundle here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/206a39d/mincore-present-pte-scan

It includes:

  - a standalone C reproducer for the no-THP mincore scan
  - the workload target/profile used by the local experiment framework
  - the local test patch shape, clearly marked not for direct submission
  - compact lab CSV summaries for the v6.16 intro-window A/B, including the
    high-CPU follow-up
  - compact lab CSV summaries for the v6.18, v7.0 and high-CPU present-first
    A/B runs
  - matched-PREEMPT release-level bridge summaries for the 1/2/4 CPU matrix
    and the separate 8CPU/16CPU context rows

I am intentionally not asking regzbot to track this at this stage.  It is a
source-calibrated synthetic signal with a strong x86 lab result across the
primary 1/2/4 CPU matrix and the 8/16CPU present-first A/B follow-up, but it
still needs arm64/mTHP validation and a proper patch before it should be
treated as an upstream-ready fix.

Does this present-first shape look like the right direction to validate further,
or would you prefer a different approach such as a smaller local fastpath around
pte_batch_hint() returning 1?

Thanks,
Chengfeng

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching
  2026-06-09  7:26 [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching Chengfeng Lin
@ 2026-06-09  9:01 ` David Hildenbrand (Arm)
  2026-06-09  9:55   ` Barry Song
  2026-06-09 14:12   ` Chengfeng Lin
  0 siblings, 2 replies; 5+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-09  9:01 UTC (permalink / raw)
  To: Chengfeng Lin, Andrew Morton
  Cc: Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka, Jann Horn,
	Pedro Falcato, linux-mm, linux-kernel, Baolin Wang, Barry Song,
	Dev Jain, Ryan Roberts, Zi Yan

On 6/9/26 09:26, Chengfeng Lin wrote:
> Hi,

Hi,

> 
> I found a source-calibrated synthetic mincore() signal in the resident
> base-page PTE path. 

sorry, I'm confused. Did you mean to say "I found a performance regression" ?

> I do not currently have an easy arm64/mTHP validation
> setup, so before trying to arrange that more expensive validation I would like
> to ask whether the candidate fix shape below looks reasonable.
> 
> To keep the scope clear, I am not presenting this as a production application
> regression report or as a generic mincore() regression.  It is a controlled
> reproducer for a real userspace-visible syscall path, with the page-table shape
> kept intentionally simple:
> 
>   mmap() private anonymous memory
>   madvise(MADV_NOHUGEPAGE)
>   fault in all pages
>   repeatedly call mincore() over a resident 64 MiB range

Okay, so I assume a mincore() regression. On arm64?

> 
> The practical hook is that mincore() is the userspace-visible residency query
> for an address range.  The resident anonymous no-THP range is intended to
> isolate the base-page present-PTE scan and avoid file cache, swap, THP, marker,
> and unmapped-range effects.  I would read the result as source-path evidence for
> the hot path below, not as evidence that every mincore() caller or a specific
> application workload regressed.

This reads very obscure and cryptic. Was that written by, or translated by an LLM?

The way it's phrased makes it a bit hard to digest.

> 
> The intended hot path is:
> 
>   mincore()
>     -> walk_page_range()
>        -> mincore_pte_range()
> 
> The main metric is mincore_ns_per_1k_pages, lower is better.  It is the
> wall-clock time spent in the mincore() scan, normalized by the number of pages
> covered by the range and reported as nanoseconds per 1000 pages scanned.
> 
> As a release-level starting point, the matched-PREEMPT 1/2/4 CPU bridge below
> uses original release kernels, QEMU direct boot, 9 repetitions, coverage
> disabled, and the same CONFIG_ADVISE_SYSCALLS setup:
> 
>   scenario: no_thp_pte_scan_64m
>   metric:   mincore_ns_per_1k_pages, lower is better
> 
>   CPU   v6.12.77   v6.18.19   v6.19.9    v7.0.9
>   1     12827.667  15677.444  16482.667  16726.333
>   2     13628.444  16102.333  18256.889  17270.333
>   4     13798.222  16739.333  18892.111  17068.222

Okay, so we see two steps of "degradation". I assume this code is so performance
sensitive that even compiler changes might easily affect it. Because all we do
is scan page tables for present entries.

The mincore optimization went into v6.16.

> 
> This shows cumulative cost relative to v6.12 in the primary 1/2/4 CPU matrix.
> I also reran the 8CPU/16CPU release-level bridge on the same scenario.  These
> rows show the same general direction, but the shared lab was busy during this
> rerun and the high-CPU rows have higher CV, so I include them as extended
> context only:
> 
>   CPU/mem     v6.12.77   v6.18.19   v6.19.9    v7.0.9
>   8/16 GiB    17251.889  23335.556  21863.556  21664.778
>   16/32 GiB   16697.333  21428.333  21629.778  21628.333

I don't think measuring concurrency here really makes a lot of sense.

Especially, as it's becoming a rather weird, unrealistic micro-benchmark that way.

> 
> The 16CPU rerun had two QEMU returncode-139 failures in the original 36-run
> matrix; I filled the missing v6.12/v6.18 samples with a clean two-run
> supplement.  I therefore use the high-CPU rows as context for the release
> bridge, not as part of the primary matrix.
> 
> Follow-up release-ladder and A/B testing narrowed the main step to the
> v6.15 -> v6.16 window.  The strongest suspect is:
> 
>   4df65651f7075 ("mm: mincore: use pte_batch_hint() to batch process large folios")
> 
> That patch improved the mTHP/large-folio case, but in this base-page resident
> PTE scan I see a sizeable cost.  The original commit message mentioned that
> base pages did not show an obvious regression, so this may simply be a
> different x86/base-page corner than the original arm64/mTHP test.

Okay, so it is on x86 then?

On x86, pte_batch_hint() is hard-coded at 1, so the expectation is that the loop
and everything should get completely optimized out.

> 
> For the v6.16 introduction-window A/B, all rows below are lab QEMU direct boot,
> 9 repetitions, coverage disabled, same PREEMPT and CONFIG_ADVISE_SYSCALLS
> setup.  The 1/2/4 CPU rows are the primary matrix; the 8CPU/16GiB and
> 16CPU/32GiB rows are the high-CPU follow-up:
> 
>   scenario: no_thp_pte_scan_64m
>   metric:   mincore_ns_per_1k_pages, lower is better
> 
>   CPU/mem     v6.15      v6.16      v6.16 batch<=1 fastpath   v6.16 nobatch
>   1           12946.889  17117.667        14560.556              13843.222
>   2           15053.111  18214.667        15714.778              14270.556
>   4           14942.000  18338.222        14397.889              14719.667
>   8/16 GiB    15046.444  17540.222        13696.333              13200.000
>   16/32 GiB   14674.111  18928.889        13949.000              15351.111
> 
> The high-CPU matrix completed 72/72 with all_cpu_match=true,
> any_noapic=false, all_autorun_exit0=true, and all_semantic_ok=true.  One v6.15
> 16CPU timing sample in the main matrix was an obvious outlier, so the 16/32 GiB
> v6.15 value above uses a clean v6.15-only 9-repeat supplement.
> 
> I also ran ftrace attribution on the same path as mechanism evidence, not as
> clean timing.  In that run, v6.16 original had a higher mincore_pte_range
> average than v6.15, v6.16-nobatch, and the batch<=1 fastpath:
> 
>   kernel                       mincore_pte_range avg_us
>   v6.15-mainline-preempt       6.040
>   v6.16-mainline-preempt       7.899
>   v6.16-mainline-nobatch       6.031
>   v6.16-mainline-fastpath      6.103
> 
> The smaller batch<=1 fastpath helped, but later v6.18 testing suggested the
> remaining cost was more about the hot present-PTE branch layout.  The candidate
> shape I tested is to check pte_present() first, while keeping pte_batch_hint()
> for batch > 1:
> 
>   if (pte_present(pte)) {
>           batch = pte_batch_hint(ptep, pte);
>           if (batch > 1)
>                   fill vec[0..step-1];
>           else
>                   *vec = 1;
>   } else if (pte_none(pte) || pte_is_marker(pte)) {
>           __mincore_unmapped_range(...);
>   } else {
>           mincore_swap(...);
>   }
> 
> On x86, pte_batch_hint() defaults to 1, so this mainly measures the
> resident-PTE hot path layout.  On arm64 the batch > 1 path should still be
> preserved, but I have not validated mTHP/contiguous-PTE performance yet.
> 
> The v6.18 confirmation A/B.  The 1/2/4 CPU rows are the primary matrix; the
> 8CPU/16GiB and 16CPU/32GiB rows are the high-CPU follow-up.  All rows use the
> same no-THP scenario, 9 repetitions, and coverage disabled:


Which compiler are you using?

The expectation is that the whole code would get optimized on x86 such that the
behavior is just like before.


-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching
  2026-06-09  9:01 ` David Hildenbrand (Arm)
@ 2026-06-09  9:55   ` Barry Song
  2026-06-09 14:12   ` Chengfeng Lin
  1 sibling, 0 replies; 5+ messages in thread
From: Barry Song @ 2026-06-09  9:55 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Chengfeng Lin, Andrew Morton, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pedro Falcato, linux-mm, linux-kernel,
	Baolin Wang, Dev Jain, Ryan Roberts, Zi Yan

On Tue, Jun 9, 2026 at 5:01 PM David Hildenbrand (Arm) <david@kernel.org> wrote:
>
> On 6/9/26 09:26, Chengfeng Lin wrote:
> > Hi,
>
> Hi,
>
> >
> > I found a source-calibrated synthetic mincore() signal in the resident
> > base-page PTE path.
>
> sorry, I'm confused. Did you mean to say "I found a performance regression" ?

I guess so.

>
> > I do not currently have an easy arm64/mTHP validation
> > setup, so before trying to arrange that more expensive validation I would like
> > to ask whether the candidate fix shape below looks reasonable.
> >
> > To keep the scope clear, I am not presenting this as a production application
> > regression report or as a generic mincore() regression.  It is a controlled
> > reproducer for a real userspace-visible syscall path, with the page-table shape
> > kept intentionally simple:
> >
> >   mmap() private anonymous memory
> >   madvise(MADV_NOHUGEPAGE)
> >   fault in all pages
> >   repeatedly call mincore() over a resident 64 MiB range
>
> Okay, so I assume a mincore() regression. On arm64?
>
> >
> > The practical hook is that mincore() is the userspace-visible residency query
> > for an address range.  The resident anonymous no-THP range is intended to
> > isolate the base-page present-PTE scan and avoid file cache, swap, THP, marker,
> > and unmapped-range effects.  I would read the result as source-path evidence for
> > the hot path below, not as evidence that every mincore() caller or a specific
> > application workload regressed.
>
> This reads very obscure and cryptic. Was that written by, or translated by an LLM?
>
> The way it's phrased makes it a bit hard to digest.

I really suggest we first write in plain, imperfect English and then
use an LLM to refine it.

A direct translation can sometimes make it very hard for readers to
understand.

I think I share the same native language as Chengfeng, but I still
find this email difficult to read.

>
> >
> > The intended hot path is:
> >
> >   mincore()
> >     -> walk_page_range()
> >        -> mincore_pte_range()
> >
> > The main metric is mincore_ns_per_1k_pages, lower is better.  It is the
> > wall-clock time spent in the mincore() scan, normalized by the number of pages
> > covered by the range and reported as nanoseconds per 1000 pages scanned.
> >
> > As a release-level starting point, the matched-PREEMPT 1/2/4 CPU bridge below
> > uses original release kernels, QEMU direct boot, 9 repetitions, coverage
> > disabled, and the same CONFIG_ADVISE_SYSCALLS setup:

Sometimes QEMU produces highly distorted performance numbers.
better to re-test on physical CPUs.

> >
> >   scenario: no_thp_pte_scan_64m
> >   metric:   mincore_ns_per_1k_pages, lower is better
> >
> >   CPU   v6.12.77   v6.18.19   v6.19.9    v7.0.9
> >   1     12827.667  15677.444  16482.667  16726.333
> >   2     13628.444  16102.333  18256.889  17270.333
> >   4     13798.222  16739.333  18892.111  17068.222
>
> Okay, so we see two steps of "degradation". I assume this code is so performance
> sensitive that even compiler changes might easily affect it. Because all we do
> is scan page tables for present entries.
>
> The mincore optimization went into v6.16.
>
> >
> > This shows cumulative cost relative to v6.12 in the primary 1/2/4 CPU matrix.
> > I also reran the 8CPU/16CPU release-level bridge on the same scenario.  These
> > rows show the same general direction, but the shared lab was busy during this
> > rerun and the high-CPU rows have higher CV, so I include them as extended
> > context only:
> >
> >   CPU/mem     v6.12.77   v6.18.19   v6.19.9    v7.0.9
> >   8/16 GiB    17251.889  23335.556  21863.556  21664.778
> >   16/32 GiB   16697.333  21428.333  21629.778  21628.333
>
> I don't think measuring concurrency here really makes a lot of sense.
>
> Especially, as it's becoming a rather weird, unrealistic micro-benchmark that way.

I'm not sure multithreading is involved here, since the reproducer
is single-threaded:

https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/main/mincore-present-pte-scan/reproducer/mincore_present_pte_scan.c

>
> >
> > The 16CPU rerun had two QEMU returncode-139 failures in the original 36-run
> > matrix; I filled the missing v6.12/v6.18 samples with a clean two-run
> > supplement.  I therefore use the high-CPU rows as context for the release
> > bridge, not as part of the primary matrix.
> >
> > Follow-up release-ladder and A/B testing narrowed the main step to the
> > v6.15 -> v6.16 window.  The strongest suspect is:
> >
> >   4df65651f7075 ("mm: mincore: use pte_batch_hint() to batch process large folios")
> >
> > That patch improved the mTHP/large-folio case, but in this base-page resident
> > PTE scan I see a sizeable cost.  The original commit message mentioned that
> > base pages did not show an obvious regression, so this may simply be a
> > different x86/base-page corner than the original arm64/mTHP test.
>
> Okay, so it is on x86 then?
>
> On x86, pte_batch_hint() is hard-coded at 1, so the expectation is that the loop
> and everything should get completely optimized out.
>
> >
> > For the v6.16 introduction-window A/B, all rows below are lab QEMU direct boot,
> > 9 repetitions, coverage disabled, same PREEMPT and CONFIG_ADVISE_SYSCALLS
> > setup.  The 1/2/4 CPU rows are the primary matrix; the 8CPU/16GiB and
> > 16CPU/32GiB rows are the high-CPU follow-up:

I don’t really understand what the “primary matrix” is, or what
“high-CPU” means.

> >
> >   scenario: no_thp_pte_scan_64m
> >   metric:   mincore_ns_per_1k_pages, lower is better
> >
> >   CPU/mem     v6.15      v6.16      v6.16 batch<=1 fastpath   v6.16 nobatch
> >   1           12946.889  17117.667        14560.556              13843.222
> >   2           15053.111  18214.667        15714.778              14270.556
> >   4           14942.000  18338.222        14397.889              14719.667
> >   8/16 GiB    15046.444  17540.222        13696.333              13200.000
> >   16/32 GiB   14674.111  18928.889        13949.000              15351.111
> >
> > The high-CPU matrix completed 72/72 with all_cpu_match=true,
> > any_noapic=false, all_autorun_exit0=true, and all_semantic_ok=true.  One v6.15
> > 16CPU timing sample in the main matrix was an obvious outlier, so the 16/32 GiB
> > v6.15 value above uses a clean v6.15-only 9-repeat supplement.
> >
> > I also ran ftrace attribution on the same path as mechanism evidence, not as
> > clean timing.  In that run, v6.16 original had a higher mincore_pte_range
> > average than v6.15, v6.16-nobatch, and the batch<=1 fastpath:
> >
> >   kernel                       mincore_pte_range avg_us
> >   v6.15-mainline-preempt       6.040
> >   v6.16-mainline-preempt       7.899
> >   v6.16-mainline-nobatch       6.031
> >   v6.16-mainline-fastpath      6.103
> >
> > The smaller batch<=1 fastpath helped, but later v6.18 testing suggested the
> > remaining cost was more about the hot present-PTE branch layout.  The candidate
> > shape I tested is to check pte_present() first, while keeping pte_batch_hint()
> > for batch > 1:
> >
> >   if (pte_present(pte)) {
> >           batch = pte_batch_hint(ptep, pte);
> >           if (batch > 1)
> >                   fill vec[0..step-1];
> >           else
> >                   *vec = 1;
> >   } else if (pte_none(pte) || pte_is_marker(pte)) {
> >           __mincore_unmapped_range(...);
> >   } else {
> >           mincore_swap(...);
> >   }
> >
> > On x86, pte_batch_hint() defaults to 1, so this mainly measures the
> > resident-PTE hot path layout.  On arm64 the batch > 1 path should still be
> > preserved, but I have not validated mTHP/contiguous-PTE performance yet.

Is the performance improvement mainly because the tested PTEs are always
present, so some conditional branches are avoided by the patch:
https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/blob/main/mincore-present-pte-scan/patches/mincore-present-first-fastpath-rfc.patch

It looks like the gain does not come from fixing an actual issue in
the existing logic in any case.

> >
> > The v6.18 confirmation A/B.  The 1/2/4 CPU rows are the primary matrix; the
> > 8CPU/16GiB and 16CPU/32GiB rows are the high-CPU follow-up.  All rows use the
> > same no-THP scenario, 9 repetitions, and coverage disabled:
>
>
> Which compiler are you using?
>
> The expectation is that the whole code would get optimized on x86 such that the
> behavior is just like before.

yes.
Likely due to QEMU behavior and some compiler-related effects.

Best Regards
Barry


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching
  2026-06-09  9:01 ` David Hildenbrand (Arm)
  2026-06-09  9:55   ` Barry Song
@ 2026-06-09 14:12   ` Chengfeng Lin
  2026-06-09 14:27     ` David Hildenbrand (Arm)
  1 sibling, 1 reply; 5+ messages in thread
From: Chengfeng Lin @ 2026-06-09 14:12 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pedro Falcato, linux-mm, linux-kernel, Baolin Wang,
	Barry Song, Dev Jain, Ryan Roberts, Zi Yan

Hi David,

Thanks, and sorry for the confusing wording.  The plain statement is: I
observed a performance difference in a narrow x86/QEMU synthetic mincore()
case, and after your comment I checked whether this is really a codegen issue.

The wording in my first mail was too abstract.  What I was trying to say is
only that the benchmark focuses on one specific case:

  private anonymous memory
  MADV_NOHUGEPAGE
  faulted/resident base pages
  repeated mincore() over the range

so the measured path should mostly be the present-PTE scan in
mincore_pte_range().  I agree that the 8/16 CPU rows are not very useful for
this path; please treat them as extra context only.  The useful data is the
single-threaded / low-CPU v6.15 -> v6.16 A/B and the patched variants.

The compiler used for the lab kernels was:

  gcc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
  GNU ld (GNU Binutils for Ubuntu) 2.42

Your point about x86 pte_batch_hint() is exactly the right thing to check.
Since pte_batch_hint() returns 1 on x86, I agree that the expectation would be
for the compiler to optimize the batching logic back down to something very
close to the old base-page path.

I checked the generated mincore_pte_range() code with the same GCC/config setup.
The function sizes from nm are:

  v6.15 original:              0x1fb
  v6.16 original:              0x245
  v6.16 batch<=1 fastpath:     0x1ec
  v6.16 with batching removed: 0x1ec

So, with GCC 13.3, the v6.16 original build does not look optimized back to the
old x86 base-page shape.  The v6.16 batch<=1 fastpath and the v6.16 nobatch
variant produce the same mincore_pte_range() objdump output in my build.

I also checked Clang 18.1.3 as a cross-check.  With Clang, v6.15 original,
v6.16 original, v6.16 batch<=1 fastpath and v6.16 nobatch all produce the same
mincore_pte_range() size, 0x1f9, and the objdump output is byte-identical.

So your expectation does hold with Clang, but not with the GCC 13.3 build I used
for the original lab runs.  This does not prove a compiler bug, and it means my
original report should be narrowed: it is not a generic x86 mincore()
regression claim.  In this check, GCC 13.3 generates a different
mincore_pte_range() shape for v6.16 original, while Clang 18.1.3 generates
byte-identical output for all checked variants.  The timing signal I reported
came from the GCC-built QEMU lab kernels.

I put the compact codegen summary and the relevant nm/objdump snippets here:

  https://github.com/lcf0399/linux-mm-regression-evidence-2026-05/tree/a5e3312deafd97321aa99a32772180989949fa59/mincore-present-pte-scan/codegen

Thanks,
Chengfeng


> -----原始邮件-----
> 发件人: "David Hildenbrand (Arm)" <david@kernel.org>
> 发送时间:2026-06-09 17:01:51 (星期二)
> 收件人: "Chengfeng Lin" <chengfenglin@stu.xmu.edu.cn>, "Andrew Morton" <akpm@linux-foundation.org>
> 抄送: "Liam R. Howlett" <liam@infradead.org>, "Lorenzo Stoakes" <ljs@kernel.org>, "Vlastimil Babka" <vbabka@kernel.org>, "Jann Horn" <jannh@google.com>, "Pedro Falcato" <pfalcato@suse.de>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Baolin Wang" <baolin.wang@linux.alibaba.com>, "Barry Song" <baohua@kernel.org>, "Dev Jain" <dev.jain@arm.com>, "Ryan Roberts" <ryan.roberts@arm.com>, "Zi Yan" <ziy@nvidia.com>
> 主题: Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching
> 
> On 6/9/26 09:26, Chengfeng Lin wrote:
> > Hi,
> 
> Hi,
> 
> > 
> > I found a source-calibrated synthetic mincore() signal in the resident
> > base-page PTE path. 
> 
> sorry, I'm confused. Did you mean to say "I found a performance regression" ?
> 
> > I do not currently have an easy arm64/mTHP validation
> > setup, so before trying to arrange that more expensive validation I would like
> > to ask whether the candidate fix shape below looks reasonable.
> > 
> > To keep the scope clear, I am not presenting this as a production application
> > regression report or as a generic mincore() regression.  It is a controlled
> > reproducer for a real userspace-visible syscall path, with the page-table shape
> > kept intentionally simple:
> > 
> >   mmap() private anonymous memory
> >   madvise(MADV_NOHUGEPAGE)
> >   fault in all pages
> >   repeatedly call mincore() over a resident 64 MiB range
> 
> Okay, so I assume a mincore() regression. On arm64?
> 
> > 
> > The practical hook is that mincore() is the userspace-visible residency query
> > for an address range.  The resident anonymous no-THP range is intended to
> > isolate the base-page present-PTE scan and avoid file cache, swap, THP, marker,
> > and unmapped-range effects.  I would read the result as source-path evidence for
> > the hot path below, not as evidence that every mincore() caller or a specific
> > application workload regressed.
> 
> This reads very obscure and cryptic. Was that written by, or translated by an LLM?
> 
> The way it's phrased makes it a bit hard to digest.
> 
> > 
> > The intended hot path is:
> > 
> >   mincore()
> >     -> walk_page_range()
> >        -> mincore_pte_range()
> > 
> > The main metric is mincore_ns_per_1k_pages, lower is better.  It is the
> > wall-clock time spent in the mincore() scan, normalized by the number of pages
> > covered by the range and reported as nanoseconds per 1000 pages scanned.
> > 
> > As a release-level starting point, the matched-PREEMPT 1/2/4 CPU bridge below
> > uses original release kernels, QEMU direct boot, 9 repetitions, coverage
> > disabled, and the same CONFIG_ADVISE_SYSCALLS setup:
> > 
> >   scenario: no_thp_pte_scan_64m
> >   metric:   mincore_ns_per_1k_pages, lower is better
> > 
> >   CPU   v6.12.77   v6.18.19   v6.19.9    v7.0.9
> >   1     12827.667  15677.444  16482.667  16726.333
> >   2     13628.444  16102.333  18256.889  17270.333
> >   4     13798.222  16739.333  18892.111  17068.222
> 
> Okay, so we see two steps of "degradation". I assume this code is so performance
> sensitive that even compiler changes might easily affect it. Because all we do
> is scan page tables for present entries.
> 
> The mincore optimization went into v6.16.
> 
> > 
> > This shows cumulative cost relative to v6.12 in the primary 1/2/4 CPU matrix.
> > I also reran the 8CPU/16CPU release-level bridge on the same scenario.  These
> > rows show the same general direction, but the shared lab was busy during this
> > rerun and the high-CPU rows have higher CV, so I include them as extended
> > context only:
> > 
> >   CPU/mem     v6.12.77   v6.18.19   v6.19.9    v7.0.9
> >   8/16 GiB    17251.889  23335.556  21863.556  21664.778
> >   16/32 GiB   16697.333  21428.333  21629.778  21628.333
> 
> I don't think measuring concurrency here really makes a lot of sense.
> 
> Especially, as it's becoming a rather weird, unrealistic micro-benchmark that way.
> 
> > 
> > The 16CPU rerun had two QEMU returncode-139 failures in the original 36-run
> > matrix; I filled the missing v6.12/v6.18 samples with a clean two-run
> > supplement.  I therefore use the high-CPU rows as context for the release
> > bridge, not as part of the primary matrix.
> > 
> > Follow-up release-ladder and A/B testing narrowed the main step to the
> > v6.15 -> v6.16 window.  The strongest suspect is:
> > 
> >   4df65651f7075 ("mm: mincore: use pte_batch_hint() to batch process large folios")
> > 
> > That patch improved the mTHP/large-folio case, but in this base-page resident
> > PTE scan I see a sizeable cost.  The original commit message mentioned that
> > base pages did not show an obvious regression, so this may simply be a
> > different x86/base-page corner than the original arm64/mTHP test.
> 
> Okay, so it is on x86 then?
> 
> On x86, pte_batch_hint() is hard-coded at 1, so the expectation is that the loop
> and everything should get completely optimized out.
> 
> > 
> > For the v6.16 introduction-window A/B, all rows below are lab QEMU direct boot,
> > 9 repetitions, coverage disabled, same PREEMPT and CONFIG_ADVISE_SYSCALLS
> > setup.  The 1/2/4 CPU rows are the primary matrix; the 8CPU/16GiB and
> > 16CPU/32GiB rows are the high-CPU follow-up:
> > 
> >   scenario: no_thp_pte_scan_64m
> >   metric:   mincore_ns_per_1k_pages, lower is better
> > 
> >   CPU/mem     v6.15      v6.16      v6.16 batch<=1 fastpath   v6.16 nobatch
> >   1           12946.889  17117.667        14560.556              13843.222
> >   2           15053.111  18214.667        15714.778              14270.556
> >   4           14942.000  18338.222        14397.889              14719.667
> >   8/16 GiB    15046.444  17540.222        13696.333              13200.000
> >   16/32 GiB   14674.111  18928.889        13949.000              15351.111
> > 
> > The high-CPU matrix completed 72/72 with all_cpu_match=true,
> > any_noapic=false, all_autorun_exit0=true, and all_semantic_ok=true.  One v6.15
> > 16CPU timing sample in the main matrix was an obvious outlier, so the 16/32 GiB
> > v6.15 value above uses a clean v6.15-only 9-repeat supplement.
> > 
> > I also ran ftrace attribution on the same path as mechanism evidence, not as
> > clean timing.  In that run, v6.16 original had a higher mincore_pte_range
> > average than v6.15, v6.16-nobatch, and the batch<=1 fastpath:
> > 
> >   kernel                       mincore_pte_range avg_us
> >   v6.15-mainline-preempt       6.040
> >   v6.16-mainline-preempt       7.899
> >   v6.16-mainline-nobatch       6.031
> >   v6.16-mainline-fastpath      6.103
> > 
> > The smaller batch<=1 fastpath helped, but later v6.18 testing suggested the
> > remaining cost was more about the hot present-PTE branch layout.  The candidate
> > shape I tested is to check pte_present() first, while keeping pte_batch_hint()
> > for batch > 1:
> > 
> >   if (pte_present(pte)) {
> >           batch = pte_batch_hint(ptep, pte);
> >           if (batch > 1)
> >                   fill vec[0..step-1];
> >           else
> >                   *vec = 1;
> >   } else if (pte_none(pte) || pte_is_marker(pte)) {
> >           __mincore_unmapped_range(...);
> >   } else {
> >           mincore_swap(...);
> >   }
> > 
> > On x86, pte_batch_hint() defaults to 1, so this mainly measures the
> > resident-PTE hot path layout.  On arm64 the batch > 1 path should still be
> > preserved, but I have not validated mTHP/contiguous-PTE performance yet.
> > 
> > The v6.18 confirmation A/B.  The 1/2/4 CPU rows are the primary matrix; the
> > 8CPU/16GiB and 16CPU/32GiB rows are the high-CPU follow-up.  All rows use the
> > same no-THP scenario, 9 repetitions, and coverage disabled:
> 
> 
> Which compiler are you using?
> 
> The expectation is that the whole code would get optimized on x86 such that the
> behavior is just like before.
> 
> 
> -- 
> Cheers,
> 
> David

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching
  2026-06-09 14:12   ` Chengfeng Lin
@ 2026-06-09 14:27     ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 5+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-09 14:27 UTC (permalink / raw)
  To: Chengfeng Lin
  Cc: Andrew Morton, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pedro Falcato, linux-mm, linux-kernel, Baolin Wang,
	Barry Song, Dev Jain, Ryan Roberts, Zi Yan

On 6/9/26 16:12, Chengfeng Lin wrote:
> Hi David,

Hi,

> Thanks, and sorry for the confusing wording.  The plain statement is: I
> observed a performance difference in a narrow x86/QEMU synthetic mincore()
> case, and after your comment I checked whether this is really a codegen issue.
> 
> The wording in my first mail was too abstract.  What I was trying to say is
> only that the benchmark focuses on one specific case:
> 
>   private anonymous memory
>   MADV_NOHUGEPAGE
>   faulted/resident base pages
>   repeated mincore() over the range
> 
> so the measured path should mostly be the present-PTE scan in
> mincore_pte_range().  I agree that the 8/16 CPU rows are not very useful for
> this path; please treat them as extra context only.  The useful data is the
> single-threaded / low-CPU v6.15 -> v6.16 A/B and the patched variants.
> 
> The compiler used for the lab kernels was:
> 
>   gcc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0

Okay, GCC 13 was released 3 years ago.

>   GNU ld (GNU Binutils for Ubuntu) 2.42
> 
> Your point about x86 pte_batch_hint() is exactly the right thing to check.
> Since pte_batch_hint() returns 1 on x86, I agree that the expectation would be
> for the compiler to optimize the batching logic back down to something very
> close to the old base-page path.
> 
> I checked the generated mincore_pte_range() code with the same GCC/config setup.
> The function sizes from nm are:
> 
>   v6.15 original:              0x1fb
>   v6.16 original:              0x245
>   v6.16 batch<=1 fastpath:     0x1ec
>   v6.16 with batching removed: 0x1ec
> 
> So, with GCC 13.3, the v6.16 original build does not look optimized back to the
> old x86 base-page shape.  The v6.16 batch<=1 fastpath and the v6.16 nobatch
> variant produce the same mincore_pte_range() objdump output in my build.
> 
> I also checked Clang 18.1.3 as a cross-check.  With Clang, v6.15 original,
> v6.16 original, v6.16 batch<=1 fastpath and v6.16 nobatch all produce the same
> mincore_pte_range() size, 0x1f9, and the objdump output is byte-identical.
> 
> So your expectation does hold with Clang, but not with the GCC 13.3 build I used
> for the original lab runs.  This does not prove a compiler bug, and it means my
> original report should be narrowed: it is not a generic x86 mincore()
> regression claim.  In this check, GCC 13.3 generates a different
> mincore_pte_range() shape for v6.16 original, while Clang 18.1.3 generates
> byte-identical output for all checked variants.  The timing signal I reported
> came from the GCC-built QEMU lab kernels.

It's probably a good idea to

1) Try with newer GCC

2) Take a look at the actual difference in the generated code

Is it some inlining decisions? E.g., if the function is larger, other code is
likely to get inlined?

The function is not particularly large, so it's a bit unexpected.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-09 14:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-09  7:26 [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching Chengfeng Lin
2026-06-09  9:01 ` David Hildenbrand (Arm)
2026-06-09  9:55   ` Barry Song
2026-06-09 14:12   ` Chengfeng Lin
2026-06-09 14:27     ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox