Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Chengfeng Lin <chengfenglin@stu.xmu.edu.cn>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R. Howlett" <liam@infradead.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	Vlastimil Babka <vbabka@kernel.org>, Jann Horn <jannh@google.com>,
	Pedro Falcato <pfalcato@suse.de>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Barry Song <baohua@kernel.org>, Dev Jain <dev.jain@arm.com>,
	Ryan Roberts <ryan.roberts@arm.com>, Zi Yan <ziy@nvidia.com>
Subject: Re: [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching
Date: Tue, 9 Jun 2026 11:01:51 +0200	[thread overview]
Message-ID: <74cdec57-67aa-4ae3-a416-e39846049ab1@kernel.org> (raw)
In-Reply-To: <5fb8bead.17483.19eab4626f5.Coremail.chengfenglin@stu.xmu.edu.cn>

On 6/9/26 09:26, Chengfeng Lin wrote:
> Hi,

Hi,

> 
> I found a source-calibrated synthetic mincore() signal in the resident
> base-page PTE path. 

sorry, I'm confused. Did you mean to say "I found a performance regression" ?

> I do not currently have an easy arm64/mTHP validation
> setup, so before trying to arrange that more expensive validation I would like
> to ask whether the candidate fix shape below looks reasonable.
> 
> To keep the scope clear, I am not presenting this as a production application
> regression report or as a generic mincore() regression.  It is a controlled
> reproducer for a real userspace-visible syscall path, with the page-table shape
> kept intentionally simple:
> 
>   mmap() private anonymous memory
>   madvise(MADV_NOHUGEPAGE)
>   fault in all pages
>   repeatedly call mincore() over a resident 64 MiB range

Okay, so I assume a mincore() regression. On arm64?

> 
> The practical hook is that mincore() is the userspace-visible residency query
> for an address range.  The resident anonymous no-THP range is intended to
> isolate the base-page present-PTE scan and avoid file cache, swap, THP, marker,
> and unmapped-range effects.  I would read the result as source-path evidence for
> the hot path below, not as evidence that every mincore() caller or a specific
> application workload regressed.

This reads very obscure and cryptic. Was that written by, or translated by an LLM?

The way it's phrased makes it a bit hard to digest.

> 
> The intended hot path is:
> 
>   mincore()
>     -> walk_page_range()
>        -> mincore_pte_range()
> 
> The main metric is mincore_ns_per_1k_pages, lower is better.  It is the
> wall-clock time spent in the mincore() scan, normalized by the number of pages
> covered by the range and reported as nanoseconds per 1000 pages scanned.
> 
> As a release-level starting point, the matched-PREEMPT 1/2/4 CPU bridge below
> uses original release kernels, QEMU direct boot, 9 repetitions, coverage
> disabled, and the same CONFIG_ADVISE_SYSCALLS setup:
> 
>   scenario: no_thp_pte_scan_64m
>   metric:   mincore_ns_per_1k_pages, lower is better
> 
>   CPU   v6.12.77   v6.18.19   v6.19.9    v7.0.9
>   1     12827.667  15677.444  16482.667  16726.333
>   2     13628.444  16102.333  18256.889  17270.333
>   4     13798.222  16739.333  18892.111  17068.222

Okay, so we see two steps of "degradation". I assume this code is so performance
sensitive that even compiler changes might easily affect it. Because all we do
is scan page tables for present entries.

The mincore optimization went into v6.16.

> 
> This shows cumulative cost relative to v6.12 in the primary 1/2/4 CPU matrix.
> I also reran the 8CPU/16CPU release-level bridge on the same scenario.  These
> rows show the same general direction, but the shared lab was busy during this
> rerun and the high-CPU rows have higher CV, so I include them as extended
> context only:
> 
>   CPU/mem     v6.12.77   v6.18.19   v6.19.9    v7.0.9
>   8/16 GiB    17251.889  23335.556  21863.556  21664.778
>   16/32 GiB   16697.333  21428.333  21629.778  21628.333

I don't think measuring concurrency here really makes a lot of sense.

Especially, as it's becoming a rather weird, unrealistic micro-benchmark that way.

> 
> The 16CPU rerun had two QEMU returncode-139 failures in the original 36-run
> matrix; I filled the missing v6.12/v6.18 samples with a clean two-run
> supplement.  I therefore use the high-CPU rows as context for the release
> bridge, not as part of the primary matrix.
> 
> Follow-up release-ladder and A/B testing narrowed the main step to the
> v6.15 -> v6.16 window.  The strongest suspect is:
> 
>   4df65651f7075 ("mm: mincore: use pte_batch_hint() to batch process large folios")
> 
> That patch improved the mTHP/large-folio case, but in this base-page resident
> PTE scan I see a sizeable cost.  The original commit message mentioned that
> base pages did not show an obvious regression, so this may simply be a
> different x86/base-page corner than the original arm64/mTHP test.

Okay, so it is on x86 then?

On x86, pte_batch_hint() is hard-coded at 1, so the expectation is that the loop
and everything should get completely optimized out.

> 
> For the v6.16 introduction-window A/B, all rows below are lab QEMU direct boot,
> 9 repetitions, coverage disabled, same PREEMPT and CONFIG_ADVISE_SYSCALLS
> setup.  The 1/2/4 CPU rows are the primary matrix; the 8CPU/16GiB and
> 16CPU/32GiB rows are the high-CPU follow-up:
> 
>   scenario: no_thp_pte_scan_64m
>   metric:   mincore_ns_per_1k_pages, lower is better
> 
>   CPU/mem     v6.15      v6.16      v6.16 batch<=1 fastpath   v6.16 nobatch
>   1           12946.889  17117.667        14560.556              13843.222
>   2           15053.111  18214.667        15714.778              14270.556
>   4           14942.000  18338.222        14397.889              14719.667
>   8/16 GiB    15046.444  17540.222        13696.333              13200.000
>   16/32 GiB   14674.111  18928.889        13949.000              15351.111
> 
> The high-CPU matrix completed 72/72 with all_cpu_match=true,
> any_noapic=false, all_autorun_exit0=true, and all_semantic_ok=true.  One v6.15
> 16CPU timing sample in the main matrix was an obvious outlier, so the 16/32 GiB
> v6.15 value above uses a clean v6.15-only 9-repeat supplement.
> 
> I also ran ftrace attribution on the same path as mechanism evidence, not as
> clean timing.  In that run, v6.16 original had a higher mincore_pte_range
> average than v6.15, v6.16-nobatch, and the batch<=1 fastpath:
> 
>   kernel                       mincore_pte_range avg_us
>   v6.15-mainline-preempt       6.040
>   v6.16-mainline-preempt       7.899
>   v6.16-mainline-nobatch       6.031
>   v6.16-mainline-fastpath      6.103
> 
> The smaller batch<=1 fastpath helped, but later v6.18 testing suggested the
> remaining cost was more about the hot present-PTE branch layout.  The candidate
> shape I tested is to check pte_present() first, while keeping pte_batch_hint()
> for batch > 1:
> 
>   if (pte_present(pte)) {
>           batch = pte_batch_hint(ptep, pte);
>           if (batch > 1)
>                   fill vec[0..step-1];
>           else
>                   *vec = 1;
>   } else if (pte_none(pte) || pte_is_marker(pte)) {
>           __mincore_unmapped_range(...);
>   } else {
>           mincore_swap(...);
>   }
> 
> On x86, pte_batch_hint() defaults to 1, so this mainly measures the
> resident-PTE hot path layout.  On arm64 the batch > 1 path should still be
> preserved, but I have not validated mTHP/contiguous-PTE performance yet.
> 
> The v6.18 confirmation A/B.  The 1/2/4 CPU rows are the primary matrix; the
> 8CPU/16GiB and 16CPU/32GiB rows are the high-CPU follow-up.  All rows use the
> same no-THP scenario, 9 repetitions, and coverage disabled:


Which compiler are you using?

The expectation is that the whole code would get optimized on x86 such that the
behavior is just like before.


-- 
Cheers,

David

next prev parent reply	other threads:[~2026-06-09  9:02 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-09  7:26 [RFC] mm/mincore: present-PTE scan cost after pte_batch_hint() batching Chengfeng Lin
2026-06-09  9:01 ` David Hildenbrand (Arm) [this message]
2026-06-09  9:55   ` Barry Song
2026-06-09 14:12   ` Chengfeng Lin
2026-06-09 14:27     ` David Hildenbrand (Arm)
2026-06-09 21:12       ` Pedro Falcato

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=74cdec57-67aa-4ae3-a416-e39846049ab1@kernel.org \
    --to=david@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=chengfenglin@stu.xmu.edu.cn \
    --cc=dev.jain@arm.com \
    --cc=jannh@google.com \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=pfalcato@suse.de \
    --cc=ryan.roberts@arm.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox