Re: [PATCH v11 7/8] mm: folio_zero_user: clear page ranges

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ankur Arora <ankur.a.arora@oracle.com>
To: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	akpm@linux-foundation.org, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com, mingo@redhat.com,
	mjguzik@gmail.com, luto@kernel.org, peterz@infradead.org,
	tglx@linutronix.de, willy@infradead.org, raghavendra.kt@amd.com,
	chleroy@kernel.org, ioworker0@gmail.com, lizhe.67@bytedance.com,
	boris.ostrovsky@oracle.com, konrad.wilk@oracle.com
Subject: Re: [PATCH v11 7/8] mm: folio_zero_user: clear page ranges
Date: Wed, 07 Jan 2026 16:44:44 -0800	[thread overview]
Message-ID: <87eco1rkzn.fsf@oracle.com> (raw)
In-Reply-To: <409dc029-ad8a-4f7e-931e-8044b61a0295@kernel.org>


David Hildenbrand (Red Hat) <david@kernel.org> writes:

> On 1/7/26 08:20, Ankur Arora wrote:
>> Use batch clearing in clear_contig_highpages() instead of clearing a
>> single page at a time. Exposing larger ranges enables the processor to
>> optimize based on extent.
>> To do this we just switch to using clear_user_highpages() which would
>> in turn use clear_user_pages() or clear_pages().
>> Batched clearing, when running under non-preemptible models, however,
>> has latency considerations. In particular, we need periodic invocations
>> of cond_resched() to keep to reasonable preemption latencies.
>> This is a problem because the clearing primitives do not, or might not
>> be able to, call cond_resched() to check if preemption is needed.
>> So, limit the worst case preemption latency by doing the clearing in
>> units of no more than PROCESS_PAGES_NON_PREEMPT_BATCH pages.
>> (Preemptible models already define away most of cond_resched(), so the
>> batch size is ignored when running under those.)
>> PROCESS_PAGES_NON_PREEMPT_BATCH: for architectures with "fast"
>> clear-pages (ones that define clear_pages()), we define it as 32MB
>> worth of pages. This is meant to be large enough to allow the processor
>> to optimize the operation and yet small enough that we see reasonable
>> preemption latency for when this optimization is not possible
>> (ex. slow microarchitectures, memory bandwidth saturation.)
>> This specific value also allows for a cacheline allocation elision
>> optimization (which might help unrelated applications by not evicting
>> potentially useful cache lines) that kicks in recent generations of
>> AMD Zen processors at around LLC-size (32MB is a typical size).
>> At the same time 32MB is small enough that even with poor clearing
>> bandwidth (say ~10GBps), time to clear 32MB should be well below the
>> scheduler's default warning threshold (sysctl_resched_latency_warn_ms=100).
>> "Slow" architectures (don't have clear_pages()) will continue to use
>> the base value (single page).
>> Performance
>> ==
>> Testing a demand fault workload shows a decent improvement in bandwidth
>> with pg-sz=1GB. Bandwidth with pg-sz=2MB stays flat.
>>   $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
>>                     contiguous-pages       batched-pages
>>                     (GBps +- %stdev)      (GBps +- %stdev)
>>     pg-sz=2MB       23.58 +- 1.95%        25.34 +- 1.18%       +  7.50%
>> preempt=*
>>     pg-sz=1GB       25.09 +- 0.79%        39.22 +- 2.32%       + 56.31%
>> preempt=none|voluntary
>>     pg-sz=1GB       25.71 +- 0.03%        52.73 +- 0.20% [#]   +110.16%  preempt=full|lazy
>>   [#] We perform much better with preempt=full|lazy because, not
>>    needing explicit invocations of cond_resched() we can clear the
>>    full extent (pg-sz=1GB) as a single unit which the processor
>>    can optimize for.
>>   (Unless otherwise noted, all numbers are on AMD Genoa (EPYC 9J13);
>>    region-size=64GB, local node; 2.56 GHz, boost=0.)
>> Analysis
>> ==
>> pg-sz=1GB: the improvement we see falls in two buckets depending on
>> the batch size in use.
>> For batch-size=32MB the number of cachelines allocated (L1-dcache-loads)
>> -- which stay relatively flat for smaller batches, start to drop off
>> because cacheline allocation elision kicks in. And as can be seen below,
>> at batch-size=1GB, we stop allocating cachelines almost entirely.
>> (Not visible here but from testing with intermediate sizes, the
>> allocation change kicks in only at batch-size=32MB and ramps up from
>> there.)
>>   contigous-pages       6,949,417,798      L1-dcache-loads                  #
>> 883.599 M/sec                       ( +-  0.01% )  (35.75%)
>>                         3,226,709,573      L1-dcache-load-misses            #   46.43% of all L1-dcache accesses   ( +-  0.05% )  (35.75%)
>>      batched,32MB       2,290,365,772      L1-dcache-loads                  #
>> 471.171 M/sec                       ( +-  0.36% )  (35.72%)
>>                         1,144,426,272      L1-dcache-load-misses            #   49.97% of all L1-dcache accesses   ( +-  0.58% )  (35.70%)
>>      batched,1GB           63,914,157      L1-dcache-loads                  #
>> 17.464 M/sec                       ( +-  8.08% )  (35.73%)
>>                            22,074,367      L1-dcache-load-misses            #   34.54% of all L1-dcache accesses   ( +- 16.70% )  (35.70%)
>> The dropoff is also visible in L2 prefetch hits (miss numbers are
>> on similar lines):
>>   contiguous-pages      3,464,861,312      l2_pf_hit_l2.all                 #
>> 437.722 M/sec                       ( +-  0.74% )  (15.69%)
>>     batched,32MB          883,750,087      l2_pf_hit_l2.all                 #
>> 181.223 M/sec                       ( +-  1.18% )  (15.71%)
>>      batched,1GB            8,967,943      l2_pf_hit_l2.all                 #
>> 2.450 M/sec                       ( +- 17.92% )  (15.77%)
>> This largely decouples the frontend from the backend since the clearing
>> operation does not need to wait on loads from memory (we still need
>> cacheline ownership but that's a shorter path). This is most visible
>> if we rerun the test above with (boost=1, 3.66 GHz).
>>   $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
>>                     contiguous-pages       batched-pages
>>                     (GBps +- %stdev)      (GBps +- %stdev)
>>     pg-sz=2MB       26.08 +- 1.72%        26.13 +- 0.92%           -
>> preempt=*
>>     pg-sz=1GB       26.99 +- 0.62%        48.85 +- 2.19%       + 80.99%
>> preempt=none|voluntary
>>     pg-sz=1GB       27.69 +- 0.18%        75.18 +- 0.25%       +171.50%  preempt=full|lazy
>> Comparing the batched-pages numbers from the boost=0 ones and these: for
>> a clock-speed gain of 42% we gain 24.5% for batch-size=32MB and 42.5%
>> for batch-size=1GB.
>> In comparison the baseline contiguous-pages case and both the
>> pg-sz=2MB ones are largely backend bound so gain no more than ~10%.
>> Other platforms tested, Intel Icelakex (Oracle X9) and ARM64 Neoverse-N1
>> (Ampere Altra) both show an improvement of ~35% for pg-sz=2MB|1GB.
>> The first goes from around 8GBps to 11GBps and the second from 32GBps
>> to 44 GBPs.
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>   include/linux/mm.h | 36 ++++++++++++++++++++++++++++++++++++
>>   mm/memory.c        | 18 +++++++++++++++---
>>   2 files changed, 51 insertions(+), 3 deletions(-)
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index a4a9a8d1ffec..fb5b86d78093 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -4204,6 +4204,15 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
>>    * mapped to user space.
>>    *
>>    * Does absolutely no exception handling.
>> + *
>> + * Note that even though the clearing operation is preemptible, clear_pages()
>> + * does not (and on architectures where it reduces to a few long-running
>> + * instructions, might not be able to) call cond_resched() to check if
>> + * rescheduling is required.
>> + *
>> + * When running under preemptible models this is not a problem. Under
>> + * cooperatively scheduled models, however, the caller is expected to
>> + * limit @npages to no more than PROCESS_PAGES_NON_PREEMPT_BATCH.
>>    */
>>   static inline void clear_pages(void *addr, unsigned int npages)
>>   {
>> @@ -4214,6 +4224,32 @@ static inline void clear_pages(void *addr, unsigned int npages)
>>   }
>>   #endif
>>   +#ifndef PROCESS_PAGES_NON_PREEMPT_BATCH
>> +#ifdef clear_pages
>> +/*
>> + * The architecture defines clear_pages(), and we assume that it is
>> + * generally "fast". So choose a batch size large enough to allow the processor
>> + * headroom for optimizing the operation and yet small enough that we see
>> + * reasonable preemption latency for when this optimization is not possible
>> + * (ex. slow microarchitectures, memory bandwidth saturation.)
>> + *
>> + * With a value of 32MB and assuming a memory bandwidth of ~10GBps, this should
>> + * result in worst case preemption latency of around 3ms when clearing pages.
>> + *
>> + * (See comment above clear_pages() for why preemption latency is a concern
>> + * here.)
>> + */
>> +#define PROCESS_PAGES_NON_PREEMPT_BATCH		(32 << (20 - PAGE_SHIFT))
>
> Nit: Could we use SZ_32G here?
>
> 	SZ_32G >> PAGE_SHIFT;
>
>> +#else /* !clear_pages */
>> +/*
>> + * The architecture does not provide a clear_pages() implementation. Assume
>> + * that clear_page() -- which clear_pages() will fallback to -- is relatively
>> + * slow and choose a small value for PROCESS_PAGES_NON_PREEMPT_BATCH.
>> + */
>> +#define PROCESS_PAGES_NON_PREEMPT_BATCH		1
>> +#endif
>> +#endif
>> +
>>   #ifdef __HAVE_ARCH_GATE_AREA
>>   extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
>>   extern int in_gate_area_no_mm(unsigned long addr);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index c06e43a8861a..49e7154121f5 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -7240,13 +7240,25 @@ static inline int process_huge_page(
>>   static void clear_contig_highpages(struct page *page, unsigned long addr,
>>   				   unsigned int nr_pages)
>>   {
>> -	unsigned int i;
>> +	unsigned int i, unit, count;
>>     	might_sleep();
>> -	for (i = 0; i < nr_pages; i++) {
>> +	/*
>> +	 * When clearing we want to operate on the largest extent possible since
>> +	 * that allows for extent based architecture specific optimizations.
>> +	 *
>> +	 * However, since the clearing interfaces (clear_user_highpages(),
>> +	 * clear_user_pages(), clear_pages()), do not call cond_resched(), we
>> +	 * limit the batch size when running under non-preemptible scheduling
>> +	 * models.
>> +	 */
>> +	unit = preempt_model_preemptible() ? nr_pages : PROCESS_PAGES_NON_PREEMPT_BATCH;
>> +
>
> Nit: you could do above:
>
> const unsigned int unit = preempt_model_preemptible() ? nr_pages : PROCESS_PAGES_NON_PREEMPT_BATCH;
>
>> +	for (i = 0; i < nr_pages; i += count) {
>>   		cond_resched();
>>   -		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
>> +		count = min(unit, nr_pages - i);
>> +		clear_user_highpages(page + i, addr + i * PAGE_SIZE, count);
>>   	}
>>   }
>>
>
> Feel free to send a fixup patch inline as reply to this mail for any of these that
> Andrew can simply squash. No need to resend just because of that.

Done.

> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>

Thanks David!

--
ankur

next prev parent reply	other threads:[~2026-01-08  0:45 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-07  7:20 [PATCH v11 0/8] mm: folio_zero_user: clear page ranges Ankur Arora
2026-01-07  7:20 ` [PATCH v11 1/8] treewide: provide a generic clear_user_page() variant Ankur Arora
2026-01-07  7:20 ` [PATCH v11 2/8] mm: introduce clear_pages() and clear_user_pages() Ankur Arora
2026-01-07 22:06   ` David Hildenbrand (Red Hat)
2026-01-07  7:20 ` [PATCH v11 3/8] highmem: introduce clear_user_highpages() Ankur Arora
2026-01-07 22:08   ` David Hildenbrand (Red Hat)
2026-01-08  6:10     ` Ankur Arora
2026-01-07  7:20 ` [PATCH v11 4/8] x86/mm: Simplify clear_page_* Ankur Arora
2026-01-07  7:20 ` [PATCH v11 5/8] x86/clear_page: Introduce clear_pages() Ankur Arora
2026-01-07  7:20 ` [PATCH v11 6/8] mm: folio_zero_user: clear pages sequentially Ankur Arora
2026-01-07 22:10   ` David Hildenbrand (Red Hat)
2026-01-07  7:20 ` [PATCH v11 7/8] mm: folio_zero_user: clear page ranges Ankur Arora
2026-01-07 22:16   ` David Hildenbrand (Red Hat)
2026-01-08  0:44     ` Ankur Arora [this message]
2026-01-08  0:43   ` [PATCH] mm: folio_zero_user: (fixup) cache neighbouring pages Ankur Arora
2026-01-08  0:53     ` Ankur Arora
2026-01-08  6:04   ` [PATCH] mm: folio_zero_user: (fixup) cache page ranges Ankur Arora
2026-01-07  7:20 ` [PATCH v11 8/8] mm: folio_zero_user: cache neighbouring pages Ankur Arora
2026-01-07 22:18   ` David Hildenbrand (Red Hat)
2026-01-26 18:32   ` [PATCH] mm: folio_zero_user: open code range computation in folio_zero_user() Ankur Arora
2026-01-26 19:05     ` Andrew Morton
2026-01-27 10:29     ` David Hildenbrand (Red Hat)
2026-01-27 23:42       ` Ankur Arora
2026-01-28 11:05         ` David Hildenbrand (Red Hat)
2026-01-28 18:59   ` [PATCH v2] " Ankur Arora
2026-02-04 21:01     ` David Hildenbrand (arm)
2026-02-04 22:31       ` Andrew Morton
2026-02-05  5:48         ` Ankur Arora
2026-02-05 12:36           ` David Hildenbrand (Arm)
2026-02-06  5:42             ` Ankur Arora
2026-02-06  8:57               ` David Hildenbrand (Arm)
2026-02-06 22:38   ` [PATCH v3] " Ankur Arora
2026-02-07 10:10     ` David Hildenbrand (Arm)
2026-02-09  1:09       ` Ankur Arora
2026-01-07 18:09 ` [PATCH v11 0/8] mm: folio_zero_user: clear page ranges Andrew Morton
2026-01-08  6:21   ` Ankur Arora

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87eco1rkzn.fsf@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=chleroy@kernel.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=hpa@zytor.com \
    --cc=ioworker0@gmail.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizhe.67@bytedance.com \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mjguzik@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=tglx@linutronix.de \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.