Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ankur Arora <ankur.a.arora@oracle.com>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Ankur Arora <ankur.a.arora@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com,
	mingo@redhat.com, luto@kernel.org, peterz@infradead.org,
	paulmck@kernel.org, rostedt@goodmis.org, tglx@linutronix.de,
	willy@infradead.org, jon.grimm@amd.com, bharata@amd.com,
	raghavendra.kt@amd.com, boris.ostrovsky@oracle.com,
	konrad.wilk@oracle.com
Subject: Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
Date: Tue, 15 Apr 2025 14:46:22 -0700	[thread overview]
Message-ID: <87tt6pw11t.fsf@oracle.com> (raw)
In-Reply-To: <mp6sg35nbmjzahnlkstw7y7n2cbcz3waqcthz27ciyc7fmki3s@jws4rtvqyds4>


Mateusz Guzik <mjguzik@gmail.com> writes:

> On Sun, Apr 13, 2025 at 08:46:07PM -0700, Ankur Arora wrote:
>> clear_pages_rep(), clear_pages_erms() use string instructions to zero
>> memory. When operating on more than a single page, we can use these
>> more effectively by explicitly advertising the region-size to the
>> processor, which can use that as a hint to optimize the clearing
>> (ex. by eliding cacheline allocation.)
>>
>> As a secondary benefit, string instructions are typically microcoded,
>> and working with larger regions helps amortize the cost of the decode.
>>
>> When zeroing the 2MB page, maximize spatial locality by clearing in
>> three sections: the faulting page and its immediate neighbourhood, the
>> left and the right regions, with the local neighbourhood cleared last.
>>
>> Performance
>> ==
>>
>> Use mmap(MAP_HUGETLB) to demand fault a 64GB region on the local
>> NUMA node.
>>
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>>
>>                  mm/folio_zero_user    x86/folio_zero_user     change
>>                   (GB/s  +- stddev)      (GB/s  +- stddev)
>>
>>   pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
>>   pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>>
>> Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
>> allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB.
>>
>> pg-sz=1GB:
>>   -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
>>   -    544,878,976      instructions                     #    0.06  insn per cycle
>>   -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
>>   -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)
>>
>>   +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
>>   +     10,979,121      instructions                     #    0.00  insn per cycle
>>   +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
>>   +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)
>>
>> That's not the case with pg-sz=2MB, where we also perform better but
>> the number of cacheline allocations remain the same.
>>
>> It's not entirely clear why the performance for pg-sz=2MB improves. We
>> decode fewer instructions and the hardware prefetcher can do a better
>> job, but the perf stats for both of those aren't convincing enough to
>> the extent of ~30%.
>>
>> pg-sz=2MB:
>>   - 13,110,306,584      cycles                           #    2.418 GHz                         ( +-  0.48% )  (46.13%)
>>   -    607,589,360      instructions                     #    0.05  insn per cycle
>>   -  2,416,130,434      L1-dcache-loads                  #  445.682 M/sec                       ( +-  0.08% )  (46.19%)
>>   -  1,080,187,594      L1-dcache-load-misses            #   44.71% of all L1-dcache accesses   ( +-  0.01% )  (46.18%)
>>
>>   +  9,624,624,178      cycles                           #    2.418 GHz                         ( +-  0.01% )  (46.13%)
>>   +    277,336,691      instructions                     #    0.03  insn per cycle
>>   +  2,251,220,599      L1-dcache-loads                  #  565.624 M/sec                       ( +-  0.01% )  (46.20%)
>>   +  1,092,386,130      L1-dcache-load-misses            #   48.52% of all L1-dcache accesses   ( +-  0.02% )  (46.19%)
>>
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>>
>>                  mm/folio_zero_user    x86/folio_zero_user     change
>>                   (GB/s +- stddev)      (GB/s +- stddev)
>>
>>   pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
>>   pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>>
>> For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
>> see a drop in cycles but there's no drop in cacheline allocation.
>>
>
> Back when I was young and handsome and 32-bit x86 was king, people
> assumed 4K pages needed to be cleared with non-temporal stores to avoid
> evicting stuff from caches. I had never seen measurements showing this
> has the intended effect. Some time after this became a thing I did see
> measurements showing that this in fact *increases* cache misses. I am
> not saying this was necessarily the case for all x86 uarchs, merely that
> the sensibly sounding assumption turned bogus at some point (if it was
> ever legit).

That was a long time ago though ;-). And, your point makes sense for
small sized pages. But, consider that zeroing a 1GB page can easily blow
away an L3 cache for absolutely nothing gained -- probabilistically,
nothing that was in the page that remains in the cache will ever be
accessed.

Now, you could argue that the situation is less clear for 2MB pages.

> This brings me to the multi-stage clearing employed here for locality.
> While it sounds great on paper, for all I know it does not provide any
> advantage. It very well may be it is harmful by preventing the CPU from
> knowing what you are trying to do.
>
> I think doing this warrants obtaining stats from some real workloads,
> but given how time consuming this can be I think it would be tolerable
> to skip it for now.
>
>> Performance for preempt=none|voluntary remains unchanged.
>>
>
> So I was under the impression the benefit would be realized for all
> kernels.
>
> I don't know how preemption support is implemented on Linux. Do you
> always get an IPI?

No. The need-resched bit is common. It's just there's no preemption via
irqentry, just synchronous calls to cond_resched() (as you mention below).

Zeroing via a subroutine like instruction (rep; stos) is incompatible with
synchronous calls to cond_resched() so this code is explicitly not called
for none/voluntary (see patch 3.)

That said, I'll probably take Ingo's suggestion of chunking things up
in say 8/16MB portions for cooperative preemption models.

Ankur


> I was thinking something like this: a per-cpu var akin to preemption
> count, but indicating the particular code section is fully preemptible
>
> Then:
>
> preemptible_enter();
> clear_pages();
> preemptible_exit();
>
> for simpler handling of the var it could prevent migration to other
> CPUs.
>
> then the IPI handler for preemption would check if ->preemptible is set
> + preemption disablement is zero, in which case it would take you off
> cpu.
>
> If this is a problem, then a better granularity would help (say 8 pages
> between cond_rescheds?)
>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  arch/x86/mm/Makefile |  1 +
>>  arch/x86/mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/mm.h   |  1 +
>>  3 files changed, 62 insertions(+)
>>  create mode 100644 arch/x86/mm/memory.c
>>
>> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
>> index 32035d5be5a0..e61b4d331cdf 100644
>> --- a/arch/x86/mm/Makefile
>> +++ b/arch/x86/mm/Makefile
>> @@ -55,6 +55,7 @@ obj-$(CONFIG_MMIOTRACE_TEST)	+= testmmiotrace.o
>>  obj-$(CONFIG_NUMA)		+= numa.o numa_$(BITS).o
>>  obj-$(CONFIG_AMD_NUMA)		+= amdtopology.o
>>  obj-$(CONFIG_ACPI_NUMA)		+= srat.o
>> +obj-$(CONFIG_PREEMPTION)	+= memory.o
>>
>>  obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
>>  obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
>> diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c
>> new file mode 100644
>> index 000000000000..99851c246fcc
>> --- /dev/null
>> +++ b/arch/x86/mm/memory.c
>> @@ -0,0 +1,60 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +#include <linux/mm.h>
>> +#include <linux/range.h>
>> +#include <linux/minmax.h>
>> +
>> +#ifndef CONFIG_HIGHMEM
>> +/*
>> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().
>> + *
>> + * Taking inspiration from the common code variant, we split the zeroing in
>> + * three parts: left of the fault, right of the fault, and up to 5 pages
>> + * in the immediate neighbourhood of the target page.
>> + *
>> + * Cleared in that order to keep cache lines of the target region hot.
>> + *
>> + * For gigantic pages, there is no expectation of cache locality so just do a
>> + * straight zero.
>> + */
>> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint)
>> +{
>> +	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>> +	int width = 2; /* pages cleared last on either side */
>> +	struct range r[3];
>> +	int i;
>> +
>> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>> +		clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
>> +		goto out;
>> +	}
>> +
>> +	/*
>> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
>> +	 * ensure it sticks around in the cache.
>> +	 */
>> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
>> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
>> +
>> +	/* Region to the left of the fault */
>> +	r[1] = DEFINE_RANGE(pg.start,
>> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
>> +
>> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
>> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
>> +			    pg.end);
>> +
>> +	for (i = 0; i <= 2; i++) {
>> +		int len = range_len(&r[i]);
>> +
>> +		if (len > 0)
>> +			clear_pages(page_address(folio_page(folio, r[i].start)), len);
>> +	}
>> +
>> +out:
>> +	/* Explicitly invoke cond_resched() to handle any live patching necessary. */
>> +	cond_resched();
>> +}
>> +
>> +#endif /* CONFIG_HIGHMEM */
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index b7f13f087954..b57512da8173 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -4114,6 +4114,7 @@ enum mf_action_page_type {
>>  };
>>
>>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>> +void folio_zero_user_preemptible(struct folio *fio, unsigned long addr_hint);
>>  void folio_zero_user(struct folio *folio, unsigned long addr_hint);
>>  int copy_user_large_folio(struct folio *dst, struct folio *src,
>>  			  unsigned long addr_hint,
>> --
>> 2.31.1
>>
>>


--
ankur

next prev parent reply	other threads:[~2025-04-15 21:46 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-14  3:46 [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing Ankur Arora
2025-04-14  3:46 ` [PATCH v3 1/4] x86/clear_page: extend clear_page*() for " Ankur Arora
2025-04-14  6:32   ` Ingo Molnar
2025-04-14 11:02     ` Peter Zijlstra
2025-04-14 11:14       ` Ingo Molnar
2025-04-14 19:46       ` Ankur Arora
2025-04-14 22:26       ` Mateusz Guzik
2025-04-15  6:14         ` Ankur Arora
2025-04-15  8:22           ` Mateusz Guzik
2025-04-15 20:01             ` Ankur Arora
2025-04-15 20:32               ` Mateusz Guzik
2025-04-14 19:52     ` Ankur Arora
2025-04-14 20:09       ` Matthew Wilcox
2025-04-15 21:59         ` Ankur Arora
2025-04-14  3:46 ` [PATCH v3 2/4] x86/clear_page: add clear_pages() Ankur Arora
2025-04-14  3:46 ` [PATCH v3 3/4] huge_page: allow arch override for folio_zero_user() Ankur Arora
2025-04-14  3:46 ` [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing Ankur Arora
2025-04-14  6:53   ` Ingo Molnar
2025-04-14 21:21     ` Ankur Arora
2025-04-14  7:05   ` Ingo Molnar
2025-04-15  6:36     ` Ankur Arora
2025-04-22  6:36     ` Raghavendra K T
2025-04-22 19:14       ` Ankur Arora
2025-04-15 10:16   ` Mateusz Guzik
2025-04-15 21:46     ` Ankur Arora [this message]
2025-04-15 22:01       ` Mateusz Guzik
2025-04-16  4:46         ` Ankur Arora
2025-04-17 14:06           ` Mateusz Guzik
2025-04-14  5:34 ` [PATCH v3 0/4] mm/folio_zero_user: add " Ingo Molnar
2025-04-14 19:30   ` Ankur Arora
2025-04-14  6:36 ` Ingo Molnar
2025-04-14 19:19   ` Ankur Arora
2025-04-15 19:10 ` Zi Yan
2025-04-22 19:32   ` Ankur Arora
2025-04-22  6:23 ` Raghavendra K T
2025-04-22 19:22   ` Ankur Arora
2025-04-23  8:12     ` Raghavendra K T
2025-04-23  9:18       ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87tt6pw11t.fsf@oracle.com \
    --to=ankur.a.arora@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jon.grimm@amd.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mjguzik@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).