linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
@ 2025-04-14  3:46 Ankur Arora
  2025-04-14  3:46 ` [PATCH v3 1/4] x86/clear_page: extend clear_page*() for " Ankur Arora
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: Ankur Arora @ 2025-04-14  3:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, bp, dave.hansen, hpa, mingo, luto, peterz,
	paulmck, rostedt, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

This series adds multi-page clearing for hugepages. It is a rework
of [1] which took a detour through PREEMPT_LAZY [2].

Why multi-page clearing?: multi-page clearing improves upon the
current page-at-a-time approach by providing the processor with a
hint as to the real region size. A processor could use this hint to,
for instance, elide cacheline allocation when clearing a large
region.

This optimization in particular is done by REP; STOS on AMD Zen
where regions larger than L3-size use non-temporal stores.

This results in significantly better performance. 

We also see performance improvement for cases where this optimization is
unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
REP; STOS is typically microcoded which can now be amortized over
larger regions and the hint allows the hardware prefetcher to do a
better job.

Milan (EPYC 7J13, boost=0, preempt=full|lazy):

                 mm/folio_zero_user    x86/folio_zero_user     change
                  (GB/s  +- stddev)      (GB/s  +- stddev)

  pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
  pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%

Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):

                 mm/folio_zero_user    x86/folio_zero_user     change
                  (GB/s +- stddev)      (GB/s +- stddev)

  pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
  pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%

Interaction with preemption: as discussed in [3], zeroing large
regions with string instructions doesn't work well with cooperative
preemption models which need regular invocations of cond_resched(). So,
this optimization is limited to only preemptible models (full, lazy).

This is done by overriding __folio_zero_user() -- which does the usual
page-at-a-time zeroing -- by an architecture optimized version but
only when running under preemptible models.
As such this ties an architecture specific optimization too closely
to preemption. Should be easy enough to change but seemed like the
simplest approach.

Comments appreciated!

Also at:
  github.com/terminus/linux clear-pages-preempt.v1


[1] https://lore.kernel.org/lkml/20230830184958.2333078-1-ankur.a.arora@oracle.com/
[2] https://lore.kernel.org/lkml/87cyyfxd4k.ffs@tglx/
[3] https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@mail.gmail.com/

Ankur Arora (4):
  x86/clear_page: extend clear_page*() for multi-page clearing
  x86/clear_page: add clear_pages()
  huge_page: allow arch override for folio_zero_user()
  x86/folio_zero_user: multi-page clearing

 arch/x86/include/asm/page_32.h |  6 ++++
 arch/x86/include/asm/page_64.h | 27 +++++++++------
 arch/x86/lib/clear_page_64.S   | 52 +++++++++++++++++++++--------
 arch/x86/mm/Makefile           |  1 +
 arch/x86/mm/memory.c           | 60 ++++++++++++++++++++++++++++++++++
 include/linux/mm.h             |  1 +
 mm/memory.c                    | 38 ++++++++++++++++++---
 7 files changed, 156 insertions(+), 29 deletions(-)
 create mode 100644 arch/x86/mm/memory.c

-- 
2.31.1



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 1/4] x86/clear_page: extend clear_page*() for multi-page clearing
  2025-04-14  3:46 [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing Ankur Arora
@ 2025-04-14  3:46 ` Ankur Arora
  2025-04-14  6:32   ` Ingo Molnar
  2025-04-14  3:46 ` [PATCH v3 2/4] x86/clear_page: add clear_pages() Ankur Arora
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Ankur Arora @ 2025-04-14  3:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, bp, dave.hansen, hpa, mingo, luto, peterz,
	paulmck, rostedt, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

clear_page*() variants now take a page-aligned length parameter and
clears the whole region.

Rename to clear_pages*().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page_64.h | 20 +++++++------
 arch/x86/lib/clear_page_64.S   | 52 +++++++++++++++++++++++++---------
 2 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index d3aab6f4e59a..45db74db9021 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -40,22 +40,24 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 
 #define __phys_reloc_hide(x)	(x)
 
-void clear_page_orig(void *page);
-void clear_page_rep(void *page);
-void clear_page_erms(void *page);
+void clear_pages_orig(void *page, unsigned int length);
+void clear_pages_rep(void *page, unsigned int length);
+void clear_pages_erms(void *page, unsigned int length);
 
 static inline void clear_page(void *page)
 {
+	unsigned int length = PAGE_SIZE;
 	/*
-	 * Clean up KMSAN metadata for the page being cleared. The assembly call
+	 * Clean up KMSAN metadata for the pages being cleared. The assembly call
 	 * below clobbers @page, so we perform unpoisoning before it.
 	 */
-	kmsan_unpoison_memory(page, PAGE_SIZE);
-	alternative_call_2(clear_page_orig,
-			   clear_page_rep, X86_FEATURE_REP_GOOD,
-			   clear_page_erms, X86_FEATURE_ERMS,
+	kmsan_unpoison_memory(page, length);
+
+	alternative_call_2(clear_pages_orig,
+			   clear_pages_rep, X86_FEATURE_REP_GOOD,
+			   clear_pages_erms, X86_FEATURE_ERMS,
 			   "=D" (page),
-			   "D" (page),
+			   ASM_INPUT("D" (page), "S" (length)),
 			   "cc", "memory", "rax", "rcx");
 }
 
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index a508e4a8c66a..bce516263b69 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -13,20 +13,35 @@
  */
 
 /*
- * Zero a page.
- * %rdi	- page
+ * Zero kernel page aligned region.
+ *
+ * Input:
+ * %rdi	- destination
+ * %esi	- length
+ *
+ * Clobbers: %rax, %rcx
  */
-SYM_TYPED_FUNC_START(clear_page_rep)
-	movl $4096/8,%ecx
+SYM_TYPED_FUNC_START(clear_pages_rep)
+	movl %esi, %ecx
 	xorl %eax,%eax
+	shrl $3,%ecx
 	rep stosq
 	RET
-SYM_FUNC_END(clear_page_rep)
-EXPORT_SYMBOL_GPL(clear_page_rep)
+SYM_FUNC_END(clear_pages_rep)
+EXPORT_SYMBOL_GPL(clear_pages_rep)
 
-SYM_TYPED_FUNC_START(clear_page_orig)
+/*
+ * Original page zeroing loop.
+ * Input:
+ * %rdi	- destination
+ * %esi	- length
+ *
+ * Clobbers: %rax, %rcx, %rflags
+ */
+SYM_TYPED_FUNC_START(clear_pages_orig)
+	movl   %esi, %ecx
 	xorl   %eax,%eax
-	movl   $4096/64,%ecx
+	shrl   $6,%ecx
 	.p2align 4
 .Lloop:
 	decl	%ecx
@@ -43,16 +58,25 @@ SYM_TYPED_FUNC_START(clear_page_orig)
 	jnz	.Lloop
 	nop
 	RET
-SYM_FUNC_END(clear_page_orig)
-EXPORT_SYMBOL_GPL(clear_page_orig)
+SYM_FUNC_END(clear_pages_orig)
+EXPORT_SYMBOL_GPL(clear_pages_orig)
 
-SYM_TYPED_FUNC_START(clear_page_erms)
-	movl $4096,%ecx
+/*
+ * Zero kernel page aligned region.
+ *
+ * Input:
+ * %rdi	- destination
+ * %esi	- length
+ *
+ * Clobbers: %rax, %rcx
+ */
+SYM_TYPED_FUNC_START(clear_pages_erms)
+	movl %esi, %ecx
 	xorl %eax,%eax
 	rep stosb
 	RET
-SYM_FUNC_END(clear_page_erms)
-EXPORT_SYMBOL_GPL(clear_page_erms)
+SYM_FUNC_END(clear_pages_erms)
+EXPORT_SYMBOL_GPL(clear_pages_erms)
 
 /*
  * Default clear user-space.
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 2/4] x86/clear_page: add clear_pages()
  2025-04-14  3:46 [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing Ankur Arora
  2025-04-14  3:46 ` [PATCH v3 1/4] x86/clear_page: extend clear_page*() for " Ankur Arora
@ 2025-04-14  3:46 ` Ankur Arora
  2025-04-14  3:46 ` [PATCH v3 3/4] huge_page: allow arch override for folio_zero_user() Ankur Arora
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Ankur Arora @ 2025-04-14  3:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, bp, dave.hansen, hpa, mingo, luto, peterz,
	paulmck, rostedt, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

Expose multi-page clearing via clear_pages().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page_32.h | 6 ++++++
 arch/x86/include/asm/page_64.h | 9 +++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index 0c623706cb7e..73c94c9e4eee 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -22,6 +22,12 @@ static inline void clear_page(void *page)
 	memset(page, 0, PAGE_SIZE);
 }
 
+static inline void clear_pages(void *page, unsigned int nsubpages)
+{
+	for (int i = 0; i < nsubpages; i++)
+		clear_page(page + i * PAGE_SIZE);
+}
+
 static inline void copy_page(void *to, void *from)
 {
 	memcpy(to, from, PAGE_SIZE);
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 45db74db9021..6781cdee6541 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -44,9 +44,9 @@ void clear_pages_orig(void *page, unsigned int length);
 void clear_pages_rep(void *page, unsigned int length);
 void clear_pages_erms(void *page, unsigned int length);
 
-static inline void clear_page(void *page)
+static inline void clear_pages(void *page, unsigned int nsubpages)
 {
-	unsigned int length = PAGE_SIZE;
+	unsigned int length = nsubpages * PAGE_SIZE;
 	/*
 	 * Clean up KMSAN metadata for the pages being cleared. The assembly call
 	 * below clobbers @page, so we perform unpoisoning before it.
@@ -61,6 +61,11 @@ static inline void clear_page(void *page)
 			   "cc", "memory", "rax", "rcx");
 }
 
+static inline void clear_page(void *page)
+{
+	clear_pages(page, 1);
+}
+
 void copy_page(void *to, void *from);
 KCFI_REFERENCE(copy_page);
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 3/4] huge_page: allow arch override for folio_zero_user()
  2025-04-14  3:46 [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing Ankur Arora
  2025-04-14  3:46 ` [PATCH v3 1/4] x86/clear_page: extend clear_page*() for " Ankur Arora
  2025-04-14  3:46 ` [PATCH v3 2/4] x86/clear_page: add clear_pages() Ankur Arora
@ 2025-04-14  3:46 ` Ankur Arora
  2025-04-14  3:46 ` [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing Ankur Arora
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Ankur Arora @ 2025-04-14  3:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, bp, dave.hansen, hpa, mingo, luto, peterz,
	paulmck, rostedt, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

folio_zero_user() is constrained to operate in a page-at-a-time fashion
because it needs to handle the CONFIG_HIGHMEM case. Additionally,
cooperative preemption models (none, voluntary) need regular
invocations of cond_resched() which limits the chunk size when zeroing.

Move the page-at-a-time handling to __folio_zero_user(). And allow an
architecture specific override. Note that when running under
CONFIG_PREEMPT_DYNAMIC, we could switch between cooperative and
preemptible models at runtime, falling back to __folio_zero_user() in
the first case.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 mm/memory.c | 38 +++++++++++++++++++++++++++++++++-----
 1 file changed, 33 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2d8c265fc7d6..ac6a19d7bdf4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7235,6 +7235,32 @@ static int clear_subpage(unsigned long addr, int idx, void *arg)
 	return 0;
 }
 
+/*
+ * __folio_zero_user - page-at-a-time zeroing.
+ *
+ * Handle cases where we have nothing better available. This could be
+ * for a few reasons:
+ *
+ *   - the architecture does not support multi-page zeroing (no override
+ *     for folio_zero_user_preemptible()): because there might be no
+ *     optimized zeroing primitive, or because CONFIG_HIGHMEM is supported.
+ *
+ *   - !preempt_model_preemptible(): need to call cond_resched()
+ *     periodically to provide reasonable latency.
+ */
+static void __folio_zero_user(struct folio *folio, unsigned long addr_hint)
+{
+	unsigned int nr_pages = folio_nr_pages(folio);
+
+	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
+		clear_gigantic_page(folio, addr_hint, nr_pages);
+	else
+		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+}
+
+void __weak folio_zero_user_preemptible(struct folio *, unsigned long)
+	__alias(__folio_zero_user);
+
 /**
  * folio_zero_user - Zero a folio which will be mapped to userspace.
  * @folio: The folio to zero.
@@ -7242,12 +7268,14 @@ static int clear_subpage(unsigned long addr, int idx, void *arg)
  */
 void folio_zero_user(struct folio *folio, unsigned long addr_hint)
 {
-	unsigned int nr_pages = folio_nr_pages(folio);
-
-	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
-		clear_gigantic_page(folio, addr_hint, nr_pages);
+	/*
+	 * Use the arch optimized version if we are preemptible and can
+	 * do zeroing of extended extents without worrying about latency.
+	 */
+	if (preempt_model_preemptible())
+		folio_zero_user_preemptible(folio, addr_hint);
 	else
-		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+		__folio_zero_user(folio, addr_hint);
 }
 
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
  2025-04-14  3:46 [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing Ankur Arora
                   ` (2 preceding siblings ...)
  2025-04-14  3:46 ` [PATCH v3 3/4] huge_page: allow arch override for folio_zero_user() Ankur Arora
@ 2025-04-14  3:46 ` Ankur Arora
  2025-04-14  6:53   ` Ingo Molnar
                     ` (2 more replies)
  2025-04-14  5:34 ` [PATCH v3 0/4] mm/folio_zero_user: add " Ingo Molnar
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 38+ messages in thread
From: Ankur Arora @ 2025-04-14  3:46 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, bp, dave.hansen, hpa, mingo, luto, peterz,
	paulmck, rostedt, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk, ankur.a.arora

clear_pages_rep(), clear_pages_erms() use string instructions to zero
memory. When operating on more than a single page, we can use these
more effectively by explicitly advertising the region-size to the
processor, which can use that as a hint to optimize the clearing
(ex. by eliding cacheline allocation.)

As a secondary benefit, string instructions are typically microcoded,
and working with larger regions helps amortize the cost of the decode.

When zeroing the 2MB page, maximize spatial locality by clearing in 
three sections: the faulting page and its immediate neighbourhood, the
left and the right regions, with the local neighbourhood cleared last.

Performance
==

Use mmap(MAP_HUGETLB) to demand fault a 64GB region on the local
NUMA node.

Milan (EPYC 7J13, boost=0, preempt=full|lazy):

                 mm/folio_zero_user    x86/folio_zero_user     change
                  (GB/s  +- stddev)      (GB/s  +- stddev)

  pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
  pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%

Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB.

pg-sz=1GB:
  -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
  -    544,878,976      instructions                     #    0.06  insn per cycle
  -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
  -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)

  +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
  +     10,979,121      instructions                     #    0.00  insn per cycle
  +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
  +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)

That's not the case with pg-sz=2MB, where we also perform better but
the number of cacheline allocations remain the same.

It's not entirely clear why the performance for pg-sz=2MB improves. We
decode fewer instructions and the hardware prefetcher can do a better
job, but the perf stats for both of those aren't convincing enough to
the extent of ~30%.

pg-sz=2MB:
  - 13,110,306,584      cycles                           #    2.418 GHz                         ( +-  0.48% )  (46.13%)
  -    607,589,360      instructions                     #    0.05  insn per cycle
  -  2,416,130,434      L1-dcache-loads                  #  445.682 M/sec                       ( +-  0.08% )  (46.19%)
  -  1,080,187,594      L1-dcache-load-misses            #   44.71% of all L1-dcache accesses   ( +-  0.01% )  (46.18%)

  +  9,624,624,178      cycles                           #    2.418 GHz                         ( +-  0.01% )  (46.13%)
  +    277,336,691      instructions                     #    0.03  insn per cycle
  +  2,251,220,599      L1-dcache-loads                  #  565.624 M/sec                       ( +-  0.01% )  (46.20%)
  +  1,092,386,130      L1-dcache-load-misses            #   48.52% of all L1-dcache accesses   ( +-  0.02% )  (46.19%)

Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):

                 mm/folio_zero_user    x86/folio_zero_user     change
                  (GB/s +- stddev)      (GB/s +- stddev)

  pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
  pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%

For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
see a drop in cycles but there's no drop in cacheline allocation.

Performance for preempt=none|voluntary remains unchanged.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/mm/Makefile |  1 +
 arch/x86/mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h   |  1 +
 3 files changed, 62 insertions(+)
 create mode 100644 arch/x86/mm/memory.c

diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 32035d5be5a0..e61b4d331cdf 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -55,6 +55,7 @@ obj-$(CONFIG_MMIOTRACE_TEST)	+= testmmiotrace.o
 obj-$(CONFIG_NUMA)		+= numa.o numa_$(BITS).o
 obj-$(CONFIG_AMD_NUMA)		+= amdtopology.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat.o
+obj-$(CONFIG_PREEMPTION)	+= memory.o
 
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c
new file mode 100644
index 000000000000..99851c246fcc
--- /dev/null
+++ b/arch/x86/mm/memory.c
@@ -0,0 +1,60 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <linux/mm.h>
+#include <linux/range.h>
+#include <linux/minmax.h>
+
+#ifndef CONFIG_HIGHMEM
+/*
+ * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().
+ *
+ * Taking inspiration from the common code variant, we split the zeroing in
+ * three parts: left of the fault, right of the fault, and up to 5 pages
+ * in the immediate neighbourhood of the target page.
+ *
+ * Cleared in that order to keep cache lines of the target region hot.
+ *
+ * For gigantic pages, there is no expectation of cache locality so just do a
+ * straight zero.
+ */
+void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint)
+{
+	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
+	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
+	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
+	int width = 2; /* pages cleared last on either side */
+	struct range r[3];
+	int i;
+
+	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
+		clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
+		goto out;
+	}
+
+	/*
+	 * Faulting page and its immediate neighbourhood. Cleared at the end to
+	 * ensure it sticks around in the cache.
+	 */
+	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+	/* Region to the left of the fault */
+	r[1] = DEFINE_RANGE(pg.start,
+			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+			    pg.end);
+
+	for (i = 0; i <= 2; i++) {
+		int len = range_len(&r[i]);
+
+		if (len > 0)
+			clear_pages(page_address(folio_page(folio, r[i].start)), len);
+	}
+
+out:
+	/* Explicitly invoke cond_resched() to handle any live patching necessary. */
+	cond_resched();
+}
+
+#endif /* CONFIG_HIGHMEM */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b7f13f087954..b57512da8173 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4114,6 +4114,7 @@ enum mf_action_page_type {
 };
 
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+void folio_zero_user_preemptible(struct folio *fio, unsigned long addr_hint);
 void folio_zero_user(struct folio *folio, unsigned long addr_hint);
 int copy_user_large_folio(struct folio *dst, struct folio *src,
 			  unsigned long addr_hint,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
  2025-04-14  3:46 [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing Ankur Arora
                   ` (3 preceding siblings ...)
  2025-04-14  3:46 ` [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing Ankur Arora
@ 2025-04-14  5:34 ` Ingo Molnar
  2025-04-14 19:30   ` Ankur Arora
  2025-04-14  6:36 ` Ingo Molnar
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Ingo Molnar @ 2025-04-14  5:34 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, bp, dave.hansen, hpa,
	mingo, luto, peterz, paulmck, rostedt, tglx, willy, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk


* Ankur Arora <ankur.a.arora@oracle.com> wrote:

> Ankur Arora (4):
>   x86/clear_page: extend clear_page*() for multi-page clearing
>   x86/clear_page: add clear_pages()
>   huge_page: allow arch override for folio_zero_user()
>   x86/folio_zero_user: multi-page clearing

These are not how x86 commit titles should look like. Please take a 
look at the titles of previous commits to the x86 files you are 
modifying and follow that style. (Capitalization, use of verbs, etc.)

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] x86/clear_page: extend clear_page*() for multi-page clearing
  2025-04-14  3:46 ` [PATCH v3 1/4] x86/clear_page: extend clear_page*() for " Ankur Arora
@ 2025-04-14  6:32   ` Ingo Molnar
  2025-04-14 11:02     ` Peter Zijlstra
  2025-04-14 19:52     ` Ankur Arora
  0 siblings, 2 replies; 38+ messages in thread
From: Ingo Molnar @ 2025-04-14  6:32 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, bp, dave.hansen, hpa,
	mingo, luto, peterz, paulmck, rostedt, tglx, willy, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk


* Ankur Arora <ankur.a.arora@oracle.com> wrote:

> clear_page*() variants now take a page-aligned length parameter and
> clears the whole region.

Please read your changelogs and fix typos. ;-)

> +void clear_pages_orig(void *page, unsigned int length);
> +void clear_pages_rep(void *page, unsigned int length);
> +void clear_pages_erms(void *page, unsigned int length);

What unit is 'length' in? If it's bytes, why is this interface 
artificially limiting itself to ~4GB? On x86-64 there's very little (if 
any) performance difference between a 32-bit and a 64-bit length 
iterations.

Even if we end up only exposing a 32-bit length API to the generic MM 
layer, there's no reason to limit the x86-64 assembly code in such a 
fashion.

>  static inline void clear_page(void *page)
>  {
> +	unsigned int length = PAGE_SIZE;
>  	/*
> -	 * Clean up KMSAN metadata for the page being cleared. The assembly call
> +	 * Clean up KMSAN metadata for the pages being cleared. The assembly call
>  	 * below clobbers @page, so we perform unpoisoning before it.

>  	 */
> -	kmsan_unpoison_memory(page, PAGE_SIZE);
> -	alternative_call_2(clear_page_orig,
> -			   clear_page_rep, X86_FEATURE_REP_GOOD,
> -			   clear_page_erms, X86_FEATURE_ERMS,
> +	kmsan_unpoison_memory(page, length);
> +
> +	alternative_call_2(clear_pages_orig,
> +			   clear_pages_rep, X86_FEATURE_REP_GOOD,
> +			   clear_pages_erms, X86_FEATURE_ERMS,
>  			   "=D" (page),
> -			   "D" (page),
> +			   ASM_INPUT("D" (page), "S" (length)),
>  			   "cc", "memory", "rax", "rcx");
>  }
>  
> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> index a508e4a8c66a..bce516263b69 100644
> --- a/arch/x86/lib/clear_page_64.S
> +++ b/arch/x86/lib/clear_page_64.S
> @@ -13,20 +13,35 @@
>   */
>  
>  /*
> - * Zero a page.
> - * %rdi	- page
> + * Zero kernel page aligned region.
> + *
> + * Input:
> + * %rdi	- destination
> + * %esi	- length
> + *
> + * Clobbers: %rax, %rcx
>   */
> -SYM_TYPED_FUNC_START(clear_page_rep)
> -	movl $4096/8,%ecx
> +SYM_TYPED_FUNC_START(clear_pages_rep)
> +	movl %esi, %ecx
>  	xorl %eax,%eax
> +	shrl $3,%ecx
>  	rep stosq
>  	RET
> -SYM_FUNC_END(clear_page_rep)
> -EXPORT_SYMBOL_GPL(clear_page_rep)
> +SYM_FUNC_END(clear_pages_rep)
> +EXPORT_SYMBOL_GPL(clear_pages_rep)
>  
> -SYM_TYPED_FUNC_START(clear_page_orig)
> +/*
> + * Original page zeroing loop.
> + * Input:
> + * %rdi	- destination
> + * %esi	- length
> + *
> + * Clobbers: %rax, %rcx, %rflags
> + */
> +SYM_TYPED_FUNC_START(clear_pages_orig)
> +	movl   %esi, %ecx
>  	xorl   %eax,%eax
> -	movl   $4096/64,%ecx
> +	shrl   $6,%ecx

So if the natural input parameter is RCX, why is this function using 
RSI as the input 'length' parameter? Causes unnecessary register 
shuffling.

> +/*
> + * Zero kernel page aligned region.
> + *
> + * Input:
> + * %rdi	- destination
> + * %esi	- length
> + *
> + * Clobbers: %rax, %rcx
> + */
> +SYM_TYPED_FUNC_START(clear_pages_erms)
> +	movl %esi, %ecx
>  	xorl %eax,%eax
>  	rep stosb
>  	RET

Same observation: unnecessary register shuffling.

Also, please rename this (now-) terribly named interface:

> +void clear_pages_orig(void *page, unsigned int length);
> +void clear_pages_rep(void *page, unsigned int length);
> +void clear_pages_erms(void *page, unsigned int length);

Because the 'pages' is now a bit misleading, and why is the starting 
address called a 'page'?

So a more sensible namespace would be to follow memset nomenclature:

	void memzero_page_aligned_*(void *addr, unsigned long len);

... and note the intentional abbreviation to 'len'.

Also, since most of these changes are to x86 architecture code, this is 
a new interface only used by x86, and the MM glue is minimal, I'd like 
to merge this series via the x86 tree, if the glue gets acks from MM 
folks.

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
  2025-04-14  3:46 [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing Ankur Arora
                   ` (4 preceding siblings ...)
  2025-04-14  5:34 ` [PATCH v3 0/4] mm/folio_zero_user: add " Ingo Molnar
@ 2025-04-14  6:36 ` Ingo Molnar
  2025-04-14 19:19   ` Ankur Arora
  2025-04-15 19:10 ` Zi Yan
  2025-04-22  6:23 ` Raghavendra K T
  7 siblings, 1 reply; 38+ messages in thread
From: Ingo Molnar @ 2025-04-14  6:36 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, bp, dave.hansen, hpa,
	mingo, luto, peterz, paulmck, rostedt, tglx, willy, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk


* Ankur Arora <ankur.a.arora@oracle.com> wrote:

> We also see performance improvement for cases where this optimization is
> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
> REP; STOS is typically microcoded which can now be amortized over
> larger regions and the hint allows the hardware prefetcher to do a
> better job.
> 
> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
> 
>                  mm/folio_zero_user    x86/folio_zero_user     change
>                   (GB/s  +- stddev)      (GB/s  +- stddev)
> 
>   pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>   pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
> 
> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
> 
>                  mm/folio_zero_user    x86/folio_zero_user     change
>                   (GB/s +- stddev)      (GB/s +- stddev)
> 
>   pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>   pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%

How was this measured? Could you integrate this measurement as a new 
tools/perf/bench/ subcommand so that people can try it on different 
systems, etc.? There's already a 'perf bench mem' subcommand space 
where this feature could be added to.

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
  2025-04-14  3:46 ` [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing Ankur Arora
@ 2025-04-14  6:53   ` Ingo Molnar
  2025-04-14 21:21     ` Ankur Arora
  2025-04-14  7:05   ` Ingo Molnar
  2025-04-15 10:16   ` Mateusz Guzik
  2 siblings, 1 reply; 38+ messages in thread
From: Ingo Molnar @ 2025-04-14  6:53 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, bp, dave.hansen, hpa,
	mingo, luto, peterz, paulmck, rostedt, tglx, willy, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk


* Ankur Arora <ankur.a.arora@oracle.com> wrote:

> clear_pages_rep(), clear_pages_erms() use string instructions to zero
> memory. When operating on more than a single page, we can use these
> more effectively by explicitly advertising the region-size to the
> processor, which can use that as a hint to optimize the clearing
> (ex. by eliding cacheline allocation.)

> +#ifndef CONFIG_HIGHMEM
> +/*
> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().
> + *
> + * Taking inspiration from the common code variant, we split the zeroing in
> + * three parts: left of the fault, right of the fault, and up to 5 pages
> + * in the immediate neighbourhood of the target page.
> + *
> + * Cleared in that order to keep cache lines of the target region hot.
> + *
> + * For gigantic pages, there is no expectation of cache locality so just do a
> + * straight zero.
> + */
> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint)
> +{
> +	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
> +	int width = 2; /* pages cleared last on either side */
> +	struct range r[3];
> +	int i;
> +
> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
> +		clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio));

> +			clear_pages(page_address(folio_page(folio, r[i].start)), len);

So the _user postfix naming is super confusing here and elsewhere in 
this series.

clear_page(), and by extension the clear_pages() interface you extended 
it to, fundamentally only works on kernel addresses:

/*
 * Zero a page.
 * %rdi - page
 */
SYM_TYPED_FUNC_START(clear_page_rep)
        movl $4096/8,%ecx
        xorl %eax,%eax
        rep stosq
        RET

Note the absolute lack of fault & exception handling.

But folio_zero_user*() uses the kernel-space variants of page clearing 
AFAICT (contrary to the naming):

void folio_zero_user(struct folio *folio, unsigned long addr_hint)
{
        unsigned int nr_pages = folio_nr_pages(folio);

        if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
                clear_gigantic_page(folio, addr_hint, nr_pages);
        else
                process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
}


static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
                                unsigned int nr_pages)
{
        unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
        int i;
         
        might_sleep();
        for (i = 0; i < nr_pages; i++) {
                cond_resched();
                clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
        }       
}

Which on x86 is simply mapped into a kernel-memory interface:

static inline void clear_user_page(void *page, unsigned long vaddr,
                                   struct page *pg)
{
        clear_page(page);
}

So at minimum this is a misnomer and a confusing mixture of user/kernel 
interface names on an epic scale that TBH should be cleaned up first 
before extended...

> +out:
> +	/* Explicitly invoke cond_resched() to handle any live patching necessary. */
> +	cond_resched();

What again?

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
  2025-04-14  3:46 ` [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing Ankur Arora
  2025-04-14  6:53   ` Ingo Molnar
@ 2025-04-14  7:05   ` Ingo Molnar
  2025-04-15  6:36     ` Ankur Arora
  2025-04-22  6:36     ` Raghavendra K T
  2025-04-15 10:16   ` Mateusz Guzik
  2 siblings, 2 replies; 38+ messages in thread
From: Ingo Molnar @ 2025-04-14  7:05 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, bp, dave.hansen, hpa,
	mingo, luto, peterz, paulmck, rostedt, tglx, willy, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk


* Ankur Arora <ankur.a.arora@oracle.com> wrote:

> clear_pages_rep(), clear_pages_erms() use string instructions to zero
> memory. When operating on more than a single page, we can use these
> more effectively by explicitly advertising the region-size to the
> processor, which can use that as a hint to optimize the clearing
> (ex. by eliding cacheline allocation.)
> 
> As a secondary benefit, string instructions are typically microcoded,
> and working with larger regions helps amortize the cost of the decode.

Not just the decoding, but also iterations around page-sized chunks are 
not cheap these days: there's various compiler generated mitigations 
and other overhead that applies on a typical kernel, and using larger 
sizes amortizes that per-page-iteration setup cost.

> When zeroing the 2MB page, maximize spatial locality by clearing in 
> three sections: the faulting page and its immediate neighbourhood, the
> left and the right regions, with the local neighbourhood cleared last.

s/zeroing the 2MB page
 /zeroing a 2MB page


> It's not entirely clear why the performance for pg-sz=2MB improves. 
> We decode fewer instructions and the hardware prefetcher can do a 
> better job, but the perf stats for both of those aren't convincing 
> enough to the extent of ~30%.

s/why the performance
 /why performance

> For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
> see a drop in cycles but there's no drop in cacheline allocation.

s/Icelakex, behaves similarly
 /Icelakex behaves similarly

> Performance for preempt=none|voluntary remains unchanged.

CONFIG_PREEMPT_VOLUNTARY=y is the default on a number of major 
distributions, such as Ubuntu, and a lot of enterprise distro kernels - 
and this patch does nothing for them, for no good reason.

So could you please provide a sensible size granularity cutoff of 16MB 
or so on non-preemptible kernels, instead of this weird build-time 
all-or-nothing binary cutoff based on preemption modes?

On preempt=full/lazy the granularity limit would be infinite.

I.e the only code dependent on the preemption mode should be the size 
cutoff/limit.

On full/lazy preemption the code would, ideally, compile to something 
close to your current code.

> +obj-$(CONFIG_PREEMPTION)	+= memory.o

> +#ifndef CONFIG_HIGHMEM
> +/*
> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().

We don't care much about HIGHMEM these days I suppose, but this 
dependency still feels wrong. Is this a stealth dependency on x86-64, 
trying to avoid a new arch Kconfig for this new API, right? ;-)

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] x86/clear_page: extend clear_page*() for multi-page clearing
  2025-04-14  6:32   ` Ingo Molnar
@ 2025-04-14 11:02     ` Peter Zijlstra
  2025-04-14 11:14       ` Ingo Molnar
                         ` (2 more replies)
  2025-04-14 19:52     ` Ankur Arora
  1 sibling, 3 replies; 38+ messages in thread
From: Peter Zijlstra @ 2025-04-14 11:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, bp,
	dave.hansen, hpa, mingo, luto, paulmck, rostedt, tglx, willy,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk

On Mon, Apr 14, 2025 at 08:32:29AM +0200, Ingo Molnar wrote:

> >  static inline void clear_page(void *page)
> >  {
> > +	unsigned int length = PAGE_SIZE;
> >  	/*
> > -	 * Clean up KMSAN metadata for the page being cleared. The assembly call
> > +	 * Clean up KMSAN metadata for the pages being cleared. The assembly call
> >  	 * below clobbers @page, so we perform unpoisoning before it.
> 
> >  	 */
> > -	kmsan_unpoison_memory(page, PAGE_SIZE);
> > -	alternative_call_2(clear_page_orig,
> > -			   clear_page_rep, X86_FEATURE_REP_GOOD,
> > -			   clear_page_erms, X86_FEATURE_ERMS,
> > +	kmsan_unpoison_memory(page, length);
> > +
> > +	alternative_call_2(clear_pages_orig,
> > +			   clear_pages_rep, X86_FEATURE_REP_GOOD,
> > +			   clear_pages_erms, X86_FEATURE_ERMS,
> >  			   "=D" (page),
> > -			   "D" (page),
> > +			   ASM_INPUT("D" (page), "S" (length)),
> >  			   "cc", "memory", "rax", "rcx");
> >  }
> >  
> > diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> > index a508e4a8c66a..bce516263b69 100644
> > --- a/arch/x86/lib/clear_page_64.S
> > +++ b/arch/x86/lib/clear_page_64.S
> > @@ -13,20 +13,35 @@
> >   */
> >  
> >  /*
> > - * Zero a page.
> > - * %rdi	- page
> > + * Zero kernel page aligned region.
> > + *
> > + * Input:
> > + * %rdi	- destination
> > + * %esi	- length
> > + *
> > + * Clobbers: %rax, %rcx
> >   */
> > -SYM_TYPED_FUNC_START(clear_page_rep)
> > -	movl $4096/8,%ecx
> > +SYM_TYPED_FUNC_START(clear_pages_rep)
> > +	movl %esi, %ecx
> >  	xorl %eax,%eax
> > +	shrl $3,%ecx
> >  	rep stosq
> >  	RET
> > -SYM_FUNC_END(clear_page_rep)
> > -EXPORT_SYMBOL_GPL(clear_page_rep)
> > +SYM_FUNC_END(clear_pages_rep)
> > +EXPORT_SYMBOL_GPL(clear_pages_rep)
> >  
> > -SYM_TYPED_FUNC_START(clear_page_orig)
> > +/*
> > + * Original page zeroing loop.
> > + * Input:
> > + * %rdi	- destination
> > + * %esi	- length
> > + *
> > + * Clobbers: %rax, %rcx, %rflags
> > + */
> > +SYM_TYPED_FUNC_START(clear_pages_orig)
> > +	movl   %esi, %ecx
> >  	xorl   %eax,%eax
> > -	movl   $4096/64,%ecx
> > +	shrl   $6,%ecx
> 
> So if the natural input parameter is RCX, why is this function using 
> RSI as the input 'length' parameter? Causes unnecessary register 
> shuffling.

This symbol is written as a C function with C calling convention, even
though it is only meant to be called from that clear_page() alternative.

If we want to go change all this, then we should go do the same we do
for __clear_user() and write it thusly:

	asm volatile(ALTERNATIVE("rep stosb",
				 "call rep_stos_alternative", ALT_NOT(X86_FEATURE_FSRS)
				 : "+c" (size), "+D" (addr), ASM_CALL_CONSTRAINT
				 : "a" (0))

And forget about all those clear_page_*() thingies.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] x86/clear_page: extend clear_page*() for multi-page clearing
  2025-04-14 11:02     ` Peter Zijlstra
@ 2025-04-14 11:14       ` Ingo Molnar
  2025-04-14 19:46       ` Ankur Arora
  2025-04-14 22:26       ` Mateusz Guzik
  2 siblings, 0 replies; 38+ messages in thread
From: Ingo Molnar @ 2025-04-14 11:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, bp,
	dave.hansen, hpa, mingo, luto, paulmck, rostedt, tglx, willy,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Apr 14, 2025 at 08:32:29AM +0200, Ingo Molnar wrote:
> 
> > >  static inline void clear_page(void *page)
> > >  {
> > > +	unsigned int length = PAGE_SIZE;
> > >  	/*
> > > -	 * Clean up KMSAN metadata for the page being cleared. The assembly call
> > > +	 * Clean up KMSAN metadata for the pages being cleared. The assembly call
> > >  	 * below clobbers @page, so we perform unpoisoning before it.
> > 
> > >  	 */
> > > -	kmsan_unpoison_memory(page, PAGE_SIZE);
> > > -	alternative_call_2(clear_page_orig,
> > > -			   clear_page_rep, X86_FEATURE_REP_GOOD,
> > > -			   clear_page_erms, X86_FEATURE_ERMS,
> > > +	kmsan_unpoison_memory(page, length);
> > > +
> > > +	alternative_call_2(clear_pages_orig,
> > > +			   clear_pages_rep, X86_FEATURE_REP_GOOD,
> > > +			   clear_pages_erms, X86_FEATURE_ERMS,
> > >  			   "=D" (page),
> > > -			   "D" (page),
> > > +			   ASM_INPUT("D" (page), "S" (length)),
> > >  			   "cc", "memory", "rax", "rcx");
> > >  }
> > >  
> > > diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> > > index a508e4a8c66a..bce516263b69 100644
> > > --- a/arch/x86/lib/clear_page_64.S
> > > +++ b/arch/x86/lib/clear_page_64.S
> > > @@ -13,20 +13,35 @@
> > >   */
> > >  
> > >  /*
> > > - * Zero a page.
> > > - * %rdi	- page
> > > + * Zero kernel page aligned region.
> > > + *
> > > + * Input:
> > > + * %rdi	- destination
> > > + * %esi	- length
> > > + *
> > > + * Clobbers: %rax, %rcx
> > >   */
> > > -SYM_TYPED_FUNC_START(clear_page_rep)
> > > -	movl $4096/8,%ecx
> > > +SYM_TYPED_FUNC_START(clear_pages_rep)
> > > +	movl %esi, %ecx
> > >  	xorl %eax,%eax
> > > +	shrl $3,%ecx
> > >  	rep stosq
> > >  	RET
> > > -SYM_FUNC_END(clear_page_rep)
> > > -EXPORT_SYMBOL_GPL(clear_page_rep)
> > > +SYM_FUNC_END(clear_pages_rep)
> > > +EXPORT_SYMBOL_GPL(clear_pages_rep)
> > >  
> > > -SYM_TYPED_FUNC_START(clear_page_orig)
> > > +/*
> > > + * Original page zeroing loop.
> > > + * Input:
> > > + * %rdi	- destination
> > > + * %esi	- length
> > > + *
> > > + * Clobbers: %rax, %rcx, %rflags
> > > + */
> > > +SYM_TYPED_FUNC_START(clear_pages_orig)
> > > +	movl   %esi, %ecx
> > >  	xorl   %eax,%eax
> > > -	movl   $4096/64,%ecx
> > > +	shrl   $6,%ecx
> > 
> > So if the natural input parameter is RCX, why is this function using 
> > RSI as the input 'length' parameter? Causes unnecessary register 
> > shuffling.
> 
> This symbol is written as a C function with C calling convention, even
> though it is only meant to be called from that clear_page() alternative.
>
> If we want to go change all this, then we should go do the same we do
> for __clear_user() and write it thusly:
> 
> 	asm volatile(ALTERNATIVE("rep stosb",
> 				 "call rep_stos_alternative", ALT_NOT(X86_FEATURE_FSRS)
> 				 : "+c" (size), "+D" (addr), ASM_CALL_CONSTRAINT
> 				 : "a" (0))
> 
> And forget about all those clear_page_*() thingies.

Yeah.

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
  2025-04-14  6:36 ` Ingo Molnar
@ 2025-04-14 19:19   ` Ankur Arora
  0 siblings, 0 replies; 38+ messages in thread
From: Ankur Arora @ 2025-04-14 19:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, bp,
	dave.hansen, hpa, mingo, luto, peterz, paulmck, rostedt, tglx,
	willy, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk


Ingo Molnar <mingo@kernel.org> writes:

> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> We also see performance improvement for cases where this optimization is
>> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
>> REP; STOS is typically microcoded which can now be amortized over
>> larger regions and the hint allows the hardware prefetcher to do a
>> better job.
>>
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>>
>>                  mm/folio_zero_user    x86/folio_zero_user     change
>>                   (GB/s  +- stddev)      (GB/s  +- stddev)
>>
>>   pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>>   pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
>>
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>>
>>                  mm/folio_zero_user    x86/folio_zero_user     change
>>                   (GB/s +- stddev)      (GB/s +- stddev)
>>
>>   pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>>   pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
>
> How was this measured? Could you integrate this measurement as a new
> tools/perf/bench/ subcommand so that people can try it on different
> systems, etc.? There's already a 'perf bench mem' subcommand space
> where this feature could be added to.

This was a standalone trivial mmap workload similar to what qemu does
when creating a VM, really any hugetlb mmap().

x86-64-stosq (lib/memset_64.S::__memset) should have the same performance
characteristics but it uses malloc() for allocation.

For this workload we want to control the allocation path as well. Let me
see if it makes sense to extend perf bench mem memset to optionally allocate
via mmap(MAP_HUGETLB) or add a new workload under perf bench mem which
does that.

Thanks for the review!

--
ankur


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
  2025-04-14  5:34 ` [PATCH v3 0/4] mm/folio_zero_user: add " Ingo Molnar
@ 2025-04-14 19:30   ` Ankur Arora
  0 siblings, 0 replies; 38+ messages in thread
From: Ankur Arora @ 2025-04-14 19:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, bp,
	dave.hansen, hpa, mingo, luto, peterz, paulmck, rostedt, tglx,
	willy, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk


Ingo Molnar <mingo@kernel.org> writes:

> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Ankur Arora (4):
>>   x86/clear_page: extend clear_page*() for multi-page clearing
>>   x86/clear_page: add clear_pages()
>>   huge_page: allow arch override for folio_zero_user()
>>   x86/folio_zero_user: multi-page clearing
>
> These are not how x86 commit titles should look like. Please take a
> look at the titles of previous commits to the x86 files you are
> modifying and follow that style. (Capitalization, use of verbs, etc.)

Ack. Will fix.

--
ankur


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] x86/clear_page: extend clear_page*() for multi-page clearing
  2025-04-14 11:02     ` Peter Zijlstra
  2025-04-14 11:14       ` Ingo Molnar
@ 2025-04-14 19:46       ` Ankur Arora
  2025-04-14 22:26       ` Mateusz Guzik
  2 siblings, 0 replies; 38+ messages in thread
From: Ankur Arora @ 2025-04-14 19:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Ankur Arora, linux-kernel, linux-mm, x86, torvalds,
	akpm, bp, dave.hansen, hpa, mingo, luto, paulmck, rostedt, tglx,
	willy, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk


Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Apr 14, 2025 at 08:32:29AM +0200, Ingo Molnar wrote:

[ ... ]

>> > -SYM_TYPED_FUNC_START(clear_page_orig)
>> > +/*
>> > + * Original page zeroing loop.
>> > + * Input:
>> > + * %rdi	- destination
>> > + * %esi	- length
>> > + *
>> > + * Clobbers: %rax, %rcx, %rflags
>> > + */
>> > +SYM_TYPED_FUNC_START(clear_pages_orig)
>> > +	movl   %esi, %ecx
>> >  	xorl   %eax,%eax
>> > -	movl   $4096/64,%ecx
>> > +	shrl   $6,%ecx
>>
>> So if the natural input parameter is RCX, why is this function using
>> RSI as the input 'length' parameter? Causes unnecessary register
>> shuffling.
>
> This symbol is written as a C function with C calling convention, even
> though it is only meant to be called from that clear_page() alternative.
>
> If we want to go change all this, then we should go do the same we do
> for __clear_user() and write it thusly:
>
> 	asm volatile(ALTERNATIVE("rep stosb",
> 				 "call rep_stos_alternative", ALT_NOT(X86_FEATURE_FSRS)
> 				 : "+c" (size), "+D" (addr), ASM_CALL_CONSTRAINT
> 				 : "a" (0))
>
> And forget about all those clear_page_*() thingies.

Yeah, this makes sense. We don't call any of the clear_pages_*() variants
from anywhere else.

clear_pages_rep() and clear_pages_erms() are trivial enough to be
inlined in the ALTERNATIVE as well.

Thanks!

--
ankur


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] x86/clear_page: extend clear_page*() for multi-page clearing
  2025-04-14  6:32   ` Ingo Molnar
  2025-04-14 11:02     ` Peter Zijlstra
@ 2025-04-14 19:52     ` Ankur Arora
  2025-04-14 20:09       ` Matthew Wilcox
  1 sibling, 1 reply; 38+ messages in thread
From: Ankur Arora @ 2025-04-14 19:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, bp,
	dave.hansen, hpa, mingo, luto, peterz, paulmck, rostedt, tglx,
	willy, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk


Ingo Molnar <mingo@kernel.org> writes:

> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> clear_page*() variants now take a page-aligned length parameter and
>> clears the whole region.
>
> Please read your changelogs and fix typos. ;-)

Ack!

>> +void clear_pages_orig(void *page, unsigned int length);
>> +void clear_pages_rep(void *page, unsigned int length);
>> +void clear_pages_erms(void *page, unsigned int length);
>
> What unit is 'length' in? If it's bytes, why is this interface
> artificially limiting itself to ~4GB? On x86-64 there's very little (if

I was in two minds about the unit. Given that the largest page size is
1GB, decided to go with 32bit. But, as you say below, there's no reason
to limit the x86-64 interface for MM reasons.  Will fix.

> any) performance difference between a 32-bit and a 64-bit length
> iterations.
>
> Even if we end up only exposing a 32-bit length API to the generic MM
> layer, there's no reason to limit the x86-64 assembly code in such a
> fashion.
>
>>  static inline void clear_page(void *page)
>>  {
>> +	unsigned int length = PAGE_SIZE;
>>  	/*
>> -	 * Clean up KMSAN metadata for the page being cleared. The assembly call
>> +	 * Clean up KMSAN metadata for the pages being cleared. The assembly call
>>  	 * below clobbers @page, so we perform unpoisoning before it.
>
>>  	 */
>> -	kmsan_unpoison_memory(page, PAGE_SIZE);
>> -	alternative_call_2(clear_page_orig,
>> -			   clear_page_rep, X86_FEATURE_REP_GOOD,
>> -			   clear_page_erms, X86_FEATURE_ERMS,
>> +	kmsan_unpoison_memory(page, length);
>> +
>> +	alternative_call_2(clear_pages_orig,
>> +			   clear_pages_rep, X86_FEATURE_REP_GOOD,
>> +			   clear_pages_erms, X86_FEATURE_ERMS,
>>  			   "=D" (page),
>> -			   "D" (page),
>> +			   ASM_INPUT("D" (page), "S" (length)),
>>  			   "cc", "memory", "rax", "rcx");
>>  }
>>
>> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
>> index a508e4a8c66a..bce516263b69 100644
>> --- a/arch/x86/lib/clear_page_64.S
>> +++ b/arch/x86/lib/clear_page_64.S
>> @@ -13,20 +13,35 @@
>>   */
>>
>>  /*
>> - * Zero a page.
>> - * %rdi	- page
>> + * Zero kernel page aligned region.
>> + *
>> + * Input:
>> + * %rdi	- destination
>> + * %esi	- length
>> + *
>> + * Clobbers: %rax, %rcx
>>   */
>> -SYM_TYPED_FUNC_START(clear_page_rep)
>> -	movl $4096/8,%ecx
>> +SYM_TYPED_FUNC_START(clear_pages_rep)
>> +	movl %esi, %ecx
>>  	xorl %eax,%eax
>> +	shrl $3,%ecx
>>  	rep stosq
>>  	RET
>> -SYM_FUNC_END(clear_page_rep)
>> -EXPORT_SYMBOL_GPL(clear_page_rep)
>> +SYM_FUNC_END(clear_pages_rep)
>> +EXPORT_SYMBOL_GPL(clear_pages_rep)
>>
>> -SYM_TYPED_FUNC_START(clear_page_orig)
>> +/*
>> + * Original page zeroing loop.
>> + * Input:
>> + * %rdi	- destination
>> + * %esi	- length
>> + *
>> + * Clobbers: %rax, %rcx, %rflags
>> + */
>> +SYM_TYPED_FUNC_START(clear_pages_orig)
>> +	movl   %esi, %ecx
>>  	xorl   %eax,%eax
>> -	movl   $4096/64,%ecx
>> +	shrl   $6,%ecx
>
> So if the natural input parameter is RCX, why is this function using
> RSI as the input 'length' parameter? Causes unnecessary register
> shuffling.

True. Will fix via ALTERNATIVE as Peter describes in the other mail.
That should get rid of some of the repetition in clear_pages_rep and
clear_pages_erms() as well.

Though I guess clear_pages_orig() will need to stick around just with
a different parameter passing convention.

>> +/*
>> + * Zero kernel page aligned region.
>> + *
>> + * Input:
>> + * %rdi	- destination
>> + * %esi	- length
>> + *
>> + * Clobbers: %rax, %rcx
>> + */
>> +SYM_TYPED_FUNC_START(clear_pages_erms)
>> +	movl %esi, %ecx
>>  	xorl %eax,%eax
>>  	rep stosb
>>  	RET
>
> Same observation: unnecessary register shuffling.
>
> Also, please rename this (now-) terribly named interface:
>
>> +void clear_pages_orig(void *page, unsigned int length);
>> +void clear_pages_rep(void *page, unsigned int length);
>> +void clear_pages_erms(void *page, unsigned int length);
>
> Because the 'pages' is now a bit misleading, and why is the starting
> address called a 'page'?

> So a more sensible namespace would be to follow memset nomenclature:
>
> 	void memzero_page_aligned_*(void *addr, unsigned long len);
>
> ... and note the intentional abbreviation to 'len'.

Ack.

> Also, since most of these changes are to x86 architecture code, this is
> a new interface only used by x86, and the MM glue is minimal, I'd like
> to merge this series via the x86 tree, if the glue gets acks from MM
> folks.

Works for me!


Thanks for the review.

--
ankur


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] x86/clear_page: extend clear_page*() for multi-page clearing
  2025-04-14 19:52     ` Ankur Arora
@ 2025-04-14 20:09       ` Matthew Wilcox
  2025-04-15 21:59         ` Ankur Arora
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Wilcox @ 2025-04-14 20:09 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Ingo Molnar, linux-kernel, linux-mm, x86, torvalds, akpm, bp,
	dave.hansen, hpa, mingo, luto, peterz, paulmck, rostedt, tglx,
	jon.grimm, bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk

On Mon, Apr 14, 2025 at 12:52:37PM -0700, Ankur Arora wrote:
> Ingo Molnar <mingo@kernel.org> writes:
> >> +void clear_pages_orig(void *page, unsigned int length);
> >> +void clear_pages_rep(void *page, unsigned int length);
> >> +void clear_pages_erms(void *page, unsigned int length);
> >
> > What unit is 'length' in? If it's bytes, why is this interface
> > artificially limiting itself to ~4GB? On x86-64 there's very little (if
> 
> I was in two minds about the unit. Given that the largest page size is
> 1GB, decided to go with 32bit. But, as you say below, there's no reason
> to limit the x86-64 interface for MM reasons.  Will fix.

Actually, I think there is (and we went through this with SPARC, if you
remember?)  We _shouldn't_ be calling memset() with a large size (ie
larger than 4GB).  If we have that much memory to clear, we should be
doing something smarter, like using padata to get lots of CPUs clearing
individual portions of the page.

I don't know how relevant this is now that you're going to be using
ALTERNATIVES.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
  2025-04-14  6:53   ` Ingo Molnar
@ 2025-04-14 21:21     ` Ankur Arora
  0 siblings, 0 replies; 38+ messages in thread
From: Ankur Arora @ 2025-04-14 21:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, bp,
	dave.hansen, hpa, mingo, luto, peterz, paulmck, rostedt, tglx,
	willy, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk


Ingo Molnar <mingo@kernel.org> writes:

> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> clear_pages_rep(), clear_pages_erms() use string instructions to zero
>> memory. When operating on more than a single page, we can use these
>> more effectively by explicitly advertising the region-size to the
>> processor, which can use that as a hint to optimize the clearing
>> (ex. by eliding cacheline allocation.)
>
>> +#ifndef CONFIG_HIGHMEM
>> +/*
>> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().
>> + *
>> + * Taking inspiration from the common code variant, we split the zeroing in
>> + * three parts: left of the fault, right of the fault, and up to 5 pages
>> + * in the immediate neighbourhood of the target page.
>> + *
>> + * Cleared in that order to keep cache lines of the target region hot.
>> + *
>> + * For gigantic pages, there is no expectation of cache locality so just do a
>> + * straight zero.
>> + */
>> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint)
>> +{
>> +	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>> +	int width = 2; /* pages cleared last on either side */
>> +	struct range r[3];
>> +	int i;
>> +
>> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>> +		clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
>
>> +			clear_pages(page_address(folio_page(folio, r[i].start)), len);
>
> So the _user postfix naming is super confusing here and elsewhere in
> this series.

The problem is that the _user naming comes from the MM interface name and
is meant to address architectures where you might need to do more than
just zero the kernel address range for the page.

> clear_page(), and by extension the clear_pages() interface you extended
> it to, fundamentally only works on kernel addresses:

Agreed.

> /*
>  * Zero a page.
>  * %rdi - page
>  */
> SYM_TYPED_FUNC_START(clear_page_rep)
>         movl $4096/8,%ecx
>         xorl %eax,%eax
>         rep stosq
>         RET
>
> Note the absolute lack of fault & exception handling.

Yeah. And, as you are implying that is safe because the folio_zero_user()
(and this path) is only called after this range has been validated.

> But folio_zero_user*() uses the kernel-space variants of page clearing
> AFAICT (contrary to the naming):
>
> void folio_zero_user(struct folio *folio, unsigned long addr_hint)
> {
>         unsigned int nr_pages = folio_nr_pages(folio);
>
>         if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
>                 clear_gigantic_page(folio, addr_hint, nr_pages);
>         else
>                 process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
> }
>
>
> static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint, >                                 unsigned int nr_pages)
> {
>         unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>         int i;
>
>         might_sleep();
>         for (i = 0; i < nr_pages; i++) {
>                 cond_resched();
>                 clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
>         }
> }
>
> Which on x86 is simply mapped into a kernel-memory interface:
>
> static inline void clear_user_page(void *page, unsigned long vaddr,
>                                    struct page *pg)
> {
>         clear_page(page);
> }
>
> So at minimum this is a misnomer and a confusing mixture of user/kernel
> interface names on an epic scale that TBH should be cleaned up first
> before extended...

I think a comment to avoid this confusion is definitely warranted. About
the mixture of names, I'm not sure how to avoid that. For instance see
arch/arc/mm/cache.c::clear_user_page()::

void clear_user_page(void *to, unsigned long u_vaddr, struct page *page)
{
        struct folio *folio = page_folio(page);
        clear_page(to);
        clear_bit(PG_dc_clean, &folio->flags);
}

arch/arm also does a bunch of stuff which made my head hurt but the arc
version is clearly different enough.

>> +out:
>> +	/* Explicitly invoke cond_resched() to handle any live patching necessary. */
>> +	cond_resched();
>
> What again?

Yeah, I can see how this looks out of place :). The idea was that even
though we don't need explicit invocations of cond_resched() (because
this path is only called when preemptible), we still need some because
cond_resched() is overloaded to help with live patching.

Anyway, this comment can go away based on your suggestion elsewhere
(extensions for cooperative preemption models.)

Thanks for the detailed review.

--
ankur


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] x86/clear_page: extend clear_page*() for multi-page clearing
  2025-04-14 11:02     ` Peter Zijlstra
  2025-04-14 11:14       ` Ingo Molnar
  2025-04-14 19:46       ` Ankur Arora
@ 2025-04-14 22:26       ` Mateusz Guzik
  2025-04-15  6:14         ` Ankur Arora
  2 siblings, 1 reply; 38+ messages in thread
From: Mateusz Guzik @ 2025-04-14 22:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Ankur Arora, linux-kernel, linux-mm, x86, torvalds,
	akpm, bp, dave.hansen, hpa, mingo, luto, paulmck, rostedt, tglx,
	willy, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk

On Mon, Apr 14, 2025 at 01:02:59PM +0200, Peter Zijlstra wrote:
> This symbol is written as a C function with C calling convention, even
> though it is only meant to be called from that clear_page() alternative.
> 
> If we want to go change all this, then we should go do the same we do
> for __clear_user() and write it thusly:
> 
> 	asm volatile(ALTERNATIVE("rep stosb",
> 				 "call rep_stos_alternative", ALT_NOT(X86_FEATURE_FSRS)
> 				 : "+c" (size), "+D" (addr), ASM_CALL_CONSTRAINT
> 				 : "a" (0))
> 
> And forget about all those clear_page_*() thingies.
> 

I have to disagree.

Next to nobody has FSRS, so for now one would have to expect everyone
would be punting to the routine. Did you mean ERMS as sizes are in fact
not short?

rep_stos_alternative() as implemented right now sucks in its own right
("small" areas sorted out with an 8 byte and 1 byte loops, bigger ones
unrolled 64 byte loop at a time, no rep stos{b,q} in sight). Someone(tm)
should fix it and for the sake of argument suppose it happened. That's
still some code executed to figure out how to zero and to align the buf.

Instead, I think one can start with just retiring clear_page_orig().

With that sucker out of the way, an optional quest is to figure out if
rep stosq vs rep stosb makes any difference for pages -- for all I know
rep stosq is the way. This would require testing on quite a few uarchs
and I'm not going to blame anyone for not being interested.

Let's say nobody bothered OR rep stosb provides a win. In that case this
can trivially ALTERNATIVE between rep stosb and rep stosq based on ERMS,
no func calls necessary.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] x86/clear_page: extend clear_page*() for multi-page clearing
  2025-04-14 22:26       ` Mateusz Guzik
@ 2025-04-15  6:14         ` Ankur Arora
  2025-04-15  8:22           ` Mateusz Guzik
  0 siblings, 1 reply; 38+ messages in thread
From: Ankur Arora @ 2025-04-15  6:14 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Peter Zijlstra, Ingo Molnar, Ankur Arora, linux-kernel, linux-mm,
	x86, torvalds, akpm, bp, dave.hansen, hpa, mingo, luto, paulmck,
	rostedt, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk


Mateusz Guzik <mjguzik@gmail.com> writes:

> On Mon, Apr 14, 2025 at 01:02:59PM +0200, Peter Zijlstra wrote:
>> This symbol is written as a C function with C calling convention, even
>> though it is only meant to be called from that clear_page() alternative.
>>
>> If we want to go change all this, then we should go do the same we do
>> for __clear_user() and write it thusly:
>>
>> 	asm volatile(ALTERNATIVE("rep stosb",
>> 				 "call rep_stos_alternative", ALT_NOT(X86_FEATURE_FSRS)
>> 				 : "+c" (size), "+D" (addr), ASM_CALL_CONSTRAINT
>> 				 : "a" (0))
>>
>> And forget about all those clear_page_*() thingies.
>>
>
> I have to disagree.
>
> Next to nobody has FSRS, so for now one would have to expect everyone
> would be punting to the routine. Did you mean ERMS as sizes are in fact
> not short?
>
> rep_stos_alternative() as implemented right now sucks in its own right
> ("small" areas sorted out with an 8 byte and 1 byte loops, bigger ones
> unrolled 64 byte loop at a time, no rep stos{b,q} in sight). Someone(tm)
> should fix it and for the sake of argument suppose it happened. That's
> still some code executed to figure out how to zero and to align the buf.
>
> Instead, I think one can start with just retiring clear_page_orig().
>
> With that sucker out of the way, an optional quest is to figure out if
> rep stosq vs rep stosb makes any difference for pages -- for all I know
> rep stosq is the way. This would require testing on quite a few uarchs
> and I'm not going to blame anyone for not being interested.

IIRC some recent AMD models (Rome?) did expose REP_GOOD but not ERMS.

> Let's say nobody bothered OR rep stosb provides a win. In that case this
> can trivially ALTERNATIVE between rep stosb and rep stosq based on ERMS,
> no func calls necessary.

We shouldn't need any function calls for ERMS and REP_GOOD.

I think something like this untested code should work:

 	asm volatile(
            ALTERNATIVE_2("call clear_pages_orig",
                          "rep stosb", X86_FEATURE_REP_GOOD,
                          "shrl $3,%ecx; rep stosq", X86_FEATURE_ERMS,
 		          : "+c" (size), "+D" (addr), ASM_CALL_CONSTRAINT
                          : "a" (0)))

--
ankur


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
  2025-04-14  7:05   ` Ingo Molnar
@ 2025-04-15  6:36     ` Ankur Arora
  2025-04-22  6:36     ` Raghavendra K T
  1 sibling, 0 replies; 38+ messages in thread
From: Ankur Arora @ 2025-04-15  6:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, bp,
	dave.hansen, hpa, mingo, luto, peterz, paulmck, rostedt, tglx,
	willy, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk


Ingo Molnar <mingo@kernel.org> writes:

> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> clear_pages_rep(), clear_pages_erms() use string instructions to zero
>> memory. When operating on more than a single page, we can use these
>> more effectively by explicitly advertising the region-size to the
>> processor, which can use that as a hint to optimize the clearing
>> (ex. by eliding cacheline allocation.)
>>
>> As a secondary benefit, string instructions are typically microcoded,
>> and working with larger regions helps amortize the cost of the decode.
>
> Not just the decoding, but also iterations around page-sized chunks are
> not cheap these days: there's various compiler generated mitigations
> and other overhead that applies on a typical kernel, and using larger
> sizes amortizes that per-page-iteration setup cost.

Thanks. Yeah, I was completely forgetting that even the cost of returns
has gone up in the mitigation era :D.

Is retbleed the one you were alluding to or there might be others that
would apply here as well?

>> When zeroing the 2MB page, maximize spatial locality by clearing in
>> three sections: the faulting page and its immediate neighbourhood, the
>> left and the right regions, with the local neighbourhood cleared last.
>
> s/zeroing the 2MB page
>  /zeroing a 2MB page
>
>
>> It's not entirely clear why the performance for pg-sz=2MB improves.
>> We decode fewer instructions and the hardware prefetcher can do a
>> better job, but the perf stats for both of those aren't convincing
>> enough to the extent of ~30%.
>
> s/why the performance
>  /why performance
>
>> For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
>> see a drop in cycles but there's no drop in cacheline allocation.
>
> s/Icelakex, behaves similarly
>  /Icelakex behaves similarly

Ack to all of the above.

>> Performance for preempt=none|voluntary remains unchanged.
>
> CONFIG_PREEMPT_VOLUNTARY=y is the default on a number of major
> distributions, such as Ubuntu, and a lot of enterprise distro kernels -
> and this patch does nothing for them, for no good reason.
> So could you please provide a sensible size granularity cutoff of 16MB
> or so on non-preemptible kernels, instead of this weird build-time
> all-or-nothing binary cutoff based on preemption modes?

So, the reason for associating this with preemption modes was in part not
the difficulty of deciding a sensible granularity cutoff.

I had done a variety of chunking for an earlier version which was a bit
of a mess:
https://lore.kernel.org/lkml/20220606203725.1313715-11-ankur.a.arora@oracle.com/.

Fixed size chunking should be straight-forward enough. However, 16MB is
around 1.6ms if you zero at 10GBps. And, longer if you are on older
hardware.

> On preempt=full/lazy the granularity limit would be infinite.
>
> I.e the only code dependent on the preemption mode should be the size
> cutoff/limit.
> On full/lazy preemption the code would, ideally, compile to something
> close to your current code.

Yeah, agree.

>> +obj-$(CONFIG_PREEMPTION)	+= memory.o
>
>> +#ifndef CONFIG_HIGHMEM
>> +/*
>> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().
>
> We don't care much about HIGHMEM these days I suppose, but this
> dependency still feels wrong. Is this a stealth dependency on x86-64,
> trying to avoid a new arch Kconfig for this new API, right? ;-)

Alas nothing so crafty :). HIGHMEM means that we need to map pages in a
hugepage folio via kmap_local_page() -- so cannot treat a hugepage folio
as continguous memory and thus cannot use REP; STOS on it.

I guess the CONFIG_HIGHMEM condition clearly warrants a comment.

--
ankur


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] x86/clear_page: extend clear_page*() for multi-page clearing
  2025-04-15  6:14         ` Ankur Arora
@ 2025-04-15  8:22           ` Mateusz Guzik
  2025-04-15 20:01             ` Ankur Arora
  0 siblings, 1 reply; 38+ messages in thread
From: Mateusz Guzik @ 2025-04-15  8:22 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm, x86,
	torvalds, akpm, bp, dave.hansen, hpa, mingo, luto, paulmck,
	rostedt, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

On Tue, Apr 15, 2025 at 8:14 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>
> Mateusz Guzik <mjguzik@gmail.com> writes:
> > With that sucker out of the way, an optional quest is to figure out if
> > rep stosq vs rep stosb makes any difference for pages -- for all I know
> > rep stosq is the way. This would require testing on quite a few uarchs
> > and I'm not going to blame anyone for not being interested.
>
> IIRC some recent AMD models (Rome?) did expose REP_GOOD but not ERMS.
>

The uarch does not have it or the bit magically fails to show up?
Worst case, should rep stosb be faster on that uarch, the kernel can
pretend the bit is set.

> > Let's say nobody bothered OR rep stosb provides a win. In that case this
> > can trivially ALTERNATIVE between rep stosb and rep stosq based on ERMS,
> > no func calls necessary.
>
> We shouldn't need any function calls for ERMS and REP_GOOD.
>
> I think something like this untested code should work:
>
>         asm volatile(
>             ALTERNATIVE_2("call clear_pages_orig",
>                           "rep stosb", X86_FEATURE_REP_GOOD,
>                           "shrl $3,%ecx; rep stosq", X86_FEATURE_ERMS,
>                           : "+c" (size), "+D" (addr), ASM_CALL_CONSTRAINT
>                           : "a" (0)))
>

That's what I'm suggesting, with one difference: whack
clear_pages_orig altogether.

-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
  2025-04-14  3:46 ` [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing Ankur Arora
  2025-04-14  6:53   ` Ingo Molnar
  2025-04-14  7:05   ` Ingo Molnar
@ 2025-04-15 10:16   ` Mateusz Guzik
  2025-04-15 21:46     ` Ankur Arora
  2 siblings, 1 reply; 38+ messages in thread
From: Mateusz Guzik @ 2025-04-15 10:16 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, bp, dave.hansen, hpa,
	mingo, luto, peterz, paulmck, rostedt, tglx, willy, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk

On Sun, Apr 13, 2025 at 08:46:07PM -0700, Ankur Arora wrote:
> clear_pages_rep(), clear_pages_erms() use string instructions to zero
> memory. When operating on more than a single page, we can use these
> more effectively by explicitly advertising the region-size to the
> processor, which can use that as a hint to optimize the clearing
> (ex. by eliding cacheline allocation.)
> 
> As a secondary benefit, string instructions are typically microcoded,
> and working with larger regions helps amortize the cost of the decode.
> 
> When zeroing the 2MB page, maximize spatial locality by clearing in 
> three sections: the faulting page and its immediate neighbourhood, the
> left and the right regions, with the local neighbourhood cleared last.
> 
> Performance
> ==
> 
> Use mmap(MAP_HUGETLB) to demand fault a 64GB region on the local
> NUMA node.
> 
> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
> 
>                  mm/folio_zero_user    x86/folio_zero_user     change
>                   (GB/s  +- stddev)      (GB/s  +- stddev)
> 
>   pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
>   pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
> 
> Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
> allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB.
> 
> pg-sz=1GB:
>   -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
>   -    544,878,976      instructions                     #    0.06  insn per cycle
>   -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
>   -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)
> 
>   +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
>   +     10,979,121      instructions                     #    0.00  insn per cycle
>   +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
>   +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)
> 
> That's not the case with pg-sz=2MB, where we also perform better but
> the number of cacheline allocations remain the same.
> 
> It's not entirely clear why the performance for pg-sz=2MB improves. We
> decode fewer instructions and the hardware prefetcher can do a better
> job, but the perf stats for both of those aren't convincing enough to
> the extent of ~30%.
> 
> pg-sz=2MB:
>   - 13,110,306,584      cycles                           #    2.418 GHz                         ( +-  0.48% )  (46.13%)
>   -    607,589,360      instructions                     #    0.05  insn per cycle
>   -  2,416,130,434      L1-dcache-loads                  #  445.682 M/sec                       ( +-  0.08% )  (46.19%)
>   -  1,080,187,594      L1-dcache-load-misses            #   44.71% of all L1-dcache accesses   ( +-  0.01% )  (46.18%)
> 
>   +  9,624,624,178      cycles                           #    2.418 GHz                         ( +-  0.01% )  (46.13%)
>   +    277,336,691      instructions                     #    0.03  insn per cycle
>   +  2,251,220,599      L1-dcache-loads                  #  565.624 M/sec                       ( +-  0.01% )  (46.20%)
>   +  1,092,386,130      L1-dcache-load-misses            #   48.52% of all L1-dcache accesses   ( +-  0.02% )  (46.19%)
> 
> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
> 
>                  mm/folio_zero_user    x86/folio_zero_user     change
>                   (GB/s +- stddev)      (GB/s +- stddev)
> 
>   pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
>   pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
> 
> For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
> see a drop in cycles but there's no drop in cacheline allocation.
> 

Back when I was young and handsome and 32-bit x86 was king, people
assumed 4K pages needed to be cleared with non-temporal stores to avoid
evicting stuff from caches. I had never seen measurements showing this
has the intended effect. Some time after this became a thing I did see
measurements showing that this in fact *increases* cache misses. I am
not saying this was necessarily the case for all x86 uarchs, merely that
the sensibly sounding assumption turned bogus at some point (if it was
ever legit).

This brings me to the multi-stage clearing employed here for locality.
While it sounds great on paper, for all I know it does not provide any
advantage. It very well may be it is harmful by preventing the CPU from
knowing what you are trying to do.

I think doing this warrants obtaining stats from some real workloads,
but given how time consuming this can be I think it would be tolerable
to skip it for now.

> Performance for preempt=none|voluntary remains unchanged.
> 

So I was under the impression the benefit would be realized for all
kernels.

I don't know how preemption support is implemented on Linux. Do you
always get an IPI?

I was thinking something like this: a per-cpu var akin to preemption
count, but indicating the particular code section is fully preemptible

Then:

preemptible_enter();
clear_pages();
preemptible_exit();

for simpler handling of the var it could prevent migration to other
CPUs.

then the IPI handler for preemption would check if ->preemptible is set
+ preemption disablement is zero, in which case it would take you off
cpu.

If this is a problem, then a better granularity would help (say 8 pages
between cond_rescheds?)

> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  arch/x86/mm/Makefile |  1 +
>  arch/x86/mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mm.h   |  1 +
>  3 files changed, 62 insertions(+)
>  create mode 100644 arch/x86/mm/memory.c
> 
> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> index 32035d5be5a0..e61b4d331cdf 100644
> --- a/arch/x86/mm/Makefile
> +++ b/arch/x86/mm/Makefile
> @@ -55,6 +55,7 @@ obj-$(CONFIG_MMIOTRACE_TEST)	+= testmmiotrace.o
>  obj-$(CONFIG_NUMA)		+= numa.o numa_$(BITS).o
>  obj-$(CONFIG_AMD_NUMA)		+= amdtopology.o
>  obj-$(CONFIG_ACPI_NUMA)		+= srat.o
> +obj-$(CONFIG_PREEMPTION)	+= memory.o
>  
>  obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
>  obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
> diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c
> new file mode 100644
> index 000000000000..99851c246fcc
> --- /dev/null
> +++ b/arch/x86/mm/memory.c
> @@ -0,0 +1,60 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +#include <linux/mm.h>
> +#include <linux/range.h>
> +#include <linux/minmax.h>
> +
> +#ifndef CONFIG_HIGHMEM
> +/*
> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().
> + *
> + * Taking inspiration from the common code variant, we split the zeroing in
> + * three parts: left of the fault, right of the fault, and up to 5 pages
> + * in the immediate neighbourhood of the target page.
> + *
> + * Cleared in that order to keep cache lines of the target region hot.
> + *
> + * For gigantic pages, there is no expectation of cache locality so just do a
> + * straight zero.
> + */
> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint)
> +{
> +	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
> +	int width = 2; /* pages cleared last on either side */
> +	struct range r[3];
> +	int i;
> +
> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
> +		clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
> +		goto out;
> +	}
> +
> +	/*
> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
> +	 * ensure it sticks around in the cache.
> +	 */
> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
> +
> +	/* Region to the left of the fault */
> +	r[1] = DEFINE_RANGE(pg.start,
> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
> +
> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
> +			    pg.end);
> +
> +	for (i = 0; i <= 2; i++) {
> +		int len = range_len(&r[i]);
> +
> +		if (len > 0)
> +			clear_pages(page_address(folio_page(folio, r[i].start)), len);
> +	}
> +
> +out:
> +	/* Explicitly invoke cond_resched() to handle any live patching necessary. */
> +	cond_resched();
> +}
> +
> +#endif /* CONFIG_HIGHMEM */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b7f13f087954..b57512da8173 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4114,6 +4114,7 @@ enum mf_action_page_type {
>  };
>  
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> +void folio_zero_user_preemptible(struct folio *fio, unsigned long addr_hint);
>  void folio_zero_user(struct folio *folio, unsigned long addr_hint);
>  int copy_user_large_folio(struct folio *dst, struct folio *src,
>  			  unsigned long addr_hint,
> -- 
> 2.31.1
> 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
  2025-04-14  3:46 [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing Ankur Arora
                   ` (5 preceding siblings ...)
  2025-04-14  6:36 ` Ingo Molnar
@ 2025-04-15 19:10 ` Zi Yan
  2025-04-22 19:32   ` Ankur Arora
  2025-04-22  6:23 ` Raghavendra K T
  7 siblings, 1 reply; 38+ messages in thread
From: Zi Yan @ 2025-04-15 19:10 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, bp, dave.hansen, hpa,
	mingo, luto, peterz, paulmck, rostedt, tglx, willy, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk

On 13 Apr 2025, at 23:46, Ankur Arora wrote:

> This series adds multi-page clearing for hugepages. It is a rework
> of [1] which took a detour through PREEMPT_LAZY [2].
>
> Why multi-page clearing?: multi-page clearing improves upon the
> current page-at-a-time approach by providing the processor with a
> hint as to the real region size. A processor could use this hint to,
> for instance, elide cacheline allocation when clearing a large
> region.
>
> This optimization in particular is done by REP; STOS on AMD Zen
> where regions larger than L3-size use non-temporal stores.
>
> This results in significantly better performance.

Do you have init_on_alloc=1 in your kernel?
With that, pages coming from buddy allocator are zeroed
in post_alloc_hook() by kernel_init_pages(), which is a for loop
of clear_highpage_kasan_tagged(), a wrap of clear_page().
And folio_zero_user() is not used.

At least Debian, Fedora, and Ubuntu by default have
CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y, which means init_on_alloc=1.

Maybe kernel_init_pages() should get your optimization as well,
unless you only target hugetlb pages.

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] x86/clear_page: extend clear_page*() for multi-page clearing
  2025-04-15  8:22           ` Mateusz Guzik
@ 2025-04-15 20:01             ` Ankur Arora
  2025-04-15 20:32               ` Mateusz Guzik
  0 siblings, 1 reply; 38+ messages in thread
From: Ankur Arora @ 2025-04-15 20:01 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Ankur Arora, Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm,
	x86, torvalds, akpm, bp, dave.hansen, hpa, mingo, luto, paulmck,
	rostedt, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk


Mateusz Guzik <mjguzik@gmail.com> writes:

> On Tue, Apr 15, 2025 at 8:14 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>>
>> Mateusz Guzik <mjguzik@gmail.com> writes:
>> > With that sucker out of the way, an optional quest is to figure out if
>> > rep stosq vs rep stosb makes any difference for pages -- for all I know
>> > rep stosq is the way. This would require testing on quite a few uarchs
>> > and I'm not going to blame anyone for not being interested.
>>
>> IIRC some recent AMD models (Rome?) did expose REP_GOOD but not ERMS.
>>
>
> The uarch does not have it or the bit magically fails to show up?
> Worst case, should rep stosb be faster on that uarch, the kernel can
> pretend the bit is set.

It's a synthetic bit so the uarch has both. I think REP STOSB is optimized
post FSRS (AIUI Zen3)

        if (c->x86 >= 0x10)
                set_cpu_cap(c, X86_FEATURE_REP_GOOD);

        /* AMD FSRM also implies FSRS */
        if (cpu_has(c, X86_FEATURE_FSRM))
                set_cpu_cap(c, X86_FEATURE_FSRS);


>> > Let's say nobody bothered OR rep stosb provides a win. In that case this
>> > can trivially ALTERNATIVE between rep stosb and rep stosq based on ERMS,
>> > no func calls necessary.
>>
>> We shouldn't need any function calls for ERMS and REP_GOOD.
>>
>> I think something like this untested code should work:
>>
>>         asm volatile(
>>             ALTERNATIVE_2("call clear_pages_orig",
>>                           "rep stosb", X86_FEATURE_REP_GOOD,
>>                           "shrl $3,%ecx; rep stosq", X86_FEATURE_ERMS,
>>                           : "+c" (size), "+D" (addr), ASM_CALL_CONSTRAINT
>>                           : "a" (0)))
>>
>
> That's what I'm suggesting, with one difference: whack
> clear_pages_orig altogether.

What do we gain by getting rid of it? Maybe there's old hardware with
unoptimized rep; stos*.

--
ankur


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] x86/clear_page: extend clear_page*() for multi-page clearing
  2025-04-15 20:01             ` Ankur Arora
@ 2025-04-15 20:32               ` Mateusz Guzik
  0 siblings, 0 replies; 38+ messages in thread
From: Mateusz Guzik @ 2025-04-15 20:32 UTC (permalink / raw)
  To: Ankur Arora
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, linux-mm, x86,
	torvalds, akpm, bp, dave.hansen, hpa, mingo, luto, paulmck,
	rostedt, tglx, willy, jon.grimm, bharata, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

On Tue, Apr 15, 2025 at 10:02 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>
> Mateusz Guzik <mjguzik@gmail.com> writes:
>
> > On Tue, Apr 15, 2025 at 8:14 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >>
> >>
> >> Mateusz Guzik <mjguzik@gmail.com> writes:
> >> > With that sucker out of the way, an optional quest is to figure out if
> >> > rep stosq vs rep stosb makes any difference for pages -- for all I know
> >> > rep stosq is the way. This would require testing on quite a few uarchs
> >> > and I'm not going to blame anyone for not being interested.
> >>
> >> IIRC some recent AMD models (Rome?) did expose REP_GOOD but not ERMS.
> >>
> >
> > The uarch does not have it or the bit magically fails to show up?
> > Worst case, should rep stosb be faster on that uarch, the kernel can
> > pretend the bit is set.
>
> It's a synthetic bit so the uarch has both. I think REP STOSB is optimized
> post FSRS (AIUI Zen3)
>
>         if (c->x86 >= 0x10)
>                 set_cpu_cap(c, X86_FEATURE_REP_GOOD);
>
>         /* AMD FSRM also implies FSRS */
>         if (cpu_has(c, X86_FEATURE_FSRM))
>                 set_cpu_cap(c, X86_FEATURE_FSRS);
>
>
> >> > Let's say nobody bothered OR rep stosb provides a win. In that case this
> >> > can trivially ALTERNATIVE between rep stosb and rep stosq based on ERMS,
> >> > no func calls necessary.
> >>
> >> We shouldn't need any function calls for ERMS and REP_GOOD.
> >>
> >> I think something like this untested code should work:
> >>
> >>         asm volatile(
> >>             ALTERNATIVE_2("call clear_pages_orig",
> >>                           "rep stosb", X86_FEATURE_REP_GOOD,
> >>                           "shrl $3,%ecx; rep stosq", X86_FEATURE_ERMS,
> >>                           : "+c" (size), "+D" (addr), ASM_CALL_CONSTRAINT
> >>                           : "a" (0)))
> >>
> >
> > That's what I'm suggesting, with one difference: whack
> > clear_pages_orig altogether.
>
> What do we gain by getting rid of it? Maybe there's old hardware with
> unoptimized rep; stos*.
>

The string routines (memset, memcpy et al) need a lot of love and
preferably nobody would bother spending time placating non-rep users
while sorting them out.

According to wiki the AMD CPUs started with REP_GOOD in 2007, meaning
you would need something even older than that to not have it. Intel is
presumably in a similar boat.

So happens gcc spent several years emitting inlined rep stosq and rep
movsq, so either users don't care or there are no users (well
realistically someone somewhere has a machine like that in the garage,
but fringe cases are not an argument).

rep_movs_alternative already punts to rep mov ignoring the issue of
REP_GOOD for some time now (admittedly, I removed the non-rep support
:P) and again there are no pitchforks (that I had seen).

So I think it would be best for everyone in the long run to completely
reap out the REP_GOOD thing. For all I know the kernel stopped booting
on machines with such uarchs long time ago for unrelated reasons.

As far as this specific patchset goes, it's just a waste of testing to
make sure it still works, but I can't *insist* on removing the
routine. I guess it is x86 maintainers call whether to whack this.
-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
  2025-04-15 10:16   ` Mateusz Guzik
@ 2025-04-15 21:46     ` Ankur Arora
  2025-04-15 22:01       ` Mateusz Guzik
  0 siblings, 1 reply; 38+ messages in thread
From: Ankur Arora @ 2025-04-15 21:46 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, bp,
	dave.hansen, hpa, mingo, luto, peterz, paulmck, rostedt, tglx,
	willy, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk


Mateusz Guzik <mjguzik@gmail.com> writes:

> On Sun, Apr 13, 2025 at 08:46:07PM -0700, Ankur Arora wrote:
>> clear_pages_rep(), clear_pages_erms() use string instructions to zero
>> memory. When operating on more than a single page, we can use these
>> more effectively by explicitly advertising the region-size to the
>> processor, which can use that as a hint to optimize the clearing
>> (ex. by eliding cacheline allocation.)
>>
>> As a secondary benefit, string instructions are typically microcoded,
>> and working with larger regions helps amortize the cost of the decode.
>>
>> When zeroing the 2MB page, maximize spatial locality by clearing in
>> three sections: the faulting page and its immediate neighbourhood, the
>> left and the right regions, with the local neighbourhood cleared last.
>>
>> Performance
>> ==
>>
>> Use mmap(MAP_HUGETLB) to demand fault a 64GB region on the local
>> NUMA node.
>>
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>>
>>                  mm/folio_zero_user    x86/folio_zero_user     change
>>                   (GB/s  +- stddev)      (GB/s  +- stddev)
>>
>>   pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
>>   pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>>
>> Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
>> allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB.
>>
>> pg-sz=1GB:
>>   -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
>>   -    544,878,976      instructions                     #    0.06  insn per cycle
>>   -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
>>   -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)
>>
>>   +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
>>   +     10,979,121      instructions                     #    0.00  insn per cycle
>>   +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
>>   +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)
>>
>> That's not the case with pg-sz=2MB, where we also perform better but
>> the number of cacheline allocations remain the same.
>>
>> It's not entirely clear why the performance for pg-sz=2MB improves. We
>> decode fewer instructions and the hardware prefetcher can do a better
>> job, but the perf stats for both of those aren't convincing enough to
>> the extent of ~30%.
>>
>> pg-sz=2MB:
>>   - 13,110,306,584      cycles                           #    2.418 GHz                         ( +-  0.48% )  (46.13%)
>>   -    607,589,360      instructions                     #    0.05  insn per cycle
>>   -  2,416,130,434      L1-dcache-loads                  #  445.682 M/sec                       ( +-  0.08% )  (46.19%)
>>   -  1,080,187,594      L1-dcache-load-misses            #   44.71% of all L1-dcache accesses   ( +-  0.01% )  (46.18%)
>>
>>   +  9,624,624,178      cycles                           #    2.418 GHz                         ( +-  0.01% )  (46.13%)
>>   +    277,336,691      instructions                     #    0.03  insn per cycle
>>   +  2,251,220,599      L1-dcache-loads                  #  565.624 M/sec                       ( +-  0.01% )  (46.20%)
>>   +  1,092,386,130      L1-dcache-load-misses            #   48.52% of all L1-dcache accesses   ( +-  0.02% )  (46.19%)
>>
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>>
>>                  mm/folio_zero_user    x86/folio_zero_user     change
>>                   (GB/s +- stddev)      (GB/s +- stddev)
>>
>>   pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
>>   pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>>
>> For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
>> see a drop in cycles but there's no drop in cacheline allocation.
>>
>
> Back when I was young and handsome and 32-bit x86 was king, people
> assumed 4K pages needed to be cleared with non-temporal stores to avoid
> evicting stuff from caches. I had never seen measurements showing this
> has the intended effect. Some time after this became a thing I did see
> measurements showing that this in fact *increases* cache misses. I am
> not saying this was necessarily the case for all x86 uarchs, merely that
> the sensibly sounding assumption turned bogus at some point (if it was
> ever legit).

That was a long time ago though ;-). And, your point makes sense for
small sized pages. But, consider that zeroing a 1GB page can easily blow
away an L3 cache for absolutely nothing gained -- probabilistically,
nothing that was in the page that remains in the cache will ever be
accessed.

Now, you could argue that the situation is less clear for 2MB pages.

> This brings me to the multi-stage clearing employed here for locality.
> While it sounds great on paper, for all I know it does not provide any
> advantage. It very well may be it is harmful by preventing the CPU from
> knowing what you are trying to do.
>
> I think doing this warrants obtaining stats from some real workloads,
> but given how time consuming this can be I think it would be tolerable
> to skip it for now.
>
>> Performance for preempt=none|voluntary remains unchanged.
>>
>
> So I was under the impression the benefit would be realized for all
> kernels.
>
> I don't know how preemption support is implemented on Linux. Do you
> always get an IPI?

No. The need-resched bit is common. It's just there's no preemption via
irqentry, just synchronous calls to cond_resched() (as you mention below).

Zeroing via a subroutine like instruction (rep; stos) is incompatible with
synchronous calls to cond_resched() so this code is explicitly not called
for none/voluntary (see patch 3.)

That said, I'll probably take Ingo's suggestion of chunking things up
in say 8/16MB portions for cooperative preemption models.

Ankur


> I was thinking something like this: a per-cpu var akin to preemption
> count, but indicating the particular code section is fully preemptible
>
> Then:
>
> preemptible_enter();
> clear_pages();
> preemptible_exit();
>
> for simpler handling of the var it could prevent migration to other
> CPUs.
>
> then the IPI handler for preemption would check if ->preemptible is set
> + preemption disablement is zero, in which case it would take you off
> cpu.
>
> If this is a problem, then a better granularity would help (say 8 pages
> between cond_rescheds?)
>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>  arch/x86/mm/Makefile |  1 +
>>  arch/x86/mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/mm.h   |  1 +
>>  3 files changed, 62 insertions(+)
>>  create mode 100644 arch/x86/mm/memory.c
>>
>> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
>> index 32035d5be5a0..e61b4d331cdf 100644
>> --- a/arch/x86/mm/Makefile
>> +++ b/arch/x86/mm/Makefile
>> @@ -55,6 +55,7 @@ obj-$(CONFIG_MMIOTRACE_TEST)	+= testmmiotrace.o
>>  obj-$(CONFIG_NUMA)		+= numa.o numa_$(BITS).o
>>  obj-$(CONFIG_AMD_NUMA)		+= amdtopology.o
>>  obj-$(CONFIG_ACPI_NUMA)		+= srat.o
>> +obj-$(CONFIG_PREEMPTION)	+= memory.o
>>
>>  obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
>>  obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
>> diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c
>> new file mode 100644
>> index 000000000000..99851c246fcc
>> --- /dev/null
>> +++ b/arch/x86/mm/memory.c
>> @@ -0,0 +1,60 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +#include <linux/mm.h>
>> +#include <linux/range.h>
>> +#include <linux/minmax.h>
>> +
>> +#ifndef CONFIG_HIGHMEM
>> +/*
>> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().
>> + *
>> + * Taking inspiration from the common code variant, we split the zeroing in
>> + * three parts: left of the fault, right of the fault, and up to 5 pages
>> + * in the immediate neighbourhood of the target page.
>> + *
>> + * Cleared in that order to keep cache lines of the target region hot.
>> + *
>> + * For gigantic pages, there is no expectation of cache locality so just do a
>> + * straight zero.
>> + */
>> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint)
>> +{
>> +	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>> +	int width = 2; /* pages cleared last on either side */
>> +	struct range r[3];
>> +	int i;
>> +
>> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>> +		clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
>> +		goto out;
>> +	}
>> +
>> +	/*
>> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
>> +	 * ensure it sticks around in the cache.
>> +	 */
>> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
>> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
>> +
>> +	/* Region to the left of the fault */
>> +	r[1] = DEFINE_RANGE(pg.start,
>> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
>> +
>> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
>> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
>> +			    pg.end);
>> +
>> +	for (i = 0; i <= 2; i++) {
>> +		int len = range_len(&r[i]);
>> +
>> +		if (len > 0)
>> +			clear_pages(page_address(folio_page(folio, r[i].start)), len);
>> +	}
>> +
>> +out:
>> +	/* Explicitly invoke cond_resched() to handle any live patching necessary. */
>> +	cond_resched();
>> +}
>> +
>> +#endif /* CONFIG_HIGHMEM */
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index b7f13f087954..b57512da8173 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -4114,6 +4114,7 @@ enum mf_action_page_type {
>>  };
>>
>>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>> +void folio_zero_user_preemptible(struct folio *fio, unsigned long addr_hint);
>>  void folio_zero_user(struct folio *folio, unsigned long addr_hint);
>>  int copy_user_large_folio(struct folio *dst, struct folio *src,
>>  			  unsigned long addr_hint,
>> --
>> 2.31.1
>>
>>


--
ankur


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 1/4] x86/clear_page: extend clear_page*() for multi-page clearing
  2025-04-14 20:09       ` Matthew Wilcox
@ 2025-04-15 21:59         ` Ankur Arora
  0 siblings, 0 replies; 38+ messages in thread
From: Ankur Arora @ 2025-04-15 21:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ankur Arora, Ingo Molnar, linux-kernel, linux-mm, x86, torvalds,
	akpm, bp, dave.hansen, hpa, mingo, luto, peterz, paulmck, rostedt,
	tglx, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk


Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Apr 14, 2025 at 12:52:37PM -0700, Ankur Arora wrote:
>> Ingo Molnar <mingo@kernel.org> writes:
>> >> +void clear_pages_orig(void *page, unsigned int length);
>> >> +void clear_pages_rep(void *page, unsigned int length);
>> >> +void clear_pages_erms(void *page, unsigned int length);
>> >
>> > What unit is 'length' in? If it's bytes, why is this interface
>> > artificially limiting itself to ~4GB? On x86-64 there's very little (if
>>
>> I was in two minds about the unit. Given that the largest page size is
>> 1GB, decided to go with 32bit. But, as you say below, there's no reason
>> to limit the x86-64 interface for MM reasons.  Will fix.
>
> Actually, I think there is (and we went through this with SPARC, if you
> remember?)

My google-fu is failing me. I don't think it was this thread:
 https://lore.kernel.org/lkml/1490310113-824438-1-git-send-email-pasha.tatashin@oracle.com/

> We _shouldn't_ be calling memset() with a large size (ie
> larger than 4GB).  If we have that much memory to clear, we should be
> doing something smarter, like using padata to get lots of CPUs clearing
> individual portions of the page.

Agreed. Or even offloading to an accelerator so as to not waste CPU time.

That said, whether to invoke clear_pages() in > 4GB seems like an MM
policy question. Not sure it makes sense to limit the low level interface.

> I don't know how relevant this is now that you're going to be using
> ALTERNATIVES.


--
ankur


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
  2025-04-15 21:46     ` Ankur Arora
@ 2025-04-15 22:01       ` Mateusz Guzik
  2025-04-16  4:46         ` Ankur Arora
  0 siblings, 1 reply; 38+ messages in thread
From: Mateusz Guzik @ 2025-04-15 22:01 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, bp, dave.hansen, hpa,
	mingo, luto, peterz, paulmck, rostedt, tglx, willy, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk

On Tue, Apr 15, 2025 at 11:46 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>
> Mateusz Guzik <mjguzik@gmail.com> writes:
>
> > On Sun, Apr 13, 2025 at 08:46:07PM -0700, Ankur Arora wrote:
> >> clear_pages_rep(), clear_pages_erms() use string instructions to zero
> >> memory. When operating on more than a single page, we can use these
> >> more effectively by explicitly advertising the region-size to the
> >> processor, which can use that as a hint to optimize the clearing
> >> (ex. by eliding cacheline allocation.)
> >>
> >> As a secondary benefit, string instructions are typically microcoded,
> >> and working with larger regions helps amortize the cost of the decode.
> >>
> >> When zeroing the 2MB page, maximize spatial locality by clearing in
> >> three sections: the faulting page and its immediate neighbourhood, the
> >> left and the right regions, with the local neighbourhood cleared last.
> >>
> >> Performance
> >> ==
> >>
> >> Use mmap(MAP_HUGETLB) to demand fault a 64GB region on the local
> >> NUMA node.
> >>
> >> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
> >>
> >>                  mm/folio_zero_user    x86/folio_zero_user     change
> >>                   (GB/s  +- stddev)      (GB/s  +- stddev)
> >>
> >>   pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
> >>   pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
> >>
> >> Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
> >> allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB.
> >>
> >> pg-sz=1GB:
> >>   -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
> >>   -    544,878,976      instructions                     #    0.06  insn per cycle
> >>   -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
> >>   -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)
> >>
> >>   +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
> >>   +     10,979,121      instructions                     #    0.00  insn per cycle
> >>   +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
> >>   +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)
> >>
> >> That's not the case with pg-sz=2MB, where we also perform better but
> >> the number of cacheline allocations remain the same.
> >>
> >> It's not entirely clear why the performance for pg-sz=2MB improves. We
> >> decode fewer instructions and the hardware prefetcher can do a better
> >> job, but the perf stats for both of those aren't convincing enough to
> >> the extent of ~30%.
> >>
> >> pg-sz=2MB:
> >>   - 13,110,306,584      cycles                           #    2.418 GHz                         ( +-  0.48% )  (46.13%)
> >>   -    607,589,360      instructions                     #    0.05  insn per cycle
> >>   -  2,416,130,434      L1-dcache-loads                  #  445.682 M/sec                       ( +-  0.08% )  (46.19%)
> >>   -  1,080,187,594      L1-dcache-load-misses            #   44.71% of all L1-dcache accesses   ( +-  0.01% )  (46.18%)
> >>
> >>   +  9,624,624,178      cycles                           #    2.418 GHz                         ( +-  0.01% )  (46.13%)
> >>   +    277,336,691      instructions                     #    0.03  insn per cycle
> >>   +  2,251,220,599      L1-dcache-loads                  #  565.624 M/sec                       ( +-  0.01% )  (46.20%)
> >>   +  1,092,386,130      L1-dcache-load-misses            #   48.52% of all L1-dcache accesses   ( +-  0.02% )  (46.19%)
> >>
> >> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
> >>
> >>                  mm/folio_zero_user    x86/folio_zero_user     change
> >>                   (GB/s +- stddev)      (GB/s +- stddev)
> >>
> >>   pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
> >>   pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
> >>
> >> For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
> >> see a drop in cycles but there's no drop in cacheline allocation.
> >>
> >
> > Back when I was young and handsome and 32-bit x86 was king, people
> > assumed 4K pages needed to be cleared with non-temporal stores to avoid
> > evicting stuff from caches. I had never seen measurements showing this
> > has the intended effect. Some time after this became a thing I did see
> > measurements showing that this in fact *increases* cache misses. I am
> > not saying this was necessarily the case for all x86 uarchs, merely that
> > the sensibly sounding assumption turned bogus at some point (if it was
> > ever legit).
>
> That was a long time ago though ;-). And, your point makes sense for
> small sized pages. But, consider that zeroing a 1GB page can easily blow
> away an L3 cache for absolutely nothing gained -- probabilistically,
> nothing that was in the page that remains in the cache will ever be
> accessed.
>
> Now, you could argue that the situation is less clear for 2MB pages.
>

Well I was talking about 2MB. ;) I thought it is a foregone conclusion
that 1GB pages will be handled with non-temporal stores, but maybe I'm
crossing my wires.

> > This brings me to the multi-stage clearing employed here for locality.
> > While it sounds great on paper, for all I know it does not provide any
> > advantage. It very well may be it is harmful by preventing the CPU from
> > knowing what you are trying to do.
> >
> > I think doing this warrants obtaining stats from some real workloads,
> > but given how time consuming this can be I think it would be tolerable
> > to skip it for now.
> >
> >> Performance for preempt=none|voluntary remains unchanged.
> >>
> >
> > So I was under the impression the benefit would be realized for all
> > kernels.
> >
> > I don't know how preemption support is implemented on Linux. Do you
> > always get an IPI?
>
> No. The need-resched bit is common. It's just there's no preemption via
> irqentry, just synchronous calls to cond_resched() (as you mention below).
>
> Zeroing via a subroutine like instruction (rep; stos) is incompatible with
> synchronous calls to cond_resched() so this code is explicitly not called
> for none/voluntary (see patch 3.)
>
> That said, I'll probably take Ingo's suggestion of chunking things up
> in say 8/16MB portions for cooperative preemption models.

makes sense, thanks

>
>
> > I was thinking something like this: a per-cpu var akin to preemption
> > count, but indicating the particular code section is fully preemptible
> >
> > Then:
> >
> > preemptible_enter();
> > clear_pages();
> > preemptible_exit();
> >
> > for simpler handling of the var it could prevent migration to other
> > CPUs.
> >
> > then the IPI handler for preemption would check if ->preemptible is set
> > + preemption disablement is zero, in which case it would take you off
> > cpu.
> >
> > If this is a problem, then a better granularity would help (say 8 pages
> > between cond_rescheds?)
> >
> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> >> ---
> >>  arch/x86/mm/Makefile |  1 +
> >>  arch/x86/mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++
> >>  include/linux/mm.h   |  1 +
> >>  3 files changed, 62 insertions(+)
> >>  create mode 100644 arch/x86/mm/memory.c
> >>
> >> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
> >> index 32035d5be5a0..e61b4d331cdf 100644
> >> --- a/arch/x86/mm/Makefile
> >> +++ b/arch/x86/mm/Makefile
> >> @@ -55,6 +55,7 @@ obj-$(CONFIG_MMIOTRACE_TEST)       += testmmiotrace.o
> >>  obj-$(CONFIG_NUMA)          += numa.o numa_$(BITS).o
> >>  obj-$(CONFIG_AMD_NUMA)              += amdtopology.o
> >>  obj-$(CONFIG_ACPI_NUMA)             += srat.o
> >> +obj-$(CONFIG_PREEMPTION)    += memory.o
> >>
> >>  obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)      += pkeys.o
> >>  obj-$(CONFIG_RANDOMIZE_MEMORY)                      += kaslr.o
> >> diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c
> >> new file mode 100644
> >> index 000000000000..99851c246fcc
> >> --- /dev/null
> >> +++ b/arch/x86/mm/memory.c
> >> @@ -0,0 +1,60 @@
> >> +// SPDX-License-Identifier: GPL-2.0-or-later
> >> +#include <linux/mm.h>
> >> +#include <linux/range.h>
> >> +#include <linux/minmax.h>
> >> +
> >> +#ifndef CONFIG_HIGHMEM
> >> +/*
> >> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().
> >> + *
> >> + * Taking inspiration from the common code variant, we split the zeroing in
> >> + * three parts: left of the fault, right of the fault, and up to 5 pages
> >> + * in the immediate neighbourhood of the target page.
> >> + *
> >> + * Cleared in that order to keep cache lines of the target region hot.
> >> + *
> >> + * For gigantic pages, there is no expectation of cache locality so just do a
> >> + * straight zero.
> >> + */
> >> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint)
> >> +{
> >> +    unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> >> +    const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
> >> +    const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
> >> +    int width = 2; /* pages cleared last on either side */
> >> +    struct range r[3];
> >> +    int i;
> >> +
> >> +    if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
> >> +            clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
> >> +            goto out;
> >> +    }
> >> +
> >> +    /*
> >> +     * Faulting page and its immediate neighbourhood. Cleared at the end to
> >> +     * ensure it sticks around in the cache.
> >> +     */
> >> +    r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
> >> +                        clamp_t(s64, fault_idx + width, pg.start, pg.end));
> >> +
> >> +    /* Region to the left of the fault */
> >> +    r[1] = DEFINE_RANGE(pg.start,
> >> +                        clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
> >> +
> >> +    /* Region to the right of the fault: always valid for the common fault_idx=0 case. */
> >> +    r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
> >> +                        pg.end);
> >> +
> >> +    for (i = 0; i <= 2; i++) {
> >> +            int len = range_len(&r[i]);
> >> +
> >> +            if (len > 0)
> >> +                    clear_pages(page_address(folio_page(folio, r[i].start)), len);
> >> +    }
> >> +
> >> +out:
> >> +    /* Explicitly invoke cond_resched() to handle any live patching necessary. */
> >> +    cond_resched();
> >> +}
> >> +
> >> +#endif /* CONFIG_HIGHMEM */
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index b7f13f087954..b57512da8173 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -4114,6 +4114,7 @@ enum mf_action_page_type {
> >>  };
> >>
> >>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >> +void folio_zero_user_preemptible(struct folio *fio, unsigned long addr_hint);
> >>  void folio_zero_user(struct folio *folio, unsigned long addr_hint);
> >>  int copy_user_large_folio(struct folio *dst, struct folio *src,
> >>                        unsigned long addr_hint,
> >> --
> >> 2.31.1
> >>
> >>
>
>
> --
> ankur



-- 
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
  2025-04-15 22:01       ` Mateusz Guzik
@ 2025-04-16  4:46         ` Ankur Arora
  2025-04-17 14:06           ` Mateusz Guzik
  0 siblings, 1 reply; 38+ messages in thread
From: Ankur Arora @ 2025-04-16  4:46 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, bp,
	dave.hansen, hpa, mingo, luto, peterz, paulmck, rostedt, tglx,
	willy, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk


Mateusz Guzik <mjguzik@gmail.com> writes:

> On Tue, Apr 15, 2025 at 11:46 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>>
>> Mateusz Guzik <mjguzik@gmail.com> writes:
>>
>> > On Sun, Apr 13, 2025 at 08:46:07PM -0700, Ankur Arora wrote:
>> >> clear_pages_rep(), clear_pages_erms() use string instructions to zero
>> >> memory. When operating on more than a single page, we can use these
>> >> more effectively by explicitly advertising the region-size to the
>> >> processor, which can use that as a hint to optimize the clearing
>> >> (ex. by eliding cacheline allocation.)
>> >>
>> >> As a secondary benefit, string instructions are typically microcoded,
>> >> and working with larger regions helps amortize the cost of the decode.
>> >>
>> >> When zeroing the 2MB page, maximize spatial locality by clearing in
>> >> three sections: the faulting page and its immediate neighbourhood, the
>> >> left and the right regions, with the local neighbourhood cleared last.
>> >>
>> >> Performance
>> >> ==
>> >>
>> >> Use mmap(MAP_HUGETLB) to demand fault a 64GB region on the local
>> >> NUMA node.
>> >>
>> >> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>> >>
>> >>                  mm/folio_zero_user    x86/folio_zero_user     change
>> >>                   (GB/s  +- stddev)      (GB/s  +- stddev)
>> >>
>> >>   pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
>> >>   pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>> >>
>> >> Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
>> >> allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB.
>> >>
>> >> pg-sz=1GB:
>> >>   -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
>> >>   -    544,878,976      instructions                     #    0.06  insn per cycle
>> >>   -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
>> >>   -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)
>> >>
>> >>   +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
>> >>   +     10,979,121      instructions                     #    0.00  insn per cycle
>> >>   +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
>> >>   +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)
>> >>
>> >> That's not the case with pg-sz=2MB, where we also perform better but
>> >> the number of cacheline allocations remain the same.
>> >>
>> >> It's not entirely clear why the performance for pg-sz=2MB improves. We
>> >> decode fewer instructions and the hardware prefetcher can do a better
>> >> job, but the perf stats for both of those aren't convincing enough to
>> >> the extent of ~30%.
>> >>
>> >> pg-sz=2MB:
>> >>   - 13,110,306,584      cycles                           #    2.418 GHz                         ( +-  0.48% )  (46.13%)
>> >>   -    607,589,360      instructions                     #    0.05  insn per cycle
>> >>   -  2,416,130,434      L1-dcache-loads                  #  445.682 M/sec                       ( +-  0.08% )  (46.19%)
>> >>   -  1,080,187,594      L1-dcache-load-misses            #   44.71% of all L1-dcache accesses   ( +-  0.01% )  (46.18%)
>> >>
>> >>   +  9,624,624,178      cycles                           #    2.418 GHz                         ( +-  0.01% )  (46.13%)
>> >>   +    277,336,691      instructions                     #    0.03  insn per cycle
>> >>   +  2,251,220,599      L1-dcache-loads                  #  565.624 M/sec                       ( +-  0.01% )  (46.20%)
>> >>   +  1,092,386,130      L1-dcache-load-misses            #   48.52% of all L1-dcache accesses   ( +-  0.02% )  (46.19%)
>> >>
>> >> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>> >>
>> >>                  mm/folio_zero_user    x86/folio_zero_user     change
>> >>                   (GB/s +- stddev)      (GB/s +- stddev)
>> >>
>> >>   pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
>> >>   pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>> >>
>> >> For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
>> >> see a drop in cycles but there's no drop in cacheline allocation.
>> >>
>> >
>> > Back when I was young and handsome and 32-bit x86 was king, people
>> > assumed 4K pages needed to be cleared with non-temporal stores to avoid
>> > evicting stuff from caches. I had never seen measurements showing this
>> > has the intended effect. Some time after this became a thing I did see
>> > measurements showing that this in fact *increases* cache misses. I am
>> > not saying this was necessarily the case for all x86 uarchs, merely that
>> > the sensibly sounding assumption turned bogus at some point (if it was
>> > ever legit).
>>
>> That was a long time ago though ;-). And, your point makes sense for
>> small sized pages. But, consider that zeroing a 1GB page can easily blow
>> away an L3 cache for absolutely nothing gained -- probabilistically,
>> nothing that was in the page that remains in the cache will ever be
>> accessed.
>>
>> Now, you could argue that the situation is less clear for 2MB pages.
>>
>
> Well I was talking about 2MB. ;) I thought it is a foregone conclusion
> that 1GB pages will be handled with non-temporal stores, but maybe I'm
> crossing my wires.

Or I am. I agree the situation is less clear for 2MB. I think for the
immediate neighbourhood we clearly do want temporal.

For the rest, L2 sizes are typically ~2MB or less. While there might be
latency benefits for caching the rest of the page, there are clear
negatives to blowing away state that was definitely useful in the
near past.

Would be good idea to quantify. Not sure what's a good benchmark to
do that though.

Ankur

>> > This brings me to the multi-stage clearing employed here for locality.
>> > While it sounds great on paper, for all I know it does not provide any
>> > advantage. It very well may be it is harmful by preventing the CPU from
>> > knowing what you are trying to do.
>> >
>> > I think doing this warrants obtaining stats from some real workloads,
>> > but given how time consuming this can be I think it would be tolerable
>> > to skip it for now.
>> >
>> >> Performance for preempt=none|voluntary remains unchanged.
>> >>
>> >
>> > So I was under the impression the benefit would be realized for all
>> > kernels.
>> >
>> > I don't know how preemption support is implemented on Linux. Do you
>> > always get an IPI?
>>
>> No. The need-resched bit is common. It's just there's no preemption via
>> irqentry, just synchronous calls to cond_resched() (as you mention below).
>>
>> Zeroing via a subroutine like instruction (rep; stos) is incompatible with
>> synchronous calls to cond_resched() so this code is explicitly not called
>> for none/voluntary (see patch 3.)
>>
>> That said, I'll probably take Ingo's suggestion of chunking things up
>> in say 8/16MB portions for cooperative preemption models.
>
> makes sense, thanks
>
>>
>>
>> > I was thinking something like this: a per-cpu var akin to preemption
>> > count, but indicating the particular code section is fully preemptible
>> >
>> > Then:
>> >
>> > preemptible_enter();
>> > clear_pages();
>> > preemptible_exit();
>> >
>> > for simpler handling of the var it could prevent migration to other
>> > CPUs.
>> >
>> > then the IPI handler for preemption would check if ->preemptible is set
>> > + preemption disablement is zero, in which case it would take you off
>> > cpu.
>> >
>> > If this is a problem, then a better granularity would help (say 8 pages
>> > between cond_rescheds?)
>> >
>> >> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> >> ---
>> >>  arch/x86/mm/Makefile |  1 +
>> >>  arch/x86/mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++
>> >>  include/linux/mm.h   |  1 +
>> >>  3 files changed, 62 insertions(+)
>> >>  create mode 100644 arch/x86/mm/memory.c
>> >>
>> >> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
>> >> index 32035d5be5a0..e61b4d331cdf 100644
>> >> --- a/arch/x86/mm/Makefile
>> >> +++ b/arch/x86/mm/Makefile
>> >> @@ -55,6 +55,7 @@ obj-$(CONFIG_MMIOTRACE_TEST)       += testmmiotrace.o
>> >>  obj-$(CONFIG_NUMA)          += numa.o numa_$(BITS).o
>> >>  obj-$(CONFIG_AMD_NUMA)              += amdtopology.o
>> >>  obj-$(CONFIG_ACPI_NUMA)             += srat.o
>> >> +obj-$(CONFIG_PREEMPTION)    += memory.o
>> >>
>> >>  obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)      += pkeys.o
>> >>  obj-$(CONFIG_RANDOMIZE_MEMORY)                      += kaslr.o
>> >> diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c
>> >> new file mode 100644
>> >> index 000000000000..99851c246fcc
>> >> --- /dev/null
>> >> +++ b/arch/x86/mm/memory.c
>> >> @@ -0,0 +1,60 @@
>> >> +// SPDX-License-Identifier: GPL-2.0-or-later
>> >> +#include <linux/mm.h>
>> >> +#include <linux/range.h>
>> >> +#include <linux/minmax.h>
>> >> +
>> >> +#ifndef CONFIG_HIGHMEM
>> >> +/*
>> >> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().
>> >> + *
>> >> + * Taking inspiration from the common code variant, we split the zeroing in
>> >> + * three parts: left of the fault, right of the fault, and up to 5 pages
>> >> + * in the immediate neighbourhood of the target page.
>> >> + *
>> >> + * Cleared in that order to keep cache lines of the target region hot.
>> >> + *
>> >> + * For gigantic pages, there is no expectation of cache locality so just do a
>> >> + * straight zero.
>> >> + */
>> >> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint)
>> >> +{
>> >> +    unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> >> +    const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>> >> +    const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>> >> +    int width = 2; /* pages cleared last on either side */
>> >> +    struct range r[3];
>> >> +    int i;
>> >> +
>> >> +    if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>> >> +            clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
>> >> +            goto out;
>> >> +    }
>> >> +
>> >> +    /*
>> >> +     * Faulting page and its immediate neighbourhood. Cleared at the end to
>> >> +     * ensure it sticks around in the cache.
>> >> +     */
>> >> +    r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
>> >> +                        clamp_t(s64, fault_idx + width, pg.start, pg.end));
>> >> +
>> >> +    /* Region to the left of the fault */
>> >> +    r[1] = DEFINE_RANGE(pg.start,
>> >> +                        clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
>> >> +
>> >> +    /* Region to the right of the fault: always valid for the common fault_idx=0 case. */
>> >> +    r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
>> >> +                        pg.end);
>> >> +
>> >> +    for (i = 0; i <= 2; i++) {
>> >> +            int len = range_len(&r[i]);
>> >> +
>> >> +            if (len > 0)
>> >> +                    clear_pages(page_address(folio_page(folio, r[i].start)), len);
>> >> +    }
>> >> +
>> >> +out:
>> >> +    /* Explicitly invoke cond_resched() to handle any live patching necessary. */
>> >> +    cond_resched();
>> >> +}
>> >> +
>> >> +#endif /* CONFIG_HIGHMEM */
>> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> >> index b7f13f087954..b57512da8173 100644
>> >> --- a/include/linux/mm.h
>> >> +++ b/include/linux/mm.h
>> >> @@ -4114,6 +4114,7 @@ enum mf_action_page_type {
>> >>  };
>> >>
>> >>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>> >> +void folio_zero_user_preemptible(struct folio *fio, unsigned long addr_hint);
>> >>  void folio_zero_user(struct folio *folio, unsigned long addr_hint);
>> >>  int copy_user_large_folio(struct folio *dst, struct folio *src,
>> >>                        unsigned long addr_hint,
>> >> --
>> >> 2.31.1
>> >>
>> >>
>>
>>
>> --
>> ankur


--
ankur


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
  2025-04-16  4:46         ` Ankur Arora
@ 2025-04-17 14:06           ` Mateusz Guzik
  0 siblings, 0 replies; 38+ messages in thread
From: Mateusz Guzik @ 2025-04-17 14:06 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, bp, dave.hansen, hpa,
	mingo, luto, peterz, paulmck, rostedt, tglx, willy, jon.grimm,
	bharata, raghavendra.kt, boris.ostrovsky, konrad.wilk

On Wed, Apr 16, 2025 at 6:46 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> Mateusz Guzik <mjguzik@gmail.com> writes:
>
> > On Tue, Apr 15, 2025 at 11:46 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
> >> Now, you could argue that the situation is less clear for 2MB pages.
> >>
> >
> > Well I was talking about 2MB. ;) I thought it is a foregone conclusion
> > that 1GB pages will be handled with non-temporal stores, but maybe I'm
> > crossing my wires.
>
> Or I am. I agree the situation is less clear for 2MB. I think for the
> immediate neighbourhood we clearly do want temporal.
>
> For the rest, L2 sizes are typically ~2MB or less. While there might be
> latency benefits for caching the rest of the page, there are clear
> negatives to blowing away state that was definitely useful in the
> near past.
>
> Would be good idea to quantify. Not sure what's a good benchmark to
> do that though.
>

L3 are not that great on many CPUs either, so you could argue you are
blown away the real stuff by zeroing a 2MB page.

I slept on it and I think proper evaluation is rather consuming and
tied to faulting more than one page at a time:
https://lore.kernel.org/linux-mm/CAGudoHGY012mwJqtGPUQ9mfQEVF1_otr9NSbbTYi_vazS09-CQ@mail.gmail.com/#t

I don't believe putting that effort in is a hard requirement at this
stage, albeit it would have been most welcome.

I think your patchset is a step in the right direction, even if the
specific policy is to change later.

All that is to say, I think this bit can be disregarded for now. I
definitely don't want to try to hold up the patchset.

cheers


--
Mateusz Guzik <mjguzik gmail.com>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
  2025-04-14  3:46 [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing Ankur Arora
                   ` (6 preceding siblings ...)
  2025-04-15 19:10 ` Zi Yan
@ 2025-04-22  6:23 ` Raghavendra K T
  2025-04-22 19:22   ` Ankur Arora
  7 siblings, 1 reply; 38+ messages in thread
From: Raghavendra K T @ 2025-04-22  6:23 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: torvalds, akpm, bp, dave.hansen, hpa, mingo, luto, peterz,
	paulmck, rostedt, tglx, willy, jon.grimm, bharata,
	boris.ostrovsky, konrad.wilk



On 4/14/2025 9:16 AM, Ankur Arora wrote:
> This series adds multi-page clearing for hugepages. It is a rework
> of [1] which took a detour through PREEMPT_LAZY [2].
> 
> Why multi-page clearing?: multi-page clearing improves upon the
> current page-at-a-time approach by providing the processor with a
> hint as to the real region size. A processor could use this hint to,
> for instance, elide cacheline allocation when clearing a large
> region.
> 
> This optimization in particular is done by REP; STOS on AMD Zen
> where regions larger than L3-size use non-temporal stores.
> 
> This results in significantly better performance.
> 
> We also see performance improvement for cases where this optimization is
> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
> REP; STOS is typically microcoded which can now be amortized over
> larger regions and the hint allows the hardware prefetcher to do a
> better job.
> 
> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
> 
>                   mm/folio_zero_user    x86/folio_zero_user     change
>                    (GB/s  +- stddev)      (GB/s  +- stddev)
> 
>    pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>    pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
> 
> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
> 
>                   mm/folio_zero_user    x86/folio_zero_user     change
>                    (GB/s +- stddev)      (GB/s +- stddev)
> 
>    pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>    pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
> 
[...]

Hello Ankur,

Thank you for the patches. Was able to test briefly w/ lazy preempt
mode.

(I do understand that, there could be lot of churn based on Ingo,
Mateusz and others' comments)
But here it goes:

SUT: AMD EPYC 9B24 (Genoa) preempt=lazy

metric = time taken in sec (lower is better). total SIZE=64GB
                  mm/folio_zero_user    x86/folio_zero_user     change
   pg-sz=1GB       2.47044  +-  0.38%    1.060877  +-  0.07%    57.06
   pg-sz=2MB       5.098403 +-  0.01%    2.52015   +-  0.36%    50.57

More details (1G example run):

base kernel    =   6.14 (preempt = lazy)

mm/folio_zero_user
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):

           2,476.47 msec task-clock                       #    1.002 
CPUs utilized            ( +-  0.39% )
                  5      context-switches                 #    2.025 
/sec                     ( +- 29.70% )
                  2      cpu-migrations                   #    0.810 
/sec                     ( +- 21.15% )
                202      page-faults                      #   81.806 
/sec                     ( +-  0.18% )
      7,348,664,233      cycles                           #    2.976 GHz 
                      ( +-  0.38% )  (38.39%)
        878,805,326      stalled-cycles-frontend          #   11.99% 
frontend cycles idle     ( +-  0.74% )  (38.43%)
        339,023,729      instructions                     #    0.05 
insn per cycle
                                                   #    2.53  stalled 
cycles per insn  ( +-  0.08% )  (38.47%)
         88,579,915      branches                         #   35.873 
M/sec                    ( +-  0.06% )  (38.51%)
         17,369,776      branch-misses                    #   19.55% of 
all branches          ( +-  0.04% )  (38.55%)
      2,261,339,695      L1-dcache-loads                  #  915.795 
M/sec                    ( +-  0.06% )  (38.56%)
      1,073,880,164      L1-dcache-load-misses            #   47.48% of 
all L1-dcache accesses  ( +-  0.05% )  (38.56%)
          511,231,988      L1-icache-loads                  #  207.038 
M/sec                    ( +-  0.25% )  (38.52%)
            128,533      L1-icache-load-misses            #    0.02% of 
all L1-icache accesses  ( +-  0.40% )  (38.48%)
             38,134      dTLB-loads                       #   15.443 
K/sec                    ( +-  4.22% )  (38.44%)
             33,992      dTLB-load-misses                 #  114.39% of 
all dTLB cache accesses  ( +-  9.42% )  (38.40%)
                156      iTLB-loads                       #   63.177 
/sec                     ( +- 13.34% )  (38.36%)
                156      iTLB-load-misses                 #  102.50% of 
all iTLB cache accesses  ( +- 25.98% )  (38.36%)

            2.47044 +- 0.00949 seconds time elapsed  ( +-  0.38% )

x86/folio_zero_user
           1,056.72 msec task-clock                       #    0.996 
CPUs utilized            ( +-  0.07% )
                 10      context-switches                 #    9.436 
/sec                     ( +-  3.59% )
                  3      cpu-migrations                   #    2.831 
/sec                     ( +- 11.33% )
                200      page-faults                      #  188.718 
/sec                     ( +-  0.15% )
      3,146,571,264      cycles                           #    2.969 GHz 
                      ( +-  0.07% )  (38.35%)
         17,226,261      stalled-cycles-frontend          #    0.55% 
frontend cycles idle     ( +-  4.12% )  (38.44%)
         14,130,553      instructions                     #    0.00 
insn per cycle
                                                   #    1.39  stalled 
cycles per insn  ( +-  1.59% )  (38.53%)
          3,578,614      branches                         #    3.377 
M/sec                    ( +-  1.54% )  (38.62%)
            415,807      branch-misses                    #   12.45% of 
all branches          ( +-  1.17% )  (38.62%)
         22,208,699      L1-dcache-loads                  #   20.956 
M/sec                    ( +-  5.27% )  (38.60%)
          7,312,684      L1-dcache-load-misses            #   27.79% of 
all L1-dcache accesses  ( +-  8.46% )  (38.51%)
            4,032,315      L1-icache-loads                  #    3.805 
M/sec                    ( +-  1.29% )  (38.48%)
             15,094      L1-icache-load-misses            #    0.38% of 
all L1-icache accesses  ( +-  1.14% )  (38.39%)
             14,365      dTLB-loads                       #   13.555 
K/sec                    ( +-  7.23% )  (38.38%)
              9,477      dTLB-load-misses                 #   65.36% of 
all dTLB cache accesses  ( +- 12.05% )  (38.38%)
                 18      iTLB-loads                       #   16.985 
/sec                     ( +- 34.84% )  (38.38%)
                 67      iTLB-load-misses                 #  158.39% of 
all iTLB cache accesses  ( +- 48.32% )  (38.32%)

           1.060877 +- 0.000766 seconds time elapsed  ( +-  0.07% )

Thanks and Regards
- Raghu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
  2025-04-14  7:05   ` Ingo Molnar
  2025-04-15  6:36     ` Ankur Arora
@ 2025-04-22  6:36     ` Raghavendra K T
  2025-04-22 19:14       ` Ankur Arora
  1 sibling, 1 reply; 38+ messages in thread
From: Raghavendra K T @ 2025-04-22  6:36 UTC (permalink / raw)
  To: Ingo Molnar, Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, bp, dave.hansen, hpa,
	mingo, luto, peterz, paulmck, rostedt, tglx, willy, jon.grimm,
	bharata, boris.ostrovsky, konrad.wilk



On 4/14/2025 12:35 PM, Ingo Molnar wrote:
> 
> * Ankur Arora <ankur.a.arora@oracle.com> wrote:

[...]

>> Performance for preempt=none|voluntary remains unchanged.
> 
> CONFIG_PREEMPT_VOLUNTARY=y is the default on a number of major
> distributions, such as Ubuntu, and a lot of enterprise distro kernels -
> and this patch does nothing for them, for no good reason.
> 
> So could you please provide a sensible size granularity cutoff of 16MB
> or so on non-preemptive kernels, instead of this weird build-time
> all-or-nothing binary cutoff based on preemption modes?
> 

Agree with Ingo here. We are too harsh on none and voluntary with this
limit. At least 512KB size?

I will try to see if I can get numbers for different sizes that does not 
interfere much with other workloads' perf. (in none/voluntary mode).

- Raghu



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
  2025-04-22  6:36     ` Raghavendra K T
@ 2025-04-22 19:14       ` Ankur Arora
  0 siblings, 0 replies; 38+ messages in thread
From: Ankur Arora @ 2025-04-22 19:14 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Ingo Molnar, Ankur Arora, linux-kernel, linux-mm, x86, torvalds,
	akpm, bp, dave.hansen, hpa, mingo, luto, peterz, paulmck, rostedt,
	tglx, willy, jon.grimm, bharata, boris.ostrovsky, konrad.wilk


Raghavendra K T <raghavendra.kt@amd.com> writes:

> On 4/14/2025 12:35 PM, Ingo Molnar wrote:
>> * Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> [...]
>
>>> Performance for preempt=none|voluntary remains unchanged.
>> CONFIG_PREEMPT_VOLUNTARY=y is the default on a number of major
>> distributions, such as Ubuntu, and a lot of enterprise distro kernels -
>> and this patch does nothing for them, for no good reason.
>> So could you please provide a sensible size granularity cutoff of 16MB
>> or so on non-preemptive kernels, instead of this weird build-time
>> all-or-nothing binary cutoff based on preemption modes?
>>
>
> Agree with Ingo here. We are too harsh on none and voluntary with this
> limit. At least 512KB size?

512KB seems much too small. I was thinking of something larger: 8MB. With
a zeroing BW of 10GBps (typical value I've seen with relatively new
Intel/AMD hardware), that would mean a preemption delay of 0.8ms
for preempt=none/voluntary.

Even with relatively old hardware -- say skylake -- we should see ~5
GBps/1.6ms delay.

> I will try to see if I can get numbers for different sizes that does not
> interfere much with other workloads' perf. (in none/voluntary mode).

Thanks that'll be useful to make the case.

--
ankur


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
  2025-04-22  6:23 ` Raghavendra K T
@ 2025-04-22 19:22   ` Ankur Arora
  2025-04-23  8:12     ` Raghavendra K T
  0 siblings, 1 reply; 38+ messages in thread
From: Ankur Arora @ 2025-04-22 19:22 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, bp,
	dave.hansen, hpa, mingo, luto, peterz, paulmck, rostedt, tglx,
	willy, jon.grimm, bharata, boris.ostrovsky, konrad.wilk


Raghavendra K T <raghavendra.kt@amd.com> writes:

> On 4/14/2025 9:16 AM, Ankur Arora wrote:
>> This series adds multi-page clearing for hugepages. It is a rework
>> of [1] which took a detour through PREEMPT_LAZY [2].
>> Why multi-page clearing?: multi-page clearing improves upon the
>> current page-at-a-time approach by providing the processor with a
>> hint as to the real region size. A processor could use this hint to,
>> for instance, elide cacheline allocation when clearing a large
>> region.
>> This optimization in particular is done by REP; STOS on AMD Zen
>> where regions larger than L3-size use non-temporal stores.
>> This results in significantly better performance.
>> We also see performance improvement for cases where this optimization is
>> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
>> REP; STOS is typically microcoded which can now be amortized over
>> larger regions and the hint allows the hardware prefetcher to do a
>> better job.
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>>                   mm/folio_zero_user    x86/folio_zero_user     change
>>                    (GB/s  +- stddev)      (GB/s  +- stddev)
>>    pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>>    pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>>                   mm/folio_zero_user    x86/folio_zero_user     change
>>                    (GB/s +- stddev)      (GB/s +- stddev)
>>    pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>>    pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
>>
> [...]
>
> Hello Ankur,
>
> Thank you for the patches. Was able to test briefly w/ lazy preempt
> mode.

Thanks for testing.

> (I do understand that, there could be lot of churn based on Ingo,
> Mateusz and others' comments)
> But here it goes:
>
> SUT: AMD EPYC 9B24 (Genoa) preempt=lazy
>
> metric = time taken in sec (lower is better). total SIZE=64GB
>                  mm/folio_zero_user    x86/folio_zero_user     change
>   pg-sz=1GB       2.47044  +-  0.38%    1.060877  +-  0.07%    57.06
>   pg-sz=2MB       5.098403 +-  0.01%    2.52015   +-  0.36%    50.57


Just to translate it into the same units I was using above:

                  mm/folio_zero_user        x86/folio_zero_user
   pg-sz=1GB       25.91 GBps +-  0.38%    60.37 GBps +-  0.07%
   pg-sz=2MB       12.57 GBps +-  0.01%    25.39 GBps +-  0.36%

That's a decent improvement over Milan. Btw, are you using boost=1?

Also, any idea why the huge delta in the mm/folio_zero_user 2MB, 1GB
cases? Both of these are doing 4k page at a time, so the huge delta
is a little head scratching.

There's a gap on Milan as well but it is much smaller.

Thanks
Ankur

> More details (1G example run):
>
> base kernel    =   6.14 (preempt = lazy)
>
> mm/folio_zero_user
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
>           2,476.47 msec task-clock                       #    1.002 CPUs
>          utilized            ( +-  0.39% )
>                  5      context-switches                 #    2.025 /sec ( +- 29.70% )
>                  2      cpu-migrations                   #    0.810 /sec ( +- 21.15% )
>                202      page-faults                      #   81.806 /sec ( +-  0.18% )
>      7,348,664,233      cycles                           #    2.976 GHz  ( +-  0.38% )  (38.39%)
>        878,805,326      stalled-cycles-frontend          #   11.99% frontend cycles idle     ( +-  0.74% )  (38.43%)
>        339,023,729      instructions                     #    0.05 insn per
>       cycle
>                                                   #    2.53  stalled cycles per
>                                                       insn  ( +-  0.08% )
>                                                       (38.47%)
>         88,579,915      branches                         #   35.873 M/sec
>        ( +-  0.06% )  (38.51%)
>         17,369,776      branch-misses                    #   19.55% of all
>        branches          ( +-  0.04% )  (38.55%)
>      2,261,339,695      L1-dcache-loads                  #  915.795 M/sec
>     ( +-  0.06% )  (38.56%)
>      1,073,880,164      L1-dcache-load-misses            #   47.48% of all
>     L1-dcache accesses  ( +-  0.05% )  (38.56%)
>          511,231,988      L1-icache-loads                  #  207.038 M/sec
>         ( +-  0.25% )  (38.52%)
>            128,533      L1-icache-load-misses            #    0.02% of all
>           L1-icache accesses  ( +-  0.40% )  (38.48%)
>             38,134      dTLB-loads                       #   15.443 K/sec
>            ( +-  4.22% )  (38.44%)
>             33,992      dTLB-load-misses                 #  114.39% of all dTLB
>            cache accesses  ( +-  9.42% )  (38.40%)
>                156      iTLB-loads                       #   63.177 /sec
>               ( +- 13.34% )  (38.36%)
>                156      iTLB-load-misses                 #  102.50% of all iTLB
>               cache accesses  ( +- 25.98% )  (38.36%)
>
>            2.47044 +- 0.00949 seconds time elapsed  ( +-  0.38% )
>
> x86/folio_zero_user
>           1,056.72 msec task-clock                       #    0.996 CPUs
>          utilized            ( +-  0.07% )
>                 10      context-switches                 #    9.436 /sec
>                ( +-  3.59% )
>                  3      cpu-migrations                   #    2.831 /sec
>                 ( +- 11.33% )
>                200      page-faults                      #  188.718 /sec
>               ( +-  0.15% )
>      3,146,571,264      cycles                           #    2.969 GHz
>      ( +-  0.07% )  (38.35%)
>         17,226,261      stalled-cycles-frontend          #    0.55% frontend
>        cycles idle     ( +-  4.12% )  (38.44%)
>         14,130,553      instructions                     #    0.00 insn per
>        cycle
>                                                   #    1.39  stalled cycles per
>                                                       insn  ( +-  1.59% )
>                                                       (38.53%)
>          3,578,614      branches                         #    3.377 M/sec
>         ( +-  1.54% )  (38.62%)
>            415,807      branch-misses                    #   12.45% of all
>           branches          ( +-  1.17% )  (38.62%)
>         22,208,699      L1-dcache-loads                  #   20.956 M/sec
>        ( +-  5.27% )  (38.60%)
>          7,312,684      L1-dcache-load-misses            #   27.79% of all
>         L1-dcache accesses  ( +-  8.46% )  (38.51%)
>            4,032,315      L1-icache-loads                  #    3.805 M/sec
>           ( +-  1.29% )  (38.48%)
>             15,094      L1-icache-load-misses            #    0.38% of all
>            L1-icache accesses  ( +-  1.14% )  (38.39%)
>             14,365      dTLB-loads                       #   13.555 K/sec
>            ( +-  7.23% )  (38.38%)
>              9,477      dTLB-load-misses                 #   65.36% of all dTLB
>             cache accesses  ( +- 12.05% )  (38.38%)
>                 18      iTLB-loads                       #   16.985 /sec
>                ( +- 34.84% )  (38.38%)
>                 67      iTLB-load-misses                 #  158.39% of all iTLB
>                cache accesses  ( +- 48.32% )  (38.32%)
>
>           1.060877 +- 0.000766 seconds time elapsed  ( +-  0.07% )
>
> Thanks and Regards
> - Raghu


--
ankur


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
  2025-04-15 19:10 ` Zi Yan
@ 2025-04-22 19:32   ` Ankur Arora
  0 siblings, 0 replies; 38+ messages in thread
From: Ankur Arora @ 2025-04-22 19:32 UTC (permalink / raw)
  To: Zi Yan
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, torvalds, akpm, bp,
	dave.hansen, hpa, mingo, luto, peterz, paulmck, rostedt, tglx,
	willy, jon.grimm, bharata, raghavendra.kt, boris.ostrovsky,
	konrad.wilk


Zi Yan <ziy@nvidia.com> writes:

> On 13 Apr 2025, at 23:46, Ankur Arora wrote:
>
>> This series adds multi-page clearing for hugepages. It is a rework
>> of [1] which took a detour through PREEMPT_LAZY [2].
>>
>> Why multi-page clearing?: multi-page clearing improves upon the
>> current page-at-a-time approach by providing the processor with a
>> hint as to the real region size. A processor could use this hint to,
>> for instance, elide cacheline allocation when clearing a large
>> region.
>>
>> This optimization in particular is done by REP; STOS on AMD Zen
>> where regions larger than L3-size use non-temporal stores.
>>
>> This results in significantly better performance.
>
> Do you have init_on_alloc=1 in your kernel?
> With that, pages coming from buddy allocator are zeroed
> in post_alloc_hook() by kernel_init_pages(), which is a for loop
> of clear_highpage_kasan_tagged(), a wrap of clear_page().
> And folio_zero_user() is not used.
>
> At least Debian, Fedora, and Ubuntu by default have
> CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y, which means init_on_alloc=1.
>
> Maybe kernel_init_pages() should get your optimization as well,
> unless you only target hugetlb pages.

Thanks for the suggestion. I do plan to look for other places where
we could be zeroing contiguous regions.

Often the problem is that even if the underlying region is contiguous,
it isn't so under CONFIG_HIGHMEM. For instance,
clear_highpage_kasan_tagged() does a kmap/kunmap_local_page() around the
clearing. This breaks the contiguous region into multiple 4K pages even
when CONFIG_HIGHMEM is not defined.

I wonder if we need a clear_highpages() kind of abstraction which lets
HIGHMEM and non-HIGHMEM go their separate ways.

--
ankur


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
  2025-04-22 19:22   ` Ankur Arora
@ 2025-04-23  8:12     ` Raghavendra K T
  2025-04-23  9:18       ` Raghavendra K T
  0 siblings, 1 reply; 38+ messages in thread
From: Raghavendra K T @ 2025-04-23  8:12 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, bp, dave.hansen, hpa,
	mingo, luto, peterz, paulmck, rostedt, tglx, willy, jon.grimm,
	bharata, boris.ostrovsky, konrad.wilk



On 4/23/2025 12:52 AM, Ankur Arora wrote:
> 
> Raghavendra K T <raghavendra.kt@amd.com> writes:
[...]
>>
>> SUT: AMD EPYC 9B24 (Genoa) preempt=lazy
>>
>> metric = time taken in sec (lower is better). total SIZE=64GB
>>                   mm/folio_zero_user    x86/folio_zero_user     change
>>    pg-sz=1GB       2.47044  +-  0.38%    1.060877  +-  0.07%    57.06
>>    pg-sz=2MB       5.098403 +-  0.01%    2.52015   +-  0.36%    50.57
> 
> 
> Just to translate it into the same units I was using above:
> 
>                    mm/folio_zero_user        x86/folio_zero_user
>     pg-sz=1GB       25.91 GBps +-  0.38%    60.37 GBps +-  0.07%
>     pg-sz=2MB       12.57 GBps +-  0.01%    25.39 GBps +-  0.36%
> 
> That's a decent improvement over Milan. Btw, are you using boost=1?
> 

yes boost=1

> Also, any idea why the huge delta in the mm/folio_zero_user 2MB, 1GB
> cases? Both of these are doing 4k page at a time, so the huge delta
> is a little head scratching.
> 
> There's a gap on Milan as well but it is much smaller.
> 

Need to think/analyze further, but from perf stat I see
glaring difference in:
			2M		1G
pagefaults		32,906		202
iTLB-load-misses	44,490		156

- Raghu


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing
  2025-04-23  8:12     ` Raghavendra K T
@ 2025-04-23  9:18       ` Raghavendra K T
  0 siblings, 0 replies; 38+ messages in thread
From: Raghavendra K T @ 2025-04-23  9:18 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, torvalds, akpm, bp, dave.hansen, hpa,
	mingo, luto, peterz, paulmck, rostedt, tglx, willy, jon.grimm,
	bharata, boris.ostrovsky, konrad.wilk

On 4/23/2025 1:42 PM, Raghavendra K T wrote:
[...]

>> Also, any idea why the huge delta in the mm/folio_zero_user 2MB, 1GB
>> cases? Both of these are doing 4k page at a time, so the huge delta
>> is a little head scratching.
>>
>> There's a gap on Milan as well but it is much smaller.
>>
> 

Sorry forgot to add perhaps even more valid/important data point

> Need to think/analyze further, but from perf stat I see
> glaring difference in:
>              2M        1G
> pagefaults        32,906        202
> iTLB-load-misses    44,490        156

dTLB-loads		1,000,169	38,134
dTLB-load-misses	402,641		33,992

- Raghu



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-04-23  9:19 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-14  3:46 [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing Ankur Arora
2025-04-14  3:46 ` [PATCH v3 1/4] x86/clear_page: extend clear_page*() for " Ankur Arora
2025-04-14  6:32   ` Ingo Molnar
2025-04-14 11:02     ` Peter Zijlstra
2025-04-14 11:14       ` Ingo Molnar
2025-04-14 19:46       ` Ankur Arora
2025-04-14 22:26       ` Mateusz Guzik
2025-04-15  6:14         ` Ankur Arora
2025-04-15  8:22           ` Mateusz Guzik
2025-04-15 20:01             ` Ankur Arora
2025-04-15 20:32               ` Mateusz Guzik
2025-04-14 19:52     ` Ankur Arora
2025-04-14 20:09       ` Matthew Wilcox
2025-04-15 21:59         ` Ankur Arora
2025-04-14  3:46 ` [PATCH v3 2/4] x86/clear_page: add clear_pages() Ankur Arora
2025-04-14  3:46 ` [PATCH v3 3/4] huge_page: allow arch override for folio_zero_user() Ankur Arora
2025-04-14  3:46 ` [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing Ankur Arora
2025-04-14  6:53   ` Ingo Molnar
2025-04-14 21:21     ` Ankur Arora
2025-04-14  7:05   ` Ingo Molnar
2025-04-15  6:36     ` Ankur Arora
2025-04-22  6:36     ` Raghavendra K T
2025-04-22 19:14       ` Ankur Arora
2025-04-15 10:16   ` Mateusz Guzik
2025-04-15 21:46     ` Ankur Arora
2025-04-15 22:01       ` Mateusz Guzik
2025-04-16  4:46         ` Ankur Arora
2025-04-17 14:06           ` Mateusz Guzik
2025-04-14  5:34 ` [PATCH v3 0/4] mm/folio_zero_user: add " Ingo Molnar
2025-04-14 19:30   ` Ankur Arora
2025-04-14  6:36 ` Ingo Molnar
2025-04-14 19:19   ` Ankur Arora
2025-04-15 19:10 ` Zi Yan
2025-04-22 19:32   ` Ankur Arora
2025-04-22  6:23 ` Raghavendra K T
2025-04-22 19:22   ` Ankur Arora
2025-04-23  8:12     ` Raghavendra K T
2025-04-23  9:18       ` Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).