[PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages
@ 2025-10-27 20:21 Ankur Arora
  2025-10-27 20:21 ` [PATCH v8 1/7] treewide: provide a generic clear_user_page() variant Ankur Arora
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Ankur Arora @ 2025-10-27 20:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

This series adds clearing of contiguous page ranges for hugepages,
improving on the current page-at-a-time approach in two ways:

 - amortizes the per-page setup cost over a larger extent
 - when using string instructions, exposes the real region size
   to the processor.

A processor could use a knowledge of the extent to optimize the
clearing. AMD Zen uarchs, as an example, elide allocation of
cachelines for regions larger than L3-size.

Demand faulting a 64GB region shows performance improvements:

 $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5

                       baseline              +series             change

                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

   pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*

   pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05% [#]   +  5.2%	preempt=none|voluntary
   pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21%       +144.3%	preempt=full|lazy

[#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, which is higher than the maximum extent used on x86
(ARCH_CONTIG_PAGE_NR=8MB), so preempt=none|voluntary sees no improvement
with pg-sz=1GB.

The anon-w-seq test in the vm-scalability benchmark, however, does show
worse performance with utime increasing by ~9%:

                         stime                  utime

  baseline         1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
  +series          1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )

In part this is because anon-w-seq runs with 384 processes zeroing
anonymously mapped memory which they then access sequentially. As
such this is a likely uncommon pattern where the memory bandwidth
is saturated while also being cache limited because we access the
entire region.

Raghavendra also tested previous version of the series on AMD Genoa [1].

Changelog:

v8:
 - make clear_user_highpages(), clear_user_pages() and clear_pages()
   more robust across architectures. (Thanks David!)
 - split up folio_zero_user() changes into ones for clearing contiguous
   regions and those for maintaining temporal locality since they have
   different performance profiles (Suggested by Andrew Morton.)
 - added Raghavendra's Reviewed-by, Tested-by.
 - get rid of nth_page()
 - perf related patches have been pulled already. Remove them.

v7:
 - interface cleanups, comments for clear_user_highpages(), clear_user_pages(),
   clear_pages().
 - fixed build errors flagged by kernel test robot
 (https://lore.kernel.org/lkml/20250917152418.4077386-1-ankur.a.arora@oracle.com/)

v6:
 - perf bench mem: update man pages and other cleanups (Namhyung Kim)
 - unify folio_zero_user() for HIGHMEM, !HIGHMEM options instead of
   working through a new config option (David Hildenbrand).
   - cleanups and simlification around that.
 (https://lore.kernel.org/lkml/20250902080816.3715913-1-ankur.a.arora@oracle.com/)

v5:
 - move the non HIGHMEM implementation of folio_zero_user() from x86
   to common code (Dave Hansen)
 - Minor naming cleanups, commit messages etc
 (https://lore.kernel.org/lkml/20250710005926.1159009-1-ankur.a.arora@oracle.com/)

v4:
 - adds perf bench workloads to exercise mmap() populate/demand-fault (Ingo)
 - inline stosb etc (PeterZ)
 - handle cooperative preemption models (Ingo)
 - interface and other cleanups all over (Ingo)
 (https://lore.kernel.org/lkml/20250616052223.723982-1-ankur.a.arora@oracle.com/)

v3:
 - get rid of preemption dependency (TIF_ALLOW_RESCHED); this version
   was limited to preempt=full|lazy.
 - override folio_zero_user() (Linus)
 (https://lore.kernel.org/lkml/20250414034607.762653-1-ankur.a.arora@oracle.com/)

v2:
 - addressed review comments from peterz, tglx.
 - Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
 - General code cleanup
 (https://lore.kernel.org/lkml/20230830184958.2333078-1-ankur.a.arora@oracle.com/)

Comments appreciated!

Also at:
  github.com/terminus/linux clear-pages.v7

[1] https://lore.kernel.org/lkml/fffd4dad-2cb9-4bc9-8a80-a70be687fd54@amd.com/

Ankur Arora (6):
  mm: introduce clear_pages() and clear_user_pages()
  mm/highmem: introduce clear_user_highpages()
  x86/mm: Simplify clear_page_*
  x86/clear_page: Introduce clear_pages()
  mm, folio_zero_user: support clearing page ranges
  mm: folio_zero_user: cache neighbouring pages

David Hildenbrand (1):
  treewide: provide a generic clear_user_page() variant

 arch/alpha/include/asm/page.h      |  1 -
 arch/arc/include/asm/page.h        |  2 +
 arch/arm/include/asm/page-nommu.h  |  1 -
 arch/arm64/include/asm/page.h      |  1 -
 arch/csky/abiv1/inc/abi/page.h     |  1 +
 arch/csky/abiv2/inc/abi/page.h     |  7 ---
 arch/hexagon/include/asm/page.h    |  1 -
 arch/loongarch/include/asm/page.h  |  1 -
 arch/m68k/include/asm/page_mm.h    |  1 +
 arch/m68k/include/asm/page_no.h    |  1 -
 arch/microblaze/include/asm/page.h |  1 -
 arch/mips/include/asm/page.h       |  1 +
 arch/nios2/include/asm/page.h      |  1 +
 arch/openrisc/include/asm/page.h   |  1 -
 arch/parisc/include/asm/page.h     |  1 -
 arch/powerpc/include/asm/page.h    |  1 +
 arch/riscv/include/asm/page.h      |  1 -
 arch/s390/include/asm/page.h       |  1 -
 arch/sparc/include/asm/page_32.h   |  2 +
 arch/sparc/include/asm/page_64.h   |  1 +
 arch/um/include/asm/page.h         |  1 -
 arch/x86/include/asm/page.h        |  6 ---
 arch/x86/include/asm/page_32.h     |  6 +++
 arch/x86/include/asm/page_64.h     | 64 ++++++++++++++++++-----
 arch/x86/lib/clear_page_64.S       | 39 +++-----------
 arch/xtensa/include/asm/page.h     |  1 -
 include/linux/highmem.h            | 29 +++++++++++
 include/linux/mm.h                 | 69 +++++++++++++++++++++++++
 mm/memory.c                        | 82 ++++++++++++++++++++++--------
 mm/util.c                          | 13 +++++
 30 files changed, 247 insertions(+), 91 deletions(-)

-- 
2.43.5



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v8 1/7] treewide: provide a generic clear_user_page() variant
  2025-10-27 20:21 [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Ankur Arora
@ 2025-10-27 20:21 ` Ankur Arora
  2025-10-27 20:21 ` [PATCH v8 2/7] mm: introduce clear_pages() and clear_user_pages() Ankur Arora
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Ankur Arora @ 2025-10-27 20:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

From: David Hildenbrand <david@redhat.com>

Let's drop all variants that effectively map to clear_page() and
provide it in a generic variant instead.

We'll use __HAVE_ARCH_CLEAR_USER_PAGE, similar to
__HAVE_ARCH_COPY_USER_HIGHPAGE, to indicate whether an architecture
provides it's own variant.

We have to be a bit careful if an architecture provides a custom
clear_user_highpage(), because then it's very likely that some special
flushing magic is happening behind the scenes.

Maybe at some point these should be CONFIG_ options.

Note that for parisc, clear_page() and clear_user_page() map to
clear_page_asm(), so we can just get rid of the custom clear_user_page()
implementation. There is a clear_user_page_asm() function on parisc,
that seems to be unused. Not sure what's up with that.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/alpha/include/asm/page.h      |  1 -
 arch/arc/include/asm/page.h        |  2 ++
 arch/arm/include/asm/page-nommu.h  |  1 -
 arch/arm64/include/asm/page.h      |  1 -
 arch/csky/abiv1/inc/abi/page.h     |  1 +
 arch/csky/abiv2/inc/abi/page.h     |  7 -------
 arch/hexagon/include/asm/page.h    |  1 -
 arch/loongarch/include/asm/page.h  |  1 -
 arch/m68k/include/asm/page_mm.h    |  1 +
 arch/m68k/include/asm/page_no.h    |  1 -
 arch/microblaze/include/asm/page.h |  1 -
 arch/mips/include/asm/page.h       |  1 +
 arch/nios2/include/asm/page.h      |  1 +
 arch/openrisc/include/asm/page.h   |  1 -
 arch/parisc/include/asm/page.h     |  1 -
 arch/powerpc/include/asm/page.h    |  1 +
 arch/riscv/include/asm/page.h      |  1 -
 arch/s390/include/asm/page.h       |  1 -
 arch/sparc/include/asm/page_32.h   |  2 ++
 arch/sparc/include/asm/page_64.h   |  1 +
 arch/um/include/asm/page.h         |  1 -
 arch/x86/include/asm/page.h        |  6 ------
 arch/xtensa/include/asm/page.h     |  1 -
 include/linux/mm.h                 | 22 ++++++++++++++++++++++
 24 files changed, 32 insertions(+), 26 deletions(-)

diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
index 5ec4c77e432e..d71ef845deca 100644
--- a/arch/alpha/include/asm/page.h
+++ b/arch/alpha/include/asm/page.h
@@ -11,7 +11,6 @@
 #define STRICT_MM_TYPECHECKS
 
 extern void clear_page(void *page);
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 
 #define vma_alloc_zeroed_movable_folio(vma, vaddr) \
 	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
diff --git a/arch/arc/include/asm/page.h b/arch/arc/include/asm/page.h
index 9720fe6b2c24..cb4d69b473e6 100644
--- a/arch/arc/include/asm/page.h
+++ b/arch/arc/include/asm/page.h
@@ -32,6 +32,8 @@ struct page;
 
 void copy_user_highpage(struct page *to, struct page *from,
 			unsigned long u_vaddr, struct vm_area_struct *vma);
+
+#define __HAVE_ARCH_CLEAR_USER_PAGE
 void clear_user_page(void *to, unsigned long u_vaddr, struct page *page);
 
 typedef struct {
diff --git a/arch/arm/include/asm/page-nommu.h b/arch/arm/include/asm/page-nommu.h
index 7c2c72323d17..e74415c959be 100644
--- a/arch/arm/include/asm/page-nommu.h
+++ b/arch/arm/include/asm/page-nommu.h
@@ -11,7 +11,6 @@
 #define clear_page(page)	memset((page), 0, PAGE_SIZE)
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
 /*
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 2312e6ee595f..0cb8853c0af4 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -36,7 +36,6 @@ struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
 void tag_clear_highpage(struct page *to);
 #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
 
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
 typedef struct page *pgtable_t;
diff --git a/arch/csky/abiv1/inc/abi/page.h b/arch/csky/abiv1/inc/abi/page.h
index 2d2159933b76..08a37f5990cc 100644
--- a/arch/csky/abiv1/inc/abi/page.h
+++ b/arch/csky/abiv1/inc/abi/page.h
@@ -10,6 +10,7 @@ static inline unsigned long pages_do_alias(unsigned long addr1,
 	return (addr1 ^ addr2) & (SHMLBA-1);
 }
 
+#define __HAVE_ARCH_CLEAR_USER_PAGE
 static inline void clear_user_page(void *addr, unsigned long vaddr,
 				   struct page *page)
 {
diff --git a/arch/csky/abiv2/inc/abi/page.h b/arch/csky/abiv2/inc/abi/page.h
index cf005f13cd15..a5a255013308 100644
--- a/arch/csky/abiv2/inc/abi/page.h
+++ b/arch/csky/abiv2/inc/abi/page.h
@@ -1,11 +1,4 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-
-static inline void clear_user_page(void *addr, unsigned long vaddr,
-				   struct page *page)
-{
-	clear_page(addr);
-}
-
 static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 				  struct page *page)
 {
diff --git a/arch/hexagon/include/asm/page.h b/arch/hexagon/include/asm/page.h
index 137ba7c5de48..f0aed3ed812b 100644
--- a/arch/hexagon/include/asm/page.h
+++ b/arch/hexagon/include/asm/page.h
@@ -113,7 +113,6 @@ static inline void clear_page(void *page)
 /*
  * Under assumption that kernel always "sees" user map...
  */
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
 static inline unsigned long virt_to_pfn(const void *kaddr)
diff --git a/arch/loongarch/include/asm/page.h b/arch/loongarch/include/asm/page.h
index a3aaf34fba16..b83415fe4ffb 100644
--- a/arch/loongarch/include/asm/page.h
+++ b/arch/loongarch/include/asm/page.h
@@ -30,7 +30,6 @@
 extern void clear_page(void *page);
 extern void copy_page(void *to, void *from);
 
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
 extern unsigned long shm_align_mask;
diff --git a/arch/m68k/include/asm/page_mm.h b/arch/m68k/include/asm/page_mm.h
index ed782609ca41..10798156121d 100644
--- a/arch/m68k/include/asm/page_mm.h
+++ b/arch/m68k/include/asm/page_mm.h
@@ -52,6 +52,7 @@ static inline void clear_page(void *page)
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 #endif
 
+#define __HAVE_ARCH_CLEAR_USER_PAGE
 #define clear_user_page(addr, vaddr, page)	\
 	do {	clear_page(addr);		\
 		flush_dcache_page(page);	\
diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
index 39db2026a4b4..d2532bc407ef 100644
--- a/arch/m68k/include/asm/page_no.h
+++ b/arch/m68k/include/asm/page_no.h
@@ -10,7 +10,6 @@ extern unsigned long memory_end;
 #define clear_page(page)	memset((page), 0, PAGE_SIZE)
 #define copy_page(to,from)	memcpy((to), (from), PAGE_SIZE)
 
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
 #define vma_alloc_zeroed_movable_folio(vma, vaddr) \
diff --git a/arch/microblaze/include/asm/page.h b/arch/microblaze/include/asm/page.h
index 90ac9f34b4b4..e1e396367ba7 100644
--- a/arch/microblaze/include/asm/page.h
+++ b/arch/microblaze/include/asm/page.h
@@ -45,7 +45,6 @@ typedef unsigned long pte_basic_t;
 # define copy_page(to, from)			memcpy((to), (from), PAGE_SIZE)
 # define clear_page(pgaddr)			memset((pgaddr), 0, PAGE_SIZE)
 
-# define clear_user_page(pgaddr, vaddr, page)	memset((pgaddr), 0, PAGE_SIZE)
 # define copy_user_page(vto, vfrom, vaddr, topg) \
 			memcpy((vto), (vfrom), PAGE_SIZE)
 
diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
index bc3e3484c1bf..6b41650c27ab 100644
--- a/arch/mips/include/asm/page.h
+++ b/arch/mips/include/asm/page.h
@@ -90,6 +90,7 @@ static inline void clear_user_page(void *addr, unsigned long vaddr,
 	if (pages_do_alias((unsigned long) addr, vaddr & PAGE_MASK))
 		flush_data_cache_page((unsigned long)addr);
 }
+#define __HAVE_ARCH_CLEAR_USER_PAGE
 
 struct vm_area_struct;
 extern void copy_user_highpage(struct page *to, struct page *from,
diff --git a/arch/nios2/include/asm/page.h b/arch/nios2/include/asm/page.h
index 00a51623d38a..ea9cac9e1bc1 100644
--- a/arch/nios2/include/asm/page.h
+++ b/arch/nios2/include/asm/page.h
@@ -45,6 +45,7 @@
 
 struct page;
 
+#define __HAVE_ARCH_CLEAR_USER_PAGE
 extern void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
 extern void copy_user_page(void *vto, void *vfrom, unsigned long vaddr,
 				struct page *to);
diff --git a/arch/openrisc/include/asm/page.h b/arch/openrisc/include/asm/page.h
index 85797f94d1d7..d2cdbf3579bb 100644
--- a/arch/openrisc/include/asm/page.h
+++ b/arch/openrisc/include/asm/page.h
@@ -30,7 +30,6 @@
 #define clear_page(page)	memset((page), 0, PAGE_SIZE)
 #define copy_page(to, from)	memcpy((to), (from), PAGE_SIZE)
 
-#define clear_user_page(page, vaddr, pg)        clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)     copy_page(to, from)
 
 /*
diff --git a/arch/parisc/include/asm/page.h b/arch/parisc/include/asm/page.h
index 8f4e51071ea1..3630b36d07da 100644
--- a/arch/parisc/include/asm/page.h
+++ b/arch/parisc/include/asm/page.h
@@ -21,7 +21,6 @@ struct vm_area_struct;
 
 void clear_page_asm(void *page);
 void copy_page_asm(void *to, void *from);
-#define clear_user_page(vto, vaddr, page) clear_page_asm(vto)
 void copy_user_highpage(struct page *to, struct page *from, unsigned long vaddr,
 		struct vm_area_struct *vma);
 #define __HAVE_ARCH_COPY_USER_HIGHPAGE
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index b28fbb1d57eb..da56e7d42e25 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -271,6 +271,7 @@ static inline const void *pfn_to_kaddr(unsigned long pfn)
 
 struct page;
 extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
+#define __HAVE_ARCH_CLEAR_USER_PAGE
 extern void copy_user_page(void *to, void *from, unsigned long vaddr,
 		struct page *p);
 extern int devmem_is_allowed(unsigned long pfn);
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index ffe213ad65a4..061b60b954ec 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -50,7 +50,6 @@ void clear_page(void *page);
 #endif
 #define copy_page(to, from)			memcpy((to), (from), PAGE_SIZE)
 
-#define clear_user_page(pgaddr, vaddr, page)	clear_page(pgaddr)
 #define copy_user_page(vto, vfrom, vaddr, topg) \
 			memcpy((vto), (vfrom), PAGE_SIZE)
 
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index 9240a363c893..6635ba56d4b2 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -65,7 +65,6 @@ static inline void copy_page(void *to, void *from)
 		: : "memory", "cc");
 }
 
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
 #define vma_alloc_zeroed_movable_folio(vma, vaddr) \
diff --git a/arch/sparc/include/asm/page_32.h b/arch/sparc/include/asm/page_32.h
index c1bccbedf567..572f62619254 100644
--- a/arch/sparc/include/asm/page_32.h
+++ b/arch/sparc/include/asm/page_32.h
@@ -17,6 +17,8 @@
 
 #define clear_page(page)	 memset((void *)(page), 0, PAGE_SIZE)
 #define copy_page(to,from) 	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
+
+#define __HAVE_ARCH_CLEAR_USER_PAGE
 #define clear_user_page(addr, vaddr, page)	\
 	do { 	clear_page(addr);		\
 		sparc_flush_page_to_ram(page);	\
diff --git a/arch/sparc/include/asm/page_64.h b/arch/sparc/include/asm/page_64.h
index d764d8a8586b..52213c92ee94 100644
--- a/arch/sparc/include/asm/page_64.h
+++ b/arch/sparc/include/asm/page_64.h
@@ -43,6 +43,7 @@ void _clear_page(void *page);
 #define clear_page(X)	_clear_page((void *)(X))
 struct page;
 void clear_user_page(void *addr, unsigned long vaddr, struct page *page);
+#define __HAVE_ARCH_CLEAR_USER_PAGE
 #define copy_page(X,Y)	memcpy((void *)(X), (void *)(Y), PAGE_SIZE)
 void copy_user_page(void *to, void *from, unsigned long vaddr, struct page *topage);
 #define __HAVE_ARCH_COPY_USER_HIGHPAGE
diff --git a/arch/um/include/asm/page.h b/arch/um/include/asm/page.h
index 6f54254aaf44..8cea97a9c8f9 100644
--- a/arch/um/include/asm/page.h
+++ b/arch/um/include/asm/page.h
@@ -26,7 +26,6 @@ struct page;
 #define clear_page(page)	memset((void *)(page), 0, PAGE_SIZE)
 #define copy_page(to,from)	memcpy((void *)(to), (void *)(from), PAGE_SIZE)
 
-#define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
 typedef struct { unsigned long pte; } pte_t;
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 9265f2fca99a..416dc88e35c1 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -22,12 +22,6 @@ struct page;
 extern struct range pfn_mapped[];
 extern int nr_pfn_mapped;
 
-static inline void clear_user_page(void *page, unsigned long vaddr,
-				   struct page *pg)
-{
-	clear_page(page);
-}
-
 static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
 				  struct page *topage)
 {
diff --git a/arch/xtensa/include/asm/page.h b/arch/xtensa/include/asm/page.h
index 20655174b111..059493256765 100644
--- a/arch/xtensa/include/asm/page.h
+++ b/arch/xtensa/include/asm/page.h
@@ -126,7 +126,6 @@ void clear_user_highpage(struct page *page, unsigned long vaddr);
 void copy_user_highpage(struct page *to, struct page *from,
 			unsigned long vaddr, struct vm_area_struct *vma);
 #else
-# define clear_user_page(page, vaddr, pg)	clear_page(page)
 # define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 #endif
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16b33bacc32..683168b522b3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3872,6 +3872,28 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 				unsigned int order) {}
 #endif	/* CONFIG_DEBUG_PAGEALLOC */
 
+#ifndef __HAVE_ARCH_CLEAR_USER_PAGE
+/**
+ * clear_user_page() - clear a page to be mapped to user space
+ * @addr: the address of the page
+ * @vaddr: the address of the user mapping
+ * @page: the page
+ */
+static inline void clear_user_page(void *addr, unsigned long vaddr, struct page *page)
+{
+#ifdef clear_user_highpage
+	/*
+	 * If an architecture defines its own clear_user_highpage() variant,
+	 * then we have to be a bit more careful here and cannot simply
+	 * rely on clear_page().
+	 */
+	clear_user_highpage(page, vaddr);
+#else
+	clear_page(addr);
+#endif
+}
+#endif
+
 #ifdef __HAVE_ARCH_GATE_AREA
 extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
 extern int in_gate_area_no_mm(unsigned long addr);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 2/7] mm: introduce clear_pages() and clear_user_pages()
  2025-10-27 20:21 [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Ankur Arora
  2025-10-27 20:21 ` [PATCH v8 1/7] treewide: provide a generic clear_user_page() variant Ankur Arora
@ 2025-10-27 20:21 ` Ankur Arora
  2025-11-07  8:47   ` David Hildenbrand (Red Hat)
  2025-10-27 20:21 ` [PATCH v8 3/7] mm/highmem: introduce clear_user_highpages() Ankur Arora
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Ankur Arora @ 2025-10-27 20:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Introduce clear_pages(), to be overridden by architectures that
support more efficient clearing of consecutive pages.

Also introduce clear_user_pages(), however, we will not expect
this function to be overridden anytime soon.

We have to place the clear_user_pages() variant that uses
clear_user_page() into mm/util.c for now to work around
macro magic on sparc and m68k.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/mm.h | 41 +++++++++++++++++++++++++++++++++++++++++
 mm/util.c          | 13 +++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 683168b522b3..ecbcb76df9de 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3872,6 +3872,26 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 				unsigned int order) {}
 #endif	/* CONFIG_DEBUG_PAGEALLOC */
 
+#ifndef __HAVE_ARCH_CLEAR_PAGES
+/**
+ * clear_pages() - clear a page range for kernel-internal use.
+ * @addr: start address
+ * @npages: number of pages
+ *
+ * Use clear_user_pages() instead when clearing a page range to be
+ * mapped to user space.
+ *
+ * Does absolutely no exception handling.
+ */
+static inline void clear_pages(void *addr, unsigned int npages)
+{
+	do {
+		clear_page(addr);
+		addr += PAGE_SIZE;
+	} while (--npages);
+}
+#endif
+
 #ifndef __HAVE_ARCH_CLEAR_USER_PAGE
 /**
  * clear_user_page() - clear a page to be mapped to user space
@@ -3894,6 +3914,27 @@ static inline void clear_user_page(void *addr, unsigned long vaddr, struct page
 }
 #endif
 
+/**
+ * clear_user_pages() - clear a page range to be mapped to user space
+ * @addr: start address
+ * @vaddr: start address of the user mapping
+ * @page: start page
+ * @npages: number of pages
+ *
+ * Assumes that the region (@addr, +@npages) has been validated
+ * already so this does no exception handling.
+ */
+#ifdef __HAVE_ARCH_CLEAR_USER_PAGE
+void clear_user_pages(void *addr, unsigned long vaddr,
+		struct page *page, unsigned int npages);
+#else /* !__HAVE_ARCH_CLEAR_USER_PAGE */
+static inline void clear_user_pages(void *addr, unsigned long vaddr,
+		struct page *page, unsigned int npages)
+{
+	clear_pages(addr, npages);
+}
+#endif /* __HAVE_ARCH_CLEAR_USER_PAGE */
+
 #ifdef __HAVE_ARCH_GATE_AREA
 extern struct vm_area_struct *get_gate_vma(struct mm_struct *mm);
 extern int in_gate_area_no_mm(unsigned long addr);
diff --git a/mm/util.c b/mm/util.c
index 8989d5767528..d3b662b71f33 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1344,3 +1344,16 @@ bool page_range_contiguous(const struct page *page, unsigned long nr_pages)
 }
 EXPORT_SYMBOL(page_range_contiguous);
 #endif
+
+#ifdef __HAVE_ARCH_CLEAR_USER_PAGE
+void clear_user_pages(void *addr, unsigned long vaddr,
+		struct page *page, unsigned int npages)
+{
+	do {
+		clear_user_page(addr, vaddr, page);
+		addr += PAGE_SIZE;
+		vaddr += PAGE_SIZE;
+		page++;
+	} while (--npages);
+}
+#endif /* __HAVE_ARCH_CLEAR_USER_PAGE */
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 2/7] mm: introduce clear_pages() and clear_user_pages()
  2025-10-27 20:21 ` [PATCH v8 2/7] mm: introduce clear_pages() and clear_user_pages() Ankur Arora
@ 2025-11-07  8:47   ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-07  8:47 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk

On 27.10.25 21:21, Ankur Arora wrote:
> Introduce clear_pages(), to be overridden by architectures that
> support more efficient clearing of consecutive pages.
> 
> Also introduce clear_user_pages(), however, we will not expect
> this function to be overridden anytime soon.
> 
> We have to place the clear_user_pages() variant that uses
> clear_user_page() into mm/util.c for now to work around
> macro magic on sparc and m68k.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---

Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v8 3/7] mm/highmem: introduce clear_user_highpages()
  2025-10-27 20:21 [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Ankur Arora
  2025-10-27 20:21 ` [PATCH v8 1/7] treewide: provide a generic clear_user_page() variant Ankur Arora
  2025-10-27 20:21 ` [PATCH v8 2/7] mm: introduce clear_pages() and clear_user_pages() Ankur Arora
@ 2025-10-27 20:21 ` Ankur Arora
  2025-11-07  8:48   ` David Hildenbrand (Red Hat)
  2025-10-27 20:21 ` [PATCH v8 4/7] x86/mm: Simplify clear_page_* Ankur Arora
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Ankur Arora @ 2025-10-27 20:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Define clear_user_highpages() which clears pages sequentially using
the single page variant.

With !CONFIG_HIGHMEM, pages are contiguous so use the range clearing
primitive clear_user_pages().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>

Note: fixed
---
 include/linux/highmem.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 105cc4c00cc3..c5f8b1556fd7 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -199,6 +199,11 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
 
 /* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */
 #ifndef clear_user_highpage
+/**
+ * clear_user_highpage() - clear a page to be mapped to user space
+ * @page: start page
+ * @vaddr: start address of the user mapping
+ */
 static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 {
 	void *addr = kmap_local_page(page);
@@ -207,6 +212,30 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 }
 #endif
 
+/**
+ * clear_user_highpages() - clear a page range to be mapped to user space
+ * @page: start page
+ * @vaddr: start address of the user mapping
+ * @npages: number of pages
+ *
+ * Assumes that all the pages in the region (@page, +@npages) are valid
+ * so this does no exception handling.
+ */
+static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
+					unsigned int npages)
+{
+	if (!IS_ENABLED(CONFIG_HIGHMEM)) {
+		clear_user_pages(page_address(page), vaddr, page, npages);
+		return;
+	}
+
+	do {
+		clear_user_highpage(page, vaddr);
+		vaddr += PAGE_SIZE;
+		page++;
+	} while (--npages);
+}
+
 #ifndef vma_alloc_zeroed_movable_folio
 /**
  * vma_alloc_zeroed_movable_folio - Allocate a zeroed page for a VMA.
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 3/7] mm/highmem: introduce clear_user_highpages()
  2025-10-27 20:21 ` [PATCH v8 3/7] mm/highmem: introduce clear_user_highpages() Ankur Arora
@ 2025-11-07  8:48   ` David Hildenbrand (Red Hat)
  2025-11-10  7:20     ` Ankur Arora
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-07  8:48 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk

On 27.10.25 21:21, Ankur Arora wrote:
> Define clear_user_highpages() which clears pages sequentially using
> the single page variant.
> 
> With !CONFIG_HIGHMEM, pages are contiguous so use the range clearing
> primitive clear_user_pages().
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> 
> Note: fixed

That should be dropped.

> ---
>   include/linux/highmem.h | 29 +++++++++++++++++++++++++++++
>   1 file changed, 29 insertions(+)
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index 105cc4c00cc3..c5f8b1556fd7 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -199,6 +199,11 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
>   
>   /* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */
>   #ifndef clear_user_highpage
> +/**
> + * clear_user_highpage() - clear a page to be mapped to user space
> + * @page: start page
> + * @vaddr: start address of the user mapping
> + */
>   static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>   {
>   	void *addr = kmap_local_page(page);
> @@ -207,6 +212,30 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>   }
>   #endif
>   
> +/**
> + * clear_user_highpages() - clear a page range to be mapped to user space
> + * @page: start page
> + * @vaddr: start address of the user mapping
> + * @npages: number of pages
> + *
> + * Assumes that all the pages in the region (@page, +@npages) are valid
> + * so this does no exception handling.
> + */
> +static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
> +					unsigned int npages)
> +{
> +	if (!IS_ENABLED(CONFIG_HIGHMEM)) {
> +		clear_user_pages(page_address(page), vaddr, page, npages);
> +		return;
> +	}
> +
> +	do {
> +		clear_user_highpage(page, vaddr);
> +		vaddr += PAGE_SIZE;
> +		page++;
> +	} while (--npages);
> +}
> +
>   #ifndef vma_alloc_zeroed_movable_folio
>   /**
>    * vma_alloc_zeroed_movable_folio - Allocate a zeroed page for a VMA.

Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 3/7] mm/highmem: introduce clear_user_highpages()
  2025-11-07  8:48   ` David Hildenbrand (Red Hat)
@ 2025-11-10  7:20     ` Ankur Arora
  0 siblings, 0 replies; 28+ messages in thread
From: Ankur Arora @ 2025-11-10  7:20 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


David Hildenbrand (Red Hat) <david@kernel.org> writes:

> On 27.10.25 21:21, Ankur Arora wrote:
>> Define clear_user_highpages() which clears pages sequentially using
>> the single page variant.
>> With !CONFIG_HIGHMEM, pages are contiguous so use the range clearing
>> primitive clear_user_pages().
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> Note: fixed
>
> That should be dropped.
>
>> ---
>>   include/linux/highmem.h | 29 +++++++++++++++++++++++++++++
>>   1 file changed, 29 insertions(+)
>> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> index 105cc4c00cc3..c5f8b1556fd7 100644
>> --- a/include/linux/highmem.h
>> +++ b/include/linux/highmem.h
>> @@ -199,6 +199,11 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, int size)
>>     /* when CONFIG_HIGHMEM is not set these will be plain clear/copy_page */
>>   #ifndef clear_user_highpage
>> +/**
>> + * clear_user_highpage() - clear a page to be mapped to user space
>> + * @page: start page
>> + * @vaddr: start address of the user mapping
>> + */
>>   static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>>   {
>>   	void *addr = kmap_local_page(page);
>> @@ -207,6 +212,30 @@ static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
>>   }
>>   #endif
>>   +/**
>> + * clear_user_highpages() - clear a page range to be mapped to user space
>> + * @page: start page
>> + * @vaddr: start address of the user mapping
>> + * @npages: number of pages
>> + *
>> + * Assumes that all the pages in the region (@page, +@npages) are valid
>> + * so this does no exception handling.
>> + */
>> +static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
>> +					unsigned int npages)
>> +{
>> +	if (!IS_ENABLED(CONFIG_HIGHMEM)) {
>> +		clear_user_pages(page_address(page), vaddr, page, npages);
>> +		return;
>> +	}
>> +
>> +	do {
>> +		clear_user_highpage(page, vaddr);
>> +		vaddr += PAGE_SIZE;
>> +		page++;
>> +	} while (--npages);
>> +}
>> +
>>   #ifndef vma_alloc_zeroed_movable_folio
>>   /**
>>    * vma_alloc_zeroed_movable_folio - Allocate a zeroed page for a VMA.
>
> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>

Thanks (for this and the others.)

--
ankur


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v8 4/7] x86/mm: Simplify clear_page_*
  2025-10-27 20:21 [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Ankur Arora
                   ` (2 preceding siblings ...)
  2025-10-27 20:21 ` [PATCH v8 3/7] mm/highmem: introduce clear_user_highpages() Ankur Arora
@ 2025-10-27 20:21 ` Ankur Arora
  2025-10-28 13:36   ` Borislav Petkov
  2025-10-27 20:21 ` [PATCH v8 5/7] x86/clear_page: Introduce clear_pages() Ankur Arora
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Ankur Arora @ 2025-10-27 20:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
variations. Inlining gets rid of an unnecessary CALL/RET (which isn't
free when using RETHUNK speculative execution mitigations.)
Fixup and rename clear_page_orig() to adapt to the changed calling
convention.

Also add a comment from Dave Hansen detailing various clearing mechanisms
used in clear_page().

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 arch/x86/include/asm/page_32.h |  6 +++++
 arch/x86/include/asm/page_64.h | 46 +++++++++++++++++++++++++---------
 arch/x86/lib/clear_page_64.S   | 39 ++++++----------------------
 3 files changed, 48 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/page_32.h b/arch/x86/include/asm/page_32.h
index 0c623706cb7e..19fddb002cc9 100644
--- a/arch/x86/include/asm/page_32.h
+++ b/arch/x86/include/asm/page_32.h
@@ -17,6 +17,12 @@ extern unsigned long __phys_addr(unsigned long);
 
 #include <linux/string.h>
 
+/**
+ * clear_page() - clear a page using a kernel virtual address.
+ * @page: address of kernel page
+ *
+ * Does absolutely no exception handling.
+ */
 static inline void clear_page(void *page)
 {
 	memset(page, 0, PAGE_SIZE);
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 015d23f3e01f..df528cff90ef 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -40,23 +40,45 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 
 #define __phys_reloc_hide(x)	(x)
 
-void clear_page_orig(void *page);
-void clear_page_rep(void *page);
-void clear_page_erms(void *page);
+void memzero_page_aligned_unrolled(void *addr, u64 len);
 
-static inline void clear_page(void *page)
+/**
+ * clear_page() - clear a page using a kernel virtual address.
+ * @addr: address of kernel page
+ *
+ * Switch between three implementations of page clearing based on CPU
+ * capabilities:
+ *
+ *  - memzero_page_aligned_unrolled(): the oldest, slowest and universally
+ *    supported method. Zeroes via 8-byte MOV instructions unrolled 8x
+ *    to write a 64-byte cacheline in each loop iteration..
+ *
+ *  - "rep stosq": really old CPUs had crummy REP implementations.
+ *    Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be
+ *    trusted. The instruction writes 8-byte per REP iteration but
+ *    CPUs can internally batch these together and do larger writes.
+ *
+ *  - "rep stosb": CPUs that enumerate 'ERMS' have an improved STOS
+ *    implementation that is less picky about alignment and where
+ *    STOSB (1-byte at a time) is actually faster than STOSQ (8-bytes
+ *    at a time.)
+ *
+ * Does absolutely no exception handling.
+ */
+static inline void clear_page(void *addr)
 {
+	u64 len = PAGE_SIZE;
 	/*
 	 * Clean up KMSAN metadata for the page being cleared. The assembly call
-	 * below clobbers @page, so we perform unpoisoning before it.
+	 * below clobbers @addr, so we perform unpoisoning before it.
 	 */
-	kmsan_unpoison_memory(page, PAGE_SIZE);
-	alternative_call_2(clear_page_orig,
-			   clear_page_rep, X86_FEATURE_REP_GOOD,
-			   clear_page_erms, X86_FEATURE_ERMS,
-			   "=D" (page),
-			   "D" (page),
-			   "cc", "memory", "rax", "rcx");
+	kmsan_unpoison_memory(addr, len);
+	asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
+				   "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
+				   "rep stosb", X86_FEATURE_ERMS)
+			: "+c" (len), "+D" (addr), ASM_CALL_CONSTRAINT
+			: "a" (0)
+			: "cc", "memory");
 }
 
 void copy_page(void *to, void *from);
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index a508e4a8c66a..27debe0c018c 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -6,30 +6,15 @@
 #include <asm/asm.h>
 
 /*
- * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
- * recommended to use this when possible and we do use them by default.
- * If enhanced REP MOVSB/STOSB is not available, try to use fast string.
- * Otherwise, use original.
+ * Zero page aligned region.
+ * %rdi	- dest
+ * %rcx	- length
  */
-
-/*
- * Zero a page.
- * %rdi	- page
- */
-SYM_TYPED_FUNC_START(clear_page_rep)
-	movl $4096/8,%ecx
-	xorl %eax,%eax
-	rep stosq
-	RET
-SYM_FUNC_END(clear_page_rep)
-EXPORT_SYMBOL_GPL(clear_page_rep)
-
-SYM_TYPED_FUNC_START(clear_page_orig)
-	xorl   %eax,%eax
-	movl   $4096/64,%ecx
+SYM_TYPED_FUNC_START(memzero_page_aligned_unrolled)
+	shrq   $6, %rcx
 	.p2align 4
 .Lloop:
-	decl	%ecx
+	decq	%rcx
 #define PUT(x) movq %rax,x*8(%rdi)
 	movq %rax,(%rdi)
 	PUT(1)
@@ -43,16 +28,8 @@ SYM_TYPED_FUNC_START(clear_page_orig)
 	jnz	.Lloop
 	nop
 	RET
-SYM_FUNC_END(clear_page_orig)
-EXPORT_SYMBOL_GPL(clear_page_orig)
-
-SYM_TYPED_FUNC_START(clear_page_erms)
-	movl $4096,%ecx
-	xorl %eax,%eax
-	rep stosb
-	RET
-SYM_FUNC_END(clear_page_erms)
-EXPORT_SYMBOL_GPL(clear_page_erms)
+SYM_FUNC_END(memzero_page_aligned_unrolled)
+EXPORT_SYMBOL_GPL(memzero_page_aligned_unrolled)
 
 /*
  * Default clear user-space.
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 4/7] x86/mm: Simplify clear_page_*
  2025-10-27 20:21 ` [PATCH v8 4/7] x86/mm: Simplify clear_page_* Ankur Arora
@ 2025-10-28 13:36   ` Borislav Petkov
  2025-10-29 23:26     ` Ankur Arora
  0 siblings, 1 reply; 28+ messages in thread
From: Borislav Petkov @ 2025-10-28 13:36 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, david, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk

On Mon, Oct 27, 2025 at 01:21:06PM -0700, Ankur Arora wrote:
> +/**
> + * clear_page() - clear a page using a kernel virtual address.
> + * @addr: address of kernel page
> + *
> + * Switch between three implementations of page clearing based on CPU
> + * capabilities:
> + *
> + *  - memzero_page_aligned_unrolled(): the oldest, slowest and universally

So I don't see how a memzero_<bla> name shows that it belongs to the
clear_page "stack" of functions? clear_page_orig() kinda conveys better what
this thing is. In any case, having "clear_page" somewhere there in the name
should stay.

> + *    supported method. Zeroes via 8-byte MOV instructions unrolled 8x
> + *    to write a 64-byte cacheline in each loop iteration..
							    ^

one fullstop is enough.

> + *
> + *  - "rep stosq": really old CPUs had crummy REP implementations.

We spell all x86 insns in ALL CAPS. Like you've almost done.

Also, it is

	REP; STOSQ

with a ;

Otherwise the idea for the cleanup makes sense.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 4/7] x86/mm: Simplify clear_page_*
  2025-10-28 13:36   ` Borislav Petkov
@ 2025-10-29 23:26     ` Ankur Arora
  2025-10-30  0:17       ` Borislav Petkov
  0 siblings, 1 reply; 28+ messages in thread
From: Ankur Arora @ 2025-10-29 23:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, david,
	dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme, namhyung,
	tglx, willy, raghavendra.kt, boris.ostrovsky, konrad.wilk


Borislav Petkov <bp@alien8.de> writes:

> On Mon, Oct 27, 2025 at 01:21:06PM -0700, Ankur Arora wrote:
>> +/**
>> + * clear_page() - clear a page using a kernel virtual address.
>> + * @addr: address of kernel page
>> + *
>> + * Switch between three implementations of page clearing based on CPU
>> + * capabilities:
>> + *
>> + *  - memzero_page_aligned_unrolled(): the oldest, slowest and universally
>
> So I don't see how a memzero_<bla> name shows that it belongs to the
> clear_page "stack" of functions? clear_page_orig() kinda conveys better what
> this thing is. In any case, having "clear_page" somewhere there in the name
> should stay.

After this change the only remaining functions in x86/lib/clear_page_64.S are
this one and rep_stos_alternative.

And so the _orig suffix seemed to be a little confusing. That's why the
change to the more functional name.

>> + *    supported method. Zeroes via 8-byte MOV instructions unrolled 8x
>> + *    to write a 64-byte cacheline in each loop iteration..
> 							    ^
>
> one fullstop is enough.
>
>> + *
>> + *  - "rep stosq": really old CPUs had crummy REP implementations.
>
> We spell all x86 insns in ALL CAPS. Like you've almost done.
>
> Also, it is
>
> 	REP; STOSQ
>
> with a ;
>
> Otherwise the idea for the cleanup makes sense.

Thanks. Will fix the above.

--
ankur


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 4/7] x86/mm: Simplify clear_page_*
  2025-10-29 23:26     ` Ankur Arora
@ 2025-10-30  0:17       ` Borislav Petkov
  2025-10-30  5:21         ` Ankur Arora
  0 siblings, 1 reply; 28+ messages in thread
From: Borislav Petkov @ 2025-10-30  0:17 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, david, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk

On Wed, Oct 29, 2025 at 04:26:38PM -0700, Ankur Arora wrote:
> After this change the only remaining functions in x86/lib/clear_page_64.S are
> this one and rep_stos_alternative.
> 
> And so the _orig suffix seemed to be a little confusing. That's why the
> change to the more functional name.

No, as said,  having "clear_page" somewhere there in the name should stay. It
can be "__clear_page" or something along those lines but not something
completely different.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 4/7] x86/mm: Simplify clear_page_*
  2025-10-30  0:17       ` Borislav Petkov
@ 2025-10-30  5:21         ` Ankur Arora
  0 siblings, 0 replies; 28+ messages in thread
From: Ankur Arora @ 2025-10-30  5:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, david,
	dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme, namhyung,
	tglx, willy, raghavendra.kt, boris.ostrovsky, konrad.wilk


Borislav Petkov <bp@alien8.de> writes:

> On Wed, Oct 29, 2025 at 04:26:38PM -0700, Ankur Arora wrote:
>> After this change the only remaining functions in x86/lib/clear_page_64.S are
>> this one and rep_stos_alternative.
>>
>> And so the _orig suffix seemed to be a little confusing. That's why the
>> change to the more functional name.
>
> No, as said,  having "clear_page" somewhere there in the name should stay. It
> can be "__clear_page" or something along those lines but not something
> completely different.

Sounds good. Will change.

--
ankur


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v8 5/7] x86/clear_page: Introduce clear_pages()
  2025-10-27 20:21 [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Ankur Arora
                   ` (3 preceding siblings ...)
  2025-10-27 20:21 ` [PATCH v8 4/7] x86/mm: Simplify clear_page_* Ankur Arora
@ 2025-10-27 20:21 ` Ankur Arora
  2025-10-28 13:56   ` Borislav Petkov
  2025-10-27 20:21 ` [PATCH v8 6/7] mm, folio_zero_user: support clearing page ranges Ankur Arora
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Ankur Arora @ 2025-10-27 20:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Performance when clearing with string instructions (x86-64-stosq and
similar) can vary significantly based on the chunk-size used.

  $ perf bench mem memset -k 4KB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      13.748208 GB/sec

  $ perf bench mem memset -k 2MB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in
  # arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      15.067900 GB/sec

  $ perf bench mem memset -k 1GB -s 4GB -f x86-64-stosq
  # Running 'mem/memset' benchmark:
  # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 4GB bytes ...

      38.104311 GB/sec

(Both on AMD Milan.)

With a change in chunk-size of 4KB to 1GB, we see the performance go
from 13.7 GB/sec to 38.1 GB/sec. For a chunk-size of 2MB the change isn't
quite as drastic but it is worth adding a clear_page() variant that can
handle contiguous page-extents.

Also define ARCH_PAGE_CONTIG_NR to specify the maximum contiguous page
range that can be zeroed when running under cooperative preemption
models. This limits the worst case preemption latency.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 arch/x86/include/asm/page_64.h | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index df528cff90ef..efab5dc26e3e 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -43,8 +43,9 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 void memzero_page_aligned_unrolled(void *addr, u64 len);
 
 /**
- * clear_page() - clear a page using a kernel virtual address.
- * @addr: address of kernel page
+ * clear_pages() - clear a page range using a kernel virtual address.
+ * @addr: start address of kernel page range
+ * @npages: number of pages
  *
  * Switch between three implementations of page clearing based on CPU
  * capabilities:
@@ -65,11 +66,11 @@ void memzero_page_aligned_unrolled(void *addr, u64 len);
  *
  * Does absolutely no exception handling.
  */
-static inline void clear_page(void *addr)
+static inline void clear_pages(void *addr, unsigned int npages)
 {
-	u64 len = PAGE_SIZE;
+	u64 len = npages * PAGE_SIZE;
 	/*
-	 * Clean up KMSAN metadata for the page being cleared. The assembly call
+	 * Clean up KMSAN metadata for the pages being cleared. The assembly call
 	 * below clobbers @addr, so we perform unpoisoning before it.
 	 */
 	kmsan_unpoison_memory(addr, len);
@@ -80,6 +81,21 @@ static inline void clear_page(void *addr)
 			: "a" (0)
 			: "cc", "memory");
 }
+#define __HAVE_ARCH_CLEAR_PAGES
+
+/*
+ * When running under cooperatively scheduled preemption models limit the
+ * maximum contiguous extent that can be cleared to pages worth 8MB.
+ *
+ * With a clearing BW of ~10GBps, this should result in worst case scheduling
+ * latency of ~1ms.
+ */
+#define ARCH_PAGE_CONTIG_NR (8 << (20 - PAGE_SHIFT))
+
+static inline void clear_page(void *addr)
+{
+	clear_pages(addr, 1);
+}
 
 void copy_page(void *to, void *from);
 KCFI_REFERENCE(copy_page);
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 5/7] x86/clear_page: Introduce clear_pages()
  2025-10-27 20:21 ` [PATCH v8 5/7] x86/clear_page: Introduce clear_pages() Ankur Arora
@ 2025-10-28 13:56   ` Borislav Petkov
  2025-10-28 18:51     ` Ankur Arora
  0 siblings, 1 reply; 28+ messages in thread
From: Borislav Petkov @ 2025-10-28 13:56 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, david, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk

On Mon, Oct 27, 2025 at 01:21:07PM -0700, Ankur Arora wrote:
> Also define ARCH_PAGE_CONTIG_NR to specify the maximum contiguous page
> range that can be zeroed when running under cooperative preemption
> models. This limits the worst case preemption latency.

Please do not explain what the patch does in the commit message - that should
be clear from the diff itself. Rather, concentrate on why this patch exists.

> +/*
> + * When running under cooperatively scheduled preemption models limit the
> + * maximum contiguous extent that can be cleared to pages worth 8MB.

Why?

> + *
> + * With a clearing BW of ~10GBps, this should result in worst case scheduling

This sounds like you have this bandwidth (please write it out - we have more
than enough silly abbreviations) on *everything* x86 the kernel runs on. Which
simply ain't true.

> + * latency of ~1ms.
> + */
> +#define ARCH_PAGE_CONTIG_NR (8 << (20 - PAGE_SHIFT))

And so this looks like some magic number which makes sense only on some
uarches but most likely it doesn't on others.

Why isn't this thing determined dynamically during boot or so, instead of
hardcoding it this way and then having to change it again later when bandwidth
increases?

Hmm, weird.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 5/7] x86/clear_page: Introduce clear_pages()
  2025-10-28 13:56   ` Borislav Petkov
@ 2025-10-28 18:51     ` Ankur Arora
  2025-10-29 22:57       ` Borislav Petkov
  0 siblings, 1 reply; 28+ messages in thread
From: Ankur Arora @ 2025-10-28 18:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, david,
	dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme, namhyung,
	tglx, willy, raghavendra.kt, boris.ostrovsky, konrad.wilk


Borislav Petkov <bp@alien8.de> writes:

> On Mon, Oct 27, 2025 at 01:21:07PM -0700, Ankur Arora wrote:
>> Also define ARCH_PAGE_CONTIG_NR to specify the maximum contiguous page
>> range that can be zeroed when running under cooperative preemption
>> models. This limits the worst case preemption latency.
>
> Please do not explain what the patch does in the commit message - that should
> be clear from the diff itself. Rather, concentrate on why this patch exists.

Ack.

>> +/*
>> + * When running under cooperatively scheduled preemption models limit the
>> + * maximum contiguous extent that can be cleared to pages worth 8MB.
>
> Why?

Will mention that the point is to minimize worst case preemption latency.

>> + *
>> + * With a clearing BW of ~10GBps, this should result in worst case scheduling
>
> This sounds like you have this bandwidth (please write it out - we have more
> than enough silly abbreviations) on *everything* x86 the kernel runs on. Which
> simply ain't true.
>
>> + * latency of ~1ms.
>> + */
>> +#define ARCH_PAGE_CONTIG_NR (8 << (20 - PAGE_SHIFT))
>
> And so this looks like some magic number which makes sense only on some
> uarches but most likely it doesn't on others.

The intent was to use a large enough value that enables uarchs which do
'REP; STOS' optimizations, but not too large so we end up with high
preemption latency.

> Why isn't this thing determined dynamically during boot or so, instead of
> hardcoding it this way and then having to change it again later when bandwidth
> increases?

I thought of doing that but given that the precise value doesn't matter
very much (and there's enough slack in it in either direction) it seemed
unnecessary to do at this point.

Also, I'm not sure that a boot determined value would really help given
that the 'REP; STOS' bandwidth could be high or low based on how
saturated the bus is.

Clearly some of this detail should have been in my commit message.

Let me add it there.

Thanks

--
ankur


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 5/7] x86/clear_page: Introduce clear_pages()
  2025-10-28 18:51     ` Ankur Arora
@ 2025-10-29 22:57       ` Borislav Petkov
  2025-10-29 23:31         ` Ankur Arora
  0 siblings, 1 reply; 28+ messages in thread
From: Borislav Petkov @ 2025-10-29 22:57 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, david, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk

On Tue, Oct 28, 2025 at 11:51:39AM -0700, Ankur Arora wrote:
> The intent was to use a large enough value that enables uarchs which do
> 'REP; STOS' optimizations, but not too large so we end up with high
> preemption latency.

How is selecting that number tied to uarches which can do REP; STOSB? I assume
you mean REP; STOSB where microcode magic glue aggregates larger moves than
just u64 chunks but only under certain conditions and so on..., and not
REP_GOOD where the microcode doesn't have problems with REP prefixes...

> > Why isn't this thing determined dynamically during boot or so, instead of
> > hardcoding it this way and then having to change it again later when bandwidth
> > increases?
> 
> I thought of doing that but given that the precise value doesn't matter
> very much (and there's enough slack in it in either direction) it seemed
> unnecessary to do at this point.
> 
> Also, I'm not sure that a boot determined value would really help given
> that the 'REP; STOS' bandwidth could be high or low based on how
> saturated the bus is.
> 
> Clearly some of this detail should have been in my commit message.

So you want to have, say, 8MB of contiguous range - if possible - and let the
CPU do larger clears. And it depends on the scheduling model. And it depends
on what the CPU can do wrt length aggregation. Close?

Well, I would like, please, for this to be properly documented why it was
selected this way and what *all* the aspects were to select it this way so
that we can know why it is there and we can change it in the future if
needed.

It is very hard to do so if the reasoning behind it has disappeared in the
bowels of lkml...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 5/7] x86/clear_page: Introduce clear_pages()
  2025-10-29 22:57       ` Borislav Petkov
@ 2025-10-29 23:31         ` Ankur Arora
  0 siblings, 0 replies; 28+ messages in thread
From: Ankur Arora @ 2025-10-29 23:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, david,
	dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme, namhyung,
	tglx, willy, raghavendra.kt, boris.ostrovsky, konrad.wilk


Borislav Petkov <bp@alien8.de> writes:

> On Tue, Oct 28, 2025 at 11:51:39AM -0700, Ankur Arora wrote:
>> The intent was to use a large enough value that enables uarchs which do
>> 'REP; STOS' optimizations, but not too large so we end up with high
>> preemption latency.
>
> How is selecting that number tied to uarches which can do REP; STOSB? I assume
> you mean REP; STOSB where microcode magic glue aggregates larger moves than
> just u64 chunks but only under certain conditions and so on..., and not
> REP_GOOD where the microcode doesn't have problems with REP prefixes...

Yes, to what you say below.

>> > Why isn't this thing determined dynamically during boot or so, instead of
>> > hardcoding it this way and then having to change it again later when bandwidth
>> > increases?
>>
>> I thought of doing that but given that the precise value doesn't matter
>> very much (and there's enough slack in it in either direction) it seemed
>> unnecessary to do at this point.
>>
>> Also, I'm not sure that a boot determined value would really help given
>> that the 'REP; STOS' bandwidth could be high or low based on how
>> saturated the bus is.
>>
>> Clearly some of this detail should have been in my commit message.
>
> So you want to have, say, 8MB of contiguous range - if possible - and let the
> CPU do larger clears. And it depends on the scheduling model. And it depends
> on what the CPU can do wrt length aggregation. Close?

Yeah pretty much that. Just to restate:

 - be large enough so CPUs that can optimize, are able to optimize
 - even in the bad cases (CPUs that don't optimize and/or are generally
   slow at this optimization): should be fast enough that we have
   reasonable preemption latency (which is an issue only for voluntary
   preemption etc)

> Well, I would like, please, for this to be properly documented why it was
> selected this way and what *all* the aspects were to select it this way so
> that we can know why it is there and we can change it in the future if
> needed.
>
> It is very hard to do so if the reasoning behind it has disappeared in the
> bowels of lkml...

Ack. Yeah I should have documented this way better.

Thanks
--
ankur


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v8 6/7] mm, folio_zero_user: support clearing page ranges
  2025-10-27 20:21 [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Ankur Arora
                   ` (4 preceding siblings ...)
  2025-10-27 20:21 ` [PATCH v8 5/7] x86/clear_page: Introduce clear_pages() Ankur Arora
@ 2025-10-27 20:21 ` Ankur Arora
  2025-11-07  8:59   ` David Hildenbrand (Red Hat)
  2025-10-27 20:21 ` [PATCH v8 7/7] mm: folio_zero_user: cache neighbouring pages Ankur Arora
  2025-10-27 21:33 ` [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Andrew Morton
  7 siblings, 1 reply; 28+ messages in thread
From: Ankur Arora @ 2025-10-27 20:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

Clear contiguous page ranges in folio_zero_user() instead of clearing
a page-at-a-time. This enables CPU specific optimizations based on
the length of the region.

Operating on arbitrarily large regions can lead to high preemption
latency under cooperative preemption models. So, limit the worst
case preemption latency via architecture specified PAGE_CONTIG_NR
units.

The resultant performance depends on the kinds of optimizations
available to the CPU for the region being cleared. Two classes of
of optimizations:

  - clearing iteration costs can be amortized over a range larger
    than a single page.
  - cacheline allocation elision (seen on AMD Zen models).

Testing a demand fault workload shows an improved baseline from the
first optimization and a larger improvement when the region being
cleared is large enough for the second optimization.

AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):

 $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5

                    page-at-a-time     contiguous clearing      change

                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

   pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*

   pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05% [#]   +  5.2%	preempt=none|voluntary
   pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21%       +144.3%	preempt=full|lazy

[#] AMD Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, which is larger than ARCH_PAGE_CONTIG_NR, so
preempt=none|voluntary see no improvement on the pg-sz=1GB.

Also as mentioned earlier, the baseline improvement is not specific to
AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
improvement as the Milan pg-sz=2MB workload above (~30%).

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm.h |  6 ++++++
 mm/memory.c        | 42 +++++++++++++++++++++---------------------
 2 files changed, 27 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecbcb76df9de..02db84667f97 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3872,6 +3872,12 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 				unsigned int order) {}
 #endif	/* CONFIG_DEBUG_PAGEALLOC */
 
+#ifndef ARCH_PAGE_CONTIG_NR
+#define PAGE_CONTIG_NR	1
+#else
+#define PAGE_CONTIG_NR	ARCH_PAGE_CONTIG_NR
+#endif
+
 #ifndef __HAVE_ARCH_CLEAR_PAGES
 /**
  * clear_pages() - clear a page range for kernel-internal use.
diff --git a/mm/memory.c b/mm/memory.c
index 74b45e258323..7781b2aa18a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7144,40 +7144,40 @@ static inline int process_huge_page(
 	return 0;
 }
 
-static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
-				unsigned int nr_pages)
+/*
+ * Clear contiguous pages chunking them up when running under
+ * non-preemptible models.
+ */
+static void clear_contig_highpages(struct page *page, unsigned long addr,
+				   unsigned int npages)
 {
-	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
-	int i;
+	unsigned int i, count, unit;
 
-	might_sleep();
-	for (i = 0; i < nr_pages; i++) {
+	unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR;
+
+	for (i = 0; i < npages; ) {
+		count = min(unit, npages - i);
+		clear_user_highpages(page + i,
+				     addr + i * PAGE_SIZE, count);
+		i += count;
 		cond_resched();
-		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
 	}
 }
 
-static int clear_subpage(unsigned long addr, int idx, void *arg)
-{
-	struct folio *folio = arg;
-
-	clear_user_highpage(folio_page(folio, idx), addr);
-	return 0;
-}
-
 /**
  * folio_zero_user - Zero a folio which will be mapped to userspace.
  * @folio: The folio to zero.
- * @addr_hint: The address will be accessed or the base address if uncelar.
+ * @addr_hint: The address accessed by the user or the base address.
+ *
+ * Uses architectural support for clear_pages() to zero page extents
+ * instead of clearing page-at-a-time.
  */
 void folio_zero_user(struct folio *folio, unsigned long addr_hint)
 {
-	unsigned int nr_pages = folio_nr_pages(folio);
+	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
 
-	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
-		clear_gigantic_page(folio, addr_hint, nr_pages);
-	else
-		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+	clear_contig_highpages(folio_page(folio, 0),
+				base_addr, folio_nr_pages(folio));
 }
 
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 6/7] mm, folio_zero_user: support clearing page ranges
  2025-10-27 20:21 ` [PATCH v8 6/7] mm, folio_zero_user: support clearing page ranges Ankur Arora
@ 2025-11-07  8:59   ` David Hildenbrand (Red Hat)
  2025-11-10  7:20     ` Ankur Arora
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-07  8:59 UTC (permalink / raw)
  To: Ankur Arora, linux-kernel, linux-mm, x86
  Cc: akpm, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz, acme,
	namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk

On 27.10.25 21:21, Ankur Arora wrote:
> Clear contiguous page ranges in folio_zero_user() instead of clearing
> a page-at-a-time. This enables CPU specific optimizations based on
> the length of the region.
> 
> Operating on arbitrarily large regions can lead to high preemption
> latency under cooperative preemption models. So, limit the worst
> case preemption latency via architecture specified PAGE_CONTIG_NR
> units.
> 
> The resultant performance depends on the kinds of optimizations
> available to the CPU for the region being cleared. Two classes of
> of optimizations:
> 
>    - clearing iteration costs can be amortized over a range larger
>      than a single page.
>    - cacheline allocation elision (seen on AMD Zen models).
> 
> Testing a demand fault workload shows an improved baseline from the
> first optimization and a larger improvement when the region being
> cleared is large enough for the second optimization.
> 
> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
> 
>   $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5
> 
>                      page-at-a-time     contiguous clearing      change
> 
>                    (GB/s  +- %stdev)     (GB/s  +- %stdev)
> 
>     pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*
> 
>     pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05% [#]   +  5.2%	preempt=none|voluntary
>     pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21%       +144.3%	preempt=full|lazy
> 
> [#] AMD Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
> allocation, which is larger than ARCH_PAGE_CONTIG_NR, so
> preempt=none|voluntary see no improvement on the pg-sz=1GB.
> 
> Also as mentioned earlier, the baseline improvement is not specific to
> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
> improvement as the Milan pg-sz=2MB workload above (~30%).
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
>   include/linux/mm.h |  6 ++++++
>   mm/memory.c        | 42 +++++++++++++++++++++---------------------
>   2 files changed, 27 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ecbcb76df9de..02db84667f97 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3872,6 +3872,12 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
>   				unsigned int order) {}
>   #endif	/* CONFIG_DEBUG_PAGEALLOC */
>   
> +#ifndef ARCH_PAGE_CONTIG_NR
> +#define PAGE_CONTIG_NR	1
> +#else
> +#define PAGE_CONTIG_NR	ARCH_PAGE_CONTIG_NR
> +#endif

The name is a bit misleading. We need something that tells us that this 
is for patch-processing (clearing? maybe alter copying?) contig pages. 
Likely spelling out that this is for the non-preemptible case only.

I assume we can drop the "CONTIG", just like clear_pages() doesn't 
contain it etc.

CLEAR_PAGES_NON_PREEMPT_BATCH

PROCESS_PAGES_NON_PREEMPT_BATCH

Can you remind me again why this is arch specific, and why the default 
is 1 instead of, say 2,4,8 ... ?

> +
>   #ifndef __HAVE_ARCH_CLEAR_PAGES
>   /**
>    * clear_pages() - clear a page range for kernel-internal use.
> diff --git a/mm/memory.c b/mm/memory.c
> index 74b45e258323..7781b2aa18a8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7144,40 +7144,40 @@ static inline int process_huge_page(
>   	return 0;
>   }
>   
> -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
> -				unsigned int nr_pages)
> +/*
> + * Clear contiguous pages chunking them up when running under
> + * non-preemptible models.
> + */
> +static void clear_contig_highpages(struct page *page, unsigned long addr,
> +				   unsigned int npages)
>   {
> -	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> -	int i;
> +	unsigned int i, count, unit;
>   
> -	might_sleep();
> -	for (i = 0; i < nr_pages; i++) {
> +	unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR;
> +
> +	for (i = 0; i < npages; ) {
> +		count = min(unit, npages - i);
> +		clear_user_highpages(page + i,
> +				     addr + i * PAGE_SIZE, count);
> +		i += count;

Why not

for (i = 0; i < nr_pages; i += count) {

Also, I would leave the cond_resched(); where it was (before the 
invocation) to perform as little change as possible.

>   		cond_resched();
> -		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
>   	}
>   }
>   
> -static int clear_subpage(unsigned long addr, int idx, void *arg)
> -{
> -	struct folio *folio = arg;
> -
> -	clear_user_highpage(folio_page(folio, idx), addr);
> -	return 0;
> -}
> -
>   /**
>    * folio_zero_user - Zero a folio which will be mapped to userspace.
>    * @folio: The folio to zero.
> - * @addr_hint: The address will be accessed or the base address if uncelar.
> + * @addr_hint: The address accessed by the user or the base address.
> + *
> + * Uses architectural support for clear_pages() to zero page extents
> + * instead of clearing page-at-a-time.
>    */
>   void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>   {
> -	unsigned int nr_pages = folio_nr_pages(folio);
> +	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>   
> -	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
> -		clear_gigantic_page(folio, addr_hint, nr_pages);
> -	else
> -		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
> +	clear_contig_highpages(folio_page(folio, 0),
> +				base_addr, folio_nr_pages(folio));
>   }
>   
>   static int copy_user_gigantic_page(struct folio *dst, struct folio *src,


-- 
Cheers

David


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 6/7] mm, folio_zero_user: support clearing page ranges
  2025-11-07  8:59   ` David Hildenbrand (Red Hat)
@ 2025-11-10  7:20     ` Ankur Arora
  2025-11-10  8:57       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 28+ messages in thread
From: Ankur Arora @ 2025-11-10  7:20 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


David Hildenbrand (Red Hat) <david@kernel.org> writes:

> On 27.10.25 21:21, Ankur Arora wrote:
>> Clear contiguous page ranges in folio_zero_user() instead of clearing
>> a page-at-a-time. This enables CPU specific optimizations based on
>> the length of the region.
>> Operating on arbitrarily large regions can lead to high preemption
>> latency under cooperative preemption models. So, limit the worst
>> case preemption latency via architecture specified PAGE_CONTIG_NR
>> units.
>> The resultant performance depends on the kinds of optimizations
>> available to the CPU for the region being cleared. Two classes of
>> of optimizations:
>>    - clearing iteration costs can be amortized over a range larger
>>      than a single page.
>>    - cacheline allocation elision (seen on AMD Zen models).
>> Testing a demand fault workload shows an improved baseline from the
>> first optimization and a larger improvement when the region being
>> cleared is large enough for the second optimization.
>> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
>>   $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5
>>                      page-at-a-time     contiguous clearing      change
>>                    (GB/s  +- %stdev)     (GB/s  +- %stdev)
>>     pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%
>> preempt=*
>>     pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05% [#]   +  5.2%
>> preempt=none|voluntary
>>     pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21%       +144.3%	preempt=full|lazy
>> [#] AMD Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
>> allocation, which is larger than ARCH_PAGE_CONTIG_NR, so
>> preempt=none|voluntary see no improvement on the pg-sz=1GB.
>> Also as mentioned earlier, the baseline improvement is not specific to
>> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
>> improvement as the Milan pg-sz=2MB workload above (~30%).
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
>> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
>> ---
>>   include/linux/mm.h |  6 ++++++
>>   mm/memory.c        | 42 +++++++++++++++++++++---------------------
>>   2 files changed, 27 insertions(+), 21 deletions(-)
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index ecbcb76df9de..02db84667f97 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3872,6 +3872,12 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
>>   				unsigned int order) {}
>>   #endif	/* CONFIG_DEBUG_PAGEALLOC */
>>   +#ifndef ARCH_PAGE_CONTIG_NR
>> +#define PAGE_CONTIG_NR	1
>> +#else
>> +#define PAGE_CONTIG_NR	ARCH_PAGE_CONTIG_NR
>> +#endif
>
> The name is a bit misleading. We need something that tells us that this is for
> patch-processing (clearing? maybe alter copying?) contig pages. Likely spelling
> out that this is for the non-preemptible case only.
>
> I assume we can drop the "CONTIG", just like clear_pages() doesn't contain it
> etc.
>
> CLEAR_PAGES_NON_PREEMPT_BATCH
>
> PROCESS_PAGES_NON_PREEMPT_BATCH

I think this version is clearer. And would be viable for copying as well.

> Can you remind me again why this is arch specific, and why the default is 1
> instead of, say 2,4,8 ... ?

So, the only use for this value is to decide a reasonable frequency
for calling cond_resched() when operating on hugepages.

And the idea was the arch was best placed to have a reasonably safe
value based on the expected spread of bandwidths it might see across
uarchs. And the default choice of 1 was to keep it close to what we
have now.

Thinking about it now though, maybe it is better to instead do this
in common code. We could have two sets of defines,
PROCESS_PAGES_NON_PREEMPT_BATCH_{LARGE,SMALL}, the first for archs
that define __HAVE_ARCH_CLEAR_PAGES and the second, without.

The first could be 8MB (so with a machine doing ~10GBps, scheduling
latency of ~1ms), the second 512K (with ~512MBps, ~1ms)?

>> +
>>   #ifndef __HAVE_ARCH_CLEAR_PAGES
>>   /**
>>    * clear_pages() - clear a page range for kernel-internal use.
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 74b45e258323..7781b2aa18a8 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -7144,40 +7144,40 @@ static inline int process_huge_page(
>>   	return 0;
>>   }
>>   -static void clear_gigantic_page(struct folio *folio, unsigned long
>> addr_hint,
>> -				unsigned int nr_pages)
>> +/*
>> + * Clear contiguous pages chunking them up when running under
>> + * non-preemptible models.
>> + */
>> +static void clear_contig_highpages(struct page *page, unsigned long addr,
>> +				   unsigned int npages)
>>   {
>> -	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> -	int i;
>> +	unsigned int i, count, unit;
>>   -	might_sleep();
>> -	for (i = 0; i < nr_pages; i++) {
>> +	unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR;
>> +
>> +	for (i = 0; i < npages; ) {
>> +		count = min(unit, npages - i);
>> +		clear_user_highpages(page + i,
>> +				     addr + i * PAGE_SIZE, count);
>> +		i += count;
>
> Why not
>
> for (i = 0; i < nr_pages; i += count) {
>
> Also, I would leave the cond_resched(); where it was (before the invocation) to
> perform as little change as possible.

Both of those seem like good ideas.


Thanks
--
ankur


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 6/7] mm, folio_zero_user: support clearing page ranges
  2025-11-10  7:20     ` Ankur Arora
@ 2025-11-10  8:57       ` David Hildenbrand (Red Hat)
  2025-11-11  6:24         ` Ankur Arora
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-10  8:57 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, akpm, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk

On 10.11.25 08:20, Ankur Arora wrote:
> 
> David Hildenbrand (Red Hat) <david@kernel.org> writes:
> 
>> On 27.10.25 21:21, Ankur Arora wrote:
>>> Clear contiguous page ranges in folio_zero_user() instead of clearing
>>> a page-at-a-time. This enables CPU specific optimizations based on
>>> the length of the region.
>>> Operating on arbitrarily large regions can lead to high preemption
>>> latency under cooperative preemption models. So, limit the worst
>>> case preemption latency via architecture specified PAGE_CONTIG_NR
>>> units.
>>> The resultant performance depends on the kinds of optimizations
>>> available to the CPU for the region being cleared. Two classes of
>>> of optimizations:
>>>     - clearing iteration costs can be amortized over a range larger
>>>       than a single page.
>>>     - cacheline allocation elision (seen on AMD Zen models).
>>> Testing a demand fault workload shows an improved baseline from the
>>> first optimization and a larger improvement when the region being
>>> cleared is large enough for the second optimization.
>>> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
>>>    $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5
>>>                       page-at-a-time     contiguous clearing      change
>>>                     (GB/s  +- %stdev)     (GB/s  +- %stdev)
>>>      pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%
>>> preempt=*
>>>      pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05% [#]   +  5.2%
>>> preempt=none|voluntary
>>>      pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21%       +144.3%	preempt=full|lazy
>>> [#] AMD Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
>>> allocation, which is larger than ARCH_PAGE_CONTIG_NR, so
>>> preempt=none|voluntary see no improvement on the pg-sz=1GB.
>>> Also as mentioned earlier, the baseline improvement is not specific to
>>> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
>>> improvement as the Milan pg-sz=2MB workload above (~30%).
>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
>>> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
>>> ---
>>>    include/linux/mm.h |  6 ++++++
>>>    mm/memory.c        | 42 +++++++++++++++++++++---------------------
>>>    2 files changed, 27 insertions(+), 21 deletions(-)
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index ecbcb76df9de..02db84667f97 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -3872,6 +3872,12 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
>>>    				unsigned int order) {}
>>>    #endif	/* CONFIG_DEBUG_PAGEALLOC */
>>>    +#ifndef ARCH_PAGE_CONTIG_NR
>>> +#define PAGE_CONTIG_NR	1
>>> +#else
>>> +#define PAGE_CONTIG_NR	ARCH_PAGE_CONTIG_NR
>>> +#endif
>>
>> The name is a bit misleading. We need something that tells us that this is for
>> patch-processing (clearing? maybe alter copying?) contig pages. Likely spelling
>> out that this is for the non-preemptible case only.
>>
>> I assume we can drop the "CONTIG", just like clear_pages() doesn't contain it
>> etc.
>>
>> CLEAR_PAGES_NON_PREEMPT_BATCH
>>
>> PROCESS_PAGES_NON_PREEMPT_BATCH
> 
> I think this version is clearer. And would be viable for copying as well.
> 
>> Can you remind me again why this is arch specific, and why the default is 1
>> instead of, say 2,4,8 ... ?
> 
> So, the only use for this value is to decide a reasonable frequency
> for calling cond_resched() when operating on hugepages.
> 
> And the idea was the arch was best placed to have a reasonably safe
> value based on the expected spread of bandwidths it might see across
> uarchs. And the default choice of 1 was to keep it close to what we
> have now.
> 
> Thinking about it now though, maybe it is better to instead do this
> in common code. We could have two sets of defines,
> PROCESS_PAGES_NON_PREEMPT_BATCH_{LARGE,SMALL}, the first for archs
> that define __HAVE_ARCH_CLEAR_PAGES and the second, without.

Right, avoiding this dependency on arch code would be nice.

Also, it feels like something we can later optimize for archs without 
__HAVE_ARCH_CLEAR_PAGES in common code.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 6/7] mm, folio_zero_user: support clearing page ranges
  2025-11-10  8:57       ` David Hildenbrand (Red Hat)
@ 2025-11-11  6:24         ` Ankur Arora
  0 siblings, 0 replies; 28+ messages in thread
From: Ankur Arora @ 2025-11-11  6:24 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, akpm, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


David Hildenbrand (Red Hat) <david@kernel.org> writes:

> On 10.11.25 08:20, Ankur Arora wrote:
>> David Hildenbrand (Red Hat) <david@kernel.org> writes:
>>
>>> On 27.10.25 21:21, Ankur Arora wrote:
>>>> Clear contiguous page ranges in folio_zero_user() instead of clearing
>>>> a page-at-a-time. This enables CPU specific optimizations based on
>>>> the length of the region.
>>>> Operating on arbitrarily large regions can lead to high preemption
>>>> latency under cooperative preemption models. So, limit the worst
>>>> case preemption latency via architecture specified PAGE_CONTIG_NR
>>>> units.
>>>> The resultant performance depends on the kinds of optimizations
>>>> available to the CPU for the region being cleared. Two classes of
>>>> of optimizations:
>>>>     - clearing iteration costs can be amortized over a range larger
>>>>       than a single page.
>>>>     - cacheline allocation elision (seen on AMD Zen models).
>>>> Testing a demand fault workload shows an improved baseline from the
>>>> first optimization and a larger improvement when the region being
>>>> cleared is large enough for the second optimization.
>>>> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
>>>>    $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5
>>>>                       page-at-a-time     contiguous clearing      change
>>>>                     (GB/s  +- %stdev)     (GB/s  +- %stdev)
>>>>      pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%
>>>> preempt=*
>>>>      pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05% [#]   +  5.2%
>>>> preempt=none|voluntary
>>>>      pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21%       +144.3%	preempt=full|lazy
>>>> [#] AMD Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
>>>> allocation, which is larger than ARCH_PAGE_CONTIG_NR, so
>>>> preempt=none|voluntary see no improvement on the pg-sz=1GB.
>>>> Also as mentioned earlier, the baseline improvement is not specific to
>>>> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
>>>> improvement as the Milan pg-sz=2MB workload above (~30%).
>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
>>>> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
>>>> ---
>>>>    include/linux/mm.h |  6 ++++++
>>>>    mm/memory.c        | 42 +++++++++++++++++++++---------------------
>>>>    2 files changed, 27 insertions(+), 21 deletions(-)
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index ecbcb76df9de..02db84667f97 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3872,6 +3872,12 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
>>>>    				unsigned int order) {}
>>>>    #endif	/* CONFIG_DEBUG_PAGEALLOC */
>>>>    +#ifndef ARCH_PAGE_CONTIG_NR
>>>> +#define PAGE_CONTIG_NR	1
>>>> +#else
>>>> +#define PAGE_CONTIG_NR	ARCH_PAGE_CONTIG_NR
>>>> +#endif
>>>
>>> The name is a bit misleading. We need something that tells us that this is for
>>> patch-processing (clearing? maybe alter copying?) contig pages. Likely spelling
>>> out that this is for the non-preemptible case only.
>>>
>>> I assume we can drop the "CONTIG", just like clear_pages() doesn't contain it
>>> etc.
>>>
>>> CLEAR_PAGES_NON_PREEMPT_BATCH
>>>
>>> PROCESS_PAGES_NON_PREEMPT_BATCH
>> I think this version is clearer. And would be viable for copying as well.
>>
>>> Can you remind me again why this is arch specific, and why the default is 1
>>> instead of, say 2,4,8 ... ?
>> So, the only use for this value is to decide a reasonable frequency
>> for calling cond_resched() when operating on hugepages.
>> And the idea was the arch was best placed to have a reasonably safe
>> value based on the expected spread of bandwidths it might see across
>> uarchs. And the default choice of 1 was to keep it close to what we
>> have now.
>> Thinking about it now though, maybe it is better to instead do this
>> in common code. We could have two sets of defines,
>> PROCESS_PAGES_NON_PREEMPT_BATCH_{LARGE,SMALL}, the first for archs
>> that define __HAVE_ARCH_CLEAR_PAGES and the second, without.
>
> Right, avoiding this dependency on arch code would be nice.
>
> Also, it feels like something we can later optimize for archs without
> __HAVE_ARCH_CLEAR_PAGES in common code.

That makes sense. Will keep the default where it is (1) and just get
rid of the arch dependency.

--
ankur


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v8 7/7] mm: folio_zero_user: cache neighbouring pages
  2025-10-27 20:21 [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Ankur Arora
                   ` (5 preceding siblings ...)
  2025-10-27 20:21 ` [PATCH v8 6/7] mm, folio_zero_user: support clearing page ranges Ankur Arora
@ 2025-10-27 20:21 ` Ankur Arora
  2025-10-27 21:33 ` [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Andrew Morton
  7 siblings, 0 replies; 28+ messages in thread
From: Ankur Arora @ 2025-10-27 20:21 UTC (permalink / raw)
  To: linux-kernel, linux-mm, x86
  Cc: akpm, david, bp, dave.hansen, hpa, mingo, mjguzik, luto, peterz,
	acme, namhyung, tglx, willy, raghavendra.kt, boris.ostrovsky,
	konrad.wilk, ankur.a.arora

folio_zero_user() does straight zeroing without caring about
temporal locality for caches.

This replaced commit c6ddfb6c5890 ("mm, clear_huge_page: move order
algorithm into a separate function") where we cleared a page at a
time converging to the faulting page from the left and the right.

To retain limited temporal locality, split the clearing in three
parts: the faulting page and its immediate neighbourhood, and, the
remaining regions on the left and the right. The local neighbourhood
will be cleared last.
Do this only when zeroing small folios (< MAX_ORDER_NR_PAGES) since
there isn't much expectation of cache locality for large folios.

Performance
===

AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads,
  memory=2.2 TB, L1d= 16K/thread, L2=512K/thread, L3=2MB/thread)

anon-w-seq (vm-scalability):
                            stime                  utime

  page-at-a-time      1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
  contiguous clearing 1602.86 ( +- 3.00% )     970.75 ( +- 4.68% )
  neighbourhood-last  1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )

Both stime and utime respond in expected ways. stime drops for both
contiguous clearing (-3.14%) and neighbourhood-last (-1.46%)
approaches. However, utime increases for both contiguous clearing
(+19.7%) and neighbourhood-last (+9.28%).

In part this is because anon-w-seq runs with 384 processes zeroing
anonymously mapped memory which they then access sequentially. As
such this is a likely uncommon pattern where the memory bandwidth
is saturated while also being cache limited because we access the
entire region.

Kernel make workload (make -j 12 bzImage):

                            stime                  utime

  page-at-a-time       138.16 ( +- 0.31% )    1015.11 ( +- 0.05% )
  contiguous clearing  133.42 ( +- 0.90% )    1013.49 ( +- 0.05% )
  neighbourhood-last   131.20 ( +- 0.76% )    1011.36 ( +- 0.07% )

For make the utime stays relatively flat with an up to 4.9% improvement
in the stime.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7781b2aa18a8..53a10c06a26d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7171,13 +7171,53 @@ static void clear_contig_highpages(struct page *page, unsigned long addr,
  *
  * Uses architectural support for clear_pages() to zero page extents
  * instead of clearing page-at-a-time.
+ *
+ * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
+ * pages in the immediate locality of the faulting page, and its left, right
+ * regions; the local neighbourhood cleared last in order to keep cache
+ * lines of the target region hot.
+ *
+ * For larger folios we assume that there is no expectation of cache locality
+ * and just do a straight zero.
  */
 void folio_zero_user(struct folio *folio, unsigned long addr_hint)
 {
 	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
+	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
+	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
+	const int width = 2; /* number of pages cleared last on either side */
+	struct range r[3];
+	int i;
 
-	clear_contig_highpages(folio_page(folio, 0),
-				base_addr, folio_nr_pages(folio));
+	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
+		clear_contig_highpages(folio_page(folio, 0),
+				       base_addr, folio_nr_pages(folio));
+		return;
+	}
+
+	/*
+	 * Faulting page and its immediate neighbourhood. Cleared at the end to
+	 * ensure it sticks around in the cache.
+	 */
+	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+	/* Region to the left of the fault */
+	r[1] = DEFINE_RANGE(pg.start,
+			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+			    pg.end);
+
+	for (i = 0; i <= 2; i++) {
+		unsigned int npages = range_len(&r[i]);
+		struct page *page = folio_page(folio, r[i].start);
+		unsigned long addr = base_addr + folio_page_idx(folio, page) * PAGE_SIZE;
+
+		if (npages > 0)
+			clear_contig_highpages(page, addr, npages);
+	}
 }
 
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages
  2025-10-27 20:21 [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Ankur Arora
                   ` (6 preceding siblings ...)
  2025-10-27 20:21 ` [PATCH v8 7/7] mm: folio_zero_user: cache neighbouring pages Ankur Arora
@ 2025-10-27 21:33 ` Andrew Morton
  2025-10-28 17:22   ` Ankur Arora
  7 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2025-10-27 21:33 UTC (permalink / raw)
  To: Ankur Arora
  Cc: linux-kernel, linux-mm, x86, david, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk

On Mon, 27 Oct 2025 13:21:02 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:

> This series adds clearing of contiguous page ranges for hugepages,
> improving on the current page-at-a-time approach in two ways:
> 
>  - amortizes the per-page setup cost over a larger extent
>  - when using string instructions, exposes the real region size
>    to the processor.
> 
> A processor could use a knowledge of the extent to optimize the
> clearing. AMD Zen uarchs, as an example, elide allocation of
> cachelines for regions larger than L3-size.
> 
> Demand faulting a 64GB region shows performance improvements:
> 
>  $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5
> 
>                        baseline              +series             change
> 
>                   (GB/s  +- %stdev)     (GB/s  +- %stdev)
> 
>    pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*
> 
>    pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05% [#]   +  5.2%	preempt=none|voluntary
>    pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21%       +144.3%	preempt=full|lazy
> 
> [#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
> allocation, which is higher than the maximum extent used on x86
> (ARCH_CONTIG_PAGE_NR=8MB), so preempt=none|voluntary sees no improvement
> with pg-sz=1GB.

I wasn't understanding this preemption thing at all, but then I saw this
in the v4 series changelogging:

: [#] Only with preempt=full|lazy because cooperatively preempted models
: need regular invocations of cond_resched(). This limits the extent
: sizes that can be cleared as a unit.

Please put this back in!!

It's possible that we're being excessively aggressive with those
cond_resched()s.  Have you investigating tuning their frequency so we
can use larger extent sizes with these preemption models?

> The anon-w-seq test in the vm-scalability benchmark, however, does show
> worse performance with utime increasing by ~9%:
> 
>                          stime                  utime
> 
>   baseline         1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
>   +series          1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )
> 
> In part this is because anon-w-seq runs with 384 processes zeroing
> anonymously mapped memory which they then access sequentially. As
> such this is a likely uncommon pattern where the memory bandwidth
> is saturated while also being cache limited because we access the
> entire region.
> 
> Raghavendra also tested previous version of the series on AMD Genoa [1].

I suggest you paste Raghavendra's results into this [0/N] - it's
important material.  

> 
> ...
>
>  arch/alpha/include/asm/page.h      |  1 -
>  arch/arc/include/asm/page.h        |  2 +
>  arch/arm/include/asm/page-nommu.h  |  1 -
>  arch/arm64/include/asm/page.h      |  1 -
>  arch/csky/abiv1/inc/abi/page.h     |  1 +
>  arch/csky/abiv2/inc/abi/page.h     |  7 ---
>  arch/hexagon/include/asm/page.h    |  1 -
>  arch/loongarch/include/asm/page.h  |  1 -
>  arch/m68k/include/asm/page_mm.h    |  1 +
>  arch/m68k/include/asm/page_no.h    |  1 -
>  arch/microblaze/include/asm/page.h |  1 -
>  arch/mips/include/asm/page.h       |  1 +
>  arch/nios2/include/asm/page.h      |  1 +
>  arch/openrisc/include/asm/page.h   |  1 -
>  arch/parisc/include/asm/page.h     |  1 -
>  arch/powerpc/include/asm/page.h    |  1 +
>  arch/riscv/include/asm/page.h      |  1 -
>  arch/s390/include/asm/page.h       |  1 -
>  arch/sparc/include/asm/page_32.h   |  2 +
>  arch/sparc/include/asm/page_64.h   |  1 +
>  arch/um/include/asm/page.h         |  1 -
>  arch/x86/include/asm/page.h        |  6 ---
>  arch/x86/include/asm/page_32.h     |  6 +++
>  arch/x86/include/asm/page_64.h     | 64 ++++++++++++++++++-----
>  arch/x86/lib/clear_page_64.S       | 39 +++-----------
>  arch/xtensa/include/asm/page.h     |  1 -
>  include/linux/highmem.h            | 29 +++++++++++
>  include/linux/mm.h                 | 69 +++++++++++++++++++++++++
>  mm/memory.c                        | 82 ++++++++++++++++++++++--------
>  mm/util.c                          | 13 +++++
>  30 files changed, 247 insertions(+), 91 deletions(-)

I guess this is an mm.git thing, with x86 acks (please).

The documented review activity is rather thin at this time so I'll sit
this out for a while.  Please ping me next week and we can reassess,

Thanks.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages
  2025-10-27 21:33 ` [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Andrew Morton
@ 2025-10-28 17:22   ` Ankur Arora
  2025-11-07  5:33     ` Ankur Arora
  0 siblings, 1 reply; 28+ messages in thread
From: Ankur Arora @ 2025-10-28 17:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, david, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


[ My earlier reply to this ate up some of the headers and broke out of
the thread. Resending. ]

Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon, 27 Oct 2025 13:21:02 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> This series adds clearing of contiguous page ranges for hugepages,
>> improving on the current page-at-a-time approach in two ways:
>>
>>  - amortizes the per-page setup cost over a larger extent
>>  - when using string instructions, exposes the real region size
>>    to the processor.
>>
>> A processor could use a knowledge of the extent to optimize the
>> clearing. AMD Zen uarchs, as an example, elide allocation of
>> cachelines for regions larger than L3-size.
>>
>> Demand faulting a 64GB region shows performance improvements:
>>
>>  $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5
>>
>>                        baseline              +series             change
>>
>>                   (GB/s  +- %stdev)     (GB/s  +- %stdev)
>>
>>    pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*
>>
>>    pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05% [#]   +  5.2%	preempt=none|voluntary
>>    pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21%       +144.3%	preempt=full|lazy
>>
>> [#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
>> allocation, which is higher than the maximum extent used on x86
>> (ARCH_CONTIG_PAGE_NR=8MB), so preempt=none|voluntary sees no improvement
>> with pg-sz=1GB.
>
> I wasn't understanding this preemption thing at all, but then I saw this
> in the v4 series changelogging:
>
> : [#] Only with preempt=full|lazy because cooperatively preempted models
> : need regular invocations of cond_resched(). This limits the extent
> : sizes that can be cleared as a unit.
>
> Please put this back in!!

/me facepalms. Sorry you had to go that far back.
Yeah, that doesn't make any kind of sense standalone. Will fix.


> It's possible that we're being excessively aggressive with those
> cond_resched()s.  Have you investigating tuning their frequency so we
> can use larger extent sizes with these preemption models?


folio_zero_user() does a small part of that: for 2MB pages the clearing
is split in three parts with an intervening cond_resched() for each.

This is of course much simpler than the process_huge_page() approach where
we do a left right dance around the faulting page.

I had implemented a version of process_huge_page() with larger extent
sizes that narrowed as we got closer to the faulting page in [a] (x86
performance was similar to the current series. See [b]).

In hindsight however, that felt too elaborate and probably unnecessary
on most modern systems where you have reasonably large caches.
Where it might help, however, is on more cache constrained systems where
the spatial locality really does matter.

So, my idea was to start with a simple version, get some testing and
then fill in the gaps instead of starting with something like [a].


[a] https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/#r
[b] https://lore.kernel.org/lkml/20220606202109.1306034-1-ankur.a.arora@oracle.com/

>> The anon-w-seq test in the vm-scalability benchmark, however, does show
>> worse performance with utime increasing by ~9%:
>>
>>                          stime                  utime
>>
>>   baseline         1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
>>   +series          1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )
>>
>> In part this is because anon-w-seq runs with 384 processes zeroing
>> anonymously mapped memory which they then access sequentially. As
>> such this is a likely uncommon pattern where the memory bandwidth
>> is saturated while also being cache limited because we access the
>> entire region.
>>
>> Raghavendra also tested previous version of the series on AMD Genoa [1].
>
> I suggest you paste Raghavendra's results into this [0/N] - it's
> important material.

Thanks. Will do.

>>
>> ...
>>
>>  arch/alpha/include/asm/page.h      |  1 -
>>  arch/arc/include/asm/page.h        |  2 +
>>  arch/arm/include/asm/page-nommu.h  |  1 -
>>  arch/arm64/include/asm/page.h      |  1 -
>>  arch/csky/abiv1/inc/abi/page.h     |  1 +
>>  arch/csky/abiv2/inc/abi/page.h     |  7 ---
>>  arch/hexagon/include/asm/page.h    |  1 -
>>  arch/loongarch/include/asm/page.h  |  1 -
>>  arch/m68k/include/asm/page_mm.h    |  1 +
>>  arch/m68k/include/asm/page_no.h    |  1 -
>>  arch/microblaze/include/asm/page.h |  1 -
>>  arch/mips/include/asm/page.h       |  1 +
>>  arch/nios2/include/asm/page.h      |  1 +
>>  arch/openrisc/include/asm/page.h   |  1 -
>>  arch/parisc/include/asm/page.h     |  1 -
>>  arch/powerpc/include/asm/page.h    |  1 +
>>  arch/riscv/include/asm/page.h      |  1 -
>>  arch/s390/include/asm/page.h       |  1 -
>>  arch/sparc/include/asm/page_32.h   |  2 +
>>  arch/sparc/include/asm/page_64.h   |  1 +
>>  arch/um/include/asm/page.h         |  1 -
>>  arch/x86/include/asm/page.h        |  6 ---
>>  arch/x86/include/asm/page_32.h     |  6 +++
>>  arch/x86/include/asm/page_64.h     | 64 ++++++++++++++++++-----
>>  arch/x86/lib/clear_page_64.S       | 39 +++-----------
>>  arch/xtensa/include/asm/page.h     |  1 -
>>  include/linux/highmem.h            | 29 +++++++++++
>>  include/linux/mm.h                 | 69 +++++++++++++++++++++++++
>>  mm/memory.c                        | 82 ++++++++++++++++++++++--------
>>  mm/util.c                          | 13 +++++
>>  30 files changed, 247 insertions(+), 91 deletions(-)
>
> I guess this is an mm.git thing, with x86 acks (please).

Ack that.

> The documented review activity is rather thin at this time so I'll sit
> this out for a while.  Please ping me next week and we can reassess,

Will do. And, thanks for the quick look!

--
ankur


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages
  2025-10-28 17:22   ` Ankur Arora
@ 2025-11-07  5:33     ` Ankur Arora
  2025-11-07  8:59       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 28+ messages in thread
From: Ankur Arora @ 2025-11-07  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, x86, david, bp, dave.hansen, hpa, mingo,
	mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk


Ankur Arora <ankur.a.arora@oracle.com> writes:

> [ My earlier reply to this ate up some of the headers and broke out of
> the thread. Resending. ]
>
> Andrew Morton <akpm@linux-foundation.org> writes:
>
>> On Mon, 27 Oct 2025 13:21:02 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>

[ ... ]

>
>> It's possible that we're being excessively aggressive with those
>> cond_resched()s.  Have you investigating tuning their frequency so we
>> can use larger extent sizes with these preemption models?
>
>
> folio_zero_user() does a small part of that: for 2MB pages the clearing
> is split in three parts with an intervening cond_resched() for each.
>
> This is of course much simpler than the process_huge_page() approach where
> we do a left right dance around the faulting page.
>
> I had implemented a version of process_huge_page() with larger extent
> sizes that narrowed as we got closer to the faulting page in [a] (x86
> performance was similar to the current series. See [b]).
>
> In hindsight however, that felt too elaborate and probably unnecessary
> on most modern systems where you have reasonably large caches.
> Where it might help, however, is on more cache constrained systems where
> the spatial locality really does matter.
>
> So, my idea was to start with a simple version, get some testing and
> then fill in the gaps instead of starting with something like [a].
>
>
> [a] https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/#r
> [b] https://lore.kernel.org/lkml/20220606202109.1306034-1-ankur.a.arora@oracle.com/
>
>>> The anon-w-seq test in the vm-scalability benchmark, however, does show
>>> worse performance with utime increasing by ~9%:
>>>
>>>                          stime                  utime
>>>
>>>   baseline         1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
>>>   +series          1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )
>>>
>>> In part this is because anon-w-seq runs with 384 processes zeroing
>>> anonymously mapped memory which they then access sequentially. As
>>> such this is a likely uncommon pattern where the memory bandwidth
>>> is saturated while also being cache limited because we access the
>>> entire region.
>>>
>>> Raghavendra also tested previous version of the series on AMD Genoa [1].
>>
>> I suggest you paste Raghavendra's results into this [0/N] - it's
>> important material.
>
> Thanks. Will do.
>
>>>
>>> ...
>>>
>>>  arch/alpha/include/asm/page.h      |  1 -
>>>  arch/arc/include/asm/page.h        |  2 +
>>>  arch/arm/include/asm/page-nommu.h  |  1 -
>>>  arch/arm64/include/asm/page.h      |  1 -
>>>  arch/csky/abiv1/inc/abi/page.h     |  1 +
>>>  arch/csky/abiv2/inc/abi/page.h     |  7 ---
>>>  arch/hexagon/include/asm/page.h    |  1 -
>>>  arch/loongarch/include/asm/page.h  |  1 -
>>>  arch/m68k/include/asm/page_mm.h    |  1 +
>>>  arch/m68k/include/asm/page_no.h    |  1 -
>>>  arch/microblaze/include/asm/page.h |  1 -
>>>  arch/mips/include/asm/page.h       |  1 +
>>>  arch/nios2/include/asm/page.h      |  1 +
>>>  arch/openrisc/include/asm/page.h   |  1 -
>>>  arch/parisc/include/asm/page.h     |  1 -
>>>  arch/powerpc/include/asm/page.h    |  1 +
>>>  arch/riscv/include/asm/page.h      |  1 -
>>>  arch/s390/include/asm/page.h       |  1 -
>>>  arch/sparc/include/asm/page_32.h   |  2 +
>>>  arch/sparc/include/asm/page_64.h   |  1 +
>>>  arch/um/include/asm/page.h         |  1 -
>>>  arch/x86/include/asm/page.h        |  6 ---
>>>  arch/x86/include/asm/page_32.h     |  6 +++
>>>  arch/x86/include/asm/page_64.h     | 64 ++++++++++++++++++-----
>>>  arch/x86/lib/clear_page_64.S       | 39 +++-----------
>>>  arch/xtensa/include/asm/page.h     |  1 -
>>>  include/linux/highmem.h            | 29 +++++++++++
>>>  include/linux/mm.h                 | 69 +++++++++++++++++++++++++
>>>  mm/memory.c                        | 82 ++++++++++++++++++++++--------
>>>  mm/util.c                          | 13 +++++
>>>  30 files changed, 247 insertions(+), 91 deletions(-)
>>
>> I guess this is an mm.git thing, with x86 acks (please).
>
> Ack that.
>
>> The documented review activity is rather thin at this time so I'll sit
>> this out for a while.  Please ping me next week and we can reassess,
>
> Will do. And, thanks for the quick look!

Hi Andrew

So, the comments I have so far are mostly about clarity around the
connection with preempt model and some cleanups on the x86 patches.

Other than that, my major concern is wider testing (platforms and
workloads) than mine has been.

Could you take another look at the series and see what else you think
it needs.


Thanks

--
ankur


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages
  2025-11-07  5:33     ` Ankur Arora
@ 2025-11-07  8:59       ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-07  8:59 UTC (permalink / raw)
  To: Ankur Arora, Andrew Morton
  Cc: linux-kernel, linux-mm, x86, bp, dave.hansen, hpa, mingo, mjguzik,
	luto, peterz, acme, namhyung, tglx, willy, raghavendra.kt,
	boris.ostrovsky, konrad.wilk

On 07.11.25 06:33, Ankur Arora wrote:
> 
> Ankur Arora <ankur.a.arora@oracle.com> writes:
> 
>> [ My earlier reply to this ate up some of the headers and broke out of
>> the thread. Resending. ]
>>
>> Andrew Morton <akpm@linux-foundation.org> writes:
>>
>>> On Mon, 27 Oct 2025 13:21:02 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>>
> 
> [ ... ]
> 
>>
>>> It's possible that we're being excessively aggressive with those
>>> cond_resched()s.  Have you investigating tuning their frequency so we
>>> can use larger extent sizes with these preemption models?
>>
>>
>> folio_zero_user() does a small part of that: for 2MB pages the clearing
>> is split in three parts with an intervening cond_resched() for each.
>>
>> This is of course much simpler than the process_huge_page() approach where
>> we do a left right dance around the faulting page.
>>
>> I had implemented a version of process_huge_page() with larger extent
>> sizes that narrowed as we got closer to the faulting page in [a] (x86
>> performance was similar to the current series. See [b]).
>>
>> In hindsight however, that felt too elaborate and probably unnecessary
>> on most modern systems where you have reasonably large caches.
>> Where it might help, however, is on more cache constrained systems where
>> the spatial locality really does matter.
>>
>> So, my idea was to start with a simple version, get some testing and
>> then fill in the gaps instead of starting with something like [a].
>>
>>
>> [a] https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/#r
>> [b] https://lore.kernel.org/lkml/20220606202109.1306034-1-ankur.a.arora@oracle.com/
>>
>>>> The anon-w-seq test in the vm-scalability benchmark, however, does show
>>>> worse performance with utime increasing by ~9%:
>>>>
>>>>                           stime                  utime
>>>>
>>>>    baseline         1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
>>>>    +series          1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )
>>>>
>>>> In part this is because anon-w-seq runs with 384 processes zeroing
>>>> anonymously mapped memory which they then access sequentially. As
>>>> such this is a likely uncommon pattern where the memory bandwidth
>>>> is saturated while also being cache limited because we access the
>>>> entire region.
>>>>
>>>> Raghavendra also tested previous version of the series on AMD Genoa [1].
>>>
>>> I suggest you paste Raghavendra's results into this [0/N] - it's
>>> important material.
>>
>> Thanks. Will do.
>>
>>>>
>>>> ...
>>>>
>>>>   arch/alpha/include/asm/page.h      |  1 -
>>>>   arch/arc/include/asm/page.h        |  2 +
>>>>   arch/arm/include/asm/page-nommu.h  |  1 -
>>>>   arch/arm64/include/asm/page.h      |  1 -
>>>>   arch/csky/abiv1/inc/abi/page.h     |  1 +
>>>>   arch/csky/abiv2/inc/abi/page.h     |  7 ---
>>>>   arch/hexagon/include/asm/page.h    |  1 -
>>>>   arch/loongarch/include/asm/page.h  |  1 -
>>>>   arch/m68k/include/asm/page_mm.h    |  1 +
>>>>   arch/m68k/include/asm/page_no.h    |  1 -
>>>>   arch/microblaze/include/asm/page.h |  1 -
>>>>   arch/mips/include/asm/page.h       |  1 +
>>>>   arch/nios2/include/asm/page.h      |  1 +
>>>>   arch/openrisc/include/asm/page.h   |  1 -
>>>>   arch/parisc/include/asm/page.h     |  1 -
>>>>   arch/powerpc/include/asm/page.h    |  1 +
>>>>   arch/riscv/include/asm/page.h      |  1 -
>>>>   arch/s390/include/asm/page.h       |  1 -
>>>>   arch/sparc/include/asm/page_32.h   |  2 +
>>>>   arch/sparc/include/asm/page_64.h   |  1 +
>>>>   arch/um/include/asm/page.h         |  1 -
>>>>   arch/x86/include/asm/page.h        |  6 ---
>>>>   arch/x86/include/asm/page_32.h     |  6 +++
>>>>   arch/x86/include/asm/page_64.h     | 64 ++++++++++++++++++-----
>>>>   arch/x86/lib/clear_page_64.S       | 39 +++-----------
>>>>   arch/xtensa/include/asm/page.h     |  1 -
>>>>   include/linux/highmem.h            | 29 +++++++++++
>>>>   include/linux/mm.h                 | 69 +++++++++++++++++++++++++
>>>>   mm/memory.c                        | 82 ++++++++++++++++++++++--------
>>>>   mm/util.c                          | 13 +++++
>>>>   30 files changed, 247 insertions(+), 91 deletions(-)
>>>
>>> I guess this is an mm.git thing, with x86 acks (please).
>>
>> Ack that.
>>
>>> The documented review activity is rather thin at this time so I'll sit
>>> this out for a while.  Please ping me next week and we can reassess,
>>
>> Will do. And, thanks for the quick look!
> 
> Hi Andrew
> 
> So, the comments I have so far are mostly about clarity around the
> connection with preempt model and some cleanups on the x86 patches.
> 
> Other than that, my major concern is wider testing (platforms and
> workloads) than mine has been.
> 
> Could you take another look at the series and see what else you think
> it needs.

Sorry for the delay from my side, I took another look at patches and had 
some smaller comments.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages
@ 2025-10-28 17:15 Ankur Arora
  0 siblings, 0 replies; 28+ messages in thread
From: Ankur Arora @ 2025-10-28 17:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ankur Arora, linux-kernel, linux-mm, x86, david, bp, dave.hansen,
	hpa, mingo, mjguzik, luto, peterz, acme, namhyung, tglx, willy,
	raghavendra.kt, boris.ostrovsky, konrad.wilk

References: <20251027202109.678022-1-ankur.a.arora@oracle.com>
 <20251027143309.4331a65f38f05ea95d9e46ad@linux-foundation.org>
User-agent: mu4e 1.4.10; emacs 27.2
In-reply-to: <20251027143309.4331a65f38f05ea95d9e46ad@linux-foundation.org>

Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon, 27 Oct 2025 13:21:02 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> This series adds clearing of contiguous page ranges for hugepages,
>> improving on the current page-at-a-time approach in two ways:
>>
>>  - amortizes the per-page setup cost over a larger extent
>>  - when using string instructions, exposes the real region size
>>    to the processor.
>>
>> A processor could use a knowledge of the extent to optimize the
>> clearing. AMD Zen uarchs, as an example, elide allocation of
>> cachelines for regions larger than L3-size.
>>
>> Demand faulting a 64GB region shows performance improvements:
>>
>>  $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5
>>
>>                        baseline              +series             change
>>
>>                   (GB/s  +- %stdev)     (GB/s  +- %stdev)
>>
>>    pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*
>>
>>    pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05% [#]   +  5.2%	preempt=none|voluntary
>>    pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21%       +144.3%	preempt=full|lazy
>>
>> [#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
>> allocation, which is higher than the maximum extent used on x86
>> (ARCH_CONTIG_PAGE_NR=8MB), so preempt=none|voluntary sees no improvement
>> with pg-sz=1GB.
>
> I wasn't understanding this preemption thing at all, but then I saw this
> in the v4 series changelogging:
>
> : [#] Only with preempt=full|lazy because cooperatively preempted models
> : need regular invocations of cond_resched(). This limits the extent
> : sizes that can be cleared as a unit.
>
> Please put this back in!!

/me facepalms. Sorry you had to go that far back.
Yeah, that doesn't make any kind of sense standalone. Will fix.

> It's possible that we're being excessively aggressive with those
> cond_resched()s.  Have you investigating tuning their frequency so we
> can use larger extent sizes with these preemption models?

folio_zero_user() does a small part of that: for 2MB pages the clearing
is split in three parts with an intervening cond_resched() for each.

This is of course much simpler than the process_huge_page() approach where
we do a left right dance around the faulting page.

I had implemented a version of process_huge_page() with larger extent
sizes that narrowed as we got closer to the faulting page in [a] (x86
performance was similar to the current series. See [b]).

In hindsight however, that felt too elaborate and probably unnecessary
on most modern systems where you have reasonably large caches.
Where it might help, however, is on more cache constrained systems where
the spatial locality really does matter.

So, my idea was to start with a simple version, get some testing and
then fill in the gaps instead of starting with something like [a].


[a] https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/#r
[b] https://lore.kernel.org/lkml/20220606202109.1306034-1-ankur.a.arora@oracle.com/


>> The anon-w-seq test in the vm-scalability benchmark, however, does show
>> worse performance with utime increasing by ~9%:
>>
>>                          stime                  utime
>>
>>   baseline         1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
>>   +series          1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )
>>
>> In part this is because anon-w-seq runs with 384 processes zeroing
>> anonymously mapped memory which they then access sequentially. As
>> such this is a likely uncommon pattern where the memory bandwidth
>> is saturated while also being cache limited because we access the
>> entire region.
>>
>> Raghavendra also tested previous version of the series on AMD Genoa [1].
>
> I suggest you paste Raghavendra's results into this [0/N] - it's
> important material.

Will do.

>>
>> ...
>>
>>  arch/alpha/include/asm/page.h      |  1 -
>>  arch/arc/include/asm/page.h        |  2 +
>>  arch/arm/include/asm/page-nommu.h  |  1 -
>>  arch/arm64/include/asm/page.h      |  1 -
>>  arch/csky/abiv1/inc/abi/page.h     |  1 +
>>  arch/csky/abiv2/inc/abi/page.h     |  7 ---
>>  arch/hexagon/include/asm/page.h    |  1 -
>>  arch/loongarch/include/asm/page.h  |  1 -
>>  arch/m68k/include/asm/page_mm.h    |  1 +
>>  arch/m68k/include/asm/page_no.h    |  1 -
>>  arch/microblaze/include/asm/page.h |  1 -
>>  arch/mips/include/asm/page.h       |  1 +
>>  arch/nios2/include/asm/page.h      |  1 +
>>  arch/openrisc/include/asm/page.h   |  1 -
>>  arch/parisc/include/asm/page.h     |  1 -
>>  arch/powerpc/include/asm/page.h    |  1 +
>>  arch/riscv/include/asm/page.h      |  1 -
>>  arch/s390/include/asm/page.h       |  1 -
>>  arch/sparc/include/asm/page_32.h   |  2 +
>>  arch/sparc/include/asm/page_64.h   |  1 +
>>  arch/um/include/asm/page.h         |  1 -
>>  arch/x86/include/asm/page.h        |  6 ---
>>  arch/x86/include/asm/page_32.h     |  6 +++
>>  arch/x86/include/asm/page_64.h     | 64 ++++++++++++++++++-----
>>  arch/x86/lib/clear_page_64.S       | 39 +++-----------
>>  arch/xtensa/include/asm/page.h     |  1 -
>>  include/linux/highmem.h            | 29 +++++++++++
>>  include/linux/mm.h                 | 69 +++++++++++++++++++++++++
>>  mm/memory.c                        | 82 ++++++++++++++++++++++--------
>>  mm/util.c                          | 13 +++++
>>  30 files changed, 247 insertions(+), 91 deletions(-)
>
> I guess this is an mm.git thing, with x86 acks (please).

Ack that.

> The documented review activity is rather thin at this time so I'll sit
> this out for a while.  Please ping me next week and we can reassess,

Will do. And, thanks for the quick look!

--
ankur
Date: Tue, 28 Oct 2025 10:15:38 -0700
Message-ID: <87zf9bq75x.fsf@oracle.com>


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-11-11  6:25 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-27 20:21 [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Ankur Arora
2025-10-27 20:21 ` [PATCH v8 1/7] treewide: provide a generic clear_user_page() variant Ankur Arora
2025-10-27 20:21 ` [PATCH v8 2/7] mm: introduce clear_pages() and clear_user_pages() Ankur Arora
2025-11-07  8:47   ` David Hildenbrand (Red Hat)
2025-10-27 20:21 ` [PATCH v8 3/7] mm/highmem: introduce clear_user_highpages() Ankur Arora
2025-11-07  8:48   ` David Hildenbrand (Red Hat)
2025-11-10  7:20     ` Ankur Arora
2025-10-27 20:21 ` [PATCH v8 4/7] x86/mm: Simplify clear_page_* Ankur Arora
2025-10-28 13:36   ` Borislav Petkov
2025-10-29 23:26     ` Ankur Arora
2025-10-30  0:17       ` Borislav Petkov
2025-10-30  5:21         ` Ankur Arora
2025-10-27 20:21 ` [PATCH v8 5/7] x86/clear_page: Introduce clear_pages() Ankur Arora
2025-10-28 13:56   ` Borislav Petkov
2025-10-28 18:51     ` Ankur Arora
2025-10-29 22:57       ` Borislav Petkov
2025-10-29 23:31         ` Ankur Arora
2025-10-27 20:21 ` [PATCH v8 6/7] mm, folio_zero_user: support clearing page ranges Ankur Arora
2025-11-07  8:59   ` David Hildenbrand (Red Hat)
2025-11-10  7:20     ` Ankur Arora
2025-11-10  8:57       ` David Hildenbrand (Red Hat)
2025-11-11  6:24         ` Ankur Arora
2025-10-27 20:21 ` [PATCH v8 7/7] mm: folio_zero_user: cache neighbouring pages Ankur Arora
2025-10-27 21:33 ` [PATCH v8 0/7] mm: folio_zero_user: clear contiguous pages Andrew Morton
2025-10-28 17:22   ` Ankur Arora
2025-11-07  5:33     ` Ankur Arora
2025-11-07  8:59       ` David Hildenbrand (Red Hat)
  -- strict thread matches above, loose matches on Subject: below --
2025-10-28 17:15 Ankur Arora

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).