linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] add static PMD zero page support
@ 2025-07-07 14:23 Pankaj Raghav (Samsung)
  2025-07-07 14:23 ` [PATCH v2 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h Pankaj Raghav (Samsung)
                   ` (9 more replies)
  0 siblings, 10 replies; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-07 14:23 UTC (permalink / raw)
  To: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
	Dev Jain, Liam R . Howlett, Jens Axboe
  Cc: linux-kernel, willy, linux-mm, x86, linux-block, linux-fsdevel,
	Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

There are many places in the kernel where we need to zeroout larger
chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
is limited by PAGE_SIZE.

This concern was raised during the review of adding Large Block Size support
to XFS[1][2].

This is especially annoying in block devices and filesystems where we
attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
bvec support in block layer, it is much more efficient to send out
larger zero pages as a part of a single bvec.

Some examples of places in the kernel where this could be useful:
- blkdev_issue_zero_pages()
- iomap_dio_zero()
- vmalloc.c:zero_iter()
- rxperf_process_call()
- fscrypt_zeroout_range_inline_crypt()
- bch2_checksum_update()
...

We already have huge_zero_folio that is allocated on demand, and it will be
deallocated by the shrinker if there are no users of it left.

At moment, huge_zero_folio infrastructure refcount is tied to the process
lifetime that created it. This might not work for bio layer as the completions
can be async and the process that created the huge_zero_folio might no
longer be alive.

Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
the huge_zero_folio via memblock, and it will never be freed.

I have converted blkdev_issue_zero_pages() as an example as a part of
this series.

I will send patches to individual subsystems using the huge_zero_folio
once this gets upstreamed.

Looking forward to some feedback.

[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
[2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/

Changes since v1:
- Move from .bss to allocating it through memblock(David)

Changes since RFC:
- Added the config option based on the feedback from David.
- Encode more info in the header to avoid dead code (Dave hansen
  feedback)
- The static part of huge_zero_folio in memory.c and the dynamic part
  stays in huge_memory.c
- Split the patches to make it easy for review.

Pankaj Raghav (5):
  mm: move huge_zero_page declaration from huge_mm.h to mm.h
  huge_memory: add huge_zero_page_shrinker_(init|exit) function
  mm: add static PMD zero page
  mm: add largest_zero_folio() routine
  block: use largest_zero_folio in __blkdev_issue_zero_pages()

 block/blk-lib.c         | 17 +++++----
 include/linux/huge_mm.h | 31 ----------------
 include/linux/mm.h      | 81 +++++++++++++++++++++++++++++++++++++++++
 mm/Kconfig              |  9 +++++
 mm/huge_memory.c        | 62 +++++++++++++++++++++++--------
 mm/memory.c             | 25 +++++++++++++
 mm/mm_init.c            |  1 +
 7 files changed, 173 insertions(+), 53 deletions(-)


base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a
-- 
2.49.0


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v2 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h
  2025-07-07 14:23 [PATCH v2 0/5] add static PMD zero page support Pankaj Raghav (Samsung)
@ 2025-07-07 14:23 ` Pankaj Raghav (Samsung)
  2025-07-15 14:08   ` Lorenzo Stoakes
  2025-07-07 14:23 ` [PATCH v2 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function Pankaj Raghav (Samsung)
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-07 14:23 UTC (permalink / raw)
  To: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
	Dev Jain, Liam R . Howlett, Jens Axboe
  Cc: linux-kernel, willy, linux-mm, x86, linux-block, linux-fsdevel,
	Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

Move the declaration associated with huge_zero_page from huge_mm.h to
mm.h. This patch is in preparation for adding static PMD zero page as we
will be reusing some of the huge_zero_page infrastructure.

No functional changes.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 include/linux/huge_mm.h | 31 -------------------------------
 include/linux/mm.h      | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 31 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..3e887374892c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -478,22 +478,6 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
-extern struct folio *huge_zero_folio;
-extern unsigned long huge_zero_pfn;
-
-static inline bool is_huge_zero_folio(const struct folio *folio)
-{
-	return READ_ONCE(huge_zero_folio) == folio;
-}
-
-static inline bool is_huge_zero_pmd(pmd_t pmd)
-{
-	return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd);
-}
-
-struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
-void mm_put_huge_zero_folio(struct mm_struct *mm);
-
 static inline bool thp_migration_supported(void)
 {
 	return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
@@ -631,21 +615,6 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	return 0;
 }
 
-static inline bool is_huge_zero_folio(const struct folio *folio)
-{
-	return false;
-}
-
-static inline bool is_huge_zero_pmd(pmd_t pmd)
-{
-	return false;
-}
-
-static inline void mm_put_huge_zero_folio(struct mm_struct *mm)
-{
-	return;
-}
-
 static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
 	unsigned long addr, pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0ef2ba0c667a..c8fbeaacf896 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4018,6 +4018,40 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern struct folio *huge_zero_folio;
+extern unsigned long huge_zero_pfn;
+
+static inline bool is_huge_zero_folio(const struct folio *folio)
+{
+	return READ_ONCE(huge_zero_folio) == folio;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+	return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd);
+}
+
+struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
+void mm_put_huge_zero_folio(struct mm_struct *mm);
+
+#else
+static inline bool is_huge_zero_folio(const struct folio *folio)
+{
+	return false;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+	return false;
+}
+
+static inline void mm_put_huge_zero_folio(struct mm_struct *mm)
+{
+	return;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 #if MAX_NUMNODES > 1
 void __init setup_nr_node_ids(void);
 #else
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function
  2025-07-07 14:23 [PATCH v2 0/5] add static PMD zero page support Pankaj Raghav (Samsung)
  2025-07-07 14:23 ` [PATCH v2 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h Pankaj Raghav (Samsung)
@ 2025-07-07 14:23 ` Pankaj Raghav (Samsung)
  2025-07-15 14:18   ` David Hildenbrand
  2025-07-15 14:29   ` Lorenzo Stoakes
  2025-07-07 14:23 ` [PATCH v2 3/5] mm: add static PMD zero page Pankaj Raghav (Samsung)
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-07 14:23 UTC (permalink / raw)
  To: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
	Dev Jain, Liam R . Howlett, Jens Axboe
  Cc: linux-kernel, willy, linux-mm, x86, linux-block, linux-fsdevel,
	Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

Add huge_zero_page_shrinker_init() and huge_zero_page_shrinker_exit().
As shrinker will not be needed when static PMD zero page is enabled,
these two functions can be a no-op.

This is a preparation patch for static PMD zero page. No functional
changes.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/huge_memory.c | 38 +++++++++++++++++++++++++++-----------
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d3e66136e41a..101b67ab2eb6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -289,6 +289,24 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
 }
 
 static struct shrinker *huge_zero_page_shrinker;
+static int huge_zero_page_shrinker_init(void)
+{
+	huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
+	if (!huge_zero_page_shrinker)
+		return -ENOMEM;
+
+	huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
+	huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
+	shrinker_register(huge_zero_page_shrinker);
+	return 0;
+}
+
+static void huge_zero_page_shrinker_exit(void)
+{
+	shrinker_free(huge_zero_page_shrinker);
+	return;
+}
+
 
 #ifdef CONFIG_SYSFS
 static ssize_t enabled_show(struct kobject *kobj,
@@ -850,33 +868,31 @@ static inline void hugepage_exit_sysfs(struct kobject *hugepage_kobj)
 
 static int __init thp_shrinker_init(void)
 {
-	huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
-	if (!huge_zero_page_shrinker)
-		return -ENOMEM;
+	int ret = 0;
 
 	deferred_split_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE |
 						 SHRINKER_MEMCG_AWARE |
 						 SHRINKER_NONSLAB,
 						 "thp-deferred_split");
-	if (!deferred_split_shrinker) {
-		shrinker_free(huge_zero_page_shrinker);
+	if (!deferred_split_shrinker)
 		return -ENOMEM;
-	}
-
-	huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
-	huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
-	shrinker_register(huge_zero_page_shrinker);
 
 	deferred_split_shrinker->count_objects = deferred_split_count;
 	deferred_split_shrinker->scan_objects = deferred_split_scan;
 	shrinker_register(deferred_split_shrinker);
 
+	ret = huge_zero_page_shrinker_init();
+	if (ret) {
+		shrinker_free(deferred_split_shrinker);
+		return ret;
+	}
+
 	return 0;
 }
 
 static void __init thp_shrinker_exit(void)
 {
-	shrinker_free(huge_zero_page_shrinker);
+	huge_zero_page_shrinker_exit();
 	shrinker_free(deferred_split_shrinker);
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 3/5] mm: add static PMD zero page
  2025-07-07 14:23 [PATCH v2 0/5] add static PMD zero page support Pankaj Raghav (Samsung)
  2025-07-07 14:23 ` [PATCH v2 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h Pankaj Raghav (Samsung)
  2025-07-07 14:23 ` [PATCH v2 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function Pankaj Raghav (Samsung)
@ 2025-07-07 14:23 ` Pankaj Raghav (Samsung)
  2025-07-15 14:21   ` David Hildenbrand
  2025-07-15 15:26   ` Lorenzo Stoakes
  2025-07-07 14:23 ` [PATCH v2 4/5] mm: add largest_zero_folio() routine Pankaj Raghav (Samsung)
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-07 14:23 UTC (permalink / raw)
  To: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
	Dev Jain, Liam R . Howlett, Jens Axboe
  Cc: linux-kernel, willy, linux-mm, x86, linux-block, linux-fsdevel,
	Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

There are many places in the kernel where we need to zeroout larger
chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
is limited by PAGE_SIZE.

This is especially annoying in block devices and filesystems where we
attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
bvec support in block layer, it is much more efficient to send out
larger zero pages as a part of single bvec.

This concern was raised during the review of adding LBS support to
XFS[1][2].

Usually huge_zero_folio is allocated on demand, and it will be
deallocated by the shrinker if there are no users of it left. At moment,
huge_zero_folio infrastructure refcount is tied to the process lifetime
that created it. This might not work for bio layer as the completitions
can be async and the process that created the huge_zero_folio might no
longer be alive.

Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
the huge_zero_folio, and it will never be freed. This makes using the
huge_zero_folio without having to pass any mm struct and does not tie
the lifetime of the zero folio to anything.

memblock is used to allocated this PMD zero page during early boot.

If STATIC_PMD_ZERO_PAGE config option is enabled, then
mm_get_huge_zero_folio() will simply return this page instead of
dynamically allocating a new PMD page.

As STATIC_PMD_ZERO_PAGE does not depend on THP, declare huge_zero_folio
and huge_zero_pfn outside the THP config.

[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
[2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 include/linux/mm.h | 25 ++++++++++++++++++++++++-
 mm/Kconfig         |  9 +++++++++
 mm/huge_memory.c   | 24 ++++++++++++++++++++----
 mm/memory.c        | 25 +++++++++++++++++++++++++
 mm/mm_init.c       |  1 +
 5 files changed, 79 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c8fbeaacf896..428fe6d36b3c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4018,10 +4018,19 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_STATIC_PMD_ZERO_PAGE
+extern void __init static_pmd_zero_init(void);
+#else
+static inline void __init static_pmd_zero_init(void)
+{
+	return;
+}
+#endif
+
 extern struct folio *huge_zero_folio;
 extern unsigned long huge_zero_pfn;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline bool is_huge_zero_folio(const struct folio *folio)
 {
 	return READ_ONCE(huge_zero_folio) == folio;
@@ -4032,9 +4041,23 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
 	return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd);
 }
 
+#ifdef CONFIG_STATIC_PMD_ZERO_PAGE
+static inline struct folio *mm_get_huge_zero_folio(struct mm_struct *mm)
+{
+	return READ_ONCE(huge_zero_folio);
+}
+
+static inline void mm_put_huge_zero_folio(struct mm_struct *mm)
+{
+	return;
+}
+
+#else
 struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
 void mm_put_huge_zero_folio(struct mm_struct *mm);
 
+#endif /* CONFIG_STATIC_PMD_ZERO_PAGE */
+
 #else
 static inline bool is_huge_zero_folio(const struct folio *folio)
 {
diff --git a/mm/Kconfig b/mm/Kconfig
index 781be3240e21..89d5971cf180 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -826,6 +826,15 @@ config ARCH_WANTS_THP_SWAP
 config MM_ID
 	def_bool n
 
+config STATIC_PMD_ZERO_PAGE
+	bool "Allocate a PMD page for zeroing"
+	help
+	  Typically huge_zero_folio, which is a PMD page of zeroes, is allocated
+	  on demand and deallocated when not in use. This option will
+	  allocate a PMD sized zero page during early boot and huge_zero_folio will
+	  use it instead allocating dynamically.
+	  Not suitable for memory constrained systems.
+
 menuconfig TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
 	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 101b67ab2eb6..c12ca7134e88 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -75,9 +75,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 					 struct shrink_control *sc);
 static bool split_underused_thp = true;
 
-static atomic_t huge_zero_refcount;
-struct folio *huge_zero_folio __read_mostly;
-unsigned long huge_zero_pfn __read_mostly = ~0UL;
 unsigned long huge_anon_orders_always __read_mostly;
 unsigned long huge_anon_orders_madvise __read_mostly;
 unsigned long huge_anon_orders_inherit __read_mostly;
@@ -208,6 +205,23 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 	return orders;
 }
 
+#ifdef CONFIG_STATIC_PMD_ZERO_PAGE
+static int huge_zero_page_shrinker_init(void)
+{
+	return 0;
+}
+
+static void huge_zero_page_shrinker_exit(void)
+{
+	return;
+}
+#else
+
+static struct shrinker *huge_zero_page_shrinker;
+static atomic_t huge_zero_refcount;
+struct folio *huge_zero_folio __read_mostly;
+unsigned long huge_zero_pfn __read_mostly = ~0UL;
+
 static bool get_huge_zero_page(void)
 {
 	struct folio *zero_folio;
@@ -288,7 +302,6 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
 	return 0;
 }
 
-static struct shrinker *huge_zero_page_shrinker;
 static int huge_zero_page_shrinker_init(void)
 {
 	huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
@@ -307,6 +320,7 @@ static void huge_zero_page_shrinker_exit(void)
 	return;
 }
 
+#endif
 
 #ifdef CONFIG_SYSFS
 static ssize_t enabled_show(struct kobject *kobj,
@@ -2843,6 +2857,8 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pte_t *pte;
 	int i;
 
+	// FIXME: can this be called with static zero page?
+	VM_BUG_ON(IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE));
 	/*
 	 * Leave pmd empty until pte is filled note that it is fine to delay
 	 * notification until mmu_notifier_invalidate_range_end() as we are
diff --git a/mm/memory.c b/mm/memory.c
index b0cda5aab398..42c4c31ad14c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -42,6 +42,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
+#include <linux/memblock.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/sched/task.h>
@@ -159,6 +160,30 @@ static int __init init_zero_pfn(void)
 }
 early_initcall(init_zero_pfn);
 
+#ifdef CONFIG_STATIC_PMD_ZERO_PAGE
+struct folio *huge_zero_folio __read_mostly = NULL;
+unsigned long huge_zero_pfn __read_mostly = ~0UL;
+
+void __init static_pmd_zero_init(void)
+{
+	void *alloc = memblock_alloc(PMD_SIZE, PAGE_SIZE);
+
+	if (!alloc)
+		return;
+
+	huge_zero_folio = virt_to_folio(alloc);
+	huge_zero_pfn = page_to_pfn(virt_to_page(alloc));
+
+	__folio_set_head(huge_zero_folio);
+	prep_compound_head((struct page *)huge_zero_folio, PMD_ORDER);
+	/* Ensure zero folio won't have large_rmappable flag set. */
+	folio_clear_large_rmappable(huge_zero_folio);
+	folio_zero_range(huge_zero_folio, 0, PMD_SIZE);
+
+	return;
+}
+#endif
+
 void mm_trace_rss_stat(struct mm_struct *mm, int member)
 {
 	trace_rss_stat(mm, member);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index f2944748f526..56d7ec372af1 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2765,6 +2765,7 @@ void __init mm_core_init(void)
 	 */
 	kho_memory_init();
 
+	static_pmd_zero_init();
 	memblock_free_all();
 	mem_init();
 	kmem_cache_init();
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 4/5] mm: add largest_zero_folio() routine
  2025-07-07 14:23 [PATCH v2 0/5] add static PMD zero page support Pankaj Raghav (Samsung)
                   ` (2 preceding siblings ...)
  2025-07-07 14:23 ` [PATCH v2 3/5] mm: add static PMD zero page Pankaj Raghav (Samsung)
@ 2025-07-07 14:23 ` Pankaj Raghav (Samsung)
  2025-07-15 14:16   ` David Hildenbrand
  2025-07-15 16:13   ` Lorenzo Stoakes
  2025-07-07 14:23 ` [PATCH v2 5/5] block: use largest_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav (Samsung)
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-07 14:23 UTC (permalink / raw)
  To: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
	Dev Jain, Liam R . Howlett, Jens Axboe
  Cc: linux-kernel, willy, linux-mm, x86, linux-block, linux-fsdevel,
	Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

Add largest_zero_folio() routine so that huge_zero_folio can be
used without the need to pass any mm struct. This will return ZERO_PAGE
folio if CONFIG_STATIC_PMD_ZERO_PAGE is disabled or if we failed to
allocate a PMD page from memblock.

This routine can also be called even if THP is disabled.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 include/linux/mm.h | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 428fe6d36b3c..d5543cf7b8e9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4018,17 +4018,41 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
 
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
 
+extern struct folio *huge_zero_folio;
+extern unsigned long huge_zero_pfn;
+
 #ifdef CONFIG_STATIC_PMD_ZERO_PAGE
 extern void __init static_pmd_zero_init(void);
+
+/*
+ * largest_zero_folio - Get the largest zero size folio available
+ *
+ * This function will return a PMD sized zero folio if CONFIG_STATIC_PMD_ZERO_PAGE
+ * is enabled. Otherwise, a ZERO_PAGE folio is returned.
+ *
+ * Deduce the size of the folio with folio_size instead of assuming the
+ * folio size.
+ */
+static inline struct folio *largest_zero_folio(void)
+{
+	if(!huge_zero_folio)
+		return page_folio(ZERO_PAGE(0));
+
+	return READ_ONCE(huge_zero_folio);
+}
+
 #else
 static inline void __init static_pmd_zero_init(void)
 {
 	return;
 }
+
+static inline struct folio *largest_zero_folio(void)
+{
+	return page_folio(ZERO_PAGE(0));
+}
 #endif
 
-extern struct folio *huge_zero_folio;
-extern unsigned long huge_zero_pfn;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline bool is_huge_zero_folio(const struct folio *folio)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v2 5/5] block: use largest_zero_folio in __blkdev_issue_zero_pages()
  2025-07-07 14:23 [PATCH v2 0/5] add static PMD zero page support Pankaj Raghav (Samsung)
                   ` (3 preceding siblings ...)
  2025-07-07 14:23 ` [PATCH v2 4/5] mm: add largest_zero_folio() routine Pankaj Raghav (Samsung)
@ 2025-07-07 14:23 ` Pankaj Raghav (Samsung)
  2025-07-15 16:19   ` Lorenzo Stoakes
  2025-07-07 18:06 ` [PATCH v2 0/5] add static PMD zero page support Zi Yan
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-07 14:23 UTC (permalink / raw)
  To: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
	Dev Jain, Liam R . Howlett, Jens Axboe
  Cc: linux-kernel, willy, linux-mm, x86, linux-block, linux-fsdevel,
	Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

Use largest_zero_folio() in __blkdev_issue_zero_pages().

On systems with CONFIG_STATIC_PMD_ZERO_PAGE enabled, we will end up
sending larger bvecs instead of multiple small ones.

Noticed a 4% increase in performance on a commercial NVMe SSD which does
not support OP_WRITE_ZEROES. The device's MDTS was 128K. The performance
gains might be bigger if the device supports bigger MDTS.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 block/blk-lib.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 4c9f20a689f7..70a5700b6717 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -196,6 +196,10 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
 		struct bio **biop, unsigned int flags)
 {
+	struct folio *zero_folio;
+
+	zero_folio = largest_zero_folio();
+
 	while (nr_sects) {
 		unsigned int nr_vecs = __blkdev_sectors_to_bio_pages(nr_sects);
 		struct bio *bio;
@@ -208,15 +212,14 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
 			break;
 
 		do {
-			unsigned int len, added;
+			unsigned int len;
 
-			len = min_t(sector_t,
-				PAGE_SIZE, nr_sects << SECTOR_SHIFT);
-			added = bio_add_page(bio, ZERO_PAGE(0), len, 0);
-			if (added < len)
+			len = min_t(sector_t, folio_size(zero_folio),
+				    nr_sects << SECTOR_SHIFT);
+			if (!bio_add_folio(bio, zero_folio, len, 0))
 				break;
-			nr_sects -= added >> SECTOR_SHIFT;
-			sector += added >> SECTOR_SHIFT;
+			nr_sects -= len >> SECTOR_SHIFT;
+			sector += len >> SECTOR_SHIFT;
 		} while (nr_sects);
 
 		*biop = bio_chain_and_submit(*biop, bio);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-07 14:23 [PATCH v2 0/5] add static PMD zero page support Pankaj Raghav (Samsung)
                   ` (4 preceding siblings ...)
  2025-07-07 14:23 ` [PATCH v2 5/5] block: use largest_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav (Samsung)
@ 2025-07-07 18:06 ` Zi Yan
  2025-07-09  8:03   ` Pankaj Raghav
  2025-07-07 22:38 ` Andrew Morton
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Zi Yan @ 2025-07-07 18:06 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Mike Rapoport,
	Dave Hansen, Michal Hocko, David Hildenbrand, Lorenzo Stoakes,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On 7 Jul 2025, at 10:23, Pankaj Raghav (Samsung) wrote:

> From: Pankaj Raghav <p.raghav@samsung.com>
>
> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
> is limited by PAGE_SIZE.
>
> This concern was raised during the review of adding Large Block Size support
> to XFS[1][2].
>
> This is especially annoying in block devices and filesystems where we
> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
> bvec support in block layer, it is much more efficient to send out
> larger zero pages as a part of a single bvec.
>
> Some examples of places in the kernel where this could be useful:
> - blkdev_issue_zero_pages()
> - iomap_dio_zero()
> - vmalloc.c:zero_iter()
> - rxperf_process_call()
> - fscrypt_zeroout_range_inline_crypt()
> - bch2_checksum_update()
> ...
>
> We already have huge_zero_folio that is allocated on demand, and it will be
> deallocated by the shrinker if there are no users of it left.
>
> At moment, huge_zero_folio infrastructure refcount is tied to the process
> lifetime that created it. This might not work for bio layer as the completions
> can be async and the process that created the huge_zero_folio might no
> longer be alive.
>
> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
> the huge_zero_folio via memblock, and it will never be freed.

Do the above users want a PMD sized zero page or a 2MB zero page?
Because on systems with non 4KB base page size, e.g., ARM64 with 64KB
base page, PMD size is different. ARM64 with 64KB base page has
512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing
half GB memory. I am not sure if it is acceptable.

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-07 14:23 [PATCH v2 0/5] add static PMD zero page support Pankaj Raghav (Samsung)
                   ` (5 preceding siblings ...)
  2025-07-07 18:06 ` [PATCH v2 0/5] add static PMD zero page support Zi Yan
@ 2025-07-07 22:38 ` Andrew Morton
  2025-07-09  9:59   ` Pankaj Raghav
  2025-07-15 14:15   ` David Hildenbrand
  2025-07-15 13:53 ` Pankaj Raghav
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 40+ messages in thread
From: Andrew Morton @ 2025-07-07 22:38 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Lorenzo Stoakes, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On Mon,  7 Jul 2025 16:23:14 +0200 "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> wrote:

> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
> is limited by PAGE_SIZE.
> 
> This concern was raised during the review of adding Large Block Size support
> to XFS[1][2].
> 
> This is especially annoying in block devices and filesystems where we
> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
> bvec support in block layer, it is much more efficient to send out
> larger zero pages as a part of a single bvec.
> 
> Some examples of places in the kernel where this could be useful:
> - blkdev_issue_zero_pages()
> - iomap_dio_zero()
> - vmalloc.c:zero_iter()
> - rxperf_process_call()
> - fscrypt_zeroout_range_inline_crypt()
> - bch2_checksum_update()
> ...
> 
> We already have huge_zero_folio that is allocated on demand, and it will be
> deallocated by the shrinker if there are no users of it left.
> 
> At moment, huge_zero_folio infrastructure refcount is tied to the process
> lifetime that created it. This might not work for bio layer as the completions
> can be async and the process that created the huge_zero_folio might no
> longer be alive.

Can we change that?  Alter the refcounting model so that dropping the
final reference at interrupt time works as expected?

And if we were to do this, what sort of benefit might it produce?

> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
> the huge_zero_folio via memblock, and it will never be freed.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-07 18:06 ` [PATCH v2 0/5] add static PMD zero page support Zi Yan
@ 2025-07-09  8:03   ` Pankaj Raghav
  2025-07-09 15:55     ` Zi Yan
  2025-07-15 14:02     ` Lorenzo Stoakes
  0 siblings, 2 replies; 40+ messages in thread
From: Pankaj Raghav @ 2025-07-09  8:03 UTC (permalink / raw)
  To: Zi Yan
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Mike Rapoport,
	Dave Hansen, Michal Hocko, David Hildenbrand, Lorenzo Stoakes,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

Hi Zi,

>> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate the huge_zero_folio via 
>> memblock, and it will never be freed.
> 
> Do the above users want a PMD sized zero page or a 2MB zero page? Because on systems with non 
> 4KB base page size, e.g., ARM64 with 64KB base page, PMD size is different. ARM64 with 64KB base 
> page has 512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing half GB memory. I am 
> not sure if it is acceptable.
> 

That is a good point. My intial RFC patches allocated 2M instead of a PMD sized
page.

But later David wanted to reuse the memory we allocate here with huge_zero_folio. So
if this config is enabled, we simply just use the same pointer for huge_zero_folio.

Since that happened, I decided to go with PMD sized page.

This config is still opt in and I would expect the users with 64k page size systems to not enable
this.

But to make sure we don't enable this for those architecture, I could do a per-arch opt in with
something like this[1] that I did in my previous patch:

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 340e5468980e..c3a9d136ec0a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -153,6 +153,7 @@ config X86
 	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
 	select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
 	select ARCH_WANTS_THP_SWAP		if X86_64
+	select ARCH_HAS_STATIC_PMD_ZERO_PAGE	if X86_64
 	select ARCH_HAS_PARANOID_L1D_FLUSH
 	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	select BUILDTIME_TABLE_SORT


diff --git a/mm/Kconfig b/mm/Kconfig
index 781be3240e21..fd1c51995029 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -826,6 +826,19 @@ config ARCH_WANTS_THP_SWAP
 config MM_ID
 	def_bool n

+config ARCH_HAS_STATIC_PMD_ZERO_PAGE
+	def_bool n
+
+config STATIC_PMD_ZERO_PAGE
+	bool "Allocate a PMD page for zeroing"
+	depends on ARCH_HAS_STATIC_PMD_ZERO_PAGE
<snip>

Let me know your thoughts.

[1] https://lore.kernel.org/linux-mm/20250612105100.59144-4-p.raghav@samsung.com/#Z31mm:Kconfig
--
Pankaj

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-07 22:38 ` Andrew Morton
@ 2025-07-09  9:59   ` Pankaj Raghav
  2025-07-15 14:15   ` David Hildenbrand
  1 sibling, 0 replies; 40+ messages in thread
From: Pankaj Raghav @ 2025-07-09  9:59 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, Lorenzo Stoakes,
	Thomas Gleixner, Nico Pache, Dev Jain, Liam R . Howlett,
	Jens Axboe, linux-kernel, willy, linux-mm, x86, linux-block,
	linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev, hch,
	Pankaj Raghav

Hi Andrew,

>> We already have huge_zero_folio that is allocated on demand, and it will be
>> deallocated by the shrinker if there are no users of it left.
>>
>> At moment, huge_zero_folio infrastructure refcount is tied to the process
>> lifetime that created it. This might not work for bio layer as the completions
>> can be async and the process that created the huge_zero_folio might no
>> longer be alive.
> 
> Can we change that?  Alter the refcounting model so that dropping the
> final reference at interrupt time works as expected?
> 

That is an interesting point. I did not try it. At the moment, we always drop the reference in
__mmput().

Going back to the discussion before this work started, one of the main thing that people wanted was
to use some sort of a **drop in replacement** for ZERO_PAGE that can be bigger than PAGE_SIZE[1].

And, during the RFCs of these patches, one of the feedback I got from David was in big server
systems, 2M (in the case of 4k page size) should not be a problem and we don't need any unnecessary
refcounting for them.

Also when I had a chat with David, he also wants to make changes to the existing mm_huge_zero_folio
infrastructure to get rid of shrinker if possible. So we decided that it is better to have opt-in
static allocation and keep the existing dynamic allocation path.

So that is why I went with this approach of having a static PMD allocation.

I hope this clarifies the motivation a bit.

Let me know if you have more questions.

> And if we were to do this, what sort of benefit might it produce?
> 
>> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
>> the huge_zero_folio via memblock, and it will never be freed.
[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-09  8:03   ` Pankaj Raghav
@ 2025-07-09 15:55     ` Zi Yan
  2025-07-15 14:02     ` Lorenzo Stoakes
  1 sibling, 0 replies; 40+ messages in thread
From: Zi Yan @ 2025-07-09 15:55 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Mike Rapoport,
	Dave Hansen, Michal Hocko, David Hildenbrand, Lorenzo Stoakes,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On 9 Jul 2025, at 4:03, Pankaj Raghav wrote:

> Hi Zi,
>
>>> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate the huge_zero_folio via
>>> memblock, and it will never be freed.
>>
>> Do the above users want a PMD sized zero page or a 2MB zero page? Because on systems with non
>> 4KB base page size, e.g., ARM64 with 64KB base page, PMD size is different. ARM64 with 64KB base
>> page has 512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing half GB memory. I am
>> not sure if it is acceptable.
>>
>
> That is a good point. My intial RFC patches allocated 2M instead of a PMD sized
> page.
>
> But later David wanted to reuse the memory we allocate here with huge_zero_folio. So
> if this config is enabled, we simply just use the same pointer for huge_zero_folio.
>
> Since that happened, I decided to go with PMD sized page.

Got it. Thank you for the explanation. This means for your use cases
2MB is big enough. For those arch which have PMD > 2MB, ideally,
a 2MB zero mTHP should be used. Thinking about this feature long term,
I wonder what we should do to support arch with PMD > 2MB. Make
the static huge zero page size a boot time parameter?

>
> This config is still opt in and I would expect the users with 64k page size systems to not enable
> this.
>
> But to make sure we don't enable this for those architecture, I could do a per-arch opt in with
> something like this[1] that I did in my previous patch:
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 340e5468980e..c3a9d136ec0a 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -153,6 +153,7 @@ config X86
>  	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
>  	select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
>  	select ARCH_WANTS_THP_SWAP		if X86_64
> +	select ARCH_HAS_STATIC_PMD_ZERO_PAGE	if X86_64
>  	select ARCH_HAS_PARANOID_L1D_FLUSH
>  	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>  	select BUILDTIME_TABLE_SORT
>
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 781be3240e21..fd1c51995029 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -826,6 +826,19 @@ config ARCH_WANTS_THP_SWAP
>  config MM_ID
>  	def_bool n
>
> +config ARCH_HAS_STATIC_PMD_ZERO_PAGE
> +	def_bool n
> +
> +config STATIC_PMD_ZERO_PAGE
> +	bool "Allocate a PMD page for zeroing"
> +	depends on ARCH_HAS_STATIC_PMD_ZERO_PAGE
> <snip>
>
> Let me know your thoughts.

Sounds good to me, since without STATIC_PMD_ZERO_PAGE, when THP is enabled,
the use cases you mentioned are still able to use the THP zero page.

Thanks.

>
> [1] https://lore.kernel.org/linux-mm/20250612105100.59144-4-p.raghav@samsung.com/#Z31mm:Kconfig
> --
> Pankaj


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-07 14:23 [PATCH v2 0/5] add static PMD zero page support Pankaj Raghav (Samsung)
                   ` (6 preceding siblings ...)
  2025-07-07 22:38 ` Andrew Morton
@ 2025-07-15 13:53 ` Pankaj Raghav
  2025-07-15 14:04 ` Lorenzo Stoakes
  2025-07-15 15:34 ` Lorenzo Stoakes
  9 siblings, 0 replies; 40+ messages in thread
From: Pankaj Raghav @ 2025-07-15 13:53 UTC (permalink / raw)
  To: Zi Yan, David Hildenbrand
  Cc: Dave Hansen, Michal Hocko, Jens Axboe, Liam R . Howlett,
	Nico Pache, Andrew Morton, Lorenzo Stoakes, Mike Rapoport,
	Vlastimil Babka, H . Peter Anvin, Borislav Petkov, Baolin Wang,
	Ryan Roberts, Suren Baghdasaryan, linux-kernel, Dev Jain,
	Thomas Gleixner, willy, linux-mm, x86, linux-block, linux-fsdevel,
	Darrick J . Wong, mcgrof, gost.dev, hch, Pankaj Raghav,
	Ingo Molnar

Hi David,

For now I have some feedback from Zi. It would be great to hear your
feedback before I send the next version :)

--
Pankaj

On Mon, Jul 07, 2025 at 04:23:14PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
> is limited by PAGE_SIZE.
>
> This concern was raised during the review of adding Large Block Size support
> to XFS[1][2].
>
> This is especially annoying in block devices and filesystems where we
> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
> bvec support in block layer, it is much more efficient to send out
> larger zero pages as a part of a single bvec.
>
> Some examples of places in the kernel where this could be useful:
> - blkdev_issue_zero_pages()
> - iomap_dio_zero()
> - vmalloc.c:zero_iter()
> - rxperf_process_call()
> - fscrypt_zeroout_range_inline_crypt()
> - bch2_checksum_update()
> ...
>
> We already have huge_zero_folio that is allocated on demand, and it will be
> deallocated by the shrinker if there are no users of it left.
>
> At moment, huge_zero_folio infrastructure refcount is tied to the process
> lifetime that created it. This might not work for bio layer as the completions
> can be async and the process that created the huge_zero_folio might no
> longer be alive.
>
> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
> the huge_zero_folio via memblock, and it will never be freed.
>
> I have converted blkdev_issue_zero_pages() as an example as a part of
> this series.
>
> I will send patches to individual subsystems using the huge_zero_folio
> once this gets upstreamed.
>
> Looking forward to some feedback.
>
> [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
> [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
>
> Changes since v1:
> - Move from .bss to allocating it through memblock(David)
>
> Changes since RFC:
> - Added the config option based on the feedback from David.
> - Encode more info in the header to avoid dead code (Dave hansen
>   feedback)
> - The static part of huge_zero_folio in memory.c and the dynamic part
>   stays in huge_memory.c
> - Split the patches to make it easy for review.
>
> Pankaj Raghav (5):
>   mm: move huge_zero_page declaration from huge_mm.h to mm.h
>   huge_memory: add huge_zero_page_shrinker_(init|exit) function
>   mm: add static PMD zero page
>   mm: add largest_zero_folio() routine
>   block: use largest_zero_folio in __blkdev_issue_zero_pages()
>
>  block/blk-lib.c         | 17 +++++----
>  include/linux/huge_mm.h | 31 ----------------
>  include/linux/mm.h      | 81 +++++++++++++++++++++++++++++++++++++++++
>  mm/Kconfig              |  9 +++++
>  mm/huge_memory.c        | 62 +++++++++++++++++++++++--------
>  mm/memory.c             | 25 +++++++++++++
>  mm/mm_init.c            |  1 +
>  7 files changed, 173 insertions(+), 53 deletions(-)
>
>
> base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a
> --
> 2.49.0


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-09  8:03   ` Pankaj Raghav
  2025-07-09 15:55     ` Zi Yan
@ 2025-07-15 14:02     ` Lorenzo Stoakes
  2025-07-15 14:06       ` David Hildenbrand
  1 sibling, 1 reply; 40+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15 14:02 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Zi Yan, Suren Baghdasaryan, Ryan Roberts, Baolin Wang,
	Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On Wed, Jul 09, 2025 at 10:03:51AM +0200, Pankaj Raghav wrote:
> Hi Zi,
>
> >> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate the huge_zero_folio via
> >> memblock, and it will never be freed.
> >
> > Do the above users want a PMD sized zero page or a 2MB zero page? Because on systems with non
> > 4KB base page size, e.g., ARM64 with 64KB base page, PMD size is different. ARM64 with 64KB base
> > page has 512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing half GB memory. I am
> > not sure if it is acceptable.
> >
>
> That is a good point. My intial RFC patches allocated 2M instead of a PMD sized
> page.
>
> But later David wanted to reuse the memory we allocate here with huge_zero_folio. So
> if this config is enabled, we simply just use the same pointer for huge_zero_folio.
>
> Since that happened, I decided to go with PMD sized page.
>
> This config is still opt in and I would expect the users with 64k page size systems to not enable
> this.
>
> But to make sure we don't enable this for those architecture, I could do a per-arch opt in with
> something like this[1] that I did in my previous patch:
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 340e5468980e..c3a9d136ec0a 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -153,6 +153,7 @@ config X86
>  	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
>  	select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
>  	select ARCH_WANTS_THP_SWAP		if X86_64
> +	select ARCH_HAS_STATIC_PMD_ZERO_PAGE	if X86_64
>  	select ARCH_HAS_PARANOID_L1D_FLUSH
>  	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>  	select BUILDTIME_TABLE_SORT
>
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 781be3240e21..fd1c51995029 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -826,6 +826,19 @@ config ARCH_WANTS_THP_SWAP
>  config MM_ID
>  	def_bool n
>
> +config ARCH_HAS_STATIC_PMD_ZERO_PAGE
> +	def_bool n

Hm is this correct? arm64 supports mutliple page tables sizes, so while the
architecture might 'support' it, it will vary based on page size, so actually we
don't care about arch at all?

> +
> +config STATIC_PMD_ZERO_PAGE
> +	bool "Allocate a PMD page for zeroing"
> +	depends on ARCH_HAS_STATIC_PMD_ZERO_PAGE

Maybe need to just make this depend on !CONFIG_PAGE_SIZE_xx?

> <snip>
>
> Let me know your thoughts.
>
> [1] https://lore.kernel.org/linux-mm/20250612105100.59144-4-p.raghav@samsung.com/#Z31mm:Kconfig
> --
> Pankaj

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-07 14:23 [PATCH v2 0/5] add static PMD zero page support Pankaj Raghav (Samsung)
                   ` (7 preceding siblings ...)
  2025-07-15 13:53 ` Pankaj Raghav
@ 2025-07-15 14:04 ` Lorenzo Stoakes
  2025-07-15 15:34 ` Lorenzo Stoakes
  9 siblings, 0 replies; 40+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15 14:04 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

Note that this series does not apply to mm-new. Please rebase for the next
respin.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-15 14:02     ` Lorenzo Stoakes
@ 2025-07-15 14:06       ` David Hildenbrand
  2025-07-15 14:12         ` Lorenzo Stoakes
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2025-07-15 14:06 UTC (permalink / raw)
  To: Lorenzo Stoakes, Pankaj Raghav
  Cc: Zi Yan, Suren Baghdasaryan, Ryan Roberts, Baolin Wang,
	Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
	Mike Rapoport, Dave Hansen, Michal Hocko, Andrew Morton,
	Thomas Gleixner, Nico Pache, Dev Jain, Liam R . Howlett,
	Jens Axboe, linux-kernel, willy, linux-mm, x86, linux-block,
	linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev, hch,
	Pankaj Raghav

On 15.07.25 16:02, Lorenzo Stoakes wrote:
> On Wed, Jul 09, 2025 at 10:03:51AM +0200, Pankaj Raghav wrote:
>> Hi Zi,
>>
>>>> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate the huge_zero_folio via
>>>> memblock, and it will never be freed.
>>>
>>> Do the above users want a PMD sized zero page or a 2MB zero page? Because on systems with non
>>> 4KB base page size, e.g., ARM64 with 64KB base page, PMD size is different. ARM64 with 64KB base
>>> page has 512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing half GB memory. I am
>>> not sure if it is acceptable.
>>>
>>
>> That is a good point. My intial RFC patches allocated 2M instead of a PMD sized
>> page.
>>
>> But later David wanted to reuse the memory we allocate here with huge_zero_folio. So
>> if this config is enabled, we simply just use the same pointer for huge_zero_folio.
>>
>> Since that happened, I decided to go with PMD sized page.
>>
>> This config is still opt in and I would expect the users with 64k page size systems to not enable
>> this.
>>
>> But to make sure we don't enable this for those architecture, I could do a per-arch opt in with
>> something like this[1] that I did in my previous patch:
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 340e5468980e..c3a9d136ec0a 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -153,6 +153,7 @@ config X86
>>   	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
>>   	select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
>>   	select ARCH_WANTS_THP_SWAP		if X86_64
>> +	select ARCH_HAS_STATIC_PMD_ZERO_PAGE	if X86_64
>>   	select ARCH_HAS_PARANOID_L1D_FLUSH
>>   	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>>   	select BUILDTIME_TABLE_SORT
>>
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 781be3240e21..fd1c51995029 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -826,6 +826,19 @@ config ARCH_WANTS_THP_SWAP
>>   config MM_ID
>>   	def_bool n
>>
>> +config ARCH_HAS_STATIC_PMD_ZERO_PAGE
>> +	def_bool n
> 
> Hm is this correct? arm64 supports mutliple page tables sizes, so while the
> architecture might 'support' it, it will vary based on page size, so actually we
> don't care about arch at all?
> 
>> +
>> +config STATIC_PMD_ZERO_PAGE
>> +	bool "Allocate a PMD page for zeroing"
>> +	depends on ARCH_HAS_STATIC_PMD_ZERO_PAGE
> 
> Maybe need to just make this depend on !CONFIG_PAGE_SIZE_xx?

I think at some point we discussed "when does the PMD-sized zeropage 
make *any* sense on these weird arch configs" (512MiB on arm64 64bit)

No idea who wants to waste half a gig on that at runtime either.

But yeah, we should let the arch code opt in whether it wants it or not 
(in particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h
  2025-07-07 14:23 ` [PATCH v2 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h Pankaj Raghav (Samsung)
@ 2025-07-15 14:08   ` Lorenzo Stoakes
  2025-07-16  7:47     ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 40+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15 14:08 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On Mon, Jul 07, 2025 at 04:23:15PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> Move the declaration associated with huge_zero_page from huge_mm.h to
> mm.h. This patch is in preparation for adding static PMD zero page as we
> will be reusing some of the huge_zero_page infrastructure.

Hmm this is really iffy.

The whole purpose of huge_mm.h is to handle huge page stuff, and now you're
moving it to a general header... not a fan of this - now we have _some_
huge stuff in mm.h and some stuff here.

Yes this might be something we screwed up already, but that's not a recent
to perpetuate mistakes.

Surely you don't _need_ to do this and this is a question of fixing up
header includes right?

Or is them some horrible cyclical header issue here?

Also your commit message doesn't give any reason as to why you _need_ to do
this also. For something like this where you're doing something that at
face value seems to contradict the purpose of these headers, you need to
explain why.

>
> No functional changes.
>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  include/linux/huge_mm.h | 31 -------------------------------
>  include/linux/mm.h      | 34 ++++++++++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+), 31 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..3e887374892c 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -478,22 +478,6 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>
>  vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
>
> -extern struct folio *huge_zero_folio;
> -extern unsigned long huge_zero_pfn;
> -
> -static inline bool is_huge_zero_folio(const struct folio *folio)
> -{
> -	return READ_ONCE(huge_zero_folio) == folio;
> -}
> -
> -static inline bool is_huge_zero_pmd(pmd_t pmd)
> -{
> -	return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd);
> -}
> -
> -struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
> -void mm_put_huge_zero_folio(struct mm_struct *mm);
> -
>  static inline bool thp_migration_supported(void)
>  {
>  	return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
> @@ -631,21 +615,6 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>  	return 0;
>  }
>
> -static inline bool is_huge_zero_folio(const struct folio *folio)
> -{
> -	return false;
> -}
> -
> -static inline bool is_huge_zero_pmd(pmd_t pmd)
> -{
> -	return false;
> -}
> -
> -static inline void mm_put_huge_zero_folio(struct mm_struct *mm)
> -{
> -	return;
> -}
> -
>  static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
>  	unsigned long addr, pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
>  {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0ef2ba0c667a..c8fbeaacf896 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4018,6 +4018,40 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
>
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +extern struct folio *huge_zero_folio;
> +extern unsigned long huge_zero_pfn;
> +
> +static inline bool is_huge_zero_folio(const struct folio *folio)
> +{
> +	return READ_ONCE(huge_zero_folio) == folio;
> +}
> +
> +static inline bool is_huge_zero_pmd(pmd_t pmd)
> +{
> +	return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd);
> +}
> +
> +struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
> +void mm_put_huge_zero_folio(struct mm_struct *mm);
> +
> +#else
> +static inline bool is_huge_zero_folio(const struct folio *folio)
> +{
> +	return false;
> +}
> +
> +static inline bool is_huge_zero_pmd(pmd_t pmd)
> +{
> +	return false;
> +}
> +
> +static inline void mm_put_huge_zero_folio(struct mm_struct *mm)
> +{
> +	return;
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  #if MAX_NUMNODES > 1
>  void __init setup_nr_node_ids(void);
>  #else
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-15 14:06       ` David Hildenbrand
@ 2025-07-15 14:12         ` Lorenzo Stoakes
  2025-07-15 14:16           ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15 14:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Raghav, Zi Yan, Suren Baghdasaryan, Ryan Roberts,
	Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
	Vlastimil Babka, Mike Rapoport, Dave Hansen, Michal Hocko,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On Tue, Jul 15, 2025 at 04:06:29PM +0200, David Hildenbrand wrote:
> I think at some point we discussed "when does the PMD-sized zeropage make
> *any* sense on these weird arch configs" (512MiB on arm64 64bit)
>
> No idea who wants to waste half a gig on that at runtime either.

Yeah this is a problem we _really_ need to solve. But obviously somewhat out of
scope here.

>
> But yeah, we should let the arch code opt in whether it wants it or not (in
> particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K)

I don't think this should be an ARCH_HAS_xxx.

Because that's saying 'this architecture has X', this isn't architecture
scope.

I suppose PMDs may vary in terms of how huge they are regardless of page
table size actually.

So maybe the best solution is a semantic one - just rename this to
ARCH_WANT_STATIC_PMD_ZERO_PAGE

And then put the page size selector in the arch code.

For example in arm64 we have:

	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)

So doing something similar here like:

	select ARCH_WANT_STATIC_PMD_ZERO_PAGE if ARM64_4K_PAGES

Would do thie job and sort everything out.

>
> --
> Cheers,
>
> David / dhildenb
>

CHeers, Lorenzo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-07 22:38 ` Andrew Morton
  2025-07-09  9:59   ` Pankaj Raghav
@ 2025-07-15 14:15   ` David Hildenbrand
  1 sibling, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2025-07-15 14:15 UTC (permalink / raw)
  To: Andrew Morton, Pankaj Raghav (Samsung)
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, Lorenzo Stoakes,
	Thomas Gleixner, Nico Pache, Dev Jain, Liam R . Howlett,
	Jens Axboe, linux-kernel, willy, linux-mm, x86, linux-block,
	linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev, hch,
	Pankaj Raghav

On 08.07.25 00:38, Andrew Morton wrote:
> On Mon,  7 Jul 2025 16:23:14 +0200 "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> wrote:
> 
>> There are many places in the kernel where we need to zeroout larger
>> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
>> is limited by PAGE_SIZE.
>>
>> This concern was raised during the review of adding Large Block Size support
>> to XFS[1][2].
>>
>> This is especially annoying in block devices and filesystems where we
>> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
>> bvec support in block layer, it is much more efficient to send out
>> larger zero pages as a part of a single bvec.
>>
>> Some examples of places in the kernel where this could be useful:
>> - blkdev_issue_zero_pages()
>> - iomap_dio_zero()
>> - vmalloc.c:zero_iter()
>> - rxperf_process_call()
>> - fscrypt_zeroout_range_inline_crypt()
>> - bch2_checksum_update()
>> ...
>>
>> We already have huge_zero_folio that is allocated on demand, and it will be
>> deallocated by the shrinker if there are no users of it left.
>>
>> At moment, huge_zero_folio infrastructure refcount is tied to the process
>> lifetime that created it. This might not work for bio layer as the completions
>> can be async and the process that created the huge_zero_folio might no
>> longer be alive.
> 
> Can we change that?  Alter the refcounting model so that dropping the
> final reference at interrupt time works as expected?

I would hope that we can drop that whole shrinking+freeing mechanism at 
some point, and simply always keep it around once allocated.

Any unprivileged process can keep the huge zero folio mapped and, 
therefore, around, until that process is killed ...

But I assume some people might still have an opinion on the shrinker, so 
for the time being having a second static model might be less controversial.

(I don't think we should be refcounting the huge zerofolio in the long term)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/5] mm: add largest_zero_folio() routine
  2025-07-07 14:23 ` [PATCH v2 4/5] mm: add largest_zero_folio() routine Pankaj Raghav (Samsung)
@ 2025-07-15 14:16   ` David Hildenbrand
  2025-07-15 14:46     ` David Hildenbrand
  2025-07-15 16:13   ` Lorenzo Stoakes
  1 sibling, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2025-07-15 14:16 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), Suren Baghdasaryan, Ryan Roberts,
	Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
	Vlastimil Babka, Zi Yan, Mike Rapoport, Dave Hansen, Michal Hocko,
	Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
	Dev Jain, Liam R . Howlett, Jens Axboe
  Cc: linux-kernel, willy, linux-mm, x86, linux-block, linux-fsdevel,
	Darrick J . Wong, mcgrof, gost.dev, hch, Pankaj Raghav

On 07.07.25 16:23, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Add largest_zero_folio() routine so that huge_zero_folio can be
> used without the need to pass any mm struct. This will return ZERO_PAGE
> folio if CONFIG_STATIC_PMD_ZERO_PAGE is disabled or if we failed to
> allocate a PMD page from memblock.
> 
> This routine can also be called even if THP is disabled.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   include/linux/mm.h | 28 ++++++++++++++++++++++++++--
>   1 file changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 428fe6d36b3c..d5543cf7b8e9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4018,17 +4018,41 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
>   
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
>   
> +extern struct folio *huge_zero_folio;
> +extern unsigned long huge_zero_pfn;

No need for "extern".

> +
>   #ifdef CONFIG_STATIC_PMD_ZERO_PAGE
>   extern void __init static_pmd_zero_init(void);
> +
> +/*
> + * largest_zero_folio - Get the largest zero size folio available
> + *
> + * This function will return a PMD sized zero folio if CONFIG_STATIC_PMD_ZERO_PAGE
> + * is enabled. Otherwise, a ZERO_PAGE folio is returned.
> + *
> + * Deduce the size of the folio with folio_size instead of assuming the
> + * folio size.
> + */
> +static inline struct folio *largest_zero_folio(void)
> +{
> +	if(!huge_zero_folio)

Nit: if (!huge_zero_folio), but see below

I assume this check is only in place to handle whether static allocation 
failed, correct?

> +		return page_folio(ZERO_PAGE(0));
> +
> +	return READ_ONCE(huge_zero_folio);

READ_ONCE should not be required if it cannot get freed.

> +}
> +
>   #else
>   static inline void __init static_pmd_zero_init(void)
>   {
>   	return;
>   }
> +
> +static inline struct folio *largest_zero_folio(void)
> +{
> +	return page_folio(ZERO_PAGE(0));
> +}
>   #endif

Could we do:

static inline struct folio *largest_zero_folio(void)
{
	if (IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE) &&
  	    likely(huge_zero_folio))
		return huge_zero_folio;
	return page_folio(ZERO_PAGE(0));
}
	

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-15 14:12         ` Lorenzo Stoakes
@ 2025-07-15 14:16           ` David Hildenbrand
  2025-07-15 15:25             ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2025-07-15 14:16 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Pankaj Raghav, Zi Yan, Suren Baghdasaryan, Ryan Roberts,
	Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
	Vlastimil Babka, Mike Rapoport, Dave Hansen, Michal Hocko,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On 15.07.25 16:12, Lorenzo Stoakes wrote:
> On Tue, Jul 15, 2025 at 04:06:29PM +0200, David Hildenbrand wrote:
>> I think at some point we discussed "when does the PMD-sized zeropage make
>> *any* sense on these weird arch configs" (512MiB on arm64 64bit)
>>
>> No idea who wants to waste half a gig on that at runtime either.
> 
> Yeah this is a problem we _really_ need to solve. But obviously somewhat out of
> scope here.
> 
>>
>> But yeah, we should let the arch code opt in whether it wants it or not (in
>> particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K)
> 
> I don't think this should be an ARCH_HAS_xxx.
> 
> Because that's saying 'this architecture has X', this isn't architecture
> scope.
> 
> I suppose PMDs may vary in terms of how huge they are regardless of page
> table size actually.
> 
> So maybe the best solution is a semantic one - just rename this to
> ARCH_WANT_STATIC_PMD_ZERO_PAGE
> 
> And then put the page size selector in the arch code.
> 
> For example in arm64 we have:
> 
> 	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> 
> So doing something similar here like:
> 
> 	select ARCH_WANT_STATIC_PMD_ZERO_PAGE if ARM64_4K_PAGES
> 
> Would do thie job and sort everything out.

Yes.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function
  2025-07-07 14:23 ` [PATCH v2 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function Pankaj Raghav (Samsung)
@ 2025-07-15 14:18   ` David Hildenbrand
  2025-07-16  8:01     ` Pankaj Raghav (Samsung)
  2025-07-15 14:29   ` Lorenzo Stoakes
  1 sibling, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2025-07-15 14:18 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), Suren Baghdasaryan, Ryan Roberts,
	Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
	Vlastimil Babka, Zi Yan, Mike Rapoport, Dave Hansen, Michal Hocko,
	Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
	Dev Jain, Liam R . Howlett, Jens Axboe
  Cc: linux-kernel, willy, linux-mm, x86, linux-block, linux-fsdevel,
	Darrick J . Wong, mcgrof, gost.dev, hch, Pankaj Raghav

On 07.07.25 16:23, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Add huge_zero_page_shrinker_init() and huge_zero_page_shrinker_exit().
> As shrinker will not be needed when static PMD zero page is enabled,
> these two functions can be a no-op.
> 
> This is a preparation patch for static PMD zero page. No functional
> changes.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   mm/huge_memory.c | 38 +++++++++++++++++++++++++++-----------
>   1 file changed, 27 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d3e66136e41a..101b67ab2eb6 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -289,6 +289,24 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
>   }
>   
>   static struct shrinker *huge_zero_page_shrinker;
> +static int huge_zero_page_shrinker_init(void)
> +{
> +	huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
> +	if (!huge_zero_page_shrinker)
> +		return -ENOMEM;
> +
> +	huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
> +	huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
> +	shrinker_register(huge_zero_page_shrinker);
> +	return 0;
> +}
> +
> +static void huge_zero_page_shrinker_exit(void)
> +{
> +	shrinker_free(huge_zero_page_shrinker);
> +	return;
> +}

While at it, we should rename most of that to "huge_zero_folio" I assume.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 3/5] mm: add static PMD zero page
  2025-07-07 14:23 ` [PATCH v2 3/5] mm: add static PMD zero page Pankaj Raghav (Samsung)
@ 2025-07-15 14:21   ` David Hildenbrand
  2025-07-15 14:53     ` David Hildenbrand
  2025-07-15 15:26   ` Lorenzo Stoakes
  1 sibling, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2025-07-15 14:21 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), Suren Baghdasaryan, Ryan Roberts,
	Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
	Vlastimil Babka, Zi Yan, Mike Rapoport, Dave Hansen, Michal Hocko,
	Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
	Dev Jain, Liam R . Howlett, Jens Axboe
  Cc: linux-kernel, willy, linux-mm, x86, linux-block, linux-fsdevel,
	Darrick J . Wong, mcgrof, gost.dev, hch, Pankaj Raghav

On 07.07.25 16:23, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
> is limited by PAGE_SIZE.
> 
> This is especially annoying in block devices and filesystems where we
> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
> bvec support in block layer, it is much more efficient to send out
> larger zero pages as a part of single bvec.
> 
> This concern was raised during the review of adding LBS support to
> XFS[1][2].
> 
> Usually huge_zero_folio is allocated on demand, and it will be
> deallocated by the shrinker if there are no users of it left. At moment,
> huge_zero_folio infrastructure refcount is tied to the process lifetime
> that created it. This might not work for bio layer as the completitions
> can be async and the process that created the huge_zero_folio might no
> longer be alive.

Of course, what we could do is indicating that there is any untracked 
reference to the huge zero folio, and then simply refuse to free it for 
all eternity.

Essentially, every any non-mm reference -> un-shrinkable.

We'd still be allocating the huge zero folio dynamically. We could try 
allocating it on first usage either from memblock, or from the buddy if
already around.

Then, we'd only need a config option to allow for that to happen.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function
  2025-07-07 14:23 ` [PATCH v2 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function Pankaj Raghav (Samsung)
  2025-07-15 14:18   ` David Hildenbrand
@ 2025-07-15 14:29   ` Lorenzo Stoakes
  2025-07-16  8:08     ` Pankaj Raghav (Samsung)
  1 sibling, 1 reply; 40+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15 14:29 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

Nit on subject, function -> functions.

On Mon, Jul 07, 2025 at 04:23:16PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> Add huge_zero_page_shrinker_init() and huge_zero_page_shrinker_exit().
> As shrinker will not be needed when static PMD zero page is enabled,
> these two functions can be a no-op.
>
> This is a preparation patch for static PMD zero page. No functional
> changes.

This is nitty stuff, but I think this is a little unclear, maybe something
like:

	We will soon be determining whether to use a shrinker depending on
	whether a static PMD zero page is available, therefore abstract out
	shrink initialisation and teardown such that we can more easily
	handle both the shrinker and static PMD zero page cases.

>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>

Other than nits, this LGTM, so with those addressed:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/huge_memory.c | 38 +++++++++++++++++++++++++++-----------
>  1 file changed, 27 insertions(+), 11 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d3e66136e41a..101b67ab2eb6 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -289,6 +289,24 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
>  }
>
>  static struct shrinker *huge_zero_page_shrinker;
> +static int huge_zero_page_shrinker_init(void)
> +{
> +	huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
> +	if (!huge_zero_page_shrinker)
> +		return -ENOMEM;
> +
> +	huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
> +	huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
> +	shrinker_register(huge_zero_page_shrinker);
> +	return 0;
> +}
> +
> +static void huge_zero_page_shrinker_exit(void)
> +{
> +	shrinker_free(huge_zero_page_shrinker);
> +	return;
> +}
> +
>
>  #ifdef CONFIG_SYSFS
>  static ssize_t enabled_show(struct kobject *kobj,
> @@ -850,33 +868,31 @@ static inline void hugepage_exit_sysfs(struct kobject *hugepage_kobj)
>
>  static int __init thp_shrinker_init(void)
>  {
> -	huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
> -	if (!huge_zero_page_shrinker)
> -		return -ENOMEM;
> +	int ret = 0;

Kinda no point in initialising to zero, unless...

>
>  	deferred_split_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE |
>  						 SHRINKER_MEMCG_AWARE |
>  						 SHRINKER_NONSLAB,
>  						 "thp-deferred_split");
> -	if (!deferred_split_shrinker) {
> -		shrinker_free(huge_zero_page_shrinker);
> +	if (!deferred_split_shrinker)
>  		return -ENOMEM;
> -	}
> -
> -	huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
> -	huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
> -	shrinker_register(huge_zero_page_shrinker);
>
>  	deferred_split_shrinker->count_objects = deferred_split_count;
>  	deferred_split_shrinker->scan_objects = deferred_split_scan;
>  	shrinker_register(deferred_split_shrinker);
>
> +	ret = huge_zero_page_shrinker_init();
> +	if (ret) {
> +		shrinker_free(deferred_split_shrinker);
> +		return ret;
> +	}

... you change this to:

	if (ret)
		shrinker_free(deferred_split_shrinker);

	return ret;

But it's not a big deal. Maybe I'd rename ret -> err if you keep things as
they are (but don't init to 0).

> +
>  	return 0;
>  }
>
>  static void __init thp_shrinker_exit(void)
>  {
> -	shrinker_free(huge_zero_page_shrinker);
> +	huge_zero_page_shrinker_exit();
>  	shrinker_free(deferred_split_shrinker);
>  }
>
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/5] mm: add largest_zero_folio() routine
  2025-07-15 14:16   ` David Hildenbrand
@ 2025-07-15 14:46     ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2025-07-15 14:46 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), Suren Baghdasaryan, Ryan Roberts,
	Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
	Vlastimil Babka, Zi Yan, Mike Rapoport, Dave Hansen, Michal Hocko,
	Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
	Dev Jain, Liam R . Howlett, Jens Axboe
  Cc: linux-kernel, willy, linux-mm, x86, linux-block, linux-fsdevel,
	Darrick J . Wong, mcgrof, gost.dev, hch, Pankaj Raghav

On 15.07.25 16:16, David Hildenbrand wrote:
> On 07.07.25 16:23, Pankaj Raghav (Samsung) wrote:
>> From: Pankaj Raghav <p.raghav@samsung.com>
>>
>> Add largest_zero_folio() routine so that huge_zero_folio can be
>> used without the need to pass any mm struct. This will return ZERO_PAGE
>> folio if CONFIG_STATIC_PMD_ZERO_PAGE is disabled or if we failed to
>> allocate a PMD page from memblock.
>>
>> This routine can also be called even if THP is disabled.
>>
>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>> ---
>>    include/linux/mm.h | 28 ++++++++++++++++++++++++++--
>>    1 file changed, 26 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 428fe6d36b3c..d5543cf7b8e9 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -4018,17 +4018,41 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
>>    
>>    #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
>>    
>> +extern struct folio *huge_zero_folio;
>> +extern unsigned long huge_zero_pfn;
> 
> No need for "extern".

Scratch that, was confused with functions ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 3/5] mm: add static PMD zero page
  2025-07-15 14:21   ` David Hildenbrand
@ 2025-07-15 14:53     ` David Hildenbrand
  2025-07-17 10:34       ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2025-07-15 14:53 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), Suren Baghdasaryan, Ryan Roberts,
	Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
	Vlastimil Babka, Zi Yan, Mike Rapoport, Dave Hansen, Michal Hocko,
	Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
	Dev Jain, Liam R . Howlett, Jens Axboe
  Cc: linux-kernel, willy, linux-mm, x86, linux-block, linux-fsdevel,
	Darrick J . Wong, mcgrof, gost.dev, hch, Pankaj Raghav

On 15.07.25 16:21, David Hildenbrand wrote:
> On 07.07.25 16:23, Pankaj Raghav (Samsung) wrote:
>> From: Pankaj Raghav <p.raghav@samsung.com>
>>
>> There are many places in the kernel where we need to zeroout larger
>> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
>> is limited by PAGE_SIZE.
>>
>> This is especially annoying in block devices and filesystems where we
>> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
>> bvec support in block layer, it is much more efficient to send out
>> larger zero pages as a part of single bvec.
>>
>> This concern was raised during the review of adding LBS support to
>> XFS[1][2].
>>
>> Usually huge_zero_folio is allocated on demand, and it will be
>> deallocated by the shrinker if there are no users of it left. At moment,
>> huge_zero_folio infrastructure refcount is tied to the process lifetime
>> that created it. This might not work for bio layer as the completitions
>> can be async and the process that created the huge_zero_folio might no
>> longer be alive.
> 
> Of course, what we could do is indicating that there is any untracked
> reference to the huge zero folio, and then simply refuse to free it for
> all eternity.
> 
> Essentially, every any non-mm reference -> un-shrinkable.
> 
> We'd still be allocating the huge zero folio dynamically. We could try
> allocating it on first usage either from memblock, or from the buddy if
> already around.
> 
> Then, we'd only need a config option to allow for that to happen.

Something incomplete and very hacky just to give an idea. It would try allocating
it if there is actual code running that would need it, and then have it
stick around forever.


diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e0a27f80f390d..357e29e98d8d2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -481,6 +481,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
  
  extern struct folio *huge_zero_folio;
  extern unsigned long huge_zero_pfn;
+atomic_t huge_zero_folio_is_static;
  
  static inline bool is_huge_zero_folio(const struct folio *folio)
  {
@@ -499,6 +500,16 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
  
  struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
  void mm_put_huge_zero_folio(struct mm_struct *mm);
+struct folio *__get_static_huge_zero_folio(void);
+
+static inline struct folio *get_static_huge_zero_folio(void)
+{
+       if (!IS_ENMABLED(CONFIG_STATIC_HUGE_ZERO_FOLIO))
+               return NULL;
+       if (likely(atomic_read(&huge_zero_folio_is_static)))
+               return huge_zero_folio;
+       return get_static_huge_zero_folio();
+}
  
  static inline bool thp_migration_supported(void)
  {
@@ -509,7 +520,6 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
                            pmd_t *pmd, bool freeze);
  bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
                            pmd_t *pmdp, struct folio *folio);
-
  #else /* CONFIG_TRANSPARENT_HUGEPAGE */
  
  static inline bool folio_test_pmd_mappable(struct folio *folio)
@@ -690,6 +700,11 @@ static inline int change_huge_pud(struct mmu_gather *tlb,
  {
         return 0;
  }
+
+static inline struct folio *static_huge_zero_folio(void)
+{
+       return NULL;
+}
  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
  
  static inline int split_folio_to_list_to_order(struct folio *folio,
@@ -703,4 +718,14 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
         return split_folio_to_list_to_order(folio, NULL, new_order);
  }
  
+static inline struct folio *largest_zero_folio(void)
+{
+       struct folio *folio;
+
+       folio = get_static_huge_zero_folio();
+       if (folio)
+               return folio;
+       return page_folio(ZERO_PAGE(0));
+}
+
  #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 31b5c4e61a574..eb49c69f9c8e2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -77,6 +77,7 @@ static bool split_underused_thp = true;
  static atomic_t huge_zero_refcount;
  struct folio *huge_zero_folio __read_mostly;
  unsigned long huge_zero_pfn __read_mostly = ~0UL;
+atomic_t huge_zero_folio_is_static __read_mostly;
  unsigned long huge_anon_orders_always __read_mostly;
  unsigned long huge_anon_orders_madvise __read_mostly;
  unsigned long huge_anon_orders_inherit __read_mostly;
@@ -266,6 +267,25 @@ void mm_put_huge_zero_folio(struct mm_struct *mm)
                 put_huge_zero_page();
  }
  
+#ifdef CONFIG_STATIC_HUGE_ZERO_FOLIO
+struct folio *__get_static_huge_zero_folio(void)
+{
+       /*
+        * Our raised reference will prevent the shrinker from ever having
+        * success -> static.
+        */
+       if (atomic_read(&huge_zero_folio_is_static))
+               return huge_zero_folio;
+       /* TODO: memblock allocation if buddy is not up yet? Or Reject that earlier. */
+       if (!get_huge_zero_page())
+               return NULL;
+       if (atomic_cmpxchg(&huge_zero_folio_is_static, 0, 1) != 0)
+               put_huge_zero_page();
+       return huge_zero_folio;
+
+}
+#endif /* CONFIG_STATIC_HUGE_ZERO_FOLIO */
+
  static unsigned long shrink_huge_zero_page_count(struct shrinker *shrink,
                                         struct shrink_control *sc)
  {


-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-15 14:16           ` David Hildenbrand
@ 2025-07-15 15:25             ` Pankaj Raghav (Samsung)
  2025-07-15 15:27               ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-15 15:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Zi Yan, Suren Baghdasaryan, Ryan Roberts,
	Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
	Vlastimil Babka, Mike Rapoport, Dave Hansen, Michal Hocko,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On Tue, Jul 15, 2025 at 04:16:44PM +0200, David Hildenbrand wrote:
> On 15.07.25 16:12, Lorenzo Stoakes wrote:
> > On Tue, Jul 15, 2025 at 04:06:29PM +0200, David Hildenbrand wrote:
> > > I think at some point we discussed "when does the PMD-sized zeropage make
> > > *any* sense on these weird arch configs" (512MiB on arm64 64bit)
> > > 
> > > No idea who wants to waste half a gig on that at runtime either.
> > 
> > Yeah this is a problem we _really_ need to solve. But obviously somewhat out of
> > scope here.
> > 
> > > 
> > > But yeah, we should let the arch code opt in whether it wants it or not (in
> > > particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K)
> > 
> > I don't think this should be an ARCH_HAS_xxx.
> > 
> > Because that's saying 'this architecture has X', this isn't architecture
> > scope.
> > 
> > I suppose PMDs may vary in terms of how huge they are regardless of page
> > table size actually.
> > 
> > So maybe the best solution is a semantic one - just rename this to
> > ARCH_WANT_STATIC_PMD_ZERO_PAGE
> > 
> > And then put the page size selector in the arch code.
> > 
> > For example in arm64 we have:
> > 
> > 	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> > 
> > So doing something similar here like:
> > 
> > 	select ARCH_WANT_STATIC_PMD_ZERO_PAGE if ARM64_4K_PAGES
> > 
> > Would do thie job and sort everything out.
> 
> Yes.

Actually I had something similar in one of my earlier versions[1] where we
can opt in from arch specific Kconfig with *WANT* instead *HAS*.

For starters, I will enable this only from x86. We can probably extend
this once we get the base patches up.

[1] https://lore.kernel.org/linux-mm/20250522090243.758943-2-p.raghav@samsung.com/
--
Pankaj

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 3/5] mm: add static PMD zero page
  2025-07-07 14:23 ` [PATCH v2 3/5] mm: add static PMD zero page Pankaj Raghav (Samsung)
  2025-07-15 14:21   ` David Hildenbrand
@ 2025-07-15 15:26   ` Lorenzo Stoakes
  1 sibling, 0 replies; 40+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15 15:26 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On Mon, Jul 07, 2025 at 04:23:17PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
> is limited by PAGE_SIZE.
>
> This is especially annoying in block devices and filesystems where we
> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
> bvec support in block layer, it is much more efficient to send out
> larger zero pages as a part of single bvec.
>
> This concern was raised during the review of adding LBS support to
> XFS[1][2].

Nit, but maybe worth spelling out LBS = (presumably :P) Large Block
Support.

>
> Usually huge_zero_folio is allocated on demand, and it will be
> deallocated by the shrinker if there are no users of it left. At moment,
> huge_zero_folio infrastructure refcount is tied to the process lifetime
> that created it. This might not work for bio layer as the completitions
> can be async and the process that created the huge_zero_folio might no
> longer be alive.
>
> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
> the huge_zero_folio, and it will never be freed. This makes using the
> huge_zero_folio without having to pass any mm struct and does not tie
> the lifetime of the zero folio to anything.

Can we in that case #ifndef CONFIG_STATIC_PMD_ZERO_PAGE around the refcount
logic?

And surely we should additionally update mm_get_huge_zero_folio() etc. to
account for this?

>
> memblock is used to allocated this PMD zero page during early boot.
>
> If STATIC_PMD_ZERO_PAGE config option is enabled, then
> mm_get_huge_zero_folio() will simply return this page instead of
> dynamically allocating a new PMD page.
>
> As STATIC_PMD_ZERO_PAGE does not depend on THP, declare huge_zero_folio
> and huge_zero_pfn outside the THP config.
>
> [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
> [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  include/linux/mm.h | 25 ++++++++++++++++++++++++-
>  mm/Kconfig         |  9 +++++++++
>  mm/huge_memory.c   | 24 ++++++++++++++++++++----
>  mm/memory.c        | 25 +++++++++++++++++++++++++
>  mm/mm_init.c       |  1 +
>  5 files changed, 79 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c8fbeaacf896..428fe6d36b3c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4018,10 +4018,19 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
>
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#ifdef CONFIG_STATIC_PMD_ZERO_PAGE
> +extern void __init static_pmd_zero_init(void);

We don't use extern for this kind of function declaration, and actually try
to remove extern's as we touch header decls that have them as we go.

> +#else
> +static inline void __init static_pmd_zero_init(void)
> +{
> +	return;

This return is redundant.

> +}
> +#endif
> +
>  extern struct folio *huge_zero_folio;
>  extern unsigned long huge_zero_pfn;
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE

OK I guess the point here is to make huge_zero_folio, huge_zero_pfn
available regardless of whether THP is enabled.

Again, I really think this should live in huge_mm.h and any place that
doesn't include it needs to like, just include it :)

I really don't want these randomly placed in mm.h if we can avoid it.

Can we also add a comment saying 'this is used for both static huge PMD and THP

>  static inline bool is_huge_zero_folio(const struct folio *folio)
>  {
>  	return READ_ONCE(huge_zero_folio) == folio;
> @@ -4032,9 +4041,23 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
>  	return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd);
>  }
>
> +#ifdef CONFIG_STATIC_PMD_ZERO_PAGE
> +static inline struct folio *mm_get_huge_zero_folio(struct mm_struct *mm)
> +{
> +	return READ_ONCE(huge_zero_folio);
> +}
> +
> +static inline void mm_put_huge_zero_folio(struct mm_struct *mm)
> +{
> +	return;

This return is redundant.

> +}
> +
> +#else
>  struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
>  void mm_put_huge_zero_folio(struct mm_struct *mm);
>
> +#endif /* CONFIG_STATIC_PMD_ZERO_PAGE */
> +
>  #else
>  static inline bool is_huge_zero_folio(const struct folio *folio)
>  {
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 781be3240e21..89d5971cf180 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -826,6 +826,15 @@ config ARCH_WANTS_THP_SWAP
>  config MM_ID
>  	def_bool n
>
> +config STATIC_PMD_ZERO_PAGE
> +	bool "Allocate a PMD page for zeroing"
> +	help
> +	  Typically huge_zero_folio, which is a PMD page of zeroes, is allocated
> +	  on demand and deallocated when not in use. This option will
> +	  allocate a PMD sized zero page during early boot and huge_zero_folio will
> +	  use it instead allocating dynamically.
> +	  Not suitable for memory constrained systems.

Would have to be pretty constrained to not spare 2 MiB :P but I accept of
course these devices do exist...

> +
>  menuconfig TRANSPARENT_HUGEPAGE
>  	bool "Transparent Hugepage Support"
>  	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 101b67ab2eb6..c12ca7134e88 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -75,9 +75,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  					 struct shrink_control *sc);
>  static bool split_underused_thp = true;
>
> -static atomic_t huge_zero_refcount;
> -struct folio *huge_zero_folio __read_mostly;
> -unsigned long huge_zero_pfn __read_mostly = ~0UL;

Ugh yeah this is a mess.

I see you're moving this to mm/memory.c because we only compile
huge_memory.c if THP is enabled.

Are there any circumstances where it makes sense to want to use static PMD
page and NOT have THP enabled?

It'd just be simpler if we could have CONFIG_STATIC_PMD_ZERO_PAGE depend on
CONFIG_TRANSPARENT_HUGEPAGE.

Why can't we do that?

>  unsigned long huge_anon_orders_always __read_mostly;
>  unsigned long huge_anon_orders_madvise __read_mostly;
>  unsigned long huge_anon_orders_inherit __read_mostly;
> @@ -208,6 +205,23 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  	return orders;
>  }
>
> +#ifdef CONFIG_STATIC_PMD_ZERO_PAGE
> +static int huge_zero_page_shrinker_init(void)
> +{
> +	return 0;
> +}
> +
> +static void huge_zero_page_shrinker_exit(void)
> +{
> +	return;

You seem to love putting return statements in void functions like this :P
you don't need to, please remove.

> +}
> +#else
> +
> +static struct shrinker *huge_zero_page_shrinker;
> +static atomic_t huge_zero_refcount;
> +struct folio *huge_zero_folio __read_mostly;
> +unsigned long huge_zero_pfn __read_mostly = ~0UL;
> +
>  static bool get_huge_zero_page(void)
>  {
>  	struct folio *zero_folio;
> @@ -288,7 +302,6 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
>  	return 0;
>  }
>
> -static struct shrinker *huge_zero_page_shrinker;
>  static int huge_zero_page_shrinker_init(void)
>  {
>  	huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
> @@ -307,6 +320,7 @@ static void huge_zero_page_shrinker_exit(void)
>  	return;
>  }
>
> +#endif
>
>  #ifdef CONFIG_SYSFS
>  static ssize_t enabled_show(struct kobject *kobj,
> @@ -2843,6 +2857,8 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>  	pte_t *pte;
>  	int i;
>
> +	// FIXME: can this be called with static zero page?

This shouldn't be in upstream code, it's up to you to determine this. And
please don't use //.

> +	VM_BUG_ON(IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE));

Also [VM_]BUG_ON() is _entirely_ deprecated. This should be
VM_WARN_ON_ONCE().

>  	/*
>  	 * Leave pmd empty until pte is filled note that it is fine to delay
>  	 * notification until mmu_notifier_invalidate_range_end() as we are
> diff --git a/mm/memory.c b/mm/memory.c
> index b0cda5aab398..42c4c31ad14c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,7 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/memblock.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -159,6 +160,30 @@ static int __init init_zero_pfn(void)
>  }
>  early_initcall(init_zero_pfn);
>
> +#ifdef CONFIG_STATIC_PMD_ZERO_PAGE
> +struct folio *huge_zero_folio __read_mostly = NULL;
> +unsigned long huge_zero_pfn __read_mostly = ~0UL;
> +
> +void __init static_pmd_zero_init(void)
> +{
> +	void *alloc = memblock_alloc(PMD_SIZE, PAGE_SIZE);
> +
> +	if (!alloc)
> +		return;

Ummm... so we're fine with just having huge_zero_folio, huge_zero_pfn
unintialised if the allocation fails?

This seems to be to be a rare case where we should panic the kernel?
Because everything's broken now.

There's actually a memblock_alloc_or_panic() function you could use for
this.

> +
> +	huge_zero_folio = virt_to_folio(alloc);
> +	huge_zero_pfn = page_to_pfn(virt_to_page(alloc));
> +
> +	__folio_set_head(huge_zero_folio);
> +	prep_compound_head((struct page *)huge_zero_folio, PMD_ORDER);

What will the reference count be on the folio here? Might something
acccidentally put this somewhere if we're not careful?


> +	/* Ensure zero folio won't have large_rmappable flag set. */
> +	folio_clear_large_rmappable(huge_zero_folio);

Why? What would set it?

I'm a little concerned as to whether this folio is correctly initialised,
need to be careful here.

> +	folio_zero_range(huge_zero_folio, 0, PMD_SIZE);
> +
> +	return;

You don't need to put returns at the end of void functions.

> +}
> +#endif
> +
>  void mm_trace_rss_stat(struct mm_struct *mm, int member)
>  {
>  	trace_rss_stat(mm, member);
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index f2944748f526..56d7ec372af1 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -2765,6 +2765,7 @@ void __init mm_core_init(void)
>  	 */
>  	kho_memory_init();
>
> +	static_pmd_zero_init();
>  	memblock_free_all();
>  	mem_init();
>  	kmem_cache_init();
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-15 15:25             ` Pankaj Raghav (Samsung)
@ 2025-07-15 15:27               ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2025-07-15 15:27 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Lorenzo Stoakes, Zi Yan, Suren Baghdasaryan, Ryan Roberts,
	Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
	Vlastimil Babka, Mike Rapoport, Dave Hansen, Michal Hocko,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On 15.07.25 17:25, Pankaj Raghav (Samsung) wrote:
> On Tue, Jul 15, 2025 at 04:16:44PM +0200, David Hildenbrand wrote:
>> On 15.07.25 16:12, Lorenzo Stoakes wrote:
>>> On Tue, Jul 15, 2025 at 04:06:29PM +0200, David Hildenbrand wrote:
>>>> I think at some point we discussed "when does the PMD-sized zeropage make
>>>> *any* sense on these weird arch configs" (512MiB on arm64 64bit)
>>>>
>>>> No idea who wants to waste half a gig on that at runtime either.
>>>
>>> Yeah this is a problem we _really_ need to solve. But obviously somewhat out of
>>> scope here.
>>>
>>>>
>>>> But yeah, we should let the arch code opt in whether it wants it or not (in
>>>> particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K)
>>>
>>> I don't think this should be an ARCH_HAS_xxx.
>>>
>>> Because that's saying 'this architecture has X', this isn't architecture
>>> scope.
>>>
>>> I suppose PMDs may vary in terms of how huge they are regardless of page
>>> table size actually.
>>>
>>> So maybe the best solution is a semantic one - just rename this to
>>> ARCH_WANT_STATIC_PMD_ZERO_PAGE
>>>
>>> And then put the page size selector in the arch code.
>>>
>>> For example in arm64 we have:
>>>
>>> 	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
>>>
>>> So doing something similar here like:
>>>
>>> 	select ARCH_WANT_STATIC_PMD_ZERO_PAGE if ARM64_4K_PAGES
>>>
>>> Would do thie job and sort everything out.
>>
>> Yes.
> 
> Actually I had something similar in one of my earlier versions[1] where we
> can opt in from arch specific Kconfig with *WANT* instead *HAS*.
> 
> For starters, I will enable this only from x86. We can probably extend
> this once we get the base patches up.

Makes sense.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-07 14:23 [PATCH v2 0/5] add static PMD zero page support Pankaj Raghav (Samsung)
                   ` (8 preceding siblings ...)
  2025-07-15 14:04 ` Lorenzo Stoakes
@ 2025-07-15 15:34 ` Lorenzo Stoakes
  2025-07-17 10:43   ` Pankaj Raghav (Samsung)
  9 siblings, 1 reply; 40+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15 15:34 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

Pankaj,

There seems to be quite a lot to work on here, and it seems rather speculative,
so can we respin as an RFC please?

Thanks! :)

On Mon, Jul 07, 2025 at 04:23:14PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
> is limited by PAGE_SIZE.
>
> This concern was raised during the review of adding Large Block Size support
> to XFS[1][2].
>
> This is especially annoying in block devices and filesystems where we
> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
> bvec support in block layer, it is much more efficient to send out
> larger zero pages as a part of a single bvec.
>
> Some examples of places in the kernel where this could be useful:
> - blkdev_issue_zero_pages()
> - iomap_dio_zero()
> - vmalloc.c:zero_iter()
> - rxperf_process_call()
> - fscrypt_zeroout_range_inline_crypt()
> - bch2_checksum_update()
> ...
>
> We already have huge_zero_folio that is allocated on demand, and it will be
> deallocated by the shrinker if there are no users of it left.
>
> At moment, huge_zero_folio infrastructure refcount is tied to the process
> lifetime that created it. This might not work for bio layer as the completions
> can be async and the process that created the huge_zero_folio might no
> longer be alive.
>
> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
> the huge_zero_folio via memblock, and it will never be freed.
>
> I have converted blkdev_issue_zero_pages() as an example as a part of
> this series.
>
> I will send patches to individual subsystems using the huge_zero_folio
> once this gets upstreamed.
>
> Looking forward to some feedback.
>
> [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
> [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
>
> Changes since v1:
> - Move from .bss to allocating it through memblock(David)
>
> Changes since RFC:
> - Added the config option based on the feedback from David.
> - Encode more info in the header to avoid dead code (Dave hansen
>   feedback)
> - The static part of huge_zero_folio in memory.c and the dynamic part
>   stays in huge_memory.c
> - Split the patches to make it easy for review.
>
> Pankaj Raghav (5):
>   mm: move huge_zero_page declaration from huge_mm.h to mm.h
>   huge_memory: add huge_zero_page_shrinker_(init|exit) function
>   mm: add static PMD zero page
>   mm: add largest_zero_folio() routine
>   block: use largest_zero_folio in __blkdev_issue_zero_pages()
>
>  block/blk-lib.c         | 17 +++++----
>  include/linux/huge_mm.h | 31 ----------------
>  include/linux/mm.h      | 81 +++++++++++++++++++++++++++++++++++++++++
>  mm/Kconfig              |  9 +++++
>  mm/huge_memory.c        | 62 +++++++++++++++++++++++--------
>  mm/memory.c             | 25 +++++++++++++
>  mm/mm_init.c            |  1 +
>  7 files changed, 173 insertions(+), 53 deletions(-)
>
>
> base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 4/5] mm: add largest_zero_folio() routine
  2025-07-07 14:23 ` [PATCH v2 4/5] mm: add largest_zero_folio() routine Pankaj Raghav (Samsung)
  2025-07-15 14:16   ` David Hildenbrand
@ 2025-07-15 16:13   ` Lorenzo Stoakes
  1 sibling, 0 replies; 40+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15 16:13 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

Nit on subject - this is a function not a routine :)

On Mon, Jul 07, 2025 at 04:23:18PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> Add largest_zero_folio() routine so that huge_zero_folio can be
> used without the need to pass any mm struct. This will return ZERO_PAGE
> folio if CONFIG_STATIC_PMD_ZERO_PAGE is disabled or if we failed to
> allocate a PMD page from memblock.
>
> This routine can also be called even if THP is disabled.

This is pretty much implicit in the series as a whole though, so probably
not worth mentioning :)

>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  include/linux/mm.h | 28 ++++++++++++++++++++++++++--
>  1 file changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 428fe6d36b3c..d5543cf7b8e9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4018,17 +4018,41 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
>
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
>
> +extern struct folio *huge_zero_folio;
> +extern unsigned long huge_zero_pfn;

I mean this should be i huge_mm.h again imo, but at any rate I don't know why
you're moving them up here.

But maybe diff algorithm being weird?

> +
>  #ifdef CONFIG_STATIC_PMD_ZERO_PAGE
>  extern void __init static_pmd_zero_init(void);
> +
> +/*
> + * largest_zero_folio - Get the largest zero size folio available
> + *
> + * This function will return a PMD sized zero folio if CONFIG_STATIC_PMD_ZERO_PAGE
> + * is enabled. Otherwise, a ZERO_PAGE folio is returned.
> + *
> + * Deduce the size of the folio with folio_size instead of assuming the
> + * folio size.
> + */
> +static inline struct folio *largest_zero_folio(void)
> +{
> +	if(!huge_zero_folio)
> +		return page_folio(ZERO_PAGE(0));
> +
> +	return READ_ONCE(huge_zero_folio);
> +}
> +
>  #else
>  static inline void __init static_pmd_zero_init(void)
>  {
>  	return;

No need to return in void functions.

>  }
> +
> +static inline struct folio *largest_zero_folio(void)
> +{
> +	return page_folio(ZERO_PAGE(0));
> +}
>  #endif
>
> -extern struct folio *huge_zero_folio;
> -extern unsigned long huge_zero_pfn;
>
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  static inline bool is_huge_zero_folio(const struct folio *folio)
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 5/5] block: use largest_zero_folio in __blkdev_issue_zero_pages()
  2025-07-07 14:23 ` [PATCH v2 5/5] block: use largest_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav (Samsung)
@ 2025-07-15 16:19   ` Lorenzo Stoakes
  2025-07-16 13:24     ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 40+ messages in thread
From: Lorenzo Stoakes @ 2025-07-15 16:19 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On Mon, Jul 07, 2025 at 04:23:19PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> Use largest_zero_folio() in __blkdev_issue_zero_pages().
>
> On systems with CONFIG_STATIC_PMD_ZERO_PAGE enabled, we will end up
> sending larger bvecs instead of multiple small ones.
>
> Noticed a 4% increase in performance on a commercial NVMe SSD which does
> not support OP_WRITE_ZEROES. The device's MDTS was 128K. The performance
> gains might be bigger if the device supports bigger MDTS.
>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  block/blk-lib.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index 4c9f20a689f7..70a5700b6717 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -196,6 +196,10 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
>  		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
>  		struct bio **biop, unsigned int flags)
>  {
> +	struct folio *zero_folio;
> +
> +	zero_folio = largest_zero_folio();

Just assign this in the decl :)

> +
>  	while (nr_sects) {
>  		unsigned int nr_vecs = __blkdev_sectors_to_bio_pages(nr_sects);
>  		struct bio *bio;
> @@ -208,15 +212,14 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
>  			break;
>
>  		do {
> -			unsigned int len, added;
> +			unsigned int len;
>
> -			len = min_t(sector_t,
> -				PAGE_SIZE, nr_sects << SECTOR_SHIFT);
> -			added = bio_add_page(bio, ZERO_PAGE(0), len, 0);
> -			if (added < len)
> +			len = min_t(sector_t, folio_size(zero_folio),
> +				    nr_sects << SECTOR_SHIFT);
> +			if (!bio_add_folio(bio, zero_folio, len, 0))

Hmm, will this work if nr_sects << SECTOR_SHIFT size isn't PMD-aligned?

I guess it actually just copies individual pages in the folio as needed?

Does this actually result in a significant performance improvement? Do we
have numbers for this to justify the series?

>  				break;
> -			nr_sects -= added >> SECTOR_SHIFT;
> -			sector += added >> SECTOR_SHIFT;
> +			nr_sects -= len >> SECTOR_SHIFT;
> +			sector += len >> SECTOR_SHIFT;
>  		} while (nr_sects);
>
>  		*biop = bio_chain_and_submit(*biop, bio);
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h
  2025-07-15 14:08   ` Lorenzo Stoakes
@ 2025-07-16  7:47     ` Pankaj Raghav (Samsung)
  2025-07-16 15:24       ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-16  7:47 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On Tue, Jul 15, 2025 at 03:08:40PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jul 07, 2025 at 04:23:15PM +0200, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> >
> > Move the declaration associated with huge_zero_page from huge_mm.h to
> > mm.h. This patch is in preparation for adding static PMD zero page as we
> > will be reusing some of the huge_zero_page infrastructure.
> 
> Hmm this is really iffy.
> 
> The whole purpose of huge_mm.h is to handle huge page stuff, and now you're
> moving it to a general header... not a fan of this - now we have _some_
> huge stuff in mm.h and some stuff here.
> 
> Yes this might be something we screwed up already, but that's not a recent
> to perpetuate mistakes.
> 
> Surely you don't _need_ to do this and this is a question of fixing up
> header includes right?
> 
> Or is them some horrible cyclical header issue here?
> 
> Also your commit message doesn't give any reason as to why you _need_ to do
> this also. For something like this where you're doing something that at
> face value seems to contradict the purpose of these headers, you need to
> explain why.
> 

In one of the earlier versions, David asked me to experiment by moving some of these
declarations to mm.h and see how it looks. Mainly because, as you
guessed it later, we can use it without THP being enabled.

But I see that you strongly feel against moving this to mm.h (and I see
why).

I can move it back to huge_mm.h.

Thanks

--
Pankaj


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function
  2025-07-15 14:18   ` David Hildenbrand
@ 2025-07-16  8:01     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-16  8:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, Lorenzo Stoakes,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On Tue, Jul 15, 2025 at 04:18:26PM +0200, David Hildenbrand wrote:
> On 07.07.25 16:23, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > Add huge_zero_page_shrinker_init() and huge_zero_page_shrinker_exit().
> > As shrinker will not be needed when static PMD zero page is enabled,
> > these two functions can be a no-op.
> > 
> > This is a preparation patch for static PMD zero page. No functional
> > changes.
> > 
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > ---
> >   mm/huge_memory.c | 38 +++++++++++++++++++++++++++-----------
> >   1 file changed, 27 insertions(+), 11 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index d3e66136e41a..101b67ab2eb6 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -289,6 +289,24 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
> >   }
> >   static struct shrinker *huge_zero_page_shrinker;
> > +static int huge_zero_page_shrinker_init(void)
> > +{
> > +	huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
> > +	if (!huge_zero_page_shrinker)
> > +		return -ENOMEM;
> > +
> > +	huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
> > +	huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
> > +	shrinker_register(huge_zero_page_shrinker);
> > +	return 0;
> > +}
> > +
> > +static void huge_zero_page_shrinker_exit(void)
> > +{
> > +	shrinker_free(huge_zero_page_shrinker);
> > +	return;
> > +}
> 
> While at it, we should rename most of that to "huge_zero_folio" I assume.
Sounds good.
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function
  2025-07-15 14:29   ` Lorenzo Stoakes
@ 2025-07-16  8:08     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-16  8:08 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On Tue, Jul 15, 2025 at 03:29:08PM +0100, Lorenzo Stoakes wrote:
> Nit on subject, function -> functions.
> 
> On Mon, Jul 07, 2025 at 04:23:16PM +0200, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> >
> > Add huge_zero_page_shrinker_init() and huge_zero_page_shrinker_exit().
> > As shrinker will not be needed when static PMD zero page is enabled,
> > these two functions can be a no-op.
> >
> > This is a preparation patch for static PMD zero page. No functional
> > changes.
> 
> This is nitty stuff, but I think this is a little unclear, maybe something
> like:
> 
> 	We will soon be determining whether to use a shrinker depending on
> 	whether a static PMD zero page is available, therefore abstract out
> 	shrink initialisation and teardown such that we can more easily
> 	handle both the shrinker and static PMD zero page cases.
> 
This looks good. I will use add this to the commit message.

> >
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> 
> Other than nits, this LGTM, so with those addressed:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks.

> >  #ifdef CONFIG_SYSFS
> >  static ssize_t enabled_show(struct kobject *kobj,
> > @@ -850,33 +868,31 @@ static inline void hugepage_exit_sysfs(struct kobject *hugepage_kobj)
> >
> >  static int __init thp_shrinker_init(void)
> >  {
> > -	huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
> > -	if (!huge_zero_page_shrinker)
> > -		return -ENOMEM;
> > +	int ret = 0;
> 
> Kinda no point in initialising to zero, unless...
> 
> >
> >  	deferred_split_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE |
> >  						 SHRINKER_MEMCG_AWARE |
> >  						 SHRINKER_NONSLAB,
> >  						 "thp-deferred_split");
> > -	if (!deferred_split_shrinker) {
> > -		shrinker_free(huge_zero_page_shrinker);
> > +	if (!deferred_split_shrinker)
> >  		return -ENOMEM;
> > -	}
> > -
> > -	huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
> > -	huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
> > -	shrinker_register(huge_zero_page_shrinker);
> >
> >  	deferred_split_shrinker->count_objects = deferred_split_count;
> >  	deferred_split_shrinker->scan_objects = deferred_split_scan;
> >  	shrinker_register(deferred_split_shrinker);
> >
> > +	ret = huge_zero_page_shrinker_init();
> > +	if (ret) {
> > +		shrinker_free(deferred_split_shrinker);
> > +		return ret;
> > +	}
> 
> ... you change this to:
> 
> 	if (ret)
> 		shrinker_free(deferred_split_shrinker);
> 
> 	return ret;
> 
> But it's not a big deal. Maybe I'd rename ret -> err if you keep things as
> they are (but don't init to 0).

Sounds good.

--
Pankaj

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 5/5] block: use largest_zero_folio in __blkdev_issue_zero_pages()
  2025-07-15 16:19   ` Lorenzo Stoakes
@ 2025-07-16 13:24     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-16 13:24 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On Tue, Jul 15, 2025 at 05:19:54PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jul 07, 2025 at 04:23:19PM +0200, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> >
> > Use largest_zero_folio() in __blkdev_issue_zero_pages().
> >
> > On systems with CONFIG_STATIC_PMD_ZERO_PAGE enabled, we will end up
> > sending larger bvecs instead of multiple small ones.
> >
> > Noticed a 4% increase in performance on a commercial NVMe SSD which does
> > not support OP_WRITE_ZEROES. The device's MDTS was 128K. The performance
> > gains might be bigger if the device supports bigger MDTS.
> >
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > ---
> >  block/blk-lib.c | 17 ++++++++++-------
> >  1 file changed, 10 insertions(+), 7 deletions(-)
> >
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index 4c9f20a689f7..70a5700b6717 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -196,6 +196,10 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
> >  		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
> >  		struct bio **biop, unsigned int flags)
> >  {
> > +	struct folio *zero_folio;
> > +
> > +	zero_folio = largest_zero_folio();
> 
> Just assign this in the decl :)
Yeah!
> 
> > +
> >  	while (nr_sects) {
> >  		unsigned int nr_vecs = __blkdev_sectors_to_bio_pages(nr_sects);
> >  		struct bio *bio;
> > @@ -208,15 +212,14 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
> >  			break;
> >
> >  		do {
> > -			unsigned int len, added;
> > +			unsigned int len;
> >
> > -			len = min_t(sector_t,
> > -				PAGE_SIZE, nr_sects << SECTOR_SHIFT);
> > -			added = bio_add_page(bio, ZERO_PAGE(0), len, 0);
> > -			if (added < len)
> > +			len = min_t(sector_t, folio_size(zero_folio),
> > +				    nr_sects << SECTOR_SHIFT);
> > +			if (!bio_add_folio(bio, zero_folio, len, 0))
> 
> Hmm, will this work if nr_sects << SECTOR_SHIFT size isn't PMD-aligned?

Yeah, that should not be a problem as long as (nr_sects << SECTOR_SHIFT) < PMD_SIZED
folio.

> 
> I guess it actually just copies individual pages in the folio as needed?
> 
> Does this actually result in a significant performance improvement? Do we
> have numbers for this to justify the series?
I put it in my commit message:
```
Noticed a 4% increase in performance on a commercial NVMe SSD which does
not support OP_WRITE_ZEROES. The device's MDTS was 128K. The performance
gains might be bigger if the device supports bigger MDTS.
```

Even though it is more of a synthetic benchmark, but this goes to show the
effects of adding multiple bio_vecs with ZERO_PAGE instead of single
PMD_ZERO_PAGE.

--
Pankaj

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h
  2025-07-16  7:47     ` Pankaj Raghav (Samsung)
@ 2025-07-16 15:24       ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2025-07-16 15:24 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), Lorenzo Stoakes
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, Andrew Morton,
	Thomas Gleixner, Nico Pache, Dev Jain, Liam R . Howlett,
	Jens Axboe, linux-kernel, willy, linux-mm, x86, linux-block,
	linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev, hch,
	Pankaj Raghav

On 16.07.25 09:47, Pankaj Raghav (Samsung) wrote:
> On Tue, Jul 15, 2025 at 03:08:40PM +0100, Lorenzo Stoakes wrote:
>> On Mon, Jul 07, 2025 at 04:23:15PM +0200, Pankaj Raghav (Samsung) wrote:
>>> From: Pankaj Raghav <p.raghav@samsung.com>
>>>
>>> Move the declaration associated with huge_zero_page from huge_mm.h to
>>> mm.h. This patch is in preparation for adding static PMD zero page as we
>>> will be reusing some of the huge_zero_page infrastructure.
>>
>> Hmm this is really iffy.
>>
>> The whole purpose of huge_mm.h is to handle huge page stuff, and now you're
>> moving it to a general header... not a fan of this - now we have _some_
>> huge stuff in mm.h and some stuff here.
>>
>> Yes this might be something we screwed up already, but that's not a recent
>> to perpetuate mistakes.
>>
>> Surely you don't _need_ to do this and this is a question of fixing up
>> header includes right?
>>
>> Or is them some horrible cyclical header issue here?
>>
>> Also your commit message doesn't give any reason as to why you _need_ to do
>> this also. For something like this where you're doing something that at
>> face value seems to contradict the purpose of these headers, you need to
>> explain why.
>>
> 
> In one of the earlier versions, David asked me to experiment by moving some of these
> declarations to mm.h and see how it looks. Mainly because, as you
> guessed it later, we can use it without THP being enabled.

I assume, in the future, most setups we care about (-> performance) will 
have THP compiled in. So likely we can defer moving it until it's really 
required.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 3/5] mm: add static PMD zero page
  2025-07-15 14:53     ` David Hildenbrand
@ 2025-07-17 10:34       ` Pankaj Raghav (Samsung)
  2025-07-17 11:46         ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-17 10:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, Lorenzo Stoakes,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

> > Then, we'd only need a config option to allow for that to happen.
> 
> Something incomplete and very hacky just to give an idea. It would try allocating
> it if there is actual code running that would need it, and then have it
> stick around forever.
> 
Thanks a lot for this David :) I think this is a much better idea and
reduces the amount code and reuse the existing infrastructure.

I will try this approach in the next version.

<snip>
> +       /*
> +        * Our raised reference will prevent the shrinker from ever having
> +        * success -> static.
> +        */
> +       if (atomic_read(&huge_zero_folio_is_static))
> +               return huge_zero_folio;
> +       /* TODO: memblock allocation if buddy is not up yet? Or Reject that earlier. */

Do we need memblock allocation? At least the use cases I forsee for
static pmd zero page are all after the mm is up. So I don't see why we
need to allocate it via memblock.

> +       if (!get_huge_zero_page())
> +               return NULL;
> +       if (atomic_cmpxchg(&huge_zero_folio_is_static, 0, 1) != 0)
> +               put_huge_zero_page();
> +       return huge_zero_folio;
> +
> +}
> +#endif /* CONFIG_STATIC_HUGE_ZERO_FOLIO */
> +
>  static unsigned long shrink_huge_zero_page_count(struct shrinker *shrink,
>                                         struct shrink_control *sc)
>  {
> 

--
Pankaj

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 0/5] add static PMD zero page support
  2025-07-15 15:34 ` Lorenzo Stoakes
@ 2025-07-17 10:43   ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-17 10:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

Hi Lorenzo,

On Tue, Jul 15, 2025 at 04:34:45PM +0100, Lorenzo Stoakes wrote:
> Pankaj,
> 
> There seems to be quite a lot to work on here, and it seems rather speculative,
> so can we respin as an RFC please?
> 

Thanks for all the review comments.

Yeah, I agree. I will resend it as RFC. I will try the new approach
suggested by David in Patch 3 in the next version.

--
Pankaj



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 3/5] mm: add static PMD zero page
  2025-07-17 10:34       ` Pankaj Raghav (Samsung)
@ 2025-07-17 11:46         ` David Hildenbrand
  2025-07-17 12:07           ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2025-07-17 11:46 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, Lorenzo Stoakes,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On 17.07.25 12:34, Pankaj Raghav (Samsung) wrote:
>>> Then, we'd only need a config option to allow for that to happen.
>>
>> Something incomplete and very hacky just to give an idea. It would try allocating
>> it if there is actual code running that would need it, and then have it
>> stick around forever.
>>
> Thanks a lot for this David :) I think this is a much better idea and
> reduces the amount code and reuse the existing infrastructure.
> 
> I will try this approach in the next version.
> 
> <snip>
>> +       /*
>> +        * Our raised reference will prevent the shrinker from ever having
>> +        * success -> static.
>> +        */
>> +       if (atomic_read(&huge_zero_folio_is_static))
>> +               return huge_zero_folio;
>> +       /* TODO: memblock allocation if buddy is not up yet? Or Reject that earlier. */
> 
> Do we need memblock allocation? At least the use cases I forsee for
> static pmd zero page are all after the mm is up. So I don't see why we
> need to allocate it via memblock.

Even better!

We might want to detect whether allocation of the huge zeropage failed a 
couple of times and then just give up. Otherwise, each and every user of 
the largest zero folio will keep allocating it.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v2 3/5] mm: add static PMD zero page
  2025-07-17 11:46         ` David Hildenbrand
@ 2025-07-17 12:07           ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 40+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-07-17 12:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Suren Baghdasaryan, Ryan Roberts, Baolin Wang, Borislav Petkov,
	Ingo Molnar, H . Peter Anvin, Vlastimil Babka, Zi Yan,
	Mike Rapoport, Dave Hansen, Michal Hocko, Lorenzo Stoakes,
	Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
	Liam R . Howlett, Jens Axboe, linux-kernel, willy, linux-mm, x86,
	linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
	hch, Pankaj Raghav

On Thu, Jul 17, 2025 at 01:46:03PM +0200, David Hildenbrand wrote:
> On 17.07.25 12:34, Pankaj Raghav (Samsung) wrote:
> > > > Then, we'd only need a config option to allow for that to happen.
> > > 
> > > Something incomplete and very hacky just to give an idea. It would try allocating
> > > it if there is actual code running that would need it, and then have it
> > > stick around forever.
> > > 
> > Thanks a lot for this David :) I think this is a much better idea and
> > reduces the amount code and reuse the existing infrastructure.
> > 
> > I will try this approach in the next version.
> > 
> > <snip>
> > > +       /*
> > > +        * Our raised reference will prevent the shrinker from ever having
> > > +        * success -> static.
> > > +        */
> > > +       if (atomic_read(&huge_zero_folio_is_static))
> > > +               return huge_zero_folio;
> > > +       /* TODO: memblock allocation if buddy is not up yet? Or Reject that earlier. */
> > 
> > Do we need memblock allocation? At least the use cases I forsee for
> > static pmd zero page are all after the mm is up. So I don't see why we
> > need to allocate it via memblock.
> 
> Even better!
> 
> We might want to detect whether allocation of the huge zeropage failed a
> couple of times and then just give up. Otherwise, each and every user of the
> largest zero folio will keep allocating it.

Yes, that makes sense. We need sort of like a global counter to count
the nr of failures and then give up trying to allocate it if it goes
above a threshold.

--
Pankaj

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2025-07-17 12:07 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-07 14:23 [PATCH v2 0/5] add static PMD zero page support Pankaj Raghav (Samsung)
2025-07-07 14:23 ` [PATCH v2 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h Pankaj Raghav (Samsung)
2025-07-15 14:08   ` Lorenzo Stoakes
2025-07-16  7:47     ` Pankaj Raghav (Samsung)
2025-07-16 15:24       ` David Hildenbrand
2025-07-07 14:23 ` [PATCH v2 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function Pankaj Raghav (Samsung)
2025-07-15 14:18   ` David Hildenbrand
2025-07-16  8:01     ` Pankaj Raghav (Samsung)
2025-07-15 14:29   ` Lorenzo Stoakes
2025-07-16  8:08     ` Pankaj Raghav (Samsung)
2025-07-07 14:23 ` [PATCH v2 3/5] mm: add static PMD zero page Pankaj Raghav (Samsung)
2025-07-15 14:21   ` David Hildenbrand
2025-07-15 14:53     ` David Hildenbrand
2025-07-17 10:34       ` Pankaj Raghav (Samsung)
2025-07-17 11:46         ` David Hildenbrand
2025-07-17 12:07           ` Pankaj Raghav (Samsung)
2025-07-15 15:26   ` Lorenzo Stoakes
2025-07-07 14:23 ` [PATCH v2 4/5] mm: add largest_zero_folio() routine Pankaj Raghav (Samsung)
2025-07-15 14:16   ` David Hildenbrand
2025-07-15 14:46     ` David Hildenbrand
2025-07-15 16:13   ` Lorenzo Stoakes
2025-07-07 14:23 ` [PATCH v2 5/5] block: use largest_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav (Samsung)
2025-07-15 16:19   ` Lorenzo Stoakes
2025-07-16 13:24     ` Pankaj Raghav (Samsung)
2025-07-07 18:06 ` [PATCH v2 0/5] add static PMD zero page support Zi Yan
2025-07-09  8:03   ` Pankaj Raghav
2025-07-09 15:55     ` Zi Yan
2025-07-15 14:02     ` Lorenzo Stoakes
2025-07-15 14:06       ` David Hildenbrand
2025-07-15 14:12         ` Lorenzo Stoakes
2025-07-15 14:16           ` David Hildenbrand
2025-07-15 15:25             ` Pankaj Raghav (Samsung)
2025-07-15 15:27               ` David Hildenbrand
2025-07-07 22:38 ` Andrew Morton
2025-07-09  9:59   ` Pankaj Raghav
2025-07-15 14:15   ` David Hildenbrand
2025-07-15 13:53 ` Pankaj Raghav
2025-07-15 14:04 ` Lorenzo Stoakes
2025-07-15 15:34 ` Lorenzo Stoakes
2025-07-17 10:43   ` Pankaj Raghav (Samsung)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).