linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
@ 2025-05-22  9:02 Pankaj Raghav
  2025-05-22  9:02 ` [RFC v2 1/2] mm: " Pankaj Raghav
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Pankaj Raghav @ 2025-05-22  9:02 UTC (permalink / raw)
  To: Suren Baghdasaryan, Vlastimil Babka, Ryan Roberts, Mike Rapoport,
	Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
	Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
	Dave Hansen, David Hildenbrand, Lorenzo Stoakes, Andrew Morton,
	Liam R . Howlett, Jens Axboe
  Cc: linux-block, linux-fsdevel, Darrick J . Wong, gost.dev, kernel,
	hch, linux-kernel, linux-mm, willy, x86, mcgrof, Pankaj Raghav

There are many places in the kernel where we need to zeroout larger
chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
is limited by PAGE_SIZE.

This concern was raised during the review of adding Large Block Size support
to XFS[1][2].

This is especially annoying in block devices and filesystems where we
attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
bvec support in block layer, it is much more efficient to send out
larger zero pages as a part of a single bvec.

Some examples of places in the kernel where this could be useful:
- blkdev_issue_zero_pages()
- iomap_dio_zero()
- vmalloc.c:zero_iter()
- rxperf_process_call()
- fscrypt_zeroout_range_inline_crypt()
- bch2_checksum_update()
...

We already have huge_zero_folio that is allocated on demand, and it will be
deallocated by the shrinker if there are no users of it left.

But to use huge_zero_folio, we need to pass a mm struct and the
put_folio needs to be called in the destructor. This makes sense for
systems that have memory constraints but for bigger servers, it does not
matter if the PMD size is reasonable (like x86).

Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate
the huge_zero_folio, and it will never be freed. This makes using the
huge_zero_folio without having to pass any mm struct and a call to put_folio
in the destructor.

I have converted blkdev_issue_zero_pages() as an example as a part of
this series.

I will send patches to individual subsystems using the huge_zero_folio
once this gets upstreamed.

Looking forward to some feedback.

[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
[2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/

Changes since v1:
- Added the config option based on the feedback from David.
- Removed iomap patches so that I don't clutter this series with too
  many subsystems.

Pankaj Raghav (2):
  mm: add THP_HUGE_ZERO_PAGE_ALWAYS config option
  block: use mm_huge_zero_folio in __blkdev_issue_zero_pages()

 arch/x86/Kconfig |  1 +
 block/blk-lib.c  | 15 +++++++++---
 mm/Kconfig       | 12 +++++++++
 mm/huge_memory.c | 63 ++++++++++++++++++++++++++++++++++++++----------
 4 files changed, 74 insertions(+), 17 deletions(-)


base-commit: f1f6aceb82a55f87d04e2896ac3782162e7859bd
-- 
2.47.2


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC v2 1/2] mm: add THP_HUGE_ZERO_PAGE_ALWAYS config option
  2025-05-22  9:02 [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option Pankaj Raghav
@ 2025-05-22  9:02 ` Pankaj Raghav
  2025-05-22  9:02 ` [RFC v2 2/2] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav
  2025-05-22 11:31 ` [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option Mike Rapoport
  2 siblings, 0 replies; 10+ messages in thread
From: Pankaj Raghav @ 2025-05-22  9:02 UTC (permalink / raw)
  To: Suren Baghdasaryan, Vlastimil Babka, Ryan Roberts, Mike Rapoport,
	Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
	Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
	Dave Hansen, David Hildenbrand, Lorenzo Stoakes, Andrew Morton,
	Liam R . Howlett, Jens Axboe
  Cc: linux-block, linux-fsdevel, Darrick J . Wong, gost.dev, kernel,
	hch, linux-kernel, linux-mm, willy, x86, mcgrof, Pankaj Raghav

There are many places in the kernel where we need to zeroout larger
chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
is limited by PAGE_SIZE.

This is especially annoying in block devices and filesystems where we
attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
bvec support in block layer, it is much more efficient to send out
larger zero pages as a part of single bvec.

This concern was raised during the review of adding LBS support to
XFS[1][2].

Usually huge_zero_folio is allocated on demand, and it will be
deallocated by the shrinker if there are no users of it left.

Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate
the huge_zero_folio, and it will never be freed. This makes using the
huge_zero_folio without having to pass any mm struct and call put_folio
in the destructor.

We can enable it by default for x86_64 where the PMD size is 2M.
It is good compromise between the memory and efficiency.
As a THP zero page might be wasteful for architectures with bigger page
sizes, let's not enable it for them.

[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
[2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 arch/x86/Kconfig |  1 +
 mm/Kconfig       | 12 +++++++++
 mm/huge_memory.c | 63 ++++++++++++++++++++++++++++++++++++++----------
 3 files changed, 63 insertions(+), 13 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 055204dc211d..2e1527580746 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -152,6 +152,7 @@ config X86
 	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
 	select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
 	select ARCH_WANTS_THP_SWAP		if X86_64
+	select ARCH_WANTS_THP_ZERO_PAGE_ALWAYS	if X86_64
 	select ARCH_HAS_PARANOID_L1D_FLUSH
 	select BUILDTIME_TABLE_SORT
 	select CLKEVT_I8253
diff --git a/mm/Kconfig b/mm/Kconfig
index bd08e151fa1b..a2994e7d55ba 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -823,6 +823,9 @@ config ARCH_WANT_GENERAL_HUGETLB
 config ARCH_WANTS_THP_SWAP
 	def_bool n
 
+config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
+	def_bool n
+
 config MM_ID
 	def_bool n
 
@@ -895,6 +898,15 @@ config READ_ONLY_THP_FOR_FS
 	  support of file THPs will be developed in the next few release
 	  cycles.
 
+config THP_ZERO_PAGE_ALWAYS
+	def_bool y
+	depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
+	help
+	  Typically huge_zero_folio, which is a THP of zeroes, is allocated
+	  on demand and deallocated when not in use. This option will always
+	  allocate huge_zero_folio for zeroing and it is never deallocated.
+	  Not suitable for memory constrained systems.
+
 config NO_PAGE_MAPCOUNT
 	bool "No per-page mapcount (EXPERIMENTAL)"
 	help
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d3e66136e41a..1a0556ca3839 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -247,9 +247,16 @@ static void put_huge_zero_page(void)
 	BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
 }
 
+/*
+ * If THP_ZERO_PAGE_ALWAYS is enabled, @mm can be NULL, i.e, the huge_zero_folio
+ * is not associated with any mm_struct.
+ */
 struct folio *mm_get_huge_zero_folio(struct mm_struct *mm)
 {
-	if (test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
+	if (!IS_ENABLED(CONFIG_THP_ZERO_PAGE_ALWAYS) && !mm)
+		return NULL;
+
+	if (IS_ENABLED(CONFIG_THP_ZERO_PAGE_ALWAYS) || test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
 		return READ_ONCE(huge_zero_folio);
 
 	if (!get_huge_zero_page())
@@ -263,6 +270,9 @@ struct folio *mm_get_huge_zero_folio(struct mm_struct *mm)
 
 void mm_put_huge_zero_folio(struct mm_struct *mm)
 {
+	if (IS_ENABLED(CONFIG_THP_ZERO_PAGE_ALWAYS))
+		return;
+
 	if (test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
 		put_huge_zero_page();
 }
@@ -274,14 +284,21 @@ static unsigned long shrink_huge_zero_page_count(struct shrinker *shrink,
 	return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
 }
 
+static void _put_huge_zero_folio(void)
+{
+	struct folio *zero_folio;
+
+	zero_folio = xchg(&huge_zero_folio, NULL);
+	BUG_ON(zero_folio == NULL);
+	WRITE_ONCE(huge_zero_pfn, ~0UL);
+	folio_put(zero_folio);
+}
+
 static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
 				       struct shrink_control *sc)
 {
 	if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
-		struct folio *zero_folio = xchg(&huge_zero_folio, NULL);
-		BUG_ON(zero_folio == NULL);
-		WRITE_ONCE(huge_zero_pfn, ~0UL);
-		folio_put(zero_folio);
+		_put_huge_zero_folio();
 		return HPAGE_PMD_NR;
 	}
 
@@ -850,10 +867,6 @@ static inline void hugepage_exit_sysfs(struct kobject *hugepage_kobj)
 
 static int __init thp_shrinker_init(void)
 {
-	huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
-	if (!huge_zero_page_shrinker)
-		return -ENOMEM;
-
 	deferred_split_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE |
 						 SHRINKER_MEMCG_AWARE |
 						 SHRINKER_NONSLAB,
@@ -863,14 +876,21 @@ static int __init thp_shrinker_init(void)
 		return -ENOMEM;
 	}
 
-	huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
-	huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
-	shrinker_register(huge_zero_page_shrinker);
-
 	deferred_split_shrinker->count_objects = deferred_split_count;
 	deferred_split_shrinker->scan_objects = deferred_split_scan;
 	shrinker_register(deferred_split_shrinker);
 
+	if (IS_ENABLED(CONFIG_THP_ZERO_PAGE_ALWAYS))
+		return 0;
+
+	huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
+	if (!huge_zero_page_shrinker)
+		return -ENOMEM;
+
+	huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
+	huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
+	shrinker_register(huge_zero_page_shrinker);
+
 	return 0;
 }
 
@@ -880,6 +900,17 @@ static void __init thp_shrinker_exit(void)
 	shrinker_free(deferred_split_shrinker);
 }
 
+static int __init huge_zero_page_init(void) {
+
+	if (!IS_ENABLED(CONFIG_THP_ZERO_PAGE_ALWAYS))
+		return 0;
+
+	if (!get_huge_zero_page()) {
+		return -ENOMEM;
+	}
+	return 0;
+}
+
 static int __init hugepage_init(void)
 {
 	int err;
@@ -903,6 +934,10 @@ static int __init hugepage_init(void)
 	if (err)
 		goto err_slab;
 
+	err = huge_zero_page_init();
+	if (err)
+		goto err_huge_zero_page;
+
 	err = thp_shrinker_init();
 	if (err)
 		goto err_shrinker;
@@ -925,6 +960,8 @@ static int __init hugepage_init(void)
 err_khugepaged:
 	thp_shrinker_exit();
 err_shrinker:
+	_put_huge_zero_folio();
+err_huge_zero_page:
 	khugepaged_destroy();
 err_slab:
 	hugepage_exit_sysfs(hugepage_kobj);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC v2 2/2] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages()
  2025-05-22  9:02 [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option Pankaj Raghav
  2025-05-22  9:02 ` [RFC v2 1/2] mm: " Pankaj Raghav
@ 2025-05-22  9:02 ` Pankaj Raghav
  2025-05-22 11:31 ` [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option Mike Rapoport
  2 siblings, 0 replies; 10+ messages in thread
From: Pankaj Raghav @ 2025-05-22  9:02 UTC (permalink / raw)
  To: Suren Baghdasaryan, Vlastimil Babka, Ryan Roberts, Mike Rapoport,
	Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
	Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
	Dave Hansen, David Hildenbrand, Lorenzo Stoakes, Andrew Morton,
	Liam R . Howlett, Jens Axboe
  Cc: linux-block, linux-fsdevel, Darrick J . Wong, gost.dev, kernel,
	hch, linux-kernel, linux-mm, willy, x86, mcgrof, Pankaj Raghav

Use mm_huge_zero_folio in __blkdev_issue_zero_pages(). Fallback to
ZERO_PAGE if mm_huge_zero_folio is not available.

On systems that allocates mm_huge_zero_folio, we will end up sending larger
bvecs instead of multiple small ones.

Noticed a 4% increase in performance on a commercial NVMe SSD which does
not support OP_WRITE_ZEROES. The device's MDTS was 128K. The performance
gains might be bigger if the device supports bigger MDTS.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 block/blk-lib.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 4c9f20a689f7..221389412359 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -196,6 +196,12 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
 		sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
 		struct bio **biop, unsigned int flags)
 {
+	struct folio *zero_folio;
+
+	zero_folio = mm_get_huge_zero_folio(NULL);
+	if (!zero_folio)
+		zero_folio = page_folio(ZERO_PAGE(0));
+
 	while (nr_sects) {
 		unsigned int nr_vecs = __blkdev_sectors_to_bio_pages(nr_sects);
 		struct bio *bio;
@@ -208,11 +214,12 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
 			break;
 
 		do {
-			unsigned int len, added;
+			unsigned int len, added = 0;
 
-			len = min_t(sector_t,
-				PAGE_SIZE, nr_sects << SECTOR_SHIFT);
-			added = bio_add_page(bio, ZERO_PAGE(0), len, 0);
+			len = min_t(sector_t, folio_size(zero_folio),
+				    nr_sects << SECTOR_SHIFT);
+			if (bio_add_folio(bio, zero_folio, len, 0))
+				added = len;
 			if (added < len)
 				break;
 			nr_sects -= added >> SECTOR_SHIFT;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
  2025-05-22  9:02 [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option Pankaj Raghav
  2025-05-22  9:02 ` [RFC v2 1/2] mm: " Pankaj Raghav
  2025-05-22  9:02 ` [RFC v2 2/2] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav
@ 2025-05-22 11:31 ` Mike Rapoport
  2025-05-22 12:00   ` Pankaj Raghav (Samsung)
  2025-05-22 12:01   ` David Hildenbrand
  2 siblings, 2 replies; 10+ messages in thread
From: Mike Rapoport @ 2025-05-22 11:31 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Suren Baghdasaryan, Vlastimil Babka, Ryan Roberts, Michal Hocko,
	Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
	Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
	Dave Hansen, David Hildenbrand, Lorenzo Stoakes, Andrew Morton,
	Liam R . Howlett, Jens Axboe, linux-block, linux-fsdevel,
	Darrick J . Wong, gost.dev, kernel, hch, linux-kernel, linux-mm,
	willy, x86, mcgrof

Hi Pankaj,

On Thu, May 22, 2025 at 11:02:41AM +0200, Pankaj Raghav wrote:
> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
> is limited by PAGE_SIZE.
> 
> This concern was raised during the review of adding Large Block Size support
> to XFS[1][2].
> 
> This is especially annoying in block devices and filesystems where we
> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
> bvec support in block layer, it is much more efficient to send out
> larger zero pages as a part of a single bvec.
> 
> Some examples of places in the kernel where this could be useful:
> - blkdev_issue_zero_pages()
> - iomap_dio_zero()
> - vmalloc.c:zero_iter()
> - rxperf_process_call()
> - fscrypt_zeroout_range_inline_crypt()
> - bch2_checksum_update()
> ...
> 
> We already have huge_zero_folio that is allocated on demand, and it will be
> deallocated by the shrinker if there are no users of it left.
> 
> But to use huge_zero_folio, we need to pass a mm struct and the
> put_folio needs to be called in the destructor. This makes sense for
> systems that have memory constraints but for bigger servers, it does not
> matter if the PMD size is reasonable (like x86).
> 
> Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate
> the huge_zero_folio, and it will never be freed. This makes using the
> huge_zero_folio without having to pass any mm struct and a call to put_folio
> in the destructor.

I don't think this config option should be tied to THP. It's perfectly
sensible to have a configuration with HUGETLB and without THP.
 
> I have converted blkdev_issue_zero_pages() as an example as a part of
> this series.
> 
> I will send patches to individual subsystems using the huge_zero_folio
> once this gets upstreamed.
> 
> Looking forward to some feedback.
> 
> [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
> [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
> 
> Changes since v1:
> - Added the config option based on the feedback from David.
> - Removed iomap patches so that I don't clutter this series with too
>   many subsystems.
> 
> Pankaj Raghav (2):
>   mm: add THP_HUGE_ZERO_PAGE_ALWAYS config option
>   block: use mm_huge_zero_folio in __blkdev_issue_zero_pages()
> 
>  arch/x86/Kconfig |  1 +
>  block/blk-lib.c  | 15 +++++++++---
>  mm/Kconfig       | 12 +++++++++
>  mm/huge_memory.c | 63 ++++++++++++++++++++++++++++++++++++++----------
>  4 files changed, 74 insertions(+), 17 deletions(-)
> 
> 
> base-commit: f1f6aceb82a55f87d04e2896ac3782162e7859bd
> -- 
> 2.47.2
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
  2025-05-22 11:31 ` [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option Mike Rapoport
@ 2025-05-22 12:00   ` Pankaj Raghav (Samsung)
  2025-05-22 12:04     ` David Hildenbrand
  2025-05-22 12:01   ` David Hildenbrand
  1 sibling, 1 reply; 10+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-05-22 12:00 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pankaj Raghav, Suren Baghdasaryan, Vlastimil Babka, Ryan Roberts,
	Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
	Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
	Dave Hansen, David Hildenbrand, Lorenzo Stoakes, Andrew Morton,
	Liam R . Howlett, Jens Axboe, linux-block, linux-fsdevel,
	Darrick J . Wong, gost.dev, hch, linux-kernel, linux-mm, willy,
	x86, mcgrof

Hi Mike,

> > Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate
> > the huge_zero_folio, and it will never be freed. This makes using the
> > huge_zero_folio without having to pass any mm struct and a call to put_folio
> > in the destructor.
> 
> I don't think this config option should be tied to THP. It's perfectly
> sensible to have a configuration with HUGETLB and without THP.
>  

Hmm, that makes sense. You mean something like this (untested):

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2e1527580746..d447a9b9eb7d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -151,8 +151,8 @@ config X86
        select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP   if X86_64
        select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP       if X86_64
        select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
+       select ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS if X86_64
        select ARCH_WANTS_THP_SWAP              if X86_64
-       select ARCH_WANTS_THP_ZERO_PAGE_ALWAYS  if X86_64
        select ARCH_HAS_PARANOID_L1D_FLUSH
        select BUILDTIME_TABLE_SORT
        select CLKEVT_I8253
diff --git a/mm/Kconfig b/mm/Kconfig
index a2994e7d55ba..83a5b95a2286 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -823,9 +823,19 @@ config ARCH_WANT_GENERAL_HUGETLB
 config ARCH_WANTS_THP_SWAP
        def_bool n
 
-config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
+config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
        def_bool n
 
+config HUGE_ZERO_PAGE_ALWAYS
+       def_bool y
+       depends on HUGETLB_PAGE && ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
+       help
+         Typically huge_zero_folio, which is a huge page of zeroes, is allocated
+         on demand and deallocated when not in use. This option will always
+         allocate huge_zero_folio for zeroing and it is never deallocated.
+         Not suitable for memory constrained systems.
+
+
 config MM_ID
        def_bool n
 
@@ -898,15 +908,6 @@ config READ_ONLY_THP_FOR_FS
          support of file THPs will be developed in the next few release
          cycles.
 
-config THP_ZERO_PAGE_ALWAYS
-       def_bool y
-       depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
-       help
-         Typically huge_zero_folio, which is a THP of zeroes, is allocated
-         on demand and deallocated when not in use. This option will always
-         allocate huge_zero_folio for zeroing and it is never deallocated.
-         Not suitable for memory constrained systems.
-
 config NO_PAGE_MAPCOUNT
        bool "No per-page mapcount (EXPERIMENTAL)"
        help

--
Pankaj

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
  2025-05-22 11:31 ` [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option Mike Rapoport
  2025-05-22 12:00   ` Pankaj Raghav (Samsung)
@ 2025-05-22 12:01   ` David Hildenbrand
  1 sibling, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2025-05-22 12:01 UTC (permalink / raw)
  To: Mike Rapoport, Pankaj Raghav
  Cc: Suren Baghdasaryan, Vlastimil Babka, Ryan Roberts, Michal Hocko,
	Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
	Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
	Dave Hansen, Lorenzo Stoakes, Andrew Morton, Liam R . Howlett,
	Jens Axboe, linux-block, linux-fsdevel, Darrick J . Wong,
	gost.dev, kernel, hch, linux-kernel, linux-mm, willy, x86, mcgrof

On 22.05.25 13:31, Mike Rapoport wrote:
> Hi Pankaj,
> 
> On Thu, May 22, 2025 at 11:02:41AM +0200, Pankaj Raghav wrote:
>> There are many places in the kernel where we need to zeroout larger
>> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
>> is limited by PAGE_SIZE.
>>
>> This concern was raised during the review of adding Large Block Size support
>> to XFS[1][2].
>>
>> This is especially annoying in block devices and filesystems where we
>> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
>> bvec support in block layer, it is much more efficient to send out
>> larger zero pages as a part of a single bvec.
>>
>> Some examples of places in the kernel where this could be useful:
>> - blkdev_issue_zero_pages()
>> - iomap_dio_zero()
>> - vmalloc.c:zero_iter()
>> - rxperf_process_call()
>> - fscrypt_zeroout_range_inline_crypt()
>> - bch2_checksum_update()
>> ...
>>
>> We already have huge_zero_folio that is allocated on demand, and it will be
>> deallocated by the shrinker if there are no users of it left.
>>
>> But to use huge_zero_folio, we need to pass a mm struct and the
>> put_folio needs to be called in the destructor. This makes sense for
>> systems that have memory constraints but for bigger servers, it does not
>> matter if the PMD size is reasonable (like x86).
>>
>> Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate
>> the huge_zero_folio, and it will never be freed. This makes using the
>> huge_zero_folio without having to pass any mm struct and a call to put_folio
>> in the destructor.
> 
> I don't think this config option should be tied to THP. It's perfectly
> sensible to have a configuration with HUGETLB and without THP.

Such configs are getting rarer ...

I assume we would then simply reuse that page from THP code if available?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
  2025-05-22 12:00   ` Pankaj Raghav (Samsung)
@ 2025-05-22 12:04     ` David Hildenbrand
  2025-05-22 12:34       ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-05-22 12:04 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), Mike Rapoport
  Cc: Pankaj Raghav, Suren Baghdasaryan, Vlastimil Babka, Ryan Roberts,
	Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
	Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
	Dave Hansen, Lorenzo Stoakes, Andrew Morton, Liam R . Howlett,
	Jens Axboe, linux-block, linux-fsdevel, Darrick J . Wong,
	gost.dev, hch, linux-kernel, linux-mm, willy, x86, mcgrof

On 22.05.25 14:00, Pankaj Raghav (Samsung) wrote:
> Hi Mike,
> 
>>> Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate
>>> the huge_zero_folio, and it will never be freed. This makes using the
>>> huge_zero_folio without having to pass any mm struct and a call to put_folio
>>> in the destructor.
>>
>> I don't think this config option should be tied to THP. It's perfectly
>> sensible to have a configuration with HUGETLB and without THP.
>>   
> 
> Hmm, that makes sense. You mean something like this (untested):
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2e1527580746..d447a9b9eb7d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -151,8 +151,8 @@ config X86
>          select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP   if X86_64
>          select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP       if X86_64
>          select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
> +       select ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS if X86_64
>          select ARCH_WANTS_THP_SWAP              if X86_64
> -       select ARCH_WANTS_THP_ZERO_PAGE_ALWAYS  if X86_64
>          select ARCH_HAS_PARANOID_L1D_FLUSH
>          select BUILDTIME_TABLE_SORT
>          select CLKEVT_I8253
> diff --git a/mm/Kconfig b/mm/Kconfig
> index a2994e7d55ba..83a5b95a2286 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -823,9 +823,19 @@ config ARCH_WANT_GENERAL_HUGETLB
>   config ARCH_WANTS_THP_SWAP
>          def_bool n
>   
> -config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
> +config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
>          def_bool n
>   
> +config HUGE_ZERO_PAGE_ALWAYS

Likely something like

PMD_ZERO_PAGE

Will be a lot clearer.

 > +       def_bool y> +       depends on HUGETLB_PAGE && 
ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS

I suspect it should then also be independent of HUGETLB_PAGE?

> +       help
> +         Typically huge_zero_folio, which is a huge page of zeroes, is allocated
> +         on demand and deallocated when not in use. This option will always
> +         allocate huge_zero_folio for zeroing and it is never deallocated.
> +         Not suitable for memory constrained systems.

I assume that code then has to live in mm/memory.c ?


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
  2025-05-22 12:04     ` David Hildenbrand
@ 2025-05-22 12:34       ` Pankaj Raghav (Samsung)
  2025-05-22 12:50         ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-05-22 12:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Pankaj Raghav, Suren Baghdasaryan, Vlastimil Babka,
	Ryan Roberts, Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain,
	Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
	Zi Yan, Dave Hansen, Lorenzo Stoakes, Andrew Morton,
	Liam R . Howlett, Jens Axboe, linux-block, linux-fsdevel,
	Darrick J . Wong, gost.dev, hch, linux-kernel, linux-mm, willy,
	x86, mcgrof

Hi David,

> >   config ARCH_WANTS_THP_SWAP
> >          def_bool n
> > -config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
> > +config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
> >          def_bool n
> > +config HUGE_ZERO_PAGE_ALWAYS
> 
> Likely something like
> 
> PMD_ZERO_PAGE
> 
> Will be a lot clearer.

Sounds much better :)

> 
> > +       def_bool y> +       depends on HUGETLB_PAGE &&
> ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
> 
> I suspect it should then also be independent of HUGETLB_PAGE?

You are right. So we don't depend on any of these features.

> 
> > +       help
> > +         Typically huge_zero_folio, which is a huge page of zeroes, is allocated
> > +         on demand and deallocated when not in use. This option will always
> > +         allocate huge_zero_folio for zeroing and it is never deallocated.
> > +         Not suitable for memory constrained systems.
> 
> I assume that code then has to live in mm/memory.c ?

Hmm, then huge_zero_folio should have always been in mm/memory.c to
begin with?

I assume probably this was placed in mm/huge_memory.c because the users
of this huge_zero_folio has been a part of mm/huge_memory.c?

So IIUC your comment, we should move the huge_zero_page_init() in the
first patch to mm/memory.c and the existing shrinker code can be a part
where they already are?

--
Pankaj

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
  2025-05-22 12:34       ` Pankaj Raghav (Samsung)
@ 2025-05-22 12:50         ` David Hildenbrand
  2025-05-22 13:34           ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-05-22 12:50 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Mike Rapoport, Pankaj Raghav, Suren Baghdasaryan, Vlastimil Babka,
	Ryan Roberts, Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain,
	Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
	Zi Yan, Dave Hansen, Lorenzo Stoakes, Andrew Morton,
	Liam R . Howlett, Jens Axboe, linux-block, linux-fsdevel,
	Darrick J . Wong, gost.dev, hch, linux-kernel, linux-mm, willy,
	x86, mcgrof

On 22.05.25 14:34, Pankaj Raghav (Samsung) wrote:
> Hi David,
> 
>>>    config ARCH_WANTS_THP_SWAP
>>>           def_bool n
>>> -config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
>>> +config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
>>>           def_bool n
>>> +config HUGE_ZERO_PAGE_ALWAYS
>>
>> Likely something like
>>
>> PMD_ZERO_PAGE
>>
>> Will be a lot clearer.
> 
> Sounds much better :)

And maybe something like

"STATIC_PMD_ZERO_PAGE"

would be even clearer.

The other one would be the dynamic one.

> 
>>
>>> +       def_bool y> +       depends on HUGETLB_PAGE &&
>> ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
>>
>> I suspect it should then also be independent of HUGETLB_PAGE?
> 
> You are right. So we don't depend on any of these features.
> 
>>
>>> +       help
>>> +         Typically huge_zero_folio, which is a huge page of zeroes, is allocated
>>> +         on demand and deallocated when not in use. This option will always
>>> +         allocate huge_zero_folio for zeroing and it is never deallocated.
>>> +         Not suitable for memory constrained systems.
>>
>> I assume that code then has to live in mm/memory.c ?
> 
> Hmm, then huge_zero_folio should have always been in mm/memory.c to
> begin with?
> 

It's complicated. Only do_huge_pmd_anonymous_page() (and fsdax) really 
uses it, and it may only get mapped into a process under certain 
conditions (related to THP / PMD handling).

> I assume probably this was placed in mm/huge_memory.c because the users
> of this huge_zero_folio has been a part of mm/huge_memory.c?

Yes.

> 
> So IIUC your comment, we should move the huge_zero_page_init() in the
> first patch to mm/memory.c and the existing shrinker code can be a part
> where they already are?

Good question. At least the "static" part can easily be moved over. 
Maybe the dynamic part as well.

Worth trying it out and seeing how it looks :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
  2025-05-22 12:50         ` David Hildenbrand
@ 2025-05-22 13:34           ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 10+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-05-22 13:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Pankaj Raghav, Suren Baghdasaryan, Vlastimil Babka,
	Ryan Roberts, Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain,
	Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
	Zi Yan, Dave Hansen, Lorenzo Stoakes, Andrew Morton,
	Liam R . Howlett, Jens Axboe, linux-block, linux-fsdevel,
	Darrick J . Wong, gost.dev, hch, linux-kernel, linux-mm, willy,
	x86, mcgrof

On Thu, May 22, 2025 at 02:50:20PM +0200, David Hildenbrand wrote:
> On 22.05.25 14:34, Pankaj Raghav (Samsung) wrote:
> > Hi David,
> > 
> > > >    config ARCH_WANTS_THP_SWAP
> > > >           def_bool n
> > > > -config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
> > > > +config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
> > > >           def_bool n
> > > > +config HUGE_ZERO_PAGE_ALWAYS
> > > 
> > > Likely something like
> > > 
> > > PMD_ZERO_PAGE
> > > 
> > > Will be a lot clearer.
> > 
> > Sounds much better :)
> 
> And maybe something like
> 
> "STATIC_PMD_ZERO_PAGE"
> 
> would be even clearer.
> 
> The other one would be the dynamic one.

Got it.
So if I understand correctly, we are going to have two huge zero pages,
- one that is always allocated statically.
- the existing dynamic will still be there for the existing users.

> 
> > 
> > > 
> > > > +       def_bool y> +       depends on HUGETLB_PAGE &&
> > > ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
> > > 
> > > I suspect it should then also be independent of HUGETLB_PAGE?
> > 
> > You are right. So we don't depend on any of these features.
> > 
> > > 
> > > > +       help
> > > > +         Typically huge_zero_folio, which is a huge page of zeroes, is allocated
> > > > +         on demand and deallocated when not in use. This option will always
> > > > +         allocate huge_zero_folio for zeroing and it is never deallocated.
> > > > +         Not suitable for memory constrained systems.
> > > 
> > > I assume that code then has to live in mm/memory.c ?
> > 
> > Hmm, then huge_zero_folio should have always been in mm/memory.c to
> > begin with?
> > 
> 
> It's complicated. Only do_huge_pmd_anonymous_page() (and fsdax) really uses
> it, and it may only get mapped into a process under certain conditions
> (related to THP / PMD handling).
> 
Got it.
> > 
> > So IIUC your comment, we should move the huge_zero_page_init() in the
> > first patch to mm/memory.c and the existing shrinker code can be a part
> > where they already are?
> 
> Good question. At least the "static" part can easily be moved over. Maybe
> the dynamic part as well.
> 
> Worth trying it out and seeing how it looks :)

Challenge accepted ;) Thanks for the comments David.

--
Pankaj

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-05-22 13:34 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-22  9:02 [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option Pankaj Raghav
2025-05-22  9:02 ` [RFC v2 1/2] mm: " Pankaj Raghav
2025-05-22  9:02 ` [RFC v2 2/2] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav
2025-05-22 11:31 ` [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option Mike Rapoport
2025-05-22 12:00   ` Pankaj Raghav (Samsung)
2025-05-22 12:04     ` David Hildenbrand
2025-05-22 12:34       ` Pankaj Raghav (Samsung)
2025-05-22 12:50         ` David Hildenbrand
2025-05-22 13:34           ` Pankaj Raghav (Samsung)
2025-05-22 12:01   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).