linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Pankaj Raghav (Samsung)" <kernel@pankajraghav.com>
Cc: Suren Baghdasaryan <surenb@google.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@redhat.com>,
	"H . Peter Anvin" <hpa@zytor.com>,
	Vlastimil Babka <vbabka@suse.cz>, Zi Yan <ziy@nvidia.com>,
	Mike Rapoport <rppt@kernel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Nico Pache <npache@redhat.com>, Dev Jain <dev.jain@arm.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Jens Axboe <axboe@kernel.dk>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	willy@infradead.org, x86@kernel.org, linux-block@vger.kernel.org,
	Ritesh Harjani <ritesh.list@gmail.com>,
	linux-fsdevel@vger.kernel.org,
	"Darrick J . Wong" <djwong@kernel.org>,
	mcgrof@kernel.org, gost.dev@samsung.com, hch@lst.de,
	Pankaj Raghav <p.raghav@samsung.com>
Subject: Re: [PATCH 3/5] mm: add static huge zero folio
Date: Mon, 4 Aug 2025 19:07:06 +0200	[thread overview]
Message-ID: <70049abc-bf79-4d04-a0a8-dd3787195986@redhat.com> (raw)
In-Reply-To: <4463bc75-486d-4034-a19e-d531bec667e8@lucifer.local>

On 04.08.25 18:46, Lorenzo Stoakes wrote:
> On Mon, Aug 04, 2025 at 02:13:54PM +0200, Pankaj Raghav (Samsung) wrote:
>> From: Pankaj Raghav <p.raghav@samsung.com>
>>
>> There are many places in the kernel where we need to zeroout larger
>> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
>> is limited by PAGE_SIZE.
>>
>> This is especially annoying in block devices and filesystems where we
>> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
>> bvec support in block layer, it is much more efficient to send out
>> larger zero pages as a part of single bvec.
>>
>> This concern was raised during the review of adding LBS support to
>> XFS[1][2].
>>
>> Usually huge_zero_folio is allocated on demand, and it will be
>> deallocated by the shrinker if there are no users of it left. At moment,
>> huge_zero_folio infrastructure refcount is tied to the process lifetime
>> that created it. This might not work for bio layer as the completions
>> can be async and the process that created the huge_zero_folio might no
>> longer be alive. And, one of the main point that came during discussion
>> is to have something bigger than zero page as a drop-in replacement.
>>
>> Add a config option STATIC_HUGE_ZERO_FOLIO that will result in allocating
>> the huge zero folio on first request, if not already allocated, and turn
>> it static such that it can never get freed. This makes using the
>> huge_zero_folio without having to pass any mm struct and does not tie the
>> lifetime of the zero folio to anything, making it a drop-in replacement
>> for ZERO_PAGE.
>>
>> If STATIC_HUGE_ZERO_FOLIO config option is enabled, then
>> mm_get_huge_zero_folio() will simply return this page instead of
>> dynamically allocating a new PMD page.
>>
>> This option can waste memory in small systems or systems with 64k base
>> page size. So make it an opt-in and also add an option from individual
>> architecture so that we don't enable this feature for larger base page
>> size systems. Only x86 is enabled as a part of this series. Other
>> architectures shall be enabled as a follow-up to this series.
>>
>> [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
>> [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
>>
>> Co-developed-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
>> ---
>>   arch/x86/Kconfig        |  1 +
>>   include/linux/huge_mm.h | 18 ++++++++++++++++
>>   mm/Kconfig              | 21 +++++++++++++++++++
>>   mm/huge_memory.c        | 46 ++++++++++++++++++++++++++++++++++++++++-
>>   4 files changed, 85 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 0ce86e14ab5e..8e2aa1887309 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -153,6 +153,7 @@ config X86
>>   	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
>>   	select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
>>   	select ARCH_WANTS_THP_SWAP		if X86_64
>> +	select ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO if X86_64
>>   	select ARCH_HAS_PARANOID_L1D_FLUSH
>>   	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>>   	select BUILDTIME_TABLE_SORT
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 7748489fde1b..78ebceb61d0e 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -476,6 +476,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
>>
>>   extern struct folio *huge_zero_folio;
>>   extern unsigned long huge_zero_pfn;
>> +extern atomic_t huge_zero_folio_is_static;
> 
> Really don't love having globals like this, please can we have a helper
> function that tells you this and not extern it?
> 
> Also we're not checking CONFIG_STATIC_HUGE_ZERO_FOLIO but still exposing
> this value which a helper function would avoid also.
> 
>>
>>   static inline bool is_huge_zero_folio(const struct folio *folio)
>>   {
>> @@ -494,6 +495,18 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
>>
>>   struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
>>   void mm_put_huge_zero_folio(struct mm_struct *mm);
>> +struct folio *__get_static_huge_zero_folio(void);
> 
> Why are we declaring a static inline function prototype that we then
> implement immediately below?
> 
>> +
>> +static inline struct folio *get_static_huge_zero_folio(void)
>> +{
>> +	if (!IS_ENABLED(CONFIG_STATIC_HUGE_ZERO_FOLIO))
>> +		return NULL;
>> +
>> +	if (likely(atomic_read(&huge_zero_folio_is_static)))
>> +		return huge_zero_folio;
>> +
>> +	return __get_static_huge_zero_folio();
>> +}
>>
>>   static inline bool thp_migration_supported(void)
>>   {
>> @@ -685,6 +698,11 @@ static inline int change_huge_pud(struct mmu_gather *tlb,
>>   {
>>   	return 0;
>>   }
>> +
>> +static inline struct folio *get_static_huge_zero_folio(void)
>> +{
>> +	return NULL;
>> +}
>>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>>   static inline int split_folio_to_list_to_order(struct folio *folio,
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index e443fe8cd6cf..366a6d2d771e 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -823,6 +823,27 @@ config ARCH_WANT_GENERAL_HUGETLB
>>   config ARCH_WANTS_THP_SWAP
>>   	def_bool n
>>
>> +config ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO
>> +	def_bool n
>> +
>> +config STATIC_HUGE_ZERO_FOLIO
>> +	bool "Allocate a PMD sized folio for zeroing"
>> +	depends on ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO && TRANSPARENT_HUGEPAGE
>> +	help
>> +	  Without this config enabled, the huge zero folio is allocated on
>> +	  demand and freed under memory pressure once no longer in use.
>> +	  To detect remaining users reliably, references to the huge zero folio
>> +	  must be tracked precisely, so it is commonly only available for mapping
>> +	  it into user page tables.
>> +
>> +	  With this config enabled, the huge zero folio can also be used
>> +	  for other purposes that do not implement precise reference counting:
>> +	  it is still allocated on demand, but never freed, allowing for more
>> +	  wide-spread use, for example, when performing I/O similar to the
>> +	  traditional shared zeropage.
>> +
>> +	  Not suitable for memory constrained systems.
>> +
>>   config MM_ID
>>   	def_bool n
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index ff06dee213eb..e117b280b38d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -75,6 +75,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>   static bool split_underused_thp = true;
>>
>>   static atomic_t huge_zero_refcount;
>> +atomic_t huge_zero_folio_is_static __read_mostly;
>>   struct folio *huge_zero_folio __read_mostly;
>>   unsigned long huge_zero_pfn __read_mostly = ~0UL;
>>   unsigned long huge_anon_orders_always __read_mostly;
>> @@ -266,6 +267,45 @@ void mm_put_huge_zero_folio(struct mm_struct *mm)
>>   		put_huge_zero_folio();
>>   }
>>
>> +#ifdef CONFIG_STATIC_HUGE_ZERO_FOLIO
>> +
> 
> Extremely tiny silly nit - there's a blank line below this, but not under the
> #endif, let's remove this line.
> 
>> +struct folio *__get_static_huge_zero_folio(void)
>> +{
>> +	static unsigned long fail_count_clear_timer;
>> +	static atomic_t huge_zero_static_fail_count __read_mostly;
>> +
>> +	if (unlikely(!slab_is_available()))
>> +		return NULL;
>> +
>> +	/*
>> +	 * If we failed to allocate a huge zero folio, just refrain from
>> +	 * trying for one minute before retrying to get a reference again.
>> +	 */
>> +	if (atomic_read(&huge_zero_static_fail_count) > 1) {
>> +		if (time_before(jiffies, fail_count_clear_timer))
>> +			return NULL;
>> +		atomic_set(&huge_zero_static_fail_count, 0);
>> +	}
> 
> Yeah I really don't like this. This seems overly complicated and too
> fiddly. Also if I want a static PMD, do I want to wait a minute for next
> attempt?
> 
> Also doing things this way we might end up:
> 
> 0. Enabling CONFIG_STATIC_HUGE_ZERO_FOLIO
> 1. Not doing anything that needs a static PMD for a while + get fragmentation.
> 2. Do something that needs it - oops can't get order-9 page, and waiting 60
>     seconds between attempts
> 3. This is silent so you think you have it switched on but are actually getting
>     bad performance.
> 
> I appreciate wanting to reuse this code, but we need to find a way to do this
> really really early, and get rid of this arbitrary time out. It's very aribtrary
> and we have no easy way of tracing how this might behave under workload.
> 
> Also we end up pinning an order-9 page either way, so no harm in getting it
> first thing?

What we could do, to avoid messing with memblock and two ways of initializing a huge zero folio early, and just disable the shrinker.

Downside is that the page is really static (not just when actually used at least once). I like it:


diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0ce86e14ab5e1..8e2aa18873098 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -153,6 +153,7 @@ config X86
  	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
  	select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
  	select ARCH_WANTS_THP_SWAP		if X86_64
+	select ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO if X86_64
  	select ARCH_HAS_PARANOID_L1D_FLUSH
  	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
  	select BUILDTIME_TABLE_SORT
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7748489fde1b7..ccfa5c95f14b1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -495,6 +495,17 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
  struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
  void mm_put_huge_zero_folio(struct mm_struct *mm);
  
+static inline struct folio *get_static_huge_zero_folio(void)
+{
+	if (!IS_ENABLED(CONFIG_STATIC_HUGE_ZERO_FOLIO))
+		return NULL;
+
+	if (unlikely(!huge_zero_folio))
+		return NULL;
+
+	return huge_zero_folio;
+}
+
  static inline bool thp_migration_supported(void)
  {
  	return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
@@ -685,6 +696,11 @@ static inline int change_huge_pud(struct mmu_gather *tlb,
  {
  	return 0;
  }
+
+static inline struct folio *get_static_huge_zero_folio(void)
+{
+	return NULL;
+}
  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
  
  static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/Kconfig b/mm/Kconfig
index e443fe8cd6cf2..366a6d2d771e3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -823,6 +823,27 @@ config ARCH_WANT_GENERAL_HUGETLB
  config ARCH_WANTS_THP_SWAP
  	def_bool n
  
+config ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO
+	def_bool n
+
+config STATIC_HUGE_ZERO_FOLIO
+	bool "Allocate a PMD sized folio for zeroing"
+	depends on ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO && TRANSPARENT_HUGEPAGE
+	help
+	  Without this config enabled, the huge zero folio is allocated on
+	  demand and freed under memory pressure once no longer in use.
+	  To detect remaining users reliably, references to the huge zero folio
+	  must be tracked precisely, so it is commonly only available for mapping
+	  it into user page tables.
+
+	  With this config enabled, the huge zero folio can also be used
+	  for other purposes that do not implement precise reference counting:
+	  it is allocated statically and never freed, allowing for more
+	  wide-spread use, for example, when performing I/O similar to the
+	  traditional shared zeropage.
+
+	  Not suitable for memory constrained systems.
+
  config MM_ID
  	def_bool n
  
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ff06dee213eb2..f65ba3e6f0824 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -866,9 +866,14 @@ static int __init thp_shrinker_init(void)
  	huge_zero_folio_shrinker->scan_objects = shrink_huge_zero_folio_scan;
  	shrinker_register(huge_zero_folio_shrinker);
  
-	deferred_split_shrinker->count_objects = deferred_split_count;
-	deferred_split_shrinker->scan_objects = deferred_split_scan;
-	shrinker_register(deferred_split_shrinker);
+	if (IS_ENABLED(CONFIG_STATIC_HUGE_ZERO_FOLIO)) {
+		if (!get_huge_zero_folio())
+			pr_warn("Allocating static huge zero folio failed\n");
+	} else {
+		deferred_split_shrinker->count_objects = deferred_split_count;
+		deferred_split_shrinker->scan_objects = deferred_split_scan;
+		shrinker_register(deferred_split_shrinker);
+	}
  
  	return 0;
  }
-- 
2.50.1


Now, one thing I do not like is that we have "ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO" but
then have a user-selectable option.

Should we just get rid of ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO?

-- 
Cheers,

David / dhildenb


  reply	other threads:[~2025-08-04 17:07 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-04 12:13 [PATCH 0/5] add static huge zero folio support Pankaj Raghav (Samsung)
2025-08-04 12:13 ` [PATCH 1/5] mm: rename huge_zero_page to huge_zero_folio Pankaj Raghav (Samsung)
2025-08-04 18:14   ` Zi Yan
2025-08-04 12:13 ` [PATCH 2/5] mm: rename MMF_HUGE_ZERO_PAGE to MMF_HUGE_ZERO_FOLIO Pankaj Raghav (Samsung)
2025-08-04 15:24   ` Lorenzo Stoakes
2025-08-04 16:20   ` David Hildenbrand
2025-08-04 18:04   ` Zi Yan
2025-08-04 12:13 ` [PATCH 3/5] mm: add static huge zero folio Pankaj Raghav (Samsung)
2025-08-04 16:46   ` Lorenzo Stoakes
2025-08-04 17:07     ` David Hildenbrand [this message]
2025-08-04 17:08       ` David Hildenbrand
2025-08-04 17:18       ` Lorenzo Stoakes
2025-08-05 10:55         ` David Hildenbrand
2025-08-05 11:40           ` Pankaj Raghav (Samsung)
2025-08-05 12:10             ` David Hildenbrand
2025-08-05 13:40               ` Lorenzo Stoakes
2025-08-06 12:18           ` Pankaj Raghav (Samsung)
2025-08-06 12:24             ` David Hildenbrand
2025-08-06 12:28               ` Pankaj Raghav (Samsung)
2025-08-06 12:36                 ` David Hildenbrand
2025-08-06 12:43                   ` Pankaj Raghav (Samsung)
2025-08-06 12:48                     ` David Hildenbrand
2025-08-05 16:33   ` Dave Hansen
2025-08-06  8:26     ` Pankaj Raghav (Samsung)
2025-08-04 12:13 ` [PATCH 4/5] mm: add largest_zero_folio() routine Pankaj Raghav (Samsung)
2025-08-04 16:50   ` Lorenzo Stoakes
2025-08-05 11:24     ` Pankaj Raghav (Samsung)
2025-08-04 18:13   ` Zi Yan
2025-08-05 16:42   ` Dave Hansen
2025-08-06  7:59     ` Pankaj Raghav (Samsung)
2025-08-04 12:13 ` [PATCH 5/5] block: use largest_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav (Samsung)
2025-08-04 16:53   ` Lorenzo Stoakes
2025-08-05 16:00 ` [PATCH 0/5] add static huge zero folio support Dave Hansen
2025-08-06  8:31   ` Pankaj Raghav (Samsung)
2025-08-06 11:28     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=70049abc-bf79-4d04-a0a8-dd3787195986@redhat.com \
    --to=david@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=dev.jain@arm.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hch@lst.de \
    --cc=hpa@zytor.com \
    --cc=kernel@pankajraghav.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mcgrof@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=npache@redhat.com \
    --cc=p.raghav@samsung.com \
    --cc=ritesh.list@gmail.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=tglx@linutronix.de \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).