linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dave Hansen <dave.hansen@intel.com>
To: "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@redhat.com>,
	"H . Peter Anvin" <hpa@zytor.com>,
	Vlastimil Babka <vbabka@suse.cz>, Zi Yan <ziy@nvidia.com>,
	Mike Rapoport <rppt@kernel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Michal Hocko <mhocko@suse.com>,
	David Hildenbrand <david@redhat.com>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Nico Pache <npache@redhat.com>, Dev Jain <dev.jain@arm.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Jens Axboe <axboe@kernel.dk>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	willy@infradead.org, x86@kernel.org, linux-block@vger.kernel.org,
	Ritesh Harjani <ritesh.list@gmail.com>,
	linux-fsdevel@vger.kernel.org,
	"Darrick J . Wong" <djwong@kernel.org>,
	mcgrof@kernel.org, gost.dev@samsung.com, hch@lst.de,
	Pankaj Raghav <p.raghav@samsung.com>
Subject: Re: [PATCH 3/5] mm: add static huge zero folio
Date: Tue, 5 Aug 2025 09:33:10 -0700	[thread overview]
Message-ID: <558da90d-e43d-464d-a3b6-02f6ee0de035@intel.com> (raw)
In-Reply-To: <20250804121356.572917-4-kernel@pankajraghav.com>

On 8/4/25 05:13, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
> is limited by PAGE_SIZE.
...

In x86-land, the rules are pretty clear about using imperative voice.
There are quite a few "we's" in the changelog and comments in this series.

I do think they're generally good to avoid and do lead to more clarity,
but I'm also not sure how important that is in mm-land these days.


> +static inline struct folio *get_static_huge_zero_folio(void)
> +{
> +	if (!IS_ENABLED(CONFIG_STATIC_HUGE_ZERO_FOLIO))
> +		return NULL;
> +
> +	if (likely(atomic_read(&huge_zero_folio_is_static)))
> +		return huge_zero_folio;
> +
> +	return __get_static_huge_zero_folio();
> +}

This seems like an ideal place to use 'struct static_key'.

>  static inline bool thp_migration_supported(void)
>  {
> @@ -685,6 +698,11 @@ static inline int change_huge_pud(struct mmu_gather *tlb,
>  {
>  	return 0;
>  }
> +
> +static inline struct folio *get_static_huge_zero_folio(void)
> +{
> +	return NULL;
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  static inline int split_folio_to_list_to_order(struct folio *folio,
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e443fe8cd6cf..366a6d2d771e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -823,6 +823,27 @@ config ARCH_WANT_GENERAL_HUGETLB
>  config ARCH_WANTS_THP_SWAP
>  	def_bool n
>  
> +config ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO
> +	def_bool n
> +
> +config STATIC_HUGE_ZERO_FOLIO
> +	bool "Allocate a PMD sized folio for zeroing"
> +	depends on ARCH_WANTS_STATIC_HUGE_ZERO_FOLIO && TRANSPARENT_HUGEPAGE
> +	help
> +	  Without this config enabled, the huge zero folio is allocated on
> +	  demand and freed under memory pressure once no longer in use.
> +	  To detect remaining users reliably, references to the huge zero folio
> +	  must be tracked precisely, so it is commonly only available for mapping
> +	  it into user page tables.
> +
> +	  With this config enabled, the huge zero folio can also be used
> +	  for other purposes that do not implement precise reference counting:
> +	  it is still allocated on demand, but never freed, allowing for more
> +	  wide-spread use, for example, when performing I/O similar to the
> +	  traditional shared zeropage.
> +
> +	  Not suitable for memory constrained systems.

IMNHO, this is written like a changelog, not documentation for end users
trying to make sense of Kconfig options. I'd suggest keeping it short
and sweet:

config PERSISTENT_HUGE_ZERO_FOLIO
	bool "Allocate a persistent PMD-sized folio for zeroing"
	...
	help
	  Enable this option to reduce the runtime refcounting overhead
	  of the huge zero folio and expand the places in the kernel
	  that can use huge zero folios.

	  With this option enabled, the huge zero folio is allocated
	  once and never freed. It potentially wastes one huge page
	  worth of memory.

	  Say Y if your system has lots of memory. Say N if you are
	  memory constrained.

>  config MM_ID
>  	def_bool n
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ff06dee213eb..e117b280b38d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -75,6 +75,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  static bool split_underused_thp = true;
>  
>  static atomic_t huge_zero_refcount;
> +atomic_t huge_zero_folio_is_static __read_mostly;
>  struct folio *huge_zero_folio __read_mostly;
>  unsigned long huge_zero_pfn __read_mostly = ~0UL;
>  unsigned long huge_anon_orders_always __read_mostly;
> @@ -266,6 +267,45 @@ void mm_put_huge_zero_folio(struct mm_struct *mm)
>  		put_huge_zero_folio();
>  }
>  
> +#ifdef CONFIG_STATIC_HUGE_ZERO_FOLIO
> +
> +struct folio *__get_static_huge_zero_folio(void)
> +{
> +	static unsigned long fail_count_clear_timer;
> +	static atomic_t huge_zero_static_fail_count __read_mostly;
> +
> +	if (unlikely(!slab_is_available()))
> +		return NULL;
> +
> +	/*
> +	 * If we failed to allocate a huge zero folio, just refrain from
> +	 * trying for one minute before retrying to get a reference again.
> +	 */
> +	if (atomic_read(&huge_zero_static_fail_count) > 1) {
> +		if (time_before(jiffies, fail_count_clear_timer))
> +			return NULL;
> +		atomic_set(&huge_zero_static_fail_count, 0);
> +	}

Any reason that this is an open-coded ratelimit instead of using
'struct ratelimit_state'?

I also find the 'huge_zero_static_fail_count' use pretty unintuitive.
This is fundamentally a slow path. Ideally, it's called once. In the
pathological case, it's called once a minute.

I'd probably just recommend putting a rate limit on this function, then
using a plain old mutex for the actual allocation to keep multiple
threads out.

Then the function becomes something like this:

	if (__ratelimit(&huge_zero_alloc_ratelimit))
		return;

	guard(mutex)(&huge_zero_mutex);

	if (!get_huge_zero_folio())
		return NULL;

	static_key_enable(&huge_zero_noref_key);

	return huge_zero_folio;

No atomic, no cmpxchg, no races on allocating.


...
>  static unsigned long shrink_huge_zero_folio_count(struct shrinker *shrink,
>  						  struct shrink_control *sc)
>  {
> @@ -277,7 +317,11 @@ static unsigned long shrink_huge_zero_folio_scan(struct shrinker *shrink,
>  						 struct shrink_control *sc)
>  {
>  	if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
> -		struct folio *zero_folio = xchg(&huge_zero_folio, NULL);
> +		struct folio *zero_folio;
> +
> +		if (WARN_ON_ONCE(atomic_read(&huge_zero_folio_is_static)))
> +			return 0;
> +		zero_folio = xchg(&huge_zero_folio, NULL);
>  		BUG_ON(zero_folio == NULL);
>  		WRITE_ONCE(huge_zero_pfn, ~0UL);
>  		folio_put(zero_folio);

This seems like a hack to me. If you don't want the shrinker to run,
then deregister it. Keeping the refcount elevated is fine, but
repeatedly calling the shrinker to do atomic_cmpxchg() when you *know*
it will do nothing seems silly.

If you can't deregister the shrinker, at least use the static_key
approach and check the static key instead of doing futile cmpxchg's forever.


  parent reply	other threads:[~2025-08-05 16:33 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-04 12:13 [PATCH 0/5] add static huge zero folio support Pankaj Raghav (Samsung)
2025-08-04 12:13 ` [PATCH 1/5] mm: rename huge_zero_page to huge_zero_folio Pankaj Raghav (Samsung)
2025-08-04 18:14   ` Zi Yan
2025-08-04 12:13 ` [PATCH 2/5] mm: rename MMF_HUGE_ZERO_PAGE to MMF_HUGE_ZERO_FOLIO Pankaj Raghav (Samsung)
2025-08-04 15:24   ` Lorenzo Stoakes
2025-08-04 16:20   ` David Hildenbrand
2025-08-04 18:04   ` Zi Yan
2025-08-04 12:13 ` [PATCH 3/5] mm: add static huge zero folio Pankaj Raghav (Samsung)
2025-08-04 16:46   ` Lorenzo Stoakes
2025-08-04 17:07     ` David Hildenbrand
2025-08-04 17:08       ` David Hildenbrand
2025-08-04 17:18       ` Lorenzo Stoakes
2025-08-05 10:55         ` David Hildenbrand
2025-08-05 11:40           ` Pankaj Raghav (Samsung)
2025-08-05 12:10             ` David Hildenbrand
2025-08-05 13:40               ` Lorenzo Stoakes
2025-08-06 12:18           ` Pankaj Raghav (Samsung)
2025-08-06 12:24             ` David Hildenbrand
2025-08-06 12:28               ` Pankaj Raghav (Samsung)
2025-08-06 12:36                 ` David Hildenbrand
2025-08-06 12:43                   ` Pankaj Raghav (Samsung)
2025-08-06 12:48                     ` David Hildenbrand
2025-08-05 16:33   ` Dave Hansen [this message]
2025-08-06  8:26     ` Pankaj Raghav (Samsung)
2025-08-04 12:13 ` [PATCH 4/5] mm: add largest_zero_folio() routine Pankaj Raghav (Samsung)
2025-08-04 16:50   ` Lorenzo Stoakes
2025-08-05 11:24     ` Pankaj Raghav (Samsung)
2025-08-04 18:13   ` Zi Yan
2025-08-05 16:42   ` Dave Hansen
2025-08-06  7:59     ` Pankaj Raghav (Samsung)
2025-08-04 12:13 ` [PATCH 5/5] block: use largest_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav (Samsung)
2025-08-04 16:53   ` Lorenzo Stoakes
2025-08-05 16:00 ` [PATCH 0/5] add static huge zero folio support Dave Hansen
2025-08-06  8:31   ` Pankaj Raghav (Samsung)
2025-08-06 11:28     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=558da90d-e43d-464d-a3b6-02f6ee0de035@intel.com \
    --to=dave.hansen@intel.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=djwong@kernel.org \
    --cc=gost.dev@samsung.com \
    --cc=hch@lst.de \
    --cc=hpa@zytor.com \
    --cc=kernel@pankajraghav.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mcgrof@kernel.org \
    --cc=mhocko@suse.com \
    --cc=mingo@redhat.com \
    --cc=npache@redhat.com \
    --cc=p.raghav@samsung.com \
    --cc=ritesh.list@gmail.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=surenb@google.com \
    --cc=tglx@linutronix.de \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).