From: David Hildenbrand <david@redhat.com>
To: "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com>,
Suren Baghdasaryan <surenb@google.com>,
Ryan Roberts <ryan.roberts@arm.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@redhat.com>,
"H . Peter Anvin" <hpa@zytor.com>,
Vlastimil Babka <vbabka@suse.cz>, Zi Yan <ziy@nvidia.com>,
Mike Rapoport <rppt@kernel.org>,
Dave Hansen <dave.hansen@linux.intel.com>,
Michal Hocko <mhocko@suse.com>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Andrew Morton <akpm@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>,
Nico Pache <npache@redhat.com>, Dev Jain <dev.jain@arm.com>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Jens Axboe <axboe@kernel.dk>
Cc: linux-kernel@vger.kernel.org, willy@infradead.org,
linux-mm@kvack.org, x86@kernel.org, linux-block@vger.kernel.org,
linux-fsdevel@vger.kernel.org,
"Darrick J . Wong" <djwong@kernel.org>,
mcgrof@kernel.org, gost.dev@samsung.com, hch@lst.de,
Pankaj Raghav <p.raghav@samsung.com>
Subject: Re: [PATCH v2 3/5] mm: add static PMD zero page
Date: Tue, 15 Jul 2025 16:53:23 +0200 [thread overview]
Message-ID: <fbcb6038-43a9-4d47-8cf7-f5ca32824079@redhat.com> (raw)
In-Reply-To: <26fded53-b79d-4538-bc56-3d2055eb5d62@redhat.com>
On 15.07.25 16:21, David Hildenbrand wrote:
> On 07.07.25 16:23, Pankaj Raghav (Samsung) wrote:
>> From: Pankaj Raghav <p.raghav@samsung.com>
>>
>> There are many places in the kernel where we need to zeroout larger
>> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
>> is limited by PAGE_SIZE.
>>
>> This is especially annoying in block devices and filesystems where we
>> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
>> bvec support in block layer, it is much more efficient to send out
>> larger zero pages as a part of single bvec.
>>
>> This concern was raised during the review of adding LBS support to
>> XFS[1][2].
>>
>> Usually huge_zero_folio is allocated on demand, and it will be
>> deallocated by the shrinker if there are no users of it left. At moment,
>> huge_zero_folio infrastructure refcount is tied to the process lifetime
>> that created it. This might not work for bio layer as the completitions
>> can be async and the process that created the huge_zero_folio might no
>> longer be alive.
>
> Of course, what we could do is indicating that there is any untracked
> reference to the huge zero folio, and then simply refuse to free it for
> all eternity.
>
> Essentially, every any non-mm reference -> un-shrinkable.
>
> We'd still be allocating the huge zero folio dynamically. We could try
> allocating it on first usage either from memblock, or from the buddy if
> already around.
>
> Then, we'd only need a config option to allow for that to happen.
Something incomplete and very hacky just to give an idea. It would try allocating
it if there is actual code running that would need it, and then have it
stick around forever.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e0a27f80f390d..357e29e98d8d2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -481,6 +481,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
extern struct folio *huge_zero_folio;
extern unsigned long huge_zero_pfn;
+atomic_t huge_zero_folio_is_static;
static inline bool is_huge_zero_folio(const struct folio *folio)
{
@@ -499,6 +500,16 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
void mm_put_huge_zero_folio(struct mm_struct *mm);
+struct folio *__get_static_huge_zero_folio(void);
+
+static inline struct folio *get_static_huge_zero_folio(void)
+{
+ if (!IS_ENMABLED(CONFIG_STATIC_HUGE_ZERO_FOLIO))
+ return NULL;
+ if (likely(atomic_read(&huge_zero_folio_is_static)))
+ return huge_zero_folio;
+ return get_static_huge_zero_folio();
+}
static inline bool thp_migration_supported(void)
{
@@ -509,7 +520,6 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd, bool freeze);
bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct folio *folio);
-
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
static inline bool folio_test_pmd_mappable(struct folio *folio)
@@ -690,6 +700,11 @@ static inline int change_huge_pud(struct mmu_gather *tlb,
{
return 0;
}
+
+static inline struct folio *static_huge_zero_folio(void)
+{
+ return NULL;
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
static inline int split_folio_to_list_to_order(struct folio *folio,
@@ -703,4 +718,14 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
return split_folio_to_list_to_order(folio, NULL, new_order);
}
+static inline struct folio *largest_zero_folio(void)
+{
+ struct folio *folio;
+
+ folio = get_static_huge_zero_folio();
+ if (folio)
+ return folio;
+ return page_folio(ZERO_PAGE(0));
+}
+
#endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 31b5c4e61a574..eb49c69f9c8e2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -77,6 +77,7 @@ static bool split_underused_thp = true;
static atomic_t huge_zero_refcount;
struct folio *huge_zero_folio __read_mostly;
unsigned long huge_zero_pfn __read_mostly = ~0UL;
+atomic_t huge_zero_folio_is_static __read_mostly;
unsigned long huge_anon_orders_always __read_mostly;
unsigned long huge_anon_orders_madvise __read_mostly;
unsigned long huge_anon_orders_inherit __read_mostly;
@@ -266,6 +267,25 @@ void mm_put_huge_zero_folio(struct mm_struct *mm)
put_huge_zero_page();
}
+#ifdef CONFIG_STATIC_HUGE_ZERO_FOLIO
+struct folio *__get_static_huge_zero_folio(void)
+{
+ /*
+ * Our raised reference will prevent the shrinker from ever having
+ * success -> static.
+ */
+ if (atomic_read(&huge_zero_folio_is_static))
+ return huge_zero_folio;
+ /* TODO: memblock allocation if buddy is not up yet? Or Reject that earlier. */
+ if (!get_huge_zero_page())
+ return NULL;
+ if (atomic_cmpxchg(&huge_zero_folio_is_static, 0, 1) != 0)
+ put_huge_zero_page();
+ return huge_zero_folio;
+
+}
+#endif /* CONFIG_STATIC_HUGE_ZERO_FOLIO */
+
static unsigned long shrink_huge_zero_page_count(struct shrinker *shrink,
struct shrink_control *sc)
{
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2025-07-15 14:53 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-07 14:23 [PATCH v2 0/5] add static PMD zero page support Pankaj Raghav (Samsung)
2025-07-07 14:23 ` [PATCH v2 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h Pankaj Raghav (Samsung)
2025-07-15 14:08 ` Lorenzo Stoakes
2025-07-16 7:47 ` Pankaj Raghav (Samsung)
2025-07-16 15:24 ` David Hildenbrand
2025-07-07 14:23 ` [PATCH v2 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function Pankaj Raghav (Samsung)
2025-07-15 14:18 ` David Hildenbrand
2025-07-16 8:01 ` Pankaj Raghav (Samsung)
2025-07-15 14:29 ` Lorenzo Stoakes
2025-07-16 8:08 ` Pankaj Raghav (Samsung)
2025-07-07 14:23 ` [PATCH v2 3/5] mm: add static PMD zero page Pankaj Raghav (Samsung)
2025-07-15 14:21 ` David Hildenbrand
2025-07-15 14:53 ` David Hildenbrand [this message]
2025-07-17 10:34 ` Pankaj Raghav (Samsung)
2025-07-17 11:46 ` David Hildenbrand
2025-07-17 12:07 ` Pankaj Raghav (Samsung)
2025-07-15 15:26 ` Lorenzo Stoakes
2025-07-07 14:23 ` [PATCH v2 4/5] mm: add largest_zero_folio() routine Pankaj Raghav (Samsung)
2025-07-15 14:16 ` David Hildenbrand
2025-07-15 14:46 ` David Hildenbrand
2025-07-15 16:13 ` Lorenzo Stoakes
2025-07-07 14:23 ` [PATCH v2 5/5] block: use largest_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav (Samsung)
2025-07-15 16:19 ` Lorenzo Stoakes
2025-07-16 13:24 ` Pankaj Raghav (Samsung)
2025-07-07 18:06 ` [PATCH v2 0/5] add static PMD zero page support Zi Yan
2025-07-09 8:03 ` Pankaj Raghav
2025-07-09 15:55 ` Zi Yan
2025-07-15 14:02 ` Lorenzo Stoakes
2025-07-15 14:06 ` David Hildenbrand
2025-07-15 14:12 ` Lorenzo Stoakes
2025-07-15 14:16 ` David Hildenbrand
2025-07-15 15:25 ` Pankaj Raghav (Samsung)
2025-07-15 15:27 ` David Hildenbrand
2025-07-07 22:38 ` Andrew Morton
2025-07-09 9:59 ` Pankaj Raghav
2025-07-15 14:15 ` David Hildenbrand
2025-07-15 13:53 ` Pankaj Raghav
2025-07-15 14:04 ` Lorenzo Stoakes
2025-07-15 15:34 ` Lorenzo Stoakes
2025-07-17 10:43 ` Pankaj Raghav (Samsung)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fbcb6038-43a9-4d47-8cf7-f5ca32824079@redhat.com \
--to=david@redhat.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=baolin.wang@linux.alibaba.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=dev.jain@arm.com \
--cc=djwong@kernel.org \
--cc=gost.dev@samsung.com \
--cc=hch@lst.de \
--cc=hpa@zytor.com \
--cc=kernel@pankajraghav.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mcgrof@kernel.org \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=npache@redhat.com \
--cc=p.raghav@samsung.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=surenb@google.com \
--cc=tglx@linutronix.de \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).