($INBOX_DIR/description missing)

($INBOX_DIR/description missing)
 help / color / mirror / Atom feed

* Re: [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed
From: David Hildenbrand (Arm) @ 2026-06-08 14:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <20260608100602-mutt-send-email-mst@kernel.org>

On 6/8/26 16:08, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 02:46:34PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/8/26 14:25, Lorenzo Stoakes wrote:
>>>
>>> Do not put comments about specific expected races like this in the commit
>>> message but not in the code. Subtleties need to be called out.
>>>
>>> The commit message also doesn't at all explain why PG_zeroed doesn't
>>> suffice here.
>>>
>>>
>>> I really don't understand why you have a 'zeroed' folio flag but need to
>>> also have new API calls to detect that?
>>>
>>> They're also HORRIBLY named. Zeroed as in what? Zero page? Huge zero page?
>>> Memory zeroed by kernel? Pages that userland happen to have zeroed? Or host
>>> VM zeroed?
>>>
>>> Each are cases we address individually and relate to folios.
>>>
>>> You absolutely fail to clarify _which one_ you mean, and provide absolutely
>>> no documentation and add an exported mm API with no description.
>>>
>>> This is just I think not something we want to add? Especially on something
>>> so fundamental?
>>
>> I raised previously that providing a folio helper is odd, and that I suggested
>> that we defer this change.
> 
> Sadly it's a dependency actually - without it memcg failures would cause
> repeated re-zeroing where previously it failed without zeroing.

Oh, you mean that we succeeded allocating (+zeroing) but failed to charge?

I don't immediately see that to be a real problem?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: David Hildenbrand (Arm) @ 2026-06-08 14:26 UTC (permalink / raw)
  To: Lorenzo Stoakes, Matthew Wilcox
  Cc: Michael S. Tsirkin, linux-kernel, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aia-q026facbptoT@lucifer>

On 6/8/26 15:09, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
>>> But instead of overloading user_addr to indicate all kinds of things, instead
>>> make life easier by actually breaking things out.
>>>
>>> Like:
>>>
>>> enum alloc_context_type {
>>> 	KERNEL_ALLOCATION,
>>> 	USER_MAPPED_ALLOCATION,
>>> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
>>> 	/* Perhaps some other states we want to encode? */
>>> };
>>>
>>> struct alloc_context {
>>> 	...
>>>
>>> 	enum alloc_context_type type;
>>> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
>>>
>>> 	// Maybe something suggesting context or whether we init before in some
>>> 	// cases?
>>> };
>>
>> Ugh, please, no.  As I suggested last time I commented on this
>> trainwreck of a series, lift the zeroing functionality from
>> alloc_frozen_pages() into its callers.
> 
> I've not looked at the callers closely enough to see the delta on that, but if
> it avoids this mess then also worth looking at yes...

If that means that we would handle __GFP_ZERO consistently in the callers of
alloc_frozen_pages(), that would also do I guess. We'd still have to pass the
user address down to some degree, through folio interfaces only at least.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Matthew Wilcox @ 2026-06-08 14:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <cover.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:33:46AM -0400, Michael S. Tsirkin wrote:
> Further, on architectures with aliasing caches, upstream with init_on_alloc

Further to what?  Did you leave out some paragraphs here?

As far as I can tell, this patch series decides to trust that the
hypervisor has zeroed pages that it allocates to the guest.  But
as far as I can tell, the trend is towards less trust in the hypervisor
from the guest, not more.

^ permalink raw reply

* Re: [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Lorenzo Stoakes @ 2026-06-08 14:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Miaohe Lin
In-Reply-To: <20260608094153-mutt-send-email-mst@kernel.org>

On Mon, Jun 08, 2026 at 09:48:34AM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 10:43:21AM +0100, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:34:23AM -0400, Michael S. Tsirkin wrote:
> > > TestSetPageHWPoison() is called without zone->lock, so its atomic
> > > update to page->flags can race with non-atomic flag operations
> > > that run under zone->lock in the buddy allocator.
> > >
> > > In particular, __free_pages_prepare() does:
> > >
> > >     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > >
> > > This non-atomic read-modify-write, while correctly excluding
> > > __PG_HWPOISON from the mask, can still lose a concurrent
> > > TestSetPageHWPoison if the read happens before the poison bit
> > > is set and the write happens after.  Follow-up patches in this
> > > series add similar non-atomic flag operations as well.
> > >
> > > Fix by acquiring zone->lock around TestSetPageHWPoison and
> > > around ClearPageHWPoison in the retry path.  This
> > > serializes with all buddy flag manipulation.  The cost is
> > > negligible: one lock/unlock in an extremely rare path
> > > (hardware memory errors).
> > >
> > > Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> > > in this file operate on pages already removed from the buddy
> > > allocator or on non-buddy pages (DAX, hugetlb), so they do not
> > > need zone->lock protection.
> > >
> > > Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >
> > Can we have Fixes: and Cc: stable and also send this separately please?
> >
> > These patches seem like unrelated fixups that you've discovered along the way,
> > and don't belong as part of the already rather large series, unless I'm missing
> > something here.
> >
> > Thanks, Lorenzo
>
> I think you are mising that they are a dependency, not unrelated.

Then say so.

> For example, this issue gets worse with the patchset as there are more
> places that manipulate flags without atomics. No?

It's your job to make that case, not mine.

>
>
> You are welcome to send this to stable, but I think stable rules
> preclude theoretical bugfixes.

It's a dependency but also theoretical?

>
> As for Fixes: the issue has been there for decades. I wouldn't know
> what to attribute it for.

Again, your job.

>
>
> I guess I could send these separately, too, why not. Not sure
> what this accomplishes, but hey.  But is that an ack? You want
> this fix merged even before the feature?

I already made the case as to why, as have other maintainers.

If you need to review what an ack looks like please consult
https://docs.kernel.org/process/5.Posting.html

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed
From: Michael S. Tsirkin @ 2026-06-08 14:08 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, linux-kernel, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <8ec82477-b497-466c-902b-82e108ae2b7b@kernel.org>

On Mon, Jun 08, 2026 at 02:46:34PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 14:25, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:38:54AM -0400, Michael S. Tsirkin wrote:
> >> Add put_page_zeroed() / folio_put_zeroed() for callers that hold
> >> a reference to a page known to be zeroed.
> >>
> >> If this drops the last reference, the zeroed hint is
> >> propagated to the buddy allocator.  If someone else still holds a
> >> reference, the hint is simply lost - this is best-effort.
> >>
> >> This is useful for balloon drivers during deflation: the host
> >> has already zeroed the pages, and the balloon is typically the
> >> sole owner.  But if the page happens to be shared, silently
> >> dropping the hint is safe and avoids the need for callers to
> >> check the refcount.
> >>
> >> Note: put_page_zeroed uses folio_put_testzero() which only
> >> detects sole ownership at the instant of the atomic decrement.
> >> A concurrent reference holder (e.g. migration) means the hint
> >> is silently lost. This is by design: the zeroed hint is a
> >> performance optimization, not a correctness requirement.
> >> Losing it just means the next allocation re-zeroes the page.
> > 
> > Do not put comments about specific expected races like this in the commit
> > message but not in the code. Subtleties need to be called out.
> > 
> > The commit message also doesn't at all explain why PG_zeroed doesn't
> > suffice here.
> > 
> >>
> >> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >> Assisted-by: Claude:claude-opus-4-6
> > 
> > I really don't understand why you have a 'zeroed' folio flag but need to
> > also have new API calls to detect that?
> > 
> > They're also HORRIBLY named. Zeroed as in what? Zero page? Huge zero page?
> > Memory zeroed by kernel? Pages that userland happen to have zeroed? Or host
> > VM zeroed?
> > 
> > Each are cases we address individually and relate to folios.
> > 
> > You absolutely fail to clarify _which one_ you mean, and provide absolutely
> > no documentation and add an exported mm API with no description.
> > 
> > This is just I think not something we want to add? Especially on something
> > so fundamental?
> 
> I raised previously that providing a folio helper is odd, and that I suggested
> that we defer this change.

Sadly it's a dependency actually - without it memcg failures would cause
repeated re-zeroing where previously it failed without zeroing.

It's the result of the whole GFP_ZERO idea.



> Maybe we'd want to add such an interface for frozen pages later (to be used by
> the balloon), but I don't think we want that for folios.
> 
> [1] https://lore.kernel.org/all/5f76af6e-9818-42ea-a305-c0fc1d920dca@kernel.org/
> 
> -- 
> Cheers,
> 
> David


^ permalink raw reply

* Re: [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Michael S. Tsirkin @ 2026-06-08 13:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Miaohe Lin
In-Reply-To: <aiaOZoa2vSCz_0wY@lucifer>

On Mon, Jun 08, 2026 at 10:43:21AM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:34:23AM -0400, Michael S. Tsirkin wrote:
> > TestSetPageHWPoison() is called without zone->lock, so its atomic
> > update to page->flags can race with non-atomic flag operations
> > that run under zone->lock in the buddy allocator.
> >
> > In particular, __free_pages_prepare() does:
> >
> >     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >
> > This non-atomic read-modify-write, while correctly excluding
> > __PG_HWPOISON from the mask, can still lose a concurrent
> > TestSetPageHWPoison if the read happens before the poison bit
> > is set and the write happens after.  Follow-up patches in this
> > series add similar non-atomic flag operations as well.
> >
> > Fix by acquiring zone->lock around TestSetPageHWPoison and
> > around ClearPageHWPoison in the retry path.  This
> > serializes with all buddy flag manipulation.  The cost is
> > negligible: one lock/unlock in an extremely rare path
> > (hardware memory errors).
> >
> > Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> > in this file operate on pages already removed from the buddy
> > allocator or on non-buddy pages (DAX, hugetlb), so they do not
> > need zone->lock protection.
> >
> > Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> Can we have Fixes: and Cc: stable and also send this separately please?
> 
> These patches seem like unrelated fixups that you've discovered along the way,
> and don't belong as part of the already rather large series, unless I'm missing
> something here.
> 
> Thanks, Lorenzo

I think you are mising that they are a dependency, not unrelated.
For example, this issue gets worse with the patchset as there are more
places that manipulate flags without atomics. No?


You are welcome to send this to stable, but I think stable rules
preclude theoretical bugfixes.

As for Fixes: the issue has been there for decades. I wouldn't know
what to attribute it for.


I guess I could send these separately, too, why not. Not sure
what this accomplishes, but hey.  But is that an ack? You want
this fix merged even before the feature?



> > Assisted-by: Claude:claude-opus-4-6
> > ---
> >  mm/memory-failure.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index ee42d4361309..3880486028a1 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -2348,6 +2348,8 @@ int memory_failure(unsigned long pfn, int flags)
> >  	unsigned long page_flags;
> >  	bool retry = true;
> >  	int hugetlb = 0;
> > +	struct zone *zone;
> > +	unsigned long mf_flags;
> >
> >  	if (!sysctl_memory_failure_recovery)
> >  		panic("Memory failure on page %lx", pfn);
> > @@ -2390,7 +2392,11 @@ int memory_failure(unsigned long pfn, int flags)
> >  	if (hugetlb)
> >  		goto unlock_mutex;
> >
> > +	/* Serialize with non-atomic buddy flag operations */
> > +	zone = page_zone(p);
> > +	spin_lock_irqsave(&zone->lock, mf_flags);
> >  	if (TestSetPageHWPoison(p)) {
> > +		spin_unlock_irqrestore(&zone->lock, mf_flags);
> >  		res = -EHWPOISON;
> >  		if (flags & MF_ACTION_REQUIRED)
> >  			res = kill_accessing_process(current, pfn, flags);
> > @@ -2399,6 +2405,7 @@ int memory_failure(unsigned long pfn, int flags)
> >  		action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
> >  		goto unlock_mutex;
> >  	}
> > +	spin_unlock_irqrestore(&zone->lock, mf_flags);
> >
> >  	/*
> >  	 * We need/can do nothing about count=0 pages.
> > @@ -2420,7 +2427,10 @@ int memory_failure(unsigned long pfn, int flags)
> >  			} else {
> >  				/* We lost the race, try again */
> >  				if (retry) {
> > +					/* Serialize with non-atomic buddy flag operations */
> > +					spin_lock_irqsave(&zone->lock, mf_flags);
> >  					ClearPageHWPoison(p);
> > +					spin_unlock_irqrestore(&zone->lock, mf_flags);
> >  					retry = false;
> >  					goto try_again;
> >  				}
> > --
> > MST
> >


^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Lorenzo Stoakes @ 2026-06-08 13:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aia93IzhTgi1vGi6@casper.infradead.org>

On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
> > But instead of overloading user_addr to indicate all kinds of things, instead
> > make life easier by actually breaking things out.
> >
> > Like:
> >
> > enum alloc_context_type {
> > 	KERNEL_ALLOCATION,
> > 	USER_MAPPED_ALLOCATION,
> > 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
> > 	/* Perhaps some other states we want to encode? */
> > };
> >
> > struct alloc_context {
> > 	...
> >
> > 	enum alloc_context_type type;
> > 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
> >
> > 	// Maybe something suggesting context or whether we init before in some
> > 	// cases?
> > };
>
> Ugh, please, no.  As I suggested last time I commented on this
> trainwreck of a series, lift the zeroing functionality from
> alloc_frozen_pages() into its callers.

I've not looked at the callers closely enough to see the delta on that, but if
it avoids this mess then also worth looking at yes...

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Matthew Wilcox @ 2026-06-08 13:04 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiagMc9Ng_AE_adh@lucifer>

On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
> But instead of overloading user_addr to indicate all kinds of things, instead
> make life easier by actually breaking things out.
> 
> Like:
> 
> enum alloc_context_type {
> 	KERNEL_ALLOCATION,
> 	USER_MAPPED_ALLOCATION,
> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
> 	/* Perhaps some other states we want to encode? */
> };
> 
> struct alloc_context {
> 	...
> 
> 	enum alloc_context_type type;
> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
> 
> 	// Maybe something suggesting context or whether we init before in some
> 	// cases?
> };

Ugh, please, no.  As I suggested last time I commented on this
trainwreck of a series, lift the zeroing functionality from
alloc_frozen_pages() into its callers.

^ permalink raw reply

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Lorenzo Stoakes @ 2026-06-08 12:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiaHd3T42XyB3UBn@lucifer>

On Mon, Jun 08, 2026 at 10:17:47AM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:33:46AM -0400, Michael S. Tsirkin wrote:
> > Further, on architectures with aliasing caches, upstream with init_on_alloc
> > double-zeros user pages: once via kernel_init_pages() in
> > post_alloc_hook, and again via clear_user_highpage() at the
> > callsite (because user_alloc_needs_zeroing() returns true).
> > This series eliminates that double-zeroing by moving the zeroing
> > into the post_alloc_hook + propagating the "host
> > already zeroed this page" information through the buddy allocator.
> >
> > For page reporting, VIRTIO_BALLOON_F_DEVICE_INIT_REPORTED (bit 6)
> > is used. For the inflate/deflate path,
> > VIRTIO_BALLOON_F_DEVICE_INIT_ON_INFLATE (bit 7) is used.
> >
> > Virtio spec: https://lore.kernel.org/all/cover.1778140241.git.mst@redhat.com
> >
> > Based on v7.1-rc6.  When applying on mm-unstable, two conflicts
> > are expected:
> > - kernel_init_pages() was renamed to clear_highpages_kasan_tagged()
> >   in mm-unstable.  Use clear_highpages_kasan_tagged() in the
> >   post_alloc_hook else branch.
> > - FPI_PREPARED uses BIT(3) in mm-unstable.  Bump FPI_ZEROED to
> >   BIT(4).
> > Build-tested on mm-unstable at e9dd96806dbc:
> > https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git zero-mm-unstable
> >
> > Patches 1-5: fixes/cleanups, dependencies of the zeroing patches.
> > Patches 6-9: thread user_addr through page allocator, contig API,
> >   and gigantic hugetlb allocation.
> > Patches 10-16: folio_zero_user in post_alloc_hook, vma_alloc_zeroed
> >   conversion, raw fault address threading.
> > Patches 17-24: PG_zeroed flag, aliasing guard, buddy merge/split
> >   tracking, FPI_ZEROED optimization, folio_put_zeroed.
> > Patches 25-27: __GFP_ZERO callsite conversions (alloc_anon_folio,
> >   vma_alloc_anon_folio_pmd) with memcg charge failure mitigation.
> > Patches 28-29: hugetlb __GFP_ZERO + HPG_zeroed.
> > Patches 30-35: page reporting zeroing (DEVICE_INIT_REPORTED),
> >   disable indirect descriptors.
> > Patches 36-37: inflate/deflate zeroing (DEVICE_INIT_ON_INFLATE).
>
> This seems far too much for one series.
>
> YOu're doing a bunch of mm stuff that seems relatively independent, then
> putting the virtio stuff on top.
>
> I think this should be broken out into separate series laying foundations
> rather than doing it all in one go, which is also difficult for review
> purposes.
>
> Adding a new folio flag is contentious also for instance, we maybe want to
> go bit-by-bit and ensure that each foundational element is acceptable
> before doing the next bit rather than having it as part of a big series.
>
> Looking through the changelog only adds to this feeling! Huge numbers of
> changes, even relatively recently and I'm not sure all relevant maintainers
> in mm have had a look through either.
>
> Thanks, Lorenzo

Additionally, it seems you've missed/ignored (I hope not) a bunch of
pre-existing feedback, so it'd be helpful if you'd carefully go through
what people have previously asked!

Keeping track of that, especially on a big series, is a lot of work.

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 03/37] mm: page_alloc: propagate PageReported flag across buddy splits
From: Matthew Wilcox @ 2026-06-08 12:50 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiaPBRm0Mp_WDPFa@lucifer>

On Mon, Jun 08, 2026 at 10:52:28AM +0100, Lorenzo Stoakes wrote:
> > -			split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
> > +			split_large_buddy(zone, p, page_to_pfn(p), p_order,
> > +					  fpi_flags, false);
> 
> I don't love adding a mystery meat boolean parameter like this.

Particularly when we have the FPI flags already being passed.  Surely
this should just be another FPI flag?


^ permalink raw reply

* Re: [PATCH v10 29/37] mm: memfd: skip zeroing for zeroed hugetlb pool pages
From: Lorenzo Stoakes @ 2026-06-08 12:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <0d6c2d31f48ff454223ad4f1d37ef7b73263bf5d.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:39:37AM -0400, Michael S. Tsirkin wrote:
> Add bool *zeroed output to alloc_hugetlb_folio_reserve() so
> callers can check whether the pool page is known-zero.  memfd's
> memfd_alloc_folio() uses this to skip the explicit folio_zero_user()
> when the page is already zero.

But why does memfd do that?

This is more AI-ish 'write out in English what the code does' which isn't
really helpful.

>
> This avoids redundant zeroing for memfd hugetlb pages that were
> pre-allocated into the pool and never mapped to userspace.

I think this should lead the commit message given it seems to be the whole
intent no?

>
> Note: HPG_zeroed is currently only set for surplus pages
> allocated with __GFP_ZERO (via alloc_surplus_hugetlb_folio),
> not for pool pages from alloc_pool_huge_folio. So the
> zeroed output from alloc_hugetlb_folio_reserve is typically
> false for pool-only reservations. It becomes true when
> surplus pages fill the reservation. The addr_hint 0 passed
> to folio_zero_user is acceptable for memfd: these pages are
> not mapped yet and will get proper dcache handling at mmap
> time via the page fault path.

This paragraph is really hard to read, and you don't seem to propagate the
same very specific information in the code so people maintaining it don't
know what's going on.

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6

This is committing the sins of the rest and adding more complexity
throughout.

The whole approach needs a rework I think, but hugetlbfs stuff should be
deferred in general.

> ---
>  include/linux/cma.h     |  3 ++-
>  include/linux/hugetlb.h |  6 ++++--
>  mm/cma.c                |  6 ++++--
>  mm/hugetlb.c            | 11 +++++++++--
>  mm/hugetlb_cma.c        |  4 ++--
>  mm/memfd.c              | 14 ++++++++------
>  6 files changed, 29 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/cma.h b/include/linux/cma.h
> index 8555d38a97b1..dee88909cf5d 100644
> --- a/include/linux/cma.h
> +++ b/include/linux/cma.h
> @@ -53,7 +53,8 @@ extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long
>
>  struct page *cma_alloc_frozen(struct cma *cma, unsigned long count,
>  		unsigned int align, bool no_warn);
> -struct page *cma_alloc_frozen_compound(struct cma *cma, unsigned int order);
> +struct page *cma_alloc_frozen_compound(struct cma *cma, unsigned int order,
> +				       gfp_t caller_gfp);
>  bool cma_release_frozen(struct cma *cma, const struct page *pages,
>  		unsigned long count);
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 06d033a57a61..7eb529eabe99 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -708,7 +708,8 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
>  				nodemask_t *nmask, gfp_t gfp_mask,
>  				bool allow_alloc_fallback);
>  struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
> -					  nodemask_t *nmask, gfp_t gfp_mask);
> +					  nodemask_t *nmask, gfp_t gfp_mask,
> +					  bool *zeroed);
>
>  int hugetlb_add_to_page_cache(struct folio *folio, struct address_space *mapping,
>  			pgoff_t idx);
> @@ -1128,7 +1129,8 @@ static inline void wait_for_freed_hugetlb_folios(void)
>
>  static inline struct folio *
>  alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
> -			    nodemask_t *nmask, gfp_t gfp_mask)
> +			    nodemask_t *nmask, gfp_t gfp_mask,
> +			    bool *zeroed)
>  {
>  	return NULL;
>  }
> diff --git a/mm/cma.c b/mm/cma.c
> index c7ca567f4c5c..27971f6264ab 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -924,9 +924,11 @@ struct page *cma_alloc_frozen(struct cma *cma, unsigned long count,
>  	return __cma_alloc_frozen(cma, count, align, gfp);
>  }
>
> -struct page *cma_alloc_frozen_compound(struct cma *cma, unsigned int order)
> +struct page *cma_alloc_frozen_compound(struct cma *cma, unsigned int order,
> +				       gfp_t caller_gfp)
>  {
> -	gfp_t gfp = GFP_KERNEL | __GFP_COMP | __GFP_NOWARN;
> +	gfp_t gfp = GFP_KERNEL | __GFP_COMP | __GFP_NOWARN |
> +		    (caller_gfp & __GFP_ZERO);
>
>  	return __cma_alloc_frozen(cma, 1 << order, order, gfp);
>  }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ed00db703911..a087e915783f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2196,7 +2196,7 @@ struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
>  }
>
>  struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
> -		nodemask_t *nmask, gfp_t gfp_mask)
> +		nodemask_t *nmask, gfp_t gfp_mask, bool *zeroed)
>  {
>  	struct folio *folio;
>
> @@ -2212,6 +2212,12 @@ struct folio *alloc_hugetlb_folio_reserve(struct hstate *h, int preferred_nid,
>  		h->resv_huge_pages--;
>
>  	spin_unlock_irq(&hugetlb_lock);
> +
> +	if (zeroed && folio) {
> +		*zeroed = folio_test_hugetlb_zeroed(folio);
> +		folio_clear_hugetlb_zeroed(folio);
> +	}
> +
>  	return folio;
>  }
>
> @@ -2296,7 +2302,8 @@ static int gather_surplus_pages(struct hstate *h, long delta)
>  		 * It is okay to use NUMA_NO_NODE because we use numa_mem_id()
>  		 * down the road to pick the current node if that is the case.
>  		 */
> -		folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
> +		folio = alloc_surplus_hugetlb_folio(h,
> +						    htlb_alloc_mask(h),
>  						    NUMA_NO_NODE, &alloc_nodemask,
>  						    USER_ADDR_NONE);
>  		if (!folio) {
> diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
> index 7693ccefd0c6..c9266b25be3d 100644
> --- a/mm/hugetlb_cma.c
> +++ b/mm/hugetlb_cma.c
> @@ -35,14 +35,14 @@ struct folio *hugetlb_cma_alloc_frozen_folio(int order, gfp_t gfp_mask,
>  		return NULL;
>
>  	if (hugetlb_cma[nid])
> -		page = cma_alloc_frozen_compound(hugetlb_cma[nid], order);
> +		page = cma_alloc_frozen_compound(hugetlb_cma[nid], order, gfp_mask);
>
>  	if (!page && !(gfp_mask & __GFP_THISNODE)) {
>  		for_each_node_mask(node, *nodemask) {
>  			if (node == nid || !hugetlb_cma[node])
>  				continue;
>
> -			page = cma_alloc_frozen_compound(hugetlb_cma[node], order);
> +			page = cma_alloc_frozen_compound(hugetlb_cma[node], order, gfp_mask);
>  			if (page)
>  				break;
>  		}
> diff --git a/mm/memfd.c b/mm/memfd.c
> index abe13b291ddc..a99617a62e33 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -69,6 +69,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
>  #ifdef CONFIG_HUGETLB_PAGE
>  	struct folio *folio;
>  	gfp_t gfp_mask;
> +	bool zeroed;
>
>  	if (is_file_hugepages(memfd)) {
>  		/*
> @@ -93,17 +94,18 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
>  		folio = alloc_hugetlb_folio_reserve(h,
>  						    numa_node_id(),
>  						    NULL,
> -						    gfp_mask);
> +						    gfp_mask,
> +						    &zeroed);
>  		if (folio) {
>  			u32 hash;
>
>  			/*
> -			 * Zero the folio to prevent information leaks to userspace.
> -			 * Use folio_zero_user() which is optimized for huge/gigantic
> -			 * pages. Pass 0 as addr_hint since this is not a faulting path
> -			 *  and we don't have a user virtual address yet.
> +			 * Zero the folio to prevent information leaks to
> +			 * userspace.  Skip if the pool page is known-zero
> +			 * (HPG_zeroed set during pool pre-allocation).
>  			 */
> -			folio_zero_user(folio, 0);
> +			if (!zeroed)
> +				folio_zero_user(folio, 0);
>
>  			/*
>  			 * Mark the folio uptodate before adding to page cache,
> --
> MST
>

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed
From: David Hildenbrand (Arm) @ 2026-06-08 12:46 UTC (permalink / raw)
  To: Lorenzo Stoakes, Michael S. Tsirkin
  Cc: linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, Christoph Lameter,
	David Rientjes, Roman Gushchin, Harry Yoo, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiazXE8pUDeCmP7c@lucifer>

On 6/8/26 14:25, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:38:54AM -0400, Michael S. Tsirkin wrote:
>> Add put_page_zeroed() / folio_put_zeroed() for callers that hold
>> a reference to a page known to be zeroed.
>>
>> If this drops the last reference, the zeroed hint is
>> propagated to the buddy allocator.  If someone else still holds a
>> reference, the hint is simply lost - this is best-effort.
>>
>> This is useful for balloon drivers during deflation: the host
>> has already zeroed the pages, and the balloon is typically the
>> sole owner.  But if the page happens to be shared, silently
>> dropping the hint is safe and avoids the need for callers to
>> check the refcount.
>>
>> Note: put_page_zeroed uses folio_put_testzero() which only
>> detects sole ownership at the instant of the atomic decrement.
>> A concurrent reference holder (e.g. migration) means the hint
>> is silently lost. This is by design: the zeroed hint is a
>> performance optimization, not a correctness requirement.
>> Losing it just means the next allocation re-zeroes the page.
> 
> Do not put comments about specific expected races like this in the commit
> message but not in the code. Subtleties need to be called out.
> 
> The commit message also doesn't at all explain why PG_zeroed doesn't
> suffice here.
> 
>>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> Assisted-by: Claude:claude-opus-4-6
> 
> I really don't understand why you have a 'zeroed' folio flag but need to
> also have new API calls to detect that?
> 
> They're also HORRIBLY named. Zeroed as in what? Zero page? Huge zero page?
> Memory zeroed by kernel? Pages that userland happen to have zeroed? Or host
> VM zeroed?
> 
> Each are cases we address individually and relate to folios.
> 
> You absolutely fail to clarify _which one_ you mean, and provide absolutely
> no documentation and add an exported mm API with no description.
> 
> This is just I think not something we want to add? Especially on something
> so fundamental?

I raised previously that providing a folio helper is odd, and that I suggested
that we defer this change.

Maybe we'd want to add such an interface for frozen pages later (to be used by
the balloon), but I don't think we want that for folios.

[1] https://lore.kernel.org/all/5f76af6e-9818-42ea-a305-c0fc1d920dca@kernel.org/

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v10 28/37] mm: hugetlb: add gfp parameter and skip zeroing for zeroed pages
From: Lorenzo Stoakes @ 2026-06-08 12:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <f251c9e495cdf98169f225559927112103303137.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:39:26AM -0400, Michael S. Tsirkin wrote:
> Add a gfp_t parameter to alloc_hugetlb_folio(). When __GFP_ZERO
> is set, the function guarantees the returned folio is zeroed:
> - Fresh allocations (buddy or gigantic): zeroed by
>   post_alloc_hook via __GFP_ZERO, HPG_zeroed set by
>   alloc_surplus_hugetlb_folio.
> - Pool pages with HPG_zeroed set: already zeroed, skip.
> - Pool pages without HPG_zeroed: zeroed via folio_zero_user().
>
> The address parameter is renamed to user_addr; the function
> aligns it internally for reservation and NUMA policy lookups.
> For pages that need zeroing, user_addr is passed to
> folio_zero_user() for cache-friendly zeroing near the faulting
> subpage.  All callers pass a page-aligned address; the
> hugetlb_no_page caller passes vmf->real_address & PAGE_MASK
> for consistency.
>
> HPG_zeroed (stored in hugetlb folio->private bits) tracks
> known-zero pool pages. It is set when alloc_surplus_hugetlb_folio
> allocates with __GFP_ZERO, and cleared in free_huge_folio when
> the page returns to the pool after userspace use.
>
> Note: for gigantic CMA pages, __GFP_ZERO is passed through
> to cma_alloc_frozen_compound() via its caller_gfp parameter,
> so the pages ARE zeroed by the allocator. HPG_zeroed is only
> set when __GFP_ZERO was in the original gfp_mask.
> Pool pages allocated without __GFP_ZERO (e.g. by
> alloc_pool_huge_folio) do not get HPG_zeroed; they are zeroed
> later by folio_zero_user() at fault time.
>
> Note: with __GFP_ZERO, the folio is zeroed before
> mem_cgroup_charge_hugetlb().  If the charge fails, the zeroed
> folio is freed back.  Before this patch it is zeroed after charge, so
> simply freeing after zeroing would be a regression.  Thread a
> zeroed hint through free_huge_folio so surplus pages freed back
> to buddy preserve the zeroed state via free_frozen_pages_zeroed,
> avoiding redundant re-zeroing on the next allocation.
>
> Suggested-by: Gregory Price <gourry@gourry.net>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh

This whole set of hugetlb changes should be a separate series. Or really
deferred.

You very badly need to pare this down into the _minimum_ changes required rather
than AI go brrrr on everything.

And propagating _yet more_ 'zeroed' state seems unpleasant, do we have to?

> ---
>  fs/hugetlbfs/inode.c    |  3 +-
>  include/linux/hugetlb.h |  5 ++-
>  mm/hugetlb.c            | 78 +++++++++++++++++++++++++++--------------
>  3 files changed, 57 insertions(+), 29 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 78d61bf2bd9b..2c0c51fe9ec3 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -790,13 +790,12 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>  		 * folios in these areas, we need to consume the reserves
>  		 * to keep reservation accounting consistent.
>  		 */
> -		folio = alloc_hugetlb_folio(&pseudo_vma, addr, false);
> +		folio = alloc_hugetlb_folio(&pseudo_vma, addr, false, __GFP_ZERO);
>  		if (IS_ERR(folio)) {
>  			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  			error = PTR_ERR(folio);
>  			goto out;
>  		}
> -		folio_zero_user(folio, addr);
>  		__folio_mark_uptodate(folio);
>  		error = hugetlb_add_to_page_cache(folio, mapping, index);
>  		if (unlikely(error)) {
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1f7ae6609e51..06d033a57a61 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -593,6 +593,7 @@ enum hugetlb_page_flags {
>  	HPG_vmemmap_optimized,
>  	HPG_raw_hwp_unreliable,
>  	HPG_cma,
> +	HPG_zeroed,
>  	__NR_HPAGEFLAGS,
>  };
>
> @@ -653,6 +654,7 @@ HPAGEFLAG(Freed, freed)
>  HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
>  HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable)
>  HPAGEFLAG(Cma, cma)
> +HPAGEFLAG(Zeroed, zeroed)

Same comments about this naming as elsewhere. Let's be specific about what
_kind_ of zeroing this is.

>
>  #ifdef CONFIG_HUGETLB_PAGE
>
> @@ -700,7 +702,8 @@ int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list);
>  int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
>  void wait_for_freed_hugetlb_folios(void);
>  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> -				unsigned long addr, bool cow_from_owner);
> +				unsigned long user_addr, bool cow_from_owner,
> +				gfp_t gfp);

You already started calling into this user_addr stuff in an earlier patch I
believe, so why not rename it then...

>  struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
>  				nodemask_t *nmask, gfp_t gfp_mask,
>  				bool allow_alloc_fallback);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 5d7e546565f5..ed00db703911 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1455,7 +1455,8 @@ void add_hugetlb_folio(struct hstate *h, struct folio *folio,
>  }
>
>  static void __update_and_free_hugetlb_folio(struct hstate *h,
> -						struct folio *folio)
> +						struct folio *folio,
> +						bool zeroed)
>  {
>  	bool clear_flag = folio_test_hugetlb_vmemmap_optimized(folio);
>
> @@ -1506,6 +1507,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>  	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
>  	if (folio_test_hugetlb_cma(folio))
>  		hugetlb_cma_free_frozen_folio(folio);
> +	else if (zeroed)
> +		free_frozen_pages_zeroed(&folio->page, folio_order(folio));
>  	else
>  		free_frozen_pages(&folio->page, folio_order(folio));
>  }
> @@ -1545,7 +1548,7 @@ static void free_hpage_workfn(struct work_struct *work)
>  		 */
>  		h = size_to_hstate(folio_size(folio));
>
> -		__update_and_free_hugetlb_folio(h, folio);
> +		__update_and_free_hugetlb_folio(h, folio, false);
>
>  		cond_resched();
>  	}
> @@ -1559,10 +1562,10 @@ static inline void flush_free_hpage_work(struct hstate *h)
>  }
>
>  static void update_and_free_hugetlb_folio(struct hstate *h, struct folio *folio,
> -				 bool atomic)
> +				 bool atomic, bool zeroed)
>  {
>  	if (!folio_test_hugetlb_vmemmap_optimized(folio) || !atomic) {
> -		__update_and_free_hugetlb_folio(h, folio);
> +		__update_and_free_hugetlb_folio(h, folio, zeroed);
>  		return;
>  	}
>
> @@ -1596,7 +1599,7 @@ static void bulk_vmemmap_restore_error(struct hstate *h,
>  			spin_lock_irq(&hugetlb_lock);
>  			__folio_clear_hugetlb(folio);
>  			spin_unlock_irq(&hugetlb_lock);
> -			update_and_free_hugetlb_folio(h, folio, false);
> +			update_and_free_hugetlb_folio(h, folio, false, false);
>  			cond_resched();
>  		}
>  	} else {
> @@ -1621,7 +1624,7 @@ static void bulk_vmemmap_restore_error(struct hstate *h,
>  				spin_lock_irq(&hugetlb_lock);
>  				__folio_clear_hugetlb(folio);
>  				spin_unlock_irq(&hugetlb_lock);
> -				update_and_free_hugetlb_folio(h, folio, false);
> +				update_and_free_hugetlb_folio(h, folio, false, false);
>  				cond_resched();
>  				break;
>  			}
> @@ -1664,7 +1667,7 @@ static void update_and_free_pages_bulk(struct hstate *h,
>  	}
>
>  	list_for_each_entry_safe(folio, t_folio, &non_hvo_folios, lru) {
> -		update_and_free_hugetlb_folio(h, folio, false);
> +		update_and_free_hugetlb_folio(h, folio, false, false);
>  		cond_resched();
>  	}
>  }
> @@ -1680,7 +1683,7 @@ struct hstate *size_to_hstate(unsigned long size)
>  	return NULL;
>  }
>
> -void free_huge_folio(struct folio *folio)
> +static void __free_huge_folio(struct folio *folio, bool zeroed)

This zeroed flag seems to be used for both hugetlb zeroed flag state and
__GFP_ZERO?

What does it mean? Can we be specific and comment? Because it's very confusing.

>  {
>  	/*
>  	 * Can't pass hstate in here because it is called from the
> @@ -1692,6 +1695,9 @@ void free_huge_folio(struct folio *folio)
>  	bool restore_reserve;
>  	unsigned long flags;
>
> +	/* Page was mapped to userspace; no longer known-zero */

Again, please be specific about the kind of zeroing.

> +	folio_clear_hugetlb_zeroed(folio);
> +
>  	VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
>  	VM_BUG_ON_FOLIO(folio_mapcount(folio), folio);
>
> @@ -1735,12 +1741,12 @@ void free_huge_folio(struct folio *folio)
>  	if (folio_test_hugetlb_temporary(folio)) {
>  		remove_hugetlb_folio(h, folio, false);
>  		spin_unlock_irqrestore(&hugetlb_lock, flags);
> -		update_and_free_hugetlb_folio(h, folio, true);
> +		update_and_free_hugetlb_folio(h, folio, true, zeroed);
>  	} else if (h->surplus_huge_pages_node[nid]) {
>  		/* remove the page from active list */
>  		remove_hugetlb_folio(h, folio, true);
>  		spin_unlock_irqrestore(&hugetlb_lock, flags);
> -		update_and_free_hugetlb_folio(h, folio, true);
> +		update_and_free_hugetlb_folio(h, folio, true, zeroed);
>  	} else {
>  		arch_clear_hugetlb_flags(folio);
>  		enqueue_hugetlb_folio(h, folio);
> @@ -1748,6 +1754,11 @@ void free_huge_folio(struct folio *folio)
>  	}
>  }
>
> +void free_huge_folio(struct folio *folio)

_Please_ be specific about hugetlb in newer functions. 'Huge' is very
overloaded.

And again you're doing the pattern of adding various 'zeroed' state, but then
_also_ adding a specific flag for hinting zeroed state.

I don't understand why we need both and you're adding complexity here for that.

> +{
> +	__free_huge_folio(folio, false);
> +}
> +
>  /*
>   * Must be called with the hugetlb lock held
>   */
> @@ -2031,7 +2042,7 @@ int dissolve_free_hugetlb_folio(struct folio *folio)
>  			rc = 0;
>  		}
>
> -		update_and_free_hugetlb_folio(h, folio, false);
> +		update_and_free_hugetlb_folio(h, folio, false, false);

More mystery meat boolean flags.

>  		return rc;
>  	}
>  out:
> @@ -2093,6 +2104,10 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h,
>  	if (!folio)
>  		return NULL;
>
> +	/* Mark as known-zero only if __GFP_ZERO was requested */

This comment is redundant and underspecified.

> +	if (gfp_mask & __GFP_ZERO)
> +		folio_set_hugetlb_zeroed(folio);

So now we're marking zeroed even in cases where it's not the host VM zeroing?

Is this useful?

> +
>  	spin_lock_irq(&hugetlb_lock);
>  	/*
>  	 * nr_huge_pages needs to be adjusted within the same lock cycle
> @@ -2156,11 +2171,11 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
>   */
>  static
>  struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
> -		struct vm_area_struct *vma, unsigned long addr)
> +		struct vm_area_struct *vma, unsigned long addr, gfp_t gfp)

Can we not propagate arbitrary GFP flags if we can avoid it?

>  {
>  	struct folio *folio = NULL;
>  	struct mempolicy *mpol;
> -	gfp_t gfp_mask = htlb_alloc_mask(h);
> +	gfp_t gfp_mask = htlb_alloc_mask(h) | gfp;
>  	int nid;
>  	nodemask_t *nodemask;
>
> @@ -2715,7 +2730,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
>  		 * Folio has been replaced, we can safely free the old one.
>  		 */
>  		spin_unlock_irq(&hugetlb_lock);
> -		update_and_free_hugetlb_folio(h, old_folio, false);
> +		update_and_free_hugetlb_folio(h, old_folio, false, false);
>  	}
>
>  	return ret;
> @@ -2723,7 +2738,7 @@ static int alloc_and_dissolve_hugetlb_folio(struct folio *old_folio,
>  free_new:
>  	spin_unlock_irq(&hugetlb_lock);
>  	if (new_folio)
> -		update_and_free_hugetlb_folio(h, new_folio, false);
> +		update_and_free_hugetlb_folio(h, new_folio, false, false);
>
>  	return ret;
>  }
> @@ -2857,16 +2872,19 @@ typedef enum {
>   * When it's set, the allocation will bypass all vma level reservations.
>   */
>  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> -				    unsigned long addr, bool cow_from_owner)
> +				    unsigned long user_addr, bool cow_from_owner,
> +				    gfp_t gfp)
>  {
>  	struct hugepage_subpool *spool = subpool_vma(vma);
>  	struct hstate *h = hstate_vma(vma);
> +	unsigned long addr = user_addr & huge_page_mask(h);
>  	struct folio *folio;
>  	long retval, gbl_chg, gbl_reserve;
>  	map_chg_state map_chg;
>  	int ret, idx;
>  	struct hugetlb_cgroup *h_cg = NULL;
> -	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> +
> +	gfp |= htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
>
>  	idx = hstate_index(h);
>
> @@ -2934,13 +2952,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
>  	if (!folio) {
>  		spin_unlock_irq(&hugetlb_lock);
> -		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
> +		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, user_addr, gfp);
>  		if (!folio)
>  			goto out_uncharge_cgroup;
>  		spin_lock_irq(&hugetlb_lock);
>  		list_add(&folio->lru, &h->hugepage_activelist);
>  		folio_ref_unfreeze(folio, 1);
> -		/* Fall through */

Why are you dropping this?

>  	}
>
>  	/*
> @@ -2963,6 +2980,10 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>
>  	spin_unlock_irq(&hugetlb_lock);
>
> +	if ((gfp & __GFP_ZERO) && !folio_test_hugetlb_zeroed(folio))
> +		folio_zero_user(folio, user_addr);
> +	folio_clear_hugetlb_zeroed(folio);

So this represents general zeroed state (not just host VM, based on how you set
it above), but you also clear it when you just zeroed the folio? I'm confused.

What does this flag actually mean?

> +
>  	hugetlb_set_folio_subpool(folio, spool);
>
>  	if (map_chg != MAP_CHG_ENFORCED) {
> @@ -2999,7 +3020,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
>
>  	if (ret == -ENOMEM) {
> -		free_huge_folio(folio);
> +		__free_huge_folio(folio, !!(gfp & __GFP_ZERO));

Is the !! actually necessary?

>  		return ERR_PTR(-ENOMEM);
>  	}
>
> @@ -4971,7 +4992,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  				spin_unlock(src_ptl);
>  				spin_unlock(dst_ptl);
>  				/* Do not use reserve as it's private owned */
> -				new_folio = alloc_hugetlb_folio(dst_vma, addr, false);
> +				new_folio = alloc_hugetlb_folio(dst_vma, addr, false, 0);
>  				if (IS_ERR(new_folio)) {
>  					folio_put(pte_folio);
>  					ret = PTR_ERR(new_folio);
> @@ -5500,7 +5521,7 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
>  	 * be acquired again before returning to the caller, as expected.
>  	 */
>  	spin_unlock(vmf->ptl);
> -	new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner);
> +	new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner, 0);
>
>  	if (IS_ERR(new_folio)) {
>  		/*
> @@ -5760,7 +5781,13 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>  				goto out;
>  		}
>
> -		folio = alloc_hugetlb_folio(vma, vmf->address, false);
> +		/*
> +		 * Passing vmf->real_address would work just as well,
> +		 * but PAGE_MASK helps make sure we never pass
> +		 * USER_ADDR_NONE by mistake.
> +		 */

Wait what??

Your whole thesis is that USER_ADDR_NONE can never possibly get used, and now
you're guarding against that not being true?


> +		folio = alloc_hugetlb_folio(vma, vmf->real_address & PAGE_MASK,
> +					   false, __GFP_ZERO);

OK so we propagate GFP just for __GFP_ZERO... Can we not?

>  		if (IS_ERR(folio)) {
>  			/*
>  			 * Returning error will result in faulting task being
> @@ -5780,7 +5807,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>  				ret = 0;
>  			goto out;
>  		}
> -		folio_zero_user(folio, vmf->real_address);
>  		__folio_mark_uptodate(folio);
>  		new_folio = true;
>
> @@ -6219,7 +6245,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>  			goto out;
>  		}
>
> -		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false);
> +		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false, 0);
>  		if (IS_ERR(folio)) {
>  			pte_t *actual_pte = hugetlb_walk(dst_vma, dst_addr, PMD_SIZE);
>  			if (actual_pte) {
> @@ -6266,7 +6292,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>  			goto out;
>  		}
>
> -		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false);
> +		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false, 0);

Not loving the '0' on these... Let's find another way to represent that.

>  		if (IS_ERR(folio)) {
>  			folio_put(*foliop);
>  			ret = -ENOMEM;
> --
> MST
>

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 27/37] mm: use __GFP_ZERO in vma_alloc_anon_folio_pmd
From: Lorenzo Stoakes @ 2026-06-08 12:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <2ee827fed4765a155d2b56bb0e13d7ee5fc6dce8.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:39:18AM -0400, Michael S. Tsirkin wrote:
> Convert vma_alloc_anon_folio_pmd() to pass __GFP_ZERO instead of
> zeroing at the callsite. post_alloc_hook uses the fault address
> passed through vma_alloc_folio for cache-friendly zeroing.
>
> Note: before this series, replacing folio_zero_user() with
> __GFP_ZERO was unsafe on cache-aliasing architectures because
> __GFP_ZERO uses clear_page() without a dcache flush. With this
> series, it is safe if the caller passes a valid user address
> (not USER_ADDR_NONE) to vma_alloc_folio() etc., which delivers
> it to post_alloc_hook() for the dcache flush via
> folio_zero_user(). It is only unsafe if USER_ADDR_NONE is passed.
>
> Note: with __GFP_ZERO, the folio is zeroed before
> mem_cgroup_charge().  If the charge fails, the zeroing work is
> wasted.  Previously zeroing was done after a successful charge.
> This is inherent to moving zeroing into the allocator.
> Charge failures are rare (only at cgroup limits).
>
> Use folio_put_zeroed() on charge failure so the zeroed hint
> propagates to the buddy allocator, avoiding redundant re-zeroing
> on the next allocation attempt.

Again, is this worth it?...

Every bit of code added increases risks of bugs, maintenance burden,
etc. let's just not do stuff because we can.

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  mm/huge_memory.c | 14 +++-----------
>  1 file changed, 3 insertions(+), 11 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d689e6491ddb..0dec3c717ff2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1333,7 +1333,7 @@ EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
>  static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
>  		unsigned long addr)
>  {
> -	gfp_t gfp = vma_thp_gfp_mask(vma);
> +	gfp_t gfp = vma_thp_gfp_mask(vma) | __GFP_ZERO;
>  	const int order = HPAGE_PMD_ORDER;
>  	struct folio *folio;
>
> @@ -1347,7 +1347,7 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
>
>  	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
>  	if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> -		folio_put(folio);
> +		folio_put_zeroed(folio);

Same comments as previously.

>  		count_vm_event(THP_FAULT_FALLBACK);
>  		count_vm_event(THP_FAULT_FALLBACK_CHARGE);
>  		count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
> @@ -1356,17 +1356,9 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
>  	}
>  	folio_throttle_swaprate(folio, gfp);
>
> -       /*
> -	* When a folio is not zeroed during allocation (__GFP_ZERO not used)
> -	* or user folios require special handling, folio_zero_user() is used to
> -	* make sure that the page corresponding to the faulting address will be
> -	* hot in the cache after zeroing.
> -	*/
> -	if (user_alloc_needs_zeroing())
> -		folio_zero_user(folio, addr);
>  	/*
>  	 * The memory barrier inside __folio_mark_uptodate makes sure that
> -	 * folio_zero_user writes become visible before the set_pmd_at()
> +	 * page zeroing becomes visible before the set_pmd_at()

folio zeroing?

>  	 * write.
>  	 */
>  	__folio_mark_uptodate(folio);
> --
> MST
>

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 26/37] mm: vma_alloc_anon_folio_pmd: pass raw fault address to vma_alloc_folio
From: Lorenzo Stoakes @ 2026-06-08 12:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <94f64aeca62d8419d76f2cf82b36f1da6017a312.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:39:10AM -0400, Michael S. Tsirkin wrote:
> Drop the redundant HPAGE_PMD_MASK alignment at the callsite.
> NUMA interleave is not affected by the raw address; the ilx
> calculation shifts addr >> PAGE_SHIFT >> order, dropping
> sub-page bits regardless of alignment. post_alloc_hook will
> use the raw address for cache-friendly zeroing.

But then what's the point in this change?

And why are we changing what we pass in this parameter but not the
vma_alloc_folio() kdoc?

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Reviewed-by: Gregory Price <gourry@gourry.net>
> ---
>  mm/huge_memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 970e077019b7..d689e6491ddb 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1337,7 +1337,7 @@ static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
>  	const int order = HPAGE_PMD_ORDER;
>  	struct folio *folio;
>
> -	folio = vma_alloc_folio(gfp, order, vma, addr & HPAGE_PMD_MASK);
> +	folio = vma_alloc_folio(gfp, order, vma, addr);
>
>  	if (unlikely(!folio)) {
>  		count_vm_event(THP_FAULT_FALLBACK);
> --
> MST
>

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 25/37] mm: use __GFP_ZERO in alloc_anon_folio
From: Lorenzo Stoakes @ 2026-06-08 12:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <13ed2aa6607d510bd8dc6d602e96626511ee4fee.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:39:02AM -0400, Michael S. Tsirkin wrote:
> Convert alloc_anon_folio() to pass __GFP_ZERO instead of zeroing
> at the callsite. post_alloc_hook uses the fault address passed
> through vma_alloc_folio for cache-friendly zeroing.
>
> Note: before this series, replacing clear_user_highpage() with
> __GFP_ZERO was unsafe on cache-aliasing architectures because
> __GFP_ZERO uses clear_page() without a dcache flush. With this
> series, it is safe if the caller passes a valid user address
> (not USER_ADDR_NONE) to vma_alloc_folio() etc., which delivers
> it to post_alloc_hook() for the dcache flush via
> folio_zero_user(). It is only unsafe if USER_ADDR_NONE is passed.
>
> Note: with __GFP_ZERO, the folio is zeroed before
> mem_cgroup_charge().  If the charge fails, the zeroing work is
> wasted.  Previously zeroing was done after a successful charge.
> This is inherent to moving zeroing into the allocator.
> Charge failures are rare (only at cgroup limits).
>
> Use folio_put_zeroed() on charge failure so the zeroed hint
> propagates to the buddy allocator, avoiding redundant re-zeroing
> on the next allocation attempt.

Is this even worth the effort? This is surely not a hotpath...

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  mm/memory.c | 13 ++-----------
>  1 file changed, 2 insertions(+), 11 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 6c14b90f558e..6d6a3e1a02c1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5265,25 +5265,16 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  		goto fallback;
>
>  	/* Try allocating the highest of the remaining orders. */
> -	gfp = vma_thp_gfp_mask(vma);
> +	gfp = vma_thp_gfp_mask(vma) | __GFP_ZERO;
>  	while (orders) {
>  		folio = vma_alloc_folio(gfp, order, vma, vmf->address);
>  		if (folio) {
>  			if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
>  				count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> -				folio_put(folio);
> +				folio_put_zeroed(folio);

You just allocated the folio as zeroed above, should PG_zeroed not be set thus
making it unnecessary to add folio_put_zeroed()?

>  				goto next;
>  			}
>  			folio_throttle_swaprate(folio, gfp);
> -			/*
> -			 * When a folio is not zeroed during allocation
> -			 * (__GFP_ZERO not used) or user folios require special
> -			 * handling, folio_zero_user() is used to make sure
> -			 * that the page corresponding to the faulting address
> -			 * will be hot in the cache after zeroing.
> -			 */
> -			if (user_alloc_needs_zeroing())
> -				folio_zero_user(folio, vmf->address);
>  			return folio;
>  		}
>  next:
> --
> MST
>

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed
From: Lorenzo Stoakes @ 2026-06-08 12:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <fef7ca7bbf591865defef658b0792b58494aaed5.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:38:54AM -0400, Michael S. Tsirkin wrote:
> Add put_page_zeroed() / folio_put_zeroed() for callers that hold
> a reference to a page known to be zeroed.
>
> If this drops the last reference, the zeroed hint is
> propagated to the buddy allocator.  If someone else still holds a
> reference, the hint is simply lost - this is best-effort.
>
> This is useful for balloon drivers during deflation: the host
> has already zeroed the pages, and the balloon is typically the
> sole owner.  But if the page happens to be shared, silently
> dropping the hint is safe and avoids the need for callers to
> check the refcount.
>
> Note: put_page_zeroed uses folio_put_testzero() which only
> detects sole ownership at the instant of the atomic decrement.
> A concurrent reference holder (e.g. migration) means the hint
> is silently lost. This is by design: the zeroed hint is a
> performance optimization, not a correctness requirement.
> Losing it just means the next allocation re-zeroes the page.

Do not put comments about specific expected races like this in the commit
message but not in the code. Subtleties need to be called out.

The commit message also doesn't at all explain why PG_zeroed doesn't
suffice here.

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6

I really don't understand why you have a 'zeroed' folio flag but need to
also have new API calls to detect that?

They're also HORRIBLY named. Zeroed as in what? Zero page? Huge zero page?
Memory zeroed by kernel? Pages that userland happen to have zeroed? Or host
VM zeroed?

Each are cases we address individually and relate to folios.

You absolutely fail to clarify _which one_ you mean, and provide absolutely
no documentation and add an exported mm API with no description.

This is just I think not something we want to add? Especially on something
so fundamental?

> ---
>  include/linux/mm.h | 13 +++++++++++++
>  mm/swap.c          | 20 ++++++++++++++++++--
>  2 files changed, 31 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 06bbe9eba636..79b3a8cb9a3b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1913,6 +1913,7 @@ static inline struct folio *virt_to_folio(const void *x)
>  }
>
>  void __folio_put(struct folio *folio);
> +void __folio_put_zeroed(struct folio *folio);
>
>  void split_page(struct page *page, unsigned int order);
>  void folio_copy(struct folio *dst, struct folio *src);
> @@ -2090,6 +2091,18 @@ static inline void folio_put(struct folio *folio)
>  		__folio_put(folio);
>  }
>
> +/* Caller must be sole owner to guarantee page is still zero */
> +static inline void folio_put_zeroed(struct folio *folio)
> +{
> +	if (folio_put_testzero(folio))
> +		__folio_put_zeroed(folio);
> +}
> +
> +static inline void put_page_zeroed(struct page *page)
> +{
> +	folio_put_zeroed(page_folio(page));
> +}
> +

Please stop adding more APIs to mm without kdocs. This just isn't
acceptable.

>  /**
>   * folio_put_refs - Reduce the reference count on a folio.
>   * @folio: The folio.
> diff --git a/mm/swap.c b/mm/swap.c
> index 5cc44f0de987..ecec780172ad 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -94,13 +94,15 @@ static void page_cache_release(struct folio *folio)
>  		lruvec_unlock_irqrestore(lruvec, flags);
>  }
>
> -void __folio_put(struct folio *folio)
> +static void ___folio_put(struct folio *folio, bool zeroed)
>  {
> +	/* zeroed hint ignored for now, no current user */

Please don't add comments about why you didn't do something that nobody
here knows about with no context.

If you want to say something about this, make it clear. This is so succinct
it's utterly meaningless.

>  	if (unlikely(folio_is_zone_device(folio))) {
>  		free_zone_device_folio(folio);
>  		return;
>  	}
>
> +	/* zeroed hint ignored for now, no current user */
>  	if (folio_test_hugetlb(folio)) {
>  		free_huge_folio(folio);
>  		return;
> @@ -109,10 +111,24 @@ void __folio_put(struct folio *folio)
>  	page_cache_release(folio);
>  	folio_unqueue_deferred_split(folio);
>  	mem_cgroup_uncharge(folio);
> -	free_frozen_pages(&folio->page, folio_order(folio));
> +	if (zeroed)
> +		free_frozen_pages_zeroed(&folio->page, folio_order(folio));
> +	else
> +		free_frozen_pages(&folio->page, folio_order(folio));
> +}
> +
> +void __folio_put(struct folio *folio)
> +{
> +	___folio_put(folio, false);
>  }
>  EXPORT_SYMBOL(__folio_put);
>

No documentation again...

> +void __folio_put_zeroed(struct folio *folio)
> +{
> +	___folio_put(folio, true);
> +}
> +EXPORT_SYMBOL(__folio_put_zeroed);
> +
>  typedef void (*move_fn_t)(struct lruvec *lruvec, struct folio *folio);
>
>  static void lru_add(struct lruvec *lruvec, struct folio *folio)
> --
> MST
>

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 23/37] mm: page_alloc: skip kernel_init_pages for FPI_ZEROED when safe
From: Lorenzo Stoakes @ 2026-06-08 12:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <7de6959a0fb1b255dc9bee2de494f51a6c21e468.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:38:46AM -0400, Michael S. Tsirkin wrote:
> In __free_pages_prepare(), when FPI_ZEROED is set the page is already
> known to be zero. We can skip kernel_init_pages() if page poisoning is
> not enabled (because poison would overwrite the zeroes).
>
> This avoids redundant zeroing work when freeing pages that are already
> known to contain all zeros.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  mm/page_alloc.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 008f1a311c40..e3a7c40c769c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1443,7 +1443,14 @@ __always_inline bool __free_pages_prepare(struct page *page,
>  		if (kasan_has_integrated_init())
>  			init = false;
>  	}
> -	if (init)
> +	/*
> +	 * Skip redundant zeroing when the page is already known-zero
> +	 * (FPI_ZEROED) and page poisoning did not overwrite it.
> +	 * When page_poisoning is enabled, kernel_poison_pages above
> +	 * wrote PAGE_POISON (0xAA), so we must re-zero.
> +	 */

Again, please stop specifying arbitrary hex values in comments, this seems
mostly 'describing what we do here'.

Maybe drop to just e.g.:

/* if poisoned or not zeroed by a virtualised host, zero now. */

or suchlike?

> +	if (init && !((fpi_flags & FPI_ZEROED) &&
> +		      !page_poisoning_enabled_static()))

This condition is absolutely horrible, !(X && !Y), you're making life difficult
for the readers.

'if not both zeroed and not poisoned' is how that reads logically. Which is hard
to understand.

De Morgan's law gives us -> !zeroed || posioned

How about:

	if (init && (!(fpi_flags & FPI_ZEROED) || page_poisoning_enabled_static())

Or preferably something like:

	const bool poisoned = page_poisoning_enabled_static();
	const bool vm_host_zeroed = fpi_flags & FPI_ZEROED;

	...

	if (init && (poisoned || !vm_host_zeroed))
		...

?


>  		kernel_init_pages(page, 1 << order);
>
>  	/*
> --
> MST
>

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 22/37] mm: add free_frozen_pages_zeroed
From: Lorenzo Stoakes @ 2026-06-08 12:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <4220116b4d91f3e933fa76ad473106c2e48fd452.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:38:37AM -0400, Michael S. Tsirkin wrote:
> Add free_frozen_pages_zeroed(page, order) to free a frozen page
> while marking it as zeroed, so the next allocation can skip
> redundant zeroing.
>
> An FPI_ZEROED internal flag carries the hint through the free path.
> PageZeroed is set after __free_pages_prepare() clears all flags,
> so the hint survives on the free list.
>
> __SetPageZeroed is non-atomic but safe here: the page is frozen
> (refcount 0) and not yet on any free list.
>
> Note: when want_init_on_free() zeroes the page via
> kernel_init_pages(), the page is zero but the direct-map
> cache lines may be dirty. A later patch (skip
> kernel_init_pages for FPI_ZEROED) avoids the redundant
> re-zero, and post_alloc_hook handles the dcache flush
> for user pages on aliasing architectures.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> ---
>  include/linux/gfp.h |  1 +
>  mm/internal.h       |  1 +
>  mm/page_alloc.c     | 23 ++++++++++++++++++++++-
>  3 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 73109d4e31a4..d24b61e45861 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -384,6 +384,7 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas
>  extern void __free_pages(struct page *page, unsigned int order);
>  extern void free_pages_nolock(struct page *page, unsigned int order);
>  extern void free_pages(unsigned long addr, unsigned int order);
> +void free_frozen_pages_zeroed(struct page *page, unsigned int order);
>
>  #define __free_page(page) __free_pages((page), 0)
>  #define free_page(addr) free_pages((addr), 0)
> diff --git a/mm/internal.h b/mm/internal.h
> index 4af5e72742ba..fd910743ddc3 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -938,6 +938,7 @@ struct page *__alloc_frozen_pages_noprof(gfp_t, unsigned int order, int nid,
>  #define __alloc_frozen_pages(...) \
>  	alloc_hooks(__alloc_frozen_pages_noprof(__VA_ARGS__))
>  void free_frozen_pages(struct page *page, unsigned int order);
> +void free_frozen_pages_zeroed(struct page *page, unsigned int order);

This is badly named. That name implies you're freeing frozen, zeroed pages, not
that you're marking them zeroed.

And again, you're overloading 'zeroed' here. Be specific, it's about
host zeroing in virtualisation.

>  void free_unref_folios(struct folio_batch *fbatch);
>
>  #ifdef CONFIG_NUMA
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 21f9e92922f1..008f1a311c40 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -91,6 +91,13 @@ typedef int __bitwise fpi_t;
>  /* Free the page without taking locks. Rely on trylock only. */
>  #define FPI_TRYLOCK		((__force fpi_t)BIT(2))
>
> +/*
> + * The page contents are known to be zero (e.g., the host zeroed them
> + * during balloon deflate).  Set PageZeroed after free so the next

Can we just be specific that this is about VM hosts, I don't imagine that we are
going to ever have a use beyond that, and we can adjust the phrasing later if
needed.

Otherwise it's just confusing right now, you're overloading 'zeroed' to mean
different things and we do that enough in mm already.

> + * allocation can skip redundant zeroing.
> + */
> +#define FPI_ZEROED		((__force fpi_t)BIT(3))

Hmm now we have another flag to propagate this around... this is messy.

Now we have multiple different ways of representing this state... ugh.

> +
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
> @@ -1596,8 +1603,12 @@ static void __free_pages_ok(struct page *page, unsigned int order,
>  	unsigned long pfn = page_to_pfn(page);
>  	struct zone *zone = page_zone(page);
>
> -	if (__free_pages_prepare(page, order, fpi_flags))
> +	if (__free_pages_prepare(page, order, fpi_flags)) {
> +		/* Don't mark zeroed if poison overwrote with 0xAA. */

Can we not reference arbitrary values in comments? And this comment seems
redundant.

> +		if ((fpi_flags & FPI_ZEROED) && !page_poisoning_enabled_static())
> +			__SetPageZeroed(page);
>  		free_one_page(zone, page, pfn, order, fpi_flags);
> +	}
>  }
>
>  void __meminit __free_pages_core(struct page *page, unsigned int order,
> @@ -3020,6 +3031,10 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
>  	if (!__free_pages_prepare(page, order, fpi_flags))
>  		return;
>
> +	/* Don't mark zeroed if poison overwrote with 0xAA. */

Same comment as above.

> +	if ((fpi_flags & FPI_ZEROED) && !page_poisoning_enabled_static())
> +		__SetPageZeroed(page);
> +
>  	/*
>  	 * We only track unmovable, reclaimable and movable on pcp lists.
>  	 * Place ISOLATE pages on the isolated list because they are being
> @@ -3058,6 +3073,12 @@ void free_frozen_pages(struct page *page, unsigned int order)
>  	__free_frozen_pages(page, order, FPI_NONE);
>  }
>

No comment describing this? kdoc please.

> +void free_frozen_pages_zeroed(struct page *page, unsigned int order)
> +{
> +	__free_frozen_pages(page, order, FPI_ZEROED);
> +}
> +EXPORT_SYMBOL(free_frozen_pages_zeroed);

Do we have to use EXPORT_SYMBOLS()? Why not EXPORT_SYMBOLS_GPL()?

> +
>  void free_frozen_pages_nolock(struct page *page, unsigned int order)
>  {
>  	__free_frozen_pages(page, order, FPI_TRYLOCK);
> --
> MST
>

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
From: Lorenzo Stoakes @ 2026-06-08 12:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <6a2f93e4447e55ada6ccf0f4d5c64e8d41a848d4.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:37:48AM -0400, Michael S. Tsirkin wrote:
> When a guest reports free pages to the hypervisor via the page reporting
> framework (used by virtio-balloon and hv_balloon), the host typically
> zeros those pages when reclaiming their backing memory.  However, when
> those pages are later allocated in the guest, post_alloc_hook()
> unconditionally zeros them again if __GFP_ZERO is set.  This
> double-zeroing is wasteful, especially for large pages.
>
> Avoid redundant zeroing:
>
> - Add a host_zeroes_pages flag to page_reporting_dev_info, allowing
>   drivers to declare that their host zeros reported pages on reclaim.
>   A static key (page_reporting_host_zeroes) gates the fast path.
>
> - Add PG_zeroed page flag (sharing PG_private bit) to mark pages
>   that have been zeroed by the host.  Set it in
>   page_reporting_drain() after the host reports them.

I think this flag is really confusingly named, if it's a virtualised host
thing, then can we please encode that in the flag name?

I was looking at a later commit and wondering who was doing the zeroing
exactly.

And could we please propagate that throughout the code, some nebulous 'bool
zeroed = ... ' begs the question of whether it's the kernel who did it and
why we are adding logic

>
> - Thread the zeroed bool through rmqueue -> prep_new_page ->
>   post_alloc_hook, where it skips redundant zeroing for __GFP_ZERO
>   allocations.
>
> Currently the PG_zeroed hint can be lost when pages are
> split (expand) or merged in the buddy allocator.  This is
> harmless: losing the hint just means the page gets re-zeroed,
> which is correct but suboptimal.  Follow-up patches propagate
> PG_zeroed across splits and merges to preserve the hint on
> common paths.
>
> No driver sets host_zeroes_pages yet; a follow-up patch to
> virtio_balloon is needed to opt in.
>
> PG_zeroed pages may pass through PCP lists before being freed.
> This is safe: __free_pages_prepare clears all
> PAGE_FLAGS_CHECK_AT_PREP flags (including PG_zeroed/PG_private)
> before the page re-enters the buddy allocator.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh
> ---
>  include/linux/page-flags.h     |  9 +++++
>  include/linux/page_reporting.h |  3 ++
>  mm/compaction.c                |  6 ++-
>  mm/internal.h                  |  2 +-
>  mm/page_alloc.c                | 68 +++++++++++++++++++++++-----------
>  mm/page_reporting.c            | 14 ++++++-
>  mm/page_reporting.h            | 12 ++++++
>  7 files changed, 88 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 7223f6f4e2b4..91f8ddb1d512 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -135,6 +135,8 @@ enum pageflags {
>  	PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */
>  	/* Some filesystems */
>  	PG_checked = PG_owner_priv_1,
> +	/* Page contents are known to be zero */
> +	PG_zeroed = PG_private,
>
>  	/*
>  	 * Depending on the way an anonymous folio can be mapped into a page
> @@ -673,6 +675,13 @@ FOLIO_TEST_CLEAR_FLAG_FALSE(young)
>  FOLIO_FLAG_FALSE(idle)
>  #endif
>
> +/*
> + * PageZeroed() tracks pages known to be zero.  The allocator
> + * uses this to skip redundant zeroing in post_alloc_hook().
> + */
> +__PAGEFLAG(Zeroed, zeroed, PF_NO_COMPOUND)
> +#define __PG_ZEROED (1UL << PG_zeroed)
> +
>  /*
>   * PageReported() is used to track reported free pages within the Buddy
>   * allocator. We can use the non-atomic version of the test and set
> diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
> index 5ab5be02fa15..c331c6b36687 100644
> --- a/include/linux/page_reporting.h
> +++ b/include/linux/page_reporting.h
> @@ -14,6 +14,9 @@ struct page_reporting_dev_info {
>  	int (*report)(struct page_reporting_dev_info *prdev,
>  		      struct scatterlist *sg, unsigned int nents);
>
> +	/* If true, host zeros reported pages on reclaim */
> +	bool host_zeroes_pages;
> +
>  	/* work struct for processing reports */
>  	struct delayed_work work;
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4336e433c99b..8000fc5e0a2e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -82,7 +82,8 @@ static inline bool is_via_compact_memory(int order) { return false; }
>
>  static struct page *mark_allocated_noprof(struct page *page, unsigned int order, gfp_t gfp_flags)
>  {
> -	post_alloc_hook(page, order, __GFP_MOVABLE, USER_ADDR_NONE);
> +	__ClearPageZeroed(page);
> +	post_alloc_hook(page, order, __GFP_MOVABLE, false, USER_ADDR_NONE);
>  	set_page_refcounted(page);
>  	return page;
>  }
> @@ -1849,9 +1850,10 @@ static struct folio *compaction_alloc_noprof(struct folio *src, unsigned long da
>  		set_page_private(&freepage[size], start_order);
>  	}
>  	dst = (struct folio *)freepage;
> +	__ClearPageZeroed(&dst->page);
>  	if (order)
>  		prep_compound_page(&dst->page, order);
> -	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, USER_ADDR_NONE);
> +	post_alloc_hook(&dst->page, order, __GFP_MOVABLE, false, USER_ADDR_NONE);
>  	set_page_refcounted(&dst->page);
>  	cc->nr_freepages -= 1 << order;
>  	cc->nr_migratepages -= 1 << order;
> diff --git a/mm/internal.h b/mm/internal.h
> index 9d2198114510..4af5e72742ba 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -928,7 +928,7 @@ static inline void init_compound_tail(struct page *tail,
>  }
>
>  void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> -		     unsigned long user_addr);
> +		     bool zeroed, unsigned long user_addr);

host_zeroed or something would be more appropriate no?

But in general do we need to propagate this around, can't we derive it from
the page zeroed flag?

It's really confusing as to _which_ zeroing this refers to, it seems the
only one relevant here is the VM host zeroing but that's completely
non-obvious and now everybody using these functions with the extra param
will simply have to happen to know this.

If we could find a way to avoid this propagation that'd be ideal.

Failing that, making it clear this is _only_ for vm host zeroing would be
better, but then maybe we need to think about how we could encode this in
some other way, e.g. passing alloc_context perhaps?

>  extern bool free_pages_prepare(struct page *page, unsigned int order);
>
>  extern int user_min_free_kbytes;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d4fbf1861a8a..45e824b1ec75 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1743,6 +1743,7 @@ static __always_inline void page_del_and_expand(struct zone *zone,
>  	bool was_reported = page_reported(page);
>
>  	__del_page_from_free_list(page, zone, high, migratetype);
> +

Stray whitespace?

>  	nr_pages -= expand(zone, page, low, high, migratetype, was_reported);
>  	account_freepages(zone, -nr_pages, migratetype);
>  }
> @@ -1815,8 +1816,10 @@ static inline bool should_skip_init(gfp_t flags)
>  	return (flags & __GFP_SKIP_ZERO);
>  }
>
> +

Stray whitespace?

>  inline void post_alloc_hook(struct page *page, unsigned int order,
> -				gfp_t gfp_flags, unsigned long user_addr)
> +				gfp_t gfp_flags, bool zeroed,
> +				unsigned long user_addr)
>  {
>  	const bool zero_tags = gfp_flags & __GFP_ZEROTAGS;
>  	bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
> @@ -1825,6 +1828,14 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>
>  	set_page_private(page, 0);
>
> +	/*
> +	 * If the page is zeroed, skip memory initialization.
> +	 * We still need to handle tag zeroing separately since the host
> +	 * does not know about memory tags.
> +	 */
> +	if (zeroed && init && !zero_tags)
> +		init = false;
> +
>  	arch_alloc_page(page, order);
>  	debug_pagealloc_map_pages(page, 1 << order);
>
> @@ -1867,7 +1878,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  	 * through a user-congruent mapping.  Host-zeroed pages
>  	 * (zeroed flag) don't need this: physical RAM is clean.
>  	 */
> -	if (!init && (gfp_flags & __GFP_ZERO) &&
> +	if (!zeroed && !init && (gfp_flags & __GFP_ZERO) &&
>  	    user_addr != USER_ADDR_NONE &&
>  	    user_alloc_needs_zeroing())
>  		init = true;
> @@ -1900,13 +1911,13 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  }
>
>  static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
> -							unsigned int alloc_flags,
> -							unsigned long user_addr)
> +			  unsigned int alloc_flags, bool zeroed,
> +			  unsigned long user_addr)
>  {
>  	if (order && (gfp_flags & __GFP_COMP))
>  		prep_compound_page(page, order);
>
> -	post_alloc_hook(page, order, gfp_flags, user_addr);
> +	post_alloc_hook(page, order, gfp_flags, zeroed, user_addr);
>
>  	/*
>  	 * page is set pfmemalloc when ALLOC_NO_WATERMARKS was necessary to
> @@ -3174,6 +3185,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
>  	}
>
>  	del_page_from_free_list(page, zone, order, mt);
> +	__ClearPageZeroed(page);
>
>  	/*
>  	 * Set the pageblock if the isolated page is at least half of a
> @@ -3246,7 +3258,7 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
>  static __always_inline
>  struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  			   unsigned int order, unsigned int alloc_flags,
> -			   int migratetype)
> +			   int migratetype, bool *zeroed)
>  {
>  	struct page *page;
>  	unsigned long flags;
> @@ -3281,6 +3293,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  			}
>  		}
>  		spin_unlock_irqrestore(&zone->lock, flags);
> +		*zeroed = PageZeroed(page);
> +		__ClearPageZeroed(page);
>  	} while (check_new_pages(page, order));
>
>  	/*
> @@ -3349,10 +3363,9 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
>  /* Remove page from the per-cpu list, caller must protect the list */
>  static inline
>  struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
> -			int migratetype,
> -			unsigned int alloc_flags,
> +			int migratetype, unsigned int alloc_flags,
>  			struct per_cpu_pages *pcp,
> -			struct list_head *list)
> +			struct list_head *list, bool *zeroed)
>  {
>  	struct page *page;
>
> @@ -3387,6 +3400,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
>  		page = list_first_entry(list, struct page, pcp_list);
>  		list_del(&page->pcp_list);
>  		pcp->count -= 1 << order;
> +		*zeroed = PageZeroed(page);
> +		__ClearPageZeroed(page);
>  	} while (check_new_pages(page, order));
>
>  	return page;
> @@ -3395,7 +3410,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
>  /* Lock and remove page from the per-cpu list */
>  static struct page *rmqueue_pcplist(struct zone *preferred_zone,
>  			struct zone *zone, unsigned int order,
> -			int migratetype, unsigned int alloc_flags)
> +			int migratetype, unsigned int alloc_flags,
> +			bool *zeroed)
>  {
>  	struct per_cpu_pages *pcp;
>  	struct list_head *list;
> @@ -3413,7 +3429,8 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
>  	 */
>  	pcp->free_count >>= 1;
>  	list = &pcp->lists[order_to_pindex(migratetype, order)];
> -	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
> +	page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags,
> +				 pcp, list, zeroed);
>  	pcp_spin_unlock(pcp);
>  	if (page) {
>  		__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> @@ -3438,19 +3455,19 @@ static inline
>  struct page *rmqueue(struct zone *preferred_zone,
>  			struct zone *zone, unsigned int order,
>  			gfp_t gfp_flags, unsigned int alloc_flags,
> -			int migratetype)
> +			int migratetype, bool *zeroed)
>  {
>  	struct page *page;
>
>  	if (likely(pcp_allowed_order(order))) {
>  		page = rmqueue_pcplist(preferred_zone, zone, order,
> -				       migratetype, alloc_flags);
> +				       migratetype, alloc_flags, zeroed);
>  		if (likely(page))
>  			goto out;
>  	}
>
>  	page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
> -							migratetype);
> +			     migratetype, zeroed);
>
>  out:
>  	/* Separate test+clear to avoid unnecessary atomics */
> @@ -3841,6 +3858,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  	struct pglist_data *last_pgdat = NULL;
>  	bool last_pgdat_dirty_ok = false;
>  	bool no_fallback;
> +	bool zeroed;
>  	bool skip_kswapd_nodes = nr_online_nodes > 1;
>  	bool skipped_kswapd_nodes = false;
>
> @@ -3985,10 +4003,11 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>
>  try_this_zone:
>  		page = rmqueue(zonelist_zone(ac->preferred_zoneref), zone, order,
> -				gfp_mask, alloc_flags, ac->migratetype);
> +					gfp_mask, alloc_flags, ac->migratetype,
> +					&zeroed);
>  		if (page) {
>  			prep_new_page(page, order, gfp_mask, alloc_flags,
> -				      ac->user_addr);
> +				      zeroed, ac->user_addr);
>
>  			return page;
>  		} else {
> @@ -4215,9 +4234,11 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	count_vm_event(COMPACTSTALL);
>
>  	/* Prep a captured page if available */
> -	if (page)
> -		prep_new_page(page, order, gfp_mask, alloc_flags,
> +	if (page) {
> +		__ClearPageZeroed(page);
> +		prep_new_page(page, order, gfp_mask, alloc_flags, false,
>  			      ac->user_addr);
> +	}
>
>  	/* Try get a page from the freelist if available */
>  	if (!page)
> @@ -5190,6 +5211,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  	/* Attempt the batch allocation */
>  	pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
>  	while (nr_populated < nr_pages) {
> +		bool zeroed = false;
>
>  		/* Skip existing pages */
>  		if (page_array[nr_populated]) {
> @@ -5198,7 +5220,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  		}
>
>  		page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
> -								pcp, pcp_list);
> +					 pcp, pcp_list, &zeroed);
>  		if (unlikely(!page)) {
>  			/* Try and allocate at least one page */
>  			if (!nr_account) {
> @@ -5209,7 +5231,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  		}
>  		nr_account++;
>
> -		prep_new_page(page, 0, gfp, 0, USER_ADDR_NONE);
> +		prep_new_page(page, 0, gfp, 0, zeroed, USER_ADDR_NONE);
>  		set_page_refcounted(page);
>  		page_array[nr_populated++] = page;
>  	}
> @@ -6949,7 +6971,8 @@ static void split_free_frozen_pages(struct list_head *list, gfp_t gfp_mask)
>  		list_for_each_entry_safe(page, next, &list[order], lru) {
>  			int i;
>
> -			post_alloc_hook(page, order, gfp_mask, USER_ADDR_NONE);
> +			__ClearPageZeroed(page);
> +			post_alloc_hook(page, order, gfp_mask, false, USER_ADDR_NONE);
>  			if (!order)
>  				continue;
>
> @@ -7157,8 +7180,9 @@ static int __alloc_contig_frozen_range(unsigned long start, unsigned long end,
>  	} else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) {
>  		struct page *head = pfn_to_page(start);
>
> +		__ClearPageZeroed(head);
>  		check_new_pages(head, order);
> -		prep_new_page(head, order, gfp_mask, 0, user_addr);
> +		prep_new_page(head, order, gfp_mask, 0, false, user_addr);

A nit but I really hate these kinds of mystery meat booleans, which mean
you have to now go look up what this is.

One thing we use in mm quite a bit now is e.g. '/*zeroed=*/ false'. Though
some might say even having a boolean like this is a code smell in itself.

>  	} else {
>  		ret = -EINVAL;
>  		WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> index 5b6b17f67131..84ebc4547119 100644
> --- a/mm/page_reporting.c
> +++ b/mm/page_reporting.c
> @@ -50,6 +50,8 @@ EXPORT_SYMBOL_GPL(page_reporting_order);
>  #define PAGE_REPORTING_DELAY	(2 * HZ)
>  static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
>
> +DEFINE_STATIC_KEY_FALSE(page_reporting_host_zeroes);
> +
>  enum {
>  	PAGE_REPORTING_IDLE = 0,
>  	PAGE_REPORTING_REQUESTED,
> @@ -129,8 +131,11 @@ page_reporting_drain(struct page_reporting_dev_info *prdev,
>  		 * report on the new larger page when we make our way
>  		 * up to that higher order.
>  		 */
> -		if (PageBuddy(page) && buddy_order(page) == order)
> +		if (PageBuddy(page) && buddy_order(page) == order) {
>  			__SetPageReported(page);
> +			if (page_reporting_host_zeroes_pages())
> +				__SetPageZeroed(page);
> +		}
>  	} while ((sg = sg_next(sg)));
>
>  	/* reinitialize scatterlist now that it is empty */
> @@ -390,6 +395,10 @@ int page_reporting_register(struct page_reporting_dev_info *prdev)
>  	/* Assign device to allow notifications */
>  	rcu_assign_pointer(pr_dev_info, prdev);
>
> +	/* enable zeroed page optimization if host zeroes reported pages */
> +	if (prdev->host_zeroes_pages)
> +		static_branch_enable(&page_reporting_host_zeroes);
> +
>  	/* enable page reporting notification */
>  	if (!static_key_enabled(&page_reporting_enabled)) {
>  		static_branch_enable(&page_reporting_enabled);
> @@ -414,6 +423,9 @@ void page_reporting_unregister(struct page_reporting_dev_info *prdev)
>
>  		/* Flush any existing work, and lock it out */
>  		cancel_delayed_work_sync(&prdev->work);
> +
> +		if (prdev->host_zeroes_pages)
> +			static_branch_disable(&page_reporting_host_zeroes);
>  	}
>
>  	mutex_unlock(&page_reporting_mutex);
> diff --git a/mm/page_reporting.h b/mm/page_reporting.h
> index c51dbc228b94..736ea7b37e9e 100644
> --- a/mm/page_reporting.h
> +++ b/mm/page_reporting.h
> @@ -15,6 +15,13 @@ DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
>  extern unsigned int page_reporting_order;
>  void __page_reporting_notify(void);
>
> +DECLARE_STATIC_KEY_FALSE(page_reporting_host_zeroes);
> +
> +static inline bool page_reporting_host_zeroes_pages(void)
> +{
> +	return static_branch_unlikely(&page_reporting_host_zeroes);
> +}
> +
>  static inline bool page_reported(struct page *page)
>  {
>  	return static_branch_unlikely(&page_reporting_enabled) &&
> @@ -46,6 +53,11 @@ static inline void page_reporting_notify_free(unsigned int order)
>  #else /* CONFIG_PAGE_REPORTING */
>  #define page_reported(_page)	false
>
> +static inline bool page_reporting_host_zeroes_pages(void)
> +{
> +	return false;
> +}
> +
>  static inline void page_reporting_notify_free(unsigned int order)
>  {
>  }
> --
> MST
>

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 19/37] mm: page_alloc: clear PG_zeroed on buddy merge if not both zero
From: Lorenzo Stoakes @ 2026-06-08 11:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <918897302a4cab9f476625adc238ec65a688d1d7.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:38:08AM -0400, Michael S. Tsirkin wrote:
> When two buddy pages merge in __free_one_page(), preserve
> PG_zeroed on the merged page only if both buddies have the
> flag set.  Otherwise clear it.
>
> The merged page would inherit PG_zeroed, and a later __GFP_ZERO
> allocation would skip zeroing stale data in the non-zero half.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh
> ---
>  include/linux/page-flags.h |  1 +
>  mm/page_alloc.c            | 15 ++++++++++++++-
>  2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 91f8ddb1d512..9365d59ac1d6 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -680,6 +680,7 @@ FOLIO_FLAG_FALSE(idle)
>   * uses this to skip redundant zeroing in post_alloc_hook().
>   */
>  __PAGEFLAG(Zeroed, zeroed, PF_NO_COMPOUND)
> +CLEARPAGEFLAG(Zeroed, zeroed, PF_NO_COMPOUND)
>  #define __PG_ZEROED (1UL << PG_zeroed)
>
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index edfc83571985..a90bca5317c1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -941,10 +941,14 @@ static inline void __free_one_page(struct page *page,
>  	unsigned long buddy_pfn = 0;
>  	unsigned long combined_pfn;
>  	struct page *buddy;
> +	bool buddy_zeroed;
> +	bool page_zeroed;
>  	bool to_tail;
>
>  	VM_BUG_ON(!zone_is_initialized(zone));
> -	VM_BUG_ON_PAGE(page->flags.f & PAGE_FLAGS_CHECK_AT_PREP, page);
> +	/* PG_zeroed (aliased to PG_private) is valid on free-list pages */
> +	VM_BUG_ON_PAGE(page->flags.f &

NIT: We don't add new VM_BUG_ON()'s, please use VM_WARN_ON().

> +		       (PAGE_FLAGS_CHECK_AT_PREP & ~__PG_ZEROED), page);
>
>  	VM_BUG_ON(migratetype == -1);
>  	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
> @@ -979,6 +983,8 @@ static inline void __free_one_page(struct page *page,
>  				goto done_merging;
>  		}
>
> +		buddy_zeroed = PageZeroed(buddy);
> +
>  		/*
>  		 * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
>  		 * merge with it and move up one order.
> @@ -997,10 +1003,17 @@ static inline void __free_one_page(struct page *page,
>  			change_pageblock_range(buddy, order, migratetype);
>  		}
>
> +		page_zeroed = PageZeroed(page);
> +		__ClearPageZeroed(page);
> +		__ClearPageZeroed(buddy);
> +
>  		combined_pfn = buddy_pfn & pfn;
>  		page = page + (combined_pfn - pfn);
>  		pfn = combined_pfn;
>  		order++;
> +
> +		if (page_zeroed && buddy_zeroed)
> +			__SetPageZeroed(page);
>  	}
>
>  done_merging:
> --
> MST
>

^ permalink raw reply

* Re: [PATCH v10 18/37] mm: page_alloc: use aliasing checks instead of user_alloc_needs_zeroing
From: Lorenzo Stoakes @ 2026-06-08 11:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <d9b5fc983725665749024312f9ff5ede4f05e405.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:38:00AM -0400, Michael S. Tsirkin wrote:
> Replace user_alloc_needs_zeroing() with the direct aliasing checks
> (cpu_dcache_is_aliasing() || cpu_icache_is_aliasing()) in the
> post_alloc_hook aliasing guard.
>
> user_alloc_needs_zeroing() includes a !init_on_alloc term that
> means "allocator didn't zero this page."  But in this guard's
> context (!zeroed && !init && __GFP_ZERO), we already know the page
> is zero; init incorporates init_on_alloc via want_init_on_alloc().
> The only question left is whether the cache architecture needs
> the data re-zeroed through a congruent mapping, which is purely
> cpu_dcache_is_aliasing() || cpu_icache_is_aliasing().
>
> On non-aliasing architectures with init_on_free=true and
> init_on_alloc=false, this avoids a redundant re-zero of an
> already-zero page.
>
> Note on PowerPC: PowerPC overrides clear_user_page to call
> flush_dcache_page after clear_page, but on freshly allocated
> pages PG_dcache_clean is already clear (cleared by
> __free_pages_prepare), so flush_dcache_page is a no-op.
> Skipping this here thus has no effect.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6

This seems like an odd ordering of patches, can we group like changes
together?

> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 45e824b1ec75..edfc83571985 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1880,7 +1880,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
>  	 */
>  	if (!zeroed && !init && (gfp_flags & __GFP_ZERO) &&
>  	    user_addr != USER_ADDR_NONE &&
> -	    user_alloc_needs_zeroing())
> +	    (cpu_dcache_is_aliasing() || cpu_icache_is_aliasing()))

Let's try and simplify things rather than adding endlessly huge if conditionals?

It's now incredibly hard to track exactly what's going on here, and that is
bug-bait.

>  		init = true;
>  	/*
>  	 * If memory is still not initialized, initialize it now.
> --
> MST
>

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 16/37] mm: alloc_swap_folio: pass raw fault address to vma_alloc_folio
From: Lorenzo Stoakes @ 2026-06-08 11:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <d2be78654e9dcc7cf5b8a542c2c8f331187c80c9.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:37:41AM -0400, Michael S. Tsirkin wrote:
> Same change as the previous patch but for alloc_swap_folio:

Please don't say 'same change as the previous patch' :) explain what you're
doing here. It's a pain to have to go check otherwise.

> pass vmf->address directly instead of ALIGN_DOWN(vmf->address, ...).
>
> Note: NUMA interleave is not affected by the raw address;
> the ilx calculation shifts addr >> PAGE_SHIFT >> order,
> dropping sub-page bits regardless of alignment.

You're expressing the same thing as the last patch differently, but then
eliding other explanations from that?

All the same questions as I asked for the last apply to this also.

And also - if you've now made this a _requirement_ that is broken
otherwise, then aren't these bisection hazards and should be squashed into
the change?...

No patch should break anything at any point.

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh
> ---
>  mm/memory.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 21f640674c4f..6c14b90f558e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4750,8 +4750,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  	/* Try allocating the highest of the remaining orders. */
>  	gfp = vma_thp_gfp_mask(vma);
>  	while (orders) {
> -		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> -		folio = vma_alloc_folio(gfp, order, vma, addr);
> +		folio = vma_alloc_folio(gfp, order, vma, vmf->address);
>  		if (folio) {
>  			if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
>  							    gfp, entry))
> --
> MST
>

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 15/37] mm: alloc_anon_folio: pass raw fault address to vma_alloc_folio
From: Lorenzo Stoakes @ 2026-06-08 11:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <2e931dc3daa76b57eaedd5b2d3d49f1075797252.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:37:25AM -0400, Michael S. Tsirkin wrote:
> Pass vmf->address directly instead of ALIGN_DOWN(vmf->address, ...).
> NUMA interleave is not affected: the ilx calculation in
> get_vma_policy() shifts addr >> PAGE_SHIFT >> order, which
> drops sub-page bits regardless of alignment. post_alloc_hook
> will use the raw address for cache-friendly zeroing via
> folio_zero_user().

I'm confused as to the justification for this? You're saying 'make change X,
it's safe because Y'. So the justification is now this post_alloc_hook thing.

But are you now creating a new requirement of vma_alloc_folio() that you must
specify the actual address we are faulting on, not an address within the folio
or the folio's base address?

(If that's a requirement, why is it?)

If so you should update the vma_alloc_folio() description 'virtual address of
the allocation' is not at all clear.

And if that _is_ a requirement, then are you sure all allocation paths are
correct?

I already see addr & HPAGE_PMD_MASK in vma_alloc_anon_folio_pmd() for
instance?

If it's not a requirement, why are we doing this? it's surely useless in
that case?

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: cursor-agent:GPT-5.4-xhigh
> ---
>  mm/memory.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 86a973119bd4..21f640674c4f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5268,8 +5268,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  	/* Try allocating the highest of the remaining orders. */
>  	gfp = vma_thp_gfp_mask(vma);
>  	while (orders) {
> -		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);

If you're removing this usage, could you also remove the silly thing of us
declaring this at function scope then using it in branches when we should always
have declared these separately?

> -		folio = vma_alloc_folio(gfp, order, vma, addr);
> +		folio = vma_alloc_folio(gfp, order, vma, vmf->address);
>  		if (folio) {
>  			if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
>  				count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> --
> MST
>

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 14/37] mm: remove arch vma_alloc_zeroed_movable_folio overrides
From: Lorenzo Stoakes @ 2026-06-08 11:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Magnus Lindholm,
	Greg Ungerer, Geert Uytterhoeven
In-Reply-To: <c374d7a35a35eb4bd5ed85f87ff6885fae605fa9.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:37:08AM -0400, Michael S. Tsirkin wrote:
> Now that the generic vma_alloc_zeroed_movable_folio() uses
> __GFP_ZERO, the arch-specific macros on alpha, m68k, s390, and
> x86 that did the same thing are redundant.  Remove them.
>
> arm64 is not affected: it has a real function override that
> handles MTE tag zeroing, not just __GFP_ZERO.
>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Acked-by: Magnus Lindholm <linmag7@gmail.com>
> Acked-by: Greg Ungerer <gerg@linux-m68k.org>
> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Assisted-by: Claude:claude-opus-4-6
> Reviewed-by: Gregory Price <gourry@gourry.net>
> ---
>  arch/alpha/include/asm/page.h   | 3 ---
>  arch/m68k/include/asm/page_no.h | 3 ---
>  arch/s390/include/asm/page.h    | 3 ---
>  arch/x86/include/asm/page.h     | 3 ---
>  include/linux/highmem.h         | 8 +++++---
>  5 files changed, 5 insertions(+), 15 deletions(-)
>
> diff --git a/arch/alpha/include/asm/page.h b/arch/alpha/include/asm/page.h
> index 59d01f9b77f6..4327029cd660 100644
> --- a/arch/alpha/include/asm/page.h
> +++ b/arch/alpha/include/asm/page.h
> @@ -12,9 +12,6 @@
>
>  extern void clear_page(void *page);
>
> -#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> -	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
> -
>  extern void copy_page(void * _to, void * _from);
>  #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
>
> diff --git a/arch/m68k/include/asm/page_no.h b/arch/m68k/include/asm/page_no.h
> index d2532bc407ef..f511b763a235 100644
> --- a/arch/m68k/include/asm/page_no.h
> +++ b/arch/m68k/include/asm/page_no.h
> @@ -12,9 +12,6 @@ extern unsigned long memory_end;
>
>  #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
>
> -#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> -	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
> -
>  #define __pa(vaddr)		((unsigned long)(vaddr))
>  #define __va(paddr)		((void *)((unsigned long)(paddr)))
>
> diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
> index 56da819a79e6..e995d2a413f9 100644
> --- a/arch/s390/include/asm/page.h
> +++ b/arch/s390/include/asm/page.h
> @@ -67,9 +67,6 @@ static inline void copy_page(void *to, void *from)
>
>  #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
>
> -#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> -	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
> -
>  #ifdef CONFIG_STRICT_MM_TYPECHECKS
>  #define STRICT_MM_TYPECHECKS
>  #endif
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index 416dc88e35c1..92fa975b46f3 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -28,9 +28,6 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr,
>  	copy_page(to, from);
>  }
>
> -#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> -	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)
> -
>  #ifndef __pa
>  #define __pa(x)		__phys_addr((unsigned long)(x))
>  #endif
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index 8b0afaabbc6e..642718a50c27 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -303,7 +303,6 @@ static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
>  #endif
>  }
>
> -#ifndef vma_alloc_zeroed_movable_folio

We're specifying this function unconditionally even though arm64 overrides?

>  /**
>   * vma_alloc_zeroed_movable_folio - Allocate a zeroed page for a VMA.
>   * @vma: The VMA the page is to be allocated for.
> @@ -317,12 +316,15 @@ static inline void clear_user_highpages(struct page *page, unsigned long vaddr,
>   * we are out of memory.
>   */
>  static inline
> -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> +struct folio *vma_alloc_zeroed_movable_folio_noprof(struct vm_area_struct *vma,
>  				   unsigned long vaddr)
>  {
> -	return vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO,
> +	return vma_alloc_folio_noprof(GFP_HIGHUSER_MOVABLE | __GFP_ZERO,
>  			      0, vma, vaddr);

This whole change seems unnecessary?

>  }
> +#ifndef vma_alloc_zeroed_movable_folio
> +#define vma_alloc_zeroed_movable_folio(...) \
> +	alloc_hooks(vma_alloc_zeroed_movable_folio_noprof(__VA_ARGS__))
>  #endif

I don't know why we need to add more of this alloc_hooks() dance when we could
just do:

#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
	vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr)

Like the existing arch stuff?

>
>  static inline void clear_highpage(struct page *page)
> --
> MST
>

Thanks, Lorenzo

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox