Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Gregory Price @ 2026-06-08 21:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <20260608170646-mutt-send-email-mst@kernel.org>

On Mon, Jun 08, 2026 at 05:16:53PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 04:53:14PM -0400, Gregory Price wrote:
> > 
> > As a start:
> > 
> >   1) the user_addr and zeroing piece seems like a discrete
> >      improvement worthy of its own set - aside from end goal.
> > 
> >      This is needed by your patch set, but was requested to
> >      try to push us towards a more reasonable pattern for
> >      folio_zero_user().
> 
> What I worry about is people can't agree what api they want.
>

Oh that's just our base state of existence.  We mostly agree that
all APIs are bad in some way and we don't want any of them :P

What you're looking for is to get people to agree to the
least-offensive, least-worst option :]

I don't think we're far off from that.  I suggest doing as Zi said and
start a [DISCUSSION] thread on specifically this and lay out the needs
and wants and design issues that you've learned from the past set of
versions and continue the discussion there.

It helps to take some snippets from your set to lay out what you've
learned and explain why you need the folio_user_zero() stuff to get from
A->Z, and then let maintainers hash out whether that should live in
post_alloc_hook or new interfaces (or outside page_alloc.c altogether).

> I don't mind trying all kind of approaches, but it seems to
> be past the point where people feel it's costing too much of
> their time with all of these revisions.
> 

People are still commenting, so I don't think you've gotten there yet.
I think the rate of revision is what's costing too much attention.

You'd save yourself some revisions by taking the attention you have
right now and starting the discussion thread (and consider submitting
the topic to LPC if that's something interests you!).

All this is to say you're doing fine, just keep on keepin' on. Maybe
pivot your approach from iterations to discussion for a bit until the
opinions settle.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Michael S. Tsirkin @ 2026-06-08 21:16 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aicruns_4nOx0hD-@gourry-fedora-PF4VCD3F>

On Mon, Jun 08, 2026 at 04:53:14PM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 04:30:46PM -0400, Michael S. Tsirkin wrote:
> > > 
> > > Please consider that this is arguably the most fundamental interface in
> > > in all of mm/.  All we're doing is going through the process of figuring
> > > out what changes here are reasonable while trying to meet your goal.
> > > 
> > > ~Gregory
> > 
> > I don't mind discarding all of this and doing something else completely,
> > but I dislike it that multiple people are apparently now angry that I
> 
> I wouldn't say anyone is angry, I think most folks are tripping on the
> complexity of the set - which has increased (at the request of others).
> 
> > don't address all the contradictory comments at the same time.
> 
> Such is life in mm/ :] - it's hard to known the entire state machine,
> and sometimes the contradictions aren't even wrong.
> 
> > I thought just sending a patchset to show how the result looks like
> > is easier than arguing about architecture, and would be helpful.
> > 
> 
> Notice: When folks argue implementation, they largely agree the
> end goal is useful.  I haven't seen anyone say your problem isn't
> real or that it shouldn't be addressed - just opinions on a particular
> path forward (which is utterly normal here).
> 
> Getting the right incantation of an API is really hard when the
> API being changes is something that underpins the entire kernel.
> 
> > I'm not pushing any of the mm rework, I was asked to do it,
> > myself I just want the ridiculously effective optimization in there.
> > 
> 
> As Lorenzo, David, and Matthew have said, the focus of the patch set
> does seem to have become unweildy (in part at the request of folks
> asking something be done differently).
> 
> What needs to be done now is to break it up into some pull-ahead 
> sets that are easier to review.  Having a brief RFC doc that lays out
> the set of patches might help clarify the confusion going on here,
> especially as new folks come in to ask "What's all this about?".
> 
> As a start:
> 
>   1) the user_addr and zeroing piece seems like a discrete
>      improvement worthy of its own set - aside from end goal.
> 
>      This is needed by your patch set, but was requested to
>      try to push us towards a more reasonable pattern for
>      folio_zero_user().

What I worry about is people can't agree what api they want.

Simply not being an mm maintainer, I don't really have the
perspective of what changes are envisioned down the road
and so what api makes sense for you guys.

I don't mind trying all kind of approaches, but it seems to
be past the point where people feel it's costing too much of
their time with all of these revisions.


>   2) There are a handful of patches that seem able to pull-ahead
>      (some of the mempolicy stuff), either as prep work for #1 or
>      just on their own.
> 
>      Some of these patches seem like latent bugs that aren't hit by
>      current users, but do seem to be doing something subtly wrong?

Right.

>   3) the final virtio piece seems like it should be entirely separate
>      once the core pieces are done.
> 
> It's not uncommon for core changes like this to take multiple prepatory
> sets over many major versions before the final feature lands.
> 
> ~Gregory


^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Zi Yan @ 2026-06-08 21:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli
In-Reply-To: <20260608170348-mutt-send-email-mst@kernel.org>

On 8 Jun 2026, at 17:04, Michael S. Tsirkin wrote:

> On Mon, Jun 08, 2026 at 04:40:15PM -0400, Zi Yan wrote:
>> On 8 Jun 2026, at 16:33, Michael S. Tsirkin wrote:
>>
>>> On Mon, Jun 08, 2026 at 04:21:08PM -0400, Zi Yan wrote:
>>>> On 8 Jun 2026, at 15:59, Gregory Price wrote:
>>>>
>>>>> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
>>>>>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
>>>>>>> But instead of overloading user_addr to indicate all kinds of things, instead
>>>>>>> make life easier by actually breaking things out.
>>>>>>>
>>>>>>> Like:
>>>>>>>
>>>>>>> enum alloc_context_type {
>>>>>>> 	KERNEL_ALLOCATION,
>>>>>>> 	USER_MAPPED_ALLOCATION,
>>>>>>> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
>>>>>>> 	/* Perhaps some other states we want to encode? */
>>>>>>> };
>>>>>>>
>>>>>>> struct alloc_context {
>>>>>>> 	...
>>>>>>>
>>>>>>> 	enum alloc_context_type type;
>>>>>>> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
>>>>>>>
>>>>>>> 	// Maybe something suggesting context or whether we init before in some
>>>>>>> 	// cases?
>>>>>>> };
>>>>>>
>>>>>> Ugh, please, no.  As I suggested last time I commented on this
>>>>>> trainwreck of a series, lift the zeroing functionality from
>>>>>> alloc_frozen_pages() into its callers.
>>>>>
>>>>> This sort of just implies writing the "alloc_frozen_zeroed_pages()"
>>>>> wrapper that does the zeroing at the end before return, and then killing
>>>>> the post hook nonsense associated with it in the first place.
>>>>
>>>> This means it is going to be a multi-step optimization. This is probably
>>>> step 1.
>>>>
>>>>>
>>>>> None of this resolves the user address annoyance which is needed on some
>>>>> archs for cache flushing.  Whether anyone agrees that the page allocator
>>>>> should be responsible for this particular operation - open debate.
>>>>
>>>> This is probably step 2. But does the virtio use case apply to these
>>>> archs? Does the performance matter for them? If not, maybe this part can
>>>> be left as a TODO.
>>>>
>>>>
>>>> Best Regards,
>>>> Yan, Zi
>>>
>>> I doubt it. But I don't get what's proposed, the code that we
>>> have to modify is arch independent?
>>
>> Change user_alloc_needs_zeroing() to only check address aliasing even
>> if that can cause double zeroing for virtio.
>>
>> Best Regards,
>> Yan, Zi
>
> Ah. I started with exactly that in v1/v2. It's a simple approach.
>
> But mm maintainers said no, user_alloc_needs_zeroing is a hack and
> I must not add to it.

Got it. It sounds that you now get conflicting ideas. Maybe you should
start a [DISCUSSION] thread that presents the high level idea of what
you want to achieve and all the ideas you got from the reviews, so that
people in this thread can have the big picture and come up a consensus
before you send another version.

Thank you for patiently replying my comments, since those points
apparently have been discussed in prior submissions.

Best Regards,
Yan, Zi

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Michael S. Tsirkin @ 2026-06-08 21:04 UTC (permalink / raw)
  To: Zi Yan
  Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli
In-Reply-To: <BD0C8760-BC10-4BF0-9EE3-9B0DAF0D977A@nvidia.com>

On Mon, Jun 08, 2026 at 04:40:15PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 16:33, Michael S. Tsirkin wrote:
> 
> > On Mon, Jun 08, 2026 at 04:21:08PM -0400, Zi Yan wrote:
> >> On 8 Jun 2026, at 15:59, Gregory Price wrote:
> >>
> >>> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
> >>>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
> >>>>> But instead of overloading user_addr to indicate all kinds of things, instead
> >>>>> make life easier by actually breaking things out.
> >>>>>
> >>>>> Like:
> >>>>>
> >>>>> enum alloc_context_type {
> >>>>> 	KERNEL_ALLOCATION,
> >>>>> 	USER_MAPPED_ALLOCATION,
> >>>>> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
> >>>>> 	/* Perhaps some other states we want to encode? */
> >>>>> };
> >>>>>
> >>>>> struct alloc_context {
> >>>>> 	...
> >>>>>
> >>>>> 	enum alloc_context_type type;
> >>>>> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
> >>>>>
> >>>>> 	// Maybe something suggesting context or whether we init before in some
> >>>>> 	// cases?
> >>>>> };
> >>>>
> >>>> Ugh, please, no.  As I suggested last time I commented on this
> >>>> trainwreck of a series, lift the zeroing functionality from
> >>>> alloc_frozen_pages() into its callers.
> >>>
> >>> This sort of just implies writing the "alloc_frozen_zeroed_pages()"
> >>> wrapper that does the zeroing at the end before return, and then killing
> >>> the post hook nonsense associated with it in the first place.
> >>
> >> This means it is going to be a multi-step optimization. This is probably
> >> step 1.
> >>
> >>>
> >>> None of this resolves the user address annoyance which is needed on some
> >>> archs for cache flushing.  Whether anyone agrees that the page allocator
> >>> should be responsible for this particular operation - open debate.
> >>
> >> This is probably step 2. But does the virtio use case apply to these
> >> archs? Does the performance matter for them? If not, maybe this part can
> >> be left as a TODO.
> >>
> >>
> >> Best Regards,
> >> Yan, Zi
> >
> > I doubt it. But I don't get what's proposed, the code that we
> > have to modify is arch independent?
> 
> Change user_alloc_needs_zeroing() to only check address aliasing even
> if that can cause double zeroing for virtio.
> 
> Best Regards,
> Yan, Zi

Ah. I started with exactly that in v1/v2. It's a simple approach.

But mm maintainers said no, user_alloc_needs_zeroing is a hack and
I must not add to it.

-- 
MST


^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Michael S. Tsirkin @ 2026-06-08 21:03 UTC (permalink / raw)
  To: Zi Yan
  Cc: Gregory Price, David Hildenbrand (Arm), Lorenzo Stoakes,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <D754DEDB-3A3A-4F6E-BF64-79086F0A6467@nvidia.com>

On Mon, Jun 08, 2026 at 04:37:59PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 16:25, Gregory Price wrote:
> 
> > On Mon, Jun 08, 2026 at 03:52:20PM -0400, Zi Yan wrote:
> >> On 8 Jun 2026, at 15:43, Gregory Price wrote:
> >>
> >>> On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
> >>>> On 6/8/26 17:27, Zi Yan wrote:
> >>>>> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
> >>>>>
> >>>>> Or should we defer zeroing after a page is returned from allocator? So that
> >>>>> user_addr does not need to be passed through irrelevant allocation APIs.
> >>>>> Something like:
> >>>>>
> >>>>> alloc_page_wrapper(gfp, order, user_addr)
> >>>>> {
> >>>>> 	page = alloc_pages();
> >>>>> 	if (gfp & __GFP_ZERO)
> >>>>> 		clear_page(page);
> >>>>> }
> >>>>>
> >>>>
> >>>> Not really sure what's best here. I think we'd want to limit the lifting to some
> >>>> internal API, so it cannot easily be messed up by random kernel code calling
> >>>> into the wrong API and not getting pages cleared.
> >>>>
> >>>
> >>> We're a bit in circles on this.  We discussed explicit interfaces a few
> >>> months back and the trade off was:
> >>>
> >>> a) add user_addr to the existing API and cause churn
> >>>
> >>> or
> >>>
> >>> b) add special interface like above
> >>>    increase the buddy surface
> >>>    leaves open the ability for users to get it wrong easily
> >>>
> >>> If we forget VMs for a moment and break this step out separately, the
> >>> core question is whether page_alloc.c is the right place to be calling
> >>> the folio_user_zero() or whatever it is.
> >>
> >> page_alloc.c calling folio_user_zero() is fine, but my question is
> >> whether we should do the zeroing inside post_alloc_hook(), part of
> >> allocation.
> >>
> >> What I propose is to lift __GFP_ZERO up as much as possible,
> >> so that most of allocation code does not need to care about it.
> >> We do the zeroing right before the page is returned to callers.
> >>
> >
> > essentially we end up with something like
> >
> > alloc_frozen_...(..., gfp)
> > {
> >   folio = whatever(..., gfp);
> >   if (gfp & __GFP_ZERO)
> >     folio_zero(folio, -1); /* don't do cache flush part */
> > }
> >
> > alloc_frozen_user_...(..., gfp, user_addr)
> > {
> >   folio = whatever(..., gfp);
> >   if (gfp & __GFP_ZERO)
> >     folio_zero(folio, user_addr); /* do cache flush part */
> > }
> >
> > The downside of this is obvious: it's easy for developers to get this
> > wrong and call the non-user interface for user-bound allocations and
> > miss the cache flush (that is only needed on some archs).
> >
> > Not saying that's a deal breaker, but it's something to chew on.
> 
> I agree that misuse can cause trouble. But if we do the churn approach,
> what prevents developer from doing alloc_frozen(..., user_addr = -1)
> and using the returned page for userspace? It is possible the allocated
> page can be exported to userspace later.
> 
> BTW, that cache flush thing is fragile even today,

Probably arch dependent. On arm32, I think if you miss the flush, then
PG_dcache_clean will be clear and then you get a perf hit but
it's still correct. Didn't check others.

> you probably can
> do alloc_page() + vm_insert() to get a page without doing proper flush
> and export it to userspace. There seems to be no mechanism to
> prevent that.
> 
> Best Regards,
> Yan, Zi

Because maybe you want to expose data to userspace?


-- 
MST


^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Gregory Price @ 2026-06-08 20:56 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand (Arm), Lorenzo Stoakes, Michael S. Tsirkin,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <D754DEDB-3A3A-4F6E-BF64-79086F0A6467@nvidia.com>

On Mon, Jun 08, 2026 at 04:37:59PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 16:25, Gregory Price wrote:
> >
> > essentially we end up with something like
> >
> > alloc_frozen_...(..., gfp)
> > {
> >   folio = whatever(..., gfp);
> >   if (gfp & __GFP_ZERO)
> >     folio_zero(folio, -1); /* don't do cache flush part */
> > }
> >
> > alloc_frozen_user_...(..., gfp, user_addr)
> > {
> >   folio = whatever(..., gfp);
> >   if (gfp & __GFP_ZERO)
> >     folio_zero(folio, user_addr); /* do cache flush part */
> > }
> >
> > The downside of this is obvious: it's easy for developers to get this
> > wrong and call the non-user interface for user-bound allocations and
> > miss the cache flush (that is only needed on some archs).
> >
> > Not saying that's a deal breaker, but it's something to chew on.
> 
> I agree that misuse can cause trouble. But if we do the churn approach,
> what prevents developer from doing alloc_frozen(..., user_addr = -1)
> and using the returned page for userspace? It is possible the allocated
> page can be exported to userspace later.
> 
> BTW, that cache flush thing is fragile even today, you probably can
> do alloc_page() + vm_insert() to get a page without doing proper flush
> and export it to userspace. There seems to be no mechanism to
> prevent that.
>

Oh of course, I said that elsewhere.  It leaves us in a spot where we're
not technically worse than we were yesterday - except that the surface
of the buddy has increased (developers need to know about 2 APIs instead
of 1).  That carries maintenance burden (if something in alloc_frozen()
changes, something in alloc_frozen_user() may need to change).

There's a careful dance here.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Gregory Price @ 2026-06-08 20:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <20260608161810-mutt-send-email-mst@kernel.org>

On Mon, Jun 08, 2026 at 04:30:46PM -0400, Michael S. Tsirkin wrote:
> > 
> > Please consider that this is arguably the most fundamental interface in
> > in all of mm/.  All we're doing is going through the process of figuring
> > out what changes here are reasonable while trying to meet your goal.
> > 
> > ~Gregory
> 
> I don't mind discarding all of this and doing something else completely,
> but I dislike it that multiple people are apparently now angry that I

I wouldn't say anyone is angry, I think most folks are tripping on the
complexity of the set - which has increased (at the request of others).

> don't address all the contradictory comments at the same time.

Such is life in mm/ :] - it's hard to known the entire state machine,
and sometimes the contradictions aren't even wrong.

> I thought just sending a patchset to show how the result looks like
> is easier than arguing about architecture, and would be helpful.
> 

Notice: When folks argue implementation, they largely agree the
end goal is useful.  I haven't seen anyone say your problem isn't
real or that it shouldn't be addressed - just opinions on a particular
path forward (which is utterly normal here).

Getting the right incantation of an API is really hard when the
API being changes is something that underpins the entire kernel.

> I'm not pushing any of the mm rework, I was asked to do it,
> myself I just want the ridiculously effective optimization in there.
> 

As Lorenzo, David, and Matthew have said, the focus of the patch set
does seem to have become unweildy (in part at the request of folks
asking something be done differently).

What needs to be done now is to break it up into some pull-ahead 
sets that are easier to review.  Having a brief RFC doc that lays out
the set of patches might help clarify the confusion going on here,
especially as new folks come in to ask "What's all this about?".

As a start:

  1) the user_addr and zeroing piece seems like a discrete
     improvement worthy of its own set - aside from end goal.

     This is needed by your patch set, but was requested to
     try to push us towards a more reasonable pattern for
     folio_zero_user().

  2) There are a handful of patches that seem able to pull-ahead
     (some of the mempolicy stuff), either as prep work for #1 or
     just on their own.

     Some of these patches seem like latent bugs that aren't hit by
     current users, but do seem to be doing something subtly wrong?

  3) the final virtio piece seems like it should be entirely separate
     once the core pieces are done.

It's not uncommon for core changes like this to take multiple prepatory
sets over many major versions before the final feature lands.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Zi Yan @ 2026-06-08 20:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli
In-Reply-To: <20260608163257-mutt-send-email-mst@kernel.org>

On 8 Jun 2026, at 16:33, Michael S. Tsirkin wrote:

> On Mon, Jun 08, 2026 at 04:21:08PM -0400, Zi Yan wrote:
>> On 8 Jun 2026, at 15:59, Gregory Price wrote:
>>
>>> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
>>>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
>>>>> But instead of overloading user_addr to indicate all kinds of things, instead
>>>>> make life easier by actually breaking things out.
>>>>>
>>>>> Like:
>>>>>
>>>>> enum alloc_context_type {
>>>>> 	KERNEL_ALLOCATION,
>>>>> 	USER_MAPPED_ALLOCATION,
>>>>> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
>>>>> 	/* Perhaps some other states we want to encode? */
>>>>> };
>>>>>
>>>>> struct alloc_context {
>>>>> 	...
>>>>>
>>>>> 	enum alloc_context_type type;
>>>>> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
>>>>>
>>>>> 	// Maybe something suggesting context or whether we init before in some
>>>>> 	// cases?
>>>>> };
>>>>
>>>> Ugh, please, no.  As I suggested last time I commented on this
>>>> trainwreck of a series, lift the zeroing functionality from
>>>> alloc_frozen_pages() into its callers.
>>>
>>> This sort of just implies writing the "alloc_frozen_zeroed_pages()"
>>> wrapper that does the zeroing at the end before return, and then killing
>>> the post hook nonsense associated with it in the first place.
>>
>> This means it is going to be a multi-step optimization. This is probably
>> step 1.
>>
>>>
>>> None of this resolves the user address annoyance which is needed on some
>>> archs for cache flushing.  Whether anyone agrees that the page allocator
>>> should be responsible for this particular operation - open debate.
>>
>> This is probably step 2. But does the virtio use case apply to these
>> archs? Does the performance matter for them? If not, maybe this part can
>> be left as a TODO.
>>
>>
>> Best Regards,
>> Yan, Zi
>
> I doubt it. But I don't get what's proposed, the code that we
> have to modify is arch independent?

Change user_alloc_needs_zeroing() to only check address aliasing even
if that can cause double zeroing for virtio.

Best Regards,
Yan, Zi

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Zi Yan @ 2026-06-08 20:37 UTC (permalink / raw)
  To: Gregory Price
  Cc: David Hildenbrand (Arm), Lorenzo Stoakes, Michael S. Tsirkin,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiclJGgB07L3rV-P@gourry-fedora-PF4VCD3F>

On 8 Jun 2026, at 16:25, Gregory Price wrote:

> On Mon, Jun 08, 2026 at 03:52:20PM -0400, Zi Yan wrote:
>> On 8 Jun 2026, at 15:43, Gregory Price wrote:
>>
>>> On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
>>>> On 6/8/26 17:27, Zi Yan wrote:
>>>>> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
>>>>>
>>>>> Or should we defer zeroing after a page is returned from allocator? So that
>>>>> user_addr does not need to be passed through irrelevant allocation APIs.
>>>>> Something like:
>>>>>
>>>>> alloc_page_wrapper(gfp, order, user_addr)
>>>>> {
>>>>> 	page = alloc_pages();
>>>>> 	if (gfp & __GFP_ZERO)
>>>>> 		clear_page(page);
>>>>> }
>>>>>
>>>>
>>>> Not really sure what's best here. I think we'd want to limit the lifting to some
>>>> internal API, so it cannot easily be messed up by random kernel code calling
>>>> into the wrong API and not getting pages cleared.
>>>>
>>>
>>> We're a bit in circles on this.  We discussed explicit interfaces a few
>>> months back and the trade off was:
>>>
>>> a) add user_addr to the existing API and cause churn
>>>
>>> or
>>>
>>> b) add special interface like above
>>>    increase the buddy surface
>>>    leaves open the ability for users to get it wrong easily
>>>
>>> If we forget VMs for a moment and break this step out separately, the
>>> core question is whether page_alloc.c is the right place to be calling
>>> the folio_user_zero() or whatever it is.
>>
>> page_alloc.c calling folio_user_zero() is fine, but my question is
>> whether we should do the zeroing inside post_alloc_hook(), part of
>> allocation.
>>
>> What I propose is to lift __GFP_ZERO up as much as possible,
>> so that most of allocation code does not need to care about it.
>> We do the zeroing right before the page is returned to callers.
>>
>
> essentially we end up with something like
>
> alloc_frozen_...(..., gfp)
> {
>   folio = whatever(..., gfp);
>   if (gfp & __GFP_ZERO)
>     folio_zero(folio, -1); /* don't do cache flush part */
> }
>
> alloc_frozen_user_...(..., gfp, user_addr)
> {
>   folio = whatever(..., gfp);
>   if (gfp & __GFP_ZERO)
>     folio_zero(folio, user_addr); /* do cache flush part */
> }
>
> The downside of this is obvious: it's easy for developers to get this
> wrong and call the non-user interface for user-bound allocations and
> miss the cache flush (that is only needed on some archs).
>
> Not saying that's a deal breaker, but it's something to chew on.

I agree that misuse can cause trouble. But if we do the churn approach,
what prevents developer from doing alloc_frozen(..., user_addr = -1)
and using the returned page for userspace? It is possible the allocated
page can be exported to userspace later.

BTW, that cache flush thing is fragile even today, you probably can
do alloc_page() + vm_insert() to get a page without doing proper flush
and export it to userspace. There seems to be no mechanism to
prevent that.

Best Regards,
Yan, Zi

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Michael S. Tsirkin @ 2026-06-08 20:33 UTC (permalink / raw)
  To: Zi Yan
  Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli
In-Reply-To: <D1CF67EF-2CF1-45F8-9265-70A528C1CB01@nvidia.com>

On Mon, Jun 08, 2026 at 04:21:08PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 15:59, Gregory Price wrote:
> 
> > On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
> >> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
> >>> But instead of overloading user_addr to indicate all kinds of things, instead
> >>> make life easier by actually breaking things out.
> >>>
> >>> Like:
> >>>
> >>> enum alloc_context_type {
> >>> 	KERNEL_ALLOCATION,
> >>> 	USER_MAPPED_ALLOCATION,
> >>> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
> >>> 	/* Perhaps some other states we want to encode? */
> >>> };
> >>>
> >>> struct alloc_context {
> >>> 	...
> >>>
> >>> 	enum alloc_context_type type;
> >>> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
> >>>
> >>> 	// Maybe something suggesting context or whether we init before in some
> >>> 	// cases?
> >>> };
> >>
> >> Ugh, please, no.  As I suggested last time I commented on this
> >> trainwreck of a series, lift the zeroing functionality from
> >> alloc_frozen_pages() into its callers.
> >
> > This sort of just implies writing the "alloc_frozen_zeroed_pages()"
> > wrapper that does the zeroing at the end before return, and then killing
> > the post hook nonsense associated with it in the first place.
> 
> This means it is going to be a multi-step optimization. This is probably
> step 1.
> 
> >
> > None of this resolves the user address annoyance which is needed on some
> > archs for cache flushing.  Whether anyone agrees that the page allocator
> > should be responsible for this particular operation - open debate.
> 
> This is probably step 2. But does the virtio use case apply to these
> archs? Does the performance matter for them? If not, maybe this part can
> be left as a TODO.
> 
> 
> Best Regards,
> Yan, Zi

I doubt it. But I don't get what's proposed, the code that we
have to modify is arch independent?


^ permalink raw reply

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Michael S. Tsirkin @ 2026-06-08 20:30 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aicjNRrHYOOBt0Hx@gourry-fedora-PF4VCD3F>

On Mon, Jun 08, 2026 at 04:16:53PM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 03:45:58PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jun 08, 2026 at 11:53:40AM -0400, Gregory Price wrote:
> > > 
> > > If `user_addr` is now implying anything other than exactly: "This needs
> > > to be zeroed / caches flushed", then this is bad.
> > > 
> > > ~Gregory
> > 
> > Well if you do folio_zero_user in a non sleepable context then things
> > are not going to work. So combining e.g. GFP_ATOMIC and GFP_ZERO and
> > user_addr all together is not a good idea.
> >
> 
> Can you say whether (GFP_ATOMIC | GFP_ZERO) w/o user_addr has the same
> issue?

I don't think it is because it does not call folio_zero_user, right?

>  If not, then this subtle complexity is now a tripping hazard.

Yes.

> Is there some combination of arguments here that should just outright
> fail if a user attempts it?

__GFP_DIRECT_RECLAIM at least.

> >
> > You are saying it's bad? It's pretty fundamental to the idea of moving
> > zeroing into the allocator, I feel.
> > 
> 
> I'm saying having to infer that safety state from the cobbling of those
> things together is not a good pattern (at least as-is).

Understood. Don't have a better idea, yet.

> If the introduction of user_addr into the mix is the thing that causes
> us to have to infer safety, then there's an argument the page allocator
> shouldn't handle that operation (in this case: user_addr cache flush).

It's not just the flush, we are also trying to use that to optimize
zeroing.

> 
> Please consider that this is arguably the most fundamental interface in
> in all of mm/.  All we're doing is going through the process of figuring
> out what changes here are reasonable while trying to meet your goal.
> 
> ~Gregory

I don't mind discarding all of this and doing something else completely,
but I dislike it that multiple people are apparently now angry that I
don't address all the contradictory comments at the same time.
I thought just sending a patchset to show how the result looks like
is easier than arguing about architecture, and would be helpful.

I'm not pushing any of the mm rework, I was asked to do it,
myself I just want the ridiculously effective optimization in there.


-- 
MST


^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Gregory Price @ 2026-06-08 20:25 UTC (permalink / raw)
  To: Zi Yan
  Cc: David Hildenbrand (Arm), Lorenzo Stoakes, Michael S. Tsirkin,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <8BB8D6E5-8644-4CBC-BAF1-0AA19702E042@nvidia.com>

On Mon, Jun 08, 2026 at 03:52:20PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 15:43, Gregory Price wrote:
> 
> > On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
> >> On 6/8/26 17:27, Zi Yan wrote:
> >>> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
> >>>
> >>> Or should we defer zeroing after a page is returned from allocator? So that
> >>> user_addr does not need to be passed through irrelevant allocation APIs.
> >>> Something like:
> >>>
> >>> alloc_page_wrapper(gfp, order, user_addr)
> >>> {
> >>> 	page = alloc_pages();
> >>> 	if (gfp & __GFP_ZERO)
> >>> 		clear_page(page);
> >>> }
> >>>
> >>
> >> Not really sure what's best here. I think we'd want to limit the lifting to some
> >> internal API, so it cannot easily be messed up by random kernel code calling
> >> into the wrong API and not getting pages cleared.
> >>
> >
> > We're a bit in circles on this.  We discussed explicit interfaces a few
> > months back and the trade off was:
> >
> > a) add user_addr to the existing API and cause churn
> >
> > or
> >
> > b) add special interface like above
> >    increase the buddy surface
> >    leaves open the ability for users to get it wrong easily
> >
> > If we forget VMs for a moment and break this step out separately, the
> > core question is whether page_alloc.c is the right place to be calling
> > the folio_user_zero() or whatever it is.
> 
> page_alloc.c calling folio_user_zero() is fine, but my question is
> whether we should do the zeroing inside post_alloc_hook(), part of
> allocation.
> 
> What I propose is to lift __GFP_ZERO up as much as possible,
> so that most of allocation code does not need to care about it.
> We do the zeroing right before the page is returned to callers.
>

essentially we end up with something like

alloc_frozen_...(..., gfp)
{
  folio = whatever(..., gfp);
  if (gfp & __GFP_ZERO)
    folio_zero(folio, -1); /* don't do cache flush part */
}

alloc_frozen_user_...(..., gfp, user_addr)
{
  folio = whatever(..., gfp);
  if (gfp & __GFP_ZERO)
    folio_zero(folio, user_addr); /* do cache flush part */
}

The downside of this is obvious: it's easy for developers to get this
wrong and call the non-user interface for user-bound allocations and
miss the cache flush (that is only needed on some archs).

Not saying that's a deal breaker, but it's something to chew on.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Zi Yan @ 2026-06-08 20:21 UTC (permalink / raw)
  To: Gregory Price
  Cc: Matthew Wilcox, Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli
In-Reply-To: <aicfERTP3H6AndOs@gourry-fedora-PF4VCD3F>

On 8 Jun 2026, at 15:59, Gregory Price wrote:

> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
>>> But instead of overloading user_addr to indicate all kinds of things, instead
>>> make life easier by actually breaking things out.
>>>
>>> Like:
>>>
>>> enum alloc_context_type {
>>> 	KERNEL_ALLOCATION,
>>> 	USER_MAPPED_ALLOCATION,
>>> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
>>> 	/* Perhaps some other states we want to encode? */
>>> };
>>>
>>> struct alloc_context {
>>> 	...
>>>
>>> 	enum alloc_context_type type;
>>> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
>>>
>>> 	// Maybe something suggesting context or whether we init before in some
>>> 	// cases?
>>> };
>>
>> Ugh, please, no.  As I suggested last time I commented on this
>> trainwreck of a series, lift the zeroing functionality from
>> alloc_frozen_pages() into its callers.
>
> This sort of just implies writing the "alloc_frozen_zeroed_pages()"
> wrapper that does the zeroing at the end before return, and then killing
> the post hook nonsense associated with it in the first place.

This means it is going to be a multi-step optimization. This is probably
step 1.

>
> None of this resolves the user address annoyance which is needed on some
> archs for cache flushing.  Whether anyone agrees that the page allocator
> should be responsible for this particular operation - open debate.

This is probably step 2. But does the virtio use case apply to these
archs? Does the performance matter for them? If not, maybe this part can
be left as a TODO.


Best Regards,
Yan, Zi

^ permalink raw reply

* Re: [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Michael S. Tsirkin @ 2026-06-08 20:17 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Miaohe Lin
In-Reply-To: <aibMs9DXuhH_5F2Z@lucifer>

On Mon, Jun 08, 2026 at 03:14:51PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 09:48:34AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jun 08, 2026 at 10:43:21AM +0100, Lorenzo Stoakes wrote:
> > > On Mon, Jun 08, 2026 at 04:34:23AM -0400, Michael S. Tsirkin wrote:
> > > > TestSetPageHWPoison() is called without zone->lock, so its atomic
> > > > update to page->flags can race with non-atomic flag operations
> > > > that run under zone->lock in the buddy allocator.
> > > >
> > > > In particular, __free_pages_prepare() does:
> > > >
> > > >     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > > >
> > > > This non-atomic read-modify-write, while correctly excluding
> > > > __PG_HWPOISON from the mask, can still lose a concurrent
> > > > TestSetPageHWPoison if the read happens before the poison bit
> > > > is set and the write happens after.  Follow-up patches in this
> > > > series add similar non-atomic flag operations as well.
> > > >
> > > > Fix by acquiring zone->lock around TestSetPageHWPoison and
> > > > around ClearPageHWPoison in the retry path.  This
> > > > serializes with all buddy flag manipulation.  The cost is
> > > > negligible: one lock/unlock in an extremely rare path
> > > > (hardware memory errors).
> > > >
> > > > Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> > > > in this file operate on pages already removed from the buddy
> > > > allocator or on non-buddy pages (DAX, hugetlb), so they do not
> > > > need zone->lock protection.
> > > >
> > > > Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > >
> > > Can we have Fixes: and Cc: stable and also send this separately please?
> > >
> > > These patches seem like unrelated fixups that you've discovered along the way,
> > > and don't belong as part of the already rather large series, unless I'm missing
> > > something here.
> > >
> > > Thanks, Lorenzo
> >
> > I think you are mising that they are a dependency, not unrelated.
> 
> Then say so.
> 
> > For example, this issue gets worse with the patchset as there are more
> > places that manipulate flags without atomics. No?
> 
> It's your job to make that case, not mine.
> 
> >
> >
> > You are welcome to send this to stable, but I think stable rules
> > preclude theoretical bugfixes.
> 
> It's a dependency but also theoretical?

As in, the race is exteremely hard to trigger and I have no idea if it
triggers for anyone, but it's obvious from reading the code that
theoretically it exists? Yes.

> >
> > As for Fixes: the issue has been there for decades. I wouldn't know
> > what to attribute it for.
> 
> Again, your job.

Alright, if you insist:

Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")

now everyone running 2.6 kernels will backport this fix, I presume.


> >
> >
> > I guess I could send these separately, too, why not. Not sure
> > what this accomplishes, but hey.  But is that an ack? You want
> > this fix merged even before the feature?
> 
> I already made the case as to why, as have other maintainers.
> 
> If you need to review what an ack looks like please consult
> https://docs.kernel.org/process/5.Posting.html
> 
> Thanks, Lorenzo

I am merely asking if you want this patch in the set including
all these nits I had to fix.

-- 
MST


^ permalink raw reply

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Gregory Price @ 2026-06-08 20:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <20260608154354-mutt-send-email-mst@kernel.org>

On Mon, Jun 08, 2026 at 03:45:58PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 11:53:40AM -0400, Gregory Price wrote:
> > 
> > If `user_addr` is now implying anything other than exactly: "This needs
> > to be zeroed / caches flushed", then this is bad.
> > 
> > ~Gregory
> 
> Well if you do folio_zero_user in a non sleepable context then things
> are not going to work. So combining e.g. GFP_ATOMIC and GFP_ZERO and
> user_addr all together is not a good idea.
>

Can you say whether (GFP_ATOMIC | GFP_ZERO) w/o user_addr has the same
issue?  If not, then this subtle complexity is now a tripping hazard.

Is there some combination of arguments here that should just outright
fail if a user attempts it?

>
> You are saying it's bad? It's pretty fundamental to the idea of moving
> zeroing into the allocator, I feel.
> 

I'm saying having to infer that safety state from the cobbling of those
things together is not a good pattern (at least as-is).

If the introduction of user_addr into the mix is the thing that causes
us to have to infer safety, then there's an argument the page allocator
shouldn't handle that operation (in this case: user_addr cache flush).

Please consider that this is arguably the most fundamental interface in
in all of mm/.  All we're doing is going through the process of figuring
out what changes here are reasonable while trying to meet your goal.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 16/37] mm: alloc_swap_folio: pass raw fault address to vma_alloc_folio
From: Michael S. Tsirkin @ 2026-06-08 20:09 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aibm5OFmiFK6lCqH@gourry-fedora-PF4VCD3F>

On Mon, Jun 08, 2026 at 11:59:32AM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 12:37:20PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:37:41AM -0400, Michael S. Tsirkin wrote:
> > > Same change as the previous patch but for alloc_swap_folio:
> > 
> > Please don't say 'same change as the previous patch' :) explain what you're
> > doing here. It's a pain to have to go check otherwise.
> > 
> 
> MST you need to slow down a bit.
> 
> I gave you this same feedback 3 versions ago:
> 
> https://lore.kernel.org/linux-mm/agXUHItfxSwtriRF@gourry-fedora-PF4VCD3F/
> 
> ~Gregory

Ooof I do try but the patchset is just too big. Sorry. I need to find a
way to split it. Or maybe Matthew will tell me how to make it much
smaller, he says he sees a way that will make everyone happy. Let's
wait.

-- 
MST


^ permalink raw reply

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Michael S. Tsirkin @ 2026-06-08 20:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aibP5W-fGcyPp9e4@casper.infradead.org>

On Mon, Jun 08, 2026 at 03:21:25PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 04:33:46AM -0400, Michael S. Tsirkin wrote:
> > Further, on architectures with aliasing caches, upstream with init_on_alloc
> 
> Further to what?  Did you leave out some paragraphs here?
> 
> As far as I can tell, this patch series decides to trust that the
> hypervisor has zeroed pages that it allocates to the guest.  But
> as far as I can tell, the trend is towards less trust in the hypervisor
> from the guest, not more.

AKA confidential computing. I'm not a visionary, no idea about trends, but
yes these are used more than in the past (not hard given it used to be
0% of the market in the past).

Page reporting already leaks some info like free page addresses, so it's
for trusted hypervisors.

Anyway:
Subject: [PATCH v10 35/37] virtio_balloon: disable reporting zeroed optimization for confidential guests

makes sure guests that do not trust hypervisors are not affected.

-- 
MST


^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Gregory Price @ 2026-06-08 19:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aia93IzhTgi1vGi6@casper.infradead.org>

On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
> > But instead of overloading user_addr to indicate all kinds of things, instead
> > make life easier by actually breaking things out.
> > 
> > Like:
> > 
> > enum alloc_context_type {
> > 	KERNEL_ALLOCATION,
> > 	USER_MAPPED_ALLOCATION,
> > 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
> > 	/* Perhaps some other states we want to encode? */
> > };
> > 
> > struct alloc_context {
> > 	...
> > 
> > 	enum alloc_context_type type;
> > 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
> > 
> > 	// Maybe something suggesting context or whether we init before in some
> > 	// cases?
> > };
> 
> Ugh, please, no.  As I suggested last time I commented on this
> trainwreck of a series, lift the zeroing functionality from
> alloc_frozen_pages() into its callers.

This sort of just implies writing the "alloc_frozen_zeroed_pages()"
wrapper that does the zeroing at the end before return, and then killing
the post hook nonsense associated with it in the first place.

None of this resolves the user address annoyance which is needed on some
archs for cache flushing.  Whether anyone agrees that the page allocator
should be responsible for this particular operation - open debate.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed
From: Michael S. Tsirkin @ 2026-06-08 19:58 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, linux-kernel, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <497dff41-e831-43a1-a56b-a6ab98bc11a3@kernel.org>

On Mon, Jun 08, 2026 at 04:28:48PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 16:08, Michael S. Tsirkin wrote:
> > On Mon, Jun 08, 2026 at 02:46:34PM +0200, David Hildenbrand (Arm) wrote:
> >> On 6/8/26 14:25, Lorenzo Stoakes wrote:
> >>>
> >>> Do not put comments about specific expected races like this in the commit
> >>> message but not in the code. Subtleties need to be called out.
> >>>
> >>> The commit message also doesn't at all explain why PG_zeroed doesn't
> >>> suffice here.
> >>>
> >>>
> >>> I really don't understand why you have a 'zeroed' folio flag but need to
> >>> also have new API calls to detect that?
> >>>
> >>> They're also HORRIBLY named. Zeroed as in what? Zero page? Huge zero page?
> >>> Memory zeroed by kernel? Pages that userland happen to have zeroed? Or host
> >>> VM zeroed?
> >>>
> >>> Each are cases we address individually and relate to folios.
> >>>
> >>> You absolutely fail to clarify _which one_ you mean, and provide absolutely
> >>> no documentation and add an exported mm API with no description.
> >>>
> >>> This is just I think not something we want to add? Especially on something
> >>> so fundamental?
> >>
> >> I raised previously that providing a folio helper is odd, and that I suggested
> >> that we defer this change.
> > 
> > Sadly it's a dependency actually - without it memcg failures would cause
> > repeated re-zeroing where previously it failed without zeroing.
> 
> Oh, you mean that we succeeded allocating (+zeroing) but failed to charge?
> 
> I don't immediately see that to be a real problem?

Yes exactly.
I don't really know if any real applications live close enough to memcg edge
that repeatedly wasting cycles zeroing pages then discarding that
information will be noticeable.

I should be able to write a test to show the difference, if that's the
question.

So just writing code in a way that we are not regressing them seemed cleaner to me.
But I'm not a maintainer so hey. Just so we are clear.

-- 
MST


^ permalink raw reply

* Re: [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
From: Michael S. Tsirkin @ 2026-06-08 19:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gregory Price, Lorenzo Stoakes, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aicMl3hPJMgOkfSv@casper.infradead.org>

On Mon, Jun 08, 2026 at 07:40:23PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 12:09:19PM -0400, Gregory Price wrote:
> > On Mon, Jun 08, 2026 at 01:00:17PM +0100, Lorenzo Stoakes wrote:
> > > On Mon, Jun 08, 2026 at 04:37:48AM -0400, Michael S. Tsirkin wrote:
> > > >
> > > >  void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> > > > -		     unsigned long user_addr);
> > > > +		     bool zeroed, unsigned long user_addr);
> > > 
> > > host_zeroed or something would be more appropriate no?
> > > 
> > > But in general do we need to propagate this around, can't we derive it from
> > > the page zeroed flag?
> > > 
> > > It's really confusing as to _which_ zeroing this refers to, it seems the
> > > only one relevant here is the VM host zeroing but that's completely
> > > non-obvious and now everybody using these functions with the extra param
> > > will simply have to happen to know this.
> > > 
> > > If we could find a way to avoid this propagation that'd be ideal.
> > > 
> > > Failing that, making it clear this is _only_ for vm host zeroing would be
> > > better, but then maybe we need to think about how we could encode this in
> > > some other way, e.g. passing alloc_context perhaps?
> > > 
> > 
> > This is unaddressed feedback from 3 version ago:
> > 
> > https://lore.kernel.org/linux-mm/agXYbcuQYooG74pb@gourry-fedora-PF4VCD3F/
> > 
> > We can infer all of this from snapshotted page flags and propogate those
> > around.  This is infinitely more useful than just a single flag being
> > pulled out into a boolean, and more extensible.
> > 
> > void
> > post_alloc_hook(struct page *page, usigned int order, gfp_t gfp_flags,
> >                 uint64_t pg_alloc_flags, unsigned long user_addr);
> > 
> > 		^^^ page_alloc.c internal falgs only
> > 
> > Once the allocator gets a page it wants to return, it can take a snapshot
> > of the flags at that point, and then doodle all over the flags as it
> > goes through the page setup prior to return (include the post hook).
> > 
> > Haven't seen a reason why this shouldn't be done this way.
> 
> I'd tuned out this awful series since it'd become apparent that my
> feedback wasn't going to be taken seriously.

Sorry about that.

> Good to know I shouldn't
> take it personally because he does it to you too.
> 
> I think it's apparent that Michael has no understanding of the MM.

It's a bit of an overstatement but I'm more of a networking guy, yes.
What I freely admit I don't understand is why I have to refactor all of mm
first.

> So we should start again with the architecture.  Let's look at the
> problem that he's trying to solve:
> 
>  - The hypervisor has zeroed the memory that it passes to the MM
>  - But the MM then zeroes it again before passing it to userspace.
>  - We want to avoid this
> 
> Let's make sure that's the actual problem before going any further.
> Because I do have a design that will satisfy that without doing this
> insane level of invasive change, but if that's not the problem, there's
> no point wasting my time writing it up.

Yes that's exactly the problem I was trying to solve. Early RFC versions
didn't do this invasive change:

https://lore.kernel.org/lkml/cover.1776689093.git.mst@redhat.com/


But I was asked to refactor mm first and implement the optimization
second.

Sure, glad to hear what the design is.



-- 
MST


^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Zi Yan @ 2026-06-08 19:52 UTC (permalink / raw)
  To: Gregory Price
  Cc: David Hildenbrand (Arm), Lorenzo Stoakes, Michael S. Tsirkin,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aicbZ9lewv1wJqqS@gourry-fedora-PF4VCD3F>

On 8 Jun 2026, at 15:43, Gregory Price wrote:

> On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/8/26 17:27, Zi Yan wrote:
>>> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
>>>
>>> Or should we defer zeroing after a page is returned from allocator? So that
>>> user_addr does not need to be passed through irrelevant allocation APIs.
>>> Something like:
>>>
>>> alloc_page_wrapper(gfp, order, user_addr)
>>> {
>>> 	page = alloc_pages();
>>> 	if (gfp & __GFP_ZERO)
>>> 		clear_page(page);
>>> }
>>>
>>
>> Not really sure what's best here. I think we'd want to limit the lifting to some
>> internal API, so it cannot easily be messed up by random kernel code calling
>> into the wrong API and not getting pages cleared.
>>
>
> We're a bit in circles on this.  We discussed explicit interfaces a few
> months back and the trade off was:
>
> a) add user_addr to the existing API and cause churn
>
> or
>
> b) add special interface like above
>    increase the buddy surface
>    leaves open the ability for users to get it wrong easily
>
> If we forget VMs for a moment and break this step out separately, the
> core question is whether page_alloc.c is the right place to be calling
> the folio_user_zero() or whatever it is.

page_alloc.c calling folio_user_zero() is fine, but my question is
whether we should do the zeroing inside post_alloc_hook(), part of
allocation.

What I propose is to lift __GFP_ZERO up as much as possible,
so that most of allocation code does not need to care about it.
We do the zeroing right before the page is returned to callers.

>
> We seem to have agreed "yes", which necessitates the plumbing of the
> address into the allocator.  The question is whether it should churn the
> existing interface for have its own explicit interface.
>
> The implications of getting it wrong are:  a user page doesn't get
> zeroed.  *oof*
>
> I think that's why we thought the churn was better.  But considering
> we're already in the same state now (callers are responsible for calling
> folio_user_zero() or whatever), maybe that's not horrid?

Yeah, some callers think they know better about huge page zeroing and want to
do it differently. That caused double zeroing when init_on_alloc was
introduced and was mitigated by user_alloc_needs_zeroing().

Best Regards,
Yan, Zi

^ permalink raw reply

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Michael S. Tsirkin @ 2026-06-08 19:45 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiblhJv_BvMn_8XY@gourry-fedora-PF4VCD3F>

On Mon, Jun 08, 2026 at 11:53:40AM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 12:23:07PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:36:38AM -0400, Michael S. Tsirkin wrote:
> > 
> > > +			 * user_addr != USER_ADDR_NONE implies sleepable
> > > +			 * context (user page fault).
> > 
> > Can you safely assume that? Also inferring which context we are in from this
> > parameter seems risky.
> > 
> > It seems to me that you're now making it such that kernel developers:
> > 
> > - Have to know when and when not to specify a user address, and under what
> >   circumstances we might consider that to be mapped.
> > 
> > - Need to know to do this correctly for aliasing architectures or have silent
> >   correctness issues.
> > 
> > - Need to take context into account when specifying this.
> > 
> > We definitely need to find a simpler way to do this!
> >
> 
> This feedback was poked at in earlier versions.  There's a tension
> between keeping the old interface as-is, having explicit interfaces
> for something like this, and the state of a page inside the
> allocator vs outside.
> 
> Double-plus complicated by the fact that we're trying to reason about
> two allocators at once:  host and guest.
> 
> It seems it has gotten a bit more complicated since then (I missed this
> "sleepable context" bit, not sure if it was there on prior versions).
> 
> If `user_addr` is now implying anything other than exactly: "This needs
> to be zeroed / caches flushed", then this is bad.
> 
> ~Gregory

Well if you do folio_zero_user in a non sleepable context then things
are not going to work. So combining e.g. GFP_ATOMIC and GFP_ZERO and
user_addr all together is not a good idea.

You are saying it's bad? It's pretty fundamental to the idea of moving
zeroing into the allocator, I feel.

-- 
MST


^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Gregory Price @ 2026-06-08 19:43 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Zi Yan, Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli
In-Reply-To: <2475e4c4-9ba3-4091-9bfd-1c1c15163da7@kernel.org>

On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 17:27, Zi Yan wrote:
> > On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
> > 
> > Or should we defer zeroing after a page is returned from allocator? So that
> > user_addr does not need to be passed through irrelevant allocation APIs.
> > Something like:
> > 
> > alloc_page_wrapper(gfp, order, user_addr)
> > {
> > 	page = alloc_pages();
> > 	if (gfp & __GFP_ZERO)
> > 		clear_page(page);
> > }
> > 
> 
> Not really sure what's best here. I think we'd want to limit the lifting to some
> internal API, so it cannot easily be messed up by random kernel code calling
> into the wrong API and not getting pages cleared.
> 

We're a bit in circles on this.  We discussed explicit interfaces a few
months back and the trade off was:

a) add user_addr to the existing API and cause churn

or

b) add special interface like above
   increase the buddy surface
   leaves open the ability for users to get it wrong easily

If we forget VMs for a moment and break this step out separately, the
core question is whether page_alloc.c is the right place to be calling
the folio_user_zero() or whatever it is.

We seem to have agreed "yes", which necessitates the plumbing of the
address into the allocator.  The question is whether it should churn the
existing interface for have its own explicit interface.

The implications of getting it wrong are:  a user page doesn't get
zeroed.  *oof*

I think that's why we thought the churn was better.  But considering
we're already in the same state now (callers are responsible for calling
folio_user_zero() or whatever), maybe that's not horrid? 

~Gregory

^ permalink raw reply

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Michael S. Tsirkin @ 2026-06-08 19:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiacZ6_7SG3nvVjM@lucifer>

On Mon, Jun 08, 2026 at 12:23:07PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:36:38AM -0400, Michael S. Tsirkin wrote:
> > When post_alloc_hook() needs to zero a page for an explicit
> > __GFP_ZERO allocation for a user page (user_addr is set), use folio_zero_user()
> > instead of kernel_init_pages().  This zeros near the faulting
> > address last, keeping those cachelines hot for the impending
> > user access.
> >
> > folio_zero_user() is only used for explicit __GFP_ZERO, not for
> > init_on_alloc.  On architectures with virtually-indexed caches
> > (e.g., ARM), clear_user_highpage() performs per-line cache
> > operations; using it for init_on_alloc would add overhead that
> > kernel_init_pages() avoids (the page fault path flushes the
> > cache at PTE installation time regardless).
> >
> > No functional change yet: current callers do not pass __GFP_ZERO
> > for user pages (they zero at the callsite instead).  Subsequent
> > patches will convert them.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Assisted-by: Claude:claude-opus-4-6
> > ---
> >  mm/page_alloc.c | 35 ++++++++++++++++++++++++++++++++---
> >  1 file changed, 32 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4676fd49819e..d4fbf1861a8a 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1861,9 +1861,38 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >  		for (i = 0; i != 1 << order; ++i)
> >  			page_kasan_tag_reset(page + i);
> >  	}
> > -	/* If memory is still not initialized, initialize it now. */
> > -	if (init)
> > -		kernel_init_pages(page, 1 << order);
> > +	/*
> > +	 * On architectures with cache aliasing, pages zeroed via the
> > +	 * kernel direct map (e.g. init_on_free) must be re-zeroed
> > +	 * through a user-congruent mapping.  Host-zeroed pages
> > +	 * (zeroed flag) don't need this: physical RAM is clean.
> > +	 */
> > +	if (!init && (gfp_flags & __GFP_ZERO) &&
> > +	    user_addr != USER_ADDR_NONE &&
> > +	    user_alloc_needs_zeroing())
> 
> We check this (gfp_flags & __GFP_ZERO) && user_addr != USER_ADDR_NONE thing
> twice, can we just put in a 'init_should_folio_zero' const bool or something?
> 
> > +		init = true;
> 
> As Vlasta says not sure if we want to add complexity just for these arches.
> 
> > +	/*
> > +	 * If memory is still not initialized, initialize it now.
> 
> I kinda hate that 'init' is unclear as to 'do init' or 'was init somewhere
> else'... Anwyay.
> 
> > +	 * When __GFP_ZERO was explicitly requested and user_addr is set,
> > +	 * use folio_zero_user() which zeros near the faulting address
> > +	 * last, keeping those cachelines hot.  For init_on_alloc, use
> > +	 * kernel_init_pages() to avoid unnecessary cache flush overhead
> > +	 * on architectures with virtually-indexed caches.
> 
> This whole paragraph seems pretty useless and just describing the code?
> 
> > +	 */
> > +	if (init) {
> > +		if ((gfp_flags & __GFP_ZERO) && user_addr != USER_ADDR_NONE) {
> > +			/*
> > +			 * folio_zero_user relies on folio_nr_pages which
> > +			 * requires __GFP_COMP for order > 0.  All user folio
> > +			 * allocations set __GFP_COMP via __folio_alloc.
> 
> This whole paragraph is useless and very like the kind of stuff AI generates for
> comments, i.e. overly long + entirely unnecessary stuff.


It was an attempt to make sashiko shut up, it doesn't understand the
context and kept complaining. Didn't really help so yea I should drop this.

> 
> > +			 * user_addr != USER_ADDR_NONE implies sleepable
> > +			 * context (user page fault).
> 
> Can you safely assume that? Also inferring which context we are in from this
> parameter seems risky.
> 
> It seems to me that you're now making it such that kernel developers:
> 
> - Have to know when and when not to specify a user address, and under what
>   circumstances we might consider that to be mapped.
> 
> - Need to know to do this correctly for aliasing architectures or have silent
>   correctness issues.
> 
> - Need to take context into account when specifying this.
> 
> We definitely need to find a simpler way to do this!
> 
> > +			 */
> > +			VM_WARN_ON_ONCE(order && !(gfp_flags & __GFP_COMP));
> 
> Surely by now we can assume this?

Another attempt to make it obvious.


> > +			folio_zero_user(page_folio(page), user_addr);
> > +		} else
> > +			kernel_init_pages(page, 1 << order);
> 
> I hate this hanging else branch... definitely prefer {} on both branches.
> 
> But in any case it seems like we could avoid some indentation with something
> like:
> 
> 	if (init && init_should_folio_zero) {
> 		...
> 	} else if (init) {
> 		...
> 	}
> 
> Or even a:
> 
> 	if (!init)
> 		goto out;
> 
> And stick an out label below?
> 
> > +	}
> 
> >
> >  	set_page_owner(page, order, gfp_flags);
> >  	page_table_check_alloc(page, order);
> > --
> > MST
> >
> 
> Oh and in general it seems that this conflicts with [0] which removes
> kernel_init_pages().
> 
> [0];https://lore.kernel.org/all/20260422102729.166599-1-hsalunke@amd.com/
> 
> Thanks, Lorenzo


^ permalink raw reply

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Michael S. Tsirkin @ 2026-06-08 19:39 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Gregory Price, Vlastimil Babka (SUSE), linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aib9QIGIn8f4W6RM@lucifer>

On Mon, Jun 08, 2026 at 06:50:46PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 11:45:18AM -0400, Gregory Price wrote:
> > On Mon, Jun 08, 2026 at 01:13:42PM +0200, Vlastimil Babka (SUSE) wrote:
> > > On 6/8/26 13:02, Vlastimil Babka (SUSE) wrote:
> > > > On 6/8/26 10:33, Michael S. Tsirkin wrote:
> > > >> Further, on architectures with aliasing caches, upstream with init_on_alloc
> > > >
> > > > It seems those are niche architectures so we can ignore that part for perf
> > > > purposes; the other reason why user_alloc_needs_zeroing() would be true is
> > > > booting with init_on_alloc.
> > >
> > > OK I misread how user_alloc_needs_zeroing() works wrt init_on_alloc, as it's
> > > negated. But you're changing that anyway to skip that user zeroing, right?
> > >
> > > "
> > > This series eliminates that double-zeroing by moving the zeroing
> > > into the post_alloc_hook + propagating the "host
> > > already zeroed this page" information through the buddy allocator.
> > > "
> > >
> > > So relying on "everything in buddy is zeroed" would still work I'd think.
> > >
> >
> > This regresses for anything that previously didn't zero on free or
> > alloc, which is most kernel allocations.
> >
> > I think the scope of this set has increased too much based on early
> > feedback to fix the userland-initiated allocations piece along with the
> > balloon/reporting/double-zero piece.  That's making all of this
> > difficult to continue following.
> 
> Yeah I feel this is 3, 4 or 5 series put together, and there's a lot to
> discuss in each :) so it's pretty difficult to work with them all put
> together.
> 
> These need to be deferred/separated.

I can do that, it's just that the real performance benefits
only come with the last patches in the series.

If I send series that merely moves zeroing around, with
a bunch of threading of addresses and stuff to achieve that,
0 perf gain and slight degradation in corner cases like memcg
failures, you feel it will be well received? You guys really
want to do that, independently of the rest?

Just making sure, I'm not the maintainer here.


> >
> > ~Gregory
> 
> Thanks, Lorenzo


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox