Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Michael S. Tsirkin @ 2026-06-08 20:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aibP5W-fGcyPp9e4@casper.infradead.org>

On Mon, Jun 08, 2026 at 03:21:25PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 04:33:46AM -0400, Michael S. Tsirkin wrote:
> > Further, on architectures with aliasing caches, upstream with init_on_alloc
> 
> Further to what?  Did you leave out some paragraphs here?
> 
> As far as I can tell, this patch series decides to trust that the
> hypervisor has zeroed pages that it allocates to the guest.  But
> as far as I can tell, the trend is towards less trust in the hypervisor
> from the guest, not more.

AKA confidential computing. I'm not a visionary, no idea about trends, but
yes these are used more than in the past (not hard given it used to be
0% of the market in the past).

Page reporting already leaks some info like free page addresses, so it's
for trusted hypervisors.

Anyway:
Subject: [PATCH v10 35/37] virtio_balloon: disable reporting zeroed optimization for confidential guests

makes sure guests that do not trust hypervisors are not affected.

-- 
MST


^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Gregory Price @ 2026-06-08 19:59 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aia93IzhTgi1vGi6@casper.infradead.org>

On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
> > But instead of overloading user_addr to indicate all kinds of things, instead
> > make life easier by actually breaking things out.
> > 
> > Like:
> > 
> > enum alloc_context_type {
> > 	KERNEL_ALLOCATION,
> > 	USER_MAPPED_ALLOCATION,
> > 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
> > 	/* Perhaps some other states we want to encode? */
> > };
> > 
> > struct alloc_context {
> > 	...
> > 
> > 	enum alloc_context_type type;
> > 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
> > 
> > 	// Maybe something suggesting context or whether we init before in some
> > 	// cases?
> > };
> 
> Ugh, please, no.  As I suggested last time I commented on this
> trainwreck of a series, lift the zeroing functionality from
> alloc_frozen_pages() into its callers.

This sort of just implies writing the "alloc_frozen_zeroed_pages()"
wrapper that does the zeroing at the end before return, and then killing
the post hook nonsense associated with it in the first place.

None of this resolves the user address annoyance which is needed on some
archs for cache flushing.  Whether anyone agrees that the page allocator
should be responsible for this particular operation - open debate.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed
From: Michael S. Tsirkin @ 2026-06-08 19:58 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, linux-kernel, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <497dff41-e831-43a1-a56b-a6ab98bc11a3@kernel.org>

On Mon, Jun 08, 2026 at 04:28:48PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 16:08, Michael S. Tsirkin wrote:
> > On Mon, Jun 08, 2026 at 02:46:34PM +0200, David Hildenbrand (Arm) wrote:
> >> On 6/8/26 14:25, Lorenzo Stoakes wrote:
> >>>
> >>> Do not put comments about specific expected races like this in the commit
> >>> message but not in the code. Subtleties need to be called out.
> >>>
> >>> The commit message also doesn't at all explain why PG_zeroed doesn't
> >>> suffice here.
> >>>
> >>>
> >>> I really don't understand why you have a 'zeroed' folio flag but need to
> >>> also have new API calls to detect that?
> >>>
> >>> They're also HORRIBLY named. Zeroed as in what? Zero page? Huge zero page?
> >>> Memory zeroed by kernel? Pages that userland happen to have zeroed? Or host
> >>> VM zeroed?
> >>>
> >>> Each are cases we address individually and relate to folios.
> >>>
> >>> You absolutely fail to clarify _which one_ you mean, and provide absolutely
> >>> no documentation and add an exported mm API with no description.
> >>>
> >>> This is just I think not something we want to add? Especially on something
> >>> so fundamental?
> >>
> >> I raised previously that providing a folio helper is odd, and that I suggested
> >> that we defer this change.
> > 
> > Sadly it's a dependency actually - without it memcg failures would cause
> > repeated re-zeroing where previously it failed without zeroing.
> 
> Oh, you mean that we succeeded allocating (+zeroing) but failed to charge?
> 
> I don't immediately see that to be a real problem?

Yes exactly.
I don't really know if any real applications live close enough to memcg edge
that repeatedly wasting cycles zeroing pages then discarding that
information will be noticeable.

I should be able to write a test to show the difference, if that's the
question.

So just writing code in a way that we are not regressing them seemed cleaner to me.
But I'm not a maintainer so hey. Just so we are clear.

-- 
MST


^ permalink raw reply

* Re: [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
From: Michael S. Tsirkin @ 2026-06-08 19:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gregory Price, Lorenzo Stoakes, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aicMl3hPJMgOkfSv@casper.infradead.org>

On Mon, Jun 08, 2026 at 07:40:23PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 12:09:19PM -0400, Gregory Price wrote:
> > On Mon, Jun 08, 2026 at 01:00:17PM +0100, Lorenzo Stoakes wrote:
> > > On Mon, Jun 08, 2026 at 04:37:48AM -0400, Michael S. Tsirkin wrote:
> > > >
> > > >  void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> > > > -		     unsigned long user_addr);
> > > > +		     bool zeroed, unsigned long user_addr);
> > > 
> > > host_zeroed or something would be more appropriate no?
> > > 
> > > But in general do we need to propagate this around, can't we derive it from
> > > the page zeroed flag?
> > > 
> > > It's really confusing as to _which_ zeroing this refers to, it seems the
> > > only one relevant here is the VM host zeroing but that's completely
> > > non-obvious and now everybody using these functions with the extra param
> > > will simply have to happen to know this.
> > > 
> > > If we could find a way to avoid this propagation that'd be ideal.
> > > 
> > > Failing that, making it clear this is _only_ for vm host zeroing would be
> > > better, but then maybe we need to think about how we could encode this in
> > > some other way, e.g. passing alloc_context perhaps?
> > > 
> > 
> > This is unaddressed feedback from 3 version ago:
> > 
> > https://lore.kernel.org/linux-mm/agXYbcuQYooG74pb@gourry-fedora-PF4VCD3F/
> > 
> > We can infer all of this from snapshotted page flags and propogate those
> > around.  This is infinitely more useful than just a single flag being
> > pulled out into a boolean, and more extensible.
> > 
> > void
> > post_alloc_hook(struct page *page, usigned int order, gfp_t gfp_flags,
> >                 uint64_t pg_alloc_flags, unsigned long user_addr);
> > 
> > 		^^^ page_alloc.c internal falgs only
> > 
> > Once the allocator gets a page it wants to return, it can take a snapshot
> > of the flags at that point, and then doodle all over the flags as it
> > goes through the page setup prior to return (include the post hook).
> > 
> > Haven't seen a reason why this shouldn't be done this way.
> 
> I'd tuned out this awful series since it'd become apparent that my
> feedback wasn't going to be taken seriously.

Sorry about that.

> Good to know I shouldn't
> take it personally because he does it to you too.
> 
> I think it's apparent that Michael has no understanding of the MM.

It's a bit of an overstatement but I'm more of a networking guy, yes.
What I freely admit I don't understand is why I have to refactor all of mm
first.

> So we should start again with the architecture.  Let's look at the
> problem that he's trying to solve:
> 
>  - The hypervisor has zeroed the memory that it passes to the MM
>  - But the MM then zeroes it again before passing it to userspace.
>  - We want to avoid this
> 
> Let's make sure that's the actual problem before going any further.
> Because I do have a design that will satisfy that without doing this
> insane level of invasive change, but if that's not the problem, there's
> no point wasting my time writing it up.

Yes that's exactly the problem I was trying to solve. Early RFC versions
didn't do this invasive change:

https://lore.kernel.org/lkml/cover.1776689093.git.mst@redhat.com/


But I was asked to refactor mm first and implement the optimization
second.

Sure, glad to hear what the design is.



-- 
MST


^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Zi Yan @ 2026-06-08 19:52 UTC (permalink / raw)
  To: Gregory Price
  Cc: David Hildenbrand (Arm), Lorenzo Stoakes, Michael S. Tsirkin,
	linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aicbZ9lewv1wJqqS@gourry-fedora-PF4VCD3F>

On 8 Jun 2026, at 15:43, Gregory Price wrote:

> On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/8/26 17:27, Zi Yan wrote:
>>> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
>>>
>>> Or should we defer zeroing after a page is returned from allocator? So that
>>> user_addr does not need to be passed through irrelevant allocation APIs.
>>> Something like:
>>>
>>> alloc_page_wrapper(gfp, order, user_addr)
>>> {
>>> 	page = alloc_pages();
>>> 	if (gfp & __GFP_ZERO)
>>> 		clear_page(page);
>>> }
>>>
>>
>> Not really sure what's best here. I think we'd want to limit the lifting to some
>> internal API, so it cannot easily be messed up by random kernel code calling
>> into the wrong API and not getting pages cleared.
>>
>
> We're a bit in circles on this.  We discussed explicit interfaces a few
> months back and the trade off was:
>
> a) add user_addr to the existing API and cause churn
>
> or
>
> b) add special interface like above
>    increase the buddy surface
>    leaves open the ability for users to get it wrong easily
>
> If we forget VMs for a moment and break this step out separately, the
> core question is whether page_alloc.c is the right place to be calling
> the folio_user_zero() or whatever it is.

page_alloc.c calling folio_user_zero() is fine, but my question is
whether we should do the zeroing inside post_alloc_hook(), part of
allocation.

What I propose is to lift __GFP_ZERO up as much as possible,
so that most of allocation code does not need to care about it.
We do the zeroing right before the page is returned to callers.

>
> We seem to have agreed "yes", which necessitates the plumbing of the
> address into the allocator.  The question is whether it should churn the
> existing interface for have its own explicit interface.
>
> The implications of getting it wrong are:  a user page doesn't get
> zeroed.  *oof*
>
> I think that's why we thought the churn was better.  But considering
> we're already in the same state now (callers are responsible for calling
> folio_user_zero() or whatever), maybe that's not horrid?

Yeah, some callers think they know better about huge page zeroing and want to
do it differently. That caused double zeroing when init_on_alloc was
introduced and was mitigated by user_alloc_needs_zeroing().

Best Regards,
Yan, Zi

^ permalink raw reply

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Michael S. Tsirkin @ 2026-06-08 19:45 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiblhJv_BvMn_8XY@gourry-fedora-PF4VCD3F>

On Mon, Jun 08, 2026 at 11:53:40AM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 12:23:07PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:36:38AM -0400, Michael S. Tsirkin wrote:
> > 
> > > +			 * user_addr != USER_ADDR_NONE implies sleepable
> > > +			 * context (user page fault).
> > 
> > Can you safely assume that? Also inferring which context we are in from this
> > parameter seems risky.
> > 
> > It seems to me that you're now making it such that kernel developers:
> > 
> > - Have to know when and when not to specify a user address, and under what
> >   circumstances we might consider that to be mapped.
> > 
> > - Need to know to do this correctly for aliasing architectures or have silent
> >   correctness issues.
> > 
> > - Need to take context into account when specifying this.
> > 
> > We definitely need to find a simpler way to do this!
> >
> 
> This feedback was poked at in earlier versions.  There's a tension
> between keeping the old interface as-is, having explicit interfaces
> for something like this, and the state of a page inside the
> allocator vs outside.
> 
> Double-plus complicated by the fact that we're trying to reason about
> two allocators at once:  host and guest.
> 
> It seems it has gotten a bit more complicated since then (I missed this
> "sleepable context" bit, not sure if it was there on prior versions).
> 
> If `user_addr` is now implying anything other than exactly: "This needs
> to be zeroed / caches flushed", then this is bad.
> 
> ~Gregory

Well if you do folio_zero_user in a non sleepable context then things
are not going to work. So combining e.g. GFP_ATOMIC and GFP_ZERO and
user_addr all together is not a good idea.

You are saying it's bad? It's pretty fundamental to the idea of moving
zeroing into the allocator, I feel.

-- 
MST


^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Gregory Price @ 2026-06-08 19:43 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Zi Yan, Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
	Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
	Andrea Arcangeli
In-Reply-To: <2475e4c4-9ba3-4091-9bfd-1c1c15163da7@kernel.org>

On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 17:27, Zi Yan wrote:
> > On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
> > 
> > Or should we defer zeroing after a page is returned from allocator? So that
> > user_addr does not need to be passed through irrelevant allocation APIs.
> > Something like:
> > 
> > alloc_page_wrapper(gfp, order, user_addr)
> > {
> > 	page = alloc_pages();
> > 	if (gfp & __GFP_ZERO)
> > 		clear_page(page);
> > }
> > 
> 
> Not really sure what's best here. I think we'd want to limit the lifting to some
> internal API, so it cannot easily be messed up by random kernel code calling
> into the wrong API and not getting pages cleared.
> 

We're a bit in circles on this.  We discussed explicit interfaces a few
months back and the trade off was:

a) add user_addr to the existing API and cause churn

or

b) add special interface like above
   increase the buddy surface
   leaves open the ability for users to get it wrong easily

If we forget VMs for a moment and break this step out separately, the
core question is whether page_alloc.c is the right place to be calling
the folio_user_zero() or whatever it is.

We seem to have agreed "yes", which necessitates the plumbing of the
address into the allocator.  The question is whether it should churn the
existing interface for have its own explicit interface.

The implications of getting it wrong are:  a user page doesn't get
zeroed.  *oof*

I think that's why we thought the churn was better.  But considering
we're already in the same state now (callers are responsible for calling
folio_user_zero() or whatever), maybe that's not horrid? 

~Gregory

^ permalink raw reply

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Michael S. Tsirkin @ 2026-06-08 19:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiacZ6_7SG3nvVjM@lucifer>

On Mon, Jun 08, 2026 at 12:23:07PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:36:38AM -0400, Michael S. Tsirkin wrote:
> > When post_alloc_hook() needs to zero a page for an explicit
> > __GFP_ZERO allocation for a user page (user_addr is set), use folio_zero_user()
> > instead of kernel_init_pages().  This zeros near the faulting
> > address last, keeping those cachelines hot for the impending
> > user access.
> >
> > folio_zero_user() is only used for explicit __GFP_ZERO, not for
> > init_on_alloc.  On architectures with virtually-indexed caches
> > (e.g., ARM), clear_user_highpage() performs per-line cache
> > operations; using it for init_on_alloc would add overhead that
> > kernel_init_pages() avoids (the page fault path flushes the
> > cache at PTE installation time regardless).
> >
> > No functional change yet: current callers do not pass __GFP_ZERO
> > for user pages (they zero at the callsite instead).  Subsequent
> > patches will convert them.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Assisted-by: Claude:claude-opus-4-6
> > ---
> >  mm/page_alloc.c | 35 ++++++++++++++++++++++++++++++++---
> >  1 file changed, 32 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 4676fd49819e..d4fbf1861a8a 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1861,9 +1861,38 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >  		for (i = 0; i != 1 << order; ++i)
> >  			page_kasan_tag_reset(page + i);
> >  	}
> > -	/* If memory is still not initialized, initialize it now. */
> > -	if (init)
> > -		kernel_init_pages(page, 1 << order);
> > +	/*
> > +	 * On architectures with cache aliasing, pages zeroed via the
> > +	 * kernel direct map (e.g. init_on_free) must be re-zeroed
> > +	 * through a user-congruent mapping.  Host-zeroed pages
> > +	 * (zeroed flag) don't need this: physical RAM is clean.
> > +	 */
> > +	if (!init && (gfp_flags & __GFP_ZERO) &&
> > +	    user_addr != USER_ADDR_NONE &&
> > +	    user_alloc_needs_zeroing())
> 
> We check this (gfp_flags & __GFP_ZERO) && user_addr != USER_ADDR_NONE thing
> twice, can we just put in a 'init_should_folio_zero' const bool or something?
> 
> > +		init = true;
> 
> As Vlasta says not sure if we want to add complexity just for these arches.
> 
> > +	/*
> > +	 * If memory is still not initialized, initialize it now.
> 
> I kinda hate that 'init' is unclear as to 'do init' or 'was init somewhere
> else'... Anwyay.
> 
> > +	 * When __GFP_ZERO was explicitly requested and user_addr is set,
> > +	 * use folio_zero_user() which zeros near the faulting address
> > +	 * last, keeping those cachelines hot.  For init_on_alloc, use
> > +	 * kernel_init_pages() to avoid unnecessary cache flush overhead
> > +	 * on architectures with virtually-indexed caches.
> 
> This whole paragraph seems pretty useless and just describing the code?
> 
> > +	 */
> > +	if (init) {
> > +		if ((gfp_flags & __GFP_ZERO) && user_addr != USER_ADDR_NONE) {
> > +			/*
> > +			 * folio_zero_user relies on folio_nr_pages which
> > +			 * requires __GFP_COMP for order > 0.  All user folio
> > +			 * allocations set __GFP_COMP via __folio_alloc.
> 
> This whole paragraph is useless and very like the kind of stuff AI generates for
> comments, i.e. overly long + entirely unnecessary stuff.


It was an attempt to make sashiko shut up, it doesn't understand the
context and kept complaining. Didn't really help so yea I should drop this.

> 
> > +			 * user_addr != USER_ADDR_NONE implies sleepable
> > +			 * context (user page fault).
> 
> Can you safely assume that? Also inferring which context we are in from this
> parameter seems risky.
> 
> It seems to me that you're now making it such that kernel developers:
> 
> - Have to know when and when not to specify a user address, and under what
>   circumstances we might consider that to be mapped.
> 
> - Need to know to do this correctly for aliasing architectures or have silent
>   correctness issues.
> 
> - Need to take context into account when specifying this.
> 
> We definitely need to find a simpler way to do this!
> 
> > +			 */
> > +			VM_WARN_ON_ONCE(order && !(gfp_flags & __GFP_COMP));
> 
> Surely by now we can assume this?

Another attempt to make it obvious.


> > +			folio_zero_user(page_folio(page), user_addr);
> > +		} else
> > +			kernel_init_pages(page, 1 << order);
> 
> I hate this hanging else branch... definitely prefer {} on both branches.
> 
> But in any case it seems like we could avoid some indentation with something
> like:
> 
> 	if (init && init_should_folio_zero) {
> 		...
> 	} else if (init) {
> 		...
> 	}
> 
> Or even a:
> 
> 	if (!init)
> 		goto out;
> 
> And stick an out label below?
> 
> > +	}
> 
> >
> >  	set_page_owner(page, order, gfp_flags);
> >  	page_table_check_alloc(page, order);
> > --
> > MST
> >
> 
> Oh and in general it seems that this conflicts with [0] which removes
> kernel_init_pages().
> 
> [0];https://lore.kernel.org/all/20260422102729.166599-1-hsalunke@amd.com/
> 
> Thanks, Lorenzo


^ permalink raw reply

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Michael S. Tsirkin @ 2026-06-08 19:39 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Gregory Price, Vlastimil Babka (SUSE), linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aib9QIGIn8f4W6RM@lucifer>

On Mon, Jun 08, 2026 at 06:50:46PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 11:45:18AM -0400, Gregory Price wrote:
> > On Mon, Jun 08, 2026 at 01:13:42PM +0200, Vlastimil Babka (SUSE) wrote:
> > > On 6/8/26 13:02, Vlastimil Babka (SUSE) wrote:
> > > > On 6/8/26 10:33, Michael S. Tsirkin wrote:
> > > >> Further, on architectures with aliasing caches, upstream with init_on_alloc
> > > >
> > > > It seems those are niche architectures so we can ignore that part for perf
> > > > purposes; the other reason why user_alloc_needs_zeroing() would be true is
> > > > booting with init_on_alloc.
> > >
> > > OK I misread how user_alloc_needs_zeroing() works wrt init_on_alloc, as it's
> > > negated. But you're changing that anyway to skip that user zeroing, right?
> > >
> > > "
> > > This series eliminates that double-zeroing by moving the zeroing
> > > into the post_alloc_hook + propagating the "host
> > > already zeroed this page" information through the buddy allocator.
> > > "
> > >
> > > So relying on "everything in buddy is zeroed" would still work I'd think.
> > >
> >
> > This regresses for anything that previously didn't zero on free or
> > alloc, which is most kernel allocations.
> >
> > I think the scope of this set has increased too much based on early
> > feedback to fix the userland-initiated allocations piece along with the
> > balloon/reporting/double-zero piece.  That's making all of this
> > difficult to continue following.
> 
> Yeah I feel this is 3, 4 or 5 series put together, and there's a lot to
> discuss in each :) so it's pretty difficult to work with them all put
> together.
> 
> These need to be deferred/separated.

I can do that, it's just that the real performance benefits
only come with the last patches in the series.

If I send series that merely moves zeroing around, with
a bunch of threading of addresses and stuff to achieve that,
0 perf gain and slight degradation in corner cases like memcg
failures, you feel it will be well received? You guys really
want to do that, independently of the rest?

Just making sure, I'm not the maintainer here.


> >
> > ~Gregory
> 
> Thanks, Lorenzo


^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Michael S. Tsirkin @ 2026-06-08 19:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand (Arm), Lorenzo Stoakes, linux-kernel,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aibVTe5HTk-fF_kj@casper.infradead.org>

On Mon, Jun 08, 2026 at 03:44:29PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 04:37:03PM +0200, David Hildenbrand (Arm) wrote:
> > On 6/8/26 16:31, Matthew Wilcox wrote:
> > > On Mon, Jun 08, 2026 at 04:26:18PM +0200, David Hildenbrand (Arm) wrote:
> > >> If that means that we would handle __GFP_ZERO consistently in the callers of
> > >> alloc_frozen_pages(), that would also do I guess. We'd still have to pass the
> > >> user address down to some degree, through folio interfaces only at least.
> > > 
> > > What I don't understand is how the kernel page allocator needs to know
> > > the user address in order to effectively zero it, but the hypervisor is
> > > able to zero the page without knowing the user address.  It feels like
> > > somebody has x86-centric thinking where cache colouring doesn't matter.
> > 
> > (not commenting on the icache dache mess we have to drag along)
> 
> Well, that was kind of the point of this email ... I did ask the
> question you're answering in a different email so let me respond
> to that too.
> 
> > The thing is that with free-page-reporting the memory is already zeroed by the
> > hypervisor as part of discarding that memory previously (e.g., MADV_DONTNEED)
> > and allocating fresh pages on re-access.
> > 
> > So it's not a question of "why is the hypervisor zeroing less efficiently", as
> > zeroing is just a side-product of reclaiming that memory in the first place.
> 
> We definitely have users who don't want the guest to trust the
> hypervisor.  So how do they disable this optimisation?

What do you mean, how? This is done by:

[PATCH v10 35/37] virtio_balloon: disable reporting zeroed optimization for confidential guests


-- 
MST


^ permalink raw reply

* Re: [PATCH] drm/virtio: fix dma_fence refcount leak on error in virtio_gpu_dma_fence_wait()
From: Dmitry Osipenko @ 2026-06-08 18:43 UTC (permalink / raw)
  To: Wentao Liang, airlied, kraxel, maarten.lankhorst, mripard,
	tzimmermann, simona
  Cc: gurchetansingh, olvaffe, dri-devel, virtualization, linux-kernel,
	stable
In-Reply-To: <20260607090303.92423-1-vulab@iscas.ac.cn>

On 6/7/26 12:03, Wentao Liang wrote:
> dma_fence_unwrap_for_each() internally calls dma_fence_unwrap_first()
> which does cursor->chain = dma_fence_get(head), taking an extra
> reference. On normal loop completion, dma_fence_unwrap_next()
> releases this via dma_fence_chain_walk() -> dma_fence_put().
> 
> When virtio_gpu_do_fence_wait() fails and the function returns early
> from inside the loop, the cursor->chain reference is never released.
> This is the only caller in the entire kernel that does an early return
> inside dma_fence_unwrap_for_each.
> 
> Add dma_fence_put(itr.chain) before the early return.
> 
> Cc: stable@vger.kernel.org
> Fixes: eba57fb5498f ("drm/virtio: Wait for each dma-fence of in-fence array individually")
> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
> ---
>  drivers/gpu/drm/virtio/virtgpu_submit.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/virtio/virtgpu_submit.c b/drivers/gpu/drm/virtio/virtgpu_submit.c
> index dae761fa5788..32cb1e4aa425 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_submit.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_submit.c
> @@ -65,8 +65,10 @@ static int virtio_gpu_dma_fence_wait(struct virtio_gpu_submit *submit,
>  
>  	dma_fence_unwrap_for_each(f, &itr, fence) {
>  		err = virtio_gpu_do_fence_wait(submit, f);
> -		if (err)
> +		if (err) {
> +			dma_fence_put(itr.chain);
>  			return err;
> +		}
>  	}
>  
>  	return 0;

Applied to misc-fixes, thanks!

-- 
Best regards,
Dmitry

^ permalink raw reply

* Re: [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
From: Matthew Wilcox @ 2026-06-08 18:40 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aibpL3REDdQ8Vhys@gourry-fedora-PF4VCD3F>

On Mon, Jun 08, 2026 at 12:09:19PM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 01:00:17PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:37:48AM -0400, Michael S. Tsirkin wrote:
> > >
> > >  void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> > > -		     unsigned long user_addr);
> > > +		     bool zeroed, unsigned long user_addr);
> > 
> > host_zeroed or something would be more appropriate no?
> > 
> > But in general do we need to propagate this around, can't we derive it from
> > the page zeroed flag?
> > 
> > It's really confusing as to _which_ zeroing this refers to, it seems the
> > only one relevant here is the VM host zeroing but that's completely
> > non-obvious and now everybody using these functions with the extra param
> > will simply have to happen to know this.
> > 
> > If we could find a way to avoid this propagation that'd be ideal.
> > 
> > Failing that, making it clear this is _only_ for vm host zeroing would be
> > better, but then maybe we need to think about how we could encode this in
> > some other way, e.g. passing alloc_context perhaps?
> > 
> 
> This is unaddressed feedback from 3 version ago:
> 
> https://lore.kernel.org/linux-mm/agXYbcuQYooG74pb@gourry-fedora-PF4VCD3F/
> 
> We can infer all of this from snapshotted page flags and propogate those
> around.  This is infinitely more useful than just a single flag being
> pulled out into a boolean, and more extensible.
> 
> void
> post_alloc_hook(struct page *page, usigned int order, gfp_t gfp_flags,
>                 uint64_t pg_alloc_flags, unsigned long user_addr);
> 
> 		^^^ page_alloc.c internal falgs only
> 
> Once the allocator gets a page it wants to return, it can take a snapshot
> of the flags at that point, and then doodle all over the flags as it
> goes through the page setup prior to return (include the post hook).
> 
> Haven't seen a reason why this shouldn't be done this way.

I'd tuned out this awful series since it'd become apparent that my
feedback wasn't going to be taken seriously.  Good to know I shouldn't
take it personally because he does it to you too.

I think it's apparent that Michael has no understanding of the MM.
So we should start again with the architecture.  Let's look at the
problem that he's trying to solve:

 - The hypervisor has zeroed the memory that it passes to the MM
 - But the MM then zeroes it again before passing it to userspace.
 - We want to avoid this

Let's make sure that's the actual problem before going any further.
Because I do have a design that will satisfy that without doing this
insane level of invasive change, but if that's not the problem, there's
no point wasting my time writing it up.

^ permalink raw reply

* Re: [PATCH v1] drm/virtio: Fix driver removal with disabled KMS
From: Dmitry Osipenko @ 2026-06-08 18:40 UTC (permalink / raw)
  To: Ryosuke Yasuoka, David Airlie, Gerd Hoffmann, Gurchetan Singh,
	Chia-I Wu
  Cc: dri-devel, virtualization, linux-kernel
In-Reply-To: <18b6b23ae1412a71.6d9db1e7acbb9813.37ad12b07518c586@ryasuoka-thinkpadx1carbongen9.tokyo.csb>

Hi,

On 6/7/26 07:31, Ryosuke Yasuoka wrote:
> Hi Dmitry
> 
> On 04/06/2026 15:27, Dmitry Osipenko wrote:
>> DRM atomic and modesetting aren't initialized if virtio-gpu driver built
>> with disabled KMS, leading to access of uninitialized data on driver
>> removal/unbinding and crashing kernel. Fix it by skipping shutting down
>> atomic core with unavailable KMS.
>>
>> Fixes: 72122c69d717 ("drm/virtio: Add option to disable KMS support")
>> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
>> ---
>>  drivers/gpu/drm/virtio/virtgpu_drv.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.c b/drivers/gpu/drm/virtio/virtgpu_drv.c
>> index f0fb784c0f6f..2aaa7cb08085 100644
>> --- a/drivers/gpu/drm/virtio/virtgpu_drv.c
>> +++ b/drivers/gpu/drm/virtio/virtgpu_drv.c
>> @@ -138,7 +138,10 @@ static void virtio_gpu_remove(struct virtio_device *vdev)
>>  
>>  	virtio_gpu_release_vqs(dev);
>>  	drm_dev_unplug(dev);
>> -	drm_atomic_helper_shutdown(dev);
>> +
>> +	if (drm_core_check_feature(dev, DRIVER_ATOMIC))
>> +		drm_atomic_helper_shutdown(dev);
>> +
>>  	virtio_gpu_deinit(dev);
>>  	drm_dev_put(dev);
>>  }
> 
> The patch looks good to me at a glance. I haven't done a full, deep code
> review yet, but I've tested it on my lab and everything works as
> expected.
> 
> Tested-by: Ryosuke Yasuoka <ryasuoka@redhat.com>

Thanks a lot for the testing. The review from you will be very welcomed
too.

-- 
Best regards,
Dmitry

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: David Hildenbrand (Arm) @ 2026-06-08 18:39 UTC (permalink / raw)
  To: Zi Yan
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <AEB19464-20A9-4C5B-97B8-681CCDD81031@nvidia.com>

On 6/8/26 17:27, Zi Yan wrote:
> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
> 
>> On 6/8/26 12:23, Lorenzo Stoakes wrote:
>>> I noticed this patch, again, sneaks in some additional code changes that
>>> are not mentioned in the commit message and seem irrelevant to the patch.
>>>
>>> Not sure if the AI is doing this, but please don't do that.
>>>
>>>
>>> This feels like very weak justification for code that massively changes mm
>>> code and makes it all much worse.
>>>
>>>
>>> Is this the only relevant code path? You're changing a TON of code here
>>> that will have multiple different code paths?
>>>
>>>
>>> Ugh god, so now we're passing a user address through allocation paths that
>>> might not even be aware of this or be allocating memory at a point when the
>>> mapping is known?
>>
>> The original ideas was to do this only with internal interfaces. I think I
>> raised before to leave hugetlb alone for now.
>>
>> Fundamentally, user_alloc_needs_zeroing() is something we should get rid of, to
>> just be able use __GFP_ZERO and do zeroing at exactly one place.
> 
> Just a reminder that user_alloc_needs_zeroing() not only checks init_on_alloc,
> but also some arc clearing page requirements. To do zeroing in one place,
> clear_page() used in post_alloc_hook() will need to learn how to handle
> arch-specific zeroing from clear_user_page()/clear_user_highpage().

Right.

> 
>>
>>
>> The question is how to pass that information (user_addr) through internal APIs.
> 
> Or should we defer zeroing after a page is returned from allocator? So that
> user_addr does not need to be passed through irrelevant allocation APIs.
> Something like:
> 
> alloc_page_wrapper(gfp, order, user_addr)
> {
> 	page = alloc_pages();
> 	if (gfp & __GFP_ZERO)
> 		clear_page(page);
> }
> 

Not really sure what's best here. I think we'd want to limit the lifting to some
internal API, so it cannot easily be messed up by random kernel code calling
into the wrong API and not getting pages cleared.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v1] vsock/virtio: rework MSG_ZEROCOPY flag handling
From: Arseniy Krasnov @ 2026-06-08 18:20 UTC (permalink / raw)
  To: David Laight
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Jason Wang, Bobby Eshleman, Xuan Zhuo, Eugenio Pérez,
	Simon Horman, kvm, virtualization, netdev, linux-kernel, oxffffaa,
	rulkc
In-Reply-To: <20260608103701.7839ff1b@pumpkin>


On 08/06/2026 12:37, David Laight wrote:
> On Mon, 8 Jun 2026 11:10:24 +0300
> Arseniy Krasnov <avkrasnov@rulkc.org> wrote:
>
>> On 05/06/2026 18:08, David Laight wrote:
>>> On Fri,  5 Jun 2026 14:53:14 +0300
>>> Arseniy Krasnov <avkrasnov@rulkc.org> wrote:
>>>  
>>>> Logically it was based on TCP implementation, so to make further
>>>> support easier, rewrite it in the TCP way.
>>>>
>>>> Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>
>>>> ---
>>>>  net/vmw_vsock/virtio_transport_common.c | 64 ++++++++++++-------------
>>>>  1 file changed, 32 insertions(+), 32 deletions(-)
>>>>
>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>>> index 2fd9eaaf5ca6..00caeeaa5590 100644
>>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>>> @@ -73,10 +73,13 @@ static bool virtio_transport_can_zcopy(const struct virtio_transport *t_ops,
>>>>  static int virtio_transport_fill_skb(struct sk_buff *skb,
>>>>  				     struct virtio_vsock_pkt_info *info,
>>>>  				     size_t len,
>>>> -				     bool zcopy)
>>>> +				     bool zcopy, struct ubuf_info *uarg)
>>>>  {
>>>>  	struct msghdr *msg = info->msg;
>>>>  
>>>> +	/* We have completion - attach it to 'skb'. */
>>>> +	skb_zcopy_set(skb, uarg, NULL);
>>>> +
>>>>  	if (zcopy)
>>>>  		return __zerocopy_sg_from_iter(msg, NULL, skb,
>>>>  					       &msg->msg_iter, len, NULL);
>>>> @@ -208,7 +211,8 @@ static struct sk_buff *virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *
>>>>  						  u32 src_cid,
>>>>  						  u32 src_port,
>>>>  						  u32 dst_cid,
>>>> -						  u32 dst_port)
>>>> +						  u32 dst_port,
>>>> +						  struct ubuf_info *uarg)
>>>>  {
>>>>  	struct vsock_sock *vsk;
>>>>  	struct sk_buff *skb;
>>>> @@ -245,7 +249,7 @@ static struct sk_buff *virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *
>>>>  	if (info->msg && payload_len > 0) {
>>>>  		int err;
>>>>  
>>>> -		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>>>> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy, uarg);
>>>>  		if (err)
>>>>  			goto out;
>>>>  
>>>> @@ -321,38 +325,36 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>>>  		return pkt_len;
>>>>  
>>>> -	if (info->msg) {
>>>> -		/* If zerocopy is not enabled by 'setsockopt()', we behave as
>>>> -		 * there is no MSG_ZEROCOPY flag set.
>>>> +	if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
>>>> +		/* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
>>>> +		 * 'MSG_ZEROCOPY' flag handling here is based on the same flag
>>>> +		 * handling from 'tcp_sendmsg_locked()'.
>>>>  		 */
>>>> -		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>>>> -			info->msg->msg_flags &= ~MSG_ZEROCOPY;
>>>> +		if (info->msg->msg_ubuf) {
>>>> +			uarg = info->msg->msg_ubuf;
>>>> +			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>>>> +		} else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
>>>> +			uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
>>>> +						    NULL, false);
>>>> +			if (!uarg) {
>>>> +				virtio_transport_put_credit(vvs, pkt_len);
>>>> +				return -ENOMEM;
>>>> +			}
>>>>  
>>>> -		if (info->msg->msg_flags & MSG_ZEROCOPY)
>>>>  			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>>>>  
>>>> +			if (!can_zcopy)
>>>> +				uarg_to_msgzc(uarg)->zerocopy = 0;
>>>> +
>>>> +			have_uref = true;
>>>> +		}
>>>> +
>>>> +		/* 'can_zcopy' means that this transmission will be
>>>> +		 * in zerocopy way (e.g. using 'frags' array).
>>>> +		 */  
>>> I've not looked at the tcp code, but the above doesn't look right.
>>> I don't see why msg->msg_ubuf might be non-NULL without SOCK_ZEROCOPY set.
>>> That would give the outer code a callback when the last skb is freed but
>>> still copy the data.  
>> Hi, 
>>
>> I guess case when 'msg->msg_ubuf' is non-NULL is special case today for io_uring MSG_ZEROCOPY implementation.
>> It was added here https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb315a7d1396b1139fc7daea55f2d3191e8e7092
>> As I see implementation of its tests in tools/testing/selftests/net/io_uring_zerocopy_tx.c , it doesn't require setting SOCK_ZEROCOPY option for
>> socket, so for virtio vsock case I just copied same logic to maintain compatibility, because there is MSG_ZEROCOPY io_uring test for virtio/vsock.
> That code path probably sets info->msg->msg_flags & MSG_ZEROCOPY so wants 'zerocopy'.
> But wants it's own callback function called rather than one that msg_zerocopy_realloc()
> adds.
>
> But there is no reason why a caller might not just want a notification that
> all the skb associated with the sendmsg have been freed without requesting zerocopy.
> Maybe no one does it today, but it is trivial to support.

Caller for this path is io_uring today. It uses own callback 'io_tx_ubuf_complete()' which implements notification using io_uring
achitecture instead of classic MSG_ERRQUEUE socket read way.

>
>>
>>> I also don't see the point of calling msg_zerocopy_realloc() to get a
>>> callback when the last skb is freed and then setting
>>> 	uarg_to_msgzc(uarg)->zerocopy = 0;
>>> so that the callback doesn't actually do anything.
>>> It isn't as though you 'find out' later on that you can't actually do
>>> zerocopy.  
>>
>> Sorry, what do You mean "last skb" ? In this code we first allocate uarg (allocate, because third arg is always NULL). Then in
>> loop we allocate sk_buffs, fill it with data and send. I mean first/last skb will be freed after uarg is already allocated and we
>> don't touch it. I think i didn't understand Your question here.
> The 'uarg' is referenced by all of the skb that contain data for the sendmsg().
> So when the last one of them is freed the callback function is called.
> The purpose of that callback is to 'undo' the zerocopy (page pinning etc).
> But when you set uarg_to_msgzc(uarg)->zerocopy = 0 the callback does nothing.
> So there is no point setting up the callback at all.

Pages are unpinned in sk_buff freeing logic: 'skb_release_data()' -> '__skb_frag_unref()'.  'zerocopy' flag of uarg shows 
caller that real zerocopy was used of not - it is checked in '__msg_zerocopy_callback()' and 'SO_EE_CODE_ZEROCOPY_COPIED' is set in
the 'ee_code' field of struct 'sock_extended_err' which is read by user. I mean idea of MSG_ZEROCOPY API is that if kernel doesn't
return error from 'sendmsg()' call with MSG_ZEROCOPY flag passed, it will send notification about data tx complete anyway - it doesn't
matter that real zercopy was done or not.

Thanks


>
> -- David
>
>>
>>>  
>>>>  		if (can_zcopy)
>>>>  			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>>>  					    (MAX_SKB_FRAGS * PAGE_SIZE));
>>>> -
>>>> -		if (info->msg->msg_flags & MSG_ZEROCOPY &&
>>>> -		    info->op == VIRTIO_VSOCK_OP_RW) {
>>>> -			uarg = info->msg->msg_ubuf;
>>>> -
>>>> -			if (!uarg) {
>>>> -				uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>>> -							    pkt_len, NULL, false);
>>>> -				if (!uarg) {
>>>> -					virtio_transport_put_credit(vvs, pkt_len);
>>>> -					return -ENOMEM;
>>>> -				}
>>>> -
>>>> -				if (!can_zcopy)
>>>> -					uarg_to_msgzc(uarg)->zerocopy = 0;
>>>> -
>>>> -				have_uref = true;
>>>> -			}
>>>> -		}
>>>>  	}
>>>>  
>>>>  	rest_len = pkt_len;
>>>> @@ -365,14 +367,12 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>  
>>>>  		skb = virtio_transport_alloc_skb(info, skb_len, can_zcopy,
>>>>  						 src_cid, src_port,
>>>> -						 dst_cid, dst_port);
>>>> +						 dst_cid, dst_port, uarg);
>>>>  		if (!skb) {
>>>>  			ret = -ENOMEM;
>>>>  			break;
>>>>  		}
>>>>  
>>>> -		skb_zcopy_set(skb, uarg, NULL);  
>>> Aren't you passing uarg through two function calls instead of doing it here.
>>> Doesn't even make it clearer what is going on.  
>>
>> Agree, to simplify patch, uarg could be set earlier (without passing it to functions) I guess.
>>
>> Thanks
>>
>>
>>> -- David
>>>  
>>>> -
>>>>  		virtio_transport_inc_tx_pkt(vvs, skb);
>>>>  
>>>>  		ret = t_ops->send_pkt(skb, info->net);
>>>> @@ -1178,7 +1178,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>>>  					   le64_to_cpu(hdr->dst_cid),
>>>>  					   le32_to_cpu(hdr->dst_port),
>>>>  					   le64_to_cpu(hdr->src_cid),
>>>> -					   le32_to_cpu(hdr->src_port));
>>>> +					   le32_to_cpu(hdr->src_port), NULL);
>>>>  	if (!reply)
>>>>  		return -ENOMEM;
>>>>    

^ permalink raw reply

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Lorenzo Stoakes @ 2026-06-08 17:50 UTC (permalink / raw)
  To: Gregory Price
  Cc: Vlastimil Babka (SUSE), Michael S. Tsirkin, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aibjjsoeEcvXobCw@gourry-fedora-PF4VCD3F>

On Mon, Jun 08, 2026 at 11:45:18AM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 01:13:42PM +0200, Vlastimil Babka (SUSE) wrote:
> > On 6/8/26 13:02, Vlastimil Babka (SUSE) wrote:
> > > On 6/8/26 10:33, Michael S. Tsirkin wrote:
> > >> Further, on architectures with aliasing caches, upstream with init_on_alloc
> > >
> > > It seems those are niche architectures so we can ignore that part for perf
> > > purposes; the other reason why user_alloc_needs_zeroing() would be true is
> > > booting with init_on_alloc.
> >
> > OK I misread how user_alloc_needs_zeroing() works wrt init_on_alloc, as it's
> > negated. But you're changing that anyway to skip that user zeroing, right?
> >
> > "
> > This series eliminates that double-zeroing by moving the zeroing
> > into the post_alloc_hook + propagating the "host
> > already zeroed this page" information through the buddy allocator.
> > "
> >
> > So relying on "everything in buddy is zeroed" would still work I'd think.
> >
>
> This regresses for anything that previously didn't zero on free or
> alloc, which is most kernel allocations.
>
> I think the scope of this set has increased too much based on early
> feedback to fix the userland-initiated allocations piece along with the
> balloon/reporting/double-zero piece.  That's making all of this
> difficult to continue following.

Yeah I feel this is 3, 4 or 5 series put together, and there's a lot to
discuss in each :) so it's pretty difficult to work with them all put
together.

These need to be deferred/separated.

>
> ~Gregory

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v1] i2c: virtio: wait uninterruptibly for completions to avoid UAF
From: Gavin Li @ 2026-06-08 17:46 UTC (permalink / raw)
  To: Viresh Kumar; +Cc: linux-i2c, Chen, Jian Jun, andi.shyti, virtualization
In-Reply-To: <ars6vptfsifv3zvdbx2p7ihcw4rbwbgkvft3xpqyjbvkdckfzy@zxx2exbtuib4>

Noted. I have sent a v2 patch that allows a hung process to be killed.

On Sun, Jun 7, 2026 at 11:44 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 07-06-26, 10:36, Gavin Li wrote:
> > virtio_i2c_complete_reqs() uses wait_for_completion_interruptible() and stops
> > waiting when a signal arrives. virtio_i2c_xfer() then frees reqs and the
> > per-request DMA bounce buffers while the device may still hold virtqueue tokens
> > pointing at &reqs[i] and DMA into read bounce buffers. Additionally, when the
> > device later completes those requests, virtio_i2c_msg_done() calls complete()
> > on freed memory and can corrupt the slab freelist.
> >
> > Wait uninterruptibly for every completion before freeing reqs. This
> > matches how other virtio drivers retain request storage until the device
> > completes it. The virtio spec unfortunately does not provide a way to
> > cancel an in-flight request, so waiting uninterruptibly is required.
> >
> > Signed-off-by: Gavin Li <gavin.li@samsara.com>
> > ---
> >  drivers/i2c/busses/i2c-virtio.c | 15 +++++++--------
> >  1 file changed, 7 insertions(+), 8 deletions(-)
>
> This is a revert of (and maybe better if that is mentioned in the logs):
>
> commit a663b3c47ab1 ("i2c: virtio: Avoid hang by using interruptible completion wait")
>
> I don't think this is the right approach here. We shouldn't hang the kernel
> indefinitely if the other side is dead.
>
> --
> viresh

^ permalink raw reply

* [PATCH v2] i2c: virtio: retain xfer with kref to fix UAF on interrupted wait
From: Gavin Li @ 2026-06-08 17:44 UTC (permalink / raw)
  To: linux-i2c, viresh.kumar
  Cc: Chen, Jian Jun, andi.shyti, virtualization, Gavin Li
In-Reply-To: <20260607143608.76122-1-gavin.li@samsara.com>

commit a663b3c47ab1 ("i2c: virtio: Avoid hang by using interruptible
completion wait") switched virtio_i2c_complete_reqs() to
wait_for_completion_interruptible() so a stuck device cannot hang a
task forever. That left a use-after-free: if the wait returns early on
a signal, virtio_i2c_xfer() frees reqs and DMA bounce buffers while the
device may still hold virtqueue tokens pointing at &reqs[i] and DMA
into read buffers. When those requests complete later,
virtio_i2c_msg_done() calls complete() on freed memory.

Waiting uninterruptibly for every completion before freeing avoids the
UAF but can hang the caller indefinitely if the virtio side never
completes the request. The virtio spec provides no way to cancel an
in-flight transfer, so that is not an acceptable tradeoff.

This commit makes two changes:

- Manage the freeing of the xfer allocations via kref, and ensure that
  each in-flight request holds a reference. This fixes the
  use-after-free by ensuring that the virtio device has a valid location
  to write to until the request completes. This will cause a memory
  leak in cases where the device hangs, but that is much preferable to
  memory corruption.

- Use wait_for_completion_killable() instead of _interruptible(). Even
  partial I2C transactions can have side effects, so the only time it
  makes sense to interrupt a transaction is when a process needs to be
  killed. Most existing I2C drivers don't support interruption at all,
  so this should not break userspace applications. This also addresses
  issues with Go programs accessing devices via the I2C userspace API,
  since the Go runtime stochastically signals SIGURG to running threads;
  leaving this as _interruptible() may cause partial side effects from
  which it is impossible to cleanly restart.

Signed-off-by: Gavin Li <gavin.li@samsara.com>
---
 drivers/i2c/busses/i2c-virtio.c | 89 ++++++++++++++++++++++++---------
 1 file changed, 64 insertions(+), 25 deletions(-)

diff --git a/drivers/i2c/busses/i2c-virtio.c b/drivers/i2c/busses/i2c-virtio.c
index 726c162cabd86..f7320a67a3409 100644
--- a/drivers/i2c/busses/i2c-virtio.c
+++ b/drivers/i2c/busses/i2c-virtio.c
@@ -13,6 +13,7 @@
 #include <linux/err.h>
 #include <linux/i2c.h>
 #include <linux/kernel.h>
+#include <linux/kref.h>
 #include <linux/module.h>
 #include <linux/virtio.h>
 #include <linux/virtio_ids.h>
@@ -31,39 +32,77 @@ struct virtio_i2c {
 	struct virtqueue *vq;
 };
 
+struct virtio_i2c_xfer;
+
 /**
  * struct virtio_i2c_req - the virtio I2C request structure
+ * @xfer: owning transfer
+ * @msg: copy of the I2C message for virtio_i2c_xfer_release
  * @completion: completion of virtio I2C message
  * @out_hdr: the OUT header of the virtio I2C message
  * @buf: the buffer into which data is read, or from which it's written
  * @in_hdr: the IN header of the virtio I2C message
  */
 struct virtio_i2c_req {
+	struct virtio_i2c_xfer *xfer;
+	struct i2c_msg msg;
 	struct completion completion;
 	struct virtio_i2c_out_hdr out_hdr	____cacheline_aligned;
 	uint8_t *buf				____cacheline_aligned;
 	struct virtio_i2c_in_hdr in_hdr		____cacheline_aligned;
 };
 
+/**
+ * struct virtio_i2c_xfer - a queued I2C transfer
+ * @ref: one ref for the caller, plus one per in-flight virtqueue request
+ * @num: number of messages
+ * @reqs: the virtio I2C requests
+ */
+struct virtio_i2c_xfer {
+	struct kref ref;
+	int num;
+	struct virtio_i2c_req reqs[];
+};
+
+static void virtio_i2c_xfer_release(struct kref *ref)
+{
+	struct virtio_i2c_xfer *xfer = container_of(ref, struct virtio_i2c_xfer, ref);
+	int i;
+
+	for (i = 0; i < xfer->num; i++) {
+		struct virtio_i2c_req *req = &xfer->reqs[i];
+		i2c_put_dma_safe_msg_buf(req->buf, &req->msg, false);
+	}
+
+	kfree(xfer);
+}
+
 static void virtio_i2c_msg_done(struct virtqueue *vq)
 {
 	struct virtio_i2c_req *req;
 	unsigned int len;
 
-	while ((req = virtqueue_get_buf(vq, &len)))
+	while ((req = virtqueue_get_buf(vq, &len))) {
 		complete(&req->completion);
+		kref_put(&req->xfer->ref, virtio_i2c_xfer_release);
+	}
 }
 
 static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
-				   struct virtio_i2c_req *reqs,
+				   struct virtio_i2c_xfer *xfer,
 				   struct i2c_msg *msgs, int num)
 {
 	struct scatterlist *sgs[3], out_hdr, msg_buf, in_hdr;
+	struct virtio_i2c_req *reqs = xfer->reqs;
 	int i;
 
+	kref_init(&xfer->ref);
+
 	for (i = 0; i < num; i++) {
 		int outcnt = 0, incnt = 0;
 
+		reqs[i].xfer = xfer;
+		reqs[i].msg = msgs[i];
 		init_completion(&reqs[i].completion);
 
 		/*
@@ -99,36 +138,36 @@ static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
 
 		if (virtqueue_add_sgs(vq, sgs, outcnt, incnt, &reqs[i], GFP_KERNEL)) {
 			i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], false);
+			reqs[i].buf = NULL; /* prevent free by virtio_i2c_xfer_release */
 			break;
 		}
+
+		kref_get(&xfer->ref); /* released in virtio_i2c_msg_done() */
 	}
 
+	xfer->num = i;
 	return i;
 }
 
-static int virtio_i2c_complete_reqs(struct virtqueue *vq,
-				    struct virtio_i2c_req *reqs,
-				    struct i2c_msg *msgs, int num)
+static int virtio_i2c_complete_reqs(struct virtio_i2c_xfer *xfer)
 {
-	bool failed = false;
-	int i, j = 0;
+	struct virtio_i2c_req *reqs = xfer->reqs;
+	int i, fail_index = -1;
 
-	for (i = 0; i < num; i++) {
+	for (i = 0; i < xfer->num; i++) {
 		struct virtio_i2c_req *req = &reqs[i];
-
-		if (!failed) {
-			if (wait_for_completion_interruptible(&req->completion))
-				failed = true;
-			else if (req->in_hdr.status != VIRTIO_I2C_MSG_OK)
-				failed = true;
-			else
-				j++;
+		if (wait_for_completion_killable(&req->completion)) {
+			return -EINTR;
+		} else if (req->in_hdr.status != VIRTIO_I2C_MSG_OK) {
+			/* Don't break yet. Try to wait until all requests complete. */
+			if (fail_index < 0)
+				fail_index = i;
 		}
-
-		i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], !failed);
+		i2c_put_dma_safe_msg_buf(req->buf, &req->msg, fail_index < 0);
+		req->buf = NULL; /* prevent free by virtio_i2c_xfer_release */
 	}
 
-	return j;
+	return fail_index >= 0 ? fail_index : xfer->num; /* number of successful transactions */
 }
 
 static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
@@ -136,14 +175,14 @@ static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
 {
 	struct virtio_i2c *vi = i2c_get_adapdata(adap);
 	struct virtqueue *vq = vi->vq;
-	struct virtio_i2c_req *reqs;
+	struct virtio_i2c_xfer *xfer;
 	int count;
 
-	reqs = kcalloc(num, sizeof(*reqs), GFP_KERNEL);
-	if (!reqs)
+	xfer = kzalloc(struct_size(xfer, reqs, num), GFP_KERNEL);
+	if (!xfer)
 		return -ENOMEM;
 
-	count = virtio_i2c_prepare_reqs(vq, reqs, msgs, num);
+	count = virtio_i2c_prepare_reqs(vq, xfer, msgs, num);
 	if (!count)
 		goto err_free;
 
@@ -157,10 +196,10 @@ static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
 	 */
 	virtqueue_kick(vq);
 
-	count = virtio_i2c_complete_reqs(vq, reqs, msgs, count);
+	count = virtio_i2c_complete_reqs(xfer);
 
 err_free:
-	kfree(reqs);
+	kref_put(&xfer->ref, virtio_i2c_xfer_release);
 	return count;
 }
 
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Andrew Morton @ 2026-06-08 16:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Miaohe Lin
In-Reply-To: <20260608094153-mutt-send-email-mst@kernel.org>

On Mon, 8 Jun 2026 09:48:34 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:

> > These patches seem like unrelated fixups that you've discovered along the way,
> > and don't belong as part of the already rather large series, unless I'm missing
> > something here.
> > 
> > Thanks, Lorenzo
> 
> I think you are mising that they are a dependency, not unrelated.
> For example, this issue gets worse with the patchset as there are more
> places that manipulate flags without atomics. No?
> 
> 
> You are welcome to send this to stable, but I think stable rules
> preclude theoretical bugfixes.

I agree with Lorenzo, please - let's have fixes separated out.  After
all, the rest of the series might never be merged (sorry, it happens).

Whether each fix gets a cc:stable is TBD, case-by-case - it depends on
the expected userspace impact.


^ permalink raw reply

* Re: [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
From: Gregory Price @ 2026-06-08 16:09 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aias_qh3wR-Xkf6T@lucifer>

On Mon, Jun 08, 2026 at 01:00:17PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:37:48AM -0400, Michael S. Tsirkin wrote:
> >
> >  void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> > -		     unsigned long user_addr);
> > +		     bool zeroed, unsigned long user_addr);
> 
> host_zeroed or something would be more appropriate no?
> 
> But in general do we need to propagate this around, can't we derive it from
> the page zeroed flag?
> 
> It's really confusing as to _which_ zeroing this refers to, it seems the
> only one relevant here is the VM host zeroing but that's completely
> non-obvious and now everybody using these functions with the extra param
> will simply have to happen to know this.
> 
> If we could find a way to avoid this propagation that'd be ideal.
> 
> Failing that, making it clear this is _only_ for vm host zeroing would be
> better, but then maybe we need to think about how we could encode this in
> some other way, e.g. passing alloc_context perhaps?
> 

This is unaddressed feedback from 3 version ago:

https://lore.kernel.org/linux-mm/agXYbcuQYooG74pb@gourry-fedora-PF4VCD3F/

We can infer all of this from snapshotted page flags and propogate those
around.  This is infinitely more useful than just a single flag being
pulled out into a boolean, and more extensible.

void
post_alloc_hook(struct page *page, usigned int order, gfp_t gfp_flags,
                uint64_t pg_alloc_flags, unsigned long user_addr);

		^^^ page_alloc.c internal falgs only

Once the allocator gets a page it wants to return, it can take a snapshot
of the flags at that point, and then doodle all over the flags as it
goes through the page setup prior to return (include the post hook).

Haven't seen a reason why this shouldn't be done this way.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 16/37] mm: alloc_swap_folio: pass raw fault address to vma_alloc_folio
From: Gregory Price @ 2026-06-08 15:59 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiapFvxPXuPd4-OP@lucifer>

On Mon, Jun 08, 2026 at 12:37:20PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:37:41AM -0400, Michael S. Tsirkin wrote:
> > Same change as the previous patch but for alloc_swap_folio:
> 
> Please don't say 'same change as the previous patch' :) explain what you're
> doing here. It's a pain to have to go check otherwise.
> 

MST you need to slow down a bit.

I gave you this same feedback 3 versions ago:

https://lore.kernel.org/linux-mm/agXUHItfxSwtriRF@gourry-fedora-PF4VCD3F/

~Gregory

^ permalink raw reply

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Gregory Price @ 2026-06-08 15:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiacZ6_7SG3nvVjM@lucifer>

On Mon, Jun 08, 2026 at 12:23:07PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:36:38AM -0400, Michael S. Tsirkin wrote:
> 
> > +			 * user_addr != USER_ADDR_NONE implies sleepable
> > +			 * context (user page fault).
> 
> Can you safely assume that? Also inferring which context we are in from this
> parameter seems risky.
> 
> It seems to me that you're now making it such that kernel developers:
> 
> - Have to know when and when not to specify a user address, and under what
>   circumstances we might consider that to be mapped.
> 
> - Need to know to do this correctly for aliasing architectures or have silent
>   correctness issues.
> 
> - Need to take context into account when specifying this.
> 
> We definitely need to find a simpler way to do this!
>

This feedback was poked at in earlier versions.  There's a tension
between keeping the old interface as-is, having explicit interfaces
for something like this, and the state of a page inside the
allocator vs outside.

Double-plus complicated by the fact that we're trying to reason about
two allocators at once:  host and guest.

It seems it has gotten a bit more complicated since then (I missed this
"sleepable context" bit, not sure if it was there on prior versions).

If `user_addr` is now implying anything other than exactly: "This needs
to be zeroed / caches flushed", then this is bad.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Gregory Price @ 2026-06-08 15:45 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <e54f9b79-b6b4-4911-86ac-48b61889765a@kernel.org>

On Mon, Jun 08, 2026 at 01:13:42PM +0200, Vlastimil Babka (SUSE) wrote:
> On 6/8/26 13:02, Vlastimil Babka (SUSE) wrote:
> > On 6/8/26 10:33, Michael S. Tsirkin wrote:
> >> Further, on architectures with aliasing caches, upstream with init_on_alloc
> > 
> > It seems those are niche architectures so we can ignore that part for perf
> > purposes; the other reason why user_alloc_needs_zeroing() would be true is
> > booting with init_on_alloc.
> 
> OK I misread how user_alloc_needs_zeroing() works wrt init_on_alloc, as it's
> negated. But you're changing that anyway to skip that user zeroing, right?
> 
> "
> This series eliminates that double-zeroing by moving the zeroing
> into the post_alloc_hook + propagating the "host
> already zeroed this page" information through the buddy allocator.
> "
> 
> So relying on "everything in buddy is zeroed" would still work I'd think.
> 

This regresses for anything that previously didn't zero on free or
alloc, which is most kernel allocations.

I think the scope of this set has increased too much based on early
feedback to fix the userland-initiated allocations piece along with the
balloon/reporting/double-zero piece.  That's making all of this
difficult to continue following.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Zi Yan @ 2026-06-08 15:27 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <64f2e580-aa82-4849-9236-b8ec4208ca24@kernel.org>

On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:

> On 6/8/26 12:23, Lorenzo Stoakes wrote:
>> I noticed this patch, again, sneaks in some additional code changes that
>> are not mentioned in the commit message and seem irrelevant to the patch.
>>
>> Not sure if the AI is doing this, but please don't do that.
>>
>> On Mon, Jun 08, 2026 at 04:35:17AM -0400, Michael S. Tsirkin wrote:
>>> Thread a user virtual address from vma_alloc_folio() down through
>>> the page allocator to post_alloc_hook(). This is plumbing
>>> preparation for a subsequent patch that will use user_addr to
>>> call folio_zero_user() for cache-friendly zeroing of user pages.
>>
>> This feels like very weak justification for code that massively changes mm
>> code and makes it all much worse.
>>
>>>
>>> The user_addr is stored in struct alloc_context and flows through:
>>>   vma_alloc_folio -> folio_alloc_mpol -> __alloc_pages_mpol ->
>>>   __alloc_frozen_pages -> get_page_from_freelist -> prep_new_page ->
>>>   post_alloc_hook
>>
>> Is this the only relevant code path? You're changing a TON of code here
>> that will have multiple different code paths?
>>
>>>
>>> USER_ADDR_NONE ((unsigned long)-1) is used for non-user
>>> allocations, since address 0 is a valid userspace mapping.
>>
>> Ugh god, so now we're passing a user address through allocation paths that
>> might not even be aware of this or be allocating memory at a point when the
>> mapping is known?
>
> The original ideas was to do this only with internal interfaces. I think I
> raised before to leave hugetlb alone for now.
>
> Fundamentally, user_alloc_needs_zeroing() is something we should get rid of, to
> just be able use __GFP_ZERO and do zeroing at exactly one place.

Just a reminder that user_alloc_needs_zeroing() not only checks init_on_alloc,
but also some arc clearing page requirements. To do zeroing in one place,
clear_page() used in post_alloc_hook() will need to learn how to handle
arch-specific zeroing from clear_user_page()/clear_user_highpage().

>
>
> The question is how to pass that information (user_addr) through internal APIs.

Or should we defer zeroing after a page is returned from allocator? So that
user_addr does not need to be passed through irrelevant allocation APIs.
Something like:

alloc_page_wrapper(gfp, order, user_addr)
{
	page = alloc_pages();
	if (gfp & __GFP_ZERO)
		clear_page(page);
}

>
> I previously said, that ideally we'd end up with a folio allocation interface in
> mm/page_alloc.c, from where we could get this done more cleanly internally.
>
> But I don't want what the previous proposals did: use GFP flags+leak state or
> return magic "zeroed" flags.


Best Regards,
Yan, Zi

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: David Hildenbrand (Arm) @ 2026-06-08 14:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aibVTe5HTk-fF_kj@casper.infradead.org>

On 6/8/26 16:44, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 04:37:03PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/8/26 16:31, Matthew Wilcox wrote:
>>>
>>> What I don't understand is how the kernel page allocator needs to know
>>> the user address in order to effectively zero it, but the hypervisor is
>>> able to zero the page without knowing the user address.  It feels like
>>> somebody has x86-centric thinking where cache colouring doesn't matter.
>>
>> (not commenting on the icache dache mess we have to drag along)
> 
> Well, that was kind of the point of this email ... I did ask the
> question you're answering in a different email so let me respond
> to that too.

Now I'm confused :)

> 
>> The thing is that with free-page-reporting the memory is already zeroed by the
>> hypervisor as part of discarding that memory previously (e.g., MADV_DONTNEED)
>> and allocating fresh pages on re-access.
>>
>> So it's not a question of "why is the hypervisor zeroing less efficiently", as
>> zeroing is just a side-product of reclaiming that memory in the first place.
> 
> We definitely have users who don't want the guest to trust the
> hypervisor.  So how do they disable this optimisation?

Right, I don't think we currently have a toggle to disable free page reporting.
So IIUC, this optimization would similarly automatically get enabled if the
hypervisor advertises it.

-- 
Cheers,

David

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox