Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed
From: Michael S. Tsirkin @ 2026-06-08 14:08 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, linux-kernel, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <8ec82477-b497-466c-902b-82e108ae2b7b@kernel.org>

On Mon, Jun 08, 2026 at 02:46:34PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 14:25, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:38:54AM -0400, Michael S. Tsirkin wrote:
> >> Add put_page_zeroed() / folio_put_zeroed() for callers that hold
> >> a reference to a page known to be zeroed.
> >>
> >> If this drops the last reference, the zeroed hint is
> >> propagated to the buddy allocator.  If someone else still holds a
> >> reference, the hint is simply lost - this is best-effort.
> >>
> >> This is useful for balloon drivers during deflation: the host
> >> has already zeroed the pages, and the balloon is typically the
> >> sole owner.  But if the page happens to be shared, silently
> >> dropping the hint is safe and avoids the need for callers to
> >> check the refcount.
> >>
> >> Note: put_page_zeroed uses folio_put_testzero() which only
> >> detects sole ownership at the instant of the atomic decrement.
> >> A concurrent reference holder (e.g. migration) means the hint
> >> is silently lost. This is by design: the zeroed hint is a
> >> performance optimization, not a correctness requirement.
> >> Losing it just means the next allocation re-zeroes the page.
> > 
> > Do not put comments about specific expected races like this in the commit
> > message but not in the code. Subtleties need to be called out.
> > 
> > The commit message also doesn't at all explain why PG_zeroed doesn't
> > suffice here.
> > 
> >>
> >> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >> Assisted-by: Claude:claude-opus-4-6
> > 
> > I really don't understand why you have a 'zeroed' folio flag but need to
> > also have new API calls to detect that?
> > 
> > They're also HORRIBLY named. Zeroed as in what? Zero page? Huge zero page?
> > Memory zeroed by kernel? Pages that userland happen to have zeroed? Or host
> > VM zeroed?
> > 
> > Each are cases we address individually and relate to folios.
> > 
> > You absolutely fail to clarify _which one_ you mean, and provide absolutely
> > no documentation and add an exported mm API with no description.
> > 
> > This is just I think not something we want to add? Especially on something
> > so fundamental?
> 
> I raised previously that providing a folio helper is odd, and that I suggested
> that we defer this change.

Sadly it's a dependency actually - without it memcg failures would cause
repeated re-zeroing where previously it failed without zeroing.

It's the result of the whole GFP_ZERO idea.



> Maybe we'd want to add such an interface for frozen pages later (to be used by
> the balloon), but I don't think we want that for folios.
> 
> [1] https://lore.kernel.org/all/5f76af6e-9818-42ea-a305-c0fc1d920dca@kernel.org/
> 
> -- 
> Cheers,
> 
> David


^ permalink raw reply

* Re: [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Lorenzo Stoakes @ 2026-06-08 14:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Miaohe Lin
In-Reply-To: <20260608094153-mutt-send-email-mst@kernel.org>

On Mon, Jun 08, 2026 at 09:48:34AM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 10:43:21AM +0100, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:34:23AM -0400, Michael S. Tsirkin wrote:
> > > TestSetPageHWPoison() is called without zone->lock, so its atomic
> > > update to page->flags can race with non-atomic flag operations
> > > that run under zone->lock in the buddy allocator.
> > >
> > > In particular, __free_pages_prepare() does:
> > >
> > >     page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > >
> > > This non-atomic read-modify-write, while correctly excluding
> > > __PG_HWPOISON from the mask, can still lose a concurrent
> > > TestSetPageHWPoison if the read happens before the poison bit
> > > is set and the write happens after.  Follow-up patches in this
> > > series add similar non-atomic flag operations as well.
> > >
> > > Fix by acquiring zone->lock around TestSetPageHWPoison and
> > > around ClearPageHWPoison in the retry path.  This
> > > serializes with all buddy flag manipulation.  The cost is
> > > negligible: one lock/unlock in an extremely rare path
> > > (hardware memory errors).
> > >
> > > Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> > > in this file operate on pages already removed from the buddy
> > > allocator or on non-buddy pages (DAX, hugetlb), so they do not
> > > need zone->lock protection.
> > >
> > > Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >
> > Can we have Fixes: and Cc: stable and also send this separately please?
> >
> > These patches seem like unrelated fixups that you've discovered along the way,
> > and don't belong as part of the already rather large series, unless I'm missing
> > something here.
> >
> > Thanks, Lorenzo
>
> I think you are mising that they are a dependency, not unrelated.

Then say so.

> For example, this issue gets worse with the patchset as there are more
> places that manipulate flags without atomics. No?

It's your job to make that case, not mine.

>
>
> You are welcome to send this to stable, but I think stable rules
> preclude theoretical bugfixes.

It's a dependency but also theoretical?

>
> As for Fixes: the issue has been there for decades. I wouldn't know
> what to attribute it for.

Again, your job.

>
>
> I guess I could send these separately, too, why not. Not sure
> what this accomplishes, but hey.  But is that an ack? You want
> this fix merged even before the feature?

I already made the case as to why, as have other maintainers.

If you need to review what an ack looks like please consult
https://docs.kernel.org/process/5.Posting.html

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Matthew Wilcox @ 2026-06-08 14:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <cover.1780906288.git.mst@redhat.com>

On Mon, Jun 08, 2026 at 04:33:46AM -0400, Michael S. Tsirkin wrote:
> Further, on architectures with aliasing caches, upstream with init_on_alloc

Further to what?  Did you leave out some paragraphs here?

As far as I can tell, this patch series decides to trust that the
hypervisor has zeroed pages that it allocates to the guest.  But
as far as I can tell, the trend is towards less trust in the hypervisor
from the guest, not more.

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: David Hildenbrand (Arm) @ 2026-06-08 14:26 UTC (permalink / raw)
  To: Lorenzo Stoakes, Matthew Wilcox
  Cc: Michael S. Tsirkin, linux-kernel, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aia-q026facbptoT@lucifer>

On 6/8/26 15:09, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
>>> But instead of overloading user_addr to indicate all kinds of things, instead
>>> make life easier by actually breaking things out.
>>>
>>> Like:
>>>
>>> enum alloc_context_type {
>>> 	KERNEL_ALLOCATION,
>>> 	USER_MAPPED_ALLOCATION,
>>> 	USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
>>> 	/* Perhaps some other states we want to encode? */
>>> };
>>>
>>> struct alloc_context {
>>> 	...
>>>
>>> 	enum alloc_context_type type;
>>> 	unsigned long user_addr; // Only set if type == USER_ALLOCATION
>>>
>>> 	// Maybe something suggesting context or whether we init before in some
>>> 	// cases?
>>> };
>>
>> Ugh, please, no.  As I suggested last time I commented on this
>> trainwreck of a series, lift the zeroing functionality from
>> alloc_frozen_pages() into its callers.
> 
> I've not looked at the callers closely enough to see the delta on that, but if
> it avoids this mess then also worth looking at yes...

If that means that we would handle __GFP_ZERO consistently in the callers of
alloc_frozen_pages(), that would also do I guess. We'd still have to pass the
user address down to some degree, through folio interfaces only at least.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v10 24/37] mm: add put_page_zeroed and folio_put_zeroed
From: David Hildenbrand (Arm) @ 2026-06-08 14:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <20260608100602-mutt-send-email-mst@kernel.org>

On 6/8/26 16:08, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 02:46:34PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/8/26 14:25, Lorenzo Stoakes wrote:
>>>
>>> Do not put comments about specific expected races like this in the commit
>>> message but not in the code. Subtleties need to be called out.
>>>
>>> The commit message also doesn't at all explain why PG_zeroed doesn't
>>> suffice here.
>>>
>>>
>>> I really don't understand why you have a 'zeroed' folio flag but need to
>>> also have new API calls to detect that?
>>>
>>> They're also HORRIBLY named. Zeroed as in what? Zero page? Huge zero page?
>>> Memory zeroed by kernel? Pages that userland happen to have zeroed? Or host
>>> VM zeroed?
>>>
>>> Each are cases we address individually and relate to folios.
>>>
>>> You absolutely fail to clarify _which one_ you mean, and provide absolutely
>>> no documentation and add an exported mm API with no description.
>>>
>>> This is just I think not something we want to add? Especially on something
>>> so fundamental?
>>
>> I raised previously that providing a folio helper is odd, and that I suggested
>> that we defer this change.
> 
> Sadly it's a dependency actually - without it memcg failures would cause
> repeated re-zeroing where previously it failed without zeroing.

Oh, you mean that we succeeded allocating (+zeroing) but failed to charge?

I don't immediately see that to be a real problem?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Matthew Wilcox @ 2026-06-08 14:31 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <d0ff6269-8b8a-4983-8569-d9a8495829f3@kernel.org>

On Mon, Jun 08, 2026 at 04:26:18PM +0200, David Hildenbrand (Arm) wrote:
> If that means that we would handle __GFP_ZERO consistently in the callers of
> alloc_frozen_pages(), that would also do I guess. We'd still have to pass the
> user address down to some degree, through folio interfaces only at least.

What I don't understand is how the kernel page allocator needs to know
the user address in order to effectively zero it, but the hypervisor is
able to zero the page without knowing the user address.  It feels like
somebody has x86-centric thinking where cache colouring doesn't matter.

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: David Hildenbrand (Arm) @ 2026-06-08 14:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aibSPsc33YQiYSMI@casper.infradead.org>

On 6/8/26 16:31, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 04:26:18PM +0200, David Hildenbrand (Arm) wrote:
>> If that means that we would handle __GFP_ZERO consistently in the callers of
>> alloc_frozen_pages(), that would also do I guess. We'd still have to pass the
>> user address down to some degree, through folio interfaces only at least.
> 
> What I don't understand is how the kernel page allocator needs to know
> the user address in order to effectively zero it, but the hypervisor is
> able to zero the page without knowing the user address.  It feels like
> somebody has x86-centric thinking where cache colouring doesn't matter.

(not commenting on the icache dache mess we have to drag along)

The thing is that with free-page-reporting the memory is already zeroed by the
hypervisor as part of discarding that memory previously (e.g., MADV_DONTNEED)
and allocating fresh pages on re-access.

So it's not a question of "why is the hypervisor zeroing less efficiently", as
zeroing is just a side-product of reclaiming that memory in the first place.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Matthew Wilcox @ 2026-06-08 14:44 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <ef14cfdf-5f1f-4e3b-846d-fe1462042883@kernel.org>

On Mon, Jun 08, 2026 at 04:37:03PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 16:31, Matthew Wilcox wrote:
> > On Mon, Jun 08, 2026 at 04:26:18PM +0200, David Hildenbrand (Arm) wrote:
> >> If that means that we would handle __GFP_ZERO consistently in the callers of
> >> alloc_frozen_pages(), that would also do I guess. We'd still have to pass the
> >> user address down to some degree, through folio interfaces only at least.
> > 
> > What I don't understand is how the kernel page allocator needs to know
> > the user address in order to effectively zero it, but the hypervisor is
> > able to zero the page without knowing the user address.  It feels like
> > somebody has x86-centric thinking where cache colouring doesn't matter.
> 
> (not commenting on the icache dache mess we have to drag along)

Well, that was kind of the point of this email ... I did ask the
question you're answering in a different email so let me respond
to that too.

> The thing is that with free-page-reporting the memory is already zeroed by the
> hypervisor as part of discarding that memory previously (e.g., MADV_DONTNEED)
> and allocating fresh pages on re-access.
> 
> So it's not a question of "why is the hypervisor zeroing less efficiently", as
> zeroing is just a side-product of reclaiming that memory in the first place.

We definitely have users who don't want the guest to trust the
hypervisor.  So how do they disable this optimisation?

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: David Hildenbrand (Arm) @ 2026-06-08 14:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aibVTe5HTk-fF_kj@casper.infradead.org>

On 6/8/26 16:44, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 04:37:03PM +0200, David Hildenbrand (Arm) wrote:
>> On 6/8/26 16:31, Matthew Wilcox wrote:
>>>
>>> What I don't understand is how the kernel page allocator needs to know
>>> the user address in order to effectively zero it, but the hypervisor is
>>> able to zero the page without knowing the user address.  It feels like
>>> somebody has x86-centric thinking where cache colouring doesn't matter.
>>
>> (not commenting on the icache dache mess we have to drag along)
> 
> Well, that was kind of the point of this email ... I did ask the
> question you're answering in a different email so let me respond
> to that too.

Now I'm confused :)

> 
>> The thing is that with free-page-reporting the memory is already zeroed by the
>> hypervisor as part of discarding that memory previously (e.g., MADV_DONTNEED)
>> and allocating fresh pages on re-access.
>>
>> So it's not a question of "why is the hypervisor zeroing less efficiently", as
>> zeroing is just a side-product of reclaiming that memory in the first place.
> 
> We definitely have users who don't want the guest to trust the
> hypervisor.  So how do they disable this optimisation?

Right, I don't think we currently have a toggle to disable free page reporting.
So IIUC, this optimization would similarly automatically get enabled if the
hypervisor advertises it.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Zi Yan @ 2026-06-08 15:27 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <64f2e580-aa82-4849-9236-b8ec4208ca24@kernel.org>

On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:

> On 6/8/26 12:23, Lorenzo Stoakes wrote:
>> I noticed this patch, again, sneaks in some additional code changes that
>> are not mentioned in the commit message and seem irrelevant to the patch.
>>
>> Not sure if the AI is doing this, but please don't do that.
>>
>> On Mon, Jun 08, 2026 at 04:35:17AM -0400, Michael S. Tsirkin wrote:
>>> Thread a user virtual address from vma_alloc_folio() down through
>>> the page allocator to post_alloc_hook(). This is plumbing
>>> preparation for a subsequent patch that will use user_addr to
>>> call folio_zero_user() for cache-friendly zeroing of user pages.
>>
>> This feels like very weak justification for code that massively changes mm
>> code and makes it all much worse.
>>
>>>
>>> The user_addr is stored in struct alloc_context and flows through:
>>>   vma_alloc_folio -> folio_alloc_mpol -> __alloc_pages_mpol ->
>>>   __alloc_frozen_pages -> get_page_from_freelist -> prep_new_page ->
>>>   post_alloc_hook
>>
>> Is this the only relevant code path? You're changing a TON of code here
>> that will have multiple different code paths?
>>
>>>
>>> USER_ADDR_NONE ((unsigned long)-1) is used for non-user
>>> allocations, since address 0 is a valid userspace mapping.
>>
>> Ugh god, so now we're passing a user address through allocation paths that
>> might not even be aware of this or be allocating memory at a point when the
>> mapping is known?
>
> The original ideas was to do this only with internal interfaces. I think I
> raised before to leave hugetlb alone for now.
>
> Fundamentally, user_alloc_needs_zeroing() is something we should get rid of, to
> just be able use __GFP_ZERO and do zeroing at exactly one place.

Just a reminder that user_alloc_needs_zeroing() not only checks init_on_alloc,
but also some arc clearing page requirements. To do zeroing in one place,
clear_page() used in post_alloc_hook() will need to learn how to handle
arch-specific zeroing from clear_user_page()/clear_user_highpage().

>
>
> The question is how to pass that information (user_addr) through internal APIs.

Or should we defer zeroing after a page is returned from allocator? So that
user_addr does not need to be passed through irrelevant allocation APIs.
Something like:

alloc_page_wrapper(gfp, order, user_addr)
{
	page = alloc_pages();
	if (gfp & __GFP_ZERO)
		clear_page(page);
}

>
> I previously said, that ideally we'd end up with a folio allocation interface in
> mm/page_alloc.c, from where we could get this done more cleanly internally.
>
> But I don't want what the previous proposals did: use GFP flags+leak state or
> return magic "zeroed" flags.


Best Regards,
Yan, Zi

^ permalink raw reply

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Gregory Price @ 2026-06-08 15:45 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Lorenzo Stoakes, Liam R. Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <e54f9b79-b6b4-4911-86ac-48b61889765a@kernel.org>

On Mon, Jun 08, 2026 at 01:13:42PM +0200, Vlastimil Babka (SUSE) wrote:
> On 6/8/26 13:02, Vlastimil Babka (SUSE) wrote:
> > On 6/8/26 10:33, Michael S. Tsirkin wrote:
> >> Further, on architectures with aliasing caches, upstream with init_on_alloc
> > 
> > It seems those are niche architectures so we can ignore that part for perf
> > purposes; the other reason why user_alloc_needs_zeroing() would be true is
> > booting with init_on_alloc.
> 
> OK I misread how user_alloc_needs_zeroing() works wrt init_on_alloc, as it's
> negated. But you're changing that anyway to skip that user zeroing, right?
> 
> "
> This series eliminates that double-zeroing by moving the zeroing
> into the post_alloc_hook + propagating the "host
> already zeroed this page" information through the buddy allocator.
> "
> 
> So relying on "everything in buddy is zeroed" would still work I'd think.
> 

This regresses for anything that previously didn't zero on free or
alloc, which is most kernel allocations.

I think the scope of this set has increased too much based on early
feedback to fix the userland-initiated allocations piece along with the
balloon/reporting/double-zero piece.  That's making all of this
difficult to continue following.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Gregory Price @ 2026-06-08 15:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiacZ6_7SG3nvVjM@lucifer>

On Mon, Jun 08, 2026 at 12:23:07PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:36:38AM -0400, Michael S. Tsirkin wrote:
> 
> > +			 * user_addr != USER_ADDR_NONE implies sleepable
> > +			 * context (user page fault).
> 
> Can you safely assume that? Also inferring which context we are in from this
> parameter seems risky.
> 
> It seems to me that you're now making it such that kernel developers:
> 
> - Have to know when and when not to specify a user address, and under what
>   circumstances we might consider that to be mapped.
> 
> - Need to know to do this correctly for aliasing architectures or have silent
>   correctness issues.
> 
> - Need to take context into account when specifying this.
> 
> We definitely need to find a simpler way to do this!
>

This feedback was poked at in earlier versions.  There's a tension
between keeping the old interface as-is, having explicit interfaces
for something like this, and the state of a page inside the
allocator vs outside.

Double-plus complicated by the fact that we're trying to reason about
two allocators at once:  host and guest.

It seems it has gotten a bit more complicated since then (I missed this
"sleepable context" bit, not sure if it was there on prior versions).

If `user_addr` is now implying anything other than exactly: "This needs
to be zeroed / caches flushed", then this is bad.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 16/37] mm: alloc_swap_folio: pass raw fault address to vma_alloc_folio
From: Gregory Price @ 2026-06-08 15:59 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiapFvxPXuPd4-OP@lucifer>

On Mon, Jun 08, 2026 at 12:37:20PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:37:41AM -0400, Michael S. Tsirkin wrote:
> > Same change as the previous patch but for alloc_swap_folio:
> 
> Please don't say 'same change as the previous patch' :) explain what you're
> doing here. It's a pain to have to go check otherwise.
> 

MST you need to slow down a bit.

I gave you this same feedback 3 versions ago:

https://lore.kernel.org/linux-mm/agXUHItfxSwtriRF@gourry-fedora-PF4VCD3F/

~Gregory

^ permalink raw reply

* Re: [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
From: Gregory Price @ 2026-06-08 16:09 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Michael S. Tsirkin, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aias_qh3wR-Xkf6T@lucifer>

On Mon, Jun 08, 2026 at 01:00:17PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 04:37:48AM -0400, Michael S. Tsirkin wrote:
> >
> >  void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> > -		     unsigned long user_addr);
> > +		     bool zeroed, unsigned long user_addr);
> 
> host_zeroed or something would be more appropriate no?
> 
> But in general do we need to propagate this around, can't we derive it from
> the page zeroed flag?
> 
> It's really confusing as to _which_ zeroing this refers to, it seems the
> only one relevant here is the VM host zeroing but that's completely
> non-obvious and now everybody using these functions with the extra param
> will simply have to happen to know this.
> 
> If we could find a way to avoid this propagation that'd be ideal.
> 
> Failing that, making it clear this is _only_ for vm host zeroing would be
> better, but then maybe we need to think about how we could encode this in
> some other way, e.g. passing alloc_context perhaps?
> 

This is unaddressed feedback from 3 version ago:

https://lore.kernel.org/linux-mm/agXYbcuQYooG74pb@gourry-fedora-PF4VCD3F/

We can infer all of this from snapshotted page flags and propogate those
around.  This is infinitely more useful than just a single flag being
pulled out into a boolean, and more extensible.

void
post_alloc_hook(struct page *page, usigned int order, gfp_t gfp_flags,
                uint64_t pg_alloc_flags, unsigned long user_addr);

		^^^ page_alloc.c internal falgs only

Once the allocator gets a page it wants to return, it can take a snapshot
of the flags at that point, and then doodle all over the flags as it
goes through the page setup prior to return (include the post hook).

Haven't seen a reason why this shouldn't be done this way.

~Gregory

^ permalink raw reply

* Re: [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Andrew Morton @ 2026-06-08 16:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli, Miaohe Lin
In-Reply-To: <20260608094153-mutt-send-email-mst@kernel.org>

On Mon, 8 Jun 2026 09:48:34 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote:

> > These patches seem like unrelated fixups that you've discovered along the way,
> > and don't belong as part of the already rather large series, unless I'm missing
> > something here.
> > 
> > Thanks, Lorenzo
> 
> I think you are mising that they are a dependency, not unrelated.
> For example, this issue gets worse with the patchset as there are more
> places that manipulate flags without atomics. No?
> 
> 
> You are welcome to send this to stable, but I think stable rules
> preclude theoretical bugfixes.

I agree with Lorenzo, please - let's have fixes separated out.  After
all, the rest of the series might never be merged (sorry, it happens).

Whether each fix gets a cc:stable is TBD, case-by-case - it depends on
the expected userspace impact.


^ permalink raw reply

* [PATCH v2] i2c: virtio: retain xfer with kref to fix UAF on interrupted wait
From: Gavin Li @ 2026-06-08 17:44 UTC (permalink / raw)
  To: linux-i2c, viresh.kumar
  Cc: Chen, Jian Jun, andi.shyti, virtualization, Gavin Li
In-Reply-To: <20260607143608.76122-1-gavin.li@samsara.com>

commit a663b3c47ab1 ("i2c: virtio: Avoid hang by using interruptible
completion wait") switched virtio_i2c_complete_reqs() to
wait_for_completion_interruptible() so a stuck device cannot hang a
task forever. That left a use-after-free: if the wait returns early on
a signal, virtio_i2c_xfer() frees reqs and DMA bounce buffers while the
device may still hold virtqueue tokens pointing at &reqs[i] and DMA
into read buffers. When those requests complete later,
virtio_i2c_msg_done() calls complete() on freed memory.

Waiting uninterruptibly for every completion before freeing avoids the
UAF but can hang the caller indefinitely if the virtio side never
completes the request. The virtio spec provides no way to cancel an
in-flight transfer, so that is not an acceptable tradeoff.

This commit makes two changes:

- Manage the freeing of the xfer allocations via kref, and ensure that
  each in-flight request holds a reference. This fixes the
  use-after-free by ensuring that the virtio device has a valid location
  to write to until the request completes. This will cause a memory
  leak in cases where the device hangs, but that is much preferable to
  memory corruption.

- Use wait_for_completion_killable() instead of _interruptible(). Even
  partial I2C transactions can have side effects, so the only time it
  makes sense to interrupt a transaction is when a process needs to be
  killed. Most existing I2C drivers don't support interruption at all,
  so this should not break userspace applications. This also addresses
  issues with Go programs accessing devices via the I2C userspace API,
  since the Go runtime stochastically signals SIGURG to running threads;
  leaving this as _interruptible() may cause partial side effects from
  which it is impossible to cleanly restart.

Signed-off-by: Gavin Li <gavin.li@samsara.com>
---
 drivers/i2c/busses/i2c-virtio.c | 89 ++++++++++++++++++++++++---------
 1 file changed, 64 insertions(+), 25 deletions(-)

diff --git a/drivers/i2c/busses/i2c-virtio.c b/drivers/i2c/busses/i2c-virtio.c
index 726c162cabd86..f7320a67a3409 100644
--- a/drivers/i2c/busses/i2c-virtio.c
+++ b/drivers/i2c/busses/i2c-virtio.c
@@ -13,6 +13,7 @@
 #include <linux/err.h>
 #include <linux/i2c.h>
 #include <linux/kernel.h>
+#include <linux/kref.h>
 #include <linux/module.h>
 #include <linux/virtio.h>
 #include <linux/virtio_ids.h>
@@ -31,39 +32,77 @@ struct virtio_i2c {
 	struct virtqueue *vq;
 };
 
+struct virtio_i2c_xfer;
+
 /**
  * struct virtio_i2c_req - the virtio I2C request structure
+ * @xfer: owning transfer
+ * @msg: copy of the I2C message for virtio_i2c_xfer_release
  * @completion: completion of virtio I2C message
  * @out_hdr: the OUT header of the virtio I2C message
  * @buf: the buffer into which data is read, or from which it's written
  * @in_hdr: the IN header of the virtio I2C message
  */
 struct virtio_i2c_req {
+	struct virtio_i2c_xfer *xfer;
+	struct i2c_msg msg;
 	struct completion completion;
 	struct virtio_i2c_out_hdr out_hdr	____cacheline_aligned;
 	uint8_t *buf				____cacheline_aligned;
 	struct virtio_i2c_in_hdr in_hdr		____cacheline_aligned;
 };
 
+/**
+ * struct virtio_i2c_xfer - a queued I2C transfer
+ * @ref: one ref for the caller, plus one per in-flight virtqueue request
+ * @num: number of messages
+ * @reqs: the virtio I2C requests
+ */
+struct virtio_i2c_xfer {
+	struct kref ref;
+	int num;
+	struct virtio_i2c_req reqs[];
+};
+
+static void virtio_i2c_xfer_release(struct kref *ref)
+{
+	struct virtio_i2c_xfer *xfer = container_of(ref, struct virtio_i2c_xfer, ref);
+	int i;
+
+	for (i = 0; i < xfer->num; i++) {
+		struct virtio_i2c_req *req = &xfer->reqs[i];
+		i2c_put_dma_safe_msg_buf(req->buf, &req->msg, false);
+	}
+
+	kfree(xfer);
+}
+
 static void virtio_i2c_msg_done(struct virtqueue *vq)
 {
 	struct virtio_i2c_req *req;
 	unsigned int len;
 
-	while ((req = virtqueue_get_buf(vq, &len)))
+	while ((req = virtqueue_get_buf(vq, &len))) {
 		complete(&req->completion);
+		kref_put(&req->xfer->ref, virtio_i2c_xfer_release);
+	}
 }
 
 static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
-				   struct virtio_i2c_req *reqs,
+				   struct virtio_i2c_xfer *xfer,
 				   struct i2c_msg *msgs, int num)
 {
 	struct scatterlist *sgs[3], out_hdr, msg_buf, in_hdr;
+	struct virtio_i2c_req *reqs = xfer->reqs;
 	int i;
 
+	kref_init(&xfer->ref);
+
 	for (i = 0; i < num; i++) {
 		int outcnt = 0, incnt = 0;
 
+		reqs[i].xfer = xfer;
+		reqs[i].msg = msgs[i];
 		init_completion(&reqs[i].completion);
 
 		/*
@@ -99,36 +138,36 @@ static int virtio_i2c_prepare_reqs(struct virtqueue *vq,
 
 		if (virtqueue_add_sgs(vq, sgs, outcnt, incnt, &reqs[i], GFP_KERNEL)) {
 			i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], false);
+			reqs[i].buf = NULL; /* prevent free by virtio_i2c_xfer_release */
 			break;
 		}
+
+		kref_get(&xfer->ref); /* released in virtio_i2c_msg_done() */
 	}
 
+	xfer->num = i;
 	return i;
 }
 
-static int virtio_i2c_complete_reqs(struct virtqueue *vq,
-				    struct virtio_i2c_req *reqs,
-				    struct i2c_msg *msgs, int num)
+static int virtio_i2c_complete_reqs(struct virtio_i2c_xfer *xfer)
 {
-	bool failed = false;
-	int i, j = 0;
+	struct virtio_i2c_req *reqs = xfer->reqs;
+	int i, fail_index = -1;
 
-	for (i = 0; i < num; i++) {
+	for (i = 0; i < xfer->num; i++) {
 		struct virtio_i2c_req *req = &reqs[i];
-
-		if (!failed) {
-			if (wait_for_completion_interruptible(&req->completion))
-				failed = true;
-			else if (req->in_hdr.status != VIRTIO_I2C_MSG_OK)
-				failed = true;
-			else
-				j++;
+		if (wait_for_completion_killable(&req->completion)) {
+			return -EINTR;
+		} else if (req->in_hdr.status != VIRTIO_I2C_MSG_OK) {
+			/* Don't break yet. Try to wait until all requests complete. */
+			if (fail_index < 0)
+				fail_index = i;
 		}
-
-		i2c_put_dma_safe_msg_buf(reqs[i].buf, &msgs[i], !failed);
+		i2c_put_dma_safe_msg_buf(req->buf, &req->msg, fail_index < 0);
+		req->buf = NULL; /* prevent free by virtio_i2c_xfer_release */
 	}
 
-	return j;
+	return fail_index >= 0 ? fail_index : xfer->num; /* number of successful transactions */
 }
 
 static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
@@ -136,14 +175,14 @@ static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
 {
 	struct virtio_i2c *vi = i2c_get_adapdata(adap);
 	struct virtqueue *vq = vi->vq;
-	struct virtio_i2c_req *reqs;
+	struct virtio_i2c_xfer *xfer;
 	int count;
 
-	reqs = kcalloc(num, sizeof(*reqs), GFP_KERNEL);
-	if (!reqs)
+	xfer = kzalloc(struct_size(xfer, reqs, num), GFP_KERNEL);
+	if (!xfer)
 		return -ENOMEM;
 
-	count = virtio_i2c_prepare_reqs(vq, reqs, msgs, num);
+	count = virtio_i2c_prepare_reqs(vq, xfer, msgs, num);
 	if (!count)
 		goto err_free;
 
@@ -157,10 +196,10 @@ static int virtio_i2c_xfer(struct i2c_adapter *adap, struct i2c_msg *msgs,
 	 */
 	virtqueue_kick(vq);
 
-	count = virtio_i2c_complete_reqs(vq, reqs, msgs, count);
+	count = virtio_i2c_complete_reqs(xfer);
 
 err_free:
-	kfree(reqs);
+	kref_put(&xfer->ref, virtio_i2c_xfer_release);
 	return count;
 }
 
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH v1] i2c: virtio: wait uninterruptibly for completions to avoid UAF
From: Gavin Li @ 2026-06-08 17:46 UTC (permalink / raw)
  To: Viresh Kumar; +Cc: linux-i2c, Chen, Jian Jun, andi.shyti, virtualization
In-Reply-To: <ars6vptfsifv3zvdbx2p7ihcw4rbwbgkvft3xpqyjbvkdckfzy@zxx2exbtuib4>

Noted. I have sent a v2 patch that allows a hung process to be killed.

On Sun, Jun 7, 2026 at 11:44 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 07-06-26, 10:36, Gavin Li wrote:
> > virtio_i2c_complete_reqs() uses wait_for_completion_interruptible() and stops
> > waiting when a signal arrives. virtio_i2c_xfer() then frees reqs and the
> > per-request DMA bounce buffers while the device may still hold virtqueue tokens
> > pointing at &reqs[i] and DMA into read bounce buffers. Additionally, when the
> > device later completes those requests, virtio_i2c_msg_done() calls complete()
> > on freed memory and can corrupt the slab freelist.
> >
> > Wait uninterruptibly for every completion before freeing reqs. This
> > matches how other virtio drivers retain request storage until the device
> > completes it. The virtio spec unfortunately does not provide a way to
> > cancel an in-flight request, so waiting uninterruptibly is required.
> >
> > Signed-off-by: Gavin Li <gavin.li@samsara.com>
> > ---
> >  drivers/i2c/busses/i2c-virtio.c | 15 +++++++--------
> >  1 file changed, 7 insertions(+), 8 deletions(-)
>
> This is a revert of (and maybe better if that is mentioned in the logs):
>
> commit a663b3c47ab1 ("i2c: virtio: Avoid hang by using interruptible completion wait")
>
> I don't think this is the right approach here. We shouldn't hang the kernel
> indefinitely if the other side is dead.
>
> --
> viresh

^ permalink raw reply

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Lorenzo Stoakes @ 2026-06-08 17:50 UTC (permalink / raw)
  To: Gregory Price
  Cc: Vlastimil Babka (SUSE), Michael S. Tsirkin, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aibjjsoeEcvXobCw@gourry-fedora-PF4VCD3F>

On Mon, Jun 08, 2026 at 11:45:18AM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 01:13:42PM +0200, Vlastimil Babka (SUSE) wrote:
> > On 6/8/26 13:02, Vlastimil Babka (SUSE) wrote:
> > > On 6/8/26 10:33, Michael S. Tsirkin wrote:
> > >> Further, on architectures with aliasing caches, upstream with init_on_alloc
> > >
> > > It seems those are niche architectures so we can ignore that part for perf
> > > purposes; the other reason why user_alloc_needs_zeroing() would be true is
> > > booting with init_on_alloc.
> >
> > OK I misread how user_alloc_needs_zeroing() works wrt init_on_alloc, as it's
> > negated. But you're changing that anyway to skip that user zeroing, right?
> >
> > "
> > This series eliminates that double-zeroing by moving the zeroing
> > into the post_alloc_hook + propagating the "host
> > already zeroed this page" information through the buddy allocator.
> > "
> >
> > So relying on "everything in buddy is zeroed" would still work I'd think.
> >
>
> This regresses for anything that previously didn't zero on free or
> alloc, which is most kernel allocations.
>
> I think the scope of this set has increased too much based on early
> feedback to fix the userland-initiated allocations piece along with the
> balloon/reporting/double-zero piece.  That's making all of this
> difficult to continue following.

Yeah I feel this is 3, 4 or 5 series put together, and there's a lot to
discuss in each :) so it's pretty difficult to work with them all put
together.

These need to be deferred/separated.

>
> ~Gregory

Thanks, Lorenzo

^ permalink raw reply

* Re: [PATCH v1] vsock/virtio: rework MSG_ZEROCOPY flag handling
From: Arseniy Krasnov @ 2026-06-08 18:20 UTC (permalink / raw)
  To: David Laight
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Jason Wang, Bobby Eshleman, Xuan Zhuo, Eugenio Pérez,
	Simon Horman, kvm, virtualization, netdev, linux-kernel, oxffffaa,
	rulkc
In-Reply-To: <20260608103701.7839ff1b@pumpkin>


On 08/06/2026 12:37, David Laight wrote:
> On Mon, 8 Jun 2026 11:10:24 +0300
> Arseniy Krasnov <avkrasnov@rulkc.org> wrote:
>
>> On 05/06/2026 18:08, David Laight wrote:
>>> On Fri,  5 Jun 2026 14:53:14 +0300
>>> Arseniy Krasnov <avkrasnov@rulkc.org> wrote:
>>>  
>>>> Logically it was based on TCP implementation, so to make further
>>>> support easier, rewrite it in the TCP way.
>>>>
>>>> Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>
>>>> ---
>>>>  net/vmw_vsock/virtio_transport_common.c | 64 ++++++++++++-------------
>>>>  1 file changed, 32 insertions(+), 32 deletions(-)
>>>>
>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>>> index 2fd9eaaf5ca6..00caeeaa5590 100644
>>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>>> @@ -73,10 +73,13 @@ static bool virtio_transport_can_zcopy(const struct virtio_transport *t_ops,
>>>>  static int virtio_transport_fill_skb(struct sk_buff *skb,
>>>>  				     struct virtio_vsock_pkt_info *info,
>>>>  				     size_t len,
>>>> -				     bool zcopy)
>>>> +				     bool zcopy, struct ubuf_info *uarg)
>>>>  {
>>>>  	struct msghdr *msg = info->msg;
>>>>  
>>>> +	/* We have completion - attach it to 'skb'. */
>>>> +	skb_zcopy_set(skb, uarg, NULL);
>>>> +
>>>>  	if (zcopy)
>>>>  		return __zerocopy_sg_from_iter(msg, NULL, skb,
>>>>  					       &msg->msg_iter, len, NULL);
>>>> @@ -208,7 +211,8 @@ static struct sk_buff *virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *
>>>>  						  u32 src_cid,
>>>>  						  u32 src_port,
>>>>  						  u32 dst_cid,
>>>> -						  u32 dst_port)
>>>> +						  u32 dst_port,
>>>> +						  struct ubuf_info *uarg)
>>>>  {
>>>>  	struct vsock_sock *vsk;
>>>>  	struct sk_buff *skb;
>>>> @@ -245,7 +249,7 @@ static struct sk_buff *virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *
>>>>  	if (info->msg && payload_len > 0) {
>>>>  		int err;
>>>>  
>>>> -		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>>>> +		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy, uarg);
>>>>  		if (err)
>>>>  			goto out;
>>>>  
>>>> @@ -321,38 +325,36 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>>>  		return pkt_len;
>>>>  
>>>> -	if (info->msg) {
>>>> -		/* If zerocopy is not enabled by 'setsockopt()', we behave as
>>>> -		 * there is no MSG_ZEROCOPY flag set.
>>>> +	if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
>>>> +		/* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
>>>> +		 * 'MSG_ZEROCOPY' flag handling here is based on the same flag
>>>> +		 * handling from 'tcp_sendmsg_locked()'.
>>>>  		 */
>>>> -		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>>>> -			info->msg->msg_flags &= ~MSG_ZEROCOPY;
>>>> +		if (info->msg->msg_ubuf) {
>>>> +			uarg = info->msg->msg_ubuf;
>>>> +			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>>>> +		} else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
>>>> +			uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
>>>> +						    NULL, false);
>>>> +			if (!uarg) {
>>>> +				virtio_transport_put_credit(vvs, pkt_len);
>>>> +				return -ENOMEM;
>>>> +			}
>>>>  
>>>> -		if (info->msg->msg_flags & MSG_ZEROCOPY)
>>>>  			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
>>>>  
>>>> +			if (!can_zcopy)
>>>> +				uarg_to_msgzc(uarg)->zerocopy = 0;
>>>> +
>>>> +			have_uref = true;
>>>> +		}
>>>> +
>>>> +		/* 'can_zcopy' means that this transmission will be
>>>> +		 * in zerocopy way (e.g. using 'frags' array).
>>>> +		 */  
>>> I've not looked at the tcp code, but the above doesn't look right.
>>> I don't see why msg->msg_ubuf might be non-NULL without SOCK_ZEROCOPY set.
>>> That would give the outer code a callback when the last skb is freed but
>>> still copy the data.  
>> Hi, 
>>
>> I guess case when 'msg->msg_ubuf' is non-NULL is special case today for io_uring MSG_ZEROCOPY implementation.
>> It was added here https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb315a7d1396b1139fc7daea55f2d3191e8e7092
>> As I see implementation of its tests in tools/testing/selftests/net/io_uring_zerocopy_tx.c , it doesn't require setting SOCK_ZEROCOPY option for
>> socket, so for virtio vsock case I just copied same logic to maintain compatibility, because there is MSG_ZEROCOPY io_uring test for virtio/vsock.
> That code path probably sets info->msg->msg_flags & MSG_ZEROCOPY so wants 'zerocopy'.
> But wants it's own callback function called rather than one that msg_zerocopy_realloc()
> adds.
>
> But there is no reason why a caller might not just want a notification that
> all the skb associated with the sendmsg have been freed without requesting zerocopy.
> Maybe no one does it today, but it is trivial to support.

Caller for this path is io_uring today. It uses own callback 'io_tx_ubuf_complete()' which implements notification using io_uring
achitecture instead of classic MSG_ERRQUEUE socket read way.

>
>>
>>> I also don't see the point of calling msg_zerocopy_realloc() to get a
>>> callback when the last skb is freed and then setting
>>> 	uarg_to_msgzc(uarg)->zerocopy = 0;
>>> so that the callback doesn't actually do anything.
>>> It isn't as though you 'find out' later on that you can't actually do
>>> zerocopy.  
>>
>> Sorry, what do You mean "last skb" ? In this code we first allocate uarg (allocate, because third arg is always NULL). Then in
>> loop we allocate sk_buffs, fill it with data and send. I mean first/last skb will be freed after uarg is already allocated and we
>> don't touch it. I think i didn't understand Your question here.
> The 'uarg' is referenced by all of the skb that contain data for the sendmsg().
> So when the last one of them is freed the callback function is called.
> The purpose of that callback is to 'undo' the zerocopy (page pinning etc).
> But when you set uarg_to_msgzc(uarg)->zerocopy = 0 the callback does nothing.
> So there is no point setting up the callback at all.

Pages are unpinned in sk_buff freeing logic: 'skb_release_data()' -> '__skb_frag_unref()'.  'zerocopy' flag of uarg shows 
caller that real zerocopy was used of not - it is checked in '__msg_zerocopy_callback()' and 'SO_EE_CODE_ZEROCOPY_COPIED' is set in
the 'ee_code' field of struct 'sock_extended_err' which is read by user. I mean idea of MSG_ZEROCOPY API is that if kernel doesn't
return error from 'sendmsg()' call with MSG_ZEROCOPY flag passed, it will send notification about data tx complete anyway - it doesn't
matter that real zercopy was done or not.

Thanks


>
> -- David
>
>>
>>>  
>>>>  		if (can_zcopy)
>>>>  			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>>>  					    (MAX_SKB_FRAGS * PAGE_SIZE));
>>>> -
>>>> -		if (info->msg->msg_flags & MSG_ZEROCOPY &&
>>>> -		    info->op == VIRTIO_VSOCK_OP_RW) {
>>>> -			uarg = info->msg->msg_ubuf;
>>>> -
>>>> -			if (!uarg) {
>>>> -				uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>>> -							    pkt_len, NULL, false);
>>>> -				if (!uarg) {
>>>> -					virtio_transport_put_credit(vvs, pkt_len);
>>>> -					return -ENOMEM;
>>>> -				}
>>>> -
>>>> -				if (!can_zcopy)
>>>> -					uarg_to_msgzc(uarg)->zerocopy = 0;
>>>> -
>>>> -				have_uref = true;
>>>> -			}
>>>> -		}
>>>>  	}
>>>>  
>>>>  	rest_len = pkt_len;
>>>> @@ -365,14 +367,12 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>  
>>>>  		skb = virtio_transport_alloc_skb(info, skb_len, can_zcopy,
>>>>  						 src_cid, src_port,
>>>> -						 dst_cid, dst_port);
>>>> +						 dst_cid, dst_port, uarg);
>>>>  		if (!skb) {
>>>>  			ret = -ENOMEM;
>>>>  			break;
>>>>  		}
>>>>  
>>>> -		skb_zcopy_set(skb, uarg, NULL);  
>>> Aren't you passing uarg through two function calls instead of doing it here.
>>> Doesn't even make it clearer what is going on.  
>>
>> Agree, to simplify patch, uarg could be set earlier (without passing it to functions) I guess.
>>
>> Thanks
>>
>>
>>> -- David
>>>  
>>>> -
>>>>  		virtio_transport_inc_tx_pkt(vvs, skb);
>>>>  
>>>>  		ret = t_ops->send_pkt(skb, info->net);
>>>> @@ -1178,7 +1178,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>>>  					   le64_to_cpu(hdr->dst_cid),
>>>>  					   le32_to_cpu(hdr->dst_port),
>>>>  					   le64_to_cpu(hdr->src_cid),
>>>> -					   le32_to_cpu(hdr->src_port));
>>>> +					   le32_to_cpu(hdr->src_port), NULL);
>>>>  	if (!reply)
>>>>  		return -ENOMEM;
>>>>    

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: David Hildenbrand (Arm) @ 2026-06-08 18:39 UTC (permalink / raw)
  To: Zi Yan
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Muchun Song, Oscar Salvador,
	Andrew Morton, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
	Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <AEB19464-20A9-4C5B-97B8-681CCDD81031@nvidia.com>

On 6/8/26 17:27, Zi Yan wrote:
> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
> 
>> On 6/8/26 12:23, Lorenzo Stoakes wrote:
>>> I noticed this patch, again, sneaks in some additional code changes that
>>> are not mentioned in the commit message and seem irrelevant to the patch.
>>>
>>> Not sure if the AI is doing this, but please don't do that.
>>>
>>>
>>> This feels like very weak justification for code that massively changes mm
>>> code and makes it all much worse.
>>>
>>>
>>> Is this the only relevant code path? You're changing a TON of code here
>>> that will have multiple different code paths?
>>>
>>>
>>> Ugh god, so now we're passing a user address through allocation paths that
>>> might not even be aware of this or be allocating memory at a point when the
>>> mapping is known?
>>
>> The original ideas was to do this only with internal interfaces. I think I
>> raised before to leave hugetlb alone for now.
>>
>> Fundamentally, user_alloc_needs_zeroing() is something we should get rid of, to
>> just be able use __GFP_ZERO and do zeroing at exactly one place.
> 
> Just a reminder that user_alloc_needs_zeroing() not only checks init_on_alloc,
> but also some arc clearing page requirements. To do zeroing in one place,
> clear_page() used in post_alloc_hook() will need to learn how to handle
> arch-specific zeroing from clear_user_page()/clear_user_highpage().

Right.

> 
>>
>>
>> The question is how to pass that information (user_addr) through internal APIs.
> 
> Or should we defer zeroing after a page is returned from allocator? So that
> user_addr does not need to be passed through irrelevant allocation APIs.
> Something like:
> 
> alloc_page_wrapper(gfp, order, user_addr)
> {
> 	page = alloc_pages();
> 	if (gfp & __GFP_ZERO)
> 		clear_page(page);
> }
> 

Not really sure what's best here. I think we'd want to limit the lifting to some
internal API, so it cannot easily be messed up by random kernel code calling
into the wrong API and not getting pages cleared.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v1] drm/virtio: Fix driver removal with disabled KMS
From: Dmitry Osipenko @ 2026-06-08 18:40 UTC (permalink / raw)
  To: Ryosuke Yasuoka, David Airlie, Gerd Hoffmann, Gurchetan Singh,
	Chia-I Wu
  Cc: dri-devel, virtualization, linux-kernel
In-Reply-To: <18b6b23ae1412a71.6d9db1e7acbb9813.37ad12b07518c586@ryasuoka-thinkpadx1carbongen9.tokyo.csb>

Hi,

On 6/7/26 07:31, Ryosuke Yasuoka wrote:
> Hi Dmitry
> 
> On 04/06/2026 15:27, Dmitry Osipenko wrote:
>> DRM atomic and modesetting aren't initialized if virtio-gpu driver built
>> with disabled KMS, leading to access of uninitialized data on driver
>> removal/unbinding and crashing kernel. Fix it by skipping shutting down
>> atomic core with unavailable KMS.
>>
>> Fixes: 72122c69d717 ("drm/virtio: Add option to disable KMS support")
>> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
>> ---
>>  drivers/gpu/drm/virtio/virtgpu_drv.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.c b/drivers/gpu/drm/virtio/virtgpu_drv.c
>> index f0fb784c0f6f..2aaa7cb08085 100644
>> --- a/drivers/gpu/drm/virtio/virtgpu_drv.c
>> +++ b/drivers/gpu/drm/virtio/virtgpu_drv.c
>> @@ -138,7 +138,10 @@ static void virtio_gpu_remove(struct virtio_device *vdev)
>>  
>>  	virtio_gpu_release_vqs(dev);
>>  	drm_dev_unplug(dev);
>> -	drm_atomic_helper_shutdown(dev);
>> +
>> +	if (drm_core_check_feature(dev, DRIVER_ATOMIC))
>> +		drm_atomic_helper_shutdown(dev);
>> +
>>  	virtio_gpu_deinit(dev);
>>  	drm_dev_put(dev);
>>  }
> 
> The patch looks good to me at a glance. I haven't done a full, deep code
> review yet, but I've tested it on my lab and everything works as
> expected.
> 
> Tested-by: Ryosuke Yasuoka <ryasuoka@redhat.com>

Thanks a lot for the testing. The review from you will be very welcomed
too.

-- 
Best regards,
Dmitry

^ permalink raw reply

* Re: [PATCH v10 17/37] mm: page_reporting: skip redundant zeroing of host-zeroed reported pages
From: Matthew Wilcox @ 2026-06-08 18:40 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aibpL3REDdQ8Vhys@gourry-fedora-PF4VCD3F>

On Mon, Jun 08, 2026 at 12:09:19PM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 01:00:17PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:37:48AM -0400, Michael S. Tsirkin wrote:
> > >
> > >  void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags,
> > > -		     unsigned long user_addr);
> > > +		     bool zeroed, unsigned long user_addr);
> > 
> > host_zeroed or something would be more appropriate no?
> > 
> > But in general do we need to propagate this around, can't we derive it from
> > the page zeroed flag?
> > 
> > It's really confusing as to _which_ zeroing this refers to, it seems the
> > only one relevant here is the VM host zeroing but that's completely
> > non-obvious and now everybody using these functions with the extra param
> > will simply have to happen to know this.
> > 
> > If we could find a way to avoid this propagation that'd be ideal.
> > 
> > Failing that, making it clear this is _only_ for vm host zeroing would be
> > better, but then maybe we need to think about how we could encode this in
> > some other way, e.g. passing alloc_context perhaps?
> > 
> 
> This is unaddressed feedback from 3 version ago:
> 
> https://lore.kernel.org/linux-mm/agXYbcuQYooG74pb@gourry-fedora-PF4VCD3F/
> 
> We can infer all of this from snapshotted page flags and propogate those
> around.  This is infinitely more useful than just a single flag being
> pulled out into a boolean, and more extensible.
> 
> void
> post_alloc_hook(struct page *page, usigned int order, gfp_t gfp_flags,
>                 uint64_t pg_alloc_flags, unsigned long user_addr);
> 
> 		^^^ page_alloc.c internal falgs only
> 
> Once the allocator gets a page it wants to return, it can take a snapshot
> of the flags at that point, and then doodle all over the flags as it
> goes through the page setup prior to return (include the post hook).
> 
> Haven't seen a reason why this shouldn't be done this way.

I'd tuned out this awful series since it'd become apparent that my
feedback wasn't going to be taken seriously.  Good to know I shouldn't
take it personally because he does it to you too.

I think it's apparent that Michael has no understanding of the MM.
So we should start again with the architecture.  Let's look at the
problem that he's trying to solve:

 - The hypervisor has zeroed the memory that it passes to the MM
 - But the MM then zeroes it again before passing it to userspace.
 - We want to avoid this

Let's make sure that's the actual problem before going any further.
Because I do have a design that will satisfy that without doing this
insane level of invasive change, but if that's not the problem, there's
no point wasting my time writing it up.

^ permalink raw reply

* Re: [PATCH] drm/virtio: fix dma_fence refcount leak on error in virtio_gpu_dma_fence_wait()
From: Dmitry Osipenko @ 2026-06-08 18:43 UTC (permalink / raw)
  To: Wentao Liang, airlied, kraxel, maarten.lankhorst, mripard,
	tzimmermann, simona
  Cc: gurchetansingh, olvaffe, dri-devel, virtualization, linux-kernel,
	stable
In-Reply-To: <20260607090303.92423-1-vulab@iscas.ac.cn>

On 6/7/26 12:03, Wentao Liang wrote:
> dma_fence_unwrap_for_each() internally calls dma_fence_unwrap_first()
> which does cursor->chain = dma_fence_get(head), taking an extra
> reference. On normal loop completion, dma_fence_unwrap_next()
> releases this via dma_fence_chain_walk() -> dma_fence_put().
> 
> When virtio_gpu_do_fence_wait() fails and the function returns early
> from inside the loop, the cursor->chain reference is never released.
> This is the only caller in the entire kernel that does an early return
> inside dma_fence_unwrap_for_each.
> 
> Add dma_fence_put(itr.chain) before the early return.
> 
> Cc: stable@vger.kernel.org
> Fixes: eba57fb5498f ("drm/virtio: Wait for each dma-fence of in-fence array individually")
> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
> ---
>  drivers/gpu/drm/virtio/virtgpu_submit.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/virtio/virtgpu_submit.c b/drivers/gpu/drm/virtio/virtgpu_submit.c
> index dae761fa5788..32cb1e4aa425 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_submit.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_submit.c
> @@ -65,8 +65,10 @@ static int virtio_gpu_dma_fence_wait(struct virtio_gpu_submit *submit,
>  
>  	dma_fence_unwrap_for_each(f, &itr, fence) {
>  		err = virtio_gpu_do_fence_wait(submit, f);
> -		if (err)
> +		if (err) {
> +			dma_fence_put(itr.chain);
>  			return err;
> +		}
>  	}
>  
>  	return 0;

Applied to misc-fixes, thanks!

-- 
Best regards,
Dmitry

^ permalink raw reply

* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Michael S. Tsirkin @ 2026-06-08 19:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Hildenbrand (Arm), Lorenzo Stoakes, linux-kernel,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
	Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aibVTe5HTk-fF_kj@casper.infradead.org>

On Mon, Jun 08, 2026 at 03:44:29PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 04:37:03PM +0200, David Hildenbrand (Arm) wrote:
> > On 6/8/26 16:31, Matthew Wilcox wrote:
> > > On Mon, Jun 08, 2026 at 04:26:18PM +0200, David Hildenbrand (Arm) wrote:
> > >> If that means that we would handle __GFP_ZERO consistently in the callers of
> > >> alloc_frozen_pages(), that would also do I guess. We'd still have to pass the
> > >> user address down to some degree, through folio interfaces only at least.
> > > 
> > > What I don't understand is how the kernel page allocator needs to know
> > > the user address in order to effectively zero it, but the hypervisor is
> > > able to zero the page without knowing the user address.  It feels like
> > > somebody has x86-centric thinking where cache colouring doesn't matter.
> > 
> > (not commenting on the icache dache mess we have to drag along)
> 
> Well, that was kind of the point of this email ... I did ask the
> question you're answering in a different email so let me respond
> to that too.
> 
> > The thing is that with free-page-reporting the memory is already zeroed by the
> > hypervisor as part of discarding that memory previously (e.g., MADV_DONTNEED)
> > and allocating fresh pages on re-access.
> > 
> > So it's not a question of "why is the hypervisor zeroing less efficiently", as
> > zeroing is just a side-product of reclaiming that memory in the first place.
> 
> We definitely have users who don't want the guest to trust the
> hypervisor.  So how do they disable this optimisation?

What do you mean, how? This is done by:

[PATCH v10 35/37] virtio_balloon: disable reporting zeroed optimization for confidential guests


-- 
MST


^ permalink raw reply

* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Michael S. Tsirkin @ 2026-06-08 19:39 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Gregory Price, Vlastimil Babka (SUSE), linux-kernel,
	David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
	Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
	Alistair Popple, Christoph Lameter, David Rientjes,
	Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
	virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aib9QIGIn8f4W6RM@lucifer>

On Mon, Jun 08, 2026 at 06:50:46PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 11:45:18AM -0400, Gregory Price wrote:
> > On Mon, Jun 08, 2026 at 01:13:42PM +0200, Vlastimil Babka (SUSE) wrote:
> > > On 6/8/26 13:02, Vlastimil Babka (SUSE) wrote:
> > > > On 6/8/26 10:33, Michael S. Tsirkin wrote:
> > > >> Further, on architectures with aliasing caches, upstream with init_on_alloc
> > > >
> > > > It seems those are niche architectures so we can ignore that part for perf
> > > > purposes; the other reason why user_alloc_needs_zeroing() would be true is
> > > > booting with init_on_alloc.
> > >
> > > OK I misread how user_alloc_needs_zeroing() works wrt init_on_alloc, as it's
> > > negated. But you're changing that anyway to skip that user zeroing, right?
> > >
> > > "
> > > This series eliminates that double-zeroing by moving the zeroing
> > > into the post_alloc_hook + propagating the "host
> > > already zeroed this page" information through the buddy allocator.
> > > "
> > >
> > > So relying on "everything in buddy is zeroed" would still work I'd think.
> > >
> >
> > This regresses for anything that previously didn't zero on free or
> > alloc, which is most kernel allocations.
> >
> > I think the scope of this set has increased too much based on early
> > feedback to fix the userland-initiated allocations piece along with the
> > balloon/reporting/double-zero piece.  That's making all of this
> > difficult to continue following.
> 
> Yeah I feel this is 3, 4 or 5 series put together, and there's a lot to
> discuss in each :) so it's pretty difficult to work with them all put
> together.
> 
> These need to be deferred/separated.

I can do that, it's just that the real performance benefits
only come with the last patches in the series.

If I send series that merely moves zeroing around, with
a bunch of threading of addresses and stuff to achieve that,
0 perf gain and slight degradation in corner cases like memcg
failures, you feel it will be well received? You guys really
want to do that, independently of the rest?

Just making sure, I'm not the maintainer here.


> >
> > ~Gregory
> 
> Thanks, Lorenzo


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox