* New design
From: Matthew Wilcox @ 2026-06-09 3:58 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <cover.1780906288.git.mst@redhat.com>
OK, here's how I'd structure this:
1. Introduce PG_zeroed for buddy pages
2. Set it if init_on_free is set
3. Set it from balloon driver
https://lore.kernel.org/lkml/c7094de807c0e963526686e1d245bc76193b1a92.1776689093.git.mst@redhat.com/
but add FPI_ZEROED instead of an extra bool parameter.
4. Introduce page_is_zeroed like this:
static inline bool page_is_zeroed(const struct page *page)
{
/*
* lru.next has bit 2 set if the page is already zeroed.
* Callers may simply overwrite it once they no longer
* need to preserve that information.
*/
return (unsigned long)page->lru.next & BIT(2);
}
(you'll notice this is similar to page_is_pfmemalloc() but it doesn't
need to be in mm.h)
This step is going to be a bit fiddly. We weren't expecting to return
multiple flags in page->lru.next, so clear_page_pfmemalloc() just sets
page->lru.next to NULL. So somewhere we need to make sure that
page->lru.next is definitely NULL, and then allow both the zeroed and
pfmemalloc flags to be set in it.
The important part of this is that it allows the zeroed flag to be
returned from the page allocator without introducing pghint_t like you
did in v2.
5. Now you can start skipping various zeroing steps higher in the call
chain.
I understand David's disgust with vma_alloc_zeroed_movable_folio()
but that is surely a separate cleanup and nothing to do with this
patchset.
^ permalink raw reply
* Re: [PATCH v1] vsock/virtio: rework MSG_ZEROCOPY flag handling
From: Jakub Kicinski @ 2026-06-09 2:08 UTC (permalink / raw)
To: Arseniy Krasnov
Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
Eric Dumazet, Paolo Abeni, Michael S. Tsirkin, Jason Wang,
Bobby Eshleman, Xuan Zhuo, Eugenio Pérez, Simon Horman, kvm,
virtualization, netdev, linux-kernel, oxffffaa, rulkc
In-Reply-To: <20260605115314.552321-1-avkrasnov@rulkc.org>
On Fri, 5 Jun 2026 14:53:14 +0300 Arseniy Krasnov wrote:
> Logically it was based on TCP implementation, so to make further
> support easier, rewrite it in the TCP way.
Does not apply:
$ git pw series apply 1106582
Failed to apply patch:
Applying: vsock/virtio: rework MSG_ZEROCOPY flag handling
error: sha1 information is lacking or useless (net/vmw_vsock/virtio_transport_common.c).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 vsock/virtio: rework MSG_ZEROCOPY flag handling
--
pw-bot: cr
^ permalink raw reply
* [PATCH net] vsock/virtio: restore msg_iter on transmission failure
From: Octavian Purdila @ 2026-06-09 0:48 UTC (permalink / raw)
To: netdev
Cc: Octavian Purdila, syzbot+28e5f3d207b14bae122a, Stefan Hajnoczi,
Stefano Garzarella, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
Eugenio Pérez, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Arseniy Krasnov, kvm, virtualization,
linux-kernel
When transmission fails in virtio_transport_send_pkt_info, the msg_iter
might have been partially advanced. If we don't restore it, the next
attempt to send data will use an incorrect iterator state, leading to
desync and warnings like "send_pkt() returns 0, but X expected".
Specifically, this can happen in the following scenario, triggered by
the syzkaller repro:
1. A write-only VMA (PROT_WRITE only) is partially populated by a
prior TUN write that failed with -EIO but still faulted in some
pages).
2. A vsock sendmmsg call with MSG_ZEROCOPY requests transmission of a
buffer from this VMA.
3. The first packet (64KB) is sent successfully because the pages are
populated.
4. The second packet allocation fails because GUP fast pins the first page
but GUP slow fails on the next unpopulated page due to PROT_WRITE-only
permissions.
5. The iterator is advanced by the partially successful GUP (68KB total
advanced: 64KB from first packet + 4KB from second), but the send loop
breaks and only reports 64KB sent. This creates a 4KB desync.
6. The next retry starts with a non-zero iov_offset, disabling zerocopy
and falling back to copy mode.
7. In copy mode, the transmission succeeds for the next packets but
exhausts the iterator early because of the desync.
8. The final retry sees an empty iterator but zerocopy is re-enabled
(offset resets). It attempts to send the remaining bytes with zerocopy
but pins 0 pages, creating an empty packet.
9. The transport sends the empty packet, triggering the warning because
the returned bytes (header only) do not match the expected payload size.
10. The loop continues to spin, allocating ubuf_info each time, eventually
exhausting sysctl_optmem_max and returning -ENOMEM to userspace.
Restore msg_iter to its original state before the packet allocation
and transmission attempt if they fail.
Fixes: e0718bd82e27 ("vsock: enable setting SO_ZEROCOPY")
Reported-by: syzbot+28e5f3d207b14bae122a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=28e5f3d207b14bae122a
Assisted-by: gemini:gemini-3.1-pro
Signed-off-by: Octavian Purdila <tavip@google.com>
---
net/vmw_vsock/virtio_transport_common.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index b10666937c490..588623a3e2bbc 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -367,6 +367,10 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
do {
struct sk_buff *skb;
size_t skb_len;
+ struct iov_iter saved_iter;
+
+ if (info->msg)
+ saved_iter = info->msg->msg_iter;
skb_len = min(max_skb_len, rest_len);
@@ -375,6 +379,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
src_cid, src_port,
dst_cid, dst_port);
if (!skb) {
+ if (info->msg)
+ info->msg->msg_iter = saved_iter;
ret = -ENOMEM;
break;
}
@@ -382,8 +388,11 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
virtio_transport_inc_tx_pkt(vvs, skb);
ret = t_ops->send_pkt(skb, info->net);
- if (ret < 0)
+ if (ret < 0) {
+ if (info->msg)
+ info->msg->msg_iter = saved_iter;
break;
+ }
/* Both virtio and vhost 'send_pkt()' returns 'skb_len',
* but for reliability use 'ret' instead of 'skb_len'.
--
2.54.0.1064.gd145956f57-goog
^ permalink raw reply related
* Re: [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: Sean Christopherson @ 2026-06-08 22:38 UTC (permalink / raw)
To: David Woodhouse
Cc: Thomas Gleixner, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Kiryl Shutsemau, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, John Stultz, H. Peter Anvin,
Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
virtualization, xen-devel, Tom Lendacky, Nikunj A Dadhania,
Michael Kelley
In-Reply-To: <eef867eae15e30d08482ba16a1a32159745b64a7.camel@infradead.org>
On Sat, Jun 06, 2026, David Woodhouse wrote:
> On Sat, 2026-06-06 at 12:34 +0200, Thomas Gleixner wrote:
> > On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> >
> > > Now that all paravirt code that explicitly specifies the TSC frequency
> > > also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
> > > and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.
> > >
> > > Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
> > > by the user. Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
> > > line parameter"), one of the goals of the param is to allow the refined
> > > calibration work "to do meaningful error checking".
> > >
> > > Note, preferring the user-provided TSC frequency over the frequency from
> > > the hypervisor or trusted firmware, while simultaneously not treating the
> > > user-provided frequency as gospel, is obviously incongruous. Sweep the
> > > problem under the rug for now to avoid opening a big can of worms that
> > > likely doesn't have a great answer.
> >
> > There is a good answer I think.
> >
> > early_tsc_khz exists to cater for the overclocking crowd. On their
> > modded systems the firmware supplied TSC frequency (CPUID/MSR) is not
> > matching reality anymore. So they work around that by supplying a close
> > enough tsc_early_khz and then they let the refined calibration work
> > figure it out.
> >
> > Arguably that's only relevant for bare metal systems and what's worse is
> > that in virtual environments the refined calibration work can fail,
> > which renders the TSC unstable.
> >
> > So I'd rather say we change this logic to:
> >
> > if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
> > tsc_khz = x86_init.....();
> > force(X86_FEATURE_TSC_KNOWN_FREQ);
> > } else if (tsc_khz_early) {
> > ....
> > } else {
> > ...
> > }
> >
> > Along with:
> >
> > if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
> > if (tsc_khz_early)
> > pr_warn("Ignoring non-sensical tsc_early_khz command line argument\n");
> >
> > or something daft like that.
Ya, I ended up in the same place once Sashiko pointed out that skipping the SNP/TDX
setup was hazardous[*], and also once I realized that tsc_khz_early *complemented*
the refinement instead of replacing it.
This is what I have locally:
if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
known_tsc_khz = snp_secure_tsc_init();
else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
known_tsc_khz = tdx_tsc_init();
/*
* If the TSC frequency wasn't provided by trusted firmware, try to get
* it from the hypervisor (which is untrusted when running as a CoCo guest).
*/
if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
known_tsc_khz = x86_init.hyper.get_tsc_khz();
/*
* Mark the TSC frequency as known if it was obtained from a hypervisor
* or trusted firmware. Don't mark the frequency as known if the user
* specified the frequency, as the user-provided frequency is intended
* as a "starting point", not a known, guaranteed frequency.
*/
if (known_tsc_khz && !tsc_early_khz)
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
/*
* Ignore the user-provided TSC frequency if the exact frequency was
* obtained from trusted firmware or the hypervisor, as the user-
* provided frequency is intended as a "starting point", not a known,
* guaranteed frequency.
*/
if (!known_tsc_khz)
known_tsc_khz = tsc_early_khz;
else if (tsc_early_khz)
pr_err("Ignoring 'tsc_early_khz' in favor of firmware/hypervisor.\n");
[*] https://lore.kernel.org/all/ahnF-FehodVd474X@google.com
> > The kernel has for various reasons always tried to cater for the needs
> > of users who are plagued by bonkers firmware, but we have to stop to
> > prioritize or treating equal ancient and modded out of spec hardware.
> >
> > TBH, I consider that whole KVM clock nonsense to fall into the modded
> > out of spec hardware realm. Do a reality check:
> >
> > How many production systems are out there still which run VMs on CPUs
> > with a broken TSC and the lack of VM TSC scaling?
> >
> > I'm not saying that we should not support the few remaining systems
> > anymore, but our tendency to pretend that we can keep all of this
> > nonsense working and at the same time making progress is just a fallacy.
FWIW, I have the exact same sentiments about kvmclock, but I'm also trying my
best not to break folks that are happily running on what is effectively flawed,
ancient "hardward".
> I don't know that we can take the KVM (and Xen) clock away from guests,
> but all of the *horrid* part about it is the way it attempts to cope
> with the possibility that the *host* timekeeping might flip away from
> TSC-based mode at any point in time. By the end of my outstanding
> cleanup series, that is the *only* thing the gtod_notifier remains for.
>
> If we can trust the hardware *and* the host kernel, then KVM could
> theoretically hardwire the kvmclock into 'master clock mode' where it
> basically just advertises the TSC→kvmclock relationship *once* to all
> CPUs and it never changes.
>
> All the nonsense about updating it every time we enter a CPU could just
> go away completely.
But to Thomas' point, why bother? For actual old hardware, kvmclock is what it
is. For modern hardware, it's completely antiquated.
^ permalink raw reply
* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Gregory Price @ 2026-06-08 22:28 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Zi Yan, Michael S. Tsirkin, Matthew Wilcox, Lorenzo Stoakes,
linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <afb14c99-162c-4c36-b1ef-fe971678f6b9@kernel.org>
On Mon, Jun 08, 2026 at 11:51:47PM +0200, David Hildenbrand (Arm) wrote:
> On 6/8/26 23:16, Zi Yan wrote:
>
> There was Willy's comment in RFC v3 [1], which had 19 patches. Unfortunately, he
> no longer followed up to my initial push back and Michael's question later [2].
>
> That would have probably been the right time to wait for more discussion.
>
> RFC v4 had 22 patches with little replies.
> v5 had 28 patches with little replies.
> v6 had 30 patches with no replies.
> v7 had 31 patches with little replies.
> v8 had 37 patches with no replies.
>
> [1] https://lore.kernel.org/lkml/aeu5P1bZW3yEH54t@casper.infradead.org/
> [2] https://lore.kernel.org/lkml/20260426165330-mutt-send-email-mst@kernel.org/
>
Hm, rewinding on this back to v3 here:
https://lore.kernel.org/lkml/016cc5e5-044c-46c6-a668-200f90a64d85@kernel.org/
You said:
```
Exactly, that's why I am saying that vma_alloc_folio() is the only
external interface people should be using with a user address.
```
Going through the list of folio_zero_user references:
Called unconditionally if a folio is acquired:
fs/hugetlbfs/inode.c: folio_zero_user(folio, addr);
mm/hugetlb.c: folio_zero_user(folio, vmf->real_address);
mm/memfd.c: folio_zero_user(folio, 0);
Called when user_alloc_needs_zeroing() and charging passes:
mm/huge_memory.c: folio_zero_user(folio, addr);
mm/memory.c: folio_zero_user(folio, vmf->address);
No one outside mm/ should know about this interface at all.
Arguably none of these should know about this interface either.
The appropriate place for this logic appears to be:
vma_alloc_folio
alloc_hugetlb_folio
alloc_hugetlb_folio_reserve
The reason to sink it into the post_alloc_hook is to let the buddy
decide whether the page actually needs to be zeroed (like the virtio
situation) based on PG_zeroed or whatever.
It seems like at a minimum moving the logic all the way into
post_alloc_hook lets us actually delete folio_zero_user() as a published
interface and move it entirely within page_alloc.c.
The catch is user_alloc_needs_zeroing() coming along with it.
~Gregory
^ permalink raw reply
* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Gregory Price @ 2026-06-08 21:53 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <20260608174430-mutt-send-email-mst@kernel.org>
On Mon, Jun 08, 2026 at 05:46:27PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 05:33:50PM -0400, Gregory Price wrote:
> >
> > You'd save yourself some revisions by taking the attention you have
> > right now and starting the discussion thread (and consider submitting
> > the topic to LPC if that's something interests you!).
>
> Well it's in october, is it not? I don't think I have the patience to
> keep fiddling with that for half a year.
>
You might be able to find a way forward that doesn't take that long, but
that starts with trying to build consensus on what to build before you
build it.
You're proposing a non-trivial change to the page allocator API, I would
not expect this to move at the speed of claude.
~Gregory
^ permalink raw reply
* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: David Hildenbrand (Arm) @ 2026-06-08 21:51 UTC (permalink / raw)
To: Zi Yan, Michael S. Tsirkin
Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
Andrea Arcangeli
In-Reply-To: <DD2E08E8-1FD7-43E3-A2C8-B4435D3773D1@nvidia.com>
On 6/8/26 23:16, Zi Yan wrote:
> On 8 Jun 2026, at 17:04, Michael S. Tsirkin wrote:
>
>> On Mon, Jun 08, 2026 at 04:40:15PM -0400, Zi Yan wrote:
>>>
>>>
>>> Change user_alloc_needs_zeroing() to only check address aliasing even
>>> if that can cause double zeroing for virtio.
>>>
>>> Best Regards,
>>> Yan, Zi
>>
>> Ah. I started with exactly that in v1/v2. It's a simple approach.
Simple, and hacky -> unmergable. I tried to push it into a different (no GFP
flags -> IMHO better) direction, but the patch set grew in complexity.
I kept telling to keep it simple (e.g., no folio_put optimization, no hugetlb
optimization, simple wrapper functions), and ideally we would have gotten a
better discussion with other folks here much earlier.
And I still do not consider providing an user address to selected interfaces
while centralizing zeroing a bad idea. The real question is how that could be
done in a cleaner way.
Or as Willy said, if we could move zeroing further out to callers, where they
can special-case. But given that KASAN and friends interact in their own way
with zeroing doesn't make that super straight forward as people might think.
>>
>> But mm maintainers said no, user_alloc_needs_zeroing is a hack and
>> I must not add to it.
I mean, I would hope that we can agree that our existing page/folio zeroing is a
mess and should not be extended by slapping more special casing on top?
Sure, we can try cleaning it up, but conceptually, zeroing happening at two
places in the callchain, with random optimizations to avoid double-zeroing is
just bad.
The fact that a vma_alloc_zeroed_movable_folio() that can be overridden by
architectures even exists makes me angry. user_alloc_needs_zeroing() is jsut the
tip of the ugly iceberg.
>
> Got it. It sounds that you now get conflicting ideas. Maybe you should
> start a [DISCUSSION] thread that presents the high level idea of what
> you want to achieve and all the ideas you got from the reviews, so that
> people in this thread can have the big picture and come up a consensus
> before you send another version.
>
> Thank you for patiently replying my comments, since those points
> apparently have been discussed in prior submissions.
>
There was Willy's comment in RFC v3 [1], which had 19 patches. Unfortunately, he
no longer followed up to my initial push back and Michael's question later [2].
That would have probably been the right time to wait for more discussion.
RFC v4 had 22 patches with little replies.
v5 had 28 patches with little replies.
v6 had 30 patches with no replies.
v7 had 31 patches with little replies.
v8 had 37 patches with no replies.
[1] https://lore.kernel.org/lkml/aeu5P1bZW3yEH54t@casper.infradead.org/
[2] https://lore.kernel.org/lkml/20260426165330-mutt-send-email-mst@kernel.org/
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Michael S. Tsirkin @ 2026-06-08 21:46 UTC (permalink / raw)
To: Gregory Price
Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aic1PoxSiHzZ40Jr@gourry-fedora-PF4VCD3F>
On Mon, Jun 08, 2026 at 05:33:50PM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 05:16:53PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jun 08, 2026 at 04:53:14PM -0400, Gregory Price wrote:
> > >
> > > As a start:
> > >
> > > 1) the user_addr and zeroing piece seems like a discrete
> > > improvement worthy of its own set - aside from end goal.
> > >
> > > This is needed by your patch set, but was requested to
> > > try to push us towards a more reasonable pattern for
> > > folio_zero_user().
> >
> > What I worry about is people can't agree what api they want.
> >
>
> Oh that's just our base state of existence. We mostly agree that
> all APIs are bad in some way and we don't want any of them :P
>
> What you're looking for is to get people to agree to the
> least-offensive, least-worst option :]
>
> I don't think we're far off from that. I suggest doing as Zi said and
> start a [DISCUSSION] thread on specifically this and lay out the needs
> and wants and design issues that you've learned from the past set of
> versions and continue the discussion there.
>
> It helps to take some snippets from your set to lay out what you've
> learned and explain why you need the folio_user_zero() stuff to get from
> A->Z, and then let maintainers hash out whether that should live in
> post_alloc_hook or new interfaces (or outside page_alloc.c altogether).
>
> > I don't mind trying all kind of approaches, but it seems to
> > be past the point where people feel it's costing too much of
> > their time with all of these revisions.
> >
>
> People are still commenting, so I don't think you've gotten there yet.
> I think the rate of revision is what's costing too much attention.
>
> You'd save yourself some revisions by taking the attention you have
> right now and starting the discussion thread (and consider submitting
> the topic to LPC if that's something interests you!).
Well it's in october, is it not? I don't think I have the patience to
keep fiddling with that for half a year.
> All this is to say you're doing fine, just keep on keepin' on. Maybe
> pivot your approach from iterations to discussion for a bit until the
> opinions settle.
>
> ~Gregory
^ permalink raw reply
* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Gregory Price @ 2026-06-08 21:33 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <20260608170646-mutt-send-email-mst@kernel.org>
On Mon, Jun 08, 2026 at 05:16:53PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 04:53:14PM -0400, Gregory Price wrote:
> >
> > As a start:
> >
> > 1) the user_addr and zeroing piece seems like a discrete
> > improvement worthy of its own set - aside from end goal.
> >
> > This is needed by your patch set, but was requested to
> > try to push us towards a more reasonable pattern for
> > folio_zero_user().
>
> What I worry about is people can't agree what api they want.
>
Oh that's just our base state of existence. We mostly agree that
all APIs are bad in some way and we don't want any of them :P
What you're looking for is to get people to agree to the
least-offensive, least-worst option :]
I don't think we're far off from that. I suggest doing as Zi said and
start a [DISCUSSION] thread on specifically this and lay out the needs
and wants and design issues that you've learned from the past set of
versions and continue the discussion there.
It helps to take some snippets from your set to lay out what you've
learned and explain why you need the folio_user_zero() stuff to get from
A->Z, and then let maintainers hash out whether that should live in
post_alloc_hook or new interfaces (or outside page_alloc.c altogether).
> I don't mind trying all kind of approaches, but it seems to
> be past the point where people feel it's costing too much of
> their time with all of these revisions.
>
People are still commenting, so I don't think you've gotten there yet.
I think the rate of revision is what's costing too much attention.
You'd save yourself some revisions by taking the attention you have
right now and starting the discussion thread (and consider submitting
the topic to LPC if that's something interests you!).
All this is to say you're doing fine, just keep on keepin' on. Maybe
pivot your approach from iterations to discussion for a bit until the
opinions settle.
~Gregory
^ permalink raw reply
* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Michael S. Tsirkin @ 2026-06-08 21:16 UTC (permalink / raw)
To: Gregory Price
Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aicruns_4nOx0hD-@gourry-fedora-PF4VCD3F>
On Mon, Jun 08, 2026 at 04:53:14PM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 04:30:46PM -0400, Michael S. Tsirkin wrote:
> > >
> > > Please consider that this is arguably the most fundamental interface in
> > > in all of mm/. All we're doing is going through the process of figuring
> > > out what changes here are reasonable while trying to meet your goal.
> > >
> > > ~Gregory
> >
> > I don't mind discarding all of this and doing something else completely,
> > but I dislike it that multiple people are apparently now angry that I
>
> I wouldn't say anyone is angry, I think most folks are tripping on the
> complexity of the set - which has increased (at the request of others).
>
> > don't address all the contradictory comments at the same time.
>
> Such is life in mm/ :] - it's hard to known the entire state machine,
> and sometimes the contradictions aren't even wrong.
>
> > I thought just sending a patchset to show how the result looks like
> > is easier than arguing about architecture, and would be helpful.
> >
>
> Notice: When folks argue implementation, they largely agree the
> end goal is useful. I haven't seen anyone say your problem isn't
> real or that it shouldn't be addressed - just opinions on a particular
> path forward (which is utterly normal here).
>
> Getting the right incantation of an API is really hard when the
> API being changes is something that underpins the entire kernel.
>
> > I'm not pushing any of the mm rework, I was asked to do it,
> > myself I just want the ridiculously effective optimization in there.
> >
>
> As Lorenzo, David, and Matthew have said, the focus of the patch set
> does seem to have become unweildy (in part at the request of folks
> asking something be done differently).
>
> What needs to be done now is to break it up into some pull-ahead
> sets that are easier to review. Having a brief RFC doc that lays out
> the set of patches might help clarify the confusion going on here,
> especially as new folks come in to ask "What's all this about?".
>
> As a start:
>
> 1) the user_addr and zeroing piece seems like a discrete
> improvement worthy of its own set - aside from end goal.
>
> This is needed by your patch set, but was requested to
> try to push us towards a more reasonable pattern for
> folio_zero_user().
What I worry about is people can't agree what api they want.
Simply not being an mm maintainer, I don't really have the
perspective of what changes are envisioned down the road
and so what api makes sense for you guys.
I don't mind trying all kind of approaches, but it seems to
be past the point where people feel it's costing too much of
their time with all of these revisions.
> 2) There are a handful of patches that seem able to pull-ahead
> (some of the mempolicy stuff), either as prep work for #1 or
> just on their own.
>
> Some of these patches seem like latent bugs that aren't hit by
> current users, but do seem to be doing something subtly wrong?
Right.
> 3) the final virtio piece seems like it should be entirely separate
> once the core pieces are done.
>
> It's not uncommon for core changes like this to take multiple prepatory
> sets over many major versions before the final feature lands.
>
> ~Gregory
^ permalink raw reply
* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Zi Yan @ 2026-06-08 21:16 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
Andrea Arcangeli
In-Reply-To: <20260608170348-mutt-send-email-mst@kernel.org>
On 8 Jun 2026, at 17:04, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 04:40:15PM -0400, Zi Yan wrote:
>> On 8 Jun 2026, at 16:33, Michael S. Tsirkin wrote:
>>
>>> On Mon, Jun 08, 2026 at 04:21:08PM -0400, Zi Yan wrote:
>>>> On 8 Jun 2026, at 15:59, Gregory Price wrote:
>>>>
>>>>> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
>>>>>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
>>>>>>> But instead of overloading user_addr to indicate all kinds of things, instead
>>>>>>> make life easier by actually breaking things out.
>>>>>>>
>>>>>>> Like:
>>>>>>>
>>>>>>> enum alloc_context_type {
>>>>>>> KERNEL_ALLOCATION,
>>>>>>> USER_MAPPED_ALLOCATION,
>>>>>>> USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
>>>>>>> /* Perhaps some other states we want to encode? */
>>>>>>> };
>>>>>>>
>>>>>>> struct alloc_context {
>>>>>>> ...
>>>>>>>
>>>>>>> enum alloc_context_type type;
>>>>>>> unsigned long user_addr; // Only set if type == USER_ALLOCATION
>>>>>>>
>>>>>>> // Maybe something suggesting context or whether we init before in some
>>>>>>> // cases?
>>>>>>> };
>>>>>>
>>>>>> Ugh, please, no. As I suggested last time I commented on this
>>>>>> trainwreck of a series, lift the zeroing functionality from
>>>>>> alloc_frozen_pages() into its callers.
>>>>>
>>>>> This sort of just implies writing the "alloc_frozen_zeroed_pages()"
>>>>> wrapper that does the zeroing at the end before return, and then killing
>>>>> the post hook nonsense associated with it in the first place.
>>>>
>>>> This means it is going to be a multi-step optimization. This is probably
>>>> step 1.
>>>>
>>>>>
>>>>> None of this resolves the user address annoyance which is needed on some
>>>>> archs for cache flushing. Whether anyone agrees that the page allocator
>>>>> should be responsible for this particular operation - open debate.
>>>>
>>>> This is probably step 2. But does the virtio use case apply to these
>>>> archs? Does the performance matter for them? If not, maybe this part can
>>>> be left as a TODO.
>>>>
>>>>
>>>> Best Regards,
>>>> Yan, Zi
>>>
>>> I doubt it. But I don't get what's proposed, the code that we
>>> have to modify is arch independent?
>>
>> Change user_alloc_needs_zeroing() to only check address aliasing even
>> if that can cause double zeroing for virtio.
>>
>> Best Regards,
>> Yan, Zi
>
> Ah. I started with exactly that in v1/v2. It's a simple approach.
>
> But mm maintainers said no, user_alloc_needs_zeroing is a hack and
> I must not add to it.
Got it. It sounds that you now get conflicting ideas. Maybe you should
start a [DISCUSSION] thread that presents the high level idea of what
you want to achieve and all the ideas you got from the reviews, so that
people in this thread can have the big picture and come up a consensus
before you send another version.
Thank you for patiently replying my comments, since those points
apparently have been discussed in prior submissions.
Best Regards,
Yan, Zi
^ permalink raw reply
* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Michael S. Tsirkin @ 2026-06-08 21:04 UTC (permalink / raw)
To: Zi Yan
Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
Andrea Arcangeli
In-Reply-To: <BD0C8760-BC10-4BF0-9EE3-9B0DAF0D977A@nvidia.com>
On Mon, Jun 08, 2026 at 04:40:15PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 16:33, Michael S. Tsirkin wrote:
>
> > On Mon, Jun 08, 2026 at 04:21:08PM -0400, Zi Yan wrote:
> >> On 8 Jun 2026, at 15:59, Gregory Price wrote:
> >>
> >>> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
> >>>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
> >>>>> But instead of overloading user_addr to indicate all kinds of things, instead
> >>>>> make life easier by actually breaking things out.
> >>>>>
> >>>>> Like:
> >>>>>
> >>>>> enum alloc_context_type {
> >>>>> KERNEL_ALLOCATION,
> >>>>> USER_MAPPED_ALLOCATION,
> >>>>> USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
> >>>>> /* Perhaps some other states we want to encode? */
> >>>>> };
> >>>>>
> >>>>> struct alloc_context {
> >>>>> ...
> >>>>>
> >>>>> enum alloc_context_type type;
> >>>>> unsigned long user_addr; // Only set if type == USER_ALLOCATION
> >>>>>
> >>>>> // Maybe something suggesting context or whether we init before in some
> >>>>> // cases?
> >>>>> };
> >>>>
> >>>> Ugh, please, no. As I suggested last time I commented on this
> >>>> trainwreck of a series, lift the zeroing functionality from
> >>>> alloc_frozen_pages() into its callers.
> >>>
> >>> This sort of just implies writing the "alloc_frozen_zeroed_pages()"
> >>> wrapper that does the zeroing at the end before return, and then killing
> >>> the post hook nonsense associated with it in the first place.
> >>
> >> This means it is going to be a multi-step optimization. This is probably
> >> step 1.
> >>
> >>>
> >>> None of this resolves the user address annoyance which is needed on some
> >>> archs for cache flushing. Whether anyone agrees that the page allocator
> >>> should be responsible for this particular operation - open debate.
> >>
> >> This is probably step 2. But does the virtio use case apply to these
> >> archs? Does the performance matter for them? If not, maybe this part can
> >> be left as a TODO.
> >>
> >>
> >> Best Regards,
> >> Yan, Zi
> >
> > I doubt it. But I don't get what's proposed, the code that we
> > have to modify is arch independent?
>
> Change user_alloc_needs_zeroing() to only check address aliasing even
> if that can cause double zeroing for virtio.
>
> Best Regards,
> Yan, Zi
Ah. I started with exactly that in v1/v2. It's a simple approach.
But mm maintainers said no, user_alloc_needs_zeroing is a hack and
I must not add to it.
--
MST
^ permalink raw reply
* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Michael S. Tsirkin @ 2026-06-08 21:03 UTC (permalink / raw)
To: Zi Yan
Cc: Gregory Price, David Hildenbrand (Arm), Lorenzo Stoakes,
linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <D754DEDB-3A3A-4F6E-BF64-79086F0A6467@nvidia.com>
On Mon, Jun 08, 2026 at 04:37:59PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 16:25, Gregory Price wrote:
>
> > On Mon, Jun 08, 2026 at 03:52:20PM -0400, Zi Yan wrote:
> >> On 8 Jun 2026, at 15:43, Gregory Price wrote:
> >>
> >>> On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
> >>>> On 6/8/26 17:27, Zi Yan wrote:
> >>>>> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
> >>>>>
> >>>>> Or should we defer zeroing after a page is returned from allocator? So that
> >>>>> user_addr does not need to be passed through irrelevant allocation APIs.
> >>>>> Something like:
> >>>>>
> >>>>> alloc_page_wrapper(gfp, order, user_addr)
> >>>>> {
> >>>>> page = alloc_pages();
> >>>>> if (gfp & __GFP_ZERO)
> >>>>> clear_page(page);
> >>>>> }
> >>>>>
> >>>>
> >>>> Not really sure what's best here. I think we'd want to limit the lifting to some
> >>>> internal API, so it cannot easily be messed up by random kernel code calling
> >>>> into the wrong API and not getting pages cleared.
> >>>>
> >>>
> >>> We're a bit in circles on this. We discussed explicit interfaces a few
> >>> months back and the trade off was:
> >>>
> >>> a) add user_addr to the existing API and cause churn
> >>>
> >>> or
> >>>
> >>> b) add special interface like above
> >>> increase the buddy surface
> >>> leaves open the ability for users to get it wrong easily
> >>>
> >>> If we forget VMs for a moment and break this step out separately, the
> >>> core question is whether page_alloc.c is the right place to be calling
> >>> the folio_user_zero() or whatever it is.
> >>
> >> page_alloc.c calling folio_user_zero() is fine, but my question is
> >> whether we should do the zeroing inside post_alloc_hook(), part of
> >> allocation.
> >>
> >> What I propose is to lift __GFP_ZERO up as much as possible,
> >> so that most of allocation code does not need to care about it.
> >> We do the zeroing right before the page is returned to callers.
> >>
> >
> > essentially we end up with something like
> >
> > alloc_frozen_...(..., gfp)
> > {
> > folio = whatever(..., gfp);
> > if (gfp & __GFP_ZERO)
> > folio_zero(folio, -1); /* don't do cache flush part */
> > }
> >
> > alloc_frozen_user_...(..., gfp, user_addr)
> > {
> > folio = whatever(..., gfp);
> > if (gfp & __GFP_ZERO)
> > folio_zero(folio, user_addr); /* do cache flush part */
> > }
> >
> > The downside of this is obvious: it's easy for developers to get this
> > wrong and call the non-user interface for user-bound allocations and
> > miss the cache flush (that is only needed on some archs).
> >
> > Not saying that's a deal breaker, but it's something to chew on.
>
> I agree that misuse can cause trouble. But if we do the churn approach,
> what prevents developer from doing alloc_frozen(..., user_addr = -1)
> and using the returned page for userspace? It is possible the allocated
> page can be exported to userspace later.
>
> BTW, that cache flush thing is fragile even today,
Probably arch dependent. On arm32, I think if you miss the flush, then
PG_dcache_clean will be clear and then you get a perf hit but
it's still correct. Didn't check others.
> you probably can
> do alloc_page() + vm_insert() to get a page without doing proper flush
> and export it to userspace. There seems to be no mechanism to
> prevent that.
>
> Best Regards,
> Yan, Zi
Because maybe you want to expose data to userspace?
--
MST
^ permalink raw reply
* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Gregory Price @ 2026-06-08 20:56 UTC (permalink / raw)
To: Zi Yan
Cc: David Hildenbrand (Arm), Lorenzo Stoakes, Michael S. Tsirkin,
linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <D754DEDB-3A3A-4F6E-BF64-79086F0A6467@nvidia.com>
On Mon, Jun 08, 2026 at 04:37:59PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 16:25, Gregory Price wrote:
> >
> > essentially we end up with something like
> >
> > alloc_frozen_...(..., gfp)
> > {
> > folio = whatever(..., gfp);
> > if (gfp & __GFP_ZERO)
> > folio_zero(folio, -1); /* don't do cache flush part */
> > }
> >
> > alloc_frozen_user_...(..., gfp, user_addr)
> > {
> > folio = whatever(..., gfp);
> > if (gfp & __GFP_ZERO)
> > folio_zero(folio, user_addr); /* do cache flush part */
> > }
> >
> > The downside of this is obvious: it's easy for developers to get this
> > wrong and call the non-user interface for user-bound allocations and
> > miss the cache flush (that is only needed on some archs).
> >
> > Not saying that's a deal breaker, but it's something to chew on.
>
> I agree that misuse can cause trouble. But if we do the churn approach,
> what prevents developer from doing alloc_frozen(..., user_addr = -1)
> and using the returned page for userspace? It is possible the allocated
> page can be exported to userspace later.
>
> BTW, that cache flush thing is fragile even today, you probably can
> do alloc_page() + vm_insert() to get a page without doing proper flush
> and export it to userspace. There seems to be no mechanism to
> prevent that.
>
Oh of course, I said that elsewhere. It leaves us in a spot where we're
not technically worse than we were yesterday - except that the surface
of the buddy has increased (developers need to know about 2 APIs instead
of 1). That carries maintenance burden (if something in alloc_frozen()
changes, something in alloc_frozen_user() may need to change).
There's a careful dance here.
~Gregory
^ permalink raw reply
* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Gregory Price @ 2026-06-08 20:53 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <20260608161810-mutt-send-email-mst@kernel.org>
On Mon, Jun 08, 2026 at 04:30:46PM -0400, Michael S. Tsirkin wrote:
> >
> > Please consider that this is arguably the most fundamental interface in
> > in all of mm/. All we're doing is going through the process of figuring
> > out what changes here are reasonable while trying to meet your goal.
> >
> > ~Gregory
>
> I don't mind discarding all of this and doing something else completely,
> but I dislike it that multiple people are apparently now angry that I
I wouldn't say anyone is angry, I think most folks are tripping on the
complexity of the set - which has increased (at the request of others).
> don't address all the contradictory comments at the same time.
Such is life in mm/ :] - it's hard to known the entire state machine,
and sometimes the contradictions aren't even wrong.
> I thought just sending a patchset to show how the result looks like
> is easier than arguing about architecture, and would be helpful.
>
Notice: When folks argue implementation, they largely agree the
end goal is useful. I haven't seen anyone say your problem isn't
real or that it shouldn't be addressed - just opinions on a particular
path forward (which is utterly normal here).
Getting the right incantation of an API is really hard when the
API being changes is something that underpins the entire kernel.
> I'm not pushing any of the mm rework, I was asked to do it,
> myself I just want the ridiculously effective optimization in there.
>
As Lorenzo, David, and Matthew have said, the focus of the patch set
does seem to have become unweildy (in part at the request of folks
asking something be done differently).
What needs to be done now is to break it up into some pull-ahead
sets that are easier to review. Having a brief RFC doc that lays out
the set of patches might help clarify the confusion going on here,
especially as new folks come in to ask "What's all this about?".
As a start:
1) the user_addr and zeroing piece seems like a discrete
improvement worthy of its own set - aside from end goal.
This is needed by your patch set, but was requested to
try to push us towards a more reasonable pattern for
folio_zero_user().
2) There are a handful of patches that seem able to pull-ahead
(some of the mempolicy stuff), either as prep work for #1 or
just on their own.
Some of these patches seem like latent bugs that aren't hit by
current users, but do seem to be doing something subtly wrong?
3) the final virtio piece seems like it should be entirely separate
once the core pieces are done.
It's not uncommon for core changes like this to take multiple prepatory
sets over many major versions before the final feature lands.
~Gregory
^ permalink raw reply
* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Zi Yan @ 2026-06-08 20:40 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
Andrea Arcangeli
In-Reply-To: <20260608163257-mutt-send-email-mst@kernel.org>
On 8 Jun 2026, at 16:33, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 04:21:08PM -0400, Zi Yan wrote:
>> On 8 Jun 2026, at 15:59, Gregory Price wrote:
>>
>>> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
>>>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
>>>>> But instead of overloading user_addr to indicate all kinds of things, instead
>>>>> make life easier by actually breaking things out.
>>>>>
>>>>> Like:
>>>>>
>>>>> enum alloc_context_type {
>>>>> KERNEL_ALLOCATION,
>>>>> USER_MAPPED_ALLOCATION,
>>>>> USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
>>>>> /* Perhaps some other states we want to encode? */
>>>>> };
>>>>>
>>>>> struct alloc_context {
>>>>> ...
>>>>>
>>>>> enum alloc_context_type type;
>>>>> unsigned long user_addr; // Only set if type == USER_ALLOCATION
>>>>>
>>>>> // Maybe something suggesting context or whether we init before in some
>>>>> // cases?
>>>>> };
>>>>
>>>> Ugh, please, no. As I suggested last time I commented on this
>>>> trainwreck of a series, lift the zeroing functionality from
>>>> alloc_frozen_pages() into its callers.
>>>
>>> This sort of just implies writing the "alloc_frozen_zeroed_pages()"
>>> wrapper that does the zeroing at the end before return, and then killing
>>> the post hook nonsense associated with it in the first place.
>>
>> This means it is going to be a multi-step optimization. This is probably
>> step 1.
>>
>>>
>>> None of this resolves the user address annoyance which is needed on some
>>> archs for cache flushing. Whether anyone agrees that the page allocator
>>> should be responsible for this particular operation - open debate.
>>
>> This is probably step 2. But does the virtio use case apply to these
>> archs? Does the performance matter for them? If not, maybe this part can
>> be left as a TODO.
>>
>>
>> Best Regards,
>> Yan, Zi
>
> I doubt it. But I don't get what's proposed, the code that we
> have to modify is arch independent?
Change user_alloc_needs_zeroing() to only check address aliasing even
if that can cause double zeroing for virtio.
Best Regards,
Yan, Zi
^ permalink raw reply
* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Zi Yan @ 2026-06-08 20:37 UTC (permalink / raw)
To: Gregory Price
Cc: David Hildenbrand (Arm), Lorenzo Stoakes, Michael S. Tsirkin,
linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aiclJGgB07L3rV-P@gourry-fedora-PF4VCD3F>
On 8 Jun 2026, at 16:25, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 03:52:20PM -0400, Zi Yan wrote:
>> On 8 Jun 2026, at 15:43, Gregory Price wrote:
>>
>>> On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
>>>> On 6/8/26 17:27, Zi Yan wrote:
>>>>> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
>>>>>
>>>>> Or should we defer zeroing after a page is returned from allocator? So that
>>>>> user_addr does not need to be passed through irrelevant allocation APIs.
>>>>> Something like:
>>>>>
>>>>> alloc_page_wrapper(gfp, order, user_addr)
>>>>> {
>>>>> page = alloc_pages();
>>>>> if (gfp & __GFP_ZERO)
>>>>> clear_page(page);
>>>>> }
>>>>>
>>>>
>>>> Not really sure what's best here. I think we'd want to limit the lifting to some
>>>> internal API, so it cannot easily be messed up by random kernel code calling
>>>> into the wrong API and not getting pages cleared.
>>>>
>>>
>>> We're a bit in circles on this. We discussed explicit interfaces a few
>>> months back and the trade off was:
>>>
>>> a) add user_addr to the existing API and cause churn
>>>
>>> or
>>>
>>> b) add special interface like above
>>> increase the buddy surface
>>> leaves open the ability for users to get it wrong easily
>>>
>>> If we forget VMs for a moment and break this step out separately, the
>>> core question is whether page_alloc.c is the right place to be calling
>>> the folio_user_zero() or whatever it is.
>>
>> page_alloc.c calling folio_user_zero() is fine, but my question is
>> whether we should do the zeroing inside post_alloc_hook(), part of
>> allocation.
>>
>> What I propose is to lift __GFP_ZERO up as much as possible,
>> so that most of allocation code does not need to care about it.
>> We do the zeroing right before the page is returned to callers.
>>
>
> essentially we end up with something like
>
> alloc_frozen_...(..., gfp)
> {
> folio = whatever(..., gfp);
> if (gfp & __GFP_ZERO)
> folio_zero(folio, -1); /* don't do cache flush part */
> }
>
> alloc_frozen_user_...(..., gfp, user_addr)
> {
> folio = whatever(..., gfp);
> if (gfp & __GFP_ZERO)
> folio_zero(folio, user_addr); /* do cache flush part */
> }
>
> The downside of this is obvious: it's easy for developers to get this
> wrong and call the non-user interface for user-bound allocations and
> miss the cache flush (that is only needed on some archs).
>
> Not saying that's a deal breaker, but it's something to chew on.
I agree that misuse can cause trouble. But if we do the churn approach,
what prevents developer from doing alloc_frozen(..., user_addr = -1)
and using the returned page for userspace? It is possible the allocated
page can be exported to userspace later.
BTW, that cache flush thing is fragile even today, you probably can
do alloc_page() + vm_insert() to get a page without doing proper flush
and export it to userspace. There seems to be no mechanism to
prevent that.
Best Regards,
Yan, Zi
^ permalink raw reply
* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Michael S. Tsirkin @ 2026-06-08 20:33 UTC (permalink / raw)
To: Zi Yan
Cc: Gregory Price, Matthew Wilcox, Lorenzo Stoakes, linux-kernel,
David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
Andrea Arcangeli
In-Reply-To: <D1CF67EF-2CF1-45F8-9265-70A528C1CB01@nvidia.com>
On Mon, Jun 08, 2026 at 04:21:08PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 15:59, Gregory Price wrote:
>
> > On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
> >> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
> >>> But instead of overloading user_addr to indicate all kinds of things, instead
> >>> make life easier by actually breaking things out.
> >>>
> >>> Like:
> >>>
> >>> enum alloc_context_type {
> >>> KERNEL_ALLOCATION,
> >>> USER_MAPPED_ALLOCATION,
> >>> USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
> >>> /* Perhaps some other states we want to encode? */
> >>> };
> >>>
> >>> struct alloc_context {
> >>> ...
> >>>
> >>> enum alloc_context_type type;
> >>> unsigned long user_addr; // Only set if type == USER_ALLOCATION
> >>>
> >>> // Maybe something suggesting context or whether we init before in some
> >>> // cases?
> >>> };
> >>
> >> Ugh, please, no. As I suggested last time I commented on this
> >> trainwreck of a series, lift the zeroing functionality from
> >> alloc_frozen_pages() into its callers.
> >
> > This sort of just implies writing the "alloc_frozen_zeroed_pages()"
> > wrapper that does the zeroing at the end before return, and then killing
> > the post hook nonsense associated with it in the first place.
>
> This means it is going to be a multi-step optimization. This is probably
> step 1.
>
> >
> > None of this resolves the user address annoyance which is needed on some
> > archs for cache flushing. Whether anyone agrees that the page allocator
> > should be responsible for this particular operation - open debate.
>
> This is probably step 2. But does the virtio use case apply to these
> archs? Does the performance matter for them? If not, maybe this part can
> be left as a TODO.
>
>
> Best Regards,
> Yan, Zi
I doubt it. But I don't get what's proposed, the code that we
have to modify is arch independent?
^ permalink raw reply
* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Michael S. Tsirkin @ 2026-06-08 20:30 UTC (permalink / raw)
To: Gregory Price
Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aicjNRrHYOOBt0Hx@gourry-fedora-PF4VCD3F>
On Mon, Jun 08, 2026 at 04:16:53PM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 03:45:58PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jun 08, 2026 at 11:53:40AM -0400, Gregory Price wrote:
> > >
> > > If `user_addr` is now implying anything other than exactly: "This needs
> > > to be zeroed / caches flushed", then this is bad.
> > >
> > > ~Gregory
> >
> > Well if you do folio_zero_user in a non sleepable context then things
> > are not going to work. So combining e.g. GFP_ATOMIC and GFP_ZERO and
> > user_addr all together is not a good idea.
> >
>
> Can you say whether (GFP_ATOMIC | GFP_ZERO) w/o user_addr has the same
> issue?
I don't think it is because it does not call folio_zero_user, right?
> If not, then this subtle complexity is now a tripping hazard.
Yes.
> Is there some combination of arguments here that should just outright
> fail if a user attempts it?
__GFP_DIRECT_RECLAIM at least.
> >
> > You are saying it's bad? It's pretty fundamental to the idea of moving
> > zeroing into the allocator, I feel.
> >
>
> I'm saying having to infer that safety state from the cobbling of those
> things together is not a good pattern (at least as-is).
Understood. Don't have a better idea, yet.
> If the introduction of user_addr into the mix is the thing that causes
> us to have to infer safety, then there's an argument the page allocator
> shouldn't handle that operation (in this case: user_addr cache flush).
It's not just the flush, we are also trying to use that to optimize
zeroing.
>
> Please consider that this is arguably the most fundamental interface in
> in all of mm/. All we're doing is going through the process of figuring
> out what changes here are reasonable while trying to meet your goal.
>
> ~Gregory
I don't mind discarding all of this and doing something else completely,
but I dislike it that multiple people are apparently now angry that I
don't address all the contradictory comments at the same time.
I thought just sending a patchset to show how the result looks like
is easier than arguing about architecture, and would be helpful.
I'm not pushing any of the mm rework, I was asked to do it,
myself I just want the ridiculously effective optimization in there.
--
MST
^ permalink raw reply
* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Gregory Price @ 2026-06-08 20:25 UTC (permalink / raw)
To: Zi Yan
Cc: David Hildenbrand (Arm), Lorenzo Stoakes, Michael S. Tsirkin,
linux-kernel, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Muchun Song, Oscar Salvador, Andrew Morton, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Hugh Dickins,
Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <8BB8D6E5-8644-4CBC-BAF1-0AA19702E042@nvidia.com>
On Mon, Jun 08, 2026 at 03:52:20PM -0400, Zi Yan wrote:
> On 8 Jun 2026, at 15:43, Gregory Price wrote:
>
> > On Mon, Jun 08, 2026 at 08:39:10PM +0200, David Hildenbrand (Arm) wrote:
> >> On 6/8/26 17:27, Zi Yan wrote:
> >>> On 8 Jun 2026, at 7:08, David Hildenbrand (Arm) wrote:
> >>>
> >>> Or should we defer zeroing after a page is returned from allocator? So that
> >>> user_addr does not need to be passed through irrelevant allocation APIs.
> >>> Something like:
> >>>
> >>> alloc_page_wrapper(gfp, order, user_addr)
> >>> {
> >>> page = alloc_pages();
> >>> if (gfp & __GFP_ZERO)
> >>> clear_page(page);
> >>> }
> >>>
> >>
> >> Not really sure what's best here. I think we'd want to limit the lifting to some
> >> internal API, so it cannot easily be messed up by random kernel code calling
> >> into the wrong API and not getting pages cleared.
> >>
> >
> > We're a bit in circles on this. We discussed explicit interfaces a few
> > months back and the trade off was:
> >
> > a) add user_addr to the existing API and cause churn
> >
> > or
> >
> > b) add special interface like above
> > increase the buddy surface
> > leaves open the ability for users to get it wrong easily
> >
> > If we forget VMs for a moment and break this step out separately, the
> > core question is whether page_alloc.c is the right place to be calling
> > the folio_user_zero() or whatever it is.
>
> page_alloc.c calling folio_user_zero() is fine, but my question is
> whether we should do the zeroing inside post_alloc_hook(), part of
> allocation.
>
> What I propose is to lift __GFP_ZERO up as much as possible,
> so that most of allocation code does not need to care about it.
> We do the zeroing right before the page is returned to callers.
>
essentially we end up with something like
alloc_frozen_...(..., gfp)
{
folio = whatever(..., gfp);
if (gfp & __GFP_ZERO)
folio_zero(folio, -1); /* don't do cache flush part */
}
alloc_frozen_user_...(..., gfp, user_addr)
{
folio = whatever(..., gfp);
if (gfp & __GFP_ZERO)
folio_zero(folio, user_addr); /* do cache flush part */
}
The downside of this is obvious: it's easy for developers to get this
wrong and call the non-user interface for user-bound allocations and
miss the cache flush (that is only needed on some archs).
Not saying that's a deal breaker, but it's something to chew on.
~Gregory
^ permalink raw reply
* Re: [PATCH v10 07/37] mm: thread user_addr through page allocator for cache-friendly zeroing
From: Zi Yan @ 2026-06-08 20:21 UTC (permalink / raw)
To: Gregory Price
Cc: Matthew Wilcox, Lorenzo Stoakes, Michael S. Tsirkin, linux-kernel,
David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Hugh Dickins, Matthew Brost, Joshua Hahn,
Rakie Kim, Byungchul Park, Ying Huang, Alistair Popple,
Christoph Lameter, David Rientjes, Roman Gushchin, Harry Yoo,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, virtualization, linux-mm,
Andrea Arcangeli
In-Reply-To: <aicfERTP3H6AndOs@gourry-fedora-PF4VCD3F>
On 8 Jun 2026, at 15:59, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 02:04:28PM +0100, Matthew Wilcox wrote:
>> On Mon, Jun 08, 2026 at 12:06:35PM +0100, Lorenzo Stoakes wrote:
>>> But instead of overloading user_addr to indicate all kinds of things, instead
>>> make life easier by actually breaking things out.
>>>
>>> Like:
>>>
>>> enum alloc_context_type {
>>> KERNEL_ALLOCATION,
>>> USER_MAPPED_ALLOCATION,
>>> USER_UNMAPPED_ALLOCATION, // Maybe? Do we ever?
>>> /* Perhaps some other states we want to encode? */
>>> };
>>>
>>> struct alloc_context {
>>> ...
>>>
>>> enum alloc_context_type type;
>>> unsigned long user_addr; // Only set if type == USER_ALLOCATION
>>>
>>> // Maybe something suggesting context or whether we init before in some
>>> // cases?
>>> };
>>
>> Ugh, please, no. As I suggested last time I commented on this
>> trainwreck of a series, lift the zeroing functionality from
>> alloc_frozen_pages() into its callers.
>
> This sort of just implies writing the "alloc_frozen_zeroed_pages()"
> wrapper that does the zeroing at the end before return, and then killing
> the post hook nonsense associated with it in the first place.
This means it is going to be a multi-step optimization. This is probably
step 1.
>
> None of this resolves the user address annoyance which is needed on some
> archs for cache flushing. Whether anyone agrees that the page allocator
> should be responsible for this particular operation - open debate.
This is probably step 2. But does the virtio use case apply to these
archs? Does the performance matter for them? If not, maybe this part can
be left as a TODO.
Best Regards,
Yan, Zi
^ permalink raw reply
* Re: [PATCH v10 02/37] mm: memory-failure: serialize TestSetPageHWPoison with zone->lock
From: Michael S. Tsirkin @ 2026-06-08 20:17 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli, Miaohe Lin
In-Reply-To: <aibMs9DXuhH_5F2Z@lucifer>
On Mon, Jun 08, 2026 at 03:14:51PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 08, 2026 at 09:48:34AM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jun 08, 2026 at 10:43:21AM +0100, Lorenzo Stoakes wrote:
> > > On Mon, Jun 08, 2026 at 04:34:23AM -0400, Michael S. Tsirkin wrote:
> > > > TestSetPageHWPoison() is called without zone->lock, so its atomic
> > > > update to page->flags can race with non-atomic flag operations
> > > > that run under zone->lock in the buddy allocator.
> > > >
> > > > In particular, __free_pages_prepare() does:
> > > >
> > > > page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > > >
> > > > This non-atomic read-modify-write, while correctly excluding
> > > > __PG_HWPOISON from the mask, can still lose a concurrent
> > > > TestSetPageHWPoison if the read happens before the poison bit
> > > > is set and the write happens after. Follow-up patches in this
> > > > series add similar non-atomic flag operations as well.
> > > >
> > > > Fix by acquiring zone->lock around TestSetPageHWPoison and
> > > > around ClearPageHWPoison in the retry path. This
> > > > serializes with all buddy flag manipulation. The cost is
> > > > negligible: one lock/unlock in an extremely rare path
> > > > (hardware memory errors).
> > > >
> > > > Note: SetPageHWPoison and TestClearPageHWPoison calls elsewhere
> > > > in this file operate on pages already removed from the buddy
> > > > allocator or on non-buddy pages (DAX, hugetlb), so they do not
> > > > need zone->lock protection.
> > > >
> > > > Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > >
> > > Can we have Fixes: and Cc: stable and also send this separately please?
> > >
> > > These patches seem like unrelated fixups that you've discovered along the way,
> > > and don't belong as part of the already rather large series, unless I'm missing
> > > something here.
> > >
> > > Thanks, Lorenzo
> >
> > I think you are mising that they are a dependency, not unrelated.
>
> Then say so.
>
> > For example, this issue gets worse with the patchset as there are more
> > places that manipulate flags without atomics. No?
>
> It's your job to make that case, not mine.
>
> >
> >
> > You are welcome to send this to stable, but I think stable rules
> > preclude theoretical bugfixes.
>
> It's a dependency but also theoretical?
As in, the race is exteremely hard to trigger and I have no idea if it
triggers for anyone, but it's obvious from reading the code that
theoretically it exists? Yes.
> >
> > As for Fixes: the issue has been there for decades. I wouldn't know
> > what to attribute it for.
>
> Again, your job.
Alright, if you insist:
Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")
now everyone running 2.6 kernels will backport this fix, I presume.
> >
> >
> > I guess I could send these separately, too, why not. Not sure
> > what this accomplishes, but hey. But is that an ack? You want
> > this fix merged even before the feature?
>
> I already made the case as to why, as have other maintainers.
>
> If you need to review what an ack looks like please consult
> https://docs.kernel.org/process/5.Posting.html
>
> Thanks, Lorenzo
I am merely asking if you want this patch in the set including
all these nits I had to fix.
--
MST
^ permalink raw reply
* Re: [PATCH v10 12/37] mm: use folio_zero_user for user pages in post_alloc_hook
From: Gregory Price @ 2026-06-08 20:16 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <20260608154354-mutt-send-email-mst@kernel.org>
On Mon, Jun 08, 2026 at 03:45:58PM -0400, Michael S. Tsirkin wrote:
> On Mon, Jun 08, 2026 at 11:53:40AM -0400, Gregory Price wrote:
> >
> > If `user_addr` is now implying anything other than exactly: "This needs
> > to be zeroed / caches flushed", then this is bad.
> >
> > ~Gregory
>
> Well if you do folio_zero_user in a non sleepable context then things
> are not going to work. So combining e.g. GFP_ATOMIC and GFP_ZERO and
> user_addr all together is not a good idea.
>
Can you say whether (GFP_ATOMIC | GFP_ZERO) w/o user_addr has the same
issue? If not, then this subtle complexity is now a tripping hazard.
Is there some combination of arguments here that should just outright
fail if a user attempts it?
>
> You are saying it's bad? It's pretty fundamental to the idea of moving
> zeroing into the allocator, I feel.
>
I'm saying having to infer that safety state from the cobbling of those
things together is not a good pattern (at least as-is).
If the introduction of user_addr into the mix is the thing that causes
us to have to infer safety, then there's an argument the page allocator
shouldn't handle that operation (in this case: user_addr cache flush).
Please consider that this is arguably the most fundamental interface in
in all of mm/. All we're doing is going through the process of figuring
out what changes here are reasonable while trying to meet your goal.
~Gregory
^ permalink raw reply
* Re: [PATCH v10 16/37] mm: alloc_swap_folio: pass raw fault address to vma_alloc_folio
From: Michael S. Tsirkin @ 2026-06-08 20:09 UTC (permalink / raw)
To: Gregory Price
Cc: Lorenzo Stoakes, linux-kernel, David Hildenbrand (Arm),
Jason Wang, Xuan Zhuo, Eugenio Pérez, Muchun Song,
Oscar Salvador, Andrew Morton, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aibm5OFmiFK6lCqH@gourry-fedora-PF4VCD3F>
On Mon, Jun 08, 2026 at 11:59:32AM -0400, Gregory Price wrote:
> On Mon, Jun 08, 2026 at 12:37:20PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Jun 08, 2026 at 04:37:41AM -0400, Michael S. Tsirkin wrote:
> > > Same change as the previous patch but for alloc_swap_folio:
> >
> > Please don't say 'same change as the previous patch' :) explain what you're
> > doing here. It's a pain to have to go check otherwise.
> >
>
> MST you need to slow down a bit.
>
> I gave you this same feedback 3 versions ago:
>
> https://lore.kernel.org/linux-mm/agXUHItfxSwtriRF@gourry-fedora-PF4VCD3F/
>
> ~Gregory
Ooof I do try but the patchset is just too big. Sorry. I need to find a
way to split it. Or maybe Matthew will tell me how to make it much
smaller, he says he sees a way that will make everyone happy. Let's
wait.
--
MST
^ permalink raw reply
* Re: [PATCH v10 00/37] mm/virtio: skip redundant zeroing of host-zeroed pages
From: Michael S. Tsirkin @ 2026-06-08 20:02 UTC (permalink / raw)
To: Matthew Wilcox
Cc: linux-kernel, David Hildenbrand (Arm), Jason Wang, Xuan Zhuo,
Eugenio Pérez, Muchun Song, Oscar Salvador, Andrew Morton,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Hugh Dickins, Matthew Brost,
Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
Alistair Popple, Christoph Lameter, David Rientjes,
Roman Gushchin, Harry Yoo, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Chris Li, Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He,
virtualization, linux-mm, Andrea Arcangeli
In-Reply-To: <aibP5W-fGcyPp9e4@casper.infradead.org>
On Mon, Jun 08, 2026 at 03:21:25PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 08, 2026 at 04:33:46AM -0400, Michael S. Tsirkin wrote:
> > Further, on architectures with aliasing caches, upstream with init_on_alloc
>
> Further to what? Did you leave out some paragraphs here?
>
> As far as I can tell, this patch series decides to trust that the
> hypervisor has zeroed pages that it allocates to the guest. But
> as far as I can tell, the trend is towards less trust in the hypervisor
> from the guest, not more.
AKA confidential computing. I'm not a visionary, no idea about trends, but
yes these are used more than in the past (not hard given it used to be
0% of the market in the past).
Page reporting already leaks some info like free page addresses, so it's
for trusted hypervisors.
Anyway:
Subject: [PATCH v10 35/37] virtio_balloon: disable reporting zeroed optimization for confidential guests
makes sure guests that do not trust hypervisors are not affected.
--
MST
^ permalink raw reply
page: next (older)
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox