* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Yan Zhao @ 2026-06-26 0:04 UTC (permalink / raw)
To: Ackerley Tng
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgFfgV0FbQLzP8hhNH5hMGaQao6OFQin4cb3TAmC7SVhfA@mail.gmail.com>
On Thu, Jun 25, 2026 at 11:20:30AM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
>
> > On Wed, Jun 24, 2026 at 05:05:44PM -0700, Ackerley Tng wrote:
> >> Yan Zhao <yan.y.zhao@intel.com> writes:
> >>
> >> >
> >> > [...snip...]
> >> >
> >> >>
> >> >> #ifdef kvm_arch_has_private_mem
> >> >> -bool __ro_after_init gmem_in_place_conversion = false;
> >> >> +bool __ro_after_init gmem_in_place_conversion = !IS_ENABLED(CONFIG_KVM_VM_MEMORY_ATTRIBUTES);
> >> >> +module_param(gmem_in_place_conversion, bool, 0444);
> >> >
> >> > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> >> > MMAP flag. In such cases, shared memory is allocated from different backends.
> >> > This means this module parameter only enables per-gmem memory attribute and does
> >> > not guarantee that gmem in-place conversion will actually occur.
> >> >
> >> > To avoid confusion, could we rename this module parameter to something more
> >> > accurate, such as gmem_memory_attribute?
> >> >
> >>
> >> I asked Sean about this after getting some fixes off list. Sean said
> >> gmem_in_place_conversion is named for a host admin to use, and something
> >> like gmem_memory_attributes is too much implementation details for the
> >> admin.
> > Thanks for this background.
> >
> > Some more context on why I'm asking:
> >
> > Currently, I'm testing TDX huge pages with the following two gmem components:
> > 1. The gmem memory attribute in this gmem in-place conversion v8.
> > 2. The gmem 2MB from buddy allocator. (for development/testing only).
> >
> > The gmem 2MB from buddy allocator allocates 2MB folios from buddy for private
> > memory, while shared memory is allocated from a different backend.
> > (To avoid fragmentation, only private mappings are split during private-to-shared
> > conversions. In this approach, the 2MB folios are always retained in the gmem
> > inode filemap cache without splitting.)
> >
> > Since shared memory is not allocated from gmem, there're no in-place conversions.
> > The reason I'm using "gmem memory attribute" is that the per-VM attribute is
> > being deprecated, as suggested by Sean [1].
> >
>
> v8 of conversions series changed that slightly, per-VM attributes is
> going to stay around (because of work on RWX attributes, coming up) and
> RWX will stay tracked at the VM level.
>
> For v8 and beyond, only tracking of private/shared in per-VM attributes
> is being deprecated.
>
> By extension the entire thing about using guest_memfd for private memory
> and a different backing memory for shared memory is being deprecated.
Thanks for the info. I was actually referring to the per-VM shared/private
attribute, which is being deprecated. Sean hoped TDX huge page would be the
first mandated user of the per-gmem shared/private attribute.
> > Besides my current usage,
>
> I think you can set up guest_memfd+2M for private memory and shared
> memory from some other source, and that's the deprecated usage pattern.
Yes, though this is the deprecated usage pattern, gmem_in_place_conversion=true
allows it.
In fact, even without huge pages, v8 allows userspace to have shared memory
allocated from other source when gmem_in_place_conversion=true.
(My default testing of this series for the 4KB setting is with this
configuration).
> > there may be other scenarios where gmem memory
> > attributes is preferred without allocating shared memory from gmem.
> > (e.g., PAGE.ADD from a temp extra shared source memory).
> >
>
> Is this TDH.MEM.PAGE.ADD, used indirectly from
> tdx_gmem_post_populate()? This use case isn't blocked. Even if
> gmem_in_place_conversion=true, you can still set src_address to
> non-guest_memfd memory and load from anywhere you like.
>
> Please let me know if that is broken! I think I accidentally used that
It's not broken. I tested it with my hacked-up QEMU.
> setup in selftests and it worked. The selftests are now defaulting to
> in-place conversion.
>
> > For such use cases, I'm concerns that the admins may find it confusing if they
> > enable gmem_in_place_conversion but still observe extra memory consumptions for
> > shared memory.
> >
>
> Hmm but I guess if someone enables gmem_in_place_conversion but still
> allocates from elsewhere, they'd have to figure it out?
If gmem_in_place_conversion=true means gmem in place conversion is allowed (but
not enforced), I agree.
I'm wondering if we could rename it to "allow_gmem_in_place_conversion":)
> > [1] https://lore.kernel.org/kvm/aWmEegVP_A613WIr@google.com/
> >
> >> Sean, would you reconsider since Yan also asked? If the admin compiled
> >> the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
> >> admin would also be able to use a param like gmem_memory_attributes?
> >>
> >> There's the additional benefit that the similar naming aids in
> >> understanding for both the admin and software engineers.
> >>
> >> Either way, in the next revision, I'll also add this documentation for
> >> this module_param:
> >>
> >> Setting the module parameter gmem_in_place_conversion to true will
> >> enable the KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl and disables
> >> the KVM_SET_MEMORY_ATTRIBUTES VM ioctl. If gmem_in_place_conversion is
> >> true, the private/shared attribute will be tracked per-guest_memfd
> >> instead of per-VM.
> >>
> >> Let me know what y'all think of the wording!
> >>
> >> >>
> >> >> [...snip...]
> >> >>
^ permalink raw reply
* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Ackerley Tng @ 2026-06-26 0:07 UTC (permalink / raw)
To: Yan Zhao
Cc: Sean Christopherson, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <ajyRg3BwGu5dCfOn@yzhao56-desk.sh.intel.com>
Yan Zhao <yan.y.zhao@intel.com> writes:
> On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>>
>> > On Tue, Jun 23, 2026, Yan Zhao wrote:
>> >> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
>> >> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
>> >> > > On Mon, Jun 22, 2026, Yan Zhao wrote:
>> >> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
>> >> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> >> > > > > index ffe9d0db58c59..56d10333c61a7 100644
>> >> > > > > --- a/arch/x86/kvm/vmx/tdx.c
>> >> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
>> >> > > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>> >> > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
>> >> > > > > return -EIO;
>> >> > > > >
>> >> > > > > - if (!src_page)
>> >> > > > > - return -EOPNOTSUPP;
>> >> > > > > + if (!src_page) {
>> >> > > > > + if (!gmem_in_place_conversion)
>> >> > > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
>> >> > > > without the MMAP flag, the absence of src_page should still be treated as an
>> >> > > > error.
>> >> > >
>> >> > > Why MMAP?
>> >> > Hmm, I was showing a scenario that in-place conversion couldn't occur.
>> >> > I didn't mean that with the MMAP flag, mmap() and user write must occur.
>> >> >
>> >> > > Shouldn't this be a general "if (!src_page && !up-to-date)"? Just
>> >> > > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
>> >> > > and written memory. And when write() lands, MMAP wouldn't be necessary to
>> >> > > initialize the memory.
>> >> > Do you mean using up-to-date flag as below?
>> >
>> > Yes? I didn't actually look at the implementation details.
>> >
>> >> > if (!src_page) {
>> >> > src_page = pfn_to_page(pfn);
>> >> > if (!folio_test_uptodate(page_folio(src_page)))
>> >> > return -EOPNOTSUPP;
>> >> > }
>>
>> Yan is right that with the earlier patch "Zero page while getting pfn",
>> folio_test_uptodate() here will always return true.
>>
>> Actually, this is an alternative fix for the issue Sashiko pointed out
>> on v7 where userspace can do a populate() (either TDX or SNP) without
>> first allocating the page, with src_address == NULL, and leak
>> uninitialized memory into the guest.
>>
>> Advantage of using the uptodate check in populate: if the host never
>> allocates the page, populate doesn't incur zeroing before writing the
>> page anyway in populate().
>>
>> Disadvantage: Both TDX and SNP will have to implement this uptodate
>> check. guest_memfd can't check centrally because for SNP, for a
>> PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
>> firmware will zero and there's no leakage of uninitialized host memory?
> Another disadvantage: the uptodate flag is per-folio. What if the folio
> is only partially initialized by the userspace especially after huge page is
> supported?
>
Good point on huge pages!
The uptodate flag on the folio in guest_memfd means "this folio has been
written to". As of now (before patch at [1]), this happens when
+ folio is zeroed on first use by userspace
+ folio is zeroed on first use of the guest
+ folio is populated
When huge pages are supported, the folio can't partially be initialized?
On allocation, if any part is shared, we split the page. The parts are
separate folios that have their own uptodate flags.
On splitting, if the huge page is uptodate, the split pages will also be
uptodate. If the huge page is not uptodate, the split pages won't be
uptodate, but that's ok since they will be marked uptodate on first use.
On merging, the non-uptodate parts have to be zeroed and then marked
uptodate. Any parts that are in use would have been marked uptodate
already, so there's no overwriting data that is in use. I'll need to
think more about when it's safe to zero.
I'm still on the fence between the two options
1. Using uptodate check in populate to reject src_pages that have never
been written to or
2. Always zero before populate
but whether the uptodate flag is per-folio or not doesn't affect these
two options in terms of fixing the leak of uninitialized host memory,
right?
>
>> >> Another concern with this fix is that:
>> >> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
>> >> folio uptodate before reaching post_populate().
>> >>
>> >> [1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/
>> >>
>> >> > One concern is that TDX now does not much care about the up-to-date flag since
>> >> > TDX doesn't rely on the flag to clear pages on conversions.
>> >> > I'm not sure if the flag can be reliably checked in this case. e.g.,
>> >> > now the whole folio is marked up-to-date even if only part of it is faulted by
>> >> > user access.
>> >> > Ensuring that the up-to-date flag works correctly with huge page support seems
>> >> > to have more effort than introducing a dedicated flag for TDX.
>> >> >
>> >> > > > Additionally, to properly enable in-place copying for the TDX initial memory
>> >> > > > region, userspace must not only specify source_addr to NULL, but also follow
>> >> > > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
>> >> > > > 1. create guest_memfd with MMAP flag
>> >> > > > 2. mmap the guest_memfd.
>> >> > > > 3. convert the initial memory range to shared.
>> >> > > > 4. copy initial content to the source page.
>> >> > > > 5. convert the initial memory range to private
>> >> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
>> >> > > > 7. do not unmap the source backend.
>> >> > > >
>> >> > > > So, would it be reasonable to introduce a dedicated flag that allows userspace
>> >> > > > to explicitly opt into the in-place copy functionality? e.g.,
>> >> > >
>> >> > > Why? It's userspace's responsibility to get the above right. If userspace fails
>> >> > > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
>>
>> Yan, is your concern that userspace forgot to update the code and
>> forgets to provide a src_page, and if we keep the "Zero page while
> Yes. Previously, it would be rejected after GUP fails.
>
I see, didn't realize previously it would be rejected because GUP
fails. GUP failed because it wasn't faulted into the host?
That's kind of orthogonal, I don't think GUP fail leading to rejecting
populate was meant to help userspace catch these issues. GUP would also
fail if the user did mmap(), write to it, unmap using
madvise(MADV_DONTNEED), then forget and pass 0 as src_address.
>> getting pfn" patch, ends up with the guest silently having a zero page?
>> I think that would be found quite early in userspace VMM testing...
> I actually encountered this during testing this patch.
> I update most code path to follow this sequence. However, still some corner ones
> for TDVF HOB, which are less obvious and harder to update.
> The TD just booted up and hang silently.
>
I think this is just the life of a close-to-hardware software engineer
:P no errors, got stuck somewhere, root cause is some unitialized
thing.
>> >> > I mean if userspace specifies a NULL source_addr by mistake, it's better for
>> >> > kernel to detect this mistake, similar to how it validates whether source_addr
>> >> > is PAGE_ALIGNED.
>> >
>> > The alignment case is different. If userspace provides an unaligned value, KVM
>> > *can't* do what userspace is asking because hardware and thus KVM only supports
>> > converting on page boundaries.
>> >
>> > For a NULL source, KVM can still do what userspace is asking. Rejecting userspace's
>> > request would then be making assumptions about what userspace wants.
>> >
>>
>> Also, +1 on this, what if userspace, knowing that pages are zeroed on
>> allocation, actually wants to rely on that to get a zero page in the guest?
> What if 0 uaddr is a valid address? :)
>
>> >> > Since userspace already needs to perform additional steps to enable in-place
>> >> > copy, specifying a dedicated flag to indicate that the NULL source_addr is
>> >> > intentional seems like a reasonable burden.
>> >
>> > I don't see how it adds any value. I wouldn't be at all surprised if most VMMs
>> > just wen up with code that does:
>> >
>> > if (in-place) {
>> > src = NULL;
>> > flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION;
>> > }
>>
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default\
From: Yan Zhao @ 2026-06-26 0:29 UTC (permalink / raw)
To: Sean Christopherson
Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <aj087H1UWSFxbShR@google.com>
On Thu, Jun 25, 2026 at 07:36:28AM -0700, Sean Christopherson wrote:
> On Thu, Jun 25, 2026, Yan Zhao wrote:
> > On Thu, Jun 25, 2026 at 09:51:01AM +0800, Yan Zhao wrote:
> > > On Wed, Jun 24, 2026 at 05:41:58PM -0700, Sean Christopherson wrote:
> > > > On Wed, Jun 24, 2026, Ackerley Tng wrote:
> > > > > Yan Zhao <yan.y.zhao@intel.com> writes:
> > > > > > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > > > > > MMAP flag. In such cases, shared memory is allocated from different backends.
> > > > > > This means this module parameter only enables per-gmem memory attribute and does
> > > > > > not guarantee that gmem in-place conversion will actually occur.
> > > >
> > > > KVM module params are pretty much always about what KVM supports, not what is
> > > > guaranteed to happen.
> > > >
> > > > - enable_mmio_caching doesn't guarantee there will actually be MMIO SPTEs,
> > > > because maybe the guest never accesses emulated MMIO.
> > > > - enable_pmu doesn't guarantee VMs will get a PMU, because userspace may elect
> > > > not to advertise one.
> > > > - and so on and so forth...
> > > >
> > > > Yes, there's a small mental jump to get from "KVM supports in-place conversion"
> > > > to "I need to set memory attributes on the guest_memfd instance, not the VM",
> > > > but I don't see that as a big hurdle, certainly not in the long term. And once
> > > > the VMM code is written, I really do think most people are going to care about
> > > > whether or not KVM supports in-place conversion, not where PRIVATE is tracked.
> > > Sorry, I just saw this mail after posting my reply in [1].
> > >
> > > I'm ok with gmem_in_place_conversion=true just means KVM supports in-place
> > > conversion, while we can still create VMs with shared memory not from gmem.
> > Or what about "allow_gmem_in_place_conversion" ?
>
> No, because turning on the param also disallows setting PRIVATE in the VM-scoped
> KVM_SET_MEMORY_ATTRIBUTES ioctl.
>
> > > Though it still feels a bit odd to require TDX huge pages to depend on
> > > gmem_in_place_conversion=true when shared memory is not currently allocated
> > > from gmem,
>
> I fully expect that to be a transient state, and in all likelihood not something
> that is *ever* shipped in production. Landing TDX hugepages without guest_memfd
> hugepage support is all about avoiding unnecessary serialization of series and
> features that aren't strictly dependent on each other.
>
> > > it should become more natural over time once gmem supports in-place
> > > conversions for huge page.
>
> Yes, and I want to prioritize the steady state for end users, not the in-progress
> state for developers. Once all of this settles out, I fully expect the majority
> of deployments to only support in-place conversion, at which point the end user
> is only going to care whether or not in-place conversion is enabled in KVM, not
> the subtle detail that it's still possible to do out-of-place conversions (and
> that will always hold true, it's not like VMA-based memslots are being deprecated).
>
> > > Besides my current usage, there may be other scenarios where gmem memory
> > > attributes is preferred without allocating shared memory from gmem.
> > > (e.g., PAGE.ADD from a temp extra shared source memory).
> > >
> > > For such use cases, I'm concerns that the admins may find it confusing if they
> > > enable gmem_in_place_conversion but still observe extra memory consumptions for
> > > shared memory.
>
> KVM can help with documentation, but beyond that, it's not KVM's problem to solve.
> If a VMM *and* platform owner chooses to deploy a setup that utilizes out-of-place
> conversions, then it's on the VMM and/or plaform owner to understand and communicate
> the implications to the end user.
Thanks for all the explanations!
Documentation that choosing a different source after enabling
gmem_in_place_conversion is deprecated looks good to me.
> And I'm not remotely convinced that prepending allow_ to the param will help
> end users diagnose "unexpected" memory consumption, in quotes because anyone that
> is deploying a stack that utilizes out-of-place conversion absolutely needs to
> understand and plan for the additional memory consumption. I.e. if the memory
> consumption is "unexpected" to the end user, they likely have far bigger problems.
My first impression of gmem_in_place_conversion=true was that it enforces gmem
in-place conversion. However, it actually only enforces per-gmem private/shared
attribute.
My worry was that people might think it's a kernel bug if userspace can still
have shared memory from other sources after they configured
gmem_in_place_conversion=true.
However, I have no strong opinion if you think gmem_in_place_conversion is good,
and with the above documentation. :)
^ permalink raw reply
* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Yan Zhao @ 2026-06-26 1:17 UTC (permalink / raw)
To: Ackerley Tng
Cc: Sean Christopherson, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <CAEvNRgH5KOHoemnC9QOn_oK97=KeAH1XuX3ps36-pJ0Fn0aBHQ@mail.gmail.com>
On Thu, Jun 25, 2026 at 05:07:23PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
>
> > On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
> >> Sean Christopherson <seanjc@google.com> writes:
> >>
> >> > On Tue, Jun 23, 2026, Yan Zhao wrote:
> >> >> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
> >> >> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
> >> >> > > On Mon, Jun 22, 2026, Yan Zhao wrote:
> >> >> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
> >> >> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> >> >> > > > > index ffe9d0db58c59..56d10333c61a7 100644
> >> >> > > > > --- a/arch/x86/kvm/vmx/tdx.c
> >> >> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
> >> >> > > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >> >> > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> >> >> > > > > return -EIO;
> >> >> > > > >
> >> >> > > > > - if (!src_page)
> >> >> > > > > - return -EOPNOTSUPP;
> >> >> > > > > + if (!src_page) {
> >> >> > > > > + if (!gmem_in_place_conversion)
> >> >> > > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
> >> >> > > > without the MMAP flag, the absence of src_page should still be treated as an
> >> >> > > > error.
> >> >> > >
> >> >> > > Why MMAP?
> >> >> > Hmm, I was showing a scenario that in-place conversion couldn't occur.
> >> >> > I didn't mean that with the MMAP flag, mmap() and user write must occur.
> >> >> >
> >> >> > > Shouldn't this be a general "if (!src_page && !up-to-date)"? Just
> >> >> > > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
> >> >> > > and written memory. And when write() lands, MMAP wouldn't be necessary to
> >> >> > > initialize the memory.
> >> >> > Do you mean using up-to-date flag as below?
> >> >
> >> > Yes? I didn't actually look at the implementation details.
> >> >
> >> >> > if (!src_page) {
> >> >> > src_page = pfn_to_page(pfn);
> >> >> > if (!folio_test_uptodate(page_folio(src_page)))
> >> >> > return -EOPNOTSUPP;
> >> >> > }
> >>
> >> Yan is right that with the earlier patch "Zero page while getting pfn",
> >> folio_test_uptodate() here will always return true.
> >>
> >> Actually, this is an alternative fix for the issue Sashiko pointed out
> >> on v7 where userspace can do a populate() (either TDX or SNP) without
> >> first allocating the page, with src_address == NULL, and leak
> >> uninitialized memory into the guest.
> >>
> >> Advantage of using the uptodate check in populate: if the host never
> >> allocates the page, populate doesn't incur zeroing before writing the
> >> page anyway in populate().
> >>
> >> Disadvantage: Both TDX and SNP will have to implement this uptodate
> >> check. guest_memfd can't check centrally because for SNP, for a
> >> PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
> >> firmware will zero and there's no leakage of uninitialized host memory?
> > Another disadvantage: the uptodate flag is per-folio. What if the folio
> > is only partially initialized by the userspace especially after huge page is
> > supported?
> >
>
> Good point on huge pages!
>
> The uptodate flag on the folio in guest_memfd means "this folio has been
> written to". As of now (before patch at [1]), this happens when
>
> + folio is zeroed on first use by userspace
> + folio is zeroed on first use of the guest
> + folio is populated
>
> When huge pages are supported, the folio can't partially be initialized?
>
> On allocation, if any part is shared, we split the page. The parts are
> separate folios that have their own uptodate flags.
>
> On splitting, if the huge page is uptodate, the split pages will also be
> uptodate. If the huge page is not uptodate, the split pages won't be
> uptodate, but that's ok since they will be marked uptodate on first use.
>
> On merging, the non-uptodate parts have to be zeroed and then marked
If that's true, it would be good.
> uptodate. Any parts that are in use would have been marked uptodate
> already, so there's no overwriting data that is in use. I'll need to
> think more about when it's safe to zero.
>
> I'm still on the fence between the two options
>
> 1. Using uptodate check in populate to reject src_pages that have never
> been written to or
> 2. Always zero before populate
2 does not work?
The flow is
1. mmap gmem_fd, make GFN shared, and write initial content.
2. convert GFN to private
3. invoke ioctl to trigger populate.
> but whether the uptodate flag is per-folio or not doesn't affect these
> two options in terms of fixing the leak of uninitialized host memory,
> right?
yes, provided "On merging, the non-uptodate parts have to be zeroed and then
marked uptodate".
> >
> >> >> Another concern with this fix is that:
> >> >> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
> >> >> folio uptodate before reaching post_populate().
> >> >>
> >> >> [1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/
> >> >>
> >> >> > One concern is that TDX now does not much care about the up-to-date flag since
> >> >> > TDX doesn't rely on the flag to clear pages on conversions.
> >> >> > I'm not sure if the flag can be reliably checked in this case. e.g.,
> >> >> > now the whole folio is marked up-to-date even if only part of it is faulted by
> >> >> > user access.
> >> >> > Ensuring that the up-to-date flag works correctly with huge page support seems
> >> >> > to have more effort than introducing a dedicated flag for TDX.
> >> >> >
> >> >> > > > Additionally, to properly enable in-place copying for the TDX initial memory
> >> >> > > > region, userspace must not only specify source_addr to NULL, but also follow
> >> >> > > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
> >> >> > > > 1. create guest_memfd with MMAP flag
> >> >> > > > 2. mmap the guest_memfd.
> >> >> > > > 3. convert the initial memory range to shared.
> >> >> > > > 4. copy initial content to the source page.
> >> >> > > > 5. convert the initial memory range to private
> >> >> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
> >> >> > > > 7. do not unmap the source backend.
> >> >> > > >
> >> >> > > > So, would it be reasonable to introduce a dedicated flag that allows userspace
> >> >> > > > to explicitly opt into the in-place copy functionality? e.g.,
> >> >> > >
> >> >> > > Why? It's userspace's responsibility to get the above right. If userspace fails
> >> >> > > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
> >>
> >> Yan, is your concern that userspace forgot to update the code and
> >> forgets to provide a src_page, and if we keep the "Zero page while
> > Yes. Previously, it would be rejected after GUP fails.
> >
>
> I see, didn't realize previously it would be rejected because GUP
> fails. GUP failed because it wasn't faulted into the host?
GUP fails if 0 is not a valid user address.
But GUP would not fail if 0 is a valid address. e.g., in below scenario:
#include <sys/mman.h>
#include <stdio.h>
int main(void)
{
void *p=mmap((void*)0,4096,PROT_READ|PROT_WRITE, MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,-1,0);
if (p==MAP_FAILED) {
perror("mmap");
return 1;
}
*(char*)0='Y';
printf("addr0=%p val=%c\n",p,*(char*)0);
return 0;
}
> That's kind of orthogonal, I don't think GUP fail leading to rejecting
> populate was meant to help userspace catch these issues. GUP would also
> fail if the user did mmap(), write to it, unmap using
> madvise(MADV_DONTNEED), then forget and pass 0 as src_address.
The original uAPI did not explicitly define 0 as an invalid uaddr. Whether 0 was
rejected depended on whether the user mmap()'d address 0. If 0 was a valid
mapping, populate() could proceed.
commit 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating
guest memory") changed the behavior though. It would return -EOPNOTSUPP for a 0
uaddr.
But if a user configures 0 uaddr as valid, writes to it, and then passes 0 as
source_addr(not from gmem), I'm not sure if it's good for the kernel to silently
treat 0 uaddr as an identifier for in-place copy from the private PFN in gmem.
> >> getting pfn" patch, ends up with the guest silently having a zero page?
> >> I think that would be found quite early in userspace VMM testing...
> > I actually encountered this during testing this patch.
> > I update most code path to follow this sequence. However, still some corner ones
> > for TDVF HOB, which are less obvious and harder to update.
> > The TD just booted up and hang silently.
> >
>
> I think this is just the life of a close-to-hardware software engineer
> :P no errors, got stuck somewhere, root cause is some unitialized
> thing.
>
> >> >> > I mean if userspace specifies a NULL source_addr by mistake, it's better for
> >> >> > kernel to detect this mistake, similar to how it validates whether source_addr
> >> >> > is PAGE_ALIGNED.
> >> >
> >> > The alignment case is different. If userspace provides an unaligned value, KVM
> >> > *can't* do what userspace is asking because hardware and thus KVM only supports
> >> > converting on page boundaries.
> >> >
> >> > For a NULL source, KVM can still do what userspace is asking. Rejecting userspace's
> >> > request would then be making assumptions about what userspace wants.
> >> >
> >>
> >> Also, +1 on this, what if userspace, knowing that pages are zeroed on
> >> allocation, actually wants to rely on that to get a zero page in the guest?
> > What if 0 uaddr is a valid address? :)
> >
> >> >> > Since userspace already needs to perform additional steps to enable in-place
> >> >> > copy, specifying a dedicated flag to indicate that the NULL source_addr is
> >> >> > intentional seems like a reasonable burden.
> >> >
> >> > I don't see how it adds any value. I wouldn't be at all surprised if most VMMs
> >> > just wen up with code that does:
> >> >
> >> > if (in-place) {
> >> > src = NULL;
> >> > flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION;
> >> > }
> >>
^ permalink raw reply
* Re: [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: Kalra, Ashish @ 2026-06-26 2:38 UTC (permalink / raw)
To: K Prateek Nayak, Borislav Petkov
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, Tycho.Andersen, Nathan.Fontenot, ackerleytng,
jackyli, pgonda, rientjes, jacobhxu, xin, pawan.kumar.gupta,
babu.moger, dyoung, nikunj, john.allen, darwi, linux-kernel,
linux-crypto, kvm, linux-coco
In-Reply-To: <898e378a-cf7c-4310-b439-e28ec0a71338@amd.com>
On 6/25/2026 5:16 PM, K Prateek Nayak wrote:
> Hello Ashish,
>
> On 6/26/2026 1:12 AM, Kalra, Ashish wrote:
>> Hello Boris,
>>
>> On 6/25/2026 10:02 AM, Borislav Petkov wrote:
>>> On Wed, Jun 24, 2026 at 09:56:49PM +0000, Ashish Kalra wrote:
>>>> +/* Set while SNP has CPU hotplug disabled (kernel-lifetime; survives ccp reload). */
>>>> +static bool snp_cpu_hotplug_disabled;
>>>
>>> Do you really need this?
>>>
>>
>> Yes.
>>
>> cpu_hotplug_disable()/cpu_hotplug_enable() are refcounted (cpu_hotplug_disabled++/--,
>> with a WARN on underflow), so they have to be balanced. This flag collapses them to
>> exactly one outstanding disable per SNP-active window, because the disable and enable
>> sites are not reached a symmetric number of times:
>>> - On firmware without SNP_X86_SHUTDOWN_SUPPORTED, __sev_snp_shutdown_locked() does not
>> call snp_shutdown() (it's gated on data.x86_snp_shutdown), so SNP stays enabled in
>> hardware — SNP_EN stays set and hotplug stays disabled — while sev->snp_initialized is
>> cleared. Re-init after that is routine, the SNP ioctls self-bracket init and shutdown
>> (e.g. SNP_COMMIT, SNP_SET_CONFIG, SNP_VLEK_LOAD):
>>
>> if (!sev->snp_initialized)
>> snp_move_to_init_state(...); /* -> __sev_snp_init_locked -> snp_prepare() */
>> ... SNP_CMD ...
>> if (shutdown_required)
>> __sev_snp_shutdown_locked(...);
>> - So whenever SNP isn't already initialized (psp_init_on_probe off, or after a prior
>> legacy shutdown), every such ioctl does init -> command -> legacy shutdown. Each init
>> reaches snp_prepare() with SNP_EN already set, and the disable now sits at the top of
>> snp_prepare(), so it fires on every cycle. Without this flag that keeps bumping
>> cpu_hotplug_disabled while the legacy shutdown never re-enables — hotplug ends up stuck
>> disabled. This flag makes all but the first disable a no-op.
>>
>> - Also, importantly, kvm-amd module reload on legacy firmware is the same pattern:
>> unload leaves SNP_EN set, reload re-inits.)
>
> Looking at snp_prepare(), we have an early-bailout for
>
> rdmsrq(MSR_AMD64_SYSCFG, val);
> if (val & MSR_AMD64_SYSCFG_SNP_EN)
> return;
>
> Does executing SHUTDOWN command lead to the firmware clearing SNP_EN in
> SYSCFG on all CPUS?
Yes, in case of X86_SNP_SHUTDOWN (available if firmware supports X86SnpShutdown feature)
SNP is disabled on all cores by clearing SYSCFG[SNPEn] bit.
If X86_SNP_SHUTDOWN is set to 1, the firmware clears the SYSCFG[SNPEn] bit in each core.
But, in case of legacy SNP shutdown, SNP_EN bit is not cleared and so SNP remains enabled.
>
> If SNP_EN remains set (and Linux can't clear it since it is
> "Write-1-only" bit), then a subsequent snp_prepare() will skip setting
> SYSCFG if it sees SNP_EN on local CPU.
>
> It can so happen that we enable hotlpug at shutdown, CPUs come online
> without setting SNP_EN in SYSCFG, subsequent snp_prepare() runs on a CPU
> where SNP_EN is still set and skips configuring it for the CPUs that
> don't have it set, and we'll be in a pickle still.
>
> The comment above that bailout saying "this can happen in case of kexec
> boot" makes me believe that SNP_EN remains set until a full system
> reset.
>
> The only safe way to do this is to ensure all possible CPUs are online
> during snp_prepare() and do snp_enable() regardless of whether local CPU
> has SNP_EN or not.
>
> Am I missing something?
>
The piece that makes the early bailout safe is the disable this patch adds:
hotplug is disabled while SNP is active, so the online set can't change under an
active SNP. snp_prepare() already requires online == present, so at a successful
init every present CPU gets SNP_EN, and because hotplug is then disabled none
can leave or rejoin without it. So whenever the bailout is hit with SNP active,
every online CPU already has SNP_EN:
- kexec: SNP_EN is already set on all CPUs by the previous kernel.
- re-init while SNP is still active (e.g. after a legacy SNP_SHUTDOWN that
leaves SNP_EN set): hotplug was disabled the whole time, so the online set is
unchanged and all of them still have SNP_EN.
The only way a CPU can be online without SNP_EN is when SNP is not active --
i.e. after an SNP_INIT failure, where this patch re-enables hotplug. That is
deliberately the same as the behavior before this support existed (hotplug was
never disabled then), and it is benign: SNP_EN only gates RMP checks, the RMP
itself is initialized by SNP_INIT, so on a failed init the RMP is all-zeroes --
every entry is in the default HV-owned state, no page is assigned, no check ever blocks
and snp_initialized stays false, so no SNP guest can be created.
Nothing is enforced and nothing is protected.
So I've kept snp_prepare()'s existing bailout / snp_enable() behavior unchanged;
what this patch adds is disabling hotplug while SNP is active, which is what
actually closes the window (a CPU coming online without SNP_EN while SNP is
live). That window -- and the SNP_EN-stays-set-on-failure situation -- already
exist in today's code, this patch constrains the dangerous (active) case and
otherwise matches current behavior.
(On the v9 placement specifically: I'm moving the disable into snp_prepare()
ahead of SNP_EN in the next version; in v9 it sits after SNP_INIT, which leaves
the window you originally pointed out.)
Thanks,
Ashish
>>
>> - On the enable side it avoids an unbalanced cpu_hotplug_enable() when the teardown/failure
>> paths run without an outstanding disable (e.g. shutdown of a never-fully-initialized SNP).
>>
>> So it's not redundant with cpu_hotplug_disabled — it tracks whether the outstanding disable
>> belongs to this SNP-active window in this kernel, which keeps the single disable/enable
>> balanced across the asymmetric legacy-vs-full SNP teardown paths and re-init.
^ permalink raw reply
* Re: [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: K Prateek Nayak @ 2026-06-26 4:01 UTC (permalink / raw)
To: Kalra, Ashish, Borislav Petkov
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, Tycho.Andersen, Nathan.Fontenot, ackerleytng,
jackyli, pgonda, rientjes, jacobhxu, xin, pawan.kumar.gupta,
babu.moger, dyoung, nikunj, john.allen, darwi, linux-kernel,
linux-crypto, kvm, linux-coco
In-Reply-To: <b9777de5-a6fa-418c-92d2-89c095e91837@amd.com>
Hello Ashish,
On 6/26/2026 8:08 AM, Kalra, Ashish wrote:
>> Looking at snp_prepare(), we have an early-bailout for
>>
>> rdmsrq(MSR_AMD64_SYSCFG, val);
>> if (val & MSR_AMD64_SYSCFG_SNP_EN)
>> return;
>>
>> Does executing SHUTDOWN command lead to the firmware clearing SNP_EN in
>> SYSCFG on all CPUS?
>
> Yes, in case of X86_SNP_SHUTDOWN (available if firmware supports X86SnpShutdown feature)
> SNP is disabled on all cores by clearing SYSCFG[SNPEn] bit.
>
> If X86_SNP_SHUTDOWN is set to 1, the firmware clears the SYSCFG[SNPEn] bit in each core.
>
> But, in case of legacy SNP shutdown, SNP_EN bit is not cleared and so SNP remains enabled.
Ah! That was the bit I was missing. Thanks a ton for clarifying.
>
>>
>> If SNP_EN remains set (and Linux can't clear it since it is
>> "Write-1-only" bit), then a subsequent snp_prepare() will skip setting
>> SYSCFG if it sees SNP_EN on local CPU.
>>
>> It can so happen that we enable hotlpug at shutdown, CPUs come online
>> without setting SNP_EN in SYSCFG, subsequent snp_prepare() runs on a CPU
>> where SNP_EN is still set and skips configuring it for the CPUs that
>> don't have it set, and we'll be in a pickle still.
>>
>> The comment above that bailout saying "this can happen in case of kexec
>> boot" makes me believe that SNP_EN remains set until a full system
>> reset.
>>
>> The only safe way to do this is to ensure all possible CPUs are online
>> during snp_prepare() and do snp_enable() regardless of whether local CPU
>> has SNP_EN or not.
>>
>> Am I missing something?
>>
>
> The piece that makes the early bailout safe is the disable this patch adds:
> hotplug is disabled while SNP is active, so the online set can't change under an
> active SNP. snp_prepare() already requires online == present, so at a successful
> init every present CPU gets SNP_EN,
How is this enforced? AFAICT, on_each_cpu(snp_enable) will only covers
the online CPUs and there could be CPUs that have been offlined before
that right?
> and because hotplug is then disabled none
> can leave or rejoin without it. So whenever the bailout is hit with SNP active,
> every online CPU already has SNP_EN:
>
> - kexec: SNP_EN is already set on all CPUs by the previous kernel.
There is a catch here: you can have offline CPUs during the previous boot
(say you have maxcpus=8 in your cmdline), and then you kexec with a different
kernel / cmdline that brings online a bunch more CPUs.
SNP_EN will only be set for a subset of then with the legacy SNP_INIT and
if snp_prepare() runs on those legacy CPUs, you still skip setting it for
the ones that don't have SNP_EN set.
Is that case covered somehow or is it a non-issue?
> - re-init while SNP is still active (e.g. after a legacy SNP_SHUTDOWN that
> leaves SNP_EN set): hotplug was disabled the whole time, so the online set is
> unchanged and all of them still have SNP_EN.
>
> The only way a CPU can be online without SNP_EN is when SNP is not active --
> i.e. after an SNP_INIT failure, where this patch re-enables hotplug. That is
> deliberately the same as the behavior before this support existed (hotplug was
> never disabled then), and it is benign: SNP_EN only gates RMP checks, the RMP
> itself is initialized by SNP_INIT, so on a failed init the RMP is all-zeroes --
> every entry is in the default HV-owned state, no page is assigned, no check ever blocks
> and snp_initialized stays false, so no SNP guest can be created.
> Nothing is enforced and nothing is protected.
>
> So I've kept snp_prepare()'s existing bailout / snp_enable() behavior unchanged;
> what this patch adds is disabling hotplug while SNP is active, which is what
> actually closes the window (a CPU coming online without SNP_EN while SNP is
> live). That window -- and the SNP_EN-stays-set-on-failure situation -- already
> exist in today's code, this patch constrains the dangerous (active) case and
> otherwise matches current behavior.
Ack! Just that one small bit up above bothers me but other than that,
doing it in snp_prepare() should be good.
This is all new to me so thanks a ton for answering my queries.
>
> (On the v9 placement specifically: I'm moving the disable into snp_prepare()
> ahead of SNP_EN in the next version; in v9 it sits after SNP_INIT, which leaves
> the window you originally pointed out.)
--
Thanks and Regards,
Prateek
^ permalink raw reply
* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Gavin Shan @ 2026-06-26 7:43 UTC (permalink / raw)
To: Suzuki K Poulose, Lorenzo Pieralisi
Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
Will Deacon, James Morse, Oliver Upton, Zenghui Yu,
linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <98d2a0f3-b831-466a-8212-5bcf97ad9d8b@arm.com>
On 6/26/26 1:58 AM, Suzuki K Poulose wrote:
> On 25/06/2026 14:53, Gavin Shan wrote:
>> On 6/6/26 12:35 AM, Lorenzo Pieralisi wrote:
>>> On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
>>>> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
>>>>> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
[...]
>>>>
>>>> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, and found
>>>> that memory slots backed by THP are broken. With THP disabled on the host and
>>>> other fixes (mentioned in my prevous replies) applied on the top of this (v14)
>>>> series, I'm able to boot a realm guest with rebased QEMU series [2], plus more
>>>> fxies on the top.
>>>>
>>>> [1] https://git.codelinaro.org/linaro/dcap/qemu.git (branch: cca/ latest)
>>>> [2] https://git.qemu.org/git/qemu.git (branch: cca/gavin)
>>>>
>>>> Lorenzo, You may be saying there is someone making QEMU to support ARM/CCA?
>>>
>>> Mathieu and I are working on that yes and with Steven/Suzuki to fix the THP
>>> issues you pointed out above.
>>>
>>>> If so, I'm not sure if there is a QEMU repository for me to try?
>>>
>>> We should be able to submit patches by end of June - we shall let you know
>>> whether we can make something available earlier.
>>>
>>
>> Not sure if there are other known issues in this series. It seems the stage2
>> page fault handling on the shared space isn't working well. In my test, the
>> vring (struct vring_desc) of virtio-net-pci is updated by the guest, and the
>> data isn't seen by QEMU, I'm suspecting if the host-page-frame-number is properly
>> resolved in the s2 page fault handler for shared (unprotected) space.
>>
>> - I rebased Jean's latest qemu branch to the upstream qemu;
>>
>> - On the host, which is emulated by qemu/tcg, the THP (transparent huge page) is
>> disabled.
>>
>> - On the guest, I can see the virtio vring (struct vring_desc) is updated. The
>> S1 page-table entry looks correct because the corresponding physical address
>> 0x10046880000 is a sane shared (unprotected) space address.
>>
>> [ 52.094143] software IO TLB: Memory encryption is active and system is using DMA bounce buffers
>> [ 52.289746] virtqueue_add_desc_split: desc[0]@0xffff000006880000, [00000100b983f000 00000640 0002 0001]
>> [ 52.432150] PTE 0x00e8010046880707 at address 0xffff000006880000
>>
>> - On the host, the s2 page-table-entry is unmapped due to attribute transition (private -> shared).
>> A subsequent S2 page fault is raised against the adress and the s2 page-table-entry is built.
>>
>> [ 109.259077] ====> realm_unmap_shared_range: tracked_unprot_addr=0x10046880000
>> [ 109.260249] realm_unmap_shared_range: unmapped shared range at 0x10046880000
>> [ 109.317786] realm_unmap_shared_range: unmapped shared range at 0x10046880000
>> [ 109.629939] ====> kvm_handle_guest_abort: fault_ipa=0x10046880000, esr=0x92000007
>> [ 109.630245] realm_map_non_secure: ipa=0x10046880000, pfn=0xb8b59, size=0x1000, prot=0xf
>> [ 109.630331] realm_map_non_secure: ipa=0x10046880000, ipa_top=0x10046881000, flags=0x1e0001, range_desc=0xb8b59004
>
> Are you able to correlate the order of the transitions and the Guest
> access with RMM log ? We haven't seen this from our end. We are aware
> of permission fault issues with Unprotected IPA when backing the memslot
> with MAP_PRIVATE areas. But this looks different.
>
> Lorenzo, have you run into this ?
>
It's hard to correlate the order since the logs are collected from two separate
consoles. For the write permission, I add code to the host where the permission
is always added for all s2 page faults in the shared space. Otherwise, qemu can
be killed by -EFAULT or similar error.
There are more findings after more experiments: this virtio-net-pci device has 3
queues or vrings (Rx/Tx/Ctrl). The Rx/Tx/Ctrl queue are populated in order one after
one. In the guest kernel, I intentionally write fixed data (0x0123456789abcdef) to
the first 8 bytes of the queue when it gets populated, and stop the guest at random
points to see if the data is gone. I found that the data written to Rx/Tx queue are
lost after Ctrl queue is allocated.
The data written to Rx/Tx queue is lost if the guest stops (B). The data written to
Rx/Tx queue isn't lost if the guest stops at (A). I can see the pattern (0x0123...cdef)
by dumping the physcial memory through 'pmemsave' command in qemu.
DMA allocation
==============
dma_alloc_coherent
dma_alloc_attrs
dma_direct_alloc
__dma_direct_alloc_pages
dma_set_decrypted // (A) No data lost if being stopped here for the Ctrl queue
memset(ret, 0, size) // (B) Data lost after being stopped after memset() for the Ctrl queue
The memset() on the Ctrl queue should trigger a stage2 page fault. It seems the page
fault enforces the shared pages for Rx/Tx queue to be dropped? I need to add more
debugging code and track it down.
> Suzuki
>
>
>>
>> - On QEMU, the updated vring (struct vring_desc) at GPA 0x46880000 isn't seen. All the
>> data in that adress are zeros.
>>
>> ====> virtqueue_split_pop: vdev=<virtio-net>, sz=0x38, queue_index=0x0, vq->vring.num=0x100
>> virtqueue_split_pop: last_avail_idx=0x0, head=0x0
>> address_space_read_cached_slow: cache@0xffff1c036440, addr=0x0, buf=0xffffeee34880, len=0x10
>> address_space_read_cached_slow: cache: ptr=0x0, xlat=0x10046880000, len=0x1000, mrs=<realm-dma-region>, is_write=no
>> address_space_read_cached_slow: translated to mr=<mach-virt.ram>, mr_addr=0x6880000, l=0x10
>> flatview_read_continue_step: mr=<mach-virt.ram>, host=0xffff23e00000, mr_addr=0x6880000, ram_ptr=0xffff2a680000
>> virtqueue_split_pop: desc: 0000000000000000 - 00000000 - 00000000 - 00000000
>> qemu-system-aarch64: virtio: zero sized buffers are not allowed
>>
>>
Thanks,
Gavin
^ permalink raw reply
* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Suzuki K Poulose @ 2026-06-26 8:47 UTC (permalink / raw)
To: Gavin Shan, Lorenzo Pieralisi
Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
Will Deacon, James Morse, Oliver Upton, Zenghui Yu,
linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <8da87878-2a5d-478a-a280-60dbed7ad1b9@redhat.com>
On 26/06/2026 08:43, Gavin Shan wrote:
> On 6/26/26 1:58 AM, Suzuki K Poulose wrote:
>> On 25/06/2026 14:53, Gavin Shan wrote:
>>> On 6/6/26 12:35 AM, Lorenzo Pieralisi wrote:
>>>> On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
>>>>> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
>>>>>> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
>
> [...]
>
>>>>>
>>>>> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU,
>>>>> and found
>>>>> that memory slots backed by THP are broken. With THP disabled on
>>>>> the host and
>>>>> other fixes (mentioned in my prevous replies) applied on the top of
>>>>> this (v14)
>>>>> series, I'm able to boot a realm guest with rebased QEMU series
>>>>> [2], plus more
>>>>> fxies on the top.
>>>>>
>>>>> [1] https://git.codelinaro.org/linaro/dcap/qemu.git (branch: cca/
>>>>> latest)
>>>>> [2] https://git.qemu.org/git/qemu.git (branch: cca/
>>>>> gavin)
>>>>>
>>>>> Lorenzo, You may be saying there is someone making QEMU to support
>>>>> ARM/CCA?
>>>>
>>>> Mathieu and I are working on that yes and with Steven/Suzuki to fix
>>>> the THP
>>>> issues you pointed out above.
>>>>
>>>>> If so, I'm not sure if there is a QEMU repository for me to try?
>>>>
>>>> We should be able to submit patches by end of June - we shall let
>>>> you know
>>>> whether we can make something available earlier.
>>>>
>>>
>>> Not sure if there are other known issues in this series. It seems the
>>> stage2
>>> page fault handling on the shared space isn't working well. In my
>>> test, the
>>> vring (struct vring_desc) of virtio-net-pci is updated by the guest,
>>> and the
>>> data isn't seen by QEMU, I'm suspecting if the host-page-frame-number
>>> is properly
>>> resolved in the s2 page fault handler for shared (unprotected) space.
>>>
>>> - I rebased Jean's latest qemu branch to the upstream qemu;
>>>
>>> - On the host, which is emulated by qemu/tcg, the THP (transparent
>>> huge page) is
>>> disabled.
>>>
>>> - On the guest, I can see the virtio vring (struct vring_desc) is
>>> updated. The
>>> S1 page-table entry looks correct because the corresponding
>>> physical address
>>> 0x10046880000 is a sane shared (unprotected) space address.
>>>
>>> [ 52.094143] software IO TLB: Memory encryption is active and
>>> system is using DMA bounce buffers
>>> [ 52.289746] virtqueue_add_desc_split:
>>> desc[0]@0xffff000006880000, [00000100b983f000 00000640 0002 0001]
>>> [ 52.432150] PTE 0x00e8010046880707 at address 0xffff000006880000
>>>
>>> - On the host, the s2 page-table-entry is unmapped due to attribute
>>> transition (private -> shared).
>>> A subsequent S2 page fault is raised against the adress and the s2
>>> page-table-entry is built.
>>>
>>> [ 109.259077] ====> realm_unmap_shared_range:
>>> tracked_unprot_addr=0x10046880000
>>> [ 109.260249] realm_unmap_shared_range: unmapped shared range at
>>> 0x10046880000
>>> [ 109.317786] realm_unmap_shared_range: unmapped shared range at
>>> 0x10046880000
>>> [ 109.629939] ====> kvm_handle_guest_abort:
>>> fault_ipa=0x10046880000, esr=0x92000007
>>> [ 109.630245] realm_map_non_secure: ipa=0x10046880000,
>>> pfn=0xb8b59, size=0x1000, prot=0xf
>>> [ 109.630331] realm_map_non_secure: ipa=0x10046880000,
>>> ipa_top=0x10046881000, flags=0x1e0001, range_desc=0xb8b59004
>>
>> Are you able to correlate the order of the transitions and the Guest
>> access with RMM log ? We haven't seen this from our end. We are aware
>> of permission fault issues with Unprotected IPA when backing the memslot
>> with MAP_PRIVATE areas. But this looks different.
>>
>> Lorenzo, have you run into this ?
>>
>
> It's hard to correlate the order since the logs are collected from two
> separate
> consoles. For the write permission, I add code to the host where the
> permission
> is always added for all s2 page faults in the shared space. Otherwise,
> qemu can
> be killed by -EFAULT or similar error.
This is the problem. We can't add WRITE permission by default. I believe
you may have MAP_PRIVATE mapping and it has to be mapped as READ only
and on a permission fault, we replace it with a writable page. By
overriding the WRITE permission, you let the guest write to a page
that may not be seen by the VMM.
We identified this as a bug in the KVM driver in this series (reported
by Lorenzo) and there is a corresponding tf-RMM change that is required
to get this working. So, please could you wait until the next series
when this will be addressed ? Or you could switch to using MAP_SHARED
for the "shared" memory in the memslot.
Suzuki
>
> There are more findings after more experiments: this virtio-net-pci
> device has 3
> queues or vrings (Rx/Tx/Ctrl). The Rx/Tx/Ctrl queue are populated in
> order one after
> one. In the guest kernel, I intentionally write fixed data
> (0x0123456789abcdef) to
> the first 8 bytes of the queue when it gets populated, and stop the
> guest at random
> points to see if the data is gone. I found that the data written to Rx/
> Tx queue are
> lost after Ctrl queue is allocated.
>
> The data written to Rx/Tx queue is lost if the guest stops (B). The data
> written to
> Rx/Tx queue isn't lost if the guest stops at (A). I can see the pattern
> (0x0123...cdef)
> by dumping the physcial memory through 'pmemsave' command in qemu.
>
> DMA allocation
> ==============
> dma_alloc_coherent
> dma_alloc_attrs
> dma_direct_alloc
> __dma_direct_alloc_pages
> dma_set_decrypted // (A) No data lost if being
> stopped here for the Ctrl queue
> memset(ret, 0, size) // (B) Data lost after being
> stopped after memset() for the Ctrl queue
>
> The memset() on the Ctrl queue should trigger a stage2 page fault. It
> seems the page
> fault enforces the shared pages for Rx/Tx queue to be dropped? I need to
> add more
> debugging code and track it down.
>
>> Suzuki
>>
>>
>>>
>>> - On QEMU, the updated vring (struct vring_desc) at GPA 0x46880000
>>> isn't seen. All the
>>> data in that adress are zeros.
>>>
>>> ====> virtqueue_split_pop: vdev=<virtio-net>, sz=0x38,
>>> queue_index=0x0, vq->vring.num=0x100
>>> virtqueue_split_pop: last_avail_idx=0x0, head=0x0
>>> address_space_read_cached_slow: cache@0xffff1c036440, addr=0x0,
>>> buf=0xffffeee34880, len=0x10
>>> address_space_read_cached_slow: cache: ptr=0x0,
>>> xlat=0x10046880000, len=0x1000, mrs=<realm-dma-region>, is_write=no
>>> address_space_read_cached_slow: translated to mr=<mach-virt.ram>,
>>> mr_addr=0x6880000, l=0x10
>>> flatview_read_continue_step: mr=<mach-virt.ram>,
>>> host=0xffff23e00000, mr_addr=0x6880000, ram_ptr=0xffff2a680000
>>> virtqueue_split_pop: desc: 0000000000000000 - 00000000 - 00000000
>>> - 00000000
>>> qemu-system-aarch64: virtio: zero sized buffers are not allowed
>>>
>>>
> Thanks,
> Gavin
>
^ permalink raw reply
* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Suzuki K Poulose @ 2026-06-26 9:04 UTC (permalink / raw)
To: Gavin Shan, Lorenzo Pieralisi
Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
Will Deacon, James Morse, Oliver Upton, Zenghui Yu,
linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <9482dfbc-4d96-47ba-a615-f4ba0bda833f@arm.com>
On 26/06/2026 09:47, Suzuki K Poulose wrote:
> On 26/06/2026 08:43, Gavin Shan wrote:
>> On 6/26/26 1:58 AM, Suzuki K Poulose wrote:
>>> On 25/06/2026 14:53, Gavin Shan wrote:
>>>> On 6/6/26 12:35 AM, Lorenzo Pieralisi wrote:
>>>>> On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
>>>>>> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
>>>>>>> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
>>
>> [...]
>>
>>>>>>
>>>>>> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU,
>>>>>> and found
>>>>>> that memory slots backed by THP are broken. With THP disabled on
>>>>>> the host and
>>>>>> other fixes (mentioned in my prevous replies) applied on the top
>>>>>> of this (v14)
>>>>>> series, I'm able to boot a realm guest with rebased QEMU series
>>>>>> [2], plus more
>>>>>> fxies on the top.
>>>>>>
>>>>>> [1] https://git.codelinaro.org/linaro/dcap/qemu.git (branch: cca/
>>>>>> latest)
>>>>>> [2] https://git.qemu.org/git/qemu.git (branch: cca/
>>>>>> gavin)
>>>>>>
>>>>>> Lorenzo, You may be saying there is someone making QEMU to support
>>>>>> ARM/CCA?
>>>>>
>>>>> Mathieu and I are working on that yes and with Steven/Suzuki to fix
>>>>> the THP
>>>>> issues you pointed out above.
>>>>>
>>>>>> If so, I'm not sure if there is a QEMU repository for me to try?
>>>>>
>>>>> We should be able to submit patches by end of June - we shall let
>>>>> you know
>>>>> whether we can make something available earlier.
>>>>>
>>>>
>>>> Not sure if there are other known issues in this series. It seems
>>>> the stage2
>>>> page fault handling on the shared space isn't working well. In my
>>>> test, the
>>>> vring (struct vring_desc) of virtio-net-pci is updated by the guest,
>>>> and the
>>>> data isn't seen by QEMU, I'm suspecting if the host-page-frame-
>>>> number is properly
>>>> resolved in the s2 page fault handler for shared (unprotected) space.
>>>>
>>>> - I rebased Jean's latest qemu branch to the upstream qemu;
>>>>
>>>> - On the host, which is emulated by qemu/tcg, the THP (transparent
>>>> huge page) is
>>>> disabled.
>>>>
>>>> - On the guest, I can see the virtio vring (struct vring_desc) is
>>>> updated. The
>>>> S1 page-table entry looks correct because the corresponding
>>>> physical address
>>>> 0x10046880000 is a sane shared (unprotected) space address.
>>>>
>>>> [ 52.094143] software IO TLB: Memory encryption is active and
>>>> system is using DMA bounce buffers
>>>> [ 52.289746] virtqueue_add_desc_split:
>>>> desc[0]@0xffff000006880000, [00000100b983f000 00000640 0002 0001]
>>>> [ 52.432150] PTE 0x00e8010046880707 at address 0xffff000006880000
>>>>
>>>> - On the host, the s2 page-table-entry is unmapped due to attribute
>>>> transition (private -> shared).
>>>> A subsequent S2 page fault is raised against the adress and the
>>>> s2 page-table-entry is built.
>>>>
>>>> [ 109.259077] ====> realm_unmap_shared_range:
>>>> tracked_unprot_addr=0x10046880000
>>>> [ 109.260249] realm_unmap_shared_range: unmapped shared range at
>>>> 0x10046880000
>>>> [ 109.317786] realm_unmap_shared_range: unmapped shared range at
>>>> 0x10046880000
>>>> [ 109.629939] ====> kvm_handle_guest_abort:
>>>> fault_ipa=0x10046880000, esr=0x92000007
>>>> [ 109.630245] realm_map_non_secure: ipa=0x10046880000,
>>>> pfn=0xb8b59, size=0x1000, prot=0xf
>>>> [ 109.630331] realm_map_non_secure: ipa=0x10046880000,
>>>> ipa_top=0x10046881000, flags=0x1e0001, range_desc=0xb8b59004
>>>
>>> Are you able to correlate the order of the transitions and the Guest
>>> access with RMM log ? We haven't seen this from our end. We are aware
>>> of permission fault issues with Unprotected IPA when backing the memslot
>>> with MAP_PRIVATE areas. But this looks different.
>>>
>>> Lorenzo, have you run into this ?
>>>
>>
>> It's hard to correlate the order since the logs are collected from two
>> separate
>> consoles. For the write permission, I add code to the host where the
>> permission
>> is always added for all s2 page faults in the shared space. Otherwise,
>> qemu can
>> be killed by -EFAULT or similar error.
>
> This is the problem. We can't add WRITE permission by default. I believe
> you may have MAP_PRIVATE mapping and it has to be mapped as READ only
> and on a permission fault, we replace it with a writable page. By
> overriding the WRITE permission, you let the guest write to a page
> that may not be seen by the VMM.
>
> We identified this as a bug in the KVM driver in this series (reported
> by Lorenzo) and there is a corresponding tf-RMM change that is required
> to get this working. So, please could you wait until the next series
> when this will be addressed ? Or you could switch to using MAP_SHARED
> for the "shared" memory in the memslot.
For the record, you need something like this :
--- a/arch/arm64/kvm/rmi.c
+++ b/arch/arm64/kvm/rmi.c
@@ -838,8 +838,17 @@ int realm_map_non_secure(struct realm *realm,
if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
/* Create missing RTTs and retry */
int level = RMI_RETURN_INDEX(ret);
+ int req_level = find_map_level(realm, ipa, ipa_top);
+
+ /*
+ * There already exists a mapping at the level.
May be
+ * we are relaxing a permission for the given
range ?
+ */
+ if (level >= req_level) {
+ realm_unmap_shared_range(kvm, ipa,
ipa_top, false);
+ continue;
+ }
- WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
ret = realm_create_rtt_levels(realm, ipa, level,
KVM_PGTABLE_LAST_LEVEL,
memcache);
Thanks
Suzuki
>
>
> Suzuki
>
>
>>
>> There are more findings after more experiments: this virtio-net-pci
>> device has 3
>> queues or vrings (Rx/Tx/Ctrl). The Rx/Tx/Ctrl queue are populated in
>> order one after
>> one. In the guest kernel, I intentionally write fixed data
>> (0x0123456789abcdef) to
>> the first 8 bytes of the queue when it gets populated, and stop the
>> guest at random
>> points to see if the data is gone. I found that the data written to
>> Rx/ Tx queue are
>> lost after Ctrl queue is allocated.
>>
>> The data written to Rx/Tx queue is lost if the guest stops (B). The
>> data written to
>> Rx/Tx queue isn't lost if the guest stops at (A). I can see the
>> pattern (0x0123...cdef)
>> by dumping the physcial memory through 'pmemsave' command in qemu.
>>
>> DMA allocation
>> ==============
>> dma_alloc_coherent
>> dma_alloc_attrs
>> dma_direct_alloc
>> __dma_direct_alloc_pages
>> dma_set_decrypted // (A) No data lost if
>> being stopped here for the Ctrl queue
>> memset(ret, 0, size) // (B) Data lost after
>> being stopped after memset() for the Ctrl queue
>>
>> The memset() on the Ctrl queue should trigger a stage2 page fault. It
>> seems the page
>> fault enforces the shared pages for Rx/Tx queue to be dropped? I need
>> to add more
>> debugging code and track it down.
>>
>>> Suzuki
>>>
>>>
>>>>
>>>> - On QEMU, the updated vring (struct vring_desc) at GPA 0x46880000
>>>> isn't seen. All the
>>>> data in that adress are zeros.
>>>>
>>>> ====> virtqueue_split_pop: vdev=<virtio-net>, sz=0x38,
>>>> queue_index=0x0, vq->vring.num=0x100
>>>> virtqueue_split_pop: last_avail_idx=0x0, head=0x0
>>>> address_space_read_cached_slow: cache@0xffff1c036440, addr=0x0,
>>>> buf=0xffffeee34880, len=0x10
>>>> address_space_read_cached_slow: cache: ptr=0x0,
>>>> xlat=0x10046880000, len=0x1000, mrs=<realm-dma-region>, is_write=no
>>>> address_space_read_cached_slow: translated to mr=<mach-virt.ram>,
>>>> mr_addr=0x6880000, l=0x10
>>>> flatview_read_continue_step: mr=<mach-virt.ram>,
>>>> host=0xffff23e00000, mr_addr=0x6880000, ram_ptr=0xffff2a680000
>>>> virtqueue_split_pop: desc: 0000000000000000 - 00000000 - 00000000
>>>> - 00000000
>>>> qemu-system-aarch64: virtio: zero sized buffers are not allowed
>>>>
>>>>
>> Thanks,
>> Gavin
>>
>
^ permalink raw reply
* Re: [RFC PATCH 09/15] x86/virt/tdx: Add interface to generate a Quote
From: Peter Fang @ 2026-06-26 9:58 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: kas@kernel.org, djbw@kernel.org, yilun.xu@linux.intel.com,
x86@kernel.org, Xu, Yilun, Duan, Zhenzhong,
baolu.lu@linux.intel.com, Li, Xiaoyao,
linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
linux-coco@lists.linux.dev
In-Reply-To: <20260614112948.GG3200182@pedri>
On Sun, Jun 14, 2026 at 04:29:48AM -0700, Peter Fang wrote:
> >
> > > + goto out;
> > > +
> > > + /*
> > > + * The quote buffer is a shared resource, so use it only for the
> > > + * SEAMCALL and copy the data out as soon as possible.
> > > + */
> > > + quote_dup = kvmemdup(quote_data.buf, out_len, GFP_KERNEL);
> >
> > So at init time we allocate a vmalloc for the quote and pre-populate the
> > hpa_list. Then we use it every time and copy the contents to a new vmalloc.
> > Would it really be that hard to keep the hpa list allocation around, do a
> > vmalloc here and update the pfn list. Then do get quote on that and pass back
> > the vmalloc we just allocated? Just feels like global reuse way has extra pieces
> > in it. Compared to the whole quoting operation, this vmalloc_to_pfn() loop is
> > probably not very expensive.
>
> Hm interesting idea. But a Quote buffer could be close to 4MB in the worst
> case. Let's say max_quote_size is 3MB, that's 768 vmalloc_to_pfn() calls
> each time... That sounds a bit excessive right?
>
> The extra bits mainly come from using kvmemdup() I think. Having to use
> kvfree() on it does feel a bit annoying but that was the tradeoff I
> made...
>
I didn't change the design in the v2 series, but after some discussion
with Yilun, I now realize this is actually a really good idea. Having a
static buffer does complicate the code a bit. I'll probably try doing
the allocation on the KVM side and just do vmalloc_to_pfn() calls each
time. Thanks.
> >
>
^ permalink raw reply
* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Gavin Shan @ 2026-06-26 11:43 UTC (permalink / raw)
To: Suzuki K Poulose, Lorenzo Pieralisi
Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
Will Deacon, James Morse, Oliver Upton, Zenghui Yu,
linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <9482dfbc-4d96-47ba-a615-f4ba0bda833f@arm.com>
On 6/26/26 6:47 PM, Suzuki K Poulose wrote:
> On 26/06/2026 08:43, Gavin Shan wrote:
>> On 6/26/26 1:58 AM, Suzuki K Poulose wrote:
>>> On 25/06/2026 14:53, Gavin Shan wrote:
>>>> On 6/6/26 12:35 AM, Lorenzo Pieralisi wrote:
>>>>> On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
>>>>>> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
>>>>>>> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
>>
>> [...]
>>
>>>>>>
>>>>>> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, and found
>>>>>> that memory slots backed by THP are broken. With THP disabled on the host and
>>>>>> other fixes (mentioned in my prevous replies) applied on the top of this (v14)
>>>>>> series, I'm able to boot a realm guest with rebased QEMU series [2], plus more
>>>>>> fxies on the top.
>>>>>>
>>>>>> [1] https://git.codelinaro.org/linaro/dcap/qemu.git (branch: cca/ latest)
>>>>>> [2] https://git.qemu.org/git/qemu.git (branch: cca/ gavin)
>>>>>>
>>>>>> Lorenzo, You may be saying there is someone making QEMU to support ARM/CCA?
>>>>>
>>>>> Mathieu and I are working on that yes and with Steven/Suzuki to fix the THP
>>>>> issues you pointed out above.
>>>>>
>>>>>> If so, I'm not sure if there is a QEMU repository for me to try?
>>>>>
>>>>> We should be able to submit patches by end of June - we shall let you know
>>>>> whether we can make something available earlier.
>>>>>
>>>>
>>>> Not sure if there are other known issues in this series. It seems the stage2
>>>> page fault handling on the shared space isn't working well. In my test, the
>>>> vring (struct vring_desc) of virtio-net-pci is updated by the guest, and the
>>>> data isn't seen by QEMU, I'm suspecting if the host-page-frame-number is properly
>>>> resolved in the s2 page fault handler for shared (unprotected) space.
>>>>
>>>> - I rebased Jean's latest qemu branch to the upstream qemu;
>>>>
>>>> - On the host, which is emulated by qemu/tcg, the THP (transparent huge page) is
>>>> disabled.
>>>>
>>>> - On the guest, I can see the virtio vring (struct vring_desc) is updated. The
>>>> S1 page-table entry looks correct because the corresponding physical address
>>>> 0x10046880000 is a sane shared (unprotected) space address.
>>>>
>>>> [ 52.094143] software IO TLB: Memory encryption is active and system is using DMA bounce buffers
>>>> [ 52.289746] virtqueue_add_desc_split: desc[0]@0xffff000006880000, [00000100b983f000 00000640 0002 0001]
>>>> [ 52.432150] PTE 0x00e8010046880707 at address 0xffff000006880000
>>>>
>>>> - On the host, the s2 page-table-entry is unmapped due to attribute transition (private -> shared).
>>>> A subsequent S2 page fault is raised against the adress and the s2 page-table-entry is built.
>>>>
>>>> [ 109.259077] ====> realm_unmap_shared_range: tracked_unprot_addr=0x10046880000
>>>> [ 109.260249] realm_unmap_shared_range: unmapped shared range at 0x10046880000
>>>> [ 109.317786] realm_unmap_shared_range: unmapped shared range at 0x10046880000
>>>> [ 109.629939] ====> kvm_handle_guest_abort: fault_ipa=0x10046880000, esr=0x92000007
>>>> [ 109.630245] realm_map_non_secure: ipa=0x10046880000, pfn=0xb8b59, size=0x1000, prot=0xf
>>>> [ 109.630331] realm_map_non_secure: ipa=0x10046880000, ipa_top=0x10046881000, flags=0x1e0001, range_desc=0xb8b59004
>>>
>>> Are you able to correlate the order of the transitions and the Guest
>>> access with RMM log ? We haven't seen this from our end. We are aware
>>> of permission fault issues with Unprotected IPA when backing the memslot
>>> with MAP_PRIVATE areas. But this looks different.
>>>
>>> Lorenzo, have you run into this ?
>>>
>>
>> It's hard to correlate the order since the logs are collected from two separate
>> consoles. For the write permission, I add code to the host where the permission
>> is always added for all s2 page faults in the shared space. Otherwise, qemu can
>> be killed by -EFAULT or similar error.
>
> This is the problem. We can't add WRITE permission by default. I believe
> you may have MAP_PRIVATE mapping and it has to be mapped as READ only
> and on a permission fault, we replace it with a writable page. By
> overriding the WRITE permission, you let the guest write to a page
> that may not be seen by the VMM.
>
> We identified this as a bug in the KVM driver in this series (reported
> by Lorenzo) and there is a corresponding tf-RMM change that is required
> to get this working. So, please could you wait until the next series
> when this will be addressed ? Or you could switch to using MAP_SHARED
> for the "shared" memory in the memslot.
>
Exactly. the syntax for MAP_PRIVATE is broken if the write permission is
enforced for a read fault in the shared space. In my case, the host page can
be the zero page and eventually multiple s2 page-table entries (for multiple
unprotected or shared pages) point to the zero page. It's why clearing the
3rd queue (Ctrl queue) also clears the first queue (Rx queue) in my case.
Yes, this issue can be avoid by using a shared memory backend in qemu, something
like below. With this, I'm able to see virtio-net-pci starts to work...
-object memory-backend-ram,id=mem0,size=2G,share=yes
Thanks,
Gavin
>
> Suzuki
>
>
>>
>> There are more findings after more experiments: this virtio-net-pci device has 3
>> queues or vrings (Rx/Tx/Ctrl). The Rx/Tx/Ctrl queue are populated in order one after
>> one. In the guest kernel, I intentionally write fixed data (0x0123456789abcdef) to
>> the first 8 bytes of the queue when it gets populated, and stop the guest at random
>> points to see if the data is gone. I found that the data written to Rx/ Tx queue are
>> lost after Ctrl queue is allocated.
>>
>> The data written to Rx/Tx queue is lost if the guest stops (B). The data written to
>> Rx/Tx queue isn't lost if the guest stops at (A). I can see the pattern (0x0123...cdef)
>> by dumping the physcial memory through 'pmemsave' command in qemu.
>>
>> DMA allocation
>> ==============
>> dma_alloc_coherent
>> dma_alloc_attrs
>> dma_direct_alloc
>> __dma_direct_alloc_pages
>> dma_set_decrypted // (A) No data lost if being stopped here for the Ctrl queue
>> memset(ret, 0, size) // (B) Data lost after being stopped after memset() for the Ctrl queue
>>
>> The memset() on the Ctrl queue should trigger a stage2 page fault. It seems the page
>> fault enforces the shared pages for Rx/Tx queue to be dropped? I need to add more
>> debugging code and track it down.
>>
>>> Suzuki
>>>
>>>
>>>>
>>>> - On QEMU, the updated vring (struct vring_desc) at GPA 0x46880000 isn't seen. All the
>>>> data in that adress are zeros.
>>>>
>>>> ====> virtqueue_split_pop: vdev=<virtio-net>, sz=0x38, queue_index=0x0, vq->vring.num=0x100
>>>> virtqueue_split_pop: last_avail_idx=0x0, head=0x0
>>>> address_space_read_cached_slow: cache@0xffff1c036440, addr=0x0, buf=0xffffeee34880, len=0x10
>>>> address_space_read_cached_slow: cache: ptr=0x0, xlat=0x10046880000, len=0x1000, mrs=<realm-dma-region>, is_write=no
>>>> address_space_read_cached_slow: translated to mr=<mach-virt.ram>, mr_addr=0x6880000, l=0x10
>>>> flatview_read_continue_step: mr=<mach-virt.ram>, host=0xffff23e00000, mr_addr=0x6880000, ram_ptr=0xffff2a680000
>>>> virtqueue_split_pop: desc: 0000000000000000 - 00000000 - 00000000 - 00000000
>>>> qemu-system-aarch64: virtio: zero sized buffers are not allowed
>>>>
>>>>
>> Thanks,
>> Gavin
>>
>
^ permalink raw reply
* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Ackerley Tng @ 2026-06-26 15:28 UTC (permalink / raw)
To: Yan Zhao
Cc: Sean Christopherson, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, tabba, willy, wyihan, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <aj3TGLGWT1kMFIVH@yzhao56-desk.sh.intel.com>
Yan Zhao <yan.y.zhao@intel.com> writes:
> On Thu, Jun 25, 2026 at 05:07:23PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>>
>> > On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
>> >> Sean Christopherson <seanjc@google.com> writes:
>> >>
>> >> > On Tue, Jun 23, 2026, Yan Zhao wrote:
>> >> >> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
>> >> >> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
>> >> >> > > On Mon, Jun 22, 2026, Yan Zhao wrote:
>> >> >> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
>> >> >> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> >> >> > > > > index ffe9d0db58c59..56d10333c61a7 100644
>> >> >> > > > > --- a/arch/x86/kvm/vmx/tdx.c
>> >> >> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
>> >> >> > > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>> >> >> > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
>> >> >> > > > > return -EIO;
>> >> >> > > > >
>> >> >> > > > > - if (!src_page)
>> >> >> > > > > - return -EOPNOTSUPP;
>> >> >> > > > > + if (!src_page) {
>> >> >> > > > > + if (!gmem_in_place_conversion)
>> >> >> > > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
>> >> >> > > > without the MMAP flag, the absence of src_page should still be treated as an
>> >> >> > > > error.
>> >> >> > >
>> >> >> > > Why MMAP?
>> >> >> > Hmm, I was showing a scenario that in-place conversion couldn't occur.
>> >> >> > I didn't mean that with the MMAP flag, mmap() and user write must occur.
>> >> >> >
>> >> >> > > Shouldn't this be a general "if (!src_page && !up-to-date)"? Just
>> >> >> > > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
>> >> >> > > and written memory. And when write() lands, MMAP wouldn't be necessary to
>> >> >> > > initialize the memory.
>> >> >> > Do you mean using up-to-date flag as below?
>> >> >
>> >> > Yes? I didn't actually look at the implementation details.
>> >> >
>> >> >> > if (!src_page) {
>> >> >> > src_page = pfn_to_page(pfn);
>> >> >> > if (!folio_test_uptodate(page_folio(src_page)))
>> >> >> > return -EOPNOTSUPP;
>> >> >> > }
>> >>
>> >> Yan is right that with the earlier patch "Zero page while getting pfn",
>> >> folio_test_uptodate() here will always return true.
>> >>
>> >> Actually, this is an alternative fix for the issue Sashiko pointed out
>> >> on v7 where userspace can do a populate() (either TDX or SNP) without
>> >> first allocating the page, with src_address == NULL, and leak
>> >> uninitialized memory into the guest.
>> >>
>> >> Advantage of using the uptodate check in populate: if the host never
>> >> allocates the page, populate doesn't incur zeroing before writing the
>> >> page anyway in populate().
>> >>
>> >> Disadvantage: Both TDX and SNP will have to implement this uptodate
>> >> check. guest_memfd can't check centrally because for SNP, for a
>> >> PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
>> >> firmware will zero and there's no leakage of uninitialized host memory?
>> > Another disadvantage: the uptodate flag is per-folio. What if the folio
>> > is only partially initialized by the userspace especially after huge page is
>> > supported?
>> >
>>
>> Good point on huge pages!
>>
>> The uptodate flag on the folio in guest_memfd means "this folio has been
>> written to". As of now (before patch at [1]), this happens when
>>
>> + folio is zeroed on first use by userspace
>> + folio is zeroed on first use of the guest
>> + folio is populated
>>
>> When huge pages are supported, the folio can't partially be initialized?
>>
>> On allocation, if any part is shared, we split the page. The parts are
>> separate folios that have their own uptodate flags.
>>
>> On splitting, if the huge page is uptodate, the split pages will also be
>> uptodate. If the huge page is not uptodate, the split pages won't be
>> uptodate, but that's ok since they will be marked uptodate on first use.
>>
>> On merging, the non-uptodate parts have to be zeroed and then marked
> If that's true, it would be good.
>
>> uptodate. Any parts that are in use would have been marked uptodate
>> already, so there's no overwriting data that is in use. I'll need to
>> think more about when it's safe to zero.
>>
>> I'm still on the fence between the two options
>>
>> 1. Using uptodate check in populate to reject src_pages that have never
>> been written to or
>> 2. Always zero before populate
> 2 does not work?
> The flow is
> 1. mmap gmem_fd, make GFN shared, and write initial content.
> 2. convert GFN to private
> 3. invoke ioctl to trigger populate.
>
This flow is correct, is what users of in-place conversion should do.
"Always" is the wrong word, I should have said "zero if not uptodate
before populate", as in, with patch at [1].
By doing the zeroing in __kvm_gmem_get_pfn instead, by the time populate
gets the pfn, the page would be zeroed, either because userspace faulted
it in, and the zeroing happened in kvm_gmem_fault_user_mapping(), or if
userspace never faulted it in, the zeroing would happen because
populate() allocated the page.
>> but whether the uptodate flag is per-folio or not doesn't affect these
>> two options in terms of fixing the leak of uninitialized host memory,
>> right?
> yes, provided "On merging, the non-uptodate parts have to be zeroed and then
> marked uptodate".
>
Thank you so much for bringing this up, I hadn't considered this
before. I'll do that when I get to guest_memfd hugepage restructuring.
>> >
>> >> >> Another concern with this fix is that:
>> >> >> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
>> >> >> folio uptodate before reaching post_populate().
>> >> >>
>> >> >> [1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/
>> >> >>
>> >> >> > One concern is that TDX now does not much care about the up-to-date flag since
>> >> >> > TDX doesn't rely on the flag to clear pages on conversions.
>> >> >> > I'm not sure if the flag can be reliably checked in this case. e.g.,
>> >> >> > now the whole folio is marked up-to-date even if only part of it is faulted by
>> >> >> > user access.
>> >> >> > Ensuring that the up-to-date flag works correctly with huge page support seems
>> >> >> > to have more effort than introducing a dedicated flag for TDX.
>> >> >> >
>> >> >> > > > Additionally, to properly enable in-place copying for the TDX initial memory
>> >> >> > > > region, userspace must not only specify source_addr to NULL, but also follow
>> >> >> > > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
>> >> >> > > > 1. create guest_memfd with MMAP flag
>> >> >> > > > 2. mmap the guest_memfd.
>> >> >> > > > 3. convert the initial memory range to shared.
>> >> >> > > > 4. copy initial content to the source page.
>> >> >> > > > 5. convert the initial memory range to private
>> >> >> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
>> >> >> > > > 7. do not unmap the source backend.
>> >> >> > > >
>> >> >> > > > So, would it be reasonable to introduce a dedicated flag that allows userspace
>> >> >> > > > to explicitly opt into the in-place copy functionality? e.g.,
>> >> >> > >
>> >> >> > > Why? It's userspace's responsibility to get the above right. If userspace fails
>> >> >> > > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
>> >>
>> >> Yan, is your concern that userspace forgot to update the code and
>> >> forgets to provide a src_page, and if we keep the "Zero page while
>> > Yes. Previously, it would be rejected after GUP fails.
>> >
>>
>> I see, didn't realize previously it would be rejected because GUP
>> fails. GUP failed because it wasn't faulted into the host?
> GUP fails if 0 is not a valid user address.
> But GUP would not fail if 0 is a valid address. e.g., in below scenario:
>
> #include <sys/mman.h>
> #include <stdio.h>
> int main(void)
> {
> void *p=mmap((void*)0,4096,PROT_READ|PROT_WRITE, MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,-1,0);
> if (p==MAP_FAILED) {
> perror("mmap");
> return 1;
> }
> *(char*)0='Y';
> printf("addr0=%p val=%c\n",p,*(char*)0);
> return 0;
> }
>
>
>> That's kind of orthogonal, I don't think GUP fail leading to rejecting
>> populate was meant to help userspace catch these issues. GUP would also
>> fail if the user did mmap(), write to it, unmap using
>> madvise(MADV_DONTNEED), then forget and pass 0 as src_address.
> The original uAPI did not explicitly define 0 as an invalid uaddr. Whether 0 was
> rejected depended on whether the user mmap()'d address 0. If 0 was a valid
> mapping, populate() could proceed.
>
> commit 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating
> guest memory") changed the behavior though. It would return -EOPNOTSUPP for a 0
> uaddr.
>
I see, I only looked at this after commit 2a62345b3052.
> But if a user configures 0 uaddr as valid, writes to it, and then passes 0 as
> source_addr(not from gmem), I'm not sure if it's good for the kernel to silently
> treat 0 uaddr as an identifier for in-place copy from the private PFN in gmem.
>
I'd say the original uAPI perhaps just didn't document 0 as an
unsupported uaddr. Given that commit 2a62345b3052 already merged, uAPI
was perhaps accidentally changed and no customer complained, I think we
can move forward with 0 as an invalid src_address? I wouldn't think
anyone relies on 0 intentionally being a valid address.
I could document that, if it helps?
>
>> >> getting pfn" patch, ends up with the guest silently having a zero page?
>> >> I think that would be found quite early in userspace VMM testing...
>> > I actually encountered this during testing this patch.
>> > I update most code path to follow this sequence. However, still some corner ones
>> > for TDVF HOB, which are less obvious and harder to update.
>> > The TD just booted up and hang silently.
>> >
>>
>> I think this is just the life of a close-to-hardware software engineer
>> :P no errors, got stuck somewhere, root cause is some unitialized
>> thing.
>>
>> >> >> > I mean if userspace specifies a NULL source_addr by mistake, it's better for
>> >> >> > kernel to detect this mistake, similar to how it validates whether source_addr
>> >> >> > is PAGE_ALIGNED.
>> >> >
>> >> > The alignment case is different. If userspace provides an unaligned value, KVM
>> >> > *can't* do what userspace is asking because hardware and thus KVM only supports
>> >> > converting on page boundaries.
>> >> >
>> >> > For a NULL source, KVM can still do what userspace is asking. Rejecting userspace's
>> >> > request would then be making assumptions about what userspace wants.
>> >> >
>> >>
>> >> Also, +1 on this, what if userspace, knowing that pages are zeroed on
>> >> allocation, actually wants to rely on that to get a zero page in the guest?
>> > What if 0 uaddr is a valid address? :)
>> >
>> >> >> > Since userspace already needs to perform additional steps to enable in-place
>> >> >> > copy, specifying a dedicated flag to indicate that the NULL source_addr is
>> >> >> > intentional seems like a reasonable burden.
>> >> >
>> >> > I don't see how it adds any value. I wouldn't be at all surprised if most VMMs
>> >> > just wen up with code that does:
>> >> >
>> >> > if (in-place) {
>> >> > src = NULL;
>> >> > flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION;
>> >> > }
>> >>
^ permalink raw reply
* Re: [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: Borislav Petkov @ 2026-06-26 16:40 UTC (permalink / raw)
To: Kalra, Ashish
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <7c64d96f-f932-4db9-8119-b9e40d5b7fd9@amd.com>
On Thu, Jun 25, 2026 at 02:42:23PM -0500, Kalra, Ashish wrote:
> Hello Boris,
Hello Ashish,
lemme try to make sense of your AI reply...
> cpu_hotplug_disable()/cpu_hotplug_enable() are refcounted (cpu_hotplug_disabled++/--,
> with a WARN on underflow), so they have to be balanced. This flag collapses them to
> exactly one outstanding disable per SNP-active window, because the disable and enable
> sites are not reached a symmetric number of times:
Well, why aren't they?
Why isn't a simple design where on SNP init hotplug is disabled - *exactly*
one call to cpu_hotplug_disable() and on SNP shutdown hotplug is reenabled
again - also exactly one call.
I know why...
> - On firmware without SNP_X86_SHUTDOWN_SUPPORTED, __sev_snp_shutdown_locked() does not
This function is one convoluted mess which does gazillion things. If I were
maintaining that code, I would impose a mandatory cleanup phase before new
features are added. But I probably said that already before...
And because a lot of code from your set goes into areas I maintain, I would
suggest you take the time and do that cleanup. Before that code goes
completely off the rails. And I'm willing to offer you review bandwidth and
other help I can with doing this right.
> call snp_shutdown() (it's gated on data.x86_snp_shutdown), so SNP stays enabled in
> hardware — SNP_EN stays set and hotplug stays disabled — while sev->snp_initialized is
> cleared. Re-init after that is routine, the SNP ioctls self-bracket init and shutdown
> (e.g. SNP_COMMIT, SNP_SET_CONFIG, SNP_VLEK_LOAD):
That init and teardown flow should be simplified:
You have multiple things which you need to do at different times
- per-CPU init
- global init
- per-CPU teardown
- global teardown
CPU hotplug toggling belongs to the global category. Instead of piling more
stuff onto that __sev_snp_shutdown_locked() function, you should take some
time to clean it up, analyze what goes where and then simplify that flow.
So let's clean stuff up first, please, analyze the flow and determine what
goes where and then do it. Not bolt more stuff on what is already wobbly.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Lorenzo Pieralisi @ 2026-06-26 16:44 UTC (permalink / raw)
To: Gavin Shan
Cc: Suzuki K Poulose, Steven Price, kvm, kvmarm, Catalin Marinas,
Marc Zyngier, Will Deacon, James Morse, Oliver Upton, Zenghui Yu,
linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <8f81ed99-c53a-4196-baa2-adea9239a000@redhat.com>
On Fri, Jun 26, 2026 at 09:43:03PM +1000, Gavin Shan wrote:
> On 6/26/26 6:47 PM, Suzuki K Poulose wrote:
> > On 26/06/2026 08:43, Gavin Shan wrote:
> > > On 6/26/26 1:58 AM, Suzuki K Poulose wrote:
> > > > On 25/06/2026 14:53, Gavin Shan wrote:
> > > > > On 6/6/26 12:35 AM, Lorenzo Pieralisi wrote:
> > > > > > On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
> > > > > > > On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
> > > > > > > > On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
> > >
> > > [...]
> > >
> > > > > > >
> > > > > > > I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, and found
> > > > > > > that memory slots backed by THP are broken. With THP disabled on the host and
> > > > > > > other fixes (mentioned in my prevous replies) applied on the top of this (v14)
> > > > > > > series, I'm able to boot a realm guest with rebased QEMU series [2], plus more
> > > > > > > fxies on the top.
> > > > > > >
> > > > > > > [1] https://git.codelinaro.org/linaro/dcap/qemu.git (branch: cca/ latest)
> > > > > > > [2] https://git.qemu.org/git/qemu.git (branch: cca/ gavin)
> > > > > > >
> > > > > > > Lorenzo, You may be saying there is someone making QEMU to support ARM/CCA?
> > > > > >
> > > > > > Mathieu and I are working on that yes and with Steven/Suzuki to fix the THP
> > > > > > issues you pointed out above.
> > > > > >
> > > > > > > If so, I'm not sure if there is a QEMU repository for me to try?
> > > > > >
> > > > > > We should be able to submit patches by end of June - we shall let you know
> > > > > > whether we can make something available earlier.
> > > > > >
> > > > >
> > > > > Not sure if there are other known issues in this series. It seems the stage2
> > > > > page fault handling on the shared space isn't working well. In my test, the
> > > > > vring (struct vring_desc) of virtio-net-pci is updated by the guest, and the
> > > > > data isn't seen by QEMU, I'm suspecting if the host-page-frame-number is properly
> > > > > resolved in the s2 page fault handler for shared (unprotected) space.
> > > > >
> > > > > - I rebased Jean's latest qemu branch to the upstream qemu;
> > > > >
> > > > > - On the host, which is emulated by qemu/tcg, the THP (transparent huge page) is
> > > > > disabled.
> > > > >
> > > > > - On the guest, I can see the virtio vring (struct vring_desc) is updated. The
> > > > > S1 page-table entry looks correct because the corresponding physical address
> > > > > 0x10046880000 is a sane shared (unprotected) space address.
> > > > >
> > > > > [ 52.094143] software IO TLB: Memory encryption is active and system is using DMA bounce buffers
> > > > > [ 52.289746] virtqueue_add_desc_split: desc[0]@0xffff000006880000, [00000100b983f000 00000640 0002 0001]
> > > > > [ 52.432150] PTE 0x00e8010046880707 at address 0xffff000006880000
> > > > >
> > > > > - On the host, the s2 page-table-entry is unmapped due to attribute transition (private -> shared).
> > > > > A subsequent S2 page fault is raised against the adress and the s2 page-table-entry is built.
> > > > >
> > > > > [ 109.259077] ====> realm_unmap_shared_range: tracked_unprot_addr=0x10046880000
> > > > > [ 109.260249] realm_unmap_shared_range: unmapped shared range at 0x10046880000
> > > > > [ 109.317786] realm_unmap_shared_range: unmapped shared range at 0x10046880000
> > > > > [ 109.629939] ====> kvm_handle_guest_abort: fault_ipa=0x10046880000, esr=0x92000007
> > > > > [ 109.630245] realm_map_non_secure: ipa=0x10046880000, pfn=0xb8b59, size=0x1000, prot=0xf
> > > > > [ 109.630331] realm_map_non_secure: ipa=0x10046880000, ipa_top=0x10046881000, flags=0x1e0001, range_desc=0xb8b59004
> > > >
> > > > Are you able to correlate the order of the transitions and the Guest
> > > > access with RMM log ? We haven't seen this from our end. We are aware
> > > > of permission fault issues with Unprotected IPA when backing the memslot
> > > > with MAP_PRIVATE areas. But this looks different.
> > > >
> > > > Lorenzo, have you run into this ?
> > > >
> > >
> > > It's hard to correlate the order since the logs are collected from two separate
> > > consoles. For the write permission, I add code to the host where the permission
> > > is always added for all s2 page faults in the shared space. Otherwise, qemu can
> > > be killed by -EFAULT or similar error.
> >
> > This is the problem. We can't add WRITE permission by default. I believe
> > you may have MAP_PRIVATE mapping and it has to be mapped as READ only
> > and on a permission fault, we replace it with a writable page. By
> > overriding the WRITE permission, you let the guest write to a page
> > that may not be seen by the VMM.
> >
> > We identified this as a bug in the KVM driver in this series (reported
> > by Lorenzo) and there is a corresponding tf-RMM change that is required
> > to get this working. So, please could you wait until the next series
> > when this will be addressed ? Or you could switch to using MAP_SHARED
> > for the "shared" memory in the memslot.
> >
>
> Exactly. the syntax for MAP_PRIVATE is broken if the write permission is
> enforced for a read fault in the shared space. In my case, the host page can
> be the zero page and eventually multiple s2 page-table entries (for multiple
> unprotected or shared pages) point to the zero page. It's why clearing the
> 3rd queue (Ctrl queue) also clears the first queue (Rx queue) in my case.
>
> Yes, this issue can be avoid by using a shared memory backend in qemu, something
> like below. With this, I'm able to see virtio-net-pci starts to work...
>
> -object memory-backend-ram,id=mem0,size=2G,share=yes
Yes, as Suzuki said that's what we have been fixing. QEmu patches
will be on the mailing lists very shortly - the KVM/tf-RMM fixes
to make MAP_PRIVATE work will be included in the next posting.
Feel free to drop your QEmu command line so that I can give it
a shot and check whether the fixes solve the problem you hit
(I think so because that's precisely the kind of issue I got
into when I started debugging THP/MAP_PRIVATE but it is better
to check).
Thanks,
Lorenzo
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default\
From: Sean Christopherson @ 2026-06-26 19:06 UTC (permalink / raw)
To: Yan Zhao
Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <aj3H2sxymOYTWTnE@yzhao56-desk.sh.intel.com>
On Fri, Jun 26, 2026, Yan Zhao wrote:
> On Thu, Jun 25, 2026 at 07:36:28AM -0700, Sean Christopherson wrote:
> > On Thu, Jun 25, 2026, Yan Zhao wrote:
> > And I'm not remotely convinced that prepending allow_ to the param will help
> > end users diagnose "unexpected" memory consumption, in quotes because anyone that
> > is deploying a stack that utilizes out-of-place conversion absolutely needs to
> > understand and plan for the additional memory consumption. I.e. if the memory
> > consumption is "unexpected" to the end user, they likely have far bigger problems.
> My first impression of gmem_in_place_conversion=true was that it enforces gmem
> in-place conversion. However, it actually only enforces per-gmem private/shared
> attribute.
> My worry was that people might think it's a kernel bug if userspace can still
> have shared memory from other sources after they configured
> gmem_in_place_conversion=true.
Ah, I see where you're coming from. FWIW, truly enforcing in-place conversion
is flat out impossible. E.g. userspace can simply replace the memslot, at which
point the memory effectively reverts to shared.
> However, I have no strong opinion if you think gmem_in_place_conversion is good,
> and with the above documentation. :)
Ya, I think this largely a documentation problem. I agree that a param name
like gmem_private_memory_attributes would be more precise, but I think it'd be
far less informative for the vast majority of users that only care whether or
not KVM can do in-place conversion, and don't care about how that is done.
^ permalink raw reply
* Re: [PATCH 1/4] kvm: sev: Fix user-space triggerable WARN_ON on snp_launch_update path
From: Tom Lendacky @ 2026-06-26 20:21 UTC (permalink / raw)
To: Jörg Rödel, Sean Christopherson, Paolo Bonzini
Cc: x86, Kiryl Shutsemau, Rick Edgecombe, Ashish Kalra, Michael Roth,
kvm, linux-kernel, linux-coco, Joerg Roedel
In-Reply-To: <20260623091556.1500930-2-joro@8bytes.org>
On 6/23/26 04:15, Jörg Rödel wrote:
> From: Joerg Roedel <joerg.roedel@amd.com>
>
> Sashiko reported on an unrelated patch:
>
> [Severity: High]
> This is a pre-existing issue, but can a host userspace process trigger a
> kernel warning by passing a NULL user address (uaddr = 0) here?
>
> If params.uaddr is 0, src becomes NULL and passes the PAGE_ALIGNED(src)
> check. kvm_gmem_populate() skips fetching the user page and passes
> src_page = NULL to sev_gmem_post_populate().
>
> That function then unconditionally evaluates:
>
> WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO &&
> !src_page)
>
> Since the type isn't ZERO, won't this allow an unprivileged user to spam
> the kernel log?
It would only be one warning that is emitted, "spam the kernel log" sounds
like you could fill it with warnings. And I would say that the severity is
only "High" should the kernel be configured with PANIC_ON_WARN.
>
> The assessment is correct, so check for this condition earlier in the
> snp_launch_update() path to avoid the WARN_ON_ONCE.
I'm not positive, but isn't it technically possible that the userspace
virtual address can be 0? In which case, should this be fixed in the
kvm_gmem_populate() API with maybe a new parameter that indicates whether
src is valid or not?
Thanks,
Tom
>
> Fixes: dee5a47cc7a45 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_UPDATE command")
> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
> ---
> arch/x86/kvm/svm/sev.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 6c6a6d663e29..41dcba5180ca 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2438,6 +2438,13 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
> if (!PAGE_ALIGNED(src))
> return -EINVAL;
>
> + /*
> + * Make sure user-mode did not pass NULL as src with
> + * type != KVM_SEV_SNP_PAGE_TYPE_ZERO.
> + */
> + if (src == NULL && params.type != KVM_SEV_SNP_PAGE_TYPE_ZERO)
> + return -EINVAL;
> +
> npages = params.len / PAGE_SIZE;
>
> /*
^ permalink raw reply
* Re: [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: Kalra, Ashish @ 2026-06-26 20:23 UTC (permalink / raw)
To: K Prateek Nayak, Borislav Petkov
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, Tycho.Andersen, Nathan.Fontenot, ackerleytng,
jackyli, pgonda, rientjes, jacobhxu, xin, pawan.kumar.gupta,
babu.moger, dyoung, nikunj, john.allen, darwi, linux-kernel,
linux-crypto, kvm, linux-coco
In-Reply-To: <24aef42a-86da-42d4-92d1-9a6d2d329592@amd.com>
Hello Prateek,
On 6/25/2026 11:01 PM, K Prateek Nayak wrote:
> Hello Ashish,
>
> On 6/26/2026 8:08 AM, Kalra, Ashish wrote:
>>> Looking at snp_prepare(), we have an early-bailout for
>>>
>>> rdmsrq(MSR_AMD64_SYSCFG, val);
>>> if (val & MSR_AMD64_SYSCFG_SNP_EN)
>>> return;
>>>
>>> Does executing SHUTDOWN command lead to the firmware clearing SNP_EN in
>>> SYSCFG on all CPUS?
>>
>> Yes, in case of X86_SNP_SHUTDOWN (available if firmware supports X86SnpShutdown feature)
>> SNP is disabled on all cores by clearing SYSCFG[SNPEn] bit.
>>
>> If X86_SNP_SHUTDOWN is set to 1, the firmware clears the SYSCFG[SNPEn] bit in each core.
>>
>> But, in case of legacy SNP shutdown, SNP_EN bit is not cleared and so SNP remains enabled.
>
> Ah! That was the bit I was missing. Thanks a ton for clarifying.
>
>>
>>>
>>> If SNP_EN remains set (and Linux can't clear it since it is
>>> "Write-1-only" bit), then a subsequent snp_prepare() will skip setting
>>> SYSCFG if it sees SNP_EN on local CPU.
>>>
>>> It can so happen that we enable hotlpug at shutdown, CPUs come online
>>> without setting SNP_EN in SYSCFG, subsequent snp_prepare() runs on a CPU
>>> where SNP_EN is still set and skips configuring it for the CPUs that
>>> don't have it set, and we'll be in a pickle still.
>>>
>>> The comment above that bailout saying "this can happen in case of kexec
>>> boot" makes me believe that SNP_EN remains set until a full system
>>> reset.
>>>
>>> The only safe way to do this is to ensure all possible CPUs are online
>>> during snp_prepare() and do snp_enable() regardless of whether local CPU
>>> has SNP_EN or not.
>>>
>>> Am I missing something?
>>>
>>
>> The piece that makes the early bailout safe is the disable this patch adds:
>> hotplug is disabled while SNP is active, so the online set can't change under an
>> active SNP. snp_prepare() already requires online == present, so at a successful
>> init every present CPU gets SNP_EN,
>
> How is this enforced? AFAICT, on_each_cpu(snp_enable) will only covers
> the online CPUs and there could be CPUs that have been offlined before
> that right
Right that on_each_cpu() only covers online CPUs -- but snp_prepare() refuses to
proceed unless online == present.
if (!cpumask_equal(cpu_online_mask, cpu_present_mask)) {
ret = -EOPNOTSUPP;
pr_warn("SNP init failed: not all CPUs online. ...");
goto unlock;
}
The check right before on_each_cpu(snp_enable) returns -EOPNOTSUPP if any present CPU
is offline, so SNP init simply fails in that case -- there is no successful init
that leaves a CPU without SNP_EN. The check and on_each_cpu(snp_enable) both run
under cpus_read_lock(), so the online set can't change between the two, at a
successful snp_prepare(), online == present and every present CPU has SNP_EN.
After that this patch disables hotplug, so the set stays == present.
(That online == present requirement is existing snp_prepare() behavior, not
something this patch adds.)
And cpu_hotplug_disable() comes right before cpus_read_lock() as it must not be called
while holding cpus_read_lock(), something like this:
rdmsrq(MSR_AMD64_SYSCFG, val);
if (val & MSR_AMD64_SYSCFG_SNP_EN)
return 0; /* bailout: re-init/kexec, SNP_EN already set */
clear_rmp();
cpu_hotplug_disable(); /* <-- here: after bailout, before cpus_read_lock */
cpus_read_lock();
if (!cpumask_equal(cpu_online_mask, cpu_present_mask)) {
ret = -EOPNOTSUPP;
...
goto unlock; /* will re-enable below */
}
on_each_cpu(mfd_reconfigure, ...);
on_each_cpu(snp_enable, ...);
...
unlock:
cpus_read_unlock();
if (ret)
cpu_hotplug_enable(); /* undo: failed before SNP_EN was set */
return ret;
>
>> and because hotplug is then disabled none
>> can leave or rejoin without it. So whenever the bailout is hit with SNP active,
>> every online CPU already has SNP_EN:
>>
>> - kexec: SNP_EN is already set on all CPUs by the previous kernel.
>
> There is a catch here: you can have offline CPUs during the previous boot
> (say you have maxcpus=8 in your cmdline), and then you kexec with a different
> kernel / cmdline that brings online a bunch more CPUs.
>
> SNP_EN will only be set for a subset of then with the legacy SNP_INIT and
> if snp_prepare() runs on those legacy CPUs, you still skip setting it for
> the ones that don't have SNP_EN set.
>
> Is that case covered somehow or is it a non-issue?
>
It's a non-issue, for two independent reasons.
First, kexec with SNP active currently requires a full SNP shutdown before the
kexec. SNP_SHUTDOWN_EX (and the IOMMU SNP shutdown it performs) fail if there
are any active SNP guests or assigned ASIDs, so a working kexec has to terminate
all SNP guests and run a full shutdown first (via systemctl kexec). On
firmware that supports X86_SNP_SHUTDOWN, that full shutdown clears SNP_EN on all
CPUs, so the kexec target boots with SNP_EN clear and runs a complete, fresh
snp_prepare() -- where online == present is enforced, so every present CPU gets
SNP_EN. There is no inherited partial-SNP_EN state.
Second, even independent of kexec, this kernel's snp_prepare() never sets SNP_EN
on a subset: on_each_cpu(snp_enable) runs only after the
cpumask_equal(cpu_online_mask, cpu_present_mask) check passes, so it's all (every
present CPU) or nothing (snp_prepare() returns -EOPNOTSUPP and SNP_EN is never
set). With maxcpus=8 on a larger system, online != present, so SNP simply does
not initialize -- it cannot leave SNP_EN set on only those 8 cores. A successful
init therefore implies every present CPU has SNP_EN, and the present mask is the
same physical hardware across kexec.
So producing a partial SNP_EN set would require a source kernel that both sets
SNP_EN partially (i.e. doesn't enforce online == present) and skips the
full shutdown before kexec -- neither of which applies here. I think it's a
non-issue in practice.
>> - re-init while SNP is still active (e.g. after a legacy SNP_SHUTDOWN that
>> leaves SNP_EN set): hotplug was disabled the whole time, so the online set is
>> unchanged and all of them still have SNP_EN.
>>
>> The only way a CPU can be online without SNP_EN is when SNP is not active --
>> i.e. after an SNP_INIT failure, where this patch re-enables hotplug. That is
>> deliberately the same as the behavior before this support existed (hotplug was
>> never disabled then), and it is benign: SNP_EN only gates RMP checks, the RMP
>> itself is initialized by SNP_INIT, so on a failed init the RMP is all-zeroes --
>> every entry is in the default HV-owned state, no page is assigned, no check ever blocks
>> and snp_initialized stays false, so no SNP guest can be created.
>> Nothing is enforced and nothing is protected.
>>
>> So I've kept snp_prepare()'s existing bailout / snp_enable() behavior unchanged;
>> what this patch adds is disabling hotplug while SNP is active, which is what
>> actually closes the window (a CPU coming online without SNP_EN while SNP is
>> live). That window -- and the SNP_EN-stays-set-on-failure situation -- already
>> exist in today's code, this patch constrains the dangerous (active) case and
>> otherwise matches current behavior.
>
> Ack! Just that one small bit up above bothers me but other than that,
> doing it in snp_prepare() should be good.
>
> This is all new to me so thanks a ton for answering my queries.
>
Thanks,
Ashish
>>
>> (On the v9 placement specifically: I'm moving the disable into snp_prepare()
>> ahead of SNP_EN in the next version; in v9 it sits after SNP_INIT, which leaves
>> the window you originally pointed out.)
^ permalink raw reply
* Re: [PATCH 1/4] kvm: sev: Fix user-space triggerable WARN_ON on snp_launch_update path
From: Sean Christopherson @ 2026-06-26 20:46 UTC (permalink / raw)
To: Tom Lendacky
Cc: Jörg Rödel, Paolo Bonzini, x86, Kiryl Shutsemau,
Rick Edgecombe, Ashish Kalra, Michael Roth, kvm, linux-kernel,
linux-coco, Joerg Roedel
In-Reply-To: <b3b60c37-aaf5-44b4-87bc-6aacd49be717@amd.com>
On Fri, Jun 26, 2026, Tom Lendacky wrote:
> On 6/23/26 04:15, Jörg Rödel wrote:
> > From: Joerg Roedel <joerg.roedel@amd.com>
> >
> > Sashiko reported on an unrelated patch:
> >
> > [Severity: High]
> > This is a pre-existing issue, but can a host userspace process trigger a
> > kernel warning by passing a NULL user address (uaddr = 0) here?
> >
> > If params.uaddr is 0, src becomes NULL and passes the PAGE_ALIGNED(src)
> > check. kvm_gmem_populate() skips fetching the user page and passes
> > src_page = NULL to sev_gmem_post_populate().
> >
> > That function then unconditionally evaluates:
> >
> > WARN_ON_ONCE(sev_populate_args->type != KVM_SEV_SNP_PAGE_TYPE_ZERO &&
> > !src_page)
> >
> > Since the type isn't ZERO, won't this allow an unprivileged user to spam
> > the kernel log?
>
> It would only be one warning that is emitted, "spam the kernel log" sounds
> like you could fill it with warnings. And I would say that the severity is
> only "High" should the kernel be configured with PANIC_ON_WARN.
Yeah, ignore any and all complaints about panic_on_warn=1 leading to DoS.
https://lore.kernel.org/all/CABgObfZJV5hU_7WoPWLRH3-EvKts%2BUBZOwtCXmwVZYJP8dDo2A@mail.gmail.com
> > The assessment is correct, so check for this condition earlier in the
> > snp_launch_update() path to avoid the WARN_ON_ONCE.
>
> I'm not positive, but isn't it technically possible that the userspace
> virtual address can be 0?
Yep, though it requires an explicit opt-in.
config DEFAULT_MMAP_MIN_ADDR
int "Low address space to protect from user allocation"
depends on MMU
default 4096
help
This is the portion of low virtual memory which should be protected
from userspace allocation. Keeping a user from writing to low pages
can help reduce the impact of kernel NULL pointer bugs.
For most ppc64 and x86 users with lots of address space
a value of 65536 is reasonable and should cause no problems.
On arm and other archs it should not be higher than 32768.
Programs which use vm86 functionality or have some need to map
this low address space will need CAP_SYS_RAWIO or disable this
protection by setting the value to 0.
This value can be changed after boot using the
/proc/sys/vm/mmap_min_addr tunable.
> In which case, should this be fixed in the kvm_gmem_populate() API with maybe
> a new parameter that indicates whether src is valid or not?
Nah, treating 0/NULL as invalid is perfectly acceptable. It doesn't work today,
i.e. there's no risk of breaking anyone, and just because userspace can use VA=0
for other things doesn't mean KVM needs to support that for all of its uAPI
surface.
FWIW, I was initially opposed to using 0/NULL to mean "invalid", and even called
it ridiculous, but David W. convinced me that accepting 0/NULL in modern KVM uAPI
would violate the principle of least surprise. So now I get to claim using NULL
to mean invalid as my own original idea ;-)
https://lore.kernel.org/all/a2cfad68277cae67791f07646c842672593a8dca.camel@infradead.org
^ permalink raw reply
* Re: [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: Kalra, Ashish @ 2026-06-26 20:59 UTC (permalink / raw)
To: Borislav Petkov
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <20260626164032.GDaj6rgHq4xPd-qjvG@fat_crate.local>
Hello Boris,
On 6/26/2026 11:40 AM, Borislav Petkov wrote:
> On Thu, Jun 25, 2026 at 02:42:23PM -0500, Kalra, Ashish wrote:
>> Hello Boris,
>
> Hello Ashish,
>
> lemme try to make sense of your AI reply...
>
>> cpu_hotplug_disable()/cpu_hotplug_enable() are refcounted (cpu_hotplug_disabled++/--,
>> with a WARN on underflow), so they have to be balanced. This flag collapses them to
>> exactly one outstanding disable per SNP-active window, because the disable and enable
>> sites are not reached a symmetric number of times:
>
> Well, why aren't they?
>
> Why isn't a simple design where on SNP init hotplug is disabled - *exactly*
> one call to cpu_hotplug_disable() and on SNP shutdown hotplug is reenabled
> again - also exactly one call.
>
> I know why...
>
It can be that simple, and flag-free, by following the SNP_EN state:
- cpu_hotplug_disable() when SNP_EN is programmed: in snp_prepare(), before snp_enable().
- cpu_hotplug_enable() when SNP_EN is cleared: in snp_shutdown(), after the firmware clears
it on X86_SNP_SHUTDOWN.
SNP_EN is set only by snp_enable() in snp_prepare() (gated by online == present),
and only the firmware clears it. So:
- snp_prepare() programs SNP_EN and disables hotplug on the same path; if it's
called again while SNP_EN is already set (re-init), it bails before the
disable.
- snp_shutdown() runs only on the X86_SNP_SHUTDOWN path, after SNP_EN has been
cleared, and enables hotplug. A legacy shutdown leaves SNP_EN set and does
not call snp_shutdown(), so hotplug correctly stays disabled.
We also have to re-enable cpu hotplug on the init failure paths
(snp_prepare()'s online != present check, and the SNP_INIT_EX / DF_FLUSH failures in
__sev_snp_init_locked()), so a failed init leaves hotplug enabled, as it was before
this support.
The only extra case is a kexec target that boots with SNP_EN already set (legacy
firmware -- on X86_SNP_SHUTDOWN firmware the full shutdown required before kexec
clears SNP_EN, so the target re-inits normally). There snp_prepare() bails, so I
do the disable once at boot in snp_rmptable_init() when SNP_EN is already set.
That and the snp_prepare() disable can't both run -- SNP_EN is either already set
at boot, or it gets programmed by snp_prepare().
No (extra) flag needed.
Thanks,
Ashish
^ permalink raw reply
* Re: [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: Borislav Petkov @ 2026-06-27 4:41 UTC (permalink / raw)
To: Kalra, Ashish
Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <9d019b55-739d-429c-bb34-ce792e8340b6@amd.com>
On Fri, Jun 26, 2026 at 03:59:34PM -0500, Kalra, Ashish wrote:
> It can be that simple, and flag-free, by following the SNP_EN state:
Maybe. But that doesn't mean that you should not clean things up first where
needed. But I'll do a proper review once the dust from patchsets flying around
settles.
> We also have to re-enable cpu hotplug on the init failure paths
> (snp_prepare()'s online != present check, and the SNP_INIT_EX / DF_FLUSH failures in
> __sev_snp_init_locked()), so a failed init leaves hotplug enabled, as it was before
> this support.
You could also block hotplug for the time being by grabbing cpus_read_lock().
And only when you know you are all clear to disable hotplug, then you can do
that in the end and drop the hotplug lock.
> The only extra case is a kexec target that boots with SNP_EN already set (legacy
> firmware -- on X86_SNP_SHUTDOWN firmware the full shutdown required before kexec
> clears SNP_EN, so the target re-inits normally). There snp_prepare() bails, so I
> do the disable once at boot in snp_rmptable_init() when SNP_EN is already set.
> That and the snp_prepare() disable can't both run -- SNP_EN is either already set
> at boot, or it gets programmed by snp_prepare().
Ok.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply
* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Gavin Shan @ 2026-06-28 10:33 UTC (permalink / raw)
To: Lorenzo Pieralisi
Cc: Suzuki K Poulose, Steven Price, kvm, kvmarm, Catalin Marinas,
Marc Zyngier, Will Deacon, James Morse, Oliver Upton, Zenghui Yu,
linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <aj6sdzlbxT8D5fnf@lpieralisi>
On 6/27/26 2:44 AM, Lorenzo Pieralisi wrote:
> On Fri, Jun 26, 2026 at 09:43:03PM +1000, Gavin Shan wrote:
>> On 6/26/26 6:47 PM, Suzuki K Poulose wrote:
>>> On 26/06/2026 08:43, Gavin Shan wrote:
>>>> On 6/26/26 1:58 AM, Suzuki K Poulose wrote:
>>>>> On 25/06/2026 14:53, Gavin Shan wrote:
>>>>>> On 6/6/26 12:35 AM, Lorenzo Pieralisi wrote:
>>>>>>> On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
>>>>>>>> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
>>>>>>>>> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
>>>>
>>>> [...]
>>>>
>>>>>>>>
>>>>>>>> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, and found
>>>>>>>> that memory slots backed by THP are broken. With THP disabled on the host and
>>>>>>>> other fixes (mentioned in my prevous replies) applied on the top of this (v14)
>>>>>>>> series, I'm able to boot a realm guest with rebased QEMU series [2], plus more
>>>>>>>> fxies on the top.
>>>>>>>>
>>>>>>>> [1] https://git.codelinaro.org/linaro/dcap/qemu.git (branch: cca/ latest)
>>>>>>>> [2] https://git.qemu.org/git/qemu.git (branch: cca/ gavin)
>>>>>>>>
>>>>>>>> Lorenzo, You may be saying there is someone making QEMU to support ARM/CCA?
>>>>>>>
>>>>>>> Mathieu and I are working on that yes and with Steven/Suzuki to fix the THP
>>>>>>> issues you pointed out above.
>>>>>>>
>>>>>>>> If so, I'm not sure if there is a QEMU repository for me to try?
>>>>>>>
>>>>>>> We should be able to submit patches by end of June - we shall let you know
>>>>>>> whether we can make something available earlier.
>>>>>>>
>>>>>>
>>>>>> Not sure if there are other known issues in this series. It seems the stage2
>>>>>> page fault handling on the shared space isn't working well. In my test, the
>>>>>> vring (struct vring_desc) of virtio-net-pci is updated by the guest, and the
>>>>>> data isn't seen by QEMU, I'm suspecting if the host-page-frame-number is properly
>>>>>> resolved in the s2 page fault handler for shared (unprotected) space.
>>>>>>
>>>>>> - I rebased Jean's latest qemu branch to the upstream qemu;
>>>>>>
>>>>>> - On the host, which is emulated by qemu/tcg, the THP (transparent huge page) is
>>>>>> disabled.
>>>>>>
>>>>>> - On the guest, I can see the virtio vring (struct vring_desc) is updated. The
>>>>>> S1 page-table entry looks correct because the corresponding physical address
>>>>>> 0x10046880000 is a sane shared (unprotected) space address.
>>>>>>
>>>>>> [ 52.094143] software IO TLB: Memory encryption is active and system is using DMA bounce buffers
>>>>>> [ 52.289746] virtqueue_add_desc_split: desc[0]@0xffff000006880000, [00000100b983f000 00000640 0002 0001]
>>>>>> [ 52.432150] PTE 0x00e8010046880707 at address 0xffff000006880000
>>>>>>
>>>>>> - On the host, the s2 page-table-entry is unmapped due to attribute transition (private -> shared).
>>>>>> A subsequent S2 page fault is raised against the adress and the s2 page-table-entry is built.
>>>>>>
>>>>>> [ 109.259077] ====> realm_unmap_shared_range: tracked_unprot_addr=0x10046880000
>>>>>> [ 109.260249] realm_unmap_shared_range: unmapped shared range at 0x10046880000
>>>>>> [ 109.317786] realm_unmap_shared_range: unmapped shared range at 0x10046880000
>>>>>> [ 109.629939] ====> kvm_handle_guest_abort: fault_ipa=0x10046880000, esr=0x92000007
>>>>>> [ 109.630245] realm_map_non_secure: ipa=0x10046880000, pfn=0xb8b59, size=0x1000, prot=0xf
>>>>>> [ 109.630331] realm_map_non_secure: ipa=0x10046880000, ipa_top=0x10046881000, flags=0x1e0001, range_desc=0xb8b59004
>>>>>
>>>>> Are you able to correlate the order of the transitions and the Guest
>>>>> access with RMM log ? We haven't seen this from our end. We are aware
>>>>> of permission fault issues with Unprotected IPA when backing the memslot
>>>>> with MAP_PRIVATE areas. But this looks different.
>>>>>
>>>>> Lorenzo, have you run into this ?
>>>>>
>>>>
>>>> It's hard to correlate the order since the logs are collected from two separate
>>>> consoles. For the write permission, I add code to the host where the permission
>>>> is always added for all s2 page faults in the shared space. Otherwise, qemu can
>>>> be killed by -EFAULT or similar error.
>>>
>>> This is the problem. We can't add WRITE permission by default. I believe
>>> you may have MAP_PRIVATE mapping and it has to be mapped as READ only
>>> and on a permission fault, we replace it with a writable page. By
>>> overriding the WRITE permission, you let the guest write to a page
>>> that may not be seen by the VMM.
>>>
>>> We identified this as a bug in the KVM driver in this series (reported
>>> by Lorenzo) and there is a corresponding tf-RMM change that is required
>>> to get this working. So, please could you wait until the next series
>>> when this will be addressed ? Or you could switch to using MAP_SHARED
>>> for the "shared" memory in the memslot.
>>>
>>
>> Exactly. the syntax for MAP_PRIVATE is broken if the write permission is
>> enforced for a read fault in the shared space. In my case, the host page can
>> be the zero page and eventually multiple s2 page-table entries (for multiple
>> unprotected or shared pages) point to the zero page. It's why clearing the
>> 3rd queue (Ctrl queue) also clears the first queue (Rx queue) in my case.
>>
>> Yes, this issue can be avoid by using a shared memory backend in qemu, something
>> like below. With this, I'm able to see virtio-net-pci starts to work...
>>
>> -object memory-backend-ram,id=mem0,size=2G,share=yes
>
> Yes, as Suzuki said that's what we have been fixing. QEmu patches
> will be on the mailing lists very shortly - the KVM/tf-RMM fixes
> to make MAP_PRIVATE work will be included in the next posting.
>
> Feel free to drop your QEmu command line so that I can give it
> a shot and check whether the fixes solve the problem you hit
> (I think so because that's precisely the kind of issue I got
> into when I started debugging THP/MAP_PRIVATE but it is better
> to check).
>
The virtio-net-pci doesn't work with the following command lines. The guest
kernel image is built from upstream kernel (v7.1.rc7).
qemu-system-aarch64 -enable-kvm -object rme-guest,id=rme0, \
-machine virt,gic-version=3,confidential-guest-support=rme0 \
-cpu host,pmu=off \
-smp maxcpus=2,cpus=2,sockets=1,clusters=1,cores=1,threads=2 \
-m 2G -object memory-backend-ram,id=mem0,size=2G \
-numa node,nodeid=0,cpus=0-1,memdev=mem0 \
-serial mon:stdio -monitor none -nographic -nodefaults \
-kernel /mnt/linux/arch/arm64/boot/Image \
-initrd /mnt/buildroot/output/images/rootfs.cpio.xz \
-append earlycon=pl011,mmio,0x10009000000 \
-device pcie-root-port,bus=pcie.0,chassis=1,id=pcie.1 \
-device pcie-root-port,bus=pcie.0,chassis=2,id=pcie.2 \
-device pcie-root-port,bus=pcie.0,chassis=3,id=pcie.3 \
-device pcie-root-port,bus=pcie.0,chassis=4,id=pcie.4 \
-netdev tap,id=tap1,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
-device virtio-net-pci,bus=pcie.2,netdev=tap1,mac=b8:3f:d2:1d:3e:c0
The virtio-net-pci starts to work with the shareable memory-backend.
-object memory-backend-ram,id=mem0,size=2G,share=yes
Note that THP is disabled on my host.
root@host:~# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
Thanks,
Gavin
> Thanks,
> Lorenzo
>
^ permalink raw reply
page: | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox