Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default\
From: Yan Zhao @ 2026-06-26  0:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <aj087H1UWSFxbShR@google.com>

On Thu, Jun 25, 2026 at 07:36:28AM -0700, Sean Christopherson wrote:
> On Thu, Jun 25, 2026, Yan Zhao wrote:
> > On Thu, Jun 25, 2026 at 09:51:01AM +0800, Yan Zhao wrote:
> > > On Wed, Jun 24, 2026 at 05:41:58PM -0700, Sean Christopherson wrote:
> > > > On Wed, Jun 24, 2026, Ackerley Tng wrote:
> > > > > Yan Zhao <yan.y.zhao@intel.com> writes:
> > > > > > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > > > > > MMAP flag. In such cases, shared memory is allocated from different backends.
> > > > > > This means this module parameter only enables per-gmem memory attribute and does
> > > > > > not guarantee that gmem in-place conversion will actually occur.
> > > > 
> > > > KVM module params are pretty much always about what KVM supports, not what is
> > > > guaranteed to happen.
> > > > 
> > > >   - enable_mmio_caching doesn't guarantee there will actually be MMIO SPTEs,
> > > >     because maybe the guest never accesses emulated MMIO.
> > > >   - enable_pmu doesn't guarantee VMs will get a PMU, because userspace may elect
> > > >     not to advertise one.
> > > >   - and so on and so forth...
> > > > 
> > > > Yes, there's a small mental jump to get from "KVM supports in-place conversion"
> > > > to "I need to set memory attributes on the guest_memfd instance, not the VM",
> > > > but I don't see that as a big hurdle, certainly not in the long term.  And once
> > > > the VMM code is written, I really do think most people are going to care about
> > > > whether or not KVM supports in-place conversion, not where PRIVATE is tracked.
> > > Sorry, I just saw this mail after posting my reply in [1].
> > > 
> > > I'm ok with gmem_in_place_conversion=true just means KVM supports in-place
> > > conversion, while we can still create VMs with shared memory not from gmem.
> > Or what about "allow_gmem_in_place_conversion" ?
> 
> No, because turning on the param also disallows setting PRIVATE in the VM-scoped
> KVM_SET_MEMORY_ATTRIBUTES ioctl.
> 
> > > Though it still feels a bit odd to require TDX huge pages to depend on
> > > gmem_in_place_conversion=true when shared memory is not currently allocated
> > > from gmem, 
> 
> I fully expect that to be a transient state, and in all likelihood not something
> that is *ever* shipped in production.  Landing TDX hugepages without guest_memfd
> hugepage support is all about avoiding unnecessary serialization of series and
> features that aren't strictly dependent on each other.
> 
> > > it should become more natural over time once gmem supports in-place
> > > conversions for huge page.
> 
> Yes, and I want to prioritize the steady state for end users, not the in-progress
> state for developers.  Once all of this settles out, I fully expect the majority
> of deployments to only support in-place conversion, at which point the end user
> is only going to care whether or not in-place conversion is enabled in KVM, not
> the subtle detail that it's still possible to do out-of-place conversions (and
> that will always hold true, it's not like VMA-based memslots are being deprecated).
> 
> > > Besides my current usage, there may be other scenarios where gmem memory
> > > attributes is preferred without allocating shared memory from gmem.
> > > (e.g., PAGE.ADD from a temp extra shared source memory).
> > > 
> > > For such use cases, I'm concerns that the admins may find it confusing if they
> > > enable gmem_in_place_conversion but still observe extra memory consumptions for
> > > shared memory.
> 
> KVM can help with documentation, but beyond that, it's not KVM's problem to solve.
> If a VMM *and* platform owner chooses to deploy a setup that utilizes out-of-place
> conversions, then it's on the VMM and/or plaform owner to understand and communicate
> the implications to the end user.
Thanks for all the explanations!
Documentation that choosing a different source after enabling
gmem_in_place_conversion is deprecated looks good to me.
 
> And I'm not remotely convinced that prepending allow_ to the param will help
> end users diagnose "unexpected" memory consumption, in quotes because anyone that
> is deploying a stack that utilizes out-of-place conversion absolutely needs to
> understand and plan for the additional memory consumption.  I.e. if the memory
> consumption is "unexpected" to the end user, they likely have far bigger problems.
My first impression of gmem_in_place_conversion=true was that it enforces gmem
in-place conversion. However, it actually only enforces per-gmem private/shared
attribute.
My worry was that people might think it's a kernel bug if userspace can still
have shared memory from other sources after they configured
gmem_in_place_conversion=true.

However, I have no strong opinion if you think gmem_in_place_conversion is good,
and with the above documentation. :)




^ permalink raw reply

* Re: [PATCH v8 23/46] KVM: TDX: Make source page optional for KVM_TDX_INIT_MEM_REGION
From: Yan Zhao @ 2026-06-26  1:17 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Sean Christopherson, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <CAEvNRgH5KOHoemnC9QOn_oK97=KeAH1XuX3ps36-pJ0Fn0aBHQ@mail.gmail.com>

On Thu, Jun 25, 2026 at 05:07:23PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
> >> Sean Christopherson <seanjc@google.com> writes:
> >>
> >> > On Tue, Jun 23, 2026, Yan Zhao wrote:
> >> >> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
> >> >> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson wrote:
> >> >> > > On Mon, Jun 22, 2026, Yan Zhao wrote:
> >> >> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4 Relay wrote:
> >> >> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> >> >> > > > > index ffe9d0db58c59..56d10333c61a7 100644
> >> >> > > > > --- a/arch/x86/kvm/vmx/tdx.c
> >> >> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
> >> >> > > > > @@ -3198,8 +3198,12 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> >> >> > > > >  	if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
> >> >> > > > >  		return -EIO;
> >> >> > > > >
> >> >> > > > > -	if (!src_page)
> >> >> > > > > -		return -EOPNOTSUPP;
> >> >> > > > > +	if (!src_page) {
> >> >> > > > > +		if (!gmem_in_place_conversion)
> >> >> > > > When userspace turns on gmem_in_place_conversion while creating guest_memfd
> >> >> > > > without the MMAP flag, the absence of src_page should still be treated as an
> >> >> > > > error.
> >> >> > >
> >> >> > > Why MMAP?
> >> >> > Hmm, I was showing a scenario that in-place conversion couldn't occur.
> >> >> > I didn't mean that with the MMAP flag, mmap() and user write must occur.
> >> >> >
> >> >> > > Shouldn't this be a general "if (!src_page && !up-to-date)"?  Just
> >> >> > > because userspace _can_ mmap() the memory doesn't mean userspace _has_ mmap()'d
> >> >> > > and written memory.  And when write() lands, MMAP wouldn't be necessary to
> >> >> > > initialize the memory.
> >> >> > Do you mean using up-to-date flag as below?
> >> >
> >> > Yes?  I didn't actually look at the implementation details.
> >> >
> >> >> > if (!src_page) {
> >> >> > 	src_page = pfn_to_page(pfn);
> >> >> > 	if (!folio_test_uptodate(page_folio(src_page)))
> >> >> > 		return -EOPNOTSUPP;
> >> >> > }
> >>
> >> Yan is right that with the earlier patch "Zero page while getting pfn",
> >> folio_test_uptodate() here will always return true.
> >>
> >> Actually, this is an alternative fix for the issue Sashiko pointed out
> >> on v7 where userspace can do a populate() (either TDX or SNP) without
> >> first allocating the page, with src_address == NULL, and leak
> >> uninitialized memory into the guest.
> >>
> >> Advantage of using the uptodate check in populate: if the host never
> >> allocates the page, populate doesn't incur zeroing before writing the
> >> page anyway in populate().
> >>
> >> Disadvantage: Both TDX and SNP will have to implement this uptodate
> >> check. guest_memfd can't check centrally because for SNP, for a
> >> PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
> >> firmware will zero and there's no leakage of uninitialized host memory?
> > Another disadvantage: the uptodate flag is per-folio. What if the folio
> > is only partially initialized by the userspace especially after huge page is
> > supported?
> >
> 
> Good point on huge pages!
> 
> The uptodate flag on the folio in guest_memfd means "this folio has been
> written to". As of now (before patch at [1]), this happens when
> 
> + folio is zeroed on first use by userspace
> + folio is zeroed on first use of the guest
> + folio is populated
> 
> When huge pages are supported, the folio can't partially be initialized?
> 
> On allocation, if any part is shared, we split the page. The parts are
> separate folios that have their own uptodate flags.
> 
> On splitting, if the huge page is uptodate, the split pages will also be
> uptodate. If the huge page is not uptodate, the split pages won't be
> uptodate, but that's ok since they will be marked uptodate on first use.
> 
> On merging, the non-uptodate parts have to be zeroed and then marked
If that's true, it would be good.

> uptodate. Any parts that are in use would have been marked uptodate
> already, so there's no overwriting data that is in use. I'll need to
> think more about when it's safe to zero.
> 
> I'm still on the fence between the two options
> 
> 1. Using uptodate check in populate to reject src_pages that have never
>    been written to or
> 2. Always zero before populate
2 does not work?
The flow is
1. mmap gmem_fd, make GFN shared, and write initial content.
2. convert GFN to private
3. invoke ioctl to trigger populate.

> but whether the uptodate flag is per-folio or not doesn't affect these
> two options in terms of fixing the leak of uninitialized host memory,
> right?
yes, provided "On merging, the non-uptodate parts have to be zeroed and then
marked uptodate".

> >
> >> >> Another concern with this fix is that:
> >> >> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always marks the
> >> >> folio uptodate before reaching post_populate().
> >> >>
> >> >> [1] https://lore.kernel.org/all/20260618-gmem-inplace-conversion-v8-21-9d2959357853@google.com/
> >> >>
> >> >> > One concern is that TDX now does not much care about the up-to-date flag since
> >> >> > TDX doesn't rely on the flag to clear pages on conversions.
> >> >> > I'm not sure if the flag can be reliably checked in this case. e.g.,
> >> >> > now the whole folio is marked up-to-date even if only part of it is faulted by
> >> >> > user access.
> >> >> > Ensuring that the up-to-date flag works correctly with huge page support seems
> >> >> > to have more effort than introducing a dedicated flag for TDX.
> >> >> >
> >> >> > > > Additionally, to properly enable in-place copying for the TDX initial memory
> >> >> > > > region, userspace must not only specify source_addr to NULL, but also follow
> >> >> > > > a specific sequence (where steps 1/2/3/7 are required only for in-place copy):
> >> >> > > > 1. create guest_memfd with MMAP flag
> >> >> > > > 2. mmap the guest_memfd.
> >> >> > > > 3. convert the initial memory range to shared.
> >> >> > > > 4. copy initial content to the source page.
> >> >> > > > 5. convert the initial memory range to private
> >> >> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
> >> >> > > > 7. do not unmap the source backend.
> >> >> > > >
> >> >> > > > So, would it be reasonable to introduce a dedicated flag that allows userspace
> >> >> > > > to explicitly opt into the in-place copy functionality? e.g.,
> >> >> > >
> >> >> > > Why?  It's userspace's responsibility to get the above right.  If userspace fails
> >> >> > > to provide a src_page when it doesn't want in-place copy, that's a userspace bug.
> >>
> >> Yan, is your concern that userspace forgot to update the code and
> >> forgets to provide a src_page, and if we keep the "Zero page while
> > Yes. Previously, it would be rejected after GUP fails.
> >
> 
> I see, didn't realize previously it would be rejected because GUP
> fails. GUP failed because it wasn't faulted into the host?
GUP fails if 0 is not a valid user address.
But GUP would not fail if 0 is a valid address. e.g., in below scenario: 

#include <sys/mman.h>
#include <stdio.h>
int main(void)
{
        void *p=mmap((void*)0,4096,PROT_READ|PROT_WRITE, MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,-1,0);
        if (p==MAP_FAILED) {
                perror("mmap");
                return 1;
        }
        *(char*)0='Y';
        printf("addr0=%p val=%c\n",p,*(char*)0);
        return 0;
}


> That's kind of orthogonal, I don't think GUP fail leading to rejecting
> populate was meant to help userspace catch these issues. GUP would also
> fail if the user did mmap(), write to it, unmap using
> madvise(MADV_DONTNEED), then forget and pass 0 as src_address.
The original uAPI did not explicitly define 0 as an invalid uaddr. Whether 0 was
rejected depended on whether the user mmap()'d address 0. If 0 was a valid
mapping, populate() could proceed.

commit 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to populating
guest memory") changed the behavior though. It would return -EOPNOTSUPP for a 0
uaddr.

But if a user configures 0 uaddr as valid, writes to it, and then passes 0 as
source_addr(not from gmem), I'm not sure if it's good for the kernel to silently
treat 0 uaddr as an identifier for in-place copy from the private PFN in gmem.


> >> getting pfn" patch, ends up with the guest silently having a zero page?
> >> I think that would be found quite early in userspace VMM testing...
> > I actually encountered this during testing this patch.
> > I update most code path to follow this sequence. However, still some corner ones
> > for TDVF HOB, which are less obvious and harder to update.
> > The TD just booted up and hang silently.
> >
> 
> I think this is just the life of a close-to-hardware software engineer
> :P no errors, got stuck somewhere, root cause is some unitialized
> thing.
> 
> >> >> > I mean if userspace specifies a NULL source_addr by mistake, it's better for
> >> >> > kernel to detect this mistake, similar to how it validates whether source_addr
> >> >> > is PAGE_ALIGNED.
> >> >
> >> > The alignment case is different.  If userspace provides an unaligned value, KVM
> >> > *can't* do what userspace is asking because hardware and thus KVM only supports
> >> > converting on page boundaries.
> >> >
> >> > For a NULL source, KVM can still do what userspace is asking.  Rejecting userspace's
> >> > request would then be making assumptions about what userspace wants.
> >> >
> >>
> >> Also, +1 on this, what if userspace, knowing that pages are zeroed on
> >> allocation, actually wants to rely on that to get a zero page in the guest?
> > What if 0 uaddr is a valid address? :)
> >
> >> >> > Since userspace already needs to perform additional steps to enable in-place
> >> >> > copy, specifying a dedicated flag to indicate that the NULL source_addr is
> >> >> > intentional seems like a reasonable burden.
> >> >
> >> > I don't see how it adds any value.  I wouldn't be at all surprised if most VMMs
> >> > just wen up with code that does:
> >> >
> >> > 	if (in-place) {
> >> > 		src = NULL;
> >> > 		flags |= KVM_TDX_IN_PLACE_COPY_INITIAL_MEMORY_REGION;
> >> > 	}
> >>

^ permalink raw reply

* Re: [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: Kalra, Ashish @ 2026-06-26  2:38 UTC (permalink / raw)
  To: K Prateek Nayak, Borislav Petkov
  Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, Tycho.Andersen, Nathan.Fontenot, ackerleytng,
	jackyli, pgonda, rientjes, jacobhxu, xin, pawan.kumar.gupta,
	babu.moger, dyoung, nikunj, john.allen, darwi, linux-kernel,
	linux-crypto, kvm, linux-coco
In-Reply-To: <898e378a-cf7c-4310-b439-e28ec0a71338@amd.com>


On 6/25/2026 5:16 PM, K Prateek Nayak wrote:
> Hello Ashish,
> 
> On 6/26/2026 1:12 AM, Kalra, Ashish wrote:
>> Hello Boris,
>>
>> On 6/25/2026 10:02 AM, Borislav Petkov wrote:
>>> On Wed, Jun 24, 2026 at 09:56:49PM +0000, Ashish Kalra wrote:
>>>> +/* Set while SNP has CPU hotplug disabled (kernel-lifetime; survives ccp reload). */
>>>> +static bool snp_cpu_hotplug_disabled;
>>>
>>> Do you really need this?
>>>
>>
>> Yes.
>>
>> cpu_hotplug_disable()/cpu_hotplug_enable() are refcounted (cpu_hotplug_disabled++/--,
>> with a WARN on underflow), so they have to be balanced. This flag collapses them to
>> exactly one outstanding disable per SNP-active window, because the disable and enable
>> sites are not reached a symmetric number of times:
>>>   - On firmware without SNP_X86_SHUTDOWN_SUPPORTED, __sev_snp_shutdown_locked() does not
>>   call snp_shutdown() (it's gated on data.x86_snp_shutdown), so SNP stays enabled in
>>   hardware — SNP_EN stays set and hotplug stays disabled — while sev->snp_initialized is
>>   cleared. Re-init after that is routine, the SNP ioctls self-bracket init and shutdown
>>   (e.g. SNP_COMMIT, SNP_SET_CONFIG, SNP_VLEK_LOAD):
>>
>>   if (!sev->snp_initialized)
>>           snp_move_to_init_state(...);   /* -> __sev_snp_init_locked -> snp_prepare() */
>>   ... SNP_CMD ...
>>   if (shutdown_required)
>>           __sev_snp_shutdown_locked(...);
>>   - So whenever SNP isn't already initialized (psp_init_on_probe off, or after a prior
>>   legacy shutdown), every such ioctl does init -> command -> legacy shutdown. Each init
>>   reaches snp_prepare() with SNP_EN already set, and the disable now sits at the top of
>>   snp_prepare(), so it fires on every cycle. Without this flag that keeps bumping
>>   cpu_hotplug_disabled while the legacy shutdown never re-enables — hotplug ends up stuck
>>   disabled. This flag makes all but the first disable a no-op.
>>  
>>   - Also, importantly, kvm-amd module reload on legacy firmware is the same pattern: 
>>   unload leaves SNP_EN set, reload re-inits.)
> 
> Looking at snp_prepare(), we have an early-bailout for
> 
>     rdmsrq(MSR_AMD64_SYSCFG, val);
>     if (val & MSR_AMD64_SYSCFG_SNP_EN)
>          return;
> 
> Does executing SHUTDOWN command lead to the firmware clearing SNP_EN in
> SYSCFG on all CPUS?

Yes, in case of X86_SNP_SHUTDOWN (available if firmware supports X86SnpShutdown feature)
SNP is disabled on all cores by clearing SYSCFG[SNPEn] bit.

If X86_SNP_SHUTDOWN is set to 1, the firmware clears the SYSCFG[SNPEn] bit in each core. 

But, in case of legacy SNP shutdown, SNP_EN bit is not cleared and so SNP remains enabled.

> 
> If SNP_EN remains set (and Linux can't clear it since it is
> "Write-1-only" bit), then a subsequent snp_prepare() will skip setting
> SYSCFG if it sees SNP_EN on local CPU.
> 
> It can so happen that we enable hotlpug at shutdown, CPUs come online
> without setting SNP_EN in SYSCFG, subsequent snp_prepare() runs on a CPU
> where SNP_EN is still set and skips configuring it for the CPUs that
> don't have it set, and we'll be in a pickle still.
> 
> The comment above that bailout saying "this can happen in case of kexec
> boot" makes me believe that SNP_EN remains set until a full system
> reset.
> 
> The only safe way to do this is to ensure all possible CPUs are online
> during snp_prepare() and do snp_enable() regardless of whether local CPU
> has SNP_EN or not.
> 
> Am I missing something?
> 

The piece that makes the early bailout safe is the disable this patch adds:
hotplug is disabled while SNP is active, so the online set can't change under an
active SNP. snp_prepare() already requires online == present, so at a successful
init every present CPU gets SNP_EN, and because hotplug is then disabled none
can leave or rejoin without it. So whenever the bailout is hit with SNP active,
every online CPU already has SNP_EN:

  - kexec: SNP_EN is already set on all CPUs by the previous kernel.
  - re-init while SNP is still active (e.g. after a legacy SNP_SHUTDOWN that
  leaves SNP_EN set): hotplug was disabled the whole time, so the online set is
  unchanged and all of them still have SNP_EN.

The only way a CPU can be online without SNP_EN is when SNP is not active --
i.e. after an SNP_INIT failure, where this patch re-enables hotplug. That is
deliberately the same as the behavior before this support existed (hotplug was
never disabled then), and it is benign: SNP_EN only gates RMP checks, the RMP
itself is initialized by SNP_INIT, so on a failed init the RMP is all-zeroes --
every entry is in the default HV-owned state, no page is assigned, no check ever blocks
and snp_initialized stays false, so no SNP guest can be created.
Nothing is enforced and nothing is protected.

So I've kept snp_prepare()'s existing bailout / snp_enable() behavior unchanged;
what this patch adds is disabling hotplug while SNP is active, which is what
actually closes the window (a CPU coming online without SNP_EN while SNP is
live). That window -- and the SNP_EN-stays-set-on-failure situation -- already
exist in today's code, this patch constrains the dangerous (active) case and
otherwise matches current behavior.

(On the v9 placement specifically: I'm moving the disable into snp_prepare()
ahead of SNP_EN in the next version; in v9 it sits after SNP_INIT, which leaves
the window you originally pointed out.)

Thanks,
Ashish

>>
>>   - On the enable side it avoids an unbalanced cpu_hotplug_enable() when the teardown/failure
>>   paths run without an outstanding disable (e.g. shutdown of a never-fully-initialized SNP).
>>
>> So it's not redundant with cpu_hotplug_disabled — it tracks whether the outstanding disable
>> belongs to this SNP-active window in this kernel, which keeps the single disable/enable
>> balanced across the asymmetric legacy-vs-full SNP teardown paths and re-init.

^ permalink raw reply

* Re: [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: K Prateek Nayak @ 2026-06-26  4:01 UTC (permalink / raw)
  To: Kalra, Ashish, Borislav Petkov
  Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, Tycho.Andersen, Nathan.Fontenot, ackerleytng,
	jackyli, pgonda, rientjes, jacobhxu, xin, pawan.kumar.gupta,
	babu.moger, dyoung, nikunj, john.allen, darwi, linux-kernel,
	linux-crypto, kvm, linux-coco
In-Reply-To: <b9777de5-a6fa-418c-92d2-89c095e91837@amd.com>

Hello Ashish,

On 6/26/2026 8:08 AM, Kalra, Ashish wrote:
>> Looking at snp_prepare(), we have an early-bailout for
>>
>>     rdmsrq(MSR_AMD64_SYSCFG, val);
>>     if (val & MSR_AMD64_SYSCFG_SNP_EN)
>>          return;
>>
>> Does executing SHUTDOWN command lead to the firmware clearing SNP_EN in
>> SYSCFG on all CPUS?
> 
> Yes, in case of X86_SNP_SHUTDOWN (available if firmware supports X86SnpShutdown feature)
> SNP is disabled on all cores by clearing SYSCFG[SNPEn] bit.
> 
> If X86_SNP_SHUTDOWN is set to 1, the firmware clears the SYSCFG[SNPEn] bit in each core. 
> 
> But, in case of legacy SNP shutdown, SNP_EN bit is not cleared and so SNP remains enabled.

Ah! That was the bit I was missing. Thanks a ton for clarifying.

> 
>>
>> If SNP_EN remains set (and Linux can't clear it since it is
>> "Write-1-only" bit), then a subsequent snp_prepare() will skip setting
>> SYSCFG if it sees SNP_EN on local CPU.
>>
>> It can so happen that we enable hotlpug at shutdown, CPUs come online
>> without setting SNP_EN in SYSCFG, subsequent snp_prepare() runs on a CPU
>> where SNP_EN is still set and skips configuring it for the CPUs that
>> don't have it set, and we'll be in a pickle still.
>>
>> The comment above that bailout saying "this can happen in case of kexec
>> boot" makes me believe that SNP_EN remains set until a full system
>> reset.
>>
>> The only safe way to do this is to ensure all possible CPUs are online
>> during snp_prepare() and do snp_enable() regardless of whether local CPU
>> has SNP_EN or not.
>>
>> Am I missing something?
>>
> 
> The piece that makes the early bailout safe is the disable this patch adds:
> hotplug is disabled while SNP is active, so the online set can't change under an
> active SNP. snp_prepare() already requires online == present, so at a successful
> init every present CPU gets SNP_EN,

How is this enforced? AFAICT, on_each_cpu(snp_enable) will only covers
the online CPUs and there could be CPUs that have been offlined before
that right?

> and because hotplug is then disabled none
> can leave or rejoin without it. So whenever the bailout is hit with SNP active,
> every online CPU already has SNP_EN:
> 
>   - kexec: SNP_EN is already set on all CPUs by the previous kernel.

There is a catch here: you can have offline CPUs during the previous boot
(say you have maxcpus=8 in your cmdline), and then you kexec with a different
kernel / cmdline that brings online a bunch more CPUs.

SNP_EN will only be set for a subset of then with the legacy SNP_INIT and
if snp_prepare() runs on those legacy CPUs, you still skip setting it for
the ones that don't have SNP_EN set.

Is that case covered somehow or is it a non-issue?

>   - re-init while SNP is still active (e.g. after a legacy SNP_SHUTDOWN that
>   leaves SNP_EN set): hotplug was disabled the whole time, so the online set is
>   unchanged and all of them still have SNP_EN.
> 
> The only way a CPU can be online without SNP_EN is when SNP is not active --
> i.e. after an SNP_INIT failure, where this patch re-enables hotplug. That is
> deliberately the same as the behavior before this support existed (hotplug was
> never disabled then), and it is benign: SNP_EN only gates RMP checks, the RMP
> itself is initialized by SNP_INIT, so on a failed init the RMP is all-zeroes --
> every entry is in the default HV-owned state, no page is assigned, no check ever blocks
> and snp_initialized stays false, so no SNP guest can be created.
> Nothing is enforced and nothing is protected.
> 
> So I've kept snp_prepare()'s existing bailout / snp_enable() behavior unchanged;
> what this patch adds is disabling hotplug while SNP is active, which is what
> actually closes the window (a CPU coming online without SNP_EN while SNP is
> live). That window -- and the SNP_EN-stays-set-on-failure situation -- already
> exist in today's code, this patch constrains the dangerous (active) case and
> otherwise matches current behavior.

Ack! Just that one small bit up above bothers me but other than that,
doing it in snp_prepare() should be good.

This is all new to me so thanks a ton for answering my queries.

> 
> (On the v9 placement specifically: I'm moving the disable into snp_prepare()
> ahead of SNP_EN in the next version; in v9 it sits after SNP_INIT, which leaves
> the window you originally pointed out.)
-- 
Thanks and Regards,
Prateek


^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Gavin Shan @ 2026-06-26  7:43 UTC (permalink / raw)
  To: Suzuki K Poulose, Lorenzo Pieralisi
  Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
	Will Deacon, James Morse, Oliver Upton, Zenghui Yu,
	linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
	Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <98d2a0f3-b831-466a-8212-5bcf97ad9d8b@arm.com>

On 6/26/26 1:58 AM, Suzuki K Poulose wrote:
> On 25/06/2026 14:53, Gavin Shan wrote:
>> On 6/6/26 12:35 AM, Lorenzo Pieralisi wrote:
>>> On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
>>>> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
>>>>> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:

[...]

>>>>
>>>> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, and found
>>>> that memory slots backed by THP are broken. With THP disabled on the host and
>>>> other fixes (mentioned in my prevous replies) applied on the top of this (v14)
>>>> series, I'm able to boot a realm guest with rebased QEMU series [2], plus more
>>>> fxies on the top.
>>>>
>>>> [1] https://git.codelinaro.org/linaro/dcap/qemu.git  (branch: cca/ latest)
>>>> [2] https://git.qemu.org/git/qemu.git                (branch: cca/gavin)
>>>>
>>>> Lorenzo, You may be saying there is someone making QEMU to support ARM/CCA?
>>>
>>> Mathieu and I are working on that yes and with Steven/Suzuki to fix the THP
>>> issues you pointed out above.
>>>
>>>> If so, I'm not sure if there is a QEMU repository for me to try?
>>>
>>> We should be able to submit patches by end of June - we shall let you know
>>> whether we can make something available earlier.
>>>
>>
>> Not sure if there are other known issues in this series. It seems the stage2
>> page fault handling on the shared space isn't working well. In my test, the
>> vring (struct vring_desc) of virtio-net-pci is updated by the guest, and the
>> data isn't seen by QEMU, I'm suspecting if the host-page-frame-number is properly
>> resolved in the s2 page fault handler for shared (unprotected) space.
>>
>> - I rebased Jean's latest qemu branch to the upstream qemu;
>>
>> - On the host, which is emulated by qemu/tcg, the THP (transparent huge page) is
>>    disabled.
>>
>> - On the guest, I can see the virtio vring (struct vring_desc) is updated. The
>>    S1 page-table entry looks correct because the corresponding physical address
>>    0x10046880000 is a sane shared (unprotected) space address.
>>
>>    [   52.094143] software IO TLB: Memory encryption is active and system is using DMA bounce buffers
>>    [   52.289746] virtqueue_add_desc_split: desc[0]@0xffff000006880000, [00000100b983f000  00000640  0002  0001]
>>    [   52.432150] PTE 0x00e8010046880707 at address 0xffff000006880000
>>
>> - On the host, the s2 page-table-entry is unmapped due to attribute transition (private -> shared).
>>    A subsequent S2 page fault is raised against the adress and the s2 page-table-entry is built.
>>
>>    [  109.259077] ====> realm_unmap_shared_range: tracked_unprot_addr=0x10046880000
>>    [  109.260249] realm_unmap_shared_range: unmapped shared range at 0x10046880000
>>    [  109.317786] realm_unmap_shared_range: unmapped shared range at 0x10046880000
>>    [  109.629939] ====> kvm_handle_guest_abort: fault_ipa=0x10046880000, esr=0x92000007
>>    [  109.630245] realm_map_non_secure: ipa=0x10046880000, pfn=0xb8b59, size=0x1000, prot=0xf
>>    [  109.630331] realm_map_non_secure: ipa=0x10046880000, ipa_top=0x10046881000, flags=0x1e0001, range_desc=0xb8b59004
> 
> Are you able to correlate the order of the transitions and the Guest
> access with RMM log ? We haven't seen this from our end. We are aware
> of permission fault issues with Unprotected IPA when backing the memslot
> with MAP_PRIVATE areas. But this looks different.
> 
> Lorenzo, have you run into this ?
> 

It's hard to correlate the order since the logs are collected from two separate
consoles. For the write permission, I add code to the host where the permission
is always added for all s2 page faults in the shared space. Otherwise, qemu can
be killed by -EFAULT or similar error.

There are more findings after more experiments: this virtio-net-pci device has 3
queues or vrings (Rx/Tx/Ctrl). The Rx/Tx/Ctrl queue are populated in order one after
one. In the guest kernel, I intentionally write fixed data (0x0123456789abcdef) to
the first 8 bytes of the queue when it gets populated, and stop the guest at random
points to see if the data is gone. I found that the data written to Rx/Tx queue are
lost after Ctrl queue is allocated.

The data written to Rx/Tx queue is lost if the guest stops (B). The data written to
Rx/Tx queue isn't lost if the guest stops at (A). I can see the pattern (0x0123...cdef)
by dumping the physcial memory through 'pmemsave' command in qemu.

DMA allocation
==============
dma_alloc_coherent
   dma_alloc_attrs
     dma_direct_alloc
       __dma_direct_alloc_pages
       dma_set_decrypted                    // (A) No data lost if being stopped here for the Ctrl queue
       memset(ret, 0, size)                 // (B) Data lost after being stopped after memset() for the Ctrl queue

The memset() on the Ctrl queue should trigger a stage2 page fault. It seems the page
fault enforces the shared pages for Rx/Tx queue to be dropped? I need to add more
debugging code and track it down.

> Suzuki
> 
> 
>>
>> - On QEMU, the updated vring (struct vring_desc) at GPA 0x46880000 isn't seen. All the
>>    data in that adress are zeros.
>>
>>    ====> virtqueue_split_pop: vdev=<virtio-net>, sz=0x38, queue_index=0x0, vq->vring.num=0x100
>>    virtqueue_split_pop: last_avail_idx=0x0, head=0x0
>>    address_space_read_cached_slow: cache@0xffff1c036440, addr=0x0, buf=0xffffeee34880, len=0x10
>>    address_space_read_cached_slow: cache: ptr=0x0, xlat=0x10046880000, len=0x1000, mrs=<realm-dma-region>, is_write=no
>>    address_space_read_cached_slow: translated to mr=<mach-virt.ram>, mr_addr=0x6880000, l=0x10
>>    flatview_read_continue_step: mr=<mach-virt.ram>, host=0xffff23e00000, mr_addr=0x6880000, ram_ptr=0xffff2a680000
>>    virtqueue_split_pop: desc: 0000000000000000 - 00000000 - 00000000 - 00000000
>>    qemu-system-aarch64: virtio: zero sized buffers are not allowed
>>
>>
Thanks,
Gavin


^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Suzuki K Poulose @ 2026-06-26  8:47 UTC (permalink / raw)
  To: Gavin Shan, Lorenzo Pieralisi
  Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
	Will Deacon, James Morse, Oliver Upton, Zenghui Yu,
	linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
	Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <8da87878-2a5d-478a-a280-60dbed7ad1b9@redhat.com>

On 26/06/2026 08:43, Gavin Shan wrote:
> On 6/26/26 1:58 AM, Suzuki K Poulose wrote:
>> On 25/06/2026 14:53, Gavin Shan wrote:
>>> On 6/6/26 12:35 AM, Lorenzo Pieralisi wrote:
>>>> On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
>>>>> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
>>>>>> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
> 
> [...]
> 
>>>>>
>>>>> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, 
>>>>> and found
>>>>> that memory slots backed by THP are broken. With THP disabled on 
>>>>> the host and
>>>>> other fixes (mentioned in my prevous replies) applied on the top of 
>>>>> this (v14)
>>>>> series, I'm able to boot a realm guest with rebased QEMU series 
>>>>> [2], plus more
>>>>> fxies on the top.
>>>>>
>>>>> [1] https://git.codelinaro.org/linaro/dcap/qemu.git  (branch: cca/ 
>>>>> latest)
>>>>> [2] https://git.qemu.org/git/qemu.git                (branch: cca/ 
>>>>> gavin)
>>>>>
>>>>> Lorenzo, You may be saying there is someone making QEMU to support 
>>>>> ARM/CCA?
>>>>
>>>> Mathieu and I are working on that yes and with Steven/Suzuki to fix 
>>>> the THP
>>>> issues you pointed out above.
>>>>
>>>>> If so, I'm not sure if there is a QEMU repository for me to try?
>>>>
>>>> We should be able to submit patches by end of June - we shall let 
>>>> you know
>>>> whether we can make something available earlier.
>>>>
>>>
>>> Not sure if there are other known issues in this series. It seems the 
>>> stage2
>>> page fault handling on the shared space isn't working well. In my 
>>> test, the
>>> vring (struct vring_desc) of virtio-net-pci is updated by the guest, 
>>> and the
>>> data isn't seen by QEMU, I'm suspecting if the host-page-frame-number 
>>> is properly
>>> resolved in the s2 page fault handler for shared (unprotected) space.
>>>
>>> - I rebased Jean's latest qemu branch to the upstream qemu;
>>>
>>> - On the host, which is emulated by qemu/tcg, the THP (transparent 
>>> huge page) is
>>>    disabled.
>>>
>>> - On the guest, I can see the virtio vring (struct vring_desc) is 
>>> updated. The
>>>    S1 page-table entry looks correct because the corresponding 
>>> physical address
>>>    0x10046880000 is a sane shared (unprotected) space address.
>>>
>>>    [   52.094143] software IO TLB: Memory encryption is active and 
>>> system is using DMA bounce buffers
>>>    [   52.289746] virtqueue_add_desc_split: 
>>> desc[0]@0xffff000006880000, [00000100b983f000  00000640  0002  0001]
>>>    [   52.432150] PTE 0x00e8010046880707 at address 0xffff000006880000
>>>
>>> - On the host, the s2 page-table-entry is unmapped due to attribute 
>>> transition (private -> shared).
>>>    A subsequent S2 page fault is raised against the adress and the s2 
>>> page-table-entry is built.
>>>
>>>    [  109.259077] ====> realm_unmap_shared_range: 
>>> tracked_unprot_addr=0x10046880000
>>>    [  109.260249] realm_unmap_shared_range: unmapped shared range at 
>>> 0x10046880000
>>>    [  109.317786] realm_unmap_shared_range: unmapped shared range at 
>>> 0x10046880000
>>>    [  109.629939] ====> kvm_handle_guest_abort: 
>>> fault_ipa=0x10046880000, esr=0x92000007
>>>    [  109.630245] realm_map_non_secure: ipa=0x10046880000, 
>>> pfn=0xb8b59, size=0x1000, prot=0xf
>>>    [  109.630331] realm_map_non_secure: ipa=0x10046880000, 
>>> ipa_top=0x10046881000, flags=0x1e0001, range_desc=0xb8b59004
>>
>> Are you able to correlate the order of the transitions and the Guest
>> access with RMM log ? We haven't seen this from our end. We are aware
>> of permission fault issues with Unprotected IPA when backing the memslot
>> with MAP_PRIVATE areas. But this looks different.
>>
>> Lorenzo, have you run into this ?
>>
> 
> It's hard to correlate the order since the logs are collected from two 
> separate
> consoles. For the write permission, I add code to the host where the 
> permission
> is always added for all s2 page faults in the shared space. Otherwise, 
> qemu can
> be killed by -EFAULT or similar error.

This is the problem. We can't add WRITE permission by default. I believe
you may have MAP_PRIVATE mapping and it has to be mapped as READ only
and on a permission fault, we replace it with a writable page. By
overriding the WRITE permission, you let the guest write to a page
that may not be seen by the VMM.

We identified this as a bug in the KVM driver in this series (reported
by Lorenzo) and there is a corresponding tf-RMM change that is required
to get this working. So, please could you wait until the next series
when this will be addressed ? Or you could switch to using MAP_SHARED
for the "shared" memory in the memslot.


Suzuki


> 
> There are more findings after more experiments: this virtio-net-pci 
> device has 3
> queues or vrings (Rx/Tx/Ctrl). The Rx/Tx/Ctrl queue are populated in 
> order one after
> one. In the guest kernel, I intentionally write fixed data 
> (0x0123456789abcdef) to
> the first 8 bytes of the queue when it gets populated, and stop the 
> guest at random
> points to see if the data is gone. I found that the data written to Rx/ 
> Tx queue are
> lost after Ctrl queue is allocated.
> 
> The data written to Rx/Tx queue is lost if the guest stops (B). The data 
> written to
> Rx/Tx queue isn't lost if the guest stops at (A). I can see the pattern 
> (0x0123...cdef)
> by dumping the physcial memory through 'pmemsave' command in qemu.
> 
> DMA allocation
> ==============
> dma_alloc_coherent
>    dma_alloc_attrs
>      dma_direct_alloc
>        __dma_direct_alloc_pages
>        dma_set_decrypted                    // (A) No data lost if being 
> stopped here for the Ctrl queue
>        memset(ret, 0, size)                 // (B) Data lost after being 
> stopped after memset() for the Ctrl queue
> 
> The memset() on the Ctrl queue should trigger a stage2 page fault. It 
> seems the page
> fault enforces the shared pages for Rx/Tx queue to be dropped? I need to 
> add more
> debugging code and track it down.
> 
>> Suzuki
>>
>>
>>>
>>> - On QEMU, the updated vring (struct vring_desc) at GPA 0x46880000 
>>> isn't seen. All the
>>>    data in that adress are zeros.
>>>
>>>    ====> virtqueue_split_pop: vdev=<virtio-net>, sz=0x38, 
>>> queue_index=0x0, vq->vring.num=0x100
>>>    virtqueue_split_pop: last_avail_idx=0x0, head=0x0
>>>    address_space_read_cached_slow: cache@0xffff1c036440, addr=0x0, 
>>> buf=0xffffeee34880, len=0x10
>>>    address_space_read_cached_slow: cache: ptr=0x0, 
>>> xlat=0x10046880000, len=0x1000, mrs=<realm-dma-region>, is_write=no
>>>    address_space_read_cached_slow: translated to mr=<mach-virt.ram>, 
>>> mr_addr=0x6880000, l=0x10
>>>    flatview_read_continue_step: mr=<mach-virt.ram>, 
>>> host=0xffff23e00000, mr_addr=0x6880000, ram_ptr=0xffff2a680000
>>>    virtqueue_split_pop: desc: 0000000000000000 - 00000000 - 00000000 
>>> - 00000000
>>>    qemu-system-aarch64: virtio: zero sized buffers are not allowed
>>>
>>>
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Suzuki K Poulose @ 2026-06-26  9:04 UTC (permalink / raw)
  To: Gavin Shan, Lorenzo Pieralisi
  Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
	Will Deacon, James Morse, Oliver Upton, Zenghui Yu,
	linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
	Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <9482dfbc-4d96-47ba-a615-f4ba0bda833f@arm.com>

On 26/06/2026 09:47, Suzuki K Poulose wrote:
> On 26/06/2026 08:43, Gavin Shan wrote:
>> On 6/26/26 1:58 AM, Suzuki K Poulose wrote:
>>> On 25/06/2026 14:53, Gavin Shan wrote:
>>>> On 6/6/26 12:35 AM, Lorenzo Pieralisi wrote:
>>>>> On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
>>>>>> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
>>>>>>> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
>>
>> [...]
>>
>>>>>>
>>>>>> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, 
>>>>>> and found
>>>>>> that memory slots backed by THP are broken. With THP disabled on 
>>>>>> the host and
>>>>>> other fixes (mentioned in my prevous replies) applied on the top 
>>>>>> of this (v14)
>>>>>> series, I'm able to boot a realm guest with rebased QEMU series 
>>>>>> [2], plus more
>>>>>> fxies on the top.
>>>>>>
>>>>>> [1] https://git.codelinaro.org/linaro/dcap/qemu.git  (branch: cca/ 
>>>>>> latest)
>>>>>> [2] https://git.qemu.org/git/qemu.git                (branch: cca/ 
>>>>>> gavin)
>>>>>>
>>>>>> Lorenzo, You may be saying there is someone making QEMU to support 
>>>>>> ARM/CCA?
>>>>>
>>>>> Mathieu and I are working on that yes and with Steven/Suzuki to fix 
>>>>> the THP
>>>>> issues you pointed out above.
>>>>>
>>>>>> If so, I'm not sure if there is a QEMU repository for me to try?
>>>>>
>>>>> We should be able to submit patches by end of June - we shall let 
>>>>> you know
>>>>> whether we can make something available earlier.
>>>>>
>>>>
>>>> Not sure if there are other known issues in this series. It seems 
>>>> the stage2
>>>> page fault handling on the shared space isn't working well. In my 
>>>> test, the
>>>> vring (struct vring_desc) of virtio-net-pci is updated by the guest, 
>>>> and the
>>>> data isn't seen by QEMU, I'm suspecting if the host-page-frame- 
>>>> number is properly
>>>> resolved in the s2 page fault handler for shared (unprotected) space.
>>>>
>>>> - I rebased Jean's latest qemu branch to the upstream qemu;
>>>>
>>>> - On the host, which is emulated by qemu/tcg, the THP (transparent 
>>>> huge page) is
>>>>    disabled.
>>>>
>>>> - On the guest, I can see the virtio vring (struct vring_desc) is 
>>>> updated. The
>>>>    S1 page-table entry looks correct because the corresponding 
>>>> physical address
>>>>    0x10046880000 is a sane shared (unprotected) space address.
>>>>
>>>>    [   52.094143] software IO TLB: Memory encryption is active and 
>>>> system is using DMA bounce buffers
>>>>    [   52.289746] virtqueue_add_desc_split: 
>>>> desc[0]@0xffff000006880000, [00000100b983f000  00000640  0002  0001]
>>>>    [   52.432150] PTE 0x00e8010046880707 at address 0xffff000006880000
>>>>
>>>> - On the host, the s2 page-table-entry is unmapped due to attribute 
>>>> transition (private -> shared).
>>>>    A subsequent S2 page fault is raised against the adress and the 
>>>> s2 page-table-entry is built.
>>>>
>>>>    [  109.259077] ====> realm_unmap_shared_range: 
>>>> tracked_unprot_addr=0x10046880000
>>>>    [  109.260249] realm_unmap_shared_range: unmapped shared range at 
>>>> 0x10046880000
>>>>    [  109.317786] realm_unmap_shared_range: unmapped shared range at 
>>>> 0x10046880000
>>>>    [  109.629939] ====> kvm_handle_guest_abort: 
>>>> fault_ipa=0x10046880000, esr=0x92000007
>>>>    [  109.630245] realm_map_non_secure: ipa=0x10046880000, 
>>>> pfn=0xb8b59, size=0x1000, prot=0xf
>>>>    [  109.630331] realm_map_non_secure: ipa=0x10046880000, 
>>>> ipa_top=0x10046881000, flags=0x1e0001, range_desc=0xb8b59004
>>>
>>> Are you able to correlate the order of the transitions and the Guest
>>> access with RMM log ? We haven't seen this from our end. We are aware
>>> of permission fault issues with Unprotected IPA when backing the memslot
>>> with MAP_PRIVATE areas. But this looks different.
>>>
>>> Lorenzo, have you run into this ?
>>>
>>
>> It's hard to correlate the order since the logs are collected from two 
>> separate
>> consoles. For the write permission, I add code to the host where the 
>> permission
>> is always added for all s2 page faults in the shared space. Otherwise, 
>> qemu can
>> be killed by -EFAULT or similar error.
> 
> This is the problem. We can't add WRITE permission by default. I believe
> you may have MAP_PRIVATE mapping and it has to be mapped as READ only
> and on a permission fault, we replace it with a writable page. By
> overriding the WRITE permission, you let the guest write to a page
> that may not be seen by the VMM.
> 
> We identified this as a bug in the KVM driver in this series (reported
> by Lorenzo) and there is a corresponding tf-RMM change that is required
> to get this working. So, please could you wait until the next series
> when this will be addressed ? Or you could switch to using MAP_SHARED
> for the "shared" memory in the memslot.

For the record, you need something like this :

--- a/arch/arm64/kvm/rmi.c
+++ b/arch/arm64/kvm/rmi.c

@@ -838,8 +838,17 @@ int realm_map_non_secure(struct realm *realm,
                 if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
                         /* Create missing RTTs and retry */
                         int level = RMI_RETURN_INDEX(ret);
+                       int req_level = find_map_level(realm, ipa, ipa_top);
+
+                       /*
+                        * There already exists a mapping at the level. 
May be
+                        * we are relaxing a permission for the given 
range ?
+                        */
+                       if (level >= req_level) {
+                               realm_unmap_shared_range(kvm, ipa, 
ipa_top, false);
+                               continue;
+                       }

-                       WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
                         ret = realm_create_rtt_levels(realm, ipa, level,
  
KVM_PGTABLE_LAST_LEVEL,
                                                       memcache);


Thanks
Suzuki


> 
> 
> Suzuki
> 
> 
>>
>> There are more findings after more experiments: this virtio-net-pci 
>> device has 3
>> queues or vrings (Rx/Tx/Ctrl). The Rx/Tx/Ctrl queue are populated in 
>> order one after
>> one. In the guest kernel, I intentionally write fixed data 
>> (0x0123456789abcdef) to
>> the first 8 bytes of the queue when it gets populated, and stop the 
>> guest at random
>> points to see if the data is gone. I found that the data written to 
>> Rx/ Tx queue are
>> lost after Ctrl queue is allocated.
>>
>> The data written to Rx/Tx queue is lost if the guest stops (B). The 
>> data written to
>> Rx/Tx queue isn't lost if the guest stops at (A). I can see the 
>> pattern (0x0123...cdef)
>> by dumping the physcial memory through 'pmemsave' command in qemu.
>>
>> DMA allocation
>> ==============
>> dma_alloc_coherent
>>    dma_alloc_attrs
>>      dma_direct_alloc
>>        __dma_direct_alloc_pages
>>        dma_set_decrypted                    // (A) No data lost if 
>> being stopped here for the Ctrl queue
>>        memset(ret, 0, size)                 // (B) Data lost after 
>> being stopped after memset() for the Ctrl queue
>>
>> The memset() on the Ctrl queue should trigger a stage2 page fault. It 
>> seems the page
>> fault enforces the shared pages for Rx/Tx queue to be dropped? I need 
>> to add more
>> debugging code and track it down.
>>
>>> Suzuki
>>>
>>>
>>>>
>>>> - On QEMU, the updated vring (struct vring_desc) at GPA 0x46880000 
>>>> isn't seen. All the
>>>>    data in that adress are zeros.
>>>>
>>>>    ====> virtqueue_split_pop: vdev=<virtio-net>, sz=0x38, 
>>>> queue_index=0x0, vq->vring.num=0x100
>>>>    virtqueue_split_pop: last_avail_idx=0x0, head=0x0
>>>>    address_space_read_cached_slow: cache@0xffff1c036440, addr=0x0, 
>>>> buf=0xffffeee34880, len=0x10
>>>>    address_space_read_cached_slow: cache: ptr=0x0, 
>>>> xlat=0x10046880000, len=0x1000, mrs=<realm-dma-region>, is_write=no
>>>>    address_space_read_cached_slow: translated to mr=<mach-virt.ram>, 
>>>> mr_addr=0x6880000, l=0x10
>>>>    flatview_read_continue_step: mr=<mach-virt.ram>, 
>>>> host=0xffff23e00000, mr_addr=0x6880000, ram_ptr=0xffff2a680000
>>>>    virtqueue_split_pop: desc: 0000000000000000 - 00000000 - 00000000 
>>>> - 00000000
>>>>    qemu-system-aarch64: virtio: zero sized buffers are not allowed
>>>>
>>>>
>> Thanks,
>> Gavin
>>
> 


^ permalink raw reply

* Re: [RFC PATCH 09/15] x86/virt/tdx: Add interface to generate a Quote
From: Peter Fang @ 2026-06-26  9:58 UTC (permalink / raw)
  To: Edgecombe, Rick P
  Cc: kas@kernel.org, djbw@kernel.org, yilun.xu@linux.intel.com,
	x86@kernel.org, Xu, Yilun, Duan, Zhenzhong,
	baolu.lu@linux.intel.com, Li, Xiaoyao,
	linux-kernel@vger.kernel.org, Mehta, Sohil, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev
In-Reply-To: <20260614112948.GG3200182@pedri>

On Sun, Jun 14, 2026 at 04:29:48AM -0700, Peter Fang wrote:
> > 
> > > +		goto out;
> > > +
> > > +	/*
> > > +	 * The quote buffer is a shared resource, so use it only for the
> > > +	 * SEAMCALL and copy the data out as soon as possible.
> > > +	 */
> > > +	quote_dup = kvmemdup(quote_data.buf, out_len, GFP_KERNEL);
> > 
> > So at init time we allocate a vmalloc for the quote and pre-populate the
> > hpa_list. Then we use it every time and copy the contents to a new vmalloc.
> > Would it really be that hard to keep the hpa list allocation around, do a
> > vmalloc here and update the pfn list. Then do get quote on that and pass back
> > the vmalloc we just allocated? Just feels like global reuse way has extra pieces
> > in it. Compared to the whole quoting operation, this vmalloc_to_pfn() loop is
> > probably not very expensive.
> 
> Hm interesting idea. But a Quote buffer could be close to 4MB in the worst
> case. Let's say max_quote_size is 3MB, that's 768 vmalloc_to_pfn() calls
> each time... That sounds a bit excessive right?
> 
> The extra bits mainly come from using kvmemdup() I think. Having to use
> kvfree() on it does feel a bit annoying but that was the tradeoff I
> made...
> 

I didn't change the design in the v2 series, but after some discussion
with Yilun, I now realize this is actually a really good idea. Having a
static buffer does complicate the code a bit. I'll probably try doing
the allocation on the KVM side and just do vmalloc_to_pfn() calls each
time. Thanks.

> > 
> 

^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Gavin Shan @ 2026-06-26 11:43 UTC (permalink / raw)
  To: Suzuki K Poulose, Lorenzo Pieralisi
  Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
	Will Deacon, James Morse, Oliver Upton, Zenghui Yu,
	linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
	Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <9482dfbc-4d96-47ba-a615-f4ba0bda833f@arm.com>

On 6/26/26 6:47 PM, Suzuki K Poulose wrote:
> On 26/06/2026 08:43, Gavin Shan wrote:
>> On 6/26/26 1:58 AM, Suzuki K Poulose wrote:
>>> On 25/06/2026 14:53, Gavin Shan wrote:
>>>> On 6/6/26 12:35 AM, Lorenzo Pieralisi wrote:
>>>>> On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
>>>>>> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
>>>>>>> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
>>
>> [...]
>>
>>>>>>
>>>>>> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, and found
>>>>>> that memory slots backed by THP are broken. With THP disabled on the host and
>>>>>> other fixes (mentioned in my prevous replies) applied on the top of this (v14)
>>>>>> series, I'm able to boot a realm guest with rebased QEMU series [2], plus more
>>>>>> fxies on the top.
>>>>>>
>>>>>> [1] https://git.codelinaro.org/linaro/dcap/qemu.git  (branch: cca/ latest)
>>>>>> [2] https://git.qemu.org/git/qemu.git                (branch: cca/ gavin)
>>>>>>
>>>>>> Lorenzo, You may be saying there is someone making QEMU to support ARM/CCA?
>>>>>
>>>>> Mathieu and I are working on that yes and with Steven/Suzuki to fix the THP
>>>>> issues you pointed out above.
>>>>>
>>>>>> If so, I'm not sure if there is a QEMU repository for me to try?
>>>>>
>>>>> We should be able to submit patches by end of June - we shall let you know
>>>>> whether we can make something available earlier.
>>>>>
>>>>
>>>> Not sure if there are other known issues in this series. It seems the stage2
>>>> page fault handling on the shared space isn't working well. In my test, the
>>>> vring (struct vring_desc) of virtio-net-pci is updated by the guest, and the
>>>> data isn't seen by QEMU, I'm suspecting if the host-page-frame-number is properly
>>>> resolved in the s2 page fault handler for shared (unprotected) space.
>>>>
>>>> - I rebased Jean's latest qemu branch to the upstream qemu;
>>>>
>>>> - On the host, which is emulated by qemu/tcg, the THP (transparent huge page) is
>>>>    disabled.
>>>>
>>>> - On the guest, I can see the virtio vring (struct vring_desc) is updated. The
>>>>    S1 page-table entry looks correct because the corresponding physical address
>>>>    0x10046880000 is a sane shared (unprotected) space address.
>>>>
>>>>    [   52.094143] software IO TLB: Memory encryption is active and system is using DMA bounce buffers
>>>>    [   52.289746] virtqueue_add_desc_split: desc[0]@0xffff000006880000, [00000100b983f000  00000640  0002  0001]
>>>>    [   52.432150] PTE 0x00e8010046880707 at address 0xffff000006880000
>>>>
>>>> - On the host, the s2 page-table-entry is unmapped due to attribute transition (private -> shared).
>>>>    A subsequent S2 page fault is raised against the adress and the s2 page-table-entry is built.
>>>>
>>>>    [  109.259077] ====> realm_unmap_shared_range: tracked_unprot_addr=0x10046880000
>>>>    [  109.260249] realm_unmap_shared_range: unmapped shared range at 0x10046880000
>>>>    [  109.317786] realm_unmap_shared_range: unmapped shared range at 0x10046880000
>>>>    [  109.629939] ====> kvm_handle_guest_abort: fault_ipa=0x10046880000, esr=0x92000007
>>>>    [  109.630245] realm_map_non_secure: ipa=0x10046880000, pfn=0xb8b59, size=0x1000, prot=0xf
>>>>    [  109.630331] realm_map_non_secure: ipa=0x10046880000, ipa_top=0x10046881000, flags=0x1e0001, range_desc=0xb8b59004
>>>
>>> Are you able to correlate the order of the transitions and the Guest
>>> access with RMM log ? We haven't seen this from our end. We are aware
>>> of permission fault issues with Unprotected IPA when backing the memslot
>>> with MAP_PRIVATE areas. But this looks different.
>>>
>>> Lorenzo, have you run into this ?
>>>
>>
>> It's hard to correlate the order since the logs are collected from two separate
>> consoles. For the write permission, I add code to the host where the permission
>> is always added for all s2 page faults in the shared space. Otherwise, qemu can
>> be killed by -EFAULT or similar error.
> 
> This is the problem. We can't add WRITE permission by default. I believe
> you may have MAP_PRIVATE mapping and it has to be mapped as READ only
> and on a permission fault, we replace it with a writable page. By
> overriding the WRITE permission, you let the guest write to a page
> that may not be seen by the VMM.
> 
> We identified this as a bug in the KVM driver in this series (reported
> by Lorenzo) and there is a corresponding tf-RMM change that is required
> to get this working. So, please could you wait until the next series
> when this will be addressed ? Or you could switch to using MAP_SHARED
> for the "shared" memory in the memslot.
> 

Exactly. the syntax for MAP_PRIVATE is broken if the write permission is
enforced for a read fault in the shared space. In my case, the host page can
be the zero page and eventually multiple s2 page-table entries (for multiple
unprotected or shared pages) point to the zero page. It's why clearing the
3rd queue (Ctrl queue) also clears the first queue (Rx queue) in my case.

Yes, this issue can be avoid by using a shared memory backend in qemu, something
like below. With this, I'm able to see virtio-net-pci starts to work...

     -object memory-backend-ram,id=mem0,size=2G,share=yes

Thanks,
Gavin

> 
> Suzuki
> 
> 
>>
>> There are more findings after more experiments: this virtio-net-pci device has 3
>> queues or vrings (Rx/Tx/Ctrl). The Rx/Tx/Ctrl queue are populated in order one after
>> one. In the guest kernel, I intentionally write fixed data (0x0123456789abcdef) to
>> the first 8 bytes of the queue when it gets populated, and stop the guest at random
>> points to see if the data is gone. I found that the data written to Rx/ Tx queue are
>> lost after Ctrl queue is allocated.
>>
>> The data written to Rx/Tx queue is lost if the guest stops (B). The data written to
>> Rx/Tx queue isn't lost if the guest stops at (A). I can see the pattern (0x0123...cdef)
>> by dumping the physcial memory through 'pmemsave' command in qemu.
>>
>> DMA allocation
>> ==============
>> dma_alloc_coherent
>>    dma_alloc_attrs
>>      dma_direct_alloc
>>        __dma_direct_alloc_pages
>>        dma_set_decrypted                    // (A) No data lost if being stopped here for the Ctrl queue
>>        memset(ret, 0, size)                 // (B) Data lost after being stopped after memset() for the Ctrl queue
>>
>> The memset() on the Ctrl queue should trigger a stage2 page fault. It seems the page
>> fault enforces the shared pages for Rx/Tx queue to be dropped? I need to add more
>> debugging code and track it down.
>>
>>> Suzuki
>>>
>>>
>>>>
>>>> - On QEMU, the updated vring (struct vring_desc) at GPA 0x46880000 isn't seen. All the
>>>>    data in that adress are zeros.
>>>>
>>>>    ====> virtqueue_split_pop: vdev=<virtio-net>, sz=0x38, queue_index=0x0, vq->vring.num=0x100
>>>>    virtqueue_split_pop: last_avail_idx=0x0, head=0x0
>>>>    address_space_read_cached_slow: cache@0xffff1c036440, addr=0x0, buf=0xffffeee34880, len=0x10
>>>>    address_space_read_cached_slow: cache: ptr=0x0, xlat=0x10046880000, len=0x1000, mrs=<realm-dma-region>, is_write=no
>>>>    address_space_read_cached_slow: translated to mr=<mach-virt.ram>, mr_addr=0x6880000, l=0x10
>>>>    flatview_read_continue_step: mr=<mach-virt.ram>, host=0xffff23e00000, mr_addr=0x6880000, ram_ptr=0xffff2a680000
>>>>    virtqueue_split_pop: desc: 0000000000000000 - 00000000 - 00000000 - 00000000
>>>>    qemu-system-aarch64: virtio: zero sized buffers are not allowed
>>>>
>>>>
>> Thanks,
>> Gavin
>>
> 


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox