Linux Confidential Computing Development

Linux Confidential Computing Development
 help / color / mirror / Atom feed

* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: David Hildenbrand (Arm) @ 2026-06-25 12:57 UTC (permalink / raw)
  To: Sean Christopherson, Ackerley Tng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, jmattson,
	jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajx3vmNPRf-M9kR6@google.com>

On 6/25/26 02:35, Sean Christopherson wrote:
> On Wed, Jun 24, 2026, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>>
>>>
>>> Under what circumstances does this happen,
>>
>> It happened 100% of the time in selftests. Perhaps it's because in the
>> selftests the pages are almost always freshly allocated and so the
>> lru_add fbatch isn't full yet? (and that the host isn't super busy so
>> lru_add fbatch doesn't get drained yet).
> 
> I chatted with Ackerley about this.  What I wanted to understand is why guest_memfd
> pages were getting put onto per-CPU batches for lru_add(), given that guest_memfd
> pages are unevictable.  The answer (assuming I read the code right), is that
> lruvec_add_folio() updates stats and other per-lru metadata for the unevictable
> lru, and does so under a per-lru lock.  I.e. we don't want to skip that stuff
> entirely.

Hm. Our pages don't participate in any LRU activity (including
isolation+migration). Isolation+migration would only apply once we'd support
page migration.

But yes, secretmem also does it like that: filemap_add_folio() will call
folio_add_lru().

Traditionally we used the unevictable LRU only for mlock purposes.

But yeah, there are "unevictable" stats involved ....

> 
> One thought I had, to avoid the IPIs that draining all per-CPU caches requires,
> was to disallow putting guest_memfd pages in folio batches, e.g. by hacking
> something into folio_may_be_lru_cached().  But due to taking a per-lru lock,
> that would penalize the relatively hot path and definitely common operation of
> faulting in guest memory.  On the other hand, memory conversion is already a
> relatively slow operation and is relatively uncommon compared to page faults,
> (and likely very uncommon for real world setups).  I.e. having to drain all
> caches if conversion isn't safe penalizes a relatively slow, relatively uncommon
> path.

Yeah, the lru_add_drain_all is rather messy.

We have similar code in

collect_longterm_unpinnable_folios(), where we first try a lru_add_drain(), to
then escalate to a lru_add_drain_all().

Maybe we could factor that (suboptimal code) out to not have to reinvent the
same thing multiple times?

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Gavin Shan @ 2026-06-25 13:53 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
	Will Deacon, James Morse, Oliver Upton, Suzuki K Poulose,
	Zenghui Yu, linux-arm-kernel, linux-kernel, Joey Gouly,
	Alexandru Elisei, Christoffer Dall, Fuad Tabba, linux-coco,
	Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
	Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
	Lorenzo.Pieralisi2
In-Reply-To: <aiLes2ecZSr17UwZ@lpieralisi>

On 6/6/26 12:35 AM, Lorenzo Pieralisi wrote:
> On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
>> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
>>> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
>>>
>>> [...]
>>>
>>>>> +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa,
>>>>> +			 kvm_pfn_t pfn, unsigned long map_size,
>>>>> +			 enum kvm_pgtable_prot prot,
>>>>> +			 struct kvm_mmu_memory_cache *memcache)
>>>>> +{
>>>>> +	struct realm *realm = &kvm->arch.realm;
>>>>> +
>>>>> +	/*
>>>>> +	 * Write permission is required for now even though it's possible to
>>>>> +	 * map unprotected pages (granules) as read-only. It's impossible to
>>>>> +	 * map protected pages (granules) as read-only.
>>>>> +	 */
>>>>> +	if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
>>>>> +		return -EFAULT;
>>>>> +
>>>>
>>>> I'm a bit concerned with this. We don't have KVM_PGTABLE_PROT_W set in @prot
>>>> if the stage2 fault is raised due to memory read. With -EFAULT returned to VMM
>>>> (e.g. QEMU), the vCPU continuous execution is stopped and system won't be
>>>> working any more.
>>>>
>>>>> +	ipa = ALIGN_DOWN(ipa, PAGE_SIZE);
>>>>> +	if (!kvm_realm_is_private_address(realm, ipa))
>>>>> +		return realm_map_non_secure(realm, ipa, pfn, map_size, prot,
>>>>> +					    memcache);
>>>>> +
>>>>> +	return realm_map_protected(kvm, ipa, pfn, map_size, memcache);
>>>>> +}
>>>>> +
>>>>>     static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
>>>>>     {
>>>>>     	switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma->vm_page_prot))) {
>>>>> @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>>>>>     	bool write_fault, exec_fault;
>>>>>     	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>>>>>     	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>>>>> -	struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
>>>>> +	struct kvm_vcpu *vcpu = s2fd->vcpu;
>>>>> +	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
>>>>> +	gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
>>>>>     	unsigned long mmu_seq;
>>>>>     	struct page *page;
>>>>> -	struct kvm *kvm = s2fd->vcpu->kvm;
>>>>> +	struct kvm *kvm = vcpu->kvm;
>>>>>     	void *memcache;
>>>>>     	kvm_pfn_t pfn;
>>>>>     	gfn_t gfn;
>>>>>     	int ret;
>>>>> -	memcache = get_mmu_memcache(s2fd->vcpu);
>>>>> -	ret = topup_mmu_memcache(s2fd->vcpu, memcache);
>>>>> +	if (kvm_is_realm(vcpu->kvm)) {
>>>>> +		/* check for memory attribute mismatch */
>>>>> +		bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
>>>>> +		/*
>>>>> +		 * For Realms, the shared address is an alias of the private
>>>>> +		 * PA with the top bit set. Thus if the fault address matches
>>>>> +		 * the GPA then it is the private alias.
>>>>> +		 */
>>>>> +		bool is_priv_fault = (gpa == s2fd->fault_ipa);
>>>>> +
>>>>> +		if (is_priv_gfn != is_priv_fault) {
>>>>> +			kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>>>>> +						      kvm_is_write_fault(vcpu),
>>>>> +						      false,
>>>>> +						      is_priv_fault);
>>>>> +			/*
>>>>> +			 * KVM_EXIT_MEMORY_FAULT requires an return code of
>>>>> +			 * -EFAULT, see the API documentation
>>>>> +			 */
>>>>> +			return -EFAULT;
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>> +	memcache = get_mmu_memcache(vcpu);
>>>>> +	ret = topup_mmu_memcache(vcpu, memcache);
>>>>>     	if (ret)
>>>>>     		return ret;
>>>>>     	if (s2fd->nested)
>>>>>     		gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
>>>>>     	else
>>>>> -		gfn = s2fd->fault_ipa >> PAGE_SHIFT;
>>>>> +		gfn = gpa >> PAGE_SHIFT;
>>>>> -	write_fault = kvm_is_write_fault(s2fd->vcpu);
>>>>> -	exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu);
>>>>> +	write_fault = kvm_is_write_fault(vcpu);
>>>>> +	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>>>>>     	VM_WARN_ON_ONCE(write_fault && exec_fault);
>>>>> @@ -1634,7 +1696,7 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>>>>>     	ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL);
>>>>>     	if (ret) {
>>>>> -		kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa, PAGE_SIZE,
>>>>> +		kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>>>>>     					      write_fault, exec_fault, false);
>>>>>     		return ret;
>>>>>     	}
>>>>> @@ -1654,14 +1716,20 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
>>>>>     	kvm_fault_lock(kvm);
>>>>>     	if (mmu_invalidate_retry(kvm, mmu_seq)) {
>>>>>     		ret = -EAGAIN;
>>>>> -		goto out_unlock;
>>>>> +		goto out_release_page;
>>>>> +	}
>>>>> +
>>>>> +	if (kvm_is_realm(kvm)) {
>>>>> +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn,
>>>>> +				    PAGE_SIZE, KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W, memcache);
>>>>> +		goto out_release_page;
>>>>>     	}
>>>>>     	ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
>>>>>     						 __pfn_to_phys(pfn), prot,
>>>>>     						 memcache, flags);
>>>>> -out_unlock:
>>>>> +out_release_page:
>>>>>     	kvm_release_faultin_page(kvm, page, !!ret, prot & KVM_PGTABLE_PROT_W);
>>>>>     	kvm_fault_unlock(kvm);
>>>>> @@ -1847,7 +1915,7 @@ static int kvm_s2_fault_get_vma_info(const struct kvm_s2_fault_desc *s2fd,
>>>>>     	 * mapping size to ensure we find the right PFN and lay down the
>>>>>     	 * mapping in the right place.
>>>>>     	 */
>>>>> -	s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >> PAGE_SHIFT;
>>>>> +	s2vi->gfn = kvm_gpa_from_fault(kvm, ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize)) >> PAGE_SHIFT;
>>>>>     	s2vi->mte_allowed = kvm_vma_mte_allowed(vma);
>>>>> @@ -2056,6 +2124,9 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
>>>>>     		prot &= ~KVM_NV_GUEST_MAP_SZ;
>>>>>     		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn),
>>>>>     								 prot, flags);
>>>>> +	} else if (kvm_is_realm(kvm)) {
>>>>> +		ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn, mapping_size,
>>>>> +				    prot, memcache);
>>>>>     	} else {
>>>>>     		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size,
>>>>>     							 __pfn_to_phys(pfn), prot,
>>>>
>>>> For the case kvm_is_realm(), need we adjust 's2fd->fault_ipa' for the sake of
>>>> huge pages. In kvm_s2_fault_map(), @gfn and @pfn may have been adjusted by
>>>> transparent_hugepage_adjust() to be aligned with huge page size. If the
>>>> adjustment happened in transparent_hugepage_adjust(), we need to align
>>>> s2fd->fault_ipa down to the huge page size either.
>>>
>>> All of the above + some RMM changes are needed to get QEmu VMM going
>>> with anon pages guest memory backing - currently testing various
>>> configurations in the background.
>>>
>>
>> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, and found
>> that memory slots backed by THP are broken. With THP disabled on the host and
>> other fixes (mentioned in my prevous replies) applied on the top of this (v14)
>> series, I'm able to boot a realm guest with rebased QEMU series [2], plus more
>> fxies on the top.
>>
>> [1] https://git.codelinaro.org/linaro/dcap/qemu.git  (branch: cca/latest)
>> [2] https://git.qemu.org/git/qemu.git                (branch: cca/gavin)
>>
>> Lorenzo, You may be saying there is someone making QEMU to support ARM/CCA?
> 
> Mathieu and I are working on that yes and with Steven/Suzuki to fix the THP
> issues you pointed out above.
> 
>> If so, I'm not sure if there is a QEMU repository for me to try?
> 
> We should be able to submit patches by end of June - we shall let you know
> whether we can make something available earlier.
> 

Not sure if there are other known issues in this series. It seems the stage2
page fault handling on the shared space isn't working well. In my test, the
vring (struct vring_desc) of virtio-net-pci is updated by the guest, and the
data isn't seen by QEMU, I'm suspecting if the host-page-frame-number is properly
resolved in the s2 page fault handler for shared (unprotected) space.

- I rebased Jean's latest qemu branch to the upstream qemu;

- On the host, which is emulated by qemu/tcg, the THP (transparent huge page) is
   disabled.

- On the guest, I can see the virtio vring (struct vring_desc) is updated. The
   S1 page-table entry looks correct because the corresponding physical address
   0x10046880000 is a sane shared (unprotected) space address.

   [   52.094143] software IO TLB: Memory encryption is active and system is using DMA bounce buffers
   [   52.289746] virtqueue_add_desc_split: desc[0]@0xffff000006880000, [00000100b983f000  00000640  0002  0001]
   [   52.432150] PTE 0x00e8010046880707 at address 0xffff000006880000

- On the host, the s2 page-table-entry is unmapped due to attribute transition (private -> shared).
   A subsequent S2 page fault is raised against the adress and the s2 page-table-entry is built.

   [  109.259077] ====> realm_unmap_shared_range: tracked_unprot_addr=0x10046880000
   [  109.260249] realm_unmap_shared_range: unmapped shared range at 0x10046880000
   [  109.317786] realm_unmap_shared_range: unmapped shared range at 0x10046880000
   [  109.629939] ====> kvm_handle_guest_abort: fault_ipa=0x10046880000, esr=0x92000007
   [  109.630245] realm_map_non_secure: ipa=0x10046880000, pfn=0xb8b59, size=0x1000, prot=0xf
   [  109.630331] realm_map_non_secure: ipa=0x10046880000, ipa_top=0x10046881000, flags=0x1e0001, range_desc=0xb8b59004

- On QEMU, the updated vring (struct vring_desc) at GPA 0x46880000 isn't seen. All the
   data in that adress are zeros.

   ====> virtqueue_split_pop: vdev=<virtio-net>, sz=0x38, queue_index=0x0, vq->vring.num=0x100
   virtqueue_split_pop: last_avail_idx=0x0, head=0x0
   address_space_read_cached_slow: cache@0xffff1c036440, addr=0x0, buf=0xffffeee34880, len=0x10
   address_space_read_cached_slow: cache: ptr=0x0, xlat=0x10046880000, len=0x1000, mrs=<realm-dma-region>, is_write=no
   address_space_read_cached_slow: translated to mr=<mach-virt.ram>, mr_addr=0x6880000, l=0x10
   flatview_read_continue_step: mr=<mach-virt.ram>, host=0xffff23e00000, mr_addr=0x6880000, ram_ptr=0xffff2a680000
   virtqueue_split_pop: desc: 0000000000000000 - 00000000 - 00000000 - 00000000
   qemu-system-aarch64: virtio: zero sized buffers are not allowed


Thanks,
Gavin


^ permalink raw reply

* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default\
From: Sean Christopherson @ 2026-06-25 14:36 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, jmattson, jthoughton, michael.roth, oupton, pankaj.gupta,
	qperret, rick.p.edgecombe, rientjes, shivankg, steven.price,
	tabba, willy, wyihan, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <aj0Jf30PS2f7x1nt@yzhao56-desk.sh.intel.com>

On Thu, Jun 25, 2026, Yan Zhao wrote:
> On Thu, Jun 25, 2026 at 09:51:01AM +0800, Yan Zhao wrote:
> > On Wed, Jun 24, 2026 at 05:41:58PM -0700, Sean Christopherson wrote:
> > > On Wed, Jun 24, 2026, Ackerley Tng wrote:
> > > > Yan Zhao <yan.y.zhao@intel.com> writes:
> > > > > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > > > > MMAP flag. In such cases, shared memory is allocated from different backends.
> > > > > This means this module parameter only enables per-gmem memory attribute and does
> > > > > not guarantee that gmem in-place conversion will actually occur.
> > > 
> > > KVM module params are pretty much always about what KVM supports, not what is
> > > guaranteed to happen.
> > > 
> > >   - enable_mmio_caching doesn't guarantee there will actually be MMIO SPTEs,
> > >     because maybe the guest never accesses emulated MMIO.
> > >   - enable_pmu doesn't guarantee VMs will get a PMU, because userspace may elect
> > >     not to advertise one.
> > >   - and so on and so forth...
> > > 
> > > Yes, there's a small mental jump to get from "KVM supports in-place conversion"
> > > to "I need to set memory attributes on the guest_memfd instance, not the VM",
> > > but I don't see that as a big hurdle, certainly not in the long term.  And once
> > > the VMM code is written, I really do think most people are going to care about
> > > whether or not KVM supports in-place conversion, not where PRIVATE is tracked.
> > Sorry, I just saw this mail after posting my reply in [1].
> > 
> > I'm ok with gmem_in_place_conversion=true just means KVM supports in-place
> > conversion, while we can still create VMs with shared memory not from gmem.
> Or what about "allow_gmem_in_place_conversion" ?

No, because turning on the param also disallows setting PRIVATE in the VM-scoped
KVM_SET_MEMORY_ATTRIBUTES ioctl.

> > Though it still feels a bit odd to require TDX huge pages to depend on
> > gmem_in_place_conversion=true when shared memory is not currently allocated
> > from gmem, 

I fully expect that to be a transient state, and in all likelihood not something
that is *ever* shipped in production.  Landing TDX hugepages without guest_memfd
hugepage support is all about avoiding unnecessary serialization of series and
features that aren't strictly dependent on each other.

> > it should become more natural over time once gmem supports in-place
> > conversions for huge page.

Yes, and I want to prioritize the steady state for end users, not the in-progress
state for developers.  Once all of this settles out, I fully expect the majority
of deployments to only support in-place conversion, at which point the end user
is only going to care whether or not in-place conversion is enabled in KVM, not
the subtle detail that it's still possible to do out-of-place conversions (and
that will always hold true, it's not like VMA-based memslots are being deprecated).

> > Besides my current usage, there may be other scenarios where gmem memory
> > attributes is preferred without allocating shared memory from gmem.
> > (e.g., PAGE.ADD from a temp extra shared source memory).
> > 
> > For such use cases, I'm concerns that the admins may find it confusing if they
> > enable gmem_in_place_conversion but still observe extra memory consumptions for
> > shared memory.

KVM can help with documentation, but beyond that, it's not KVM's problem to solve.
If a VMM *and* platform owner chooses to deploy a setup that utilizes out-of-place
conversions, then it's on the VMM and/or plaform owner to understand and communicate
the implications to the end user.

And I'm not remotely convinced that prepending allow_ to the param will help
end users diagnose "unexpected" memory consumption, in quotes because anyone that
is deploying a stack that utilizes out-of-place conversion absolutely needs to
understand and plan for the additional memory consumption.  I.e. if the memory
consumption is "unexpected" to the end user, they likely have far bigger problems.

^ permalink raw reply

* Re: [PATCH v9 3/6] x86/sev: Disable CPU hotplug while SNP is active
From: Borislav Petkov @ 2026-06-25 15:02 UTC (permalink / raw)
  To: Ashish Kalra
  Cc: tglx, mingo, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb, pbonzini, aik,
	Michael.Roth, KPrateek.Nayak, Tycho.Andersen, Nathan.Fontenot,
	ackerleytng, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <ba146ca15b7f76eee386c8c073fb3f1cc36e5781.1782336473.git.ashish.kalra@amd.com>

On Wed, Jun 24, 2026 at 09:56:49PM +0000, Ashish Kalra wrote:
> +/* Set while SNP has CPU hotplug disabled (kernel-lifetime; survives ccp reload). */
> +static bool snp_cpu_hotplug_disabled;

Do you really need this?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCH v8 18/46] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Sean Christopherson @ 2026-06-25 15:40 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, liam, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <6ed7d12a-c3a1-4572-8385-754e6d5b8b44@kernel.org>

On Thu, Jun 25, 2026, David Hildenbrand (Arm) wrote:
> On 6/25/26 02:35, Sean Christopherson wrote:
> > One thought I had, to avoid the IPIs that draining all per-CPU caches requires,
> > was to disallow putting guest_memfd pages in folio batches, e.g. by hacking
> > something into folio_may_be_lru_cached().  But due to taking a per-lru lock,
> > that would penalize the relatively hot path and definitely common operation of
> > faulting in guest memory.  On the other hand, memory conversion is already a
> > relatively slow operation and is relatively uncommon compared to page faults,
> > (and likely very uncommon for real world setups).  I.e. having to drain all
> > caches if conversion isn't safe penalizes a relatively slow, relatively uncommon
> > path.
> 
> Yeah, the lru_add_drain_all is rather messy.
> 
> We have similar code in
> 
> collect_longterm_unpinnable_folios(), where we first try a lru_add_drain(), to
> then escalate to a lru_add_drain_all().
> 
> Maybe we could factor that (suboptimal code) out to not have to reinvent the
> same thing multiple times?

As discussed in the guest_memfd call, we should do this straightaway, i.e. instead
of merging this series as-is, so that we don't export lru_add_drain_all() only to
drop the export a kernel or two later, and can instead export the helper to drain
any batches for a folio (or set of folios/pages).


^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Suzuki K Poulose @ 2026-06-25 15:58 UTC (permalink / raw)
  To: Gavin Shan, Lorenzo Pieralisi
  Cc: Steven Price, kvm, kvmarm, Catalin Marinas, Marc Zyngier,
	Will Deacon, James Morse, Oliver Upton, Zenghui Yu,
	linux-arm-kernel, linux-kernel, Joey Gouly, Alexandru Elisei,
	Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <1e39094f-7fa3-4ef1-be54-53d7a8643506@redhat.com>

On 25/06/2026 14:53, Gavin Shan wrote:
> On 6/6/26 12:35 AM, Lorenzo Pieralisi wrote:
>> On Fri, Jun 05, 2026 at 06:11:11PM +1000, Gavin Shan wrote:
>>> On 6/5/26 5:28 PM, Lorenzo Pieralisi wrote:
>>>> On Fri, Jun 05, 2026 at 04:23:15PM +1000, Gavin Shan wrote:
>>>>
>>>> [...]
>>>>
>>>>>> +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa,
>>>>>> +             kvm_pfn_t pfn, unsigned long map_size,
>>>>>> +             enum kvm_pgtable_prot prot,
>>>>>> +             struct kvm_mmu_memory_cache *memcache)
>>>>>> +{
>>>>>> +    struct realm *realm = &kvm->arch.realm;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * Write permission is required for now even though it's 
>>>>>> possible to
>>>>>> +     * map unprotected pages (granules) as read-only. It's 
>>>>>> impossible to
>>>>>> +     * map protected pages (granules) as read-only.
>>>>>> +     */
>>>>>> +    if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
>>>>>> +        return -EFAULT;
>>>>>> +
>>>>>
>>>>> I'm a bit concerned with this. We don't have KVM_PGTABLE_PROT_W set 
>>>>> in @prot
>>>>> if the stage2 fault is raised due to memory read. With -EFAULT 
>>>>> returned to VMM
>>>>> (e.g. QEMU), the vCPU continuous execution is stopped and system 
>>>>> won't be
>>>>> working any more.
>>>>>
>>>>>> +    ipa = ALIGN_DOWN(ipa, PAGE_SIZE);
>>>>>> +    if (!kvm_realm_is_private_address(realm, ipa))
>>>>>> +        return realm_map_non_secure(realm, ipa, pfn, map_size, prot,
>>>>>> +                        memcache);
>>>>>> +
>>>>>> +    return realm_map_protected(kvm, ipa, pfn, map_size, memcache);
>>>>>> +}
>>>>>> +
>>>>>>     static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
>>>>>>     {
>>>>>>         switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma- 
>>>>>> >vm_page_prot))) {
>>>>>> @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct 
>>>>>> kvm_s2_fault_desc *s2fd)
>>>>>>         bool write_fault, exec_fault;
>>>>>>         enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>>>>>>         enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>>>>>> -    struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
>>>>>> +    struct kvm_vcpu *vcpu = s2fd->vcpu;
>>>>>> +    struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
>>>>>> +    gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
>>>>>>         unsigned long mmu_seq;
>>>>>>         struct page *page;
>>>>>> -    struct kvm *kvm = s2fd->vcpu->kvm;
>>>>>> +    struct kvm *kvm = vcpu->kvm;
>>>>>>         void *memcache;
>>>>>>         kvm_pfn_t pfn;
>>>>>>         gfn_t gfn;
>>>>>>         int ret;
>>>>>> -    memcache = get_mmu_memcache(s2fd->vcpu);
>>>>>> -    ret = topup_mmu_memcache(s2fd->vcpu, memcache);
>>>>>> +    if (kvm_is_realm(vcpu->kvm)) {
>>>>>> +        /* check for memory attribute mismatch */
>>>>>> +        bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> 
>>>>>> PAGE_SHIFT);
>>>>>> +        /*
>>>>>> +         * For Realms, the shared address is an alias of the private
>>>>>> +         * PA with the top bit set. Thus if the fault address 
>>>>>> matches
>>>>>> +         * the GPA then it is the private alias.
>>>>>> +         */
>>>>>> +        bool is_priv_fault = (gpa == s2fd->fault_ipa);
>>>>>> +
>>>>>> +        if (is_priv_gfn != is_priv_fault) {
>>>>>> +            kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>>>>>> +                              kvm_is_write_fault(vcpu),
>>>>>> +                              false,
>>>>>> +                              is_priv_fault);
>>>>>> +            /*
>>>>>> +             * KVM_EXIT_MEMORY_FAULT requires an return code of
>>>>>> +             * -EFAULT, see the API documentation
>>>>>> +             */
>>>>>> +            return -EFAULT;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    memcache = get_mmu_memcache(vcpu);
>>>>>> +    ret = topup_mmu_memcache(vcpu, memcache);
>>>>>>         if (ret)
>>>>>>             return ret;
>>>>>>         if (s2fd->nested)
>>>>>>             gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
>>>>>>         else
>>>>>> -        gfn = s2fd->fault_ipa >> PAGE_SHIFT;
>>>>>> +        gfn = gpa >> PAGE_SHIFT;
>>>>>> -    write_fault = kvm_is_write_fault(s2fd->vcpu);
>>>>>> -    exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu);
>>>>>> +    write_fault = kvm_is_write_fault(vcpu);
>>>>>> +    exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>>>>>>         VM_WARN_ON_ONCE(write_fault && exec_fault);
>>>>>> @@ -1634,7 +1696,7 @@ static int gmem_abort(const struct 
>>>>>> kvm_s2_fault_desc *s2fd)
>>>>>>         ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, 
>>>>>> &page, NULL);
>>>>>>         if (ret) {
>>>>>> -        kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd- 
>>>>>> >fault_ipa, PAGE_SIZE,
>>>>>> +        kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>>>>>>                               write_fault, exec_fault, false);
>>>>>>             return ret;
>>>>>>         }
>>>>>> @@ -1654,14 +1716,20 @@ static int gmem_abort(const struct 
>>>>>> kvm_s2_fault_desc *s2fd)
>>>>>>         kvm_fault_lock(kvm);
>>>>>>         if (mmu_invalidate_retry(kvm, mmu_seq)) {
>>>>>>             ret = -EAGAIN;
>>>>>> -        goto out_unlock;
>>>>>> +        goto out_release_page;
>>>>>> +    }
>>>>>> +
>>>>>> +    if (kvm_is_realm(kvm)) {
>>>>>> +        ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn,
>>>>>> +                    PAGE_SIZE, KVM_PGTABLE_PROT_R | 
>>>>>> KVM_PGTABLE_PROT_W, memcache);
>>>>>> +        goto out_release_page;
>>>>>>         }
>>>>>>         ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd- 
>>>>>> >fault_ipa, PAGE_SIZE,
>>>>>>                              __pfn_to_phys(pfn), prot,
>>>>>>                              memcache, flags);
>>>>>> -out_unlock:
>>>>>> +out_release_page:
>>>>>>         kvm_release_faultin_page(kvm, page, !!ret, prot & 
>>>>>> KVM_PGTABLE_PROT_W);
>>>>>>         kvm_fault_unlock(kvm);
>>>>>> @@ -1847,7 +1915,7 @@ static int kvm_s2_fault_get_vma_info(const 
>>>>>> struct kvm_s2_fault_desc *s2fd,
>>>>>>          * mapping size to ensure we find the right PFN and lay 
>>>>>> down the
>>>>>>          * mapping in the right place.
>>>>>>          */
>>>>>> -    s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) 
>>>>>> >> PAGE_SHIFT;
>>>>>> +    s2vi->gfn = kvm_gpa_from_fault(kvm, ALIGN_DOWN(s2fd- 
>>>>>> >fault_ipa, s2vi->vma_pagesize)) >> PAGE_SHIFT;
>>>>>>         s2vi->mte_allowed = kvm_vma_mte_allowed(vma);
>>>>>> @@ -2056,6 +2124,9 @@ static int kvm_s2_fault_map(const struct 
>>>>>> kvm_s2_fault_desc *s2fd,
>>>>>>             prot &= ~KVM_NV_GUEST_MAP_SZ;
>>>>>>             ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, 
>>>>>> gfn_to_gpa(gfn),
>>>>>>                                      prot, flags);
>>>>>> +    } else if (kvm_is_realm(kvm)) {
>>>>>> +        ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn, mapping_size,
>>>>>> +                    prot, memcache);
>>>>>>         } else {
>>>>>>             ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, 
>>>>>> gfn_to_gpa(gfn), mapping_size,
>>>>>>                                  __pfn_to_phys(pfn), prot,
>>>>>
>>>>> For the case kvm_is_realm(), need we adjust 's2fd->fault_ipa' for 
>>>>> the sake of
>>>>> huge pages. In kvm_s2_fault_map(), @gfn and @pfn may have been 
>>>>> adjusted by
>>>>> transparent_hugepage_adjust() to be aligned with huge page size. If 
>>>>> the
>>>>> adjustment happened in transparent_hugepage_adjust(), we need to align
>>>>> s2fd->fault_ipa down to the huge page size either.
>>>>
>>>> All of the above + some RMM changes are needed to get QEmu VMM going
>>>> with anon pages guest memory backing - currently testing various
>>>> configurations in the background.
>>>>
>>>
>>> I tried to rebase Jean's latest QEMU series [1] to upstream QEMU, and 
>>> found
>>> that memory slots backed by THP are broken. With THP disabled on the 
>>> host and
>>> other fixes (mentioned in my prevous replies) applied on the top of 
>>> this (v14)
>>> series, I'm able to boot a realm guest with rebased QEMU series [2], 
>>> plus more
>>> fxies on the top.
>>>
>>> [1] https://git.codelinaro.org/linaro/dcap/qemu.git  (branch: cca/ 
>>> latest)
>>> [2] https://git.qemu.org/git/qemu.git                (branch: cca/gavin)
>>>
>>> Lorenzo, You may be saying there is someone making QEMU to support 
>>> ARM/CCA?
>>
>> Mathieu and I are working on that yes and with Steven/Suzuki to fix 
>> the THP
>> issues you pointed out above.
>>
>>> If so, I'm not sure if there is a QEMU repository for me to try?
>>
>> We should be able to submit patches by end of June - we shall let you 
>> know
>> whether we can make something available earlier.
>>
> 
> Not sure if there are other known issues in this series. It seems the 
> stage2
> page fault handling on the shared space isn't working well. In my test, the
> vring (struct vring_desc) of virtio-net-pci is updated by the guest, and 
> the
> data isn't seen by QEMU, I'm suspecting if the host-page-frame-number is 
> properly
> resolved in the s2 page fault handler for shared (unprotected) space.
> 
> - I rebased Jean's latest qemu branch to the upstream qemu;
> 
> - On the host, which is emulated by qemu/tcg, the THP (transparent huge 
> page) is
>    disabled.
> 
> - On the guest, I can see the virtio vring (struct vring_desc) is 
> updated. The
>    S1 page-table entry looks correct because the corresponding physical 
> address
>    0x10046880000 is a sane shared (unprotected) space address.
> 
>    [   52.094143] software IO TLB: Memory encryption is active and 
> system is using DMA bounce buffers
>    [   52.289746] virtqueue_add_desc_split: desc[0]@0xffff000006880000, 
> [00000100b983f000  00000640  0002  0001]
>    [   52.432150] PTE 0x00e8010046880707 at address 0xffff000006880000
> 
> - On the host, the s2 page-table-entry is unmapped due to attribute 
> transition (private -> shared).
>    A subsequent S2 page fault is raised against the adress and the s2 
> page-table-entry is built.
> 
>    [  109.259077] ====> realm_unmap_shared_range: 
> tracked_unprot_addr=0x10046880000
>    [  109.260249] realm_unmap_shared_range: unmapped shared range at 
> 0x10046880000
>    [  109.317786] realm_unmap_shared_range: unmapped shared range at 
> 0x10046880000
>    [  109.629939] ====> kvm_handle_guest_abort: fault_ipa=0x10046880000, 
> esr=0x92000007
>    [  109.630245] realm_map_non_secure: ipa=0x10046880000, pfn=0xb8b59, 
> size=0x1000, prot=0xf
>    [  109.630331] realm_map_non_secure: ipa=0x10046880000, 
> ipa_top=0x10046881000, flags=0x1e0001, range_desc=0xb8b59004

Are you able to correlate the order of the transitions and the Guest
access with RMM log ? We haven't seen this from our end. We are aware
of permission fault issues with Unprotected IPA when backing the memslot
with MAP_PRIVATE areas. But this looks different.

Lorenzo, have you run into this ?

Suzuki


> 
> - On QEMU, the updated vring (struct vring_desc) at GPA 0x46880000 isn't 
> seen. All the
>    data in that adress are zeros.
> 
>    ====> virtqueue_split_pop: vdev=<virtio-net>, sz=0x38, 
> queue_index=0x0, vq->vring.num=0x100
>    virtqueue_split_pop: last_avail_idx=0x0, head=0x0
>    address_space_read_cached_slow: cache@0xffff1c036440, addr=0x0, 
> buf=0xffffeee34880, len=0x10
>    address_space_read_cached_slow: cache: ptr=0x0, xlat=0x10046880000, 
> len=0x1000, mrs=<realm-dma-region>, is_write=no
>    address_space_read_cached_slow: translated to mr=<mach-virt.ram>, 
> mr_addr=0x6880000, l=0x10
>    flatview_read_continue_step: mr=<mach-virt.ram>, host=0xffff23e00000, 
> mr_addr=0x6880000, ram_ptr=0xffff2a680000
>    virtqueue_split_pop: desc: 0000000000000000 - 00000000 - 00000000 - 
> 00000000
>    qemu-system-aarch64: virtio: zero sized buffers are not allowed
> 
> 
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v14 26/44] arm64: RMI: Allow populating initial contents
From: Suzuki K Poulose @ 2026-06-25 16:19 UTC (permalink / raw)
  To: Steven Price, Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
	Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
	linux-coco, Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
	Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
	Lorenzo.Pieralisi2
In-Reply-To: <9631be66-c757-488d-bb66-a62698aa26b8@arm.com>

On 08/06/2026 14:53, Steven Price wrote:
> On 08/06/2026 10:41, Suzuki K Poulose wrote:
>> On 08/06/2026 10:36, Steven Price wrote:
>>> On 28/05/2026 06:30, Gavin Shan wrote:
>>>> Hi Steve,
>>>>
>>>> On 5/13/26 11:17 PM, Steven Price wrote:
>>>>> The VMM needs to populate the realm with some data before starting
>>>>> (e.g.
>>>>> a kernel and initrd). This is measured by the RMM and used as part of
>>>>> the attestation later on.
>>>>>
>>>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>
>> ...
>>
>>>>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>>>>> index a89873a5eb77..209087bcf399 100644
>>>>> --- a/arch/arm64/kvm/rmi.c
>>>>> +++ b/arch/arm64/kvm/rmi.c
>>>>> @@ -486,6 +486,75 @@ void kvm_realm_unmap_range(struct kvm *kvm,
>>>>> unsigned long start,
>>>>>             realm_unmap_private_range(kvm, start, end, may_block);
>>>>>     }
>>>>>     +static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
>>>>> +                   kvm_pfn_t dst_pfn, kvm_pfn_t src_pfn,
>>>>> +                   unsigned long flags)
>>>>> +{
>>>>> +    struct realm *realm = &kvm->arch.realm;
>>>>> +    phys_addr_t rd = virt_to_phys(realm->rd);
>>>>> +    phys_addr_t dst_phys, src_phys;
>>>>> +    int ret;
>>>>> +
>>>>> +    dst_phys = __pfn_to_phys(dst_pfn);
>>>>> +    src_phys = __pfn_to_phys(src_pfn);
>>>>> +
>>>>> +    if (rmi_delegate_page(dst_phys))
>>>>> +        return -ENXIO;
>>>>> +
>>>>> +    ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys, flags);
>>>>> +    if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>>>>> +        /* Create missing RTTs and retry */
>>>>> +        int level = RMI_RETURN_INDEX(ret);
>>>>> +
>>>>> +        KVM_BUG_ON(level == KVM_PGTABLE_LAST_LEVEL, kvm);
>>>>
>>>>           KVM_BUG_ON(level >= KVM_PGTABLE_LAST_LEVEL, kvm);
>>>
>>> Ack.
>>>
>>
>> Thinking more about this, I guess a buggy VMM can trigger this
>> by populating twice ? (level == KVM_PGTABLE_LAST_LEVEL). So, we should
>> return the error back, than warning here and suppressing the error ?
> 
> Populating twice causes rmi_delegate_page() to be run twice on the same
> page and the second one will then fail. So I don't think this is
> possible (please correct me if I've missed something!)

Good point, but I think this may not fail to allow the hugepages in the
future. The DELEGATE_RANGE would skip the granules in DELEGATED/DATA 
state. I am getting this clarified in the spec.


Suzuki

> 
> Thanks,
> Steve


^ permalink raw reply

* Re: [PATCH v2 16/17] KVM: TDX: Add in-kernel Quote generation
From: Sean Christopherson @ 2026-06-25 18:01 UTC (permalink / raw)
  To: Xu Yilun
  Cc: x86, kvm, linux-coco, linux-kernel, djbw, kas, rick.p.edgecombe,
	yilun.xu, xiaoyao.li, sohil.mehta, adrian.hunter, kishen.maloor,
	tony.lindgren, peter.fang, baolu.lu, zhenzhong.duan, dave.hansen,
	dave.hansen
In-Reply-To: <20260618081355.3253581-17-yilun.xu@linux.intel.com>

On Thu, Jun 18, 2026, Xu Yilun wrote:
> From: Peter Fang <peter.fang@intel.com>
> 
> Provide an in-kernel path for Quote generation when handling
> TDG.VP.VMCALL<GetQuote>, without requiring an exit to userspace.

Why?

^ permalink raw reply

* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Ackerley Tng @ 2026-06-25 18:20 UTC (permalink / raw)
  To: Yan Zhao
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	jmattson, jthoughton, michael.roth, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Baoquan He, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ajyCn0PnFtQK+Nka@yzhao56-desk.sh.intel.com>

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Wed, Jun 24, 2026 at 05:05:44PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>>
>> >
>> > [...snip...]
>> >
>> >>
>> >>  #ifdef kvm_arch_has_private_mem
>> >> -bool __ro_after_init gmem_in_place_conversion = false;
>> >> +bool __ro_after_init gmem_in_place_conversion = !IS_ENABLED(CONFIG_KVM_VM_MEMORY_ATTRIBUTES);
>> >> +module_param(gmem_in_place_conversion, bool, 0444);
>> >
>> > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
>> > MMAP flag. In such cases, shared memory is allocated from different backends.
>> > This means this module parameter only enables per-gmem memory attribute and does
>> > not guarantee that gmem in-place conversion will actually occur.
>> >
>> > To avoid confusion, could we rename this module parameter to something more
>> > accurate, such as gmem_memory_attribute?
>> >
>>
>> I asked Sean about this after getting some fixes off list. Sean said
>> gmem_in_place_conversion is named for a host admin to use, and something
>> like gmem_memory_attributes is too much implementation details for the
>> admin.
> Thanks for this background.
>
> Some more context on why I'm asking:
>
> Currently, I'm testing TDX huge pages with the following two gmem components:
> 1. The gmem memory attribute in this gmem in-place conversion v8.
> 2. The gmem 2MB from buddy allocator. (for development/testing only).
>
> The gmem 2MB from buddy allocator allocates 2MB folios from buddy for private
> memory, while shared memory is allocated from a different backend.
> (To avoid fragmentation, only private mappings are split during private-to-shared
> conversions. In this approach, the 2MB folios are always retained in the gmem
> inode filemap cache without splitting.)
>
> Since shared memory is not allocated from gmem, there're no in-place conversions.
> The reason I'm using "gmem memory attribute" is that the per-VM attribute is
> being deprecated, as suggested by Sean [1].
>

v8 of conversions series changed that slightly, per-VM attributes is
going to stay around (because of work on RWX attributes, coming up) and
RWX will stay tracked at the VM level.

For v8 and beyond, only tracking of private/shared in per-VM attributes
is being deprecated.

By extension the entire thing about using guest_memfd for private memory
and a different backing memory for shared memory is being deprecated.

> Besides my current usage,

I think you can set up guest_memfd+2M for private memory and shared
memory from some other source, and that's the deprecated usage pattern.

> there may be other scenarios where gmem memory
> attributes is preferred without allocating shared memory from gmem.
> (e.g., PAGE.ADD from a temp extra shared source memory).
>

Is this TDH.MEM.PAGE.ADD, used indirectly from
tdx_gmem_post_populate()? This use case isn't blocked. Even if
gmem_in_place_conversion=true, you can still set src_address to
non-guest_memfd memory and load from anywhere you like.

Please let me know if that is broken! I think I accidentally used that
setup in selftests and it worked. The selftests are now defaulting to
in-place conversion.

> For such use cases, I'm concerns that the admins may find it confusing if they
> enable gmem_in_place_conversion but still observe extra memory consumptions for
> shared memory.
>

Hmm but I guess if someone enables gmem_in_place_conversion but still
allocates from elsewhere, they'd have to figure it out?

> [1] https://lore.kernel.org/kvm/aWmEegVP_A613WIr@google.com/
>
>> Sean, would you reconsider since Yan also asked? If the admin compiled
>> the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
>> admin would also be able to use a param like gmem_memory_attributes?
>>
>> There's the additional benefit that the similar naming aids in
>> understanding for both the admin and software engineers.
>>
>> Either way, in the next revision, I'll also add this documentation for
>> this module_param:
>>
>>   Setting the module parameter gmem_in_place_conversion to true will
>>   enable the KVM_SET_MEMORY_ATTRIBUTES2 guest_memfd ioctl and disables
>>   the KVM_SET_MEMORY_ATTRIBUTES VM ioctl. If gmem_in_place_conversion is
>>   true, the private/shared attribute will be tracked per-guest_memfd
>>   instead of per-VM.
>>
>> Let me know what y'all think of the wording!
>>
>> >>
>> >> [...snip...]
>> >>

^ permalink raw reply

page:              | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox