Linux Confidential Computing Development

Linux Confidential Computing Development
 help / color / mirror / Atom feed

* Re: [PATCH RFC v5 01/53] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
From: Liam R. Howlett @ 2026-05-07  3:34 UTC (permalink / raw)
  To: ackerleytng
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260428-gmem-inplace-conversion-v5-1-d8608ccfca22@google.com>

On 26/04/28 04:24PM, Ackerley Tng via B4 Relay wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Start plumbing in guest_memfd support for in-place private<=>shared
> conversions by tracking attributes via a maple tree.  KVM currently tracks
> private vs. shared attributes on a per-VM basis, which made sense when a
> guest_memfd _only_ supported private memory, but tracking per-VM simply
> can't work for in-place conversions as the shareability of a given page
> needs to be per-gmem_inode, not per-VM.
> 
> Use the filemap invalidation lock to protect the maple tree, as taking the
> lock for read when faulting in memory (for userspace or the guest) isn't
> expected to result in meaningful contention, and using a separate lock
> would add significant complexity (avoid deadlock is quite difficult).
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Co-developed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  virt/kvm/guest_memfd.c | 139 +++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 123 insertions(+), 16 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 69c9d6d546b28..17e5a23fec0a1 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -4,6 +4,7 @@
>  #include <linux/falloc.h>
>  #include <linux/fs.h>
>  #include <linux/kvm_host.h>
> +#include <linux/maple_tree.h>
>  #include <linux/mempolicy.h>
>  #include <linux/pseudo_fs.h>
>  #include <linux/pagemap.h>
> @@ -33,6 +34,12 @@ struct gmem_inode {
>  	struct list_head gmem_file_list;
>  
>  	u64 flags;
> +	/*
> +	 * Every index in this inode, whether memory is populated or
> +	 * not, is tracked in attributes. There are no gaps in this
> +	 * maple tree.
> +	 */
> +	struct maple_tree attributes;
>  };
>  
>  static __always_inline struct gmem_inode *GMEM_I(struct inode *inode)
> @@ -60,6 +67,31 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
>  	return gfn - slot->base_gfn + slot->gmem.pgoff;
>  }
>  
> +static u64 kvm_gmem_get_attributes(struct inode *inode, pgoff_t index)
> +{
> +	struct maple_tree *mt = &GMEM_I(inode)->attributes;
> +	void *entry = mtree_load(mt, index);
> +
> +	/*
> +	 * The lock _must_ be held for lookups, as some maple tree operations,
> +	 * e.g. append, are unsafe (return inaccurate information) with respect
> +	 * to concurrent RCU-protected lookups.
> +	 */

Can you please elaborate how you see inaccurate information and which
information is inaccurate?

Your comment is incorrect and misleading as append will not be used in
rcu mode.  Note that you have not set this tree up in rcu mode.

> +	lockdep_assert(mt_lock_is_held(mt));
> +
> +	return WARN_ON_ONCE(!entry) ? 0 : xa_to_value(entry);
> +}
> +
> +static bool kvm_gmem_is_private_mem(struct inode *inode, pgoff_t index)
> +{
> +	return kvm_gmem_get_attributes(inode, index) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +}
> +
> +static bool kvm_gmem_is_shared_mem(struct inode *inode, pgoff_t index)
> +{
> +	return !kvm_gmem_is_private_mem(inode, index);
> +}
> +
>  static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>  				    pgoff_t index, struct folio *folio)
>  {
> @@ -397,10 +429,13 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>  	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
>  		return VM_FAULT_SIGBUS;
>  
> -	if (!(GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED))
> -		return VM_FAULT_SIGBUS;
> +	filemap_invalidate_lock_shared(inode->i_mapping);
> +	if (kvm_gmem_is_shared_mem(inode, vmf->pgoff))
> +		folio = kvm_gmem_get_folio(inode, vmf->pgoff);
> +	else
> +		folio = ERR_PTR(-EACCES);
> +	filemap_invalidate_unlock_shared(inode->i_mapping);
>  
> -	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>  	if (IS_ERR(folio)) {
>  		if (PTR_ERR(folio) == -EAGAIN)
>  			return VM_FAULT_RETRY;
> @@ -556,6 +591,51 @@ bool __weak kvm_arch_supports_gmem_init_shared(struct kvm *kvm)
>  	return true;
>  }
>  
> +static int kvm_gmem_init_inode(struct inode *inode, loff_t size, u64 flags)
> +{
> +	struct gmem_inode *gi = GMEM_I(inode);
> +	MA_STATE(mas, &gi->attributes, 0, (size >> PAGE_SHIFT) - 1);
> +	u64 attrs;
> +	int r;
> +
> +	inode->i_op = &kvm_gmem_iops;
> +	inode->i_mapping->a_ops = &kvm_gmem_aops;
> +	inode->i_mode |= S_IFREG;
> +	inode->i_size = size;
> +	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> +
> +	/*
> +	 * guest_memfd memory is neither migratable nor swappable: set
> +	 * inaccessible to gate off both.
> +	 */
> +	mapping_set_inaccessible(inode->i_mapping);
> +	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> +
> +	gi->flags = flags;
> +
> +	mt_set_external_lock(&gi->attributes,
> +			     &inode->i_mapping->invalidate_lock);
> +
> +	/*
> +	 * Store default attributes for the entire gmem instance. Ensuring every
> +	 * index is represented in the maple tree at all times simplifies the
> +	 * conversion and merging logic.
> +	 */
> +	attrs = gi->flags & GUEST_MEMFD_FLAG_INIT_SHARED ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +
> +	/*
> +	 * Acquire the invalidation lock purely to make lockdep happy.  The
> +	 * maple tree library expects all stores to be protected via the lock,
> +	 * and the library can't know when the tree is reachable only by the
> +	 * caller, as is the case here.
> +	 */
> +	filemap_invalidate_lock(inode->i_mapping);
> +	r = mas_store_gfp(&mas, xa_mk_value(attrs), GFP_KERNEL);
> +	filemap_invalidate_unlock(inode->i_mapping);
> +
> +	return r;
> +}
> +
>  static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>  {
>  	static const char *name = "[kvm-gmem]";
> @@ -586,16 +666,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>  		goto err_fops;
>  	}
>  
> -	inode->i_op = &kvm_gmem_iops;
> -	inode->i_mapping->a_ops = &kvm_gmem_aops;
> -	inode->i_mode |= S_IFREG;
> -	inode->i_size = size;
> -	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> -	mapping_set_inaccessible(inode->i_mapping);
> -	/* Unmovable mappings are supposed to be marked unevictable as well. */
> -	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
> -
> -	GMEM_I(inode)->flags = flags;
> +	err = kvm_gmem_init_inode(inode, size, flags);
> +	if (err)
> +		goto err_inode;
>  
>  	file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_fops);
>  	if (IS_ERR(file)) {
> @@ -797,9 +870,13 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	if (!file)
>  		return -EFAULT;
>  
> +	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> +
>  	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> -	if (IS_ERR(folio))
> -		return PTR_ERR(folio);
> +	if (IS_ERR(folio)) {
> +		r = PTR_ERR(folio);
> +		goto out;
> +	}
>  
>  	if (!folio_test_uptodate(folio)) {
>  		clear_highpage(folio_page(folio, 0));
> @@ -815,6 +892,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	else
>  		folio_put(folio);
>  
> +out:
> +	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>  	return r;
>  }
>  EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
> @@ -944,6 +1023,15 @@ static struct inode *kvm_gmem_alloc_inode(struct super_block *sb)
>  
>  	mpol_shared_policy_init(&gi->policy, NULL);
>  
> +	/*
> +	 * Memory attributes are protected by the filemap invalidation lock, but
> +	 * the lock structure isn't available at this time.  Immediately mark
> +	 * maple tree as using external locking so that accessing the tree
> +	 * before it's fully initialized results in NULL pointer dereferences
> +	 * and not more subtle bugs.
> +	 */
> +	mt_init_flags(&gi->attributes, MT_FLAGS_LOCK_EXTERN);
> +
>  	gi->flags = 0;
>  	INIT_LIST_HEAD(&gi->gmem_file_list);
>  	return &gi->vfs_inode;
> @@ -951,7 +1039,26 @@ static struct inode *kvm_gmem_alloc_inode(struct super_block *sb)
>  
>  static void kvm_gmem_destroy_inode(struct inode *inode)
>  {
> -	mpol_free_shared_policy(&GMEM_I(inode)->policy);
> +	struct gmem_inode *gi = GMEM_I(inode);
> +
> +	mpol_free_shared_policy(&gi->policy);
> +
> +	/*
> +	 * Note!  Checking for an empty tree is functionally necessary
> +	 * to avoid explosions if the tree hasn't been fully
> +	 * initialized, i.e. if the inode is being destroyed before
> +	 * guest_memfd can set the external lock, lockdep would find
> +	 * that the tree's internal ma_lock was not held.
> +	 */
> +	if (!mtree_empty(&gi->attributes)) {
> +		/*
> +		 * Acquire the invalidation lock purely to make lockdep happy,
> +		 * the inode is unreachable at this point.
> +		 */
> +		filemap_invalidate_lock(inode->i_mapping);
> +		__mt_destroy(&gi->attributes);
> +		filemap_invalidate_unlock(inode->i_mapping);
> +	}
>  }
>  
>  static void kvm_gmem_free_inode(struct inode *inode)
> 
> -- 
> 2.54.0.545.g6539524ca2-goog
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v8 10/21] x86/virt/seamldr: Shut down the current TDX module
From: Dave Hansen @ 2026-05-06 20:49 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, linux-coco, linux-kernel, x86, binbin.wu, dave.hansen, djbw,
	ira.weiny, kai.huang, kas, nik.borisov, paulmck, pbonzini,
	reinette.chatre, rick.p.edgecombe, sagis, seanjc, tony.lindgren,
	vannapurve, vishal.l.verma, yilun.xu, xiaoyao.li, yan.y.zhao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin
In-Reply-To: <afqtylwGsoLsp/0f@intel.com>

On 5/5/26 19:56, Chao Gao wrote:
> On Thu, Apr 30, 2026 at 11:52:50AM -0700, Dave Hansen wrote:
>> On 4/27/26 08:28, Chao Gao wrote:
>>>  static int do_seamldr_install_module(void *seamldr_params)
>>>  {
>>>  	enum module_update_state newstate, curstate = MODULE_UPDATE_START;
>>> +	int cpu = smp_processor_id();
>>> +	bool primary;
>>>  	int ret = 0;
>>>  
>>> +	primary = cpumask_first(cpu_online_mask) == cpu;
>> Isn't cpumask_first(cpu_online_mask)==0, always? I thought CPU 0 could
>> never be offlined.
> Not always. On x86, CPU 0 can be offlined at runtime if the kernel is booted
> with the cpu0_hotplug command-line option. See cpu_hotplug.rst.

See e59e74dc48a309cb848ffc3d76a0d61aa6803c05.

Yes, the docs are stale.

^ permalink raw reply

* Re: [PATCH v8 09/21] x86/virt/seamldr: Introduce skeleton for TDX module updates
From: Dave Hansen @ 2026-05-06 20:43 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, linux-coco, linux-kernel, x86, binbin.wu, dave.hansen, djbw,
	ira.weiny, kai.huang, kas, nik.borisov, paulmck, pbonzini,
	reinette.chatre, rick.p.edgecombe, sagis, seanjc, tony.lindgren,
	vannapurve, vishal.l.verma, yilun.xu, xiaoyao.li, yan.y.zhao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin
In-Reply-To: <afs7UATlAuDehCCB@intel.com>

On 5/6/26 06:00, Chao Gao wrote:
> On Thu, Apr 30, 2026 at 01:03:53PM -0700, Dave Hansen wrote:
>> On 4/27/26 08:28, Chao Gao wrote:
>>> +static struct {
>>> +	enum module_update_state state;
>>> +	int thread_ack;
>>
>> multi_stop_data has an atomic_t for this.
>>
>> You have an int.
>>
>> Which one is right?
> 
> You pointed out that using atomic_t and memory barriers for synchronization was
> overly complicated. So, I switched to use a spinlock, and thread_ack can now be
> a plain int.

Good point.

Could you make this a bit more obvious, please?

I honestly don't like guard(). Maybe I'm old-fashioned, but I really,
really like critical sections to be, well, explicit *sections* of code.

Second, you have two functions defined next to each other with similar
names:

static void ack_state(void)

static void set_target_state(enum module_update_state state)

Both of which manipulate the same data. One takes the lock. One doesn't.
That could be fixed with comments.

^ permalink raw reply

* Re: SVSM Development Call May 6th, 2026
From: Jörg Rödel @ 2026-05-06 18:10 UTC (permalink / raw)
  To: coconut-svsm, linux-coco
In-Reply-To: <45x7yirnjkd374loz5ymizxdyccuytg7rr46jvgh33gl7c2xwo@vps5xitk2umx>

Added meeting minutes to the pending PR from last week:

	https://github.com/coconut-svsm/governance/pull/106

-Joerg

^ permalink raw reply

* Re: [PATCH v4 3/3] coco: guest: arm64: Query host IPA-change alignment via RHI
From: Aneesh Kumar K.V @ 2026-05-06 14:23 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-kernel, iommu, linux-coco, linux-arm-kernel, kvmarm,
	Catalin Marinas, Jason Gunthorpe, Marek Szyprowski, Robin Murphy,
	Steven Price, Suzuki K Poulose, Thomas Gleixner, Will Deacon
In-Reply-To: <86tssvyz2v.wl-maz@kernel.org>

Marc Zyngier <maz@kernel.org> writes:

> On Tue, 28 Apr 2026 13:49:46 +0100,
> Aneesh Kumar K.V <aneesh.kumar@kernel.org> wrote:
>> 
>> Marc Zyngier <maz@kernel.org> writes:
>> 
>> > On Mon, 27 Apr 2026 07:31:08 +0100,
>> > "Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> wrote:
>> >> 
>> >> Add the Realm Host Interface support needed to query host configuration
>> >> from a Realm guest. Define the RHI hostconf SMCs, add rsi_host_call(), and
>> >> use them during Realm initialization to retrieve the host IPA-change
>> >> alignment size.
>> >
>> > I don't understand what "IPA-change" means. What you are after is the
>> > host's sharing granule size.
>> >
>> 
>> This is part of the RHI specification, and the call is named
>> RHI_HOSTCONF_GET_IPA_CHANGE_ALIGNMENT. The intent is to determine the
>> alignment requirements for changing IPA attributes (protected vs.
>> unprotected IPA
>
> This really is a terrible name. Why the 'change' part? It doesn't
> change, it is a constant.
>

How about rhi_get_host_sharing_alignment()? or can you suggest a better
name I can switch to?

> Oh well...
>
> [...]
>
>> >> +static inline unsigned long rsi_host_call(struct rsi_host_call *rhi_call)
>> >> +{
>> >> +	phys_addr_t addr = virt_to_phys(rhi_call);
>> >> +	struct arm_smccc_res res;
>> >> +
>> >> +	arm_smccc_1_1_invoke(SMC_RSI_HOST_CALL, addr, &res);
>> >
>> > Errr... What guarantees that *rhi_call is *IPA contiguous*? This is
>> > incredibly fragile. You should at the very least check that this isn't
>> > vmalloc'd.
>> >
>> 
>> 
>> I didn’t quite follow that. We have other RSI calls (even RMI calls)
>> that do similar things, and the caller understands that the address
>> should be IPA-contiguous.
>
> Does it? Where is it documented?  All you get is a pointer, so all
> bets are off.
>

We have multiple rmi and rsi calls that takes ipa values. asm/rmi_cmds.h
and asm/rsi_cmds.h. Some of them takes phys_addr_t while others take
unsigned long. The spec mention these as 64 bits values. May be we
should switch them all to u64. x86 also having similar discussion 
https://lore.kernel.org/all/afOrd7JYkUfe7wcZ@google.com

>
>> Are you suggesting that all RSI calls should
>> add checks for this?. or are you suggesting to update the API to
>> 
>> unsigned long rsi_host_call(unsigned long rhi_call_phys) ?
>
> I'm suggesting that this API is subtly broken because it makes random
> assumption about the physical contiguity of the VA space. It does so
> without any check, without any documentation.
>
> Simply changing the parameter to phys_addr_t could at the very least
> capture some of the requirements, but I'd like something in big bold
> letters.
>


virt_to_phys() emits a WARN if the address is not part of the linear
map. Are you suggesting that we should add additional checks to the call
sites that pass such addresses?

Sorry, it’s still not clear to me how you want these calls to be
updated.

The pattern I’ve been following is:

Lower-level calls that use arm_smccc_1_1_invoke() take parameters as
unsigned long. I initially wanted to switch this to u64, but since
kvm/rmi.c uses unsigned long, it was decided to keep it consistent.

This approach is used in cases where the same argument is passed across
multiple calls, for example:

phys_addr_t rd_phys = virt_to_phys(realm->rd);
rmi_vdev_create(rd_phys, ...);
rmi_vdev_lock(rd_phys, ...);

For calls like rsi_host_call(), I chose to pass a struct pointer to
maintain better type safety:

static inline unsigned long rsi_host_call(struct rsi_host_call *rhi_call)
{
	phys_addr_t addr = virt_to_phys(rhi_call);

	arm_smccc_1_1_invoke(SMC_RSI_HOST_CALL, addr, &res);
}

Note that virt_to_phys() will WARN if the address is not part of the
linear map

Could you clarify what changes you would like to see in these
interfaces?

-aneesh


^ permalink raw reply

* Re: [RFC PATCH 04/12] vfio/pci: Allow MMIO regions to be exported through dma-buf
From: Jason Gunthorpe @ 2026-05-06 13:16 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
	sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
	vivek.kasireddy, dan.j.williams, yilun.xu, linux-coco,
	linux-kernel, lukas, yan.y.zhao, daniel.vetter, leon, baolu.lu,
	zhenzhong.duan, tao1.su
In-Reply-To: <c0b160f8-2930-4158-9e50-b4cc4209e2ca@amd.com>

On Wed, May 06, 2026 at 12:35:42PM +1000, Alexey Kardashevskiy wrote:
> Hi!
> 
> Let's reignite this topic.
> 
> I've been using these patches + QEMU side hacks for 6+ months. And it's been fine until I got a device where MSIX BAR is in a middle of another BAR marked as TEE in the TDISP interface report. And no trusted MSIX yet.
> 
> Every time QEMU mmaps a BAR - I request a dmabuf fd from VFIO in QEMU. Since mapping of an entire MSIX BAR is allowed by default, VFIORegion::nr_mmaps==1 and it is an entire BAR.
> 
> Problem: KVM memslot mismatches the dmabuf fd size 

Huh? kvm does not care about dmabuf at all? Are you running other
patches to hook kvm and dmabuf?

Putting a slice in a dmabuf is a well understood need for MSI, so I
expect whatever kvm dmabuf interface that gets merged to accomodate
this?

> Solution2: modify logic in VFIO dmabuf to allow multiple KVM memory
> slots per dmabuf. Now it is kvm_memory_slot::dmabuf_attach with no
> offset into the dmabuf and one kvm_vfio_dmabuf per dma_buf.

Yes, when kvm learns to take in a dmabuf it needs to take in a slice,
not the whole buf. Or you need to create multiple dmabufs with the
necessary slices from the VFIO. The upstream vfio dmabuf creation
allows creating it with a slice.

Jason

^ permalink raw reply

* Re: [PATCH v8 09/21] x86/virt/seamldr: Introduce skeleton for TDX module updates
From: Chao Gao @ 2026-05-06 13:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kvm, linux-coco, linux-kernel, x86, binbin.wu, dave.hansen, djbw,
	ira.weiny, kai.huang, kas, nik.borisov, paulmck, pbonzini,
	reinette.chatre, rick.p.edgecombe, sagis, seanjc, tony.lindgren,
	vannapurve, vishal.l.verma, yilun.xu, xiaoyao.li, yan.y.zhao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin
In-Reply-To: <5dc70847-332d-496f-b0ab-03323eff7118@intel.com>

On Thu, Apr 30, 2026 at 01:03:53PM -0700, Dave Hansen wrote:
>On 4/27/26 08:28, Chao Gao wrote:
>> +static struct {
>> +	enum module_update_state state;
>> +	int thread_ack;
>
>multi_stop_data has an atomic_t for this.
>
>You have an int.
>
>Which one is right?

You pointed out that using atomic_t and memory barriers for synchronization was
overly complicated. So, I switched to use a spinlock, and thread_ack can now be
a plain int.

See https://lore.kernel.org/kvm/31936a20-929f-489a-9dc6-0f8fcb9307f1@intel.com/

^ permalink raw reply

* Re: [PATCH v8 15/21] x86/virt/tdx: Refresh TDX module version after update
From: Chao Gao @ 2026-05-06 12:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kvm, linux-coco, linux-kernel, x86, binbin.wu, dave.hansen, djbw,
	ira.weiny, kai.huang, kas, nik.borisov, paulmck, pbonzini,
	reinette.chatre, rick.p.edgecombe, sagis, seanjc, tony.lindgren,
	vannapurve, vishal.l.verma, yilun.xu, xiaoyao.li, yan.y.zhao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin
In-Reply-To: <28be5180-ff74-4e4d-b392-7ba7a9b4c1c0@intel.com>

On Thu, Apr 30, 2026 at 12:14:32PM -0700, Dave Hansen wrote:
>On 4/27/26 08:28, Chao Gao wrote:
>> The kernel exposes the TDX module version through sysfs so userspace can
>> check update compatibility. That information needs to remain accurate
>> across runtime updates.
>> 
>> A runtime update may change the module's update_version, so refresh the
>> cached version after a successful update and emit a log message to show
>> the version change.
>> 
>> Drop __ro_after_init from tdx_sysinfo because it is now updated at runtime.
>> 
>> Perform the refresh outside of stop_machine() since printk() within
>> stop_machine() would add significant latency.
>> 
>> Do not refresh the rest of tdx_sysinfo. Refreshing them at runtime could
>> disrupt running software that relies on the previously reported values.
>
>This needs more explanation. I think the reasoning is quite nuanced.
>
>tdx_sysinfo is just a cache of what the TDX module is and can do. If
>that changes, it means the TDX module changed. So you somehow need to
>argue why it's OK to hide those changes from the tdx_sysinfo users.
>
>Why would they be confused by tdx_sysinfo changes but not by the TDX
>module *itself* changing?

Good point. The key assumption here is that module updates are fully backward
compatible, so running software can continue to work without seeing the new
tdx_sysinfo. I will revise the changelog. See below.

>
>> Note that major and minor versions are not refreshed because runtime updates
>> are supported only between releases with identical major and minor versions.
>
>I'd rather have this in code than a changelog comment.
>
>If they can't change then warn if they do.

Maybe I can just drop the note as I don't want to add code to preemptively
catch theoretical module bugs.

I added it because Sashiko pointed out that assigning the whole version struct
outside stop_machine() could allow sysfs readers to observe a partially updated
version. As we don't need to print new module version, I will move that
assignment into stop_machine(), which addresses that issue. After that, there
is no need to mention that major/minor versions are identical across updates.

>
>> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
>> index 98a8d9d3ae25..c81b26c4bac1 100644
>> --- a/arch/x86/virt/vmx/tdx/seamldr.c
>> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
>> @@ -306,6 +306,8 @@ DEFINE_FREE(free_seamldr_params, struct seamldr_params *,
>>   */
>>  int seamldr_install_module(const u8 *data, u32 size)
>>  {
>> +	int ret;
>> +
>>  	struct seamldr_params *params __free(free_seamldr_params) =
>>  						init_seamldr_params(data, size);
>>  	if (IS_ERR(params))
>> @@ -314,6 +316,10 @@ int seamldr_install_module(const u8 *data, u32 size)
>>  	/* Ensure a stable set of online CPUs for the update process. */
>>  	guard(cpus_read_lock)();
>>  	set_target_state(MODULE_UPDATE_START + 1);
>> -	return stop_machine_cpuslocked(do_seamldr_install_module, params, cpu_online_mask);
>> +	ret = stop_machine_cpuslocked(do_seamldr_install_module, params, cpu_online_mask);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return tdx_module_refresh_version();
>>  }
>
>This is one reason I rather dislike guard().
>
>Does tdx_module_refresh_version() need to be guarded by 'cpus_read_lock'?

No. It can be moved out of 'cpus_read_lock'.

>
>?
>
>
>> +	/* Major/minor versions should not change across updates. */
>> +	tdx_sysinfo.version.update_version	= new.update_version;
>
>					   ^ very odd tab
>
>Also, how much of this do you *NEED*? You don't need to print the old
>version. You don't really need to _print_ the new version either.
>
>Isn't this arguably all fluff?

I initially kept tdx module update similar to the microcode update. But yes,
printing the new version is not strictly needed. Once the unnecessary
complexity is dropped, the patch becomes quite small:

commit 90e5a66b3f54af96d5895f6707ecdeef4bc4a3ed
Author: Chao Gao <chao.gao@intel.com>
Date:   Tue Mar 31 05:41:29 2026 -0700

    x86/virt/tdx: Refresh TDX module version after update
    
    The kernel exposes the TDX module version through sysfs so userspace can
    check update compatibility. That information needs to remain accurate
    across runtime updates.
    
    A runtime update may change the module's update_version, so refresh the
    cached version after a successful update.
    
    Drop __ro_after_init from tdx_sysinfo because it is now updated at runtime.
    
    Do not refresh the rest of tdx_sysinfo even if those values change. Current
    tdx_sysinfo users (e.g., KVM) can continue to work across module updates
    without seeing the new values because module updates are required to be
    backward compatible. And those users are not written to re-validate
    tdx_sysinfo values after an update, so refreshing them could be risky.
    
    For example, a user may decide both setup and teardown behavior purely
    based on the reported capabilities. If refreshed tdx_sysinfo starts
    reporting a newly gained capability, later code may assume the
    corresponding setup exists and try to use or tear it down, even though no
    such setup was done before the update.
    
    Signed-off-by: Chao Gao <chao.gao@intel.com>
    ---
    v9:
    - don't print old and new version [Dave]
    - explain why it's OK to hide changes from the tdx_sysinfo users [Dave]
    - update versions in stop_machine context
    - don't mention major/minor versions are idential across updates. That fact is
      not relevant here.

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index deb1470185ce..ab350b705910 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -66,7 +66,7 @@ static struct tdmr_info_list tdx_tdmr_list;
 /* All TDX-usable memory regions.  Protected by mem_hotplug_lock. */
 static LIST_HEAD(tdx_memlist);
 
-static struct tdx_sys_info tdx_sysinfo __ro_after_init;
+static struct tdx_sys_info tdx_sysinfo;
 
 /*
  * Do the module global initialization once and return its result.
@@ -1305,6 +1305,10 @@ int tdx_module_run_update(void)
	if (ret)
		return ret;
 
+	/* Shouldn't fail as the update has succeeded. */
+	ret = get_tdx_sys_info_version(&tdx_sysinfo.version);
+	WARN_ON_ONCE(ret);
+
	tdx_module_state.initialized = true;
	return 0;
 }
diff --git a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
index e793dec688ab..e49c300f23d4 100644
--- a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
+++ b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
@@ -7,7 +7,7 @@
  * Include this file to other C file instead.
  */
 
-static __init int get_tdx_sys_info_version(struct tdx_sys_info_version *sysinfo_version)
+static int get_tdx_sys_info_version(struct tdx_sys_info_version *sysinfo_version)
 {
	int ret = 0;
	u64 val;

^ permalink raw reply related

* Re: SVSM Development Call April 29, 2026
From: Jörg Rödel @ 2026-05-06 10:35 UTC (permalink / raw)
  To: coconut-svsm, linux-coco
In-Reply-To: <j3qgor73sx536uzabgyrfadoj7xznc4ir2pj6tbyimn7aqpiki@m5hzafmcng7m>

Meeting minutes are now available:

	https://github.com/coconut-svsm/governance/pull/106

-Joerg

^ permalink raw reply

* Re: [PATCH v8 11/21] x86/virt/tdx: Reset software states during TDX module shutdown
From: Chao Gao @ 2026-05-06  6:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kvm, linux-coco, linux-kernel, x86, binbin.wu, dave.hansen, djbw,
	ira.weiny, kai.huang, kas, nik.borisov, paulmck, pbonzini,
	reinette.chatre, rick.p.edgecombe, sagis, seanjc, tony.lindgren,
	vannapurve, vishal.l.verma, yilun.xu, xiaoyao.li, yan.y.zhao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin
In-Reply-To: <20f2d821-bfa6-4db2-a968-b5455c7b5007@intel.com>

On Thu, Apr 30, 2026 at 11:58:12AM -0700, Dave Hansen wrote:
>On 4/27/26 08:28, Chao Gao wrote:
>> +	/*
>> +	 * Clear global and per-CPU initialization flags so the new module
>> +	 * can be fully re-initialized after a successful update.
>> +	 *
>> +	 * No locks needed as no concurrent accesses can occur here.
>> +	 */
>> +	tdx_module_initialized = false;
>> +	sysinit_done = false;
>> +	sysinit_ret = 0;
>> +	for_each_possible_cpu(cpu)
>> +		per_cpu(tdx_lp_initialized, cpu) = false;
>
>This speaks to needing refactoring.
>
>If there's global TDX state, it needs to be in a global TDX state
>structure, not scattered across random global variables.
>
>Imagine the structure is:
>
>struct tdx_module_config foo;
>
>That's 0's at boot. You have to init the TDX module to bring it out of
>0's to valid state. It actually means something if you do:
>
>	memset(&foo, 0, sizeof(foo));
>
>Because it takes it right back to its bss state. That ^ is also handy
>because it naturally just works if new state gets added.
>
>Guess what will happen the next time someone adds:
>
>	int sysinit_new_fancy_thing;
>
>Someone will forget to add it to this code. Then the module gets updated
>and breaks things in fun ways.

Makes sense. I will slot in a patch like this:

From 1578bd211c732f7773475819cbf145fe9fedfcb5 Mon Sep 17 00:00:00 2001
From: Chao Gao <chao.gao@intel.com>
Date: Tue, 5 May 2026 22:46:39 -0700
Subject: [PATCH] x86/virt/tdx: Consolidate TDX global initialization state

The kernel uses several global flags to guard one-time TDX initialization
flows and prevent them from being repeated.

When the TDX module is updated, all of this state must be reset so that
the module can be initialized again. Today those flags are kept as separate
global variables, which makes the reset path awkward and easy to miss when
new state is added.

Group the flags into a single structure so they can be reset together, for
example with memset(), and so future state that needs the same handling is
easier to manage.

Drop the __ro_after_init annotation from tdx_module_initialized to make it
consistent with the other two flags.

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 arch/x86/virt/vmx/tdx/tdx.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index c0c6281b08a5..0172b432f229 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -44,6 +44,13 @@
 #include <asm/virt.h>
 #include "tdx.h"
 
+struct tdx_module_state {
+	bool initialized;
+	bool sysinit_done;
+	int sysinit_ret;
+};
+
+static struct tdx_module_state tdx_module_state;
 static u32 tdx_global_keyid __ro_after_init;
 static u32 tdx_guest_keyid_start __ro_after_init;
 static u32 tdx_nr_guest_keyids __ro_after_init;
@@ -58,7 +65,6 @@ static struct tdmr_info_list tdx_tdmr_list;
 static LIST_HEAD(tdx_memlist);
 
 static struct tdx_sys_info tdx_sysinfo __ro_after_init;
-static bool tdx_module_initialized __ro_after_init;
 
 typedef void (*sc_err_func_t)(u64 fn, u64 err, struct tdx_module_args *args);
 
@@ -113,30 +119,28 @@ static int try_init_module_global(void)
 {
	struct tdx_module_args args = {};
	static DEFINE_RAW_SPINLOCK(sysinit_lock);
-	static bool sysinit_done;
-	static int sysinit_ret;
 
	raw_spin_lock(&sysinit_lock);
 
-	if (sysinit_done)
+	if (tdx_module_state.sysinit_done)
		goto out;
 
	/* RCX is module attributes and all bits are reserved */
	args.rcx = 0;
-	sysinit_ret = seamcall_prerr(TDH_SYS_INIT, &args);
+	tdx_module_state.sysinit_ret = seamcall_prerr(TDH_SYS_INIT, &args);
 
	/*
	 * The first SEAMCALL also detects the TDX module, thus
	 * it can fail due to the TDX module is not loaded.
	 * Dump message to let the user know.
	 */
-	if (sysinit_ret == -ENODEV)
+	if (tdx_module_state.sysinit_ret == -ENODEV)
		pr_err("module not loaded\n");
 
-	sysinit_done = true;
+	tdx_module_state.sysinit_done = true;
 out:
	raw_spin_unlock(&sysinit_lock);
-	return sysinit_ret;
+	return tdx_module_state.sysinit_ret;
 }
 
 /**
@@ -1299,7 +1303,7 @@ static __init int tdx_enable(void)
 
	register_syscore(&tdx_syscore);
 
-	tdx_module_initialized = true;
+	tdx_module_state.initialized = true;
	pr_info("TDX-Module initialized\n");
	return 0;
 }
@@ -1554,7 +1558,7 @@ void __init tdx_init(void)
 
 const struct tdx_sys_info *tdx_get_sysinfo(void)
 {
-	if (!tdx_module_initialized)
+	if (!tdx_module_state.initialized)
		return NULL;
 
	return (const struct tdx_sys_info *)&tdx_sysinfo;
-- 
2.52.0



^ permalink raw reply related

* Re: [PATCH v8 10/21] x86/virt/seamldr: Shut down the current TDX module
From: Chao Gao @ 2026-05-06  2:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kvm, linux-coco, linux-kernel, x86, binbin.wu, dave.hansen, djbw,
	ira.weiny, kai.huang, kas, nik.borisov, paulmck, pbonzini,
	reinette.chatre, rick.p.edgecombe, sagis, seanjc, tony.lindgren,
	vannapurve, vishal.l.verma, yilun.xu, xiaoyao.li, yan.y.zhao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin
In-Reply-To: <0523b07b-2df2-4cf4-bf98-6efe01780698@intel.com>

On Thu, Apr 30, 2026 at 11:52:50AM -0700, Dave Hansen wrote:
>On 4/27/26 08:28, Chao Gao wrote:
>>  static int do_seamldr_install_module(void *seamldr_params)
>>  {
>>  	enum module_update_state newstate, curstate = MODULE_UPDATE_START;
>> +	int cpu = smp_processor_id();
>> +	bool primary;
>>  	int ret = 0;
>>  
>> +	primary = cpumask_first(cpu_online_mask) == cpu;
>
>Isn't cpumask_first(cpu_online_mask)==0, always? I thought CPU 0 could
>never be offlined.

Not always. On x86, CPU 0 can be offlined at runtime if the kernel is booted
with the cpu0_hotplug command-line option. See cpu_hotplug.rst.

^ permalink raw reply

* Re: [PATCH v8 07/21] coco/tdx-host: Implement firmware upload sysfs ABI for TDX module updates
From: Chao Gao @ 2026-05-06  2:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kvm, linux-coco, linux-kernel, x86, binbin.wu, dave.hansen, djbw,
	ira.weiny, kai.huang, kas, nik.borisov, paulmck, pbonzini,
	reinette.chatre, rick.p.edgecombe, sagis, seanjc, tony.lindgren,
	vannapurve, vishal.l.verma, yilun.xu, xiaoyao.li, yan.y.zhao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin
In-Reply-To: <06682111-2b63-444e-a431-b8dd0db2b0ab@intel.com>

On Wed, Apr 29, 2026 at 04:17:10PM -0700, Dave Hansen wrote:
>On 4/27/26 08:28, Chao Gao wrote:
>> Linux kernel supports two primary firmware update mechanisms:
>>   - request_firmware()
>>   - firmware upload (or fw_upload)
>
>All the stuff here is good info, but it was hard to extract the
>implementation information from the background.
>
>I think this would do:
>
>	Select fw_upload for doing TDX module updates. The process of
>	selecting among available update images is complicated and
>	nuanced. Punt the selection policy out to userspace.

Agreed. I'll add that as a TL;DR.

>
>...
>> +static int seamldr_init(struct device *dev)
>> +{
>> +	struct fw_upload *tdx_fwl;
>> +
>> +	if (!can_expose_seamldr())
>> +		return 0;
>
>can_expose_seamldr() has a not great name.
>
>Why not just have naming that says:
>
>	if (supports_runtime_update())
>		...
>
>Why abstract this out to what can or can't be exposed?

Sure, I'll use supports_runtime_update().

I originally used can_expose_seamldr() because I was focused on the
VMCS-clobbering erratum, where SEAMLDR sysfs cannot be exposed to
userspace.

^ permalink raw reply

* Re: [RFC PATCH 04/12] vfio/pci: Allow MMIO regions to be exported through dma-buf
From: Alexey Kardashevskiy @ 2026-05-06  2:35 UTC (permalink / raw)
  To: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
	sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
	jgg, vivek.kasireddy, dan.j.williams
  Cc: yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
	daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
In-Reply-To: <20250107142719.179636-5-yilun.xu@linux.intel.com>

Hi!

Let's reignite this topic.

I've been using these patches + QEMU side hacks for 6+ months. And it's been fine until I got a device where MSIX BAR is in a middle of another BAR marked as TEE in the TDISP interface report. And no trusted MSIX yet.

Every time QEMU mmaps a BAR - I request a dmabuf fd from VFIO in QEMU. Since mapping of an entire MSIX BAR is allowed by default, VFIORegion::nr_mmaps==1 and it is an entire BAR.

Problem: KVM memslot mismatches the dmabuf fd size
How: QEMU emulates MSIX table and PBA by adding emulated MemoryRegions on top of the mapped BARs. In the QEMU memory flatview this splits the BAR into 2 or 3 sections (2 if MSIX at the start/end, 3 if in a middle). QEMU tries registering 1 or 2 KVM memory slots for regions outside of MSIX which fails in kvm_vfio_dmabuf_bind() as regions are smaller than the exported dmabuf fd (which is the entire BAR == 32KB).

Solution1: use QEMU x-msix-relocation hack to move MSIX elsewhere - end of some BAR (doubles the bar size so problematic for huge BARs like 512GB+), or another BAR (there may be no available as 3 of 64bit BARs is the limit).

Solution2: modify logic in VFIO dmabuf to allow multiple KVM memory slots per dmabuf. Now it is kvm_memory_slot::dmabuf_attach with no offset into the dmabuf and one kvm_vfio_dmabuf per dma_buf.

Solution3: hack QEMU into smashing a MSIX-containing BAR in vfio_pci_fixup_msix_region. Here is what it does:

0000380004000000-0000380004002fff (prio 0, ramd): 0000:41:04.0 BAR 4 mmaps[0] KVM
0000380004003000-0000380004006fff (prio 0, i/o): msix-table
0000380004007000-0000380004007fff (prio 0, ramd): 0000:41:04.0 BAR 4 mmaps[1] KVM

The problem now is that the TDI report must have the same split of the BAR as the VM is going to validate TEE (==trusted) MMIO ranges and this has to match the QEMU view. Harder than it sounds as the size of MSIX table in bytes is not exactly specified anywhere except the report.

Solution4: the above + modify QEMU to check the TDI report for not reporting MSIX+PBA where QEMU emulates them. The problem is that when QEMU adds emulated MRs - there is no report yet (the report is created later on, some TDISP witchery). So when the device is accepted - QEMU could reinitialize those emulated msix and pba MRs to match the report exactly so the flatview generates correct KVM memory regions and then KVM can work with dmabuf fine.

Any thoughts? what is acceptable for everybody? Thanks,



On 8/1/25 01:27, Xu Yilun wrote:
> From: Vivek Kasireddy <vivek.kasireddy@intel.com>
> 
> This is a reduced version of Vivek's series [1]. Removed the
> dma_buf_ops.attach/map/unmap_dma_buf/mmap() as they are not necessary
> in this series, also because of the WIP p2p dma mapping opens [2]. Just
> focus on the private MMIO get PFN function in this early stage.
> 
>>From Jason Gunthorpe:
> "dma-buf has become a way to safely acquire a handle to non-struct page
> memory that can still have lifetime controlled by the exporter. Notably
> RDMA can now import dma-buf FDs and build them into MRs which allows for
> PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory
> from PCI device BARs.
> 
> The patch design loosely follows the pattern in commit
> db1a8dd916aa ("habanalabs: add support for dma-buf exporter") except this
> does not support pinning.
> 
> Instead, this implements what, in the past, we've called a revocable
> attachment using move. In normal situations the attachment is pinned, as a
> BAR does not change physical address. However when the VFIO device is
> closed, or a PCI reset is issued, access to the MMIO memory is revoked.
> 
> Revoked means that move occurs, but an attempt to immediately re-map the
> memory will fail. In the reset case a future move will be triggered when
> MMIO access returns. As both close and reset are under userspace control
> it is expected that userspace will suspend use of the dma-buf before doing
> these operations, the revoke is purely for kernel self-defense against a
> hostile userspace."
> 
> [1] https://lore.kernel.org/kvm/20240624065552.1572580-4-vivek.kasireddy@intel.com/
> [2] https://lore.kernel.org/all/IA0PR11MB7185FDD56CFDD0A2B8D21468F83B2@IA0PR11MB7185.namprd11.prod.outlook.com/
> 
> Original-patch-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> ---
>   drivers/vfio/pci/Makefile          |   1 +
>   drivers/vfio/pci/dma_buf.c         | 223 +++++++++++++++++++++++++++++
>   drivers/vfio/pci/vfio_pci_config.c |  22 ++-
>   drivers/vfio/pci/vfio_pci_core.c   |  20 ++-
>   drivers/vfio/pci/vfio_pci_priv.h   |  25 ++++
>   include/linux/vfio_pci_core.h      |   1 +
>   include/uapi/linux/vfio.h          |  29 ++++
>   7 files changed, 316 insertions(+), 5 deletions(-)
>   create mode 100644 drivers/vfio/pci/dma_buf.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index cf00c0a7e55c..0cfdc9ede82f 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -2,6 +2,7 @@
>   
>   vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>   vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
> +vfio-pci-core-$(CONFIG_DMA_SHARED_BUFFER) += dma_buf.o
>   obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
>   
>   vfio-pci-y := vfio_pci.o
> diff --git a/drivers/vfio/pci/dma_buf.c b/drivers/vfio/pci/dma_buf.c
> new file mode 100644
> index 000000000000..1d5f46744922
> --- /dev/null
> +++ b/drivers/vfio/pci/dma_buf.c
> @@ -0,0 +1,223 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.
> + */
> +#include <linux/dma-buf.h>
> +#include <linux/dma-resv.h>
> +
> +#include "vfio_pci_priv.h"
> +
> +MODULE_IMPORT_NS("DMA_BUF");
> +
> +struct vfio_pci_dma_buf {
> +	struct dma_buf *dmabuf;
> +	struct vfio_pci_core_device *vdev;
> +	struct list_head dmabufs_elm;
> +	unsigned int nr_ranges;
> +	struct vfio_region_dma_range *dma_ranges;
> +	bool revoked;
> +};
> +
> +static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment)
> +{
> +}
> +
> +static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment)
> +{
> +	/*
> +	 * Uses the dynamic interface but must always allow for
> +	 * dma_buf_move_notify() to do revoke
> +	 */
> +	return -EINVAL;
> +}
> +
> +static int vfio_pci_dma_buf_get_pfn(struct dma_buf_attachment *attachment,
> +				    pgoff_t pgoff, u64 *pfn, int *max_order)
> +{
> +	/* TODO */
> +	return -EOPNOTSUPP;
> +}
> +
> +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> +	/*
> +	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
> +	 * The refcount prevents both.
> +	 */
> +	if (priv->vdev) {
> +		down_write(&priv->vdev->memory_lock);
> +		list_del_init(&priv->dmabufs_elm);
> +		up_write(&priv->vdev->memory_lock);
> +		vfio_device_put_registration(&priv->vdev->vdev);
> +	}
> +	kfree(priv);
> +}
> +
> +static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
> +	.pin = vfio_pci_dma_buf_pin,
> +	.unpin = vfio_pci_dma_buf_unpin,
> +	.get_pfn = vfio_pci_dma_buf_get_pfn,
> +	.release = vfio_pci_dma_buf_release,
> +};
> +
> +static int check_dma_ranges(struct vfio_pci_dma_buf *priv, u64 *dmabuf_size)
> +{
> +	struct vfio_region_dma_range *dma_ranges = priv->dma_ranges;
> +	struct pci_dev *pdev = priv->vdev->pdev;
> +	resource_size_t bar_size;
> +	int i;
> +
> +	for (i = 0; i < priv->nr_ranges; i++) {
> +		/*
> +		 * For PCI the region_index is the BAR number like
> +		 * everything else.
> +		 */
> +		if (dma_ranges[i].region_index >= VFIO_PCI_ROM_REGION_INDEX)
> +			return -EINVAL;
> +
> +		bar_size = pci_resource_len(pdev, dma_ranges[i].region_index);
> +		if (!bar_size)
> +			return -EINVAL;
> +
> +		if (!dma_ranges[i].offset && !dma_ranges[i].length)
> +			dma_ranges[i].length = bar_size;
> +
> +		if (!IS_ALIGNED(dma_ranges[i].offset, PAGE_SIZE) ||
> +		    !IS_ALIGNED(dma_ranges[i].length, PAGE_SIZE) ||
> +		    dma_ranges[i].length > bar_size ||
> +		    dma_ranges[i].offset >= bar_size ||
> +		    dma_ranges[i].offset + dma_ranges[i].length > bar_size)
> +			return -EINVAL;
> +
> +		*dmabuf_size += dma_ranges[i].length;
> +	}
> +
> +	return 0;
> +}
> +
> +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> +				  struct vfio_device_feature_dma_buf __user *arg,
> +				  size_t argsz)
> +{
> +	struct vfio_device_feature_dma_buf get_dma_buf;
> +	struct vfio_region_dma_range *dma_ranges;
> +	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> +	struct vfio_pci_dma_buf *priv;
> +	u64 dmabuf_size = 0;
> +	int ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(get_dma_buf));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
> +		return -EFAULT;
> +
> +	dma_ranges = memdup_array_user(&arg->dma_ranges,
> +				       get_dma_buf.nr_ranges,
> +				       sizeof(*dma_ranges));
> +	if (IS_ERR(dma_ranges))
> +		return PTR_ERR(dma_ranges);
> +
> +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +	if (!priv) {
> +		kfree(dma_ranges);
> +		return -ENOMEM;
> +	}
> +
> +	priv->vdev = vdev;
> +	priv->nr_ranges = get_dma_buf.nr_ranges;
> +	priv->dma_ranges = dma_ranges;
> +
> +	ret = check_dma_ranges(priv, &dmabuf_size);
> +	if (ret)
> +		goto err_free_priv;
> +
> +	if (!vfio_device_try_get_registration(&vdev->vdev)) {
> +		ret = -ENODEV;
> +		goto err_free_priv;
> +	}
> +
> +	exp_info.ops = &vfio_pci_dmabuf_ops;
> +	exp_info.size = dmabuf_size;
> +	exp_info.flags = get_dma_buf.open_flags;
> +	exp_info.priv = priv;
> +
> +	priv->dmabuf = dma_buf_export(&exp_info);
> +	if (IS_ERR(priv->dmabuf)) {
> +		ret = PTR_ERR(priv->dmabuf);
> +		goto err_dev_put;
> +	}
> +
> +	/* dma_buf_put() now frees priv */
> +	INIT_LIST_HEAD(&priv->dmabufs_elm);
> +	down_write(&vdev->memory_lock);
> +	dma_resv_lock(priv->dmabuf->resv, NULL);
> +	priv->revoked = !__vfio_pci_memory_enabled(vdev);
> +	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
> +	dma_resv_unlock(priv->dmabuf->resv);
> +	up_write(&vdev->memory_lock);
> +
> +	/*
> +	 * dma_buf_fd() consumes the reference, when the file closes the dmabuf
> +	 * will be released.
> +	 */
> +	return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
> +
> +err_dev_put:
> +	vfio_device_put_registration(&vdev->vdev);
> +err_free_priv:
> +	kfree(dma_ranges);
> +	kfree(priv);
> +	return ret;
> +}
> +
> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> +{
> +	struct vfio_pci_dma_buf *priv;
> +	struct vfio_pci_dma_buf *tmp;
> +
> +	lockdep_assert_held_write(&vdev->memory_lock);
> +
> +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> +		/*
> +		 * Returns true if a reference was successfully obtained.
> +		 * The caller must interlock with the dmabuf's release
> +		 * function in some way, such as RCU, to ensure that this
> +		 * is not called on freed memory.
> +		 */
> +		if (!get_file_rcu(&priv->dmabuf->file))
> +			continue;
> +
> +		if (priv->revoked != revoked) {
> +			dma_resv_lock(priv->dmabuf->resv, NULL);
> +			priv->revoked = revoked;
> +			dma_buf_move_notify(priv->dmabuf);
> +			dma_resv_unlock(priv->dmabuf->resv);
> +		}
> +		dma_buf_put(priv->dmabuf);
> +	}
> +}
> +
> +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> +{
> +	struct vfio_pci_dma_buf *priv;
> +	struct vfio_pci_dma_buf *tmp;
> +
> +	down_write(&vdev->memory_lock);
> +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> +		if (!get_file_rcu(&priv->dmabuf->file))
> +			continue;
> +		dma_resv_lock(priv->dmabuf->resv, NULL);
> +		list_del_init(&priv->dmabufs_elm);
> +		priv->vdev = NULL;
> +		priv->revoked = true;
> +		dma_buf_move_notify(priv->dmabuf);
> +		dma_resv_unlock(priv->dmabuf->resv);
> +		vfio_device_put_registration(&vdev->vdev);
> +		dma_buf_put(priv->dmabuf);
> +	}
> +	up_write(&vdev->memory_lock);
> +}
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index ea2745c1ac5e..5cc200e15edc 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -589,10 +589,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
>   		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
>   		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
>   
> -		if (!new_mem)
> +		if (!new_mem) {
>   			vfio_pci_zap_and_down_write_memory_lock(vdev);
> -		else
> +			vfio_pci_dma_buf_move(vdev, true);
> +		} else {
>   			down_write(&vdev->memory_lock);
> +		}
>   
>   		/*
>   		 * If the user is writing mem/io enable (new_mem/io) and we
> @@ -627,6 +629,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
>   		*virt_cmd &= cpu_to_le16(~mask);
>   		*virt_cmd |= cpu_to_le16(new_cmd & mask);
>   
> +		if (__vfio_pci_memory_enabled(vdev))
> +			vfio_pci_dma_buf_move(vdev, false);
>   		up_write(&vdev->memory_lock);
>   	}
>   
> @@ -707,12 +711,16 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
>   static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
>   					  pci_power_t state)
>   {
> -	if (state >= PCI_D3hot)
> +	if (state >= PCI_D3hot) {
>   		vfio_pci_zap_and_down_write_memory_lock(vdev);
> -	else
> +		vfio_pci_dma_buf_move(vdev, true);
> +	} else {
>   		down_write(&vdev->memory_lock);
> +	}
>   
>   	vfio_pci_set_power_state(vdev, state);
> +	if (__vfio_pci_memory_enabled(vdev))
> +		vfio_pci_dma_buf_move(vdev, false);
>   	up_write(&vdev->memory_lock);
>   }
>   
> @@ -900,7 +908,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
>   
>   		if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
>   			vfio_pci_zap_and_down_write_memory_lock(vdev);
> +			vfio_pci_dma_buf_move(vdev, true);
>   			pci_try_reset_function(vdev->pdev);
> +			if (__vfio_pci_memory_enabled(vdev))
> +				vfio_pci_dma_buf_move(vdev, true);
>   			up_write(&vdev->memory_lock);
>   		}
>   	}
> @@ -982,7 +993,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
>   
>   		if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
>   			vfio_pci_zap_and_down_write_memory_lock(vdev);
> +			vfio_pci_dma_buf_move(vdev, true);
>   			pci_try_reset_function(vdev->pdev);
> +			if (__vfio_pci_memory_enabled(vdev))
> +				vfio_pci_dma_buf_move(vdev, true);
>   			up_write(&vdev->memory_lock);
>   		}
>   	}
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index c3269d708411..f69eda5956ad 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -287,6 +287,8 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
>   	 * semaphore.
>   	 */
>   	vfio_pci_zap_and_down_write_memory_lock(vdev);
> +	vfio_pci_dma_buf_move(vdev, true);
> +
>   	if (vdev->pm_runtime_engaged) {
>   		up_write(&vdev->memory_lock);
>   		return -EINVAL;
> @@ -370,6 +372,8 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
>   	 */
>   	down_write(&vdev->memory_lock);
>   	__vfio_pci_runtime_pm_exit(vdev);
> +	if (__vfio_pci_memory_enabled(vdev))
> +		vfio_pci_dma_buf_move(vdev, false);
>   	up_write(&vdev->memory_lock);
>   }
>   
> @@ -690,6 +694,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
>   #endif
>   	vfio_pci_core_disable(vdev);
>   
> +	vfio_pci_dma_buf_cleanup(vdev);
> +
>   	mutex_lock(&vdev->igate);
>   	if (vdev->err_trigger) {
>   		eventfd_ctx_put(vdev->err_trigger);
> @@ -1234,7 +1240,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
>   	 */
>   	vfio_pci_set_power_state(vdev, PCI_D0);
>   
> +	vfio_pci_dma_buf_move(vdev, true);
>   	ret = pci_try_reset_function(vdev->pdev);
> +	if (__vfio_pci_memory_enabled(vdev))
> +		vfio_pci_dma_buf_move(vdev, false);
>   	up_write(&vdev->memory_lock);
>   
>   	return ret;
> @@ -1523,6 +1532,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>   		return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
>   	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>   		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_DMA_BUF:
> +		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
>   	default:
>   		return -ENOTTY;
>   	}
> @@ -2098,6 +2109,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
>   	INIT_LIST_HEAD(&vdev->dummy_resources_list);
>   	INIT_LIST_HEAD(&vdev->ioeventfds_list);
>   	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
> +	INIT_LIST_HEAD(&vdev->dmabufs);
>   	init_rwsem(&vdev->memory_lock);
>   	xa_init(&vdev->ctx);
>   
> @@ -2480,11 +2492,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>   	 * cause the PCI config space reset without restoring the original
>   	 * state (saved locally in 'vdev->pm_save').
>   	 */
> -	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
> +	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) {
> +		vfio_pci_dma_buf_move(vdev, true);
>   		vfio_pci_set_power_state(vdev, PCI_D0);
> +	}
>   
>   	ret = pci_reset_bus(pdev);
>   
> +	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
> +		if (__vfio_pci_memory_enabled(vdev))
> +			vfio_pci_dma_buf_move(vdev, false);
> +
>   	vdev = list_last_entry(&dev_set->device_list,
>   			       struct vfio_pci_core_device, vdev.dev_set_list);
>   
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index 5e4fa69aee16..d27f383f3931 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -101,4 +101,29 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>   	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>   }
>   
> +#ifdef CONFIG_DMA_SHARED_BUFFER
> +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> +				  struct vfio_device_feature_dma_buf __user *arg,
> +				  size_t argsz);
> +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
> +#else
> +static int
> +vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> +			      struct vfio_device_feature_dma_buf __user *arg,
> +			      size_t argsz)
> +{
> +	return -ENOTTY;
> +}
> +
> +static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> +{
> +}
> +
> +static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
> +					 bool revoked)
> +{
> +}
> +#endif
> +
>   #endif
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index fbb472dd99b3..da5d8955ae56 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -94,6 +94,7 @@ struct vfio_pci_core_device {
>   	struct vfio_pci_core_device	*sriov_pf_core_dev;
>   	struct notifier_block	nb;
>   	struct rw_semaphore	memory_lock;
> +	struct list_head	dmabufs;
>   };
>   
>   /* Will be exported for vfio pci drivers usage */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index c8dbf8219c4f..f43dfbde7352 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1458,6 +1458,35 @@ struct vfio_device_feature_bus_master {
>   };
>   #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
>   
> +/**
> + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> + * regions selected.
> + *
> + * For struct struct vfio_device_feature_dma_buf, open_flags are the typical
> + * flags passed to open(2), eg O_RDWR, O_CLOEXEC, etc. nr_ranges is the total
> + * number of dma_ranges that comprise the dmabuf.
> + *
> + * For struct vfio_region_dma_range, region_index/offset/length specify a slice
> + * of the region to create the dmabuf from, if both offset & length are 0 then
> + * the whole region is used.
> + *
> + * Return: The fd number on success, -1 and errno is set on failure.
> + */
> +struct vfio_region_dma_range {
> +	__u32	region_index;
> +	__u32	__pad;
> +	__u64	offset;
> +	__u64	length;
> +};
> +
> +struct vfio_device_feature_dma_buf {
> +	__u32	open_flags;
> +	__u32	nr_ranges;
> +	struct vfio_region_dma_range dma_ranges[];
> +};
> +
> +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> +
>   /* -------- API for Type1 VFIO IOMMU -------- */
>   
>   /**

-- 
Alexey


^ permalink raw reply

* Re: [PATCH v4 6/7] KVM: SEV: Perform RMP optimizations on SNP guest shutdown
From: Kalra, Ashish @ 2026-05-05 20:34 UTC (permalink / raw)
  To: Ackerley Tng, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <CAEvNRgGP4ZHz9=MOGybGwe2A4XHkVF6nXnr_KdHavz1rR62U4w@mail.gmail.com>

Hello Ackerley,

On 5/1/2026 2:12 PM, Ackerley Tng wrote:
> Ashish Kalra <Ashish.Kalra@amd.com> writes:
> 
>>
>> [...snip...]
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 3f9c1aa39a0a..e0f4f8ebef68 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -2942,6 +2942,8 @@ void sev_vm_destroy(struct kvm *kvm)
>>  	if (sev_snp_guest(kvm)) {
>>  		snp_guest_req_cleanup(kvm);
>>
>> +		snp_rmpopt_all_physmem();
>> +
> 
> I see this is what you suggested in [1]. The time-based batching you
> suggeested works because adding to the workqueue when there's already a
> job just does nothing. Thanks!
> 
> I think optimizing when the VM is destroyed makes sense, in most cases
> for SNP VMs, we don't expect large 1G blocks of memory to be shared
> anyway, so even if we try to RMPOPT on every conversion to private, most
> of those tries would be optimizing nothing.
> 

Yes.

Thanks,
Ashish

> I guess the remaining optimization would be to update based on only the
> range of pfns where guest_memfd has private memory, but that could be
> done in another patch series.
> 
> Reviewed-by: Ackerley Tng <ackerleytng@google.com>
> 
> [1] https://lore.kernel.org/all/31040bb7-653a-40f9-8899-40bc852f7e1f@amd.com/
> 
>>  		/*
>>  		 * Decomission handles unbinding of the ASID. If it fails for
>>  		 * some unexpected reason, just leak the ASID.
>> --
>> 2.43.0

^ permalink raw reply

* Re: [PATCH v4 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Kalra, Ashish @ 2026-05-05 20:30 UTC (permalink / raw)
  To: Ackerley Tng, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <CAEvNRgFRJNRyUf3T9TTWr9-xt76E=Z28vSKsdZ46QK3UAEd8dA@mail.gmail.com>

Hello Ackerley,

On 5/1/2026 1:57 PM, Ackerley Tng wrote:
> Ashish Kalra <Ashish.Kalra@amd.com> writes:
> 
>>
>> [...snip...]
>>
>> +/*
>> + * 'val' is a system physical address.
>> + */
>> +static void rmpopt_smp(void *val)
>> +{
>> +	u64 rax = ALIGN_DOWN((u64)val, SZ_1G);
>> +	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
>> +
>> +	__rmpopt(rax, rcx);
>> +}
>> +
>> +static void rmpopt(u64 pa)
>> +{
>> +	u64 rax = ALIGN_DOWN(pa, SZ_1G);
>> +	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
>> +
>> +	__rmpopt(rax, rcx);
>> +}
>> +
> 
> Could rmpopt_smp() call rmpopt() to remove duplicate code?

Yes. 

> 
>> +/*
>> + * RMPOPT optimizations skip RMP checks at 1GB granularity if this
>> + * range of memory does not contain any SNP guest memory.
>> + */
>> +static void rmpopt_work_handler(struct work_struct *work)
>> +{
>> +	bool current_cpu_cleared = false;
>> +	phys_addr_t pa;
>> +
>> +	pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
>> +		rmpopt_pa_start, rmpopt_pa_end);
>> +
>> +	/*
>> +	 * RMPOPT scans the RMP table, stores the result of the scan in the
>> +	 * reserved processor memory. The RMP scan is the most expensive
>> +	 * part. If a second RMPOPT occurs, it can skip the expensive scan
>> +	 * if they can see a cached result in the reserved processor memory.
>> +	 *
>> +	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
>> +	 * on every other primary thread. This potentially allows the
>> +	 * followers to use the "cached" scan results to avoid repeating
>> +	 * full scans.
> 
> Out of curiosity, how does this caching work? Is it possible to do it
> once and then synchronize the cache to the other CPUs?

The first CPU does the full RMP scan and stores the result of the scan in reserved processor memory.
And other CPUs can skip the scan if they can see a cached result in the reserved processor memory.
So i believe the other CPUs would *still* have to issue the RMPOPT instruction, but then they will 
avoid the full RMP scan and use the cached results.

> 
>> +	 */
>> +
>> +	if (cpumask_test_cpu(smp_processor_id(), &rmpopt_cpumask)) {
>> +		cpumask_clear_cpu(smp_processor_id(), &rmpopt_cpumask);
>> +		current_cpu_cleared = true;
>> +	}
>> +
>> +	/* current CPU */
>> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
>> +		rmpopt(pa);
>> +
>> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>> +		on_each_cpu_mask(&rmpopt_cpumask, rmpopt_smp,
>> +				 (void *)pa, true);
>> +
>> +		 /* Give a chance for other threads to run */
>> +		cond_resched();
>> +
>> +	}
>> +
>> +	if (current_cpu_cleared)
>> +		cpumask_set_cpu(smp_processor_id(), &rmpopt_cpumask);
> 
> Sashiko [1] pointed this out: after cond_resched(), this code might be
> on a different cpu so smp_processor_id() would return a different cpu,
> that would mess up the global cpumask.
> 
> Perhaps it's better to store the id on a stack? Or actually, what if we
> give on_each_cpu_mask a copy of rmpopt_cpumask with the current cpu
> unset?
> 
> [1] https://sashiko.dev/#/patchset/cover.1775874970.git.ashish.kalra%40amd.com
>

Yes, i think it makes sense to store the id on the stack.

Additionally, i will be moving the computing of the cpumask within this workitem,
so cpumask won't be global.
 
>> +}
>> +
>>  void snp_setup_rmpopt(void)
>>  {
>>  	u64 rmpopt_base;
>> @@ -568,9 +656,20 @@ void snp_setup_rmpopt(void)
>>  	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
>>  		return;
>>
>> +	/*
>> +	 * Create an RMPOPT-specific workqueue to avoid scheduling
>> +	 * RMPOPT workitem on the global system workqueue.
>> +	 */
>> +	rmpopt_wq = alloc_workqueue("rmpopt_wq", WQ_UNBOUND, 1);
>> +	if (!rmpopt_wq) {
>> +		setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
>> +		return;
>> +	}
>> +
>>  	/*
>>  	 * RMPOPT_BASE MSR is per-core, so only one thread per core needs to
>> -	 * setup RMPOPT_BASE MSR.
>> +	 * setup RMPOPT_BASE MSR. Additionally only one thread per core needs
>> +	 * to issue the RMPOPT instruction.
>>  	 */
>>
>>  	for_each_online_cpu(cpu) {
>> @@ -590,6 +689,20 @@ void snp_setup_rmpopt(void)
>>  	 * up to 2 TB of system RAM on all CPUs.
>>  	 */
>>  	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
>> +
>> +	INIT_DELAYED_WORK(&rmpopt_delayed_work, rmpopt_work_handler);
>> +
>> +	rmpopt_pa_end = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
>> +
>> +	/* Limit memory scanning to the first 2 TB of RAM */
> 
> I think this is better phrased as "limit memory scanning to 2TB",
> 

Ok.

>> +	if ((rmpopt_pa_end - rmpopt_pa_start) > SZ_2T)
>> +		rmpopt_pa_end = rmpopt_pa_start + SZ_2T;
> 
> and then this could be
> 
>     rmpopt_pa_end = min(rmpopt_pa_end, rmpopt_pa_start + SZ_2T);
>

Yes.

Thanks,
Ashish

>> +
>> +	/*
>> +	 * Once all per-CPU RMPOPT tables have been configured, enable RMPOPT
>> +	 * optimizations on all physical memory.
>> +	 */
>> +	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
>>  }
>>  EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
>>
> 
> Reviewed-by: Ackerley Tng <ackerleytng@google.com>

^ permalink raw reply

* Re: [PATCH v4 3/7] x86/sev: Initialize RMPOPT configuration MSRs
From: Kalra, Ashish @ 2026-05-05 20:04 UTC (permalink / raw)
  To: Ackerley Tng, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <CAEvNRgEC2NSROZWz8uxnOSD6t8s1KmmFrr92=e8s30PJzYhQ1g@mail.gmail.com>

Hello Ackerley,

On 5/1/2026 1:12 PM, Ackerley Tng wrote:
> Ashish Kalra <Ashish.Kalra@amd.com> writes:
> 
>>
>> [...snip...]
>>
>> +void snp_setup_rmpopt(void)
>> +{
>> +	u64 rmpopt_base;
>> +	int cpu;
>> +
>> +	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
>> +		return;
>> +
>> +	/*
>> +	 * RMPOPT_BASE MSR is per-core, so only one thread per core needs to
>> +	 * setup RMPOPT_BASE MSR.
>> +	 */
>> +
>> +	for_each_online_cpu(cpu) {
> 
> Dave mentioned hotplug in v3 [1], which led me to check. From this
> series, it looks like RMPOPT won't ever be performed for newly-plugged
> cpus, is that okay?

Yes, with this current series, RMPOPT won't be performed on newly-plugged
CPUs, i was computing and storing the cpumask for threads to fire RMPOPT on,
statically during the setup code to save computing it again on the
worker thread, but i believe the simple fix for handling hotplugged CPUs
would be to compute the cpumask every time the worker thread executes.

> 
>> +		if (!topology_is_primary_thread(cpu))
>> +			continue;
>> +
>> +		cpumask_set_cpu(cpu, &rmpopt_cpumask);
> 
> nit: perhaps flipping the condition is easier to read:

Yes.

> 
>     if (topology_is_primary_thread(cpu))
>     	cpumask_set_cpu()
> 
>> +	}
>> +
>> +	rmpopt_pa_start = ALIGN_DOWN(PFN_PHYS(min_low_pfn), SZ_1G);
>> +	rmpopt_base = rmpopt_pa_start | MSR_AMD64_RMPOPT_ENABLE;
>> +
>> +	/*
>> +	 * Per-CPU RMPOPT tables support at most 2 TB of addressable memory
>> +	 * for RMP optimizations. Initialize the per-CPU RMPOPT table base
>> +	 * to the starting physical address to enable RMP optimizations for
>> +	 * up to 2 TB of system RAM on all CPUs.
>> +	 */
>> +	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
>> +}
> 
> [1] https://lore.kernel.org/all/ab41b1d8-e464-4ad6-ac07-7318686db10e@intel.com/
> 
>>
>> [...snip...]
>>

^ permalink raw reply

* SVSM Development Call May 6th, 2026
From: Jörg Rödel @ 2026-05-05 16:07 UTC (permalink / raw)
  To: coconut-svsm, linux-coco

Hi,

Here is the call for agenda items for this weeks SVSM development call.  Please
send any agenda items you have in mind as a reply to this email or raise them
in the meeting.

We will use the LF Zoom instance. Details of the meeting  can be found in our
governance repository at:

	https://github.com/coconut-svsm/governance

The link to the COCONUT-SVSM calendar is:

	https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week

The meeting will be recorded and the recording eventually published.

Regards,

	Jörg

^ permalink raw reply

* Re: [PATCH RFC v5 00/53] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-05-01 22:21 UTC (permalink / raw)
  To: Michael Roth
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco, Jacob Xu, Darwin Guo
In-Reply-To: <CAEvNRgHRpvsEjtr1A_Qz3d4oMEaffTxESavrZ73Jtt6OobCwhA@mail.gmail.com>

Ackerley Tng <ackerleytng@google.com> writes:

>
> [...snip...]
>
>
> TLDR:
>
> + PRESERVE == guarantee that the process of setting memory attributes
>   doesn't change memory contents.
>     + implementation == do nothing in most cases, except -EOPNOTSUPP for
>       to-shared on TDX, since unmapping is a required part of setting
>       memory attributes to private, and a TDX side effect of unmapping
>       is zeroing memory,

-EOPNOTSUPP will only be for TDX, not SNP.

> + ZERO == guarantee that the process of setting memory attributes zeroes
>   memory contents.
>     + implementation == memset(zero) in most cases. For TDX, a future
>       optimization exists, where memset() can be skipped for pages that
>       were mapped in Secure EPTs before conversion
> + UNSPECIFIED == no guarantees
>     + implementation == guest_memfd does nothing explicitly about memory
>       contents. The implementation is pretty much the same as PRESERVE
>       except guest_memfd won't take into account vendor-specific side
>       effects of the process of conversion. Except for the test vehicle
>       KVM_X86_SW_PROTECTED_VMS, where memory is scrambled.
>

Found another use case internally for pre-finalize, SNP, to-shared,
PRESERVE, which works with the above smaller scope.

During SNP_LAUNCH_UPDATE, when inserting a CPUID page, the firmware will
check that the CPUID values would not lead to an insecure guest
state. SNP_LAUNCH_UPDATE will fail with an error and the page remains
shared in the RMP table.

Here's the proposed flow in the userspace VMM:

1. Load CPUID in shared guest_memfd memory
2. SET_MEMORY_ATTRIBUTES(PRIVATE, PRESERVE)
3. SNP_LAUNCH_UPDATE => get error since CPUID was insecure
4. SET_MEMORY_ATTRIBUTES(SHARED, PRESERVE)
5. Read shared guest_memfd memory, error if VMM disagrees
6. SET_MEMORY_ATTRIBUTES(PRIVATE, PRESERVE)
7. SNP_LAUNCH_UPDATE => successful, since CPUID is now corrected

Does that seem ok?

>>>
>>> [...snip...]
>>>

^ permalink raw reply

* Re: [PATCH v4 6/7] KVM: SEV: Perform RMP optimizations on SNP guest shutdown
From: Ackerley Tng @ 2026-05-01 19:12 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <0c15142ecf6689ebe31a9c0f6f331398fc04f6d2.1775874970.git.ashish.kalra@amd.com>

Ashish Kalra <Ashish.Kalra@amd.com> writes:

>
> [...snip...]
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 3f9c1aa39a0a..e0f4f8ebef68 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2942,6 +2942,8 @@ void sev_vm_destroy(struct kvm *kvm)
>  	if (sev_snp_guest(kvm)) {
>  		snp_guest_req_cleanup(kvm);
>
> +		snp_rmpopt_all_physmem();
> +

I see this is what you suggested in [1]. The time-based batching you
suggeested works because adding to the workqueue when there's already a
job just does nothing. Thanks!

I think optimizing when the VM is destroyed makes sense, in most cases
for SNP VMs, we don't expect large 1G blocks of memory to be shared
anyway, so even if we try to RMPOPT on every conversion to private, most
of those tries would be optimizing nothing.

I guess the remaining optimization would be to update based on only the
range of pfns where guest_memfd has private memory, but that could be
done in another patch series.

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

[1] https://lore.kernel.org/all/31040bb7-653a-40f9-8899-40bc852f7e1f@amd.com/

>  		/*
>  		 * Decomission handles unbinding of the ASID. If it fails for
>  		 * some unexpected reason, just leak the ASID.
> --
> 2.43.0

^ permalink raw reply

* Re: [PATCH v4 5/7] x86/sev: Add interface to re-enable RMP optimizations.
From: Ackerley Tng @ 2026-05-01 19:04 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <112cc1213834de1ac97c71314a565ba7dc4af30b.1775874970.git.ashish.kalra@amd.com>

Ashish Kalra <Ashish.Kalra@amd.com> writes:

>
> [...snip...]
>

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

^ permalink raw reply

* Re: [PATCH v4 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Ackerley Tng @ 2026-05-01 18:57 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <ad924b3fbe4154466195e0668604afe8e0b825ca.1775874970.git.ashish.kalra@amd.com>

Ashish Kalra <Ashish.Kalra@amd.com> writes:

>
> [...snip...]
>
> +/*
> + * 'val' is a system physical address.
> + */
> +static void rmpopt_smp(void *val)
> +{
> +	u64 rax = ALIGN_DOWN((u64)val, SZ_1G);
> +	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
> +
> +	__rmpopt(rax, rcx);
> +}
> +
> +static void rmpopt(u64 pa)
> +{
> +	u64 rax = ALIGN_DOWN(pa, SZ_1G);
> +	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
> +
> +	__rmpopt(rax, rcx);
> +}
> +

Could rmpopt_smp() call rmpopt() to remove duplicate code?

> +/*
> + * RMPOPT optimizations skip RMP checks at 1GB granularity if this
> + * range of memory does not contain any SNP guest memory.
> + */
> +static void rmpopt_work_handler(struct work_struct *work)
> +{
> +	bool current_cpu_cleared = false;
> +	phys_addr_t pa;
> +
> +	pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
> +		rmpopt_pa_start, rmpopt_pa_end);
> +
> +	/*
> +	 * RMPOPT scans the RMP table, stores the result of the scan in the
> +	 * reserved processor memory. The RMP scan is the most expensive
> +	 * part. If a second RMPOPT occurs, it can skip the expensive scan
> +	 * if they can see a cached result in the reserved processor memory.
> +	 *
> +	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
> +	 * on every other primary thread. This potentially allows the
> +	 * followers to use the "cached" scan results to avoid repeating
> +	 * full scans.

Out of curiosity, how does this caching work? Is it possible to do it
once and then synchronize the cache to the other CPUs?

> +	 */
> +
> +	if (cpumask_test_cpu(smp_processor_id(), &rmpopt_cpumask)) {
> +		cpumask_clear_cpu(smp_processor_id(), &rmpopt_cpumask);
> +		current_cpu_cleared = true;
> +	}
> +
> +	/* current CPU */
> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
> +		rmpopt(pa);
> +
> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> +		on_each_cpu_mask(&rmpopt_cpumask, rmpopt_smp,
> +				 (void *)pa, true);
> +
> +		 /* Give a chance for other threads to run */
> +		cond_resched();
> +
> +	}
> +
> +	if (current_cpu_cleared)
> +		cpumask_set_cpu(smp_processor_id(), &rmpopt_cpumask);

Sashiko [1] pointed this out: after cond_resched(), this code might be
on a different cpu so smp_processor_id() would return a different cpu,
that would mess up the global cpumask.

Perhaps it's better to store the id on a stack? Or actually, what if we
give on_each_cpu_mask a copy of rmpopt_cpumask with the current cpu
unset?

[1] https://sashiko.dev/#/patchset/cover.1775874970.git.ashish.kalra%40amd.com

> +}
> +
>  void snp_setup_rmpopt(void)
>  {
>  	u64 rmpopt_base;
> @@ -568,9 +656,20 @@ void snp_setup_rmpopt(void)
>  	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
>  		return;
>
> +	/*
> +	 * Create an RMPOPT-specific workqueue to avoid scheduling
> +	 * RMPOPT workitem on the global system workqueue.
> +	 */
> +	rmpopt_wq = alloc_workqueue("rmpopt_wq", WQ_UNBOUND, 1);
> +	if (!rmpopt_wq) {
> +		setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
> +		return;
> +	}
> +
>  	/*
>  	 * RMPOPT_BASE MSR is per-core, so only one thread per core needs to
> -	 * setup RMPOPT_BASE MSR.
> +	 * setup RMPOPT_BASE MSR. Additionally only one thread per core needs
> +	 * to issue the RMPOPT instruction.
>  	 */
>
>  	for_each_online_cpu(cpu) {
> @@ -590,6 +689,20 @@ void snp_setup_rmpopt(void)
>  	 * up to 2 TB of system RAM on all CPUs.
>  	 */
>  	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
> +
> +	INIT_DELAYED_WORK(&rmpopt_delayed_work, rmpopt_work_handler);
> +
> +	rmpopt_pa_end = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
> +
> +	/* Limit memory scanning to the first 2 TB of RAM */

I think this is better phrased as "limit memory scanning to 2TB",

> +	if ((rmpopt_pa_end - rmpopt_pa_start) > SZ_2T)
> +		rmpopt_pa_end = rmpopt_pa_start + SZ_2T;

and then this could be

    rmpopt_pa_end = min(rmpopt_pa_end, rmpopt_pa_start + SZ_2T);

> +
> +	/*
> +	 * Once all per-CPU RMPOPT tables have been configured, enable RMPOPT
> +	 * optimizations on all physical memory.
> +	 */
> +	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
>  }
>  EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
>

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

^ permalink raw reply

* Re: [PATCH v4 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Ackerley Tng @ 2026-05-01 18:33 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <662ddafc74fa90be6fdf7dba09bccc53f821f328.1775874970.git.ashish.kalra@amd.com>

Ashish Kalra <Ashish.Kalra@amd.com> writes:

>
> [...snip...]
>
> +void wrmsrq_on_cpus(const struct cpumask *mask, u32 msr_no, u64 q)
> +{
> +	struct msr_info rv;
> +	int this_cpu;
> +
> +	memset(&rv, 0, sizeof(rv));
> +
> +	rv.msr_no = msr_no;
> +	rv.reg.q = q;
> +
> +	this_cpu = get_cpu();
> +
> +	if (cpumask_test_cpu(this_cpu, mask))
> +		__wrmsr_on_cpu(&rv);
> +
> +	smp_call_function_many(mask, __wrmsr_on_cpu, &rv, 1);
> +	put_cpu();

I think

  on_each_cpu_mask(mask, __wrmsr_on_cpu, (void *)&rv, true);

is more readable than get and put cpu.

I understand Dave wanted to remove the ugly casting earlier, but perhaps
it's okay now that the cast is wrapped in a function like this?

Just my 2c, feel free to go ahead either way.

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

>
> [...snip...]
>

^ permalink raw reply

* Re: [PATCH v4 3/7] x86/sev: Initialize RMPOPT configuration MSRs
From: Ackerley Tng @ 2026-05-01 18:12 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <846263383f9b6a08fc87ce6edb931c912f68c60d.1775874970.git.ashish.kalra@amd.com>

Ashish Kalra <Ashish.Kalra@amd.com> writes:

>
> [...snip...]
>
> +void snp_setup_rmpopt(void)
> +{
> +	u64 rmpopt_base;
> +	int cpu;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
> +		return;
> +
> +	/*
> +	 * RMPOPT_BASE MSR is per-core, so only one thread per core needs to
> +	 * setup RMPOPT_BASE MSR.
> +	 */
> +
> +	for_each_online_cpu(cpu) {

Dave mentioned hotplug in v3 [1], which led me to check. From this
series, it looks like RMPOPT won't ever be performed for newly-plugged
cpus, is that okay?

> +		if (!topology_is_primary_thread(cpu))
> +			continue;
> +
> +		cpumask_set_cpu(cpu, &rmpopt_cpumask);

nit: perhaps flipping the condition is easier to read:

    if (topology_is_primary_thread(cpu))
    	cpumask_set_cpu()

> +	}
> +
> +	rmpopt_pa_start = ALIGN_DOWN(PFN_PHYS(min_low_pfn), SZ_1G);
> +	rmpopt_base = rmpopt_pa_start | MSR_AMD64_RMPOPT_ENABLE;
> +
> +	/*
> +	 * Per-CPU RMPOPT tables support at most 2 TB of addressable memory
> +	 * for RMP optimizations. Initialize the per-CPU RMPOPT table base
> +	 * to the starting physical address to enable RMP optimizations for
> +	 * up to 2 TB of system RAM on all CPUs.
> +	 */
> +	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
> +}

[1] https://lore.kernel.org/all/ab41b1d8-e464-4ad6-ac07-7318686db10e@intel.com/

>
> [...snip...]
>

^ permalink raw reply

* Re: [PATCH v4 1/7] x86/cpufeatures: Add X86_FEATURE_AMD_RMPOPT feature flag
From: Ackerley Tng @ 2026-05-01 16:37 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <77153c889934972efcfc3d210251564f29abcf51.1775874970.git.ashish.kalra@amd.com>

Ashish Kalra <Ashish.Kalra@amd.com> writes:

>
> [...snip...]
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index dbe104df339b..bce1b2e2a35c 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -76,7 +76,7 @@
>  #define X86_FEATURE_K8			( 3*32+ 4) /* Opteron, Athlon64 */
>  #define X86_FEATURE_ZEN5		( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
>  #define X86_FEATURE_ZEN6		( 3*32+ 6) /* CPU based on Zen6 microarchitecture */
> -/* Free                                 ( 3*32+ 7) */
> +#define X86_FEATURE_RMPOPT		( 3*32+ 7) /* Support for AMD RMPOPT instruction */
>  #define X86_FEATURE_CONSTANT_TSC	( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
>  #define X86_FEATURE_UP			( 3*32+ 9) /* "up" SMP kernel running on UP */
>  #define X86_FEATURE_ART			( 3*32+10) /* "art" Always running timer (ART) */
> diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
> index 42c7eac0c387..7ac3818c4502 100644
> --- a/arch/x86/kernel/cpu/scattered.c
> +++ b/arch/x86/kernel/cpu/scattered.c
> @@ -65,6 +65,7 @@ static const struct cpuid_bit cpuid_bits[] = {
>  	{ X86_FEATURE_PERFMON_V2,		CPUID_EAX,  0, 0x80000022, 0 },
>  	{ X86_FEATURE_AMD_LBR_V2,		CPUID_EAX,  1, 0x80000022, 0 },
>  	{ X86_FEATURE_AMD_LBR_PMC_FREEZE,	CPUID_EAX,  2, 0x80000022, 0 },
> +	{ X86_FEATURE_RMPOPT,			CPUID_EDX,  0, 0x80000025, 0 },
>  	{ X86_FEATURE_AMD_HTR_CORES,		CPUID_EAX, 30, 0x80000026, 0 },
>  	{ 0, 0, 0, 0, 0 }
>  };

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

^ permalink raw reply

* Re: [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Ackerley Tng @ 2026-05-01  0:58 UTC (permalink / raw)
  To: Edgecombe, Rick P, pbonzini@redhat.com, seanjc@google.com,
	Zhao, Yan Y, dave.hansen@linux.intel.com
  Cc: Shahar, Sagi, Yamahata, Isaku, x86@kernel.org, kas@kernel.org,
	yilun.xu@linux.intel.com, bp@alien8.de, mingo@redhat.com,
	linux-kernel@vger.kernel.org, Huang, Kai, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Li, Xiaoyao, tglx@kernel.org,
	binbin.wu@linux.intel.com, Annapurve, Vishal
In-Reply-To: <231efd4e9267d3dbb8c63e57b3a43567c965e24a.camel@intel.com>

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Thu, 2026-04-30 at 11:17 -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>>
>> >
>> > [...snip...]
>> >
>> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>> > index b24b81cea5ea..e5a37ea2d4a0 100644
>> > --- a/arch/x86/virt/vmx/tdx/tdx.c
>> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
>> > @@ -710,7 +710,7 @@ static __init int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
>> >    * to normal kernel memory. Systems with the X86_BUG_TDX_PW_MCE erratum need to
>> >    * do the conversion explicitly via MOVDIR64B.
>> >    */
>> > -static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>> > +void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>>
>> Could this be updated to use phys_addr_t base and size_t size instead of
>> generic unsigned long?
>
> A type is a really good idea. It could look like a virtual address, despite the
> paddr in the name.
>
> I added it to the cleanup list. But I would prefer to keep this series focused
> on resolving the critical controversy around the struct page/pfn type, rather
> than adding in more things to debate. If you don't mind.
>

No worries, please go ahead :)

>>
>> >   {
>> >   	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
>> >   	unsigned long phys, end;
>> > @@ -729,6 +729,7 @@ static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>> >   	 */
>> >   	mb();
>> >   }
>> > +EXPORT_SYMBOL_FOR_KVM(tdx_quirk_reset_paddr);
>> >
>> >   void tdx_quirk_reset_page(struct page *page)
>> >   {
>> > @@ -1920,17 +1921,17 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
>> >   {
>> >   	struct tdx_module_args args = {};
>> >
>> > -	args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
>> > +	args.rcx = mk_keyed_paddr(tdx_global_keyid, page_to_pfn(td->tdr_page));
>>
>> Should mk_keyed_paddr() be updated to have a return type of phys_addr_t?
>> I guess in this case since mk_keyed_paddr() is pretty much an internal
>> function, returning u64 also makes sense to indicate that it should only
>> be used to set 64 bit registers.
>
> Yea, this is used to construct u64 inputs for seamcall args. So I think it
> should keep returning u64s. Maybe instead it would be better to have a function
> name pattern that denotes that purpose. We have some more helpers like this
> coming.
>
> But similarly, I'd like to not grow a snowballing cleanup series for this one.

Makes sense, let's go :)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox