Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [PATCH v8 10/21] x86/virt/seamldr: Shut down the current TDX module
From: Chao Gao @ 2026-05-06  2:56 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kvm, linux-coco, linux-kernel, x86, binbin.wu, dave.hansen, djbw,
	ira.weiny, kai.huang, kas, nik.borisov, paulmck, pbonzini,
	reinette.chatre, rick.p.edgecombe, sagis, seanjc, tony.lindgren,
	vannapurve, vishal.l.verma, yilun.xu, xiaoyao.li, yan.y.zhao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin
In-Reply-To: <0523b07b-2df2-4cf4-bf98-6efe01780698@intel.com>

On Thu, Apr 30, 2026 at 11:52:50AM -0700, Dave Hansen wrote:
>On 4/27/26 08:28, Chao Gao wrote:
>>  static int do_seamldr_install_module(void *seamldr_params)
>>  {
>>  	enum module_update_state newstate, curstate = MODULE_UPDATE_START;
>> +	int cpu = smp_processor_id();
>> +	bool primary;
>>  	int ret = 0;
>>  
>> +	primary = cpumask_first(cpu_online_mask) == cpu;
>
>Isn't cpumask_first(cpu_online_mask)==0, always? I thought CPU 0 could
>never be offlined.

Not always. On x86, CPU 0 can be offlined at runtime if the kernel is booted
with the cpu0_hotplug command-line option. See cpu_hotplug.rst.

^ permalink raw reply

* Re: [PATCH v8 07/21] coco/tdx-host: Implement firmware upload sysfs ABI for TDX module updates
From: Chao Gao @ 2026-05-06  2:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kvm, linux-coco, linux-kernel, x86, binbin.wu, dave.hansen, djbw,
	ira.weiny, kai.huang, kas, nik.borisov, paulmck, pbonzini,
	reinette.chatre, rick.p.edgecombe, sagis, seanjc, tony.lindgren,
	vannapurve, vishal.l.verma, yilun.xu, xiaoyao.li, yan.y.zhao,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin
In-Reply-To: <06682111-2b63-444e-a431-b8dd0db2b0ab@intel.com>

On Wed, Apr 29, 2026 at 04:17:10PM -0700, Dave Hansen wrote:
>On 4/27/26 08:28, Chao Gao wrote:
>> Linux kernel supports two primary firmware update mechanisms:
>>   - request_firmware()
>>   - firmware upload (or fw_upload)
>
>All the stuff here is good info, but it was hard to extract the
>implementation information from the background.
>
>I think this would do:
>
>	Select fw_upload for doing TDX module updates. The process of
>	selecting among available update images is complicated and
>	nuanced. Punt the selection policy out to userspace.

Agreed. I'll add that as a TL;DR.

>
>...
>> +static int seamldr_init(struct device *dev)
>> +{
>> +	struct fw_upload *tdx_fwl;
>> +
>> +	if (!can_expose_seamldr())
>> +		return 0;
>
>can_expose_seamldr() has a not great name.
>
>Why not just have naming that says:
>
>	if (supports_runtime_update())
>		...
>
>Why abstract this out to what can or can't be exposed?

Sure, I'll use supports_runtime_update().

I originally used can_expose_seamldr() because I was focused on the
VMCS-clobbering erratum, where SEAMLDR sysfs cannot be exposed to
userspace.

^ permalink raw reply

* Re: [RFC PATCH 04/12] vfio/pci: Allow MMIO regions to be exported through dma-buf
From: Alexey Kardashevskiy @ 2026-05-06  2:35 UTC (permalink / raw)
  To: Xu Yilun, kvm, dri-devel, linux-media, linaro-mm-sig,
	sumit.semwal, christian.koenig, pbonzini, seanjc, alex.williamson,
	jgg, vivek.kasireddy, dan.j.williams
  Cc: yilun.xu, linux-coco, linux-kernel, lukas, yan.y.zhao,
	daniel.vetter, leon, baolu.lu, zhenzhong.duan, tao1.su
In-Reply-To: <20250107142719.179636-5-yilun.xu@linux.intel.com>

Hi!

Let's reignite this topic.

I've been using these patches + QEMU side hacks for 6+ months. And it's been fine until I got a device where MSIX BAR is in a middle of another BAR marked as TEE in the TDISP interface report. And no trusted MSIX yet.

Every time QEMU mmaps a BAR - I request a dmabuf fd from VFIO in QEMU. Since mapping of an entire MSIX BAR is allowed by default, VFIORegion::nr_mmaps==1 and it is an entire BAR.

Problem: KVM memslot mismatches the dmabuf fd size
How: QEMU emulates MSIX table and PBA by adding emulated MemoryRegions on top of the mapped BARs. In the QEMU memory flatview this splits the BAR into 2 or 3 sections (2 if MSIX at the start/end, 3 if in a middle). QEMU tries registering 1 or 2 KVM memory slots for regions outside of MSIX which fails in kvm_vfio_dmabuf_bind() as regions are smaller than the exported dmabuf fd (which is the entire BAR == 32KB).

Solution1: use QEMU x-msix-relocation hack to move MSIX elsewhere - end of some BAR (doubles the bar size so problematic for huge BARs like 512GB+), or another BAR (there may be no available as 3 of 64bit BARs is the limit).

Solution2: modify logic in VFIO dmabuf to allow multiple KVM memory slots per dmabuf. Now it is kvm_memory_slot::dmabuf_attach with no offset into the dmabuf and one kvm_vfio_dmabuf per dma_buf.

Solution3: hack QEMU into smashing a MSIX-containing BAR in vfio_pci_fixup_msix_region. Here is what it does:

0000380004000000-0000380004002fff (prio 0, ramd): 0000:41:04.0 BAR 4 mmaps[0] KVM
0000380004003000-0000380004006fff (prio 0, i/o): msix-table
0000380004007000-0000380004007fff (prio 0, ramd): 0000:41:04.0 BAR 4 mmaps[1] KVM

The problem now is that the TDI report must have the same split of the BAR as the VM is going to validate TEE (==trusted) MMIO ranges and this has to match the QEMU view. Harder than it sounds as the size of MSIX table in bytes is not exactly specified anywhere except the report.

Solution4: the above + modify QEMU to check the TDI report for not reporting MSIX+PBA where QEMU emulates them. The problem is that when QEMU adds emulated MRs - there is no report yet (the report is created later on, some TDISP witchery). So when the device is accepted - QEMU could reinitialize those emulated msix and pba MRs to match the report exactly so the flatview generates correct KVM memory regions and then KVM can work with dmabuf fine.

Any thoughts? what is acceptable for everybody? Thanks,



On 8/1/25 01:27, Xu Yilun wrote:
> From: Vivek Kasireddy <vivek.kasireddy@intel.com>
> 
> This is a reduced version of Vivek's series [1]. Removed the
> dma_buf_ops.attach/map/unmap_dma_buf/mmap() as they are not necessary
> in this series, also because of the WIP p2p dma mapping opens [2]. Just
> focus on the private MMIO get PFN function in this early stage.
> 
>>From Jason Gunthorpe:
> "dma-buf has become a way to safely acquire a handle to non-struct page
> memory that can still have lifetime controlled by the exporter. Notably
> RDMA can now import dma-buf FDs and build them into MRs which allows for
> PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory
> from PCI device BARs.
> 
> The patch design loosely follows the pattern in commit
> db1a8dd916aa ("habanalabs: add support for dma-buf exporter") except this
> does not support pinning.
> 
> Instead, this implements what, in the past, we've called a revocable
> attachment using move. In normal situations the attachment is pinned, as a
> BAR does not change physical address. However when the VFIO device is
> closed, or a PCI reset is issued, access to the MMIO memory is revoked.
> 
> Revoked means that move occurs, but an attempt to immediately re-map the
> memory will fail. In the reset case a future move will be triggered when
> MMIO access returns. As both close and reset are under userspace control
> it is expected that userspace will suspend use of the dma-buf before doing
> these operations, the revoke is purely for kernel self-defense against a
> hostile userspace."
> 
> [1] https://lore.kernel.org/kvm/20240624065552.1572580-4-vivek.kasireddy@intel.com/
> [2] https://lore.kernel.org/all/IA0PR11MB7185FDD56CFDD0A2B8D21468F83B2@IA0PR11MB7185.namprd11.prod.outlook.com/
> 
> Original-patch-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> ---
>   drivers/vfio/pci/Makefile          |   1 +
>   drivers/vfio/pci/dma_buf.c         | 223 +++++++++++++++++++++++++++++
>   drivers/vfio/pci/vfio_pci_config.c |  22 ++-
>   drivers/vfio/pci/vfio_pci_core.c   |  20 ++-
>   drivers/vfio/pci/vfio_pci_priv.h   |  25 ++++
>   include/linux/vfio_pci_core.h      |   1 +
>   include/uapi/linux/vfio.h          |  29 ++++
>   7 files changed, 316 insertions(+), 5 deletions(-)
>   create mode 100644 drivers/vfio/pci/dma_buf.c
> 
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index cf00c0a7e55c..0cfdc9ede82f 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -2,6 +2,7 @@
>   
>   vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
>   vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
> +vfio-pci-core-$(CONFIG_DMA_SHARED_BUFFER) += dma_buf.o
>   obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
>   
>   vfio-pci-y := vfio_pci.o
> diff --git a/drivers/vfio/pci/dma_buf.c b/drivers/vfio/pci/dma_buf.c
> new file mode 100644
> index 000000000000..1d5f46744922
> --- /dev/null
> +++ b/drivers/vfio/pci/dma_buf.c
> @@ -0,0 +1,223 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.
> + */
> +#include <linux/dma-buf.h>
> +#include <linux/dma-resv.h>
> +
> +#include "vfio_pci_priv.h"
> +
> +MODULE_IMPORT_NS("DMA_BUF");
> +
> +struct vfio_pci_dma_buf {
> +	struct dma_buf *dmabuf;
> +	struct vfio_pci_core_device *vdev;
> +	struct list_head dmabufs_elm;
> +	unsigned int nr_ranges;
> +	struct vfio_region_dma_range *dma_ranges;
> +	bool revoked;
> +};
> +
> +static void vfio_pci_dma_buf_unpin(struct dma_buf_attachment *attachment)
> +{
> +}
> +
> +static int vfio_pci_dma_buf_pin(struct dma_buf_attachment *attachment)
> +{
> +	/*
> +	 * Uses the dynamic interface but must always allow for
> +	 * dma_buf_move_notify() to do revoke
> +	 */
> +	return -EINVAL;
> +}
> +
> +static int vfio_pci_dma_buf_get_pfn(struct dma_buf_attachment *attachment,
> +				    pgoff_t pgoff, u64 *pfn, int *max_order)
> +{
> +	/* TODO */
> +	return -EOPNOTSUPP;
> +}
> +
> +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> +	/*
> +	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
> +	 * The refcount prevents both.
> +	 */
> +	if (priv->vdev) {
> +		down_write(&priv->vdev->memory_lock);
> +		list_del_init(&priv->dmabufs_elm);
> +		up_write(&priv->vdev->memory_lock);
> +		vfio_device_put_registration(&priv->vdev->vdev);
> +	}
> +	kfree(priv);
> +}
> +
> +static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
> +	.pin = vfio_pci_dma_buf_pin,
> +	.unpin = vfio_pci_dma_buf_unpin,
> +	.get_pfn = vfio_pci_dma_buf_get_pfn,
> +	.release = vfio_pci_dma_buf_release,
> +};
> +
> +static int check_dma_ranges(struct vfio_pci_dma_buf *priv, u64 *dmabuf_size)
> +{
> +	struct vfio_region_dma_range *dma_ranges = priv->dma_ranges;
> +	struct pci_dev *pdev = priv->vdev->pdev;
> +	resource_size_t bar_size;
> +	int i;
> +
> +	for (i = 0; i < priv->nr_ranges; i++) {
> +		/*
> +		 * For PCI the region_index is the BAR number like
> +		 * everything else.
> +		 */
> +		if (dma_ranges[i].region_index >= VFIO_PCI_ROM_REGION_INDEX)
> +			return -EINVAL;
> +
> +		bar_size = pci_resource_len(pdev, dma_ranges[i].region_index);
> +		if (!bar_size)
> +			return -EINVAL;
> +
> +		if (!dma_ranges[i].offset && !dma_ranges[i].length)
> +			dma_ranges[i].length = bar_size;
> +
> +		if (!IS_ALIGNED(dma_ranges[i].offset, PAGE_SIZE) ||
> +		    !IS_ALIGNED(dma_ranges[i].length, PAGE_SIZE) ||
> +		    dma_ranges[i].length > bar_size ||
> +		    dma_ranges[i].offset >= bar_size ||
> +		    dma_ranges[i].offset + dma_ranges[i].length > bar_size)
> +			return -EINVAL;
> +
> +		*dmabuf_size += dma_ranges[i].length;
> +	}
> +
> +	return 0;
> +}
> +
> +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> +				  struct vfio_device_feature_dma_buf __user *arg,
> +				  size_t argsz)
> +{
> +	struct vfio_device_feature_dma_buf get_dma_buf;
> +	struct vfio_region_dma_range *dma_ranges;
> +	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> +	struct vfio_pci_dma_buf *priv;
> +	u64 dmabuf_size = 0;
> +	int ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(get_dma_buf));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
> +		return -EFAULT;
> +
> +	dma_ranges = memdup_array_user(&arg->dma_ranges,
> +				       get_dma_buf.nr_ranges,
> +				       sizeof(*dma_ranges));
> +	if (IS_ERR(dma_ranges))
> +		return PTR_ERR(dma_ranges);
> +
> +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +	if (!priv) {
> +		kfree(dma_ranges);
> +		return -ENOMEM;
> +	}
> +
> +	priv->vdev = vdev;
> +	priv->nr_ranges = get_dma_buf.nr_ranges;
> +	priv->dma_ranges = dma_ranges;
> +
> +	ret = check_dma_ranges(priv, &dmabuf_size);
> +	if (ret)
> +		goto err_free_priv;
> +
> +	if (!vfio_device_try_get_registration(&vdev->vdev)) {
> +		ret = -ENODEV;
> +		goto err_free_priv;
> +	}
> +
> +	exp_info.ops = &vfio_pci_dmabuf_ops;
> +	exp_info.size = dmabuf_size;
> +	exp_info.flags = get_dma_buf.open_flags;
> +	exp_info.priv = priv;
> +
> +	priv->dmabuf = dma_buf_export(&exp_info);
> +	if (IS_ERR(priv->dmabuf)) {
> +		ret = PTR_ERR(priv->dmabuf);
> +		goto err_dev_put;
> +	}
> +
> +	/* dma_buf_put() now frees priv */
> +	INIT_LIST_HEAD(&priv->dmabufs_elm);
> +	down_write(&vdev->memory_lock);
> +	dma_resv_lock(priv->dmabuf->resv, NULL);
> +	priv->revoked = !__vfio_pci_memory_enabled(vdev);
> +	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
> +	dma_resv_unlock(priv->dmabuf->resv);
> +	up_write(&vdev->memory_lock);
> +
> +	/*
> +	 * dma_buf_fd() consumes the reference, when the file closes the dmabuf
> +	 * will be released.
> +	 */
> +	return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
> +
> +err_dev_put:
> +	vfio_device_put_registration(&vdev->vdev);
> +err_free_priv:
> +	kfree(dma_ranges);
> +	kfree(priv);
> +	return ret;
> +}
> +
> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> +{
> +	struct vfio_pci_dma_buf *priv;
> +	struct vfio_pci_dma_buf *tmp;
> +
> +	lockdep_assert_held_write(&vdev->memory_lock);
> +
> +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> +		/*
> +		 * Returns true if a reference was successfully obtained.
> +		 * The caller must interlock with the dmabuf's release
> +		 * function in some way, such as RCU, to ensure that this
> +		 * is not called on freed memory.
> +		 */
> +		if (!get_file_rcu(&priv->dmabuf->file))
> +			continue;
> +
> +		if (priv->revoked != revoked) {
> +			dma_resv_lock(priv->dmabuf->resv, NULL);
> +			priv->revoked = revoked;
> +			dma_buf_move_notify(priv->dmabuf);
> +			dma_resv_unlock(priv->dmabuf->resv);
> +		}
> +		dma_buf_put(priv->dmabuf);
> +	}
> +}
> +
> +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> +{
> +	struct vfio_pci_dma_buf *priv;
> +	struct vfio_pci_dma_buf *tmp;
> +
> +	down_write(&vdev->memory_lock);
> +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> +		if (!get_file_rcu(&priv->dmabuf->file))
> +			continue;
> +		dma_resv_lock(priv->dmabuf->resv, NULL);
> +		list_del_init(&priv->dmabufs_elm);
> +		priv->vdev = NULL;
> +		priv->revoked = true;
> +		dma_buf_move_notify(priv->dmabuf);
> +		dma_resv_unlock(priv->dmabuf->resv);
> +		vfio_device_put_registration(&vdev->vdev);
> +		dma_buf_put(priv->dmabuf);
> +	}
> +	up_write(&vdev->memory_lock);
> +}
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index ea2745c1ac5e..5cc200e15edc 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -589,10 +589,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
>   		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
>   		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
>   
> -		if (!new_mem)
> +		if (!new_mem) {
>   			vfio_pci_zap_and_down_write_memory_lock(vdev);
> -		else
> +			vfio_pci_dma_buf_move(vdev, true);
> +		} else {
>   			down_write(&vdev->memory_lock);
> +		}
>   
>   		/*
>   		 * If the user is writing mem/io enable (new_mem/io) and we
> @@ -627,6 +629,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
>   		*virt_cmd &= cpu_to_le16(~mask);
>   		*virt_cmd |= cpu_to_le16(new_cmd & mask);
>   
> +		if (__vfio_pci_memory_enabled(vdev))
> +			vfio_pci_dma_buf_move(vdev, false);
>   		up_write(&vdev->memory_lock);
>   	}
>   
> @@ -707,12 +711,16 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
>   static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
>   					  pci_power_t state)
>   {
> -	if (state >= PCI_D3hot)
> +	if (state >= PCI_D3hot) {
>   		vfio_pci_zap_and_down_write_memory_lock(vdev);
> -	else
> +		vfio_pci_dma_buf_move(vdev, true);
> +	} else {
>   		down_write(&vdev->memory_lock);
> +	}
>   
>   	vfio_pci_set_power_state(vdev, state);
> +	if (__vfio_pci_memory_enabled(vdev))
> +		vfio_pci_dma_buf_move(vdev, false);
>   	up_write(&vdev->memory_lock);
>   }
>   
> @@ -900,7 +908,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
>   
>   		if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
>   			vfio_pci_zap_and_down_write_memory_lock(vdev);
> +			vfio_pci_dma_buf_move(vdev, true);
>   			pci_try_reset_function(vdev->pdev);
> +			if (__vfio_pci_memory_enabled(vdev))
> +				vfio_pci_dma_buf_move(vdev, true);
>   			up_write(&vdev->memory_lock);
>   		}
>   	}
> @@ -982,7 +993,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
>   
>   		if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
>   			vfio_pci_zap_and_down_write_memory_lock(vdev);
> +			vfio_pci_dma_buf_move(vdev, true);
>   			pci_try_reset_function(vdev->pdev);
> +			if (__vfio_pci_memory_enabled(vdev))
> +				vfio_pci_dma_buf_move(vdev, true);
>   			up_write(&vdev->memory_lock);
>   		}
>   	}
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index c3269d708411..f69eda5956ad 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -287,6 +287,8 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
>   	 * semaphore.
>   	 */
>   	vfio_pci_zap_and_down_write_memory_lock(vdev);
> +	vfio_pci_dma_buf_move(vdev, true);
> +
>   	if (vdev->pm_runtime_engaged) {
>   		up_write(&vdev->memory_lock);
>   		return -EINVAL;
> @@ -370,6 +372,8 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
>   	 */
>   	down_write(&vdev->memory_lock);
>   	__vfio_pci_runtime_pm_exit(vdev);
> +	if (__vfio_pci_memory_enabled(vdev))
> +		vfio_pci_dma_buf_move(vdev, false);
>   	up_write(&vdev->memory_lock);
>   }
>   
> @@ -690,6 +694,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
>   #endif
>   	vfio_pci_core_disable(vdev);
>   
> +	vfio_pci_dma_buf_cleanup(vdev);
> +
>   	mutex_lock(&vdev->igate);
>   	if (vdev->err_trigger) {
>   		eventfd_ctx_put(vdev->err_trigger);
> @@ -1234,7 +1240,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
>   	 */
>   	vfio_pci_set_power_state(vdev, PCI_D0);
>   
> +	vfio_pci_dma_buf_move(vdev, true);
>   	ret = pci_try_reset_function(vdev->pdev);
> +	if (__vfio_pci_memory_enabled(vdev))
> +		vfio_pci_dma_buf_move(vdev, false);
>   	up_write(&vdev->memory_lock);
>   
>   	return ret;
> @@ -1523,6 +1532,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>   		return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
>   	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>   		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_DMA_BUF:
> +		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
>   	default:
>   		return -ENOTTY;
>   	}
> @@ -2098,6 +2109,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
>   	INIT_LIST_HEAD(&vdev->dummy_resources_list);
>   	INIT_LIST_HEAD(&vdev->ioeventfds_list);
>   	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
> +	INIT_LIST_HEAD(&vdev->dmabufs);
>   	init_rwsem(&vdev->memory_lock);
>   	xa_init(&vdev->ctx);
>   
> @@ -2480,11 +2492,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>   	 * cause the PCI config space reset without restoring the original
>   	 * state (saved locally in 'vdev->pm_save').
>   	 */
> -	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
> +	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) {
> +		vfio_pci_dma_buf_move(vdev, true);
>   		vfio_pci_set_power_state(vdev, PCI_D0);
> +	}
>   
>   	ret = pci_reset_bus(pdev);
>   
> +	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
> +		if (__vfio_pci_memory_enabled(vdev))
> +			vfio_pci_dma_buf_move(vdev, false);
> +
>   	vdev = list_last_entry(&dev_set->device_list,
>   			       struct vfio_pci_core_device, vdev.dev_set_list);
>   
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index 5e4fa69aee16..d27f383f3931 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -101,4 +101,29 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>   	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>   }
>   
> +#ifdef CONFIG_DMA_SHARED_BUFFER
> +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> +				  struct vfio_device_feature_dma_buf __user *arg,
> +				  size_t argsz);
> +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
> +#else
> +static int
> +vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> +			      struct vfio_device_feature_dma_buf __user *arg,
> +			      size_t argsz)
> +{
> +	return -ENOTTY;
> +}
> +
> +static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> +{
> +}
> +
> +static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
> +					 bool revoked)
> +{
> +}
> +#endif
> +
>   #endif
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index fbb472dd99b3..da5d8955ae56 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -94,6 +94,7 @@ struct vfio_pci_core_device {
>   	struct vfio_pci_core_device	*sriov_pf_core_dev;
>   	struct notifier_block	nb;
>   	struct rw_semaphore	memory_lock;
> +	struct list_head	dmabufs;
>   };
>   
>   /* Will be exported for vfio pci drivers usage */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index c8dbf8219c4f..f43dfbde7352 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1458,6 +1458,35 @@ struct vfio_device_feature_bus_master {
>   };
>   #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
>   
> +/**
> + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> + * regions selected.
> + *
> + * For struct struct vfio_device_feature_dma_buf, open_flags are the typical
> + * flags passed to open(2), eg O_RDWR, O_CLOEXEC, etc. nr_ranges is the total
> + * number of dma_ranges that comprise the dmabuf.
> + *
> + * For struct vfio_region_dma_range, region_index/offset/length specify a slice
> + * of the region to create the dmabuf from, if both offset & length are 0 then
> + * the whole region is used.
> + *
> + * Return: The fd number on success, -1 and errno is set on failure.
> + */
> +struct vfio_region_dma_range {
> +	__u32	region_index;
> +	__u32	__pad;
> +	__u64	offset;
> +	__u64	length;
> +};
> +
> +struct vfio_device_feature_dma_buf {
> +	__u32	open_flags;
> +	__u32	nr_ranges;
> +	struct vfio_region_dma_range dma_ranges[];
> +};
> +
> +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> +
>   /* -------- API for Type1 VFIO IOMMU -------- */
>   
>   /**

-- 
Alexey


^ permalink raw reply

* Re: [PATCH v4 6/7] KVM: SEV: Perform RMP optimizations on SNP guest shutdown
From: Kalra, Ashish @ 2026-05-05 20:34 UTC (permalink / raw)
  To: Ackerley Tng, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <CAEvNRgGP4ZHz9=MOGybGwe2A4XHkVF6nXnr_KdHavz1rR62U4w@mail.gmail.com>

Hello Ackerley,

On 5/1/2026 2:12 PM, Ackerley Tng wrote:
> Ashish Kalra <Ashish.Kalra@amd.com> writes:
> 
>>
>> [...snip...]
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 3f9c1aa39a0a..e0f4f8ebef68 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -2942,6 +2942,8 @@ void sev_vm_destroy(struct kvm *kvm)
>>  	if (sev_snp_guest(kvm)) {
>>  		snp_guest_req_cleanup(kvm);
>>
>> +		snp_rmpopt_all_physmem();
>> +
> 
> I see this is what you suggested in [1]. The time-based batching you
> suggeested works because adding to the workqueue when there's already a
> job just does nothing. Thanks!
> 
> I think optimizing when the VM is destroyed makes sense, in most cases
> for SNP VMs, we don't expect large 1G blocks of memory to be shared
> anyway, so even if we try to RMPOPT on every conversion to private, most
> of those tries would be optimizing nothing.
> 

Yes.

Thanks,
Ashish

> I guess the remaining optimization would be to update based on only the
> range of pfns where guest_memfd has private memory, but that could be
> done in another patch series.
> 
> Reviewed-by: Ackerley Tng <ackerleytng@google.com>
> 
> [1] https://lore.kernel.org/all/31040bb7-653a-40f9-8899-40bc852f7e1f@amd.com/
> 
>>  		/*
>>  		 * Decomission handles unbinding of the ASID. If it fails for
>>  		 * some unexpected reason, just leak the ASID.
>> --
>> 2.43.0

^ permalink raw reply

* Re: [PATCH v4 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Kalra, Ashish @ 2026-05-05 20:30 UTC (permalink / raw)
  To: Ackerley Tng, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <CAEvNRgFRJNRyUf3T9TTWr9-xt76E=Z28vSKsdZ46QK3UAEd8dA@mail.gmail.com>

Hello Ackerley,

On 5/1/2026 1:57 PM, Ackerley Tng wrote:
> Ashish Kalra <Ashish.Kalra@amd.com> writes:
> 
>>
>> [...snip...]
>>
>> +/*
>> + * 'val' is a system physical address.
>> + */
>> +static void rmpopt_smp(void *val)
>> +{
>> +	u64 rax = ALIGN_DOWN((u64)val, SZ_1G);
>> +	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
>> +
>> +	__rmpopt(rax, rcx);
>> +}
>> +
>> +static void rmpopt(u64 pa)
>> +{
>> +	u64 rax = ALIGN_DOWN(pa, SZ_1G);
>> +	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
>> +
>> +	__rmpopt(rax, rcx);
>> +}
>> +
> 
> Could rmpopt_smp() call rmpopt() to remove duplicate code?

Yes. 

> 
>> +/*
>> + * RMPOPT optimizations skip RMP checks at 1GB granularity if this
>> + * range of memory does not contain any SNP guest memory.
>> + */
>> +static void rmpopt_work_handler(struct work_struct *work)
>> +{
>> +	bool current_cpu_cleared = false;
>> +	phys_addr_t pa;
>> +
>> +	pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
>> +		rmpopt_pa_start, rmpopt_pa_end);
>> +
>> +	/*
>> +	 * RMPOPT scans the RMP table, stores the result of the scan in the
>> +	 * reserved processor memory. The RMP scan is the most expensive
>> +	 * part. If a second RMPOPT occurs, it can skip the expensive scan
>> +	 * if they can see a cached result in the reserved processor memory.
>> +	 *
>> +	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
>> +	 * on every other primary thread. This potentially allows the
>> +	 * followers to use the "cached" scan results to avoid repeating
>> +	 * full scans.
> 
> Out of curiosity, how does this caching work? Is it possible to do it
> once and then synchronize the cache to the other CPUs?

The first CPU does the full RMP scan and stores the result of the scan in reserved processor memory.
And other CPUs can skip the scan if they can see a cached result in the reserved processor memory.
So i believe the other CPUs would *still* have to issue the RMPOPT instruction, but then they will 
avoid the full RMP scan and use the cached results.

> 
>> +	 */
>> +
>> +	if (cpumask_test_cpu(smp_processor_id(), &rmpopt_cpumask)) {
>> +		cpumask_clear_cpu(smp_processor_id(), &rmpopt_cpumask);
>> +		current_cpu_cleared = true;
>> +	}
>> +
>> +	/* current CPU */
>> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
>> +		rmpopt(pa);
>> +
>> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
>> +		on_each_cpu_mask(&rmpopt_cpumask, rmpopt_smp,
>> +				 (void *)pa, true);
>> +
>> +		 /* Give a chance for other threads to run */
>> +		cond_resched();
>> +
>> +	}
>> +
>> +	if (current_cpu_cleared)
>> +		cpumask_set_cpu(smp_processor_id(), &rmpopt_cpumask);
> 
> Sashiko [1] pointed this out: after cond_resched(), this code might be
> on a different cpu so smp_processor_id() would return a different cpu,
> that would mess up the global cpumask.
> 
> Perhaps it's better to store the id on a stack? Or actually, what if we
> give on_each_cpu_mask a copy of rmpopt_cpumask with the current cpu
> unset?
> 
> [1] https://sashiko.dev/#/patchset/cover.1775874970.git.ashish.kalra%40amd.com
>

Yes, i think it makes sense to store the id on the stack.

Additionally, i will be moving the computing of the cpumask within this workitem,
so cpumask won't be global.
 
>> +}
>> +
>>  void snp_setup_rmpopt(void)
>>  {
>>  	u64 rmpopt_base;
>> @@ -568,9 +656,20 @@ void snp_setup_rmpopt(void)
>>  	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
>>  		return;
>>
>> +	/*
>> +	 * Create an RMPOPT-specific workqueue to avoid scheduling
>> +	 * RMPOPT workitem on the global system workqueue.
>> +	 */
>> +	rmpopt_wq = alloc_workqueue("rmpopt_wq", WQ_UNBOUND, 1);
>> +	if (!rmpopt_wq) {
>> +		setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
>> +		return;
>> +	}
>> +
>>  	/*
>>  	 * RMPOPT_BASE MSR is per-core, so only one thread per core needs to
>> -	 * setup RMPOPT_BASE MSR.
>> +	 * setup RMPOPT_BASE MSR. Additionally only one thread per core needs
>> +	 * to issue the RMPOPT instruction.
>>  	 */
>>
>>  	for_each_online_cpu(cpu) {
>> @@ -590,6 +689,20 @@ void snp_setup_rmpopt(void)
>>  	 * up to 2 TB of system RAM on all CPUs.
>>  	 */
>>  	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
>> +
>> +	INIT_DELAYED_WORK(&rmpopt_delayed_work, rmpopt_work_handler);
>> +
>> +	rmpopt_pa_end = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
>> +
>> +	/* Limit memory scanning to the first 2 TB of RAM */
> 
> I think this is better phrased as "limit memory scanning to 2TB",
> 

Ok.

>> +	if ((rmpopt_pa_end - rmpopt_pa_start) > SZ_2T)
>> +		rmpopt_pa_end = rmpopt_pa_start + SZ_2T;
> 
> and then this could be
> 
>     rmpopt_pa_end = min(rmpopt_pa_end, rmpopt_pa_start + SZ_2T);
>

Yes.

Thanks,
Ashish

>> +
>> +	/*
>> +	 * Once all per-CPU RMPOPT tables have been configured, enable RMPOPT
>> +	 * optimizations on all physical memory.
>> +	 */
>> +	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
>>  }
>>  EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
>>
> 
> Reviewed-by: Ackerley Tng <ackerleytng@google.com>

^ permalink raw reply

* Re: [PATCH v4 3/7] x86/sev: Initialize RMPOPT configuration MSRs
From: Kalra, Ashish @ 2026-05-05 20:04 UTC (permalink / raw)
  To: Ackerley Tng, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <CAEvNRgEC2NSROZWz8uxnOSD6t8s1KmmFrr92=e8s30PJzYhQ1g@mail.gmail.com>

Hello Ackerley,

On 5/1/2026 1:12 PM, Ackerley Tng wrote:
> Ashish Kalra <Ashish.Kalra@amd.com> writes:
> 
>>
>> [...snip...]
>>
>> +void snp_setup_rmpopt(void)
>> +{
>> +	u64 rmpopt_base;
>> +	int cpu;
>> +
>> +	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
>> +		return;
>> +
>> +	/*
>> +	 * RMPOPT_BASE MSR is per-core, so only one thread per core needs to
>> +	 * setup RMPOPT_BASE MSR.
>> +	 */
>> +
>> +	for_each_online_cpu(cpu) {
> 
> Dave mentioned hotplug in v3 [1], which led me to check. From this
> series, it looks like RMPOPT won't ever be performed for newly-plugged
> cpus, is that okay?

Yes, with this current series, RMPOPT won't be performed on newly-plugged
CPUs, i was computing and storing the cpumask for threads to fire RMPOPT on,
statically during the setup code to save computing it again on the
worker thread, but i believe the simple fix for handling hotplugged CPUs
would be to compute the cpumask every time the worker thread executes.

> 
>> +		if (!topology_is_primary_thread(cpu))
>> +			continue;
>> +
>> +		cpumask_set_cpu(cpu, &rmpopt_cpumask);
> 
> nit: perhaps flipping the condition is easier to read:

Yes.

> 
>     if (topology_is_primary_thread(cpu))
>     	cpumask_set_cpu()
> 
>> +	}
>> +
>> +	rmpopt_pa_start = ALIGN_DOWN(PFN_PHYS(min_low_pfn), SZ_1G);
>> +	rmpopt_base = rmpopt_pa_start | MSR_AMD64_RMPOPT_ENABLE;
>> +
>> +	/*
>> +	 * Per-CPU RMPOPT tables support at most 2 TB of addressable memory
>> +	 * for RMP optimizations. Initialize the per-CPU RMPOPT table base
>> +	 * to the starting physical address to enable RMP optimizations for
>> +	 * up to 2 TB of system RAM on all CPUs.
>> +	 */
>> +	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
>> +}
> 
> [1] https://lore.kernel.org/all/ab41b1d8-e464-4ad6-ac07-7318686db10e@intel.com/
> 
>>
>> [...snip...]
>>

^ permalink raw reply

* SVSM Development Call May 6th, 2026
From: Jörg Rödel @ 2026-05-05 16:07 UTC (permalink / raw)
  To: coconut-svsm, linux-coco

Hi,

Here is the call for agenda items for this weeks SVSM development call.  Please
send any agenda items you have in mind as a reply to this email or raise them
in the meeting.

We will use the LF Zoom instance. Details of the meeting  can be found in our
governance repository at:

	https://github.com/coconut-svsm/governance

The link to the COCONUT-SVSM calendar is:

	https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week

The meeting will be recorded and the recording eventually published.

Regards,

	Jörg

^ permalink raw reply

* Re: [PATCH RFC v5 00/53] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-05-01 22:21 UTC (permalink / raw)
  To: Michael Roth
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco, Jacob Xu, Darwin Guo
In-Reply-To: <CAEvNRgHRpvsEjtr1A_Qz3d4oMEaffTxESavrZ73Jtt6OobCwhA@mail.gmail.com>

Ackerley Tng <ackerleytng@google.com> writes:

>
> [...snip...]
>
>
> TLDR:
>
> + PRESERVE == guarantee that the process of setting memory attributes
>   doesn't change memory contents.
>     + implementation == do nothing in most cases, except -EOPNOTSUPP for
>       to-shared on TDX, since unmapping is a required part of setting
>       memory attributes to private, and a TDX side effect of unmapping
>       is zeroing memory,

-EOPNOTSUPP will only be for TDX, not SNP.

> + ZERO == guarantee that the process of setting memory attributes zeroes
>   memory contents.
>     + implementation == memset(zero) in most cases. For TDX, a future
>       optimization exists, where memset() can be skipped for pages that
>       were mapped in Secure EPTs before conversion
> + UNSPECIFIED == no guarantees
>     + implementation == guest_memfd does nothing explicitly about memory
>       contents. The implementation is pretty much the same as PRESERVE
>       except guest_memfd won't take into account vendor-specific side
>       effects of the process of conversion. Except for the test vehicle
>       KVM_X86_SW_PROTECTED_VMS, where memory is scrambled.
>

Found another use case internally for pre-finalize, SNP, to-shared,
PRESERVE, which works with the above smaller scope.

During SNP_LAUNCH_UPDATE, when inserting a CPUID page, the firmware will
check that the CPUID values would not lead to an insecure guest
state. SNP_LAUNCH_UPDATE will fail with an error and the page remains
shared in the RMP table.

Here's the proposed flow in the userspace VMM:

1. Load CPUID in shared guest_memfd memory
2. SET_MEMORY_ATTRIBUTES(PRIVATE, PRESERVE)
3. SNP_LAUNCH_UPDATE => get error since CPUID was insecure
4. SET_MEMORY_ATTRIBUTES(SHARED, PRESERVE)
5. Read shared guest_memfd memory, error if VMM disagrees
6. SET_MEMORY_ATTRIBUTES(PRIVATE, PRESERVE)
7. SNP_LAUNCH_UPDATE => successful, since CPUID is now corrected

Does that seem ok?

>>>
>>> [...snip...]
>>>

^ permalink raw reply

* Re: [PATCH v4 6/7] KVM: SEV: Perform RMP optimizations on SNP guest shutdown
From: Ackerley Tng @ 2026-05-01 19:12 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <0c15142ecf6689ebe31a9c0f6f331398fc04f6d2.1775874970.git.ashish.kalra@amd.com>

Ashish Kalra <Ashish.Kalra@amd.com> writes:

>
> [...snip...]
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 3f9c1aa39a0a..e0f4f8ebef68 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2942,6 +2942,8 @@ void sev_vm_destroy(struct kvm *kvm)
>  	if (sev_snp_guest(kvm)) {
>  		snp_guest_req_cleanup(kvm);
>
> +		snp_rmpopt_all_physmem();
> +

I see this is what you suggested in [1]. The time-based batching you
suggeested works because adding to the workqueue when there's already a
job just does nothing. Thanks!

I think optimizing when the VM is destroyed makes sense, in most cases
for SNP VMs, we don't expect large 1G blocks of memory to be shared
anyway, so even if we try to RMPOPT on every conversion to private, most
of those tries would be optimizing nothing.

I guess the remaining optimization would be to update based on only the
range of pfns where guest_memfd has private memory, but that could be
done in another patch series.

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

[1] https://lore.kernel.org/all/31040bb7-653a-40f9-8899-40bc852f7e1f@amd.com/

>  		/*
>  		 * Decomission handles unbinding of the ASID. If it fails for
>  		 * some unexpected reason, just leak the ASID.
> --
> 2.43.0

^ permalink raw reply

* Re: [PATCH v4 5/7] x86/sev: Add interface to re-enable RMP optimizations.
From: Ackerley Tng @ 2026-05-01 19:04 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <112cc1213834de1ac97c71314a565ba7dc4af30b.1775874970.git.ashish.kalra@amd.com>

Ashish Kalra <Ashish.Kalra@amd.com> writes:

>
> [...snip...]
>

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

^ permalink raw reply

* Re: [PATCH v4 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Ackerley Tng @ 2026-05-01 18:57 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <ad924b3fbe4154466195e0668604afe8e0b825ca.1775874970.git.ashish.kalra@amd.com>

Ashish Kalra <Ashish.Kalra@amd.com> writes:

>
> [...snip...]
>
> +/*
> + * 'val' is a system physical address.
> + */
> +static void rmpopt_smp(void *val)
> +{
> +	u64 rax = ALIGN_DOWN((u64)val, SZ_1G);
> +	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
> +
> +	__rmpopt(rax, rcx);
> +}
> +
> +static void rmpopt(u64 pa)
> +{
> +	u64 rax = ALIGN_DOWN(pa, SZ_1G);
> +	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
> +
> +	__rmpopt(rax, rcx);
> +}
> +

Could rmpopt_smp() call rmpopt() to remove duplicate code?

> +/*
> + * RMPOPT optimizations skip RMP checks at 1GB granularity if this
> + * range of memory does not contain any SNP guest memory.
> + */
> +static void rmpopt_work_handler(struct work_struct *work)
> +{
> +	bool current_cpu_cleared = false;
> +	phys_addr_t pa;
> +
> +	pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
> +		rmpopt_pa_start, rmpopt_pa_end);
> +
> +	/*
> +	 * RMPOPT scans the RMP table, stores the result of the scan in the
> +	 * reserved processor memory. The RMP scan is the most expensive
> +	 * part. If a second RMPOPT occurs, it can skip the expensive scan
> +	 * if they can see a cached result in the reserved processor memory.
> +	 *
> +	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
> +	 * on every other primary thread. This potentially allows the
> +	 * followers to use the "cached" scan results to avoid repeating
> +	 * full scans.

Out of curiosity, how does this caching work? Is it possible to do it
once and then synchronize the cache to the other CPUs?

> +	 */
> +
> +	if (cpumask_test_cpu(smp_processor_id(), &rmpopt_cpumask)) {
> +		cpumask_clear_cpu(smp_processor_id(), &rmpopt_cpumask);
> +		current_cpu_cleared = true;
> +	}
> +
> +	/* current CPU */
> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
> +		rmpopt(pa);
> +
> +	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
> +		on_each_cpu_mask(&rmpopt_cpumask, rmpopt_smp,
> +				 (void *)pa, true);
> +
> +		 /* Give a chance for other threads to run */
> +		cond_resched();
> +
> +	}
> +
> +	if (current_cpu_cleared)
> +		cpumask_set_cpu(smp_processor_id(), &rmpopt_cpumask);

Sashiko [1] pointed this out: after cond_resched(), this code might be
on a different cpu so smp_processor_id() would return a different cpu,
that would mess up the global cpumask.

Perhaps it's better to store the id on a stack? Or actually, what if we
give on_each_cpu_mask a copy of rmpopt_cpumask with the current cpu
unset?

[1] https://sashiko.dev/#/patchset/cover.1775874970.git.ashish.kalra%40amd.com

> +}
> +
>  void snp_setup_rmpopt(void)
>  {
>  	u64 rmpopt_base;
> @@ -568,9 +656,20 @@ void snp_setup_rmpopt(void)
>  	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
>  		return;
>
> +	/*
> +	 * Create an RMPOPT-specific workqueue to avoid scheduling
> +	 * RMPOPT workitem on the global system workqueue.
> +	 */
> +	rmpopt_wq = alloc_workqueue("rmpopt_wq", WQ_UNBOUND, 1);
> +	if (!rmpopt_wq) {
> +		setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
> +		return;
> +	}
> +
>  	/*
>  	 * RMPOPT_BASE MSR is per-core, so only one thread per core needs to
> -	 * setup RMPOPT_BASE MSR.
> +	 * setup RMPOPT_BASE MSR. Additionally only one thread per core needs
> +	 * to issue the RMPOPT instruction.
>  	 */
>
>  	for_each_online_cpu(cpu) {
> @@ -590,6 +689,20 @@ void snp_setup_rmpopt(void)
>  	 * up to 2 TB of system RAM on all CPUs.
>  	 */
>  	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
> +
> +	INIT_DELAYED_WORK(&rmpopt_delayed_work, rmpopt_work_handler);
> +
> +	rmpopt_pa_end = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
> +
> +	/* Limit memory scanning to the first 2 TB of RAM */

I think this is better phrased as "limit memory scanning to 2TB",

> +	if ((rmpopt_pa_end - rmpopt_pa_start) > SZ_2T)
> +		rmpopt_pa_end = rmpopt_pa_start + SZ_2T;

and then this could be

    rmpopt_pa_end = min(rmpopt_pa_end, rmpopt_pa_start + SZ_2T);

> +
> +	/*
> +	 * Once all per-CPU RMPOPT tables have been configured, enable RMPOPT
> +	 * optimizations on all physical memory.
> +	 */
> +	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
>  }
>  EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
>

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

^ permalink raw reply

* Re: [PATCH v4 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Ackerley Tng @ 2026-05-01 18:33 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <662ddafc74fa90be6fdf7dba09bccc53f821f328.1775874970.git.ashish.kalra@amd.com>

Ashish Kalra <Ashish.Kalra@amd.com> writes:

>
> [...snip...]
>
> +void wrmsrq_on_cpus(const struct cpumask *mask, u32 msr_no, u64 q)
> +{
> +	struct msr_info rv;
> +	int this_cpu;
> +
> +	memset(&rv, 0, sizeof(rv));
> +
> +	rv.msr_no = msr_no;
> +	rv.reg.q = q;
> +
> +	this_cpu = get_cpu();
> +
> +	if (cpumask_test_cpu(this_cpu, mask))
> +		__wrmsr_on_cpu(&rv);
> +
> +	smp_call_function_many(mask, __wrmsr_on_cpu, &rv, 1);
> +	put_cpu();

I think

  on_each_cpu_mask(mask, __wrmsr_on_cpu, (void *)&rv, true);

is more readable than get and put cpu.

I understand Dave wanted to remove the ugly casting earlier, but perhaps
it's okay now that the cast is wrapped in a function like this?

Just my 2c, feel free to go ahead either way.

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

>
> [...snip...]
>

^ permalink raw reply

* Re: [PATCH v4 3/7] x86/sev: Initialize RMPOPT configuration MSRs
From: Ackerley Tng @ 2026-05-01 18:12 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <846263383f9b6a08fc87ce6edb931c912f68c60d.1775874970.git.ashish.kalra@amd.com>

Ashish Kalra <Ashish.Kalra@amd.com> writes:

>
> [...snip...]
>
> +void snp_setup_rmpopt(void)
> +{
> +	u64 rmpopt_base;
> +	int cpu;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
> +		return;
> +
> +	/*
> +	 * RMPOPT_BASE MSR is per-core, so only one thread per core needs to
> +	 * setup RMPOPT_BASE MSR.
> +	 */
> +
> +	for_each_online_cpu(cpu) {

Dave mentioned hotplug in v3 [1], which led me to check. From this
series, it looks like RMPOPT won't ever be performed for newly-plugged
cpus, is that okay?

> +		if (!topology_is_primary_thread(cpu))
> +			continue;
> +
> +		cpumask_set_cpu(cpu, &rmpopt_cpumask);

nit: perhaps flipping the condition is easier to read:

    if (topology_is_primary_thread(cpu))
    	cpumask_set_cpu()

> +	}
> +
> +	rmpopt_pa_start = ALIGN_DOWN(PFN_PHYS(min_low_pfn), SZ_1G);
> +	rmpopt_base = rmpopt_pa_start | MSR_AMD64_RMPOPT_ENABLE;
> +
> +	/*
> +	 * Per-CPU RMPOPT tables support at most 2 TB of addressable memory
> +	 * for RMP optimizations. Initialize the per-CPU RMPOPT table base
> +	 * to the starting physical address to enable RMP optimizations for
> +	 * up to 2 TB of system RAM on all CPUs.
> +	 */
> +	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
> +}

[1] https://lore.kernel.org/all/ab41b1d8-e464-4ad6-ac07-7318686db10e@intel.com/

>
> [...snip...]
>

^ permalink raw reply

* Re: [PATCH v4 1/7] x86/cpufeatures: Add X86_FEATURE_AMD_RMPOPT feature flag
From: Ackerley Tng @ 2026-05-01 16:37 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, jackyli, pgonda, rientjes, jacobhxu, xin,
	pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen, darwi,
	linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <77153c889934972efcfc3d210251564f29abcf51.1775874970.git.ashish.kalra@amd.com>

Ashish Kalra <Ashish.Kalra@amd.com> writes:

>
> [...snip...]
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index dbe104df339b..bce1b2e2a35c 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -76,7 +76,7 @@
>  #define X86_FEATURE_K8			( 3*32+ 4) /* Opteron, Athlon64 */
>  #define X86_FEATURE_ZEN5		( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
>  #define X86_FEATURE_ZEN6		( 3*32+ 6) /* CPU based on Zen6 microarchitecture */
> -/* Free                                 ( 3*32+ 7) */
> +#define X86_FEATURE_RMPOPT		( 3*32+ 7) /* Support for AMD RMPOPT instruction */
>  #define X86_FEATURE_CONSTANT_TSC	( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
>  #define X86_FEATURE_UP			( 3*32+ 9) /* "up" SMP kernel running on UP */
>  #define X86_FEATURE_ART			( 3*32+10) /* "art" Always running timer (ART) */
> diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
> index 42c7eac0c387..7ac3818c4502 100644
> --- a/arch/x86/kernel/cpu/scattered.c
> +++ b/arch/x86/kernel/cpu/scattered.c
> @@ -65,6 +65,7 @@ static const struct cpuid_bit cpuid_bits[] = {
>  	{ X86_FEATURE_PERFMON_V2,		CPUID_EAX,  0, 0x80000022, 0 },
>  	{ X86_FEATURE_AMD_LBR_V2,		CPUID_EAX,  1, 0x80000022, 0 },
>  	{ X86_FEATURE_AMD_LBR_PMC_FREEZE,	CPUID_EAX,  2, 0x80000022, 0 },
> +	{ X86_FEATURE_RMPOPT,			CPUID_EDX,  0, 0x80000025, 0 },
>  	{ X86_FEATURE_AMD_HTR_CORES,		CPUID_EAX, 30, 0x80000026, 0 },
>  	{ 0, 0, 0, 0, 0 }
>  };

Reviewed-by: Ackerley Tng <ackerleytng@google.com>

^ permalink raw reply

* Re: [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Ackerley Tng @ 2026-05-01  0:58 UTC (permalink / raw)
  To: Edgecombe, Rick P, pbonzini@redhat.com, seanjc@google.com,
	Zhao, Yan Y, dave.hansen@linux.intel.com
  Cc: Shahar, Sagi, Yamahata, Isaku, x86@kernel.org, kas@kernel.org,
	yilun.xu@linux.intel.com, bp@alien8.de, mingo@redhat.com,
	linux-kernel@vger.kernel.org, Huang, Kai, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Li, Xiaoyao, tglx@kernel.org,
	binbin.wu@linux.intel.com, Annapurve, Vishal
In-Reply-To: <231efd4e9267d3dbb8c63e57b3a43567c965e24a.camel@intel.com>

"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> writes:

> On Thu, 2026-04-30 at 11:17 -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>>
>> >
>> > [...snip...]
>> >
>> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
>> > index b24b81cea5ea..e5a37ea2d4a0 100644
>> > --- a/arch/x86/virt/vmx/tdx/tdx.c
>> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
>> > @@ -710,7 +710,7 @@ static __init int tdmrs_set_up_pamt_all(struct tdmr_info_list *tdmr_list,
>> >    * to normal kernel memory. Systems with the X86_BUG_TDX_PW_MCE erratum need to
>> >    * do the conversion explicitly via MOVDIR64B.
>> >    */
>> > -static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>> > +void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>>
>> Could this be updated to use phys_addr_t base and size_t size instead of
>> generic unsigned long?
>
> A type is a really good idea. It could look like a virtual address, despite the
> paddr in the name.
>
> I added it to the cleanup list. But I would prefer to keep this series focused
> on resolving the critical controversy around the struct page/pfn type, rather
> than adding in more things to debate. If you don't mind.
>

No worries, please go ahead :)

>>
>> >   {
>> >   	const void *zero_page = (const void *)page_address(ZERO_PAGE(0));
>> >   	unsigned long phys, end;
>> > @@ -729,6 +729,7 @@ static void tdx_quirk_reset_paddr(unsigned long base, unsigned long size)
>> >   	 */
>> >   	mb();
>> >   }
>> > +EXPORT_SYMBOL_FOR_KVM(tdx_quirk_reset_paddr);
>> >
>> >   void tdx_quirk_reset_page(struct page *page)
>> >   {
>> > @@ -1920,17 +1921,17 @@ u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td)
>> >   {
>> >   	struct tdx_module_args args = {};
>> >
>> > -	args.rcx = mk_keyed_paddr(tdx_global_keyid, td->tdr_page);
>> > +	args.rcx = mk_keyed_paddr(tdx_global_keyid, page_to_pfn(td->tdr_page));
>>
>> Should mk_keyed_paddr() be updated to have a return type of phys_addr_t?
>> I guess in this case since mk_keyed_paddr() is pretty much an internal
>> function, returning u64 also makes sense to indicate that it should only
>> be used to set 64 bit registers.
>
> Yea, this is used to construct u64 inputs for seamcall args. So I think it
> should keep returning u64s. Maybe instead it would be better to have a function
> name pattern that denotes that purpose. We have some more helpers like this
> coming.
>
> But similarly, I'd like to not grow a snowballing cleanup series for this one.

Makes sense, let's go :)

^ permalink raw reply

* Re: [PATCH v2 2/4] x86/tdx: Use PFN directly for unmapping guest private memory
From: Ackerley Tng @ 2026-05-01  0:57 UTC (permalink / raw)
  To: Dave Hansen, Sean Christopherson, Rick P Edgecombe
  Cc: pbonzini@redhat.com, Yan Y Zhao, dave.hansen@linux.intel.com,
	Sagi Shahar, Isaku Yamahata, x86@kernel.org, kas@kernel.org,
	yilun.xu@linux.intel.com, bp@alien8.de, mingo@redhat.com,
	linux-kernel@vger.kernel.org, Kai Huang, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Xiaoyao Li, tglx@kernel.org,
	binbin.wu@linux.intel.com, Vishal Annapurve
In-Reply-To: <3067bbb0-b528-41ae-8739-c49432f9a62c@intel.com>

Dave Hansen <dave.hansen@intel.com> writes:

> On 4/30/26 12:20, Sean Christopherson wrote:
>>> Yea, this is used to construct u64 inputs for seamcall args. So I think it
>>> should keep returning u64s.
>> +1.  IMO, we should treat the TDX-Module as an extension of hardware and pass in
>> u64s where the spec says it takes a 64-bit value.
>
> +2
>
> In this very specific case 'phys_addr_t' is 100% the *WRONG* type for
> mk_keyed_paddr(). Why? Because the thing being returned is *NOT* *A*
> *PHYSICAL* *ADDRESS*. It's a composite type that contains a special
> physical address plus some metadata in a special "hardware" format. It's
> as much of a 'phys_addr_t' as a PTE is a 'phys_addr_t'. Yeah, they
> contain and are constructed partly from physical addresses, but they are
> not physical addresses themselves.
>

Got it, thanks!

> At the same time, if the kernel has a type-safe way of representing
> something that's also a 64-bit value, we should use it. Just because the
> TDX spec takes a 64-bit virtual address doesn't mean we should use a u64
> and not a "struct foo *".
>
> Let's please just not go back to a sea of u64's everywhere. Use common
> sense. Use the kernel's type system where appropriate, but don't
> over-apply it.
>
> IOW, I agree with Sean. But please don't take Sean's advise too far here.

^ permalink raw reply

* Re: [PATCH RFC v5 00/53] guest_memfd: In-place conversion support
From: Ackerley Tng @ 2026-04-30 23:51 UTC (permalink / raw)
  To: Michael Roth
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, oupton, pankaj.gupta, qperret,
	rick.p.edgecombe, rientjes, shivankg, steven.price, tabba, willy,
	wyihan, yan.y.zhao, forkloop, pratyush, suzuki.poulose,
	aneesh.kumar, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
	Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
	Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
	linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
	linux-coco
In-Reply-To: <x7n77snnvvukofo3slopl74c5tmlb2t2un5qy5fq6eb3d2xt7e@6a6tixg5mdfc>

Michael Roth <michael.roth@amd.com> writes:

>
> [...snip...]
>
> I made a super-long-winded reply to that thread, but to summarize:
>
> PRESERVE flag has different enumeration/behavior/enforcement for pre-launch
> vs. post-launch, and similar considerations might come into play for
> other flags, so to make it easier to enumerate what flags are available
> for pre-launch/post-launch, maybe we could have 2 capabilities instead
> of 1:
>
>   KVM_CAP_MEMORY_ATTRIBUTES2_PRE_LAUNCH_FLAGS
>   KVM_CAP_MEMORY_ATTRIBUTES2_FLAGS
>
> where SNP/TDX would only advertise PRESERVE for PRE_LAUNCH, and pKVM I
> guess would enumerate it for both (or maybe just POST_LAUNCH?)
>
> That lets us keep the flags definitions more straightforward but still
> allows userspace to easily enumerate what exactly should be available at
> pre vs. post launch time, and give us some flexibility to detail
> variations in behavior between the 2 phases without documenting
> edge-cases in terms of VM types.
>

Oops Michael I only read this after the meeting today.

Sean, today at guest_memfd biweekly we also discussed this topic. I
brought up this topic because IMO the interface is starting to get
a little awkward, I'm struggling to put the awkwardness into words.

Here are some awkward points:

For PRESERVE, even though it is defined (now) as that what the host
writes will be readable in the guest, it only works for both to-private
and to-shared conversions for KVM_X86_SW_PROTECTED_VMs and pKVM. That's
because guest_memfd doesn't actually invoke encryption during the
conversion. For TDX and SNP, the encryption can only be done before the
VM is finalized, through vendor-specific ioctls that go through
kvm_gmem_populate() to load memory into the guest.

For ZERO, it is defined in api.rst that ZERO is not supported for
to-private conversions, and the rationale there was that when ZEROing,
guest_memfd/KVM can zero, but it's really the contract between the guest
and the vendor trusted firmware whether the guest sees zeros later.

Another awkward point is that ZERO was meant to enable an optimization
for TDX since the firmware zeroes memory, but it actually only zeroes
memory when the page is unmapped from Secure EPTs. guest_memfd (for now)
doesn't track whether the page was unmapped from Secure EPTs as part of
the conversion, so guest_memfd can't assume it was mapped before the
conversion request. To uphold the ZERO contract with userspace,
guest_memfd applies zeroing for TDX anyway.

Summarizing from guest_memfd biweekly today:

David suggested enumerating the combinations, something like
`SHARED_ZERO` and friends (since to-private and ZERO is not supported)
and Michael then brought up the other axis of pre/post launch. IIRC
there might be another axis since pKVM would need to determine
dynamically if a to_shared conversion can be permitted for the range
being converted, based on whether the guest had requested a to_shared
conversion.

I think this might just result in too many flags, and could paint us
into a corner if more options get supported later.


I spent even more time thinking about this today. I get that we want a
consistent contract to userspace, can we scope the contract differently?

What if we scope as "what KVM guarantees the content will look like
after guest_memfd updates attributes"? This is a smaller contract, since
it doesn't promise anything about what the guest sees. Running this
through a few examples:

+ Pre-finalize, SNP, to-private, PRESERVE: guest_memfd guarantees that
  after setting memory attributes, the contents of the pages will not
  change. The contents are then ready for populate. What populate does
  to the memory is another contract between SNP and the guest that is
  out of scope of guest_memfd's contract.

+ Post-finalize, SNP, to-private, PRESERVE: guest_memfd guarantees that
  after setting memory attributes, the contents of the pages will not
  change. SNP's contract with the guest does not, though. After the page
  gets faulted in, the guest sees scrambled data. This may be a
  meaningless operation now, but it leaves the door open so perhaps we
  could have an SNP-specific ioctl in future where step 1 is to set
  memory attributes within guest_memfd to private and step 2 is to
  encrypt in place.

+ pKVM, to-private, PRESERVE: guest_memfd guarantees that after setting
  memory attributes, the contents of the pages don't change. Separately,
  pKVM doesn't do encryption, so the pKVM guest reads the same contents
  the host wrote. The distinction here from the current state is that
  guest_memfd didn't guarantee that the pKVM guest will see the same
  content the host wrote since that's a separate contract between the
  pKVM guest and pKVM.

+ Post-finalize, TDX, to-shared, ZERO: guest_memfd guarantees that
  contents of the pages will be zeroed in the process of updating
  guest_memfd attributes. Host userspace reads zeros after faulting it
  in, which is because guest_memfd did zero the pages after conversion
  to shared. A future optimization is possible, where guest_memfd only
  zeroes the pages that were unmapped from Secure EPTs, since (this
  version of) TDX zeros memory when unmapping from Secure EPTs.

+ Post-finalize, TDX, to-shared, PRESERVE: -EOPNOTSUPP. guest_memfd is
  unable to guarantee that the process of setting memory attributes will
  not change memory contents. The process of setting memory attributes
  requires unmapping from Secure EPTs, which will zero the memory. (In
  future, if we want to relax this, we could permit this if nothing in
  the requested range was mapped in Secure EPTs)

+ Post-finalize, SNP, to-shared, PRESERVE: guest_memfd guarantees that
  after setting memory attributes, the contents of the pages will not
  change. For SNP, unmapping doesn't change memory contents? The guest
  reads garbage, and that's a separate contract between SNP and the
  guest. In the guest_memfd contract, guest_memfd PRESERVEs the memory
  contents in the process of setting memory attributes, and can fulfil
  that.

+ Post-finalize, TDX, to-private, ZERO: guest_memfd zeroes the shared
  memory before updating the attributes to be private, because it
  promised to. If this memory gets faulted in to Secure EPTs, TDX
  firmware zeros it again, because that's TDX's contract with the
  guest. I can't see any benefit to userspace in using this combination,
  but the guest_memfd contract and implementation are simple.

TLDR:

+ PRESERVE == guarantee that the process of setting memory attributes
  doesn't change memory contents.
    + implementation == do nothing in most cases, except -EOPNOTSUPP for
      to-shared on TDX, since unmapping is a required part of setting
      memory attributes to private, and a TDX side effect of unmapping
      is zeroing memory,
+ ZERO == guarantee that the process of setting memory attributes zeroes
  memory contents.
    + implementation == memset(zero) in most cases. For TDX, a future
      optimization exists, where memset() can be skipped for pages that
      were mapped in Secure EPTs before conversion
+ UNSPECIFIED == no guarantees
    + implementation == guest_memfd does nothing explicitly about memory
      contents. The implementation is pretty much the same as PRESERVE
      except guest_memfd won't take into account vendor-specific side
      effects of the process of conversion. Except for the test vehicle
      KVM_X86_SW_PROTECTED_VMS, where memory is scrambled.

>>
>> [...snip...]
>>
>
> Looking at the example you have there:
>
>   + Note: These content modes apply to the entire requested range, not
>   + just the parts of the range that underwent conversion. For example, if
>   + this was the initial state:
>   +
>   +   * [0x0000, 0x1000): shared
>   +   * [0x1000, 0x2000): private
>   +   * [0x2000, 0x3000): shared
>   + and range [0x0000, 0x3000) was set to shared, the content mode would
>   + apply to all memory in [0x0000, 0x3000), not just the range that
>   + underwent conversion [0x1000, 0x2000).
>
> Userspace would be aware of whether the range contains pages that were
> already set to private, so if it really wants to set the just the
> [0x1000, 0x2000) range to shared with appropriate content mode, it is
> fully able to do so by just issuing the ioctl for that specific range.
> If it attempts to issue it for the entire range, it only seems like it
> would defy normal expectations and cause confusion to skip ranges, and
> I'm not sure it gains us anything useful in exchange for that potential
> confusion.
>

Great that we're aligned here :) No complaints from guest_memfd biweekly
today as well :)

>>
>> [...snip...]
>>

^ permalink raw reply

* Re: [PATCH v8 12/21] x86/virt/seamldr: Install a new TDX module
From: Dave Hansen @ 2026-04-30 22:29 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org,
	Gao, Chao, x86@kernel.org
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
	djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
	hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com
In-Reply-To: <ab11b3c922c93db045c4c0dfa8f191eb6fdb0b9f.camel@intel.com>

On 4/30/26 14:48, Edgecombe, Rick P wrote:
> On Thu, 2026-04-30 at 12:00 -0700, Dave Hansen wrote:
>> On 4/27/26 08:28, Chao Gao wrote:
>>> +			case MODULE_UPDATE_CPU_INSTALL:
>>> +				ret = seamldr_install(seamldr_params);
>>> +				break;
>> I really despise this naming. Could you please clarify with comments?
>>
>> This reads like it is installing a seamldr, not telling a seamldr to
>> perform an install.
> Yea it does read that way. We kind of have a convention around the seamcall
> wrappers matching the tdx docs name. tdh_foo_bar() is pretty clear.
> Unfortunately the seamldr calls are defined like SEAMLDR.INSTALL.

The naming convention is fine, really. It does make thing easier to look
up in the documentation.

In this case, just comment it, please. It's only a single site.

^ permalink raw reply

* Re: [PATCH v8 12/21] x86/virt/seamldr: Install a new TDX module
From: Edgecombe, Rick P @ 2026-04-30 21:48 UTC (permalink / raw)
  To: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Hansen, Dave,
	linux-kernel@vger.kernel.org, Gao, Chao, x86@kernel.org
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
	djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
	hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com
In-Reply-To: <f92d114c-c0f1-4eaa-a828-a4d9ac4054d4@intel.com>

On Thu, 2026-04-30 at 12:00 -0700, Dave Hansen wrote:
> On 4/27/26 08:28, Chao Gao wrote:
> > +			case MODULE_UPDATE_CPU_INSTALL:
> > +				ret = seamldr_install(seamldr_params);
> > +				break;
> 
> I really despise this naming. Could you please clarify with comments?
> 
> This reads like it is installing a seamldr, not telling a seamldr to
> perform an install.

Yea it does read that way. We kind of have a convention around the seamcall
wrappers matching the tdx docs name. tdh_foo_bar() is pretty clear.
Unfortunately the seamldr calls are defined like SEAMLDR.INSTALL.

We don't need to make matching the wrappers name match the tdx docs be a rule. I
just considered trying to make a seamldr name schema to make it clear that these
are seamldr call names. But it is probably better to just special case this one
with a clearer name we give it, like seamldr_tdx_module_install(). Which is
slightly long, but both clear on the purpose and that it is in the family of
seamldr_() functions.

^ permalink raw reply

* Re: [PATCH v8 15/21] x86/virt/tdx: Refresh TDX module version after update
From: Edgecombe, Rick P @ 2026-04-30 21:35 UTC (permalink / raw)
  To: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Hansen, Dave,
	linux-kernel@vger.kernel.org, Gao, Chao, x86@kernel.org
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
	djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
	hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com
In-Reply-To: <28be5180-ff74-4e4d-b392-7ba7a9b4c1c0@intel.com>

On Thu, 2026-04-30 at 12:14 -0700, Dave Hansen wrote:
> > Do not refresh the rest of tdx_sysinfo. Refreshing them at runtime could
> > disrupt running software that relies on the previously reported values.
> 
> This needs more explanation. I think the reasoning is quite nuanced.
> 
> tdx_sysinfo is just a cache of what the TDX module is and can do. If
> that changes, it means the TDX module changed. So you somehow need to
> argue why it's OK to hide those changes from the tdx_sysinfo users.
> 
> Why would they be confused by tdx_sysinfo changes but not by the TDX
> module *itself* changing?

We shouldn't refresh them because they don't change, right Chao? It's not
because we are hiding things?

> 
> > Note that major and minor versions are not refreshed because runtime updates
> > are supported only between releases with identical major and minor versions.
> 
> I'd rather have this in code than a changelog comment.
> 
> If they can't change then warn if they do.


^ permalink raw reply

* Re: [PATCH v8 08/21] x86/virt/seamldr: Allocate and populate a module update request
From: Dave Hansen @ 2026-04-30 21:31 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org,
	Gao, Chao, x86@kernel.org
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
	djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
	hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com
In-Reply-To: <e4a12820bdb7fa8c2e97e028fc9a4b51e1baff1e.camel@intel.com>

On 4/30/26 14:23, Edgecombe, Rick P wrote:
> On Wed, 2026-04-29 at 17:45 -0700, Dave Hansen wrote:
>>> +/*
>>> + * Intel TDX module blob. Its format is defined at:
>>> + *
>>> https://github.com/intel/tdx-module-binaries/blob/main/blob_structure.txt
>> Heh, so URLs are not OK in changelogs because they go stale, but they're
>> fine in the code?
> Hmm, we usually could refer to a document by name. But this is just a txt in a
> github repo... Maybe just:
> 
> Intel TDX module blob is a format for distributing TDX module updates. It is
> parsed by the host before internal bits are passed to the TDX module. For
> details, see the latest TDX module update repository.
> 
> To me the confusing part in all of this is that there is a format that the host
> has to parse and then pass different bits into the TDX module. Usually something
> would just take the format it defines. So I think that is probably good context
> to have around it.

Just say:

/* Intel TDX module update ABI structure. aka. "TDX module blob" */

We don't need to tell folks how to use Google. Just give them the right
keywords to dump in there.

^ permalink raw reply

* Re: [PATCH v8 08/21] x86/virt/seamldr: Allocate and populate a module update request
From: Edgecombe, Rick P @ 2026-04-30 21:23 UTC (permalink / raw)
  To: kvm@vger.kernel.org, linux-coco@lists.linux.dev, Hansen, Dave,
	linux-kernel@vger.kernel.org, Gao, Chao, x86@kernel.org
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
	djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
	hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com
In-Reply-To: <a52c4701-c99d-48d5-9b63-8eb1c0e589f0@intel.com>

On Wed, 2026-04-29 at 17:45 -0700, Dave Hansen wrote:
> > +/*
> > + * Intel TDX module blob. Its format is defined at:
> > + *
> > https://github.com/intel/tdx-module-binaries/blob/main/blob_structure.txt
> 
> Heh, so URLs are not OK in changelogs because they go stale, but they're
> fine in the code?

Hmm, we usually could refer to a document by name. But this is just a txt in a
github repo... Maybe just:

Intel TDX module blob is a format for distributing TDX module updates. It is
parsed by the host before internal bits are passed to the TDX module. For
details, see the latest TDX module update repository.


To me the confusing part in all of this is that there is a format that the host
has to parse and then pass different bits into the TDX module. Usually something
would just take the format it defines. So I think that is probably good context
to have around it.

^ permalink raw reply

* Re: [PATCH v8 18/21] coco/tdx-host: Don't expose P-SEAMLDR features on CPUs with erratum
From: Dave Hansen @ 2026-04-30 20:09 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-19-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> Some TDX-capable CPUs have an erratum, as documented in Intel® Trust
> Domain CPU Architectural Extensions (May 2021 edition) Chapter 2.3:
> 
>   SEAMRET from the P-SEAMLDR clears the current VMCS structure pointed
>   to by the current-VMCS pointer. A VMM that invokes the P-SEAMLDR using
>   SEAMCALL must reload the current-VMCS, if required, using the VMPTRLD
>   instruction.
> 
> Clearing the current VMCS behind KVM's back will break KVM.
> 
> This erratum is not present when IA32_VMX_BASIC[60] is set. Add a CPU
> bug bit for this erratum and refuse to expose P-SEAMLDR features (e.g.,
> TDX module updates) on affected CPUs.

This seems totally random.

Shouldn't this be way back when can_expose_seamldr() got defined in the
first place?
> +#define X86_BUG_SEAMRET_INVD_VMCS	X86_BUG( 1*32+11) /* "seamret_invd_vmcs" SEAMRET from P-SEAMLDR clears the current VMCS */

I find myself wondering if this is worth a bug bit.


^ permalink raw reply

* Re: [PATCH v8 17/21] x86/virt/seamldr: Abort updates on failure
From: Dave Hansen @ 2026-04-30 20:06 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-18-chao.gao@intel.com>

I don't like how this is being done.

 1. Introduce this do{}while() loop
 2. Do 20 other patches
 3. Introduce a thing that can make it change
 4. Change the fundamental flow of the loop, to fix #3

I'd much rather have:

 1. Introduce this do{}while() loop
 2. Tweak fundamental flow of the loop from the last patch when I can
    remember it. Allude to future failures.
 3. Do 20 other patches
 4. Introduce a thing that uses #2


> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> index c81b26c4bac1..9b8f571eb03f 100644
> --- a/arch/x86/virt/vmx/tdx/seamldr.c
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -220,6 +220,7 @@ enum module_update_state {
>  static struct {
>  	enum module_update_state state;
>  	int thread_ack;
> +	bool failed;
>  	/*
>  	 * Protect update_data. Raw spinlock as it will be acquired from
>  	 * interrupt-disabled contexts.
> @@ -284,12 +285,15 @@ static int do_seamldr_install_module(void *seamldr_params)
>  				break;
>  			}
>  
> -			ack_state();
> +			if (ret)
> +				WRITE_ONCE(update_data.failed, true);
> +			else
> +				ack_state();
>  		} else {
>  			touch_nmi_watchdog();
>  			rcu_momentary_eqs();
>  		}

I don't like how this is turning out either. I don't like all the nested
conditions or ack_state() that hides its mucking with update data while
its caller mucks with it directly. It's just all hacked together.

Defer all of the acking, and *failed* acking to the ack_state() helper.

Also, I'm kinda peeved that you copied and pasted the  		
touch_nmi_watchdog()/rcu_momentary_eqs() bits and none of the comments.
This is a rather subtle use of both. If you want this to be a normal
"spinning in stop machine" idiom, then create a helper and put the
comments there.

Also, this is a case where:

	do {
		cpu_relax();
		newstate = READ_ONCE(update_data.state);

		if (newstate == curstate) {
			// can cpu_relax() just go in here??
			touch_nmi_watchdog();
			rcu_momentary_eqs();
			continue;
		}
	
		switch() {
			// state changing here
		}
	} while (...);

is a much more sane setup. You're not paying the if() indentation cost
for the entire state transition block. You're also putting the "shut up
the warnings" code out of the way where you can forget about it.


> -	} while (curstate != MODULE_UPDATE_DONE);
> +	} while (curstate != MODULE_UPDATE_DONE && !READ_ONCE(update_data.failed));
>  
>  	return ret;
>  }
> @@ -315,6 +319,7 @@ int seamldr_install_module(const u8 *data, u32 size)
>  
>  	/* Ensure a stable set of online CPUs for the update process. */
>  	guard(cpus_read_lock)();
> +	WRITE_ONCE(update_data.failed, false);
>  	set_target_state(MODULE_UPDATE_START + 1);
>  	ret = stop_machine_cpuslocked(do_seamldr_install_module, params, cpu_online_mask);
>  	if (ret)
I kinda wish this 'update_data.failed' set was named. This is trying to
bring 'update_data' into some initial state. Let's _call_ it that.
Honestly, I wouldn't hate if that function just also did the spinlock
init since it's so ugly do to statically.

^ permalink raw reply

* Re: [PATCH v8 09/21] x86/virt/seamldr: Introduce skeleton for TDX module updates
From: Dave Hansen @ 2026-04-30 20:03 UTC (permalink / raw)
  To: Chao Gao, kvm, linux-coco, linux-kernel, x86
  Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
	nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
	sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
	yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin
In-Reply-To: <20260427152854.101171-10-chao.gao@intel.com>

On 4/27/26 08:28, Chao Gao wrote:
> +static struct {
> +	enum module_update_state state;
> +	int thread_ack;

multi_stop_data has an atomic_t for this.

You have an int.

Which one is right?

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox